SPECIAL ISSUE: - 麻省理工学院人工智能研究专业

SPECIAL ISSUE:
Cognitive Computational Neuroscience of Language

Predictive Coding or Just Feature Discovery?
An Alternative Account of Why Language
Models Fit Brain Data

开放访问

杂志

Richard Antonello

and Alexander Huth

计算机科学系, University of Texas at Austin, Austin, TX, 美国

关键词: encoding models, predictive coding, language models

抽象的

Many recent studies have shown that representations drawn from neural network language
models are extremely effective at predicting brain responses to natural language. But why do
these models work so well? One proposed explanation is that language models and brains are
similar because they have the same objective: to predict upcoming words before they are
perceived. This explanation is attractive because it lends support to the popular theory of
predictive coding. We provide several analyses that cast doubt on this claim. 第一的, we show
that the ability to predict future words does not uniquely (or even best) explain why some
representations are a better match to the brain than others. 第二, we show that within a
language model, representations that are best at predicting future words are strictly worse brain
models than other representations. 最后, we argue in favor of an alternative explanation for
the success of language models in neuroscience: These models are effective at predicting brain
responses because they generally capture a wide variety of linguistic phenomena.

介绍

Predictive coding is a cognitive theory of the high-level mechanisms underlying sensory pro-
cessing in the brain. It holds that the brain is constantly attempting to predict future events
before they occur. These predictions are revised and updated via error signals generated upon
comparison of predictions with observations. Predictive coding is attractive as a theory
because it provides a concrete, conceptually simple, and mechanistically plausible objective
for brain processing that seems to also relate to our own introspective experience of what it
feels like to learn. Although originally formulated to explain visual processing in the brain
(黄 & 饶, 2011; Jiang & 饶, 2021; 饶 & Ballard, 1999), this theory has also been
extended to language processing. For language, predictive coding theories posit that the brain
works to preemptively generate predictions about future words and sentences as it perceives
natural language stimuli.

Evidence for predictive coding in language processing comes from several strands of
研究. 第一的, many studies have shown electrophysiological signals associated with syntac-
tically or semantically incongruent words or surprisal (Frank et al., 2015; Gagnepain et al.,
2012; Heilbron et al., 2022; Kuperberg & Jaeger, 2016; Kutas & Hillyard, 1984; Münte
等人。, 1990; Schmitt et al., 2021; Shain et al., 2020). These signals are thought to correspond
to “prediction error” between what was predicted and what actually occurred.

引文: Antonello, R。, & Huth, A.
(2022). Predictive coding or just feature
发现? An alternative account of
why language models fit brain data.
Neurobiology of Language. Advance
出版物. https://doi.org/10.1162/nol
_a_00087

DOI:
https://doi.org/10.1162/nol_a_00087

支持信息:
https://doi.org/10.1162/nol_a_00087

已收到: 28 二月 2022
公认: 26 十月 2022

利益争夺: 作者有
声明不存在竞争利益
存在.

通讯作者:
Alexander Huth
huth@cs.utexas.edu

处理编辑器:
Evelina Fedorenko

版权: © 2022
麻省理工学院
在知识共享下发布
归因 4.0 国际的
(抄送 4.0) 执照

麻省理工学院出版社

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

Predictive coding:
A neuroscientific theory that posits
that the brain uses a prediction
objective or error to efficiently learn.

Language models:
Autoregressive machine learning
models that are trained to predict
next words given a previous context.

第二, many recent studies have shown that neural network language models (NNLMs),
which embody (some elements of ) predictive coding theory, are much more effective at
explaining brain activity elicited by natural language than earlier methods (Anderson et al.,
2021; Antonello et al., 2021; Caucheteux et al., 2021A; Caucheteux & 国王, 2022; 戈德斯坦
等人。, 2021; Jain & Huth, 2018; Jat et al., 2019; LeBel, Jain, & Huth, 2021; 李等人。, 2021;
Schrimpf et al., 2021; Tikochinski et al., 2021; Toneva et al., 2020). Some of these studies
claim that the superiority of NNLMs over other methods is evidence for predictive coding theory
在语言上 (Goldstein et al., 2021; Schrimpf et al., 2021). We argue in this paper that the high
performance of these models should not be construed as positive evidence in support of a theory
of predictive coding. As an alternative, we propose that the prediction task which these NNLMs
attempt to solve is simply one way out of many to discover useful linguistic features.

Language Models and Encoding Models

Unidirectional NNLMs are artificial neural networks that are trained to perform a “next word
prediction” task (Dai et al., 2019; Radford et al., 2019). 具体来说, these neural networks are
trained to generate a probability distribution over the next word in a sequence, conditioned on
a context consisting of previous words. 例如, when fed the context “Better late than”, A
language model might assign a high probability to the next word being “never.”

Compared to tasks that require labeled data, such as translation, question answering, 或者
word sense disambiguation, NNLMs have a distinct advantage because of the near-limitless
amount of data that can be used to train them; almost any natural language text that can be
scraped from the internet is valid data to train an NNLM. 更远, in order to do effective next-
word prediction, NNLMs need to capture a great deal about the statistical regularities in nat-
ural language, including everything from part of speech (Tsai et al., 2019) to topic (孙等人。,
2019) to coreference information (Joshi et al., 2019). The ease of training NNLMs and their
ability to learn many types of statistical dependencies has, in recent years, developed into
the paradigm of language model fine-tuning. In this paradigm, representations extracted from
existing NNLMs are retooled for other linguistic tasks such as named entity recognition (李
等人。, 2020), summarization (Nikolich et al., 2021), question answering (Su et al., 2019),
and sentiment analysis (Socher et al., 2013). Fine-tuning from NNLMs often outperforms
models that are trained from scratch on these tasks, as it allows the model to reuse linguistic
features that were learned by the original NNLM, and helps make up for the limited and costly
hand-labeled training data that many downstream tasks currently require (Dodge et al., 2020).

State-of-the-art NNLMs are typically organized into a series of architecturally homogeneous
layers of processing blocks called transformers (Radford et al., 2019; Vaswani et al., 2017).
Transformers use a mechanism known as dot product attention to selectively process some
elements of their input context while ignoring others. This mechanism enables models to inte-
grate information over much longer timescales than other methods (Vaswani et al., 2017). 这
output of each transformer layer is an encoded representation of its inputs, often called a hid-
den state. 例如, in the commonly used GPT-2 Small model (Radford et al., 2019), 这
hidden state is a 768-dimensional vector. This output vector is then fed into the next layer as its
输入. These layers serve to transform information from the initial input (often provided as word
嵌入; see Mikolov et al., 2013) to a next word prediction output at the last layer. 为了
this reason, the hidden states of later layers (those near the output) generally tend to act as
representations that are more suitable for next word prediction than the hidden states of earlier
layers, which are more similar to the initial word embeddings. Language models are typically
evaluated by a metric known as perplexity, which measures how well they can predict next

Perplexity:
A formal metric for how well a
language model can predict a given
数据集; lower is better.

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

Encoding models:
Machine learning models that predict
brain response from natural stimulus
特征.

字. Low perplexity means that the model assigns a high probability to the actual next word,
while high perplexity means that it assigns a low probability; 那是, lower perplexity is better.

Drawing on the success of NNLMs for transferring to many different language tasks, 新-
roscientists have used NNLM representations that encode linguistic context to predict brain
responses to natural language (Jain & Huth, 2018). Regression models that attempt to predict
brain response to natural language stimuli by using an analytic feature space derived from the
stimuli can be called encoding models (Huth et al., 2016; Naselaris et al., 2011). Much recent
work has examined the extent to which features generated by language models can be used as
encoding model inputs (Caucheteux & 国王, 2022; Schrimpf et al., 2021). Particular interest
has been afforded to these LM-based encoding models, as they appear to outperform previous
approaches that used representations sourced from non-contextual word embedding spaces.

The success of this approach raises a key question: Why do LM-based encoding models
perform so much better than encoding models that use other feature spaces? One hypothesis
is that these features work so well precisely because their training objective—next word
prediction—is the same objective that the brain has learned to solve. 例如, 两个都
Schrimpf et al. (2021) and Caucheteux and King (2022) showed that there is a strong correla-
tion between encoding model performance for a feature space and that feature space’s
capacity for next word prediction. Schrimpf et al. (2021) in particular argue that this strong
correlation may be taken as evidence that the next-word prediction task is a fundamental part
of biological language processing. Accepting this argument requires us to interpret correlation
as causation: Some representations have high encoding performance because they have high
next-word prediction performance.

Goldstein et al. (2021) went even further, showing that embeddings for future words can be
predicted at significantly above chance by brain responses before word onset, even if simple
contextual and semantic information such as word meaning and bigram information is
已删除. Caucheteux et al. (2021乙) demonstrate a similar result, showing that embeddings
of future words improve LM-based encoding models over using only present context. 他们
each suggest that these results stand as strong direct evidence of predictive coding in the brain
during language processing.

在本文中, we analyze the strength of the evidence that encoding model research pro-
vides for the theory of predictive coding. We claim that existing evidence does not favor pre-
dictive coding above alternative explanations. (然而, we distinguish this evidence from
the theory of predictive coding itself: It is plausible that the brain is doing predictive coding
even if it cannot be proven using this type of evidence.) Our claim is based on two major
论据.

第一的, we examine the correlation between next word prediction performance and encoding
performance and present an alternative hypothesis for why representations from NNLMs per-
form well as encoding model inputs. In this alternative hypothesis, we suggest that the high
encoding performance of NNLM representations can be explained by the fact that these rep-
resentations transfer effectively to representations from many other linguistic tasks, a quality
which is acknowledged in the fine-tuning literature. We produce a standardized metric for this
“general” transfer performance and show that it is well correlated with brain encoding perfor-
曼斯. We construct another metric that captures transfer performance to a representation
extracted from a machine translation model from English to German. We show that the cor-
relation between this translation metric and next word prediction performance is also high,
and use this to argue that one should be generally skeptical of drawing strong inferences from
correlations with encoding performance alone.

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

第二, we argue that a theory of predictive coding implies that language representations
that are more useful for next word prediction should in general be better at predicting brain
responses when controlling for other factors. Caucheteux and King (2022) analyzed the per-
formance of individual layer hidden states as encoding model input features and showed that
the intermediate layers of these language models, which are not the best at next word predic-
的, consistently outperform early and later layers as encoding model features. Using a vari-
ance partitioning argument, we build on this result to show that the late representations from
NNLMs, which are the best at predicting next words, explain strictly less variance in nearly
every cortical voxel than intermediate representations that are less effective at predicting next
字. Using these results, we further argue that the existence of predictive information in the
brain does not inherently necessitate a theory of predictive coding.

材料和方法

MRI Data Collection

We used functional magnetic resonance imaging (功能磁共振成像) data collected from five human sub-
jects as they listened to English language podcast stories over Sensimetrics S14 (2022) 头-
phones. Subjects were not asked to make any responses, but simply to listen attentively to the
故事. For encoding model training, each subject listened to approximately 5 hr of unique
stories across five scanning sessions, yielding a total of 9,189 data points for each voxel across
the whole brain. For model testing, the subjects listened to the same test story once in each
session (IE。, five times). These responses were then averaged across repetitions. Functional
signal-to-noise ratios in each voxel were computed using the mean-explainable variance
method from Nishimoto et al. (2017) on the repeated test data. Only voxels within 8 mm of
the mid-cortical surface were analyzed, yielding roughly 90,000 voxels per subject. 语言-
responsive voxels were identified as those where at least 5% of the response variance for the
test story, which was played at least five times for each subject, could be explained by the
average response across repetitions (Nishimoto et al., 2017).

MRI data were collected on a 3T Siemens Skyra scanner at the University of Texas at Austin
Biomedical Imaging Center using a 64-channel Siemens volume coil. Functional scans were
collected using a gradient echo-planar imaging sequence with repetition time (TR) = 2.00 s,
echo time (TE) = 30.8 多发性硬化症, flip angle = 71°, multiband factor (simultaneous multislice) = 2,
voxel size = 2.6 mm × 2.6 mm × 2.6 毫米 (slice thickness = 2.6 毫米), 矩阵大小= 84 ×
84, and field of view = 220 毫米. Anatomical data were collected using a T1-weighted
multi-echo MP-RAGE sequence with voxel size = 1 mm × 1 mm × 1 mm following the Free-
surfer morphometry protocol (菲舍尔, 2012).

All subjects were healthy and had normal hearing. The experimental protocol was
approved by the Institutional Review Board at the University of Texas at Austin. Written
informed consent was obtained from all subjects.

fMRI Preprocessing

All functional data were motion corrected using the FMRIB Linear Image Registration Tool
(FLIRT) from FSL 5.0 (詹金森 & 史密斯, 2001). FLIRT was used to align all data to a template
that was made from the average across the first functional run in the first story session for each
主题. These automatic alignments were manually checked for accuracy.

Low frequency voxel response drift was identified using a second order Savitzky-Golay
filter (Savitzky & Golay, 1964) 与一个 120 s window and then subtracted from the signal. 到

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

avoid onset artifacts and poor detrending performance near each end of the scan, responses
were trimmed by removing 20 s (10 volumes) at the beginning and end of each scan, 哪个
removed the 10 s silent period and the first and last 10 s of each story. The mean response for
each voxel was subtracted and the remaining response was scaled to have unit variance.

Encoding Model Construction

We used the fMRI data to generate voxelwise brain encoding models for 97 different language
陈述. In order to temporally align word times with TR times, we applied Lanczos
interpolation together with a finite impulse response model as described in Huth et al. (2016).
Let ti (S) correspond to the instantiation of the i th representation on our transcribed stimulus set
S. Let g(的 (S)) indicate a linearized ridge regression model that uses a temporally transformed
version of the representation instantiation ti (S) as predictors. The temporal transformation
accounts for the lag in the hemodynamic response function (Huth et al., 2016; Nishimoto
等人。, 2011). We use time delays of 2, 4, 6, 和 8 s of the representation to generate this
temporal transformation. For each subject x, voxel v, and representation ti, we fit a separate
encoding model g(X,v,的 ) to predict the BOLD response ^B from our represented stimulus, 那是,
^B(X,v,的 ) = g(X,v,的 )的 (S). Encoding model performance for a representation was computed as the
average voxelwise performance across our five subjects.

Next-Word Prediction Performance

We performed a linear regression between each representation and the GloVe embedding of
the next word (Pennington et al., 2014). We then computed the exponentiated average cross
entropy between the distribution over the predicted next word from this regression against the
ground truth next word. This value is used as a metric for how well each representation pre-
dicts next words. This metric was computed using a test corpus of approximately 54,000
words consisting of transcribed podcasts (LeBel, 瓦格纳, 等人。, 2021).

Representational Generality

For our 97 陈述, we used the method and publicly available data and code from our
earlier work (Antonello et al., 2021) to measure the overall generality of the information con-
tained in these representations. Let S be our set of stimulus data. Further define U(S) 作为
universal input feature space for our stimuli S. We used GloVe word embeddings of our stim-
ulus data for U(S). For each representation t 2 时间 , we generated an encoder Et(·) such that the
encoder extracts only information in U(S) that is needed to predict t (S). We did this by using a
bottlenecked linear neural network that maps every u 2 U(S) to an intermediate low-
dimensional latent space Lt = Et (U(S)) and then maps it to the given representation space,
Þ
∀s 2 S; t sð Þ ≈ f Et U sð Þ

where f (·) is mapping from Lt to t (S).

We used a small latent space of 20 dimensions to encourage the encoder to extract only the
information in U(S) that is relevant to compute t (S). These latent spaces were then scored on
how much better they transferred to other representations. The use of this approach over sim-
ple linear regression enables us to normalize representations by their dimensionality and mea-
sure the overall generality of each representation rather then the total amount of information
contained in each representation, which is more dependent on the total number of dimensions
in each representation. For every pair of representations (t1, t2) 2 时间 , we next generate a
(U(S))) approximates t2(S). This yields a total
decoder Dt1
(Et1

such that Dt1

) = Dt1

(Lt1

→t2

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

of n2 decoders, where n = |时间 | is the total number of representations. All networks were trained
with batches of size 1024 and standard stochastic gradient descent with a learning rate of 10−4
for the initial encoders and 2 × 10−5 for the decoders. We enforce a strict linearity constraint on
both the encoder and decoder to ensure that representations that are nonlinearly decodable
from one another are treated as distinct (Naselaris et al., 2011). Hyperparameters were chosen
via coordinate descent.

We finally used the decoders to generate a pairwise tournament matrix Wt for each repre-
sentation t by “fighting” all pairs of decoders that output to representation t using a held-out
test set Stest of sentences. Element (我, j ) in Wt contains the ratio of samples in the test set for
which Dti

→t has lower mean squared error than Dtj

→t, 那是,
Es2Stest Dti →t sð Þ < Dtj →t sð Þ Es2Stest Dti →t sð Þ > Dtj →t sð Þ

(西德:1)
(西德:1)

(西德:3)
(西德:3) :

Wt i;jð

例如, if the decoder DA→C has lower mean squared error than decoder DB→C for 75%
of the data in Stest, we assign the ratio of 0.75/0.25 = 3 to entry (A, 乙) in the tournament matrix
WC for representation C.

We then averaged these pairwise tournament matrices Wt over all t to generate an average
pairwise tournament matrix W* which encodes the average relative performances of each rep-
resentation in transferring to the other representations in our set. Further averaging this matrix
along its first axis yields a metric of the relative propensity of each representation to transfer to
each other representation in general. We used this metric to denote the generality score of a
表示.

最后, we isolated the pairwise tournament matrix of an intermediate representation from
a machine translation model from English to German. We similarly averaged this matrix along
its first axis to yield a metric of translation transfer performance for each representation that
was not from the English to German model.

Voxelwise Variance Partitioning

For voxelwise variance partitioning, we used the method established by de Heer et al. (2017).
When partitioning the variance explained between two input spaces, A and B, over an output
set of voxels, we generated three models per voxel v and subject x: ^B(X,v,tA), ^B(X,v,tB), and ^B(X,v,tA·B).
^B(X,v,tA) and ^B(X,v,tB) refer to the models generated by using only A or B respectively, as the input
表示. ^B(X,v,tA·B) refers to the model generated by using A concatenated with B as the
input representation.

Variance explained was computed on a held-out pair of test stories from our podcast data.
Variance explained by the concatenated model but not explained by a single model was
inferred to be uniquely explained by the other single model. Only language responsive voxels
where at least 5% of the response variance for the test story was explainable (Nishimoto et al.,
2017) were included in our variance partitioning analyses.

结果

Correlations Between Encoding Performance and Other Metrics on Language Representations

Several recent studies (Caucheteux et al., 2021A; Schrimpf et al., 2021) have shown that lan-
guage models whose representations perform better as encoding model inputs tend to perform
better at predicting upcoming words or sentences. We first sought to replicate this result by

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

examining the relationship between encoding performance and the ability of a representation
to predict next words. We extracted a total of 97 representations from several different natural
语言处理 (自然语言处理) 型号, including three word embedding spaces (GloVe, BERT-E,
and FLAIR; Akbik et al., 2019; Devlin et al., 2019; Pennington et al., 2014), three unidirec-
tional language models (GPT-2 Small, GPT-2 Medium, and Transformer-XL; Dai et al., 2019;
Radford et al., 2019; 沃尔夫等人。, 2019), two masked bidirectional language models (BERT and
ALBERT; Devlin et al., 2019; Lan et al., 2019), four common interpretable language tagging
任务 (named entity recognition, part-of-speech identification, sentence chunking, and frame
语义解析; Akbik et al., 2019), and two machine translation models (English →
Mandarin, English → German; Tiedemann & Thottingal, 2020). A full description of each of
these representations is given in the Supporting Information, which is available at https://土井
.org/10.1162/nol_a_00087.

Using a natural language fMRI data set, we constructed voxelwise encoding models for
每一个 97 language representations. For each voxel, we then computed the encoding per-
formance as the correlation between predicted and actual BOLD responses on a held-out test
数据集. We measured the overall encoding performance for each representation by comput-
ing the average encoding performance across all language-responsive voxels. We then mea-
sured how well each representation can do next word prediction by computing a “linearly
extractable perplexity” score (see Materials and Methods). Comparing encoding performance
and next word prediction performance across the 97 representations showed that these metrics
are have high mean correlation (r = 0.847; Figure 1A), replicating earlier results (Caucheteux
等人。, 2021A; Schrimpf et al., 2021).

While the high correlation between next word prediction performance and encoding per-
formance is argued to be evidence for predictive coding in the brain, an alternative hypothesis
is that certain representations work well as encoding models because they contain information
that is generally useful for predicting representations from many language tasks, including next
word prediction. 为了检验这个假设, we measured how well each of the 97 陈述
could predict, or “transfer to” the other 96 陈述 (see Materials and Methods). 这
yields a metric measuring general transfer performance or representational generality. This met-
ric tells us how much generally useful language information is contained in each representa-
tion as compared to the other representations. Representations that contain information useful
for explaining other representations will have higher generality values, while those that con-
tain little useful information will have lower values. An extended discussion of this metric and
the motivation behind it is given in the Supporting Information, which is available at https://土井
.org/10.1162/nol_a_00087.

Figure 1B shows that there exists a very strong mean correlation (r = 0.864) between how
well a representation transfers in general and its encoding performance. This correlation is
numerically greater but not significantly different from the correlation between encoding per-
formance and next word prediction performance. This result provides support for the hypoth-
esis that certain representations produce effective encoding models because they have high
general transfer performance, but does not constitute proof. 的确, the high correlation
between all three metrics—next word prediction performance, general transfer performance,
and encoding performance—makes differentiation between competing causal hypotheses dif-
ficult. Yet even this confusion raises a salient point: Correlation between these metrics is not
sufficient to support a causal argument.

To further illustrate the difficulty of making causal claims based on this type of evidence, 我们
present a final example of the same type which is absurd on its face. In this third analysis, 我们

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1. Correlates of encoding performance. 绘制的是 97 language representations as mea-
sured according to four metrics: (A) Average encoding performance across five subjects, next word
prediction performance, shown here as negative perplexity; (乙) general transfer performance to
other representations; 和 (C) transfer performance to a representation extracted from an English-
to-German translation model. In each plot, encoding performance is compared to one of the other
指标. In every case, encoding performance of a representation correlates strongly with the other
metric. 此外, representations extracted from unidirectional language models (GPT-2 Small
and GPT-2 Medium) are the highest in each of these metrics. This suggests that the reason features
from unidirectional models such as GPT-2 (shown in orange) perform well may be because they are
generally good features that perform well when transferring to other language representations,
rather than because they are simply good at next word prediction. (D) Subsamples. To robustly esti-
mate correlations, 70 points from each comparison were selected at random 10,000 times and then
correlated. These are presented in the boxplot.

compared encoding performance for each representation to one specific type of transfer per-
formance: the ability of each representation to predict features extracted from an English-to-
German translation model (Tiedemann & Thottingal, 2020; see Materials and Methods). 从
the set of models used to compute our representational generality metric, we isolated those
that predicted the intermediate representation of a machine translation model that was trained
to convert English text to German text. We then computed the relative transfer performance of
each of our representations to this machine translation representation, yielding a metric we

Neurobiology of Language

Predictive coding or just feature discovery?

call “Eng → De translation transfer performance.” Comparing encoding performance to Eng →
De translation transfer performance again showed a high mean correlation (r = 0.780;
Figure 1C). How should we interpret this result? If we were to assume that this correlation
suggests causation (and were not aware of the other results), we might conclude that the objec-
tive underlying the brain’s processing of English language is translation to German. But this is
荒诞, not least because none of the subjects in this study speak fluent German. 反而, 我们
should conclude that this correlation—like the others we have reported here—is likely the
result of common causes. To effectively predict brain responses, a representation must contain
many different types of linguistic information. Some types of linguistic information are useful
for predicting representations extracted from an Eng → De translation model. 因此, 代表-
tations that make for good encoding models also excel at translating English to German.

Comparing Across Layers of Neural Network Language Models

We next investigated implications of predictive coding theory just within a single NNLM. 一
consequence of predictive coding theory is that the brain should encode information about its
next word predictions. 因此, representations that contain predictive information about next
words should explain brain responses well. 更远, representations that can predict next
words should uniquely explain some variance in brain responses that is not explained by rep-
resentations that lack that predictive information. We investigated this issue by analyzing
encoding performance for different layers from two variations of the same NNLM, GPT-2 Small
and GPT-2 Medium (Radford et al., 2019). In these unidirectional language models, 字
enter at the first layer and then propagate through many intermediate layers until, at the last
层, the model predicts the next word. Across layers, the representations slowly shift from
more input-like in the early layers to more prediction-like in the latest layers. Many earlier
reports have shown that the best encoding performance (and transfer performance) is obtained
from layers closer to the middle of such a model, and not the latest layers (Antonello et al.,
2021; Caucheteux et al., 2021A; Caucheteux & 国王, 2022; Jain & Huth, 2018; Toneva &
Wehbe, 2019). This suggests that the intermediate layers are better at capturing linguistic struc-
ture than the latest layers, even though the latest layers are best at next word prediction. 这
could contradict predictive coding theory, which would suggest that the latest layers, 哪个
are best at predicting future words, should also yield the best encoding models.

To study this issue more closely, we both constructed encoding models and measured next
word prediction performance for each layer of the two GPT models. Figure 2A shows the next
word prediction performance of each layer alongside the hypothesized relationship between
encoding performance and depth suggested by predictive coding. 正如预期的那样, the next word
prediction performance increases nearly monotonically, achieving its highest values in the
latest layers. 然而, actual encoding model performance (averaged across voxels and
subjects) does not follow this pattern. 这里, consistent with earlier reports, we see that
encoding performance peaks at between 60% 和 80% of maximum model depth, 进而
falls precipitously for the latest layers. If the brain was truly representing predictions for the
next word, we should not see this pattern.

然而, this overall comparison is not conclusive: Although the intermediate layers pro-
vide better average encoding performance, it is still possible that the latest layers, by virtue of
doing better next word prediction, uniquely capture some variance in brain responses. 这
would be sufficient to support the theory of predictive coding, which does not require that
every brain area represent next word predictions, only that some do. Put succinctly, next word
prediction anywhere supports predictive coding everywhere. To explicitly test for this

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

数字 2. Variance partitioning is performed on the encoding performance of encoding models built from GPT-2. (A) A plot showing the
change in encoding performance as a function of layer depth in GPT-2 Small and GPT-2 Medium. (乙) Maps showing the individual contri-
bution of variance explained from each component of a joint GPT-2 Small encoding model. (C) A mean breakdown of the contribution of
variance of each component of this model. The 12th layer explains no unique variance above the 9th layer despite better next word prediction
表现. R2 is computed as R * |右| to allow for negative values.

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

possibility we used a variance partitioning analysis to determine whether any brain responses
are uniquely explained by the last layer. 在本次分析中, we measured how much of the var-
iance in brain response could be uniquely explained by either the most performant layer in
每个型号 (measured by average voxelwise correlation) or the last layer in each model, 作为
well as the amount of variance that could be explained equally well by either of those layers.
This was done by fitting three encoding models: one with just the best layer, one with just the
last layer, and one with both representations concatenated.

Figure 2B and C show the results of this variance partitioning analysis. Here we see that the
most performant layer (the ninth layer in GPT-2 Small) does not merely outperform the last
层, but actually dominates the last layer across the entire cortex. While much of the vari-
ance that can be explained by either layer is explained by both, the last layer uniquely
explains no significant additional variance above the ninth layer, while the ninth layer
explains some variance above the last layer. 实际上, owing to the combination of high covari-
ance of the 12th layer features with the ninth layer features and having low beneficial contri-
bution of its own, the ridge regression using the concatenated model performs slightly worse

Neurobiology of Language

Predictive coding or just feature discovery?

than the ridge regression using just the ninth layer features. This leads to a negative average
measured unique variance explained for the 12th layer, which can be seen in Figure 2C.

If the brain was performing an internal prediction task, then we would expect that at least
some voxels would have unique variance that could be explained only by the last layer, 哪个
is most similar to the final predictive output of the language model. The fact that no variance is
uniquely explained by the last layer suggests that some intermediate structural representation
that is reached in the course of next word prediction is closer to what the brain internally rep-
怨恨. As the intermediate layers are also the best at transferring to other representations, 这
further supports the hypothesis that overall representational generality—and not next word
prediction—underlies the success of language models at predicting brain data.

讨论

Recent work has argued in favor of a predictive coding theory of linguistic cognition based on
evidence from encoding models (Schrimpf et al., 2021). Among the most noteworthy claims
stemming from the encoding model literature is the observation, which we have replicated,
that a strong correlation exists between the encoding performance of a linguistic representa-
tion and its ability to predict next words. This correlation has been taken as causal evidence
that the brain is driven by predictive mechanisms that underlie its high-level objectives. 我们
believe, 然而, that this inference is flawed. It is perfectly reasonable to expect that if the
brain encodes a feature, then a model that also encodes the same feature will fit the brain better
than a model that does not, all other things equal. But predictive coding arguments apply this
implication in the wrong direction by assuming that models that fit the brain better than others
have feature X, so therefore the brain also has feature X, where “X” in this case is next word
prediction. Issues with this particular type of reasoning about artificial and biological compu-
tation are discussed extensively by Guest and Martin (2021).

As an analogy, consider the problem in signal processing of spectral density estimation.
Linear autoregressive models are often used to provide regularized estimates of the spectrum
of a signal (Ulrych & Bishop, 1975). Yet it would be false to suggest that spectral density esti-
mation is an example of predictive coding, as autoregressive models are merely one way to
accomplish this goal. 同样的方式, we cannot assume that language models fit the brain
well because the brain is trying to predict future inputs. The correlation between a represen-
tation’s encoding performance and its ability to transfer to an English-to-German translation
representation underscores this problem. If we were to apply the same logic to this correlation
as is applied to the correlation between the predictive power of models and their encoding
model performance, we might—absurdly—conclude that what underlies linguistic processing
in the brain is German translation. Yet a much simpler explanation for both effects is that gen-
erality in transferring to linguistic tasks is highly correlated with both measures, and represen-
tations that are suitable for one sufficiently general task (such as language modeling) are likely
to be suitable for many others (such as translation or brain encoding).

此外, one possible entailment of predictive coding theory is that representations that
better encode next word prediction ought to capture some responses somewhere in the brain
better than representations that do not. 然而, our variance partitioning analysis showed
that as next-word linear decodability continues to improve across layers in GPT-2 Small,
encoding performance declines not merely on average, but everywhere.

One might object to an argument such as this, on the basis that such an entailment is not
necessary for predictive coding and that prediction may simply be an objective of the lan-
guage system, or that prediction in the brain occurs not at the word level but at a more abstract

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

conceptual level. While this seems exceedingly plausible, we are somewhat wary of treating
predictive coding itself as a rigorous scientific theory if it is only to mean that the brain uses the
objective of (possibly conceptual) prediction in order to help generate or manifest the lan-
guage system. We feel that this interpretation of predictive coding is vague and underdefined,
as it is unclear to us what provably false statements about the nature of the language system
could be made if directly measurable quantities such as linear next-word prediction perfor-
mance are rejected as irrelevant. We acknowledge that the tests we have explored here
may not be suitable for assessing every potential interpretation of predictive coding. 因此,
we would encourage our peers in the field who hold affirmative views regarding “conceptual”
predictive coding to expand and formalize them, so that they can be more precisely evaluated.

当然, it is possible that the effects of predictive coding are simply undetectable at the
spatial and temporal resolution of fMRI, and that is a fundamental limitation of the analyses in
this article. But suppose that we could do this variance partitioning analysis at perfect resolu-
的, without the limitations of neuroimaging methods, limited data, and imperfect regression
技巧. If we still observed no meaningful improvement anywhere in the brain from add-
ing a later layer of a language model to an earlier one, then proponents of predictive coding
would surely need to specify what quantifiable and falsifiable claims are being made about the
language system according to predictive coding theory that uniquely distinguish prediction
from absurd objectives like English-to-German translation.

Encoding model arguments concluding that the brain learns through prediction must nec-
essarily contend with the possibility that observed phenomena are the product of the low-
dimensional structure that naturally arises across language representations (Antonello et al.,
2021), whether they be from the brain or artificial models, and not the consequence of an
inherently predictive process. 此外, eliminating the confounds between structure
and prediction is extremely challenging, as any sufficiently structured linguistic system will
necessarily contain some predictive information, and any sufficiently predictive linguistic sys-
tem will possess inherent structure.

What does this all mean for the wider claims about a theory of predictive coding for lin-
guistic processing? We do not believe any of the results or arguments made in this article
should be considered evidence against predictive coding as a cognitive theory. 的确, pre-
dictive coding elegantly and mechanistically explains many observed phenomena. We do,
然而, claim that evidence from encoding model research should not be seen to currently
support a theory of predictive coding. This is due to the fact that much of what is cited as the
strongest evidence in favor of predictive coding from encoding model research would very
likely be true even in the absence of predictive coding, as our representational generality
results demonstrate.

If we are to reject the existing evidence, a logical next question is What would suffice as
evidence for predictive coding? One possible avenue might be to determine whether next
word information can be used to predict brain activity before word onset better than informa-
tion from previous words. This is exactly the approach taken by Goldstein et al. (2021) 和
Caucheteux et al. (2021乙). They showed that a small but statistically significant improvement
in encoding performance can be gleaned by using future words to predict brain responses, 作为
compared to only using past words. While this is an elegant test, we feel the conclusion that is
drawn—that this implies that predictive coding occurs in the brain—should still be viewed
with skepticism. This is because it is challenging to differentiate between next word predictive
information that is incidentally useful for prediction but was generated for some other objec-
主动的, and information that has been gleaned in the process of directly trying to predict next

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

字. As we have seen, linguistic information is highly versatile and general, and information
that is useful for one task is often useful for many others. Recall, 例如, that it is entirely
possible to build a reasonably effective encoding model for English speakers using information
derived from an English-to-German translation model. So it is quite reasonable to believe that
some predictive or future information would be useful for brain encoding even if prediction
itself is not the driving mechanism of linguistic processing in the brain.

If evidence suggesting that next word information aids in brain encoding does not suffice,
what might? Predictive coding as a theory seems, 讽刺地, to not predict many phenomena
uniquely. Much of what predictive coding can explain can also be explained without it. 所以
what measurable phenomenon differentiates a world where the brain does predictive coding
from one where the brain does not? The discovery of some naturally occurring low-level neu-
ral circuit that encodes prediction as an objective of language learning would be strong evi-
登塞. There is undeniably much existing evidence that is necessary for predictive coding to
be true. But without direct access to the neural circuits underlying language processing, 骗局-
vincingly sufficient evidence for predictive coding will no doubt be difficult to produce. 齿轮-
nitive theories invoking prediction as an essential element are fundamentally tied to those that
invoke generality, or more simply, learned structure, as each can plausibly explain the other.
There may be no easy path forward in disentangling these concepts.

Predictive coding presents both a promise and a challenge to computational neurolinguists.
一方面, as a cognitive theory, it makes a relatively concrete and exceedingly plausible
claim about the high-level nature of the brain that greatly coincides with our intuition. 它会
plainly represent a grand achievement of modern computational neuroscience if it could be
proven to be true. 另一方面, serious inquiry into predictive coding naturally intro-
duces a perfidious tangle of confounds. Finding a solution to these confounding issues may be
a major step toward discovering the computational principles underlying language processing
in the human brain.

致谢

We would like to acknowledge Shailee Jain and Arjun Bose for editing and feedback on this
manuscript. This research was funded by grants from the NIDCD and NSF (1R01DC020088-
001), the Burroughs-Wellcome Foundation, and a gift from Intel Inc.

资金信息

Alexander Huth, Burroughs Wellcome Fund (https://dx.doi.org/10.13039/100000861).
Alexander Huth, Intel Corporation (https://dx.doi.org/10.13039/100002418). Alexander Huth,
National Institute on Deafness and Other Communication Disorders (https://dx.doi.org/10
.13039/100000055), 奖项ID: 1R01DC020088-001.

作者贡献

Richard Antonello: 概念化: 带领; 资金获取: 配套; 调查:
带领; 方法: 带领; 软件: 带领; 验证: 带领; 可视化: 带领; Writing—
original draft: 带领; Writing—review & 编辑: 平等的. Alexander Huth: 概念化:
配套; 数据管理: 带领; 形式分析: 带领; 资金获取: 带领; Investiga-
的: 配套; 方法: 配套; 项目管理: 带领; 资源: 带领;
软件: 配套; 监督: 带领; 验证: 配套; 可视化: 配套;
Writing—original draft: 配套; Writing—review & 编辑: 平等的.

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

参考

Akbik, A。, Bergmann, T。, Blythe, D ., Rasul, K., Schweter, S。, &
Vollgraf, 右. (2019). FLAIR: An easy-to-use framework for
state-of-the-art NLP. 在诉讼程序中 2019 Conference of
the North American Chapter of the Association for Computational
语言学 (Demonstrations) (PP. 54–59). Association for Compu-
tational Linguistics.

安德森, A. J。, Kiela, D ., Binder, J. R。, Fernandino, L。, 汉弗莱斯,
C. J。, Conant, L. L。, Raizada, 右. D ., Grimm, S。, & Lalor, 乙. C.
(2021). Deep artificial neural networks reveal a distributed corti-
cal network encoding propositional sentence-level meaning.
神经科学杂志, 41(18), 4100–4119. https://doi.org/10
.1523/JNEUROSCI.1152-20.2021, 考研: 33753548

Antonello, R。, Turek, J. S。, Vo, 五、, & Huth, A. (2021). 低的-
dimensional structure in the space of language representations
is reflected in brain responses. 在米. Ranzato, A. 贝格尔齐默,
是. Dauphin, 磷. S. 梁, & J. Wortman Vaughan (编辑。), Advances
in neural information processing systems 34 (神经信息处理系统 2021).
神经信息处理系统. https://proceedings.neurips.cc/paper/2021/file
/464074179972cbbd75a39abc6954cd12-Paper.pdf

Caucheteux, C。, Gramfort, A。, & 国王, J.-R. (2021A). GPT-2’s activa-
tions predict the degree of semantic comprehension in the
人脑. BioRxiv. https://doi.org/10.1101/2021.04.20
.440622

Caucheteux, C。, Gramfort, A。, & 国王, J.-R. (2021乙). Long-range
and hierarchical language predictions in brains and algorithms.
ArXiv. https://doi.org/10.48550/arXiv.2111.14232

Caucheteux, C。, & 国王, J.-R. (2022). Brains and algorithms partially
converge in natural language processing. 通讯
生物学, 5(1), 文章 134. https://doi.org/10.1038/s42003-022
-03036-1, 考研: 35173264

Dai, Z。, 哪个, Z。, 哪个, Y。, 科恩, 瓦. W., Carbonell, J。, Le, 问. 五、, &
Salakhutdinov, 右. (2019). Transformer-XL: Attentive language
models beyond a fixed-length context. ArXiv. https://doi.org/10
.48550/arXiv.1901.02860

de Heer, 瓦. A。, Huth, A. G。, Griffiths, 时间. L。, Gallant, J. L。, &
Theunissen, F. 乙. (2017). The hierarchical cortical organization
of human speech processing. 神经科学杂志, 37(27),
6539–6557. https://doi.org/10.1523/JNEUROSCI.3267-16.2017,
考研: 28588065

Devlin, J。, 张, M.-W., 李, K., & Toutanova, K. (2019). BERT:
Pre-training of deep bidirectional transformers for language
理解. 在诉讼程序中 2019 Conference of the
North American Chapter of the Association for Computational
语言学: 人类语言技术 (Long and Short
文件) (PP. 4171–4186). Association for Computational
语言学.

Dodge, J。, Ilharco, G。, 施瓦茨, R。, Farhadi, A。, Hajishirzi, H。, &
史密斯, 氮. (2020). Fine-tuning pretrained language models:
Weight initializations, data orders, and early stopping. ArXiv.
https://doi.org/10.48550/arXiv.2002.06305

菲舍尔, 乙. (2012). FreeSurfer. 神经影像, 62(2), 774–781. https://
doi.org/10.1016/j.neuroimage.2012.01.021 , 考研:
22248573

Frank, S. L。, Otten, L. J。, Galli, G。, & Vigliocco, G. (2015). The ERP
response to the amount of information conveyed by words in
句子. Brain and Language, 140, 1–11. https://doi.org/10
.1016/j.bandl.2014.10.006, 考研: 25461915

Gagnepain, P。, Henson, 右. N。, & 戴维斯, 中号. H. (2012). Temporal
predictive codes for spoken words in auditory cortex. 当前的
生物学, 22(7), 615–621. https://doi.org/10.1016/j.cub.2012.02
.015, 考研: 22425155

戈德斯坦, A。, Zada, Z。, Buchnik, E., Schain, M。, Price, A。, Aubrey,
B., Nastase, S. A。, Feder, A。, Emanuel, D ., 科恩, A。, Jansen, A。,
Gazula, H。, Choe, G。, 饶, A。, Kim, S. C。, Casto, C。, Fanda, L。,
多伊尔, W., 弗里德曼, D ., … Hasson, U. (2021). Thinking ahead:
Spontaneous prediction in context as a keystone of language in
humans and machines. BioRxiv. https://doi.org/10.1101/2020.12
.02.403477

Guest, 奥。, & 马丁, A. 乙. (2021). On logical inference over brains,
behaviour, and artificial neural networks. PsyArXiv. https://土井
.org/10.31234/osf.io/tbmcg

Heilbron, M。, Armeni, K., 锄, J.-M。, Hagoort, P。, & 的
Lange, F. 磷. (2022). A hierarchy of linguistic predictions during
natural language comprehension. 国家会议录
Academy of Sciences, 119(32), Article e2201968119. https://土井
.org/10.1073/pnas.2201968119, 考研: 35921434

黄, Y。, & 饶, 右. 磷. (2011). Predictive coding. Wiley Interdisci-
plinary Reviews: 认知科学, 2(5), 580–593. https://土井
.org/10.1002/wcs.142, 考研: 26302308

Huth, A. G。, de Heer, 瓦. A。, Griffiths, 时间. L。, Theunissen, F. E.,
& Gallant, J. L. (2016). Natural speech reveals the semantic
maps that
tile human cerebral cortex. 自然, 532(7600),
453–458. https://doi.org/10.1038/nature17637, 考研:
27121839

Jain, S。, & Huth, A. (2018). Incorporating context into language
encoding models for fMRI. 在S. 本吉奥, H. 瓦拉赫, H.
拉罗谢尔, K. Grauman, 氮. Cesa-Bianchi, & 右. 加内特 (编辑。),
Advances in neural information processing systems 31 (NIPS
2018). 神经信息处理系统. https://proceedings.neurips.cc/paper/2018/hash
/f471223d1a1614b58a7dc45c9d01df19-Abstract.html

Jat, S。, 唐, H。, Talukdar, P。, & 米切尔, 时间. (2019). Relating simple
sentence representations in deep neural networks and the brain.
ArXiv. https://doi.org/10.48550/arXiv.1906.11861

詹金森, M。, & 史密斯, S. 中号. (2001). A global optimisation method
for robust affine registration of brain images. Medical Image
分析, 5(2), 143–156. https://doi.org/10.1016/S1361-8415
(01)00036-6, 考研: 11516708

Jiang, L. P。, & 饶, 右. 磷. 氮 (2021). Predictive coding theories of
cortical function. ArXiv. https://doi.org/10.48550/arXiv.2112
.10048

Joshi, M。, 征收, 奥。, Weld, D. S。, & Zettlemoyer, L. (2019). BERT for
coreference resolution: Baselines and analysis. ArXiv. https://土井
.org/10.48550/arXiv.1908.09091

Kuperberg, G. R。, & Jaeger, 时间. F. (2016). What do we mean by pre-
diction in language comprehension? 语言, Cognition and
神经科学, 31(1), 32–59. https://doi.org/10.1080/23273798
.2015.1102299, 考研: 27135040

Kutas, M。, & Hillyard, S. A. (1984). Brain potentials during reading
reflect word expectancy and semantic association. 自然,
307(5947), 161–163. https://doi.org/10.1038/307161a0,
考研: 6690995

兰, Z。, 陈, M。, 古德曼, S。, Gimpel, K., 夏尔马, P。, & Soricut,
右. (2019). ALBERT: A lite BERT for self-supervised learning of
language representations. ArXiv. https://doi.org/10.48550/arXiv
.1909.11942

LeBel, A。, Jain, S。, & Huth, A. G. (2021). Voxelwise encoding
models show that cerebellar language representations are highly
conceptual. 神经科学杂志, 41(50), 10341–10355.
https://doi.org/10.1523/ JNEUROSCI.0118-21.2021, 考研:
34732520

LeBel, A。, 瓦格纳, L。, Jain, S。, Adhikari-Desai, A。, 古普塔, B.,
Morgenthal, A。, 唐, J。, 徐, L。, & Huth, A. G. (2021). An fMRI

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

dataset during a passive natural language listening task, Open-
Neuro, Version 1.0.2. https://doi.org/10.18112/openneuro
.ds003020.v1.0.2

李, J。, Sun, A。, Han, J。, & 李, C. (2020). A survey on deep learning for
named entity recognition. IEEE Transactions on Knowledge and
Data Engineering, 34(1), 50–70. https://doi.org/10.1109/ TKDE
.2020.2981314

李, J。, 王, S。, Luh, W.-M., Pylkkänen, L。, 哪个, Y。, & 黑尔, J.
(2021). Cortical processing of reference in language revealed
by computational models. BioRxiv. https://doi.org/10.1101
/2020.11.24.396598

米科洛夫, T。, 吸勺, 我。, 陈, K., 科拉多, G. S。, & 院长, J.
(2013). Distributed representations of words and phrases and
their compositionality. 在C中. J. 布尔吉斯, L. 波图, 中号. Welling, Z.
Ghahramani, & K. 问. 温伯格 (编辑。), Advances in neural infor-
mation processing systems 26 (NIPS 2013). 神经信息处理系统. https://文件
.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b
-Paper.pdf

Münte, T。, Heinze, H。, & Prevedel, H. (1990). Ereigniskorrelierte
Hirnpotentiale reflektieren semantische und synthaktische Fehler
bei der Sprachverarbeitung [Event-related brain potentials reflect
semantic and syntactic errors during language processing]. EEG-
EMG Zeitschrift fur Elektroenzephalographie, Elektromyographie
und verwandte Gebiete, 21(2), 75–81. https://doi.org/10.1055/s
-2008-1060778, 考研: 2113458

Naselaris, T。, Kay, K. N。, Nishimoto, S。, & Gallant, J. L. (2011).
Encoding and decoding in fMRI. 神经影像, 56(2), 400–410.
https://doi.org/10.1016/j.neuroimage.2010.07.073, 考研:
20691790

Nikolich, A。, Osliakova, 我。, Kudinova, T。, Kappusheva, 我。, &
Puchkova, A. (2021). Fine-tuning GPT-3 for Russian text summa-
rization. 在R中. Silhavy, 磷. Silhavy, & Z. Prokopova (编辑。), 数据
science and intelligent systems: CoMeSySo 2021 (PP. 748–757).
施普林格. https://doi.org/10.1007/978-3-030-90321-3_61

Nishimoto, S。, Huth, A. G。, Bilenko, 氮. Y。, & Gallant, J. L. (2017).
Eye movement-invariant representations in the human visual
系统. Journal of Vision, 17(1), 11. https://doi.org/10.1167/17
.1.11, 考研: 28114479

Nishimoto, S。, Vu, A. T。, Naselaris, T。, Benjamini, Y。, 于, B., &
Gallant, J. L. (2011). Reconstructing visual experiences from
brain activity evoked by natural movies. 现代生物学, 21(19),
1641–1646. https://doi.org/10.1016/j.cub.2011.08.031,
考研: 21945275

Pennington, J。, Socher, R。, & 曼宁, C. (2014). GloVe: 全球的
vectors for word representation. 在诉讼程序中 2014 骗局-
ference on empirical methods in natural language processing
(EMNLP) (PP. 1532–1543). Association for Computational Lin-
语言学. https://aclanthology.org/ D14-1162.pdf. https://doi.org
/10.3115/v1/D14-1162

雷德福, A。, 吴, J。, Child, R。, Luan, D ., Amodei, D ., & 吸勺, 我.
(2019). Language models are unsupervised multitask learners.
OpenAI.com. https://cdn.openai.com/ better-language-models
/language_models_are_unsupervised_multitask_learners.pdf
饶, 右. P。, & Ballard, D. H. (1999). Predictive coding in the visual
cortex: A functional interpretation of some extra-classical
receptive-field effects. 自然神经科学, 2(1), 79–87.
https://doi.org/10.1038/4580, 考研: 10195184

Savitzky, A。, & Golay, 中号. J. 乙. (1964). Smoothing and differentiation
of data by simplified least squares procedures. Analytical Chem-
istry, 36(8), 1627–1639. https://doi.org/10.1021/ac60214a047
施密特, L.-M., Erb, J。, Tune, S。, Rysop, A. U。, Hartwigsen, G。, &
Obleser, J. (2021). Predicting speech from a cortical hierarchy
of event-based time scales. Science Advances, 7(49), 文章

eabi6070. https://doi.org/10.1126/sciadv.abi6070, 考研:
34860554

Schrimpf, M。, 空白的, 我. A。, Tuckute, G。, Kauf, C。, Hosseini, 乙. A。,
Kanwisher, N。, Tenenbaum, J. B., & Fedorenko, 乙. (2021). 这
neural architecture of language: Integrative modeling converges
on predictive processing. 国家科学院院刊
科学系, 118(45), Article e2105646118. https://doi.org/10
.1073/pnas.2105646118, 考研: 34737231

Schulte, P。, & Neander, K. (2022). Teleological theories of mental
内容. 在E中. 氮. Zalta (埃德。), The Stanford encyclopedia of philos-
奥菲 (夏天 2022 版). Metaphysics Research Lab,
斯坦福大学.

Sensimetrics. (2022). Model S14: Insert earphones for fMRI

研究. https://www.sens.com/products/model-s14/

Shain, C。, 空白的, 我. A。, 从斯海德尔, M。, Schuler, W., & Fedorenko,
乙. (2020). fMRI reveals language-specific predictive coding dur-
ing naturalistic sentence comprehension. Neuropsychologia,
138, 文章 107307. https://doi.org/10.1016/j.neuropsychologia
.2019.107307, 考研: 31874149

Socher, R。, Perelygin, A。, 吴, J。, Chuang, J。, 曼宁, C. D ., 的,
A. Y。, & 波茨, C. (2013). Recursive deep models for semantic
compositionality over a sentiment treebank. 在诉讼程序中
这 2013 Conference on Empirical Methods in Natural Language
加工 (PP. 1631–1642). Association for Computational
语言学.

Su, D ., 徐, Y。, Winata, G. 我。, 徐, P。, Kim, H。, 刘, Z。, & Fung, 磷. (2019).
Generalizing question answering system with pre-trained lan-
guage model fine-tuning. 第二届研讨会论文集
on Machine Reading for Question Answering (PP. 203–211). Asso-
ciation for Computational Linguistics. https://doi.org/10.18653/v1
/D19-5827

Sun, C。, Qiu, X。, 徐, Y。, & 黄, X. (2019). How to fine-tune BERT
for text classification? 在米. Sun, X. 黄, H. 吉, Z. 刘, & 是. 刘.
(编辑。), China national conference on Chinese computational
语言学 (PP. 194–206). 施普林格. https://doi.org/10.1007/978
-3-030-32381-3_16

Tiedemann, J。, & Thottingal, S. (2020). OPUS-MT—Building open
translation services for the World. In Proceedings of the 22nd
Annual Conference of the European Association for Machine
Translation (EAMT) (PP. 479–480). EAMT.

Tikochinski, R。, 戈德斯坦, A。, Yeshurun, Y。, Hasson, U。, & Reichart,
右. (2021). Fine-tuning of deep language models as a computa-
tional framework of modeling listeners’ perspective during
language comprehension. BioRxiv. https://doi.org/10.1101/2021
.11.22.469596

Toneva, M。, 米切尔, 时间. M。, & Wehbe, L. (2020). Combining compu-
tational controls with natural text reveals new aspects of meaning
作品. BioRxiv. https://doi.org/10.1101/2020.09.28.316935
Toneva, M。, & Wehbe, L. (2019). Interpreting and improving natural-
语言处理 (in machines) with natural language-processing
(in the brain). 在H. 瓦拉赫, H. 拉罗谢尔, A. 贝格尔齐默, F.
阿尔谢比克, 乙. 狐狸, & 右. 加内特 (编辑。), 神经方面的进展
信息处理系统 32. 神经信息处理系统. https://诉讼程序
.neurips.cc/paper/2019/file/749a8e6c231831ef7756db230b4359c8
-Paper.pdf

Tsai, H。, Riesa, J。, 约翰逊, M。, Arivazhagan, N。, 李, X。, & Archer, A.
(2019). Small and practical BERT models for sequence labeling.
ArXiv. https://doi.org/10.48550/arXiv.1909.00100

Ulrych, 时间. J。, & Bishop, 时间. 氮. (1975). Maximum entropy spectral
analysis and autoregressive decomposition. Reviews of Geophys-
集成电路, 13(1), 183–200. https://doi.org/10.1029/RG013i001p00183
Vaswani, A。, Shazeer, N。, Parmar, N。, Uszkoreit, J。, 琼斯, L。,
Gomez, A. N。, Kaiser, Ł., & Polosukhin, 我. (2017). Attention is

Neurobiology of Language

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

我

1
0
1
1
6
2
n
哦
_
A
_
0
0
0
8
7
2
0
7
2
5
4
9
n
哦
_
A
_
0
0
0
8
7
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Predictive coding or just feature discovery?

all you need. In Advances in Neural Information Processing Sys-
特姆斯 30 (NIPS 2017). 神经信息处理的进展
系统. https://proceedings.neurips.cc/paper/2017/file
/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

王, A。, 辛格, A。, 迈克尔, J。, 爬坡道, F。, 征收, 奥。, & Bowman, S. 右.
(2018). GLUE: A multi-task benchmark and analysis platform for

自然语言理解. ArXiv. https://doi.org/10.48550
/arXiv.1804.07461

Wolf, T。, Debut, L。, Sanh, 五、, Chaumond, J。, Delangue, C。, Moi, A。,
Cistac, P。, Rault, T。, Louf, R。, Funtowicz, M。, & Brew, J. (2019).
Huggingface’s transformers: State-of-the-art natural language
加工. ArXiv. https://doi.org/10.48550/arXiv.1910.03771