BLiMP: The Benchmark of Linguistic Minimal Pairs for English

BLiMP: The Benchmark of Linguistic Minimal Pairs for English

Alex Warstadt1, Alicia Parrish1, Haokun Liu2, Anhad Mohananey2,
Wei Peng2, Sheng-FuWang1, Samuel R. Bowman1,2,3

1语言学系
纽约大学

2计算机科学系
纽约大学

3Center for Data Science
纽约大学

{warstadt,alicia.v.parrish,haokunliu,anhad,
weipeng,shengfu.wang,bowman}@nyu.edu

抽象的

We introduce The Benchmark of Linguistic
Minimal Pairs (BLiMP),1 a challenge set for
evaluating the linguistic knowledge of lan-
guage models (LMs) on major grammatical
phenomena in English. BLiMP consists of
67 individual datasets, each containing 1,000
minimal pairs—that is, pairs of minimally dif-
ferent sentences that contrast in grammatical
acceptability and isolate specific phenomenon
in syntax, morphology, or semantics. We gen-
erate the data according to linguist-crafted
grammar templates, and human aggregate
agreement with the labels is 96.4%. 我们
evaluate n-gram, LSTM, and Transformer
(GPT-2 and Transformer-XL) LMs by observ-
ing whether they assign a higher probability to
the acceptable sentence in each minimal pair.
We find that state-of-the-art models identify
morphological contrasts related to agreement
reliably, but they struggle with some subtle
semantic and syntactic phenomena, 例如
negative polarity items and extraction islands.

1 介绍

Current neural networks for sentence processing
rely on unsupervised pretraining tasks like lan-
is an open question

guage modeling. 仍然,
how the linguistic knowledge of state-of-the-art
language models (LMs) varies across the lin-
guistic phenomena of English. Recent studies
(例如, Linzen et al., 2016; Marvin and Linzen,
2018; Wilcox et al., 2018) have explored this
question by evaluating LMs’ preferences between
minimal pairs of sentences differing in gramma-
tical acceptability, as in Example 1. 然而, each

1https://github.com/alexwarstadt/blimp.

377

of these studies uses a different set of metrics,
and focuses on a small
语言学的
paradigms, severely limiting any possible big-
picture conclusions.

一套

(1)

A. The cats annoy Tim. (grammatical)
乙. *The cats annoys Tim. (ungrammatical)

We introduce the Benchmark of Linguistic
Minimal Pairs (shortened to BLiMP), a linguis-
tically motivated benchmark for assessing the
sensitivity of LMs to acceptability contrasts across
a wide range of English phenomena, covering both
previously studied and novel contrasts. BLiMP
consists of 67 datasets automatically generated
from linguist-crafted grammar templates, each
containing 1,000 minimal pairs and organized
by phenomenon into 12 类别. 验证
with crowdworkers shows that BLiMP faithfully
represents human preferences.

We use BLiMP to study several pretrained LMs:
Transformer-based LMs GPT-2 (Radford et al.,
2019) and Transformer-XL (Dai et al., 2019), 一个
LSTM LM trained by Gulordava et al. (2019), 和
an n-gram LM. We evaluate whether the LM
assigns a higher probability to the acceptable
sentence in each minimal pair to determine which
grammatical distinctions LMs are sensitive to.
This gives us indirect evidence about each model’s
linguistic knowledge and allows us to compare
models in a fine-grained way. We conclude that
current neural LMs appear to acquire robust
knowledge of morphological agreement and some
syntactic phenomena such as ellipsis and control/
提高. They show weaker evidence of knowledge
about argument structure, negative polarity item
licensing, and the semantic properties of quan-
tifiers. All models perform at or near chance
on extraction islands. 全面的, every model we

计算语言学协会会刊, 卷. 8, PP. 377–392, 2020. https://doi.org/10.1162/tacl 00321
动作编辑器: Mark Steedman. 提交批次: 1/2020; 修改批次: 3/2020; 已发表 7/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

evaluate falls short of human performance by a
wide margin. GPT-2, which performs the best,
performs 8 points below humans overall, 尽管
it does match or exceed human performance on
specific phenomena.

In §6.3 we conduct additional experiments to
investigate the effect of training size on the
LSTM LM and Transformer-XL’s performance
on BLiMP. Although we see steady improvements
in overall performance, we find that LMs learn
phenomenon-specific distinctions at different
费率.
In §6.4 we consider alternative well-
motivated evaluation metrics on BLiMP, but find
that they do not differ drastically from our method
of comparing LM probabilities for full sentences.
We conclude that whereas models like GPT-2
appear to have significant linguistic knowledge,
this knowledge is concentrated in some specific
domains of English grammar. We use BLiMP
to uncover several linguistic phenomena where
even state-of-the-art
language models clearly
lack human-like knowledge, and to bring into
focus those areas of grammar that future studies
evaluating LMs should investigate in greater
深度.

2 Background and Related Work

2.1 Language Models

The objective of a language model is to give
a probability distribution over the strings of a
语言. Both neural network and non-neural
network architectures are used to build LMs, 和
neural models can be trained in a self-supervised
setting without the need for labeled data. 最近,
variants of neural language modeling have been
shown to be a strong pretraining task for natural
language processing tasks (Howard and Ruder,
2018; Peters et al., 2018; Radford et al., 2018;
Devlin et al., 2019).

The last decade has seen two major paradigm
shifts in the state of the art for language modeling.
第一的, there was a movement from models based on
local n-gram statistics (see Chen and Goodman,
1999) to neural sequence models such as LSTMs
(Mikolov et al., 2010), which optimize on the
task of predicting the next token. 随后,
Transformer-based architectures employing self-
注意力 (Vaswani et al., 2017) have outperformed
LSTMs (例如, Dai et al., 2019). Although these
shifts have resulted in stronger LMs, perplexity

on large benchmark datasets like WikiText-103
(Merity et al., 2016) has remained the primary
performance metric, which cannot give detailed
insight into these models’ knowledge of grammar.
Evaluation on benchmarks like GLUE (王
等人。, 2018, 2019A), which heavily adapt language
models to perform downstream tasks, is more
informative, but doesn’t offer broad coverage
of linguistic phenomena, and doesn’t necessary
reflect knowledge that is already present in the
LMs.

2.2 Linguistic Knowledge of NNs

Many recent studies have searched for evidence
that neural networks (NNs) learn representations
that implicitly encode grammatical concepts. 我们
refer to the ability to encode these concepts
as linguistic knowledge. Some studies evaluate
NNs’ linguistic knowledge using probing tasks
is trained to directly
in which a classifier
predict grammatical properties of a sentence
(例如, syntactic tree depth) or part of a sentence
(例如, part-of-speech) using only the NNs’ learned
representation as input (Shi et al., 2016; Adi et al.,
2017; Conneau et al., 2018; Ettinger et al., 2018;
Tenney et al., 2019). We follow a complementary
approach that uses acceptability judgments to
address the same question without the need for
training data labeled with grammatical concepts.
Acceptability judgments are the main form of
behavioral data used in generative linguistics to
measure human linguistic competence (Chomsky,
1965; Sch¨utze, 1996).

One branch of this literature uses minimal pairs
to infer whether LMs detect specific grammatical
contrasts. 桌子 1 summarizes linguistic pheno-
mena studied in this work. 例如, 扁豆
等人. (2016) look closely at minimal pairs contrast-
ing subject-verb agreement. Marvin and Linzen
(2018) expand the investigation to negative
polarity item and reflexive licensing. 然而,
these and related studies cover a limited set of
现象,
to the exclusion of well-studied
phenomena in linguistics such as control and
提高, ellipsis, quantification, 以及无数的
其他的. This is likely due to the labor-intensive
nature of collecting such targeted minimal pairs.

A related line of work evaluates neural networks
on acceptability judgments in a more domain-
general way. Corpora of sentences and their
grammaticality are collected for this purpose in a

378

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Phenomenon

Relevant work

Marvin and Linzen (2018), Futrell et al. (2018), Warstadt et al. (2019乙)

Anaphora/binding
Subj.-verb agreement Linzen et al. (2016), Futrell et al. (2018), Gulordava et al. (2019), Marvin and
扁豆 (2018), An et al. (2019), Warstadt et al. (2019乙)
Neg. polarity items Marvin and Linzen (2018), Futrell et al. (2018), Jumelet and Hupkes (2018),

Filler-gap/Islands

Argument structure

Wilcox et al. (2019), Warstadt et al. (2019A)
Wilcox et al. (2018), Warstadt et al. (2019乙), Chowdhury and Zamparelli
(2018, 2019), Chaves (2020), Da Costa and Chaves (2020)
Kann et al. (2019), Warstadt et al. (2019乙), Chowdhury and Zamparelli (2019)

桌子 1: Summary of related work organized by linguistic phenomena tested. All studies analyze neural
networks using acceptability judgments on minimal pairs mainly in English. Some studies appear
multiple times.

number of studies (Heilman et al., 2014; Lau et al.,
2017; Warstadt et al., 2019乙). 最近的
and comprehensive corpus is CoLA (Warstadt
等人。, 2019乙), containing 10k sentences covering
a wide variety of linguistic phenomena provided as
examples in linguistics papers and books. CoLA,
which is included in the GLUE benchmark (王
等人。, 2018), has been used to track advances
in the sensitivity of reusable sentence encoding
models to acceptability. Current models like
BERT (Devlin et al., 2019) and T5 (Raffel et al.,
2019) now learn to give acceptability judgments
that approach or even exceed individual human
agreement with CoLA.

预测

Although CoLA can provide evidence about
phenomenon-specific knowledge of models, 这
method is limited by the need to train a super-
vised classifier on CoLA data prior to evaluation.
This is because CoLA is designed for binary
acceptability classification, and there is no gen-
erally accepted method for obtaining binary
acceptability
from unsupervised
models like LMs.2 Warstadt and Bowman (2019)
measure phenomenon-specific performance on
CoLA for several pretrained sentence encoding
型号: an LSTM, GPT (Radford et al., 2018),
and BERT. 然而,
the use of supervision
prevents making strong conclusions about the
sentence encoding component, since it
不是
possible to distinguish what the encoder knows
from what is learned through supervised training
on acceptability data.

Evaluating LMs on minimal pairs avoids this
问题, with the caveat that the LM probability

2Though see Lau et al. (2017) for some promising
proposals for normalizing LM probabilities to correlate with
gradient acceptability.

of a sentence can only serve as a proxy for
acceptability if confounding factors impacting
a sentence’s probability such as length and
lexical content are controlled for. It is with these
considerations in mind that we design BLiMP.

3 数据

BLiMP consists of 67 minimal pair paradigms,
每个都有 1,000 sentence pairs in mainstream
American English grouped into 12 categories.3
We refer to minimal pair types as paradigms
and categories as phenomena. Each paradigm is
annotated for the unique contrast it isolates and the
broader phenomena it is part of. We automatically
generate the data from linguist-crafted grammar
模板, and our automatic labels are validated
with crowd-sourced human judgments.

事实

Although each minimal pair type corresponds
to exactly one paradigm, a particular fact about
English grammar may be illustrated by multiple
paradigms. 例如,
that certain
determiners and nouns agree can be illustrated
by keeping the determiner the same and changing
the number marking of the noun as in the example
表中 2, or by keeping the noun the same
and changing the determiner (例如, Rachelle had
bought those chair.). With completeness in mind,
we include such complementary paradigms in
BLiMP whenever possible.

3We choose English because it is the native language of
the linguists who built the grammar templates, though in the
long run, creating versions of BLiMP in additional languages
would allow for coverage of more phenomena and expand
BLiMP’s range of usefulness. We assume 1,000 pairs is
sufficient to limit random noise resulting from small sample
sizes.

379

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Phenomenon

N Acceptable Example

Unacceptable Example

ANAPHOR AGR.
ARG. STRUCTURE
BINDING
CONTROL/RAISING
DET.-NOUN AGR.
ELLIPSIS

2 Many girls insulted themselves.
9 Rose wasn’t disturbing Mark.
7 Carlos said that Lori helped him.
5 There was bound to be a fish escaping. There was unable to be a fish escaping.
8 Rachelle had bought that chair.
2 Anne’s doctor cleans one important

Many girls insulted herself.
Rose wasn’t boasting Mark.
Carlos said that Lori helped himself.

book and Stacey cleans a few.

FILLER-GAP
IRREGULAR FORMS
ISLAND EFFECTS
NPI LICENSING
QUANTIFIERS
SUBJECT-VERB AGR. 6 These casseroles disgust Kayla.

7 Brett knew what many waiters find.
2 Aaron broke the unicycle.
8 Whose hat should Tonya wear?
7 The truck has clearly tipped over.
4 No boy knew fewer than six guys.

Rachelle had bought that chairs.
Anne’s doctor cleans one book and
Stacey cleans a few important.
Brett knew that many waiters find.
Aaron broken the unicycle.
Whose should Tonya wear hat?
The truck has ever tipped over.
No boy knew at most six guys.
These casseroles disgusts Kayla.

桌子 2: Minimal pairs from each of the twelve linguistic phenomenon categories covered by BLiMP.
Differences are underlined. N is the number of 1,000-example minimal pair paradigms within each
broad category.

3.1 Data Generation Procedure

To create minimal pairs exemplifying a wide array
of linguistic contrasts, we found it necessary to
artificially generate all datasets. This ensures both
that we have sufficient unacceptable examples,
and that the data is fully controlled, allowing for
repeated isolation of a single linguistic pheno-
menon (Ettinger et al., 2018). For each paradigm,
we use a generation script to sample lexical items
from a vocabulary of over 3,000 items according
to a template specifying linear order of the phrases
in the acceptable and unacceptable sentences in
each minimal pair. Our data generation scripts are
publicly available.4 We annotate these lexical
items with the morphological, 句法的, 和
semantic features needed to enforce selectional
restrictions and create grammatical and seman-
tically felicitous sentences.

All examples in a paradigm are structurally
analogous up to the point required for the relevant
contrast but may vary in some ways. 例如,
illustrated in
the template for NPI LICENSING,
桌子 2, specifies that an arbitrary verb phrase
needs to be generated. 因此, the generation
script samples from the entire set of verbs and
generates the required arguments on-the-fly. 因此,
the structure of the sentence then depends on
whether the sampled verb is transitive, 条款-
embedding, 提高, 等等, but that same

4https://github.com/alexwarstadt/data

一代.

verb phrase and its arguments are used in both
pairs in the paradigm.

This generation procedure is not without
局限性, and despite the very detailed voca-
bulary we use, implausible sentences are occa-
sionally generated (例如, Sam ran around some
glaciers).
尽管, 两者都
acceptable and unacceptable sentences will be
equally implausible given world knowledge, 所以
any difference in the probability assigned to them
is still attributable to the intended grammatical
对比.

In these cases,

3.2 Coverage

The paradigms covered by BLiMP represent
well-established contrasts in English morphology,
syntax, and semantics. Each paradigm is grouped
into one of 12 现象, 如表所示 2.
Examples of all 67 paradigms appear in Table 4
of the Appendix. The paradigms are selected with
the constraints that they can be characterized using
templates as described above and illustrated with
minimal pairs of sentences equal in length5 that
differ in at most one vocabulary item.

Although this dataset has broad coverage, 这是
not exhaustive. It is not possible to include every

5We define length as the number of entries from our
lexicon. Some sentences in a pair contain different numbers
of words because visit and drop by are each one lexical entry.
Where discrepancies in number of words occur, 他们是
generally randomly distributed across the grammatical and
ungrammatical sentences in a paradigm.

380

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

grammatical phenomenon of English, and there is
no agreed-upon set of core phenomena. 然而,
we consider frequent inclusion of a phenomenon in
a syntax/semantics textbook as an informal proxy
for what linguists consider to be core phenomena.
We survey several syntax textbooks (例如, Sag
等人。, 2003; Adger, 2003; Sportiche et al., 2013),
and find that nearly all of the phenomena in
BLiMP are discussed in some source. Most of
the topics that repeatedly appear in textbooks
and can be represented with minimal pairs (例如,
协议, control/raising, wh-extraction/islands,
binding) are present in BLiMP.6

We characterize the 12 phenomena in BLiMP

as follows7:

• ANAPHOR AGREEMENT:

the requirement that

reflexive pronouns like himself
anaphora) agree with their antecedents in
人, 数字, 性别, and animacy.

(a.k.a.

• ARGUMENT STRUCTURE: the ability of different
verbs to appear with different
types of
论据. 例如, different verbs can
appear with a direct object, participate in the
causative alternation, or take an inanimate
争论.

• BINDING: the structural relationship between
a pronoun and its antecedent. All paradigms
illustrate aspects of Chomsky’s
(1981)
Principle A. Because coindexation cannot
be annotated in BLiMP, Principles B and C
are not illustrated.

• CONTROL/RAISING:

之间

syntactic and semantic
差异

predicates that embed an infinitival VP.
This includes control, 提高, and tough-
movement predicates.

各种各样的

类型

• DETERMINER-NOUN AGREEMENT: number agree-
ment between demonstrative determiners
(例如, this/these) and the associated noun.

• ELLIPSIS:

omitting
possibility
expressions from a sentence. Because this is
difficult to illustrate with sentences of equal

6In line with these textbooks, we rely on stereotyped
gender-name pairings and contrasts not present in all English
dialects (more detail provided in the Appendix).

7我们的
narrower
particular constraints described above.

implementation of
than the linguistic definition because of

these phenomena is often

381

length, our paradigms cover only special
cases of noun phrase ellipsis that meet this
约束.

• FILLER-GAP:

dependencies

phrasal movement
问题.

在,

arising


例如, wh-

• IRREGULAR FORMS: irregular morphology on
English past participles (例如, 破碎的). 我们
are unable to evaluate models on nonexistent
forms like *breaked because such forms are
out of the vocabulary for some LMs.

• ISLAND EFFECTS:

restrictions on syntactic
environments where the gap in a filler-gap
dependency may occur.

• NPI LICENSING: restrictions on the distribution
of negative polarity items like any and ever
limited to, 例如, the scope of negation
并且只有.

覆盖

量词. 我们

• QUANTIFIERS: restrictions on the distribution

这样的
restrictions: superlative quantifiers (例如, 在
至少) cannot embed under negation, 和
definite quantifiers and determiners cannot
be subjects in existential-there constructions.

• SUBJECT-VERB AGREEMENT:

subjects

present tense verbs must agree in number.

3.3 Comparison to Related Resources

With a vocabulary of over 3,000 字, BLiMP
lexical variation of any
has by far the most
related generated dataset. It includes verbs with
11 different subcategorization frames, 包括
verbs that select for PPs, infinitival VPs, 和
embedded clauses. 通过对比, datasets by
Ettinger et al. (2018) and Marvin and Linzen
(2018) use vocabularies of under 200 项目. 其他
datasets of minimal pairs that achieve more lexical
and syntactic variety use data-creation methods
that limit empirical scope and control. 扁豆
等人. (2016) construct a dataset of minimal
pairs for subject-verb agreement by changing
verbs’ number marking in a subset of English
维基百科, but this approach does not generalize
beyond agreement phenomena. Lau et al. (2017)
construct minimal pairs by taking sentences from
the BNC through round-trip machine translation.
The resulting sentences contain a wider variety of

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

r

e

v

模型
5-公克 61.2
LSTM 69.8
TXL
69.6
GPT-2
81.5
人类
88.6

a ll

A. A

A
47.9
91.7
94.1
99.6
97.5

G

G. S

A
71.9
73.2
69.5
78.3
90.0

时间

G

D I N

I N


64.4
73.5
74.7
80.1
87.3

L . 右

时间

C
68.5
67.0
71.5
80.5
83.9

S I S

G

A

A I S.
D – 氮
70.0
85.4
83.0
93.3
92.2

L I P

L


36.9
67.6
77.2
86.6
85.0

L

I L

F
60.2
73.9
66.6
81.3
86.9

A

右 . G

G

I R
79.5
89.1
78.2
84.1
97.0

A

L

U

A

L

I S
57.2
46.6
48.4
70.6
84.9

D

P I


45.5
51.7
55.2
78.9
88.1

A

U


53.5
64.5
69.3
71.3
86.6

S

T I F I E

G

A

S – V
60.3
80.1
76.0
89.0
90.9

桌子 3: Percentage accuracy of four baseline models and raw human performance on BLiMP using a forced-choice
任务. A random guessing baseline would achieve an accuracy of 50%.

grammatical violations, but it is not possible to
control the nature or quantity of violations in the
resulting sentences.

3.4 Data Validation

To verify that the generated sentences represent a
real contrast in acceptability, we conduct human
validation via Amazon Mechanical Turk.8 Twenty
separate validators rated five pairs from each of
这 67 paradigms, for a total of 6,700 判断.
We restricted validators to individuals currently
located in the US who self-reported as native
speakers of English. To assure that our validators
made a genuine effort on the task, each HIT
included an attention check item and a hidden
field question to catch bot-assisted humans.
Validators were paid $0.25 for completing five
判断, which we estimate took 1-2 minutes.
For each minimal pair, 20 individuals completed
a forced-choice task mirroring the LMs’ task;
the human-determined acceptable sentence was
calculated via majority vote of annotators. By this
metric, we estimate aggregate human agreement
with our annotations to be 96.4% 全面的. 作为一个
threshold of inclusion in BLiMP, 多数的
validators needed to agree with BLiMP on at least
4/5 examples from each paradigm. 因此, 全部 67
paradigms in the public version of BLiMP passed
this validation; only two additional paradigms
were rejected on this criterion. We also estimate
individual human agreement to be 88.6% 全面的
using the approximately 100 annotations from
each paradigm.9 Table 3 reports individual human
结果 (and model results) as a conservative
measure of human agreement.

4 楷模

GPT-2 GPT-2 (Radford et al., 2019) is a large-
scale language model using the Transformer
建筑学 (Vaswani et al., 2017). 我们的主要
experiments use GPT-2-large with 36 layers and
774M parameters.10 The model is pretrained on
Radford et al.’s WebText dataset, which contains
40GB of English text extracted from Web pages
and filtered for quality. To our knowledge,
WebText is not publicly available, so assuming
an average of 5–6 bytes/chars per word, 我们
estimate WebText contains about 8B tokens. 我们
use jiant, a codebase for training and evaluating
sentence understanding models (王等人。,
2019乙), to implement code for evaluating GPT-2
on BLiMP.11

Transformer-XL Transformer-XL (Dai et al.,
2019) is another multilayer Transformer-based
neural language model. We test the pretrained
Transformer-XL Large model with 18 layers of
Transformer decoders and 16 attention heads for
each layer. The model is trained on WikiText-
103 (Merity et al., 2016), a corpus of 103M
tokens from English Wikipedia. Code for testing
Transformer-XL on BLiMP is also implemented
in jiant.

LSTM We include a long-short term memory
(LSTM, Hochreiter and Schmidhuber, 1997)
LM in our experiments. 具体来说, we test
a pretrained LSTM LM from Gulordava et al.
(2019) on BLiMP. The model is trained on a 83M-
token corpus extracted from English Wikipedia.
To investigate the effect of training size on
model performance (§6.3), we retrain a series
of LSTM and Transformer-XL models with the
same hyperparameters and the following training

8The full set of human judgments and a summary of the

results for all 67 paradigms is in Table 4 in the Appendix.
9A few had to be excluded due to ineligible annotators.

10GPT-2-XL performs slightly worse on BLiMP; see §6.3.
11https://github.com/nyu-mll/jiant/tree/

blimp-and-npi/scripts/blimp.

382

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

sizes: 64中号, 32中号, 16中号, 8中号, 4中号, 2中号, 1中号, 1/2中号,
1/4中号, and 1/8M tokens. For each size, we train
the model on five different random samples of the
original training data, which has a size of 83M
tokens.12

5-gram We build a 5-gram LM on the English
Gigaword corpus (Graff et al., 2003), 哪个
consists of 3.1B tokens. To efficiently query
n-grams we use an implementation13 based on
Heafield et al. (2013).14

5 Results and Discussion

An LM’s overall accuracy on BLiMP is simply
the proportion of the 67,000 minimal pairs in
which the model assigns a higher probability to
the acceptable sentence. We report the results for
all models and human evaluation in Table 3. GPT-
2 achieves the highest accuracy and the 5-gram
model the lowest. All models perform well below
estimated human accuracy (as described in § 3.4).
The 5-gram model’s poor performance—overall
and on every individual category—indicates
that BLiMP is likely not solvable from local
co-occurrence statistics alone.

Because we evaluate pretrained models that
differ in architecture and training data, 我们可以
only speculate about what drives these differences
(though see § 6.3 for a controlled ablation study
on the LSTM LM). The results seem to indicate
that access to training data is the main driver of
performance on BLiMP for the neural models we
evaluate. This may explain why Transformer-XL
and the LSTM LM perform similarly in spite of
differences in architecture, as both are trained on
approximately 100M tokens of Wikipedia text.
相关地, GPT-2’s advantage may come from
the fact that it is trained on roughly two orders of
magnitude more data. Possibly, LSTMs trained on
larger datasets could perform comparably to GPT-
2, but such experiments are impractical because of
the inefficiency of training LSTMs at this scale.

generally perform best and closest
to human
level on morphological phenomena. 例如,
GPT-2 performs within 2.1 points of humans
on ANAPHOR AGR., DET.-NOUN AGR., and SUBJ.-VERB
AGR.. The set of challenging phenomena is more
diverse. ISLANDS are the hardest phenomenon by
a wide margin. Only GPT-2 performs well above
机会, and it remains 14 points below humans.
Some semantic phenomena, specifically those
involving NPI LICENSING and QUANTIFIERS, are also
challenging overall. All models perform relatively
poorly on ARG. STRUCTURE.

From these results we conclude that current
SotA LMs robustly encode basic facts of English
协议. This does not mean that LMs will
come close to human performance for all
agreement phenomena. §6.1 discusses evidence
that increased dependency length and the presence
of agreement attractors of the kind investigated by
Linzen et al. (2016) and Gulordava et al. (2019)
reduce performance on agreement phenomena.

We find,

that LMs do represent

in accordance with Wilcox et al.
(2018),
long-distance
wh-dependencies, but we also conclude that
their representations differ fundamentally from
humans’. Although some models approach human
performance in ordinary filler-gap dependencies,
they are exceptionally poor at identifying island
violations overall. This finding suggests that they
reliably encode long-distance dependencies in
一般的, but not the syntactic domains in which
these dependencies are blocked, though GPT-2
does perform well above chance on some para-
digms of ISLAND EFFECTS. 然而, strong con-
clusions about how these models
represent
wh-dependencies are not possible using the
forced-choice task compatible with BLiMP, 和
a complete assessment of syntactic islands is best
addressed using a factorial design that manipulates
both the presence of an island and an attempt to
extract from it, as in Kush et al. (2018) or Wilcox
等人. (2018).

5.1 Results and Discussion by Phenomenon

into how LM’s
The results also give insight
linguistic knowledge varies by domain. 楷模

12https://github.com/sheng-fu/colorless

greenRNNs.

13https://github.com/kpu/kenlm.
14https://github.com/anhad13/blimp ngram.

In the semantic phenomena where models
斗争 (NPIS and QUANTIFIERS), violations are
often attributed in semantic theories to a presup-
position failure or contradiction arising from
semantic composition or pragmatic reasoning
(例如, Chierchia, 2013; Ward and Birner, 1995;
Geurts and Nouwen, 2007). These abstract

383

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

semantic and pragmatic factors may be difficult
for LMs to learn. Marvin and Linzen also find
that LSTMs largely fail to recognize NPI licensing
状况. Warstadt et al. (2019A) find that BERT
(which is similar in scale to GPT-2) recognizes
these conditions inconsistently in an unsupervised
环境.

The weak performance on ARG. STRUCTURE is
somewhat surprising, since arguments and heads
are usually—though not always—adjacent (例如,
subjects and direct objects are adjacent to the
verb in default English word order). 然而,
argument structure is closely related to semantic
event structure (see Marantz, 2013), 这可能
be comparatively difficult for LMs to learn.
judgments about argument structure are
还,
complicated by the possibility of coercing a
frequently transitive verb to be intransitive and
vice versa as well as the existence of secondary
meanings of verbs with different argument
结构 (例如, normally intransitive boast has a
transitive use as in The spa boasts 10 pools), 哪个
might make this domain somewhat more difficult
for LMs. Though even with these complications,
humans detect the intended contrast 90% 的
时间. We note that the reported difficulty of these
phenomena contradicts Warstadt and Bowman’s
(2019) conclusion that argument structure is one of
the strongest domains for neural models. 然而,
Warstadt and Bowman evaluate classifiers with
supervision on CoLA, a large proportion of which
is sentences related to argument structure.

项目,

最后, we caution against interpreting positive
results on a general phenomenon in BLiMP as
proof of human-like knowledge. Although it is
unlikely that GPT-2 could reach human perfor-
mance on the SUBJ.-VERB AGR. paradigms without
acquiring a concept of number marking that
abstracts away from specific lexical

is difficult to rule out this possibility without
accumulating different forms of evidence, 为了
实例, by testing how it generalizes to nonce
字. We take the paradigms in FILLER-GAP as
a cautionary example (见表 4). 有
four paradigms that assess a model’s sensitivity to
the syntactic requirements of complementizer that
versus a wh-word. We observe that all models
more or less succeed when the unacceptable
sentence lacks a necessary gap, but fail when
it contains an illicit gap. These results suggest the
models’ ability to accurately detect a contrast in

384

数字 1: Heatmap showing the correlation between
models’ accuracies in each of the 67 paradigms.

whether a gap is filled following a wh-word is
not clearly based on a generalization about the
relationship between that wh-word and its gap,
as such a generalization should extend to the
cases where the models currently fail to detect
the correct contrast. 更普遍, 结论
about a model’s knowledge of a particular
grammatical concept can only be reached by
considering several paradigms.

5.2 Shallow Predictors of Performance

We also ask what
factors besides linguistic
phenomena affect model accuracy. 数字 2 节目
how sentence length, perplexity (which does not
depend on length), the probability of the good
句子 (which does depend on length), 和
confidence affect model performance. The effect
of perplexity is much weaker for GPT-2 than
for other models, which indicates that it is probably
more robust to sentences with non-stereotypical
syntax or describing unlikely scenarios. GPT-2
is the only model where accuracy increases
largely monotonically with confidence. A similar
relationship holds between confidence

agreement in human acceptability judgments.

5.3 Correlation of Model and Human

Performance

We examine the extent to which models and
humans succeed at detecting contrasts for the same
语言现象. 数字 1 shows the Pearson
correlation between the four LMs and humans
of their accuracies on the 67 paradigms. 这
neural models correlate moderately with humans,
with GPT-2 correlating most strongly. The n-gram
model’s performance correlates with humans rela-
tively weakly. Neural models correlate with each
other more strongly, suggesting neural networks
share some biases that are not human-like.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 2: Models’ performance on BLiMP as a function
of sentence length, perplexity, log probability of the
acceptable sentence, and model confidence (calculated
作为 |log P (S1) − log P (S2)|).

Transformer-XL and LSTM’s high correlation of
0.9 possibly reflects their similar training data.

6 分析

6.1 Long-Distance Dependencies

The presence of intervening material can lower the
ability of humans to detect agreement depen-
dencies (Bock and Miller, 1991). We study how
intervening material affects the LMs’ sensitivity
to mismatches in agreement in BLiMP. 第一的, 我们
test for sensitivity to determiner-noun agreement
with and without an intervening adjective, 如
例子 (2). The results are plotted in Figure 3.
The n-gram model is the most heavily impacted,
performing on average 35 points worse. 这是
unsurprising, since the bigram consisting of a
determiner and noun is far more likely to be
observed than the trigram of determiner, 形容词,
and noun. For the neural models, we find a weak
but consistent effect, with all models performing
on average between 5 和 3 points worse when
there is an intervening adjective.

(2)

A. Ron saw that man/*men.
乙. Ron saw that nice man/*men.

第二, we test for sensitivity to mismatches
in subject-verb agreement when an attractor noun
of the opposite number intervenes. 我们比较
attractors in relative clauses (3-乙) and as part of
a relational noun (3-C), following experiments
by Linzen et al. (2016) 和别的. 再次, 我们
the n-gram model’s performance is
find that
reduced significantly by this intervening material,
suggesting the model is consistently misled by the
presence of an attractor. All the neural models
perform above chance with an attractor present,
but GPT-2 and the LSTM perform 22 和 20 点

数字 3: The effect of the locality of determiner-noun
协议 (upper panel) and the type of agreement
吸引子 (lower panel) on model performance.

worse when an attractor is present than when
there is no attractor, while Transformer-XL’s
performance is reduced by only 5 点. 因此,
we reproduce Linzen et al.’s finding that attractors
significantly reduce LSTM LMs’ sensitivity to
mismatches in agreement and find evidence that
this holds true of some Transformer LMs as well.

(3)

A. The sisters bake/*bakes.
乙. The sisters who met Cheryl bake/*bakes.
C. The sisters of Cheryl bake/*bakes.

6.2 Regular vs. Irregular Agreement

In DET.-NOUN AGR. and SUBJ.-VERB AGR., we generate
separate datasets for nouns with regular and
irregular number marking, as in Example (4). 全部
else being equal, only models with access
information should make
to sub-word-level
any distinction between regular and irregular
morphology.

(4)

(常规的)
A. Ron saw that nice kid/*kids.
乙. Ron saw that nice man/*men. (irregular)

实际上, 数字 4 shows that the two sub-word-
level models GPT-2 and Transformer-XL show
little effect of irregular morphology: They perform
少于 1.3 points worse on irregulars than
regulars. Their high overall performance suggests
that they robustly encode number features without
relying on segmental cues.15

15The LSTM LM, which has word-level tokens, averages
5.2 points worse on the irregular paradigms. This effect is
not due to morphology, but rather to the higher proportion
of out-of-vocabulary items among the irregular nouns, 哪个
include many loanwords such as theses and alumni.

385

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 4: Models’ performance on agreement phe-
nomena between a determiner and noun and between
a subject and verb, broken down by whether the
noun/subject has a regular or irregular plural form

6.3 Training size and BLiMP performance

We use BLiMP to track how a model’s rep-
resentation of particular phenomena varies with
the quantity of training data. Using different
sized subsets of Gulordava et al.’s (2019) 训练
数据, we retrain the LSTM and Transformer-XL
models and evaluate their performance on BLiMP.
数字 5 shows that different phenomena have
notably different learning curves across different
training sizes even if the full model trained on 83M
tokens achieved equivalent accuracy scores. 为了
例子, the LSTM model ultimately performs
well on both IRREGULAR and ANAPHOR AGR., 但
requires more training to reach this level of
performance for ANAPHOR AGR. These learning
curve differences show how BLiMP performance
dissociates from perplexity on Wikipedia data, A
standard measure of LM performance: 虽然
perplexity decreases with more training data,16
performance on different phenomena grows at
varying rates.

A

我们

那里

conjecture

sigmoid
relationship between the logarithm of training
set size and BLiMP performance that appears
to be roughly linear at this scale. We conduct
linear regression analyses to estimate the rate
of increase in performance in relation to the
logarithm (根据 2) of dataset size. For the LSTM
LM, best-fit
lines for phenomena on which
the model had the highest accuracy have the
steepest slopes: ANAPHOR AGR. (0.0623), DET.-NOUN

16Average perplexity on the Gulordava et al. (2019) 测试

放: 595 at 0.125M, 212 at 1M, 92.8 at 8M, 和 53 at 64M.

386

数字 5: Transformer-XL (顶部) and LSTM LM
(底部) performance as a function of training size
and phenomena in BLiMP. The gray line shows the
average across all phenomena.

AGR. (0.0426), and IRREGULAR (0.039). We see
the shallowest slopes on phenomena with the
worst performance: NPIS (0.0078) and ISLANDS
(0.0036). For Transformer-XL, we observe a
similar pattern: The steepest
learning curves
again belong to ANAPHOR AGR. (0.0545) and DET.-
NOUN AGR. (0.0405), and the shallowest to NPIS
(0.0055) and ISLANDS (0.0039). Based on these
价值观, we estimate that if log-linear improvement
continues, the LSTM LM and Transformer-XL
should require well over 1020 tokens of training
data to achieve human-like performance on these
hardest phenomena.

We also find that increasing model size (数字
of parameters) is unlikely to improve performance:
We evaluate four pretrained versions of GPT-
2 和 117 M to 1,558 M parameters trained
on WebText. All models have overall BLiMP
准确度 0.84 ± .01%, and standard deviation
among the models on each of the 12 现象
does not exceed 0.03. This finding bolsters
our earlier conclusion in §5 that amount of
training data has the biggest impact on BLiMP
表现.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

6.4 Alternate Evaluation Methods

There are several other methods one can
use to measure an LM’s preference between
two minimally different sentences. 迄今为止, 我们
have considered only the full-sentence method,
advocated for by Marvin and Linzen (2018), 哪个
compares LM likelihoods of full sentences. 在一个
followup experiment, we use two prefix methods,
each of which has appeared in related prior work,
that evaluate a model’s preferences by comparing
its prediction at a key point of divergence between
the sentences. Subsets of BLiMP data are designed
to be compatible with multiple methods, allowing
us to conduct the first direct comparison. We find
that all methods give broadly similar results when
aggregating over a set of paradigms. We see no
strong argument against evaluating solely using
the full-sentence method, though some results
diverge for specific paradigms.

One-Prefix Method In the one-prefix method,
used by Linzen et al. (2016), a pair of sentences
share the same initial portion of a sentence, 但
differ in a critical word that make them differ in
语法性 (例如, The cat eats mice vs. 这
cat eat mice). The model’s prediction is correct if
it assigns a higher probability to the grammatical
token given the shared prefix.

Two-Prefix Method In the two-prefix method,
used by Wilcox et al. (2019), a pair of sentences
differ in their initial string, and the grammaticality
difference is only revealed when a shared critical
word is included (例如, The cat eats mice vs. 这
cats eats mice). For these paradigms, we evaluate
whether the model assigns a higher probability to
the critical word conditioned on the grammatical
prefix than on the ungrammatical prefix.

The prefix methods differ

from the full-
sentence method in two key ways: (我)
他们
require that the acceptability of the sentence be
unambiguously predictable from the critical word,
but not sooner, 和 (二) they are not affected
by predictions made by the LM following the
critical word. These values do affect the full
sentence method. 例如, 假设
磷 (are numerous) ≫ P (is numerous), a model
could predict that The cats are numerous is more
likely than The cats is numerous without correctly
predicting that P (是|the cats) > P (是|the cats).
Using prefix probabilities allows us to exclude
models’ use of this additional information and

387

数字 6: Comparison of models’ performance on the
simple LM method and the 1- and 2-prefix methods.
The upper panels show results from three phenomena
that are compatible with both 1-prefix and 2-prefix
方法. The lower panel shows the averages and
standard deviations across all phenomena.

evaluate how the models perform when they have
just enough information to judge grammaticality.
数字 6 shows that models have generally
comparable accuracies across all three methods.
然而, there are some cases where we observe
differences between these methods. 例如,
Transformer-XL performs much worse at BINDING,
DET.-NOUN AGR., and SUBJ.-VERB AGR. in the simple
the probabilities
LM method, suggesting that
Transformer-XL assigns to the irrelevant part
at the end of the sentence very often overturn
the observed preference based on probability up
to the critical word. 另一方面, GPT-2
benefits from reading the whole sentence for
BINDING phenomena, as its performance is better in
the simple LM method than in the prefix method.
We conclude that with a sufficiently diverse
set of paradigms,
the various metrics under
consideration will give similar results. 因此, 它
is not problematic that BLiMP relies only on the
full-sentence method, and doing so allows BLiMP
to include many paradigms not compatible with
either prefix method. 尽管如此, prefix methods
are still valuable for detailed analysis or for studies
making direct comparison to psycholinguistic
理论 (例如, Wilcox et al., 2018).

7 Conclusion and Future Work

We have shown ways in which BLiMP can be used
as tool to gain evidence about both the overall
and fine-grained linguistic knowledge of language
型号. Like the GLUE benchmark (王等人。,
2018), BLiMP assigns a single overall score

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

to an LM that summarizes its general sensitivity to
minimal pair contrasts. It also provides a break-
down of LM performance by linguistic phenom-
enon, which can be used to draw more concrete
conclusions about
the kinds of grammatical
features learned acquired by a given model. 这
kind of information is a linguistically motivated
evaluation of LMs that can complement common
metrics like perplexity.

the extent

此外,

to which humans
resemble data-driven learners
like language
models is debated in linguistics and cognitive
科学 (see e.g., Chomsky, 1965; Reali and
Christiansen, 2005). In some domains, we may
require the aid of innate knowledge to acquire
phenomenon-specific knowledge resembling that
tested in BLiMP. By evaluating whether self-
supervised learners like LMs acquire human-like
grammatical acuity in a particular domain, 我们

indirect evidence as to whether
收集
phenomenon is a necessary component of humans’
innate knowledge.

Another aim of BLiMP is to serve as a guide for
future work on the linguistic evaluation of LMs.
It is particularly interesting to better understand
those empirical domains where current LMs
to acquire some relevant knowledge,
出现
but still fall short of human performance. 这
results from BLiMP suggest that—in addition
to relatively well-studied phenomena like filler-
gap dependencies, NPIs, and binding—argument
structure remains one area where there is much to
uncover about what LMs learn. 更普遍,
as language modeling techniques continue to
improve, it will be useful to have large-scale tools
like BLiMP to efficiently track changes in what
these models do and do not know about grammar.

致谢

This material is based upon work supported by
the National Science Foundation under grant no.
1850208. 有什么意见, 发现, and conclusions
or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect
the views of the National Science Foundation.
This project has also benefited from support to
SB by Eric and Wendy Schmidt (made by rec-
ommendation of the Schmidt Futures program), 经过
Samsung Research (under the project Improving
Deep Learning using Latent Structure), by Intuit,

388

公司, and by NVIDIA Corporation (与
donation of a Titan V GPU).

附录

The following contains examples from each of the
67 paradigms in BLiMP.

Caveats Some paradigms include non-transparent
factors that may influence interpretation. We list
here those factors that we are aware of:

• Several paradigms within ANAPHOR AGREE-
MENT and BINDING rely on stereotyped gender
assignment associated with names (例如,
Mary). A model has to have at least a weak
gender-name association in order to succeed
on some paradigms in BLiMP. 例如,
like Mary hugged
we mark sentences
themselves and Mary hugged himself as un-
acceptable, and we never include possibilities
like Mary hugged themself.

• To isolate certain phenomena, we had to rely
on acceptability contrasts present in main-
stream US and UK English but absent in
many other dialects. 例如, 一些
the sentence Suzy
speakers would accept
lie, but we would mark this un-
don’t
acceptable based on mainstream US English
判断. BLiMP assesses models’ know-
ledge of this specific dialect of English; 在
some cases it could penalize models that
conform to a different dialect.

How to read this table:

• Phenomenon refers to the linguistic phenom-
enon as noted in Table 2. UID refers to the
unique identifier used in the released dataset.

• Model and human performance are reported
as percent accuracy.
‘Human’ uses the
more conservative individual judgments (作为
opposed to majority vote, for which each
paradigm would be either 100% 或者 80%).

• Each pair is marked for whether it is usable
with a prefix method. All sentences are valid
for the simple LM method.

• If a sentence has a checkmark (X) 在下面
the 1pfx column, the sentence can be used
with the 1-prefix method in addition to the

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

T M

L

X

时间

时间 – 2

G

n

A

H

Acceptable Example

Phenomenon

UID

ANAPHOR
AGREEMENT

anaphor gender agreement
anaphor number agreement

ARGUMENT
STRUCTURE

BINDING

CONTROL/
RAISING

DETER-
MINER-
NOUN
AGR.

animate subject passive
animate subject trans
causative
drop argument
inchoative
intransitive
passive 1
passive 2
transitive

principle A c command
principle A case 1
principle A case 2
principle A domain 1
principle A domain 2
principle A domain 3
principle A reconstruction

existential there object raising
existential there subject raising
expletive it object raising
tough vs raising 1
tough vs raising 2

determiner noun agreement 1
determiner noun agreement 2
determiner noun agreement irregular 1
determiner noun agreement irregular 2
determiner noun agreement with adj 1
determiner noun agreement with adj 2
determiner noun agreement with adj irregular 1
determiner noun agreement with adj irregular 2

ELLIPSIS

ellipsis n bar 1
ellipsis n bar 2

FILLER
GAP

wh questions object gap
wh questions subject gap
wh questions subject gap long distance
wh vs that no gap
wh vs that no gap long distance
wh vs that with gap
wh vs that with gap long distance

IRREGULAR
FORMS

irregular past participle adjectives
irregular past participle verbs

ISLAND
EFFECTS

NPI
LICENSING

QUANTIFIERS

SUBJECT-
VERB
AGR.

adjunct island
complex NP island
coordinate structure constraint complex left branch
coordinate structure constraint object extraction
left branch island echo question
left branch island simple question
sentential subject island
wh island

matrix question npi licensor present
npi present 1
npi present 2
only npi licensor present
only npi scope
sentential negation npi licensor present
sentential negation npi scope

existential there quantifiers 1
existential there quantifiers 2
superlative quantifiers 1
superlative quantifiers 2

distractor agreement relational noun
distractor agreement relative clause
irregular plural subject verb agreement 1
irregular plural subject verb agreement 2
regular plural subject verb agreement 1
regular plural subject verb agreement 2

5 – g r a

44
52

54
72
51
68
89
82
71
70
91

58
100
49
95
56
52
40

84
77
72
33
77

88
86
85
90
50
53
55
52

23
50

53
82
86
83
81
18
20

79
80

48
50
32
59
96
57
61
56

1
47
47
57
30
93
45

91
62
45
17

24
22
73
88
76
81

S

L

88
95

68
79
65
79
72
73
65
72
87

59
100
87
98
68
55
46

66
80
63
34
93

92
92
82
86
86
76
83
87

68
67

79
92
96
97
97
43
14

93
85

67
47
30
71
32
36
43
47

2
54
54
93
36
100
23

96
16
63
83

76
63
81
89
89
83

91
97

58
70
54
67
81
81
76
74
89

61
100
95
99
70
60
38

76
79
72
45
86

92
81
88
82
78
81
77
86

65
89

61
83
86
86
91
42
17

91
66

65
58
36
74
63
36
37
20

1
61
48
80
45
99
53

94
14
84
85

77
60
78
83
73
85

99
100

77
80
68
84
90
90
89
79
49

100
96
73
99
73
82
37

92
89
58
72
92

100
93
94
93
90
96
88
93

88
86

84
95
88
97
94
56
56

78
90

91
72
42
88
77
82
35
77

67
55
62
100
85
89
95

99
24
84
78

83
68
95
96
97
96

96
99

98
87
82
90
95
86
99
86
87

86
98
96
95
75
83
78

90
88
86
75
81

96
95
92
85
96
94
85
95

92
78

85
98
85
97
92
77
75

99
95

94
80
90
91
91
99
61
73

98
83
98
92
72
93
81

94
76
91
85

81
86
95
94
95
95

Katherine can’t help herself.
Many teenagers were helping themselves.

Amanda was respected by some waitresses.
Danielle visited Irene.
Aaron breaks the glass.
The Lutherans couldn’t skate around.
A screen was fading.
Some glaciers are vaporizing.
Jeffrey’s sons are insulted by Tina’s supervisor.
Most cashiers are disliked.
A lot of actresses’ nieces have toured that art gallery.

A lot of actresses that thought about Alice healed themselves.
Tara thinks that she sounded like Wayne.
Stacy imagines herself praising this actress.
Carlos said that Lori helped him.
Mark imagines Erin might admire herself.
Nancy could say every guy hides himself.
It’s herself who Karen criticized.

William has declared there to be no guests getting fired.
There was bound to be a fish escaping.
Regina wanted it to be obvious that Maria thought about Anna.
Julia wasn’t fun to talk to.
Rachel was apt to talk to Alicia.

Craig explored that grocery store.
Carl cures those horses.
Phillip was lifting this mouse.
Those ladies walk through those oases.
Tracy praises those lucky guys.
Some actors buy these gray books.
This person shouldn’t criticize this upset child.
That adult has brought that purple octopus.

Unacceptable Example

Katherine can’t help himself.
Many teenagers were helping herself.

Amanda was respected by some picture.
The eye visited Irene.
Aaron appeared the glass.
The Lutherans couldn’t disagree with.
A screen was cleaning.
Some glaciers are scaring.
Jeffrey’s sons are smiled by Tina’s supervisor.
Most cashiers are flirted.
A lot of actresses’ nieces have coped that art gallery.

A lot of actresses that thought about Alice healed herself.
Tara thinks that herself sounded like Wayne.
Stacy imagines herself praises this actress.
Carlos said that Lori helped himself.
Mark imagines Erin might admire himself.
Every guy could say Nancy hides himself.
It’s herself who criticized Karen.

William has obliged there to be no guests getting fired.
There was unable to be a fish escaping.
Regina forced it to be obvious that Maria thought about Anna.
Julia wasn’t unlikely to talk to.
Rachel was exciting to talk to Alicia.

Craig explored that grocery stores.
Carl cures that horses.
Phillip was lifting this mice.
Those ladies walk through that oases.
Tracy praises those lucky guys.
Some actors buy this gray books.
This person shouldn’t criticize this upset children.
That adult has brought those purple octopus.

Brad passed one big museum and Eva passed several.
Curtis’s boss discussed four sons and Andrew discussed five sick sons.

Brad passed one museum and Eva passed several big.
Curtis’s boss discussed four happy sons and Andrew discussed five sick.

Joel discovered the vase that Patricia might take.
Cheryl thought about some dog that upset Sandra.
Bruce knows that person that Dawn likes that argued about a lot of guys.
Danielle finds out that many organizations have alarmed Chad.
Christina forgot that all plays that win worry Dana.
Nina has learned who most men sound like.
Martin did find out what every cashier that shouldn’t drink wore.

Joel discovered what Patricia might take the vase.
Cheryl thought about who some dog upset Sandra.
Bruce knows who that person that Dawn likes argued about a lot of guys.
Danielle finds out who many organizations have alarmed Chad.
Christina forgot who all plays that win worry Dana.
Nina has learned that most men sound like.
Martin did find out that every cashier that shouldn’t drink wore.

The forgotten newspaper article was bad.
Edward hid the cats.

The forgot newspaper article was bad.
Edward hidden the cats.

Who has Colleen aggravated before kissing Judy?
Who hadn’t some driver who would fire Jennifer’s colleague embarrassed?
What lights could Spain sell and Andrea discover?
Who will Elizabeth and Gregory cure?
David would cure what snake?
Whose hat should Tonya wear?
Who have many women’s touring Spain embarrassed.
What could Alan discover he has run around?

Who has Colleen aggravated Judy before kissing?
Who hadn’t Jennifer’s colleague embarrassed some driver who would fire?
What could Spain sell lights and Andrea discover?
Who will Elizabeth cure and Gregory?
What would David cure snake?
Whose should Tonya wear hat?
Who have many women’s touring embarrassed Spain.
What could Alan discover who has run around?

Should Monica ever grin?
Even these trucks have often slowed.
Many skateboards also roll.
Only Bill would ever complain.
Only those doctors who Karla respects ever conceal many snakes.
Those banks had not ever lied.
Those turtles that are boring April could not ever break those couches.

Monica should ever grin.
Even these trucks have ever slowed.
Many skateboards ever roll.
Even Bill would ever complain.
Those doctors who only Karla respects ever conceal many snakes.
Those banks had really ever lied.
Those turtles that are not boring April could ever break those couches.

There aren’t many lights darkening.
Each book is there disturbing Margaret.
No man has revealed more than five forks.
An actor arrived at at most six lakes.

A sketch of lights doesn’t appear.
Boys that aren’t disturbing Natalie suffer.
This goose isn’t bothering Edward.
The woman cleans every public park.
Jeffrey hasn’t criticized Donald.
The dress crumples.

There aren’t all lights darkening.
There is each book disturbing Margaret.
No man has revealed at least five forks.
No actor arrived at at most six lakes.

A sketch of lights don’t appear.
Boys that aren’t disturbing Natalie suffers.
This goose weren’t bothering Edward.
The women cleans every public park.
Jeffrey haven’t criticized Donald.
The dresses crumples.

1pfx 2pfx

X
X

X

X
X
X
X
X

X

X

X

X

X

X

X
X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X
X
X
X

X

X

X

X

桌子 4: Examples of all 67 paradigms in BLiMP along with model performance and estimated human agreement.

simple LM method. The bolded word is the
critical word—the probability of the two
different critical words for the acceptable
and unacceptable sentences can be compared
based on the same ‘prefix’.

Yossi Adi, Einat Kermany, Yonatan Belinkov,
Ofer Lavi, and Yoav Goldberg. 2017. Fine-
grained analysis of sentence embeddings using
auxiliary prediction tasks. 在诉讼程序中
ICLR Conference Track. Toulon, 法国.

• If a sentence has a checkmark (X) under the
2pfx column, the sentence can be used with
the 2-prefix method in addition to the simple
LM method. The bolded word is the critical
word—the probability of that particular word
can be compared based on the two different
acceptable and unacceptable ‘prefixes’.

Aixiu An, Peng Qian, Ethan Wilcox, and Roger
征收. 2019. Representation of constituents in
neural language models: Coordination phrase as
a case study. arXiv 预印本 arXiv:1909.04625.

Kathryn Bock and Carol A. 磨坊主. 1991. Broken
协议. 认知心理学, 23(1):45–93.

参考

Marantz, Alec. 2013. Verbal argument structure:
Events and participants. Lingua, 30:152–168.
爱思唯尔.

David Adger. 2003. Core Syntax: A Minimalist
Approach. Oxford University Press Oxford.

Rui P. Chaves. 2020. What don’t RNN language
models learn about filler-gap dependencies? 在
Proceedings of the Third Meeting of the Society
for Computation in Linguistics (SCiL).

Stanley F. Chen and Joshua Goodman. 1999.
An empirical study of smoothing techniques
for language modeling. Computer Speech &
语言, 13(4):359–394.

389

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Gennaro Chierchia. 2013. Logic in Grammar.

牛津大学出版社.

Noam Chomsky. 1965. Aspects of the Theory of

Syntax. 与新闻界.

Noam Chomsky. 1981. Lectures on Government

and Binding.

Shammur Absar Chowdhury

2018. RNN simulations

and Roberto

Zamparelli.
grammaticality judgments on long-distance
the 27th
dependencies.
国际计算会议
语言学, pages 133–144.

在诉讼程序中

Shammur Absar Chowdhury

and Roberto
Zamparelli. 2019. An LSTM adaptation study
的 (和) 语法性. 在诉讼程序中
2019 ACL Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 204–212.

Alexis Conneau, German Kruszewski, Guillaume
Lample, Lo¨ıc Barrault, and Marco Baroni.
2018. What you can cram into a single
&!#* 向量: Probing sentence embeddings for
linguistic properties. In ACL 2018-56th Annual
Meeting of the Association for Computational
语言学, 体积 1, pages 2126–2136.

Jillian K. Da Costa and Rui P. Chaves.
2020. Assessing the ability of transformer-
based neural models to represent structurally
unbounded dependencies. 在诉讼程序中
Third Meeting of the Society for Computation
in Linguistics (SCiL).

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime
Carbonell, Quoc Le, and Ruslan Salakhutdinov.
2019. Transformer-XL: Attentive language

models beyond a fixed-length context.
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2978–2988. Florence, 意大利. 协会
for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中
这 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186.

390

Allyson Ettinger, Ahmed Elgohary, Colin Phillips,
and Philip Resnik. 2018. Assessing composition
in sentence vector representations. In Proceed-
ings of the 27th International Conference on
计算语言学, pages 1790–1801.
计算语言学协会.

理查德

富特雷尔,

Takashi
Ethan Wilcox,
Morita,
and Roger Levy. 2018. RNNs
as psycholinguistic subjects: Syntactic state
and grammatical dependency. arXiv 预印本
arXiv:1809.01329.

Bart Geurts and Rick Nouwen. 2007. ‘At least’
等人。: The semantics of scalar modifiers.
语言, pages 533–559.

David Graff, Junbo Kong, Ke Chen, and Kazuaki
Maeda. 2003. English gigaword. Linguistic
Data Consortium, 费城, 4(1):34.

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2019.
Colorless green recurrent networks dream
hierarchically. Proceedings of the Society for
Computation in Linguistics, 2(1):363–364.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
克拉克, and Philipp Koehn. 2013. Scalable mod-
ified Kneser-Ney language model estimation.
In Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics
(体积 2: Short Papers), pages 690–696.

Michael Heilman, Aoife Cahill, Nitin Madnani,
Melissa Lopez, Matthew Mulholland, 和
Joel Tetreault. 2014. Predicting grammaticality
on an ordinal scale. 在诉讼程序中

52nd Annual Meeting of the Association for
计算语言学 (体积 2: Short
文件), 体积 2, pages 174–180.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. 神经计算,
9(8):1735–1780.

Jeremy Howard and Sebastian Ruder. 2018.
Universal language model fine-tuning for text
the 56th
classification.
Annual Meeting of
the Association for
计算语言学 (体积 1: 长的
文件), pages 328–339.

在诉讼程序中

Jaap Jumelet and Dieuwke Hupkes. 2018. 做
language models understand anything? 在

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

the ability of LSTMs to understand negative
这 2018
polarity items. 在诉讼程序中
EMNLP Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 222–231.

Katharina Kann, Alex Warstadt, Adina Williams,
and Samuel R. Bowman. 2019. Verb argument
structure alternations in word and sentence
嵌入. Proceedings of the Society for
Computation in Linguistics, 2(1):287–297.

Dave Kush, Terje Lohndal, and Jon Sprouse. 2018.
Investigating variation in island effects. Natural
语言 & Linguistic Theory, 36(3):743–779.

Jey Han Lau, Alexander Clark, and Shalom Lappin.
2017. 语法性, acceptability, and proba-
能力: A probabilistic view of linguistic knowl-
边缘. 认知科学, 41(5):1202–1241.

Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. 反式-
actions of the Association for Computational
语言学, 4:521–535.

Rebecca Marvin and Tal Linzen. 2018. Targeted
syntactic evaluation of language models. 在
这 2018 会议
会议记录
Empirical Methods
in Natural Language
加工, pages 1192–1202.

Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2016. Pointer sentinel
mixture models. CoRR, abs/1609.07843.

Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget,
Jan ˇCernock`y, and Sanjeev Khudanpur. 2010.
Recurrent neural network based language
In Eleventh Annual Conference of
模型.

Speech Communication
协会.

国际的

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contex-
tualized word representations. arXiv 预印本
arXiv:1802.05365.

Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised
学习, Technical report, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Blog, 1(8).

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. arXiv
e-prints.

Florencia Reali and Morten H. Christiansen.
2005. Uncovering the richness of the stimulus:
Structure dependence and indirect statistical
证据. 认知科学, 29(6):1007–1028.

Ivan A. Sag, Thomas Wasow, and Emily M.
Bender. 2003. Syntactic Theory: A Formal
介绍, 2nd edition. CSLI出版物.

Carson T. Sch¨utze. 1996. The Empirical Base
of Linguistics: Grammaticality Judgments and
Linguistic Methodology. 芝加哥大学
按.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016.
Does string-based neural MT learn source
syntax? 在诉讼程序中 2016 会议
on Empirical Methods in Natural Language
加工, pages 1526–1534.

Dominique Sportiche, Hilda Koopman, 和
Edward Stabler. 2013. 简介
Syntactic Analysis and Theory, 约翰·威利 &
Sons.

Ian Tenney, Patrick Xia, Berlin Chen, Alex
王, Adam Poliak, 右. Thomas McCoy,
Najoung Kim, Benjamin Van Durme, Samuel R
Bowman, Dipanjan Das, 等人. 2019. 什么
do you learn from context? Probing for
sentence structure in contextualized word
陈述. In Proceedings of ICLR.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need, 我. Guyon,
U. V. Luxburg, S. 本吉奥, H. 瓦拉赫,
右. 弗格斯, S. Vishwanathan, 和R. 加内特,
信息
编辑, Advances
Processing Systems 30, pages 5998–6008.
柯伦联合公司, Inc.

in Neural

391

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel R. Bowman. 2019A.
SuperGLUE: A stickier benchmark for general-
purpose language understanding systems. 在
信息
33rd Conference on Neural
Processing Systems.

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R. Bowman.
benchmark
2018. GLUE: A multi-task
and analysis platform for natural
语言
这 2018
理解. 在诉讼程序中
EMNLP Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 353–355.

Alex Wang, Ian F. Tenney, Yada Pruksachatkun,
Katherin Yu, Jan Hula, Patrick Xia, Raghu
Pappagari, Shuning Jin, 右. Thomas McCoy,
Roma Patel, Yinghui Huang, Jason Phang,
Edouard Grave, Haokun Liu, Najoung Kim,
Phu Mon Htut, Thibault F’evry, Berlin Chen,
Nikita Nangia, Anhad Mohananey, Katharina
Kann, Shikha Bordia, Nicolas Patry, 大卫
Benton, Ellie Pavlick, and Samuel R. Bowman.
2019乙. jiant 1.2: A software toolkit for
research on general-purpose text understanding
型号. http://jiant.info/.

Gregory Ward and Betty Birner. 1995. Definite-
ness and the English existential. 语言,
pages 722–742.

Alex Warstadt and Samuel R. Bowman. 2019. 林-
guistic analysis of pretrained sentence encoders
with acceptability judgments. arXiv 预印本
arXiv:1901.03438.

Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng,
Hagen Blix, Yining Nie, Anna Alsop, Shikha
Bordia, Haokun Liu, Alicia Parrish, 盛-
Fu Wang, Jason Phang, Anhad Mohananey,
Phu Mon Htut, Paloma Jeretiˇc, and Samuel
右. Bowman. 2019A.
Investigating BERT’s
knowledge of language: Five analysis methods
with NPIs. In Proceedings of EMNLP-IJCNLP,
pages 2870–2880.

Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2019乙. Neural network acceptability
判断. 协会的交易
计算语言学, 7:625–641.

Ethan Wilcox, Roger Levy, Takashi Morita, 和
Richard Futrell. 2018. What do RNN language
models learn about filler–gap dependencies? 在
诉讼程序 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 211–221.

Ethan Wilcox, Peng Qian, Richard Futrell, Miguel
Ballesteros, and Roger Levy. 2019. 结构性
supervision improves learning of non-local
grammatical dependencies. 在诉讼程序中
这 2019 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 3302–3312.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
2
1
1
9
2
3
6
9
7

/

/
t

A
C
_
A
_
0
0
3
2
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

392BLiMP: The Benchmark of Linguistic Minimal Pairs for English image
BLiMP: The Benchmark of Linguistic Minimal Pairs for English image
BLiMP: The Benchmark of Linguistic Minimal Pairs for English image
BLiMP: The Benchmark of Linguistic Minimal Pairs for English image
BLiMP: The Benchmark of Linguistic Minimal Pairs for English image

下载pdf