BLiMP: The Benchmark of Linguistic Minimal Pairs for English
Alex Warstadt1, Alicia Parrish1, Haokun Liu2, Anhad Mohananey2,
Wei Peng2, Sheng-FuWang1, Samuel R. Bowman1,2,3
1Department of Linguistics
New York University
2Department of Computer Science
New York University
3Center for Data Science
New York University
{warstadt,alicia.v.parrish,haokunliu,anhad,
weipeng,shengfu.wang,bowman}@nyu.edu
Abstrait
We introduce The Benchmark of Linguistic
Minimal Pairs (BLiMP),1 a challenge set for
evaluating the linguistic knowledge of lan-
guage models (LMs) on major grammatical
phenomena in English. BLiMP consists of
67 individual datasets, each containing 1,000
minimal pairs—that is, pairs of minimally dif-
ferent sentences that contrast in grammatical
acceptability and isolate specific phenomenon
in syntax, morphology, or semantics. We gen-
erate the data according to linguist-crafted
grammar templates, and human aggregate
agreement with the labels is 96.4%. Nous
evaluate n-gram, LSTM, and Transformer
(GPT-2 and Transformer-XL) LMs by observ-
ing whether they assign a higher probability to
the acceptable sentence in each minimal pair.
We find that state-of-the-art models identify
morphological contrasts related to agreement
reliably, but they struggle with some subtle
semantic and syntactic phenomena, tel que
negative polarity items and extraction islands.
1 Introduction
Current neural networks for sentence processing
rely on unsupervised pretraining tasks like lan-
is an open question
it
guage modeling. Toujours,
how the linguistic knowledge of state-of-the-art
language models (LMs) varies across the lin-
guistic phenomena of English. Recent studies
(par exemple., Linzen et al., 2016; Marvin and Linzen,
2018; Wilcox et al., 2018) have explored this
question by evaluating LMs’ preferences between
minimal pairs of sentences differing in gramma-
tical acceptability, as in Example 1. Cependant, chaque
1https://github.com/alexwarstadt/blimp.
377
of these studies uses a different set of metrics,
and focuses on a small
linguistic
paradigms, severely limiting any possible big-
picture conclusions.
set of
(1)
un. The cats annoy Tim. (grammatical)
b. *The cats annoys Tim. (ungrammatical)
We introduce the Benchmark of Linguistic
Minimal Pairs (shortened to BLiMP), a linguis-
tically motivated benchmark for assessing the
sensitivity of LMs to acceptability contrasts across
a wide range of English phenomena, covering both
previously studied and novel contrasts. BLiMP
consists of 67 datasets automatically generated
from linguist-crafted grammar templates, chaque
containing 1,000 minimal pairs and organized
by phenomenon into 12 catégories. Validation
with crowdworkers shows that BLiMP faithfully
represents human preferences.
We use BLiMP to study several pretrained LMs:
Transformer-based LMs GPT-2 (Radford et al.,
2019) and Transformer-XL (Dai et al., 2019), un
LSTM LM trained by Gulordava et al. (2019), et
an n-gram LM. We evaluate whether the LM
assigns a higher probability to the acceptable
sentence in each minimal pair to determine which
grammatical distinctions LMs are sensitive to.
This gives us indirect evidence about each model’s
linguistic knowledge and allows us to compare
models in a fine-grained way. We conclude that
current neural LMs appear to acquire robust
knowledge of morphological agreement and some
syntactic phenomena such as ellipsis and control/
raising. They show weaker evidence of knowledge
about argument structure, negative polarity item
licensing, and the semantic properties of quan-
tifiers. All models perform at or near chance
on extraction islands. Dans l'ensemble, every model we
Transactions of the Association for Computational Linguistics, vol. 8, pp. 377–392, 2020. https://doi.org/10.1162/tacl a 00321
Action Editor: Mark Steedman. Submission batch: 1/2020; Revision batch: 3/2020; Published 7/2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
evaluate falls short of human performance by a
wide margin. GPT-2, which performs the best,
performs 8 points below humans overall, though
it does match or exceed human performance on
specific phenomena.
In §6.3 we conduct additional experiments to
investigate the effect of training size on the
LSTM LM and Transformer-XL’s performance
on BLiMP. Although we see steady improvements
in overall performance, we find that LMs learn
phenomenon-specific distinctions at different
rates.
In §6.4 we consider alternative well-
motivated evaluation metrics on BLiMP, but find
that they do not differ drastically from our method
of comparing LM probabilities for full sentences.
We conclude that whereas models like GPT-2
appear to have significant linguistic knowledge,
this knowledge is concentrated in some specific
domains of English grammar. We use BLiMP
to uncover several linguistic phenomena where
even state-of-the-art
language models clearly
lack human-like knowledge, and to bring into
focus those areas of grammar that future studies
evaluating LMs should investigate in greater
depth.
2 Background and Related Work
2.1 Language Models
The objective of a language model is to give
a probability distribution over the strings of a
langue. Both neural network and non-neural
network architectures are used to build LMs, et
neural models can be trained in a self-supervised
setting without the need for labeled data. Recently,
variants of neural language modeling have been
shown to be a strong pretraining task for natural
language processing tasks (Howard and Ruder,
2018; Peters et al., 2018; Radford et al., 2018;
Devlin et al., 2019).
The last decade has seen two major paradigm
shifts in the state of the art for language modeling.
D'abord, there was a movement from models based on
local n-gram statistics (see Chen and Goodman,
1999) to neural sequence models such as LSTMs
(Mikolov et al., 2010), which optimize on the
task of predicting the next token. Subsequently,
Transformer-based architectures employing self-
attention (Vaswani et al., 2017) have outperformed
LSTMs (par exemple., Dai et al., 2019). Although these
shifts have resulted in stronger LMs, perplexity
on large benchmark datasets like WikiText-103
(Merity et al., 2016) has remained the primary
performance metric, which cannot give detailed
insight into these models’ knowledge of grammar.
Evaluation on benchmarks like GLUE (Wang
et coll., 2018, 2019un), which heavily adapt language
models to perform downstream tasks, is more
informative, but doesn’t offer broad coverage
of linguistic phenomena, and doesn’t necessary
reflect knowledge that is already present in the
LMs.
2.2 Linguistic Knowledge of NNs
Many recent studies have searched for evidence
that neural networks (NNs) learn representations
that implicitly encode grammatical concepts. Nous
refer to the ability to encode these concepts
as linguistic knowledge. Some studies evaluate
NNs’ linguistic knowledge using probing tasks
is trained to directly
in which a classifier
predict grammatical properties of a sentence
(par exemple., syntactic tree depth) or part of a sentence
(par exemple., part-of-speech) using only the NNs’ learned
representation as input (Shi et al., 2016; Adi et al.,
2017; Conneau et al., 2018; Ettinger et al., 2018;
Tenney et al., 2019). We follow a complementary
approach that uses acceptability judgments to
address the same question without the need for
training data labeled with grammatical concepts.
Acceptability judgments are the main form of
behavioral data used in generative linguistics to
measure human linguistic competence (Chomsky,
1965; Sch¨utze, 1996).
One branch of this literature uses minimal pairs
to infer whether LMs detect specific grammatical
contrasts. Tableau 1 summarizes linguistic pheno-
mena studied in this work. Par exemple, Linzen
et autres. (2016) look closely at minimal pairs contrast-
ing subject-verb agreement. Marvin and Linzen
(2018) expand the investigation to negative
polarity item and reflexive licensing. Cependant,
these and related studies cover a limited set of
phenomena,
to the exclusion of well-studied
phenomena in linguistics such as control and
raising, ellipsis, quantification, and countless
others. This is likely due to the labor-intensive
nature of collecting such targeted minimal pairs.
A related line of work evaluates neural networks
on acceptability judgments in a more domain-
general way. Corpora of sentences and their
grammaticality are collected for this purpose in a
378
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Phenomenon
Relevant work
Marvin and Linzen (2018), Futrell et al. (2018), Warstadt et al. (2019b)
Anaphora/binding
Subj.-verb agreement Linzen et al. (2016), Futrell et al. (2018), Gulordava et al. (2019), Marvin and
Linzen (2018), An et al. (2019), Warstadt et al. (2019b)
Neg. polarity items Marvin and Linzen (2018), Futrell et al. (2018), Jumelet and Hupkes (2018),
Filler-gap/Islands
Argument structure
Wilcox et al. (2019), Warstadt et al. (2019un)
Wilcox et al. (2018), Warstadt et al. (2019b), Chowdhury and Zamparelli
(2018, 2019), Chaves (2020), Da Costa and Chaves (2020)
Kann et al. (2019), Warstadt et al. (2019b), Chowdhury and Zamparelli (2019)
Tableau 1: Summary of related work organized by linguistic phenomena tested. All studies analyze neural
networks using acceptability judgments on minimal pairs mainly in English. Some studies appear
multiple times.
number of studies (Heilman et al., 2014; Lau et al.,
2017; Warstadt et al., 2019b). The most recent
and comprehensive corpus is CoLA (Warstadt
et coll., 2019b), containing 10k sentences covering
a wide variety of linguistic phenomena provided as
examples in linguistics papers and books. CoLA,
which is included in the GLUE benchmark (Wang
et coll., 2018), has been used to track advances
in the sensitivity of reusable sentence encoding
models to acceptability. Current models like
BERT (Devlin et al., 2019) and T5 (Raffel et al.,
2019) now learn to give acceptability judgments
that approach or even exceed individual human
agreement with CoLA.
prédictions
Although CoLA can provide evidence about
phenomenon-specific knowledge of models, ce
method is limited by the need to train a super-
vised classifier on CoLA data prior to evaluation.
This is because CoLA is designed for binary
acceptability classification, and there is no gen-
erally accepted method for obtaining binary
acceptability
from unsupervised
models like LMs.2 Warstadt and Bowman (2019)
measure phenomenon-specific performance on
CoLA for several pretrained sentence encoding
models: an LSTM, GPT (Radford et al., 2018),
and BERT. Cependant,
the use of supervision
prevents making strong conclusions about the
sentence encoding component, since it
is not
possible to distinguish what the encoder knows
from what is learned through supervised training
on acceptability data.
Evaluating LMs on minimal pairs avoids this
problem, with the caveat that the LM probability
2Though see Lau et al. (2017) for some promising
proposals for normalizing LM probabilities to correlate with
gradient acceptability.
of a sentence can only serve as a proxy for
acceptability if confounding factors impacting
a sentence’s probability such as length and
lexical content are controlled for. It is with these
considerations in mind that we design BLiMP.
3 Données
BLiMP consists of 67 minimal pair paradigms,
each with 1,000 sentence pairs in mainstream
American English grouped into 12 categories.3
We refer to minimal pair types as paradigms
and categories as phenomena. Each paradigm is
annotated for the unique contrast it isolates and the
broader phenomena it is part of. We automatically
generate the data from linguist-crafted grammar
templates, and our automatic labels are validated
with crowd-sourced human judgments.
the fact
Although each minimal pair type corresponds
to exactly one paradigm, a particular fact about
English grammar may be illustrated by multiple
paradigms. Par exemple,
that certain
determiners and nouns agree can be illustrated
by keeping the determiner the same and changing
the number marking of the noun as in the example
in Table 2, or by keeping the noun the same
and changing the determiner (par exemple., Rachelle had
bought those chair.). With completeness in mind,
we include such complementary paradigms in
BLiMP whenever possible.
3We choose English because it is the native language of
the linguists who built the grammar templates, though in the
long run, creating versions of BLiMP in additional languages
would allow for coverage of more phenomena and expand
BLiMP’s range of usefulness. We assume 1,000 pairs is
sufficient to limit random noise resulting from small sample
sizes.
379
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Phenomenon
N Acceptable Example
Unacceptable Example
ANAPHOR AGR.
ARG. STRUCTURE
BINDING
CONTROL/RAISING
DET.-NOUN AGR.
ELLIPSIS
2 Many girls insulted themselves.
9 Rose wasn’t disturbing Mark.
7 Carlos said that Lori helped him.
5 There was bound to be a fish escaping. There was unable to be a fish escaping.
8 Rachelle had bought that chair.
2 Anne’s doctor cleans one important
Many girls insulted herself.
Rose wasn’t boasting Mark.
Carlos said that Lori helped himself.
book and Stacey cleans a few.
FILLER-GAP
IRREGULAR FORMS
ISLAND EFFECTS
NPI LICENSING
QUANTIFIERS
SUBJECT-VERB AGR. 6 These casseroles disgust Kayla.
7 Brett knew what many waiters find.
2 Aaron broke the unicycle.
8 Whose hat should Tonya wear?
7 The truck has clearly tipped over.
4 No boy knew fewer than six guys.
Rachelle had bought that chairs.
Anne’s doctor cleans one book and
Stacey cleans a few important.
Brett knew that many waiters find.
Aaron broken the unicycle.
Whose should Tonya wear hat?
The truck has ever tipped over.
No boy knew at most six guys.
These casseroles disgusts Kayla.
Tableau 2: Minimal pairs from each of the twelve linguistic phenomenon categories covered by BLiMP.
Differences are underlined. N is the number of 1,000-example minimal pair paradigms within each
broad category.
3.1 Data Generation Procedure
To create minimal pairs exemplifying a wide array
of linguistic contrasts, we found it necessary to
artificially generate all datasets. This ensures both
that we have sufficient unacceptable examples,
and that the data is fully controlled, allowing for
repeated isolation of a single linguistic pheno-
menon (Ettinger et al., 2018). For each paradigm,
we use a generation script to sample lexical items
from a vocabulary of over 3,000 items according
to a template specifying linear order of the phrases
in the acceptable and unacceptable sentences in
each minimal pair. Our data generation scripts are
publicly available.4 We annotate these lexical
items with the morphological, syntactic, et
semantic features needed to enforce selectional
restrictions and create grammatical and seman-
tically felicitous sentences.
All examples in a paradigm are structurally
analogous up to the point required for the relevant
contrast but may vary in some ways. Par exemple,
illustrated in
the template for NPI LICENSING,
Tableau 2, specifies that an arbitrary verb phrase
needs to be generated. Accordingly, the generation
script samples from the entire set of verbs and
generates the required arguments on-the-fly. Ainsi,
the structure of the sentence then depends on
whether the sampled verb is transitive, clause-
embedding, raising, and so forth, but that same
4https://github.com/alexwarstadt/data
generation.
verb phrase and its arguments are used in both
pairs in the paradigm.
This generation procedure is not without
limitations, and despite the very detailed voca-
bulary we use, implausible sentences are occa-
sionally generated (par exemple., Sam ran around some
glaciers).
though, both the
acceptable and unacceptable sentences will be
equally implausible given world knowledge, donc
any difference in the probability assigned to them
is still attributable to the intended grammatical
contraste.
In these cases,
3.2 Couverture
The paradigms covered by BLiMP represent
well-established contrasts in English morphology,
syntax, and semantics. Each paradigm is grouped
into one of 12 phenomena, shown in Table 2.
Examples of all 67 paradigms appear in Table 4
of the Appendix. The paradigms are selected with
the constraints that they can be characterized using
templates as described above and illustrated with
minimal pairs of sentences equal in length5 that
differ in at most one vocabulary item.
Although this dataset has broad coverage, it is
pas exhaustive. It is not possible to include every
5We define length as the number of entries from our
lexicon. Some sentences in a pair contain different numbers
of words because visit and drop by are each one lexical entry.
Where discrepancies in number of words occur, ils sont
generally randomly distributed across the grammatical and
ungrammatical sentences in a paradigm.
380
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
grammatical phenomenon of English, and there is
no agreed-upon set of core phenomena. Cependant,
we consider frequent inclusion of a phenomenon in
a syntax/semantics textbook as an informal proxy
for what linguists consider to be core phenomena.
We survey several syntax textbooks (par exemple., Sag
et coll., 2003; Adger, 2003; Sportiche et al., 2013),
and find that nearly all of the phenomena in
BLiMP are discussed in some source. Most of
the topics that repeatedly appear in textbooks
and can be represented with minimal pairs (par exemple.,
agreement, control/raising, wh-extraction/islands,
binding) are present in BLiMP.6
We characterize the 12 phenomena in BLiMP
as follows7:
• ANAPHOR AGREEMENT:
the requirement that
reflexive pronouns like himself
anaphora) agree with their antecedents in
person, number, genre, and animacy.
(a.k.a.
• ARGUMENT STRUCTURE: the ability of different
verbs to appear with different
types of
arguments. Par exemple, different verbs can
appear with a direct object, participate in the
causative alternation, or take an inanimate
argument.
• BINDING: the structural relationship between
a pronoun and its antecedent. All paradigms
illustrate aspects of Chomsky’s
(1981)
Principle A. Because coindexation cannot
be annotated in BLiMP, Principles B and C
are not illustrated.
• CONTROL/RAISING:
entre
syntactic and semantic
differences
de
predicates that embed an infinitival VP.
This includes control, raising, and tough-
movement predicates.
various
les types
• DETERMINER-NOUN AGREEMENT: number agree-
ment between demonstrative determiners
(par exemple., this/these) and the associated noun.
• ELLIPSIS:
le
omitting
possibility
expressions from a sentence. Because this is
difficult to illustrate with sentences of equal
de
6In line with these textbooks, we rely on stereotyped
gender-name pairings and contrasts not present in all English
dialects (more detail provided in the Appendix).
7Notre
narrower
particular constraints described above.
implementation of
than the linguistic definition because of
these phenomena is often
le
381
length, our paradigms cover only special
cases of noun phrase ellipsis that meet this
constraint.
• FILLER-GAP:
dependencies
phrasal movement
questions.
dans,
arising
depuis
Par exemple, wh-
• IRREGULAR FORMS: irregular morphology on
English past participles (par exemple., broken). Nous
are unable to evaluate models on nonexistent
forms like *breaked because such forms are
out of the vocabulary for some LMs.
• ISLAND EFFECTS:
restrictions on syntactic
environments where the gap in a filler-gap
dependency may occur.
• NPI LICENSING: restrictions on the distribution
of negative polarity items like any and ever
limited to, Par exemple, the scope of negation
and only.
cover
quantifiers. Nous
• QUANTIFIERS: restrictions on the distribution
de
tel
restrictions: superlative quantifiers (par exemple., à
least) cannot embed under negation, et
definite quantifiers and determiners cannot
be subjects in existential-there constructions.
deux
• SUBJECT-VERB AGREEMENT:
sujets
et
present tense verbs must agree in number.
3.3 Comparison to Related Resources
With a vocabulary of over 3,000 words, BLiMP
lexical variation of any
has by far the most
related generated dataset. It includes verbs with
11 different subcategorization frames, y compris
verbs that select for PPs, infinitival VPs, et
embedded clauses. By comparison, datasets by
Ettinger et al. (2018) and Marvin and Linzen
(2018) use vocabularies of under 200 items. Other
datasets of minimal pairs that achieve more lexical
and syntactic variety use data-creation methods
that limit empirical scope and control. Linzen
et autres. (2016) construct a dataset of minimal
pairs for subject-verb agreement by changing
verbs’ number marking in a subset of English
Wikipedia, but this approach does not generalize
beyond agreement phenomena. Lau et al. (2017)
construct minimal pairs by taking sentences from
the BNC through round-trip machine translation.
The resulting sentences contain a wider variety of
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
r
e
v
Ô
Model
5-gram 61.2
LSTM 69.8
TXL
69.6
GPT-2
81.5
Human
88.6
a ll
UN. UN
N
UN
47.9
91.7
94.1
99.6
97.5
R.
G
G. S
R.
UN
71.9
73.2
69.5
78.3
90.0
R.
T
G
D I N
I N
B
64.4
73.5
74.7
80.1
87.3
L . R.
R.
T
C
68.5
67.0
71.5
80.5
83.9
S I S
R.
G
UN
A I S.
D – N
70.0
85.4
83.0
93.3
92.2
L I P
L
E
36.9
67.6
77.2
86.6
85.0
E
L
I L
F
60.2
73.9
66.6
81.3
86.9
P.
UN
R. . G
G
E
R.
I R
79.5
89.1
78.2
84.1
97.0
R.
UN
L
U
N
UN
L
I S
57.2
46.6
48.4
70.6
84.9
D
P I
N
45.5
51.7
55.2
78.9
88.1
N
UN
U
Q
53.5
64.5
69.3
71.3
86.6
S
R.
T I F I E
R.
G
UN
S – V
60.3
80.1
76.0
89.0
90.9
Tableau 3: Percentage accuracy of four baseline models and raw human performance on BLiMP using a forced-choice
task. A random guessing baseline would achieve an accuracy of 50%.
grammatical violations, but it is not possible to
control the nature or quantity of violations in the
resulting sentences.
3.4 Data Validation
To verify that the generated sentences represent a
real contrast in acceptability, we conduct human
validation via Amazon Mechanical Turk.8 Twenty
separate validators rated five pairs from each of
le 67 paradigms, for a total of 6,700 judgments.
We restricted validators to individuals currently
located in the US who self-reported as native
speakers of English. To assure that our validators
made a genuine effort on the task, each HIT
included an attention check item and a hidden
field question to catch bot-assisted humans.
Validators were paid $0.25 for completing five
judgments, which we estimate took 1-2 minutes.
For each minimal pair, 20 individuals completed
a forced-choice task mirroring the LMs’ task;
the human-determined acceptable sentence was
calculated via majority vote of annotators. By this
metric, we estimate aggregate human agreement
with our annotations to be 96.4% overall. As a
threshold of inclusion in BLiMP, the majority of
validators needed to agree with BLiMP on at least
4/5 examples from each paradigm. Ainsi, tous 67
paradigms in the public version of BLiMP passed
this validation; only two additional paradigms
were rejected on this criterion. We also estimate
individual human agreement to be 88.6% overall
using the approximately 100 annotations from
each paradigm.9 Table 3 reports individual human
résultats (and model results) as a conservative
measure of human agreement.
4 Models
GPT-2 GPT-2 (Radford et al., 2019) is a large-
scale language model using the Transformer
architecture (Vaswani et al., 2017). Our main
experiments use GPT-2-large with 36 layers and
774M parameters.10 The model is pretrained on
Radford et al.’s WebText dataset, which contains
40GB of English text extracted from Web pages
and filtered for quality. To our knowledge,
WebText is not publicly available, so assuming
an average of 5–6 bytes/chars per word, nous
estimate WebText contains about 8B tokens. Nous
use jiant, a codebase for training and evaluating
sentence understanding models (Wang et al.,
2019b), to implement code for evaluating GPT-2
on BLiMP.11
Transformer-XL Transformer-XL (Dai et al.,
2019) is another multilayer Transformer-based
neural language model. We test the pretrained
Transformer-XL Large model with 18 layers of
Transformer decoders and 16 attention heads for
each layer. The model is trained on WikiText-
103 (Merity et al., 2016), a corpus of 103M
tokens from English Wikipedia. Code for testing
Transformer-XL on BLiMP is also implemented
in jiant.
LSTM We include a long-short term memory
(LSTM, Hochreiter and Schmidhuber, 1997)
LM in our experiments. Spécifiquement, we test
a pretrained LSTM LM from Gulordava et al.
(2019) on BLiMP. The model is trained on a 83M-
token corpus extracted from English Wikipedia.
To investigate the effect of training size on
model performance (§6.3), we retrain a series
of LSTM and Transformer-XL models with the
same hyperparameters and the following training
8The full set of human judgments and a summary of the
results for all 67 paradigms is in Table 4 in the Appendix.
9A few had to be excluded due to ineligible annotators.
10GPT-2-XL performs slightly worse on BLiMP; see §6.3.
11https://github.com/nyu-mll/jiant/tree/
blimp-and-npi/scripts/blimp.
382
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
sizes: 64M., 32M., 16M., 8M., 4M., 2M., 1M., 1/2M.,
1/4M., and 1/8M tokens. For each size, we train
the model on five different random samples of the
original training data, which has a size of 83M
tokens.12
5-gram We build a 5-gram LM on the English
Gigaword corpus (Graff et al., 2003), lequel
consists of 3.1B tokens. To efficiently query
n-grams we use an implementation13 based on
Heafield et al. (2013).14
5 Results and Discussion
An LM’s overall accuracy on BLiMP is simply
the proportion of the 67,000 minimal pairs in
which the model assigns a higher probability to
the acceptable sentence. We report the results for
all models and human evaluation in Table 3. GPT-
2 achieves the highest accuracy and the 5-gram
model the lowest. All models perform well below
estimated human accuracy (as described in § 3.4).
The 5-gram model’s poor performance—overall
and on every individual category—indicates
that BLiMP is likely not solvable from local
co-occurrence statistics alone.
Because we evaluate pretrained models that
differ in architecture and training data, we can
only speculate about what drives these differences
(though see § 6.3 for a controlled ablation study
on the LSTM LM). The results seem to indicate
that access to training data is the main driver of
performance on BLiMP for the neural models we
evaluate. This may explain why Transformer-XL
and the LSTM LM perform similarly in spite of
differences in architecture, as both are trained on
approximately 100M tokens of Wikipedia text.
Relatedly, GPT-2’s advantage may come from
the fact that it is trained on roughly two orders of
magnitude more data. Possibly, LSTMs trained on
larger datasets could perform comparably to GPT-
2, but such experiments are impractical because of
the inefficiency of training LSTMs at this scale.
generally perform best and closest
to human
level on morphological phenomena. Par exemple,
GPT-2 performs within 2.1 points of humans
on ANAPHOR AGR., DET.-NOUN AGR., and SUBJ.-VERB
AGR.. The set of challenging phenomena is more
diverse. ISLANDS are the hardest phenomenon by
a wide margin. Only GPT-2 performs well above
chance, and it remains 14 points below humans.
Some semantic phenomena, specifically those
involving NPI LICENSING and QUANTIFIERS, are also
challenging overall. All models perform relatively
poorly on ARG. STRUCTURE.
From these results we conclude that current
SotA LMs robustly encode basic facts of English
agreement. This does not mean that LMs will
come close to human performance for all
agreement phenomena. §6.1 discusses evidence
that increased dependency length and the presence
of agreement attractors of the kind investigated by
Linzen et al. (2016) and Gulordava et al. (2019)
reduce performance on agreement phenomena.
We find,
that LMs do represent
in accordance with Wilcox et al.
(2018),
long-distance
wh-dependencies, but we also conclude that
their representations differ fundamentally from
humans’. Although some models approach human
performance in ordinary filler-gap dependencies,
they are exceptionally poor at identifying island
violations overall. This finding suggests that they
reliably encode long-distance dependencies in
général, but not the syntactic domains in which
these dependencies are blocked, though GPT-2
does perform well above chance on some para-
digms of ISLAND EFFECTS. Cependant, strong con-
clusions about how these models
represent
wh-dependencies are not possible using the
forced-choice task compatible with BLiMP, et
a complete assessment of syntactic islands is best
addressed using a factorial design that manipulates
both the presence of an island and an attempt to
extract from it, as in Kush et al. (2018) or Wilcox
et autres. (2018).
5.1 Results and Discussion by Phenomenon
into how LM’s
The results also give insight
linguistic knowledge varies by domain. Models
12https://github.com/sheng-fu/colorless
greenRNNs.
13https://github.com/kpu/kenlm.
14https://github.com/anhad13/blimp ngram.
In the semantic phenomena where models
struggle (NPIS and QUANTIFIERS), violations are
often attributed in semantic theories to a presup-
position failure or contradiction arising from
semantic composition or pragmatic reasoning
(par exemple., Chierchia, 2013; Ward and Birner, 1995;
Geurts and Nouwen, 2007). These abstract
383
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
semantic and pragmatic factors may be difficult
for LMs to learn. Marvin and Linzen also find
that LSTMs largely fail to recognize NPI licensing
conditions. Warstadt et al. (2019un) find that BERT
(which is similar in scale to GPT-2) recognizes
these conditions inconsistently in an unsupervised
setting.
The weak performance on ARG. STRUCTURE is
somewhat surprising, since arguments and heads
are usually—though not always—adjacent (par exemple.,
subjects and direct objects are adjacent to the
verb in default English word order). Cependant,
argument structure is closely related to semantic
event structure (see Marantz, 2013), which may
be comparatively difficult for LMs to learn.
judgments about argument structure are
Aussi,
complicated by the possibility of coercing a
frequently transitive verb to be intransitive and
vice versa as well as the existence of secondary
meanings of verbs with different argument
structures (par exemple., normally intransitive boast has a
transitive use as in The spa boasts 10 pools), lequel
might make this domain somewhat more difficult
for LMs. Though even with these complications,
humans detect the intended contrast 90% of the
temps. We note that the reported difficulty of these
phenomena contradicts Warstadt and Bowman’s
(2019) conclusion that argument structure is one of
the strongest domains for neural models. Cependant,
Warstadt and Bowman evaluate classifiers with
supervision on CoLA, a large proportion of which
is sentences related to argument structure.
items,
Enfin, we caution against interpreting positive
results on a general phenomenon in BLiMP as
proof of human-like knowledge. Although it is
unlikely that GPT-2 could reach human perfor-
mance on the SUBJ.-VERB AGR. paradigms without
acquiring a concept of number marking that
abstracts away from specific lexical
it
is difficult to rule out this possibility without
accumulating different forms of evidence, pour
instance, by testing how it generalizes to nonce
words. We take the paradigms in FILLER-GAP as
a cautionary example (see Table 4). Il y a
four paradigms that assess a model’s sensitivity to
the syntactic requirements of complementizer that
versus a wh-word. We observe that all models
more or less succeed when the unacceptable
sentence lacks a necessary gap, but fail when
it contains an illicit gap. These results suggest the
models’ ability to accurately detect a contrast in
384
Chiffre 1: Heatmap showing the correlation between
models’ accuracies in each of the 67 paradigms.
whether a gap is filled following a wh-word is
not clearly based on a generalization about the
relationship between that wh-word and its gap,
as such a generalization should extend to the
cases where the models currently fail to detect
the correct contrast. Plus généralement, conclusions
about a model’s knowledge of a particular
grammatical concept can only be reached by
considering several paradigms.
5.2 Shallow Predictors of Performance
We also ask what
factors besides linguistic
phenomena affect model accuracy. Chiffre 2 shows
how sentence length, perplexity (which does not
depend on length), the probability of the good
sentence (which does depend on length), et
confidence affect model performance. The effect
of perplexity is much weaker for GPT-2 than
for other models, which indicates that it is probably
more robust to sentences with non-stereotypical
syntax or describing unlikely scenarios. GPT-2
is the only model where accuracy increases
largely monotonically with confidence. A similar
relationship holds between confidence
et
agreement in human acceptability judgments.
5.3 Correlation of Model and Human
Performance
We examine the extent to which models and
humans succeed at detecting contrasts for the same
linguistic phenomena. Chiffre 1 shows the Pearson
correlation between the four LMs and humans
of their accuracies on the 67 paradigms. Le
neural models correlate moderately with humans,
with GPT-2 correlating most strongly. The n-gram
model’s performance correlates with humans rela-
tively weakly. Neural models correlate with each
other more strongly, suggesting neural networks
share some biases that are not human-like.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 2: Models’ performance on BLiMP as a function
of sentence length, perplexity, log probability of the
acceptable sentence, and model confidence (calculated
comme |log P (S1) − log P (S2)|).
Transformer-XL and LSTM’s high correlation of
0.9 possibly reflects their similar training data.
6 Analysis
6.1 Long-Distance Dependencies
The presence of intervening material can lower the
ability of humans to detect agreement depen-
dencies (Bock and Miller, 1991). We study how
intervening material affects the LMs’ sensitivity
to mismatches in agreement in BLiMP. D'abord, nous
test for sensitivity to determiner-noun agreement
with and without an intervening adjective, as in
Exemple (2). The results are plotted in Figure 3.
The n-gram model is the most heavily impacted,
performing on average 35 points worse. This is
unsurprising, since the bigram consisting of a
determiner and noun is far more likely to be
observed than the trigram of determiner, adjective,
and noun. For the neural models, we find a weak
but consistent effect, with all models performing
on average between 5 et 3 points worse when
there is an intervening adjective.
(2)
un. Ron saw that man/*men.
b. Ron saw that nice man/*men.
Deuxième, we test for sensitivity to mismatches
in subject-verb agreement when an attractor noun
of the opposite number intervenes. We compare
attractors in relative clauses (3-b) and as part of
a relational noun (3-c), following experiments
by Linzen et al. (2016) et d'autres. Encore, nous
the n-gram model’s performance is
find that
reduced significantly by this intervening material,
suggesting the model is consistently misled by the
presence of an attractor. All the neural models
perform above chance with an attractor present,
but GPT-2 and the LSTM perform 22 et 20 points
Chiffre 3: The effect of the locality of determiner-noun
agreement (upper panel) and the type of agreement
attractor (lower panel) on model performance.
worse when an attractor is present than when
there is no attractor, while Transformer-XL’s
performance is reduced by only 5 points. Ainsi,
we reproduce Linzen et al.’s finding that attractors
significantly reduce LSTM LMs’ sensitivity to
mismatches in agreement and find evidence that
this holds true of some Transformer LMs as well.
(3)
un. The sisters bake/*bakes.
b. The sisters who met Cheryl bake/*bakes.
c. The sisters of Cheryl bake/*bakes.
6.2 Regular vs. Irregular Agreement
In DET.-NOUN AGR. and SUBJ.-VERB AGR., we generate
separate datasets for nouns with regular and
irregular number marking, as in Example (4). All
else being equal, only models with access
information should make
to sub-word-level
any distinction between regular and irregular
morphology.
(4)
(regular)
un. Ron saw that nice kid/*kids.
b. Ron saw that nice man/*men. (irregular)
En fait, Chiffre 4 shows that the two sub-word-
level models GPT-2 and Transformer-XL show
little effect of irregular morphology: They perform
less than 1.3 points worse on irregulars than
regulars. Their high overall performance suggests
that they robustly encode number features without
relying on segmental cues.15
15The LSTM LM, which has word-level tokens, averages
5.2 points worse on the irregular paradigms. This effect is
not due to morphology, but rather to the higher proportion
of out-of-vocabulary items among the irregular nouns, lequel
include many loanwords such as theses and alumni.
385
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 4: Models’ performance on agreement phe-
nomena between a determiner and noun and between
a subject and verb, broken down by whether the
noun/subject has a regular or irregular plural form
6.3 Training size and BLiMP performance
We use BLiMP to track how a model’s rep-
resentation of particular phenomena varies with
the quantity of training data. Using different
sized subsets of Gulordava et al.’s (2019) entraînement
data, we retrain the LSTM and Transformer-XL
models and evaluate their performance on BLiMP.
Chiffre 5 shows that different phenomena have
notably different learning curves across different
training sizes even if the full model trained on 83M
tokens achieved equivalent accuracy scores. Pour
example, the LSTM model ultimately performs
well on both IRREGULAR and ANAPHOR AGR., mais
requires more training to reach this level of
performance for ANAPHOR AGR. These learning
curve differences show how BLiMP performance
dissociates from perplexity on Wikipedia data, un
standard measure of LM performance: Although
perplexity decreases with more training data,16
performance on different phenomena grows at
varying rates.
un
est
Nous
que
là
conjecture
sigmoid
relationship between the logarithm of training
set size and BLiMP performance that appears
to be roughly linear at this scale. We conduct
linear regression analyses to estimate the rate
of increase in performance in relation to the
logarithm (base 2) of dataset size. For the LSTM
LM, best-fit
lines for phenomena on which
the model had the highest accuracy have the
steepest slopes: ANAPHOR AGR. (0.0623), DET.-NOUN
16Average perplexity on the Gulordava et al. (2019) test
ensemble: 595 at 0.125M, 212 at 1M, 92.8 at 8M, et 53 at 64M.
386
Chiffre 5: Transformer-XL (top) and LSTM LM
(bottom) performance as a function of training size
and phenomena in BLiMP. The gray line shows the
average across all phenomena.
AGR. (0.0426), and IRREGULAR (0.039). We see
the shallowest slopes on phenomena with the
worst performance: NPIS (0.0078) and ISLANDS
(0.0036). For Transformer-XL, we observe a
similar pattern: The steepest
learning curves
again belong to ANAPHOR AGR. (0.0545) and DET.-
NOUN AGR. (0.0405), and the shallowest to NPIS
(0.0055) and ISLANDS (0.0039). Based on these
valeurs, we estimate that if log-linear improvement
continues, the LSTM LM and Transformer-XL
should require well over 1020 tokens of training
data to achieve human-like performance on these
hardest phenomena.
We also find that increasing model size (number
of parameters) is unlikely to improve performance:
We evaluate four pretrained versions of GPT-
2 avec 117 M to 1,558 M parameters trained
on WebText. All models have overall BLiMP
accuracy of 0.84 ± .01%, and standard deviation
among the models on each of the 12 phenomena
does not exceed 0.03. This finding bolsters
our earlier conclusion in §5 that amount of
training data has the biggest impact on BLiMP
performance.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
6.4 Alternate Evaluation Methods
There are several other methods one can
use to measure an LM’s preference between
two minimally different sentences. So far, nous
have considered only the full-sentence method,
advocated for by Marvin and Linzen (2018), lequel
compares LM likelihoods of full sentences. In a
followup experiment, we use two prefix methods,
each of which has appeared in related prior work,
that evaluate a model’s preferences by comparing
its prediction at a key point of divergence between
the sentences. Subsets of BLiMP data are designed
to be compatible with multiple methods, allowing
us to conduct the first direct comparison. We find
that all methods give broadly similar results when
aggregating over a set of paradigms. We see no
strong argument against evaluating solely using
the full-sentence method, though some results
diverge for specific paradigms.
One-Prefix Method In the one-prefix method,
used by Linzen et al. (2016), a pair of sentences
share the same initial portion of a sentence, mais
differ in a critical word that make them differ in
grammaticality (par exemple., The cat eats mice vs. Le
cat eat mice). The model’s prediction is correct if
it assigns a higher probability to the grammatical
token given the shared prefix.
Two-Prefix Method In the two-prefix method,
used by Wilcox et al. (2019), a pair of sentences
differ in their initial string, and the grammaticality
difference is only revealed when a shared critical
word is included (par exemple., The cat eats mice vs. Le
cats eats mice). For these paradigms, we evaluate
whether the model assigns a higher probability to
the critical word conditioned on the grammatical
prefix than on the ungrammatical prefix.
The prefix methods differ
from the full-
sentence method in two key ways: (je)
ils
require that the acceptability of the sentence be
unambiguously predictable from the critical word,
but not sooner, et (ii) they are not affected
by predictions made by the LM following the
critical word. These values do affect the full
sentence method. Par exemple, assuming that
P. (are numerous) ≫ P (is numerous), un modèle
could predict that The cats are numerous is more
likely than The cats is numerous without correctly
predicting that P (sont|the cats) > P (est|the cats).
Using prefix probabilities allows us to exclude
models’ use of this additional information and
387
Chiffre 6: Comparison of models’ performance on the
simple LM method and the 1- and 2-prefix methods.
The upper panels show results from three phenomena
that are compatible with both 1-prefix and 2-prefix
méthodes. The lower panel shows the averages and
standard deviations across all phenomena.
evaluate how the models perform when they have
just enough information to judge grammaticality.
Chiffre 6 shows that models have generally
comparable accuracies across all three methods.
Cependant, there are some cases where we observe
differences between these methods. Par exemple,
Transformer-XL performs much worse at BINDING,
DET.-NOUN AGR., and SUBJ.-VERB AGR. in the simple
the probabilities
LM method, suggesting that
Transformer-XL assigns to the irrelevant part
at the end of the sentence very often overturn
the observed preference based on probability up
to the critical word. On the other hand, GPT-2
benefits from reading the whole sentence for
BINDING phenomena, as its performance is better in
the simple LM method than in the prefix method.
We conclude that with a sufficiently diverse
set of paradigms,
the various metrics under
consideration will give similar results. Ainsi, it
is not problematic that BLiMP relies only on the
full-sentence method, and doing so allows BLiMP
to include many paradigms not compatible with
either prefix method. Néanmoins, prefix methods
are still valuable for detailed analysis or for studies
making direct comparison to psycholinguistic
theories (par exemple., Wilcox et al., 2018).
7 Conclusion and Future Work
We have shown ways in which BLiMP can be used
as tool to gain evidence about both the overall
and fine-grained linguistic knowledge of language
models. Like the GLUE benchmark (Wang et al.,
2018), BLiMP assigns a single overall score
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
to an LM that summarizes its general sensitivity to
minimal pair contrasts. It also provides a break-
down of LM performance by linguistic phenom-
enon, which can be used to draw more concrete
conclusions about
the kinds of grammatical
features learned acquired by a given model. Ce
kind of information is a linguistically motivated
evaluation of LMs that can complement common
metrics like perplexity.
the extent
En outre,
to which humans
resemble data-driven learners
like language
models is debated in linguistics and cognitive
science (see e.g., Chomsky, 1965; Reali and
Christiansen, 2005). In some domains, we may
require the aid of innate knowledge to acquire
phenomenon-specific knowledge resembling that
tested in BLiMP. By evaluating whether self-
supervised learners like LMs acquire human-like
grammatical acuity in a particular domain, nous
ce
indirect evidence as to whether
gather
phenomenon is a necessary component of humans’
innate knowledge.
Another aim of BLiMP is to serve as a guide for
future work on the linguistic evaluation of LMs.
It is particularly interesting to better understand
those empirical domains where current LMs
to acquire some relevant knowledge,
appear
but still fall short of human performance. Le
results from BLiMP suggest that—in addition
to relatively well-studied phenomena like filler-
gap dependencies, NPIs, and binding—argument
structure remains one area where there is much to
uncover about what LMs learn. Plus généralement,
as language modeling techniques continue to
improve, it will be useful to have large-scale tools
like BLiMP to efficiently track changes in what
these models do and do not know about grammar.
Remerciements
This material is based upon work supported by
the National Science Foundation under grant no.
1850208. Any opinions, findings, and conclusions
or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect
the views of the National Science Foundation.
This project has also benefited from support to
SB by Eric and Wendy Schmidt (made by rec-
ommendation of the Schmidt Futures program), par
Samsung Research (under the project Improving
Deep Learning using Latent Structure), by Intuit,
388
Inc., and by NVIDIA Corporation (avec le
donation of a Titan V GPU).
Appendix
The following contains examples from each of the
67 paradigms in BLiMP.
Caveats Some paradigms include non-transparent
factors that may influence interpretation. We list
here those factors that we are aware of:
• Several paradigms within ANAPHOR AGREE-
MENT and BINDING rely on stereotyped gender
assignment associated with names (par exemple.,
Mary). A model has to have at least a weak
gender-name association in order to succeed
on some paradigms in BLiMP. Par exemple,
like Mary hugged
we mark sentences
themselves and Mary hugged himself as un-
acceptable, and we never include possibilities
like Mary hugged themself.
• To isolate certain phenomena, we had to rely
on acceptability contrasts present in main-
stream US and UK English but absent in
many other dialects. Par exemple, quelques
the sentence Suzy
speakers would accept
lie, but we would mark this un-
don’t
acceptable based on mainstream US English
judgments. BLiMP assesses models’ know-
ledge of this specific dialect of English; dans
some cases it could penalize models that
conform to a different dialect.
How to read this table:
• Phenomenon refers to the linguistic phenom-
enon as noted in Table 2. UID refers to the
unique identifier used in the released dataset.
• Model and human performance are reported
as percent accuracy.
‘Human’ uses the
more conservative individual judgments (comme
opposed to majority vote, for which each
paradigm would be either 100% ou 80%).
• Each pair is marked for whether it is usable
with a prefix method. All sentences are valid
for the simple LM method.
• If a sentence has a checkmark (X) sous
the 1pfx column, the sentence can be used
with the 1-prefix method in addition to the
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
T M
L
X
T
T – 2
P.
G
n
un
m
toi
H
Acceptable Example
Phenomenon
UID
ANAPHOR
AGREEMENT
anaphor gender agreement
anaphor number agreement
ARGUMENT
STRUCTURE
BINDING
CONTROL/
RAISING
DETER-
MINER-
NOUN
AGR.
animate subject passive
animate subject trans
causative
drop argument
inchoative
intransitive
passive 1
passive 2
transitive
principle A c command
principle A case 1
principle A case 2
principle A domain 1
principle A domain 2
principle A domain 3
principle A reconstruction
existential there object raising
existential there subject raising
expletive it object raising
tough vs raising 1
tough vs raising 2
determiner noun agreement 1
determiner noun agreement 2
determiner noun agreement irregular 1
determiner noun agreement irregular 2
determiner noun agreement with adj 1
determiner noun agreement with adj 2
determiner noun agreement with adj irregular 1
determiner noun agreement with adj irregular 2
ELLIPSIS
ellipsis n bar 1
ellipsis n bar 2
FILLER
GAP
wh questions object gap
wh questions subject gap
wh questions subject gap long distance
wh vs that no gap
wh vs that no gap long distance
wh vs that with gap
wh vs that with gap long distance
IRREGULAR
FORMS
irregular past participle adjectives
irregular past participle verbs
ISLAND
EFFECTS
NPI
LICENSING
QUANTIFIERS
SUBJECT-
VERB
AGR.
adjunct island
complex NP island
coordinate structure constraint complex left branch
coordinate structure constraint object extraction
left branch island echo question
left branch island simple question
sentential subject island
wh island
matrix question npi licensor present
npi present 1
npi present 2
only npi licensor present
only npi scope
sentential negation npi licensor present
sentential negation npi scope
existential there quantifiers 1
existential there quantifiers 2
superlative quantifiers 1
superlative quantifiers 2
distractor agreement relational noun
distractor agreement relative clause
irregular plural subject verb agreement 1
irregular plural subject verb agreement 2
regular plural subject verb agreement 1
regular plural subject verb agreement 2
m
5 – g r a
44
52
54
72
51
68
89
82
71
70
91
58
100
49
95
56
52
40
84
77
72
33
77
88
86
85
90
50
53
55
52
23
50
53
82
86
83
81
18
20
79
80
48
50
32
59
96
57
61
56
1
47
47
57
30
93
45
91
62
45
17
24
22
73
88
76
81
S
L
88
95
68
79
65
79
72
73
65
72
87
59
100
87
98
68
55
46
66
80
63
34
93
92
92
82
86
86
76
83
87
68
67
79
92
96
97
97
43
14
93
85
67
47
30
71
32
36
43
47
2
54
54
93
36
100
23
96
16
63
83
76
63
81
89
89
83
91
97
58
70
54
67
81
81
76
74
89
61
100
95
99
70
60
38
76
79
72
45
86
92
81
88
82
78
81
77
86
65
89
61
83
86
86
91
42
17
91
66
65
58
36
74
63
36
37
20
1
61
48
80
45
99
53
94
14
84
85
77
60
78
83
73
85
99
100
77
80
68
84
90
90
89
79
49
100
96
73
99
73
82
37
92
89
58
72
92
100
93
94
93
90
96
88
93
88
86
84
95
88
97
94
56
56
78
90
91
72
42
88
77
82
35
77
67
55
62
100
85
89
95
99
24
84
78
83
68
95
96
97
96
96
99
98
87
82
90
95
86
99
86
87
86
98
96
95
75
83
78
90
88
86
75
81
96
95
92
85
96
94
85
95
92
78
85
98
85
97
92
77
75
99
95
94
80
90
91
91
99
61
73
98
83
98
92
72
93
81
94
76
91
85
81
86
95
94
95
95
Katherine can’t help herself.
Many teenagers were helping themselves.
Amanda was respected by some waitresses.
Danielle visited Irene.
Aaron breaks the glass.
The Lutherans couldn’t skate around.
A screen was fading.
Some glaciers are vaporizing.
Jeffrey’s sons are insulted by Tina’s supervisor.
Most cashiers are disliked.
A lot of actresses’ nieces have toured that art gallery.
A lot of actresses that thought about Alice healed themselves.
Tara thinks that she sounded like Wayne.
Stacy imagines herself praising this actress.
Carlos said that Lori helped him.
Mark imagines Erin might admire herself.
Nancy could say every guy hides himself.
It’s herself who Karen criticized.
William has declared there to be no guests getting fired.
There was bound to be a fish escaping.
Regina wanted it to be obvious that Maria thought about Anna.
Julia wasn’t fun to talk to.
Rachel was apt to talk to Alicia.
Craig explored that grocery store.
Carl cures those horses.
Phillip was lifting this mouse.
Those ladies walk through those oases.
Tracy praises those lucky guys.
Some actors buy these gray books.
This person shouldn’t criticize this upset child.
That adult has brought that purple octopus.
Unacceptable Example
Katherine can’t help himself.
Many teenagers were helping herself.
Amanda was respected by some picture.
The eye visited Irene.
Aaron appeared the glass.
The Lutherans couldn’t disagree with.
A screen was cleaning.
Some glaciers are scaring.
Jeffrey’s sons are smiled by Tina’s supervisor.
Most cashiers are flirted.
A lot of actresses’ nieces have coped that art gallery.
A lot of actresses that thought about Alice healed herself.
Tara thinks that herself sounded like Wayne.
Stacy imagines herself praises this actress.
Carlos said that Lori helped himself.
Mark imagines Erin might admire himself.
Every guy could say Nancy hides himself.
It’s herself who criticized Karen.
William has obliged there to be no guests getting fired.
There was unable to be a fish escaping.
Regina forced it to be obvious that Maria thought about Anna.
Julia wasn’t unlikely to talk to.
Rachel was exciting to talk to Alicia.
Craig explored that grocery stores.
Carl cures that horses.
Phillip was lifting this mice.
Those ladies walk through that oases.
Tracy praises those lucky guys.
Some actors buy this gray books.
This person shouldn’t criticize this upset children.
That adult has brought those purple octopus.
Brad passed one big museum and Eva passed several.
Curtis’s boss discussed four sons and Andrew discussed five sick sons.
Brad passed one museum and Eva passed several big.
Curtis’s boss discussed four happy sons and Andrew discussed five sick.
Joel discovered the vase that Patricia might take.
Cheryl thought about some dog that upset Sandra.
Bruce knows that person that Dawn likes that argued about a lot of guys.
Danielle finds out that many organizations have alarmed Chad.
Christina forgot that all plays that win worry Dana.
Nina has learned who most men sound like.
Martin did find out what every cashier that shouldn’t drink wore.
Joel discovered what Patricia might take the vase.
Cheryl thought about who some dog upset Sandra.
Bruce knows who that person that Dawn likes argued about a lot of guys.
Danielle finds out who many organizations have alarmed Chad.
Christina forgot who all plays that win worry Dana.
Nina has learned that most men sound like.
Martin did find out that every cashier that shouldn’t drink wore.
The forgotten newspaper article was bad.
Edward hid the cats.
The forgot newspaper article was bad.
Edward hidden the cats.
Who has Colleen aggravated before kissing Judy?
Who hadn’t some driver who would fire Jennifer’s colleague embarrassed?
What lights could Spain sell and Andrea discover?
Who will Elizabeth and Gregory cure?
David would cure what snake?
Whose hat should Tonya wear?
Who have many women’s touring Spain embarrassed.
What could Alan discover he has run around?
Who has Colleen aggravated Judy before kissing?
Who hadn’t Jennifer’s colleague embarrassed some driver who would fire?
What could Spain sell lights and Andrea discover?
Who will Elizabeth cure and Gregory?
What would David cure snake?
Whose should Tonya wear hat?
Who have many women’s touring embarrassed Spain.
What could Alan discover who has run around?
Should Monica ever grin?
Even these trucks have often slowed.
Many skateboards also roll.
Only Bill would ever complain.
Only those doctors who Karla respects ever conceal many snakes.
Those banks had not ever lied.
Those turtles that are boring April could not ever break those couches.
Monica should ever grin.
Even these trucks have ever slowed.
Many skateboards ever roll.
Even Bill would ever complain.
Those doctors who only Karla respects ever conceal many snakes.
Those banks had really ever lied.
Those turtles that are not boring April could ever break those couches.
There aren’t many lights darkening.
Each book is there disturbing Margaret.
No man has revealed more than five forks.
An actor arrived at at most six lakes.
A sketch of lights doesn’t appear.
Boys that aren’t disturbing Natalie suffer.
This goose isn’t bothering Edward.
The woman cleans every public park.
Jeffrey hasn’t criticized Donald.
The dress crumples.
There aren’t all lights darkening.
There is each book disturbing Margaret.
No man has revealed at least five forks.
No actor arrived at at most six lakes.
A sketch of lights don’t appear.
Boys that aren’t disturbing Natalie suffers.
This goose weren’t bothering Edward.
The women cleans every public park.
Jeffrey haven’t criticized Donald.
The dresses crumples.
1pfx 2pfx
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Tableau 4: Examples of all 67 paradigms in BLiMP along with model performance and estimated human agreement.
simple LM method. The bolded word is the
critical word—the probability of the two
different critical words for the acceptable
and unacceptable sentences can be compared
based on the same ‘prefix’.
Yossi Adi, Einat Kermany, Yonatan Belinkov,
Ofer Lavi, and Yoav Goldberg. 2017. Fine-
grained analysis of sentence embeddings using
auxiliary prediction tasks. In Proceedings of
ICLR Conference Track. Toulon, France.
• If a sentence has a checkmark (X) under the
2pfx column, the sentence can be used with
the 2-prefix method in addition to the simple
LM method. The bolded word is the critical
word—the probability of that particular word
can be compared based on the two different
acceptable and unacceptable ‘prefixes’.
Aixiu An, Peng Qian, Ethan Wilcox, and Roger
Levy. 2019. Representation of constituents in
neural language models: Coordination phrase as
a case study. arXiv preprint arXiv:1909.04625.
Kathryn Bock and Carol A. Miller. 1991. Broken
agreement. Psychologie Cognitive, 23(1):45–93.
Les références
Marantz, Alec. 2013. Verbal argument structure:
Events and participants. Lingua, 30:152–168.
Elsevier.
David Adger. 2003. Core Syntax: A Minimalist
Approach. Oxford University Press Oxford.
Rui P. Chaves. 2020. What don’t RNN language
models learn about filler-gap dependencies? Dans
Proceedings of the Third Meeting of the Society
for Computation in Linguistics (SCiL).
Stanley F. Chen and Joshua Goodman. 1999.
An empirical study of smoothing techniques
for language modeling. Computer Speech &
Language, 13(4):359–394.
389
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Gennaro Chierchia. 2013. Logic in Grammar.
Presse universitaire d'Oxford.
Noam Chomsky. 1965. Aspects of the Theory of
Syntax. AVEC Presse.
Noam Chomsky. 1981. Lectures on Government
and Binding.
Shammur Absar Chowdhury
2018. RNN simulations
and Roberto
de
Zamparelli.
grammaticality judgments on long-distance
the 27th
dependencies.
International Conference on Computational
Linguistics, pages 133–144.
In Proceedings of
Shammur Absar Chowdhury
and Roberto
Zamparelli. 2019. An LSTM adaptation study
de (et) grammaticality. In Proceedings of the
2019 ACL Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 204–212.
Alexis Conneau, German Kruszewski, Guillaume
Lample, Lo¨ıc Barrault, and Marco Baroni.
2018. What you can cram into a single
&!#* vector: Probing sentence embeddings for
linguistic properties. In ACL 2018-56th Annual
Meeting of the Association for Computational
Linguistics, volume 1, pages 2126–2136.
Jillian K. Da Costa and Rui P. Chaves.
2020. Assessing the ability of transformer-
based neural models to represent structurally
unbounded dependencies. In Proceedings of the
Third Meeting of the Society for Computation
in Linguistics (SCiL).
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime
Carbonell, Quoc Le, and Ruslan Salakhutdinov.
2019. Transformer-XL: Attentive language
Dans
models beyond a fixed-length context.
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2978–2988. Florence, Italy. Association
for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
le 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186.
390
Allyson Ettinger, Ahmed Elgohary, Colin Phillips,
and Philip Resnik. 2018. Assessing composition
in sentence vector representations. In Proceed-
ings of the 27th International Conference on
Computational Linguistics, pages 1790–1801.
Association for Computational Linguistics.
Richard
Futrell,
Takashi
Ethan Wilcox,
Morita,
and Roger Levy. 2018. RNNs
as psycholinguistic subjects: Syntactic state
and grammatical dependency. arXiv preprint
arXiv:1809.01329.
Bart Geurts and Rick Nouwen. 2007. ‘At least’
et coll.: The semantics of scalar modifiers.
Language, pages 533–559.
David Graff, Junbo Kong, Ke Chen, and Kazuaki
Maeda. 2003. English gigaword. Linguistic
Data Consortium, Philadelphia, 4(1):34.
Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2019.
Colorless green recurrent networks dream
hierarchically. Proceedings of the Society for
Computation in Linguistics, 2(1):363–364.
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
Clark, and Philipp Koehn. 2013. Scalable mod-
ified Kneser-Ney language model estimation.
In Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics
(Volume 2: Short Papers), pages 690–696.
Michael Heilman, Aoife Cahill, Nitin Madnani,
Melissa Lopez, Matthew Mulholland, et
Joel Tetreault. 2014. Predicting grammaticality
on an ordinal scale. In Proceedings of
le
52nd Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), volume 2, pages 174–180.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9(8):1735–1780.
Jeremy Howard and Sebastian Ruder. 2018.
Universal language model fine-tuning for text
the 56th
classification.
Annual Meeting of
the Association for
Computational Linguistics (Volume 1: Long
Papers), pages 328–339.
In Proceedings of
Jaap Jumelet and Dieuwke Hupkes. 2018. Do
language models understand anything? Sur
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
the ability of LSTMs to understand negative
le 2018
polarity items. In Proceedings of
EMNLP Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 222–231.
Katharina Kann, Alex Warstadt, Adina Williams,
and Samuel R. Bowman. 2019. Verb argument
structure alternations in word and sentence
embeddings. Proceedings of the Society for
Computation in Linguistics, 2(1):287–297.
Dave Kush, Terje Lohndal, and Jon Sprouse. 2018.
Investigating variation in island effects. Natural
Language & Linguistic Theory, 36(3):743–779.
Jey Han Lau, Alexander Clark, and Shalom Lappin.
2017. Grammaticality, acceptability, and proba-
bility: A probabilistic view of linguistic knowl-
bord. Sciences cognitives, 41(5):1202–1241.
Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. Trans-
actions of the Association for Computational
Linguistics, 4:521–535.
Rebecca Marvin and Tal Linzen. 2018. Targeted
syntactic evaluation of language models. Dans
le 2018 Conference on
Proceedings of
Empirical Methods
in Natural Language
Processing, pages 1192–1202.
Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2016. Pointer sentinel
mixture models. CoRR, abs/1609.07843.
Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget,
Jan ˇCernock`y, and Sanjeev Khudanpur. 2010.
Recurrent neural network based language
In Eleventh Annual Conference of
model.
le
Speech Communication
Association.
International
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contex-
tualized word representations. arXiv preprint
arXiv:1802.05365.
Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised
learning, Technical report, OpenAI.
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Blog, 1(8).
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. arXiv
e-prints.
Florencia Reali and Morten H. Christiansen.
2005. Uncovering the richness of the stimulus:
Structure dependence and indirect statistical
evidence. Sciences cognitives, 29(6):1007–1028.
Ivan A. Sag, Thomas Wasow, and Emily M.
Cintreuse. 2003. Syntactic Theory: A Formal
Introduction, 2nd edition. CSLI Publications.
Carson T. Sch¨utze. 1996. The Empirical Base
of Linguistics: Grammaticality Judgments and
Linguistic Methodology. University of Chicago
Presse.
Xing Shi, Inkit Padhi, and Kevin Knight. 2016.
Does string-based neural MT learn source
syntax? In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language
Processing, pages 1526–1534.
Dominique Sportiche, Hilda Koopman, et
Edward Stabler. 2013. An Introduction to
Syntactic Analysis and Theory, John Wiley &
Fils.
Ian Tenney, Patrick Xia, Berlin Chen, Alex
Wang, Adam Poliak, R.. Thomas McCoy,
Najoung Kim, Benjamin Van Durme, Samuel R
Bowman, Dipanjan Das, et autres. 2019. What
do you learn from context? Probing for
sentence structure in contextualized word
representations. In Proceedings of ICLR.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need, je. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach,
R.. Fergus, S. Vishwanathan, et R. Garnett,
Information
editors, Advances
Processing Systems 30, pages 5998–6008.
Curran Associates, Inc.
in Neural
391
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel R. Bowman. 2019un.
SuperGLUE: A stickier benchmark for general-
purpose language understanding systems. Dans
Information
33rd Conference on Neural
Processing Systems.
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R. Bowman.
benchmark
2018. GLUE: A multi-task
and analysis platform for natural
langue
le 2018
understanding. In Proceedings of
EMNLP Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP,
pages 353–355.
Alex Wang, Ian F. Tenney, Yada Pruksachatkun,
Katherin Yu, Jan Hula, Patrick Xia, Raghu
Pappagari, Shuning Jin, R.. Thomas McCoy,
Roma Patel, Yinghui Huang, Jason Phang,
Edouard Grave, Haokun Liu, Najoung Kim,
Phu Mon Htut, Thibault F’evry, Berlin Chen,
Nikita Nangia, Anhad Mohananey, Katharina
Kann, Shikha Bordia, Nicolas Patry, David
Benton, Ellie Pavlick, and Samuel R. Bowman.
2019b. jiant 1.2: A software toolkit for
research on general-purpose text understanding
models. http://jiant.info/.
Gregory Ward and Betty Birner. 1995. Definite-
ness and the English existential. Language,
pages 722–742.
Alex Warstadt and Samuel R. Bowman. 2019. Lin-
guistic analysis of pretrained sentence encoders
with acceptability judgments. arXiv preprint
arXiv:1901.03438.
Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng,
Hagen Blix, Yining Nie, Anna Alsop, Shikha
Bordia, Haokun Liu, Alicia Parrish, Sheng-
Fu Wang, Jason Phang, Anhad Mohananey,
Phu Mon Htut, Paloma Jeretiˇc, and Samuel
R.. Bowman. 2019un.
Investigating BERT’s
knowledge of language: Five analysis methods
with NPIs. In Proceedings of EMNLP-IJCNLP,
pages 2870–2880.
Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2019b. Neural network acceptability
judgments. Transactions of the Association for
Computational Linguistics, 7:625–641.
Ethan Wilcox, Roger Levy, Takashi Morita, et
Richard Futrell. 2018. What do RNN language
models learn about filler–gap dependencies? Dans
Actes du 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 211–221.
Ethan Wilcox, Peng Qian, Richard Futrell, Miguel
Ballesteros, and Roger Levy. 2019. Structural
supervision improves learning of non-local
grammatical dependencies. In Proceedings of
le 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 3302–3312.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
1
1
9
2
3
6
9
7
/
/
t
je
un
c
_
un
_
0
0
3
2
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
392