Beat the AI: Investigating Adversarial Human Annotation - Recherche en IA spécialisée au MIT

Beat the AI: Investigating Adversarial Human Annotation
for Reading Comprehension

Max Bartolo Alastair Roberts

Johannes Welbl Sebastian Riedel Pontus Stenetorp

Department of Computer Science
University College London
{m.bartolo,a.roberts,j.welbl,s.riedel,p.stenetorp}@cs.ucl.ac.uk

Abstrait

Innovations in annotation methodology have
been a catalyst for Reading Comprehension
(RC) datasets and models. One recent trend
to challenge current RC models is to involve
a model in the annotation process: Humans
create questions adversarially, such that the
model fails to answer them correctly. Dans
this work we investigate this annotation
methodology and apply it in three different
settings, collecting a total of 36,000 samples
with progressively stronger models in the
annotation loop. This allows us to explore
questions such as the reproducibility of the
adversarial effect, transfer from data collected
with varying model-in-the-loop strengths, et
generalization to data collected without a
model. We find that training on adversarially
collected samples leads to strong generalization
to non-adversarially collected datasets, yet with
progressive performance deterioration with
increasingly
stronger models-in-the-loop.
En outre, we find that stronger models can
still
learn from datasets collected with
substantially weaker models-in-the-loop. Quand
trained on data collected with a BiDAF model
in the loop, RoBERTa achieves 39.9F1 on
questions that it cannot answer when trained
on SQuAD—only marginally lower than when
trained on data collected using RoBERTa
lui-même (41.0F1).

1 Introduction

Data collection is a fundamental prerequisite for
Machine Learning-based approaches to Natural
Language Processing (NLP). Innovations in data
acquisition methodology, such as crowdsourcing,
have led to major breakthroughs in scalability
and preceded the ‘‘deep learning revolution’’, pour

662

which they can arguably be seen as co-responsible
(Deng et al., 2009; Bowman et al., 2015; Rajpurkar
et coll., 2016). Annotation approaches include ex-
pert annotation, Par exemple, relying on trained
linguists (Marcus et al., 1993), crowd-sourcing by
non-experts (Snow et al., 2008), distant supervi-
sion (Mintz et al., 2009; Joshi et al., 2017), et
leveraging document structure (Hermann et coll.,
2015). The concrete data collection paradigm cho-
sen dictates the degree of scalability, annotation
coût, precise task structure (often arising as a
compromise of the above) and difficulty, domain
coverage, as well as resulting dataset biases and
model blind spots (Jia and Liang, 2017; Schwartz
et coll., 2017; Gururangan et al., 2018).

A recently emerging trend in NLP dataset
creation is the use of a model-in-the-loop when
composing samples: A contemporary model is
used either as a filter or directly during annotation,
to identify samples wrongly predicted by the
model. Examples of this method are realized
in Build It Break It, The Language Edition
(Ettinger et al., 2017), HotpotQA (Yang et al.,
2018un), SWAG (Zellers et al., 2018), Mechanical
Turker Descent (Yang et al., 2018b), DROP
(Dua et al., 2019), CODAH (Chen et al., 2019),
Quoref (Dasigi et al., 2019), and AdversarialNLI
(Nie et al., 2019).1 This approach probes model
robustness and ensures that the resulting datasets
pose a challenge to current models, which drives
research to tackle new sets of problems.

We study this approach in the context of
Reading Comprehension (RC), and investigate its
robustness in the face of continuously progressing
models—do adversarially constructed datasets
quickly become outdated in their usefulness as
models grow stronger?

1The idea was alluded to at least as early as Richardson

et autres. (2013), but it has only recently seen wider adoption.

Transactions of the Association for Computational Linguistics, vol. 8, pp. 662–678, 2020. https://doi.org/10.1162/tacl a 00338
Action Editor: Christopher Potts. Submission batch: 3/2020; Revision batch: 6/2020; Published 10/2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

échouer

to general improvements across the model-in-the-
loop datasets we collect, as well as improvements
of more than 20.0F1 for both BERT and RoBERTa
on an extractive subset of DROP (Dua et al.,
2019), another adversarially composed dataset.
When conducting a systematic analysis of the
concrete questions different models
à
answer correctly, as well as non-adversarially
the nature
composed questions, we see that
of the resulting questions changes: Questions
composed with a model in the loop are overall
more diverse, use more paraphrasing, multi-
hop inference, comparisons, and background
connaissance, and are generally less easily answered
by matching an explicit statement
that states
the required information literally. Given our
observations, we believe a model-in-the-loop
approach to annotation shows promise and should
be considered when creating future RC datasets.

To summarize, our contributions are as follows:
D'abord, an investigation into the model-in-the-
loop approach to RC data collection based on
three progressively stronger models,
ensemble
with an empirical performance comparison when
trained on datasets constructed with adversaries of
different strength. Deuxième, a comparative inves-
tigation into the nature of questions composed
to be unsolvable by a sequence of progressively
stronger models. Troisième, a study of the reproduc-
ibility of the adversarial effect and the gener-
alization ability of models trained in various
settings.

2 Related Work

Constructing Challenging Datasets Recent
efforts
in dataset construction have driven
considerable progress in RC, yet datasets are
structurally diverse and annotation methodologies
vary. With its large size and combination of free-
form questions with answers as extracted spans,
SQuAD1.1 (Rajpurkar et al., 2016) has become
an established benchmark that has inspired the
construction of a series of similarly structured
datasets. Cependant, mounting evidence suggests
that models can achieve strong generalization
performance merely by relying on superficial
cues—such as lexical overlap, term frequencies,
or entity type matching (Chen et al., 2016;
Weissenborn et al., 2017; Sugawara et al., 2018).
It has thus become an increasingly important
consideration to construct datasets that RC models

Chiffre 1: Human annotation with a model in the loop,
showing: je) the ‘‘Beat the AI’’ annotation setting where
only questions that the model does not answer correctly
are accepted, and ii) questions generated this way, avec
a progressively stronger model in the annotation loop.

Based on models trained on the widely used
SQuAD dataset, and following the same anno-
tation protocol, we investigate the annotation setup
where an annotator has to compose questions for
which the model predicts the wrong answer. As a
result, only samples that the model fails to predict
correctly are retained in the dataset—see Figure 1
for an example.

We apply this annotation strategy with three
distinct models in the loop, resulting in datasets
avec 12,000 samples each. We then study the
reproducibility of the adversarial effect when
retraining the models with the same data, aussi
as the generalization ability of models trained
using datasets produced with and without a model
adversary. Models can, to a considerable degree,
learn to generalize to more challenging questions,
based on training sets collected with both stronger
and also weaker models in the loop. Compared
to training on SQuAD, training on adversarially
composed questions leads to a similar degree
of generalization to non-adversarially written
questions, both for SQuAD and NaturalQuestions
(Kwiatkowski et al., 2019). It furthermore leads

663

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

find challenging, and for which natural language
understanding is a requisite for generalization.
Attempts to achieve this non-trivial aim have
typically revolved around extensions to the
SQuAD dataset annotation methodology. Ils
include unanswerable questions (Trischler et al.,
2017; Rajpurkar et al., 2018; Reddy et al., 2019;
Choi et al., 2018), adding the option of ‘‘Yes’’
or ‘‘No’’ answers (Dua et al., 2019; Kwiatkowski
et coll., 2019), questions requiring reasoning over
multiple sentences or documents (Welbl et al.,
2018; Yang et al., 2018un), questions requiring
rule interpretation or context awareness (Saeidi
et coll., 2018; Choi et al., 2018; Reddy et al.,
2019), limiting annotator passage exposure by
sourcing questions first (Kwiatkowski et al., 2019),
controlling answer types by including options for
dates, numbers, or spans from the question (Dua
et coll., 2019), as well as questions with free-form
answers (Nguyen et al., 2016; Koˇcisk´y et al.,
2018; Reddy et al., 2019).

Adversarial Annotation One recently adopted
approach to constructing challenging datasets
à
involves the use of an adversarial model
select examples that it does not perform well
sur, an approach which superficially is akin to
active learning (Lewis and Gale, 1994). Ici, nous
make a distinction between two sub-categories
of adversarial annotation: je) adversarial filtering,
where the adversarial model is applied offline
in a separate stage of the process, usually after
data generation; examples include SWAG (Zellers
et coll., 2018), ReCoRD (Zhang et al., 2018),
HotpotQA (Yang et al., 2018un), and HellaSWAG
ii) model-in-the-loop
(Zellers et al., 2019);
adversarial annotation, where the annotator can
directly interact with the adversary during the an-
notation process and uses the feedback to further
inform the generation process; examples include
CODAH (Chen et al., 2019), Quoref (Dasigi
et coll., 2019), DROP (Dua et al., 2019), FEVER2.0
(Thorne et al., 2019), AdversarialNLI (Nie et al.,
2019), as well as work by Dinan et al. (2019),
Kaushik et al. (2020), and Wallace et al. (2019)
for the Quizbowl task.

We are primarily interested in the latter cate-
gory, as this feedback loop creates an environ-
ment where the annotator can probe the model
directly to explore its weaknesses and formulate
targeted adversarial attacks. Although Dua et al.
(2019) and Dasigi et al. (2019) make use of
adversarial annotations for RC, both annotation

setups limit the reach of the model-in-the-loop: Dans
DROP, primarily due to the imposition of specific
answer types, and in Quoref by focusing on co-
reference, which is already a known RC model
weakness.

In contrast, we investigate a scenario where
annotators interact with a model in its original task
setting—annotators must thus explore a range of
natural adversarial attacks, as opposed to filtering
out ‘‘easy’’ samples during the annotation process.

3 Annotation Methodology

3.1 Annotation Protocol

The data annotation protocol is based on SQuAD1.1,
with a model in the loop, and the additional
instruction that questions should only have one
answer in the passage, which directly mirrors the
setting in which these models were trained.

Officiellement, provided with a passage p, a human
annotator generates a question q and selects a
(human) answer ah by highlighting the corre-
sponding span in the passage. The input (p, q)
is then given to the model, which returns a
predicted (model) answer am. To compare the
deux, a word-overlap F1 score between ah and am
is computed; a score above a threshold of 40% est
considered a ‘‘win’’ for the model.2 This process
is repeated until the human ‘‘wins’’; Chiffre 2
gives a schematic overview of the process. All
réussi (p, q, ah) triples, c'est, those which
the model is unable to answer correctly, are then
retained for further validation.

3.2 Annotation Details

Models in the Annotation Loop We begin
by training three different models, which are
used as adversaries during data annotation. Comme
a seed dataset for training the models we select
the widely used SQuAD1.1 (Rajpurkar et al.,
2016) dataset, a large-scale resource for which a
variety of mature and well-performing models are
readily available. En outre, unlike cloze-based
datasets, SQuAD is robust to passage/question-
only adversarial attacks (Kaushik and Lipton,
2018). We will compare dataset annotation with
a series of three progressively stronger models
as adversary in the loop, namely, BiDAF (Seo

2This threshold is set after initial experiments to not
be overly restrictive given acceptable answer spans, par exemple., un
human answer of ‘‘New York’’ vs. model answer ‘‘New
York City’’ would still lead to a model ‘‘win’’.

664

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

ensure that all our datasets have one valid an-
swer per question, enabling us to fairly draw
direct comparisons. For clarity, we will hereafter
refer to this modified version of SQuAD1.1 as
DSQuAD.

Crowdsourcing We use custom-designed Hu-
man Intelligence Tasks (HITs) served through
Amazon Mechanical Turk (AMT) for all anno-
tation efforts. Workers are required to be based in
Canada, the UK, or the US, have a HIT Approval
Rate greater than 98%, and have previously
completed at least 1,000 HITs successfully. Nous
experiment with and without the AMT Master
requirement and find no substantial difference
in quality, but observe a throughput reduction
of nearly 90%. We pay USD 2.00 for every
question generation HIT, during which workers
are required to compose up to five questions that
‘‘beat’’ the model in the loop (cf. Chiffre 3). Le
mean HIT completion times for BiDAF, BERT,
and RoBERTa are 551.8s, 722.4s, and 686.4s.
En outre, we find that human workers are able
to generate questions that successfully ‘‘beat’’ the
model in the loop 59.4% of the time for BiDAF,
47.1% for BERT, et 44.0% for RoBERTa. These
metrics broadly reflect the relative strength of the
models.

3.3 Quality Control

Training and Qualification We provide a two-
part worker training interface in order to i) famil-
iarize workers with the process, and ii) conduct
a first screening based on worker outputs. Le
interface familiarizes workers with formulating
questions, and answering them through span
selection. Workers are asked to generate questions
for two given answers, to highlight answers for
two given questions, to generate one full question-
answer pair, and finally to complete a question
generation HIT with BiDAF as the model in
the loop. Each worker’s output is then reviewed
manually (by the authors); those who pass the
screening are added to the pool of qualified
annotators.

Manual Worker Validation In the second
annotation stage, qualified workers produce data
for the ‘‘Beat the AI’’ question generation task.
A sample of every worker’s HITs is manually
reviewed based on their total number of completed
tasks n, determined by ⌊5·log10(n)+1⌋, chosen for

Chiffre 2: Overview of the annotation process to collect
adversarially written questions from humans using a
model in the loop.

et coll., 2017), BERTLARGE (Devlin et al., 2019),
and RoBERTaLARGE (Liu et al., 2019b). Chaque
of these will serve as a model adversary in a
separate annotation experiment and result in three
distinct datasets; we will refer to these as DBiDAF,
DBERT, and DRoBERTa respectively. Examples
from the validation set of each are shown in
Tableau 1. We rely on the AllenNLP (Gardner
et coll., 2018) and Transformers (Wolf et al., 2019)
model implementations, and our models achieve
EM/F1 scores of 65.5%/77.5%, 82.7%/90.3% et
86.9%/93.6% for BiDAF, BERT, and RoBERTa,
respectivement, on the SQuAD1.1 validation set,
consistent with results reported in other work.

Our choice of models reflects both the transi-
tion from LSTM-based to pre-trained transformer-
based models, as well as a graduation among
the latter; we investigate how this is reflected
in datasets collected with each of these different
models in the annotation loop. For each of the
models we collect 10,000 entraînement, 1,000 valida-
tion, et 1,000 test examples. Dataset sizes are
motivated by the data efficiency of transformer-
based pretrained models (Devlin et al., 2019;
Liu et al., 2019b), which has improved the
viability of smaller-scale data collection efforts
for investigative and analysis purposes.

To ensure the experimental integrity provided
by reporting all results on a held-out test set,
we split the existing SQuAD1.1 validation set in
half (stratified by document title) as the official
test set is not publicly available. We maintain
passage consistency across the training, valida-
tion and test sets of all datasets to enable like-
for-like comparisons. Dernièrement, we use the majority
vote answer as ground truth for SQuAD1.1 to

665

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

F
UN
D
B

T
R.
E
B

Passage: [. . . ] the United Methodist Church has placed great emphasis on the importance of education.
En tant que tel, the United Methodist Church established and is affiliated with around one hundred colleges
[. . . ] of Methodist-related Schools, Collèges, and Universities. The church operates three hundred sixty
schools and institutions overseas.
Question: The United Methodist Church has how many schools internationally?

Passage: In a purely capitalist mode of production (c'est à dire. where professional and labor organizations cannot
limit the number of workers) the workers wages will not be controlled by these organizations, or by the
employer, but rather by the market. Wages work in the same way as prices for any other good. Ainsi, wages
can be considered as a [. . . ]
Question: What determines worker wages?

Passage: [. . . ] released to the atmosphere, and a separate source of water feeding the boiler is supplied.
Normally water is the fluid of choice due to its favourable properties, such as non-toxic and unreactive
chemistry, abundance, low cost, and its thermodynamic properties. Mercury is the working fluid in the
mercury vapor turbine [. . . ]
Question: What is the most popular type of fluid?

Passage: [. . . ] Jochi was secretly poisoned by an order from Genghis Khan. Rashid al-Din reports that
the great Khan sent for his sons in the spring of 1223, and while his brothers heeded the order, Jochi
remained in Khorasan. Juzjani suggests that the disagreement arose from a quarrel between Jochi and his
brothers in the siege of Urgench [. . . ]
Question: Who went to Khan after his order in 1223?

Passage: In the Sandgate area, to the east of the city and beside the river, resided the close-knit community
of keelmen and their families. They were so called because [. . . ] transfer coal from the river banks to the
waiting colliers, for export to London and elsewhere. In the 1630s about 7,000 out of 20,000 inhabitants
of Newcastle died of plague [. . . ]
Question: Where did almost half the people die?

Passage: [. . . ] was important to reduce the weight of coal carried. Steam engines remained the dominant
source of power until the early 20th century, when advances in the design of electric motors and internal
combustion engines gradually resulted in the replacement of reciprocating (piston) steam engines, avec
shipping in the 20th-century [. . . ]
Question: Why did steam engines become obsolete?

a Passage: [. . . ] and seven other hymns were published in the Achtliederbuch, the first Lutheran hymnal.
T
Dans 1524 Luther developed his original four-stanza psalm paraphrase into a five-stanza Reformation hymn
R.
E
that developed the theme of “grace alone” more fully. Because it expressed essential Reformation doctrine,
B
o
this expanded version of “Aus [. . . ]
R.
Question: Luther’s reformed hymn did not feature stanzas of what quantity?

un
T
R.
E
B
o
R.

Passage: [. . . ] tight end Greg Olsen, who caught a career-high 77 passes for 1,104 yards and seven
touchdowns, and wide receiver Ted Ginn, Jr., who caught 44 passes for 739 yards and 10 touchdowns;
[. . . ] receivers included veteran Jerricho Cotchery (39 receptions for 485 yards), rookie Devin Funchess
(31 receptions for 473 yards and [. . . ]
Question: Who caught the second most passes?

Passage: Other prominent alumni include anthropologists David Graeber and Donald Johanson, who is
best known for discovering the fossil of a female hominid australopithecine known as “Lucy” in the Afar
Triangle region, psychologist John B. Watson, American psychologist who established the psychological
school of behaviorism, communication theorist Harold Innis, chess grandmaster Samuel Reshevsky, et
conservative international relations scholar and White House Coordinator of Security Planning for the
National Security Council Samuel P. Huntington.
Question: Who thinks three moves ahead?

Tableau 1: Validation set examples of questions collected using different RC models (BiDAF, BERT, et
RoBERTa) in the annotation loop. The answer to the question is highlighted in the passage.

666

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 3: ‘‘Beat the AI’’ question generation interface. Human annotators are tasked with asking questions about
a provided passage that the model in the loop fails to answer correctly.

convenience. This is done after every annotation
batch; if workers fall below an 80% success thresh-
old at any point, their qualification is revoked and
their work is discarded in its entirety.

Question Answerability As the models used in
the annotation task become stronger, the resulting
questions tend to become more complex. Comment-
jamais, this also means that it becomes more chal-
lenging to disentangle measures of dataset quality
from inherent question difficulty. En tant que tel, nous
use the condition of human answerability for an
annotated question-answer pair as follows: C'est
answerable if at least one of three additional non-
expert human validators can provide an answer
matching the original. We conduct answerability
checks on both the validation and test sets, et
achieve answerability scores of 87.95%, 85.41%,
et 82.63% for DBiDAF, DBERT, and DRoBERTa.
We discard all questions deemed unanswerable
from the validation and test sets, and further
discard all data from any workers with less than
half of their questions considered answerable. Il
should be emphasized that the main purpose of
this process is to create a level playing field
for comparison across datasets constructed for
different model adversaries, and can inevitably
result in valid questions being discarded. Le

Dev

Test

Resource

DBiDAF
DBERT
DRoBERTa

63.0
59.2
58.1

F1
76.9
74.3
72.0

62.6
63.9
58.7

F1
78.5
76.9
73.7

Tableau 2: Non-expert human performance results
for a randomly-selected validator per question.

total cost for training and qualification, dataset
construction, and validation is approximately
USD 27,000.

Human Performance We select a randomly
chosen validator’s answer to each question and
compute Exact Match (EM) and word overlap F1
scores with the original to calculate non-expert
human performance; Tableau 2 shows the result. Nous
observe a clear trend: The stronger the model in
the loop used to construct the dataset, the harder
the resulting questions become for humans.

3.4 Dataset Statistics

Tableau 3 provides general details on the number
of passages and question-answer pairs used in the
different dataset splits. The average number of
words in questions and answers, as well as the

667

#Passages

Resource Train Dev Test

DSQuAD
DBiDAF
DBERT
DRoBERTa

18,891 971 1,096
277
2,523 278
292
2,444 283
333
2,552 341

#QAs
Train Dev

Test

87,599 5,278 5,292
10,000 1,000 1,000
10,000 1,000 1,000
10,000 1,000 1,000

Tableau 3: Number of passages and question-
answer pairs for each data resource.

DSQuAD DBiDAF DBERT DRoBERTa

Question length
Answer length
N-Gram overlap

10.3
2.6
3.0

9.8
2.9
2.2

9.8
3.0
2.1

10.0
3.2
2.0

Tableau 4: Average number of words per question
and answer, and average longest n-gram
overlap between passage and question.

average longest n-gram overlap between passage
and question are given in Table 4.

We can again observe two clear trends: Depuis
weaker towards stronger models used in the
annotation loop, the average length of answers
increases, and the largest n-gram overlap drops
depuis 3 à 2 tokens. C'est, on average there
is a trigram overlap between the passage and
question for DSQuAD, but only a bigram overlap
for DRoBERTa (Chiffre 4).3 This is in line with prior
observations on lexical overlap as a predictive
cue in SQuAD (Weissenborn et al., 2017; Min
et coll., 2018); questions with less overlap are
harder to answer for any of the three models.
We furthermore analyze question types based
on the question wh-word. We find that—in con-
trast to DSQuAD—the datasets collected with a
model in the annotation loop have fewer when,
comment, and in questions, and more which, où,
and why questions, as well as questions in
the other category, which indicates increased
question diversity. In terms of answer types,
we observe more common noun and verb phrase
clauses than in DSQuAD, as well as fewer dates,
names, and numeric answers. This reflects on
the strong answer-type matching capabilities
of contemporary RC models. The training and
validation sets used in this analysis (DBiDAF,
DBERT, and DRoBERTa) will be publicly released.

3Note that the original SQuAD1.1 dataset can be con-
sidered a limit case of the adversarial annotation framework,
in which the model in the loop always predicts the wrong
answer, thus every question is accepted.

Chiffre 4: Distribution of longest n-gram overlap
between passage and question for different datasets.
µ: mean; p: standard deviation.

Model

Resource
dev

dev

DBiDAF
DBERT

BiDAF
BERT
RoBERTa DRoBERTa
test
BiDAF
BERT
RoBERTa DRoBERTa

DBiDAF
DBERT

test

dev

test

Original

EM F1
5.3
0.0
4.9
0.0
6.1
0.0

0.0
0.0
0.0

5.5
5.3
5.9

Re-init.

EM
10.7 0.8
19.7 1.0
15.7 0.9
11.6 1.0
18.9 1.2
16.1 0.8

F1
20.4 1.0
30.1 1.2
25.8 1.2
21.3 1.2
29.4 1.1
26.7 0.9

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 5: Consistency of the adversarial effect (ou
lack thereof) when retraining the models in the
loop on the same data again, but with different
random seeds. We report the mean and standard
deviation (subscript) over 10 re-initialization runs.

4 Experiments

4.1 Consistency of the Model in the Loop

We begin with an experiment regarding the
consistency of the adversarial nature of the models
in the annotation loop. Our annotation pipeline is
designed to reject all samples where the model
correctly predicts the answer. How reproducible
is this when retraining the model with the same
training data? To measure this, we evaluate the
performance of instances of BiDAF, BERT, et
RoBERTa, which only differ from the model used
during annotation in their random initialization

668

Model

Trained On

DSQuAD
F1

DBiDAF

Evaluation (Test) Dataset
DRoBERTa
F1
EM

DBERT

DDROP

DNQ

BiDAF

BERT

RoBERTa

DSQuAD(10K) 40.9 0.6 54.3 0.6 7.1 0.6 15.7 0.6 5.6 0.3 13.5 0.4 5.7 0.4 13.5 0.4 3.8 0.4 8.6 0.6 25.1 1.1 38.7 0.7
11.5 0.4 20.9 0.4 5.3 0.4 11.6 0.5 7.1 0.4 14.8 0.6 6.8 0.5 13.5 0.6 6.5 0.5 12.4 0.4 15.7 1.1 28.7 0.8
DBiDAF
10.8 0.3 19.8 0.4 7.2 0.5 14.4 0.6 6.9 0.3 14.5 0.4 8.1 0.4 15.0 0.6 7.8 0.9 14.5 0.9 16.5 0.6 28.3 0.9
DBERT
10.7 0.2 20.2 0.3 6.3 0.7 13.5 0.8 9.4 0.6 17.0 0.6 8.9 0.9 16.0 0.8 15.3 0.8 22.9 0.8 13.4 0.9 27.1 1.2
DRoBERTa
DSQuAD(10K) 69.4 0.5 82.7 0.4 35.1 1.9 49.3 2.2 15.6 2.0 27.3 2.1 11.9 1.5 23.0 1.4 18.9 2.3 28.9 3.2 52.9 1.0 68.2 1.0
66.5 0.7 80.6 0.6 46.2 1.2 61.1 1.2 37.8 1.4 48.8 1.5 30.6 0.8 42.5 0.6 41.1 2.3 50.6 2.0 54.2 1.2 69.8 0.9
DBiDAF
61.2 1.8 75.7 1.6 42.9 1.9 57.5 1.8 37.4 2.1 47.9 2.0 29.3 2.1 40.0 2.3 39.4 2.2 47.6 2.2 49.9 2.3 65.7 2.3
DBERT
57.0 1.7 71.7 1.8 37.0 2.3 52.0 2.5 34.8 1.5 45.9 2.0 30.5 2.2 41.2 2.2 39.0 3.1 47.4 2.8 45.8 2.4 62.4 2.5
DRoBERTa
DSQuAD(10K) 68.6 0.5 82.8 0.3 37.7 1.1 53.8 1.1 20.8 1.2 34.0 1.0 11.0 0.8 22.1 0.9 25.0 2.2 39.4 2.4 43.9 3.8 62.8 3.1
64.8 0.7 80.0 0.4 48.0 1.2 64.3 1.1 40.0 1.5 51.5 1.3 29.0 1.9 39.9 1.8 44.5 2.1 55.4 1.9 48.4 1.1 66.9 0.8
DBiDAF
59.5 1.0 75.1 0.9 45.4 1.5 60.7 1.5 38.4 1.8 49.8 1.7 28.2 1.5 38.8 1.5 42.2 2.3 52.6 2.0 45.8 1.1 63.6 1.1
DBERT
56.2 0.7 72.1 0.7 41.4 0.8 57.1 0.8 38.4 1.1 49.5 0.9 30.2 1.3 41.0 1.2 41.2 0.9 51.2 0.8 43.6 1.1 61.6 0.9
DRoBERTa

Tableau 6: Training models on various datasets, each with 10,000 samples, and measuring their
generalization to different evaluation datasets. Results underlined indicate the best result per model.
We report the mean and standard deviation (subscript) over 10 runs with different random seeds.

and order of mini-batch samples during training.
These results are shown in Table 5.

D'abord, we observe—as expected given our
annotation constraints—that model performance
is 0.0EM on datasets created with the same re-
spective model in the annotation loop. We ob-
serve, cependant,
that retrained models do not
reliably perform as poorly on those samples.
Par exemple, BERT reaches 19.7EM, alors que
the original model used during annotation
provides no correct answer with 0.0EM. Ce
demonstrates that random model components can
the adversarial annotation
substantially affect
processus. The evaluation furthermore serves as
a baseline for subsequent model evaluations:
This much of the performance range can be
learned merely by retraining the same model.
A possible takeaway for using the model-in-
the-loop annotation strategy in the future is to
rely on ensembles of adversaries and reduce the
dependency on one particular model instantia-
tion, as investigated by Grefenstette et al. (2018).

4.2 Adversarial Generalization

A potential problem with the focus on challenging
questions is that they might be very distinct from
one another, leading to difficulties in learning to
generalize to and from them. We conduct a series
of experiments in which we train on DBiDAF,
DBERT, and DRoBERTa, and observe how well

models can learn to generalize to the respective
test portions of these datasets. Tableau 6 shows the
résultats, and there is a multitude of observations.

D'abord, one clear trend we observe across all
training data setups is a negative performance
progression when evaluated against datasets
constructed with a stronger model in the loop. Ce
trend holds true for all but the BiDAF model, dans
each of the training configurations, and for each of
the evaluation datasets. Par exemple, RoBERTa
trained on DRoBERTa achieves 72.1, 57.1, 49.5,
and 41.0F1 when evaluated on DSQuAD, DBiDAF,
DBERT, and DRoBERTa respectively.

Deuxième, we observe that the BiDAF model is
not able to generalize well to datasets constructed
with a model in the loop,
independent of its
training setup. En particulier, it is unable to learn
from DBiDAF, thus failing to overcome some of
its own blind spots through adversarial training.
Irrespective of
the training dataset, BiDAF
consistently performs poorly on the adversarially
collected evaluation datasets, and we also note
a substantial performance drop when trained on
DBiDAF, DBERT, or DRoBERTa and evaluated on
DSQuAD.

In contrast, BERT and RoBERTa are able
to partially overcome their blind spots through
training on data collected with a model in the
loop, and to a degree that far exceeds what would
be expected from random retraining (cf. Tableau 5).

669

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Model

Training Dataset

DSQuAD

Evaluation (Test) Dataset
DBERT
DBiDAF

DRoBERTa
EM
F1

BiDAF

BERT

RoBERTa

8.6 0.6 17.3 0.8

56.7 0.5 70.1 0.3 11.6 1.0 21.3 1.1
8.3 0.7 16.8 0.5
DSQuAD
DSQuAD + DBiDAF
56.3 0.6 69.7 0.4 14.4 0.9 24.4 0.9 15.6 1.1 24.7 1.1 14.3 0.5 23.3 0.7
DSQuAD + DBERT
56.2 0.6 69.4 0.6 14.4 0.7 24.2 0.8 15.7 0.6 25.1 0.6 13.9 0.8 22.7 0.8
DSQuAD + DRoBERTa 56.2 0.7 69.6 0.6 14.7 0.9 24.8 0.8 17.9 0.5 26.7 0.6 16.7 1.1 25.0 0.8
74.8 0.3 86.9 0.2 46.4 0.7 60.5 0.8 24.4 1.2 35.9 1.1 17.3 0.7 28.9 0.9
DSQuAD
DSQuAD + DBiDAF
75.2 0.4 87.2 0.2 52.4 0.9 66.5 0.9 40.9 1.3 51.2 1.5 32.9 0.9 44.1 0.8
DSQuAD + DBERT
75.1 0.3 87.1 0.3 54.1 1.0 68.0 0.8 43.7 1.1 54.1 1.3 34.7 0.7 45.7 0.8
DSQuAD + DRoBERTa 75.3 0.4 87.1 0.3 53.0 1.1 67.1 0.8 44.1 1.1 54.4 0.9 36.6 0.8 47.8 0.5
73.2 0.4 86.3 0.2 48.9 1.1 64.3 1.1 31.3 1.1 43.5 1.2 16.1 0.8 26.7 0.9
DSQuAD
DSQuAD + DBiDAF
73.9 0.4 86.7 0.2 55.0 1.4 69.7 0.9 46.5 1.1 57.3 1.1 31.9 0.8 42.4 1.0
DSQuAD + DBERT
73.8 0.2 86.7 0.2 55.4 1.0 70.1 0.9 48.9 1.0 59.0 1.2 32.9 1.3 43.7 1.4
DSQuAD + DRoBERTa 73.5 0.3 86.5 0.2 55.9 0.7 70.6 0.7 49.1 1.2 59.5 1.2 34.7 1.0 45.9 1.2

Tableau 7: Training models on SQuAD, as well as SQuAD combined with different adversarially created
datasets. Results underlined indicate the best result per model. We report the mean and standard
deviation (subscript) over 10 runs with different random seeds.

Par exemple, BERT reaches 47.9F1 when trained
and evaluated on DBERT, while RoBERTa trained
on DRoBERTa reaches 41.0F1 on DRoBERTa, les deux
considerably better than random retraining or
when training on the non-adversarially collected
DSQuAD(10K), showing gains of 20.6F1 for BERT
and 18.9F1 for RoBERTa. These observations
suggest that there exists learnable structure among
harder questions that can be picked up by some
of the models, yet not all, as BiDAF fails to
achieve this. The fact that even BERT can learn to
generalize to DRoBERTa, but not BiDAF to DBERT
suggests the existence of an inherent limitation to
what BiDAF can learn from these new samples,
compared with BERT and RoBERTa.

Plus généralement, we observe that training on
DS, where S is a stronger RC model, helps gen-
eralize to DW, where W is a weaker model—for
example, training on DRoBERTa and testing on
DBERT. On the other hand, training on DW also
leads to generalization towards DS. Par exemple,
RoBERTa trained on 10,000 SQuAD samples
reaches 22.1F1 on DRoBERTa (DS), whereas train-
ing RoBERTa on DBiDAF and DBERT (DW) bumps
this number to 39.9F1 and 38.8F1, respectivement.

Troisième, we observe similar performance deg-
radation patterns for both BERT and RoBERTa
on DSQuAD when trained on data collected with

increasingly stronger models in the loop. Pour
example, RoBERTa
evaluated on DSQuAD
achieves 82.8, 80.0, 75.1, and 72.1F1 when trained
on DSQuAD(10K), DBiDAF, DBERT, and DRoBERTa,
respectivement. This may indicate a gradual shift
in the distributions of composed questions as the
model in the loop gets stronger.

These observations suggest an encouraging
takeaway for the model-in-the-loop annotation
paradigm: Even though a particular model might
be chosen as an adversary in the annotation
loop, which at some point falls behind more recent
state-of-the-art models, these future models can
still benefit from data collected with the weaker
model, and also generalize better to samples
composed with the stronger model in the loop.

We further show experimental results for the
same models and training datasets, but now
including SQuAD as additional training data, dans
Tableau 7. In this training setup we generally see
improved generalization to DBiDAF, DBERT, et
DRoBERTa. Fait intéressant, the relative differences
between DBiDAF, DBERT, and DRoBERTa as training
sets used in conjunction with SQuAD are much
diminished, and especially DRoBERTa as (part of)
the training set now generalizes substantially
better. We see that BERT and RoBERTa both
show consistent performance gains with the

670

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Model

DSQuAD

BiDAF
BERT
RoBERTa

57.10.4
75.50.2
74.20.3

F1
70.40.3
87.20.2
86.90.3

Evaluation (Test) Dataset
DBiDAF

17.10.8
57.71.0
59.80.5

F1
27.00.9
71.01.1
74.10.6

DBERT

20.01.0
52.10.7
55.10.6

F1
29.20.8
62.20.7
65.10.7

DRoBERTa
EM

F1
27.40.7
54.21.0
52.71.0

18.30.6
43.01.1
41.61.0

Tableau 8: Training models on SQuAD combined with all the adversarially created datasets DBiDAF,
DBERT, and DRoBERTa. Results underlined indicate the best result per model. We report the mean and
standard deviation (subscript) over 10 runs with different random seeds.

addition of the original SQuAD1.1 training data,
but unlike in Table 6, this comes without any
noticeable decline in performance on DSQuAD,
the adversarially constructed
suggesting that
datasets expose inherent model weaknesses, comme
investigated by Liu et al. (2019un).

En outre, RoBERTa achieves the strongest
results on the adversarially collected eval-
uation sets,
in particular when trained on
DSQuAD + DRoBERTa. This stands in contrast
to the results in Table 6, where training on DBiDAF
in several cases led to better generalization than
training on DRoBERTa. A possible explanation is
that training on DRoBERTa leads to a larger degree
of overfitting to specific adversarial examples in
DRoBERTa than training on DBiDAF, and that the
inclusion of a large number of standard SQuAD
training samples can mitigate this effect.

Results for the models trained on all
le
datasets combined (DSQuAD, DBiDAF, DBERT, et
DRoBERTa) are shown in Table 8. These further
support the previous observations and provide
additional performance gains where, for exam-
ple, RoBERTa achieves F1 scores of 86.9 sur
DSQuAD, 74.1 on DBiDAF, 65.1 on DBERT, et 52.7
on DRoBERTa, surpassing the best previous perfor-
mance on all adversarial datasets.

Enfin, we identify a risk of datasets con-
structed with weaker models in the loop becom-
ing outdated. Par exemple, RoBERTa achieves
58.2EM/73.2F1 on DBiDAF, in contrast to 0.0EM/
5.5F1 for BiDAF—which is not far from the
non-expert human performance of 62.6EM/78.5F1
(cf. Tableau 2).

It is also interesting to note that, même quand
training on all the combined data (cf. Tableau 8),
BERT outperforms RoBERTa on DRoBERTa and

vice versa, suggesting that
weaknesses inherent to each model class.

there may exist

4.3 Generalization to Non-Adversarial Data

Compared with standard annotation, the model-
in-the-loop approach generally results in new ques-
tion distributions. Par conséquent, models trained
on adversarially composed questions might not be
able to generalize to standard (‘‘easy’’) questions,
thus limiting the practical usefulness of
le
resulting data. To what extent do models trained
on model-in-the-loop questions generalize differ-
ently to standard (‘‘easy’’) questions, compared
trained on standard (‘‘easy’’)
with models
questions?

To measure this we further train each of our three
models on either DBiDAF, DBERT, or DRoBERTa
and test on DSQuAD, with results in the DSQuAD
columns of Table 6. For comparison, the models
are also trained on 10,000 SQuAD1.1 samples
(referred to as DSQuAD(10K)) chosen from the same
passages as the adversarial datasets, thus elim-
inating size and paragraph choice as potential con-
founding factors. The models are tuned for EM
on the held-out DSQuAD validation set. Note that,
although performance values on the majority vote
DSQuAD dataset are lower than on the original, pour
the reasons described earlier, this enables direct
comparisons across all datasets.

Remarquablement, neither BERT nor RoBERTa
show substantial drops when trained on DBiDAF
compared to training on SQuAD data (−2.1F1,
and −2.8F1): Training these models on a dataset
with a weaker model in the loop still leads to
strong generalization even to data from the origi-
nal SQuAD distribution, which all models in the
loop are trained on. BiDAF, on the other hand,
fails to learn such information from the adversar-

671

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

ially collected data, and drops >30F1 for each of
the new training sets, compared to training on
SQuAD.

We also observe a gradual decrease in gener-
alization to SQuAD when training on DBiDAF to-
wards training on DRoBERTa. This suggests that
the stronger the model, the more dissimilar the
resulting data distribution becomes from the orig-
inal SQuAD distribution. We later find further
support
this explanation in a qualitative
analyse (Section 5). It may, cependant, also be due
to a limitation of BERT and RoBERTa—similar
to BiDAF—in learning from a data distribution
designed to beat these models; an even stronger
model might
learn more from, Par exemple,
DRoBERTa.

pour

4.4 Generalization to DROP and

NaturalQuestions

Enfin, we investigate to what extent models
can transfer skills learned on the datasets created
with a model in the loop to two recently intro-
duced datasets: DROP (Dua et al., 2019), et
NaturalQuestions (Kwiatkowski et al., 2019). Dans
this experiment we select the subsets of DROP and
NaturalQuestions that align with the structural
constraints of SQuAD to ensure a like-for-like
analyse. Spécifiquement, we only consider questions
in DROP where the answer is a span in the passage
and where there is only one candidate answer.
For NaturalQuestions, we consider all non-tabular
long answers as passages, remove HTML tags
and use the short answer as the extracted span.
We apply this filtering on the validation sets for
both datasets. Next we split them, stratifying by
document (as we did for DSQuAD), which results
dans 1409/1418 validation and test set examples
for DROP, et 964/982 for NaturalQuestions,
respectivement. We denote these datasets as DDROP
and DNQ for clarity and distinction from their
unfiltered versions. We consider the same models
and training datasets as before, but tune on the
respective validation sets of DDROP and DNQ.
Tableau 6 shows the results of these experiments in
the respective DDROP and DNQ columns.

D'abord, we observe clear generalization improve-
ments towards DDROP across all models compared
to training on DSQuAD(10K) when training on any
of DBiDAF, DBERT, or DRoBERTa. C'est, y compris
a model in the loop for the training dataset leads
to improved transfer towards DDROP. Note that

DROP also makes use of a BiDAF model in
the loop during annotation; these results are in
line with our prior observations when testing the
same setups on DBiDAF, DBERT, and DRoBERTa,
compared to training on DSQuAD(10K).

Deuxième, we observe overall strong transfer
results towards DNQ, with up to 69.8F1 for a BERT
model trained on DBiDAF. Note that this result is
similar to, and even slightly improves over, model
training with SQuAD data of the same size. That
est, relative to training on SQuAD data, entraînement
on adversarially collected data DBiDAF does not
impede generalization to the DNQ dataset, lequel
was created without a model in the annotation
loop. We then, cependant, see a similar negative
performance progression as observed before when
testing on DSQuAD: The stronger the model in
the annotation loop of the training dataset, le
lower the test accuracy on test data from a data
distribution composed without a model in the loop.

5 Qualitative Analysis

Having applied the general model-in-the-loop
methodology on models of varying strength, nous
next perform a qualitative comparison of the na-
ture of the resulting questions. As reference points
we also include the original SQuAD questions,
as well as DROP and NaturalQuestions, in this
comparison: these datasets are both constructed
to overcome limitations in SQuAD and have sub-
sets sufficiently similar to SQuAD to make an
analysis possible. Spécifiquement, we seek to under-
stand the qualitative differences in terms of
reading comprehension challenges posed by the
questions in each of these datasets.

5.1 Comprehension Requirements

There exists a variety of prior work that seeks
to understand the types of knowledge, compre-
hension skills, or types of reasoning required to
answer questions based on text (Rajpurkar et al.,
2016; Clark et al., 2018; Sugawara et al., 2019;
Dua et al., 2019; Dasigi et al., 2019); we are,
cependant, unaware of any commonly accepted
formalism. We take inspiration from these but
develop our own taxonomy of comprehension
requirements which suits the datasets analyzed.
Our taxonomy contains 13 labels, most of which
are commonly used in other work. Cependant, le
following three deserve additional clarification:
je) explicit–for which the answer is stated nearly

672

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 5: Comparison of comprehension types for the questions in different datasets. The label types are neither
mutually exclusive nor comprehensive. Values above columns indicate excess of the axis range.

word-for-word in the passage as it is in the ques-
tion, ii) filtering–a set of answers is narrowed
down to select one by some particular distin-
guishing feature, and iii) implicit–the answer
builds on information implied by the passage and
does not otherwise require any of the other types
of reasoning.

que

We annotate questions with labels from this
catalogue in a manner
is not mutually
exclusive, and neither fully comprehensive; le
development of such a catalogue is itself very
challenging. Plutôt, we focus on capturing the
most salient characteristics of each given question,
and assign it up to three of the labels in our
catalogue. In total, we analyze 100 samples from
the validation set of each of the datasets; Chiffre 5
shows the results.

5.2 Observations

An initial observation is that the majority (57%)
of answers to SQuAD questions are stated explic-
itly, without comprehension requirements beyond
the literal level. This number decreases substan-
tially for any of the model-in-the-loop datasets
derived from SQuAD (par exemple., 8% for DBiDAF)
and also DDROP, yet 42% of questions in DNQ
share this property. In contrast to SQuAD, le
model-in-the-loop questions generally tend to
involve more paraphrasing. They also require
more external knowledge, and multi-hop infer-
ence (beyond co-reference resolution) with an
increasing trend for stronger models used in
the annotation loop. Model-in-the-loop questions
further fan out into a variety of small, but non-
negligible proportions of more specific types
of inference required for comprehension, pour
example, spatial or
inference (les deux
going beyond explicitly stated spatial or temporal

temporal

information)—SQuAD questions rarely require
these at all. Some of these more particular infer-
ence types are common features of the other two
datasets, in particular comparative questions for
DROP (60%) and to a small extent also Natu-
ralQuestions. Fait intéressant, DBiDAF possesses the
largest number of comparison questions (11%)
among our model-in-the-loop datasets, alors que
DBERT and DRoBERTa only possess 1% et 3%,
respectivement. This offers an explanation for our
previous observation in Table 6, where BERT and
RoBERTa perform better on DDROP when trained
on DBiDAF rather than on DBERT or DRoBERTa. C'est
likely that BiDAF as a model in the loop is worse
than BERT and RoBERTa at comparative ques-
tion, as evidenced by the results in Table 6 avec
BiDAF reaching 8.6F1, BERT reaching 28.9F1,
and RoBERTa reaching 39.4F1 on DDROP (quand
trained on DSQuAD(10K)).

The distribution of NaturalQuestions contains
elements of both the SQuAD and DBiDAF
distributions, which offers a potential explanation
for the strong performance on DNQ of models
trained on DSQuAD(10K) and DBiDAF. Enfin, le
gradually shifting distribution away from both
SQuAD and NaturalQuestions as the model-
in-the-loop strength increases reflects our prior
observations on the decreasing performance on
SQuAD and NaturalQuestions of models trained
on datasets with progressively stronger models in
the loop.

6 Discussion and Conclusions

We have investigated an RC annotation para-
digm that requires a model in the loop to be
‘‘beaten’’ by an annotator. Applying this approach
with progressively stronger models in the loop

673

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

(BiDAF, BERT, and RoBERTa), we produced
three separate datasets. Using these datasets,
we investigated several questions regarding the
annotation paradigm, in particular, whether such
datasets grow outdated as
stronger models
emerge, and their generalization to standard
(non-adversarially collected) questions. We found
that stronger models can still learn from data
collected with a weak adversary in the loop, et
their generalization improves even on datasets
collected with a stronger adversary. Models
trained on data collected with a model in the
loop further generalize well to non-adversarially
collected data, both on SQuAD and on Natu-
ralQuestions, yet we observe a gradual deteriora-
tion in performance with progressively stronger
adversaries.

We see our work as a contribution towards
the emerging paradigm of model-in-the-loop
annotation. Although this paper has focused on
RC, with SQuAD as the original dataset used
to train model adversaries, we see no reason in
principle why findings would not be similar for
other tasks using the same annotation paradigm,
when crowdsourcing challenging samples with a
model in the loop. We would expect the insights
and benefits conveyed by model-in-the-loop
annotation to be the greatest on mature datasets
where models exceed human performance: Ici
the resulting data provides a magnifying glass
on model performance, focused in particular on
samples which models struggle on. On the other
main, applying the method to datasets where
performance has not yet plateaued would likely
result in a more similar distribution to the original
data, which is challenging to models a priori.
We hope that the series of experiments on repli-
cability, observations on transfer between datasets
collected using models of different strength, comme
well as our findings regarding generalization to
non-adversarially collected data, can support and
inform future research and annotation efforts using
this paradigm.

Remerciements

The authors would like to thank Christopher Potts
for his detailed and constructive feedback, et
our reviewers. This work was supported by the
European Union’s Horizon 2020 Research and
Innovation Programme under grant agreement
Non. 875160 and the UK Defence Science and

Technology Laboratory (Dstl) and Engineering
and Physical Research Council (EPSRC) sous
grant EP/R018693/1 as a part of the collaboration
between US DOD, UK MOD, and UK EPSRC
under the Multidisciplinary University Research
Initiative (MURI).

Les références

Samuel R. Bowman, Gabor Angeli, Christophe
Potts, and Christopher D. Manning. 2015. UN
large annotated corpus for learning natural
language inference. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 632–642, Lisbon,
Portugal. Association for Computational Lin-
guistics. EST CE QUE JE: https://est ce que je.org/10
.18653/v1/D15-1075

In Proceedings of

Danqi Chen, Jason Bolton, and Christopher D.
Manning. 2016. A thorough examination of
the CNN/Daily Mail reading comprehension
the 54th Annual
task.
the Association for Computa-
Meeting of
tional Linguistics (Volume 1: Long Papers),
pages 2358–2367. Berlin, Allemagne. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/P16
-1223, PMID: 30036459

Michael Chen, Mike D’Arcy, Alisa Liu, Jared
Fernandez, and Doug Downey. 2019. CODAH:
An adversarially-authored question answering
dataset for common sense. In Proceedings
of the 3rd Workshop on Evaluating Vector
Space Representations for NLP, pages 63–69,
Minneapolis, Etats-Unis. Association for Computa-
tional Linguistics.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, et
Luke Zettlemoyer. 2018. QuAC: Question an-
swering in context. In Proceedings of the 2018
Conference on Empirical Methods in Nat-
ural Language Processing, pages 2174–2184,
Brussels, Belgium. Association for Compu-
tational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/D18-1241, PMID:
30142985

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar
Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. 2018. Think you have

674

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

solved question answering? Try ARC, the AI2
reasoning challenge. CoRR, abs/1803.05457.

Pradeep Dasigi, Nelson F. Liu, Ana Marasovi´c,
Noah A. Forgeron, and Matt Gardner. 2019.
Quoref: A reading comprehension dataset with
questions requiring coreferential
reasoning.
le 2019 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 5925–5932, Hong
Kong, Chine. Association for Computational
Linguistics. EST CE QUE JE: https://est ce que je.org/10
.18653/v1/D19-1606

Jia Deng, R.. Socher, Li Fei-Fei, Wei Dong,
Kai Li, and Li-Jia Li. 2009. ImageNet: UN
large-scale hierarchical
Dans
2009 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 248–255.
EST CE QUE JE: https://doi.org/10.1109/CVPR
.2009.5206848

image database.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
le 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.

Emily Dinan, Samuel Humeau, Bharath Chintagunta,
and Jason Weston. 2019. Build it break it fix it
for dialogue safety: Robustness from adver-
sarial human attack. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 4537–4546,
Hong Kong, Chine. Association for Compu-
tational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/D19-1461

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Gabriel Stanovsky, Sameer Singh, and Matt
Gardner. 2019. DROP: A reading comprehen-
sion benchmark requiring discrete reasoning
over paragraphs. In Proceedings of the 2019
Conference of the North American Chapter

of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 2368–2378,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.

Allyson Ettinger, Sudha Rao, Hal Daum´e III, et
Emily M. Cintreuse. 2017. Towards linguistically
generalizable NLP systems: A workshop and
shared task. CoRR, abs/1711.01505.

Matt Gardner,

Joel Grus, Mark Neumann,
Oyvind Tafjord, Pradeep Dasigi, Nelson F.
Liu, Matthew Peters, Michael Schmitz, et
Luke Zettlemoyer. 2018. AllenNLP: A deep
semantic natural language processing platform.
In Proceedings of Workshop for NLP Open
(NLP-OSS), pages 1–6,
Source Software
Melbourne, Australia. Association for Compu-
tational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/W18-2501, PMCID:
PMC5753512

Edward Grefenstette, Robert Stanforth, Brendan
Jonathan Uesato, Grzegorz
O’Donoghue,
Swirszcz, and Pushmeet Kohli. 2018. Strength
en chiffres: Trading-off robustness and com-
putation via adversarially-trained ensembles.
CoRR, abs/1811.09300.

Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, et
Noah A. Forgeron. 2018. Annotation artifacts in
natural language inference data. In Proceedings
of the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 107–112,
La Nouvelle Orléans, Louisiana. Association for Com-
putational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/N18-2017

Karl Moritz Hermann, Tomas Kocisky, Edward
Grefenstette, Lasse Espeholt, Will Kay,
Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend.
In C. Cortes, N. D. Lawrence, D. D. Lee,
M.. Sugiyama, et R. Garnett, editors, Advances
in Neural Information Processing Systems 28,
pages 1693–1701. Curran Associates, Inc.

Robin Jia and Percy Liang. 2017. Adversarial
examples for evaluating reading comprehension
systèmes. In Proceedings of the 2017 Conference

675

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

on Empirical Methods in Natura Language
Processing, pages 2021–2031, Copenhagen, Den-
mark. Association for Computational Linguistics.
EST CE QUE JE: https://doi.org/10.18653/v1
/D17-1215

the 55th Annual Meeting of

Mandar Joshi, Eunsol Choi, Daniel Weld, et
Luke Zettlemoyer. 2017. TriviaQA: A large
scale distantly supervised challenge dataset
for reading comprehension. In Proceedings
de
the Associ-
ation for Computational Linguistics (Volume 1:
Long Papers), pages 1601–1611, Vancouver,
Canada. Association for Computational Linguis-
tics. EST CE QUE JE: https://doi.org/10.18653
/v1/P17-1147

Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2020. Learning the difference that makes
a difference with counterfactually-augmented
data. In International Conference on Learning
Representations.

Divyansh Kaushik and Zachary C. Lipton. 2018.
How much reading does reading compre-
investigation of
hension require? A critical
popular benchmarks. In Proceedings of
le
2018 Conference on Empirical Methods in Natu-
ral Language Processing, pages 5010–5015,
Brussels, Belgium. Association for Computa-
tional Linguistics. EST CE QUE JE: https://doi.org
/10.18653/v1/D18-1546

Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blun-
som, Chris Dyer, Karl Moritz Hermann, G´abor
Melis, and Edward Grefenstette. 2018. Le
NarrativeQA reading comprehension chal-
lenge. Transactions of
the Association for
Computational Linguistics, 6:317–328. EST CE QUE JE:
https://doi.org/10.1162/tacl a
00023

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein,
Illia Polosukhin,
Jacob Devlin, Kenton Lee, Kristina Toutanova,
Llion Jones, Matthew Kelcey, Ming-Wei
Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural Questions:
A benchmark for question answering research.
Transactions of the Association for Computa-
tional Linguistics, 7:453–466. EST CE QUE JE: https://
doi.org/10.1162/tacl a 00276

David D. Lewis and William A. Coup de vent. 1994. UN
sequential algorithm for training text classifiers.
In SIGIR, pages 3–12. ACM/Springer.

le 2019 Conference of

Nelson F. Liu, Roy Schwartz, and Noah A. Forgeron.
2019un. Inoculation by fine-tuning: A method
for analyzing challenge datasets. In Proceed-
ings of
the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 2171–2179, Minneapolis, Minnesota.
Association for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
2019b. RoBERTa: A robustly optimized BERT
pretraining approach. CoRR, abs/1907.11692.

Mitchell P. Marcus, Beatrice Santorini, et
Mary Ann Marcinkiewicz. 1993. Building a
large annotated corpus of English: The Penn
Treebank. Computational Linguistics, 19(2):
313–330. EST CE QUE JE: https://est ce que je.org/10
.21236/ADA273556

Sewon Min, Victor Zhong, Richard Socher,
and Caiming Xiong. 2018. Efficient and robust
question answering from minimal context
over documents. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 1725–1735, Melbourne, Australia. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/P18
-1160

Mike Mintz, Steven Bills, Rion Snow, et
Daniel Jurafsky. 2009. Distant supervision
for relation extraction without
labeled data.
the Joint Conference of
In Proceedings of
the 47th Annual Meeting of
the ACL and
the 4th International Joint Conference on
Natural Language Processing of the AFNLP,
pages 1003–1011, Suntec, Singapore. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.3115/1690219
.1690287

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
Gao, Saurabh Tiwary, Rangan Majumder, et
Li Deng. 2016. MS MARCO: Un humain
generated MAchine Reading COmprehension
dataset. arXiv preprint arXiv:1611.09268.

676

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2019.
Adversarial NLI: A new benchmark for nat-
ural language understanding. arXiv preprint
arXiv: 1910.14599. EST CE QUE JE: https://doi.org
/10.18653/v1/2020.acl-main.441

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. In Proceedings of
the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 784–789, Melbourne, Australia.
Association for Computational Linguistics.
EST CE QUE JE: https://doi.org/10.18653/v1
/P18-2124

Pranav Rajpurkar,

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions
for machine compre-
hension of text. In Proceedings of the 2016
Conference on Empirical Methods in Nat-
ural Language Processing, pages 2383–2392,
Austin, Texas. Association for Computational
Linguistics. EST CE QUE JE: https://est ce que je.org/10
.18653/v1/D16-1264

Siva Reddy, Danqi Chen, and Christopher D.
Manning. 2019. CoQA: A conversational ques-
tion answering challenge. Transactions of the
Association for Computational Linguistics,
7:249–266. EST CE QUE JE: https://est ce que je.org/10
.1162/tacl a 00266

Matthew Richardson, Christopher J. C. Burges,
and Erin Renshaw. 2013. MCTest: A chal-
lenge dataset for the open-domain machine
comprehension of text. In Proceedings of the
2013 Conference on Empirical Methods in
Natural Language Processing, pages 193–203,
Seattle, Washington, Etats-Unis. Association for
Computational Linguistics.

Marzieh Saeidi, Max Bartolo, Patrick Lewis,
Sameer Singh, Tim Rockt¨aschel, Mike Sheldon,
Guillaume Bouchard, and Sebastian Riedel.
2018. Interpretation of natural language rules
in conversational machine reading. En Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 2087–2097, Brussels, Belgium. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/D18
-1233

Roy Schwartz, Maarten Sap, Ioannis Konstas,
Leila Zilles, Yejin Choi, and Noah A. Forgeron.
2017. The effect of different writing tasks on
linguistic style: A case study of the ROC story
cloze task. In Proceedings of the 21st Confer-
ence on Computational Natural Language
Apprentissage (CoNLL 2017), pages 15–25, Vancou-
ver, Canada. Association for Computational
Linguistics. EST CE QUE JE: https://est ce que je.org/10
.18653/v1/K17-1004

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi,
and Hannaneh Hajishirzi. 2017. Bidirectional
attention flow for machine comprehension.
In The International Conference on Learning
Representations (ICLR).

Rion Snow, Brendan O’Connor, Daniel Jurafsky,
and Andrew Ng. 2008. Cheap and fast – but
is it good? Evaluating non-expert annotations
for natural language tasks. In Proceedings of the
2008 Conference on Empirical Methods in
Natural Language Processing, pages 254–263,
Honolulu, Hawaii. Association for Compu-
tational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.3115/1613715.1613751

Saku Sugawara, Kentaro Inui, Satoshi Sekine,
and Akiko Aizawa. 2018. What makes reading
comprehension questions easier? In Proceed-
ings of
le 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 4208–4219, Brussels, Belgium. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/D18
-1453

Saku Sugawara, Pontus Stenetorp, Kentaro Inui,
and Akiko Aizawa. 2019. Assessing the bench-
marking capacity of machine reading compre-
hension datasets. CoRR, abs/1911.09241.

James Thorne, Andreas Vlachos, Oana Cocarascu,
Christos Christodoulopoulos, and Arpit Mittal.
2019. The FEVER2.0 shared task. In Proceed-
ings of the Second Workshop on Fact Extraction
and VERification (FEVER), pages 1–6, Hong
Kong, Chine. Association for Computational
Linguistics. EST CE QUE JE: https://est ce que je.org/10
.18653/v1/D19-6601, PMCID: PMC6533707

Adam Trischler, Tong Wang, Xingdi Yuan, Justin
Harris, Alessandro Sordoni, Philip Bachman,

677

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

and Kaheer Suleman. 2017. NewsQA: A ma-
chine comprehension dataset. In Proceedings of
the 2nd Workshop on Representation Learning
for NLP, pages 191–200, Vancouver, Canada.
Association for Computational Linguistics.
EST CE QUE JE: https://doi.org/10.18653/v1
/W17-2623

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya
Yamada, and Jordan Boyd-Graber. 2019. Trick
me if you can: Human-in-the-loop generation
of adversarial examples for question answering.
Transactions of the Association for Computa-
tional Linguistics, 7:387–401. EST CE QUE JE: https://
doi.org/10.1162/tacl a 00279

Dirk Weissenborn, Georg Wiese, and Laura Seiffe.
2017. Making neural QA as simple as possible
but not simpler. In Proceedings of the 21st
Conference on Computational Natural Lan-
guage Learning (CoNLL 2017), pages 271–280,
Vancouver, Canada. Association for Compu-
tational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/K17-1028

Johannes Welbl, Pontus Stenetorp, and Sebastian
Riedel. 2018. Constructing datasets for multi-
hop reading comprehension across documents.
Transactions of the Association for Computa-
tional Linguistics, 6:287–302. EST CE QUE JE: https://
doi.org/10.1162/tacl a 00021

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Antoine
Moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
HuggingFace’s Transformers: State-of-the-art
Natural Language Processing.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
Bengio, William Cohen, Ruslan Salakhutdinov,
and Christopher D. Manning. 2018un. HotpotQA:

A dataset for diverse, explainable multi-hop
question answering. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing,
2369–2380,
Brussels, Belgium. Association for Compu-
tational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/D18-1259,
PMCID: PMC6156886

pages

Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will
Feng, Alexander Miller, Arthur Szlam, Douwe
Kiela, and Jason Weston. 2018b. Mastering
the dungeon: Grounded language learning by
turker descent. In International
mechanical
Conference on Learning Representations.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, et
Yejin Choi. 2018. SWAG: A large-scale adver-
sarial dataset for grounded commonsense infer-
ence. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language
Processing, pages 93–104, Brussels, Belgium.
Association for Computational Linguistics.
EST CE QUE JE: https://doi.org/10.18653/v1
/D18-1009

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag:
Can a machine really finish your sentence?
In Proceedings of the 57th Annual Meeting
the Association for Computational Lin-
de
guistics, pages 4791–4800, Florence,
Italy.
Association for Computational Linguistics.
EST CE QUE JE: https://doi.org/10.18653/v1
/P19-1472

Sheng Zhang, Xiaodong Liu,

Jingjing Liu,
Jianfeng Gao, Kevin Duh, and Benjamin Van
Durme. 2018. ReCoRD: Bridging the gap be-
tween human and machine commonsense read-
ing comprehension. arXiv preprint arXiv:1810.
12885.

678

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

un
c
_
un
_
0
0
3
3
8
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3 Beat the AI: Investigating Adversarial Human Annotation image

Télécharger le PDF