Beat the AI: Investigating Adversarial Human Annotation - 麻省理工学院人工智能研究专业

Beat the AI: Investigating Adversarial Human Annotation
for Reading Comprehension

Max Bartolo Alastair Roberts

Johannes Welbl Sebastian Riedel Pontus Stenetorp

计算机科学系
伦敦大学学院
{m.bartolo,a.roberts,j.welbl,s.riedel,p.stenetorp}@cs.ucl.ac.uk

抽象的

Innovations in annotation methodology have
been a catalyst for Reading Comprehension
(RC) datasets and models. One recent trend
to challenge current RC models is to involve
a model in the annotation process: 人类
create questions adversarially, such that the
model fails to answer them correctly. 在
this work we investigate this annotation
methodology and apply it in three different
settings, collecting a total of 36,000 样品
with progressively stronger models in the
annotation loop. This allows us to explore
questions such as the reproducibility of the
adversarial effect, transfer from data collected
with varying model-in-the-loop strengths, 和
generalization to data collected without a
模型. We find that training on adversarially
collected samples leads to strong generalization
to non-adversarially collected datasets, yet with
progressive performance deterioration with
日益
stronger models-in-the-loop.
此外, we find that stronger models can
仍然
learn from datasets collected with
substantially weaker models-in-the-loop. 什么时候
trained on data collected with a BiDAF model
in the loop, RoBERTa achieves 39.9F1 on
questions that it cannot answer when trained
on SQuAD—only marginally lower than when
trained on data collected using RoBERTa
本身 (41.0F1).

1 介绍

Data collection is a fundamental prerequisite for
Machine Learning-based approaches to Natural
语言处理 (自然语言处理). Innovations in data
acquisition methodology, such as crowdsourcing,
have led to major breakthroughs in scalability
and preceded the ‘‘deep learning revolution’’, 为了

662

which they can arguably be seen as co-responsible
(Deng et al., 2009; Bowman et al., 2015; Rajpurkar
等人。, 2016). Annotation approaches include ex-
pert annotation, 例如, relying on trained
linguists (Marcus et al., 1993), crowd-sourcing by
非专家 (Snow et al., 2008), distant supervi-
锡安 (Mintz et al., 2009; Joshi et al., 2017), 和
leveraging document structure (Hermann et al.,
2015). The concrete data collection paradigm cho-
sen dictates the degree of scalability, 注解
成本, precise task structure (often arising as a
compromise of the above) and difficulty, domain
覆盖范围, as well as resulting dataset biases and
model blind spots (Jia and Liang, 2017; 施瓦茨
等人。, 2017; Gururangan et al., 2018).

A recently emerging trend in NLP dataset
creation is the use of a model-in-the-loop when
composing samples: A contemporary model is
used either as a filter or directly during annotation,
to identify samples wrongly predicted by the
模型. Examples of this method are realized
in Build It Break It, The Language Edition
(Ettinger et al., 2017), HotpotQA (杨等人。,
2018A), SWAG (Zellers et al., 2018), Mechanical
Turker Descent (杨等人。, 2018乙), DROP
(Dua et al., 2019), CODAH (陈等人。, 2019),
Quoref (Dasigi et al., 2019), and AdversarialNLI
(Nie et al., 2019).1 This approach probes model
robustness and ensures that the resulting datasets
pose a challenge to current models, which drives
research to tackle new sets of problems.

We study this approach in the context of
Reading Comprehension (RC), and investigate its
robustness in the face of continuously progressing
models—do adversarially constructed datasets
quickly become outdated in their usefulness as
models grow stronger?

1The idea was alluded to at least as early as Richardson

等人. (2013), but it has only recently seen wider adoption.

计算语言学协会会刊, 卷. 8, PP. 662–678, 2020. https://doi.org/10.1162/tacl 00338
动作编辑器: Christopher Potts. 提交批次: 3/2020; 修改批次: 6/2020; 已发表 10/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

失败

to general improvements across the model-in-the-
loop datasets we collect, as well as improvements
of more than 20.0F1 for both BERT and RoBERTa
on an extractive subset of DROP (Dua et al.,
2019), another adversarially composed dataset.
When conducting a systematic analysis of the
concrete questions different models
到
answer correctly, as well as non-adversarially
the nature
composed questions, we see that
of the resulting questions changes: 问题
composed with a model in the loop are overall
more diverse, use more paraphrasing, 多-
hop inference, comparisons, and background
知识, and are generally less easily answered
by matching an explicit statement
that states
the required information literally. Given our
observations, we believe a model-in-the-loop
approach to annotation shows promise and should
be considered when creating future RC datasets.

总结一下, our contributions are as follows:
第一的, an investigation into the model-in-the-
loop approach to RC data collection based on
three progressively stronger models,
一起
with an empirical performance comparison when
trained on datasets constructed with adversaries of
different strength. 第二, a comparative inves-
tigation into the nature of questions composed
to be unsolvable by a sequence of progressively
stronger models. 第三, a study of the reproduc-
ibility of the adversarial effect and the gener-
alization ability of models trained in various
settings.

2 相关工作

Constructing Challenging Datasets Recent
努力
in dataset construction have driven
considerable progress in RC, yet datasets are
structurally diverse and annotation methodologies
vary. With its large size and combination of free-
form questions with answers as extracted spans,
SQuAD1.1 (Rajpurkar et al., 2016) has become
an established benchmark that has inspired the
construction of a series of similarly structured
datasets. 然而, mounting evidence suggests
that models can achieve strong generalization
performance merely by relying on superficial
cues—such as lexical overlap, term frequencies,
or entity type matching (陈等人。, 2016;
Weissenborn et al., 2017; Sugawara et al., 2018).
It has thus become an increasingly important
consideration to construct datasets that RC models

数字 1: Human annotation with a model in the loop,
显示: 我) the ‘‘Beat the AI’’ annotation setting where
only questions that the model does not answer correctly
are accepted, and ii) questions generated this way, 和
a progressively stronger model in the annotation loop.

Based on models trained on the widely used
SQuAD dataset, and following the same anno-
tation protocol, we investigate the annotation setup
where an annotator has to compose questions for
which the model predicts the wrong answer. 作为一个
结果, only samples that the model fails to predict
correctly are retained in the dataset—see Figure 1
for an example.

We apply this annotation strategy with three
distinct models in the loop, resulting in datasets
和 12,000 samples each. We then study the
reproducibility of the adversarial effect when
retraining the models with the same data, 还有
as the generalization ability of models trained
using datasets produced with and without a model
adversary. Models can, to a considerable degree,
learn to generalize to more challenging questions,
based on training sets collected with both stronger
and also weaker models in the loop. 比较的
to training on SQuAD, training on adversarially
composed questions leads to a similar degree
of generalization to non-adversarially written
问题, both for SQuAD and NaturalQuestions
(Kwiatkowski et al., 2019). It furthermore leads

663

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

find challenging, and for which natural language
understanding is a requisite for generalization.
Attempts to achieve this non-trivial aim have
typically revolved around extensions to the
SQuAD dataset annotation methodology. 他们
include unanswerable questions (Trischler et al.,
2017; Rajpurkar et al., 2018; Reddy et al., 2019;
Choi et al., 2018), adding the option of ‘‘Yes’’
or ‘‘No’’ answers (Dua et al., 2019; Kwiatkowski
等人。, 2019), questions requiring reasoning over
multiple sentences or documents (Welbl et al.,
2018; 杨等人。, 2018A), questions requiring
rule interpretation or context awareness (Saeidi
等人。, 2018; Choi et al., 2018; Reddy et al.,
2019), limiting annotator passage exposure by
sourcing questions first (Kwiatkowski et al., 2019),
controlling answer types by including options for
dates, numbers, or spans from the question (Dua
等人。, 2019), as well as questions with free-form
答案 (Nguyen et al., 2016; Koˇcisk´y et al.,
2018; Reddy et al., 2019).

Adversarial Annotation One recently adopted
approach to constructing challenging datasets
到
involves the use of an adversarial model
select examples that it does not perform well
在, an approach which superficially is akin to
active learning (Lewis and Gale, 1994). 这里, 我们
make a distinction between two sub-categories
of adversarial annotation: 我) adversarial filtering,
where the adversarial model is applied offline
in a separate stage of the process, usually after
data generation; examples include SWAG (Zellers
等人。, 2018), ReCoRD (张等人。, 2018),
HotpotQA (杨等人。, 2018A), and HellaSWAG
二) model-in-the-loop
(Zellers et al., 2019);
adversarial annotation, where the annotator can
directly interact with the adversary during the an-
notation process and uses the feedback to further
inform the generation process; 例子包括
CODAH (陈等人。, 2019), Quoref (Dasigi
等人。, 2019), DROP (Dua et al., 2019), FEVER2.0
(Thorne et al., 2019), AdversarialNLI (Nie et al.,
2019), as well as work by Dinan et al. (2019),
Kaushik et al. (2020), and Wallace et al. (2019)
for the Quizbowl task.

We are primarily interested in the latter cate-
gory, as this feedback loop creates an environ-
ment where the annotator can probe the model
directly to explore its weaknesses and formulate
targeted adversarial attacks. Although Dua et al.
(2019) and Dasigi et al. (2019) make use of
adversarial annotations for RC, both annotation

setups limit the reach of the model-in-the-loop: 在
DROP, primarily due to the imposition of specific
answer types, and in Quoref by focusing on co-
reference, which is already a known RC model
weakness.

相比之下, we investigate a scenario where
annotators interact with a model in its original task
setting—annotators must thus explore a range of
natural adversarial attacks, as opposed to filtering
out ‘‘easy’’ samples during the annotation process.

3 Annotation Methodology

3.1 Annotation Protocol

The data annotation protocol is based on SQuAD1.1,
with a model in the loop, and the additional
instruction that questions should only have one
answer in the passage, which directly mirrors the
setting in which these models were trained.

正式地, provided with a passage p, 一个人
annotator generates a question q and selects a
(人类) answer ah by highlighting the corre-
sponding span in the passage. The input (p, q)
is then given to the model, which returns a
预测的 (模型) answer am. To compare the
二, a word-overlap F1 score between ah and am
is computed; a score above a threshold of 40% 是
considered a ‘‘win’’ for the model.2 This process
is repeated until the human ‘‘wins’’; 数字 2
gives a schematic overview of the process. 全部
成功的 (p, q, ah) 三元组, 那是, those which
the model is unable to answer correctly, are then
retained for further validation.

3.2 Annotation Details

Models in the Annotation Loop We begin
by training three different models, 哪个是
used as adversaries during data annotation. 作为
a seed dataset for training the models we select
the widely used SQuAD1.1 (Rajpurkar et al.,
2016) dataset, a large-scale resource for which a
variety of mature and well-performing models are
一应俱全. 此外, unlike cloze-based
datasets, SQuAD is robust to passage/question-
only adversarial attacks (Kaushik and Lipton,
2018). We will compare dataset annotation with
a series of three progressively stronger models
as adversary in the loop, 即, BiDAF (Seo

2This threshold is set after initial experiments to not
be overly restrictive given acceptable answer spans, 例如, A
human answer of ‘‘New York’’ vs. model answer ‘‘New
York City’’ would still lead to a model ‘‘win’’.

664

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

ensure that all our datasets have one valid an-
swer per question, enabling us to fairly draw
direct comparisons. For clarity, we will hereafter
refer to this modified version of SQuAD1.1 as
DSQuAD.

Crowdsourcing We use custom-designed Hu-
man Intelligence Tasks (HITs) served through
Amazon Mechanical Turk (AMT) for all anno-
tation efforts. Workers are required to be based in
加拿大, 英国, or the US, have a HIT Approval
Rate greater than 98%, and have previously
completed at least 1,000 HITs successfully. 我们
experiment with and without the AMT Master
requirement and find no substantial difference
in quality, but observe a throughput reduction
of nearly 90%. We pay USD 2.00 for every
question generation HIT, during which workers
are required to compose up to five questions that
‘‘beat’’ the model in the loop (比照. 数字 3). 这
mean HIT completion times for BiDAF, BERT,
and RoBERTa are 551.8s, 722.4s, and 686.4s.
此外, we find that human workers are able
to generate questions that successfully ‘‘beat’’ the
model in the loop 59.4% of the time for BiDAF,
47.1% for BERT, 和 44.0% for RoBERTa. 这些
metrics broadly reflect the relative strength of the
型号.

3.3 Quality Control

Training and Qualification We provide a two-
part worker training interface in order to i) famil-
iarize workers with the process, and ii) 执行
a first screening based on worker outputs. 这
interface familiarizes workers with formulating
问题, and answering them through span
选择. Workers are asked to generate questions
for two given answers, to highlight answers for
two given questions, to generate one full question-
answer pair, and finally to complete a question
generation HIT with BiDAF as the model in
the loop. Each worker’s output is then reviewed
manually (by the authors); those who pass the
screening are added to the pool of qualified
annotators.

Manual Worker Validation In the second
annotation stage, qualified workers produce data
for the ‘‘Beat the AI’’ question generation task.
A sample of every worker’s HITs is manually
reviewed based on their total number of completed
tasks n, determined by ⌊5·log10(n)+1⌋, chosen for

数字 2: Overview of the annotation process to collect
adversarially written questions from humans using a
model in the loop.

等人。, 2017), BERTLARGE (Devlin et al., 2019),
and RoBERTaLARGE (刘等人。, 2019乙). 每个
of these will serve as a model adversary in a
separate annotation experiment and result in three
distinct datasets; we will refer to these as DBiDAF,
DBERT, and DRoBERTa respectively. Examples
from the validation set of each are shown in
桌子 1. We rely on the AllenNLP (加德纳
等人。, 2018) and Transformers (沃尔夫等人。, 2019)
model implementations, and our models achieve
EM/F1 scores of 65.5%/77.5%, 82.7%/90.3% 和
86.9%/93.6% for BiDAF, BERT, and RoBERTa,
分别, on the SQuAD1.1 validation set,
consistent with results reported in other work.

Our choice of models reflects both the transi-
tion from LSTM-based to pre-trained transformer-
based models, as well as a graduation among
后者; we investigate how this is reflected
in datasets collected with each of these different
models in the annotation loop. For each of the
models we collect 10,000 训练, 1,000 valida-
的, 和 1,000 test examples. Dataset sizes are
motivated by the data efficiency of transformer-
based pretrained models (Devlin et al., 2019;
刘等人。, 2019乙), which has improved the
viability of smaller-scale data collection efforts
for investigative and analysis purposes.

To ensure the experimental integrity provided
by reporting all results on a held-out test set,
we split the existing SQuAD1.1 validation set in
half (stratified by document title) as the official
test set is not publicly available. We maintain
passage consistency across the training, valida-
tion and test sets of all datasets to enable like-
for-like comparisons. 最后, we use the majority
vote answer as ground truth for SQuAD1.1 to

665

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

F
A
D
乙

我

F
A
D
乙

我

F
A
D
乙

我

时间
右
乙
乙

Passage: [. . . ] the United Methodist Church has placed great emphasis on the importance of education.
像这样, the United Methodist Church established and is affiliated with around one hundred colleges
[. . . ] of Methodist-related Schools, Colleges, and Universities. The church operates three hundred sixty
schools and institutions overseas.
问题: The United Methodist Church has how many schools internationally?

Passage: In a purely capitalist mode of production (i.e. where professional and labor organizations cannot
limit the number of workers) the workers wages will not be controlled by these organizations, or by the
employer, but rather by the market. Wages work in the same way as prices for any other good. 因此, 工资
can be considered as a [. . . ]
问题: What determines worker wages?

Passage: [. . . ] released to the atmosphere, and a separate source of water feeding the boiler is supplied.
Normally water is the fluid of choice due to its favourable properties, such as non-toxic and unreactive
化学, abundance, low cost, and its thermodynamic properties. Mercury is the working fluid in the
mercury vapor turbine [. . . ]
问题: What is the most popular type of fluid?

Passage: [. . . ] Jochi was secretly poisoned by an order from Genghis Khan. Rashid al-Din reports that
the great Khan sent for his sons in the spring of 1223, and while his brothers heeded the order, Jochi
remained in Khorasan. Juzjani suggests that the disagreement arose from a quarrel between Jochi and his
brothers in the siege of Urgench [. . . ]
问题: Who went to Khan after his order in 1223?

Passage: In the Sandgate area, to the east of the city and beside the river, resided the close-knit community
of keelmen and their families. They were so called because [. . . ] transfer coal from the river banks to the
waiting colliers, for export to London and elsewhere. In the 1630s about 7,000 在......之外 20,000 inhabitants
of Newcastle died of plague [. . . ]
问题: Where did almost half the people die?

Passage: [. . . ] was important to reduce the weight of coal carried. Steam engines remained the dominant
source of power until the early 20th century, when advances in the design of electric motors and internal
combustion engines gradually resulted in the replacement of reciprocating (piston) steam engines, 和
shipping in the 20th-century [. . . ]
问题: Why did steam engines become obsolete?

a Passage: [. . . ] and seven other hymns were published in the Achtliederbuch, the first Lutheran hymnal.
时间
在 1524 Luther developed his original four-stanza psalm paraphrase into a five-stanza Reformation hymn
右
乙
that developed the theme of “grace alone” more fully. Because it expressed essential Reformation doctrine,
乙
哦
this expanded version of “在......之外 [. . . ]
右
问题: Luther’s reformed hymn did not feature stanzas of what quantity?

A
时间
右
乙
乙
哦
右

Passage: [. . . ] tight end Greg Olsen, who caught a career-high 77 passes for 1,104 yards and seven
touchdowns, and wide receiver Ted Ginn, 小。, who caught 44 passes for 739 yards and 10 touchdowns;
[. . . ] receivers included veteran Jerricho Cotchery (39 receptions for 485 yards), rookie Devin Funchess
(31 receptions for 473 yards and [. . . ]
问题: Who caught the second most passes?

Passage: Other prominent alumni include anthropologists David Graeber and Donald Johanson, 谁是
best known for discovering the fossil of a female hominid australopithecine known as “Lucy” in the Afar
Triangle region, psychologist John B. 沃森, American psychologist who established the psychological
school of behaviorism, communication theorist Harold Innis, chess grandmaster Samuel Reshevsky, 和
conservative international relations scholar and White House Coordinator of Security Planning for the
National Security Council Samuel P. 亨廷顿.
问题: Who thinks three moves ahead?

桌子 1: Validation set examples of questions collected using different RC models (BiDAF, BERT, 和
RoBERTa) in the annotation loop. The answer to the question is highlighted in the passage.

666

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: ‘‘Beat the AI’’ question generation interface. Human annotators are tasked with asking questions about
a provided passage that the model in the loop fails to answer correctly.

方便. This is done after every annotation
batch; if workers fall below an 80% success thresh-
old at any point, their qualification is revoked and
their work is discarded in its entirety.

Question Answerability As the models used in
the annotation task become stronger, the resulting
questions tend to become more complex. 如何-
曾经, this also means that it becomes more chal-
lenging to disentangle measures of dataset quality
from inherent question difficulty. 像这样, 我们
use the condition of human answerability for an
annotated question-answer pair as follows: 这是
answerable if at least one of three additional non-
expert human validators can provide an answer
matching the original. We conduct answerability
checks on both the validation and test sets, 和
achieve answerability scores of 87.95%, 85.41%,
和 82.63% for DBiDAF, DBERT, and DRoBERTa.
We discard all questions deemed unanswerable
from the validation and test sets, and further
discard all data from any workers with less than
half of their questions considered answerable. 它
should be emphasized that the main purpose of
this process is to create a level playing field
for comparison across datasets constructed for
different model adversaries, and can inevitably
result in valid questions being discarded. 这

Dev

Test

Resource

DBiDAF
DBERT
DRoBERTa

63.0
59.2
58.1

F1
76.9
74.3
72.0

62.6
63.9
58.7

F1
78.5
76.9
73.7

桌子 2: Non-expert human performance results
for a randomly-selected validator per question.

total cost for training and qualification, dataset
建造, and validation is approximately
USD 27,000.

Human Performance We select a randomly
chosen validator’s answer to each question and
compute Exact Match (EM) and word overlap F1
scores with the original to calculate non-expert
human performance; 桌子 2 shows the result. 我们
observe a clear trend: The stronger the model in
the loop used to construct the dataset, the harder
the resulting questions become for humans.

3.4 Dataset Statistics

桌子 3 provides general details on the number
of passages and question-answer pairs used in the
different dataset splits. The average number of
words in questions and answers, as well as the

667

#Passages

Resource Train Dev Test

DSQuAD
DBiDAF
DBERT
DRoBERTa

18,891 971 1,096
277
2,523 278
292
2,444 283
333
2,552 341

#QAs
Train Dev

Test

87,599 5,278 5,292
10,000 1,000 1,000
10,000 1,000 1,000
10,000 1,000 1,000

桌子 3: Number of passages and question-
answer pairs for each data resource.

DSQuAD DBiDAF DBERT DRoBERTa

Question length
Answer length
N-Gram overlap

10.3
2.6
3.0

9.8
2.9
2.2

9.8
3.0
2.1

10.0
3.2
2.0

桌子 4: Average number of words per question
并回答, and average longest n-gram
overlap between passage and question.

average longest n-gram overlap between passage
and question are given in Table 4.

We can again observe two clear trends: 从
weaker towards stronger models used in the
annotation loop, the average length of answers
增加, and the largest n-gram overlap drops
从 3 到 2 代币. 那是, on average there
is a trigram overlap between the passage and
question for DSQuAD, but only a bigram overlap
for DRoBERTa (数字 4).3 This is in line with prior
observations on lexical overlap as a predictive
cue in SQuAD (Weissenborn et al., 2017; 最小
等人。, 2018); questions with less overlap are
harder to answer for any of the three models.
We furthermore analyze question types based
on the question wh-word. We find that—in con-
trast to DSQuAD—the datasets collected with a
model in the annotation loop have fewer when,
如何, and in questions, and more which, 在哪里,
and why questions, as well as questions in
the other category, which indicates increased
question diversity. In terms of answer types,
we observe more common noun and verb phrase
clauses than in DSQuAD, as well as fewer dates,
名字, and numeric answers. This reflects on
the strong answer-type matching capabilities
of contemporary RC models. The training and
validation sets used in this analysis (DBiDAF,
DBERT, and DRoBERTa) will be publicly released.

3Note that the original SQuAD1.1 dataset can be con-
sidered a limit case of the adversarial annotation framework,
in which the model in the loop always predicts the wrong
回答, thus every question is accepted.

数字 4: Distribution of longest n-gram overlap
between passage and question for different datasets.
µ: 意思是; σ: 标准差.

模型

Resource
dev

dev

DBiDAF
DBERT

BiDAF
BERT
RoBERTa DRoBERTa
测试
BiDAF
BERT
RoBERTa DRoBERTa

DBiDAF
DBERT

测试

dev

测试

Original

EM F1
5.3
0.0
4.9
0.0
6.1
0.0

0.0
0.0
0.0

5.5
5.3
5.9

Re-init.

EM
10.7 0.8
19.7 1.0
15.7 0.9
11.6 1.0
18.9 1.2
16.1 0.8

F1
20.4 1.0
30.1 1.2
25.8 1.2
21.3 1.2
29.4 1.1
26.7 0.9

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

桌子 5: Consistency of the adversarial effect (或者
lack thereof) when retraining the models in the
loop on the same data again, but with different
random seeds. We report the mean and standard
deviation (subscript) 超过 10 re-initialization runs.

4 实验

4.1 Consistency of the Model in the Loop

We begin with an experiment regarding the
consistency of the adversarial nature of the models
in the annotation loop. Our annotation pipeline is
designed to reject all samples where the model
correctly predicts the answer. How reproducible
is this when retraining the model with the same
training data? To measure this, we evaluate the
performance of instances of BiDAF, BERT, 和
RoBERTa, which only differ from the model used
during annotation in their random initialization

668

模型

Trained On

DSQuAD
F1

DBiDAF

评估 (Test) 数据集
DRoBERTa
F1
EM

DBERT

DDROP

DNQ

BiDAF

BERT

RoBERTa

DSQuAD(10K) 40.9 0.6 54.3 0.6 7.1 0.6 15.7 0.6 5.6 0.3 13.5 0.4 5.7 0.4 13.5 0.4 3.8 0.4 8.6 0.6 25.1 1.1 38.7 0.7
11.5 0.4 20.9 0.4 5.3 0.4 11.6 0.5 7.1 0.4 14.8 0.6 6.8 0.5 13.5 0.6 6.5 0.5 12.4 0.4 15.7 1.1 28.7 0.8
DBiDAF
10.8 0.3 19.8 0.4 7.2 0.5 14.4 0.6 6.9 0.3 14.5 0.4 8.1 0.4 15.0 0.6 7.8 0.9 14.5 0.9 16.5 0.6 28.3 0.9
DBERT
10.7 0.2 20.2 0.3 6.3 0.7 13.5 0.8 9.4 0.6 17.0 0.6 8.9 0.9 16.0 0.8 15.3 0.8 22.9 0.8 13.4 0.9 27.1 1.2
DRoBERTa
DSQuAD(10K) 69.4 0.5 82.7 0.4 35.1 1.9 49.3 2.2 15.6 2.0 27.3 2.1 11.9 1.5 23.0 1.4 18.9 2.3 28.9 3.2 52.9 1.0 68.2 1.0
66.5 0.7 80.6 0.6 46.2 1.2 61.1 1.2 37.8 1.4 48.8 1.5 30.6 0.8 42.5 0.6 41.1 2.3 50.6 2.0 54.2 1.2 69.8 0.9
DBiDAF
61.2 1.8 75.7 1.6 42.9 1.9 57.5 1.8 37.4 2.1 47.9 2.0 29.3 2.1 40.0 2.3 39.4 2.2 47.6 2.2 49.9 2.3 65.7 2.3
DBERT
57.0 1.7 71.7 1.8 37.0 2.3 52.0 2.5 34.8 1.5 45.9 2.0 30.5 2.2 41.2 2.2 39.0 3.1 47.4 2.8 45.8 2.4 62.4 2.5
DRoBERTa
DSQuAD(10K) 68.6 0.5 82.8 0.3 37.7 1.1 53.8 1.1 20.8 1.2 34.0 1.0 11.0 0.8 22.1 0.9 25.0 2.2 39.4 2.4 43.9 3.8 62.8 3.1
64.8 0.7 80.0 0.4 48.0 1.2 64.3 1.1 40.0 1.5 51.5 1.3 29.0 1.9 39.9 1.8 44.5 2.1 55.4 1.9 48.4 1.1 66.9 0.8
DBiDAF
59.5 1.0 75.1 0.9 45.4 1.5 60.7 1.5 38.4 1.8 49.8 1.7 28.2 1.5 38.8 1.5 42.2 2.3 52.6 2.0 45.8 1.1 63.6 1.1
DBERT
56.2 0.7 72.1 0.7 41.4 0.8 57.1 0.8 38.4 1.1 49.5 0.9 30.2 1.3 41.0 1.2 41.2 0.9 51.2 0.8 43.6 1.1 61.6 0.9
DRoBERTa

桌子 6: Training models on various datasets, 每个都有 10,000 样品, and measuring their
generalization to different evaluation datasets. Results underlined indicate the best result per model.
We report the mean and standard deviation (subscript) 超过 10 runs with different random seeds.

and order of mini-batch samples during training.
These results are shown in Table 5.

第一的, we observe—as expected given our
annotation constraints—that model performance
is 0.0EM on datasets created with the same re-
spective model in the annotation loop. We ob-
serve, 然而,
that retrained models do not
reliably perform as poorly on those samples.
例如, BERT reaches 19.7EM, 然而
the original model used during annotation
provides no correct answer with 0.0EM. 这
demonstrates that random model components can
the adversarial annotation
substantially affect
过程. The evaluation furthermore serves as
a baseline for subsequent model evaluations:
This much of the performance range can be
learned merely by retraining the same model.
A possible takeaway for using the model-in-
the-loop annotation strategy in the future is to
rely on ensembles of adversaries and reduce the
dependency on one particular model instantia-
的, as investigated by Grefenstette et al. (2018).

4.2 Adversarial Generalization

A potential problem with the focus on challenging
questions is that they might be very distinct from
另一个, leading to difficulties in learning to
generalize to and from them. We conduct a series
of experiments in which we train on DBiDAF,
DBERT, and DRoBERTa, and observe how well

models can learn to generalize to the respective
test portions of these datasets. 桌子 6 shows the
结果, and there is a multitude of observations.

第一的, one clear trend we observe across all
training data setups is a negative performance
progression when evaluated against datasets
constructed with a stronger model in the loop. 这
trend holds true for all but the BiDAF model, 在
each of the training configurations, and for each of
the evaluation datasets. 例如, RoBERTa
trained on DRoBERTa achieves 72.1, 57.1, 49.5,
and 41.0F1 when evaluated on DSQuAD, DBiDAF,
DBERT, and DRoBERTa respectively.

第二, we observe that the BiDAF model is
not able to generalize well to datasets constructed
with a model in the loop,
independent of its
training setup. 尤其, it is unable to learn
from DBiDAF, thus failing to overcome some of
its own blind spots through adversarial training.
Irrespective of
the training dataset, BiDAF
consistently performs poorly on the adversarially
collected evaluation datasets, and we also note
a substantial performance drop when trained on
DBiDAF, DBERT, or DRoBERTa and evaluated on
DSQuAD.

相比之下, BERT and RoBERTa are able
to partially overcome their blind spots through
training on data collected with a model in the
环形, and to a degree that far exceeds what would
be expected from random retraining (比照. 桌子 5).

669

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

模型

Training Dataset

DSQuAD

评估 (Test) 数据集
DBERT
DBiDAF

DRoBERTa
EM
F1

BiDAF

BERT

RoBERTa

8.6 0.6 17.3 0.8

56.7 0.5 70.1 0.3 11.6 1.0 21.3 1.1
8.3 0.7 16.8 0.5
DSQuAD
DSQuAD + DBiDAF
56.3 0.6 69.7 0.4 14.4 0.9 24.4 0.9 15.6 1.1 24.7 1.1 14.3 0.5 23.3 0.7
DSQuAD + DBERT
56.2 0.6 69.4 0.6 14.4 0.7 24.2 0.8 15.7 0.6 25.1 0.6 13.9 0.8 22.7 0.8
DSQuAD + DRoBERTa 56.2 0.7 69.6 0.6 14.7 0.9 24.8 0.8 17.9 0.5 26.7 0.6 16.7 1.1 25.0 0.8
74.8 0.3 86.9 0.2 46.4 0.7 60.5 0.8 24.4 1.2 35.9 1.1 17.3 0.7 28.9 0.9
DSQuAD
DSQuAD + DBiDAF
75.2 0.4 87.2 0.2 52.4 0.9 66.5 0.9 40.9 1.3 51.2 1.5 32.9 0.9 44.1 0.8
DSQuAD + DBERT
75.1 0.3 87.1 0.3 54.1 1.0 68.0 0.8 43.7 1.1 54.1 1.3 34.7 0.7 45.7 0.8
DSQuAD + DRoBERTa 75.3 0.4 87.1 0.3 53.0 1.1 67.1 0.8 44.1 1.1 54.4 0.9 36.6 0.8 47.8 0.5
73.2 0.4 86.3 0.2 48.9 1.1 64.3 1.1 31.3 1.1 43.5 1.2 16.1 0.8 26.7 0.9
DSQuAD
DSQuAD + DBiDAF
73.9 0.4 86.7 0.2 55.0 1.4 69.7 0.9 46.5 1.1 57.3 1.1 31.9 0.8 42.4 1.0
DSQuAD + DBERT
73.8 0.2 86.7 0.2 55.4 1.0 70.1 0.9 48.9 1.0 59.0 1.2 32.9 1.3 43.7 1.4
DSQuAD + DRoBERTa 73.5 0.3 86.5 0.2 55.9 0.7 70.6 0.7 49.1 1.2 59.5 1.2 34.7 1.0 45.9 1.2

桌子 7: Training models on SQuAD, as well as SQuAD combined with different adversarially created
datasets. Results underlined indicate the best result per model. We report the mean and standard
deviation (subscript) 超过 10 runs with different random seeds.

例如, BERT reaches 47.9F1 when trained
and evaluated on DBERT, while RoBERTa trained
on DRoBERTa reaches 41.0F1 on DRoBERTa, 两个都
considerably better than random retraining or
when training on the non-adversarially collected
DSQuAD(10K), showing gains of 20.6F1 for BERT
and 18.9F1 for RoBERTa. These observations
suggest that there exists learnable structure among
harder questions that can be picked up by some
of the models, yet not all, as BiDAF fails to
achieve this. The fact that even BERT can learn to
generalize to DRoBERTa, but not BiDAF to DBERT
suggests the existence of an inherent limitation to
what BiDAF can learn from these new samples,
compared with BERT and RoBERTa.

更普遍, we observe that training on
DS, where S is a stronger RC model, helps gen-
eralize to DW, where W is a weaker model—for
例子, training on DRoBERTa and testing on
DBERT. 另一方面, training on DW also
leads to generalization towards DS. 例如,
RoBERTa trained on 10,000 SQuAD samples
reaches 22.1F1 on DRoBERTa (DS), whereas train-
ing RoBERTa on DBiDAF and DBERT (DW) bumps
this number to 39.9F1 and 38.8F1, 分别.

第三, we observe similar performance deg-
radation patterns for both BERT and RoBERTa
on DSQuAD when trained on data collected with

increasingly stronger models in the loop. 为了
例子, RoBERTa
evaluated on DSQuAD
achieves 82.8, 80.0, 75.1, and 72.1F1 when trained
on DSQuAD(10K), DBiDAF, DBERT, and DRoBERTa,
分别. This may indicate a gradual shift
in the distributions of composed questions as the
model in the loop gets stronger.

These observations suggest an encouraging
takeaway for the model-in-the-loop annotation
范例: Even though a particular model might
be chosen as an adversary in the annotation
环形, which at some point falls behind more recent
state-of-the-art models, these future models can
still benefit from data collected with the weaker
模型, and also generalize better to samples
composed with the stronger model in the loop.

We further show experimental results for the
same models and training datasets, but now
including SQuAD as additional training data, 在
桌子 7. In this training setup we generally see
improved generalization to DBiDAF, DBERT, 和
DRoBERTa. 有趣的是, the relative differences
between DBiDAF, DBERT, and DRoBERTa as training
sets used in conjunction with SQuAD are much
diminished, and especially DRoBERTa as (part of)
the training set now generalizes substantially
更好的. We see that BERT and RoBERTa both
show consistent performance gains with the

670

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

模型

DSQuAD

BiDAF
BERT
RoBERTa

57.10.4
75.50.2
74.20.3

F1
70.40.3
87.20.2
86.90.3

评估 (Test) 数据集
DBiDAF

17.10.8
57.71.0
59.80.5

F1
27.00.9
71.01.1
74.10.6

DBERT

20.01.0
52.10.7
55.10.6

F1
29.20.8
62.20.7
65.10.7

DRoBERTa
EM

F1
27.40.7
54.21.0
52.71.0

18.30.6
43.01.1
41.61.0

桌子 8: Training models on SQuAD combined with all the adversarially created datasets DBiDAF,
DBERT, and DRoBERTa. Results underlined indicate the best result per model. We report the mean and
标准差 (subscript) 超过 10 runs with different random seeds.

addition of the original SQuAD1.1 training data,
but unlike in Table 6, this comes without any
noticeable decline in performance on DSQuAD,
the adversarially constructed
suggesting that
datasets expose inherent model weaknesses, 作为
investigated by Liu et al. (2019A).

此外, RoBERTa achieves the strongest
results on the adversarially collected eval-
uation sets,
in particular when trained on
DSQuAD + DRoBERTa. This stands in contrast
to the results in Table 6, where training on DBiDAF
in several cases led to better generalization than
training on DRoBERTa. A possible explanation is
that training on DRoBERTa leads to a larger degree
of overfitting to specific adversarial examples in
DRoBERTa than training on DBiDAF, and that the
inclusion of a large number of standard SQuAD
training samples can mitigate this effect.

Results for the models trained on all
这
datasets combined (DSQuAD, DBiDAF, DBERT, 和
DRoBERTa) are shown in Table 8. These further
support the previous observations and provide
additional performance gains where, 为了考试-
普莱, RoBERTa achieves F1 scores of 86.9 在
DSQuAD, 74.1 on DBiDAF, 65.1 on DBERT, 和 52.7
on DRoBERTa, surpassing the best previous perfor-
mance on all adversarial datasets.

最后, we identify a risk of datasets con-
structed with weaker models in the loop becom-
ing outdated. 例如, RoBERTa achieves
58.2EM/73.2F1 on DBiDAF, in contrast to 0.0EM/
5.5F1 for BiDAF—which is not far from the
non-expert human performance of 62.6EM/78.5F1
(比照. 桌子 2).

It is also interesting to note that, even when
training on all the combined data (比照. 桌子 8),
BERT outperforms RoBERTa on DRoBERTa and

vice versa, suggesting that
weaknesses inherent to each model class.

there may exist

4.3 Generalization to Non-Adversarial Data

Compared with standard annotation, 该模型-
in-the-loop approach generally results in new ques-
tion distributions. 最后, models trained
on adversarially composed questions might not be
able to generalize to standard (‘‘easy’’) 问题,
thus limiting the practical usefulness of
这
resulting data. To what extent do models trained
on model-in-the-loop questions generalize differ-
ently to standard (‘‘easy’’) 问题, compared
trained on standard (‘‘easy’’)
with models
问题?

To measure this we further train each of our three
models on either DBiDAF, DBERT, or DRoBERTa
and test on DSQuAD, with results in the DSQuAD
columns of Table 6. 用于比较, the models
are also trained on 10,000 SQuAD1.1 samples
(referred to as DSQuAD(10K)) chosen from the same
passages as the adversarial datasets, thus elim-
inating size and paragraph choice as potential con-
founding factors. The models are tuned for EM
on the held-out DSQuAD validation set. 注意,
although performance values on the majority vote
DSQuAD dataset are lower than on the original, 为了
the reasons described earlier, this enables direct
comparisons across all datasets.

值得注意的是, neither BERT nor RoBERTa
show substantial drops when trained on DBiDAF
compared to training on SQuAD data (−2.1F1,
and −2.8F1): Training these models on a dataset
with a weaker model in the loop still leads to
strong generalization even to data from the origi-
nal SQuAD distribution, which all models in the
loop are trained on. BiDAF, 另一方面,
fails to learn such information from the adversar-

671

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

ially collected data, and drops >30F1 for each of
the new training sets, compared to training on
SQuAD.

We also observe a gradual decrease in gener-
alization to SQuAD when training on DBiDAF to-
wards training on DRoBERTa. This suggests that
the stronger the model, the more dissimilar the
resulting data distribution becomes from the orig-
inal SQuAD distribution. We later find further
支持
this explanation in a qualitative
分析 (部分 5). It may, 然而, also be due
to a limitation of BERT and RoBERTa—similar
to BiDAF—in learning from a data distribution
designed to beat these models; an even stronger
model might
learn more from, 例如,
DRoBERTa.

为了

4.4 Generalization to DROP and

NaturalQuestions

最后, we investigate to what extent models
can transfer skills learned on the datasets created
with a model in the loop to two recently intro-
duced datasets: DROP (Dua et al., 2019), 和
NaturalQuestions (Kwiatkowski et al., 2019). 在
this experiment we select the subsets of DROP and
NaturalQuestions that align with the structural
constraints of SQuAD to ensure a like-for-like
分析. 具体来说, we only consider questions
in DROP where the answer is a span in the passage
and where there is only one candidate answer.
For NaturalQuestions, we consider all non-tabular
long answers as passages, remove HTML tags
and use the short answer as the extracted span.
We apply this filtering on the validation sets for
both datasets. Next we split them, stratifying by
文档 (as we did for DSQuAD), which results
在 1409/1418 validation and test set examples
for DROP, 和 964/982 for NaturalQuestions,
分别. We denote these datasets as DDROP
and DNQ for clarity and distinction from their
unfiltered versions. We consider the same models
and training datasets as before, but tune on the
respective validation sets of DDROP and DNQ.
桌子 6 shows the results of these experiments in
the respective DDROP and DNQ columns.

第一的, we observe clear generalization improve-
ments towards DDROP across all models compared
to training on DSQuAD(10K) when training on any
of DBiDAF, DBERT, or DRoBERTa. 那是, 包括
a model in the loop for the training dataset leads
to improved transfer towards DDROP. 注意

DROP also makes use of a BiDAF model in
the loop during annotation; these results are in
line with our prior observations when testing the
same setups on DBiDAF, DBERT, and DRoBERTa,
compared to training on DSQuAD(10K).

第二, we observe overall strong transfer
results towards DNQ, with up to 69.8F1 for a BERT
model trained on DBiDAF. Note that this result is
similar to, and even slightly improves over, 模型
training with SQuAD data of the same size. 那
是, relative to training on SQuAD data, 训练
on adversarially collected data DBiDAF does not
impede generalization to the DNQ dataset, 哪个
was created without a model in the annotation
环形. We then, 然而, see a similar negative
performance progression as observed before when
testing on DSQuAD: The stronger the model in
the annotation loop of the training dataset, 这
lower the test accuracy on test data from a data
distribution composed without a model in the loop.

5 Qualitative Analysis

Having applied the general model-in-the-loop
methodology on models of varying strength, 我们
next perform a qualitative comparison of the na-
ture of the resulting questions. As reference points
we also include the original SQuAD questions,
as well as DROP and NaturalQuestions, 在这个
比较: these datasets are both constructed
to overcome limitations in SQuAD and have sub-
sets sufficiently similar to SQuAD to make an
analysis possible. 具体来说, we seek to under-
stand the qualitative differences in terms of
reading comprehension challenges posed by the
questions in each of these datasets.

5.1 Comprehension Requirements

There exists a variety of prior work that seeks
to understand the types of knowledge, 压缩-
hension skills, or types of reasoning required to
answer questions based on text (Rajpurkar et al.,
2016; Clark et al., 2018; Sugawara et al., 2019;
Dua et al., 2019; Dasigi et al., 2019); we are,
然而, unaware of any commonly accepted
formalism. We take inspiration from these but
develop our own taxonomy of comprehension
requirements which suits the datasets analyzed.
Our taxonomy contains 13 labels, 其中大部分
are commonly used in other work. 然而, 这
following three deserve additional clarification:
我) explicit–for which the answer is stated nearly

672

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 5: Comparison of comprehension types for the questions in different datasets. The label types are neither
mutually exclusive nor comprehensive. Values above columns indicate excess of the axis range.

word-for-word in the passage as it is in the ques-
的, 二) filtering–a set of answers is narrowed
down to select one by some particular distin-
guishing feature, and iii) implicit–the answer
builds on information implied by the passage and
does not otherwise require any of the other types
of reasoning.

那

We annotate questions with labels from this
catalogue in a manner
is not mutually
exclusive, and neither fully comprehensive; 这
development of such a catalogue is itself very
具有挑战性的. 反而, we focus on capturing the
most salient characteristics of each given question,
and assign it up to three of the labels in our
目录. 总共, we analyze 100 samples from
the validation set of each of the datasets; 数字 5
shows the results.

5.2 观察结果

An initial observation is that the majority (57%)
of answers to SQuAD questions are stated explic-
itly, without comprehension requirements beyond
the literal level. This number decreases substan-
tially for any of the model-in-the-loop datasets
derived from SQuAD (例如, 8% for DBiDAF)
and also DDROP, 然而 42% of questions in DNQ
share this property. In contrast to SQuAD, 这
model-in-the-loop questions generally tend to
involve more paraphrasing. They also require
more external knowledge, and multi-hop infer-
恩斯 (beyond co-reference resolution) with an
increasing trend for stronger models used in
the annotation loop. Model-in-the-loop questions
further fan out into a variety of small, but non-
negligible proportions of more specific types
of inference required for comprehension, 为了
例子, spatial or
inference (两个都
going beyond explicitly stated spatial or temporal

颞

信息)—SQuAD questions rarely require
these at all. Some of these more particular infer-
ence types are common features of the other two
datasets, in particular comparative questions for
DROP (60%) and to a small extent also Natu-
ralQuestions. 有趣的是, DBiDAF possesses the
largest number of comparison questions (11%)
among our model-in-the-loop datasets, 然而
DBERT and DRoBERTa only possess 1% 和 3%,
分别. This offers an explanation for our
previous observation in Table 6, where BERT and
RoBERTa perform better on DDROP when trained
on DBiDAF rather than on DBERT or DRoBERTa. 这是
likely that BiDAF as a model in the loop is worse
than BERT and RoBERTa at comparative ques-
系统蒸发散, as evidenced by the results in Table 6 和
BiDAF reaching 8.6F1, BERT reaching 28.9F1,
and RoBERTa reaching 39.4F1 on DDROP (什么时候
trained on DSQuAD(10K)).

The distribution of NaturalQuestions contains
elements of both the SQuAD and DBiDAF
分布, which offers a potential explanation
for the strong performance on DNQ of models
trained on DSQuAD(10K) and DBiDAF. 最后, 这
gradually shifting distribution away from both
SQuAD and NaturalQuestions as the model-
in-the-loop strength increases reflects our prior
observations on the decreasing performance on
SQuAD and NaturalQuestions of models trained
on datasets with progressively stronger models in
the loop.

6 Discussion and Conclusions

We have investigated an RC annotation para-
digm that requires a model in the loop to be
‘‘beaten’’ by an annotator. Applying this approach
with progressively stronger models in the loop

673

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

(BiDAF, BERT, and RoBERTa), we produced
three separate datasets. Using these datasets,
we investigated several questions regarding the
annotation paradigm, 尤其, whether such
datasets grow outdated as
stronger models
emerge, and their generalization to standard
(non-adversarially collected) 问题. 我们发现
that stronger models can still learn from data
collected with a weak adversary in the loop, 和
their generalization improves even on datasets
collected with a stronger adversary. 楷模
trained on data collected with a model in the
loop further generalize well to non-adversarially
collected data, both on SQuAD and on Natu-
ralQuestions, yet we observe a gradual deteriora-
tion in performance with progressively stronger
adversaries.

We see our work as a contribution towards
the emerging paradigm of model-in-the-loop
注解. Although this paper has focused on
RC, with SQuAD as the original dataset used
to train model adversaries, we see no reason in
principle why findings would not be similar for
other tasks using the same annotation paradigm,
when crowdsourcing challenging samples with a
model in the loop. We would expect the insights
and benefits conveyed by model-in-the-loop
annotation to be the greatest on mature datasets
where models exceed human performance: 这里
the resulting data provides a magnifying glass
on model performance, focused in particular on
samples which models struggle on. On the other
手, applying the method to datasets where
performance has not yet plateaued would likely
result in a more similar distribution to the original
数据, which is challenging to models a priori.
We hope that the series of experiments on repli-
cability, observations on transfer between datasets
collected using models of different strength, 作为
well as our findings regarding generalization to
non-adversarially collected data, can support and
inform future research and annotation efforts using
this paradigm.

致谢

The authors would like to thank Christopher Potts
for his detailed and constructive feedback, 和
our reviewers. This work was supported by the
European Union’s Horizon 2020 Research and
Innovation Programme under grant agreement
不. 875160 and the UK Defence Science and

Technology Laboratory (Dstl) and Engineering
and Physical Research Council (EPSRC) 在下面
grant EP/R018693/1 as a part of the collaboration
between US DOD, UK MOD, and UK EPSRC
under the Multidisciplinary University Research
倡议 (MURI).

参考

Samuel R. Bowman, Gabor Angeli, Christopher
波茨, and Christopher D. 曼宁. 2015. A
large annotated corpus for learning natural
language inference. 在诉讼程序中 2015
Conference on Empirical Methods in Natural
语言处理, pages 632–642, 里斯本,
Portugal. Association for Computational Lin-
语言学. DOI: https://doi.org/10
.18653/v1/D15-1075

在诉讼程序中

Danqi Chen, Jason Bolton, and Christopher D.
曼宁. 2016. A thorough examination of
the CNN/Daily Mail reading comprehension
the 54th Annual
任务.
the Association for Computa-
Meeting of
tional Linguistics (体积 1: Long Papers),
pages 2358–2367. 柏林, 德国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P16
-1223, PMID: 30036459

Michael Chen, Mike D’Arcy, Alisa Liu, Jared
Fernandez, and Doug Downey. 2019. CODAH:
An adversarially-authored question answering
dataset for common sense. In Proceedings
of the 3rd Workshop on Evaluating Vector
Space Representations for NLP, pages 63–69,
明尼阿波利斯, 美国. Association for Computa-
tional Linguistics.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, 和
Luke Zettlemoyer. 2018. QuAC: Question an-
swering in context. 在诉讼程序中 2018
Conference on Empirical Methods in Nat-
ural Language Processing, pages 2174–2184,
布鲁塞尔, 比利时. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/D18-1241, PMID:
30142985

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar
Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. 2018. Think you have

674

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

solved question answering? Try ARC, the AI2
reasoning challenge. CoRR, abs/1803.05457.

Pradeep Dasigi, Nelson F. 刘, Ana Marasovi´c,
诺亚A. 史密斯, and Matt Gardner. 2019.
Quoref: A reading comprehension dataset with
questions requiring coreferential
推理.
这 2019 会议
在诉讼程序中
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 5925–5932, 洪
孔, 中国. Association for Computational
语言学. DOI: https://doi.org/10
.18653/v1/D19-1606

Jia Deng, 右. Socher, Li Fei-Fei, Wei Dong,
Kai Li, and Li-Jia Li. 2009. ImageNet: A
large-scale hierarchical
在
2009 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 248–255.
DOI: https://doi.org/10.1109/CVPR
.2009.5206848

image database.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中
这 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics.

Emily Dinan, Samuel Humeau, Bharath Chintagunta,
and Jason Weston. 2019. Build it break it fix it
for dialogue safety: Robustness from adver-
sarial human attack. 在诉讼程序中 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
加工 (EMNLP-IJCNLP), pages 4537–4546,
香港, 中国. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/D19-1461

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Gabriel Stanovsky, Sameer Singh, and Matt
加德纳. 2019. DROP: A reading comprehen-
sion benchmark requiring discrete reasoning
over paragraphs. 在诉讼程序中 2019
Conference of the North American Chapter

of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 2368–2378,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics.

Allyson Ettinger, Sudha Rao, Hal Daum´e III, 和
Emily M. Bender. 2017. Towards linguistically
generalizable NLP systems: A workshop and
shared task. CoRR, abs/1711.01505.

Matt Gardner,

Joel Grus, Mark Neumann,
Oyvind Tafjord, Pradeep Dasigi, Nelson F.
刘, Matthew Peters, Michael Schmitz, 和
Luke Zettlemoyer. 2018. AllenNLP: A deep
semantic natural language processing platform.
In Proceedings of Workshop for NLP Open
(NLP-OSS), pages 1–6,
Source Software
墨尔本, 澳大利亚. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/W18-2501, PMCID:
PMC5753512

Edward Grefenstette, Robert Stanforth, Brendan
Jonathan Uesato, Grzegorz
O’Donoghue,
Swirszcz, and Pushmeet Kohli. 2018. Strength
in numbers: Trading-off robustness and com-
putation via adversarially-trained ensembles.
CoRR, abs/1811.09300.

Suchin Gururangan, Swabha Swayamdipta, Omer
征收, Roy Schwartz, Samuel Bowman, 和
诺亚A. 史密斯. 2018. Annotation artifacts in
natural language inference data. In Proceedings
的 2018 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积 2 (Short Papers), pages 107–112,
New Orleans, Louisiana. Association for Com-
putational Linguistics. DOI: https://土井
.org/10.18653/v1/N18-2017

Karl Moritz Hermann, Tomas Kocisky, 爱德华
格芬施泰特, Lasse Espeholt, Will Kay,
Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend.
在C中. 科尔特斯, 氮. D. 劳伦斯, D. D. 李,
中号. Sugiyama, 和R. 加内特, 编辑, Advances
in Neural Information Processing Systems 28,
pages 1693–1701. 柯伦联合公司, Inc.

Robin Jia and Percy Liang. 2017. Adversarial
examples for evaluating reading comprehension
系统. 在诉讼程序中 2017 会议

675

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

on Empirical Methods in Natura Language
加工, pages 2021–2031, 哥本哈根, Den-
标记. 计算语言学协会.
DOI: https://doi.org/10.18653/v1
/D17-1215

the 55th Annual Meeting of

Mandar Joshi, Eunsol Choi, Daniel Weld, 和
Luke Zettlemoyer. 2017. TriviaQA: A large
scale distantly supervised challenge dataset
for reading comprehension. In Proceedings
的
the Associ-
ation for Computational Linguistics (体积 1:
Long Papers), pages 1601–1611, Vancouver,
加拿大. Association for Computational Linguis-
抽动症. DOI: https://doi.org/10.18653
/v1/P17-1147

Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2020. Learning the difference that makes
a difference with counterfactually-augmented
数据. In International Conference on Learning
Representations.

Divyansh Kaushik and Zachary C. Lipton. 2018.
How much reading does reading compre-
investigation of
hension require? 一个批评的
popular benchmarks. 在诉讼程序中
这
2018 Conference on Empirical Methods in Natu-
ral Language Processing, pages 5010–5015,
布鲁塞尔, 比利时. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D18-1546

Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blun-
som, Chris Dyer, Karl Moritz Hermann, G´abor
Melis, and Edward Grefenstette. 2018. 这
NarrativeQA reading comprehension chal-
许久. Transactions of
the Association for
计算语言学, 6:317–328. DOI:
https://doi.org/10.1162/tacl
00023

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein,
Illia Polosukhin,
Jacob Devlin, Kenton Lee, Kristina Toutanova,
Llion Jones, Matthew Kelcey, Ming-Wei
张, 安德鲁·M. Dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural Questions:
A benchmark for question answering research.
Transactions of the Association for Computa-
tional Linguistics, 7:453–466. DOI: https://
doi.org/10.1162/tacl 00276

大卫·D.. Lewis and William A. Gale. 1994. A
sequential algorithm for training text classifiers.
In SIGIR, pages 3–12. ACM/Springer.

这 2019 Conference of

Nelson F. 刘, Roy Schwartz, and Noah A. 史密斯.
2019A. Inoculation by fine-tuning: A method
for analyzing challenge datasets. In Proceed-
ings of
the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
逻辑的, 体积 1 (Long and Short Papers),
pages 2171–2179, 明尼阿波利斯, Minnesota.
计算语言学协会.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
2019乙. RoBERTa: A robustly optimized BERT
pretraining approach. CoRR, abs/1907.11692.

Mitchell P. 马库斯, Beatrice Santorini, 和
Mary Ann Marcinkiewicz. 1993. 建设一个
large annotated corpus of English: The Penn
树库. 计算语言学, 19(2):
313–330. DOI: https://doi.org/10
.21236/ADA273556

Sewon Min, Victor Zhong, Richard Socher,
and Caiming Xiong. 2018. Efficient and robust
question answering from minimal context
over documents. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),
pages 1725–1735, 墨尔本, 澳大利亚. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P18
-1160

Mike Mintz, Steven Bills, Rion Snow, 和
Daniel Jurafsky. 2009. Distant supervision
for relation extraction without
labeled data.
the Joint Conference of
在诉讼程序中
the 47th Annual Meeting of
the ACL and
the 4th International Joint Conference on
Natural Language Processing of the AFNLP,
pages 1003–1011, Suntec, 新加坡. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/1690219
.1690287

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
高, Saurabh Tiwary, Rangan Majumder, 和
Li Deng. 2016. MS MARCO: A human
generated MAchine Reading COmprehension
dataset. arXiv 预印本 arXiv:1611.09268.

676

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2019.
Adversarial NLI: A new benchmark for nat-
ural language understanding. arXiv 预印本
arXiv: 1910.14599. DOI: https://doi.org
/10.18653/v1/2020.acl-main.441

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. 在诉讼程序中
the 56th Annual Meeting of the Association for
计算语言学 (体积 2: Short
文件), pages 784–789, 墨尔本, 澳大利亚.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/P18-2124

Pranav Rajpurkar,

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ 问题
for machine compre-
hension of text. 在诉讼程序中 2016
Conference on Empirical Methods in Nat-
ural Language Processing, pages 2383–2392,
Austin, 德克萨斯州. Association for Computational
语言学. DOI: https://doi.org/10
.18653/v1/D16-1264

Siva Reddy, Danqi Chen, and Christopher D.
曼宁. 2019. CoQA: A conversational ques-
tion answering challenge. Transactions of the
计算语言学协会,
7:249–266. DOI: https://doi.org/10
.1162/tacl a 00266

Matthew Richardson, 克里斯托弗·J. C. 布尔吉斯,
and Erin Renshaw. 2013. MCTest: A chal-
lenge dataset for the open-domain machine
comprehension of text. 在诉讼程序中
2013 实证方法会议
自然语言处理, pages 193–203,
Seattle, 华盛顿, 美国. 协会
计算语言学.

Marzieh Saeidi, Max Bartolo, Patrick Lewis,
Sameer Singh, Tim Rockt¨aschel, Mike Sheldon,
Guillaume Bouchard, and Sebastian Riedel.
2018. Interpretation of natural language rules
in conversational machine reading. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 2087–2097, 布鲁塞尔, 比利时. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1233

Roy Schwartz, Maarten Sap, Ioannis Konstas,
Leila Zilles, Yejin Choi, and Noah A. 史密斯.
2017. The effect of different writing tasks on
语言风格: A case study of the ROC story
cloze task. In Proceedings of the 21st Confer-
ence on Computational Natural Language
学习 (CoNLL 2017), pages 15–25, Vancou-
版本, 加拿大. Association for Computational
语言学. DOI: https://doi.org/10
.18653/v1/K17-1004

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi,
and Hannaneh Hajishirzi. 2017. Bidirectional
attention flow for machine comprehension.
In The International Conference on Learning
Representations (ICLR).

Rion Snow, Brendan O’Connor, Daniel Jurafsky,
and Andrew Ng. 2008. Cheap and fast – but
is it good? Evaluating non-expert annotations
for natural language tasks. 在诉讼程序中
2008 实证方法会议
自然语言处理, pages 254–263,
檀香山, Hawaii. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.3115/1613715.1613751

Saku Sugawara, Kentaro Inui, Satoshi Sekine,
and Akiko Aizawa. 2018. What makes reading
comprehension questions easier? In Proceed-
ings of
这 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 4208–4219, 布鲁塞尔, 比利时. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1453

Saku Sugawara, Pontus Stenetorp, Kentaro Inui,
and Akiko Aizawa. 2019. Assessing the bench-
marking capacity of machine reading compre-
hension datasets. CoRR, abs/1911.09241.

James Thorne, Andreas Vlachos, Oana Cocarascu,
Christos Christodoulopoulos, and Arpit Mittal.
2019. The FEVER2.0 shared task. In Proceed-
ings of the Second Workshop on Fact Extraction
and VERification (FEVER), pages 1–6, 洪
孔, 中国. Association for Computational
语言学. DOI: https://doi.org/10
.18653/v1/D19-6601, PMCID: PMC6533707

Adam Trischler, Tong Wang, Xingdi Yuan, Justin
哈里斯, Alessandro Sordoni, Philip Bachman,

677

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

and Kaheer Suleman. 2017. NewsQA: A ma-
chine comprehension dataset. 在诉讼程序中
the 2nd Workshop on Representation Learning
for NLP, pages 191–200, Vancouver, 加拿大.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/W17-2623

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya
Yamada, and Jordan Boyd-Graber. 2019. Trick
me if you can: Human-in-the-loop generation
of adversarial examples for question answering.
Transactions of the Association for Computa-
tional Linguistics, 7:387–401. DOI: https://
doi.org/10.1162/tacl 00279

Dirk Weissenborn, Georg Wiese, and Laura Seiffe.
2017. Making neural QA as simple as possible
but not simpler. In Proceedings of the 21st
Conference on Computational Natural Lan-
guage Learning (CoNLL 2017), pages 271–280,
Vancouver, 加拿大. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/K17-1028

Johannes Welbl, Pontus Stenetorp, and Sebastian
Riedel. 2018. Constructing datasets for multi-
hop reading comprehension across documents.
Transactions of the Association for Computa-
tional Linguistics, 6:287–302. DOI: https://
doi.org/10.1162/tacl 00021

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
HuggingFace’s Transformers: State-of-the-art
自然语言处理.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
本吉奥, William Cohen, Ruslan Salakhutdinov,
and Christopher D. 曼宁. 2018A. HotpotQA:

A dataset for diverse, explainable multi-hop
question answering. 在诉讼程序中 2018
Conference on Empirical Methods in Natural
语言处理,
2369–2380,
布鲁塞尔, 比利时. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/D18-1259,
PMCID: PMC6156886

页面

Zhilin Yang, Saizheng Zhang, Jack Urbanek, 将要
冯, Alexander Miller, Arthur Szlam, Douwe
Kiela, and Jason Weston. 2018乙. Mastering
the dungeon: Grounded language learning by
turker descent. 在国际
机械的
Conference on Learning Representations.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, 和
Yejin Choi. 2018. SWAG: A large-scale adver-
sarial dataset for grounded commonsense infer-
恩斯. 在诉讼程序中 2018 会议
on Empirical Methods in Natural Language
加工, pages 93–104, 布鲁塞尔, 比利时.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/D18-1009

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag:
Can a machine really finish your sentence?
In Proceedings of the 57th Annual Meeting
the Association for Computational Lin-
的
语言学, pages 4791–4800, Florence,
意大利.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/P19-1472

Sheng Zhang, Xiaodong Liu,

Jingjing Liu,
Jianfeng Gao, Kevin Duh, and Benjamin Van
Durme. 2018. ReCoRD: Bridging the gap be-
tween human and machine commonsense read-
ing comprehension. arXiv 预印本 arXiv:1810.
12885.

678

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
3
8
1
9
2
3
6
5
8

/
t

我

A
C
_
A
_
0
0
3
3
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3 Beat the AI: Investigating Adversarial Human Annotation image

下载pdf