Natural Questions: A Benchmark for Question Answering Research

Natural Questions: A Benchmark for Question Answering Research

Tom Kwiatkowski♣♦♠∗ Jennimaria Palomaki♠ Olivia Redfield♦♠ Michael Collins♣♦♠♥
Ankur Parikh♥ Chris Alberti♥ Danielle Epstein♤♦ Illia Polosukhin♤♦ Jacob Devlin♤
Kenton Lee♥ Kristina Toutanova♥ Llion Jones♤ Matthew Kelcey♤♦ Ming-Wei Chang♥

Andrew M. Dai♣♦

Jakob Uszkoreit♣ Quoc Le♣♦ Slav Petrov♣

Google Research
natural-questions@google.com

Abstrakt

We present the Natural Questions corpus, A
question answering data set. Questions consist
of real anonymized, aggregated queries issued
to the Google search engine. An annotator
is presented with a question along with a
Wikipedia page from the top 5 search results,
and annotates a long answer (typically a para-
graph) and a short answer (one or more en-
tities) if present on the page, or marks null if
no long/short answer is present. The public
release consists of 307,373 training examples
with single annotations; 7,830 examples with
5-way annotations for development data; Und
a further 7,842 examples with 5-way anno-
tated sequestered as test data. We present
experiments validating quality of the data. Wir
also describe analysis of 25-way annotations
An 302 examples, giving insights into human
variability on the annotation task. We introduce
robust metrics for the purposes of evaluating
question answering systems; demonstrate high
human upper bounds on these metrics; Und
establish baseline results using competitive
methods drawn from related literature.

1

Einführung

In recent years there has been dramatic progress
in machine learning approaches to problems such
as machine translation, speech recognition, Und
image recognition. One major factor in these
successes has been the development of neural
methods that far exceed the performance of
previous approaches. A second major factor has

been the existence of large quantities of training
data for these systems.

Open-domain question answering (QA) ist ein
benchmark task in natural language understanding
(NLU), which has significant utility to users, Und
in addition is potentially a challenge task that
can drive the development of methods for NLU.
Several pieces of recent work have introduced
QA data sets (z.B., Rajpurkar et al., 2016; Reddy
et al., 2018). Jedoch, in contrast to tasks where
it is relatively easy to gather naturally occurring
examples,1 the definition of a suitable QA task,
and the development of a methodology for an-
notation and evaluation, is challenging. Key issues
include the methods and sources used to obtain
Fragen; the methods used to annotate and col-
lect answers; the methods used to measure and
ensure annotation quality; and the metrics used for
evaluation. For more discussion of the limitations
of previous work with respect to these issues, sehen
Abschnitt 2 of this paper.

This paper introduces Natural Questions2 (NQ),
a new data set for QA research, along with
methods for QA system evaluation. Our goals are
three-fold: 1) To provide large-scale end-to-end
training data for the QA problem. 2) To provide
a data set that drives research in natural language
Verständnis. 3) To study human performance in
providing QA annotations for naturally occurring
Fragen.

In brief, our annotation process is as follows. Ein
annotator is presented with a (question, Wikipedia
page) pair. The annotator returns a (long answer,
short answer) pair. The long answer (l) can
be an HTML bounding box on the Wikipedia

∗♣Project

initiation; ♦Project design; ♠Data creation;
♥Model development; ♤Project support; ♥Also affiliated
with Columbia University, work done at Google; ♦No longer
at Google, work done at Google.

1Zum Beispiel, for machine translation/speech recognition
humans provide translations/transcriptions relatively easily.
2Verfügbar um: https://ai.google.com/research/

NaturalQuestions.

453

Transactions of the Association for Computational Linguistics, Bd. 7, S. 453–466, 2019. https://doi.org/10.1162/tacl a 00276.
Action Editor: Jing Jiang. Submission batch: 4/2018; Revision batch: 6/2018; Published 7/2019.
C(cid:7) 2019 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8

/

/
T

l

A
C
_
A
_
0
0
2
7
6
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Task definition The input to a model is a ques-
tion together with an entire Wikipedia page. Der
target output from the model is: 1) a long-answer
(z.B., a paragraph) from the page that answers the
question, or alternatively an indication that there
is no answer on the page; 2) a short answer where
applicable. The task was designed to be close to
an end-to-end question answering application.

Ensuring high-quality annotations at scale
Comprehensive guidelines were developed for the
Aufgabe. These are summarized in Section 3. Anno-
tation quality was constantly monitored.

Evaluation of quality Section 4 describes post-
hoc evaluation of annotation quality. Long/short
answers have 90%/84% precision, jeweils.

Study of variability One clear finding in NQ is
that for naturally occurring questions there is often
genuine ambiguity in whether or not an answer
is acceptable. There are also often a number
of acceptable answers. Abschnitt 4 examines this
variability using 25-way annotations.

Robust evaluation metrics Section 5 intro-
duces methods of measuring answer quality that
account for variability in acceptable answers. Wir
demonstrate a high human upper bound on these
measures for both long answers (90% precision,
85% recall), and short answers (79% precision,
72% recall).

We propose NQ as a new benchmark for research
in QA. In Section 6.4 we present baseline results
from recent models developed on comparable data
sets (Clark and Gardner, 2018), as well as a simple
pipelined model designed for the NQ task. Wir
demonstrate a large gap between the performance
of these baselines and a human upper bound. Wir
argue that closing this gap will require significant
advances in NLU.

2 Related Work

The SQuAD (Rajpurkar et al., 2016), SQuAD 2.0
(Rajpurkar et al., 2018), NarrativeQA (Kocisky
et al., 2018), and HotpotQA (Yang et al., 2018)
data sets contain questions and answers writ-
ten by annotators who have first read a short
text containing the answer. The SQuAD data
sets contain questions/paragraph/answer triples
from Wikipedia. In the original SQuAD data set,
annotators often borrow part of the evidence para-
graph to create a question. Jia and Liang (2017)

Figur 1: Example annotations from the corpus.

page—typically a paragraph or table—that con-
tains the information required to answer the
question. Alternativ, the annotator can return
l = NULL if there is no answer on the page, oder wenn
the information required to answer the question is
spread across many paragraphs. The short answer
(S) can be a span or set of spans (typically entities)
within l that answer the question, a boolean yes or
no answer, or NULL. If l = NULL then s = NULL,
Notwendig. Figur 1 shows examples.

Natural Questions has the following properties:

Source of questions The questions consist of
real anonymized, aggregated queries issued to the
Google search engine. Simple heuristics are used
to filter questions from the query stream. Thus the
questions are ‘‘natural’’ in that they represent real
queries from people seeking information.

Number of items The public release contains
307,373 training examples with single annota-
tionen, 7,830 examples with 5-way annotations for
development data, Und 7,842 5-way annotated
items sequestered as test data. We justify the use
of 5-way annotation for evaluation in Section 5.

454

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8

/

/
T

l

A
C
_
A
_
0
0
2
7
6
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

showed that systems trained on SQuAD could
be easily fooled by the insertion of distractor
sentences that should not change the answer, Und
SQuAD 2.0 introduces questions that are designed
to be unanswerable. Jedoch, we argue that ques-
tions written to be unanswerable can be identified
as such with little reasoning, in contrast to NQ’s task
of deciding whether a paragraph contains all of the
evidence required to answer a real question. Beide
SQuAD tasks have driven significant advances in
reading comprehension, but systems now outper-
form humans and harder challenges are needed.
NarrativeQA aims to elicit questions that are not
close paraphrases of the evidence by separate sum-
mary texts. No human performance upper bound is
provided for the full task and, although an extrac-
tive system could theoretically perfectly recover
all answers, current approaches only just outper-
form a random baseline. NarrativeQA may just be
too hard for the current state of NLU. HotpotQA is
designed to contain questions that require reason-
ing over text from separate Wikipedia pages. Als
well as answering questions, systems must also
identify passages that contain supporting facts.
This is similar in motivation to NQ’s long answer
Aufgabe, where the selected passage must contain all
of the information required to infer the answer.
Mirroring our identification of acceptable variabil-
ity in the NQ task definition, HotpotQA’s authors
observe that the choice of supporting facts is
somewhat subjective. They set high human upper
bounds by selecting, for each example, the score
maximizing partition of four annotations into one
prediction and three references. The reference
labels chosen by this maximization are not rep-
resentative of the reference labels in HotpotQA’s
evaluation set, and it is not clear that the upper
bounds are achievable. A more robust approach
is to keep the evaluation distribution fixed, Und
calculate an acheivable upper bound by approx-
imating the expectation over annotations—as we
have done for NQ in Section 5.

The QuAC (Choi et al., 2018) and CoQA
(Reddy et al., 2018) data sets contain dialogues
between a questioner, who is trying to learn about
a text, and an answerer. QuAC also prevents the
questioner from seeing the evidence text. Con-
versational QA is an exciting new area, aber es ist
significantly different from the single turn QA
task in NQ. In both QuAC and CoQA, conversa-
tions tend to explore evidence texts incrementally,
progressing from the start to the end of the text.

This contrasts with NQ, where individual questions
often require reasoning over large bodies of text.
The WikiQA (Yang et al., 2015) and MS Marco
(Nguyen et al., 2016) data sets contain queries
sampled from the Bing search engine. WikiQA
contains only 3,047 Fragen. MS Marco con-
tains 100,000 questions with freeform answers.
For each question, the annotator is presented with
10 passages returned by the search engine, Und
is asked to generate an answer to the query, oder
to say that the answer is not contained within the
passages. Free-form text answers allow more flex-
ibility in providing abstractive answers, but lead to
difficulties in evaluation (BLEU score [Papineni
et al., 2002] is used). MS Marco’s authors do
not discuss issues of variability or report quality
metrics for their annotations. From our expe-
Rience, these issues are critical. DuReader (Er
et al., 2018) is a Chinese language data set con-
taining queries from Baidu search logs. Like NQ,
DuReader contains real user queries; it requires
systems to read entire documents to find answers;
and it identifies acceptable variability in answers.
Jedoch, as with MS Marco, DuReader is reliant
on BLEU for answer scoring, and systems already
outperform a humans according to this metric.

There are a number of reading comprehension
benchmarks based on multiple choice tests
(Mihaylov et al., 2018; Richardson et al., 2013; Lai
et al., 2017). The TriviaQA data set (Joshi et al.,
2017) contains questions and answers taken from
trivia quizzes found online. A number of Cloze-
style tasks have also been proposed (Hermann
et al., 2015; Hill et al., 2015; Paperno et al., 2016;
Onishi et al., 2016). We believe that all of these
tasks are related to, but distinct from, answering
information-seeking questions. We also believe
Das, because a solution to NQ will have genuine
Dienstprogramm, it is better equipped as a benchmark for
NLU.

3 Task Definition and Data Collection

Natural Questions contains (question, wikipedia
page,
long answer, short answer) quadruples
Wo: the question seeks factual information; Die
Wikipedia page may or may not contain the infor-
mation required to answer the question; the long
answer is a bounding box on this page containing
all information required to infer the answer; Und
the short answer is one or more entities that give
a short answer to the question, or a boolean yes or

455

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8

/

/
T

l

A
C
_
A
_
0
0
2
7
6
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

1.a where does the nature conservancy get its funding
1.b who is the song killing me softly written about
who owned most of the railroads in the 1800s
2
how far is chardon ohio from cleveland ohio
4
american comedian on have i got news for you
5

Tisch 1: Matches for heuristics in Section 3.1.

NEIN. Both the long and short answer can be NULL if
no viable candidates exist on the Wikipedia page.

3.1 Questions and Evidence Documents

All the questions in NQ are queries of 8 words or
more that have been issued to the Google search
engine by multiple users in a short period of time.
From these queries, we sample a subset that either:

1. start with ‘‘who’’, ‘‘when’’, or ‘‘where’’ di-
rectly followed by: A) a finite form of ‘‘do’’
or a modal verb; or b) a finite form of ‘‘be’’
or ‘‘have’’ with a verb in some later position;

2. start with ‘‘who’’ directly followed by a verb

that is not a finite form of ‘‘be’’;

3. contain multiple entities as well as an adjec-

tiv, adverb, verb, or determiner;

4. contain a categorical noun phrase immedi-
ately preceded by a preposition or relative
clause;

5. end with a categorical noun phrase, and do
not contain a preposition or relative clause.3

Tisch 1 gives examples. We run questions
through the Google search engine and keep those
where there is a Wikipedia page in the top 5 suchen
results. Der (question, Wikipedia page) pairs are
the input to the human annotation task described
nächste.

The goal of these heuristics is to discard a
large proportion of queries that are non-questions,
while retaining the majority of queries of 8 Wörter
or more in length that are questions. A manual
inspection showed that the majority of questions in
the data, with the exclusion of question beginning
with ‘‘how to’’, are accepted by the filters. Wir
focus on longer queries as they are more complex,
and are thus a more challenging test for deep NLU.

3We pre-define the set of categorical noun phrases used
In 4 Und 5 by running Hearst patterns (Hearst, 1992) to find
a broad set of hypernyms. Part of speech tags and entities
are identified using Google’s Cloud NLP API: https://cloud.
google.com/natural-language.

Figur 2: Annotation decision process with path pro-
portions from NQ training data. Percentages are propor-
tions of entire data set. A total of 49% of all examples
have a long answer.

We focus on Wikipedia as it is a very important
source of factual information, and we believe that
stylistically it is similar to other sources of factual
information on the Web; Jedoch, like any data
set there may be biases in this choice. Future data-
collection efforts may introduce shorter queries,
‘‘how to’’ questions, or domains other than
Wikipedia.

3.2 Human Identification of Answers

Annotation is performed using a custom annota-
tion interface, by a pool of around 50 annotators,
with an average annotation time of 80 seconds.

The guidelines and tooling divide the annotation
task into three conceptual stages, where all three
stages are completed by a single annotator in
succession. The decision flow through these is
illustrated in Figure 2 and the instructions given
to annotators are summarized below.

Question Identification: Contributors deter-
mine whether the given question is good or bad.
A good question is a fact-seeking question that
can be answered with an entity or explanation.
A bad question is ambigous, incomprehensible,

456

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8

/

/
T

l

A
C
_
A
_
0
0
2
7
6
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

dependent on clear false presuppositions, opinion-
suchen, or not clearly a request for factual in-
Formation. Annotators must make this judgment
solely by the content of the question; they are not
yet shown the Wikipedia page.

Long Answer Identification: For good ques-
tions only, annotators select the earliest HTML
bounding box containing enough information for
a reader to completely infer the answer to the ques-
tion. Bounding boxes can be paragraphs, tables,
list items, or whole lists. Alternativ, annotators
mark ‘‘no answer’’ if the page does not answer the
question, or if the information is present but not
contained in a single one of the allowed elements.

Short Answer Identification: For examples
with long answers, annotators select the entity or
set of entities within the long answer that answer
the question. Alternativ, annotators can flag
that the short answer is yes, NEIN, or they can flag
that no short answer is possible.

3.3 Data Statistics

In Summe, annotators identify a long answer for
49% of the examples, and short answer spans or
a yes/no answer for 36% of the examples. Wir
consider the choice of whether or not to answer
a question a core part of the question answering
Aufgabe, and do not discard the remaining 51% Das
have no answer labeled.

Annotators identify long answers by selecting
the smallest HTML bounding box that contains
all of the information required to answer the
question. These are mostly paragraphs (73%).
The remainder are made up of tables (19%), table
rows (1%), lists (3%), or list items (3%).4 Wir
leave further subcategorization of long answers to
future work, and provide a breakdown of base-
line performance on each of these three types of
answers in Section 6.4.

4 Evaluation of Annotation Quality

This section describes evaluation of the quality
of the human annotations in our data. Wir gebrauchen
a combination of two methods: 1) post hoc
evaluation of correctness of non-null answers,
under consensus judgments from four ‘‘experts’’;

4We note that both tables and lists may be used purely for
the purposes of formatting text, or they may have their own
complex semantics—as in the case of Wikipedia infoboxes.

2) k-way annotations (with k = 25) on a subset of
the data.

Post hoc evaluation of non-null answers leads
directly to a measure of annotation precision. As is
common in information-retrieval style problems
such as long-answer identification, measuring
recall is more challenging. Jedoch, we describe
how 25-way annotated data provide useful insights
into recall, particularly when combined with ex-
pert judgments.

4.1 Preliminaries: The Sampling

Distribution

Each item in our data consists of a four-tuple
(Q, D, l, S) where q is a question, d is a document,
l is a long answer, and s is a short answer. Daher
we introduce random variables Q, D, L, and S
corresponding to these items. Note that L, can be
a span within the document, or NULL. Ähnlich,
S can be one or more spans within L, a boolean,
or NULL.

For now we consider the three-tuple (Q, D, l). Der
treatment for short answers is the same throughout,
mit (Q, D, S) replacing (Q, D, l).

Each data item (Q, D, l) is independent and iden-

tically distrbuted (IID) sampled from

P(l, Q, D) = p(Q, D) × p(l|Q, D)

Hier, P(Q, D) is the sampling distribution (prob-
ability mass function [PMF]) over question/
document pairs. It is defined as the PMF cor-
responding to the following sampling process:5
Erste, sample a question at random from some
distribution; zweite, perform a search on a major
search engine using the question as the underlying
query; finally, entweder: 1) return (Q, D) where d is
the top Wikipedia result for q, if d is in the top
5 search results for q; 2) if there is no Wikipedia
page in the top 5 results, discard q and repeat the
sampling process.

Here p(l|Q, D) is the conditional distribution
(PMF) over long answer l conditioned on the pair
(Q, D). The value for l is obtained by: 1) sampling
an annotator uniformly at random from the pool

5More formally, there is some base distribution pb(Q)
from which queries q are drawn, and a deterministic function
S(Q) which returns the top-ranked Wikipedia page in the top
5 search results, or NULL if there is no Wikipedia page in
the top 5 results. Define Q to be the set of queries such that
S(Q) (cid:8)= NULL, and b =
q∈Q pb(Q). Then p(Q, D) = pb(Q)/B
if q ∈ Q and d (cid:8)= NULL and d = s(Q), otherwise p(Q, D) = 0.

(cid:2)

457

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8

/

/
T

l

A
C
_
A
_
0
0
2
7
6
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

of annotators; 2) presenting the pair (Q, D) to the
annotator, who then provides a value for l.

Note that l is non-deterministic due to two
sources of randomness: 1) the random choice of
annotator; 2) the potentially random behavior of
a particular annotator (the annotator may give a
different answer depending on the time of day,
usw.).

We will also consider the distribution

P(l, Q, D|L (cid:8)= NULL) =

P(l, Q, D)
P (L (cid:8)= NULL)

if l (cid:8)= NULL

= 0 ansonsten
(cid:2)

where P (L (cid:8)= NULL) =
l,Q,D:l(cid:8)=NULL p(l, Q, D).
Thus p(l, Q, D|L (cid:8)= NULL) is the probability of
seeing the triple (l, Q, D), conditioned on L not
being NULL.

We now define precision of annotations. Con-
sider a function π(l, Q, D) that is equal to 1 if l
is a ‘‘correct’’ answer for the pair (Q, D), 0 Wenn
the answer is incorrect. The next section gives a
concrete definition of π. The annotation precision
is defined as
(cid:3)

P(l, Q, D|L (cid:8)= NULL) × π(l, Q, D)

Ψ =

l,Q,D

Given a set of annotations S = {(l(ich), Q(ich), D(ich))}|S|
i=1
drawn IID from p(l, Q, D|L (cid:8)= NULL), we can
derive an estimate of Ψ as ˆΨ = 1
(l,Q,D)∈
|S|
(l, Q, D).

(cid:2)

4.2 Expert Evaluations of Correctness

We now describe the process
for deriving
‘‘expert’’ judgments of answer correctness. Wir
used four experts for these judgments. Diese
experts had prepared the guidelines for the anno-
tation process.6 In a first phase each of the four
experts independently annotated examples for cor-
rectness. In a second phase the four experts met to
discuss disagreements in judgments, and to reach
a single consensus judgment for each example.

A key step is to define the criteria used to
determine correctness of an example. Given a
triple (l, Q, D), we extracted the passage l(cid:10) corre-
sponding to l on the page d. The pair (Q, l(cid:10)) War
then presented to the expert. Experts categorized
(Q, l(cid:10)) pairs into the following three categories:

Correct (C): It is clear beyond a reasonable doubt

that the answer is correct.

6The first four authors of this paper.

458

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8

/

/
T

l

A
C
_
A
_
0
0
2
7
6
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figur 3: Examples with consensus expert judgments, Und
justification for these judgments. Siehe Abbildung 6 for more
examples.

Correct (but debatable) (Cd): A reasonable person
could be satisfied by the answer; Jedoch,
a reasonable person could raise a reasonable
doubt about the answer.

Wrong (W): There is not convincing evidence

that the answer is correct.

Figur 3 shows some example judgments. Wir
introduced the intermediate Cd category after
observing that many (Q, l(cid:10)) pairs are high quality
answers, but raise some small doubt or quibble
about whether they fully answer the question. Der
use of the word ‘‘debatable’’ is intended to be
literal: (Q, l(cid:10)) pairs falling into the Cd category
could literally lead to some debate between
reasonable people as to whether they fully answer
the question or not.

Given this background, we will make the follow-

ing assumption:

Answers in the Cd category should be very
useful to a user interacting with a QA system, Und
should be considered to be high-quality answers;
Jedoch, an annotator would be justified in either
annotating or not annotating the example.

For these cases there is often disagreement
between annotators as to whether the page contains

Quantity
ˆΨ
ˆE(C)
ˆE(Cd)
ˆE(W)

Long answer
90%
59%
31%
10%

Short answer
84%
51%
33%
16%

Tisch 2: Precision results ( ˆΨ) and empirical
estimates of the proportions of C, Cd, and W
Artikel.

an answer or not: We will see evidence of this
when we consider the 25-way annotations.

4.3 Results for Precision Measurements

We used the following procedure to derive mea-
surements of precision: 1) We sampled examples
IID from the distribution p(l, Q, D|L (cid:8)= NULL). Wir
call this set S. We had |S| = 139. 2) Four experts
independently classified each of the items in S into
the categories C, Cd, W. 3) The four experts met to
come up with a consensus judgment for each item.
For each example (l(ich), Q(ich), D(ich)) ∈ S, we define
C(ich) to be the consensus judgment. Dieser Prozess war
repeated to derive judgments for short answers.

We can then calculate the percentage of exam-
ples falling into the three expert categories; Wir
denote these values as ˆE(C), ˆE(Cd), and ˆE(W ).7
We define ˆΨ = ˆE(C)+ ˆE(Cd). We have explicitly
included samples C and Cd in the overall precision
as we believe that Cd answers are essentially cor-
rect. Tisch 2 shows the values for these quantities.

4.4 Variability of Annotations

We have shown that an annotation drawn from
P(l, Q, D|L (cid:8)= NULL) has high expected precision.
Now we address the distribution over annotations
for a given (Q, D) pair. Annotators can disagree
about whether or not d contains an answer to
q—that is, whether or not L = NULL. In the case
that annotators agree that L (cid:8)= NULL, they can
also disagree about the correct assignment to L.

In order to study variability, we collected 24
additional annotations from separate annotators
for each of the (Q, D, l) triples in S. For each
(Q, D, l) triple, we now have a 5-tuple (Q(ich), D(ich),
l(ich), C(ich), A(ich)) where a(ich) = a(ich)
25 is a vector
von 25 annotations (including l(ich)), und C(ich)
Ist

1 . . . A(ich)

7More formally, let [[e]] for any statement e be 1 if e is
true, 0 if e is false. We define ˆE(C) = 1
i=1[[C(ich) = C]].
|S|
The values for ˆE(Cd) and ˆE(W) are calculated in a similar
manner.

(cid:2)|S|

Figur 4: Values of ˆE[(θ1, θ2]] and ˆE[(θ1, θ2], C/Cd/
W] for different intervals (θ1, θ2]. The height of each
bar is equal to ˆE[(θ1, θ2]], the divisions within each bar
show ˆE[(θ1, θ2], C], ˆE[(θ1, θ2], Cd], and ˆE[(θ1, θ2], W].

the consensus judgment for l(ich). For each i also
define

M(ich) =

1
25

25(cid:3)

j=1

[[A(ich)
J

(cid:8)= NULL]]

to be the proportion of the 25-way annotations
that are non-null.

We now show that μ(ich) is highly correlated with

annotation precision. We define

ˆE[(0.8, 1.0]] =

1
|S|

|S|(cid:3)

i=1

[[0.8 < μ(i) ≤ 1]] to be the proportion of examples with greater than 80% of the 25 annotators marking a non-null long answer, and ˆE[(0.8, 1.0], C] = 1 |S| |S|(cid:3) i=1 [[0.8 < μ(i) ≤ 1 and c(i) = C]] to be the proportion of examples with greater than 80% of the 25 annotators marking a non-null long answer and with c(i) = C. Similar definitions apply for the intervals (0,0.2], (0.2, 0.4], (0.4, 0.6], and (0.6, 0.8], and for judgments Cd and W. Figure 4 illustrates the proportion of annotations falling into the C/Cd/W categories in different regions of μ(i). For those (q, d) pairs where more than 80% of annotators gave some non-null answer, our expert judgements agree that these annotations are overwhelmingly correct. Simi- larly, when fewer than 20% of annotators gave a non-null answer, these answers tend to be incor- rect. In between these two extremes, the disagree- ment between annotators is largely accounted for by the Cd category—where a reasonable person could either be satisfied with the answer, or want more information. Later, in Section 5, we make use of the correlation between μ(i) and accuracy to define a metric for the evaluation of answer 459 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 quality. In that section, we also show that a model trained on (l, q, d) triples can outperform a sin- gle annotator on this metric by accounting for the uncertainty of whether or not an answer is present. As well as disagreeing about whether (q, d) contains a valid answer, annotators can disagree about the location of the best answer. In many cases there are multiple valid long answers in multiple distinct locations on the page.8 The most extreme example of this that we see in our 25-way annotated data is for the question ‘‘name the sub- stance used to make the filament of bulb’’ paired with the Wikipedia page about incandescent light bulbs. Annotators identify 7 passages that discuss tungsten wire filaments. Short answers can be arbitrarily delimited and this can lead to extreme variation. The most extreme example of this that we see in the 25-way annotated data is the 11 distinct, but correct, answers for the question ‘‘where is blood pumped after it leaves the right ventricle’’. Here, 14 anno- tators identify a substring of ‘‘to the lungs’’ as the best possible short answer. Of these, 6 label the entire string, 4 reduce it to ‘‘the lungs’’, and 4 reduce it to ‘‘lungs’’. A further 6 annotators do not consider this short answer to be sufficient and choose more precise phrases such as ‘‘through the semilunar pulmonary valve into the left and right main pulmonary arteries (one for each lung)’’. The remaining 5 annotators decide that there is no adequate short answer. For each question, we ranked each of the unique answers given by our 25 annotators according to the number of annotators that chose it. We found that by just taking the most popular long answer, we could account for 83% of the long answer annotations. The two most popular long answers account for 96% of the long answer annotations. It is extremely uncommon for a question to have more than three distinct long answers annotated. Short answers have greater variability, but the most popular short answer still accounts for 64% of all short answer annotations. The three most popular short answers account for 90% of all short answer annotations. 8As stated earlier in this paper, we did instruct annotators to select the earliest instance of an answer when there are multiple answer instances on the page. However, there are still cases where different annotators disagree on whether an answer earlier in the page is sufficient in comparison to a later answer, leading to differences between annotators. 5 Evaluation Measures NQ includes 5-way annotations on 7,830 items for development data, and we will sequester a further 7,842 items, 5-way annotated, for test data. This section describes evaluation metrics using this data, and gives justification for these metrics. We choose 5-way annotations for the following reasons: First, we have evidence that aggregating annotations from 5 annotators is likely to be much more robust than relying on a single annotator (see Section 4). Second, 5 annotators is a small enough number that the cost of annotating thousands of development and test items is not prohibitive. 5.1 Definition of an Evaluation Measure Based on 5-Way Annotations Assume that we have a model fθ with parameters θ that maps an input (q, d) to a long answer l = fθ(q, d). We would like to evaluate the accuracy this model. Assume we have evaluation of examples {q(i), d(i), a(i)} for i = 1 . . . n, where q(i) is a question, d(i) is the associated Wikipedia document, and a(i) is a vector with components a(i) is the output from j the j’th annotator, and can be a paragraph in d(i), or can be NULL. The five annotators are chosen uniformly at random from a pool of annotators. for j = 1 . . . 5. Each a(i) j We define an evaluation measure based on the five way annotations as follows. If at least two out of five annotators have given a non-null long answer on the example, then the system is required to output a non-null answer that is seen at least once in the five annotations; conversely, if fewer than two annotators give a non-null long answer, the system is required to return NULL as its output. To make this more formal, define the function g(a(i)) to be the number of annotations in a(i) that are non-null. Define a function hβ(a, l) that judges the correctness of label l given annotations a = a1 . . . a5. This function is parameterized by an integer β. The function returns 1 if the label l is judged to be correct, and 0 otherwise: Definition 1 (Definition of hβ(a, l)) If g(a) ≥ β and l (cid:8)= NULL and l = aj for some j ∈ {1 . . . 5} Then hβ(a, l) = 1; Else If g(a) < β and l = NULL Then hβ(a, l) = 1; Else hβ(a, l) = 0. We used β = 2 in our experiments.9 9This is partly motivated through the results on 25-way annotations (see Section 4.4), where for μ(i) ≥ 0.4 over 93% (114/122 annotations) are in the C or Cd categories, whereas 460 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 The accuracy of a model is then Aβ(fθ) = 1 n n(cid:3) i=1 hβ(a(i), fθ(q(i), d(i))) The value for Aβ is an estimate of accuracy with respect to the underlying distribution, which we define as ¯Aβ(fθ) = E[hβ(a, fθ(q, d))]. Here the expectation is taken with respect j=1 p(aj|q, d) where to p(a, q, d) = p(q, d) p(aj|q, d) = P (L = aj|Q = q, D = d); hence the annotations a1 . . . a5 are assumed to be drawn IID from p(l|q, d).10 (cid:4) 5 Precision and Recall During evaluation, it is of- ten beneficial to separately measure false positives (incorrectly predicting an answer), and false neg- atives (failing to predict an answer). We define the precision (P ) and recall (R) of fθ: t(q, d, a, fθ) = hβ(a, fθ(q, d))[[fθ(q, d) (cid:8)= NULL]] (cid:2) R(fθ) = P (fθ) = (cid:2) (cid:2) n i=1 t(q(i), d(i), a(i), fθ) (cid:2) n i=1[[g(a(i) ≥ β]] n i=1 t(q(i), d(i), a(i), fθ) n i=1[[fθ(q(i), d(i)) (cid:8)= NULL]] l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 We discuss this measure at length in this section. First, however, we make the following critical point: It is possible for a model trained on (l(i), q(i), d(i)) triples drawn IID from p(l, q, d) to exceed the performance of a single annotator on this measure. In particular, if we have a model p(l|q, d; θ), trained on (l, q, d) triples, which is a good approximation to p(l|q, d), it is then possible to use p(l|q, d; θ) to make predictions that outperform a single random draw from p(l|q, d). The Bayes optimal hypothesis (see Devroye et al., 1997) for hβ, defined as arg maxf Eq,d,a[[hβ(a, f (q, d))]], is a function of the posterior distribution p(·|q, d),11 and will generally exceed the performance of a l p(l|q, d) × single random annotation, Eq,d,a[[ hβ(a, l)]]. (cid:2) We also show this empirically, by constructing an approximation to p(l|q, d) from 20-way anno- tations, then using this approximation to make predictions that significantly outperform a single annotator. for μ(i) < 0.4 over 35% (11/17 annotations) are in the W category. 10This isn’t quite accurate as the annotators are sampled without replacement; however, it simplifies the analysis. and γ = p(l∗|q, d), 11Specifically, for an input (q, d), if we define l∗ = arg maxl(cid:8)=NULL p(l|q, d), ¯γ = p(NULL|q, d), then the Bayes optimal hypothesis is to output l∗ if P (hβ(a, l∗) = 1|γ, ¯γ) ≥ P (hβ(a, NULL) = 1|γ, ¯γ), and to output NULL otherwise. Implementation of this strategy is straightforward if γ and ¯γ are known; this strategy will in general give a higher accuracy value than taking a single sample l from p(l|q, d) and using this sample as the prediction. In principle a model p(l|q, d; θ) trained on (l, q, d) triples can converge to a good estimate of γ and ¯γ. Note that for the special case γ + ¯γ = 1 we have P (hβ(a, NULL) = 1|γ, ¯γ) = ¯γ5 + 5¯γ4(1 − ¯γ) and P (hβ(a, l∗) = 1|γ, ¯γ) = 1 − P (hβ(a, NULL) = 1|γ, ¯γ). It follows that the Bayes optimal hypothesis is to predict l∗ if γ ≥ α where α ≈ 0.31381, and to predict NULL otherwise. α is 1 − ¯α where ¯α is the solution to ¯α5 + 5 ¯α4(1 − ¯α) = 0.5. 5.2 Super-Annotator Upper Bound To place an upper bound on the metrics introduced above we create a ‘‘super-annotator’’ from the 25- way annotated data introduced in Section 4. From this data, we create four tuples (q(i), d(i), a(i), b(i)). The first three terms in this tuple are the question, document, and vector of five reference annotations. b(i) is a vector of annotations b(i) for j j = 1 . . . 20 drawn from the same distribution as a(i). The super-annotator predicts NULL if g(b(i)) < α, and l∗ = arg maxl∈d 20 j=1[[l = bj]] otherwise. (cid:2) Table 3 shows super-annotator performance for α = 8, with 90.0% precision, 84.6% recall, and 87.2% F-measure. This significantly exceeds the performance (80.4% precision/67.6% recall/ 73.4% F-measure) for a single annotator. We subsequently view the super-annotator numbers as an effective upper bound on performance of a learned model. 6 Baseline Performance The NQ corpus is designed to provide a benchmark with which we can evaluate the performance of QA systems. Every question in NQ is unique under exact string match, and we split questions ran- domly in NQ into separate train/development/test sets. To facilitate comparison, we introduce base- lines that either make use of high-level data set regularities, or are trained on the 307k examples in the training set. Here, we present well-established baselines that were state of the art at the time of submission. We also refer readers to Alberti et al. (2019) for more recent advances in model- ing. All of our baselines focus on the long and short answer extraction tasks. We leave boolean answers to future work. 461 Long answer Dev Long answer Test P F1 F1 R R P Short answer Dev P F1 R Short answer Test P F1 R First paragraph Most frequent Closest question 22.2 37.8 27.8 43.1 20.0 27.3 37.7 28.5 32.4 22.3 38.5 28.3 40.2 18.4 25.2 36.2 27.8 31.4 – – – – – – – – – – – – – – – – – – DocumentQA DecAtt + DocReader Single annotator† Super-annotator† 47.5 44.7 46.1 52.7 57.0 54.8 48.9 43.3 45.7 54.3 55.7 55.0 38.6 33.2 35.7 34.3 28.9 31.4 40.6 31.0 35.1 31.9 31.1 31.5 80.4 67.6 73.4 90.0 84.6 87.2 – – – – – – 63.4 52.6 57.5 79.1 72.6 75.7 – – – – – – Table 3: Precision (P), recall (R), and the harmonic mean of these (F1) of all baselines, a single annotator, and the super-annotator upper bound. The human performances marked with † are evaluated on a sample of five annotations from the 25-way annotated data introduced in Section 5. 6.1 Untrained Baselines NQ’s long answer selection task admits several untrained baselines. The first paragraph of a Wikipedia page commonly acts as a summary of the most important information regarding the page’s subject. We therefore implement a long answer baseline that simply selects the first paragraph for all pages. Furthermore, because 79% of the Wikipedia pages in the development set also appear in the training set, we implement two ‘‘copying’’ baselines. The first of these simply selects the most frequent annotation applied to a given page in the training set. The second selects the annotation given to the training set question closest to the eval set question according to TFIDF weighted word overlap. These three baselines are reported as First paragraph, Most frequent, and Closest question in Table 3, respectively. 6.2 Document-QA the reference implementation12 of We adapt Document-QA (Clark and Gardner, 2018) for the NQ task. This system performs well on the SQuAD and TriviaQA short answer extraction tasks, but it is not designed to represent: (i) the long answers that do not contain short answers, and (ii) the NULL answers that occur in NQ. To address (i) we choose the shortest available answer span at training, differentiating long and short answers only through the inclusion of special start and end of passage tokens that identify long answer candidates. At prediction time, the model can either predict a long answer (and no short answer), or a short answer (which implies a long answer). 12https://github.com/allenai/document-qa. To address (ii), we tried adding special NULL passages to represent the lack of answer. However, we achieved better performance by training on the subset of questions with answers and then only predicting those answers whose scores exceed a threshold. With these two modifications, we are able to apply Document-QA to NQ. We follow Clark and Gardner (2018) in pruning documents down to the set of passages that have highest TFIDF similarity with the question. Under this approach, we con- sider the top 16 passages as long answers. We con- sider short answers containing up to 17 words. We train Document-QA for 30 epochs with batches containing 15 examples. The post hoc score thresh- old is set to 3.0. All of these values were chosen on the basis of development set performance. 6.3 Custom Pipeline (DecAtt + DocReader) One view of the long answer selection task is that it is more closely related to natural language infer- ence (Bowman et al., 2015; Williams et al., 2018) than short answer extraction. A valid long answer must contain all of the information required to infer the answer. Short answers do not need to con- tain this information—they need to be surrounded by it. Motivated by this intuition, we implement a pipelined approach that uses a model drawn from the natural language interference literature to se- lect long answers. Then short answers are selected from these using a model drawn from the short answer extraction literature. Long answer selection Let t(d, l) denote the sequence of tokens in d for the long answer candidate l. We then use the Decomposable Attention model (Parikh et al., 2016) to produce 462 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 5: Examples from the questions with 25-way annotations. a score for each question, candidate pair xl = DecAtt(q, t(d, l)). To this we add a 10- dimensional trainable embedding rl of the long answer candidate’s position in the sequence of candidates;13 an integer ul containing the number of the words shared by q and t(d, l); and a scalar vl containing the number of words shared by q and t(d, l) weighted by inverse document frequency. The long answer score zl is then given as a linear function of the above features zl = w(cid:14)[xl, rl, ul, vl] + b where w(cid:14) and b are the trainable weight vector and bias, respectively, Short answer selection Given a long answer, the Document Reader model (Chen et al., 2017; abbreviated DocReader) is used to extract short answers. Training The long answer selection model is trained by minimizing the negative log-likelihood of the correct answer l(i) with a hyperparameter η that down-weights examples with the NULL label: (cid:6) (cid:5) ×(1−η[[l(i) = NULL]]) n(cid:3) − log i=1 exp(zl(i)) (cid:2) l exp(zl) We found that the inclusion of η is useful in accounting for the asymmetry in labels—because a NULL label is less informative than an answer location. Varying η also seems to provide a more stable method of setting a model’s precision point than post hoc thresholding of prediction scores. An analogous strategy is used for the short answer model where examples with no entity answers are given a different weight. 13Specifically, we have a unique learned 10-dimensional embedding for each position 1 . . . 19 in the sequence, and a 20th embedding used for all positions ≥ 20. 463 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 6.4 Results Table 3 shows results for all baselines as well as a single annotator, and the super-annotator introduced in Section 5. It is clear that there is a great deal of headroom in both tasks. We find that Document-QA performs significantly worse than DecAtt+DocReader in long answer identification. This is likely because Document-QA was designed for the short answer task only. To ground these results in the context of comparable tasks, we measure performance on the subset of NQ that has non-NULL labels for both long and short answers. Freed from the decision of whether or not to answer, DecAtt+DocReader obtains 68.0% F1 on the long answer task, and 40.4% F1 on the short answer task. We also ex- amine performance of the short answer extraction systems in the setting where the long answer is given, and a short answer is known to exist. With this simplification, short answer F1 increases 57.7% for DocReader. Under this restriction NQ roughly approximates the SQuAD 1.1 task. From the gap to the super-annotator upper bound we know that this task is far from being solved in NQ. Finally, we break the long answer identification results down according to long answer type. From Table 3 we know that DecAtt+DocReader predicts long answers with 54.8% F1. If we only measure performance on examples that should have a paragraph long answer, this increases to 65.1%. For tables and table rows it is 66.4%. And for lists and list items it is 32.0%. All other examples have a NULL label. Clearly, the model is struggling to learn some aspect of list-formatted data from the 6% of the non NULL examples that have this type. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 6: Answer annotations for four examples from Figure 5 that have long answers that are paragraphs (i.e., not tables or lists). We show the expert judgment (C/Cd/W) for each non-null answer. ‘‘Long answer stats’’ a/25, b/25 have a = number of non-null long answers for this question, b = number of long answers the same as that shown in the figure. For example, for question A1, 13 out of 25 annotators give some non-null answer, and 4 out of 25 annotators give the same long answer After mashing . . .. ‘‘Short answer stats’’ has similar statistics for short answers. 7 Conclusion We argue that progress on QA has been hindered by a lack of appropriate training and test data. To address this, we present the Natural Questions corpus. This is the first large publicly available data set to pair real user queries with high-quality annotations of answers in documents. We also present metrics to be used with NQ, for the purposes of evaluating the performance of question answer- ing systems. We demonstrate a high upper bound on these metrics and show that existing methods do not approach this upper bound. We argue that for them to do so will require significant advances in NLU. Figure 5 shows example questions from the data set. Figure 6 shows example question/answer pairs from the data set, together with expert judg- ments and statistics from the 25-way annotations. References Chris Alberti, Kenton Lee, and Michael Collins. 2019. A BERT Baseline for the Natural Ques- tions. arXiv preprint:1901.08634. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. Quac: Question answer- ing in context. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 2174–2184, Brussels. Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading 464 comprehension. In Proceedings of the 56th An- nual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 845–855, Melbourne. Melis, and Edward Grefenstette. 2018. The nar- rative qa reading comprehension challenge. Transactions of the Association for Compu- tational Linguistics, 6317–328. Luc Devroye, L´aszl´o Gy¨orfi, and G´abor Lugosi. 1997. A Probabilistic Theory of Pattern Rec- ognition, corrected 2nd edition, volume 31 of Applications of Mathematics. Springer. Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng Wang. 2018. Dureader: A Chinese machine reading comprehension dataset from real-world applications. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 37–46, Melbourne. Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In COLING 1992 Volume 2: The 15th International Confer- ence on Computational Linguistics. Karl Moritz Hermann, Tom´aˇs Koˇcisk´y, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceed- the 28th International Conference ings of on Neural Information Processing Systems, NIPS’15. Cambridge, MA. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Read- ing children’s books with explicit memory rep- resentations. In Proceedings of the International Conference on Learning Representations. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for read- ing comprehension. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1601–1611. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large- scale reading comprehension dataset from examinations. In Proceedings of the 2017 Con- in Natu- ference on Empirical Methods ral Language Processing, pages 785–794. Copenhagen. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can A suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human gen- erated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches. Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2230–2235. Austin, TX. Denis Paperno, Germ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLUE: A method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia. Tomas Kocisky, Jonathan Schwarz, Phil Blun- som, Chris Dyer, Karl Moritz Hermann, Gabor Ankur Parikh, Oscar T¨ackstr¨om, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable 465 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 attention model for natural language inference. In Proceedings of the 2016 Conference on Em- pirical Methods in Natural Language Pro- cessing, pages 2249–2255, Austin, TX. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Un- answerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Em- pirical Methods in Natural Language Pro- cessing, pages 2383–2392, Austin, TX. Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042. Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. 2013. MCTest: A chal- lenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193–203, Seattle, WA. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, LA. Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open- domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013–2018, Lisbon. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2369–2380, Brussels. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 6 1 9 2 3 2 8 8 / / t l a c _ a _ 0 0 2 7 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 466
PDF Herunterladen