Natural Questions: A Benchmark for Question Answering Research
Tom Kwiatkowski♣♦♠∗ Jennimaria Palomaki♠ Olivia Redfield♦♠ Michael Collins♣♦♠♥
Ankur Parikh♥ Chris Alberti♥ Danielle Epstein♤♦ Illia Polosukhin♤♦ Jacob Devlin♤
Kenton Lee♥ Kristina Toutanova♥ Llion Jones♤ Matthew Kelcey♤♦ Ming-Wei Chang♥
Andrew M. Dai♣♦
Jakob Uszkoreit♣ Quoc Le♣♦ Slav Petrov♣
Google Research
natural-questions@google.com
Abstrakt
We present the Natural Questions corpus, A
question answering data set. Questions consist
of real anonymized, aggregated queries issued
to the Google search engine. An annotator
is presented with a question along with a
Wikipedia page from the top 5 search results,
and annotates a long answer (typically a para-
graph) and a short answer (one or more en-
tities) if present on the page, or marks null if
no long/short answer is present. The public
release consists of 307,373 training examples
with single annotations; 7,830 examples with
5-way annotations for development data; Und
a further 7,842 examples with 5-way anno-
tated sequestered as test data. We present
experiments validating quality of the data. Wir
also describe analysis of 25-way annotations
An 302 examples, giving insights into human
variability on the annotation task. We introduce
robust metrics for the purposes of evaluating
question answering systems; demonstrate high
human upper bounds on these metrics; Und
establish baseline results using competitive
methods drawn from related literature.
1
Einführung
In recent years there has been dramatic progress
in machine learning approaches to problems such
as machine translation, speech recognition, Und
image recognition. One major factor in these
successes has been the development of neural
methods that far exceed the performance of
previous approaches. A second major factor has
been the existence of large quantities of training
data for these systems.
Open-domain question answering (QA) ist ein
benchmark task in natural language understanding
(NLU), which has significant utility to users, Und
in addition is potentially a challenge task that
can drive the development of methods for NLU.
Several pieces of recent work have introduced
QA data sets (z.B., Rajpurkar et al., 2016; Reddy
et al., 2018). Jedoch, in contrast to tasks where
it is relatively easy to gather naturally occurring
examples,1 the definition of a suitable QA task,
and the development of a methodology for an-
notation and evaluation, is challenging. Key issues
include the methods and sources used to obtain
Fragen; the methods used to annotate and col-
lect answers; the methods used to measure and
ensure annotation quality; and the metrics used for
evaluation. For more discussion of the limitations
of previous work with respect to these issues, sehen
Abschnitt 2 of this paper.
This paper introduces Natural Questions2 (NQ),
a new data set for QA research, along with
methods for QA system evaluation. Our goals are
three-fold: 1) To provide large-scale end-to-end
training data for the QA problem. 2) To provide
a data set that drives research in natural language
Verständnis. 3) To study human performance in
providing QA annotations for naturally occurring
Fragen.
In brief, our annotation process is as follows. Ein
annotator is presented with a (question, Wikipedia
page) pair. The annotator returns a (long answer,
short answer) pair. The long answer (l) can
be an HTML bounding box on the Wikipedia
∗♣Project
initiation; ♦Project design; ♠Data creation;
♥Model development; ♤Project support; ♥Also affiliated
with Columbia University, work done at Google; ♦No longer
at Google, work done at Google.
1Zum Beispiel, for machine translation/speech recognition
humans provide translations/transcriptions relatively easily.
2Verfügbar um: https://ai.google.com/research/
NaturalQuestions.
453
Transactions of the Association for Computational Linguistics, Bd. 7, S. 453–466, 2019. https://doi.org/10.1162/tacl a 00276.
Action Editor: Jing Jiang. Submission batch: 4/2018; Revision batch: 6/2018; Published 7/2019.
C(cid:7) 2019 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
T
l
A
C
_
A
_
0
0
2
7
6
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Task definition The input to a model is a ques-
tion together with an entire Wikipedia page. Der
target output from the model is: 1) a long-answer
(z.B., a paragraph) from the page that answers the
question, or alternatively an indication that there
is no answer on the page; 2) a short answer where
applicable. The task was designed to be close to
an end-to-end question answering application.
Ensuring high-quality annotations at scale
Comprehensive guidelines were developed for the
Aufgabe. These are summarized in Section 3. Anno-
tation quality was constantly monitored.
Evaluation of quality Section 4 describes post-
hoc evaluation of annotation quality. Long/short
answers have 90%/84% precision, jeweils.
Study of variability One clear finding in NQ is
that for naturally occurring questions there is often
genuine ambiguity in whether or not an answer
is acceptable. There are also often a number
of acceptable answers. Abschnitt 4 examines this
variability using 25-way annotations.
Robust evaluation metrics Section 5 intro-
duces methods of measuring answer quality that
account for variability in acceptable answers. Wir
demonstrate a high human upper bound on these
measures for both long answers (90% precision,
85% recall), and short answers (79% precision,
72% recall).
We propose NQ as a new benchmark for research
in QA. In Section 6.4 we present baseline results
from recent models developed on comparable data
sets (Clark and Gardner, 2018), as well as a simple
pipelined model designed for the NQ task. Wir
demonstrate a large gap between the performance
of these baselines and a human upper bound. Wir
argue that closing this gap will require significant
advances in NLU.
2 Related Work
The SQuAD (Rajpurkar et al., 2016), SQuAD 2.0
(Rajpurkar et al., 2018), NarrativeQA (Kocisky
et al., 2018), and HotpotQA (Yang et al., 2018)
data sets contain questions and answers writ-
ten by annotators who have first read a short
text containing the answer. The SQuAD data
sets contain questions/paragraph/answer triples
from Wikipedia. In the original SQuAD data set,
annotators often borrow part of the evidence para-
graph to create a question. Jia and Liang (2017)
Figur 1: Example annotations from the corpus.
page—typically a paragraph or table—that con-
tains the information required to answer the
question. Alternativ, the annotator can return
l = NULL if there is no answer on the page, oder wenn
the information required to answer the question is
spread across many paragraphs. The short answer
(S) can be a span or set of spans (typically entities)
within l that answer the question, a boolean yes or
no answer, or NULL. If l = NULL then s = NULL,
Notwendig. Figur 1 shows examples.
Natural Questions has the following properties:
Source of questions The questions consist of
real anonymized, aggregated queries issued to the
Google search engine. Simple heuristics are used
to filter questions from the query stream. Thus the
questions are ‘‘natural’’ in that they represent real
queries from people seeking information.
Number of items The public release contains
307,373 training examples with single annota-
tionen, 7,830 examples with 5-way annotations for
development data, Und 7,842 5-way annotated
items sequestered as test data. We justify the use
of 5-way annotation for evaluation in Section 5.
454
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
T
l
A
C
_
A
_
0
0
2
7
6
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
showed that systems trained on SQuAD could
be easily fooled by the insertion of distractor
sentences that should not change the answer, Und
SQuAD 2.0 introduces questions that are designed
to be unanswerable. Jedoch, we argue that ques-
tions written to be unanswerable can be identified
as such with little reasoning, in contrast to NQ’s task
of deciding whether a paragraph contains all of the
evidence required to answer a real question. Beide
SQuAD tasks have driven significant advances in
reading comprehension, but systems now outper-
form humans and harder challenges are needed.
NarrativeQA aims to elicit questions that are not
close paraphrases of the evidence by separate sum-
mary texts. No human performance upper bound is
provided for the full task and, although an extrac-
tive system could theoretically perfectly recover
all answers, current approaches only just outper-
form a random baseline. NarrativeQA may just be
too hard for the current state of NLU. HotpotQA is
designed to contain questions that require reason-
ing over text from separate Wikipedia pages. Als
well as answering questions, systems must also
identify passages that contain supporting facts.
This is similar in motivation to NQ’s long answer
Aufgabe, where the selected passage must contain all
of the information required to infer the answer.
Mirroring our identification of acceptable variabil-
ity in the NQ task definition, HotpotQA’s authors
observe that the choice of supporting facts is
somewhat subjective. They set high human upper
bounds by selecting, for each example, the score
maximizing partition of four annotations into one
prediction and three references. The reference
labels chosen by this maximization are not rep-
resentative of the reference labels in HotpotQA’s
evaluation set, and it is not clear that the upper
bounds are achievable. A more robust approach
is to keep the evaluation distribution fixed, Und
calculate an acheivable upper bound by approx-
imating the expectation over annotations—as we
have done for NQ in Section 5.
The QuAC (Choi et al., 2018) and CoQA
(Reddy et al., 2018) data sets contain dialogues
between a questioner, who is trying to learn about
a text, and an answerer. QuAC also prevents the
questioner from seeing the evidence text. Con-
versational QA is an exciting new area, aber es ist
significantly different from the single turn QA
task in NQ. In both QuAC and CoQA, conversa-
tions tend to explore evidence texts incrementally,
progressing from the start to the end of the text.
This contrasts with NQ, where individual questions
often require reasoning over large bodies of text.
The WikiQA (Yang et al., 2015) and MS Marco
(Nguyen et al., 2016) data sets contain queries
sampled from the Bing search engine. WikiQA
contains only 3,047 Fragen. MS Marco con-
tains 100,000 questions with freeform answers.
For each question, the annotator is presented with
10 passages returned by the search engine, Und
is asked to generate an answer to the query, oder
to say that the answer is not contained within the
passages. Free-form text answers allow more flex-
ibility in providing abstractive answers, but lead to
difficulties in evaluation (BLEU score [Papineni
et al., 2002] is used). MS Marco’s authors do
not discuss issues of variability or report quality
metrics for their annotations. From our expe-
Rience, these issues are critical. DuReader (Er
et al., 2018) is a Chinese language data set con-
taining queries from Baidu search logs. Like NQ,
DuReader contains real user queries; it requires
systems to read entire documents to find answers;
and it identifies acceptable variability in answers.
Jedoch, as with MS Marco, DuReader is reliant
on BLEU for answer scoring, and systems already
outperform a humans according to this metric.
There are a number of reading comprehension
benchmarks based on multiple choice tests
(Mihaylov et al., 2018; Richardson et al., 2013; Lai
et al., 2017). The TriviaQA data set (Joshi et al.,
2017) contains questions and answers taken from
trivia quizzes found online. A number of Cloze-
style tasks have also been proposed (Hermann
et al., 2015; Hill et al., 2015; Paperno et al., 2016;
Onishi et al., 2016). We believe that all of these
tasks are related to, but distinct from, answering
information-seeking questions. We also believe
Das, because a solution to NQ will have genuine
Dienstprogramm, it is better equipped as a benchmark for
NLU.
3 Task Definition and Data Collection
Natural Questions contains (question, wikipedia
page,
long answer, short answer) quadruples
Wo: the question seeks factual information; Die
Wikipedia page may or may not contain the infor-
mation required to answer the question; the long
answer is a bounding box on this page containing
all information required to infer the answer; Und
the short answer is one or more entities that give
a short answer to the question, or a boolean yes or
455
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
T
l
A
C
_
A
_
0
0
2
7
6
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
1.a where does the nature conservancy get its funding
1.b who is the song killing me softly written about
who owned most of the railroads in the 1800s
2
how far is chardon ohio from cleveland ohio
4
american comedian on have i got news for you
5
Tisch 1: Matches for heuristics in Section 3.1.
NEIN. Both the long and short answer can be NULL if
no viable candidates exist on the Wikipedia page.
3.1 Questions and Evidence Documents
All the questions in NQ are queries of 8 words or
more that have been issued to the Google search
engine by multiple users in a short period of time.
From these queries, we sample a subset that either:
1. start with ‘‘who’’, ‘‘when’’, or ‘‘where’’ di-
rectly followed by: A) a finite form of ‘‘do’’
or a modal verb; or b) a finite form of ‘‘be’’
or ‘‘have’’ with a verb in some later position;
2. start with ‘‘who’’ directly followed by a verb
that is not a finite form of ‘‘be’’;
3. contain multiple entities as well as an adjec-
tiv, adverb, verb, or determiner;
4. contain a categorical noun phrase immedi-
ately preceded by a preposition or relative
clause;
5. end with a categorical noun phrase, and do
not contain a preposition or relative clause.3
Tisch 1 gives examples. We run questions
through the Google search engine and keep those
where there is a Wikipedia page in the top 5 suchen
results. Der (question, Wikipedia page) pairs are
the input to the human annotation task described
nächste.
The goal of these heuristics is to discard a
large proportion of queries that are non-questions,
while retaining the majority of queries of 8 Wörter
or more in length that are questions. A manual
inspection showed that the majority of questions in
the data, with the exclusion of question beginning
with ‘‘how to’’, are accepted by the filters. Wir
focus on longer queries as they are more complex,
and are thus a more challenging test for deep NLU.
3We pre-define the set of categorical noun phrases used
In 4 Und 5 by running Hearst patterns (Hearst, 1992) to find
a broad set of hypernyms. Part of speech tags and entities
are identified using Google’s Cloud NLP API: https://cloud.
google.com/natural-language.
Figur 2: Annotation decision process with path pro-
portions from NQ training data. Percentages are propor-
tions of entire data set. A total of 49% of all examples
have a long answer.
We focus on Wikipedia as it is a very important
source of factual information, and we believe that
stylistically it is similar to other sources of factual
information on the Web; Jedoch, like any data
set there may be biases in this choice. Future data-
collection efforts may introduce shorter queries,
‘‘how to’’ questions, or domains other than
Wikipedia.
3.2 Human Identification of Answers
Annotation is performed using a custom annota-
tion interface, by a pool of around 50 annotators,
with an average annotation time of 80 seconds.
The guidelines and tooling divide the annotation
task into three conceptual stages, where all three
stages are completed by a single annotator in
succession. The decision flow through these is
illustrated in Figure 2 and the instructions given
to annotators are summarized below.
Question Identification: Contributors deter-
mine whether the given question is good or bad.
A good question is a fact-seeking question that
can be answered with an entity or explanation.
A bad question is ambigous, incomprehensible,
456
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
T
l
A
C
_
A
_
0
0
2
7
6
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
dependent on clear false presuppositions, opinion-
suchen, or not clearly a request for factual in-
Formation. Annotators must make this judgment
solely by the content of the question; they are not
yet shown the Wikipedia page.
Long Answer Identification: For good ques-
tions only, annotators select the earliest HTML
bounding box containing enough information for
a reader to completely infer the answer to the ques-
tion. Bounding boxes can be paragraphs, tables,
list items, or whole lists. Alternativ, annotators
mark ‘‘no answer’’ if the page does not answer the
question, or if the information is present but not
contained in a single one of the allowed elements.
Short Answer Identification: For examples
with long answers, annotators select the entity or
set of entities within the long answer that answer
the question. Alternativ, annotators can flag
that the short answer is yes, NEIN, or they can flag
that no short answer is possible.
3.3 Data Statistics
In Summe, annotators identify a long answer for
49% of the examples, and short answer spans or
a yes/no answer for 36% of the examples. Wir
consider the choice of whether or not to answer
a question a core part of the question answering
Aufgabe, and do not discard the remaining 51% Das
have no answer labeled.
Annotators identify long answers by selecting
the smallest HTML bounding box that contains
all of the information required to answer the
question. These are mostly paragraphs (73%).
The remainder are made up of tables (19%), table
rows (1%), lists (3%), or list items (3%).4 Wir
leave further subcategorization of long answers to
future work, and provide a breakdown of base-
line performance on each of these three types of
answers in Section 6.4.
4 Evaluation of Annotation Quality
This section describes evaluation of the quality
of the human annotations in our data. Wir gebrauchen
a combination of two methods: 1) post hoc
evaluation of correctness of non-null answers,
under consensus judgments from four ‘‘experts’’;
4We note that both tables and lists may be used purely for
the purposes of formatting text, or they may have their own
complex semantics—as in the case of Wikipedia infoboxes.
2) k-way annotations (with k = 25) on a subset of
the data.
Post hoc evaluation of non-null answers leads
directly to a measure of annotation precision. As is
common in information-retrieval style problems
such as long-answer identification, measuring
recall is more challenging. Jedoch, we describe
how 25-way annotated data provide useful insights
into recall, particularly when combined with ex-
pert judgments.
4.1 Preliminaries: The Sampling
Distribution
Each item in our data consists of a four-tuple
(Q, D, l, S) where q is a question, d is a document,
l is a long answer, and s is a short answer. Daher
we introduce random variables Q, D, L, and S
corresponding to these items. Note that L, can be
a span within the document, or NULL. Ähnlich,
S can be one or more spans within L, a boolean,
or NULL.
For now we consider the three-tuple (Q, D, l). Der
treatment for short answers is the same throughout,
mit (Q, D, S) replacing (Q, D, l).
Each data item (Q, D, l) is independent and iden-
tically distrbuted (IID) sampled from
P(l, Q, D) = p(Q, D) × p(l|Q, D)
Hier, P(Q, D) is the sampling distribution (prob-
ability mass function [PMF]) over question/
document pairs. It is defined as the PMF cor-
responding to the following sampling process:5
Erste, sample a question at random from some
distribution; zweite, perform a search on a major
search engine using the question as the underlying
query; finally, entweder: 1) return (Q, D) where d is
the top Wikipedia result for q, if d is in the top
5 search results for q; 2) if there is no Wikipedia
page in the top 5 results, discard q and repeat the
sampling process.
Here p(l|Q, D) is the conditional distribution
(PMF) over long answer l conditioned on the pair
(Q, D). The value for l is obtained by: 1) sampling
an annotator uniformly at random from the pool
5More formally, there is some base distribution pb(Q)
from which queries q are drawn, and a deterministic function
S(Q) which returns the top-ranked Wikipedia page in the top
5 search results, or NULL if there is no Wikipedia page in
the top 5 results. Define Q to be the set of queries such that
S(Q) (cid:8)= NULL, and b =
q∈Q pb(Q). Then p(Q, D) = pb(Q)/B
if q ∈ Q and d (cid:8)= NULL and d = s(Q), otherwise p(Q, D) = 0.
(cid:2)
457
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
T
l
A
C
_
A
_
0
0
2
7
6
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
of annotators; 2) presenting the pair (Q, D) to the
annotator, who then provides a value for l.
Note that l is non-deterministic due to two
sources of randomness: 1) the random choice of
annotator; 2) the potentially random behavior of
a particular annotator (the annotator may give a
different answer depending on the time of day,
usw.).
We will also consider the distribution
P(l, Q, D|L (cid:8)= NULL) =
P(l, Q, D)
P (L (cid:8)= NULL)
if l (cid:8)= NULL
= 0 ansonsten
(cid:2)
where P (L (cid:8)= NULL) =
l,Q,D:l(cid:8)=NULL p(l, Q, D).
Thus p(l, Q, D|L (cid:8)= NULL) is the probability of
seeing the triple (l, Q, D), conditioned on L not
being NULL.
We now define precision of annotations. Con-
sider a function π(l, Q, D) that is equal to 1 if l
is a ‘‘correct’’ answer for the pair (Q, D), 0 Wenn
the answer is incorrect. The next section gives a
concrete definition of π. The annotation precision
is defined as
(cid:3)
P(l, Q, D|L (cid:8)= NULL) × π(l, Q, D)
Ψ =
l,Q,D
Given a set of annotations S = {(l(ich), Q(ich), D(ich))}|S|
i=1
drawn IID from p(l, Q, D|L (cid:8)= NULL), we can
derive an estimate of Ψ as ˆΨ = 1
(l,Q,D)∈
|S|
Sπ(l, Q, D).
(cid:2)
4.2 Expert Evaluations of Correctness
We now describe the process
for deriving
‘‘expert’’ judgments of answer correctness. Wir
used four experts for these judgments. Diese
experts had prepared the guidelines for the anno-
tation process.6 In a first phase each of the four
experts independently annotated examples for cor-
rectness. In a second phase the four experts met to
discuss disagreements in judgments, and to reach
a single consensus judgment for each example.
A key step is to define the criteria used to
determine correctness of an example. Given a
triple (l, Q, D), we extracted the passage l(cid:10) corre-
sponding to l on the page d. The pair (Q, l(cid:10)) War
then presented to the expert. Experts categorized
(Q, l(cid:10)) pairs into the following three categories:
Correct (C): It is clear beyond a reasonable doubt
that the answer is correct.
6The first four authors of this paper.
458
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
T
l
A
C
_
A
_
0
0
2
7
6
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figur 3: Examples with consensus expert judgments, Und
justification for these judgments. Siehe Abbildung 6 for more
examples.
Correct (but debatable) (Cd): A reasonable person
could be satisfied by the answer; Jedoch,
a reasonable person could raise a reasonable
doubt about the answer.
Wrong (W): There is not convincing evidence
that the answer is correct.
Figur 3 shows some example judgments. Wir
introduced the intermediate Cd category after
observing that many (Q, l(cid:10)) pairs are high quality
answers, but raise some small doubt or quibble
about whether they fully answer the question. Der
use of the word ‘‘debatable’’ is intended to be
literal: (Q, l(cid:10)) pairs falling into the Cd category
could literally lead to some debate between
reasonable people as to whether they fully answer
the question or not.
Given this background, we will make the follow-
ing assumption:
Answers in the Cd category should be very
useful to a user interacting with a QA system, Und
should be considered to be high-quality answers;
Jedoch, an annotator would be justified in either
annotating or not annotating the example.
For these cases there is often disagreement
between annotators as to whether the page contains
Quantity
ˆΨ
ˆE(C)
ˆE(Cd)
ˆE(W)
Long answer
90%
59%
31%
10%
Short answer
84%
51%
33%
16%
Tisch 2: Precision results ( ˆΨ) and empirical
estimates of the proportions of C, Cd, and W
Artikel.
an answer or not: We will see evidence of this
when we consider the 25-way annotations.
4.3 Results for Precision Measurements
We used the following procedure to derive mea-
surements of precision: 1) We sampled examples
IID from the distribution p(l, Q, D|L (cid:8)= NULL). Wir
call this set S. We had |S| = 139. 2) Four experts
independently classified each of the items in S into
the categories C, Cd, W. 3) The four experts met to
come up with a consensus judgment for each item.
For each example (l(ich), Q(ich), D(ich)) ∈ S, we define
C(ich) to be the consensus judgment. Dieser Prozess war
repeated to derive judgments for short answers.
We can then calculate the percentage of exam-
ples falling into the three expert categories; Wir
denote these values as ˆE(C), ˆE(Cd), and ˆE(W ).7
We define ˆΨ = ˆE(C)+ ˆE(Cd). We have explicitly
included samples C and Cd in the overall precision
as we believe that Cd answers are essentially cor-
rect. Tisch 2 shows the values for these quantities.
4.4 Variability of Annotations
We have shown that an annotation drawn from
P(l, Q, D|L (cid:8)= NULL) has high expected precision.
Now we address the distribution over annotations
for a given (Q, D) pair. Annotators can disagree
about whether or not d contains an answer to
q—that is, whether or not L = NULL. In the case
that annotators agree that L (cid:8)= NULL, they can
also disagree about the correct assignment to L.
In order to study variability, we collected 24
additional annotations from separate annotators
for each of the (Q, D, l) triples in S. For each
(Q, D, l) triple, we now have a 5-tuple (Q(ich), D(ich),
l(ich), C(ich), A(ich)) where a(ich) = a(ich)
25 is a vector
von 25 annotations (including l(ich)), und C(ich)
Ist
1 . . . A(ich)
7More formally, let [[e]] for any statement e be 1 if e is
true, 0 if e is false. We define ˆE(C) = 1
i=1[[C(ich) = C]].
|S|
The values for ˆE(Cd) and ˆE(W) are calculated in a similar
manner.
(cid:2)|S|
Figur 4: Values of ˆE[(θ1, θ2]] and ˆE[(θ1, θ2], C/Cd/
W] for different intervals (θ1, θ2]. The height of each
bar is equal to ˆE[(θ1, θ2]], the divisions within each bar
show ˆE[(θ1, θ2], C], ˆE[(θ1, θ2], Cd], and ˆE[(θ1, θ2], W].
the consensus judgment for l(ich). For each i also
define
M(ich) =
1
25
25(cid:3)
j=1
[[A(ich)
J
(cid:8)= NULL]]
to be the proportion of the 25-way annotations
that are non-null.
We now show that μ(ich) is highly correlated with
annotation precision. We define
ˆE[(0.8, 1.0]] =
1
|S|
|S|(cid:3)
i=1
[[0.8 < μ(i) ≤ 1]]
to be the proportion of examples with greater than
80% of the 25 annotators marking a non-null long
answer, and
ˆE[(0.8, 1.0], C] =
1
|S|
|S|(cid:3)
i=1
[[0.8 < μ(i) ≤ 1 and c(i) = C]]
to be the proportion of examples with greater
than 80% of the 25 annotators marking a non-null
long answer and with c(i) = C. Similar definitions
apply for the intervals (0,0.2], (0.2, 0.4], (0.4, 0.6],
and (0.6, 0.8], and for judgments Cd and W.
Figure 4 illustrates the proportion of annotations
falling into the C/Cd/W categories in different
regions of μ(i). For those (q, d) pairs where
more than 80% of annotators gave some non-null
answer, our expert judgements agree that these
annotations are overwhelmingly correct. Simi-
larly, when fewer than 20% of annotators gave
a non-null answer, these answers tend to be incor-
rect. In between these two extremes, the disagree-
ment between annotators is largely accounted for
by the Cd category—where a reasonable person
could either be satisfied with the answer, or want
more information. Later, in Section 5, we make
use of the correlation between μ(i) and accuracy
to define a metric for the evaluation of answer
459
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
quality. In that section, we also show that a model
trained on (l, q, d) triples can outperform a sin-
gle annotator on this metric by accounting for the
uncertainty of whether or not an answer is present.
As well as disagreeing about whether (q, d)
contains a valid answer, annotators can disagree
about the location of the best answer. In many
cases there are multiple valid long answers in
multiple distinct locations on the page.8 The most
extreme example of this that we see in our 25-way
annotated data is for the question ‘‘name the sub-
stance used to make the filament of bulb’’ paired
with the Wikipedia page about incandescent light
bulbs. Annotators identify 7 passages that discuss
tungsten wire filaments.
Short answers can be arbitrarily delimited and
this can lead to extreme variation. The most
extreme example of this that we see in the 25-way
annotated data is the 11 distinct, but correct,
answers for the question ‘‘where is blood pumped
after it leaves the right ventricle’’. Here, 14 anno-
tators identify a substring of ‘‘to the lungs’’ as
the best possible short answer. Of these, 6 label
the entire string, 4 reduce it to ‘‘the lungs’’, and
4 reduce it to ‘‘lungs’’. A further 6 annotators do
not consider this short answer to be sufficient and
choose more precise phrases such as ‘‘through the
semilunar pulmonary valve into the left and right
main pulmonary arteries (one for each lung)’’.
The remaining 5 annotators decide that there is no
adequate short answer.
For each question, we ranked each of the unique
answers given by our 25 annotators according to
the number of annotators that chose it. We found
that by just taking the most popular long answer,
we could account for 83% of the long answer
annotations. The two most popular long answers
account for 96% of the long answer annotations.
It is extremely uncommon for a question to have
more than three distinct long answers annotated.
Short answers have greater variability, but the
most popular short answer still accounts for 64%
of all short answer annotations. The three most
popular short answers account for 90% of all short
answer annotations.
8As stated earlier in this paper, we did instruct annotators
to select the earliest instance of an answer when there are
multiple answer instances on the page. However, there are
still cases where different annotators disagree on whether an
answer earlier in the page is sufficient in comparison to a
later answer, leading to differences between annotators.
5 Evaluation Measures
NQ includes 5-way annotations on 7,830 items for
development data, and we will sequester a further
7,842 items, 5-way annotated, for test data. This
section describes evaluation metrics using this
data, and gives justification for these metrics.
We choose 5-way annotations for the following
reasons: First, we have evidence that aggregating
annotations from 5 annotators is likely to be much
more robust than relying on a single annotator (see
Section 4). Second, 5 annotators is a small enough
number that the cost of annotating thousands of
development and test items is not prohibitive.
5.1 Definition of an Evaluation Measure
Based on 5-Way Annotations
Assume that we have a model fθ with parameters
θ that maps an input (q, d) to a long answer l =
fθ(q, d). We would like to evaluate the accuracy
this model. Assume we have evaluation
of
examples {q(i), d(i), a(i)} for i = 1 . . . n, where
q(i) is a question, d(i) is the associated Wikipedia
document, and a(i) is a vector with components
a(i)
is the output from
j
the j’th annotator, and can be a paragraph in d(i),
or can be NULL. The five annotators are chosen
uniformly at random from a pool of annotators.
for j = 1 . . . 5. Each a(i)
j
We define an evaluation measure based on the
five way annotations as follows. If at least two
out of five annotators have given a non-null long
answer on the example, then the system is required
to output a non-null answer that is seen at least
once in the five annotations; conversely, if fewer
than two annotators give a non-null long answer,
the system is required to return NULL as its output.
To make this more formal, define the function
g(a(i)) to be the number of annotations in a(i)
that are non-null. Define a function hβ(a, l) that
judges the correctness of label l given annotations
a = a1 . . . a5. This function is parameterized by
an integer β. The function returns 1 if the label l
is judged to be correct, and 0 otherwise:
Definition 1 (Definition of hβ(a, l)) If g(a) ≥ β
and l (cid:8)= NULL and l = aj for some j ∈ {1 . . . 5}
Then hβ(a, l) = 1; Else If g(a) < β and
l = NULL Then hβ(a, l) = 1; Else hβ(a, l) = 0.
We used β = 2 in our experiments.9
9This is partly motivated through the results on 25-way
annotations (see Section 4.4), where for μ(i) ≥ 0.4 over 93%
(114/122 annotations) are in the C or Cd categories, whereas
460
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
The accuracy of a model is then
Aβ(fθ) =
1
n
n(cid:3)
i=1
hβ(a(i), fθ(q(i), d(i)))
The value for Aβ is an estimate of accuracy
with respect
to the underlying distribution,
which we define as ¯Aβ(fθ) = E[hβ(a, fθ(q, d))].
Here the expectation is taken with respect
j=1 p(aj|q, d) where
to p(a, q, d) = p(q, d)
p(aj|q, d) = P (L = aj|Q = q, D = d); hence
the annotations a1 . . . a5 are assumed to be drawn
IID from p(l|q, d).10
(cid:4)
5
Precision and Recall During evaluation, it is of-
ten beneficial to separately measure false positives
(incorrectly predicting an answer), and false neg-
atives (failing to predict an answer). We define
the precision (P ) and recall (R) of fθ:
t(q, d, a, fθ) = hβ(a, fθ(q, d))[[fθ(q, d) (cid:8)= NULL]]
(cid:2)
R(fθ) =
P (fθ) =
(cid:2)
(cid:2)
n
i=1 t(q(i), d(i), a(i), fθ)
(cid:2)
n
i=1[[g(a(i) ≥ β]]
n
i=1 t(q(i), d(i), a(i), fθ)
n
i=1[[fθ(q(i), d(i)) (cid:8)= NULL]]
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
We discuss this measure at length in this section.
First, however, we make the following critical
point:
It is possible for a model trained on (l(i), q(i),
d(i)) triples drawn IID from p(l, q, d) to exceed the
performance of a single annotator on this measure.
In particular, if we have a model p(l|q, d; θ),
trained on (l, q, d) triples, which is a good
approximation to p(l|q, d), it is then possible to use
p(l|q, d; θ) to make predictions that outperform a
single random draw from p(l|q, d). The Bayes
optimal hypothesis (see Devroye et al., 1997) for
hβ, defined as arg maxf Eq,d,a[[hβ(a, f (q, d))]], is
a function of the posterior distribution p(·|q, d),11
and will generally exceed the performance of a
l p(l|q, d) ×
single random annotation, Eq,d,a[[
hβ(a, l)]].
(cid:2)
We also show this empirically, by constructing
an approximation to p(l|q, d) from 20-way anno-
tations, then using this approximation to make
predictions that significantly outperform a single
annotator.
for μ(i) < 0.4 over 35% (11/17 annotations) are in the W
category.
10This isn’t quite accurate as the annotators are sampled
without replacement; however, it simplifies the analysis.
and
γ = p(l∗|q, d),
11Specifically, for an input (q, d), if we define l∗ =
arg maxl(cid:8)=NULL p(l|q, d),
¯γ =
p(NULL|q, d), then the Bayes optimal hypothesis is to output
l∗ if P (hβ(a, l∗) = 1|γ, ¯γ) ≥ P (hβ(a, NULL) = 1|γ, ¯γ),
and to output NULL otherwise. Implementation of this
strategy is straightforward if γ and ¯γ are known;
this
strategy will in general give a higher accuracy value than
taking a single sample l from p(l|q, d) and using this sample
as the prediction. In principle a model p(l|q, d; θ) trained
on (l, q, d) triples can converge to a good estimate of
γ and ¯γ. Note that for the special case γ + ¯γ = 1 we
have P (hβ(a, NULL) = 1|γ, ¯γ) = ¯γ5 + 5¯γ4(1 − ¯γ) and
P (hβ(a, l∗) = 1|γ, ¯γ) = 1 − P (hβ(a, NULL) = 1|γ, ¯γ). It
follows that the Bayes optimal hypothesis is to predict l∗ if
γ ≥ α where α ≈ 0.31381, and to predict NULL otherwise.
α is 1 − ¯α where ¯α is the solution to ¯α5 + 5 ¯α4(1 − ¯α) = 0.5.
5.2 Super-Annotator Upper Bound
To place an upper bound on the metrics introduced
above we create a ‘‘super-annotator’’ from the 25-
way annotated data introduced in Section 4. From
this data, we create four tuples (q(i), d(i), a(i), b(i)).
The first
three terms in this tuple are the
question, document, and vector of five reference
annotations. b(i) is a vector of annotations b(i)
for
j
j = 1 . . . 20 drawn from the same distribution
as a(i). The super-annotator predicts NULL if
g(b(i)) < α, and l∗ = arg maxl∈d
20
j=1[[l = bj]]
otherwise.
(cid:2)
Table 3 shows super-annotator performance
for α = 8, with 90.0% precision, 84.6% recall,
and 87.2% F-measure. This significantly exceeds
the performance (80.4% precision/67.6% recall/
73.4% F-measure) for a single annotator. We
subsequently view the super-annotator numbers
as an effective upper bound on performance of a
learned model.
6 Baseline Performance
The NQ corpus is designed to provide a benchmark
with which we can evaluate the performance of
QA systems. Every question in NQ is unique under
exact string match, and we split questions ran-
domly in NQ into separate train/development/test
sets. To facilitate comparison, we introduce base-
lines that either make use of high-level data set
regularities, or are trained on the 307k examples in
the training set. Here, we present well-established
baselines that were state of the art at the time
of submission. We also refer readers to Alberti
et al. (2019) for more recent advances in model-
ing. All of our baselines focus on the long and
short answer extraction tasks. We leave boolean
answers to future work.
461
Long answer Dev Long answer Test
P
F1
F1
R
R
P
Short answer Dev
P
F1
R
Short answer Test
P
F1
R
First paragraph
Most frequent
Closest question
22.2 37.8 27.8
43.1 20.0 27.3
37.7 28.5 32.4
22.3 38.5 28.3
40.2 18.4 25.2
36.2 27.8 31.4
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
DocumentQA
DecAtt + DocReader
Single annotator†
Super-annotator†
47.5 44.7 46.1
52.7 57.0 54.8
48.9 43.3 45.7
54.3 55.7 55.0
38.6 33.2 35.7
34.3 28.9 31.4
40.6 31.0 35.1
31.9 31.1 31.5
80.4 67.6 73.4
90.0 84.6 87.2
–
–
–
–
–
–
63.4 52.6 57.5
79.1 72.6 75.7
–
–
–
–
–
–
Table 3: Precision (P), recall (R), and the harmonic mean of these (F1) of all baselines, a single annotator, and
the super-annotator upper bound. The human performances marked with † are evaluated on a sample of five
annotations from the 25-way annotated data introduced in Section 5.
6.1 Untrained Baselines
NQ’s long answer selection task admits several
untrained baselines. The first paragraph of a
Wikipedia page commonly acts as a summary
of the most important information regarding the
page’s subject. We therefore implement a long
answer baseline that simply selects the first
paragraph for all pages.
Furthermore, because 79% of the Wikipedia
pages in the development set also appear in
the training set, we implement two ‘‘copying’’
baselines. The first of these simply selects the
most frequent annotation applied to a given page in
the training set. The second selects the annotation
given to the training set question closest to the eval
set question according to TFIDF weighted word
overlap. These three baselines are reported as First
paragraph, Most frequent, and Closest question in
Table 3, respectively.
6.2 Document-QA
the reference implementation12 of
We adapt
Document-QA (Clark and Gardner, 2018) for the
NQ task. This system performs well on the SQuAD
and TriviaQA short answer extraction tasks, but it
is not designed to represent: (i) the long answers
that do not contain short answers, and (ii) the
NULL answers that occur in NQ.
To address (i) we choose the shortest available
answer span at training, differentiating long and
short answers only through the inclusion of special
start and end of passage tokens that identify long
answer candidates. At prediction time, the model
can either predict a long answer (and no short
answer), or a short answer (which implies a long
answer).
12https://github.com/allenai/document-qa.
To address (ii), we tried adding special NULL
passages to represent the lack of answer. However,
we achieved better performance by training on the
subset of questions with answers and then only
predicting those answers whose scores exceed a
threshold.
With these two modifications, we are able to
apply Document-QA to NQ. We follow Clark and
Gardner (2018) in pruning documents down to the
set of passages that have highest TFIDF similarity
with the question. Under this approach, we con-
sider the top 16 passages as long answers. We con-
sider short answers containing up to 17 words. We
train Document-QA for 30 epochs with batches
containing 15 examples. The post hoc score thresh-
old is set to 3.0. All of these values were chosen
on the basis of development set performance.
6.3 Custom Pipeline (DecAtt + DocReader)
One view of the long answer selection task is that
it is more closely related to natural language infer-
ence (Bowman et al., 2015; Williams et al., 2018)
than short answer extraction. A valid long answer
must contain all of the information required to
infer the answer. Short answers do not need to con-
tain this information—they need to be surrounded
by it.
Motivated by this intuition, we implement a
pipelined approach that uses a model drawn from
the natural language interference literature to se-
lect long answers. Then short answers are selected
from these using a model drawn from the short
answer extraction literature.
Long answer selection Let t(d, l) denote the
sequence of tokens in d for the long answer
candidate l. We then use the Decomposable
Attention model (Parikh et al., 2016) to produce
462
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Examples from the questions with 25-way annotations.
a score for each question, candidate pair
xl = DecAtt(q, t(d, l)). To this we add a 10-
dimensional trainable embedding rl of the long
answer candidate’s position in the sequence of
candidates;13 an integer ul containing the number
of the words shared by q and t(d, l); and a
scalar vl containing the number of words shared
by q and t(d, l) weighted by inverse document
frequency. The long answer score zl
is then
given as a linear function of the above features
zl = w(cid:14)[xl, rl, ul, vl] + b where w(cid:14) and b are the
trainable weight vector and bias, respectively,
Short answer selection Given a long answer,
the Document Reader model (Chen et al., 2017;
abbreviated DocReader) is used to extract short
answers.
Training The long answer selection model is
trained by minimizing the negative log-likelihood
of the correct answer l(i) with a hyperparameter η
that down-weights examples with the NULL label:
(cid:6)
(cid:5)
×(1−η[[l(i) = NULL]])
n(cid:3)
−
log
i=1
exp(zl(i))
(cid:2)
l exp(zl)
We found that the inclusion of η is useful in
accounting for the asymmetry in labels—because
a NULL label is less informative than an answer
location. Varying η also seems to provide a more
stable method of setting a model’s precision point
than post hoc thresholding of prediction scores.
An analogous strategy is used for the short answer
model where examples with no entity answers are
given a different weight.
13Specifically, we have a unique learned 10-dimensional
embedding for each position 1 . . . 19 in the sequence, and a
20th embedding used for all positions ≥ 20.
463
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
6.4 Results
Table 3 shows results for all baselines as well
as a single annotator, and the super-annotator
introduced in Section 5. It is clear that there is a
great deal of headroom in both tasks. We find that
Document-QA performs significantly worse than
DecAtt+DocReader in long answer identification.
This is likely because Document-QA was designed
for the short answer task only.
To ground these results in the context of
comparable tasks, we measure performance on
the subset of NQ that has non-NULL labels for both
long and short answers. Freed from the decision
of whether or not to answer, DecAtt+DocReader
obtains 68.0% F1 on the long answer task, and
40.4% F1 on the short answer task. We also ex-
amine performance of the short answer extraction
systems in the setting where the long answer
is given, and a short answer is known to exist.
With this simplification, short answer F1 increases
57.7% for DocReader. Under
this restriction
NQ roughly approximates the SQuAD 1.1 task.
From the gap to the super-annotator upper bound
we know that this task is far from being solved
in NQ.
Finally, we break the long answer identification
results down according to long answer type. From
Table 3 we know that DecAtt+DocReader predicts
long answers with 54.8% F1. If we only measure
performance on examples that should have a
paragraph long answer, this increases to 65.1%.
For tables and table rows it is 66.4%. And for lists
and list items it is 32.0%. All other examples have
a NULL label. Clearly, the model is struggling to
learn some aspect of list-formatted data from the
6% of the non NULL examples that have this type.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 6: Answer annotations for four examples from Figure 5 that have long answers that are paragraphs (i.e., not
tables or lists). We show the expert judgment (C/Cd/W) for each non-null answer. ‘‘Long answer stats’’ a/25,
b/25 have a = number of non-null long answers for this question, b = number of long answers the same as that
shown in the figure. For example, for question A1, 13 out of 25 annotators give some non-null answer, and 4 out
of 25 annotators give the same long answer After mashing . . .. ‘‘Short answer stats’’ has similar statistics for
short answers.
7 Conclusion
We argue that progress on QA has been hindered
by a lack of appropriate training and test data.
To address this, we present the Natural Questions
corpus. This is the first large publicly available
data set to pair real user queries with high-quality
annotations of answers in documents. We also
present metrics to be used with NQ, for the purposes
of evaluating the performance of question answer-
ing systems. We demonstrate a high upper bound
on these metrics and show that existing methods do
not approach this upper bound. We argue that for
them to do so will require significant advances in
NLU. Figure 5 shows example questions from the
data set. Figure 6 shows example question/answer
pairs from the data set, together with expert judg-
ments and statistics from the 25-way annotations.
References
Chris Alberti, Kenton Lee, and Michael Collins.
2019. A BERT Baseline for the Natural Ques-
tions. arXiv preprint:1901.08634.
Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural
language inference. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 632–642.
Danqi Chen, Adam Fisch, Jason Weston, and
Antoine Bordes. 2017. Reading Wikipedia to
answer open-domain questions. In Proceedings
of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 1870–1879.
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, and
Luke Zettlemoyer. 2018. Quac: Question answer-
ing in context. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2174–2184,
Brussels.
Christopher Clark and Matt Gardner. 2018.
Simple and effective multi-paragraph reading
464
comprehension. In Proceedings of the 56th An-
nual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 845–855, Melbourne.
Melis, and Edward Grefenstette. 2018. The nar-
rative qa reading comprehension challenge.
Transactions of the Association for Compu-
tational Linguistics, 6317–328.
Luc Devroye, L´aszl´o Gy¨orfi, and G´abor Lugosi.
1997. A Probabilistic Theory of Pattern Rec-
ognition, corrected 2nd edition, volume 31 of
Applications of Mathematics. Springer.
Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi
Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang,
Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu,
and Haifeng Wang. 2018. Dureader: A Chinese
machine reading comprehension dataset from
real-world applications. In Proceedings of the
Workshop on Machine Reading for Question
Answering, pages 37–46, Melbourne.
Marti A. Hearst. 1992. Automatic acquisition of
hyponyms from large text corpora. In COLING
1992 Volume 2: The 15th International Confer-
ence on Computational Linguistics.
Karl Moritz Hermann, Tom´aˇs Koˇcisk´y, Edward
Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. 2015. Teaching
machines to read and comprehend. In Proceed-
the 28th International Conference
ings of
on Neural Information Processing Systems,
NIPS’15. Cambridge, MA.
Felix Hill, Antoine Bordes, Sumit Chopra, and Jason
Weston. 2015. The goldilocks principle: Read-
ing children’s books with explicit memory rep-
resentations. In Proceedings of the International
Conference on Learning Representations.
Robin Jia and Percy Liang. 2017. Adversarial
examples for evaluating reading comprehension
systems. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language
Processing, pages 2021–2031, Copenhagen.
Mandar Joshi, Eunsol Choi, Daniel Weld, and
Luke Zettlemoyer. 2017. Triviaqa: A large scale
distantly supervised challenge dataset for read-
ing comprehension. In Proceedings of the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 1601–1611.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming
Yang, and Eduard Hovy. 2017. Race: Large-
scale reading comprehension dataset
from
examinations. In Proceedings of the 2017 Con-
in Natu-
ference on Empirical Methods
ral Language Processing, pages 785–794.
Copenhagen.
Todor Mihaylov, Peter Clark, Tushar Khot, and
Ashish Sabharwal. 2018. Can A suit of armor
conduct electricity? A new dataset for open book
question answering. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 2381–2391, Brussels.
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
Gao, Saurabh Tiwary, Rangan Majumder, and
Li Deng. 2016. MS MARCO: A human gen-
erated machine reading comprehension dataset.
In Proceedings of the Workshop on Cognitive
Computation: Integrating Neural and Symbolic
Approaches.
Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin
Gimpel, and David McAllester. 2016. Who
did what: A large-scale person-centered cloze
dataset. In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language
Processing, pages 2230–2235. Austin, TX.
Denis Paperno, Germ´an Kruszewski, Angeliki
Lazaridou, Ngoc Quan Pham, Raffaella Bernardi,
Sandro Pezzelle, Marco Baroni, Gemma
Boleda, and Raquel Fernandez. 2016. The
LAMBADA dataset: Word prediction requiring
a broad discourse context. In Proceedings of
the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 1525–1534, Berlin.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. BLUE: A method for
automatic evaluation of machine translation.
In Proceedings of 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, Philadelphia.
Tomas Kocisky, Jonathan Schwarz, Phil Blun-
som, Chris Dyer, Karl Moritz Hermann, Gabor
Ankur Parikh, Oscar T¨ackstr¨om, Dipanjan Das,
and Jakob Uszkoreit. 2016. A decomposable
465
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
attention model for natural language inference.
In Proceedings of the 2016 Conference on Em-
pirical Methods in Natural Language Pro-
cessing, pages 2249–2255, Austin, TX.
Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Un-
answerable questions for squad. In Proceedings
of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short
Papers), pages 784–789.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
and Percy Liang. 2016. SQuAD: 100,000+
Questions for Machine Comprehension of Text.
In Proceedings of the 2016 Conference on Em-
pirical Methods in Natural Language Pro-
cessing, pages 2383–2392, Austin, TX.
Siva Reddy, Danqi Chen, and Christopher
D. Manning. 2018. Coqa: A conversational
question answering challenge. arXiv preprint
arXiv:1808.07042.
Matthew Richardson, Christopher J. C. Burges,
and Erin Renshaw. 2013. MCTest: A chal-
lenge dataset for the open-domain machine
comprehension of text. In Proceedings of the
2013 Conference on Empirical Methods in
Natural Language Processing, pages 193–203,
Seattle, WA.
Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
inference. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the
Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long Papers), pages 1112–1122, New Orleans,
LA.
Yi Yang, Wen-tau Yih, and Christopher Meek.
2015. Wikiqa: A challenge dataset for open-
domain question answering. In Proceedings of
the 2015 Conference on Empirical Methods
in Natural Language Processing, pages 2013–2018,
Lisbon.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
Bengio, William Cohen, Ruslan Salakhutdinov,
and Christopher D. Manning. 2018. Hotpotqa:
A dataset for diverse, explainable multi-hop
question answering. In Proceedings of the 2018
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 2369–2380, Brussels.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
6
1
9
2
3
2
8
8
/
/
t
l
a
c
_
a
_
0
0
2
7
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
466
PDF Herunterladen