PAQ: 65 Million Probably-Asked Questions and - IA de Investigación especializada en el MIT

PAQ: 65 Million Probably-Asked Questions and
What You Can Do With Them

Patrick Lewis†‡ Yuxiang Wu‡ Linqing Liu‡ Pasquale Minervini‡ Heinrich K ¨uttler†
Aleksandra Piktus† Pontus Stenetorp‡ Sebastian Riedel†‡
†Facebook AI Research ‡University College London, Reino Unido
{plewis,hnr,piktus,sriedel}@fb.com
{yuxiang.wu,linqing.liu,p.minervini,p.stenetorp}@cs.ucl.ac.uk

Abstracto

Open-domain Question Answering models
that directly leverage question-answer (control de calidad)
pares, such as closed-book QA (CBQA) modelos
and QA-pair retrievers, show promise in terms
of speed and memory compared with conven-
tional models which retrieve and read from
text corpora. QA-pair retrievers also offer in-
terpretable answers, a high degree of control,
and are trivial to update at test time with new
conocimiento. Sin embargo, these models fall short
of the accuracy of retrieve-and-read systems,
as substantially less knowledge is covered by
the available QA-pairs relative to text cor-
pora like Wikipedia. To facilitate improved
QA-pair models, we introduce Probably Asked
Questions (PAQ), a very large resource of 65M
automatically generated QA-pairs. We intro-
duce a new QA-pair retriever, RePAQ, a
complement PAQ. We find that PAQ preempts
and caches test questions, enabling RePAQ to
match the accuracy of recent retrieve-and-read
modelos, whilst being significantly faster. Usando
PAQ, we train CBQA models which outper-
form comparable baselines by 5%, but trail
RePAQ by over 15%, indicating the effec-
tiveness of explicit retrieval. RePAQ can be
configured for size (under 500MB) or speed
(over 1K questions per second) while retain-
ing high accuracy. Por último, we demonstrate
RePAQ’s strength at selective QA, abstaining
from answering when it is likely to be incor-
recto. This enables RePAQ to ‘‘back-off’’ to
a more expensive state-of-the-art model, dirigir-
ing to a combined system which is both more
accurate and 2x faster than the state-of-the-art
model alone.

Introducción

Open-domain QA (ODQA) systems usually have
access to a background corpus that can be used to
answer questions. Models that explicitly exploit

this corpus are commonly referred to as Open-
book models (Roberts et al., 2020). They typically
index the whole corpus, and then retrieve-and-
read documents to answer questions (Voorhees
and Harman, 1999; Chen et al., 2017, inter alia).
A second class of models, closed-book ques-
tion answering (CBQA) modelos, has recently
been proposed. They learn to directly map ques-
tions to answers from training question-answer
(control de calidad) pairs without access to a background corpus
(Roberts et al., 2020; Ye et al., 2021). Estos mod-
els usually take the form of pretrained seq2seq
models such as T5 (Rafael y col., 2020) or BART
(Lewis et al., 2020a), fine-tuned on QA-pairs. Él
has recently been shown that current closed-book
models mostly memorize training QA-pairs, y
can struggle to answer questions that do not
overlap with training data (Lewis et al., 2021).

Models that explicitly retrieve (training) control de calidad-
pares, rather than memorizing them in parameters,
have been shown to perform competitively with
CBQA models (Lewis et al., 2021; Xiao et al.,
2021). These models have a number of useful pro-
perties, such as fast inference, interpretable out-
puts (by inspecting retrieved QA-pairs), y el
ability to update the model’s knowledge at test
time by adding or removing QA-pairs.

Sin embargo, CBQA and QA-pair retriever models
are currently not competitive with retrieve-and-
read systems in terms of accuracy, largely because
the training QA-pairs they operate on cover sub-
stantially less knowledge than background corpora
like Wikipedia. en este documento, we explore whether
greatly expanding the coverage of QA-pairs en-
ables CBQA and QA-pair retriever models which
are competitive with retrieve-and-read models.

We present Probably Asked Questions (PAQ),
a semi-structured Knowledge Base (KB) of 65M
natural language QA-pairs, which models can
memorise and/or learn to retrieve from. PAQ

1098

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1098–1115, 2021. https://doi.org/10.1162/tacl a 00415
Editor de acciones: michael collins. Lote de envío: 4/2021; Lote de revisión: 7/2021; Publicado 10/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

differs from traditional KBs in that questions and
answers are stored in natural language, y eso
questions are generated such that they are likely to
appear in ODQA datasets. PAQ is automatically
constructed using a question generation model
and Wikipedia. To ensure generated questions
are not only answerable given the passage they
are generated from, we use a global filtering
post-processing step using a state-of-the-art
ODQA system. This greatly reduces the amount
of wrong/ambiguous questions compared to other
approaches (Fang et al., 2020; Alberti et al., 2019),
and is critical for high-accuracy, downstream QA.
To complement PAQ we develop RePAQ,
an ODQA model based on question retrieval/
matching models, using dense Maximum Inner
Product Search-based retrieval, and optionally,
re-ranking. We show that PAQ and RePAQ
provide accurate ODQA predictions, at the level
of relatively recent large-scale retrieve-and-read
systems such as RAG (Lewis et al., 2020b) en
NaturalQuestions (Kwiatkowski et al., 2019) y
CuriosidadesQA (Joshi et al., 2017). PAQ instances are
annotated with scores that reflect how likely we
expect questions to appear, which can be used
to control the memory footprint of RePAQ by
pruning the KB accordingly. Como resultado, RePAQ
becomes flexible, allowing us to configure QA
systems with near state-of-the-art results, muy
small memory size, or inference speeds of over
1,000 questions per second.

We also show that PAQ is a useful source of
training data for CBQA models. BART models
trained on PAQ outperform standard data base-
lines by 5%. Sin embargo, these models struggle to
effectively memorize all the knowledge in PAQ,
lagging behind RePAQ by 15%. This demon-
strates the effectiveness of RePAQ at leverag-
ing PAQ.

para

pregunta

selective

Finalmente, we show that as RePAQ’s question
matching score correlates well with QA accuracy,
it effectively ‘‘knows when it doesn’t know’’,
answering
permitiendo
(Voorhees, 2002) where systems may abstain
from answering. Although answer abstaining is
important in its own right, it also enables an
elegant ‘‘back-off’’ approach where we can defer
to a more accurate but expensive QA system when
the answer confidence is low. This allows us to
make use of the best of both speed and accuracy.
En resumen, we make the following con-
introduce PAQ, 65M QA-pairs

tributions:

automatically generated from Wikipedia, y
demonstrate the importance of global filtering
for high quality; ii) introduce RePAQ, a QA sys-
tem designed to utilize PAQ and demonstrate how
it can be optimised for memory, velocidad, or accu-
racy; ii) investigate the utility of PAQ for CBQA
modelos, improving by 5% but note significant
headroom to RePAQ iv) demonstrate RePAQ’s
strength on selective QA, enabling us to combine
RePAQ; and with a state-of-the-art QA model,
making it both more accurate and 2x faster.1

2 Open-Domain Question Answering

ODQA is the task of answering natural language
factoid questions from an open set of domains.
A typical question might be ‘‘when was the last
year astronauts landed on the moon?'', con un
target answer ‘‘1972’’. The goal of ODQA is to
develop an answer function m : q (cid:3)→ A, dónde
Q and A, respectivamente, are the sets of all possi-
ble questions and answers. We assume there is
a distribution P (q, a) of QA-pairs, defined over
Q × A. A good answer function will minimize
the expected error over P (q, a) con respecto a
some loss function, such as answer string match.
En la práctica, we do not have access to P (q, a), y
instead rely on an empirical sample of QA-pairs
K drawn from P , and measure the empirical loss
of answer functions on K. Our goal in this work
is to implicitly model P (q, a) in order to draw
a large sample of QA-pairs, PAQ, which we can
train on and/or retrieve from. A sufficiently-large
drawn sample will overlap with K, essentially
pre-empting and caching questions that humans
may ask at test-time. This allows us to shift com-
putation from test-time to train-time compared to
retrieve-and-read methods.

3 Generating Question-Answer Pairs

En esta sección, we describe the process for gen-
erating PAQ. Given a large background corpus
C, our QA-pair generation process consists of the
following components:

1. A passage selection model ps(C), to identify
passages which humans are likely to ask
questions about.

1Datos, modelos, and code are available at https://

github.com/facebookresearch/PAQ.

1099

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Top Left: Generation pipeline for QA-pairs in PAQ. Top Right: PAQ used as training data for CBQA
modelos. Bottom Left: RePAQ retrieves similar QA-pairs to input questions from PAQ. Bottom right: RePAQ’s
confidence is predictive of accuracy. If confidence is low, we can defer to slower, more accurate systems,
like FiD.

2. An answer extraction model pa(a|C), para
identifying spans in a passage that are more
likely to be answers to a question.

3. A question generator pq(q|a, C) eso, given a
passage and an answer, generates a question.

4. A filtering QA model pf (a|q, C) that gen-
erates an answer for a given question. If an
answer generated by pf does not match the
answer a question was generated from, el
question is discarded. This ensures generated
questions are consistent (Alberti et al., 2019).

As shown in Figure 1, these models are applied
sequentially to generate QA-pairs, similarly to
contextual QA generation (Alberti et al., 2019;
Lewis et al., 2019). First a passage c is selected
with a high probability under ps. Próximo, candidate
answers a are extracted from c using pa, y
questions q are generated for each answer using
pq. Por último, pf generates a new answer a(cid:5) para el
pregunta. If source answer a matches a(cid:5), entonces (q, a)
is deemed consistent and added to PAQ. El
pipeline is based on Alberti et al. (2019), updated
to take advantage of recent modeling advances.
Passage selection and our filtering approach are
novel contributions to the best of our knowledge,
specifically designed for ODQA QA-pair gen-
eration. Each component is described in detail
abajo.

3.1 Passage Selection, ps

The passage selection model ps is used to find
passages that are likely to contain information
that humans may ask about, and thus make good

candidates to generate questions from. We learn
ps using a similar method to Karpukhin et al.
(2020). Concretely, we assume access to a set of
positive passages C + ⊂ C, obtained from answer-
containing passages from ODQA train sets. Como
we do not have a set of labeled negatives,
we sample negatives either randomly or using
heuristics. We then maximize log-likelihood of
positive passages relative to negatives. We im-
plement ps with RoBERTa (Liu et al., 2019)
and obtain positive passages from Natural Ques-
ciones (NQ, Kwiatkowski et al., 2019). We sample
easy negatives at random from Wikipedia, y
hard negatives from the same Wikipedia article
as the positive passage. Easy negatives help the
model to learn topics of interest, and hard nega-
tives help to differentiate between interesting and
non-interesting passages from the same article.

3.2 Answer Extraction, pa

Given a passage, this component identifies spans
that are likely to be answers to questions. Nosotros
consider two alternatives: an off-the-shelf Named
Entity Recognizer (NER) or training a BERT
(Devlin et al., 2019) answer extraction model
on NQ.

The NER answer extractor simply extracts all
named entities from a passage.2 The majority of
questions in ODQA datasets consist of entity men-
ciones (Kwiatkowski et al., 2019; Joshi et al., 2017),
so this approach can achieve high answer cover-
edad. Sin embargo, as we extract all entity mentions
in a passage, we may extract unsuitable mentions,

2We use the spaCy (Honnibal et al., 2020) NER system,

trained on OntoNotes (Hovy et al., 2006).

1100

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

or miss answers that do not conform to the NER
system’s annotation schema. The trained answer
span extractor aims to address these issues.

BERT answer span extraction is typically
performed by modelling answer start and end inde-
pendently (Devlin et al., 2019). We instead follow
the approach of Alberti et al. (2019), which breaks
the conditional independence of answer spans
by directly predicting pa(a|C) = p([astart, aend]|C).
Our implementation first feeds a passage through
BERT, before concatenating the start and end to-
ken representations of all possible spans of up to
length 30, before passing them through an MLP
to give pa(a|C). At generation time, we extract the
top-K most probable spans from each passage.

3.3 Question Generation, pq

Given a passage and an answer, this model gener-
ates likely questions with that answer. To indicate
the answer and its occurrence in the passage,
we prepend the answer to the passage and label
the answer span with surrounding special tokens.
We train on a combination of NQ, CuriosidadesQA,
and SQuAD, and perform standard fine-tuning of
BART-base (Lewis et al., 2020a) to obtain pq.

3.4 Filtering, pf

The filtering model pf improves the quality of
generated questions, by ensuring that they are
consistent—that the answer they were generated is
likely to be a valid answer to the question. Anterior
trabajar (Alberti et al., 2019; Fang et al., 2020) tiene
used a machine reading comprehension (MRC)
QA model for this purpose, pf (a|q, C), which pro-
duces an answer when supplied with a question
and the passage it was generated from. We refer
to this as local filtering. Sin embargo, local filter-
ing will not remove questions that are ambiguous
(Min et al., 2020b), and can only be answered
correctly with access to the source passage. De este modo,
we use an ODQA model for filtering, pf (a|q, C),
supplied with only the generated question, y
not the source passage. We refer to this as global
filtering, and later show it is vital for strong down-
stream results. We use FiD-base with 50 passages,
trained on NQ (Izacard and Grave, 2021).

4 Question Answering Using PAQ

We consider two uses of PAQ for building QA
modelos. The first is to use PAQ as a source of

training QA-pairs for CBQA models. El segundo
treats PAQ as a KB, which models learn to directly
retrieve from. These are related, as CBQA models
have been shown to memorize the train data in
their parameters, latently retrieving from them at
test time (Lewis et al., 2021; Domingos, 2020).

4.1 PAQ for Closed-Book QA

We fine-tune BART-large (Lewis et al., 2020a)
with QA-pairs from the concatenation of the
training data and PAQ, using a similar training
procedure to Roberts et al. (2020). We use a batch
tamaño de 512, and use validation Exact Match score
for early stopping (Rajpurkar et al., 2018). Fol-
lowing recent best practices (Alberti et al., 2019;
Yang et al., 2019), we then fine-tune on training
QA-pairs only. We note that effective CBQA
models must be able to understand the semantics
of questions and how to generate answers, en
addition to being able to store large numbers
of facts in their parameters. This model
de este modo
represents a kind of combined parametric knowl-
edgebase and retrieval system (Petroni et al.,
2020). The model proposed in the next section,
RePAQ, represents an explicit non-parametric
instantiation of this idea.

4.2 RePAQ

RePAQ is a retrieval model
that operates on
KBs of QA-pairs, such as PAQ. RePAQ extends
recently proposed nearest neighbor QA-pair
retriever models (Lewis et al., 2021; Xiao et al.,
2021). These models assume access to a KB of
N QA-pairs K = {(q1, a1) . . . (qN , aN )}. Estos
models provide an answer to a test question q
by finding the most relevant QA-pair (q(cid:5), a(cid:5))
in K, using a scalable relevance function, entonces
returning a(cid:5) as the answer to q. This function
could be implemented using standard information
retrieval techniques, (p.ej., TF-IDF) or learned
from training data. RePAQ is learned from
ODQA data and consists of a neural retriever,
optionally followed by a reranker.

4.2.1 RePAQ Retriever

Our retriever adopts the dense Maximum In-
ner Product Search (MIPS) paradigma, which has
recently been shown to obtain state-of-the-art re-
sults in a number of settings (Karpukhin et al.,
2020; Lee et al., 2021, inter alia). Our goal is to

1101

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

embed queries q and indexed items d into a repre-
sentation space via embedding functions gq and gd,
so that the inner product gq(q)(cid:7)gd(d) is maximized
for items relevant to q. In our case, queries are
questions and indexed items are QA-pairs (q(cid:5), a(cid:5)).
We make our retriever symmetric by embedding q(cid:5)
en vez de (q(cid:5), a(cid:5)). Tal como, only one embedding
function gq is required, which maps questions to
embeddings. This applies a useful inductive bias,
and we find that it aids stability during training.

Learning the embedding function gq is compli-
cated by the lack of labeled question pair para-
phrases in ODQA datasets. We propose a latent
variable approach similar to retrieval-augmented
generación (RAG, Lewis et al., 2021),3 where we
index training QA-pairs rather than documents.
For an input question q, the top K QA-pairs
(q(cid:5), a(cid:5)) are retrieved by a retriever pret where
pret(q|q(cid:5)) ∝ exp(gq(q)(cid:7)gq(q(cid:5))). These are then
fed into a seq2seq model pgen which generates an
answer for each retrieved QA-pair, before a final
answer is produced by marginalising,

(cid:2)

pag(a|q) =

pgen(a|q, q(cid:5), a(cid:5))pret(q(cid:5)|q),

(a(cid:5), q(cid:5)) ∈ top-k pret(·|q)

As pgen generates answers token-by-token, credit
can be given for retrieving helpful QA-pairs that
do not exactly match the target answer. por ejemplo-
amplio, for the question ‘‘when was the last time
anyone was on the moon’’ and target answer
‘‘December 1972’’, retrieving ‘‘when was the last
year astronauts landed on the moon’’ with answer
‘‘1972’’ will help to generate the target answer,
despite the answers having different granularity.
After training, we discard pret,4 retaining only
the question embedder g. We implement pret
with ALBERT (Lan et al., 2020) with an out-
put dimension of 768, and pgen with BART-large
(Lewis et al., 2020a). We train with 100 retrieved
QA-pairs, and refresh the index every 5 training
steps.

Once the embedder gq is trained, we build a
test-time QA system by embedding and indexing
a QA KB such as PAQ. Answering is achieved

3Other methods, such as heuristically constructing para-
phrase pairs assuming that questions with the same answer
are paraphrases, and training with sampled negatives would
also be valid, but were not competitive in early experiments
4We could use pgen as a reranker/aggregator for QA, pero
in practice find it both slower and less accurate than the
reranker described in Section 4.2.2

by retrieving the most similar stored question, y
returning its answer. The matched QA-pair can
be displayed to the user, providing a mechanism
for more interpretable answers than CBQA mod-
els and many retrieve-and-read generators which
consume thousands of tokens to generate an an-
responder. Efficient MIPS libraries such as FAISS
(Johnson et al., 2019) enable RePAQ’s retriever
to answer 100s to 1,000s of questions per second
(mira la sección 5.2.3). We use a KB for RePAQ con-
sisting of train set QA-pairs and QA-pairs from
PAQ.

4.2.2 RePAQ Reranker

Accuracy can be improved using a reranker on the
top-K QA-pairs from the retriever. The reranker
uses cross-encoding, and includes the retrieved
answer in the scoring function for richer fea-
turisation. The model is trained as a multi-class
classifier, attempting to classify a QA-pair which
answers a question correctly against K-1 retrieved
QA-pairs which do not. For each QA-pair candi-
fecha, we concatenate the input question q with the
QA-pair (q(cid:5), a(cid:5)), and feed it through ALBERT, y
project the CLS representation to a logit score.
The model produces a distribution over the K
QA-pairs via softmax, and is trained to minimize
negative log-likelihood of the correct QA-pair.

We obtain training data in the following man-
ner: For a training QA-pair, we retrieve the top
2K QA-pairs from PAQ using RePAQ’s retriever.
If one of the retrieved QA-pairs has the correct an-
responder, we treat it as a positive, and randomly sample
K-1 of the incorrect retrieved questions as nega-
tives. We train with K=10, and rerank 50 QA-pairs
en el momento de la prueba. The reranker improves accuracy at the
expense of speed. Sin embargo, as QA-pairs consist
of fewer tokens than passages, the reranker is
still faster than retrieve-and-read models, incluso
for architectures such as ALBERT-xxlarge.

5 Resultados

We first examine the PAQ resource in general,
before exploring how both CBQA models and
RePAQ perform using PAQ, comparing to recently
published systems. We measure performance us-
ing Natural Questions (NQ, Kwiatkowski et al.,
2019) and TriviaQA (Joshi et al., 2017), evaluat-
ing using standard Exact Match (EM) puntaje.

1102

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

5.1 Examining PAQ

We generate PAQ by applying the pipeline de-
scribed in Section 3 to the Wikipedia dump from
Karpukhin et al. (2020), which splits Wikipedia
into 100-word passages. We use the passage selec-
tion model ps to rank all passages, generate from
the top 10M, before applying global filtering.5

We are interested in understanding the effect-
iveness of different answer extractors, and whether
generating more questions per answer span leads
to better results. To address these questions, nosotros
create three versions of PAQ, described below.
PAQL uses a learned answer extractor, y un
question generator trained on NQ and TriviaQA.
We extract 8 answers per passage and use beam
tamaño 4 for question generation. In PAQL,1 nosotros
only use the top scoring question in the beam,
whereas in PAQL,4 we use all four questions
from the beam, allowing several questions to be
generated from each answer span. PAQN E,1 usos
the NER answer extractor, and a generator trained
on NQ. PAQN E,1 allow us to assess whether
diversity in the form of answer extractors and
question generators gives better results. The final
KB, referred to as just ‘‘PAQ’’, is the union of
PAQL and PAQN E.

As shown in Table 1, PAQ consists of 65M
filtered QA pairs.6 This was obtained by extracting
165M answer spans and generating 279M unique
questions before applying global filtering. Mesa 1
shows that the PAQL pipeline is more efficient
than PAQN E, con 24.4% of QA-pairs surviving
filtering, compared to 18.6%.

PAQ Answer Coverage To evaluate answer ex-
tractors, we calculate how many answers in the
validation sets of TriviaQA and NQ also occur
in PAQ’s filtered QA-pairs. Mesa 1 shows that
the answer coverage of PAQ is very high—over
90% for both TriviaQA and NQ. Comparing
PAQL with PAQN E shows that the learnt ex-
tractor achieves higher coverage, but the union
of the two leads to the highest coverage overall.
Comparing PAQL,1 and PAQL,4 indicates that us-
ing more questions from the beam also results in
higher coverage.

5Generation was stopped when downstream performance

with RePAQ did not significantly improve.

Dataset

Extracted
Answers

Unique
Qs

Filtered
QAs

Ratio

Coverage
NQ TQA

76.4M 58.0M 14.1M 24.4% 88.3 90.2
PAQL,1
76.4M 225.2M 53.8M 23.9% 89.8 90.9
PAQL,4
PAQN E,1 122.2M 65.4M 12.0M 18.6% 83.5 88.3
165.7M 279.2M 64.9M 23% 90.2 91.1
PAQ

Mesa 1: PAQ dataset statistics and ODQA dataset
answer coverage. ‘‘Ratio’’ refers to the number
of generated questions which pass the global
consistency filter.

PAQ Question Generation Quality
Illustrative
examples from PAQ can be seen in Table 2. Man-
ual inspection of 50 questions from PAQ reveals
eso 82% of questions accurately capture infor-
mation from the passage and contain sufficient
details to locate the answer. Sixteen percent of
questions confuse the semantics of certain answer
types, either by conflating similar entities in the
passage or by misinterpreting rare phrases (ver
examples 7 y 8 en mesa 2). Finalmente, we find
small numbers of grammar errors (such as exam-
por ejemplo 5) and mismatched wh-words (5% y 2%,
respectivamente).7

Other Observations PAQ often contains sev-
the same QA-pair. Este
eral paraphrases of
redundancy reflects how information is distributed
in Wikipedia, with facts often mentioned on sev-
eral different pages. Generating several questions
per answer span also increases redundancy. Alabama-
though this means that PAQ could be more
information-dense if a de-duplication step was
aplicado, we later show that RePAQ always im-
proves with more questions (Sección 5.2.1). Este
suggests that it is worth increasing redundancy for
greater coverage.

5.2 Question Answering Results
En esta sección, we shall compare how the PAQ-
leveraging models proposed in Section 4 compare
to existing approaches. We primarily compare to
a state-of-the-art retrieve-and-read model, Fusion-
in-Decoder (FiD, Izacard and Grave, 2021). FiD
uses DPR (Karpukhin et al., 2020) to retrieve
passages from Wikipedia, and feeds them into T5
(Rafael y col., 2020) to generate a final answer.

Mesa 3 shows the highest-accuracy con-
figurations of our models alongside recent
state-of-the-art models. We make the following

6Each question only has one answer due to global filtering.

7Further details in Appendix A.3.

1103

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Question

Answer

Comentario

Martin Toonder
1 who created the dutch comic strip panda
2 what was the jazz group formed by john hammond in 1935 Goodman Trio
3 astrakhan is russia’s main market for what commodity
4 what material were aramaic documents rendered on
5 when did the giant panda chi chi died
6 pinewood is a village in which country
7 who was the mughal emperor at the battle of lahore
8 how many jersey does mitch richmond have in the nba

(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2), Grammar error
∼, Also a Pinewood village in USA

fish
leather
22 Julio 1972
Inglaterra
Ahmad Shah Bahadur ✗ Confuses with Ahmad Shah Abdali
2

✗ His Jersey No. era 2

Mesa 2: Representative examples from PAQ. (cid:2)indicates correct, ∼ ambiguous, and ✗
respectivamente.

incorrect facts,

Model Type

Modelo

NaturalQuestions TriviaQA

1 Closed-book
2 Closed-book
3 QA-pair retriever
4 Open-book, retrieve-and-read RAG-Sequence (Lewis et al., 2020b)
5 Open-book, retrieve-and-read FiD-large, 100 docs (Izacard and Grave, 2021)
6 Open-book, phrase index

T5-11B-SSM (Roberts et al., 2020)
BART-large (Lewis et al., 2021)
Dense retriever (Lewis et al., 2021)

DensePhrases (Lee et al., 2021)

7 Closed-book
8 QA-pair retriever
9 QA-pair retriever
10 QA-pair retriever
11 QA-pair retriever
12 QA-pair retriever

BART-large, pre-finetuned on PAQ
RePAQ (retriever only)
RePAQ (with reranker)
RePAQ-multitask (retriever only)
RePAQ-multitask (with reranker)
RePAQ-multitask w/ FiD-Large Backoff

35.2
26.5
26.7
44.5
51.4
40.9

32.7
41.2
47.7
41.7
47.6
52.3

51.8
26.7
28.9
56.8
67.6
50.7

33.2
38.8
50.7
41.3
52.1
67.3

Mesa 3: Exact Match score for highest accuracy RePAQ configurations in comparison to re-
cent state-of-the-art systems. Highest score indicated in bold, highest non-retrieve-and-read model
underlined.

observaciones: Comparing rows 2 y 7 shows that
a CBQA BART model trained with PAQ out-
performs a comparable NQ-only model by 5%,
and is only 3% behind T5-11B (row 1) cual
has 27x more parameters. Segundo, we note strong
results for RePAQ on NQ (row 9), superando
retrieve-and-read systems such as RAG by 3%
(row 4).

Multitask training RePAQ on NQ and TriviaQA
improves TriviaQA results by 1%-2% (comparing
filas 8-9 con 10-11). RePAQ does not perform
as strongly on TriviaQA (mira la sección 5.2.6), pero
is within 5% of RAG, and outperforms concur-
rent work on real-time QA, DensePhrases (row 6,
Lee et al., 2021). Por último, row 12 shows that com-
bining RePAQ and FiD-large into a combined
system is 0.9% more accurate than FiD-large (ver
Sección 5.2.4 for more details).

5.2.1 Ablating PAQ Using RePAQ

Mesa 4 shows RePAQ’s accuracy using different
PAQ variants. To establish the effect of filtering,

# KB

Filtering Size

Coincidencia exacta
Retrieve Rerank

1 NQ-Train
–
2 PAQL,1 None
Local
3 PAQL,1
4 PAQL,1 Global
5 PAQL,4 Global
6 PAQN E,1 Global
Global
7 PAQ

87.9k 27.9

58.0METRO 21.6
31.7METRO 28.3
14.1METRO 38.6
53.8METRO 40.3
12.0METRO 37.3
64.9METRO 41.6

31.8

30.6
34.9
44.3
45.2
42.6
46.4

Mesa 4: The effect of different PAQ subsets on
the NQ validation accuracy of RePAQ.

we evaluate RePAQ with unfiltered, locally fil-
tered and globally filtered QA-pairs on PAQL,1.
Rows 2-4 show that global filtering is crucial,
leading to a 9% y 14% increase over locally
filtered and unfiltered QA-pairs, respectivamente.

We also note a general

trend in Table 4
that adding more globally filtered questions im-
proves accuracy. Rows 4-5 show that using four

1104

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Retriever Reranker Exact Match Q/sec

FiD-large
FiD-base

–
–

RePAQ
RePAQ

base
xlarge

–
–

base
xxlarge

51.4
48.2

40.9
41.5

45.7
47.6

0.5
2

1400
800

55
6

Mesa 5: Inference speeds of various configura-
tions of RePAQ compared to FiD models on NQ.

curacy and inference speed. We use a fast
Hierarchical Navigable Small World (HNSW)
index in FAISS (Malkov and Yashunin, 2020;
Johnson et al., 2019)9 and measure the time re-
quired to evaluate the NQ test set on a system with
access to one GPU.10 Table 5 shows these results.
Some retriever-only RePAQ models can answer
encima 1,000 questions per second, and are relatively
insensitive to model size, with ALBERT-base
only scoring 0.5% lower than ALBERT-xlarge.
They also outperform retrieve-and-read models
like REALM (40.4%, Guu et al., 2020) and recent
real-time QA models like DensePhrases (40.9%,
Lee et al., 2021). We find that larger, slower
RePAQ rerankers achieve higher accuracy. Cómo-
alguna vez, even the slowest RePAQ is 3x faster than
FiD-base, while only being 0.8% less accurate,
and 12x faster than FiD-large.

5.2.4 Selective Question Answering

Models should not just be able to answer accu-
rately, but also ‘‘know when they don’t know’’,
and abstain when they are unlikely to produce
good answers (Voorhees, 2002). This task is chal-
lenging for current systems (Asai and Choi, 2020;
Jiang et al., 2020b), and has been approached
in MRC by training on unanswerable questions
(Rajpurkar et al., 2018) and for trivia systems by
using incremental QA (Rodriguez et al., 2019).

We find that RePAQ’s retrieval and reranking
scores are well-correlated with answering cor-
rectly. RePAQ can thus be used for selective QA
by abstaining when the score is below a certain
límite. Cifra 3 shows a risk-coverage plot
(Wang y cols., 2018) for RePAQ and FiD, dónde

Cifra 2: Size vs. accuracy for RePAQ and FiD-large
as a function of the number of items in the index.

questions per answer span is better than generating
uno (+0.9%), and rows 5-7 show that combin-
ing PAQN E and PAQL also improves accuracy
(+1.2%). Empirically we did not observe any cases
where increasing the number of globally filtered
QA-pairs reduced accuracy, even when there were
millions of QA-pairs already.

5.2.2 System Size vs. Accuracy

PAQ’s QA-pairs are accompanied by scores of
how likely they are to be asked. These scores can
be used to filter the KB and reduce the RePAQ
system size. A similar procedure can be used to fil-
ter the background corpus for a retrieve-and-read
modelo (Izacard et al., 2020). We compare the sys-
tem size of a FiD-large system and RePAQ as
the number of items (passages and QA-pairs, re-
spectively) in their indexes are reduced. We select
which passages and QA-pairs are included using
the passage selection model ps.8 Further exper-
imental details can be found in Appendix A.4.
Cifra 2 shows that both system sizes can be
reduced several-fold with only a small drop in
exactitud, demonstrating the effectiveness of ps.
FiD can achieve a higher accuracy, but requires
larger system sizes. RePAQ can be reduced to
a smaller size before a significant accuracy drop,
driven primarily by the higher information density
of QA-pairs relative to passages, and fewer model
parameters used by RePAQ compared to FiD.

5.2.3 Inference Speed vs. Accuracy

We train a variety of differently sized RePAQ
models to explore the relationship between ac-

8Aquí, we use PAQL1, which is 5x smaller than the full

PAQ, but retains most of the accuracy (ver tabla 4).

retriever accuracy compared to a flat index.

10System details can be found in Appendix A.5.

9The HNSW index has negligible (∼0.1%) drop in

1105

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Input: who was the film chariots of fire about

A: Eric Liddell

who was the main character in chariots of fire
who starred in the movie chariots of fire
which part did straan rodger play in chariots of fire
who played harold in the 1981 film chariots of fire
who is the main character in chariots of fire

A: Eric Liddell
A: Ian Charleson
A: Sandy McGrath
A: Ben Cross
A: Eric Liddell

Input: what is the meaning of the name didymus

A: twin

what language does the name didymus come from
where does the name didymus come from in english
what does the word domus mean in english
how long has the term domus been used
what does the greek word didyma mean

A: Griego
A: Griego
A: home
A: 1000s of years
A: twin

Input: what is the name of a group of llamas

A: herd

what are llamas and alpacas considered to be
what are the figures of llamas in azapa valley
what are the names of the llamas in azapa valley
what is the scientific name for camels and llamas
are llamas bigger or smaller than current forms

A: domesticated
A: Atoca
A: Atoca
A: Camelidae
A: más grande

(cid:2)

Mesa 6: Examples of top 5 retrieved QA-pairs for
NQ. Italics indicate QA-pairs chosen by reranker.

there is high (80.8 ROUGE-L) similarity between
correctly answered test questions and the top re-
trieved questions. Nine percent of test questions
even exist verbatim in PAQ, and are thus trivial to
respuesta. The reranker primarily improves over the
retriever for ambiguous cases, and cases where
the top retrieved answer does not have the right
granularity. En 32% of cases, RePAQ does not
retrieve the correct answer in the top 50 QA-pairs,
suggesting a lack of coverage may be a significant
source of error. In these cases, retrieved ques-
tions are much less similar to the test question
than for correctly answered questions, dropping
por 20 ROUGE-L. We also observe cases where
retrieved questions match the test question, pero el
answer does not match the desired answer. Este
is usually due to different answer granularity, pero
in a small number of cases was due to factually
incorrect answers.

5.2.6 Does the Filtering Model Limit

RePAQ’s Accuracy?

As RePAQ relies on retrieving paraphrases of
test questions, we may expect that the ODQA
filtering model places an upper bound on its per-
rendimiento. Por ejemplo, if a QA-pair is generated
that overlaps with a test QA-pair, but the filter
cannot answer it correctly, that QA-pair will not
be added to PAQ, and RePAQ cannot use it to an-
swer the test question. The NQ FiD-base-50-doc
model used for filtering scores 46.1% y 53.1%
for NQ and TriviaQA, respectivamente. RePAQ ac-
tually outperforms the filter model on NQ by
1.6%. This is possible because generated questions
can be phrased in such a way that they are eas-
ier to answer, Por ejemplo, being less ambiguous

Cifra 3: Risk-coverage plot for FiD and RePAQ.

we use FiD’s answer log probability for its an-
swer confidence.11 The plot shows the accuracy
on the top N% highest confidence answers for
NQ. If we require models to answer 75% of user
preguntas, RePAQ’s accuracy on the questions
it does answer is 59%, whereas FiD, which has
poorer calibration, puntuaciones 55%. This difference is
even more pronounced with stricter thresholds—
with coverage of 50%, RePAQ outperforms FiD
by over 10%. FiD only outperforms RePAQ
when we require systems to answer over 85%
of questions.

Although RePAQ’s selective QA is useful in
its own right, it also allows us to combine the
slow but accurate FiD with the fast and precise
RePAQ, which we refer to as backoff. We first
try to answer with RePAQ, and if the confidence
is below a threshold determined on validation
datos, we pass the question onto FiD. For NQ, el
combined system is 2.1x faster than FiD-large,
with RePAQ answering 57% of the questions, y
the overall accuracy is 1% higher than FiD-large
(ver tabla 3).

If inference speed is a priority, the threshold can
be decreased so that RePAQ answers 80% del
preguntas, which retains the same overall accu-
racy as FiD, with a 4.6x speedup. For TriviaQA,
the combined system backs off to FiD earlier,
due to the stronger relative performance of FiD.
Additional details can be found in Appendix A.6.

5.2.5 Analyzing RePAQ’s Predictions

Some examples of top retrieved questions are
mostrado en la tabla 6. When RePAQ answers cor-
rectly, the retrieved question is a paraphrase of the
test question from PAQ in 89% of cases. Tal como,

11We also investigate improving FiD’s calibration using
an auxiliary model, see Appendix A.6. We find that the most
effective way to calibrate FiD is to use RePAQ’s confidences.

1106

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗
✗

Modelo

Total

A-only
Overlap Overlap Overlap

CBQA BART w/ NQ
26.5
CBQA BART w/ NQ+PAQ 28.2
32.7
41.7
47.3

+ final NQ finetune
RePAQ (retriever only)
RePAQ (with reranker)

67.6
52.8
69.8
65.4
73.5

10.2
24.4
22.2
31.7
39.7

0.8
9.4
7.51

21.4
26.0

Mesa 7: NQ Behavioural splits (Lewis et al.,
2021). ‘‘Q overlap’’ are test questions with para-
phrases in training data. ‘‘A-only’’ are test ques-
tions where answers appear in training data, pero
questions do not. ‘‘No overlap’’ where neither
question or answer overlap.

(Min et al., 2020b). RePAQ can then retrieve
the paraphrased QA-pair and answer correctly,
even if the filter could not answer the test ques-
tion directly. The filtering model’s weaker scores
on TriviaQA helps explain why RePAQ is not
as strong on this dataset. We speculate that a
stronger filtering model for TriviaQA would in
turn improve RePAQ’s results.

5.3 Closed-book QA vs. RePAQ

Mesa 7 shows results on test set splits that measure
how effectively models memorize QA-pairs from
the NQ train set (‘‘Q overlap’’), and generalize
to novel questions (‘‘A overlap only’’ and ‘‘No
overlap’’).12 Comparing CBQA models trained
on NQ vs. those trained on NQ and PAQ show
that models trained with PAQ answer more ques-
tions correctly from the ‘‘A-only overlap’’ and
‘‘No overlap’’ categories, indicating they learned
facts not present in the NQ train set. Apply-
ing further NQ finetuning on the PAQ CBQA
model improves scores on ‘‘Q overlap’’ (indi-
cating greater memorisation of NQ), but scores
on the other categories drop (indicating reduced
memorization of PAQ). Sin embargo, RePAQ, cual
explicitly retrieves from PAQ rather than memo-
rizing it in parameters, strongly outperforms the
CBQA model in all categories, demostrando que
the CBQA model struggles to memorize enough
facts from PAQ. Larger CBQA models should be
better able to memorise PAQ, but have downsides
in terms of system resources. Future work should
address how to better store PAQ in CBQA model
parámetros.

12See Lewis et al. (2021) para más detalles.

6 Trabajo relacionado

ODQA has a been a topic of interest for atleast
five decades (Simmons, 1965), with its modern
formulation established by TREC in the early
2000s (Voorhees, 1999). For a detailed history, el
reader is referred to Chen and Yih (2020). Interest
in ODQA has recently intensified for its practical
applications and for measuring how well models
store and access knowledge (Petroni et al., 2021).

KBQA A number of early approaches in ODQA
focused on using structured KBs (Berant et al.,
2013) such as Freebase (Bollacker et al., 2008),
with recent examples from F´evry et al. (2020)
and Verga et al. (2020). This approach often has
high precision but suffers when the KB does
not match user requirements, or where the schema
limits what knowledge can be stored. We populate
our KB with semi-structured QA-pairs that are
specifically designed to be relevant at test time,
mitigating these drawbacks, while sharing benefits
such as precision and extensibility.

OpenIE Our work touches on KB construction
and open information extraction (OpenIE) (Angeli
et al., 2015). Aquí, the goal is to extract structured
or semi-structured facts from text, típicamente (sub-
ject, relation, object) triples for use in tasks such as
slot-filling (Surdeanu, 2013). We generate natu-
ral language QA-pairs rather than OpenIE triples,
and do not attempt to extract all possible facts in a
cuerpo, focusing only on those likely to be asked.
QA-pairs have also been used in semantic role
labeling, Por ejemplo, QA-SRL (FitzGerald et al.,
2018).

Real-time ODQA Systems prioritizing fast run-
time over accuracy are sometimes referred to as
real-time QA systems (Seo et al., 2018). DenSPI
(Seo et al., 2019) and a contemporary work,
DensePhrases (Lee et al., 2021), index all possi-
ble phrases in a corpus, and learn mappings from
questions to passage-phrase pairs. We also build
an index for fast answering, but generate and index
globally answerable questions. Indexing QA-pairs
can be considered as indexing summaries of im-
portant facts from the corpus, rather than indexing
the corpus itself. We also generate and store
multiple questions per passage-answer pair, re-
lieving information bottlenecks from encoding a
passage-answer pair into a single vector.

1107

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Question Generation for QA Question gener-
ation has been used for various purposes, semejante
as data augmentation (Alberti et al., 2019; Luis
et al., 2019; Lee et al., 2021), improved retrieval
(Nogueira et al., 2019), generative modelling for
contextual QA (Lewis and Fan, 2018), así como
being studied in its own right (Du et al., 2017;
Hosking and Riedel, 2019). Serban et al. (2016)
generate large numbers of questions from Free-
base, but do not address how to use them for
control de calidad. Closest to our work is the recently proposed
OceanQA (Fang et al., 2020). OceanQA first gen-
erates contextual QA-pairs from Wikipedia. En
test-time, a document retrieval system is used to
retrieve the most relevant passage for a question
and the closest pre-generated QA-pair from that
passage is selected. A diferencia de, we focus on gen-
erating a large KB of non-contextual, globally
consistent ODQA questions and explore what QA
systems are facilitated by such a resource.

7 Discusión y conclusión

We have introduced a dataset of 65M QA-pairs,
and explored its uses for improving ODQA mod-
los. We demonstrated the effectiveness of RePAQ,
a system that retrieves from PAQ, in terms of ac-
curacy, velocidad, space efficiency and selective QA.
We found that RePAQ’s errors are driven by a
lack of coverage, thus generating more QA-pairs
should improve accuracy further. Sin embargo, phe-
nomena such as compositionality may impose
practical
limits on this approach. Multi-hop
RePAQ extensions suggest themselves as ways
forward here, as well as back-off systems (ver
Sección 5.2.4). Generating PAQ is also computa-
tionally intensive due to its large scale and global
filtering, but it should be a useful, re-usable re-
source for researchers. Sin embargo, future work
should be carried out to improve the efficiency of
generation to expand PAQ’s coverage.

We also demonstrated PAQ’s utility for im-
proved CBQA, but note a large accuracy gap
between our CBQA models and RePAQ. Explor-
ing the trade-offs between storing and retrieving
knowledge parametrically or non-parametrically
is of great current interest (Lewis et al., 2020b;
Cao et al., 2021), and PAQ should be a useful
testbed for probing this relationship further. Nosotros
also note that PAQ could be used as general data-
augmentation when training any open-domain QA

model or retriever. We consider such work out-of-
scope here, but leveraging PAQ to improve other
models should be explored in future work.

Expresiones de gratitud

The authors would like to extend their gratitude
to the anonymous reviewers and Action Editor for
their highly detailed and insightful comments and
comentario. The authors would also like to thank
Gautier Izacard, Ethan Pérez, Max Bartolo, Tom
Kwiatkowski, and Jimmy Lin for helpful discus-
sions and feedback on the project. PM and PS are
supported by the European Union’s Horizon 2020
research and innovation programme under grant
agreement no. 875160.

Referencias

Chris Alberti, Daniel Andor, Emily Pitler, Jacob
Devlin, and Michael Collins. 2019. Synthetic
QA corpora generation with roundtrip con-
sistency. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Lingüística, pages 6168–6173, Florencia, Italia.
Asociación de Lingüística Computacional.

Gabor Angeli, Melvin Jose Johnson Premkumar,
and Christopher D. Manning. 2015. Leveraging
linguistic structure for open domain information
extraction. In Proceedings of the 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint
Conferencia sobre procesamiento del lenguaje natural
(Volumen 1: Artículos largos), pages 344–354,
Beijing, Porcelana. Asociación de Computación
Lingüística.

Akari Asai and Eunsol Choi. 2020. Challenges
in information seeking QA: Unanswerable
questions and paragraph retrieval. arXiv:2010
.11915 [cs]. ArXiv: 2010.11915.

Jonathan Berant, Andrés Chou, Roy Frosty,
y Percy Liang. 2013. análisis semántico
on freebase from question-answer pairs. En
Actas de la 2013 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
páginas 1533–1544, seattle, Washington, EE.UU.
Asociación de Lingüística Computacional.

1108

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kurt Bollacker, Colin Evans, Praveen Paritosh,
Tim Sturge, y jamie taylor. 2008. Free-
base: A collaboratively created graph database
for structuring human knowledge. En curso-
ings of the 2008 ACM SIGMOD international
conference on Management of data, SIGMOD
’08, páginas 1247–1250, vancouver, Canada.
Association for Computing Machinery.

Nicola De Cao, Gautier Izacard, Sebastián Riedel,
and Fabio Petroni. 2021. Autoregressive en-
tity retrieval. In International Conference on
Learning Representations.

Danqi Chen, Adam Fisch, Jason Weston, y
Antonio Bordes. 2017. Reading Wikipedia to
answer open-domain questions. En procedimientos
of the 55th Annual Meeting of the Association
para Lingüística Computacional (Volumen 1: Largo
Documentos), pages 1870–1879, vancouver, Canada.
Asociación de Lingüística Computacional.

Danqi Chen and Wen-tau Yih. 2020. Open-domain
question answering. En procedimientos de
el
58ª Reunión Anual de la Asociación de
Ligüística computacional: Tutorial Abstracts,
pages 34–37, En línea. Asociación de Computación-
lingüística nacional.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
el 2019
comprensión. En procedimientos de
Conferenceof the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional.

Pedro Domingos. 2020. Every Model Learned by
Gradient Descent Is Approximately a Kernel
Machine. arXiv:2012.00152 [cs, stat]. ArXiv:
2012.00152.

Xinya Du, Junru Shao, and Claire Cardie. 2017.
Learning to ask: Neural question generation
for reading comprehension. En procedimientos de
the 55th Annual Meeting of the Association
para Lingüística Computacional (Volumen 1: Largo
Documentos), pages 1342–1352, vancouver, Canada.
Asociación de Lingüística Computacional.

Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun,
and Jingjing Liu. 2020. Accelerating real-time

question answering via question generation.
arXiv:2009.05167 [cs]. ArXiv: 2009.05167.

Nicholas FitzGerald, Julian Michael, Luheng
Él, and Luke Zettlemoyer. 2018. Large-scale
QA-SRL parsing. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 2051–2060, Melbourne, Australia. también-
ciation for Computational Linguistics.

Jerome H. Friedman. 2001. Greedy function ap-
proximation: A gradient boosting machine.
Annals of Statistics, 29(5):1189–1232.

Thibault

F´evry,

Baldini

Soares,
Livio
Nicholas FitzGerald, Eunsol Choi, and Tom
Kwiatkowski. 2020. Entities as experts: Sparse
memory access with entity supervision. En
Actas de
el 2020 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando (EMNLP), pages 4937–4951, En línea.
Asociación de Lingüística Computacional.

Kelvin Gu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Mingwei Chang. 2020. Retrieval
augmented language model pre-training. En
Proceedings of the 37th International Confer-
ence on Machine Learning, volumen 119 de
Actas de investigación sobre aprendizaje automático,
pages 3929–3938, PMLR.

Matthew Honnibal, Ines Montani, Sofie Van
Landeghem, and Adriane Boyd. 2020. spaCy:
Industrial-strength natural language processing
in python.

Tom Hosking and Sebastian Riedel. 2019. Eval-
uating rewards for question generation models.
En procedimientos de
el 2019 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: Humano
Language Technologies, Volumen 1 (Long and
Artículos breves), pages 2278–2283, Minnea-
polis, Minnesota. Asociación de Computación-
lingüística nacional.

Eduard Hovy, Mitchell Marcus, Martha Palmer,
Lance Ramshaw, and Ralph Weischedel. 2006.
OntoNotes: El 90% solución. En procedimientos
of the Human Language Technology Confer-
ence of the NAACL, Companion Volume: Short
Documentos, NAACL-Short ’06, pages 57–60, EE.UU.
Asociación de Lingüística Computacional.

1109

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Gautier

Izacard and Edouard Grave. 2021.
Leveraging passage retrieval with generative
models for open domain question answer-
En g. In Proceedings of the 16th Conference
of the European Chapter of the Association
para Lingüística Computacional: Main Vol-
ume, pages 874–880, En línea. Asociación para
Ligüística computacional.

Gautier Izacard, Fabio Petroni, Lucas Hosseini,
Nicola De Cao, Sebastián Riedel, and Edouard
Grave. 2020. A memory efficient baseline
for open domain question answering. arXiv:
2012.15156 [cs]. ArXiv: 2012.15156.

Zhengbao Jiang, Jun Araki, Haibo Ding, y
Graham Neubig. 2020a. How can we know
when language models know? arXiv:2012
.00955 [cs]. ArXiv: 2012.00955.

Zhengbao Jiang, Wei Xu, Jun Araki, y graham
Neubig. 2020b. Generalizing natural language
analysis through span-relation representations.
In Proceedings of the 58th Annual Meeting
de la Asociación de Linguis Computacional-
tics, pages 2120–2133, En línea. Asociación para
Ligüística computacional.

Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2019. Billion-scale similarity search with
IEEE Transactions on Big Data,
GPUs.
pages 1–1.

Mandar Joshi, Eunsol Choi, Daniel Soldadura, y
Lucas Zettlemoyer. 2017. CuriosidadesQA: Un gran
escalar el conjunto de datos de desafío supervisado a distancia para
comprensión lectora. En Actas de la
55ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo
Documentos), páginas 1601–1611, vancouver, Canada.
Asociación de Lingüística Computacional.

h. J´egou, METRO. Douze, and C. Schmid. 2011. Product
quantization for nearest neighbor search. IEEE
Transactions on Pattern Analysis and Machine
Inteligencia, 33(1):117–128. https://doi
.org/10.1109/TPAMI.2010.57

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
Passage retrieval for open-domain question
answering. En Actas de la 2020 Conferir-
encia sobre métodos empíricos en lan natural-

Procesamiento de calibre (EMNLP), pages 6769–6781,
En línea. Asociación de Lingüística Computacional.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, michael collins, Ankur Parikh, cris
Illia Polosukhin,
Alberti, Danielle Epstein,
Matthew Kelcey, Jacob Devlin, Kenton Lee,
Kristina N. Toutanova, Leon Jones, Ming-Wei
Chang, Andrew Dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural questions:
A benchmark for question answering research.
Transactions of the Association of Computa-
lingüística nacional, 7:452–466. https://doi
.org/10.1162/tacl_a_00276

Zhenzhong Lan, Mingda Chen, Sebastian
Buen hombre, Kevin Gimpel, Piyush Sharma, y
Radu Soricut. 2020. Alberto: A lite bert for
self-supervised learning of language representa-
ciones. En Conferencia Internacional sobre Aprendizaje
Representaciones.

Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, y
Danqi Chen. 2021. Learning dense represen-
tations of phrases at scale. arXiv:2012.12624
[cs]. ArXiv: 2012.12624.

Kenton Lee, Ming-Wei Chang, and Kristina
Toutanova. 2019. Latent retrieval for weakly
supervised open domain question answering.
In Proceedings of the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 6086–6096, Florencia, Italia. Asociación
para Lingüística Computacional.

Mike Lewis and Angela Fan. 2018. Generative
question answering: Learning to answer the
whole question. In International Conference on
Learning Representations.

mike lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Exacción, Veselin Stoyanov, and Luke Zettlemoyer.
2020a. BART: Denoising sequence-to-sequence
pre-training for natural language generation,
traducción, and comprehension. En curso-
cosas de
el
Asociación de Lingüística Computacional,
pages 7871–7880, En línea. Asociación para
Ligüística computacional.

the 58th Annual Meeting of

Patrick Lewis, Ludovic Denoyer, and Sebastian
Riedel. 2019. Unsupervised question answer-
ing by cloze translation. En Actas de la

1110

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

57ª Reunión Anual de la Asociación de
Ligüística computacional, pages 4896–4910,
Florencia, Italia. Asociación de Computación
Lingüística.

Patrick Lewis, Ethan Pérez, Aleksandra Piktus,
Fabio Petroni, Vladimir Karpukhin, Naman
goyal, Heinrich K¨uttler, mike lewis, Wen-tau
Yih, Tim Rockt¨aschel, Sebastián Riedel, y
Douwe Kiela. 2020b. Retrieval-augmented gen-
eration for knowledge-intensive NLP tasks. En
Avances en el procesamiento de información neuronal
Sistemas, volumen 33, pages 9459–9474. Curran
Associates, Cª.

Patrick Lewis, Pontus Stenetorp, and Sebastian
Riedel. 2021. Question and answer test-train
overlap in open-domain question answering
conjuntos de datos. In Proceedings of the 16th Conference
of the European Chapter of the Association
para Lingüística Computacional: Volumen principal,
pages 1000–1008, En línea. Asociación para
Ligüística computacional.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized BERT pretraining approach. arXiv:
1907.11692 [cs]. ArXiv: 1907.11692.

and robust

Yu A. Malkov and D. A. Yashunin. 2020.
approximate nearest
Efficient
neighbor search using hierarchical navigable
small world graphs. IEEE Transactions on
Pattern Analysis and Machine Intelligence,
https://doi.org/10
42(4):824–836.
.1109/TPAMI.2018.2889473

Sewon Min, Jordan Boyd-Graber, Chris Alberti,
Danqi Chen, Eunsol Choi, michael collins,
Kelvin Gu, Hannaneh Hajishirzi, Kenton
Sotavento, Jennimaria Palomaki, Colin Raffel, Adán
Roberts, Tom Kwiatkowski, Patrick Lewis,
Yuxiang Wu, Heinrich K¨uttler, Linqing Liu,
Pasquale Minervini,
Stenetorp,
Sebastián Riedel, Sohee Yang, Minjoon Seo,
Gautier Izacard, Fabio Petroni, Lucas Hosseini,
Nicola De Cao, Edouard Grave,
Ikuya
Yamada, Sonse Shimaoka, Masatoshi Suzuki,
Shumpei Miyawaki, Shun Sato, Ryo Takahashi,
Jun Suzuki, Martin Fajcik, Martin Docekal,
Karel Ondrej, Pavel Smrz, Hao Cheng, Yelong

Pontus

shen, Xiaodong Liu, Pengcheng He, Weizhu
Jianfeng Gao, Barlas Oguz, Xilun
Chen,
Chen, Vladimir Karpukhin, Stan Peshterliev,
Dmytro Okhonko, Michael Schlichtkrull, Sonal
Gupta, Yashar Mehdad, and Wen-tau Yih.
2020a. NeurIPS 2020 EfficientQA Competi-
ción: Sistemas, analyses and lessons learned.
arXiv:2101.00133 [cs]. ArXiv: 2101.00133.

Sewon Min, Julian Michael, Hannaneh Hajishirzi,
and Luke Zettlemoyer. 2020b. AmbigQA: Un-
swering ambiguous open-domain questions.
el 2020 Conferencia sobre
En procedimientos de
Métodos empíricos en Natural Language Pro-
cesando (EMNLP), pages 5783–5797, En línea.
Asociación de Lingüística Computacional.

Rodrigo Nogueira, Wei Yang, Jimmy Lin, y
Kyunghyun Cho. 2019. Document expansion
by query prediction. arXiv:1904.08375 [cs].
ArXiv: 1904.08375.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Admirador, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli. 2019. fairseq: A fast, ex-
tensible toolkit
En
Actas de la 2019 Conference of the
North American Chapter of the Association
para Lingüística Computacional (Demonstra-
ciones), pages 48–53, Mineápolis, Minnesota.
Asociación de Lingüística Computacional.

for sequence modeling.

Adam Paszke, Sam Gross, Francisco Massa,
Adam Lerer,
James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary DeVito,
Martin Raison, Alykhan Tejani, Sasank
Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. 2019. PyTorch: Un
imperative style, high-performance deep learn-
ing library. En H. Wallach, h. Larochelle, A.
Beygelzimer, F. d’ Alch´e-Buc, mi. Fox, y r.
Garnett, editores, Advances in Neural Informa-
tion Processing Systems 32, pages 8024–8035.
Asociados Curran, Cª.

Fabio Petroni, Patrick Lewis, Aleksandra Piktus,
Tim Rockt¨aschel, Yuxiang Wu, Alexander
h. Molinero, and Sebastian Riedel. 2020. Cómo
factual
contexto
predicciones. In Automated Knowledge Base
Construction.

language models’

affects

1111

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fabio Petroni, Aleksandra Piktus, Angela Fan,
Patrick Lewis, Majid Yazdani, Nicola De
Cao, James Thorn, Yacine Jernite, Vladimir
Karpukhin, Jean Maillard, Vassilis Plachouras,
Tim Rockt¨aschel, and Sebastian Riedel. 2021.
KILT: A benchmark for knowledge inten-
sive language tasks. En procedimientos de
el
2021 Conference of the North American Chap-
the Association for Computational
ter of
Lingüística: Tecnologías del lenguaje humano,
pages 2523–2544, En línea. Asociación para
Ligüística computacional.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Miguel
mañana, Yanqi Zhou, wei li, y Pedro J.. Liu.
2020. Explorando los límites del aprendizaje por transferencia
con un transformador unificado de texto a texto. Diario
de Investigación sobre Aprendizaje Automático, 21(140):1–67.

Pranav Rajpurkar, Robin Jia, y Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. En procedimientos de
the 56th Annual Meeting of the Association for
Ligüística computacional (Volumen 2: Short
Documentos), pages 784–789, Melbourne, Australia.
Asociación de Lingüística Computacional.

Adam Roberts, Colin Raffel, and Noam Shazeer.
2020. How much knowledge can you pack
into the parameters of a language model?
En procedimientos de
el 2020 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando (EMNLP), pages 5418–5426, En línea.
Asociación de Lingüística Computacional.

Pedro Rodriguez, Shi Feng, Mohit Iyyer, Él
Él, and Jordan Boyd-Graber. 2019. Quizbowl:
The case for incremental question answering.
arXiv:1904.04792 [cs]. ArXiv: 1904.04792.

Minjoon Seo, Tom Kwiatkowski, Ankur Parikh,
Ali Farhadi, and Hannaneh Hajishirzi. 2018.
Phrase-indexed question answering: A new
challenge for scalable document comprehen-
sión. En Actas de la 2018 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, pages 559–564, Bruselas, Bélgica.
Asociación de Lingüística Computacional.

Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski,
Ankur Parikh, Ali Farhadi, and Hannaneh
Hajishirzi. 2019. Real-time open-domain ques-
tion answering with dense-sparse phrase index.

In Proceedings of the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 4430–4441, Florencia, Italia. Asociación
para Lingüística Computacional.

Iulian Vlad Serban, Alberto Garc´ıa-Dur´an, Caglar
Gulcehre, Sungjin Ahn, Sarath Chandar, Aarón
Courville, and Yoshua Bengio. 2016. Gener-
ating factoid questions with recurrent neural
redes: The 30M factoid question-answer
cuerpo. En procedimientos de
the 54th Annual
reunión de
la Asociación de Computación-
lingüística nacional (Volumen 1: Artículos largos),
pages 588–598, Berlina, Alemania. Asociación
para Lingüística Computacional.

R. F. Simmons. 1965. Answering English ques-
tions by computer: A survey. Comunicaciones
of the Association for Computing Machinery,
8(1):53–70. https://doi.org/10.1145
/363707.363732

METRO. Surdeanu. 2013. Overview of the TAC2013
KnowledgeBasePopulationEvaluation:Inglés
Slot Filling and Temporal Slot Filling. TAC.

Pat Verga, Haitian Sun, Livio Baldini Soares, y
William W. cohen. 2020. Facts as experts:
Adaptable and interpretable neural memory
over symbolic knowledge. arXiv:2007.00849
[cs]. ArXiv: 2007.00849.

Ellen M. Voorhees. 1999. The TREC-8 Question
Answering Track Report. En procedimientos de
TREC-8, pages 77–82.

Ellen M. Voorhees. 2002. Overview of the TREC
2002 question answering track. En procedimientos
of The Eleventh Text REtrieval Conference,
TREC 2002, Gaithersburg, Maryland, EE.UU,
November 19–22, 2002, volume 500–251 of
NIST Special Publication. National Institute of
Standards and Technology (NIST).

Ellen M. Voorhees and Donna K. Harman,
editores. 1999. Proceedings of The Eighth Text
REtrieval Conference, TREC 1999, Gaithers-
burg, Maryland, EE.UU, November 17–19, 1999,
volume 500–246 of NIST Special Publication.
National Institute of Standards and Technology
(NIST).

1112

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

William Wang, Angelina Wang, Aviv Tamar, Xi
Chen, and Pieter Abbeel. 2018. Safer classifica-
tion by synthesis. arXiv:1711.08534 [cs, stat].
ArXiv: 1711.08534.

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
Gugger, Mariama Drama, Quentin Lhoest, y
Alexander Rush. 2020. transformadores: Estado-
of-the-art natural language processing. En profesional-
cesiones de la 2020 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural:
Demostraciones del sistema, páginas 38–45, En línea.
Asociación de Lingüística Computacional.

Jinfeng Xiao, Lidan Wang, Franck Dernoncourt,
Trung Bui, Tong Sun, and Jiawei Han. 2021.
Open-domain question answering with pre-
constructed question spaces. En procedimientos
del 2021 Conference of the North American
la Asociación de Computación-
Chapter of
lingüística nacional: Student Research Workshop,
pages 61–67, En línea. Asociación de Computación-
lingüística nacional.

Wei Yang, Yuqing Xie, Luchen Tan, Kun Xiong,
Ming Li, and Jimmy Lin. 2019. Data augmen-
tation for BERT fine-tuning in open-domain
question answering. arXiv:1904.06652 [cs].
ArXiv: 1904.06652.

Qinyuan Ye, Belinda Z. li, Sinong Wang,
Benjamin Bolte, Hao Ma, Wen-tau Yih,
Xiang Ren,
and Madian Khabsa. 2021.
Studying strategically: Learning to mask
for closed-book QA. arXiv:2012.15856 [cs].
ArXiv: 2012.15856.

A Appendices

A.1 Details on Dataset Splits

For NQ we use the standard open-domain
(2019), y
splits introduced by Lee et al.
train-development splits used by Karpukhin et al.
(2020). For TriviaQA, we use the standard
open-domain splits, which correspond to the
unfiltered-train and unfiltered-dev reading com-
prehension splits (Joshi et al., 2017; Lee et al.,
2019).

A.2 Further Details on Passage Selection

is based on
The passage selection model
RoBERTaBASE (Liu et al., 2019). We feed each
passage into the model and use an MLP on top of
el [CLS] representation to produce a score. Nosotros
use this model to obtain a score for every passage
in the corpus. The top N highest-scoring passages
are selected for QA-pair generation. This model
achieves 84.7% recall on the NQ dev set.

A.3 Further Details on Question Quality

For NQ, we find that the retrieved questions are
paraphrases of the test questions in the major-
ity of cases. We conduct human evaluation on
50 random sampled questions generated from the
wikipedia passage pool. We make the following
observaciones: i) 82% of questions accurately cap-
ture the context of the answer in the passage, y
contain sufficient details to locate the answer. ii)
16% of questions have incorrect semantics with
respect to their answers. These errors are driven
by two main factors: Mistaking extremely similar
entities and Generalization to rare phrases. Un
example of the former is ‘‘what is the eastern
end of the Kerch peninsula’’ for the passage ‘‘The
Kerch Peninsula is located at the eastern end of the
Crimean Peninsula’’ and the answer ‘‘the Crimean
Peninsula’’. An example of the latter is where the
model interprets digits separated by colons as date
ranges, such as for the passage ‘‘under a 109–124
loss to the Milwaukee Bucks’’, the question is
generated as ‘‘when did . . . play for the Toronto
Raptors’’. iii) Solo 2% of questions mismatch
question wh-words in the analysis sample.

A.4 Further Details on System Size

vs. Accuracy

The experiment in Section 5.2.2 measures the
bytes required to store the models, the text of
the documents/QA-pairs, and a dense index. Para
Cifra 2, We assume models are stored at FP16
precisión, the text has been compressed using
LZMA,13 and the indexes use 768 dimensional
vectores, and Product Quantization (J´egou et al.,
2011). These are relatively standard settings when
building efficient systems (Izacard et al., 2020;
Min et al., 2020a). The RePAQ model used
here consists of an ALBERT-base retriever and

13https://tukaani.org/xz/.

1113

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
1
5
1
9
6
6
2
0
5

/
t

a
C
_
a
_
0
0
4
1
5
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 4: System size vs. accuracy for RePAQ and FiD
as a function of the number of items in the index, con
different experimental setup than the main paper.

Cifra 5: Risk-coverage plot on TriviaQA. FiD has
higher overall accuracy, but RePAQ’s reranker still
performs best for coverages <50%. ALBERT-xxlarge reranker, and the FiD system consists of DPR (Karpukhin et al., 2020) (consist- ing of two BERT-base retrievers) and a T5-large reader (Raffel et al., 2020). Using a different setup (e.g., storing models at full precision, no text com- pression, and FP16 index quantization, shown in Figure 4) shifts the relative position of the curves in Figure 2, but the qualitative relationship is unchanged. A.5 Further Details on Inference Speed The machine used for speed benchmarking is a machine learning workstation with 80 CPU cores, 512GB of CPU RAM and access to one 32GB NVIDIA V100 GPU. Inference is carried out at mixed precision for all systems, and questions are allowed to be answered in parallel. Models are implemented in PyTorch (Paszke et al., 2019) using Transformers (Wolf et al., 2020). Measure- ments are repeated 3 times and the mean time is reported, rounded to an appropriate significant figure. The HNSW index used in this experiment indexes all 65M PAQ QA-pairs with 768 dimen- sional vectors, uses an ef construction of 80, ef search of 32, and store n of 256, and performs up to 2048 searches in parallel. This index occupies 220GB, but can be considerably compressed with scalar or product quantization, or training retrievers with smaller dimensions – see Section A.8 for details of such an index. A.6 Further Details on Selective QA Figure 5 shows the risk-coverage plot for Trivia- QA. The results are qualitatively similar to NQ (Figure 3), although FiD’s stronger overall per- Figure 6: Risk-coverage plot for different calibration methods for FiD (RePAQ included for comparison). Using RePAQ’s confidence scores to calibrate FiD leads to FiD’s strongest results. formance shifts its risk-coverage curve up the accuracy axis relative to RePAQ. FiD also appears better calibrated on TriviaQA than it is for NQ, indicated by higher gradient. However, RePAQ remains better calibrated, outperforming it for answer coverages below 50%. We also investigate improving FiD’s calibra- tion on NQ, using a post-hoc calibration technique similar to Jiang et al. (2020a). We train a Gradi- ent Boosting Machine (GBM; Friedman, 2001) on development data to predict whether FiD has answered correctly or not. The GBM is featurized with FiD’s answer loss, answer log probability and the retrieval score of the top 100 retrieved documents from DPR. Figure 6 shows these re- sults. We first note that FiD-Large’s answer loss and answer log probabilities perform similarly, and both struggle to calibrate FiD, as mentioned in the main paper. The GBM improves calibra- tion, especially at lower coverages, but still lags behind RePAQ by 7% EM at 50% coverage. We 1114 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 5 1 9 6 6 2 0 5 / / t l a c _ a _ 0 0 4 1 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 also note that we can actually use RePAQ’s confi- dence scores to calibrate FiD. Here, we use FiD’s predicted answer, but RePAQ’s confidence score to decide whether to answer or not. This result is also plotted in Figure 6, and results in FiD’s best risk-coverage curve. Despite these improvements, FiD is still not as well-calibrated as RePAQ. A.7 Additional Model Training Details RePAQ models were trained for up to 3 days on a machine with 8 NVIDIA 32GB V100 GPUs. Val- idation Exact Match score was used to determine when to stop training in all cases. RePAQ retriev- ers were trained using Fairseq (Ott et al., 2019), and rerankers were trained in Transformers (Wolf et al., 2020) in PyTorch (Paszke et al., 2019). The PAQ CBQA models were trained in Fairseq for up to 6 days on 8 NVIDIA 32GB V100 GPUs, after which validation accuracy had plateaued. Hyperparameters were tuned to try to promote faster learning, but learning became unstable with learning rates greater than 0.0001. A.8 Memory-Efficient REPAQ Retriever Code, models, and data are available at https:// github.com/facebookresearch/PAQ. As part of this release, we have trained a memory-efficient RePAQ retriever designed for use with more modest hardware than the main RePAQ mod- els. This consists of an ALBERT-base retriever, with 256-dimensional embedding, rather than the 768-dimensional models in the main paper. We provide 2 FAISS indices (Johnson et al., 2019) for use with this model, both built with 8-bit scalar quantization. The first index is a flat index, which is very memory-friendly, requiring only 16GB of CPU RAM, but relatively slow (1-10 questions per second). The other is an HNSW approximate in- dex (Malkov and Yashunin, 2020), requiring ∼32 GB of CPU RAM, but can process 100-1000 ques- tions per second. This memory-efficient system is highly competitive with the models in the main paper, actually outperforming the AlBERT-base model (+0.6%, NQ, +0.5%, TQA), and only trail- ing the AlBERT-xlarge model by 0.6% on average (−0.3% NQ, −0.9% TQA). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 1 5 1 9 6 6 2 0 5 / / t l a c _ a _ 0 0 4 1 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1115
Descargar PDF