Exploring Contrast Consistency of Open-Domain Question Answering
Systems on Minimally Edited Questions
Zhihan Zhang, Wenhao Yu, Zheng Ning, Mingxuan Ju, Meng Jiang
University of Notre Dame, Notre Dame, IN, Etats-Unis
{zzhang23, wyu1, zning, mju2, mjiang2}@nd.edu
Abstrait
Contrast consistency, the ability of a model
to make consistently correct predictions in the
presence of perturbations, is an essential aspect
in NLP. While studied in tasks such as sen-
timent analysis and reading comprehension,
it remains unexplored in open-domain ques-
tion answering (OpenQA) due to the difficulty
of collecting perturbed questions that satisfy
factuality requirements. In this work, we col-
lect minimally edited questions as challeng-
ing contrast sets to evaluate OpenQA models.
Our collection approach combines both human
annotation and large language model genera-
tion. We find that the widely used dense pas-
sage retriever (DPR) performs poorly on our
contrast sets, despite fitting the training set
well and performing competitively on standard
test sets. To address this issue, we introduce
a simple and effective query-side contrastive
loss with the aid of data augmentation to im-
prove DPR training. Our experiments on the
contrast sets demonstrate that DPR’s contrast
consistency is improved without sacrificing its
accuracy on the standard test sets.1
1
Introduction
Contrast consistency (Gardner et al., 2020) is a
crucial aspect for neural models in NLP. Models
are expected to identify perturbations in the text
input and decide whether such a semantic shift
leads to a different
label. To evaluate this
consistency, contrast sets have been introduced
in various tasks such as sentiment analysis
(Wu et al., 2021), naturel
language inference
(Ross et al., 2022), and reading comprehension
(Longpre et al., 2021) by minimally modifying
the original input to reverse the original label.
Cependant, to our best knowledge, there is no study
on the contrast consistency in open-domain ques-
tion answering (OpenQA). In OpenQA, even a
1Data and code are available at https://github.com
/ytyz1307zzh/Minimally Edited Questions.
slight modification of a word or two can alter the
meaning of the question, which leads to a com-
pletely different answer. To maintain contrast
consistency, models are expected to predict the
corresponding answer when such semantic shift
occurs.
Studying contrast consistency in OpenQA poses
unique challenges. Firstly, collecting appropri-
ate contrast sets is difficult. While contrast sets
have been developed for reading comprehension
(Longpre et al., 2021; Li et al., 2022), they typ-
ically replaced an entity (par exemple., Barack Obama
was born in Hawaii) in given context with an-
other entity (par exemple., Barack Obama was born in New
York), leading to a different answer to the given
question (par exemple., Where was Barack Obama born?).
Constructing such contrast sets does not neces-
sitate the factuality of the perturbed context, comme
the answer depends solely on the context rather
than world knowledge. Cependant, in the absence
of evidence context, the perturbed questions in
OpenQA must be factually answerable in ac-
cordance with world knowledge, which is be-
yond what rule-based methods can do. Secondly,
achieving contrast consistency is challenging
for OpenQA models, which usually follow the
‘‘retrieve-then-read’’ pipeline (Lewis et al., 2020).
In addition to the challenge of predicting answers
from a contrast context as in reading comprehen-
sion, models also face the challenge of mapping
the perturbed question with its corresponding ev-
idence passage in a large corpus. The latter re-
quires the retriever to distinguish the minimal
semantic difference between embeddings of the
perturbed question and the original question,
which is ignored in typical retriever training.
To fill this gap in OpenQA, we propose to
create contrast sets using Minimally Edited Ques-
tion (MEQs). Given a question q and its answer
un, an MEQ q(cid:2) is defined as a question that pos-
sesses high lexical and semantic similarity with
q, while having a distinct answer a(cid:2) (un(cid:2) (cid:3)= a). Pour
1082
Transactions of the Association for Computational Linguistics, vol. 11, pp. 1082–1096, 2023. https://doi.org/10.1162/tacl a 00591
Action Editor: Lidong Bing. Submission batch: 2/2023; Revision batch: 5/2023; Published 8/2023.
c(cid:4) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
je
un
c
_
un
_
0
0
5
9
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
problem lies in the contrastive training process of
DPR. The model is trained to optimize question
embeddings to be closer to their positive passage
embeddings2 than negative passage embeddings.
This paradigm does not provide explicit sig-
nals for understanding the relationships between
questions, which causes the generated question
embeddings to be insensitive to minimal discrep-
ancies. Par conséquent, the model generates overly
similar embeddings for the MEQ and the origi-
nal question, leading to incorrect passage retrieval
for the MEQ. En fait, the overlap between the
retrieved passages of the original question and
those of its MEQ is as high as ∼70%, which re-
flects DPR’s limited ability in distinguishing the
questions. To overcome such limitations, it is nec-
essary to complement DPR training with signals
on inter-question relationships. Besides building
the mapping between questions and passages,
DPR needs to know which questions are the
same and which are different.
In this pioneering study, we propose a sim-
ple and effective method based on a query-side
contrastive loss to improve the performance of
DPR on MEQs. Spécifiquement, in order to learn
inter-question relationships, DPR is trained to
distinguish between paraphrase questions and se-
mantically different questions. Pour y parvenir,
we obtain synthetic MEQs for training ques-
tions from the machine-created QA corpus, PAQ
(Lewis et al., 2021), as augmented data. Exper-
iments demonstrate that learning the query-side
contrastive loss on the augmented MEQs im-
proves the performance of DPR on contrast sets,
without sacrificing its performance on standard
open-domain questions in the NQ test set.
2 Related Work
2.1 Open-Domain Question Answering
OpenQA is a task that aims to answer user ques-
tions without any specified context, thereby testing
the ability of QA systems to retrieve, compre-
hend, and utilize world knowledge (Zhu et al.,
2021). The state-of-the-art approach in OpenQA
is a two-stage pipeline, consisting of evidence re-
trieval and answer prediction (Chen et al., 2017).
In the evidence retrieval stage, a retriever
model finds evidence passages from a large cor-
pus (par exemple., Wikipedia) based on their relevance
2A passage that provides evidence for answering the
question is its positive passage, otherwise a negative passage.
Chiffre 1: Above: Trained on question q1 but not a
contrast one q2, DPR generated an overly similar em-
bedding of q2 with q1’s and thus falsely retrieved p1.
We aim to identify q2 as a distinct question and re-
trieve p2 instead. Below: The performance of DPR-
based OpenQA models on the standard NQ question
set and our contrast set of minimally edited ques-
tion (MEQs).
example, in Figure 1, changing ‘‘Pet Sematary 2’’
to ‘‘Pet Sematary’’ generates an MEQ that re-
sembles the original question but has a dis-
(‘‘Coweta County, Georgia’’→
tinct answer
‘‘Maine’’). We use the training set of an exist-
ing benchmark as the original questions because
neural OpenQA models exhibit high performance
on them. Ainsi, we are able to evaluate the
models’ ability of distinguishing MEQs by mea-
suring their performance on the MEQ contrast set.
Spécifiquement, we collect MEQs for training ques-
tions in the Natural Questions (NQ) benchmark
(Kwiatkowski et al., 2019) from two sources,
namely, (1) InstructGPT-based question gener-
ation (Ouyang et al., 2022) then crowdsource
annotation and (2) the AmbigQA dataset (Min
et coll., 2020).
We find that the state-of-the-art OpenQA mod-
els which employ the dense passage retriever
(DPR)
(Karpukhin et al., 2020) struggle on
our MEQ contrast sets. As shown in Figure 1,
DPR-retrieved passages lead to 63% downstream
QA accuracy on the training set and 43% sur
standard test set. Cependant, the accuracy drops
to 20%∼25% on our MEQ contrast sets. Le
1083
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
je
un
c
_
un
_
0
0
5
9
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
to the question. Traditional retrievers like BM25
(Robertson and Zaragoza, 2009) perform lexi-
cal matching to measure such relevance scores.
Recently, DPR (Karpukhin et al., 2020) revo-
lutionized the field by employing dual BERT
(Devlin et al., 2019) encoders to compute em-
beddings for the question and the passage, concernant-
spectively. It searches evidence passages based
on the inner product of question and passage em-
beddings. Despite subsequent approaches having
sought to improve the architecture of the retriever
by using fine-grained question-passage interac-
tion (Khattab and Zaharia, 2020) or enhancing
global embedding training (Gao and Callan, 2021),
DPR remains the most widely-used model due
to its simplicity and efficiency. Cependant, le
capability of DPR in distinguishing contrastive
information has not been thoroughly studied. Dans
this work, we use MEQs as contrast sets and
show that DPR has limited contrast consistency
when solving MEQs.
In the answer prediction stage, a reader model
encodes and fuses the representations of all pas-
sages, then predicts an answer by extracting a
span (Kedia et al., 2022), generating a free-form
séquence (Izacard and Grave, 2021), or using a hy-
brid approach (Fajcik et al., 2021). While answer
prediction is also challenging on MEQs, our ap-
proach mainly focuses on the retrieval part which
is the bottleneck of solving the MEQs in OpenQA.
2.2 Contrast Sets
NLP Benchmark datasets are typically composed
of i.i.d. examples that are randomly divided into
training and test sets. Inversement, contrast sets re-
fer to data created from small yet label-changing
modifications to the existing examples (Gardner
et coll., 2020). Such characteristics make contrast
sets an ideal testbed for evaluating contrast con-
sistency. Par exemple, Gardner et al. (2020) et
Kaushik et al. (2020) employed humans to modify
linguistic patterns on tasks like syntactic pars-
ing, relation extraction, and claim verification.
On sentiment analysis and language inference
tasks, controlled text modification models could
automatically generate contrast sets (Wu et al.,
2021; Ross et al., 2022). In reading compre-
hension, rule-based algorithms created contrast
sets by replacing the answer with another entity
(Longpre et al., 2021; Ye et al., 2021; Li et al.,
2022). In video-to-text matching, a pre-trained
T5 model was used to find replacements for
verbs and entities in the original caption (Parc
et coll., 2022).
Nevertheless, building contrast sets to evalu-
ate contrast consistency in OpenQA has not been
explored yet, where data collection must guar-
antee the factuality of MEQs. The most relevant
work is Paranjape et al. (2022) which automat-
ically generated perturbed questions for data aug-
mentation on QA datasets. Cependant, we focus on
collecting challenging MEQs to evaluate model
consistency instead of data augmentation. More-
over, their generated questions did not meet the
requirements of MEQs. The limited accuracy of
the question generation model would lead to lots
of noise instead of perfect factuality. Aussi, their
method did not ensure the minimality of edits.
Donc, their generated data cannot be used
as challenging contrast sets to evaluate contrast
consistency in OpenQA.
3 Task: Contrast Consistency on MEQs
3.1 Problem Formulation
In this work, we study minimally edited questions
(MEQ) as challenging contrast sets in OpenQA.
Suppose we have two questions q and q(cid:2) avec
answers a and a(cid:2), respectivement, where q is the
original question in the training set and q(cid:2)
est
an MEQ of q. Dans cette étude, the minimality of
edits is measured in two aspects: lexical distance
d(cid:2)(q, q(cid:2)) and semantic distance ds(q, q(cid:2)). That is to
say, q(cid:2) needs to satisfy d(cid:2)(q, q(cid:2)) ≤ (cid:2)(cid:2), ds(q, q(cid:2)) ≤
(cid:2)s and a(cid:2)
(cid:3)= a, où (cid:2)(cid:2) et (cid:2)s are distance
thresholds.
3.2 Evaluation Metrics
To evaluate DPR on MEQ contrast sets, nous
consider metrics on both ranking and retrieval
evaluation. En plus, we run end-to-end QA
experiments using the passages retrieved by DPR.
Ranking evaluation measures
the model’s
ability to differentiate a positive passage from
negative passages, by ranking a set of candidate
passages based on the relevance score to the ques-
tion. We collect 50 candidates for each question,
including a positive passage, 30 hard negative
passages, et 19 random negative passages. Hard
negatives are the top-ranked passages in BM25
1084
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
je
un
c
_
un
_
0
0
5
9
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
retrieval that do not contain the answer. We re-
port Mean Rank (MR) and Mean Reciprocal Rank
(MRR) of the positive passage.
Retrieval evaluation tests the model’s ability
to retrieve passages relevant to answering the
question from a large corpus. Our retrieval cor-
pus contains ∼21M passages from Wikipedia.
We calculate Recall@k, the number of passages
containing the answer in top-k retrieved passages.
End-to-end QA evaluation checks whether the
retrieved passages contain useful information for
predicting the correct answer. The retrieved pas-
sages are fed into a Fusion-in-Decoder (FiD)
lecteur (Izacard and Grave, 2021) trained on NQ.
We calculate Exact Match between model pre-
dictions and answers.
4 Données: MEQ Contrast Sets
4.1 Dataset Construction
Based on the above evaluation metrics, we col-
lect two MEQ contrast sets to evaluate models’
contrast consistency. The first set, referred to
as MEQ-GPT, is generated using InstructGPT
(Ouyang et al., 2022) then manually filtered and
annotated with answers by crowdsource work-
ers. The second set, named MEQ-AmbigQA, est
sourced from the AmbigQA dataset (Min et al.,
2020). The construction of our contrast sets con-
sists of four phases: question collection, MEQ
filtering, answer annotation, and evidence pas-
sage annotation.
4.1.1 Collection of Candidate MEQs
MEQ-InstructGPT Generating
answerable
MEQs is very difficult for crowdsource workers
who are not domain experts. It is hard for them
to determine which modifications to the origi-
nal question result in an answerable MEQ with-
out extensive Internet searches. Cependant, récent
GPT-3 models have demonstrated their ability
to possess vast amount of knowledge through
massive pre-training (Brown et al., 2020). Là-
fore, we first utilize the InstructGPT model
(text-davinci-002) to generate a set of MEQ candi-
dates, and leave the answer annotation task to
crowdsource workers. The input to InstructGPT
is of the form [je, x1, · · · , xt, q, un], where I is
the instruction ‘‘Generate a similar ques-
tion that has a different answer’’.
je, un(cid:2)
je] (q(cid:2)
{xi}t
i=1 are in-context demonstrations that are
manually created, where each xi
is a tuple
[qi, ai, q(cid:2)
i is the MEQ of qi). The original
question q and answer a are appended to the
input, prompting InstructGPT to generate a
new question q(cid:2) and its answer a(cid:2) to complete
the sequence. For each input q, we sample 10
completions from InstructGPT to generate a set
of candidate MEQs.
MEQ-AmbigQA The AmbigQA dataset
ini-
tially targeted a subset of NQ consisting of am-
biguous questions. The dataset was introduced to
decompose each ambiguous question into mul-
tiple disambiguated questions, each of which is
a slight modification of the original question.
For each NQ question covered in AmbigQA, its
corresponding disambiguated questions are con-
sidered as its candidate MEQs and are delivered
to the subsequent filtering phase (§4.1.2). Comment-
jamais, such questions are limited as we set strict
criteria for MEQs, so we need more data gener-
ated by InstructGPT for solid evaluation.
4.1.2 MEQ Filtering
To build challenging contrast sets, a series of
criteria are applied to eliminate unqualified can-
didates and select MEQs based on the definition
in §3.1.
1. Quality control: We do not allow q and q(cid:2) à
differ in question words (par exemple., comment, what),
or if the only word that q(cid:2) adds to q falls
into {d'abord,last,new,next,origi-
nal,pas}. We have found that Instruct-
GPT frequently adds these words to create
MEQs, but they usually lead to unanswer-
able questions.
2. Lexical distance: Word-level edit distance
is used as d(cid:2)(q, q(cid:2)), and we remove q(cid:2) si
d(cid:2)(q, q(cid:2)) = 0 or d(cid:2)(q, q(cid:2)) > 3.
3. Semantic distance: The cosine similarity
of semantic embeddings is used to measure
ds(q, q(cid:2)). We remove q(cid:2) if cos(hq, hq(cid:2)) <
0.95 which indicates non-negligible seman-
tic discrepancy. The semantic embedding h
should be generated by a sentence embed-
ding model. Here we use the question en-
coder of the unsupervised dense retrieval
model Contriever (Izacard et al., 2021).
1085
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
4. Paraphrase filtering: q(cid:2) is discarded if it
is determined to be a paraphrase to q by a
paraphrase detection model. Here we use a
RoBERTa-large (Liu et al., 2019) fine-tuned
on the Quora Question Pairs dataset (Wang
et al., 2019) for paraphrase classification.
5. Answer difference: q(cid:2) is discarded if a(cid:2) =
a. For AmbigQA questions, since they are
originally human-annotated, we ask human
volunteers to check whether a(cid:2) and a are
aliases to the same entity. For GPT-generated
questions, the inspection of answer differ-
ence is included in the answer annotation
process, which we will elaborate in §4.1.3.
Among GPT-generated questions, for a certain
original question q, there may be multiple MEQ
candidates that pass the above filtering. In such
cases, the question that is generated most fre-
quently across 10 samples is selected as the most
confident MEQ by InstructGPT. This is similar to
the self-consistency idea in Wang et al. (2022).
4.1.3 Answer Annotation
Due to the limited accuracy of InstructGPT in
directly answering open-domain questions (Yu
et al., 2023), we recruit crowdsource workers
to annotate the answer of each candidate MEQ
generated by InstructGPT. Before human annota-
tion, we first check the answer generated by In-
structGPT via Google Search. If Google Search
returns a highlighted answer box which matches
the InstructGPT-generated answer, we skip the
subsequent human labeling step. For the remain-
ing questions, we recruit human annotators from
Surge AI3 for data labeling. We ask them the
following questions:
Q1. Is q(cid:2) a good variation to q? Bad variations
include being unanswerable or having the
same answer with q, and are discarded from
our dataset.
Q2. If q(cid:2) is deemed a good variation, find the
answer a(cid:2) using search engines. If necessary,
the question may have multiple answers.
Quality Control To ensure answer correctness,
each question is answered by two different anno-
tators. If the annotators disagree on the answer
or if either annotator determines the question is
an bad variation, the question is discarded. Since
the answers are free-form responses, we manually
check whether the answers given by two annota-
tors are aliases to the same entity. If the response
of the first annotator matches exactly with the an-
swer provided by InstructGPT, we do not recruit
a second annotator to reduce costs.
4.1.4 Gold Evidence Passages
As mentioned in §3.2, ranking evaluation on
MEQs needs gold evidence passages as positive
examples, so we collect them from Wikipedia for
our contrast sets. For MEQ-AmbigQA, we utilize
the semi-oracle evidence documents4 provided
by the original authors, dividing them into 100-
word passages. Then, we identify the first passage
that contains the gold answer. For MEQ-GPT, our
initial step involves finding candidate evidence
passages that include the gold answer. This is
achieved by retrieving Wiki passages with BM25
and selecting the top 3 passages that contain the
answer. Next, we recruit human annotators from
Surge AI to assess whether any of these pas-
sages provide sufficient evidence for answering
the question. The highest-ranked passage that
passed human annotation is chosen as the gold
evidence passage. Finally, both contrast sets have
a subset of questions paired with a correspond-
ing gold evidence passage.
4.2 Dataset Analysis
The full dataset is composed of 3,343 MEQs
(2,293 from InstructGPT and 1,050 from Am-
bigQA). Each of these MEQs has its original
question in the NQ training set. Among them,
1,229 (53.6%) InstructGPT questions and 625
(59.5%) AmbigQA questions are paired with a
gold evidence passage from Wikipedia. We use
this subset in ranking evaluation and the full set in
retrieval and end-to-end QA evaluation.
Data Statistics We summarize basic statis-
tics of the MEQ contrast sets compared to the
original NQ questions. As shown in Table 1,
MEQ-GPT is similar to NQ regarding the aver-
age length of questions and answers. Questions in
MEQ-AmbigQA are longer because the original
AmbigQA annotators usually added conditions to
4https://github.com/shmsw25/AmbigQA/blob
3https://www.surgehq.ai.
/main/evidence.md.
1086
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Statistics
Size
With Gold Passage
Question Length
Answer Length
#Answers
Edit Distance
NQ
MEQ
Train
79,168
58,880
9.17
2.16
1.22
9.10
Test
3,610
1,766
9.22
2.22
1.79
9.16
AmbigQA
1,050
625
10.73
2.62
1.47
2.39
GPT
2,293
1,229
9.69
1.96
1.18
1.18
Semantic Similarity
30.12
29.87
96.47
97.96
Table 1: Dataset statistics. Question lengths, an-
swer lengths, and edit distances are all measured
in words. Semantic similarity is computed by
Contriever (Izacard et al., 2021). For NQ-train
and NQ-test, edit distance and semantic similarity
are computed between random question pairs. For
MEQ contrast sets, they are computed between
the original question and its MEQ.
disambiguate the original NQ questions. Besides,
AmbigQA does not impose a limit on the answer
length, while we limit each answer in MEQ-GPT
to at most 5 words, consistent with NQ. The num-
ber of answers per question is lower in MEQ-
GPT than in MEQ-AmbigQA, because most
answers are obtained through strict text matching
on candidate answers from two sources. In ad-
dition, we observe that MEQ-GPT has a smaller
edit distance and higher semantic similarity be-
tween q and q(cid:2), making it hard for models to dis-
tinguish them.
Types of Edits We review and categorize dif-
ferent types of minimal edits that are used to
create MEQs. Since MEQ-AmbigQA primarily
consists of edits that add specifications to the
original NQ question, we consider MEQ-GPT as
a more natural representation of minimal edits.
As shown in Table 2, the edits in MEQ-GPT
involve nouns (28.0%), verbs (18.5%), adjec-
tives (18.2%), numbers (14.2%), ordinals (9.2%),
dates (6.6%), prepositions/conjunctions (2.9%),
and others (2.4%). A word cloud of the edited
words is given in Figure 2. We also observe
that 22.5% of the total edits are antonym edits
where a word in the original question is replaced
by its antonym. Our dataset of diverse MEQs
provides a comprehensive evaluation of contrast
consistency.
4.3 Challenges of MEQ Contrast Sets
The collected MEQ contrast sets are challeng-
ing for the widely-used DPR-based OpenQA sys-
tem, although these perturbed questions are only
minimal edits to the well-learned training ques-
tions. As shown in Figure 1, the model signifi-
cantly underperforms on the contrast sets, where
the passage ranking score of DPR decreases by
39% and 45% compared to NQ-train, and by 29%
and 18% compared to NQ-test. This makes a sub-
stantial impact on the QA performance, with the
accuracy being 69% and 60% lower on the two
contrast sets compared to NQ-train, and 54% and
40% lower than NQ-test. The results show that
the collected MEQs are much harder to solve than
random test questions, which indicates our con-
trast sets can serve as testbeds for evaluating the
contrast consistency of OpenQA.
5 Method: Training DPR with Query-
Side Contrastive Loss
5.1 Preliminary: DPR
As a dense retriever, DPR includes a question
encoder EQ(·) and a passage encoder EP (·). Both
encoders map the input sequence to a dense vector
as its semantic representation. The relevance score
s(q, p) between a question q and a passage p is
defined as the dot product of their representations:
s(q, p) = EQ(q)
(cid:2)
EP (p)
DPR is trained via a contrastive loss. Given a
positive passage p+ and a set of negative passages
{p−
}n
i=1 to a certain question, the model is trained
i
to maximize the relevance score between q and
p+, while minimizing the relevance score between
q and each p−
i . The loss function is:
LQP = −log
exp(s(q, p+))
(cid:2)
exp(s(q, p+)) +
n
i=1 exp(s(q, p−
i ))
The above training paradigm works well on
retrieving passages for random test questions, but
does not perform as effectively on MEQ contrast
sets, as discussed in §1 and §4.3. The training
loss LQP does not provide explicit signals for
DPR to learn the relationships between questions.
As a result, the question embeddings are insen-
sitive to minimal discrepancies, which prevents
the model from identifying the MEQ as a dis-
tinct question after seeing the original question in
training. This causes DPR to generate an overly
similar embedding for the MEQ, leading to a high
1087
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Edit Type
Proportion
Antonym Edits
Example
Nouns
641(28.0%)
151 (24%)
Q: Who wrote the music for the national anthem? A: John Stafford Smith
Q: Who wrote the lyrics for the national anthem? A: Francis Scott Key
Verbs
425 (18.5%)
176 (41%)
Q: When did Australia stop using one cent coins? A: 1992
Q: When did Australia start using one cent coins? A: 1966
Adjectives
418 (18.2%)
146 (35%)
Q: How many islands are in Andaman and Nicobar? A: 572
Q: How many inhabited islands are in Andaman and Nicobar? A: 37
Numbers
326 (14.2%)
–
Where did season 2 of Jersey Shore take place? A: Miami Beach, Florida
Q: Where did season 3 of Jersey Shore take place? A: Seaside Heights, New Jersey
Ordinals
211 (9.2%)
30 (14%)
Q: Highest scoring NBA players of all time in one game? A: Wilt Chamberlain
Q: Second highest scoring NBA players of all time in one game? A: Kobe Bryant
Dates
152 (6.6%)
–
Q: Who ruled the Holy Roman Empire in 1509? A: Maximilian I
Q: Who ruled the Holy Roman Empire in 1519? A: Charles V
Prepositions
Conjunctions
66 (2.9%)
14 (21%)
Q: Where did the Titanic make its maiden voyage from? A: Southampton
Q: Where did the Titanic make its maiden voyage to? A: New York
Table 2: Different MEQ edit types in MEQ-GPT with their proportions of antonym edits and examples.
The remaining 2.4% of the instances are of miscellaneous types. The first line in each example is the
original question and the second line is the MEQ. Words in green and red are the deleted and added
words, respectively.
To obtain q+, we leverage back translation
provided by the nlpaug5 package. The original
question q is translated to another language and
then translated back to produce a new phrasing
of q. We used translation models of 6 languages
provided by Ng et al. (2019) and Tiedemann and
Thottingal (2020). Questions that are identical to
q (i.e., edit distance = 0) or classified as ‘‘not
paraphrase’’ by the paraphrase detection model
used in §4.1.2 are eliminated. The remaining
questions constitute a candidate set of positive
questions from which a random q+ is sampled in
each epoch.
To obtain q−, synthetic MEQs are retrieved
from the machine-built QA corpus PAQ (Lewis
et al., 2021). All questions in PAQ that are sim-
ilar to q are retrieved by the question retriever
in the work of PAQ. Then, the MEQ require-
ments specified in §4.1.2 are applied to filter
the retrieved synthetic questions. The remaining
questions constitute a candidate set of negative
questions from which a random q− is sampled
in each epoch.
Apart from learning the relationships among q,
q+, and q−, the loss LQP can be augmented to
learn the relevance between synthetic questions
and their corresponding passages. Because q+ is
a paraphrase question mapping the passages of
q, it does not have to be involved in LQP . To
5https://github.com/makcedward/nlpaug.
Figure 2: Word cloud of the edited words. Words in
green and red are the deleted and added words, respec-
tively. Larger font sizes indicate higher frequencies.
overlap in the retrieved passages and low con-
trast consistency.
5.2 Proposed Method
We propose to improve the contrast consistency
of DPR by introducing a query-side contrastive
loss to distinguish between paraphrase ques-
tions and MEQs which are positive and negative
question examples for an original question, re-
spectively. We devise a data augmentation ap-
proach to collect synthetic question examples to
train this loss.
5.2.1 Data Augmentation
For a training question q, its positive example
q+ is a synthetic paraphrase question which is
slightly different from q and has the same an-
its negative question q− is a synthetic
swer;
MEQ with a different answer.
1088
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 3: Above: the original contrastive training of DPR. Below: our improved DPR with the query-side
contrastive loss, where q+ and q− are obtained through data augmentation.
train on q−, its positive passage is the Wikipedia
passage that was used to generate the question
during the construction of PAQ; its negative pas-
sages are collected from the top-ranked passages
retrieved by BM25 which do not contain the
answer.
5.2.2 Model Training
To provide more supervision signals and pre-
vent overfitting, we randomly sample q+, q−,
and p− for each training question q in each
epoch. This means that while the original train-
ing questions remain fixed, a different set of aug-
mented questions is used. For explicit supervision
on inter-question relationships, given q, DPR is
trained to assign higher relevance scores to its
paraphrase question (q+) and lower relevance
scores to its MEQ (q−). The relevance score of
any pair of questions (q1, q2) is calculated as the
inner product of their embeddings: s(q1, q2) =
EQ(q1)
EQ(q2). Specifically, we consider three
forms of query-side constrastive loss functions
in experiments:
(cid:2)
(1) InfoNCE Loss (van den Oord et al., 2018),
which differentiates the positive question from
a set of m negative questions. Besides the syn-
thetic MEQ which is considered as a hard neg-
ative, the other questions in the same batch are
included as random negatives. The loss func-
tion is:
LQQ = −log
exp(s(q, q+)) +
exp (s (q, q+))
(cid:2)
m
j=1 exp(s(q, q−
j ))
.
(2) Dot Product Loss, which directly penal-
izes the relevance score between a sample ques-
tion q and its augmented MEQ:
LQQ = s(q, q−).
(3) Triplet Loss (Schroff et al., 2015), which
trains the model to assign a higher relevance
score to q+ compared to q−, enfored by a mar-
gin α:
(cid:3)
LQQ = max
0, α − s(q, q+) + s(q, q−)
(cid:4)
.
The final training loss of our improved DPR is
L = LQP + λLQQ, where the hyperparameter λ
weights the trade-off between the loss terms.
6 Experiments
In experiments, we compare our proposed training
method against the original training setting of
DPR. After training the models on the NQ train-
ing set, we test them on the standard NQ test set
1089
Model
Augmentation
DPRBASE
DPRBASE
DPRBASE
DPRBASE
DPRLARGE
DPRLARGE
DPRLARGE
DPRLARGE
None
Random
MEQs
MEQs + LQQ
None
Random
MEQs
MEQs + LQQ
NQ
MR↓
2.36
MRR↑
0.784
MEQ-AmbigQA
MRR↑
MR↓
0.563
5.09
MEQ-GPT
MR↓
5.44
MRR↑
0.507
2.36
2.34
2.25
2.31
2.20
2.17
2.14
0.781
0.783
0.791
0.780
0.797
0.797
0.804
5.09
5.09
4.85
4.84
4.98
4.79
4.59
0.557
0.543
0.569
0.569
0.554
0.561
0.592
5.25
5.10
4.88
5.46
5.18
5.00
4.61
0.524
0.529
0.547
0.515
0.533
0.544
0.565
Table 3: Ranking evaluation results. MR and MRR stand for mean rank and mean reciprocal rank,
respectively. A lower MR or higher MRR indicates better performance. BM25 is not listed because
sampling hard negatives from top-ranked passages in BM25 retrieval lowers the ranking performance
of BM25 in return.
as well as two MEQ contrast sets that we col-
lected in this work.
6.1 Models
We augment the training set with M = 33k syn-
thetic MEQs and train DPR with both LQP and
LQQ. We consider the following baselines:
• Vanilla DPR. This is the original training
setting of DPR, proposed by Karpukhin et al.
(2020). The model is trained only with LQP
on the standard NQ training set.
• DPR with random augmented questions.
This model is trained only with LQP , but
we add M random synthetic questions from
PAQ to the training set. This is to rule out
the effect of simply adding more synthetic
data.
• DPR with augmented MEQs. This model
uses the same set of M synthetic MEQs
retrieved from PAQ as data augmentation,
but is trained only with LQP . We use this
variant to test if LQQ is necessary in model
training.
Additionally, we test the performance of BM25
on retrieval as a reference. Recent research has
shown that larger retrievers may exhibit better
generalization (Ni et al., 2022). Therefore, in ad-
dition to the standard DPR which is built on
BERT-Base, we use BERT-Large as the backbone
model to see: (1) whether MEQ contrast sets are
still challenging for larger models and (2) whether
our training method is still effective for larger
models. We name the smaller model and larger
model DPRBASE and DPRLARGE, respectively.
We use the same set of basic hyper-parameters
for each DPR model: a learning rate of 10−5, a
batch size of 64 (32 for DPRLARGE), 40 train-
ing epochs with 5% warmup steps. On ranking
evaluation, our best setting uses the InfoNCE loss
with λ = 0.5. On retrieval and QA evaluation,
our best setting uses the dot product loss with
λ = 0.03. Since we do not have a dev set for
MEQs,6 we conduct ranking evaluation on MEQ
contrast sets in a dev setting, where we select
the highest score among all checkpoints. Then we
use the checkpoint with the best ranking score to
test its retrieval and QA performance. The scores
on NQ-test is reported using the best checkpoint
on NQ-dev.
6.2 Results
Experimental results on three datasets (NQ-test,
MEQ-AmbigQA, MEQ-GPT) are presented from
Table 3 to Table 6. We have the following
findings:
(1) Our proposed method improves DPR’s abil-
ity to distinguish MEQs. As shown in Tables 3
6We empirically found that model performance on
NQ-dev is inconsistent with MEQ contrast sets.
1090
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
LQQ
InfoNCE
Dot Product
DPRBASE
DPRBASE
DPRBASE
DPRLARGE
DPRLARGE Dot Product
DPRLARGE Triplet
InfoNCE
Triplet
AmbigQA
GPT
MR↓ MRR↑ MR↓ MRR↑
0.547
4.85
0.569
4.88
4.79
4.80
4.76
4.59
4.61
0.574
0.568
0.572
0.592
0.582
4.98
4.91
4.63
4.61
4.59
0.539
0.542
0.570
0.565
0.573
Table 4: Ranking evaluation with different LQQ
functions on two MEQ contrast sets. All loss
functions outperform the baselines in Table 3.
and 5, on passage ranking and passage retrieval,
the DPR trained with query-side contrastive loss
outperforms the vanilla DPR on both contrast
sets, showing improved contrast consistency on
MEQs. This improvement is consistent across
models of different sizes. For example, on
MEQ-GPT, our model improves the vanilla DPR
by 8% and 10% on ranking MRR for base and
large versions, respectively. On the choice of LQQ,
Table 4 demonstrates that all three loss functions
improve performance over baselines, while the
optimal setting may require tuning on the specific
dataset.
(2) The query-side contrastive loss contributes
the most to the improved contrast consistency.
Although synthetic MEQs themselves bring more
training signals, the model cannot consistently
outperform the vanilla DPR without LQQ. Ac-
tually, its performance is sometimes even lower
than DPR. In contrast, after including the query-
side contrastive loss, we observe consistent im-
provements across all datasets, as shown in
Tables 3 and 5. For example, on MEQ-AmbigQA,
simply adding synthetic MEQs into the training
set gives 12% lower recall@1 than the vanilla
DPR, while training with LQQ outperforms the
naive augmentation method by 18%.
(3) The improvement does not simply come
from the increased number of training data.
There is no significant difference on the per-
formance between DPR augmented with random
synthetic questions (‘‘Random’’ in ‘‘Augmenta-
tion’’ column) and the original DPR (‘‘None’’ in
the column) in Tables 3, 5, and 6. The average
improvement of inserting random synthetic ques-
tions on all metrics is only 0.2% for DPRBASE
and 1.6% for DPRLARGE, which indicates simply
adding more synthetic data is not an effective
solution.
(4) Improved retrieval performance leads to
higher end-to-end QA accuracy. As shown
in Table 6, our improved DPR provides more
information for answer prediction on
relevant
MEQs. Even using only 1 retrieved passage,
our improved DPR-Large outperforms its vanilla
version by 12% and 11% on two contrast sets,
respectively.
(5) Our method does not sacrifice performance
on standard test questions. After joint training
with the query-side contrastive loss and aug-
mented with synthetic MEQs, our model still
maintains its competitive performance on the stan-
dard NQ test set. Specifically, It outperform all
baselines in ranking evaluation (see Table 3),
while performing on par with the best baseline in
retrieval and QA scores (see Tables 5 and 6).
Summary: The results are consistent across
ranking, retrieval, and end-to-end QA experi-
ments, which demonstrates the solidity of the
above findings. Nevertheless, the performance of
DPR still has a long way to improve, and such a
gap is observed in both base and large versions
of the model. Notably, DPR models perform sig-
nificantly worse on MEQ contrast sets than the
standard test set, even though it is trained under
a development setting. This suggests that further
research is still necessary to improve the contrast
consistency of retrieval models on MEQs.
6.3 Analysis
Passage Overlap One of the indications that
DPR lacks the ability to distinguish the original
question and its MEQ is the high overlap between
the passages retrieved for each. Figure 4 illus-
trates that both synthetic data augmentation and
the query-side contrastive loss can reduce pas-
sage overlap. The synthetic MEQ augmentation
helps to train the question embeddings of MEQs
closer to their positive passages. Moreover, the
query-side contrastive loss explicitly trains the
model to distinguish the original question and its
MEQ apart. Nevertheless, a lower passage over-
lap does not always indicate better performance.
For instance, our model with the dot product loss
1091
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Augmentation
BM25
None
None
Random
MEQs
MEQs + LQQ
DPRBASE
DPRBASE
DPRBASE
DPRBASE
DPRLARGE None
DPRLARGE Random
DPRLARGE MEQs
DPRLARGE MEQs + LQQ
NQ
MEQ-AmbigQA
MEQ-GPT
R@1 R@5 R@20 R@1 R@5 R@20 R@1 R@5 R@20
23.2
46.6
48.2
46.4
48.1
46.0
49.0
48.0
51.0
45.3
70.0
71.2
69.9
70.8
67.6
70.9
70.5
71.2
64.5
81.2
81.6
81.3
81.9
80.3
81.5
81.4
81.6
16.8
28.5
27.5
25.2
29.5
26.8
26.2
27.7
30.1
34.2
50.0
49.2
46.2
52.3
49.2
48.4
47.2
52.3
48.8
65.6
65.8
62.7
66.4
64.0
64.1
61.5
65.4
21.1
31.5
31.8
31.5
32.8
29.2
31.3
31.2
32.5
42.7
57.3
58.0
55.9
58.7
54.9
56.7
57.0
58.4
61.7
73.2
73.8
72.3
74.4
70.7
72.3
71.8
73.1
Table 5: Retrieval evaluation results. R@k stands for Recall@k.
Model
Augmentation
BM25
None
None
Random
DPRBASE
DPRBASE
DPRBASE
DPRBASE
DPRLARGE
DPRLARGE
DPRLARGE MEQs
DPRLARGE MEQs + LQQ
MEQs
MEQs + LQQ
None
Random
NQ
5P
28.4
43.2
44.8
43.4
44.7
42.2
44.6
44.7
44.6
MEQ-AmbigQA
MEQ-GPT
20P
37.3
49.1
49.4
48.7
49.2
47.9
49.3
48.7
49.3
1P
10.9
14.0
14.7
13.5
16.6
14.3
13.4
15.7
16.1
5P
15.2
19.7
19.1
19.3
21.8
19.2
20.4
19.3
22.1
20P
18.1
21.9
22.4
23.1
22.8
21.4
21.5
21.7
23.0
1P
13.3
17.6
16.8
17.1
19.5
16.1
17.3
17.4
19.4
5P
20.5
25.8
25.4
25.5
26.7
24.6
25.5
25.0
27.6
20P
25.8
29.3
29.5
29.5
31.1
29.1
29.4
29.1
31.6
1P
16.4
32.6
33.7
32.0
34.4
31.4
33.7
33.0
33.7
Table 6: End-to-end QA results (Exact Match). 1P, 5P, and 20P are the number of passages read by the
FiD reader.
Figure 4: Overlap in top-5 retrieved passages between
the original training question and its MEQ.
Figure 5: The ratio of successful MEQ identifications
of different models on contrast sets, with paraphrase
questions as distractors.
does not have the lowest passage overlap, but
performs the best in retrieval evaluation.
Identification of Inter-question relationships
To further analyze model behavior after the query-
side contrastive training, we test the models’ abil-
ity to distinguish inter-question relationships. A
model is considered successful in identifying the
MEQ if the generated embedding of the original
1092
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
question is closer to its paraphrase question rather
than its MEQ. The paraphrase questions are sepa-
rately generated using InstructGPT to avoid con-
flict with those used in data augmentation. As
shown in Figure 5, training with the query-side
contrastive loss leads to an improved ability to
distinguish between paraphrase questions and dif-
ferent questions, which indicates our models are
better at identifying inter-question relationships.
The model trained with InfoNCE loss has the
highest success rate in identifying inter-question
relationships, because it received more training sig-
nals from a positive example and a set of negative
examples than those with other types of loss.
7 Conclusion
In this study, we addressed the gap in research
on contrast consistency in OpenQA by collecting
MEQs as challenging contrast sets to the popu-
lar NQ benchmark. Our findings reveal that DPR
lacks contrast consistency on our contrast sets. To
address this limitation, we introduced a query-side
contrastive loss with the aid of data augmen-
tation, which improved its ability to recognize
inter-question relationships. Overall, our findings
and data can pave the way for further exploring
the role of contrast consistency in developing ro-
bust and effective OpenQA systems.
Acknowledgments
This work was supported in part by NSF
IIS-2119531, IIS-2137396, IIS-2142827, CCF-
1901059, and ONR N00014-22-1-2507. Wenhao
Yu is also supported in part by Bloomberg Data
Science PhD Fellowship. We would like to thank
the anonymous reviewers and the action editor
for their valuable suggestions to this paper.
References
Tom B. Brown, Benjamin Mann, Nick Ryder,
Jared Kaplan, Prafulla
Melanie Subbiah,
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler,
Jeffrey Wu, Clemens
Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. 2020. Language models
are few-shot learners. In Advances in Neural
Information Processing Systems 33: Annual
Conference on Neural Information Processing
Systems 2020, NeurIPS 2020.
Danqi Chen, Adam Fisch, Jason Weston, and
Antoine Bordes. 2017. Reading Wikipedia to
answer open-domain questions. In Proceedings
of
the Asso-
ciation for Computational Linguistics, ACL
2017. https://doi.org/10.18653/v1
/P17-1171
the 55th Annual Meeting of
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT
2019. https://doi.org/10.18653/v1
/N19-1423
Martin Fajcik, Martin Docekal, Karel Ondrej,
and Pavel Smrz. 2021. R2-D2: A modular base-
line for open-domain question answering. In
Findings of
the Association for Compu-
tational Linguistics: EMNLP 2021, pages
854–870, Punta Cana, Dominican Republic.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.findings-emnlp.73
Luyu Gao and Jamie Callan. 2021. Con-
denser: A pre-training architecture for dense
retrieval. In Proceedings of
the 2021 Con-
ference on Empirical Methods in Natural
Language Processing, EMNLP 2021, pages
981–993, Online and Punta Cana, Domini-
can Republic. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.75
Matt Gardner, Yoav Artzi, Victoria Basmova,
Jonathan Berant, Ben Bogin, Sihao Chen,
Pradeep Dasigi, Dheeru Dua, Yanai Elazar,
Ananth Gottumukkala, Nitish Gupta, Hannaneh
Hajishirzi, Gabriel Ilharco, Daniel Khashabi,
Kevin Lin, Jiangming Liu, Nelson F. Liu,
Phoebe Mulcaire, Qiang Ning, Sameer Singh,
Noah A. Smith, Sanjay Subramanian, Reut
1093
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Tsarfaty, Eric Wallace, Ally Zhang, and
lo-
Ben Zhou. 2020. Evaluating models’
cal decision boundaries via contrast sets. In
Findings of
the Association for Computa-
tional Linguistics: EMNLP, pages 1307–1323,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.findings-emnlp.117
Gautier Izacard, Mathilde Caron, Lucas Hosseini,
Sebastian Riedel, Piotr Bojanowski, Armand
Joulin, and Edouard Grave. 2021. Towards
unsupervised dense information retrieval with
contrastive learning. ArXiv preprint, 2112.09118.
https://doi.org/10.48550/arXiv
.2112.09118
Gautier
Izacard and Edouard Grave. 2021.
Leveraging passage retrieval with generative
models for open domain question answering.
In Proceedings of the 16th Conference of the
European Chapter of the Association for Com-
putational Linguistics: Main Volume, EACL
2021. https://doi.org/10.18653/v1
/2021.eacl-main.74
Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick S. H. Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2020, pages
6769–6781, Online. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-main.550
Divyansh Kaushik, Eduard H. Hovy, and Zachary
Chase Lipton. 2020. Learning the difference
that makes a difference with counterfactually-
augmented data. In 8th International Con-
ference on Learning Representations, ICLR
2020.
Akhil Kedia, Mohd Abbas Zaidi, and Haejun
Lee. 2022. Fie: Building a global probability
space by leveraging early fusion in encoder
for open-domain question answering. In Pro-
ceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing,
EMNLP 2022.
Omar Khattab
and Matei Zaharia.
2020.
Colbert: Efficient and effective passage search
via contextualized late interaction over BERT.
In Proceedings of the 43rd International ACM
SIGIR Conference on Research and Develop-
ment in Information Retrieval, SIGIR 2020,
pages 39–48. Association for Computational
Linguistics. https://doi.org/10.1145
/3397271.3401075
Tom Kwiatkowski,
Jennimaria
Palomaki,
Olivia Redfield, Michael Collins, Ankur P.
Parikh, Chris Alberti, Danielle Epstein, Illia
Polosukhin, Jacob Devlin, Kenton Lee, Kristina
Toutanova, Llion Jones, Matthew Kelcey,
Ming-Wei Chang, Andrew M. Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019.
Natural questions: A benchmark for ques-
tion answering research. Transactions of the
Association for Computational Linguistics,
7:452–466. https://doi.org/10.1162
/tacl_a_00276
Patrick S. H. Lewis, Ethan Perez, Aleksandra
Piktus, Fabio Petroni, Vladimir Karpukhin,
Naman Goyal, Heinrich K¨uttler, Mike Lewis,
Wen-tau Yih, Tim Rockt¨aschel, Sebastian
Riedel, and Douwe Kiela. 2020. Retrieval-
augmented generation for knowledge-intensive
NLP tasks. In Advances in Neural Information
Processing Systems 33: Annual Conference on
Neural Information Processing Systems 2020,
NeurIPS 2020.
Patrick S. H. Lewis, Yuxiang Wu, Linqing
Liu, Pasquale Minervini, Heinrich K¨uttler,
Aleksandra Piktus, Pontus Stenetorp, and
Sebastian Riedel. 2021. PAQ: 65 million probably-
asked questions and what you can do with them.
Transactions of the Association for Computa-
tional Linguistics, 9:1098–1115. https://
doi.org/10.1162/tacl a 00415
Daliang Li, Ankit Singh Rawat, Manzil Zaheer,
Xin Wang, Michal Lukasik, Andreas Veit,
Felix X. Yu, and Sanjiv Kumar. 2022.
Large language models with controllable
working memory. ArXiv preprint, 2211.05110.
https://doi.org/10.48550/arXiv
.2211.05110
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly opti-
mized BERT pretraining approach. ArXiv
preprint, 1907.11692. https://doi.org
/10.48550/arXiv.1907.11692
1094
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Shayne Longpre, Kartik Perisetla, Anthony Chen,
Nikhil Ramesh, Chris DuBois, and Sameer
Singh. 2021. Entity-based knowledge con-
In Proceed-
flicts in question answering.
ings of
the 2021 Conference on Empirical
Methods in Natural Language Processing,
EMNLP 2021, pages 7052–7063, Online
and
Punta Cana, Dominican Republic.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.emnlp-main.565
Sewon Min, Julian Michael, Hannaneh Hajishirzi,
and Luke Zettlemoyer. 2020. Ambigqa: An-
swering ambiguous open-domain questions.
In Proceedings of
the 2020 Conference on
Empirical Methods in Natural Language Pro-
cessing, EMNLP 2020, pages 5783–5797,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.466
Nathan Ng, Kyra Yee, Alexei Baevski, Myle
Ott, Michael Auli, and Sergey Edunov.
2019. Facebook fair’s WMT19 news transla-
tion task submission. In Proceedings of
the
Fourth Conference on Machine Translation,
WMT 2019, pages 314–319, Florence, Italy.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/W19
-5333
Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai,
Gustavo Hern´andez Abrego, Ji Ma, Vincent Y.
Zhao, Yi Luan, Keith B. Hall, Ming-Wei
Chang, and Yinfei Yang. 2022. Large dual
encoders are generalizable retrievers. In Pro-
ceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing,
EMNLP 2022, pages 9844–9855. Association
for Computational Linguistics.
A¨aron van den Oord, Yazhe Li, and Oriol
Vinyals.
learning
2018. Representation
with contrastive predictive coding. Arxiv
preprint, 1807.03748. https://doi.org
/10.48550/arXiv.1807.03748
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama,
Alex Ray, John Schulman, Jacob Hilton, Fraser
Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul F. Christiano,
Jan Leike, and Ryan Lowe. 2022. Training
language models to follow instructions with
human feedback. ArXiv preprint, 2203.02155.
https://doi.org/10.48550/arXiv
.2203.02155
Bhargavi Paranjape, Matthew Lamm, and Ian
Tenney. 2022. Retrieval-guided counterfactual
generation for QA. In Proceedings of the 60th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
ACL 2022, pages 1670–1686, Dublin, Ireland.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2022
.acl-long.117.
the 2022 Conference of
Jae Sung Park, Sheng Shen, Ali Farhadi, Trevor
Darrell, Yejin Choi, and Anna Rohrbach.
2022. Exposing the limits of video-text
In Proceed-
models through contrast sets.
ings of
the North
American Chapter of
the Association for
Computational Linguistics: Human Language
Technologies, NAACL 2022, pages 3574–3586,
Seattle, United States. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2022.naacl-main.261
Stephen E. Robertson and Hugo Zaragoza. 2009.
The probabilistic relevance framework: BM25
and beyond. Foundations and Trends R(cid:4) in In-
formation Retrieval, 3(4):333–389. https://
doi.org/10.1561/1500000019
In Proceedings of
Alexis Ross, Tongshuang Wu, Hao Peng,
Matthew E. Peters, and Matt Gardner. 2022.
Tailor: Generating and perturbing text with
the
semantic controls.
60th Annual Meeting of
the Association
for Computational Linguistics (Volume 1:
Long Papers), ACL 2022, pages 3194–3213,
Dublin, Ireland. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/2022.acl-long.228
Florian Schroff, Dmitry Kalenichenko, and James
Philbin. 2015. Facenet: A unified embedding
for face recognition and clustering. In IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2015, pages 815–823,
Boston, MA. https://doi.org/10.1109
/CVPR.2015.7298682
J¨org Tiedemann and Santhosh Thottingal. 2020.
OPUS-MT - Building open translation services
the 22nd
for the world. In Proceedings of
1095
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Annual Conference of the European Associ-
ation for Machine Translation, EAMT 2020,
pages 479–480, Lisboa, Portugal. European
Association for Machine Translation.
Language Processing, ACL/IJCNLP 2021,
pages 6707–6723, Online. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/2021.acl-long.523
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R. Bowman.
2019. GLUE: A multi-task benchmark and
analysis platform for natural
language un-
In 7th International Confer-
derstanding.
ence on Learning Representations,
ICLR
2019, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.
OpenReview.net. https://doi.org/10
.18653/v1/W18-5446
Xuezhi Wang, Jason Wei, Dale Schuurmans,
Quoc V. Le, Ed H. Chi, and Denny Zhou. 2022.
Self-consistency improves chain of thought
ArXiv
reasoning
preprint, 2203.11171. https://doi.org
/10.48550/arXiv.2203.11171
language models.
in
Xi Ye, Rohan Nair, and Greg Durrett. 2021. Con-
necting attributions and QA model behavior
on realistic counterfactuals. In Proceedings of
the 2021 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2021,
pages 5496–5512, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.447
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong
Xu, Mingxuan Ju, Soumya Sanyal, Chenguang
Zhu, Michael Zeng, and Meng Jiang. 2023.
Generate rather
than retrieve: Large lan-
guage models are strong context generators.
In 11th International Conference on Learning
Representations, ICLR 2023.
Tongshuang Wu, Marco T´ulio Ribeiro, Jeffrey
Heer, and Daniel S. Weld. 2021. Polyjuice:
Generating counterfactuals for explaining, eval-
uating, and improving models. In Proceedings
of the 59th Annual Meeting of the Associa-
tion for Computational Linguistics and the
11th International Joint Conference on Natural
Fengbin Zhu, Wenqiang Lei, Chao Wang,
Jianming Zheng, Soujanya Poria, and Tat-Seng
Chua. 2021. Retrieving and reading: A
comprehensive survey on open-domain ques-
tion answering. ArXiv preprint, 2101.00774.
https://doi.org/10.48550/arXiv.2101
.00774
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
9
1
2
1
5
4
4
8
3
/
/
t
l
a
c
_
a
_
0
0
5
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1096
Télécharger le PDF