Exploring Contrast Consistency of Open-Domain Question Answering

Exploring Contrast Consistency of Open-Domain Question Answering
Systems on Minimally Edited Questions

Zhihan Zhang, Wenhao Yu, Zheng Ning, Mingxuan Ju, Meng Jiang
University of Notre Dame, Notre Dame, IN, USA
{zzhang23, wyu1, zning, mju2, mjiang2}@nd.edu

Abstrakt

Contrast consistency, the ability of a model
to make consistently correct predictions in the
presence of perturbations, is an essential aspect
in NLP. While studied in tasks such as sen-
timent analysis and reading comprehension,
it remains unexplored in open-domain ques-
tion answering (OpenQA) due to the difficulty
of collecting perturbed questions that satisfy
factuality requirements. In this work, we col-
lect minimally edited questions as challeng-
ing contrast sets to evaluate OpenQA models.
Our collection approach combines both human
annotation and large language model genera-
tion. We find that the widely used dense pas-
sage retriever (DPR) performs poorly on our
contrast sets, despite fitting the training set
well and performing competitively on standard
test sets. To address this issue, we introduce
a simple and effective query-side contrastive
loss with the aid of data augmentation to im-
prove DPR training. Our experiments on the
contrast sets demonstrate that DPR’s contrast
consistency is improved without sacrificing its
accuracy on the standard test sets.1

1

Einführung

Contrast consistency (Gardner et al., 2020) ist ein
crucial aspect for neural models in NLP. Models
are expected to identify perturbations in the text
input and decide whether such a semantic shift
leads to a different
label. To evaluate this
consistency, contrast sets have been introduced
in various tasks such as sentiment analysis
(Wu et al., 2021), natürlich
language inference
(Ross et al., 2022), and reading comprehension
(Longpre et al., 2021) by minimally modifying
the original input to reverse the original label.
Jedoch, to our best knowledge, there is no study
on the contrast consistency in open-domain ques-
tion answering (OpenQA). In OpenQA, even a

1Data and code are available at https://github.com

/ytyz1307zzh/Minimally Edited Questions.

slight modification of a word or two can alter the
meaning of the question, which leads to a com-
pletely different answer. To maintain contrast
consistency, models are expected to predict the
corresponding answer when such semantic shift
occurs.

Studying contrast consistency in OpenQA poses
unique challenges. zuerst, collecting appropri-
ate contrast sets is difficult. While contrast sets
have been developed for reading comprehension
(Longpre et al., 2021; Li et al., 2022), they typ-
ically replaced an entity (z.B., Barack Obama
was born in Hawaii) in given context with an-
other entity (z.B., Barack Obama was born in New
York), leading to a different answer to the given
question (z.B., Where was Barack Obama born?).
Constructing such contrast sets does not neces-
sitate the factuality of the perturbed context, als
the answer depends solely on the context rather
than world knowledge. Jedoch, in the absence
of evidence context, the perturbed questions in
OpenQA must be factually answerable in ac-
cordance with world knowledge, which is be-
yond what rule-based methods can do. Zweitens,
achieving contrast consistency is challenging
for OpenQA models, which usually follow the
‘‘retrieve-then-read’’ pipeline (Lewis et al., 2020).
In addition to the challenge of predicting answers
from a contrast context as in reading comprehen-
sion, models also face the challenge of mapping
the perturbed question with its corresponding ev-
idence passage in a large corpus. The latter re-
quires the retriever to distinguish the minimal
semantic difference between embeddings of the
perturbed question and the original question,
which is ignored in typical retriever training.

To fill this gap in OpenQA, we propose to
create contrast sets using Minimally Edited Ques-
tionen (MEQs). Given a question q and its answer
A, an MEQ q(cid:2) is defined as a question that pos-
sesses high lexical and semantic similarity with
Q, while having a distinct answer a(cid:2) (A(cid:2) (cid:3)= a). Für

1082

Transactions of the Association for Computational Linguistics, Bd. 11, S. 1082–1096, 2023. https://doi.org/10.1162/tacl a 00591
Action Editor: Lidong Bing. Submission batch: 2/2023; Revision batch: 5/2023; Published 8/2023.
C(cid:4) 2023 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
9
1
2
1
5
4
4
8
3

/

/
T

l

A
C
_
A
_
0
0
5
9
1
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

problem lies in the contrastive training process of
DPR. The model is trained to optimize question
embeddings to be closer to their positive passage
embeddings2 than negative passage embeddings.
This paradigm does not provide explicit sig-
nals for understanding the relationships between
Fragen, which causes the generated question
embeddings to be insensitive to minimal discrep-
ancies. Infolge, the model generates overly
similar embeddings for the MEQ and the origi-
nal question, leading to incorrect passage retrieval
for the MEQ. Tatsächlich, the overlap between the
retrieved passages of the original question and
those of its MEQ is as high as ∼70%, welche re-
flects DPR’s limited ability in distinguishing the
Fragen. To overcome such limitations, it is nec-
essary to complement DPR training with signals
on inter-question relationships. Besides building
the mapping between questions and passages,
DPR needs to know which questions are the
same and which are different.

In this pioneering study, we propose a sim-
ple and effective method based on a query-side
contrastive loss to improve the performance of
DPR on MEQs. Speziell, in order to learn
inter-question relationships, DPR is trained to
distinguish between paraphrase questions and se-
mantically different questions. To achieve this,
we obtain synthetic MEQs for training ques-
tions from the machine-created QA corpus, PAQ
(Lewis et al., 2021), as augmented data. Exper-
iments demonstrate that learning the query-side
contrastive loss on the augmented MEQs im-
proves the performance of DPR on contrast sets,
without sacrificing its performance on standard
open-domain questions in the NQ test set.

2 Related Work

2.1 Open-Domain Question Answering

OpenQA is a task that aims to answer user ques-
tions without any specified context, thereby testing
the ability of QA systems to retrieve, compre-
hend, and utilize world knowledge (Zhu et al.,
2021). The state-of-the-art approach in OpenQA
is a two-stage pipeline, consisting of evidence re-
trieval and answer prediction (Chen et al., 2017).
In the evidence retrieval stage, a retriever
model finds evidence passages from a large cor-
pus (z.B., Wikipedia) based on their relevance

2A passage that provides evidence for answering the
question is its positive passage, otherwise a negative passage.

Figur 1: Über: Trained on question q1 but not a
contrast one q2, DPR generated an overly similar em-
bedding of q2 with q1’s and thus falsely retrieved p1.
We aim to identify q2 as a distinct question and re-
trieve p2 instead. Below: The performance of DPR-
based OpenQA models on the standard NQ question
set and our contrast set of minimally edited ques-
tionen (MEQs).

Beispiel, in Abbildung 1, changing ‘‘Pet Sematary 2’’
to ‘‘Pet Sematary’’ generates an MEQ that re-
sembles the original question but has a dis-
(‘‘Coweta County, Georgia’’→
tinct answer
‘‘Maine’’). We use the training set of an exist-
ing benchmark as the original questions because
neural OpenQA models exhibit high performance
on them. Daher, we are able to evaluate the
models’ ability of distinguishing MEQs by mea-
suring their performance on the MEQ contrast set.
Speziell, we collect MEQs for training ques-
tions in the Natural Questions (NQ) benchmark
(Kwiatkowski et al., 2019) from two sources,
nämlich, (1) InstructGPT-based question gener-
ation (Ouyang et al., 2022) then crowdsource
annotation and (2) the AmbigQA dataset (Min
et al., 2020).

We find that the state-of-the-art OpenQA mod-
els which employ the dense passage retriever
(DPR)
(Karpukhin et al., 2020) struggle on
our MEQ contrast sets. As shown in Figure 1,
DPR-retrieved passages lead to 63% downstream
QA accuracy on the training set and 43% An
standard test set. Jedoch, the accuracy drops
to 20%∼25% on our MEQ contrast sets. Der

1083

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
9
1
2
1
5
4
4
8
3

/

/
T

l

A
C
_
A
_
0
0
5
9
1
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

to the question. Traditional retrievers like BM25
(Robertson and Zaragoza, 2009) perform lexi-
cal matching to measure such relevance scores.
Kürzlich, DPR (Karpukhin et al., 2020) revo-
lutionized the field by employing dual BERT
(Devlin et al., 2019) encoders to compute em-
beddings for the question and the passage, Re-
spectively. It searches evidence passages based
on the inner product of question and passage em-
beddings. Despite subsequent approaches having
sought to improve the architecture of the retriever
by using fine-grained question-passage interac-
tionen (Khattab and Zaharia, 2020) or enhancing
global embedding training (Gao and Callan, 2021),
DPR remains the most widely-used model due
to its simplicity and efficiency. Jedoch, Die
capability of DPR in distinguishing contrastive
information has not been thoroughly studied. In
this work, we use MEQs as contrast sets and
show that DPR has limited contrast consistency
when solving MEQs.

In the answer prediction stage, a reader model
encodes and fuses the representations of all pas-
sages, then predicts an answer by extracting a
Spanne (Kedia et al., 2022), generating a free-form
sequence (Izacard and Grave, 2021), or using a hy-
brid approach (Fajcik et al., 2021). While answer
prediction is also challenging on MEQs, our ap-
proach mainly focuses on the retrieval part which
is the bottleneck of solving the MEQs in OpenQA.

2.2 Contrast Sets

NLP Benchmark datasets are typically composed
of i.i.d. examples that are randomly divided into
training and test sets. Umgekehrt, contrast sets re-
fer to data created from small yet label-changing
modifications to the existing examples (Gardner
et al., 2020). Such characteristics make contrast
sets an ideal testbed for evaluating contrast con-
sistency. Zum Beispiel, Gardner et al. (2020) Und
Kaushik et al. (2020) employed humans to modify
linguistic patterns on tasks like syntactic pars-
ing, relation extraction, and claim verification.
On sentiment analysis and language inference
tasks, controlled text modification models could
automatically generate contrast sets (Wu et al.,
2021; Ross et al., 2022). In reading compre-
hension, rule-based algorithms created contrast
sets by replacing the answer with another entity
(Longpre et al., 2021; Ye et al., 2021; Li et al.,
2022). In video-to-text matching, a pre-trained

T5 model was used to find replacements for
verbs and entities in the original caption (Park
et al., 2022).

Trotzdem, building contrast sets to evalu-
ate contrast consistency in OpenQA has not been
explored yet, where data collection must guar-
antee the factuality of MEQs. The most relevant
work is Paranjape et al. (2022) which automat-
ically generated perturbed questions for data aug-
mentation on QA datasets. Jedoch, we focus on
collecting challenging MEQs to evaluate model
consistency instead of data augmentation. More-
über, their generated questions did not meet the
requirements of MEQs. The limited accuracy of
the question generation model would lead to lots
of noise instead of perfect factuality. Auch, their
method did not ensure the minimality of edits.
daher, their generated data cannot be used
as challenging contrast sets to evaluate contrast
consistency in OpenQA.

3 Task: Contrast Consistency on MEQs

3.1 Problem Formulation

In this work, we study minimally edited questions
(MEQ) as challenging contrast sets in OpenQA.
Suppose we have two questions q and q(cid:2) mit
answers a and a(cid:2), jeweils, where q is the
original question in the training set and q(cid:2)
Ist
an MEQ of q. In this study, the minimality of
edits is measured in two aspects: lexical distance
D(cid:2)(Q, Q(cid:2)) and semantic distance ds(Q, Q(cid:2)). That is to
sagen, Q(cid:2) needs to satisfy d(cid:2)(Q, Q(cid:2)) ≤ (cid:2)(cid:2), ds(Q, Q(cid:2)) ≤
(cid:2)s and a(cid:2)
(cid:3)= a, Wo (cid:2)(cid:2) Und (cid:2)s are distance
thresholds.

3.2 Evaluation Metrics

To evaluate DPR on MEQ contrast sets, Wir
consider metrics on both ranking and retrieval
evaluation. Zusätzlich, we run end-to-end QA
experiments using the passages retrieved by DPR.

Ranking evaluation measures
the model’s
ability to differentiate a positive passage from
negative passages, by ranking a set of candidate
passages based on the relevance score to the ques-
tion. We collect 50 candidates for each question,
including a positive passage, 30 hard negative
passages, Und 19 random negative passages. Hard
negatives are the top-ranked passages in BM25

1084

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
9
1
2
1
5
4
4
8
3

/

/
T

l

A
C
_
A
_
0
0
5
9
1
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

retrieval that do not contain the answer. We re-
port Mean Rank (MR) and Mean Reciprocal Rank
(MRR) of the positive passage.

Retrieval evaluation tests the model’s ability
to retrieve passages relevant to answering the
question from a large corpus. Our retrieval cor-
pus contains ∼21M passages from Wikipedia.
We calculate Recall@k, the number of passages
containing the answer in top-k retrieved passages.

End-to-end QA evaluation checks whether the
retrieved passages contain useful information for
predicting the correct answer. The retrieved pas-
sages are fed into a Fusion-in-Decoder (FiD)
reader (Izacard and Grave, 2021) trained on NQ.
We calculate Exact Match between model pre-
dictions and answers.

4 Data: MEQ Contrast Sets

4.1 Dataset Construction

Based on the above evaluation metrics, we col-
lect two MEQ contrast sets to evaluate models’
contrast consistency. The first set, referred to
as MEQ-GPT, is generated using InstructGPT
(Ouyang et al., 2022) then manually filtered and
annotated with answers by crowdsource work-
ers. The second set, named MEQ-AmbigQA, Ist
sourced from the AmbigQA dataset (Min et al.,
2020). The construction of our contrast sets con-
sists of four phases: question collection, MEQ
filtering, answer annotation, and evidence pas-
sage annotation.

4.1.1 Collection of Candidate MEQs

MEQ-InstructGPT Generating
answerable
MEQs is very difficult for crowdsource workers
who are not domain experts. It is hard for them
to determine which modifications to the origi-
nal question result in an answerable MEQ with-
out extensive Internet searches. Jedoch, recent
GPT-3 models have demonstrated their ability
to possess vast amount of knowledge through
massive pre-training (Brown et al., 2020). Dort-
Vordergrund, we first utilize the InstructGPT model
(text-davinci-002) to generate a set of MEQ candi-
dates, and leave the answer annotation task to
crowdsource workers. The input to InstructGPT
is of the form [ICH, x1, · · · , xt, Q, A], where I is
the instruction ‘‘Generate a similar ques-
tion that has a different answer’’.

ich, A(cid:2)

ich] (Q(cid:2)

{xi}T
i=1 are in-context demonstrations that are
manually created, where each xi
is a tuple
[Qi, ai, Q(cid:2)
i is the MEQ of qi). The original
question q and answer a are appended to the
Eingang, prompting InstructGPT to generate a
new question q(cid:2) and its answer a(cid:2) to complete
the sequence. For each input q, we sample 10
completions from InstructGPT to generate a set
of candidate MEQs.

MEQ-AmbigQA The AmbigQA dataset
ini-
tially targeted a subset of NQ consisting of am-
biguous questions. The dataset was introduced to
decompose each ambiguous question into mul-
tiple disambiguated questions, each of which is
a slight modification of the original question.
For each NQ question covered in AmbigQA, its
corresponding disambiguated questions are con-
sidered as its candidate MEQs and are delivered
to the subsequent filtering phase (§4.1.2). Wie-
immer, such questions are limited as we set strict
criteria for MEQs, so we need more data gener-
ated by InstructGPT for solid evaluation.

4.1.2 MEQ Filtering

To build challenging contrast sets, a series of
criteria are applied to eliminate unqualified can-
didates and select MEQs based on the definition
in §3.1.

1. Quality control: We do not allow q and q(cid:2) Zu
differ in question words (z.B., Wie, what),
or if the only word that q(cid:2) adds to q falls
into {Erste,last,neu,nächste,origi-
nal,nicht}. We have found that Instruct-
GPT frequently adds these words to create
MEQs, but they usually lead to unanswer-
able questions.

2. Lexical distance: Word-level edit distance
is used as d(cid:2)(Q, Q(cid:2)), and we remove q(cid:2) Wenn
D(cid:2)(Q, Q(cid:2)) = 0 oder d(cid:2)(Q, Q(cid:2)) > 3.

3. Semantic distance: The cosine similarity
of semantic embeddings is used to measure
ds(Q, Q(cid:2)). We remove q(cid:2) if cos(hq, hq(cid:2)) < 0.95 which indicates non-negligible seman- tic discrepancy. The semantic embedding h should be generated by a sentence embed- ding model. Here we use the question en- coder of the unsupervised dense retrieval model Contriever (Izacard et al., 2021). 1085 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 4. Paraphrase filtering: q(cid:2) is discarded if it is determined to be a paraphrase to q by a paraphrase detection model. Here we use a RoBERTa-large (Liu et al., 2019) fine-tuned on the Quora Question Pairs dataset (Wang et al., 2019) for paraphrase classification. 5. Answer difference: q(cid:2) is discarded if a(cid:2) = a. For AmbigQA questions, since they are originally human-annotated, we ask human volunteers to check whether a(cid:2) and a are aliases to the same entity. For GPT-generated questions, the inspection of answer differ- ence is included in the answer annotation process, which we will elaborate in §4.1.3. Among GPT-generated questions, for a certain original question q, there may be multiple MEQ candidates that pass the above filtering. In such cases, the question that is generated most fre- quently across 10 samples is selected as the most confident MEQ by InstructGPT. This is similar to the self-consistency idea in Wang et al. (2022). 4.1.3 Answer Annotation Due to the limited accuracy of InstructGPT in directly answering open-domain questions (Yu et al., 2023), we recruit crowdsource workers to annotate the answer of each candidate MEQ generated by InstructGPT. Before human annota- tion, we first check the answer generated by In- structGPT via Google Search. If Google Search returns a highlighted answer box which matches the InstructGPT-generated answer, we skip the subsequent human labeling step. For the remain- ing questions, we recruit human annotators from Surge AI3 for data labeling. We ask them the following questions: Q1. Is q(cid:2) a good variation to q? Bad variations include being unanswerable or having the same answer with q, and are discarded from our dataset. Q2. If q(cid:2) is deemed a good variation, find the answer a(cid:2) using search engines. If necessary, the question may have multiple answers. Quality Control To ensure answer correctness, each question is answered by two different anno- tators. If the annotators disagree on the answer or if either annotator determines the question is an bad variation, the question is discarded. Since the answers are free-form responses, we manually check whether the answers given by two annota- tors are aliases to the same entity. If the response of the first annotator matches exactly with the an- swer provided by InstructGPT, we do not recruit a second annotator to reduce costs. 4.1.4 Gold Evidence Passages As mentioned in §3.2, ranking evaluation on MEQs needs gold evidence passages as positive examples, so we collect them from Wikipedia for our contrast sets. For MEQ-AmbigQA, we utilize the semi-oracle evidence documents4 provided by the original authors, dividing them into 100- word passages. Then, we identify the first passage that contains the gold answer. For MEQ-GPT, our initial step involves finding candidate evidence passages that include the gold answer. This is achieved by retrieving Wiki passages with BM25 and selecting the top 3 passages that contain the answer. Next, we recruit human annotators from Surge AI to assess whether any of these pas- sages provide sufficient evidence for answering the question. The highest-ranked passage that passed human annotation is chosen as the gold evidence passage. Finally, both contrast sets have a subset of questions paired with a correspond- ing gold evidence passage. 4.2 Dataset Analysis The full dataset is composed of 3,343 MEQs (2,293 from InstructGPT and 1,050 from Am- bigQA). Each of these MEQs has its original question in the NQ training set. Among them, 1,229 (53.6%) InstructGPT questions and 625 (59.5%) AmbigQA questions are paired with a gold evidence passage from Wikipedia. We use this subset in ranking evaluation and the full set in retrieval and end-to-end QA evaluation. Data Statistics We summarize basic statis- tics of the MEQ contrast sets compared to the original NQ questions. As shown in Table 1, MEQ-GPT is similar to NQ regarding the aver- age length of questions and answers. Questions in MEQ-AmbigQA are longer because the original AmbigQA annotators usually added conditions to 4https://github.com/shmsw25/AmbigQA/blob 3https://www.surgehq.ai. /main/evidence.md. 1086 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Statistics Size With Gold Passage Question Length Answer Length #Answers Edit Distance NQ MEQ Train 79,168 58,880 9.17 2.16 1.22 9.10 Test 3,610 1,766 9.22 2.22 1.79 9.16 AmbigQA 1,050 625 10.73 2.62 1.47 2.39 GPT 2,293 1,229 9.69 1.96 1.18 1.18 Semantic Similarity 30.12 29.87 96.47 97.96 Table 1: Dataset statistics. Question lengths, an- swer lengths, and edit distances are all measured in words. Semantic similarity is computed by Contriever (Izacard et al., 2021). For NQ-train and NQ-test, edit distance and semantic similarity are computed between random question pairs. For MEQ contrast sets, they are computed between the original question and its MEQ. disambiguate the original NQ questions. Besides, AmbigQA does not impose a limit on the answer length, while we limit each answer in MEQ-GPT to at most 5 words, consistent with NQ. The num- ber of answers per question is lower in MEQ- GPT than in MEQ-AmbigQA, because most answers are obtained through strict text matching on candidate answers from two sources. In ad- dition, we observe that MEQ-GPT has a smaller edit distance and higher semantic similarity be- tween q and q(cid:2), making it hard for models to dis- tinguish them. Types of Edits We review and categorize dif- ferent types of minimal edits that are used to create MEQs. Since MEQ-AmbigQA primarily consists of edits that add specifications to the original NQ question, we consider MEQ-GPT as a more natural representation of minimal edits. As shown in Table 2, the edits in MEQ-GPT involve nouns (28.0%), verbs (18.5%), adjec- tives (18.2%), numbers (14.2%), ordinals (9.2%), dates (6.6%), prepositions/conjunctions (2.9%), and others (2.4%). A word cloud of the edited words is given in Figure 2. We also observe that 22.5% of the total edits are antonym edits where a word in the original question is replaced by its antonym. Our dataset of diverse MEQs provides a comprehensive evaluation of contrast consistency. 4.3 Challenges of MEQ Contrast Sets The collected MEQ contrast sets are challeng- ing for the widely-used DPR-based OpenQA sys- tem, although these perturbed questions are only minimal edits to the well-learned training ques- tions. As shown in Figure 1, the model signifi- cantly underperforms on the contrast sets, where the passage ranking score of DPR decreases by 39% and 45% compared to NQ-train, and by 29% and 18% compared to NQ-test. This makes a sub- stantial impact on the QA performance, with the accuracy being 69% and 60% lower on the two contrast sets compared to NQ-train, and 54% and 40% lower than NQ-test. The results show that the collected MEQs are much harder to solve than random test questions, which indicates our con- trast sets can serve as testbeds for evaluating the contrast consistency of OpenQA. 5 Method: Training DPR with Query- Side Contrastive Loss 5.1 Preliminary: DPR As a dense retriever, DPR includes a question encoder EQ(·) and a passage encoder EP (·). Both encoders map the input sequence to a dense vector as its semantic representation. The relevance score s(q, p) between a question q and a passage p is defined as the dot product of their representations: s(q, p) = EQ(q) (cid:2) EP (p) DPR is trained via a contrastive loss. Given a positive passage p+ and a set of negative passages {p− }n i=1 to a certain question, the model is trained i to maximize the relevance score between q and p+, while minimizing the relevance score between q and each p− i . The loss function is: LQP = −log exp(s(q, p+)) (cid:2) exp(s(q, p+)) + n i=1 exp(s(q, p− i )) The above training paradigm works well on retrieving passages for random test questions, but does not perform as effectively on MEQ contrast sets, as discussed in §1 and §4.3. The training loss LQP does not provide explicit signals for DPR to learn the relationships between questions. As a result, the question embeddings are insen- sitive to minimal discrepancies, which prevents the model from identifying the MEQ as a dis- tinct question after seeing the original question in training. This causes DPR to generate an overly similar embedding for the MEQ, leading to a high 1087 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Edit Type Proportion Antonym Edits Example Nouns 641(28.0%) 151 (24%) Q: Who wrote the music for the national anthem? A: John Stafford Smith Q: Who wrote the lyrics for the national anthem? A: Francis Scott Key Verbs 425 (18.5%) 176 (41%) Q: When did Australia stop using one cent coins? A: 1992 Q: When did Australia start using one cent coins? A: 1966 Adjectives 418 (18.2%) 146 (35%) Q: How many islands are in Andaman and Nicobar? A: 572 Q: How many inhabited islands are in Andaman and Nicobar? A: 37 Numbers 326 (14.2%) – Where did season 2 of Jersey Shore take place? A: Miami Beach, Florida Q: Where did season 3 of Jersey Shore take place? A: Seaside Heights, New Jersey Ordinals 211 (9.2%) 30 (14%) Q: Highest scoring NBA players of all time in one game? A: Wilt Chamberlain Q: Second highest scoring NBA players of all time in one game? A: Kobe Bryant Dates 152 (6.6%) – Q: Who ruled the Holy Roman Empire in 1509? A: Maximilian I Q: Who ruled the Holy Roman Empire in 1519? A: Charles V Prepositions Conjunctions 66 (2.9%) 14 (21%) Q: Where did the Titanic make its maiden voyage from? A: Southampton Q: Where did the Titanic make its maiden voyage to? A: New York Table 2: Different MEQ edit types in MEQ-GPT with their proportions of antonym edits and examples. The remaining 2.4% of the instances are of miscellaneous types. The first line in each example is the original question and the second line is the MEQ. Words in green and red are the deleted and added words, respectively. To obtain q+, we leverage back translation provided by the nlpaug5 package. The original question q is translated to another language and then translated back to produce a new phrasing of q. We used translation models of 6 languages provided by Ng et al. (2019) and Tiedemann and Thottingal (2020). Questions that are identical to q (i.e., edit distance = 0) or classified as ‘‘not paraphrase’’ by the paraphrase detection model used in §4.1.2 are eliminated. The remaining questions constitute a candidate set of positive questions from which a random q+ is sampled in each epoch. To obtain q−, synthetic MEQs are retrieved from the machine-built QA corpus PAQ (Lewis et al., 2021). All questions in PAQ that are sim- ilar to q are retrieved by the question retriever in the work of PAQ. Then, the MEQ require- ments specified in §4.1.2 are applied to filter the retrieved synthetic questions. The remaining questions constitute a candidate set of negative questions from which a random q− is sampled in each epoch. Apart from learning the relationships among q, q+, and q−, the loss LQP can be augmented to learn the relevance between synthetic questions and their corresponding passages. Because q+ is a paraphrase question mapping the passages of q, it does not have to be involved in LQP . To 5https://github.com/makcedward/nlpaug. Figure 2: Word cloud of the edited words. Words in green and red are the deleted and added words, respec- tively. Larger font sizes indicate higher frequencies. overlap in the retrieved passages and low con- trast consistency. 5.2 Proposed Method We propose to improve the contrast consistency of DPR by introducing a query-side contrastive loss to distinguish between paraphrase ques- tions and MEQs which are positive and negative question examples for an original question, re- spectively. We devise a data augmentation ap- proach to collect synthetic question examples to train this loss. 5.2.1 Data Augmentation For a training question q, its positive example q+ is a synthetic paraphrase question which is slightly different from q and has the same an- its negative question q− is a synthetic swer; MEQ with a different answer. 1088 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 3: Above: the original contrastive training of DPR. Below: our improved DPR with the query-side contrastive loss, where q+ and q− are obtained through data augmentation. train on q−, its positive passage is the Wikipedia passage that was used to generate the question during the construction of PAQ; its negative pas- sages are collected from the top-ranked passages retrieved by BM25 which do not contain the answer. 5.2.2 Model Training To provide more supervision signals and pre- vent overfitting, we randomly sample q+, q−, and p− for each training question q in each epoch. This means that while the original train- ing questions remain fixed, a different set of aug- mented questions is used. For explicit supervision on inter-question relationships, given q, DPR is trained to assign higher relevance scores to its paraphrase question (q+) and lower relevance scores to its MEQ (q−). The relevance score of any pair of questions (q1, q2) is calculated as the inner product of their embeddings: s(q1, q2) = EQ(q1) EQ(q2). Specifically, we consider three forms of query-side constrastive loss functions in experiments: (cid:2) (1) InfoNCE Loss (van den Oord et al., 2018), which differentiates the positive question from a set of m negative questions. Besides the syn- thetic MEQ which is considered as a hard neg- ative, the other questions in the same batch are included as random negatives. The loss func- tion is: LQQ = −log exp(s(q, q+)) + exp (s (q, q+)) (cid:2) m j=1 exp(s(q, q− j )) . (2) Dot Product Loss, which directly penal- izes the relevance score between a sample ques- tion q and its augmented MEQ: LQQ = s(q, q−). (3) Triplet Loss (Schroff et al., 2015), which trains the model to assign a higher relevance score to q+ compared to q−, enfored by a mar- gin α: (cid:3) LQQ = max 0, α − s(q, q+) + s(q, q−) (cid:4) . The final training loss of our improved DPR is L = LQP + λLQQ, where the hyperparameter λ weights the trade-off between the loss terms. 6 Experiments In experiments, we compare our proposed training method against the original training setting of DPR. After training the models on the NQ train- ing set, we test them on the standard NQ test set 1089 Model Augmentation DPRBASE DPRBASE DPRBASE DPRBASE DPRLARGE DPRLARGE DPRLARGE DPRLARGE None Random MEQs MEQs + LQQ None Random MEQs MEQs + LQQ NQ MR↓ 2.36 MRR↑ 0.784 MEQ-AmbigQA MRR↑ MR↓ 0.563 5.09 MEQ-GPT MR↓ 5.44 MRR↑ 0.507 2.36 2.34 2.25 2.31 2.20 2.17 2.14 0.781 0.783 0.791 0.780 0.797 0.797 0.804 5.09 5.09 4.85 4.84 4.98 4.79 4.59 0.557 0.543 0.569 0.569 0.554 0.561 0.592 5.25 5.10 4.88 5.46 5.18 5.00 4.61 0.524 0.529 0.547 0.515 0.533 0.544 0.565 Table 3: Ranking evaluation results. MR and MRR stand for mean rank and mean reciprocal rank, respectively. A lower MR or higher MRR indicates better performance. BM25 is not listed because sampling hard negatives from top-ranked passages in BM25 retrieval lowers the ranking performance of BM25 in return. as well as two MEQ contrast sets that we col- lected in this work. 6.1 Models We augment the training set with M = 33k syn- thetic MEQs and train DPR with both LQP and LQQ. We consider the following baselines: • Vanilla DPR. This is the original training setting of DPR, proposed by Karpukhin et al. (2020). The model is trained only with LQP on the standard NQ training set. • DPR with random augmented questions. This model is trained only with LQP , but we add M random synthetic questions from PAQ to the training set. This is to rule out the effect of simply adding more synthetic data. • DPR with augmented MEQs. This model uses the same set of M synthetic MEQs retrieved from PAQ as data augmentation, but is trained only with LQP . We use this variant to test if LQQ is necessary in model training. Additionally, we test the performance of BM25 on retrieval as a reference. Recent research has shown that larger retrievers may exhibit better generalization (Ni et al., 2022). Therefore, in ad- dition to the standard DPR which is built on BERT-Base, we use BERT-Large as the backbone model to see: (1) whether MEQ contrast sets are still challenging for larger models and (2) whether our training method is still effective for larger models. We name the smaller model and larger model DPRBASE and DPRLARGE, respectively. We use the same set of basic hyper-parameters for each DPR model: a learning rate of 10−5, a batch size of 64 (32 for DPRLARGE), 40 train- ing epochs with 5% warmup steps. On ranking evaluation, our best setting uses the InfoNCE loss with λ = 0.5. On retrieval and QA evaluation, our best setting uses the dot product loss with λ = 0.03. Since we do not have a dev set for MEQs,6 we conduct ranking evaluation on MEQ contrast sets in a dev setting, where we select the highest score among all checkpoints. Then we use the checkpoint with the best ranking score to test its retrieval and QA performance. The scores on NQ-test is reported using the best checkpoint on NQ-dev. 6.2 Results Experimental results on three datasets (NQ-test, MEQ-AmbigQA, MEQ-GPT) are presented from Table 3 to Table 6. We have the following findings: (1) Our proposed method improves DPR’s abil- ity to distinguish MEQs. As shown in Tables 3 6We empirically found that model performance on NQ-dev is inconsistent with MEQ contrast sets. 1090 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model LQQ InfoNCE Dot Product DPRBASE DPRBASE DPRBASE DPRLARGE DPRLARGE Dot Product DPRLARGE Triplet InfoNCE Triplet AmbigQA GPT MR↓ MRR↑ MR↓ MRR↑ 0.547 4.85 0.569 4.88 4.79 4.80 4.76 4.59 4.61 0.574 0.568 0.572 0.592 0.582 4.98 4.91 4.63 4.61 4.59 0.539 0.542 0.570 0.565 0.573 Table 4: Ranking evaluation with different LQQ functions on two MEQ contrast sets. All loss functions outperform the baselines in Table 3. and 5, on passage ranking and passage retrieval, the DPR trained with query-side contrastive loss outperforms the vanilla DPR on both contrast sets, showing improved contrast consistency on MEQs. This improvement is consistent across models of different sizes. For example, on MEQ-GPT, our model improves the vanilla DPR by 8% and 10% on ranking MRR for base and large versions, respectively. On the choice of LQQ, Table 4 demonstrates that all three loss functions improve performance over baselines, while the optimal setting may require tuning on the specific dataset. (2) The query-side contrastive loss contributes the most to the improved contrast consistency. Although synthetic MEQs themselves bring more training signals, the model cannot consistently outperform the vanilla DPR without LQQ. Ac- tually, its performance is sometimes even lower than DPR. In contrast, after including the query- side contrastive loss, we observe consistent im- provements across all datasets, as shown in Tables 3 and 5. For example, on MEQ-AmbigQA, simply adding synthetic MEQs into the training set gives 12% lower recall@1 than the vanilla DPR, while training with LQQ outperforms the naive augmentation method by 18%. (3) The improvement does not simply come from the increased number of training data. There is no significant difference on the per- formance between DPR augmented with random synthetic questions (‘‘Random’’ in ‘‘Augmenta- tion’’ column) and the original DPR (‘‘None’’ in the column) in Tables 3, 5, and 6. The average improvement of inserting random synthetic ques- tions on all metrics is only 0.2% for DPRBASE and 1.6% for DPRLARGE, which indicates simply adding more synthetic data is not an effective solution. (4) Improved retrieval performance leads to higher end-to-end QA accuracy. As shown in Table 6, our improved DPR provides more information for answer prediction on relevant MEQs. Even using only 1 retrieved passage, our improved DPR-Large outperforms its vanilla version by 12% and 11% on two contrast sets, respectively. (5) Our method does not sacrifice performance on standard test questions. After joint training with the query-side contrastive loss and aug- mented with synthetic MEQs, our model still maintains its competitive performance on the stan- dard NQ test set. Specifically, It outperform all baselines in ranking evaluation (see Table 3), while performing on par with the best baseline in retrieval and QA scores (see Tables 5 and 6). Summary: The results are consistent across ranking, retrieval, and end-to-end QA experi- ments, which demonstrates the solidity of the above findings. Nevertheless, the performance of DPR still has a long way to improve, and such a gap is observed in both base and large versions of the model. Notably, DPR models perform sig- nificantly worse on MEQ contrast sets than the standard test set, even though it is trained under a development setting. This suggests that further research is still necessary to improve the contrast consistency of retrieval models on MEQs. 6.3 Analysis Passage Overlap One of the indications that DPR lacks the ability to distinguish the original question and its MEQ is the high overlap between the passages retrieved for each. Figure 4 illus- trates that both synthetic data augmentation and the query-side contrastive loss can reduce pas- sage overlap. The synthetic MEQ augmentation helps to train the question embeddings of MEQs closer to their positive passages. Moreover, the query-side contrastive loss explicitly trains the model to distinguish the original question and its MEQ apart. Nevertheless, a lower passage over- lap does not always indicate better performance. For instance, our model with the dot product loss 1091 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model Augmentation BM25 None None Random MEQs MEQs + LQQ DPRBASE DPRBASE DPRBASE DPRBASE DPRLARGE None DPRLARGE Random DPRLARGE MEQs DPRLARGE MEQs + LQQ NQ MEQ-AmbigQA MEQ-GPT R@1 R@5 R@20 R@1 R@5 R@20 R@1 R@5 R@20 23.2 46.6 48.2 46.4 48.1 46.0 49.0 48.0 51.0 45.3 70.0 71.2 69.9 70.8 67.6 70.9 70.5 71.2 64.5 81.2 81.6 81.3 81.9 80.3 81.5 81.4 81.6 16.8 28.5 27.5 25.2 29.5 26.8 26.2 27.7 30.1 34.2 50.0 49.2 46.2 52.3 49.2 48.4 47.2 52.3 48.8 65.6 65.8 62.7 66.4 64.0 64.1 61.5 65.4 21.1 31.5 31.8 31.5 32.8 29.2 31.3 31.2 32.5 42.7 57.3 58.0 55.9 58.7 54.9 56.7 57.0 58.4 61.7 73.2 73.8 72.3 74.4 70.7 72.3 71.8 73.1 Table 5: Retrieval evaluation results. R@k stands for Recall@k. Model Augmentation BM25 None None Random DPRBASE DPRBASE DPRBASE DPRBASE DPRLARGE DPRLARGE DPRLARGE MEQs DPRLARGE MEQs + LQQ MEQs MEQs + LQQ None Random NQ 5P 28.4 43.2 44.8 43.4 44.7 42.2 44.6 44.7 44.6 MEQ-AmbigQA MEQ-GPT 20P 37.3 49.1 49.4 48.7 49.2 47.9 49.3 48.7 49.3 1P 10.9 14.0 14.7 13.5 16.6 14.3 13.4 15.7 16.1 5P 15.2 19.7 19.1 19.3 21.8 19.2 20.4 19.3 22.1 20P 18.1 21.9 22.4 23.1 22.8 21.4 21.5 21.7 23.0 1P 13.3 17.6 16.8 17.1 19.5 16.1 17.3 17.4 19.4 5P 20.5 25.8 25.4 25.5 26.7 24.6 25.5 25.0 27.6 20P 25.8 29.3 29.5 29.5 31.1 29.1 29.4 29.1 31.6 1P 16.4 32.6 33.7 32.0 34.4 31.4 33.7 33.0 33.7 Table 6: End-to-end QA results (Exact Match). 1P, 5P, and 20P are the number of passages read by the FiD reader. Figure 4: Overlap in top-5 retrieved passages between the original training question and its MEQ. Figure 5: The ratio of successful MEQ identifications of different models on contrast sets, with paraphrase questions as distractors. does not have the lowest passage overlap, but performs the best in retrieval evaluation. Identification of Inter-question relationships To further analyze model behavior after the query- side contrastive training, we test the models’ abil- ity to distinguish inter-question relationships. A model is considered successful in identifying the MEQ if the generated embedding of the original 1092 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 question is closer to its paraphrase question rather than its MEQ. The paraphrase questions are sepa- rately generated using InstructGPT to avoid con- flict with those used in data augmentation. As shown in Figure 5, training with the query-side contrastive loss leads to an improved ability to distinguish between paraphrase questions and dif- ferent questions, which indicates our models are better at identifying inter-question relationships. The model trained with InfoNCE loss has the highest success rate in identifying inter-question relationships, because it received more training sig- nals from a positive example and a set of negative examples than those with other types of loss. 7 Conclusion In this study, we addressed the gap in research on contrast consistency in OpenQA by collecting MEQs as challenging contrast sets to the popu- lar NQ benchmark. Our findings reveal that DPR lacks contrast consistency on our contrast sets. To address this limitation, we introduced a query-side contrastive loss with the aid of data augmen- tation, which improved its ability to recognize inter-question relationships. Overall, our findings and data can pave the way for further exploring the role of contrast consistency in developing ro- bust and effective OpenQA systems. Acknowledgments This work was supported in part by NSF IIS-2119531, IIS-2137396, IIS-2142827, CCF- 1901059, and ONR N00014-22-1-2507. Wenhao Yu is also supported in part by Bloomberg Data Science PhD Fellowship. We would like to thank the anonymous reviewers and the action editor for their valuable suggestions to this paper. References Tom B. Brown, Benjamin Mann, Nick Ryder, Jared Kaplan, Prafulla Melanie Subbiah, Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the Asso- ciation for Computational Linguistics, ACL 2017. https://doi.org/10.18653/v1 /P17-1171 the 55th Annual Meeting of Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. https://doi.org/10.18653/v1 /N19-1423 Martin Fajcik, Martin Docekal, Karel Ondrej, and Pavel Smrz. 2021. R2-D2: A modular base- line for open-domain question answering. In Findings of the Association for Compu- tational Linguistics: EMNLP 2021, pages 854–870, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .findings-emnlp.73 Luyu Gao and Jamie Callan. 2021. Con- denser: A pre-training architecture for dense retrieval. In Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, EMNLP 2021, pages 981–993, Online and Punta Cana, Domini- can Republic. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2021.emnlp-main.75 Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut 1093 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tsarfaty, Eric Wallace, Ally Zhang, and lo- Ben Zhou. 2020. Evaluating models’ cal decision boundaries via contrast sets. In Findings of the Association for Computa- tional Linguistics: EMNLP, pages 1307–1323, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.findings-emnlp.117 Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Towards unsupervised dense information retrieval with contrastive learning. ArXiv preprint, 2112.09118. https://doi.org/10.48550/arXiv .2112.09118 Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Com- putational Linguistics: Main Volume, EACL 2021. https://doi.org/10.18653/v1 /2021.eacl-main.74 Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2020, pages 6769–6781, Online. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2020.emnlp-main.550 Divyansh Kaushik, Eduard H. Hovy, and Zachary Chase Lipton. 2020. Learning the difference that makes a difference with counterfactually- augmented data. In 8th International Con- ference on Learning Representations, ICLR 2020. Akhil Kedia, Mohd Abbas Zaidi, and Haejun Lee. 2022. Fie: Building a global probability space by leveraging early fusion in encoder for open-domain question answering. In Pro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, SIGIR 2020, pages 39–48. Association for Computational Linguistics. https://doi.org/10.1145 /3397271.3401075 Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for ques- tion answering research. Transactions of the Association for Computational Linguistics, 7:452–466. https://doi.org/10.1162 /tacl_a_00276 Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval- augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Patrick S. H. Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich K¨uttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021. PAQ: 65 million probably- asked questions and what you can do with them. Transactions of the Association for Computa- tional Linguistics, 9:1098–1115. https:// doi.org/10.1162/tacl a 00415 Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix X. Yu, and Sanjiv Kumar. 2022. Large language models with controllable working memory. ArXiv preprint, 2211.05110. https://doi.org/10.48550/arXiv .2211.05110 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly opti- mized BERT pretraining approach. ArXiv preprint, 1907.11692. https://doi.org /10.48550/arXiv.1907.11692 1094 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge con- In Proceed- flicts in question answering. ings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .emnlp-main.565 Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. Ambigqa: An- swering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2020, pages 5783–5797, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.emnlp-main.466 Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook fair’s WMT19 news transla- tion task submission. In Proceedings of the Fourth Conference on Machine Translation, WMT 2019, pages 314–319, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19 -5333 Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hern´andez Abrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. 2022. Large dual encoders are generalizable retrievers. In Pro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pages 9844–9855. Association for Computational Linguistics. A¨aron van den Oord, Yazhe Li, and Oriol Vinyals. learning 2018. Representation with contrastive predictive coding. Arxiv preprint, 1807.03748. https://doi.org /10.48550/arXiv.1807.03748 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. ArXiv preprint, 2203.02155. https://doi.org/10.48550/arXiv .2203.02155 Bhargavi Paranjape, Matthew Lamm, and Ian Tenney. 2022. Retrieval-guided counterfactual generation for QA. In Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), ACL 2022, pages 1670–1686, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022 .acl-long.117. the 2022 Conference of Jae Sung Park, Sheng Shen, Ali Farhadi, Trevor Darrell, Yejin Choi, and Anna Rohrbach. 2022. Exposing the limits of video-text In Proceed- models through contrast sets. ings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, pages 3574–3586, Seattle, United States. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2022.naacl-main.261 Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends R(cid:4) in In- formation Retrieval, 3(4):333–389. https:// doi.org/10.1561/1500000019 In Proceedings of Alexis Ross, Tongshuang Wu, Hao Peng, Matthew E. Peters, and Matt Gardner. 2022. Tailor: Generating and perturbing text with the semantic controls. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3194–3213, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653 /v1/2022.acl-long.228 Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pages 815–823, Boston, MA. https://doi.org/10.1109 /CVPR.2015.7298682 J¨org Tiedemann and Santhosh Thottingal. 2020. OPUS-MT - Building open translation services the 22nd for the world. In Proceedings of 1095 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Annual Conference of the European Associ- ation for Machine Translation, EAMT 2020, pages 479–480, Lisboa, Portugal. European Association for Machine Translation. Language Processing, ACL/IJCNLP 2021, pages 6707–6723, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.acl-long.523 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language un- In 7th International Confer- derstanding. ence on Learning Representations, ICLR 2019, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. OpenReview.net. https://doi.org/10 .18653/v1/W18-5446 Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought ArXiv reasoning preprint, 2203.11171. https://doi.org /10.48550/arXiv.2203.11171 language models. in Xi Ye, Rohan Nair, and Greg Durrett. 2021. Con- necting attributions and QA model behavior on realistic counterfactuals. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pages 5496–5512, Online and Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2021.emnlp-main.447 Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2023. Generate rather than retrieve: Large lan- guage models are strong context generators. In 11th International Conference on Learning Representations, ICLR 2023. Tongshuang Wu, Marco T´ulio Ribeiro, Jeffrey Heer, and Daniel S. Weld. 2021. Polyjuice: Generating counterfactuals for explaining, eval- uating, and improving models. In Proceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th International Joint Conference on Natural Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain ques- tion answering. ArXiv preprint, 2101.00774. https://doi.org/10.48550/arXiv.2101 .00774 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 1 2 1 5 4 4 8 3 / / t l a c _ a _ 0 0 5 9 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1096
PDF Herunterladen