On the Robustness of Dialogue History Representation in - 麻省理工学院人工智能研究专业

On the Robustness of Dialogue History Representation in
Conversational Question Answering: A Comprehensive Study
and a New Prompt-based Method

Zorik GekhmanT∗ Nadav OvedT ∗ Orgad KellerG Idan SzpektorG Roi ReichartT
T Technion – Israel Institute of Technology, Israel GGoogle Research, 以色列
{zorik@campus.|nadavo@campus.|roiri@}technion.ac.il
{orgad|szpektor}@google.com

抽象的

Most work on modeling the conversation his-
tory in Conversational Question Answering
(CQA) reports a single main result on a com-
mon CQA benchmark. While existing models
show impressive results on CQA leaderboards,
it remains unclear whether they are robust to
shifts in setting (sometimes to more realis-
tic ones), training data size (例如, from large
to small sets) and domain. 在这项工作中, 我们
design and conduct the first large-scale robust-
ness study of history modeling approaches for
CQA. We find that high benchmark scores
do not necessarily translate to strong robust-
内斯, and that various methods can perform
extremely differently under different settings.
Equipped with the insights from our study, 我们
design a novel prompt-based history modeling
approach and demonstrate its strong robust-
ness across various settings. Our approach is
inspired by existing methods that highlight
historic answers in the passage. 然而, 在-
stead of highlighting by modifying the passage
token embeddings, we add textual prompts
directly in the passage text. Our approach
is simple, easy to plug into practically any
模型, and highly effective, thus we recom-
mend it as a starting point for future model
developers. We also hope that our study and
insights will raise awareness to the importance
of robustness-focused evaluation, 此外
to obtaining high leaderboard scores, leading
to better CQA systems.1

介绍

Conversational Question Answering (CQA) 在-
volves a dialogue between a user who asks
questions and an agent that answers them based
on a given document. CQA is an extension of the

∗Authors contributed equally to this work.
1Our code and data are available at: https://github

.com/zorikg/MarCQAp.

351

traditional single-turn QA task (Rajpurkar et al.,
2016), with the major difference being the pres-
ence of the conversation history, which requires
effective history modeling (Gupta et al., 2020).
Previous work demonstrated that the straightfor-
ward approach of concatenating the conversation
turns to the input is lacking (Qu et al., 2019A),
leading to various proposals of architecture com-
ponents that explicitly model the conversation
历史 (Choi et al., 2018; Huang et al., 2019;
Yeh and Chen, 2019; Qu et al., 2019A,乙; 陈
等人。, 2020; Kim et al., 2021). 然而, 有
no single agreed-upon setting for evaluating the
effectiveness of such methods, with the majority
of prior work reporting a single main result on
a CQA benchmark, such as CoQA (Reddy et al.,
2019) or QuAC (Choi et al., 2018).

While recent CQA models show impressive re-
sults on these benchmarks, such a single-score
evaluation scheme overlooks aspects that can be
essential in real-world use-cases. 第一的, QuAC and
CoQA contain large annotated training sets, 哪个
makes it unclear whether existing methods can re-
main effective in small-data settings, 哪里的
annotation budget is limited. 此外, the eval-
uation is done in-domain, ignoring the model’s
robustness to domain shifts, with target domains
that may even be unknown at model training
时间. 此外, the models are trained and
evaluated using a ‘‘clean’’ conversation history
之间 2 人类, while in reality the history
can be ‘‘noisy’’ and less fluent, due to the in-
correct answers by the model (李等人。, 2022).
最后, these benchmarks mix the impact of ad-
vances in pre-trained language models (LMs) 和
conversation history modeling effectiveness.

在这项工作中, we investigate the robustness
of history modeling approaches in CQA. We ask
whether high performance on existing benchmarks
also indicates strong robustness. To address this

计算语言学协会会刊, 卷. 11, PP. 351–366, 2023. https://doi.org/10.1162/tacl 00549
动作编辑器: Preslav I. Nakov. 提交批次: 7/2022; 修改批次: 11/2022; 已发表 4/2023.
C(西德:3) 2023 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Training In-Domain Evaluation

Out-Of-Domain Evaluation

Data source
Domain
83,568
# Examples
# Conversations 11,567

QuAC QuAC QuAC-NH

7,354
1,000

10,515
1,204

CoQA

DoQA

children stories literature mid-high school news wikipedia cooking movies travel
1,884 1,713
400

1,797
400

1,626
100

1,425
100

1,653
100

1,630
100

1,649
100

400

桌子 1: Datasets statistics.

问题, we carry out the first large-scale robust-
ness study using 6 common modeling approaches.
We design 5 robustness-focused evaluation set-
tings, which we curate based on 4 existing CQA
datasets. Our settings are designed to evaluate
efficiency in low-data scenarios, the ability to
scale in a high-resource setting, as well as robust-
ness to domain-shift and to noisy conversation
历史. We then perform a comprehensive ro-
bustness study, where we evaluate the considered
methods in our settings.

We focus exclusively on history modeling, as it
is considered the most significant aspect of CQA
(Gupta et al., 2020), differentiating it from the
classic single-turn QA task. To better reflect the
contribution of the history modeling component,
we adapt the existing evaluation metric. 第一的, 到
avoid differences which stem from the use of dif-
ferent pre-trained LMs, we fix the underlying LM
for all the evaluated methods, re-implementing
all of them. 第二, instead of focusing on final
scores on a benchmark, we focus on each model’s
改进 (Δ%) compared to a baseline QA
model that has no access to the conversation
历史.

Our results show that history modeling meth-
ods perform very differently in different settings,
and that approaches that achieve high benchmark
scores are not necessarily robust under low-data
and domain-shift settings. 此外, we notice
that approaches that highlight historic answers
within the document by modifying the document
embeddings achieve the top benchmark scores,
but their performance is surprisingly lacking in
low-data and domain-shift settings. We hypothe-
size that history highlighting yields high-quality
表示, but since the existing highlighting
methods add dedicated embedding parameters,
specifically designed to highlight the document’s
代币, they are prone to over-fitting.

These findings motivate us to search for
an alternative history modeling approach with
improved robustness across different settings.
Following latest trends w.r.t. prompting in NLP
(刘等人。, 2021) we design MarCQAp, a novel

prompt-based approach for history modeling,
which adds textual prompts within the grounding
document in order to highlight previous answers
from the conversation history. While our approach
is inspired by the embedding-based highlighting
方法, it is not only simpler, 但它也表明
superior robustness compared to other evaluated
方法. As MarCQAp is prompt-based, it can
be easily combined with any architecture, allow-
ing to fine-tune any model with a QA architecture
for the CQA task with minimal effort. 因此, 我们
hope that it will be adopted by the community as
a useful starting point, owing to its simplicity, 作为
well as high effectiveness and robustness. 我们也
hope that our study and insights will encourage
more robustness-focused evaluations, 此外
to obtaining high leaderboard scores, 导致
better CQA systems.

2 Preliminaries

2.1 CQA Task Definition and Notations

Given a text passage P , the current question qk
and a conversation history Hk in a form of a
sequence of previous questions and answers Hk =
(q1, a1, . . . , qk−1, ak−1), a CQA model predicts the
answer ak based on P as a knowledge source. 这
answers can be either spans within the passage P
(提取的) or free-form text (abstractive).

2.2 CQA Datasets

Full datasets statistics are presented in Table 1.

QuAC (Choi et al., 2018) and CoQA (Reddy
等人。, 2019) are the two leading CQA datasets,
with different properties. In QuAC, the questions
are more exploratory and open-ended with longer
answers that are more likely to be followed up.
This makes QuAC more challenging and realistic.
We follow the common practice in recent work
(Qu et al., 2019A,乙; Kim et al., 2021; 李等人。,
2022), focusing on QuAC as our main dataset,
using its training set for training and its vali-
dation set for in-domain evaluation (the test set
is hidden, reserved for a leaderboard challenge).

352

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

We use CoQA for additional pre-training or for
domain-shift evaluation.

DoQA (Campos et al., 2020) is another CQA
dataset with dialogues from the Stack Exchange
online forum. Due to its relatively small size, 它
is typically used for testing transfer and zero-shot
学习. We use it for domain-shift evaluation.

QuAC Noisy-History (QuAC-NH)
is based on
a datatset of human-machine conversations col-
lected by Li et al. (2022), 使用 100 passages
from the QuAC validation set. While Li et al. 用过的
it for human evaluation, we use it for automatic
评估, leveraging the fact that the answers
are labeled for correctness, which allows us to use
the correct answers as labels.

k=1 = {(磷, Hk, qk)}米

In existing CQA datasets, each conversa-
的 (q1, a1, .., qm, am) and the corresponding
passage P , are used to create m examples
k=1, where Hk =
{Ek}米
(q1, a1, . . . qk−1, ak−1). ak is then used as a la-
bel for Ek. Since QuAC-NH contains incorrect
if ak is incorrect we discard Ek to
答案,
avoid corrupting the evaluation set with incor-
rectly labeled examples. We also filtered out
invalid questions (李等人。, 2022) and answers
that did not appear in P .2

2.3 CQA Related Work

Conversation History Modeling is the major chal-
lenge in CQA (Gupta et al., 2020). Early work
used recurrent neural networks (RNNs) and vari-
ants of attention mechanisms (Reddy et al., 2019;
Choi et al., 2018; Zhu et al., 2018). 其他
trend was to use flow-based approaches, 哪个
generate a latent representation for the tokens in
Hk, using tokens from P (Huang et al., 2019;
Yeh and Chen, 2019; 陈等人。, 2020). 现代的
方法, which are the focus of our work, lever-
age Transformer-based (Vaswani et al., 2017)
pre-trained language models.

The simplest approach to model the history with
pre-trained LMs is to concatenate Hk with qk and
磷 (Choi et al., 2018; 赵等人。, 2021). Alter-
native approaches rewrite qk based on Hk and
use the rewritten questions instead of Hk and qk
(Vakulenko et al., 2021), or as an additional train-
ing signal (Kim et al., 2021). Another fundamental
approach is to highlight historic answers within

2Even though Li et al. only used extractive models, A

Pre-trained

LM Size

Base

大的

Base

Training

评估

QuAC

CoQA + QuAC

QuAC smaller samples

QuAC

CoQA + DoQA

QuAC-NH

标准

High-Resource

Low-Resource

Domain-Shift

Noisy-History

桌子 2: Summary of our proposed settings.

P by modifying the passage’s token embeddings
(Qu et al., 2019A,乙). Qu et al. also introduced
a component that performs dynamic history se-
lection after each turn is encoded. 然而, in our
corresponding baseline we utilize only the his-
toric answer highlighting mechanism, owing to its
simplicity and high effectiveness. A contempora-
neous work proposed a global history attention
成分, designed to capture long-distance
dependencies between conversation turns (Qian
等人。, 2022).3

3 History Modeling Study

在这项工作中, we examine the effect of a model’s
history representation on its robustness. 对此
结尾, we evaluate different approaches under sev-
eral settings that diverge from the standard
supervised benchmark (§3.1). This allows us to
examine whether the performance of some meth-
ods deteriorates more quickly than others in
different scenarios. To better isolate the gains
from history modeling, we measure performance
compared to a baseline QA model which has
no access to Hk (§3.2), and re-implement all
the considered methods using the same under-
lying pre-trained language model (LM) for text
表示 (§3.3).

3.1 Robustness Study Settings

We next describe each comparative setting in our
study and the rationale behind it, as summarized
表中 2. 桌子 1 depicts the utilized datasets.

标准. Defined by Choi et al. (2018), 这
setting is followed by most studies. We use a
medium-sized pre-trained LM for each method,
commonly known as its base version,
然后
fine-tune and evaluate the models on QuAC.

High-Resource. This setting examines the ex-
帐篷
to which methods can improve their
performance when given more resources. 对此

small portion of the answers did not appear in the passage.

3已发表 2 weeks before our submission.

353

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

结尾, we use a large pre-trained LM, perform ad-
ditional pre-training on CoQA (with the CQA
客观的), and then fine-tune and evaluate on
QuAC.

Low-Resource.
在这个设置下, we examine the
resource efficiency of the history modeling ap-
proaches by reducing the size of the training set.
This setting is similar to the standard setting,
except that we fine-tune on smaller samples of
QuAC’s training set. For each evaluated method
we train 4 model variants: 20%, 10%, 5%, 和 1%,
reflecting the percentage of training data retained.

Domain-Shift. This setting examines robust-
ness to domain shift. 为此, 我们使用
8 domains in the CoQA and DoQA datasets as
test sets from unseen target domains, evaluating
the models trained under the standard setting on
these test-sets.

Noisy-History. This setting examines robust-
ness to noisy conversation history, 哪里的
answers are sometimes incorrect and the conver-
sation flow is less fluent. 为此, we evaluate
the models trained under the standard setting on
the QuAC-NH dataset, consisting of conversations
between humans and other CQA models (§2.2).
We note that a full human-machine evaluation
requires a human in the loop. We choose to eval-
uate against other models predictions as a middle
地面. This allows us to test the models’ behav-
ior on noisy conversations with incorrect answers
and less fluent flow, but without a human in the
环形.

3.2 Evaluation Metric

The standard CQA evaluation metric is the average
word-level F1 score (Rajpurkar et al., 2016; Choi
等人。, 2018; Reddy et al., 2019; Campos et al.,
2020).4 Since we focus on the impact of history
造型, we propose to consider each model’s
improvement in F1 (Δ%) compared to a baseline
QA model that has no access to the dialogue
历史.

3.3 Pre-trained LM

To control for differences which stem from the
use of different pre-trained LMs, we re-implement
all the considered methods using the Longformer
(Beltagy et al., 2020), a sparse-attention Trans-
former designed to process long input sequences.

4We follow the calculation presented in Choi et al. (2018).

It is therefore a good fit for handling the con-
versation history and the source passage as a
合并的 (长的) 输入. Prior work usually utilized
dense-attention Transformers, whose input length
limitation forced them to truncate Hk and split
P into chunks, processing them separately and
combining the results (Choi et al., 2018; Qu et al.,
2019A,乙; Kim et al., 2021; 赵等人。, 2021). 这
introduces additional complexity and diversity in
实施, while with the Longformer
we can keep implementation simple, as this model
can attend to the entire history and passage.

the state-of-the-art

We would also like to highlight RoR (赵
等人。, 2021), which enhances a dense-attention
Transformer to better handle long sequences.
尤其,
result on QuAC
was reported using ELECTRA+RoR with simple
history concatenation (see CONCAT in §3.4). 尽管
this suggests that ELECTRA+RoR can outper-
form the Longformer, since our primary focus
is on analyzing the robustness of different his-
tory modeling techniques rather than on long
sequence modeling, we opt for a general-purpose
commonly used LM for long sequences, 哪个
exhibits competitive performance.

3.4 Evaluated Methods

In our study we choose to focus on modern history
modeling approaches that
leverage pre-trained
LMs. These models have demonstrated significant
progress in recent years (§2.3).

NO HISTORY A classic single-turn QA model
without access to Hk. We trained a Longformer
for QA (Beltagy et al., 2020), using qk and P as a
single packed input sequence (ignoring Hk). 这
model then extracts the answer span by predicting
its start and end positions within P .

In contrast to the rest of the evaluated methods,
we do not consider this method as a baseline for
history modeling, but rather as a reference for
calculating our Δ% metric. As discussed in §3.2,
we evaluate all history modeling methods for their
ability to improve over this model.

CONCAT Concatenating Hk to the input (IE。,
to qk and P ), 这是 (可以说) 最多
straightforward way to model the history (Choi
等人。, 2018; Qu et al., 2019A; 赵等人。,
2021). Other than the change to the input, 这
model architecture and training is identical to NO
HISTORY.

354

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

REWRITE This
approach was proposed in
Vakulenko et al. (2021). It consists of a pipeline
of two models, question rewriting (QR) 和
question answering (QA). An external QR model
first generates a rewritten question ˜qk, 基于
qk and Hk. ˜qk and P are then used as input to
a standard QA model, identical to NO HISTORY,
but trained with the rewritten questions. 为了
external QR model we follow Lin et al. (2020),
Vakulenko et al. (2021), and Kim et al. (2021)
and fine-tune T5-base (Raffel et al., 2020) 在
CANARD dataset (Elgohary et al., 2019). We use
the same QR model across all the settings in our
学习 (§3.1), meaning that in the low-resource
setting we limit only the CQA data, which is used
to train the QA model.

REWRITEC Hypothesizing that there is useful in-
formation in Hk on top of the rewritten question
˜qk, we combine REWRITE and CONCAT, obtaining
a model which is similar to CONCAT, except that
it replaces qk with ˜qk.

ExCorDLF Our implementation of the ExCorD
方法, proposed in Kim et al. (2021). 反而
of rewriting the original question, qk, at inference
时间 (REWRITE), ExCorD uses the rewritten ques-
tion only at training time as a regularization signal
when encoding the original question.

HAELF Our implementation of the HAE ap-
proach proposed in Qu et al. (2019A), 哪个
highlights the conversation history within P . 在-
stead of concatenating Hk to the input, HAE
highlights the historic answers {人工智能}k−1
i=1 within
磷 , by modifying the passage token embeddings.
HAE adds an additional dedicated embedding
layer with 2 learned embedding vectors, denoting
whether a token from P appears in any historic
answers or not.

PosHAELF Our implementation of the PosHAE
approach proposed in Qu et al. (2019乙), which ex-
tends HAE by adding positional information. 这
embedding matrix is extended to contain a vector
per conversation turn, each vector representing the
turn that the corresponding token appeared in.

3.5 Implementation Details

We fine-tune all models on QuAC for 10 纪元,
employ an accumulated batch size of 640, a weight
decay of 0.01, and a learning rate of 3 · 10−5. 在里面
high-resource setup, we also pre-train on CoQA

Original Work

CONCAT

Qu et al. (2019A)

Original LM Original Result
BERT

62.0

Our Impl.
65.8

REWRITE

Vakulenko et al. (2021)

BERT

Not Reported

REWRITEC
ExCorD

N/A (this baseline was first proposed in this work)

Kim et al. (2021)

RoBERTa

HAE

Qu et al. (2019A)

PosHAE

Qu et al. (2019乙)

BERT

64.6

67.3

67.5

68.9

69.8

67.7

63.9

64.7

桌子 3: F1 scores comparison between original
implementations and ours (using Longformer as
the LM), for all methods described in §3.4, 在里面
standard setting.

为了 5 纪元. We use a maximum output length
的 64 代币. Following Beltagy et al. (2020), 我们
set Longformer’s global attention to all the tokens
of qk. We use the cross-entropy loss and AdamW
optimizer (Kingma and Ba, 2015; Loshchilov and
Hutter, 2019). Our implementation makes use
of the HuggingFace Transformers (沃尔夫等人。,
2020), and PyTorch-Lightning libraries.5
the base LM (used in all
settings
except high-resource) we found that a Long-
former that was further pre-trained on SQuADv2
(Rajpurkar et al., 2018),6 achieved consistently
better performance than the base Longformer.
因此, we adopted it as our base LM. For the large
LM (used in the high-resource setting) we used
Longformer-large.7

为了

In §5, we introduce a novel method (MarCQAp)
and perform statistical significance tests (Dror
等人。, 2018, 2020). Following Qu et al. (2019乙),
we use the Student’s paired t-test with p < 0.05, to compare MarCQAp to all other methods in each setting. In our re-implementation of the evaluated meth- ods, we carefully followed their descriptions and implementation details as published by the authors in their corresponding papers and codebases. A key difference in our implementation is the use of a long sequence Transformer, which removes the need to truncate Hk and split P into chunks (§3.3). This simplifies our implementation and avoids differences between methods.8 Table 3 compares between our results and those reported in previous works. In almost all cases we achieved 5https://github.com/PyTorchLightning /pytorch-lightning. 6https://huggingface.co/mrm8488 /longformer-base-4096-finetuned-squadv2. 7https://huggingface.co/allenai /longformer-large-4096. 8The maximum length limit of Hk varies between dif- ferent works, as well as how sub-document chunks are handled. 355 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 4 9 2 1 4 5 1 9 9 / / t l a c _ a _ 0 0 5 4 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Setting LM Training set size NO HISTORY concat REWRITE REWRITEC ExCorDLF HAELF PosHAELF MarCQAp (§5) Low-Resource Longformer-base Pre-trained SQuAD 8K (10%) 52.9 53.4 (+0.9%) 56.4 (+6.6%) 57.2 (+8.1%) 57.2 (+8.1%) 55.0 (+4.0%) 55.1 (+4.2%) 57.4 (+14.8%) 61.3 (+15.9%) 64.6 (+16.6%) 16K (20%) 55.4 57.8 (+4.3%) 59.2 (+6.9%) 60.6 (+9.4%) 60.3 (+8.8%) 59.8 (+7.9%) 60.9 (+9.9%) 4K (5%) 800 (1%) 50.0 45.0 51.2 (+2.4%) 43.9 (-2.4%) 54.0 (+8.0%) 46.5 (+3.3%) 54.4 (+8.8%) 42.3 (-6.0%) 53.0 (+6.0%) 46.0 (+2.2%) 44.5 (-1.1%) 50.8 (+1.6%) 40.5 (-10.0%) 51.0 (+2.0%) 48.2 (+7.1%) High-Resource Standard Longformer-base Longformer-large Pre-trained SQuAD Pre-trained CoQA Avg Δ% – +1.3% +6.2% +5.1% +6.3% +3.1% +1.5% +13.6% 80K (100%) 60.4 65.8 (+8.9%) 64.6 (+7.0%) 67.3 (+11.4%) 67.5 (+11.8%) 69.0 (+14.2%) 69.8 (+15.6%) 70.2 (+16.2%) 80K (100%) 65.6 72.3 (+10.2%) 69.0 (+5.2%) 72.5 (+10.5%) 73.8 (+12.3%) 73.2 (+11.4%) 74.2 (+12.9%) 74.7 (+13.7%) Table 4: In-domain F1 and Δ% scores on the full QuAC validation set, for the standard, high-resource and low-resource settings. We color coded the Δ% for positive and negative numbers. a higher score (probably since Longformer out- performs BERT), with the exception of ExCorD, where we achieved a comparable score (proba- bly since Longformer is actually initialized using RoBERTa’s weights [Beltagy et al., 2020]). 4 Results and Analysis We next discuss the takeaways from our study, where we evaluated the considered methods across the proposed settings. Table 4 presents the results of the standard, high-resource, and low-resource settings. Table 5 further presents the domain-shift results. Finally, Table 6 depicts the results of the noisy-history setting. Each method is compared to NO HISTORY by calculating the Δ% (§3.2). The tables also present the results of our method, termed MarCQAp, which is discussed in §5. We further analyze the effect of the conversa- tion history length in Figure 1, evaluating models from the standard setting with different limits on the history length. For instance, when the limit is 2, we expose the model to up to the 2 most recent turns, by truncating Hk.9 Key Findings A key goal of our study is to examine the robustness of history modeling ap- proaches to setting shifts. This research reveals limitations of the single-score benchmark-based evaluation adopted in previous works (§4.1), as such scores are shown to be only weakly correlated with low-resource and domain-shift robustness. Furthermore, keeping in mind that history mod- eling is a key aspect of CQA, our study also demonstrates the importance of isolating the con- tribution of the history modeling method from 9We exclude REWRITE, since it utilizes Hk only in the form of the rewritten question. For REWRITEC , we truncate the concatenated Hk for the CQA model, while the QR model remains exposed to the entire history. other model components (§4.2). Finally, we dis- cover that while existing history highlighting approaches yield high-quality input representa- tions, their robustness is surprisingly poor. We further analyze the history highlighting results and provide possible explanations for this phe- nomenon (§4.3). This finding is the key motivation for our proposed method (§5). 4.1 High CQA Benchmark Scores do not Indicate Good Robustness First, we observe some expected general trends: All methods improve on top of NO HISTORY, as demonstrated by the positive Δ% in the standard setting, showing that all the methods can leverage information from Hk. All methods scale with more training data and a larger model (high-resource), and their performances drop significantly when the training data size is reduced (low-resource) or when they are presented with noisy history. A performance drop is also observed when evaluat- ing on domain-shift, as expected in the zero shot setting. However, not all methods scale equally well and some deteriorate faster than others. This phenomenon is illustrated in Table 7, where the methods are ranked by their scores in each set- ting. We observe high instability between settings. For instance, PosHAELF is top performing in 3 settings but in 2 others. is second worst REWRITE is second best in low-resource, but among the last ones in other settings. So is the case with CONCAT: Second best in domain-shift but among the worst ones in others. In addition, while all the methods improve when they are exposed to longer histories (Figure 1), some saturate earlier than others. 356 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 4 9 2 1 4 5 1 9 9 / / t l a c _ a _ 0 0 5 4 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Setting Domain NO HIST. CONCAT REWRITE REWRITEC ExCorDLF HAELF PosHAELF MarCQAp (§5) Domain-Shift Children Sto. Literature CoQA M/H Sch. News Wikipedia Cooking DoQA Movies Travel 54.8 42.6 50.3 50.1 58.2 46.9 45.0 44.0 62.2 (+13.5%) 48.0 (+12.7%) 55.3 (+9.9%) 54.9 (+9.6%) 59.9 (+2.9%) 54.8 (+16.8%) 52.0 (+15.6%) 48.4 (+10%) 60.1 (+9.7%) 47.7 (+12.0%) 55.0 (+9.3%) 54.8 (+9.4%) 60.9 (+4.6%) 44.6 (-4.9%) 43.2 (-4.0%) 40.9 (-7.0%) 62.7 (+14.4%) 49.0 (+15.0%) 56.7 (+12.7%) 55.2 (+10.2%) 59.4 (+2.1%) 52.0 (+10.9%) 49.1 (+9.1%) 46.4 (+5.5%) 62.7 (+14.4%) 51.5 (+20.9%) 58.2 (+15.7%) 57.0 (+13.8%) 63.6 (+9.3%) 53.7 (+14.5%) 51.1 (+13.6%) 48.6 (+10.5%) 61.8 (+12.8%) 50.5 (+18.5%) 56.6 (+12.5%) 55.4 (+10.6%) 60.9 (+4.6%) 45.0 (-4.1%) 45.1 (+0.2%) 45.1 (+2.5%) 56.6 (+3.3%) 66.7 (+21.7%) 47.4 (+11.3%) 56.4 (+32.4%) 55.4 (+10.1%) 61.8 (+22.9%) 52.7 (+5.2%) 60.8 (+21.4%) 61.7 (+6.0%) 67.5 (+16.0%) 45.6 (-2.8%) 45.8 (+1.8%) 53.3 (+13.6%) 51.8 (+15.1%) 44.7 (+1.6%) 50.1 (+13.9%) Avg Δ% – +11.4% +3.6% +10.0% +14.1% +7.2% +4.6% +19.6% Table 5: F1 and Δ% scores for the domain-shift setting. We color coded the Δ% for positive and negative numbers. Setting NO HISTORY CONCAT REWRITE REWRITEC ExCorDLF HAELF PosHAELF MarCQAp (§5) Noisy-History 49.9 55.3 (+10.8%) 56.0 (+12.2%) 58.5 (+17.2%) 56.8 (+13.8%) 57.9 (+16.0%) 60.1 (+20.4%) 62.3 (+24.9%) Table 6: F1 and Δ% scores for the noisy-history setting. We conclude that the winner does not take it all: There are significant instabilities in meth- ods’ performance across settings. This reveals the limitations of the existing single-score bench- mark evaluation practice, and calls for more comprehensive robustness-focused evaluation. 4.2 The Contribution of the History Modeling Method should be Isolated In the high-resource setting, NO HISTORY reaches 65.6 F1, higher than many CQA results reported in previous work (Choi et al., 2018; Qu et al., 2019a,b; Huang et al., 2019). Since it is clearly ignoring the history, this shows that significant improvements can stem from simply using a better LM. Thus comparing between history modeling methods that use different LMs can be misleading. This is further illustrated with HAELF ’s and PosHAELF ’s results. The score that Kim et al. reported for ExCorD is higher than Qu et al. reported for HAE and PosHAE. While both au- thors used a setting equivalent to our standard setting, Kim et al. used RoBERTa while Qu et al. used BERT, as their underlying LM. It is therefore unclear whether ExCorD’s higher score stems from better history representation or from choosing to use RoBERTa. In our study, HAELF Table 7: Per setting rankings of the methods evalu- ated in our study (top is best), excluding MarCQAp. C is CONCAT, R is REWRITE, RC is REWRITEC, Ex is ExCorDLF , H is HAELF , and PH is PosHAELF . and PosHAELF actually outperform ExCorDLF in the standard setting. This suggests that these methods can perform better than reported, and demonstrates the importance of controlling for the choice of LM when comparing between history modeling methods. (2019a) As can be seen in Figure 1, CONCAT sat- urates at 6 turns, which is interesting since Qu et al. reported saturation at 1 turn in a BERT-based equivalent. Furthermore, Qu et al. observed a performance degradation with more turns, while we observe stability. These differences probably stem from the history trunca- tion in BERT, due to the input length limitation of dense attention Transformers. This demonstrates the advantages of sparse attention Transformers for history modeling evaluation, since the com- parison against CONCAT can be more ‘‘fair’’. This comparison is important, since the usefulness of any method should be established by comparing it to the straight-forward solution, which is CONCAT in case of history modeling. We would also like to highlight PosHAELF ’s F1 scores in the noisy-history (60.1) and the 20% low-resource setting (60.9), both lower than the 69.8 F1 in the standard setting. Do these performance drops reflect lower effectiveness in modeling the conversation history? Here the Δ% comes to the rescue. While the Δ% de- creased between the standard and the 20% settings 357 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 4 9 2 1 4 5 1 9 9 / / t l a c _ a _ 0 0 5 4 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: F1 as a function of # history turns, for models from the standard setup. The first occurrence of the maximum F1 value (saturation point) is highlighted. Figure 2: Δ% as a function of # training examples. Results taken from the standard and low-resource settings. (15.6 → 9.9), it actually increased in the noisy-history setting (to 20.4). This indicates that even though the F1 decreased, the ability to leverage the history actually increased. We conclude that our study results support the design choices we made, in our effort to better iso- late the contribution of the history representation. We recommend future works to compare history modeling methods using the same LM (prefer- ably a long sequence LM), and to measure a Δ% compared to a NO HISTORY baseline. 4.3 History Highlighting is Effective in Resource-rich Setups, but is not Robust The most interesting results are observed for the history highlighting methods: HAE and PosHAE. First, when implemented using the Longformer, HAELF and PosHAELF perform better than re- ported in previous work, with 68.9 and 69.8 F1 respectively, compared to 63.9 and 64.7 reported by Qu et al. using BERT. The gap between HAELF and PosHAELF demonstrates the effect of the po- sitional information in PosHAELF . This effect is further observed in Figure 1: HAELF saturates earlier since it cannot distinguish between dif- ferent conversation turns, which probably yields conflicting information. PosHAELF saturates at 9 turns, later than the rest of the methods, which indicates that it can better leverage long conversations. PosHAELF outperforms all methods in the stan- dard, high-resource, and noisy-history settings,10 demonstrating the high effectiveness of history highlighting. However, it shows surprisingly poor 10We ignore MarCQAp’s results in this section. performance in low-resource and domain-shift set- tings, with extremely low average Δ% compared to other methods. The impact of the training set size is further illustrated in Figure 2. We plot the Δ% as a function of the training set size, and specifically highlight PosHAELF in bold red. Its performance deteriorates significantly faster than others when the training set size is reduced. In the 1% setting it is actually the worst performing method. This poor robustness could be caused by the additional parameters added in the embedding layer of PosHAELF . Figure 2 demonstrates that properly training these parameters, in order to benefit from this method’s full potential, seems to require large amounts of data. Furthermore, the poor domain-shift performance indicates that, even with enough training data, this embedding layer seems to be prone to overfitting to the source domain. We conclude that history highlighting clearly yields a very strong representation, but the addi- tional parameters of the embedding layer seem to require large amounts of data to train properly and over-fit to the source domain. Is there a way to highlight historic answers in the passage, without adding dedicated embedding layers? In §5 we present MarCQAp, a novel history modeling approach that is inspired by PosHAE, adopting the idea of history highlighting. How- ever, instead of modifying the passage embedding, we highlight historic answers by adding textual prompts directly in the input text. By leveraging prompts, we reduce model complexity and remove the need for training dedicated parameters, hoping to mitigate the robustness weaknesses of PosHAE. 358 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 4 9 2 1 4 5 1 9 9 / / t l a c _ a _ 0 0 5 4 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 5 MarCQAp Motivated by our findings, we design MarCQAp, a novel prompt-based history modeling approach that highlights answers from previous conversa- tion turns by inserting textual prompts in their respective positions within P . By highlighting with prompts instead of embedding vectors, we hope to encode valuable dialogue information, while reducing the learning complexity incurred by the existing embedding-based methods. Thus, we expect MarCQAp to perform well not only in high-resource settings, but also in low-resource and domain adaptation settings, in which prompt- ing methods have shown to be particularly useful (Brown et al., 2020; Le Scao and Rush, 2021; Ben-David et al., 2022). Prompting often refers to the practice of adding phrases to the input, in order to en- courage pre-trained LMs to perform specific tasks (Liu et al., 2021), yet it is also used as a method for injecting task-specific guidance during fine-tuning (Le Scao and Rush, 2021; Ben-David et al., 2022). MarCQAp closely re- sembles the prompting approach from Ben-David et al. (2022) since our prompts are: (1) discrete (i.e., the prompt is an actual text-string), (2) dy- namic (i.e., example-based), and (3) added to the input text and the model then makes predictions conditioned on the modified input. Moreover, as in Ben-David et al., in our method the underlying LM is further trained on the downstream task with prompts. However, in contrast to most prompting approaches, which predefine the prompt’s loca- tion in the input (Liu et al., 2021), our prompts are inserted in different locations for each example. In addition, while most textual prompting approaches leverage prompts comprised of natural language, our prompts contain non-verbal symbols (e.g., "<1>“, 见图 3 and §5.1), which were proven
useful for supervision of NLP tasks. 例如,
Aghajanyan et al. (2022) showed the usefulness
of structured pre-training by adding HTML sym-
bols to the input text. 最后, to the best of our
知识, this work is the first to propose a
prompting mechanism for the CQA task.

5.1 方法

MarCQAp utilizes a standard single-turn QA
model architecture and input, with the input com-
prising the current question qk and the passage P .
For each CQA example (磷, Hk, qk), MarCQAp

数字 3: The MarCQAp highlighting scheme: Answers
to previous questions are highlighted in the grounding
文档, which is then provided as input to the model.

inserts a textual prompt within P , based on in-
formation extracted from the conversation history
Hk. In extractive QA, the answer ak is typically a
span within P . Given the input (磷, Hk, qk), 三月-
CQAp transforms P into an answer-highlighted
passage (西德:2)Pk, by constructing a prompt pk and in-
serting it within P . pk is constructed by locating
the beginning and end positions of all historic
答案 {人工智能}k−1
i=1 within P , and inserting a unique
textual marker for each answer in its respective
positions (see example in Figure 3). The input
( (西德:2)Pk, qk) is then passed to the QA model, 反而
的 (磷, qk).

In abstractive QA, a free form answer is gen-
erated based on an evidence span that is first
extracted from P . 因此, the final answer does
not necessarily appear in P . To support this set-
ting, MarCQAp highlights the historical evidence
跨度 (which appear in P ) instead of the generated
答案.

To encode positional dialogue information, 这
markers for aj ∈ {人工智能}k−1
i=1 include its turn index
number in reverse order, 那是, k − 1 − j. 这
encodes relative historic positioning w.r.t. 电流-
rent question qk, allowing the model to distinguish
between the historic answers by their recency.

MarCQAp highlights only the historic answers,
since the corresponding questions do not appear
in P . While this might lead to information loss,
in §5.3 we implement MarCQAp’s variants that
add the historic questions to the input, and show

359

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

that the contribution of the historic questions to
the performance is minor.11

‘NO ANSWER’

A CQA dialogue may also contain unanswer-
able questions. Before inserting the prompts,
MarCQAp first appends a ‘NO ANSWER’ string
to P .12 Each historical
is then
highlighted with prompts, similarly to ordinary
historical answers. For example see a4 in Figure 3.
MarCQAp has several advantages over prior
方法. 第一的, since it is prompt-based, it does
not modify the model architecture, which makes
it easier to port across various models, alleviat-
ing the need for model-specific implementation
and training procedures. 此外, it naturally
represents overlapping answers in P , 这是
a limitation in prior work (Qu et al., 2019A,乙).
Overlapping answers contain tokens which relate
to multiple turns, yet the existing token-based em-
bedding methods encode the relation of a token
from P only to a single turn from Hk. 自从
MarCQAp is span-based, it naturally represents
overlapping historic answers (例如, see a2 and a3
图中 3).

5.2 MarCQAp Evaluation

We evaluate MarCQAp in all our proposed
experimental settings (§3.1). As presented in
Tables 4, 5, 和 6, it outperforms all other methods
in all settings. In the standard, high-resource,
and noisy-history settings,
its performance is
very close to PosHAELF ,13 indicating that our
prompt-based approach is an effective alterna-
tive implementation for the idea of highlighting
historical answers. Similarly to PosHAELF , 三月-
CQAp is able to handle long conversations and its
performance gains saturate at 9 轮流 (数字 1).
然而, in contrast to PosHAELF , MarCQAp
performs especially well in the low-resource and
the domain-shift settings.

In the low-resource settings, MarCQAp out-
performs all methods by a large margin, 和
an average Δ% of 13.6% compared to the best
baseline with 6.3%. The dramatic improvement
over PosHAELF ’s average Δ% (1.5% → 13.6%)
serves as a strong indication that our prompt-based

11Which is also in line with the findings in Qu et al.

(2019乙).

数字 4: An example of MarCQAp’s robustness in
the low-resource setting. Even though ExCorDLF ,
HAELF , and PosHAELF predict correct answers in the
standard setting, they fail on the same example when
the training data size is reduced to 10%. MarCQAp
predicts a correct answer in both settings.

approach is much more robust. This boost in ro-
bustness is best illustrated in Figure 2, 哪个
presents the Δ% as a function of the training set
尺寸, highlighting PosHAELF (红色的) and MarCQAp
(绿色的) 具体来说. An example of MarCQAp’s
robustness in the low-resource setting is provided
图中 4.

In the domain-shift settings, MarCQAp is
the best performing method in 6 在......之外 8 做-
mains.14 On the remaining two domains (Cooking
& Movies), CONCAT is the best performing.15
尤其, MarCQAp’s average Δ% (19.6%) 是
substantially higher compared to the next best
方法 (14.1%). These results serve as additional
strong evidence of MarCQAp’s robustness.

MarCQAp’s Performance Using Different
LMs
In addition to Longformer, we evaluated
MarCQAp using RoBERTa (刘等人。, 2019)
and BigBird (Zaheer et al., 2020) in the stan-
dard setting. The results are presented in Table 8.
MarCQAp shows a consistent positive effect
across different LMs, which further highlights
its effectiveness.

12Only if it is not already appended to P , in some datasets

14For the Travel domain MarCQAp’s improvement over

the passages are always suffixed with ‘NO ANSWER’.

ExCorDLF is not statistically significant.

13In the standard and high-resource MarCQAp’s improve-

15The differences between CONCAT and MarCQAp for both

ments over PosHAELF are not statistically significant.

domains are not statistically significant.

360

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

No History MarCQAp

Δ%

RoBERTa
BigBird
Longformerbase
Longformersquad

57.7
57.6
60.0
60.4

68.0
66.3
68.4
70.2

(+17.9%)
(+15.1%)
(+14.0%)
(+16.6%)

桌子 8: MarCQAp’s standard setting per-
formance across different Transformer-based
pre-trained LMs.

BiDAF++ w/ 2-Context (Choi et al., 2018)
HAE (Qu et al., 2019A)
FlowQA (Huang et al., 2019)
GraphFlow (陈等人。, 2020)
HAM (Qu et al., 2019乙)
FlowDelta (Yeh and Chen, 2019)
GHR (Qian et al., 2022)
RoR (赵等人。, 2021)
MarCQAp (Ours)

60.1
62.4
64.1
64.9
65.4
65.5
73.7
74.9
74.0

We note that since RoBERTa is a dense-
attention Transformer with input length limita-
tion of 512 代币, longer passages are split into
chunks. This may lead to some chunks containing
part of the historic answers, and therefore partial
highlighting by MarCQAp. Our analysis showed
那 51% of all examples in QuAC were split
into several chunks, 和 61% the resulted chunks
contained partial highlighting. MarCQAp’s strong
performance with RoBERTa suggests that it can
remain effective even with partial highlighting.

Official QuAC Leaderboard Results For com-
pleteness, we submitted our best performing
模型 (from the high-resource setting) 到
official QuAC leaderboard,16 evaluating its per-
formance on the hidden test set. 桌子 9 presents
the results.17 MarCQAp achieves a very competi-
tive score of 74.0 F1, very close to the published
state-of-the art (RoR by Zhao et al. [2021] 和
74.9 F1), yet with a much simpler model.18

5.3 Prompt Design

Recall that MarCQAp inserts prompts at the begin-
ning and end positions for each historical answer
within P (数字 3). The prompts are designed
with predefined marker symbols and include the
answer’s turn index (例如, “<1>“). This design
builds on 3 main assumptions: (1) textual prompts
can represent conversation history information,
(2) the positioning of the prompts within P facil-
itates highlighting of historical answers, 和 (3)
indexing the historical answers encodes valuable
信息. We validate our design assumptions
by comparing MarCQAp against ablated variants
(桌子 10).

16https://quac.ai.
17The leaderboard contains additional results for mod-
els which (at the time of writing) include no descriptions
or published papers, rendering them unsuitable for fair
比较.

18See §3.3 for a discussion of RoR.

桌子 9: Results from the official QuAC leader-
board, presenting F1 scores for the hidden test set,
for MarCQAp and other models with published
文件.

To validate assumption (1), we compare Mar-
CQAp to MARCQAPC, a variant which adds Hk
to the input, 此外 (西德:2)Pk and qk. MARC-
QAPC is exposed to information from Hk via two
来源: The concatenated Hk and the MarCQAp
prompt within (西德:2)Pk. We observe a negligible ef-
影响,19 suggesting that MarCQAp indeed encodes
information from the conversation history, 自从
providing Hk does not add useful information on
top of (西德:2)Pk.

To validate assumptions (2) 和 (3), 我们用
two additional MarCQAp’s variants. Answer Pos
inserts a constant predefined symbol (“<>“), 在
each answer’s beginning and end positions within
磷 (IE。, similar to MarCQAp, but without turn
indexing). Random Pos inserts the same number
of symbols but in random positions within P .

Answer Pos achieves a Δ% of 12.7%, 尽管
Random Pos achieves 1.7%. This demonstrates
that the positioning of the prompts within P is cru-
cial, and that most of MarCQAp’s performance
gains stem from its prompts positioning w.r.t.
historical answers {人工智能}k−1
我=1 . When the prompts
are inserted at meaningful positions, 该模型
seems to learn to leverage these positions in
order to derive an effective history representa-
的. 出奇, Random Pos leads to a minor
improvement of 1.7%.20 最后, MarCQAp’s im-
provement over Answer Pos (a Δ% of 15.9%
相比 12.7%), indicates that answer in-
dexing encodes valuable information, helping us
validate assumption (3).

最后, since textual prompts allow for easy
信息, we make

injection of additional

19The difference is not statistically significant.
20The difference is statistically significant, 我们没有

further investigate the reasons behind this particular result.

361

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

NO HISTORY
Random Pos
Answer Pos
Full Q
Word from Q
Word from Q + Index
MARCQAPC
MarCQAp

52.9
53.8 (+1.7%)
59.6 (+12.7%)
59.2 (+11.9%)
60.4 (+14.2%)
60.7 (+14.8%)
61.5 (+16.3%)
61.3 (+15.9%)

桌子 10: F1 and Δ% scores for MarCQAp’s
ablated variants,
这
low-resource setting.

在里面 10% setup of

several initial attempts in this direction, inject-
ing different types of information into our textual
prompts. In Word from Q, the marker contains
the first word from the historic answer’s corre-
sponding question, which is typically a wh-word
(例如, ‘‘’’). In Word from Q + Index we
also add the historic answer’s turn index (例如,
‘‘’’). In Full Q, we inject the entire
historic question into the prompt. Word from Q
and Word from Q + Index achieved comparable
scores, lower than MarCQAp’s but higher than
Answer Pos’s.21 This suggests that adding se-
mantic information is useful (since Word from Q
outperformed Answer Pos), and that combining
such information with the positional information
is not trivial (since MarCQAp outperformed Word
from Q + Index). This points to the effects of the
prompt structure and the information included:
We see that ‘‘<1>’’ and ‘‘’’ both outper-
form ‘‘<>’’, yet constructing a prompt by naively
combining these signals (‘‘’’) 不
lead to complementary effect. 最后, Word from
Q outperformed Full Q. We hypothesize that
since the full question can be long, it might sub-
stantially interfere with the natural structure of
the passage text. This provides evidence that the
prompts should probably remain compact symbols
with small footprint within the passage. These ini-
tial results call for further exploration of optimal
prompt design in future work.

5.4 Case Study

数字 5 presents an example of all evaluated
methods in action from the standard setting. 这
current question ‘‘Did he have any other crit-
集成电路?’’ has two correct answers: Alan Dershowitz
or Omer Bartov. We first note that all methods

21Both differences are statistically significant.

数字 5: Our case study example, comparing answers
predicted by each evaluated method in the standard
环境. We provide a detailed analysis in §5.4.

predicted a name of a person, which indicates that
the main subject of the question was captured cor-
正确地. 然而, the methods differ in their prediction
of the specific person.

REWRITE and CONCAT predict a correct answer
(Alan Dershowitz), yet CONCAT predicts it based on
incorrect evidence. This may indicate that CONCAT
did not capture the context correctly (just the fact
that it needs to predict a person’s name), 并且是
lucky enough to guess the correct name.

有趣的是, REWRITEC predicts Daniel Gold-
hagen, which is different
from the answers
predicted by CONCAT and REWRITE. This shows that
combining both methods can yield completely dif-
ferent results, and demonstrates an instance where
REWRITEC performs worse than REWRITE and CON-
CAT (for instance in the 1% low-resource setting).
This is also an example of a history modeling flaw,
since Daniel Goldhagen was already mentioned
as a critic in previous conversation turns.

This example also demonstrates how errors
can propagate through a pipeline-based system.

362

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The gold rewritten question is ‘‘Did Norman
Finkelstein have any other critics aside from
Peter Novick and Daniel Goldhagen?’’,22 尽管
the question rewriting model generated ‘‘Besides
Peter Novick, did Norman Finkelstein have any
other critics?’’, omitting Daniel Goldhagen. 这
makes it impossible for REWRITE to figure out
that Daniel Goldhagen was already mentioned,
making it a legitimate answer. This reveals that
REWRITE might have also gotten lucky and provides
a possible explanation for the incorrect answer
predicted by REWRITEC.

ExCorDLF , HAELF , and PosHAELF not only
predict a wrong answer, but also seem to fail
to resolve the conversational coreferences, 自从
the pronoun ‘‘he’’, in the current question ‘‘Did
he have any other critics?’’, refers to Norman
Finkelstein.

MarCQAp predicts a correct answer, Omer
Bartov. This demonstrates an instance where Mar-
CQAp succeeds while HAELF and PosHAELF
失败, even though they are all history-highlighting
方法. 有趣的是, MarCQAp is the only
model that predicts Omer Bartov, a non-trivial
choice compared to Alan Dershowitz, since Omer
Bartov appears later in the passage, further away
from the historic answers.

6 Limitations

This work focuses on a single-document CQA
环境, which is in line with the majority of the
previous work on conversation history model-
ing in CQA (§2.3). 相应地, MarCQAp
was designed for single-document CQA. Apply-
ing MarCQAp in multi-document settings (Qu
等人。, 2020; Anantha et al., 2021; Adlakha et al.,
2022) may result in partial history representation,
since the retrieved document may contain only
part of the historic answers, therefore MarCQAp
will only highlight the answers which appear in
the document.23

In §5.3 we showed initial evidence that Mar-
CQAp prompts can encode additional informa-
tion that can be useful for CQA. In this work we
focused on the core idea behind prompt-based an-
swer highlighting, as a proposed solution in light
of our results in §4. 然而, we did not conduct a com-

22As annotated in CANARD (Elgohary et al., 2019).
23We note that this limitation applies to all highlighting
方法, including HAE and PosHAE (Qu et al., 2019A,乙).

prehensive exploration in search of the optimal
prompt design, and leave this for future work.

7 结论

the first compre-
在这项工作中, we carry out
hensive robustness study of history modeling
approaches for Conversational Question Answer-
英 (CQA), including sensitivity to model and
training data size, domain shift, and noisy history
输入. We revealed limitations of the existing
benchmark-based evaluation, 通过展示
that it cannot reflect the models’ robustness to
such changes in setting. 此外, 我们提议
evaluation practices that better isolate the contri-
bution of the history modeling component, 和
demonstrated their usefulness.

We also discovered that highlighting historic
answers via passage embedding is very effective
in standard setups, but it suffers from substantial
performance degradation in low data and domain
shift settings. Following this finding, we design
a novel prompt-based history highlighting ap-
普罗奇. We show that highlighting with prompts,
rather than with embeddings, significantly im-
prove robustness, while maintaining overall high
表现.

Our approach can be a good starting point for
future work, due to its high effectiveness, robust-
内斯, and portability. We also hope that the insights
from our study will encourage evaluations with
focusonrobustness,leading to better CQA systems.

致谢

We would like to thank the action editor and
the reviewers, as well as the members of the
IE@Technion NLP group and Roee Aharoni for
their valuable feedback and advice. The Technion
team was supported by the Zuckerman Fund to the
Technion Artificial Intelligence Hub (Tech.AI).
This research was also supported in part by a
grant from Google.

参考

Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer
Suleman, Harm de Vries, and Siva Reddy.
2022. Topiocqa: Open-domain conversational
question answering with topic switching. 反式-
actions of the Association for Computational
https://土井
语言学,
.org/10.1162/tacl_a_00471

10:468–483.

363

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Armen Aghajanyan, Dmytro Okhonko, Mike
Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh,
and Luke Zettlemoyer. 2022. HTLM: Hyper-
text pre-training and prompting of language
型号. In International Conference on Learn-
ing Representations.

Raviteja Anantha, Svitlana Vakulenko, Zhucheng
Tu, Shayne Longpre, Stephen Pulman, 和
Srinivas Chappidi. 2021. Open-domain ques-
tion answering goes conversational via ques-
tion rewriting. 在诉讼程序中
这 2021
Conference of
the North American Chap-
the Association for Computational
ter of
语言学: 人类语言技术,
NAACL-HLT 2021, 在线的, June 6–11, 2021,
pages 520–534. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2021.naacl-main.44

Iz Beltagy, Matthew E. Peters, and Arman
Cohan. 2020. Longformer: The long-document
transformer. CoRR, abs/2004.05150.

Eyal Ben-David, Nadav Oved, and Roi Reichart.
2022. PADA: Example-based prompt learning
for on-the-fly adaptation to unseen domains.
Transactions of the Association for Computatio-
nal Linguistics, 10:414–433. https://土井
.org/10.1162/tacl_a_00468

Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. 卡普兰, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
阿加瓦尔, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya
Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
冬天, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, 本杰明
Chess, Jack Clark, Christopher Berner, 山姆
McCandlish, Alec Radford, 伊利亚·苏茨克维尔,
and Dario Amodei. 2020. Language models
are few-shot learners. In Advances in Neural
Information Processing Systems, 体积 33,
pages 1877–1901. 柯伦联合公司, Inc.

Jon Ander Campos, Arantxa Otegi, Aitor Soroa,
Jan Deriu, Mark Cieliebak, and Eneko Agirre.
2020. Doqa – accessing domain-specific faqs
via conversational QA. 在诉讼程序中
58th Annual Meeting of the Association for
计算语言学, 前交叉韧带 2020, 在线的,
July 5–10, 2020, pages 7302–7314. 协会
for Computational Linguistics.

Yu Chen, Lingfei Wu, and Mohammed J. Zaki.
2020. Graphflow: Exploiting conversation flow
with graph neural networks for conversational
machine comprehension. 在诉讼程序中
the Twenty-Ninth International Joint Confer-
ence on Artificial Intelligence, IJCAI 2020,
ijcai.org. https://土井
pages 1230–1236.
.org/10.24963/ijcai.2020/171

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, 和
Luke Zettlemoyer. 2018. Quac: Question an-
swering in context. 在诉讼程序中
这
2018 实证方法会议
自然语言处理, 布鲁塞尔, Bel-
gium, 十月 31 – November 4, 2018,
pages 2174–2184. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1241

Rotem Dror, Gili Baumer, Segev Shlomov, 和
Roi Reichart. 2018. The hitchhiker’s guide
to testing statistical significance in natural
语言处理. 在诉讼程序中
这
56th Annual Meeting of the Association for
计算语言学, 前交叉韧带 2018, 梅尔-
bourne, 澳大利亚, July 15–20, 2018, 体积 1:
Long Papers, pages 1383–1392. 协会
计算语言学. https://土井
.org/10.18653/v1/P18-1128

Rotem Dror, Lotem Peled-Cohen,

Segev
Shlomov, and Roi Reichart. 2020. Statisti-
cal Significance Testing for Natural Language
加工. Synthesis Lectures on Human
语言技术. 摩根 & Claypool
出版商. https://doi.org/10.1007
/978-3-031-02174-9

到

rewrite

Ahmed Elgohary, Denis Peskov, and Jordan L.
Boyd-Graber. 2019. Can you unpack that?
questions-in-context.
学习
这 2019 会议
在诉讼程序中
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, 香港, 中国,
November 3–7, 2019, pages 5917–5923.
计算语言学协会.
https://doi.org/10.18653/v1/D19
-1605

Somil Gupta, Bhanu Pratap Singh Rawat,
and Hong Yu. 2020. Conversational ma-
chine comprehension: A literature review. 在

364

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Proceedings of the 28th International Confer-
ence on Computational Linguistics, 科林
2020, 巴塞罗那, 西班牙 (在线的), 十二月
8–13,
内特纳-
tional Committee on Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.coling-main.247

2739–2753.

页面

2020,

Hsin-Yuan Huang, Eunsol Choi, and Wen-tau
Yih. 2019. Flowqa: Grasping flow in history
for conversational machine comprehension. 在
7th International Conference on Learning Rep-
resentations, ICLR 2019, New Orleans, 这,
美国, May 6–9, 2019. OpenReview.net.

Gangwoo Kim, Hyunjae Kim, Jungsoo Park, 和
Jaewoo Kang. 2021. Learn to resolve conver-
sational dependency: A consistency training
framework for conversational question an-
swering. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Process-
英, ACL/IJCNLP 2021,
(体积 1: 长的
文件), Virtual Event, August 1–6, 2021,
pages 6130–6141. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2021.acl-long.478

Diederik P. Kingma and Jimmy Ba. 2015. 亚当:
A method for stochastic optimization. In 3rd
International Conference on Learning Repre-
句子, ICLR 2015, 圣地亚哥, CA, 美国,
May 7–9, 2015, Conference Track Proceedings.

Teven Le Scao and Alexander Rush. 2021.
How many data points is a prompt worth?
这 2021 Conference of
在诉讼程序中
the North American Chapter of the Associ-
ation for Computational Linguistics: 人类
语言技术, pages 2627–2636,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2021.naacl-main.208

Huihan Li, Tianyu Gao, Manan Goenka, 和
Danqi Chen. 2022. Ditch the gold stan-
dard: Re-evaluating conversational question
answering. In Proceedings of the 60th Annual
Meeting of the Association for Computational
语言学 (体积 1: Long Papers), 前交叉韧带
2022, 都柏林,
爱尔兰, May 22–27, 2022,
pages 8074–8085. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.555

365

Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo
Nogueira, Ming-Feng Tsai, Chuan-Ju Wang,
and Jimmy Lin. 2020. Conversational question
reformulation via sequence-to-sequence archi-
tectures and pretrained language models. CoRR,
abs/2004.01909.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao
Jiang, Hiroaki Hayashi, and Graham Neubig.
2021. Pre-train, 迅速的, and predict: A system-
atic survey of prompting methods in natural
语言处理. CoRR, abs/2107.13586.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. CoRR,
abs/1907.11692.

Ilya Loshchilov and Frank Hutter. 2019. 的-
coupled weight decay regularization. In 7th
International Conference on Learning Repre-
句子, ICLR 2019, New Orleans, 这, 美国,
May 6–9, 2019. OpenReview.net.

Jin Qian, Bowei Zou, Mengxing Dong, Xiao
李, AiTi Aw, and Yu Hong. 2022. Capturing
conversational interaction for question answer-
ing via global history reasoning. In Findings
的
the Association for Computational Lin-
语言学: 全国AACL 2022, Seattle, WA, 团结的
状态, July 10–15, 2022, pages 2071–2078.
for Computational Linguis-
协会
抽动症. https://doi.org/10.18653/v1
/2022.findings-naacl.159

Chen Qu, Liu Yang, Cen-Chieh Chen, Minghui
Qiu, 瓦. Bruce Croft, and Mohit Iyyer. 2020.
Open-retrieval conversational question answer-
英. 会议记录
the 43rd International
ACM SIGIR Conference on Research and
Development in Information Retrieval.

Chen Qu, Liu Yang, Minghui Qiu, 瓦. Bruce Croft,
Yongfeng Zhang, and Mohit Iyyer. 2019A.
BERT with history answer embedding for con-
versational question answering. In Proceedings
of the 42nd International ACM SIGIR Confe-
rence on Research and Development in Infor-
mation Retrieval, SIGIR 2019, 巴黎, 法国,
July 21–25, 2019, pages 1133–1136. ACM.

Chen Qu, Liu Yang, Minghui Qiu, Yongfeng
张, Cen Chen, 瓦. Bruce Croft, and Mohit
伊耶尔. 2019乙. Attentive history selection for

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

conversational question answering. In Proceed-
ings of the 28th ACM International Conference
on Information and Knowledge Management,
CIKM 2019, 北京, 中国, November 3–7,
2019, pages 1391–1400. ACM.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J.
刘. 2020. Exploring the limits of transfer
learning with a unified text-to-text
反式-
以前的. Journal of Machine Learning Research,
21:140:1–140:67.

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for squad. 在诉讼程序中
the 56th Annual Meeting of the Association
for Computational Linguistics, 前交叉韧带 2018, 梅尔-
bourne, 澳大利亚, July 15–20, 2018, 体积 2:
Short Papers, pages 784–789. 协会
计算语言学. https://土井
.org/10.18653/v1/P18-2124

Pranav Rajpurkar,

文本.

在诉讼程序中

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. Squad:
100, 000+ questions for machine compre-
这
hension of
2016 实证方法会议
自然语言处理, EMNLP 2016,
Austin, 德克萨斯州, 美国, November 1–4, 2016,
页面
为了
计算语言学. https://土井
.org/10.18653/v1/D16-1264

2383–2392. The Association

Siva Reddy, Danqi Chen, and Christopher D.
曼宁. 2019. Coqa: A conversational ques-
tion answering challenge. Transactions of
the Association for Computational Linguistics,
7:249–266. https://doi.org/10.1162
/tacl_a_00266

Svitlana Vakulenko, Shayne Longpre, Zhucheng
Tu, and Raviteja Anantha. 2021. 问题
rewriting for conversational question answer-
英. In WSDM ’21, The Fourteenth ACM
International Conference on Web Search and
Data Mining, Virtual Event, 以色列, 行进
8–12, 2021, pages 355–363. ACM.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. 在-
tention is all you need. In Advances in Neural
Information Processing Systems 30: Annual
Conference on Neural Information Processing

366

系统 2017, December 4–9, 2017, 长的
Beach, CA, 美国, pages 5998–6008.

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander Rush. 2020. 反式-
前者: State-of-the-art natural
语言
加工. 在诉讼程序中 2020 骗局-
ference on Empirical Methods in Natural
语言处理: 系统演示,
pages 38–45, 在线的. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-demos.6

Yi-Ting Yeh and Yun-Nung Chen. 2019.
Flowdelta: Modeling flow information gain in
reasoning for conversational machine compre-
hension. 第二届研讨会论文集
on Machine Reading for Question Answering,
MRQA@EMNLP 2019, 香港, 中国,
十一月 4, 2019, pages 86–90. 协会
for Computational Linguistics.

Manzil Zaheer, Guru Guruganesh, Kumar
Avinava Dubey, Joshua Ainslie, Chris Alberti,
Santiago Onta˜n´on, Philip Pham, Anirudh
Ravula, Qifan Wang, Li Yang, and Amr Ahmed.
2020. Big bird: Transformers for longer se-
序列. In Advances in Neural Information
Processing Systems 33: Annual Conference
on Neural
Information Processing Systems
2020, 神经信息处理系统 2020, December 6–12, 2020,
虚拟的.

Jing Zhao, Junwei Bao, Yifan Wang, Yongwei
周, Youzheng Wu, Xiaodong He, and Bowen
周. 2021. Ror: Read-over-read for long
document machine reading comprehension.
In Findings of
the Association for Compu-
tational Linguistics: EMNLP 2021, Virtual
Event
/ Punta Cana, Dominican Republic,
16–20 November, 2021, pages 1862–1872.
协会
for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2021.findings-emnlp.160

Chenguang Zhu, Michael Zeng, and Xuedong
黄. 2018. Sdnet: Contextualized attention-
based deep network for conversational question
answering. CoRR, abs/1812.03593.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
4
9
2
1
4
5
1
9
9

/
t

我

A
C
_
A
_
0
0
5
4
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3 On the Robustness of Dialogue History Representation in image

下载pdf