On the Robustness of Dialogue History Representation in
Conversational Question Answering: A Comprehensive Study
and a New Prompt-based Method
Zorik GekhmanT∗ Nadav OvedT ∗ Orgad KellerG Idan SzpektorG Roi ReichartT
T Technion – Israel Institute of Technology, Israel GGoogle Research, Israel
{zorik@campus.|nadavo@campus.|roiri@}technion.ac.il
{orgad|szpektor}@google.com
Abstracto
Most work on modeling the conversation his-
tory in Conversational Question Answering
(CQA) reports a single main result on a com-
mon CQA benchmark. While existing models
show impressive results on CQA leaderboards,
it remains unclear whether they are robust to
shifts in setting (sometimes to more realis-
tic ones), training data size (p.ej., from large
to small sets) and domain. En este trabajo, nosotros
design and conduct the first large-scale robust-
ness study of history modeling approaches for
CQA. We find that high benchmark scores
do not necessarily translate to strong robust-
ness, and that various methods can perform
extremely differently under different settings.
Equipped with the insights from our study, nosotros
design a novel prompt-based history modeling
approach and demonstrate its strong robust-
ness across various settings. Our approach is
inspired by existing methods that highlight
historic answers in the passage. Sin embargo, en-
stead of highlighting by modifying the passage
token embeddings, we add textual prompts
directly in the passage text. Our approach
is simple, easy to plug into practically any
modelo, and highly effective, thus we recom-
mend it as a starting point for future model
developers. We also hope that our study and
insights will raise awareness to the importance
of robustness-focused evaluation, in addition
to obtaining high leaderboard scores, leading
to better CQA systems.1
1
Introducción
Conversational Question Answering (CQA) en-
volves a dialogue between a user who asks
questions and an agent that answers them based
on a given document. CQA is an extension of the
∗Authors contributed equally to this work.
1Our code and data are available at: https://github
.com/zorikg/MarCQAp.
351
traditional single-turn QA task (Rajpurkar et al.,
2016), with the major difference being the pres-
ence of the conversation history, which requires
effective history modeling (Gupta et al., 2020).
Previous work demonstrated that the straightfor-
ward approach of concatenating the conversation
turns to the input is lacking (Qu et al., 2019a),
leading to various proposals of architecture com-
ponents that explicitly model the conversation
historia (Choi et al., 2018; Huang et al., 2019;
Yeh and Chen, 2019; Qu et al., 2019a,b; Chen
et al., 2020; Kim y cols., 2021). Sin embargo, hay
no single agreed-upon setting for evaluating the
effectiveness of such methods, with the majority
of prior work reporting a single main result on
a CQA benchmark, such as CoQA (Reddy et al.,
2019) or QuAC (Choi et al., 2018).
While recent CQA models show impressive re-
sults on these benchmarks, such a single-score
evaluation scheme overlooks aspects that can be
essential in real-world use-cases. Primero, QuAC and
CoQA contain large annotated training sets, cual
makes it unclear whether existing methods can re-
main effective in small-data settings, donde el
annotation budget is limited. Además, the eval-
uation is done in-domain, ignoring the model’s
robustness to domain shifts, with target domains
that may even be unknown at model training
tiempo. Además, the models are trained and
evaluated using a ‘‘clean’’ conversation history
entre 2 humanos, while in reality the history
can be ‘‘noisy’’ and less fluent, due to the in-
correct answers by the model (Le et al., 2022).
Finalmente, these benchmarks mix the impact of ad-
vances in pre-trained language models (LMs) con
conversation history modeling effectiveness.
En este trabajo, we investigate the robustness
of history modeling approaches in CQA. We ask
whether high performance on existing benchmarks
also indicates strong robustness. To address this
Transacciones de la Asociación de Lingüística Computacional, volumen. 11, páginas. 351–366, 2023. https://doi.org/10.1162/tacl a 00549
Editor de acciones: Preslav I. Nakov. Lote de envío: 7/2022; Lote de revisión: 11/2022; Publicado 4/2023.
C(cid:3) 2023 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Training In-Domain Evaluation
Out-Of-Domain Evaluation
Data source
Domain
83,568
# Examples
# Conversations 11,567
QuAC QuAC QuAC-NH
7,354
1,000
10,515
1,204
CoQA
DoQA
children stories literature mid-high school news wikipedia cooking movies travel
1,884 1,713
400
1,797
400
1,626
100
1,425
100
1,653
100
1,630
100
1,649
100
400
Mesa 1: Datasets statistics.
pregunta, we carry out the first large-scale robust-
ness study using 6 common modeling approaches.
We design 5 robustness-focused evaluation set-
tings, which we curate based on 4 existing CQA
conjuntos de datos. Our settings are designed to evaluate
efficiency in low-data scenarios, la capacidad de
scale in a high-resource setting, as well as robust-
ness to domain-shift and to noisy conversation
historia. We then perform a comprehensive ro-
bustness study, where we evaluate the considered
methods in our settings.
We focus exclusively on history modeling, as it
is considered the most significant aspect of CQA
(Gupta et al., 2020), differentiating it from the
classic single-turn QA task. To better reflect the
contribution of the history modeling component,
we adapt the existing evaluation metric. Primero, a
avoid differences which stem from the use of dif-
ferent pre-trained LMs, we fix the underlying LM
for all the evaluated methods, re-implementing
all of them. Segundo, instead of focusing on final
scores on a benchmark, we focus on each model’s
mejora (Δ%) compared to a baseline QA
model that has no access to the conversation
historia.
Our results show that history modeling meth-
ods perform very differently in different settings,
and that approaches that achieve high benchmark
scores are not necessarily robust under low-data
and domain-shift settings. Además, we notice
that approaches that highlight historic answers
within the document by modifying the document
embeddings achieve the top benchmark scores,
but their performance is surprisingly lacking in
low-data and domain-shift settings. We hypothe-
size that history highlighting yields high-quality
representación, but since the existing highlighting
methods add dedicated embedding parameters,
specifically designed to highlight the document’s
tokens, they are prone to over-fitting.
These findings motivate us to search for
an alternative history modeling approach with
improved robustness across different settings.
Following latest trends w.r.t. prompting in NLP
(Liu et al., 2021) we design MarCQAp, una novela
prompt-based approach for history modeling,
which adds textual prompts within the grounding
document in order to highlight previous answers
from the conversation history. While our approach
is inspired by the embedding-based highlighting
methods, it is not only simpler, but it also shows
superior robustness compared to other evaluated
approaches. As MarCQAp is prompt-based, it can
be easily combined with any architecture, allow-
ing to fine-tune any model with a QA architecture
for the CQA task with minimal effort. De este modo, nosotros
hope that it will be adopted by the community as
a useful starting point, owing to its simplicity, como
well as high effectiveness and robustness. Nosotros también
hope that our study and insights will encourage
more robustness-focused evaluations, in addition
to obtaining high leaderboard scores, conduciendo a
better CQA systems.
2 Preliminaries
2.1 CQA Task Definition and Notations
Given a text passage P , the current question qk
and a conversation history Hk in a form of a
sequence of previous questions and answers Hk =
(q1, a1, . . . , qk−1, ak−1), a CQA model predicts the
answer ak based on P as a knowledge source. El
answers can be either spans within the passage P
(extractive) or free-form text (abstractive).
2.2 CQA Datasets
Full datasets statistics are presented in Table 1.
QuAC (Choi et al., 2018) and CoQA (Reddy
et al., 2019) are the two leading CQA datasets,
with different properties. In QuAC, the questions
are more exploratory and open-ended with longer
answers that are more likely to be followed up.
This makes QuAC more challenging and realistic.
We follow the common practice in recent work
(Qu et al., 2019a,b; Kim y cols., 2021; Le et al.,
2022), focusing on QuAC as our main dataset,
using its training set for training and its vali-
dation set for in-domain evaluation (el conjunto de prueba
is hidden, reserved for a leaderboard challenge).
352
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
We use CoQA for additional pre-training or for
domain-shift evaluation.
DoQA (Campos et al., 2020) is another CQA
dataset with dialogues from the Stack Exchange
online forum. Due to its relatively small size, él
is typically used for testing transfer and zero-shot
aprendiendo. We use it for domain-shift evaluation.
QuAC Noisy-History (QuAC-NH)
is based on
a datatset of human-machine conversations col-
lected by Li et al. (2022), usando 100 passages
from the QuAC validation set. While Li et al. usado
it for human evaluation, we use it for automatic
evaluación, leveraging the fact that the answers
are labeled for correctness, which allows us to use
the correct answers as labels.
k=1 = {(PAG, HK, qk)}metro
In existing CQA datasets, each conversa-
ción (q1, a1, .., qm, am) and the corresponding
passage P , are used to create m examples
k=1, where Hk =
{Ek}metro
(q1, a1, . . . qk−1, ak−1). ak is then used as a la-
bel for Ek. Since QuAC-NH contains incorrect
if ak is incorrect we discard Ek to
answers,
avoid corrupting the evaluation set with incor-
rectly labeled examples. We also filtered out
invalid questions (Le et al., 2022) and answers
that did not appear in P .2
2.3 CQA Related Work
Conversation History Modeling is the major chal-
lenge in CQA (Gupta et al., 2020). Early work
used recurrent neural networks (RNNs) and vari-
ants of attention mechanisms (Reddy et al., 2019;
Choi et al., 2018; Zhu et al., 2018). Otro
trend was to use flow-based approaches, cual
generate a latent representation for the tokens in
HK, using tokens from P (Huang et al., 2019;
Yeh and Chen, 2019; Chen et al., 2020). Modern
approaches, which are the focus of our work, lever-
age Transformer-based (Vaswani et al., 2017)
pre-trained language models.
The simplest approach to model the history with
pre-trained LMs is to concatenate Hk with qk and
PAG (Choi et al., 2018; Zhao et al., 2021). Alter-
native approaches rewrite qk based on Hk and
use the rewritten questions instead of Hk and qk
(Vakulenko et al., 2021), or as an additional train-
ing signal (Kim y cols., 2021). Another fundamental
approach is to highlight historic answers within
2Even though Li et al. only used extractive models, a
Pre-trained
LM Size
Base
Large
Base
Base
Base
Capacitación
Evaluation
QuAC
CoQA + QuAC
QuAC smaller samples
QuAC
QuAC
QuAC
QuAC
QuAC
CoQA + DoQA
QuAC-NH
Estándar
High-Resource
Low-Resource
Domain-Shift
Noisy-History
Mesa 2: Summary of our proposed settings.
P by modifying the passage’s token embeddings
(Qu et al., 2019a,b). Qu et al. also introduced
a component that performs dynamic history se-
lection after each turn is encoded. Todavía, en nuestro
corresponding baseline we utilize only the his-
toric answer highlighting mechanism, owing to its
simplicity and high effectiveness. A contempora-
neous work proposed a global history attention
component, designed to capture long-distance
dependencies between conversation turns (Qian
et al., 2022).3
3 History Modeling Study
En este trabajo, we examine the effect of a model’s
history representation on its robustness. To this
end, we evaluate different approaches under sev-
eral settings that diverge from the standard
supervised benchmark (§3.1). This allows us to
examine whether the performance of some meth-
ods deteriorates more quickly than others in
different scenarios. To better isolate the gains
from history modeling, we measure performance
compared to a baseline QA model which has
no access to Hk (§3.2), and re-implement all
the considered methods using the same under-
lying pre-trained language model (LM) for text
representación (§3.3).
3.1 Robustness Study Settings
We next describe each comparative setting in our
study and the rationale behind it, as summarized
en mesa 2. Mesa 1 depicts the utilized datasets.
Estándar. Defined by Choi et al. (2018), este
setting is followed by most studies. We use a
medium-sized pre-trained LM for each method,
commonly known as its base version,
entonces
fine-tune and evaluate the models on QuAC.
High-Resource. This setting examines the ex-
tent
to which methods can improve their
performance when given more resources. To this
small portion of the answers did not appear in the passage.
3Publicado 2 weeks before our submission.
353
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
end, we use a large pre-trained LM, perform ad-
ditional pre-training on CoQA (with the CQA
objetivo), and then fine-tune and evaluate on
QuAC.
Low-Resource.
In this setting, we examine the
resource efficiency of the history modeling ap-
proaches by reducing the size of the training set.
This setting is similar to the standard setting,
except that we fine-tune on smaller samples of
QuAC’s training set. For each evaluated method
we train 4 model variants: 20%, 10%, 5%, y 1%,
reflecting the percentage of training data retained.
Domain-Shift. This setting examines robust-
ness to domain shift. Para tal fin, we use the
8 domains in the CoQA and DoQA datasets as
test sets from unseen target domains, evaluating
the models trained under the standard setting on
these test-sets.
Noisy-History. This setting examines robust-
ness to noisy conversation history, donde el
answers are sometimes incorrect and the conver-
sation flow is less fluent. Para tal fin, we evaluate
the models trained under the standard setting on
the QuAC-NH dataset, consisting of conversations
between humans and other CQA models (§2.2).
We note that a full human-machine evaluation
requires a human in the loop. We choose to eval-
uate against other models predictions as a middle
ground. This allows us to test the models’ behav-
ior on noisy conversations with incorrect answers
and less fluent flow, but without a human in the
loop.
3.2 Evaluation Metric
The standard CQA evaluation metric is the average
word-level F1 score (Rajpurkar et al., 2016; Choi
et al., 2018; Reddy et al., 2019; Campos et al.,
2020).4 Since we focus on the impact of history
modelado, we propose to consider each model’s
improvement in F1 (Δ%) compared to a baseline
QA model that has no access to the dialogue
historia.
3.3 Pre-trained LM
To control for differences which stem from the
use of different pre-trained LMs, we re-implement
all the considered methods using the Longformer
(Beltagy et al., 2020), a sparse-attention Trans-
former designed to process long input sequences.
4We follow the calculation presented in Choi et al. (2018).
It is therefore a good fit for handling the con-
versation history and the source passage as a
combined (largo) aporte. Prior work usually utilized
dense-attention Transformers, whose input length
limitation forced them to truncate Hk and split
P into chunks, processing them separately and
combining the results (Choi et al., 2018; Qu et al.,
2019a,b; Kim y cols., 2021; Zhao et al., 2021). Este
introduces additional complexity and diversity in
the implementation, while with the Longformer
we can keep implementation simple, as this model
can attend to the entire history and passage.
the state-of-the-art
We would also like to highlight RoR (zhao
et al., 2021), which enhances a dense-attention
Transformer to better handle long sequences.
Notablemente,
result on QuAC
was reported using ELECTRA+RoR with simple
history concatenation (see CONCAT in §3.4). Mientras
this suggests that ELECTRA+RoR can outper-
form the Longformer, since our primary focus
is on analyzing the robustness of different his-
tory modeling techniques rather than on long
sequence modeling, we opt for a general-purpose
commonly used LM for long sequences, cual
exhibits competitive performance.
3.4 Evaluated Methods
In our study we choose to focus on modern history
modeling approaches that
leverage pre-trained
LMs. These models have demonstrated significant
progress in recent years (§2.3).
NO HISTORY A classic single-turn QA model
without access to Hk. We trained a Longformer
for QA (Beltagy et al., 2020), using qk and P as a
single packed input sequence (ignoring Hk). El
model then extracts the answer span by predicting
its start and end positions within P .
In contrast to the rest of the evaluated methods,
we do not consider this method as a baseline for
history modeling, but rather as a reference for
calculating our Δ% metric. As discussed in §3.2,
we evaluate all history modeling methods for their
ability to improve over this model.
CONCAT Concatenating Hk to the input (es decir.,
to qk and P ), cual es (arguably) the most
straightforward way to model the history (Choi
et al., 2018; Qu et al., 2019a; Zhao et al.,
2021). Other than the change to the input, este
model architecture and training is identical to NO
HISTORY.
354
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
REWRITE This
approach was proposed in
Vakulenko et al. (2021). It consists of a pipeline
of two models, question rewriting (QR) y
question answering (control de calidad). An external QR model
first generates a rewritten question ˜qk, Residencia en
qk and Hk. ˜qk and P are then used as input to
a standard QA model, identical to NO HISTORY,
but trained with the rewritten questions. Para el
external QR model we follow Lin et al. (2020),
Vakulenko et al. (2021), and Kim et al. (2021)
and fine-tune T5-base (Rafael y col., 2020) sobre el
CANARD dataset (Elgohary et al., 2019). Usamos
the same QR model across all the settings in our
estudiar (§3.1), meaning that in the low-resource
setting we limit only the CQA data, which is used
to train the QA model.
REWRITEC Hypothesizing that there is useful in-
formation in Hk on top of the rewritten question
˜qk, we combine REWRITE and CONCAT, obtaining
a model which is similar to CONCAT, except that
it replaces qk with ˜qk.
ExCorDLF Our implementation of the ExCorD
acercarse, proposed in Kim et al. (2021). En cambio
of rewriting the original question, qk, at inference
tiempo (REWRITE), ExCorD uses the rewritten ques-
tion only at training time as a regularization signal
when encoding the original question.
HAELF Our implementation of the HAE ap-
proach proposed in Qu et al. (2019a), cual
highlights the conversation history within P . En-
stead of concatenating Hk to the input, HAE
highlights the historic answers {ai}k−1
i=1 within
PAG , by modifying the passage token embeddings.
HAE adds an additional dedicated embedding
layer with 2 learned embedding vectors, denoting
whether a token from P appears in any historic
answers or not.
PosHAELF Our implementation of the PosHAE
approach proposed in Qu et al. (2019b), which ex-
tends HAE by adding positional information. El
embedding matrix is extended to contain a vector
per conversation turn, each vector representing the
turn that the corresponding token appeared in.
3.5 Detalles de implementacion
We fine-tune all models on QuAC for 10 epochs,
employ an accumulated batch size of 640, a weight
decay of 0.01, and a learning rate of 3 · 10−5. En el
high-resource setup, we also pre-train on CoQA
Original Work
CONCAT
Qu et al. (2019a)
Original LM Original Result
BERT
62.0
Our Impl.
65.8
REWRITE
Vakulenko et al. (2021)
BERT
Not Reported
REWRITEC
ExCorD
N/A (this baseline was first proposed in this work)
Kim et al. (2021)
RoBERTa
HAE
Qu et al. (2019a)
PosHAE
Qu et al. (2019b)
BERT
BERT
64.6
67.3
67.5
68.9
69.8
67.7
63.9
64.7
Mesa 3: F1 scores comparison between original
implementations and ours (using Longformer as
the LM), for all methods described in §3.4, en el
standard setting.
para 5 epochs. We use a maximum output length
de 64 tokens. Following Beltagy et al. (2020), nosotros
set Longformer’s global attention to all the tokens
of qk. We use the cross-entropy loss and AdamW
optimizer (Kingma and Ba, 2015; Loshchilov and
Hutter, 2019). Our implementation makes use
of the HuggingFace Transformers (Wolf et al.,
2020), and PyTorch-Lightning libraries.5
the base LM (used in all
settings
except high-resource) we found that a Long-
former that was further pre-trained on SQuADv2
(Rajpurkar et al., 2018),6 achieved consistently
better performance than the base Longformer.
De este modo, we adopted it as our base LM. For the large
LM (used in the high-resource setting) we used
Longformer-large.7
Para
In §5, we introduce a novel method (MarCQAp)
and perform statistical significance tests (Dror
et al., 2018, 2020). Following Qu et al. (2019b),
we use the Student’s paired t-test with p < 0.05,
to compare MarCQAp to all other methods in each
setting.
In our re-implementation of the evaluated meth-
ods, we carefully followed their descriptions and
implementation details as published by the authors
in their corresponding papers and codebases. A
key difference in our implementation is the use
of a long sequence Transformer, which removes
the need to truncate Hk and split P into chunks
(§3.3). This simplifies our implementation and
avoids differences between methods.8 Table 3
compares between our results and those reported
in previous works. In almost all cases we achieved
5https://github.com/PyTorchLightning
/pytorch-lightning.
6https://huggingface.co/mrm8488
/longformer-base-4096-finetuned-squadv2.
7https://huggingface.co/allenai
/longformer-large-4096.
8The maximum length limit of Hk varies between dif-
ferent works, as well as how sub-document chunks are
handled.
355
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
l
a
c
_
a
_
0
0
5
4
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Setting
LM
Training set size
NO HISTORY
concat
REWRITE
REWRITEC
ExCorDLF
HAELF
PosHAELF
MarCQAp (§5)
Low-Resource
Longformer-base
Pre-trained SQuAD
8K (10%)
52.9
53.4 (+0.9%)
56.4 (+6.6%)
57.2 (+8.1%)
57.2 (+8.1%)
55.0 (+4.0%)
55.1 (+4.2%)
57.4 (+14.8%) 61.3 (+15.9%) 64.6 (+16.6%)
16K (20%)
55.4
57.8 (+4.3%)
59.2 (+6.9%)
60.6 (+9.4%)
60.3 (+8.8%)
59.8 (+7.9%)
60.9 (+9.9%)
4K (5%)
800 (1%)
50.0
45.0
51.2 (+2.4%)
43.9 (-2.4%)
54.0 (+8.0%)
46.5 (+3.3%)
54.4 (+8.8%)
42.3 (-6.0%)
53.0 (+6.0%)
46.0 (+2.2%)
44.5 (-1.1%)
50.8 (+1.6%)
40.5 (-10.0%) 51.0 (+2.0%)
48.2 (+7.1%)
High-Resource
Standard
Longformer-base
Longformer-large
Pre-trained SQuAD Pre-trained CoQA
Avg Δ%
–
+1.3%
+6.2%
+5.1%
+6.3%
+3.1%
+1.5%
+13.6%
80K (100%)
60.4
65.8 (+8.9%)
64.6 (+7.0%)
67.3 (+11.4%)
67.5 (+11.8%)
69.0 (+14.2%)
69.8 (+15.6%)
70.2 (+16.2%)
80K (100%)
65.6
72.3 (+10.2%)
69.0 (+5.2%)
72.5 (+10.5%)
73.8 (+12.3%)
73.2 (+11.4%)
74.2 (+12.9%)
74.7 (+13.7%)
Table 4: In-domain F1 and Δ% scores on the full QuAC validation set, for the standard, high-resource
and low-resource settings. We color coded the Δ% for positive and negative numbers.
a higher score (probably since Longformer out-
performs BERT), with the exception of ExCorD,
where we achieved a comparable score (proba-
bly since Longformer is actually initialized using
RoBERTa’s weights [Beltagy et al., 2020]).
4 Results and Analysis
We next discuss the takeaways from our study,
where we evaluated the considered methods across
the proposed settings. Table 4 presents the results
of the standard, high-resource, and low-resource
settings. Table 5 further presents the domain-shift
results. Finally, Table 6 depicts the results of the
noisy-history setting. Each method is compared
to NO HISTORY by calculating the Δ% (§3.2). The
tables also present the results of our method,
termed MarCQAp, which is discussed in §5.
We further analyze the effect of the conversa-
tion history length in Figure 1, evaluating models
from the standard setting with different limits on
the history length. For instance, when the limit is
2, we expose the model to up to the 2 most recent
turns, by truncating Hk.9
Key Findings A key goal of our study is to
examine the robustness of history modeling ap-
proaches to setting shifts. This research reveals
limitations of the single-score benchmark-based
evaluation adopted in previous works (§4.1), as
such scores are shown to be only weakly correlated
with low-resource and domain-shift robustness.
Furthermore, keeping in mind that history mod-
eling is a key aspect of CQA, our study also
demonstrates the importance of isolating the con-
tribution of the history modeling method from
9We exclude REWRITE, since it utilizes Hk only in the
form of the rewritten question. For REWRITEC , we truncate
the concatenated Hk for the CQA model, while the QR model
remains exposed to the entire history.
other model components (§4.2). Finally, we dis-
cover that while existing history highlighting
approaches yield high-quality input representa-
tions, their robustness is surprisingly poor. We
further analyze the history highlighting results
and provide possible explanations for this phe-
nomenon (§4.3). This finding is the key motivation
for our proposed method (§5).
4.1 High CQA Benchmark Scores do not
Indicate Good Robustness
First, we observe some expected general trends:
All methods improve on top of NO HISTORY, as
demonstrated by the positive Δ% in the standard
setting, showing that all the methods can leverage
information from Hk. All methods scale with more
training data and a larger model (high-resource),
and their performances drop significantly when
the training data size is reduced (low-resource)
or when they are presented with noisy history. A
performance drop is also observed when evaluat-
ing on domain-shift, as expected in the zero shot
setting.
However, not all methods scale equally well
and some deteriorate faster than others. This
phenomenon is illustrated in Table 7, where the
methods are ranked by their scores in each set-
ting. We observe high instability between settings.
For instance, PosHAELF is top performing in
3 settings but
in 2 others.
is second worst
REWRITE is second best in low-resource, but among
the last ones in other settings. So is the case
with CONCAT: Second best in domain-shift but
among the worst ones in others. In addition, while
all the methods improve when they are exposed
to longer histories (Figure 1), some saturate earlier
than others.
356
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
l
a
c
_
a
_
0
0
5
4
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Setting
Domain
NO HIST.
CONCAT
REWRITE
REWRITEC
ExCorDLF
HAELF
PosHAELF
MarCQAp (§5)
Domain-Shift
Children Sto.
Literature
CoQA
M/H Sch.
News
Wikipedia
Cooking
DoQA
Movies
Travel
54.8
42.6
50.3
50.1
58.2
46.9
45.0
44.0
62.2 (+13.5%)
48.0 (+12.7%)
55.3 (+9.9%)
54.9 (+9.6%)
59.9 (+2.9%)
54.8 (+16.8%)
52.0 (+15.6%)
48.4 (+10%)
60.1 (+9.7%)
47.7 (+12.0%)
55.0 (+9.3%)
54.8 (+9.4%)
60.9 (+4.6%)
44.6 (-4.9%)
43.2 (-4.0%)
40.9 (-7.0%)
62.7 (+14.4%)
49.0 (+15.0%)
56.7 (+12.7%)
55.2 (+10.2%)
59.4 (+2.1%)
52.0 (+10.9%)
49.1 (+9.1%)
46.4 (+5.5%)
62.7 (+14.4%)
51.5 (+20.9%)
58.2 (+15.7%)
57.0 (+13.8%)
63.6 (+9.3%)
53.7 (+14.5%)
51.1 (+13.6%)
48.6 (+10.5%)
61.8 (+12.8%)
50.5 (+18.5%)
56.6 (+12.5%)
55.4 (+10.6%)
60.9 (+4.6%)
45.0 (-4.1%)
45.1 (+0.2%)
45.1 (+2.5%)
56.6 (+3.3%)
66.7 (+21.7%)
47.4 (+11.3%)
56.4 (+32.4%)
55.4 (+10.1%)
61.8 (+22.9%)
52.7 (+5.2%)
60.8 (+21.4%)
61.7 (+6.0%)
67.5 (+16.0%)
45.6 (-2.8%)
45.8 (+1.8%)
53.3 (+13.6%)
51.8 (+15.1%)
44.7 (+1.6%)
50.1 (+13.9%)
Avg Δ%
–
+11.4%
+3.6%
+10.0%
+14.1%
+7.2%
+4.6%
+19.6%
Table 5: F1 and Δ% scores for the domain-shift setting. We color coded the Δ% for positive and
negative numbers.
Setting
NO HISTORY
CONCAT
REWRITE
REWRITEC
ExCorDLF
HAELF
PosHAELF
MarCQAp (§5)
Noisy-History
49.9
55.3 (+10.8%)
56.0 (+12.2%)
58.5 (+17.2%)
56.8 (+13.8%)
57.9 (+16.0%)
60.1 (+20.4%)
62.3 (+24.9%)
Table 6: F1 and Δ% scores for the noisy-history
setting.
We conclude that the winner does not take it
all: There are significant instabilities in meth-
ods’ performance across settings. This reveals
the limitations of the existing single-score bench-
mark evaluation practice, and calls for more
comprehensive robustness-focused evaluation.
4.2 The Contribution of the History
Modeling Method should be Isolated
In the high-resource setting, NO HISTORY reaches
65.6 F1, higher than many CQA results reported
in previous work (Choi et al., 2018; Qu et al.,
2019a,b; Huang et al., 2019). Since it is clearly
ignoring the history, this shows that significant
improvements can stem from simply using a better
LM. Thus comparing between history modeling
methods that use different LMs can be misleading.
This is further illustrated with HAELF ’s and
PosHAELF ’s results. The score that Kim et al.
reported for ExCorD is higher than Qu et al.
reported for HAE and PosHAE. While both au-
thors used a setting equivalent to our standard
setting, Kim et al. used RoBERTa while Qu
et al. used BERT, as their underlying LM. It is
therefore unclear whether ExCorD’s higher score
stems from better history representation or from
choosing to use RoBERTa. In our study, HAELF
Table 7: Per setting rankings of the methods evalu-
ated in our study (top is best), excluding MarCQAp.
C is CONCAT, R is REWRITE, RC is REWRITEC, Ex is
ExCorDLF , H is HAELF , and PH is PosHAELF .
and PosHAELF actually outperform ExCorDLF
in the standard setting. This suggests that these
methods can perform better than reported, and
demonstrates the importance of controlling for the
choice of LM when comparing between history
modeling methods.
(2019a)
As can be seen in Figure 1, CONCAT sat-
urates at 6 turns, which is interesting since
Qu et al.
reported saturation at 1
turn in a BERT-based equivalent. Furthermore,
Qu et al. observed a performance degradation
with more turns, while we observe stability. These
differences probably stem from the history trunca-
tion in BERT, due to the input length limitation of
dense attention Transformers. This demonstrates
the advantages of sparse attention Transformers
for history modeling evaluation, since the com-
parison against CONCAT can be more ‘‘fair’’. This
comparison is important, since the usefulness of
any method should be established by comparing it
to the straight-forward solution, which is CONCAT
in case of history modeling.
We would also like to highlight PosHAELF ’s
F1 scores in the noisy-history (60.1) and the
20% low-resource setting (60.9), both lower than
the 69.8 F1 in the standard setting. Do these
performance drops reflect
lower effectiveness
in modeling the conversation history? Here the
Δ% comes to the rescue. While the Δ% de-
creased between the standard and the 20% settings
357
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
l
a
c
_
a
_
0
0
5
4
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: F1 as a function of # history turns, for models
from the standard setup. The first occurrence of the
maximum F1 value (saturation point) is highlighted.
Figure 2: Δ% as a function of # training examples.
Results taken from the standard and low-resource
settings.
(15.6 → 9.9),
it actually increased in the
noisy-history setting (to 20.4). This indicates
that even though the F1 decreased, the ability
to leverage the history actually increased.
We conclude that our study results support the
design choices we made, in our effort to better iso-
late the contribution of the history representation.
We recommend future works to compare history
modeling methods using the same LM (prefer-
ably a long sequence LM), and to measure a Δ%
compared to a NO HISTORY baseline.
4.3 History Highlighting is Effective in
Resource-rich Setups, but is not Robust
The most interesting results are observed for the
history highlighting methods: HAE and PosHAE.
First, when implemented using the Longformer,
HAELF and PosHAELF perform better than re-
ported in previous work, with 68.9 and 69.8 F1
respectively, compared to 63.9 and 64.7 reported
by Qu et al. using BERT. The gap between HAELF
and PosHAELF demonstrates the effect of the po-
sitional information in PosHAELF . This effect is
further observed in Figure 1: HAELF saturates
earlier since it cannot distinguish between dif-
ferent conversation turns, which probably yields
conflicting information. PosHAELF saturates at
9 turns,
later than the rest of the methods,
which indicates that it can better leverage long
conversations.
PosHAELF outperforms all methods in the stan-
dard, high-resource, and noisy-history settings,10
demonstrating the high effectiveness of history
highlighting. However, it shows surprisingly poor
10We ignore MarCQAp’s results in this section.
performance in low-resource and domain-shift set-
tings, with extremely low average Δ% compared
to other methods. The impact of the training set
size is further illustrated in Figure 2. We plot the
Δ% as a function of the training set size, and
specifically highlight PosHAELF in bold red. Its
performance deteriorates significantly faster than
others when the training set size is reduced. In
the 1% setting it is actually the worst performing
method.
This poor robustness could be caused by the
additional parameters added in the embedding
layer of PosHAELF . Figure 2 demonstrates that
properly training these parameters, in order to
benefit from this method’s full potential, seems
to require large amounts of data. Furthermore,
the poor domain-shift performance indicates that,
even with enough training data, this embedding
layer seems to be prone to overfitting to the source
domain.
We conclude that history highlighting clearly
yields a very strong representation, but the addi-
tional parameters of the embedding layer seem to
require large amounts of data to train properly and
over-fit to the source domain. Is there a way to
highlight historic answers in the passage, without
adding dedicated embedding layers?
In §5 we present MarCQAp, a novel history
modeling approach that is inspired by PosHAE,
adopting the idea of history highlighting. How-
ever, instead of modifying the passage embedding,
we highlight historic answers by adding textual
prompts directly in the input text. By leveraging
prompts, we reduce model complexity and remove
the need for training dedicated parameters, hoping
to mitigate the robustness weaknesses of PosHAE.
358
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
l
a
c
_
a
_
0
0
5
4
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
5 MarCQAp
Motivated by our findings, we design MarCQAp,
a novel prompt-based history modeling approach
that highlights answers from previous conversa-
tion turns by inserting textual prompts in their
respective positions within P . By highlighting
with prompts instead of embedding vectors, we
hope to encode valuable dialogue information,
while reducing the learning complexity incurred
by the existing embedding-based methods. Thus,
we expect MarCQAp to perform well not only in
high-resource settings, but also in low-resource
and domain adaptation settings, in which prompt-
ing methods have shown to be particularly useful
(Brown et al., 2020; Le Scao and Rush, 2021;
Ben-David et al., 2022).
Prompting often refers to the practice of
adding phrases to the input,
in order to en-
courage pre-trained LMs to perform specific
tasks (Liu et al., 2021), yet it is also used as
a method for injecting task-specific guidance
during fine-tuning (Le Scao and Rush, 2021;
Ben-David et al., 2022). MarCQAp closely re-
sembles the prompting approach from Ben-David
et al. (2022) since our prompts are: (1) discrete
(i.e., the prompt is an actual text-string), (2) dy-
namic (i.e., example-based), and (3) added to the
input text and the model then makes predictions
conditioned on the modified input. Moreover, as
in Ben-David et al., in our method the underlying
LM is further trained on the downstream task with
prompts. However, in contrast to most prompting
approaches, which predefine the prompt’s loca-
tion in the input (Liu et al., 2021), our prompts are
inserted in different locations for each example. In
addition, while most textual prompting approaches
leverage prompts comprised of natural language,
our prompts contain non-verbal symbols (e.g.,
"<1>«, ver figura 3 and §5.1), which were proven
useful for supervision of NLP tasks. Por ejemplo,
Aghajanyan et al. (2022) showed the usefulness
of structured pre-training by adding HTML sym-
bols to the input text. Finalmente, a lo mejor de nuestro
conocimiento, this work is the first to propose a
prompting mechanism for the CQA task.
5.1 Método
MarCQAp utilizes a standard single-turn QA
model architecture and input, with the input com-
prising the current question qk and the passage P .
For each CQA example (PAG, HK, qk), MarCQAp
Cifra 3: The MarCQAp highlighting scheme: Answers
to previous questions are highlighted in the grounding
documento, which is then provided as input to the model.
inserts a textual prompt within P , based on in-
formation extracted from the conversation history
HK. In extractive QA, the answer ak is typically a
span within P . Given the input (PAG, HK, qk), Mar-
CQAp transforms P into an answer-highlighted
passage (cid:2)Pk, by constructing a prompt pk and in-
serting it within P . pk is constructed by locating
the beginning and end positions of all historic
answers {ai}k−1
i=1 within P , and inserting a unique
textual marker for each answer in its respective
positions (see example in Figure 3). The input
( (cid:2)Pk, qk) is then passed to the QA model, en cambio
de (PAG, qk).
In abstractive QA, a free form answer is gen-
erated based on an evidence span that is first
extracted from P . Por eso, the final answer does
not necessarily appear in P . To support this set-
ting, MarCQAp highlights the historical evidence
spans (which appear in P ) instead of the generated
answers.
To encode positional dialogue information, el
markers for aj ∈ {ai}k−1
i=1 include its turn index
number in reverse order, eso es, k − 1 − j. Este
encodes relative historic positioning w.r.t. el perro-
rent question qk, allowing the model to distinguish
between the historic answers by their recency.
MarCQAp highlights only the historic answers,
since the corresponding questions do not appear
in P . While this might lead to information loss,
in §5.3 we implement MarCQAp’s variants that
add the historic questions to the input, and show
359
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
that the contribution of the historic questions to
the performance is minor.11
‘NO ANSWER’
A CQA dialogue may also contain unanswer-
able questions. Before inserting the prompts,
MarCQAp first appends a ‘NO ANSWER’ string
to P .12 Each historical
is then
highlighted with prompts, similarly to ordinary
historical answers. For example see a4 in Figure 3.
MarCQAp has several advantages over prior
approaches. Primero, since it is prompt-based, it does
not modify the model architecture, which makes
it easier to port across various models, alleviat-
ing the need for model-specific implementation
and training procedures. Además, it naturally
represents overlapping answers in P , which was
a limitation in prior work (Qu et al., 2019a,b).
Overlapping answers contain tokens which relate
to multiple turns, yet the existing token-based em-
bedding methods encode the relation of a token
from P only to a single turn from Hk. Desde
MarCQAp is span-based, it naturally represents
overlapping historic answers (p.ej., see a2 and a3
En figura 3).
5.2 MarCQAp Evaluation
We evaluate MarCQAp in all our proposed
experimental settings (§3.1). As presented in
Tables 4, 5, y 6, it outperforms all other methods
in all settings. In the standard, high-resource,
and noisy-history settings,
its performance is
very close to PosHAELF ,13 indicating that our
prompt-based approach is an effective alterna-
tive implementation for the idea of highlighting
historical answers. Similarly to PosHAELF , Mar-
CQAp is able to handle long conversations and its
performance gains saturate at 9 turns (Cifra 1).
Sin embargo, in contrast to PosHAELF , MarCQAp
performs especially well in the low-resource and
the domain-shift settings.
In the low-resource settings, MarCQAp out-
performs all methods by a large margin, con
an average Δ% of 13.6% compared to the best
baseline with 6.3%. The dramatic improvement
over PosHAELF ’s average Δ% (1.5% → 13.6%)
serves as a strong indication that our prompt-based
11Which is also in line with the findings in Qu et al.
(2019b).
Cifra 4: An example of MarCQAp’s robustness in
the low-resource setting. Even though ExCorDLF ,
HAELF , and PosHAELF predict correct answers in the
standard setting, they fail on the same example when
the training data size is reduced to 10%. MarCQAp
predicts a correct answer in both settings.
approach is much more robust. This boost in ro-
bustness is best illustrated in Figure 2, cual
presents the Δ% as a function of the training set
tamaño, highlighting PosHAELF (rojo) and MarCQAp
(verde) específicamente. An example of MarCQAp’s
robustness in the low-resource setting is provided
En figura 4.
In the domain-shift settings, MarCQAp is
the best performing method in 6 out of 8 hacer-
mains.14 On the remaining two domains (Cooking
& Movies), CONCAT is the best performing.15
Notablemente, MarCQAp’s average Δ% (19.6%) es
substantially higher compared to the next best
método (14.1%). These results serve as additional
strong evidence of MarCQAp’s robustness.
MarCQAp’s Performance Using Different
LMs
In addition to Longformer, we evaluated
MarCQAp using RoBERTa (Liu et al., 2019)
and BigBird (Zaheer et al., 2020) in the stan-
dard setting. The results are presented in Table 8.
MarCQAp shows a consistent positive effect
across different LMs, which further highlights
its effectiveness.
12Only if it is not already appended to P , in some datasets
14For the Travel domain MarCQAp’s improvement over
the passages are always suffixed with ‘NO ANSWER’.
ExCorDLF is not statistically significant.
13In the standard and high-resource MarCQAp’s improve-
15The differences between CONCAT and MarCQAp for both
ments over PosHAELF are not statistically significant.
domains are not statistically significant.
360
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Modelo
No History MarCQAp
Δ%
RoBERTa
BigBird
Longformerbase
Longformersquad
57.7
57.6
60.0
60.4
68.0
66.3
68.4
70.2
(+17.9%)
(+15.1%)
(+14.0%)
(+16.6%)
Mesa 8: MarCQAp’s standard setting per-
formance across different Transformer-based
pre-trained LMs.
BiDAF++ w/ 2-Context (Choi et al., 2018)
HAE (Qu et al., 2019a)
FlowQA (Huang et al., 2019)
GraphFlow (Chen et al., 2020)
HAM (Qu et al., 2019b)
FlowDelta (Yeh and Chen, 2019)
GHR (Qian et al., 2022)
RoR (Zhao et al., 2021)
MarCQAp (Ours)
60.1
62.4
64.1
64.9
65.4
65.5
73.7
74.9
74.0
We note that since RoBERTa is a dense-
attention Transformer with input length limita-
ción de 512 tokens, longer passages are split into
chunks. This may lead to some chunks containing
part of the historic answers, and therefore partial
highlighting by MarCQAp. Our analysis showed
eso 51% of all examples in QuAC were split
into several chunks, y 61% the resulted chunks
contained partial highlighting. MarCQAp’s strong
performance with RoBERTa suggests that it can
remain effective even with partial highlighting.
Official QuAC Leaderboard Results For com-
pleteness, we submitted our best performing
modelo (from the high-resource setting) to the
official QuAC leaderboard,16 evaluating its per-
formance on the hidden test set. Mesa 9 presents
the results.17 MarCQAp achieves a very competi-
tive score of 74.0 F1, very close to the published
state-of-the art (RoR by Zhao et al. [2021] con
74.9 F1), yet with a much simpler model.18
5.3 Prompt Design
Recall that MarCQAp inserts prompts at the begin-
ning and end positions for each historical answer
within P (Cifra 3). The prompts are designed
with predefined marker symbols and include the
answer’s turn index (p.ej., «<1>«). This design
builds on 3 main assumptions: (1) textual prompts
can represent conversation history information,
(2) the positioning of the prompts within P facil-
itates highlighting of historical answers, y (3)
indexing the historical answers encodes valuable
información. We validate our design assumptions
by comparing MarCQAp against ablated variants
(Mesa 10).
16https://quac.ai.
17The leaderboard contains additional results for mod-
els which (at the time of writing) include no descriptions
or published papers, rendering them unsuitable for fair
comparación.
18See §3.3 for a discussion of RoR.
Mesa 9: Results from the official QuAC leader-
board, presenting F1 scores for the hidden test set,
for MarCQAp and other models with published
documentos.
To validate assumption (1), we compare Mar-
CQAp to MARCQAPC, a variant which adds Hk
to the input, in addition to (cid:2)Pk and qk. MARC-
QAPC is exposed to information from Hk via two
sources: The concatenated Hk and the MarCQAp
prompt within (cid:2)Pk. We observe a negligible ef-
fect,19 suggesting that MarCQAp indeed encodes
information from the conversation history, desde
providing Hk does not add useful information on
top of (cid:2)Pk.
To validate assumptions (2) y (3), we use
two additional MarCQAp’s variants. Answer Pos
inserts a constant predefined symbol («<>«), en
each answer’s beginning and end positions within
PAG (es decir., similar to MarCQAp, but without turn
indexing). Random Pos inserts the same number
of symbols but in random positions within P .
Answer Pos achieves a Δ% of 12.7%, mientras
Random Pos achieves 1.7%. This demonstrates
that the positioning of the prompts within P is cru-
cial, and that most of MarCQAp’s performance
gains stem from its prompts positioning w.r.t.
historical answers {ai}k−1
yo=1 . When the prompts
are inserted at meaningful positions, el modelo
seems to learn to leverage these positions in
order to derive an effective history representa-
ción. Asombrosamente, Random Pos leads to a minor
mejora de 1.7%.20 Finalmente, MarCQAp’s im-
provement over Answer Pos (a Δ% of 15.9%
compared to 12.7%), indicates that answer in-
dexing encodes valuable information, helping us
validate assumption (3).
Finalmente, since textual prompts allow for easy
información, we make
injection of additional
19The difference is not statistically significant.
20The difference is statistically significant, we did not
further investigate the reasons behind this particular result.
361
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
NO HISTORY
Random Pos
Answer Pos
Full Q
Word from Q
Word from Q + Index
MARCQAPC
MarCQAp
52.9
53.8 (+1.7%)
59.6 (+12.7%)
59.2 (+11.9%)
60.4 (+14.2%)
60.7 (+14.8%)
61.5 (+16.3%)
61.3 (+15.9%)
Mesa 10: F1 and Δ% scores for MarCQAp’s
ablated variants,
el
low-resource setting.
en el 10% setup of
several initial attempts in this direction, inject-
ing different types of information into our textual
prompts. In Word from Q, the marker contains
the first word from the historic answer’s corre-
sponding question, which is typically a wh-word
(p.ej., ‘‘
also add the historic answer’s turn index (p.ej.,
‘‘
historic question into the prompt. Word from Q
and Word from Q + Index achieved comparable
puntuaciones, lower than MarCQAp’s but higher than
Answer Pos’s.21 This suggests that adding se-
mantic information is useful (since Word from Q
outperformed Answer Pos), and that combining
such information with the positional information
is not trivial (since MarCQAp outperformed Word
from Q + Index). This points to the effects of the
prompt structure and the information included:
We see that ‘‘<1>’’ and ‘‘
form ‘‘<>'', yet constructing a prompt by naively
combining these signals (‘‘
lead to complementary effect. Finalmente, Word from
Q outperformed Full Q. We hypothesize that
since the full question can be long, it might sub-
stantially interfere with the natural structure of
the passage text. This provides evidence that the
prompts should probably remain compact symbols
with small footprint within the passage. These ini-
tial results call for further exploration of optimal
prompt design in future work.
5.4 Case Study
Cifra 5 presents an example of all evaluated
methods in action from the standard setting. El
current question ‘‘Did he have any other crit-
circuitos integrados?’’ has two correct answers: Alan Dershowitz
or Omer Bartov. We first note that all methods
21Both differences are statistically significant.
Cifra 5: Our case study example, comparing answers
predicted by each evaluated method in the standard
configuración. We provide a detailed analysis in §5.4.
predicted a name of a person, lo que indica que
the main subject of the question was captured cor-
rectly. Todavía, the methods differ in their prediction
of the specific person.
REWRITE and CONCAT predict a correct answer
(Alan Dershowitz), yet CONCAT predicts it based on
incorrect evidence. This may indicate that CONCAT
did not capture the context correctly (just the fact
that it needs to predict a person’s name), y era
lucky enough to guess the correct name.
Curiosamente, REWRITEC predicts Daniel Gold-
hagen, which is different
from the answers
predicted by CONCAT and REWRITE. This shows that
combining both methods can yield completely dif-
ferent results, and demonstrates an instance where
REWRITEC performs worse than REWRITE and CON-
CAT (for instance in the 1% low-resource setting).
This is also an example of a history modeling flaw,
since Daniel Goldhagen was already mentioned
as a critic in previous conversation turns.
This example also demonstrates how errors
can propagate through a pipeline-based system.
362
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
The gold rewritten question is ‘‘Did Norman
Finkelstein have any other critics aside from
Peter Novick and Daniel Goldhagen?'',22 mientras
the question rewriting model generated ‘‘Besides
Peter Novick, did Norman Finkelstein have any
other critics?'', omitting Daniel Goldhagen. Este
makes it impossible for REWRITE to figure out
that Daniel Goldhagen was already mentioned,
making it a legitimate answer. This reveals that
REWRITE might have also gotten lucky and provides
a possible explanation for the incorrect answer
predicted by REWRITEC.
ExCorDLF , HAELF , and PosHAELF not only
predict a wrong answer, but also seem to fail
to resolve the conversational coreferences, desde
the pronoun ‘‘he’’, in the current question ‘‘Did
he have any other critics?'', refers to Norman
Finkelstein.
MarCQAp predicts a correct answer, Omer
Bartov. This demonstrates an instance where Mar-
CQAp succeeds while HAELF and PosHAELF
fail, even though they are all history-highlighting
methods. Curiosamente, MarCQAp is the only
model that predicts Omer Bartov, a non-trivial
choice compared to Alan Dershowitz, since Omer
Bartov appears later in the passage, further away
from the historic answers.
6 Limitaciones
This work focuses on a single-document CQA
configuración, which is in line with the majority of the
previous work on conversation history model-
ing in CQA (§2.3). Correspondingly, MarCQAp
was designed for single-document CQA. Apply-
ing MarCQAp in multi-document settings (Qu
et al., 2020; Anantha et al., 2021; Adlakha et al.,
2022) may result in partial history representation,
since the retrieved document may contain only
part of the historic answers, therefore MarCQAp
will only highlight the answers which appear in
the document.23
In §5.3 we showed initial evidence that Mar-
CQAp prompts can encode additional informa-
tion that can be useful for CQA. In this work we
focused on the core idea behind prompt-based an-
swer highlighting, as a proposed solution in light
of our results in §4. Todavía, we did not conduct a com-
22As annotated in CANARD (Elgohary et al., 2019).
23We note that this limitation applies to all highlighting
approaches, including HAE and PosHAE (Qu et al., 2019a,b).
prehensive exploration in search of the optimal
prompt design, and leave this for future work.
7 Conclusión
the first compre-
En este trabajo, we carry out
hensive robustness study of history modeling
approaches for Conversational Question Answer-
En g (CQA), including sensitivity to model and
training data size, domain shift, and noisy history
aporte. We revealed limitations of the existing
benchmark-based evaluation, by demonstrating
that it cannot reflect the models’ robustness to
such changes in setting. Además, we proposed
evaluation practices that better isolate the contri-
bution of the history modeling component, y
demonstrated their usefulness.
We also discovered that highlighting historic
answers via passage embedding is very effective
in standard setups, but it suffers from substantial
performance degradation in low data and domain
shift settings. Following this finding, we design
a novel prompt-based history highlighting ap-
proach. We show that highlighting with prompts,
rather than with embeddings, significantly im-
prove robustness, while maintaining overall high
actuación.
Our approach can be a good starting point for
future work, due to its high effectiveness, robusto-
ness, and portability. We also hope that the insights
from our study will encourage evaluations with
focusonrobustness,leading to better CQA systems.
Expresiones de gratitud
We would like to thank the action editor and
the reviewers, as well as the members of the
IE@Technion NLP group and Roee Aharoni for
their valuable feedback and advice. The Technion
team was supported by the Zuckerman Fund to the
Technion Artificial Intelligence Hub (Tech.AI).
This research was also supported in part by a
grant from Google.
Referencias
Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer
Suleman, Harm de Vries, and Siva Reddy.
2022. Topiocqa: Open-domain conversational
question answering with topic switching. Trans-
acciones de la Asociación de Computación
https://doi
Lingüística,
.org/10.1162/tacl_a_00471
10:468–483.
363
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Armen Aghajanyan, Dmytro Okhonko, Miguel
Luis, Mandar Joshi, Hu Xu, Gargi Ghosh,
and Luke Zettlemoyer. 2022. HTLM: Hyper-
text pre-training and prompting of language
modelos. In International Conference on Learn-
ing Representations.
Raviteja Anantha, Svitlana Vakulenko, Zhucheng
Tu, Shayne Longpre, Stephen Pulman, y
Srinivas Chappidi. 2021. Open-domain ques-
tion answering goes conversational via ques-
tion rewriting. En procedimientos de
el 2021
Conference of
the North American Chap-
the Association for Computational
ter of
Lingüística: Tecnologías del lenguaje humano,
NAACL-HLT 2021, En línea, June 6–11, 2021,
pages 520–534. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.naacl-main.44
Iz Beltagy, Matthew E. Peters, and Arman
Cohán. 2020. Longformer: The long-document
transformador. CORR, abs/2004.05150.
Eyal Ben-David, Nadav Oved, and Roi Reichart.
2022. EN: Example-based prompt learning
for on-the-fly adaptation to unseen domains.
Transactions of the Association for Computatio-
nal Linguistics, 10:414–433. https://doi
.org/10.1162/tacl_a_00468
Tom Brown, Benjamín Mann, Nick Ryder,
Melanie Subbiah, Jared D.. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, sandhi
agarwal, Ariel Herbert-Voss, Gretchen
krüger, Tom Henighan, niño rewon, Aditya
Ramesh, Daniel Ziegler, Jeffrey Wu, Clemenes
Invierno, Chris Hesse, Marcos Chen, eric
Sigler, Mateusz Litwin, Scott Gris, Benjamín
Ajedrez, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever,
y Darío Amodei. 2020. Modelos de lenguaje
son aprendices de pocas oportunidades. En avances en neurología
Sistemas de procesamiento de información, volumen 33,
páginas 1877-1901. Asociados Curran, Cª.
Jon Ander Campos, Arantxa Otegi, Aitor Soroa,
Jan Deriu, Mark Cieliebak, and Eneko Agirre.
2020. Doqa – accessing domain-specific faqs
via conversational QA. En Actas de la
58ª Reunión Anual de la Asociación de
Ligüística computacional, LCA 2020, En línea,
July 5–10, 2020, pages 7302–7314. Asociación
para Lingüística Computacional.
Yu Chen, Lingfei Wu, and Mohammed J. Zaki.
2020. Graphflow: Exploiting conversation flow
with graph neural networks for conversational
machine comprehension. En procedimientos de
the Twenty-Ninth International Joint Confer-
ence on Artificial Intelligence, IJCAI 2020,
ijcai.org. https://doi
pages 1230–1236.
.org/10.24963/ijcai.2020/171
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, y
Lucas Zettlemoyer. 2018. Quac: Preguntar un-
swering in context. En procedimientos de
el
2018 Conference on Empirical Methods in
Natural Language Processing, Bruselas, Bel-
gium, Octubre 31 – November 4, 2018,
pages 2174–2184. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D18-1241
Rotem Dror, Gili Baumer, Segev Shlomov, y
Roi Reichart. 2018. The hitchhiker’s guide
to testing statistical significance in natural
language processing. En procedimientos de
el
56ª Reunión Anual de la Asociación de
Ligüística computacional, LCA 2018, Mel-
bourne, Australia, July 15–20, 2018, Volumen 1:
Artículos largos, pages 1383–1392. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/P18-1128
Rotem Dror, Lotem Peled-Cohen,
Segev
Shlomov, and Roi Reichart. 2020. Statisti-
cal Significance Testing for Natural Language
Procesando. Synthesis Lectures on Human
Language Technologies. morgan & Claypool
Publishers. https://doi.org/10.1007
/978-3-031-02174-9
a
rewrite
Ahmed Elgohary, Denis Peskov, and Jordan L.
Boyd-Graber. 2019. Can you unpack that?
questions-in-context.
Aprendiendo
el 2019 Conferencia sobre
En procedimientos de
Empirical Methods
in Natural Language
Procesamiento y IX Conjunción Internacional
Conferencia sobre procesamiento del lenguaje natural,
EMNLP-IJCNLP 2019, Hong Kong, Porcelana,
November 3–7, 2019, pages 5917–5923.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/D19
-1605
Somil Gupta, Bhanu Pratap Singh Rawat,
and Hong Yu. 2020. Conversational ma-
chine comprehension: A literature review. En
364
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Proceedings of the 28th International Confer-
ence on Computational Linguistics, COLECCIONAR
2020, Barcelona, España (En línea), December
8–13,
interna-
tional Committee on Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.coling-main.247
2739–2753.
paginas
2020,
Hsin-Yuan Huang, Eunsol Choi, and Wen-tau
Yih. 2019. Flowqa: Grasping flow in history
for conversational machine comprehension. En
7th International Conference on Learning Rep-
resentaciones, ICLR 2019, Nueva Orleans, LA,
EE.UU, May 6–9, 2019. OpenReview.net.
Gangwoo Kim, Hyunjae Kim, Jungsoo Park, y
Jaewoo Kang. 2021. Learn to resolve conver-
sational dependency: A consistency training
framework for conversational question an-
swering. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Process-
En g, ACL/IJCNLP 2021,
(Volumen 1: Largo
Documentos), Virtual Event, August 1–6, 2021,
pages 6130–6141. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.acl-long.478
Diederik P. Kingma and Jimmy Ba. 2015. Adán:
A method for stochastic optimization. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015, San Diego, California, EE.UU,
May 7–9, 2015, Conference Track Proceedings.
Teven Le Scao and Alexander Rush. 2021.
How many data points is a prompt worth?
el 2021 Conference of
En procedimientos de
the North American Chapter of the Associ-
ation for Computational Linguistics: Humano
Language Technologies, pages 2627–2636,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.naacl-main.208
Huihan Li, Tianyu Gao, Manan Goenka, y
Danqi Chen. 2022. Ditch the gold stan-
dard: Re-evaluating conversational question
answering. In Proceedings of the 60th Annual
Meeting of the Association for Computational
Lingüística (Volumen 1: Artículos largos), LCA
2022, Dublín,
Irlanda, May 22–27, 2022,
pages 8074–8085. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2022.acl-long.555
365
Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo
Nogueira, Ming-Feng Tsai, Chuan-Ju Wang,
and Jimmy Lin. 2020. Conversational question
reformulation via sequence-to-sequence archi-
tectures and pretrained language models. CORR,
abs/2004.01909.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao
Jiang, Hiroaki Hayashi, y Graham Neubig.
2021. Pre-train, prompt, and predict: A system-
atic survey of prompting methods in natural
language processing. CORR, abs/2107.13586.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. CORR,
abs/1907.11692.
Ilya Loshchilov and Frank Hutter. 2019. De-
coupled weight decay regularization. In 7th
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2019, Nueva Orleans, LA, EE.UU,
May 6–9, 2019. OpenReview.net.
Jin Qian, Bowei Zou, Mengxing Dong, xiao
li, AiTi Aw, and Yu Hong. 2022. Capturing
conversational interaction for question answer-
ing via global history reasoning. In Findings
de
the Association for Computational Lin-
guísticos: NAACL 2022, seattle, Washington, United
Estados, 10 al 15 de julio, 2022, pages 2071–2078.
for Computational Linguis-
Asociación
tics. https://doi.org/10.18653/v1
/2022.findings-naacl.159
Chen Qu, Liu Yang, Cen-Chieh Chen, Minghui
Qiu, W.. Bruce Croft, and Mohit Iyyer. 2020.
Open-retrieval conversational question answer-
En g. Actas de
the 43rd International
ACM SIGIR Conference on Research and
Development in Information Retrieval.
Chen Qu, Liu Yang, Minghui Qiu, W.. Bruce Croft,
Yongfeng Zhang, and Mohit Iyyer. 2019a.
BERT with history answer embedding for con-
versational question answering. En procedimientos
of the 42nd International ACM SIGIR Confe-
rence on Research and Development in Infor-
mation Retrieval, SIGIR 2019, París, Francia,
July 21–25, 2019, pages 1133–1136. ACM.
Chen Qu, Liu Yang, Minghui Qiu, Yongfeng
zhang, Cen Chen, W.. Bruce Croft, and Mohit
Iyyer. 2019b. Attentive history selection for
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
conversational question answering. En curso-
ings of the 28th ACM International Conference
on Information and Knowledge Management,
CIKM 2019, Beijing, Porcelana, November 3–7,
2019, pages 1391–1400. ACM.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Miguel
mañana, Yanqi Zhou, wei li, y Pedro J..
Liu. 2020. Exploring the limits of transfer
learning with a unified text-to-text
trans-
anterior. Journal of Machine Learning Research,
21:140:1–140:67.
Pranav Rajpurkar, Robin Jia, y Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for squad. En procedimientos de
the 56th Annual Meeting of the Association
para Lingüística Computacional, LCA 2018, Mel-
bourne, Australia, July 15–20, 2018, Volumen 2:
Artículos breves, pages 784–789. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/P18-2124
Pranav Rajpurkar,
texto.
En procedimientos de
Jian Zhang, Constantino
Lopyrev, y Percy Liang. 2016. Squad:
100, 000+ questions for machine compre-
el
hension of
2016 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2016,
austin, Texas, EE.UU, November 1–4, 2016,
paginas
para
Ligüística computacional. https://doi
.org/10.18653/v1/D16-1264
2383–2392. The Association
Siva Reddy, Danqi Chen, and Christopher D.
Manning. 2019. Coqa: A conversational ques-
tion answering challenge. Transactions of
la Asociación de Lingüística Computacional,
7:249–266. https://doi.org/10.1162
/tacl_a_00266
Svitlana Vakulenko, Shayne Longpre, Zhucheng
Tu, and Raviteja Anantha. 2021. Question
rewriting for conversational question answer-
En g. In WSDM ’21, The Fourteenth ACM
International Conference on Web Search and
Data Mining, Virtual Event, Israel, Marzo
8–12, 2021, pages 355–363. ACM.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
Lukasz Kaiser, y Illia Polosukhin. 2017. En-
La atención es todo lo que necesitas.. En avances en neurología
Sistemas de procesamiento de información 30: Annual
Conference on Neural Information Processing
366
Sistemas 2017, December 4–9, 2017, Largo
Beach, California, EE.UU, pages 5998–6008.
Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drama, Quintín
Lhoest, y Alejandro Rush. 2020. Trans-
formadores: State-of-the-art natural
idioma
Procesando. En Actas de la 2020 Estafa-
ference on Empirical Methods in Natural
Procesamiento del lenguaje: Demostraciones del sistema,
páginas 38–45, En línea. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.emnlp-demos.6
Yi-Ting Yeh and Yun-Nung Chen. 2019.
Flowdelta: Modeling flow information gain in
reasoning for conversational machine compre-
hension. In Proceedings of the 2nd Workshop
on Machine Reading for Question Answering,
MRQA@EMNLP 2019, Hong Kong, Porcelana,
Noviembre 4, 2019, pages 86–90. Asociación
para Lingüística Computacional.
Manzil Zaheer, Guru Guruganesh, Kumar
Avinava Dubey, Joshua Ainslie, Chris Alberti,
Santiago Onta˜n´on, Philip Pham, Anirudh
Ravula, Qifan Wang, Li Yang, and Amr Ahmed.
2020. Big bird: Transformers for longer se-
quences. In Advances in Neural Information
Sistemas de procesamiento 33: Annual Conference
on Neural
Sistemas de procesamiento de información
2020, NeurIPS 2020, December 6–12, 2020,
virtual.
Jing Zhao, Junwei Bao, Yifan Wang, Yongwei
zhou, Youzheng Wu, Xiaodong He, and Bowen
zhou. 2021. Ror: Read-over-read for long
document machine reading comprehension.
En hallazgos de
the Association for Compu-
lingüística nacional: EMNLP 2021, Virtual
Event
/ Punta Cana, República Dominicana,
16–20 November, 2021, pages 1862–1872.
Asociación
for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.findings-emnlp.160
Chenguang Zhu, Michael Zeng, and Xuedong
Huang. 2018. Sdnet: Contextualized attention-
based deep network for conversational question
answering. CORR, abs/1812.03593.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
5
4
9
2
1
4
5
1
9
9
/
/
t
yo
a
C
_
a
_
0
0
5
4
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3