TopiOCQA: Open-domain Conversational Question Answering - IA de Investigación especializada en el MIT

TopiOCQA: Open-domain Conversational Question Answering
with Topic Switching

Vaibhav Adlakha1,4 Shehzaad Dhuliawala2
Kaheer Suleman3 Harm de Vries4 Siva Reddy1,5
2ETH Z¨urich, Suiza

3Microsoft Montr´eal, Canada

1Mila, Universidad McGill, Canada

4ServiceNow Research, Canada

5Facebook CIFAR AI Chair, Canada

{vaibhav.adlakha,siva.reddy}@mila.quebec

Abstracto

In a conversational question answering sce-
nario, a questioner seeks to extract information
about a topic through a series of interdepen-
dent questions and answers. As the conver-
sation progresses, they may switch to related
temas, a phenomenon commonly observed in
information-seeking search sessions. Sin embargo,
current datasets for conversational question
answering are limiting in two ways: 1) ellos
do not contain topic switches; y 2) they as-
sume the reference text for the conversation is
given, eso es, the setting is not open-domain.
We introduce TOPIOCQA (pronounced Tapi-
oca), an open-domain conversational dataset
with topic switches based on Wikipedia.
TOPIOCQA contains 3,920 conversations with
information-seeking questions and free-form
answers. De término medio, a conversation in our
dataset spans 13 question-answer turns and
involves four topics (documentos). TOPIOCQA
poses a challenging test-bed for models, dónde
efficient retrieval is required on multiple turns
of the same conversation, in conjunction with
constructing valid responses using conversa-
tional history. We evaluate several baselines,
by combining state-of-the-art document re-
trieval methods with neural reader models.
Our best model achieves F1 of 55.8, falling
short of human performance by 14.2 puntos,
indicating the difficulty of our dataset. Nuestro
dataset and code are available at https://
mcgill-nlp.github.io/topiocqa.

Introducción

People often engage in information-seeking con-
versations to discover new knowledge (Walton,
2019). In such conversations, a questioner (el
seeker) asks multiple rounds of questions to an
answerer (the expert). As the conversation pro-
ceeds, the questioner becomes inquisitive of new
but related topics based on the information pro-

468

vided in the answers (Stede and Schlangen, 2004).
Such topic switching behaviour is natural
en
information-seeking conversations and is com-
monly observed when people seek information
through search engines (Spink et al., 2002).

According to Spink et al., people switch from
one to ten topics with a mean of 2.11 topic switches
per search session. Por ejemplo, a person can start
a search session about tennis, and then land on
Roger Federer, and after learning a bit about him
may land on his country Switzerland, and spend
more time learning about other Swiss athletes.
Thanks to tremendous progress in question an-
swering research (Rogers et al., 2021), we are
coming close to enabling information-seeking
conversations with machines (as opposed to just
using keywords-based search). In order to real-
ize this goal further, it is crucial to construct
datasets that contain information-seeking conver-
sations with topic switching, and measure progress
of conversational models on this task, the two
primary contributions of this work.

In the literature, a simplified setting of
information-seeking conversation known as con-
versational question answering (CQA) ha sido
deeply explored (Choi et al., 2018; Reddy et al.,
2019). en esta tarea, the entire conversation is based
on a given reference text of a topic/entity. Mientras
the CQA task is challenging, it still falls short of
the real-world setting, where the reference text is
not known beforehand (first limitation) y el
conversation is not restricted to a single topic
(second limitation).

Qu et al. (2020) and Anantha et al. (2021)
have attempted to overcome the first limitation
by adapting existing CQA datasets to the open-
domain setting. They do so by obtaining context-
independent rewrites of the first question to make
the question independent of the reference text. Para
ejemplo, if the reference text is about Augusto

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 468–483, 2022. https://doi.org/10.1162/tacl a 00471
Editor de acciones: Hua Wu. Lote de envío: 11/2021; Lote de revisión: 12/2021; Publicado 4/2022.
C(cid:2) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Pinochet and the conversation starts with a ques-
ción «Was he known for being intelligent?», el
question is re-written to «Was Augusto Pinochet
known for being intelligent?». Sin embargo, como el
entire question sequence in the conversation was
collected with a given reference text of a topic, todo
the turns still revolve around a single topic.

En este trabajo, we present TOPIOCQA1—Topic
switching in Open-domain Conversational Question
Answering—a large-scale dataset for information-
seeking conversations in open-domain based on
the Wikipedia corpus. We consider each Wikipedia
document to be a separate topic. The conversations
in TOPIOCQA start with a real information-seeking
question from Natural Questions (Kwiatkowski
et al., 2019) in order to determine a seed topic
(documento), and then the questioner may shift to
other related topics (documentos) as the conver-
sation progresses.2 Throughout the conversation,
the questioner is never shown the content of the
documentos (but only the main title and section ti-
tles) to simulate an information-seeking scenario,
whereas the answerer has full access to the content
along with the hyperlink structure for navigation.
In each turn, both questioner and answerer use
free-form text to converse (as opposed to extrac-
tive text spans as is common for an answerer in
many existing datasets).

Cifra 1 shows an example of a conversation
from our dataset. The first question leads to the
seed topic Byzantine Empire, and after two turns
switches to Mehmed the Conqueror in Q4, based
on part of the answer (A3) that contains reference
to Mehmed. Note that the answers A1, A3, and A4
are free-form answers that do not occur as spans
in either the seed document or the follow up doc-
umento. The topic then switches to Anatolia based
on part of the previous answer (A4). The topics
change in further turns to Turkey and Ankara.
Because of the conversational nature, TOPIOCQA
contains questions rife with complex coreference
phenomena, por ejemplo, Q9 relies on entities
mentioned in A7, A8 and Q1.

TOPIOCQA contains 3,920 conversations and
50,574 QA pairs, based on Wikipedia corpus of
5.9 million documents. De término medio, a conversa-
tion has 13 question-answer turns and involves 4
temas. Twenty-eight percent of turns in our data-

1TOPIOCQA is pronounced as Tapioca.
2A portion of the training data also contains conversations
where the questioner asks the first question given a seed topic.

Cifra 1: A conversation from TOPIOCQA. Our dataset
has information-seeking questions with free-form an-
swers across multiple topics (documentos). The consec-
utive turns from the same topic (documento) ha sido
excluded for brevity.

set require retrieving a document different from
the previous turn. A lo mejor de nuestro conocimiento,
TOPIOCQA is the first open-domain information-
seeking CQA dataset that incorporates topical
cambios, along with other desirable properties
(ver tabla 1).

To investigate the difficulty of the TOPIOCQA
conjunto de datos, we benchmark several strong retriever-
reader neural baselines, considering both sparse
and dense retrievers, as well as extractive and gen-
erative readers (Karpukhin et al., 2020; Izacard
and Grave, 2021). Inspired by previous work,
we explore two ways to represent the question:
(1) concatenating the entire conversation history
(Qu et al., 2020), y (2) self-contained rewrites
of the conversational question (Anantha et al.,
2021). The best performing model—Fusion-in-

469

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Dataset

TOPIOCQA (nuestro)

QReCC (Anantha et al., 2021)
OR-QuAC (Qu et al., 2020)
CoQA (Reddy et al., 2019)
QuAC (Choi et al., 2018)
NarrativeQA (Koˇcisk´y et al., 2018)
Natural Questions (Kwiatkowski et al., 2019)
Equipo 2.0 (Rajpurkar et al., 2018)

Multi–turn

Open–domain

Free–form answers

Information–seeking questions

Topic Switching

✓

✓
✓
✓
✓
✗
✗
✗

✓

✓
✓
✗
✗
✓
✓
✗

✓

✓
✗
✓
✗
✓
✗
✗

✓

✓
✗
✓
✓
✓
✗

✓

✗
✗
✗
✗
✗
✗

Mesa 1: Comparison of TOPIOCQA with other QA datasets. TOPIOCQA incorporates topical changes,
represents that only a proportion of dataset
along with several best practices of previous datasets.
satisfies the property.

Decoder (Izacard and Grave, 2021) trained on con-
catenated conversation history—is 14.2 F1 points
short of human performance, indicating signif-
icant room for improvement. We also evaluate
GPT-3 to estimate the performance in a closed-
book zero-shot setting, and its performance is 38.2
F1 points below the human performance.

2 Trabajo relacionado

2.1 Open-Domain Question Answering

In open-domain question answering, a model has
to answer natural language questions by retriev-
ing relevant documents. This can be considered
as a simplified setting of open-domain CQA,
where the conversation is limited to just one
doblar. Several datasets have been proposed for this
tarea. Por un lado, reading comprehension data-
sets like SQuAD (Rajpurkar et al., 2016, 2018),
which consist of (pregunta, documento, respuesta)
triplets, have been adapted for the task by with-
holding access to the document (Chen et al., 2017).
While these datasets have been helpful in spurring
modelling advances, they suffer from an anno-
tator bias because they were not collected in an
information-seeking setup. Eso es, annotators had
access to the target answer and its surrounding
context and therefore formulated questions that
had a high lexical overlap with the answer (Jia
and Liang, 2017). Por otro lado, Web-search
based datasets do not suffer from such artefacts
because they are curated from real search engine
consultas. The WikiQA (Yang et al., 2015) y
MS Marco (Nguyen et al., 2016) datasets contain
queries from the Bing search engine, mientras
Natural Questions (Kwiatkowski et al., 2019)
contain queries from the Google search engine.

Models for open-domain QA often follow a
two-stage process: (1) A retriever selects a small
collection of documents relevant to the question
from a big corpus (p.ej., Wikipedia), (2) a reader

extracts or generates an answer from the selected
documentos. While classical approaches rely on
counting-based bag-of-words representations like
TF-IDF or BM25 (Chen et al., 2017; Wang y cols.,
2018; Yang et al., 2019), more recent deep learn-
ing approaches learn dense representations of the
questions and document through a dual-encoder
estructura (Lee et al., 2019; Karpukhin et al.,
2020). In such learned retriever setups, documento
retrieval is done efficiently using Maximum Inner
Product Search (MIPS, Shrivastava and Li, 2014).

2.2 Conversational Question
Answering (CQA)

CQA extends the reading comprehension task
from a single turn to multiple turns. Given a refer-
ence document, a system is tasked with interac-
tively answering a sequence of information-seeking
questions about the corresponding document. Este
conversational extension leads to novel challenges
in modeling linguistic phenomena such as ana-
phora (referencing previous turns) and ellipsis
(omitting words from questions), as well as in per-
forming pragmatic reasoning. Large-scale con-
versational datasets such as CoQA (Reddy et al.,
2019) and QuAC (Choi et al., 2018) have facil-
itated much of the research in this area. Estos
datasets differ along several dimensions, two of
cuales son (1) CoQA has short free-form answers,
whereas QuAC has long extractive span-based an-
respuestas, y (2) unlike CoQA, QuAC is collected
in a simulated information-seeking scenario.

Models for CQA have used simple concatena-
tion of the question-answer history (Zhu et al.,
2019), history turn selection (Qu et al., 2019a,b),
and question-rewrites (Vakulenko et al., 2021).
For question-rewriting, a different module is
trained on self-contained rewrites of context-
dependent questions. Por ejemplo, a plausible
rewrite of Q8 (Cifra 1) is ‘‘can you name
some of the major cities in Turkey apart from

470

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ankara?''. The re-written question is then an-
swered using open-domain QA systems. Two pop-
ular question-rewriting datasets for training this
module are (1) CANARD (Elgohary et al., 2019),
which contains re-writes of 50% of QuAC, y
(2) QReCC (Anantha et al., 2021), which contains
rewrites of the entire QuAC dataset and a small
portion from other sources.

2.3 Open-Domain CQA

En este trabajo, we focus on constructing a chal-
lenging benchmark for open-domain CQA. El
open-domain aspect requires systems to answer
questions without access to a reference document.
The conversational aspect enables users to ask
multiple related questions, which can, in principle,
span several different topics. With TOPIOCQA, nosotros
introduce the first open-domain CQA dataset that
explicitly covers such topical switches.

Previous datasets for

this task re-purpose
existing CQA datasets. The OR-QuAC dataset
(Qu et al., 2020) is automatically constructed
from QuAC (Choi et al., 2018) and CANARD
(Elgohary et al., 2019) by replacing the first ques-
tion in QuAC with context-independent rewrites
from CANARD. QReCC (Anantha et al., 2021)
is a large-scale open-domain CQA and question
rewriting dataset that contains conversations from
QuAC, TREC CAsT (Dalton et al., 2020), and Nat-
ural Questions (NQ; Kwiatkowski et al., 2019). Todo
the questions in OR-QuAC and 78% of questions
in QReCC are based on QuAC. As conversations
in QuAC were collected with a given reference
documento, the question sequences of these con-
versations revolve around the topic or entity cor-
responding to that document. Twenty-one percent
of questions in QReCC are from NQ-based con-
versations. As NQ is not a conversational dataset,
the annotators of QReCC use NQ to start a con-
versation. A single annotator is tasked with pro-
viding both follow-up questions and answers for
to QReCC,
a given NQ question. A diferencia de
conversations in our dataset are collected in a
simulated information-seeking scenario using two
annotators (Sección 3.3).

Deep learning models for this task have fol-
lowed a similar retriever-reader setup as open-
domain QA. Instead of a single question, previous
works have explored feeding the entire conversa-
tion history (Qu et al., 2020), or a context indepen-
dent re-written question (Anantha et al., 2021).

3 Dataset Collection

Each conversation in TOPIOCQA is an interac-
tion between two annotators—a questioner and an
answerer. The details about the annotator selec-
tion are provided in Appendix A.

3.1 Seed Topics and Document Collection

The seed topics essentially drive the conversation.
In order to make them interesting for annotators,
we select the good3 articles of Wikipedia as seed
temas (around 35k) for the first turn, but use
entire Wikipedia for later turns. We used the
Wikipedia dump from 10/20/2020, cual estafa-
sists of 5.9 million documents. We used Wikiex-
tractor4 to extract the text. While pre-processing
the Wikipedia documents, we retain the hyper-
links that refer to other Wikipedia documents,
thus ensuring that we can provide all the doc-
uments requested by annotators (via hyperlinks)
during the conversation.

3.2 Simulating Information-seeking Scenario

Information-seeking conversations are closer to
the real-world if an information need can be
simulated via the data collection interface. En
TOPIOCQA, we achieve this by withholding ques-
tioner’s access to the full reference text of the
documento. The questioner can only see the meta-
datos (main title and the section titles) del
Wikipedia documents, whereas the answerer can
access the entire text of the documents. On finding
the answer, the answerer highlights a contiguous
span of text as rationale, and generates a free-form
respuesta. The answerer also has the option to mark
the question as unanswerable. The conversation
history is visible to both the annotators.
As a conversation starting point,

la primera
question is sampled from a subset of NQ
(Kwiatkowski et al., 2019) since NQ contains
genuine information-seeking questions asked on
Google. We only sample those questions for which
the answer is in our seed document pool. A
increase the diversity of our dataset, we also al-
low the questioner to formulate the first question
based on the provided seed topic entity for 28%
of the conversations.

3Wikipedia Good articles.
4github:wikiextractor.

471

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

3.3 Enabling Topic-switching

The key feature of the interface is enabling topic
switching via hyperlinks. For the answerer, el
text of the document includes clickable hyper-
links to other documents. On clicking these links,
the current document in the answerer’s interface
changes to the requested (clicked) documento. Este
enables the answerer to search for answers in doc-
uments beyond the current one. The questioner
can access the metadata of documents visited by
the answerer and documents present in the ration-
ale of the answers. Por ejemplo, let us assume
that given the seed document Daniel Radcliffe and
the first question ‘‘Where was Daniel Radcliffe
born?'', the answerer selects the ‘‘Daniel Jacob
Radcliffe was born in London on 23 July 1989’’
span as rationale and provides ‘‘London’’ as the
respuesta. If London is a hyperlink in the rationale
span, then the metadata of both Daniel Radcliffe
and London is available to the questioner to form
the next question. If the next question is ‘‘What
is its population?'',
the answerer can switch
the current document from Daniel Radcliffe to
London by clicking on the hyperlink, and can
then find and provide the answer. The conver-
sation up till
involves two topics:
Daniel Radcliffe and London. We also provide
easy navigation to previously visited documents
for both the annotators. This interface design
(Cifra 8) ensures that information about the new
topic is semantically connected to topics of the
previous turns, similar to natural human-human
conversaciones (Sacks and Jefferson, 1995).

this point

3.4 Additional Annotations

To account for multiple valid answers, we col-
lected three additional annotations for answers of
conversations in evaluation sets (development and
test splits). For this task, at any turn, the annotator
can see all the previous questions and original an-
respuestas. Showing original answers of previous turns
is important in a conversational setting as the sub-
sequent questions can potentially depend on them.
We also provide the list of documents correspond-
ing to previous turns of the original conversation.
This ensures that the current annotator has all
the information the original answerer had while
providing the answer. Similar to the answerer,
the annotator then provides the rationale and the
respuesta, or marks the question as unanswerable.

Dataset

# Turns
# Conversations
# Tokens / Question
# Tokens / Answer
# Turns / conversation
# Topics / conversation

Tren

45,450
3,509
6.91
11.71
13
4

desarrollador

2,514
205
6.89
11.96
12
4

Prueba

2,502
206
7.11
12.27
12
4

En general

50,466
3920
6.92
11.75
13
4

Mesa 2: Dataset statistics of TOPIOCQA.

4 Dataset Analysis

We collected a total of 3,920 conversaciones, estafa-
sisting of 50,466 turns. The annotators were
encouraged to complete a minimum of 10 turns.
Conversations with fewer than 5 turns were dis-
carded. We split the data into train, desarrollo,
and test splits.

Mesa 2 reports simple statistics of the dataset
se divide. De término medio, a conversation in TOPIOCQA
tiene 13 question-answer turns and is based on 4
documentos. Our dataset differs from other con-
versational question-answering datasets by incor-
porating topic switches in the conversation.

4.1 Topic Switching

Before we start our analysis, let us first define the
notion of a topic switch in TOPIOCQA. Recall that
answers are based on Wikipedia articles, dónde
each document consists of several sections. Mientras
one can argue that a topic switch occurs when the
answer is based on a different section of the same
documento, we opt for a more conservative notion
and define a switch of topic if the answer is based
on a different Wikipedia document.

Number of Topics vs Conversation Length
We begin our analysis by investigating how the
number of topics varies with the conversation
length. En figura 2(a) we show a heat-map of the
number of topics for each conversation length,
where each column is normalized by the num-
ber of conversations of that length. We observe
that longer conversations usually include more
temas. Most 10-turn conversations include 3 arriba-
circuitos integrados, 14-turn conversations include 4 temas, y
18-turn conversations include 5 temas. The con-
versations with fewer than 10 turns mostly include
justo 2 temas.

Topic Flow in Conversation Next, we examine
how often consecutive questions stay within the

472

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Analysis of the topic switches in TOPIOCQA. En (a) we show the distribution of the number of topics (en
porcentaje) for each conversation length. Longer conversations typically include more topics. En (b) we show a
histogram of the topic lengths, illustrating that usually 3–4 consecutive questions stay within the same topic.

Cifra 3: A flow diagram of topic switches over conversations up to 15 turns. There are complex interactions
between the topics, especially later in the conversation.

same topic. para hacerlo, we first cluster conversa-
tions into sequences of turns for which all answers
are from the same document. Entonces, we count how
many turns belong to topic clusters of a particu-
lar length. Cifra 2(b) shows the distribution of
topic lengths. The mode of the distribution is 3,
signifying that annotators usually ask 3 preguntas
about the same topic before switching. Asking 2
o 4 consecutive questions on the same topic is
also frequently observed. Sin embargo, we rarely see
más que 10 consecutive turns on the same topic.
We also analyze the flow of topics throughout
the conversation. Do annotators always introduce
new topics or do they also go back to old ones?
Cifra 3 depicts a flow diagram of topics in
conversations up to 15 turns. Note that we have
indexed topics according to their first occurrence
in the conversation. We can see that the majority of
switches introduce new topics, but also that more
complex topic switching emerges in later turns.
Específicamente, we see that, from sixth turn onwards,
questioners frequently go back one or two topics in

the conversation. En general, this diagram suggests
that there are complex interactions among the
topics in the conversation.

Qualitative Assessment of Topic Switching
In order to understand the nature of a topic
switch, inspired from Stede and Schlangen (2004),
we classify questions into three types: ask-
generic refers to general open-ended questions,
ask-specific questions ask about a specific
attribute or detail of a topic, and ask-further
is a question type that seeks additional details
of an attribute discussed in one of the previous
turns. Mesa 4 shows examples of each type for
questions in the same conversation. We consider
three types of turns for our evaluation. If the an-
swer document of the turn is same as the previous
doblar, we refer to it as no-switch. If a topic
switch has happened, and the answer document is
present in one of the previous turns, it is consid-
ered to be switch-to-old. The final category,
switch-to-new refers to turns where current

473

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Question Type

Avg Answer length

ask-generic
ask-specific
ask-further

22.43
11.38
11.23

Mesa 3: Average answer length of different
question types. Generic questions tend to
have longer answers.

Turn type

Question type

Conversation turn

no-switch

ask-generic

no-switch

ask-specific

switch-to-new

ask-specific

switch-to-old

ask-specific

no-switch

ask-further

q: who is mariah carey?
A: An American singer
songwriter and actress
Tema: Mariah Carey

q: name one of her famous songs.
A: Oh Santa!
Tema: Mariah Carey

q: how was it received?
A: There were mixed reviews
Tema: Oh Santa!

q: is she married?
A: Sí
Tema: Mariah Carey

q: to whom?
A: Tommy Mottola
Tema: Mariah Carey

Mesa 4: Examples of various turn types and
question types in a conversation. Random samples
of each turn type are manually annotated with one
of the question types.

answer document has not been seen in the con-
versation before. These different types of topic
switches are also illustrated in Table 4.

We sample 50 turns of each type, and man-
ually label them with one of the three question
types. Cifra 4 shows the results of our eval-
uation. ask-specific is the most common
question type across all types of turns, indicando
that most of the questions in the dataset focus on
specific attributes of a topic. ask-generic has
a much higher proportion in switch-to-new
turn types, indicating that it is more likely to
see generic questions in turns that introduce a
new topic in the conversation, compared to other
turn types. ask-further has almost equal pro-
portion in no-switch and switch-to-old,
with switch-to-old being slightly higher.
ask-further is not observed in switch-
to-new as follow-up questions are generally not
possible without the topic being discussed in any
of the previous turns.

We also look at average answer length of
answers of all three question types (Mesa 3).
Como era de esperar, ask-generic has a much

Cifra 4: Distribution of various question types for each
turn type. Questions asking about specific attributes
are most common. Generic questions are likely to be
observed when switching to a new topic.

higher answer length compared to other types,
presumably due to the open-ended nature of
the question.

5 Experimental Setup

The task of open-domain information-seeking
conversation can be framed as follows. Given
previous questions and ground truth answers
{q1, a1, q2, a2, . . . , qi−1, ai−1} and current ques-
tion qi, the model has to provide the answer ai.
This can be considered as an oracle setting, como
the gold answers of previous questions are pro-
vided. The models can optionally use a corpus of
documents C = {d1, d2, . . . , dN }.

5.1 Modelos

We consider models from two categories, Residencia en
whether they use the document corpus or not. El
closed-book models use just the question-answer
pares, whereas open-book models use the docu-
ment corpus, along with question-answer pairs.
We now describe the implementation and tech-
nical details of both classes of models.

5.1.1 Closed-book

Large-scale language models often capture a lot
of world knowledge during unsupervised pre-
training (Petroni et al., 2019; Roberts et al., 2020).
These models, in principle, can answer questions
without access to any external corpus. We consi-
der GPT-3 (Brown y cols., 2020)—an autoregres-
sive language model with 175 billion parameters,
and evaluate it on TOPIOCQA. The input to GPT-3
is a prompt5 followed by previous question-answer
pairs and the current question. Because GPT-3 is

5beta.openai.com/examples/default-qa.

474

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

When constrained by the encoder input se-
quence length, we retain the first turn and as
many turns prior to the current turn as possi-
ble, eso es, k is chosen such that q1 [SEP]
a1 [SEP] qn−k [SEP] an−k [SEP] . . .
[SEP] qn−1 [SEP] an−1 [SEP] qn satisfies
encoder input limits.

• REWRITES: Given a query-rewriting module
QR, let q(cid:3)
n = QR(q1, a1, . . . , qn−1, an−1, qn)
denote the decontextualized question, estafa-
ditioned on the conversation history. q(cid:3)
n is
then passed to the model.

Wang et al. (2019) observed that fixed-length text
segments from documents are more useful than
full documents in both retrieval and final QA
exactitud. Por eso, we split a Wikipedia document
into multiple text blocks of at least 100 palabras,
while preserving section and sentence boundaries.
These text blocks, augmented with the metadata
(main title and section title) are referred to as
passages. This resulted in 25.7 million passages,
which act as basic units of retrieval. To form
question-passage pairs for training DPR Retriever,
we select the passage from gold answer document
that contains the majority of rationale span.

Following the original works, we use BERT
(Devlin et al., 2019) for DPR (both Retriever
and Reader) and T5 (Rafael y col., 2020) for FiD
as base models. Because DPR Reader requires a
span from passage for each training example, nosotros
heuristically select the span from the gold passage
that has the highest lexical overlap (F1 score)
with the gold answer. For the query-rewriting
module QR, we fine-tune T5 model on rewrites
of QReCC (Anantha et al., 2021), and use that
to generate the rewrites for TOPIOCQA. We refer
the reader to Appendix B for more details. El
hyperparameters for all models are mentioned in
Apéndice C.

5.2 Métricas de evaluación

Following Choi et al. (2018) and Reddy et al.
(2019), we use exact match (EM) and F1 as eval-
uation metrics for TOPIOCQA.

To compute human and system performance
in the presence of multiple gold annotations, nosotros
follow the evaluation process similar to Choi et al.
(2018) and Reddy et al. (2019). Given n human
answers, human performance on the task is deter-
mined by considering each answer as prediction

Cifra 5: A partial conversation and different ques-
tion representations of Q3. The REWRITES representa-
tion is an example, not the output of our QR module.

never explicitly exposed to any training examples,
this can be considered as a zero-shot setting.

5.1.2 Open-book

We build on state-of-the-art QA models that adapt
a two step retriever-reader approach. For the re-
triever, we consider BM25 (Robertson et al., 1995)
and DPR Retriever (Karpukhin et al., 2020). Given
a query, BM25 ranks the documents based on a
bag-of-words scoring function. Por otro lado,
DPR learns dense vector representations of docu-
ment and query, and uses the dot product between
them as a ranking function.

We consider two types of neural readers. (1)
DPR Reader (Karpukhin et al., 2020), which re-
ranks the retrieved passages and selects a span
from each document independently. The span with
highest span score is chosen as the answer. (2)
Fusion-in-Decoder
Izacard and Grave,
2021), which encodes all retrieved passages in-
dependiente, and then jointly attends over all of
them in the decoder to generate the answer.

(FiD;

For these models, we consider three different
question representations for question at nth turn of
the conversation (qn). Cifra 5 shows an example
of different question representations for the third
pregunta (Q3) of a conversation.

• ORIGINAL: This serves as a naive baseline
where just the current question qn is passed
to the model.

• ALLHISTORY: The question is represented as
q1 [SEP] a1 [SEP] q2 [SEP] a2 [SEP]
. . . [SEP] qn−1 [SEP] an−1 [SEP] qn.

475

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Humano

GPT-3

BM25 + DPR Reader

BM25 + FiD

DPR Retriever + DPR Reader

DPR Retriever + FiD

Question Rep

desarrollador

Prueba

40.2

12.4

7.1
13.6
15.4

10.1
24.1
24.0

4.9
21.0
17.2

7.9
33.0
23.5

70.1

33.4

12.8
25.0
32.5

21.8
37.2
41.6

14.9
43.4
36.4

21.6
55.3
44.2

40.3

10.4

7.2
13.8
15.7

10.5
23.4
24.9

4.3
19.4
16.5

7.8
33.4
24.0

70.0

31.8

13.0
25.2
31.7

22.6
36.1
41.4

14.9
41.1
35.2

21.4
55.8
44.7

ORIGINAL
ALLHISTORY
REWRITES

Mesa 5: Overall performance of all model variants on TOPIOCQA development and test set.

and other human answers as the reference set. Este
results in n scores, which are averaged to give the
final human performance score. The system pre-
diction is also compared with n distinct reference
conjuntos, each containing n − 1 human answers, y
then averaged. For TOPIOCQA, norte = 4 (the origi-
nal answer and three additional annotations). Nota
that human performance is not necessarily an up-
per bound for the task, as document retrieval can
potentially be performed better by the systems.

Modelo

Question Rep

desarrollador

Prueba

BM25

ORIGINAL
ALLHISTORY
REWRITES

ORIGINAL

DPR Retriever ALLHISTORY

REWRITES

Top-20 Top-100 Top-20 Top-100

5.2
23.1
32.5

9.9
70.4
49.9

9.1
36.8
49.2

16.5
82.4
62.4

6.0
22.5
33.0

10.0
67.0
49.3

10.1
35.6
47.4

15.3
80.8
61.1

Mesa 6: Retrieval performance of all model
variants on TOPIOCQA development and test set.

6 Results and Discussion

Nosotros informamos
the end-to-end performance of all
systems in Table 5. For open-book models, nosotros
also look at the performance of its constituents
(retriever and reader). Mesa 6 reports the retrieval
performance and Table 7 reports the reading
comprehension performance of the readers, given
the gold passage. Based on these results, nosotros
answer the following research questions.

How do the models compare against humans for
TOPIOCQA?
We report model and human performance on de-
velopment and test set in Table 5. En general, modelo
performance in all settings is significantly lower
than the human performance. The best performing
modelo (DPR Retriever + FiD using ALLHISTORY

Modelo

Question Rep

desarrollador

Prueba

Extractive Bound

DPR Reader

FiD

ORIGINAL
ALLHISTORY
REWRITES

47.7

27.1
29.7
29.8

34.4
38.3
34.5

81.1

51.4
54.2
53.8

60.5
65.5
61.9

47.3

25.5
28.0
28.1

33.7
37.2
35.3

81.0

50.4
52.6
52.1

61.0
64.1
62.8

Mesa 7: Reader performance of all model variants
on TOPIOCQA development and test set when
provided with the gold passage.

question representation) achieves 33.4 puntos
EM and 55.8 points F1 on the test set, cual
falls short of human performance by 6.9 puntos
y 14.2 puntos, respectivamente, indicating room for
further improvement.

476

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Which class of models perform better—Closed
book or Open book?
GPT-3 is directly comparable to ALLHISTORY vari-
ant of open-book models as it takes the entire
conversation history as input. Apart from BM25 +
DPR Reader, GPT-3 performs worse than all
other ALLHISTORY variants of open-book models.
It achieves an F1 score of 31.8 on the test set,
cual es 24 points behind the best performing
open-book model (DPR Retriever + FiD). We ob-
serve that GPT-3 often hallucinates many answers,
a phenomenon commonly observed in literature
(Shuster et al., 2021).

How does the performance of open-book models
vary with various question representations?
For all open-book models, we fine-tune on three
different question representations (Sección 5).
From the results in Table 5, we observe that the
ORIGINAL representation is consistently worse than
others for all models. This highlights the impor-
tance of encoding the conversational context for
TOPIOCQA. Between ALLHISTORY and REWRITES,
we observe that ALLHISTORY performs better
with dense retriever (DPR Retriever), mientras
REWRITES performs better with sparse retriever
(BM25). To confirm that this performance dif-
ference in end-to-end systems stems from the
retriever, we look at Top-20 and Top-100 re-
trieval accuracy of BM25 and DPR Retriever in
Mesa 6. En efecto, ALLHISTORY representation per-
forms better than REWRITES for DPR Retriever but
worse for BM25. As DPR Retriever is trained on
TOPIOCQA, it can probably learn how to select
relevant information from the ALLHISTORY rep-
resentimiento, whereas for BM25, the non-relevant
keywords in the representation act as distractors.
The better performance of DPR Retriever over
BM25 indicates that TOPIOCQA requires learning
task-specific dense semantic encoding for superior
retrieval performance.

How much are the readers constrained due to
retrieved results?
Mesa 6 shows retrieval results. In an end-to-end
sistema, the reader takes as input the retrieved
passages, which may or may not contain the gold
passage. To get an estimate of reader performance
independently from the retriever, we experiment
with directly providing only the gold passage to
the readers, instead of the retrieved ones. Mesa 7
shows the results. This can be seen as an ‘‘Ideal

Retriever’’ setting, where the retriever always re-
trieves the correct passage as the top one. A pesar de
we observe significant gains over end-to-end sys-
tems for all models across all variants, the best
modelo (FiD with ALLHISTORY) still falls short of
human performance by 3.1 points EM and 5.9
points F1 on the test set. These experiments in-
dicate that while passage retrieval is a significant
bottleneck for the task, technical advancements
are needed for the readers as well.

While it is plausible to assume that DPR Reader
is restricted in its performance due to its extractive
naturaleza, we show that this is not the case. We cal-
culate the extractive upper bound for TOPIOCQA
(reported in Table 7) by selecting the span from
the gold document with best F1 overlap with the
ground truth answer. This bound is 47.3 points EM
y 81.0 points F1, which essentially represents
the best that any extractive model can do on this
tarea. DPR Reader falls short of this upper bound
por 19.2 points EM and 28.4 points F1.

7 Conclusión

We introduced TOPIOCQA, a novel open-domain
conversational question answering dataset with
topic switching. En este trabajo, we described our
data collection effort, analyzed its topic switching
comportamiento, and established strong neural baselines.
The best performing model (DPR Retriever +
FiD) es 6.9 points EM and 14.2 points F1 below
human performance, suggesting that advances in
modeling are needed. We hope our dataset will be
an important resource to enable more research on
conversational agents that support topic switches
in information-seeking scenarios.

Expresiones de gratitud

We would like to thank TAKT’s annotators and
Jai Thirani (Chief Data Officer) for contributing
to TOPIOCQA. We are grateful for construc-
tive and insightful feedback from the anonymous
reviewers. This work is supported by the follow-
ing grants: MSR-Mila Grant, NSERC Discovery
Grant on Robust conversational models for ac-
cessing the world’s knowledge, and the Facebook
CIFAR AI Chair. Shehzaad is partly supported
by the IBM PhD Fellowship. We thank Compute
Canada for the computing resources.

477

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Referencias

Raviteja Anantha, Svitlana Vakulenko, Zhucheng
Tu, Shayne Longpre, Stephen Pulman, y
Srinivas Chappidi. 2021. Open-domain ques-
tion answering goes conversational via question
rewriting. En Actas de la 2021 Conferir-
ence of the North American Chapter of the
Asociación de Lingüística Computacional: Hu-
man Language Technologies, pages 520–534,
En línea. https://doi.org/10.18653/v1
/2021.naacl-main.44

Tom Brown, Benjamín Mann, Nick Ryder,
Melanie Subbiah, Jared D.. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, sandhi
agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, niño rewon, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Marcos Chen, Eric Sigler, Mateusz
Litwin, Scott Gris, Benjamin Chess, Jacobo
clark, Christopher Berner, Sam McCandlish,
Ilya Sutskever, and Dario
Alec Radford,
Amodei. 2020. Language models are few-shot
learners. In Advances in Neural Information Pro-
cessing Systems, volumen 33, páginas 1877-1901.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186,
Mineápolis, Minnesota.

Ahmed Elgohary, Denis Peskov, and Jordan
Boyd-Graber. 2019. Can you unpack that? Learn-
ing to rewrite questions-in-context. En profesional-
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5918–5924, Hong Kong, Porcelana. https://
doi.org/10.18653/v1/D19-1605

Gautier

Izacard and Edouard Grave. 2021.
Leveraging passage retrieval with generative
models for open domain question answer-
En g. In Proceedings of the 16th Conference
of the European Chapter of the Association
para Lingüística Computacional: Volumen principal,
pages 874–880, En línea. https://doi.org
/10.18653/v1/2021.eacl-main.74

Danqi Chen, Adam Fisch, Jason Weston, y
Antonio Bordes. 2017. Reading Wikipedia to
answer open-domain questions. En procedimientos
of the 55th Annual Meeting of the Associa-
ción para la Lingüística Computacional (Volumen 1:
Artículos largos), pages 1870–1879, vancouver,
Canada.

Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
sistemas. En Actas de la 2017 Conferir-
encia sobre métodos empíricos en lan natural-
Procesamiento de calibre, pages 2021–2031, Copenhague,
Dinamarca. https://doi.org/10.18653
/v1/D17-1215

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, y
Lucas Zettlemoyer. 2018. QuAC: Preguntar un-
swering in context. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 2174–2184,
Bruselas, Bélgica.

Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar,
and Jamie Callan. 2020. CAsT-19: A data-
set for conversational information seeking. En
Proceedings of the 43rd International ACM
SIGIR Conference on Research and Devel-
opment in Information Retrieval, SIGIR ’20,
pages 1985–1988, Nueva York, Nueva York, EE.UU.
https://doi.org/10.1145/3397271
.3401206

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. En Actas de la 2020 Estafa-
Conferencia sobre métodos empíricos en Lan Natural.-
Procesamiento de calibre (EMNLP), pages 6769–6781,
En línea. https://doi.org/10.18653/v1
/2020.emnlp-main.550

Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blunsom,
Chris Dyer, Karl Moritz Hermann, G´abor Melis,
and Edward Grefenstette. 2018. The Narrative-
QA reading comprehension challenge. Trans-
acciones de la Asociación de Computación
Lingüística, 6:317–328. https://doi.org
/10.1162/tacl_a_00023

478

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, michael collins, Ankur Parikh, cris
Alberti, Danielle Epstein,
Illia Polosukhin,
Jacob Devlin, Kenton Lee, Kristina Toutanova,
Leon Jones, Matthew Kelcey, Ming-Wei
Chang, Andrew M. dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural questions:
A benchmark for question answering research.
Transactions of the Association for Computa-
lingüística nacional, 7:452–466. https://doi
.org/10.1162/tacl_a_00276

Kenton Lee, Ming-Wei Chang, and Kristina
Toutanova. 2019. Latent retrieval for weakly
supervised open domain question answering.
In Proceedings of the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 6086–6096, Florencia, Italia.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
gao, Saurabh Tiwary, Rangan Majumder, y
Li Deng. 2016. MS MARCO: A Human Gener-
ated MAchine Reading COmprehension Data-
colocar. In CoCo@NIPS.

Fabio Petroni, Tim Rockt¨aschel, Sebastián Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Modelos de lenguaje
as knowledge bases? En Actas de la 2019
Jornada sobre Métodos Empíricos en Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando (EMNLP-IJCNLP), pages 2463–2473,
Hong Kong, Porcelana. https://doi.org/10
.18653/v1/D19-1250

Chen Qu, Liu Yang, Cen Chen, Minghui Qiu,
Iyyer. 2020.
W.. Bruce Croft, and Mohit
Open-Retrieval Conversational Question An-
swering. SIGIR ’20, pages 539–548, Nueva York,
Nueva York, EE.UU.

Chen Qu, Liu Yang, Minghui Qiu, W.. bruce
Croft, Yongfeng Zhang, and Mohit Iyyer. 2019a.
BERT with history answer embedding for con-
versational question answering. En procedimientos
of the 42nd International ACM SIGIR Confer-
ence on Research and Development in Infor-
mation Retrieval, SIGIR’19, pages 1133–1136,
Nueva York, Nueva York, EE.UU.

Chen Qu, Liu Yang, Minghui Qiu, Yongfeng
zhang, Cen Chen, W.. Bruce Croft, and Mohit
Iyyer. 2019b. Attentive history selection for
conversational question answering. En curso-
ings of the 28th ACM International Conference

on Information and Knowledge Management,
CIKM ’19, pages 1391–1400, Nueva York,
Nueva York, EE.UU.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, wei li, y Pedro J.. Liu. 2020.
Exploring the limits of transfer learning with
a unified text-to-text transformer. Diario de
Machine Learning Research, 21(140):1–67.

Pranav Rajpurkar, Robin Jia, y Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. En procedimientos de
the 56th Annual Meeting of the Association for
Ligüística computacional (Volumen 2: Short
Documentos), pages 784–789, Melbourne, Australia.
https://doi.org/10.18653/v1/P18
-2124

Pranav Rajpurkar,

Jian Zhang, Constantino
Lopyrev, y Percy Liang. 2016. Equipo:
100,000+ preguntas para la comprensión de la máquina-
sion of text. En Actas de la 2016 Estafa-
ference on Empirical Methods in Natural
Procesamiento del lenguaje, páginas 2383–2392, austin,
Texas. https://doi.org/10.18653/v1
/D16-1264

Siva Reddy, Danqi Chen, and Christopher Manning.
2019. CoQA: A conversational question answer-
ing challenge. Transacciones de la Asociación
para Lingüística Computacional, 7(0):249–266.
https://doi.org/10.1162/tacl a 00266

Adam Roberts, Colin Raffel, and Noam Shazeer.
2020. How much knowledge can you pack
into the parameters of a language model? En
Actas de la 2020 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural
(EMNLP), pages 5418–5426, En línea. https://
doi.org/10.18653/v1/2020.emnlp-main
.437

Stephen Robertson, S. Caminante, S. jones, METRO. METRO.
Hancock-Beaulieu, y M. Gatford. 1995. Okapi
at TREC-3. In Overview of the Third Text Re-
trieval Conference (TREC-3), pages 109–126.

Anna Rogers, Matt Gardner,

and Isabelle
Augenstein. 2021. QA dataset explosion: A
taxonomy of NLP resources for question an-
swering and reading comprehension. arXiv
preprint arXiv:2107.12708.

Harvey Sacks and Gail Jefferson. 1995. Lectures
on conversation, Volumen 1. pages 289–331.

479

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Anshumali Shrivastava and Ping Li. 2014.
Asymmetric LSH (ALSH) for sublinear time
maximum inner product search (MIPS). En profesional-
ceedings of the 27th International Conference
on Neural Information Processing Systems –
Volumen 2, pages 2321–2329, Montréal, Canada.

Kurt Shuster, Spencer Poff, Moya Chen, Douwe
Kiela, y Jason Weston. 2021. Retrieval
augmentation reduces hallucination in con-
versation. arXiv preimpresión arXiv:2104.07567.
https://doi.org/10.18653/v1/2021
.findings-emnlp.320

Amanda Spink, h. Cenk Ozmutlu, and Seda
Ozmutlu. 2002. Multitasking information seek-
ing and searching processes. Journal of the
American Society for Information Science and
Tecnología, 53(8):639–652. https://doi
.org/10.1002/asi.10124

Manfred Stede and David Schlangen. 2004.
Information-seeking chat: Dialogue manage-
ment by topic structure.

Svitlana Vakulenko, Shayne Longpre, Zhucheng
Tu, and Raviteja Anantha. 2021. Question
rewriting for conversational question answer-
En g. In Proceedings of the 14th ACM Interna-
tional Conference on Web Search and Data
Minería, WSDM ’21, pages 355–363, Nuevo
york, Nueva York, EE.UU. https://doi.org/10
.1145/3437963.3441748

Douglas Walton. 2019. The New Dialectic: Estafa-
versational Contexts of Argument. Universidad
of Toronto Press.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo
Wang, Tim Klinger, Wei Zhang, Shiyu
Chang, Gerry Tesauro, Bowen Zhou, and Jing
Jiang. 2018. R3: Reinforced ranker-reader for
open-domain question answering. En AAAI,
pages 5981–5988.

Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh
Nallapati, and Bing Xiang. 2019. Multi-passage
BERT: A globally normalized BERT model
for open-domain question answering. En profesional-
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5878–5882, Hong Kong, Porcelana. https://
doi.org/10.18653/v1/D19-1599

Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li,
Luchen Tan, Kun Xiong, Ming Li, and Jimmy

lin. 2019. End-to-end open-domain question
answering with BERTserini. En procedimientos de
el 2019 Conference of the North American
Chapter of the Association for Computational
Lingüística (Demonstrations), pages 72–77,
Mineápolis, Minnesota. https://doi.org
/10.18653/v1/N19-4013

Yi Yang, Wen-tau Yih, and Christopher Meek.
para
2015. WikiQA: A challenge dataset
open-domain question answering. En curso-
cosas de
el 2015 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 2013–2018, Lisbon, Portugal. https://
doi.org/10.18653/v1/D15-1237

Chenguang Zhu, Michael Zeng, and Xuedong
Huang. 2019. SDNet: Contextualized attention-
based deep network for conversational ques-
tion answering. arXiv preimpresión arXiv:1812
.03593.

A Annotators Details

Each conversation in TOPIOCQA is an interac-
tion between two annotators, a questioner and
an answerer. The annotators were selected from
TAKT’s in-house workforce, based on their En-
glish language proficiency and trained for the role
of both questioner and answerer. The annotators
are provided with the following guidelines.

Guidelines for the Questioner:

• The first question should be unambiguous

and about the seed entity.

• The follow-up questions be contextualized
and dependent on the conversation history
whenever possible.

• Avoid using same words as in section titles
del documento. Por ejemplo, if the section
title is ‘‘Awards’’, a plausible question can
be ‘‘What accolades did she receive for her
trabajar?''.

• The conversation should involve multiple

documentos (temas).

Guidelines for the Answerer:

• Based on the question, identify the relevant

document and section.

• The answer should be based on the contents

of the identified document.

480

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

• The rationale should be selected such that it

justifies the answer.

• The answer should be a sub-string in rationale
whenever possible. Sin embargo, answers should
be edited to fit the conversational context
(adding yes, No), perform reasoning (p.ej.,
counting), Etcétera.

• Personal opinions should never be included.

After providing the guidelines and a few examples,
the initial annotated conversations were manually
inspected by the authors. The workers who pro-
vided low-quality annotations during this inspec-
tion phase were disqualified. The final workforce
consistió en 15 workers, who provided annota-
tions for the dataset over a period of two months.
Random quality checks were performed by the
authors and periodic feedback was given to the
annotators throughout the data collection to main-
tain high quality of data. Cifra 8 shows annotation
interfaces for questioner and answerer. Cifra 6
shows an example from the dataset.

We also implemented several real-time checks
in the questioner’s interface to encourage topic
switching and use of co-reference, and to reduce
the lexical overlap with the metadata of the docu-
ment while forming the question.

B Query Rewriting

A query-rewriting module, QR, takes the cur-
rent question and the conversation history as
aporte (q1, a1, . . . , qn−1, an−1, qn) and provides a
decontextualized rewritten question, q(cid:3)
norte, como el
producción. As we don’t collect rewrites in TOPI-
OCQA, we rely on other datasets to train our
QR model. Two datasets that provide rewrites for
information-seeking conversations are CANARD
(Elgohary et al., 2019) and QReCC (Anantha
et al., 2021). Due to its large-scale and diverse
naturaleza, we use QReCC to train our T5 model
based QR module.

To rewrite the nth question, the conversation
history and the current question is given to model
as q1 [SEP] a1 [SEP] q2 [SEP] a2 [SEP] . . .
[SEP] qn−1 [SEP] an−1 [SEP] qn. We train
this model on QReCC dataset. On the test split
of QReCC, our model achieves a BLEU score of
62.74 puntos. We use this model to generate re-
writes for TOPIOCQA in our experiments. Cifra 7
shows a conversation from the dataset along with
rewrites from this T5-based QR module.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 6: A full conversation from TOPIOCQA.

We observe that while this QR module can re-
solve simple coreferences (Q2 and Q5), it struggles
later in the conversation in the presence of mul-
tiple entities (he is resolved to albus dumbledore

481

instead of tom marvolo riddle in Q10 and Q12).
The QR module also fails to perform reason-
ing required for correct rewrites, Por ejemplo,
boy wizard’s nemesis is not rewritten to Lord
Voldemort in Q9, even though this information is
present in A5).

C Hyperparameter Details

We use Lucene BM25 with k1 = 0.9 (term fre-
quency scaling) and b = 0.4 (document length
normalization). For both DPR and FiD, apart
from the batch size, we use the hyperparameters
suggested in their codebases. We use the maxi-
mum batch size that fits in the GPU cluster. DPR
Retriever is trained on four 40GB A100 GPUs,
whereas DPR Reader and FiD are trained on 8
32GB V100 GPUs. We use base model size for
all systems. Following original implementations,
DPR Retriever is trained for 40 epochs, DPR
Reader for 20 epochs, and FiD for 15,000 gradient
steps. The model checkpoint with best EM score
on development set is selected as the final model.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
1
2
0
0
8
1
2
6

/
t

a
C
_
a
_
0
0
4
7
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 7: An example of a conversation from
TOPIOCQA along with rewrites from the QR module.
Few turns are excluded and some answers are shorted
for brevity.

482

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/