TopiOCQA: Open-domain Conversational Question Answering - Ricerca sull'intelligenza artificiale specializzata al MIT

TopiOCQA: Open-domain Conversational Question Answering
with Topic Switching

Vaibhav Adlakha1,4 Shehzaad Dhuliawala2
Kaheer Suleman3 Harm de Vries4 Siva Reddy1,5
2ETH Z¨urich, Svizzera

3Microsoft Montr´eal, Canada

1Mila, McGill University, Canada

4ServiceNow Research, Canada

5Facebook CIFAR AI Chair, Canada

{vaibhav.adlakha,siva.reddy}@mila.quebec

Astratto

In a conversational question answering sce-
nario, a questioner seeks to extract information
about a topic through a series of interdepen-
dent questions and answers. As the conver-
sation progresses, they may switch to related
topics, a phenomenon commonly observed in
information-seeking search sessions. Tuttavia,
current datasets for conversational question
answering are limiting in two ways: 1) Essi
do not contain topic switches; E 2) they as-
sume the reference text for the conversation is
given, questo è, the setting is not open-domain.
We introduce TOPIOCQA (pronounced Tapi-
oca), an open-domain conversational dataset
with topic switches based on Wikipedia.
TOPIOCQA contains 3,920 conversations with
information-seeking questions and free-form
answers. On average, a conversation in our
dataset spans 13 question-answer turns and
involves four topics (documents). TOPIOCQA
poses a challenging test-bed for models, Dove
efficient retrieval is required on multiple turns
of the same conversation, in conjunction with
constructing valid responses using conversa-
tional history. We evaluate several baselines,
by combining state-of-the-art document re-
trieval methods with neural reader models.
Our best model achieves F1 of 55.8, falling
short of human performance by 14.2 points,
indicating the difficulty of our dataset. Nostro
dataset and code are available at https://
mcgill-nlp.github.io/topiocqa.

introduzione

People often engage in information-seeking con-
versations to discover new knowledge (Walton,
2019). In such conversations, a questioner (IL
seeker) asks multiple rounds of questions to an
answerer (the expert). As the conversation pro-
ceeds, the questioner becomes inquisitive of new
but related topics based on the information pro-

468

vided in the answers (Stede and Schlangen, 2004).
Such topic switching behaviour is natural
In
information-seeking conversations and is com-
monly observed when people seek information
through search engines (Spink et al., 2002).

According to Spink et al., people switch from
one to ten topics with a mean of 2.11 topic switches
per search session. Per esempio, a person can start
a search session about tennis, and then land on
Roger Federer, and after learning a bit about him
may land on his country Switzerland, and spend
more time learning about other Swiss athletes.
Thanks to tremendous progress in question an-
swering research (Rogers et al., 2021), we are
coming close to enabling information-seeking
conversations with machines (as opposed to just
using keywords-based search). In order to real-
ize this goal further, it is crucial to construct
datasets that contain information-seeking conver-
sations with topic switching, and measure progress
of conversational models on this task, the two
primary contributions of this work.

In the literature, a simplified setting of
information-seeking conversation known as con-
versational question answering (CQA) has been
deeply explored (Choi et al., 2018; Reddy et al.,
2019). In this task, the entire conversation is based
on a given reference text of a topic/entity. While
the CQA task is challenging, it still falls short of
the real-world setting, where the reference text is
not known beforehand (first limitation) and the
conversation is not restricted to a single topic
(second limitation).

Qu et al. (2020) and Anantha et al. (2021)
have attempted to overcome the first limitation
by adapting existing CQA datasets to the open-
domain setting. They do so by obtaining context-
independent rewrites of the first question to make
the question independent of the reference text. For
esempio, if the reference text is about Augusto

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 468–483, 2022. https://doi.org/10.1162/tacl a 00471
Redattore di azioni: Hua Wu. Lotto di invio: 11/2021; Lotto di revisione: 12/2021; Pubblicato 4/2022.
C(cid:2) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Pinochet and the conversation starts with a ques-
zione “Was he known for being intelligent?”, IL
question is re-written to “Was Augusto Pinochet
known for being intelligent?”. Tuttavia, as the
entire question sequence in the conversation was
collected with a given reference text of a topic, Tutto
the turns still revolve around a single topic.

In this work, we present TOPIOCQA1—Topic
switching in Open-domain Conversational Question
Answering—a large-scale dataset for information-
seeking conversations in open-domain based on
the Wikipedia corpus. We consider each Wikipedia
document to be a separate topic. The conversations
in TOPIOCQA start with a real information-seeking
question from Natural Questions (Kwiatkowski
et al., 2019) in order to determine a seed topic
(document), and then the questioner may shift to
other related topics (documents) as the conver-
sation progresses.2 Throughout the conversation,
the questioner is never shown the content of the
documents (but only the main title and section ti-
tles) to simulate an information-seeking scenario,
whereas the answerer has full access to the content
along with the hyperlink structure for navigation.
In each turn, both questioner and answerer use
free-form text to converse (as opposed to extrac-
tive text spans as is common for an answerer in
many existing datasets).

Figura 1 shows an example of a conversation
from our dataset. The first question leads to the
seed topic Byzantine Empire, and after two turns
switches to Mehmed the Conqueror in Q4, based
on part of the answer (A3) that contains reference
to Mehmed. Note that the answers A1, A3, and A4
are free-form answers that do not occur as spans
in either the seed document or the follow up doc-
ument. The topic then switches to Anatolia based
on part of the previous answer (A4). The topics
change in further turns to Turkey and Ankara.
Because of the conversational nature, TOPIOCQA
contains questions rife with complex coreference
phenomena, for instance, Q9 relies on entities
mentioned in A7, A8 and Q1.

TOPIOCQA contains 3,920 conversations and
50,574 QA pairs, based on Wikipedia corpus of
5.9 million documents. On average, a conversa-
tion has 13 question-answer turns and involves 4
topics. Twenty-eight percent of turns in our data-

1TOPIOCQA is pronounced as Tapioca.
2A portion of the training data also contains conversations
where the questioner asks the first question given a seed topic.

Figura 1: A conversation from TOPIOCQA. Our dataset
has information-seeking questions with free-form an-
swers across multiple topics (documents). The consec-
utive turns from the same topic (document) have been
excluded for brevity.

set require retrieving a document different from
the previous turn. To the best of our knowledge,
TOPIOCQA is the first open-domain information-
seeking CQA dataset that incorporates topical
i cambiamenti, along with other desirable properties
(Vedi la tabella 1).

To investigate the difficulty of the TOPIOCQA
dataset, we benchmark several strong retriever-
reader neural baselines, considering both sparse
and dense retrievers, as well as extractive and gen-
erative readers (Karpukhin et al., 2020; Izacard
and Grave, 2021). Inspired by previous work,
we explore two ways to represent the question:
(1) concatenating the entire conversation history
(Qu et al., 2020), E (2) self-contained rewrites
of the conversational question (Anantha et al.,
2021). The best performing model—Fusion-in-

469

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Dataset

TOPIOCQA (ours)

QReCC (Anantha et al., 2021)
OR-QuAC (Qu et al., 2020)
CoQA (Reddy et al., 2019)
QuAC (Choi et al., 2018)
NarrativeQA (Koˇcisk´y et al., 2018)
Natural Questions (Kwiatkowski et al., 2019)
SQuAD 2.0 (Rajpurkar et al., 2018)

Multi–turn

Open–domain

Free–form answers

Information–seeking questions

Topic Switching

✓

✓
✓
✓
✓
✗
✗
✗

✓

✓
✓
✗
✗
✓
✓
✗

✓

✓
✗
✓
✗
✓
✗
✗

✓

✓
✗
✓
✓
✓
✗

✓

✗
✗
✗
✗
✗
✗

Tavolo 1: Comparison of TOPIOCQA with other QA datasets. TOPIOCQA incorporates topical changes,
represents that only a proportion of dataset
along with several best practices of previous datasets.
satisfies the property.

Decoder (Izacard and Grave, 2021) trained on con-
catenated conversation history—is 14.2 F1 points
short of human performance, indicating signif-
icant room for improvement. We also evaluate
GPT-3 to estimate the performance in a closed-
book zero-shot setting, and its performance is 38.2
F1 points below the human performance.

2 Related Work

2.1 Open-Domain Question Answering

In open-domain question answering, a model has
to answer natural language questions by retriev-
ing relevant documents. This can be considered
as a simplified setting of open-domain CQA,
where the conversation is limited to just one
turn. Several datasets have been proposed for this
task. On one hand, reading comprehension data-
sets like SQuAD (Rajpurkar et al., 2016, 2018),
which consist of (question, document, answer)
triplets, have been adapted for the task by with-
holding access to the document (Chen et al., 2017).
While these datasets have been helpful in spurring
modelling advances, they suffer from an anno-
tator bias because they were not collected in an
information-seeking setup. Questo è, annotators had
access to the target answer and its surrounding
context and therefore formulated questions that
had a high lexical overlap with the answer (Jia
and Liang, 2017). D'altra parte, Web-search
based datasets do not suffer from such artefacts
because they are curated from real search engine
queries. The WikiQA (Yang et al., 2015) E
MS Marco (Nguyen et al., 2016) datasets contain
queries from the Bing search engine, whereas
Natural Questions (Kwiatkowski et al., 2019)
contain queries from the Google search engine.

Models for open-domain QA often follow a
two-stage process: (1) A retriever selects a small
collection of documents relevant to the question
from a big corpus (per esempio., Wikipedia), (2) a reader

extracts or generates an answer from the selected
documents. While classical approaches rely on
counting-based bag-of-words representations like
TF-IDF or BM25 (Chen et al., 2017; Wang et al.,
2018; Yang et al., 2019), more recent deep learn-
ing approaches learn dense representations of the
questions and document through a dual-encoder
framework (Lee et al., 2019; Karpukhin et al.,
2020). In such learned retriever setups, document
retrieval is done efficiently using Maximum Inner
Product Search (MIPS, Shrivastava and Li, 2014).

2.2 Conversational Question
Answering (CQA)

CQA extends the reading comprehension task
from a single turn to multiple turns. Given a refer-
ence document, a system is tasked with interac-
tively answering a sequence of information-seeking
questions about the corresponding document. Questo
conversational extension leads to novel challenges
in modeling linguistic phenomena such as ana-
phora (referencing previous turns) and ellipsis
(omitting words from questions), as well as in per-
forming pragmatic reasoning. Large-scale con-
versational datasets such as CoQA (Reddy et al.,
2019) and QuAC (Choi et al., 2018) have facil-
itated much of the research in this area. These
datasets differ along several dimensions, two of
which are (1) CoQA has short free-form answers,
whereas QuAC has long extractive span-based an-
swers, E (2) unlike CoQA, QuAC is collected
in a simulated information-seeking scenario.

Models for CQA have used simple concatena-
tion of the question-answer history (Zhu et al.,
2019), history turn selection (Qu et al., 2019UN,B),
and question-rewrites (Vakulenko et al., 2021).
For question-rewriting, a different module is
trained on self-contained rewrites of context-
dependent questions. Per esempio, a plausible
rewrite of Q8 (Figura 1) is ‘‘can you name
some of the major cities in Turkey apart from

470

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Ankara?’’. The re-written question is then an-
swered using open-domain QA systems. Two pop-
ular question-rewriting datasets for training this
module are (1) CANARD (Elgohary et al., 2019),
which contains re-writes of 50% of QuAC, E
(2) QReCC (Anantha et al., 2021), which contains
rewrites of the entire QuAC dataset and a small
portion from other sources.

2.3 Open-Domain CQA

In this work, we focus on constructing a chal-
lenging benchmark for open-domain CQA. IL
open-domain aspect requires systems to answer
questions without access to a reference document.
The conversational aspect enables users to ask
multiple related questions, which can, in principle,
span several different topics. With TOPIOCQA, we
introduce the first open-domain CQA dataset that
explicitly covers such topical switches.

Previous datasets for

this task re-purpose
existing CQA datasets. The OR-QuAC dataset
(Qu et al., 2020) is automatically constructed
from QuAC (Choi et al., 2018) and CANARD
(Elgohary et al., 2019) by replacing the first ques-
tion in QuAC with context-independent rewrites
from CANARD. QReCC (Anantha et al., 2021)
is a large-scale open-domain CQA and question
rewriting dataset that contains conversations from
QuAC, TREC CAsT (Dalton et al., 2020), and Nat-
ural Questions (NQ; Kwiatkowski et al., 2019). Tutto
the questions in OR-QuAC and 78% of questions
in QReCC are based on QuAC. As conversations
in QuAC were collected with a given reference
document, the question sequences of these con-
versations revolve around the topic or entity cor-
responding to that document. Twenty-one percent
of questions in QReCC are from NQ-based con-
versations. As NQ is not a conversational dataset,
the annotators of QReCC use NQ to start a con-
versation. A single annotator is tasked with pro-
viding both follow-up questions and answers for
to QReCC,
a given NQ question. In contrasto
conversations in our dataset are collected in a
simulated information-seeking scenario using two
annotators (Sezione 3.3).

Deep learning models for this task have fol-
lowed a similar retriever-reader setup as open-
domain QA. Instead of a single question, previous
works have explored feeding the entire conversa-
tion history (Qu et al., 2020), or a context indepen-
dent re-written question (Anantha et al., 2021).

3 Dataset Collection

Each conversation in TOPIOCQA is an interac-
tion between two annotators—a questioner and an
answerer. The details about the annotator selec-
tion are provided in Appendix A.

3.1 Seed Topics and Document Collection

The seed topics essentially drive the conversation.
In order to make them interesting for annotators,
we select the good3 articles of Wikipedia as seed
topics (around 35k) for the first turn, but use
entire Wikipedia for later turns. We used the
Wikipedia dump from 10/20/2020, which con-
sists of 5.9 million documents. We used Wikiex-
tractor4 to extract the text. While pre-processing
the Wikipedia documents, we retain the hyper-
links that refer to other Wikipedia documents,
thus ensuring that we can provide all the doc-
uments requested by annotators (via hyperlinks)
during the conversation.

3.2 Simulating Information-seeking Scenario

Information-seeking conversations are closer to
the real-world if an information need can be
simulated via the data collection interface. In
TOPIOCQA, we achieve this by withholding ques-
tioner’s access to the full reference text of the
document. The questioner can only see the meta-
dati (main title and the section titles) del
Wikipedia documents, whereas the answerer can
access the entire text of the documents. On finding
the answer, the answerer highlights a contiguous
span of text as rationale, and generates a free-form
answer. The answerer also has the option to mark
the question as unanswerable. The conversation
history is visible to both the annotators.
As a conversation starting point,

the first
question is sampled from a subset of NQ
(Kwiatkowski et al., 2019) since NQ contains
genuine information-seeking questions asked on
Google. We only sample those questions for which
the answer is in our seed document pool. A
increase the diversity of our dataset, we also al-
low the questioner to formulate the first question
based on the provided seed topic entity for 28%
of the conversations.

3Wikipedia Good articles.
4github:wikiextractor.

471

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

3.3 Enabling Topic-switching

The key feature of the interface is enabling topic
switching via hyperlinks. For the answerer, IL
text of the document includes clickable hyper-
links to other documents. On clicking these links,
the current document in the answerer’s interface
changes to the requested (clicked) document. Questo
enables the answerer to search for answers in doc-
uments beyond the current one. The questioner
can access the metadata of documents visited by
the answerer and documents present in the ration-
ale of the answers. Per esempio, let us assume
that given the seed document Daniel Radcliffe and
the first question ‘‘Where was Daniel Radcliffe
born?’’, the answerer selects the ‘‘Daniel Jacob
Radcliffe was born in London on 23 July 1989’’
span as rationale and provides ‘‘London’’ as the
answer. If London is a hyperlink in the rationale
span, then the metadata of both Daniel Radcliffe
and London is available to the questioner to form
the next question. If the next question is ‘‘What
is its population?’’,
the answerer can switch
the current document from Daniel Radcliffe to
London by clicking on the hyperlink, and can
then find and provide the answer. The conver-
sation up till
involves two topics:
Daniel Radcliffe and London. We also provide
easy navigation to previously visited documents
for both the annotators. This interface design
(Figura 8) ensures that information about the new
topic is semantically connected to topics of the
previous turns, similar to natural human-human
conversations (Sacks and Jefferson, 1995).

this point

3.4 Additional Annotations

To account for multiple valid answers, we col-
lected three additional annotations for answers of
conversations in evaluation sets (development and
test splits). For this task, at any turn, the annotator
can see all the previous questions and original an-
swers. Showing original answers of previous turns
is important in a conversational setting as the sub-
sequent questions can potentially depend on them.
We also provide the list of documents correspond-
ing to previous turns of the original conversation.
This ensures that the current annotator has all
the information the original answerer had while
providing the answer. Similar to the answerer,
the annotator then provides the rationale and the
answer, or marks the question as unanswerable.

Dataset

# Turns
# Conversations
# Tokens / Question
# Tokens / Answer
# Turns / conversation
# Topics / conversation

Train

45,450
3,509
6.91
11.71
13
4

Dev

2,514
205
6.89
11.96
12
4

Test

2,502
206
7.11
12.27
12
4

Overall

50,466
3920
6.92
11.75
13
4

Tavolo 2: Dataset statistics of TOPIOCQA.

4 Dataset Analysis

We collected a total of 3,920 conversations, con-
sisting of 50,466 turns. The annotators were
encouraged to complete a minimum of 10 turns.
Conversations with fewer than 5 turns were dis-
carded. We split the data into train, development,
and test splits.

Tavolo 2 reports simple statistics of the dataset
splits. On average, a conversation in TOPIOCQA
ha 13 question-answer turns and is based on 4
documents. Our dataset differs from other con-
versational question-answering datasets by incor-
porating topic switches in the conversation.

4.1 Topic Switching

Before we start our analysis, let us first define the
notion of a topic switch in TOPIOCQA. Recall that
answers are based on Wikipedia articles, Dove
each document consists of several sections. While
one can argue that a topic switch occurs when the
answer is based on a different section of the same
document, we opt for a more conservative notion
and define a switch of topic if the answer is based
on a different Wikipedia document.

Number of Topics vs Conversation Length
We begin our analysis by investigating how the
number of topics varies with the conversation
length. In Figure 2(UN) we show a heat-map of the
number of topics for each conversation length,
where each column is normalized by the num-
ber of conversations of that length. We observe
that longer conversations usually include more
topics. Most 10-turn conversations include 3 top-
ics, 14-turn conversations include 4 topics, E
18-turn conversations include 5 topics. The con-
versations with fewer than 10 turns mostly include
just 2 topics.

Topic Flow in Conversation Next, we examine
how often consecutive questions stay within the

472

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 2: Analysis of the topic switches in TOPIOCQA. In (UN) we show the distribution of the number of topics (In
percentage) for each conversation length. Longer conversations typically include more topics. In (B) we show a
histogram of the topic lengths, illustrating that usually 3–4 consecutive questions stay within the same topic.

Figura 3: A flow diagram of topic switches over conversations up to 15 turns. There are complex interactions
between the topics, especially later in the conversation.

same topic. To do so, we first cluster conversa-
tions into sequences of turns for which all answers
are from the same document. Then, we count how
many turns belong to topic clusters of a particu-
lar length. Figura 2(B) shows the distribution of
topic lengths. The mode of the distribution is 3,
signifying that annotators usually ask 3 questions
about the same topic before switching. Asking 2
O 4 consecutive questions on the same topic is
also frequently observed. Tuttavia, we rarely see
more than 10 consecutive turns on the same topic.
We also analyze the flow of topics throughout
the conversation. Do annotators always introduce
new topics or do they also go back to old ones?
Figura 3 depicts a flow diagram of topics in
conversations up to 15 turns. Note that we have
indexed topics according to their first occurrence
in the conversation. We can see that the majority of
switches introduce new topics, but also that more
complex topic switching emerges in later turns.
Specifically, we see that, from sixth turn onwards,
questioners frequently go back one or two topics in

the conversation. Overall, this diagram suggests
that there are complex interactions among the
topics in the conversation.

Qualitative Assessment of Topic Switching
In order to understand the nature of a topic
switch, inspired from Stede and Schlangen (2004),
we classify questions into three types: ask-
generic refers to general open-ended questions,
ask-specific questions ask about a specific
attribute or detail of a topic, and ask-further
is a question type that seeks additional details
of an attribute discussed in one of the previous
turns. Tavolo 4 shows examples of each type for
questions in the same conversation. We consider
three types of turns for our evaluation. If the an-
swer document of the turn is same as the previous
turn, we refer to it as no-switch. If a topic
switch has happened, and the answer document is
present in one of the previous turns, it is consid-
ered to be switch-to-old. The final category,
switch-to-new refers to turns where current

473

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Question Type

Avg Answer length

ask-generic
ask-specific
ask-further

22.43
11.38
11.23

Tavolo 3: Average answer length of different
question types. Generic questions tend to
have longer answers.

Turn type

Question type

Conversation turn

no-switch

ask-generic

no-switch

ask-specific

switch-to-new

ask-specific

switch-to-old

ask-specific

no-switch

ask-further

Q: who is mariah carey?
UN: An American singer
songwriter and actress
Topic: Mariah Carey

Q: name one of her famous songs.
UN: Oh Santa!
Topic: Mariah Carey

Q: how was it received?
UN: There were mixed reviews
Topic: Oh Santa!

Q: is she married?
UN: Yes
Topic: Mariah Carey

Q: to whom?
UN: Tommy Mottola
Topic: Mariah Carey

Tavolo 4: Examples of various turn types and
question types in a conversation. Random samples
of each turn type are manually annotated with one
of the question types.

answer document has not been seen in the con-
versation before. These different types of topic
switches are also illustrated in Table 4.

We sample 50 turns of each type, and man-
ually label them with one of the three question
types. Figura 4 shows the results of our eval-
uazione. ask-specific is the most common
question type across all types of turns, indicating
that most of the questions in the dataset focus on
specific attributes of a topic. ask-generic has
a much higher proportion in switch-to-new
turn types, indicating that it is more likely to
see generic questions in turns that introduce a
new topic in the conversation, compared to other
turn types. ask-further has almost equal pro-
portion in no-switch and switch-to-old,
with switch-to-old being slightly higher.
ask-further is not observed in switch-
to-new as follow-up questions are generally not
possible without the topic being discussed in any
of the previous turns.

We also look at average answer length of
answers of all three question types (Tavolo 3).
Unsurprisingly, ask-generic has a much

Figura 4: Distribution of various question types for each
turn type. Questions asking about specific attributes
are most common. Generic questions are likely to be
observed when switching to a new topic.

higher answer length compared to other types,
presumably due to the open-ended nature of
the question.

5 Experimental Setup

The task of open-domain information-seeking
conversation can be framed as follows. Given
previous questions and ground truth answers
{q1, a1, q2, a2, . . . , qi−1, ai−1} and current ques-
tion qi, the model has to provide the answer ai.
This can be considered as an oracle setting, COME
the gold answers of previous questions are pro-
vided. The models can optionally use a corpus of
documents C = {d1, d2, . . . , dN }.

5.1 Modelli

We consider models from two categories, based on
whether they use the document corpus or not. IL
closed-book models use just the question-answer
pairs, whereas open-book models use the docu-
ment corpus, along with question-answer pairs.
We now describe the implementation and tech-
nical details of both classes of models.

5.1.1 Closed-book

Large-scale language models often capture a lot
of world knowledge during unsupervised pre-
training (Petroni et al., 2019; Roberts et al., 2020).
These models, in principle, can answer questions
without access to any external corpus. We consi-
der GPT-3 (Brown et al., 2020)—an autoregres-
sive language model with 175 billion parameters,
and evaluate it on TOPIOCQA. The input to GPT-3
is a prompt5 followed by previous question-answer
pairs and the current question. Because GPT-3 is

5beta.openai.com/examples/default-qa.

474

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

When constrained by the encoder input se-
quence length, we retain the first turn and as
many turns prior to the current turn as possi-
ble, questo è, k is chosen such that q1 [SEP]
a1 [SEP] qn−k [SEP] an−k [SEP] . . .
[SEP] qn−1 [SEP] an−1 [SEP] qn satisfies
encoder input limits.

• REWRITES: Given a query-rewriting module
QR, let q(cid:3)
n = QR(q1, a1, . . . , qn−1, an−1, qn)
denote the decontextualized question, con-
ditioned on the conversation history. q(cid:3)
n is
then passed to the model.

Wang et al. (2019) observed that fixed-length text
segments from documents are more useful than
full documents in both retrieval and final QA
accuracy. Hence, we split a Wikipedia document
into multiple text blocks of at least 100 parole,
while preserving section and sentence boundaries.
These text blocks, augmented with the metadata
(main title and section title) are referred to as
passages. This resulted in 25.7 million passages,
which act as basic units of retrieval. To form
question-passage pairs for training DPR Retriever,
we select the passage from gold answer document
that contains the majority of rationale span.

Following the original works, we use BERT
(Devlin et al., 2019) for DPR (both Retriever
and Reader) and T5 (Raffel et al., 2020) for FiD
as base models. Because DPR Reader requires a
span from passage for each training example, we
heuristically select the span from the gold passage
that has the highest lexical overlap (F1 score)
with the gold answer. For the query-rewriting
module QR, we fine-tune T5 model on rewrites
of QReCC (Anantha et al., 2021), and use that
to generate the rewrites for TOPIOCQA. We refer
the reader to Appendix B for more details. IL
hyperparameters for all models are mentioned in
Appendix C.

5.2 Evaluation Metrics

Following Choi et al. (2018) and Reddy et al.
(2019), we use exact match (EM) and F1 as eval-
uation metrics for TOPIOCQA.

To compute human and system performance
in the presence of multiple gold annotations, we
follow the evaluation process similar to Choi et al.
(2018) and Reddy et al. (2019). Given n human
answers, human performance on the task is deter-
mined by considering each answer as prediction

Figura 5: A partial conversation and different ques-
tion representations of Q3. The REWRITES representa-
tion is an example, not the output of our QR module.

never explicitly exposed to any training examples,
this can be considered as a zero-shot setting.

5.1.2 Open-book

We build on state-of-the-art QA models that adapt
a two step retriever-reader approach. For the re-
triever, we consider BM25 (Robertson et al., 1995)
and DPR Retriever (Karpukhin et al., 2020). Given
a query, BM25 ranks the documents based on a
bag-of-words scoring function. D'altra parte,
DPR learns dense vector representations of docu-
ment and query, and uses the dot product between
them as a ranking function.

We consider two types of neural readers. (1)
DPR Reader (Karpukhin et al., 2020), which re-
ranks the retrieved passages and selects a span
from each document independently. The span with
highest span score is chosen as the answer. (2)
Fusion-in-Decoder
Izacard and Grave,
2021), which encodes all retrieved passages in-
dependently, and then jointly attends over all of
them in the decoder to generate the answer.

(FiD;

For these models, we consider three different
question representations for question at nth turn of
the conversation (qn). Figura 5 shows an example
of different question representations for the third
question (Q3) of a conversation.

• ORIGINAL: This serves as a naive baseline
where just the current question qn is passed
to the model.

• ALLHISTORY: The question is represented as
q1 [SEP] a1 [SEP] q2 [SEP] a2 [SEP]
. . . [SEP] qn−1 [SEP] an−1 [SEP] qn.

475

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Model

Umano

GPT-3

BM25 + DPR Reader

BM25 + FiD

DPR Retriever + DPR Reader

DPR Retriever + FiD

Question Rep

Dev

Test

40.2

12.4

7.1
13.6
15.4

10.1
24.1
24.0

4.9
21.0
17.2

7.9
33.0
23.5

70.1

33.4

12.8
25.0
32.5

21.8
37.2
41.6

14.9
43.4
36.4

21.6
55.3
44.2

40.3

10.4

7.2
13.8
15.7

10.5
23.4
24.9

4.3
19.4
16.5

7.8
33.4
24.0

70.0

31.8

13.0
25.2
31.7

22.6
36.1
41.4

14.9
41.1
35.2

21.4
55.8
44.7

ORIGINAL
ALLHISTORY
REWRITES

Tavolo 5: Overall performance of all model variants on TOPIOCQA development and test set.

and other human answers as the reference set. Questo
results in n scores, which are averaged to give the
final human performance score. The system pre-
diction is also compared with n distinct reference
sets, each containing n − 1 human answers, E
then averaged. For TOPIOCQA, n = 4 (the origi-
nal answer and three additional annotations). Note
that human performance is not necessarily an up-
per bound for the task, as document retrieval can
potentially be performed better by the systems.

Model

Question Rep

Dev

Test

BM25

ORIGINAL
ALLHISTORY
REWRITES

ORIGINAL

DPR Retriever ALLHISTORY

REWRITES

Top-20 Top-100 Top-20 Top-100

5.2
23.1
32.5

9.9
70.4
49.9

9.1
36.8
49.2

16.5
82.4
62.4

6.0
22.5
33.0

10.0
67.0
49.3

10.1
35.6
47.4

15.3
80.8
61.1

Tavolo 6: Retrieval performance of all model
variants on TOPIOCQA development and test set.

6 Results and Discussion

We report
the end-to-end performance of all
systems in Table 5. For open-book models, we
also look at the performance of its constituents
(retriever and reader). Tavolo 6 reports the retrieval
performance and Table 7 reports the reading
comprehension performance of the readers, given
the gold passage. Based on these results, we
answer the following research questions.

How do the models compare against humans for
TOPIOCQA?
We report model and human performance on de-
velopment and test set in Table 5. Overall, modello
performance in all settings is significantly lower
than the human performance. The best performing
modello (DPR Retriever + FiD using ALLHISTORY

Model

Question Rep

Dev

Test

Extractive Bound

DPR Reader

FiD

ORIGINAL
ALLHISTORY
REWRITES

47.7

27.1
29.7
29.8

34.4
38.3
34.5

81.1

51.4
54.2
53.8

60.5
65.5
61.9

47.3

25.5
28.0
28.1

33.7
37.2
35.3

81.0

50.4
52.6
52.1

61.0
64.1
62.8

Tavolo 7: Reader performance of all model variants
on TOPIOCQA development and test set when
provided with the gold passage.

question representation) achieves 33.4 points
EM and 55.8 points F1 on the test set, Quale
falls short of human performance by 6.9 points
E 14.2 points, rispettivamente, indicating room for
further improvement.

476

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Which class of models perform better—Closed
book or Open book?
GPT-3 is directly comparable to ALLHISTORY vari-
ant of open-book models as it takes the entire
conversation history as input. Apart from BM25 +
DPR Reader, GPT-3 performs worse than all
other ALLHISTORY variants of open-book models.
It achieves an F1 score of 31.8 on the test set,
che è 24 points behind the best performing
open-book model (DPR Retriever + FiD). We ob-
serve that GPT-3 often hallucinates many answers,
a phenomenon commonly observed in literature
(Shuster et al., 2021).

How does the performance of open-book models
vary with various question representations?
For all open-book models, we fine-tune on three
different question representations (Sezione 5).
From the results in Table 5, we observe that the
ORIGINAL representation is consistently worse than
others for all models. This highlights the impor-
tance of encoding the conversational context for
TOPIOCQA. Between ALLHISTORY and REWRITES,
we observe that ALLHISTORY performs better
with dense retriever (DPR Retriever), whereas
REWRITES performs better with sparse retriever
(BM25). To confirm that this performance dif-
ference in end-to-end systems stems from the
retriever, we look at Top-20 and Top-100 re-
trieval accuracy of BM25 and DPR Retriever in
Tavolo 6. Infatti, ALLHISTORY representation per-
forms better than REWRITES for DPR Retriever but
worse for BM25. As DPR Retriever is trained on
TOPIOCQA, it can probably learn how to select
relevant information from the ALLHISTORY rep-
resentation, whereas for BM25, the non-relevant
keywords in the representation act as distractors.
The better performance of DPR Retriever over
BM25 indicates that TOPIOCQA requires learning
task-specific dense semantic encoding for superior
retrieval performance.

How much are the readers constrained due to
retrieved results?
Tavolo 6 shows retrieval results. In an end-to-end
system, the reader takes as input the retrieved
passages, which may or may not contain the gold
passage. To get an estimate of reader performance
independently from the retriever, we experiment
with directly providing only the gold passage to
the readers, instead of the retrieved ones. Tavolo 7
shows the results. This can be seen as an ‘‘Ideal

Retriever’’ setting, where the retriever always re-
trieves the correct passage as the top one. Although
we observe significant gains over end-to-end sys-
tems for all models across all variants, the best
modello (FiD with ALLHISTORY) still falls short of
human performance by 3.1 points EM and 5.9
points F1 on the test set. These experiments in-
dicate that while passage retrieval is a significant
bottleneck for the task, technical advancements
are needed for the readers as well.

While it is plausible to assume that DPR Reader
is restricted in its performance due to its extractive
natura, we show that this is not the case. We cal-
culate the extractive upper bound for TOPIOCQA
(reported in Table 7) by selecting the span from
the gold document with best F1 overlap with the
ground truth answer. This bound is 47.3 points EM
E 81.0 points F1, which essentially represents
the best that any extractive model can do on this
task. DPR Reader falls short of this upper bound
by 19.2 points EM and 28.4 points F1.

7 Conclusione

We introduced TOPIOCQA, a novel open-domain
conversational question answering dataset with
topic switching. In this work, we described our
data collection effort, analyzed its topic switching
behavior, and established strong neural baselines.
The best performing model (DPR Retriever +
FiD) È 6.9 points EM and 14.2 points F1 below
human performance, suggesting that advances in
modeling are needed. We hope our dataset will be
an important resource to enable more research on
conversational agents that support topic switches
in information-seeking scenarios.

Ringraziamenti

We would like to thank TAKT’s annotators and
Jai Thirani (Chief Data Officer) for contributing
to TOPIOCQA. We are grateful for construc-
tive and insightful feedback from the anonymous
reviewers. This work is supported by the follow-
ing grants: MSR-Mila Grant, NSERC Discovery
Grant on Robust conversational models for ac-
cessing the world’s knowledge, and the Facebook
CIFAR AI Chair. Shehzaad is partly supported
by the IBM PhD Fellowship. We thank Compute
Canada for the computing resources.

477

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Riferimenti

Raviteja Anantha, Svitlana Vakulenko, Zhucheng
Tu, Shayne Longpre, Stephen Pulman, E
Srinivas Chappidi. 2021. Open-domain ques-
tion answering goes conversational via question
rewriting. Negli Atti del 2021 Confer-
ence of the North American Chapter of the
Associazione per la Linguistica Computazionale: Eh-
uomo Tecnologie del linguaggio, pages 520–534,
Online. https://doi.org/10.18653/v1
/2021.naacl-main.44

Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish,
Ilya Sutskever, and Dario
Alec Radford,
Amodei. 2020. Language models are few-shot
learners. In Advances in Neural Information Pro-
cessing Systems, volume 33, pages 1877–1901.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. Negli Atti di
IL 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tic: Tecnologie del linguaggio umano, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota.

Ahmed Elgohary, Denis Peskov, and Jordan
Boyd-Graber. 2019. Can you unpack that? Learn-
ing to rewrite questions-in-context. Nel professionista-
ceedings of the 2019 Conferenza sull'Empirico
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
elaborazione del linguaggio urale (EMNLP-IJCNLP),
pages 5918–5924, Hong Kong, China. https://
doi.org/10.18653/v1/D19-1605

Gautier

Izacard and Edouard Grave. 2021.
Leveraging passage retrieval with generative
models for open domain question answer-
ing. In Proceedings of the 16th Conference
of the European Chapter of the Association
for Computational Linguistics: Main Volume,
pages 874–880, Online. https://doi.org
/10.18653/v1/2021.eacl-main.74

Danqi Chen, Adam Fisch, Jason Weston, E
Antoine Bordes. 2017. Reading Wikipedia to
answer open-domain questions. Negli Atti
of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1:
Documenti lunghi), pages 1870–1879, Vancouver,
Canada.

Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
systems. Negli Atti del 2017 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing, pages 2021–2031, Copenhagen,
Denmark. https://doi.org/10.18653
/v1/D17-1215

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, E
Luke Zettlemoyer. 2018. QuAC: Question an-
swering in context. Negli Atti del 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2174–2184,
Brussels, Belgium.

Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar,
and Jamie Callan. 2020. CAsT-19: A data-
set for conversational information seeking. In
Proceedings of the 43rd International ACM
SIGIR Conference on Research and Devel-
opment in Information Retrieval, SIGIR ’20,
pages 1985–1988, New York, NY, USA.
https://doi.org/10.1145/3397271
.3401206

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. Negli Atti del 2020 Contro-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 6769–6781,
Online. https://doi.org/10.18653/v1
/2020.emnlp-main.550

Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blunsom,
Chris Dyer, Karl Moritz Hermann, G´abor Melis,
and Edward Grefenstette. 2018. The Narrative-
QA reading comprehension challenge. Trans-
actions of the Association for Computational
Linguistica, 6:317–328. https://doi.org
/10.1162/tacl_a_00023

478

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein,
Illia Polosukhin,
Jacob Devlin, Kenton Lee, Kristina Toutanova,
Llion Jones, Matthew Kelcey, Ming-Wei
Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural questions:
A benchmark for question answering research.
Transactions of the Association for Computa-
linguistica nazionale, 7:452–466. https://doi
.org/10.1162/tacl_a_00276

Kenton Lee, Ming-Wei Chang, and Kristina
Toutanova. 2019. Latent retrieval for weakly
supervised open domain question answering.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 6086–6096, Florence, Italy.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
Gao, Saurabh Tiwary, Rangan Majumder, E
Li Deng. 2016. MS MARCO: A Human Gener-
ated MAchine Reading COmprehension Data-
set. In CoCo@NIPS.

Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Language models
as knowledge bases? Negli Atti del 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessazione (EMNLP-IJCNLP), pages 2463–2473,
Hong Kong, China. https://doi.org/10
.18653/v1/D19-1250

Chen Qu, Liu Yang, Cen Chen, Minghui Qiu,
Iyyer. 2020.
W. Bruce Croft, and Mohit
Open-Retrieval Conversational Question An-
swering. SIGIR ’20, pages 539–548, New York,
NY, USA.

Chen Qu, Liu Yang, Minghui Qiu, W. Bruce
Croft, Yongfeng Zhang, and Mohit Iyyer. 2019UN.
BERT with history answer embedding for con-
versational question answering. Negli Atti
of the 42nd International ACM SIGIR Confer-
ence on Research and Development in Infor-
mation Retrieval, SIGIR’19, pages 1133–1136,
New York, NY, USA.

Chen Qu, Liu Yang, Minghui Qiu, Yongfeng
Zhang, Cen Chen, W. Bruce Croft, and Mohit
Iyyer. 2019B. Attentive history selection for
conversational question answering. In Procedi-
ings of the 28th ACM International Conference

on Information and Knowledge Management,
CIKM ’19, pages 1391–1400, New York,
NY, USA.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. 2020.
Exploring the limits of transfer learning with
a unified text-to-text transformer. Journal of
Machine Learning Research, 21(140):1–67.

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. Negli Atti di
the 56th Annual Meeting of the Association for
Linguistica computazionale (Volume 2: Corto
Carte), pages 784–789, Melbourne, Australia.
https://doi.org/10.18653/v1/P18
-2124

Pranav Rajpurkar,

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
sion of text. Negli Atti del 2016 Contro-
ference on Empirical Methods in Natural
Language Processing, pages 2383–2392, Austin,
Texas. https://doi.org/10.18653/v1
/D16-1264

Siva Reddy, Danqi Chen, and Christopher Manning.
2019. CoQA: A conversational question answer-
ing challenge. Transactions of the Association
for Computational Linguistics, 7(0):249–266.
https://doi.org/10.1162/tacl a 00266

Adam Roberts, Colin Raffel, and Noam Shazeer.
2020. How much knowledge can you pack
into the parameters of a language model? In
Atti del 2020 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP), pages 5418–5426, Online. https://
doi.org/10.18653/v1/2020.emnlp-main
.437

Stephen Robertson, S. Walker, S. Jones, M. M.
Hancock-Beaulieu, and M. Gatford. 1995. Okapi
at TREC-3. In Overview of the Third Text Re-
trieval Conference (TREC-3), pages 109–126.

Anna Rogers, Matt Gardner,

and Isabelle
Augenstein. 2021. QA dataset explosion: UN
taxonomy of NLP resources for question an-
swering and reading comprehension. arXiv
preprint arXiv:2107.12708.

Harvey Sacks and Gail Jefferson. 1995. Lectures
on conversation, Volume 1. pages 289–331.

479

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Anshumali Shrivastava and Ping Li. 2014.
Asymmetric LSH (ALSH) for sublinear time
maximum inner product search (MIPS). Nel professionista-
ceedings of the 27th International Conference
on Neural Information Processing Systems –
Volume 2, pages 2321–2329, Montreal, Canada.

Kurt Shuster, Spencer Poff, Moya Chen, Douwe
Kiela, and Jason Weston. 2021. Retrieval
augmentation reduces hallucination in con-
versation. arXiv preprint arXiv:2104.07567.
https://doi.org/10.18653/v1/2021
.findings-emnlp.320

Amanda Spink, H. Cenk Ozmutlu, and Seda
Ozmutlu. 2002. Multitasking information seek-
ing and searching processes. Journal of the
American Society for Information Science and
Tecnologia, 53(8):639–652. https://doi
.org/10.1002/asi.10124

Manfred Stede and David Schlangen. 2004.
Information-seeking chat: Dialogue manage-
ment by topic structure.

Svitlana Vakulenko, Shayne Longpre, Zhucheng
Tu, and Raviteja Anantha. 2021. Question
rewriting for conversational question answer-
ing. In Proceedings of the 14th ACM Interna-
tional Conference on Web Search and Data
Mining, WSDM ’21, pages 355–363, Nuovo
York, NY, USA. https://doi.org/10
.1145/3437963.3441748

Douglas Walton. 2019. The New Dialectic: Contro-
versational Contexts of Argument. Università
of Toronto Press.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo
Wang, Tim Klinger, Wei Zhang, Shiyu
Chang, Gerry Tesauro, Bowen Zhou, and Jing
Jiang. 2018. R3: Reinforced ranker-reader for
open-domain question answering. In AAAI,
pages 5981–5988.

Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh
Nallapati, and Bing Xiang. 2019. Multi-passage
BERT: A globally normalized BERT model
for open-domain question answering. Nel professionista-
ceedings of the 2019 Conferenza sull'Empirico
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
elaborazione del linguaggio urale (EMNLP-IJCNLP),
pages 5878–5882, Hong Kong, China. https://
doi.org/10.18653/v1/D19-1599

Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li,
Luchen Tan, Kun Xiong, Ming Li, and Jimmy

Lin. 2019. End-to-end open-domain question
answering with BERTserini. Negli Atti di
IL 2019 Conferenza del Nord America
Capitolo dell'Associazione per il calcolo
Linguistica (Demonstrations), pages 72–77,
Minneapolis, Minnesota. https://doi.org
/10.18653/v1/N19-4013

Yi Yang, Wen-tau Yih, and Christopher Meek.
for
2015. WikiQA: A challenge dataset
open-domain question answering. In Procedi-
ings di
IL 2015 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale,
pages 2013–2018, Lisbon, Portugal. https://
doi.org/10.18653/v1/D15-1237

Chenguang Zhu, Michael Zeng, and Xuedong
Huang. 2019. SDNet: Contextualized attention-
based deep network for conversational ques-
tion answering. arXiv preprint arXiv:1812
.03593.

A Annotators Details

Each conversation in TOPIOCQA is an interac-
tion between two annotators, a questioner and
an answerer. The annotators were selected from
TAKT’s in-house workforce, based on their En-
glish language proficiency and trained for the role
of both questioner and answerer. The annotators
are provided with the following guidelines.

Guidelines for the Questioner:

• The first question should be unambiguous

and about the seed entity.

• The follow-up questions be contextualized
and dependent on the conversation history
whenever possible.

• Avoid using same words as in section titles
of the document. Per esempio, if the section
title is ‘‘Awards’’, a plausible question can
be ‘‘What accolades did she receive for her
lavoro?’’.

• The conversation should involve multiple

documents (topics).

Guidelines for the Answerer:

• Based on the question, identify the relevant

document and section.

• The answer should be based on the contents

of the identified document.

480

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

• The rationale should be selected such that it

justifies the answer.

• The answer should be a sub-string in rationale
whenever possible. Tuttavia, answers should
be edited to fit the conversational context
(adding yes, NO), perform reasoning (per esempio.,
counting), and so forth.

• Personal opinions should never be included.

After providing the guidelines and a few examples,
the initial annotated conversations were manually
inspected by the authors. The workers who pro-
vided low-quality annotations during this inspec-
tion phase were disqualified. The final workforce
consisted of 15 workers, who provided annota-
tions for the dataset over a period of two months.
Random quality checks were performed by the
authors and periodic feedback was given to the
annotators throughout the data collection to main-
tain high quality of data. Figura 8 shows annotation
interfaces for questioner and answerer. Figura 6
shows an example from the dataset.

We also implemented several real-time checks
in the questioner’s interface to encourage topic
switching and use of co-reference, and to reduce
the lexical overlap with the metadata of the docu-
ment while forming the question.

B Query Rewriting

A query-rewriting module, QR, takes the cur-
rent question and the conversation history as
input (q1, a1, . . . , qn−1, an−1, qn) and provides a
decontextualized rewritten question, q(cid:3)
N, as the
produzione. As we don’t collect rewrites in TOPI-
OCQA, we rely on other datasets to train our
QR model. Two datasets that provide rewrites for
information-seeking conversations are CANARD
(Elgohary et al., 2019) and QReCC (Anantha
et al., 2021). Due to its large-scale and diverse
natura, we use QReCC to train our T5 model
based QR module.

To rewrite the nth question, the conversation
history and the current question is given to model
as q1 [SEP] a1 [SEP] q2 [SEP] a2 [SEP] . . .
[SEP] qn−1 [SEP] an−1 [SEP] qn. We train
this model on QReCC dataset. On the test split
of QReCC, our model achieves a BLEU score of
62.74 points. We use this model to generate re-
writes for TOPIOCQA in our experiments. Figura 7
shows a conversation from the dataset along with
rewrites from this T5-based QR module.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 6: A full conversation from TOPIOCQA.

We observe that while this QR module can re-
solve simple coreferences (Q2 and Q5), it struggles
later in the conversation in the presence of mul-
tiple entities (he is resolved to albus dumbledore

481

instead of tom marvolo riddle in Q10 and Q12).
The QR module also fails to perform reason-
ing required for correct rewrites, Per esempio,
boy wizard’s nemesis is not rewritten to Lord
Voldemort in Q9, even though this information is
present in A5).

C Hyperparameter Details

We use Lucene BM25 with k1 = 0.9 (term fre-
quency scaling) and b = 0.4 (document length
normalization). For both DPR and FiD, apart
from the batch size, we use the hyperparameters
suggested in their codebases. We use the maxi-
mum batch size that fits in the GPU cluster. DPR
Retriever is trained on four 40GB A100 GPUs,
whereas DPR Reader and FiD are trained on 8
32GB V100 GPUs. We use base model size for
all systems. Following original implementations,
DPR Retriever is trained for 40 epochs, DPR
Reader for 20 epochs, and FiD for 15,000 gradient
steps. The model checkpoint with best EM score
on development set is selected as the final model.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
7
1
2
0
0
8
1
2
6

/
T

UN
C
_
UN
_
0
0
4
7
1
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 7: An example of a conversation from
TOPIOCQA along with rewrites from the QR module.
Few turns are excluded and some answers are shorted
for brevity.

482

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/