TopiOCQA: Open-domain Conversational Question Answering
with Topic Switching
Vaibhav Adlakha1,4 Shehzaad Dhuliawala2
Kaheer Suleman3 Harm de Vries4 Siva Reddy1,5
2ETH Z¨urich, 瑞士
3Microsoft Montr´eal, 加拿大
1Mila, 麦吉尔大学, 加拿大
4ServiceNow Research, 加拿大
5Facebook CIFAR AI Chair, 加拿大
{vaibhav.adlakha,siva.reddy}@mila.quebec
抽象的
In a conversational question answering sce-
成员, a questioner seeks to extract information
about a topic through a series of interdepen-
dent questions and answers. As the conver-
sation progresses, they may switch to related
主题, a phenomenon commonly observed in
information-seeking search sessions. 然而,
current datasets for conversational question
answering are limiting in two ways: 1) 他们
do not contain topic switches; 和 2) they as-
sume the reference text for the conversation is
给定, 那是, the setting is not open-domain.
We introduce TOPIOCQA (pronounced Tapi-
oca), an open-domain conversational dataset
with topic switches based on Wikipedia.
TOPIOCQA contains 3,920 conversations with
information-seeking questions and free-form
答案. 平均而言, a conversation in our
dataset spans 13 question-answer turns and
involves four topics (文件). TOPIOCQA
poses a challenging test-bed for models, 在哪里
efficient retrieval is required on multiple turns
of the same conversation, in conjunction with
constructing valid responses using conversa-
tional history. We evaluate several baselines,
by combining state-of-the-art document re-
trieval methods with neural reader models.
Our best model achieves F1 of 55.8, falling
short of human performance by 14.2 点,
indicating the difficulty of our dataset. 我们的
dataset and code are available at https://
mcgill-nlp.github.io/topiocqa.
1
介绍
People often engage in information-seeking con-
versations to discover new knowledge (沃尔顿,
2019). In such conversations, a questioner (这
seeker) asks multiple rounds of questions to an
answerer (the expert). As the conversation pro-
ceeds, the questioner becomes inquisitive of new
but related topics based on the information pro-
468
vided in the answers (Stede and Schlangen, 2004).
Such topic switching behaviour is natural
在
information-seeking conversations and is com-
monly observed when people seek information
through search engines (Spink et al., 2002).
According to Spink et al., people switch from
one to ten topics with a mean of 2.11 topic switches
per search session. 例如, a person can start
a search session about tennis, and then land on
Roger Federer, and after learning a bit about him
may land on his country Switzerland, and spend
more time learning about other Swiss athletes.
Thanks to tremendous progress in question an-
swering research (Rogers et al., 2021), we are
coming close to enabling information-seeking
conversations with machines (as opposed to just
using keywords-based search). In order to real-
ize this goal further, it is crucial to construct
datasets that contain information-seeking conver-
sations with topic switching, and measure progress
of conversational models on this task, the two
primary contributions of this work.
在文献中, a simplified setting of
information-seeking conversation known as con-
versational question answering (CQA) 已经
deeply explored (Choi et al., 2018; Reddy et al.,
2019). In this task, the entire conversation is based
on a given reference text of a topic/entity. 尽管
the CQA task is challenging, it still falls short of
the real-world setting, where the reference text is
not known beforehand (first limitation) 和
conversation is not restricted to a single topic
(second limitation).
Qu et al. (2020) and Anantha et al. (2021)
have attempted to overcome the first limitation
by adapting existing CQA datasets to the open-
domain setting. They do so by obtaining context-
independent rewrites of the first question to make
the question independent of the reference text. 为了
例子, if the reference text is about Augusto
计算语言学协会会刊, 卷. 10, PP. 468–483, 2022. https://doi.org/10.1162/tacl 00471
动作编辑器: Hua Wu. 提交批次: 11/2021; 修改批次: 12/2021; 已发表 4/2022.
C(西德:2) 2022 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Pinochet and the conversation starts with a ques-
的 “Was he known for being intelligent?”, 这
question is re-written to “Was Augusto Pinochet
known for being intelligent?”. 然而, 作为
entire question sequence in the conversation was
collected with a given reference text of a topic, 全部
the turns still revolve around a single topic.
在这项工作中, we present TOPIOCQA1—Topic
switching in Open-domain Conversational Question
Answering—a large-scale dataset for information-
seeking conversations in open-domain based on
the Wikipedia corpus. We consider each Wikipedia
document to be a separate topic. The conversations
in TOPIOCQA start with a real information-seeking
question from Natural Questions (Kwiatkowski
等人。, 2019) in order to determine a seed topic
(文档), and then the questioner may shift to
other related topics (文件) as the conver-
sation progresses.2 Throughout the conversation,
the questioner is never shown the content of the
文件 (but only the main title and section ti-
tles) to simulate an information-seeking scenario,
whereas the answerer has full access to the content
along with the hyperlink structure for navigation.
In each turn, both questioner and answerer use
free-form text to converse (as opposed to extrac-
tive text spans as is common for an answerer in
many existing datasets).
数字 1 shows an example of a conversation
from our dataset. The first question leads to the
seed topic Byzantine Empire, and after two turns
switches to Mehmed the Conqueror in Q4, 基于
on part of the answer (A3) that contains reference
to Mehmed. Note that the answers A1, A3, and A4
are free-form answers that do not occur as spans
in either the seed document or the follow up doc-
ument. The topic then switches to Anatolia based
on part of the previous answer (A4). The topics
change in further turns to Turkey and Ankara.
Because of the conversational nature, TOPIOCQA
contains questions rife with complex coreference
现象, 例如, Q9 relies on entities
mentioned in A7, A8 and Q1.
TOPIOCQA contains 3,920 conversations and
50,574 QA pairs, based on Wikipedia corpus of
5.9 million documents. 平均而言, a conversa-
tion has 13 question-answer turns and involves 4
主题. Twenty-eight percent of turns in our data-
1TOPIOCQA is pronounced as Tapioca.
2A portion of the training data also contains conversations
where the questioner asks the first question given a seed topic.
数字 1: A conversation from TOPIOCQA. Our dataset
has information-seeking questions with free-form an-
swers across multiple topics (文件). The consec-
utive turns from the same topic (文档) 已经
excluded for brevity.
set require retrieving a document different from
the previous turn. 据我们所知,
TOPIOCQA is the first open-domain information-
seeking CQA dataset that incorporates topical
变化, along with other desirable properties
(见表 1).
To investigate the difficulty of the TOPIOCQA
dataset, we benchmark several strong retriever-
reader neural baselines, considering both sparse
and dense retrievers, as well as extractive and gen-
erative readers (Karpukhin et al., 2020; Izacard
and Grave, 2021). Inspired by previous work,
we explore two ways to represent the question:
(1) concatenating the entire conversation history
(Qu et al., 2020), 和 (2) self-contained rewrites
of the conversational question (Anantha et al.,
2021). The best performing model—Fusion-in-
469
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数据集
TOPIOCQA (ours)
QReCC (Anantha et al., 2021)
OR-QuAC (Qu et al., 2020)
CoQA (Reddy et al., 2019)
QuAC (Choi et al., 2018)
NarrativeQA (Koˇcisk´y et al., 2018)
Natural Questions (Kwiatkowski et al., 2019)
SQuAD 2.0 (Rajpurkar et al., 2018)
Multi–turn
Open–domain
Free–form answers
Information–seeking questions
Topic Switching
✓
✓
✓
✓
✓
✗
✗
✗
✓
✓
✓
✗
✗
✓
✓
✗
✓
✓
✗
✓
✗
✓
✗
✗
✓
✓
✗
✓
✓
✓
✗
✓
✗
✗
✗
✗
✗
✗
桌子 1: Comparison of TOPIOCQA with other QA datasets. TOPIOCQA incorporates topical changes,
represents that only a proportion of dataset
along with several best practices of previous datasets.
satisfies the property.
Decoder (Izacard and Grave, 2021) trained on con-
catenated conversation history—is 14.2 F1 points
short of human performance, indicating signif-
icant room for improvement. We also evaluate
GPT-3 to estimate the performance in a closed-
book zero-shot setting, and its performance is 38.2
F1 points below the human performance.
2 相关工作
2.1 Open-Domain Question Answering
In open-domain question answering, a model has
to answer natural language questions by retriev-
ing relevant documents. This can be considered
as a simplified setting of open-domain CQA,
where the conversation is limited to just one
转动. Several datasets have been proposed for this
任务. 一方面, reading comprehension data-
sets like SQuAD (Rajpurkar et al., 2016, 2018),
which consist of (问题, 文档, 回答)
triplets, have been adapted for the task by with-
holding access to the document (陈等人。, 2017).
While these datasets have been helpful in spurring
modelling advances, they suffer from an anno-
tator bias because they were not collected in an
information-seeking setup. 那是, annotators had
access to the target answer and its surrounding
context and therefore formulated questions that
had a high lexical overlap with the answer (Jia
and Liang, 2017). 另一方面, Web-search
based datasets do not suffer from such artefacts
because they are curated from real search engine
queries. The WikiQA (杨等人。, 2015) 和
MS Marco (Nguyen et al., 2016) datasets contain
queries from the Bing search engine, 然而
Natural Questions (Kwiatkowski et al., 2019)
contain queries from the Google search engine.
Models for open-domain QA often follow a
two-stage process: (1) A retriever selects a small
collection of documents relevant to the question
from a big corpus (例如, 维基百科), (2) a reader
extracts or generates an answer from the selected
文件. While classical approaches rely on
counting-based bag-of-words representations like
TF-IDF or BM25 (陈等人。, 2017; 王等人。,
2018; 杨等人。, 2019), more recent deep learn-
ing approaches learn dense representations of the
questions and document through a dual-encoder
框架 (李等人。, 2019; Karpukhin et al.,
2020). In such learned retriever setups, 文档
retrieval is done efficiently using Maximum Inner
Product Search (MIPS, Shrivastava and Li, 2014).
2.2 Conversational Question
Answering (CQA)
CQA extends the reading comprehension task
from a single turn to multiple turns. Given a refer-
ence document, a system is tasked with interac-
tively answering a sequence of information-seeking
questions about the corresponding document. 这
conversational extension leads to novel challenges
in modeling linguistic phenomena such as ana-
phora (referencing previous turns) and ellipsis
(omitting words from questions), as well as in per-
forming pragmatic reasoning. Large-scale con-
versational datasets such as CoQA (Reddy et al.,
2019) and QuAC (Choi et al., 2018) have facil-
itated much of the research in this area. 这些
datasets differ along several dimensions, two of
哪个是 (1) CoQA has short free-form answers,
whereas QuAC has long extractive span-based an-
swers, 和 (2) unlike CoQA, QuAC is collected
in a simulated information-seeking scenario.
Models for CQA have used simple concatena-
tion of the question-answer history (Zhu et al.,
2019), history turn selection (Qu et al., 2019A,乙),
and question-rewrites (Vakulenko et al., 2021).
For question-rewriting, a different module is
trained on self-contained rewrites of context-
dependent questions. 例如, a plausible
rewrite of Q8 (数字 1) is ‘‘can you name
some of the major cities in Turkey apart from
470
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Ankara?’’. The re-written question is then an-
swered using open-domain QA systems. Two pop-
ular question-rewriting datasets for training this
module are (1) CANARD (Elgohary et al., 2019),
which contains re-writes of 50% of QuAC, 和
(2) QReCC (Anantha et al., 2021), which contains
rewrites of the entire QuAC dataset and a small
portion from other sources.
2.3 Open-Domain CQA
在这项工作中, we focus on constructing a chal-
lenging benchmark for open-domain CQA. 这
open-domain aspect requires systems to answer
questions without access to a reference document.
The conversational aspect enables users to ask
multiple related questions, which can, 原则,
span several different topics. With TOPIOCQA, 我们
introduce the first open-domain CQA dataset that
explicitly covers such topical switches.
Previous datasets for
this task re-purpose
existing CQA datasets. The OR-QuAC dataset
(Qu et al., 2020) is automatically constructed
from QuAC (Choi et al., 2018) and CANARD
(Elgohary et al., 2019) by replacing the first ques-
tion in QuAC with context-independent rewrites
from CANARD. QReCC (Anantha et al., 2021)
is a large-scale open-domain CQA and question
rewriting dataset that contains conversations from
QuAC, TREC CAsT (Dalton et al., 2020), and Nat-
ural Questions (NQ; Kwiatkowski et al., 2019). 全部
the questions in OR-QuAC and 78% of questions
in QReCC are based on QuAC. As conversations
in QuAC were collected with a given reference
文档, the question sequences of these con-
versations revolve around the topic or entity cor-
responding to that document. Twenty-one percent
of questions in QReCC are from NQ-based con-
诗篇. As NQ is not a conversational dataset,
the annotators of QReCC use NQ to start a con-
versation. A single annotator is tasked with pro-
viding both follow-up questions and answers for
to QReCC,
a given NQ question. 相比之下
conversations in our dataset are collected in a
simulated information-seeking scenario using two
annotators (部分 3.3).
Deep learning models for this task have fol-
lowed a similar retriever-reader setup as open-
domain QA. Instead of a single question, previous
works have explored feeding the entire conversa-
tion history (Qu et al., 2020), or a context indepen-
dent re-written question (Anantha et al., 2021).
3 Dataset Collection
Each conversation in TOPIOCQA is an interac-
tion between two annotators—a questioner and an
answerer. The details about the annotator selec-
tion are provided in Appendix A.
3.1 Seed Topics and Document Collection
The seed topics essentially drive the conversation.
In order to make them interesting for annotators,
we select the good3 articles of Wikipedia as seed
主题 (around 35k) for the first turn, but use
entire Wikipedia for later turns. We used the
Wikipedia dump from 10/20/2020, which con-
sists of 5.9 million documents. We used Wikiex-
tractor4 to extract the text. While pre-processing
the Wikipedia documents, we retain the hyper-
links that refer to other Wikipedia documents,
thus ensuring that we can provide all the doc-
uments requested by annotators (via hyperlinks)
during the conversation.
3.2 Simulating Information-seeking Scenario
Information-seeking conversations are closer to
the real-world if an information need can be
simulated via the data collection interface. 在
TOPIOCQA, we achieve this by withholding ques-
tioner’s access to the full reference text of the
文档. The questioner can only see the meta-
数据 (main title and the section titles) 的
Wikipedia documents, whereas the answerer can
access the entire text of the documents. On finding
the answer, the answerer highlights a contiguous
span of text as rationale, and generates a free-form
回答. The answerer also has the option to mark
the question as unanswerable. The conversation
history is visible to both the annotators.
As a conversation starting point,
the first
question is sampled from a subset of NQ
(Kwiatkowski et al., 2019) since NQ contains
genuine information-seeking questions asked on
谷歌. We only sample those questions for which
the answer is in our seed document pool. 到
increase the diversity of our dataset, we also al-
low the questioner to formulate the first question
based on the provided seed topic entity for 28%
of the conversations.
3Wikipedia Good articles.
4github:wikiextractor.
471
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
3.3 Enabling Topic-switching
The key feature of the interface is enabling topic
switching via hyperlinks. For the answerer, 这
text of the document includes clickable hyper-
links to other documents. On clicking these links,
the current document in the answerer’s interface
changes to the requested (clicked) 文档. 这
enables the answerer to search for answers in doc-
uments beyond the current one. The questioner
can access the metadata of documents visited by
the answerer and documents present in the ration-
ale of the answers. 例如, let us assume
that given the seed document Daniel Radcliffe and
the first question ‘‘Where was Daniel Radcliffe
born?’’, the answerer selects the ‘‘Daniel Jacob
Radcliffe was born in London on 23 July 1989’’
span as rationale and provides ‘‘London’’ as the
回答. If London is a hyperlink in the rationale
span, then the metadata of both Daniel Radcliffe
and London is available to the questioner to form
the next question. If the next question is ‘‘What
is its population?’’,
the answerer can switch
the current document from Daniel Radcliffe to
London by clicking on the hyperlink, 并且可以
then find and provide the answer. The conver-
sation up till
involves two topics:
Daniel Radcliffe and London. We also provide
easy navigation to previously visited documents
for both the annotators. This interface design
(数字 8) ensures that information about the new
topic is semantically connected to topics of the
previous turns, similar to natural human-human
conversations (Sacks and Jefferson, 1995).
这一点
3.4 Additional Annotations
To account for multiple valid answers, we col-
lected three additional annotations for answers of
conversations in evaluation sets (development and
test splits). For this task, at any turn, the annotator
can see all the previous questions and original an-
swers. Showing original answers of previous turns
is important in a conversational setting as the sub-
sequent questions can potentially depend on them.
We also provide the list of documents correspond-
ing to previous turns of the original conversation.
This ensures that the current annotator has all
the information the original answerer had while
providing the answer. Similar to the answerer,
the annotator then provides the rationale and the
回答, or marks the question as unanswerable.
数据集
# Turns
# Conversations
# 代币 / 问题
# 代币 / Answer
# Turns / conversation
# Topics / conversation
Train
45,450
3,509
6.91
11.71
13
4
Dev
2,514
205
6.89
11.96
12
4
Test
2,502
206
7.11
12.27
12
4
全面的
50,466
3920
6.92
11.75
13
4
桌子 2: Dataset statistics of TOPIOCQA.
4 Dataset Analysis
We collected a total of 3,920 conversations, 骗局-
sisting of 50,466 轮流. The annotators were
encouraged to complete a minimum of 10 轮流.
Conversations with fewer than 5 turns were dis-
carded. We split the data into train, 发展,
and test splits.
桌子 2 reports simple statistics of the dataset
splits. 平均而言, a conversation in TOPIOCQA
有 13 question-answer turns and is based on 4
文件. Our dataset differs from other con-
versational question-answering datasets by incor-
porating topic switches in the conversation.
4.1 Topic Switching
Before we start our analysis, let us first define the
notion of a topic switch in TOPIOCQA. Recall that
answers are based on Wikipedia articles, 在哪里
each document consists of several sections. 尽管
one can argue that a topic switch occurs when the
answer is based on a different section of the same
文档, we opt for a more conservative notion
and define a switch of topic if the answer is based
on a different Wikipedia document.
Number of Topics vs Conversation Length
We begin our analysis by investigating how the
number of topics varies with the conversation
length. 图中 2(A) we show a heat-map of the
number of topics for each conversation length,
where each column is normalized by the num-
ber of conversations of that length. We observe
that longer conversations usually include more
主题. Most 10-turn conversations include 3 顶部-
集成电路, 14-turn conversations include 4 主题, 和
18-turn conversations include 5 主题. 骗局-
versations with fewer than 10 turns mostly include
只是 2 主题.
Topic Flow in Conversation Next, we examine
how often consecutive questions stay within the
472
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 2: Analysis of the topic switches in TOPIOCQA. 在 (A) we show the distribution of the number of topics (在
percentage) for each conversation length. Longer conversations typically include more topics. 在 (乙) we show a
histogram of the topic lengths, illustrating that usually 3–4 consecutive questions stay within the same topic.
数字 3: A flow diagram of topic switches over conversations up to 15 轮流. There are complex interactions
between the topics, especially later in the conversation.
same topic. 这样做, we first cluster conversa-
tions into sequences of turns for which all answers
are from the same document. 然后, we count how
many turns belong to topic clusters of a particu-
lar length. 数字 2(乙) shows the distribution of
topic lengths. The mode of the distribution is 3,
signifying that annotators usually ask 3 问题
about the same topic before switching. Asking 2
或者 4 consecutive questions on the same topic is
also frequently observed. 然而, we rarely see
多于 10 consecutive turns on the same topic.
We also analyze the flow of topics throughout
the conversation. Do annotators always introduce
new topics or do they also go back to old ones?
数字 3 depicts a flow diagram of topics in
conversations up to 15 轮流. Note that we have
indexed topics according to their first occurrence
in the conversation. We can see that the majority of
switches introduce new topics, but also that more
complex topic switching emerges in later turns.
具体来说, we see that, from sixth turn onwards,
questioners frequently go back one or two topics in
the conversation. 全面的, this diagram suggests
that there are complex interactions among the
topics in the conversation.
Qualitative Assessment of Topic Switching
In order to understand the nature of a topic
switch, inspired from Stede and Schlangen (2004),
we classify questions into three types: ask-
generic refers to general open-ended questions,
ask-specific questions ask about a specific
attribute or detail of a topic, and ask-further
is a question type that seeks additional details
of an attribute discussed in one of the previous
轮流. 桌子 4 shows examples of each type for
questions in the same conversation. We consider
three types of turns for our evaluation. If the an-
swer document of the turn is same as the previous
转动, we refer to it as no-switch. If a topic
switch has happened, and the answer document is
present in one of the previous turns, it is consid-
ered to be switch-to-old. The final category,
switch-to-new refers to turns where current
473
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Question Type
Avg Answer length
ask-generic
ask-specific
ask-further
22.43
11.38
11.23
桌子 3: Average answer length of different
question types. Generic questions tend to
have longer answers.
Turn type
Question type
Conversation turn
no-switch
ask-generic
no-switch
ask-specific
switch-to-new
ask-specific
switch-to-old
ask-specific
no-switch
ask-further
问: who is mariah carey?
A: An American singer
songwriter and actress
话题: Mariah Carey
问: name one of her famous songs.
A: Oh Santa!
话题: Mariah Carey
问: how was it received?
A: There were mixed reviews
话题: Oh Santa!
问: is she married?
A: 是的
话题: Mariah Carey
问: to whom?
A: Tommy Mottola
话题: Mariah Carey
桌子 4: Examples of various turn types and
question types in a conversation. Random samples
of each turn type are manually annotated with one
of the question types.
answer document has not been seen in the con-
versation before. These different types of topic
switches are also illustrated in Table 4.
We sample 50 turns of each type, 和男人-
ually label them with one of the three question
类型. 数字 4 shows the results of our eval-
uation. ask-specific is the most common
question type across all types of turns, indicating
that most of the questions in the dataset focus on
specific attributes of a topic. ask-generic has
a much higher proportion in switch-to-new
turn types, indicating that it is more likely to
see generic questions in turns that introduce a
new topic in the conversation, compared to other
turn types. ask-further has almost equal pro-
portion in no-switch and switch-to-old,
with switch-to-old being slightly higher.
ask-further is not observed in switch-
to-new as follow-up questions are generally not
possible without the topic being discussed in any
of the previous turns.
We also look at average answer length of
answers of all three question types (桌子 3).
不出所料, ask-generic has a much
数字 4: Distribution of various question types for each
turn type. Questions asking about specific attributes
are most common. Generic questions are likely to be
observed when switching to a new topic.
higher answer length compared to other types,
presumably due to the open-ended nature of
问题.
5 实验装置
The task of open-domain information-seeking
conversation can be framed as follows. 给定
previous questions and ground truth answers
{q1, a1, q2, a2, . . . , qi−1, ai−1} and current ques-
tion qi, the model has to provide the answer ai.
This can be considered as an oracle setting, 作为
the gold answers of previous questions are pro-
vided. The models can optionally use a corpus of
documents C = {d1, d2, . . . , dN }.
5.1 楷模
We consider models from two categories, 基于
whether they use the document corpus or not. 这
closed-book models use just the question-answer
对, whereas open-book models use the docu-
ment corpus, along with question-answer pairs.
We now describe the implementation and tech-
nical details of both classes of models.
5.1.1 Closed-book
Large-scale language models often capture a lot
of world knowledge during unsupervised pre-
训练 (Petroni et al., 2019; Roberts et al., 2020).
These models, 原则, can answer questions
without access to any external corpus. We consi-
der GPT-3 (Brown et al., 2020)—an autoregres-
sive language model with 175 billion parameters,
and evaluate it on TOPIOCQA. The input to GPT-3
is a prompt5 followed by previous question-answer
pairs and the current question. Because GPT-3 is
5beta.openai.com/examples/default-qa.
474
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
When constrained by the encoder input se-
quence length, we retain the first turn and as
many turns prior to the current turn as possi-
布莱, 那是, k is chosen such that q1 [SEP]
a1 [SEP] qn−k [SEP] an−k [SEP] . . .
[SEP] qn−1 [SEP] an−1 [SEP] qn satisfies
encoder input limits.
• REWRITES: Given a query-rewriting module
QR, let q(西德:3)
n = QR(q1, a1, . . . , qn−1, an−1, qn)
denote the decontextualized question, 骗局-
ditioned on the conversation history. q(西德:3)
n is
then passed to the model.
Wang et al. (2019) observed that fixed-length text
segments from documents are more useful than
full documents in both retrieval and final QA
准确性. 因此, we split a Wikipedia document
into multiple text blocks of at least 100 字,
while preserving section and sentence boundaries.
These text blocks, augmented with the metadata
(main title and section title) are referred to as
passages. This resulted in 25.7 million passages,
which act as basic units of retrieval. To form
question-passage pairs for training DPR Retriever,
we select the passage from gold answer document
that contains the majority of rationale span.
Following the original works, we use BERT
(Devlin et al., 2019) for DPR (both Retriever
and Reader) and T5 (Raffel et al., 2020) for FiD
as base models. Because DPR Reader requires a
span from passage for each training example, 我们
heuristically select the span from the gold passage
that has the highest lexical overlap (F1 score)
with the gold answer. For the query-rewriting
module QR, we fine-tune T5 model on rewrites
of QReCC (Anantha et al., 2021), and use that
to generate the rewrites for TOPIOCQA. We refer
the reader to Appendix B for more details. 这
hyperparameters for all models are mentioned in
附录C.
5.2 Evaluation Metrics
Following Choi et al. (2018) and Reddy et al.
(2019), we use exact match (EM) and F1 as eval-
uation metrics for TOPIOCQA.
To compute human and system performance
in the presence of multiple gold annotations, 我们
follow the evaluation process similar to Choi et al.
(2018) and Reddy et al. (2019). Given n human
答案, human performance on the task is deter-
mined by considering each answer as prediction
数字 5: A partial conversation and different ques-
tion representations of Q3. The REWRITES representa-
tion is an example, not the output of our QR module.
never explicitly exposed to any training examples,
this can be considered as a zero-shot setting.
5.1.2 Open-book
We build on state-of-the-art QA models that adapt
a two step retriever-reader approach. For the re-
triever, we consider BM25 (Robertson et al., 1995)
and DPR Retriever (Karpukhin et al., 2020). 给定
a query, BM25 ranks the documents based on a
bag-of-words scoring function. 另一方面,
DPR learns dense vector representations of docu-
ment and query, and uses the dot product between
them as a ranking function.
We consider two types of neural readers. (1)
DPR Reader (Karpukhin et al., 2020), 哪个重新-
ranks the retrieved passages and selects a span
from each document independently. The span with
highest span score is chosen as the answer. (2)
Fusion-in-Decoder
Izacard and Grave,
2021), which encodes all retrieved passages in-
dependently, and then jointly attends over all of
them in the decoder to generate the answer.
(FiD;
For these models, we consider three different
question representations for question at nth turn of
the conversation (qn). 数字 5 shows an example
of different question representations for the third
问题 (Q3) of a conversation.
• ORIGINAL: This serves as a naive baseline
where just the current question qn is passed
to the model.
• ALLHISTORY: The question is represented as
q1 [SEP] a1 [SEP] q2 [SEP] a2 [SEP]
. . . [SEP] qn−1 [SEP] an−1 [SEP] qn.
475
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
模型
人类
GPT-3
BM25 + DPR Reader
BM25 + FiD
DPR Retriever + DPR Reader
DPR Retriever + FiD
Question Rep
Dev
Test
EM
40.2
12.4
7.1
13.6
15.4
10.1
24.1
24.0
4.9
21.0
17.2
7.9
33.0
23.5
F1
70.1
33.4
12.8
25.0
32.5
21.8
37.2
41.6
14.9
43.4
36.4
21.6
55.3
44.2
EM
40.3
10.4
7.2
13.8
15.7
10.5
23.4
24.9
4.3
19.4
16.5
7.8
33.4
24.0
F1
70.0
31.8
13.0
25.2
31.7
22.6
36.1
41.4
14.9
41.1
35.2
21.4
55.8
44.7
ORIGINAL
ALLHISTORY
REWRITES
ORIGINAL
ALLHISTORY
REWRITES
ORIGINAL
ALLHISTORY
REWRITES
ORIGINAL
ALLHISTORY
REWRITES
桌子 5: Overall performance of all model variants on TOPIOCQA development and test set.
and other human answers as the reference set. 这
results in n scores, which are averaged to give the
final human performance score. The system pre-
diction is also compared with n distinct reference
套, each containing n − 1 human answers, 和
then averaged. For TOPIOCQA, n = 4 (the origi-
nal answer and three additional annotations). 笔记
that human performance is not necessarily an up-
per bound for the task, as document retrieval can
potentially be performed better by the systems.
模型
Question Rep
Dev
Test
BM25
ORIGINAL
ALLHISTORY
REWRITES
ORIGINAL
DPR Retriever ALLHISTORY
REWRITES
Top-20 Top-100 Top-20 Top-100
5.2
23.1
32.5
9.9
70.4
49.9
9.1
36.8
49.2
16.5
82.4
62.4
6.0
22.5
33.0
10.0
67.0
49.3
10.1
35.6
47.4
15.3
80.8
61.1
桌子 6: Retrieval performance of all model
variants on TOPIOCQA development and test set.
6 Results and Discussion
We report
the end-to-end performance of all
systems in Table 5. For open-book models, 我们
also look at the performance of its constituents
(retriever and reader). 桌子 6 reports the retrieval
performance and Table 7 reports the reading
comprehension performance of the readers, 给定
the gold passage. Based on these results, 我们
answer the following research questions.
How do the models compare against humans for
TOPIOCQA?
We report model and human performance on de-
velopment and test set in Table 5. 全面的, 模型
performance in all settings is significantly lower
than the human performance. The best performing
模型 (DPR Retriever + FiD using ALLHISTORY
模型
Question Rep
Dev
Test
Extractive Bound
DPR Reader
FiD
ORIGINAL
ALLHISTORY
REWRITES
ORIGINAL
ALLHISTORY
REWRITES
EM
47.7
27.1
29.7
29.8
34.4
38.3
34.5
F1
81.1
51.4
54.2
53.8
60.5
65.5
61.9
EM
47.3
25.5
28.0
28.1
33.7
37.2
35.3
F1
81.0
50.4
52.6
52.1
61.0
64.1
62.8
桌子 7: Reader performance of all model variants
on TOPIOCQA development and test set when
provided with the gold passage.
question representation) achieves 33.4 点
EM and 55.8 points F1 on the test set, 哪个
falls short of human performance by 6.9 点
和 14.2 点, 分别, indicating room for
further improvement.
476
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Which class of models perform better—Closed
book or Open book?
GPT-3 is directly comparable to ALLHISTORY vari-
ant of open-book models as it takes the entire
conversation history as input. Apart from BM25 +
DPR Reader, GPT-3 performs worse than all
other ALLHISTORY variants of open-book models.
It achieves an F1 score of 31.8 on the test set,
这是 24 points behind the best performing
open-book model (DPR Retriever + FiD). We ob-
serve that GPT-3 often hallucinates many answers,
a phenomenon commonly observed in literature
(Shuster et al., 2021).
How does the performance of open-book models
vary with various question representations?
For all open-book models, we fine-tune on three
different question representations (部分 5).
From the results in Table 5, we observe that the
ORIGINAL representation is consistently worse than
others for all models. This highlights the impor-
tance of encoding the conversational context for
TOPIOCQA. Between ALLHISTORY and REWRITES,
we observe that ALLHISTORY performs better
with dense retriever (DPR Retriever), 然而
REWRITES performs better with sparse retriever
(BM25). To confirm that this performance dif-
ference in end-to-end systems stems from the
retriever, we look at Top-20 and Top-100 re-
trieval accuracy of BM25 and DPR Retriever in
桌子 6. 的确, ALLHISTORY representation per-
forms better than REWRITES for DPR Retriever but
worse for BM25. As DPR Retriever is trained on
TOPIOCQA, it can probably learn how to select
relevant information from the ALLHISTORY rep-
resentation, whereas for BM25, the non-relevant
keywords in the representation act as distractors.
The better performance of DPR Retriever over
BM25 indicates that TOPIOCQA requires learning
task-specific dense semantic encoding for superior
retrieval performance.
How much are the readers constrained due to
retrieved results?
桌子 6 shows retrieval results. In an end-to-end
系统, the reader takes as input the retrieved
passages, which may or may not contain the gold
passage. To get an estimate of reader performance
independently from the retriever, we experiment
with directly providing only the gold passage to
the readers, instead of the retrieved ones. 桌子 7
shows the results. This can be seen as an ‘‘Ideal
Retriever’’ setting, where the retriever always re-
trieves the correct passage as the top one. 虽然
we observe significant gains over end-to-end sys-
tems for all models across all variants, the best
模型 (FiD with ALLHISTORY) still falls short of
human performance by 3.1 points EM and 5.9
points F1 on the test set. These experiments in-
dicate that while passage retrieval is a significant
bottleneck for the task, technical advancements
are needed for the readers as well.
While it is plausible to assume that DPR Reader
is restricted in its performance due to its extractive
自然, we show that this is not the case. We cal-
culate the extractive upper bound for TOPIOCQA
(reported in Table 7) by selecting the span from
the gold document with best F1 overlap with the
ground truth answer. This bound is 47.3 points EM
和 81.0 points F1, which essentially represents
the best that any extractive model can do on this
任务. DPR Reader falls short of this upper bound
经过 19.2 points EM and 28.4 points F1.
7 结论
We introduced TOPIOCQA, a novel open-domain
conversational question answering dataset with
topic switching. 在这项工作中, we described our
data collection effort, analyzed its topic switching
行为, and established strong neural baselines.
The best performing model (DPR Retriever +
FiD) 是 6.9 points EM and 14.2 points F1 below
human performance, suggesting that advances in
modeling are needed. We hope our dataset will be
an important resource to enable more research on
conversational agents that support topic switches
in information-seeking scenarios.
致谢
We would like to thank TAKT’s annotators and
Jai Thirani (Chief Data Officer) for contributing
to TOPIOCQA. We are grateful for construc-
tive and insightful feedback from the anonymous
reviewers. This work is supported by the follow-
ing grants: MSR-Mila Grant, NSERC Discovery
Grant on Robust conversational models for ac-
cessing the world’s knowledge, and the Facebook
CIFAR AI Chair. Shehzaad is partly supported
by the IBM PhD Fellowship. We thank Compute
Canada for the computing resources.
477
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
参考
Raviteja Anantha, Svitlana Vakulenko, Zhucheng
Tu, Shayne Longpre, Stephen Pulman, 和
Srinivas Chappidi. 2021. Open-domain ques-
tion answering goes conversational via question
rewriting. 在诉讼程序中 2021 Confer-
ence of the North American Chapter of the
计算语言学协会: 胡-
man Language Technologies, pages 520–534,
在线的. https://doi.org/10.18653/v1
/2021.naacl-main.44
Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. 卡普兰, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
阿加瓦尔, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, 杰克
克拉克, Christopher Berner, Sam McCandlish,
伊利亚·苏茨克维尔, and Dario
Alec Radford,
Amodei. 2020. Language models are few-shot
learners. In Advances in Neural Information Pro-
cessing Systems, 体积 33, pages 1877–1901.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中
这 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186,
明尼阿波利斯, Minnesota.
Ahmed Elgohary, Denis Peskov, and Jordan
Boyd-Graber. 2019. Can you unpack that? 学习-
ing to rewrite questions-in-context. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5918–5924, 香港, 中国. https://
doi.org/10.18653/v1/D19-1605
Gautier
Izacard and Edouard Grave. 2021.
Leveraging passage retrieval with generative
models for open domain question answer-
英. In Proceedings of the 16th Conference
of the European Chapter of the Association
for Computational Linguistics: Main Volume,
pages 874–880, 在线的. https://doi.org
/10.18653/v1/2021.eacl-main.74
Danqi Chen, Adam Fisch, Jason Weston, 和
Antoine Bordes. 2017. Reading Wikipedia to
answer open-domain questions. In Proceedings
of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 1870–1879, Vancouver,
加拿大.
Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
系统. 在诉讼程序中 2017 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing, pages 2021–2031, 哥本哈根,
丹麦. https://doi.org/10.18653
/v1/D17-1215
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, 和
Luke Zettlemoyer. 2018. QuAC: Question an-
swering in context. 在诉讼程序中 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2174–2184,
布鲁塞尔, 比利时.
Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar,
and Jamie Callan. 2020. CAsT-19: A data-
set for conversational information seeking. 在
Proceedings of the 43rd International ACM
SIGIR Conference on Research and Devel-
opment in Information Retrieval, SIGIR ’20,
pages 1985–1988, 纽约, 纽约, 美国.
https://doi.org/10.1145/3397271
.3401206
Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question
answering. 在诉讼程序中 2020 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 6769–6781,
在线的. https://doi.org/10.18653/v1
/2020.emnlp-main.550
Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blunsom,
Chris Dyer, Karl Moritz Hermann, G´abor Melis,
and Edward Grefenstette. 2018. The Narrative-
QA reading comprehension challenge. 反式-
actions of the Association for Computational
语言学, 6:317–328. https://doi.org
/10.1162/tacl_a_00023
478
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Tom Kwiatkowski, Jennimaria Palomaki, Olivia
Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein,
Illia Polosukhin,
Jacob Devlin, Kenton Lee, Kristina Toutanova,
Llion Jones, Matthew Kelcey, Ming-Wei
张, 安德鲁·M. Dai, Jakob Uszkoreit, Quoc
Le, and Slav Petrov. 2019. Natural questions:
A benchmark for question answering research.
Transactions of the Association for Computa-
tional Linguistics, 7:452–466. https://土井
.org/10.1162/tacl_a_00276
Kenton Lee, Ming-Wei Chang, and Kristina
Toutanova. 2019. Latent retrieval for weakly
supervised open domain question answering.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 6086–6096, Florence, 意大利.
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
高, Saurabh Tiwary, Rangan Majumder, 和
Li Deng. 2016. MS MARCO: A Human Gener-
ated MAchine Reading COmprehension Data-
放. In CoCo@NIPS.
Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander Miller. 2019. Language models
as knowledge bases? 在诉讼程序中 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2463–2473,
香港, 中国. https://doi.org/10
.18653/v1/D19-1250
Chen Qu, Liu Yang, Cen Chen, Minghui Qiu,
伊耶尔. 2020.
瓦. Bruce Croft, and Mohit
Open-Retrieval Conversational Question An-
swering. SIGIR ’20, pages 539–548, 纽约,
纽约, 美国.
Chen Qu, Liu Yang, Minghui Qiu, 瓦. 布鲁斯
Croft, Yongfeng Zhang, and Mohit Iyyer. 2019A.
BERT with history answer embedding for con-
versational question answering. In Proceedings
of the 42nd International ACM SIGIR Confer-
ence on Research and Development in Infor-
mation Retrieval, SIGIR’19, pages 1133–1136,
纽约, 纽约, 美国.
Chen Qu, Liu Yang, Minghui Qiu, Yongfeng
张, Cen Chen, 瓦. Bruce Croft, and Mohit
伊耶尔. 2019乙. Attentive history selection for
conversational question answering. In Proceed-
ings of the 28th ACM International Conference
on Information and Knowledge Management,
CIKM ’19, pages 1391–1400, 纽约,
纽约, 美国.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. 刘. 2020.
Exploring the limits of transfer learning with
a unified text-to-text transformer. 杂志
Machine Learning Research, 21(140):1–67.
Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unanswer-
able questions for SQuAD. 在诉讼程序中
the 56th Annual Meeting of the Association for
计算语言学 (体积 2: Short
文件), pages 784–789, 墨尔本, 澳大利亚.
https://doi.org/10.18653/v1/P18
-2124
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
sion of text. 在诉讼程序中 2016 骗局-
ference on Empirical Methods in Natural
语言处理, pages 2383–2392, Austin,
德克萨斯州. https://doi.org/10.18653/v1
/D16-1264
Siva Reddy, Danqi Chen, and Christopher Manning.
2019. CoQA: A conversational question answer-
ing challenge. Transactions of the Association
for Computational Linguistics, 7(0):249–266.
https://doi.org/10.1162/tacl 00266
Adam Roberts, Colin Raffel, and Noam Shazeer.
2020. How much knowledge can you pack
into the parameters of a language model? 在
诉讼程序 2020 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP), pages 5418–5426, 在线的. https://
doi.org/10.18653/v1/2020.emnlp-main
.437
Stephen Robertson, S. 沃克, S. 琼斯, 中号. 中号.
Hancock-Beaulieu, 和M. Gatford. 1995. Okapi
at TREC-3. In Overview of the Third Text Re-
trieval Conference (TREC-3), pages 109–126.
Anna Rogers, Matt Gardner,
and Isabelle
Augenstein. 2021. QA dataset explosion: A
taxonomy of NLP resources for question an-
swering and reading comprehension. arXiv
preprint arXiv:2107.12708.
Harvey Sacks and Gail Jefferson. 1995. Lectures
on conversation, 体积 1. pages 289–331.
479
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Anshumali Shrivastava and Ping Li. 2014.
Asymmetric LSH (ALSH) for sublinear time
maximum inner product search (MIPS). In Pro-
ceedings of the 27th International Conference
on Neural Information Processing Systems –
体积 2, pages 2321–2329, 蒙特利尔, 加拿大.
Kurt Shuster, Spencer Poff, Moya Chen, Douwe
Kiela, and Jason Weston. 2021. Retrieval
augmentation reduces hallucination in con-
versation. arXiv 预印本 arXiv:2104.07567.
https://doi.org/10.18653/v1/2021
.findings-emnlp.320
Amanda Spink, H. Cenk Ozmutlu, and Seda
Ozmutlu. 2002. Multitasking information seek-
ing and searching processes. Journal of the
American Society for Information Science and
技术, 53(8):639–652. https://土井
.org/10.1002/asi.10124
Manfred Stede and David Schlangen. 2004.
Information-seeking chat: Dialogue manage-
ment by topic structure.
Svitlana Vakulenko, Shayne Longpre, Zhucheng
Tu, and Raviteja Anantha. 2021. 问题
rewriting for conversational question answer-
英. In Proceedings of the 14th ACM Interna-
tional Conference on Web Search and Data
Mining, WSDM ’21, pages 355–363, 新的
约克, 纽约, 美国. https://doi.org/10
.1145/3437963.3441748
Douglas Walton. 2019. The New Dialectic: 骗局-
versational Contexts of Argument. 大学
of Toronto Press.
Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo
王, Tim Klinger, Wei Zhang, Shiyu
张, Gerry Tesauro, Bowen Zhou, and Jing
Jiang. 2018. R3: Reinforced ranker-reader for
open-domain question answering. In AAAI,
pages 5981–5988.
Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh
Nallapati, and Bing Xiang. 2019. Multi-passage
BERT: A globally normalized BERT model
for open-domain question answering. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 5878–5882, 香港, 中国. https://
doi.org/10.18653/v1/D19-1599
Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li,
Luchen Tan, Kun Xiong, Ming Li, and Jimmy
林. 2019. End-to-end open-domain question
answering with BERTserini. 在诉讼程序中
这 2019 Conference of the North American
Chapter of the Association for Computational
语言学 (Demonstrations), pages 72–77,
明尼阿波利斯, Minnesota. https://doi.org
/10.18653/v1/N19-4013
Yi Yang, Wen-tau Yih, and Christopher Meek.
为了
2015. WikiQA: A challenge dataset
open-domain question answering. In Proceed-
ings of
这 2015 Conference on Empirical
Methods in Natural Language Processing,
pages 2013–2018, 里斯本, Portugal. https://
doi.org/10.18653/v1/D15-1237
Chenguang Zhu, Michael Zeng, and Xuedong
黄. 2019. SDNet: Contextualized attention-
based deep network for conversational ques-
tion answering. arXiv 预印本 arXiv:1812
.03593.
A Annotators Details
Each conversation in TOPIOCQA is an interac-
tion between two annotators, a questioner and
an answerer. The annotators were selected from
TAKT’s in-house workforce, based on their En-
glish language proficiency and trained for the role
of both questioner and answerer. The annotators
are provided with the following guidelines.
Guidelines for the Questioner:
• The first question should be unambiguous
and about the seed entity.
• The follow-up questions be contextualized
and dependent on the conversation history
whenever possible.
• Avoid using same words as in section titles
of the document. 例如, if the section
title is ‘‘Awards’’, a plausible question can
be ‘‘What accolades did she receive for her
工作?’’.
• The conversation should involve multiple
文件 (主题).
Guidelines for the Answerer:
• Based on the question, identify the relevant
document and section.
• The answer should be based on the contents
of the identified document.
480
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
• The rationale should be selected such that it
justifies the answer.
• The answer should be a sub-string in rationale
whenever possible. 然而, answers should
be edited to fit the conversational context
(adding yes, 不), perform reasoning (例如,
数数), 等等.
• Personal opinions should never be included.
After providing the guidelines and a few examples,
the initial annotated conversations were manually
inspected by the authors. The workers who pro-
vided low-quality annotations during this inspec-
tion phase were disqualified. The final workforce
consisted of 15 工人, who provided annota-
tions for the dataset over a period of two months.
Random quality checks were performed by the
authors and periodic feedback was given to the
annotators throughout the data collection to main-
tain high quality of data. 数字 8 shows annotation
interfaces for questioner and answerer. 数字 6
shows an example from the dataset.
We also implemented several real-time checks
in the questioner’s interface to encourage topic
switching and use of co-reference, and to reduce
the lexical overlap with the metadata of the docu-
ment while forming the question.
B Query Rewriting
A query-rewriting module, QR, takes the cur-
rent question and the conversation history as
输入 (q1, a1, . . . , qn−1, an−1, qn) and provides a
decontextualized rewritten question, q(西德:3)
n, 作为
输出. As we don’t collect rewrites in TOPI-
OCQA, we rely on other datasets to train our
QR model. Two datasets that provide rewrites for
information-seeking conversations are CANARD
(Elgohary et al., 2019) and QReCC (Anantha
等人。, 2021). Due to its large-scale and diverse
自然, we use QReCC to train our T5 model
based QR module.
To rewrite the nth question, the conversation
history and the current question is given to model
as q1 [SEP] a1 [SEP] q2 [SEP] a2 [SEP] . . .
[SEP] qn−1 [SEP] an−1 [SEP] qn. We train
this model on QReCC dataset. On the test split
of QReCC, our model achieves a BLEU score of
62.74 点. We use this model to generate re-
writes for TOPIOCQA in our experiments. 数字 7
shows a conversation from the dataset along with
rewrites from this T5-based QR module.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 6: A full conversation from TOPIOCQA.
We observe that while this QR module can re-
solve simple coreferences (Q2 and Q5), it struggles
later in the conversation in the presence of mul-
tiple entities (he is resolved to albus dumbledore
481
instead of tom marvolo riddle in Q10 and Q12).
The QR module also fails to perform reason-
ing required for correct rewrites, 例如,
boy wizard’s nemesis is not rewritten to Lord
Voldemort in Q9, even though this information is
present in A5).
C Hyperparameter Details
We use Lucene BM25 with k1 = 0.9 (term fre-
quency scaling) and b = 0.4 (document length
normalization). For both DPR and FiD, apart
from the batch size, we use the hyperparameters
suggested in their codebases. We use the maxi-
mum batch size that fits in the GPU cluster. DPR
Retriever is trained on four 40GB A100 GPUs,
whereas DPR Reader and FiD are trained on 8
32GB V100 GPUs. We use base model size for
all systems. Following original implementations,
DPR Retriever is trained for 40 纪元, DPR
Reader for 20 纪元, and FiD for 15,000 gradient
脚步. The model checkpoint with best EM score
on development set is selected as the final model.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 7: An example of a conversation from
TOPIOCQA along with rewrites from the QR module.
Few turns are excluded and some answers are shorted
for brevity.
482
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
7
1
2
0
0
8
1
2
6
/
/
t
我
A
C
_
A
_
0
0
4
7
1
p
d
.
数字 8: Annotation interface for questioners and answerers.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
483