DREAM: A Challenge Data Set and Models for
Dialogue-Based Reading Comprehension
Kai Sun♠ ∗ Dian Yu♥
Jianshu Chen♥ Dong Yu♥ Yejin Choi♦, ♣ Claire Cardie♠
♠Cornell University, Ithaca, New York, USA
♥Tencent AI Lab, Bellevue, WA, USA
♦University of Washington, Seattle, WA, USA
♣Allen Institute for Artificial Intelligence, Seattle, WA, USA
ks985@cornell.edu {yudian,jianshuchen,dyu}@tencent.com
yejin@cs.washington.edu cardie@cs.cornell.edu
Abstrakt
We present DREAM, the first dialogue-based
multiple-choice reading comprehension data
set. Collected from English as a Foreign
Language examinations designed by human
experts to evaluate the comprehension level
of Chinese learners of English, our data set
contains 10,197 multiple-choice questions for
6,444 dialogues. In contrast to existing reading
comprehension data sets, DREAM is the first
to focus on in-depth multi-turn multi-party
dialogue understanding. DREAM is likely to
present significant challenges for existing
reading comprehension systems: 84% of an-
swers are non-extractive, 85% of questions
require reasoning beyond a single sentence,
Und 34% of questions also involve common-
sense knowledge.
We apply several popular neural reading
comprehension models that primarily exploit
surface information within the text and find
them to, at best, just barely outperform a rule-
based approach. We next investigate the effects
of incorporating dialogue structure and dif-
ferent kinds of general world knowledge into
both rule-based and (neural and non-neural)
machine learning-based reading comprehen-
sion models. Experimental
results on the
DREAM data set show the effectiveness of dia-
logue structure and general world knowledge.
DREAM is available at https://dataset.
org/dream/.
1
Einführung
Recently a significant amount of research has fo-
cused on the construction of large-scale multiple-
∗This work was done when K. S. was an intern at the
Tencent AI Lab, Bellevue, WA.
217
Auswahl (Lai et al., 2017; Khashabi et al., 2018;
Ostermann et al., 2018) and extractive (Hermann
et al., 2015; Hill et al., 2016; Rajpurkar et al., 2016;
Trischler et al., 2017) reading comprehension data
sets (Abschnitt 2). Source documents in these data
sets have generally been drawn from formal
written texts such as news, fiction, and Wikipedia
articles, which are commonly considered well-
written, accurate, and neutral.
With the goal of advancing research in ma-
chine reading comprehension and facilitating dia-
logue understanding, we construct and present
DREAM — the first multiple-choice Dialogue-
based REAding comprehension exaMination data
set. We collect 10,197 questions for 6,444 multi-
turn multi-party dialogues from English language
exams, which are carefully designed by educa-
tional experts (z.B., English teachers) to assess
the comprehension level of Chinese learners of
English. Each question is associated with three
answer options, exactly one of which is correct.
(See Table 1 for an example.) DREAM covers a
variety of topics and scenarios in daily life such
as conversations on the street, on the phone, in einem
classroom or library, at the airport or the office or
a shop (Abschnitt 3).
Based on our analysis of DREAM, we argue
that dialogue-based reading comprehension is at
least as difficult as existing non-conversational
counterparts. Insbesondere, answering 34% von
DREAM questions requires unspoken common-
sense knowledge, Zum Beispiel, unspoken scene
Information. This might be due to the nature
of dialogues: For efficient oral communication,
people rarely state obvious explicit world knowl-
edge (Forbes and Choi, 2017) such as ‘‘Christmas
Day is celebrated on December 25th.’’Understanding
Transactions of the Association for Computational Linguistics, Bd. 7, S. 217–231, 2019. Action Editor: Jing Jiang.
Submission batch: 9/2018; Revision batch: 12/2018; Final submission: 2/2019; Published 4/2019.
C(cid:7) 2019 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
T
l
A
C
_
A
_
0
0
2
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Dialogue 1 (D1)
W: Tom, look at your shoes. How dirty they are!
You must clean them.
M: Oh, mum, I just cleaned them yesterday.
W: They are dirty now. You must clean them
wieder.
M: I do not want to clean them today. Even if
I clean them today, they will get dirty again
tomorrow.
W: All right, Dann.
M: Mum, give me something to eat, please.
W: You had your breakfast in the morning, Tom,
and you had lunch at school.
M: I am hungry again.
W: Oh, hungry? But if I give you something to eat
Heute, you will be hungry again tomorrow.
Q1 Why did the woman say that she wouldn’t give
him anything to eat?
A. Because his mother wants to correct his bad
habit.(cid:2)
B. Because he had lunch at school.
C. Because his mother wants to leave him hungry.
Tisch 1: A sample DREAM problem that requires
general world knowledge ((cid:2): the correct answer option).
the social implications of an utterance as well as
inferring a speaker’s intentions is also regularly
required for answering dialogue-based questions.
The dialogue content in Table 1, Zum Beispiel,
is itself insufficient for readers to recognize the
intention of the female speaker (W) in the first
question (Q1). Jedoch, world knowledge is
rarely considered in state-of-the-art reading com-
prehension models (Tay et al., 2018; Wang et al.,
2018B).
Darüber hinaus, dialogue-based questions can cover
information imparted across multiple turns involv-
ing multiple speakers. In DREAM, etwa
85% of questions can only be answered by con-
sidering the information from multiple sentences.
to answer Q1 in Table 3 später
Zum Beispiel,
in the paper regarding the date of birth of the
male speaker (M), the supporting sentences (In
bold) include ‘‘You know, tomorrow is Christ-
mas Day’’ from the female speaker and ‘‘. . . ICH
am more than excited about my birthday, welche
will come in two days’’ from the male speaker.
Compared with ‘‘multiple-sentence questions’’
in traditional reading comprehension data sets,
DREAM further requires an understanding of the
turn-based structure of dialogue—for example,
for aligning utterances with their corresponding
speakers.
As only 16% of correct answer options are text
spans from the source documents, we primarily
explore rule-based methods and state-of-the-
art neural models designed for multiple-choice
reading comprehension (Abschnitt 4). We find first
that neural models designed for non–dialogue-
based reading comprehension (Chen et al., 2016;
Dhingra et al., 2017; Wang et al., 2018B) do not
fare well: The highest achieved accuracy is 45.5%,
only slightly better than the accuracy (44.6%)
of a simple lexical baseline (Richardson et al.,
2013). Größtenteils, these models fundamen-
tally exploit only surface-level information from
the source documents. Considering the above-
mentioned challenges, Jedoch, we hypothesize
that incorporating general world knowledge and
aspects of the dialogue structure would allow a
better understanding of the dialogues. Infolge,
we modify our baseline systems to include (1)
general world knowledge in the form of such as
ConceptNet relations (Speer et al., 2017) und ein
pre-trained language model (Radford et al., 2018),
Und (2) speaker information for each utterance.
Experiments show the effectiveness of these fac-
tors on the lexical baselines as well as neural
and non-neural machine learning approaches: Wir
acquire up to 11.9% absolute gain in accuracy
compared with the highest performance achieved
by the state-of-the-art reading comprehension
Modell (Wang et al., 2018B), which mainly relies
on explicit surface-level information in the text
(Abschnitt 5).
Endlich, we see a significant gap between the
best automated approach (59.5%) and human
ceiling performance (98.6%) on the DREAM data
set. This provides yet additional evidence that
dialogue-based reading comprehension is a very
challenging task. We hope that it also inspires the
research community to develop methods for the
dialogue-based reading comprehension task.
2 Related Work
We divide reading comprehension data sets into
three categories based on the types of answers:
extractive, abstractive, and multiple choice.
2.1 Extractive and Abstractive Data Sets
In den vergangenen Jahren, we have seen increased interest in
large-scale cloze/span-based reading comprehension
218
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
T
l
A
C
_
A
_
0
0
2
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
SQuAD
NarrativeQA
CoQA
RACE
DREAM (this work)
extractive
Answer type
Source document type written text
# of source documents
Average answer length
536
3.2
abstractive
written text
1,572
4.7
abstractive multiple-choice
written text
8,399
2.7
written text
27,933
5.3
multiple-choice
dialogue
6,444
5.3
Extractive (%)
Abstractive (%)
100.0
0.0
73.6
26.4
66.8
33.2
13.0
87.0
16.3
83.7
Tisch 2: Distribution of answer (or correct answer option) types in three kinds of reading comprehension data sets.
Statistics of other data sets come from Reddy et al. (2018), Koˇcisk`y et al. (2018), and Lai et al. (2017).
data set construction (Hermann et al., 2015; Hill
et al., 2016; Onishi et al., 2016; Rajpurkar et al.,
2016; Bajgar et al., 2016; Nguyen et al., 2016;
Trischler et al., 2017; Joshi et al., 2017; Choi et al.,
2018). We regard them as extractive since candi-
date answers are usually short spans from source
documents. State-of-the-art neural models with
attention mechanisms already achieve very high
performance based on local lexical information.
Recently researchers work on the construction of
spoken span-based data sets (Lee et al., 2018;
Li et al., 2018) by applying text-to-speech tech-
nologies or recruiting human speakers based on
formal written document-based data sets such as
SQuAD (Rajpurkar et al., 2016). Some span-
based conversation data sets are constructed from
a relatively small size of dialogues from television
zeigt an (Chen and Choi, 2016; Ma et al., 2018).
Considering the limitations in extractive data
sets, answers in abstractive data sets such as MS
MARCO (Nguyen et al., 2016), SearchQA (Dunn
et al., 2017), and NarrativeQA (Koˇcisk`y et al.,
2018) are human-crowdsourced based on source
documents or summaries. Concurrently, there is a
growing interest in conversational reading com-
prehension such as CoQA (Reddy et al., 2018).
Because annotators tend to copy spans as an-
swers (Reddy et al., 2018), the majority of answers
are still extractive in these data sets (Tisch 2).
Compared to the data sets mentioned above, most
of the correct answer options (83.7%) in DREAM
are free-form text.
generation during crowdsourcing (d.h., Fragen,
correct answer option, and distractors). Besides
surface matching, a significant portion of ques-
tions require multiple-sentence reasoning and
external knowledge (Richardson et al., 2013;
Mostafazadeh et al., 2016; Khashabi et al., 2018;
Ostermann et al., 2018).
Besides crowdsourcing, some data sets are col-
lected from examinations designed by educa-
tional experts (Penas et al., 2014; Shibuki et al.,
2014; Tseng et al., 2016; Clark et al., 2016;
Lai et al., 2017; Mihaylov et al., 2018), welche
aim to test human examinees. There are various
types of complicated questions such as math word
problems, summarization, logical reasoning, Und
sentiment analysis. Because we can adopt more
objective evaluation criteria such as accuracy,
these questions are usually easy to grade. Besides,
questions from examinations are generally clean
and high-quality. daher, human performance
ceiling on this kind of data set is much higher
(z.B., 94.5% on RACE [Lai et al., 2017] Und
98.6% on DREAM in accuracy) than that of data
sets built by crowdsourcing.
In comparison, we present the first multiple-
choice dialogue-based data set from examinations
that contains a large percentage of questions that
require multiple sentence inference. To the best of
our knowledge, DREAM also contains the largest
number of questions involving commonsense rea-
soning compared with other examination data sets.
3 Data
2.2 Multiple-Choice Data Sets
We primarily discuss the multiple-choice data
sets, in which answer options are not restricted
to extractive text spans in the given document.
Stattdessen, most of the correct answer options are
abstractive (Tisch 2). Multiple-choice data sets in-
volve extensive human involvement for problem
In diesem Abschnitt, we describe how we construct
DREAM (Abschnitt 3.1) and provide a detailed
analysis of this data set (Abschnitt 3.2).
3.1 Collection Methodology
We collect dialogue-based comprehension prob-
lems from a variety of English language exams
219
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
T
l
A
C
_
A
_
0
0
2
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Dialogue 2 (D2)
Metric
Wert
W: Hey, Mike. Where have you been? I didn’t see
you around these days?
M: I was hiding in my office. My boss gave me
loads of work to do, and I tried to finish it
before my birthday. Anyway, I am done now.
Thank goodness! How is everything going
with you?
W: I’m quite well. Du weisst, tomorrow is
Christmas Day. Do you have any plans?
M: Well, to tell you the truth, I am more
than excited about my birthday, which will
come in two days. I am going to visit my
parents-in-law with my wife.
W: Wow, sounds great.
M: Definitely! This is my first time to spend my
birthday with them.
W: Do they live far away from here?
M: A little bit. We planned to take the train, Aber
considering the travel peak, my wife strongly
suggested that we go to the airport right after
we finish our work this afternoon. How about
Du? What’s your holiday plan?
W: Well, our situations are just the opposite. Mein
parents-in-law will come to my house, Und
they wish to stay at home and have a quiet
Christmas Day. So I have to call my friends to
cancel our party that will be held at my house.
M: You’ll experience a quite different and lovely
holiday. Enjoy your Christmas!
W: Thanks, the same to you!
Q1 What is the date of the man’s birthday?
A. 25th, Dezember.
B. 26th, December.(cid:2)
C. 27th, Dezember.
Q2 How will the man go to his wife’s parents’
heim?
A. By train.
B. By bus.
C. By plane.(cid:2)
Q3 What is the probable relationship between the
two speakers?
A. Husband and wife.
B. Friends.(cid:2)
C. Parent-in-law and son-in-law.
Tisch 3: A complete sample DREAM problem ((cid:2): Die
correct answer option).
(including practice exams) such as National Col-
lege Entrance Examination, College English Test,
and Public English Test,1 which are designed by
human experts to assess either the listening or
reading comprehension level of Chinese English
1We list all the Web sites used for data collection in the
released data set.
220
3
# of answer options per question
# of turns
Avg./Max. # of questions per dialogue
Avg./Max. # of speakers per dialogue
Avg./Max. # of turns per dialogue
Avg./Max. option length (in tokens)
Avg./Max. question length (in tokens)
Avg./Max. dialogue length (in tokens) 85.9 / 1,290
vocabulary size
30,183
1.6 / 10
2.0 / 7
4.7 / 48
5.3 / 21
8.6 / 24
13,037
Tisch 4: The overall statistics of DREAM. A turn is
defined as an uninterrupted stream of speech from one
speaker in a dialogue.
Train
Dev
Test
Alle
# of dialogues
# of questions
3,869
6,116
1,288
2,040
1,287
2,041
6,444
10,197
Tisch 5: The separation of the training, Entwicklung,
and test sets in DREAM.
learners in high schools and colleges (for indi-
viduals aged 12–22 years). All the problems in
DREAM are freely accessible online for public
usage. Each problem consists of a dialogue and
a series of multiple-choice questions. To ensure
every question is associated with exactly three
answer options, we drop wrong answer option(S)
randomly for questions with more than three
options. We remove duplicate problems and ran-
domly split the data at the problem level, mit
60% train, 20% Entwicklung, Und 20% test.
3.2 Data Analysis
We summarize the statistics of DREAM in Table 4
and data split in Table 5. Compared with existing
data sets built from formal written texts,
Die
vocabulary size is relatively small since spoken
English by its nature makes greater use of high-
frequency words and needs a smaller vocabulary for
efficient real-time communication (Nation, 2006).
We categorize questions into two main cate-
gories according to the types of knowledge re-
quired to answer them: matching and reasoning.
• Matching A question is entailed or para-
phrased by exactly one sentence in a dialogue.
The answer can be extracted from the same
Satz. Zum Beispiel, we can easily verify
the correctness of the question-answer pair
(‘‘What kind of room does the man want
to rent?’’, ‘‘A two-bedroom apartment.’’)
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
T
l
A
C
_
A
_
0
0
2
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
based on the sentence ‘‘M: I’m interested
in renting a two-bedroom apartment.’’ This
category is further divided into two cate-
gories, word matching and paraphrasing, In
previous work (Chen et al., 2016; Trischler
et al., 2017).
• Reasoning Questions that cannot be an-
swered by the surface meaning of a single
sentence belong to this category. We further
define four subcategories as follows.
Question Type
Matching
Reasoning
Summary
Logic
Arithmetic
Commonsense
Single sentence
Multiple sentences
Dev
13.0
87.0
8.4
74.5
5.1
31.5
17.1
82.9
Test Dev + Test
10.3
89.7
15.9
70.4
3.6
35.9
13.7
86.3
11.7
88.3
12.1
72.5
4.4
33.7
15.4
84.6
Tisch 6: Distribution (%) of question types.
– Summary Answering this kind of ques-
tions requires the whole picture of a dia-
logue, such as the topic of a dialogue and
the relation between speakers (z.B., D2-Q3
in Table 3). Under this category, ques-
tions such as ‘‘What are the two speakers
talking about?’’ and ‘‘What are the speak-
ers probably doing?’’ are frequently asked.
– Logic We require logical reasoning to
answer questions in this category. We usu-
ally need to identify logically implied
relations among multiple sentences in a
dialogue. To reduce the ambiguity during
the annotation, we regard a question that
can only be solved by considering the con-
tent of multiple sentences and does not
belong to the summary subcategory that
involves all the sentences in a dialogue as
a logic question. Following this definition,
both D2-Q1 and D2-Q2 in Table 3 belong
to this category.
– Arithmetic Inferring the answer
Re-
quires arithmetic knowledge (z.B., D2-Q1
in Table 3 requires 25 − 1 + 2 = 26).
– Commonsense To answer questions under
this subcategory, besides the textual infor-
mation in the dialogue, we also require
external commonsense knowledge that
cannot be obtained from the dialogue.
Zum Beispiel, all questions in Table 3 fallen
under this category. D2-Q1 and D2-Q2 in
Tisch 3 belong to both logic and common-
sense since they require multiple sentences
as well as commonsense knowledge for
question answering. There exist multiple
types of commonsense knowledge in DREAM
such as the well-known properties of a
highly recognizable entity (z.B., D2-Q1
in Table 3), the prominent relationship-
between two speakers (z.B., D2-Q3 in
Tisch 3), the knowledge of or shared by
a particular culture (z.B., when a speaker
says ‘‘Cola? I think it tastes like medicine.’’,
she/he probably means ‘‘I don’t like cola.’’),
and the cause-effect relation between events
(z.B., D1-Q1 in Table 1). We refer read-
ers to LoBue and Yates (2011) for detailed
definitions.
Tisch 6 shows the question type distribution
labeled by two human annotators on 25% ques-
tions randomly sampled from the development
and test sets. Besides the previously defined ques-
tion categories, we also report the percentage of
questions that require reasoning over multiple sen-
tences (d.h., summary or logic questions) und das
percentage of questions that require the surface-
level understanding or commonsense/math knowl-
edge based on the content of a single sentence.
As a question can belong to multiple reasoning
subcategories, the summation of the percentage
of reasoning subcategories is not equal to the
percentage of reasoning. The Cohen’s kappa coef-
ficient is 0.67 on the development set and 0.68 An
the test set.
Dialogues in DREAM are generally clean and
mostly error-free because they are carefully de-
signed by educational experts. Jedoch, it is not
guaranteed that each dialogue is written or proof-
read by a native speaker. Besides, dialogues tend
to be more proper and less informal for exam
Zwecke. To have a rough estimation of the qual-
ity of dialogues in DREAM and the differences
between these dialogues and more casual ones in
movies or television shows, we run a proofreading
tool—Grammarly2—on all the dialogues from the
annotated 25% instances of the development set
and the same size (20.7k tokens) of dialogues
from Friends, a famous American television show
2https://app.grammarly.com.
221
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
T
l
A
C
_
A
_
0
0
2
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Metric
DREAM Friends
# of spelling errors
# of grammar errors
# of conciseness suggestions
# of vocabulary suggestions
General Performance
Readability Score
11
23
6
18
98.0
93.7
146
16
2
3
95.0
95.3
Tisch 7: Comparison of the quality of dialogues from
DREAM and Friends (a TV show).
whose transcripts are commonly used for dialogue
Verständnis (Chen and Choi, 2016; Ma et al.,
2018). As shown in Table 7, there exist fewer
spelling mistakes and the overall score is slightly
higher than that of the dialogues in Friends.
Based on the evaluated instances, articles and
verb forms are the two most frequent grammar
error categories (10 Und 8, jeweils, out of
23) in DREAM. Besides, the language tends to
indicated by the
be less precise in DREAM,
number of vocabulary suggestions. Zum Beispiel,
experts tend to use expressions such as ‘‘really
hot,’’ ‘‘really beautiful,’’ ‘‘very bad,’’ and ‘‘very
important’’ rather than more appropriate yet more
advanced adjectives that might hinder reading
comprehension of language learners with smaller
vocabularies. According to the explanations pro-
vided by the tool, the readability scores for both
data sets fall into the same category ‘‘Your text
is very simple and easy to read, likely to be
understood by an average 5th-grader (Alter 10).’’
4 Approaches
We formally introduce the dialogue-based reading
comprehension task and notations in Section 4.1.
To investigate the effects of different kinds of
general world knowledge and dialogue structure,
we incorporate them into rule-based approaches
(Abschnitt 4.2) as well as non-neural (Abschnitt 4.3)
and neural (Abschnitt 4.4) machine learning ap-
proaches. We describe in detail preprocessing
and training in Section 4.5.
4.1 Problem Formulation and Notations
We start with a formal definition of the dialogue-
based multiple-choice reading comprehension
Aufgabe. An n-turn dialogue D is defined as D =
{s1 : t1, s2 : t2, . . . , sn : tn}, where si represents
the speaker ID (z.B., ‘‘M’’ and ‘‘W’’), and ti
represents the text of the ith turn. Let Q denote
the text of question, and O1..3 denote the text of
three answer options. The task is to choose the
correct one from answer options O1..3 associated
with question Q given dialogue D. In diesem Papier,
we regard this task as a three-class classification
Problem, each class corresponding to an answer
Möglichkeit.
} Wo
For convenience, we define the following
notations, which will be referred in the rest of
this paper. Let Ds denote the turns spoken by
speaker s in D. Formally, Ds = {si1 : ti1, si2 :
{i1, i2, . . . , im} =
ti2, . . . , sim : Tim
{ich | si = s} and i1 < i2 < . . . < im. In particular,
s = ∗ denotes all the speakers. W Ds and W Oi
denote the ordered set of the running words
(excluding punctuation marks) in Ds and Oi,
respectively. Questions designed for dialogue-
based reading comprehension often focus on a
particular speaker. If there is exactly one speaker
mentioned in a question, we use sQ to denote this
target speaker. Otherwise, sQ = ∗. For example,
given the dialogue in Table 3, sQ =‘‘M’’ for
Question 1 and 2, and sQ = ∗ for Question 3.
4.2 Rule-Based Approaches
We first attempt to incorporate dialogue structure
information into sliding window (SW), a rule-
based approach developed by Richardson et al.
(2013). This approach matches a bag of words
constructed from a question Q and one of its
answer option Oi with a given document, and
calculates the TF-IDF style matching score for
each answer option.
Let ˆDs, ˆQ, and ˆOi be the unordered set of
distinct words (excluding punctuation marks)
in Ds, Q, and Oi, respectively. Instead of only
regarding dialogue D as a non-conversational text
snippet, we also pay special attention to the context
that is relevant to the target speaker mentioned in
the question. Therefore, given a target speaker
sQ, we propose to compute a speaker-focused
sliding window score for each answer option Oi,
by matching a bag of words constructed from Q
and Oi with DsQ (i.e., turns spoken by sQ). Given
speaker s, we formally define the sliding window
score sw of Oi as:
sws
i = max
j
⎧
⎨
(cid:2)
⎩
k=1...|Ti|
ics(W Ds
j+k)
if W Ds
j+k
∈ Ti
(1)
0
(cid:6)
otherwise
(cid:7)
where ics(w) = log
ˆOi ∪ ˆQ, and W Ds
1
, Ti =
i 11(W Ds
i =w)
denotes the i-th word in W Ds.
1 +
(cid:2)
i
222
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Based on these definitions, we can regard sw∗
i as
the general score defined in the original sliding
sQ
window approach, and sw
represents the speaker-
i
focused sliding window score considering the
target speaker sQ.
Because the sliding window score ignores long-
range dependencies, Richardson et al. (2013)
introduce a distance-based variation (DSW), in
which a word-distance based score is subtracted
from the sliding window score to arrive at the final
score. Similarly, we calculate the speaker-focused
distance-based score given a (Q, Oi) pair and sQ,
by counting the distance between the occurrence
of a word in Q and a word in Oi in DsQ. More
formally, given speaker s and a set of stop words3
U , the distance-based score d of Oi is defined as
⎧
⎨
ds
i =
⎩
1
δs
i
|W Ds |−1
if |I s
Q
| = 0 or |I s
Oi
| = 0
otherwise
(2)
= ( ˆOi ∩ ˆDs) −
Q = ( ˆQ ∩ ˆDs) − U , I s
where I s
Oi
ˆQ − U , and δs
i is the minimum number of words
between an occurrence of a question word and
an answer option word in W Ds, plus one. The
formal definition of δs
i is as follows.
δs
i =
W Ds
j
min
Q,W Ds
k
∈I s
|j − k| + 1
(3)
∈I s
Oi
Based on these definitions, we can regard d∗
i
as the distance-based score defined in the original
sQ
sliding window approach, and d
represents the
i
speaker-focused distance-based score considering
speaker sQ. In addition, the final distance-based
sliding window score of Oi (Richardson et al.,
2013) can be formulated as
sw∗
i
− d∗
i
(4)
Expression (4) only focuses on the general (or
speaker-independent) information (i.e., sw∗
i and
d∗
i ); we can capture general and speaker-focused
information (i.e., sw
, and d
) simultaneously
by averaging them:
sQ
i
sQ
i
swsQ
i + sw∗
2
i
− dsQ
i + d∗
2
i
(5)
Since a large percentage of questions cannot
be solved by word-level matching, we also attempt
to incorporate general world knowledge into
our rule-based method. We calculate css
the
i ,
3We use the list of stop words from NLTK (Bird and
Loper, 2004).
maximum cosine similarity between Oi and
consecutive words of the same length in W Ds, as:
(cid:6)
(cid:7)
css
i = max
j
cos
W Oi , W Ds
j...j+|W Oi |−1
(6)
where x is obtained by averaging the embeddings
of the constituent words in x. Here we use Concept-
Net embeddings (Speer et al., 2017) because they
leverage the knowledge graph that focuses on
general world knowledge. Following Expression (5),
we capture both general and speaker-focused
semantic information within a dialogue as follows.
i + cs∗
2
cssQ
(7)
i
To make the final answer option selection, our
rule-based method combines Expressions (5) and (7):
− dsQ
swsQ
cssQ
i
i
arg max
(8)
+
i + sw∗
2
i + cs∗
2
i
i + d∗
2
i
4.3 Feature-Based Classifier
To explore what features are effective for dia-
logue understanding, we first consider a gradient
boosting decision tree (GBDT) classifier. Besides
the conventional bag-of-words features, we pri-
marily focus on features related to general world
knowledge and dialogue structure.
• Bag of words of each answer option.
sQ
).
i
sQ
1..3 and p∗
• Features
inspired by rule-based ap-
proaches: We adopt the features introduced
in Section 4.2, including speaker-independent
i and d∗
scores (i.e., sw∗
i ) and speaker-focused
sQ
and d
scores (i.e., sw
i
• Matching position: p
1..3, where ps
i
is the last position (in percentage) of a word
in Ds that is also mentioned in Oi; 0 if
none of the words in Ds is mentioned in Oi.
We consider matching position because of
our observation of the existence of conces-
sions and negotiations in dialogues (Amgoud
et al., 2007). We assume the facts or opinions
expressed near the end of a dialogue tend to
be more critical for us to answer a question.
• Pointwise mutual
sQ
max,1..3, pmi∗
sQ
avg,1..3, and pmi∗
pmi
pmi
defined as
max,1..3, pmi
information (PMI):
sQ
min,1..3, pmi∗
min,1..3,
avg,1..3, where pmis
f,i is
(cid:8)
j log fk
pmis
f,i =
223
C2(W Oi
j
,W Ds
k )
)C1(W Ds
k )
C1(W Oi
|W Oi|
j
(9)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
C1(w) denotes the word frequency of w in
external copora (we use Reddit posts [Tan
and Lee, 2015]), and C2(w1, w2) represents
the co-occurrence frequency of word w1 and
w2 within a distance < K in external copora.
We use PMI to evaluate the relatedness
between the content of an answer option and
the target-speaker-focused context based on
co-occurrences of words in external corpora,
inspired by previous studies on narrative
event chains (Chambers and Jurafsky, 2008).
• ConceptNet relations
(e.g.,
(CR): cr1..3,1..|R|.
R = {r1, r2, . . .} is the set of ConceptNet
‘‘CapableOf’’
relation types
and
‘‘PartOf’’). cri,j
is the number of rela-
tion triples (w1, rj, w2) that appear in the
ConceptNet (Speer et al., 2017), where w1
represents a word in answer option Oi, w2
represents a word in D, and the relation type
rj ∈ R. Similar to the motivation for using
PMI, we use CR to capture the association
between an answer option and the source
dialogue based on raw co-occurrence counts
in the commonsense knowledge base.
i and cs
• ConceptNet embeddings (CE): Besides the
lexical similarity based on string matching,
sQ
we also calculate cs∗
1..3 and cs
1..3, where
sQ
cs∗
represent the maximum cosine
i
similarity between Oi and consecutive words
of the same length in D and DsQ, respec-
tively (Expression (6) in Section 4.2). We use
ConceptNet embeddings (Speer et al., 2017)
because they leverage the general world
knowledge graph.
4.4 End-To-End Neural Network
Our end-to-end neural model is based on a genera-
tive pre-trained language model (LM). We follow
the framework of
finetuned transformer LM
(FTLM) (Radford et al., 2018) and make modifica-
tions for dialogue-based reading comprehension.
The training procedure of FTLM consists of
two stages. The first stage is to learn a high-
capacity language model on a large-scale un-
supervised corpus of tokens U = {u1, . . . , un} by
maximizing the following likelihood:
LLM (U) =
(cid:2)
i
log P (ui | ui−k, . . . , ui−1; Θ)
(10)
where k is the context window size, and the
conditional probability P is modeled by a multi-
layer transformer decoder (Liu et al., 2018) with
parameters Θ. In the second stage, the model is
adapted to a labeled data set C, where each instance
consists of a sequence of input tokens x1, . . . , xm
with a label y, by maximizing:
L(C) =
(cid:2)
x,y
log P (y | x1, . . . , xm) + λLLM (C) (11)
where P (y | x1, . . . , xm) is obtained by a linear +
softmax layer over the final transformer block’s
activation, and λ is the weight for language model.
For multiple-choice reading comprehension, the
input tokens x1, . . . , xm come from the concat-
enation of a start token, dialogue, question, a
delimiter token, answer option, and an end token;
y indicates if the answer option is correct. We refer
readers to Radford et al. (2018) for more details.
Because the original FTLM framework already
leverages rich linguistic information from a large
unlabeled corpus, which can be regarded as a
type of tacit general world knowledge, we inves-
tigate whether additional dialogue structure can
further improve this strong baseline. We pro-
pose speaker embedding to better capture dialogue
structure. Specifically, in the original framework,
given an input context (u−k, . . . , u−1) of the trans-
former, the encoding of u−i is wewewe(u−i) + pepepe(i),
where wewewe(·) is the word embedding, and pepepe(·)
is the position embedding. When adapting Θ to
DREAM, we change the encoding to wewewe(u−i) +
pepepe(i)+sesese(u−i, sQ), where the speaker embedding
sesese(u−i, sQ) is (a) 0 if the token u−i is not in the
dialogue (i.e. it is either a start/end/delimiter token
or a token in the question/option); (b) eeetarget if
the token is spoken by sQ; (c) eeerest if the token is
in the dialogue but not spoken by sQ. eeetarget and
eeerest are trainable and initialized randomly. We
show the overall framework in Figure 1.
4.5 Preprocessing and Training Details
For all the models, we conduct coreference res-
olution to determine speaker mentions of sQ
based on simple heuristics. Particularly, we map
three most common speaker abbreviations (i.e.,
‘‘M’’; ‘‘W’’ and ‘‘F’’) that appear in dialogues to
their eight most common corresponding mentions
(i.e., ‘‘man,’’ ‘‘boy,’’ ‘‘he,’’ and ‘‘his’’; ‘‘woman,’’
‘‘girl,’’ ‘‘she,’’ and ‘‘her’’) in questions. We keep
224
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: Overall neural network framework (Section 4.4).
speaker abbreviations unchanged, since neither
replacing them with their corresponding full forms
nor removing them contributes to the performance
based on our experiments.
For the neural model mentioned in Section 4.4,
most of our parameter settings follow Radford
et al. (2018). We adopt the same preprocessing
procedure and use their publicly released language
model, which is pre-trained on the BooksCorpus
data set (Zhu et al., 2015). We set the batch size to
8, language model weight λ to 2, and maximum
epochs of training to 10.
For other models, we use the following pre-
processing steps. We tokenize and lowercase the
corpus, convert number words to numeric digits,
normalize time expressions to 24-hour numeric
form, and address negation by removing interrog-
ative sentences that receive ‘‘no’’ as the reply. We
use the gradient boosting classifier implemented
in the scikit-learn toolkit (Pedregosa et al., 2011).
We set the number of boosting iterations to 600
and keep the rest of hyperparameters unchanged.
The distance upper bound K for PMI is set to 10.
We perform several runs of machine learning
models (Section 4.3 and Section 4.4) with random-
ness introduced by different random seeds and/or
GPU non-determinism and select the model or
models (for ensemble) that perform best on the
development set.
5 Experiment
5.1 Baselines
We implement several baselines, including rule-
based methods and state-of-the-art neural models.
• Word Matching This strong baseline (Yih
et al., 2013) selects the answer option that
has the highest count of overlapping words
with the given dialogue.
• Sliding Window We implement the sliding
window approach (i.e., arg maxi sw∗
i ) and
its distance-based variation DSW (i.e.,
arg maxi sw∗
i ) (Richardson et al., 2013)
i
introduced in Section 4.2.
− d∗
• Enhanced Distance-Based Sliding Window
(DSW++) We also use general world knowl-
edge and speaker-focused information to
improve the original sliding window base-
line, formulated in Expression 8 (Section 4.2).
• Stanford Attentive Reader This neural base-
line compares each candidate answer (i.e.,
entity) representation to the question-aware
document representation built with atten-
tion mechanism (Hermann et al., 2015;
Chen et al., 2016). Lai et al. (2017) add
a bilinear operation to compare document
and answer option representations to answer
multiple-choice questions.
• Gated-Attention Reader The baseline mod-
els multiplicative question-specific document
representations based on a gated-attention
mechanism (Dhingra et al., 2017), which are
then compared to each answer option (Lai
et al., 2017).
• Co-Matching This state-of-the-art multiple-
choice reading comprehension model explic-
itly treats question and answer option as two
sequences and jointly matches them against
a given document (Wang et al., 2018b).
• Finetuned Transformer LM This is a gen-
eral task-agnostic model introduced in Sec-
tion 4.4, which achieves the best reported
performance on several tasks requiring multi-
sentence reasoning (Radford et al., 2018).
225
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Method
Random
Word Matching (WM) (Yih et al., 2013)
Sliding Window (SW) (Richardson et al., 2013)
Distance-Based Sliding Window (DSW) (Richardson et al., 2013)
Stanford Attentive Reader (SAR) (Chen et al., 2016)
Gated-Attention Reader (GAR) (Dhingra et al., 2017)
Co-Matching (CO) (Wang et al., 2018b)
Finetuned Transformer LM (FTLM) (Radford et al., 2018)
Our Approaches:
DSW++ (DSW w/ Dialogue Structure and ConceptNet Embedding)
GBDT++ (GBDT w/ Features of Dialogue Structure and General World Knowledge)
FTLM++ (FTLM w/ Speaker Embedding)
Ensemble of 3 FTLM++
Ensemble of 1 GBDT++ and 3 FTLM++
Human Performance
Ceiling Performance
Dev
32.8
41.7
42.6
44.4
40.2
40.5
45.6
55.9
51.4
53.3
57.6
58.1
59.6
Test
33.4
42.0
42.5
44.6
39.8
41.3
45.5
55.5
50.1
52.8
57.4
58.2
59.5
93.9(cid:3)
98.7(cid:3)
95.5(cid:3)
98.6(cid:3)
Table 8: Performance in accuracy (%) on the DREAM data set. Performance marked by (cid:2) is reported based on
25% annotated questions from the development and test sets.
We do not investigate other ways of leveraging
pre-trained deep models such as adding ELMo
representations (Peters et al., 2018) as additional
features to a neural model since recent stud-
ies show that directly fine-tuning a pre-trained
language model such as FTLM is significantly
superior on multiple-choice reading comprehen-
sion tasks (Radford et al., 2018; Chen et al., 2019).
We do not apply more recent extractive models
such as AOA (Cui et al., 2017) and QANet (Yu
et al., 2018) since they aim at precisely locating a
span in a document. When adapted to solve ques-
tions with abstractive answer options, extractive
models generally tend to perform less well (Chen
et al., 2016; Dhingra et al., 2017; Lai et al., 2017).
5.2 Results and Analysis
We report the performance of the baselines intro-
duced in Section 5.1 and our proposed approaches
in Table 8. We report the averaged accuracy of
two annotators as the human performance. The
proportion of valid questions (i.e., an unambigu-
ous question with a unique correct answer option
provided) that are manually checked by annota-
tors on the annotated test and development sets is
regarded as the human ceiling performance.
Surface matching is insufficient. Experimen-
tal results show that neural models that primarily
exploit surface-level information (i.e., SAR, GAR,
and CO) attain a performance level close to that
of simple rule-based approaches (i.e., WM, SW,
and DSW). The highest accuracy achieved by CO
is 45.5%, a similar level of performance to the
rule-based method DSW (44.6%).
It is helpful to incorporate general world
knowledge and dialogue structure. We see a
significant gain of 5.5% in accuracy when enhanc-
ing DSW using general world knowledge from
ConceptNet embeddings and considering speaker-
focused information (Section 4.2). FTLM, which
leverages rich external linguistic knowledge from
thousands of books, already achieves a much
higher accuracy (55.5%) compared with previous
state-of-the-art machine comprehension models,
indicating the effectiveness of general world
knowledge. Experimental results show that our
best single model FTLM++ significantly outper-
forms FTLM (p-value = 0.03), illustrating the
usefulness of additional dialogue structure. Com-
pared with the state-of-the-art neural reader Co-
Matching that primarily explores surface-level
information (45.5%), the tacit general world knowl-
edge (in the pre-trained language model) and dia-
logue structure in FTLM++ lead to an absolute
gain of 11.9% in accuracy.
Ensembling different types of methods can
bring further improvements. We use the
majority vote strategy to obtain the ensemble
model performance. Although GBDT++ (52.8%)
itself does not outperform FTLM++, GBDT++
226
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Method
Accuracy
Δ
DSW++
− dialogue structure
− CE
GBDT++
− bag of words
− rule-based features
− matching position
− dialogue structure
− PMI
− CR
− CE
− PMI, CR, CE
FTLM++
− speaker embedding
− LM pre-training
51.4
50.0
46.7
53.3
51.6
51.2
53.0
51.9
51.4
52.7
52.7
47.1
57.6
55.9
36.2
−
−1.4
−4.7
−
−1.7
−2.1
−0.3
−1.4
−1.9
−0.6
−0.6
−6.2
−
−1.7
−21.4
Table 9: Ablation tests on the development set (%).
Minus (−) indicates percentage decrease.
can serve as a supplement to FTLM++ because
they leverage different types of general world
knowledge and model architectures. We achieve
the highest accuracy (59.5%) by ensembling one
GBDT++ and three FTLM++.
5.3 Ablation Tests
We conduct ablation tests to evaluate the indi-
vidual components of our proposed approaches
(Table 9). In Table 10, we summarize the involved
types of dialogue structure and general world
knowledge in our approaches.
Dialogue Structure Specifically, we observe 1.4%
drop in accuracy if we set the target speaker sQ
to ∗ for all questions when we apply DSW++.
We observe a similar performance drop when we
remove speaker-focused features from GBDT++.
In addition, removing speaker embeddings from
FTLM++ leads to a 1.7% drop in accuracy (in
this case, the model becomes the original FTLM).
These results consistently indicate the usefulness
of dialogue structure for dialogue understanding.
General World Knowledge We also investigate
the effects of general world knowledge. The
accuracy of DSW++ drops by 4.7% if we remove
ConceptNet embeddings (CE) by deleting the last
term of Expression (8) in Section 4.2. Additionally,
the accuracy of GBDT++ drops by 6.2% if we
remove all the general world knowledge features
(i.e., ConceptNet embeddings/relations and PMI),
leading to prediction failures on questions such
Dialogue Structure
General World
Knowledge
DSW++ speaker-focused
scores
CE
GBDT++ speaker-focused
features
CE, CR, and
PMI
FTLM++ speaker embedding
pre-trained LM
Table 10: Types of dialogue structure and general world
knowledge investigated in our approaches.
as ‘‘What do we learn about the man?’’ whose
correct answer option ‘‘He is health-conscious.’’
is not explicitly mentioned in the source dialogue
‘‘M: We had better start to eat onions frequently,
Linda. W: But you hate onions, don’t you? M:
Until I learned from a report from today’s paper
that
they protect people from flu and colds.
After all, compared with health, taste is not so
important.’’ Moreover, if we train FTLM++ with
randomly initialized transformer weights instead
of weights pre-trained on the external corpus, the
accuracy drops dramatically to 36.2%, which is
only slightly better than a random baseline.
5.4 Error Analysis
Impact of Longer Turns The number of dial-
ogue turns has a significant
impact on the
performance of FTLM++. As shown in Figure 2,
its performance reaches the peak when the number
of turns ranges from 0 to 10, while it suffers
severe performance drops when the given dialogue
contains more turns. Both DSW++ (56.8%)
and GBDT++ (57.4%) outperform FTLM++
(55.7%) when the number of turns ranges from
10 to 48. To deal with lengthy context, it may be
helpful to first identify relevant sentences based
on a question and its associated answer options
rather than using the entire dialogue context as
input.
Impact of Confusing Distractors For 54.5% of
questions on the development set, the fuzzy match-
ing score (Sikes, 2007) of at least one distractor
answer option against the dialogue is higher than
the score of the correct answer option. For ques-
tions that all models (i.e., DSW++, GBDT++,
and FTLM++) fail to answer correctly, 73.0% of
them contain at least one such confusing distractor
answer option. The causes of this kind of errors
can be roughly divided into two categories. First,
227
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Question Type
FTLM++ GBDT++
Matching
Reasoning
Summary
Logic
Arithmetic
Commonsense
Single sentence
Multiple sentences
57.0
56.8
73.6
55.0
30.2
53.4
56.5
56.9
68.1
49.4
47.1
49.7
24.5
41.7
63.3
49.5
Table 11: Accuracy (%) by question type on the
annotated development subset.
and commonsense) which require aggregation of
information from multiple sentences, the under-
standing of the entire dialogue, or the utilization
of world knowledge. Therefore, it might be useful
to leverage the strengths of individual models to
solve different types of questions.
6 Conclusion and Future Work
We present DREAM, the first multiple-choice
dialogue-based reading comprehension data set
from English language examinations. Besides the
multi-turn multi-party dialogue context, 85% of
questions require multiple-sentence reasoning,
and 34% of questions also require commonsense
knowledge, making this task very challenging.
We apply several popular reading comprehension
models and find that surface-level information is
insufficient. We incorporate general world knowl-
edge and dialogue structure into rule-based and
machine learning methods and show the effec-
tiveness of these factors, suggesting a promising
direction for dialogue-based reading comprehen-
sion. For future work, we are interested in problem
generation for dialogues and investigating whether
it will lead to more gains to pre-train a deep lan-
guage model such as FTLM over large-scale
dialogues from movies and TV shows instead of
the BookCorpus data set (Zhu et al., 2015) used
by previous work (Radford et al., 2018).
Acknowledgments
We would like to thank the editors and anony-
mous reviewers for their helpful feedback. We
also thank Hai Wang from Toyota Technological
Institute at Chicago for useful discussions and
valuable comments.
Figure 2: Performance comparison of different number
of turns on the test set.
the distractor is wrongly associated with the target
speaker/s mentioned in the question (e.g., answer
option A and C in D2-Q3 in Table 3). Second,
although the claim in the distractor is supported
by the dialogue, it is irrelevant to the question
(e.g., D1-Q1-B in Table 1). A promising direction
to solve this problem could be the construction
of speaker-focused event chains (Chambers and
Jurafsky, 2008) and advanced dialogue-specific
coreference resolution systems for more reliable
evidence context detection in a dialogue.
Impact of Question Types We further report the
performance of the best single model FTLM++
and the GBDT++ baseline on the categories
defined in Section 3.2 (Table 11). Not surprisingly,
both models perform worse than random guessing
on math problems. While most of the math
problems can be solved by one single linear
equation, it is still difficult to apply recent neural
math word problem solvers (Huang et al., 2018;
Wang et al., 2018a) due to informal dialogue-
based problem descriptions and the requirement
of commonsense inference. For example, given
the dialogue:
‘‘W: The plane arrives at 10:50. It is already
10:40 now. Be quick! M: Relax. Your watch must
be fast. There are still twenty minutes left.’’
We need prior knowledge to infer that the watch of
the man is showing incorrect time 10:40. Instead,
10:50 should be used as the reference time with
the time interval ‘‘twenty minutes left’’ together to
answer the question ‘‘What time is it now?’’
Results show that GBDT++ is superior to the
fine-tuned language model on the questions under
the category matching (68.1% vs. 57.0%) and the
latter model is more capable of answering implicit
questions (e.g., under the category summary, logic,
228
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
References
Leila Amgoud, Yannis Dimopoulos, and Pavlos
Moraitis. 2007. A unified and general frame-
work for argumentation-based negotiation.
the AAMAS, pages 1–8.
In Proceedings of
New York, NY, USA.
Ondrej Bajgar, Rudolf Kadlec,
Jan
Kleindienst. 2016. Embracing data abundance:
Booktest data set for reading comprehension.
CoRR, cs.CL/1610.00956v1.
and
Steven Bird and Edward Loper. 2004. NLTK: the
natural language toolkit. In Proceedings of the
ACL on Interactive poster and demonstration
sessions, pages 31–34. Barcelona, Spain.
Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang,
Ting Liu, and Guoping Hu. 2017. Attention-
over-attention neural networks for
reading
comprehension. In Proceedings of the ACL,
pages 593–602. Vancouver, Canada.
Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang,
William Cohen, and Ruslan Salakhutdinov.
2017. Gated-attention readers for text com-
the ACL,
In Proceedings of
prehension.
pages 1832–1846. Vancouver, Canada.
Matthew Dunn, Levent Sagun, Mike Higgins,
V Ugur Guney, Volkan Cirik, and Kyunghyun
Cho. 2017. SearchQA: A new Q&A data set
augmented with context from a search engine.
CoRR, cs.CL/1704.05179v3.
Nathanael Chambers and Dan Jurafsky. 2008.
Unsupervised learning of narrative event chains.
In Proceedings of the ACL, pages 789–797.
Columbus, OH.
Maxwell Forbes and Yejin Choi. 2017. Verb
physics: Relative physical knowledge of actions
the ACL,
and objects.
pages 266–276. Vancouver, Canada.
In Proceedings of
Danqi Chen, Jason Bolton, and Christopher D.
Manning. 2016. A thorough examination of the
CNN/Daily Mail reading comprehension task.
In Proceedings of the ACL, pages 2358–2367.
Berlin, Germany.
Yu-Hsin Chen and Jinho D. Choi. 2016. Character
identification on multiparty conversation: Iden-
tifying mentions of characters in TV shows.
In Proceedings of the SIGDial, pages 90–100.
Los Angeles, CA.
Zhipeng Chen, Yiming Cui, Wentao Ma, Shijin
Wang, and Guoping Hu. 2019. Convolutional
spatial attention model for reading compre-
hension with multiple-choice questions.
In
Proceedings of the AAAI. Honolulu, HI.
Karl Moritz Hermann, Tomas Kocisky, Edward
Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. 2015. Teaching
machines to read and comprehend. In Proceed-
ings of the NIPS, pages 1693–1701. Montreal,
Canada.
Felix Hill, Antoine Bordes, Sumit Chopra, and
Jason Weston. 2016. The goldilocks principle:
Reading children’s books with explicit memory
representations. In Proceedings of the ICLR.
Caribe Hilton, Puerto Rico.
Danqing Huang, Jing Liu, Chin-Yew Lin, and Jian
Yin. 2018. Neural math word problem solver
with reinforcement learning. In Proceedings of
the COLING, pages 213–223. Santa Fe, NM.
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,
Wen-tau Yih, Yejin Choi, Percy Liang, and
Luke Zettlemoyer. 2018. QuAC: Question an-
swering in context. pages 2174–2184. Brussels,
Belgium.
Mandar Joshi, Eunsol Choi, Daniel S. Weld,
and Luke Zettlemoyer. 2017. TriviaQA: A
large scale distantly supervised challenge
data set for reading comprehension. CoRR,
cs.CL/1705.03551v2.
Peter Clark, Oren Etzioni, Tushar Khot, Ashish
Sabharwal, Oyvind Tafjord, Peter D. Turney,
and Daniel Khashabi. 2016. Combining re-
trieval, statistics, and inference to answer ele-
mentary science questions. In Proceedings of
the AAAI, pages 2580–2586. Phoenix, AZ.
Daniel Khashabi, Snigdha Chaturvedi, Michael
Roth, Shyam Upadhyay, and Dan Roth. 2018.
Looking beyond the surface: A challenge
set for reading comprehension over multiple
sentences. In Proceedings of the NAACL-HLT,
pages 252–262. New Orleans, LA.
229
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Tom´aˇs Koˇcisk`y,
Schwarz,
Phil
Jonathan
Blunsom, Chris Dyer, Karl Moritz Hermann,
G´aabor Melis,
and Edward Grefenstette.
2018. The narrativeqa reading comprehension
challenge. Transactions of the Association of
Computational Linguistics, 6:317–328.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming
Yang, and Eduard Hovy. 2017. RACE: Large-
scale reading comprehension data set from
examinations. In Proceedings of the EMNLP,
pages 785–794. Copenhagen, Denmark.
Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng
Chang, and Hung-Yi Lee. 2018. ODSQA:
Open-domain spoken question answering data
set. CoRR, cs.CL/1808.02280v1.
Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu,
and Hung-yi Lee. 2018. Spoken SQuAD: A
study of mitigating the impact of speech
recognition errors on listening comprehension.
CoRR, cs.CL/1804.00320v1.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, and
Noam Shazeer. 2018. Generating Wikipedia by
summarizing long sequences. In Proceedings
of the ICLR. Vancouver, Canada.
Peter LoBue and Alexander Yates. 2011. Types
of common-sense knowledge needed for rec-
ognizing textual entailment. In Proceedings of
the ACL, pages 329–334. Portland, OR.
Kaixin Ma, Tomasz Jurczyk, and Jinho D. Choi.
2018. Challenging reading comprehension on
daily conversation: Passage completion on multi-
party dialog. In Proceedings of the NAACL-
HLT, pages 2039–2048. New Orleans, LA.
Todor Mihaylov, Peter Clark, Tushar Khot, and
Ashish Sabharwal. 2018. Can a suit of armor
conduct electricity? A new data set for open
book question answering. In Proceedings of the
EMNLP. Brussels, Belgium.
Nasrin Mostafazadeh, Nathanael Chambers,
Xiaodong He, Devi Parikh, Dhruv Batra, Lucy
Vanderwende, Pushmeet Kohli, and James
Allen. 2016. A corpus and evaluvation framework
for deeper understanding of commonsense
stories. In Proceedings of the NAACL-HLT,
pages 839–849. San Diego, CA.
230
I Nation. 2006. How large a vocabulary is needed
for reading and listening? Canadian Modern
Language Review, 63:59–82.
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
Gao, Saurabh Tiwary, Rangan Majumder, and
Li Deng. 2016. MS MARCO: A human gen-
erated machine reading comprehension data set.
CoRR, cs.CL/1611.09268v2.
Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin
Gimpel, and David McAllester. 2016. Who
did What: A large-scale person-centered cloze
the EMNLP,
In Proceedings of
data set.
pages 2230–2235. Austin, TX.
Simon Ostermann, Michael Roth, Ashutosh Modi,
Stefan Thater, and Manfred Pinkal. 2018.
SemEval-2018 Task 11: Machine compre-
hension using commonsense knowledge. In
Proceedings of the SemEval, pages 747–757.
New Orleans, LA.
Fabian Pedregosa, Ga¨el Varoquaux, Alexandre
Gramfort, Vincent Michel, Bertrand Thirion,
Olivier Grisel, Mathieu Blondel,
Peter
Prettenhofer, Ron Weiss, Vincent Dubourg,
Jake Vanderplas, Alexandre Passos, David
Cournapeau, Matthieu Brucher, Matthieu
Perrot, and Eduard Duchesnay. 2011. Scikit-
learn: Machine learning in python. Journal of
Machine Learning Research, 12:2825–2830.
Anselmo Penas, Yusuke Miyao, Alvaro Rodrigo,
Eduard H Hovy, and Noriko Kando. 2014.
Overview of CLEF QA Entrance Exams
the CLEF,
In Proceedings of
Task 2014.
pages 1194–1200. Sheffield, UK.
Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contex-
In Proceed-
tualized word representations.
ings of the NAACL-HLT, pages 2227–2237,
New Orleans, LA.
Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018. Improving language
understanding by generative pre-training. Preprint,
available at https://openai.com/blog/language-
unsupervised/.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
and Percy Liang. 2016. SQuAD: 100,000+
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
questions for machine comprehension of text. In
Proceedings of the EMNLP, pages 2383–2392.
Austin, TX.
Siva Reddy, Danqi Chen, and Christopher D.
Manning. 2018. CoQA: A conversational ques-
tion answering challenge. CoRR, cs.CL/1808.
07042v1.
Matthew Richardson, Christopher JC Burges, and
Erin Renshaw. 2013. MCTest: A challenge data
set for the open-domain machine comprehen-
sion of text. In Proceedings of the EMNLP,
pages 193–203. Seattle, WA.
Hideyuki Shibuki, Kotaro Sakamoto, Yoshinobu
Kano, Teruko Mitamura, Madoka Ishioroshi,
Kelly Y Itakura, Di Wang, Tatsunori Mori,
and Noriko Kando. 2014. Overview of the
NTCIR-11 QA-Lab Task. In NTCIR.
Richard Sikes. 2007. Fuzzy matching in theory
and practice. Multilingual, 18(6):39–43.
Robyn Speer, Joshua Chin, and Catherine Havasi.
2017. ConceptNet 5.5: An Open Multilingual
Graph of General Knowledge. In Proceedings
of the AAAI, pages 4444–4451. San Francisco,
CA.
Chenhao Tan and Lillian Lee. 2015. All who
wander: On the prevalence and characteristics
of multi-community engagement. In Proceed-
ings of the WWW, pages 1056–1066. Florence,
Italy.
Yi Tay, Luu Anh Tuan, and Siu Cheung Hui.
2018. Multi-range reasoning for machine com-
prehension. CoRR, cs.CL/1803.09074v1.
Adam Trischler, Tong Wang, Xingdi Yuan, Justin
Harris, Alessandro Sordoni, Philip Bachman,
and Kaheer Suleman. 2017. NewsQA: A
In Pro-
machine comprehension data set.
ceedings of the RepL4NLP, pages 191–200.
Vancouver, Canada.
Bo-Hsiang Tseng, Sheng-Syun Shen, Hung-Yi
Lee, and Lin-Shan Lee. 2016. Towards machine
comprehension of spoken content: Initial toefl
listening comprehension test by machine. In
Proceedings of the Interspeech. San Francisco,
CA.
Lei Wang, Yan Wang, Deng Cai, Dongxiang
Zhang, and Xiaojiang Liu. 2018a. Translating
a math word problem to a expression tree. In
Proceedings of the EMNLP, pages 1064–1069.
Brussels, Belgium.
Shuohang Wang, Mo Yu, Shiyu Chang, and Jing
Jiang. 2018b. A co-matching model for multi-
choice reading comprehension. In Proceedings
of the ACL, pages 1–6. Melbourne, Australia.
Wen-tau Yih, Ming-Wei Chang, Christopher Meek,
and Andrzej Pastusiak. 2013. Question answer-
ing using enhanced lexical semantic models.
In Proceedings of the ACL, pages 1744–1753.
Sofia, Bulgaria.
Adams Wei Yu, David Dohan, Minh-Thang Luong,
Rui Zhao, Kai Chen, Mohammad Norouzi, and
Quoc V Le. 2018. QANet: Combining local
convolution with global self-attention for read-
ing comprehension. In Proceedings of the ICLR.
Vancouver, Canada.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and
movies: Towards story-like visual explanations
by watching movies and reading books. In
Proceedings of the IEEE ICCV, pages 19–27.
Santiago, Chile.
231
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
4
1
9
2
3
1
1
1
/
/
t
l
a
c
_
a
_
0
0
2
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
PDF Herunterladen