DREAM: A Challenge Data Set and Models for - 麻省理工学院人工智能研究专业

DREAM: A Challenge Data Set and Models for
Dialogue-Based Reading Comprehension

Kai Sun♠ ∗ Dian Yu♥

Jianshu Chen♥ Dong Yu♥ Yejin Choi♦, ♣ Claire Cardie♠

♠Cornell University, 伊萨卡岛, 纽约, 美国
♥Tencent AI Lab, Bellevue, WA, 美国
♦University of Washington, Seattle, WA, 美国
♣Allen Institute for Artificial Intelligence, Seattle, WA, 美国
ks985@cornell.edu {yudian,jianshuchen,dyu}@tencent.com
yejin@cs.washington.edu cardie@cs.cornell.edu

抽象的

We present DREAM, the first dialogue-based
multiple-choice reading comprehension data
放. Collected from English as a Foreign
Language examinations designed by human
experts to evaluate the comprehension level
of Chinese learners of English, our data set
包含 10,197 multiple-choice questions for
6,444 对话. In contrast to existing reading
comprehension data sets, DREAM is the first
to focus on in-depth multi-turn multi-party
dialogue understanding. DREAM is likely to
present significant challenges for existing
reading comprehension systems: 84% of an-
swers are non-extractive, 85% of questions
require reasoning beyond a single sentence,
和 34% of questions also involve common-
sense knowledge.

We apply several popular neural reading
comprehension models that primarily exploit
surface information within the text and find
他们到, 最好, just barely outperform a rule-
based approach. We next investigate the effects
of incorporating dialogue structure and dif-
ferent kinds of general world knowledge into
both rule-based and (neural and non-neural)
machine learning-based reading comprehen-
sion models. Experimental
results on the
DREAM data set show the effectiveness of dia-
logue structure and general world knowledge.
DREAM is available at https://dataset.
org/dream/.

介绍

Recently a significant amount of research has fo-
cused on the construction of large-scale multiple-

∗This work was done when K. S. was an intern at the

Tencent AI Lab, Bellevue, WA.

217

选择 (Lai et al., 2017; Khashabi et al., 2018;
Ostermann et al., 2018) and extractive (Hermann
等人。, 2015; Hill et al., 2016; Rajpurkar et al., 2016;
Trischler et al., 2017) reading comprehension data
套 (部分 2). Source documents in these data
sets have generally been drawn from formal
written texts such as news, fiction, 和维基百科
文章, which are commonly considered well-
written, accurate, and neutral.

With the goal of advancing research in ma-
chine reading comprehension and facilitating dia-
logue understanding, we construct and present
DREAM — the first multiple-choice Dialogue-
based REAding comprehension exaMination data
放. We collect 10,197 questions for 6,444 多-
turn multi-party dialogues from English language
exams, which are carefully designed by educa-
tional experts (例如, English teachers) to assess
the comprehension level of Chinese learners of
英语. Each question is associated with three
answer options, exactly one of which is correct.
(见表 1 for an example.) DREAM covers a
variety of topics and scenarios in daily life such
as conversations on the street, on the phone, 在一个
classroom or library, at the airport or the office or
a shop (部分 3).

Based on our analysis of DREAM, we argue
that dialogue-based reading comprehension is at
least as difficult as existing non-conversational
同行. 尤其, answering 34% 的
DREAM questions requires unspoken common-
sense knowledge, 例如, unspoken scene
信息. This might be due to the nature
of dialogues: For efficient oral communication,
people rarely state obvious explicit world knowl-
边缘 (Forbes and Choi, 2017) such as ‘‘Christmas
Day is celebrated on December 25th.’’Understanding

计算语言学协会会刊, 卷. 7, PP. 217–231, 2019. 动作编辑器: Jing Jiang.
提交批次: 9/2018; 修改批次: 12/2018; Final submission: 2/2019; 已发表 4/2019.
C(西德:7) 2019 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1

/
t

我

A
C
_
A
_
0
0
2
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Dialogue 1 (D1)

瓦: 汤姆, look at your shoes. How dirty they are!

You must clean them.

中号: 哦, 妈妈, I just cleaned them yesterday.
瓦: They are dirty now. You must clean them

再次.

中号: I do not want to clean them today. 即使
I clean them today, they will get dirty again
tomorrow.
瓦: All right, 然后.
中号: Mum, give me something to eat, 请.
瓦: You had your breakfast in the morning, 汤姆,

and you had lunch at school.

中号: I am hungry again.
瓦: 哦, 饥饿的? But if I give you something to eat
今天, you will be hungry again tomorrow.

Q1 Why did the woman say that she wouldn’t give

him anything to eat?

A. Because his mother wants to correct his bad

habit.(西德:2)

乙. Because he had lunch at school.
C. Because his mother wants to leave him hungry.

桌子 1: A sample DREAM problem that requires
general world knowledge ((西德:2): the correct answer option).

the social implications of an utterance as well as
inferring a speaker’s intentions is also regularly
required for answering dialogue-based questions.
The dialogue content in Table 1, 例如,
is itself insufficient for readers to recognize the
intention of the female speaker (瓦) in the first
问题 (Q1). 然而, world knowledge is
rarely considered in state-of-the-art reading com-
prehension models (Tay et al., 2018; 王等人。,
2018乙).

而且, dialogue-based questions can cover
information imparted across multiple turns involv-
ing multiple speakers. In DREAM, 大约
85% of questions can only be answered by con-
sidering the information from multiple sentences.
to answer Q1 in Table 3 之后
例如,
in the paper regarding the date of birth of the
male speaker (中号), the supporting sentences (在
bold) include ‘‘You know, tomorrow is Christ-
mas Day’’ from the female speaker and ‘‘. . . 我
am more than excited about my birthday, 哪个
will come in two days’’ from the male speaker.
Compared with ‘‘multiple-sentence questions’’
in traditional reading comprehension data sets,
DREAM further requires an understanding of the
turn-based structure of dialogue—for example,

for aligning utterances with their corresponding
speakers.

As only 16% of correct answer options are text
spans from the source documents, we primarily
explore rule-based methods and state-of-the-
art neural models designed for multiple-choice
reading comprehension (部分 4). We find first
that neural models designed for non–dialogue-
based reading comprehension (陈等人。, 2016;
Dhingra et al., 2017; 王等人。, 2018乙) 不要
fare well: The highest achieved accuracy is 45.5%,
only slightly better than the accuracy (44.6%)
of a simple lexical baseline (Richardson et al.,
2013). 大部分情况下, these models fundamen-
tally exploit only surface-level information from
the source documents. Considering the above-
mentioned challenges, 然而, we hypothesize
that incorporating general world knowledge and
aspects of the dialogue structure would allow a
better understanding of the dialogues. 因此,
we modify our baseline systems to include (1)
general world knowledge in the form of such as
ConceptNet relations (Speer et al., 2017) and a
pre-trained language model (Radford et al., 2018),
和 (2) speaker information for each utterance.
Experiments show the effectiveness of these fac-
tors on the lexical baselines as well as neural
and non-neural machine learning approaches: 我们
acquire up to 11.9% absolute gain in accuracy
compared with the highest performance achieved
by the state-of-the-art reading comprehension
模型 (王等人。, 2018乙), which mainly relies
on explicit surface-level information in the text
(部分 5).

最后, we see a significant gap between the
best automated approach (59.5%) and human
ceiling performance (98.6%) on the DREAM data
放. This provides yet additional evidence that
dialogue-based reading comprehension is a very
challenging task. We hope that it also inspires the
research community to develop methods for the
dialogue-based reading comprehension task.

2 相关工作

We divide reading comprehension data sets into
three categories based on the types of answers:
提取的, abstractive, and multiple choice.

2.1 Extractive and Abstractive Data Sets

最近几年, we have seen increased interest in
large-scale cloze/span-based reading comprehension

218

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1

/
t

我

A
C
_
A
_
0
0
2
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

SQuAD

NarrativeQA

CoQA

RACE

DREAM (this work)

提取的
Answer type
Source document type written text
# of source documents
Average answer length

536
3.2

abstractive
written text
1,572
4.7

abstractive multiple-choice
written text
8,399
2.7

written text
27,933
5.3

multiple-choice
dialogue
6,444
5.3

Extractive (%)
Abstractive (%)

100.0
0.0

73.6
26.4

66.8
33.2

13.0
87.0

16.3
83.7

桌子 2: Distribution of answer (or correct answer option) types in three kinds of reading comprehension data sets.
Statistics of other data sets come from Reddy et al. (2018), Koˇcisk`y et al. (2018), and Lai et al. (2017).

data set construction (Hermann et al., 2015; 爬坡道
等人。, 2016; Onishi et al., 2016; Rajpurkar et al.,
2016; Bajgar et al., 2016; Nguyen et al., 2016;
Trischler et al., 2017; Joshi et al., 2017; Choi et al.,
2018). We regard them as extractive since candi-
date answers are usually short spans from source
文件. State-of-the-art neural models with
attention mechanisms already achieve very high
performance based on local lexical information.
Recently researchers work on the construction of
spoken span-based data sets (李等人。, 2018;
李等人。, 2018) by applying text-to-speech tech-
nologies or recruiting human speakers based on
formal written document-based data sets such as
SQuAD (Rajpurkar et al., 2016). Some span-
based conversation data sets are constructed from
a relatively small size of dialogues from television
节目 (Chen and Choi, 2016; Ma et al., 2018).

Considering the limitations in extractive data
套, answers in abstractive data sets such as MS
MARCO (Nguyen et al., 2016), SearchQA (Dunn
等人。, 2017), and NarrativeQA (Koˇcisk`y et al.,
2018) are human-crowdsourced based on source
documents or summaries. Concurrently, 有一个
growing interest in conversational reading com-
prehension such as CoQA (Reddy et al., 2018).
Because annotators tend to copy spans as an-
swers (Reddy et al., 2018), the majority of answers
are still extractive in these data sets (桌子 2).
Compared to the data sets mentioned above, 最多
of the correct answer options (83.7%) in DREAM
are free-form text.

generation during crowdsourcing (IE。, 问题,
correct answer option, and distractors). Besides
surface matching, a significant portion of ques-
tions require multiple-sentence reasoning and
external knowledge (Richardson et al., 2013;
Mostafazadeh et al., 2016; Khashabi et al., 2018;
Ostermann et al., 2018).

Besides crowdsourcing, some data sets are col-
lected from examinations designed by educa-
tional experts (Penas et al., 2014; Shibuki et al.,
2014; Tseng et al., 2016; Clark et al., 2016;
Lai et al., 2017; Mihaylov et al., 2018), 哪个
aim to test human examinees. There are various
types of complicated questions such as math word
问题, summarization, logical reasoning, 和
sentiment analysis. Because we can adopt more
objective evaluation criteria such as accuracy,
these questions are usually easy to grade. Besides,
questions from examinations are generally clean
and high-quality. 所以, human performance
ceiling on this kind of data set is much higher
(例如, 94.5% on RACE [Lai et al., 2017] 和
98.6% on DREAM in accuracy) than that of data
sets built by crowdsourcing.

In comparison, we present the first multiple-
choice dialogue-based data set from examinations
that contains a large percentage of questions that
require multiple sentence inference. To the best of
our knowledge, DREAM also contains the largest
number of questions involving commonsense rea-
soning compared with other examination data sets.

3 数据

2.2 Multiple-Choice Data Sets

We primarily discuss the multiple-choice data
套, in which answer options are not restricted
to extractive text spans in the given document.
反而, most of the correct answer options are
abstractive (桌子 2). Multiple-choice data sets in-
volve extensive human involvement for problem

在这个部分, we describe how we construct
DREAM (部分 3.1) and provide a detailed
analysis of this data set (部分 3.2).

3.1 Collection Methodology

We collect dialogue-based comprehension prob-
lems from a variety of English language exams

219

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1

/
t

我

A
C
_
A
_
0
0
2
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Dialogue 2 (D2)

Metric

Value

瓦: 嘿, Mike. Where have you been? I didn’t see

you around these days?

中号: I was hiding in my office. My boss gave me
loads of work to do, and I tried to finish it
before my birthday. Anyway, I am done now.
Thank goodness! How is everything going
with you?

瓦: I’m quite well. You know, tomorrow is
Christmas Day. Do you have any plans?
中号: 出色地, to tell you the truth, I am more
than excited about my birthday, 这将
come in two days. I am going to visit my
parents-in-law with my wife.

瓦: Wow, sounds great.
中号: Definitely! This is my first time to spend my

birthday with them.

瓦: Do they live far away from here?
中号: A little bit. We planned to take the train, 但
considering the travel peak, my wife strongly
suggested that we go to the airport right after
we finish our work this afternoon. How about
你? What’s your holiday plan?

瓦: 出色地, our situations are just the opposite. 我的
parents-in-law will come to my house, 和
they wish to stay at home and have a quiet
Christmas Day. So I have to call my friends to
cancel our party that will be held at my house.
中号: You’ll experience a quite different and lovely

holiday. Enjoy your Christmas!

瓦: 谢谢, the same to you!

Q1 What is the date of the man’s birthday?
A. 25th, 十二月.
乙. 26th, December.(西德:2)
C. 27th, 十二月.
Q2 How will the man go to his wife’s parents’

home?
A. By train.
乙. By bus.
C. By plane.(西德:2)
Q3 What is the probable relationship between the

two speakers?
A. Husband and wife.
乙. Friends.(西德:2)
C. Parent-in-law and son-in-law.

桌子 3: A complete sample DREAM problem ((西德:2): 这
correct answer option).

(including practice exams) such as National Col-
lege Entrance Examination, College English Test,
and Public English Test,1 which are designed by
human experts to assess either the listening or
reading comprehension level of Chinese English

1We list all the Web sites used for data collection in the

released data set.

220

# of answer options per question
# of turns
Avg./Max. # of questions per dialogue
Avg./Max. # of speakers per dialogue
Avg./Max. # of turns per dialogue
Avg./Max. option length (in tokens)
Avg./Max. question length (in tokens)
Avg./Max. dialogue length (in tokens) 85.9 / 1,290
vocabulary size

30,183
1.6 / 10
2.0 / 7
4.7 / 48
5.3 / 21
8.6 / 24

13,037

桌子 4: The overall statistics of DREAM. A turn is
defined as an uninterrupted stream of speech from one
speaker in a dialogue.

Train

Dev

Test

全部

# of dialogues
# of questions

3,869
6,116

1,288
2,040

1,287
2,041

6,444
10,197

桌子 5: The separation of the training, 发展,
and test sets in DREAM.

learners in high schools and colleges (for indi-
viduals aged 12–22 years). All the problems in
DREAM are freely accessible online for public
用法. Each problem consists of a dialogue and
a series of multiple-choice questions. To ensure
every question is associated with exactly three
answer options, we drop wrong answer option(s)
randomly for questions with more than three
选项. We remove duplicate problems and ran-
domly split the data at the problem level, 和
60% 火车, 20% 发展, 和 20% 测试.

3.2 数据分析

We summarize the statistics of DREAM in Table 4
and data split in Table 5. Compared with existing
data sets built from formal written texts,
这
vocabulary size is relatively small since spoken
English by its nature makes greater use of high-
frequency words and needs a smaller vocabulary for
efficient real-time communication (国家, 2006).
We categorize questions into two main cate-
gories according to the types of knowledge re-
quired to answer them: matching and reasoning.

• Matching A question is entailed or para-
phrased by exactly one sentence in a dialogue.
The answer can be extracted from the same
句子. 例如, we can easily verify
the correctness of the question-answer pair
(‘‘What kind of room does the man want
to rent?’’, ‘‘A two-bedroom apartment.’’)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1

/
t

我

A
C
_
A
_
0
0
2
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

based on the sentence ‘‘M: I’m interested
in renting a two-bedroom apartment.’’ This
category is further divided into two cate-
gories, word matching and paraphrasing, 在
previous work (陈等人。, 2016; Trischler
等人。, 2017).

• Reasoning Questions that cannot be an-
swered by the surface meaning of a single
sentence belong to this category. We further
define four subcategories as follows.

Question Type

Matching
推理
Summary
Logic
Arithmetic
Commonsense

Single sentence
Multiple sentences

Dev

13.0
87.0
8.4
74.5
5.1
31.5

17.1
82.9

Test Dev + Test

10.3
89.7
15.9
70.4
3.6
35.9

13.7
86.3

11.7
88.3
12.1
72.5
4.4
33.7

15.4
84.6

桌子 6: Distribution (%) of question types.

– Summary Answering this kind of ques-
tions requires the whole picture of a dia-
logue, such as the topic of a dialogue and
the relation between speakers (例如, D2-Q3
表中 3). Under this category, ques-
tions such as ‘‘What are the two speakers
谈论?’’ and ‘‘What are the speak-
ers probably doing?’’ are frequently asked.
– Logic We require logical reasoning to
answer questions in this category. We usu-
ally need to identify logically implied
relations among multiple sentences in a
dialogue. To reduce the ambiguity during
the annotation, we regard a question that
can only be solved by considering the con-
tent of multiple sentences and does not
belong to the summary subcategory that
involves all the sentences in a dialogue as
a logic question. Following this definition,
both D2-Q1 and D2-Q2 in Table 3 belong
to this category.

– Arithmetic Inferring the answer

关于-
quires arithmetic knowledge (例如, D2-Q1
表中 3 需要 25 - 1 + 2 = 26).
– Commonsense To answer questions under
this subcategory, besides the textual infor-
mation in the dialogue, we also require
external commonsense knowledge that
cannot be obtained from the dialogue.
例如, all questions in Table 3 fall
under this category. D2-Q1 and D2-Q2 in
桌子 3 belong to both logic and common-
sense since they require multiple sentences
as well as commonsense knowledge for
question answering. There exist multiple
types of commonsense knowledge in DREAM
such as the well-known properties of a
highly recognizable entity (例如, D2-Q1
表中 3), the prominent relationship-
between two speakers (例如, D2-Q3 in

桌子 3), the knowledge of or shared by
a particular culture (例如, when a speaker
says ‘‘Cola? I think it tastes like medicine.’’,
she/he probably means ‘‘I don’t like cola.’’),
and the cause-effect relation between events
(例如, D1-Q1 in Table 1). We refer read-
ers to LoBue and Yates (2011) for detailed
definitions.

桌子 6 shows the question type distribution
labeled by two human annotators on 25% ques-
tions randomly sampled from the development
and test sets. Besides the previously defined ques-
tion categories, we also report the percentage of
questions that require reasoning over multiple sen-
时态 (IE。, summary or logic questions) 和
percentage of questions that require the surface-
level understanding or commonsense/math knowl-
edge based on the content of a single sentence.
As a question can belong to multiple reasoning
subcategories, the summation of the percentage
of reasoning subcategories is not equal to the
percentage of reasoning. The Cohen’s kappa coef-
ficient is 0.67 on the development set and 0.68 在
the test set.

Dialogues in DREAM are generally clean and
mostly error-free because they are carefully de-
signed by educational experts. 然而, 它不是
guaranteed that each dialogue is written or proof-
read by a native speaker. Besides, dialogues tend
to be more proper and less informal for exam
目的. To have a rough estimation of the qual-
ity of dialogues in DREAM and the differences
between these dialogues and more casual ones in
movies or television shows, we run a proofreading
tool—Grammarly2—on all the dialogues from the
annotated 25% instances of the development set
and the same size (20.7k tokens) of dialogues
from Friends, a famous American television show

2https://app.grammarly.com.

221

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
6
4
1
9
2
3
1
1
1

/
t

我

A
C
_
A
_
0
0
2
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Metric

DREAM Friends

# of spelling errors
# of grammar errors
# of conciseness suggestions
# of vocabulary suggestions

General Performance
Readability Score

11
23
6
18

98.0
93.7

146
16
2
3

95.0
95.3

桌子 7: Comparison of the quality of dialogues from
DREAM and Friends (a TV show).

whose transcripts are commonly used for dialogue
理解 (Chen and Choi, 2016; Ma et al.,
2018). 如表所示 7, there exist fewer
spelling mistakes and the overall score is slightly
higher than that of the dialogues in Friends.
Based on the evaluated instances, articles and
verb forms are the two most frequent grammar
error categories (10 和 8, 分别, 在......之外
23) in DREAM. Besides, the language tends to
表示由
be less precise in DREAM,
number of vocabulary suggestions. 例如,
experts tend to use expressions such as ‘‘really
hot,’’ ‘‘really beautiful,’’ ‘‘very bad,’’ and ‘‘very
important’’ rather than more appropriate yet more
advanced adjectives that might hinder reading
comprehension of language learners with smaller
vocabularies. According to the explanations pro-
vided by the tool, the readability scores for both
data sets fall into the same category ‘‘Your text
is very simple and easy to read, likely to be
understood by an average 5th-grader (年龄 10).’’

4 Approaches

We formally introduce the dialogue-based reading
comprehension task and notations in Section 4.1.
To investigate the effects of different kinds of
general world knowledge and dialogue structure,
we incorporate them into rule-based approaches
(部分 4.2) as well as non-neural (部分 4.3)
and neural (部分 4.4) machine learning ap-
proaches. We describe in detail preprocessing
and training in Section 4.5.

4.1 Problem Formulation and Notations

We start with a formal definition of the dialogue-
based multiple-choice reading comprehension
任务. An n-turn dialogue D is defined as D =
{s1 : t1, s2 : t2, . . . , sn : tn}, where si represents
the speaker ID (例如, ‘‘M’’ and ‘‘W’’), and ti
represents the text of the ith turn. Let Q denote
the text of question, and O1..3 denote the text of

three answer options. The task is to choose the
correct one from answer options O1..3 associated
with question Q given dialogue D. 在本文中,
we regard this task as a three-class classification
问题, each class corresponding to an answer
option.

} 在哪里

为了方便, we define the following
notations, which will be referred in the rest of
这篇论文. Let Ds denote the turns spoken by
speaker s in D. 正式地, Ds = {si1 : ti1, si2 :
{i1, i2, . . . , im} =
ti2, . . . , 模拟 : 蒂姆
{我 | si = s} and i1 < i2 < . . . < im. In particular, s = ∗ denotes all the speakers. W Ds and W Oi denote the ordered set of the running words (excluding punctuation marks) in Ds and Oi, respectively. Questions designed for dialogue- based reading comprehension often focus on a particular speaker. If there is exactly one speaker mentioned in a question, we use sQ to denote this target speaker. Otherwise, sQ = ∗. For example, given the dialogue in Table 3, sQ =‘‘M’’ for Question 1 and 2, and sQ = ∗ for Question 3. 4.2 Rule-Based Approaches We first attempt to incorporate dialogue structure information into sliding window (SW), a rule- based approach developed by Richardson et al. (2013). This approach matches a bag of words constructed from a question Q and one of its answer option Oi with a given document, and calculates the TF-IDF style matching score for each answer option. Let ˆDs, ˆQ, and ˆOi be the unordered set of distinct words (excluding punctuation marks) in Ds, Q, and Oi, respectively. Instead of only regarding dialogue D as a non-conversational text snippet, we also pay special attention to the context that is relevant to the target speaker mentioned in the question. Therefore, given a target speaker sQ, we propose to compute a speaker-focused sliding window score for each answer option Oi, by matching a bag of words constructed from Q and Oi with DsQ (i.e., turns spoken by sQ). Given speaker s, we formally define the sliding window score sw of Oi as: sws i = max j ⎧ ⎨ (cid:2) ⎩ k=1...|Ti| ics(W Ds j+k) if W Ds j+k ∈ Ti (1) 0 (cid:6) otherwise (cid:7) where ics(w) = log ˆOi ∪ ˆQ, and W Ds 1 , Ti = i 11(W Ds i =w) denotes the i-th word in W Ds. 1 + (cid:2) i 222 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Based on these definitions, we can regard sw∗ i as the general score defined in the original sliding sQ window approach, and sw represents the speaker- i focused sliding window score considering the target speaker sQ. Because the sliding window score ignores long- range dependencies, Richardson et al. (2013) introduce a distance-based variation (DSW), in which a word-distance based score is subtracted from the sliding window score to arrive at the final score. Similarly, we calculate the speaker-focused distance-based score given a (Q, Oi) pair and sQ, by counting the distance between the occurrence of a word in Q and a word in Oi in DsQ. More formally, given speaker s and a set of stop words3 U , the distance-based score d of Oi is defined as ⎧ ⎨ ds i = ⎩ 1 δs i |W Ds |−1 if |I s Q | = 0 or |I s Oi | = 0 otherwise (2) = ( ˆOi ∩ ˆDs) − Q = ( ˆQ ∩ ˆDs) − U , I s where I s Oi ˆQ − U , and δs i is the minimum number of words between an occurrence of a question word and an answer option word in W Ds, plus one. The formal definition of δs i is as follows. δs i = W Ds j min Q,W Ds k ∈I s |j − k| + 1 (3) ∈I s Oi Based on these definitions, we can regard d∗ i as the distance-based score defined in the original sQ sliding window approach, and d represents the i speaker-focused distance-based score considering speaker sQ. In addition, the final distance-based sliding window score of Oi (Richardson et al., 2013) can be formulated as sw∗ i − d∗ i (4) Expression (4) only focuses on the general (or speaker-independent) information (i.e., sw∗ i and d∗ i ); we can capture general and speaker-focused information (i.e., sw , and d ) simultaneously by averaging them: sQ i sQ i swsQ i + sw∗ 2 i − dsQ i + d∗ 2 i (5) Since a large percentage of questions cannot be solved by word-level matching, we also attempt to incorporate general world knowledge into our rule-based method. We calculate css the i , 3We use the list of stop words from NLTK (Bird and Loper, 2004). maximum cosine similarity between Oi and consecutive words of the same length in W Ds, as: (cid:6) (cid:7) css i = max j cos W Oi , W Ds j...j+|W Oi |−1 (6) where x is obtained by averaging the embeddings of the constituent words in x. Here we use Concept- Net embeddings (Speer et al., 2017) because they leverage the knowledge graph that focuses on general world knowledge. Following Expression (5), we capture both general and speaker-focused semantic information within a dialogue as follows. i + cs∗ 2 cssQ (7) i To make the final answer option selection, our rule-based method combines Expressions (5) and (7): − dsQ swsQ cssQ i i arg max (8) + i + sw∗ 2 i + cs∗ 2 i i + d∗ 2 i 4.3 Feature-Based Classifier To explore what features are effective for dia- logue understanding, we first consider a gradient boosting decision tree (GBDT) classifier. Besides the conventional bag-of-words features, we pri- marily focus on features related to general world knowledge and dialogue structure. • Bag of words of each answer option. sQ ). i sQ 1..3 and p∗ • Features inspired by rule-based ap- proaches: We adopt the features introduced in Section 4.2, including speaker-independent i and d∗ scores (i.e., sw∗ i ) and speaker-focused sQ and d scores (i.e., sw i • Matching position: p 1..3, where ps i is the last position (in percentage) of a word in Ds that is also mentioned in Oi; 0 if none of the words in Ds is mentioned in Oi. We consider matching position because of our observation of the existence of conces- sions and negotiations in dialogues (Amgoud et al., 2007). We assume the facts or opinions expressed near the end of a dialogue tend to be more critical for us to answer a question. • Pointwise mutual sQ max,1..3, pmi∗ sQ avg,1..3, and pmi∗ pmi pmi defined as max,1..3, pmi information (PMI): sQ min,1..3, pmi∗ min,1..3, avg,1..3, where pmis f,i is (cid:8) j log fk pmis f,i = 223 C2(W Oi j ,W Ds k ) )C1(W Ds k ) C1(W Oi |W Oi| j (9) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 C1(w) denotes the word frequency of w in external copora (we use Reddit posts [Tan and Lee, 2015]), and C2(w1, w2) represents the co-occurrence frequency of word w1 and w2 within a distance < K in external copora. We use PMI to evaluate the relatedness between the content of an answer option and the target-speaker-focused context based on co-occurrences of words in external corpora, inspired by previous studies on narrative event chains (Chambers and Jurafsky, 2008). • ConceptNet relations (e.g., (CR): cr1..3,1..|R|. R = {r1, r2, . . .} is the set of ConceptNet ‘‘CapableOf’’ relation types and ‘‘PartOf’’). cri,j is the number of rela- tion triples (w1, rj, w2) that appear in the ConceptNet (Speer et al., 2017), where w1 represents a word in answer option Oi, w2 represents a word in D, and the relation type rj ∈ R. Similar to the motivation for using PMI, we use CR to capture the association between an answer option and the source dialogue based on raw co-occurrence counts in the commonsense knowledge base. i and cs • ConceptNet embeddings (CE): Besides the lexical similarity based on string matching, sQ we also calculate cs∗ 1..3 and cs 1..3, where sQ cs∗ represent the maximum cosine i similarity between Oi and consecutive words of the same length in D and DsQ, respec- tively (Expression (6) in Section 4.2). We use ConceptNet embeddings (Speer et al., 2017) because they leverage the general world knowledge graph. 4.4 End-To-End Neural Network Our end-to-end neural model is based on a genera- tive pre-trained language model (LM). We follow the framework of finetuned transformer LM (FTLM) (Radford et al., 2018) and make modifica- tions for dialogue-based reading comprehension. The training procedure of FTLM consists of two stages. The first stage is to learn a high- capacity language model on a large-scale un- supervised corpus of tokens U = {u1, . . . , un} by maximizing the following likelihood: LLM (U) = (cid:2) i log P (ui | ui−k, . . . , ui−1; Θ) (10) where k is the context window size, and the conditional probability P is modeled by a multi- layer transformer decoder (Liu et al., 2018) with parameters Θ. In the second stage, the model is adapted to a labeled data set C, where each instance consists of a sequence of input tokens x1, . . . , xm with a label y, by maximizing: L(C) = (cid:2) x,y log P (y | x1, . . . , xm) + λLLM (C) (11) where P (y | x1, . . . , xm) is obtained by a linear + softmax layer over the final transformer block’s activation, and λ is the weight for language model. For multiple-choice reading comprehension, the input tokens x1, . . . , xm come from the concat- enation of a start token, dialogue, question, a delimiter token, answer option, and an end token; y indicates if the answer option is correct. We refer readers to Radford et al. (2018) for more details. Because the original FTLM framework already leverages rich linguistic information from a large unlabeled corpus, which can be regarded as a type of tacit general world knowledge, we inves- tigate whether additional dialogue structure can further improve this strong baseline. We pro- pose speaker embedding to better capture dialogue structure. Specifically, in the original framework, given an input context (u−k, . . . , u−1) of the trans- former, the encoding of u−i is wewewe(u−i) + pepepe(i), where wewewe(·) is the word embedding, and pepepe(·) is the position embedding. When adapting Θ to DREAM, we change the encoding to wewewe(u−i) + pepepe(i)+sesese(u−i, sQ), where the speaker embedding sesese(u−i, sQ) is (a) 0 if the token u−i is not in the dialogue (i.e. it is either a start/end/delimiter token or a token in the question/option); (b) eeetarget if the token is spoken by sQ; (c) eeerest if the token is in the dialogue but not spoken by sQ. eeetarget and eeerest are trainable and initialized randomly. We show the overall framework in Figure 1. 4.5 Preprocessing and Training Details For all the models, we conduct coreference res- olution to determine speaker mentions of sQ based on simple heuristics. Particularly, we map three most common speaker abbreviations (i.e., ‘‘M’’; ‘‘W’’ and ‘‘F’’) that appear in dialogues to their eight most common corresponding mentions (i.e., ‘‘man,’’ ‘‘boy,’’ ‘‘he,’’ and ‘‘his’’; ‘‘woman,’’ ‘‘girl,’’ ‘‘she,’’ and ‘‘her’’) in questions. We keep 224 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Overall neural network framework (Section 4.4). speaker abbreviations unchanged, since neither replacing them with their corresponding full forms nor removing them contributes to the performance based on our experiments. For the neural model mentioned in Section 4.4, most of our parameter settings follow Radford et al. (2018). We adopt the same preprocessing procedure and use their publicly released language model, which is pre-trained on the BooksCorpus data set (Zhu et al., 2015). We set the batch size to 8, language model weight λ to 2, and maximum epochs of training to 10. For other models, we use the following pre- processing steps. We tokenize and lowercase the corpus, convert number words to numeric digits, normalize time expressions to 24-hour numeric form, and address negation by removing interrog- ative sentences that receive ‘‘no’’ as the reply. We use the gradient boosting classifier implemented in the scikit-learn toolkit (Pedregosa et al., 2011). We set the number of boosting iterations to 600 and keep the rest of hyperparameters unchanged. The distance upper bound K for PMI is set to 10. We perform several runs of machine learning models (Section 4.3 and Section 4.4) with random- ness introduced by different random seeds and/or GPU non-determinism and select the model or models (for ensemble) that perform best on the development set. 5 Experiment 5.1 Baselines We implement several baselines, including rule- based methods and state-of-the-art neural models. • Word Matching This strong baseline (Yih et al., 2013) selects the answer option that has the highest count of overlapping words with the given dialogue. • Sliding Window We implement the sliding window approach (i.e., arg maxi sw∗ i ) and its distance-based variation DSW (i.e., arg maxi sw∗ i ) (Richardson et al., 2013) i introduced in Section 4.2. − d∗ • Enhanced Distance-Based Sliding Window (DSW++) We also use general world knowl- edge and speaker-focused information to improve the original sliding window base- line, formulated in Expression 8 (Section 4.2). • Stanford Attentive Reader This neural base- line compares each candidate answer (i.e., entity) representation to the question-aware document representation built with atten- tion mechanism (Hermann et al., 2015; Chen et al., 2016). Lai et al. (2017) add a bilinear operation to compare document and answer option representations to answer multiple-choice questions. • Gated-Attention Reader The baseline mod- els multiplicative question-specific document representations based on a gated-attention mechanism (Dhingra et al., 2017), which are then compared to each answer option (Lai et al., 2017). • Co-Matching This state-of-the-art multiple- choice reading comprehension model explic- itly treats question and answer option as two sequences and jointly matches them against a given document (Wang et al., 2018b). • Finetuned Transformer LM This is a gen- eral task-agnostic model introduced in Sec- tion 4.4, which achieves the best reported performance on several tasks requiring multi- sentence reasoning (Radford et al., 2018). 225 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Method Random Word Matching (WM) (Yih et al., 2013) Sliding Window (SW) (Richardson et al., 2013) Distance-Based Sliding Window (DSW) (Richardson et al., 2013) Stanford Attentive Reader (SAR) (Chen et al., 2016) Gated-Attention Reader (GAR) (Dhingra et al., 2017) Co-Matching (CO) (Wang et al., 2018b) Finetuned Transformer LM (FTLM) (Radford et al., 2018) Our Approaches: DSW++ (DSW w/ Dialogue Structure and ConceptNet Embedding) GBDT++ (GBDT w/ Features of Dialogue Structure and General World Knowledge) FTLM++ (FTLM w/ Speaker Embedding) Ensemble of 3 FTLM++ Ensemble of 1 GBDT++ and 3 FTLM++ Human Performance Ceiling Performance Dev 32.8 41.7 42.6 44.4 40.2 40.5 45.6 55.9 51.4 53.3 57.6 58.1 59.6 Test 33.4 42.0 42.5 44.6 39.8 41.3 45.5 55.5 50.1 52.8 57.4 58.2 59.5 93.9(cid:3) 98.7(cid:3) 95.5(cid:3) 98.6(cid:3) Table 8: Performance in accuracy (%) on the DREAM data set. Performance marked by (cid:2) is reported based on 25% annotated questions from the development and test sets. We do not investigate other ways of leveraging pre-trained deep models such as adding ELMo representations (Peters et al., 2018) as additional features to a neural model since recent stud- ies show that directly fine-tuning a pre-trained language model such as FTLM is significantly superior on multiple-choice reading comprehen- sion tasks (Radford et al., 2018; Chen et al., 2019). We do not apply more recent extractive models such as AOA (Cui et al., 2017) and QANet (Yu et al., 2018) since they aim at precisely locating a span in a document. When adapted to solve ques- tions with abstractive answer options, extractive models generally tend to perform less well (Chen et al., 2016; Dhingra et al., 2017; Lai et al., 2017). 5.2 Results and Analysis We report the performance of the baselines intro- duced in Section 5.1 and our proposed approaches in Table 8. We report the averaged accuracy of two annotators as the human performance. The proportion of valid questions (i.e., an unambigu- ous question with a unique correct answer option provided) that are manually checked by annota- tors on the annotated test and development sets is regarded as the human ceiling performance. Surface matching is insufficient. Experimen- tal results show that neural models that primarily exploit surface-level information (i.e., SAR, GAR, and CO) attain a performance level close to that of simple rule-based approaches (i.e., WM, SW, and DSW). The highest accuracy achieved by CO is 45.5%, a similar level of performance to the rule-based method DSW (44.6%). It is helpful to incorporate general world knowledge and dialogue structure. We see a significant gain of 5.5% in accuracy when enhanc- ing DSW using general world knowledge from ConceptNet embeddings and considering speaker- focused information (Section 4.2). FTLM, which leverages rich external linguistic knowledge from thousands of books, already achieves a much higher accuracy (55.5%) compared with previous state-of-the-art machine comprehension models, indicating the effectiveness of general world knowledge. Experimental results show that our best single model FTLM++ significantly outper- forms FTLM (p-value = 0.03), illustrating the usefulness of additional dialogue structure. Com- pared with the state-of-the-art neural reader Co- Matching that primarily explores surface-level information (45.5%), the tacit general world knowl- edge (in the pre-trained language model) and dia- logue structure in FTLM++ lead to an absolute gain of 11.9% in accuracy. Ensembling different types of methods can bring further improvements. We use the majority vote strategy to obtain the ensemble model performance. Although GBDT++ (52.8%) itself does not outperform FTLM++, GBDT++ 226 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Method Accuracy Δ DSW++ − dialogue structure − CE GBDT++ − bag of words − rule-based features − matching position − dialogue structure − PMI − CR − CE − PMI, CR, CE FTLM++ − speaker embedding − LM pre-training 51.4 50.0 46.7 53.3 51.6 51.2 53.0 51.9 51.4 52.7 52.7 47.1 57.6 55.9 36.2 − −1.4 −4.7 − −1.7 −2.1 −0.3 −1.4 −1.9 −0.6 −0.6 −6.2 − −1.7 −21.4 Table 9: Ablation tests on the development set (%). Minus (−) indicates percentage decrease. can serve as a supplement to FTLM++ because they leverage different types of general world knowledge and model architectures. We achieve the highest accuracy (59.5%) by ensembling one GBDT++ and three FTLM++. 5.3 Ablation Tests We conduct ablation tests to evaluate the indi- vidual components of our proposed approaches (Table 9). In Table 10, we summarize the involved types of dialogue structure and general world knowledge in our approaches. Dialogue Structure Specifically, we observe 1.4% drop in accuracy if we set the target speaker sQ to ∗ for all questions when we apply DSW++. We observe a similar performance drop when we remove speaker-focused features from GBDT++. In addition, removing speaker embeddings from FTLM++ leads to a 1.7% drop in accuracy (in this case, the model becomes the original FTLM). These results consistently indicate the usefulness of dialogue structure for dialogue understanding. General World Knowledge We also investigate the effects of general world knowledge. The accuracy of DSW++ drops by 4.7% if we remove ConceptNet embeddings (CE) by deleting the last term of Expression (8) in Section 4.2. Additionally, the accuracy of GBDT++ drops by 6.2% if we remove all the general world knowledge features (i.e., ConceptNet embeddings/relations and PMI), leading to prediction failures on questions such Dialogue Structure General World Knowledge DSW++ speaker-focused scores CE GBDT++ speaker-focused features CE, CR, and PMI FTLM++ speaker embedding pre-trained LM Table 10: Types of dialogue structure and general world knowledge investigated in our approaches. as ‘‘What do we learn about the man?’’ whose correct answer option ‘‘He is health-conscious.’’ is not explicitly mentioned in the source dialogue ‘‘M: We had better start to eat onions frequently, Linda. W: But you hate onions, don’t you? M: Until I learned from a report from today’s paper that they protect people from flu and colds. After all, compared with health, taste is not so important.’’ Moreover, if we train FTLM++ with randomly initialized transformer weights instead of weights pre-trained on the external corpus, the accuracy drops dramatically to 36.2%, which is only slightly better than a random baseline. 5.4 Error Analysis Impact of Longer Turns The number of dial- ogue turns has a significant impact on the performance of FTLM++. As shown in Figure 2, its performance reaches the peak when the number of turns ranges from 0 to 10, while it suffers severe performance drops when the given dialogue contains more turns. Both DSW++ (56.8%) and GBDT++ (57.4%) outperform FTLM++ (55.7%) when the number of turns ranges from 10 to 48. To deal with lengthy context, it may be helpful to first identify relevant sentences based on a question and its associated answer options rather than using the entire dialogue context as input. Impact of Confusing Distractors For 54.5% of questions on the development set, the fuzzy match- ing score (Sikes, 2007) of at least one distractor answer option against the dialogue is higher than the score of the correct answer option. For ques- tions that all models (i.e., DSW++, GBDT++, and FTLM++) fail to answer correctly, 73.0% of them contain at least one such confusing distractor answer option. The causes of this kind of errors can be roughly divided into two categories. First, 227 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Question Type FTLM++ GBDT++ Matching Reasoning Summary Logic Arithmetic Commonsense Single sentence Multiple sentences 57.0 56.8 73.6 55.0 30.2 53.4 56.5 56.9 68.1 49.4 47.1 49.7 24.5 41.7 63.3 49.5 Table 11: Accuracy (%) by question type on the annotated development subset. and commonsense) which require aggregation of information from multiple sentences, the under- standing of the entire dialogue, or the utilization of world knowledge. Therefore, it might be useful to leverage the strengths of individual models to solve different types of questions. 6 Conclusion and Future Work We present DREAM, the first multiple-choice dialogue-based reading comprehension data set from English language examinations. Besides the multi-turn multi-party dialogue context, 85% of questions require multiple-sentence reasoning, and 34% of questions also require commonsense knowledge, making this task very challenging. We apply several popular reading comprehension models and find that surface-level information is insufficient. We incorporate general world knowl- edge and dialogue structure into rule-based and machine learning methods and show the effec- tiveness of these factors, suggesting a promising direction for dialogue-based reading comprehen- sion. For future work, we are interested in problem generation for dialogues and investigating whether it will lead to more gains to pre-train a deep lan- guage model such as FTLM over large-scale dialogues from movies and TV shows instead of the BookCorpus data set (Zhu et al., 2015) used by previous work (Radford et al., 2018). Acknowledgments We would like to thank the editors and anony- mous reviewers for their helpful feedback. We also thank Hai Wang from Toyota Technological Institute at Chicago for useful discussions and valuable comments. Figure 2: Performance comparison of different number of turns on the test set. the distractor is wrongly associated with the target speaker/s mentioned in the question (e.g., answer option A and C in D2-Q3 in Table 3). Second, although the claim in the distractor is supported by the dialogue, it is irrelevant to the question (e.g., D1-Q1-B in Table 1). A promising direction to solve this problem could be the construction of speaker-focused event chains (Chambers and Jurafsky, 2008) and advanced dialogue-specific coreference resolution systems for more reliable evidence context detection in a dialogue. Impact of Question Types We further report the performance of the best single model FTLM++ and the GBDT++ baseline on the categories defined in Section 3.2 (Table 11). Not surprisingly, both models perform worse than random guessing on math problems. While most of the math problems can be solved by one single linear equation, it is still difficult to apply recent neural math word problem solvers (Huang et al., 2018; Wang et al., 2018a) due to informal dialogue- based problem descriptions and the requirement of commonsense inference. For example, given the dialogue: ‘‘W: The plane arrives at 10:50. It is already 10:40 now. Be quick! M: Relax. Your watch must be fast. There are still twenty minutes left.’’ We need prior knowledge to infer that the watch of the man is showing incorrect time 10:40. Instead, 10:50 should be used as the reference time with the time interval ‘‘twenty minutes left’’ together to answer the question ‘‘What time is it now?’’ Results show that GBDT++ is superior to the fine-tuned language model on the questions under the category matching (68.1% vs. 57.0%) and the latter model is more capable of answering implicit questions (e.g., under the category summary, logic, 228 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 References Leila Amgoud, Yannis Dimopoulos, and Pavlos Moraitis. 2007. A unified and general frame- work for argumentation-based negotiation. the AAMAS, pages 1–8. In Proceedings of New York, NY, USA. Ondrej Bajgar, Rudolf Kadlec, Jan Kleindienst. 2016. Embracing data abundance: Booktest data set for reading comprehension. CoRR, cs.CL/1610.00956v1. and Steven Bird and Edward Loper. 2004. NLTK: the natural language toolkit. In Proceedings of the ACL on Interactive poster and demonstration sessions, pages 31–34. Barcelona, Spain. Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention- over-attention neural networks for reading comprehension. In Proceedings of the ACL, pages 593–602. Vancouver, Canada. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text com- the ACL, In Proceedings of prehension. pages 1832–1846. Vancouver, Canada. Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A new Q&A data set augmented with context from a search engine. CoRR, cs.CL/1704.05179v3. Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of the ACL, pages 789–797. Columbus, OH. Maxwell Forbes and Yejin Choi. 2017. Verb physics: Relative physical knowledge of actions the ACL, and objects. pages 266–276. Vancouver, Canada. In Proceedings of Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In Proceedings of the ACL, pages 2358–2367. Berlin, Germany. Yu-Hsin Chen and Jinho D. Choi. 2016. Character identification on multiparty conversation: Iden- tifying mentions of characters in TV shows. In Proceedings of the SIGDial, pages 90–100. Los Angeles, CA. Zhipeng Chen, Yiming Cui, Wentao Ma, Shijin Wang, and Guoping Hu. 2019. Convolutional spatial attention model for reading compre- hension with multiple-choice questions. In Proceedings of the AAAI. Honolulu, HI. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceed- ings of the NIPS, pages 1693–1701. Montreal, Canada. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The goldilocks principle: Reading children’s books with explicit memory representations. In Proceedings of the ICLR. Caribe Hilton, Puerto Rico. Danqing Huang, Jing Liu, Chin-Yew Lin, and Jian Yin. 2018. Neural math word problem solver with reinforcement learning. In Proceedings of the COLING, pages 213–223. Santa Fe, NM. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question an- swering in context. pages 2174–2184. Brussels, Belgium. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge data set for reading comprehension. CoRR, cs.CL/1705.03551v2. Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter D. Turney, and Daniel Khashabi. 2016. Combining re- trieval, statistics, and inference to answer ele- mentary science questions. In Proceedings of the AAAI, pages 2580–2586. Phoenix, AZ. Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the NAACL-HLT, pages 252–262. New Orleans, LA. 229 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tom´aˇs Koˇcisk`y, Schwarz, Phil Jonathan Blunsom, Chris Dyer, Karl Moritz Hermann, G´aabor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association of Computational Linguistics, 6:317–328. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large- scale reading comprehension data set from examinations. In Proceedings of the EMNLP, pages 785–794. Copenhagen, Denmark. Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng Chang, and Hung-Yi Lee. 2018. ODSQA: Open-domain spoken question answering data set. CoRR, cs.CL/1808.02280v1. Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, and Hung-yi Lee. 2018. Spoken SQuAD: A study of mitigating the impact of speech recognition errors on listening comprehension. CoRR, cs.CL/1804.00320v1. Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by summarizing long sequences. In Proceedings of the ICLR. Vancouver, Canada. Peter LoBue and Alexander Yates. 2011. Types of common-sense knowledge needed for rec- ognizing textual entailment. In Proceedings of the ACL, pages 329–334. Portland, OR. Kaixin Ma, Tomasz Jurczyk, and Jinho D. Choi. 2018. Challenging reading comprehension on daily conversation: Passage completion on multi- party dialog. In Proceedings of the NAACL- HLT, pages 2039–2048. New Orleans, LA. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? A new data set for open book question answering. In Proceedings of the EMNLP. Brussels, Belgium. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and evaluvation framework for deeper understanding of commonsense stories. In Proceedings of the NAACL-HLT, pages 839–849. San Diego, CA. 230 I Nation. 2006. How large a vocabulary is needed for reading and listening? Canadian Modern Language Review, 63:59–82. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human gen- erated machine reading comprehension data set. CoRR, cs.CL/1611.09268v2. Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did What: A large-scale person-centered cloze the EMNLP, In Proceedings of data set. pages 2230–2235. Austin, TX. Simon Ostermann, Michael Roth, Ashutosh Modi, Stefan Thater, and Manfred Pinkal. 2018. SemEval-2018 Task 11: Machine compre- hension using commonsense knowledge. In Proceedings of the SemEval, pages 747–757. New Orleans, LA. Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Eduard Duchesnay. 2011. Scikit- learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830. Anselmo Penas, Yusuke Miyao, Alvaro Rodrigo, Eduard H Hovy, and Noriko Kando. 2014. Overview of CLEF QA Entrance Exams the CLEF, In Proceedings of Task 2014. pages 1194–1200. Sheffield, UK. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contex- In Proceed- tualized word representations. ings of the NAACL-HLT, pages 2227–2237, New Orleans, LA. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Preprint, available at https://openai.com/blog/language- unsupervised/. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 questions for machine comprehension of text. In Proceedings of the EMNLP, pages 2383–2392. Austin, TX. Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. CoQA: A conversational ques- tion answering challenge. CoRR, cs.CL/1808. 07042v1. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. MCTest: A challenge data set for the open-domain machine comprehen- sion of text. In Proceedings of the EMNLP, pages 193–203. Seattle, WA. Hideyuki Shibuki, Kotaro Sakamoto, Yoshinobu Kano, Teruko Mitamura, Madoka Ishioroshi, Kelly Y Itakura, Di Wang, Tatsunori Mori, and Noriko Kando. 2014. Overview of the NTCIR-11 QA-Lab Task. In NTCIR. Richard Sikes. 2007. Fuzzy matching in theory and practice. Multilingual, 18(6):39–43. Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the AAAI, pages 4444–4451. San Francisco, CA. Chenhao Tan and Lillian Lee. 2015. All who wander: On the prevalence and characteristics of multi-community engagement. In Proceed- ings of the WWW, pages 1056–1066. Florence, Italy. Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Multi-range reasoning for machine com- prehension. CoRR, cs.CL/1803.09074v1. Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A In Pro- machine comprehension data set. ceedings of the RepL4NLP, pages 191–200. Vancouver, Canada. Bo-Hsiang Tseng, Sheng-Syun Shen, Hung-Yi Lee, and Lin-Shan Lee. 2016. Towards machine comprehension of spoken content: Initial toefl listening comprehension test by machine. In Proceedings of the Interspeech. San Francisco, CA. Lei Wang, Yan Wang, Deng Cai, Dongxiang Zhang, and Xiaojiang Liu. 2018a. Translating a math word problem to a expression tree. In Proceedings of the EMNLP, pages 1064–1069. Brussels, Belgium. Shuohang Wang, Mo Yu, Shiyu Chang, and Jing Jiang. 2018b. A co-matching model for multi- choice reading comprehension. In Proceedings of the ACL, pages 1–6. Melbourne, Australia. Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question answer- ing using enhanced lexical semantic models. In Proceedings of the ACL, pages 1744–1753. Sofia, Bulgaria. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for read- ing comprehension. In Proceedings of the ICLR. Vancouver, Canada. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE ICCV, pages 19–27. Santiago, Chile. 231 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 4 1 9 2 3 1 1 1 / / t l a c _ a _ 0 0 2 6 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
下载pdf