A Sequential Matching Framework for - 麻省理工学院人工智能研究专业

A Sequential Matching Framework for
Multi-Turn Response Selection in
Retrieval-Based Chatbots

Yu Wu
Beihang University
State Key Laboratory of Software
Development Environment
wuyu@buaa.edu.cn

Wei Wu
Microsoft Corporation
Research and AI Group
wuwei@microsoft.com

Chen Xing
NanKai University
College of Computer and Control
Engineering
xingchen1113@gmail.com

Can Xu
Microsoft Corporation
Research and AI Group
can.xu@microsoft.com

Zhoujun Li
Beihang University
State Key Laboratory of Software
Development Environment
lizj@buaa.edu.cn

Ming Zhou
Microsoft Research
Natural Language Computing Group
mingzhou@microsoft.com

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

提交材料已收到: 14 八月 2017; 收到修订版: 3 七月 2018; 接受出版:
6 十二月 2018.

土井:10.1162/大肠杆菌a 00345

计算语言学

体积 45, 数字 1

We study the problem of response selection for multi-turn conversation in retrieval-based
chatbots. The task involves matching a response candidate with a conversation context, 这
challenges for which include how to recognize important parts of the context, and how to
model the relationships among utterances in the context. Existing matching methods may lose
important information in contexts as we can interpret them with a uniﬁed framework in which
contexts are transformed to ﬁxed-length vectors without any interaction with responses before
matching. This motivates us to propose a new matching framework that can sufﬁciently carry
important information in contexts to matching and model relationships among utterances at the
同时. The new framework, which we call a sequential matching framework (SMF), 让我们
each utterance in a context interact with a response candidate at the ﬁrst step and transforms
the pair to a matching vector. The matching vectors are then accumulated following the order of
the utterances in the context with a recurrent neural network (RNN) that models relationships
among utterances. Context-response matching is then calculated with the hidden states of the
RNN. Under SMF, we propose a sequential convolutional network and sequential attention
network and conduct experiments on two public data sets to test their performance. 实验
results show that both models can signiﬁcantly outperform state-of-the-art matching methods.
We also show that the models are interpretable with visualizations that provide us insights on
how they capture and leverage important information in contexts for matching.

1. 介绍

Recent years have witnessed a surge of interest on building conversational agents both
in industry and academia. Existing conversational agents can be categorized into task-
oriented dialog systems and non–task-oriented chatbots. Dialog systems focus on help-
ing people complete speciﬁc tasks in vertical domains (Young et al. 2010), such as ﬂight
booking, bus route enquiry, restaurant recommendation, 等等; chatbots aim
to naturally and meaningfully converse with humans on open domain topics (里特尔,
Cherry, and Dolan 2011). Building an open domain chatbot is challenging, 因为
it requires the conversational engine to be capable of responding to any input from
humans that covers a wide range of topics. To address the problem, 研究人员有
considered leveraging the large amount of conversation data available on the Internet,
and proposed generation-based methods (Shang, 鲁, and Li 2015; Vinyals and Le 2015;
李等人. 2016乙; Mou et al. 2016; Serban et al. 2016; Xing et al. 2017) and retrieval-based
方法 (Wang et al. 2013; 胡等. 2014; 吉, 鲁, and Li 2014; Wang et al. 2015; 严,
歌曲, and Wu 2016; Zhou et al. 2016; Wu et al. 2018A). Generation-based methods
generate responses with natural language generation models learned from conversation
数据, while retrieval-based methods re-use the existing responses by selecting proper
ones from an index of the conversation data. 在这项工作中, we study the problem of
response selection in retrieval-based chatbots, because retrieval-based chatbots have
the advantage of returning informative and ﬂuent responses. Although most existing
work on retrieval-based chatbots studies response selection for single-turn conversation
(Wang et al. 2013) in which conversation history is ignored, we study the problem in a
multi-turn scenario. In a chatbot, multi-turn response selection takes a message and
utterances in its previous turns as an input and selects a response that is natural and
relevant to the entire context.

A key step in response selection is measuring matching degree between an input
and response candidates. Different from single-turn conversation, in which the input is
a single utterance (IE。, the message), multi-turn conversation requires context-response

164

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

matching where both the current message and the utterances in its previous turns
should be taken into consideration. The challenges of the task include (1) how to extract
important information (字, 短语, and sentences) from the context and leverage
the information in matching; 和 (2) how to model relationships and dependencies
among the utterances in the context. 桌子 1 uses an example to illustrate the challenges.
第一的, to ﬁnd a proper response for the context, the chatbot must know that “hold a
drum class” and “drum” are important points. Without them, it may return a response
relevant to the message (IE。, Turn-5 in the context) but nonsensical in the context
(例如, “what lessons do you want?”). 另一方面, words like “Shanghai” and
“Lujiazui” are less useful and even noisy to response selection. The responses from the
chatbot may drift to the topic of “Shanghai” if the chatbot pays signiﬁcant attention
to these words. 所以, it is crucial yet non-trivial to let the chatbot understand the
important points in the context and leverage them in matching and at the same time
circumvent noise. 第二, there is a clear dependency between Turn-5 and Turn-2 in
the context, and the order of utterances matters in response selection because there will
be different proper responses if we exchange Turn-3 and Turn-5.

Existing work, including the recurrent neural network architectures proposed by
Lowe et al. (2015), the deep learning to respond architecture proposed by Yan, 歌曲,
and Wu (2016), and the multi-view architecture proposed by Zhou et al. (2016), 可能
lose important information in context-response matching because they follow the same
paradigm to perform matching, which suffers clear drawbacks. 实际上, 虽然这些
models have different structures, they can be interpreted with a uniﬁed framework:
A context and a response are ﬁrst individually represented as vectors, and then their
matching score is computed with the vectors. The context representation includes two
layers. The ﬁrst layer represents utterances in the context, and the second layer takes the
output of the ﬁrst layer as an input and represents the entire context. The existing work
differs in how they design the context representation and the response representation
and how they calculate the matching score with the two representations. 框架
view uniﬁes the existing models and indicates the common drawbacks they have:
everything in the context is compressed to one or more ﬁxed-length vectors before
matching is conducted; and there is no interaction between the context and the response
in the formation of their representations. The context is represented without enough
supervision from the response, and so is the response.

To overcome the drawbacks, we propose a sequential matching network (SMN)
for context-response matching in our early work (Wu et al. 2017) where we construct

桌子 1
An example of multi-turn conversation.

Context

Turn-1 Human: How are you doing?
Turn-2 ChatBot: I am going to hold a drum class in Shanghai. Anyone wants to join?

The location is near Lujiazui.

Turn-3 Human: Interesting! Do you have coaches who can help me practice

drum?

Turn-4 ChatBot: 当然.
Turn-5 Human: Can I have a free ﬁrst lesson?

Response Candidates
Response 1: Sure. Have you ever played drum before? (西德:88)
Response 2: What lessons do you want? (西德:55)

165

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

a matching vector for each utterance–response pair through convolution and pooling
on their similarity matrices, and then aggregate the sequence of matching vectors as
a matching score of the context and the response. 在这项工作中, we take it one step
further and generalize the SMN model to a sequential matching framework (SMF). 这
framework view allows us to tackle the challenges of context-response matching from a
high level. Speciﬁcally, SMF matches each utterance in the context with the response
at the ﬁrst step and forms a sequence of matching vectors. It then accumulates the
matching vectors of utterance–response pairs in the chronological order of the utter-
ances. The ﬁnal context-response matching score is calculated with the accumulation
of pair matching. Different from the existing framework, SMF allows utterances in the
context and the response to interact with each other at the very beginning, 因此
important matching information in each utterance–response pair can be sufﬁciently
preserved and carried to the ﬁnal matching score. 而且, relationships and depen-
dencies among utterances are modeled in a matching fashion, so the order of utterances
can supervise the aggregation of the utterance–response matching. Generally speaking,
SMF consists of three layers. The ﬁrst layer extracts important matching information
from each utterance–response pair and transforms the information into a matching
向量. The matching vectors are then uploaded to the second layer where a recurrent
neural network with gated recurrent units (GRU) (Chung et al. 2014) is used to model
the relationships and dependencies among utterances and accumulate the matching
vectors into its hidden states. The ﬁnal layer takes the hidden states of the GRU as input
and calculates a matching score for the context and the response.

The key to the success of SMF lies in how to design the utterance–response matching
层, which requires identiﬁcation of important parts in each utterance. We ﬁrst show
that the point-wise similarity calculation followed by convolution and pooling in SMN
is one implementation of the utterance–response matching layer of SMF, making the
SMN model a special case of the framework. 然后, we propose a new model named
sequential attention network (SAN), which implements the utterance–response match-
ing layer of SMF with an attention mechanism. Speciﬁcally, for an utterance–response
pair, SAN lets the response attend to important parts (either words or segments) 在
the utterance by weighting the parts using each part of the response. Each weight
reﬂects how important the part in the utterance is with respect to the corresponding
part in the response. Then for each part in the response, parts in the utterance are
linearly combined with the weights, and the combination interacts with the part of
the response by Hadamard product to form a representation of the utterance. 这样的
utterance representations are computed on both a word level and a segment level. 这
two levels of representations are ﬁnally concatenated and processed by a GRU to form
a matching vector. SMN and SAN are two different implementations of the utterance–
response matching layer, and we give a comprehensive comparison between SAN and
SMN. Theoretically, SMN is faster and easier to parallelize than SAN, whereas SAN
can better utilize the sequential relationship and dependency. The empirical results
are consistent with the theoretical analysis.

We empirically compare SMN and SAN on two public data sets: the Ubuntu Dia-
logue Corpus (Lowe et al. 2015) and the Douban Conversation Corpus (Wu et al. 2017).
The Ubuntu corpus is a large-scale English data set in which negative instances are
randomly sampled and dialogues are collected from a speciﬁc domain; the Douban
corpus is a newly published Chinese data set where conversations are crawled from
an open domain forum with response candidates collected following the procedure
of retrieval-based chatbots and their appropriateness judged by human annotators.
Experimental results show that on both data sets, both SMN and SAN can signiﬁcantly

166

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

outperform the existing methods. Particularly, on the Ubuntu corpus, SMN and SAN
yield 6 和 7 percentage point improvement, 分别, on R10@1 over the best
performing baseline method, and on the Douban corpus, the improvement on mean
average precision from SMN and SAN over the best baseline are 2.6 和 3.6 百分-
age points, 分别. The empirical results indicate that SAN can achieve better
performance than SMN in practice. In addition to the quantitative evaluation, we also
visualize the two models with examples from the Ubuntu corpus. The visualization
reveals how the two models understand conversation contexts and provides us insights
on why they can achieve big improvement over state-of-the-art methods.

This work is a substantial extension of our previous work reported at ACL 2017.
The extension in this article includes a uniﬁed framework for the existing methods, A
proposal of a new framework for context-response matching, and a new model under
the framework. Speciﬁcally, the contributions of this work include the following.

• We unify existing context-response matching models with a framework

and disclose their intercorrelations with detailed mathematical
derivations, which reveals their common drawbacks and sheds light on
our new direction.

• We propose a new framework for multi-turn response selection, 即,
the sequential matching framework, which is capable of overcoming the
drawbacks suffered by the existing models and addressing the challenges
of context-response matching in an end-to-end way. 框架
indicates that the key to context-response matching is not the 2D
convolution and pooling operations in SMN, but a general
utterance–response matching function that can capture the important
matching information in utterance–response pairs.

• We propose a new architecture, the sequential attention network, 在下面

the new framework. 而且, we compare SAN with SMN on both
efﬁciency and effectiveness.

• We conduct extensive experiments on public data sets and verify that SAN

achieves new state-of-the-art performance on context-response matching.

The rest of the paper is organized as follows: 在部分 2 we summarize the related
工作. We formalize the learning problem in Section 3. 在部分 4, we interpret the
existing models with a framework. 部分 5 elaborates our new framework and gives
two models as special cases of the framework. 部分 6 gives the learning objective
and some training details. 在部分 7 we give details of the experiments. 在部分 8,
we outline our conclusions.

2. 相关工作

We brieﬂy review the history and recent progress of chatbots, and application of text
matching techniques in other tasks. Together with the review of existing work, 我们
clarify the connection and difference between these works and our work in this article.

2.1 Chatbots

Research on chatbots goes back to the 1960s when ELIZA (Weizenbaum 1966), an early
chatbot, was designed with a large number of handcrafted templates and heuristic rules.

167

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

ELIZA needs huge human effort but can only return limited responses. To remedy
这, researchers have developed data-driven approaches (Higashinaka et al. 2014). 这
idea behind data-driven approaches is to build a chatbot with the large amount of
conversation data available on social media such as forums and microblogging services.
Methods along this line can be categorized into retrieval-based and generation-based
那些.

Generation-based chatbots reply to a message with natural language generation
技巧. Early work (里特尔, Cherry, and Dolan 2011) regards messages and re-
sponses as source language and target language, 分别, and learn a phrase-based
statistical machine translation model to translate a message to a response. 最近,
together with the success of deep learning approaches, the sequence-to-sequence frame-
work has become the mainstream approach, because it can implicitly capture com-
positionality and long-span dependencies in languages. Under this framework, 许多
models have been proposed for both single-turn conversation and multi-turn conversa-
的. 例如, in single-turn conversation, sequence-to-sequence with an attention
机制 (Shang, 鲁, and Li 2015; Vinyals and Le 2015) has been applied to response
一代; 李等人. (2016A) proposed a maximum mutual information objective to
improve diversity of generated responses; Xing et al. (2017) and Mou et al. (2016)
introduced external knowledge into the sequence-to-sequence model; Wu et al. (2018乙)
proposed decoding a response from a dynamic vocabulary; 李等人. (2016乙) incorpo-
rated persona information into the sequence-to-sequence model to enhance response
consistency with speakers; and Zhou et al. (2018) explored how to generate emotional
responses with a memory augmented sequence-to-sequence model. In multi-turn con-
versation, Sordoni et al. (2015) compressed a context to a vector with a multi-layer per-
ceptron in response generation; Serban et al. (2016) extended the sequence-to-sequence
model to a hierarchical encoder-decoder structure; and under this structure, they further
proposed two variants including VHRED (Serban et al. 2017乙) and MrRNN (Serban
等人. 2017A) to introduce latent and explicit variables into the generation process.
Xing et al. (2018) exploited a hierarchical attention mechanism to highlight the effect
of important words and utterances in generation. Upon these methods, reinforcement
learning technique (李等人. 2016C) and an adversarial learning technique (李等人. 2017)
have also been applied to response generation.

Different from the generation based systems, retrieval-based chatbots select a
proper response from an index and re-use the one to reply to a new input. The key to
response selection is how to match the input with a response. In a single-turn scenario,
matching is conducted between a message and a response. 例如, 胡等. (2014)
proposed message-response matching with convolutional neural networks; Wang et al.
(2015) incorporated syntax information into matching; 吉, 鲁, and Li (2014) 合并的
several matching features, such as cosine, topic similarity, and translation score, 到
rank response candidates. In multi-turn conversation, matching requires taking the
entire context into consideration. In this scenario, Lowe et al. (2015) used a dual long
short-term memory (LSTM) model to match a response with the literal concatenation
of utterances in a context; 严, 歌曲, and Wu (2016) reformulated the input message
with the utterances in its previous turns and performed matching with a deep neural
network architecture; Zhou et al. (2016) adopted an utterance view and a word view
in matching to model relationships among utterances; and Wu et al. (2017) proposed
a sequential matching network that can capture important information in contexts
and model relationships among utterances in a uniﬁed form.

Our work is a retrieval-based method. It is an extension of the work by Wu et al.
(2017) reported at the ACL conference. 在这项工作中, we analyze the existing models

168

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

from a framework view, generalize the model in Wu et al. (2017) to a framework, give
another implementation with better performance under the framework, and compare
the new model with the model in the conference paper on various aspects.

2.2 Text Matching

In addition to response selection in chatbots, neural network–based text matching
techniques have proven effective in capturing semantic relations between text pairs in a
variety of NLP tasks. 例如, in question answering, covolutional neural networks
(Qiu and Huang 2015; Severyn and Moschitti 2015) can effectively capture compositions
of n-grams and their relations in questions and answers. Inner-Attention (王, 刘,
and Zhao 2016) and multiple view (MV)-LSTM (Wan et al. 2016A) can model complex
interaction between questions and answers through recurrent neural network based
架构. (More studies on text matching for question answering can be found in
Tan et al. [2016]; 刘等人. [2016A,乙]; Wan et al. [2016乙]; He and Lin [2016]; Yin et al.
[2016]; Yin and Sch ¨utze [2015]). In Web search, 沉等人. (2014) and Huang et al.
(2013) built a neural network with tri-letters to alleviate mismatching of queries and
documents due to spelling errors. In textual entailment, the model in Rockt¨aschel et al.
(2015) utilized a word-by-word attention mechanism to distinguish the relationship
between two sentences. Wang and Jiang (2016乙) introduced another way to adopt an
attention mechanism for textual entailment. Besides those two works, Chen et al. (2016),
Parikh et al. (2016), and Wang and Jiang (2016A) also investigated the textual entailment
problem with neural network models.

在这项工作中, we study text matching for response selection in multi-turn conver-
站, in which matching is conducted between a piece of text and a context which
consists of multiple pieces of text dependent on each other. We propose a new matching
framework that is able to extract important information in the context and model
dependencies among utterances in the context.

3. Problem Formalization

Suppose that we have a data set D = {(做, si, ri)}氮
我=1, where si is a conversation context,
ri is a response candidate, and yi ∈ {0, 1} is a label. si = {ui,1, . . . , ui,ni
k=1
are utterances. ∀k, ui,k = (wui,k,1, . . . , wui,k,j, . . . , wui,k,nui,k ) where wui,k,j is the j-th word in
ui,k and nui,k is the length of ui,k. 相似地, ri = (wri,1, . . . , wri,j, . . . , wri,nri
) where wri,j is
the j-th word in ri and nri is the length of the response. yi = 1 if ri is a proper response
to si, otherwise yi = 0. Our goal is to learn a matching model g(·, ·) with D, and thus for
any new context-response pair (s, r), G(s, r) measures their matching degree. 根据
to g(s, r), we can rank candidates for s and select a proper one as its response.

} 在哪里 {ui,k}ni

在以下部分中, we ﬁrst review how the existing work deﬁnes g(·, ·) 从一个
framework view. The framework view discloses the common drawbacks of the existing
工作. 然后, based on this analysis, we propose a new matching framework and give
two models under the framework.

4. A Framework for the Existing Models

Before our work, a few studies on context-response matching for response selection in
multi-turn conversation have been conducted. 例如, Lowe et al. (2015) match a
context and a response with recurrent neural networks (RNNs); Yan et al. (2016) 展示

169

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

数字 1
Existing models can be interpreted with a uniﬁed framework. F (·), F (西德:48)(·), H(·), and m(·, ·) 是
utterance representation function, response representation function, context representation
function, and matching function, 分别.

a deep learning to respond architecture for multi-turn response selection; and Zhou
等人. (2016) perform context-response matching from both a word view and an utterance
看法. Although these models are proposed from different backgrounds, we ﬁnd that
they can be interpreted with a uniﬁed framework, given by Figure 1. 框架
consists of utterance representation f (·), response representation f (西德:48)(·), context repre-
sentation h(·), and matching calculation m(·, ·). Given a context s = {u1, . . . , 和} and a
response candidate r, F (·) and f (西德:48)(·) represent each ui in s and r as vectors or matrices
by f (ui) and f (西德:48)(r), 分别. {F (ui)}n
i=1 are then uploaded to h(·), which transforms
the utterance representations into h (西德:0)F (u1), . . . , F (和)(西德:1) as a representation of the context s.
最后, 米(·, ·) takes h (西德:0)F (u1), . . . , F (和)(西德:1) and f (西德:48)(r) as input and calculates a matching score
for s and r. 总结, the framework performs context-response matching following
a paradigm that context s and response r are ﬁrst individually represented as vectors
and then their matching degree is determined by the vectors. Under the framework,
the matching model g(s, r) can be deﬁned with f (·), H(·), F (西德:48)(·), and m(·, ·), as follows:

G(s, r) = m (西德:0)H (西德:0) F (u1), . . . , F (和)(西德:1) , F (西德:48)(r)(西德:1)

(1)

The existing models are special cases under the framework with different deﬁni-
tions of f (·), H(·), F (西德:48)(·), and m(·, ·). Speciﬁcally, the RNN models in Lowe et al. (2015)
can be deﬁned as

mrnn(s, r) = σ

(西德:16)

hrnn

(西德:0) frnn(u1), . . . , frnn(和)(西德:1)(西德:62) · M · f (西德:48)

rnn(r) + 乙

(西德:17)

(2)

where M is a linear transformation, b is a bias, and σ(·) is a sigmoid function. ∀ui =
{wui,1, . . . , wui,ni

}, frnn(ui) is deﬁned by

frnn(ui) = (西德:2) (西德:126)wui,1, . . . , (西德:126)wui,k, . . . , (西德:126)wui,ni

(西德:3)

(3)

170

1unurMatching ScoreContext representationUtterance representation Framework of existing models ()F()F()f'()F()H(,)m2u

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

在哪里 (西德:126)wui,k is the embedding of the k-th word wui,k, 和 [·] denotes a horizontal con-
catenation operator on vectors or matrices.1 Suppose that the dimension of the word
embedding is d, then the output of frnn(ui) is a d × ni matrix with each column an
embedding vector. Suppose that r = (wr,1, . . . , wr,nr ), then f (西德:48)

rnn(r) is deﬁned as

F (西德:48)
rnn(r) = RNN( (西德:126)wr,1, . . . , (西德:126)wr,k, . . . , (西德:126)wr,nr )

(4)

在哪里 (西德:126)wr,k is the embedding of the k-th word in r, and RNN(·) is either a vanilla RNN
(Elman 1990) or an RNN with LSTM units (Hochreiter and Schmidhuber 1997). RNN(·)
takes a sequence of vectors as an input, and outputs the last hidden state of the network.
最后, the context representation hrnn(·) is deﬁned by

hrnn

(西德:0) frnn(u1), . . . , frnn(和)(西德:1) = RNN (西德:0)[frnn(u1), . . . , frnn(和)](西德:1)

(5)

In the deep learning to respond (DL2R) 建筑学 (严, 歌曲, and Wu 2016),
the authors ﬁrst transform the context s to an s(西德:48) = {v1, . . . , vo} with heuristics includ-
ing “no context,” “whole context,” “add-one,” “drop-out,” and “combined.” These
heuristics differ on how utterances before the last input in the context are incor-
porated into matching. In “no context,” s(西德:48) = {和}, and thus no previous utterances
(西德:1) ··· (西德:1) 和, 和} where operator (西德:1) glues
are considered; in “whole context,” s(西德:48) = {u1
vectors together and forms a long vector. 所以, in “whole context,” the conver-
sation context is represented as a concatenation of all its utterances; in “add-one,”
(西德:1) 和, 和}. “add-one” leverages the conversation context by con-
s(西德:48) = {u1
catenating each of its utterances (except the last one) with the last input; in “drop-out,”
(西德:1) ··· (西德:1) un and c\ui means exclud-
s(西德:48) = {(c\u1) (西德:1) 和, . . . , (c\un−1) (西德:1) 和, 和} where c = u1
ing ui from c. “drop-out” also utilizes each utterance before the last one individually, 但
concatenates the complement of each utterance with the last input; and in “combined,”
s(西德:48) is the union of the other heuristics. Let vo = un in all heuristics, then the matching
model of DL2R can be reformulated as

(西德:1) 和, . . . , un−1

mdl2r(s, r) =

哦
(西德:88)

我=1

多层线性规划( fdl2r(六) (西德:1) fdl2r(vo)) · MLP( fdl2r(六) (西德:1) F (西德:48)

dl2r(r))

(6)

where MLP(·) is a multi-layer perceptron. ∀v ∈ {v1, . . . , vo}, suppose that { (西德:126)wv,1, . . . ,
(西德:126)wv,nv} represent embedding vectors of the words in v, then fdl2r(v) is given by

fdl2r(v) = CNN (西德:0)Bi-LSTM( (西德:126)wv,1, . . . , (西德:126)wv,nv )(西德:1)

(7)

where CNN(·) is a convolutional neural network (CNN) (Kim 2014) and Bi-LSTM(·) 是一个
bi-directional recurrent neural network with LSTM units (Bi-LSTM) (格雷夫斯, Mohamed,
和辛顿 2013). The output of Bi-LSTM(·) is all the hidden states of the Bi-LSTM
模型. F (西德:48)
dl2r(·) is deﬁned in the same way with fdl2r(·). In DL2R, hdl2r(·) can be viewed
as an identity function on {fdl2r(v1), . . . , fdl2r(vo)}. Note that in the paper of Yan, 歌曲,
and Wu (2016), the authors also assume that each response candidate is associated
with an antecedent posting p. This assumption does not always hold in multi-turn

1 We borrow the operator from MATLAB.

171

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

response selection. 例如, in the Ubuntu Dialog Corpus (Lowe et al. 2015), 那里
are no antecedent postings. To make the framework compatible with their assumption,
we can simply extend fdl2r(r) 到 [fdl2r(p), fdl2r(r)], and deﬁne mdl2r(s, r) 作为


MLP( fdl2r(六) (西德:1) fdl2r(vo)) ·



(西德:88)



哦
(西德:88)

我=1

多层线性规划( fdl2r(六) (西德:1) fdl2r(p)) · MLP( fdl2r(六) (西德:1) fdl2r(r))



 (8)



最后, in Zhou et al. (2016), the multi-view matching model can be rewritten as

mmv(s, r) = σ

(西德:18)

hmv( fmv(u1), . . . , fmv(和))(西德:62)

(西德:21)

(西德:20)M1
M2

F (西德:48)
mv(r) +

(西德:21)(西德:19)

(西德:20)b1
b2

(9)

where M1 and M2 are linear transformations, b1 and b2 are biases. ∀ui = {wui,1, . . . ,
wui,ni

}, fmv(ui) is deﬁned as

fmv(ui) = {fw(ui), fu(ui)}

(10)

where fw(ui) 和福(ui) are utterance representations from a word view and an utterance
看法, 分别. The formulation of fw(ui) 和福(ui) are given by

fw(ui) = (西德:2) (西德:126)wui,1, . . . , (西德:126)wui,ni
fu(ui) = CNN( (西德:126)wui,1, . . . , (西德:126)wui,ni )

(西德:3)

Suppose that r = (wr,1, . . . , wr,nr ), then f (西德:48)

mv(r) is deﬁned as

mv(r) = [F (西德:48)
F (西德:48)

w(r)(西德:62), F (西德:48)

你(r)(西德:62)](西德:62)

(11)

where the word view representation f (西德:48)
are formulated as

w(r) and the utterance view representation f (西德:48)

你(r)

F (西德:48)
w(r) = GRU( (西德:126)wr,1, . . . , (西德:126)wur,nr
F (西德:48)
你(r) = CNN( (西德:126)wr,1, . . . , (西德:126)wur,nr

)

where GRU(·) is a recurrent neural network with GRUs (Cho et al. 2014). The out-
put of f (西德:48)
w(r) is the last hidden state of the GRU model. The context representation
hmv( fmv(u1), . . . , fmv(和)) is deﬁned as

hmv( fmv(u1), . . . , fmv(和)) = [hw( fw(u1), . . . , fw(和))(西德:62), 胡( fu(u1), . . . , fu(和))(西德:62)](西德:62)

(12)

where the word view hw(·) and the utterance view hu(·) are deﬁned as
hw( fw(u1), . . . , fw(和)) = GRU (西德:0)[fw(u1), . . . , fw(和)](西德:1)
胡( fu(u1), . . . , fu(和)) = GRU (西德:0) fu(u1), . . . , fu(和)(西德:1)

There are several advantages when applying the framework view to the existing
context-response matching models. 第一的, it uniﬁes the existing models and reveals the

172

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

instinct connections among them. These models are nothing but similarity functions of
a context representation and a response representation. Their difference on performance
comes from how well the two representations capture the semantics and the structures
of the context and the response and how accurate the similarity calculation is. 为了
例子, in empirical studies, the multi-view model performs much better than the
RNN models. This is because the multi-view model captures the sequential relationship
among words, the composition of n-grams, and the sequential relationship of utterances
by hw(·) and hu(·); whereas in RNN models, only the sequential relationship among
words are modeled by hrnn(·). 第二, it is easy to make an extension of the existing
models by replacing f (·), F (西德:48)(·), H(·), and m(·, ·). 例如, we can replace the hrnn(·)
in RNN models with a composition of CNN and RNN to model both composition of
n-grams and their sequential relationship, and we can replace the mrnn(·) with a more
powerful neural tensor network (索切尔等人. 2013). 第三, the framework unveils the
limitations the existing models and their possible extensions suffer: Everything in the
context are compressed to one or more ﬁxed-length vectors before matching; 在那里
is no interaction between the context and the response in the formation of their repre-
句子. The context is represented without enough supervision from the response,
and so is the response. 因此, these models may lose important information of
contexts in matching, and more seriously, no matter how we improve them, 只要
as the improvement is under the framework, we cannot overcome the limitations. 这
framework view motivates us to propose a new framework that can essentially change
the existing matching paradigm.

5. Sequential Matching Framework

We propose a sequential matching framework (SMF) that can simultaneously capture
important information in a context and model relationships among utterances in the
语境. 数字 2 gives the architecture of SMF. SMF consists of utterance–response
matching f (·, ·), matching accumulation h(·), and matching prediction m(·). The three

数字 2
Our new framework for multi-turn response selection, which is called the Sequential Matching
Framework. It ﬁrst computes a matching vector between an utterance and a response, 那么
matching vectors are accumulated by a GRU. 最后, the matching score is obtained with the
hidden states in the second layer.

173

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

1unurMatching prediction Sequential Matching Framework 2u()H()mMatching accumulation(,)F(,)F(,)fUtterance-response matching

计算语言学

体积 45, 数字 1

components are organized in a three-layer architecture. Given a context s = {u1, . . . , 和}
and a response candidate r, the ﬁrst layer matches each ui in s with r through f (·, ·) 和
forms a sequence of matching vectors {F (u1, r), . . . , F (和, r)}. 这里, we require f (·, ·) 到
be capable of differentiating important parts from unimportant parts in ui and carry
the important information into f (ui, r). Details of how to design such a f (·, ·) will be
described later. The matching vectors {F (u1, r), . . . , F (和, r)} are then uploaded to the
second layer where h(·) models relationships and dependencies among the utterances
{u1, . . . 和}. 这里, we deﬁne h(·) as a recurrent neural network whose output is a
sequence of hidden states {h1, . . . , hn}. ∀k ∈ {1, . . . , n}, hk is given by

hk = h(西德:48)

(西德:18)

(西德:19)

hk−1, F (uk, r)

(13)

where h(西德:48)(·, ·) is a non-linear transformation, and h0 = 0. H(·) accumulates matching
vectors {F (u1, r), . . . , F (和, r)} in its hidden states. 最后, in the third layer, 米(·) 需要
{h1, . . . , hn} as an input and predicts a matching score for (s, r). In brief, SMF matches s
and r with a g(s, r) deﬁned as

G(s, r) = m

(西德:18)
(西德:16)
F (u1, r), F (u2, r), . . . , F (uni r)
H

(西德:17)(西德:19)

(14)

SMF has two major differences over the existing framework: 首先, SMF lets each
utterance in the context and the response “meet” at the very beginning, 因此
utterances and the response can sufﬁciently interact with each other. Through the inter-
行动, the response will help recognize important information in each utterance. 这
information is preserved in the matching vectors and carried into the ﬁnal matching
score with minimal loss; 第二, matching and utterance relationships are coupled
rather than separately modeled as in the existing framework. 因此, the utterance
关系 (例如, the order of the utterances), as a kind of knowledge, can supervise
the formation of the matching score. Because of the differences, SMF can overcome the
drawbacks the existing models suffer and tackle the two challenges of context-response
matching simultaneously.

It is obvious that the success of SMF lies in how to design f (·, ·), because f (·, ·) 戏剧
a key role in capturing important information in a context. 在以下部分中,
we will ﬁrst specify the design of f (·, ·), and then discuss how to deﬁne h(·) and m(·).

5.1 Utterance–Response Matching

We design the utterance–response matching function f (·, ·) in SMF as neural networks to
beneﬁt from their powerful representation abilities. To guarantee that f (·, ·) can capture
important information in utterances with the help of the response, we implement f (·, ·)
using a convolution-pooling technique and an attention technique, which results in a
sequential convolutional network (SCN) and a sequential attention network (SAN).
而且, in both SCN and SAN, we consider matching on multiple levels of granu-
larity of text. Note that in our ACL paper (Wu et al. 2017), the sequential convolutional
network is named “SMN.” Here, we rename it to SCN in order to distinguish it from
the framework.

174

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

数字 3
The architecture of SCN. The ﬁrst layer extracts matching information from interactions between
utterances and a response on a word level and a segment level by a CNN. The second layer
accumulates the matching information from the ﬁrst layer by a GRU. The third layer takes the
hidden states of the second layer as an input and calculates a matching score.

5.1.1 Sequential Convolutional Network. 数字 3 gives the architecture of SCN. Given an
utterance u in a context s and a response candidate r, SCN looks up an embedding table
and represents u and r as U = (西德:2)欧洲联盟,1, . . . , 欧洲联盟,nu
(西德:3), 分别, 在哪里
欧洲联盟,我, 是,i ∈ Rd are the embeddings of the i-th word of u and r, 分别. With U and R,
SCN constructs a word–word similarity matrix M1 ∈ Rnu×nr and a sequence–sequence
similarity matrix M2 ∈ Rnu×nr as two input channels of a convolutional neural network
(CNN). The CNN then extracts important matching information from the two matrices
and encodes the information into a matching vector v.

(西德:3) and R = (西德:2)是,1, . . . , 是,nr

Speciﬁcally, ∀i, j, 这 (我, j)-th element of M1 is deﬁned by

e1,i,j = e(西德:62)

你,i · er,j

(15)

M1 models the interaction between u and r on a word level.
Suppose that Hu = (西德:2)胡,1, . . . , 胡,nu
deﬁned by

To get M2, we ﬁrst transform U and R to sequences of hidden vectors with a GRU.
(西德:3) are the hidden vectors of U, then ∀i, 胡,i ∈ Rm is

zi = σ(Wzeu,我 + Uzhu,i−1)
ri = σ(Wreu,我 + Urhu,i−1)

(西德:101)胡,i = tanh(Wheu,我 + Uh(ri (西德:12) 胡,i−1))

胡,i = zi (西德:12) (西德:101)胡,我 + (1 − zi) (西德:12) 胡,i−1

(16)

where hu,0 = 0, zi and ri are an update gate and a reset gate respectively, σ(·) 是一个
sigmoid function, and Wz, Wh, Wr, Uz, 乌尔,Uh are parameters. 相似地, 我们有
(西德:3) as the hidden vectors of R. 然后, ∀i, j, 这 (我, j)-th element of M2
Hr = (西德:2)小时,1, . . . , 小时,nr
is deﬁned by

e2,i,j = h(西德:62)

你,iAhr,j

(17)

175

计算语言学

体积 45, 数字 1

where A ∈ Rm×m is a linear transformation. ∀i, GRU encodes the sequential information
and the dependency among words until position i in u into the i-th hidden state. 作为一个
consequence, M2 models the interaction between u and r on a segment level.

M1 and M2 are then processed by a CNN to compute the matching vector v. ∀f =
1, 2, CNN regards Mf as an input channel, and alternates convolution and max-pooling
运营. If we denote the k-th feature map at the l-th layer as zk, whose ﬁlters are
determined by a tensor Wk and a bias bk, then the feature map zk is obtained as follows:

我,j = σ((Wk ∗ z(西德:48))我,j + bk)
zk
(西德:19)

(西德:18)(西德:18) (西德:88)

吴

k ∗ z(西德:48)

你

zk
我,j = σ

你

+ bk

我,j

(西德:19)

(18)

(19)

where σ(·) is a ReLU, 吴
U )
is feature maps on the (l − 1)-th layer, and U is the number of feature maps. 尤其,
∗ is a 2D convolutional operation, sliding a window on feature maps at that layer, 那
is formulated as

k is the weight of the u-th feature map, z(西德:48) = (z(西德:48)

1 . . . z(西德:48)

你 . . . z(西德:48)

(W ∗ o)米,n =

width
(西德:88)

height
(西德:88)

i=0

j=0

Wi,j · om+i,n+j

(20)

where width and height are the hyper-parameters of the convolutional window, and o =
z(西德:48)
你 . A max pooling operation follows a convolution operation and picks the maximal
values within a window sliding on the output of the convolution operation, and carries
out a linear transformation on the feature values within the window. The max pooling
operation can be formulated as

zk
我,j = max z(我:i+pw,j:j+ph )

(21)

where pw and ph are the width and the height of the 2D pooling, 分别. 这
matching vector v is deﬁned by concatenating outputs of the last feature maps and
transforming it to a low dimensional space:

v = Wc[z0, z1 . . . , zf (西德:48)

] + bc

(22)

where f (西德:48) denotes the number of feature maps, Wc and bc are parameters, and zk is the
concatenation of elements at the k-th feature map, meaning zk = [zk
我,J] 在哪里
I and J are the maximum indices of the feature map.

0,1 . . . zk

0,0, zk

SCN distills important information in each utterance in the context from multiple
levels of granularity through convolution and pooling operations on similarity matrices.
From Equations (15), (17), (18), 和 (21), we can see that by learning word embeddings
and parameters of GRU from training data, important words or segments in the ut-
terance may have high similarity with some words or segments in the response and
result in high value areas in the similarity matrices. These areas will be transformed
and extracted to the matching vector by convolutions and poolings. We will further
explore the mechanism of SCN by visualizing M1 and M2 of an example in Section 7.

176

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

数字 4
The architecture of SAN. The ﬁrst layer highlights important words and segments in context,
and computes a matching vector from both word level and segment level. Similar to SCN, 这
second layer uses a GRU to accumulate the matching information, and the third layer predicts
the ﬁnal matching score.

5.1.2 Sequential Attention Network. With word embeddings U and R and hidden vectors
Hu and Hr, SAN also performs utterance–response matching on a word level and a
segment level. 数字 4 gives the architecture of SAN. In each level of matching, SAN
exploits every part of the response (either a word or a hidden state) to weight the parts
of the utterance and obtain a weighted representation of the utterance. The utterance
representation then interacts with the part of the response. The interactions are ﬁnally
aggregated following the order of the parts in the response as a matching vector.

Speciﬁcally, ∀er,i ∈ R, the weight of eu,j ∈ U is given by

你,jWatt1er,我 + batt1)

ωi,j = tanh(e(西德:62)
eωi,j
j=1 eωi,j

αi,j =

(西德:80)nu

(23)

(24)

where Watt1 ∈ Rd×d, and batt1 ∈ R are parameters. ωi,j ∈ R represents the importance of
欧洲联盟,j in the utterance corresponding to er,i in the response. αi,j is normalized importance.
The interaction between u and er,i is then deﬁned as



nu(西德:88)

t1,i =





αi,jeu,j

 (西德:12) 是,我

(25)

j=1

在哪里 ((西德:80)nu
Hadamard product.

j=1 αi,jeu,j) is the representation of u with weights {αi,j}nu

j=1, 和 (西德:12) 是个

177

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

相似地, ∀hr,i ∈ Hr, the weight of hu,j ∈ Hu can be deﬁned as

你,jWatt2hr,我 + batt2)

ω(西德:48)

A(西德:48)

我,j = v(西德:48)(西德:62)tanh(H(西德:62)
eω(西德:48)
j=1 e

我,j =

(西德:80)nu

ω(西德:48)
我,j

我,j

(26)

(27)

where Watt2 ∈ Rd×d, v(西德:48) ∈ Rd, and batt2 ∈ Rd are parameters. The interaction between u
and hr,i then can be formulated as

t2,i =

nu(西德:88)

j=1

(西德:16)

A(西德:48)

我,jhu,j

(西德:17)

(西德:12) 小时,我

(28)

We denote the attention weights {αi,j} 和 {A(西德:48)
我,j} as A1 and A2, 分别. 和
the word-level interaction T1 = [t1,1, . . . , t1,nr ] and the segment level interaction T2 =
[t2,1, . . . , t2,nr ], we form a T = [t1, . . . , tnr ] by deﬁning ti as [t(西德:62)
2,我](西德:62). The matching vector
v of SAN is then obtained by processing T with a GRU:

1,我, t(西德:62)

v = GRU(时间)

(29)

where the speciﬁc parameterization of GRU(·) is similar to Equation (16), and we take
the last hidden state of the GRU as v.

From Equations (23) 和 (26), we can see that SAN identiﬁes important information
in utterances in a context through an attention mechanism. Words or segments in
utterances that are useful to recognize the appropriateness between the context and
a response will receive high weights from the response. The information conveyed by
these words and segments will be highlighted in the interaction between the utterances
and the response and carried to the matching vector through a RNN that models the
aggregation of information in the utterances under the supervision of the response.
Similar to SCN, we will further investigate the effect of the attention mechanism in
SAN by visualizing the attention weights in Section 7.

5.1.3 SAN vs. SCN. Because SCN and SAN exploits different mechanisms to understand
important parts in contexts, an interesting question arises: What are the advantages and
disadvantages of the two models in practice? 这里, we leave empirical comparison of
their performance to experiments and ﬁrst compare SCN with SAN on the following
aspects: (1) amount of parallelable computation, which is measured by the minimum
number of sequential operations required; 和 (2) total time complexity.

桌子 2 summarizes the comparison between the two models. In terms of parallel-
能力, SAN uses two RNNs to learn the representations, which requires 2n sequential
运营, whereas SCN has n sequentially executed operations in the construction
of M2. 因此, SCN is easier to parallelize than SAN. In terms of time complexity, 这
complexity of SCN is O(k · n · d2 + n · d2 + n2 · d), where k is the number of feature maps
in convolutions, n is max(nu, nr), and d is embedding size. 更具体地说, in SCN, 这
cost on construction of M1 and M2 is O(n · d2 + n2 · d), and the cost on convolution and
pooling is O(k · n · d2). The complexity of SAN is O(n2 · d + n2 · d2), where O(n2 · d) 是
the cost on calculating Hu and Hr and O(n2 · d2) is the cost of the following attention-
based GRU. 在实践中, k is usually much smaller than the maximum sentence length n.

178

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

桌子 2
Comparison between SCN and SAN. k is the kernel number of convolutions. n is max(nu, nr). d is
the embedding size.

time complexity

number of sequential operations

SCN O(k · n · d2 + n · d2 + n2 · d)
SAN

氧(n2 · d2 + n2 · d)

n
2n

所以, SCN could be faster than SAN. The conclusion is also veriﬁed by empirical
results in Section 7.

5.2 Matching Accumulation

The function of matching accumulation h(·) in SMF can be implemented with any
recurrent neural networks such as LSTM and GRU. 在这项工作中, we ﬁx h(·) as GRU
in both SCN and SAN. 给定 {F (u1, r), . . . , F (和, r)} as the output of the ﬁrst layer of
SMF, the non-linear transformation h(西德:48)(·, ·) in Equation (13) is formulated as

z(西德:48)
i = σ(Wz
r(西德:48)
i = σ(Wr

(西德:48)F (ui, r) + Uz
(西德:48)F (ui, r) + 乌尔

(西德:48)hi−1)
(西德:48)hi−1)

(西德:101)hi = tanh(Wh

(西德:48)F (ui, r) + Uh

(西德:48)(ri (西德:12) H(西德:48)

i−1))

hi = zi (西德:12) (西德:101)你好 + (1 − zi) (西德:12) hi−1

(30)

(西德:48),Uh

(西德:48), 乌尔

(西德:48), Uz

(西德:48), Wr

(西德:48), Wh

i and r(西德:48)

(西德:48) are parameters, and z(西德:48)

where Wz
i are an update gate
and a reset gate, 分别. 这里, hi is a hidden state, which encodes the matching
information in its previous turns. From Equation (30) we can see that the reset gate (IE。,
ri) and the update gate (IE。, zi) control how much information from the current matching
vector f (ui, r) ﬂows into the accumulation vector hi. 理想情况下, the two gates should let
matching vectors that correspond to important utterances make a great impact to the
accumulation vectors (IE。, the hidden states) while blocking the information from the
unimportant utterances. 在实践中, we ﬁnd that we can achieve this by learning SCN
and SAN from large-scale conversation data. The details will be given in Section 7.

5.3 Matching Prediction

米(·) 需要 {h1, . . . , hn} from h(·) as an input and predicts a matching score for (s, r).
We consider three approaches to implementing m(·).

5.3.1 Last State. The ﬁrst approach is that we only use the last hidden state hn to calculate
a matching score. The underlying assumption is that important information in the
语境, after selection by the gates of the GRU, has been encoded into the vector hn.
Then m(·) is formulated as

mlast(h1, . . . , hn) = softmax(Wlhn + bl)

where Wl and bl are parameters.

(31)

179

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

5.3.2 Static Average. The second approach is combining all hidden states with weights
determined by their positions. In this approach, 米(·) can be formulated as

mstatic(h1, . . . , hn) = softmax(Ws(

n
(西德:88)

我=1

wihi) + bs)

(32)

where Ws and bs are parameters, and wi is the weight of the i-th hidden state and is
learned from data. Note that in mstatic(·), 一次 {wi}n
i=1 are learned, they are ﬁxed for any
(s, r) 对, and that is why we call the approach “static average.” Compared with last
状态, the static average can leverage more information in the early parts of {h1, . . . , hn},
and thus can avoid information loss from the process of the GRU in h(·).

5.3.3 Dynamic Average. Similar to static average, we also combine all hidden states
to calculate a matching score, but the difference is that the combination weights are
dynamically computed by the hidden states and the utterance vectors through an
attention mechanism as in Bahdanau, 给, and Bengio (2014). The weights will change
according to the content of the utterances in different contexts, and that is why we call
the approach “dynamic average.” In this approach, 米(·) is deﬁned as

ti = t(西德:62)

αi =

s tanh(Wd1hu,nu + Wd2hi + bd1)
经验值(的)
i exp(的)

(西德:80)

米(h1, . . . , hn) = softmax(Wd(

n
(西德:88)

我=1

αihi) + bd2)

(33)

where Wd1 ∈ Rq×m, Wd2 ∈ Rq×q, bd1 ∈ Rq, Wd ∈ Rq×q, and bd2 ∈ Rq are parameters. ts
is a virtual context vector that is learned in training. hi and hu,nu are i-th hidden state of
H(·) and the ﬁnal hidden state of the utterance, 分别.

6. Model Training

We choose cross entropy as the loss function. Let Θ denote the parameters of f (·, ·), H(·, ·),
and m(·), then the objective function L(D, Θ) can be written as

L(D, Θ) = −

氮
(西德:88)

我=1

(西德:2)yilog(G(si, ri)) + (1 − yi)日志(1 − g(si, ri))(西德:3)

(34)

where N in the number of instances in D. We optimize the objective function using back-
propagation and the parameters are updated by stochastic gradient descent with the
Adam algorithm (Kingma and Ba 2014) on a single Tesla K80 GPU. The initial learning
rate is 0.001, and the parameters of Adam, β1 and β2, 是 0.9 和 0.999, 分别.
We use early-stopping as a regularization strategy. Models are trained in mini-batches
with a batch size of 200.

180

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

7. 实验

We test SAN and SCN on two public data sets with both quantitative metrics and
qualitative analysis.

7.1 数据集

The ﬁrst data set we exploited to test the performance of our models is the Ubuntu
Dialogue Corpus v1 (Lowe et al. 2015). The corpus contains large-scale two-way conver-
sations collected from the chat logs of the Ubuntu forum. The conversations are multi-
turn discussions about Ubuntu-related technical issues. We used the copy shared by
徐等. (徐等. 2017),2 in which numbers, URLs, and paths are replaced by special
placeholders. The data set consists of 1 million context-response pairs for training,
0.5 million pairs for validation, 和 0.5 million pairs for testing. In each conversation,
a human reply is selected as a positive response to the context, and negative responses
are randomly sampled. The ratio of positive responses and negative responses is 1:1 在
the training set, 和 1:9 in both the validation and test sets.

In addition to the Ubuntu Dialogue Corpus, we selected the Douban Conversation
语料库 (Wu et al. 2017) as another data set. The data set is a recently released large-
scale open-domain conversation corpus in which conversations are crawled from a
popular Chinese forum Douban Group.3 The training set contains 1 million context-
response pairs, and the validation set contains 50, 000 对. In both sets, a context
has a human reply as a positive response and a randomly sampled reply as a nega-
tive response. 所以, the ratio of positive instances and negative instances in both
training and validation is 1:1. Different from the Ubuntu Dialogue Corpus, the test set
of the Douban Conversation Corpus contains 1, 000 contexts with each one having 10
responses retrieved from a pre-built index. Each response receives three labels from
human annotators that indicate its appropriateness as a reply to the context and the
majority of the labels are taken as the ﬁnal decision. An appropriate response means
that the response can naturally reply to the conversation history by satisfying logic
一致性, ﬂuency, and semantic relevance. 否则, if a response does not meet any
of the three conditions, it is an inappropriate response. The Fleiss kappa (弗莱斯 1971) 的
the labeling is 0.41, which means that the labelers reached a moderate agreement in
their work. Note that in our experiments, we removed contexts whose responses are all
labeled as positive or negative. After this step, 有 6, 670 context-response pairs
left in the test set.

桌子 3 summarizes the statistics of the two data sets.

7.2 基线

We compared our methods with the following methods:

TF-IDF: We followed Lowe et al. (2015) and computed TF-IDF-based cosine simi-
larity between a context and a response. Utterances in the context are concatenated to
form a document. IDF is computed on the training data.

Basic deep learning models: We used models in Lowe et al. (2015) and Kadlec,
Schmid, and Kleindienst (2015), in which representations of a context are learned by

2 https://www.dropbox.com/s/2fdn26rj6h9bpvl/ubuntudata.zip?dl=0.
3 https://www.douban.com/group/.

181

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

桌子 3
Statistics of the two data sets.

Ubuntu Corpus

Douban Corpus

火车

val

测试

火车

val

测试

# context-response pairs
# candidates per context
# positive candidates per context
最小. # turns per context
Max. # turns per context
Avg. # turns per context
Avg. # words per utterance

1M 0.5M 0.5M
10
10
1
1
3
3
19
19
10.11
10.10
12.48
12.44

2
1
3
19
10.10
12.45

1中号
2
1
3
98
6.69
18.56

50k
2
1
3
91
6.75
18.50

10k
10
1.18
3
45
6.45
20.74

neural networks with the concatenation of utterances as inputs and the ﬁnal matching
score is computed by a bilinear function of the context representation and the response
表示. Models including RNN, CNN, LSTM, and BiLSTM were selected as
基线.

Multi-View: The model proposed in Zhou et al. (2016) that utilizes a hierarchical
recurrent neural network to model utterance relationships. It integrates information in
a context from an utterance view and a word view. Details of the model can be found in
方程 (9).

Deep learning to respond (DL2R): The authors in Yan, 歌曲, and Wu (2016) 亲-
posed several approaches to reformulate a message with previous turns in a context.
The response and the reformulated message are then represented by a composition of
RNN and CNN. 最后, the matching score is computed with the concatenation of the
陈述. Details of the model can be found in Equation (6).

Advanced single-turn matching models: Because BiLSTM does not represent the
state-of-the-art matching model, we concatenated the utterances in a context and
matched the long text with a response candidate using more powerful models, 包括-
ing MV-LSTM (Wan et al. 2016乙) (2D matching), Match-LSTM (Wang and Jiang 2016b),
and Attentive-LSTM (Tan et al. 2016) (two attention based models). To demonstrate the
importance of modeling utterance relationships, we also calculated a matching score
for the concatenation of utterances and the response candidate using the methods in
部分 5.1. The two models are simple versions of SCN and SAN, 分别, 和-
out considering utterance relationships. We denote them as SCNsingle and SANsingle,
分别.

7.3 Evaluation Metrics

In experiments on the Ubuntu corpus, we followed Lowe et al. (2015) and used recall at
position k in n candidates (Rn@k) as evaluation metrics. Here the matching models are
required to return k most likely responses, and Rn@k = 1 if the true response is among
the k candidates. Rn@k will become larger when k gets larger or n gets smaller.

Rn@k has bias when there are multiple true candidates for a context. 因此, 在
the Douban corpus, apart from Rn@ks, we also followed the convention of information
retrieval and used mean average precision (MAP) (Baeza-Yates, Ribeiro-Neto et al.

182

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

1999), mean reciprocal rank (MRR) (Voorhees and Tice 2000), and precision at position 1
(P@1) as evaluation metrics, which are deﬁned as follows

MAP = 1
|S|

MRR = 1
|S|

P@1 = 1
|S|

(西德:88)

si∈S
(西德:88)

si∈S

AP(si) , where AP(si) =

(西德:80)Nr
j=0

(西德:80)j

k=0 rel(rk,si )
j

· rel(rj, si)

(西德:80)Nr

j=0 rel(rj, si)

RR(si) , where RR(si) = 1
ranki

相对(rtop1, si)

(35)

(36)

(37)

where ranki refers to the position of the ﬁrst relevant response to context si in the ranking
列表; rj refers to the response ranked at the j-th position; 相对(rj, si) = 1 if rj is an appropriate
response to context si, otherwise rel(rj, si) = 0; rtop1 is the response ranked at the top
位置; S is the universal set of contexts; and Nr denotes the number of retrieved
responses.

We did not calculate R2@1 for the test data in the Douban corpus because one
context could have more than one correct response, and we have to randomly sample
one for R2@1, which may bring bias to the evaluation.

7.4 Parameter Tuning

For baseline models, we copied the numbers in the existing papers if their results on
the Ubuntu corpus are reported in their original paper (TF-IDF, RNN, CNN, LSTM,
BiLSTM, Multi-View); otherwise we implemented the models by tuning their parame-
ters on the validation sets. All models were implemented using the Theano framework
(Theano Development Team 2016). Word embeddings in neural networks were initial-
ized by the results of word2vec (米科洛夫等人. 20134) pre-trained on the training data.
We did not use GloVe (Pennington, Socher, and Manning 2014) because the Ubuntu
corpus contains many technical words that are not covered by Twitter or Wikipedia.
The word embedding size was chosen as 200. The maximum utterance length was set
作为 50. The maximum context length (IE。, number of utterances per context) was varied
从 1 到 20 and set as 10 at last. We padded zeros if the number of utterances in a
context is less than 10; otherwise we kept the last 10 utterances. We will discuss how the
performance of models changes in terms of different maximum context length later.

For SCN, the window size of convolution and pooling was tuned to {(2, 2),
(3, 3)(4, 4)} and was set as (3, 3) ﬁnally. The number of feature maps is 8. The size of
the hidden states in the construction of M2 is the same with the word embedding size,
and the size of the output vector v was set as 50. 此外, the size of the hidden
states in the matching accumulation module is also 50. In SAN, the size of the hidden
states in the segment level representation is 200, and the size of the hidden states in
方程 (29) was set as 400.

All tuning was done according to R2@1 on the validation data.

4 https://code.google.com/archive/p/word2vec/.

183

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

7.5 Evaluation Results

Tables 4 和 5 show the evaluation results on the Ubuntu Corpus and the Douban
语料库, 分别. SAN and SCN outperform baselines over all metrics on both
data sets with large margins, and except for R10@5 of SCN on the Douban corpus,
the improvements are statistically signiﬁcant (t-test with p-value ≤ 0.01). Our models
are better than state-of-the-art single turn matching models such as MV-LSTM, Match-
LSTM, SCNsingle, and SANsingle. The results demonstrate that one cannot neglect utter-
ance relationships and simply perform multi-turn response selection by concatenating
utterances together.

TF-IDF shows the worst performance, indicating that the multi-turn response selec-
tion problem cannot be addressed with shallow features. LSTM is the best model among
the basic models. The reason might be that it models relationships among words. 多-
View is better than LSTM, demonstrating the effectiveness of the utterance-view in
context modeling. Advanced models have better performance, because they are capable
of capturing more complicated structures in contexts.

SAN is better than SCN on both data sets, which might be attributed to three
原因. The ﬁrst reason is that SAN uses vectors instead of scalars to represent in-
teractions between words or text segments. 所以, the matching vectors in SAN
can encode more information from the pairs than those in SCN. The second reason is
that SAN uses a soft attention mechanism to emphasize important words or segments

桌子 4
Evaluation results on the Ubuntu corpus. Subscripts including last, 静止的, and dynamic indicate
three approaches to predicting a matching score as described in Section 5.3. Numbers in bold
mean that the improvement from the models is statistically signiﬁcant over the best baseline
方法.

R2@1 R10@1 R10@2 R10@5

0.410
0.403
0.549
0.638
0.630

0.662
0.626

0.653
0.653
0.633
0.656
0.662

0.723
0.725
0.726

0.733
0.734
0.733

0.545
0.547
0.684
0.784
0.780

0.801
0.783

0.804
0.799
0.789
0.809
0.810

0.842
0.838
0.847

0.850
0.852
0.851

0.708
0.819
0.896
0.949
0.944

0.951
0.944

0.946
0.944
0.943
0.942
0.945

0.956
0.962
0.961

0.961
0.962
0.961

TF-IDF
RNN
CNN
LSTM
BiLSTM

Multi-View
DL2R

0.659
0.768
0.848
0.901
0.895

0.908
0.899

0.906
MV-LSTM
Match-LSTM
0.904
Attentive-LSTM 0.903
0.904
SCNsingle
0.906
SANsingle

0.923
0.927
0.926

0.930
0.932
0.932

SCNlast
SCNstatic
SCNdynamic

SANlast
SANstatic
SANdynamic

184

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

桌子 5
Evaluation results on the Douban corpus. Notations have the same meaning as those in Table 4.
On R10@5, only SAN signiﬁcantly outperforms baseline methods.

MAP MRR

P@1

R10@1 R10@2 R10@5

TF-IDF
RNN
CNN
LSTM
BiLSTM

Multi-View
DL2R

0.331
0.390
0.417
0.485
0.479

0.505
0.488

0.498
MV-LSTM
Match-LSTM
0.500
Attentive-LSTM 0.495
0.506
SCNsingle
0.508
SANsingle

SCNlast
SCNstatic
SCNdynamic

SANlast
SANstatic
SANdynamic

0.526
0.523
0.529

0.536
0.532
0.534

0.359
0.422
0.440
0.527
0.514

0.543
0.527

0.538
0.537
0.523
0.543
0.547

0.571
0.572
0.569

0.581
0.575
0.577

0.180
0.208
0.226
0.320
0.313

0.342
0.330

0.348
0.345
0.331
0.349
0.352

0.393
0.387
0.397

0.393
0.387
0.391

0.096
0.118
0.121
0.187
0.184

0.202
0.193

0.202
0.202
0.192
0.203
0.206

0.236
0.228
0.233

0.236
0.228
0.230

0.172
0.223
0.252
0.343
0.330

0.350
0.342

0.351
0.348
0.328
0.351
0.353

0.387
0.387
0.396

0.404
0.393
0.393

0.405
0.589
0.647
0.720
0.716

0.729
0.705

0.710
0.720
0.718
0.709
0.720

0.729
0.734
0.724

0.761
0.736
0.742

in utterances, whereas SCN uses a max pooling operation to select important infor-
mation from similarity matrices. When multiple words or segments are important in
an utterance–response pair, a max pooling operation just selects the top one, 但是
attention mechanism can leverage all of them. The last reason is that SAN models the
sequential relationship and dependency among words or segments in the interaction
aggregation module, whereas SCN only considers n-grams.

The three approaches to matching prediction do not show much difference in both
SCN and SAN, but dynamic average and static average are better than the last state on
the Ubuntu corpus and worse than it on the Douban corpus. This is because contexts
in the Ubuntu corpus are longer than those in the Douban corpus (average context
length 10.1 与. 6.7), and thus the last hidden state may lose information in history on the
Ubuntu data. 相比之下, the Douban corpus has shorter contexts but longer utterances
(average utterance length 18.5 与. 12.4), and thus noise may be involved in response
selection if more hidden states are taken into consideration.

There are two reasons that Rn@ks on the Douban corpus are much smaller than
those on the Ubuntu corpus. One is that response candidates in the Douban corpus are
returned by a search engine rather than negative sampling. 所以, some negative
responses in the Douban corpus might be semantically closer to the true positive re-
sponses than those in the Ubuntu corpus, and thus more difﬁcult to differentiate by
a model. The other is that there are multiple correct candidates for a context, 所以
maximum R10@1 for some contexts are not 1. 例如, if there are three correct
responses, then the maximum R10@1 is 0.33. P@1 is about 40% on the Douban corpus,
indicating the difﬁculty of the task in a real chatbot.

185

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

7.6 Further Analysis
7.6.1 Model Ablation. We ﬁrst investigated how different parts of SCN and SAN affect
their performance by ablating SCNlast and SANlast. 桌子 6 reports the results of ablation
on the test data. 第一的, we replaced the utterance–response matching module in SCN and
SAN with a neural tensor (索切尔等人. 2013) (denoted as ReplaceM), which matches an
utterance and a response by feeding their representations to a neural tensor network
(NTN). The result is that the performance of the two models dropped dramatically. 这
is because in NTN there is no interaction between the utterance and the response before
their matching; and it is doubtful whether NTN can recognize important parts in the
pair and encode the information into matching. 因此, the model loses important
information in the pair. 所以, we can conclude that a good utterance–response
matching mechanism is crucial to the success of SMF. At least, one has to let an utterance
and a response interact with each other and explicitly highlight important parts in their
matching vector. 第二, we replaced the GRU in the matching accumulation modules
of SCN and SAN with a multi-layer perceptron (denoted as SCN ReplaceA and SAN
ReplaceA, 分别). The change led to a slight performance drop. This indicates
that utterance relationships are useful in context-response matching. 最后, 我们只
left one level of granularity, either word level or segment level, in SCN and SAN, 和
denoted the models as SCN with words, SCN with segments, SAN with words, 和
SAN with segments, 分别. The results indicate that segment level matching on
utterance–response pairs contributes more to the ﬁnal context-response matching, 和
both segments and words are useful in response selection.

7.6.2 Comparison with Respect to Context Length. We then studied how the performance
of SCNlast and SANlast changes across contexts with different lengths. Context-response
pairs were bucketed into three bins according to the length of the contexts (IE。, 这
number of utterances in the contexts), and comparison was made in different bins
on different metrics. 数字 5 gives the results. Note that we did the analysis only on
the Douban corpus because on the Ubuntu corpus many results were copied from the
existing literatures and the bin-level results are not available. SAN and SCN consistently
perform better than the baselines over bins, and a general trend is that when contexts
become longer, gaps become larger. 例如, 在 (2, 5], SAN is 3 points higher than
LSTM on R10@5, but the gap becomes 5 points in (5, 10]. The results demonstrate that our
models can well capture dependencies, especially long-distance dependencies, 之中
utterances in contexts. SAN and SCN have similar trends because both of them use a

桌子 6
Evaluation results of model ablation.

Ubuntu Corpus

Douban Corpus

R2@1
0.905

R10@1
0.661

R10@2
0.799

R10@5
0.950

MAP MRR

P@1

0.503

0.541

0.343

R10@1
0.201

R10@2
0.364

R10@5
0.729

0.919
0.921
0.918
0.923

0.922
0.928
0.927
0.930

0.704
0.715
0.716
0.723

0.713
0.729
0.728
0.733

0.832
0.836
0.832
0.842

0.842
0.846
0.842
0.850

0.955
0.956
0.954
0.956

0.957
0.959
0.959
0.961

0.518
0.521
0.522
0.526

0.523
0.532
0.532
0.536

0.562
0.565
0.565
0.571

0.565
0.575
0.561
0.581

0.370
0.382
0.376
0.393

0.372
0.385
0.386
0.393

0.228
0.232
0.220
0.236

0.232
0.234
0.225
0.236

0.371
0.380
0.385
0.387

0.381
0.393
0.395
0.404

0.737
0.734
0.727
0.729

0.747
0.754
0.757
0.761

ReplaceM

SCN with words
SCN with segments
SCN ReplaceA
SCNlast

SAN with words
SAN with segments
SAN ReplaceA
SANlast

186

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

数字 5
Model performance across context length. We compared SAN and SCN with LSTM, MV-LSTM,
and Multi-View on the Douban corpus.

GRU in the second layer to model dependencies among utterances. The performance
of all models drops when the length of contexts increases from (2, 5] 到 (5, 10]. 这
is because semantics of longer contexts is more difﬁcult to capture than that of shorter
上下文. 另一方面, the performance of all models improved when the length of
contexts increases from (5, 10] 到 (10, ). This is because the bin of (10, ) contains much less
data than the other two bins (the data distribution is 53% 为了 (2, 5], 38% 为了 (5, 10], 和
9% 为了 (10, )), and thus the improvement does not make much sense from a statistical
看法.

7.6.3 Sensitivity to Hyper-Parameters. We checked how sensitive SCN and SAN are re-
garding the size of word embedding and the maximum context length. 桌子 7 报告
evaluation results of SCNlast and SANlast with embedding sizes varying in {50, 100, 200}.

桌子 7
Evaluation results in terms of different word embedding sizes.

Ubuntu Corpus

Douban Corpus

R2@1 R10@1 R10@2 R10@5

MAP MRR

P@1

R10@1 R10@2 R10@5

SCN50d
SCN100d
SCN200d

SAN50d
SAN100d
SAN200d

0.920
0.921
0.923

0.914
0.921
0.930

0.715
0.718
0.723

0.698
0.711
0.733

0.834
0.838
0.842

0.828
0.840
0.850

0.952
0.954
0.956

0.950
0.953
0.961

0.503
0.524
0.526

0.503
0.525
0.536

0.541
0.569
0.571

0.541
0.565
0.581

0.343
0.391
0.393

0.343
0.375
0.393

0.201
0.234
0.236

0.201
0.220
0.236

0.364
0.387
0.387

0.364
0.388
0.404

0.729
0.727
0.729

0.729
0.746
0.761

187

(2,5](5,10](10,)context length304050P@1LSTMMV-LSTMMulti-ViewSCNSAN(2,5](5,10](10,)context length4045505560MAPLSTMMV-LSTMMulti-ViewSCNSAN(2,5](5,10](10,)context length506070MRRLSTMMV-LSTMMulti-ViewSCNSAN(2,5](5,10](10,)context length15202530R_10@1LSTMMV-LSTMMulti-ViewSCNSAN(2,5](5,10](10,)context length30354045R_10@2LSTMMV-LSTMMulti-ViewSCNSAN(2,5](5,10](10,)context length65707580R_10@5LSTMMV-LSTMMulti-ViewSCNSAN

计算语言学

体积 45, 数字 1

We can see that SAN is more sensitive to the word embedding size than SCN. SCN
becomes stable after the embedding size exceeds 100, whereas SAN keeps improving
with the increase of the embedding size. Our explanation of the phenomenon is that
SCN transforms word vectors and hidden vectors of GRU to scalars in the similarity
matrices by dot products, thus information in extra dimensions (例如, entries with
indices larger than 100) might be lost; 另一方面, SAN leverages the whole
d-dimensional vectors in matching, so the information in the embedding can be ex-
ploited more sufﬁciently.

数字 6 shows the performance of SCN and SAN with respect to the maximum
context length. We ﬁnd that both models signiﬁcantly become better with the increase of
maximum context length when it is lower than 5, and become stable after the maximum
context length reaches 10. The results indicate that utterances from early history can
provide useful information to response selection. 而且, model performance is more
sensitive to the maximum context length on the Ubuntu corpus than it is on the Douban

(A) Performance of SCN across different context lengths.

(乙) Performance of SAN across different context lengths.

数字 6
Performance with respect to different maximum context lengths.

188

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

123456789101112131415Maximum context length0.50.60.70.80.9ScoreUbuntu CorpusR2_1R10_1R10_2R10_5123456789101112131415Maximum context length0.350.400.450.500.55ScoreDouban CorpusMAPMRRP_1123456789101112131415Maximum context length0.50.60.70.80.9ScoreUbuntu CorpusR2_1R10_1R10_2R10_5123456789101112131415Maximum context length0.350.400.450.500.550.60ScoreDouban CorpusMAPMRRP_1

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

数字 7
Efﬁciency of SCN and SAN. The left panel shows the training time per batch with 200
dimensional word embeddings, and the right panel shows the inference time per batch.
One batch contains 200 instances.

语料库. This is because utterances in the Douban corpus are longer than those in the
Ubuntu corpus (average length 18.5 与. 12.4), which means single utterances in the
Douban corpus could contain more information than those in the Ubuntu corpus.
在实践中, we set the maximum context length to 10 to balance effectiveness and
efﬁciency.

7.6.4 Model Efﬁciency. 在部分 5.1.3, we theoretically analyzed the efﬁciency of SCN
and SAN. To verify the theoretical results, we further empirically compared their efﬁ-
ciency using the training data and the test data of the two data sets. The experiments
were conducted using Theano on a Tesla K80 GPU with a Windows Server 2012 歌剧-
ation system. The parameters of the two models are described in Section 7.4. 数字 7
gives the training time and the test time of SAN and SCN. We can see that SCN is
twice as fast as SAN in the training process (as a result of low time complexity and ease
of parallelization), and saves 3 msec per batch in the test process. 而且, 不同的
matching functions do not inﬂuence the running time as much, because the bottleneck
is the utterance representation learning.

The empirical results are consistent with our theoretical results: SCN is faster than
SAN. The results indicate that SCN is suitable for systems that care more about efﬁ-
ciency, whereas SAN can reach a higher accuracy with a little sacriﬁce of efﬁciency.

7.6.5 可视化. We ﬁnally explained how SAN and SCN understand the semantics
of conversation contexts by visualizing the similarity matrices of SCN, the attention
weights of SAN, and the update gate and the reset gate of the accumulation GRU of
the two models using an example from the Ubuntu corpus. 桌子 8 shows an example
that is selected from the test set of the Ubuntu corpus and ranked at the top position by
both SAN and SCN.

数字 8(A) illustrates word–word similarity matrices M1 in SCN. We can see that
important words in u1 such as “unzip,” “rar,” and “ﬁles” are recognized and highlighted
by words like “command,” “extract,” and “directory” in r. 另一方面, the simi-
larity matrix of r and u3 is almost blank, as there is no important information conveyed
by u3. 数字 8(乙) shows the sequence-to-sequence similarity matrices M2 in SCN. 我们
ﬁnd that important segments like “unzip many rar” are highlighted, and the matrices

189

计算语言学

体积 45, 数字 1

桌子 8
An example for visualization from the Ubuntu corpus.

Context
u1: how can unzip many rar ﬁles at once?
u2: sure you can do that in bash
u3: okay how?
u4: are the ﬁles all in the same directory?
u5: yes they all are;

Response
Response: then the command glebihan should extract them all from/to that directory

also provide complementary matching information to M1. 数字 8(C) visualizes the
reset gate and the update gate of the accumulation GRU, 分别. Higher values
in the update gate represent more information from the corresponding matching vector
ﬂowing into matching accumulation. From Figure 8(C), we can see that u1 is crucial to
response selection and nearly all information from u1 and r ﬂows to the hidden state
of GRU, whereas other utterances are less informative and the corresponding gates are
almost “closed” to keep the information from u1 and r until the ﬁnal state.

Regarding SAN, 数字 9(A) 和图 9(乙) illustrate the word level attention
weights A1 and segment level attention weights A2, 分别. Similar to SCN, impor-
tant words such as “zip” and “ﬁle” and important segments like “unzip many rar” get
high weights, whereas function words like “that” and “for” are less attended. It should
be noted that as the attention weights are normalized, the gaps between high and low
values in A1 and A2 are not so large as those in M1 and M2 of SCN. 数字 9(C) visualizes
the gates of the accumulation GRU, from which we observed similar distributions as
those of SCN.

7.7 Error Analysis and Future Work

Although models under SMF outperform baseline methods on the two data sets, 那里
are still several problems that cannot yet be handled perfectly.

(1) Logical consistency. SMF models the context and response on a semantic level,
but pays little attention to logical consistency. This leads to several bad cases in the
Douban corpus. We give a typical example in Table 9. In the conversation history, 之一
the speakers says that he thinks the item on Taobao is fake, and the response is expected
to be why he dislikes the fake shoes. 然而, both SCN and SAN rank the response
“It is not a fake. I just worry about the date of manufacture.” at the top position. 这
response is inconsistent with the context in terms of logic, as it claims that the jogging
shoes are not fake, which is contradictory to the context.

The reason behind this is that SMF only models semantics of context-response pairs.

Logic, attitude, and sentiment are not taken into account in response selection.

将来, we shall explore the logic consistency problem in retrieval-based

chatbots by leveraging more features.

(2) No valid candidates. Another serious issue is the quality of candidates after
恢复. According to Wu et al. (2017), the candidate retrieval method can be described
as follows: given a message un with {u1, . . . , un−1} utterances in its previous turns, 这

190

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

(A) Visualization of M1 in SCN. Darker squares refer to higher values.

(乙) Visualization of M2 in SCN. Darker sqaures refer to higher values.

数字 8
Visualization of SCN.

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

191

thenthecommandglebihanshouldextractthemallfrom/tothatdirectoryhowcanunzipmanyrar(_number_forexample)filesatonceValue of M_1 (u_1 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectorysureyoucandothatinbashValue of M_1 (u_2 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryokhowValue of M_1 (u_3 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryallthefilesallinthesamedirectoryValue of M_1 (u_4 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryyestheyallareValue of M_1 (u_5 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryhowcanunzipmanyrar(_number_forexample)filesatonceValue of M_2 (u_1 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectorysureyoucandothatinbashValue of M_2 (u_2 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryokhowValue of M_2 (u_3 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryallthefilesallinthesamedirectoryValue of M_2 (u_4 and r)0.000.150.300.450.600.750.901.051.201.351.50valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryyestheyallareValue of M_2 (u_5 and r)0.000.150.300.450.600.750.901.051.201.351.50value010203040u_1u_2u_3u_4u_5update gate0.00.20.40.60.81.0value010203040u_1u_2u_3u_4u_5reset gate0.00.20.40.60.81.0value

计算语言学

体积 45, 数字 1

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

(A) Visualization of A1 in SAN. Darker squares refer to higher values.

(乙) Visualization of A2 in SAN. Darker squares refer to higher values.

数字 9
Visualization of SAN.

192

thenthecommandglebihanshouldextractthemallfrom/tothatdirectoryhowcanunzipmanyrar(_number_forexample)filesatonceValue of M_2 (u_1 and r)0.000.050.100.150.200.250.300.35valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectorysureyoucandothatinbashValue of M_2 (u_2 and r)0.000.050.100.150.200.250.300.35valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryokhowValue of M_2 (u_3 and r)0.000.050.100.150.200.250.300.35valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryallthefilesallinthesamedirectoryValue of M_2 (u_4 and r)0.000.050.100.150.200.250.300.35valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryyestheyallareValue of M_2 (u_5 and r)0.000.050.100.150.200.250.300.35valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryhowcanunzipmanyrar(_number_forexample)filesatonceValue of M_2 (u_1 and r)0.000.050.100.150.200.25valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectorysureyoucandothatinbashValue of M_2 (u_2 and r)0.000.050.100.150.200.25valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryokhowValue of M_2 (u_3 and r)0.000.050.100.150.200.25valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryallthefilesallinthesamedirectoryValue of M_2 (u_4 and r)0.000.050.100.150.200.25valuethenthecommandglebihanshouldextractthemallfrom/tothatdirectoryyestheyallareValue of M_2 (u_5 and r)0.000.050.100.150.200.25value050100150u_1u_2u_3u_4u_5update gate0.00.20.40.60.81.0value050100150u_1u_2u_3u_4u_5reset gate0.00.20.40.60.81.0value

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

桌子 9
An example in the Douban corpus. The response is ranked at the top position among candidates,
but it is inconsistent on logic to the current context.

Context
u1: Does anyone know Newton jogging shoes?
u2: 100 RMB on Taobao.
u3: I know that. I do not want to buy it because that is a fake which is made in Qingdao,
u4:Is it the only reason you do not want to buy it?

Response
Response: It is not a fake. I just worry about the date of manufacture.

top ﬁve keywords are extracted from {u1, . . . , un−1} based on their TF-IDF scores.5
un is then expanded with the keywords, and the expanded message is sent to the
index to retrieve response candidates using the inline retrieval algorithm of the index.
The performance of the heuristic message expansion method is not good enough. 在
the experiment, 仅有的 667 在......之外 1, 000 contexts have correct candidates after response
candidate retrieval. This indicates that there is still much room to improve the retrieval
成分, and message expansion with several keywords from previous turns may
not be enough for candidate retrieval. 将来, we will consider advanced methods
for retrieving candidates.

(3) Gap between training and test. The current method requires a huge amount of
training data (IE。, context-response pairs) to learn a matching model. 然而, 这是
too expensive to obtain large-scale (例如, millions of) human labeled pairs in practice.
所以, we regard conversations with human replies as positive instances and con-
versations with randomly sampled replies as negative instances in model training. 这
negative sampling method, 然而, oversimpliﬁes the learning of a matching model
because most negative candidates are semantically far from human responses, 因此
easy to recognize; and some negative candidates might be proper responses if they are
judged by a human. Because of the gap in training and test, our matching models,
although performing much better than the baseline models, are still far from perfect
on the Douban corpus (see the low P@1 in Table 5). 将来, we may consider
using small human labeled data sets but leveraging the large-scale unlabeled data to
learn matching models.

8. 结论

In this paper we studied the problem of multi-turn response selection in which one
has to model the relationships among utterances in a context and pay more attention
to important parts of the context. We ﬁnd that the existing models cannot address the
two challenges at the same time when we summarize them into a general framework.

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

5 Tf is word frequency in the context, and IDF is calculated using the entire index.

193

计算语言学

体积 45, 数字 1

Motivated by the analysis, we propose a sequential matching framework for context-
response matching. The new framework is able to capture the important information in
a context and model the utterance relationships simultaneously. Under the framework,
we propose two speciﬁc models based on a convolution-pooling technique and an at-
tention mechanism. We test the two models on two public data sets. The results indicate
that both models can signiﬁcantly outperform the state-of-the-art models. To further
understand the models, we conduct ablation analysis and visualize key components of
the two models. We also compare the two models in terms of their efﬁcacy, efﬁciency,
and sensitivity to hyper-parameters.

致谢
Yu Wu is supported by an AdeptMind
Scholarship and a Microsoft Scholarship.
This work was supported in part by the
Natural Science Foundation of China (grants
U1636211, 61672081, 61370126), the Beijing
Advanced Innovation Center for Imaging
技术 (grant BAICIT-2016001), 和
National Key R&D Program of China (grant
2016QY04W0802).

参考
Baeza-Yates, Ricardo, Berthier Ribeiro-Neto,
等人. 1999. Modern Information Retrieval,
463. ACM Press, 纽约.

Bahdanau, Dzmitry, Kyunghyun Cho, 和
Yoshua Bengio. 2014. Neural machine
translation by jointly learning to align and
translate. CoRR, abs/1409.0473.

陈, Qian, Xiaodan Zhu, Zhen-Hua Ling,
Si Wei, and Hui Jiang. 2016. Enhancing
and combining sequential and tree LSTM
for natural language inference. CoRR,
abs/1609.06038.

给, Kyunghyun, Bart Van Merri¨enboer,

Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua
本吉奥. 2014. Learning phrase
representations using rnn encoder-decoder
for statistical machine translation. 在
Conference on Empirical Methods in Natural
语言处理, pages 1724–1734,
Doha.

钟, Junyoung, C¸ aglar G ¨ulc¸ehre,

KyungHyun Cho, and Yoshua Bengio.
2014. Empirical evaluation of gated
recurrent neural networks on sequence
造型. CoRR, abs/1412.3555.

Elman, Jeffrey L. 1990. Finding structure in
时间. 认知科学, 14(2):179–211.
弗莱斯, Joseph L. 1971. Measuring nominal
scale agreement among many raters.
Psychological Bulletin, 76(5):378.

格雷夫斯, Alex, Abdel-rahman Mohamed, 和
Geoffrey Hinton. 2013. Speech recognition

194

with deep recurrent neural networks. 在
Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International
会议, pages 6645–6649, Vancouver.

他, Hua and Jimmy J. 林. 2016. Pairwise
word interaction modeling with deep
neural networks for semantic similarity
measurement. In NAACL HLT 2016, 这
2016 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 937–948, 圣地亚哥, CA.

Higashinaka, Ryuichiro, Kenji Imamura,

Toyomi Meguro, Chiaki Miyazaki, Nozomi
Kobayashi, Hiroaki Sugiyama, Toru
Hirano, Toshiro Makino, and Yoshihiro
Matsuo. 2014. Towards an open-domain
conversational system fully based on
natural language processing. In COLING,
pages 928–939, 都柏林.

Hochreiter, Sepp and J ¨urgen Schmidhuber.
1997. Long short-term memory. Neural
计算, 9(8):1735–1780.

胡, Baotian, Zhengdong Lu, Hang Li, 和

Qingcai Chen. 2014. Convolutional neural
network architectures for matching natural
language sentences. In Advances in Neural
Information Processing Systems,
pages 2042–2050, 蒙特利尔.

黄, Po-Sen, Xiaodong He, Jianfeng Gao,
Li Deng, Alex Acero, and Larry Heck.
2013. Learning deep structured semantic
models for web search using clickthrough
数据. In Proceedings of the 22nd ACM
International Conference on Information &
Knowledge Management, pages 2333–2338,
旧金山, CA.

吉, Zongcheng, Zhengdong Lu, and Hang Li.
2014. An information retrieval approach to
short text conversation. CoRR,
abs/1408.6988.

Kadlec, 鲁道夫, Martin Schmid, and Jan

Kleindienst. 2015. Improved deep learning
baselines for Ubuntu corpus dialogs.
CoRR, abs/1510.03753.

Kim, Yoon. 2014. Convolutional neural

networks for sentence classiﬁcation. 在

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

诉讼程序 2014 会议
自然语言的经验方法
加工, EMNLP 2014, pages 1746–1751,
Doha.

Kingma, Diederik P. and Jimmy Ba. 2014.

亚当: A method for stochastic
优化. CoRR, abs/1412.6980.
李, Jiwei, 米歇尔·加莱, Chris Brockett,

Jianfeng Gao, and Bill Dolan. 2016A. A
diversity-promoting objective function for
neural conversation models. In NAACL
赫勒特 2016, 这 2016 Conference of the North
American Chapter of the Association for
计算语言学: 人类
语言技术, pages 110–119,
圣地亚哥, CA.

李, Jiwei, 米歇尔·加莱, Chris Brockett,

Georgios P. Spithourakis, Jianfeng Gao,
and William B. Dolan. 2016乙. A
persona-based neural conversation model.
In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics,
前交叉韧带 2016, pages 994–1003, 柏林.
李, Jiwei, Will Monroe, Alan Ritter, Dan

Jurafsky, 米歇尔·加莱, and Jianfeng Gao.
2016C. Deep reinforcement learning for
dialogue generation. 在诉讼程序中
2016 实证方法会议
自然语言处理, EMNLP 2016,
pages 1192–1202, Austin, TX.

李, Jiwei, Will Monroe, Tianlin Shi, S´ebastien
让, Alan Ritter, and Dan Jurafsky. 2017.
Adversarial learning for neural dialogue
一代. 在诉讼程序中 2017
Conference on Empirical Methods in Natural
语言处理, EMNLP 2017,
pages 2157–2169, 哥本哈根.

刘, Pengfei, Xipeng Qiu, Jifan Chen, 和

Xuanjing Huang. 2016A. Deep fusion LSTMs
for text semantic matching. In Proceedings
of the 54th Annual Meeting of the Association
for Computational Linguistics Volume 1: 长的
文件, pages 1034–1043, 柏林.

刘, Pengfei, Xipeng Qiu, Yaqian Zhou, Jifan

陈, and Xuanjing Huang. 2016乙.
Modelling interaction of sentence pair
with coupled-LSTMs. 在诉讼程序中
2016 实证方法会议
自然语言处理, EMNLP 2016,
pages 1703–1712, Austin, TX.

Lowe, Ryan, Nissan Pow, Iulian Serban, 和
Joelle Pineau. 2015. The Ubuntu dialogue
语料库: A large dataset for research in
unstructured multi-turn dialogue systems.
In Proceedings of the SIGDIAL 2015
会议, The 16th Annual Meeting of the
Special Interest Group on Discourse and
Dialogue, pages 285–294, Prague.

米科洛夫, 托马斯, 伊利亚·苏茨克维尔, Kai Chen,
格雷格小号. 科拉多, 和杰夫·迪恩. 2013.

Distributed representations of words
and phrases and their compositionality.
In Advances in Neural Information
Processing Systems, 第 3111–3119 页,
Lake Tahoe, NV.

Mou, Lili, Yiping Song, Rui Yan, Ge Li,

Lu Zhang, and Zhi Jin. 2016. Sequence
to backward and forward sequences:
A content-introducing approach to
generative short-text conversation. 在
科林 2016, 26th International Conference
on Computational Linguistics, 会议记录
the Conference: 技术论文,
pages 3349–3358, 大阪.

Parikh, Ankur P., Oscar T¨ackstr ¨om, Dipanjan

这, and Jakob Uszkoreit. 2016. A
decomposable attention model for natural
language inference. 在诉讼程序中
2016 实证方法会议
自然语言处理, EMNLP 2016,
pages 2249–2255, Austin, TX.

Pennington, 杰弗里, Richard Socher, 和
Christopher D. 曼宁. 2014. GloVe:
Global vectors for word representation. 在
诉讼程序 2014 会议
自然语言的经验方法
加工, EMNLP 2014, pages 1532–1543,
Doha.

Qiu, Xipeng and Xuanjing Huang. 2015.
Convolutional neural tensor network
architecture for community-based
question answering. 在诉讼程序中
24th International Joint Conference on
人工智能 (IJCAI),
pages 1305–1311, Buenos Aires.

里特尔, 艾伦, Colin Cherry, and William B.

Dolan. 2011. Data-driven response
generation in social media. In Proceedings
of the Conference on Empirical Methods in
自然语言处理,
pages 583–593, 爱丁堡.

Rockt¨aschel, Tim, Edward Grefenstette,

Karl Moritz Hermann, Tom´as Kocisk ´y, 和
Phil Blunsom. 2015. Reasoning about
entailment with neural attention. CoRR,
abs/1509.06664.

Serban, Iulian Vlad, Tim Klinger, Gerald
Tesauro, Kartik Talamadupula, Bowen
周, Yoshua Bengio, and Aaron C.
考维尔. 2017A. Multiresolution recurrent
神经网络: An application to
dialogue response generation. 在
Proceedings of the Thirty-First AAAI
Conference on Artiﬁcial Intelligence,
pages 3288–3294, 旧金山, CA.
Serban, Iulian Vlad, Alessandro Sordoni,

Yoshua Bengio, Aaron C. 考维尔, 和
Joelle Pineau. 2016. Building end-to-end
dialogue systems using generative
hierarchical neural network models.

195

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

计算语言学

体积 45, 数字 1

In Proceedings of the Thirtieth AAAI
Conference on Artiﬁcial Intelligence,
pages 3776–3784, Phoenix, AZ.

Serban, Iulian Vlad, Alessandro Sordoni,

Ryan Lowe, Laurent Charlin, Joelle Pineau,
Aaron C. 考维尔, and Yoshua Bengio.
2017乙. A hierarchical latent variable
encoder-decoder model for generating
对话. In AAAI, pages 3295–3301,
旧金山, CA.

塞维林, Aliaksei and Alessandro Moschitti.
2015. Learning to rank short text pairs with
convolutional deep neural networks. 在
Proceedings of the 38th International ACM
SIGIR Conference on Research and
Development in Information Retrieval,
pages 373–382, 圣地亚哥.

Shang, Lifeng, Zhengdong Lu, and Hang Li.

2015. Neural responding machine for
short-text conversation. In ACL 2015,
体积 1: Long Papers, pages 1577–1586,
北京.

沉, Yelong, Xiaodong He, Jianfeng Gao,
Li Deng, and Gr´egoire Mesnil. 2014.
A latent semantic model with
convolutional-pooling structure for
信息检索. 在诉讼程序中
23rd ACM International Conference on
Conference on Information and Knowledge
管理, pages 101–110, Shanghai.
Socher, 理查德, Danqi Chen, Christopher
D. 曼宁, and Andrew Ng. 2013.
Reasoning with neural tensor networks for
knowledge base completion. In Advances in
Neural Information Processing Systems,
pages 926–934, Lake Tahoe, NV.

Sordoni, Alessandro, 米歇尔·加莱, 迈克尔
Auli, Chris Brockett, Yangfeng Ji, 玛格丽特
米切尔, Jian-Yun Nie, Jianfeng Gao, 和
Bill Dolan. 2015. A neural network
approach to context-sensitive generation
of conversational responses. In NAACL
赫勒特 2015, 这 2015 Conference of the North
American Chapter of the Association for
计算语言学: Human Language
Technologies, pages 196–205, 丹佛, 一氧化碳.
Tan, Ming, C´ıcero Nogueira dos Santos, Bing
Xiang, and Bowen Zhou. 2016. 改进
representation learning for question
answer matching. In Proceedings of the 54th
Annual Meeting of the Association for
计算语言学, 前交叉韧带 2016,
体积 1: Long Papers, pages 464–473,
柏林.

Theano Development Team. 2016. Theano: A
python framework for fast computation of
mathematical expressions. CoRR,
abs/1605.02688.

Vinyals, Oriol and Quoc V. Le. 2015. A neural
conversational model. CoRR, abs/1506.05869.

196

Voorhees, Ellen M. and Dawn M. Tice. 2000.
The TREC-8 question answering track.
In Proceedings of the Second International
Conference on Language Resources and
评估, LREC 2000, pages 26–34,
雅典.

Wan, Shengxian, Yanyan Lan, Jiafeng Guo,
Jun Xu, Liang Pang, and Xueqi Cheng.
2016A. A deep architecture for semantic
matching with multiple positional
sentence representations. 在诉讼程序中
the Thirtieth AAAI Conference on
人工智能, pages 2835–2841,
Phoenix, AZ.

Wan, Shengxian, Yanyan Lan, Jun Xu, Jiafeng
Guo, Liang Pang, and Xueqi Cheng. 2016乙.
Match-SRNN: Modeling the recursive
matching structure with spatial RNN. 在
Proceedings of the Twenty-Fifth International
Joint Conference on Artiﬁcial Intelligence,
IJCAI 2016, pages 2922–2928, 纽约,
纽约.

王, Bingning, Kang Liu, and Jun Zhao.
2016. Inner attention based recurrent
neural networks for answer selection. 在
Proceedings of the 54th Annual Meeting of the
计算语言学协会,
前交叉韧带 2016, 体积 1: Long Papers,
pages 1288–1297, 柏林.

王, Hao, Zhengdong Lu, Hang Li, 和

Enhong Chen. 2013. A dataset for research
on short-text conversations. In Proceedings
的 2013 Conference on Empirical
Methods in Natural Language Processing,
EMNLP 2013, pages 935–945,
Seattle, WA.

王, Mingxuan, Zhengdong Lu, Hang Li,
and Qun Liu. 2015. Syntax-based deep
matching of short texts. In Twenty-Fourth
International Joint Conference on Artiﬁcial
智力, pages 1354–1361, Buenos
Aires.

王, Shuohang and Jing Jiang. 2016A. A
compare-aggregate model for matching
text sequences. CoRR, abs/1611.01747.

王, Shuohang and Jing Jiang. 2016乙.

Learning natural language inference with
LSTM. In NAACL HLT 2016, 这 2016
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术,
pages 1442–1451, 圣地亚哥, CA.
Weizenbaum, 约瑟夫. 1966. 伊丽莎: A

computer program for the study of natural
language communication between man
and machine. ACM 通讯,
9(1):36–45.

吴, 于, Wei Wu, Zhoujun Li, and Ming

周. 2018A. Learning matching models
with weak supervision for response

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
我
我
/
A
r
t
我
C
我
e
–
p
d
F
/
4
5
/
1
/
1
6
3
/
1
8
0
9
6
7
7
/
C
哦
我
我
_
A
_
0
0
3
4
5
.
p
d
F

乙
y

G
你
e
s
t

哦
n

0
8

S
e
p
t
e
米
乙
e
r

2
0
2
3

Wu et al.

A Sequential Matching Framework for Retrieval-Based Chatbots

selection in retrieval-based chatbots. 在
Proceedings of the 56th Annual Meeting of the
计算语言学协会,
前交叉韧带 2018, 体积 2: Short Papers,
pages 420–425, 墨尔本.

吴, 于, Wei Wu, Chen Xing, Ming Zhou,

and Zhoujun Li. 2017. Sequential matching
网络: A new architecture for multi-turn
response selection in retrieval-based
chatbots. In Proceedings of the 55th Annual
Meeting of the Association for Computational
语言学, 前交叉韧带 2017, 体积 1: 长的
文件, pages 496–505, Vancouver.

吴, 于, Wei Wu, Dejian Yang, Can Xu, 和

Zhoujun Li. 2018乙. Neural response
generation with dynamic vocabularies.
In Proceedings of the Thirty-Second AAAI
Conference on Artiﬁcial Intelligence,
(AAAI-18), the 30th Innovative Applications
of Artiﬁcial Intelligence (IAAI-18), and the 8th
AAAI Symposium on Educational Advances
in Artiﬁcial Intelligence (EAAI-18),
pages 5594–5601, New Orleans, 这.
Xing, 陈, Wei Wu, Yu Wu, Jie Liu, Yalou
黄, Ming Zhou, and Wei-Ying Ma.
2017. Topic aware neural response
一代. In Proceedings of the Thirty-First
AAAI Conference on Artiﬁcial Intelligence,
pages 3351–3357, 旧金山, CA.
Xing, 陈, Yu Wu, Wei Wu, Yalou Huang,

and Ming Zhou. 2018. 分层的
recurrent attention network for response
一代. 在诉讼程序中
Thirty-Second AAAI Conference on Artiﬁcial
智力, (AAAI-18), the 30th Innovative
Applications of Artiﬁcial Intelligence
(IAAI-18), and the 8th AAAI Symposium on
Educational Advances in Artiﬁcial
智力 (EAAI-18), pages 5610–5617,
New Orleans, 这.

徐, Zhen, Bingquan Liu, Baoxun Wang,

Chengjie Sun, and Xiaolong Wang. 2017.
Incorporating loose-structured knowledge
into conversation modeling via recall-gate
LSTM. 在 2017 International Joint Conference
on Neural Networks, IJCNN 2017,
pages 3506–3513, Anchorage, AK.

严, Rui, Yiping Song, and Hua Wu. 2016.
Learning to respond with deep neural
networks for retrieval-based
human-computer conversation system.
In Proceedings of the 39th International
ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR
2016, pages 55–64, Pisa.

Yin, Wenpeng and Hinrich Sch ¨utze. 2015.
MultigranCNN: An architecture for
general matching of text chunks on
multiple levels of granularity. 在
Proceedings of the 53rd Annual Meeting of the
计算语言学协会
(前交叉韧带), pages 63–73, 北京.

Yin, Wenpeng, Hinrich Sch ¨utze, Bing Xiang,

and Bowen Zhou. 2016. ABCNN:
Attention-based convolutional neural
network for modeling sentence pairs.
处理, 4:259–272.

Young, Steve, Milica Gaˇsi´c, Simon Keizer,

Franc¸ois Mairesse, Jost Schatzmann, Blaise
Thomson, and Kai Yu. 2010. The hidden
information state model: A practical
framework for POMDP-based spoken
dialogue management. Computer Speech &
语言, 24(2):150–174.

周, Hao, Minlie Huang, Tianyang Zhang,

Xiaoyan Zhu, and Bing Liu. 2018.
Emotional chatting machine: Emotional
conversation generation with internal and
external memory. 在诉讼程序中
Thirty-Second AAAI Conference on Artiﬁcial
智力, (AAAI-18), the 30th Innovative
Applications of Artiﬁcial Intelligence
(IAAI-18), and the 8th AAAI Symposium on
Educational Advances in Artiﬁcial
智力 (EAAI-18), pages 730–739,
New Orleans, 这.

周, Xiangyang, Daxiang Dong, Hua Wu,
Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan
刘, and Rui Yan. 2016. Multi-view
response selection for human-computer
conversation. 在诉讼程序中 2016
Conference on Empirical Methods in Natural
语言处理, EMNLP 2016,
pages 372–381, Austin, TX.