Graph Convolutional Network with Sequential Attention for - 麻省理工学院人工智能研究专业

Graph Convolutional Network with Sequential Attention for
Goal-Oriented Dialogue Systems

Suman Banerjee and Mitesh M. Khapra

计算机科学与工程系,
Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI),
Indian Institute of Technology Madras, 印度
{suman, miteshk}@cse.iitm.ac.in

抽象的
Domain-specific goal-oriented dialogue sys-
tems typically require modeling three types of
输入, 即, (我) the knowledge-base asso-
ciated with the domain, (二) the history of the
conversation, which is a sequence of utter-
ances, 和 (三、) the current utterance for which
the response needs to be generated. 尽管
modeling these inputs, current state-of-the-art
models such as Mem2Seq typically ignore
the rich structure inherent in the knowledge
graph and the sentences in the conversa-
tion context. Inspired by the recent success
of structure-aware Graph Convolutional Net-
作品 (GCNs) for various NLP tasks such
as machine translation, semantic role labeling,
and document dating, we propose a memory-
augmented GCN for goal-oriented dialogues.
Our model exploits (我) the entity relation graph
in a knowledge-base and (二) the dependency
graph associated with an utterance to compute
richer representations for words and entities.
更远, we take cognizance of the fact that in
certain situations, such as when the conversa-
tion is in a code-mixed language, 依赖性
parsers may not be available. We show that in
such situations we could use the global word
co-occurrence graph to enrich the representa-
tions of utterances. We experiment with four
datasets: (我) the modified DSTC2 dataset, (二)
recently released code-mixed versions of DSTC2
dataset in four languages, (三、) Wizard-of-Oz
style CAM676 dataset, 和 (四号) Wizard-of-Oz
style MultiWOZ dataset. On all four datasets
our method outperforms existing methods, 在
a wide range of evaluation metrics.

介绍

Goal-oriented dialogue systems that can assist
humans in various day-to-day activities have

485

widespread applications in several domains such
as e-commerce, entertainment, healthcare, 所以
向前. 例如, such systems can help humans
in scheduling medical appointments or reserving
restaurants, booking tickets. From a modeling
看法, one clear advantage of dealing with
domain-specific goal-oriented dialogues is that
the vocabulary is typically limited, the utterances
largely follow a fixed set of templates, 在那里
is an associated domain knowledge that can
be exploited. 进一步来说, there is some
structure associated with the utterances as well as
the knowledge base (KB).

更正式一点, the task here is to generate the
next response given (我) the previous utterances
in the conversation history, (二) the current user
发声 (known as the query), 和 (三、) 这
entities and their relationships in the associated
knowledge base. Current state-of-the-art methods
(Seo et al., 2017; Eric and Manning, 2017;
Madotto et al., 2018) typically use variants of
Recurrent Neural Networks (RNNs)
(Elman,
1990) to encode the history and current utterance
or an external memory network (Sukhbaatar et al.,
2015) to encode them along with the entities
in the knowledge base. The encodings of the
utterances and memory elements are then suitably
combined using an attention network and fed to
the decoder to generate the response, one word
at a time. 然而, these methods do not exploit
the structure in the knowledge base as defined by
entity–entity relations and the structure in the
utterances as defined by a dependency parse.
Such structural
information can be exploited
to improve the performance of the system, 作为
demonstrated by recent works on syntax-aware
neural machine translation (Eriguchi et al., 2016;
Bastings et al., 2017; 陈等人。, 2017), semantic
role labeling (Marcheggiani and Titov, 2017),
and document dating (Vashishth et al., 2018),
which use Graph Convolutional Networks (GCNs)

计算语言学协会会刊, 卷. 7, PP. 485–500, 2019. https://doi.org/10.1162/tacl 00284
动作编辑器: Asli Celikyilmaz. 提交批次: 1/2019; 修改批次: 5/2019; 已发表 9/2019.
C(西德:2) 2019 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

(Defferrard et al., 2016; Duvenaud et al., 2015;
Kipf and Welling, 2017) to exploit sentence
结构.

在这项工作中, we propose to use such graph
structures for goal-oriented dialogues. In partic-
他们是, we compute the dependency parse tree for
each utterance in the conversation and use a
GCN to capture the interactions between words.
This allows us to capture interactions between
distant words in the sentence as long as they are
connected by a dependency relation. We also use
GCNs to encode the entities of the KB where the
entities are treated as nodes and their relations
as edges of the graph. Once we have a richer
structure aware representation for the utterances
and the entities, we use a sequential attention
mechanism to compute an aggregated context
representation from the GCN node vectors of the
query, 历史, and entities. 更远, we note that in
certain situations, such as when the conversation
is in a code-mixed language or a language for
which parsers are not available, then it may not
be possible to construct a dependency parse for
the utterances. To overcome this, we construct a
co-occurrence matrix from the entire corpus and
use this matrix to impose a graph structure on
the utterances. 进一步来说, we add an edge
between two words in a sentence if they co-occur
frequently in the corpus. Our experiments suggest
那
this simple strategy acts as a reasonable
substitute for dependency parse trees.

We perform experiments with the modified
DSTC2 (Bordes et al., 2017) dataset, 哪个
contains goal-oriented conversations for making
restaurant reservations. We also use its recently
released code-mixed versions (Banerjee et al.,
2018), which contain code-mixed conversations in
four different languages: Hindi, Bengali, Gujarati,
and Tamil. We compare with recent state-of-the-
art methods and show that on average, 专业人士-
posed model gives an improvement of 2.8 蓝线
points and 2 ROUGE points. We also perform
experiments on two human–human dialogue
datasets of different sizes: (我) Cam676 (Wen
等人。, 2017): a small scale dataset containing
676 dialogues from the restaurant domain; 和 (二)
多WOZ (布兹安诺夫斯基等人。, 2018): 一个大的-
scale dataset containing around 10k dialogues and
spanning multiple domains for each dialogue. 在
these two datasets as well, we observe a similar
趋势, wherein our model outperforms existing
方法.

Our contributions can be summarized as fol-
lows: (我) We use GCNs to incorporate structural
information for encoding query, 历史, 和
KB entities in goal-oriented dialogues; (二) 我们
use a sequential attention mechanism to obtain
query aware and history aware context repre-
句子; (三、) We leverage co-occurrence fre-
quencies and PPMI (positive-pointwise mutual
信息) values to construct contextual graphs
for code-mixed utterances; 和 (四号) We show
that the proposed model obtains state-of-the-art
results on four different datasets spanning five
different languages.

2 相关工作

在这个部分, we review the previous work in
goal-oriented dialogue systems and describe the
introduction of GCNs in NLP.

Goal-Oriented Dialogue Systems: Initial goal-
oriented dialogue systems (Young, 2000; 威廉姆斯
and Young, 2007) were based on dialogue state
追踪 (Williams et al., 2013; Henderson et al.,
2014A,乙) and included pipelined modules for
natural
language understanding, dialogue state
追踪, policy management, and natural lan-
guage generation. Wen et al. (2017) used neural
networks for these intermediate modules but
still lacked absolute end-to-end trainability. 这样的
pipelined modules were restricted by the fixed
slot-structure assumptions on the dialogue state
and required per-module based labeling. 到
mitigate this problem, Bordes et al. (2017) 关于-
leased a version of goal-oriented dialogue dataset
that focuses on the development of end-to-end
neural models. Such models need to reason
over the associated KB triples and generate re-
sponses directly from the utterances without any
additional annotations. 例如, Bordes et al.
(2017) proposed a Memory Network (Sukhbaatar
等人。, 2015) based model to match the response
candidates with the multi-hop attention weighted
representation of the conversation history and the
KB triples in memory. Liu and Perez (2017)
further added highway (Srivastava et al., 2015)
and residual connections (He et al., 2016) 到
memory network in order to regulate the access to
the memory blocks. Seo et al. (2017) 发达
a variant of RNN cell that computes a refined
representation of the query over multiple iterations
before querying the memory. 然而, all these
approaches retrieve the response from a set of

486

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

candidate responses and such a candidate set is
not easy to obtain for any new domain of interest.
To account for this, Eric and Manning (2017) 和
Zhao et al. (2017) adapted RNN-based encoder-
decoder models to generate appropriate responses
instead of retrieving them from a candidate set.
Eric et al. (2017) introduced a key-value memory
network based generative model that integrates
the underlying KB with RNN-based encode-
attend-decode models. Madotto et al. (2018) 用过的
memory networks on top of the RNN decoder
to tightly integrate KB entities with the decoder
in order to generate more informative responses.
然而, as opposed to our work, all these works
ignore the underlying structure of the entity–
entity graph of the KB and the syntactic structure
of the utterances.

GCNs in NLP: 最近,

there has been
an active interest in enriching existing encode-
attend-decode models (Bahdanau et al., 2015)
with structural
information for various NLP
任务. Such structure is typically obtained from
the constituency and/or dependency parse of
句子. The idea is to treat the output of a
parser as a graph and use an appropriate network
to capture the interactions between the nodes of
this graph. 例如, Eriguchi et al. (2016)
and Chen et al. (2017) showed that incorporating
such syntactical structures as Tree-LSTMs in the
encoder can improve the performance of neural
machine translation. Peng et al. (2017) use Graph-
LSTMs to perform cross sentence n-ary relation
their formulation is
extraction and show that
applicable to any graph structure and Tree-LSTMs
can be thought of as a special case of it. 在
parallel, Graph Convolutional Networks (GCNs)
(Duvenaud et al., 2015; Defferrard et al., 2016;
Kipf and Welling, 2017) and their variants (李
等人。, 2016) have emerged as state-of-the-art
methods for computing representations of entities
in a knowledge graph. They provide a more
flexible way of encoding such graph structures
by capturing multi-hop relationships between
节点. This has led to their adoption for various
NLP tasks such as neural machine translation
(Marcheggiani et al., 2018; Bastings et al., 2017),
semantic role labeling (Marcheggiani and Titov,
2017), document dating (Vashishth et al., 2018),
and question answering (约翰逊, 2017; De Cao
等人。, 2019).

487

据我们所知, ours is the first
work that uses GCNs to incorporate dependency
structural information and the entity–entity graph
structure in a single end-to-end neural model
for goal-oriented dialogues. This is also the first
work that incorporates contextual co-occurrence
information for code-mixed utterances, for which
no dependency structures are available.

3 Background

在这个部分, we describe GCNs (Kipf and
Welling, 2017) for undirected graphs and then
describe their syntactic versions, which work with
directed labeled edges of dependency parse trees.

3.1 GCN for Undirected Graphs

Graph convolutional networks operate on a graph
structure and compute representations for the
nodes of the graph by looking at the neighborhood
of the node. We can stack k layers of GCNs
to account for neighbors that are k-hops away
from the current node. 正式地, let G = (V, 乙)
be an undirected graph, where V is the set of
节点 (让 |V| = n) and E is the set of edges.
Let X ∈ Rn×m be the input feature matrix with
n nodes and each node xu(u ∈ V) is represented
by an m-dimensional feature vector. The output
of a 1-layer GCN is the hidden representation
matrix H ∈ Rn×d where each d-dimensional
representation of a node captures the interactions
with its 1-hop neighbors. Each row of this matrix
can be computed as:
(西德:2) (西德:3)

(西德:4)

hv = ReLU

(W xu + 乙)

u∈N (v)

∀v ∈ V

(1)
Here W ∈ Rd×m is the model parameter matrix,
b ∈ Rd is the bias vector, and ReLU is the
rectified linear unit activation function. 氮 (v) 是
the set of neighbors of node v and is assumed
to also include the node v so that the previous
representation of the node v is also considered
while computing its new hidden representation. 到
capture interactions with nodes that are multiple
hops away, multiple layers of GCNs can be stacked
一起. 具体来说, the representation of node
v after kth GCN layer can be formulated as:
(西德:2) (西德:3)
(西德:4)

hk+1

v = ReLU

(W khk

你 + bk)

(2)

u∈N (v)

∀v ∈ V. Here hk

u is the representation of the uth

node in the (k − 1)th GCN layer and h1

u = xu.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

3.2 Syntactic GCN
In a directed labeled graph G = (V, 乙), each
edge between nodes u and v is represented
by a triple (你, v, L(你, v)) where L(你, v) 是个
associated edge label. Marcheggiani and Titov
(2017) modified GCNs to operate over directed
labeled graphs, such as the dependency parse tree
of a sentence. For such a tree, in order to allow
information to flow from head to dependents and
vice-versa, they added inverse dependency edges
from dependents to heads such as (v, 你, L(你, v)(西德:5))
to E and made the model parameters and biases
label specific. In their formulation,

where relation signifies the edge label. At any
dialogue turn i, given the (我) dialogue history
H = (U1, S1, U2, . . . , Si−1), (二) the current user
utterance as the query Q = Ui and (三、) 这
associated knowledge graph Gk, the task is to
generate the current response Si that leads to a
completion of the goal. As mentioned earlier, 我们
exploit the graph structure in KB and the syntactic
structure in the utterances to generate appropriate
responses. Toward this end, we propose a model
with the following components for encoding these
three types of inputs. The code for the model is
released publicly.1

(西德:4)

4.1 Query Encoder

(西德:2) (西德:3)

hk+1

v = ReLU

(W k

L(你,v)hk

你 + bk

L(你,v))

u∈N (v)

(3)
∀v ∈ V. Notice that unlike equation 2,

L(你,v) and bk

方程 3 has parameters W k
L(你,v)
which are label-specific. Suppose there are L
different labels, then this formulation will require
L weights and biases per GCN layer, 导致
a large number of parameters. To avoid this, 这
authors use only three sets of weights and biases
per GCN layer (as opposed to L) 根据
the direction in which the information flows. 更多的
具体来说, W k
dir(你,v), where dir(你, v)
indicates whether information flows from u to v,
v to u or u = v. 在这项工作中, we also make
bk
L(你,v) = bk
dir(你,v) instead of having a separate
bias per label. The final GCN formulation can thus
be described as:

L(你,v) = W k

(西德:2) (西德:3)

(西德:4)

(W k

dir(你,v)hk

u+bk

dir(你,v))

u∈N (v)

(4)

hk+1

v = ReLU

4 模型

We first formally define the task of end-to-end
goal-oriented dialogue generation. Each dialogue
of t turns can be viewed as a succession of
user utterances (U ) and system responses (S) 和
can be represented as: (U1, S1, U2, S2, . . . , Ut, 英石).
Along with these utterances, each dialogue is also
accompanied by e KB triples that are relevant
to that dialogue and can be represented as:
(k1, k2, k3, . . . , ke). Each triple is of the form:
(entity1, 关系, entity2). These triples can be
represented in the form of a graph Gk = (Vk, Ek)
where Vk is the set of all entities and each edge
in Ek is of the form: (entity1, entity2, 关系),

is the ith (current) 用户
The query Q = Ui
utterance in the dialogue and contains |问| 代币.
We denote the embedding of the ith token in
the query as qi. We first compute the contextual
representations of these tokens by passing them
through a bidirectional RNN:

bt = BiRN NQ(bt−1, qt)

(5)

现在, consider the dependency parse tree of
the query sentence denoted by GQ = (VQ, EQ).
We use a query-specific GCN to operate on GQ,
which takes {双}|问|
i=1 as the input to the first GCN
层. The node representation in the kth hop of
the query specific GCN is computed as:
(西德:2) (西德:3)

(西德:4)

ck+1
v = ReLU

(W k

dir(你,v)ck

你 + gk

dir(你,v))

u∈N (v)

(6)
∀v ∈ VQ. Here W k
dir(你,v) are edge
direction specific query-GCN weights and biases
for the kth hop and c1
u = bu.

dir(你,v), gk

4.2 Dialogue History Encoder
The history H of the dialogue contains |H| 代币
and we denote the embedding of the ith token in
the history by pi. 再次, we first compute
the hidden representations of these tokens using a
bidirectional RNN:

st = BiRN N H (st−1, 点)

(7)

We now compute a dependency parse tree
for each sentence in the history and collectively
represent all the trees as a single graph GH =

1https://github.com/sumanbanerjee1/

GCN-SeA.

488

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: Illustration of the GCN and RNN+GCN modules which are used as encoders in our model. The notations
are specific to the dialogue history encoder but both the encoders are similar for the query. We use only the GCN
encoder for the KB.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

(VH , EH ). Note that this graph will only contain
edges between words belonging to the same
sentence and there will be no edges between words
across sentences. We then use a history-specific
GCN to operate on GH which takes st as the input
to the first layer. The node representation in the
kth hop of the history-specific GCN is computed
作为:

(西德:4)

(西德:2) (西德:3)

ak+1
v = ReLU

(V k

dir(你,v)ak

你 + ok

dir(你,v))

u∈N (v)

(8)
∀v ∈ VH . Here V k
dir(你,v) and ok
dir(你,v) are edge
direction-specific history-GCN weights and biases
in the kth hop and a1
u = su. Such an encoder with
a single hop of GCN is illustrated in Figure 1(乙)
and the encoder without the BiRNN is depicted in
数字 1(A).

4.3 KB Encoder
As mentioned earlier, GK = (VK, EK) is the graph
capturing the interactions between the entities in
the knowledge graph associated with the dialogue.
Let there be m such entities and we denote the
embedding of the node corresponding to the ith
entity as ei. We then operate a KB-specific GCN

489

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

on these entity representations to obtain refined
representations that capture relations between
实体. The node representation in the kth hop of
the KB specific GCN is computed as:

rk+1
v = ReLU

(西德:2) (西德:3)

u∈N (v)

(西德:4)

(U k

dir(你,v)rk

你 + zk

dir(你,v))

dir(你,v) and zk

(9)
∀v ∈ VK. Here U k
dir(你,v) are edge
direction-specific KB-GCN weights and biases in
kth hop and r1
u = eu. We also add inverse edges to
EK similar to the case of syntactic GCNs in order
to allow information flow in both the directions
for an entity pair in the knowledge graph.

4.4 Sequential Attention

We use an RNN decoder to generate the tokens
of the response and let the hidden states of the
decoder be denoted as: {的}时间
我=1, where T is the
total number of decoder time steps. 为了
obtain a single representation of the node vectors
from the final layer (k = f ) of the query-GCN, 我们

use an attention mechanism as described below:

j + W2dt−1)

1 tanh(W1cf

μjt = vT
αt = softmax(μt)
(西德:5)|问|
j(西德:5)=1 αj(西德:5)tcf
hQ
j(西德:5)

t =

(10)

(11)

(12)

Here v1, W1 and W2 are parameters. 更远,
at each decoder time step, we obtain a query-
aware representation from the final layer of the
history-GCN by computing an attention score for
each node/token in the history based on the query
context vector hQ
t as shown below:

νjt = vT

2 tanh(W3af

j + W4dt−1 + W5hQ
t )
(13)

βt = softmax(νt)
(西德:5)|H|
j(西德:5)=1 βj(西德:5)taf
hH
j(西德:5)

t =

(14)

(15)

Here v2, W3, W4, and W5 are parameters.
最后, we obtain a query and history aware
representation of
the KB by computing an
attention score over all the nodes in the final
layer of KB-GCN using hQ
as shown
以下:

t and hH
t

ωjt = vT

3 tanh(W6rf

j + W7dt−1 + W8hQ

γt = softmax(ωt)
j(西德:5)=1 γj(西德:5)trf
米
hK
t =
j(西德:5)

(西德:5)

+ W9hH
t )

(16)

(17)

(18)

Here v3, W6, W7, W8 and W9 are parameters.
This sequential attention mechanism is illustrated
图中 2. For simplicity, we depict the GCN
and RNN+GCN encoders as blocks. The internal
structure of these blocks are shown in Figure 1.

4.5 Decoder

The decoder is conditioned on two components:
(我) the context that contains the history and the KB
和 (二) the query that is the last/previous utterance
in the dialogue. We use an aggregator that learns
the overall attention to be given to the history
and KB components. These attention scores: θH
t
and θK
t are dependent on the respective context
vectors and the previous decoder state dt−1. 这
final context vector is obtained as:

t = θH
hC
hf inal
= [hC
t

t hH
t + θK
t ; hQ
t ]

t hK
t

(19)

(20)

490

在哪里 [; ] denotes the concatenation operator. 在
every time step, the decoder then computes a
probability distribution over the vocabulary using
the following equations:

dt = RN N (dt−1, [hf inal
Pvocab = softmax(V (西德:5)dt + 乙(西德:5))

; wt])

(21)

(22)

where wt is the decoder input at time step t, V (西德:5) 和
乙(西德:5) are parameters. Pvocab gives us a probability
distribution over the entire vocabulary and the loss
for time step t is lt = − log Pvocab(w∗
t ), where w∗
t
is the tth word in the ground truth response. 这
total loss is an average of the per-time step losses.

4.6 Contextual Graph Creation

For the dialogue history and query encoder, 我们
used the dependency parse tree for capturing
structural information in the encodings. 然而,
if the conversations occur in a language for which
no dependency parsers exist, 例如: 代码-
mixed languages like Hinglish (Hindi–English)
(Banerjee et al., 2018), then we need an alternate
way of extracting a graph structure from the
utterances. One simple solution that has worked
well in practice was to create a word co-occurrence
matrix from the entire corpus where the context
window is an entire sentence. Once we have such
a co-occurrence matrix, for a given sentence we
can connect an edge between two words if their
co-occurrence frequency is above a threshold
价值. The co-occurrence matrix can either con-
tain co-occurrence frequency counts or positive-
pointwise mutual
信息 (PPMI) 价值观
(教堂和汉克斯, 1990; Dagan et al., 1993;
Niwa and Nitta, 1994).

5 实验装置

在这个部分, we describe the datasets used in
our experiments, the various hyperparameters that
we considered, and the models that we compared.

5.1 数据集

The original DSTC2 dataset (Henderson et al.,
2014A) was based on the task of restaurant
table reservation and contains transcripts of real
conversations between humans and bots. 这
utterances were labeled with the dialogue state
annotations like the semantic intent representation,
requested slots, and the constraints on the slot
价值观. We report our results on the modified
DSTC2 dataset of Bordes et al. (2017), 在哪里

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: Illustration of sequential attention mechanism in RNN+GCN-SeA.

such annotations are removed and only the raw
utterance–response pairs are present with an
associated set of KB triples for each dialogue.
It contains around 1,618 training dialogues, 500
validation dialogues, 和 1,117 test dialogues.
For our experiments with contextual graphs we
report our results on the code-mixed versions of
modified DSTC2, which was recently released
by Banerjee et al. (2018). This dataset has been
collected by code-mixing the utterances of the

English version of modified DSTC2 (En-DSTC2)
in four languages: Hindi (Hi-DSTC2), Bengali
(Be-DSTC2), Gujarati (Gu-DSTC2), and Tamil
(Ta-DSTC2), via crowdsourcing. We also perform
experiments on two goal-oriented dialogue datasets
that contain conversations between humans wherein
the conversations were collected in a Wizard-of-
Oz (沃兹) 方式. 具体来说, we use the Cam676
dataset (文等人。, 2017), which contains 676 KB-
grounded dialogues from the restaurant domain

491

模型

per-resp.
acc

蓝线

ROUGE

Entity F1

Rule-Based (Bordes et al., 2017)
MEMNN (Bordes et al., 2017)
QRN (Seo et al., 2017)
GMEMNN (Liu and Perez, 2017)
Seq2Seq-Attn (Bahdanau et al., 2015)
Seq2Seq-Attn+Copy (Eric and Manning, 2017)
HRED (Serban et al., 2016)
Mem2Seq (Madotto et al., 2018)
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA

33.3
41.1
50.7
48.7
46.0
47.3
48.9
45.0
47.1
51.2
51.4

L
2
1
-
-
-
-
-
-
-
-
-
-
-
-
67.2 56.0 64.9
-
-
-
67.9 57.6 65.7
-
-
-
67.4 57.1 65.0
69.4 59.9 67.2
69.6 60.2 67.4

-
-
-
-
57.3
55.4
58.4
55.3
59.0
60.9
61.2

-
-
-
-
67.1
71.6
75.6
75.3
71.9
78.1
77.9

桌子 1: Comparison of RNN+GCN-SeA with other models on the English version of modified DSTC2.

and the MultiWOZ (布兹安诺夫斯基等人。, 2018)
dataset, which contains 10,438 对话.

5.2 Hyperparameters

We used the same train, 测试, and validation splits
as provided in the original versions of the datasets.
We minimized the cross entropy loss using the
Adam optimizer (Kingma and Ba, 2015) 和
tuned the initial learning rates in the range of
0.0006 到 0.001. For regularization we used an
L2 penalty of 0.001 in addition to a dropout
(Srivastava et al., 2014) 的 0.1. We used randomly
initialized word embeddings of size 300. The RNN
and GCN hidden dimensions were also chosen to
是 300. We used GRU (Cho et al., 2014) cells for
the RNNs. All parameters were initialized from
a truncated normal distribution with a standard
deviation of 0.1.

5.3 Models Compared

We compare the performance of the following
型号.

(我) RNN+GCN-SeA vs GCN-SeA: We use
RNN+GCN-SeA to refer to the model described
in Section 4. Instead of using the hidden repre-
sentations obtained from the bidirectional RNNs,
we also experiment by providing the token em-
beddings directly to the GCNs—that is, c1
u = qu
in equation 6 and a1
u = pu in equation 8. We refer
to this model as GCN-SeA.

(二) Cross edges between the GCNs:

在
addition to the dependency and contextual edges,
we add edges between words in the dialogue

history/query and KB entities if a history/query
word exactly matches the KB entity. 这样的
edges create a single connected graph that is
encoded using a single GCN encoder and then
separated into different contexts to compute
sequential attention. This model is referred to
as RNN+CROSS-GCN-SeA.

与

(三、)

GCN-SeA+Random

GCN-
SeA+Structure: We experiment with the model
where the graph is constructed by randomly con-
necting edges between two words in a context.
We refer to this model as GCN-SeA+Random.
We refer to the model that either uses dependency
or contextual graphs instead of random graphs as
GCN-SeA+Structure.

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

6 Results and Discussions

在这个部分, we discuss the results of our
experiments as summarized in Tables 1– 5. We use
蓝线 (Papineni et al., 2002) and ROUGE (林,
2004) metrics to evaluate the generation quality
回应数量. We also report the per-response
准确性, which computes the percentage of
responses in which the generated response exactly
matches the ground truth response. To evaluate the
model’s capability of correctly injecting entities
in the generated response, we report the entity F1
measure as defined in Eric and Manning (2017).

Results on En-DSTC2: We compare our model
with the previous works on the English version
of modified DSTC2 in Table 1. For most of the
retrieval-based models, the BLEU or ROUGE
scores are not available as they select a candidate

492

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

数据集

模型

per-resp.
acc

蓝线

ROUGE

Entity F1

Hi-DSTC2

Be-DSTC2

GU-DSTC2

Ta-DSTC2

Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA
Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA
Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA
Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA

48.0
47.2
43.1
47.0
47.2
49.2
50.4
47.8
41.9
47.1
50.4
50.3
47.7
48.0
43.1
48.1
49.4
48.9
49.3
47.8
44.2
46.4
50.8
50.7

1
62.9
63.4
55.5
65.0
64.7
66.4
67.4
67.2
58.9
67.4
68.3
69.0
64.8
65.4
55.7
65.5
66.4
66.1
67.8
66.9
58.6
68.5
69.8
70.2

2
52.5
52.7
48.1
55.3
54.9
56.8
57.6
57.0
50.8
57.3
58.9
59.4
54.9
55.2
48.6
56.2
57.2
56.9
56.3
55.2
50.8
57.5
59.6
59.9

L
61.0
61.5
54.0
63.0
62.6
64.4
65.1
64.9
57.0
64.9
65.9
66.6
62.6
63.3
54.2
63.5
64.3
64.1
65.6
64.8
57.0
66.1
67.5
67.9

55.1
55.3
50.2
56.0
56.4
57.1
55.6
55.6
52.1
58.4
59.1
59.2
54.5
54.7
48.9
55.7
56.9
56.7
62.9
61.5
58.9
62.8
64.5
64.9

74.3
71.3
73.8
72.4
73.5
75.9
76.2
71.5
73.2
69.6
74.9
75.1
71.3
71.8
75.5
72.2
73.4
73.0
77.7
74.4
74.9
71.9
78.8
77.9

桌子 2: Comparison of RNN+GCN-SeA with other models on all code-mixed datasets.

Match Success BLEU ROUGE-1 ROUGE-2 ROUGE-L
楷模
48.11
85.29
Seq2seq-Attn
48.25
83.82
HRED
47.69
GCN-SeA
85.29
50.49
RNN+GCN-SeA 94.12

48.53
44.12
21.32
45.59

40.41
39.93
40.29
42.35

24.69
24.09
25.15
27.69

18.81
19.38
18.48
21.62

桌子 3: Comparison of our models with the baselines on the Cam676 dataset.

from a list of candidates as opposed to generating
它. Our model outperforms all of the retrieval and
generation-based models. We obtain a gain of 0.7
in the per-response accuracy compared with the
previous retrieval based state-of-the-art model of
Seo et al. (2017), which is a very strong baseline
for our generation-based model. We call this a
strong baseline because the candidate selection
task of this model is easier than the response
generation task of our model. We also obtain
a gain of 2.8 BLEU points, 2 ROUGE, 点
和 2.5 entity F1 points compared with current
state-of-the-art generation-based models.

Results on code-mixed datasets and effect of
using RNNs: The results of our experiments on the
code-mixed datasets are reported in Table 2. 我们的
model outperforms the baseline models on all the
code-mixed languages. One common observation
from the results over all the languages is that
RNN+GCN-SeA performs better than GCN-SeA.
Similar observations were made by Marcheggiani
and Titov (2017) for semantic role labeling.

Results on Cam676 dataset: The results of
our experiments on the Cam676 dataset are
reported in Table 3. In order to evaluate goal-
completeness, we use two additional metrics as

493

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Single Domain Dialogues (SNG)

楷模
Seq2seq-Attn
HRED
GCN-SeA
RNN+GCN-SeA-400d
RNN+GCN-SeA-100d

Seq2seq-Attn
HRED
GCN-SeA
RNN+GCN-SeA

11.53
10.27
12.30
11.73
13.13

36.77
52.02
44.84
59.19
32.74

Match Success BLEU ROUGE-1 ROUGE-2 ROUGE-L
35.30
68.16
38.30
84.30
39.79
63.68
86.10
38.76
40.76
75.78
Multi-Domain Dialogues (MUL)
38.99
40.57
42.40
43.40

28.28
30.38
32.51
30.93
33.59

13.44
14.49
16.11
15.22
17.67

30.87
31.98
34.25
35.15

16.39
16.83
19.03
19.63

14.03
12.75
14.16
15.85

22.10
37.70
37.90
40.30

44.40
66.40
57.40
62.20

桌子 4: Comparison of our models with the baselines on the MultiWOZ dataset.

数据集

模型

per-resp. 蓝线

Hi-DSTC2

En-DSTC2 GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure

Gu-DSTC2

Be-DSTC2

Ta-DSTC2

acc
45.9
47.1
44.4
47.0
44.9
47.1
45.0
48.1
44.8
46.4

57.8
59.0
54.9
56.0
56.5
58.4
54.0
55.7
61.4
62.8

ROUGE
2
56.5
57.1
52.9
55.3
54.8
57.3
54.0
56.2
55.6
57.5

1
67.1
67.4
63.1
65.0
65.4
67.4
64.1
65.5
66.9
68.5

L
64.8
65.0
60.9
63.0
62.7
64.9
61.9
63.5
64.3
66.1

Entity F1

72.2
71.9
67.2
72.4
65.6
69.6
69.1
72.2
70.5
71.9

桌子 5: GCN-SeA with random graphs and dependency/contextual graphs on all DSTC2 datasets.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: GCN-SeA with multiple hops on all DSTC2 datasets.

494

数据集

模型

per-resp.
acc

蓝线

ROUGE

Entity F1

Hi-DSTC2

Be-DSTC2

Gu-DSTC2

Ta-DSTC2

En-DSTC2

Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA

48.0
38.5
47.1
45.8
49.2
50.4
42.1
47.0
46.8
50.3
47.7
38.8
46.5
45.4
48.9
49.3
42.0
46.3
46.8
50.7
46.0
45.7
47.4
47.0
51.4

55.1
50.4
56.0
55.9
57.1
55.6
55.1
57.7
58.5
59.2
54.5
49.5
55.5
56.0
56.7
62.9
59.3
63.2
64.0
64.9
57.3
58.1
59.5
60.2
61.2

1
62.9
58.9
65.1
65.1
66.4
67.4
63.7
67.0
67.6
69.0
64.8
59.2
65.6
66.0
66.1
67.8
64.8
68.0
69.3
70.2
67.2
66.5
67.9
68.5
69.6

2
52.5
47.7
55.2
55.5
56.8
57.6
52.8
57.4
58.1
59.4
54.9
48.3
55.9
56.6
56.9
56.3
52.8
57.2
59.0
59.9
56.0
55.9
57.7
58.9
60.2

L
61.0
56.7
62.9
63.1
64.4
65.1
61.1
64.6
65.1
66.6
62.6
56.8
63.4
63.9
64.1
65.6
62.1
65.6
67.1
67.9
64.9
64.1
65.6
66.2
67.4

74.3
59.1
72.2
71.8
75.9
76.2
64.3
70.9
71.9
75.1
71.3
58.0
70.6
69.8
73.0
77.7
69.7
72.1
74.2
77.9
67.1
70.1
72.9
72.7
77.9

桌子 6: Ablation results of various models on all versions of DSTC2.

数据集

模型

per resp.
acc

蓝线

ROUGE

Entity F1

En-DSTC2

Hi-DSTC2

Be-DSTC2

Gu-DSTC2

Ta-DSTC2

Query
Query + 历史
Query + KB
Query
Query + 历史
Query + KB
Query
Query + 历史
Query + KB
Query
Query + 历史
Query + KB
Query
Query + 历史
Query + KB

22.8
47.1
41.4
22.5
45.5
40.5
22.7
45.7
41.2
22.4
21.1
40.1
22.8
45.8
40.9

1
53.5
68.8
63.7
50.9
65.3
60.8
51.9
67.1
63.0
50.7
48.6
60.9
53.6
68.9
64.2

2
37.6
59.4
52.4
37.8
55.7
49.7
38.0
57.4
52.1
37.2
35.1
50.1
39.0
58.4
52.3

L
50.6
66.6
60.9
48.4
63.3
58.5
49.0
64.6
60.3
48.4
46.3
58.7
50.6
66.5
61.5

38.1
60.6
55.8
37.5
55.9
52.6
37.9
57.4
54.6
36.1
36.6
50.6
39.3
63.1
59.2

18.4
72.8
63.5
11.1
69.8
60.2
10.6
69.9
60.2
10.9
07.2
59.5
18.8
72.6
64.2

桌子 7: Ablations on different parts of the encoder of RNN+GCN-SeA.

495

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Dataset Wins % Losses % Ties %
35.00
43.66

En-DSTC2
Cam676

42.17
29.00

22.83
27.33

桌子 8: Human evaluation results showing wins,
losses, and ties % on En-DSTC2 and Cam676.

used in the original paper (文等人。, 2017)
which introduced this dataset, (我) match rate: 这
number of times the correct entity was suggested
by the model, 和 (二) success rate: if the correct
entity was suggested and the system provided all
the requestable slots then the dialogue results in
a success. The results suggest that our model’s
responses are more fluent as indicated by the
BLEU and ROUGE scores. It also produces the
correct entities according to the dialogue goals but
fails to provide enough requestable slots. 注意
the model described in the original paper (Wen
等人。, 2017) is not directly comparable to our work
as it uses an explicit belief tracker, which requires
extra supervision/annotation about
the belief-
状态. 然而, for the sake of completeness we
would like to mention that their model using this
extra supervision achieves a BLEU score of 23.69
and a success rate of 83.82%.

Results on MultiWOZ dataset: The results of
our experiments on two versions of the MultiWOZ
dataset are reported in Table 4. The first version
(SNG) contains around 3K dialogues in which
each dialogue involves only a single domain
and the second version (MUL) contains all 10k
对话. The baseline models do not use an
oracle belief state as mentioned in Budzianowski
等人. (2018) and therefore are comparable to our
模型. We observed that with a larger GCN hidden
dimension (400d in Table 4) our model is able to
provide the correct entities and requestable slots
in SNG. 另一方面, with a smaller GCN
hidden dimension (100d) we are able to generate
fluent responses in SNG. On MUL, our model is
able to generate fluent responses but struggles
in providing the correct entity mainly due to
the increased complexity of multiple domains.
然而, our model still provides a high num-
ber of correct requestable slots, as shown by the
success rate. This is because multiple domains
(hotel, 餐厅, attraction, hospital) 有
same requestable slots (地址, 电话, postcode).
Effect of using hops: As we increased the
number of hops of GCNs (数字 3), we observed

496

a decrease in the performance. One reason for such
a drop in performance could be that the average
utterance length is very small (7.76 字). 因此,
there is not much scope for capturing distant
neighborhood information and more hops can
add noisy information. The reduction is more
prominent in contextual graphs in which multi-
hop neighbors can turn out to be dissimilar words
in different sentences.

Effect of using random graphs: GCN-
SeA+Random and GCN-SeA+Structure take the
token embeddings directly instead of passing them
though an RNN. This ensures that the difference in
performance of the two models are not influenced
by the RNN encodings. The results are shown in
桌子 5 and we observe a drop in performance
for GCN-SeA+Random across all the languages.
This shows that the dependency and contextual
structures play an important role and cannot be
replaced by random graphs.

Ablations: We experiment with replacing the
sequential attention by the Bahdanau attention
(Bahdanau et al., 2015). We also experiment
with various combinations of RNNs and GCNs
as encoders. 结果见表 6.
We observed that GCNs do not outperform
RNNs independently. 一般来说, RNN-Bahdanau
attention performs better than GCN-Bahdanau
注意力. The sequential attention mechanism
outperforms Bahdanau attention as observed from
the following comparisons: (我) GCN-Bahdanau
attention vs GCN-SeA,
(二) RNN-Bahdanau
attention vs RNN-SeA (in BLEU and ROUGE),
和 (三、) RNN+GCN-Bahdanau attention vs
RNN+GCN-SeA. 全面的,
the best results are
always obtained by our final model, which com-
bines RNN, GCN, and sequential attention. 我们
also performed ablations by removing specific
parts of the encoder. 具体来说, we experi-
ment with (我) query encoder alone, (二) query +
history encoder, 和 (三、) query + KB encoder.
The results shown in Table 7 suggest that the
query and the KB are not enough to generate
fluent responses and the previous conversation
history is essential.

In order

Human evaluations:

to evaluate
the appropriateness of our model’s responses
compared to the baselines, we perform a human
evaluation of the generated responses using in-
house evaluators. We evaluated randomly chosen
responses from 200 dialogues of En-DSTC2 and

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Context:
Speaker 1: [silence] .
Speaker 2: hello, welcome to the cambridge
restaurant system . you can ask for restaurants by area,
price range or food type . how may i help you ?
Speaker 1: uh cheap restaurant malaysian food .
Speaker 2: i’m sorry but there is no restaurant
serving malaysian food .
Speaker 1: uh italian food .
Speaker 2: could you please repeat that ?
Speaker 1: cheap restaurant uh italian food .
Speaker 2: there are restaurants serving
italian in the cheap price range . what area would you like ?
Speaker 1: i dont care .

KB Triples:
ask, r cuisine, italian
ask, r location, centre
ask, r phone, ask phone
ask, r address, ask address
ask, r price, 便宜的
ask, r rating, 0
pizza hut city centre, r cuisine, italian
pizza hut city centre, r location, centre
pizza hut city centre, r phone,
pizza hut city centre phone
pizza hut city centre, r address,
pizza hut city centre address
pizza hut city centre, r price, 便宜的
pizza hut city centre, r rating, 9

Seq2seq + Attn: ask is a nice place in the west of town serving tasty italian food
HRED: pizza hut city centre serves italian food in the expensive price range
GCN-SeA: pizza hut city centre serves italian food in the cheap price range
RNN+GCN-SeA: pizza hut city centre is a great restaurant serving cheap italian food in the centre of town
RNN+CROSS-GCN-SeA:pizza hut city centre is a nice place in the centre of town serving tasty italian food

桌子 9: Qualitative comparison of responses between the baselines and different versions of our model

100 dialogues of Cam676 using the method
of pairwise comparisons introduced in Serban
等人. (2017). We chose the best baseline model
for each dataset, 即, HRED for En-DSTC2
and Seq2seq+Attn for Cam676. We show each
dialogue context to three different evaluators and
ask them to select the most appropriate response
在这种情况下. The evaluators were given no
information about which model generated which
response. They were allowed to choose an option
for tie if they were not able to decide whether
one model’s response was better than the other
模型. The results reported in Table 8 建议
that our model’s responses are favorable in noisy
contexts of spontaneous conversations, 例如
those exhibited in the DSTC2 dataset. 然而,
in a WOZ setting for human–human dialogues,
where the conversations are less spontaneous and
contexts are properly established, both the models
generate appropriate responses.

Qualitative analysis: We show the generated
responses of the baselines and different versions
of our model in Table 9. We see that Seq2seq+Attn
model is not able to suggest a restaurant with a
high rating whereas HRED gets the restaurant right
but suggests an incorrect price range. 然而,
RNN+GCN-SeA suggests the correct restaurant
with the preferred attributes. Although GCN-SeA
selects the correct restaurant, it does not provide
the location in its response.

7 结论

We showed that structure-aware representations
are useful
in goal-oriented dialogue and our
model outperforms existing methods on four
dialogue datasets. We used GCNs to infuse struc-
tural information of dependency graphs and con-
textual graphs to enrich the representations of the
dialogue context and KB. We also proposed a
sequential attention mechanism for combining the
representations of (我) query (current utterance),
(二) conversation history, 和 (三、) the KB. 最后,
we empirically showed that when dependency
parsers are not available for certain languages,
such as code-mixed languages, then we can use
word co-occurrence frequencies and PPMI values
to extract a contextual graph and use such a graph
with GCNs for improved performance.

致谢

We would like to thank the anonymous reviewers
and the action editor for their insightful com-
ments and suggestions. We would like to thank
the Department of Computer Science and Engi-
neering, IIT Madras and Robert Bosch Centre
for Data Science and Artificial
智力
(RBC-DSAI),
IIT Madras for providing the
necessary resources. We would also like to thank
Accenture Technology Labs, 印度, for support-
ing our work through their generous academic
research grant.

497

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

参考

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
jointly learning to align and translate. In Pro-
ceedings of the 3rd International Conference
on Learning Representations,
ICLR 2015,
圣地亚哥, CA.

Suman Banerjee, Nikita Moghe, Siddhartha
Arora, and Mitesh M. Khapra. 2018. A data-
set for building code-mixed goal oriented con-
versation systems. In Proceedings of the 27th
国际计算会议
语言学, pages 3766–3780.

Joost Bastings, Ivan Titov, Wilker Aziz, Diego
Marcheggiani, and Khalil Simaan. 2017. 图形
convolutional encoders for syntax-aware neural
machine translation. 在诉讼程序中 2017
Conference on Empirical Methods in Natural
语言处理, pages 1957–1967.

Antoine Bordes, Y.-Lan Boureau, and Jason
Weston. 2017. Learning end-to-end goal-oriented
dialog. In Proceedings of the 5th International
Conference on Learning Representations, ICLR
2017, Toulon.

Paweł Budzianowski, Tsung-Hsien Wen, Bo-
Hsiang Tseng, I˜nigo Casanueva, Stefan Ultes,
Osman Ramadan, and Milica Gasic. 2018.
多WOZ – A large-scale multi-domain wizard-
任务导向的对话
of-Oz dataset
modelling. 在诉讼程序中 2018 骗局-
ference on Empirical Methods in Natural
语言处理,
5016–5026,
布鲁塞尔.

页面

为了

Huadong Chen, Shujian Huang, David Chiang,
and Jiajun Chen. 2017.
Improved neural
machine translation with a syntax-aware en-
coder and decoder. 在诉讼程序中
这
55th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), pages 1936–1945.

Kyunghyun Cho, Bart van Merrienboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using RNN
encoder–decoder for statistical machine trans-
关系. 在诉讼程序中 2014 会议
on Empirical Methods in Natural Language
加工, pages 1724–1734.

498

Kenneth Ward Church and Patrick Hanks. 1990.
词语关联规范, mutual information,
和词典编纂. 计算语言学,
16(1):22–29.

Ido Dagan, Shaul Marcus, and Shaul Markovitch.
1993. Contextual word similarity and estima-
在诉讼程序中
tion from sparse data.
the 31st Annual Meeting of the Association
for Computational Linguistics, pages 164–171,
Columbus, 哦.

Nicola De Cao, Wilker Aziz, and Ivan Titov.
2019. Question answering by reasoning across
documents with graph convolutional networks.
在诉讼程序中 2019 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long and Short Papers),
pages 2306–2317, 明尼阿波利斯, 明尼苏达州.

Micha¨el Defferrard, Xavier Bresson, and Pierre
Vandergheynst. 2016, Convolutional neural
networks on graphs with fast localized spectral
filtering, D. D. 李, 中号. Sugiyama, U. V.
Luxburg, 我. Guyon, 和R. 加内特, 编辑,
神经信息处理的进展
系统 29, pages 3844–3852. Curran Asso-
ciates, Inc.

David K. Duvenaud, Dougal Maclaurin, 豪尔赫
Iparraguirre, Rafael Bombarell, Timothy Hirzel,
Alan Aspuru-Guzik, and Ryan P. Adams. 2015,
Convolutional networks on graphs for learning
molecular fingerprints. 在C中. 科尔特斯, 氮. D.
劳伦斯, D. D. 李, 中号. Sugiyama, 和R.
加内特, 编辑, Advances in Neural Infor-
mation Processing Systems 28, pages 2224–2232.
柯伦联合公司, Inc.

Jeffrey L. Elman. 1990. Finding structure in time.

认知科学, 14(2):179–211.

Mihail Eric, Lakshmi Krishnan, Francois Charette,
and Christopher D. 曼宁. 2017. Key-
value retrieval networks
task-oriented
dialogue. In Proceedings of the 18th Annual
SIGdial Meeting on Discourse and Dialogue,
pages 37–49, 萨尔布吕肯.

为了

Mihail Eric and Christopher Manning. 2017. A
copy-augmented sequence-to-sequence archi-
tecture gives good performance on task-
oriented dialogue. In Proceedings of the 15th

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

欧洲分会会议
计算语言学协会:
体积 2, Short Papers, pages 468–473.

Akiko Eriguchi, Kazuma Hashimoto,

和
Yoshimasa Tsuruoka. 2016. Tree-to-sequence
attentional neural machine translation.
在
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 823–833.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, 和
Jian Sun. 2016. Deep residual
learning for
image recognition. 在 2016 IEEE Conference
on Computer Vision and Pattern Recognition,
CVPR 2016, pages 770–778, Las Vegas, NV.

Matthew Henderson, Blaise Thomson,

和
贾森·D. 威廉姆斯. 2014A. The second dialog
state tracking challenge. 在诉讼程序中
the SIGDIAL 2014 会议, The 15th
Annual Meeting of the Special Interest Group
论话语与对话, pages 263–272,
费城, PA.

Matthew Henderson, Blaise Thomson,

和
贾森·D. 威廉姆斯. 2014乙. The third dialog
state tracking challenge. 在 2014 IEEE Spoken
Language Technology Workshop, SLT 2014,
pages 324–329, South Lake Tahoe, NV.

Daniel D. 约翰逊. 2017. Learning graphical state
transitions. In 5th International Conference on
Learning Representations, ICLR 2017, Toulon.

Diederik P. Kingma and Jimmy Ba. 2015.
亚当: A method for stochastic optimization.
In 3rd International Conference on Learning
Representations, ICLR 2015, 圣地亚哥, CA.

Thomas N. Kipf and Max Welling. 2017.
Semi-supervised classification with graph
convolutional networks. In 5th International
Conference on Learning Representations, ICLR
2017, Toulon.

Yujia Li, Daniel Tarlow, Marc Brockschmidt,
and Richard S. Zemel. 2016. Gated graph
sequence neural networks. In 4th International
Conference on Learning Representations, ICLR
2016, San Juan.

Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out: 会议记录
the ACL-04 作坊, pages 74–81, 巴塞罗那.

Fei Liu and Julien Perez. 2017. Gated end-to-
end memory networks. 在诉讼程序中
15th Conference of the European Chapter of
the Association for Computational Linguistics,
2017, 第 1–10 页, Valencia.

Andrea Madotto, Chien-Sheng Wu, and Pascale
Fung. 2018. Mem2Seq: Effectively incor-
porating knowledge bases into end-to-end
task-oriented dialog systems. In Proceedings
of the 56th Annual Meeting of the Association
for Computational Linguistics (体积 1: 长的
文件), pages 1468–1478.

Diego Marcheggiani, Joost Bastings, and Ivan
Titov. 2018. Exploiting semantics in neural
machine translation with graph convolutional
网络. 在诉讼程序中 2018 骗局-
ference of
the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 2
(Short Papers), pages 486–492.

Diego Marcheggiani and Ivan Titov. 2017.
Encoding sentences with graph convolutional
networks for semantic role labeling. In Pro-
ceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing,
pages 1506–1515.

Yoshiki Niwa and Yoshihiko Nitta. 1994. 钴-
occurrence vectors from corpora vs. distance
vectors from dictionaries. 在诉讼程序中
15th Conference on Computational Linguistics
– 体积 1, COLING ’94, pages 304–309,
Stroudsburg, PA.

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. 蓝线: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, 费城, PA.

Nanyun Peng, Hoifung Poon, Chris Quirk,
Kristina Toutanova, and Wen-tau Yih. 2017.
Cross-sentence n-ary relation extraction with
graph LSTMs. Transactions of the Association
for Computational Linguistics, 5:101–115.

Minjoon Seo, Sewon Min, Ali Farhadi, 和
Hannaneh Hajishirzi. 2017. Query-reduction
networks for question answering. In 5th Inter-
national Conference on Learning Representa-
系统蒸发散, ICLR 2017, Toulon.

499

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Iulian V. Serban, Alessandro Sordoni, Ryan
Lowe, Laurent Charlin, Joelle Pineau, Aaron
考维尔, and Yoshua Bengio. 2017. A hier-
archical latent variable encoder-decoder model
In Thirty-First
for generating dialogues.
AAAI 人工智能会议,
页 1583.

Iulian Vlad Serban, Alessandro Sordoni, Yoshua
and Joelle
本吉奥, Aaron C. 考维尔,
Pineau. 2016. Building end-to-end dialogue
systems using generative hierarchical neural
network models. In Proceedings of the Thirtieth
AAAI 人工智能会议,
pages 3776–3784, Phoenix, AZ.

Nitish Srivastava, Geoffrey E. 欣顿, Alex
克里热夫斯基,
and Ruslan
伊利亚·苏茨克维尔,
Salakhutdinov. 2014. Dropout: A simple way
to prevent neural networks from overfitting.
Journal of Machine Learning Research,
15(1):1929–1958.

Rupesh K Srivastava, Klaus Greff, and J¨urgen
施米德胡贝尔. 2015, Training very deep net-
作品. 在C中. 科尔特斯, 氮. D. 劳伦斯, D. D. 李,
中号. Sugiyama, 和R. 加内特, 编辑, Advances
in Neural Information Processing Systems 28,
pages 2377–2385. 柯伦联合公司, Inc.

Sainbayar Sukhbaatar, Arthur Szlam,

Jason
Weston, and Rob Fergus. 2015. End-to-end
In Advances in Neural
memory networks.
Information Processing Systems 28: Annual
Conference on Neural Information Processing
系统 2015, pages 2440–2448, 蒙特利尔.

Shikhar Vashishth, Shib Sankar Dasgupta,
Swayambhu Nath Ray, and Partha Talukdar.

2018. Dating documents using graph con-
这
volution networks.
56th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), pages 1605–1615.

在诉讼程序中

Tsung-Hsien Wen, David Vandyke, Nikola
Mrkˇsi´c, Milica Gaˇsi´c, Lina M. Rojas-
Barahona, Pei-Hao Su, Stefan Ultes, 和
Steve Young. 2017. A network-based end-to-
end trainable task-oriented dialogue system.
在诉讼程序中
the 15th Conference of
the European Chapter of the Association for
计算语言学: 体积 1, 长的
文件, pages 438–449, Valencia.

贾森·D. 威廉姆斯, Antoine Raux, Deepak
Ramachandran, and Alan W. 黑色的. 2013. 这
dialog state tracking challenge. In Proceedings
of the SIGDIAL 2013 会议, The 14th
Annual Meeting of the Special Interest Group
论话语与对话, pages 404–413,
Metz.

贾森·D. Williams and Steve J. Young. 2007.
Partially observable Markov decision processes
for spoken dialog systems. Computer Speech &
语言, 21(2):393–422.

Steve J. Young. 2000. Probabilistic methods
in spoken-dialogue systems. 哲学
Transactions: Mathematical, Physical and
Engineering Sciences, 358(1769):1389–1402.

Tiancheng Zhao, Allen Lu, Kyusong Lee, 和
Maxine Eskenazi. 2017. Generative encoder-
decoder models for task-oriented spoken dialog
systems with chatting capability. In Proceed-
ings of the 18th Annual SIGdial Meeting on
Discourse and Dialogue, pages 27–36.

500

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
8
4
1
9
2
3
4
5
0

/
t

我

A
C
_
A
_
0
0
2
8
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
下载pdf