Graph Convolutional Network with Sequential Attention for

Graph Convolutional Network with Sequential Attention for
Goal-Oriented Dialogue Systems

Suman Banerjee and Mitesh M. Khapra

Department of Computer Science and Engineering,
Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI),
Indian Institute of Technology Madras, India
{suman, miteshk}@cse.iitm.ac.in

Astratto
Domain-specific goal-oriented dialogue sys-
tems typically require modeling three types of
inputs, namely, (io) the knowledge-base asso-
ciated with the domain, (ii) the history of the
conversation, which is a sequence of utter-
ances, E (iii) the current utterance for which
the response needs to be generated. While
modeling these inputs, current state-of-the-art
models such as Mem2Seq typically ignore
the rich structure inherent in the knowledge
graph and the sentences in the conversa-
tion context. Inspired by the recent success
of structure-aware Graph Convolutional Net-
works (GCNs) for various NLP tasks such
as machine translation, semantic role labeling,
and document dating, we propose a memory-
augmented GCN for goal-oriented dialogues.
Our model exploits (io) the entity relation graph
in a knowledge-base and (ii) the dependency
graph associated with an utterance to compute
richer representations for words and entities.
Further, we take cognizance of the fact that in
certain situations, such as when the conversa-
tion is in a code-mixed language, dependency
parsers may not be available. We show that in
such situations we could use the global word
co-occurrence graph to enrich the representa-
tions of utterances. We experiment with four
datasets: (io) the modified DSTC2 dataset, (ii)
recently released code-mixed versions of DSTC2
dataset in four languages, (iii) Wizard-of-Oz
style CAM676 dataset, E (iv) Wizard-of-Oz
style MultiWOZ dataset. On all four datasets
our method outperforms existing methods, SU
a wide range of evaluation metrics.

1

introduzione

Goal-oriented dialogue systems that can assist
humans in various day-to-day activities have

485

widespread applications in several domains such
as e-commerce, entertainment, healthcare, and so
forth. Per esempio, such systems can help humans
in scheduling medical appointments or reserving
restaurants, booking tickets. From a modeling
perspective, one clear advantage of dealing with
domain-specific goal-oriented dialogues is that
the vocabulary is typically limited, the utterances
largely follow a fixed set of templates, and there
is an associated domain knowledge that can
be exploited. More specifically, there is some
structure associated with the utterances as well as
the knowledge base (KB).

More formally, the task here is to generate the
next response given (io) the previous utterances
in the conversation history, (ii) the current user
utterance (known as the query), E (iii) IL
entities and their relationships in the associated
knowledge base. Current state-of-the-art methods
(Seo et al., 2017; Eric and Manning, 2017;
Madotto et al., 2018) typically use variants of
Recurrent Neural Networks (RNNs)
(Elman,
1990) to encode the history and current utterance
or an external memory network (Sukhbaatar et al.,
2015) to encode them along with the entities
in the knowledge base. The encodings of the
utterances and memory elements are then suitably
combined using an attention network and fed to
the decoder to generate the response, one word
at a time. Tuttavia, these methods do not exploit
the structure in the knowledge base as defined by
entity–entity relations and the structure in the
utterances as defined by a dependency parse.
Such structural
information can be exploited
to improve the performance of the system, COME
demonstrated by recent works on syntax-aware
neural machine translation (Eriguchi et al., 2016;
Bastings et al., 2017; Chen et al., 2017), semantic
role labeling (Marcheggiani and Titov, 2017),
and document dating (Vashishth et al., 2018),
which use Graph Convolutional Networks (GCNs)

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 7, pag. 485–500, 2019. https://doi.org/10.1162/tacl a 00284
Redattore di azioni: Asli Celikyilmaz. Lotto di invio: 1/2019; Lotto di revisione: 5/2019; Pubblicato 9/2019.
C(cid:2) 2019 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

(Defferrard et al., 2016; Duvenaud et al., 2015;
Kipf and Welling, 2017) to exploit sentence
structure.

In this work, we propose to use such graph
structures for goal-oriented dialogues. In partic-
ular, we compute the dependency parse tree for
each utterance in the conversation and use a
GCN to capture the interactions between words.
This allows us to capture interactions between
distant words in the sentence as long as they are
connected by a dependency relation. We also use
GCNs to encode the entities of the KB where the
entities are treated as nodes and their relations
as edges of the graph. Once we have a richer
structure aware representation for the utterances
and the entities, we use a sequential attention
mechanism to compute an aggregated context
representation from the GCN node vectors of the
query, history, and entities. Further, we note that in
certain situations, such as when the conversation
is in a code-mixed language or a language for
which parsers are not available, then it may not
be possible to construct a dependency parse for
the utterances. To overcome this, we construct a
co-occurrence matrix from the entire corpus and
use this matrix to impose a graph structure on
the utterances. More specifically, we add an edge
between two words in a sentence if they co-occur
frequently in the corpus. Our experiments suggest
Quello
this simple strategy acts as a reasonable
substitute for dependency parse trees.

We perform experiments with the modified
DSTC2 (Bordes et al., 2017) dataset, Quale
contains goal-oriented conversations for making
restaurant reservations. We also use its recently
released code-mixed versions (Banerjee et al.,
2018), which contain code-mixed conversations in
four different languages: Hindi, Bengali, Gujarati,
and Tamil. We compare with recent state-of-the-
art methods and show that on average, the pro-
posed model gives an improvement of 2.8 BLEU
points and 2 ROUGE points. We also perform
experiments on two human–human dialogue
datasets of different sizes: (io) Cam676 (Wen
et al., 2017): a small scale dataset containing
676 dialogues from the restaurant domain; E (ii)
MultiWOZ (Budzianowski et al., 2018): a large-
scale dataset containing around 10k dialogues and
spanning multiple domains for each dialogue. On
these two datasets as well, we observe a similar
trend, wherein our model outperforms existing
metodi.

Our contributions can be summarized as fol-
lows: (io) We use GCNs to incorporate structural
information for encoding query, history, E
KB entities in goal-oriented dialogues; (ii) Noi
use a sequential attention mechanism to obtain
query aware and history aware context repre-
sentations; (iii) We leverage co-occurrence fre-
quencies and PPMI (positive-pointwise mutual
informazione) values to construct contextual graphs
for code-mixed utterances; E (iv) We show
that the proposed model obtains state-of-the-art
results on four different datasets spanning five
different languages.

2 Related Work

In this section, we review the previous work in
goal-oriented dialogue systems and describe the
introduction of GCNs in NLP.

Goal-Oriented Dialogue Systems: Initial goal-
oriented dialogue systems (Young, 2000; Williams
and Young, 2007) were based on dialogue state
tracking (Williams et al., 2013; Henderson et al.,
2014UN,B) and included pipelined modules for
natural
language understanding, dialogue state
tracking, policy management, and natural lan-
guage generation. Wen et al. (2017) used neural
networks for these intermediate modules but
still lacked absolute end-to-end trainability. Such
pipelined modules were restricted by the fixed
slot-structure assumptions on the dialogue state
and required per-module based labeling. A
mitigate this problem, Bordes et al. (2017) Rif-
leased a version of goal-oriented dialogue dataset
that focuses on the development of end-to-end
neural models. Such models need to reason
over the associated KB triples and generate re-
sponses directly from the utterances without any
additional annotations. Per esempio, Bordes et al.
(2017) proposed a Memory Network (Sukhbaatar
et al., 2015) based model to match the response
candidates with the multi-hop attention weighted
representation of the conversation history and the
KB triples in memory. Liu and Perez (2017)
further added highway (Srivastava et al., 2015)
and residual connections (He et al., 2016) to the
memory network in order to regulate the access to
the memory blocks. Seo et al. (2017) sviluppato
a variant of RNN cell that computes a refined
representation of the query over multiple iterations
before querying the memory. Tuttavia, all these
approaches retrieve the response from a set of

486

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

candidate responses and such a candidate set is
not easy to obtain for any new domain of interest.
To account for this, Eric and Manning (2017) E
Zhao et al. (2017) adapted RNN-based encoder-
decoder models to generate appropriate responses
instead of retrieving them from a candidate set.
Eric et al. (2017) introduced a key-value memory
network based generative model that integrates
the underlying KB with RNN-based encode-
attend-decode models. Madotto et al. (2018) used
memory networks on top of the RNN decoder
to tightly integrate KB entities with the decoder
in order to generate more informative responses.
Tuttavia, as opposed to our work, all these works
ignore the underlying structure of the entity–
entity graph of the KB and the syntactic structure
of the utterances.

GCNs in NLP: Recentemente,

there has been
an active interest in enriching existing encode-
attend-decode models (Bahdanau et al., 2015)
with structural
information for various NLP
compiti. Such structure is typically obtained from
the constituency and/or dependency parse of
sentences. The idea is to treat the output of a
parser as a graph and use an appropriate network
to capture the interactions between the nodes of
this graph. Per esempio, Eriguchi et al. (2016)
and Chen et al. (2017) showed that incorporating
such syntactical structures as Tree-LSTMs in the
encoder can improve the performance of neural
machine translation. Peng et al. (2017) use Graph-
LSTMs to perform cross sentence n-ary relation
their formulation is
extraction and show that
applicable to any graph structure and Tree-LSTMs
can be thought of as a special case of it. In
parallel, Graph Convolutional Networks (GCNs)
(Duvenaud et al., 2015; Defferrard et al., 2016;
Kipf and Welling, 2017) and their variants (Li
et al., 2016) have emerged as state-of-the-art
methods for computing representations of entities
in a knowledge graph. They provide a more
flexible way of encoding such graph structures
by capturing multi-hop relationships between
nodes. This has led to their adoption for various
NLP tasks such as neural machine translation
(Marcheggiani et al., 2018; Bastings et al., 2017),
semantic role labeling (Marcheggiani and Titov,
2017), document dating (Vashishth et al., 2018),
and question answering (Johnson, 2017; De Cao
et al., 2019).

487

To the best of our knowledge, ours is the first
work that uses GCNs to incorporate dependency
structural information and the entity–entity graph
structure in a single end-to-end neural model
for goal-oriented dialogues. This is also the first
work that incorporates contextual co-occurrence
information for code-mixed utterances, for which
no dependency structures are available.

3 Background

In this section, we describe GCNs (Kipf and
Welling, 2017) for undirected graphs and then
describe their syntactic versions, which work with
directed labeled edges of dependency parse trees.

3.1 GCN for Undirected Graphs

Graph convolutional networks operate on a graph
structure and compute representations for the
nodes of the graph by looking at the neighborhood
of the node. We can stack k layers of GCNs
to account for neighbors that are k-hops away
from the current node. Formalmente, let G = (V, E)
be an undirected graph, where V is the set of
nodes (let |V| = n) and E is the set of edges.
Let X ∈ Rn×m be the input feature matrix with
n nodes and each node xu(u ∈ V) is represented
by an m-dimensional feature vector. The output
of a 1-layer GCN is the hidden representation
matrix H ∈ Rn×d where each d-dimensional
representation of a node captures the interactions
with its 1-hop neighbors. Each row of this matrix
can be computed as:
(cid:2) (cid:3)

(cid:4)

hv = ReLU

(W xu + B)

,

u∈N (v)

∀v ∈ V

(1)
Here W ∈ Rd×m is the model parameter matrix,
b ∈ Rd is the bias vector, and ReLU is the
rectified linear unit activation function. N (v) È
the set of neighbors of node v and is assumed
to also include the node v so that the previous
representation of the node v is also considered
while computing its new hidden representation. A
capture interactions with nodes that are multiple
hops away, multiple layers of GCNs can be stacked
together. Specifically, the representation of node
v after kth GCN layer can be formulated as:
(cid:2) (cid:3)
(cid:4)

hk+1

v = ReLU

(W khk

tu + bk)

(2)

u∈N (v)

∀v ∈ V. Here hk

u is the representation of the uth

node in the (k − 1)th GCN layer and h1

u = xu.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

3.2 Syntactic GCN
In a directed labeled graph G = (V, E), each
edge between nodes u and v is represented
by a triple (tu, v, l(tu, v)) where L(tu, v) is the
associated edge label. Marcheggiani and Titov
(2017) modified GCNs to operate over directed
labeled graphs, such as the dependency parse tree
of a sentence. For such a tree, in order to allow
information to flow from head to dependents and
vice-versa, they added inverse dependency edges
from dependents to heads such as (v, tu, l(tu, v)(cid:5))
to E and made the model parameters and biases
label specific. In their formulation,

where relation signifies the edge label. At any
dialogue turn i, given the (io) dialogue history
H = (U1, S1, U2, . . . , Si−1), (ii) the current user
utterance as the query Q = Ui and (iii) IL
associated knowledge graph Gk, the task is to
generate the current response Si that leads to a
completion of the goal. As mentioned earlier, we
exploit the graph structure in KB and the syntactic
structure in the utterances to generate appropriate
responses. Toward this end, we propose a model
with the following components for encoding these
three types of inputs. The code for the model is
released publicly.1

(cid:4)

4.1 Query Encoder

(cid:2) (cid:3)

hk+1

v = ReLU

(W k

l(tu,v)hk

tu + bk

l(tu,v))

u∈N (v)

(3)
∀v ∈ V. Notice that unlike equation 2,

l(tu,v) and bk

equation 3 has parameters W k
l(tu,v)
which are label-specific. Suppose there are L
different labels, then this formulation will require
L weights and biases per GCN layer, resulting in
a large number of parameters. To avoid this, IL
authors use only three sets of weights and biases
per GCN layer (as opposed to L) depending on
the direction in which the information flows. More
specifically, W k
dir(tu,v), where dir(tu, v)
indicates whether information flows from u to v,
v to u or u = v. In this work, we also make
bk
l(tu,v) = bk
dir(tu,v) instead of having a separate
bias per label. The final GCN formulation can thus
be described as:

l(tu,v) = W k

(cid:2) (cid:3)

(cid:4)

(W k

dir(tu,v)hk

u+bk

dir(tu,v))

u∈N (v)

(4)

hk+1

v = ReLU

4 Model

We first formally define the task of end-to-end
goal-oriented dialogue generation. Each dialogue
of t turns can be viewed as a succession of
user utterances (U ) and system responses (S) E
can be represented as: (U1, S1, U2, S2, . . . , Ut, St).
Along with these utterances, each dialogue is also
accompanied by e KB triples that are relevant
to that dialogue and can be represented as:
(k1, k2, k3, . . . , ke). Each triple is of the form:
(entity1, relation, entity2). These triples can be
represented in the form of a graph Gk = (Vk, Ek)
where Vk is the set of all entities and each edge
in Ek is of the form: (entity1, entity2, relation),

is the ith (current) user
The query Q = Ui
utterance in the dialogue and contains |Q| gettoni.
We denote the embedding of the ith token in
the query as qi. We first compute the contextual
representations of these tokens by passing them
through a bidirectional RNN:

bt = BiRN NQ(bt−1, qt)

(5)

Now, consider the dependency parse tree of
the query sentence denoted by GQ = (VQ, EQ).
We use a query-specific GCN to operate on GQ,
which takes {bi}|Q|
i=1 as the input to the first GCN
layer. The node representation in the kth hop of
the query specific GCN is computed as:
(cid:2) (cid:3)

(cid:4)

ck+1
v = ReLU

(W k

dir(tu,v)ck

tu + gk

dir(tu,v))

u∈N (v)

(6)
∀v ∈ VQ. Here W k
dir(tu,v) are edge
direction specific query-GCN weights and biases
for the kth hop and c1
u = bu.

dir(tu,v), gk

4.2 Dialogue History Encoder
The history H of the dialogue contains |H| gettoni
and we denote the embedding of the ith token in
the history by pi. Once again, we first compute
the hidden representations of these tokens using a
bidirectional RNN:

st = BiRN N H (st−1, pt)

(7)

We now compute a dependency parse tree
for each sentence in the history and collectively
represent all the trees as a single graph GH =

1https://github.com/sumanbanerjee1/

GCN-SeA.

488

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 1: Illustration of the GCN and RNN+GCN modules which are used as encoders in our model. The notations
are specific to the dialogue history encoder but both the encoders are similar for the query. We use only the GCN
encoder for the KB.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

(VH , EH ). Note that this graph will only contain
edges between words belonging to the same
sentence and there will be no edges between words
across sentences. We then use a history-specific
GCN to operate on GH which takes st as the input
to the first layer. The node representation in the
kth hop of the history-specific GCN is computed
COME:

(cid:4)

(cid:2) (cid:3)

ak+1
v = ReLU

(V k

dir(tu,v)ak

tu + ok

dir(tu,v))

u∈N (v)

(8)
∀v ∈ VH . Here V k
dir(tu,v) and ok
dir(tu,v) are edge
direction-specific history-GCN weights and biases
in the kth hop and a1
u = su. Such an encoder with
a single hop of GCN is illustrated in Figure 1(B)
and the encoder without the BiRNN is depicted in
Figura 1(UN).

4.3 KB Encoder
As mentioned earlier, GK = (VK, EK) is the graph
capturing the interactions between the entities in
the knowledge graph associated with the dialogue.
Let there be m such entities and we denote the
embedding of the node corresponding to the ith
entity as ei. We then operate a KB-specific GCN

489

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

on these entity representations to obtain refined
representations that capture relations between
entities. The node representation in the kth hop of
the KB specific GCN is computed as:

rk+1
v = ReLU

(cid:2) (cid:3)

u∈N (v)

(cid:4)

(U k

dir(tu,v)rk

tu + zk

dir(tu,v))

dir(tu,v) and zk

(9)
∀v ∈ VK. Here U k
dir(tu,v) are edge
direction-specific KB-GCN weights and biases in
kth hop and r1
u = eu. We also add inverse edges to
EK similar to the case of syntactic GCNs in order
to allow information flow in both the directions
for an entity pair in the knowledge graph.

4.4 Sequential Attention

We use an RNN decoder to generate the tokens
of the response and let the hidden states of the
decoder be denoted as: {di}T
i=1, where T is the
total number of decoder time steps. In order to
obtain a single representation of the node vectors
from the final layer (k = f ) of the query-GCN, we

use an attention mechanism as described below:

j + W2dt−1)

1 di pesce(W1cf

μjt = vT
αt = softmax(μt)
(cid:5)|Q|
j(cid:5)=1 αj(cid:5)tcf
hQ
j(cid:5)

t =

(10)

(11)

(12)

Here v1, W1 and W2 are parameters. Further,
at each decoder time step, we obtain a query-
aware representation from the final layer of the
history-GCN by computing an attention score for
each node/token in the history based on the query
context vector hQ
t as shown below:

νjt = vT

2 di pesce(W3af

j + W4dt−1 + W5hQ
T )
(13)

βt = softmax(νt)
(cid:5)|H|
j(cid:5)=1 βj(cid:5)taf
hH
j(cid:5)

t =

(14)

(15)

Here v2, W3, W4, and W5 are parameters.
Finalmente, we obtain a query and history aware
representation of
the KB by computing an
attention score over all the nodes in the final
layer of KB-GCN using hQ
come mostrato
below:

t and hH
T

ωjt = vT

3 di pesce(W6rf

j + W7dt−1 + W8hQ

T

γt = softmax(ωt)
j(cid:5)=1 γj(cid:5)trf
M
hK
t =
j(cid:5)

(cid:5)

+ W9hH
T )

(16)

(17)

(18)

Here v3, W6, W7, W8 and W9 are parameters.
This sequential attention mechanism is illustrated
in Figure 2. For simplicity, we depict the GCN
and RNN+GCN encoders as blocks. The internal
structure of these blocks are shown in Figure 1.

4.5 Decoder

The decoder is conditioned on two components:
(io) the context that contains the history and the KB
E (ii) the query that is the last/previous utterance
in the dialogue. We use an aggregator that learns
the overall attention to be given to the history
and KB components. These attention scores: θH
T
and θK
t are dependent on the respective context
vectors and the previous decoder state dt−1. IL
final context vector is obtained as:

t = θH
hC
hf inal
= [hC
T

t hH
T + θK
T ; hQ
T ]

t hK
T

(19)

(20)

490

Dove [; ] denotes the concatenation operator. A
every time step, the decoder then computes a
probability distribution over the vocabulary using
the following equations:

dt = RN N (dt−1, [hf inal
Pvocab = softmax(V (cid:5)dt + B(cid:5))

T

; peso])

(21)

(22)

where wt is the decoder input at time step t, V (cid:5) E
B(cid:5) are parameters. Pvocab gives us a probability
distribution over the entire vocabulary and the loss
for time step t is lt = − log Pvocab(w∗
T ), where w∗
T
is the tth word in the ground truth response. IL
total loss is an average of the per-time step losses.

4.6 Contextual Graph Creation

For the dialogue history and query encoder, we
used the dependency parse tree for capturing
structural information in the encodings. Tuttavia,
if the conversations occur in a language for which
no dependency parsers exist, Per esempio: code-
mixed languages like Hinglish (Hindi–English)
(Banerjee et al., 2018), then we need an alternate
way of extracting a graph structure from the
utterances. One simple solution that has worked
well in practice was to create a word co-occurrence
matrix from the entire corpus where the context
window is an entire sentence. Once we have such
a co-occurrence matrix, for a given sentence we
can connect an edge between two words if their
co-occurrence frequency is above a threshold
value. The co-occurrence matrix can either con-
tain co-occurrence frequency counts or positive-
pointwise mutual
informazione (PPMI) values
(Church and Hanks, 1990; Dagan et al., 1993;
Niwa and Nitta, 1994).

5 Experimental Setup

In this section, we describe the datasets used in
our experiments, the various hyperparameters that
we considered, and the models that we compared.

5.1 Datasets

The original DSTC2 dataset (Henderson et al.,
2014UN) was based on the task of restaurant
table reservation and contains transcripts of real
conversations between humans and bots. IL
utterances were labeled with the dialogue state
annotations like the semantic intent representation,
requested slots, and the constraints on the slot
values. We report our results on the modified
DSTC2 dataset of Bordes et al. (2017), Dove

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 2: Illustration of sequential attention mechanism in RNN+GCN-SeA.

such annotations are removed and only the raw
utterance–response pairs are present with an
associated set of KB triples for each dialogue.
It contains around 1,618 training dialogues, 500
validation dialogues, E 1,117 test dialogues.
For our experiments with contextual graphs we
report our results on the code-mixed versions of
modified DSTC2, which was recently released
by Banerjee et al. (2018). This dataset has been
collected by code-mixing the utterances of the

English version of modified DSTC2 (En-DSTC2)
in four languages: Hindi (Hi-DSTC2), Bengali
(Be-DSTC2), Gujarati (Gu-DSTC2), and Tamil
(Ta-DSTC2), via crowdsourcing. We also perform
experiments on two goal-oriented dialogue datasets
that contain conversations between humans wherein
the conversations were collected in a Wizard-of-
Oz (WOZ) maniera. Specifically, we use the Cam676
dataset (Wen et al., 2017), which contains 676 KB-
grounded dialogues from the restaurant domain

491

Model

per-resp.
acc

BLEU

ROUGE

Entity F1

Rule-Based (Bordes et al., 2017)
MEMNN (Bordes et al., 2017)
QRN (Seo et al., 2017)
GMEMNN (Liu and Perez, 2017)
Seq2Seq-Attn (Bahdanau et al., 2015)
Seq2Seq-Attn+Copy (Eric and Manning, 2017)
HRED (Serban et al., 2016)
Mem2Seq (Madotto et al., 2018)
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA

33.3
41.1
50.7
48.7
46.0
47.3
48.9
45.0
47.1
51.2
51.4

l
2
1












67.2 56.0 64.9



67.9 57.6 65.7



67.4 57.1 65.0
69.4 59.9 67.2
69.6 60.2 67.4





57.3
55.4
58.4
55.3
59.0
60.9
61.2





67.1
71.6
75.6
75.3
71.9
78.1
77.9

Tavolo 1: Comparison of RNN+GCN-SeA with other models on the English version of modified DSTC2.

and the MultiWOZ (Budzianowski et al., 2018)
dataset, which contains 10,438 dialogues.

5.2 Hyperparameters

We used the same train, test, and validation splits
as provided in the original versions of the datasets.
We minimized the cross entropy loss using the
Adam optimizer (Kingma and Ba, 2015) E
tuned the initial learning rates in the range of
0.0006 A 0.001. For regularization we used an
L2 penalty of 0.001 in addition to a dropout
(Srivastava et al., 2014) Di 0.1. We used randomly
initialized word embeddings of size 300. The RNN
and GCN hidden dimensions were also chosen to
be 300. We used GRU (Cho et al., 2014) cells for
the RNNs. All parameters were initialized from
a truncated normal distribution with a standard
deviation of 0.1.

5.3 Models Compared

We compare the performance of the following
models.

(io) RNN+GCN-SeA vs GCN-SeA: We use
RNN+GCN-SeA to refer to the model described
in Section 4. Instead of using the hidden repre-
sentations obtained from the bidirectional RNNs,
we also experiment by providing the token em-
beddings directly to the GCNs—that is, c1
u = qu
in equation 6 and a1
u = pu in equation 8. We refer
to this model as GCN-SeA.

(ii) Cross edges between the GCNs:

In
addition to the dependency and contextual edges,
we add edges between words in the dialogue

history/query and KB entities if a history/query
word exactly matches the KB entity. Such
edges create a single connected graph that is
encoded using a single GCN encoder and then
separated into different contexts to compute
sequential attention. This model is referred to
as RNN+CROSS-GCN-SeA.

vs

(iii)

GCN-SeA+Random

GCN-
SeA+Structure: We experiment with the model
where the graph is constructed by randomly con-
necting edges between two words in a context.
We refer to this model as GCN-SeA+Random.
We refer to the model that either uses dependency
or contextual graphs instead of random graphs as
GCN-SeA+Structure.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

6 Results and Discussions

In this section, we discuss the results of our
experiments as summarized in Tables 1– 5. We use
BLEU (Papineni et al., 2002) and ROUGE (Lin,
2004) metrics to evaluate the generation quality
of responses. We also report the per-response
accuracy, which computes the percentage of
responses in which the generated response exactly
matches the ground truth response. To evaluate the
model’s capability of correctly injecting entities
in the generated response, we report the entity F1
measure as defined in Eric and Manning (2017).

Results on En-DSTC2: We compare our model
with the previous works on the English version
of modified DSTC2 in Table 1. For most of the
retrieval-based models, the BLEU or ROUGE
scores are not available as they select a candidate

492

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

Dataset

Model

per-resp.
acc

BLEU

ROUGE

Entity F1

Hi-DSTC2

Be-DSTC2

GU-DSTC2

Ta-DSTC2

Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA
Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA
Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA
Seq2Seq-Bahdanau Attn
HRED
Mem2Seq
GCN-SeA
RNN+CROSS-GCN-SeA
RNN+GCN-SeA

48.0
47.2
43.1
47.0
47.2
49.2
50.4
47.8
41.9
47.1
50.4
50.3
47.7
48.0
43.1
48.1
49.4
48.9
49.3
47.8
44.2
46.4
50.8
50.7

1
62.9
63.4
55.5
65.0
64.7
66.4
67.4
67.2
58.9
67.4
68.3
69.0
64.8
65.4
55.7
65.5
66.4
66.1
67.8
66.9
58.6
68.5
69.8
70.2

2
52.5
52.7
48.1
55.3
54.9
56.8
57.6
57.0
50.8
57.3
58.9
59.4
54.9
55.2
48.6
56.2
57.2
56.9
56.3
55.2
50.8
57.5
59.6
59.9

l
61.0
61.5
54.0
63.0
62.6
64.4
65.1
64.9
57.0
64.9
65.9
66.6
62.6
63.3
54.2
63.5
64.3
64.1
65.6
64.8
57.0
66.1
67.5
67.9

55.1
55.3
50.2
56.0
56.4
57.1
55.6
55.6
52.1
58.4
59.1
59.2
54.5
54.7
48.9
55.7
56.9
56.7
62.9
61.5
58.9
62.8
64.5
64.9

74.3
71.3
73.8
72.4
73.5
75.9
76.2
71.5
73.2
69.6
74.9
75.1
71.3
71.8
75.5
72.2
73.4
73.0
77.7
74.4
74.9
71.9
78.8
77.9

Tavolo 2: Comparison of RNN+GCN-SeA with other models on all code-mixed datasets.

Match Success BLEU ROUGE-1 ROUGE-2 ROUGE-L
Modelli
48.11
85.29
Seq2seq-Attn
48.25
83.82
HRED
47.69
GCN-SeA
85.29
50.49
RNN+GCN-SeA 94.12

48.53
44.12
21.32
45.59

40.41
39.93
40.29
42.35

24.69
24.09
25.15
27.69

18.81
19.38
18.48
21.62

Tavolo 3: Comparison of our models with the baselines on the Cam676 dataset.

from a list of candidates as opposed to generating
Esso. Our model outperforms all of the retrieval and
generation-based models. We obtain a gain of 0.7
in the per-response accuracy compared with the
previous retrieval based state-of-the-art model of
Seo et al. (2017), which is a very strong baseline
for our generation-based model. We call this a
strong baseline because the candidate selection
task of this model is easier than the response
generation task of our model. We also obtain
a gain of 2.8 BLEU points, 2 ROUGE, points
E 2.5 entity F1 points compared with current
state-of-the-art generation-based models.

Results on code-mixed datasets and effect of
using RNNs: The results of our experiments on the
code-mixed datasets are reported in Table 2. Nostro
model outperforms the baseline models on all the
code-mixed languages. One common observation
from the results over all the languages is that
RNN+GCN-SeA performs better than GCN-SeA.
Similar observations were made by Marcheggiani
and Titov (2017) for semantic role labeling.

Results on Cam676 dataset: The results of
our experiments on the Cam676 dataset are
reported in Table 3. In order to evaluate goal-
completeness, we use two additional metrics as

493

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Single Domain Dialogues (SNG)

Modelli
Seq2seq-Attn
HRED
GCN-SeA
RNN+GCN-SeA-400d
RNN+GCN-SeA-100d

Seq2seq-Attn
HRED
GCN-SeA
RNN+GCN-SeA

11.53
10.27
12.30
11.73
13.13

36.77
52.02
44.84
59.19
32.74

Match Success BLEU ROUGE-1 ROUGE-2 ROUGE-L
35.30
68.16
38.30
84.30
39.79
63.68
86.10
38.76
40.76
75.78
Multi-Domain Dialogues (MUL)
38.99
40.57
42.40
43.40

28.28
30.38
32.51
30.93
33.59

13.44
14.49
16.11
15.22
17.67

30.87
31.98
34.25
35.15

16.39
16.83
19.03
19.63

14.03
12.75
14.16
15.85

22.10
37.70
37.90
40.30

44.40
66.40
57.40
62.20

Tavolo 4: Comparison of our models with the baselines on the MultiWOZ dataset.

Dataset

Model

per-resp. BLEU

Hi-DSTC2

En-DSTC2 GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure
GCN-SeA+Random
GCN-SeA+Structure

Gu-DSTC2

Be-DSTC2

Ta-DSTC2

acc
45.9
47.1
44.4
47.0
44.9
47.1
45.0
48.1
44.8
46.4

57.8
59.0
54.9
56.0
56.5
58.4
54.0
55.7
61.4
62.8

ROUGE
2
56.5
57.1
52.9
55.3
54.8
57.3
54.0
56.2
55.6
57.5

1
67.1
67.4
63.1
65.0
65.4
67.4
64.1
65.5
66.9
68.5

l
64.8
65.0
60.9
63.0
62.7
64.9
61.9
63.5
64.3
66.1

Entity F1

72.2
71.9
67.2
72.4
65.6
69.6
69.1
72.2
70.5
71.9

Tavolo 5: GCN-SeA with random graphs and dependency/contextual graphs on all DSTC2 datasets.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 3: GCN-SeA with multiple hops on all DSTC2 datasets.

494

Dataset

Model

per-resp.
acc

BLEU

ROUGE

Entity F1

Hi-DSTC2

Be-DSTC2

Gu-DSTC2

Ta-DSTC2

En-DSTC2

Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA
Seq2seq-Bahdanau Attn
GCN-Bahdanau Attn
RNN+GCN-Bahdanau Attn
RNN-SeA
RNN+GCN-SeA

48.0
38.5
47.1
45.8
49.2
50.4
42.1
47.0
46.8
50.3
47.7
38.8
46.5
45.4
48.9
49.3
42.0
46.3
46.8
50.7
46.0
45.7
47.4
47.0
51.4

55.1
50.4
56.0
55.9
57.1
55.6
55.1
57.7
58.5
59.2
54.5
49.5
55.5
56.0
56.7
62.9
59.3
63.2
64.0
64.9
57.3
58.1
59.5
60.2
61.2

1
62.9
58.9
65.1
65.1
66.4
67.4
63.7
67.0
67.6
69.0
64.8
59.2
65.6
66.0
66.1
67.8
64.8
68.0
69.3
70.2
67.2
66.5
67.9
68.5
69.6

2
52.5
47.7
55.2
55.5
56.8
57.6
52.8
57.4
58.1
59.4
54.9
48.3
55.9
56.6
56.9
56.3
52.8
57.2
59.0
59.9
56.0
55.9
57.7
58.9
60.2

l
61.0
56.7
62.9
63.1
64.4
65.1
61.1
64.6
65.1
66.6
62.6
56.8
63.4
63.9
64.1
65.6
62.1
65.6
67.1
67.9
64.9
64.1
65.6
66.2
67.4

74.3
59.1
72.2
71.8
75.9
76.2
64.3
70.9
71.9
75.1
71.3
58.0
70.6
69.8
73.0
77.7
69.7
72.1
74.2
77.9
67.1
70.1
72.9
72.7
77.9

Tavolo 6: Ablation results of various models on all versions of DSTC2.

Dataset

Model

per resp.
acc

BLEU

ROUGE

Entity F1

En-DSTC2

Hi-DSTC2

Be-DSTC2

Gu-DSTC2

Ta-DSTC2

Query
Query + History
Query + KB
Query
Query + History
Query + KB
Query
Query + History
Query + KB
Query
Query + History
Query + KB
Query
Query + History
Query + KB

22.8
47.1
41.4
22.5
45.5
40.5
22.7
45.7
41.2
22.4
21.1
40.1
22.8
45.8
40.9

1
53.5
68.8
63.7
50.9
65.3
60.8
51.9
67.1
63.0
50.7
48.6
60.9
53.6
68.9
64.2

2
37.6
59.4
52.4
37.8
55.7
49.7
38.0
57.4
52.1
37.2
35.1
50.1
39.0
58.4
52.3

l
50.6
66.6
60.9
48.4
63.3
58.5
49.0
64.6
60.3
48.4
46.3
58.7
50.6
66.5
61.5

38.1
60.6
55.8
37.5
55.9
52.6
37.9
57.4
54.6
36.1
36.6
50.6
39.3
63.1
59.2

18.4
72.8
63.5
11.1
69.8
60.2
10.6
69.9
60.2
10.9
07.2
59.5
18.8
72.6
64.2

Tavolo 7: Ablations on different parts of the encoder of RNN+GCN-SeA.

495

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Dataset Wins % Losses % Ties %
35.00
43.66

En-DSTC2
Cam676

42.17
29.00

22.83
27.33

Tavolo 8: Human evaluation results showing wins,
losses, and ties % on En-DSTC2 and Cam676.

used in the original paper (Wen et al., 2017)
which introduced this dataset, (io) match rate: IL
number of times the correct entity was suggested
by the model, E (ii) success rate: if the correct
entity was suggested and the system provided all
the requestable slots then the dialogue results in
a success. The results suggest that our model’s
responses are more fluent as indicated by the
BLEU and ROUGE scores. It also produces the
correct entities according to the dialogue goals but
fails to provide enough requestable slots. Note that
the model described in the original paper (Wen
et al., 2017) is not directly comparable to our work
as it uses an explicit belief tracker, which requires
extra supervision/annotation about
the belief-
state. Tuttavia, for the sake of completeness we
would like to mention that their model using this
extra supervision achieves a BLEU score of 23.69
and a success rate of 83.82%.

Results on MultiWOZ dataset: The results of
our experiments on two versions of the MultiWOZ
dataset are reported in Table 4. The first version
(SNG) contains around 3K dialogues in which
each dialogue involves only a single domain
and the second version (MUL) contains all 10k
dialogues. The baseline models do not use an
oracle belief state as mentioned in Budzianowski
et al. (2018) and therefore are comparable to our
modello. We observed that with a larger GCN hidden
dimension (400d in Table 4) our model is able to
provide the correct entities and requestable slots
in SNG. D'altra parte, with a smaller GCN
hidden dimension (100D) we are able to generate
fluent responses in SNG. On MUL, our model is
able to generate fluent responses but struggles
in providing the correct entity mainly due to
the increased complexity of multiple domains.
Tuttavia, our model still provides a high num-
ber of correct requestable slots, as shown by the
success rate. This is because multiple domains
(hotel, restaurant, attraction, hospital) have the
same requestable slots (address, phone, postcode).
Effect of using hops: As we increased the
number of hops of GCNs (Figura 3), we observed

496

a decrease in the performance. One reason for such
a drop in performance could be that the average
utterance length is very small (7.76 parole). Così,
there is not much scope for capturing distant
neighborhood information and more hops can
add noisy information. The reduction is more
prominent in contextual graphs in which multi-
hop neighbors can turn out to be dissimilar words
in different sentences.

Effect of using random graphs: GCN-
SeA+Random and GCN-SeA+Structure take the
token embeddings directly instead of passing them
though an RNN. This ensures that the difference in
performance of the two models are not influenced
by the RNN encodings. The results are shown in
Tavolo 5 and we observe a drop in performance
for GCN-SeA+Random across all the languages.
This shows that the dependency and contextual
structures play an important role and cannot be
replaced by random graphs.

Ablations: We experiment with replacing the
sequential attention by the Bahdanau attention
(Bahdanau et al., 2015). We also experiment
with various combinations of RNNs and GCNs
as encoders. The results are shown in Table 6.
We observed that GCNs do not outperform
RNNs independently. Generalmente, RNN-Bahdanau
attention performs better than GCN-Bahdanau
Attenzione. The sequential attention mechanism
outperforms Bahdanau attention as observed from
the following comparisons: (io) GCN-Bahdanau
attention vs GCN-SeA,
(ii) RNN-Bahdanau
attention vs RNN-SeA (in BLEU and ROUGE),
E (iii) RNN+GCN-Bahdanau attention vs
RNN+GCN-SeA. Overall,
the best results are
always obtained by our final model, which com-
bines RNN, GCN, and sequential attention. Noi
also performed ablations by removing specific
parts of the encoder. Specifically, we experi-
ment with (io) query encoder alone, (ii) query +
history encoder, E (iii) query + KB encoder.
The results shown in Table 7 suggest that the
query and the KB are not enough to generate
fluent responses and the previous conversation
history is essential.

In order

Human evaluations:

to evaluate
the appropriateness of our model’s responses
compared to the baselines, we perform a human
evaluation of the generated responses using in-
house evaluators. We evaluated randomly chosen
responses from 200 dialogues of En-DSTC2 and

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Context:
Speaker 1: [silence] .
Speaker 2: hello, welcome to the cambridge
restaurant system . you can ask for restaurants by area,
price range or food type . how may i help you ?
Speaker 1: uh cheap restaurant malaysian food .
Speaker 2: i’m sorry but there is no restaurant
serving malaysian food .
Speaker 1: uh italian food .
Speaker 2: could you please repeat that ?
Speaker 1: cheap restaurant uh italian food .
Speaker 2: there are restaurants serving
italian in the cheap price range . what area would you like ?
Speaker 1: i dont care .

KB Triples:
ask, r cuisine, italian
ask, r location, centre
ask, r phone, ask phone
ask, r address, ask address
ask, r price, cheap
ask, r rating, 0
pizza hut city centre, r cuisine, italian
pizza hut city centre, r location, centre
pizza hut city centre, r phone,
pizza hut city centre phone
pizza hut city centre, r address,
pizza hut city centre address
pizza hut city centre, r price, cheap
pizza hut city centre, r rating, 9

Seq2seq + Attn: ask is a nice place in the west of town serving tasty italian food
HRED: pizza hut city centre serves italian food in the expensive price range
GCN-SeA: pizza hut city centre serves italian food in the cheap price range
RNN+GCN-SeA: pizza hut city centre is a great restaurant serving cheap italian food in the centre of town
RNN+CROSS-GCN-SeA:pizza hut city centre is a nice place in the centre of town serving tasty italian food

Tavolo 9: Qualitative comparison of responses between the baselines and different versions of our model

100 dialogues of Cam676 using the method
of pairwise comparisons introduced in Serban
et al. (2017). We chose the best baseline model
for each dataset, namely, HRED for En-DSTC2
and Seq2seq+Attn for Cam676. We show each
dialogue context to three different evaluators and
ask them to select the most appropriate response
in that context. The evaluators were given no
information about which model generated which
risposta. They were allowed to choose an option
for tie if they were not able to decide whether
one model’s response was better than the other
modello. The results reported in Table 8 suggest
that our model’s responses are favorable in noisy
contexts of spontaneous conversations, ad esempio
those exhibited in the DSTC2 dataset. Tuttavia,
in a WOZ setting for human–human dialogues,
where the conversations are less spontaneous and
contexts are properly established, both the models
generate appropriate responses.

Qualitative analysis: We show the generated
responses of the baselines and different versions
of our model in Table 9. We see that Seq2seq+Attn
model is not able to suggest a restaurant with a
high rating whereas HRED gets the restaurant right
but suggests an incorrect price range. Tuttavia,
RNN+GCN-SeA suggests the correct restaurant
with the preferred attributes. Although GCN-SeA
selects the correct restaurant, it does not provide
the location in its response.

7 Conclusione

We showed that structure-aware representations
are useful
in goal-oriented dialogue and our
model outperforms existing methods on four
dialogue datasets. We used GCNs to infuse struc-
tural information of dependency graphs and con-
textual graphs to enrich the representations of the
dialogue context and KB. We also proposed a
sequential attention mechanism for combining the
representations of (io) query (current utterance),
(ii) conversation history, E (iii) the KB. Finalmente,
we empirically showed that when dependency
parsers are not available for certain languages,
such as code-mixed languages, then we can use
word co-occurrence frequencies and PPMI values
to extract a contextual graph and use such a graph
with GCNs for improved performance.

Ringraziamenti

We would like to thank the anonymous reviewers
and the action editor for their insightful com-
ments and suggestions. We would like to thank
the Department of Computer Science and Engi-
neering, IIT Madras and Robert Bosch Centre
for Data Science and Artificial
Intelligenza
(RBC-DSAI),
IIT Madras for providing the
necessary resources. We would also like to thank
Accenture Technology Labs, India, for support-
ing our work through their generous academic
research grant.

497

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Riferimenti

Dzmitry Bahdanau, Kyunghyun Cho, e Yoshua
Bengio. 2015. Traduzione automatica neurale di
imparare insieme ad allineare e tradurre. Nel professionista-
Atti della 3a Conferenza Internazionale
sulle rappresentazioni dell'apprendimento,
ICLR 2015,
San Diego, CA.

Suman Banerjee, Nikita Moghe, Siddhartha
Arora, and Mitesh M. Khapra. 2018. A data-
set for building code-mixed goal oriented con-
versation systems. In Proceedings of the 27th
Conferenza internazionale sul calcolo
Linguistica, pages 3766–3780.

Joost Bastings, Ivan Titov, Wilker Aziz, Diego
Marcheggiani, and Khalil Simaan. 2017. Graph
convolutional encoders for syntax-aware neural
machine translation. Negli Atti del 2017
Conference on Empirical Methods in Natural
Language Processing, pages 1957–1967.

Antoine Bordes, Y.-Lan Boureau, and Jason
Weston. 2017. Learning end-to-end goal-oriented
dialog. In Proceedings of the 5th International
Conference on Learning Representations, ICLR
2017, Toulon.

Paweł Budzianowski, Tsung-Hsien Wen, Bo-
Hsiang Tseng, I˜nigo Casanueva, Stefan Ultes,
Osman Ramadan, and Milica Gasic. 2018.
MultiWOZA large-scale multi-domain wizard-
task-oriented dialogue
of-Oz dataset
modelling. Negli Atti del 2018 Contro-
ference on Empirical Methods in Natural
Language Processing,
5016–5026,
Brussels.

pagine

for

Huadong Chen, Shujian Huang, David Chiang,
and Jiajun Chen. 2017.
Improved neural
machine translation with a syntax-aware en-
coder and decoder. Negli Atti di
IL
55esima Assemblea Annuale dell'Associazione per
Linguistica computazionale (Volume 1: Lungo
Carte), pages 1936–1945.

Kyunghyun Cho, Bart van Merrienboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using RNN
encoder–decoder for statistical machine trans-
lation. Negli Atti del 2014 Conferenza
sui metodi empirici nel linguaggio naturale
in lavorazione, pages 1724–1734.

498

Kenneth Ward Church and Patrick Hanks. 1990.
Word association norms, mutual information,
and lexicography. Linguistica computazionale,
16(1):22–29.

Ido Dagan, Shaul Marcus, and Shaul Markovitch.
1993. Contextual word similarity and estima-
Negli Atti di
tion from sparse data.
the 31st Annual Meeting of the Association
for Computational Linguistics, pages 164–171,
Columbus, OH.

Nicola De Cao, Wilker Aziz, and Ivan Titov.
2019. Question answering by reasoning across
documents with graph convolutional networks.
Negli Atti del 2019 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 2306–2317, Minneapolis, MN.

Micha¨el Defferrard, Xavier Bresson, and Pierre
Vandergheynst. 2016, Convolutional neural
networks on graphs with fast localized spectral
filtering, D. D. Lee, M. Sugiyama, U. V.
Luxburg, IO. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing
Sistemi 29, pages 3844–3852. Curran Asso-
ciates, Inc.

David K. Duvenaud, Dougal Maclaurin, Jorge
Iparraguirre, Rafael Bombarell, Timothy Hirzel,
Alan Aspuru-Guzik, and Ryan P. Adams. 2015,
Convolutional networks on graphs for learning
molecular fingerprints. In C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, and R.
Garnett, editors, Advances in Neural Infor-
mation Processing Systems 28, pages 2224–2232.
Curran Associates, Inc.

Jeffrey L. Elman. 1990. Finding structure in time.

Cognitive Science, 14(2):179–211.

Mihail Eric, Lakshmi Krishnan, Francois Charette,
e Christopher D. Equipaggio. 2017. Key-
value retrieval networks
task-oriented
dialogue. In Proceedings of the 18th Annual
SIGdial Meeting on Discourse and Dialogue,
pages 37–49, Saarbr¨ucken.

for

Mihail Eric and Christopher Manning. 2017. UN
copy-augmented sequence-to-sequence archi-
tecture gives good performance on task-
oriented dialogue. In Proceedings of the 15th

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Conference of the European Chapter of the
Associazione per la Linguistica Computazionale:
Volume 2, Short Papers, pages 468–473.

Akiko Eriguchi, Kazuma Hashimoto,

E
Yoshimasa Tsuruoka. 2016. Tree-to-sequence
attentional neural machine translation.
In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Documenti lunghi), pages 823–833.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, E
Jian Sun. 2016. Deep residual
learning for
image recognition. In 2016 IEEE Conference
on Computer Vision and Pattern Recognition,
CVPR 2016, pages 770–778, Las Vegas, NV.

Matthew Henderson, Blaise Thomson,

E
Jason D. Williams. 2014UN. The second dialog
state tracking challenge. Negli Atti di
the SIGDIAL 2014 Conferenza, The 15th
Annual Meeting of the Special Interest Group
on Discourse and Dialogue, pages 263–272,
Philadelphia, PAPÀ.

Matthew Henderson, Blaise Thomson,

E
Jason D. Williams. 2014B. The third dialog
state tracking challenge. In 2014 IEEE Spoken
Language Technology Workshop, SLT 2014,
pages 324–329, South Lake Tahoe, NV.

Daniel D. Johnson. 2017. Learning graphical state
transitions. In 5th International Conference on
Learning Representations, ICLR 2017, Toulon.

Diederik P. Kingma and Jimmy Ba. 2015.
Adam: A method for stochastic optimization.
In 3rd International Conference on Learning
Representations, ICLR 2015, San Diego, CA.

Thomas N. Kipf and Max Welling. 2017.
Semi-supervised classification with graph
convolutional networks. In 5th International
Conference on Learning Representations, ICLR
2017, Toulon.

Yujia Li, Daniel Tarlow, Marc Brockschmidt,
and Richard S. Zemel. 2016. Gated graph
sequence neural networks. In 4th International
Conference on Learning Representations, ICLR
2016, San Juan.

Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out: Proceedings of
the ACL-04 Workshop, pages 74–81, Barcelona.

Fei Liu and Julien Perez. 2017. Gated end-to-
end memory networks. Negli Atti del
15th Conference of the European Chapter of
the Association for Computational Linguistics,
2017, pages 1–10, Valencia.

Andrea Madotto, Chien-Sheng Wu, and Pascale
Fung. 2018. Mem2Seq: Effectively incor-
porating knowledge bases into end-to-end
task-oriented dialog systems. Negli Atti
of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Lungo
Carte), pages 1468–1478.

Diego Marcheggiani, Joost Bastings, and Ivan
Titov. 2018. Exploiting semantics in neural
machine translation with graph convolutional
networks. Negli Atti del 2018 Contro-
ference of
the North American Chapter of
the Association for Computational Linguistics:
Tecnologie del linguaggio umano, Volume 2
(Short Papers), pages 486–492.

Diego Marcheggiani and Ivan Titov. 2017.
Encoding sentences with graph convolutional
networks for semantic role labeling. Nel professionista-
ceedings of the 2017 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale,
pages 1506–1515.

Yoshiki Niwa and Yoshihiko Nitta. 1994. Co-
occurrence vectors from corpora vs. distance
vectors from dictionaries. Negli Atti del
15th Conference on Computational Linguistics
– Volume 1, COLING ’94, pages 304–309,
Stroudsburg, PAPÀ.

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, Philadelphia, PAPÀ.

Nanyun Peng, Hoifung Poon, Chris Quirk,
Kristina Toutanova, and Wen-tau Yih. 2017.
Cross-sentence n-ary relation extraction with
graph LSTMs. Transactions of the Association
for Computational Linguistics, 5:101–115.

Minjoon Seo, Sewon Min, Ali Farhadi, E
Hannaneh Hajishirzi. 2017. Query-reduction
networks for question answering. In 5th Inter-
national Conference on Learning Representa-
zioni, ICLR 2017, Toulon.

499

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Iulian V. Serban, Alessandro Sordoni, Ryan
Lowe, Laurent Charlin, Joelle Pineau, Aaron
Courville, and Yoshua Bengio. 2017. A hier-
archical latent variable encoder-decoder model
In Thirty-First
for generating dialogues.
AAAI Conference on Artificial Intelligence,
page 1583.

Iulian Vlad Serban, Alessandro Sordoni, Yoshua
and Joelle
Bengio, Aaron C. Courville,
Pineau. 2016. Building end-to-end dialogue
systems using generative hierarchical neural
network models. In Proceedings of the Thirtieth
AAAI Conference on Artificial Intelligence,
pages 3776–3784, Phoenix, AZ.

Nitish Srivastava, Geoffrey E. Hinton, Alex
Krizhevsky,
and Ruslan
Ilya Sutskever,
Salakhutdinov. 2014. Dropout: A simple way
to prevent neural networks from overfitting.
Journal of Machine Learning Research,
15(1):1929–1958.

Rupesh K Srivastava, Klaus Greff, and J¨urgen
Schmidhuber. 2015, Training very deep net-
works. In C. Cortes, N. D. Lawrence, D. D. Lee,
M. Sugiyama, and R. Garnett, editors, Advances
in Neural Information Processing Systems 28,
pages 2377–2385. Curran Associates, Inc.

Sainbayar Sukhbaatar, Arthur Szlam,

Jason
Weston, and Rob Fergus. 2015. End-to-end
In Advances in Neural
memory networks.
Information Processing Systems 28: Annual
Conference on Neural Information Processing
Sistemi 2015, pages 2440–2448, Montreal.

Shikhar Vashishth, Shib Sankar Dasgupta,
Swayambhu Nath Ray, and Partha Talukdar.

2018. Dating documents using graph con-
IL
volution networks.
56esima Assemblea Annuale dell'Associazione per
Linguistica computazionale (Volume 1: Lungo
Carte), pages 1605–1615.

Negli Atti di

Tsung-Hsien Wen, David Vandyke, Nikola
Mrkˇsi´c, Milica Gaˇsi´c, Lina M. Rojas-
Barahona, Pei-Hao Su, Stefan Ultes, E
Steve Young. 2017. A network-based end-to-
end trainable task-oriented dialogue system.
Negli Atti di
the 15th Conference of
the European Chapter of the Association for
Linguistica computazionale: Volume 1, Lungo
Carte, pages 438–449, Valencia.

Jason D. Williams, Antoine Raux, Deepak
Ramachandran, and Alan W. Black. 2013. IL
dialog state tracking challenge. Negli Atti
of the SIGDIAL 2013 Conferenza, The 14th
Annual Meeting of the Special Interest Group
on Discourse and Dialogue, pages 404–413,
Metz.

Jason D. Williams and Steve J. Young. 2007.
Partially observable Markov decision processes
for spoken dialog systems. Computer Speech &
Language, 21(2):393–422.

Steve J. Young. 2000. Probabilistic methods
in spoken-dialogue systems. Philosophical
Transactions: Mathematical, Physical and
Engineering Sciences, 358(1769):1389–1402.

Tiancheng Zhao, Allen Lu, Kyusong Lee, E
Maxine Eskenazi. 2017. Generative encoder-
decoder models for task-oriented spoken dialog
systems with chatting capability. In Procedi-
ings of the 18th Annual SIGdial Meeting on
Discourse and Dialogue, pages 27–36.

500

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
2
8
4
1
9
2
3
4
5
0

/

/
T

l

UN
C
_
UN
_
0
0
2
8
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Scarica il pdf