Dialogue State Tracking with Incremental Reasoning

Dialogue State Tracking with Incremental Reasoning

Lizi Liao, Le Hong Long, Yunshan Ma, Wenqiang Lei, Tat-Seng Chua
School of Computing
National University of Singapore
{liaolizi.llz, yunshan.ma, wenqianglei}@gmail.com
lehonglong@u.nus.edu
chuats@comp.nus.edu.sg

Abstrakt

Tracking dialogue states to better interpret
user goals and feed downstream policy
learning is a bottleneck in dialogue
management. Common practice has been
to treat it as a problem of classifying
dialogue content into a set of pre-defined
slot-value pairs, or generating values for
different slots given the dialogue history.
Both have limitations on considering
dependencies that occur on dialogues, Und
are lacking of reasoning capabilities. Das
paper proposes to track dialogue states
gradually with reasoning over dialogue
turns with the help of
the back-end
Daten. Empirical results demonstrate that
our method outperforms the state-of-the-
art methods in terms of
joint belief
accuracy for MultiWOZ 2.1, a large-scale
human–human dialogue dataset across
multiple domains.

1

Einführung

to monitor

Dialogue State Tracking (DST) usually works
as a core component
the user’s
intentional states (or belief states) and is cru-
cial for appropriate dialogue management. A
state in DST typically consists of a set of
dialogue acts and slot value pairs. Consider
the task of restaurant reservation as shown in
Figur 1. In each turn,
the user may inform
the agent of particular goals (e.g. single one as
inform(food=Indian) or composed one as
inform(area=center,food=Jamaican)).
Such goals given during a turn are referred as
turn belief. The joint belief
is the set of accu-
mulated turn goals updated until the current turn,
which summarizes the information needed to
successfully maintain and finish the dialogue.

557

Traditionell, dialogue system is supported by
a domain ontology, which defines a collection
of slots and the values that each slot can take.
The aim of DST is to identify good features or
patterns, and map to entries such as specific slot-
value pairs in the ontology. It is often treated as
a classification problem. daher, most efforts
center on (1) finding salient features: from hand-
crafted features (Wang and Lemon, 2013; Sun
et al., 2014A), semantic dictionaries (Henderson
et al., 2014B; Rastogi et al., 2017), to neural
network extracted features (Mrkˇsi´c et al., 2017);
oder (2) investigating effective mappings: aus
rule-based models (Sun et al., 2014B), generative
Modelle (Thomson and Young, 2010; Williams
und Jung, 2007), to discriminative ones (Lee
and Eskenazi, 2013; Ren et al., 2018; Xie
et al., 2018). Andererseits, some researchers
attack these methods’ over-dependence on domain
ontology. They perform DST in the absence of
a comprehensive domain ontology and handle
unknown slot values by generating words from
dialogue history or knowledge source (Rastogi
et al., 2017; Xu and Hu, 2018; Wu et al., 2019).

Jedoch, the critical problem of modeling the
dependencies and reasoning over dialogue history
is not well researched. Many existing methods
work on turn level only, which takes in the cur-
rent turn utterance and outputs the corresponding
turn belief (Henderson et al., 2014B; Zilka and
Jurcicek, 2015; Rastogi et al., 2017; Xu and Hu,
2018). Compared to joint belief, das resultierende
turn belief only reflects single turn informa-
tion, and thus is of less practical use. daher,
the joint belief
more recent efforts target at
that summarizes the dialogue history. Generally
Apropos, they accumulate turn beliefs by rules
((Mrkˇsi´c et al., 2017; Zhong et al., 2018); Nouri
and Hosseini-Asl, 2018) or model information
across turns via various recurrent neural networks
(RNNs) (Wen et al., 2017; Ramadan et al., 2018).

Transactions of the Association for Computational Linguistics, Bd. 9, S. 557–569, 2021. https://doi.org/10.1162/tacl a 00384
Action Editor: Wenjie (Maggie) Li. Submission batch: 7/2020; Revision batch: 1/2021; Published 5/2021.
C(cid:2) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

provides valuable hints for it to reason about
user goals and update belief states. It is therefore
natural to construct a bipartite graph based on the
database where the entities and entity attributes are
the two groups of nodes; with edges connecting
them to express attribute belonging relation. Als
the example in Figure 1, the database does not
contain restaurant entity serving Jamaican food
and located in center area. Thus there would
be no two-hop path between these two nodes.
Existing methods like Wu et al. (2019) müssen
understand it via system utterances, while a DST
reasoning over database would easily obtain such
clues explicitly.

In diesem Papier, we propose to do reasoning over
turns and reasoning over database in Dialogue
State Tracking (ReDST) for task-oriented systems.
For reasoning over turns, we model dialogue
state tracking as a recursive process in which
the current joint belief relies on the generated
current turn belief and last joint belief. Motivated
by the limited length of single turn utterance
and the good performance of pre-trained BERT
(Devlin et al., 2019), we formalize the turn belief
prediction as a token and sequence classification
Problem. It follows a multitask learning setting
with augmented utterance inputs. To integrate
results, an incremental
the last
inference module is applied for more robust
belief updates. For reasoning over a database,
we abstract the back-end database as a bipartite
graph, and propagate extracted beliefs over the
graph to obtain more realistic dialogue states.
Contributions are summarized as:

turn belief

• We propose to rethink the dialogue state
tracking problem for task-oriented agents,
pointing out the need for proper reasoning
over turns and reasoning over back-end data.
• We represent the database into a bipartite
graph and perform belief propagation on
Es, which enables the belief
Zu
gain insight on potential candidates and
detect conflicting requirements along the
conversation course.

tracker

augmented

• With the help from pre-trained Transformer
models working
short
An
utterance for achieving more accurate turn
beliefs, we incrementally infer joint belief
via reasoning in a turn by turn style and
outperform state-of-the-art methods by a
large margin.

Figur 1: An example dialogue for illustration. Turn
belief labels are provided based on turn information,
while the joint belief captures most updated user
intention up to the current turn.

Although these RNN based methods model dia-
logue in turn by turn style, they usually feed
the whole turn utterance directly to the RNN,
which contains a large portion of noise, and result
in unsatisfactory performance (Liao et al., 2018;
Zhang et al., 2019B). More recently, es gibt
works that directly merge fixed window of past
turns (Perez and Liu, 2017; Wu et al., 2019) as new
input and achieve state-of-the-art performance
(Wu et al., 2019). dennoch, their capability of
modeling long-range dependencies and doing rea-
soning in the interactive dialogue process is rather
limited. Zum Beispiel, (Wu et al., 2019) performs
gated copy to generate slot values from dialogue
Geschichte. Although certain turns of utterances are
exposed to the model, since the interactive signals
are lost when concatenating turns together, it fails
to do in-depth reasoning over turns.

such methods

Very recently, there is research starting to work
in turn-by-turn style with pre-trained models.
Generally speaking,
take the
previous turn’s belief state and the current turn
to generate new dialogue
utterances as input
state (Chao and Lane, 2019; Kim et al., 2020;
Chen et al., 2020). Jedoch, there exists a long
ignored fact that as an agent’s central component,
the state tracker not only receives dialogue
history but also observes the back-end database
or knowledge base. Such an information source

558

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

2 Related Work

2.1 Dialogue State Tracking

A plethora of research has been focused on DST.
We briefly discuss them in general chronological
Befehl. At the early stage, traditional dialogue state
trackers combine semantic information extracted
by Language Understanding (LU) modules to
do DST (Williams and Young, 2007; Williams,
2014). Such trackers accumulate errors from the
LU part and possibly suffer from information
loss of dialogue context. Subsequent word-based
(Henderson et al., 2014B; Zilka and Jurcicek,
2015) trackers thus forgo the LU part and directly
infer states using dialogue history. Hand-crafted
semantic dictionaries are utilized to hold all
key terms, rephrases and alternative mentions to
delexicalize for achieving generalization (Rastogi
et al., 2017).

Kürzlich, most approaches for dialogue state
tracking rely on deep learning models (Wen et al.,
2017; Ramadan et al., 2018). (Mrkˇsi´c et al.,
2017)
leveraged pre-trained word vectors to
resolve lexical/morphological ambiguity. As it
treats slots independently that might result
In
missing relations among slots (Ouyang et al.,
2020), Zhong et al. (2018) proposed global mod-
ules to share parameters between estimators for
different slots. Ähnlich, (Nouri and Hosseini-Asl
2018) used only one recurrent network with global
conditioning to reduce latency while preserving
Leistung. Allgemein, these methods represent
the dialogue state as a distribution over all candi-
date slot values that are defined in the ontology.
This is often solved as a classification or matching
Problem. Jedoch, these methods rely heavily
on a comprehensive ontology, which often might
not be available. daher, Rastogi et al. (2017)
introduced a sophisticated candidate generation
strategy, while (Perez and Liu, 2017) followed
the general paradigm of machine reading and
proposed to solve it using an end-to-end memory
Netzwerk. Xu and Hu (2018) utilized the pointer
network to extract slot values from utterances,
while Wu et al. (2019) integrated copy mechanism
to generate slot values.

Jedoch,

these methods

tend to largely
ignore the dialogue logic and dependencies.
Zum Beispiel,
inter-utterance information and
correlations between slot values have been shown
to be challenging, let alone the frequent goal
shifting of users. Folglich, reasoning over

turns is sensible. We first aim to improve the
turn belief prediction, then model the joint belief
prediction as an updating process. Very recently,
we see such design leveraged by several works.
Zum Beispiel, Chao and Lane (2019) leverage
BERT model to extract slot values for each turn,
then employ a rule-based update mechanism to
track dialogue states across turns. Ren et al. (2019)
encode previous dialogue state and current turn
utterances using Bi-LSTM, then hierarchically
decode domains, slots, and values one after
another. Gleichzeitig, Kim et al. (2020)
encode these inputs with BERT model while
predicting operation gates and generating possible
Werte. Trotzdem, such methods largely ignore the fact
that as an agent, it has access to the back-end
data structure which can be leveraged to further
improve the performance of DST.

2.2 Incremental Reasoning

The ability to do reasoning over the dialogue
history is essential for dialogue state trackers.
Bei
the turn level, we aim to extract more
accurate slot values from user utterance with
the help of contextualized semantic inference.
Contextualized representation learning in NLP
dates back to Collobert and Weston (2008)
but has had a resurgence in the recent year.
Contextualized word vectors were pre-trained
using machine translation data and transferred
to text classification and QA tasks (McCann et al.,
2017). Most recently, BERT (Devlin et al., 2019)
employed Transformer layers (Vaswani et al.,
2017) with a masked language modeling objective
and achieved superior performance across various
tasks. In DST, we also observe a wide adoption
of such models (Shan et al., 2020; Liao et al.,
2021). Zum Beispiel, Kim et al. (2020) and Heck
et al., (2020) adopted the pre-trained BERT as
base network. Hosseini-Asl et al. (2020) applied
the pre-trained GPT-2 (Alec et al., 2019) Modell
as the base network for dialogue state tracking.

At dialogue context level, since we perform
reasoning via belief propagation through graph,
our work is also related to a wide range of graph
reasoning studies. As a relatively early work,
the page-ranking algorithm (Page et al., 1999)
used a random walk with restart mechanism to
perform multi-hop reasoning. Almost at the same
Zeit, loopy belief propagation (Murphy et al.,
1999) was proposed to calculate the approximate
marginal probabilities of vertices in a graph based

559

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: The architecture of the proposed ReDST model, which comprises (A) a turn belief generator, (B) A
bipartite belief propagator, Und (C) an incremental belief generator. The turn belief generator will predict values
for domain slot pairs. Together with the last joint belief, the beliefs will be aggregated via the bipartite belief
propagator based on the database structure. Then the incremental belief generator infers the final joint belief.

on partial information. In den vergangenen Jahren, Forschung
on graph reasoning has moved to learn symbolic
inference rules from relational paths in the KG
(Xiong et al., 2017; Das et al., 2017). Under these
settings, a large number of entities and many
types of relationships are usually involved. In
DST, Chen et al. (2020) leveraged schema graphs
containing slot relations, but their method heavily
relied on a complete slot ontology. Zhou and
Small (2019) incorporated a dynamically evolving
knowledge graph to explicitly learn relationships
slots. In our work, only the attribute-belonging
relations are captured, and the constructed graph
is simply a bipartite graph. We thus resort to
heuristic belief propagation on the bipartite graph
for reasoning. Further exploring more advanced
models are treated as our future work.

3 ReDST Model

The proposed ReDST model in Figure 2 consists
of three components: a turn belief generator,
a bipartite graph belief propagator, and an in-
cremental belief generator. Instead of predicting
the joint belief directly from dialogue history,
we perform two-stage inference: It first obtains
turn belief from augmented turn utterance via
transformer models. Dann, it reasons over turn
belief and last joint belief with the help of the

bipartite graph propagation results. Based on this,
it incrementally infers the final joint belief.

To facilitate the model description in detail, Wir
first introduce our mathematical notations here.
We define X = {(U1, R1), · · · (UT , RT )} as the
set of user utterance and system response pairs in
T turns of dialogue, and B = {B1, · · · , BT }
as the joint belief states at each turn. Während
Bt summarizes the dialogue history up to the
current turn t, we also model the turn belief Qt
that corresponds to the belief state of a specific
turn (Ut, Rt), and denote Dt as the domain of
this specific turn. Following (Wu et al., 2019),
we design our state tracker to handle multiple
tasks. Daher, each Bt or Qt consists of tuples
wie (Domain, slot, value). Suppose there are K
anders (Domain, slot) pairs in total, we denote
Yk as the true slot value for the k-th (Domain, slot)
pair.

3.1 BERT-based Turn Belief Generator

Denoting Xt = (Ut, Rt) as the t-th turn utterance,
the goal of turn belief generator is to predict
accurate state for this specific utterance. Obwohl
the dialogue history X can accumulate in arbitrary
Länge, the turn utterance Xt is often relatively
in oftentimes. To utilize contextualized
short
representation for extracting beliefs and enjoy
the good performance of pre-trained encoders,

560

we fine-tune BERT as our base network while
attaching the sequence classification and token
classification layers in a multitask learning setting.
The token classification task extracts specific
slot value spans. The sequence classification task
decides which domain the turn is talking about
and whether a specific (Domain, slot) pair takes
the gate value like yes, NEIN, doncare, none, oder
generate from token classification, und so weiter.
The model architecture of BERT is a multi-
layer bidirectional Transformer encoder based on
the original Transformer model (Vaswani et al.,
2017). The input representation is a concate-
nation of WordPiece embeddings (Wu et al.,
2016), positional embeddings, and the seg-
ment embedding. As we need to predict
Die
values for each (Domain, slot) pair, we aug-
the input sequence as follows. Suppose
ment
as Xt =
we have
the original utterance
x1, · · · , xN , the augmented utterance is then X (cid:3)
t =
[CLS], Domain, slot, [SEP], x1, · · · , xN , [SEP].
The specific (Domain, slot) works as queries to
extract the answer span. We denote the outputs of
BERT as H = h1, …, hN +5.1 The BERT model
is pre-trained with two strategies on large-scale
unlabeled text, das ist, masked language model and
next sentence prediction, which provide a power-
ful context-dependent sentence representation.

We use the hidden state h1 corresponding to
[CLS] as the aggregated sequence representation
to do the domain dt and gate zt classification:

dt = sof tmax(Wdm · (h1)T + bdm),
zt = sof tmax(Wgt · (h1)T + bgt)

where Wdm is trainable weight matrix and bdm
is the bias for domain classification. And Wgt is
trainable weight matrix and bgt is the bias for gate
classification.

For token classification, we feed the hidden
states of other tokens h2, · · · , hN +5 into a softmax
layer to classify over the token labels S, ICH, Ö,
[SEP] von

together. For the former, the cross-entropy loss
Lsc is computed between the predicted d, z and
the true one-hot label ˆd, ˆz,

Lsc = −log(d · (ˆd)T ) − log(z · (ˆz)T ).

(2)

For the latter, we apply another cross-entropy
loss Ltc between each token label in the input
sequence.

Ltc = −

N +5(cid:2)

n=2

log(yn

· (ˆyn)T ).

(3)

We optimize the turn belief generator via a
weighted sum of these two loss functions as below
over all training samples:

Lturn = αLsc + βLtc.

(4)

3.1.1 Filter for Improving Efficiency
As in turn belief, most of the slots will get the
value not mentioned. To enhance the efficiency of
our model, we further design a gate mechanism
similar to Wu et al. (2019) to filter out such slots
Erste, for which we can skip the generation process
and predict the value none directly. We apply the
separate training objective as the cross entropy
loss computed between the predicted slot gate
pf ilter
as below:
S

and the true one-hot label qf ilter

S

Lf ilter = −log(pf ilter

S

· (qf ilter
S

)T ),

where for prediction, we calculate HXt =
fBERT (Xt) as contextualized word representa-
tions for turn utterance, and then apply query
attention to classify whether the slot should be
filtered,

η = Sof tmax(HXt

· (qs)T ),

pf ilter
S

= Sof tmax(Wf ilter · (ηT · HXt)T ).

Wf ilter is the weight matrix and qs is the [CLS]
position’s output from a BERT encoder for the
domain-slot query.

yn = sof tmax(Wtc · (hn)T + btc),

(1)

3.2 Joint Belief Reasoning

where Wtc is trainable weight matrix and btc is
the bias for token classification.

To jointly model the sequence classification
and token classification, we optimize their loss

1For ease of illustration, we ignore the WordPiece

separation effect on token numbers.

the turn level belief
Now we can predict
state for each turn. Intuitively, we can directly
apply our turn belief generator on concatenated
dialogue history to obtain the joint belief as
is hardly
in Wu et al. (2019). Jedoch,
Es
treating all
an optimal practice. First of all,
lose the
utterances as a long sequence will

561

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

iterative character of dialogue, thus resulting in
information loss. Zweite, current models like
recurrent networks or Transformers are known
for not being able to model
the long-range
dependencies well. Long sequences introduce
Die
difficulty to the modeling as well as
computational complexity of Transformers. Der
WordPiece separation operation makes sequences
even longer. daher, we simulate the dialogue
procedure as a recursive process where current
joint belief Bt relies on the last joint belief Bt−1
and the current turn belief Qt. Generally speaking,
we use Bt−1 and Qt to perform belief propagation
on the bipartite graph constructed based on the
back-end database to obtain credibility score for
each slot value pairs. Dann, we do incremental
belief reasoning over the recursive process using
different methods.

3.2.1 Bipartite Graph Belief Propagator

As the central component for dialogue systems,
the dialogue state tracker has access to the back-
end database most of the time. In the course
of the task-oriented dialogue, the user and agent
interact with each other to reach the same stage of
information awareness regarding a specific task.
The user expresses requirements that, often, Sind
hard to meet. The agent resorts to the back-end
database and responds accordingly. Then the user
would adjust their requirements to get the task
done. In most existing DSTs, the tracker has to
infer such adjustment requirements from dialogue
Geschichte. With reasoning over the agent’s database,
we expect to harvest more accurate clues explicitly
for belief update.

Folglich, we abstract

the database as
a bipartite graph G = (V, E), where vertices
are partitioned into two groups: The entity set
Vent and attribute set Vattr, where V = Vent ∪
Vattr and Vent ∩ Vattr = φ. The entities within
Vent and Vattr are totally disconnected. Edges
link two vertices from each of Vent and Vattr,
representing the attribute belonging relationship.
During each turn, we first map the predicted Qt
and last joint belief Bt−1 to belief distributions
over the graph via the function g(·). Here we
apply fuzzy match and calculate the similarity
with a threshold (cid:5) to realize g(·). We use BERT
tokenizer to tokenize both dialogue and database
entries. The mapping is done based on a pre-
set threshold on the token level overlap ratio. Für
Beispiel, the generated ‘cambridge punt ##er’ will

be mapped to the database entry ‘the cambridge
punt ##er’ when their overlap ratio is larger than
(cid:5). In our experiment, we find that approximately
60.5% of entity names and 12.2% other slot values
can be mapped.2 This mapping operation actually
helps to correct some minor errors made in span
extraction or generation.

After the mapping of beliefs to the database
bipartite graph via g(·), we start to do belief
propagation over the graph. Generally speaking,
there are two kinds of belief propagation in the
bipartite graph. The first is from Vent to Vattr.
It simulates the situation when a venue entity is
erwähnt, its attributes will be activated. Für
Beispiel, after a restaurant is recommended, A
nearby hotel will have the same location value
with it. The second one is from Vattr to Vent.
This simulates the situation when an attribute is
erwähnt, all entities having this attribute will
also receive the propagated beliefs. If an entity
gets more attributes mentioned, it will receive
more propagated beliefs. Suppose the propagation
result is ct for the current turn t, it can be viewed
as the credibility scores of the state values after
reasoning over the database graph. We reason over
this set of entries via doing belief propagation in
the bipartite graph to obtain the certainty scores
for them as below:

ct = γ · g(Bt−1) + η · g(Qt) · (ICH + Wadj),

(5)

where γ is a hyper-parameter for modeling the
credibility decay, because newly provided slot
values usually reflect more updated user intention.
η adjusts the effect of propagated beliefs. Wadj
is the adjacency matrix of the bipartite graph.
Note that the belief propagation method is rather
simple but effective. We tried more advanced
methods such as loopy belief propagation (Murphy
et al., 1999). Jedoch, we did not see obvious
performance gain, which might be due to the
relatively small bipartite graph size (273 Knoten
in Summe). Auch, we suspect that graph reasoning
might be more helpful for down stream tasks such
as action prediction. We will explore further in
future.

3.2.2 Incremental Belief Generator
With the credibility scores ct obtained from the
belief propagator, we now incrementally infer the

2Over half of the slot values are time, Menschen, stay, day,
usw. There are no such nodes in the bipartite graph but we
keep these slot values’ existence in the belief vector

562

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

current joint belief Bt. Mathematically, we have

Bt = f (Qt, Bt−1, ct).

(6)

The function f integrates evidence from the
turn belief, last joint belief, and the propagated
credibility scores. There are wide variety of
models that can be applied. We may leverage the
straight-forward Multi-Layer Perceptron (MLP) Zu
model the interactions between these beliefs (Er
et al., 2017) deeply. Due to the sequential nature
of the belief generator, we can also apply GRU
cells to predict the beliefs turn by turn (Cho et al.,
2014). Intuitively, given these remaining and new
belief entries as well as credibility scores, Die
essential task here is to reason out what entries
to keep, update, or delete. daher, we make
use of these information to carry out the operation
classification task. There are three operations keep,
update, and delete to choose from for each domain
slot. For the GRU case, the detailed equation for
operation classification is as below:

ht = GRU (W · [G(Qt), ct], ht−1)
opk = sof tmax(Wopk

· (ht)T + bopk ),

where W · [G(Qt), ct] and ht−1 are the inputs to the
GRU cell. [, ] denotes vector concatenation. Wopk
and bopk are the weight matrix and bias vector for
the corresponding k-th (Domain, slot) pair. Nach
the operation op in the current turn t is predicted,
we obtain the corresponding current joint belief
Bt via performing corresponding operations.

4 Experimente

4.1 Dataset

We carry out experiments on MultiWOZ 2.1
(Eric et al., 2019). It is a multi-domain dialogue
dataset spanning seven distinct domains and
containing over 10,000 dialogues. As compared to
MultiWOZ 2.0, it fixed substantial noisy dialogue
state annotations and dialogue utterances that
could negatively impact the performance of state-
tracking models. In MultiWOZ 2.1,
es gibt
30 domain-slot pairs and over 4,500 möglich
Werte, which is different from existing standard
datasets like WOZ (Wen et al., 2017) and DSTC2
(Henderson et al., 2014A), which have fewer than
ten slots and only a few hundred values. We follow
the original training, validation, and testing split
and directly use the DST labels. Since the hospital
and police domain have very few dialogues (10%

compared to others) and only appear in the training
set, we only use the other five domains in our
Experiment.

4.2 Settings

Training Details Our model is trained in a two-
stage style. We first train the turn belief generator
using the Adam optimizer with a batch size of 32.
We adopt the bert-base-uncased version of BERT
and initialize the learning rate for fine-tuning as
3e-5. The α and β in Equation 4 are set to 0.05
Und 1.0, jeweils. We use the average of the
last four hidden layer outputs of BERT as the final
representation of each token.

During the later reasoning stage, regarding
incremental belief reasoning, we use a fully
connected two-layer feed-forward neural network
with ReLU activation for MLP. The hidden size
is set to 500, and the learning rate is initialized
als 0.002. For GRU, we set the learning rate as
0.005. We pre-process turn utterances to alleviate
the problem of ground truth absence, Zum Beispiel,
formalize time values into standard forms. Similar
to Heck et al. (2020), we also make use of the
system acts to enrich the system utterances.

Evaluation Metrics Similar to Wu et al. (2019),
we adopt the evaluation metric joint goal accuracy
to evaluate the performance. It is a relatively
strict elevation standard. The joint goal accuracy
compares the predicted belief states to the ground
truth Bt at each turn t. The joint accuracy is 1.0 Wenn
and only if all (Domain, slot, value) triplets are
predicted correctly at each turn, otherwise it is 0.

Baselines We denote the two versions of ReDST
with different
incremental reasoning modules
as ReDST M LP , and ReDST GRU . Sie sind
compared with the following baselines.

DST Reader
(Gao et al., 2019): It treats DST
as a reading comprehension problem. Given the
Geschichte, it learns to extract slot values as spans.

HyST (Goel et al., 2019):
It combines a
hierarchical encoder in a fixed vocabulary system
with an open vocabulary n-gram copy-based
System.

TRADE (Wu et al., 2019): It concatenates
the whole dialogue history as input and uses a
generative state tracker with a copy mechanism to
generate value for each slot separately.

563

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

DST-Picklist
(Zhang et al., 2019A): Given the
whole dialogue history as input, it uses two BERT-
based encoders and takes a hybrid approach
of predefined ontology-based DST and open
vocabulary-based DST. It defines picklist-based
slots for classification and span-based slots for
span extraction like DSTRead (Gao et al., 2019).

SOM (Kim et al., 2020): It works in turn-by-turn
style and considers state as an explicit fixed-sized
Erinnerung, and adopts a selectively overwriting
mechanism for generating values with copy.

SST (Chen et al., 2020): It leverages a graph
attention matching network to fuse information
from utterances and schema graphs. A recurrent
graph attention network controls state updating. Es
relies on a predefined ontology.

4.3 DST Results

We first compare our model with the state-of-the-
art methods. As shown in Table 1, we observe that
our method outperforms all the other baselines.
Zum Beispiel, in terms of joint accuracy, welche
is a rather strict metric, ReDST GRU improves
the performance by 46.2%, 17.4%, Und 1.3% als
compared to open-vocabulary based methods: Die
DST Reader, TRADE, and SOM, jeweils.
Based on results in Table 1, the methods such
as DST-Picklist and SST perform better than
our method. Jedoch, they rely heavily on a
predefined ontology. In such methods, der Wert
candidates for each slot to choose from are fixed
bereits. They cannot handle unknown slot values,
which largely limits their application in real-life
scenarios.

We observe that a large portion of baselines
work on relatively long window-sized dialogue
Geschichte. FJST directly encodes the raw dialogue
history using recurrent neural networks. In con-
trast, HJST first encodes turn utterance to vectors
using a word-level RNN, and then encodes the
whole history to vectors using a context level
RNN. Jedoch, the lower performance of HJST
demonstrates its inefficiency in learning useful
features in this task. Based on HJST, HyST man-
ages to achieve better performance by further
integrating a copy-based module. Trotzdem, die Perfor-
mance is lower than TRADE, which encodes the
raw concatenated whole dialogue history, gener-
ates or copies slot values with extra slot gates.
Generally speaking, these baselines are based on

predefined
ontology

offen-
vocabulary

Modell
FJST
HJST
HyST
DST-Picklist
SST
DST Reader
TRADE
TRADE w/o gate
SOM
ReDST M LP
ReDST GRU

Joint Acc
0.378
0.356
0.381
0.533
0.552
0.364
0.453
0.411
0.525
0.511
0.532

Tisch 1: The multi-domain DST evaluation
results on the MultiWOZ 2.1 dataset. Der
ReDST GRU method achieves the highest
joint accuracy.

recurrent neural networks for encoding dialogue
Geschichte. Since the interactions between user and
agent can be arbitrarily long and recurrent neural
networks are not effective in modeling long-range
dependencies, they might not be a good choice
to model the dialogue for DST. Andererseits,
single turn utterances usually are short and con-
tain relatively simple information as compared
to complicated dialogue history. It is thus better
to generate belief in turn level and then integrate
them via reasoning. According to the comparisons
of baselines, the superior performance of SST,
SOM, and ReDSTs validate this design.

Darüber hinaus, we also tested the performance of
TRADE without the slot gate. The performance
drops dramatically–from 0.453 Zu 0.411 in terms
of joint accuracy. We suspect that this is due
to lengthy dialogue history, where the decoder
and copy mechanism start to lose focus. It might
generate some value that appears in dialogue
history but is not the ground truth. daher, Die
slot gate is used to decide which slot value should
genommen werden, which resembles the inference in some
sense. To validate this, we feed the single turn
utterances to TRADE and generate the turn beliefs
as output. Interessant, we find that it performs
similar with gate or without it, which validates our
guess. Jedoch, such resembled inference is not
enough. When the dialogue history becomes long,
the gating mechanism will lose its focus easily.
Entsprechend, we report the results of TRADE and
ReDST GRU on the last four turns of dialogues in
Tisch 2. The better performance of ReDST GRU

564

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Modell
TRADE
ReDST GRU

T-3
0.411
0.487

T-2
0.339
0.440

T-1
0.269
0.391

T
0.282
0.377

Setting
ReDST M LP
ReDST GRU

w BP
0.511
0.532

w/o BP
0.507
0.530

Tisch 2: The last four turns’ joint accuracy of
TRADE and proposed ReDST. (T refers to the
last turn of each dialogue session.)

Tisch 4: The joint accuracy results for ReDST
methods with or without bipartite graph reasoning.

Modell

TRADE
SOM
ReDST

Joint Acc

0.697
0.799
0.808

Tisch 3: The turn belief generation results of
TRADE, SOM, and proposed ReDST.

further validates the importance of reasoning over
turns. Usually, as the interactive dialogue goes
An, users might frequently adjust
their goals,
which requires special consideration. Since turn
utterance is relatively more straightforward and
dialogue is turn by turn in nature, doing DST turn
by turn is a useful and practical design.

4.4 Component Analysis

Since our model makes use of the advanced
BERT structure to learn the contextualized
representation, we first test how much contribution
the BERT has made. daher, we carried out a
study on a turn belief generator and compare it
with SOM and the BiLSTM baseline TRADE on
the single turn utterance. As shown in Table 3, Wir
observe that the BERT-based SOM and ReDST
indeed perform better than single turn TRADE.
This is due to the usage of pre-trained BERT
in learning better-contextualized features. Im
multitask setting of our design, both the token
classification and sequence classification tasks
benefit from BERT’s strength. Darüber hinaus, Wir
notice that when doing the single turn setting,
the system response usually depends on certain
information mentioned in the former turn user
utterance. daher, we concatenate the former
turn utterance to each current single turn as the
input for BERT. Under this setting, we achieved
in performance regarding joint
a large boost
accuracy as in Table 3. It provides an excellent
base for the later stage inferences.

We also tested the effect of reasoning over
the database. For a clear comparison, we ignore
the evidence obtained via bipartite graph belief

propagation while keeping other settings the same.
To show it more clearly, we re-organize the
results in Table 4. It can be observed that both
ReDST M LP and ReDST GRU gain a bit from
belief propagation. It validates the usefulness of
database reasoning. Jedoch, since the graph
is rather small, the performance improvement is
rather limited. Similar patterns are found in Chen
et al., (2020) and we suspect that it will be more
helpful with larger database structure. Auch, Wir
will further explore its usage in down-stream tasks
such as action prediction.

For different incremental reasoning modules,
the results are also shown in Table 1. We find
that ReDST GRU performs better. Jedoch, Wir
notice that simply accumulating turn belief as in
Zhong et al. (2018) performs very well. The rule
is to add newly predicted turn belief entries to the
last joint belief. When different values for a slot
appear, only keep the new one. Although this rule
seems simple, it actually reflects the dialogue’s
interactive and updating nature. We tried to
directly apply this rule on the ground truth turn
belief to generate joint belief. It results in 0.963
joint accuracy. Jedoch, a critical problem of
such accumulation rule is that when the generated
turn belief is wrong, it will not be able to add a
missing entry or delete a wrong entry. By applying
GRU in ReDST GRU , it manages to modify a bit
with the help of database evidence. Trotzdem, there is
large space for more powerful reasoning models
to address this error accumulation issue. We will
further investigate in this direction.

4.5 Error Analysis

We also provide error analysis regarding each slot
for ReDST GRU in Figure 3. To make it more clear,
we also list the results of SOM for comparison. Wir
observe that a large portion of the improvements
for our method are on name entities and time-
related slots. As mentioned in Wu et al. (2019),
name slots in the attraction, restaurant, Und
hotel domains have the highest error rates. Es
is partly because these slots have a relatively

565

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

algorithms for performing reasoning over turns
and on graphs for generating more accurate
summarization of user intention.

Danksagungen

This research is supported by the National Re-
search Foundation, Singapur, under its Inter-
national Research Centres in Singapore Funding
Initiative. Any opinions, Erkenntnisse, and conclusions
or recommendations expressed in this material are
those of the author(S) and do not reflect the views
of National Research Foundation, Singapur.

Verweise

Radford Alec, Wu Jeffrey, Child Rewon, Luan
David, Amodei Dario, and Sutskever Ilya. 2019.
Language models are unsupervised multitask
learners. Technical report, OpenAI.

end-to-end dialogue

Guan-Lin Chao and Ian Lane. 2019. BERT-
DST: Scalable
state
tracking with bidirectional encoder represen-
tations from transformer. In INTERSPEECH,
pages 1468–1472. DOI: https://doi.org
/10.21437/Interspeech.2019-1355

Lu Chen, Boer Lv, Chi Wang, Su Zhu, Bowen
Bräunen, and Kai Yu. 2020. Schema-guided
multi-domain dialogue state tracking with
In AAAI,
graph attention neural networks.
7521–7528. DOI: https://doi
Seiten
.org/10.1609/aaai.v34i05.6250

Kyunghyun Cho, Bart van Merrienboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using rnn
encoder–decoder for statistical machine trans-
lation. In EMNLP, pages 1724–1734.

Ronan Collobert and Jason Weston. 2008. A
unified architecture for natural language pro-
Abschließen: Deep neural networks with multitask
In ICML, pages 160–167. DOI:
learning.
https://doi.org/10.1145/1390156
.1390177

Rajarshi Das, Shehzaad Dhuliawala, Manzil
Zaheer, Luke Vilnis, Ishan Durugkar, Akshay
Krishnamurthy, Alex Smola, and Andrew
McCallum. 2017. Go for a walk and arrive at
the answer: Reasoning over paths in knowledge

Figur 3: Slot error rate on the test set. The error rate
for name slots on restaurant, hotel, and attraction
domain drops 4.2% on average.

large number of possible values that are hard
to recognize. In ReDST GRU , we map beliefs
into a bipartite graph constructed via database
and do belief propagation on it. This helps
to improve the accuracy on name slots. Auch,
the classification gate design helps to improve
performance on Yes/No slots. We also observe that
the performance for taxi destination becomes
worse. This is due to the value co-reference
phenomenon where the user might just mention
‘taxi to the hotel’ to refer to the hotel name
mentioned earlier. These findings are interesting
and we will explore it further.

5 Abschluss

We rethink DST from the angle of agent and
point out the urgent need for in-depth reasoning
other than being obsessed with generating values
from history text as a whole. We demonstrated
the importance of doing reasoning over turns
and over the database. In detail, we fine-tuned
pre-trained BERT for more accurate turn level
belief generation while doing belief propagation in
bipartite graph to harvest more clues. Experimente
on a large-scale multi-domain dataset demonstrate
the superior performance of the proposed method.
In the future, we will explore more advanced

566

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

bases using reinforcement
preprint arXiv:1711.05851.

learning. arXiv

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Und
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
Verständnis. In NAACL, pages 4171–4186.

Mihail Eric, Rahul Goel, Shachi Paul, Abhishek
Sethi, Sanchit Agarwal, Shuyang Gao,
and Dilek Hakkani-T¨ur. 2019. MultiWOZ
2.1: Multi-domain dialogue
correc-
tions and state tracking baselines. CoRR,
abs/1907.01669.

state

Shuyang Gao, Abhishek Sethi, Sanchit Agarwal,
Tagyoung Chung, and Dilek Hakkani-Tur.
2019. Dialog state tracking: A neural read-
In SIGDIAL,
ing comprehension approach.
pages 264–273. DOI: https://doi.org
/10.18653/v1/W19-5932

Rahul Goel, Shachi Paul, and Dilek Hakkani-T¨ur.
2019. HyST: A hybrid approach for flexible
and accurate dialogue state tracking. arXiv
preprint arXiv:1907.00883. DOI: https://
doi.org/10.21437/Interspeech
.2019-1863

Xiangnan He, Lizi Liao, Hanwang Zhang,
Liqiang Nie, Xia Hu, and Tat-Seng Chua.
2017. Neural collaborative filtering. In WWW,
pages 173–182.

Michael Heck, Carel van Niekerk, Nurul Lubis,
Christian Geishauser, Hsien-Chin Lin, Marco
Moresi, and Milica Gaˇsi´c. 2020. Trippy: A
triple copy strategy for value independent neural
dialog state tracking. In SIGDIAL, pages 35–44.

Und
Matthew Henderson, Blaise Thomson,
Jason D. Williams. 2014A. Der
zweite
dialog state tracking challenge. In SIGDIAL,
pages 263–272. DOI: https://doi.org
/10.3115/v1/W14-4337

Matthew Henderson, Blaise Thomson,

Und
Steve Young. 2014B. Word-based dialog
state tracking with recurrent neural net-
funktioniert. In SIGDIAL, pages 292–299. DOI:
https://doi.org/10.3115/v1/W14
-4340

Ehsan Hosseini-Asl, Bryan McCann, Chien-
and Richard

Sheng Wu, Semih Yavuz,

Socher. 2020. A simple language model
task-oriented dialogue. arXiv preprint
für
arXiv:2005.00796.

Sungdong Kim, Sohee Yang, Gyuwan Kim, Und
Sang-Woo Lee. 2020. Efficient dialogue state
tracking by selectively overwriting memory. In
ACL, pages 567–582.

Sungjin Lee

and Maxine Eskenazi. 2013.
Recipe for building robust spoken dialog state
trackers: Dialog state tracking challenge system
description. In SIGDIAL, pages 414–422.

Lizi Liao, Yunshan Ma, Xiangnan He, Richang
Hong, and Tat-seng Chua. 2018. Knowledge-
aware multimodal dialogue systems. In Pro-
the 26th ACM international
ceedings of
conference on Multimedia, pages 801–809.
DOI: https://doi.org/10.1145/3240508
.3240605

Lizi Liao, Tongyao Zhu, Long Lehong, and Tat-
Seng Chua. 2021. Multi-domain dialogue state
tracking with recursive inference. In The Web
Conference. To appear.

Bryan McCann,

James Bradbury, Caiming
Xiong, and Richard Socher. 2017. Learned
in translation: Contextualized word vectors. In
NIPS, pages 6294–6305.

Nikola Mrkˇsi´c, Diarmuid ´O. S´eaghdha, Tsung-
Hsien Wen, Blaise Thomson, and Steve
tracker: Data-
Jung. 2017. Neural belief
In ACL,
driven dialogue
pages 1777–1788. DOI: https://doi.org
/10.18653/v1/P17-1163

Verfolgung.

state

Kevin P. Murphy, Yair Weiss, and Michael I.
Jordanien. 1999. Loopy belief propagation for
approximate inference: An empirical study. In
UAI, pages 467–475.

Elnaz Nouri and Ehsan Hosseini-Asl. 2018.
Toward scalable neural dialogue state tracking
Modell. arXiv preprint arXiv:1812.00899.

Yawen Ouyang, Moxin Chen, Xinyu Dai,
Yinggong Zhao, Shujian Huang, and Jiajun
Chen. 2020. Dialogue state tracking with
explicit slot connection modeling. In ACL,
pages 34–40. DOI: https://doi.org
/10.18653/v1/2020.acl-main.5

567

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Lawrence Page, Sergey Brin, Rajeev Motwani,
and Terry Winograd. 1999, The pagerank
citation ranking: Bringing order to the Web.
Stanford InfoLab.

Julien Perez and Fei Liu. 2017. Dialog state track-
ing, a machine reading approach using memory
In EACL, pages 305–314. DOI:
Netzwerk.
https://doi.org/10.18653/v1/E17
-1029

Osman Ramadan, Paweł Budzianowski, Und
Milica Gasic. 2018. Large-scale multi-domain
belief tracking with knowledge sharing. In
ACL, pages 432–437. DOI: https://doi
.org/10.18653/v1/P18-2069

Abhinav Rastogi, Dilek Hakkani-T¨ur, and Larry
Heck. 2017. Scalable multi-domain dia-
In ASRU Workshop,
logue state tracking.
pages 561–568. DOI: https://doi.org
/10.1109/ASRU.2017.8268986

Liliang Ren, Jianmo Ni, and Julian McAuley.
2019. Scalable and accurate dialogue state
tracking via hierarchical sequence generation.
In EMNLP, pages 1876–1885.

Liliang Ren, Kaige Xie, Lu Chen, and Kai
Yu. 2018. Towards universal dialogue state
Verfolgung. In EMNLP, pages 2780–2786.

Yong Shan, Zekang Li, Jinchao Zhang, Fandong
Meng, Yang Feng, Cheng Niu, and Jie Zhou.
2020. A contextual hierarchical attention
network with adaptive objective for dialogue
state tracking. In ACL, pages 6322–6333. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.563

Kai Sun, Lu Chen, Su Zhu, and Kai Yu.
2014A. A generalized rule based tracker for
dialogue state tracking. In SLT Workshop,
pages 330–335. DOI: https://doi.org
/10.1109/SLT.2014.7078596

Kai Sun, Lu Chen, Su Zhu, and Kai Yu. 2014B.
The sjtu system for dialog state tracking chal-
lenge 2. In SIGDIAL, pages 318–326. DOI:
https://doi.org/10.3115/v1/W14
-4343

Blaise Thomson and Steve Young. 2010. Bayesian
update of dialogue state: A POMDP frame-
work for spoken dialogue systems. Computer

Speech & Language, 24(4):562–588. DOI:
https://doi.org/10.1016/j.csl
.2009.07.003

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In NIPS,
pages 5998–6008.

Zhuoran Wang and Oliver Lemon. 2013. A
simple and generic belief tracking mechanism
for the dialog state tracking challenge: An
the believability of observed information. In
SIGDIAL, pages 423–432.

T. H. Wen, D. Vandyke, N. Mrkˇs´ıc, M. Gaˇs´ıc,
L. M. Rojas-Barahona, P. H. Su, S. Ultes,
and S. Jung. 2017. A network-based end-
trainable
to-end
dialogue
task-oriented
In EACL, pages 438–449. DOI:
System.
https://doi.org/10.18653/v1/E17
-1042

Jason D. Williams. 2014. Web-style ranking
and slu combination for dialog state track-
In SIGDIAL, pages 282–291. DOI:
ing.
https://doi.org/10.3115/v1/W14
-4339

Jason D. Williams and Steve Young. 2007.
Partially observable markov decision pro-
cesses for spoken dialog systems. Computer
Speech & Language, 21(2):393–422. DOI:
https://doi.org/10.1016/j.csl
.2006.06.008

Chien-Sheng Wu, Andrea Madotto, Ehsan
Hosseini-Asl, Caiming Xiong, Richard Socher,
and Pascale Fung. 2019. Transferable multi-
domain state generator
task-oriented
dialogue systems. In ACL, pages 808–819.

für

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, Łukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, George
Kurian, Nishant Patil, Wei Wang, Cliff Young,
Jason SmithJason Smith, Jason Riesa, Alex
Rudnick, Oriol Vinyals, Greg Corrado, Macduff
Hughes, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the

568

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

gap between human and machine translation.
arXiv preprint arXiv:1609.08144.

https://doi.org/10.18653/v1/2020
.emnlp-main.243

Kaige Xie, Cheng Chang, Liliang Ren, Lu Chen,
and Kai Yu. 2018. Cost-sensitive active
In
learning for dialogue
SIGDIAL, pages 209–213. DOI: https://
doi.org/10.18653/v1/W18-5022

Verfolgung.

state

Wenhan Xiong, Thien Hoang, and William Yang
Wang. 2017. Deeppath: A reinforcement
learning method for knowledge graph rea-
soning. In EMNLP, pages 564–573. DOI:
https://doi.org/10.18653/v1/D17
-1060

Puyang Xu and Qi Hu. 2018. An end-to-
end approach for handling unknown slot
values in dialogue state tracking. In ACL,
pages 1448–1457.

Jian-Guo Zhang, Kazuma Hashimoto, Chien-
Sheng Wu, Yao Wan, Philip S. Yu, Richard
Socher, and Caiming Xiong. 2019A. Find
slot-value
or classify? Dual
predictions on multi-domain dialog state track-
ing. arXiv preprint arXiv:1910.03544. DOI:

strategy for

Zheng Zhang, Lizi Liao, Minlie Huang, Xiaoyan
Zhu, and Tat-Seng Chua. 2019B. Neural mul-
timodal belief tracker with adaptive attention
for dialogue systems. In The World Wide
Web Conference, pages 2401–2412. DOI:
https://doi.org/10.1145/3308558
.3313598

Victor Zhong, Caiming Xiong, and Richard
Socher. 2018. Global-locally self-attentive en-
coder for dialogue state tracking. In ACL,
1458–1467. DOI: https://doi
Seiten
.org/10.18653/v1/P18-1135

Li Zhou and Kevin Small. 2019. Multi-domain
dialogue state tracking as dynamic knowledge
graph enhanced question answering. arXiv
preprint arXiv:1911.06192.

lstm-based

Lukas Zilka and Filip Jurcicek. 2015. Incre-
mental
tracker.
In ASRU Workshop, pages 757–762. DOI:
https://doi.org/10.1109/ASRU.2015
.7404864

dialog

state

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
8
4
1
9
2
3
7
3
9

/

/
T

l

A
C
_
A
_
0
0
3
8
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

569Dialogue State Tracking with Incremental Reasoning image
Dialogue State Tracking with Incremental Reasoning image

PDF Herunterladen