CrossWOZ: A Large-Scale Chinese Cross-Domain
Task-Oriented Dialogue Dataset
Qi Zhu1, Kaili Huang2, Zheng Zhang1, Xiaoyan Zhu1, Minlie Huang1∗
1Dept. of Computer Science and Technology, 1Institute for Artificial Intelligence,
1Beijing National Research Center for Information Science and Technology,
2Dept. of Industrial Engineering, Tsinghua University, Peking, China
{zhu-q18,hkl16,z-zhang15}@mails.tsinghua.edu.cn
{zxy-dcs,aihuang}@tsinghua.edu.cn
Abstrakt
To advance multi-domain (cross-domain) dia-
logue modeling as well as alleviate the short-
age of Chinese task-oriented datasets, Wir
propose CrossWOZ, the first large-scale Chinese
Cross-Domain Wizard-of-Oz task-oriented data-
set. It contains 6K dialogue sessions and
102K utterances for 5 domains,
einschließlich
hotel, restaurant, attraction, metro, and taxi.
Darüber hinaus, the corpus contains rich annotation
of dialogue states and dialogue acts on both
user and system sides. About 60% of the
dialogues have cross-domain user goals that
favor inter-domain dependency and encourage
natural transition across domains in conversa-
tion. We also provide a user simulator and
several benchmark models for pipelined task-
oriented dialogue systems, which will facilitate
researchers to compare and evaluate their
models on this corpus. The large size and rich
annotation of CrossWOZ make it suitable to
investigate a variety of tasks in cross-domain
dialogue modeling, such as dialogue state
Verfolgung, policy learning, user simulation, usw.
1 Einführung
Kürzlich, there have been a variety of task-oriented
dialogue models thanks to the prosperity of neural
architectures (Yao et al., 2013; Wen et al., 2015;
Mrkˇsi´c et al., 2017; Peng et al., 2017; Lei et al.,
2018; G¨ur et al., 2018). Jedoch, research is still
largely limited by the lack of large-scale high-
quality dialogue data. Many corpora have advanced
the research of task-oriented dialogue systems,
most of which are single domain conversations,
including ATIS (Hemphill et al., 1990), DSTC 2
(Henderson et al., 2014), Frames (El Asri et al.,
2017), KVRET (Eric et al., 2017), WOZ 2.0
(Wen et al., 2017), and M2M (Shah et al., 2018).
∗Corresponding author.
281
Despite the significant contributions to the
Gemeinschaft, these datasets are still limited in size,
language variation, or task complexity. Weiter-
mehr, there is a gap between existing dialogue
corpora and real-life human dialogue data. In
real-life conversations, it is natural for humans to
transition between different domains or scenarios
while still maintaining coherent contexts. Daher,
real-life dialogues are much more complicated
than those dialogues that are only simulated
within a single domain. To address this issue,
some multi-domain corpora have been proposed
(Budzianowski et al., 2018B; Rastogi et al.,
2019). The most notable corpus is MultiWOZ
(Budzianowski et al., 2018B), a large-scale multi-
domain dataset
that consists of crowdsourced
It contains 10K
human-to-human dialogues.
dialogue sessions and 143K utterances for 7
domains, with annotation of system-side dialogue
the state
states and dialogue acts. Jedoch,
annotations are noisy (Eric et al., 2019), and user-
side dialogue acts are missing. The dependency
across domains is simply embodied in imposing
the same pre-specified constraints on different
domains, such as requiring both a hotel and an
attraction to locate in the center of the town.
In comparison to the abundance of English
dialogue data, überraschenderweise, there is still no widely
recognized Chinese task-oriented dialogue corpus.
In diesem Papier, we propose CrossWOZ, a large-
scale Chinese multi-domain (cross-domain) Aufgabe-
oriented dialogue dataset. An dialogue example
is shown in Figure 1. We compare CrossWOZ to
other corpora in Tables 1 Und 2. Our dataset has
the following features comparing to other corpora
(particularly MultiWOZ (Budzianowski et al.,
2018B)):
1. The dependency between domains is more
challenging because the choice in one domain
will affect the choices in related domains
Transactions of the Association for Computational Linguistics, Bd. 8, S. 281–295, 2020. https://doi.org/10.1162/tacl a 00314
Action Editor: Bonnie Webber. Submission batch: 10/2019; Revision batch: 1/2020; Published 6/2020.
C(cid:13) 2020 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
4
1
9
2
3
5
2
5
/
/
T
l
A
C
_
A
_
0
0
3
1
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
in CrossWOZ. As shown in Figure 1 Und
Tisch 2, the hotel must be near the attraction
chosen by the user in previous turns, welche
requires more accurate context understanding.
2. It is the first Chinese corpus that contains
large-scale multi-domain task-oriented dia-
logues, consisting of 6K sessions and 102K
utterances for 5 domains (attraction, restau-
rant, hotel, metro, and taxi).
3. Annotation of dialogue states and dialogue
acts is provided for both the system side
and user side. The annotation of user states
enables us to track the conversation from
the user’s perspective and can empower
the development of more elaborate user
simulators.
In diesem Papier, we present
the process of
dialogue collection and provide detailed data
the corpus. Statistics show that
analysis of
our cross-domain dialogues are complicated. To
facilitate model comparison, benchmark models
are provided for different modules in pipelined
task-oriented dialogue systems, including natural
language understanding, dialogue state tracking,
dialogue policy learning, and natural language
Generation. We also provide a user simulator,
which will
Und
evaluation of dialogue models on this corpus.
The corpus and the benchmark models are
publicly available at https://github.com/
thu-coai/CrossWOZ.
the development
facilitate
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
4
1
9
2
3
5
2
5
/
/
T
l
A
C
_
A
_
0
0
3
1
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
2 Related Work
According to whether the dialogue agent is human
or machine, we can group the collection methods
of existing task-oriented dialogue datasets into
three categories. The first one is human-to-human
dialogues. One of the earliest and well-known is
the ATIS dataset (Hemphill et al., 1990) used this
setting, followed by El Asri et al. (2017), Eric et al.
(2017), Wen et al. (2017), Lewis et al. (2017),
Wei et al.
(2018), and Budzianowski et al.
(2018B). Though this setting requires many human
efforts, it can collect natural and diverse dialogues.
The second one is human-to-machine dialogues,
which need a ready dialogue system to converse
with humans. The famous Dialogue State Tracking
Challenges provided a set of human-to-machine
dialogue data (Williams et al., 2013; Henderson
Figur 1: A dialogue example. The user state is
initialized by the user goal: Finding an attraction and
one of its nearby hotels, then booking a taxi to commute
between these two places. In addition to expressing pre-
specified informable slots and filling in requestable
slots, users need to consider and modify cross-domain
informable slots (bold) that vary through conversation.
We only show a few turns (turn number on the left),
each with either user or system state of the current
Domain, which are shown above each utterance.
et al., 2014). The performance of the dialogue
largely influence the quality of
system will
dialogue data. The third one is machine-to-
machine dialogues. It needs to build both user and
282
Typ
Dataset
Language
Speakers
# Domains
# Dialogues
# Turns
Avg. domains
Avg. turns
# Slots
# Values
Single-domain goal
Multi-domain goal
DSTC2 WOZ 2.0 Frames KVRET M2M MultiWOZ Schema CrossWOZ
EN
H2M
1
1,612
23,354
1
14.5
8
212
EN
H2H
1
600
4,472
1
7.5
4
99
EN
H2H
1
1,369
19,986
1
14.6
61
3,871
EN
EN
H2H M2M
3
2,425
12,732
1
5.3
13
1363
2
1,500
14,796
1
9.9
14
138
EN
H2H
7
8,438
115,424
1.80
13.7
25
4,510
EN
M2M
16
16,142
329,964
1.84
20.4
214
14,139
CN
H2H
5
5,012
84,692
3.24
16.9
72
7,871
Tisch 1: Comparison of CrossWOZ to other task-oriented corpora (training set). H2H, H2M, Und
M2M represent human-to-human, human-to-machine, machine-to-machine respectively. The average
numbers of domains and turns are for each dialogue.
MultiWOZ
CrossWOZ
usr: I’m looking for a college type attraction. usr:
. . .
Hello, could you recommend an attraction with a rating of 4.5 or higher?
usr: I would like to visit in town centre please. sys:
. . .
Tiananmen, Gui Street, and Beijing Happy Valley are very nice places.
usr: Can you find an Indian restaurant for me usr:
that is also in the town centre?
Schema
usr: I want a hotel in San Diego and I want to
check out on Thursday next week.
sys:
usr:
. . .
usr: I need a one way flight to go there.
I like Beijing Happy Valley. What hotels are around this attraction?
There are many, such as hotel A, hotel B, and hotel C.
Great! I am planning to find a hotel to stay near the attraction.
Which one has a rating of 4 or higher and offers wake-up call service?
Tisch 2: Cross-domain dialog examples in MultiWOZ, Schema, and CrossWOZ. The value of cross-
domain constraints(bold) are underlined. Some turns are omitted to save space. Names of hotels are
replaced by A,B,C for simplicity. Cross-domain constraints are pre-specified in MultiWOZ and Schema,
while determined dynamically in CrossWOZ. In CrossWOZ, the choice in one domain will greatly affect
related domains.
(Peng et al., 2017)
system simulators to generate dialogue outlines,
Zu
then use templates
generate dialogues or
further use people to
paraphrase the dialogues to make them more
natürlich (Shah et al., 2018; Rastogi et al., 2019).
It needs much less human effort. Jedoch, Die
complexity and diversity of dialogue policy are
limited by the simulators. To explore dialogue
policy in multi-domain scenarios, and to collect
natural and diverse dialogues, we resort to the
human-to-human setting.
Most of the existing datasets only involve
single domain in one dialogue, except MultiWOZ
(Budzianowski et al., 2018B) and Schema (Rastogi
et al., 2019). The MultiWOZ dataset has attracted
much attention recently, due to its large size and
multi-domain characteristics. It is at least one
order of magnitude larger than previous datasets,
amounting to 8,438 dialogues and 115K turns in
the training set. It greatly promotes the research
on multi-domain dialogue modeling, wie zum Beispiel
policy learning (Takanobu et al., 2019), state
Verfolgung (Wu et al., 2019), and context-to-text
Generation (Budzianowski et al., 2018A). Kürzlich
the Schema dataset has been collected in a
machine-to-machine fashion, ergebend 16,142
dialogues and 330K turns for 16 domains in
the multi-domain
the training set. Jedoch,
dependency in these two datasets is only embodied
in imposing the same pre-specified constraints on
283
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
4
1
9
2
3
5
2
5
/
/
T
l
A
C
_
A
_
0
0
3
1
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
different domains, such as requiring a restaurant
and an attraction to locate in the same area, oder der
city of a hotel and the destination of a flight to be
the same (Tisch 2).
Tisch 1 presents a comparison between our
dataset with other task-oriented datasets. In com-
parison to MultiWOZ, our dataset has a com-
parable scale: 5,012 dialogues and 84K turns in
the training set. The average number of domains
and turns per dialogue are larger than those
of MultiWOZ, which indicates that our task is
more complex. The cross-domain dependency in
our dataset is natural and challenging. For exam-
Bitte, as shown in Table 2, the system needs to
recommend a hotel near the attraction chosen by
the user in previous turns. Daher, both system
recommendation and user selection will dynam-
ically impact the dialogue. We also allow the
same domain to appear multiple times in a user
goal since a tourist may want to go to more than
one attraction.
To better track the conversation flow and model
user dialogue policy, we provide annotation of
user states in addition to system states and
dialogue acts. While the system state tracks the
dialogue history, the user state is maintained by
the user and indicates whether the sub-goals have
been completed, which can be used to predict
user actions. This information will facilitate the
construction of the user simulator.
To the best of our knowledge, CrossWOZ is the
first large-scale Chinese dataset for task-oriented
dialogue systems, which will
largely alleviate
the shortage of Chinese task-oriented dialogue
corpora that are publicly available.
Information. Stattdessen, we can call the API
directly if necessary.
2. Goal Generation: A multi-domain goal
generator was designed based on the
database. The relation across domains is
captured in two ways. One is to constrain two
targets that locate near each other. The other
is to use a taxi or metro to commute between
two targets in HAR domains mentioned in
the context. To make workers understand
the task more easily, we crafted templates
to generate natural language descriptions for
each structured goal.
3. Dialogue Collection: Before the formal data
collection starts, we required the workers to
make a small number of dialogues and gave
them feedback about the dialogue quality.
Dann, well-trained workers were paired to
converse according to the given goals. Der
workers were also asked to annotate both
user states and system states.
4. Dialogue Annotation: We used some rules
to automatically annotate dialogue acts
according to user states, system states,
and dialogue histories. To evaluate the
quality of the annotation of dialogue acts
and states, three experts were employed to
manually annotate dialogue acts and states
für 50 dialogues. The results show that
our annotations are of high quality. Endlich,
each dialogue contains a structured goal, A
task description, user states, system states,
dialogue acts, and utterances.
3 Data Collection
3.1 Database Construction
Our corpus is to simulate scenarios where a
traveler seeks tourism information and plans her
or his travel in Beijing. Domains include hotel,
attraction, restaurant, metro, and taxi. The data
collection process is summarized as follows:
1. Database Construction: We crawled travel
information in Beijing from the Web,
including Hotel, Attraction, and Restaurant
domains
(hereafter we name the three
domains as HAR domains). Dann, we used
the metro information of entities in HAR
domains to build the metro database. For the
taxi domain, there is no need to store the
We collected 465 attractions, 951 restaurants, Und
1,133 hotels in Beijing from the Web. Some
statistics are shown in Table 3. There are three
types of slots for each entity: common slots
such as name and address; binary slots for
hotel services such as wake-up call; and nearby
attractions/restaurants/hotels slots that contain
nearby entities in the attraction, restaurant, Und
hotel domains. Because it is not usual to find
another nearby hotel in the hotel domain, we did
not collect such information. This nearby relation
allows us to generate natural cross-domain goals,
such as ‘‘find another attraction near the first
one’’ and ‘‘find a restaurant near the attraction’’.
284
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
4
1
9
2
3
5
2
5
/
/
T
l
A
C
_
A
_
0
0
3
1
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Domain
Attract. Ausruhen.
Hotel
Id Domain
Slot
# Entities
# Slots
Avg. nearby attract.
Avg. nearby rest.
Avg. nearby hotels
465
9
4.7
6.7
2.1
951
10
3.3
4.1
2.4
1133
8 + 37∗
0.8
2.0
–
Tisch 3: Database statistics. ∗ indicates that there
Sind 37 binary slots for hotel services such as wake-
up call. The last three rows show the average
number of nearby attractions/restaurants/hotels for
each entity. We did not collect nearby hotels
information for the hotel domain.
Nearest metro stations of HAR entities form the
metro database. Im Gegensatz, we provided the
pseudo car type and plate number for the taxi
Domain.
3.2 Goal Generation
To avoid generating overly complex goals, jede
goal has at most five sub-goals. To generate
more natural goals, the sub-goals can be of the
same domain, such as two attractions near each
andere. The goal is represented as a list of (sub-
goal id, Domain, slot, value) tuples, named as
semantic tuples. The sub-goal
id is used to
distinguish sub-goals, which may be in the same
Domain. There are two types of slots: informable
the user
slots, which are the constraints that
needs to inform the system, and requestable
slots, which are the information that the user
needs to inquire from the system. Wie gezeigt in
Tisch 4, besides common informable slots (italic
Werte) whose values are determined before the
conversation, we specially design cross-domain
informable slots (bold values) whose values refer
to other sub-goals. Cross-domain informable slots
utilize sub-goal id to connect different sub-goals.
Thus the actual constraints vary according to the
different contexts instead of being pre-specified.
The values of common informable slots are
sampled randomly from the database. Based on
the informable slots, users are required to gather
the values of requestable slots (blank values in
Tisch 4) through conversation.
There are four steps in goal generation. Erste, Wir
generate independent sub-goals in HAR domains.
For each domain in HAR domains, with the same
probability P we generate a sub-goal, while with
1 Attraction
1 Attraction
1 Attraction
2 Hotel
2 Hotel
2 Hotel
Taxi
3
Taxi
3
Taxi
3
Taxi
3
fee
name
nearby hotels
name
wake-up call
rating
aus
Zu
car type
plate number
Wert
frei
near (id = 1)
Ja
(id = 1)
(id = 2)
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
4
1
9
2
3
5
2
5
/
/
T
l
A
C
_
A
_
0
0
3
1
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Tisch 4: A user goal example (translated into
English). Slots with bold/italic/blank value are
cross-domain informable slots, common inform-
able slots, and requestable slots. In this example,
the user wants to find an attraction and one of
its nearby hotels, then book a taxi to commute
between these two places.
the probability of 1 − P we do not generate
any sub-goal for this domain. Each sub-goal has
common informable slots and requestable slots.
As shown in Table 5, all slots of HAR domains
can be requestable slots, while the slots with an
asterisk can be common informable slots.
Zweite, we generate cross-domain sub-goals
in HAR domains. For each generated sub-goal
(z.B., the attraction sub-goal in Table 4), if its
requestable slots contain ‘‘nearby hotels’’, Wir
generate an additional sub-goal in the hotel domain
(z.B., the hotel sub-goal in Table 4) mit dem
probability of Pattraction→hotel. Natürlich, Die
selected hotel must satisfy the nearby relation to
the attraction entity. Ähnlich, we do not generate
any additional sub-goal in the hotel domain with
the probability of 1 − Pattraction→hotel. This also
works for the attraction and restaurant domains.
Photel→hotel = 0 because we do not allow the user
to find the nearby hotels of one hotel.
Dritte, we generate sub-goals in the metro and
taxi domains. With the probability of Ptaxi, Wir
generate a sub-goal in the taxi domain (z.B., Die
taxi sub-goal in Table 4) to commute between
two entities of HAR domains that are already
generated. It is similar for the metro domain and
we set Pmetro = Ptaxi. All slots in the metro or
taxi domain appear in the sub-goals and must be
filled. As shown in Table 5, from and to slots are
285
Attraction domain
name∗, rating∗, fee∗, duration∗, address, Telefon,
nearby attract., nearby rest., nearby hotels
Restaurant domain
name∗, rating∗, cost∗, dishes∗, address, Telefon,
offen, nearby attract., nearby rest., nearby hotels
Hotel domain
name∗, rating∗, price∗, type∗, 37 services∗,
Telefon, address, nearby attract., nearby rest.
Taxi domain
aus, Zu, car type, plate number
Metro domain
aus, Zu, from station, to station
Tisch 5: All slots in each domain (übersetzt
into English). Slots in bold can be cross-domain
informable slots. Slots with asterisk are inform-
able slots. All slots are requestable slots except
‘‘from’’ and ‘‘to’’ slots in the taxi and metro
domains. The ‘‘nearby attractions/restaurants/
hotels’’ slots and the ‘‘dishes’’ slot can be multiple
valued (a list). The value of each ‘‘service’’ is
either yes or no.
always cross-domain informable slots, wohingegen
others are always requestable slots.
Last, we rearrange the order of the sub-goals to
generate more natural and logical user goals. Wir
require that a sub-goal should be followed by its
referred sub-goal as immediately as possible.
To make the workers aware of this cross-domain
feature, we additionally provide a task description
for each user goal in natural language, welches ist
generated from the structured goal by hand-crafted
templates.
Compared with the goals whose constraints are
all pre-specified, our goals impose much more
dependency between different domains, welche
will significantly influence the conversation. Der
exact values of cross-domain informable slots
are finally determined according to the dialogue
Kontext.
3.3 Dialogue Collection
We developed a specialized website that allows
two workers to converse synchronously and make
annotations online. On the website, workers are
free to choose one of the two roles: tourist (user)
or system (wizard). Dann, two paired workers are
sent to a chatroom. The user needs to accomplish
the allocated goal
through conversation while
the wizard searches the database to provide the
necessary information and gives responses. Vor
the formal data collection, we trained the workers
to complete a small number of dialogues by giving
them feedback. Endlich, 90 well-trained workers
participated in the data collection.
Im Gegensatz, MultiWOZ (Budzianowski et al.,
2018B) hired more than a thousand workers to
converse asynchronously. Each worker received a
dialogue context to review and had to respond for
only one turn at a time. The collected dialogues
may be incoherent because workers may not
understand the context correctly and multiple
workers contributed to the same dialogue session,
possibly leading to more variance in the data qual-
ität. Zum Beispiel, some workers expressed two
mutually exclusive constraints in two consecutive
user turns and failed to eliminate the system’s
confusion in the next several turns. Compared
with MultiWOZ, our synchronous conversation
setting may produce more coherent dialogues.
3.3.1 User Side
The user state is the same as the user goal before
a conversation starts. At each turn, the user needs
Zu 1) modify the user state according to the system
response at the preceding turn, 2) select some
semantic tuples in the user state, which indicates
the dialogue acts, Und 3) compose the utterance
according to the selected semantic tuples. In
addition to filling the required values and updating
cross-domain informable slots with real values in
the user state, the user is encouraged to modify
the constraints when there is no result under such
constraints. The change will also be recorded in
the user state. Once the goal is completed (all the
values in the user state are filled), the user can
terminate the dialogue.
3.3.2 Wizard Side
We regard the database query as the system
state, which records the constraints of each
domain till the current turn. At each turn, Die
wizard needs to 1) fill the query according to the
previous user response and search the database if
necessary, 2) select the retrieved entities, Und
3) respond in natural
language based on the
information of the selected entities. If none of the
286
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
4
1
9
2
3
5
2
5
/
/
T
l
A
C
_
A
_
0
0
3
1
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
entities satisfy all the constraints, the wizard will
try to relax some of them for a recommendation,
resulting in multiple queries. The first query
records original user constraints while the last
one records the constraints relaxed by the system.
3.4 Dialogue Annotation
After collecting the conversation data, we used
some rules to annotate dialogue acts automati-
cally. Each utterance can have several dialogue
acts. Each dialogue act is a tuple that consists of
intent, Domain, slot, and value. We pre-define 6
types of intents and use the update of the user state
and system state as well as keyword matching to
obtain dialogue acts. For the user side, dialogue
acts are mainly derived from the selection of
semantic tuples that contain the information of
Domain, slot, and value. Zum Beispiel,
Wenn (1,
Attraction, fee, frei) in Table 4 is selected by
the user, Dann (Inform, Attraction, fee, frei) Ist
) is selected,
labelled. Wenn (1, Attraction, name,
Dann (Request, Attraction, name, none) is labeled.
Wenn (2, Hotel, name, near (id=1)) is selected, Dann
(Select, Hotel, src domain, Attraction) is labeled.
This intent is specially designed for the ‘‘nearby’’
constraint. For the system side, we mainly applied
keyword matching to label dialogue acts. Inform
intent is derived by matching the system utterance
with the information of selected entities. Wann
the wizard selects multiple retrieved entities and
recommend them, Recommend intent is labeled.
When the wizard expresses that no result satisfies
user constraints, NoOffer is labeled. For General
intents such as ‘‘goodbye’’, ‘‘thanks’’ at both user
and system sides, keyword matching is applied.
We also obtained a binary label for each seman-
tic tuple in the user state, which indicates whether
this semantic tuple has been selected to be
expressed by the user. This annotation directly
illustrates the progress of the conversation.
To evaluate the quality of the annotation of
dialogue acts and states (both user and system
Staaten), three experts were employed to manually
annotate dialogue acts and states for the same 50
dialogues (806 utterances), 10 for each goal type
(see Section 4). Because dialogue act annotation is
not a classification problem, we didn’t use Fleiss’
kappa to measure the agreement among experts.
We used dialogue act F1 and state accuracy to
measure the agreement between each two ex-
perts’ annotations. The average dialogue act F1 is
Train
Valid
Test
# Dialogues
# Turns
# Tokens
Vocab
Avg. sub-goals
Avg. STs
Avg. turns
Avg. tokens
5,012
84,692
1,376,033
12,502
3.24
14.8
16.9
16.3
500
8,458
137,736
5,202
3.26
14.9
16.9
16.3
500
8,476
137,427
5,143
3.26
15.0
17.0
16.2
Tisch 6: Data statistics. The average numbers
of sub-goals, turns, and STs (semantic tuples)
are for each dialogue. The average number of
tokens is for each turn.
94.59% and the average state accuracy is 93.55%.
We then compared our annotations with each
expert’s annotations, which are regarded as gold
standard. The average dialogue act F1 is 95.36%
and the average state accuracy is 94.95%, welche
indicates the high quality of our annotations.
4 Statistics
After removing uncompleted dialogues, we collec-
ted 6,012 dialogues in total. The dataset is split
randomly for training/validation/test, bei dem die
statistics are shown in Table 6. The average
number of sub-goals in our dataset
Ist 3.24,
which is much larger than that in MultiWOZ
(1.80) (Budzianowski et al., 2018B) and Schema
(1.84) (Rastogi et al., 2019). The average number
of turns (16.9) is also larger than that in MultiWOZ
(13.7). These statistics indicate that our dialogue
data are more complex.
According to the type of user goal, we group the
dialogues in the training set into five categories:
Single-domain (S) 417 dialogues have only one
sub-goal in HAR domains.
Independent multi-domain (M)1,573 dialogues
have multiple sub-goals (2∼3) in HAR do-
mains. Jedoch, these sub-goals do not have
cross-domain informable slots.
Independent multi-domain + traffic (M+T) 691
dialogues have multiple sub-goals in HAR
domains and at least one sub-goal in the
metro or taxi domain (3∼5 sub-goals). Der
sub-goals in HAR domains do not have
cross-domain informable slots.
287
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
4
1
9
2
3
5
2
5
/
/
T
l
A
C
_
A
_
0
0
3
1
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Goal type
S M M+T CM CM+T
417 1573 691 1759 572
# Dialogues
0.10 0.22 0.22 0.61 0.55
NoOffer rate
0.06 0.07 0.07 0.14 0.12
Multi-query rate
0.10 0.28 0.31 0.69 0.63
Goal change rate
Avg. dialogue acts 1.85 1.90 2.09 2.06 2.11
1.00 2.49 3.62 3.87 4.57
Avg. sub-goals
4.5 11.3 15.8 18.2 20.7
Avg. STs
6.8 13.7 16.0 21.0 21.6
Avg. turns
13.2 15.2 16.3 16.9 17.0
Avg. tokens
Tisch 7: Statistics for dialogues of different goal
types in the training set. NoOffer rate and Goal
change rate are for each dialogue. Multi-query rate
is for each system turn. The average number of
dialogue acts is for each turn.
Cross multi-domain (CM) 1,759 dialogues have
multiple sub-goals (2∼5) in HAR domains
with cross-domain informable slots.
Cross multi-domain + traffic (CM+T) 572 dia-
logues have multiple sub-goals in HAR
domains with cross-domain informable slots
and at least one sub-goal in the metro or taxi
Domain (3∼5 sub-goals).
The data statistics are shown in Table 7. Als
mentioned in Section 3.2, we generate indepen-
dent multi-domain, cross multi-domain, and traffic
domain sub-goals one by one. Thus in terms of
the task complexity, we have S