独奏者: Building Task Bots at Scale with - 麻省理工学院人工智能研究专业

独奏者: Building Task Bots at Scale with
Transfer Learning and Machine Teaching

Baolin Peng, Chunyuan Li, Jinchao Li
Shahin Shayandeh, Lars Liden, Jianfeng Gao

Microsoft Research, Redmond, 美国
{bapeng,chunyl,jincli,shahins,lars.liden,jfgao}@microsoft.com

抽象的

We present a new method, 独奏者,1 那
uses transfer learning and machine teaching
to build task bots at scale. We parameterize
classical modular task-oriented dialog systems
using a Transformer-based auto-regressive
language model, which subsumes different
dialog modules into a single neural model. 我们
pre-train, on heterogeneous dialog corpora,
a task-grounded response generation model,
which can generate dialog responses grounded
in user goals and real-world knowledge for
task completion. The pre-trained model can be
efficiently adapted to accomplish new tasks
with a handful of task-specific dialogs via
machine teaching, where training samples are
generated by human teachers interacting with
系统. Experiments show that (我 ) 独奏者
creates new state-of-the-art on well-studied
task-oriented dialog benchmarks,
包括
CamRest676 and MultiWOZ; (二 ) 在里面
few-shot fine-tuning settings, SOLOIST signif-
icantly outperforms existing methods; 和
(三、 ) the use of machine teaching substantially
reduces the labeling cost of fine-tuning. 这
pre-trained models and codes are available at
https://aka.ms/soloist.

介绍

The increasing use of personal assistants and
messaging applications has spurred interest in
building task-oriented dialog systems (or task
bots) that can communicate with users through
natural language to accomplish a wide range of
任务, such as restaurant booking, weather query,
flight booking, IT helpdesk (例如, Zhou et al.,
2020; 阿迪瓦达纳等人。, 2020; Roller et al.,
2020乙; Gao et al., 2020; 彭等人。, 2020A). 这

1TASK-ORIENTED DIALOG WITH A SINGLE PRE-TRAINED
MODEL. 在本文中, SOLOIST refers to both the proposed bot
building method and the dialog model or system developed
using the method.

wide variety of tasks and domains has created
the need for a flexible task-oriented dialog devel-
opment platform that can support many different
use cases while remaining straightforward for
developers to use and maintain.

A typical task-oriented dialog system uses a
modular pipeline, which has four modules and
executes sequentially (Young et al., 2013; 高
等人。, 2019A), 如图 1(A). A natural
language understanding (自然语言单元) module identifies
user intents and extracts associated information
such as slots and their values from users’ input. A
dialog state tracker (夏令时) infers the belief state
(or user goal) from dialog history. The belief state
is often used to query a task-specific database
(数据库) to obtain the DB state, such as the number of
entities that match the user goal. The dialog state
and DB state are then passed to a dialog policy
(POL) to select the next system action. A natural
language generation (NLG) module converts the
action to a natural language response.

employ the modular

Most popular commercial

tools for dialog
系统,
发展
including Google’s Dialog Flow,2 Microsoft’s
(PVA),3 Facebook’s
Power Virtual Agents
Wit.ai,4 Amazon’s Lex,5 and IBM’s Watson
Assistant.6 They are designed mainly to help
develop systems manually, 即, writing code,
crafting rules and templates. 很遗憾, 甚至
with these tools, building dialog systems remains
a label-intensive, time-consuming task, requiring
rich domain knowledge, reasonable coding skill,
and expert experience. The cost of building dialog
systems at scale (IE。, tens of thousands of bots for
different tasks) can be prohibitively expensive.

2https://dialogflow.com/.
3https://powervirtualagents.microsoft.com/.
4https://wit.ai/.
5https://aws.amazon.com/lex/.
6https://www.ibm.com/watson/.

807

计算语言学协会会刊, 卷. 9, PP. 807–824, 2021. https://doi.org/10.1162/tacl 00399
动作编辑器: James Henderson. 提交批次: 7/2020; 修改批次: 1/2021; 已发表 8/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: Illustration of a traditional modular task-oriented dialog system, an example for the model input, 和
the proposed model. The SOLOIST solution utilizes a single neural auto-regressive model in (C) to parameterize the
sequential dialog pipeline in (A), with input sequence represented in (乙). Different from GPT-2, the SOLOIST model
learns to ground response generation in user goals and database/knowledge.

With the recent advances in neural approaches
to conversational AI (Gao et al., 2019A), 研究-
ers have been developing data-driven methods and
neural models for either individual dialog mod-
ules or end-to-end systems. 例如, 最近的
attempts such as RASA (Bocklisch et al., 2017),
ConvLab (李等人。, 2019乙; Zhu et al., 2020), 和
Conversation Learner (Shukla et al., 2020) 是
made to allow the use of data-driven approaches
based on machine learning and machine teaching
to develop dialog modules. End-to-end trainable
dialog systems have also been studied (例如, Wen
等人。, 2017; Zhao and Eskenazi, 2016; 李等人。,
2017; Williams et al., 2017; 雷等人。, 2018; 高
等人。, 2019A; 张等人。, 2020乙). Although these
methods have achieved promising results, 他们
require large amounts of task-specific labeled
data for training, which are rarely available for
new tasks in real-world applications.

在本文中, we propose a novel method
of building task bots at scale, 独奏者, 哪个
significantly eases the workflow of training and
deploying dialog systems for new tasks, compared
to existing tools and methods. Our approach is
inspired by the recent success of applying transfer
learning to natural language processing (自然语言处理)
任务: Big language models pre-trained on large
amounts of raw text (例如, BERT (Devlin et al.,
2019), RoBERTa (刘等人。, 2019), and UniLM
(Dong et al., 2019)) can be effectively fine-tuned
for a wide range of NLP tasks with few in-domain
labels. 最近, these pre-trained language mod-
els have also been employed to develop dialog
modules such as NLU and DST (Henderson
等人。, 2020; Coope et al., 2020; Wu et al.,
2020A). The proposed SOLOIST uses a similar pre-
training-and-fine-tuning framework for building
end-to-end dialog systems. We parameterize a task

808

bot using a Transformer-based auto-regressive
language model, which subsumes different dialog
模块 (IE。, 自然语言单元, 夏令时, POL, and NLG) into a
single neural model. Task bot building proceeds in
two stages: (我) In the pre-training stage, initialized
using GPT-2 (Radford et al., 2019), we train a
Transformer-based, task-grounded, response gen-
eration model using large heterogeneous dialog
语料库. The model learns the primary task com-
pletion skills such as DST and POL, 并且可以
generate dialog responses grounded in user goals
and real-world knowledge for task completion.
(二) In the fine-tuning stage, we adapt the pre-
trained SOLOIST model
to complete a specific
(新的) task using a handful of task-specific dialogs
via machine teaching, where training samples are
generated by human teachers interacting with the
系统 (朱, 2015; Shukla et al., 2020).

We show through a comprehensive empirical
study that SOLOIST is an effective method of build-
ing task bots at scale by successfully transferring
two capabilities from the pre-trained model to
a new task bot: (我) the capability of NLU and
NLG learned on raw text, 和 (二) the capability
of grounding system responses in user goals
and real-world knowledge for task completion,
learned on the out-domain dialog corpora.

SOLOIST achieves state-of-the-art performance
on two well-studied task-oriented dialog bench-
marks, lifting the combined score by 10 点
in automatic evaluation, and the success rate by
20 points in human evaluation. In the few-shot
fine-tuning settings, SOLOIST adapts to the new
domain much more effectively than competing
方法, achieving a reasonable success rate
using less than 50 dialogs. The promising results
demonstrate the potential of the new method
for developing task bots at scale. Instead of
collecting, labeling data, and building one bot per
任务, we can pre-train a task-grounded response
generation model, and adapt it to new tasks via
transfer learning and machine teaching.

2 独奏者

2.1 An Auto-Regressive Model for Dialog

The modular dialog system in Figure 1 constitutes
a data processing pipeline that produces a
顺序, through concatenating the input-output
pair of each module along the generation process.
Each consecutive pair in this sequence plays

the role of annotated data for the corresponding
module. 理想情况下, when the entire sequence is
可用的, the data generation process of a dia-
log system (自然语言单元, 夏令时, POL, NLG) can be for-
mulated as a single auto-regressive model.

GPT-2 (Radford et al., 2019) is a state-of-
the-art (SoTA) auto-regressive language model
trained on large amounts of open Web text data.
Although after being fine-tuned using conver-
sational data, GPT-2 can respond to users with
realistic and coherent continuations about any
topic of their choosing (张等人。, 2020C), 这
generated responses are not useful for completing
any specific task due to the lack of grounding.
SOLOIST inherits GPT-2’s capability of produc-
ing human-like responses. 尽管如此, unlike
GPT-2, SOLOIST is pre-trained to generate re-
sponses grounded in user goals and real-world
knowledge for task completion. While GPT-2 is
a language model for text prediction, SOLOIST is a
stateful decision-making model for task comple-
的, with the capabilities of tracking dialog states,
selecting best system actions, 等等. 因此,
SOLOIST is pre-trained using task-oriented dialog
sessions annotated with grounding information,
IE。, user goals, dialog belief states, DB states, 和
system responses. 具体来说, each dialog turn
in our training data is represented as:

x = (s, 乙, C, r),

(1)

where s is the dialog history up to the current
dialog turn, b is the dialog belief state acquired
from human annotation, c is the DB state auto-
matically retrieved from a database using b, and r
is the delexicalized dialog response, from which
the system response in natural language can be
generated using some automatic post-processing.
Each item in x is by itself a sequence of tokens,
as illustrated by the examples in Figure 1(乙).
因此, it is natural to treat the concatenation of
them as a long sequence for model training, 作为
shown in Figure 1(C). We pre-train the SOLOIST
model using publicly available heterogeneous
dialog corpora with labels of belief states and DB
状态. The pre-trained model can be fine-tuned to
any new task to generate responses grounded in
task-specific user goals and a database.

2.2 Task-Grounded Pre-Training
Given training data of N samples D = {xn}氮
n=1,
our goal is to build a neural model parameterized

809

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

by θ to characterize the sequence generation
概率 pθ(X). We use a multi-task objective
for learning θ, where each task is a self-supervised
learning task.

To leverage the sequential structure of a
task-oriented dialog system, the joint probability
p(X) can be factorized in the auto-regressive
manner as:

p(X) = p(r, C, 乙, s)

p(r|C, 乙, s)
(西德:5)
(西德:3)(西德:4)
(西德:2)

Grounded Response Generation

(2)
p(s), (3)

p(乙|s)
(西德:2) (西德:3)(西德:4) (西德:5)
Belief Prediction

因式分解来自哪里 (2) 到 (3) 是基于
on the fact that p(C|乙, s) = p(C|乙) = 1, 因为
the DB state c is obtained using a deterministic
database-lookup process given a belief state b
(例如, via an API call). 注意 (3) decomposes
the joint distribution modeling problem into two
sub-problems: belief state prediction p(乙|s) 和
grounded response generation p(r|C, 乙, s). 自从
b and r are sequences, we can further factorize
them in the left-to-right auto-regressive manner,
分别.

任务 1: Belief Prediction. For a belief state
sequence of length Tb, we define the objective of
predicting the belief state as:

LB = log p(乙|s) =

Tb(西德:6)

t=1

log pθ(bt|乙 Belief State:
Restaurant { pricerange = expensive, food =
Chinese, area = north } < EOB > 数据库: Restau-
rant 1 匹配 < EOKB > 这 [restaurant name]
is a great [value food] 餐厅. Would you
like to book a table there ? < EOS >

This sequence, tokenized using byte pair encod-
英格斯 (Sennrich et al., 2016), can be readily used for
multi-task training, 如图 1(C). 这
implementation of SOLOIST is based on Hugging-
face PyTorch Transformer (沃尔夫等人。, 2020). 这
task-grounded pre-training of SOLOIST uses the
public 117M-parameter GPT-2 as initialization.

810

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

姓名

#Dialog #Utterance Avg. Turn #Domain

task-grounded pre-training:
模式
Taskmaster

22,825
13,215

463,284
303,066

fine-tuning:
多WOZ2.0 10,420
676
凸轮休息676
–
Banking77
–
Restaurant-8k

71,410
2,744
25,716
8,198

20.3
22.9

6.9
4.1
–
–

17
6

7
1
21
1

桌子 1: Dialog corpora. The datasets in the upper
block are used for task-grounded pre-training, 和
the datasets in the lower block are for fine-tuning.

亚当 (Kingma and Ba, 2014) with weight
decay is used for pre-training. 桌子 1 shows the
dialog corpora (Kim et al., 2019; 拉斯托吉等人。,
2020; 伯恩等人。, 2019) used for task-grounded
pre-training. To ensure there is no overlap bet-
ween pre-training and fine-tuning datasets, we ex-
clude the data akin to MultiWOZ (Budzianowski
等人。, 2018), 凸轮休息676 (文等人。, 2017),
Banking77 (Casanueva et al., 2020), Restaurant-
8k (Coope et al., 2020).

2.3 Fine-Tuning and Machine Teaching

When deploying SOLOIST to a new task, we collect
task-specific x in the same format as that used
for pre-training as (1). When x is available, 这
conventional fine-tuning procedure is utilized: 我们
use the same multi-task objective of (7) to update
θ to adapt the model to complete the new task
using labeled task-specific dialogs.

In real applications, annotated task-specific
data is often unavailable, or noisy/incomplete
beforehand. One may deploy the dialog system
and acquire high-quality task-specific labels (例如,
belief state and system response) for each dialog
turn using machine teaching. Machine teaching
is an active learning paradigm that
focuses
on leveraging the knowledge and expertise of
domain experts as ‘‘teachers’’. This paradigm
puts a strong emphasis on tools and techniques
那
non-data
scientists and non-machine-learning experts—to
visualize data,
find potential problems, 和
provide corrections or additional training inputs
in order to improve the system’s performance
(Simard et al., 2017; 朱, 2015; Williams and
Liden, 2017; Shukla et al., 2020).

teachers—particularly

enable

We proceed fine-tuning using Conversation
Learner (Shukla et al., 2020), a machine teaching

811

tool, in the following steps: (我) Dialog authors
deploy the pre-trained SOLOIST model for a specific
任务. (二) Users (or human subjects recruited for
system fine-tuning) interact with the system and
generate human-bot dialog logs. (三、) Dialog
authors revise a dozen of training samples by se-
lecting representative failed dialogs from the logs,
correcting their belief and/or responses so that the
system can complete these dialogs successfully, 作为
illustrated in Figure 2. The corrected task-specific
dialog turns are used to fine-tune the model.

Implementation Details. To adapt a pre-trained
SOLOIST to a new task in our experiments, 我们
always fine-tune SOLOIST using a small amount
of pre-collected task-specific dialogs, 进而
continue to fine-tune it via machine teaching,
as detailed in Section 3.3. Training examples
are truncated to ensure a maximal
length of
512. The pre-trained models are fine-tuned with
a mini-batch of 6 在 8 Nvidia V100 until no
progress is observed on validation data or up to
10 纪元. Nucleus sampling (Holtzman et al.,
2019) is used for decoding, where the sampling
top-p ranges from 0.2 到 0.5 for all our models.
The best setup of hyper-parameters is selected
through grid-search on the validation set. 为了
machine teaching experiment,pre-trained models
are fine-tuned with SGD on a single Nvidia V100.

3 实验

This section evaluates the proposed SOLOIST to
answer three questions: Q1: How does SOLOIST
perform on standard benchmarks compared to
SoTA methods? Q2: Does SOLOIST meet the goal
of effectively generalizing to new domains in the
few-shot fine-tuning setting? Q3: how effective
machine teaching is for fine-tuning? 注意
we employ the conventional fine-tuning method
without machine teaching for a fair comparison
when studying Q1 and Q2.

3.1 实验装置

Dialog Datasets for Fine-Tuning. We validate
the end-to-end dialog system performance of
SOLOIST on two well-studied datasets. (我) 凸轮-
Rest676 (文等人。, 2017) is a single-domain task-
oriented dialog corpus. 它包含 408/136/136
重新指定-
dialogs for
主动地. Following Lei et al. (2018), we delexicalize
each token that occurs in the ontology with its slot

training/validation/testing,

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: Illustration of the machine teaching process using conversion learner. The human-bot conversion log in
(A) can be edited via correcting its belief state in (乙), and selecting/inserting a more appropriate response in (C).

names such as restaurant name, phone number, 和
postcode. (二) MultiWOZ dataset (Budzianowski
等人。, 2018) is a multi-domain task-oriented dialog
dataset. 它包含 8438/1000/1000 for train-
ing/validation/testing, 分别. Each dialog
session contains 1 到 3 域, such as Attrac-
的, Hotel, Hospital, 警察, Restaurant, Train,
and Taxi. MultiWOZ is inherently challenging
due to its multi-domain setting and diverse lan-
guage styles.

Automatic Evaluation Metrics. 下列的
Budzianowski et al. (2018), 通知, 成功,
and BLEU scores are reported. The first two metrics
relate to the dialogue task completion—whether
the system has provided an appropriate entity
(通知) and then answered all the requested
如何
属性
the generated responses are compared
natural
to that generated by human agents. A com-
结合分数 (组合) 据报道还使用
合并= (通知 + 成功) × 0.5 + 蓝线
as an overall quality measure.

(成功). BLEU evaluates

基线. We compare SOLOIST with several
strong baselines, which hold SoTA on the Cam-
Rest676 or MultiWOZ datasets. (我) Multi-Action
Data Augmentation (该死)
(张等人。,
2020乙) is a modular system, where each di-
alog module is implemented using a neural
网络, and the whole system is trained in an
end-to-end manner. (二) 顺序性 (雷等人。,
2018)
它
does not use multi-action data augmentation.
(三、) GPT fine-tuning (Budzianowski and Vuli´c,
2019) is fine-tuned on GPT-2 to generate re-

to DAMD except

相似的

那

是

sponses based on the dialog state and history.
(四号) ARDM (Wu et al., 2019乙) utilizes GPT-2
as the pre-trained model
to learn to generate
role-aware responses given dialog context. 这
model has to work with a separate dialog state
tracker for task completion. (v) HDSA (陈
等人。, 2019) is a modular dialog system, which gen-
erates responses using a BERT-based dialog pol-
icy and graph structure dialog act representations.

3.2 End-to-End Evaluation

凸轮休息676. 桌子 2 shows the result and lists
annotations used by different models. 独奏者
achieves the best scores in all the metrics. ARDM
performs similarly to SOLOIST in terms of Success
and BLEU. 然而, ARDM cannot track dialog
states and requires a separately trained state
tracker to accomplish tasks. GPT-2 fine-tuned
with task-specific data works reasonably well but
lags behind SOLOIST by a large margin. 顺序性,
which uses a jointly trained model with belief
state and policy annotations, underperforms
独奏者. This result also shows that, compared
to other end-to-end models, SOLOIST not only
achieves better performance but requires lower
labeling cost for fine-tuning due to the use of
task-grounded pre-training.

多WOZ. The result is shown in Table 3.
SOLOIST achieves the best performance in terms of
通知, 成功, and Combined, lifting the pre-
vious SoTA by a significant margin (例如, 关于 10
points improvement in Combined over DAMD).
SOLOIST also outperforms the method of Ham et al.
(2020), where GPT-2 is fine-tuned and applied
for end-to-end dialog modeling. Compared to the

812

模型

Annotations

Evaluation Metrics

Belief State Policy Inform ↑ Success ↑ BLEU ↑ Combined ↑

顺序性 (雷等人。, 2018)
顺序性 (w/o RL)
GPT fine-tuning (Budzianowski and Vuli´c, 2019)
ARDM1 (Wu et al., 2019乙)
独奏者

(西德:2)
(西德:2)

(西德:2)

(西德:2)
(西德:2)

92.30
94.00
–
–
94.70

85.30
83.40
86.20
87.10
87.10

21.40
23.40
19.20
25.20
25.50

110.20
112.10
–
–
116.40

1ARDM is not fully E2E, as it requires a rule-based dialog state tracker.

桌子 2: End-to-End evaluation on CamRest676. Results of existing methods are from Wu et al.
(2019乙).

模型

Annotations

Evaluation Metrics

Belief State Policy Inform ↑ Success ↑ BLEU ↑ Combined ↑

顺序性 (雷等人。, 2018)
HRED-TS (彭等人。, 2019)
Structured Fusion (Mehri et al., 2019乙)
DSTC8 Track 1 Winner 1 (他等人。, 2020)
该死 (张等人。, 2020乙)
独奏者

(西德:2)
(西德:2)
(西德:2)
(西德:2)
(西德:2)
(西德:2)

(西德:2)
(西德:2)
(西德:2)
(西德:2)
(西德:2)

66.41
70.00
73.80
73.00
76.40
85.50

45.32
58.00
58.60
62.40
60.40
72.90

15.54
17.50
16.90
16.00
16.60
16.54

71.41
81.50
83.10
83.50
85.00
95.74

1The result of DSTC8 Track 1 Winner is produced by adapting their code to our setting.

桌子 3: End-to-end evaluation on MultiWOZ.

classical modular dialog systems such as DAMD,
SOLOIST uses a much simpler architecture and
requires much lower labeling effort. 例如,
SOLOIST requires only the belief states, 尽管
DAMD requires additional annotations for task
定义 (IE。, defining the intents, 插槽, 和
corresponding value ranges) and dialog acts.

3.3 Few-Shot Evaluation

It is desirable for task bots to effectively gener-
alize to new tasks with few task-specific training
样品. 因此, the few-shot fine-tuning setting
is a more realistic setting for evaluating dialog
系统. 很遗憾, the existing task-oriented
dialog benchmarks typically contain for each
task hundreds to thousands of dialogs. 所以,
we re-organize CamRest676 and MultiWOZ
to simulate the few-shot fine-tuning setting for
end-to-end evaluation.7 We sample from the
MultiWOZ dataset the dialog tasks that contain
only one domain. Attraction, Train,
Hotel, and Restaurant domains are used.
We do not use the domains of Police, Taxi,
and Hospital, as they do not require explicitly
tracking dialog states for task completion. 为了
each domain, we randomly sample 50 dialog
sessions for training and validation and 200 dialog
sessions for testing. The only exception is the

7We will release the re-organized datasets.

Domain Attra. Train Hotel Rest. 凸轮休息676

#Train
#Valid
#Test

50
50
100

50
50
200

20
136
136

桌子 4: Data statistics for domains used in few-
shot evaluation. Attra. denotes Attraction
domain and Rest. means Restaurant.

模型

凸轮休息676

Inform ↑ Success ↑ BLEU ↑

顺序性 (雷等人。, 2018)
SOLOIST w/o pre-training
独奏者
SOLOISTL

60.61
73.88
85.82
88.05

66.11
72.22
84.22
84.79

11.15
13.11
19.18
18.88

桌子 5: End-to-end evaluation on CamRest676
in the few-shot fine-tuning setting.

Attraction domain, which has 100 sessions
for testing. For CamRest676, we randomly sam-
普莱 20 sessions. Details are shown in Table 4.

桌子 5 和 6 report the end-to-end perfor-
mance in the few-shot fine-tuning settings on
CamRest676 and MultiWOZ, 分别. 在
the domains, SOLOIST obtains substantially
全部
better performance in all the metrics. Removing
task-grounded pre-training significantly hurts
the performance of SOLOIST, although SOLOIST

813

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

Attraction

Train

Hotel

Restaurant

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

该死 (张等人。, 2020乙)
SOLOIST w/o pre-training
独奏者
SOLOISTL

70.00
65.66
86.00
86.00

15.00
46.97
65.00
68.00

6.90
5.85
12.90
14.60

75.00
59.00
80.81
81.31

39.50
44.00
64.65
74.24

6.20
7.07
9.96
11.90

62.50
62.50
74.50
75.00

20.50
40.00
43.50
51.50

7.60
7.70
8.12
10.09

68.00
75.50
81.00
84.00

19.50
44.50
55.50
62.50

10.50
11.00
12.80
13.17

桌子 6: End-to-end evaluation on MultiWOZ in the few-shot fine-tuning setting.

模型

10%

20%

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

该死 (张等人。, 2020乙)
SOLOIST w/o pre-training
独奏者

34.40
46.10
58.40

9.10
24.40
35.30

8.10
10.39
10.58

52.50
63.40
69.30

31.80
38.70
52.30

11.60
11.19
11.80

55.30
64.90
69.90

30.30
44.50
51.90

13.00
13.57
14.60

62.60
70.10
74.00

44.10
52.20
60.10

14.90
14.72
15.24

桌子 7: End-to-end evaluation on MultiWOZ with varying sizes of task-specific training data for
fine-tuning.

模型

Attraction

Train

Hotel

Restaurant

Inform ↑ Success ↑ BLEU ↑ Inform ↑ Success ↑ BLEU ↑ Inform ↑ Success ↑ BLEU ↑ Inform ↑ Success ↑ BLEU ↑

独奏者
SOLOIST +Extra
SOLOIST +Teach

45.00
63.00
78.00

19.00
41.00
45.00

7.67
11.08
11.90

67.68
65.15
68.18

58.08
57.58
63.64

7.13
9.74
9.45

33.50
41.50
46.50

22.50
19.00
22.50

8.70
7.96
7.68

50.50
44.50
53.00

10.00
27.00
32.00

8.61
9.77
9.81

桌子 8: Machine teaching results. SOLOIST is trained with 10 examples for each domain. SOLOIST+Teach
indicates continual training with 5 dialogs recommended by CL with human teacher corrections.
SOLOIST+Extra indicates continual training using 5 randomly sampled dialogs with full annotations.

to Ham et al.

without task-grounded pre-training still consis-
tently outperforms DAMD in all the domains.
task-grounded pre-training is
SOLOIST without
(2020),
conceptually similar
but is architecturally simpler and needs fewer
注释. The result verifies the importance of
task-grounded pre-training on annotated dialog
语料库, allowing SOLOIST to learn how to track
dialog and database states to accomplish a task.
To study the impact of using larger model size, 我们
build a large version of SOLOIST, SOLOISTL, 哪个
is task-grounded pre-trained on the same data
but using GPT-2medium with 345M parameters as
initialization. SOLOISTL consistently outperforms
表明
SOLOIST by a large margin. 它
a larger model
learner,
exhibiting stronger generalization ability with
limited in-domain data. We leave it to future work
to significantly scale up SOLOIST.

is a better few-shot

We conduct experiments to fine-tune SOLOIST
by varying the percentage of task-specific training
样品, 范围从 1% (80 examples) 到
20% (1600 examples), on the MultiWOZ dataset.
如表所示 7, SOLOIST consistently
outperforms DAMD for a wide range of dataset
sizes, and the improvement is more substantial
when smaller numbers of in-domain examples are
used for fine-tuning.

814

3.4 Machine Teaching Results

The machine teaching module of Conversational
Learner (CL) (Shukla et al., 2020) allows human
教师 (dialog authors) to select and visualize
dialogs, find potential problems, and provide
corrections or additional
training samples to
improve the bot’s performance. We use CL to
evaluate the effectiveness of machine teaching
for task bot fine-tuning. In our experiment, 我们
first sample 10 dialogs from each domain to
fine-tune SOLOIST as described in Section 3.3. 这
result is presented in the first row of Table 8.
We then deploy the model to interact with human
users via CL. The row of SOLOIST+Teach shows
the result of machine teaching, where a human
teacher has manually corrected 5 dialogs, 哪个
are recommended by CL using a ranking heuristic
based on perplexity. The corrections are utilized
to continually fine-tune the deployed system.

桌子 8 shows that SOLOIST+Teach consistently
improves Combined by a large margin compared
with that without human teaching. SOLOIST+Extra
is used as an ablation baseline, 在哪里 5 randomly
selected dialogs with full annotations
从
experts are added as extra examples to fine-tune
该模型. It shows lower performance than
machine teaching. 数字 3 demonstrates the

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

BERT-Fixed
BERT-Tuned
USE
ConveRT
USE+ConveRT
独奏者

Banking77

67.55
83.42
84.23
83.32
85.19
78.73

80.07
90.03
89.74
89.37
90.57
89.28

Full

87.19
93.66
92.81
93.01
93.36
93.80

桌子 9: Intent classification accuracy scores (5
runs average) on Banking77 with varying number
of training examples (10, 30 examples for each
intent, and full training examples. The baseline
results are cited from Casanueva et al. (2020).

and updated during fine-tuning, 分别. A
linear classifier with a softmax layer is added
on top of BERT for classification. Universal
Sentence Encoder and ConveRT are sentence
encoders tailored for modeling sentence pairs,
and are trained for optimizing the conversational
response selection task. The results in Table 9
show that SOLOIST is comparable with SoTA intent
classification models. SOLOIST is the best per-
former when the full dataset is used for fine-tuning
但
its performance deteriorates more quickly
than USE+ConveRT when fewer samples are
used for fine-tuning. It is interesting to investigate
whether incorporating intent classification tasks
in task-grounded pre-training can boost SOLOIST’s
表现. We leave it to future work.

Slot Filling. We follow the experiment setting of
Coope et al. (2020) and formulate slot filling as a
turn-based span extraction problem. The results in
桌子 10 show that SOLOIST performs significantly
better than the SoTA method Span-ConveRT, A
variant of ConveRT designed explicitly for slot
filling. The gap is wider when fewer examples are
used for training. 例如, 什么时候 64 样品
are used for training, SOLOIST outperforms Span-
ConveRT by 20 points in F1 score.

Dialog State Tracking. We compare the dialog
state tracking capability of SOLOIST with several
strong baselines on MultiWOZ 2.0 和 2.1. 这
results in Table 11 show that SOLOIST achieves the
best performance on MultiWOZ2.1 and similar
performance to DST-Picklist (张等人。, 2020A),
which requires pre-defined task ontology to guide
state tracking. In comparison with Simple-TOD
(侯赛尼-阿斯尔等人。, 2020) that is based on GPT-2,

数字 3: Machine teaching performance of different
iterations in Restaurant domain. Machine teaching
with CL achieves near 1.5X efficiency gain (IE。, 这
1st iteration used 15 dialogs while the 3rd iteration
有 25 dialogs) and boosts performance by 10 点
compared with that without teaching.

performance of SOLOIST in Restaurant by
repeating the above machine teaching process in
multiple iterations. We observe that in the second
iteration of machine teaching SOLOIST+Teach
improves Combined by more than 8 points while
SOLOIST+Extra achieves 5 points higher. The result
demonstrates the effectiveness of our two-step
fine-tuning scheme to deploy SOLOIST for a new
任务 (domain). In terms of machine teaching cost,
taking the restaurant domain as an example,
we assume that one slot-value pair of belief state
correction counts as one edit and a response
correction counts as ten edits. The total numbers
of edits for SOLOIST+Teach and SOLOIST+Extra are
61 和 396, 分别, suggesting that machine
teaching reduces the labeling cost by 6×.

3.5 Component-Wise Evaluation

This section evaluates SOLOIST on two NLU tasks
(IE。, intent classification and slot filling), the DST
task and the response generation task. We show
that although SOLOIST is an end-to-end dialog
模型, it also performs well on these component
任务.

Intent Classification The task is to classify
a user utterance into one of several pre-defined
类 (intents). We follow the experiment setting
of Casanueva et al. (2020). The last hidden state of
SOLOIST is used as the sequence representation for
classification. Several baseline methods are used
for comparison. BERT-fixed and BERT-tuned are
fine-tuned on BERT, with BERT parameters fixed

815

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Fraction

SOLOIST Span-ConveRT V-CNN-CRF Span-BERT

模型

Joint Goal Accuracy ↑

MWoz2.0 MWoz2.1

1 (8198)
1/2 (4099)
1/4 (2049)
1/8 (1024)
1/16 (512)
1/32 (256)
1/64 (128)
1/128 (64)

0.98
0.95
0.93
0.89
0.84
0.79
0.74
0.61

0.96
0.94
0.91
0.89
0.81
0.64
0.58
0.41

0.94
0.92
0.89
0.85
0.74
0.57
0.37
0.26

0.93
0.91
0.88
0.85
0.77
0.54
0.42
0.30

桌子 10: Average F1 scores across all slots
for Restaurant-8K with varying training set frac-
系统蒸发散. Numbers in parentheses represent training
set sizes. The baseline results are quoted from
Coope et al. (2020).

SOLOIST obtains 1.13% higher joint goal accu-
活泼的. We attribute the gain to the task-grounded
pre-training that equips SOLOIST with task comple-
tion skills including dialog state tracking.

任务,

在这个

Context-to-Response.
系统
need to generate responses given the ground-truth
belief state and DB search result (文等人。,
2017). The results on MultiWOZ 2.0 are shown in
桌子 12. SOLOIST achieves the best performance
in terms of Inform and Success but performs
slightly worse in BLEU. The Combined score of
SOLOIST is comparable with the current SoTA
method DAMD. 然而, DAMD uses the labels
of dialog act on both the user and system sides,
which demands significantly higher
labeling
efforts than SOLOIST for model training. HDSA
achieves the best BLEU score. 与相比
HDSA, SOLOIST is much simpler and able to
perform better in terms of Combined. 独奏者
outperforms ARDM in Combined. It is worth
mentioning that ARDM cannot perform dialog
state tracking and requires an extra dialog state
tracker to accomplish tasks. These results show
that SOLOIST can learn dialog policies accurately
and generate natural language responses in the
multi-domain scenario.

3.6 Human Evaluation Results

We conduct human evaluation to assess the
quality of SOLOIST interacting with human users.
Following the evaluation protocol in the DSTC8
track 1 challenge (Kim et al., 2019), we host the
best performed SOLOIST on the validation set in
MultiWOZ domain in the back-end as bot services
and crowdsource the work to Amazon Mechanical
Turk. For each dialog session, we present Turks
a goal with instructions. Then Turks are required

816

MDBT (Ramadan et al., 2018)
GLAD (Zhong et al., 2018)
GCE (Nouri and Hosseini-Asl, 2018)
FJST (埃里克等人。, 2020)
HyST (Goel et al., 2019)
SUMBT (李等人。, 2019A)
TOD-BERT (Wu et al., 2020A)
Neural Reading (Gao et al., 2019乙)
TRADE (Wu et al., 2019A)
COMER (Ren et al., 2019)
NADST (Le et al., 2020)
DSTQA (Zhou and Small, 2019)
SOM-DST (Kim et al., 2020)
DST-Picklist (张等人。, 2020A)
MinTL (林等人。, 2020)
SST (陈等人。, 2020)
Tripy (Heck et al., 2020)
Simple-TOD (侯赛尼-阿斯尔等人。, 2020)
独奏者

15.57
35.57
36.27
40.20
44.24
46.65
–
41.10
48.62
48.79
50.52
51.44
51.38
53.30
52.10
51.17
–
–
53.20

–
–
–
38.00
–
–
48.00
–
45.60
–
49.04
51.17
52.57
–
53.62
55.23
55.29
55.72
56.85

桌子 11: Dialog state tracking results on
多WOZ 2.0 和 2.1.

to converse with SOLOIST to achieve the goal
and judge the overall dialog experience at the
end of a session using four metrics. (我) 成功
evaluates task completion. (二) 在下面. (语言
understanding score) 范围从 1 (坏的) 到 5
(好的) indicates the extent to which the system
understands user inputs. (二) Appr. (response
appropriateness score) scaling from 1 (坏的) 到
5 (好的) denotes whether the response is appro-
priate and human-like. (四号) Turns is the average
number of turns in a dialog overall successful
dialog sessions. Turks are further required to write
down a justification of giving a specific rating. 在
全部的, 120 dialog sessions are gathered for analysis.
桌子 13 shows the human assessment results
on MultiWOZ. The results are consistent with the
automatic evaluation. SOLOIST achieves substan-
tially better performance than other systems over
all the metrics. 而且, SOLOIST outperforms
the DSTC8 Track 1 Winner by a much larger
margin in Success (+20 点)
in human
evaluation than that in automatic evaluation (+10
points in Table 3). We attribute this to the fact that
Turks use more diverse language to interact with
the target bots in interactive human evaluation
than that in the pre-collected MultiWOZ dataset
and the use of heterogeneous dialog data for
task-grounded pre-training makes SOLOIST a more
robust task bot than the others. In many test cases
against SOLOIST, Turks comment that they feel
like they are talking to a real person.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

Annotations

Evaluation Metrics

Belief State Policy Inform ↑ Success ↑ BLEU ↑ Combined ↑

Baseline (布兹安诺夫斯基等人。, 2018)
TokenMoE (Pei et al., 2019)
GPT fine-tuning (Budzianowski and Vulic, 2019)
Structured Fusion (Mehri et al., 2019乙)
LaRL (赵等人。, 2019)
MD-Sequicity (张等人。, 2020乙)
HDSA (陈等人。, 2019)
ARDM (Wu et al., 2019乙)
该死 (张等人。, 2020乙)
独奏者

(西德:2)
(西德:2)
(西德:2)
(西德:2)
(西德:2)
(西德:2)
(西德:2)

(西德:2)
(西德:2)

71.29
75.30
70.96
82.70
82.80
86.60
82.90
87.40
89.20
89.60

60.94
59.70
61.36
72.10
79.20
71.60
68.90
72.80
77.90
79.30

18.80
16.81
19.05
16.34
12.80
16.68
23.60
20.60
18.60
18.03

84.93
84.31
85.21
93.74
93.80
95.90
99.50
100.70
102.15
102.49

(西德:2)

(西德:2)
(西德:2)

(西德:2)

桌子 12: Context-to-response evaluation on MultiWOZ.

模型

Success ↑ Under. ↑ Appr. ↑ Turns ↓

独奏者
DSTC8 Track 1 Winner
DSTC8 2nd Place
DSTC8 3rd Place
DSTC8 Baseline

91.67
68.32
65.81
65.09
56.45

4.29
4.15
3.54
3.54
3.10

4.43
4.29
3.63
3.84
3.56

18.97
19.51
15.48
13.88
17.54

桌子 13: Human evaluation results. The results
except SOLOIST are quoted from Li et al. (2020乙).

数字 4 depicts a dialog example where a user
interacts with SOLOIST to complete a multi-domain
任务. The user starts the conversation by asking
for a recommendation of a museum in the center
of town. SOLOIST identifies the user intent, 和
provides a recommendation based on the search
result from an attraction DB. 然后, the user wants
to book a table in a restaurant in the same area.
We can see that through the conversation, 独奏者
develops belief state, which can be viewed as the
system’s understanding of what the user needs
and what is available in the DB. Based on the
belief state and DB state, SOLOIST picks the next
行动, either asking for clarification or providing
the user with information being requested. 这
example also demonstrates that SOLOIST is able to
deal with some NLU challenges displayed often
in human conversations, such as co-reference
resolution. 例如, SOLOIST understands that
the ‘‘same area’’ at Turn 5 refers to ‘‘centre of
town’’, and then identifies a proper entity from the
restaurant booking DB to make the reservation.

4 相关工作

Dialog Systems. Dialog systems are typically
grouped into two categories, task-oriented sys-

817

数字 4: An interactive example.

tems and social chatbots (例如, 陈等人。, 2017;
Gao et al., 2019A; Roller et al., 2020A; Zhou et al.,
2020). Recently many variants have been devel-
oped to extend the scope of dialog systems, 包括-
ing empathetic dialog systems (Ma et al., 2020;
Zhou et al., 2018), chatbots for sentiment analysis
(李等人。, 2020C), dialog systems with common-
sense knowledge (Young et al., 2018; Shuster
等人。, 2020), or using audio features (Young et al.,
2020). 在本文中, we focus on end-to-end
dialog models for task-oriented systems.

Pre-Trained Language Models. Recent ad-
vances on self-supervised learning have witnessed
the blooming of large-scale pre-trained language
型号 (例如, Devlin et al., 2019; Radford et al.,
2019; Dong et al., 2019), which achieve
SoTA performance on a variety of language under-
standing and generation tasks. The closest
到
SOLOIST are GPT-2 (Radford et al., 2019) 和

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

its variants that ground language generation
in the prescribed control codes such as CTRL
(Keskar et al., 2019) and Grover (Zellers et al.,
2019), 或者
latent variables such as Optimus
(李等人。, 2020A).

最近, pre-trained language models have
been adopted to develop task-oriented and chit-
chat dialog systems. To name a few examples of
chit-chat dialog systems: DialoGPT (张等人。,
2020C), TransferTransfo (沃尔夫等人。, 2019) 和
CGRG (Wu et al., 2020乙) adapt GPT-2 using
human conversational data for response genera-
的. Plato (包等人。, 2020) pre-trains a discrete
latent variable model for response generation.
Meena (阿迪瓦达纳等人。, 2020) and BST (Roller
等人。, 2020乙) pre-train large models on conver-
sational data and have demonstrated expressive
performance in generating social chit-chat dialogs.
For task-oriented dialogs, Mehri et al. (2019A)
explores different pre-training methods for dialog
representation learning. TOD-BERT
语境
(Wu et al., 2020A) adapts the pre-trained BERT
to achieve strong performance on four dialog
sub-tasks. ConveRT (Henderson et al., 2020)
pre-trains a model on Reddit data for intent clas-
sification and response selection. Span-ConveRT
(Coope et al., 2020) extends the framework to
entity extraction. SC-GPT (彭等人。, 2020乙)
uses a pre-trained language model to convert a
dialog act to a natural language response. All these
works use the pre-training and fine-tuning frame-
工作. 然而, they follow the modular archi-
tecture of task bots, and the pre-trained models
are used for improving individual dialog modules
such as NLU and DST. SOLOIST generalizes these
methods to the entire dialog pipeline, building an
end-to-end dialog system.

End-to-End Trainable Dialog Systems. 这
end-to-end dialog systems based on neural
models have been studied in Wen et al. (2017); 李
等人. (2017); Lei et al. (2018); 徐等. (2019).
Although these methods have achieved promising
结果, they are designed for specific domains,
rendering difficulties in generalizing to multi-
domains such as MultiWOZ. Dialog models that
can handle multi-domain tasks are studied in (裴
等人。, 2019; Budzianowski and Vuli´c, 2019; 梅赫里
等人。, 2019乙; 赵等人。, 2019; Wu et al., 2019乙;
张等人。, 2020乙; 彭等人。, 2017). 然而,
these works require large amounts of in-domain
labels to achieve good performance. 相比之下,

SOLOIST can effectively adapt to a new task in the
few-shot fine-tuning settings.

The most related work to ours is Ham et al.
(2020), which is the first attempt to fine-tune GPT-
2 to build end-to-end dialog models. Hosseini-Asl
等人. (2020) take a similar approach, and is a
concurrent work of SOLOIST. 然而, 独奏者
differs from these two methods in two major
aspects. The first is the use of task-grounded
pre-training that allows SOLOIST to learn primary
task completion skills, such as tracking dialog
states and select system actions. These skills can
be easily reused and adapted (例如, via few-shot
fine-tuning) to solve new dialog tasks, leading
to a much higher task success rate, as reported
in Section 3. The second is that the annotation
cost required for training SOLOIST is much lower
than that of Ham et al. (2020) or Hosseini-Asl
等人. 2020. Training SOLOIST requires only belief
states as labels. But training of Ham et al. (2020)
and Hosseini-Asl et al. (2020) requires labeling
each dialog turn with dialog acts. 此外,
while SOLOIST is end-to-end trainable, 另一个
two models are not and need heuristic rules to
handle different database search conditions.

5 结论

SOLOIST is a method of building task bots
at scale with transfer
learning and machine
教学. Unlike GPT-2, SOLOIST is pre-trained
in a task-grounded manner. 所以, it can generate
responses grounded in user goals and real-world
knowledge for
task completion. 实验
show that SOLOIST creates new SoTA on two
popular task-oriented dialog benchmarks, 和
that SOLOIST outperforms existing methods by a
large margin in the few-shot fine-tuning settings
where only a limited number of task labels are
available for fine-tuning.

We hope that SOLOIST can inspire dialog
researchers and developers to comprehensively
explore the new paradigm for building task bots
based on task-grounded pre-training and fine-
tuning via machine teaching, and improving the
recipe we present in this paper, 即, formulat-
ing task-oriented dialog as a single auto-regressive
language model, pre-training a task-grounded re-
sponse generation model on heterogeneous dia-
log corpora, and adapting the pre-trained model
to new tasks through fine-tuning using a handful
task-specific examples via machine teaching.

818

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

参考

Daniel Adiwardana, Minh-Thang

Luong,
David R. 所以, Jamie Hall, Noah Fiedel, Romal
Thoppilan, Zi Yang, Apoorv Kulshreshtha,
Gaurav Nemade, Yifeng Lu, and et al. 2020.
Towards a human-like open-domain chatbot.
arXiv 预印本 arXiv:2001.09977.

Siqi Bao, Huang He, Fan Wang, Hua Wu, 和
Haifeng Wang. 2020. Plato: Pre-trained dia-
logue generation model with discrete latent
variable. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 85–96. https://土井
.org/10.18653/v1/2020.acl-main.9

Tom Bocklisch, Joey Faulkner, Nick Pawlowski,
and Alan Nichol. 2017. Rasa: Open source
language understanding and dialogue manage-
蒙特. CoRR, abs/1712.05181.

Paweł Budzianowski and Ivan Vuli´c. 2019.
it’s GPT-2-How can I help you?
Hello,
Towards
the use of pretrained language
models for task-oriented dialogue systems. 在
Proceedings of the 3rd Workshop on Neural
Generation and Translation, pages 15–22.
https://doi.org/10.18653/v1/D19
-5602

Paweł

Budzianowski,

Tsung-Hsien Wen,
Bo-Hsiang Tseng, I˜nigo Casanueva, Stefan
Ultes, Osman Ramadan, and Milica Gasic.
2018. Multiwoz-a large-scale multi-domain
wizard-of-oz dataset
task-oriented dia-
logue modelling. 在诉讼程序中 2018
Conference on Empirical Methods in Natural
语言处理,
5016–5026.
https://doi.org/10.18653/v1/D18
-1547

页面

为了

Bill

Byrne,

Karthik

Krishnamoorthi,
Chinnadhurai Sankar, Arvind Neelakantan,
Ben Goodrich, Daniel Duckworth, Semih
Yavuz, Amit Dubey, Kyu-Young Kim, 和
Andy Cedilnik. 2019. Taskmaster-1: 走向
a realistic and diverse dialog dataset.
在
这 2019 会议
会议记录
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Process-
英 (EMNLP-IJCNLP), pages 4506–4517.
https://doi.org/10.18653/v1/D19
-1459

I˜nigo Casanueva, Tadas Temˇcinas, Daniela Gerz,
Matthew Henderson, and Ivan Vuli´c. 2020.
Efficient intent detection with dual sentence
encoders. 第二届研讨会论文集
on Natural Language Processing for Conver-
sational AI, pages 38–45. https://doi.org
/10.18653/v1/2020.nlp4convai-1.5

Hongshen Chen, Xiaorui Liu, Dawei Yin, 和
Jiliang Tang. 2017. A survey on dialogue sys-
特姆斯: Recent advances and new frontiers. Acm
Sigkdd Explorations Newsletter, 19(2):25–35.
https://doi.org/10.1145/3166054
.3166058

Lu Chen, Boer Lv, Chi Wang, Su Zhu, Bowen
Tan, and Kai Yu. 2020. Schema-guided
multi-domain dialogue state tracking with
图形
在
neural
AAAI, pages 7521–7528. https://土井
.org/10.1609/aaai.v34i05.6250

网络.

注意力

Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng
严, and William Yang Wang. 2019. Seman-
tically conditioned dialog response generation
via hierarchical disentangled self-attention.
In Proceedings of the 57th Annual Meeting
的
the Association for Computational Lin-
语言学, pages 3696–3709, Florence,
意大利.
计算语言学协会.
https://doi.org/10.18653/v1/P19
-1360

span

Few-shot

Sam Coope, Tyler Farghly, Daniela Gerz,
Ivan Vulic, and Matthew Henderson. 2020.
extraction
Span-convert:
for dialog with pretrained conversational
这
陈述.
58th Annual Meeting of
the Association
for Computational Linguistics, 前交叉韧带 2020,
在线的, July 5–10, 2020, pages 107–121.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.acl-main.11

In Proceedings

的

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. Bert: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中 2019 骗局-
ference of the North American Chapter of the
计算语言学协会: 胡-
man Language Technologies, 体积 1 (长的
and Short Papers), pages 4171–4186.

819

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Li Dong, Nan Yang, Wenhui Wang, Furu Wei,
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
周, and Hsiao-Wuen Hon. 2019. Unified lan-
guage model pre-training for natural language
understanding and generation. In Advances
in Neural Information Processing Systems,
pages 13042–13054.

Mihail Eric, Rahul Goel, Shachi Paul, Abhishek
Sethi, Sanchit Agarwal, Shuyang Gao, Adarsh
Kumar, Anuj Goyal, Peter Ku, and Dilek
Hakkani-Tur. 2020. Multiwoz 2.1: A consol-
idated multi-domain dialogue dataset with state
corrections and state tracking baselines. In Pro-
ceedings of The 12th Language Resources and
Evaluation Conference, pages 422–428.

Jianfeng Gao, 米歇尔·加莱, and Lihong Li.
2019A. Neural approaches to conversational
人工智能. Foundations and Trends in Informa-
tion Retrieval, 13(2–3):127–298. https://
doi.org/10.1561/1500000074

Jianfeng Gao, Baolin Peng, Chunyuan Li,
Jinchao Li, Shahin Shayandeh, Lars Liden, 和
Heung-Yeung Shum. 2020. Robust conversa-
tional AI with grounded text generation. CoRR,
abs/2009.03457.

Shuyang Gao, Abhishek Sethi, Sanchit Agarwal,
Tagyoung Chung, and Dilek Hakkani-Tur.
2019乙. Dialog state tracking: A neural reading
comprehension approach. 在诉讼程序中
20th Annual SIGdial Meeting on Discourse and
Dialogue, pages 264–273.

Rahul Goel, Shachi Paul, and Dilek Hakkani-T¨ur.
2019. Hyst: A hybrid approach for flexible and
accurate dialogue state tracking. 会议记录
的
1458–1462.
https://doi.org/10.21437/Interspeech
.2019-1863

Interspeech

页面

2019,

Donghoon Ham, Jeong-Gwan Lee, Youngsoo
Jang, and Kee-Eung Kim. 2020. End-to-end
neural pipeline for goal-oriented dialogue sys-
tems using GPT-2. In Proceedings of the 58th
Annual Meeting of the Association for Com-
putational Linguistics, pages 583–592.

Michael Heck, Carel van Niekerk, Nurul Lubis,
Christian Geishauser, Hsien-Chin Lin, Marco
Moresi, and Milica Gasic. 2020. Trippy: A
triple copy strategy for value independent neural

820

dialog state tracking. In Proceedings of the 21th
Annual Meeting of the Special Interest Group
论话语与对话, pages 35–44.

在诉讼程序中

Matthew Henderson, I˜nigo Casanueva, Nikola
Mrksic, Pei-Hao Su, Tsung-Hsien Wen, 和
Ivan Vulic. 2020. Convert: Efficient and
accurate conversational representations from
这 2020
transformers.
Conference on Empirical Methods in Natural
语言处理: 发现, EMNLP
2020, Online Event, 16-20 十一月 2020,
pages 2161–2174. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2020.findings-emnlp.196

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes,
and Yejin Choi. 2019. The curious case
of neural text degeneration. 在国际
Conference on Learning Representations.

Ehsan

Bryan

Hosseini-Asl,

McCann,
Chien-Sheng Wu, Semih Yavuz, and Richard
Socher. 2020. A simple language model
任务导向的对话. arXiv 预印本
为了
arXiv:2005.00796.

Nitish Shirish Keskar, Bryan McCann, Lav R.
Varshney, Caiming Xiong, and Richard Socher.
2019. Ctrl: A conditional
transformer lan-
guage model for controllable generation. arXiv
preprint arXiv:1909.05858.

Seokhwan Kim, 米歇尔·加莱, Chulaka
Gunasekara, Sungjin Lee, Adam Atkinson,
Baolin Peng, Hannes Schulz, Jianfeng Gao,
Jinchao Li, Mahmoud Adada, Minlie Huang,
Luis Lastras, Jonathan K. Kummerfeld, Walter
S. Lasecki, Chiori Hori, Anoop Cherian, Tim
K. 分数, Abhinav Rastogi, Xiaoxue Zang,
Srinivas Sunkara, and Raghav Gupta. 2019.
The eighth dialog system technology challenge.
arXiv 预印本 arXiv:1911.06394.

Sungdong Kim, Sohee Yang, Gyuwan Kim, 和
Sang-Woo Lee. 2020. Efficient dialogue state
tracking by selectively overwriting memory.
In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 567–582.

Diederik P. Kingma and Jimmy Ba. 2014. 亚当:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Hung Le, Richard Socher, and Steven C. H. Hoi.
2020. Non-autoregressive dialog state tracking.
arXiv 预印本 arXiv:2002.08024.

Hwaran Lee, Jinsik Lee, and Tae-Yoon Kim.
2019A. SUMBT: Slot-utterance matching for
universal and scalable belief tracking. In Pro-
ceedings of
the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 5478–5483.

Sungjin Lee, Qi Zhu, Ryuichi Takanobu, 郑
张, Yaoqin Zhang, Xiang Li, Jinchao Li,
Baolin Peng, Xiujun Li, Minlie Huang, 和
Jianfeng Gao. 2019乙. ConvLab: Multi-domain
end-to-end dialog system platform. In Pro-
ceedings of the 57th Annual Meeting of the
计算语言学协会:
系统演示, pages 64–69.

他,

Ren,

Xiangnan

Wenqiang Lei, Xisen Jin, Min-Yen Kan,
和
Zhaochun
Dawei Yin. 2018. 顺序性: Simplifying
task-oriented dialogue systems with single
In Pro-
sequence-to-sequence architectures.
ceedings of the 56th Annual Meeting of the
计算语言学协会
(体积 1: Long Papers).

Chunyuan Li, Xiang Gao, Yuan Li, Baolin
彭, Xiujun Li, Yizhe Zhang, and Jianfeng
高. 2020A. Optimus: Organizing sentences
via pre-trained modeling of a latent space.
在诉讼程序中
这 2020 会议
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 4678–4699, 在线的.
计算语言学协会.

Jinchao Li, Baolin Peng, Sungjin Lee, Jianfeng
高, Ryuichi Takanobu, Qi Zhu, Minlie Huang,
Hannes Schulz, Adam Atkinson, and Mahmoud
Adada. 2020乙. Results of the multi-domain
task-completion dialog challenge. In Proceed-
ings of the 34th AAAI Conference on Artificial
智力, Eighth Dialog System Technology
Challenge Workshop.

Wei Li, Wei Shao, Shaoxiong Ji, and Erik
Cambria. 2020C. BiERU: Bidirectional emo-
tional recurrent unit for conversational sen-
timent analysis. arXiv 预印本 arXiv:2006
.00492.

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng
高, and Asli Celikyilmaz. 2017. End-to-end

821

task-completion neural dialogue systems. arXiv
preprint arXiv:1703.01008.

Zhaojiang Lin, Andrea Madotto, Genta Indra
Winata, and Pascale Fung. 2020. MinTL:
Minimalist transfer learning for task-oriented
这
dialogue systems.
2020 经验方法会议
自然语言处理博士 (EMNLP),
pages 3391–3405.

在诉讼程序中

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized BERT pretraining approach. arXiv pre-
print arXiv:1907.11692.

Yukun Ma, Khanh Linh Nguyen, Frank Z.
Xing, and Erik Cambria. 2020. 一项调查
on empathetic dialogue systems. 信息
Fusion, 64:50–70. https://doi.org/10
.1016/j.inffus.2020.06.011

Shikib Mehri, Evgeniia Razumovskaia, Tiancheng
赵, and Maxine Eskenazi. 2019A. Pretrain-
ing methods for dialog context representation
学习. In Proceedings of the 57th Annual
Meeting of the Association for Computational
语言学, pages 3836–3845. https://
doi.org/10.18653/v1/P19-1373

Shikib Mehri, Tejas Srinivasan, and Maxine
Eskenazi. 2019乙. Structured fusion networks
for dialog. In Proceedings of the 20th An-
nual SIGdial Meeting on Discourse and
Dialogue, pages 165–177. https://土井
.org/10.18653/v1/W19-5921

Elnaz Nouri and Ehsan Hosseini-Asl. 2018. 到-
ward scalable neural dialogue state tracking
模型. arXiv 预印本 arXiv:1812.00899.

Jiahuan Pei, Pengjie Ren, and Maarten de Rijke.
2019. A modular task-oriented dialogue sys-
tem using a neural mixture-of-experts. arXiv
preprint arXiv:1907.05346.

Baolin Peng, Chunyuan Li, Zhu Zhang,
Chenguang Zhu, Jinchao Li, and Jianfeng Gao.
2020A. RADDLE: An evaluation benchmark
and analysis platform for robust task-oriented
dialog systems. CoRR, abs/2012.14666.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,
Asli Celikyilmaz, Sungjin Lee, and Kam-Fai
黄. 2017. Composite task-completion dia-
logue policy learning via hierarchical deep re-
inforcement learning. 在诉讼程序中
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2231–2240.
https://doi.org/10.18653/v1/D17
-1237

Baolin Peng, Chenguang Zhu, Chunyuan Li,
Xiujun Li, Jinchao Li, Michael Zeng, 和
Jianfeng Gao. 2020乙. Few-shot natural lan-
guage generation for task-oriented dialog. 在
Findings of the Association for Computational
语言学: EMNLP 2020, pages 172–182,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2020.findings-emnlp.17

Shuke Peng, Xinjing Huang, Zehao Lin,
Feng Ji, Haiqing Chen, and Yin Zhang.
2019. Teacher-student framework enhanced
arXiv
multi-domain
preprint arXiv:1908.07137.

一代.

dialogue

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners.

Osman Ramadan, Paweł Budzianowski, 和
Milica Gasic. 2018. Large-scale multi-domain
belief tracking with knowledge sharing. 在
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(体积 2: Short Papers), pages 432–437.
https://doi.org/10.18653/v1/P18
-2069

Abhinav Rastogi, Xiaoxue Zang, Srinivas
Sunkara, Raghav Gupta, and Pranav Khaitan.
2020. Towards scalable multi-domain conver-
sational agents: The schema-guided dialogue
dataset. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, 体积 34,
pages 8689–8696. https://doi.org/10
.1609/aaai.v34i05.6394

Liliang Ren, Jianmo Ni, and Julian McAuley.
2019. Scalable and accurate dialogue state
tracking via hierarchical sequence genera-
的. 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language

822

Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1876–1885.

Stephen Roller, Y-Lan Boureau, Jason Weston,
Antoine Bordes, Emily Dinan, Angela Fan,
David Gunning, Da Ju, Margaret Li, Spencer
Poff, Pratik Ringshia, Kurt Shuster, Eric
Michael Smith, Arthur Szlam, Jack Urbanek,
and Mary Williamson. 2020A. Open-domain
conversational agents: Current progress, 打开
问题, and future directions. CoRR, abs
/2006.12442.

Stephen Roller, Emily Dinan, Naman Goyal,
Da Ju, Mary Williamson, Yinhan Liu, Jing Xu,
Myle Ott, Kurt Shuster, Eric M. 史密斯, Y-Lan
Boureau, and Jason Weston. 2020乙. Recipes
for building an open-domain chatbot. arXiv
preprint arXiv:2004.13637.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. 在诉讼程序中
the 54th Annual Meeting of the Association
for Computational Linguistics (体积 1: 长的
文件), pages 1715–1725.

Swadheen Shukla, Lars Liden, Shahin Shayandeh,
Eslam Kamal, Jinchao Li, Matt Mazzola,
Thomas Park, Baolin Peng, and Jianfeng
高. 2020. Conversation learner-a machine
teaching tool for building dialog managers for
task-oriented dialog systems. In Proceedings
of the 58th Annual Meeting of the Associa-
tion for Computational Linguistics: 系统
Demonstrations, pages 343–349. https://
doi.org/10.18653/v1/2020.acl
-demos.39

Kurt Shuster, Da Ju, Stephen Roller, Emily
Dinan, Y-Lan Boureau, and Jason Weston.
2020. The dialogue dodecathlon: Open-domain
knowledge and image grounded conversational
代理人. 在诉讼程序中
the 58th Annual
Meeting of the Association for Computational
语言学, 前交叉韧带 2020, 在线的, July 5–10,
2020, pages 2453–2470. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.222

Patrice

是.

Simard,

Amershi,
David Maxwell Chickering, Alicia Edelman
Pelton, Soroush Ghorashi, Christopher Meek,

Saleema

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Gonzalo Ramos, Jina Suh, Johan Verwey,
Mo Wang, and John Wernsing. 2017. 机器
教学: A new paradigm for building machine
learning systems. CoRR, abs/1707.06742.

Tsung-Hsien Wen, David Vandyke, Nikola
Mrkˇsi´c, Milica Gasic, Lina M. Rojas Barahona,
Pei-Hao Su, Stefan Ultes, 和史蒂夫·杨.
2017. A network-based end-to-end trainable
task-oriented dialogue system. In Proceed-
ings of the 15th Conference of the European
Chapter of
the Association for Computa-
tional Linguistics: 体积 1, Long Papers,
pages 438–449.

贾森·D. 威廉姆斯, Kavosh Asadi Atui, 和
Geoffrey Zweig. 2017. Hybrid code networks:
Practical and efficient end-to-end dialog control
with supervised and reinforcement learning.
In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 665–677.
https://doi.org/10.18653/v1/P17
-1062

贾森·D. Williams and Lars Liden. 2017. Demon-
stration of interactive teaching for end-to-end
dialog control with hybrid code networks. 在
Proceedings of the 18th Annual SIGdial Meet-
ing on Discourse and Dialogue, pages 82–85.
https://doi.org/10.18653/v1/W17
-5511

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
古格, Mariama Drame, Quentin Lhoest,
and Alexander Rush. 2020. Transformers:
State-of-the-art natural language processing. 在
诉讼程序 2020 Conference on Empir-
ical Methods in Natural Language Processing:
系统演示, pages 38–45, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.emnlp-demos.6

Chien-Sheng Wu, Steven C. H. Hoi, 理查德
Socher, and Caiming Xiong. 2020A. TOD-BERT:
Pre-trained natural
language understanding
for task-oriented dialogue. 在诉讼程序中
这 2020 经验方法会议
自然语言处理博士 (EMNLP),
pages 917–929.

Chien-Sheng Wu, Andrea Madotto, Ehsan
Hosseini-Asl, Caiming Xiong, Richard Socher,
and Pascale Fung.
2019A. Transferable
multi-domain state generator for task-oriented
这
dialogue systems.
57th Annual Meeting of the Association for
计算语言学, pages 808–819.

在诉讼程序中

Qingyang Wu, Yichi Zhang, Yu Li, and Zhou Yu.
2019乙. Alternating recurrent dialog model with
large-scale pre-trained language models. arXiv
preprint arXiv:1910.03756.

Zeqiu Wu, 米歇尔·加莱, Chris Brockett,
Yizhe Zhang, Xiang Gao, Chris Quirk, Rik
Koncel-Kedziorski, Jianfeng Gao, Hannaneh
Hajishirzi, Mari Ostendorf, and Bill Dolan.
2020乙. A controllable model of grounded
响应生成. CoRR, abs/2005.00613.

Haotian Xu, Haiyun Peng, Haoran Xie, Erik
Cambria, Liuyang Zhou, and Weiguo Zheng.
2019. End-to-end latent-variable task-oriented
dialogue system with exact
log-likelihood
优化. World Wide Web, pages 1–14.

Steve J. Young, Milica Gasic, Blaise Thomson,
and Jason D. 威廉姆斯. 2013. POMDP-based
statistical spoken dialog systems: A review.
会议记录
IEEE, 101(5):1160–1179.
https://doi.org/10.1109/JPROC
.2012.2225812

Tom Young, Erik Cambria, Iti Chaturvedi, Hao
周, Subham Biswas, and Minlie Huang.
2018. Augmenting end-to-end dialogue systems
with commonsense knowledge. In Proceedings
of the Thirty-Second AAAI Conference on Ar-
tificial Intelligence, pages 4970–4977. AAAI
按.

Thomas Wolf, Victor Sanh, Julien Chaumond,
and Clement Delangue. 2019. Transfertransfo:
A transfer learning approach for neural net-
work based conversational agents. CoRR,
abs/1901.08149.

Tom Young, Vlad Pandelea, Soujanya Poria, 和
Erik Cambria. 2020. Dialogue systems with
audio context. Neurocomputing, 388:102–109.
https://doi.org/10.1016/j.neucom
.2019.12.126

823

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, Franziska Roesner,
and Yejin Choi. 2019. Defending against neural
fake news. In Advances in Neural Information
Processing Systems.

Jianguo Zhang, Kazuma Hashimoto, Chien-Sheng
吴, Yao Wang, S. Yu Philip, Richard Socher,
and Caiming Xiong. 2020A. Find or clas-
sify? dual strategy for slot-value predictions on
multi-domain dialog state tracking. In Proceed-
ings of the Ninth Joint Conference on Lexical
and Computational Semantics, pages 154–167.

Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020乙.
Task-oriented dialog systems that consider mul-
tiple appropriate responses under the same
语境. 在诉讼程序中
the AAAI Con-
ference on Artificial Intelligence, 体积 34,
pages 9604–9611.

Yizhe Zhang, Siqi Sun, 米歇尔·加莱, Yen-Chun
陈, Chris Brockett, Xiang Gao, Jianfeng
高, Jingjing Liu, and Bill Dolan. 2020C. DI-
ALOGPT : Large-scale generative pre-training
for conversational
在
Proceedings of the 58th Annual Meeting of the
计算语言学协会: Sys-
tem Demonstrations, pages 270–278, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.acl-demos.30

响应生成.

Tiancheng Zhao and Maxine Eskenazi. 2016.
Towards end-to-end learning for dialog state
tracking and management using deep rein-
forcement learning. In Proceedings of the 17th
Annual Meeting of the Special Interest Group
论话语与对话, 第 1–10 页.

reinforcement

Tiancheng Zhao, Kaige Xie,

and Maxine
Eskenazi. 2019. Rethinking action spaces
为了
learning in end-to-end
dialog agents with latent variable models. 在
诉讼程序 2019 Conference of the
North American Chapter of the Association

for Computational Linguistics: Human Lan-
guage Technologies, 体积 1 (Long and Short
文件), pages 1208–1218.

Victor Zhong, Caiming Xiong, and Richard
Socher. 2018. Global-locally self-attentive
encoder for dialogue state tracking. In Pro-
ceedings of the 56th Annual Meeting of the
计算语言学协会
(体积 1: Long Papers), pages 1458–1467.
https://doi.org/10.18653/v1/P18
-1135

Hao Zhou, Minlie Huang, Tianyang Zhang,
Xiaoyan Zhu, and Bing Liu. 2018. Emotional
chatting machine: Emotional conversation gen-
eration with internal and external memory.
In Proceedings of the AAAI Conference on
人工智能, 体积 32.

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung
Shum. 2020. The design and implementation of
xiaoice, an empathetic social chatbot. Compu-
tational Linguistics, 46(1):53–93. https://
doi.org/10.1162/coli a 00368

Li Zhou and Kevin Small. 2019. Multi-domain
dialogue state tracking as dynamic knowl-
edge graph enhanced question answering. arXiv
preprint arXiv:1911.06192.

Qi Zhu, Zheng Zhang, Yan Fang, Xiang Li,
Ryuichi Takanobu, Jinchao Li, Baolin Peng,
Jianfeng Gao, Xiaoyan Zhu, and Minlie
黄. 2020. ConvLab-2: An open-source
toolkit for building, evaluating, and diagnos-
ing dialogue systems. 在诉讼程序中
58th Annual Meeting of the Association for
计算语言学: System Demonstra-
系统蒸发散, pages 142–149, 在线的. 协会
计算语言学. https://土井
.org/10.18653/v1/2020.acl-demos.19

Xiaojin Zhu. 2015. Machine teaching: An inverse
problem to machine learning and an approach
toward optimal education. In Twenty-Ninth
AAAI 人工智能会议.

824

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

我

A
C
_
A
_
0
0
3
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3 独奏者: Building Task Bots at Scale with image

下载pdf