SOLOIST: Building Task Bots at Scale with - IA de Investigación especializada en el MIT

SOLOIST: Building Task Bots at Scale with
Transfer Learning and Machine Teaching

Baolin Peng, Chunyuan Li, Jinchao Li
Shahin Shayandeh, Lars Liden, Jianfeng Gao

Microsoft Research, Redmond, United States
{bapeng,chunyl,jincli,shahins,lars.liden,jfgao}@microsoft.com

Abstracto

We present a new method, SOLOIST,1 eso
uses transfer learning and machine teaching
to build task bots at scale. We parameterize
classical modular task-oriented dialog systems
using a Transformer-based auto-regressive
modelo de lenguaje, which subsumes different
dialog modules into a single neural model. Nosotros
pre-train, on heterogeneous dialog corpora,
a task-grounded response generation model,
which can generate dialog responses grounded
in user goals and real-world knowledge for
task completion. The pre-trained model can be
efficiently adapted to accomplish new tasks
with a handful of task-specific dialogs via
machine teaching, where training samples are
generated by human teachers interacting with
the system. Experiments show that (i ) SOLOIST
creates new state-of-the-art on well-studied
task-oriented dialog benchmarks,
incluido
CamRest676 and MultiWOZ; (ii ) en el
few-shot fine-tuning settings, SOLOIST signif-
icantly outperforms existing methods; y
(iii ) the use of machine teaching substantially
reduces the labeling cost of fine-tuning. El
pre-trained models and codes are available at
https://aka.ms/soloist.

Introducción

The increasing use of personal assistants and
messaging applications has spurred interest in
building task-oriented dialog systems (or task
bots) that can communicate with users through
natural language to accomplish a wide range of
tareas, such as restaurant booking, weather query,
flight booking, IT helpdesk (p.ej., Zhou y cols.,
2020; Adiwardana et al., 2020; Roller et al.,
2020b; Gao et al., 2020; Peng et al., 2020a). El

1TASK-ORIENTED DIALOG WITH A SINGLE PRE-TRAINED
MODEL. en este documento, SOLOIST refers to both the proposed bot
building method and the dialog model or system developed
using the method.

wide variety of tasks and domains has created
the need for a flexible task-oriented dialog devel-
opment platform that can support many different
use cases while remaining straightforward for
developers to use and maintain.

A typical task-oriented dialog system uses a
modular pipeline, which has four modules and
executes sequentially (Young et al., 2013; gao
et al., 2019a), as shown in Figure 1(a). A natural
language understanding (NLU) module identifies
user intents and extracts associated information
such as slots and their values from users’ input. A
dialog state tracker (DST) infers the belief state
(or user goal) from dialog history. The belief state
is often used to query a task-specific database
(DB) to obtain the DB state, such as the number of
entities that match the user goal. The dialog state
and DB state are then passed to a dialog policy
(POL) to select the next system action. A natural
language generation (NLG) module converts the
action to a natural language response.

employ the modular

Most popular commercial

tools for dialog
sistemas,
desarrollo
including Google’s Dialog Flow,2 Microsoft’s
(PVA),3 Facebook’s
Power Virtual Agents
Wit.ai,4 Amazon’s Lex,5 and IBM’s Watson
Assistant.6 They are designed mainly to help
develop systems manually, a saber, writing code,
crafting rules and templates. Desafortunadamente, incluso
with these tools, building dialog systems remains
a label-intensive, time-consuming task, requiring
rich domain knowledge, reasonable coding skill,
and expert experience. The cost of building dialog
systems at scale (es decir., tens of thousands of bots for
different tasks) can be prohibitively expensive.

2https://dialogflow.com/.
3https://powervirtualagents.microsoft.com/.
4https://wit.ai/.
5https://aws.amazon.com/lex/.
6https://www.ibm.com/watson/.

807

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 807–824, 2021. https://doi.org/10.1162/tacl a 00399
Editor de acciones: James Henderson. Lote de envío: 7/2020; Lote de revisión: 1/2021; Publicado 8/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Illustration of a traditional modular task-oriented dialog system, an example for the model input, y
the proposed model. The SOLOIST solution utilizes a single neural auto-regressive model in (C) to parameterize the
sequential dialog pipeline in (a), with input sequence represented in (b). Different from GPT-2, the SOLOIST model
learns to ground response generation in user goals and database/knowledge.

With the recent advances in neural approaches
to conversational AI (Gao et al., 2019a), investigación-
ers have been developing data-driven methods and
neural models for either individual dialog mod-
ules or end-to-end systems. Por ejemplo, recent
attempts such as RASA (Bocklisch et al., 2017),
ConvLab (Lee et al., 2019b; Zhu et al., 2020), y
Conversation Learner (Shukla et al., 2020) son
made to allow the use of data-driven approaches
based on machine learning and machine teaching
to develop dialog modules. End-to-end trainable
dialog systems have also been studied (p.ej., Wen
et al., 2017; Zhao and Eskenazi, 2016; Le et al.,
2017; Williams et al., 2017; Lei et al., 2018; gao
et al., 2019a; Zhang et al., 2020b). Although these
methods have achieved promising results, ellos
require large amounts of task-specific labeled
data for training, which are rarely available for
new tasks in real-world applications.

en este documento, we propose a novel method
of building task bots at scale, SOLOIST, cual
significantly eases the workflow of training and
deploying dialog systems for new tasks, comparado
to existing tools and methods. Our approach is
inspired by the recent success of applying transfer
learning to natural language processing (NLP)
tareas: Big language models pre-trained on large
amounts of raw text (p.ej., BERT (Devlin et al.,
2019), RoBERTa (Liu et al., 2019), and UniLM
(Dong et al., 2019)) can be effectively fine-tuned
for a wide range of NLP tasks with few in-domain
labels. Recientemente, these pre-trained language mod-
els have also been employed to develop dialog
modules such as NLU and DST (Henderson
et al., 2020; Coope et al., 2020; Wu et al.,
2020a). The proposed SOLOIST uses a similar pre-
training-and-fine-tuning framework for building
end-to-end dialog systems. We parameterize a task

808

bot using a Transformer-based auto-regressive
modelo de lenguaje, which subsumes different dialog
modules (es decir., NLU, DST, POL, and NLG) into a
single neural model. Task bot building proceeds in
two stages: (i) In the pre-training stage, initialized
using GPT-2 (Radford et al., 2019), we train a
Transformer-based, task-grounded, response gen-
eration model using large heterogeneous dialog
corpus. The model learns the primary task com-
pletion skills such as DST and POL, and can
generate dialog responses grounded in user goals
and real-world knowledge for task completion.
(ii) In the fine-tuning stage, we adapt the pre-
trained SOLOIST model
to complete a specific
(nuevo) task using a handful of task-specific dialogs
via machine teaching, where training samples are
generated by human teachers interacting with the
sistema (Zhu, 2015; Shukla et al., 2020).

We show through a comprehensive empirical
study that SOLOIST is an effective method of build-
ing task bots at scale by successfully transferring
two capabilities from the pre-trained model to
a new task bot: (i) the capability of NLU and
NLG learned on raw text, y (ii) the capability
of grounding system responses in user goals
and real-world knowledge for task completion,
learned on the out-domain dialog corpora.

SOLOIST achieves state-of-the-art performance
on two well-studied task-oriented dialog bench-
marks, lifting the combined score by 10 puntos
in automatic evaluation, and the success rate by
20 points in human evaluation. In the few-shot
fine-tuning settings, SOLOIST adapts to the new
domain much more effectively than competing
methods, achieving a reasonable success rate
using less than 50 dialogs. The promising results
demonstrate the potential of the new method
for developing task bots at scale. Instead of
collecting, labeling data, and building one bot per
tarea, we can pre-train a task-grounded response
generation model, and adapt it to new tasks via
transfer learning and machine teaching.

2 SOLOIST

2.1 An Auto-Regressive Model for Dialog

The modular dialog system in Figure 1 constitutes
a data processing pipeline that produces a
secuencia, through concatenating the input-output
pair of each module along the generation process.
Each consecutive pair in this sequence plays

the role of annotated data for the corresponding
module. Idealmente, when the entire sequence is
disponible, the data generation process of a dia-
log system (NLU, DST, POL, NLG) can be for-
mulated as a single auto-regressive model.

GPT-2 (Radford et al., 2019) is a state-of-
the-art (SoTA) auto-regressive language model
trained on large amounts of open Web text data.
Although after being fine-tuned using conver-
sational data, GPT-2 can respond to users with
realistic and coherent continuations about any
topic of their choosing (Zhang et al., 2020C), el
generated responses are not useful for completing
any specific task due to the lack of grounding.
SOLOIST inherits GPT-2’s capability of produc-
ing human-like responses. Sin embargo, a diferencia de
GPT-2, SOLOIST is pre-trained to generate re-
sponses grounded in user goals and real-world
knowledge for task completion. While GPT-2 is
a language model for text prediction, SOLOIST is a
stateful decision-making model for task comple-
ción, with the capabilities of tracking dialog states,
selecting best system actions, etcétera. De este modo,
SOLOIST is pre-trained using task-oriented dialog
sessions annotated with grounding information,
es decir., user goals, dialog belief states, DB states, y
system responses. Específicamente, each dialog turn
in our training data is represented as:

x = (s, b, C, r),

(1)

where s is the dialog history up to the current
dialog turn, b is the dialog belief state acquired
from human annotation, c is the DB state auto-
matically retrieved from a database using b, and r
is the delexicalized dialog response, from which
the system response in natural language can be
generated using some automatic post-processing.
Each item in x is by itself a sequence of tokens,
as illustrated by the examples in Figure 1(b).
De este modo, it is natural to treat the concatenation of
them as a long sequence for model training, como
como se muestra en la figura 1(C). We pre-train the SOLOIST
model using publicly available heterogeneous
dialog corpora with labels of belief states and DB
estados. The pre-trained model can be fine-tuned to
any new task to generate responses grounded in
task-specific user goals and a database.

2.2 Task-Grounded Pre-Training
Given training data of N samples D = {xn}norte
norte=1,
our goal is to build a neural model parameterized

809

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

by θ to characterize the sequence generation
probability pθ(X). We use a multi-task objective
for learning θ, where each task is a self-supervised
tarea de aprendizaje.

To leverage the sequential structure of a
task-oriented dialog system, the joint probability
pag(X) can be factorized in the auto-regressive
manner as:

pag(X) = p(r, C, b, s)

pag(r|C, b, s)
(cid:5)
(cid:3)(cid:4)
(cid:2)

Grounded Response Generation

(2)
pag(s), (3)

pag(b|s)
(cid:2) (cid:3)(cid:4) (cid:5)
Belief Prediction

where the factorization from (2) a (3) is based
on the fact that p(C|b, s) = p(C|b) = 1, porque
the DB state c is obtained using a deterministic
database-lookup process given a belief state b
(p.ej., via an API call). Tenga en cuenta que (3) decomposes
the joint distribution modeling problem into two
sub-problems: belief state prediction p(b|s) y
grounded response generation p(r|C, b, s). Desde
b and r are sequences, we can further factorize
them in the left-to-right auto-regressive manner,
respectivamente.

Tarea 1: Belief Prediction. For a belief state
sequence of length Tb, we define the objective of
predicting the belief state as:

LB = log p(b|s) =

Tb(cid:6)

t=1

log pθ(bt|b Belief State:
Restaurant { pricerange = expensive, food =
Chino, area = north } < EOB > DB: Restau-
rant 1 match < EOKB > El [restaurant name]
is a great [value food] restaurant. Would you
like to book a table there ? < EOS >

This sequence, tokenized using byte pair encod-
ings (Sennrich et al., 2016), can be readily used for
multi-task training, as shown in Figure 1(C). El
implementation of SOLOIST is based on Hugging-
face PyTorch Transformer (Wolf et al., 2020). El
task-grounded pre-training of SOLOIST uses the
public 117M-parameter GPT-2 as initialization.

810

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Nombre

#Dialog #Utterance Avg. Turn #Domain

task-grounded pre-training:
Schema
Taskmaster

22,825
13,215

463,284
303,066

fine-tuning:
MultiWOZ2.0 10,420
676
CamRest676
–
Banking77
–
Restaurant-8k

71,410
2,744
25,716
8,198

20.3
22.9

6.9
4.1
–
–

17
6

7
1
21
1

Mesa 1: Dialog corpora. The datasets in the upper
block are used for task-grounded pre-training, y
the datasets in the lower block are for fine-tuning.

Adán (Kingma and Ba, 2014) with weight
decay is used for pre-training. Mesa 1 shows the
dialog corpora (Kim y cols., 2019; Rastogi et al.,
2020; Byrne et al., 2019) used for task-grounded
pre-training. To ensure there is no overlap bet-
ween pre-training and fine-tuning datasets, we ex-
clude the data akin to MultiWOZ (Budzianowski
et al., 2018), CamRest676 (Wen et al., 2017),
Banking77 (Casanueva et al., 2020), Restaurant-
8k (Coope et al., 2020).

2.3 Fine-Tuning and Machine Teaching

When deploying SOLOIST to a new task, we collect
task-specific x in the same format as that used
for pre-training as (1). When x is available, el
conventional fine-tuning procedure is utilized: nosotros
use the same multi-task objective of (7) to update
θ to adapt the model to complete the new task
using labeled task-specific dialogs.

In real applications, annotated task-specific
data is often unavailable, or noisy/incomplete
beforehand. One may deploy the dialog system
and acquire high-quality task-specific labels (p.ej.,
belief state and system response) for each dialog
turn using machine teaching. Machine teaching
is an active learning paradigm that
focuses
on leveraging the knowledge and expertise of
domain experts as ‘‘teachers’’. This paradigm
puts a strong emphasis on tools and techniques
eso
non-data
scientists and non-machine-learning experts—to
visualize data,
find potential problems, y
provide corrections or additional training inputs
in order to improve the system’s performance
(Simard et al., 2017; Zhu, 2015; Williams and
Liden, 2017; Shukla et al., 2020).

teachers—particularly

enable

We proceed fine-tuning using Conversation
Learner (Shukla et al., 2020), a machine teaching

811

tool, in the following steps: (i) Dialog authors
deploy the pre-trained SOLOIST model for a specific
tarea. (ii) Users (or human subjects recruited for
system fine-tuning) interact with the system and
generate human-bot dialog logs. (iii) Dialog
authors revise a dozen of training samples by se-
lecting representative failed dialogs from the logs,
correcting their belief and/or responses so that the
system can complete these dialogs successfully, como
illustrated in Figure 2. The corrected task-specific
dialog turns are used to fine-tune the model.

Detalles de implementacion. To adapt a pre-trained
SOLOIST to a new task in our experiments, nosotros
always fine-tune SOLOIST using a small amount
of pre-collected task-specific dialogs, y luego
continue to fine-tune it via machine teaching,
as detailed in Section 3.3. Training examples
are truncated to ensure a maximal
length of
512. The pre-trained models are fine-tuned with
a mini-batch of 6 en 8 Nvidia V100 until no
progress is observed on validation data or up to
10 epochs. Nucleus sampling (Holtzman et al.,
2019) is used for decoding, where the sampling
top-p ranges from 0.2 a 0.5 for all our models.
The best setup of hyper-parameters is selected
through grid-search on the validation set. Para el
machine teaching experiment,pre-trained models
are fine-tuned with SGD on a single Nvidia V100.

3 experimentos

This section evaluates the proposed SOLOIST to
answer three questions: Q1: How does SOLOIST
perform on standard benchmarks compared to
SoTA methods? Q2: Does SOLOIST meet the goal
of effectively generalizing to new domains in the
few-shot fine-tuning setting? Q3: how effective
machine teaching is for fine-tuning? Tenga en cuenta que
we employ the conventional fine-tuning method
without machine teaching for a fair comparison
when studying Q1 and Q2.

3.1 Experimental Setup

Dialog Datasets for Fine-Tuning. We validate
the end-to-end dialog system performance of
SOLOIST on two well-studied datasets. (i) Leva-
Rest676 (Wen et al., 2017) is a single-domain task-
oriented dialog corpus. It contains 408/136/136
respetar-
dialogs for
activamente. Following Lei et al. (2018), we delexicalize
each token that occurs in the ontology with its slot

training/validation/testing,

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Illustration of the machine teaching process using conversion learner. The human-bot conversion log in
(a) can be edited via correcting its belief state in (b), and selecting/inserting a more appropriate response in (C).

names such as restaurant name, phone number, y
postcode. (ii) MultiWOZ dataset (Budzianowski
et al., 2018) is a multi-domain task-oriented dialog
conjunto de datos. It contains 8438/1000/1000 for train-
ing/validation/testing, respectivamente. Each dialog
session contains 1 a 3 dominios, such as Attrac-
ción, Hotel, Hospital, Police, Restaurant, Tren,
and Taxi. MultiWOZ is inherently challenging
due to its multi-domain setting and diverse lan-
guage styles.

Automatic Evaluation Metrics. Following
Budzianowski et al. (2018), Inform, Success,
and BLEU scores are reported. The first two metrics
relate to the dialogue task completion—whether
the system has provided an appropriate entity
(Inform) and then answered all the requested
cómo
atributos
the generated responses are compared
natural
to that generated by human agents. A com-
bined score (Combined) is also reported using
Combined = (Inform + Success) × 0.5 + AZUL
as an overall quality measure.

(Success). BLEU evaluates

Líneas de base. We compare SOLOIST with several
strong baselines, which hold SoTA on the Cam-
Rest676 or MultiWOZ datasets. (i) Multi-Action
Data Augmentation (DAMD)
(Zhang et al.,
2020b) is a modular system, where each di-
alog module is implemented using a neural
network, and the whole system is trained in an
end-to-end manner. (ii) Sequicity (Lei et al.,
2018)
él
does not use multi-action data augmentation.
(iii) GPT fine-tuning (Budzianowski and Vuli´c,
2019) is fine-tuned on GPT-2 to generate re-

to DAMD except

similar

eso

sponses based on the dialog state and history.
(iv) ARDM (Wu et al., 2019b) utilizes GPT-2
as the pre-trained model
to learn to generate
role-aware responses given dialog context. El
model has to work with a separate dialog state
tracker for task completion. (v) HDSA (Chen
et al., 2019) is a modular dialog system, which gen-
erates responses using a BERT-based dialog pol-
icy and graph structure dialog act representations.

3.2 End-to-End Evaluation

CamRest676. Mesa 2 shows the result and lists
annotations used by different models. SOLOIST
achieves the best scores in all the metrics. ARDM
performs similarly to SOLOIST in terms of Success
and BLEU. Sin embargo, ARDM cannot track dialog
states and requires a separately trained state
tracker to accomplish tasks. GPT-2 fine-tuned
with task-specific data works reasonably well but
lags behind SOLOIST by a large margin. Sequicity,
which uses a jointly trained model with belief
state and policy annotations, underperforms
SOLOIST. This result also shows that, comparado
to other end-to-end models, SOLOIST not only
achieves better performance but requires lower
labeling cost for fine-tuning due to the use of
task-grounded pre-training.

MultiWOZ. The result is shown in Table 3.
SOLOIST achieves the best performance in terms of
Inform, Success, and Combined, lifting the pre-
vious SoTA by a significant margin (p.ej., acerca de 10
points improvement in Combined over DAMD).
SOLOIST also outperforms the method of Ham et al.
(2020), where GPT-2 is fine-tuned and applied
for end-to-end dialog modeling. Compared to the

812

Modelo

Annotations

Métricas de evaluación

Belief State Policy Inform ↑ Success ↑ BLEU ↑ Combined ↑

Sequicity (Lei et al., 2018)
Sequicity (w/o RL)
GPT fine-tuning (Budzianowski and Vuli´c, 2019)
ARDM1 (Wu et al., 2019b)
SOLOIST

(cid:2)
(cid:2)

(cid:2)

(cid:2)
(cid:2)

92.30
94.00
–
–
94.70

85.30
83.40
86.20
87.10
87.10

21.40
23.40
19.20
25.20
25.50

110.20
112.10
–
–
116.40

1ARDM is not fully E2E, as it requires a rule-based dialog state tracker.

Mesa 2: End-to-End evaluation on CamRest676. Results of existing methods are from Wu et al.
(2019b).

Modelo

Annotations

Métricas de evaluación

Belief State Policy Inform ↑ Success ↑ BLEU ↑ Combined ↑

Sequicity (Lei et al., 2018)
HRED-TS (Peng et al., 2019)
Structured Fusion (Mehri et al., 2019b)
DSTC8 Track 1 Winner 1 (Ham et al., 2020)
DAMD (Zhang et al., 2020b)
SOLOIST

(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)

(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)

66.41
70.00
73.80
73.00
76.40
85.50

45.32
58.00
58.60
62.40
60.40
72.90

15.54
17.50
16.90
16.00
16.60
16.54

71.41
81.50
83.10
83.50
85.00
95.74

1The result of DSTC8 Track 1 Winner is produced by adapting their code to our setting.

Mesa 3: End-to-end evaluation on MultiWOZ.

classical modular dialog systems such as DAMD,
SOLOIST uses a much simpler architecture and
requires much lower labeling effort. Por ejemplo,
SOLOIST requires only the belief states, mientras
DAMD requires additional annotations for task
definition (es decir., defining the intents, slots, y el
corresponding value ranges) and dialog acts.

3.3 Few-Shot Evaluation

It is desirable for task bots to effectively gener-
alize to new tasks with few task-specific training
muestras. De este modo, the few-shot fine-tuning setting
is a more realistic setting for evaluating dialog
sistemas. Desafortunadamente, the existing task-oriented
dialog benchmarks typically contain for each
task hundreds to thousands of dialogs. Por lo tanto,
we re-organize CamRest676 and MultiWOZ
to simulate the few-shot fine-tuning setting for
end-to-end evaluation.7 We sample from the
MultiWOZ dataset the dialog tasks that contain
only one domain. Attraction, Tren,
Hotel, and Restaurant domains are used.
We do not use the domains of Police, Taxi,
and Hospital, as they do not require explicitly
tracking dialog states for task completion. Para
each domain, we randomly sample 50 dialog
sessions for training and validation and 200 dialog
sessions for testing. The only exception is the

7We will release the re-organized datasets.

Domain Attra. Train Hotel Rest. CamRest676

#Tren
#Valid
#Prueba

50
50
100

50
50
200

20
136
136

Mesa 4: Data statistics for domains used in few-
shot evaluation. Attra. denotes Attraction
domain and Rest. means Restaurant.

Modelo

CamRest676

Inform ↑ Success ↑ BLEU ↑

Sequicity (Lei et al., 2018)
SOLOIST w/o pre-training
SOLOIST
SOLOISTL

60.61
73.88
85.82
88.05

66.11
72.22
84.22
84.79

11.15
13.11
19.18
18.88

Mesa 5: End-to-end evaluation on CamRest676
in the few-shot fine-tuning setting.

Attraction domain, which has 100 sessions
para las pruebas. For CamRest676, we randomly sam-
por ejemplo 20 sessions. Details are shown in Table 4.

Mesa 5 y 6 report the end-to-end perfor-
mance in the few-shot fine-tuning settings on
CamRest676 and MultiWOZ, respectivamente. On
the domains, SOLOIST obtains substantially
todo
better performance in all the metrics. Removing
task-grounded pre-training significantly hurts
the performance of SOLOIST, although SOLOIST

813

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Attraction

Tren

Hotel

Restaurant

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

DAMD (Zhang et al., 2020b)
SOLOIST w/o pre-training
SOLOIST
SOLOISTL

70.00
65.66
86.00
86.00

15.00
46.97
65.00
68.00

6.90
5.85
12.90
14.60

75.00
59.00
80.81
81.31

39.50
44.00
64.65
74.24

6.20
7.07
9.96
11.90

62.50
62.50
74.50
75.00

20.50
40.00
43.50
51.50

7.60
7.70
8.12
10.09

68.00
75.50
81.00
84.00

19.50
44.50
55.50
62.50

10.50
11.00
12.80
13.17

Mesa 6: End-to-end evaluation on MultiWOZ in the few-shot fine-tuning setting.

Modelo

10%

20%

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

Inform ↑

Success ↑

BLEU ↑

DAMD (Zhang et al., 2020b)
SOLOIST w/o pre-training
SOLOIST

34.40
46.10
58.40

9.10
24.40
35.30

8.10
10.39
10.58

52.50
63.40
69.30

31.80
38.70
52.30

11.60
11.19
11.80

55.30
64.90
69.90

30.30
44.50
51.90

13.00
13.57
14.60

62.60
70.10
74.00

44.10
52.20
60.10

14.90
14.72
15.24

Mesa 7: End-to-end evaluation on MultiWOZ with varying sizes of task-specific training data for
fine-tuning.

Modelo

Attraction

Tren

Hotel

Restaurant

Inform ↑ Success ↑ BLEU ↑ Inform ↑ Success ↑ BLEU ↑ Inform ↑ Success ↑ BLEU ↑ Inform ↑ Success ↑ BLEU ↑

SOLOIST
SOLOIST +Extra
SOLOIST +Teach

45.00
63.00
78.00

19.00
41.00
45.00

7.67
11.08
11.90

67.68
65.15
68.18

58.08
57.58
63.64

7.13
9.74
9.45

33.50
41.50
46.50

22.50
19.00
22.50

8.70
7.96
7.68

50.50
44.50
53.00

10.00
27.00
32.00

8.61
9.77
9.81

Mesa 8: Machine teaching results. SOLOIST is trained with 10 examples for each domain. SOLOIST+Teach
indicates continual training with 5 dialogs recommended by CL with human teacher corrections.
SOLOIST+Extra indicates continual training using 5 randomly sampled dialogs with full annotations.

to Ham et al.

without task-grounded pre-training still consis-
tently outperforms DAMD in all the domains.
task-grounded pre-training is
SOLOIST without
(2020),
conceptually similar
but is architecturally simpler and needs fewer
anotaciones. The result verifies the importance of
task-grounded pre-training on annotated dialog
corpus, allowing SOLOIST to learn how to track
dialog and database states to accomplish a task.
To study the impact of using larger model size, nosotros
build a large version of SOLOIST, SOLOISTL, cual
is task-grounded pre-trained on the same data
but using GPT-2medium with 345M parameters as
inicialización. SOLOISTL consistently outperforms
indicates that
SOLOIST by a large margin. Él
a larger model
learner,
exhibiting stronger generalization ability with
limited in-domain data. We leave it to future work
to significantly scale up SOLOIST.

is a better few-shot

We conduct experiments to fine-tune SOLOIST
by varying the percentage of task-specific training
muestras, que van desde 1% (80 examples) a
20% (1600 examples), on the MultiWOZ dataset.
As shown in Table 7, SOLOIST consistently
outperforms DAMD for a wide range of dataset
sizes, and the improvement is more substantial
when smaller numbers of in-domain examples are
used for fine-tuning.

814

3.4 Machine Teaching Results

The machine teaching module of Conversational
Learner (CL) (Shukla et al., 2020) allows human
profesores (dialog authors) to select and visualize
dialogs, find potential problems, and provide
corrections or additional
training samples to
improve the bot’s performance. We use CL to
evaluate the effectiveness of machine teaching
for task bot fine-tuning. In our experiment, nosotros
first sample 10 dialogs from each domain to
fine-tune SOLOIST as described in Section 3.3. El
result is presented in the first row of Table 8.
We then deploy the model to interact with human
users via CL. The row of SOLOIST+Teach shows
the result of machine teaching, where a human
teacher has manually corrected 5 dialogs, cual
are recommended by CL using a ranking heuristic
based on perplexity. The corrections are utilized
to continually fine-tune the deployed system.

Mesa 8 shows that SOLOIST+Teach consistently
improves Combined by a large margin compared
with that without human teaching. SOLOIST+Extra
is used as an ablation baseline, dónde 5 randomly
selected dialogs with full annotations
de
experts are added as extra examples to fine-tune
el modelo. It shows lower performance than
machine teaching. Cifra 3 demonstrates the

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

BERT-Fixed
BERT-Tuned
USE
ConveRT
USE+ConveRT
SOLOIST

Banking77

67.55
83.42
84.23
83.32
85.19
78.73

80.07
90.03
89.74
89.37
90.57
89.28

Lleno

87.19
93.66
92.81
93.01
93.36
93.80

Mesa 9: Intent classification accuracy scores (5
runs average) on Banking77 with varying number
of training examples (10, 30 examples for each
intent, and full training examples. The baseline
results are cited from Casanueva et al. (2020).

and updated during fine-tuning, respectivamente. A
linear classifier with a softmax layer is added
on top of BERT for classification. Universal
Sentence Encoder and ConveRT are sentence
encoders tailored for modeling sentence pairs,
and are trained for optimizing the conversational
response selection task. The results in Table 9
show that SOLOIST is comparable with SoTA intent
classification models. SOLOIST is the best per-
former when the full dataset is used for fine-tuning
pero
its performance deteriorates more quickly
than USE+ConveRT when fewer samples are
used for fine-tuning. It is interesting to investigate
whether incorporating intent classification tasks
in task-grounded pre-training can boost SOLOIST’s
actuación. We leave it to future work.

Slot Filling. We follow the experiment setting of
Coope et al. (2020) and formulate slot filling as a
turn-based span extraction problem. The results in
Mesa 10 show that SOLOIST performs significantly
better than the SoTA method Span-ConveRT, a
variant of ConveRT designed explicitly for slot
filling. The gap is wider when fewer examples are
used for training. Por ejemplo, cuando 64 muestras
are used for training, SOLOIST outperforms Span-
ConveRT by 20 points in F1 score.

Dialog State Tracking. We compare the dialog
state tracking capability of SOLOIST with several
strong baselines on MultiWOZ 2.0 y 2.1. El
results in Table 11 show that SOLOIST achieves the
best performance on MultiWOZ2.1 and similar
performance to DST-Picklist (Zhang et al., 2020a),
which requires pre-defined task ontology to guide
state tracking. In comparison with Simple-TOD
(Hosseini-Asl et al., 2020) that is based on GPT-2,

Cifra 3: Machine teaching performance of different
iterations in Restaurant domain. Machine teaching
with CL achieves near 1.5X efficiency gain (es decir., el
1st iteration used 15 dialogs while the 3rd iteration
tiene 25 dialogs) and boosts performance by 10 puntos
compared with that without teaching.

performance of SOLOIST in Restaurant by
repeating the above machine teaching process in
multiple iterations. We observe that in the second
iteration of machine teaching SOLOIST+Teach
improves Combined by more than 8 points while
SOLOIST+Extra achieves 5 points higher. El resultado
demonstrates the effectiveness of our two-step
fine-tuning scheme to deploy SOLOIST for a new
tarea (domain). In terms of machine teaching cost,
taking the restaurant domain as an example,
we assume that one slot-value pair of belief state
correction counts as one edit and a response
correction counts as ten edits. The total numbers
of edits for SOLOIST+Teach and SOLOIST+Extra are
61 y 396, respectivamente, suggesting that machine
teaching reduces the labeling cost by 6×.

3.5 Component-Wise Evaluation

This section evaluates SOLOIST on two NLU tasks
(es decir., intent classification and slot filling), the DST
task and the response generation task. We show
that although SOLOIST is an end-to-end dialog
modelo, it also performs well on these component
tareas.

Intent Classification The task is to classify
a user utterance into one of several pre-defined
classes (intents). We follow the experiment setting
of Casanueva et al. (2020). The last hidden state of
SOLOIST is used as the sequence representation for
clasificación. Several baseline methods are used
for comparison. BERT-fixed and BERT-tuned are
fine-tuned on BERT, with BERT parameters fixed

815

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fraction

SOLOIST Span-ConveRT V-CNN-CRF Span-BERT

Modelo

Joint Goal Accuracy ↑

MWoz2.0 MWoz2.1

1 (8198)
1/2 (4099)
1/4 (2049)
1/8 (1024)
1/16 (512)
1/32 (256)
1/64 (128)
1/128 (64)

0.98
0.95
0.93
0.89
0.84
0.79
0.74
0.61

0.96
0.94
0.91
0.89
0.81
0.64
0.58
0.41

0.94
0.92
0.89
0.85
0.74
0.57
0.37
0.26

0.93
0.91
0.88
0.85
0.77
0.54
0.42
0.30

Mesa 10: Average F1 scores across all slots
for Restaurant-8K with varying training set frac-
ciones. Numbers in parentheses represent training
set sizes. The baseline results are quoted from
Coope et al. (2020).

SOLOIST obtains 1.13% higher joint goal accu-
racy. We attribute the gain to the task-grounded
pre-training that equips SOLOIST with task comple-
tion skills including dialog state tracking.

tarea,

En esto

Context-to-Response.
sistemas
need to generate responses given the ground-truth
belief state and DB search result (Wen et al.,
2017). The results on MultiWOZ 2.0 are shown in
Mesa 12. SOLOIST achieves the best performance
in terms of Inform and Success but performs
slightly worse in BLEU. The Combined score of
SOLOIST is comparable with the current SoTA
method DAMD. Sin embargo, DAMD uses the labels
of dialog act on both the user and system sides,
which demands significantly higher
labeling
efforts than SOLOIST for model training. HDSA
achieves the best BLEU score. Comparado con
HDSA, SOLOIST is much simpler and able to
perform better in terms of Combined. SOLOIST
outperforms ARDM in Combined. It is worth
mentioning that ARDM cannot perform dialog
state tracking and requires an extra dialog state
tracker to accomplish tasks. These results show
that SOLOIST can learn dialog policies accurately
and generate natural language responses in the
multi-domain scenario.

3.6 Human Evaluation Results

We conduct human evaluation to assess the
quality of SOLOIST interacting with human users.
Following the evaluation protocol in the DSTC8
track 1 challenge (Kim y cols., 2019), we host the
best performed SOLOIST on the validation set in
MultiWOZ domain in the back-end as bot services
and crowdsource the work to Amazon Mechanical
Turk. For each dialog session, we present Turks
a goal with instructions. Then Turks are required

816

MDBT (Ramadan et al., 2018)
GLAD (Zhong et al., 2018)
GCE (Nouri and Hosseini-Asl, 2018)
FJST (Eric et al., 2020)
HyST (Goel et al., 2019)
SUMBT (Lee et al., 2019a)
TOD-BERT (Wu et al., 2020a)
Neural Reading (Gao et al., 2019b)
TRADE (Wu et al., 2019a)
COMER (Ren et al., 2019)
NADST (Le et al., 2020)
DSTQA (Zhou and Small, 2019)
SOM-DST (Kim y cols., 2020)
DST-Picklist (Zhang et al., 2020a)
MinTL (Lin et al., 2020)
SST (Chen et al., 2020)
Tripy (Heck et al., 2020)
Simple-TOD (Hosseini-Asl et al., 2020)
SOLOIST

15.57
35.57
36.27
40.20
44.24
46.65
–
41.10
48.62
48.79
50.52
51.44
51.38
53.30
52.10
51.17
–
–
53.20

–
–
–
38.00
–
–
48.00
–
45.60
–
49.04
51.17
52.57
–
53.62
55.23
55.29
55.72
56.85

Mesa 11: Dialog state tracking results on
MultiWOZ 2.0 y 2.1.

to converse with SOLOIST to achieve the goal
and judge the overall dialog experience at the
end of a session using four metrics. (i) Success
evaluates task completion. (ii) Under. (idioma
understanding score) que van desde 1 (bad) a 5
(bien) indicates the extent to which the system
understands user inputs. (ii) Appr. (respuesta
appropriateness score) scaling from 1 (bad) a
5 (bien) denotes whether the response is appro-
priate and human-like. (iv) Turns is the average
number of turns in a dialog overall successful
dialog sessions. Turks are further required to write
down a justification of giving a specific rating. En
total, 120 dialog sessions are gathered for analysis.
Mesa 13 shows the human assessment results
on MultiWOZ. The results are consistent with the
automatic evaluation. SOLOIST achieves substan-
tially better performance than other systems over
all the metrics. Además, SOLOIST outperforms
the DSTC8 Track 1 Winner by a much larger
margin in Success (+20 puntos)
in human
evaluation than that in automatic evaluation (+10
points in Table 3). We attribute this to the fact that
Turks use more diverse language to interact with
the target bots in interactive human evaluation
than that in the pre-collected MultiWOZ dataset
and the use of heterogeneous dialog data for
task-grounded pre-training makes SOLOIST a more
robust task bot than the others. In many test cases
against SOLOIST, Turks comment that they feel
like they are talking to a real person.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Annotations

Métricas de evaluación

Belief State Policy Inform ↑ Success ↑ BLEU ↑ Combined ↑

Base (Budzianowski et al., 2018)
TokenMoE (Pei et al., 2019)
GPT fine-tuning (Budzianowski and Vulic, 2019)
Structured Fusion (Mehri et al., 2019b)
LaRL (Zhao et al., 2019)
MD-Sequicity (Zhang et al., 2020b)
HDSA (Chen et al., 2019)
ARDM (Wu et al., 2019b)
DAMD (Zhang et al., 2020b)
SOLOIST

(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)

(cid:2)
(cid:2)

71.29
75.30
70.96
82.70
82.80
86.60
82.90
87.40
89.20
89.60

60.94
59.70
61.36
72.10
79.20
71.60
68.90
72.80
77.90
79.30

18.80
16.81
19.05
16.34
12.80
16.68
23.60
20.60
18.60
18.03

84.93
84.31
85.21
93.74
93.80
95.90
99.50
100.70
102.15
102.49

(cid:2)

(cid:2)
(cid:2)

(cid:2)

Mesa 12: Context-to-response evaluation on MultiWOZ.

Modelo

Success ↑ Under. ↑ Appr. ↑ Turns ↓

SOLOIST
DSTC8 Track 1 Winner
DSTC8 2nd Place
DSTC8 3rd Place
DSTC8 Baseline

91.67
68.32
65.81
65.09
56.45

4.29
4.15
3.54
3.54
3.10

4.43
4.29
3.63
3.84
3.56

18.97
19.51
15.48
13.88
17.54

Mesa 13: Human evaluation results. The results
except SOLOIST are quoted from Li et al. (2020b).

Cifra 4 depicts a dialog example where a user
interacts with SOLOIST to complete a multi-domain
tarea. The user starts the conversation by asking
for a recommendation of a museum in the center
of town. SOLOIST identifies the user intent, y
provides a recommendation based on the search
result from an attraction DB. Entonces, the user wants
to book a table in a restaurant in the same area.
We can see that through the conversation, SOLOIST
develops belief state, which can be viewed as the
system’s understanding of what the user needs
and what is available in the DB. Basado en el
belief state and DB state, SOLOIST picks the next
acción, either asking for clarification or providing
the user with information being requested. Este
example also demonstrates that SOLOIST is able to
deal with some NLU challenges displayed often
in human conversations, such as co-reference
resolution. Por ejemplo, SOLOIST understands that
the ‘‘same area’’ at Turn 5 refers to ‘‘centre of
town’’, and then identifies a proper entity from the
restaurant booking DB to make the reservation.

4 Trabajo relacionado

Dialog Systems. Dialog systems are typically
grouped into two categories, task-oriented sys-

817

Cifra 4: An interactive example.

tems and social chatbots (p.ej., Chen et al., 2017;
Gao et al., 2019a; Roller et al., 2020a; Zhou y cols.,
2020). Recently many variants have been devel-
oped to extend the scope of dialog systems, incluir-
ing empathetic dialog systems (Ma et al., 2020;
Zhou y cols., 2018), chatbots for sentiment analysis
(Le et al., 2020C), dialog systems with common-
sense knowledge (Young et al., 2018; Shuster
et al., 2020), or using audio features (Young et al.,
2020). en este documento, we focus on end-to-end
dialog models for task-oriented systems.

Pre-Trained Language Models. Recent ad-
vances on self-supervised learning have witnessed
the blooming of large-scale pre-trained language
modelos (p.ej., Devlin et al., 2019; Radford et al.,
2019; Dong et al., 2019), which achieve
SoTA performance on a variety of language under-
standing and generation tasks. The closest
a
SOLOIST are GPT-2 (Radford et al., 2019) y

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

its variants that ground language generation
in the prescribed control codes such as CTRL
(Keskar et al., 2019) and Grover (Zellers et al.,
2019), o
latent variables such as Optimus
(Le et al., 2020a).

Recientemente, pre-trained language models have
been adopted to develop task-oriented and chit-
chat dialog systems. To name a few examples of
chit-chat dialog systems: DialoGPT (Zhang et al.,
2020C), TransferTransfo (Wolf et al., 2019) y
CGRG (Wu et al., 2020b) adapt GPT-2 using
human conversational data for response genera-
ción. Plato (Bao et al., 2020) pre-trains a discrete
latent variable model for response generation.
Meena (Adiwardana et al., 2020) and BST (Roller
et al., 2020b) pre-train large models on conver-
sational data and have demonstrated expressive
performance in generating social chit-chat dialogs.
For task-oriented dialogs, Mehri et al. (2019a)
explores different pre-training methods for dialog
representation learning. TOD-BERT
contexto
(Wu et al., 2020a) adapts the pre-trained BERT
to achieve strong performance on four dialog
sub-tasks. ConveRT (Henderson et al., 2020)
pre-trains a model on Reddit data for intent clas-
sification and response selection. Span-ConveRT
(Coope et al., 2020) extends the framework to
entity extraction. SC-GPT (Peng et al., 2020b)
uses a pre-trained language model to convert a
dialog act to a natural language response. All these
works use the pre-training and fine-tuning frame-
trabajar. Sin embargo, they follow the modular archi-
tecture of task bots, and the pre-trained models
are used for improving individual dialog modules
such as NLU and DST. SOLOIST generalizes these
methods to the entire dialog pipeline, building an
end-to-end dialog system.

End-to-End Trainable Dialog Systems. El
end-to-end dialog systems based on neural
models have been studied in Wen et al. (2017); li
et al. (2017); Lei et al. (2018); Xu et al. (2019).
Although these methods have achieved promising
resultados, they are designed for specific domains,
rendering difficulties in generalizing to multi-
domains such as MultiWOZ. Dialog models that
can handle multi-domain tasks are studied in (Pei
et al., 2019; Budzianowski and Vuli´c, 2019; Mehri
et al., 2019b; Zhao et al., 2019; Wu et al., 2019b;
Zhang et al., 2020b; Peng et al., 2017). Sin embargo,
these works require large amounts of in-domain
labels to achieve good performance. A diferencia de,

SOLOIST can effectively adapt to a new task in the
few-shot fine-tuning settings.

The most related work to ours is Ham et al.
(2020), which is the first attempt to fine-tune GPT-
2 to build end-to-end dialog models. Hosseini-Asl
et al. (2020) take a similar approach, y es un
concurrent work of SOLOIST. Sin embargo, SOLOIST
differs from these two methods in two major
aspectos. The first is the use of task-grounded
pre-training that allows SOLOIST to learn primary
task completion skills, such as tracking dialog
states and select system actions. These skills can
be easily reused and adapted (p.ej., via few-shot
fine-tuning) to solve new dialog tasks, leading
to a much higher task success rate, as reported
en la sección 3. The second is that the annotation
cost required for training SOLOIST is much lower
than that of Ham et al. (2020) or Hosseini-Asl
et al. 2020. Training SOLOIST requires only belief
states as labels. But training of Ham et al. (2020)
and Hosseini-Asl et al. (2020) requires labeling
each dialog turn with dialog acts. Además,
while SOLOIST is end-to-end trainable, the other
two models are not and need heuristic rules to
handle different database search conditions.

5 Conclusión

SOLOIST is a method of building task bots
at scale with transfer
learning and machine
teaching. Unlike GPT-2, SOLOIST is pre-trained
in a task-grounded manner. So, it can generate
responses grounded in user goals and real-world
knowledge for
task completion. experimentos
show that SOLOIST creates new SoTA on two
popular task-oriented dialog benchmarks, y
that SOLOIST outperforms existing methods by a
large margin in the few-shot fine-tuning settings
where only a limited number of task labels are
available for fine-tuning.

We hope that SOLOIST can inspire dialog
researchers and developers to comprehensively
explore the new paradigm for building task bots
based on task-grounded pre-training and fine-
tuning via machine teaching, and improving the
recipe we present in this paper, a saber, formulat-
ing task-oriented dialog as a single auto-regressive
modelo de lenguaje, pre-training a task-grounded re-
sponse generation model on heterogeneous dia-
log corpora, and adapting the pre-trained model
to new tasks through fine-tuning using a handful
task-specific examples via machine teaching.

818

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Referencias

Daniel Adiwardana, Minh-Thang

Luong,
David R. So, Jamie Hall, Noah Fiedel, Romal
Thoppilan, Zi Yang, Apoorv Kulshreshtha,
Gaurav Nemade, Yifeng Lu, and et al. 2020.
Towards a human-like open-domain chatbot.
arXiv preimpresión arXiv:2001.09977.

Siqi Bao, Huang He, Fan Wang, Hua Wu, y
Haifeng Wang. 2020. Plato: Pre-trained dia-
logue generation model with discrete latent
variable. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Lingüística, pages 85–96. https://doi
.org/10.18653/v1/2020.acl-main.9

Tom Bocklisch, Joey Faulkner, Nick Pawlowski,
and Alan Nichol. 2017. Rasa: Open source
language understanding and dialogue manage-
mento. CORR, abs/1712.05181.

Paweł Budzianowski and Ivan Vuli´c. 2019.
it’s GPT-2-How can I help you?
Hello,
Towards
the use of pretrained language
models for task-oriented dialogue systems. En
Proceedings of the 3rd Workshop on Neural
Generation and Translation, pages 15–22.
https://doi.org/10.18653/v1/D19
-5602

Paweł

Budzianowski,

Tsung-Hsien Wen,
Bo-Hsiang Tseng, I˜nigo Casanueva, Stefan
Ultes, Osman Ramadan, and Milica Gasic.
2018. Multiwoz-a large-scale multi-domain
wizard-of-oz dataset
task-oriented dia-
logue modelling. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje,
5016–5026.
https://doi.org/10.18653/v1/D18
-1547

paginas

para

Factura

Byrne,

Karthik

Krishnamoorthi,
Chinnadhurai Sankar, Arvind Neelakantan,
Ben Goodrich, Daniel Duckworth, Semih
Yavuz, Amit Dubey, Kyu-Young Kim, y
Andy Cedilnik. 2019. Taskmaster-1: Toward
a realistic and diverse dialog dataset.
En
el 2019 Conferencia sobre
Actas de
Empirical Methods
in Natural Language
Procesamiento y IX Conjunción Internacional
Conference on Natural Language Process-
En g (EMNLP-IJCNLP), pages 4506–4517.
https://doi.org/10.18653/v1/D19
-1459

I˜nigo Casanueva, Tadas Temˇcinas, Daniela Gerz,
Matthew Henderson, and Ivan Vuli´c. 2020.
Efficient intent detection with dual sentence
encoders. In Proceedings of the 2nd Workshop
on Natural Language Processing for Conver-
sational AI, páginas 38–45. https://doi.org
/10.18653/v1/2020.nlp4convai-1.5

Hongshen Chen, Xiaorui Liu, Dawei Yin, y
Jiliang Tang. 2017. A survey on dialogue sys-
tems: Recent advances and new frontiers. Acm
Sigkdd Explorations Newsletter, 19(2):25–35.
https://doi.org/10.1145/3166054
.3166058

Lu Chen, Boer Lv, Chi Wang, Su Zhu, Bowen
Broncearse, and Kai Yu. 2020. Schema-guided
multi-domain dialogue state tracking with
graph
En
neural
AAAI, pages 7521–7528. https://doi
.org/10.1609/aaai.v34i05.6250

redes.

atención

Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng
yan, and William Yang Wang. 2019. Seman-
tically conditioned dialog response generation
via hierarchical disentangled self-attention.
In Proceedings of the 57th Annual Meeting
de
the Association for Computational Lin-
guísticos, pages 3696–3709, Florencia,
Italia.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P19
-1360

span

Few-shot

Sam Coope, Tyler Farghly, Daniela Gerz,
Ivan Vulic, and Matthew Henderson. 2020.
extraction
Span-convert:
for dialog with pretrained conversational
el
representaciones.
58th Annual Meeting of
la Asociación
para Lingüística Computacional, LCA 2020,
En línea, July 5–10, 2020, pages 107–121.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.acl-main.11

En procedimientos

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. Bert: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En Actas de la 2019 Estafa-
ference of the North American Chapter of the
Asociación de Lingüística Computacional: Hu-
man Language Technologies, Volumen 1 (Largo
and Short Papers), páginas 4171–4186.

819

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Li Dong, Nan Yang, Wenhui Wang, Furu Wei,
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
zhou, and Hsiao-Wuen Hon. 2019. Unified lan-
guage model pre-training for natural language
comprensión y generación. In Advances
en sistemas de procesamiento de información neuronal,
pages 13042–13054.

Mihail Eric, Rahul Goel, Shachi Paul, Abhishek
Sethi, Sanchit Agarwal, Shuyang Gao, Adarsh
Kumar, Anuj Goyal, Peter Ku, and Dilek
Hakkani-Tur. 2020. Multiwoz 2.1: A consol-
idated multi-domain dialogue dataset with state
corrections and state tracking baselines. En profesional-
ceedings of The 12th Language Resources and
Evaluation Conference, pages 422–428.

Jianfeng Gao, Michel Galley, and Lihong Li.
2019a. Neural approaches to conversational
AI. Foundations and Trends in Informa-
tion Retrieval, 13(2–3):127–298. https://
doi.org/10.1561/1500000074

Jianfeng Gao, Baolin Peng, Chunyuan Li,
Jinchao Li, Shahin Shayandeh, Lars Liden, y
Heung-Yeung Shum. 2020. Robust conversa-
tional AI with grounded text generation. CORR,
abs/2009.03457.

Shuyang Gao, Abhishek Sethi, Sanchit Agarwal,
Tagyoung Chung, and Dilek Hakkani-Tur.
2019b. Dialog state tracking: A neural reading
comprehension approach. En Actas de la
20th Annual SIGdial Meeting on Discourse and
Dialogue, pages 264–273.

Rahul Goel, Shachi Paul, and Dilek Hakkani-T¨ur.
2019. Hyst: A hybrid approach for flexible and
accurate dialogue state tracking. Actas
de
1458–1462.
https://doi.org/10.21437/Interspeech
.2019-1863

Interspeech

paginas

2019,

Donghoon Ham, Jeong-Gwan Lee, Youngsoo
Jang, and Kee-Eung Kim. 2020. End-to-end
neural pipeline for goal-oriented dialogue sys-
tems using GPT-2. In Proceedings of the 58th
Annual Meeting of the Association for Com-
Lingüística putacional, pages 583–592.

Michael Heck, Carel van Niekerk, Nurul Lubis,
Christian Geishauser, Hsien-Chin Lin, Marco
Moresi, and Milica Gasic. 2020. Trippy: A
triple copy strategy for value independent neural

820

dialog state tracking. In Proceedings of the 21th
Annual Meeting of the Special Interest Group
on Discourse and Dialogue, pages 35–44.

En procedimientos de

Matthew Henderson, I˜nigo Casanueva, Nikola
Mrksic, Pei-Hao Su, Tsung-Hsien Wen, y
Ivan Vulic. 2020. Convert: Efficient and
accurate conversational representations from
el 2020
transformadores.
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje: Findings, EMNLP
2020, Online Event, 16-20 Noviembre 2020,
pages 2161–2174. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.findings-emnlp.196

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes,
and Yejin Choi. 2019. The curious case
of neural text degeneration. En internacional
Conferencia sobre Representaciones del Aprendizaje.

Ehsan

Bryan

Hosseini-Asl,

McCann,
Chien-Sheng Wu, Semih Yavuz, y ricardo
Socher. 2020. A simple language model
task-oriented dialogue. arXiv preprint
para
arXiv:2005.00796.

Nitish Shirish Keskar, Bryan McCann, Lav R.
Varshney, Caiming Xiong, and Richard Socher.
2019. Ctrl: A conditional
transformer lan-
guage model for controllable generation. arXiv
preprint arXiv:1909.05858.

Seokhwan Kim, Michel Galley, Chulaka
Gunasekara, Sungjin Lee, Adam Atkinson,
Baolin Peng, Hannes Schulz, Jianfeng Gao,
Jinchao Li, Mahmoud Adada, Minlie Huang,
Luis Lastras, Jonathan K. Kummerfeld, walter
S. Lasecki, Chiori Hori, Anoop Cherian, Tim
k. Marks, Abhinav Rastogi, Xiaoxue Zang,
Srinivas Sunkara, and Raghav Gupta. 2019.
The eighth dialog system technology challenge.
arXiv preimpresión arXiv:1911.06394.

Sungdong Kim, Sohee Yang, Gyuwan Kim, y
Sang-Woo Lee. 2020. Efficient dialogue state
tracking by selectively overwriting memory.
In Proceedings of the 58th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 567–582.

Diederik P. Kingma and Jimmy Ba. 2014. Adán:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Hung Le, Richard Socher, and Steven C. h. Hoi.
2020. Non-autoregressive dialog state tracking.
arXiv preimpresión arXiv:2002.08024.

Hwaran Lee, Jinsik Lee, and Tae-Yoon Kim.
2019a. SUMBT: Slot-utterance matching for
universal and scalable belief tracking. En profesional-
cesiones de
the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 5478–5483.

Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Zheng
zhang, Yaoqin Zhang, Xiang Li, Jinchao Li,
Baolin Peng, Xiujun Li, Minlie Huang, y
Jianfeng Gao. 2019b. ConvLab: Multi-domain
end-to-end dialog system platform. En profesional-
ceedings of the 57th Annual Meeting of the
Asociación de Lingüística Computacional:
Demostraciones del sistema, pages 64–69.

Él,

Ren,

Xiangnan

Wenqiang Lei, Xisen Jin, Min-Yen Kan,
y
Zhaochun
Dawei Yin. 2018. Sequicity: Simplifying
task-oriented dialogue systems with single
En profesional-
sequence-to-sequence architectures.
ceedings of the 56th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos).

Chunyuan Li, Xiang Gao, Yuan Li, Baolin
Peng, Xiujun Li, Yizhe Zhang, and Jianfeng
gao. 2020a. Optimus: Organizing sentences
via pre-trained modeling of a latent space.
En procedimientos de
el 2020 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando (EMNLP), pages 4678–4699, En línea.
Asociación de Lingüística Computacional.

Jinchao Li, Baolin Peng, Sungjin Lee, Jianfeng
gao, Ryuichi Takanobu, Qi Zhu, Minlie Huang,
Hannes Schulz, Adam Atkinson, and Mahmoud
Adada. 2020b. Results of the multi-domain
task-completion dialog challenge. En curso-
ings of the 34th AAAI Conference on Artificial
Inteligencia, Eighth Dialog System Technology
Challenge Workshop.

wei li, Wei Shao, Shaoxiong Ji, and Erik
Cambria. 2020C. BiERU: Bidirectional emo-
tional recurrent unit for conversational sen-
timent analysis. arXiv preimpresión arXiv:2006
.00492.

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng
gao, and Asli Celikyilmaz. 2017. End-to-end

821

task-completion neural dialogue systems. arXiv
preprint arXiv:1703.01008.

Zhaojiang Lin, Andrea Madotto, Genta Indra
Winata, and Pascale Fung. 2020. MinTL:
Minimalist transfer learning for task-oriented
el
dialogue systems.
2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 3391–3405.

En procedimientos de

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized BERT pretraining approach. arXiv pre-
print arXiv:1907.11692.

Yukun Ma, Khanh Linh Nguyen, Frank Z.
Xing, and Erik Cambria. 2020. A survey
on empathetic dialogue systems. Información
Fusion, 64:50–70. https://doi.org/10
.1016/j.inffus.2020.06.011

Shikib Mehri, Evgeniia Razumovskaia, Tiancheng
zhao, and Maxine Eskenazi. 2019a. Pretrain-
ing methods for dialog context representation
aprendiendo. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Lingüística, pages 3836–3845. https://
doi.org/10.18653/v1/P19-1373

Shikib Mehri, Tejas Srinivasan, and Maxine
Eskenazi. 2019b. Structured fusion networks
for dialog. In Proceedings of the 20th An-
nual SIGdial Meeting on Discourse and
Dialogue, pages 165–177. https://doi
.org/10.18653/v1/W19-5921

Elnaz Nouri and Ehsan Hosseini-Asl. 2018. A-
ward scalable neural dialogue state tracking
modelo. arXiv preimpresión arXiv:1812.00899.

Jiahuan Pei, Pengjie Ren, and Maarten de Rijke.
2019. A modular task-oriented dialogue sys-
tem using a neural mixture-of-experts. arXiv
preprint arXiv:1907.05346.

Baolin Peng, Chunyuan Li, Zhu Zhang,
Chenguang Zhu, Jinchao Li, and Jianfeng Gao.
2020a. RADDLE: An evaluation benchmark
and analysis platform for robust task-oriented
dialog systems. CORR, abs/2012.14666.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,
Asli Celikyilmaz, Sungjin Lee, and Kam-Fai
Wong. 2017. Composite task-completion dia-
logue policy learning via hierarchical deep re-
aprendizaje por refuerzo. En Actas de la
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2231–2240.
https://doi.org/10.18653/v1/D17
-1237

Baolin Peng, Chenguang Zhu, Chunyuan Li,
Xiujun Li, Jinchao Li, Michael Zeng, y
Jianfeng Gao. 2020b. Few-shot natural lan-
guage generation for task-oriented dialog. En
Hallazgos de la Asociación de Computación
Lingüística: EMNLP 2020, pages 172–182,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653/v1
/2020.findings-emnlp.17

Shuke Peng, Xinjing Huang, Zehao Lin,
Feng Ji, Haiqing Chen, and Yin Zhang.
2019. Teacher-student framework enhanced
arXiv
multi-domain
preprint arXiv:1908.07137.

generación.

dialogue

Alec Radford, Jeffrey Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners.

Osman Ramadan, Paweł Budzianowski, y
Milica Gasic. 2018. Large-scale multi-domain
belief tracking with knowledge sharing. En
Proceedings of the 56th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 2: Artículos breves), pages 432–437.
https://doi.org/10.18653/v1/P18
-2069

Abhinav Rastogi, Xiaoxue Zang, Srinivas
Sunkara, Raghav Gupta, and Pranav Khaitan.
2020. Towards scalable multi-domain conver-
sational agents: The schema-guided dialogue
conjunto de datos. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, volumen 34,
pages 8689–8696. https://doi.org/10
.1609/aaai.v34i05.6394

Liliang Ren, Jianmo Ni, and Julian McAuley.
2019. Scalable and accurate dialogue state
tracking via hierarchical sequence genera-
ción. En Actas de la 2019 Conferencia
sobre métodos empíricos en lenguaje natural

822

Procesamiento y IX Conjunción Internacional
Conferencia sobre procesamiento del lenguaje natural
(EMNLP-IJCNLP), pages 1876–1885.

Stephen Roller, Y-Lan Boureau, Jason Weston,
Antonio Bordes, Emily Dinan, Angela Fan,
David Gunning, Da Ju, Margaret Li, Spencer
Poff, Pratik Ringshia, Kurt Shuster, eric
Michael Smith, Arthur Szlam, Jack Urbanek,
and Mary Williamson. 2020a. Open-domain
conversational agents: Current progress, abierto
problemas, and future directions. CORR, abs
/2006.12442.

Stephen Roller, Emily Dinan, Naman Goyal,
Da Ju, Mary Williamson, Yinhan Liu, Jing Xu,
Myle Ott, Kurt Shuster, Eric M. Herrero, Y-Lan
bureau, y Jason Weston. 2020b. Recipes
for building an open-domain chatbot. arXiv
preprint arXiv:2004.13637.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. En procedimientos de
the 54th Annual Meeting of the Association
para Lingüística Computacional (Volumen 1: Largo
Documentos), pages 1715–1725.

Swadheen Shukla, Lars Liden, Shahin Shayandeh,
Eslam Kamal, Jinchao Li, Matt Mazzola,
Thomas Park, Baolin Peng, and Jianfeng
gao. 2020. Conversation learner-a machine
teaching tool for building dialog managers for
task-oriented dialog systems. En procedimientos
of the 58th Annual Meeting of the Associa-
ción para la Lingüística Computacional: Sistema
Demonstrations, pages 343–349. https://
doi.org/10.18653/v1/2020.acl
-demos.39

Kurt Shuster, Da Ju, Stephen Roller, Emily
Dinan, Y-Lan Boureau, y Jason Weston.
2020. The dialogue dodecathlon: Open-domain
knowledge and image grounded conversational
agents. En procedimientos de
the 58th Annual
Meeting of the Association for Computational
Lingüística, LCA 2020, En línea, July 5–10,
2020, pages 2453–2470. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/2020.acl-main.222

Patrice

Simard,

Amershi,
David Maxwell Chickering, Alicia Edelman
Pelton, Soroush Ghorashi, Christopher Meek,

Saleema

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Gonzalo Ramos, Jina Suh, Johan Verwey,
Mo Wang, and John Wernsing. 2017. Machine
teaching: A new paradigm for building machine
learning systems. CORR, abs/1707.06742.

Tsung-Hsien Wen, David Vandyke, Nikola
Mrkˇsi´c, Milica Gasic, Lina M. Rojas Barahona,
Pei-Hao Su, Stefan Ultes, and Steve Young.
2017. A network-based end-to-end trainable
task-oriented dialogue system. En curso-
ings of the 15th Conference of the European
Chapter of
la Asociación de Computación-
lingüística nacional: Volumen 1, Artículos largos,
pages 438–449.

Jason D. williams, Kavosh Asadi Atui, y
Geoffrey Zweig. 2017. Hybrid code networks:
Practical and efficient end-to-end dialog control
with supervised and reinforcement learning.
In Proceedings of the 55th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 665–677.
https://doi.org/10.18653/v1/P17
-1062

Jason D. Williams and Lars Liden. 2017. Demon-
stration of interactive teaching for end-to-end
dialog control with hybrid code networks. En
Proceedings of the 18th Annual SIGdial Meet-
ing on Discourse and Dialogue, pages 82–85.
https://doi.org/10.18653/v1/W17
-5511

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
Gugger, Mariama Drama, Quentin Lhoest,
y Alejandro Rush. 2020. transformadores:
State-of-the-art natural language processing. En
Actas de la 2020 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural:
Demostraciones del sistema, páginas 38–45, En línea.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.emnlp-demos.6

Chien-Sheng Wu, Steven C. h. Hoi, Ricardo
Socher, and Caiming Xiong. 2020a. TOD-BERT:
Pre-trained natural
language understanding
for task-oriented dialogue. En procedimientos de
el 2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 917–929.

Chien-Sheng Wu, Andrea Madotto, Ehsan
Hosseini-Asl, Caiming Xiong, Richard Socher,
and Pascale Fung.
2019a. Transferable
multi-domain state generator for task-oriented
el
dialogue systems.
57ª Reunión Anual de la Asociación de
Ligüística computacional, pages 808–819.

En procedimientos de

Qingyang Wu, Yichi Zhang, Yu Li, and Zhou Yu.
2019b. Alternating recurrent dialog model with
large-scale pre-trained language models. arXiv
preprint arXiv:1910.03756.

Zeqiu Wu, Michel Galley, Chris Brockett,
Yizhe Zhang, Xiang Gao, Chris Quirk, Rik
Koncel-Kedziorski, Jianfeng Gao, Hannaneh
Hajishirzi, Mari Ostendorf, and Bill Dolan.
2020b. A controllable model of grounded
response generation. CORR, abs/2005.00613.

Haotian Xu, Haiyun Peng, Haoran Xie, Erik
Cambria, Liuyang Zhou, and Weiguo Zheng.
2019. End-to-end latent-variable task-oriented
dialogue system with exact
log-likelihood
optimization. World Wide Web, pages 1–14.

Steve J. Joven, Milica Gasic, Blaise Thomson,
and Jason D. williams. 2013. POMDP-based
statistical spoken dialog systems: A review.
Actas de
IEEE, 101(5):1160–1179.
https://doi.org/10.1109/JPROC
.2012.2225812

Tom Young, Erik Cambria, Iti Chaturvedi, hao
zhou, Subham Biswas, and Minlie Huang.
2018. Augmenting end-to-end dialogue systems
with commonsense knowledge. En procedimientos
of the Thirty-Second AAAI Conference on Ar-
tificial Intelligence, pages 4970–4977. AAAI
Prensa.

Tomás Lobo, Víctor Sanh, Julien Chaumond,
and Clement Delangue. 2019. Transfertransfo:
A transfer learning approach for neural net-
work based conversational agents. CORR,
abs/1901.08149.

Tom Young, Vlad Pandelea, Soujanya Poria, y
Erik Cambria. 2020. Dialogue systems with
audio context. Neurocomputing, 388:102–109.
https://doi.org/10.1016/j.neucom
.2019.12.126

823

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, Franziska Roesner,
and Yejin Choi. 2019. Defending against neural
fake news. In Advances in Neural Information
Sistemas de procesamiento.

Jianguo Zhang, Kazuma Hashimoto, Chien-Sheng
Wu, Yao Wang, S. Yu Philip, Richard Socher,
and Caiming Xiong. 2020a. Find or clas-
sify? dual strategy for slot-value predictions on
multi-domain dialog state tracking. En curso-
ings of the Ninth Joint Conference on Lexical
and Computational Semantics, pages 154–167.

Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020b.
Task-oriented dialog systems that consider mul-
tiple appropriate responses under the same
contexto. En procedimientos de
the AAAI Con-
ference on Artificial Intelligence, volumen 34,
pages 9604–9611.

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun
Chen, Chris Brockett, Xiang Gao, Jianfeng
gao, Jingjing Liu, and Bill Dolan. 2020C. DI-
ALOGPT : Large-scale generative pre-training
for conversational
En
Proceedings of the 58th Annual Meeting of the
Asociación de Lingüística Computacional: Sys-
tem Demonstrations, pages 270–278, En línea.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.acl-demos.30

response generation.

Tiancheng Zhao and Maxine Eskenazi. 2016.
Towards end-to-end learning for dialog state
tracking and management using deep rein-
forcement learning. In Proceedings of the 17th
Annual Meeting of the Special Interest Group
on Discourse and Dialogue, pages 1–10.

reforzamiento

Tiancheng Zhao, Kaige Xie,

and Maxine
Eskenazi. 2019. Rethinking action spaces
para
learning in end-to-end
dialog agents with latent variable models. En
Actas de la 2019 Conference of the
North American Chapter of the Association

para Lingüística Computacional: Human Lan-
guage Technologies, Volumen 1 (Long and Short
Documentos), pages 1208–1218.

Victor Zhong, Caiming Xiong, y ricardo
Socher. 2018. Global-locally self-attentive
encoder for dialogue state tracking. En profesional-
ceedings of the 56th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 1458–1467.
https://doi.org/10.18653/v1/P18
-1135

Hao Zhou, Minlie Huang, Tianyang Zhang,
Xiaoyan Zhu, and Bing Liu. 2018. Emotional
chatting machine: Emotional conversation gen-
eration with internal and external memory.
In Proceedings of the AAAI Conference on
Artificial Intelligence, volumen 32.

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung
Shum. 2020. The design and implementation of
xiaoice, an empathetic social chatbot. Compu-
lingüística nacional, 46(1):53–93. https://
doi.org/10.1162/coli a 00368

Li Zhou and Kevin Small. 2019. Multi-domain
dialogue state tracking as dynamic knowl-
edge graph enhanced question answering. arXiv
preprint arXiv:1911.06192.

Qi Zhu, Zheng Zhang, Yan Fang, Xiang Li,
Ryuichi Takanobu, Jinchao Li, Baolin Peng,
Jianfeng Gao, Xiaoyan Zhu, and Minlie
Huang. 2020. ConvLab-2: An open-source
toolkit for building, evaluating, and diagnos-
ing dialogue systems. En Actas de la
58ª Reunión Anual de la Asociación de
Ligüística computacional: System Demonstra-
ciones, pages 142–149, En línea. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/2020.acl-demos.19

Xiaojin Zhu. 2015. Machine teaching: An inverse
problem to machine learning and an approach
toward optimal education. In Twenty-Ninth
Conferencia AAAI sobre Inteligencia Artificial.

824

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
9
9
1
9
6
8
9
7
3

/
t

a
C
_
a
_
0
0
3
9
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3 SOLOIST: Building Task Bots at Scale with image

Descargar PDF