Pretraining the Noisy Channel Model for Task-Oriented Dialogue
Qi Liu2∗, Lei Yu1, Laura Rimell1, and Phil Blunsom1,2
1DeepMind, United Kingdom 2University of Oxford, United Kingdom
qi.liu@cs.ox.ac.uk
{leiyu,laurarimell,pblunsom}@google.com
Abstract
Direct decoding for task-oriented dialogue is
known to suffer from the explaining-away ef-
fect, manifested in models that prefer short
and generic responses. Here we argue for the
use of Bayes’ theorem to factorize the dialogue
task into two models, the distribution of the
context given the response, and the prior for
the response itself. This approach, an instan-
tiation of the noisy channel model, both miti-
gates the explaining-away effect and allows
the principled incorporation of large pretrained
models for the response prior. We present ex-
tensive experiments showing that a noisy chan-
nel model decodes better responses compared
to direct decoding and that a two-stage pre-
training strategy, employing both open-domain
and task-oriented dialogue data, improves over
randomly initialized models.
1
Introduction
Task-oriented dialogue agents provide a conver-
sational interface to assist users in accomplish-
ing specific goals, such as finding a restaurant
or booking a hotel (Seneff and Polifroni, 2000;
Raux et al., 2005; Budzianowski et al., 2018;
Peng et al., 2020a). Increasing demand from indus-
try for natural language assistants and scalable cus-
tomer service solutions has recently been driving
a renaissance in the development of task-oriented
dialogue models. In addition, the specification
of explicit dialogue agent goals, afforded by the
task-oriented paradigm, makes such research eas-
ier to ground and evaluate than open-domain
chatbots.
Current research on task-oriented dialogue is
dominated by monolithic sequence-to-sequence
models that directly parameterize the conditional
distribution of the response given the prior dia-
∗Work completed during an internship at DeepMind.
657
logue context. However, this monolithic approach
conflates the task-specific and language-general
aspects of dialogue, and adversely favors short
and generic responses (Bao et al., 2020) due to
the explaining-away effect (Klein and Manning,
2002).
Here we pursue an alternative to the direct
model. Using Bayes’ rule allows us to factorize
the probability of the response given the context
p(R| C) into a language model p(R) and a
context model p(C| R).1 Within natural language
processing (NLP), this approach is traditionally
known as the noisy channel model (Shannon,
1948), and has recently seen renewed interest
with its successful application to neural machine
translation (Yu et al., 2017, 2020; Yee et al., 2019).
We hypothesize that the noisy channel reformu-
lation is advantageous for dialogue because the
factorization enables each sub-module to special-
ize in a dialogue sub-task. In particular, the con-
text conditional model can help to discount short
and generic responses and mitigate the explaining-
away effect, while the language model helps
ensure that responses are natural. We find that a
noisy channel model with the same number of
parameters as a direct model achieves better ac-
curacy on three task-oriented dialogue datasets.
Moreover, a larger noisy channel model can
be trained with the same hardware, by training
the sub-modules separately, yielding additional
improvements.
It has become common in recent years to pre-
train dialogue models on large text data, either
general text (Peng et al., 2020b; Budzianowski
and Vuli´c, 2019; Wu et al., 2020a) or dialogue-
structured data (Roller et al., 2020; Adiwardana
et al., 2020), such as tweets and Reddit posts. We
utilise a similar strategy with Reddit data and find
1Here we abstract away from the prediction of belief states
and dialogue acts, which also form part of our generative
model; see Section 3 for details.
Transactions of the Association for Computational Linguistics, vol. 9, pp. 657–674, 2021. https://doi.org/10.1162/tacl a 00390
Action Editor: Wenjie (Maggie) Li. Submission batch: 2/2021; Revision batch: 2/2021; Published 7/2021.
c(cid:3) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: The data flow of one turn in a task-oriented dialogue for train booking from MultiWOZ.
that the benefits of pretraining to the noisy channel
model are similar to those for the direct model.
Further, we evaluate transfer across task-oriented
dialogue datasets by implementing a second pre-
training stage using Taskmaster (Byrne et al.,
2019) and Schema-Guided Dialogue (Rastogi
et al., 2020) as training data, before fine-tuning
on our final tasks.
We evaluate the algorithm on three datasets,
MultiWOZ 2.0 (Budzianowski et al., 2018), Cam-
Rest676 (Wen et al., 2017a), and SMCalFlow
(Andreas et al., 2020), demonstrating that the
noisy channel approach is robust to different dia-
logue schema annotations used across datasets.
Further analysis demonstrates that the noisy chan-
nel models can decode responses with similar
lengths and Zipf scores compared to ground-truth
responses and reduce the likelihood of falling into
repetition loops (Holtzman et al., 2019).
2 A Seq-to-Seq Dialogue Model
for
In this section, we introduce a discriminative
sequence-to-sequence model
task-oriented
dialogue. The traditional sequence of steps needed
to produce a system turn in a task-directed dia-
logue is shown in Figure 1, with an example from
MultiWOZ 2.0 (Budzianowski et al., 2018).
Given a dialogue context containing previous user
and system utterances, the dialogue system first
predicts a belief state, consisting of a set of slot-
value pairs (e.g., destination: Cambridge),
to capture user intent. To ground the system with
external information, the belief state can be con-
verted into a database query in order to retrieve
information, such as the number of
relevant
matches and booking information. Next, the sys-
tem predicts a set of dialogue acts, representing
the abstract meaning of the proposed dialogue
response (Austin, 1975). Finally, a delexicalized
dialogue response is generated, where slot val-
ues are replaced by generic placeholders, such
as value time for a train departure time, in
order to reduce lexical variation. The delexical-
ized response can be converted to a lexicalized
response in post-processing by filling in the
slot values based on belief states and database
information.
We use the MultiWOZ schema for illustra-
tion in Sections 2 and 3, but our models easily
generalize to different schema annotations (e.g.,
datasets without annotated dialogue acts [Andreas
et al., 2020]).
Because it is well known that pipelined mod-
els tend to suffer from error propagation, many
NLP tasks have been reformulated in recent
years as end-to-end text-to-text transformations
(Raffel et al., 2020; Brown et al., 2020). State-
of-the-art
task-oriented dialogue systems have
followed this approach (Hosseini-Asl et al., 2020;
Peng et al., 2020b). We represent the example
from Figure 1 as follows, serializing turns and
using special start and end tokens to encapsu-
late each data field:
658
Context: [c] I am looking to . . . [/u] What is your . . . [/r]
I’ll be leaving . . . [/u] [/c]
Belief: [b] [train] destination Cambridge, day Tuesday,
arrive 12:30, departure London [/b]
Database: [db] [train] match 1, status not booked [/db]
Act: [a] [train] inform arrive, inform leave, offer
reservation [/a]
Response: [r] There is a train that leaves at [value time]
and arrives at [value time]. Should I book it? [/r]
Given this text representation, the direct discri-
minative approach models p(B, A, R| C), where
C, B, A, and R represent dialogue context, belief
state, dialogue act, and delexicalized response,
respectively.2 We use the serialized text of the
dialogue context as input, and the concatenation
of belief state, dialogue act, and response as target
output, making the task amenable to the appli-
cation of an autoregressive sequence-to-sequence
model. B, A, and R can be generated sequentially
with direct decoding methods, such as greedy
decoding and beam search. We use a sequence-
to-sequence Transformer (Vaswani et al., 2017)
to implement p(B, A, R| C). This distribution will
also be used to build the noisy channel model in
Section 3.
3 Noisy Channel Model for Dialogue
While direct decoding is an effective approach for
decoding belief states (Hosseini-Asl et al., 2020),
it may be sub-optimal for generating responses.
First, it favors short and generic responses (Bao
et al., 2020). As a result, the decoded responses
are bland and lack diversity (Li et al., 2016).
Second, it suffers from the explaining-away effect
(Klein and Manning, 2002), where inputs are
‘‘explained-away’’ by highly predictive output
prefixes. For example, if there is one hotel match-
ing the user’s intent as encoded in the belief state,
the model is nevertheless prone to decoding ‘‘no’’
given the output prefix ‘‘there is’’, ignoring the
input information.
In this work, we propose using the neural noisy
channel model (Yu et al., 2017) to mitigate the
above problems for response generation. Given
an input sequence x and output sequence y,
the noisy channel formulation (Shannon, 1948)
uses Bayes’ rule to rewrite the model p(y|x) as
p(x|y)p(y)
∝ p(x|y)p(y). It was originally applied
p(x)
2We do not model the probabilities of database state or
lexicalized response, as these are deterministic given the
belief state and delexicalized response, respectively.
to speech recognition, where p(y|x) is a con-
ditional model of the source text given a noisy ob-
servation. The channel model p(x|y) estimates
the probability of
the observation given the
source, while p(y) is an unconditional language
model (or source model), which can be trained
on unpaired data. More recently it has been ap-
plied to machine translation, where y is a trans-
lation of input text x.
Abstracting away from belief states and dia-
logue acts, for task-oriented dialogue we want to
estimate p(R| C), the probability of a response
given a context. The channel model p(C| R),
given a response, predicts a distribution over con-
texts which might have elicited that response. The
source model p(R) is an unconditional language
model. In this extension of the noisy channel ap-
proach to task-oriented dialogue, the ‘‘channel’’
can be understood as connecting dialogue contexts
with suitable responses.
For the full task, we develop a noisy channel
model for p(B, A, R| C). Using the chain rule,
p(B, A, R| C) = p(B| C) · p(A, R| C, B). Follow-
ing Hosseini-Asl et al. (2020), we use the direct
model described in Section 2 to parameterize
p(B| C) and decode B, which our preliminary
experiments confirmed to be advantageous.
We use the noisy channel formulation to pa-
rameterize p(A, R| C, B). Using Bayes’
rule,
p(A, R| C, B) ∝ p(C, B| A, R) · p(A, R). The
channel model p(C, B| A, R) and source model
p(A, R) are implemented as Transformers.
We choose to use the noisy channel formulation
for decoding A based on preliminary experiments
that showed improved overall accuracy over direct
decoding, possibly because poor dialogue act pre-
diction by the direct model led to worse quality
responses. The serialized text of A and R are
concatenated during training, and the decoded
sequence is split into A and R with the special
start/end tokens during decoding.
We suggest that the noisy channel model has
three advantages over the direct model for re-
sponse generation: (1) The channel model can
penalize short and generic responses. Such re-
sponses can be mapped to a large number of con-
texts, resulting in a flat distribution over contexts.
This leads to a lower channel model score for
short and generic responses (Zhang et al., 2020b).
(2) The channel model ensures that (A, R) must
explain the corresponding (C, B), alleviating the
explaining-away effect (Yu et al., 2017). (3) The
659
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
source model, an unconditional distribution over
A and R, can make use of abundant non-dialogue
textual data for pretraining, further improving the
fluency of generated sequences (Brants et al.,
2007). We leave exploration of this last advantage
for future work, as we pretrain all sub-modules
with the same data.
3.1 Decoding
Because exact decoding from the noisy chan-
nel model arg maxA,R p(C, B| A, R) · p(A, R)3
is computationally intractable, we experiment
with two approximation methods, noisy channel
reranking and noisy channel online decoding.
Since these methods rely on p(A, R| C, B) as a
proposal distribution for approximation, and both
p(A, R| C, B) and p(B| C) are parameterized with
the direct model introduced in Section 2, our noisy
channel model therefore has three sub-modules:
a direct model p(B, A, R| C), a channel model
p(C, B| A, R), and a source model p(A, R).
Noisy Channel Reranking: Noisy channel
reranking first decodes B and then continues
decoding a list S of (A, R) pairs by beam
search with the direct model, prior to utilizing
the noisy channel model to rerank (A, R) pairs. In
particular, during beam search, partial sequences
are expanded and pruned with p(A, R| C, B) (from
the direct model in Section 2). The pairs after
decoding are reranked using the following model
combination:
(A(cid:5), R(cid:5)) = arg max
(A,R)∈ S
log p(A, R| C, B)+
λ1 · log p(C, B| A, R)+
λ2 · log p(A, R)+
λ3 · | A, R|,
(1)
where | A, R| denotes the length of (A, R), and
λ1, λ2 and λ3 are hyperparameters. Besides the
channel model p(C, B| A, R) and the source
model p(A, R), we additionally use the direct
model p(A, R| C, B) and a length bias | A, R| to
encourage responses with high direct model likeli-
hood and discourage short responses, respectively.
Noisy Channel Online Decoding: In contrast
to reranking, online decoding applies the noisy
3Although exact decoding is also computationally intrac-
table for the direct model, approximating arg maxB p(B| C)
is well-studied, e.g., beam search. The decoding for B is
therefore omitted here.
: Context C
Algorithm 1: Online decoding for the noisy
channel.
Input
Output: Belief, act and response (B, A, R)
Decode B given C with p(B| C)
Beam: S = {([a])}
while end(S) is False do
S (cid:5) = ø
for O in S do
if O.last() is [/r] or | O| > l then
S (cid:5).add(O)
continue
end
Get k1 tokens o1, . . . , ok1 from the direct
model p(O| O|+1| C, B, O)
for oi in (o1, . . . , ok1 ) do
S (cid:5).add((O, oi))
end
end
S = top k2
O∈ S (cid:5)
log p(O| C, B)+
λ1 · log p(C, B| O)+
λ2 · log p(O)+
λ3 · | O|
end
Select O ∈ S with the largest score using Eq. 1 and
return (B, A, R)
channel model during beam search for pruning
partial sequences, thus exploring a larger search
space.
As shown in Algorithm 1, we first decode the
belief state with p(B| C), which comes from the
direct model in Section 2. Then, starting with
a beam S containing a single sequence [a] (the
dialogue act start token), we continuously expand
the sequences in S until end(S) is met, namely,
all sequences in S either end with [/r] or have
lengths larger than l. In each iteration, we first
expand the sequences in the beam, then prune
the expanded beam. To expand a partial act and
response sequence (denoted as O in Algorithm 1),
a naive way is to use the noisy channel model
to score |V | (the vocabulary size) possible ex-
pansions, which is computationally expensive.
Instead, we use the probability of the next token
p(O| O|+1| C, B, O) (where | O| denotes the length
of O) to select k1 candidates to be scored by the
noisy channel model. This next token probability
is from the direct model introduced in Section 2.
One straightforward way to select k1 expansions
from p(O| O|+1| C, B, O) is using the top-k maxi-
mization, but we can also take advantage of the
advances in sampling from a categorical distri-
bution for text generation (e.g., top-k sampling
Fan et al., 2018 and nucleus sampling [Holtzman
660
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
et al., 2019]). After the expansion, we prune
the expanded beam S (cid:5) to obtain a smaller beam
with k2 partial sequences based on the model
combination in Eq. 1. Compared to noisy chan-
nel reranking, online decoding applies the noisy
channel model during beam search, which is
potentially less biased towards the direct model.
In summary, we note that beam search for
both the direct model and the online decoding
for our noisy channel model decodes (B, A, R)
autoregressively. Thus both approaches are end-
to-end models for task-oriented dialogue. The key
difference is that noisy channel online decoding
uses Eq. 1 for pruning, while the direct model
uses p(A, R| C, B).
4 Model and Pretraining
We use three Transformer (Vaswani et al., 2017)
networks to parameterize the direct model p(B,
A, R| C), the channel model p(C, B| A, R) and
the source model p(A, R), respectively. The input
to each Transformer is the sum of four embed-
dings: word embeddings, position embeddings,
role embeddings (user/system), and turn embed-
dings (each word corresponds to a turn number).
Cross entropy is used as the loss function.
Given training samples (C, B, A, R), if we train
the channel model using complete (A, R) pairs
as input, a significant discrepancy arises between
training and decoding for noisy channel online
decoding. Since the channel model is used to score
partial act and response pairs, that is, p(C, B| O)
in Algorithm 1, the channel model trained with
complete (A, R) pairs is unsuited to scoring par-
tial sequences. In order to manually create partial
sequences during training that are better matched
for online decoding, we truncate the (A, R) pairs
with a truncation length uniformly sampled from
1 to the sequence length (inclusive). The direct
model and the source model are trained with
complete sequences, as partial sequences occur
naturally in their standard autoregressive training
procedure.
As in-domain dialogue data are usually scarce,
we use a two-stage pretraining strategy to enhance
the noisy channel model. Although the effec-
tiveness of pretraining with Reddit data has
been validated for open-domain dialogue (Zhang
et al., 2020b; Bao et al., 2019; Adiwardana et al.,
2020), relatively little work has applied such data
(Rastogi et al., 2020),
to task-oriented dialogue.4 In the first stage, we
explore Reddit pretraining (where the Reddit
data is pre-processed into (C, R), i.e., context-
response, pairs as described below). In the second
stage, we use two task-oriented dialogue datasets,
Taskmaster5 (Byrne et al., 2019) and Schema-
Guided Dialogue6
to
specialize the Reddit-pretrained models. Because
the Reddit data consists of open-domain-style dia-
logues (where belief states and dialogue acts are
missing), pretraining on these datasets can famil-
iarize the models with the sequence-to-sequence
representation of task-oriented dialogue. Three
models, a context-to-response model, a response-
to-context model and a response language model,
are pretrained to initialize the direct model, the
channel model and the source model, respectively.
4.1 Implementation Details
Models: All models are implemented with JAX
(Bradbury et al., 2018) and Haiku (Hennigan
et al., 2020). For the direct model introduced in
Section 2, we use a Transformer model with
hidden size 512, 12 encoder-decoder layers, and
16 self-attention heads. The model has 114M pa-
rameters. For the noisy channel model, we use a
base setting and a large setting. The base setting
reduces the number of layers to 5, hidden size
to 384, and self-attention heads to 12. Its sub-
modules, a direct model, a reverse model and a
language model, have 43M, 43M, and 30M pa-
rameters, respectively. We employ the base set-
ting for a fair comparison with a single direct
model using roughly the same number of param-
eters (116M vs. 114M). For the large setting, we
use the same hyperparameters as the direct model
(114M), so that its sub-modules, a direct model, a
reverse model, and a language model, have 114M,
114M, and 64M parameters, respectively. We use
this large setting to explore the limits of the noisy
channel model. The large noisy channel model
(292M) is 2.56 times larger compared to the direct
model (114M). This illustrates another advantage
of the noisy channel model during training. While
training a direct model with 292M parameters
will overflow the memory of 16GB TPUs (v3)
4One exception is Henderson et al. (2019), who use Reddit
data to improve response retrieval and selection. We focus
on response generation in this work.
5https://cutt.ly/xkuUHUa.
6https://cutt.ly/QkuUZUu.
661
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
# Dialog
# Turn Avg. Turn/Dialog Avg. Token/Turn # Domain Multi-Task # Unique Slot # Unique Value
Taskmaster
Schema
CamRest676
MultiWOZ
SMCalFlow
17,304
22,825
676
10,438
41,517
341,801
463,284
5,488
143,048
170,590
19.75
20.3
8.12
13.7
4.11
7.87
9.86
10.71
15.03
8.77
7
17
1
7
4
(cid:2)
(cid:2)
(cid:2)
281
123
4
46
–
66,659
23,889
89
11,828
–
Table 1: Statistics of task-oriented dialogue datasets. We define a multi-task dialogue as a dialogue
involving multiple tasks (e.g., hotel and restaurant booking) while its counterpart handles a single task
(e.g., hotel booking). Taskmaster and CamRest676 do not contain any multi-task dialogues.
without using model parallelism, training the sub-
modules of the large noisy channel model can
easily fit into 16GB TPUs, as these modules are
independently trained with no need to load three
modules for training. This enables us to train a
noisy channel model with more parameters com-
pared to training a direct model using the same
hardware. For inference, we still need to load the
sub-modules into a TPU. Because gradients are
not required during inference, we are able to load
the three sub-modules of the large noisy channel
model (292M) into a single TPU with 16GB
memory for decoding. The large noisy channel
model (292M) still consumes more memory than
the direct model (114M) during inference.
Pretraining Settings: The maximum sequence
length l is set to 1024, and sequences with longer
lengths are truncated. We reuse the vocabulary
from GPT-2 (Radford et al., 2019), which contains
50,257 BPE tokens. We use PreNorm (Nguyen
and Salazar, 2019) for faster convergence. GELU
(Hendrycks and Gimpel, 2016) is applied as the
activation function. Following ALBERT (Lan
et al., 2020), dropout is disabled during pretrain-
ing. We use the normal distribution truncated
to the range [−0.01, 0.01] to initialize the input
embeddings, while other parameters are initial-
ized using the normal distribution with zero mean
and standard deviation 0.1. The batch size is
set to 256. The LAMB optimizer (You et al.,
2020) (b1 = 0.9 and b2 = 0.999) is employed
for optimization. The initial learning rate is 1e-7,
and we apply 4000 warmup steps to increase
the learning rate to 1e-3, before utilizing cosine
annealing to decay the learning rate. Gradient
clipping with clipping value 1 is applied to avoid
gradient explosion. We use gradient accumulation
with accumulation step 20.
Pretraining: For Reddit pretraining, we down-
load a Reddit dump (with Reddit posts ranging
from 2005-12 to 2019-09) from PushShift.7 Since
the comments of a Reddit post are organized into
a tree, we extract paths from a tree as dialogue
turns. The last comment of each comment path
is regarded as the response, while the others are
used as the dialogue context. We pretrain each
model for 400,000 steps, consuming 102,400,000
(400,000 × 256) comment paths in total. For the
task-oriented pretraining, we combine the two
datasets, Taskmaster and Schema-Guided Dia-
logue, and pretrain for 1e5 steps. The statistics of
the task-oriented dialogue datasets are shown in
Table 1.
We train each model using 64 TPU chips with
16GB memory each. The pretraining takes around
4 days to complete.
5 Experiments
We fine-tune and evaluate the pretrained models
on three dialogue datasets: MultiWOZ 2.0, Cam-
Rest676 and SMCalFlow (Andreas et al., 2020). In
this section we describe the datasets (Section 5.1),
fine-tuning (Section 5.2), decoding (Section 5.3),
and evaluation metrics (Section 5.4). Results are
presented in Section 6, and analysis and ablation
studies in Section 7.
5.1 Datasets
MultiWOZ8 is a multi-domain dataset consisting
of dialogues annotated with C, B, A, R in the fol-
lowing seven domains: attraction, hotel, hospital,
police, restaurant, train, and taxi. Since its release,
MultiWOZ has been one of the most commonly
used task-oriented dialogue datasets.
CamRest6769 is annotated similarly to Multi-
WOZ and consists of dialogues in a single domain:
restaurant reservations. Though CamRest676 is
7https://pushshift.io/.
8https://cutt.ly/0kuUCRS.
9https://cutt.ly/SkuUNfE.
662
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
✗
✗
smaller than MultiWOZ and predates it, it still
provides a widely used benchmark for evaluating
task-oriented dialogue models.
SMCalFlow consists of dialogues in four do-
mains: calendar, weather, places, and people. Un-
like MultiWOZ and CamRest676, SMCalFlow
uses dataflow graphs instead of slot-value pairs
to represent belief states and does not annotate
dialogue acts. We refer readers to Andreas et al.
(2020) for a detailed description of the dataflow
representation. We follow Andreas et al. (2020)
to convert dataflow graphs into sequences to
apply seq2seq models. This dataset is newer and
offers fewer prior models to compare with, but
we use this dataset to study the robustness of the
noisy channel model under different annotation
schemas.
We use the public splits for these datasets,
where MultiWOZ, CamRest676 and SMCalFlow
are split to 8438/1000/1000, 404/136/136, and
32647/3649/5211 dialogues for training, develop-
ment, and testing, respectively. However, because
SMCalFlow’s test set has not been publicly
released, we randomly select 500 dialogues from
its training set to tune hyperparameters and use its
development set for testing.
Preprocessing: We use the standard prepro-
cessing procedures for each dataset in order to
facilitate fair comparison with previous meth-
ods.10,11,12 In particular, for MultiWOZ and Cam-
Rest676, delexicalization is used to reduce lexical
variation, while SMCalFlow does not use delexi-
calization. During delexicalization, slot values are
replaced by generic placeholders based on a pre-
defined dictionary. During decoding, following
prior work, our dialogue models generate delexi-
calized responses. These delexicalized responses
are re-lexicalized in post-processing by replacing
placeholders with their corresponding slot values
based on belief states and database information.
Since there is no public code for lexicalization,13
we implement our own functions for lexicaliza-
tion with regular expressions, for the purpose of
displaying example responses. However, this does
not affect reported results, as the standard metrics
for MultiWOZ and CamRest676 that we adopt
here are calculated using delexicalized responses.
10https://cutt.ly/TkuU1oM.
11https://cutt.ly/zkuU0Ht.
12https://cutt.ly/vkuU9bT.
13We confirmed this with the dataset authors by email.
5.2 Fine-Tuning
We apply label smoothing with parameter 0.1.
Dropout is used on input embeddings and hidden
representations, with dropout rate 0.1. The Adam
optimizer (Kingma and Ba, 2015) (b1 = 0.9 and
b2 = 0.999) is adopted. We use a fixed learning
rate 1e-4 with gradient clipping for fine-tuning.
5.3 Decoding
We use direct decoding for belief state. For dia-
logue act and response, we study three decoding
methods: direct decoding, noisy channel rerank-
ing, and noisy channel online decoding. Since
all of these decoding methods require choosing
k1 tokens from a categorical distribution during
expansion, we compare four methods, top-k max-
imization, sampling without replacement, top-k
sampling, and nucleus sampling. Nucleus sam-
pling with cumulative probability 0.98 performs
marginally better and is adopted. We perform
a range search with the range [1, 20] on devel-
opment sets for the beam sizes k1 and k2, and
we set k1, k2 = 4, k1, k2 = 15, and k1, k2 = 4
for MultiWOZ, CamRest676, and SMCalFlow,
respectively. For noisy channel reranking and
noisy channel online decoding, a grid search with
range [0, 2] is performed for λ1, λ2, and λ3. We
set (λ1 = 0.8, λ2 = 1, λ3 = 0.8), (λ1 = 1.2,
λ2 = 1.2, λ3 = 0.8), and (λ1 = 0.4, λ2 = 1,
λ3 = 0.2) for MultiWOZ, CamRest676, and
SMCalFlow, respectively.
5.4 Evaluation Metrics
For MultiWOZ and CamRest676, following pre-
vious work, we adopt three automatic evaluation
metrics: inform, success, and BLEU score. Peng
et al. (2020a) showed that
these metrics are
well correlated to human evaluation. The evalu-
ators14,15 provided with the datasets are used for
calculating these metrics. To calculate the inform
score for a dialogue, the evaluator first checks
whether certain placeholders (e.g., [restau-
rant name]) appear in decoded responses.
If so, decoded belief states are converted to
database queries to retrieve database records.
These database records are compared with the
records retrieved with ground-truth belief states.
The inform score is one if these two sets of
database records match. The success score takes
14https://cutt.ly/VkuU3FA.
15https://cutt.ly/MkuU88u.
663
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Inform ↑
Success ↑
BLEU ↑
Combined ↑
Sequicity (Lei et al., 2018)
HRED-TS (Peng et al., 2019)
DSTC8 Track 1 Winner (Ham et al., 2020)
DAMD (Zhang et al., 2020a)
SimpleTOD (Hosseini-Asl et al., 2020)
SOLOIST (Peng et al., 2020a)
UBAR (Yang et al., 2021)†
Direct decoding (114M)
Noisy channel reranking (116M)
Noisy channel online decoding (116M)
Noisy channel reranking (292M)
Noisy channel online decoding (292M)
Direct decoding (114M)
Noisy channel reranking (116M)
Noisy channel online decoding (116M)
Noisy channel reranking (292M)
Noisy channel online decoding (292M)
Direct decoding (114M)
Noisy channel reranking (116M)
Noisy channel online decoding (116M)
Noisy channel reranking (292M)
Noisy channel online decoding (292M)
66.4
70.0
73.0
76.4
84.4
85.5
88.2
Randomly Initialized
81.0
82.7
82.9
82.1
83.9
Reddit Pretraining
81.0
81.3
81.6
82.2
82.4
Task-Oriented Pretraining
85.2
85.6
85.9
86.5
86.9
45.3
58.0
62.4
60.4
70.1
72.9
79.5
54.7
57.1
58.9
58.1
60.9
69.2
70.1
71.1
70.9
71.7
72.9
73.8
74.8
74.9
76.2
15.54
17.50
16.00
16.60
15.01
16.54
16.43
15.12
15.29
15.33
15.37
15.57
17.06
19.01
19.31
19.89
20.49
17.00
19.38
19.76
20.31
20.58
71.39
81.50
83.50
85.00
92.26
95.74
100.28
82.97
85.19
86.23
85.47
87.97
92.16
94.71
95.66
96.44
97.54
96.05
99.08
100.11
101.01
102.13
Table 2: MultiWOZ test results (end-to-end modeling with generated beliefs) with seq2seq approaches.
Results are significant (p < 0.01) comparing noisy channel decoding and direct decoding. † Yang et al.
(2021) also report a combined score of 105.1 with an alternative context and evaluation setting,
contributions orthogonal to our work and the other benchmarks reported here.
all the requestable slots (e.g., postcode, phone
number, and address) from a decoded response
and compares these requestable slots with the ones
in the ground-truth response. The success score
is one if generated requestable slots coincide with
the ground-truth ones. BLEU score (BLEU-4)
compares the n-grams of generated responses and
human responses, and is a widely used metric
in NLP for evaluating text quality. Following
Budzianowski et al. (2018), we also calculate a
combined score, which is (Inform + Success) /
2 + BLEU. For SMCalFlow, inform and success
scores are not applicable because calculation of
these scores relies on delexicalization placehold-
ers, and this dataset does not use delexicalization.
We use SacreBLEU16 and TER17 to directly mea-
sure the quality of responses. As prior work on
16https://cutt.ly/BkuU7dL.
17https://pypi.org/project/pyter/.
this dataset has focused on belief tracking rather
than end-to-end response generation, we are the
first to use these metrics on this dataset.
We perform significance tests, where we use
t-test for inform, success, and TER scores and use
permutation test for BLEU.
6 Results
MultiWOZ: Results on the MultiWOZ test set
are shown in Table 2. We observe several trends.
First, the base noisy channel model (116M) per-
forms better than direct decoding (114M), despite
having a similar number of parameters, showing
that the noisy channel factorization is beneficial
for task-oriented dialogue. The large noisy chan-
nel setting improves further over the base setting.
Second, Reddit pretraining provides benefits over
random initialization, validating the use of large
664
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Inform ↑
Success ↑
BLEU ↑
Combined ↑
Sequicity (Lei et al., 2018)
GPT-2 fine-tuned (Wu et al., 2019b)
ARDM (Wu et al., 2019b)
SOLOIST (Peng et al., 2020a)
92.3
-
-
94.7
85.3
86.2
87.1
87.1
Randomly Initialized
Direct decoding (114M)
Noisy channel online decoding (116M)
Noisy channel online decoding (292M)
78.1
79.8
80.9
Reddit Pretraining
Direct decoding (114M)
Noisy channel online decoding (116M)
Noisy channel online decoding (292M)
93.3
93.7
93.9
83.5
84.1
84.9
83.9
84.5
84.7
Direct decoding (114M)
Noisy channel online decoding (116M)
Noisy channel online decoding (292M)
93.4
94.3
95.4
84.3
85.2
85.3
Task-Oriented Pretraining
21.40
19.20
25.20
25.50
21.58
22.83
23.19
23.41
25.14
25.38
24.92
25.98
26.89
110.20
-
-
116.40
102.38
104.78
106.09
112.01
114.24
114.68
113.77
115.73
117.24
Table 3: CamRest676 test results (end-to-end modeling with generated beliefs) with seq2seq approaches.
Noisy channel reranking performs comparable with noisy channel online decoding, and the results are
not shown. Results are significant (p < 0.01) comparing noisy channel decoding and direct decoding.
Model
SacreBLEU ↑
TER ↓
Direct decoding (114M)
Online decoding (116M)
Online decoding (292M)
Randomly Initialized
51.30
53.66
54.39
Reddit Pretraining
Direct decoding (114M)
Online decoding (116M)
Online decoding (292M)
60.68
63.29
63.91
Task-Oriented Pretraining
Direct decoding (114M)
Online decoding (116M)
Online decoding (292M)
61.02
63.72
64.29
89.13
74.18
73.18
61.99
47.16
46.43
59.84
46.27
45.81
Table 4: SMCalFlow results. Reranking performs
worse than online decoding, and the results are
not shown. Results are significant (p < 0.01)
comparing noisy channel decoding and direct
decoding.
open-domain dialogue-genre pretraining for task-
oriented dialogue, while the models with a second
stage of task-oriented pretraining obtain further
improvements. This effect is consistent across
both direct and noisy channel decoding. Finally,
we observe that online decoding consistently
outperforms reranking, indicating the benefits of
tighter model integration during decoding.
Our model performs better on combined score
than SOLOIST (Peng et al., 2020a), a closely
related baseline that pretrains a GPT2-initialized
Transformer with Taskmaster and Schema-
Guided Dialogue and decodes with nucleus
sampling.
CamRest676: Results on the CamRest676 test
set are shown in Table 3. We observe that the
base noisy channel model (116M) obtains bet-
ter results compared to direct decoding (114M),
again demonstrating the effectiveness of the noisy
channel model. Reddit pretraining again provides
a large benefit over random initialization for both
direct decoding and noisy channel decoding, while
task-oriented pretraining provides a further boost.
Our model again performs better than SOLOIST.
SMCalFlow: Results on the SMCalFlow devel-
opment set are shown in Table 4. As end-to-end
models have not previously been tested on this
dataset, we use it to demonstrate that the noisy
channel model, which we developed primar-
ily on MultiWOZ, continues to be effective on
task-oriented dialogue datasets with different
annotation schema. The results are consistent
with MultiWOZ and CamRest676. The noisy
channel model outperforms the direct model by
a large margin, demonstrating that dialogue act
annotations are not essential for the noisy channel
665
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
CamRest676
MultiWOZ
Direct decoding
115.17
Noisy Channel Online Decoding
Direct + Channel
Direct + Source
Direct + Length
Channel + Source
Channel + Length
Source + Length
All - Direct
All - Channel
All - Source
All - Length
All
115.63
115.91
115.56
115.82
115.60
115.62
115.96
116.56
116.38
116.52
116.91
96.73
98.54
99.12
97.57
99.18
98.71
99.19
100.89
100.93
99.92
101.11
102.62
Table 5: Ablation results for model combination
on development sets (combined score). Results
for reranking are similar and are not shown.
‘All’, ‘Direct’, ‘Source’, and ‘Channel’ denote
no ablation, direct model, source model and
channel model, respectively. Rows with ‘+’ are
combinations of two sub-modules, while the rows
with ‘-’ are combinations of three sub-modules.
model, and that it remains effective across diverse
dialogue representations.
Reddit pretraining confers a similar large ben-
efit on SMCalFlow as on the other datasets, but
we observe that task-oriented pretraining brings
only marginal further improvements. This may be
due to differences in domain or format between
our pretraining datasets and SMCalFlow. Alterna-
tively, task-oriented pretraining may help more on
task-specific metrics, such as inform and success
than on text quality metrics such as
scores,
BLEU and TER scores. This hypothesis is further
supported by the MultiWOZ results in Table 2.
7 Analysis
In this section, we use MultiWOZ and Cam-
Rest676 to perform ablation studies on the effects
of model combination,
large-scale pretraining,
and sample efficiency; as well as analyzing the
runtime requirements of our model and the reasons
for its success.
7.1 Ablation on Model Combination
Noisy channel decoding involves a combination
of four sub-modules, as in Eq. 1: the direct model,
channel model, language model, and length bias.
We perform an ablation study to determine
whether all model components are important to
the result, using the large model. Results on the
development sets of CamRest676 and MultiWOZ
are presented in Table 5. Note that the ablation
is performed after applying the direct model to
obtain k1 expansions at each beam search step for
noisy channel online decoding. We find that the
combination of all four sub-modules performs the
best, followed by combinations of three and then
two sub-modules. The results are significant when
comparing ‘All’ and the baselines (p < 0.01).
This result demonstrates the effectiveness of the
noisy channel factorization, and the importance
of each model component.
7.2 Effect of Pretraining Scale
We investigate the importance of scale for both
our pretraining stages. We select different check-
points for Reddit pretraining, and truncate the two
task-oriented dialogue datasets for task-oriented
pretraining. We fine-tune these models using the
full training data of CamRest676 or MultiWOZ.
The results of three decoding methods (with the
large noisy channel model) on the development
sets are shown in Figure 2. In Figure 2 (a) and
(c), the combined scores of all three decoding
methods improve with more Reddit pretraining
steps, demonstrating the advantage of increasing
amounts of data in the open-domain dialogue
pretraining stage. In Figure 2 (b) and (d), the
combined scores further
increase with more
task-oriented data, confirming that additional
task-oriented pretraining data is useful.
7.3 Sample Efficiency of Fine-Tuning
We investigate whether pretraining can improve
sample efficiency during fine-tuning. We gradu-
ally increase the amount of fine-tuning data and
evaluate the randomly-initialized, Reddit pre-
trained and task-oriented pretrained models. The
results on the development sets are shown in
Figure 3. Combined scores increase with more
training data under all conditions. Crucially,
Reddit pretrained models show better performance
with a smaller amount of fine-tuning data than
randomly initialized models, and task-oriented
pretrained models better still. We conclude that
both our pretraining stages can improve sample
efficiency, which is especially important when the
target task has little training data.
666
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: Results showing the effect of pretraining scale.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 3: Pretraining improves sample efficiency during fine-tuning.
7.4 Decoding Runtime
In Table 6, we report the average clock time for
decoding one turn (including its belief state, dia-
logue act and response). Noisy channel reranking
is slightly slower compared to direct decoding,
with overhead due to the reranking step in Eq. 1.
Noisy channel online decoding is significantly
slower, since it needs to apply Eq. 1 at each beam
search step. In future work we will investigate
ways to improve the efficiency of online decoding.
7.5 Decoding Properties
In this section we analyze why the noisy channel
model performed better than direct decoding.
Model
CamRest676 MultiWOZ
Direct decoding
Reranking
Online decoding
4.89
5.43
8.73
6.48
6.92
10.97
Table 6: Average decoding time (in seconds) for
each turn with different decoding methods.
Length: In Table 7 we show the average length
of generated responses. Direct decoding produces
shorter responses than the ground truth, confirm-
ing that the direct model prefers short and generic
responses. Adding a length bias to direct decoding
(with lambda tuned on the development sets)
produces responses longer than the ground truth,
667
Model
CamRest676 MultiWOZ
Ground truth
Direct decoding
Direct decoding + Length
Reranking
Online decoding
14.50
12.07
15.98
15.09
15.14
16.91
12.85
17.73
17.47
17.32
Table 7: The average length of responses with
different decoding methods (on test set). The value
closest to the ground truth is bold.
Model
CamRest676 MultiWOZ
Ground truth
Direct decoding
Reranking
Online decoding
1.07
0.84
0.87
0.89
1.22
0.91
0.99
1.03
Table 8: The Zipf scores of responses with dif-
ferent decoding methods (on test set). The value
closest to the ground truth is bold.
Model
CamRest676 MultiWOZ
Direct decoding
Reranking
Online decoding
0.24
0.12
0.08
0.31
0.14
0.11
Table 9: The likelihood (%) of falling into repe-
tition loops for different decoding methods (on
test set).
which may be a disadvantage. The noisy channel
models produce responses with average length
closest to the ground truth.
Zipf: Table 8 shows the Zipf scores of re-
sponses. We find that the word distributions of
responses generated by the noisy channel models
are closer to the word distribution of ground-truth
responses.
Repetition: In Table 9 we examine the like-
lihood of falling into repetition loops (Holtzman
et al., 2019) for different decoding methods. Re-
petition loops are rare for all decoding methods,
but noisy channel decoding can further decrease
their likelihood. The channel model can discount
a sequence with a repetition loop, since it conveys
less information than a natural sequence of the
same length, making it harder to ‘‘explain’’ the
context.
Examples: Some examples of responses are
shown in Table 10. We observe that noisy chan-
nel models decode longer responses compared
to direct decoding, and that the responses can
explain their dialogue contexts well to meet users’
requirements.
8 Related Work
Task-Oriented Dialogue Models: Most
task-
oriented dialogue systems break down the task
into three components: belief tracking (Henderson
et al., 2013; Mrkˇsi´c et al., 2016; Rastogi et al.,
2017; Nouri and Hosseini-Asl, 2018; Wu et al.,
2019a; Zhang et al., 2019; Zhou and Small, 2019;
Heck et al., 2020), dialogue act prediction (Wen
et al., 2017a; Tanaka et al., 2019), and response
generation (Chen et al., 2019; Budzianowski et al.,
2018; Lippe et al., 2020). Traditionally, a mod-
ular approach is adopted, where these components
are optimized independently (i.e., a pipeline
design) or learned via multi-task learning (i.e.,
some parameters are shared among the compo-
nents) (Wen et al., 2017b; Neelakantan et al.,
2019; Zhao et al., 2019; Mehri et al., 2019;
Tseng et al., 2020; Lee et al., 2020). However,
it is known that improvements in one component
do not necessarily lead to overall performance
improvements (Ham et al., 2020), and the mod-
ular approach suffers from error propagation in
practice (Liu and Lane, 2018). These observations
gave rise to the sequence-to-sequence approach
(Lei et al., 2018; Pei et al., 2019; Budzianowski
and Vuli´c 2019; Wu et al., 2019b; Zhang et al.,
2020a; Ham et al., 2020; Hosseini-Asl et al.,
2020; Peng et al., 2020a; Yang et al., 2021),
where dialogue beliefs and acts are represented
as text spans, and a sequence-to-sequence model
is applied to subsume the three components. Our
work is situated within this general approach. In
contrast to previous work, however, which uses a
direct model for decoding, we introduce the noisy
channel model to improve task-oriented dialogue.
Pretraining Models for Dialogue: Recent
work has applied pretraining (Peters et al., 2018;
Devlin et al., 2019; Radford et al., 2019) to
dialogue. For open-domain dialogue, DialoGPT
(Zhang et al., 2020b) and CGRG (Wu et al., 2020b)
extend GPT-2 (Radford et al., 2019) for res-
ponse generation. PLATO (Bao et al., 2019) and
PLATO-2 (Bao et al., 2020) pretrain a latent vari-
able model with social media data for diversified
response generation. Meena (Adiwardana et al.,
2020) collects a large-scale social media corpus
668
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Delexicalized Response
SacreBLEU ↑
Dialogue: MUL1624, Turn Number: 4
Ground truth
Direct decoding
Reranking
Online decoding
Sure. The trainid is [train id]. It leaves at [value time] and arrives at [value time]. Would you like to
make a reservation?
The travel time is [value count] minutes.
[train id] leaves [value place] at [value time] and arrives in [value place] at [value time]. Would you
like me to book that for you?
[train id] leaves at [value time] and arrives at [value time]. Would you like me to book that?
Dialogue: MUL1276, Turn Number: 6
Ground truth
Direct decoding
Reranking
Online decoding
The [restaurant name] has a postal code of [restaurant postcode]. Will this work for you? I can book
it if you please.
There are [value count] places. What type of food would you like?
How about [restaurant name] located at [restaurant address], postcode [restaurant postcode].
How about [restaurant name] located at [restaurant address], postcode [restaurant postcode]?
Dialogue: MUL1898, Turn Number: 2
Ground truth
Direct decoding
Reranking
Online decoding
Yes, [attraction name] is on [attraction address] and is in the [value area] side of town. Is there
anything else you need to know?
[attraction name] is located in the [value area] part of town and has free admission.
[attraction name] is located in the [value area] of town at [attraction address]. The entrance fee is
free. Can I help you with anything else?
[attraction name] is located in the [value area] part of town at [attraction address]. Can I help you
with anything else?
–
2.03
46.33
60.90
–
1.63
24.90
22.00
–
27.53
41.66
42.38
Table 10: Case study on the responses decoded by direct decoding, noisy channel reranking, and noisy
channel online decoding. The large noisy channel model is used.
for pretraining and proposes a metric named sen-
sibleness and specificity average for evaluation.
Roller et al. (2020) study various strategies for
building an open-domain chatbot with Reddit
for pretraining. For task-oriented dialogue, ToD-
BERT (Wu et al., 2020a)
fine-tunes BERT
(Devlin et al., 2019) for four tasks, including
intention detection, belief tracking, dialogue act
prediction, and response selection. SC-GPT (Peng
et al., 2020b) fine-tunes GPT-2 for few-shot re-
sponse generation with given dialogue acts. Ham
et al. (2020) fine-tune GPT-2 for belief tracking
and context-to-response generation. SimpleTOD
(Hosseini-Asl et al., 2020) proposes a method
to serialize dialogue beliefs and acts into text
spans and fine-tunes GPT-2 for end-to-end dia-
logue modeling. SOLOIST (Peng et al., 2020a)
uses a series of task-oriented dialogue datasets
to further pretrain GPT-2 before fine-tuning it on
final tasks for evaluation. Unlike these BERT-
or GPT-initialized task-oriented dialogue models,
which are essentially pretrained with general text,
such as Wikipedia and BookCorpus, we use a
Reddit dump to pretrain the models to learn from
open-domain dialogues.
9 Conclusion
We introduced two noisy channel models, noisy
channel reranking and noisy channel online decod-
ing, for task-oriented dialogue. Large-scale pre-
training was further adopted to tackle data scarcity
in downstream tasks. Extensive experiments
on MultiWOZ, CamRest676, and SMCalFlow
demonstrated that (1) the noisy channel mod-
els significantly outperform direct decoding; (2)
models with pretraining improve over randomly-
initialized models; (3) the models are robust to
different dialogue schema annotations; and (4) the
noisy channel models can decode responses closer
to ground-truth responses than direct decoding.
Acknowledgments
We would like to thank the action editors
(Maggie, Wenjie Li, and Eneko Agirre) and
three anonymous reviewers for their insightful
comments. We also thank Angeliki Lazaridou,
G´abor Melis, Nando de Freitas, Chris Dyer, and
the DeepMind language team for their helpful
discussions.
References
Daniel Adiwardana, Minh-Thang Luong, David R.
So, Jamie Hall, Noah Fiedel, Romal Thoppilan,
Zi Yang, Apoorv Kulshreshtha, Gaurav
Nemade, Yifeng Lu, and Quoc V. Le. 2020.
Towards a human-like open-domain chatbot.
arXiv preprint arXiv:2001.09977.
669
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Jacob Andreas, John Bufe, David Burkett, Charles
Chen, Josh Clausman, Jean Crawford, Kate
Crim, Jordan DeLoach, Leah Dorner, Jason
Eisner, and Hao Fang, Alan Guo, David
Hall, Kristin Hayes, Kellie Hill, Diana Ho,
Jha, Dan Klein,
Wendy Iwaszuk, Smriti
Jayant Krishnamurthy, Theo Lanman, Percy
Liang, Christopher H. Lin, Ilya Lintsbakh,
Andy McGovern, Aleksandr Nisnevich, Adam
Pauls, Dmitrij Petters, Brent Read, Dan Roth,
Subhro Roy, Jesse Rusak, Beth Short, Div
Slomin, Ben Snyder, Stephon Striplin, Yu
Su, Zachary Tellman, Sam Thomson, Andrei
Vorobev, Izabela Witoszko, Jason Wolfe, Abby
Wray, Yuchen Zhang, and Alexander Zotov.
2020. Task-oriented dialogue as dataflow syn-
the Association for
thesis. Transactions of
Computational Linguistics, 8:556–571.
John Langshaw Austin. 1975. How To Do Things
with Words, 88, Oxford University Press.
Siqi Bao, Huang He, Fan Wang, and Hua Wu.
2019. Plato: Pre-trained dialogue generation
model with discrete latent variable. arXiv
preprint arXiv:1910.07931.
Siqi Bao, Huang He, Fan Wang, Hua Wu, and
Haifeng Wang. 2020. PLATO: pre-trained
dialogue generation model with discrete la-
the 58th
tent variable.
Annual Meeting of the Association for Com-
putational Linguistics, ACL 2020, Online,
July 5-10, 2020, pages 85–96. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.9
In Proceedings of
James Bradbury, Roy Frostig, Peter Hawkins,
Matthew James Johnson, Chris Leary, Dougal
Maclaurin, and Skye Wanderman-Milne. 2018.
JAX: composable transformations of Python+
NumPy programs.
Thorsten Brants, Ashok C. Popat, Peng Xu,
Franz J. Och, and Jeffrey Dean. 2007. Large
language models in machine translation. In
Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
Learning (EMNLP-CoNLL), pages 858–867,
Prague, Czech Republic. Association for Com-
putational Linguistics.
Tom B. Brown, Benjamin Mann, Nick Ryder,
Jared Kaplan, Prafulla
Melanie Subbiah,
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutsk. 2020.
Language models are few-shot
learners. In
Advances in Neural Information Processing
Systems 33: Annual Conference on Neural
Information Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual.
Paweł Budzianowski and Ivan Vuli´c. 2019. Hello,
it’s GPT-2 - how can I help you? Towards
the use of pretrained language models for
task-oriented dialogue systems. In Proceedings
of the 3rd Workshop on Neural Generation
and Translation, pages 15–22, Hong Kong.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D19
-5602
for
Pawel Budzianowski, Tsung-Hsien Wen, Bo-
I˜nigo Casanueva, Stefan
Hsiang Tseng,
Ultes, Osman Ramadan, and Milica Gasic.
2018. Multiwoz - A large-scale multi-domain
task-oriented dia-
wizard-of-oz dataset
logue modelling. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, Brussels, Belgium, Octo-
ber 31 - November 4, 2018, pages 5016–5026.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D18
-1547
Bill Byrne, Karthik Krishnamoorthi, Chinnadhu-
rai Sankar, Arvind Neelakantan, Ben Goodrich,
Daniel Duckworth, Semih Yavuz, Amit Dubey,
Kyu-Young Kim, and Andy Cedilnik. 2019.
Taskmaster-1: Toward a realistic and diverse
dialog dataset. In Proceedings of
the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing, EMNLP-IJCNLP 2019, Hong Kong,
China, November 3–7, 2019, pages 4515–4524.
Association for Computational Linguistics.
670
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
https://doi.org/10.18653/v1/D19
-1459
Wenhu Chen,
Jianshu Chen, Pengda Qin,
Xifeng Yan, and William Yang Wang. 2019.
Semantically conditioned dialog response
generation via hierarchical disentangled self-
attention. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 3696–3709, Florence, Italy.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P19
-1360
Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: Pre-
training of deep bidirectional transformers for
In Proceedings of
language understanding.
the 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short
Papers), pages 4171–4186. Association for
Computational Linguistics.
Angela Fan, Mike Lewis, and Yann N. Dauphin.
2018. Hierarchical neural story generation.
Iryna Gurevych and Yusuke Miyao, editors,
In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics,
ACL 2018, Melbourne, Australia, July 15-20,
2018, Volume 1: Long Papers, pages 889–898.
Association for Computational Linguistics.
Donghoon Ham, Jeong-Gwan Lee, Youngsoo
Jang, and Kee-Eung Kim. 2020. End-to-
end neural pipeline for goal-oriented dialogue
systems using GPT-2. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 583–592.
Association for Computational Linguistics,
Online.
Michael Heck, Carel van Niekerk, Nurul Lubis,
Christian Geishauser, Hsien-Chin Lin, Marco
Moresi, and Milica Gasic. 2020. Trippy: A
triple copy strategy for value independent neural
dialog state tracking. In Proceedings of the 21th
Annual Meeting of the Special Interest Group
on Discourse and Dialogue, SIGdial 2020, 1st
virtual meeting, July 1-3, 2020, pages 35–44.
Association for Computational Linguistics.
Matthew Henderson, Blaise Thomson, and Steve
Young. 2013. Deep neural network approach
for the dialog state tracking challenge. In
Proceedings of the SIGDIAL 2013 Conference,
pages 467–471.
Matthew Henderson, Ivan Vulic, Daniela Gerz,
I˜nigo Casanueva, Pawel Budzianowski, Sam
Coope, Georgios Spithourakis, Tsung-Hsien
Wen, Nikola Mrksic, and Pei-Hao Su. 2019.
Training neural response selection for task-
In Proceedings
oriented dialogue systems.
the Association
the 57th Conference of
of
for Computational Linguistics, ACL 2019,
Florence,
Italy, July 28- August 2, 2019,
Volume 1: Long Papers, pages 5392–5404.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P19
-1536
Dan Hendrycks
Gaussian error
preprint arXiv:1606.08415.
and Kevin Gimpel. 2016.
linear units (gelus). arXiv
Tom Hennigan, Trevor Cai, Tamara Norman, and
Igor Babuschkin. 2020. Haiku: Sonnet for JAX.
Ari Holtzman, Jan Buys, Maxwell Forbes, and
Yejin Choi. 2019. The curious case of neural
text degeneration. CoRR, abs/1904.09751.
Ehsan Hosseini-Asl, Bryan McCann, Chien-
Sheng Wu, Semih Yavuz, and Richard Socher.
2020. A simple language model for task-
In Advances in Neural
oriented dialogue.
Information Processing Systems 33: Annual
Conference on Neural Information Processing
Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual.
Diederik P. Kingma and Jimmy Ba. 2015.
Adam: A method for stochastic optimization.
In 3rd International Conference on Learn-
ing Representations, ICLR 2015, San Diego,
CA, USA, May 7–9, 2015, Conference Track
Proceedings.
Dan Klein and Christopher D. Manning. 2002.
Conditional structure versus conditional esti-
mation in NLP models. In Proceedings of
the 2002 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2002),
pages 9–16. https://doi.org/10.3115
/1118693.1118695
671
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Zhenzhong Lan, Mingda Chen, Sebastian
Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. 2020. ALBERT: A lite BERT
for self-supervised learning of language repre-
sentations. In 8th International Conference on
Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26–30, 2020.
Hwaran Lee, Seokhwan Jo, HyungJun Kim,
Sangkeun Jung, and Tae-Yoon Kim. 2020.
Sumbt+ larl: End-to-end neural task-oriented
dialog system with reinforcement
learning.
arXiv preprint arXiv:2009.10447.
Wenqiang Lei, Xisen Jin, Min-Yen Kan,
Zhaochun Ren, Xiangnan He, and Dawei
Yin. 2018. Sequicity: Simplifying task-oriented
dialogue systems with single sequence-to-
sequence architectures. In Proceedings of the
56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long
Papers), pages 1437–1447.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng
Gao, and Bill Dolan. 2016. A diversity-
promoting objective function for neural con-
In NAACL HLT 2016,
versation models.
The 2016 Conference of the North American
for Com-
the Association
Chapter
putational Linguistics: Human Language Tech-
nologies, San Diego California, USA, June
12–17, 2016, pages 110–119. The Association
for Computational Linguistics.
of
Phillip Lippe, Pengjie Ren, Hinda Haned,
Bart Voorn, and Maarten de Rijke. 2020.
Diversifying task-oriented dialogue response
generation with prototype guided paraphrasing.
CoRR, abs/2008.03391.
Bing Liu and Ian Lane. 2018. End-to-end learning
of task-oriented dialogs. In Proceedings of the
2018 Conference of
the North American
the Association for Computa-
Chapter of
tional Linguistics: Student Research Work-
shop, pages 67–73. https://doi.org/10
.18653/v1/N18-4010
Shikib Mehri, Tejas Srinivasan, and Maxine
Eskenazi. 2019. Structured fusion networks for
dialog. arXiv preprint arXiv:1907.10016.
Nikola Mrkˇsi´c, Diarmuid O. S´eaghdha, Tsung-
Hsien Wen, Blaise Thomson, and Steve Young.
2016. Neural belief tracker: Data-driven dia-
logue state tracking. arXiv preprint arXiv:
1606.03777.
Arvind Neelakantan, Semih Yavuz, Sharan
Narang, Vishaal Prasad, Ben Goodrich, Daniel
Duckworth, Chinnadhurai Sankar, and Xifeng
Yan. 2019. Neural assistant: Joint action predic-
tion, response generation, and latent knowledge
reasoning. arXiv preprint arXiv:1910.14613.
Toan Q. Nguyen and Julian Salazar. 2019.
Transformers without
Improving the
normalization of self-attention. arXiv preprint
arXiv:1910.05895.
tears:
Elnaz Nouri and Ehsan Hosseini-Asl. 2018.
Toward scalable neural dialogue state tracking
model. arXiv preprint arXiv:1812.00899.
Jiahuan Pei, Pengjie Ren, and Maarten de Rijke.
2019. A modular task-oriented dialogue system
using a neural mixture-of-experts. arXiv pre-
print arXiv:1907.05346.
Baolin Peng, Chunyuan Li, Jinchao Li, Shahin
Shayandeh, Lars Liden, and Jianfeng Gao.
2020a. Soloist: Few-shot task-oriented dialog
with a single pre-trained auto-regressive model.
arXiv preprint arXiv:2005.05298.
Baolin Peng, Chenguang Zhu, Chunyuan Li,
Xiujun Li, Jinchao Li, Michael Zeng, and
Jianfeng Gao. 2020b. Few-shot natural lan-
task-oriented dialog.
guage generation for
In Proceedings of
the 2020 Conference on
in Natural Language
Empirical Methods
Processing: Findings, EMNLP 2020, Online
Event, 16–20 November 2020, pages 172–182.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.findings-emnlp.17
Shuke Peng, Xinjing Huang, Zehao Lin, Feng
Ji, Haiqing Chen, and Yin Zhang. 2019.
Teacher-student framework enhanced multi-
domain dialogue generation. arXiv preprint
arXiv:1908.07137.
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. In Proceedings of
the 2018 Conference of the North American
672
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
NAACL-HLT, pages 2227–2237. https://
doi.org/10.18653/v1/N18-1202
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Blog, 1(8):9.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. J. Mach.
Learn. Res., 21:140:1–140:67.
Abhinav Rastogi, Dilek Hakkani-T¨ur, and Larry
Heck. 2017. Scalable multi-domain dialogue
state tracking. In 2017 IEEE Automatic Speech
Recognition and Understanding Workshop
(ASRU), pages 561–568. IEEE. https://
doi.org/10.1109/ASRU.2017.8268986
Abhinav Rastogi, Xiaoxue Zang, Srinivas
Sunkara, Raghav Gupta, and Pranav Khaitan.
2020. Towards scalable multi-domain conver-
sational agents: The schema-guided dialogue
dataset. In The Thirty-Fourth AAAI Conference
on Artificial Intelligence, February 7–12, 2020,
pages 8689–8696. AAAI Press. https://
doi.org/10.1609/aaai.v34i05.6394
Antoine Raux, Brian Langner, Dan Bohus, Alan W
Black, and Maxine Eskenazi. 2005. Let’s go
public! taking a spoken dialog system to the
real world. In Ninth European conference on
speech communication and technology.
Stephen Roller, Emily Dinan, Naman Goyal,
Da Ju, Mary Williamson, Yinhan Liu, Jing
Xu, Myle Ott, Kurt Shuster, Eric M Smith, Y-
Lan Boureau, and Jason Weston. 2020. Recipes
for building an open-domain chatbot. arXiv
preprint arXiv:2004.13637.
Stephanie Seneff and Joseph Polifroni. 2000.
Dialogue management in the Mercury flight
reservation system. In ANLP-NAACL 2000
Workshop: Conversational Systems. https://
doi.org/10.3115/1117562.1117565
Claude Shannon. 1948. A mathematical theory of
communication. Bell System Technical Journal,
27:379–423.
Koji Tanaka,
and Yuki
Junya Takayama,
Arase. 2019. Dialogue-act prediction of future
responses based on conversation history. In
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics:
Student Research Workshop, pages 197–202.
Bo-Hsiang Tseng, Jianpeng Cheng, Yimai Fang,
and David Vandyke. 2020. A generative model
for joint natural language understanding and
generation. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, ACL 2020, Online, July 5–10,
2020, pages 1795–1807. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.163
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is all you need.
in Neural Information Processing Systems,
pages 5998–6008.
Tsung-Hsien Wen, Yishu Miao, Phil Blunsom,
and Steve J. Young. 2017a. Latent intention
dialogue models. CoRR, abs/1705.10229.
Tsung-Hsien Wen, David Vandyke, Nikola
Mrksic, Milica Gasic, Lina Maria Rojas-
Barahona, Pei-Hao Su, Stefan Ultes, and
Steve J. Young. 2017b. A network-based end-
to-end trainable task-oriented dialogue system.
the 15th Conference of
In Proceedings of
the European Chapter of
the Association
for Computational Linguistics, EACL 2017,
Valencia, Spain, April 3–7, 2017, Volume 1:
Long Papers, pages 438–449. Association for
Computational Linguistics.
task-oriented dialogue.
2020 Conference
the
Chien-Sheng Wu, Steven C. H. Hoi, Richard
Socher, and Caiming Xiong. 2020a. TOD-BERT:
language understanding
Pre-trained natural
In Proceedings
for
of
on Empirical
Methods in Natural Language Processing,
EMNLP 2020, Online, November 16–20, 2020,
pages 917–929. Association for Computational
Linguistics.
Chien-Sheng Wu, Andrea Madotto, Ehsan
Hosseini-Asl, Caiming Xiong, Richard Socher,
and Pascale Fung. 2019a. Transferable multi-
domain state generator for task-oriented dia-
673
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
logue systems. In Proceedings of
the 57th
Conference of the Association for Computa-
tional Linguistics, ACL 2019, Florence, Italy,
July 28- August 2, 2019, Volume 1: Long
Papers, pages 808–819. Association for Com-
putational Linguistics.
Qingyang Wu, Yichi Zhang, Yu Li, and Zhou Yu.
2019b. Alternating recurrent dialog model with
large-scale pre-trained language models. arXiv
preprint arXiv:1910.03756.
Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe
Zhang, Xiang Gao, Chris Quirk, Rik Koncel-
Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi,
Mari Ostendorf, and Bill Dolan. 2020b. A
controllable model of grounded response gen-
eration. arXiv preprint arXiv:2005.00613.
Yunyi Yang, Yunhao Li, and Xiaojun Quan. 2021.
Ubar: Towards fully end-to-end task-oriented
dialog systems with gpt-2. The Thirty-Fifth
AAAI Conference on Artificial Intelligence.
Kyra Yee, Yann N. Dauphin, and Michael Auli.
2019. Simple and effective noisy channel
modeling for neural machine translation. In
the 2019 Conference on
Proceedings of
in Natural Language
Empirical Methods
Processing and the 9th International Joint
Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China,
November 3-7, 2019, pages 5695–5700. Asso-
ciation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1571
Yang You, Jing Li, Sashank J. Reddi, Jonathan
Hseu, Sanjiv Kumar, Srinadh Bhojanapalli,
Xiaodan Song, James Demmel, Kurt Keutzer,
and Cho-Jui Hsieh. 2020. Large batch opti-
mization for deep learning: Training BERT in
76 minutes. In 8th International Conference on
Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26–30, 2020.
Lei Yu, Phil Blunsom, Chris Dyer, Edward
Grefenstette, and Tom´as Kocisk´y. 2017. The
In 5th International
neural noisy channel.
Conference on Learning Representations, ICLR
2017, Toulon, France, April 24–26, 2017,
Conference Track Proceedings.
Lei Yu, Laurent Sartran, Wojciech Stokowiec,
Wang Ling, Lingpeng Kong, Phil Blunsom,
and Chris Dyer. 2020. Better document-
level machine translation with bayes rule.
Transactions of the Association for Compu-
tational Linguistics, 8:346–360. https://
doi.org/10.1162/tacl a 00319
Jian-Guo Zhang, Kazuma Hashimoto, Chien-
Sheng Wu, Yao Wan, Philip S. Yu, Richard
Socher, and Caiming Xiong. 2019. Find or
classify? Dual strategy for slot-value predic-
tions on multi-domain dialog state tracking.
arXiv preprint arXiv:1910.03544.
Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020a.
Task-oriented dialog systems that consider
multiple appropriate responses under the same
context. In The Thirty-Fourth AAAI Confer-
ence on Artificial Intelligence, AAAI 2020,
The Thirty-Second Innovative Applications of
Artificial Intelligence Conference, IAAI 2020,
The Tenth AAAI Symposium on Educational
Advances in Artificial Intelligence, EAAI 2020,
New York, NY, USA, February 7–12, 2020,
pages 9604–9611. AAAI Press. https://
doi.org/10.1609/aaai.v34i05.6507
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-
Chun Chen, Chris Brockett, Xiang Gao,
Jianfeng Gao, Jingjing Liu, and Bill Dolan.
2020b. DIALOGPT : Large-scale generative
pre-training for conversational response gen-
eration. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics: System Demonstrations, ACL
2020, Online, July 5–10, 2020, pages 270–278.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-demos.30
Tiancheng Zhao, Kaige Xie,
reinforcement
and Maxine
Esk´enazi. 2019. Rethinking action spaces
learning in end-to-end
for
dialog agents with latent variable models. In
Proceedings of the 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapolis,
MN, USA, June 2–7, 2019, Volume 1 (Long and
Short Papers), pages 1208–1218. Association
for Computational Linguistics.
Li Zhou and Kevin Small. 2019. Multi-domain
dialogue state tracking as dynamic knowledge
graph enhanced question answering. arXiv
preprint arXiv:1911.06192.
674
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
9
0
1
9
2
9
7
2
7
/
/
t
l
a
c
_
a
_
0
0
3
9
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf