Pretraining the Noisy Channel Model for Task-Oriented Dialogue

Pretraining the Noisy Channel Model for Task-Oriented Dialogue

Qi Liu2∗, Lei Yu1, Laura Rimell1, and Phil Blunsom1,2
1DeepMind, United Kingdom 2University of Oxford, Großbritannien
qi.liu@cs.ox.ac.uk
{leiyu,laurarimell,pblunsom}@google.com

Abstrakt

Direct decoding for task-oriented dialogue is
known to suffer from the explaining-away ef-
fect, manifested in models that prefer short
and generic responses. Here we argue for the
use of Bayes’ theorem to factorize the dialogue
task into two models, the distribution of the
context given the response, and the prior for
the response itself. This approach, an instan-
tiation of the noisy channel model, both miti-
gates the explaining-away effect and allows
the principled incorporation of large pretrained
models for the response prior. We present ex-
tensive experiments showing that a noisy chan-
nel model decodes better responses compared
to direct decoding and that a two-stage pre-
training strategy, employing both open-domain
and task-oriented dialogue data, improves over
randomly initialized models.

1

Einführung

Task-oriented dialogue agents provide a conver-
sational interface to assist users in accomplish-
ing specific goals, such as finding a restaurant
or booking a hotel (Seneff and Polifroni, 2000;
Raux et al., 2005; Budzianowski et al., 2018;
Peng et al., 2020A). Increasing demand from indus-
try for natural language assistants and scalable cus-
tomer service solutions has recently been driving
a renaissance in the development of task-oriented
dialogue models. Zusätzlich, the specification
of explicit dialogue agent goals, afforded by the
task-oriented paradigm, makes such research eas-
ier to ground and evaluate than open-domain
chatbots.

Current research on task-oriented dialogue is
dominated by monolithic sequence-to-sequence
models that directly parameterize the conditional
distribution of the response given the prior dia-

∗Work completed during an internship at DeepMind.

657

logue context. Jedoch, this monolithic approach
conflates the task-specific and language-general
aspects of dialogue, and adversely favors short
and generic responses (Bao et al., 2020) due to
the explaining-away effect (Klein and Manning,
2002).

Here we pursue an alternative to the direct
Modell. Using Bayes’ rule allows us to factorize
the probability of the response given the context
P(R| C) into a language model p(R) und ein
context model p(C| R).1 Within natural language
Verarbeitung (NLP), this approach is traditionally
known as the noisy channel model (Shannon,
1948), and has recently seen renewed interest
with its successful application to neural machine
Übersetzung (Yu et al., 2017, 2020; Yee et al., 2019).
We hypothesize that the noisy channel reformu-
lation is advantageous for dialogue because the
factorization enables each sub-module to special-
ize in a dialogue sub-task. Insbesondere, the con-
text conditional model can help to discount short
and generic responses and mitigate the explaining-
away effect, while the language model helps
ensure that responses are natural. We find that a
noisy channel model with the same number of
parameters as a direct model achieves better ac-
curacy on three task-oriented dialogue datasets.
Darüber hinaus, a larger noisy channel model can
be trained with the same hardware, by training
the sub-modules separately, yielding additional
Verbesserungen.

It has become common in recent years to pre-
train dialogue models on large text data, entweder
general text (Peng et al., 2020B; Budzianowski
and Vuli´c, 2019; Wu et al., 2020A) or dialogue-
structured data (Roller et al., 2020; Adiwardana
et al., 2020), such as tweets and Reddit posts. Wir
utilise a similar strategy with Reddit data and find

1Here we abstract away from the prediction of belief states
and dialogue acts, which also form part of our generative
Modell; see Section 3 for details.

Transactions of the Association for Computational Linguistics, Bd. 9, S. 657–674, 2021. https://doi.org/10.1162/tacl a 00390
Action Editor: Wenjie (Maggie) Li. Submission batch: 2/2021; Revision batch: 2/2021; Published 7/2021.
C(cid:3) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
9
0
1
9
2
9
7
2
7

/

/
T

l

A
C
_
A
_
0
0
3
9
0
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
9
0
1
9
2
9
7
2
7

/

/
T

l

A
C
_
A
_
0
0
3
9
0
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1: The data flow of one turn in a task-oriented dialogue for train booking from MultiWOZ.

that the benefits of pretraining to the noisy channel
model are similar to those for the direct model.
Weiter, we evaluate transfer across task-oriented
dialogue datasets by implementing a second pre-
training stage using Taskmaster (Byrne et al.,
2019) and Schema-Guided Dialogue (Rastogi
et al., 2020) as training data, before fine-tuning
on our final tasks.

We evaluate the algorithm on three datasets,
MultiWOZ 2.0 (Budzianowski et al., 2018), Nocken-
Rest676 (Wen et al., 2017A), and SMCalFlow
(Andreas et al., 2020), demonstrating that the
noisy channel approach is robust to different dia-
logue schema annotations used across datasets.
Further analysis demonstrates that the noisy chan-
nel models can decode responses with similar
lengths and Zipf scores compared to ground-truth
responses and reduce the likelihood of falling into
repetition loops (Holtzman et al., 2019).

2 A Seq-to-Seq Dialogue Model

für

In diesem Abschnitt, we introduce a discriminative
sequence-to-sequence model
task-oriented
dialogue. The traditional sequence of steps needed
to produce a system turn in a task-directed dia-
logue is shown in Figure 1, with an example from
MultiWOZ 2.0 (Budzianowski et al., 2018).
Given a dialogue context containing previous user
and system utterances, the dialogue system first
predicts a belief state, consisting of a set of slot-
value pairs (z.B., destination: Cambridge),

to capture user intent. To ground the system with
external information, the belief state can be con-
verted into a database query in order to retrieve
Information, such as the number of
relevant
matches and booking information. Nächste, the sys-
tem predicts a set of dialogue acts, representing
the abstract meaning of the proposed dialogue
response (Austin, 1975). Endlich, a delexicalized
dialogue response is generated, where slot val-
ues are replaced by generic placeholders, solch
as value time for a train departure time, In
order to reduce lexical variation. The delexical-
ized response can be converted to a lexicalized
response in post-processing by filling in the
slot values based on belief states and database
Information.

We use the MultiWOZ schema for illustra-
tion in Sections 2 Und 3, but our models easily
generalize to different schema annotations (z.B.,
datasets without annotated dialogue acts [Andreas
et al., 2020]).

Because it is well known that pipelined mod-
els tend to suffer from error propagation, viele
NLP tasks have been reformulated in recent
years as end-to-end text-to-text transformations
(Raffel et al., 2020; Brown et al., 2020). Zustand-
of-the-art
task-oriented dialogue systems have
followed this approach (Hosseini-Asl et al., 2020;
Peng et al., 2020B). We represent the example
from Figure 1 as follows, serializing turns and
using special start and end tokens to encapsu-
late each data field:

658

Context: [C] I am looking to . . . [/u] What is your . . . [/R]
I’ll be leaving . . . [/u] [/C]
Belief: [B] [train] destination Cambridge, day Tuesday,
arrive 12:30, departure London [/B]
Datenbank: [db] [train] match 1, status not booked [/db]
Akt: [A] [train] inform arrive, inform leave, offer
reservation [/A]
Response: [R] There is a train that leaves at [value time]
and arrives at [value time]. Should I book it? [/R]

Given this text representation, the direct discri-
minative approach models p(B, A, R| C), Wo
C, B, A, and R represent dialogue context, belief
state, dialogue act, and delexicalized response,
respectively.2 We use the serialized text of the
dialogue context as input, and the concatenation
of belief state, dialogue act, and response as target
output, making the task amenable to the appli-
cation of an autoregressive sequence-to-sequence
Modell. B, A, and R can be generated sequentially
with direct decoding methods, such as greedy
decoding and beam search. We use a sequence-
to-sequence Transformer (Vaswani et al., 2017)
to implement p(B, A, R| C). This distribution will
also be used to build the noisy channel model in
Abschnitt 3.

3 Noisy Channel Model for Dialogue

While direct decoding is an effective approach for
decoding belief states (Hosseini-Asl et al., 2020),
it may be sub-optimal for generating responses.
Erste, it favors short and generic responses (Bao
et al., 2020). Infolge, the decoded responses
are bland and lack diversity (Li et al., 2016).
Zweite, it suffers from the explaining-away effect
(Klein and Manning, 2002), where inputs are
‘‘explained-away’’ by highly predictive output
prefixes. Zum Beispiel, if there is one hotel match-
ing the user’s intent as encoded in the belief state,
the model is nevertheless prone to decoding ‘‘no’’
given the output prefix ‘‘there is’’, ignoring the
input information.

In this work, we propose using the neural noisy
channel model (Yu et al., 2017) to mitigate the
above problems for response generation. Gegeben
an input sequence x and output sequence y,
the noisy channel formulation (Shannon, 1948)
uses Bayes’ rule to rewrite the model p(j|X) als
P(X|j)P(j)
∝ p(X|j)P(j). It was originally applied
P(X)

2We do not model the probabilities of database state or
lexicalized response, as these are deterministic given the
belief state and delexicalized response, jeweils.

to speech recognition, where p(j|X) is a con-
ditional model of the source text given a noisy ob-
servation. The channel model p(X|j) estimates
the probability of
the observation given the
source, while p(j) is an unconditional language
Modell (or source model), which can be trained
on unpaired data. More recently it has been ap-
plied to machine translation, where y is a trans-
lation of input text x.

Abstracting away from belief states and dia-
logue acts, for task-oriented dialogue we want to
estimate p(R| C), the probability of a response
given a context. The channel model p(C| R),
given a response, predicts a distribution over con-
texts which might have elicited that response. Der
source model p(R) is an unconditional language
Modell. In this extension of the noisy channel ap-
proach to task-oriented dialogue, the ‘‘channel’’
can be understood as connecting dialogue contexts
with suitable responses.

For the full task, we develop a noisy channel
model for p(B, A, R| C). Using the chain rule,
P(B, A, R| C) = p(B| C) · P(A, R| C, B). Follow-
ing Hosseini-Asl et al. (2020), we use the direct
model described in Section 2 to parameterize
P(B| C) and decode B, which our preliminary
experiments confirmed to be advantageous.

We use the noisy channel formulation to pa-
rameterize p(A, R| C, B). Using Bayes’
rule,
P(A, R| C, B) ∝ p(C, B| A, R) · P(A, R). Der
channel model p(C, B| A, R) and source model
P(A, R) are implemented as Transformers.

We choose to use the noisy channel formulation
for decoding A based on preliminary experiments
that showed improved overall accuracy over direct
decoding, possibly because poor dialogue act pre-
diction by the direct model led to worse quality
responses. The serialized text of A and R are
concatenated during training, and the decoded
sequence is split into A and R with the special
start/end tokens during decoding.

We suggest that the noisy channel model has
three advantages over the direct model for re-
sponse generation: (1) The channel model can
penalize short and generic responses. Such re-
sponses can be mapped to a large number of con-
texts, resulting in a flat distribution over contexts.
This leads to a lower channel model score for
short and generic responses (Zhang et al., 2020B).
(2) The channel model ensures that (A, R) must
explain the corresponding (C, B), alleviating the
explaining-away effect (Yu et al., 2017). (3) Der

659

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
9
0
1
9
2
9
7
2
7

/

/
T

l

A
C
_
A
_
0
0
3
9
0
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

source model, an unconditional distribution over
A and R, can make use of abundant non-dialogue
textual data for pretraining, further improving the
fluency of generated sequences (Brants et al.,
2007). We leave exploration of this last advantage
for future work, as we pretrain all sub-modules
with the same data.

3.1 Decoding

Because exact decoding from the noisy chan-
nel model arg maxA,R p(C, B| A, R) · P(A, R)3
is computationally intractable, we experiment
with two approximation methods, noisy channel
reranking and noisy channel online decoding.
Since these methods rely on p(A, R| C, B) as a
proposal distribution for approximation, and both
P(A, R| C, B) und p(B| C) are parameterized with
the direct model introduced in Section 2, our noisy
channel model therefore has three sub-modules:
a direct model p(B, A, R| C), a channel model
P(C, B| A, R), and a source model p(A, R).

Noisy Channel Reranking: Noisy channel
reranking first decodes B and then continues
decoding a list S of (A, R) pairs by beam
search with the direct model, prior to utilizing
the noisy channel model to rerank (A, R) pairs. In
besondere, during beam search, partial sequences
are expanded and pruned with p(A, R| C, B) (aus
the direct model in Section 2). The pairs after
decoding are reranked using the following model
combination:

(A(cid:5), R(cid:5)) = arg max
(A,R)∈ S

log p(A, R| C, B)+

λ1 · log p(C, B| A, R)+
λ2 · log p(A, R)+
λ3 · | A, R|,

(1)

Wo | A, R| denotes the length of (A, R), Und
λ1, λ2 and λ3 are hyperparameters. Besides the
channel model p(C, B| A, R) and the source
model p(A, R), we additionally use the direct
model p(A, R| C, B) and a length bias | A, R| Zu
encourage responses with high direct model likeli-
hood and discourage short responses, jeweils.

Noisy Channel Online Decoding: Im Gegensatz
to reranking, online decoding applies the noisy

3Although exact decoding is also computationally intrac-
table for the direct model, approximating arg maxB p(B| C)
is well-studied, z.B., beam search. The decoding for B is
therefore omitted here.

: Context C

Algorithm 1: Online decoding for the noisy
channel.
Input
Output: Belief, act and response (B, A, R)
Decode B given C with p(B| C)
Beam: S = {([A])}
while end(S) is False do

S (cid:5) = ø
for O in S do

if O.last() Ist [/R] oder | Ö| > l then

S (cid:5).add(Ö)
continue

end
Get k1 tokens o1, . . . , ok1 from the direct
model p(Ö| Ö|+1| C, B, Ö)
for oi in (o1, . . . , ok1 ) do
S (cid:5).add((Ö, oi))

end

end
S = top k2
O∈ S (cid:5)

log p(Ö| C, B)+
λ1 · log p(C, B| Ö)+
λ2 · log p(Ö)+
λ3 · | Ö|

end
Select O ∈ S with the largest score using Eq. 1 Und

return (B, A, R)

channel model during beam search for pruning
partial sequences, thus exploring a larger search
Raum.

As shown in Algorithm 1, we first decode the
belief state with p(B| C), which comes from the
direct model in Section 2. Dann, starting with
a beam S containing a single sequence [A] (Die
dialogue act start token), we continuously expand
the sequences in S until end(S) is met, nämlich,
all sequences in S either end with [/R] or have
lengths larger than l. In each iteration, we first
expand the sequences in the beam, then prune
the expanded beam. To expand a partial act and
response sequence (denoted as O in Algorithm 1),
a naive way is to use the noisy channel model
to score |V | (the vocabulary size) possible ex-
pansions, which is computationally expensive.
Stattdessen, we use the probability of the next token
P(Ö| Ö|+1| C, B, Ö) (Wo | Ö| denotes the length
of O) to select k1 candidates to be scored by the
noisy channel model. This next token probability
is from the direct model introduced in Section 2.
One straightforward way to select k1 expansions
from p(Ö| Ö|+1| C, B, Ö) is using the top-k maxi-
mization, but we can also take advantage of the
advances in sampling from a categorical distri-
bution for text generation (z.B., top-k sampling
Fan et al., 2018 and nucleus sampling [Holtzman

660

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
9
0
1
9
2
9
7
2
7

/

/
T

l

A
C
_
A
_
0
0
3
9
0
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

et al., 2019]). After the expansion, we prune
the expanded beam S (cid:5) to obtain a smaller beam
with k2 partial sequences based on the model
combination in Eq. 1. Compared to noisy chan-
nel reranking, online decoding applies the noisy
channel model during beam search, welches ist
potentially less biased towards the direct model.

Zusammenfassend, we note that beam search for
both the direct model and the online decoding
for our noisy channel model decodes (B, A, R)
autoregressively. Thus both approaches are end-
to-end models for task-oriented dialogue. Der Schlüssel
difference is that noisy channel online decoding
uses Eq. 1 for pruning, while the direct model
uses p(A, R| C, B).

4 Model and Pretraining

We use three Transformer (Vaswani et al., 2017)
networks to parameterize the direct model p(B,
A, R| C), the channel model p(C, B| A, R) Und
the source model p(A, R), jeweils. The input
to each Transformer is the sum of four embed-
dings: word embeddings, position embeddings,
role embeddings (user/system), and turn embed-
dings (each word corresponds to a turn number).
Cross entropy is used as the loss function.

Given training samples (C, B, A, R), if we train
the channel model using complete (A, R) pairs
as input, a significant discrepancy arises between
training and decoding for noisy channel online
decoding. Since the channel model is used to score
partial act and response pairs, das ist, P(C, B| Ö)
in Algorithm 1, the channel model trained with
complete (A, R) pairs is unsuited to scoring par-
tial sequences. In order to manually create partial
sequences during training that are better matched
for online decoding, we truncate the (A, R) pairs
with a truncation length uniformly sampled from
1 to the sequence length (inclusive). The direct
model and the source model are trained with
complete sequences, as partial sequences occur
naturally in their standard autoregressive training
procedure.

As in-domain dialogue data are usually scarce,
we use a two-stage pretraining strategy to enhance
the noisy channel model. Although the effec-
tiveness of pretraining with Reddit data has
been validated for open-domain dialogue (Zhang
et al., 2020B; Bao et al., 2019; Adiwardana et al.,
2020), relatively little work has applied such data

(Rastogi et al., 2020),

to task-oriented dialogue.4 In the first stage, Wir
explore Reddit pretraining (where the Reddit
data is pre-processed into (C, R), d.h., Kontext-
response, pairs as described below). In dieser Sekunde
stage, we use two task-oriented dialogue datasets,
Taskmaster5 (Byrne et al., 2019) and Schema-
Guided Dialogue6
Zu
specialize the Reddit-pretrained models. Weil
the Reddit data consists of open-domain-style dia-
logues (where belief states and dialogue acts are
missing), pretraining on these datasets can famil-
iarize the models with the sequence-to-sequence
representation of task-oriented dialogue. Drei
Modelle, a context-to-response model, a response-
to-context model and a response language model,
are pretrained to initialize the direct model, Die
channel model and the source model, jeweils.

4.1 Implementation Details

Models: All models are implemented with JAX
(Bradbury et al., 2018) and Haiku (Hennigan
et al., 2020). For the direct model introduced in
Abschnitt 2, we use a Transformer model with
hidden size 512, 12 encoder-decoder layers, Und
16 self-attention heads. The model has 114M pa-
rameters. For the noisy channel model, we use a
base setting and a large setting. The base setting
reduces the number of layers to 5, hidden size
Zu 384, and self-attention heads to 12. Its sub-
modules, a direct model, a reverse model and a
language model, have 43M, 43M, and 30M pa-
rameters, jeweils. We employ the base set-
ting for a fair comparison with a single direct
model using roughly the same number of param-
eters (116M vs. 114M). For the large setting, Wir
use the same hyperparameters as the direct model
(114M), so that its sub-modules, a direct model, A
reverse model, and a language model, have 114M,
114M, and 64M parameters, jeweils. Wir gebrauchen
this large setting to explore the limits of the noisy
channel model. The large noisy channel model
(292M) Ist 2.56 times larger compared to the direct
Modell (114M). This illustrates another advantage
of the noisy channel model during training. Während
training a direct model with 292M parameters
will overflow the memory of 16GB TPUs (v3)

4One exception is Henderson et al. (2019), who use Reddit
data to improve response retrieval and selection. We focus
on response generation in this work.

5https://cutt.ly/xkuUHUa.
6https://cutt.ly/QkuUZUu.

661

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
9
0
1
9
2
9
7
2
7

/

/
T

l

A
C
_
A
_
0
0
3
9
0
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Dataset

# Dialog

# Turn Avg. Turn/Dialog Avg. Token/Turn # Domain Multi-Task # Unique Slot # Unique Value

Taskmaster
Schema
CamRest676
MultiWOZ
SMCalFlow

17,304
22,825
676
10,438
41,517

341,801
463,284
5,488
143,048
170,590

19.75
20.3
8.12
13.7

4.11

7.87
9.86
10.71
15.03
8.77

7
17
1
7
4

(cid:2)

(cid:2)
(cid:2)

281
123
4
46

66,659
23,889
89
11,828

Tisch 1: Statistics of task-oriented dialogue datasets. We define a multi-task dialogue as a dialogue
involving multiple tasks (z.B., hotel and restaurant booking) while its counterpart handles a single task
(z.B., hotel booking). Taskmaster and CamRest676 do not contain any multi-task dialogues.

without using model parallelism, training the sub-
modules of the large noisy channel model can
easily fit into 16GB TPUs, as these modules are
independently trained with no need to load three
modules for training. This enables us to train a
noisy channel model with more parameters com-
pared to training a direct model using the same
hardware. For inference, we still need to load the
sub-modules into a TPU. Because gradients are
not required during inference, we are able to load
the three sub-modules of the large noisy channel
Modell (292M) into a single TPU with 16GB
memory for decoding. The large noisy channel
Modell (292M) still consumes more memory than
the direct model (114M) during inference.

Pretraining Settings: The maximum sequence
length l is set to 1024, and sequences with longer
lengths are truncated. We reuse the vocabulary
from GPT-2 (Radford et al., 2019), which contains
50,257 BPE tokens. We use PreNorm (Nguyen
and Salazar, 2019) for faster convergence. GELU
(Hendrycks and Gimpel, 2016) is applied as the
activation function. Following ALBERT (Lan
et al., 2020), dropout is disabled during pretrain-
ing. We use the normal distribution truncated
to the range [−0.01, 0.01] to initialize the input
embeddings, while other parameters are initial-
ized using the normal distribution with zero mean
and standard deviation 0.1. The batch size is
set to 256. The LAMB optimizer (You et al.,
2020) (b1 = 0.9 and b2 = 0.999) is employed
for optimization. The initial learning rate is 1e-7,
and we apply 4000 warmup steps to increase
the learning rate to 1e-3, before utilizing cosine
annealing to decay the learning rate. Gradient
clipping with clipping value 1 is applied to avoid
gradient explosion. We use gradient accumulation
with accumulation step 20.

Pretraining: For Reddit pretraining, we down-
load a Reddit dump (with Reddit posts ranging

aus 2005-12 Zu 2019-09) from PushShift.7 Since
the comments of a Reddit post are organized into
a tree, we extract paths from a tree as dialogue
turns. The last comment of each comment path
is regarded as the response, while the others are
used as the dialogue context. We pretrain each
model for 400,000 Schritte, consuming 102,400,000
(400,000 × 256) comment paths in total. For the
task-oriented pretraining, we combine the two
datasets, Taskmaster and Schema-Guided Dia-
logue, and pretrain for 1e5 steps. The statistics of
the task-oriented dialogue datasets are shown in
Tisch 1.

We train each model using 64 TPU chips with
16GB memory each. The pretraining takes around
4 days to complete.

5 Experimente

We fine-tune and evaluate the pretrained models
on three dialogue datasets: MultiWOZ 2.0, Nocken-
Rest676 and SMCalFlow (Andreas et al., 2020). In
this section we describe the datasets (Abschnitt 5.1),
fine-tuning (Abschnitt 5.2), decoding (Abschnitt 5.3),
and evaluation metrics (Abschnitt 5.4). Results are
presented in Section 6, and analysis and ablation
studies in Section 7.

5.1 Datasets
MultiWOZ8 is a multi-domain dataset consisting
of dialogues annotated with C, B, A, R in the fol-
lowing seven domains: attraction, hotel, hospital,
police, restaurant, train, and taxi. Since its release,
MultiWOZ has been one of the most commonly
used task-oriented dialogue datasets.

CamRest6769 is annotated similarly to Multi-
WOZ and consists of dialogues in a single domain:
restaurant reservations. Though CamRest676 is

7https://pushshift.io/.
8https://cutt.ly/0kuUCRS.
9https://cutt.ly/SkuUNfE.

662

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
9
0
1
9
2
9
7
2
7

/

/
T

l

A
C
_
A
_
0
0
3
9
0
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3


smaller than MultiWOZ and predates it, it still
provides a widely used benchmark for evaluating
task-oriented dialogue models.

SMCalFlow consists of dialogues in four do-
mains: calendar, weather, places, and people. Un-
like MultiWOZ and CamRest676, SMCalFlow
uses dataflow graphs instead of slot-value pairs
to represent belief states and does not annotate
dialogue acts. We refer readers to Andreas et al.
(2020) for a detailed description of the dataflow
representation. We follow Andreas et al. (2020)
to convert dataflow graphs into sequences to
apply seq2seq models. This dataset is newer and
offers fewer prior models to compare with, Aber
we use this dataset to study the robustness of the
noisy channel model under different annotation
schemas.

We use the public splits for these datasets,
where MultiWOZ, CamRest676 and SMCalFlow
are split to 8438/1000/1000, 404/136/136, Und
32647/3649/5211 dialogues for training, develop-
ment, and testing, jeweils. Jedoch, Weil
SMCalFlow’s test set has not been publicly
released, we randomly select 500 dialogues from
its training set to tune hyperparameters and use its
development set for testing.

Vorverarbeitung: We use the standard prepro-
cessing procedures for each dataset in order to
facilitate fair comparison with previous meth-
ods.10,11,12 In particular, for MultiWOZ and Cam-
Rest676, delexicalization is used to reduce lexical
Variation, while SMCalFlow does not use delexi-
calization. During delexicalization, slot values are
replaced by generic placeholders based on a pre-
defined dictionary. During decoding, following
prior work, our dialogue models generate delexi-
calized responses. These delexicalized responses
are re-lexicalized in post-processing by replacing
placeholders with their corresponding slot values
based on belief states and database information.
Since there is no public code for lexicalization,13
we implement our own functions for lexicaliza-
tion with regular expressions, for the purpose of
displaying example responses. Jedoch, this does
not affect reported results, as the standard metrics
for MultiWOZ and CamRest676 that we adopt
here are calculated using delexicalized responses.

10https://cutt.ly/TkuU1oM.
11https://cutt.ly/zkuU0Ht.
12https://cutt.ly/vkuU9bT.
13We confirmed this with the dataset authors by email.

5.2 Fine-Tuning

We apply label smoothing with parameter 0.1.
Dropout is used on input embeddings and hidden
Darstellungen, with dropout rate 0.1. The Adam
optimizer (Kingma and Ba, 2015) (b1 = 0.9 Und
b2 = 0.999) is adopted. We use a fixed learning
rate 1e-4 with gradient clipping for fine-tuning.

5.3 Decoding

We use direct decoding for belief state. For dia-
logue act and response, we study three decoding
Methoden: direct decoding, noisy channel rerank-
ing, and noisy channel online decoding. Seit
all of these decoding methods require choosing
k1 tokens from a categorical distribution during
expansion, we compare four methods, top-k max-
imization, sampling without replacement, top-k
sampling, and nucleus sampling. Nucleus sam-
pling with cumulative probability 0.98 performs
marginally better and is adopted. We perform
a range search with the range [1, 20] on devel-
opment sets for the beam sizes k1 and k2, Und
we set k1, k2 = 4, k1, k2 = 15, and k1, k2 = 4
for MultiWOZ, CamRest676, and SMCalFlow,
jeweils. For noisy channel reranking and
noisy channel online decoding, a grid search with
range [0, 2] is performed for λ1, λ2, and λ3. Wir
set (λ1 = 0.8, λ2 = 1, λ3 = 0.8), (λ1 = 1.2,
λ2 = 1.2, λ3 = 0.8), Und (λ1 = 0.4, λ2 = 1,
λ3 = 0.2) for MultiWOZ, CamRest676, Und
SMCalFlow, jeweils.

5.4 Evaluation Metrics

For MultiWOZ and CamRest676, following pre-
vious work, we adopt three automatic evaluation
metrics: inform, success, and BLEU score. Peng
et al. (2020A) showed that
these metrics are
well correlated to human evaluation. The evalu-
ators14,15 provided with the datasets are used for
calculating these metrics. To calculate the inform
score for a dialogue, the evaluator first checks
whether certain placeholders (z.B., [restau-
rant name]) appear in decoded responses.
Wenn ja, decoded belief states are converted to
database queries to retrieve database records.
These database records are compared with the
records retrieved with ground-truth belief states.
The inform score is one if these two sets of
database records match. The success score takes

14https://cutt.ly/VkuU3FA.
15https://cutt.ly/MkuU88u.

663

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
3
9
0
1
9
2
9
7
2
7

/

/
T

l

A
C
_
A
_
0
0
3
9
0
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Modell

Inform ↑

Success ↑

BLEU ↑

Combined ↑

Sequicity (Lei et al., 2018)
HRED-TS (Peng et al., 2019)
DSTC8 Track 1 Gewinner (Ham et al., 2020)
DAMD (Zhang et al., 2020A)
SimpleTOD (Hosseini-Asl et al., 2020)
SOLOIST (Peng et al., 2020A)
UBAR (Yang et al., 2021)†

Direct decoding (114M)
Noisy channel reranking (116M)
Noisy channel online decoding (116M)
Noisy channel reranking (292M)
Noisy channel online decoding (292M)

Direct decoding (114M)
Noisy channel reranking (116M)
Noisy channel online decoding (116M)
Noisy channel reranking (292M)
Noisy channel online decoding (292M)

Direct decoding (114M)
Noisy channel reranking (116M)
Noisy channel online decoding (116M)
Noisy channel reranking (292M)
Noisy channel online decoding (292M)

66.4
70.0
73.0
76.4
84.4
85.5
88.2
Randomly Initialized
81.0
82.7
82.9
82.1
83.9
Reddit Pretraining
81.0
81.3
81.6
82.2
82.4
Task-Oriented Pretraining
85.2
85.6
85.9
86.5
86.9

45.3
58.0
62.4
60.4
70.1
72.9
79.5

54.7
57.1
58.9
58.1
60.9

69.2
70.1
71.1
70.9
71.7

72.9
73.8
74.8
74.9
76.2

15.54
17.50
16.00
16.60
15.01
16.54
16.43

15.12
15.29
15.33
15.37
15.57

17.06
19.01
19.31
19.89
20.49

17.00
19.38
19.76
20.31
20.58

71.39
81.50
83.50
85.00
92.26
95.74
100.28

82.97
85.19
86.23
85.47
87.97

92.16
94.71
95.66
96.44
97.54

96.05
99.08
100.11
101.01
102.13

Tisch 2: MultiWOZ test results (end-to-end modeling with generated beliefs) with seq2seq approaches.
Results are significant (P < 0.01) comparing noisy channel decoding and direct decoding. † Yang et al. (2021) also report a combined score of 105.1 with an alternative context and evaluation setting, contributions orthogonal to our work and the other benchmarks reported here. all the requestable slots (e.g., postcode, phone number, and address) from a decoded response and compares these requestable slots with the ones in the ground-truth response. The success score is one if generated requestable slots coincide with the ground-truth ones. BLEU score (BLEU-4) compares the n-grams of generated responses and human responses, and is a widely used metric in NLP for evaluating text quality. Following Budzianowski et al. (2018), we also calculate a combined score, which is (Inform + Success) / 2 + BLEU. For SMCalFlow, inform and success scores are not applicable because calculation of these scores relies on delexicalization placehold- ers, and this dataset does not use delexicalization. We use SacreBLEU16 and TER17 to directly mea- sure the quality of responses. As prior work on 16https://cutt.ly/BkuU7dL. 17https://pypi.org/project/pyter/. this dataset has focused on belief tracking rather than end-to-end response generation, we are the first to use these metrics on this dataset. We perform significance tests, where we use t-test for inform, success, and TER scores and use permutation test for BLEU. 6 Results MultiWOZ: Results on the MultiWOZ test set are shown in Table 2. We observe several trends. First, the base noisy channel model (116M) per- forms better than direct decoding (114M), despite having a similar number of parameters, showing that the noisy channel factorization is beneficial for task-oriented dialogue. The large noisy chan- nel setting improves further over the base setting. Second, Reddit pretraining provides benefits over random initialization, validating the use of large 664 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model Inform ↑ Success ↑ BLEU ↑ Combined ↑ Sequicity (Lei et al., 2018) GPT-2 fine-tuned (Wu et al., 2019b) ARDM (Wu et al., 2019b) SOLOIST (Peng et al., 2020a) 92.3 - - 94.7 85.3 86.2 87.1 87.1 Randomly Initialized Direct decoding (114M) Noisy channel online decoding (116M) Noisy channel online decoding (292M) 78.1 79.8 80.9 Reddit Pretraining Direct decoding (114M) Noisy channel online decoding (116M) Noisy channel online decoding (292M) 93.3 93.7 93.9 83.5 84.1 84.9 83.9 84.5 84.7 Direct decoding (114M) Noisy channel online decoding (116M) Noisy channel online decoding (292M) 93.4 94.3 95.4 84.3 85.2 85.3 Task-Oriented Pretraining 21.40 19.20 25.20 25.50 21.58 22.83 23.19 23.41 25.14 25.38 24.92 25.98 26.89 110.20 - - 116.40 102.38 104.78 106.09 112.01 114.24 114.68 113.77 115.73 117.24 Table 3: CamRest676 test results (end-to-end modeling with generated beliefs) with seq2seq approaches. Noisy channel reranking performs comparable with noisy channel online decoding, and the results are not shown. Results are significant (p < 0.01) comparing noisy channel decoding and direct decoding. Model SacreBLEU ↑ TER ↓ Direct decoding (114M) Online decoding (116M) Online decoding (292M) Randomly Initialized 51.30 53.66 54.39 Reddit Pretraining Direct decoding (114M) Online decoding (116M) Online decoding (292M) 60.68 63.29 63.91 Task-Oriented Pretraining Direct decoding (114M) Online decoding (116M) Online decoding (292M) 61.02 63.72 64.29 89.13 74.18 73.18 61.99 47.16 46.43 59.84 46.27 45.81 Table 4: SMCalFlow results. Reranking performs worse than online decoding, and the results are not shown. Results are significant (p < 0.01) comparing noisy channel decoding and direct decoding. open-domain dialogue-genre pretraining for task- oriented dialogue, while the models with a second stage of task-oriented pretraining obtain further improvements. This effect is consistent across both direct and noisy channel decoding. Finally, we observe that online decoding consistently outperforms reranking, indicating the benefits of tighter model integration during decoding. Our model performs better on combined score than SOLOIST (Peng et al., 2020a), a closely related baseline that pretrains a GPT2-initialized Transformer with Taskmaster and Schema- Guided Dialogue and decodes with nucleus sampling. CamRest676: Results on the CamRest676 test set are shown in Table 3. We observe that the base noisy channel model (116M) obtains bet- ter results compared to direct decoding (114M), again demonstrating the effectiveness of the noisy channel model. Reddit pretraining again provides a large benefit over random initialization for both direct decoding and noisy channel decoding, while task-oriented pretraining provides a further boost. Our model again performs better than SOLOIST. SMCalFlow: Results on the SMCalFlow devel- opment set are shown in Table 4. As end-to-end models have not previously been tested on this dataset, we use it to demonstrate that the noisy channel model, which we developed primar- ily on MultiWOZ, continues to be effective on task-oriented dialogue datasets with different annotation schema. The results are consistent with MultiWOZ and CamRest676. The noisy channel model outperforms the direct model by a large margin, demonstrating that dialogue act annotations are not essential for the noisy channel 665 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model CamRest676 MultiWOZ Direct decoding 115.17 Noisy Channel Online Decoding Direct + Channel Direct + Source Direct + Length Channel + Source Channel + Length Source + Length All - Direct All - Channel All - Source All - Length All 115.63 115.91 115.56 115.82 115.60 115.62 115.96 116.56 116.38 116.52 116.91 96.73 98.54 99.12 97.57 99.18 98.71 99.19 100.89 100.93 99.92 101.11 102.62 Table 5: Ablation results for model combination on development sets (combined score). Results for reranking are similar and are not shown. ‘All’, ‘Direct’, ‘Source’, and ‘Channel’ denote no ablation, direct model, source model and channel model, respectively. Rows with ‘+’ are combinations of two sub-modules, while the rows with ‘-’ are combinations of three sub-modules. model, and that it remains effective across diverse dialogue representations. Reddit pretraining confers a similar large ben- efit on SMCalFlow as on the other datasets, but we observe that task-oriented pretraining brings only marginal further improvements. This may be due to differences in domain or format between our pretraining datasets and SMCalFlow. Alterna- tively, task-oriented pretraining may help more on task-specific metrics, such as inform and success than on text quality metrics such as scores, BLEU and TER scores. This hypothesis is further supported by the MultiWOZ results in Table 2. 7 Analysis In this section, we use MultiWOZ and Cam- Rest676 to perform ablation studies on the effects of model combination, large-scale pretraining, and sample efficiency; as well as analyzing the runtime requirements of our model and the reasons for its success. 7.1 Ablation on Model Combination Noisy channel decoding involves a combination of four sub-modules, as in Eq. 1: the direct model, channel model, language model, and length bias. We perform an ablation study to determine whether all model components are important to the result, using the large model. Results on the development sets of CamRest676 and MultiWOZ are presented in Table 5. Note that the ablation is performed after applying the direct model to obtain k1 expansions at each beam search step for noisy channel online decoding. We find that the combination of all four sub-modules performs the best, followed by combinations of three and then two sub-modules. The results are significant when comparing ‘All’ and the baselines (p < 0.01). This result demonstrates the effectiveness of the noisy channel factorization, and the importance of each model component. 7.2 Effect of Pretraining Scale We investigate the importance of scale for both our pretraining stages. We select different check- points for Reddit pretraining, and truncate the two task-oriented dialogue datasets for task-oriented pretraining. We fine-tune these models using the full training data of CamRest676 or MultiWOZ. The results of three decoding methods (with the large noisy channel model) on the development sets are shown in Figure 2. In Figure 2 (a) and (c), the combined scores of all three decoding methods improve with more Reddit pretraining steps, demonstrating the advantage of increasing amounts of data in the open-domain dialogue pretraining stage. In Figure 2 (b) and (d), the combined scores further increase with more task-oriented data, confirming that additional task-oriented pretraining data is useful. 7.3 Sample Efficiency of Fine-Tuning We investigate whether pretraining can improve sample efficiency during fine-tuning. We gradu- ally increase the amount of fine-tuning data and evaluate the randomly-initialized, Reddit pre- trained and task-oriented pretrained models. The results on the development sets are shown in Figure 3. Combined scores increase with more training data under all conditions. Crucially, Reddit pretrained models show better performance with a smaller amount of fine-tuning data than randomly initialized models, and task-oriented pretrained models better still. We conclude that both our pretraining stages can improve sample efficiency, which is especially important when the target task has little training data. 666 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: Results showing the effect of pretraining scale. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 3: Pretraining improves sample efficiency during fine-tuning. 7.4 Decoding Runtime In Table 6, we report the average clock time for decoding one turn (including its belief state, dia- logue act and response). Noisy channel reranking is slightly slower compared to direct decoding, with overhead due to the reranking step in Eq. 1. Noisy channel online decoding is significantly slower, since it needs to apply Eq. 1 at each beam search step. In future work we will investigate ways to improve the efficiency of online decoding. 7.5 Decoding Properties In this section we analyze why the noisy channel model performed better than direct decoding. Model CamRest676 MultiWOZ Direct decoding Reranking Online decoding 4.89 5.43 8.73 6.48 6.92 10.97 Table 6: Average decoding time (in seconds) for each turn with different decoding methods. Length: In Table 7 we show the average length of generated responses. Direct decoding produces shorter responses than the ground truth, confirm- ing that the direct model prefers short and generic responses. Adding a length bias to direct decoding (with lambda tuned on the development sets) produces responses longer than the ground truth, 667 Model CamRest676 MultiWOZ Ground truth Direct decoding Direct decoding + Length Reranking Online decoding 14.50 12.07 15.98 15.09 15.14 16.91 12.85 17.73 17.47 17.32 Table 7: The average length of responses with different decoding methods (on test set). The value closest to the ground truth is bold. Model CamRest676 MultiWOZ Ground truth Direct decoding Reranking Online decoding 1.07 0.84 0.87 0.89 1.22 0.91 0.99 1.03 Table 8: The Zipf scores of responses with dif- ferent decoding methods (on test set). The value closest to the ground truth is bold. Model CamRest676 MultiWOZ Direct decoding Reranking Online decoding 0.24 0.12 0.08 0.31 0.14 0.11 Table 9: The likelihood (%) of falling into repe- tition loops for different decoding methods (on test set). which may be a disadvantage. The noisy channel models produce responses with average length closest to the ground truth. Zipf: Table 8 shows the Zipf scores of re- sponses. We find that the word distributions of responses generated by the noisy channel models are closer to the word distribution of ground-truth responses. Repetition: In Table 9 we examine the like- lihood of falling into repetition loops (Holtzman et al., 2019) for different decoding methods. Re- petition loops are rare for all decoding methods, but noisy channel decoding can further decrease their likelihood. The channel model can discount a sequence with a repetition loop, since it conveys less information than a natural sequence of the same length, making it harder to ‘‘explain’’ the context. Examples: Some examples of responses are shown in Table 10. We observe that noisy chan- nel models decode longer responses compared to direct decoding, and that the responses can explain their dialogue contexts well to meet users’ requirements. 8 Related Work Task-Oriented Dialogue Models: Most task- oriented dialogue systems break down the task into three components: belief tracking (Henderson et al., 2013; Mrkˇsi´c et al., 2016; Rastogi et al., 2017; Nouri and Hosseini-Asl, 2018; Wu et al., 2019a; Zhang et al., 2019; Zhou and Small, 2019; Heck et al., 2020), dialogue act prediction (Wen et al., 2017a; Tanaka et al., 2019), and response generation (Chen et al., 2019; Budzianowski et al., 2018; Lippe et al., 2020). Traditionally, a mod- ular approach is adopted, where these components are optimized independently (i.e., a pipeline design) or learned via multi-task learning (i.e., some parameters are shared among the compo- nents) (Wen et al., 2017b; Neelakantan et al., 2019; Zhao et al., 2019; Mehri et al., 2019; Tseng et al., 2020; Lee et al., 2020). However, it is known that improvements in one component do not necessarily lead to overall performance improvements (Ham et al., 2020), and the mod- ular approach suffers from error propagation in practice (Liu and Lane, 2018). These observations gave rise to the sequence-to-sequence approach (Lei et al., 2018; Pei et al., 2019; Budzianowski and Vuli´c 2019; Wu et al., 2019b; Zhang et al., 2020a; Ham et al., 2020; Hosseini-Asl et al., 2020; Peng et al., 2020a; Yang et al., 2021), where dialogue beliefs and acts are represented as text spans, and a sequence-to-sequence model is applied to subsume the three components. Our work is situated within this general approach. In contrast to previous work, however, which uses a direct model for decoding, we introduce the noisy channel model to improve task-oriented dialogue. Pretraining Models for Dialogue: Recent work has applied pretraining (Peters et al., 2018; Devlin et al., 2019; Radford et al., 2019) to dialogue. For open-domain dialogue, DialoGPT (Zhang et al., 2020b) and CGRG (Wu et al., 2020b) extend GPT-2 (Radford et al., 2019) for res- ponse generation. PLATO (Bao et al., 2019) and PLATO-2 (Bao et al., 2020) pretrain a latent vari- able model with social media data for diversified response generation. Meena (Adiwardana et al., 2020) collects a large-scale social media corpus 668 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model Delexicalized Response SacreBLEU ↑ Dialogue: MUL1624, Turn Number: 4 Ground truth Direct decoding Reranking Online decoding Sure. The trainid is [train id]. It leaves at [value time] and arrives at [value time]. Would you like to make a reservation? The travel time is [value count] minutes. [train id] leaves [value place] at [value time] and arrives in [value place] at [value time]. Would you like me to book that for you? [train id] leaves at [value time] and arrives at [value time]. Would you like me to book that? Dialogue: MUL1276, Turn Number: 6 Ground truth Direct decoding Reranking Online decoding The [restaurant name] has a postal code of [restaurant postcode]. Will this work for you? I can book it if you please. There are [value count] places. What type of food would you like? How about [restaurant name] located at [restaurant address], postcode [restaurant postcode]. How about [restaurant name] located at [restaurant address], postcode [restaurant postcode]? Dialogue: MUL1898, Turn Number: 2 Ground truth Direct decoding Reranking Online decoding Yes, [attraction name] is on [attraction address] and is in the [value area] side of town. Is there anything else you need to know? [attraction name] is located in the [value area] part of town and has free admission. [attraction name] is located in the [value area] of town at [attraction address]. The entrance fee is free. Can I help you with anything else? [attraction name] is located in the [value area] part of town at [attraction address]. Can I help you with anything else? – 2.03 46.33 60.90 – 1.63 24.90 22.00 – 27.53 41.66 42.38 Table 10: Case study on the responses decoded by direct decoding, noisy channel reranking, and noisy channel online decoding. The large noisy channel model is used. for pretraining and proposes a metric named sen- sibleness and specificity average for evaluation. Roller et al. (2020) study various strategies for building an open-domain chatbot with Reddit for pretraining. For task-oriented dialogue, ToD- BERT (Wu et al., 2020a) fine-tunes BERT (Devlin et al., 2019) for four tasks, including intention detection, belief tracking, dialogue act prediction, and response selection. SC-GPT (Peng et al., 2020b) fine-tunes GPT-2 for few-shot re- sponse generation with given dialogue acts. Ham et al. (2020) fine-tune GPT-2 for belief tracking and context-to-response generation. SimpleTOD (Hosseini-Asl et al., 2020) proposes a method to serialize dialogue beliefs and acts into text spans and fine-tunes GPT-2 for end-to-end dia- logue modeling. SOLOIST (Peng et al., 2020a) uses a series of task-oriented dialogue datasets to further pretrain GPT-2 before fine-tuning it on final tasks for evaluation. Unlike these BERT- or GPT-initialized task-oriented dialogue models, which are essentially pretrained with general text, such as Wikipedia and BookCorpus, we use a Reddit dump to pretrain the models to learn from open-domain dialogues. 9 Conclusion We introduced two noisy channel models, noisy channel reranking and noisy channel online decod- ing, for task-oriented dialogue. Large-scale pre- training was further adopted to tackle data scarcity in downstream tasks. Extensive experiments on MultiWOZ, CamRest676, and SMCalFlow demonstrated that (1) the noisy channel mod- els significantly outperform direct decoding; (2) models with pretraining improve over randomly- initialized models; (3) the models are robust to different dialogue schema annotations; and (4) the noisy channel models can decode responses closer to ground-truth responses than direct decoding. Acknowledgments We would like to thank the action editors (Maggie, Wenjie Li, and Eneko Agirre) and three anonymous reviewers for their insightful comments. We also thank Angeliki Lazaridou, G´abor Melis, Nando de Freitas, Chris Dyer, and the DeepMind language team for their helpful discussions. References Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. 669 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, and Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Jha, Dan Klein, Wendy Iwaszuk, Smriti Jayant Krishnamurthy, Theo Lanman, Percy Liang, Christopher H. Lin, Ilya Lintsbakh, Andy McGovern, Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth, Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon Striplin, Yu Su, Zachary Tellman, Sam Thomson, Andrei Vorobev, Izabela Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, and Alexander Zotov. 2020. Task-oriented dialogue as dataflow syn- the Association for thesis. Transactions of Computational Linguistics, 8:556–571. John Langshaw Austin. 1975. How To Do Things with Words, 88, Oxford University Press. Siqi Bao, Huang He, Fan Wang, and Hua Wu. 2019. Plato: Pre-trained dialogue generation model with discrete latent variable. arXiv preprint arXiv:1910.07931. Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng Wang. 2020. PLATO: pre-trained dialogue generation model with discrete la- the 58th tent variable. Annual Meeting of the Association for Com- putational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 85–96. Association for Computational Linguistics. https://doi .org/10.18653/v1/2020.acl-main.9 In Proceedings of James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. 2018. JAX: composable transformations of Python+ NumPy programs. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858–867, Prague, Czech Republic. Association for Com- putational Linguistics. Tom B. Brown, Benjamin Mann, Nick Ryder, Jared Kaplan, Prafulla Melanie Subbiah, Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutsk. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Paweł Budzianowski and Ivan Vuli´c. 2019. Hello, it’s GPT-2 - how can I help you? Towards the use of pretrained language models for task-oriented dialogue systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 15–22, Hong Kong. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19 -5602 for Pawel Budzianowski, Tsung-Hsien Wen, Bo- I˜nigo Casanueva, Stefan Hsiang Tseng, Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz - A large-scale multi-domain task-oriented dia- wizard-of-oz dataset logue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Octo- ber 31 - November 4, 2018, pages 5016–5026. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18 -1547 Bill Byrne, Karthik Krishnamoorthi, Chinnadhu- rai Sankar, Arvind Neelakantan, Ben Goodrich, Daniel Duckworth, Semih Yavuz, Amit Dubey, Kyu-Young Kim, and Andy Cedilnik. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pages 4515–4524. Association for Computational Linguistics. 670 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 https://doi.org/10.18653/v1/D19 -1459 Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan, and William Yang Wang. 2019. Semantically conditioned dialog response generation via hierarchical disentangled self- attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3696–3709, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19 -1360 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre- training of deep bidirectional transformers for In Proceedings of language understanding. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics. Angela Fan, Mike Lewis, and Yann N. Dauphin. 2018. Hierarchical neural story generation. Iryna Gurevych and Yusuke Miyao, editors, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 889–898. Association for Computational Linguistics. Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang, and Kee-Eung Kim. 2020. End-to- end neural pipeline for goal-oriented dialogue systems using GPT-2. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 583–592. Association for Computational Linguistics, Online. Michael Heck, Carel van Niekerk, Nurul Lubis, Christian Geishauser, Hsien-Chin Lin, Marco Moresi, and Milica Gasic. 2020. Trippy: A triple copy strategy for value independent neural dialog state tracking. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGdial 2020, 1st virtual meeting, July 1-3, 2020, pages 35–44. Association for Computational Linguistics. Matthew Henderson, Blaise Thomson, and Steve Young. 2013. Deep neural network approach for the dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, pages 467–471. Matthew Henderson, Ivan Vulic, Daniela Gerz, I˜nigo Casanueva, Pawel Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola Mrksic, and Pei-Hao Su. 2019. Training neural response selection for task- In Proceedings oriented dialogue systems. the Association the 57th Conference of of for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 5392–5404. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19 -1536 Dan Hendrycks Gaussian error preprint arXiv:1606.08415. and Kevin Gimpel. 2016. linear units (gelus). arXiv Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. 2020. Haiku: Sonnet for JAX. Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. CoRR, abs/1904.09751. Ehsan Hosseini-Asl, Bryan McCann, Chien- Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A simple language model for task- In Advances in Neural oriented dialogue. Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learn- ing Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. Dan Klein and Christopher D. Manning. 2002. Conditional structure versus conditional esti- mation in NLP models. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 9–16. https://doi.org/10.3115 /1118693.1118695 671 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language repre- sentations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. Hwaran Lee, Seokhwan Jo, HyungJun Kim, Sangkeun Jung, and Tae-Yoon Kim. 2020. Sumbt+ larl: End-to-end neural task-oriented dialog system with reinforcement learning. arXiv preprint arXiv:2009.10447. Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to- sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1437–1447. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity- promoting objective function for neural con- In NAACL HLT 2016, versation models. The 2016 Conference of the North American for Com- the Association Chapter putational Linguistics: Human Language Tech- nologies, San Diego California, USA, June 12–17, 2016, pages 110–119. The Association for Computational Linguistics. of Phillip Lippe, Pengjie Ren, Hinda Haned, Bart Voorn, and Maarten de Rijke. 2020. Diversifying task-oriented dialogue response generation with prototype guided paraphrasing. CoRR, abs/2008.03391. Bing Liu and Ian Lane. 2018. End-to-end learning of task-oriented dialogs. In Proceedings of the 2018 Conference of the North American the Association for Computa- Chapter of tional Linguistics: Student Research Work- shop, pages 67–73. https://doi.org/10 .18653/v1/N18-4010 Shikib Mehri, Tejas Srinivasan, and Maxine Eskenazi. 2019. Structured fusion networks for dialog. arXiv preprint arXiv:1907.10016. Nikola Mrkˇsi´c, Diarmuid O. S´eaghdha, Tsung- Hsien Wen, Blaise Thomson, and Steve Young. 2016. Neural belief tracker: Data-driven dia- logue state tracking. arXiv preprint arXiv: 1606.03777. Arvind Neelakantan, Semih Yavuz, Sharan Narang, Vishaal Prasad, Ben Goodrich, Daniel Duckworth, Chinnadhurai Sankar, and Xifeng Yan. 2019. Neural assistant: Joint action predic- tion, response generation, and latent knowledge reasoning. arXiv preprint arXiv:1910.14613. Toan Q. Nguyen and Julian Salazar. 2019. Transformers without Improving the normalization of self-attention. arXiv preprint arXiv:1910.05895. tears: Elnaz Nouri and Ehsan Hosseini-Asl. 2018. Toward scalable neural dialogue state tracking model. arXiv preprint arXiv:1812.00899. Jiahuan Pei, Pengjie Ren, and Maarten de Rijke. 2019. A modular task-oriented dialogue system using a neural mixture-of-experts. arXiv pre- print arXiv:1907.05346. Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayandeh, Lars Liden, and Jianfeng Gao. 2020a. Soloist: Few-shot task-oriented dialog with a single pre-trained auto-regressive model. arXiv preprint arXiv:2005.05298. Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and Jianfeng Gao. 2020b. Few-shot natural lan- task-oriented dialog. guage generation for In Proceedings of the 2020 Conference on in Natural Language Empirical Methods Processing: Findings, EMNLP 2020, Online Event, 16–20 November 2020, pages 172–182. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .findings-emnlp.17 Shuke Peng, Xinjing Huang, Zehao Lin, Feng Ji, Haiqing Chen, and Yin Zhang. 2019. Teacher-student framework enhanced multi- domain dialogue generation. arXiv preprint arXiv:1908.07137. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextu- alized word representations. In Proceedings of the 2018 Conference of the North American 672 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 2227–2237. https:// doi.org/10.18653/v1/N18-1202 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67. Abhinav Rastogi, Dilek Hakkani-T¨ur, and Larry Heck. 2017. Scalable multi-domain dialogue state tracking. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 561–568. IEEE. https:// doi.org/10.1109/ASRU.2017.8268986 Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conver- sational agents: The schema-guided dialogue dataset. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, February 7–12, 2020, pages 8689–8696. AAAI Press. https:// doi.org/10.1609/aaai.v34i05.6394 Antoine Raux, Brian Langner, Dan Bohus, Alan W Black, and Maxine Eskenazi. 2005. Let’s go public! taking a spoken dialog system to the real world. In Ninth European conference on speech communication and technology. Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, Y- Lan Boureau, and Jason Weston. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. Stephanie Seneff and Joseph Polifroni. 2000. Dialogue management in the Mercury flight reservation system. In ANLP-NAACL 2000 Workshop: Conversational Systems. https:// doi.org/10.3115/1117562.1117565 Claude Shannon. 1948. A mathematical theory of communication. Bell System Technical Journal, 27:379–423. Koji Tanaka, and Yuki Junya Takayama, Arase. 2019. Dialogue-act prediction of future responses based on conversation history. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 197–202. Bo-Hsiang Tseng, Jianpeng Cheng, Yimai Fang, and David Vandyke. 2020. A generative model for joint natural language understanding and generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020, pages 1795–1807. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2020.acl-main.163 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. In Advances Attention is all you need. in Neural Information Processing Systems, pages 5998–6008. Tsung-Hsien Wen, Yishu Miao, Phil Blunsom, and Steve J. Young. 2017a. Latent intention dialogue models. CoRR, abs/1705.10229. Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina Maria Rojas- Barahona, Pei-Hao Su, Stefan Ultes, and Steve J. Young. 2017b. A network-based end- to-end trainable task-oriented dialogue system. the 15th Conference of In Proceedings of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3–7, 2017, Volume 1: Long Papers, pages 438–449. Association for Computational Linguistics. task-oriented dialogue. 2020 Conference the Chien-Sheng Wu, Steven C. H. Hoi, Richard Socher, and Caiming Xiong. 2020a. TOD-BERT: language understanding Pre-trained natural In Proceedings for of on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020, pages 917–929. Association for Computational Linguistics. Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019a. Transferable multi- domain state generator for task-oriented dia- 673 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 logue systems. In Proceedings of the 57th Conference of the Association for Computa- tional Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 808–819. Association for Com- putational Linguistics. Qingyang Wu, Yichi Zhang, Yu Li, and Zhou Yu. 2019b. Alternating recurrent dialog model with large-scale pre-trained language models. arXiv preprint arXiv:1910.03756. Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe Zhang, Xiang Gao, Chris Quirk, Rik Koncel- Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi, Mari Ostendorf, and Bill Dolan. 2020b. A controllable model of grounded response gen- eration. arXiv preprint arXiv:2005.00613. Yunyi Yang, Yunhao Li, and Xiaojun Quan. 2021. Ubar: Towards fully end-to-end task-oriented dialog systems with gpt-2. The Thirty-Fifth AAAI Conference on Artificial Intelligence. Kyra Yee, Yann N. Dauphin, and Michael Auli. 2019. Simple and effective noisy channel modeling for neural machine translation. In the 2019 Conference on Proceedings of in Natural Language Empirical Methods Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 5695–5700. Asso- ciation for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1571 Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large batch opti- mization for deep learning: Training BERT in 76 minutes. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tom´as Kocisk´y. 2017. The In 5th International neural noisy channel. Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. Lei Yu, Laurent Sartran, Wojciech Stokowiec, Wang Ling, Lingpeng Kong, Phil Blunsom, and Chris Dyer. 2020. Better document- level machine translation with bayes rule. Transactions of the Association for Compu- tational Linguistics, 8:346–360. https:// doi.org/10.1162/tacl a 00319 Jian-Guo Zhang, Kazuma Hashimoto, Chien- Sheng Wu, Yao Wan, Philip S. Yu, Richard Socher, and Caiming Xiong. 2019. Find or classify? Dual strategy for slot-value predic- tions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544. Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020a. Task-oriented dialog systems that consider multiple appropriate responses under the same context. In The Thirty-Fourth AAAI Confer- ence on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pages 9604–9611. AAAI Press. https:// doi.org/10.1609/aaai.v34i05.6507 Yizhe Zhang, Siqi Sun, Michel Galley, Yen- Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020b. DIALOGPT : Large-scale generative pre-training for conversational response gen- eration. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5–10, 2020, pages 270–278. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-demos.30 Tiancheng Zhao, Kaige Xie, reinforcement and Maxine Esk´enazi. 2019. Rethinking action spaces learning in end-to-end for dialog agents with latent variable models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pages 1208–1218. Association for Computational Linguistics. Li Zhou and Kevin Small. 2019. Multi-domain dialogue state tracking as dynamic knowledge graph enhanced question answering. arXiv preprint arXiv:1911.06192. 674 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 9 0 1 9 2 9 7 2 7 / / t l a c _ a _ 0 0 3 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
PDF Herunterladen