Insertion-based Decoding with Automatically Inferred Generation Order
Jiatao Gu†, Qi Liu(cid:2)∗, and Kyunghyun Cho‡†
†Facebook AI Research
(cid:2)University of Oxford
‡New York University, CIFAR Azrieli Global Scholar
†{jgu, kyunghyuncho}@fb.com ‡qi.liu@st-hughs.ox.ac.uk
Astratto
Conventional neural autoregressive decoding
commonly assumes a fixed left-to-right gener-
ation order, which may be sub-optimal. In this
lavoro, we propose a novel decoding algorithm—
InDIGO—which supports flexible sequence
generation in arbitrary orders through insertion
operations. We extend Transformer, a state-
of-the-art sequence generation model, to effi-
ciently implement
the proposed approach,
enabling it to be trained with either a pre-
defined generation order or adaptive orders
obtained from beam-search. Experiments on
four real-world tasks, including word order
recovery, machine translation, image caption,
and code generation, demonstrate that our
algorithm can generate sequences following
arbitrary orders, while achieving competitive
or even better performance compared with
the conventional left-to-right generation. IL
InDIGO
generated sequences
adopts adaptive generation orders based on
input information.
show that
1 introduzione
Neural autoregressive models have become the
de facto standard in a wide range of sequence
generation tasks, such as machine translation
(Bahdanau et al., 2015), summarization (Rush
et al., 2015), and dialogue systems (Vinyals and
Le, 2015). In these studies, a sequence is modeled
autoregressively with the left-to-right generation
order, which raises the question of whether gen-
eration in an arbitrary order is worth considering
∗This work was completed while the author worked as an
AI resident at Facebook AI Research.
661
(Vinyals et al., 2016; Ford et al., 2018). Nev-
ertheless, previous studies on generation orders
mostly resort to a fixed set of generation orders,
showing particular choices of ordering are help-
ful (Wu et al., 2018; Ford et al., 2018; Mehri
and Sigal, 2018), without providing an efficient
algorithm for finding adaptive generation orders,
or restrict the problem scope to n-gram segment
generation (Vinyals et al., 2016).
in questo documento, we propose a novel decoding al-
gorithm, Insertion-based Decoding with Inferred
Generation Order (InDIGO), which models gener-
ation orders as latent variables and automatically
infers the generation orders by simultaneously pre-
dicting a word and its position to be inserted at each
decoding step. Given that absolute positions are
unknown before generating the whole sequence,
we use a relative-position-based representation
to capture generation orders. We show that de-
coding consists of a series of insertion operations
with a demonstration shown in Figure 1.
We extend Transformer (Vaswani et al., 2017)
for supporting insertion operations, dove il
generation order is directly captured as relative
positions through self-attention inspired by Shaw
et al. (2018). For learning, we maximize the
evidence lower-bound (ELBO) Di
the maxi-
mum likelihood objective, and study two ap-
proximate posterior distributions of generation
orders based on a pre-defined generation order
and adaptive orders obtained from beam-search,
rispettivamente.
Experimental results on word order recov-
ery, machine translation, code generation, E
image caption demonstrate that our algorithm can
generate sequences with arbitrary orders, while
achieving competitive or even better performance
compared to the conventional left-to-right genera-
zione. Case studies show that the proposed method
adopts adaptive orders based on input information.
The code will be released as part of the official
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 7, pag. 661–676, 2019. https://doi.org/10.1162/tacl a 00292
Redattore di azioni: Alexandra Birch. Lotto di invio: 4/2019; Lotto di revisione: 7/2019; Pubblicato: 11/2019.
C(cid:4) 2019 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
T
l
UN
C
_
UN
_
0
0
2
9
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
3 Insertion-based Decoding with
Inferred Generation Order (InDIGO)
Equazione 1 explicitly assumes a left-to-right (L2R)
generation order of the sequence y. In principle,
we can factorize the sequence probability in
any permutation and train a model for each
permutation separately. As long as we have an
infinite amount of data with proper optimization
these models are equivalent.
performed, Tutto
Nevertheless, Vinyals et al. (2016) have shown
that the generation order of a sequence actually
matters in many real-world tasks, per esempio., lingua
modeling.
Although the L2R order is a strong inductive
bias, as it is natural for most human beings to
read and write sequences from left to right, L2R is
not necessarily the optimal option for generating
sequences. For instance, people sometimes tend
to think of central phrases first before building up
a whole sentence. For programming languages,
it is beneficial to be generated based on abstract
syntax trees (Yin and Neubig, 2017).
Therefore, a natural question arises, how can
we decode a sequence in its best order?
3.1 Orders as Latent Variables
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
T
l
UN
C
_
UN
_
0
0
2
9
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
We address this question by modeling generation
orders as latent variables. Similar to Vinyals et al.
(2016), we rewrite the target sequence y in a
particular order π = (z2, …, zT , zT +1) ∈ PT 1
as a set yπ = {(y2, z2), …, (yT +1, zT +1)}, Dove
(yt, zt) represents the t-th generated token and
its absolute position, rispettivamente. Different from
the common notation,
the target sequence is
2-step drifted because the two special
gettoni
(y0, z0) = ((cid:6)S(cid:7), 0) E (y1, z1) = ((cid:6)/S(cid:7), T + 1)
are always prepended to represent the left and
right boundaries, rispettivamente. Then, we model
the conditional probability as the joint distribution
of words and positions by marginalizing all the
orders:
(cid:4)
pθ(sì|X) =
pθ(yπ|X),
π∈PT
where for each element:
pθ(yπ|X) = pθ(yT +2|y0:T +1, z0:T +1, x1:T (cid:5))·
pθ(yt+1, zt+1|y0:T, z0:T, x1:T (cid:5))
(2)
T(cid:2)
t=1
1 PT is the set of all the permutations of (1, …, T ).
Figura 1: An example of InDIGO. At each step, we
simultaneously predict the next token and its (relative)
position to be inserted. The final output sequence is
obtained by mapping the words based on their positions.
repository of Fairseq (https://github.com/
pytorch/fairseq).
2 Neural Autoregressive Decoding
Let us consider the problem of generating a
sequence y = (y1, …, yT ) conditioned on some
inputs, per esempio., a source sequence x = (x1, …, xT (cid:5)).
Our goal is to build a model parameterized by θ
that models the conditional probability of y given
X, which is factorized as:
pθ(sì|X) =
T(cid:2)
t=0
pθ(yt+1|y0:T, x1:T (cid:5)),
(1)
where y0 and yT +1 are special tokens (cid:6)S(cid:7) E (cid:6)/S(cid:7),
rispettivamente. The model sequentially predicts the
conditional probability of the next token at each
step t, which can be implemented by any function
approximator such as RNNs (Bahdanau et al.,
2015) and Transformer (Vaswani et al., 2017).
Learning Neural autoregressive model is com-
monly learned by maximizing the conditional like-
(cid:3)T
lihood log p(sì|X) =
t=0 log pθ(yt+1|y0:T, x1:T (cid:5))
given a set of parallel examples.
Decoding A common way to decode a sequence
is to make use of the
from a trained model
autoregressive nature that allows us to predict
one word at each step. Given any source x, we
essentially follow the order of factorization to
generate tokens sequentially using some heuristic-
based algorithms such as greedy decoding and
beam-search.
662
where the third special token yT +2 = (cid:6)eod(cid:7) È
the end-of-decoding, E
introduced to signal
P(yT +2|·) is the end-of-decoding probability.
At decoding time, the factorization allows us to
decode autoregressively by predicting word yt+1
and its position zt+1 step by step. The generation
order is automatically inferred during decoding.
3.2 Relative Representation of Positions
, …, zt
It is difficult and inefficient to predict the absolute
positions zt without knowing the actual length
T . One solution is directly using the absolute
positions zt
t of the partial sequence y0:T
0
at each autoregressive step t. Per esempio, IL
absolute positions for the sequence ((cid:6)S(cid:7), (cid:6)/S(cid:7),
dream, IO) are (zt
3 = 1)
in Figure 1 at step t = 3. È, Tuttavia, inefficient
to model such explicit positions using a single
neural network without recomputing the hidden
states for the entire partial sequence, as some
positions are changed at every step (as shown in
Figura 1).
0 = 0, zt
1 = 3, zt
2 = 2, zt
Relative Positions We propose using relative-
position representations rt
instead of abso-
0:T
lute positions zt
0:T. We use a ternary vector
rt
i ∈ {−1, 0, 1}t+1
the relative-position
representation for zt
i is
defined as:
io . The j-th element of rt
COME
rt
io,j =
⎧
⎪⎨
−1
0
⎪⎩
1
j > zt
zt
j = zt
zt
j < zt
zt
i (left)
i (middle)
i (right)
,
(3)
where the elements of rt
i show the relative
positions with respect to all the other words in
the partial sequence at step t. We use a matrix
Rt =
to show the relative-position
representations of all the words in the sequence.
The relative-position representation can always be
mapped back to the absolute position zt
, ..., rt
t
(cid:9)
rt
0
, rt
1
i by:
(cid:10)
zt
i =
t(cid:4)
j=0
max(0, rt
i,j)
(4)
One of the biggest advantages for using such
vector-based representations is that at each step,
updating the relative-position representations is
simply extending the relative-position matrix Rt
with the next predicted relative position, because
the (left, middle, right) relations described in
663
Algorithm 1 Insertion-based Decoding
(cid:12)
1
, t = 1
0
Initialize: y = ((cid:6)s(cid:7), (cid:6)/s(cid:7)), R =
repeat
0
−1
(cid:11)
Predict the next word yt+1 based on y, R.
if yt+1 is (cid:6)eod(cid:7) then
break
end if
Choose an existing word yk ∈ y;
Choose the left or right (s) of yk to insert;
Obtain the next position rt+1 with k, s (Eq. (6)).
Update R by appending rt+1 (Eq. (5)).
Update y by appending yt+1
Update t = t + 1
until Reach the maximum length
Map back to absolute positions π (Eq. (4))
Reorder y: yzi = yi ∀zi ∈ π, i ∈ [0, t]
⎤
⎥
⎥
⎥
⎦
rt+1
t+1,0
...
rt+1
t+1,t
0
Equation (3) stay unchanged once they are created.
Thus, we update Rt as follows:
⎡
Rt+1 =
⎢
⎢
⎢
⎣
Rt
−rt+1
t+1,0
· · · −rt+1
t+1,t
(5)
where we use rt+1
t+1 to represent the relative position
at step t + 1. This append-only property enables
our method to reuse the previous hidden states
without recomputing the hidden states at each
step. For simplicity, the superscript of r is omitted
from now on without causing conflicts.
3.3 Insertion-based Decoding
Given a partial sequence y0:t and its corresponding
relative-position representations r0:t, not all of
the 3t+2 possible vectors are valid for the next
relative-position representation, rt+1. Only these
vectors corresponding to insertion operations
satisfy Equation (4). In Algorithm 1, we describe
an insertion-based decoding framework based on
this observation. The next word yt+1 is predicted
based on y0:t and r0:t. We then choose an existing
word yk (0 ≤ k ≤ t)) from y0:t and insert yt+1 to
its left or right. As a result, the next position rt+1
is determined by
(cid:19)
rt+1,j =
s
rk,j
j = k
j (cid:11)= k
,
∀j ∈ [0, t]
(6)
where s = −1 if yt+1 is on the left of yk,
and s = 1 otherwise. Finally, we use rt+1 to
update the relative-position matrix R as shown in
Equation (5).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: The overall framework of the proposed Transformer-InDIGO which includes (a) the word & position
prediction module; (b) the one step decoding with position updating; (c) final decoding output by reordering. The
black-white blocks represent the relative position matrix.
4 Model
We present Transformer-InDIGO, an extension of
Transformer (Vaswani et al., 2017), supporting
insertion-based decoding. The overall framework
is shown in Figure 2.
4.1 Network Design
We extend the decoder of Transformer with
relative-position-based self-attention, joint word
and position prediction, and position updating
modules.
Self-Attention One of the major challenges that
prevents the vanilla Transformer from generating
sequences following arbitrary orders is that the
absolute-position-based positional encodings are
inefficient (as mentioned in Section 3.2), in that
absolute positions are changed during decoding,
invalidating the previous hidden states. In contrast,
we adapt Shaw et al. (2018) to use relative posi-
tions in self-attention. Different from Shaw et al.
(2018), in which a clipping distance d (usually
d ≥ 2) is set for relative positions, our relative-
position representations only preserve d = 1
relations (Equation (3)).
Each attention head in a multi-head self-attention
module of Transformer-InDIGO takes the hidden
states of a partial sequence y0:t, denoted as
U = (u0, ..., ut), and its corresponding relative
position matrix Rt as input, where each input
state ui ∈ Rdmodel. The logit ei,j for attention is
computed as:
(cid:20)
u(cid:13)
i Q
(cid:22)
(cid:21)
·
(cid:23)(cid:13)
u(cid:13)
j K + A[ri,j +1]
√
dmodel
ei,j =
where Q, K ∈ Rdmodel×dmodel and A ∈ R3×dmodel are
parameter matrices. A[ri,j +1] is the row vector
indexed by ri,j + 1, which biases all the input
keys based on the relative position, ri,j.
Word and Position Prediction Like the vanilla
Transformer, we take the representations from the
last layer of self-attention, H = (h0, ..., ht) and
H ∈ Rdmodel×(t+1), to predict both the next word
yt+1 and its position vector rt+1 in two stages
based on the following factorization:
p(yt+1, rt+1|H) = p(yt+1|H) · p(rt+1|yt+1, H)
It can also be factorized as predicting the posi-
tion before predicting the next word, yet our
preliminary experiments show that predicting the
word first works slightly better. The prediction
module for word and position prediction are shown
in Figure 2(a).
First, we predict the next word yt+1 from the
categorical distribution pword(y|H) as:
pword(y|H) = softmax
(h(cid:13)
t F ) · W (cid:13)
(cid:20)
(cid:21)
,
(8)
where W ∈ RdV×dmodel is the embedding matrix and
dV is the size of vocabulary. We linearly project
the last representation ht using F ∈ Rdmodel×dmodel
for querying W . Then, as shown in Equation (6),
the prediction of the next position is done by per-
forming insertion operations to existing words
that can be modeled similarly to Pointer Net-
works (Vinyals et al., 2015). We predict a pointer
kt+1 ∈ [0, 2t + 1] based on:
ppointer(k|yt+1, H) =
(cid:24)
(cid:27)
(cid:26)(cid:13)
(cid:25)
H (cid:13)C
H (cid:13)D
(9)
,
,
(7)
softmax
(h(cid:13)
t E + W[yt+1]) ·
664
where C, D, E ∈ Rdmodel×dmodel and W[yt+1] is the
embedding of the predicted word. C, D are used
to obtain the left and right keys, respectively,
considering that each word has two ‘‘keys’’ (its
left and right) for inserting the generated word.
The query vector is obtained by adding up the word
embedding W[yt+1], and the linearly projected state,
h(cid:13)
t E. The resulting relative-position vector, rt+1
is computed using kt+1 according to Equation (6).
We manually set ppointer(0|·) = ppointer(2+t|·) = 0
to avoid any word from being inserted to the left
of (cid:6)s(cid:7) and the right of (cid:6)/s(cid:7).
Position Updating As mentioned in Sec. 3.1, we
update the relative position representation Rt with
the predicted rt+1. Because updating the relative
positions will not change the pre-computed relative-
position representations, Transformer-InDIGO
can reuse the previous hidden states in the next
decoding step the same as the vanilla Transformer.
4.2 Learning
Training requires maximizing the marginalized
likelihood in Equation (2). Yet this is intractable
since we need to enumerate all of the T ! per-
mutations of tokens. Instead, we maximize the
ELBO of the original objective by introducing an
approximate posterior distribution of generation
orders q(π|x, y), which provides the probabilities
of latent generation orders based on the ground-
truth sequences x and y:
LELBO = E
π∼q
⎛
log pθ(yπ|x) + H(q)
= E
r2:T +1∼q
T +1(cid:4)
⎝
t=1
log pθ(yt+1|y0:t, r0:t, x1:T (cid:5))
!
(cid:31)
(cid:30)
Word Prediction Loss
⎞
+
T(cid:4)
t=1
⎠ + H(q),
log pθ(rt+1|y0:t+1, r0:t, x1:T (cid:5))
!
(cid:31)
(cid:30)
Position Prediction Loss
(10)
where π = r2:T +1, sampled from q(π|x, y), is
represented as relative positions. H(q) is the
entropy term which can be ignored if q is fixed
during training. Equation (10) shows that given a
sampled order, the learning objective is divided
into word and position objectives. For calculating
the position prediction loss, we aggregate the two
probabilities corresponding to the same position
by
pθ(rt+1|·) = ppointer(kl|·) + ppointer(kr|·),
(11)
665
where ppointer(kl|·) and ppointer(kr|·) are calculated
simultaneously from the same softmax function in
Equation (9). kl, kr(kl (cid:11)= kr) represent the keys
corresponding to the same relative position.
Here, we study two types of q(π|x, y):
Pre-defined Order
If we already possess some
prior knowledge about the sequence, e.g., the L2R
order is proven to be a strong baseline in many
scenarios, we assume a Dirac-delta distribution
q(π|x, y) = δ(π = π∗(x, y)), where π∗(x, y))
is a predefined order. In this work, we study a
set of pre-defined orders, which can be found in
Table 1, for evaluating their effect on generation.
Searched Adaptive Order (SAO) We choose
the approximate posterior q as the point estima-
tion that maximizes log pθ(yπ|x), which can
also be seen as the maximum-a-posteriori (MAP)
estimation on the latent order π. In practice, we
approximate these generation orders π through
beam-search (Pal et al., 2006). Unlike the original
beam-search for autoregressive decoding that
searches in the sequence space to find the sequence
maximizing the probability shown in Equation 1,
we search in the space of all the permutations of the
target sequence to find π maximising Equation 2,
as all the target tokens are known in advance
during training.
More specifically, we maintain B sub-sequences
with the maximum probabilities using a set B at
each step t. For every sub-sequence y(b)
0:t ∈ B,
we evaluate the probabilities of every possible
choice from the remaining words y(cid:5) ∈ y \ y(b)
0:t
and its position r(cid:5). We calculate the cumulative
likelihood for each y(cid:5), r(cid:5), based on which we select
top-B sub-sequences as the new set B for the next
step. After obtaining the B generation orders, we
optimize our objective as an average over these
orders:
LSAO =
1
B
(cid:4)
π∈B
log pθ(yπ|x)
(12)
where we assume q(π|x, y) =
$
1/B π ∈ B
0
otherwise
.
Beam-Search with Dropout The goal of beam-
search is to approximately find the most likely
generation orders, which limits learning from
exploring other generation orders that may not
be favourable currently but may ultimately be
deemed better. Prior research (Vijayakumar et al.,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Pre-defined Order
Descriptions
Left-to-right (L2R)
Right-to-left (R2L)
Generate words from left to right. (Wu et al., 2018)
Generate words from right to left. (Wu et al., 2018)
Odd-Even (ODD)
Generate words at odd positions from left to right, then generate even positions. (Ford et al., 2018)
Balanced-tree (BLT) Generate words with a top-down left-to-right order from a balanced binary tree. (Stern et al., 2019)
Syntax-tree (SYN)
Generate words with a top-down left-to-right order from the dependency tree. (Wang et al., 2018b)
Common-First (CF)
Generate all common words first from left to right, and then generate the others. (Ford et al., 2018)
Generate all rare words first from left to right, and then generate the remaining. (Ford et al., 2018)
Rare-First (RF)
Random (RND)
Generate words in a random order shuffled every time the example was loaded.
Table 1: Descriptions of the pre-defined orders used in this work. Major references that have explored
these generation orders with different models and applications are also marked.
2016) also pointed out that the search space of the
standard beam-search is restricted. We encourage
exploration by injecting noise during beam-search
(Cho, 2016). Particularly, we found it effective to
keep the dropout on (e.g., dropout = 0.1).
Bootstrapping from a Pre-defined Order During
preliminary experiments, sequences returned by
beam-search were often degenerated by always
predicting common or functional words (‘‘the’’,
‘‘,’’, etc.) as the first several tokens, leading to
inferior performance. We conjecture that is due to
the fact that the position prediction module learns
much faster than the word prediction module, and
it quickly captures spurious correlations induced
by a poorly initialized model. It is essential to
balance the learning progress of these modules.
To do so, we bootstrap learning by pre-training
the model with a pre-defined order (e.g., L2R),
before training with beam-searched orders.
4.3 Decoding
As for decoding, we directly follow Algorithm 1
to sample or decode greedily from the proposed
model. However,
in practice beam-search is
important to explore the output space for neural
autoregressive models. In our implementation, we
perform beam-search for InDIGO as a two-step
search. Suppose the beam size B, at each step, we
do beam-search for word prediction and then with
the searched words, try out all possible positions
and select the top-B sub-sequences. In preliminary
experiments, we also tried doing beam-search for
word and positions simultaneously with their joint
probability. However, it did not seem helpful.
5 Experiments
We evaluate InDIGO extensively on four chal-
lenging sequence generation tasks: word order
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
recovery, machine translation, natural language to
code generation (NL2Code, Ling et al., 2016) and
image captioning. We compare our model trained
with the pre-defined orders and the adaptive
orders obtained by beam-search. We use the same
architecture for all orders including the standard
L2R order.
5.1 Experimental Settings
Dataset The machine translation experiments
are conducted on three language pairs for studying
how the decoding order influences the translation
quality of languages with diversified charac-
teristics: WMT’16 Romanian-English (Ro-En),2
WMT 18 English-Turkish (En-Tr),3 and KFTT
English-Japanese (En-Ja, Neubig, 2011).4 The
English part of the Ro-En dataset is used for the
word order recovery task. For the NL2Code task,
We use the Django dataset (Oda et al., 2015)5 and
the MS COCO (Lin et al., 2014) with the stan-
dard split (Karpathy and Fei-Fei, 2015) for the
NL2Code task and image captioning, respectively.
The dataset statistics are shown in Table 2.
Preprocessing We apply the Moses tokeniza-
tion6 and normalization on all the text datasets
except for codes. We perform 32, 000 joint BPE
(Sennrich et al., 2016) operations for the MT
datasets, while using all the unique words as the
vocabulary for NL2Code. For image captioning,
we follow the same procedure as described by Lee
et al. (2018), where we use 49 512-dimensional
image feature vectors (extracted from a pretrained
2http://www.statmt.org/wmt16/translation-
task.html
3http://www.statmt.org/wmt18/translation-
task.html
4http://www.phontron.com/kftt/.
5https://github.com/odashi/ase15-django-
dataset
666
Dataset
Train
Dev
Test Length
WMT16 Ro-En
WMT18 En-Tr
KFTT En-Ja
Django
MS-COCO
620k
207k
405k
16k
567k
2000
3007
1166
1000
5000
2000
3000
1160
1801
5000
26.48
25.81
27.51
8.87
12.52
Table 2: Dataset statistics for the machine trans-
lation, code generation, and image captioning
tasks. Length represents the average number of
tokens for target sentences of the training set.
ResNet-18 [He et al., 2016]) as the input to the
Transformer encoder. The image features are fixed
during training.
Models We set dmodel = 512, dhidden = 2048,
nheads = 8, nlayers = 6, lrmax = 0.0005, warmup =
4000, and dropout = 0.1 throughout all
the
experiments. The source and target embedding
matrices are shared except for En-Ja, as our pre-
liminary experiments showed that keeping the
embeddings not shared significantly improves the
translation quality. Both the encoder and decoder
use relative positions during self-attention except
for the word order recovery experiments (where
the position embedding is removed in the encoder,
as there is no ground-truth position information
in the input). We do not introduce task-specific
modules such as copying mechanism (Gu et al.,
2016).
Training When training with the pre-defined
orders, we reorder words of each training sequence
in advance accordingly, which provides super-
vision of the ground-truth positions that each word
should be inserted. We test the pre-defined orders
listed in Table 1. The SYN orders were generated
according to the dependency parse obtained by
a dependency parse parser from Spacy (Honnibal
and Montani, 2017) following a parent-to-children
left-to-right order. The CF & RF orders are ob-
tained based on vocabulary cut-off so that the
number of common words and the number of rare
words are approximately the same (Ford et al.,
2018). We also consider on-the-fly sampling a
random order for each sentence as the baseline
(RND). When using L2R as the pre-defined order,
Transformer-InDIGO is almost equivalent to the
vanilla Transformer, as the position prediction
simply learns to predict the next position as the
left of the (cid:6)s(cid:7) symbol. The only difference is that
667
Figure 3: The BLEU scores on the test set for word
order recovery with various decoding beam sizes.
it enhances the vanilla Transformer with a small
number of additional parameters for the position
prediction.
We also train Transformer-InDIGO using the
SAO where we set the beam size to 8. In default,
models trained with SAO are bootstrapped from
a slightly pre-trained (6,000 steps) model in L2R
order.
Inference During the test time, we do beam-
search as described in Sec. 4.3. We observe
from our preliminary experiments that models
trained with different orders (either pre-defined
or SAO) have very different optimal beam sizes
for decoding. Therefore, we perform sensitivity
studies, in which the beam sizes vary from 1 ∼ 20
and pick the beam size with the highest BLEU
score on the validation set for each particular
model.
5.2 Results and Analysis
Word Order Recovery Word order recovery
takes a bag of words as input and recovers its
original word order, which is challenging as the
search space is factorial. We do not restrict the
vocabulary of the input words. We compare our
model
trained with the L2R order and eight
SAO from beam-search for word order recovery.
The BLEU scores over various beam sizes are
shown in Figure 3. The model trained with SAO
lead to higher BLEU scores over that trained
with L2R with a gain up to 3 BLEU scores.
Furthermore, increasing the beam size brings more
improvements for SAO compared with L2R, sug-
gesting that InDIGO produces more diversified
predictions so that it has higher chances to recover
the order.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Order
WMT16 Ro → En
WMT18 En → Tr
BLEU Ribes Meteor TER BLEU Ribes Meteor TER BLEU Ribes Meteor TER
KFTT En → Ja
RND 20.20
79.35
41.00
63.20
03.04
55.45
19.12
90.60
17.09
70.89
35.24
70.11
31.82
L2R
31.62
R2L
ODD 30.11
24.38
BLT
29.62
SYN
30.25
CF
30.23
RF
83.37
83.18
83.09
81.70
82.65
83.22
83.29
52.19
52.09
50.68
45.67
50.25
50.71
50.72
50.62
50.20
50.79
55.38
52.14
50.72
51.73
14.85
14.38
13.64
08.72
69.20
68.87
68.85
65.70
33.90
33.33
32.48
27.40
71.56
71.91
72.84
77.76
30.87
30.44
28.59
21.50
77.72
77.95
77.01
73.97
48.57
47.91
46.28
40.23
59.92
61.09
60.12
64.39
–
–
12.04
12.10
67.61
67.44
31.18
30.72
74.75
73.40
28.91
27.35
77.06
76.40
46.46
45.15
61.56
62.14
SAO
32.47
84.10
53.00
49.02
15.18
70.06
34.60
71.56
31.91
77.56
49.66
59.80
Table 3: Results of translation experiments for three language pairs in different decoding orders. Scores
are reported on the test set with four widely used evaluation metrics (BLEU↑, Meteor↑, TER↓, and
Ribes↑). We do not report models trained with SYN order on En-Tr and En-Ja due to the lack of reliable
dependency parsers. The statistical significance analysis6 between the outputs of SAO and L2R are
conducted using BLEU score as the metric, and the p-values are ≤ 0.001 for all three language pairs.
Machine Translation As shown in Table 3,
we compare our model trained with pre-defined
orders and the SAO with varying setups. We use
four evaluation metrics including BLEU (Papineni
et al., 2002), Ribes (Isozaki et al., 2010), Meteor
(Banerjee and Lavie, 2005), and TER (Snover
et al., 2006) to avoid using a single metric that
might be in favor of a particular generation order.
Most of the pre-defined orders (except for the
random order and the balanced tree [BLT] order)
perform reasonably well with InDIGO on the three
language pairs. The best score with a predefined
word ordering is reached by the L2R order among
the pre-defined orders except for En-Ja, where
the R2L order works slightly better according to
Ribes. This indicates that in machine translation,
the monotonic orders are reasonable and reflect
the languages. ODD, CF, and RF show similar
performance, which is below the L2R and R2L
orders by around 2 BLEU scores. The tree-based
orders, such as the SYN and BLT orders, do
not perform well, indicating that predicting words
following a syntactic path is not preferable. On
the other hand, Table 3 shows that the model with
SAO achieves competitive and even statistically
significant improvements over the L2R order. The
improvements are larger for Turkish and Japanese,
indicating that a flexible generation order may
improve the translation quality for languages with
different syntactic structures from English.
6https://github.com/moses-smt/mosesdecoder
668
Model
Django
MS-COCO
BLEU Accuracy BLEU CIDEr-D
L2R
SAO
36.74
42.33
13.6%
22.12
16.3% 22.58
68.88
69.42
Table 4: Results on the official test sets for both
code generation and image captioning tasks.
Code Generation The goal of this task is to
generate Python code based on a natural language
description, which can be achieved by using a
standard sequence-to-sequence generation frame-
work such as the proposed Transformer-InDIGO.
As shown in Table 4, SAO works significantly
better than the L2R order in terms of both BLEU
and accuracy. This shows that flexible generation
orders are more preferable in code generation.
Image Captioning For the captioning task, one
caption is generated per image and is compared
against five human-created captions during test-
ing. As show in Table 4, we observe that SAO ob-
tains higher BLEU and CIDEr-D (Vedantam et al.,
2015) compared to the L2R order, and it implies
that better captions are generated with different
orders.
5.3 Ablation Study
Model Variants Table 5 shows the results of
the ablation studies using the machine translation
task. SAO without bootstrapping nor beam-search
degenerates by approximate 1 BLEU score on
Ro-En, demonstrating the effectiveness of these
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
dev
test
Model
Training (b/s) Decoding (ms/s)
Model Variants
Baseline L2R
SAO default
no bootstrap
no bootstrap, no noise
bootstrap from R2L order
bootstrap from SYN order
32.53
33.60
32.86
32.64
33.12
33.09
Stern et al. (2019) - Uniform 29.99
32.27
Stern et al. (2019) - Binary
31.82
32.47
31.88
31.72
32.02
31.93
28.52
30.66
Table 5: Ablation study for machine translation
on WMT16 Ro-En. Results of Stern et al. (2019)
are based on greedy decoding with the EOS
penalty.
two methods. We also test SAO by bootstrapping
from a model trained with a R2L order as well
as a SYN order, which obtains slightly worse
yet comparable results compared to bootstrapping
from L2R. This suggests that
the SAO algo-
rithm is quite robust with different bootstrapping
methods, and L2R bootstrapping performs the
best. In addition, we re-implement a recent work
(Stern et al., 2019) that adopts a similar idea of
generating sequences through insertion operations
for machine translation. We use the best settings
of their algorithm, i.e., training with binary-tree/
uniform slot-losses and slot-termination, while
removing the knowledge distillation for a fair
comparison with ours. Our model obtains better
performance compared with Stern et al. (2019) on
WMT16 Ro-En.
Running Time As shown in Table 6, InDIGO
decodes sentences as efficient as the standard L2R
autoregressive models. However, it is slower in
terms of training time using SAO as the super-
vision, as additional efforts are needed to search
the generation orders, and it is difficult to paral-
lelize the SAO. SAO with beam sizes 1 and 8 are
3.8 and 7.2 times slower than L2R, respectively.
Note that enlarging the beam size during training
won’t affect the decoding time as searching the
best orders only happen in the training time.
We will investigate off-line searching methods
to speed up SAO training and make InDIGO more
scalable in the future.
5.4 Visualization
Relative-Position Matrix In Figure 4, we show
an instantiated example produced by InDIGO,
669
L2R
SAO (b = 1)
SAO (b = 8)
4.21
1.12
0.58
12.3
12.5
12.8
Table 6: Comparison of the L2R order with SAO
on running time, where b/s is batches per second
and ms/s is ms per sentence. All experiments are
conducted on 8 Nvidia V100 GPUs with 2000
tokens per GPU. We also compare beam sizes
of 1 and 8 for SAO to search the best orders
during training. We report the decoding speed
of all three models based on greedy decoding.
which is randomly sampled from the validation
set of the KFTT En-Ja dataset. The relative-
position matrices (Rt) and their corresponding
absolute positions (zt) are shown at each step. We
argue that relative-position matrices are flexible to
encode position information, and its append-only
property enables InDIGO to reuse previous hidden
states.
Case Study We demonstrate how InDIGO
works by uniformly sampling examples from
the validation sets for machine translation (Ro-
En), image captioning, and code generation. As
shown in Figure 5, the proposed model generates
sequences in different orders based on the order
used for learning (either pre-defined or SAO).
For instance, the model generates tokens approx-
imately following the dependency parse when we
used the SYN order for the machine translation
task. On the other hand, the model trained using
the RF order learns to first produce verbs and
nouns first, before filling up the sequence with
remaining functional words.
We observe several key characteristics about
the inferred orders of SAO by analyzing the
model’s output for each task: (1) For machine
the generation order of an output
translation,
sequence does not deviate too much from L2R.
Instead, the sequences are shuffled with chunks,
and words within each chunk are generated in
a L2R order; (2) In the examples of image
captioning and code generation, the model tends
to generate most of the words in the L2R order and
insert a few words afterward in certain locations.
Moreover, we provide more examples in the
appendix.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: An instantiated concrete example of the decoding process using InDIGO sampled from the En-Ja
translation datset. The final output is reordered based on the predicted relative-position matrix.
6 Related Work
Decoding for Neural Models Neural autore-
gressive modelling has become one of the most
successful approaches for generating sequences
(Sutskever et al., 2011; Mikolov, 2012), which
has been widely used in a range of applications,
such as machine translation (Sutskever et al.,
2014), dialogue response generation (Vinyals and
Le, 2015), image captioning (Karpathy and Fei-
Fei, 2015), and speech recognition (Chorowski
et al., 2015). Another stream of work focuses
on generating a sequence of tokens in a non-
autoregressive fashion (Gu et al., 2018; Lee et al.,
2018; van den Oord et al., 2018), in which the
discrete tokens are generated in parallel. Semi-
autoregressive modelling (Stern et al., 2018; Wang
et al., 2018a) is a mixture of the two approaches,
while largely adhering to left-to-right generation.
Our method is different from these approaches
as we support flexible generation orders, while
decoding autoregressively.
Non-L2R Orders Previous studies on genera-
tion order of sequences mostly resort to a fixed set
of generation orders. Wu et al. (2018) empirically
show that R2L generation outperforms its L2R
counterpart in a few tasks. Ford et al. (2018) devise
a two-pass approach that produces partially-filled
sentence ‘‘templates’’ and then fills in missing
tokens. Zhu et al. (2019) also propose to gen-
erate tokens by first predicting a text template
and infill the sentence afterwards while in a more
general way. Mehri and Sigal (2018) propose a
670
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Examples randomly sampled from three tasks that are instructed to decode using InDIGO with various
learned generation order. Words in red and underlined are the inserted token at each step. For visual convenience,
we reordered all the partial sequences to its correct positions at each decoding step.
671
middle-out decoder that firstly predicts a middle-
word and simultaneously expands the sequence in
both directions afterwards. Previous studies also
focused on decoding in a bidirectional fashion
such as (Sun et al., 2017; Zhou et al., 2019a,b).
Another line of work models sequence gener-
ation based on syntax structures (Yamada and
Knight, 2001; Charniak et al., 2003; Chiang,
2005; Emami and Jelinek, 2005; Zhang et al.,
2016; Dyer et al., 2016; Aharoni and Goldberg,
2017; Wang et al., 2018b; Eriguchi et al., 2017).
In contrast, Transformer-InDIGO supports fully
flexible generation orders during decoding.
There are two concurrent papers (Welleck et al.,
2019; Stern et al., 2019) that study sequence
generation in a non-L2R order. Welleck et al.
(2019) propose a tree-like generation algorithm.
Unlike this work, the tree-based generation order
only produces a subset of all possible generation
orders compared to our insertion-based models.
Further, Welleck et al. (2019) find L2R is superior
to their learned orders on machine translation
tasks, while transformer-InDIGO with searched
adaptive orders achieves better performance.
Stern et al. (2019) propose a very similar idea
of using insertion operations in Transformer for
machine translation. The major difference is that
they directly use absolute positions, whereas ours
utilizes relative positions. As a result, their model
needs to re-encode the partial sequence at every
step, which is computationally more expensive.
In contrast, our approach does not necessitate re-
encoding the entire sentence during generation. In
addition, knowledge distillation was necessary to
achieve good performance in Stern et al. (2019),
while our model is able to match the performance
of L2R even without bootstrapping.
7 Conclusion
We have presented a novel approach—InDIGO—
that supports flexible sequence generation. Our
model was trained with either pre-defined orders
or searched adaptive orders. In contrast to con-
ventional neural autoregressive models that often
generate from left to right, our model can flexibly
generate a sequence following an arbitrary order.
Experiments show that our method achieved com-
petitive or even better performance compared with
the conventional left-to-right generation on four
tasks, including machine translation, word order
recovery, code generation and image captioning.
For future work, it is worth exploring a trainable
inference model to directly predict the permutation
(Mena et al., 2018) instead of beam-search. Also,
the proposed InDIGO could be extended for post-
editing tasks such as automatic post-editing for
machine translation and grammatical error cor-
rection by introducing additional operations such
as ‘‘deletion’’ and ‘‘substitution’’.
Acknowledgments
We specially thank our action editor Alexandra
Birch and all the reviewers for their great ef-
forts to review the draft. We also would like
to thank Douwe Kiela, Marc’Aurelio Ranzato,
Jake Zhao, and our colleagues at FAIR for the valu-
able feedback, discussions, and technical assis-
tance. This work was partly supported by Samsung
Advanced Institute of Technology (Next Gener-
ation Deep Learning: From Pattern Recognition
to AI) and Samsung Electronics (Improving Deep
Learning Using Latent Structure). KC thanks for
the support of eBay and Nvidia.
References
Roee Aharoni and Yoav Goldberg. 2017. Towards
string-to-tree neural machine translation. In
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(Volume 2: Short Papers), volume 2, pages
132–140.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Repre-
sentations, ICLR 2015, San Diego, CA, USA,
May 7-9, 2015, Conference Track Proceedings.
Satanjeev Banerjee and Alon Lavie. 2005. Meteor:
An automatic metric for MT evaluation with
improved correlation with human judgments.
In Proceedings of the ACL Workshop on In-
trinsic and Extrinsic Evaluation Measures for
Machine Translation and/or Summarization,
pages 65–72.
Eugene Charniak, Kevin Knight, and Kenji
Yamada. 2003. Syntax-based language models
for statistical machine translation. In MT Sum-
mit IX. Intl. Assoc. for Machine Translation.
Citeseer.
672
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
David Chiang. 2005. A hierarchical phrase-
based model for statistical machine translation.
In Proceedings of the 43rd Annual Meeting
on Association for Computational Linguistics,
pages 263–270. Association for Computational
Linguistics.
Kyunghyun Cho. 2016. Noisy parallel approxi-
mate decoding for conditional recurrent lan-
guage model. arXiv preprint arXiv:1605.03835.
Jan K. Chorowski, Dzmitry Bahdanau, Dmitriy
Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
2015. Attention-based models
speech
for
recognition. In NIPS, pages 577–585.
Chris Dyer, Adhiguna Kuncoro, Miguel
Ballesteros, and Noah A. Smith. 2016. Recur-
rent neural network grammars. In NAACL
HLT 2016, The 2016 Conference of
the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, San Diego California, USA,
June 12-17, 2016, pages 199–209.
Ahmad Emami and Frederick Jelinek. 2005. A
neural syntactic language model. Machine
Learning, 60(1-3):195–227.
Akiko Eriguchi, Yoshimasa Tsuruoka,
and
Kyunghyun Cho. 2017. Learning to parse and
translate improves neural machine translation.
In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics,
ACL 2017, Vancouver, Canada, July 30 - August 4,
Volume 2: Short Papers, pages 72–78.
Nicolas Ford, Daniel Duckworth, Mohammad
Norouzi, and George E. Dahl. 2018. The impor-
tance of generation order in language modeling.
the 2018 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing, Brussels, Belgium, October 31 -
November 4, 2018, pages 2942–2946.
Jiatao Gu, James Bradbury, Caiming Xiong,
Victor O. K. Li, and Richard Socher. 2018.
Non-autoregressive neural machine translation.
In 6th International Conference on Learn-
ing Representations, ICLR 2018, Vancouver,
Canada, April 30-May 3, 2018, Conference
Track Proceedings.
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor
O. K. Li. 2016. Incorporating copying me-
chanism in sequence-to-sequence learning. In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics,
ACL 2016, August 7-12, 2016, Berlin, Germany,
Volume 1: Long Papers.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and
learning for
Jian Sun. 2016. Deep residual
image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, pages 770–778.
Matthew Honnibal and Ines Montani. 2017. spaCy
2: Natural language understanding with Bloom
embeddings, convolutional neural networks and
incremental parsing. To appear.
Hideki
Isozaki, Tsutomu Hirao, Kevin Duh,
Katsuhito Sudoh, and Hajime Tsukada. 2010.
Automatic evaluation of translation quality for
distant language pairs. In Proceedings of the
2010 Conference on Empirical Methods in
Natural Language Processing, pages 944–952.
Association for Computational Linguistics.
Andrej Karpathy and Li Fei-Fei. 2015. Deep
for generating
alignments
visual-semantic
image descriptions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, pages 3128–3137.
Jason Lee, Elman Mansimov, and Kyunghyun
Cho. 2018. Deterministic non-autoregressive
neural sequence modeling by iterative refine-
ment. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language
Processing, Brussels, Belgium, October 31 -
November 4, 2018, pages 1173–1182.
Tsung-Yi Lin, Michael Maire, Serge Belongie,
James Hays, Pietro Perona, Deva Ramanan,
Piotr Doll´ar, and C. Lawrence Zitnick. 2014.
Microsoft COCO: Common objects in context.
In European Conference on Computer Vision,
pages 740–755. Springer.
Wang Ling, Phil Blunsom, Edward Grefenstette,
Karl Moritz Hermann, Tom´as Kocisk´y, Fumin
Wang, and Andrew W. Senior. 2016. Latent
predictor networks for code generation. In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics,
673
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
ACL 2016, August 7-12, 2016, Berlin, Germany,
Volume 1: Long Papers.
Shikib Mehri and Leonid Sigal. 2018, Middle-out
decoding, S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett,
editors, Advances in Neural Information Pro-
cessing Systems 31, pages 5523–5534. Curran
Associates, Inc.
Gonzalo Mena, David Belanger, Scott Linderman,
and Jasper Snoek. 2018. Learning latent per-
mutations with Gumbel-Sinkhorn networks. In
6th International Conference on Learning Rep-
resentations, ICLR 2018, Vancouver, Canada,
April 30-May 3, 2018, Conference Track
Proceedings.
Tom´aˇs Mikolov. 2012. Statistical language mod-
els based on neural networks. Presentation at
Google, Mountain View, 2 April.
Graham Neubig. 2011. The Kyoto free translation
task. http://www.phontron.com/kftt.
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig,
Hideaki Hata, Sakriani Sakti, Tomoki Toda,
and Satoshi Nakamura. 2015. Learning to gen-
erate pseudo-code from source code using
statistical machine translation. In Proceedings
of the 2015 30th IEEE/ACM International Con-
ference on Automated Software Engineering
(ASE), ASE ’15, pages 574–584, Lincoln,
Nebraska, USA. IEEE Computer Society.
Aaron van den Oord, Yazhe Li, Igor Babuschkin,
Karen Simonyan, Oriol Vinyals, Koray
Kavukcuoglu, George van den Driessche,
Edward Lockhart, Luis Cobo,
Florian
Stimberg, Norman Casagrande, Dominik
Grewe, Seb Noury, Sander Dieleman, Erich
Elsen, Nal Kalchbrenner, Heiga Zen, Alex
Graves, Helen King, Tom Walters, Dan Belov,
and Demis Hassabis. 2018. Parallel WaveNet:
Fast high-fidelity speech synthesis. In Proceed-
ings of the 35th International Conference on
Machine Learning, volume 80 of Proceedings
of Machine Learning Research, pages 3918–3926,
Stockholm, Sweden. PMLR.
Chris Pal, Charles Sutton, and Andrew McCallum.
2006. Sparse forward-backward using min-
imum divergence beams for fast training of
conditional random fields. In Acoustics, Speech
and Signal Processing, 2006. ICASSP 2006
Proceedings. 2006 IEEE International Confer-
ence on, volume 5, pages V–V. IEEE.
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th annual meeting
on association for computational linguistics,
pages 311–318. Association for Computational
Linguistics.
Alexander M. Rush, Sumit Chopra, and Jason
Weston. 2015. A neural attention model for
abstractive sentence summarization. In Pro-
ceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing,
pages 379–389, Lisbon, Portugal. Association
for Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of
the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 1715–1725, Berlin, Germany.
Association for Computational Linguistics.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
2018. Self-attention with relative position
representations. In Proceedings of the 2018
Conference of the North American Chapter
the Association for Computational Lin-
of
guistics: Human Language Technologies,
Volume 2 (Short Papers), pages 464–468,
New Orleans, Louisiana. Association for Com-
putational Linguistics.
Matthew Snover, Bonnie Dorr, Richard Schwartz,
Linnea Micciulla, and John Makhoul. 2006.
A study of translation edit rate with targeted
human annotation. In Proceedings of Associa-
tion for Machine Translation in the Americas,
pages 223–231.
Mitchell Stern, William Chan, Jamie Kiros, and
Jakob Uszkoreit. 2019. Insertion transformer:
Flexible sequence generation via insertion
operations. arXiv preprint arXiv:1902.03249.
Mitchell Stern, Noam Shazeer, and Jakob
Uszkoreit. 2018. Blockwise parallel decoding
for deep autoregressive models. In Advances
in Neural Information Processing Systems,
pages 10107–10116.
674
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Qing Sun, Stefan Lee, and Dhruv Batra. 2017.
Bidirectional beam search: Forward-backward
inference in neural sequence models for fill-in-
the-blank image captioning. In Proceedings of
the IEEE Conference on Computer Vision and
Pattern Recognition, pages 6961–6969.
Ilya Sutskever, James Martens, and Geoffrey E
Hinton. 2011. Generating text with recurrent
neural networks. In Proceedings of the 28th
International Conference on Machine Learning
(ICML-11), pages 1017–1024.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
2014. Sequence to sequence learning with
in Neural
neural networks.
Information Processing Systems 27: Annual
Conference on Neural Information Processing
Systems 2014, December 8-13 2014, Montreal,
Quebec, Canada, pages 3104–3112.
In Advances
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Proceedings of
the Annual Conference on Neural Information
Processing Systems (NIPS).
Ramakrishna Vedantam, C Lawrence Zitnick, and
Devi Parikh. 2015. Cider: Consensus-based
image description evaluation. In Proceedings
of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4566–4575.
Ashwin K Vijayakumar, Michael Cogswell,
Ramprasath R Selvaraju, Qing Sun, Stefan
Lee, David Crandall, and Dhruv Batra. 2016.
Diverse beam search: Decoding diverse solutions
from neural sequence models. arXiv preprint
arXiv:1610.02424.
Oriol Vinyals, Samy Bengio, and Manjunath
Kudlur. 2016. Order matters: Sequence to
sequence for sets. In 4th International Confer-
ence on Learning Representations, ICLR 2016,
San Juan, Puerto Rico, May 2-4, 2016, Confer-
ence Track Proceedings.
Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. 2015. Pointer networks. In Advances in
675
Neural Information Processing Systems, pages
2692–2700.
Oriol Vinyals
and Quoc Le.
2015. A
neural conversational model. arXiv preprint
arXiv:1506.05869.
Chunqi Wang, Ji Zhang, and Haiqing Chen.
2018a. Semi-autoregressive neural machine
translation. In Proceedings of the 2018 Con-
ference on Empirical Methods in Natural
Language Processing, pages 479–488, Brussels,
Belgium. Association
for Computational
Linguistics.
Xinyi Wang, Hieu Pham, Pengcheng Yin, and
Graham Neubig. 2018b. A tree-based decoder
for neural machine translation. In Proceedings of the
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 4772–4777,
Brussels, Belgium. Association for Compu-
tational Linguistics.
Sean Welleck, Kiant´e Brantley, Hal Daum´e III,
and Kyunghyun Cho. 2019. Non-monotonic
text generation. arXiv preprint
sequential
arXiv:1902.02192.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin,
Jianhuang Lai,
and Tie-Yan Liu. 2018.
Beyond error propagation in neural machine
translation: Characteristics of language also
the 2018 Con-
matter.
ference on Empirical Methods in Natural
Language Processing,
3602–3611,
Brussels, Belgium. Association for Computa-
tional Linguistics.
In Proceedings of
pages
Kenji Yamada and Kevin Knight. 2001. A
syntax-based statistical translation model. In
Proceedings of the 39th Annual Meeting of the
Association for Computational Linguistics.
Pengcheng Yin and Graham Neubig. 2017. A
syntactic neural model for general-purpose code
generation. In Proceedings of the 55th Annual
Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages
440–450, Vancouver, Canada. Association for
Computational Linguistics.
Xingxing Zhang, Liang Lu, and Mirella Lapata.
2016. Top-down tree long short-term mem-
the 2016
ory networks. In Proceedings of
Conference of the North American Chapter
the Association for Computational Lin-
of
guistics: Human Language Technologies, pages
310–320, San Diego, California. Association
for Compu tational Linguistics.
translation. Transactions of the Association for
Computational Linguistics, 7:91–105.
Long Zhou, Jiajun Zhang, Chengqing Zong, and
Heng Yu. 2019b. Sequence generation: From
both sides to the middle. IJCAI.
Long Zhou, Jiajun Zhang, and Chengqing Zong.
2019a. Synchronous bidirectional neural machine
Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019.
Text Infilling. arXiv, arXiv:1901.00158.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
9
2
1
9
2
3
0
2
4
/
/
t
l
a
c
_
a
_
0
0
2
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
676