EDITOR: An Edit-Based Transformer with Repositioning
for Neural Machine Translation with Soft Lexical Constraints
Weijia Xu
Universidad de Maryland
weijia@cs.umd.edu
Marine Carpuat
Universidad de Maryland
marine@cs.umd.edu
Abstracto
We introduce an Edit-Based TransfOrmer
with Repositioning (EDITOR), which makes
sequence generation flexible by seamlessly
allowing users to specify preferences in out-
put lexical choice. Building on recent models
for non-autoregressive sequence generation
(Gu et al., 2019), EDITOR generates new
sequences by iteratively editing hypotheses. Él
relies on a novel reposition operation designed
to disentangle lexical choice from word posi-
tioning decisions, while enabling efficient ora-
cles for imitation learning and parallel edits
at decoding time. Empirically, EDITOR uses
soft lexical constraints more effectively than
the Levenshtein Transformer
(Gu et al.,
2019) while speeding up decoding dramati-
cally compared to constrained beam search
(Post and Vilar, 2018). EDITOR also achieves
comparable or better translation quality with
faster decoding speed than the Levenshtein
Transformer on standard Romanian-English,
English-German, and English-Japanese machine
translation tasks.
1
Introducción
Neural machine translation (MONTE) architectures
(Bahdanau et al., 2015; Vaswani et al., 2017)
make it difficult for users to specify preferences
that could be incorporated more easily in statis-
tical MT models (Koehn et al., 2007) and have
been shown to be useful for interactive machine
traducción (Foster et al., 2002; Barrachina et al.,
2009) and domain adaptation (Hokamp and Liu,
2017). Lexical constraints or preferences have
previously been incorporated by re-training NMT
models with constraints as inputs (Song et al.,
2019; Dinu et al., 2019) or with constrained
beam search that drastically slows down decoding
(Hokamp and Liu, 2017; Post and Vilar, 2018).
En este trabajo, we introduce a translation model
that can seamlessly incorporate users’ lexical
choice preferences without increasing the time
and computational cost at decoding time, mientras
being trained on regular MT samples. We apply
this model to MT tasks with soft lexical con-
tensiones. As illustrated in Figure 1, when decoding
with soft lexical constraints, user preferences for
lexical choice in the output language are provided
as an additional input sequence of target words
in any order. The goal is to let users encode
terminology, domain, or stylistic preferences in
target word usage, without strictly enforcing hard
constraints that might hamper NMT’s ability to
generate fluent outputs.
Our model
is an Edit-Based TransfOrmer
with Repositioning (EDITOR), which builds on
recent progress on non-autoregressive sequence
generación (Lee et al., 2018; Ghazvininejad et al.,
2019).1 Específicamente, the Levenshtein Transformer
(Gu et al., 2019) showed that iteratively refining
output sequences via insertions and deletions
yields a fast and flexible generation process for
MT and automatic post-editing tasks. EDITOR
replaces the deletion operation with a novel
reposition operation to disentangle lexical choice
from reordering decisions. Como resultado, EDITOR
exploits lexical constraints more effectively and
efficiently than the Levenshtein Transformer, como
a single reposition operation can subsume a
sequence of deletions and insertions. To train
EDITOR via imitation learning, the reposition
operation is defined to preserve the ability to use
the Levenshtein edit distance (Levenshtein, 1966)
as an efficient oracle. We also introduce a dual-
path roll-in policy, which lets the reposition and
deletion models learn to refine their respective
outputs more effectively.
Experiments on Romanian-English, Inglés-
Alemán, and English-Japanese MT show that
EDITOR achieves comparable or better trans-
lation quality with faster decoding speed than
1https://github.com/Izecson/fairseq
-editor.
311
Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 311–328, 2021. https://doi.org/10.1162/tacl a 00368
Editor de acciones: Francois Yvon. Lote de envío: 7/2020; Lote de revisión: 10/2020; Publicado 3/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
yo
a
C
_
a
_
0
0
3
6
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
the quality gap between non-autoregressive and
autoregressive models. Sin embargo, we argue that
these operations limit the flexibility and efficiency
of the resulting models for MT by entangling
lexical choice and reordering decisions.
Reordering vs. Lexical Choice EDITOR’s
insertion and reposition operations connect closely
with the long-standing view of MT as a combina-
tion of a translation or lexical choice model, cual
selects appropriate translations for source units
given their context, and reordering model,cual
encourages the generation of a target sequence
order appropriate for the target language. Este
view is reflected in architectures ranging from the
word-based IBM models (Brown y cols., 1990),
sentence-level models that generate a bag of tar-
get words that is reordered to construct a target
oración (Bangalore et al., 2007), or the Operation
Sequence Model (Durrani et al., 2015; Stahlberg
et al., 2018), which views translation as a sequence
of translation and reordering operations over bilin-
gual minimal units. Por el contrario, autoregressive
NMT models (Bahdanau et al., 2015; Vaswani et al.,
2017) do not explicitly separate lexical choice and
reordering, and previous non-autoregressive mod-
els break up reordering into sequences of other
operaciones. This work introduces the reposition
operación, which makes it possible to move
words around during the refinement process, como
reordering models do. Sin embargo, we will see
that reposition differs from typical reordering to
enable efficient oracles for training via imitation
aprendiendo, and parallelization of edit operations at
decoding time (Sección 3).
MT with Soft Lexical Constraints NMT mod-
els lack flexible mechanisms to incorporate
users preferences in their outputs. Lexical con-
straints have been incorporated in prior work
a través de 1) constrained training where NMT models
are trained on parallel samples augmented with
constraint target phrases in both the source and
target sequences (Song et al., 2019; Dinu et al.,
2019), o 2) constrained decoding where beam
search is modified to include constraint words or
phrases in the output (Hokamp and Liu, 2017;
Post and Vilar, 2018). These mechanisms can
incorporate domain-specific knowledge and lexi-
cons which is particularly helpful in low-resource
casos (Arthur et al., 2016; Tang et al., 2016).
Despite their success at domain adaptation for MT
Cifra 1: Romanian to English MT example. Y-
constrained MT incorrectly translates ‘‘glezn˘a’’ to
‘‘bullying’’. Given constraint words ‘‘plague’’ and
‘‘ankle’’, soft-constrained MT correctly uses ‘‘ankle’’
and avoids disfluencies introduced by using ‘‘plague’’
as a hard constraint in its exact form.
the Levenshtein Transformer (Gu et al., 2019) en
the standard MT tasks and exploits soft lexical
constraints better: It achieves significantly better
translation quality and matches more constraints
with faster decoding speed than the Levenshtein
Transformador. It also drastically speeds up decod-
ing compared with lexically constrained decoding
algoritmos (Post and Vilar, 2018). Además,
results highlight the benefits of soft constraints
over hard ones—EDITOR with soft constraints
achieves translation quality on par or better than
both EDITOR and Levenshtein Transformer with
hard constraints (Susanto et al., 2020).
2 Fondo
Non-Autoregressive MT Although autoregres-
sive models that decode from left-to-right are the
de facto standard for many sequence generation
tareas (Cho et al., 2014; Chorowski et al., 2015;
Vinyals and Le, 2015), non-autoregressive models
offer a promising alternative to speed up de-
coding by generating a sequence of tokens in par-
allel (Gu et al., 2018; van den Oord et al., 2018;
Ma et al., 2019). Sin embargo, their output quality
suffers due to the large decoding space and strong
independence assumptions between target tokens
(Ma et al., 2019; Wang y cols., 2019). These issues
have been addressed via partially parallel decoding
(Wang y cols., 2018; Stern et al., 2018) or multi-
pass decoding (Lee et al., 2018; Ghazvininejad
et al., 2019; Gu et al., 2019). This work adopts
multi-pass decoding, where the model generates
the target sequences by iteratively editing the out-
puts from previous iterations. Edit operations such
as substitution (Ghazvininejad et al., 2019) y
insertion-deletion (Gu et al., 2019) have reduced
312
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
yo
a
C
_
a
_
0
0
3
6
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(Hokamp and Liu, 2017) and caption generation
(Anderson et al., 2017), they suffer from sev-
eral issues: Constrained training requires building
dedicated models for constrained language gener-
ación, while constrained decoding adds significant
computational overhead and treats all constraints
as hard constraints which may hurt fluency. En
other tasks, various constraint types have been
introduced by designing complex architectures
tailored to specific content or style constraints
(Abu Sheikha and Inkpen, 2011; Mei et al., 2016),
or via segment-level ‘‘side-constraints’’ (Sennrich
et al., 2016a; Ficler and Goldberg, 2017; Agrawal
and Carpuat, 2019), which condition generation
on users’ stylistic preferences, but do not offer
fine-grained control over their realization in the
output sequence. We refer the reader to Yvon and
Abdul Rauf, (2020) for a comprehensive review of
the strengths and weaknesses of current techniques
to incorporate terminology constraints in NMT.
Our work is closely related to Susanto et al.
(2020)’s idea of applying the Levenshtein Trans-
former to MT with hard terminology constraints.
We will see that their technique can directly be
used by EDITOR as well (Sección 3.3), pero esto
does not offer empirical benefits over the default
EDITOR model (Sección 4.3).
3 Acercarse
3.1 The EDITOR Model
We cast both constrained and unconstrained
language generation as an iterative sequence
refinement problem modeled by a Markov Deci-
sion Process (Y, A, mi, R, y0), where a state y in
the state space Y corresponds to a sequence of
tokens y = (y1, y2, . . . , yL) from the vocabulary
V up to length L, and y0 ∈ Y is the initial sequence
For standard sequence generation tasks, y0 is the
empty sequence ((cid:4)s(cid:5), (cid:4)/s(cid:5)). For lexically con-
strained generation tasks, y0 consists of the words
to be used as constraints ((cid:4)s(cid:5), c1, . . . , cm, (cid:4)/s(cid:5)).
At the k-th decoding iteration, the model takes
as input yk−1, the output from the previous iter-
ación, chooses an action ak ∈ A to refine the
sequence into yk = E(yk−1, ak), and receives a
reward rk = R(yk). The policy π maps the input
sequence yk−1 to a probability distribution P (A)
over the action space A. Our model is based on
the Transformer encoder-decoder (Vaswani et al.
2017) and we extract
the decoder representa-
ciones (h1, . . . , hn) to make the policy predictions.
Cifra 2: Applying the reposition operation r to input
y: ri > 0 is the 1-based index of token y(cid:6)
i in the input
secuencia; yi is deleted if ri = 0.
Each refinement action is based on two basic
operaciones: reposition and insertion.
Reposition For each position i in the input
sequence y1…norte, the reposition policy πrps(r | i, y)
predicts an index r ∈ [0, norte]: If r > 0, nosotros
place the r-th input token yr at the i-th out-
put position, otherwise we delete the token
en
that position (Cifra 2). We constrain
πrps(1 | 1, y) = πrps(norte | norte, y) = 1 to maintain
sequence boundaries. Note that reposition differs
from typical reordering because 1) it makes it
possible to delete tokens, y 2) it places tokens
at each position independently, which enables
parallelization at decoding time. En principio, el
same input token can thus be placed at multiple
output positions. Sin embargo, this happens rarely in
practice as the policy predictor is trained to follow
oracle demonstrations which cannot contain such
repetitions by design.2
The reposition classifier gives a categorical
distribution over the index of the input token to be
placed at each output position:
πrps(r | i, y) = softmax(hi · [b, e1, . . . , en]) (1)
where ej is the embedding of the j-th token in the
input sequence, and b ∈ Rdmodel is used to predict
whether to delete the token. The dot product in the
softmax function captures the similarity between
the hidden state hi and each input embedding ej
or the deletion vector b.
(2019),
Insertion Following Gu et al.
el
insertion operation consists of two phases: (1)
placeholder insertion: Given an input sequence
y1…norte, the placeholder predictor πplh(pag | i, y) pre-
dicts the number of placeholders p ∈ [0, Kmax]
to be inserted between two neighboring tokens
(yi, yi+1);3 (2) token prediction: Given the output
of the placeholder predictor, the token predictor
2Empirically, fewer than 1% of tokens are repositioned to
more than one output position.
3In our implementation, we set Kmax = 255.
313
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
yo
a
C
_
a
_
0
0
3
6
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
πtok(t | i, y) replaces each placeholder with an
actual token.
The Placeholder
Insertion Classifier gives
a categorical distribution over the number of
placeholders to be inserted between every two
consecutive positions:
possible actions given each input sequence. El
model is trained to choose actions that minimize
the cost-to-go estimates. We use a search-based
oracle policy π∗ as the roll-out policy and train
the model to imitate the optimal actions chosen
by the oracle.
πplh(pag | i, y) = softmax([hi ; hi+1] · W plh) (2)
where W plh ∈ R(2dmodel)×(Kmax+1).
The Token Prediction Classifier predicts the
identity of each token to fill in each placeholder:
πtok(t | i, y) = softmax(hi · W tok)
(3)
where W tok ∈ Rdmodel×|V|.
Action Given an input sequence y1…norte, an action
consists of repositioning tokens, inserting and
replacing placeholders. Formalmente, we define an
action as a sequence of reposition (r), placeholder
insertion (pag), and token prediction (t) operaciones:
a = (r, pag, t). r, pag, and t are applied in this
order to adjust non-empty initial sequences via
reposition before inserting new tokens. Each of r,
pag, and t consists of a set of basic operations that
can be applied in parallel:
r= {r1, . . . , rn}
pag = {p1, . . . , pm−1}
t= {t1, . . . , tl}
(cid:2)
where m =
define the policy as
norte
i
I(ri > 0) and l =
(cid:2)
m−1
i
pi. Nosotros
Pi(a|y) =
(cid:3)
pi∈p
πrps(ri | i, y) ·
πtok(de | i, y(cid:2)(cid:2))
(cid:3)
ri∈r
(cid:3)
ti∈t
πplh(pi | i, y(cid:2))·
with intermediate outputs y(cid:2) = mi(y, r) y
y(cid:2)(cid:2) = mi(y(cid:2), pag).
3.2 Dual-Path Imitation Learning
We train EDITOR using imitation learning
(Daum´e III et al., 2009; Ross et al., 2011; Ross and
Bagnell, 2014) to efficiently explore the space of
valid action sequences that can reach a reference
traducción. The key idea is to construct a roll-in
policy πin to generate sequences to be refined and
a roll-out policy πout to estimate cost-to-go for all
314
ins
rps
Formalmente, dπin
and dπin
denote the dis-
tributions of sequences induced by running the
rps and πin
roll-in policies πin
ins respectively. Nosotros
update the model policy π = πrps · πplh · πtok
to minimize the expected cost C(Pi ; y, π∗) por
comparing the model policy against the cost-to-go
estimates under the oracle policy π∗ given input
sequences y:
(cid:4)
(cid:5)
Eyrps
Eyins
∼dπin
rps
∼dπin
ins
C(πrps ; yrps, π∗)
[C(πplh, πtok ; yins, π∗)]
+
(4)
The cost function compares the model vs. oracle
comportamiento. As prior work suggests that cost functions
close to the cross-entropy loss are better suited
to deep neural models than the squared error
(Leblond et al., 2018; Cheng et al., 2018), nosotros
define the cost function as the KL divergence
between the action distributions given by the
model policy and by the oracle (Welleck et al.,
2019):
C(Pi ; y, π∗)
=DKL [ π∗(a | y, y∗)|| Pi(a | y)]
=Ea∼π∗(a | y,y∗) [− log π(a | y)] + const.
(5)
where the oracle has additional access to the
reference sequence y∗. By minimizing the cost
función, the model learns to imitate the oracle
policy without access to the reference sequence.
Próximo, we describe how the reposition operation
is incorporated in the roll-in policy (Sección 3.2.1)
and the oracle roll-out policy (Sección 3.2.2).
3.2.1 Dual-Path Roll-in Policy
As shown in Figure 3, the roll-in policies πin
ins
and πin
rps for the reposition and insertion policy
predictors are stochastic mixtures of the noised
reference sequences and the output sequences
sampled from their corresponding dual policy
predictors. Cifra 4 shows an example for creating
the roll-in sequences: We first create the initial
sequence y0 by applying random word dropping
(Gu et al., 2019) and random word shuffle (Lample
et al., 2018) with probability of 0.5 and maximum
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
yo
a
C
_
a
_
0
0
3
6
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 3: Our dual-path imitation learning process uses both the reposition and insertion policies during roll-in
so that they can be trained to refine each other’s outputs: Given an initial sequence y0, created by noising the
reference y∗, the roll-in policy stochastically generates intermediate sequences yins and yrps via reposition and
insertion respectively. The policy predictors are trained to minimize the costs of reaching y∗ from yins and yrps
estimated by the oracle policy π∗.
shuffle distance of 3 to the reference sequence y∗,
and produce the roll-in sequences for each policy
predictor as follows:
1. Reposition: The roll-in policy πin
rps is a
stochastic mixture of the initial sequence y0
and the output sequence by applying one
iteration of the oracle placeholder insertion
policy p∗ ∼ π∗ and the model’s token
prediction policy ˜t ∼ πtok to y0:
(cid:6)
dπin
rps
=
y0,
mi(mi(y0, p∗), ˜t),
if u < β
otherwise
(6)
where the mixture factor β ∈ [0, 1] and
random variable u ∼ Uniform(0, 1).
2. Insertion: The roll-in policy πin
ins
is a
stochastic mixture of the initial sequence y0
and the output sequence by applying one
iteration of the model’s reposition policy
˜r ∼ πrps to y0:
(cid:6)
dπin
ins
=
y0,
E(y0, ˜r),
if u < α
otherwise
(7)
Figure 4: The roll-in sequence for the insertion predictor
is a stochastic mixture of the noised reference y0 and
the output by applying the model’s reposition policy
πrps to y0. The roll-in sequence for the reposition
predictor is a stochastic mixture of the noised reference
y0 and the output by applying the oracle placeholder
insertion policy π∗
plh and the model’s token prediction
policy πtok to y0.
during roll-out, mimicking the iterative refinement
process used at inference time.4
3.2.2 Oracle Roll-Out Policy
Policy Given an input sequence y and a ref-
erence sequence y∗, the oracle algorithm finds the
optimal action to transform y into y∗ with the
minimum number of basic edit operations:
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
where the mixture factor α ∈ [0, 1] and
random variable u ∼ Uniform(0, 1).
Oracle(y, y∗) = arg min
a
NumOps(y, y∗ | a)
While Gu et al. (2019) define roll-in using
only the model’s insertion policy, we call our
approach dual-path because roll-in creates two
distinct intermediate sequences using the model’s
reposition or insertion policy. This makes it
possible for the reposition and insertion policy
predictors to learn to refine one another’s outputs
The associated oracle policy is defined as:
π∗(a | y, y∗) =
(cid:6)
1,
0,
if a = Oracle(y, y∗)
otherwise
(8)
(9)
4Different from the inference process, we generate the
roll-in sequences by applying the model’s reposition or
insertion policy for only one iteration.
315
Algorithm The reposition and insertion opera-
tions used in EDITOR are designed so that the
Levenshtein edit distance algorithm (Levenshtein,
1966) can be used as the oracle. The reposition
operation (Section 3.1) can be split into two dis-
tinct types of operations: (1) deletion and (2)
replacing a word with any other word appearing
in the input sequence, which is a constrained ver-
sion of the Levenshtein substitution operation. As
a result, we can use dynamic programming to find
the optimal action sequence in O(|y||y∗|) time.
By contrast, the Levenshtein Transformer restricts
the oracle and model to insertion and deletion
operations only. While in principle substitutions
can be performed indirectly by deletion and re-
insertion, our results show the benefits of using
the reposition variant of the substitution operation.
3.3 Inference
from the initial
During inference, we start
sequence y0. For standard sequence generation
tasks, y0 is an empty sequence, whereas for lex-
ically constrained generation y0 is a sequence of
lexical constraints. Inference then proceeds in the
exact same way for constrained and unconstrained
tasks. The initial sequence is refined iteratively
by applying a sequence of actions (a1, a2, . . .) =
(r1, p1, t1 ; r2, p2, t2 ; . . .). We greedily select the
best action at each iteration given the model policy
in Equations (1) to (3). We stop refining if 1) the
output sequences from two consecutive iterations
are the same (Gu et al., 2019), or 2) the maximum
number of decoding steps is reached (Lee et al.,
2018; Ghazvininejad et al., 2019).5
Incorporating
Soft Constraints Although
EDITOR is trained without lexical constraints, it
can be used seamlessly for MT with constraints
without any change to the decoding process
except using the constraint sequence as the initial
sequence.
Incorporating Hard Constraints We adopt the
decoding technique introduced by Susanto et al.
(2020) to enforce hard constraints at decoding
5Following Stern et al. (2019), we also experiment with
adding penalty for inserting ‘‘empty’’ placeholders during
inference by subtracting a penalty score γ = [0, 3] from the
logits of zero in Equation (2) to avoid overly short outputs.
However, preliminary experiments show that zero penalty
score achieves the best performance.
Train Valid
Test
Provenance
Ro-En
En-De
En-Ja
599k
3,961k
2,000k
1911
3000
1790
1999 WMT16
3003 WMT14
1812 WAT2017
Table 1: MT Tasks. Data statistics (# sentence
pairs) and provenance per language pair.
time by prohibiting deletion operations on con-
straint tokens or insertions within a multi-token
constraints.
4 Experiments
We evaluate the EDITOR model on standard
(Section 4.2) and lexically constrained machine
translation (Sections 4.3–4.4).
4.1 Experimental Settings
Dataset Following Gu et al. (2019), we exper-
iment on three language pairs spanning different
language families and data conditions (Table 1):
Romanian-English (Ro-En) from WMT16 (Bojar
et al., 2016), English-German (En-De) from WMT14
(Bojar et al., 2014), and English-Japanese (En-Ja)
from WAT2017 Small-NMT Task (Nakazawa et al.,
2017). We also evaluate EDITOR on the two En-
De test sets with terminology constraints released
by Dinu et al. (2019). The test sets are subsets of
the WMT17 En-De test set (Bojar et al., 2017)
with terminology constraints extracted from Wik-
tionary and IATE.6 For each test set, they only
select the sentence pairs in which the exact target
terms are used in the reference. The resulting
Wiktionary and IATE test sets contain 727 and
414 sentences respectively. We follow the same
preprocessing steps in Gu et al. (2019): We apply
normalization, tokenization, true-casing, and BPE
(Sennrich et al., 2016b) with 37k and 40k oper-
ations for En-De and Ro-En. For En-Ja, we use
the provided subword vocabularies (16,384 BPE
per language from SentencePiece [Kudo and
Richardson, 2018]).
Experimental Conditions We
and
evaluate the following models in controlled
conditions to thoroughly evaluate EDITOR:
train
• Auto-Regressive Transformers (AR) built
using Sockeye (Hieber et al., 2017) and
6Available at https://www.wiktionary.org/
and https://iate.europa.eu.
316
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
fairseq (Ott et al., 2019). We report AR
baselines with both toolkits to enable fair
comparisons when using our fairseq-based
implementation of EDITOR and Sockeye-
based implementation of lexically constrained
decoding algorithms (Post and Vilar, 2018).
• Non Auto-Regressive Transformers (NAR)
In addition to EDITOR, we train a Levensh-
tein Transformer (LevT) with approximately
the same number of parameters. Both are
implemented using fairseq.
Model and Training Configurations All mod-
the base Transformer architecture
els adopt
(Vaswani et al., 2017) with dmodel = 512,
dhidden = 2048, nheads = 8, nlayers = 6, and
pdropout = 0.3. For En-De and Ro-En, the source
and target embeddings are tied with the output
layer weights (Press and Wolf, 2017; Nguyen and
Chiang, 2018). We add dropout to embeddings
(0.1) and label smoothing (0.1). AR models are
trained with the Adam optimizer (Kingma and Ba,
2015) with a batch size of 4096 tokens. We check-
point models every 1000 updates. The initial
learning rate is 0.0002, and it is reduced by 30%
after 4 checkpoints without validation perplexity
improvement. Training stops after 20 check-
points without improvement. All NAR models are
trained using Adam (Kingma and Ba, 2015) with
initial learning rate of 0.0005 and a batch size
of 64,800 tokens for maximum 300,000 steps.7
We select the best checkpoint based on validation
BLEU (Papineni et al., 2002). All models are
trained on 8 NVIDIA V100 Tensor Core GPUs.
Knowledge Distillation We apply sequence-level
knowledge distillation from autoregressive teacher
models as widely used in non-autoregressive gen-
eration (Gu et al., 2018; Lee et al., 2018; Gu
et al., 2019). Specifically, when training the non-
autoregressive models, we replace the reference
sequences y∗ in the training data with translation
outputs from the AR teacher model (Sockeye,
with beam = 4).8 We also report the results when
applying knowledge distillation to autoregressive
models.
Evaluation We evaluate translation quality via
case-sensitive tokenized BLEU (as in Gu et al.
7Our preliminary experiments and prior work show that
NAR models require larger training batches than AR models.
8This teacher model was selected for a fairer comparison
on MT with lexical constraints.
(2019))9 and RIBES (Isozaki et al., 2010), which
is more sensitive to word order differences. Before
computing the scores, we tokenize the German
and English outputs using Moses and Japanese
outputs using KyTea.10 For lexically constrained
decoding, we report the constraint preservation
rate (CPR) in the translation outputs.
We quantify decoding speed using latency per
sentence. It is computed as the average time (in
ms) required to translate the test set using batch
size of one (excluding the model loading time)
divided by the number of sentences in the test set.
4.2 MT Tasks
Because our experiments involve two different
toolkits, we first compare the same Transformer
AR models built with Sockeye and with fairseq:
The AR models achieve comparable decoding
speed and translation quality regardless of toolkit
—the Sockeye model obtains higher BLEU than
the fairseq model on Ro-En and En-De but lower
on En-Ja (Table 2). Further comparisons will
therefore center on the Sockeye AR model to better
compare EDITOR with the lexically constrained
decoding algorithm (Post and Vilar, 2018).
Table 2 also shows that knowledge distillation
has a small and inconsistent impact on AR models
(Sockeye): It yields higher BLEU on Ro-En, close
BLEU on En-De, and lower BLEU on En-Ja.11
Thus, we use the AR models trained without
distillation in further experiments.
Next, we compare the NAR models against
the AR (Sockeye) baseline. As expected, both
EDITOR and LevT achieve close translation
quality to their AR teachers with 2–4 times
speedup. BLEU differences are small (Δ < 1.1),
as in prior work (Gu et al., 2019). The RIBES
trends are more surprising: Both NAR models
significantly outperform the AR models (Sockeye)
on RIBES, except for En-Ja, where EDITOR and
the AR models significantly outperforms LevT.
This illustrates the strength of EDITOR in word
reordering.
results
confirm the benefits of
Finally,
EDITOR’s
reposition operation over LevT:
Decoding with EDITOR is 6–7% faster than LevT
9https://github.com/pytorch/fairseq/blob
/master/fairseq/clib/libbleu/libbleu.cpp.
10http://www.phontron.com/kytea/.
11Kasai et al. (2020) found that AR models can benefit from
knowledge distillation but with a Transformer large model as
a teacher, while we use the Transformer base model.
317
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Distill Beam Params BLEU ↑ RIBES ↑ Latency (ms) ↓
Ro-En
En-De
En-Ja
AR (fairseq)
AR (sockeye)
AR (sockeye)
AR (sockeye)
NAR: LevT
NAR: EDITOR
AR (fairseq)
AR (sockeye)
AR (sockeye)
AR (sockeye)
NAR: LevT
NAR: EDITOR
AR (fairseq)
AR (sockeye)
AR (sockeye)
AR (sockeye)
NAR: LevT
NAR: EDITOR
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
4
4
10
10
−
−
4
4
10
10
−
−
4
4
10
10
−
−
64.5M
64.5M
64.5M
64.5M
90.9M
90.9M
64.9M
64.9M
64.9M
64.9M
91.1M
91.1M
62.4M
62.4M
62.4M
62.4M
106.1M
106.1M
32.0
32.3
32.5
32.9
31.6
31.9
27.1
27.3
27.4
27.6
26.9
26.9
44.9
43.4
43.5
42.7
42.4
42.3
83.8
83.6
83.8
84.2
84.0
84.0
80.4
80.2
80.3
80.5
81.0
80.9
85.7
85.1
85.3
85.1
84.5
85.1
357.14
369.82
394.52
371.75
98.81
93.20
363.64
308.64
332.73
363.52
113.12
105.37
292.40
286.83
311.38
295.32
143.88
96.62
Table 2: Machine Translation Results. For each metric, we underline the top scores among all
models and boldface the top scores among NAR models based on the paired bootstrap test with
p < 0.05 (Clark et al., 2011). EDITOR decodes 6–7% faster than LevT on Ro-En and En-De, and
33% faster on En-Ja, while achieving comparable or higher BLEU and RIBES.
on Ro-En and En-De, and 33% faster on En-Ja
—a more distant language pair which requires
more reordering but no inflection changes on
reordered words—with no statistically significant
difference in BLEU nor RIBES, except for En-Ja,
where EDITOR significantly outperforms LevT
on RIBES. Overall, EDITOR is shown to be a
good alternative to LevT on standard machine
translation tasks and can also be used to replace
the AR models in settings where decoding speed
matters more than small differences in translation
quality.
4.3 MT with Lexical Constraints
We now turn to the main evaluation of EDITOR
on machine translation with lexical constraints.
Experimental Conditions We conduct a con-
trolled comparison of the following approaches:
• NAR models: EDITOR and LevT view
the lexical constraints as soft constraints,
provided via the initial
target sequence.
We also explore the decoding technique
introduced in Susanto et al. (2020) to support
hard constraints.
• AR models: They use the provided target
words as hard constraints enforced at decod-
ing time by an efficient form of constrained
beam search: dynamic beam allocation (DBA)
(Post and Vilar, 2018).12
Crucially, all models,
including EDITOR, are
the exact same models evaluated on the standard
MT tasks above, and do not need to be trained
specifically to incorporate constraints.
We define lexical constraints as Post and Vilar
(2018): For each source sentence, we randomly
select one to four words from the reference as
lexical constraints. We then randomly shuffle
the constraints and apply BPE to the constraint
sequence. Different from the terminology test
sets in Dinu et al. (2019), which contain only
several hundred sentences with mostly nominal
constraints, our constructed test sets are larger and
include lexical constraints of all types.
12Although the beam pruning option in Post and Vilar
(2018) is not used here (since it is not supported in Sockeye
anymore), other Sockeye updates
improve efficiency.
Constrained decoding with DBA is 1.8–2.7 times slower
than unconstrained decoding here, while DBA is 3 times
slower when beam = 10 in Post and Vilar (2018).
318
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Distill Beam BLEU ↑ RIBES ↑ CPR ↑ Latency (ms) ↓
Ro-En
En-De
En-Ja
AR + DBA (sockeye)
AR + DBA (sockeye)
NAR: LevT
+ hard constraints
NAR: EDITOR
+ hard constraints
AR + DBA (sockeye)
AR + DBA (sockeye)
NAR: LevT
+ hard constraints
NAR: EDITOR
+ hard constraints
AR + DBA (sockeye)
AR + DBA (sockeye)
NAR: LevT
+ hard constraints
NAR: EDITOR
+ hard constraints
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
4
10
–
–
–
–
4
10
–
–
–
–
4
10
–
–
–
–
31.0
34.6
31.6
27.7
33.1
28.8
26.1
30.5
27.1
24.9
28.2
25.8
44.3
48.0
42.8
39.7
45.3
43.7
79.5
84.5
83.4
78.4
85.0
81.2
74.7
81.9
80.0
74.1
81.6
77.2
81.6
85.9
84.0
77.4
85.7
82.6
99.7
99.5
80.3
99.9
86.8
95.0
99.7
99.5
75.6
100.0
88.4
96.8
100.0
100.0
74.3
99.9
91.3
96.4
436.26
696.68
121.80
140.79
108.98
136.78
434.41
896.60
127.00
134.10
121.65
134.10
418.71
736.92
161.17
159.27
109.50
132.71
Table 3: Machine Translation with lexical constraints (averages over 5 runs). For each metric, we
underline the top scores among all models and boldface the top scores among NAR models based on
the independent student’s t-test with p < 0.05. EDITOR exploits constraints better than LevT. It also
achieves comparable RIBES to the best AR model with 6–7 times decoding speedup.
significantly higher
Main Results Table 3 shows that EDITOR
exploits the soft constraints to strike a better
balance between translation quality and decoding
speed than other models. Compared to LevT,
EDITOR preserves 7–17% more constraints
translation
and achieves
quality (+1.1–2.5 on BLEU and +1.6–1.8 on
RIBES) and faster decoding speed. Compared
to the AR model with beam = 4, EDITOR
yields significantly higher BLEU (+1.0–2.2)
and RIBES (+4.1–6.9) with 3–4 times decoding
speedup. After increasing the beam to 10, EDI-
TOR obtains lower BLEU but comparable RIBES
with 6–7 times decoding speedup.13 Note that AR
models treat provided words as hard constraints
and therefore achieve over 99% CPR by design,
while NAR models treat them as soft constraints.
Results confirm that enforcing hard constraints
increases CPR but degrades translation quality
compared to the same model using soft con-
straints: For LevT, it degrades BLEU by 2.2–3.9
and RIBES by 5.0–6.6. For EDITOR, it degrades
13Post and Vilar (2018) show that the optimal beam size for
DBA is 20. Our experiment on En-De shows that increasing
the beam size from 10 to 20 improves BLEU by 0.7 at the
cost of doubling the decoding time.
BLEU by 1.6–4.3 and RIBES by 3.1–4.4 (Table 3).
By contrast, EDITOR with soft constraints strikes
a better balance between translation quality and
constraint preservation.
The strengths of EDITOR hold when varying
the number of constraints (Figure 5). For all tasks
and models, adding constraints helps BLEU up to
a certain point, ranging from 4 to 10 words. When
excluding the slower AR model (beam = 10),
EDITOR consistently reaches the highest BLEU
score with 2–10 constraints: EDITOR outperforms
LevT and the AR model with beam = 4.
Consistent with Post and Vilar (2018), as the
number of constraints increases, the AR model
needs larger beams to reach good performance.
When the number of constraints increases to 10,
EDITOR yields higher BLEU than the AR model
on En-Ja and Ro-En, even after incurring the cost
of increasing the AR beam to 10.
Are EDITOR improvements limited to pre-
serving constraints better? We verify that this
is not the case by computing the target word
F1 binned by frequency (Neubig et al., 2019).
Figure 6 shows that EDITOR improves over
LevT across all test frequency classes and closes
319
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
Figure 5: EDITOR improves BLEU over LevT for
2–10 constraints (counted pre-BPE) and beats the best
AR model on 2/3 tasks with 10 constraints.
the gap between NAR and AR models: The
largest improvements are obtained for low and
medium frequency words—on En-De and En-Ja,
the largest
improvements are on words with
frequency between 5 and 1000, while on Ro-En,
EDITOR improves more on words with frequency
between 5 and 100. EDITOR also improves F1 on
rare words (frequency in [0, 5]), but not as much
as for more frequent words.
Figure 6: Target word F1 score binned by word test
set frequency: EDITOR improves over LevT the most
for words of low or medium frequency. AR achieves
higher F1 than EDITOR for words of low or medium
frequency at the cost of much longer decoding time.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
We now conduct further analysis to better
to
understand
EDITOR’s advantages over LevT.
contribute
factors
that
the
Impact of Reposition We compare the average
number of basic edit operations (Section 3.1) of
different types used by EDITOR and LevT on
each test sentence (averaged over the 5 runs):
Reposition (excluding deletion for controlled
comparison with LevT), deletion, and insertion
performed by LevT and EDITOR at decoding
320
Repos. Del.
Ins.
Total
Iter.
BLEU↑ RIBES↑ CPR↑ Lat. ↓
Ro-En
LevT
EDITOR
En-De
LevT
EDITOR
En-Ja
LevT
EDITOR
0.00
8.13
4.61 33.05 37.67 2.01
2.50 28.68 39.31 1.81
0.00
5.85
7.13 45.45 52.58 2.14
4.01 28.75 38.61 2.07
0.00
4.73
5.24 32.83 38.07 2.93
1.69 21.64 28.06 1.76
Table 4: Average number of repositions (exclud-
ing deletions), deletions, insertions, and decod-
ing iterations to translate each sentence with soft
lexical constraints (averaged over 5 runs). Thanks
to reposition operations, EDITOR uses 40–70%
fewer deletions, 10–40% fewer insertions, and
3–40% fewer decoding iterations overall.
time. Table 4 shows that LevT deletes tokens 2–3
times more often than EDITOR, which explains
its lower CPR than EDITOR. LevT also inserts
tokens 1.2–1.6 times more often than EDITOR and
performs 1.4 times more edit operations on En-De
and En-Ja. On Ro-En, LevT performs −4% fewer
edit operations in total than EDITOR but is overall
slower than EDITOR, since multiple operations
can be done in parallel at each action step. Overall,
EDITOR takes 3–40% fewer decoding iterations
than LevT. These results suggest that reposition
successfully reduces redundancy in edit operations
and makes decoding more efficient by replacing
sequences of insertions and deletions with a single
repositioning step.
Furthermore, Figure 7 illustrates how repo-
sition increases flexibility in exploiting lexical
constraints, even when they are provided in the
wrong order. While LevT generates an incorrect
output by using constraints in the provided order,
EDITOR’s reposition operation helps generate a
more fluent and adequate translation.
Impact of Dual-Path Roll-In Ablation exper-
iments (Table 5) show that EDITOR benefits
greatly from dual-path roll-in. Replacing dual-
path roll-in with the simpler roll-in policy used
in Gu et al. (2019), the model’s translation qual-
ity drops significantly (by 0.9–1.3 on BLEU
and 0.6–1.9 on RIBES) with fewer constraints
preserved and slower decoding. It still achieves
Ro-En
EDITOR
-dual-path
LevT
En-De
EDITOR
-dual-path
LevT
En-Ja
EDITOR
-dual-path
LevT
33.1
32.2
31.6
28.2
27.2
27.1
45.3
44.0
42.8
85.0
84.4
83.4
81.6
80.4
80.0
85.7
83.9
84.0
86.8
74.8
80.3
88.4
78.7
75.6
91.3
80.0
74.3
108.98
119.61
121.80
121.65
130.85
127.00
109.50
154.10
161.17
Table 5: Ablating the dual-path roll-in policy
hurts EDITOR on soft-constrained MT, but still
outperforms LevT, confirming that reposition and
dual-path imitation learning both benefit EDITOR.
Wiktionary
IATE
Term%↑ BLEU↑ Term%↑ BLEU↑
Prior Results
Base Trans.
Post18
Dinu19
Base LevT
Susanto20
76.9
99.5
93.4
81.1
100.0
Our Results
84.3
LevT
+ soft constraints 90.5
+ hard constraints 100.0
EDITOR
83.5
+ soft constraints 96.8
+ hard constraints 99.8
26.0
25.8
26.3
30.2
31.2
28.2
28.5
28.8
28.8
29.3
29.3
76.3
82.0
94.5
80.3
100.0
83.9
92.5
100.0
83.0
97.1
100.0
25.8
25.3
26.0
29.0
30.1
27.9
28.3
28.9
27.9
28.8
28.9
Table 6: Term usage percentage (Term%) and BLEU
scores of En-De models on terminology test sets
(Dinu et al., 2019) provided with correct terminology
entries (exact matches on both source and target
sides). EDITOR with soft constraints achieves higher
BLEU than LevT with soft constraints, and on par or
higher BLEU than LevT with hard constraints.
better translation quality than LevT thanks to the
reposition operation: specifically, it yields signif-
icantly higher BLEU and RIBES on Ro-En, com-
parable BLEU and significantly higher RIBES on
En-De, and comparable RIBES and significantly
higher BLEU on En-Ja than LevT.
321
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 7: Ro-En translation with soft lexical constraints: while LevT uses the constraints in the provided order,
EDITOR’s reposition operation helps generate a more fluent and adequate translation.
4.4 MT with Terminology Constraints
We evaluate EDITOR on the terminology test sets
released by Dinu et al. (2019) to test its ability to
incorporate terminology constraints and to further
compare it with prior work (Dinu et al., 2019;
Post and Vilar, 2018; Susanto et al., 2020).
Compared to Post and Vilar (2018) and Dinu
et al., (2019), EDITOR with soft constraints
achieves higher absolute BLEU, and higher BLEU
improvements over its counterpart without con-
straints (Table 6). Consistent with previous find-
ings by Susanto et al. (2020), incorporating soft
constraints in LevT improves BLEU by +0.3 on
Wiktionary and by +0.4 on IATE. Enforcing hard
constraints as in Susanto et al. (2020) increases
the term usage by +8–10% and improves BLEU
by +0.3–0.6 over LevT using soft constraints.14
For EDITOR, adding soft constraints improves
BLEU by +0.5 on Wiktionary and +0.9 on IATE,
with very high term usages (96.8% and 97.1%
respectively). EDITOR thus correctly uses the pro-
vided terms almost all the time when they are pro-
vided as soft constraints, so there is little benefit to
enforcing hard constraints instead: They help close
the small gap to reach 100% term usage and do not
improve BLEU. Overall, EDITOR achieves on par
or higher BLEU than LevT with hard constraints.
14We use our implementations of Susanto et al.’s (2020)
technique for a more controlled comparison. The LevT
baseline in Susanto et al. (2020) achieves higher BLEU
than ours on the small Wiktionary and IATE test sets, while
it underperforms our LevT on the full WMT14 test set (26.5
vs. 26.9).
Results also suggest that EDITOR can han-
dle phrasal constraints even though it
relies
on token-level edit operations, since it achieves
above 99% term usage on the terminology test sets
where 26–27% of the constraints are multi-token.
5 Conclusion
We introduce EDITOR, a non-autoregressive
transformer model that iteratively edits hypotheses
using a novel reposition operation. Reposition
combined with a new dual-path imitation learning
strategy helps EDITOR generate output sequences
that flexibly incorporate user’s lexical choice
preferences. Extensive experiments show that
EDITOR exploits soft lexical constraints more
effectively than the Levenshtein Transformer
(Gu et al., 2019) while speeding up decoding
dramatically compared to constrained beam search
(Post and Vilar, 2018). Results also confirm the
benefits of using soft constraints over hard ones
in terms of translation quality. EDITOR also
achieves comparable or better translation quality
with faster decoding speed than the Levenshtein
Transformer on three standard MT tasks. These
promising results open several avenues
for
future work, including using EDITOR for other
generation tasks than MT and investigating its
ability to incorporate more diverse constraint types
into the decoding process.
Acknowledgments
We thank Sweta Agrawal, Kiant´e Brantley,
Eleftheria Briakou, Hal Daum´e III, Aquia
322
Richburg, Franc¸ois Yvon, the TACL reviewers,
and the CLIP lab at UMD for their helpful and
constructive comments. This research is supported
in part by an Amazon Web Services Machine
Learning Research Award and by the Office of
the Director of National Intelligence (ODNI),
Intelligence Advanced Research Projects Activity
(IARPA), via contract #FA8650-17-C-9117. The
views and conclusions contained herein are those
of the authors and should not be interpreted
as necessarily representing the official policies,
either expressed or implied, of ODNI, IARPA,
or the U.S. Government. The U.S. Government
is authorized to reproduce and distribute reprints
for governmental purposes notwithstanding any
copyright annotation therein.
References
Fadi Abu Sheikha and Diana Inkpen. 2011.
Generation of formal and informal sentences. In
Proceedings of the 13th European Workshop on
Natural Language Generation, pages 187–193,
Nancy, France. Association for Computational
Linguistics.
Sweta Agrawal and Marine Carpuat. 2019. Control-
ling text complexity in neural machine trans-
lation. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1549–1564, Hong
Kong, China. Association for Computational
Linguistics. DOI: https://doi.org/10
.18653/v1/D19-1166
In Proceedings of
Peter Anderson, Basura Fernando, Mark Johnson,
and Stephen Gould. 2017. Guided open vocab-
ulary image captioning with constrained beam
the 2017 Con-
search.
ference on Empirical Methods in Natural
Language Processing, pages 936–945, Copen-
hagen, Denmark. Association for Computa-
tional Linguistics. DOI: https://doi
.org/10.18653/v1/D17-1098, PMID:
30027537, PMCID: PMC6220700
Philip Arthur, Graham Neubig, and Satoshi
Nakamura. 2016. Incorporating discrete trans-
lation lexicons into neural machine translation.
the 2016 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing, pages 1557–1567, Austin, Texas.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/D16-1162
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation
by jointly learning to align and translate.
the 3th International
In Proceedings of
Conference on Learning Representations.
Srinivas Bangalore, Patrick Haffner, and Stephan
Kanthak. 2007. Statistical machine translation
through global lexical selection and sentence
reconstruction. In Proceedings of the 45th An-
nual Meeting of the Association of Compu-
tational Linguistics, pages 152–159, Prague,
Czech Republic. Association for Computational
Linguistics.
Sergio Barrachina, Oliver Bender, Francisco
Casacuberta, Jorge Civera, Elsa Cubel, Shahram
Khadivi, Antonio Lagarda, Hermann Ney,
Jes´us Tom´as, Enrique Vidal, and Juan-Miguel
Vilar. 2009. Statistical approaches to computer-
assisted translation. Computational Linguis-
tics, 35(1):3–28. DOI: https://doi.org
/10.1162/coli.2008.07-055-R2-06
-29
Ondˇrej Bojar, Christian Buck, Christian
Federmann, Barry Haddow, Philipp Koehn,
Johannes Leveling, Christof Monz, Pavel
Pecina, Matt Post, Herve Saint-Amand, Radu
Soricut, Lucia Specia, and Aleˇs Tamchyna.
2014. Findings of
the 2014 workshop on
statistical machine translation. In Proceedings
of
Statistical
Machine Translation, pages 12–58, Baltimore,
Maryland, USA. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.3115/v1/14-3302
the Ninth Workshop
on
Ondˇrej Bojar, Rajen Chatterjee, Christian
Federmann, Yvette Graham, Barry Haddow,
Shujian Huang, Matthias Huck, Philipp Koehn,
Qun Liu, Varvara Logacheva, Christof Monz,
Matteo Negri, Matt Post, Raphael Rubino,
Lucia Specia, and Marco Turchi. 2017.
Findings of the 2017 conference on machine
translation (WMT17). In Proceedings of the
Second Conference on Machine Translation,
323
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
pages 169–214, Copenhagen, Denmark. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-4717
Ondˇrej Bojar, Rajen Chatterjee, Christian
Federmann, Yvette Graham, Barry Haddow,
Matthias Huck, Antonio Jimeno Yepes, Philipp
Koehn, Varvara Logacheva, Christof Monz,
Matteo Negri, Aur´elie N´ev´eol, Mariana Neves,
Martin Popel, Matt Post, Raphael Rubino,
Carolina Scarton, Lucia Specia, Marco Turchi,
Karin Verspoor, and Marcos Zampieri. 2016.
Findings of the 2016 conference on machine
translation. In Proceedings of the First Con-
ference on Machine Translation: Volume 2,
Shared Task Papers, pages 131–198, Berlin,
Germany. Association for Computational Lin-
https://doi.org/10
guistics. DOI:
.18653/v1/W16-2301
Peter F. Brown, John Cocke, Stephen A. Della
Pietra, Vincent J. Della Pietra, Fredrick Jelinek,
John D. Lafferty, Robert L. Mercer, and Paul
S. Roossin. 1990. A statistical approach to
machine translation. Computational Linguis-
tics, 16(2):79–85.
Ching-An Cheng, Xinyan Yan, Nolan Wagener,
and Byron Boots. 2018. Fast policy learning
through imitation and reinforcement.
In
Proceedings of
the 2018 Conference on
Uncertainty in Artificial Intelligence (UAI),
pages 845–855, Monterey, CA, USA.
Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry
Bahdanau, and Yoshua Bengio. 2014. On
the properties of neural machine translation:
Encoder–decoder approaches. In Proceedings
of SSST-8, Eighth Workshop on Syntax,
Semantics
Statistical
Structure
Translation, pages 103–111, Doha, Qatar.
Association for Computational Linguistics.
and
in
Jan K. Chorowski, Dzmitry Bahdanau, Dmitriy
Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
2015. Attention-based models
speech
recognition. In Advances in Neural Information
Processing Systems, pages 577–585, Montreal,
Canada.
for
Jonathan H. Clark, Chris Dyer, Alon Lavie, and
Noah A. Smith. 2011. Better hypothesis testing
for statistical machine translation: Controlling
for optimizer instability. In Proceedings of the
49th Annual Meeting of the Association for
Computational Linguistics: Human Language
Technologies, pages 176–181, Portland, Ore-
gon, USA. Association for Computational
Linguistics.
Hal Daum´e III, John Langford, and Daniel Marcu.
2009. Search-based structured prediction.
75(3):297–325. DOI:
Machine Learning,
https://doi.org/10.1007/s10994
-009-5106-x
and Yaser Al-Onaizan.
Georgiana Dinu, Prashant Mathur, Marcello
Federico,
2019.
Training neural machine translation to apply
terminology constraints. In Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, pages 3063–3068,
Florence, Italy. Association for Computational
Linguistics.
Nadir Durrani, Helmut Schmid, Alexander Fraser,
Philipp Koehn, and Hinrich Sch¨utze. 2015.
The operation sequence model—Combining
n-gram-based and phrase-based statistical ma-
chine translation. Computational Linguistics,
41(2):157–186. DOI: https://doi.org
/10.1162/COLI a 00218
Jessica Ficler and Yoav Goldberg. 2017. Control-
ling linguistic style aspects in neural language
generation. In Proceedings of the Workshop
on Stylistic Variation, pages 94–104, Copen-
hagen, Denmark. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/W17-4912
translators.
In Proceedings
George Foster, Philippe Langlais, and Guy
Lapalme. 2002. User-friendly text prediction
the
for
2002 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2002),
pages 148–155. Association for Computational
Linguistics. DOI: https://doi.org/10
.18653/v1/W17-4912
of
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Marjan Ghazvininejad, Omer Levy, Yinhan Liu,
and Luke Zettlemoyer. 2019. Mask-predict:
conditional masked
Parallel decoding of
the
language models.
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
In Proceedings of
324
Processing
Language
(EMNLP-IJCNLP),
pages 6112–6121, Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1633
Jiatao Gu, James Bradbury, Caiming Xiong,
Victor OK Li, and Richard Socher. 2018.
Non-autoregressive neural machine translation.
In International Conference on Learning
Representations.
Jiatao Gu, Changhan Wang, and Junbo Zhao.
2019. Levenshtein transformer. In Advances
in Neural Information Processing Systems 32,
pages 11181–11191. Curran Associates, Inc.
Felix Hieber,
Tobias Domhan, Michael
Denkowski, David Vilar, Artem Sokolov, Ann
Clifton, and Matt Post. 2017. Sockeye: A
toolkit for neural machine translation. CoRR,
abs/1712.05690.
Chris Hokamp and Qun Liu. 2017. Lexically
constrained decoding for sequence generation
using grid beam search. In Proceedings of
the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1:
Long Papers), pages 1535–1546, Vancouver,
Canada. Association
for Computational
Linguistics.
Hideki
Isozaki, Tsutomu Hirao, Kevin Duh,
Katsuhito Sudoh, and Hajime Tsukada. 2010.
Automatic evaluation of translation quality
In Proceedings
for distant
on Empirical
the
of
Methods in Natural Language Processing,
pages 944–952, Cambridge, MA. Association
for Computational Linguistics.
language pairs.
2010 Conference
Jungo Kasai, Nikolaos Pappas, Hao Peng, James
Cross, and Noah A Smith. 2020. Deep encoder,
shallow decoder: Reevaluating the speed-
quality tradeoff in machine translation. arXiv
preprint arXiv:2006.10369.
Diederik P. Kingma and Jimmy Ba. 2015.
Adam: A method for stochastic optimization.
In Proceedings of
the 3th International
Conference on Learning Representations. San
Diego, CA, USA.
Philipp Koehn, Hieu Hoang, Alexandra Birch,
Chris Callison-Burch, Marcello Federico,
325
Nicola Bertoldi, Brooke Cowan, Wade Shen,
Christine Moran, Richard Zens, Chris Dyer,
Ondˇrej Bojar, Alexandra Constantin, and
Evan Herbst. 2007. Moses: Open source
toolkit for statistical machine translation. In
Proceedings of the 45th Annual Meeting of
the Association for Computational Linguistics
Companion Volume Proceedings of the Demo
and Poster Sessions, pages 177–180, Prague,
Czech Republic. Association for Computational
Linguistics.
Taku Kudo
and
John Richardson.
2018.
SentencePiece: A simple and language inde-
pendent subword tokenizer and detokenizer for
neural text processing. In Proceedings of the
2018 Conference on Empirical Methods in
Natural Language Processing: System Dem-
onstrations, pages 66–71, Brussels, Belgium.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/D18-2012, PMID: 29382465
Guillaume Lample, Alexis Conneau, Ludovic
Denoyer, and Marc’Aurelio Ranzato. 2018.
Unsupervised machine
using
monolingual corpora only. In Proceedings of
the 6th International Conference on Learning
Representations.
translation
R´emi Leblond, Jean-Baptiste Alayrac, Anton
Osokin, and Simon Lacoste-Julien. 2018.
SEARNN: Training RNNs with global-local
losses. In International Conference on Learning
Representations.
Jason Lee, Elman Mansimov, and Kyunghyun
Cho. 2018. Deterministic non-autoregressive
neural sequence modeling by iterative refine-
ment. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language
Processing, pages 1173–1182, Brussels, Bel-
gium. Association for Computational Linguis-
tics. DOI: https://doi.org/10.18653
/v1/D18-1149
Vladimir I. Levenshtein. 1966. Binary codes
insertions,
In Soviet Physics Doklady,
capable of correcting deletions,
and reversals.
volume 10, pages 707–710.
Xuezhe Ma, Chunting Zhou, Xian Li,
Graham Neubig, and Eduard Hovy. 2019.
conditional
FlowSeq: Non-autoregressive
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
sequence generation with generative flow.
the 2019 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 4282–4292, Hong
Kong, China. Association for Computational
Linguistics.
Hongyuan Mei, Mohit Bansal, and Matthew R.
Walter. 2016. What to talk about and how?
selective generation using LSTMs with coarse-
to-fine alignment. In Proceedings of the 2016
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, pages 720–730,
San Diego, California. Association for Compu-
tational Linguistics.
Toshiaki Nakazawa,
Shohei Higashiyama,
Isao Goto,
Chenchen Ding, Hideya Mino,
Hideto Kazawa, Yusuke Oda, Graham Neubig,
and Sadao Kurohashi. 2017. Overview of the
4th workshop on Asian translation. In Pro-
ceedings of the 4th Workshop on Asian Trans-
lation (WAT2017), pages 1–54, Taipei, Taiwan.
Asian
of Natural Language
Processing.
Federation
Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul
Michel, Danish Pruthi, and Xinyi Wang.
2019. compare-mt: A tool for holistic com-
parison of language generation systems. In
Proceedings of the 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics (Demonstrations),
pages 35–41, Minneapolis, Minnesota. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N19
-4007
Toan Q. Nguyen and David Chiang. 2018.
Improving lexical choice in neural machine
translation. In Proceedings of the 2018 Con-
ference of
the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, pages 334–343,
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/N18-1031, PMID: 29283496
Aaron van den Oord, Yazhe Li, Igor Babuschkin,
Karen Simonyan, Oriol Vinyals, Koray
Kavukcuoglu, George van den Driessche,
Seb Noury,
Edward Lockhart, Luis Cobo , Florian
Stimberg, Norman Casagrande, Dominik
Grewe,
Sander Dieleman,
Erich Elsen, Nal Kalchbrenner, Heiga
Zen, Alex Graves, Helen King, Tom
Walters, Dan Belov, and Demis Hassabis.
2018. Parallel WaveNet: Fast high-fidelity
the
speech synthesis.
35th International Conference on Machine
Learning, volume 80 of Proceedings of Ma-
chine Learning Research, pages 3918–3926,
Stockholmsm¨assan, Stockholm Sweden. PMLR.
In Proceedings of
Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. 2019. Fairseq: A
fast, extensible toolkit for sequence modeling.
In Proceedings of the 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics (Demonstrations),
pages 48–53, Minneapolis, Minnesota. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N19
-4009
Kishore
for
Papineni,
Salim Roukos,
Todd
Ward, and Wei-Jing Zhu. 2002. BLEU:
of
evaluation
automatic
A method
machine translation. In Proceedings of
the
40th Annual Meeting of the Association for
Computational Linguistics, pages 311–318,
Philadelphia, Pennsylvania, USA. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1073083
.1073135
Matt Post and David Vilar. 2018. Fast lexically
constrained decoding with dynamic beam
allocation for neural machine translation. In
Proceedings of the 2018 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers),
pages 1314–1324, New Orleans, Louisiana.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/N18-1119
Ofir Press and Lior Wolf. 2017. Using the
output embedding to improve language mo-
dels. In Proceedings of the 15th Conference
of the European Chapter of the Association for
Computational Computational, pages 157–163.
326
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/E17-2025
St´ephane Ross and J. Andrew Bagnell. 2014.
Reinforcement and imitation learning via inter-
learning. CoRR, abs/1406
active no-regret
.5979.
Stephane Ross, Geoffrey Gordon, and Drew
Bagnell. 2011. A reduction of
imitation
learning and structured prediction to no-
In Proceedings of
regret online learning.
the Fourteenth International Conference on
Artificial Intelligence and Statistics, volume 15
of Proceedings of Machine Learning Research,
pages 627–635. PMLR.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016a. Controlling politeness in neu-
ral machine translation via side constraints.
the 2016 Conference of
In Proceedings of
the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, pages 35–40, San
Diego, California. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/N16-1005
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016b. Neural machine translation
of rare words with subword units. In Pro-
ceedings of the 54th Annual Meeting of the
Association for Computational Linguistics,
pages 1715–1725. Association for Computatio-
nal Linguistics. DOI: https://doi.org
/10.18653/v1/P16-1162
Kai Song, Yue Zhang, Heng Yu, Weihua
Luo, Kun Wang, and Min Zhang. 2019.
Code-switching for enhancing NMT with pre-
specified translation. In Proceedings of
the
2019 Conference of
the North American
Chapter of the Association for Computational
Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers),
pages
449–459, Minneapolis, Minnesota.
Association for Computational Linguistics.
Felix Stahlberg, Danielle Saunders, and Bill
Byrne. 2018. An operation sequence model
transla-
for
the 2018 EMNLP
tion. In Proceedings of
and
Workshop BlackboxNLP: Analyzing
neural machine
explainable
Interpreting Neural Networks
for NLP,
pages 175–186, Brussels, Belgium. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W18
-5420
Mitchell Stern, William Chan, Jamie Kiros,
and Jakob Uszkoreit. 2019. Insertion transfor-
mer: Flexible sequence generation via insertion
operations. In Proceedings of the 36th Inter-
national Conference on Machine Learning,
volume 97 of Proceedings of Machine Learning
Research, pages 5976–5985, Long Beach,
California, USA. PMLR.
Mitchell Stern, Noam Shazeer,
and Jakob
Uszkoreit. 2018. Blockwise parallel decoding
for deep autoregressive models. In Advances
in Neural Information Processing Systems,
volume 31, pages 10086–10095, Montreal,
Canada. Curran Associates, Inc..
Raymond Hendy Susanto, Shamil Chollampatt,
and Liling Tan. 2020. Lexically constrained
neural machine translation with Levenshtein
transformer. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, pages 3536–3543, Online. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.325
Yaohua Tang, Fandong Meng, Zhengdong Lu,
Hang Li, and Philip L. H. Yu. 2016.
Neural machine translation with external phrase
memory. arXiv preprint arXiv:1606.01792.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neural
Information Processing Systems, volume 30,
pages 5998–6008, Long Beach, CA, USA.
Curran Associates, Inc.
Oriol Vinyals and Quoc Le. 2015. A neural
conversational model. In ICML Deep Learning
Workshop. Lille, France.
Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018.
Semi-autoregressive neural machine transla-
tion. In Proceedings of the 2018 Conference
327
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
on Empirical Methods in Natural Language
Processing, pages 479–488, Brussels, Belgium.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/D18-1044
Yiren Wang, Fei Tian, Di He, Tao Qin,
ChengXiang Zhai, and Tie-Yan Liu. 2019.
Non-autoregressive machine translation with
auxiliary regularization. In Proceedings of the
AAAI Conference on Artificial Intelligence,
33(01):5377–5384. DOI: https://doi
.org/10.1609/aaai.v33i01.33015377
Sean Welleck, Kiant´e Brantley, Hal Daum´e III,
and Kyunghyun Cho. 2019. Non-monotonic
sequential text generation. In International Con-
ference on Machine Learning, pages 6716–6726.
Franc¸ois Yvon and Sadaf Abdul Rauf. 2020.
Utilisation de ressources lexicales et terminol-
ogiques en traduction neuronale. Research
Report 2020-001, LIMSI-CNRS.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
6
8
1
9
2
3
8
4
8
/
/
t
l
a
c
_
a
_
0
0
3
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
328
Descargar PDF