Synchronous Bidirectional Neural Machine Translation
Long Zhou1,2,
Jiajun Zhang1,2∗, Chengqing Zong1,2,3
1National Laboratory of Pattern Recognition, CASIA, Beijing, Porcelana
2University of Chinese Academy of Sciences, Beijing, Porcelana
3CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, Porcelana
{long.zhou, jjzhang, cqzong}@nlpr.ia.ac.cn
Abstracto
Existing approaches to neural machine trans-
lación (NMT) generate the target
idioma
sequence token-by-token from left to right.
Sin embargo, this kind of unidirectional decod-
ing framework cannot make full use of the
target-side future contexts which can be pro-
duced in a right-to-left decoding direction,
and thus suffers from the issue of unbal-
anced outputs. en este documento, we introduce
a synchronous bidirectional–neural machine
traducción (SB-NMT) that predicts its outputs
using left-to-right and right-to-left decoding
simultaneously and interactively, in order to
leverage both of the history and future in-
formation at the same time. Específicamente, nosotros
first propose a new algorithm that enables syn-
chronous bidirectional decoding in a single
modelo. Entonces, we present an interactive decod-
ing model in which left-to-right (right-to-left)
generation does not only depend on its pre-
viously generated outputs, but also relies on
future contexts predicted by right-to-left (izquierda-
to-right) decoding. We extensively evaluate
the proposed SB-NMT model on large-scale
NIST Chinese-English, WMT14 English-
Alemán, and WMT18 Russian-English trans-
lation tasks. Experimental results demonstrate
that our model achieves significant improve-
ments over the strong Transformer model
por 3.92, 1.49, y 1.04 BLEU points, respetar-
activamente, and obtains the state-of-the-art per-
formance on Chinese-English and English-
German translation tasks.1
1
Introducción
Neural machine translation has significantly im-
proved the quality of machine translation in recent
años (Sutskever et al., 2014; Bahdanau et al.,
2015; Zhang and Zong, 2015; Wu et al., 2016;
Gehring et al., 2017; Vaswani et al., 2017). Reciente
approaches to sequence-to-sequence learning typ-
ically leverage recurrence (Sutskever et al., 2014),
convolution (Gehring et al., 2017), or attention
(Vaswani et al., 2017) as basic building blocks.
Typically, NMT adopts the encoder-decoder
architecture and generates the target translation
from left to right. Despite their remarkable suc-
impuesto, NMT models suffer from several weak-
nesses (Koehn and Knowles, 2017). Uno de los
most prominent issues is the problem of unbal-
anced outputs in which the translation prefixes are
better predicted than the suffixes (Liu et al., 2016).
We analyze translation accuracy of the first and
last 4 tokens for left-to-right (L2R) and right-to-
izquierda (R2L) directions, respectivamente. As shown in
Mesa 1, the statistical results show that L2R per-
forms better in the first 4 tokens, whereas R2L
translates better in terms of the last 4 tokens.
This problem is mainly caused by the left-to-
right unidirectional decoding, which conditions
each output word on previously generated out-
puts only, but leaving the future information from
target-side contexts unexploited during transla-
ción. The future context is commonly used in
reading and writing in human cognitive process
∗Corresponding author.
com/wszlong/sb-nmt.
1The source code is available at https://github.
91
Transacciones de la Asociación de Lingüística Computacional, volumen. 7, páginas. 91−105, 2019. Editor de acciones: George Foster.
Lote de envío: 8/2018; Lote de revisión: 10/2018; Publicado 4/2019.
C(cid:13) 2019 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
5
6
1
9
2
3
6
6
5
/
/
t
yo
a
C
_
a
_
0
0
2
5
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Modelo
L2R
R2L
el primero 4 tokens
40.21%
35.67%
The last 4 tokens
35.10%
39.47%
Mesa 1: Translation accuracy of the first 4 tokens
and last 4 tokens in NIST Chinese-English translation
tareas. L2R denotes left-to-right decoding and R2L
means right-to-left decoding for conventional NMT.
(Xia et al., 2017), and it is crucial to avoid under-
traducción (Tu et al., 2016; Mi et al., 2016).
To alleviate the problems, existing studies usu-
ally used independent bidirectional decoders for
NMT (Liu et al., 2016; Sennrich et al., 2016a).
Most of them trained two NMT models with left-
to-right and right-to-left directions, respectivamente.
they translated and re-ranked candidate
Entonces,
translations using two decoding scores together.
More recently, Zhang et al. (2018) presentado
an asynchronous bidirectional decoding algo-
rithm for NMT, which extended the conventional
encoder-decoder framework by utilizing a back-
ward decoder. Sin embargo, these methods are more
complicated than the conventional NMT frame-
work because they require two NMT models or
decoders. Además, the L2R and R2L de-
coders are independent from each other (Liu et al.,
2016), or only the forward decoder can utilize
information from the backward decoder (zhang
et al., 2018). It is therefore a promising direction
to design a synchronous bidirectional decoding
algorithm in which L2R and R2L generations can
interact with each other.
Respectivamente, we propose in this paper a novel
estructura (SB-NMT) that utilizes a single de-
coder to bidirectionally generate target sentences
simultaneously and interactively. As shown in
Cifra 1, two special labels ((cid:104)l2r(cid:105) y (cid:104)r2l(cid:105)) en
the beginning of the target sentence guide trans-
lating from left to right or right to left, y el
decoder in each direction can utilize the previ-
ously generated symbols of bidirectional decod-
ing when generating the next token. Taking L2R
decoding as an example, at each moment, the gen-
eration of the target word (p.ej., y3) does not only
rely on previously generated outputs (y1 and y2)
of L2R decoding, but also depends on previously
predicted tokens (yn and yn−1) of R2L decod-
En g. Compared with the previous related NMT
modelos, our method has the following advan-
tages: 1) We use a single model (one encoder and
one decoder) to achieve the decoding with left-
Cifra 1: Illustration of the decoder in the synchronous
bidirectional NMT model. L2R denotes left-to-right
decoding guided by the start token (cid:104)l2r(cid:105) and R2L
means right-to-left decoding indicated by the start
simbólico (cid:104)r2l(cid:105). SBAtt is our proposed synchronous bi-
el
directional attention (see § 3.2). Por ejemplo,
generation of y3 does not only rely on y1 and y2, pero
also depends on yn and yn−1 of R2L.
to-right and right-to-left generation, which can
be processed in parallel. 2) Via the synchronous
bidirectional attention model (SBAtt, §3.2), nuestro
proposed model is an end-to-end joint framework
and can optimize bidirectional decoding simulta-
neously. 3) Compared with two-phase decoding
scheme in previous work, our decoder is faster and
more compact, using one beam search algorithm.
Específicamente, we make the following contribu-
tions in this paper:
• We propose a synchronous bidirectional
NMT model
that adopts one decoder to
generate outputs with left-to-right and right-
to-left directions simultaneously and interac-
activamente. A lo mejor de nuestro conocimiento, this is
the first work to investigate the effectiveness
of a single NMT model with synchronous
bidirectional decoding.
• Extensive experiments on NIST Chinese-
Inglés, WMT14 English-German and WMT18
Russian-English translation tasks demon-
strate that our SB-NMT model obtains
significant
improvements over the strong
Transformer model by 3.92, 1.49, y 1.04
BLEU points, respectivamente. En particular, nuestro
approach separately establishes the state-of-
the-art BLEU score of 51.11 y 29.21 en
Chinese-English and English-German trans-
lation tasks.
2 Fondo
en este documento, we build our model based on the
powerful Transformer (Vaswani et al., 2017) con
92
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
5
6
1
9
2
3
6
6
5
/
/
t
yo
a
C
_
a
_
0
0
2
5
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 2: (izquierda) Scaled Dot-Product Attention. (bien)
Multi-Head Attention.
an encoder-decoder framework, where the en-
coder network first transforms an input sequence
of symbols x = (x1, x2, …, xn) to a sequence
of continues representations z = (z1, z2, …, zn),
from which the decoder generates an output se-
quence y = (y1, y2, …, ym) one element at a time.
Particularly, relying entirely on the multi-head
attention mechanism, the Transformer with beam
search algorithm achieves the state-of-the-art
results for machine translation.
Multi-head attention allows the model to jointly
attend to information from different representa-
tion subspaces at different positions. It operates
on queries Q, keys K, and values V . For multi-
head intra-attention of encoder or decoder, todo
q, k, V are the output hidden-state matrices of
the previous layer. For multi-head inter-attention
of the decoder, Q are the hidden states of the pre-
vious decoder layer, and K-V pairs come from
La salida (z1, z2, …, zn) of the encoder.
Formalmente, multi-head attention first obtains h
different representations of (chi, Ki, Vi). Specif-
icamente, for each attention head i, we project the
hidden-state matrix into distinct query, key, y
value representations Qi = QW Q
, Ki = KW K
,
i
i
Vi = V W V
, respectivamente. Then we perform scaled
i
dot-product attention for each representation,
concatenate the results, and project the concate-
nation with a feed-forward layer.
MultiHead(q, k, V ) = Concati(headi)W O
, V W V
i )
headi = Attention(QW Q
i , KW K
i
(1)
Cifra 3: Illustration of the standard beam search
algorithm with beam size 4. The black blocks denote
the ongoing expansion of the hypotheses.
Scaled dot-product attention can be described
as mapping a query and a set of key-value pairs
to an output. Específicamente, we can then multiply
query Qi by key Ki to obtain an attention weight
matrix, which is then multiplied by value Vi
for each token to obtain the self-attention token
representación. As shown in Figure 2, scaled dot-
product attention operates on a query Q, a key K,
and a value V as:
Atención(q, k, V ) = Softmax
(cid:19)
(cid:18) QKT
√
dk
V
(2)
where dk is the dimension of the key. For the sake
of brevity, we refer the reader to Vaswani et al.
(2017) for more details.
Standard Beam Search Given the trained
model and input sentence x, we usually employ
beam search or greedy search (beam size = 1)
to find the best translation (cid:98)y = argmaxyP (y|X).
Beam size N is used to control the search space
by extending only the top-N hypotheses in the
current stack. As shown in Figure 3, the blocks
represent the four best token expansions of the
previous states, and these token expansions are
sorted top-to-bottom from most probable to least
probable. We define a complete hypothesis as
a hypothesis which outputs EOS, where EOS
is a special target token indicating the end of
oración. With the above settings, the translation
y is generated token-by-token from left to right.
3 Our Approach
where W Q
i
projection matrices.
, W K
i
, W V
i
, and W O are parameter
En esta sección, we will introduce the approach of
synchronous bidirectional NMT. Our goal is to
93
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
5
6
1
9
2
3
6
6
5
/
/
t
yo
a
C
_
a
_
0
0
2
5
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
design a synchronous bidirectional beam search
algoritmo (§3.1) which generates tokens with
both L2R and R2L decoding simultaneously and
interactively using a single model. The central
module is the synchronous bidirectional atten-
ción (SBAtt, see §3.2). By using SBAtt, the two
decoding directions in one beam search process
can help and interact with each other, and can
make full use of the target-side history and future
information during translation. Entonces, we apply
our proposed SBAtt to replace the multi-head
intra-attention in the decoder part of Transformer
modelo (§3.3), and the model is trained end-to-end
by maximum likelihood using stochastic gradient
descent (§3.4).
3.1 Synchronous Bidirectional Beam Search
Cifra 4 illustrates the synchronous bidirectional
beam search process with beam size 4. With two
special start tokens which are optimized during
the training process, we let half of the beam
keep decoding from left to right guided by the
label (cid:104)l2r(cid:105), and allow the other half beam to
decode from right to left, indicated by the label
(cid:104)r2l(cid:105). More importantly, via the proposed SBAtt
(§3.2) modelo, L2R (R2L) generation does not only
depend on its previously generated outputs, pero
also relies on future contexts predicted by R2L
(L2R) decoding.
Tenga en cuenta que (1) at each time step, we choose the
best items of the half beam from L2R decoding
and the best items of the half beam from R2L
decoding to continue expanding simultaneously;
(2) L2R and R2L beams should be thought of
as parallel, with SBAtt computed between items
of 1-best L2R and R2L, items of 2-best L2R and
R2L, and so on2; (3) the black blocks denote
the ongoing expansion of the hypotheses, y
decoding terminates when the end-of-sentence
flag EOS is predicted;
in our decoding
algoritmo,
the complete hypotheses will not
participate in subsequent SBAtt, and the L2R
hypothesis attended by R2L decoding may change
at different time steps, while the ongoing partial
hypotheses in both directions of SBAtt always
share the same length; (5) finally, we output the
(4)
2We also did experiments in which all of L2R hypotheses
attend to the 1-best R2L hypothesis, and all
the R2L
hypotheses attend to the 1-best L2R hypothesis. The results
of the two schemes are similar. For the sake of simplicity, nosotros
employed the previous scheme.
94
Cifra 4: The synchronous bidirectional decoding of
our model. (cid:104)l2r(cid:105) y (cid:104)r2l(cid:105) are two special labels, cual
indicate the target-side translation direction in L2R and
R2L modes, respectivamente. Our model can decode with
both L2R and R2L directions in one beam search by
using SBAtt, simultaneously and interactively. SBAtt
means the synchronous bidirectional attention (§3.2)
performed between items of L2R and R2L decoding.
translation result with highest probability from
all complete hypotheses. Intuitivamente, our model
is able to choose from L2R output or R2L out-
put as final hypothesis according to their model
probabilities, and if an R2L hypothesis wins, nosotros
reverse the tokens before presenting it.
3.2 Synchronous Bidirectional Attention
Instead of multi-head intra-attention which pre-
vents future information flow in the decoder to
preserve the auto-regressive property, we propose
a synchronous bidirectional attention (SBAtt)
mechanism. With the two key modules of syn-
chronous bidirectional dot-product attention (§3.2.1)
and synchronous bidirectional multi-head atten-
ción (§3.2.2), SBAtt is capable of capturing and
combining the information generated by L2R and
R2L decoding.
3.2.1 Synchronous Bidirectional
Dot-Product Attention
Cifra 5 shows our particular attention Syn-
chronous Bidirectional Dot-Product Attention
←−
q ]),
(SBDPA). The input consists of queries ([
←−
←−
V ]) cuales son
k ]), and values ([
keys ([
all concatenated by forward (L2R) states and
backward (R2L) estados. The new forward state
←−
−→
H can be obtained by
H and backward state
−→
V ;
−→
q ;
−→
k ;
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
5
6
1
9
2
3
6
6
5
/
/
t
yo
a
C
_
a
_
0
0
2
5
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
where λ is a hyper-parameter decided by the
performance on development set.3
−→
H is equal to
−→
H history
Nonlinear Interpolation
in the conventional attention mechanism, y
−→
H f uture means the attention information between
current hidden state and generated hidden states
of the other decoding. In order to distinguish
two different information sources, we present a
nonlinear interpolation by adding an activation
function to the backward hidden states:
−→
H =
−→
H history + λ ∗ AF (
−→
H f uture)
(5)
where AF denotes activation function, como
tanh or relu.
Gate Mechanism We also propose a gate mech-
anism to dynamically control
the amount of
information flow from the forward and backward
contextos. Específicamente, we apply a feed-forward
−→
H f uture to
gating layer upon
enrich the nonlinear expressiveness of our model:
−→
H history as well as
−→
H history;
rt, zt = σ(W g[
−→
H history + zt (cid:12)
−→
H = rt (cid:12)
−→
H f uture])
−→
H f uture
(6)
dónde (cid:12) denotes element-wise multiplication. Via
this gating layer, it is able to control how much
past information can be preserved from previous
context and how much reversed information can
be captured from backward hidden states.
Similar to the calculation of forward hidden
←−
H i can be
−→
H i, the backward hidden states
estados
computed as follows.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
5
6
1
9
2
3
6
6
5
/
/
t
yo
a
C
_
a
_
0
0
2
5
6
pag
d
.
←−
H history = Attention(
←−
H f uture = Attention(
←−
H = Fusion(
←−
H history,
←−
k ,
−→
k ,
←−
V )
−→
V )
←−
q ,
←−
q ,
←−
H f uture)
(7)
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
where Fusion(·) is the same as introduced in
←−
H can be cal-
Equations 4–6. Tenga en cuenta que
culated in parallel. We refer to the whole proce-
dure formulated in Equation 3 and Equation 7 como
SBDPA(·).
−→
H and
−→
h ;
[
←−
h ] = SBDPA([
←−
q ;
−→
q ], [
←−
k ;
−→
k ], [
←−
V ;
−→
V ])
(8)
3Note that we can also set λ to be a vector and learn
λ during training with standard back-propagation, and we
remain it as future exploration.
95
Cifra 5: Synchronous bidirectional attention model
based on scaled dot-product attention. It operates on
forward (L2R) and backward (R2L) queries Q, keys K,
values V.
synchronous bidirectional dot-product attention.
−→
h , it can be calcu-
For the new forward state
lated as:
−→
H history = Attention(
−→
H f uture = Attention(
−→
H history,
−→
H = Fusion(
−→
q ,
−→
q ,
−→
k ,
←−
k ,
−→
V )
←−
V )
−→
H f uture)
(3)
−→
H history is obtained by using conven-
dónde
tional scaled dot-product attention as introduced
en la ecuación 2, and its purpose is to take ad-
vantage of previously generated tokens, a saber
−→
H f uture using
history information. We calculate
−→
q ) and backward key-value pairs
forward query (
←−
←−
V ), which attempts at making use of future
k ,
(
information from R2L decoding as effectively as
possible in order to help predict the current to-
ken in L2R decoding. The role of Fusion(·) (verde
−→
H history and
block in Figure 5) is to combine
−→
H f uture by using linear interpolation, nonlinear
interpolación, or gate mechanism.
−→
H history and
−→
H f uture
Linear Interpolation
have different importance to prediction of cur-
−→
H history and
rent word. Linear interpolation of
−→
H f uture produces an overall hidden state:
−→
H =
−→
H history + λ ∗
−→
H f uture
(4)
3.2.2 Synchronous Bidirectional
Multi-Head Attention
Multi-head attention consists of h attention
cabezas, each of which learns a distinct attention
function to attend to all of the tokens in the
secuencia, where mask is used for preventing
leftward information flow in decoder. Compared
with the multi-head attention, our inputs are the
concatenation of forward and backward hidden
estados. We extend standard multi-headed attention
by letting each head attend to both forward and
backward hidden states, combined via SBDPA(·):
MultiHead([
= Concat([
←−
k ;
←−
q ;
−→
h 1;
−→
q ], [
←−
h 1], …, [
−→
k ], [
−→
H h;
←−
V ;
−→
V ])
←−
H h])W O
(9)
−→
H i;
←−
y [
H i] can be computed as follows, cual
is the biggest difference from conventional multi-
head attention:
−→
H i;
[
←−
H i] = SBDPA([
−→
k ]W K
i
←−
k ;
[
, [
←−
q ;
←−
V ;
−→
q ]W Q
i ,
−→
V ]W V
i )
(10)
, W V
, W K
i
where W Q
i and W O are parameter pro-
i
jection matrices, which are the same as standard
multi-head attention introduced in Equation 1.
3.3
Integrating Synchronous Bidirectional
Attention into NMT
We apply our synchronous bidirectional attention
to replace the multi-head intra-attention in the
decoder, as illustrated in Figure 6. el neural
encoder of our model is identical to that of the
standard Transformer model. From the source
tokens, learned embeddings are generated which
are then modified by an additive positional
encoding. The encoded word embeddings are
then used as input to the encoder which consists
of N blocks each containing two layers: (1) a
multi-head attention layer (MHAtt), y (2) a
position-wise feed-forward layer (FFN).
es
The bidirectional decoder of our model
extended from the standard Transformer decoder.
For each layer in the bidirectional decoder, el
lowest sub-layer is our proposed synchronous
bidirectional attention network, and it also uses
residual connections around each of the sublayers,
followed by layer normalization:
d = LayerNorm(sl−1 + SBAtt(sl−1, sl−1, sl−1))
sl
(11)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
5
6
1
9
2
3
6
6
5
/
/
t
yo
a
C
_
a
_
0
0
2
5
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 6: The new Transformer architecture with the
proposed synchronous bidirectional multi-head atten-
tion network, namely SBAtt. The input of decoder is
concatenation of forward (L2R) sequence and back-
ward (R2L) secuencia. Note that all bidirectional
information flow in decoder runs in parallel and only
interacts in synchronous bidirectional attention layer.
where l denotes layer depth, and subscript d
means the decoder-informed intra-attention rep-
resentimiento. SBAtt is our proposed synchronous
bidirectional attention, and sl−1
a
[−→s l−1; ←−s l−1] containing forward and backward
hidden states. Además, the decoder stacks an-
other two sub-layers to seek translation-relevant
source semantics to bridge the gap between the
source and target language:
is equal
e = LayerNorm(sl
sl
sl = LayerNorm(sl
d + MHAtt(sl
mi + FFN(sl
mi))
d, hN , hN ))
(12)
where MHAtt denotes the multi-head attention in-
troduced in Equation 1, and we use e to denote
the encoder-informed inter-attention representa-
ción; hN is the source top layer hidden state, y
FFN means feed-forward networks.
Finalmente, we use a linear transformation and
softmax activation to compute the probability of
the next tokens based on sN = [−→s N ; ←−s N ],
96
namely the final hidden states of forward and
backward decoding:
pag(−→y j|−→y