Synchronous Bidirectional Neural Machine Translation

Long Zhou1,2,

Jiajun Zhang1,2∗, Chengqing Zong1,2,3

1National Laboratory of Pattern Recognition, CASIA, Beijing, China
2University of Chinese Academy of Sciences, Beijing, China
3CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China
{long.zhou, jjzhang, cqzong}@nlpr.ia.ac.cn

Astratto

Existing approaches to neural machine trans-
lation (NMT) generate the target
lingua
sequence token-by-token from left to right.
Tuttavia, this kind of unidirectional decod-
ing framework cannot make full use of the
target-side future contexts which can be pro-
duced in a right-to-left decoding direction,
and thus suffers from the issue of unbal-
anced outputs. in questo documento, we introduce
a synchronous bidirectional–neural machine
translation (SB-NMT) that predicts its outputs
using left-to-right and right-to-left decoding
simultaneously and interactively, in order to
leverage both of the history and future in-
formation at the same time. Speciﬁcally, we
ﬁrst propose a new algorithm that enables syn-
chronous bidirectional decoding in a single
modello. Then, we present an interactive decod-
ing model in which left-to-right (right-to-left)
generation does not only depend on its pre-
viously generated outputs, but also relies on
future contexts predicted by right-to-left (left-
to-right) decoding. We extensively evaluate
the proposed SB-NMT model on large-scale
NIST Chinese-English, WMT14 English-
German, and WMT18 Russian-English trans-
lation tasks. Experimental results demonstrate
that our model achieves signiﬁcant improve-
ments over the strong Transformer model
by 3.92, 1.49, E 1.04 BLEU points, respec-
tively, and obtains the state-of-the-art per-

formance on Chinese-English and English-
German translation tasks.1

introduzione

Neural machine translation has signiﬁcantly im-
proved the quality of machine translation in recent
years (Sutskever et al., 2014; Bahdanau et al.,
2015; Zhang and Zong, 2015; Wu et al., 2016;
Gehring et al., 2017; Vaswani et al., 2017). Recente
approaches to sequence-to-sequence learning typ-
ically leverage recurrence (Sutskever et al., 2014),
convolution (Gehring et al., 2017), or attention
(Vaswani et al., 2017) as basic building blocks.

Typically, NMT adopts the encoder-decoder
architecture and generates the target translation
from left to right. Despite their remarkable suc-
cess, NMT models suffer from several weak-
ness (Koehn and Knowles, 2017). One of the
most prominent issues is the problem of unbal-
anced outputs in which the translation preﬁxes are
better predicted than the sufﬁxes (Liu et al., 2016).
We analyze translation accuracy of the ﬁrst and
last 4 tokens for left-to-right (L2R) and right-to-
left (R2L) directions, rispettivamente. As shown in
Tavolo 1, the statistical results show that L2R per-
forms better in the ﬁrst 4 gettoni, whereas R2L
translates better in terms of the last 4 gettoni.
This problem is mainly caused by the left-to-
right unidirectional decoding, which conditions
each output word on previously generated out-
puts only, but leaving the future information from
target-side contexts unexploited during transla-
zione. The future context is commonly used in
reading and writing in human cognitive process

∗Corresponding author.

com/wszlong/sb-nmt.

1The source code is available at https://github.

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 7, pag. 91−105, 2019. Redattore di azioni: George Foster.
Lotto di invio: 8/2018; Lotto di revisione: 10/2018; Pubblicato 4/2019.
C(cid:13) 2019 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
5
6
1
9
2
3
6
6
5

/
T

UN
C
_
UN
_
0
0
2
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Model
L2R
R2L

The ﬁrst 4 gettoni
40.21%
35.67%

The last 4 gettoni
35.10%
39.47%

Tavolo 1: Translation accuracy of the ﬁrst 4 gettoni
and last 4 tokens in NIST Chinese-English translation
compiti. L2R denotes left-to-right decoding and R2L
means right-to-left decoding for conventional NMT.

(Xia et al., 2017), and it is crucial to avoid under-
translation (Tu et al., 2016; Mi et al., 2016).

To alleviate the problems, existing studies usu-
ally used independent bidirectional decoders for
NMT (Liu et al., 2016; Sennrich et al., 2016UN).
Most of them trained two NMT models with left-
to-right and right-to-left directions, rispettivamente.
they translated and re-ranked candidate
Then,
translations using two decoding scores together.
More recently, Zhang et al. (2018) presented
an asynchronous bidirectional decoding algo-
rithm for NMT, which extended the conventional
encoder-decoder framework by utilizing a back-
ward decoder. Tuttavia, these methods are more
complicated than the conventional NMT frame-
work because they require two NMT models or
decoders. Inoltre, the L2R and R2L de-
coders are independent from each other (Liu et al.,
2016), or only the forward decoder can utilize
information from the backward decoder (Zhang
et al., 2018). It is therefore a promising direction
to design a synchronous bidirectional decoding
algorithm in which L2R and R2L generations can
interact with each other.

Accordingly, we propose in this paper a novel
framework (SB-NMT) that utilizes a single de-
coder to bidirectionally generate target sentences
simultaneously and interactively. As shown in
Figura 1, two special labels ((cid:104)l2r(cid:105) E (cid:104)r2l(cid:105)) at
the beginning of the target sentence guide trans-
lating from left to right or right to left, and the
decoder in each direction can utilize the previ-
ously generated symbols of bidirectional decod-
ing when generating the next token. Taking L2R
decoding as an example, at each moment, the gen-
eration of the target word (per esempio., y3) does not only
rely on previously generated outputs (y1 and y2)
of L2R decoding, but also depends on previously
predicted tokens (yn and yn−1) of R2L decod-
ing. Compared with the previous related NMT
models, our method has the following advan-
tages: 1) We use a single model (one encoder and
one decoder) to achieve the decoding with left-

Figura 1: Illustration of the decoder in the synchronous
bidirectional NMT model. L2R denotes left-to-right
decoding guided by the start token (cid:104)l2r(cid:105) and R2L
means right-to-left decoding indicated by the start
token (cid:104)r2l(cid:105). SBAtt is our proposed synchronous bi-
IL
directional attention (see § 3.2). For instance,
generation of y3 does not only rely on y1 and y2, Ma
also depends on yn and yn−1 of R2L.

to-right and right-to-left generation, which can
be processed in parallel. 2) Via the synchronous
bidirectional attention model (SBAtt, §3.2), our
proposed model is an end-to-end joint framework
and can optimize bidirectional decoding simulta-
neously. 3) Compared with two-phase decoding
scheme in previous work, our decoder is faster and
more compact, using one beam search algorithm.
Speciﬁcally, we make the following contribu-

tions in this paper:

• We propose a synchronous bidirectional
NMT model
that adopts one decoder to
generate outputs with left-to-right and right-
to-left directions simultaneously and interac-
tively. To the best of our knowledge, questo è
the ﬁrst work to investigate the effectiveness
of a single NMT model with synchronous
bidirectional decoding.

• Extensive experiments on NIST Chinese-
English, WMT14 English-German and WMT18
Russian-English translation tasks demon-
strate that our SB-NMT model obtains
signiﬁcant
improvements over the strong
Transformer model by 3.92, 1.49, E 1.04
BLEU points, rispettivamente. In particular, our
approach separately establishes the state-of-
the-art BLEU score of 51.11 E 29.21 SU
Chinese-English and English-German trans-
lation tasks.

2 Background

in questo documento, we build our model based on the
powerful Transformer (Vaswani et al., 2017) con

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
5
6
1
9
2
3
6
6
5

/
T

UN
C
_
UN
_
0
0
2
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figura 2: (left) Scaled Dot-Product Attention. (right)
Multi-Head Attention.

an encoder-decoder framework, where the en-
coder network ﬁrst transforms an input sequence
of symbols x = (x1, x2, …, xn) to a sequence
of continues representations z = (z1, z2, …, zn),
from which the decoder generates an output se-
quence y = (y1, y2, …, ym) one element at a time.
Particularly, relying entirely on the multi-head
attention mechanism, the Transformer with beam
search algorithm achieves the state-of-the-art
results for machine translation.

Multi-head attention allows the model to jointly
attend to information from different representa-
tion subspaces at different positions. It operates
on queries Q, keys K, and values V . For multi-
head intra-attention of encoder or decoder, all of
Q, K, V are the output hidden-state matrices of
the previous layer. For multi-head inter-attention
of the decoder, Q are the hidden states of the pre-
vious decoder layer, and K-V pairs come from
the output (z1, z2, …, zn) of the encoder.

Formalmente, multi-head attention ﬁrst obtains h
different representations of (Qi, Ki, Vi). Specif-
ically, for each attention head i, we project the
hidden-state matrix into distinct query, key, E
value representations Qi = QW Q
, Ki = KW K
,
io
io
Vi = V W V
, rispettivamente. Then we perform scaled
io
dot-product attention for each representation,
concatenate the results, and project the concate-
nation with a feed-forward layer.

MultiHead(Q, K, V ) = Concati(headi)W O
, V W V
io )

headi = Attention(QW Q

io , KW K
io

(1)

Figura 3: Illustration of the standard beam search
algorithm with beam size 4. The black blocks denote
the ongoing expansion of the hypotheses.

Scaled dot-product attention can be described
as mapping a query and a set of key-value pairs
to an output. Speciﬁcally, we can then multiply
query Qi by key Ki to obtain an attention weight
matrix, which is then multiplied by value Vi
for each token to obtain the self-attention token
representation. Come mostrato in figura 2, scaled dot-
product attention operates on a query Q, a key K,
and a value V as:

Attention(Q, K, V ) = Softmax

(cid:19)

(cid:18) QKT
√
dk

(2)

where dk is the dimension of the key. For the sake
of brevity, we refer the reader to Vaswani et al.
(2017) for more details.

Standard Beam Search Given the trained
model and input sentence x, we usually employ
beam search or greedy search (beam size = 1)
to ﬁnd the best translation (cid:98)y = argmaxyP (sì|X).
Beam size N is used to control the search space
by extending only the top-N hypotheses in the
current stack. Come mostrato in figura 3, the blocks
represent the four best token expansions of the
previous states, and these token expansions are
sorted top-to-bottom from most probable to least
probable. We deﬁne a complete hypothesis as
a hypothesis which outputs EOS, where EOS
is a special target token indicating the end of
sentence. With the above settings, the translation
y is generated token-by-token from left to right.

3 Our Approach

where W Q
io
projection matrices.

, W K
io

, W V
io

, and W O are parameter

In this section, we will introduce the approach of
synchronous bidirectional NMT. Our goal is to

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
5
6
1
9
2
3
6
6
5

/
T

UN
C
_
UN
_
0
0
2
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

design a synchronous bidirectional beam search
algorithm (§3.1) which generates tokens with
both L2R and R2L decoding simultaneously and
interactively using a single model. The central
module is the synchronous bidirectional atten-
zione (SBAtt, see §3.2). By using SBAtt, the two
decoding directions in one beam search process
can help and interact with each other, and can
make full use of the target-side history and future
information during translation. Then, we apply
our proposed SBAtt to replace the multi-head
intra-attention in the decoder part of Transformer
modello (§3.3), and the model is trained end-to-end
by maximum likelihood using stochastic gradient
descent (§3.4).

3.1 Synchronous Bidirectional Beam Search

Figura 4 illustrates the synchronous bidirectional
beam search process with beam size 4. With two
special start tokens which are optimized during
the training process, we let half of the beam
keep decoding from left to right guided by the
label (cid:104)l2r(cid:105), and allow the other half beam to
decode from right to left, indicated by the label
(cid:104)r2l(cid:105). More importantly, via the proposed SBAtt
(§3.2) modello, L2R (R2L) generation does not only
depend on its previously generated outputs, Ma
also relies on future contexts predicted by R2L
(L2R) decoding.

Note that (1) at each time step, we choose the
best items of the half beam from L2R decoding
and the best items of the half beam from R2L
decoding to continue expanding simultaneously;
(2) L2R and R2L beams should be thought of
as parallel, with SBAtt computed between items
of 1-best L2R and R2L, items of 2-best L2R and
R2L, and so on2; (3) the black blocks denote
the ongoing expansion of the hypotheses, E
decoding terminates when the end-of-sentence
ﬂag EOS is predicted;
in our decoding
algorithm,
the complete hypotheses will not
participate in subsequent SBAtt, and the L2R
hypothesis attended by R2L decoding may change
at different time steps, while the ongoing partial
hypotheses in both directions of SBAtt always
share the same length; (5) ﬁnally, we output the

(4)

2We also did experiments in which all of L2R hypotheses
attend to the 1-best R2L hypothesis, and all
the R2L
hypotheses attend to the 1-best L2R hypothesis. The results
of the two schemes are similar. For the sake of simplicity, we
employed the previous scheme.

Figura 4: The synchronous bidirectional decoding of
our model. (cid:104)l2r(cid:105) E (cid:104)r2l(cid:105) are two special labels, Quale
indicate the target-side translation direction in L2R and
R2L modes, rispettivamente. Our model can decode with
both L2R and R2L directions in one beam search by
using SBAtt, simultaneously and interactively. SBAtt
means the synchronous bidirectional attention (§3.2)
performed between items of L2R and R2L decoding.

translation result with highest probability from
all complete hypotheses. Intuitively, our model
is able to choose from L2R output or R2L out-
put as ﬁnal hypothesis according to their model
probabilities, and if an R2L hypothesis wins, we
reverse the tokens before presenting it.

3.2 Synchronous Bidirectional Attention

Instead of multi-head intra-attention which pre-
vents future information ﬂow in the decoder to
preserve the auto-regressive property, we propose
a synchronous bidirectional attention (SBAtt)
meccanismo. With the two key modules of syn-
chronous bidirectional dot-product attention (§3.2.1)
and synchronous bidirectional multi-head atten-
zione (§3.2.2), SBAtt is capable of capturing and
combining the information generated by L2R and
R2L decoding.

3.2.1 Synchronous Bidirectional

Dot-Product Attention

Figura 5 shows our particular attention Syn-
chronous Bidirectional Dot-Product Attention
←−
Q ]),
(SBDPA). The input consists of queries ([
←−
←−
V ]) which are
K ]), and values ([
keys ([
all concatenated by forward (L2R) states and
backward (R2L) stati. The new forward state
←−
−→
H can be obtained by
H and backward state

−→
V ;

−→
Q ;

−→
K ;

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
5
6
1
9
2
3
6
6
5

/
T

UN
C
_
UN
_
0
0
2
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

where λ is a hyper-parameter decided by the
performance on development set.3

−→
H is equal to

−→
H history
Nonlinear Interpolation
in the conventional attention mechanism, E
−→
H f uture means the attention information between
current hidden state and generated hidden states
of the other decoding. In order to distinguish
two different information sources, we present a
nonlinear interpolation by adding an activation
function to the backward hidden states:

−→
H =

−→
H history + λ ∗ AF (

−→
H f uture)

(5)

where AF denotes activation function, ad esempio
tanh or relu.

Gate Mechanism We also propose a gate mech-
anism to dynamically control
the amount of
information ﬂow from the forward and backward
contesti. Speciﬁcally, we apply a feed-forward
−→
H f uture to
gating layer upon
enrich the nonlinear expressiveness of our model:

−→
H history as well as

−→
H history;

rt, zt = σ(W g[
−→
H history + zt (cid:12)

−→
H = rt (cid:12)

−→
H f uture])
−→
H f uture

(6)

Dove (cid:12) denotes element-wise multiplication. Via
this gating layer, it is able to control how much
past information can be preserved from previous
context and how much reversed information can
be captured from backward hidden states.

Similar to the calculation of forward hidden
←−
H i can be

−→
H i, the backward hidden states

stati
computed as follows.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
5
6
1
9
2
3
6
6
5

/
T

UN
C
_
UN
_
0
0
2
5
6
P
D

←−
H history = Attention(
←−
H f uture = Attention(
←−
H = Fusion(

←−
H history,

←−
K ,
−→
K ,

←−
V )
−→
V )

←−
Q ,
←−
Q ,
←−
H f uture)

(7)

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

where Fusion(·) is the same as introduced in
←−
H can be cal-
Equations 4–6. Note that
culated in parallel. We refer to the whole proce-
dure formulated in Equation 3 and Equation 7 COME
SBDPA(·).

−→
H and

−→
H ;
[

←−
H ] = SBDPA([

←−
Q ;

−→
Q ], [

←−
K ;

−→
K ], [

←−
V ;

−→
V ])

(8)

3Note that we can also set λ to be a vector and learn
λ during training with standard back-propagation, and we
remain it as future exploration.

Figura 5: Synchronous bidirectional attention model
based on scaled dot-product attention. It operates on
forward (L2R) and backward (R2L) queries Q, keys K,
values V.

synchronous bidirectional dot-product attention.
−→
H , it can be calcu-
For the new forward state
lated as:

−→
H history = Attention(
−→
H f uture = Attention(
−→
H history,

−→
H = Fusion(

−→
Q ,
−→
Q ,

−→
K ,
←−
K ,

−→
V )
←−
V )
−→
H f uture)

(3)

−→
H history is obtained by using conven-
Dove
tional scaled dot-product attention as introduced
in Equation 2, and its purpose is to take ad-
vantage of previously generated tokens, namely
−→
H f uture using
history information. We calculate
−→
Q ) and backward key-value pairs
forward query (
←−
←−
V ), which attempts at making use of future
K ,
(
information from R2L decoding as effectively as
possible in order to help predict the current to-
ken in L2R decoding. The role of Fusion(·) (green
−→
H history and
block in Figure 5) is to combine
−→
H f uture by using linear interpolation, nonlinear
interpolation, or gate mechanism.

−→
H history and

−→
H f uture
Linear Interpolation
have different importance to prediction of cur-
−→
H history and
rent word. Linear interpolation of
−→
H f uture produces an overall hidden state:

−→
H =

−→
H history + λ ∗

−→
H f uture

(4)

3.2.2 Synchronous Bidirectional
Multi-Head Attention

Multi-head attention consists of h attention
heads, each of which learns a distinct attention
function to attend to all of the tokens in the
sequence, where mask is used for preventing
leftward information ﬂow in decoder. Compared
with the multi-head attention, our inputs are the
concatenation of forward and backward hidden
stati. We extend standard multi-headed attention
by letting each head attend to both forward and
backward hidden states, combined via SBDPA(·):

MultiHead([

= Concat([

←−
K ;

←−
Q ;
−→
H 1;

−→
Q ], [
←−
H 1], …, [

−→
K ], [
−→
H h;

←−
V ;

−→
V ])
←−
H h])W O

(9)

−→
H i;

←−
E [
H i] can be computed as follows, Quale
is the biggest difference from conventional multi-
head attention:

−→
H i;
[

←−
H i] = SBDPA([
−→
K ]W K
io

←−
K ;
[

, [

←−
Q ;
←−
V ;

−→
Q ]W Q
io ,
−→
V ]W V
io )

(10)

, W V

, W K
io

where W Q
i and W O are parameter pro-
io
jection matrices, which are the same as standard
multi-head attention introduced in Equation 1.

3.3

Integrating Synchronous Bidirectional
Attention into NMT

We apply our synchronous bidirectional attention
to replace the multi-head intra-attention in the
decoder, as illustrated in Figure 6. The neural
encoder of our model is identical to that of the
standard Transformer model. From the source
gettoni, learned embeddings are generated which
are then modiﬁed by an additive positional
encoding. The encoded word embeddings are
then used as input to the encoder which consists
of N blocks each containing two layers: (1) UN
multi-head attention layer (MHAtt), E (2) UN
position-wise feed-forward layer (FFN).

È
The bidirectional decoder of our model
extended from the standard Transformer decoder.
For each layer in the bidirectional decoder, IL
lowest sub-layer is our proposed synchronous
bidirectional attention network, and it also uses
residual connections around each of the sublayers,
followed by layer normalization:

d = LayerNorm(sl−1 + SBAtt(sl−1, sl−1, sl−1))
sl
(11)

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
5
6
1
9
2
3
6
6
5

/
T

UN
C
_
UN
_
0
0
2
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figura 6: The new Transformer architecture with the
proposed synchronous bidirectional multi-head atten-
tion network, namely SBAtt. The input of decoder is
concatenation of forward (L2R) sequence and back-
ward (R2L) sequence. Note that all bidirectional
information ﬂow in decoder runs in parallel and only
interacts in synchronous bidirectional attention layer.

where l denotes layer depth, and subscript d
means the decoder-informed intra-attention rep-
resentation. SBAtt is our proposed synchronous
bidirectional attention, and sl−1
A
[−→s l−1; ←−s l−1] containing forward and backward
hidden states. Inoltre, the decoder stacks an-
other two sub-layers to seek translation-relevant
source semantics to bridge the gap between the
source and target language:

is equal

e = LayerNorm(sl
sl
sl = LayerNorm(sl

D + MHAtt(sl
e + FFN(sl

e))

D, hN , hN ))

(12)

where MHAtt denotes the multi-head attention in-
troduced in Equation 1, and we use e to denote
the encoder-informed inter-attention representa-
zione; hN is the source top layer hidden state, E
FFN means feed-forward networks.

Finalmente, we use a linear transformation and
softmax activation to compute the probability of
the next tokens based on sN = [−→s N ; ←−s N ],

namely the ﬁnal hidden states of forward and
backward decoding:

P(−→y j|−→y

Scarica il pdf