Synchronous Bidirectional Neural Machine Translation

Synchronous Bidirectional Neural Machine Translation

Long Zhou1,2,

Jiajun Zhang1,2∗, Chengqing Zong1,2,3

1National Laboratory of Pattern Recognition, CASIA, 北京, 中国
2University of Chinese Academy of Sciences, 北京, 中国
3CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, 中国
{long.zhou, jjzhang, cqzong}@nlpr.ia.ac.cn

抽象的

Existing approaches to neural machine trans-
关系 (NMT) generate the target
语言
sequence token-by-token from left to right.
然而, this kind of unidirectional decod-
ing framework cannot make full use of the
target-side future contexts which can be pro-
duced in a right-to-left decoding direction,
and thus suffers from the issue of unbal-
anced outputs. 在本文中, we introduce
a synchronous bidirectional–neural machine
翻译 (SB-NMT) that predicts its outputs
using left-to-right and right-to-left decoding
simultaneously and interactively, 为了
leverage both of the history and future in-
formation at the same time. Specifically, 我们
first propose a new algorithm that enables syn-
chronous bidirectional decoding in a single
模型. 然后, we present an interactive decod-
ing model in which left-to-right (right-to-left)
generation does not only depend on its pre-
viously generated outputs, but also relies on
future contexts predicted by right-to-left (左边-
to-right) decoding. We extensively evaluate
the proposed SB-NMT model on large-scale
NIST Chinese-English, WMT14 English-
德语, and WMT18 Russian-English trans-
lation tasks. Experimental results demonstrate
that our model achieves significant improve-
ments over the strong Transformer model
经过 3.92, 1.49, 和 1.04 BLEU points, 重新指定-
主动地, and obtains the state-of-the-art per-

formance on Chinese-English and English-
German translation tasks.1

1

介绍

Neural machine translation has significantly im-
proved the quality of machine translation in recent
年 (Sutskever et al., 2014; Bahdanau et al.,
2015; Zhang and Zong, 2015; Wu et al., 2016;
Gehring et al., 2017; Vaswani et al., 2017). Recent
approaches to sequence-to-sequence learning typ-
ically leverage recurrence (Sutskever et al., 2014),
卷积 (Gehring et al., 2017), or attention
(Vaswani et al., 2017) as basic building blocks.

通常, NMT adopts the encoder-decoder
architecture and generates the target translation
from left to right. Despite their remarkable suc-
过程, NMT models suffer from several weak-
内塞斯 (Koehn and Knowles, 2017). 中的一个
most prominent issues is the problem of unbal-
anced outputs in which the translation prefixes are
better predicted than the suffixes (刘等人。, 2016).
We analyze translation accuracy of the first and
最后的 4 tokens for left-to-right (L2R) and right-to-
左边 (R2L) 方向, 分别. As shown in
桌子 1, the statistical results show that L2R per-
forms better in the first 4 代币, whereas R2L
translates better in terms of the last 4 代币.
This problem is mainly caused by the left-to-
right unidirectional decoding, which conditions
each output word on previously generated out-
puts only, but leaving the future information from
target-side contexts unexploited during transla-
的. The future context is commonly used in
reading and writing in human cognitive process

∗Corresponding author.

com/wszlong/sb-nmt.

1The source code is available at https://github.

91

计算语言学协会会刊, 卷. 7, PP. 91−105, 2019. 动作编辑器: George Foster.
提交批次: 8/2018; 修改批次: 10/2018; 已发表 4/2019.
C(西德:13) 2019 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
2
5
6
1
9
2
3
6
6
5

/

/
t

A
C
_
A
_
0
0
2
5
6
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

模型
L2R
R2L

第一个 4 代币
40.21%
35.67%

最后一个 4 代币
35.10%
39.47%

桌子 1: Translation accuracy of the first 4 代币
and last 4 tokens in NIST Chinese-English translation
任务. L2R denotes left-to-right decoding and R2L
means right-to-left decoding for conventional NMT.

(Xia et al., 2017), and it is crucial to avoid under-
翻译 (Tu et al., 2016; Mi et al., 2016).

To alleviate the problems, existing studies usu-
ally used independent bidirectional decoders for
NMT (刘等人。, 2016; Sennrich et al., 2016A).
Most of them trained two NMT models with left-
to-right and right-to-left directions, 分别.
they translated and re-ranked candidate
然后,
translations using two decoding scores together.
最近, 张等人. (2018) 提出
an asynchronous bidirectional decoding algo-
rithm for NMT, which extended the conventional
encoder-decoder framework by utilizing a back-
ward decoder. 然而, these methods are more
complicated than the conventional NMT frame-
work because they require two NMT models or
decoders. 此外, the L2R and R2L de-
coders are independent from each other (刘等人。,
2016), or only the forward decoder can utilize
information from the backward decoder (张
等人。, 2018). It is therefore a promising direction
to design a synchronous bidirectional decoding
algorithm in which L2R and R2L generations can
interact with each other.

因此, we propose in this paper a novel
框架 (SB-NMT) that utilizes a single de-
coder to bidirectionally generate target sentences
simultaneously and interactively. As shown in
数字 1, two special labels ((西德:104)l2r(西德:105) 和 (西德:104)r2l(西德:105)) 在
the beginning of the target sentence guide trans-
lating from left to right or right to left, 和
decoder in each direction can utilize the previ-
ously generated symbols of bidirectional decod-
ing when generating the next token. Taking L2R
decoding as an example, at each moment, the gen-
eration of the target word (例如, y3) does not only
rely on previously generated outputs (y1 and y2)
of L2R decoding, but also depends on previously
predicted tokens (yn and yn−1) of R2L decod-
英. Compared with the previous related NMT
型号, our method has the following advan-
塔盖斯: 1) We use a single model (one encoder and
one decoder) to achieve the decoding with left-

数字 1: Illustration of the decoder in the synchronous
bidirectional NMT model. L2R denotes left-to-right
decoding guided by the start token (西德:104)l2r(西德:105) and R2L
means right-to-left decoding indicated by the start
代币 (西德:104)r2l(西德:105). SBAtt is our proposed synchronous bi-

directional attention (see § 3.2). 例如,
generation of y3 does not only rely on y1 and y2, 但
also depends on yn and yn−1 of R2L.

to-right and right-to-left generation, which can
be processed in parallel. 2) Via the synchronous
bidirectional attention model (SBAtt, §3.2), 我们的
proposed model is an end-to-end joint framework
and can optimize bidirectional decoding simulta-
neously. 3) Compared with two-phase decoding
scheme in previous work, our decoder is faster and
more compact, using one beam search algorithm.
Specifically, we make the following contribu-

tions in this paper:

• We propose a synchronous bidirectional
NMT model
that adopts one decoder to
generate outputs with left-to-right and right-
to-left directions simultaneously and interac-
主动地. 据我们所知, 这是
the first work to investigate the effectiveness
of a single NMT model with synchronous
bidirectional decoding.

• Extensive experiments on NIST Chinese-
英语, WMT14 English-German and WMT18
Russian-English translation tasks demon-
strate that our SB-NMT model obtains
significant
improvements over the strong
Transformer model by 3.92, 1.49, 和 1.04
BLEU points, 分别. 尤其, 我们的
approach separately establishes the state-of-
the-art BLEU score of 51.11 和 29.21 在
Chinese-English and English-German trans-
lation tasks.

2 Background

在本文中, we build our model based on the
powerful Transformer (Vaswani et al., 2017) 和

92

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
2
5
6
1
9
2
3
6
6
5

/

/
t

A
C
_
A
_
0
0
2
5
6
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

数字 2: (左边) Scaled Dot-Product Attention. (正确的)
Multi-Head Attention.

an encoder-decoder framework, where the en-
coder network first transforms an input sequence
of symbols x = (x1, x2, …, xn) to a sequence
of continues representations z = (z1, z2, …, zn),
from which the decoder generates an output se-
quence y = (y1, y2, …, ym) one element at a time.
Particularly, relying entirely on the multi-head
attention mechanism, the Transformer with beam
search algorithm achieves the state-of-the-art
results for machine translation.

Multi-head attention allows the model to jointly
attend to information from different representa-
tion subspaces at different positions. It operates
on queries Q, keys K, and values V . For multi-
head intra-attention of encoder or decoder, all of
问, K, V are the output hidden-state matrices of
the previous layer. For multi-head inter-attention
of the decoder, Q are the hidden states of the pre-
vious decoder layer, and K-V pairs come from
the output (z1, z2, …, zn) of the encoder.

正式地, multi-head attention first obtains h
different representations of (齐, Ki, 维). Specif-
ically, for each attention head i, we project the
hidden-state matrix into distinct query, 钥匙, 和
value representations Qi = QW Q
, Ki = KW K
,


Vi = V W V
, 分别. Then we perform scaled

dot-product attention for each representation,
concatenate the results, and project the concate-
nation with a feed-forward layer.

MultiHead(问, K, V ) = Concati(headi)W O
, V W V
我 )

headi = Attention(QW Q

我 , KW K

(1)

数字 3: Illustration of the standard beam search
algorithm with beam size 4. The black blocks denote
the ongoing expansion of the hypotheses.

Scaled dot-product attention can be described
as mapping a query and a set of key-value pairs
to an output. Specifically, we can then multiply
query Qi by key Ki to obtain an attention weight
矩阵, which is then multiplied by value Vi
for each token to obtain the self-attention token
表示. 如图 2, scaled dot-
product attention operates on a query Q, a key K,
and a value V as:

Attention(问, K, V ) = Softmax

(西德:19)

(西德:18) QKT

dk

V

(2)

where dk is the dimension of the key. For the sake
of brevity, we refer the reader to Vaswani et al.
(2017) for more details.

Standard Beam Search Given the trained
model and input sentence x, we usually employ
beam search or greedy search (beam size = 1)
to find the best translation (西德:98)y = argmaxyP (y|X).
Beam size N is used to control the search space
by extending only the top-N hypotheses in the
current stack. 如图 3, the blocks
represent the four best token expansions of the
previous states, and these token expansions are
sorted top-to-bottom from most probable to least
probable. We define a complete hypothesis as
a hypothesis which outputs EOS, where EOS
is a special target token indicating the end of
句子. With the above settings, the translation
y is generated token-by-token from left to right.

3 Our Approach

where W Q

projection matrices.

, W K

, W V

, and W O are parameter

在这个部分, we will introduce the approach of
synchronous bidirectional NMT. Our goal is to

93

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
2
5
6
1
9
2
3
6
6
5

/

/
t

A
C
_
A
_
0
0
2
5
6
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

design a synchronous bidirectional beam search
algorithm (§3.1) which generates tokens with
both L2R and R2L decoding simultaneously and
interactively using a single model. The central
module is the synchronous bidirectional atten-
的 (SBAtt, see §3.2). By using SBAtt, the two
decoding directions in one beam search process
can help and interact with each other, 并且可以
make full use of the target-side history and future
information during translation. 然后, we apply
our proposed SBAtt to replace the multi-head
intra-attention in the decoder part of Transformer
模型 (§3.3), and the model is trained end-to-end
by maximum likelihood using stochastic gradient
血统 (§3.4).

3.1 Synchronous Bidirectional Beam Search

数字 4 illustrates the synchronous bidirectional
beam search process with beam size 4. With two
special start tokens which are optimized during
the training process, we let half of the beam
keep decoding from left to right guided by the
标签 (西德:104)l2r(西德:105), and allow the other half beam to
decode from right to left, indicated by the label
(西德:104)r2l(西德:105). 更重要的是, via the proposed SBAtt
(§3.2) 模型, L2R (R2L) generation does not only
depend on its previously generated outputs, 但
also relies on future contexts predicted by R2L
(L2R) decoding.

注意 (1) at each time step, we choose the
best items of the half beam from L2R decoding
and the best items of the half beam from R2L
decoding to continue expanding simultaneously;
(2) L2R and R2L beams should be thought of
as parallel, with SBAtt computed between items
of 1-best L2R and R2L, items of 2-best L2R and
R2L, and so on2; (3) the black blocks denote
the ongoing expansion of the hypotheses, 和
decoding terminates when the end-of-sentence
flag EOS is predicted;
in our decoding
algorithm,
the complete hypotheses will not
participate in subsequent SBAtt, and the L2R
hypothesis attended by R2L decoding may change
at different time steps, while the ongoing partial
hypotheses in both directions of SBAtt always
share the same length; (5) finally, we output the

(4)

2We also did experiments in which all of L2R hypotheses
attend to the 1-best R2L hypothesis, 和所有
the R2L
hypotheses attend to the 1-best L2R hypothesis. The results
of the two schemes are similar. For the sake of simplicity, 我们
employed the previous scheme.

94

数字 4: The synchronous bidirectional decoding of
our model. (西德:104)l2r(西德:105) 和 (西德:104)r2l(西德:105) are two special labels, 哪个
indicate the target-side translation direction in L2R and
R2L modes, 分别. Our model can decode with
both L2R and R2L directions in one beam search by
using SBAtt, simultaneously and interactively. SBAtt
means the synchronous bidirectional attention (§3.2)
performed between items of L2R and R2L decoding.

translation result with highest probability from
all complete hypotheses. 直观地, our model
is able to choose from L2R output or R2L out-
put as final hypothesis according to their model
probabilities, and if an R2L hypothesis wins, 我们
reverse the tokens before presenting it.

3.2 Synchronous Bidirectional Attention

Instead of multi-head intra-attention which pre-
vents future information flow in the decoder to
preserve the auto-regressive property, 我们建议
a synchronous bidirectional attention (SBAtt)
机制. With the two key modules of syn-
chronous bidirectional dot-product attention (§3.2.1)
and synchronous bidirectional multi-head atten-
的 (§3.2.2), SBAtt is capable of capturing and
combining the information generated by L2R and
R2L decoding.

3.2.1 Synchronous Bidirectional

Dot-Product Attention

数字 5 shows our particular attention Syn-
chronous Bidirectional Dot-Product Attention
←−
问 ]),
(SBDPA). The input consists of queries ([
←−
←−
V ]) 哪个是
K ]), 和价值观 ([
keys ([
all concatenated by forward (L2R) states and
落后 (R2L) 状态. The new forward state
←−
−→
H can be obtained by
H and backward state

−→
V ;

−→
问 ;

−→
K ;

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
2
5
6
1
9
2
3
6
6
5

/

/
t

A
C
_
A
_
0
0
2
5
6
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

where λ is a hyper-parameter decided by the
performance on development set.3

−→
H is equal to

−→
H history
Nonlinear Interpolation
in the conventional attention mechanism, 和
−→
H f uture means the attention information between
current hidden state and generated hidden states
of the other decoding. In order to distinguish
two different information sources, we present a
nonlinear interpolation by adding an activation
function to the backward hidden states:

−→
H =

−→
H history + λ ∗ AF (

−→
H f uture)

(5)

where AF denotes activation function, 例如
tanh or relu.

Gate Mechanism We also propose a gate mech-
anism to dynamically control
the amount of
information flow from the forward and backward
上下文. Specifically, we apply a feed-forward
−→
H f uture to
gating layer upon
enrich the nonlinear expressiveness of our model:

−→
H history as well as

−→
H history;

rt, zt = σ(W g[
−→
H history + zt (西德:12)

−→
H = rt (西德:12)

−→
H f uture])
−→
H f uture

(6)

在哪里 (西德:12) denotes element-wise multiplication. Via
this gating layer, it is able to control how much
past information can be preserved from previous
context and how much reversed information can
be captured from backward hidden states.

Similar to the calculation of forward hidden
←−
H i can be

−→
H i, the backward hidden states

状态
computed as follows.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
2
5
6
1
9
2
3
6
6
5

/

/
t

A
C
_
A
_
0
0
2
5
6
p
d

.

←−
H history = Attention(
←−
H f uture = Attention(
←−
H = Fusion(

←−
H history,

←−
K ,
−→
K ,

←−
V )
−→
V )

←−
问 ,
←−
问 ,
←−
H f uture)

(7)

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

where Fusion(·) is the same as introduced in
←−
H can be cal-
Equations 4–6. 注意
culated in parallel. We refer to the whole proce-
dure formulated in Equation 3 and Equation 7 作为
SBDPA(·).

−→
H and

−→
H ;
[

←−
H ] = SBDPA([

←−
问 ;

−→
问 ], [

←−
K ;

−→
K ], [

←−
V ;

−→
V ])

(8)

3Note that we can also set λ to be a vector and learn
λ during training with standard back-propagation, 和我们
remain it as future exploration.

95

数字 5: Synchronous bidirectional attention model
based on scaled dot-product attention. It operates on
向前 (L2R) and backward (R2L) queries Q, keys K,
values V.

synchronous bidirectional dot-product attention.
−→
H , it can be calcu-
For the new forward state
lated as:

−→
H history = Attention(
−→
H f uture = Attention(
−→
H history,

−→
H = Fusion(

−→
问 ,
−→
问 ,

−→
K ,
←−
K ,

−→
V )
←−
V )
−→
H f uture)

(3)

−→
H history is obtained by using conven-
在哪里
tional scaled dot-product attention as introduced
in Equation 2, and its purpose is to take ad-
vantage of previously generated tokens, 即
−→
H f uture using
history information. We calculate
−→
问 ) and backward key-value pairs
forward query (
←−
←−
V ), which attempts at making use of future
K ,
(
information from R2L decoding as effectively as
possible in order to help predict the current to-
ken in L2R decoding. The role of Fusion(·) (绿色的
−→
H history and
block in Figure 5) is to combine
−→
H f uture by using linear interpolation, nonlinear
interpolation, or gate mechanism.

−→
H history and

−→
H f uture
Linear Interpolation
have different importance to prediction of cur-
−→
H history and
rent word. Linear interpolation of
−→
H f uture produces an overall hidden state:

−→
H =

−→
H history + λ ∗

−→
H f uture

(4)

3.2.2 Synchronous Bidirectional
Multi-Head Attention

Multi-head attention consists of h attention
头, each of which learns a distinct attention
function to attend to all of the tokens in the
顺序, where mask is used for preventing
leftward information flow in decoder. 比较的
with the multi-head attention, our inputs are the
concatenation of forward and backward hidden
状态. We extend standard multi-headed attention
by letting each head attend to both forward and
backward hidden states, combined via SBDPA(·):

MultiHead([

= Concat([

←−
K ;

←−
问 ;
−→
H 1;

−→
问 ], [
←−
H 1], …, [

−→
K ], [
−→
H h;

←−
V ;

−→
V ])
←−
H h])W O

(9)

−→
H i;

←−
和 [
H i] can be computed as follows, 哪个
is the biggest difference from conventional multi-
head attention:

−→
H i;
[

←−
H i] = SBDPA([
−→
K ]W K

←−
K ;
[

, [

←−
问 ;
←−
V ;

−→
问 ]W Q
我 ,
−→
V ]W V
我 )

(10)

, W V

, W K

where W Q
i and W O are parameter pro-

jection matrices, which are the same as standard
multi-head attention introduced in Equation 1.

3.3

Integrating Synchronous Bidirectional
Attention into NMT

We apply our synchronous bidirectional attention
to replace the multi-head intra-attention in the
decoder, as illustrated in Figure 6. The neural
encoder of our model is identical to that of the
standard Transformer model. From the source
代币, learned embeddings are generated which
are then modified by an additive positional
encoding. The encoded word embeddings are
then used as input to the encoder which consists
of N blocks each containing two layers: (1) A
multi-head attention layer (MHAtt), 和 (2) A
position-wise feed-forward layer (FFN).


The bidirectional decoder of our model
extended from the standard Transformer decoder.
For each layer in the bidirectional decoder, 这
lowest sub-layer is our proposed synchronous
bidirectional attention network, and it also uses
residual connections around each of the sublayers,
followed by layer normalization:

d = LayerNorm(sl−1 + SBAtt(sl−1, sl−1, sl−1))
sl
(11)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
2
5
6
1
9
2
3
6
6
5

/

/
t

A
C
_
A
_
0
0
2
5
6
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

数字 6: The new Transformer architecture with the
proposed synchronous bidirectional multi-head atten-
tion network, namely SBAtt. The input of decoder is
concatenation of forward (L2R) sequence and back-
病房 (R2L) 顺序. Note that all bidirectional
information flow in decoder runs in parallel and only
interacts in synchronous bidirectional attention layer.

where l denotes layer depth, and subscript d
means the decoder-informed intra-attention rep-
resentation. SBAtt is our proposed synchronous
bidirectional attention, and sl−1

[−→s l−1; ←−s l−1] containing forward and backward
hidden states. 此外, the decoder stacks an-
other two sub-layers to seek translation-relevant
source semantics to bridge the gap between the
source and target language:

is equal

e = LayerNorm(sl
sl
sl = LayerNorm(sl

d + MHAtt(sl
e + FFN(sl

e))

d, hN , hN ))

(12)

where MHAtt denotes the multi-head attention in-
troduced in Equation 1, and we use e to denote
the encoder-informed inter-attention representa-
的; hN is the source top layer hidden state, 和
FFN means feed-forward networks.

最后, we use a linear transformation and
softmax activation to compute the probability of
the next tokens based on sN = [−→s N ; ←−s N ],

96

namely the final hidden states of forward and
backward decoding:

p(−→y j|−→y

下载pdf