Better Document-Level Machine Translation with Bayes’ Rule

Lei Yu1, Laurent Sartran1, Wojciech Stokowiec1,
Wang Ling1, Lingpeng Kong1, Phil Blunsom1,2, Chris Dyer1

1DeepMind,

2牛津大学

{leiyu, lsartran, wstokowiec, lingwang, lingpenk, pblunsom,
cdyer}@google.com

抽象的

We show that Bayes’ rule provides an effective
mechanism for creating document translation
models that can be learned from only paral-
lel sentences and monolingual documents—a
compelling benefit because parallel documents
are not always available. In our formulation,
the posterior probability of a candidate transla-
tion is the product of the unconditional (prior)
probability of the candidate output document
and the ‘‘reverse translation probability’’ of
translating the candidate output back into the
source language. Our proposed model uses a
powerful autoregressive language model as
the prior on target language documents, 但它
assumes that each sentence is translated inde-
pendently from the target to the source lan-
规格. 至关重要的是, at test time, when a source
document is observed, the document language
model prior induces dependencies between
the translations of the source sentences in the
后部. The model’s independence assump-
tion not only enables efficient use of available
数据, but it additionally admits a practical
left-to-right beam-search algorithm for carry-
ing out inference. Experiments show that our
model benefits from using cross-sentence con-
text in the language model, and it outperforms
existing document translation approaches.

1 介绍

There have been many recent demonstrations that
neural language models based on transformers
(Vaswani et al., 2017; Dai et al., 2019) are capa-
ble of learning to generate remarkably coherent
documents with few (Zellers et al., 2019) or no
(Radford et al., 2019) conditioning variables.
Despite this apparent generation ability, in prac-
tical applications, unconditional language models
are most often used to provide representations
for natural language understanding applications
(Devlin et al., 2019; 杨等人。, 2019; Peters

346

等人。, 2018), and how to use them for conditional
generation applications remains an open question.
Our hypothesis in this work is that Bayes’ rule
provides an effective way to leverage powerful
unconditional document language models to im-
prove a conditional
任务: machine translation.
The application of Bayes’ rule to transform the
translation modeling problem p(y | X), where y is
the target language, and x is the source language,
has a long tradition and was the dominant para-
digm in speech and language processing for many
年 (Brown et al., 1993), where it is often called a
‘‘noisy channel’’ decomposition, by analogy to an
information theoretic conception of Bayes’ rule.

Whereas several recent papers have demon-
strated that the noisy channel decomposition has
benefits when translating sentences one-by-one
(Yu et al., 2017; Yee et al., 2019; Ng et al., 2019),
in this paper we show that this decomposition is
particularly suited to tackling the problem of trans-
lating complete documents. Although using cross-
sentence context and maintaining cross-document
consistency has long been recognized as essen-
tial to the translation problem (Tiedemann and
Scherrer, 2017; Bawden et al., 2018, inter alia),
operationalizing this in models has been challeng-
ing for several reasons. Most prosaically, parallel
documents are not generally available (然而
parallel sentences are much more numerous),
making direct estimation of document translation
probabilities challenging. More subtly, 文件
are considerably more diverse than sentences, 和
models must be carefully biased so as not to pick
up spurious correlations.

Our Bayes’ rule decomposition (§2) permits
several innovations that enable us to solve these
问题. 相当
than directly modeling the
conditional distribution, we rewrite it as p(y |
X) ∝ p(y) × p(X | y). This changes the learn-
ing problem from estimating a single complex

计算语言学协会会刊, 卷. 8, PP. 346–360, 2020. https://doi.org/10.1162/tacl 00319
动作编辑器: David Chiang. 提交批次: 12/2019; 修改批次: 2/2020; 已发表 6/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
9
1
9
2
3
2
2
8

/
t

我

A
C
_
A
_
0
0
3
1
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

conditional distribution to learning two different
分布: a language model p(y), 哪个
provides unconditional estimates of the output
(在本文中, 文件); 和 p(X | y), 哪个
provides the probability of translating a candidate
output y into the (observed) source document x.
As we will discuss subsequently, 虽然
problems of estimating p(y | X) 和 p(X | y)
are formally similar, independence assumptions
made in p(X | y) are less statistically costly than
they might otherwise be since, at test time, 我们
will be conditioning on x and reasoning about a
posterior distribution over y, which will be jointly
dependent on all (conditionally independent) 部分
of x. This statistical fact—which is the same
trick that gives na¨ıve Bayes classifiers their
expressiveness and ease of estimation—permits
us to assume independence between sentence
translations in the reverse translation model, 和
therefore to use parallel sentences (而不是
parallel documents) to train it. In the posterior, 我们
thus have an implicit estimate of a document-level
translation system, even though we made no use
of parallel documents when estimating the prior
or likelihood models. This is particularly useful
because parallel sentences are much more readily
available than parallel documents. A second
benefit of our approach is that the unconditional
language model can be estimated from nonparallel
数据, which exists in vast quantities.

Although the noisy channel model is ideal for
exploiting the data resources that naturally exist
在世界上 (large corpora of parallel but inde-
pendent sentences, and large corpora of mono-
lingual documents), we are faced with a much
harder decoding problem (§3). To address this
问题, we propose a new beam-search algo-
rithm, exploiting the fact that our document lan-
guage model operates left-to-right, and our reverse
translation model treats sentences independently.
The search is guided by a proposal distribution that
provides candidate continuations of a document
prefix, and these are reranked according to the
posterior distribution. 尤其, we compare
two proposal models: one based on estimates of
independent sentence translations (Vaswani et al.,
2017) and one that conditions on the source doc-
ument context (张等人。, 2018). 虽然
closely related, our algorithm is much simpler and
faster than that proposed in Yu et al. (2017). 相当
than using a specially designed channel model
(Yu et al., 2016) which is limited in process-

347

ing long sequences like documents, our condi-
tional sentence independence assumptions allow
us to use any sequence-to-sequence model as the
channel model, making it a better option for
document-level translation.

To explore the performance of our proposed
模型, we focus on Chinese–English translation,
following a series of papers on document trans-
关系 (张等人。, 2018; Werlen et al., 2018;
Tu et al., 2018; Xiong et al., 2019). 虽然
in general it is unreasonable to expect that inde-
pendent translations of sentences would lead to
coherent translations of documents, the task of
translating Chinese into English poses some
particularly acute challenges. As Chinese makes
fewer inflectional distinctions than English does,
and the relevant clues for predicting, 例如,
什么
tense an English verb should be in, 或者
whether an English noun should have singular
or plural morphology, may be spread throughout a
文档, it is crucial that extra-sentential context
is used.

Our experiments (§4) explore: (1) 不同的
approaches to reranking, (2) different indepen-
dence assumptions when modeling documents
(IE。, whether sentences are generated indepen-
dently or not), (3) different amounts of language
modeling data, 和 (4) different proposal models.
Briefly summarized, we find that document-
context language models significantly improve the
translation quality obtained with our system, 两个都
in terms of BLEU scores, and in terms of a human
评估. Targeted error analysis demonstrates
the document prior is capable of enforcing con-
sistency of tense and number and lexical choice
across documents.

2 Model Description
We define x = (x1, x2, . . . , xI ) as the source
document with I sentences, and similarly, y =
(y1, y2, . . . , yJ ) as the target document with J
句子. During the (人类) translation process,
translators may split or recombine sentences, 但
1, 希
we will assume that I = J.1 Let xi = (希
2,
. . . , 希
中号 ) represent the ith sentence in the docu-
蒙特, consisting of M words; likewise yi =
(做
氮 ) denote the ith sentence in the
target document, containing N words.

2, . . . , 做

1, 做

1Size mismatches are addressed by merging sentences
using sentence alignment algorithms (Gale and Church,
1993).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
9
1
9
2
3
2
2
8

/
t

我

A
C
_
A
_
0
0
3
1
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The translation of a document x is determined
by finding the document ˆy, where p(ˆy | X) 是
optimal.

ˆy = arg max

p(y | X).

(1)

Instead of modeling the probability p(y | X)

直接地, we factorize it using Bayes’ rule:

ˆy = arg max

= arg max

p(X | y) × p(y)
p(X)

p(X | y)

× p(y)

(2)

channel model
| {z }

language model
|{z}

We further assume that sentences are indepen-
dently translated, and that the sentences are gener-
ated by a left-to-right factorization according to
the chain rule. 所以, 我们有

ˆy ≈ arg max

|X|

是
我=1

p(希 | 做) × p(做 | y下载pdf