Better Document-Level Machine Translation with Bayes’ Rule

Better Document-Level Machine Translation with Bayes’ Rule

Lei Yu1, Laurent Sartran1, Wojciech Stokowiec1,
Wang Ling1, Lingpeng Kong1, Phil Blunsom1,2, Chris Dyer1

1DeepMind,

2University of Oxford

{leiyu, lsartran, wstokowiec, lingwang, lingpenk, pblunsom,
cdyer}@google.com

Abstract

We show that Bayes’ rule provides an effective
mechanism for creating document translation
models that can be learned from only paral-
lel sentences and monolingual documents—a
compelling benefit because parallel documents
are not always available. In our formulation,
the posterior probability of a candidate transla-
tion is the product of the unconditional (prior)
probability of the candidate output document
and the ‘‘reverse translation probability’’ of
translating the candidate output back into the
source language. Our proposed model uses a
powerful autoregressive language model as
the prior on target language documents, but it
assumes that each sentence is translated inde-
pendently from the target to the source lan-
guage. Crucially, at test time, when a source
document is observed, the document language
model prior induces dependencies between
the translations of the source sentences in the
posterior. The model’s independence assump-
tion not only enables efficient use of available
data, but it additionally admits a practical
left-to-right beam-search algorithm for carry-
ing out inference. Experiments show that our
model benefits from using cross-sentence con-
text in the language model, and it outperforms
existing document translation approaches.

1 Introduction

There have been many recent demonstrations that
neural language models based on transformers
(Vaswani et al., 2017; Dai et al., 2019) are capa-
ble of learning to generate remarkably coherent
documents with few (Zellers et al., 2019) or no
(Radford et al., 2019) conditioning variables.
Despite this apparent generation ability, in prac-
tical applications, unconditional language models
are most often used to provide representations
for natural language understanding applications
(Devlin et al., 2019; Yang et al., 2019; Peters

346

et al., 2018), and how to use them for conditional
generation applications remains an open question.
Our hypothesis in this work is that Bayes’ rule
provides an effective way to leverage powerful
unconditional document language models to im-
prove a conditional
task: machine translation.
The application of Bayes’ rule to transform the
translation modeling problem p(y | x), where y is
the target language, and x is the source language,
has a long tradition and was the dominant para-
digm in speech and language processing for many
years (Brown et al., 1993), where it is often called a
‘‘noisy channel’’ decomposition, by analogy to an
information theoretic conception of Bayes’ rule.

Whereas several recent papers have demon-
strated that the noisy channel decomposition has
benefits when translating sentences one-by-one
(Yu et al., 2017; Yee et al., 2019; Ng et al., 2019),
in this paper we show that this decomposition is
particularly suited to tackling the problem of trans-
lating complete documents. Although using cross-
sentence context and maintaining cross-document
consistency has long been recognized as essen-
tial to the translation problem (Tiedemann and
Scherrer, 2017; Bawden et al., 2018, inter alia),
operationalizing this in models has been challeng-
ing for several reasons. Most prosaically, parallel
documents are not generally available (whereas
parallel sentences are much more numerous),
making direct estimation of document translation
probabilities challenging. More subtly, documents
are considerably more diverse than sentences, and
models must be carefully biased so as not to pick
up spurious correlations.

Our Bayes’ rule decomposition (§2) permits
several innovations that enable us to solve these
problems. Rather
than directly modeling the
conditional distribution, we rewrite it as p(y |
x) ∝ p(y) × p(x | y). This changes the learn-
ing problem from estimating a single complex

Transactions of the Association for Computational Linguistics, vol. 8, pp. 346–360, 2020. https://doi.org/10.1162/tacl a 00319
Action Editor: David Chiang. Submission batch: 12/2019; Revision batch: 2/2020; Published 6/2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
9
1
9
2
3
2
2
8

/

/
t

l

a
c
_
a
_
0
0
3
1
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

conditional distribution to learning two different
distributions: a language model p(y), which
provides unconditional estimates of the output
(in this paper, documents); and p(x | y), which
provides the probability of translating a candidate
output y into the (observed) source document x.
As we will discuss subsequently, although the
problems of estimating p(y | x) and p(x | y)
are formally similar, independence assumptions
made in p(x | y) are less statistically costly than
they might otherwise be since, at test time, we
will be conditioning on x and reasoning about a
posterior distribution over y, which will be jointly
dependent on all (conditionally independent) parts
of x. This statistical fact—which is the same
trick that gives na¨ıve Bayes classifiers their
expressiveness and ease of estimation—permits
us to assume independence between sentence
translations in the reverse translation model, and
therefore to use parallel sentences (rather than
parallel documents) to train it. In the posterior, we
thus have an implicit estimate of a document-level
translation system, even though we made no use
of parallel documents when estimating the prior
or likelihood models. This is particularly useful
because parallel sentences are much more readily
available than parallel documents. A second
benefit of our approach is that the unconditional
language model can be estimated from nonparallel
data, which exists in vast quantities.

Although the noisy channel model is ideal for
exploiting the data resources that naturally exist
in the world (large corpora of parallel but inde-
pendent sentences, and large corpora of mono-
lingual documents), we are faced with a much
harder decoding problem (§3). To address this
problem, we propose a new beam-search algo-
rithm, exploiting the fact that our document lan-
guage model operates left-to-right, and our reverse
translation model treats sentences independently.
The search is guided by a proposal distribution that
provides candidate continuations of a document
prefix, and these are reranked according to the
posterior distribution. In particular, we compare
two proposal models: one based on estimates of
independent sentence translations (Vaswani et al.,
2017) and one that conditions on the source doc-
ument context (Zhang et al., 2018). Although
closely related, our algorithm is much simpler and
faster than that proposed in Yu et al. (2017). Rather
than using a specially designed channel model
(Yu et al., 2016) which is limited in process-

347

ing long sequences like documents, our condi-
tional sentence independence assumptions allow
us to use any sequence-to-sequence model as the
channel model, making it a better option for
document-level translation.

To explore the performance of our proposed
model, we focus on Chinese–English translation,
following a series of papers on document trans-
lation (Zhang et al., 2018; Werlen et al., 2018;
Tu et al., 2018; Xiong et al., 2019). Although
in general it is unreasonable to expect that inde-
pendent translations of sentences would lead to
coherent translations of documents, the task of
translating Chinese into English poses some
particularly acute challenges. As Chinese makes
fewer inflectional distinctions than English does,
and the relevant clues for predicting, for example,
what
tense an English verb should be in, or
whether an English noun should have singular
or plural morphology, may be spread throughout a
document, it is crucial that extra-sentential context
is used.

Our experiments (§4) explore: (1) different
approaches to reranking, (2) different indepen-
dence assumptions when modeling documents
(i.e., whether sentences are generated indepen-
dently or not), (3) different amounts of language
modeling data, and (4) different proposal models.
Briefly summarized, we find that document-
context language models significantly improve the
translation quality obtained with our system, both
in terms of BLEU scores, and in terms of a human
evaluation. Targeted error analysis demonstrates
the document prior is capable of enforcing con-
sistency of tense and number and lexical choice
across documents.

2 Model Description
We define x = (x1, x2, . . . , xI ) as the source
document with I sentences, and similarly, y =
(y1, y2, . . . , yJ ) as the target document with J
sentences. During the (human) translation process,
translators may split or recombine sentences, but
1, xi
we will assume that I = J.1 Let xi = (xi
2,
. . . , xi
M ) represent the ith sentence in the docu-
ment, consisting of M words; likewise yi =
(yi
N ) denote the ith sentence in the
target document, containing N words.

2, . . . , yi

1, yi

1Size mismatches are addressed by merging sentences
using sentence alignment algorithms (Gale and Church,
1993).

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
9
1
9
2
3
2
2
8

/

/
t

l

a
c
_
a
_
0
0
3
1
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The translation of a document x is determined
by finding the document ˆy, where p(ˆy | x) is
optimal.

ˆy = arg max

y

p(y | x).

(1)

Instead of modeling the probability p(y | x)

directly, we factorize it using Bayes’ rule:

ˆy = arg max

y

= arg max

y

p(x | y) × p(y)
p(x)

p(x | y)

× p(y)

.

(2)

channel model
| {z }

language model
|{z}

We further assume that sentences are indepen-
dently translated, and that the sentences are gener-
ated by a left-to-right factorization according to
the chain rule. Therefore, we have

ˆy ≈ arg max

y

|x|

Y
i=1

p(xi | yi) × p(yi | yDownload pdf