Multilingual Denoising Pre-training for Neural Machine Translation

Multilingual Denoising Pre-training for Neural Machine Translation

Yinhan Liu‡∗, Jiatao Gu†∗, Naman Goyal†∗, Xian Li†, Sergey Edunov†,
Marjan Ghazvininejad†, Mike Lewis†, and Luke Zettlemoyer‡

†Facebook AI
‡Birch Technology
†{jgu,naman,xianl,edunov,ghazvini,mikelewis,lsz}@fb.com
‡yinhan@birch.ai

Abstract

This paper demonstrates that multilingual
denoising pre-training produces significant
performance gains across a wide variety of
machine translation (MT) tasks. We present
mBART—a sequence-to-sequence denoising
auto-encoder pre-trained on large-scale mono-
lingual corpora in many languages using the
BART objective (Lewis et al., 2019). mBART
is the first method for pre-training a complete
sequence-to-sequence model by denoising full
texts in multiple languages, whereas previous
approaches have focused only on the encoder,
decoder, or reconstructing parts of the text.
Pre-training a complete model allows it to
be directly fine-tuned for supervised (both
sentence-level and document-level) and un-
supervised machine translation, with no task-
specific modifications. We demonstrate that
adding mBART initialization produces per-
formance gains in all but the highest-resource
settings, including up to 12 BLEU points for
low resource MT and over 5 BLEU points
for many document-level and unsupervised
models. We also show that it enables transfer
to language pairs with no bi-text or that were
not in the pre-training corpus, and present
extensive analysis of which factors contribute
the most to effective pre-training.1

1 Introduction

Despite its wide adoption for other NLP tasks
(Devlin et al., 2019; Liu et al., 2019; Yang et al.,
2019b; Lewis et al., 2019; Raffel et al., 2019),

* Equal contribution. Most of the work was done when

the first author worked at Facebook.

1Code and pre-trained models are available at https://
git h u b . c o m / p y t o rch/fairseq/tree/master
/examples/mbart.

726

self-supervised pre-training is not yet common
practice in machine translation (MT). Existing
approaches (Lample and Conneau, 2019; Edunov
et al., 2019; Lewis et al., 2019; Raffel et al., 2019)
have been proposed either to partially pre-train
the model or to only focus on English corpora. In
this paper, we show that significant performance
gains are possible by pre-training a complete
autoregressive model with an objective that noises
and reconstructs full texts across many languages.
In this work, we present mBART—a multilin-
gual sequence-to-sequence (Seq2Seq) denoising
auto-encoder. mBART is trained by applying
the BART (Lewis et al., 2019) to large-scale
monolingual corpora across many languages. The
input texts are noised by masking phrases and
permuting sentences, and a single Transformer
(Vaswani et al., 2017) model is learned to re-
cover the texts. Different from other pre-training
for MT (Lample and Conneau,
approaches
2019; Song et al., 2019), mBART pre-trains a
complete autoregressive Seq2Seq model. mBART
is trained once for all languages, providing a
set of parameters that can be fine-tuned for any
of the language pairs in both supervised and
unsupervised settings, without any task-specific or
language-specific modifications or initialization
schemes.

Extensive experiments demonstrate that this
simple approach works remarkably well. We first
focus on existing MT benchmarks. For supervised
sentence-level MT, mBART initialization leads to
significant gains (up to 12 BLEU points) across
low/medium-resource pairs (<10M bi-text pairs), without sacrificing performance in high-resource settings. These results further improve with back- translation (BT), setting a new state-of-the-art on WMT16 English-Romanian and the FloRes test sets. For document-level MT, our document- level pre-training improves by up to 5.5 Transactions of Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. https:> is used as
the initial token to predict the sentence. It is also
possible to use other noise types, such as those in
Lample et al. (2018c), but we leave the exploration
of the optimal noising strategy to future work.

Instance Format For each instance of a batch,
we sample a language id symbol , and we
pack as many consecutive sentences as possible
sampled from the corresponding corpus of ,
until either it hits the document boundary or
reaches the 512 max token length. Sentences in
the instance are separated by the end of sentence
() token. Then, we append the selected
token to represent the end of this instance. Pre-
training at ‘‘multi sentence’’ level enables us to
work on both sentence and document translation.

Optimization Our full model (including 25
languages) is trained on 256 Nvidia V100 GPUs
(32GB) for 500K steps. The total batch size is
around 128K tokens per GPU, matching BART
(Lewis et al., 2019) configuration. We use the
Adam optimizer (ǫ = 1e−6, β2 = 0.98) and linear
learning rate decay scheduling. The total training
time was approximately 2.5 weeks. We started the
training with dropout 0.1 and reduced it to 0.05 at
250K steps and 0 at 400K steps. All experiments
are done with Fairseq (Ott et al., 2019).

issue of

Reproducibility One potential
the
proposed approach is the replicability problem
due to the requirement of massive monolingual
corpora and computational resources, with fine-
grained selection on hyper-parameters during
pre-training. It is likely to get slightly different
fine-tuning performance if we re-train the system
again. Tackling on this, we will release the pre-
trained checkpoints as well as the code with full
instructions for pre-training a new model.

several

Related Work: XLM(-R) and MASS There
are
closely related approaches of
multilingual pre-training for machine translation.
XLM (Lample and Conneau, 2019) and XLM-R
(Conneau et al., 2019) pretrain BERT (Devlin
et al., 2019; Liu et al., 2019) in a multilingual
fashion, and the resulted parameters can be used to
initialize the translation model encoder. Different
from XLM(-R), mBART simultaneously pre-
trains the encoder and the decoder due to the

Seq2Seq setup, which is more natural to adapt to
machine translation applications.

Similar to mBART, MASS (Song et al., 2019)
is also a Seq2Seq-based pre-training technique
with ‘‘word-masking’’. However, the decoder of
MASS only predicted tokens that was masked in
the encoder, whereas mBART reconstructs the full
target sequence which allows to apply not only
‘‘masking’’ but any possible noise functions.

Furthermore, both XLM and MASS did
not show evidence of the pre-trained models
improving translation performance over
two
languages.

2.3 Pre-trained Models

To better measure the effects of different levels
of multilinguality during pre-training, we built a
range of models as follows:

• mBART25 We pre-train a model on all
25 languages, using the setting described in
§2.2.

• mBART06 To explore the effect of pre-
training on related languages, we pretrain a
model on a subset of six European languages:
Ro, It, Cs, Fr, Es, and En. For a fair
comparison, we use ∼ 1/4 of the mBART25
batch size, which allows our model to have
the same number of updates per language
during pre-training.

• mBART02 We pre-train bilingual models,
using English and one other language for
four language pairs: En-De, En-Ro, En-It.
We use a batch size of ∼ 1/12 of that in the
mBART25.

• BART-En/Ro To help establish a better
understanding towards multilingual pre-
training, we also train monolingual BART
models on the En and Ro corpus only,
respectively.

• Random As additional baselines, we will
also include a comparison with a model
randomly initialized without pre-training for
each translation task. Because the sizes
of different downstream datasets vary, we
always grid-search the hyper-parameters
(architecture, dropout, etc.) to find the best
non-pretrained configuration.

728

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: Framework for our multilingual denoising pre-training (left) and fine-tuning on downstream
MT tasks (right), where we use (1) sentence permutation and (2) word-span masking as the injected
noise. A special language id token is added at both the encoder and decoder. One multilingual pre-trained
model is used for all tasks.

All models use the same vocabulary (§2.1). Not
all tokens will frequently occur in all pre-training
corpora, but
this
large vocabulary can improve generalization in
multilingual settings even for unseen languages.

later experiments show that

2.4 Scaling-up Matters

Scaling-up the training data and model parameters
has been a key factor in pre-training (Devlin
et al., 2019; Conneau et al., 2019; Raffel et al.,
2019). Compared to conventional semi-supervised
methods (e.g., back-translation) and other pre-
training for MT (Lample and Conneau, 2019;
Song et al., 2019), we pre-train mBART on much
more monolingual data with relatively deeper
architecture. This scale, in combination with the
new multi-lingual training, is central to our results
(sections 3 to 5), although future work could
more carefully study the relative contributions
of each.

3 Sentence-level Machine Translation

This section shows that mBART pre-training
provides consistent performance gains in low
to medium resource sentence-level MT settings,
including bi-text only and with back translation,
existing pre-training
and outperforms other
schemes (§3.2). We also present a detailed analysis
to understand better which factors contribute
the most to these gains (§3.3), and show that
pre-training can even improve performance for
languages not present in the pre-training data
(§3.4).

3.1 Experimental Settings

Datasets We gather 24 pairs of publicly avail-
able parallel corpora that cover all the languages
in CC25 (Figure 1). Most pairs are from previous
WMT (Gu, Kk, Tr, Ro, Et, Lt, Fi, Lv, Cs, Es, Zh,
De, Ru, Fr ↔ En) and IWSLT (Vi, Ja, Ko, Nl,
Ar, It ↔ En) competitions. We also use FLoRes
pairs (Guzm´an et al., 2019, En-Ne and En-Si),
En-Hi from IITB (Kunchukuttan et al., 2017), and
En-My from WAT19 (Ding et al., 2018, 2019).
We divide the datasets into three categories—low
resource (<1M sentence pairs), medium resource (>1M and <10M), and high resource (>10M).

Fine-tuning & Decoding We fine-tune mBART
on a single pair of bi-text data, feeding the
source language into the encoder and decod-
ing the target language. As shown in Figure 2,
we load the pre-trained weights and train the MT
model on bi-texts with teacher forcing. For all
directions, we train with 0.3 dropout, 0.2 label
smoothing, 2500 warm-up steps, 3e−5 maximum
learning rate. We use a maximum of 40K training
updates for all low and medium resource pairs and
100K for high resource pairs. The final models are
selected based on validation likelihood. We use
beam-search with beam size 5 for decoding. Our
initial experiments indicate that the fine-tuning
process is generally stable with different seeds.
Therefore, to reduce the total computation, all
our results are reported with single execution. We
validate the statistical significance with scripts
from the mosesdecoder.3

3https://github.com/moses-smt/mosesdecoder
/b l o b/master/scripts/a n a lysis/bootstrap
-hypothesis-difference-significance.pl.

729

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Languages
En-Gu
Data Source WMT19
Size
Direction

10K

En-Kk
WMT19
91K

En-Vi

En-Tr
IWSLT15 WMT17

133K

207K

En-Ja
IWSLT17
223K

En-Ko
IWSLT17
230K

← → ← → ← → ← → ← → ← →

Random
mBART25

0.0
0.3

0.0
0.1

0.8
7.4

0.2
2.5

23.6
36.1

24.8
35.4

12.2
22.5

9.5
17.8

10.4
19.1

12.3
19.4

15.3
24.6

16.3
22.6

Languages
Data Source
Size
Direction

En-Nl
IWSLT17
237K

En-Ar
IWSLT17
250K

En-It

En-My
IWSLT17 WAT19

250K

259K

En-Ne
FLoRes
564K

En-Ro
WMT16
608K

← → ← → ← → ← → ← → ← →

Random
mBART25

34.6
43.3

29.3
34.8

27.5
37.6

16.9
21.6

31.7
39.8

28.0
34.0

23.3
28.3

34.9
36.9

7.6
14.5

4.3
7.4

34.0
37.8

34.3
37.7

Languages
Data Source
Size
Direction

En-Si
FLoRes
647K

En-Hi
ITTB
1.56M

En-Et
WMT18
1.94M

En-Lt
WMT19
2.11M

En-Fi
WMT17
2.66M

En-Lv
WMT17
4.50M

← → ← → ← → ← → ← → ← →

Random
mBART25

7.2
13.7

1.2
3.3

10.9
23.5

14.2
20.8

22.6
27.8

17.9
21.4

18.1
22.4

12.1
15.3

21.8
28.5

20.2
22.4

15.6
19.3

12.9
15.9

Table 1: Low/medium resource machine translation Pre-training consistently improves over a
randomly initialized baseline, with particularly large gains on low resource language pairs (e.g.,
Vi-En).

3.2 Main Results

As shown in Table 1, initializing with the pre-
trained mBART25 weights shows gains on all the
low and medium resource pairs when compared
with randomly initialized baselines. We observe
gains of 12 or more BLEU points on low
resource pairs such as En-Vi, En-Tr, and noisily
aligned pairs like En-Hi. Fine-tuning still fails
in extremely low-resource cases such as En-Gu,
which have ∼10k examples. In these settings,
unsupervised translation is more appropriate,
see §5.2. For high resource cases (Table 2),
we do not observe consistent gains, and pre-
training slightly hurts performance when more
than 25M parallel sentences are available. When
a significant amount of bi-text data is given, we
suspect that supervised training washes out the
pre-trained weights.

Note that some reported runs of our baseline
systems using the vanilla Transformers with
randomly initialized weights have considerably
noticeable gaps between the SoTA systems
reported in the original competitions.4 The differ-
ence is mainly because we train and search

4http://matrix.statmt.org/.

Languages Cs Es Zh De Ru
Size

Fr
11M 15M 25M 28M 29M 41M

RANDOM
16.5 33.2 35.0 30.9 31.5 41.4
MBART25 18.0 34.0 33.3 30.5 31.3 41.0

Table 2: High resource machine translation
where all the datasets are from their latest WMT
competitions. We only evaluate our models on
En-X translation.

the hyper-parameters for baselines on officially
provided bitext only without using any mono-
lingual corpus or multilingual adaptation. For
instance, the SoTA score for En→Gu is 28.2
in WMT19, compared with 0 in Table 1. It is
basically because the quality of the original bitext
data is low, and the SoTA systems commonly
used additional languages such as Hi to boost
the performance. Similar gaps can also be
observed in pairs such as Kk-En and Lt-En,
language is also
where Ru as the additional
crucial. The main purpose of this part
is to
discuss the effects of multilingual pre-training in a
constrained bitext setting for a better comparison.
We will include more discussions of combining

730

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 3: Pre-training + back translation on FLoRes with two iterations of BT.

multilingual translation with pretraining in future
work.

Plus Back-Translation Back-translation (BT;
Sennrich et al., 2016) is a standard approach
to augment bi-text with target-side monolingual
data. We combine our pre-training with BT and
test it on low resource language pairs—En-Si
and En-Ne—using the FLoRes dataset (Guzm´an
et al., 2019). We use the same monolingual data
as Guzm´an et al. (2019) to generate BT data.
Figure 3 shows that initializing the model with
our mBART25 pre-trained parameters improves
BLEU scores at each iteration of back translation,
resulting in new state-of-the-art results in all four
translation directions. It indicates that the pre-
trained mBART weights can be directly plugged
into existing pipeline using BT.

with

Other

Compared
Pre-training
Approaches We also compare our pre-trained
models with recent self-supervised pre-training
methods, as shown in Table 3. We consider En-Ro
translation, the only pair with established results.
Our mBART model outperforms all the other
pre-trained models, both with and without BT
augmentation. We also show comparisons with
the conventional BART model trained on the same
En and Ro data only. Both have improvements
over baselines, although worse than mBART
results, indicating that pre-training in a multilin-
gual setting is essential. Moreover, combining
BT leads to additional gains, resulting in a new
state-of-the-art for Ro-En translation.

3.3 Analysis

We also present additional analyses, to better
quantify when our pre-training helps.

How many languages should you pre-train on?
We investigate when it is helpful for pre-training to
include languages other than the targeted language
pair that will be used during fine tuning. Table 4

731

Pre-training

Data
None
En Ro
En Ro
En

Model
RANDOM
XLM (2019)
MASS (2019)
BART (2019)
XLM-R (2019) CC100
BART-EN
BART-RO
MBART02
MBART25

En
Ro
En Ro
CC25

Fine-tuning
En→Ro Ro→En +BT
36.8
38.5
39.1
38.0

37.4
38.1
39.9
38.8

34.0
35.6


35.8
35.8
36.8
38.5
37.8

34.3



35.6
36.0
37.6
38.5
37.7

Table 3: Comparison with other pre-training
approaches on WMT16 Ro-En.

Ro
66.6 61.4 30.2

It My
Languages De
Size/GB
1.6
mBART02 31.3 38.5 39.7 36.5
mBART06
mBART25 30.5 37.7 39.8 36.9

38.5 39.3

En
300.8

Table 4: Pretraining languages on En-X trans-
lation. The size refers to the size of monolingual
data for X. The size of En is shown as reference.
All the pretrained models were controlled to see
the same number of English instances during
training.

shows performance on four X-En pairs. Pre-
training on more languages helps most when the
target language monolingual data is limited (e.g.,
En-My, where the size of My is around 0.5%
of En).

In contrast, when monolingual data is plentiful
(De, Ro), pre-training on multiple languages
slightly hurts the final results (<1 BLEU). In languages may reduce these cases, additional the capacity available for each test language. Additionally, the fact that mBART06 performs similar to mBART02 on Ro-En suggests that pre-training with similar languages is particularly helpful. l D o w n o a d e d f r o m h t t p : > is predicted. We use beam size 5 by default.

4 Document-level Machine Translation

We evaluate mBART on document-level machine
translation tasks, where the goal is to translate
that contain more than one
segments of text
sentence (up to an entire document). During pre-
training, we use document fragments of up to 512
tokens, allowing the models to learn dependencies
this pre-
between sentences. We show that
training significantly improves document-level
translation.

4.1 Experimental Settings

Datasets We evaluate performance on two
common document-level MT datasets: WMT19
En-De and TED15 Zh-En. For En-De, we use the
document data from WMT19 to train our model,
without any additional sentence-level data. The
Zh-En dataset is from IWSLT 2014 and 2015
(Cettolo et al., 2012, 2015). Following Miculicich
et al. (2018), we use 2010-2013 TED as the test
set.

Pre-processing We pre-process with the ap-
proach used in pre-training. For each block, sen-
tences are separated by end of sentence symbols
() and the entire instance is ended with
the specific language id (). On average,
documents are split into 2–4 instances.

Baselines & Evaluation We train 4 models: a
document-level (Doc-) MT model (§4.1) and a
corresponded sentence-level (Sent-) MT model
(§3.1) as the baseline, both with and without
pre-training. We use mBART25 as the common
pre-trained model for En-De and Zh-En. For
En-De, even though our mBART25 Doc-MT
model decodes multiple sentences together, the
translated sentences can be aligned to the source
sentences, which allows us to evaluate BLEU
scores both on sentence-level (s-BLEU) and
document-level (d-BLEU).5 For Zh-En, however,
we cannot produce the same number of translated
sentences as the reference due to alignment errors
in the test data. We only provide the d-BLEU
scores on this direction.

We also compare our models with Hierarchical
Attention Networks (HAN, Miculicich et al.,
2018) on Zh-En, which is the state-of-the-
art non-pretraining approach for document-level
translation for
this pair. They combine two
layers of attention—first within and then across
sentences.

5Standard BLEU scores match n-grams at sentence-level.
We also consider document-level where we match n-grams
over the whole document resulting in a slightly higher score.

733

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(a) Sentence- and Document-level BLEU scores on En-De

(b) Document-level BLEU scores on Zh-En

Model

Random

mBART25

s-BLEU d-BLEU s-BLEU d-BLEU

Sent-MT
Doc-MT

34.5
×

35.9
7.7

36.4
37.1

38.0
38.5

Model

Sent-MT
Doc-MT

Random mBART25 HAN (2018)
d-BLEU
22.0
3.2

d-BLEU

24.0

d-BLEU
28.4
29.6

Table 6: Document-level machine translation on En-De and Zh-En. (×) The randomly initialized Doc-
MT model cannot produce translations aligned to the original sentences, so only document evaluation
is possible.

4.2 Main Results

Table 6 shows the main results for both En-De and
Zh-En at both sentence-level and document-level.

Random vs. Pre-trained The MT models
initialized with pre-trained weights outperform
randomly initialized models by large margins, for
both sentence-level and document-level training.
Our mBART25 models (both Sent-MT and Doc-
MT) also outperform HAN (Miculicich et al.,
2018),6 despite the fact
they are not
customized for document-level MT.

that

Sent-MT vs. Doc-MT For En-De and En-Zh,
the mBART25 Doc-MT models outperform
mBART25 fine-tuned at sentence-level by large
margins, reversing the trend seen for models
without pre-training. For both datasets, randomly
initialized Doc-MT fails
resulting
to work,
in much worse results
than the sentence-
level models. Such large performance gaps
indicate that pre-training is critical for document
level performance. It is in general difficult to
collect high-quality document-level data in large
quantities, suggesting that pre-training may be a
strong strategy for future work. We also include a
sampled example in Figure 6.

5 Unsupervised Machine Translation

In addition to supervised machine translation, we
also evaluate our model on tasks where no bi-text
is available for the target language pair. We define
three types of unsupervised translation:

1. No bi-text of any kind. A common solution
is to learn from back-translation (Artetxe
et al., 2017; Lample et al., 2018c). We
show that mBART provides a simple and
effective initialization scheme for
these
methods (§5.1).

6d-BLEU is recomputed from the provided system output.

2. No bi-text for the target pair, but both
languages appear in bi-text corpora with other
pairs. This setup is common for multilingual
MT systems (Johnson et al., 2017; Gu et al.,
2019). In this paper, we limit our focus to
building models for single language pairs,
and leave discussions for multilingual MT to
future work.

3. No bi-text for the target pair is available, but
there is bi-text for translating from some other
language into the target language. mBART
supports effective transfer, even if the source
language has no bi-text of any form (§5.2).

5.1 Unsupervised Machine Translation via

Back-Translation

Datasets We evaluate our pre-trained models
on En-De, En-Ne, and En-Si. En and De are both
European languages sharing many sub-words,
whereas Ne and Si are quite distinct from En. We
use the same test sets as supervised benchmarks
§3.1, and use the same pre-training data (CC25)
for back-translation to avoid introducing new
information.

Learning Following Lample
and Conneau
(XLM, 2019), we initialize the translation model
with the mBART weights, and then learn to
predict the monolingual sentences conditioned
on source sentences generated by on-the-fly BT.
Furthermore, we constrain mBART to only gen-
erating tokens in target language7 for the first
1000 steps of on-the-fly BT, to avoid it copying
the source text.

Results Table 7 shows the unsupervised trans-
lation results compared with non-pretrained mod-
els, as well as models with existing pre-training
methods. Our models achieve large gains over
non-pretrained models for all directions, and

7We mask out the output probability of predicted tokens
which appear less than 1% in the target monolingual corpus.

734

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 6: An example of document-level translation from mBART25 Sent-MT and Doc-MT, held out
from the test set of TED15 Zh-En. The Doc-MT system produces much fluent and coherent translation,
which is closer to the reference translation. For instance, Doc-MT model produces several ‘‘And’’ to
connect sentences to make it reads better, while the Sent-MT model does not contain global knowledge
and produce sentences independently. Additionally, both systems produce much better translations
than models without pre-training where the non-pretrained Doc-MT model completely fails to produce
readable translation output.

735

En-Si
En-Ne
En-De
← → ← → ← →

Random
21.0 17.2
XLM (2019)
34.3 26.4
MASS (2019) 35.2 28.3

0.0
0.5

0.0 0.0 0.0
0.1 0.1 0.1


mBART

34.0 29.8 10.0 4.4 8.2 3.9

Table 7: Unsupervised MT via BT between
dis-similar languages.

most pairs, language transfer works better when
fine-tuning is also conducted in the same language
family, especially between Indic languages (Hi,
Ne, Gu). However, significant vocabulary sharing
is not required for effective transfer. For instance,
Zh-En and It-En achieve the best transfer learning
results on Ko-En and Ar-En, respectively. This is
despite the low vocabulary overlap (even character
overlap) between (Zh, Ko) and (It, Ar).

outperform XLM significantly for dissimilar pairs
(En-Ne, En-Si) where the existing approaches
completely fail. For En-De, our model also
performs comparably against XLM and MASS.

5.2 Unsupervised Machine Translation via

Language Transfer

We also report results when the target language
appears in a bi-text with some other source
language.

Datasets We only consider X→En translation,
and choose the bitexts of 12 language pairs from
§3.1, covering Indic languages (Ne, Hi, Si, Gu),
European languages (Ro, It, Cs, Nl), East Asian
languages (Zh, Ja, Ko), and Arabic (Ar).

Results The pre-trained mBART25 model is
fine-tuned on each language pair, and then
evaluated on the rest of pairs, as seen in Table 8.
We also present the direct fine-tuning performance
(§3) on the diagonal, for reference. We see transfer
for all pairs with all fine-tuned models except from
Gu-En where the supervised model completely
fails (0.3 BLEU). In some cases we can achieve
similar (Cs-En) or even much better (Ne-En,
Gu-En) results compared with the supervised
results. We also show an example of language
transfer in Figure 7.

As a comparison, we also apply the same proce-
dure on randomly initialized models without pre-
training, which always ends up with ≈ 0 BLEU.
This indicates that multilingual pre-training is
essential and produces universal representations
across languages, so that once the model learns to
translate one language to En, it learns to translate
all languages with similar representations.

When is language transfer useful? Table 8
also shows that the size of transfer effects varies
with the similarity of different languages. First, for

With BT We present a comparison of unsuper-
vised MT with BT vs. language transfer in Table 9
where language transfer works better when there
exists a close language translation to transfer from.
Moreover, we show promising results for
combining these two techniques. We start from the
best transferred model and apply (iterative) BT on
the same monolingual corpus used in pre-training.
Table 9 presents the results with 1 iteration of BT.
We see improvements for all pairs. The complete
analysis of both methods is left as future work.

6 Related Work

Self-supervised Learning for Text Generation
This work inherits from the recent success brought
by pre-training for NLP applications (Peters et al.,
2018; Radford et al., 2018; Devlin et al., 2019;
Yang et al., 2019b; Liu et al., 2019), especially
for text generation (Radford et al., 2019; Song
et al., 2019; Dong et al., 2019; Raffel et al.,
2019; Lewis et al., 2019). The pre-trained models
are usually used as the initialization for fine-
tuning downstream tasks such as controllable
language modeling (Shirish Keskar et al., 2019),
summarization (Song et al., 2019; Liu and Lapata,
2019) and dialogue generation (Zhang et al.,
2019).

Specifically for machine translation, unsuper-
vised pre-training methods were also explored
to improve the performance. Qi et al. (2018)
investigated the application of pre-trained word
embeddings for MT; Ramachandran et al. (2017)
proposed to pre-train the encoder-decoder mod-
ules as two separate language models. Yang et al.
(2019a); Zhu et al. (2020) explored fusion ap-
proaches to incorporate the pre-trained BERT
weights to improve NMT training. In contrast
to most prior work, we focus on pre-training one
denoising autoencoder, and adapt the weights of
the entire model for various MT applications.

736

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Fine-tuning Languages
Ro

Zh

s
e
g
a
u
g
n
a
L
g
n
i
t
s
e
T

Domain News
23.7
9.9
5.8
9.3
16.2
14.4
16.9
5.8
3.2
2.1
5.0
8.2

ZH
JA
KO
CS
RO
NL
IT
AR
HI
NE
SI
GU

Cs

Ko

Ja
TED TED News News
7.8
4.8
8.5
19.5
37.8
27.0
23.4
12.0
6.7
4.3
1.3
3.5

9.2
12.2
24.6
17.2
17.9
32.3
27.8
12.8
9.9
6.5
3.8
4.7

2.8
0.9
5.7
21.6
23.0
21.2
17.1
12.7
5.8
5.0
3.8
5.4

8.8
19.1
16.9
15.1
18.7
30.4
25.8
15.5
10.1
6.7
5.7
8.5

It

Si

Ne

Ar

Gu
Hi
Nl
TED TED TED News Wiki Wiki Wiki
0.0
7.2
0.0
4.7
0.0
9.6
0.0
13.2
0.0
16.4
0.0
24.6
0.0
20.1
0.0
11.6
0.0
23.5
17.9
0.0
0.0
8.1
13.8
0.3

7.0
6.4
9.5
17.0
22.3
43.3
30.2
14.7
6.1
3.0
0.9
2.1

5.9
6.5
11.1
16.4
22.1
27.3
23.2
16.7
13.0
10.8
13.7
12.8

4.2
4.2
8.8
15.1
18.5
23.3
18.5
13.0
14.5
14.5
8.9
13.5

6.8
5.1
9.1
16.7
21.6
34.1
39.8
14.7
5.0
2.2
0.5
0.0

6.2
5.6
8.7
16.9
22.6
31.0
30.6
37.6
7.6
5.2
3.5
6.2

Table 8: Unsupervised MT via language transfer on X-En translations. The model fine-tuned on one
language pair is directly tested on another. We use gray color to show the direct fine-tuning results,
and lightgray color to show language transfer within similar language groups. We bold the highest
transferring score for each pair.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 7: An example of unsupervised MT via language transfer. mBART models finetuned with Ko or
Zh are able to translate Ja sentence to En almost as correctly as in the supervised case.

Source online BT
Ro
Ne
Zh
Nl

30.5
10.0
11.3
28.5

Transfer

23.0
17.9
9.2
34.1

( Cs )
( Hi )
( Ko )
( It )

Combined
33.9
22.1
15.0
35.4

Table 9: BT vs. language transfer for unsupervised
MT for X-En translations. For language transfer,
we present the best transferring scores together
with the language transferred from.

Multilinguality in NLP tasks This work is also
related to the continual
trend of multilingual
language learning, including aligning multilingual
word embeddings (Mikolov et al., 2013; Chen
and Cardie, 2018; Lample et al., 2018b) into

universal space, and learning crosslingual models
(Wada and Iwata, 2018; Lample and Conneau,
2019; Conneau et al., 2019) to exploit shared
representations across languages.

For MT, the most relevant field is multilingual
translation (Firat et al., 2016; Johnson et al.,
2017; Aharoni et al., 2019; Arivazhagan et al.,
2019) where the ultimate goal is to jointly train
one translation model
translates multiple
the same time, and
language directions at
shares representations to improve the translation
performance on low-resource languages (Gu et al.,
2018). In this paper, we focus on multilingualism
in the pre-training stage and fine-tune the
learned model in the standard bilingual scenario.

that

737

Compared with multilingual translation, we do
not require parallel data across multiple languages
but the targeted direction, which improves the
scalability to low-resource languages and specific
domains.

(2019)

is the most

Document Translation As one of
the key
applications, our work is also related to previous
efforts for incorporating document-level context
into neural machine translation (Wang et al.,
2017; Jean et al., 2017; Tiedemann and Scherrer,
2017; Miculicich et al., 2018; Tu et al., 2018).
Li et al.
relevant work
that also utilized pre-trained encoder (BERT)
for handling longer context. However,
the
focus has been on designing new task-specific
techniques, and doing sentence-level translation
with a wider input context. To the best of our
knowledge, our multilingual pre-trained model
that shows improved results on
is the first
document-level translation with standard Seq2Seq
models.

Unsupervised Translation This work also
summarizes the previous efforts of learning to
translate between languages without a direct
parallel corpus. When no parallel data of any
kind is available, Artetxe et al.
(2017) and
Lample et al. (2018a) proposed to jointly learn
denoising auto-encoder and back-translation from
both directions, which, however, required good
initialization and only worked well on similar
(2019) solve the
language pairs. Wu et al.
problem by mining sentences from Wikipedia
and using them as weakly supervised translation
pairs. Similar to Lample and Conneau (2019) and
Song et al. (2019), we follow the first approach
and treat our pre-trained model as the initialization
step. We also investigate unsupervised translation
using language transfer, which is similar
to
Pourdamghani et al. (2019), where the authors
generate translationese of the source language
and train a system on high-resource languages
is
to correct
also closely related to Conneau et al. (2018)
and Artetxe et al.
for cross-lingual
representation learning where we also show
representation learned by mBART can be easily
transferred between language without supervised
data.

these intermediate utterances. It

(2019)

738

7 Conclusion

We demonstrate that multilingual de-noising pre-
training is able to significantly improve both
supervised and unsupervised machine translation
at both the sentence level and document level.
We analyze when and how pre-training is most
effective and can be combined with other
approaches such as back-translation. Our results
learning ability of
also show the transfer
the learned representations from multilingual
pre-training.

In future work, we will scale-up the current
pre-training to more languages, for example, an
mBART100 model. The size of our model makes
it expensive to deploy in production—future work
will explore pre-training more efficient models.

Acknowledgments

We thank Marc’Aurelio Ranzato, Guillaume
Lample, Alexis Conneau, and Michael Auli
for sharing their expertise on low-resource and
unsupervised machine translation and Peng-Jen
Chen and Jiajun Shen for details about FloRes and
WAT datasets. We also thank our colleagues at
FAIR and FAIAR for valuable feedback.

References

Roee Aharoni, Melvin Johnson, and Orhan
Firat. 2019. Massively multilingual neural
machine translation. In Proceedings of the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 3874–3884.
Minneapolis, Minnesota. Association
for
Computational Linguistics.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat,
Dmitry Lepikhin, Melvin Johnson, Maxim
Krikun, Mia Xu Chen, Yuan Cao, George
Foster, Colin Cherry, Wolfgang Macherey,
Zhifeng Chen, and Yonghui Wu. 2019.
Massively multilingual neural machine trans-
lation in the wild: Findings and challenges.
CoRR, abs/1907.05019.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and
Kyunghyun Cho. 2017. Unsupervised neural
machine translation. arXiv preprint arXiv:
1710.11041. DOI: https://doi.org/18653
/v1/D18-1399

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2019. On the cross-lingual trans-
ferability of monolingual representations. arXiv
preprint arXiv:1910.11856. DOI: https://
doi.org/10.18653/v1/2020.acl
-main.421

Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. Wit3: Web inventory of
transcribed and translated talks. In Conference
of European Association
for Machine
Translation, pages 261–268.

Mauro Cettolo, Niehues Jan, St¨uker Sebas-
tian, Luisa Bentivogli, Roldano Cattoni, and
Marcello Federico. 2015. The IWSLT 2015
evaluation campaign. In International Work-
shop on Spoken Language Translation.

Xilun Chen and Claire Cardie. 2018. Unsuper-
vised multilingual word embeddings. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 261–270. Brussels, Belgium. Association
for Computational Linguistics. DOI: https://
doi.org/10.18653/v1/D18-1024

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2019. Unsupervised cross-lingual representa-
tion learning at scale. arXiv preprint arXiv:
1911.02116. DOI: https://doi.org/10
.18653/v1/2020.acl-main.747

Alexis Conneau, Ruty Rinott, Guillaume
Lample, Adina Williams, Samuel R. Bowman,
Holger Schwenk,
and Veselin Stoyanov.
2018. XNLI: Evaluating cross-lingual sentence
representations. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/D18-1269

Utiyama, and Eiichiro Sumita. 2019. Towards
Burmese (Myanmar) morphological analysis:
Syllable-based tokenization and part-of-speech
tagging. ACM Transactions on Asian and
Low-Resource Language Information Process-
ing (TALLIP), 19(1):5. DOI: https://doi
.org/10.1145/3325885

Chenchen Ding, Masao Utiyama, and Eiichiro
Sumita. 2018. NOVA: A feasible and flexible
annotation system for joint tokenization and
part-of-speech tagging. ACM Transactions on
Asian and Low-Resource Language Informa-
tion Processing (TALLIP), 18(2):17. DOI:
https://doi.org/10.1145/3276773

Li Dong, Nan Yang, Wenhui Wang, Furu
Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao,
Ming Zhou, and Hsiao-Wuen Hon. 2019.
Unified language model pre-training for natural
language understanding and generation. arXiv
preprint arXiv:1905.03197.

Sergey Edunov, Alexei Baevski, and Michael
Auli. 2019. Pre-trained language model repre-
sentations for language generation. arXiv preprint
arXiv:1903.09722. DOI: https://doi.org
/10.18653/v1/N19-1409

Orhan Firat, Kyunghyun Cho, and Yoshua
Bengio. 2016. Multi-way, multilingual neural
machine translation with a shared attention
In NAACL. DOI: https://
mechanism.
doi.org/10.18653/v1/N16-1101

Jiatao Gu, Hany Hassan, Jacob Devlin, and
Victor O. K. Li. 2018. Universal neural
machine translation for extremely low resource
the 2018
In Proceedings of
languages.
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long Papers), pages 344–354. New Orleans,
Louisiana. Association for Computational
Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In North American Association
for Computational Linguistics (NAACL).

Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa,
Khin Thandar Nwet, Khin Mar Soe, Masao

Jiatao Gu, Yong Wang, Kyunghyun Cho, and
Victor O. K. Li. 2019. Improved zero-shot neu-
ral machine translation via ignoring spurious
correlations. arXiv preprint arXiv:1906.01181.

Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
Koehn, Vishrav Chaudhary, and Marc’Aurelio

739

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Ranzato. 2019. The FLORES evaluation
datasets for low-resource machine translation:
Nepali–English and Sinhala–English. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 6097–6110. Hong Kong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1632

S´ebastien Jean, Stanislas Lauly, Orhan Firat, and
Kyunghyun Cho. 2017. Does neural machine
translation benefit from larger context? CoRR,
abs/1704.05135.

Melvin Johnson, Mike Schuster, Quoc V.
Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda Vi´egas, Martin
Wattenberg, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2017. Googles multilingual
neural machine translation system: Enabling
the
zero-shot
Association for Computational Linguistics,
5:339–351. DOI: https://doi.org/10
.1162/tacl a 00065

translation. Transactions of

Taku Kudo and John Richardson. 2018. Sentence-
Piece: A simple and language independent
subword tokenizer and detokenizer for neural
text processing. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing: System Demonstrations,
pages 66–71. Brussels, Belgium. Associa-
tion for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-2012, PMID: 29382465

Anoop Kunchukuttan, Pratik Mehta, and Pushpak
Bhattacharyya. 2017. The IIT Bombay English-
Hindi parallel corpus. CoRR, abs/1710.02855.

Guillaume Lample and Alexis Conneau. 2019.
language model pretraining.

Cross-lingual
arXiv preprint arXiv:1901.07291.

Guillaume Lample, Alexis Conneau, Ludovic
Denoyer, and Marc’Aurelio Ranzato. 2018a.
Unsupervised machine translation using mono-
lingual corpora only. In International Confer-
ence on Learning Representations.

Guillaume

Conneau,
Marc’Aurelio Ranzato, Ludovic Denoyer,

Lample,

Alexis

and Hervé Jégou. 2018b. Word transla-
tion without parallel data. In International
Conference on Learning Representations.

Guillaume Lample, Myle Ott, Alexis Conneau,
Ludovic Denoyer, and Marc’Aurelio Ranzato.
2018c. Phrase-based & neural unsupervised
machine translation. arXiv preprint arXiv:
1804.07755. DOI: https://doi.org/10
.18653/v1/D18-1549

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2019. Bart: Denoising sequence-to-sequence
pre-training for natural language generation,
translation, and comprehension. arXiv preprint
arXiv:1910.13461. DOI: https://doi.org
/10.18653/v1/2020.acl-main.703

Liangyou Li, Xin Jiang, and Qun Liu. 2019.
Pretrained language models for document-
level neural machine translation. arXiv preprint
arXiv:1911.03110.

Yang Liu and Mirella Lapata. 2019. Text
summarization with pretrained encoders. arXiv
preprint arXiv:1908.08345. DOI: https://
doi.org/10.18653/v1/D19-1387

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. ROBERTA: A robustly
optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692.

and

James Henderson.

Lesly Miculicich, Dhananjay Ram, Nikolaos
Pappas,
2018.
Document-level neural machine translation
with hierarchical attention networks. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 2947–2954. Brussels, Belgium. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18-1325

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.
2013. Exploiting similarities among languages
for machine translation. CoRR, abs/1309.4168.

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. 2019. FAIRSEQ: A

740

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

fast, extensible toolkit for sequence model-
In North American Association for
ing.
Computational Linguistics (NAACL): System
Demonstrations. DOI: https://doi.org
/10.18653/v1/N19-4009

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettlemoyer. 2018. Deep
contextualized word representations. In North
American Association
for Computational
Linguistics (NAACL).

Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual
is multilingual
BERT? arXiv preprint arXiv:1906.01502. DOI:
https://doi.org/10.18653/v1/P19
-1493

Nima Pourdamghani, Nada Aldarrab, Marjan
Ghazvininejad, Kevin Knight, and Jonathan
May. 2019. Translating translationese: A two-
step approach to unsupervised machine trans-
lation. In ACL. DOI: https://doi.org
/10.18653/v1/P19-1293

Ye Qi, Devendra Singh Sachan, Matthieu Felix,
Sarguna Janani Padmanabhan, and Graham
Neubig. 2018. When and why are pre-
trained word embeddings useful for neural
machine translation? arXiv preprint arXiv:
1804.06323. DOI: https://doi.org/10
.18653/v1/N18-2084

Alec Radford, Karthik Narasimhan, Time
Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised
learning, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners, OpenAI.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. arXiv
preprint arXiv:1910.10683.

Prajit Ramachandran, Peter J Liu, and Quoc Le.
2017. Unsupervised pretraining for sequence to
sequence learning. In Proceedings of the 2017

Conference on Empirical Methods in Natural
Language Processing, pages 383–391. DOI:
https://doi.org/10.18653/v1/D17
-1039

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. In Pro-
ceedings of the 54th Annual Meeting of the
Association for Computational Linguistics
(Volume 1: Long Papers), pages 86–96.
Berlin, Germany. Association for Computa-
tional Linguistics. DOI: https://doi
.org/10.18653/v1/P16-1009

Nitish Shirish Keskar, Bryan McCann, Lav R.
Varshney, Caiming Xiong, and Richard Socher.
2019. Ctrl: A conditional transformer language
controllable generation. arXiv
model
preprint arXiv:1909.05858.

for

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu,
and Tie-Yan Liu. 2019. MASS: Masked
sequence to sequence pre-training for language
generation. In International Conference on
Machine Learning (ICML).

J¨org Tiedemann and Yves Scherrer. 2017.
Neural machine translation with extended
context. In Proceedings of the Third Work-
shop on Discourse in Machine Translation,
pages 82–92. Copenhagen, Denmark. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-4811

Zhaopeng Tu, Yang Liu, Shuming Shi, and
Tong Zhang. 2018. Learning to remember
translation history with a continuous cache.
the Association for Com-
Transactions of
6:407–420. DOI:
putational
https://doi.org/10.1162/tacl a 00029

Linguistics,

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neural
Information Processing Systems.

Takashi Wada and Tomoharu Iwata. 2018.
Unsupervised cross-lingual word embedding
by multilingual neural language models. CoRR,
abs/1809.02306. DOI: https://doi.org
/10.18653/v1/P19-1300

741

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

for neural machine translation.

Longyue Wang, Zhaopeng Tu, Andy Way,
and Qun Liu. 2017. Exploiting cross-sentence
context
In
the 2017 Conference on
Proceedings of
Empirical Methods
in Natural Language
Processing, pages 2826–2831. Copenhagen,
Denmark. Association
for Computational
Linguistics. DOI: https://doi.org/10
.18653/v1/D17-1301

Zihan Wang, Stephen Mayhew, Dan Roth,
and others. 2019. Cross-lingual ability of
multilingual bert: An empirical study. arXiv
preprint arXiv:1912.07840.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis
Conneau, Vishrav Chaudhary, Francisco
Guzman, Armand Joulin, and Edouard Grave.
2019. CCNET: Extracting
quality
monolingual datasets from web crawl data.
arXiv preprint arXiv:1911.00359.

high

Lijun Wu, Jinhua Zhu, Di He, Fei Gao, Xu Tan,
Tao Qin, and Tie-Yan Liu. 2019. Machine
translation with weakly paired bilingual
documents.

Jiacheng Yang, Mingxuan Wang, Hao Zhou,
Chengqi Zhao, Yong Yu, Weinan Zhang, and
Lei Li. 2019a. Towards making the most of bert
in neural machine translation. arXiv preprint
arXiv:1908.05672.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019b. XLNet: Generalized autoregressive
pretraining for language understanding. arXiv
preprint arXiv:1906.08237.

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun
Chen, Chris Brockett, Xiang Gao, Jianfeng
Gao, Jingjing Liu, and Bill Dolan. 2019.
DialoGPT: Large-scale generative pre-training
for conversational response generation.

Jinhua Zhu, Yingce Xia, Lijun Wu, Di He,
Tao Qin, Wengang Zhou, Houqiang Li, and
Tie-Yan Liu. 2020. Incorporating BERT into
neural machine translation. arXiv preprint
https://doi
arXiv:2002.06823. DOI:
.org/10.18653/v1/2020.acl-demos.30

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/

/
t

l

a
c
_
a
_
0
0
3
4
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

742
Download pdf