Multilingual Denoising Pre-training for Neural Machine Translation

Yinhan Liu‡∗, Jiatao Gu†∗, Naman Goyal†∗, Xian Li†, Sergey Edunov†,
Marjan Ghazvininejad†, Mike Lewis†, and Luke Zettlemoyer‡

†Facebook AI
‡Birch Technology
†{jgu,naman,xianl,edunov,ghazvini,mikelewis,lsz}@fb.com
‡yinhan@birch.ai

Abstrakt

This paper demonstrates that multilingual
denoising pre-training produces significant
performance gains across a wide variety of
maschinelle Übersetzung (MT) tasks. We present
mBART—a sequence-to-sequence denoising
auto-encoder pre-trained on large-scale mono-
lingual corpora in many languages using the
BART objective (Lewis et al., 2019). mBART
is the first method for pre-training a complete
sequence-to-sequence model by denoising full
texts in multiple languages, whereas previous
approaches have focused only on the encoder,
decoder, or reconstructing parts of the text.
Pre-training a complete model allows it to
be directly fine-tuned for supervised (beide
sentence-level and document-level) and un-
supervised machine translation, with no task-
specific modifications. We demonstrate that
adding mBART initialization produces per-
formance gains in all but the highest-resource
settings, including up to 12 BLEU points for
low resource MT and over 5 BLEU points
for many document-level and unsupervised
Modelle. We also show that it enables transfer
to language pairs with no bi-text or that were
not in the pre-training corpus, and present
extensive analysis of which factors contribute
the most to effective pre-training.1

1 Einführung

Despite its wide adoption for other NLP tasks
(Devlin et al., 2019; Liu et al., 2019; Yang et al.,
2019B; Lewis et al., 2019; Raffel et al., 2019),

* Equal contribution. Most of the work was done when

the first author worked at Facebook.

1Code and pre-trained models are available at https://
git h u b . c o m / p y t o rch/fairseq/tree/master
/examples/mbart.

726

self-supervised pre-training is not yet common
practice in machine translation (MT). Existing
approaches (Lample and Conneau, 2019; Edunov
et al., 2019; Lewis et al., 2019; Raffel et al., 2019)
have been proposed either to partially pre-train
the model or to only focus on English corpora. In
this paper, we show that significant performance
gains are possible by pre-training a complete
autoregressive model with an objective that noises
and reconstructs full texts across many languages.
In this work, we present mBART—a multilin-
gual sequence-to-sequence (Seq2Seq) denoising
auto-encoder. mBART is trained by applying
the BART (Lewis et al., 2019) to large-scale
monolingual corpora across many languages. Der
input texts are noised by masking phrases and
permuting sentences, and a single Transformer
(Vaswani et al., 2017) model is learned to re-
cover the texts. Different from other pre-training
for MT (Lample and Conneau,
approaches
2019; Song et al., 2019), mBART pre-trains a
complete autoregressive Seq2Seq model. mBART
is trained once for all languages, Bereitstellung einer
set of parameters that can be fine-tuned for any
of the language pairs in both supervised and
unsupervised settings, without any task-specific or
language-specific modifications or initialization
schemes.

Extensive experiments demonstrate that this
simple approach works remarkably well. We first
focus on existing MT benchmarks. For supervised
sentence-level MT, mBART initialization leads to
significant gains (up to 12 BLEU points) across
low/medium-resource pairs (<10M bi-text pairs), without sacrificing performance in high-resource settings. These results further improve with back- translation (BT), setting a new state-of-the-art on WMT16 English-Romanian and the FloRes test sets. For document-level MT, our document- level pre-training improves by up to 5.5 Transactions of Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. https:> is used as
the initial token to predict the sentence. It is also
possible to use other noise types, such as those in
Lample et al. (2018C), but we leave the exploration
of the optimal noising strategy to future work.

Instance Format For each instance of a batch,
we sample a language id symbol , and we
pack as many consecutive sentences as possible
sampled from the corresponding corpus of ,
until either it hits the document boundary or
reaches the 512 max token length. Sentences in
the instance are separated by the end of sentence
() token. Dann, we append the selected
token to represent the end of this instance. Pre-
training at ‘‘multi sentence’’ level enables us to
work on both sentence and document translation.

Optimization Our full model (einschließlich 25
languages) is trained on 256 Nvidia V100 GPUs
(32GB) for 500K steps. The total batch size is
around 128K tokens per GPU, matching BART
(Lewis et al., 2019) configuration. We use the
Adam optimizer (ǫ = 1e−6, β2 = 0.98) and linear
learning rate decay scheduling. The total training
time was approximately 2.5 weeks. We started the
training with dropout 0.1 and reduced it to 0.05 bei
250K steps and 0 at 400K steps. All experiments
are done with Fairseq (Ott et al., 2019).

issue of

Reproducibility One potential
Die
proposed approach is the replicability problem
due to the requirement of massive monolingual
corpora and computational resources, with fine-
grained selection on hyper-parameters during
pre-training. It is likely to get slightly different
fine-tuning performance if we re-train the system
wieder. Tackling on this, we will release the pre-
trained checkpoints as well as the code with full
instructions for pre-training a new model.

several

Related Work: XLM(-R) and MASS There
Sind
closely related approaches of
multilingual pre-training for machine translation.
XLM (Lample and Conneau, 2019) and XLM-R
(Conneau et al., 2019) pretrain BERT (Devlin
et al., 2019; Liu et al., 2019) in a multilingual
Mode, and the resulted parameters can be used to
initialize the translation model encoder. Different
from XLM(-R), mBART simultaneously pre-
trains the encoder and the decoder due to the

Seq2Seq setup, which is more natural to adapt to
machine translation applications.

Similar to mBART, MASS (Song et al., 2019)
is also a Seq2Seq-based pre-training technique
with ‘‘word-masking’’. Jedoch, the decoder of
MASS only predicted tokens that was masked in
the encoder, whereas mBART reconstructs the full
target sequence which allows to apply not only
‘‘masking’’ but any possible noise functions.

Außerdem, both XLM and MASS did
not show evidence of the pre-trained models
improving translation performance over
zwei
languages.

2.3 Pre-trained Models

To better measure the effects of different levels
of multilinguality during pre-training, we built a
range of models as follows:

• mBART25 We pre-train a model on all
25 languages, using the setting described in
§2.2.

• mBART06 To explore the effect of pre-
training on related languages, we pretrain a
model on a subset of six European languages:
Ro, Es, Cs, Fr, Es, and En. For a fair
comparison, we use ∼ 1/4 of the mBART25
batch size, which allows our model to have
the same number of updates per language
during pre-training.

• mBART02 We pre-train bilingual models,
using English and one other language for
four language pairs: En-De, En-Ro, En-It.
We use a batch size of ∼ 1/12 of that in the
mBART25.

• BART-En/Ro To help establish a better
understanding towards multilingual pre-
Ausbildung, we also train monolingual BART
models on the En and Ro corpus only,
jeweils.

• Random As additional baselines, we will
also include a comparison with a model
randomly initialized without pre-training for
each translation task. Because the sizes
of different downstream datasets vary, Wir
always grid-search the hyper-parameters
(architecture, dropout, usw.) to find the best
non-pretrained configuration.

728

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: Framework for our multilingual denoising pre-training (links) and fine-tuning on downstream
MT tasks (Rechts), where we use (1) sentence permutation and (2) word-span masking as the injected
noise. A special language id token is added at both the encoder and decoder. One multilingual pre-trained
model is used for all tasks.

All models use the same vocabulary (§2.1). Nicht
all tokens will frequently occur in all pre-training
corpora, Aber
Das
large vocabulary can improve generalization in
multilingual settings even for unseen languages.

later experiments show that

2.4 Scaling-up Matters

Scaling-up the training data and model parameters
has been a key factor in pre-training (Devlin
et al., 2019; Conneau et al., 2019; Raffel et al.,
2019). Compared to conventional semi-supervised
Methoden (z.B., back-translation) and other pre-
training for MT (Lample and Conneau, 2019;
Song et al., 2019), we pre-train mBART on much
more monolingual data with relatively deeper
architecture. This scale, in combination with the
new multi-lingual training, is central to our results
(Abschnitte 3 Zu 5), although future work could
more carefully study the relative contributions
of each.

3 Sentence-level Machine Translation

This section shows that mBART pre-training
provides consistent performance gains in low
to medium resource sentence-level MT settings,
including bi-text only and with back translation,
existing pre-training
and outperforms other
schemes (§3.2). We also present a detailed analysis
to understand better which factors contribute
the most to these gains (§3.3), and show that
pre-training can even improve performance for
languages not present in the pre-training data
(§3.4).

3.1 Experimental Settings

Datasets We gather 24 pairs of publicly avail-
able parallel corpora that cover all the languages
in CC25 (Figur 1). Most pairs are from previous
WMT (Gu, Kk, Tr, Ro, Et, Lt, Fi, Lv, Cs, Es, Zh,
Von, Ru, Fr ↔ En) and IWSLT (Vi, Ja, Ist, Nl,
Ar, It ↔ En) competitions. We also use FLoRes
pairs (Guzm´an et al., 2019, En-Ne and En-Si),
En-Hi from IITB (Kunchukuttan et al., 2017), Und
En-My from WAT19 (Ding et al., 2018, 2019).
We divide the datasets into three categories—low
resource (<1M sentence pairs), medium resource (>1M and <10M), and high resource (>10M).

Fine-tuning & Decoding We fine-tune mBART
on a single pair of bi-text data, feeding the
source language into the encoder and decod-
ing the target language. As shown in Figure 2,
we load the pre-trained weights and train the MT
model on bi-texts with teacher forcing. For all
directions, we train with 0.3 dropout, 0.2 label
smoothing, 2500 warm-up steps, 3e−5 maximum
learning rate. We use a maximum of 40K training
updates for all low and medium resource pairs and
100K for high resource pairs. The final models are
selected based on validation likelihood. Wir gebrauchen
beam-search with beam size 5 for decoding. Unser
initial experiments indicate that the fine-tuning
process is generally stable with different seeds.
daher, to reduce the total computation, alle
our results are reported with single execution. Wir
validate the statistical significance with scripts
from the mosesdecoder.3

3https://github.com/moses-smt/mosesdecoder
/b l o b/master/scripts/a n a lysis/bootstrap
-hypothesis-difference-significance.pl.

729

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Languages
En-Gu
Data Source WMT19
Size
Richtung

10K

En-Kk
WMT19
91K

En-Vi

En-Tr
IWSLT15 WMT17

133K

207K

En-Ja
IWSLT17
223K

En-Ko
IWSLT17
230K

← → ← → ← → ← → ← → ← →

Random
mBART25

0.0
0.3

0.0
0.1

0.8
7.4

0.2
2.5

23.6
36.1

24.8
35.4

12.2
22.5

9.5
17.8

10.4
19.1

12.3
19.4

15.3
24.6

16.3
22.6

Languages
Data Source
Size
Richtung

En-Nl
IWSLT17
237K

En-Ar
IWSLT17
250K

En-It

En-My
IWSLT17 WAT19

250K

259K

En-Ne
FLoRes
564K

En-Ro
WMT16
608K

← → ← → ← → ← → ← → ← →

Random
mBART25

34.6
43.3

29.3
34.8

27.5
37.6

16.9
21.6

31.7
39.8

28.0
34.0

23.3
28.3

34.9
36.9

7.6
14.5

4.3
7.4

34.0
37.8

34.3
37.7

Languages
Data Source
Size
Richtung

En-Si
FLoRes
647K

En-Hi
ITTB
1.56M

En-Et
WMT18
1.94M

En-Lt
WMT19
2.11M

En-Fi
WMT17
2.66M

En-Lv
WMT17
4.50M

← → ← → ← → ← → ← → ← →

Random
mBART25

7.2
13.7

1.2
3.3

10.9
23.5

14.2
20.8

22.6
27.8

17.9
21.4

18.1
22.4

12.1
15.3

21.8
28.5

20.2
22.4

15.6
19.3

12.9
15.9

Tisch 1: Low/medium resource machine translation Pre-training consistently improves over a
randomly initialized baseline, with particularly large gains on low resource language pairs (z.B.,
Vi-En).

3.2 Main Results

As shown in Table 1, initializing with the pre-
trained mBART25 weights shows gains on all the
low and medium resource pairs when compared
with randomly initialized baselines. We observe
gains of 12 or more BLEU points on low
resource pairs such as En-Vi, En-Tr, and noisily
aligned pairs like En-Hi. Fine-tuning still fails
in extremely low-resource cases such as En-Gu,
which have ∼10k examples. In these settings,
unsupervised translation is more appropriate,
see §5.2. For high resource cases (Tisch 2),
we do not observe consistent gains, and pre-
training slightly hurts performance when more
than 25M parallel sentences are available. Wann
a significant amount of bi-text data is given, Wir
suspect that supervised training washes out the
pre-trained weights.

Note that some reported runs of our baseline
systems using the vanilla Transformers with
randomly initialized weights have considerably
noticeable gaps between the SoTA systems
reported in the original competitions.4 The differ-
ence is mainly because we train and search

4http://matrix.statmt.org/.

Languages Cs Es Zh De Ru
Size

Fr
11M 15M 25M 28M 29M 41M

RANDOM
16.5 33.2 35.0 30.9 31.5 41.4
MBART25 18.0 34.0 33.3 30.5 31.3 41.0

Tisch 2: High resource machine translation
where all the datasets are from their latest WMT
competitions. We only evaluate our models on
En-X translation.

the hyper-parameters for baselines on officially
provided bitext only without using any mono-
lingual corpus or multilingual adaptation. Für
Beispiel, the SoTA score for En→Gu is 28.2
in WMT19, compared with 0 in Table 1. Es ist
basically because the quality of the original bitext
data is low, and the SoTA systems commonly
used additional languages such as Hi to boost
die Performance. Similar gaps can also be
observed in pairs such as Kk-En and Lt-En,
language is also
where Ru as the additional
crucial. The main purpose of this part
is to
discuss the effects of multilingual pre-training in a
constrained bitext setting for a better comparison.
We will include more discussions of combining

730

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 3: Pre-training + back translation on FLoRes with two iterations of BT.

multilingual translation with pretraining in future
arbeiten.

Plus Back-Translation Back-translation (BT;
Sennrich et al., 2016) is a standard approach
to augment bi-text with target-side monolingual
Daten. We combine our pre-training with BT and
test it on low resource language pairs—En-Si
and En-Ne—using the FLoRes dataset (Guzm´an
et al., 2019). We use the same monolingual data
as Guzm´an et al. (2019) to generate BT data.
Figur 3 shows that initializing the model with
our mBART25 pre-trained parameters improves
BLEU scores at each iteration of back translation,
resulting in new state-of-the-art results in all four
translation directions. It indicates that the pre-
trained mBART weights can be directly plugged
into existing pipeline using BT.

mit

Other

Compared
Pre-training
Approaches We also compare our pre-trained
models with recent self-supervised pre-training
Methoden, as shown in Table 3. We consider En-Ro
Übersetzung, the only pair with established results.
Our mBART model outperforms all the other
pre-trained models, both with and without BT
augmentation. We also show comparisons with
the conventional BART model trained on the same
En and Ro data only. Both have improvements
over baselines, although worse than mBART
results, indicating that pre-training in a multilin-
gual setting is essential. Darüber hinaus, combining
BT leads to additional gains, resulting in a new
state-of-the-art for Ro-En translation.

3.3 Analyse

We also present additional analyses, to better
quantify when our pre-training helps.

How many languages should you pre-train on?
We investigate when it is helpful for pre-training to
include languages other than the targeted language
pair that will be used during fine tuning. Tisch 4

731

Pre-training

Data
None
En Ro
En Ro
En

Modell
RANDOM
XLM (2019)
MASS (2019)
BART (2019)
XLM-R (2019) CC100
BART-EN
BART-RO
MBART02
MBART25

En
Ro
En Ro
CC25

Fine-tuning
En→Ro Ro→En +BT
36.8
38.5
39.1
38.0
–
37.4
38.1
39.9
38.8

34.0
35.6
–
–
35.8
35.8
36.8
38.5
37.8

34.3
–
–
–
35.6
36.0
37.6
38.5
37.7

Tisch 3: Comparison with other pre-training
approaches on WMT16 Ro-En.

Ro
66.6 61.4 30.2

It My
Languages De
Size/GB
1.6
mBART02 31.3 38.5 39.7 36.5
mBART06
mBART25 30.5 37.7 39.8 36.9

38.5 39.3

–

En
300.8

Tisch 4: Pretraining languages on En-X trans-
lation. The size refers to the size of monolingual
data for X. The size of En is shown as reference.
All the pretrained models were controlled to see
the same number of English instances during
Ausbildung.

shows performance on four X-En pairs. Pre-
training on more languages helps most when the
target language monolingual data is limited (z.B.,
En-My, where the size of My is around 0.5%
of En).

Im Gegensatz, when monolingual data is plentiful
(Von, Ro), pre-training on multiple languages
slightly hurts the final results (<1 BLEU). In languages may reduce these cases, additional the capacity available for each test language. Additionally, the fact that mBART06 performs similar to mBART02 on Ro-En suggests that pre-training with similar languages is particularly helpful. l D o w n o a d e d f r o m h t t p : > is predicted. We use beam size 5 by default.

4 Document-level Machine Translation

We evaluate mBART on document-level machine
translation tasks, where the goal is to translate
that contain more than one
segments of text
Satz (up to an entire document). During pre-
Ausbildung, we use document fragments of up to 512
tokens, allowing the models to learn dependencies
this pre-
between sentences. We show that
training significantly improves document-level
Übersetzung.

4.1 Experimental Settings

Datasets We evaluate performance on two
common document-level MT datasets: WMT19
En-De and TED15 Zh-En. For En-De, we use the
document data from WMT19 to train our model,
without any additional sentence-level data. Der
Zh-En dataset is from IWSLT 2014 Und 2015
(Cettolo et al., 2012, 2015). Following Miculicich
et al. (2018), wir gebrauchen 2010-2013 TED as the test
set.

Pre-processing We pre-process with the ap-
proach used in pre-training. For each block, sen-
tences are separated by end of sentence symbols
() and the entire instance is ended with
the specific language id (). On average,
documents are split into 2–4 instances.

Baselines & Evaluation We train 4 Modelle: A
document-level (Doc-) MT model (§4.1) und ein
corresponded sentence-level (Sent-) MT model
(§3.1) as the baseline, both with and without
pre-training. We use mBART25 as the common
pre-trained model for En-De and Zh-En. Für
En-De, even though our mBART25 Doc-MT
model decodes multiple sentences together, Die
translated sentences can be aligned to the source
Sätze, which allows us to evaluate BLEU
scores both on sentence-level (s-BLEU) Und
document-level (d-BLEU).5 For Zh-En, Jedoch,
we cannot produce the same number of translated
sentences as the reference due to alignment errors
in the test data. We only provide the d-BLEU
scores on this direction.

We also compare our models with Hierarchical
Attention Networks (HAN, Miculicich et al.,
2018) on Zh-En, which is the state-of-the-
art non-pretraining approach for document-level
translation for
this pair. They combine two
layers of attention—first within and then across
Sätze.

5Standard BLEU scores match n-grams at sentence-level.
We also consider document-level where we match n-grams
over the whole document resulting in a slightly higher score.

733

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

(A) Sentence- and Document-level BLEU scores on En-De

(B) Document-level BLEU scores on Zh-En

Modell

Random

mBART25

s-BLEU d-BLEU s-BLEU d-BLEU

Sent-MT
Doc-MT

34.5
×

35.9
7.7

36.4
37.1

38.0
38.5

Modell

Sent-MT
Doc-MT

Random mBART25 HAN (2018)
d-BLEU
22.0
3.2

d-BLEU
−
24.0

d-BLEU
28.4
29.6

Tisch 6: Document-level machine translation on En-De and Zh-En. (×) The randomly initialized Doc-
MT model cannot produce translations aligned to the original sentences, so only document evaluation
is possible.

4.2 Main Results

Tisch 6 shows the main results for both En-De and
Zh-En at both sentence-level and document-level.

Random vs. Pre-trained The MT models
initialized with pre-trained weights outperform
randomly initialized models by large margins, für
both sentence-level and document-level training.
Our mBART25 models (both Sent-MT and Doc-
MT) also outperform HAN (Miculicich et al.,
2018),6 despite the fact
they are not
customized for document-level MT.

Das

Sent-MT vs. Doc-MT For En-De and En-Zh,
the mBART25 Doc-MT models outperform
mBART25 fine-tuned at sentence-level by large
margins, reversing the trend seen for models
without pre-training. For both datasets, randomly
initialized Doc-MT fails
resulting
to work,
in much worse results
than the sentence-
level models. Such large performance gaps
indicate that pre-training is critical for document
level performance. It is in general difficult to
collect high-quality document-level data in large
Mengen, suggesting that pre-training may be a
strong strategy for future work. We also include a
sampled example in Figure 6.

5 Unsupervised Machine Translation

In addition to supervised machine translation, Wir
also evaluate our model on tasks where no bi-text
is available for the target language pair. We define
three types of unsupervised translation:

1. No bi-text of any kind. A common solution
is to learn from back-translation (Artetxe
et al., 2017; Lample et al., 2018C). Wir
show that mBART provides a simple and
effective initialization scheme for
diese
Methoden (§5.1).

6d-BLEU is recomputed from the provided system output.

2. No bi-text for the target pair, but both
languages appear in bi-text corpora with other
pairs. This setup is common for multilingual
MT systems (Johnson et al., 2017; Gu et al.,
2019). In diesem Papier, we limit our focus to
building models for single language pairs,
and leave discussions for multilingual MT to
future work.

3. No bi-text for the target pair is available, Aber
there is bi-text for translating from some other
language into the target language. mBART
supports effective transfer, even if the source
language has no bi-text of any form (§5.2).

5.1 Unsupervised Machine Translation via

Back-Translation

Datasets We evaluate our pre-trained models
on En-De, En-Ne, and En-Si. En and De are both
European languages sharing many sub-words,
whereas Ne and Si are quite distinct from En. Wir
use the same test sets as supervised benchmarks
§3.1, and use the same pre-training data (CC25)
for back-translation to avoid introducing new
Information.

Learning Following Lample
and Conneau
(XLM, 2019), we initialize the translation model
with the mBART weights, and then learn to
predict the monolingual sentences conditioned
on source sentences generated by on-the-fly BT.
Außerdem, we constrain mBART to only gen-
erating tokens in target language7 for the first
1000 steps of on-the-fly BT, to avoid it copying
the source text.

Results Table 7 shows the unsupervised trans-
lation results compared with non-pretrained mod-
els, as well as models with existing pre-training
Methoden. Our models achieve large gains over
non-pretrained models for all directions, Und

7We mask out the output probability of predicted tokens
which appear less than 1% in the target monolingual corpus.

734

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 6: An example of document-level translation from mBART25 Sent-MT and Doc-MT, held out
from the test set of TED15 Zh-En. The Doc-MT system produces much fluent and coherent translation,
which is closer to the reference translation. Zum Beispiel, Doc-MT model produces several ‘‘And’’ to
connect sentences to make it reads better, while the Sent-MT model does not contain global knowledge
and produce sentences independently. Zusätzlich, both systems produce much better translations
than models without pre-training where the non-pretrained Doc-MT model completely fails to produce
readable translation output.

735

En-Si
En-Ne
En-De
← → ← → ← →

Random
21.0 17.2
XLM (2019)
34.3 26.4
MASS (2019) 35.2 28.3

0.0
0.5
–

0.0 0.0 0.0
0.1 0.1 0.1
–
–
–

mBART

34.0 29.8 10.0 4.4 8.2 3.9

Tisch 7: Unsupervised MT via BT between
dis-similar languages.

most pairs, language transfer works better when
fine-tuning is also conducted in the same language
family, especially between Indic languages (Hi,
Ne, Gu). Jedoch, significant vocabulary sharing
is not required for effective transfer. Zum Beispiel,
Zh-En and It-En achieve the best transfer learning
results on Ko-En and Ar-En, jeweils. Das ist
despite the low vocabulary overlap (even character
Überlappung) zwischen (Zh, Ist) Und (Es, Ar).

outperform XLM significantly for dissimilar pairs
(En-Ne, En-Si) where the existing approaches
completely fail. For En-De, our model also
performs comparably against XLM and MASS.

5.2 Unsupervised Machine Translation via

Language Transfer

We also report results when the target language
appears in a bi-text with some other source
Sprache.

Datasets We only consider X→En translation,
and choose the bitexts of 12 language pairs from
§3.1, covering Indic languages (Ne, Hi, Si, Gu),
European languages (Ro, Es, Cs, Nl), East Asian
languages (Zh, Ja, Ist), and Arabic (Ar).

Results The pre-trained mBART25 model is
fine-tuned on each language pair, and then
evaluated on the rest of pairs, as seen in Table 8.
We also present the direct fine-tuning performance
(§3) on the diagonal, for reference. We see transfer
for all pairs with all fine-tuned models except from
Gu-En where the supervised model completely
fails (0.3 BLEU). In some cases we can achieve
ähnlich (Cs-En) or even much better (Ne-En,
Gu-En) results compared with the supervised
results. We also show an example of language
transfer in Figure 7.

As a comparison, we also apply the same proce-
dure on randomly initialized models without pre-
Ausbildung, which always ends up with ≈ 0 BLEU.
This indicates that multilingual pre-training is
essential and produces universal representations
across languages, so that once the model learns to
translate one language to En, it learns to translate
all languages with similar representations.

When is language transfer useful? Tisch 8
also shows that the size of transfer effects varies
with the similarity of different languages. Erste, für

With BT We present a comparison of unsuper-
vised MT with BT vs. language transfer in Table 9
where language transfer works better when there
exists a close language translation to transfer from.
Darüber hinaus, we show promising results for
combining these two techniques. We start from the
best transferred model and apply (iterative) BT on
the same monolingual corpus used in pre-training.
Tisch 9 presents the results with 1 iteration of BT.
We see improvements for all pairs. The complete
analysis of both methods is left as future work.

6 Related Work

Self-supervised Learning for Text Generation
This work inherits from the recent success brought
by pre-training for NLP applications (Peters et al.,
2018; Radford et al., 2018; Devlin et al., 2019;
Yang et al., 2019B; Liu et al., 2019), especially
for text generation (Radford et al., 2019; Song
et al., 2019; Dong et al., 2019; Raffel et al.,
2019; Lewis et al., 2019). The pre-trained models
are usually used as the initialization for fine-
tuning downstream tasks such as controllable
language modeling (Shirish Keskar et al., 2019),
summarization (Song et al., 2019; Liu and Lapata,
2019) and dialogue generation (Zhang et al.,
2019).

Specifically for machine translation, unsuper-
vised pre-training methods were also explored
to improve the performance. Qi et al. (2018)
investigated the application of pre-trained word
embeddings for MT; Ramachandran et al. (2017)
proposed to pre-train the encoder-decoder mod-
ules as two separate language models. Yang et al.
(2019A); Zhu et al. (2020) explored fusion ap-
proaches to incorporate the pre-trained BERT
weights to improve NMT training. Im Gegensatz
to most prior work, we focus on pre-training one
denoising autoencoder, and adapt the weights of
the entire model for various MT applications.

736

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Fine-tuning Languages
Ro

S
e
G
A
u
G
N
A
L
G
N
ich
T
S
e
T

Domain News
23.7
9.9
5.8
9.3
16.2
14.4
16.9
5.8
3.2
2.1
5.0
8.2

ZH
JA
KO
CS
RO
NL
IT
AR
HALLO
NE
SI
GU

Ist

Ja
TED TED News News
7.8
4.8
8.5
19.5
37.8
27.0
23.4
12.0
6.7
4.3
1.3
3.5

9.2
12.2
24.6
17.2
17.9
32.3
27.8
12.8
9.9
6.5
3.8
4.7

2.8
0.9
5.7
21.6
23.0
21.2
17.1
12.7
5.8
5.0
3.8
5.4

8.8
19.1
16.9
15.1
18.7
30.4
25.8
15.5
10.1
6.7
5.7
8.5

Gu
Hi
Nl
TED TED TED News Wiki Wiki Wiki
0.0
7.2
0.0
4.7
0.0
9.6
0.0
13.2
0.0
16.4
0.0
24.6
0.0
20.1
0.0
11.6
0.0
23.5
17.9
0.0
0.0
8.1
13.8
0.3

7.0
6.4
9.5
17.0
22.3
43.3
30.2
14.7
6.1
3.0
0.9
2.1

5.9
6.5
11.1
16.4
22.1
27.3
23.2
16.7
13.0
10.8
13.7
12.8

4.2
4.2
8.8
15.1
18.5
23.3
18.5
13.0
14.5
14.5
8.9
13.5

6.8
5.1
9.1
16.7
21.6
34.1
39.8
14.7
5.0
2.2
0.5
0.0

6.2
5.6
8.7
16.9
22.6
31.0
30.6
37.6
7.6
5.2
3.5
6.2

Tisch 8: Unsupervised MT via language transfer on X-En translations. The model fine-tuned on one
language pair is directly tested on another. We use gray color to show the direct fine-tuning results,
and lightgray color to show language transfer within similar language groups. We bold the highest
transferring score for each pair.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 7: An example of unsupervised MT via language transfer. mBART models finetuned with Ko or
Zh are able to translate Ja sentence to En almost as correctly as in the supervised case.

Source online BT
Ro
Ne
Zh
Nl

30.5
10.0
11.3
28.5

Transfer

23.0
17.9
9.2
34.1

( Cs )
( Hi )
( Ist )
( Es )

Combined
33.9
22.1
15.0
35.4

Tisch 9: BT vs. language transfer for unsupervised
MT for X-En translations. For language transfer,
we present the best transferring scores together
with the language transferred from.

Multilinguality in NLP tasks This work is also
related to the continual
trend of multilingual
language learning, including aligning multilingual
word embeddings (Mikolov et al., 2013; Chen
and Cardie, 2018; Lample et al., 2018B) into

universal space, and learning crosslingual models
(Wada and Iwata, 2018; Lample and Conneau,
2019; Conneau et al., 2019) to exploit shared
representations across languages.

For MT, the most relevant field is multilingual
Übersetzung (Firat et al., 2016; Johnson et al.,
2017; Aharoni et al., 2019; Arivazhagan et al.,
2019) where the ultimate goal is to jointly train
one translation model
translates multiple
die selbe Zeit, Und
language directions at
shares representations to improve the translation
performance on low-resource languages (Gu et al.,
2018). In diesem Papier, we focus on multilingualism
in the pre-training stage and fine-tune the
learned model in the standard bilingual scenario.

Das

737

Compared with multilingual translation, we do
not require parallel data across multiple languages
but the targeted direction, which improves the
scalability to low-resource languages and specific
domains.

(2019)

is the most

Document Translation As one of
the key
applications, our work is also related to previous
efforts for incorporating document-level context
into neural machine translation (Wang et al.,
2017; Jean et al., 2017; Tiedemann and Scherrer,
2017; Miculicich et al., 2018; Tu et al., 2018).
Li et al.
relevant work
that also utilized pre-trained encoder (BERT)
for handling longer context. Jedoch,
Die
focus has been on designing new task-specific
Techniken, and doing sentence-level translation
with a wider input context. To the best of our
Wissen, our multilingual pre-trained model
that shows improved results on
is the first
document-level translation with standard Seq2Seq
Modelle.

Unsupervised Translation This work also
summarizes the previous efforts of learning to
translate between languages without a direct
parallel corpus. When no parallel data of any
kind is available, Artetxe et al.
(2017) Und
Lample et al. (2018A) proposed to jointly learn
denoising auto-encoder and back-translation from
both directions, welche, Jedoch, required good
initialization and only worked well on similar
(2019) solve the
language pairs. Wu et al.
problem by mining sentences from Wikipedia
and using them as weakly supervised translation
pairs. Similar to Lample and Conneau (2019) Und
Song et al. (2019), we follow the first approach
and treat our pre-trained model as the initialization
step. We also investigate unsupervised translation
using language transfer, which is similar
Zu
Pourdamghani et al. (2019), where the authors
generate translationese of the source language
and train a system on high-resource languages
Ist
to correct
also closely related to Conneau et al. (2018)
and Artetxe et al.
for cross-lingual
representation learning where we also show
representation learned by mBART can be easily
transferred between language without supervised
Daten.

these intermediate utterances. Es

(2019)

738

7 Abschluss

We demonstrate that multilingual de-noising pre-
training is able to significantly improve both
supervised and unsupervised machine translation
at both the sentence level and document level.
We analyze when and how pre-training is most
effective and can be combined with other
approaches such as back-translation. Our results
learning ability of
also show the transfer
the learned representations from multilingual
pre-training.

In future work, we will scale-up the current
pre-training to more languages, Zum Beispiel, ein
mBART100 model. The size of our model makes
it expensive to deploy in production—future work
will explore pre-training more efficient models.

Danksagungen

We thank Marc’Aurelio Ranzato, Guillaume
Lample, Alexis Conneau, and Michael Auli
for sharing their expertise on low-resource and
unsupervised machine translation and Peng-Jen
Chen and Jiajun Shen for details about FloRes and
WAT datasets. We also thank our colleagues at
FAIR and FAIAR for valuable feedback.

Verweise

Roee Aharoni, Melvin Johnson, and Orhan
Firat. 2019. Massively multilingual neural
maschinelle Übersetzung. In Proceedings of the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volumen 1
(Long and Short Papers), pages 3874–3884.
Minneapolis, Minnesota. Association
für
Computerlinguistik.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat,
Dmitry Lepikhin, Melvin Johnson, Maxim
Krikun, Mia Xu Chen, Yuan Cao, George
Foster, Colin Cherry, Wolfgang Macherey,
Zhifeng Chen, and Yonghui Wu. 2019.
Massively multilingual neural machine trans-
lation in the wild: Findings and challenges.
CoRR, abs/1907.05019.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, Und
Kyunghyun Cho. 2017. Unsupervised neural
maschinelle Übersetzung. arXiv preprint arXiv:
1710.11041. DOI: https://doi.org/18653
/v1/D18-1399

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2019. On the cross-lingual trans-
ferability of monolingual representations. arXiv
preprint arXiv:1910.11856. DOI: https://
doi.org/10.18653/v1/2020.acl
-main.421

Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. Wit3: Web inventory of
transcribed and translated talks. In Conference
of European Association
for Machine
Translation, pages 261–268.

Mauro Cettolo, Niehues Jan, St¨uker Sebas-
tian, Luisa Bentivogli, Roldano Cattoni, Und
Marcello Federico. 2015. The IWSLT 2015
evaluation campaign. In International Work-
shop on Spoken Language Translation.

Xilun Chen and Claire Cardie. 2018. Unsuper-
vised multilingual word embeddings. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 261–270. Brussels, Belgien. Association
für Computerlinguistik. DOI: https://
doi.org/10.18653/v1/D18-1024

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2019. Unsupervised cross-lingual representa-
tion learning at scale. arXiv preprint arXiv:
1911.02116. DOI: https://doi.org/10
.18653/v1/2020.acl-main.747

Alexis Conneau, Ruty Rinott, Guillaume
Lample, Adina Williams, Samuel R. Bowman,
Holger Schwenk,
and Veselin Stoyanov.
2018. XNLI: Evaluating cross-lingual sentence
Darstellungen. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/D18-1269

Utiyama, and Eiichiro Sumita. 2019. Towards
Burmese (Myanmar) morphological analysis:
Syllable-based tokenization and part-of-speech
tagging. ACM Transactions on Asian and
Low-Resource Language Information Process-
ing (TALLIP), 19(1):5. DOI: https://doi
.org/10.1145/3325885

Chenchen Ding, Masao Utiyama, and Eiichiro
Sumita. 2018. NOVA: A feasible and flexible
annotation system for joint tokenization and
part-of-speech tagging. ACM Transactions on
Asian and Low-Resource Language Informa-
tion Processing (TALLIP), 18(2):17. DOI:
https://doi.org/10.1145/3276773

Li Dong, Nan Yang, Wenhui Wang, Furu
Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao,
Ming Zhou, and Hsiao-Wuen Hon. 2019.
Unified language model pre-training for natural
language understanding and generation. arXiv
preprint arXiv:1905.03197.

Sergey Edunov, Alexei Baevski, and Michael
Auli. 2019. Pre-trained language model repre-
sentations for language generation. arXiv preprint
arXiv:1903.09722. DOI: https://doi.org
/10.18653/v1/N19-1409

Orhan Firat, Kyunghyun Cho, and Yoshua
Bengio. 2016. Multi-way, multilingual neural
machine translation with a shared attention
In NAACL. DOI: https://
mechanism.
doi.org/10.18653/v1/N16-1101

Jiatao Gu, Hany Hassan, Jacob Devlin, Und
Victor O. K. Li. 2018. Universal neural
machine translation for extremely low resource
Die 2018
In Proceedings of
languages.
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volumen 1
(Long Papers), pages 344–354. New Orleans,
Louisiana. Association for Computational
Linguistik.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Und
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
Verständnis. In North American Association
für Computerlinguistik (NAACL).

Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa,
Khin Thandar Nwet, Khin Mar Soe, Masao

Jiatao Gu, Yong Wang, Kyunghyun Cho, Und
Victor O. K. Li. 2019. Improved zero-shot neu-
ral machine translation via ignoring spurious
correlations. arXiv preprint arXiv:1906.01181.

Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
Koehn, Vishrav Chaudhary, and Marc’Aurelio

739

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Ranzato. 2019. The FLORES evaluation
datasets for low-resource machine translation:
Nepali–English and Sinhala–English. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 6097–6110. Hongkong, China. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1632

S´ebastien Jean, Stanislas Lauly, Orhan Firat, Und
Kyunghyun Cho. 2017. Does neural machine
translation benefit from larger context? CoRR,
abs/1704.05135.

Melvin Johnson, Mike Schuster, Quoc V.
Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda Vi´egas, Martin
Wattenberg, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2017. Googles multilingual
neural machine translation system: Enabling
Die
zero-shot
Verein für Computerlinguistik,
5:339–351. DOI: https://doi.org/10
.1162/tacl a 00065

Übersetzung. Transactions of

Taku Kudo and John Richardson. 2018. Sentence-
Piece: A simple and language independent
subword tokenizer and detokenizer for neural
text processing. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing: Systemdemonstrationen,
pages 66–71. Brussels, Belgien. Associa-
tion for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-2012, PMID: 29382465

Anoop Kunchukuttan, Pratik Mehta, and Pushpak
Bhattacharyya. 2017. The IIT Bombay English-
Hindi parallel corpus. CoRR, abs/1710.02855.

Guillaume Lample and Alexis Conneau. 2019.
language model pretraining.

Cross-lingual
arXiv preprint arXiv:1901.07291.

Guillaume Lample, Alexis Conneau, Ludovic
Denoyer, and Marc’Aurelio Ranzato. 2018A.
Unsupervised machine translation using mono-
lingual corpora only. In International Confer-
ence on Learning Representations.

Guillaume

Conneau,
Marc’Aurelio Ranzato, Ludovic Denoyer,

Lample,

Alexis

and Hervé Jégou. 2018B. Word transla-
tion without parallel data. In International
Conference on Learning Representations.

Guillaume Lample, Myle Ott, Alexis Conneau,
Ludovic Denoyer, and Marc’Aurelio Ranzato.
2018C. Phrase-based & neural unsupervised
maschinelle Übersetzung. arXiv preprint arXiv:
1804.07755. DOI: https://doi.org/10
.18653/v1/D18-1549

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Erheben, Veselin Stoyanov, and Luke Zettlemoyer.
2019. Bart: Denoising sequence-to-sequence
pre-training for natural language generation,
Übersetzung, and comprehension. arXiv preprint
arXiv:1910.13461. DOI: https://doi.org
/10.18653/v1/2020.acl-main.703

Liangyou Li, Xin Jiang, and Qun Liu. 2019.
Pretrained language models for document-
level neural machine translation. arXiv preprint
arXiv:1911.03110.

Yang Liu and Mirella Lapata. 2019. Text
summarization with pretrained encoders. arXiv
preprint arXiv:1908.08345. DOI: https://
doi.org/10.18653/v1/D19-1387

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Von, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. ROBERTA: A robustly
optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692.

Und

James Henderson.

Lesly Miculicich, Dhananjay Ram, Nikolaos
Pappas,
2018.
Document-level neural machine translation
with hierarchical attention networks. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 2947–2954. Brussels, Belgien. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18-1325

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.
2013. Exploiting similarities among languages
für maschinelle Übersetzung. CoRR, abs/1309.4168.

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. 2019. FAIRSEQ: A

740

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

fast, extensible toolkit for sequence model-
In North American Association for
ing.
Computerlinguistik (NAACL): System
Demonstrations. DOI: https://doi.org
/10.18653/v1/N19-4009

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettlemoyer. 2018. Deep
contextualized word representations. In North
American Association
for Computational
Linguistik (NAACL).

Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual
is multilingual
BERT? arXiv preprint arXiv:1906.01502. DOI:
https://doi.org/10.18653/v1/P19
-1493

Nima Pourdamghani, Nada Aldarrab, Marjan
Ghazvininejad, Kevin Knight, and Jonathan
Mai. 2019. Translating translationese: A two-
step approach to unsupervised machine trans-
lation. In ACL. DOI: https://doi.org
/10.18653/v1/P19-1293

Ye Qi, Devendra Singh Sachan, Matthieu Felix,
Sarguna Janani Padmanabhan, and Graham
Neubig. 2018. When and why are pre-
trained word embeddings useful for neural
maschinelle Übersetzung? arXiv preprint arXiv:
1804.06323. DOI: https://doi.org/10
.18653/v1/N18-2084

Alec Radford, Karthik Narasimhan, Time
Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised
learning, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners, OpenAI.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, und Peter J. Liu.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. arXiv
preprint arXiv:1910.10683.

Prajit Ramachandran, Peter J Liu, and Quoc Le.
2017. Unsupervised pretraining for sequence to
sequence learning. In Proceedings of the 2017

Conference on Empirical Methods in Natural
Language Processing, pages 383–391. DOI:
https://doi.org/10.18653/v1/D17
-1039

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. In Pro-
ceedings of the 54th Annual Meeting of the
Verein für Computerlinguistik
(Volumen 1: Long Papers), pages 86–96.
Berlin, Deutschland. Association for Computa-
tional Linguistics. DOI: https://doi
.org/10.18653/v1/P16-1009

Nitish Shirish Keskar, Bryan McCann, Lav R.
Varshney, Caiming Xiong, and Richard Socher.
2019. Ctrl: A conditional transformer language
controllable generation. arXiv
Modell
preprint arXiv:1909.05858.

für

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu,
and Tie-Yan Liu. 2019. MASS: Masked
sequence to sequence pre-training for language
Generation. In International Conference on
Machine Learning (ICML).

J¨org Tiedemann and Yves Scherrer. 2017.
Neural machine translation with extended
Kontext. In Proceedings of the Third Work-
shop on Discourse in Machine Translation,
pages 82–92. Copenhagen, Denmark. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-4811

Zhaopeng Tu, Yang Liu, Shuming Shi, Und
Tong Zhang. 2018. Learning to remember
translation history with a continuous cache.
the Association for Com-
Transactions of
6:407–420. DOI:
putational
https://doi.org/10.1162/tacl a 00029

Linguistik,

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neural
Information Processing Systems.

Takashi Wada and Tomoharu Iwata. 2018.
Unsupervised cross-lingual word embedding
by multilingual neural language models. CoRR,
abs/1809.02306. DOI: https://doi.org
/10.18653/v1/P19-1300

741

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

for neural machine translation.

Longyue Wang, Zhaopeng Tu, Andy Way,
and Qun Liu. 2017. Exploiting cross-sentence
Kontext
In
Die 2017 Conference on
Proceedings of
Empirical Methods
in Natural Language
Processing, pages 2826–2831. Copenhagen,
Denmark. Association
for Computational
Linguistik. DOI: https://doi.org/10
.18653/v1/D17-1301

Zihan Wang, Stephen Mayhew, Dan Roth,
and others. 2019. Cross-lingual ability of
multilingual bert: An empirical study. arXiv
preprint arXiv:1912.07840.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis
Conneau, Vishrav Chaudhary, Francisco
Guzman, Armand Joulin, and Edouard Grave.
2019. CCNET: Extracting
Qualität
monolingual datasets from web crawl data.
arXiv preprint arXiv:1911.00359.

hoch

Lijun Wu, Jinhua Zhu, Di He, Fei Gao, Xu Tan,
Tao Qin, and Tie-Yan Liu. 2019. Machine
translation with weakly paired bilingual
documents.

Jiacheng Yang, Mingxuan Wang, Hao Zhou,
Chengqi Zhao, Yong Yu, Weinan Zhang, Und
Lei Li. 2019A. Towards making the most of bert
in neural machine translation. arXiv preprint
arXiv:1908.05672.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019B. XLNet: Generalized autoregressive
pretraining for language understanding. arXiv
preprint arXiv:1906.08237.

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun
Chen, Chris Brockett, Xiang Gao, Jianfeng
Gao, Jingjing Liu, and Bill Dolan. 2019.
DialoGPT: Large-scale generative pre-training
for conversational response generation.

Jinhua Zhu, Yingce Xia, Lijun Wu, Di He,
Tao Qin, Wengang Zhou, Houqiang Li, Und
Tie-Yan Liu. 2020. Incorporating BERT into
neural machine translation. arXiv preprint
https://doi
arXiv:2002.06823. DOI:
.org/10.18653/v1/2020.acl-demos.30

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
T

A
C
_
A
_
0
0
3
4
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

742
PDF Herunterladen