Multilingual Denoising Pre-training for Neural Machine Translation

Yinhan Liu‡∗, Jiatao Gu†∗, Naman Goyal†∗, Xian Li†, Sergey Edunov†,
Marjan Ghazvininejad†, Mike Lewis†, and Luke Zettlemoyer‡

†Facebook AI
‡Birch Technology
†{jgu,naman,xianl,edunov,ghazvini,mikelewis,lsz}@fb.com
‡yinhan@birch.ai

Abstracto

This paper demonstrates that multilingual
denoising pre-training produces significant
performance gains across a wide variety of
machine translation (MONTE) tareas. We present
mBART—a sequence-to-sequence denoising
auto-encoder pre-trained on large-scale mono-
lingual corpora in many languages using the
BART objective (Lewis et al., 2019). mBART
is the first method for pre-training a complete
sequence-to-sequence model by denoising full
texts in multiple languages, whereas previous
approaches have focused only on the encoder,
decoder, or reconstructing parts of the text.
Pre-training a complete model allows it to
be directly fine-tuned for supervised (ambos
sentence-level and document-level) and un-
supervised machine translation, with no task-
specific modifications. We demonstrate that
adding mBART initialization produces per-
formance gains in all but the highest-resource
settings, including up to 12 BLEU points for
low resource MT and over 5 BLEU points
for many document-level and unsupervised
modelos. We also show that it enables transfer
to language pairs with no bi-text or that were
not in the pre-training corpus, and present
extensive analysis of which factors contribute
the most to effective pre-training.1

1 Introducción

Despite its wide adoption for other NLP tasks
(Devlin et al., 2019; Liu et al., 2019; Yang et al.,
2019b; Lewis et al., 2019; Rafael y col., 2019),

* Equal contribution. Most of the work was done when

the first author worked at Facebook.

1Code and pre-trained models are available at https://
git h u b . c o m / p y t o rch/fairseq/tree/master
/examples/mbart.

726

self-supervised pre-training is not yet common
practice in machine translation (MONTE). Existing
approaches (Lample and Conneau, 2019; Edunov
et al., 2019; Lewis et al., 2019; Rafael y col., 2019)
have been proposed either to partially pre-train
the model or to only focus on English corpora. En
this paper, we show that significant performance
gains are possible by pre-training a complete
autoregressive model with an objective that noises
and reconstructs full texts across many languages.
En este trabajo, we present mBART—a multilin-
gual sequence-to-sequence (Seq2Seq) denoising
auto-encoder. mBART is trained by applying
the BART (Lewis et al., 2019) to large-scale
monolingual corpora across many languages. El
input texts are noised by masking phrases and
permuting sentences, and a single Transformer
(Vaswani et al., 2017) model is learned to re-
cover the texts. Different from other pre-training
for MT (Lample and Conneau,
approaches
2019; Song et al., 2019), mBART pre-trains a
complete autoregressive Seq2Seq model. mBART
is trained once for all languages, providing a
set of parameters that can be fine-tuned for any
of the language pairs in both supervised and
unsupervised settings, without any task-specific or
language-specific modifications or initialization
schemes.

Extensive experiments demonstrate that this
simple approach works remarkably well. We first
focus on existing MT benchmarks. For supervised
sentence-level MT, mBART initialization leads to
significant gains (hasta 12 BLEU points) across
low/medium-resource pairs (<10M bi-text pairs), without sacrificing performance in high-resource settings. These results further improve with back- translation (BT), setting a new state-of-the-art on WMT16 English-Romanian and the FloRes test sets. For document-level MT, our document- level pre-training improves by up to 5.5 Transactions of Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. https:> is used as
the initial token to predict the sentence. It is also
possible to use other noise types, such as those in
Lample et al. (2018C), but we leave the exploration
of the optimal noising strategy to future work.

Instance Format For each instance of a batch,
we sample a language id symbol , and we
pack as many consecutive sentences as possible
sampled from the corresponding corpus of ,
until either it hits the document boundary or
reaches the 512 max token length. Sentences in
the instance are separated by the end of sentence
() simbólico. Entonces, we append the selected
token to represent the end of this instance. Pre-
training at ‘‘multi sentence’’ level enables us to
work on both sentence and document translation.

Optimization Our full model (incluido 25
idiomas) is trained on 256 Nvidia V100 GPUs
(32ES) for 500K steps. The total batch size is
around 128K tokens per GPU, matching BART
(Lewis et al., 2019) configuración. We use the
Adam optimizer (ǫ = 1e−6, β2 = 0.98) y lineal
learning rate decay scheduling. The total training
time was approximately 2.5 semanas. We started the
training with dropout 0.1 and reduced it to 0.05 en
250K steps and 0 at 400K steps. All experiments
are done with Fairseq (Ott et al., 2019).

issue of

Reproducibility One potential
el
proposed approach is the replicability problem
due to the requirement of massive monolingual
corpora and computational resources, with fine-
grained selection on hyper-parameters during
pre-training. It is likely to get slightly different
fine-tuning performance if we re-train the system
de nuevo. Tackling on this, we will release the pre-
trained checkpoints as well as the code with full
instructions for pre-training a new model.

several

Trabajo relacionado: XLM(-R) and MASS There
son
closely related approaches of
multilingual pre-training for machine translation.
XLM (Lample and Conneau, 2019) and XLM-R
(Conneau et al., 2019) pretrain BERT (Devlin
et al., 2019; Liu et al., 2019) in a multilingual
moda, and the resulted parameters can be used to
initialize the translation model encoder. Different
from XLM(-R), mBART simultaneously pre-
trains the encoder and the decoder due to the

Seq2Seq setup, which is more natural to adapt to
machine translation applications.

Similar to mBART, MASS (Song et al., 2019)
is also a Seq2Seq-based pre-training technique
with ‘‘word-masking’’. Sin embargo, the decoder of
MASS only predicted tokens that was masked in
the encoder, whereas mBART reconstructs the full
target sequence which allows to apply not only
‘‘masking’’ but any possible noise functions.

Además, both XLM and MASS did
not show evidence of the pre-trained models
improving translation performance over
two
idiomas.

2.3 Pre-trained Models

To better measure the effects of different levels
of multilinguality during pre-training, we built a
range of models as follows:

• mBART25 We pre-train a model on all
25 idiomas, using the setting described in
§2.2.

• mBART06 To explore the effect of pre-
training on related languages, we pretrain a
model on a subset of six European languages:
Ro, Él, Cs, Fr, Es, and En. For a fair
comparación, we use ∼ 1/4 of the mBART25
batch size, which allows our model to have
the same number of updates per language
during pre-training.

• mBART02 We pre-train bilingual models,
using English and one other language for
four language pairs: En-De, En-Ro, En-It.
We use a batch size of ∼ 1/12 of that in the
mBART25.

• BART-En/Ro To help establish a better
understanding towards multilingual pre-
training, we also train monolingual BART
models on the En and Ro corpus only,
respectivamente.

• Random As additional baselines, we will
also include a comparison with a model
randomly initialized without pre-training for
each translation task. Because the sizes
of different downstream datasets vary, nosotros
always grid-search the hyper-parameters
(architecture, dropout, etc.) to find the best
non-pretrained configuration.

728

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Framework for our multilingual denoising pre-training (izquierda) and fine-tuning on downstream
MT tasks (bien), donde usamos (1) sentence permutation and (2) word-span masking as the injected
ruido. A special language id token is added at both the encoder and decoder. One multilingual pre-trained
model is used for all tasks.

All models use the same vocabulary (§2.1). No
all tokens will frequently occur in all pre-training
corpus, pero
este
large vocabulary can improve generalization in
multilingual settings even for unseen languages.

later experiments show that

2.4 Scaling-up Matters

Scaling-up the training data and model parameters
has been a key factor in pre-training (Devlin
et al., 2019; Conneau et al., 2019; Rafael y col.,
2019). Compared to conventional semi-supervised
methods (p.ej., back-translation) and other pre-
training for MT (Lample and Conneau, 2019;
Song et al., 2019), we pre-train mBART on much
more monolingual data with relatively deeper
architecture. This scale, in combination with the
new multi-lingual training, is central to our results
(secciones 3 a 5), although future work could
more carefully study the relative contributions
of each.

3 Sentence-level Machine Translation

This section shows that mBART pre-training
provides consistent performance gains in low
to medium resource sentence-level MT settings,
including bi-text only and with back translation,
existing pre-training
and outperforms other
schemes (§3.2). We also present a detailed analysis
to understand better which factors contribute
the most to these gains (§3.3), and show that
pre-training can even improve performance for
languages not present in the pre-training data
(§3.4).

3.1 Experimental Settings

Datasets We gather 24 pairs of publicly avail-
able parallel corpora that cover all the languages
in CC25 (Cifra 1). Most pairs are from previous
WMT (Gu, Kk, Tr, Ro, Et, teniente, fi, Lv, Cs, Es, Z h,
De, Ru, Fr ↔ En) and IWSLT (Vi, Ja, Ko, Nl,
Ar, It ↔ En) competitions. We also use FLoRes
pares (Guzm´an et al., 2019, En-Ne and En-Si),
En-Hi from IITB (Kunchukuttan et al., 2017), y
En-My from WAT19 (Ding et al., 2018, 2019).
We divide the datasets into three categories—low
resource (<1M sentence pairs), medium resource (>1M and <10M), and high resource (>10METRO).

Fine-tuning & Decoding We fine-tune mBART
on a single pair of bi-text data, feeding the
source language into the encoder and decod-
ing the target language. As shown in Figure 2,
we load the pre-trained weights and train the MT
model on bi-texts with teacher forcing. Para todos
directions, we train with 0.3 dropout, 0.2 label
smoothing, 2500 warm-up steps, 3e−5 maximum
learning rate. We use a maximum of 40K training
updates for all low and medium resource pairs and
100K for high resource pairs. The final models are
selected based on validation likelihood. Usamos
beam-search with beam size 5 for decoding. Nuestro
initial experiments indicate that the fine-tuning
process is generally stable with different seeds.
Por lo tanto, to reduce the total computation, todo
our results are reported with single execution. Nosotros
validate the statistical significance with scripts
from the mosesdecoder.3

3https://github.com/moses-smt/mosesdecoder
/b l o b/master/scripts/a n a lysis/bootstrap
-hypothesis-difference-significance.pl.

729

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Idiomas
En-Gu
Data Source WMT19
Size
Direction

10k

En-Kk
WMT19
91k

En-Vi

En-Tr
IWSLT15 WMT17

133k

207k

En-Ja
IWSLT17
223k

En-Ko
IWSLT17
230k

← → ← → ← → ← → ← → ← →

Aleatorio
mBART25

0.0
0.3

0.0
0.1

0.8
7.4

0.2
2.5

23.6
36.1

24.8
35.4

12.2
22.5

9.5
17.8

10.4
19.1

12.3
19.4

15.3
24.6

16.3
22.6

Idiomas
Data Source
Size
Direction

En-Nl
IWSLT17
237k

En-Ar
IWSLT17
250k

En-It

En-My
IWSLT17 WAT19

250k

259k

En-Ne
FLoRes
564k

En-Ro
WMT16
608k

← → ← → ← → ← → ← → ← →

Aleatorio
mBART25

34.6
43.3

29.3
34.8

27.5
37.6

16.9
21.6

31.7
39.8

28.0
34.0

23.3
28.3

34.9
36.9

7.6
14.5

4.3
7.4

34.0
37.8

34.3
37.7

Idiomas
Data Source
Size
Direction

En-Si
FLoRes
647k

En-Hi
ITTB
1.56METRO

En-Et
WMT18
1.94METRO

En-Lt
WMT19
2.11METRO

En-Fi
WMT17
2.66METRO

En-Lv
WMT17
4.50METRO

← → ← → ← → ← → ← → ← →

Aleatorio
mBART25

7.2
13.7

1.2
3.3

10.9
23.5

14.2
20.8

22.6
27.8

17.9
21.4

18.1
22.4

12.1
15.3

21.8
28.5

20.2
22.4

15.6
19.3

12.9
15.9

Mesa 1: Low/medium resource machine translation Pre-training consistently improves over a
randomly initialized baseline, with particularly large gains on low resource language pairs (p.ej.,
Vi-En).

3.2 Main Results

As shown in Table 1, initializing with the pre-
trained mBART25 weights shows gains on all the
low and medium resource pairs when compared
with randomly initialized baselines. We observe
gains of 12 or more BLEU points on low
resource pairs such as En-Vi, En-Tr, and noisily
aligned pairs like En-Hi. Fine-tuning still fails
in extremely low-resource cases such as En-Gu,
which have ∼10k examples. In these settings,
unsupervised translation is more appropriate,
see §5.2. For high resource cases (Mesa 2),
we do not observe consistent gains, and pre-
training slightly hurts performance when more
than 25M parallel sentences are available. Cuando
a significant amount of bi-text data is given, nosotros
suspect that supervised training washes out the
pre-trained weights.

Note that some reported runs of our baseline
systems using the vanilla Transformers with
randomly initialized weights have considerably
noticeable gaps between the SoTA systems
reported in the original competitions.4 The differ-
ence is mainly because we train and search

4http://matrix.statmt.org/.

Languages Cs Es Zh De Ru
Size

Fr
11M 15M 25M 28M 29M 41M

RANDOM
16.5 33.2 35.0 30.9 31.5 41.4
MBART25 18.0 34.0 33.3 30.5 31.3 41.0

Mesa 2: High resource machine translation
where all the datasets are from their latest WMT
competitions. We only evaluate our models on
En-X translation.

the hyper-parameters for baselines on officially
provided bitext only without using any mono-
lingual corpus or multilingual adaptation. Para
instancia, the SoTA score for En→Gu is 28.2
in WMT19, comparado con 0 en mesa 1. Es
basically because the quality of the original bitext
data is low, and the SoTA systems commonly
used additional languages such as Hi to boost
el desempeño. Similar gaps can also be
observed in pairs such as Kk-En and Lt-En,
language is also
where Ru as the additional
crucial. The main purpose of this part
is to
discuss the effects of multilingual pre-training in a
constrained bitext setting for a better comparison.
We will include more discussions of combining

730

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: Pre-entrenamiento + back translation on FLoRes with two iterations of BT.

multilingual translation with pretraining in future
trabajar.

Plus Back-Translation Back-translation (BT;
Sennrich et al., 2016) is a standard approach
to augment bi-text with target-side monolingual
datos. We combine our pre-training with BT and
test it on low resource language pairs—En-Si
and En-Ne—using the FLoRes dataset (Guzm´an
et al., 2019). We use the same monolingual data
as Guzm´an et al. (2019) to generate BT data.
Cifra 3 shows that initializing the model with
our mBART25 pre-trained parameters improves
BLEU scores at each iteration of back translation,
resulting in new state-of-the-art results in all four
translation directions. It indicates that the pre-
trained mBART weights can be directly plugged
into existing pipeline using BT.

con

Otro

Compared
Pre-entrenamiento
Approaches We also compare our pre-trained
models with recent self-supervised pre-training
methods, as shown in Table 3. We consider En-Ro
traducción, the only pair with established results.
Our mBART model outperforms all the other
pre-trained models, both with and without BT
augmentation. We also show comparisons with
the conventional BART model trained on the same
En and Ro data only. Both have improvements
over baselines, although worse than mBART
resultados, indicating that pre-training in a multilin-
gual setting is essential. Además, combining
BT leads to additional gains, resulting in a new
state-of-the-art for Ro-En translation.

3.3 Análisis

We also present additional analyses, to better
quantify when our pre-training helps.

How many languages should you pre-train on?
We investigate when it is helpful for pre-training to
include languages other than the targeted language
pair that will be used during fine tuning. Mesa 4

731

Pre-entrenamiento

Datos
None
En Ro
En Ro
En

Modelo
RANDOM
XLM (2019)
MASS (2019)
BART (2019)
XLM-R (2019) CC100
BART-EN
BART-RO
MBART02
MBART25

En
Ro
En Ro
CC25

Fine-tuning
En→Ro Ro→En +BT
36.8
38.5
39.1
38.0
–
37.4
38.1
39.9
38.8

34.0
35.6
–
–
35.8
35.8
36.8
38.5
37.8

34.3
–
–
–
35.6
36.0
37.6
38.5
37.7

Mesa 3: Comparison with other pre-training
approaches on WMT16 Ro-En.

Ro
66.6 61.4 30.2

It My
Languages De
Size/GB
1.6
mBART02 31.3 38.5 39.7 36.5
mBART06
mBART25 30.5 37.7 39.8 36.9

38.5 39.3

–

En
300.8

Mesa 4: Pretraining languages on En-X trans-
lación. The size refers to the size of monolingual
data for X. The size of En is shown as reference.
All the pretrained models were controlled to see
the same number of English instances during
training.

shows performance on four X-En pairs. Pre-
training on more languages helps most when the
target language monolingual data is limited (p.ej.,
En-My, where the size of My is around 0.5%
of En).

A diferencia de, when monolingual data is plentiful
(De, Ro), pre-training on multiple languages
slightly hurts the final results (<1 BLEU). In languages may reduce these cases, additional the capacity available for each test language. Additionally, the fact that mBART06 performs similar to mBART02 on Ro-En suggests that pre-training with similar languages is particularly helpful. l D o w n o a d e d f r o m h t t p : > is predicted. We use beam size 5 by default.

4 Document-level Machine Translation

We evaluate mBART on document-level machine
translation tasks, where the goal is to translate
that contain more than one
segments of text
oración (up to an entire document). During pre-
training, we use document fragments of up to 512
tokens, allowing the models to learn dependencies
this pre-
between sentences. Nosotros mostramos que
training significantly improves document-level
traducción.

4.1 Experimental Settings

Datasets We evaluate performance on two
common document-level MT datasets: WMT19
En-De and TED15 Zh-En. For En-De, we use the
document data from WMT19 to train our model,
without any additional sentence-level data. El
Zh-En dataset is from IWSLT 2014 y 2015
(Cettolo et al., 2012, 2015). Following Miculicich
et al. (2018), we use 2010-2013 TED as the test
colocar.

Pre-processing We pre-process with the ap-
proach used in pre-training. For each block, sen-
tences are separated by end of sentence symbols
() and the entire instance is ended with
the specific language id (). De término medio,
documents are split into 2–4 instances.

Líneas de base & Evaluation We train 4 modelos: a
nivel de documento (Doc-) MT model (§4.1) y un
corresponded sentence-level (Sent-) MT model
(§3.1) as the baseline, both with and without
pre-training. We use mBART25 as the common
pre-trained model for En-De and Zh-En. Para
En-De, even though our mBART25 Doc-MT
model decodes multiple sentences together, el
translated sentences can be aligned to the source
oraciones, which allows us to evaluate BLEU
scores both on sentence-level (s-BLEU) y
nivel de documento (d-BLEU).5 For Zh-En, sin embargo,
we cannot produce the same number of translated
sentences as the reference due to alignment errors
in the test data. We only provide the d-BLEU
scores on this direction.

We also compare our models with Hierarchical
Attention Networks (HAN, Miculicich et al.,
2018) on Zh-En, which is the state-of-the-
art non-pretraining approach for document-level
translation for
this pair. They combine two
layers of attention—first within and then across
oraciones.

5Standard BLEU scores match n-grams at sentence-level.
We also consider document-level where we match n-grams
over the whole document resulting in a slightly higher score.

733

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(a) Oración- and Document-level BLEU scores on En-De

(b) Document-level BLEU scores on Zh-En

Modelo

Aleatorio

mBART25

s-BLEU d-BLEU s-BLEU d-BLEU

Sent-MT
Doc-MT

34.5
×

35.9
7.7

36.4
37.1

38.0
38.5

Modelo

Sent-MT
Doc-MT

Random mBART25 HAN (2018)
d-BLEU
22.0
3.2

d-BLEU
−
24.0

d-BLEU
28.4
29.6

Mesa 6: Document-level machine translation on En-De and Zh-En. (×) The randomly initialized Doc-
MT model cannot produce translations aligned to the original sentences, so only document evaluation
is possible.

4.2 Main Results

Mesa 6 shows the main results for both En-De and
Zh-En at both sentence-level and document-level.

Random vs. Pre-trained The MT models
initialized with pre-trained weights outperform
randomly initialized models by large margins, para
both sentence-level and document-level training.
Our mBART25 models (both Sent-MT and Doc-
MONTE) also outperform HAN (Miculicich et al.,
2018),6 despite the fact
they are not
customized for document-level MT.

eso

Sent-MT vs. Doc-MT For En-De and En-Zh,
the mBART25 Doc-MT models outperform
mBART25 fine-tuned at sentence-level by large
margins, reversing the trend seen for models
without pre-training. For both datasets, randomly
initialized Doc-MT fails
resulting
to work,
in much worse results
than the sentence-
modelos de nivel. Such large performance gaps
indicate that pre-training is critical for document
level performance. It is in general difficult to
collect high-quality document-level data in large
quantities, suggesting that pre-training may be a
strong strategy for future work. We also include a
sampled example in Figure 6.

5 Unsupervised Machine Translation

In addition to supervised machine translation, nosotros
also evaluate our model on tasks where no bi-text
is available for the target language pair. Definimos
three types of unsupervised translation:

1. No bi-text of any kind. A common solution
is to learn from back-translation (casa de arte
et al., 2017; Lample et al., 2018C). Nosotros
show that mBART provides a simple and
effective initialization scheme for
estos
methods (§5.1).

6d-BLEU is recomputed from the provided system output.

2. No bi-text for the target pair, but both
languages appear in bi-text corpora with other
pares. This setup is common for multilingual
MT systems (Johnson et al., 2017; Gu et al.,
2019). en este documento, we limit our focus to
building models for single language pairs,
and leave discussions for multilingual MT to
future work.

3. No bi-text for the target pair is available, pero
there is bi-text for translating from some other
language into the target language. mBART
supports effective transfer, even if the source
language has no bi-text of any form (§5.2).

5.1 Unsupervised Machine Translation via

Back-Translation

Datasets We evaluate our pre-trained models
on En-De, En-Ne, and En-Si. En and De are both
European languages sharing many sub-words,
whereas Ne and Si are quite distinct from En. Nosotros
use the same test sets as supervised benchmarks
§3.1, and use the same pre-training data (CC25)
for back-translation to avoid introducing new
información.

Learning Following Lample
and Conneau
(XLM, 2019), we initialize the translation model
with the mBART weights, and then learn to
predict the monolingual sentences conditioned
on source sentences generated by on-the-fly BT.
Además, we constrain mBART to only gen-
erating tokens in target language7 for the first
1000 steps of on-the-fly BT, to avoid it copying
the source text.

Results Table 7 shows the unsupervised trans-
lation results compared with non-pretrained mod-
los, as well as models with existing pre-training
methods. Our models achieve large gains over
non-pretrained models for all directions, y

7We mask out the output probability of predicted tokens
which appear less than 1% in the target monolingual corpus.

734

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 6: An example of document-level translation from mBART25 Sent-MT and Doc-MT, held out
from the test set of TED15 Zh-En. The Doc-MT system produces much fluent and coherent translation,
which is closer to the reference translation. Por ejemplo, Doc-MT model produces several ‘‘And’’ to
connect sentences to make it reads better, while the Sent-MT model does not contain global knowledge
and produce sentences independently. Además, both systems produce much better translations
than models without pre-training where the non-pretrained Doc-MT model completely fails to produce
readable translation output.

735

En-Si
En-Ne
En-De
← → ← → ← →

Aleatorio
21.0 17.2
XLM (2019)
34.3 26.4
MASS (2019) 35.2 28.3

0.0
0.5
–

0.0 0.0 0.0
0.1 0.1 0.1
–
–
–

mBART

34.0 29.8 10.0 4.4 8.2 3.9

Mesa 7: Unsupervised MT via BT between
dis-similar languages.

most pairs, language transfer works better when
fine-tuning is also conducted in the same language
familia, especially between Indic languages (Hi,
Ne, Gu). Sin embargo, significant vocabulary sharing
is not required for effective transfer. Por ejemplo,
Zh-En and It-En achieve the best transfer learning
results on Ko-En and Ar-En, respectivamente. Esto es
despite the low vocabulary overlap (even character
superposición) entre (Z h, Ko) y (Él, Ar).

outperform XLM significantly for dissimilar pairs
(En-Ne, En-Si) where the existing approaches
completely fail. For En-De, our model also
performs comparably against XLM and MASS.

5.2 Unsupervised Machine Translation via

Language Transfer

We also report results when the target language
appears in a bi-text with some other source
idioma.

Datasets We only consider X→En translation,
and choose the bitexts of 12 language pairs from
§3.1, covering Indic languages (Ne, Hi, Y, Gu),
European languages (Ro, Él, Cs, Nl), East Asian
idiomas (Z h, Ja, Ko), and Arabic (Ar).

Results The pre-trained mBART25 model is
fine-tuned on each language pair, y luego
evaluated on the rest of pairs, as seen in Table 8.
We also present the direct fine-tuning performance
(§3) on the diagonal, for reference. We see transfer
for all pairs with all fine-tuned models except from
Gu-En where the supervised model completely
fails (0.3 AZUL). In some cases we can achieve
similar (Cs-En) or even much better (Ne-En,
Gu-En) results compared with the supervised
resultados. We also show an example of language
transfer in Figure 7.

As a comparison, we also apply the same proce-
dure on randomly initialized models without pre-
training, which always ends up with ≈ 0 AZUL.
This indicates that multilingual pre-training is
essential and produces universal representations
across languages, so that once the model learns to
translate one language to En, it learns to translate
all languages with similar representations.

When is language transfer useful? Mesa 8
also shows that the size of transfer effects varies
with the similarity of different languages. Primero, para

With BT We present a comparison of unsuper-
vised MT with BT vs. language transfer in Table 9
where language transfer works better when there
exists a close language translation to transfer from.
Además, we show promising results for
combining these two techniques. We start from the
best transferred model and apply (iterative) BT on
the same monolingual corpus used in pre-training.
Mesa 9 presents the results with 1 iteration of BT.
We see improvements for all pairs. The complete
analysis of both methods is left as future work.

6 Trabajo relacionado

Self-supervised Learning for Text Generation
This work inherits from the recent success brought
by pre-training for NLP applications (Peters et al.,
2018; Radford et al., 2018; Devlin et al., 2019;
Yang et al., 2019b; Liu et al., 2019), especially
for text generation (Radford et al., 2019; Song
et al., 2019; Dong et al., 2019; Rafael y col.,
2019; Lewis et al., 2019). The pre-trained models
are usually used as the initialization for fine-
tuning downstream tasks such as controllable
language modeling (Shirish Keskar et al., 2019),
summarization (Song et al., 2019; Liu and Lapata,
2019) and dialogue generation (Zhang et al.,
2019).

Specifically for machine translation, unsuper-
vised pre-training methods were also explored
to improve the performance. Qi et al. (2018)
investigated the application of pre-trained word
embeddings for MT; Ramachandran et al. (2017)
proposed to pre-train the encoder-decoder mod-
ules as two separate language models. Yang y otros.
(2019a); Zhu et al. (2020) explored fusion ap-
proaches to incorporate the pre-trained BERT
weights to improve NMT training. A diferencia de
to most prior work, we focus on pre-training one
denoising autoencoder, and adapt the weights of
the entire model for various MT applications.

736

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fine-tuning Languages
Ro

Z h

s
mi
gramo
a
tu
gramo
norte
a
l
gramo
norte
i
t
s
mi
t

Domain News
23.7
9.9
5.8
9.3
16.2
14.4
16.9
5.8
3.2
2.1
5.0
8.2

ZH
JA
KO
CS
RO
NL
IT
AR
HI
NE
SI
GU

Ja
TED TED News News
7.8
4.8
8.5
19.5
37.8
27.0
23.4
12.0
6.7
4.3
1.3
3.5

9.2
12.2
24.6
17.2
17.9
32.3
27.8
12.8
9.9
6.5
3.8
4.7

2.8
0.9
5.7
21.6
23.0
21.2
17.1
12.7
5.8
5.0
3.8
5.4

8.8
19.1
16.9
15.1
18.7
30.4
25.8
15.5
10.1
6.7
5.7
8.5

Él

Gu
Hi
Nl
TED TED TED News Wiki Wiki Wiki
0.0
7.2
0.0
4.7
0.0
9.6
0.0
13.2
0.0
16.4
0.0
24.6
0.0
20.1
0.0
11.6
0.0
23.5
17.9
0.0
0.0
8.1
13.8
0.3

7.0
6.4
9.5
17.0
22.3
43.3
30.2
14.7
6.1
3.0
0.9
2.1

5.9
6.5
11.1
16.4
22.1
27.3
23.2
16.7
13.0
10.8
13.7
12.8

4.2
4.2
8.8
15.1
18.5
23.3
18.5
13.0
14.5
14.5
8.9
13.5

6.8
5.1
9.1
16.7
21.6
34.1
39.8
14.7
5.0
2.2
0.5
0.0

6.2
5.6
8.7
16.9
22.6
31.0
30.6
37.6
7.6
5.2
3.5
6.2

Mesa 8: Unsupervised MT via language transfer on X-En translations. The model fine-tuned on one
language pair is directly tested on another. We use gray color to show the direct fine-tuning results,
and lightgray color to show language transfer within similar language groups. We bold the highest
transferring score for each pair.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 7: An example of unsupervised MT via language transfer. mBART models finetuned with Ko or
Zh are able to translate Ja sentence to En almost as correctly as in the supervised case.

Source online BT
Ro
Ne
Z h
Nl

30.5
10.0
11.3
28.5

Transferir

23.0
17.9
9.2
34.1

( Cs )
( Hi )
( Ko )
( Él )

Combined
33.9
22.1
15.0
35.4

Mesa 9: BT vs. language transfer for unsupervised
MT for X-En translations. For language transfer,
we present the best transferring scores together
with the language transferred from.

Multilinguality in NLP tasks This work is also
related to the continual
trend of multilingual
language learning, including aligning multilingual
word embeddings (Mikolov et al., 2013; Chen
and Cardie, 2018; Lample et al., 2018b) en

universal space, and learning crosslingual models
(Wada and Iwata, 2018; Lample and Conneau,
2019; Conneau et al., 2019) to exploit shared
representations across languages.

For MT, the most relevant field is multilingual
traducción (Firat et al., 2016; Johnson et al.,
2017; Aharoni et al., 2019; Arivazhagan et al.,
2019) where the ultimate goal is to jointly train
one translation model
translates multiple
al mismo tiempo, y
language directions at
shares representations to improve the translation
performance on low-resource languages (Gu et al.,
2018). en este documento, we focus on multilingualism
in the pre-training stage and fine-tune the
learned model in the standard bilingual scenario.

eso

737

Compared with multilingual translation, we do
not require parallel data across multiple languages
but the targeted direction, which improves the
scalability to low-resource languages and specific
dominios.

(2019)

is the most

Document Translation As one of
the key
applications, our work is also related to previous
efforts for incorporating document-level context
into neural machine translation (Wang y cols.,
2017; Jean et al., 2017; Tiedemann and Scherrer,
2017; Miculicich et al., 2018; Tu et al., 2018).
Li et al.
relevant work
that also utilized pre-trained encoder (BERT)
for handling longer context. Sin embargo,
el
focus has been on designing new task-specific
técnicas, and doing sentence-level translation
with a wider input context. To the best of our
conocimiento, our multilingual pre-trained model
that shows improved results on
is the first
document-level translation with standard Seq2Seq
modelos.

Unsupervised Translation This work also
summarizes the previous efforts of learning to
translate between languages without a direct
parallel corpus. When no parallel data of any
kind is available, Artetxe et al.
(2017) y
Lample et al. (2018a) proposed to jointly learn
denoising auto-encoder and back-translation from
both directions, cual, sin embargo, required good
initialization and only worked well on similar
(2019) solve the
language pairs. Wu et al.
problem by mining sentences from Wikipedia
and using them as weakly supervised translation
pares. Similar to Lample and Conneau (2019) y
Song et al. (2019), we follow the first approach
and treat our pre-trained model as the initialization
step. We also investigate unsupervised translation
using language transfer, which is similar
a
Pourdamghani et al. (2019), where the authors
generate translationese of the source language
and train a system on high-resource languages
es
to correct
also closely related to Conneau et al. (2018)
and Artetxe et al.
for cross-lingual
representation learning where we also show
representation learned by mBART can be easily
transferred between language without supervised
datos.

these intermediate utterances. Él

(2019)

738

7 Conclusión

We demonstrate that multilingual de-noising pre-
training is able to significantly improve both
supervised and unsupervised machine translation
at both the sentence level and document level.
We analyze when and how pre-training is most
effective and can be combined with other
approaches such as back-translation. Nuestros resultados
learning ability of
also show the transfer
the learned representations from multilingual
pre-training.

In future work, we will scale-up the current
pre-training to more languages, Por ejemplo, un
mBART100 model. The size of our model makes
it expensive to deploy in production—future work
will explore pre-training more efficient models.

Expresiones de gratitud

We thank Marc’Aurelio Ranzato, Guillaume
Lample, Alexis Conneau, and Michael Auli
for sharing their expertise on low-resource and
unsupervised machine translation and Peng-Jen
Chen and Jiajun Shen for details about FloRes and
WAT datasets. We also thank our colleagues at
FAIR and FAIAR for valuable feedback.

Referencias

Roee Aharoni, Melvin Johnson, and Orhan
Firat. 2019. Massively multilingual neural
machine translation. En Actas de la 2019
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), pages 3874–3884.
Mineápolis, Minnesota. Asociación
para
Ligüística computacional.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat,
Dmitry Lepikhin, Melvin Johnson, Maxim
Krikun, Mia Xu Chen, Yuan Cao, Jorge
Foster, Colin Cherry, Wolfgang Macherey,
Zhifeng Chen, and Yonghui Wu. 2019.
Massively multilingual neural machine trans-
lation in the wild: Findings and challenges.
CORR, abs/1907.05019.

Mikel Artetxe, Gorka Lavaka, Eneko Agirre, y
Kyunghyun Cho. 2017. Unsupervised neural
machine translation. arXiv preimpresión arXiv:
1710.11041. DOI: https://doi.org/18653
/v1/D18-1399

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mikel Artetxe, Sebastián Ruder, y dani
Yogatama. 2019. Sobre la translingüística-
ferabilidad de representaciones monolingües. arXiv
preprint arXiv:1910.11856. DOI: https://
doi.org/10.18653/v1/2020.acl
-main.421

Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. Wit3: Web inventory of
transcribed and translated talks. In Conference
of European Association
for Machine
Translation, pages 261–268.

Mauro Cettolo, Niehues Jan, St¨uker Sebas-
tian, Luisa Bentivogli, Roldano Cattoni, y
Marcello Federico. 2015. The IWSLT 2015
evaluation campaign. In International Work-
shop on Spoken Language Translation.

Xilun Chen and Claire Cardie. 2018. Unsuper-
vised multilingual word embeddings. En profesional-
cesiones de la 2018 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 261–270. Bruselas, Bélgica. Asociación
para Lingüística Computacional. DOI: https://
doi.org/10.18653/v1/D18-1024

Alexis Conneau, Kartikay Khandelwal, Naman
goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Lucas Zettlemoyer, and Veselin Stoyanov.
2019. Unsupervised cross-lingual representa-
tion learning at scale. arXiv preimpresión arXiv:
1911.02116. DOI: https://doi.org/10
.18653/v1/2020.acl-main.747

Alexis Conneau, Ruty Rinott, Guillaume
Lample, Adina Williams, Samuel R. Bowman,
Holger Schwenk,
and Veselin Stoyanov.
2018. XNLI: Evaluating cross-lingual sentence
representaciones. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral. Asociación para Com-
Lingüística putacional. DOI: https://doi
.org/10.18653/v1/D18-1269

Utiyama, and Eiichiro Sumita. 2019. Towards
Burmese (Myanmar) morphological analysis:
Syllable-based tokenization and part-of-speech
tagging. ACM Transactions on Asian and
Low-Resource Language Information Process-
En g (TALLIP), 19(1):5. DOI: https://doi
.org/10.1145/3325885

Chenchen Ding, Masao Utiyama, and Eiichiro
Sumita. 2018. NOVA: A feasible and flexible
annotation system for joint tokenization and
part-of-speech tagging. ACM Transactions on
Asian and Low-Resource Language Informa-
tion Processing (TALLIP), 18(2):17. DOI:
https://doi.org/10.1145/3276773

Li Dong, Nan Yang, Wenhui Wang, Furu
Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao,
Ming Zhou, and Hsiao-Wuen Hon. 2019.
Unified language model pre-training for natural
language understanding and generation. arXiv
preprint arXiv:1905.03197.

Sergey Edunov, Alexei Baevski, y miguel
Auli. 2019. Pre-trained language model repre-
sentations for language generation. arXiv preprint
arXiv:1903.09722. DOI: https://doi.org
/10.18653/v1/N19-1409

Orhan Firat, Kyunghyun Cho, y yoshua
bengio. 2016. Multi-way, multilingual neural
machine translation with a shared attention
In NAACL. DOI: https://
mechanism.
doi.org/10.18653/v1/N16-1101

Jiatao Gu, Hany Hassan, Jacob Devlin, y
Victor O. k. li. 2018. Universal neural
machine translation for extremely low resource
el 2018
En procedimientos de
idiomas.
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos), pages 344–354. Nueva Orleans,
Luisiana. Asociación de Computación
Lingüística.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. In North American Association
para Lingüística Computacional (NAACL).

Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa,
Khin Thandar Nwet, Khin Mar Soe, Masao

Jiatao Gu, Yong Wang, Kyunghyun Cho, y
Victor O. k. li. 2019. Improved zero-shot neu-
ral machine translation via ignoring spurious
correlations. arXiv preimpresión arXiv:1906.01181.

Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
Koehn, Vishrav Chaudhary, and Marc’Aurelio

739

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ranzato. 2019. The FLORES evaluation
datasets for low-resource machine translation:
Nepali–English and Sinhala–English. En profesional-
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 6097–6110. Hong Kong, Porcelana. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1632

S´ebastien Jean, Stanislas Lauly, Orhan Firat, y
Kyunghyun Cho. 2017. Does neural machine
translation benefit from larger context? CORR,
abs/1704.05135.

Melvin Johnson, Mike Schuster, Quoc V.
Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda Vi´egas, Martín
Wattenberg, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2017. Googles multilingual
neural machine translation system: Enabling
el
zero-shot
Asociación de Lingüística Computacional,
5:339–351. DOI: https://doi.org/10
.1162/tacl a 00065

traducción. Transactions of

Taku Kudo and John Richardson. 2018. Oración-
Piece: A simple and language independent
subword tokenizer and detokenizer for neural
text processing. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje: Demostraciones del sistema,
pages 66–71. Bruselas, Bélgica. asociación-
ción para la Lingüística Computacional. DOI:
https://doi.org/10.18653/v1/D18
-2012, PMID: 29382465

Anoop Kunchukuttan, Pratik Mehta, and Pushpak
Bhattacharyya. 2017. The IIT Bombay English-
Hindi parallel corpus. CORR, abs/1710.02855.

Guillaume Lample and Alexis Conneau. 2019.
language model pretraining.

multilingüe
arXiv preimpresión arXiv:1901.07291.

Guillaume Lample, Alexis Conneau, Ludovic
Denoyer, and Marc’Aurelio Ranzato. 2018a.
Unsupervised machine translation using mono-
lingual corpora only. In International Confer-
ence on Learning Representations.

Guillaume

Conneau,
Marc’Aurelio Ranzato, Ludovic Denoyer,

Lample,

Alexis

and Hervé Jégou. 2018b. Word transla-
tion without parallel data. En internacional
Conferencia sobre Representaciones del Aprendizaje.

Guillaume Lample, Myle Ott, Alexis Conneau,
Ludovic Denoyer, and Marc’Aurelio Ranzato.
2018C. Phrase-based & neural unsupervised
machine translation. arXiv preimpresión arXiv:
1804.07755. DOI: https://doi.org/10
.18653/v1/D18-1549

mike lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Exacción, Veselin Stoyanov, and Luke Zettlemoyer.
2019. Bart: Denoising sequence-to-sequence
pre-training for natural language generation,
traducción, and comprehension. arXiv preprint
arXiv:1910.13461. DOI: https://doi.org
/10.18653/v1/2020.acl-main.703

Liangyou Li, Xin Jiang, and Qun Liu. 2019.
Pretrained language models for document-
level neural machine translation. arXiv preprint
arXiv:1911.03110.

Yang Liu and Mirella Lapata. 2019. Texto
summarization with pretrained encoders. arXiv
preprint arXiv:1908.08345. DOI: https://
doi.org/10.18653/v1/D19-1387

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. ROBERTA: A robustly
optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692.

James Henderson.

Lesly Miculicich, Dhananjay Ram, Nikolaos
Pappas,
2018.
Document-level neural machine translation
with hierarchical attention networks. En profesional-
cesiones de la 2018 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 2947–2954. Bruselas, Bélgica. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18-1325

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.
2013. Exploiting similarities among languages
for machine translation. CORR, abs/1309.4168.

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. 2019. FAIRSEQ: A

740

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

fast, extensible toolkit for sequence model-
In North American Association for
En g.
Ligüística computacional (NAACL): Sistema
Demonstrations. DOI: https://doi.org
/10.18653/v1/N19-4009

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton
Sotavento, and Luke Zettlemoyer. 2018. Deep
contextualized word representations. In North
American Association
for Computational
Lingüística (NAACL).

Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual
is multilingual
BERT? arXiv preimpresión arXiv:1906.01502. DOI:
https://doi.org/10.18653/v1/P19
-1493

Nima Pourdamghani, Nada Aldarrab, Marjan
Ghazvininejad, Kevin Knight, and Jonathan
Puede. 2019. Translating translationese: A two-
step approach to unsupervised machine trans-
lación. In ACL. DOI: https://doi.org
/10.18653/v1/P19-1293

Ye Qi, Devendra Singh Sachan, Matthieu Felix,
Sarguna Janani Padmanabhan, y graham
Neubig. 2018. When and why are pre-
trained word embeddings useful for neural
machine translation? arXiv preimpresión arXiv:
1804.06323. DOI: https://doi.org/10
.18653/v1/N18-2084

Alec Radford, Karthik Narasimhan, Time
Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised
aprendiendo, OpenAI.

Alec Radford, Jeffrey Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners, OpenAI.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Miguel
mañana, Yanqi Zhou, wei li, y Pedro J.. Liu.
2019. Explorando los límites del aprendizaje por transferencia
con un transformador unificado de texto a texto. arXiv
preprint arXiv:1910.10683.

Prajit Ramachandran, Peter J Liu, and Quoc Le.
2017. Unsupervised pretraining for sequence to
sequence learning. En Actas de la 2017

Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 383–391. DOI:
https://doi.org/10.18653/v1/D17
-1039

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. En profesional-
ceedings of the 54th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 86–96.
Berlina, Alemania. Asociación de Computación-
lingüística nacional. DOI: https://doi
.org/10.18653/v1/P16-1009

Nitish Shirish Keskar, Bryan McCann, Lav R.
Varshney, Caiming Xiong, and Richard Socher.
2019. Ctrl: A conditional transformer language
controllable generation. arXiv
modelo
preprint arXiv:1909.05858.

para

Kaitao Song, Xu Tan, tao qin, Jianfeng Lu,
and Tie-Yan Liu. 2019. MASS: Masked
sequence to sequence pre-training for language
generación. In International Conference on
Machine Learning (ICML).

J¨org Tiedemann and Yves Scherrer. 2017.
Neural machine translation with extended
contexto. In Proceedings of the Third Work-
shop on Discourse in Machine Translation,
pages 82–92. Copenhague, Dinamarca. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-4811

Zhaopeng Tu, Yang Liu, Shuming Shi, y
Tong Zhang. 2018. Learning to remember
translation history with a continuous cache.
the Association for Com-
Transactions of
6:407–420. DOI:
putational
https://doi.org/10.1162/tacl a 00029

Lingüística,

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017.
Attention is all you need. En avances en neurología
Sistemas de procesamiento de información.

Takashi Wada and Tomoharu Iwata. 2018.
Unsupervised cross-lingual word embedding
by multilingual neural language models. CORR,
abs/1809.02306. DOI: https://doi.org
/10.18653/v1/P19-1300

741

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

for neural machine translation.

Longyue Wang, Zhaopeng Tu, Andy Way,
and Qun Liu. 2017. Exploiting cross-sentence
contexto
En
el 2017 Conferencia sobre
Actas de
Empirical Methods
in Natural Language
Procesando, pages 2826–2831. Copenhague,
Dinamarca. Asociación
for Computational
Lingüística. DOI: https://doi.org/10
.18653/v1/D17-1301

Zihan Wang, Stephen Mayhew, Dan Roth,
y otros. 2019. Cross-lingual ability of
multilingual bert: An empirical study. arXiv
preprint arXiv:1912.07840.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis
Conneau, Vishrav Chaudhary, Francisco
Guzman, Armand Joulin, and Edouard Grave.
2019. CCNET: Extracting
quality
monolingual datasets from web crawl data.
arXiv preimpresión arXiv:1911.00359.

alto

Lijun Wu, Jinhua Zhu, Di He, Fei Gao, Xu Tan,
tao qin, and Tie-Yan Liu. 2019. Machine
translation with weakly paired bilingual
documentos.

Jiacheng Yang, Mingxuan Wang, Hao Zhou,
Chengqi Zhao, Yong Yu, Weinan Zhang, y
Lei Li. 2019a. Towards making the most of bert
in neural machine translation. arXiv preprint
arXiv:1908.05672.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019b. XLNet: Generalized autoregressive
pretraining for language understanding. arXiv
preprint arXiv:1906.08237.

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun
Chen, Chris Brockett, Xiang Gao, Jianfeng
gao, Jingjing Liu, and Bill Dolan. 2019.
DialoGPT: Large-scale generative pre-training
for conversational response generation.

Jinhua Zhu, Yingce Xia, Lijun Wu, Di He,
tao qin, Wengang Zhou, Houqiang Li, y
Tie-Yan Liu. 2020. Incorporating BERT into
neural machine translation. arXiv preprint
https://doi
arXiv:2002.06823. DOI:
.org/10.18653/v1/2020.acl-demos.30

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

a
C
_
a
_
0
0
3
4
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

742
Descargar PDF