Multilingual Denoising Pre-training for Neural Machine Translation

Yinhan Liu‡∗, Jiatao Gu†∗, Naman Goyal†∗, Xian Li†, Sergey Edunov†,
Marjan Ghazvininejad†, Mike Lewis†, and Luke Zettlemoyer‡

†Facebook AI
‡Birch Technology
†{jgu,naman,xianl,edunov,ghazvini,mikelewis,lsz}@fb.com
‡yinhan@birch.ai

抽象的

This paper demonstrates that multilingual
denoising pre-training produces significant
performance gains across a wide variety of
machine translation (公吨) 任务. We present
mBART—a sequence-to-sequence denoising
auto-encoder pre-trained on large-scale mono-
lingual corpora in many languages using the
BART objective (刘易斯等人。, 2019). mBART
is the first method for pre-training a complete
sequence-to-sequence model by denoising full
texts in multiple languages, whereas previous
approaches have focused only on the encoder,
decoder, or reconstructing parts of the text.
Pre-training a complete model allows it to
be directly fine-tuned for supervised (两个都
sentence-level and document-level) and un-
supervised machine translation, with no task-
specific modifications. We demonstrate that
adding mBART initialization produces per-
formance gains in all but the highest-resource
settings, including up to 12 BLEU points for
low resource MT and over 5 BLEU points
for many document-level and unsupervised
型号. We also show that it enables transfer
to language pairs with no bi-text or that were
not in the pre-training corpus, and present
extensive analysis of which factors contribute
the most to effective pre-training.1

1 介绍

Despite its wide adoption for other NLP tasks
(Devlin et al., 2019; 刘等人。, 2019; 杨等人。,
2019乙; 刘易斯等人。, 2019; Raffel et al., 2019),

* Equal contribution. Most of the work was done when

the first author worked at Facebook.

1Code and pre-trained models are available at https://
git h u b . c o m / p y t o rch/fairseq/tree/master
/examples/mbart.

726

self-supervised pre-training is not yet common
practice in machine translation (公吨). 现存的
方法 (Lample and Conneau, 2019; Edunov
等人。, 2019; 刘易斯等人。, 2019; Raffel et al., 2019)
have been proposed either to partially pre-train
the model or to only focus on English corpora. 在
这篇论文, we show that significant performance
gains are possible by pre-training a complete
autoregressive model with an objective that noises
and reconstructs full texts across many languages.
在这项工作中, we present mBART—a multilin-
gual sequence-to-sequence (Seq2Seq) 去噪
auto-encoder. mBART is trained by applying
the BART (刘易斯等人。, 2019) to large-scale
monolingual corpora across many languages. 这
input texts are noised by masking phrases and
permuting sentences, and a single Transformer
(Vaswani et al., 2017) model is learned to re-
cover the texts. Different from other pre-training
用于机器翻译 (Lample and Conneau,
方法
2019; Song et al., 2019), mBART pre-trains a
complete autoregressive Seq2Seq model. mBART
is trained once for all languages, providing a
set of parameters that can be fine-tuned for any
of the language pairs in both supervised and
unsupervised settings, without any task-specific or
language-specific modifications or initialization
计划.

Extensive experiments demonstrate that this
simple approach works remarkably well. We first
focus on existing MT benchmarks. For supervised
sentence-level MT, mBART initialization leads to
significant gains (最多 12 BLEU points) across
low/medium-resource pairs (<10M bi-text pairs), without sacrificing performance in high-resource settings. These results further improve with back- translation (BT), setting a new state-of-the-art on WMT16 English-Romanian and the FloRes test sets. For document-level MT, our document- level pre-training improves by up to 5.5 Transactions of Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. https:> is used as
the initial token to predict the sentence. 这也是
possible to use other noise types, such as those in
Lample et al. (2018C), but we leave the exploration
of the optimal noising strategy to future work.

Instance Format For each instance of a batch,
we sample a language id symbol , 和我们
pack as many consecutive sentences as possible
sampled from the corresponding corpus of ,
until either it hits the document boundary or
reaches the 512 max token length. Sentences in
the instance are separated by the end of sentence
() 代币. 然后, we append the selected
token to represent the end of this instance. 预-
training at ‘‘multi sentence’’ level enables us to
work on both sentence and document translation.

Optimization Our full model (包括 25
语言) is trained on 256 Nvidia V100 GPUs
(32GB) for 500K steps. The total batch size is
around 128K tokens per GPU, matching BART
(刘易斯等人。, 2019) configuration. We use the
Adam optimizer (ǫ = 1e−6, β2 = 0.98) and linear
learning rate decay scheduling. The total training
time was approximately 2.5 weeks. We started the
training with dropout 0.1 and reduced it to 0.05 在
250K steps and 0 at 400K steps. All experiments
are done with Fairseq (Ott et al., 2019).

issue of

Reproducibility One potential
这
proposed approach is the replicability problem
due to the requirement of massive monolingual
corpora and computational resources, with fine-
grained selection on hyper-parameters during
pre-training. It is likely to get slightly different
fine-tuning performance if we re-train the system
再次. Tackling on this, we will release the pre-
trained checkpoints as well as the code with full
instructions for pre-training a new model.

several

相关工作: XLM(-右) and MASS There
是
closely related approaches of
multilingual pre-training for machine translation.
XLM (Lample and Conneau, 2019) and XLM-R
(Conneau et al., 2019) pretrain BERT (Devlin
等人。, 2019; 刘等人。, 2019) in a multilingual
fashion, and the resulted parameters can be used to
initialize the translation model encoder. 不同的
from XLM(-右), mBART simultaneously pre-
trains the encoder and the decoder due to the

Seq2Seq setup, which is more natural to adapt to
machine translation applications.

Similar to mBART, MASS (Song et al., 2019)
is also a Seq2Seq-based pre-training technique
with ‘‘word-masking’’. 然而, the decoder of
MASS only predicted tokens that was masked in
the encoder, whereas mBART reconstructs the full
target sequence which allows to apply not only
‘‘masking’’ but any possible noise functions.

此外, both XLM and MASS did
not show evidence of the pre-trained models
improving translation performance over
二
语言.

2.3 Pre-trained Models

To better measure the effects of different levels
of multilinguality during pre-training, we built a
range of models as follows:

• mBART25 We pre-train a model on all
25 语言, using the setting described in
§2.2.

• mBART06 To explore the effect of pre-
training on related languages, we pretrain a
model on a subset of six European languages:
Ro, 它, Cs, Fr, Es, and En. For a fair
比较, we use ∼ 1/4 of the mBART25
batch size, which allows our model to have
the same number of updates per language
during pre-training.

• mBART02 We pre-train bilingual models,
using English and one other language for
four language pairs: En-De, En-Ro, En-It.
We use a batch size of ∼ 1/12 of that in the
mBART25.

• BART-En/Ro To help establish a better
understanding towards multilingual pre-
训练, we also train monolingual BART
models on the En and Ro corpus only,
分别.

• Random As additional baselines, 我们将
also include a comparison with a model
randomly initialized without pre-training for
each translation task. Because the sizes
of different downstream datasets vary, 我们
always grid-search the hyper-parameters
(建筑学, dropout, ETC。) to find the best
non-pretrained configuration.

728

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: Framework for our multilingual denoising pre-training (左边) and fine-tuning on downstream
MT tasks (正确的), where we use (1) sentence permutation and (2) word-span masking as the injected
noise. A special language id token is added at both the encoder and decoder. One multilingual pre-trained
model is used for all tasks.

All models use the same vocabulary (§2.1). 不是
all tokens will frequently occur in all pre-training
语料库, 但
这
large vocabulary can improve generalization in
multilingual settings even for unseen languages.

later experiments show that

2.4 Scaling-up Matters

Scaling-up the training data and model parameters
has been a key factor in pre-training (Devlin
等人。, 2019; Conneau et al., 2019; Raffel et al.,
2019). Compared to conventional semi-supervised
方法 (例如, back-translation) and other pre-
training for MT (Lample and Conneau, 2019;
Song et al., 2019), we pre-train mBART on much
more monolingual data with relatively deeper
建筑学. This scale, in combination with the
new multi-lingual training, is central to our results
(sections 3 到 5), although future work could
more carefully study the relative contributions
of each.

3 Sentence-level Machine Translation

This section shows that mBART pre-training
provides consistent performance gains in low
to medium resource sentence-level MT settings,
including bi-text only and with back translation,
existing pre-training
and outperforms other
计划 (§3.2). We also present a detailed analysis
to understand better which factors contribute
the most to these gains (§3.3), and show that
pre-training can even improve performance for
languages not present in the pre-training data
(§3.4).

3.1 Experimental Settings

Datasets We gather 24 pairs of publicly avail-
able parallel corpora that cover all the languages
in CC25 (数字 1). Most pairs are from previous
WMT (Gu, Kk, Tr, Ro, Et, LT, Fi, 左, Cs, Es, Zh,
的, Ru, Fr ↔ En) and IWSLT (维, Ja, Ko, Nl,
氩气, It ↔ En) competitions. We also use FLoRes
对 (Guzm´an et al., 2019, En-Ne and En-Si),
En-Hi from IITB (Kunchukuttan et al., 2017), 和
En-My from WAT19 (Ding et al., 2018, 2019).
We divide the datasets into three categories—low
resource (<1M sentence pairs), medium resource (>1M and <10M), and high resource (>10中号).

微调 & Decoding We fine-tune mBART
on a single pair of bi-text data, feeding the
source language into the encoder and decod-
ing the target language. 如图 2,
we load the pre-trained weights and train the MT
model on bi-texts with teacher forcing. For all
方向, we train with 0.3 dropout, 0.2 标签
平滑化, 2500 warm-up steps, 3e−5 maximum
learning rate. We use a maximum of 40K training
updates for all low and medium resource pairs and
100K for high resource pairs. The final models are
selected based on validation likelihood. We use
beam-search with beam size 5 for decoding. 我们的
initial experiments indicate that the fine-tuning
process is generally stable with different seeds.
所以, to reduce the total computation, 全部
our results are reported with single execution. 我们
validate the statistical significance with scripts
from the mosesdecoder.3

3https://github.com/moses-smt/mosesdecoder
/b l o b/master/scripts/a n a lysis/bootstrap
-hypothesis-difference-significance.pl.

729

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Languages
En-Gu
Data Source WMT19
尺寸
Direction

10K

En-Kk
WMT19
91K

En-Vi

En-Tr
IWSLT15 WMT17

133K

207K

En-Ja
IWSLT17
223K

En-Ko
IWSLT17
230K

← → ← → ← → ← → ← → ← →

Random
mBART25

0.0
0.3

0.0
0.1

0.8
7.4

0.2
2.5

23.6
36.1

24.8
35.4

12.2
22.5

9.5
17.8

10.4
19.1

12.3
19.4

15.3
24.6

16.3
22.6

Languages
Data Source
尺寸
Direction

En-Nl
IWSLT17
237K

En-Ar
IWSLT17
250K

En-It

En-My
IWSLT17 WAT19

250K

259K

En-Ne
FLoRes
564K

En-Ro
WMT16
608K

← → ← → ← → ← → ← → ← →

Random
mBART25

34.6
43.3

29.3
34.8

27.5
37.6

16.9
21.6

31.7
39.8

28.0
34.0

23.3
28.3

34.9
36.9

7.6
14.5

4.3
7.4

34.0
37.8

34.3
37.7

Languages
Data Source
尺寸
Direction

En-Si
FLoRes
647K

En-Hi
ITTB
1.56中号

En-Et
WMT18
1.94中号

En-Lt
WMT19
2.11中号

En-Fi
WMT17
2.66中号

En-Lv
WMT17
4.50中号

← → ← → ← → ← → ← → ← →

Random
mBART25

7.2
13.7

1.2
3.3

10.9
23.5

14.2
20.8

22.6
27.8

17.9
21.4

18.1
22.4

12.1
15.3

21.8
28.5

20.2
22.4

15.6
19.3

12.9
15.9

桌子 1: Low/medium resource machine translation Pre-training consistently improves over a
randomly initialized baseline, with particularly large gains on low resource language pairs (例如,
Vi-En).

3.2 Main Results

如表所示 1, initializing with the pre-
trained mBART25 weights shows gains on all the
low and medium resource pairs when compared
with randomly initialized baselines. We observe
gains of 12 or more BLEU points on low
resource pairs such as En-Vi, En-Tr, and noisily
aligned pairs like En-Hi. Fine-tuning still fails
in extremely low-resource cases such as En-Gu,
which have ∼10k examples. In these settings,
unsupervised translation is more appropriate,
see §5.2. For high resource cases (桌子 2),
we do not observe consistent gains, and pre-
training slightly hurts performance when more
than 25M parallel sentences are available. 什么时候
a significant amount of bi-text data is given, 我们
suspect that supervised training washes out the
pre-trained weights.

Note that some reported runs of our baseline
systems using the vanilla Transformers with
randomly initialized weights have considerably
noticeable gaps between the SoTA systems
reported in the original competitions.4 The differ-
ence is mainly because we train and search

4http://matrix.statmt.org/.

Languages Cs Es Zh De Ru
尺寸

Fr
11M 15M 25M 28M 29M 41M

RANDOM
16.5 33.2 35.0 30.9 31.5 41.4
MBART25 18.0 34.0 33.3 30.5 31.3 41.0

桌子 2: High resource machine translation
where all the datasets are from their latest WMT
competitions. We only evaluate our models on
En-X translation.

the hyper-parameters for baselines on officially
provided bitext only without using any mono-
lingual corpus or multilingual adaptation. 为了
实例, the SoTA score for En→Gu is 28.2
in WMT19, compared with 0 表中 1. 这是
basically because the quality of the original bitext
data is low, and the SoTA systems commonly
used additional languages such as Hi to boost
the performance. Similar gaps can also be
observed in pairs such as Kk-En and Lt-En,
language is also
where Ru as the additional
crucial. The main purpose of this part
is to
discuss the effects of multilingual pre-training in a
constrained bitext setting for a better comparison.
We will include more discussions of combining

730

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: Pre-training + back translation on FLoRes with two iterations of BT.

multilingual translation with pretraining in future
工作.

Plus Back-Translation Back-translation (BT;
Sennrich et al., 2016) is a standard approach
to augment bi-text with target-side monolingual
数据. We combine our pre-training with BT and
test it on low resource language pairs—En-Si
and En-Ne—using the FLoRes dataset (Guzm´an
等人。, 2019). We use the same monolingual data
as Guzm´an et al. (2019) to generate BT data.
数字 3 shows that initializing the model with
our mBART25 pre-trained parameters improves
BLEU scores at each iteration of back translation,
resulting in new state-of-the-art results in all four
translation directions. It indicates that the pre-
trained mBART weights can be directly plugged
into existing pipeline using BT.

和

其他

比较的
Pre-training
Approaches We also compare our pre-trained
models with recent self-supervised pre-training
方法, 如表所示 3. We consider En-Ro
翻译, the only pair with established results.
Our mBART model outperforms all the other
pre-trained models, both with and without BT
增强. We also show comparisons with
the conventional BART model trained on the same
En and Ro data only. Both have improvements
over baselines, although worse than mBART
结果, indicating that pre-training in a multilin-
gual setting is essential. 而且, combining
BT leads to additional gains, resulting in a new
state-of-the-art for Ro-En translation.

3.3 分析

We also present additional analyses, to better
quantify when our pre-training helps.

How many languages should you pre-train on?
We investigate when it is helpful for pre-training to
include languages other than the targeted language
pair that will be used during fine tuning. 桌子 4

731

Pre-training

数据
没有任何
En Ro
En Ro
En

模型
RANDOM
XLM (2019)
MASS (2019)
捷运 (2019)
XLM-R (2019) CC100
BART-EN
BART-RO
MBART02
MBART25

En
Ro
En Ro
CC25

微调
En→Ro Ro→En +BT
36.8
38.5
39.1
38.0
–
37.4
38.1
39.9
38.8

34.0
35.6
–
–
35.8
35.8
36.8
38.5
37.8

34.3
–
–
–
35.6
36.0
37.6
38.5
37.7

桌子 3: Comparison with other pre-training
approaches on WMT16 Ro-En.

Ro
66.6 61.4 30.2

It My
Languages De
Size/GB
1.6
mBART02 31.3 38.5 39.7 36.5
mBART06
mBART25 30.5 37.7 39.8 36.9

38.5 39.3

–

En
300.8

桌子 4: Pretraining languages on En-X trans-
关系. The size refers to the size of monolingual
data for X. The size of En is shown as reference.
All the pretrained models were controlled to see
the same number of English instances during
训练.

shows performance on four X-En pairs. 预-
training on more languages helps most when the
target language monolingual data is limited (例如,
En-My, where the size of My is around 0.5%
of En).

相比之下, when monolingual data is plentiful
(的, Ro), pre-training on multiple languages
slightly hurts the final results (<1 BLEU). In languages may reduce these cases, additional the capacity available for each test language. Additionally, the fact that mBART06 performs similar to mBART02 on Ro-En suggests that pre-training with similar languages is particularly helpful. l D o w n o a d e d f r o m h t t p : > is predicted. We use beam size 5 by default.

4 Document-level Machine Translation

We evaluate mBART on document-level machine
translation tasks, where the goal is to translate
that contain more than one
segments of text
句子 (up to an entire document). During pre-
训练, we use document fragments of up to 512
代币, allowing the models to learn dependencies
this pre-
between sentences. 我们表明
training significantly improves document-level
翻译.

4.1 Experimental Settings

Datasets We evaluate performance on two
common document-level MT datasets: WMT19
En-De and TED15 Zh-En. For En-De, 我们使用
document data from WMT19 to train our model,
without any additional sentence-level data. 这
Zh-En dataset is from IWSLT 2014 和 2015
(Cettolo et al., 2012, 2015). Following Miculicich
等人. (2018), 我们用 2010-2013 TED as the test
放.

Pre-processing We pre-process with the ap-
proach used in pre-training. For each block, sen-
tences are separated by end of sentence symbols
() and the entire instance is ended with
the specific language id (). 平均而言,
documents are split into 2–4 instances.

基线 & Evaluation We train 4 型号: A
document-level (Doc-) MT model (§4.1) and a
corresponded sentence-level (Sent-) MT model
(§3.1) as the baseline, both with and without
pre-training. We use mBART25 as the common
pre-trained model for En-De and Zh-En. 为了
En-De, even though our mBART25 Doc-MT
model decodes multiple sentences together, 这
translated sentences can be aligned to the source
句子, which allows us to evaluate BLEU
scores both on sentence-level (s-BLEU) 和
document-level (d-BLEU).5 For Zh-En, 然而,
we cannot produce the same number of translated
sentences as the reference due to alignment errors
in the test data. We only provide the d-BLEU
scores on this direction.

We also compare our models with Hierarchical
Attention Networks (HAN, Miculicich et al.,
2018) on Zh-En, which is the state-of-the-
art non-pretraining approach for document-level
翻译为
this pair. They combine two
layers of attention—first within and then across
句子.

5Standard BLEU scores match n-grams at sentence-level.
We also consider document-level where we match n-grams
over the whole document resulting in a slightly higher score.

733

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

(A) 句子- and Document-level BLEU scores on En-De

(乙) Document-level BLEU scores on Zh-En

模型

Random

mBART25

s-BLEU d-BLEU s-BLEU d-BLEU

Sent-MT
Doc-MT

34.5
×

35.9
7.7

36.4
37.1

38.0
38.5

模型

Sent-MT
Doc-MT

Random mBART25 HAN (2018)
d-BLEU
22.0
3.2

d-BLEU
-
24.0

d-BLEU
28.4
29.6

桌子 6: Document-level machine translation on En-De and Zh-En. (×) The randomly initialized Doc-
MT model cannot produce translations aligned to the original sentences, so only document evaluation
is possible.

4.2 Main Results

桌子 6 shows the main results for both En-De and
Zh-En at both sentence-level and document-level.

Random vs. Pre-trained The MT models
initialized with pre-trained weights outperform
randomly initialized models by large margins, 为了
both sentence-level and document-level training.
Our mBART25 models (both Sent-MT and Doc-
公吨) also outperform HAN (Miculicich et al.,
2018),6 despite the fact
they are not
customized for document-level MT.

那

Sent-MT vs. Doc-MT For En-De and En-Zh,
the mBART25 Doc-MT models outperform
mBART25 fine-tuned at sentence-level by large
margins, reversing the trend seen for models
without pre-training. For both datasets, randomly
initialized Doc-MT fails
resulting
去工作,
in much worse results
than the sentence-
level models. Such large performance gaps
indicate that pre-training is critical for document
level performance. It is in general difficult to
collect high-quality document-level data in large
quantities, suggesting that pre-training may be a
strong strategy for future work. We also include a
sampled example in Figure 6.

5 Unsupervised Machine Translation

In addition to supervised machine translation, 我们
also evaluate our model on tasks where no bi-text
is available for the target language pair. We define
three types of unsupervised translation:

1. No bi-text of any kind. A common solution
is to learn from back-translation (Artetxe
等人。, 2017; Lample et al., 2018C). 我们
show that mBART provides a simple and
effective initialization scheme for
这些
方法 (§5.1).

6d-BLEU is recomputed from the provided system output.

2. No bi-text for the target pair, but both
languages appear in bi-text corpora with other
对. This setup is common for multilingual
MT systems (Johnson et al., 2017; Gu et al.,
2019). 在本文中, we limit our focus to
building models for single language pairs,
and leave discussions for multilingual MT to
future work.

3. No bi-text for the target pair is available, 但
there is bi-text for translating from some other
language into the target language. mBART
supports effective transfer, even if the source
language has no bi-text of any form (§5.2).

5.1 Unsupervised Machine Translation via

Back-Translation

Datasets We evaluate our pre-trained models
on En-De, En-Ne, and En-Si. En and De are both
European languages sharing many sub-words,
whereas Ne and Si are quite distinct from En. 我们
use the same test sets as supervised benchmarks
§3.1, and use the same pre-training data (CC25)
for back-translation to avoid introducing new
信息.

Learning Following Lample
and Conneau
(XLM, 2019), we initialize the translation model
with the mBART weights, and then learn to
predict the monolingual sentences conditioned
on source sentences generated by on-the-fly BT.
此外, we constrain mBART to only gen-
erating tokens in target language7 for the first
1000 steps of on-the-fly BT, to avoid it copying
the source text.

Results Table 7 shows the unsupervised trans-
lation results compared with non-pretrained mod-
这, as well as models with existing pre-training
方法. Our models achieve large gains over
non-pretrained models for all directions, 和

7We mask out the output probability of predicted tokens
which appear less than 1% in the target monolingual corpus.

734

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 6: An example of document-level translation from mBART25 Sent-MT and Doc-MT, held out
from the test set of TED15 Zh-En. The Doc-MT system produces much fluent and coherent translation,
which is closer to the reference translation. 例如, Doc-MT model produces several ‘‘And’’ to
connect sentences to make it reads better, while the Sent-MT model does not contain global knowledge
and produce sentences independently. 此外, both systems produce much better translations
than models without pre-training where the non-pretrained Doc-MT model completely fails to produce
readable translation output.

735

En-Si
En-Ne
En-De
← → ← → ← →

Random
21.0 17.2
XLM (2019)
34.3 26.4
MASS (2019) 35.2 28.3

0.0
0.5
–

0.0 0.0 0.0
0.1 0.1 0.1
–
–
–

mBART

34.0 29.8 10.0 4.4 8.2 3.9

桌子 7: Unsupervised MT via BT between
dis-similar languages.

most pairs, language transfer works better when
fine-tuning is also conducted in the same language
家庭, especially between Indic languages (Hi,
Ne, Gu). 然而, significant vocabulary sharing
is not required for effective transfer. 例如,
Zh-En and It-En achieve the best transfer learning
results on Ko-En and Ar-En, 分别. 这是
despite the low vocabulary overlap (even character
重叠) 之间 (Zh, Ko) 和 (它, 氩气).

outperform XLM significantly for dissimilar pairs
(En-Ne, En-Si) where the existing approaches
completely fail. For En-De, our model also
performs comparably against XLM and MASS.

5.2 Unsupervised Machine Translation via

Language Transfer

We also report results when the target language
appears in a bi-text with some other source
语言.

Datasets We only consider X→En translation,
and choose the bitexts of 12 language pairs from
§3.1, covering Indic languages (Ne, Hi, 和, Gu),
European languages (Ro, 它, Cs, Nl), East Asian
语言 (Zh, Ja, Ko), and Arabic (氩气).

Results The pre-trained mBART25 model is
fine-tuned on each language pair, 进而
evaluated on the rest of pairs, as seen in Table 8.
We also present the direct fine-tuning performance
(§3) on the diagonal, for reference. We see transfer
for all pairs with all fine-tuned models except from
Gu-En where the supervised model completely
fails (0.3 蓝线). In some cases we can achieve
相似的 (Cs-En) or even much better (Ne-En,
Gu-En) results compared with the supervised
结果. We also show an example of language
transfer in Figure 7.

As a comparison, we also apply the same proce-
dure on randomly initialized models without pre-
训练, which always ends up with ≈ 0 蓝线.
This indicates that multilingual pre-training is
essential and produces universal representations
across languages, so that once the model learns to
translate one language to En, it learns to translate
all languages with similar representations.

When is language transfer useful? 桌子 8
also shows that the size of transfer effects varies
with the similarity of different languages. 第一的, 为了

With BT We present a comparison of unsuper-
vised MT with BT vs. language transfer in Table 9
where language transfer works better when there
exists a close language translation to transfer from.
而且, we show promising results for
combining these two techniques. We start from the
best transferred model and apply (iterative) BT on
the same monolingual corpus used in pre-training.
桌子 9 presents the results with 1 iteration of BT.
We see improvements for all pairs. 完整的
analysis of both methods is left as future work.

6 相关工作

Self-supervised Learning for Text Generation
This work inherits from the recent success brought
by pre-training for NLP applications (Peters et al.,
2018; Radford et al., 2018; Devlin et al., 2019;
杨等人。, 2019乙; 刘等人。, 2019), 尤其
for text generation (Radford et al., 2019; 歌曲
等人。, 2019; Dong et al., 2019; Raffel et al.,
2019; 刘易斯等人。, 2019). The pre-trained models
are usually used as the initialization for fine-
tuning downstream tasks such as controllable
语言建模 (Shirish Keskar et al., 2019),
summarization (Song et al., 2019; Liu and Lapata,
2019) and dialogue generation (张等人。,
2019).

Specifically for machine translation, unsuper-
vised pre-training methods were also explored
to improve the performance. Qi et al. (2018)
investigated the application of pre-trained word
embeddings for MT; Ramachandran et al. (2017)
proposed to pre-train the encoder-decoder mod-
ules as two separate language models. Yang et al.
(2019A); Zhu et al. (2020) explored fusion ap-
proaches to incorporate the pre-trained BERT
weights to improve NMT training. 相比之下
to most prior work, we focus on pre-training one
denoising autoencoder, and adapt the weights of
the entire model for various MT applications.

736

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Fine-tuning Languages
Ro

s
e
G
A
你
G
n
A
L
G
n
我
t
s
e
时间

Domain News
23.7
9.9
5.8
9.3
16.2
14.4
16.9
5.8
3.2
2.1
5.0
8.2

ZH
JA
KO
CS
RO
NL
它
AR
HI
NE
和
GU

Ja
TED TED News News
7.8
4.8
8.5
19.5
37.8
27.0
23.4
12.0
6.7
4.3
1.3
3.5

9.2
12.2
24.6
17.2
17.9
32.3
27.8
12.8
9.9
6.5
3.8
4.7

2.8
0.9
5.7
21.6
23.0
21.2
17.1
12.7
5.8
5.0
3.8
5.4

8.8
19.1
16.9
15.1
18.7
30.4
25.8
15.5
10.1
6.7
5.7
8.5

它

和

氩气

Gu
Hi
Nl
TED TED TED News Wiki Wiki Wiki
0.0
7.2
0.0
4.7
0.0
9.6
0.0
13.2
0.0
16.4
0.0
24.6
0.0
20.1
0.0
11.6
0.0
23.5
17.9
0.0
0.0
8.1
13.8
0.3

7.0
6.4
9.5
17.0
22.3
43.3
30.2
14.7
6.1
3.0
0.9
2.1

5.9
6.5
11.1
16.4
22.1
27.3
23.2
16.7
13.0
10.8
13.7
12.8

4.2
4.2
8.8
15.1
18.5
23.3
18.5
13.0
14.5
14.5
8.9
13.5

6.8
5.1
9.1
16.7
21.6
34.1
39.8
14.7
5.0
2.2
0.5
0.0

6.2
5.6
8.7
16.9
22.6
31.0
30.6
37.6
7.6
5.2
3.5
6.2

桌子 8: Unsupervised MT via language transfer on X-En translations. The model fine-tuned on one
language pair is directly tested on another. We use gray color to show the direct fine-tuning results,
and lightgray color to show language transfer within similar language groups. We bold the highest
transferring score for each pair.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 7: An example of unsupervised MT via language transfer. mBART models finetuned with Ko or
Zh are able to translate Ja sentence to En almost as correctly as in the supervised case.

Source online BT
Ro
Ne
Zh
Nl

30.5
10.0
11.3
28.5

Transfer

23.0
17.9
9.2
34.1

( Cs )
( Hi )
( Ko )
( 它 )

组合
33.9
22.1
15.0
35.4

桌子 9: BT vs. language transfer for unsupervised
MT for X-En translations. For language transfer,
we present the best transferring scores together
with the language transferred from.

Multilinguality in NLP tasks This work is also
related to the continual
trend of multilingual
language learning, including aligning multilingual
word embeddings (Mikolov et al., 2013; 陈
and Cardie, 2018; Lample et al., 2018乙) 进入

universal space, and learning crosslingual models
(Wada and Iwata, 2018; Lample and Conneau,
2019; Conneau et al., 2019) to exploit shared
跨语言的表示.

For MT, the most relevant field is multilingual
翻译 (Firat et al., 2016; Johnson et al.,
2017; Aharoni et al., 2019; Arivazhagan et al.,
2019) where the ultimate goal is to jointly train
one translation model
translates multiple
同一时间, 和
language directions at
shares representations to improve the translation
performance on low-resource languages (Gu et al.,
2018). 在本文中, we focus on multilingualism
in the pre-training stage and fine-tune the
learned model in the standard bilingual scenario.

那

737

Compared with multilingual translation, 我们的确是
not require parallel data across multiple languages
but the targeted direction, which improves the
scalability to low-resource languages and specific
域.

(2019)

is the most

Document Translation As one of
the key
applications, our work is also related to previous
efforts for incorporating document-level context
into neural machine translation (王等人。,
2017; Jean et al., 2017; Tiedemann and Scherrer,
2017; Miculicich et al., 2018; Tu et al., 2018).
李等人.
relevant work
that also utilized pre-trained encoder (BERT)
for handling longer context. 然而,
这
focus has been on designing new task-specific
技巧, and doing sentence-level translation
with a wider input context. To the best of our
知识, our multilingual pre-trained model
that shows improved results on
is the first
document-level translation with standard Seq2Seq
型号.

Unsupervised Translation This work also
summarizes the previous efforts of learning to
translate between languages without a direct
parallel corpus. When no parallel data of any
kind is available, Artetxe et al.
(2017) 和
Lample et al. (2018A) proposed to jointly learn
denoising auto-encoder and back-translation from
both directions, 哪个, 然而, required good
initialization and only worked well on similar
(2019) solve the
language pairs. Wu et al.
problem by mining sentences from Wikipedia
and using them as weakly supervised translation
对. Similar to Lample and Conneau (2019) 和
宋等人. (2019), we follow the first approach
and treat our pre-trained model as the initialization
step. We also investigate unsupervised translation
using language transfer, which is similar
到
Pourdamghani et al. (2019), where the authors
generate translationese of the source language
and train a system on high-resource languages
是
to correct
also closely related to Conneau et al. (2018)
and Artetxe et al.
for cross-lingual
representation learning where we also show
representation learned by mBART can be easily
transferred between language without supervised
数据.

these intermediate utterances. 它

(2019)

738

7 结论

We demonstrate that multilingual de-noising pre-
training is able to significantly improve both
supervised and unsupervised machine translation
at both the sentence level and document level.
We analyze when and how pre-training is most
effective and can be combined with other
approaches such as back-translation. Our results
learning ability of
also show the transfer
the learned representations from multilingual
pre-training.

In future work, we will scale-up the current
pre-training to more languages, 例如, 一个
mBART100 model. The size of our model makes
it expensive to deploy in production—future work
will explore pre-training more efficient models.

致谢

We thank Marc’Aurelio Ranzato, Guillaume
Lample, Alexis Conneau, and Michael Auli
for sharing their expertise on low-resource and
unsupervised machine translation and Peng-Jen
Chen and Jiajun Shen for details about FloRes and
WAT datasets. We also thank our colleagues at
FAIR and FAIAR for valuable feedback.

参考

Roee Aharoni, Melvin Johnson, and Orhan
Firat. 2019. Massively multilingual neural
machine translation. 在诉讼程序中 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long and Short Papers), pages 3874–3884.
明尼阿波利斯, Minnesota. 协会
为了
计算语言学.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat,
Dmitry Lepikhin, Melvin Johnson, Maxim
Krikun, Mia Xu Chen, Yuan Cao, 乔治
促进, Colin Cherry, Wolfgang Macherey,
Zhifeng Chen, and Yonghui Wu. 2019.
Massively multilingual neural machine trans-
lation in the wild: Findings and challenges.
CoRR, abs/1907.05019.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, 和
Kyunghyun Cho. 2017. Unsupervised neural
machine translation. arXiv 预印本 arXiv:
1710.11041. DOI: https://doi.org/18653
/v1/D18-1399

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2019. On the cross-lingual trans-
ferability of monolingual representations. arXiv
preprint arXiv:1910.11856. DOI: https://
doi.org/10.18653/v1/2020.acl
-main.421

Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. Wit3: Web inventory of
transcribed and translated talks. In Conference
of European Association
for Machine
Translation, pages 261–268.

Mauro Cettolo, Niehues Jan, St¨uker Sebas-
tian, Luisa Bentivogli, Roldano Cattoni, 和
Marcello Federico. 2015. The IWSLT 2015
evaluation campaign. In International Work-
shop on Spoken Language Translation.

Xilun Chen and Claire Cardie. 2018. Unsuper-
vised multilingual word embeddings. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 261–270. 布鲁塞尔, 比利时. 协会
for Computational Linguistics. DOI: https://
doi.org/10.18653/v1/D18-1024

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2019. Unsupervised cross-lingual representa-
tion learning at scale. arXiv 预印本 arXiv:
1911.02116. DOI: https://doi.org/10
.18653/v1/2020.acl-main.747

Alexis Conneau, 鲁西·里诺特, Guillaume
Lample, Adina Williams, Samuel R. Bowman,
Holger Schwenk,
and Veselin Stoyanov.
2018. XNLI: Evaluating cross-lingual sentence
陈述. 在诉讼程序中 2018
Conference on Empirical Methods in Natu-
ral Language Processing. Association for Com-
putational Linguistics. DOI: https://土井
.org/10.18653/v1/D18-1269

Utiyama, and Eiichiro Sumita. 2019. Towards
Burmese (缅甸) morphological analysis:
Syllable-based tokenization and part-of-speech
tagging. ACM Transactions on Asian and
Low-Resource Language Information Process-
英 (TALLIP), 19(1):5. DOI: https://土井
.org/10.1145/3325885

Chenchen Ding, Masao Utiyama, and Eiichiro
Sumita. 2018. NOVA: A feasible and flexible
annotation system for joint tokenization and
part-of-speech tagging. ACM Transactions on
Asian and Low-Resource Language Informa-
tion Processing (TALLIP), 18(2):17. DOI:
https://doi.org/10.1145/3276773

Li Dong, Nan Yang, Wenhui Wang, Furu
Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao,
Ming Zhou, and Hsiao-Wuen Hon. 2019.
Unified language model pre-training for natural
language understanding and generation. arXiv
preprint arXiv:1905.03197.

Sergey Edunov, Alexei Baevski, 和迈克尔
Auli. 2019. Pre-trained language model repre-
sentations for language generation. arXiv 预印本
arXiv:1903.09722. DOI: https://doi.org
/10.18653/v1/N19-1409

Orhan Firat, Kyunghyun Cho, and Yoshua
本吉奥. 2016. Multi-way, multilingual neural
machine translation with a shared attention
In NAACL. DOI: https://
机制.
doi.org/10.18653/v1/N16-1101

Jiatao Gu, Hany Hassan, Jacob Devlin, 和
Victor O. K. 李. 2018. Universal neural
machine translation for extremely low resource
这 2018
在诉讼程序中
语言.
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long Papers), pages 344–354. New Orleans,
Louisiana. Association for Computational
语言学.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. In North American Association
for Computational Linguistics (全国AACL).

Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa,
Khin Thandar Nwet, Khin Mar Soe, Masao

Jiatao Gu, Yong Wang, Kyunghyun Cho, 和
Victor O. K. 李. 2019. Improved zero-shot neu-
ral machine translation via ignoring spurious
correlations. arXiv 预印本 arXiv:1906.01181.

Francisco Guzm´an, Peng-Jen Chen, Myle
Ott, Juan Pino, Guillaume Lample, Philipp
科恩, Vishrav Chaudhary, and Marc’Aurelio

739

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Ranzato. 2019. The FLORES evaluation
datasets for low-resource machine translation:
Nepali–English and Sinhala–English. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 6097–6110. 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1632

S´ebastien Jean, Stanislas Lauly, Orhan Firat, 和
Kyunghyun Cho. 2017. Does neural machine
translation benefit from larger context? CoRR,
abs/1704.05135.

Melvin Johnson, Mike Schuster, Quoc V.
Le, Maxim Krikun, Yonghui Wu, Zhifeng
陈, Nikhil Thorat, Fernanda Vi´egas, 马丁
Wattenberg, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2017. Googles multilingual
neural machine translation system: Enabling
这
zero-shot
计算语言学协会,
5:339–351. DOI: https://doi.org/10
.1162/tacl a 00065

翻译. Transactions of

Taku Kudo and John Richardson. 2018. 句子-
Piece: A simple and language independent
subword tokenizer and detokenizer for neural
text processing. 在诉讼程序中 2018
Conference on Empirical Methods in Natural
语言处理: 系统演示,
pages 66–71. 布鲁塞尔, 比利时. Associa-
tion for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-2012, PMID: 29382465

Anoop Kunchukuttan, Pratik Mehta, and Pushpak
Bhattacharyya. 2017. The IIT Bombay English-
Hindi parallel corpus. CoRR, abs/1710.02855.

Guillaume Lample and Alexis Conneau. 2019.
language model pretraining.

Cross-lingual
arXiv 预印本 arXiv:1901.07291.

Guillaume Lample, Alexis Conneau, Ludovic
Denoyer, and Marc’Aurelio Ranzato. 2018A.
Unsupervised machine translation using mono-
lingual corpora only. In International Confer-
ence on Learning Representations.

Guillaume

Conneau,
Marc’Aurelio Ranzato, Ludovic Denoyer,

Lample,

Alexis

and Hervé Jégou. 2018乙. Word transla-
tion without parallel data. 在国际
Conference on Learning Representations.

Guillaume Lample, Myle Ott, Alexis Conneau,
Ludovic Denoyer, and Marc’Aurelio Ranzato.
2018C. Phrase-based & neural unsupervised
machine translation. arXiv 预印本 arXiv:
1804.07755. DOI: https://doi.org/10
.18653/v1/D18-1549

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
征收, Veselin Stoyanov, and Luke Zettlemoyer.
2019. Bart: Denoising sequence-to-sequence
pre-training for natural language generation,
翻译, and comprehension. arXiv 预印本
arXiv:1910.13461. DOI: https://doi.org
/10.18653/v1/2020.acl-main.703

Liangyou Li, Xin Jiang, and Qun Liu. 2019.
Pretrained language models for document-
level neural machine translation. arXiv 预印本
arXiv:1911.03110.

Yang Liu and Mirella Lapata. 2019. Text
summarization with pretrained encoders. arXiv
preprint arXiv:1908.08345. DOI: https://
doi.org/10.18653/v1/D19-1387

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. ROBERTA: A robustly
optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692.

和

James Henderson.

Lesly Miculicich, Dhananjay Ram, Nikolaos
Pappas,
2018.
Document-level neural machine translation
with hierarchical attention networks. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 2947–2954. 布鲁塞尔, 比利时. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18-1325

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.
2013. Exploiting similarities among languages
for machine translation. CoRR, abs/1309.4168.

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, 大卫
Grangier, and Michael Auli. 2019. FAIRSEQ: A

740

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

快速地, extensible toolkit for sequence model-
In North American Association for
英.
计算语言学 (全国AACL): 系统
Demonstrations. DOI: https://doi.org
/10.18653/v1/N19-4009

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton
李, and Luke Zettlemoyer. 2018. Deep
contextualized word representations. In North
American Association
for Computational
语言学 (全国AACL).

Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual
is multilingual
BERT? arXiv 预印本 arXiv:1906.01502. DOI:
https://doi.org/10.18653/v1/P19
-1493

Nima Pourdamghani, Nada Aldarrab, Marjan
Ghazvininejad, Kevin Knight, and Jonathan
可能. 2019. Translating translationese: A two-
step approach to unsupervised machine trans-
关系. In ACL. DOI: https://doi.org
/10.18653/v1/P19-1293

Ye Qi, Devendra Singh Sachan, Matthieu Felix,
Sarguna Janani Padmanabhan, and Graham
Neubig. 2018. When and why are pre-
trained word embeddings useful for neural
machine translation? arXiv 预印本 arXiv:
1804.06323. DOI: https://doi.org/10
.18653/v1/N18-2084

Alec Radford, Karthik Narasimhan, 时间
Salimans, and Ilya Sutskever. 2018. Improving
language understanding with unsupervised
学习, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners, OpenAI.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. arXiv
preprint arXiv:1910.10683.

Prajit Ramachandran, Peter J Liu, and Quoc Le.
2017. Unsupervised pretraining for sequence to
sequence learning. 在诉讼程序中 2017

Conference on Empirical Methods in Natural
语言处理, pages 383–391. DOI:
https://doi.org/10.18653/v1/D17
-1039

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. In Pro-
ceedings of the 54th Annual Meeting of the
计算语言学协会
(体积 1: Long Papers), pages 86–96.
柏林, 德国. Association for Computa-
tional Linguistics. DOI: https://土井
.org/10.18653/v1/P16-1009

Nitish Shirish Keskar, Bryan McCann, Lav R.
Varshney, Caiming Xiong, and Richard Socher.
2019. Ctrl: A conditional transformer language
controllable generation. arXiv
模型
preprint arXiv:1909.05858.

为了

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu,
and Tie-Yan Liu. 2019. MASS: Masked
sequence to sequence pre-training for language
一代. In International Conference on
Machine Learning (ICML).

J¨org Tiedemann and Yves Scherrer. 2017.
Neural machine translation with extended
语境. In Proceedings of the Third Work-
shop on Discourse in Machine Translation,
pages 82–92. 哥本哈根, 丹麦. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-4811

Zhaopeng Tu, Yang Liu, Shuming Shi, 和
Tong Zhang. 2018. Learning to remember
translation history with a continuous cache.
the Association for Com-
Transactions of
6:407–420. DOI:
putational
https://doi.org/10.1162/tacl 00029

语言学,

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neural
Information Processing Systems.

Takashi Wada and Tomoharu Iwata. 2018.
Unsupervised cross-lingual word embedding
by multilingual neural language models. CoRR,
abs/1809.02306. DOI: https://doi.org
/10.18653/v1/P19-1300

741

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

for neural machine translation.

Longyue Wang, Zhaopeng Tu, Andy Way,
and Qun Liu. 2017. Exploiting cross-sentence
语境
在
这 2017 会议
会议记录
Empirical Methods
in Natural Language
加工, pages 2826–2831. 哥本哈根,
丹麦. 协会
for Computational
语言学. DOI: https://doi.org/10
.18653/v1/D17-1301

Zihan Wang, Stephen Mayhew, Dan Roth,
和别的. 2019. Cross-lingual ability of
multilingual bert: An empirical study. arXiv
preprint arXiv:1912.07840.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis
Conneau, Vishrav Chaudhary, Francisco
Guzman, Armand Joulin, and Edouard Grave.
2019. CCNET: Extracting
质量
monolingual datasets from web crawl data.
arXiv 预印本 arXiv:1911.00359.

高的

Lijun Wu, Jinhua Zhu, Di He, Fei Gao, Xu Tan,
Tao Qin, and Tie-Yan Liu. 2019. 机器
translation with weakly paired bilingual
文件.

Jiacheng Yang, Mingxuan Wang, Hao Zhou,
Chengqi Zhao, Yong Yu, Weinan Zhang, 和
Lei Li. 2019A. Towards making the most of bert
in neural machine translation. arXiv 预印本
arXiv:1908.05672.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019乙. XLNet: Generalized autoregressive
pretraining for language understanding. arXiv
preprint arXiv:1906.08237.

Yizhe Zhang, Siqi Sun, 米歇尔·加莱, Yen-Chun
陈, Chris Brockett, Xiang Gao, Jianfeng
高, Jingjing Liu, and Bill Dolan. 2019.
DialoGPT: Large-scale generative pre-training
for conversational response generation.

Jinhua Zhu, Yingce Xia, Lijun Wu, Di He,
Tao Qin, Wengang Zhou, Houqiang Li, 和
Tie-Yan Liu. 2020. Incorporating BERT into
neural machine translation. arXiv 预印本
https://土井
arXiv:2002.06823. DOI:
.org/10.18653/v1/2020.acl-demos.30

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
4
3
1
9
2
3
4
0
1

/
t

我

A
C
_
A
_
0
0
3
4
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

742
下载pdf