Unsupervised Bitext Mining and Translation
via Self-Trained Contextual Embeddings
Phillip Keung•
Julian Salazar• Yichao Lu• Noah A. Smith†‡
•Amazon †University of Washington
‡Allen Institute for AI
{keung,julsal,yichaolu}@amazon.com nasmith@cs.washington.edu
抽象的
We describe an unsupervised method to create
pseudo-parallel corpora for machine trans-
关系 (公吨) from unaligned text. We use mul-
tilingual BERT to create source and target
sentence embeddings for nearest-neighbor
search and adapt the model via self-training.
We validate our technique by extracting par-
allel sentence pairs on the BUCC 2017 bitext
mining task and observe up to a 24.5 观点
增加 (absolute) in F1 scores over previous
unsupervised methods. We then improve an
XLM-based unsupervised neural MT system
pre-trained on Wikipedia by supplementing it
with pseudo-parallel text mined from the same
语料库, boosting unsupervised translation per-
formance by up to 3.5 BLEU on the WMT’14
French-English and WMT’16 German-English
tasks and outperforming the previous state-
of-the-art. 最后, we enrich the IWSLT’15
English-Vietnamese corpus with pseudo-
parallel Wikipedia sentence pairs, yielding a
1.2 BLEU improvement on the low-resource
MT task. We demonstrate that unsupervised
bitext mining is an effective way of augment-
ing MT datasets and complements existing
techniques like initializing with pre-trained
contextual embeddings.
1
介绍
Large corpora of parallel sentences are prerequi-
sites for training models across a diverse set of
applications, such as neural machine translation
(NMT; Bahdanau et al., 2015), paraphrase genera-
的 (Bannard and Callison-Burch, 2005), 和
aligned multilingual sentence embeddings (Artetxe
and Schwenk, 2019乙). Systems that extract paral-
lel corpora typically rely on various cross-lingual
资源 (例如, bilingual lexicons, parallel cor-
828
pora), but recent work has shown that unsuper-
vised parallel sentence mining (Hangya et al.,
2018) and unsupervised NMT (Artetxe et al.,
2018; Lample et al., 2018A) produce surprisingly
good results.1
Existing approaches to unsupervised parallel
句子 (or bitext) mining start from bilingual
word embeddings (BWEs) learned via an unsuper-
vised, adversarial approach (Lample et al., 2018乙).
Hangya et al. (2018) created sentence represen-
tations by mean-pooling BWEs over content
字. To disambiguate semantically similar but
non-parallel sentences, Hangya and Fraser (2019)
additionally proposed parallel segment detection
by searching for paired substrings with high simi-
larity scores per word. 然而, using word
embeddings to generate sentence embeddings
ignores sentential context, which may degrade
bitext retrieval performance.
We describe a new unsupervised bitext mining
approach based on contextual embeddings. 我们
create sentence embeddings by mean-pooling the
outputs of multilingual BERT (mBERT; Devlin
等人。, 2019), which is pre-trained on unaligned
Wikipedia sentences across 104 语言. 为了
a pair of source and target languages, 我们发现
candidate translations by using nearest-neighbor
search with margin-based similarity scores bet-
ween pairs of mBERT-embedded source and tar-
get sentences. We bootstrap a dataset of positive
and negative sentence pairs from these initial
then self-train
neighborhoods of candidates,
mBERT on its own outputs. A final retrieval step
gives a corpus of pseudo-parallel sentence pairs,
which we expect to be a mix of actual translations
and semantically related non-translations.
1By unsupervised, we mean that no cross-lingual
resources like parallel text or bilingual lexicons are used.
Unsupervised techniques have been used to bootstrap MT
systems for low-resource languages like Khmer and Burmese
(Marie et al., 2019).
计算语言学协会会刊, 卷. 8, PP. 828–841, 2020. https://doi.org/10.1162/tacl 00348
动作编辑器: Collin Cherry. 提交批次: 4/2020; 修改批次: 8/2020; 已发表 12/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
We apply our technique on the BUCC 2017
parallel sentence mining task (Zweigenbaum et al.,
2017). We achieve state-of-the-art F1 scores on
unsupervised bitext mining, with an improvement
of up to 24.5 点 (absolute) on published
结果 (Hangya and Fraser, 2019). Other work
(例如, Libovick´y et al., 2019) has shown that
retrieval performance varies substantially with
the layer of mBERT used to generate sentence
陈述; using the optimal mBERT layer
yields an improvement as large as 44.9 点.
此外, our pseudo-parallel text improves
unsupervised NMT (UNMT) 表现. 我们
build upon the UNMT framework of Lample
等人. (2018C) and XLM (Lample and Conneau,
2019) by incorporating our pseudo-parallel text
(also derived from Wikipedia) at training time.
This boosts performance on WMT’14 En-Fr and
WMT’16 En-De by up to 3.5 BLEU over the
XLM baseline, outperforming the state-of-the-art
on unsupervised NMT (Song et al., 2019).
最后, we demonstrate the practical value of
unsupervised bitext mining in the low-resource
the English-Vietnamese
环境. We augment
语料库 (133k pairs) from the IWSLT’15 trans-
lation task (Cettolo et al., 2015) with our pseudo-
bitext from Wikipedia (400k pairs), and observe a
1.2 BLEU increase over the best published model
(Nguyen and Salazar, 2019). When we reduced the
amount of parallel and monolingual Vietnamese
data by a factor of ten (13.3k pairs), 该模型
trained with pseudo-bitext performed 7 蓝线
points better than a model trained on the reduced
parallel text alone.
2 Our Approach
Our aim is to create a bilingual sentence em-
bedding space where, for each source sentence
embedding, a sufficiently close nearest neighbor
among the target sentence embeddings is its
翻译. By aligning source and target sentence
embeddings in this way, we can extract sentence
pairs to create new parallel corpora. Artetxe and
Schwenk (2019A) construct this space by training
a joint encoder-decoder MT model over multiple
language pairs and using the resulting encoder
to generate sentence embeddings. A margin-
based similarity score is then computed between
embeddings for retrieval (部分 2.2). 然而,
this approach requires large parallel corpora to
train the encoder-decoder model in the first place.
We investigate whether contextualized sentence
embeddings created with unaligned text are useful
for unsupervised bitext retrieval. Previous work
explored the use of multilingual sentence encoders
taken from machine translation models (例如,
Artetxe and Schwenk, 2019乙; 卢等人。, 2018)
for zero-shot cross-lingual transfer. Our work is
motivated by recent success in tasks like zero-shot
text classification and named entity recognition
(例如, Keung et al., 2019; Mulcaire et al., 2019)
with multilingual contextual embeddings, 哪个
exhibit cross-lingual properties despite being
trained without parallel sentences.
We illustrate our method in Figure 1. We first
retrieve the candidate translation pairs:
• Each source and target language sentence
is converted into an embedding vector with
mBERT via mean-pooling.
• Margin-based scores are computed for each
sentence pair using the k nearest neighbors
of the source and target sentences (秒. 2.2).
• Each source sentence is paired with its nearest
neighbor in the target language based on this
分数.
• We select a threshold score that keeps some
top percentage of pairs (秒. 2.2).
• Rule-based filters are applied to further re-
move mismatched sentence pairs (秒. 2.3).
The remaining candidate pairs are used to
bootstrap a dataset for self-training mBERT as
如下:
• Each candidate pair (a source sentence and its
closest nearest neighbor above the threshold)
is taken as a positive example.
• This source sentence is also paired with its
next k − 1 neighbors to give hard negative
examples (we compare this with random
negative samples in Sec. 3.3).
• We finetune mBERT to produce sentence
之间
嵌入
positive and negative pairs (秒. 2.4).
discriminate
那
After self-training, the finetuned mBERT model
is used to generate new sentence embeddings.
Parallel sentences should be closer to each other
in this new embedding space, which improves
retrieval performance.
829
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 1: Our self-training scheme. 左边: We index sentences using our two encoders. For each source sentence,
we retrieve k nearest-neighbor target sentences per the margin criterion (Eq. 1), depicted here for k = 4. 如果
nearest neighbor is within a threshold, it is treated with the source sentence as a positive pair, 和剩余的
k − 1 are treated with the source sentence as negative pairs. 正确的: We refine one of the encoders such that the
cosine similarity of the two embeddings is maximized on positive pairs and minimized on negative pairs.
2.1 Sentence Embeddings and
Nearest-neighbor Search
We use mBERT (Devlin et al., 2019) to create
sentence embeddings for both languages by mean-
pooling the representations from the final layer.
We use FAISS (Johnson et al., 2017) to perform
exact nearest-neighbor search on the embeddings.
We compare every sentence in the source language
to every sentence in the target language; 我们的确是
not use links between Wikipedia articles or other
metadata to reduce the size of the search space.
In our experiments, we retrieve the k = 4 closest
target sentences for each source sentence; 这
source language is always non-English, 而
target language is always English.
2.2 Margin-based Score
We compute a margin-based similarity score
between each source sentence and its k nearest
target neighbors. Following Artetxe and Schwenk
(2019A), we use the ratio margin score, 哪个
calibrates the cosine similarity by dividing it by
the average cosine distance of each embedding’s
k nearest neighbors:
margin(X, y) =
因斯(X, y)
2k + Pz∈NNsrc
因斯(X,z)
k (y)
Pz∈NNtgt
k (X)
(1)
.
因斯(y,z)
2k
830
We remove the sentence pairs with margin scores
below some pre-selected threshold. For BUCC,
we do not have development data for tuning the
threshold hyperparameter, so we simply use the
prior probability. 例如, the creators of
the dataset estimate that ∼2% of De sentences
have an En translation, so we choose a score
threshold such that we retrieve ∼2% of the pairs.
We set the threshold in the same way for the
other BUCC pairs. For UNMT with Wikipedia
bitext mining, we set the threshold such that we
always retrieve 2.5 million sentence pairs for each
language pair.
2.3 Rule-based Filtering
We also apply two simple filtering steps before
finalizing the candidate pairs list:
• Digit
filtering: Sentence pairs that are
translations of each other must have digit
sequences that match exactly.2
• Edit distance: 句子
from English
Wikipedia sometimes appear in non-English
pages and vice versa. We remove sentence
pairs where the content of the source and
2In Python, 放(re.findall(“[0-9]+”,sent1))
== set(re.findall(“[0-9]+”,sent2)).
target share substantial overlap (IE。,
character-level edit distance is ≤50%).
这
2.4 Self-training
We devise an unsupervised self-training technique
to improve mBERT for bitext retrieval using
mBERT’s own outputs. For each source sentence,
if the nearest target sentence is within the threshold
and not filtered out, the pair is treated as a positive
句子. We then keep the next k − 1 nearest
neighbors as negative sentences. Altogether, 这些
give us a training set of examples which are labeled
as positive or negative pairs.
We train mBERT to discriminate between
positive and negative sentence pairs as a binary
classification task. We distinguish the mBERT
encoders for the source and target languages as
fsrc, ftgt respectively. Our training objective is
L(X, 是 ; Θsrc) =
(西德:12)
(西德:12)
(西德:12)
(西德:12)
fsrc(X; Θsrc)⊤ftgt(是 )
kfsrc(X; Θsrc)kkftgt(是 )k
− Par(X, 是 )
(2)
,
(西德:12)
(西德:12)
(西德:12)
(西德:12)
where fsrc(X) and ftgt(是 ) are the mean-pooled
representations of the source sentence X and
target sentence Y , and where Par(X, 是 ) 是 1
if X, Y are parallel and 0 否则. This loss
encourages the cosine similarity between the
source and target embeddings to increase for
positive pairs and decrease otherwise. The process
is depicted in Figure 1.
Note that we only finetune fsrc (参数
Θsrc) and we hold ftgt fixed. If both fsrc and ftgt
are updated, then the training process collapses
to a trivial solution, since the model will map
all pseudo-parallel pairs to one representation and
all non-parallel pairs to another. We hold ftgt
fixed, which forces fsrc to align its outputs to
the target (in our experiments, always English)
mBERT embeddings.
After finetuning, we use the updated fsrc to
generate new non-English sentence embeddings.
We then repeat the retrieval process with FAISS,
yielding a final set of pseudo-parallel pairs after
thresholding and filtering.
3 Unsupervised Bitext Mining
We apply our method to the BUCC 2017 共享
任务, ‘‘Spotting Parallel Sentences in Comparable
Corpora’’ (Zweigenbaum et al., 2017). The task
involves retrieving parallel sentences from mono-
lingual corpora derived from Wikipedia. Parallel
sentences were inserted into the corpora in a con-
textually appropriate manner by the task organi-
zers. The shared task assessed retrieval systems for
precision, 记起, and F1-score on four language
对: De-En, Fr-En, Ru-En, and Zh-En. 事先的
work on unsupervised bitext mining has generally
studied the European language pairs to avoid
dealing with Chinese word segmentation (Hangya
等人。, 2018; Hangya and Fraser, 2019).
3.1 Setup
For each BUCC language pair, we take the
corresponding source and target monolingual
语料库, which have been pre-split into training,
sample, and test sets at a ratio of 49%–2%–49%.
The identity of the parallel sentence pairs for
the test set were not publicly released, and are
only available for the training set. 下列的
the convention established in Hangya and Fraser
(2019) and Artetxe and Schwenk (2019A), 我们
use the test portion for unsupervised system
development and evaluate on the training portion.
We use the reference FAISS implementation3
for nearest-neighbor
搜索. We used the
GluonNLP toolkit (Guo et al., 2020) with pre-
trained mBERT weights4
inference and
self-training. We compute the margin similarity
score in Eq. 1 with k = 4 nearest neighbors. 我们
set a threshold on the score such that we retrieve
the prior proportion (例如, ∼2%) of parallel pairs
in each language.
为了
We then finetune mBERT via self-training. 我们
take minibatches of 100 sentence pairs. We use the
Adam optimizer with a constant learning rate of
0.00001 为了 2 纪元. To avoid noisy translations,
we finetune on the top 50% of the highest-scoring
pairs from the retrieved bitext (例如, if the prior
proportion is 2%, then we would use the top 1%
of sentence pairs for self-training).
We considered performing more than one round
of self-training but found it was not helpful for
the BUCC task. BUCC has very few parallel pairs
(例如, 9,000 pairs for Fr-En) per language and thus
few positive pairs for our unsupervised method
to find. The size of the self-training corpus is
limited by the proportion of parallel sentences,
and mBERT rapidly overfits to small datasets.
3https://github.com/facebookresearch/faiss.
4https://github.com/google-research
/bert/blob/master/multilingual.md.
831
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
方法
De-En
Fr-En
Ru-En
Zh-En
Hangya and Fraser (2019)
平均.
align-static
align-dyn.
Our method
mBERT (final layer)
+ digit filtering (DF)
+ edit distance (ED)
+ self-training (英石)
mBERT (层 8)
+ DF, ED, 英石
30.96
42.81
43.35
42.1
47.0
47.0
60.6
67.0
74.9
44.81
42.21
43.44
45.8
49.3
49.3
60.2
65.3
73.0
19.80
24.53
24.97
36.9
41.2
41.2
49.5
59.3
69.9
-
-
-
35.8
38.0
38.0
45.7
53.3
60.1
桌子 1: F1 scores for unsupervised bitext retrieval on BUCC 2017. Results with mBERT are from our
方法 (秒. 2) using the final (12th) 层. We also include results for the 8th layer (例如, Libovick´y et al.,
2019), but do not consider this part of the unsupervised setting as we would not have known a priori which
layer was best to use.
Language pair Parallel sentence pair
De-En
Fr-En
Ru-En
Zh-En
Beide Elemente des amerikanischen Traums haben heute einen Teil
Anziehungskraft verloren.
Both elements of the American dream have now lost something of their appeal.
ihrer
L’Allemagne `a elle seule s’attend `a recevoir pas moins d’un million de demandeurs
d’asile cette ann´ee.
Germany alone expects as many as a million asylum-seekers this year.
尽管如此, 在 1881, Thessaly and small parts of Epirus were ceded to Greece as part
of the Treaty of Berlin.
In the strange new world of today, the modern and the pre-modern depend on each
其他.
桌子 2: Examples of parallel sentences that were extracted by our method on the BUCC 2017
shared task.
3.2 结果
We provide a few examples of the bitext we
retrieved in Table 2. The examples were chosen
from the high-scoring pairs and verified to be
correct translations.
我们的
恢复
results are in Table 1. 我们
compare our results with strictly unsupervised
技巧, which do not use bilingual lexicons,
文本, or other cross-lingual resources.
parallel
Using mBERT as-is with the margin-based score
works reasonably well, giving F1 scores in the
范围 35.8 到 45.8, which is competitive with
the previous state-of-the-art for some pairs, 和
outperforming by 12 points in the case of Ru-En.
此外, applying simple rule-based filters
(秒. 2.3) on the candidate translation pairs adds a
few more points, although the edit distance filter
has a negligible effect when compared with the
digit filter.
832
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
方法
De-En Fr-En Ru-En Zh-En
mBERT w/o ST 47.0
w/ ST (random)
w/ ST (难的)
57.7
60.6
49.3
55.7
60.2
41.2
48.1
49.5
38.0
45.2
45.7
桌子 3: F1 scores for bitext retrieval on BUCC
2017 using random sentences as negative samples
instead of nearest neighbors.
We see that finetuning mBERT on its own
chosen sentence pairs (IE。, unsupervised self-
训练) yields significant improvements, 添加
其他 8 到 14 points to the F1 score on top
of filtering. In all, these F1 scores represent a
34% 到 98% relative improvement over existing
techniques in unsupervised parallel sentence
extraction for these language pairs.
Libovick´y et al. (2019) explored bitext mining
with mBERT in the supervised context and
found that retrieval performance significantly
varies with the mBERT layer used to create
sentence embeddings. 尤其, they found
层 8 embeddings gave the highest precision-at-
1. We also observe an improvement (桌子 1) 在
unsupervised retrieval of another 13 到 20 点
by using the 8th layer instead of the default final
层 (12th). We include these results but do not
consider them unsupervised, as we would not
know a priori which layer was best to use.
3.3 Choosing Negative Sentence Pairs
Other authors (例如, Guo et al., 2018) have noted
that the choice of negative examples has a con-
siderable impact on metric learning. 具体来说,
using negative examples which are difficult to
distinguish from the positive nearest neighbor is
often beneficial for performance. We examine the
impact of taking random sentences instead of the
remaining k −1 nearest neighbors as the negatives
during self-training.
Our results are in Table 3. While self-training
with random negatives still greatly improves
the untuned baseline, the use of hard negative
examples mined from the k-nearest neighborhood
can make a significant difference to the final F1
分数.
4 Bitext for Neural Machine Translation
A major application of bitext mining is to create
new corpora for machine translation. We conduct
an extrinsic evaluation of our unsupervised bitext
mining approach on unsupervised (WMT’14
French-English, WMT’16 German-English) 和
low-resource (IWSLT’15 English-Vietnamese)
translation tasks.
We perform large-scale unsupervised bitext
extraction on the October 2019 Wikipedia dumps
in various languages. We use wikifil.pl5 to
extract paragraphs from Wikipedia and remove
markup. We then use the syntok6 package for
sentence segmentation. 最后, we reduce the size
of the corpus by removing sentences that aren’t
part of the body of Wikipedia pages. 句子
that contain *, =, //, ::, #, 万维网, (talk), 或者
pattern [0-9]{2}:[0-9]{2} are filtered out.
fit
We index, retrieve, and filter candidate sentence
pairs with the procedure in Sec. 3. Unlike BUCC,
the Wikipedia dataset does not
in GPU
记忆. The processed corpus is quite large,
和 133 百万, 67 百万, 36 百万, 和 6
million sentences in English, 德语, 法语,
and Vietnamese respectively. We therefore shard
the dataset into chunks of 32,768 sentences and
perform nearest-neighbor comparisons in chunks
for each language pair. We use a simple map-
reduce algorithm to merge the intermediate results
back together.
We follow the approach outlined in Sec. 2
for Wikipedia bitext mining. For each source
句子, we retrieve the four nearest
目标
neighbors across the millions of sentences that
we extracted from Wikipedia and compute the
margin-based scores for each pair.
4.1 Unsupervised NMT
We show that our pseudo-parallel
text can
complement existing techniques for unsupervised
翻译 (Artetxe et al., 2018; Lample et al.,
2018C). In line with existing work on UNMT, 我们
evaluate our approach on the WMT’14 Fr-En and
WMT’16 De-En test sets.
Our UNMT experiments build upon the
reference implementation7 of XLM (Lample and
Conneau, 2019). The UNMT model is trained
by alternating between two steps: a denoising
autoencoder step and a backtranslation step (参考
to Lample et al., 2018c for more details). 这
backtranslation step generates pseudo-parallel
5h t t p s : / / g i t hub.com/facebookresearch
/fastText/blob/master/wikifil.pl.
6https://github.com/fnl/syntok.
7https://github.com/facebookresearch/xlm.
833
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
参考
Artetxe et al. (2018)
Lample et al. (2018A)
Yang et al. (2018)
Lample et al. (2018C)
宋等人. (2019)
XLM Baselines
Architecture
Pre-training En-De De-En En-Fr Fr-En
2-layer RNN
3-layer RNN
4-layer Transformer
4-layer Transformer
6-layer Transformer MASS
6.89
9.75
10.86
17.16
28.3
10.16
13.33
14.62
21.00
35.2
15.13
15.05
16.97
25.14
37.5
15.56
14.31
15.58
24.18
34.9
6-layer Transformer XLM
Lample and Conneau (2019)
6-layer Transformer XLM
宋等人. (2019)
XLM reference implementation
6-layer Transformer XLM
Maximum performance across baselines 6-layer Transformer XLM
Ours
Our XLM baseline
w/ pseudo-parallel text before ST
w/ pseudo-parallel text after ST
6-layer Transformer XLM
6-layer Transformer XLM
6-layer Transformer XLM
–
27.0
–
27.0
27.7
30.4
30.7
–
34.3
–
34.3
34.5
36.3
37.3
33.4
33.4
36.6
36.6
36.7
39.7
40.2
33.3
33.3
34.0
34.0
34.5
35.9
36.9
桌子 4: BLEU scores for unsupervised NMT performance on WMT’14 English-French and WMT’16
English-German test sets. All methods only use unaligned Wikipedia corpora for pre-training and/or
bitext mining. ‘ST’ refers to self-training.
training data, and we incorporate our bitext during
UNMT training in the same way, as another
set of pseudo-parallel sentences. We also use
the same initialization as Lample and Conneau
(2019), where the UNMT models have encoders
and decoders that are initialized with contextual
embeddings trained on the source and target
language Wikipedia corpora with the masked
language model (MLM) 客观的; no parallel
data is used.
We performed the exhaustive (Fr Wiki)-(En
Wiki) 和 (De Wiki)-(En Wiki) nearest-neighbor
comparison on eight V100 GPUs, which requires
3 到 4 days to complete per language pair. 我们
retained the top 2.5 million pseudo-parallel Fr-En
and De-En sentence pairs after mining.
4.2 结果
Our results are in Table 4. The addition of mined
bitext consistently increases the BLEU score in
both directions for WMT’14 Fr-En and WMT’16
De-En. Much of the existing work on improving
UNMT focuses on improved initialization with
contextual embeddings like XLM or MASS (歌曲
等人。, 2019). These embeddings were already pre-
trained on Wikipedia data, so it is surprising that
adding our pseudo-parallel Wikipedia sentences
leads to a 2 到 3 BLEU improvement. 其他
字, our approach is complementary to pre-
trained initialization techniques.
Previously (表中 1), we saw that self-
training improved the F1 score for BUCC bitext
恢复. The improvement in bitext quality car-
ries over to UNMT, and providing better pseudo-
parallel text yields a consistent improvement for
all translation directions.
Our results are state-of-the-art in UNMT, 但
they should be interpreted relative to the strength
of our XLM baseline. We are building on top of
the XLM initialization, and the effectiveness of
the initialization (and the various hyperparameters
used during training and decoding) affects the
strength of our final results. 例如, 我们
adjusted the beam width on our XLM baselines
to attain BLEU scores which are similar to what
others have published. One can apply our method
to MASS, which performs better than XLM on
UNMT, but we chose to report results on XLM
because it has been validated on a wider range of
tasks and languages.
We also trained a standard 6-layer transformer
encoder-decoder model directly on the pseudo-
parallel text. We used the standard implementa-
tion in Sockeye (Hieber et al., 2018) as-is, 和
trained models for French and German on 2.5
million Wikipedia sentence pairs. We withheld
10k pseudo-parallel pairs per language pair to
serve as a development set. We achieved BLEU
scores of 20.8, 21.1, 28.2, 和 28.0 on En-De, 的-
En, En-Fr, and Fr-En respectively. BLEU scores
were computed with SacreBLEU (邮政, 2018).
834
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
This compares favorably with the best UNMT
results in Lample et al. (2018C), while avoiding
the use of parallel development data altogether.
4.3 Low-resource NMT
French and German are high-resource languages
and are linguistically close to English. 我们
therefore evaluate our mined bitext on a low-
resource, linguistically distant language pair. 这
IWSLT’15 English-Vietnamese MT task (Cettolo
等人。, 2015) provides 133k sentence pairs derived
from translated TED talks transcripts and is a
common benchmark for low-resource MT. 我们
take supervised training data from the IWSLT task
and augment it with different amounts of pseudo-
parallel text mined from English and Vietnamese
维基百科. 此外, we construct a very low-
resource setting by downsampling the parallel
text and monolingual Vietnamese Wikipedia text
by a factor of ten (13.3k sentence pairs).
We use the reference implementation8 for the
state-of-the-art model (Nguyen and Salazar, 2019),
which is a highly regularized 6+6-layer trans-
former with pre-norm residual connections, 规模
normalization, and normalized word embeddings.
We use the same hyperparameters (除了
dropout rate) but train on our augmented datasets.
To mitigate domain shift, we finetune the best
checkpoint for 75k more steps using only the
IWSLT training data, in the spirit of ‘‘trivial’’
transfer learning for low-resource NMT (Kocmi
and Bojar, 2018).
表中 5, we show BLEU scores as more
pseudo-parallel text is included during training.
As in previous works on En-Vi (比照. Luong and
曼宁, 2015), we use tst2012 (1,553 对)
and tst2013 (1,268 对) as our development
and test sets respectively, we tokenize all data
with Moses, and we report tokenized BLEU via
multi-bleu.perl. The BLEU score increases
monotonically with the size of the pseudo-parallel
corpus and exceeds the state-of-the-art system’s
BLEU by 1.2 点. This result is consistent
with improvements observed with other types of
monolingual data augmentation like pre-trained
UNMT initialization, various forms of back-
翻译 (Hoang et al., 2018; Zhou and Keung,
2020), and cross-view training (CVT; Clark et al.,
2018):
8https://github.com/tnq177/transformers
without tears.
Luong and Manning (2015)
Clark et al. (2018)
Clark et al. (2018), with CVT
徐等. (2019)
Nguyen and Salazar (2019)
+ top 100k mined pairs
+ top 200k mined pairs
+ top 300k mined pairs
+ top 400k mined pairs
En-Vi
26.4
28.9
29.6
31.4
32.8 (28.8)
33.2 (29.5)
33.9 (29.8)
34.0 (30.0)
34.1 (29.9)
桌子 5: Tokenized BLEU scores on tst2013 for
the low-resource IWSLT’15 English-Vietnamese
translation task using bitext mined with our
方法. Added pairs are sorted by their score.
Development scores on tst2012 in parentheses.
We describe our hyperparameter tuning and
infrastructure following Dodge et al. (2019). 这
translation sections of this work mostly used
default parameters, but we did tune the dropout
速度 (在 0.2 和 0.3) for each amount of mined
bitext for the supervised En-Vi task (at 100k,
200k, 300k, and 400k sentence pairs). We include
development scores for our best models; dropout
的 0.3 did best for 0k and 100k, 尽管 0.2 did best
否则. Training takes less than a day on one
V100 GPU.
To simulate a very low-resource task, 我们用
one-tenth of the training data by downsampling the
IWSLT En-Vi train set to 13.3k sentence pairs.
此外, we mine bitext from one-tenth of
the monolingual Wiki Vi text and extract propor-
tionately fewer sentence pairs (IE。, 10k, 20k,
30k, and 40k pairs). We use the implementation
and hyperparameters for the regularized 4+4-layer
transformer used by Nguyen and Salazar (2019)
in a similar setting. We tune the dropout rate (0.2,
0.3, 0.4) to maximize development performance;
0.4 was best for 0k, 0.3 for 10k and 20k, 和
0.2 for 30k and 40k. 表中 6, we see larger
improvements in BLEU (4+ 点) for the same
relative increases in mined data (as compared to
桌子 5). In both cases, the rate of improvement
tapers off as the quality and relative quantity of
mined pairs degrades at each increase.
4.4 UNMT Ablation Study: Pre-training and
Bitext Mining Corpora
秒内. 4.2, we mined bitext from the October
2019 Wikipedia snapshot whereas the pre-trained
835
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
En-Vi, one-tenth
w/o PP as bitext w/ PP as bitext
13.3k pairs (from 133k original)
+ top 10k mined pairs
+ top 20k mined pairs
+ top 30k mined pairs
+ top 40k mined pairs
20.7 (19.5)
25.0 (22.9)
26.7 (24.1)
27.3 (24.5)
27.7 (24.7)
桌子 6: Tokenized BLEU scores (tst2013),
where the bitext was mined from one-tenth
of the monolingual Vietnamese data. Devel-
opment scores on tst2012 in parentheses.
XLM embeddings were created prior to January
2019. 因此, it is possible that the UNMT BLEU
increase would be smaller if the bitext were mined
from the same corpus used for pre-training. 我们
ran an ablation study to show the effect (or lack
thereof) of the overlap between the pre-training
and pseudo-parallel corpora.
For the En-Vi language pair, we used 5 百万
English and 5 million Vietnamese Wiki sentences
to pre-train the XLM model. We only use text from
the October 2019 Wiki snapshot. We mined 300k
pseudo-parallel sentence pairs using our approach
(秒. 2) from the same Wiki snapshot. We created
two datasets for XLM pre-training: A 10 百万-
sentence corpus that is disjoint from the 600k
sentences of the mined bitext, and a 10 百万-
sentence corpus that contains all 600k sentences of
the bitext. 表中 7, we show the BLEU increase
on the IWSLT En-Vi task with and without using
the mined bitext as parallel data, using each of the
two XLM models as the initialization.
The benefit of using pseudo-parallel text is very
清除; even if the pre-trained XLM model saw
the pseudo-parallel sentences during pre-training,
using mined bitext still significantly improves
UNMT performance (23.1 与. 28.3 蓝线). 在
添加, the baseline UNMT performance without
the mined bitext
is similar between the two
XLM initializations (23.1 与. 23.2 蓝线), 哪个
suggests that removing some of the parallel text
present during pre-training does not have a major
effect on UNMT.
最后, we trained a standard encoder-decoder
model on the 300k pseudo-parallel pairs only,
using the same Sockeye recipe in Sec. 4.2. 这
yielded a BLEU score of 27.5 on En-Vi, 哪个
is lower than the best XLM-based result (IE。,
28.9), which suggests that the XLM initialization
improves unsupervised NMT. A similar outcome
was also reported in Lample and Conneau (2019).
XLM excl. PP text
XLM incl. PP text
23.2
23.1
28.9
28.3
桌子 7: Tokenized UNMT BLEU scores on
IWSLT’15 English-Vietnamese (tst2013) 和
XLM initialization. We mined 300k pseudo-
parallel (PP) sentence pairs from En and Vi Wiki-
pedia (Oct. 2019). We created two XLM models,
with the pre-training corpus including or exclud-
ing the PP pairs. We compare their downstream
UNMT performance with and without PP pairs as
‘‘bitext’’ during UNMT training.
5 相关工作
5.1 Parallel Sentence Mining
Approaches to parallel sentence (or bitext) 矿业
have been historically driven by the data require-
ments of statistical machine translation. Some of
the earliest work in mining the Web for large-scale
parallel corpora can be found in Resnik (1998)
and Resnik and Smith (2003). Recent interest
in the field is reflected by new shared tasks on
parallel extraction and filtering (Zweigenbaum
等人。, 2017; Koehn et al., 2018) and the creation
of massively multilingual parallel corpora mined
from the Web, like WikiMatrix (Schwenk et al.,
2019A) and CCMatrix (Schwenk et al., 2019乙).
Existing parallel corpora have been exploited in
many ways to create sentence representations for
supervised bitext mining. One approach involves
a joint encoder with a shared wordpiece vocabu-
lary, trained as part of multiple encoder-decoder
translation models on parallel corpora (Schwenk,
2018). Artetxe and Schwenk (2019乙) apply this
approach at scale, and shared a single encoder and
joint vocabulary across 93 语言. 其他
approach uses negative sampling to align the
encoders’ sentence representations for nearest-
neighbor retrieval (Gr´egoire and Langlais, 2018;
Guo et al., 2018).
然而,
these approaches require training
with initial parallel corpora. 相比之下, Hangya
等人. (2018) and Hangya and Fraser (2019) 亲-
posed unsupervised methods for parallel sentence
extraction that use bilingual word embeddings
induced in an unsupervised manner. Our work
is the first to explore using contextual represen-
tations (mBERT; Devlin et al., 2019) 在一个
unsupervised manner to mine for bitext, 并
836
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
show improvements over the latest UNMT sys-
特姆斯 (Lample and Conneau, 2019; Song et al.,
for which transformers and encoder/
2019),
decoder pre-training have doubled or
tripled
BLEU scores on unsupervised WMT’16 En-De
since Artetxe et al. (2018) and Lample et al.
(2018C).
5.2 Self-training Techniques
Self-training refers to techniques that use the
outputs of a model
to provide labels for its
own training. Yarowsky (1995) proposed a semi-
supervised strategy where a model is first trained
on a small set of labeled data and then used
to assign pseudo-labels to unlabeled data. Semi-
supervised self-training has been used to improve
sentence encoders that project sentences into a
common semantic space. 例如, Clark et al.
(2018) proposed cross-view training (CVT) 和
labeled and unlabeled data to achieve state-of-the-
art results on a set of sequence tagging, 公吨, 和
dependency parsing tasks.
Semi-supervised methods require some anno-
tated data, even if it is not directly related to the
target task. Our work is the first to apply unsuper-
vised self-training for generating cross-lingual
sentence embeddings. The most similar approach
to ours is the prevailing scheme for unsupervised
NMT (Lample et al., 2018C), which relies on
multiple iterations of backtranslation (Sennrich
等人。, 2016) to create a sequence of pseudo-
parallel sentence pairs with which to bootstrap an
MT model.
6 结论
在这项工作中, we describe a novel approach
for state-of-the-art unsupervised bitext mining
using multilingual contextual representations. 我们
extract pseudo-parallel sentences from unaligned
corpora to create models that achieve state-of-the-
art performance on unsupervised and low-resource
translation tasks. Our approach is complementary
to the improvements derived from initializing MT
models with pre-trained encoders and decoders,
and helps narrow the gap between unsupervised
and supervised MT. We focused on mBERT-
based embeddings in our experiments, but we
expect unsupervised self-training to improve
the unsupervised bitext mining and downstream
UNMT performance of other forms of multilingual
contextual embeddings as well.
Our findings are in line with recent work show-
ing that multilingual embeddings are very useful
for cross-lingual zero-shot and zero-resource tasks.
Even without using aligned corpora, mBERT can
embed sentences across different languages in
a consistent fashion according to their semantic
内容. More work will be needed to understand
how contextual embeddings discover these cross-
lingual correspondences.
致谢
We would like to thank the anonymous reviewers
for their thoughtful comments.
参考
Mikel Artetxe, Gorka Labaka, Eneko Agirre,
and Kyunghyun Cho. 2018. Unsupervised
neural machine translation. In 6th International
Conference on Learning Representations, ICLR
2018, Vancouver, BC, 加拿大, 四月 30 –
可能 3, 2018, Conference Track Proceedings.
OpenReview.net. DOI: https://doi.org
/10.18653/v1/D18-1399
Mikel Artetxe and Holger Schwenk. 2019A.
Margin-based parallel corpus mining with
In Pro-
multilingual sentence embeddings.
ceedings of the 57th Annual Meeting of the
计算语言学协会,
pages 3197–3203. Florence, 意大利. 协会
for Computational Linguistics.
Mikel Artetxe and Holger Schwenk. 2019乙.
Massively multilingual sentence embeddings
for zero-shot cross-lingual transfer and beyond.
Transactions of the Association for Computa-
tional Linguistics, 7:597–610. DOI: https://
doi.org/10.1162/tacl 00288
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Repre-
句子, ICLR 2015, 圣地亚哥, CA, 美国,
可能 7-9, 2015, Conference Track Proceedings.
Colin Bannard and Chris Callison-Burch. 2005.
Paraphrasing with bilingual parallel corpora.
837
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
In Proceedings of the 43rd Annual Meeting
of the Association for Computational Linguis-
抽动症 (ACL’05), pages 597–604. 安娜堡,
Michigan. Association for Computational Lin-
语言学. DOI: https://doi.org/10.3115
/1219840.1219914
Mauro Cettolo, Niehues Jan, St¨uker Sebastian,
Luisa Bentivogli, Roldano Cattoni,
和
Marcello Federico. 2015. The IWSLT 2015
evaluation campaign. 在第 12 届会议记录中
International Workshop on Spoken Language
Translation, pages 2–14. Da Nang, 越南.
Kevin Clark, Minh-Thang Luong, Christopher D.
曼宁, and Quoc Le. 2018. Semi-supervised
sequence modeling with cross-view training.
这 2018 会议
在诉讼程序中
Empirical Methods in Natural Language Pro-
cessing, pages 1914–1925. 布鲁塞尔, 比利时.
计算语言学协会.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
这 2019
理解. 在诉讼程序中
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186.
明尼阿波利斯, Minnesota. 协会
为了
计算语言学.
Jesse Dodge, Suchin Gururangan, Dallas Card,
Roy Schwartz, and Noah A. 史密斯. 2019.
Improved reporting of
Show your work:
这
experimental results. 在诉讼程序中
2019 实证方法会议
Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
语言
pages 2185–2194. 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1224
加工
Francis Gr´egoire and Philippe Langlais. 2018.
Extracting parallel sentences with bidirectional
recurrent neural networks to improve machine
the 27th
在诉讼程序中
翻译.
国际计算会议
语言学, pages 1442–1453. 圣达菲, 新的
墨西哥, 美国. Association for Computational
语言学.
Jian Guo, He He, Tong He, Leonard Lausen,
Mu Li, Haibin Lin, Xingjian Shi, Chenguang
王, Junyuan Xie, Sheng Zha, Aston Zhang,
Hang Zhang, Zhi Zhang, Zhongyue Zhang,
Shuai Zheng, and Yi Zhu. 2020. GluonCV and
GluonNLP: Deep learning in computer vision
and natural language processing. 杂志
Machine Learning Research, 21:23:1–23:7.
Mandy Guo, Qinlan Shen, Yinfei Yang, Heming
锗, Daniel Cer, Gustavo Hernandez Abrego,
Keith Stevens, Noah Constant, Yun-Hsuan
宋, Brian Strope, and Ray Kurzweil. 2018.
Effective parallel corpus mining using bilingual
sentence embeddings. 在诉讼程序中
Third Conference on Machine Translation:
Research Papers, pages 165–176. 布鲁塞尔,
比利时. 协会
for Computational
语言学.
Viktor Hangya, Fabienne Braune, Yuliya
Kalasouskaya, and Alexander Fraser. 2018.
Unsupervised parallel sentence extraction from
comparable corpora. In Proceedings of the 15th
International Workshop on Spoken Language
Translation, pages 7–13. Bruges, 比利时.
Viktor Hangya and Alexander Fraser. 2019.
Unsupervised parallel sentence extraction with
parallel segment detection helps machine trans-
关系. 在诉讼程序中
the 57th Annual
Meeting of the Association for Computatio-
nal Linguistics, pages 1224–1234. Florence,
意大利. Association for Computational Linguis-
抽动症. DOI: https://doi.org/10.18653
/v1/P19-1118
Felix Hieber,
Tobias Domhan, 迈克尔
Denkowski, David Vilar, Artem Sokolov, 安
克利夫顿, and Matt Post. 2018. The Sockeye
neural machine translation toolkit at AMTA
2018. In Proceedings of the 13th Conference
of the Association for Machine Translation in
the Americas (体积 1: Research Papers),
pages 200–207. 波士顿, 嘛. 协会
Machine Translation in the Americas.
Vu Cong Duy Hoang, Philipp Koehn, Gholamreza
Iterative
Haffari, and Trevor Cohn. 2018.
back-translation for neural machine transla-
的. 在诉讼程序中
the 2nd Workshop
on Neural Machine Translation and Gen-
进化, pages 18–24. 墨尔本, 澳大利亚.
计算语言学协会.
838
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2017. Billion-scale similarity search with GPUs.
CoRR, abs/1702.08734v1. DOI: https://
doi.org/10.1109/TBDATA.2019.2921572
Phillip Keung, Yichao Lu, and Vikas Bhardwaj.
2019. Adversarial
learning with contextual
embeddings for zero-resource cross-lingual
classification and NER. 在诉讼程序中
这 2019 经验方法会议
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
语言
pages 1355–1360. 香港, 中国. Associa-
tion for Computational Linguistics.
加工
Tom Kocmi and Ondˇrej Bojar. 2018. Trivial trans-
fer learning for low-resource neural machine
翻译. In Proceedings of the Third Con-
ference on Machine Translation: 研究
文件, pages 244–252. 布鲁塞尔, 比利时.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/W18-6325
Philipp Koehn, Huda Khayrallah, Kenneth
Heafield, and Mikel L. Forcada. 2018. 发现
of the WMT 2018 shared task on parallel
corpus filtering. In Proceedings of the Third
Conference on Machine Translation: Shared
Task Papers,
726–739. 比利时,
布鲁塞尔. Association for Computational Lin-
https://doi.org/10
语言学. DOI:
.18653/v1/W18-6453
页面
Guillaume Lample and Alexis Conneau. 2019.
Cross-lingual
language model pretraining.
Hanna M. 瓦拉赫, Hugo Larochelle, Alina
贝格尔齐默, Florence d’Alch´e-Buc, Emily B.
狐狸, and Roman Garnett, 编辑, In Advances
in Neural Information Processing Systems 32:
Annual Conference on Neural
信息
Processing Systems 2019, 神经信息处理系统 2019, 8-
14 十二月 2019, Vancouver, BC, 加拿大,
pages 7057–7067.
Guillaume Lample, Alexis Conneau, Ludovic
Denoyer, and Marc’Aurelio Ranzato. 2018A.
Unsupervised machine translation using mono-
lingual corpora only. In 6th International Con-
ference on Learning Representations, ICLR
2018, Vancouver, BC, 加拿大, 四月 30 –
839
可能 3, 2018, Conference Track Proceedings.
OpenReview.net.
Guillaume
Alexis
Lample,
Conneau,
Marc’Aurelio Ranzato, Ludovic Denoyer,
and Herv´e J´egou. 2018乙. Word translation
In 6th International
without parallel data.
Conference on Learning Representations,
ICLR 2018, Vancouver, BC, 加拿大, 四月 30 –
可能 3, 2018, Conference Track Proceedings.
OpenReview.net.
Guillaume Lample, Myle Ott, Alexis Conneau,
Ludovic Denoyer, and Marc’Aurelio Ranzato.
2018C. Phrase-based & neural unsupervised
machine translation. 在诉讼程序中 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 5039–5049.
布鲁塞尔, 比利时. Association for Computa-
tional Linguistics.
Jindrich Libovick´y, Rudolf Rosa, and Alexander
弗雷泽. 2019. How language-neutral is multi-
lingual BERT? CoRR, abs/1911.03310v1.
Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas
Bhardwaj, Shaonan Zhang, and Jason Sun.
interlingua for multilingual
2018. A neural
machine translation. In Proceedings of the Third
Conference on Machine Translation: 研究
文件, pages 84–92. 布鲁塞尔, 比利时.
计算语言学协会.
Minh-Thang Luong and Christopher D. 曼宁.
2015. Stanford neural machine translation
域.
系统
在诉讼程序中
the 12th International
Workshop on Spoken Language Translation,
pages 76–79. Da Nang, 越南.
语言
spoken
为了
Benjamin Marie, Hour Kaing, Aye Myat Mon,
Chenchen Ding, Atsushi Fujita, Masao
Utiyama, and Eiichiro Sumita. 2019. Super-
vised and unsupervised machine translation
for Myanmar-English and Khmer-English. 在
Proceedings of the 6th Workshop on Asian
Translation, pages 68–75. 香港, 中国.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/D19-5206
Phoebe Mulcaire, Jungo Kasai, and Noah A. 史密斯.
陈述
2019. Polyglot
contextual
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
transfer.
这 2019 Conference of
In Proceed-
improve crosslingual
the North
ings of
American Chapter of
the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long and Short
文件),
3912–3918. 明尼阿波利斯,
Minnesota. Association for Computational
语言学. DOI: https://doi.org/10
.18653/v1/N19-1392
页面
tears:
Toan Q. Nguyen and Julian Salazar. 2019.
Transformers without
Improving the
normalization of self-attention. In Proceedings
of the 16th International Workshop on Spoken
Language Translation. 香港, 中国.
Zenodo.
Matt Post. 2018. A call for clarity in reporting
BLEU scores. 在诉讼程序中
the Third
Conference on Machine Translation: 关于-
search Papers, pages 186–191. 布鲁塞尔,
比利时. 协会
for Computational
语言学. DOI: https://doi.org/10
.18653/v1/W18-6319
Philip Resnik.
strands: A
1998. Parallel
preliminary investigation into mining the web
for bilingual text. David Farwell, Laurie Gerber,
and Eduard H. 蓝色的, 编辑, In Machine
Translation and the Information Soup, 第三
Conference of
the Association for Machine
Translation in the Americas, AMTA ’98,
Langhorne, PA, 美国, 十月 28-31, 1998,
会议记录, 体积 1529 of Lecture Notes in
计算机科学, pages 72–82. 施普林格.
Philip Resnik and Noah A. 史密斯. 2003. 这
web as a parallel corpus. Computational Lin-
语言学, 29(3):349–380. DOI: https://土井
.org/10.1162/089120103322711578
Holger Schwenk. 2018. Filtering and mining
parallel data in a joint multilingual space. 在
Proceedings of the 56th Annual Meeting of the
计算语言学协会
(体积 2: Short Papers), pages 228–234.
墨尔本, 澳大利亚. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/P18-2037
Holger Schwenk, Vishrav Chaudhary, Shuo Sun,
Hongyu Gong, and Francisco Guzm´an. 2019A.
WikiMatrix: Mining 135M parallel sentences
840
在 1620 language pairs from Wikipedia. CoRR,
abs/1907.05791v2.
Holger Schwenk, Guillaume Wenzek, Sergey
Edunov, Edouard Grave, and Armand Joulin.
2019乙. CCMatrix: Mining billions of high-
quality parallel sentences on the WEB. CoRR,
abs/1911.04944v2.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. In Pro-
ceedings of the 54th Annual Meeting of the
计算语言学协会
(体积 1: Long Papers), pages 86–96.
柏林, 德国. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/P16-1009
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, 和
Tie-Yan Liu. 2019. MASS: Masked sequence to
sequence pre-training for language generation.
在诉讼程序中
the 36th International
Conference on Machine Learning, ICML 2019,
9-15 六月 2019, Long Beach, 加利福尼亚州, 美国,
体积 97 of Proceedings of Machine Learning
研究, pages 5926–5936. PMLR.
Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang
赵, and Junyang Lin. 2019. Understanding
and improving layer normalization. In Advances
in Neural Information Processing Systems 32:
Annual Conference on Neural
信息
Processing Systems 2019, 神经信息处理系统 2019, 8-
14 十二月 2019, Vancouver, BC, 加拿大,
pages 4383–4393.
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.
2018. Unsupervised neural machine translation
with weight sharing. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),
pages 46–55. 墨尔本, 澳大利亚. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P18
-1005
David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
In 33rd Annual Meeting of the Association
for Computational Linguistics, pages 189–196.
剑桥, 马萨诸塞州, 美国. 协会
for Computational Linguistics.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Jiawei Zhou and Phillip Keung. 2020. Improving
non-autoregressive neural machine translation
with monolingual data. In ACL.
Pierre Zweigenbaum, Serge Sharoff, and Reinhard
Rapp. 2017. Overview of the second BUCC
shared task: Spotting parallel sentences in
comparable corpora. In Proceedings of the 10th
Workshop on Building and Using Comparable
语料库, pages 60–67. Vancouver, 加拿大.
计算语言学协会.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
3
4
8
1
9
2
3
0
6
5
/
/
t
我
A
C
_
A
_
0
0
3
4
8
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
841
下载pdf