Unsupervised Bitext Mining and Translation

Unsupervised Bitext Mining and Translation
via Self-Trained Contextual Embeddings

Phillip Keung•

Julian Salazar• Yichao Lu• Noah A. Smith†‡

•Amazon †University of Washington

‡Allen Institute for AI

{keung,julsal,yichaolu}@amazon.com nasmith@cs.washington.edu

Abstracto

We describe an unsupervised method to create
pseudo-parallel corpora for machine trans-
lación (MONTE) from unaligned text. We use mul-
tilingual BERT to create source and target
sentence embeddings for nearest-neighbor
search and adapt the model via self-training.
We validate our technique by extracting par-
allel sentence pairs on the BUCC 2017 bitext
mining task and observe up to a 24.5 punto
increase (absolute) in F1 scores over previous
unsupervised methods. We then improve an
XLM-based unsupervised neural MT system
pre-trained on Wikipedia by supplementing it
with pseudo-parallel text mined from the same
cuerpo, boosting unsupervised translation per-
formance by up to 3.5 BLEU on the WMT’14
French-English and WMT’16 German-English
tasks and outperforming the previous state-
of-the-art. Finalmente, we enrich the IWSLT’15
English-Vietnamese corpus with pseudo-
parallel Wikipedia sentence pairs, yielding a
1.2 BLEU improvement on the low-resource
MT task. We demonstrate that unsupervised
bitext mining is an effective way of augment-
ing MT datasets and complements existing
techniques like initializing with pre-trained
contextual embeddings.

1

Introducción

Large corpora of parallel sentences are prerequi-
sites for training models across a diverse set of
applications, such as neural machine translation
(NMT; Bahdanau et al., 2015), paraphrase genera-
ción (Bannard and Callison-Burch, 2005), y
aligned multilingual sentence embeddings (casa de arte
and Schwenk, 2019b). Systems that extract paral-
lel corpora typically rely on various cross-lingual
resources (p.ej., bilingual lexicons, parallel cor-

828

pora), but recent work has shown that unsuper-
vised parallel sentence mining (Hangya et al.,
2018) and unsupervised NMT (Artetxe et al.,
2018; Lample et al., 2018a) produce surprisingly
good results.1

Existing approaches to unsupervised parallel
oración (or bitext) mining start from bilingual
word embeddings (BWEs) learned via an unsuper-
vised, adversarial approach (Lample et al., 2018b).
Hangya et al. (2018) created sentence represen-
tations by mean-pooling BWEs over content
palabras. To disambiguate semantically similar but
non-parallel sentences, Hangya and Fraser (2019)
additionally proposed parallel segment detection
by searching for paired substrings with high simi-
larity scores per word. Sin embargo, using word
embeddings to generate sentence embeddings
ignores sentential context, which may degrade
bitext retrieval performance.

We describe a new unsupervised bitext mining
approach based on contextual embeddings. Nosotros
create sentence embeddings by mean-pooling the
outputs of multilingual BERT (mBERTO; Devlin
et al., 2019), which is pre-trained on unaligned
Wikipedia sentences across 104 idiomas. Para
a pair of source and target languages, we find
candidate translations by using nearest-neighbor
search with margin-based similarity scores bet-
ween pairs of mBERT-embedded source and tar-
get sentences. We bootstrap a dataset of positive
and negative sentence pairs from these initial
then self-train
neighborhoods of candidates,
mBERT on its own outputs. A final retrieval step
gives a corpus of pseudo-parallel sentence pairs,
which we expect to be a mix of actual translations
and semantically related non-translations.

1By unsupervised, we mean that no cross-lingual
resources like parallel text or bilingual lexicons are used.
Unsupervised techniques have been used to bootstrap MT
systems for low-resource languages like Khmer and Burmese
(Marie et al., 2019).

Transacciones de la Asociación de Lingüística Computacional, volumen. 8, páginas. 828–841, 2020. https://doi.org/10.1162/tacl a 00348
Editor de acciones: Collin Cherry. Lote de envío: 4/2020; Lote de revisión: 8/2020; Publicado 12/2020.
C(cid:13) 2020 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

We apply our technique on the BUCC 2017
parallel sentence mining task (Zweigenbaum et al.,
2017). We achieve state-of-the-art F1 scores on
unsupervised bitext mining, with an improvement
of up to 24.5 puntos (absolute) on published
resultados (Hangya and Fraser, 2019). Other work
(p.ej., Libovick´y et al., 2019) has shown that
retrieval performance varies substantially with
the layer of mBERT used to generate sentence
representaciones; using the optimal mBERT layer
yields an improvement as large as 44.9 puntos.

Además, our pseudo-parallel text improves
unsupervised NMT (UNMT) actuación. Nosotros
build upon the UNMT framework of Lample
et al. (2018C) and XLM (Lample and Conneau,
2019) by incorporating our pseudo-parallel text
(also derived from Wikipedia) at training time.
This boosts performance on WMT’14 En-Fr and
WMT’16 En-De by up to 3.5 BLEU over the
XLM baseline, outperforming the state-of-the-art
on unsupervised NMT (Song et al., 2019).

Finalmente, we demonstrate the practical value of
unsupervised bitext mining in the low-resource
the English-Vietnamese
configuración. We augment
cuerpo (133k pairs) from the IWSLT’15 trans-
lation task (Cettolo et al., 2015) with our pseudo-
bitext from Wikipedia (400k pairs), and observe a
1.2 BLEU increase over the best published model
(Nguyen and Salazar, 2019). When we reduced the
amount of parallel and monolingual Vietnamese
data by a factor of ten (13.3k pairs), el modelo
trained with pseudo-bitext performed 7 AZUL
points better than a model trained on the reduced
parallel text alone.

2 Our Approach

Our aim is to create a bilingual sentence em-
bedding space where, for each source sentence
incrustar, a sufficiently close nearest neighbor
among the target sentence embeddings is its
traducción. By aligning source and target sentence
embeddings in this way, we can extract sentence
pairs to create new parallel corpora. Artetxe and
Schwenk (2019a) construct this space by training
a joint encoder-decoder MT model over multiple
language pairs and using the resulting encoder
to generate sentence embeddings. A margin-
based similarity score is then computed between
embeddings for retrieval (Sección 2.2). Sin embargo,
this approach requires large parallel corpora to
train the encoder-decoder model in the first place.

We investigate whether contextualized sentence
embeddings created with unaligned text are useful
for unsupervised bitext retrieval. Trabajo previo
explored the use of multilingual sentence encoders
taken from machine translation models (p.ej.,
Artetxe and Schwenk, 2019b; Lu et al., 2018)
for zero-shot cross-lingual transfer. Our work is
motivated by recent success in tasks like zero-shot
text classification and named entity recognition
(p.ej., Keung et al., 2019; Mulcaire et al., 2019)
with multilingual contextual embeddings, cual
exhibit cross-lingual properties despite being
trained without parallel sentences.

We illustrate our method in Figure 1. We first

retrieve the candidate translation pairs:

• Each source and target language sentence
is converted into an embedding vector with
mBERT via mean-pooling.

• Margin-based scores are computed for each
sentence pair using the k nearest neighbors
of the source and target sentences (Sec. 2.2).

• Each source sentence is paired with its nearest
neighbor in the target language based on this
puntaje.

• We select a threshold score that keeps some

top percentage of pairs (Sec. 2.2).

• Rule-based filters are applied to further re-
move mismatched sentence pairs (Sec. 2.3).

The remaining candidate pairs are used to
bootstrap a dataset for self-training mBERT as
follows:

• Each candidate pair (a source sentence and its
closest nearest neighbor above the threshold)
is taken as a positive example.

• This source sentence is also paired with its
next k − 1 neighbors to give hard negative
examples (we compare this with random
negative samples in Sec. 3.3).

• We finetune mBERT to produce sentence
entre

embeddings
positive and negative pairs (Sec. 2.4).

discriminate

eso

After self-training, the finetuned mBERT model
is used to generate new sentence embeddings.
Parallel sentences should be closer to each other
in this new embedding space, which improves
retrieval performance.

829

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Our self-training scheme. Left: We index sentences using our two encoders. For each source sentence,
we retrieve k nearest-neighbor target sentences per the margin criterion (ecuación. 1), depicted here for k = 4. If the
nearest neighbor is within a threshold, it is treated with the source sentence as a positive pair, and the remaining
k − 1 are treated with the source sentence as negative pairs. Right: We refine one of the encoders such that the
cosine similarity of the two embeddings is maximized on positive pairs and minimized on negative pairs.

2.1 Sentence Embeddings and

Nearest-neighbor Search

We use mBERT (Devlin et al., 2019) to create
sentence embeddings for both languages by mean-
pooling the representations from the final layer.
We use FAISS (Johnson et al., 2017) to perform
exact nearest-neighbor search on the embeddings.
We compare every sentence in the source language
to every sentence in the target language; we do
not use links between Wikipedia articles or other
metadata to reduce the size of the search space.
En nuestros experimentos, we retrieve the k = 4 closest
target sentences for each source sentence; el
source language is always non-English, mientras que la
target language is always English.

2.2 Margin-based Score

We compute a margin-based similarity score
between each source sentence and its k nearest
target neighbors. Following Artetxe and Schwenk
(2019a), we use the ratio margin score, cual
calibrates the cosine similarity by dividing it by
the average cosine distance of each embedding’s
k nearest neighbors:

margin(X, y) =

porque(X, y)
2k + Pz∈NNsrc

porque(X,z)

k (y)

Pz∈NNtgt

k (X)

(1)

.

porque(y,z)
2k

830

We remove the sentence pairs with margin scores
below some pre-selected threshold. For BUCC,
we do not have development data for tuning the
threshold hyperparameter, so we simply use the
prior probability. Por ejemplo, the creators of
the dataset estimate that ∼2% of De sentences
have an En translation, so we choose a score
threshold such that we retrieve ∼2% of the pairs.
We set the threshold in the same way for the
other BUCC pairs. For UNMT with Wikipedia
bitext mining, we set the threshold such that we
always retrieve 2.5 million sentence pairs for each
language pair.

2.3 Rule-based Filtering

We also apply two simple filtering steps before
finalizing the candidate pairs list:

• Digit

filtering: Sentence pairs that are
translations of each other must have digit
sequences that match exactly.2

• Edit distance: Sentences

from English
Wikipedia sometimes appear in non-English
pages and vice versa. We remove sentence
pairs where the content of the source and

2In Python, colocar(re.findall(«[0-9]+»,sent1))

== set(re.findall(«[0-9]+»,sent2)).

target share substantial overlap (es decir.,
character-level edit distance is ≤50%).

el

2.4 Self-training

We devise an unsupervised self-training technique
to improve mBERT for bitext retrieval using
mBERT’s own outputs. For each source sentence,
if the nearest target sentence is within the threshold
and not filtered out, the pair is treated as a positive
oración. We then keep the next k − 1 nearest
neighbors as negative sentences. Altogether, estos
give us a training set of examples which are labeled
as positive or negative pairs.

We train mBERT to discriminate between
positive and negative sentence pairs as a binary
classification task. We distinguish the mBERT
encoders for the source and target languages as
fsrc, ftgt respectively. Our training objective is

l(X, Y ; Θsrc) =
(cid:12)
(cid:12)
(cid:12)
(cid:12)

fsrc(X; Θsrc)⊤ftgt(Y )
kfsrc(X; Θsrc)kkftgt(Y )k

− Par(X, Y )

(2)

,

(cid:12)
(cid:12)
(cid:12)
(cid:12)

where fsrc(X) and ftgt(Y ) are the mean-pooled
representations of the source sentence X and
target sentence Y , and where Par(X, Y ) es 1
if X, Y are parallel and 0 de lo contrario. This loss
encourages the cosine similarity between the
source and target embeddings to increase for
positive pairs and decrease otherwise. The process
is depicted in Figure 1.

Note that we only finetune fsrc (parámetros
Θsrc) and we hold ftgt fixed. If both fsrc and ftgt
are updated, then the training process collapses
to a trivial solution, since the model will map
all pseudo-parallel pairs to one representation and
all non-parallel pairs to another. We hold ftgt
fixed, which forces fsrc to align its outputs to
the target (in our experiments, always English)
mBERT embeddings.

After finetuning, we use the updated fsrc to
generate new non-English sentence embeddings.
We then repeat the retrieval process with FAISS,
yielding a final set of pseudo-parallel pairs after
thresholding and filtering.

3 Unsupervised Bitext Mining

We apply our method to the BUCC 2017 shared
tarea, ‘‘Spotting Parallel Sentences in Comparable
Corpora’’ (Zweigenbaum et al., 2017). The task
involves retrieving parallel sentences from mono-
lingual corpora derived from Wikipedia. Parallel

sentences were inserted into the corpora in a con-
textually appropriate manner by the task organi-
zers. The shared task assessed retrieval systems for
precisión, recordar, and F1-score on four language
pares: De-En, Fr-En, Ru-En, and Zh-En. Previo
work on unsupervised bitext mining has generally
studied the European language pairs to avoid
dealing with Chinese word segmentation (Hangya
et al., 2018; Hangya and Fraser, 2019).

3.1 Setup

For each BUCC language pair, we take the
corresponding source and target monolingual
cuerpo, which have been pre-split into training,
sample, and test sets at a ratio of 49%–2%–49%.
The identity of the parallel sentence pairs for
the test set were not publicly released, y son
only available for the training set. Following
the convention established in Hangya and Fraser
(2019) and Artetxe and Schwenk (2019a), nosotros
use the test portion for unsupervised system
development and evaluate on the training portion.
We use the reference FAISS implementation3
for nearest-neighbor
buscar. We used the
GluonNLP toolkit (Guo et al., 2020) with pre-
trained mBERT weights4
inference and
self-training. We compute the margin similarity
score in Eq. 1 with k = 4 nearest neighbors. Nosotros
set a threshold on the score such that we retrieve
the prior proportion (p.ej., ∼2%) of parallel pairs
in each language.

para

We then finetune mBERT via self-training. Nosotros
take minibatches of 100 sentence pairs. We use the
Adam optimizer with a constant learning rate of
0.00001 para 2 epochs. To avoid noisy translations,
we finetune on the top 50% of the highest-scoring
pairs from the retrieved bitext (p.ej., if the prior
proportion is 2%, then we would use the top 1%
of sentence pairs for self-training).

We considered performing more than one round
of self-training but found it was not helpful for
the BUCC task. BUCC has very few parallel pairs
(p.ej., 9,000 pairs for Fr-En) per language and thus
few positive pairs for our unsupervised method
to find. The size of the self-training corpus is
limited by the proportion of parallel sentences,
and mBERT rapidly overfits to small datasets.

3https://github.com/facebookresearch/faiss.
4https://github.com/google-research

/bert/blob/master/multilingual.md.

831

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Método

De-En

Fr-En

Ru-En

Zh-En

Hangya and Fraser (2019)

avg.
align-static
align-dyn.

Our method

mBERTO (final layer)
+ digit filtering (DF)
+ edit distance (ED)
+ self-training (ST)

mBERTO (capa 8)
+ DF, ED, ST

30.96
42.81
43.35

42.1
47.0
47.0
60.6

67.0
74.9

44.81
42.21
43.44

45.8
49.3
49.3
60.2

65.3
73.0

19.80
24.53
24.97

36.9
41.2
41.2
49.5

59.3
69.9



35.8
38.0
38.0
45.7

53.3
60.1

Mesa 1: F1 scores for unsupervised bitext retrieval on BUCC 2017. Results with mBERT are from our
método (Sec. 2) using the final (12th) capa. We also include results for the 8th layer (p.ej., Libovick´y et al.,
2019), but do not consider this part of the unsupervised setting as we would not have known a priori which
layer was best to use.

Language pair Parallel sentence pair

De-En

Fr-En

Ru-En

Zh-En

Beide Elemente des amerikanischen Traums haben heute einen Teil
Anziehungskraft verloren.
Both elements of the American dream have now lost something of their appeal.

ihrer

L’Allemagne `a elle seule s’attend `a recevoir pas moins d’un million de demandeurs
d’asile cette ann´ee.
Germany alone expects as many as a million asylum-seekers this year.

Sin embargo, en 1881, Thessaly and small parts of Epirus were ceded to Greece as part
of the Treaty of Berlin.

In the strange new world of today, the modern and the pre-modern depend on each
otro.

Mesa 2: Examples of parallel sentences that were extracted by our method on the BUCC 2017
shared task.

3.2 Resultados

We provide a few examples of the bitext we
retrieved in Table 2. The examples were chosen
from the high-scoring pairs and verified to be
correct translations.

Nuestro

retrieval

results are in Table 1. Nosotros
compare our results with strictly unsupervised
técnicas, which do not use bilingual lexicons,
texto, or other cross-lingual resources.
parallel

Using mBERT as-is with the margin-based score
works reasonably well, giving F1 scores in the
range of 35.8 a 45.8, which is competitive with
the previous state-of-the-art for some pairs, y
outperforming by 12 points in the case of Ru-En.
Además, applying simple rule-based filters
(Sec. 2.3) on the candidate translation pairs adds a
few more points, although the edit distance filter
has a negligible effect when compared with the
digit filter.

832

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Método

De-En Fr-En Ru-En Zh-En

mBERT w/o ST 47.0

w/ ST (aleatorio)
w/ ST (hard)

57.7
60.6

49.3

55.7
60.2

41.2

48.1
49.5

38.0

45.2
45.7

Mesa 3: F1 scores for bitext retrieval on BUCC
2017 using random sentences as negative samples
instead of nearest neighbors.

We see that finetuning mBERT on its own
chosen sentence pairs (es decir., unsupervised self-
training) yields significant improvements, agregando
otro 8 a 14 points to the F1 score on top
of filtering. En todo, these F1 scores represent a
34% a 98% relative improvement over existing
techniques in unsupervised parallel sentence
extraction for these language pairs.

Libovick´y et al. (2019) explored bitext mining
with mBERT in the supervised context and
found that retrieval performance significantly
varies with the mBERT layer used to create
sentence embeddings. En particular, they found
capa 8 embeddings gave the highest precision-at-
1. We also observe an improvement (Mesa 1) en
unsupervised retrieval of another 13 a 20 puntos
by using the 8th layer instead of the default final
capa (12th). We include these results but do not
consider them unsupervised, as we would not
know a priori which layer was best to use.

3.3 Choosing Negative Sentence Pairs

Other authors (p.ej., Guo et al., 2018) have noted
that the choice of negative examples has a con-
siderable impact on metric learning. Específicamente,
using negative examples which are difficult to
distinguish from the positive nearest neighbor is
often beneficial for performance. We examine the
impact of taking random sentences instead of the
remaining k −1 nearest neighbors as the negatives
during self-training.

Our results are in Table 3. While self-training
with random negatives still greatly improves
the untuned baseline, the use of hard negative
examples mined from the k-nearest neighborhood
can make a significant difference to the final F1
puntaje.

4 Bitext for Neural Machine Translation

A major application of bitext mining is to create
new corpora for machine translation. We conduct

an extrinsic evaluation of our unsupervised bitext
mining approach on unsupervised (WMT’14
French-English, WMT’16 German-English) y
low-resource (IWSLT’15 English-Vietnamese)
translation tasks.

We perform large-scale unsupervised bitext
extraction on the October 2019 Wikipedia dumps
in various languages. We use wikifil.pl5 to
extract paragraphs from Wikipedia and remove
markup. We then use the syntok6 package for
sentence segmentation. Finalmente, we reduce the size
of the corpus by removing sentences that aren’t
part of the body of Wikipedia pages. Sentences
that contain *, =, //, ::, #, www, (talk), or the
patrón [0-9]{2}:[0-9]{2} are filtered out.

fit

We index, retrieve, and filter candidate sentence
pairs with the procedure in Sec. 3. Unlike BUCC,
the Wikipedia dataset does not
in GPU
memory. The processed corpus is quite large,
con 133 millón, 67 millón, 36 millón, y 6
million sentences in English, Alemán, Francés,
and Vietnamese respectively. We therefore shard
the dataset into chunks of 32,768 sentences and
perform nearest-neighbor comparisons in chunks
for each language pair. We use a simple map-
reduce algorithm to merge the intermediate results
back together.

We follow the approach outlined in Sec. 2
for Wikipedia bitext mining. For each source
oración, we retrieve the four nearest
objetivo
neighbors across the millions of sentences that
we extracted from Wikipedia and compute the
margin-based scores for each pair.

4.1 Unsupervised NMT

We show that our pseudo-parallel
text can
complement existing techniques for unsupervised
traducción (Artetxe et al., 2018; Lample et al.,
2018C). In line with existing work on UNMT, nosotros
evaluate our approach on the WMT’14 Fr-En and
WMT’16 De-En test sets.

Our UNMT experiments build upon the
reference implementation7 of XLM (Lample and
Conneau, 2019). The UNMT model is trained
by alternating between two steps: a denoising
autoencoder step and a backtranslation step (refer
to Lample et al., 2018c for more details). El
backtranslation step generates pseudo-parallel

5h t t p s : / / g i t hub.com/facebookresearch

/fastText/blob/master/wikifil.pl.
6https://github.com/fnl/syntok.
7https://github.com/facebookresearch/xlm.

833

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reference

Artetxe et al. (2018)
Lample et al. (2018a)
Yang y otros. (2018)
Lample et al. (2018C)
Song et al. (2019)

XLM Baselines

Arquitectura

Pre-training En-De De-En En-Fr Fr-En

2-layer RNN
3-layer RNN
4-layer Transformer
4-layer Transformer
6-layer Transformer MASS

6.89
9.75
10.86
17.16
28.3

10.16
13.33
14.62
21.00
35.2

15.13
15.05
16.97
25.14
37.5

15.56
14.31
15.58
24.18
34.9

6-layer Transformer XLM
Lample and Conneau (2019)
6-layer Transformer XLM
Song et al. (2019)
XLM reference implementation
6-layer Transformer XLM
Maximum performance across baselines 6-layer Transformer XLM

Ours

Our XLM baseline
w/ pseudo-parallel text before ST
w/ pseudo-parallel text after ST

6-layer Transformer XLM
6-layer Transformer XLM
6-layer Transformer XLM


27.0

27.0

27.7
30.4
30.7


34.3

34.3

34.5
36.3
37.3

33.4
33.4
36.6
36.6

36.7
39.7
40.2

33.3
33.3
34.0
34.0

34.5
35.9
36.9

Mesa 4: BLEU scores for unsupervised NMT performance on WMT’14 English-French and WMT’16
English-German test sets. All methods only use unaligned Wikipedia corpora for pre-training and/or
bitext mining. ‘ST’ refers to self-training.

training data, and we incorporate our bitext during
UNMT training in the same way, as another
set of pseudo-parallel sentences. We also use
the same initialization as Lample and Conneau
(2019), where the UNMT models have encoders
and decoders that are initialized with contextual
embeddings trained on the source and target
language Wikipedia corpora with the masked
modelo de lenguaje (MLM) objetivo; no parallel
data is used.

We performed the exhaustive (Fr Wiki)-(En
Wiki) y (De Wiki)-(En Wiki) nearest-neighbor
comparison on eight V100 GPUs, which requires
3 a 4 days to complete per language pair. Nosotros
retained the top 2.5 million pseudo-parallel Fr-En
and De-En sentence pairs after mining.

4.2 Resultados

Our results are in Table 4. The addition of mined
bitext consistently increases the BLEU score in
both directions for WMT’14 Fr-En and WMT’16
De-En. Much of the existing work on improving
UNMT focuses on improved initialization with
contextual embeddings like XLM or MASS (Song
et al., 2019). These embeddings were already pre-
trained on Wikipedia data, so it is surprising that
adding our pseudo-parallel Wikipedia sentences
leads to a 2 a 3 BLEU improvement. En otra
palabras, our approach is complementary to pre-
trained initialization techniques.

Previously (en mesa 1), we saw that self-
training improved the F1 score for BUCC bitext
retrieval. The improvement in bitext quality car-
ries over to UNMT, and providing better pseudo-
parallel text yields a consistent improvement for
all translation directions.

Our results are state-of-the-art in UNMT, pero
they should be interpreted relative to the strength
of our XLM baseline. We are building on top of
the XLM initialization, and the effectiveness of
the initialization (and the various hyperparameters
used during training and decoding) affects the
strength of our final results. Por ejemplo, nosotros
adjusted the beam width on our XLM baselines
to attain BLEU scores which are similar to what
others have published. One can apply our method
to MASS, which performs better than XLM on
UNMT, but we chose to report results on XLM
because it has been validated on a wider range of
tasks and languages.

We also trained a standard 6-layer transformer
encoder-decoder model directly on the pseudo-
parallel text. We used the standard implementa-
tion in Sockeye (Hieber et al., 2018) as-is, y
trained models for French and German on 2.5
million Wikipedia sentence pairs. We withheld
10k pseudo-parallel pairs per language pair to
serve as a development set. We achieved BLEU
puntuaciones de 20.8, 21.1, 28.2, y 28.0 on En-De, De-
En, En-Fr, and Fr-En respectively. Puntuaciones BLEU
were computed with SacreBLEU (Correo, 2018).

834

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

This compares favorably with the best UNMT
results in Lample et al. (2018C), while avoiding
the use of parallel development data altogether.

4.3 Low-resource NMT

French and German are high-resource languages
and are linguistically close to English. Nosotros
therefore evaluate our mined bitext on a low-
resource, linguistically distant language pair. El
IWSLT’15 English-Vietnamese MT task (Cettolo
et al., 2015) provides 133k sentence pairs derived
from translated TED talks transcripts and is a
common benchmark for low-resource MT. Nosotros
take supervised training data from the IWSLT task
and augment it with different amounts of pseudo-
parallel text mined from English and Vietnamese
Wikipedia. Además, we construct a very low-
resource setting by downsampling the parallel
text and monolingual Vietnamese Wikipedia text
by a factor of ten (13.3k sentence pairs).

We use the reference implementation8 for the
state-of-the-art model (Nguyen and Salazar, 2019),
which is a highly regularized 6+6-layer trans-
former with pre-norm residual connections, escala
normalization, and normalized word embeddings.
We use the same hyperparameters (except for the
dropout rate) but train on our augmented datasets.
To mitigate domain shift, we finetune the best
checkpoint for 75k more steps using only the
IWSLT training data, in the spirit of ‘‘trivial’’
transfer learning for low-resource NMT (Kocmi
and Bojar, 2018).

En mesa 5, we show BLEU scores as more
pseudo-parallel text is included during training.
As in previous works on En-Vi (cf. Luong and
Manning, 2015), we use tst2012 (1,553 pares)
and tst2013 (1,268 pares) as our development
and test sets respectively, we tokenize all data
with Moses, and we report tokenized BLEU via
multi-bleu.perl. The BLEU score increases
monotonically with the size of the pseudo-parallel
corpus and exceeds the state-of-the-art system’s
BLEU by 1.2 puntos. This result is consistent
with improvements observed with other types of
monolingual data augmentation like pre-trained
UNMT initialization, various forms of back-
traducción (Hoang et al., 2018; Zhou and Keung,
2020), and cross-view training (CVT; Clark et al.,
2018):

8https://github.com/tnq177/transformers

without tears.

Luong and Manning (2015)
Clark et al. (2018)
Clark et al. (2018), with CVT
Xu et al. (2019)
Nguyen and Salazar (2019)

+ top 100k mined pairs
+ top 200k mined pairs
+ top 300k mined pairs
+ top 400k mined pairs

En-Vi

26.4
28.9
29.6
31.4
32.8 (28.8)

33.2 (29.5)
33.9 (29.8)
34.0 (30.0)
34.1 (29.9)

Mesa 5: Tokenized BLEU scores on tst2013 for
the low-resource IWSLT’15 English-Vietnamese
translation task using bitext mined with our
método. Added pairs are sorted by their score.
Development scores on tst2012 in parentheses.

We describe our hyperparameter tuning and
infrastructure following Dodge et al. (2019). El
translation sections of this work mostly used
default parameters, but we did tune the dropout
tasa (en 0.2 y 0.3) for each amount of mined
bitext for the supervised En-Vi task (at 100k,
200k, 300k, and 400k sentence pairs). We include
development scores for our best models; dropout
de 0.3 did best for 0k and 100k, mientras 0.2 did best
de lo contrario. Training takes less than a day on one
V100 GPU.

To simulate a very low-resource task, we use
one-tenth of the training data by downsampling the
IWSLT En-Vi train set to 13.3k sentence pairs.
Además, we mine bitext from one-tenth of
the monolingual Wiki Vi text and extract propor-
tionately fewer sentence pairs (es decir., 10k, 20k,
30k, and 40k pairs). We use the implementation
and hyperparameters for the regularized 4+4-layer
transformer used by Nguyen and Salazar (2019)
in a similar setting. We tune the dropout rate (0.2,
0.3, 0.4) to maximize development performance;
0.4 was best for 0k, 0.3 for 10k and 20k, y
0.2 for 30k and 40k. En mesa 6, we see larger
improvements in BLEU (4+ puntos) for the same
relative increases in mined data (as compared to
Mesa 5). In both cases, the rate of improvement
tapers off as the quality and relative quantity of
mined pairs degrades at each increase.

4.4 UNMT Ablation Study: Pre-training and

Bitext Mining Corpora

In Sec. 4.2, we mined bitext from the October
2019 Wikipedia snapshot whereas the pre-trained

835

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

En-Vi, one-tenth

w/o PP as bitext w/ PP as bitext

13.3k pairs (from 133k original)
+ top 10k mined pairs
+ top 20k mined pairs
+ top 30k mined pairs
+ top 40k mined pairs

20.7 (19.5)
25.0 (22.9)
26.7 (24.1)
27.3 (24.5)
27.7 (24.7)

Mesa 6: Tokenized BLEU scores (tst2013),
where the bitext was mined from one-tenth
of the monolingual Vietnamese data. Devel-
opment scores on tst2012 in parentheses.

XLM embeddings were created prior to January
2019. Por eso, it is possible that the UNMT BLEU
increase would be smaller if the bitext were mined
from the same corpus used for pre-training. Nosotros
ran an ablation study to show the effect (or lack
thereof) of the overlap between the pre-training
and pseudo-parallel corpora.

For the En-Vi language pair, we used 5 millón
English and 5 million Vietnamese Wiki sentences
to pre-train the XLM model. We only use text from
the October 2019 Wiki snapshot. We mined 300k
pseudo-parallel sentence pairs using our approach
(Sec. 2) from the same Wiki snapshot. We created
two datasets for XLM pre-training: a 10 millón-
sentence corpus that is disjoint from the 600k
sentences of the mined bitext, y un 10 millón-
sentence corpus that contains all 600k sentences of
the bitext. En mesa 7, we show the BLEU increase
on the IWSLT En-Vi task with and without using
the mined bitext as parallel data, using each of the
two XLM models as the initialization.

The benefit of using pseudo-parallel text is very
clear; even if the pre-trained XLM model saw
the pseudo-parallel sentences during pre-training,
using mined bitext still significantly improves
UNMT performance (23.1 vs. 28.3 AZUL). En
addition, the baseline UNMT performance without
the mined bitext
is similar between the two
XLM initializations (23.1 vs. 23.2 AZUL), cual
suggests that removing some of the parallel text
present during pre-training does not have a major
effect on UNMT.

Finalmente, we trained a standard encoder-decoder
model on the 300k pseudo-parallel pairs only,
using the same Sockeye recipe in Sec. 4.2. Este
yielded a BLEU score of 27.5 on En-Vi, cual
is lower than the best XLM-based result (es decir.,
28.9), which suggests that the XLM initialization
improves unsupervised NMT. A similar outcome
was also reported in Lample and Conneau (2019).

XLM excl. PP text
XLM incl. PP text

23.2
23.1

28.9
28.3

Mesa 7: Tokenized UNMT BLEU scores on
IWSLT’15 English-Vietnamese (tst2013) con
XLM initialization. We mined 300k pseudo-
parallel (PÁGINAS) sentence pairs from En and Vi Wiki-
pedia (Oct. 2019). We created two XLM models,
with the pre-training corpus including or exclud-
ing the PP pairs. We compare their downstream
UNMT performance with and without PP pairs as
‘‘bitext’’ during UNMT training.

5 Trabajo relacionado

5.1 Parallel Sentence Mining

Approaches to parallel sentence (or bitext) mining
have been historically driven by the data require-
ments of statistical machine translation. Some of
the earliest work in mining the Web for large-scale
parallel corpora can be found in Resnik (1998)
and Resnik and Smith (2003). Recent interest
in the field is reflected by new shared tasks on
parallel extraction and filtering (Zweigenbaum
et al., 2017; Koehn et al., 2018) and the creation
of massively multilingual parallel corpora mined
from the Web, like WikiMatrix (Schwenk et al.,
2019a) and CCMatrix (Schwenk et al., 2019b).

Existing parallel corpora have been exploited in
many ways to create sentence representations for
supervised bitext mining. One approach involves
a joint encoder with a shared wordpiece vocabu-
lary, trained as part of multiple encoder-decoder
translation models on parallel corpora (Schwenk,
2018). Artetxe and Schwenk (2019b) apply this
approach at scale, and shared a single encoder and
joint vocabulary across 93 idiomas. Otro
approach uses negative sampling to align the
encoders’ sentence representations for nearest-
neighbor retrieval (Gr´egoire and Langlais, 2018;
Guo et al., 2018).

Sin embargo,

these approaches require training
with initial parallel corpora. A diferencia de, Hangya
et al. (2018) and Hangya and Fraser (2019) pro-
posed unsupervised methods for parallel sentence
extraction that use bilingual word embeddings
induced in an unsupervised manner. Our work
is the first to explore using contextual represen-
taciones (mBERTO; Devlin et al., 2019) in an
unsupervised manner to mine for bitext, y para

836

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

show improvements over the latest UNMT sys-
tems (Lample and Conneau, 2019; Song et al.,
for which transformers and encoder/
2019),
decoder pre-training have doubled or
tripled
BLEU scores on unsupervised WMT’16 En-De
since Artetxe et al. (2018) and Lample et al.
(2018C).

5.2 Self-training Techniques

Self-training refers to techniques that use the
outputs of a model
to provide labels for its
own training. Yarowsky (1995) proposed a semi-
supervised strategy where a model is first trained
on a small set of labeled data and then used
to assign pseudo-labels to unlabeled data. Semi-
supervised self-training has been used to improve
sentence encoders that project sentences into a
common semantic space. Por ejemplo, Clark et al.
(2018) proposed cross-view training (CVT) con
labeled and unlabeled data to achieve state-of-the-
art results on a set of sequence tagging, MONTE, y
dependency parsing tasks.

Semi-supervised methods require some anno-
tated data, even if it is not directly related to the
target task. Our work is the first to apply unsuper-
vised self-training for generating cross-lingual
sentence embeddings. The most similar approach
to ours is the prevailing scheme for unsupervised
NMT (Lample et al., 2018C), which relies on
multiple iterations of backtranslation (Sennrich
et al., 2016) to create a sequence of pseudo-
parallel sentence pairs with which to bootstrap an
MT model.

6 Conclusión

En este trabajo, we describe a novel approach
for state-of-the-art unsupervised bitext mining
using multilingual contextual representations. Nosotros
extract pseudo-parallel sentences from unaligned
corpora to create models that achieve state-of-the-
art performance on unsupervised and low-resource
translation tasks. Our approach is complementary
to the improvements derived from initializing MT
models with pre-trained encoders and decoders,
and helps narrow the gap between unsupervised
and supervised MT. We focused on mBERT-
based embeddings in our experiments, but we
expect unsupervised self-training to improve
the unsupervised bitext mining and downstream

UNMT performance of other forms of multilingual
contextual embeddings as well.

Our findings are in line with recent work show-
ing that multilingual embeddings are very useful
for cross-lingual zero-shot and zero-resource tasks.
Even without using aligned corpora, mBERT can
embed sentences across different languages in
a consistent fashion according to their semantic
contenido. More work will be needed to understand
how contextual embeddings discover these cross-
lingual correspondences.

Expresiones de gratitud

We would like to thank the anonymous reviewers
for their thoughtful comments.

Referencias

Mikel Artetxe, Gorka Lavaka, Eneko Agirre,
and Kyunghyun Cho. 2018. Unsupervised
neural machine translation. In 6th International
Conferencia sobre Representaciones del Aprendizaje, ICLR
2018, vancouver, BC, Canada, Abril 30 –
Puede 3, 2018, Conference Track Proceedings.
OpenReview.net. DOI: https://doi.org
/10.18653/v1/D18-1399

Mikel Artetxe and Holger Schwenk. 2019a.
Margin-based parallel corpus mining with
En profesional-
multilingual sentence embeddings.
ceedings of the 57th Annual Meeting of the
Asociación de Lingüística Computacional,
pages 3197–3203. Florencia, Italia. Asociación
para Lingüística Computacional.

Mikel Artetxe and Holger Schwenk. 2019b.
Massively multilingual sentence embeddings
for zero-shot cross-lingual transfer and beyond.
Transactions of the Association for Computa-
lingüística nacional, 7:597–610. DOI: https://
doi.org/10.1162/tacl a 00288

Dzmitry Bahdanau, Kyunghyun Cho, y yoshua
bengio. 2015. Traducción automática neuronal por
aprender juntos a alinear y traducir. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015, San Diego, California, EE.UU,
Puede 7-9, 2015, Conference Track Proceedings.

Colin Bannard and Chris Callison-Burch. 2005.
Paraphrasing with bilingual parallel corpora.

837

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

In Proceedings of the 43rd Annual Meeting
de la Asociación de Linguis Computacional-
tics (ACL’05), pages 597–604. ann-arbor,
Michigan. Asociación de Lin Computacional-
guísticos. DOI: https://doi.org/10.3115
/1219840.1219914

Mauro Cettolo, Niehues Jan, St¨uker Sebastian,
Luisa Bentivogli, Roldano Cattoni,
y
Marcello Federico. 2015. The IWSLT 2015
evaluation campaign. In Proceedings of the 12th
International Workshop on Spoken Language
Translation, pages 2–14. Da Nang, Vietnam.

Kevin Clark, Minh-Thang Luong, Cristóbal D..
Manning, and Quoc Le. 2018. Semi-supervised
sequence modeling with cross-view training.
el 2018 Conferencia sobre
En procedimientos de
Métodos empíricos en Natural Language Pro-
cesando, pages 1914–1925. Bruselas, Bélgica.
Asociación de Lingüística Computacional.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
el 2019
comprensión. En procedimientos de
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186.
Mineápolis, Minnesota. Asociación
para
Ligüística computacional.

Jesse Dodge, Suchin Gururangan, Dallas Card,
Roy Schwartz, y Noé A.. Herrero. 2019.
Improved reporting of
Show your work:
el
experimental results. En procedimientos de
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
Idioma
pages 2185–2194. Hong Kong, Porcelana. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1224

Procesando

Francis Gr´egoire and Philippe Langlais. 2018.
Extracting parallel sentences with bidirectional
recurrent neural networks to improve machine
the 27th
En procedimientos de
traducción.
International Conference on Computational
Lingüística, pages 1442–1453. Santa Fe, Nuevo
México, EE.UU. Asociación de Computación
Lingüística.

Jian Guo, He He, Tong He, Leonard Lausen,
Mu Li, Haibin Lin, Xingjian Shi, Chenguang
Wang, Junyuan Xie, Sheng Zha, Aston Zhang,
Hang Zhang, Zhi Zhang, Zhongyue Zhang,
Shuai Zheng, and Yi Zhu. 2020. GluonCV and
GluonNLP: Deep learning in computer vision
and natural language processing. Diario de
Machine Learning Research, 21:23:1–23:7.

Mandy Guo, Qinlan Shen, Yinfei Yang, Heming
Ge, Daniel Cer, Gustavo Hernandez Abrego,
Keith Stevens, Noé constante, Yun-Hsuan
Sung, Brian Strope, and Ray Kurzweil. 2018.
Effective parallel corpus mining using bilingual
sentence embeddings. En Actas de la
Third Conference on Machine Translation:
Research Papers, pages 165–176. Bruselas,
Bélgica. Asociación
for Computational
Lingüística.

Viktor Hangya, Fabienne Braune, Yuliya
Kalasouskaya, and Alexander Fraser. 2018.
Unsupervised parallel sentence extraction from
comparable corpora. In Proceedings of the 15th
International Workshop on Spoken Language
Translation, pages 7–13. Bruges, Bélgica.

Viktor Hangya and Alexander Fraser. 2019.
Unsupervised parallel sentence extraction with
parallel segment detection helps machine trans-
lación. En procedimientos de
the 57th Annual
Meeting of the Association for Computatio-
nal Linguistics, pages 1224–1234. Florencia,
Italia. Association for Computational Linguis-
tics. DOI: https://doi.org/10.18653
/v1/P19-1118

Felix Hieber,

Tobias Domhan, Miguel
Denkowski, David Vilar, Artem Sokolov, Ann
Clifton, and Matt Post. 2018. The Sockeye
neural machine translation toolkit at AMTA
2018. In Proceedings of the 13th Conference
of the Association for Machine Translation in
the Americas (Volumen 1: Research Papers),
pages 200–207. Bostón, MAMÁ. Asociación para
Machine Translation in the Americas.

Vu Cong Duy Hoang, Philipp Koehn, Gholamreza
Iterative
Haffari, and Trevor Cohn. 2018.
back-translation for neural machine transla-
ción. En procedimientos de
the 2nd Workshop
on Neural Machine Translation and Gen-
eration, pages 18–24. Melbourne, Australia.
Asociación de Lingüística Computacional.

838

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2017. Billion-scale similarity search with GPUs.
CORR, abs/1702.08734v1. DOI: https://
doi.org/10.1109/TBDATA.2019.2921572

Phillip Keung, Yichao Lu, and Vikas Bhardwaj.
2019. Adversarial
learning with contextual
embeddings for zero-resource cross-lingual
classification and NER. En procedimientos de
el 2019 Conferencia sobre métodos empíricos
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
Idioma
pages 1355–1360. Hong Kong, Porcelana. asociación-
ción para la Lingüística Computacional.

Procesando

Tom Kocmi and Ondˇrej Bojar. 2018. Trivial trans-
fer learning for low-resource neural machine
traducción. In Proceedings of the Third Con-
ference on Machine Translation: Investigación
Documentos, pages 244–252. Bruselas, Bélgica.
Asociación de Lingüística Computacional.
DOI: https://doi.org/10.18653/v1
/W18-6325

Philipp Koehn, Huda Khayrallah, Kenneth
Heafield, and Mikel L. Forcada. 2018. Findings
of the WMT 2018 shared task on parallel
corpus filtering. En Actas del Tercero
Conferencia sobre traducción automática: Shared
Task Papers,
726–739. Bélgica,
Bruselas. Asociación de Lin Computacional-
https://doi.org/10
guísticos. DOI:
.18653/v1/W18-6453

paginas

Guillaume Lample and Alexis Conneau. 2019.
multilingüe
language model pretraining.
Hanna M. Wallach, Hugo Larochelle, Alina
Beygelzimer, Florence d’Alch´e-Buc, Emily B.
Fox, and Roman Garnett, editores, In Advances
en sistemas de procesamiento de información neuronal 32:
Annual Conference on Neural
Información
Sistemas de procesamiento 2019, NeurIPS 2019, 8-
14 December 2019, vancouver, BC, Canada,
pages 7057–7067.

Guillaume Lample, Alexis Conneau, Ludovic
Denoyer, and Marc’Aurelio Ranzato. 2018a.
Unsupervised machine translation using mono-
lingual corpora only. In 6th International Con-
ference on Learning Representations, ICLR
2018, vancouver, BC, Canada, Abril 30 –

839

Puede 3, 2018, Conference Track Proceedings.
OpenReview.net.

Guillaume

Alexis

Lample,

Conneau,
Marc’Aurelio Ranzato, Ludovic Denoyer,
and Herv´e J´egou. 2018b. Word translation
In 6th International
without parallel data.
Conferencia sobre Representaciones del Aprendizaje,
ICLR 2018, vancouver, BC, Canada, Abril 30 –
Puede 3, 2018, Conference Track Proceedings.
OpenReview.net.

Guillaume Lample, Myle Ott, Alexis Conneau,
Ludovic Denoyer, and Marc’Aurelio Ranzato.
2018C. Phrase-based & neural unsupervised
machine translation. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 5039–5049.
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional.

Jindrich Libovick´y, Rudolf Rosa, y alejandro
Fraser. 2019. How language-neutral is multi-
lingual BERT? CORR, abs/1911.03310v1.

Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas
Bhardwaj, Shaonan Zhang, and Jason Sun.
interlingua for multilingual
2018. A neural
machine translation. En Actas del Tercero
Conferencia sobre traducción automática: Investigación
Documentos, pages 84–92. Bruselas, Bélgica.
Asociación de Lingüística Computacional.

Minh-Thang Luong and Christopher D. Manning.
2015. Stanford neural machine translation
dominios.
sistemas
En procedimientos de
the 12th International
Workshop on Spoken Language Translation,
pages 76–79. Da Nang, Vietnam.

idioma

spoken

para

Benjamin Marie, Hour Kaing, Aye Myat Mon,
Chenchen Ding, Atsushi Fujita, Masao
Utiyama, and Eiichiro Sumita. 2019. Super-
vised and unsupervised machine translation
for Myanmar-English and Khmer-English. En
Proceedings of the 6th Workshop on Asian
Translation, pages 68–75. Hong Kong, Porcelana.
Asociación de Lingüística Computacional.
DOI: https://doi.org/10.18653/v1
/D19-5206

Phoebe Mulcaire, Jungo Kasai, y Noé A.. Herrero.
representaciones

2019. Polyglot

contextual

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

transfer.

el 2019 Conference of

En curso-
improve crosslingual
the North
cosas de
American Chapter of
la Asociación para
Ligüística computacional: Human Language
Technologies, Volumen 1 (Long and Short
Documentos),
3912–3918. Mineápolis,
Minnesota. Asociación de Computación
Lingüística. DOI: https://doi.org/10
.18653/v1/N19-1392

paginas

tears:

Toan Q. Nguyen and Julian Salazar. 2019.
Transformers without
Improving the
normalization of self-attention. En procedimientos
of the 16th International Workshop on Spoken
Language Translation. Hong Kong, Porcelana.
Zenodo.

publicación mate. 2018. Un llamado a la claridad en la presentación de informes
Puntuaciones BLEU. En procedimientos de
El tercero
Conferencia sobre traducción automática: Re-
search Papers, páginas 186–191. Bruselas,
Bélgica. Asociación
for Computational
Lingüística. DOI: https://doi.org/10
.18653/v1/W18-6319

Philip Resnik.

strands: A
1998. Parallel
preliminary investigation into mining the web
for bilingual text. David Farwell, Laurie Gerber,
and Eduard H. Azul, editores, In Machine
Translation and the Information Soup, Tercero
Conference of
the Association for Machine
Translation in the Americas, AMTA ’98,
Langhorne, Pensilvania, EE.UU, Octubre 28-31, 1998,
Actas, volumen 1529 of Lecture Notes in
Computer Science, pages 72–82. Saltador.

Philip Resnik and Noah A. Herrero. 2003. El
web as a parallel corpus. Computational Lin-
guísticos, 29(3):349–380. DOI: https://doi
.org/10.1162/089120103322711578

Holger Schwenk. 2018. Filtering and mining
parallel data in a joint multilingual space. En
Proceedings of the 56th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 2: Artículos breves), pages 228–234.
Melbourne, Australia. Asociación de Computación-
lingüística nacional. DOI: https://doi
.org/10.18653/v1/P18-2037

Holger Schwenk, Vishrav Chaudhary, Shuo Sun,
Hongyu Gong, and Francisco Guzm´an. 2019a.
WikiMatrix: Mining 135M parallel sentences

840

en 1620 language pairs from Wikipedia. CORR,
abs/1907.05791v2.

Holger Schwenk, Guillaume Wenzek, Sergey
Edunov, Edouard Grave, and Armand Joulin.
2019b. CCMatrix: Mining billions of high-
quality parallel sentences on the WEB. CORR,
abs/1911.04944v2.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. En profesional-
ceedings of the 54th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 86–96.
Berlina, Alemania. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.18653/v1/P16-1009

Kaitao Song, Xu Tan, tao qin, Jianfeng Lu, y
Tie-Yan Liu. 2019. MASS: Masked sequence to
sequence pre-training for language generation.
En procedimientos de
the 36th International
Conference on Machine Learning, ICML 2019,
9-15 Junio 2019, Long Beach, California, EE.UU,
volumen 97 of Proceedings of Machine Learning
Investigación, pages 5926–5936. PMLR.

Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang
zhao, and Junyang Lin. 2019. Comprensión
and improving layer normalization. In Advances
en sistemas de procesamiento de información neuronal 32:
Annual Conference on Neural
Información
Sistemas de procesamiento 2019, NeurIPS 2019, 8-
14 December 2019, vancouver, BC, Canada,
pages 4383–4393.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.
2018. Unsupervised neural machine translation
with weight sharing. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 46–55. Melbourne, Australia. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P18
-1005

David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
In 33rd Annual Meeting of the Association
para Lingüística Computacional, pages 189–196.
Cambridge, Massachusetts, EE.UU. Asociación
para Lingüística Computacional.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jiawei Zhou and Phillip Keung. 2020. Improving
non-autoregressive neural machine translation
with monolingual data. In ACL.

Pierre Zweigenbaum, Serge Sharoff, and Reinhard
Rapp. 2017. Overview of the second BUCC

shared task: Spotting parallel sentences in
comparable corpora. In Proceedings of the 10th
Workshop on Building and Using Comparable
corpus, pages 60–67. vancouver, Canada.
Asociación de Lingüística Computacional.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
4
8
1
9
2
3
0
6
5

/

/
t

yo

a
C
_
a
_
0
0
3
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

841
Descargar PDF