Synthesizing Parallel Data of User-Generated Texts - Recherche en IA spécialisée au MIT

Synthesizing Parallel Data of User-Generated Texts
with Zero-Shot Neural Machine Translation

Benjamin Marie

Atsushi Fujita

National Institute of Information and Communications Technology
3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan
{bmarie, atsushi.fujita}@nict.go.jp

Abstrait

Neural machine translation (NMT) systèmes
are usually trained on clean parallel data. Ils
can perform very well for translating clean
in-domain texts. Cependant, as demonstrated by
previous work, the translation quality signifi-
cantly worsens when translating noisy texts,
such as user-generated texts (UGT) from on-
line social media. Given the lack of parallel
data of UGT that can be used to train or adapt
NMT systems, we synthesize parallel data of
UGT, exploiting monolingual data of UGT
through crosslingual
language model pre-
training and zero-shot NMT systems. Ce
paper presents two different but complemen-
tary approaches: One alters given clean parallel
data into UGT-like parallel data whereas the
other generates translations from monolingual
data of UGT. On the MTNT translation tasks,
we show that our synthesized parallel data can
lead to better NMT systems for UGT while
making them more robust in translating texts
from various domains and styles.

1 Introduction

Neural machine translation (NMT) requires large
parallel data for training. Cependant, même quand
trained on large clean parallel data, NMT gen-
erates translations of very poor quality when
translating out-of-domain or noisy texts. Pour
instance, Michel and Neubig (2018) empirically
showed that NMT systems trained on clean
parallel data from the news and parliamentary
debate domains perform reasonably well when
translating news articles but poorly perform at
translating user-generated texts (UGT) from a
social media. UGT can be from various domains
and manifest various forms of natural noise. Pour
instance, they can exhibit spelling/typographical

710

errors, words omission/insertion/repetition, gram-
matical/syntactic errors, or noise markers even
more specific to the writing style of social
media such as abbreviations, obfuscated profani-
liens, inconsistent capitalization, Internet slang, et
emojis. Normalizing and correcting them in a pre-
processing step is a solution to facilitate translation
(Gerlach et al., 2013; Matos Veliz et al., 2019),
but it impedes the correct transfer of the style of
the source text to its translation. In this paper, nous
posit that the NMT system should preserve the
style during the translation. Another trend of work
focuses on making NMT more robust in handling
noisy tokens, such as tokens with spelling mis-
takes, which can greatly disturb NMT (Belinkov
and Bisk, 2018). Cependant, it has only a mini-
mal impact in translating UGT (Karpukhin et al.,
2019) that contains other types of noise/errors.

Whereas domain adaptation methods are help-
ful in improving NMT for UGT (Li et al., 2019),
we do not usually have bilingual parallel data of
UGT created by professional translators to train
or adapt an NMT system. Par conséquent, previous
work on NMT for UGT merely focused on sce-
narios for which we have UGT parallel data, tel
as the MTNT dataset (Michel and Neubig, 2018).
In contrast to previous work, we assume that
parallel data of UGT are not available and that we
can only rely on the formal and clean texts that are
usually used to train NMT systems. En outre,
we exploit UGT monolingual data that are
publicly available in large quantity on the Internet
for many languages. We propose to synthesize
parallel data of UGT to train better NMT
systems for UGT. For this purpose, we present
two complementary approaches that associate
a pre-trained crosslingual language model with
zero-shot NMT systems. Our contributions are as
follows:

• A method for altering clean parallel data into

UGT parallel data

Transactions of the Association for Computational Linguistics, vol. 8, pp. 710–725, 2020. https://doi.org/10.1162/tacl a 00341
Action Editor: Philipp Koehn. Submission batch: 4/2020; Revision batch: 7/2020; Published 11/2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
1
1
9
2
3
4
5
9

/
t

un
c
_
un
_
0
0
3
4
1
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
1
1
9
2
3
4
5
9

/
t

un
c
_
un
_
0
0
3
4
1
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 1: Examples of the impact of noise in NMT. The NMT systems are presented in Table 2. Vanilla NMT
is trained on clean parallel data, whereas ‘‘our work’’ refers to the configuration #1+#2 presented in Section 5.4
trained on synthetic parallel data of UGT.

• A method for synthesizing parallel data of

UGT from monolingual data

• An empirical evaluation, in four translation
instructions, of our methods that shows con-
sistent improvements in translation quality
over previous work for UGT but also on
various domains and styles

The remainder of this paper is organized as
follows. In Section 2, we present the research
problem and questions that we answer in this work.
Alors, in Section 3, we present a zero-shot NMT
framework that we use to synthesize parallel data
of UGT by our two methods presented in Section 4.
We evaluate the usefulness of our approaches for
better translating UGT in Section 5. In Sections 6
et 7, we evaluate alternative configurations for
our zero-shot NMT systems, and in Section 8 nous
verify whether our NMT systems trained on the
synthetic parallel data are more robust to changes
of domain and style. We analyze the synthetic
sentences and present examples in Section 9 à
better understand why our data lead to better NMT
systèmes. Following the presentation of related
work in Section 10, we conclude the paper in
Section 11.

2 Motivation

marginal in the discussions from Reddit (Michel
and Neubig, 2018).

Chiffre 1 shows the impact on MT of two
types of noise: spelling (Ex1) et
different
syntactic (Ex2) errors, compared to the translation
of the same but clean sentence (Ex3). Ex1 has
an intentional spelling error ‘‘vl`a’’ (instead of
‘‘voil`a’’) and a UGT-specific symbol,
‘‘#.’’
Comparison with Ex3 suggests that they have
negative effects on the vanilla NMT system and
eventually lead to an incorrect translation largely
different from the translation of the clean source
of Ex3. In Ex2, a syntactic error ‘‘arrive est’’
instead of ‘‘arrive’’ has also an impact, but to
a lesser extent, by inducing the past tense in
English. Vanilla NMT gives the best translation
for the clean source sentence (Ex3) only failing in
translating ‘‘COVID19.’’ For indicative purpose,
we present in the row ‘‘our work’’ translations
generated by our work. These examples highlight
the inability of vanilla NMT in translating sen-
tences with various types of noise.

In conducting the research to better translate
UGT, we answer the following research questions:

Q1 How can we generate synthetic parallel data
for UGT in a specific domain/style without
relying on any manually produced parallel
data of UGT?

UGT contains many different types of noise that
can also differ from one type of UGT to another.
Par exemple, posts on Twitter contain many
spelling errors intentionally introduced for text
compression, whereas this kind of error is rather

Q2 Do the synthetic parallel data lead to a better
NMT system for the targeted UGT and do
they make it more robust to the change of
domain or style?

711

3 Zero-Shot NMT for Synthesizing

Parallel Data

We describe in this section our zero-shot NMT
system used to synthesize parallel data of UGT.

3.1 Objective and Prerequisites

Let L1 and L2 be two languages for clean texts and
R1 and R2 for the same languages, respectivement,
but for UGT. The data prerequisites for our NMT
system described in Section 3.2 are as follows:

• PL1-L2 parallel data of clean and formal texts

that are usually used for training NMT,

• ML1 and ML2 monolingual data from any

domains, et

• MR1 and MR2 monolingual data of UGT.

Chiffre 2: Our zero-shot NMT framework.

Unlike previous work on NMT for UGT, we do
not assume any PR1-R2 parallel data for training or
validating NMT systems, except for evaluation.
PL1-L2, ML1, and ML2, parallel and monolingual
data, are usually used to build state-of-the-art
NMT systems. MR1 and MR2 monolingual data
are obtained by crawling social media.

Our objective is to synthesize parallel data of
UGT, which we henceforth denote as PS
R1-R2.
To this end, we propose the following two
approaches:

#1 Alter a clean parallel data PL1-L2 into PS

R1-R2

#2 Synthesize PS

R1-R2 parallel data by translating

MR2 monolingual data into R1

These approaches must regard L1 and R1, et
similarly L2 and R2, as two different languages.
Pour #1, we alter the PL1-L2 parallel data by
performing L1→R2 and L2→R1 translations.1
Pour #2, we generate the data via R2→R1
translation. Note that L1→R2, L2→R1, et
R2→R1 are all zero-shot translation tasks, because
we do not assume any PL1-R2, PL2-R1, PR1-R2
parallel data, nor any parallel data using a pivot
langue.

3.2 Zero-Shot NMT

For a given language pair L1-L2, we require
only one multilingual and multidirectional NMT
system to synthesize parallel data. The compo-

1We do not consider L1→R1 and L2→R2 (voir

Section 4.1).

nents of this system are presented in Figure 2.
Inspired by previous work in unsupervised NMT
(Conneau and Lample, 2019), we first pre-
train a cross-lingual language model to initialize
the NMT system. We use the XLM approach
(Conneau and Lample, 2019) trained with the
combination of
the following two different
objectifs:

Masked Language Model (MLM): MLM has
a similar objective to BERT (Devlin et al., 2019)
but uses text streams for training instead of pairs
of sentences. We optimize the MLM objective on
the ML1, ML2, MR1, and MR2 monolingual data.

Translation Language Model (TLM): TLM is
an extension of MLM where parallel data are
leveraged so that we can rely on context in two
different languages to predict masked words. Nous
optimize the TLM objective on PL1-L2 parallel
data, alternatively exploiting both translation
instructions.

The XLM approach alternates between MLM
and TLM objectives to train a single model. Par
sharing a single vocabulary for all of L1, L2, R1,
and R2, we expect XLM to implicitly model
translation knowledge for our zero-shot transla-
tion directions, namely, L1→R2, L2→R1, et
R2→R1, thanks to the joint training of MLM and
TLM, also maximally exploiting the similarity
between L1 and R1, and between L2 and R2.

Alors, the embeddings from the XLM model are
used to initialize the encoder and decoder embed-
dings of the NMT system instead of the standard

712

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
1
1
9
2
3
4
5
9

/
t

un
c
_
un
_
0
0
3
4
1
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

random initialization. We exploit unsupervised
NMT objectives (Lample et al., 2018) to which we
associate a supervised NMT objective as follows:

Auto-encoder (AE) Objectives: Using a noise
model that drops and swaps words, the objective
is to reconstruct the original sentences. We use
AE objectives for L1, L2, R1, and R2.

Back-translation (BT) Objectives: For train-
ing translation directions for which we do not have
parallel data, a round-trip translation is performed
during training in which a sentence s from mono-
lingual data is translated, and its translation back-
translated, with the objective of generating s. Nous
use the BT objectives corresponding to our tar-
geted zero-shot translation directions: L1→R2
→L1, R2→L1→R2, L2→R1→L2, R1→L2→R1,
R1→R2→R1, and R2→R1→R2.

Machine Translation (MT) Objectives: nous
use this objective for L1→L2 and L2→L1, pour
which we have parallel data.

AE and BT are unsupervised NMT objectives
used to train our zero-shot translation directions.
Cependant, using only these objectives would result
in very poor performance, especially for distant
and difficult language pairs. We thus also use MT
objectives for the necessary supervision.

To alter PL1-L2 into PS

R1-R2 by our method #1, nous
could have trained an NMT system for L1→R1
and L2→R2 with the BT objectives L1→R1→L1
and L2→R2→L2. Cependant, due to the similarity
between L1 and R1, the NMT system would often
perform a copy of ML1 to MS
R1. Donc, comme
done by previous work in paraphrase generation
(Bannard and Callison-Burch, 2005; Mallinson
et coll., 2017), we instead rely on pivot languages,
par exemple, by translating the L1 side of PL1-L2
parallel data into R2 as a translation of L2.

4 Synthesizing Parallel Data of UGT

This section presents our two approaches to
synthesize parallel data of UGT mentioned in
Section 3.1: #1 alters existing parallel data and #2
generates translations of UGT monolingual data.

4.1 Parallel Data Alteration

There exist several methods to synthesize parallel
data of UGT from existing parallel data in various
style or domains, but mostly requiring the use
of UGT parallel data. Vaibhav et al. (2019)

713

Chiffre 3: Alteration of PL1-L2 parallel data to synthe-
size PS

R1-R2 parallel data.

proposed a synthetic noise induction (SNI) que
applies manually defined editing operations, tel
as adding/dropping characters from a word or
adding emojis, to introduce noise into existing
parallel data. The resulting data were used for
adapting an NMT system for translating UGT.
They also proposed a tag-based method given a
small PR1-R2 parallel data: concatenate PR1-R2 and
PL1-L2 parallel data, prepend a tag onto each source
sentence to indicate whether the sentence pair is
from PR1-R2 or PL1-L2, and train NMT systems
on that data. Alors, they used this NMT system
to translate the L1 side of another PL1-L2 parallel
data prepended with the tag for PR1-R2 so that
the system is forced to translate L1 sentences
as R1 sentences. The resulting parallel data are
noisier than the original data and potentially more
suitable to train NMT systems for UGT. The data
are used to fine-tune NMT systems trained on
PL1-L2 parallel data.

In contrast, as illustrated in Figure 3, notre
approach uses a zero-shot NMT system that does
not require any manually produced PR1-R2 nor
relies on manually defined editing operations.
Given PL1-L2, we perform L1→R2 and L2→R1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
1
1
9
2
3
4
5
9

/
t

un
c
_
un
_
0
0
3
4
1
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

translation for each of L1 and L2 sentences,
respectivement, to obtain a synthetic R1-R2 version,
c'est, PS
R1-R2, of the original PL1-L2. The resulting
PS
R1-R2 can be too noisy to be used to train NMT.
To filter PS
R1-R2, we evaluate the similarity between
original L1 and L2 sentences with their respective
R1 and R2 versions using sentence-level BLEU
(Lin and Och, 2004) (sBLEU). Given a sentence
pair in PS
R1-R2, if either sBLEU of L1 with respect
to R1 or sBLEU of L2 with respect to R2 is below
a predetermined threshold T , we filter out the
sentence pair, consider that it has been too much
altered. T can be set empirically: Create several
R1-R2 using different T values, train
version of PS
an NMT system for each version, and choose the
value that leads to the NMT system achieving the
best BLEU score on some PL1-L2 validation data.
Enfin, after filtering, we exploit the resulting
PS
R1-R2 by concatenating it to the original PL1-L2
parallel data and train a new NMT system for
translating UGT, or by using it for fine-tuning an
NMT system trained on PL1-L2 parallel data.

4.2 Translation of Monolingual Data

Previous work also proposed to synthesize parallel
data from monolingual data using NMT (Sennrich
et coll., 2016un): An L1→L2 NMT system is used
to translate ML1 monolingual data into L2, et
then the synthesized PS
L1-L2 parallel data are
concatenated to original parallel data and used to
train new L2→L1 (back-translation) or L1→L2
(forward translation) NMT systems. Cependant, à
the best of our knowledge, nobody has studied
the use of large UGT monolingual data, without
any manually produced PR1-R2 parallel data, et
its impact on translation quality.2

In our scenario, translating R1 texts with an
L1→L2 would lead to translations of R1, que
we can denote R2, of a very poor quality (voir
Section 2). Par conséquent, back-translations or
forward translations generated this way would
be too noisy to train R1↔R2 NMT systems. Nous
verify this assumption in Section 5.2.1. Plutôt,
as illustrated in Figure 4, we use R1→R2 and
R2→R1 zero-shot NMT to synthesize parallel
data from MR1 and MR2 monolingual data,
respectivement. Because our NMT system uses a
pre-trained language model for R1 and R2, nous
can expect it to generate better translation than

Chiffre 4: Translation of monolingual data MR1 and
MR2 to synthesize PS

R1-R2 parallel data.

a standard NMT system trained only on PL1-L2
parallel data, (c'est à dire., that never saw UGT during
entraînement). As in Section 4.1, the resulting PS
R1-R2
parallel data can be used for fine-tuning or
concatenated with the original PL1-L2 parallel data
for training.

In this work, we only examine the use of
PS
R1-R2 parallel data with their synthetic part on
the source side, as back-translations, because in
our preliminary experiments we have consistently
observed better results than when PS
R1-R2 is used
as forward translations.3 Note also that we do not
filter the synthesized data and use all the data
generated from the monolingual data, in contrast
to another approach presented in Section 4.1.
We could potentially obtain better results by
filtering synthetic parallel data with some existing
methods proposed, par exemple, for filtering back-
translations (Imankulova et al., 2019). We leave
the investigation of such filtering techniques for
future work.

5 Experiments

Dans cette section, we empirically evaluate the
usefulness of the parallel data synthesized by

2Berard et al. (2019un) showed that a large monolingual
corpus of UGT can be successfully back-translated with a
system trained on PR1-R2 parallel data.

3Li and Specia (2019) observed improvements using
forward translations but only in combination with manually
produced PR1-R2 parallel data.

714

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
1
1
9
2
3
4
5
9

/
t

un
c
_
un
_
0
0
3
4
1
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

our proposed approaches in training better NMT
systems for translating UGT.

5.1 Données

We conducted experiments for two language pairs,
English–French (en-fr) and English–Japanese
(en-ja), with the MTNT translation tasks (Michel
and Neubig, 2018). The test sets were made from
posts extracted from an online discussion Web
site, Reddit. Translations in the MTNT test sets
were produced by professional translators with
the instructions of keeping the style. Errors in the
source texts were also preserved. In the four test
sets, one for each translation direction, the source
side contains original texts, c'est, our systems
will not have to translate translationese.

For parallel data, we did not use any of the
Reddit parallel data of the MTNT, since our
approach is supposed to be agnostic of manually
produced PR1-R2 translations. To make our settings
comparable with previous work, we used only
the clean parallel data in MTNT as PL1-L2 data
for training and validating our NMT systems.
For the en-fr pair, PL1-L2 data contain 2.2M
sentence pairs consisting of the news-commentary
(news commentaries) and Europarl (parliamentary
debates) corpora provided by WMT15 (Bojar
et coll., 2015). For the en-ja pair, PL1-L2 data
consist of the KFTT (Wikipedia articles), TED
(transcripts of online conference talks), and JESC
(subtitles) corpora, resulting in a total of 3.9 M.
sentence pairs. All PL1-L2 parallel data can be
considered rather clean and/or formal in contrast
to Reddit data.

As monolingual data, ML1 and ML2, we used
the entire News Crawl provided for WMT204
for Japanese, 3.4M lines, and a sample of 25M
lines for English and French. As MR1 and MR2,
we crawled data using the Reddit API and
applied fastText5 for language identification.6 As
preprocessing steps for English and French, nous
first normalized the punctuation of all the data,
except for the reference translations in the test sets,

4http://www.statmt.org/wmt20/translation

-task.html.

5https://fasttext.cc/.
6In our preliminary experiments, we observed large
improvements in translation quality (au-delà 5.0 BLEU
points) with our approaches when the crawled MR1 contains
the source side of the test sets. We rather chose to experiment
without the knowledge of the source side of the test set and
carefully removed it from the monolingual data.

with the Moses (Koehn et al., 2007)7 punctuation
normalizer, and then tokenized all the data with
the Moses tokenizer. Enfin, we truecased the
data with the Moses truecaser trained on the
Reddit monolingual data. As for Japanese, we only
tokenized the data with MeCab.8 We removed all
empty lines and lines longer than 120 tokens
from the monolingual and parallel data. Because
we could crawled plenty of English data (595M.
lines) on Reddit, we only selected its noisiest part,
similarly to Michel and Neubig (2018) when they
built the MTNT dataset. We trained a language
model on the English News Crawl monolingual
data using LMPLZ (Heafield et al., 2013), scored
all lines of English Reddit data with the language
model, normalized the score by the number of
tokens in each line, and kept only the 25M
lines with the lowest score. Because there are
significantly less Japanese and French Reddit data,
0.8M and 1.2M sentences, respectivement, we did
not apply this filtering for these two languages.
English Reddit data are thus much larger and
can also be considered noisier than French and
Japanese Reddit data.

For validation, we used the PL1-L2 validation
data from the MTNT dataset: Newsdiscuss-
dev2015 for en-fr and the concatenation of the
validation data provided with the KFTT, TED,
and JESC corpora.

chrF (Popovi´c, 2015)

For evaluation, we used SacreBLEU (Post,
includes the MTNT test sets. Pour
2018) que
en→ja, we report on scores using the character-
instead
level metric
of BLEU (Papineni et al., 2002)
to avoid
any tokenization mismatch with previous/future
work.9 We tested the significance of our re-
sults via bootstrap re-sampling and approximate
randomization with MultEval (Clark et al.,
2011).10

5.2 Baselines Systems

To train NMT systems, we first segmented
tokens into sub-words using a BPE segmentation
(Sennrich et al., 2016b) with 32k operations

7https://github.com/moses-smt/mosesdecoder.
8https://taku910.github.io/mecab/.
9The sacreBLEU signatures, where xx is among {dans,fr,ja}
are as follows: BLEU+case.mixed+lang.xx-xx+numrefs.1
+smooth.exp+test.mtnt1.1/test+tok.13a+version.1.4.2; chrF2+
case.mixed+lang.en-ja+numchars.6+numrefs.1 +space.False+
test.mtnt1.1/test+version.1.4.2.

10https://github.com/jhclark/multeval.

715

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
1
1
9
2
3
4
5
9

/
t

un
c
_
un
_
0
0
3
4
1
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

System

vanilla

+ TBT News
+ TBT Reddit

FT on SNI
+ SNI

fr→en

21.6

25.8∗
22.9∗

23.1∗
22.0

FT on NGBT

0.2

BLEU
en→fr

ja→en

21.7

25.3∗
25.5∗

22.3∗
21.7

17.3

8.1

8.6∗
0.5

8.2∗
8.3∗

0.5

chrF
en→ja

0.174

0.190∗
0.181∗

0.164
0.158

0.021

Tableau 1: Results for the MTNT test sets.
Tagged back-translation systems (TBT) étaient
trained on back-translations of News Crawl or
Reddit monolingual data. ‘‘+’’ indicates that
the generated data were concatenated to the
original PL1-L2 parallel data. ‘‘FT’’ denotes
the fine-tuning of the vanilla NMT system.
‘‘*’’ denotes systems significantly better than
the vanilla NMT system with a p-value < 0.05. jointly learned for each language pair on the Reddit monolingual data. We used the Transformer (Vaswani et al., 2017) implementation in Marian (Junczys-Dowmunt et al., 2018) with standard hyper-parameters: 6 encoder and decoder layers, 512 dimensions for the embeddings and hidden states, 8 attention heads, and 2,048 dimensions for the feed-forward filter. During training, we evaluated the model using a mean cross-entropy score computed on the MTNT PL1-L2 validation data after every 5k mini-batch updates and stopped training when it had not been improved for 5 consecutive times. We selected the model that yields the best BLEU, using the BLEU metric implemented in Marian, on the same validation data. We used the same training procedure for our vanilla NMT systems and all the NMT systems trained on synthetic parallel data. Table 1 reports on the results for our vanilla NMT systems and other baseline systems de- scribed in Sections 5.2.1 and 5.2.2. 5.2.1 Tagged Back-translation We generated back-translations from Reddit monolingual data, tagged (Caswell et al., 2019) and concatenated them to the original PL1-L2 parallel data, and trained a new NMT system from scratch. Because Reddit data are noisy UGT, the generated back-translations may be of a very poor quality and harm the training of NMT. As contrastive experiments, we also evaluated the use of back-translations of News Crawl for which we can expect the system trained on PL1-L2 to generate better but out-of-domain translations. In all experiments, we used as many monolingual sentences as in the PL1-L2 parallel, or all of the Reddit data for French and Japanese since we do not have enough Reddit data to match the size of PL1-L2. As shown in Table 1, back-translations of Reddit are mostly useful, with up to 3.8 BLEU points of improvement, but dramatically failed for ja→en potentially due to the very low quality of the back-translations generated by the en→ja vanilla NMT system. Using back-translations of News Crawl is more helpful, especially for fr→en and ja→en. Berard et al. (2019a) showed improvements when using back-translations of UGT. In contrast, we did not consistently observe improvements without using any manually produced PR1-R2 to train the NMT systems for back-translation. 5.2.2 Synthetic Noise Generation As potential baselines, we also evaluated the methods proposed by Vaibhav et al. (2019) for SNI, because it does not require any manually produced PR1-R2. We applied their method to PL1-L2 using their scripts11 to create a noisy version of parallel data, namely, PS R1-R2. We also evaluated a similar approach to the tagged back- translations proposed by Vaibhav et al. (2019) (see Section 4.1). We used our systems trained on back- translations of Reddit to decode L1 sentences from PL1-L2 parallel data, to which we added the back- translation tags to let the NMT system generate translation of L1 similar to UGT. We denote this noise generation from back-translation ‘‘NGBT.’’ As in Vaibhav et al. (2019), we introduced noise only to the source side of the parallel data performing L1→L2→L1 where the resulting L1 sentences comprise a noisy version of the original L1 sentences. We then replace L1 sentences in the PL1-L2 parallel data with their noisy version. In addition to the use of the resulting PS R1-R2 data for fine-tuning as in Vaibhav et al. (2019), we also evaluated NMT systems trained from scratch on the concatenation of the PS R1-R2 and PL1-L2. As shown in Table 1, fine-tuning our vanilla NMT system on SNI actually improves translation quality for all the tasks, except en→ja. These results are not in accordance with the results in 11h t t p s : / / g i t h u b . c o m / M y s teryVaibhav /robust mtnt. 716 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Vaibhav et al. (2019) that show a slight drop of the BLEU score for fr→en.12 We speculate that the difference may come from the use of a different, better, vanilla NMT system for which we used a larger PL1-L2 parallel data than in Vaibhav et al. (2019). Using the PS R1-R2 synthetic parallel data concatenated to the original PS L1-L2 leads to lower BLEU scores than fine-tuning, except for ja→en. As expected, our adaptation of NGBT per- formed very poorly, showing that our systems trained on Reddit back-translations are not good enough to generate a useful noisy version of PL1-L2 parallel data. We do not further explore this configuration in this paper. 5.3 System Settings for our Approaches Our NMT systems used for synthesizing PS R1-R2 parallel data are initialized with XLM (Section 3.2). To train XLM, we used the data presented in Section 5.1 on which we applied the same BPE segmentation used by our vanilla NMT systems. For the MLM objectives, we used the News Crawl corpora as ML1 and ML2 and the Reddit corpora as MR1 and MR2 monolingual data. For the TLM objectives, we used the parallel data used to train our vanilla NMT system as PL1-L2 parallel data. We used the publicly available XLM framework13 with the standard hyperparameters proposed for unsupervised NMT: 6 layers for the encoder and the decoder, 1,024 dimensions for the embeddings, a dropout rate of 0.1, and the GELU activation. We used text streams of 256 tokens and a mini-batch size of 64. The Adam optimizer (Kingma and Ba, 2014) with a linear warm-up (Vaswani et al., 2017) was used. During training, the model was evaluated every 200k sentences on the MTNT validation parallel data for TLM and the monolingual validation data of MTNT for MLM. The training was stopped when the averaged perplexity of MLM and TLM had not been improved for 10 consecutive times. We initialized our zero-shot NMT with XLM and trained it with the AE, BT, and MT objectives presented in Section 3.2, all having the same 12Vaibhav et al. (2019) observed improvements only when used in combination with an manually produced PS R1-R2. 13We refer the reader to the section III given at this URL to retrieve the complete settings of our training for XLM and unsupervised NMT: https://github.com /facebookresearch/XLM. The only difference is that we used our data in different languages, which is also used to train our own BPE vocabulary. 717 System fr→en zero-shot NMT 21.4 vanilla FT on SNI 21.6 23.1 BLEU en→fr 22.4 21.7 22.3 ja→en 3.0 8.1 8.2 #1: PS R1-R2 synthesized from PL1-L2 9.0∗ 24.2∗ 9.5∗ 24.7∗ 22.0 23.1 chrF en→ja 0.126 0.174 0.164 0.174 0.180∗ R1-R2 synthesized from MR2 monolingual data 26.2∗ 26.8∗ 26.5∗ 29.3∗ 9.1∗ 10.0∗ 0.202∗ 0.212∗ R1-R2 FT on PS + PS R1-R2 R1-R2 #2: PS FT on PS + PS R1-R2 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 PS R1-R2 synthesized by #1 and #2 10.4∗ 27.2∗ 29.0∗ 0.213∗ + #1 + #2 With the Reddit training parallel data from MTNT FT on MTNT 29.0∗ 27.5∗ 9.9∗ 0.192∗ Table 2: Results for the MTNT test sets using PS R1-R2 synthesized by our approaches. ‘‘zero-shot NMT’’ is the NMT system used for synthesizing PS R1-R2. ‘‘FT on PS R1-R2’’ are configurations for which we sampled 100k sentence pairs from PS R1-R2 to fine-tune the vanilla NMT system. The last row is given for reference: the vanilla NMT system fined-tuned on the official MTNT training parallel data. ‘‘*’’ denotes systems significantly better than the FT on SNI system with a p-value < 0.05. weights, using the same hyperparameters as XLM. We evaluated the model every 200k sentences on the MTNT validation parallel data and stopped training when the average BLEU of L1→L2 and L2→L1 had not been improved for 10 consecutive times. Finally, we synthesized PS R1-R2 data with our approaches using this system and trained final NMT models on the resulting PS R1-R2. 5.4 Results Our results are presented in Table 2. First, we checked the performance of our zero-shot NMT system. Whereas for fr↔en, it was comparable with the vanilla NMT system, for ja↔en, it performed much worse than the vanilla NMT model as expected. This is due to the use of unsupervised MT objectives that were shown to be very difficult to optimize for distant and difficult language pairs (Marie et al., 2019) with almost no shared entries in the respective vocabulary of the two languages. With approach #1, we synthesized PS R1-R2 from PL1-L2 and filtered them with T = 0.5 for en-fr and T = 0.25 for en-ja, respectively, resulting 196,788 and 301,519 sentence pairs.14 As shown in Table 2, fine-tuning on PS R1-R2 brings larger improvements than doing so on SNI, except for fr→en. Despite the small size of the PS R1-R2, concatenating it with PL1-L2 achieves the best BLEU with up to 3.0 BLEU points of improvements. We conclude that our approach successfully alters PL1-L2 into PS useful to train NMT for UGT. R1-R2 We give an analysis of the altered sentences later in Section 9. Our approach #2 to synthesize PS R1-R2 brought even larger improvements. In contrast to the back- translations of Reddit generated by the vanilla NMT system (see Table 1), PS R1-R2 synthesized by our zero-shot NMT systems from MR2 Reddit monolingual data (the same data used to generate ‘‘TBT Reddit’’) lead to larger improvements, especially when concatenated to PL1-L2. For fr→en, for instance, the gain over the vanilla NMT system is 7.7 BLEU points. Note also that further gains may potentially be attainable by exploring upsampling or downsampling strategies to find the optimal ratio between the sizes of PL1-L2 and PS R1-R2. Finally, concatenating PS R1-R2 parallel data synthesized by #1 and #2 provides slightly better results than, or comparable to, the use of only parallel data synthesized by #2. 6 Impact of the Distinction Between L1/L2 and R1/R2 Monolingual Data We empirically verified our assumption that ML1 and MR1, ML2 and MR2, must be distinguished in order to enforce our NMT systems to learn the difference between clean texts and UGT, while it also learns to translate between L1 and L2, and between R1 and R2. To this end, we set up two new configurations, #A and #B, where we have PL1-L2 parallel data and only ML1 and ML2 monolingual data, that is, we do not define MR1 and MR2 monolingual data to train XLM and NMT systems used to synthesize parallel data. #A We replace News Crawl for ML1 and ML2 monolingual data with those for Reddit. System fr→en BLEU en→fr ja→en chrF en→ja #1: synthesized from PL1-L2 parallel data + original + A + B 23.1 21.5∗ 21.9∗ 24.7 21.5∗ 22.0∗ 9.5 8.0∗ 8.3∗ 0.180 0.173∗ 0.182 #2 TBT: generated from Reddit monolingual data + original + A + B 29.3 21.3∗ 22.0∗ 26.8 22.0∗ 22.1∗ 10.0 8.1∗ 8.7∗ 0.212 0.170∗ 0.183∗ Table 3: Results for the MTNT test sets using the configurations #A and #B. ‘‘original’’ denotes the system presented in Section 5.4. ‘‘*’’ denotes systems significantly worse than the the ‘‘original’’ configuration with a p-value < 0.05. #B Because we only have few Reddit monolingual data for French and Japanese, #A is significantly disadvantaged by using much less monolingual data compared with our original system that also used News Crawl. In configuration #B, ML1 and ML2 are the concatenation of News Crawl and Reddit data with French and Japanese Reddit data upsampled to respectively match the size of the French and Japanese News Crawl corpora. With #A and #B, we no longer have zero- shot translation directions for synthesizing PS R1-R2. Instead, we have an NMT system initialized using a pre-trained crosslingual language model also exploiting Reddit monolingual data.15 With these configurations, we assume that the presence of a significant amount of Reddit data in the monolingual data may bias the NMT system in synthesizing Reddit-like texts. The results of NMT systems trained on parallel data synthesized by #A and #B are presented in Table 3. With both our approaches #1 and #2, both configurations #A and #B perform significantly worse than our proposed NMT systems that exploit PS R1-R2 synthesized by zero-shot NMT systems. These results point out the necessity to set zero- shot NMT systems, differentiating clean texts from UGT, to synthesize useful parallel data of UGT. 14In terms of BLEU scores, we observed differences, in the range of 2.0 BLEU points, considering all the thresholds tested. 15We used the same framework used by our zero-shot NMT systems for #A and #B, also using the AE and BT objectives since removing them did not have a positive impact. 718 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 7 Ablation Study on Zero-Shot NMT’s Objective Losses fr→en en→fr Zero-Shot NMT We performed an ablation study of the objectives exploited for zero-shot NMT presented in Section 3.2. We compared the following four combinations of objectives: training the AE+BT+MT BT+MT AE+BT BT 21.4 20.4∗ 19.2∗ 0.1∗ 22.4 22.3 21.2∗ 0.4∗ AE+BT+MT: The original combination used to train our zero-shot NMT system. BT+MT: The AE objective is removed. This excludes any random noise in the source sentences. The system is no longer restricted to perform a simple copy of the source when performing round-trip BT. AE+BT: Typical combination of objectives used for unsupervised NMT (Lample et al., 2018). Without the supervised MT objective, we expect a drop of the translation quality. BT: Without AE and MT objectives, we can expect the system to be able to properly model neither languages nor translations. Note that we cannot remove the BT objectives as this is the only objective that trained the system to translate, for instance, from L1 to R2 and from R2 to R1. We evaluated the zero-shot NMT itself and NMT systems exploiting the synthetic parallel data generated by the zero-shot NMT system using our approaches #1 without filtering16 and #2. The results are presented in Table 4. None of the alternative combinations performs better than AE+BT+MT in our original proposal. Removing AE (i.e., BT+MT) has a minimal impact but it is necessary to obtain the best results. In contrast, removing the MT objective (i.e., AE+BT) led to a significant drop of the translation quality as the zero-shot NMT is not supervised at all. Using only the BT objective led to extremely noisy synthetic data that cannot be used to train NMT. 8 Impact on the Robustness of NMT Using extra test suites, we evaluated to what extent our NMT systems trained on synthetic parallel data of UGT are robust to domain/style changes or only adapted to better translate Reddit data. 16We did not filter the data unlike our original proposal, because our goal is only to evaluate the quality of the data given the different systems used to generate them while saving the computational cost of finding a good threshold for filtering. #1: only with PS AE+BT+MT BT+MT AE+BT BT AE+BT+MT BT+MT AE+BT BT #2: PL1-L2 + PS R1-R2 synthesized from PL1-L2 19.0 17.7∗ 6.6∗ 0.0∗ 18.0 17.0∗ 0.7∗ 0.2∗ R1-R2 synthesized from MR2 29.3 28.5∗ 28.3∗ 13.2∗ 26.8 25.9∗ 25.0∗ 12.1∗ Table 4: BLEU scores for the MTNT test sets with some of the objectives deactivated for training the zero-shot NMT system that synthesizes PS R1-R2. The configurations using #1 synthetic data were trained exclusively on this data. ‘‘*’’ denotes systems significantly worse than using all the objectives with a p-value < 0.05. Newstest2014 (en-fr): Translation task of WMT14 containing clean texts of news. Newsdiscuss2015 (en-fr): Translation task of WMT15 containing UGT of discussions on news. Foursquare (en-fr): A corpus of restaurant re- views (Berard et al., 2019a) that is another instance of UGT. JESC, KFTT, and TED (en-ja): Test sets re- leased with their respective training data in the MTNT dataset (see Section 5.1). Twitter (en-ja): We collected 1,400 English tweets from the natural disaster domain and hired a translation firm to translate them into Japanese with specific instructions to preserve the style of the source texts. This test set is particularly noisy because it presents many tokens specific to tweets (user identifiers, hash tags, abbreviations, etc.). For all these translation tasks, we experimented only with the original translation direction to avoid translating translationese, except for the cases 719 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 News2014 fr→en en→fr Newsdiscuss2015 en→fr fr→en Foursquare KFTT ja→en fr→en TED en→ja JESC ja→en en→ja Twitter en→ja System vanilla + TBT News + TBT Reddit + #1 + #2 + #1 + #2 29.4 34.2∗ 27.4 29.3 30.4∗ 30.6∗ 32.5 36.3∗ 33.1∗ 32.6 34.2∗ 34.5∗ FT on MTNT 30.6∗ 33.4∗ 29.3 32.3∗ 29.2 29.8∗ 31.6∗ 31.4∗ 32.1∗ 31.3 33.7∗ 33.4∗ 31.6∗ 33.0∗ 33.3∗ 34.1∗ 13.5 17.0∗ 15.4∗ 13.7∗ 17.5∗ 17.7∗ 18.0∗ 22.8 22.2 1.0 22.9∗ 23.1∗ 23.0∗ 0.234 0.226 0.228 0.233 0.236 0.238∗ 15.9 14.1 0.1 16.7∗ 17.5∗ 17.6∗ 0.229 0.217 0.223 0.232∗ 0.240∗ 0.239∗ 22.9 0.228 15.0 0.229 0.120 0.151∗ 0.137∗ 0.140∗ 0.150∗ 0.150∗ 0.127∗ Table 5: BLEU (∗→{en,fr}) and chrf (en→ja) scores obtained on the extra test sets. Best scores are in bold. ‘‘*’’ denotes systems significantly better than the vanilla NMT system with a p-value < 0.05. where the origin of the source texts is unknown or mixed. The results obtained with the same systems presented in Section 5.4 are presented in Table 5. These results point out that our approaches did not only adapt NMT systems to the domain and style of Reddit but also improved them overall. NMT systems trained on the parallel data synthesized by our approaches perform better than the vanilla NMT systems irrespective of the domain and style of the text to translate. In contrast, exploiting the Reddit monolingual data through tagged back-translation consistently led to lower BLEU scores (except for en→fr Newsdiscuss2015), highlighting the ability of our framework in producing better synthetic parallel data. The configuration ‘‘TBT News,’’ which exploits tagged back-translation from News Crawl, is as expected the best system for translating Newstest2014, Newsdiscuss2015, and tweets, since some of the tweets have been posted by news agencies, but performed lower than our system for translating UGT from Foursquare. With these results and the results obtained on the MTNT test sets (see Section 5.4), we conclude that our approaches improve translation quality for UGT in general and did not only adapt the NMT system to translate a specific type of UGT. 9 Analysis of Clean Sentences Altered into UGT This section takes a closer look at the parallel data synthesized by approach #1 to observe how the clean sentences from PL1-L2 parallel data were altered and to better understand why the use of synthetic data leads to a better NMT system for UGT. We first focus on some of the characteristics of the MTNT datasets and compare how well these characteristics are exhibited in PS R1-R2. For this analysis, we mainly relied on the scripts and resources provided by Michel and Neubig (2018).17 We randomly sampled source sentences from PL1-L2 and PS R1-R2 as much as there are in the MTNT test sets, and performed our analysis on them.18 We counted the occurrences of profanities in the English, French, and Japanese. For English, we also counted the number of word contractions19 and Internet slang expressions. We also counted words ending by ‘‘-ise’’ and ‘‘-ize’’ to account for some of the differences between US English and UK English word spellings. Because PL1-L2 is mainly made of Europarl, we can expect that UK English spelling is mainly used, whereas we expect to find a higher ratio of US English spelling in the Reddit data, since Reddit is an American platform. For Japanese, we counted the numbers of formal and informal pronouns, assuming that MTNT datasets contain more informal pronouns than PL1-L2. Michel and Neubig (2018) also coun- ted spelling and grammar errors, and emojis. We did not count spelling and grammar errors, ex- pecting that they are artificially numerous in our synthetic data, since they had been automatically generated. As for the emojis, both PL1-L2 and PS R1-R2 did not contain any. Table 6 demonstrates that according to all the indicators, PS R1-R2 exhibits more of the char- acteristics of MTNT datasets than PL1-L2. For instance, PS R1-R2 is in more US English, contains more Internet slang, and uses significantly more 17https://github.com/pmichel31415/mtnt. 18For this analysis, the sentences sampled from PS R1-R2 are the synthetic versions of the sentences sampled from PL1-L2. 19We searched for the tokens: ’re, ’s, ’t, ’d, ’ll, and ’ve. 720 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Dataset Profanities % Slang % Contractions % -ise/-ize Ratio English French Profanities % Japanese Profanities % Formal/Informal Pronouns Ratio MTNT PL1-L2 PS L1-L2 0.27 0.01 0.06 0.21 0.00 0.04 1.90 0.03 0.21 40.00/60.00 92.00/8.00 41.03/58.97 0.90 0.45 0.57 0.01 0.00 0.01 68.75/31.25 96.88/3.12 83.01/16.99 Table 6: Quantitative analysis of the generated data. ‘‘%’’ indicates the number for occurrences per 100 tokens. For English, we compute the statistics on the en-fr data. For the MTNT test sets, the statistics are computed on the source side. RS L1-L2 has been generated by the alteration of PL1-L2 by our approach #1. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 5: Examples of French and English original sentence from the Europarl and News Commentary corpora (ML1) altered by our approach #1 (MR1). Bold indicates the alterations that we want to highlight for each example. We have manually masked a profanity in En4 with ‘‘*******’’. English contractions. This partly explains the usefulness of PS R1-R2 as NMT training data for the MTNT translation tasks, but most indicators show that PS R1-R2 is still far from perfectly matching with the characteristics of Reddit data, suggesting some room for improvement. For a more concrete illustration of our synthetic data, we present in Figure 5 four English and four French example sentences altered by our approach #1. These examples are all instances of a successful alteration of clean texts into UGT. En1 introduces an English contraction ‘‘we’re’’ that is a characteristic of less formal English. En2, En3, and Fr3 show spelling errors (for Fr3, ‘‘Ca’’ should be written ‘‘C¸ a’’) that may guide the system to make itself more robust. En4 introduces an instance of Internet slang with a profanity, as in Fr1 where ‘‘tr`es chiante,’’ a vulgar translation of ‘‘very annoying’’ diverges from the original meaning of ‘‘franche’’ that can be translated by ‘‘frank.’’ Fr2, Fr3, and Fr4 are simplifications that make the sentences less formal: ‘‘en outre’’ and ‘‘impliquent’’ are usually used in texts that perform a formal demonstration, while ‘‘c¸a veux dire’’ is a more familiar turn of phrase for ‘‘impliquent’’ in this context. We also observed many instances of person names written with Reddit syntax for referring to a Reddit user account by prepending ‘‘/u/,’’ e.g., ‘‘Berlusconi’’ becomes ‘‘/u/Berlusconi.’’ All these examples are evidence that our approach successfully generates UGT in the style of Reddit. 721 10 Related Work Several approaches for better translating UGT have been proposed taking advantage of the parallel data of UGT in the MTNT datasets (Michel and Neubig, 2018). Because of their relatively small size, they have been mostly used for fine-tuning (Li et al., 2019) and designing specific pre- and post-processing rules to improve translation quality (Berard et al., 2019b). Vaibhav et al. (2019) also proposed to generate synthetic parallel data of UGT through back-translation by exploiting the parallel data in MTNT. Mono- lingual data of UGT have been exploited to a lesser extent through forward translation (Li and Specia, 2019) or back-translation (Berard et al., 2019a) and always with NMT systems trained on parallel data of UGT. To the best of our knowledge, Vaibhav et al. (2019) proposed the only approach that synthesizes parallel data of UGT without re- lying on existing parallel data of UGT. Having obtained texts in the target style of UGT, they designed editing operations to make existing parallel data in other styles more similar to the targeted style. Another line of work exploits NMT to perform style transfer across texts, that is, applying some characteristics of one text to another, without exploiting any parallel data of UGT, but has never been applied to NMT for UGT. Prabhumoye et al. (2018) performed style transfer through back-translation to preserve the meaning of the text while reducing its stylistic properties and then exploit adversarial generation algorithms to apply the desired style to the back-translated texts, assuming that meaning and style can be disentangled. Their approach also requires a classifier that can accurately predict the style of a given text. Zhang et al. (2018) proposed a three- step pipeline combining unsupervised statistical and neural MT to generate instances of texts in the targeted style that is then evaluated by a given style classifier as in Prabhumoye et al. (2018). 11 Conclusion We described two new methods for synthesizing parallel data to train better NMT systems for UGT. Both methods work through a zero-shot NMT system, initialized with a pre-trained crosslingual language model that exploits monolingual corpora of UGT. Our first method (#1) successfully alters clean parallel data into parallel data that exhibit the characteristics of UGT of the targeted style. Our second method (#2) uses the same zero-shot NMT system to translate monolingual corpora of UGT for synthesizing parallel data useful to train NMT. We showed that both methods, separately or combined, improve translation quality for UGT. For future work, we will study the use of manually produced UGT parallel data to better train our NMT system that synthesizes the parallel data. We will also explore other applications for this framework, such as paraphrase generation. We will also investigate the use of the recently proposed mirror-generative NMT (Zheng et al., 2020), a semi-supervised architecture that ex- ploits jointly large source and target monolingual corpora, such as those of UGT, during training using source and target language models in the same latent space. Acknowledgments We would like to thank the action editor, Philipp Koehn, and reviewers for their useful comments and suggestions. A part of this work was conducted under the program ‘‘Research and Development of Enhanced Multilingual and Multipurpose Speech Translation System’’ of the Ministry of Internal Affairs and Communications (MIC), Japan. Benjamin Marie was partly supported by JSPS KAKENHI grant number 20K19879 and the tenure-track researcher start-up fund in NICT. Atsushi Fujita was partly supported by JSPS KAKENHI grant number 19H05660. References Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, page 597–604. Ann Arbor, MI, USA. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.3115/1219840 .1219914 Yonatan Belinkov and Yonatan Bisk. 2018. Syn- thetic and natural noise both break neural ma- chine translation. In Proceedings of the 6th International Conference on Learning Repre- sentations. Vancouver, Canada. Alexandre Berard, Ioan Calapodescu, Marc Dymetman, Claude Roux, Jean-Luc Meunier, 722 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 restaurant and Vassilina Nikoulina. 2019a. Machine trans- reviews: New corpus lation of for domain adaptation and robustness. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 168–176. Hong Kong, China. Association for Computa- tional Linguistics. DOI: https://doi .org/10.18653/v1/D19-5617 Alexandre Berard, Ioan Calapodescu, and Claude Roux. 2019b. Naver Labs Europe’s systems for the WMT19 machine translation robustness task. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 526–532. Florence, Italy. Association for Computational Linguis- tics. DOI: https://doi.org/10.18653 /v1/W19-5361 Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46. Lisbon, Portugal. Association for Computatio- nal Linguistics. DOI: https://doi.org /10.18653/v1/W15-3001, PMID: 25955892 Isaac Caswell, Ciprian Chelba, and David Grangier. 2019. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 53–63. Florence, Italy. Assoc- iation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W19 -5206 Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 176–181. Portland, MN, USA. Association for Computational Linguistics. DOI: https://dl.acm.org /doi/10.5555/2002736.2002774 Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In 723 Proceedings of Advances in Neural Informa- tion Processing Systems 32, pages 7057–7067. Vancouver, Canada. Curran Associates, Inc. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Minneapolis, MN, USA. Association for Computational Linguistics. DOI: https:// doi.org/10.18653/v1/N19-1423 Johanna Gerlach, Victoria Porro Rodriguez, Pierrette Bouillon, and Sabine Lehmann. 2013. Combining pre-editing and post-editing to improve SMT of user-generated content. In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice, pages 45–53. Nice, France. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable mod- ified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 690–696. Sofia, Bulgaria. Association for Computational Linguistics. Aizhan Sato, Imankulova, Takayuki and Mamoru Komachi. 2019. Filtered pseudo- parallel corpus improves low-resource neural machine translation. ACM Transactions on Asian and Low-Resource Language Informa- tion Processing, 19(2):24:1–16. DOI: https:// doi.org/10.1145/3341726 Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, Andr´e F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C++. In Proceedings of ACL 2018, System Demonstrations, pages 116–121. Melbourne, Australia. Association for Com- putational Linguistics. DOI: https://doi .org/10.18653/v1/P18-4020 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Vladimir Karpukhin, Omer Levy, Jacob Eisenstein, and Marjan Ghazvininejad. 2019. Training on synthetic noise improves robust- ness to natural noise in machine translation. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 42–47. Hong Kong, China. Associa- tion for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D19-5506 Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source statistical machine translation. for toolkit In Proceedings of the 45th Annual Meet- ing of the Association for Computational Linguistics Companion Volume Proceed- the Demo and Poster Sessions, ings of pages 177–180. Prague, Czech Republic. Asso- ciation for Computational Linguistics. DOI: https://dl.acm.org/doi/10.5555 /1557769.1557821 Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Na- tural Language Processing, pages 5039–5049. Brussels, Belgium. Association for Comput- ational Linguistics. DOI: https://doi .org/10.18653/v1/D18-1549 Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, Orhan Firat, Philipp Koehn, Graham Neubig, Juan Pino, and Hassan Sajjad. 2019. Findings of the first shared task on machine translation robustness. In Pro- ceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 91–102. Florence, Italy. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W19 -5303 Zhenhao Li and Lucia Specia. 2019. Improv- ing neural machine translation robustness via data augmentation: Beyond back-translation. the 5th Workshop on In Proceedings of Noisy User-generated Text (W-NUT 2019), pages 328–336. Hong Kong, China. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D19-5543 Chin-Yew Lin and Franz Josef Och. 2004. ORANGE: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics, pages 501–507. Geneva, Switzerland. DOI: https://dl.acm.org/doi/10.3115 /1220355.1220427 Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. 2017. Paraphrasing revisited with In Proceedings neural machine translation. of the European the 15th Conference of the Association for Computa- Chapter of tional Linguistics: Volume 1, Long Papers, pages 881–893. Valencia, Spain. Associa- tion for Computational Linguistics. DOI: https://doi.org/10.18653/v1/E17 -1083 Benjamin Marie, Haipeng Sun, Rui Wang, Kehai Chen, Atsushi Fujita, Masao Utiyama, and Eiichiro Sumita. 2019. NICT’s unsupervised neural and statistical machine translation sys- tems for the WMT19 news translation task. In Proceedings of the Fourth Conference on Ma- chine Translation (Volume 2: Shared Task Papers, Day 1), pages 294–301. Florence, Italy. Association for Computational Linguis- tics. DOI: https://doi.org/10.18653 /v1/W19-5330 Claudia Matos Veliz, Orphee De Clercq, and Veronique Hoste. 2019. Benefits of data augmentation for NMT-based text normaliza- tion of user-generated content. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 275–285. Hong Kong, China. Association for Computational Linguistics. DOI: https://doi.org/10 .18653/v1/D19-5536 Paul Michel and Graham Neubig. 2018. MTNT: A testbed for machine translation of noisy text. In Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, 724 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 pages 543–553. Brussels, Belgium. Associ- ation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D18 -1050, PMID: 29565364 (Volume 1: Long Papers), pages 86–96. Berlin, Germany. Association for Computational Lin- guistics. DOI: https://doi.org/10.18653 /v1/P16-1009 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318. Philadelphia, PA, USA. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.3115/1073083 .1073135 Maja Popovi´c. 2015. chrF: character n-gram f-score for automatic MT evaluation. In Pro- ceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395. Lisbon, Portugal. Association for Computational Lin- guistics. DOI: https://doi.org/10.18653 /v1/W15-3049 Matt Post. 2018. A call for clarity in reporting the Third BLEU scores. In Proceedings of Conference on Machine Translation: Research Papers, pages 186–191. Brussels, Belgium. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1 /W18-6319 through back-translation. Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W. Black. 2018. Style transfer In Proceedings of the 56th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 866–876. Melbourne, Australia. Association for Com- putational Linguistics. DOI: https://doi .org/10.18653/v1/P18-1080 Rico Sennrich, Barry Haddow, and Alexan- dra Birch. 2016b. Neural machine trans- rare words with subword units. lation of In Proceedings of the 54th Annual Meet- for Computa- ing tional Linguistics (Volume 1: Long Papers), pages 1715–1725. Berlin, Germany. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P16-1162 the Association of Vaibhav Vaibhav, Sumeet Singh, Craig Stewart, and Graham Neubig. 2019. Improving robust- ness of machine translation with synthetic noise. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1916–1920. Minneapolis, USA. Association for Compu- tational Linguistics. DOI: https://doi .org/10.18653/v1/N19-1190 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems 30, pages 5998–6008. Long Beach, USA. Curran Associates, Inc. Zhirui Zhang, Shuo Ren, Shujie Liu, Jianyong Wang, Peng Chen, Mu Li, Ming Zhou, and Enhong Chen. 2018. Style transfer as unsupervised machine translation. CoRR, abs /1808.07894. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine trans- lation models with monolingual data. In Pro- ceedings of the 54th Annual Meeting of the Association for Computational Linguistics Zaixiang Zheng, Hao Zhou, Shujian Huang, Lei Li, Xin-Yu Dai, and Jiajun Chen. 2020. Mirror-generative neural machine translation. In Proceedings of the 8th International Confer- ence on Learning Representations. Virtual. 725 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 1 1 9 2 3 4 5 9 / / t l a c _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3
Télécharger le PDF