Improving Statistical Machine

Improving Statistical Machine
Translation by Adapting Translation
Models to Translationese

Gennadi Lembersky
University of Haifa, Israel

∗∗

Noam Ordan
University of Haifa, Israel

Shuly Wintner
University of Haifa, Israel

Translation models used for statistical machine translation are compiled from parallel corpora
that are manually translated. The common assumption is that parallel texts are symmetrical:
The direction of translation is deemed irrelevant and is consequently ignored. Much research
in Translation Studies indicates that the direction of translation matters, however, as translated
language (translationese) has many unique properties. It has already been shown that phrase
tables constructed from parallel corpora translated in the same direction as the translation task
outperform those constructed from corpora translated in the opposite direction.

We reconfirm that this is indeed the case, but emphasize the importance of also using texts
translated in the “wrong” direction. We take advantage of information pertaining to the direction
of translation in constructing phrase tables by adapting the translation model to the special
properties of translationese. We explore two adaptation techniques: First, we create a mixture
model by interpolating phrase tables trained on texts translated in the “right” and the “wrong”
directions. The weights for the interpolation are determined by minimizing perplexity. Second,
we define entropy-based measures that estimate the correspondence of target-language phrases
to translationese, thereby eliminating the need to annotate the parallel corpus with information
pertaining to the direction of translation. We show that incorporating these measures as features
in the phrase tables of statistical machine translation systems results in consistent, statistically
significant improvement in the quality of the translation.

∗ Department of Computer Science, University of Haifa, 31905 Haifa, Israel.

E-mail: gennadi.lembersky@nice.com.

∗∗ Department of Computer Science, University of Haifa, 31905 Haifa, Israel.

E-mail: noam.ordan@gmail.com.

† Department of Computer Science, University of Haifa, 31905 Haifa, Israel.

E-mail: shuly@cs.haifa.ac.il.

Submission received: 23 June 2012; revised submission received: 13 November 2012; accepted for publication:
18 January 2013.

doi:10.1162/COLI a 00159

© 2013 Association for Computational Linguistics

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

1. Introduction

Much research in translation studies indicates that translated texts have unique
characteristics that set them apart from original texts (Toury 1980; Gellerstam 1986;
Toury 1995). Known as translationese, translated texts (in any language) constitute a
genre, or a dialect, of the target language, which reflects both artifacts of the translation
process and traces of the original language from which the texts were translated.
Among the better-known properties of translationese are simplification and explicitation
(Blum-Kulka and Levenston 1983; Blum-Kulka 1986; Baker 1993): Translated texts tend
to be shorter, to have lower type/token ratio, and to use certain discourse markers
more frequently than original texts. Interestingly, translated texts are so markedly
different from original ones that automatic classification can identify them with very
high accuracy (Baroni and Bernardini 2006; van Halteren 2008; Ilisei et al. 2010; Koppel
and Ordan 2011).

Contemporary statistical machine translation (SMT) systems use parallel corpora to
train translation models that reflect source- and target-language phrase correspondences.
Typically, SMT systems ignore the direction of translation of the parallel corpus. Given
the unique properties of translationese, which operate asymmetrically from source to
target language, it is reasonable to assume that this direction may affect the quality
of the translation. Recently, Kurokawa, Goutte, and Isabelle (2009) showed that this is
indeed the case. They trained a system to translate between French and English (and
vice versa) using a French-translated-to-English parallel corpus, and then an English-
translated-to-French one. They find that in translating into French the latter parallel
corpus yields better results (in terms of higher BLEU scores), whereas for translating
into English it is better to use the former.

Typically, however, parallel corpora are not marked for direction. Therefore,
Kurokawa, Goutte, and Isabelle (2009) trained an SVM-based classifier to predict which
side of a bi-text is the origin and which one is the translation, and trained a translation
model by utilizing only the subset of the corpus that corresponds to the direction of the
task.

We use these results as our departure point, but improve them in two major ways.
First, we demonstrate that the other subset of the corpus, reflecting translation in the
“wrong” direction, is also important for the translation task, and must not be ignored;
second, we show that explicit information on the direction of translation of the par-
allel corpus, whether manually annotated or machine-learned, is not mandatory. This
is achieved by casting the problem in the framework of domain adaptation: We use
domain-adaptation techniques to direct the SMT system toward producing output that
better reflects the properties of translationese. We show that SMT systems adapted to
translationese produce better translations than vanilla systems trained on exactly the
same resources. We confirm these findings using automatic evaluation metrics, as well
as through a qualitative analysis of the results.

After reviewing related work in Section 2, we begin by replicating the results of
Kurokawa, Goutte, and Isabelle (2009) in Section 3. We then (Section 4) explain why
translation quality improves when the parallel corpus is translated in the “right” di-
rection. We do so by showing that the subset of the corpus that was translated in the
direction of the translation task (the “right” direction, henceforth source-to-target, or
S → T) yields phrase tables that are better suited for translation of the original language
than the subset translated in the reverse direction (the “wrong” direction, henceforth
target-to-source, or T → S). We use several statistical measures that indicate the better
quality of the phrase tables in the former case.

1000

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Lembersky, Ordan, and Wintner

Adapting Translation Models to Translationese

We then show (Section 5) that using the entire parallel corpus, including texts that
are translated both in the “right” and in the “wrong” direction, improves the quality
of the results. Next, we investigate several ways to improve the translation quality
by adapting a translation model to the nature of translationese, thereby making the
output of machine translation more similar to actual, human translation. Specifically,
we create two phrase tables, one for the S → T portion of the corpus, and one for the
T → S portion, and combine them into a mixture model using perplexity minimization
(Sennrich 2012) to set the model weights. We show that this combination significantly
outperforms a simple union of the two portions of the parallel corpus.

Furthermore, we show that the direction of translation used for producing the
parallel corpus can be approximated by defining several entropy-based measures that
correlate well with translationese, and, consequently, with translation quality. We use
the entire corpus, create a single, unified phrase table, and then use these measures,
and in particular cross-entropy, as a clue for selecting phrase pairs from this table. The
benefit of this method is that not only does it improve the translation quality, but it
also eliminates the need to directly predict the direction of translation of the parallel
corpus.

The main contribution of this work, therefore, is a methodology that improves
the quality of SMT by building translation models that are adapted to the nature of
translationese.1 To demonstrate the contribution of our methodology, we conduct in
Section 6 a thorough analysis of our results, both quantitatively and qualitatively. We
show that translations produced by our best-performing system indeed reflect some
well-known properties of translationese better than the output of baseline systems.
Furthermore, we provide several examples of SMT outputs that demonstrate in what
ways our adapted system generates better results.

2. Related Work

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

Kurokawa, Goutte, and Isabelle (2009) were the first to address the direction of
translation in the context of SMT. They found that a translation model based on the
S → T portion of the parallel corpus results in much better translation quality than a
translation model based on the T → S portion. We replicate these results here (Section 3),
and view them as a baseline. In taking direction into account, we are faced with two
major challenges. First, using only the “right” portion of the corpus results in discarding
potentially very useful data. In real-world scenarios this can be crucial, because the
proportion between the two portions of the corpus can vary greatly. In the Hansard
corpus, used by Kurokawa, Goutte, and Isabelle (2009), only 20% of the corpus is S → T.
We show that the T → S portion is also important for machine translation and thus
should not be discarded. Using information-theoretic measures, and in particular cross-
entropy, we gain statistically significant improvements in translation quality beyond the
results of Kurokawa, Goutte, and Isabelle (2009). The second challenge is to overcome
the need to (manually or automatically) classify parallel corpora according to direction.
We face this challenge by using an adaptation technique.

In previous work, we investigated the relations between translationese and machine
translation, focusing on the language model (LM) (Lembersky, Ordan, and Wintner

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

1 This article is a revised and much extended version of Lembersky, Ordan, and Wintner (2012a).

Extensions include experiments with several language pairs, in both directions; better adaptation
techniques; a new mixture model; and a detailed analysis of the results.

1001

Computational Linguistics

Volume 39, Number 4

2011, 2012b). We showed that LMs trained on translated texts yield better translation
quality than LMs compiled from original texts. We also showed that perplexity is a
good discriminator between original and translated texts. Importantly, we convincingly
demonstrated that the differences between translated and original texts are indeed due
to effects of translationese, and cannot be attributed to the domain or topic of the texts.
Whereas that work focused on the product of translation, namely, the language model,
the current study focuses on the process of translation, to wit, the translation model.

Our current work is closely related to research in domain-adaptation. In a typical
domain adaptation scenario, a system is trained on a large corpus of “general” (out-of-
domain) training material, with a small portion of in-domain training texts. In our case,
the translation model is trained on a large parallel corpus, of which some (generally
unknown) subset is “in-domain” (S → T), and some other subset is “out-of-domain”
(T → S). Most existing adaptation methods focus on selecting in-domain data from a
general domain corpus. In particular, perplexity is used to score the sentences in the
general-domain corpus according to an in-domain language model. Gao et al. (2002) and
Moore and Lewis (2010) apply this method to language modeling, and Foster, Goutte,
and Kuhn (2010) and Axelrod, He, and Gao (2011) apply this method to translation
modeling. Moore and Lewis (2010) suggest a slightly different approach, using cross-
entropy difference as a ranking function.

Domain adaptation methods are usually applied at the corpus level, whereas we
focus on an adaptation of the phrase table used for SMT. In this sense, our work follows
Foster, Goutte, and Kuhn (2010), who weigh out-of-domain phrase pairs according to
their relevance to the target domain. They use multiple features that help to distinguish
between phrase pairs in the general domain and those in the specific domain. We rely
on features that are motivated by the findings of translation studies, having established
their relevance through a comparative analysis of the phrase tables. In particular, we use
measures such as translation model entropy, inspired by Koehn, Birch, and Steinberger
(2009). Additionally, we apply the method suggested by Moore and Lewis (2010) using
perplexity ratio instead of cross-entropy difference.

Koehn and Schroeder (2007) suggest a method for adaptation of translation models.
They pass two phrase tables directly to the decoder using multiple decoding paths. As
we show in Section 5, the application of this method to our scenario does not result
in a clear contribution, and we are able to show better results using our proposed
method.

Finally, Sennrich (2012) proposes perplexity minimization as a way to set the
weights for translation model mixture for domain adaptation. We successfully apply
this method to the problem of adapting translation models to translationese, gaining
statistically significant improvements in translation quality.

3. Baseline Experiments

3.1 Europarl Experiments

The task we focus on in our experiments is translation from French to English (FR-
EN) and from English to French (EN-FR). To establish the robustness of our approach,
we also conduct experiments with other translation tasks, including German–English
(DE-EN), English–German (EN-DE), Italian–English (IT-EN), and English–Italian (EN-
IT). Our corpus is Europarl (Koehn 2005), specifically, portions collected over the years
1996–1999 and 2001–2009. This is a large multilingual corpus, containing sentences

1002

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Lembersky, Ordan, and Wintner

Adapting Translation Models to Translationese

translated from several European languages. In most cases the corpus is annotated with
the original language and the name of the speaker. For each language pair we extract
from the multilingual corpus two subsets, corresponding to the original languages in
which the sentences were produced. For example, in the case of FR-EN we extract
from our corpus all sentences produced in French and translated into English, and all
sentences produced in English and translated into French. All sentences are lowercased
and tokenized using Moses (Koehn et al. 2007). Sentences longer than 80 words are
discarded. Table 1 depicts the size of the subsets whose target language is English.

We use each subset to train two phrase-based statistical machine translation (PB-
SMT) systems (Koehn et al. 2007), translating in both directions between the languages
in each language pair. In other words, we train two PB-SMTs for each translation task,
each based on a parallel corpus produced and translated in a different direction. We
use GIZA++ (Och and Ney 2000) with grow-diag-final alignment, and extract phrases
of length up to 10 words. We prune the resulting phrase tables as in Johnson et al.
(2007), using at most 30 translations per source phrase and discarding singleton phrase
pairs.

We use all Europarl corpora between the years 1996–1999 and 2001–2009 to con-
struct English, German, French, and Italian 5-gram language models, using interpolated
modified Kneser-Ney discounting (Chen 1998) and no cut-off on all n-grams. We use a
specific symbol to mark out-of-vocabulary words (OOVs). The OOV rate is low, less
than 0.5%, and very similar in all our experiments. We use the portion of Europarl
collected over the year 2000 for tuning and evaluation. For each translation task we ran-
domly extract 1,000 parallel sentences for the tuning set and another set of 5,000 parallel
sentences for evaluation. These sentences are originally written in the translation task’s
source language and are translated into the translation task’s target language (in real-
world scenarios, the directionality of the test set is typically known). We use the MERT
algorithm (Och 2003) for tuning and BLEU (Papineni et al. 2002) as our evaluation
metric. We test the statistical significance of the differences between the results using
the bootstrap resampling method (Koehn 2004).

A word on notation: We use S → T when the translation direction of the parallel
corpus corresponds to the translation task and T → S when a corpus is translated in the
opposite direction to the translation task. For example, suppose the translation tasks are
English-to-French (E2F) and French-to-English (F2E). We use S → T when the French-
original corpus is used for the F2E task or when the English-original corpus is used for
the E2F task; and T → S when the French-original corpus is used for the E2F task or
when the English-original corpus is used for the F2E task.

Table 2 depicts the BLEU scores of the SMT systems. The data are consistent with
the findings of Kurokawa, Goutte, and Isabelle (2009): Systems trained on S → T parallel

Table 1
Europarl corpus size, in sentences and tokens.

Original language

#Sentence

#Tokens

FR-EN

DE-EN

IT-EN

French
English
German
English
Italian
English

168,818
134,318
200,037
129,309
69,270
125,640

4,995,397
3,441,120
5,571,202
3,283,298
2,535,225
3,389,736

1003

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Table 2
BLEU scores of the Europarl baseline systems.

Task

S → T

T → S

33.64
FR-EN
EN-FR
32.11
DE-EN 26.53
16.96
EN-DE
28.70
IT-EN
23.81
EN-IT

30.88
30.35
23.67
16.17
26.84
21.28

texts always outperform systems trained on T → S texts. The difference in BLEU score
can be as high as 3 points.

3.2 Hansard Experiments

The corpora used in the Europarl experiments are small (up to 200,000 sentences).
Also, the ratio between S → T and T → S materials varies greatly for different language
pairs. To mitigate these issues we use the Hansard corpus, containing transcripts of the
Canadian parliament from 1996–2007, as another source of parallel data. The Hansard is
a bilingual French–English corpus comprising approximately 80% English-original texts
and 20% French-original texts. Crucially, each sentence pair in the corpus is annotated
with the direction of translation.

To address the effect of corpus size, we compile six subsets of different sizes (250K,
500K, 750K, 1M, 1.25M, and 1.5M parallel sentences) from each portion (English-original
and French-original) of the corpus. Additionally, we use the devtest section of the
Hansard corpus to randomly select French-original and English-original sentences that
are used for tuning (1,000 sentences each) and evaluation (5,000 sentences each).

On these corpora we train twelve French-to-English and twelve English-to-French
PB-SMT systems using the Moses toolkit (Koehn et al. 2007). We use the same GIZA++
configuration and phrase table pruning as in the Europarl experiments. We also reuse
the English and French language models. French-to-English MT systems are tuned and
tested on French-original sentences and English-to-French systems on English-original
ones.

Table 3 depicts the BLEU scores of the Hansard systems. The data are consistent
with our previous findings: Systems trained on S → T parallel texts always outperform

Table 3
BLEU scores of the Hansard baseline systems.

Task: French-to-English

Task: English-to-French

Corpus subset

S → T

T → S

Corpus subset

S → T

T → S

34.35
35.21
36.12
35.73
36.24
36.43

31.33
32.38
32.90
33.07
33.23
33.73

250K
500K
750K
1M
1.25M
1.5M

27.74
29.15
29.43
29.94
30.63
29.89

26.58
27.19
27.63
27.88
27.84
27.83

250K
500K
750K
1M
1.25M
1.5M

1004

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Lembersky, Ordan, and Wintner

Adapting Translation Models to Translationese

Table 4
Statistic measures computed on the phrase tables: total size, in tokens (Total), the number of
unique source phrases (Source), and the average number of translations per source phrase
(AvgTran).

Task: French-to-English

Set

S → T

T → S

Total

Source AvgTran

Total

Source AvgTran

231K
250K
360K
500K
461K
750K
544K
1M
1.25M 619K
1.5M 684K

69K
86K
96K
103K
109K
114K

3.35
4.21
4.81
5.27
5.66
6.01

199K
317K
405K
479K
545K
602K

55K
69K
78K
85K
90K
94K

3.65
4.56
5.19
5.66
6.07
6.43

Task: English-to-French

Set

S → T

T → S

Total

Source AvgTran

Total

Source AvgTran

224K
250K
346K
500K
437K
750K
1M
513K
1.25M 579K
1.5M 635K

49K
61K
68K
74K
78K
81K

4.52
5.64
6.39
6.95
7.42
7.83

220K
334K
421K
489K
550K
603K

46K
57K
64K
69K
73K
76K

4.75
5.82
6.54
7.10
7.56
7.92

systems trained on T → S texts, even when the latter are much larger. For example,
a French-to-English SMT system trained on 250,000 S → T sentences outperforms a
system trained on 1,500,000 T → S sentences.

4. Phrase Tables Reflect Facets of Translationese

The baseline results suggest that S → T and T → S phrase tables differ substantially,
presumably due to the different characteristics of original and translated texts. In this
section we explain the better translation quality in terms of the better quality of the
respective phrase tables, as defined by a number of statistical measures. We first relate
these measures to the unique properties of translationese.

Translated texts tend to be simpler than original ones along a number of criteria.
Generally, translated texts are not as rich and variable as original ones, and, in particular,
their type/token ratio is lower. Consequently, we expect S → T phrase tables (which
are based on a parallel corpus whose source is original texts, and whose target is
translationese) to have more unique source phrases and a lower number of translations
per source phrase. A large number of unique source phrases suggests better coverage
of the source text, whereas a small number of translations per source phrase means a
lower phrase table entropy.

These expectations are confirmed, as the data in Table 4 show. We report the total
size of the phrase table in tokens (Total), the number of unique source phrases (Source),
and the average number of translations per source phrase (AvgTran), computed on the

1005

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

24 phrase tables corresponding to our SMT systems.2 Evidently, S → T phrase tables
have more unique source phrases, but fewer translation options per source phrase.
This holds uniformly for all 24 tables.

These findings are consistent with our understanding of translationese. Translated
texts are not as rich as original ones; their type-to-token ratio is lower, and the variety
of syntactic structures is more limited. S → T phrase tables capture correspondences
between phrases written in the source language (original) and translated to the target
language (translated). Consequently, more different types in the source language corre-
spond to fewer types in the target language. For example, in the FR-EN S → T lexicon
trained on 1.5M sentences, the French word r´eduite (reduced) has 77 translations,
whereas in the T → S lexicon the same word has 143 translations. Moreover, in the
S → T lexicon the probability of the best translation, reduced, is 41.2%, whereas in the
T → S lexicon it is only 28.7%.

A well-established tool for assessing the quality of a phrase table involves entropy-
based measures. Phrase table entropy captures the amount of uncertainty involved in
choosing candidate translation phrases (Koehn, Birch, and Steinberger 2009). Given a
source phrase s and a phrase table T with translations t of s whose probabilities are
p(t | s), the entropy H of s is:

H(s) = −

(cid:1)

t∈T

p(t | s) × log2p(t | s)

(1)

To compute the phrase table entropy, Koehn, Birch, and Steinberger (2009) search
through all possible segmentations of the source sentence to find the optimal covering
set of test sentences that minimizes the average entropy of the source phrases in the
covering set. We refer to this measure as covering set entropy, or CovEnt.

We also propose a metric that assesses the quality of the source side of a phrase
table. This metric finds the minimal covering set of a given text in the source language
using source phrases from a particular phrase table, and outputs the average length of
a phrase in the covering set. This measure is referred to as covering set average length,
or CovLen.

Lembersky, Ordan, and Wintner (2011) show that perplexity distinguishes well
between translated and original texts. Moreover, perplexity can reflect the degree of
“relatedness” of a given phrase to original language or to translationese. Motivated
by this observation, we design a cross-entropy-based measure that assesses how well
each phrase table fits the properties of translationese. We then build language models
from translated texts, and compute the cross-entropy of each target phrase in the phrase
tables according to these language models.

Given a language model L, the cross-entropy of a text w = w1, w2, · · · wN is:

H(w, L) = − 1
N

N(cid:1)

i=1

log2L(wi)

(2)

Ideally, we would like to use only Hansard data for the language model, but as we
already used much of the Hansard data for training the translation model, we use
instead an adaptation of an external corpus (Europarl) to the Hansard domain. We

2 The phrase tables were pruned, retaining only phrases that are included in the evaluation set.

1006

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Lembersky, Ordan, and Wintner

Adapting Translation Models to Translationese

build language models of translated texts as follows. For English translationese, we
extract 170,000 French-original sentences from the English portion of Europarl, and
3,000 English-translated-from-French sentences from the Hansard corpus (disjoint from
the training, development, and test sets, of course). We use each corpus to train a
trigram language model with interpolated modified Kneser-Ney discounting and no
cut-off. All OOV words are mapped to a special token, (cid:8)unk(cid:9). Then, we interpolate
the Hansard and Europarl language models to minimize the perplexity of the target
side of the development set (λ = 0.58, the mixture weight of the Hansard corpus).
For French translationese, we use 270,000 sentences from Europarl and 3,000 sentences
from Hansard, λ = 0.81.

Similarly to covering set entropy, covering set cross-entropy (CovCrEnt) finds the
optimal covering set of test sentences that minimizes the weighted cross-entropy of the
source phrase in the covering set. Given a phrase table T and a language model L,
the weighted cross-entropy W for a source phrase s is:

W(s, L) = −

(cid:1)

t∈T

H(t, L) × p(t | s)

(3)

where H(t, L) is the cross-entropy of t according to a language model L.

Table 5 lists the entropy-based measures computed on our 24 phrase tables. Again,
the data meet our expectations: S → T phrase tables uniformly and unexceptionally
have lower entropy and cross-entropy, but higher covering set length.

So far, we have established the hypothesis that S → T phrase tables better reflect the
properties of translationese than T → S ones. But does this necessarily affect the quality

Table 5
Entropy-based measures computed on the phrase tables: covering set entropy (CovEnt),
covering set cross-entropy (CovCrEnt), and covering set average length (CovLen).

Task: French-to-English

S → T

T → S

CovEnt CovCrEnt CovLen CovEnt CovCrEnt CovLen

0.36
0.35
0.35
0.34
0.34
0.33

1.64
1.30
1.10
0.99
0.91
0.85

2.44
2.64
2.77
2.85
2.92
2.97

0.45
0.43
0.43
0.42
0.41
0.41

1.87
1.52
1.35
1.21
1.12
1.07

2.25
2.42
2.53
2.61
2.67
2.71

Task: English-to-French

S → T

T → S

CovEnt CovCrEnt CovLen CovEnt CovCrEnt CovLen

0.63
0.59
0.57
0.55
0.54
0.53

1.88
1.49
1.33
1.18
1.09
1.03

2.08
2.25
2.33
2.41
2.46
2.50

0.63
0.60
0.58
0.57
0.55
0.55

2.09
1.70
1.48
1.35
1.25
1.17

2.02
2.16
2.25
2.32
2.37
2.41

Set

250K
500K
750K
1M
1.25M
1.5M

Set

250K
500K
750K
1M
1.25M
1.5M

1007

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Table 6
Correlation of BLEU scores with phrase table statistical measures.

Measure

R2 (FR-EN) R2 (EN-FR)

CovEnt
CovCrEnt
CovLen

0.94
0.56
0.75

0.46
0.54
0.56

of the generated translations? To verify that, we measure the correlation between the
quality of the translation, as measured by BLEU (Table 3), with each of the entropy-
based metrics. We compute the correlation coefficient R2 (the square of Pearson’s
product-moment correlation coefficient) by fitting a simple linear regression model.
Table 6 lists the results; clearly, all three measures are strongly correlated with trans-
lation quality. Consequently, we use these measures as indicators of better translations,
more similar to translationese. Crucially, these measures are computed directly on the
phrase table, and do not require reference translations or meta-information pertaining
to the direction of translation of the parallel phrase.

5. Adaptation of the Translation Model to Translationese

We have thus established the fact that S → T phrase tables have an advantage over
T → S ones that stems directly from the different characteristics of original and trans-
lated texts. We have also identified three statistical measures that explain most of the
variability in translation quality. We now explore ways for taking advantage of the entire
parallel corpus, including translations in both directions, in light of these findings. Our
goal is to establish the best method to address the issue of different translation direction
components in the parallel corpus.

5.1 Baseline

As a simple baseline we take the union of the two subsets of the parallel corpus. This
gives the decoder an opportunity to select phrases from either subset of the corpus,
and MERT can be expected to optimize this selection process. For each translation
task in Section 3.1, we concatenate the S → T and the T → S subsets of the parallel
corpora and use the union to train an SMT system (henceforth UNION). We use the
same language and reordering models, Moses configuration, and the same tuning and
evaluation sets as in Section 3.1. Table 7 reports the results. The UNION systems, which

Table 7
Evaluation results of various ways for combining phrase tables.

System

FR-EN EN-FR DE-EN EN-DE

IT-EN EN-IT

S → T
UNION
MULTI-PATH
PPLMIN-1
PPLMIN-2

33.64
33.79
33.81
33.86
33.95

32.11
32.24
31.95
32.47
32.65

26.53
26.76
26.68
26.83
26.77

16.96
17.36
17.39
17.80
17.65

28.70
29.12
29.11
29.23
29.44

23.81
23.70
23.80
23.86
24.01

1008

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Lembersky, Ordan, and Wintner

Adapting Translation Models to Translationese

use twice as much training data as the S → T systems, outperform the S → T systems
for all language pairs except English-to-Italian. Only in three cases out of six (German-
to-English, English-to-German, and Italian-to-English), however, is the gain statistically
significant. Nevertheless, this indicates that the T → S subset contains useful material
that can (and does) contribute to translation quality.

We now look at ways to better utilize this portion. First, we train SMT systems
with two phrase tables using multiple decoding paths, and combine them in a log-
linear model, following Koehn and Schroeder (2007). The performance of this approach
(referred to as MULTI-PATH) is either lower or only slightly better than that of the
UNION systems (Table 7).

5.2 Perplexity Minimization

Next, we look at a linear interpolation of the translation models. We need a way to tune
the weights of the translation model components, and we use perplexity minimization,
following Sennrich (2012).
(cid:2)

Given n phrase tables, we are looking for a set of n weights λ = λ1, . . . , λn, such that
n
i=1 λi = 1, where λi is the interpolation weight of phrase table i. Then, given a phrase

pair (s, t), the linear interpolation of the n models is given by:

p(s | t; λ) =

n(cid:1)

i=1

λip(s | t)

(4)

To adapt an interpolated translation model to a specific (development) corpus, let ˜p(s, t)
be the observed, empirical probability of the pair (s, t) in the development corpus. This
is obtained by training a phrase table on the development corpus using the standard
methodology; the probability of the pair (s, t) is then extracted from the phrase table.
The cross entropy H of a translation model with probabilities p to a development corpus
with probabilities ˜p is defined as:

H = −

(cid:1)

s,t

˜p(s, t) × log2p(s | t)

To minimize the cross entropy, we look for a weight vector ˆλ such that:

ˆλ = arg min

λ

(cid:1)

s,t

˜p(s, t) × log2

(cid:4)

λip(s | t)

(cid:3)

n(cid:1)

i=1

(5)

(6)

Each feature of the standard SMT translation model (the phrase translation proba-
bilities p(t | s) and p(s | t), and the lexical weights lex(t | s) and lex(s | t)) is optimized
independently.

This technique is particularly appealing for us due to two reasons: first, Lembersky,
Ordan, and Wintner (2011) show that perplexity is a good differentiator between
original and translated texts; second, the perplexity is minimized with respect to some
development set. Consequently, if we use a S → T corpus for this purpose, we directly
adapt the interpolated phrase table to the qualities of the S → T translation models
as described in Section 4. We use this technique to interpolate two pairs of phrase
tables: we interpolate the S → T and the T → S models (we refer to this system as

1009

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

PPLMIN-1) and we also interpolate the S → T with the UNION models (PPLMIN-2), as
a simple way of upweighting. Table 7 reports the results. In all cases, the interpolated
systems yield higher BLEU scores than the simple UNION systems. Although the
improvements are small (0.2–0.4 BLEU points), they are statistically significant in all
cases, except for German-English. Clearly, the interpolated systems outperform the
S → T systems by 0.2-0.7 BLEU points (statistically significant in all cases). PPLMIN-2
seems to be better than PPLMIN-1 in four out of six systems.

To verify that the improvement in translation quality is due to the adaptation
process rather than a quirk resulting from MERT instability, we use MultEval (Clark et
al. 2011). This is a script that takes machine translation hypotheses from several (in our
case, three) runs of an optimizer (MERT) and reports three popular metric scores: BLEU,
Meteor (Denkowski and Lavie 2011), and TER (Snover et al. 2006). Meteor and BLEU
scores are higher for better translations (↑), whereas TER is a lower-is-better measure (↓).
In addition, MultEval computes the ratio of output length to reference length (closer to
100% is better), as well as p-values (via approximate randomization). We use MultEval
to compare translation hypotheses of the UNION and PPLMIN-2 systems. Table 8
presents the results for French-to-English and English-to-French (other translation tasks
produce similar results). The improvement of the adapted systems is clear and robust.

5.3 Adaptation without Explicit Information on Directionality

A prerequisite for interpolating translation models, the method we advocate here, is
that the direction of translation of every sentence pair in the parallel corpus be known
in advance. When such information is not available, machine learning can automati-
cally classify texts as original or translated (Baroni and Bernardini 2006; van Halteren

Table 8
MultEval scores for UNION and PPLMIN-2 systems.

Metric

System

Avg

p-value

French-to-English

UNION
PPLMIN-2

BLEU ↑
METEOR ↑ UNION
PPLMIN-2
UNION
PPLMIN-2
UNION
PPLMIN-2

TER ↓

Length

33.7
33.9
35.7
35.8
49.7
49.5
99.4
99.5

English-to-French

UNION
PPLMIN-2

BLEU ↑
METEOR ↑ UNION
PPLMIN-2
UNION
PPLMIN-2
UNION
PPLMIN-2

TER ↓

Length

32.3
32.6
53.8
54.0
52.6
52.5
98.7
98.9


0.0001

0.0001

0.0001

0.0003


0.0001

0.0001

0.004

0.0001

1010

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
4
9
9
9
1
8
0
2
2
3
4
/
c
o

l
i

_
a
_
0
0
1
5
9
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Lembersky, Ordan, and Wintner

Adapting Translation Models to Translationese

2008; Ilisei et al. 2010; Koppel and Ordan 2011). Naturally, however, the quality of the
interpolation of translation models trained on classified (rather than annotated) data
is expected to decrease. In this section we establish an adaptation technique that does
not rely on explicit information pertaining to the direction of translation, but rather
uses perplexity-based measures to evaluate the “relatedness” of a specific phrase to an
original or a translated language “dialect.”

For the following experiments we use the Hansard corpus described in Section 3.2;
FO (French original) refers to subsets of the parallel corpus that were translated from
French to English, EO (English original) refers to texts translated from English to French.
We create three different mixtures of FO and EO: a balanced mix comprising 500K
sentences each of FO and EO (MIX), an EO-biased mix with 500K sentences of FO
and 1M sentences of EO (MIX-EO), and an FO-biased mix with 1M sentences of FO and
500K sentences of EO (MIX-FO). We use these corpora to train French-to-English and
English-to-French MT systems, evaluating their quality on the evaluation sets described
in Section 3.2. We use the same Moses configuration as well as the same language and
reordering models as in Section 3.2.

Now, we adapt the translation models by adding to each phrase pair in the phrase
tables an additional factor, as a measure of its fitness to the genre of translationese. The
factors are used as additional features in the phrase table. We experiment with two such
factors. First, we use the language models described in Section 4 to compute the cross-
entropy of each translation option according to this model. We add cross-entropy as an
additional score of a translation pair that can be tuned by MERT (we refer to this system
as CrEnt). Because cross-entropy is a “the lower the better” metric, we adjust the range
of values used by MERT for this score to be negative.

Second, following Moore and Lewis (2010), we define an adapting feature that not
only measures how close phrases are to translated language, but also how far they are
from original language, and use it as a factor in a phrase table (this system is referred to
as PplRatio). We build two additional language models of original texts as follows. For
original English, we extract 135,000 English-original sentences from the English portion
of Europarl, and 2,700 English-original sentences from the Hansard corpus. We train a
trigram language model with interpolated modified Kneser-Ney discounting on each
corpus and we interpolate both models to minimize the perplexity of the source side of
the development set for the English-to-French translation task (λ = 0.49). For original
French, we use 110,000 sentences from Europarl and 2,900 sentences from Hansard,
λ = 0.61. Finally, for each target phrase t in the phrase table we compute the ratio of
the perplexity of t according to the original language model Lo and the perplexity of t
with respect to the translated model Lt (see Section 4). In other words, the factor F is
computed as follows:

F(t) =

H(t, Lo)
H(t, Lt)

(7)

We apply these techniques to the French-to-English and English-to-French phrase
tables built from the concatenated corpora, and use each phrase table to train an SMT
system. We compare the performance of these systems to that of S → T, UNION, and
both PPLMIN systems. Table 9 summarizes the results.

All systems outperform the corresponding UNION systems. CrEnt systems show
significant improvements (p < 0.05) on balanced scenarios (MIX) and on scenarios biased towards the S → T component (MIX-FO in the French-to-English task, MIX- EO in English-to-French). PplRatio systems exhibit more consistent behavior, showing 1011 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 Table 9 Adaption without classification results. Task: French-to-English System MIX MIX-EO MIX-FO S → T UNION PPLMIN-1 PPLMIN-2 CrEnt PplRatio 35.21 35.27 35.46 35.75 35.54 35.59 35.21 35.36 35.59 35.65 35.45 35.78 35.73 35.94 36.26 36.20 36.75 36.22 Task: English-to-French System MIX MIX-FO MIX-EO S → T UNION PPLMIN-1 PPLMIN-2 CrEnt PplRatio 29.15 29.27 29.64 29.50 29.47 29.65 29.15 29.44 29.94 30.45 29.45 29.62 29.94 30.01 29.65 30.12 30.44 30.34 small, but statistically significant improvement (p < 0.05) in all scenarios. Additionally, the new systems perform quite competitively compared to the interpolated systems, winning in four out of six cases. Note again that all systems in the same column (except S → T) are trained on exactly the same corpus and have exactly the same phrase tables. The only difference is an additional factor in the phrase table that “encourages” the decoder to select translation options that are closer to translated texts than to original ones. 6. Analysis We have demonstrated that SMT systems that are sensitive to the direction of translation perform better. The superior quality of SMT systems that are adapted to translationese is reflected in higher BLEU scores, but also in the scores of other automatic measures for evaluating the quality of machine translation output. In this section we analyze the better performance of translationese-adapted systems, both quantitatively and qualita- tively, relating it to established insights in translation studies. It may be claimed that translationese-aware systems perform better (in terms of BLEU scores) not because of the properties of translated texts, but due to the closer domain, genre, or topic of translated texts to those of the reference translations. In our previous work (Lembersky, Ordan, and Wintner 2011, 2012b), we convincingly demonstrated that this is not the case, by means of several experiments that abstracted the texts away from specific words. Although these results are concerned with the language model, we trust that they also hold for the translation model on which we focus here. Furthermore, improvements in BLEU scores that result from attention to transla- tionese are consistent with human judgments. In other words, a machine translation system that produces translations with higher BLEU scores by taking into account 1012 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lembersky, Ordan, and Wintner Adapting Translation Models to Translationese the directionality of translation also produces translations that human judges prefer. Although we have not conducted such experiments here, we have shown this corre- lation in a previous work (Lembersky, Ordan, and Wintner 2012b) that focused on the language model rather than on the translation model. 6.1 Quantitative Analysis Is the output of translationese-adapted systems indeed more similar to translationese? We begin with a set of properties of translationese that are easy to compute, and evaluate the output of our translationese-adapted SMT systems in terms of these properties. 6.1.1 Type–Token Ratio. Translated texts have been shown to have lower type-to-token ratio (TTR) than original ones (Al-Shabab 1996). Figure 1 compares the TTR of the trans- lation outputs of S → T, T → S, UNION, and PPLMIN-2 systems. For comparison, we also add the TTR of the reference translations for each task. To mitigate the effects of the different morphological systems of the various languages, we compute the TTR in terms of lemmas, rather than surface forms. Obviously, the TTR of S → T output is much lower than T → S system. Recall that S → T systems produce markedly better translations than T → S ones, so indeed there is a clear correspondence between the TTR of the outputs and better translation quality. Figure 1 also compares the TTR of the outputs produced from two combination systems, UNION and PPLMIN-2. The UNION outputs are arbitrary: Their TTR is sometimes lower than the corresponding S → T system, but sometimes higher than even the corresponding T → S system. In contrast, PPLMIN-2 systems (which are the best adapted systems) systematically produce outputs with the lowest TTR, that is, outputs closest to translationese. As expected, reference translations exhibit the lowest TTR in four out of the six tasks. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1013 Figure 1 Type–token ratio in SMT translation outputs. Computational Linguistics Volume 39, Number 4 Figure 2 Numbers of singletons in SMT translation outputs. 6.1.2 Singletons. A related property of translated texts is that they tend to exhibit a much lower rate of words that occur only once in a text (hapax legomena) than original texts. We thus count the number of singletons in the outputs of each of the SMT systems (and, for comparison, the reference translations). The results, which are depicted in Figure 2, are not totally conclusive, but are interesting nonetheless. Specifically, in all cases the PPLMIN-2 system exhibits a lower number of singletons than the UNION system; and in all systems except the English-Italian one, the number of singletons produced by the PPLMIN-2 system is lowest. Reference translations exhibit the lowest rate of singletons in five out of the six tasks. 6.1.3 Entropy. As another quantitative measure of the contribution of perplexity min- imization as a method of adaptation, we list in Table 10 the values of the entropy- based measures discussed in Section 4 on three types of SMT systems: those compiled from S → T texts only, UNION, and PPLMIN-2 ones. Observe that the covering set cross-entropy measure, designed to reflect the fitting of a phrase table’s target side to translated texts, is significantly lower in PPLMIN-2 systems than in S → T and UNION systems. This indicates that perplexity minimization improves the system’s fitness to translationese. Interestingly, the PPLMIN-2 systems have better lexical coverage than the UNION systems. Table 10 lists data for French-English and English-French, but other language pairs exhibit similar behavior. 6.1.4 Mean Occurrence Rate. Original texts are known to be lexically richer than translated ones; in particular, translationese uses more frequent and common words (Laviosa 1998). To assess the lexical diversity of a given text we define Mean Occurrence Rate (MOR). MOR computes the average number of occurrences of tokens in the text with 1014 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lembersky, Ordan, and Wintner Adapting Translation Models to Translationese Table 10 Entropy-based measures, computed on phrase tables of baseline and adapted SMT systems. System CovEnt CovCrEnt CovLen FR-EN EN-FR S → T T → S UNION PPLMIN-2 S → T T → S UNION PPLMIN-2 0.43 0.45 0.43 0.43 0.64 0.66 0.61 0.61 2.39 2.77 2.20 2.14 3.47 3.52 3.09 2.99 2.24 2.03 2.34 2.35 2.01 1.99 2.17 2.18 respect to a large reference corpus. Consequently, sentences containing more frequent words have higher MOR scores. More formally, given a reference corpus R with n word · · · rn, let C(ri) be a number of occurrences of the word ri in the corpus R. Then types r1 the MOR of a sentence S = s1 · · · sk is: MOR(S) = 1 k k(cid:1) i=1 log(C(si)) (8) ∈ R. Otherwise, C(si) = α, where α is a pre- C(si) is calculated from the corpus R if si defined constant depending on the size of the reference corpus. In all our experiments we use α = 0.5. In order to establish the relation between the MOR measure and translation quality, we compute MOR scores for each sentence of an SMT system output. Then, we sort the output sentences based on their MOR scores, split the output into two parts—below and above the median of MOR—and calculate BLEU score for each portion independently. We perform these calculations on the outputs of UNION and PPLMIN-2 SMT systems for all our translation tasks. We use the Europarl corpus (Koehn 2005) as a reference for a list of occurrences. Table 11 depicts the results. In all cases, the bottom part (below the median) of SMT outputs has significantly lower BLEU scores (up to 5 BLEU points!) than the upper part, indicating that the MOR measure is a good (post factum) differentiator between poor and good translations. Table 11 BLEU scores computed on portions of UNION and PPLMIN-2 systems outputs below and above the MOR median. Task DE-EN EN-DE FR-EN EN-FR IT-EN EN-IT UNION PPLMIN-2 Bottom Upper Bottom Upper 24.05 16.06 31.48 28.97 26.07 21.57 28.72 18.78 35.49 34.83 31.75 25.79 24.10 16.42 31.85 29.30 26.43 21.97 28.71 18.82 35.49 35.58 31.97 25.99 1015 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 Figure 3 Mean Occurrence Rate in SMT translation outputs. We now compute the average MOR score on the outputs of all our SMT systems. Figure 3 shows the results. In all cases (except Italian to English), S → T is better than T → S; and in all systems except EN-FR, PPLMIN-2 is best. 6.2 Qualitative Analysis Translation is sometimes described as an attempt to strike a balance between interference, the so-called inevitable marks left by the source language on the target text, and standardization, the attempt of the translator to adapt the translation product to the target language and culture, to break away from the source text towards a more adequate text (Toury 1995). In order to study the effect of the adaptation qualitatively, rather than quantitatively, we focus on several concrete examples. We compare transla- tions produced by the UNION (henceforth baseline) and by the PPLMIN-2 (henceforth adapted) French-English Europarl systems. We selected 200 sentences from the French- English evaluation set for manual inspection, focusing on sentences in which the trans- lations were significantly different from each other. Indeed, we find that the translations are better adapted along several dimensions. In the following sentences, the baseline follows a more literal translation, whereas the adapted system creates a more adequate, standardized translation. Source Monsieur le pr´esident, chers coll`egues, les tempˆetes qui ont ravag´e la france dans la nuit des 26 et 27 d´ecembre ont fait, on l’a dit, 90 morts, 75 milliards de francs,soit11milliardsd’euros,ded´egˆats. Baseline Mr president, ladies and gentlemen, storms that have ravaged france during thenightof26and27decemberwere,ashasbeensaid,90peopledead,75billion francs,thatis,eur11billion,damage. 1016 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lembersky, Ordan, and Wintner Adapting Translation Models to Translationese Adapted Mrpresident,ladiesandgentlemen,thestormswhichhavedevastatedfrance during the night of 26 and 27 december were, as has been said, 90 people dead, 75billionfrancs,oreur11billion,damage. Source Tout d’abord, je tiens `a saluer tous mes coll`egues maires, ´elus locaux, qui, au quotidien,ontdˆurassurerlapopulation,organiserlasolidarit´e,coop´ereravecles servicespublics. Baseline First of all, I should like to pay tribute to all my colleagues, mayors, local elected representatives, who, in their daily lives, have had to reassure the population,organisesolidarity,cooperatewithpublicservices. Adapted First of all, I should like to pay tribute to all my colleagues, mayors, local elected representatives, who, on a daily basis, have had to reassure the popula- tion,organisesolidarity,cooperatewithpublicservices. Source Monsieurlepr´esident,jevousremerciedemelaisserconclure,etjerappellerai simplementunemaxime:“lestueursens´eriesefonttoujoursprendreparlapolice quandilsacc´el`erentlacadencedeleurscrimes”. Baseline Mr president, thank you for allowing me to leave conclusion, and I would like to remind you just a maxim: ‘the murderers in series are always take by the policewhentheyacc´el`erentthepaceoftheircrimes’. Adapted Mrpresident,thankyouforlettingmefinish,andIwouldliketoremindyou just a maxim: ‘the murderers in series are always take by the police when they acc´el`erentthepaceoftheircrimes’. Note that the baseline is not necessarily incomprehensible, nor even “impossible” in the target language; in the first example, it is clear what is meant by stormsthathave ravaged France, and moreover, we find such expressions in a 1.5-G token-sized corpus (Ferraresi et al. 2008); it is just half as likely as what is offered by the adapted system. The second example, on the other hand, misses the point altogether, and the third one is a clear case of interference, where the French laisserconclure is transferred verbatim as leaveconclusion. Another difference between the two systems is reordering. Sometimes, as in the two following examples, the inability of the baseline system to reorder the words correctly stems from interference: Source Madamelapr´esidente,mescherscoll`egues,nouscroyions,jusqu’`apr´esent,que l’unioneurop´eenne ´etait,selonlesdispositionsdestrait´esderomeetdeparisqui avaient fond´e les communaut´es, devenues union, une association d’´etats libres, ind´ependantsetsouverains. Baseline Madam president, ladies and gentlemen, we croyions, up to now, that the european union is, according to the provisions of the treaties of rome and paris who had based the communities, become union, an association of states free, independentandsovereign. Adapted Madam president, ladies and gentlemen, we croyions, up to now, that the europeanunionwas,accordingtotheprovisionsofthetreatiesofparisandrome who had based communities, become union, an association of free, sovereign andindependentstates. Source La convention de lom´e b´en´eficie essentiellement `a quelques grands groupes industriels ou financiers qui continuent ´a piller ces pays et perp´etuent leur d´ependance ´economique,notammentdesanciennespuissancescoloniales. Baseline Thelom´econventionhasmainlytoafewlargeindustrialgroupsorfinancial whichcontinuetoplunderthosecountriesandperp´etuenttheireconomicdepen- dence,inparticulartheformercolonialpowers. 1017 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 Adapted The rom´e convention has mainly to a few large financial and industrial groupswhichcontinuetoplunderthosecountriesandperpetuatetheireconomic dependence,inparticulartheformercolonialpowers. Additionally, the adapted system produces much better collocations. Compare the “natural” expressions payahighprice and expresstheconcern with the baseline system products: Source Ces hommes et ces femmes qui bougent `a travers l’europe paient leur voiture, leurstaxesnationales,leurpotcatalytique,leurstaxessurlescarburants,etpaient doncd´ej`atr`escherleprixdelamagnifiquemachineetlalibert´edecirculer. Baseline These men and women who are moving across europe are paying their car, their national taxes, their catalytic converter, their taxes on fuel, and therefore already pay very dearly for the price of the magnificent machine and freedom ofmovement. Adapted These men and women who are moving across europe pay their car, their national taxes, their catalytic converter, their taxes on fuel, and therefore already payahighpriceforthemagnificentmachineandfreedomofmovement. Source Jeveuxdire ´egalementlesouciquej’aid’unebonnecoop´erationentreinterreg etlefed,notammentpourlescara¨ıbesetl’oc´eanindien. Baseline I would like to say to the concern that I have good cooperation between interregandtheedf,particularlyforthecaribbeanandtheindianocean. Adapted I also wish to express the concern that I have good cooperation between interregandtheedf,particularlyforthecaribbeanandtheindianocean. Last, there are a few cases of explicitation. Blum-Kulka (1986) observed the tendency of translations to introduce to the target texts cohesive markers in order to render implicit utterances more explicit. Koppel and Ordan (2011), who used function words to discriminate between translated and non-translated texts, found that cohesive markers, words such as in fact, however, moreover, and so forth, were among the top markers of translationese, irrespective of source language and domain. And truly we find them also over-represented in the adapted system: Source Nous affirmons au contraire la n´ecessit´e politique de r´e´equilibrer les rapports entrel’afriqueetl’unioneurop´eenne. Baseline We say the opposite the political necessity to rebalance relations between africaandtheeuropeanunion. Adapted on the contrary, we maintain the political necessity of rebalancing relations betweenafricaandtheeuropeanunion. Source Cette mention semble alors contredire les explications linguistiques donn´ees par l’office etlaisse craindre quel’erreur ne revˆete pas leseul caract`ere technique quel’onsemblevouloirluidonner. Baseline This note seems so contradict the explanations given by the language and leaving office fear that the mistake revˆete do not only technical nature hat we seemstowanttogiveit. Adapted This note therefore seems to contradict the linguistic explanations given by theofficeandfearthatleavesthemistakerevˆetedonotonlytechnicalnaturethat weseemstowanttogiveit. In (human) translation circles, translating out of one’s mother tongue is considered unprofessional, even unethical (Beeby 2009). Many professional associations in Europe urge translators to work exclusively into their mother tongue (Pavlovi´c 2007). The two 1018 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lembersky, Ordan, and Wintner Adapting Translation Models to Translationese kinds of automatic systems built in this article reflect only partly the human situation, but they do so in a crucial way. The S → T systems learn examples from many human translators who follow the decree according to which translation should be made into one’s native tongue. The T → S systems are flipped directions of humans’ input and output. The S → T direction proved to be more fluent and accurate. This has to do with the fact that the translators “cover” the source texts more fully, having a better “translation model.” 7. Combining Translation and Language Models When we experimented with translation models, we compiled language models from corpora comprising original and translated texts. Our previous work (Lembersky, Ordan, and Wintner 2011, 2012b), however, shows that language models compiled from translated texts are better for machine translation than models trained on original texts. In this section we examine whether these findings have a cumulative effect. In other words, we test if an additional improvement in translation quality can be gained by combining our findings for both the language and the translation model. The tasks are translating French-to-English (FR-EN) and English-to-French (EN- FR). We re-use the Europarl-based translation models of Section 3.1. We compile lan- guage models from the French-English Hansard-based parallel corpora described in Section 3.2. We use 1 million parallel sentence subsets. We train an original French LM on the source side of the S → T corpus and we train the translated English LM on the target side of the same corpus. In the same manner we compile the translated French LM and the original English LM from the T → S corpus. All language models are 5-grams with an interpolated modified Kneser-Ney discounting (Chen 1998). The vocabulary is limited to tokens that appear twice or more in the reference set. All unknown words are mapped to a special token. We tune and evaluate all SMT systems on two kinds of reference sets: Europarl (Section 3.1) and Hansard (Section 3.2). First, we use all possible combinations of translation and language models to train four SMT systems for each translation task: T → S TM with original (O) LM, T → S TM with translated (T) LM, S → T TM with O LM, and S → T TM with T LM. All systems are tuned and evaluated on both the Europarl and the Hansard reference sets. Table 12 Table 12 Combining TMs and LMs: SMT system evaluation results. FR-EN (EUROPARL) FR-EN (HANSARD) LM O T LM O T TM T → S S → T 27.06 30.38 27.30 30.65 TM T → S S → T 24.41 25.46 25.47 26.44 EN-FR (EUROPARL) EN-FR (HANSARD) LM O T LM O T TM T → S S → T 22.33 25.11 22.71 24.94 TM T → S S → T 15.88 17.08 16.34 17.48 1019 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 Table 13 Adapting TMs and LMs: SMT system evaluation results. FR-EN (EUROPARL) FR-EN (HANSARD) LM Concat Adapt LM Concat Adapt TM Concat Adapt 30.76 31.06 30.69 31.13 TM Concat Adapt 27.65 27.76 27.48 27.73 EN-FR (EUROPARL) EN-FR (HANSARD) LM Concat Adapt LM Concat Adapt TM Concat Adapt 25.55 25.64 25.51 25.69 TM Concat Adapt 18.69 18.65 18.46 18.68 shows the translation quality of the SMT systems in terms of BLEU. Both translation and language models contribute to the translation quality, but it seems that the contribution of the translation model is more significant. Even in the case of the Hansard reference set, in the English-to-French translation task, the S → T TM (compiled from Europarl texts) adds 1.2 BLEU points, and the T LM (compiled from Hansard texts) adds only 0.46 BLEU points. Next, we perform a set of experiments to test whether a combination of the adap- tation techniques described in Section 5 for the translation and language models can further improve the translation quality. First, we build a baseline SMT system with a translation model trained on a concatenation of S → T and T → S parallel corpora and a language model compiled from a concatentation of translated and original texts. Then, we build two other systems, one with an adapted translation model and one with an adapted language model. Finally, we use the adapted translation and language models to train yet another SMT system. We use the PPLMIN-2 method (Section 5) to adapt the translation model and linear interpolation to adapt the language model. The SMT systems are then tuned and evaluated on the Europarl and the Hansard reference sets. The results, depicted in Table 13, show that SMT systems with an adapted TM usually outperform the baseline systems. LM adaptation alone does not improve the translation quality, but if combined with TM adaptation it produces the best results (but not significantly better than just TM adaptation). 8. Conclusion Phrase tables trained on parallel corpora that were translated in the same direction as the translation task perform better than ones trained on corpora translated in the opposite direction. Nonetheless, even “wrong” phrase tables contribute to the transla- tion quality. We analyze both “correct” and “wrong” phrase tables, uncovering a great deal of difference between them. We use insights from translation studies to explain these differences; we then adapt the translation model to the nature of translationese. We investigate several approaches to the adaptation problem. First, we use linear interpolation to create a mixture model of S → T and T → S translation models. We 1020 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lembersky, Ordan, and Wintner Adapting Translation Models to Translationese use perplexity minimization and an S → T reference set to determine the weights of each model, thus directly adapting the model to the properties of translationese. We show consistent and statistically significant improvements in translation quality on three different language pairs (six translation tasks) using several automatic evaluation metrics. Furthermore, we incorporate information-theoretic measures that correlate well with translationese into phrase tables as an additional score that can be tuned by MERT, and show a statistically significant improvement in the translation quality over all baseline systems. We also analyze the results qualitatively, showing that SMT systems adapted to translationese tend to produce more coherent and fluent outputs than the baseline systems. An additional advantage of our approach is that it does not require an annotation of the translation direction of the parallel corpus. It is completely generic and can be applied to any language pair, domain, or corpus. Where this work focuses on improving the process of translation (i.e., the translation model), our previous work (Lembersky, Ordan, and Wintner 2012b) focuses on improv- ing the product of translation (i.e., the language model). We have shown preliminary results in which both models were adapted to translationese; an open challenge is finding the optimal combination of improving both process and product in a single unified system. Acknowledgments We are grateful to Cyril Goutte, George Foster, and Pierre Isabelle for providing us with an annotated version of the Hansard corpus. Alon Lavie has been instrumental in stimulating some of the ideas reported in this article, as well as in his long-term support and advice. We benefitted greatly from several constructive suggestions by the three anonymous Computational Linguistics referees. This research was supported by the Israel Science Foundation (grant no. 137/06) and by a grant from the Israeli Ministry of Science and Technology. References Al-Shabab, Omar S. 1996. Interpretation and the Language of Translation: Creativity and Conventions in Translation. Janus, Edinburgh. Axelrod, Amittai, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 355–362, Edinburgh. Baker, Mona. 1993. Corpus linguistics and translation studies: Implications and applications. In Gill Francis Mona Baker and Elena Tognini-Bonelli, editors, Text and Technology: In Honour of John Sinclair. John Benjamins, Amsterdam, pages 233–252. Baroni, Marco and Silvia Bernardini. 2006. A new approach to the study of Translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing, 21(3):259–274. Beeby, Alison. 2009. Direction of translation (directionality). In Mona Baker and Gabriela Saldanha, editors, Routledge Encyclopedia of Translation Studies. Routledge (Taylor and Francis), New York, 2nd edition, pages 84–88. Blum-Kulka, Shoshana. 1986. Shifts of cohesion and coherence in translation. In Juliane House and Shoshana Blum-Kulka, editors, Interlingual and Intercultural Communication Discourse and Cognition in Translation and Second Language Acquisition Studies, volume 35. Gunter Narr Verlag, pages 17–35. Blum-Kulka, Shoshana and Eddie A. Levenston. 1983. Universals of lexical simplification. In Claus Faerch and Gabriele Kasper, editors, Strategies in Interlanguage Communication. Longman, pages 119–139. Chen, Stanley F. 1998. An empirical study of smoothing techniques for language modeling. Technical report 10-98, Computer Science Group, Harvard University, Cambridge, MA. Clark, Jonathan H., Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer 1021 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 176–181, Portland, OR. Denkowski, Michael and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 85–91, Edinburgh. Ferraresi, Adriano, Silvia Bernardini, Picci Giovanni, and Marco Baroni. 2008. Web corpora for bilingual lexicography, a pilot study of English/French collocation extraction and translation. In Proceedings of The International Symposium on Using Corpora in Contrastive and Translation Studies, pages 1–30, Hangzhou. Foster, George, Cyril Goutte, and Roland Kuhn. 2010. Discriminative instance weighting for domain adaptation in statistical machine translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 451–459, Stroudsburg, PA. Gao, Jianfeng, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Transactions on Asian Language Information Processing, 1:3–33. Gellerstam, Martin. 1986. Translationese in Swedish novels translated from English. In Lars Wollin and Hans Lindquist, editors, Translation Studies in Scandinavia. CWK Gleerup, Lund, pages 88–95. Ilisei, Iustina, Diana Inkpen, Gloria Corpas Pastor, and Ruslan Mitkov. 2010. Identification of translationese: A machine learning approach. In Alexander F. Gelbukh, editor, Proceedings of CICLing-2010: 11th International Conference on Computational Linguistics and Intelligent Text Processing, volume 6008 of Lecture Notes in Computer Science, pages 503–511. Springer. Johnson, Howard, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 967–975, Prague. Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP 2004, pages 388–395, Barcelona. 1022 Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Tenth Machine Translation Summit, pages 79–86, Phuket Island. Koehn, Philipp, Alexandra Birch, and Ralf Steinberger. 2009. 462 machine translation systems for Europe. In Proceedings of the Twelfth Machine Translation Summit, pages 65–72, Ottawa. Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague. Koehn, Philipp and Josh Schroeder. 2007. Experiments in domain adaptation for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, pages 224–227, Stroudsburg, PA. Koppel, Moshe and Noam Ordan. 2011. Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1318–1326, Portland, OR. Kurokawa, David, Cyril Goutte, and Pierre Isabelle. 2009. Automatic detection of translated text and its impact on machine translation. In Proceedings of MT-Summit XII, pages 81–88, Ottawa. Laviosa, Sara. 1998. Core patterns of lexical use in a comparable corpus of English lexical prose. Meta, 43(4):557–570. Lembersky, Gennadi, Noam Ordan, and Shuly Wintner. 2011. Language models for machine translation: Original vs. translated texts. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 363–374, Edinburgh. Lembersky, Gennadi, Noam Ordan, and Shuly Wintner. 2012a. Adapting translation models to translationese improves SMT. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 255–265, Avignon. Lembersky, Gennadi, Noam Ordan, and Shuly Wintner. 2012b. Language models l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lembersky, Ordan, and Wintner Adapting Translation Models to Translationese for machine translation: Original vs. translated texts. Computational Linguistics, 38(4):799–825. Moore, Robert C. and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference, Short Papers, pages 220–224, Stroudsburg, PA. Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In ACL ’03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167, Morristown, NJ. Och, Franz Josef and Hermann Ney. 2000. Improved statistical alignment models. In ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 440–447, Morristown, NJ. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, Morristown, NJ. Pavlovi´c, Nataˇsa. 2007. Directionality in translation and interpreting practice. Report on a questionnaire survey in Croatia. Forum, 5(2):79–99. Sennrich, Rico. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 539–549, Avignon. Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation of the Americas (AMTA-2006), pages 223–231, Cambridge, MA. Toury, Gideon. 1980. In Search of a Theory of Translation. The Porter Institute for Poetics and Semiotics, Tel Aviv University, Tel Aviv. Toury, Gideon. 1995. Descriptive Translation Studies and Beyond. John Benjamins, Amsterdam / Philadelphia. van Halteren, Hans. 2008. Source language markers in EUROPARL translations. In COLING ’08: Proceedings of the 22nd International Conference on Computational Linguistics, pages 937–944, Morristown, NJ. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1023 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 9 9 9 1 8 0 2 2 3 4 / c o l i _ a _ 0 0 1 5 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3Improving Statistical Machine image
Improving Statistical Machine image
Improving Statistical Machine image

Download pdf