Online Learning for Statistical
Machine Translation
Daniel Ortiz-Mart´ınez∗
Universitat Polit`ecnica de Val`encia1
We present online learning techniques for statistical machine translation (SMT). The availability
of large training data sets that grow constantly over time is becoming more and more frequent in
the field of SMT—for example, in the context of translation agencies or the daily translation of
government proceedings. When new knowledge is to be incorporated in the SMT models, the use
of batch learning techniques require very time-consuming estimation processes over the whole
training set that may take days or weeks to be executed. By means of the application of online
learning, new training samples can be processed individually in real time. For this purpose, Wir
define a state-of-the-art SMT model composed of a set of submodels, as well as a set of incremental
update rules for each of these submodels. To test our techniques, we have studied two well-known
SMT applications that can be used in translation agencies: post-editing and interactive machine
Übersetzung. In both scenarios, the SMT system collaborates with the user to generate high-
quality translations. These user-validated translations can be used to extend the SMT models
by means of online learning. Empirical results in the two scenarios under consideration show
the great impact of frequent updates in the system performance. The time cost of such updates
was also measured, comparing the efficiency of a batch learning SMT system with that of an
online learning system, showing that online learning is able to work in real time whereas the
time cost of batch retraining soon becomes infeasible. Empirical results also showed that the
performance of online learning is comparable to that of batch learning. Darüber hinaus, the proposed
techniques were able to learn from previously estimated models or from scratch. We also propose
two new measures to predict the effectiveness of online learning in SMT tasks. The translation
system with online learning capabilities presented here is implemented in the open-source Thot
toolkit for SMT.
1. Einführung
Multiplicity of languages is inherent to modern society. Phenomena such as global-
ization and technological development have extraordinarily increased the need for
translating information from one language to another. One possibility to deal with this
growing demand of translations is the use of machine translation (MT) Techniken.
∗ PRHLT Research Center, Universitat Polit`ecnica de Val`encia, 46071, Val`encia, Spanien,
Email: dortiz@prhlt.upv.es.
1 The author is now at Webinterpret.
Einreichung erhalten: 2 Mai 2014; revised version received: 2 Oktober 2015; zur Veröffentlichung angenommen:
20 November 2015.
doi:10.1162/COLI a 00244
© 2016 Verein für Computerlinguistik
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 42, Nummer 1
MT can be formalized under a statistical point of view as the process of finding
the sentence of maximum probability in the target language given the source sentence.
Statistical MT (SMT) requires the availability of parallel texts to estimate the statistical
models involved in the translation. It is also important that such parallel texts belong to
the same domain the system will be used for. These kinds of texts are referred to as in-
domain corpora in the domain adaptation literature. Jedoch, in-domain corpora are
often not available in real translation scenarios, forcing us to estimate the system models
by means of large out-of-domain texts, such as Parliament proceedings. Bedauerlicherweise,
this results in a significant degradation in the translation quality (Irvine et al. 2013).
There are many real translation scenarios in which new training data is inherently
generated over time (z.B., translation agencies or the daily translation of government
Verfahren). The newly generated training data could be used to mitigate the problem
of data scarcity. Jedoch, this situation poses new challenges in the SMT framework,
because the vast majority of the SMT systems described in the literature makes use of
the well-known batch learning paradigm. In the batch learning paradigm, the train-
ing of the SMT system and the translation process are carried out in separate stages.
This implies that all training samples must be available before training takes place,
preventing the statistical models to be extended when the system starts generating
translations. To solve this problem, the online learning paradigm can be applied. Online
learning is a machine learning task that is structured in a series of trials, where each
trial has four steps: (1) the learning algorithm receives an instance, (2) a label for the
instance is predicted, (3) the true label for the instance is presented, Und (4) the learning
algorithm uses the true label to update its parameters. In this paradigm, the training
and prediction stages are no longer separated.
Online learning fits nicely in typical computer-assisted translation (CAT) appli-
cations used in translation agencies. This is because in such applications the system
translation for each source sentence is validated by a human expert and thus can be
used to produce new training pairs. One possible CAT implementation consists of
post-editing (PE) the output of an MT system. In this implementation, the MT system
generates an initial translation that is corrected by the user without further system
intervention. Another instance of CAT is interactive machine translation (IMT), Wo
the user generates each translation in a series of interactions with the system.
Scientific and commercial interest in CAT applications has greatly increased dur-
ing recent years, capturing the attention of internationally renowned research groups
and translation companies. A good example of this is the work carried out in the
TransType (Foster, Isabelle, and Plamondon 1997) and TransType-II (SchlumbergerSema
S.A. et al. 2001) research projects, where the IMT paradigm was developed, und das
CasMaCat (Alabau et al. 2014) and MateCat (Federico et al. 2014) Projekte, where a
substantial part of the effort was focused on developing adaptive learning techniques
for CAT. Literature also offers demonstrations of CAT applications (Koehn 2009; Ortiz-
Mart´ınez et al. 2011) as well as studies involving real users showing the potential
benefits of CAT (Grün, Heer, and Manning 2013; Ortiz-Mart´ınez et al. 2015).
In this work we propose online learning techniques for SMT useful to efficiently
update the statistical models used by the system, avoiding the necessity of execut-
ing costly retraining processes. The properties of the proposed techniques will be
tested in the two CAT scenarios we have mentioned. As noted earlier, in such sce-
narios there is a human translator that supervises each system translation. Jedoch,
it is important to remark that our proposed techniques can also work in scenar-
ios where there are no human experts involved. One example of this can be found
in fully automatic translation tasks where the initial models can be extended from
122
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
new blocks of training data obtained from different sources, such as parliamentary
Verfahren.
The rest of the article is organized as follows: Abschnitt 2 describes the statistical
foundations of SMT and its adaptation to the PE and IMT scenarios. Abschnitt 3 explains
the online learning techniques proposed here, including the definition of a log-linear
SMT model as well as a set of incremental update rules for each one of its components.
The content of Section 3 is complemented by Appendix A, which presents an alternative
incremental update rule for word-alignment models. Experimental results as well as
their discussion are shown in Sections 4 Und 5, jeweils. Abschnitt 6 describes related
work on online learning. The work conclusions are given in Section 7.
2. Statistical Framework
In this section we describe the details of the statistical framework adopted in the rest
of the article. For this purpose, we briefly describe the statistical formulation of SMT as
well as the required modifications for its use in two well-known applications of SMT,
nämlich, post editing and interactive machine translation.
2.1 Statistical Machine Translation
In the statistical approach to MT, given a source sentence f J
1 ≡ f1… fj… fJ in the source
language F, we want to find its equivalent target sentence eI
1 ≡ e1…NEIN…eI in the target
language E, where fj and ei note the ith word and the jth word of the sentences f J
1 Und
eI
1, jeweils. From the set of all possible sentences of the target language, we are
interested in that with the highest probability according to the following equation:
ˆI
1 = arg max
ˆe
{Pr(eI
1| f J
1 )}
ICH,eI
1
(1)
1|eI
Early works on SMT decompose Pr(eI
1 ) applying Bayes’ theorem and thus ob-
taining two new distributions, Pr(eI
1|eI
1), which are approximated by means
of parametric statistical models. Konkret, Pr(eI
1) is modeled by means of a language
Modell, and Pr( f J
1) is modeled by means of a translation model.
1| f J
1) and Pr( f J
Statistical language models are typically implemented with n-gram language
Modelle. Regarding the translation models, they are commonly implemented using the
so-called phrase-based models (Koehn, Und, and Marcu 2003). The basic idea of phrase-
based translation is to segment the source sentence into phrases, then to translate each
source phrase into a target phrase, and finally to reorder the translated target phrases
in order to compose the target sentence. The decisions made during the phrase-based
translation process can be summarized by means of the hidden variable ˜aK
1 ≡ ˜a1…˜ak…˜aK,
where ˜ak denotes the index of the target phrase ˜ek that is aligned with the kth source
phrase ˜fk, determining a bisegmentation of the source and target sentences of length K.
Alternative formalizations to the one using Bayes’ theorem have been proposed.
Such formalizations are based on the direct modeling of the posterior probability,
Pr(eI
1 ), by means of the so-called log-linear models for SMT (Oh und Ney 2002). Log-
linear models use a set of feature functions hr( f J
1) each one with its corresponding
weight λr, which are typically estimated by means of the well-known minimum error
rate training (MERT) Algorithmus (Und 2003). Common log-linear model implementations
1| f J
1, eI
123
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 42, Nummer 1
are strongly focused on these phrase-based models, obtaining the best alignment at
phrase level:
ˆI
1 = arg max
ˆe
ICH,eI
1
(cid:26)
R
(cid:88)
r=1
max
K,˜aK
1
(cid:27)
λrhr( f J
1, eI
1, ˜aK
1 )
(2)
where a total of R feature functions are assumed.
State-of-the-art decoders work by exploring the search space determined by Equa-
tion (2) using iterative algorithms that build partial target translations from left to right.
2.2 Post-Editing the Output of Statistical Machine Translation
Post-editing (PE) involves making corrections to machine generated translations
(see TAUS-Project [2010] for a detailed study). PE is used when raw machine translation
is not error-free, a common situation for current MT technology. PE tends to be carried
out via tools built for editing human-generated translations, such as translation mem-
ories (some authors refer to this task as simply editing). Because in the PE scenario,
the user only edits the output of the MT system without further intervention from
the system, there are no differences in the way in which the MT system is designed
and implemented. Somit, the statistical framework for MT described previously can be
adopted without modifications in order to build the PE system.
2.3 Statistical Interactive Machine Translation
One alternative to the serial collaboration model adopted by PE is interactively com-
bining the MT system with a human translator, constituting the interactive machine
Übersetzung (IMT) paradigm (also referred to as interactive translation prediction). Eins
possible IMT implementation uses SMT systems to produce target sentence hypotheses
that can be partially or completely accepted and amended by a human translator
(Barrachina et al. 2009). Each partially corrected text segment, or prefix, is then used
by the SMT system as additional information to achieve improved suggestions.
Figur 1 illustrates a typical IMT session. In interaction-0, the system suggests a
complete translation hypothesis, es, given the source sentence, f J
1. In interaction-1, Die
user moves the mouse to accept the prefix composed of the first eight characters To view
(das ist, the prefix of the sentence the user deems to be correct) and presses the a key
(k), producing the prefix, ep. Then the system suggests completing the sentence with list
of resources (a new es), given the accepted and correct prefix. Interactions-2 and -3 Sind
ähnlich. In the final interaction, the user accepts the current translation suggestion.
In the IMT scenario we have to find an extension es for a user prefix ep:2
ˆes = arg max
es
{P(es | f J
1, ep)}
(3)
If Bayes’ theorem is applied, we then obtain two distributions, P(es | ep) Und
P( f J
1 | ep, es), that are very similar to those obtained for conventional SMT, since epes ≡
eI
1. This allows us to use the same models if the search procedures are adequately
2 Notiz: in Abbildung 1, the prefix ep also includes the keys, k, that are pressed by the user.
124
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
interaction-0
interaction-1
interaction-2
source(f J
1):
Referenz(ˆeI
1):
ep
es
ep
k
es
ep
k
es
ep
k
es
ep
interaction-3
acceptance
Para ver la lista de recursos
To view a listing of resources
To
To
view
view
To
view
To
view
To
view
Die
resources
list
A
A
A
A
list
list
list i
list i ng
listing
listing
von
resources
resources
Ö
o f
von
resources
resources
Figur 1
An example of an IMT session to translate a Spanish sentence into English.
modified. Konkret, the search is restricted to generate target sentences compatible
with the user prefix.3 Note that the statistical models are defined at the word level
whereas the IMT interface described in Figure 1 works at the character level. Das ist
not an important issue because the transformations that are required in the statistical
models for their use at character level are trivial. Konkret, the compatibility with the
user prefix is verified by comparing characters instead of words.
3. Online Learning for Statistical Machine Translation
In this section we describe the concept of online learning and its application to SMT.
3.1 Definition of Online Learning
Online learning algorithms proceed in a sequence of trials. Each trial can be decomposed
into four steps:
1.
2.
3.
4.
The learning algorithm receives an instance.
The learning algorithm predicts a label for the instance according to its
current parameters.
The true label of the instance is presented to the learning algorithm.
The learning algorithm uses the true label to update its parameters.
The system uses the true label to measure the prediction error incurred by the learner
and discarded afterwards. The ultimate goal of the online learning algorithm is to
minimize the cumulative prediction error along its run by modifying its parameters.
More formally, given any sequence of training samples x1, x2, … , an online learning
algorithm produces a sequence of parameters: Θ(0), Θ(1), Θ(2), … , such that the algorithm
parameters at trial t, Θ(T), depends only on the previous parameters, Θ(t−1), und das
current sample xt.
3 This also includes the application of the log-linear approach given by Equation (2).
125
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 42, Nummer 1
One important consequence of discarding the training samples after each trial is
that the computational complexity of processing a new sample does not depend on the
number of samples that has been previously seen. Das ist, the computational complexity
of processing a new sample is constant.
The online learning algorithms that discard each new training sample after up-
dating the learner are also referred to as incremental learning algorithms by some
authors (see Anthony and Biggs [1992]). Jedoch, this constraint can be relaxed by
using mini-batches (small sets of samples).
The online learning setting contrasts with the batch learning setting, in which all the
training patterns are presented to the learner before learning takes place and the learner
is no longer updated after the learning stage has concluded.
Batch learning algorithms are appropriate for their use in stationary environments.
In a stationary environment, all instances are drawn from the same underlying proba-
bility distribution. Im Gegensatz, because online learning algorithms continually receive
prediction feedback, they can be used in non-stationary environments.
The design of online learning algorithms raises issues not present in batch learning
settings. Three of them are identified in Giraud-Carrier (2000): (1) Chronology: the order
in which knowledge is acquired is an inherent aspect of online learning, (2) Learning
curve: the learner may start from scratch and gain knowledge from examples given one
at a time over time; as a result, it experiences a sort of learning curve, Und (3) Open-world
assumption: all the data relevant to the problem at hand is not available a priori.
Endlich, online learning is also related to another learning paradigm: active learn-
ing. In this paradigm the system queries the user to obtain the true labels of specific
instances, obtaining greater accuracy using less training data. Active learning can also
be applied in online settings, where the capability of the system to learn in an online or
incremental manner using techniques like those proposed here is crucial. Ein Beispiel
of this is the work presented in Gonz´alez-Rubio, Ortiz-Mart´ınez, and Casacuberta
(2012), where active learning techniques for IMT are proposed.
3.2 Implementing Online Learning
The key aspect to be considered when implementing online learning algorithms is how
to update the system parameters given the previous ones and the new training sample.
If the online learning algorithm is based on statistical models, then we need to maintain
a set of sufficient statistics for these models that can be incrementally updated. A
sufficient statistic for a statistical model is a statistic that captures all the information
that is relevant to estimate this model. If the estimation of the statistical model does not
require the use of the expectation–maximization (EM) Algorithmus (e.g. n-gram language
Modelle), then it is generally easy to incrementally extend the model given a new training
sample. Im Gegensatz, if the EM algorithm is required (z.B., word alignment models), Die
estimation procedure has to be modified, since conventional EM is designed for its use
in batch learning scenarios. To solve this problem, an incremental version of the EM
algorithm is required.
3.3 Predicting the Effectiveness of Online Learning in SMT Tasks
According to the work presented in Irvine et al. (2013), the presence of unknown source
words and known source words with unknown translations explains the majority of
the performance degradation when an SMT system is migrated to a new domain.
The use of online learning can mitigate these two problems, because the system is
126
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
now able to efficiently learn translations for new or previously seen words. Jedoch,
the benefits will only be significant when the document to be translated presents a
high internal repetition rate, since this will allow the system to take advantage of the
newly acquired knowledge. This should not be seen as a limitation specific to online
learning. In batch learning scenarios the translation quality is strongly weakened if
the training corpus is not representative of the text to be translated. When we move
to an online setting, we still have the same requirement but now the training and
translation stages are no longer separated. This is why we speak about repetitiveness
instead of representativeness. In any case, sufficiently high repetition rates for test doc-
uments are common, according to the document-internal repetition property defined in
Church and Gale (1995).
Bertoldi, Cettolo, and Federico (2013) propose an automatic measure for assessing
the potential usefulness of online learning: the repetition rate (RR). In this section we
will slightly modify the definition of RR and propose two additional measures.
The RR measure looks at the rate of non-singleton n-grams contained in a given
Text. More specifically, the rates of non-singleton n-grams from n = 1 Zu 4 are calculated
and geometrically averaged, using a sliding window of 1, 000 words to make the rates
comparable across different sized corpora. Here we use a slightly modified version in
which the sliding window calculation is removed, because in real translation scenarios
the text to be translated is available beforehand and should be completely translated.
Daher, we define our modified RR (MRR) measure as follows:
MRR(ICH ) =
(cid:32) 4
(cid:89)
n=1
|In,1+| − |In,1|
|In,1+|
(cid:33)1/4
(4)
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
Wo | · | represents the length of a given set, In,1+ represents the set of different
n-grams contained in the in-domain corpus I, and In,1 represents the set of different
n-grams occurring only once in I.
The MRR measure does not take into account whether a specific n-gram is contained
or not in the out-of-domain corpus that has been used to estimate the SMT models.
According to Irvine et al. (2013), unseen events constitute a major cause of translation
errors when migrating an existing SMT system to a new domain. Daher, it is interesting
to restrict the calculation of the repetition rate to those n-grams that are not contained
in the out-of-domain corpus. We will refer to this measure as the restricted repetition
rate (RRR):
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
RRR(ICH, Ö) =
(cid:32) 4
(cid:89)
n=1
|In,1+ − On,1+| − |In,1 − On,1+|
|In,1+ − On,1+|
(cid:33)1/4
(5)
where On,1+ represents the set of different n-grams contained in the out-of-domain
corpus O.
The RRR measure reflects how frequently unseen n-grams are repeated in the
corpus to be translated. Jedoch, such unseen n-grams constitute only a fraction of
the in-domain corpus. A high value of the RRR is not enough to predict good results
127
Computerlinguistik
Volumen 42, Nummer 1
if the fraction of unseen n-grams is very low. To capture this corpus property, we define
the unseen n-gram fraction (UNF) messen:
UNF(ICH, Ö) =
4
(cid:89)
n=1
(cid:80)
w∈(In,1+−On,1+ )
(cid:80)
w∈In,1+
cI (w)
cI (w)
1/4
(6)
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
where cI (w) represents the count of n-gram w in corpus I.
Here we propose to predict the potential usefulness of online learning, paying
attention only to the RRR and the UNF measures. Trotzdem, MRR will be also
reported so as to compare the information provided by the three measures.
3.4 Statistical Phrase-Based Log-Linear Model for Online SMT
As stated in Section 2.1, log-linear models including phrase-based models as feature
functions constitute the state-of-the-art in statistical machine translation. In diesem Abschnitt
we will describe the components of our log-linear model for SMT. Later, in Section 3.5,
the update rules required to extend such components will be presented.
According to Equation (2), we introduce a set of seven feature functions in
our log-linear model: an n-gram language model (h1), a source sentence-length
Modell (h2), inverse and direct phrase-based models (h3 and h4, jeweils), a tar-
get phrase-length model (h5), a source phrase-length model (h6), and a distortion
(or phrase reordering) Modell (h7). All these feature functions, with the exception
of the one related to the direct phrase-based model (h4), can be obtained from a
proper decomposition of the distribution Pr(eI
1 ) (the decomposition is detailed in
Ortiz-Mart´ınez [2011]).
1| f J
This set of feature functions is similar to those incorporated into other state-of-the-
art SMT systems such as the Moses decoder (Koehn et al. 2007). The main difference of
our proposal with existing works is that we have paid special attention to the formal
justification of the features.
We next list the details for each feature function.
(cid:114)
n-gram target language model (h1)4
h1(eI
1) = log
(cid:33)
P(NEIN|ei−1
i−n+1)
(cid:32)I+1
(cid:89)
i=1
4 I is the length of eI
J
i ≡ ei… ej
1, e0 is the begin-of-sentence symbol, e|e|+1 is the end-of-sentence symbol, e
128
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
where p(NEIN|ei−1
i−n+1) is defined as follows:
P(NEIN|ei−1
i−n+1) =
max{cX(NEIN
i−n+1) − Dn, 0}
cX(ei−1
i−n+1)
+
Dn
cX(ei−1
i−n+1)
N1+(ei−1
i−n+1•) · P(NEIN|ei−1
i−n+2)
(7)
cn,1+2cn,2
where Dn = cn,1
one and two counts respectively), N1+(ei−1
follows the history ei−1
i−n+1, and cX(NEIN
cX(·) can represent true counts cT(·) or modified counts cM(·).
is a fixed discount (cn,1 and cn,2 are the number of n-grams with
i−n+1•) is the number of unique words that
i−n+1, Wo
i−n+1) is the count of the n-gram ei
True counts are used for the higher order n-grams and modified counts for the
lower order n-grams. Given a certain n-gram, its modified count consists of the
number of different words that precede this n-gram in the training corpus.
Gleichung (7) corresponds to the probability given by an n-gram language model
with an interpolated version of the Kneser-Ney smoothing (Chen and Goodman
1996).
source sentence-length model (h2)
h2( f J
1, eI
1) = log(P( J | ICH)) = log(φI( J + 0.5) − φI( J − 0.5))
where φI(·) is the cumulative distribution function for the normal distribution (Die
cumulative distribution function is used to integrate the normal density function over
an interval of length 1). A specific target sentence length I will be assigned during
decoding time when a new empty hypothesis is created. After that, this hypothesis
will be extended in successive trials, but it will be constrained to have I words when
all the source words are covered. The sentence length model is introduced to avoid
the generation of too short or too long target sentences, which negatively impact
translation quality (other authors use models that simply penalize the number of
target words).
We use a specific normal distribution with mean µI and standard deviation σI for
each target sentence length I.
inverse and direct phrase-based models (h3, h4)
(cid:114)
(cid:114)
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
h3( f J
1, eI
1, ˜aK
1 ) = log
(cid:19)
P( ˜fk|˜e˜ak )
(cid:18) K
(cid:89)
k=1
where p( ˜fk|˜e˜ak ) is defined as follows:
P( ˜fk|˜e˜ak ) = β · pphr( ˜fk|˜e˜ak ) + (1 − β) · phmm( ˜fk|˜e˜ak )
(8)
In Equation (8), pphr( ˜fk|˜e˜ak ) denotes the probability given by a statistical phrase-based
dictionary used in regular phrase-based models.
129
Computerlinguistik
Volumen 42, Nummer 1
phmm( ˜fk|˜e˜ak ) is the probability given by a hidden Markov model (HMM)-based (intra-
Phrase) alignment model (see Vogel, Ney, and Tillmann 1996):
phmm( ˜f |˜e) = (cid:15)
| ˜f |
(cid:89)
(cid:88)
j=1
| ˜f |
A
1
P( ˜fj|˜eaj ) · P(aj|aj−1, |˜e|)
(9)
The HMM-based alignment model probability is used here for smoothing purposes.
Analogously, h4 is defined as:
(cid:114)
(cid:114)
(cid:114)
h4( f J
1, eI
1, ˜aK
1 ) = log
(cid:19)
P(˜e˜ak
| ˜fk)
(cid:18) K
(cid:89)
k=1
target phrase-length model (h5)
h5( f J
1, eI
1, ˜aK
1 ) = log
(cid:19)
P(|˜ek|)
(cid:18) K
(cid:89)
k=1
where p(|˜ek|) = δ(1 − δ)|˜ek|.
h5 implements a target phrase-length model by means of a geometric distribution
with probability of success on each trial δ.
The use of a geometric distribution penalizes the length of target phrases.
source phrase-length model (h6)
h6( f J
1, eI
1, ˜aK
1 ) = log
P(| ˜fk| | |˜e˜ak
|)
(cid:19)
,
(cid:18) K
(cid:89)
k=1
where p(| ˜fk| | |˜e˜ak
|) = 1
absolute value function.
1+τ δ(1 − δ)abs(| ˜fk|−|˜e˜ak
|), τ = (cid:80)|˜e˜ak
i=1
|−1
δ(1 − δ)ich, and abs(·) ist der
A geometric distribution (with scaling factor
1
1+τ ) is used to model this feature (Es
penalizes the difference between the source and target phrase lengths).
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
distortion model (h7)
h7(˜aK
1 ) = log
(cid:19)
P(˜ak|˜ak−1)
(cid:18) K
(cid:89)
k=1
where p(˜ak|˜ak−1) = 1
, b˜ak denotes the beginning position of the
source phrase covered by ˜ak and l˜ak−1 denotes the last position of the source phrase
covered by ˜ak−1.
2−δ δ(1 − δ)
abs(b˜ak
−l˜ak−1
)
1
2−δ ) is used to model this feature (Es
A geometric distribution (with scaling factor
penalizes the reorderings).
130
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
3.5 Online Update Rules
After translating a source sentence f J
1) is available to feed the
SMT system. Um dies zu tun, a set of sufficient statistics that can be incrementally updated is
maintained for the statistical models that implement each feature function hr(·).
1, a new sentence pair ( f J
1, eI
In the following sections, we present the set of sufficient statistics for each model.
Regarding the weights of the log-linear combination, they are not modified because of
the presentation of a new sentence pair to the system. These weights can be adjusted
offline by means of a development corpus and well-known optimization techniques,
such as the Powell algorithm or the downhill simplex algorithm, which are commonly
used in a typical MERT procedure.
3.5.1 Language Model (h1). Feature function h1 implements a language model. Nach
to Equation (7), the following data are to be maintained: ck,1 and ck,2 given any order k,
N1+(·), and cX(·) (see Section 3.4 for the meaning of each symbol).
i−k+1 of eI
Given a new sentence eI
1, and for each k-gram ei
1, Wo 1 ≤ k ≤ n and
1 ≤ i ≤ I + 1, the set of sufficient statistics is modified, as is shown in Algorithm 1. Der
algorithm checks the changes in the counts of the k-grams to update the set of sufficient
Statistiken. For a given k-gram, NEIN
i−k+1, its true count and the corresponding normalizer
are updated at lines 13 Und 14, jeweils. The modified count of the (k − 1)-gram and
its normalizer are updated at lines 7 Und 8, jeweils, only when the k-gram ei
i−k+1
appears for the first time (condition checked at line 2). The value of the N1+(·) statistic
for ei−1
i−k+2 is updated at lines 10 Und 6, jeweils, only if the word ei has
been seen for the first time following these contexts. Endlich, sufficient statistics for Dk
i−k+1 and ei−1
Algorithm 1 Pseudocode for the update suff stats lm algorithm. This algorithm is
used to incrementally update the sufficient statistics of a language model with Kneser-
Ney smoothing. The meaning of the different symbols is explained in Section 3.5.1.
Eingang
: N (higher order), NEIN
S = {∀j(cj,1, cj,2), N1+(·), cX(·)} (current set of sufficient statistics)
i−k+1 (k-gram),
output : S (updated set of sufficient statistics)
1 begin
2
if cT(NEIN
i−k+1) = 0 Dann
if k − 1 ≥ 1 Dann
3
4
5
6
7
8
9
10
11
12
13
14
i−k+2) + 1)
updD(S,k-1,cM(NEIN
if cM(NEIN
i−k+2) = 0 Dann
i−k+2),cM(NEIN
N1+(ei−1
i−k+2) := N1+(ei−1
i−k+2) + 1
i−k+2) + 1
i−k+2) := cM(NEIN
i−k+2) := cM(ei−1
i−k+2) + 1
cM(NEIN
cM(ei−1
if k = n then
N1+(ei−1
i−k+1) := N1+(ei−1
i−k+1) + 1
if k = n then
updD(S,k,cT(NEIN
i−k+1):=cT(NEIN
i−k+1):=cT(ei−1
i−k+1),cT(NEIN
i−k+1) + 1
i−k+1) + 1
cT(NEIN
cT(ei−1
i−k+1) + 1)
131
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 42, Nummer 1
Algorithm 2 Pseudocode for the updD algorithm. This algorithm is used internally by
the update suff stats lm algorithm to update the value of the Dk statistic involved in
the generation of language model probabilities.
Eingang
: S (current set of sufficient statistics),k (Befehl), C (current count),
C(cid:48) (new count)
output : (ck,1, ck,2) (updated sufficient statistics)
1 begin
2
if c = 0 Dann
3
4
5
6
7
8
if c(cid:48) = 1 then ck,1 := ck,1 + 1
if c(cid:48) = 2 then ck,2 := ck,2 + 1
if c = 1 Dann
ck,1 := ck,1 − 1
if c(cid:48) = 2 then ck,2 := ck,2 + 1
if c = 2 then ck,2 := ck,2 − 1
are updated at lines 12 (for higher order n-grams) Und 4 (for lower order n-grams),
following the auxiliary procedure shown in Algorithm 2.
3.5.2 Source Sentence Length Model (h2). Feature function h2 implements a source sentence
length model. h2 requires the incremental calculation of the mean µI and the standard
deviation σI of the normal distribution associated with a target sentence length I. Für
this purpose, the procedure described in Knuth (1981) can be used. In this procedure,
two quantities are maintained for each normal distribution: µI and SI, where SI is an
auxiliary quantity from which the standard deviation can be obtained, as is explained
subsequently. Given the training sample ( f J
1, eI
1) at trial t, the two quantities are updated
according to the following equations:
ich
ich:ich(cid:54)=I = µ(t−1)
µ(T)
I = µ(t−1)
µ(T)
ich:ich(cid:54)=I = S(t−1)
S(T)
I = S(t−1)
S(T)
ICH
ICH
ich
+ ( J − µ(t−1)
ICH
)/C(ICH)
+ ( J − µ(t−1)
ICH
) · ( J − µ(T)
ICH )
(10)
(11)
(12)
(13)
and S(t−1)
ICH
where c(ICH) is the count of the number of sentences of length I that have been seen so far,
and µ(t−1)
is initialized to the source
ICH
sentence length of the first sample and S(0)
is initialized to zero). Endlich, the standard
deviation can be obtained from S(T)
are the quantities previously stored (µ(0)
I as follows: σ(T)
S(T)
ICH /(C(ICH)(T) − 1).
I =
(cid:113)
ICH
ICH
3.5.3 Inverse and Direct Phrase-Based Models (h3 and h4). Feature functions h3 and h4 imple-
ment inverse and direct phrase-based models, jeweils. These phrase-based models
are combined with HMM-based alignment models via linear interpolation. In this work
we have not studied how to incrementally update the weights of the interpolation.
Stattdessen, these weights can be estimated from a development corpus.
132
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Because phrase-based models are symmetric models, only an inverse phrase-based
model is maintained. The inverse phrase model probabilities, P( ˜f |˜e), are estimated from
phrase counts, C( ˜f ,˜e), as follows:
P( ˜f |˜e) =
C( ˜f , ˜e)
˜f (cid:48) C( ˜f (cid:48), ˜e)
(cid:80)
According to the previous equation, the set of sufficient statistics to be stored for
the inverse phrase model consists of a set of phrase counts, C( ˜f , ˜e).
When processing a new sentence pair ( f J
1), the standard phrase-based model
estimation method (see Zens, Und, and Ney [2002] and Koehn, Und, and Marcu [2003]
for a detailed explanation) uses a word alignment matrix, A, between f J
1 to extract
the set of phrase pairs that are consistent with the word alignment matrix: BP ( f J
1).
This consistency relation is formally defined as follows:
1 and eI
1, eI
1, eI
BP ( f J
1, eI
1, A) = {( f j+r
J
, ei+s
ich
|∀(ich(cid:48), J(cid:48)) ∈ A : j ≤ j(cid:48) ≤ j + r ⇐⇒ i ≤ i(cid:48) ≤ i + S}
(14)
Somit, the set of consistent phrase pairs is constituted by those bilingual phrases where
all the words within the source phrase are only aligned to the words of the target phrase
und umgekehrt.
Given the training pair ( f J
1, eI
1) at trial t, and after obtaining the set of consistent
phrase pairs, BP ( f J
1, eI
1, A), the phrase counts are updated as follows:
C( ˜f , ˜e)(T) = c( ˜f , ˜e)(t−1) + C( ˜f , ˜e | BP ( f J
1, eI
1, A))
(15)
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
where c( ˜f , ˜e)(T) is the current count of the phrase pair ( ˜f , ˜e), C( ˜f , ˜e)(t−1) is the previous
1, A)) is the count of ( ˜f , ˜e) in BP ( f J
zählen, und C( ˜f , ˜e | BP ( f J
1, A).
1, eI
1, eI
After updating the phrase counts, we need to efficiently compute the phrase trans-
lation probabilities. For this purpose, we maintain in memory both the current phrase
counts and their normalizers.
One problem to be solved when updating the phrase model parameters is the need
to generate word alignment matrices. To solve this problem, we use the direct and
inverse HMM-based alignment models that are included in the formulation of the IMT
System. Konkret, these models are used to obtain word alignments in both translation
directions. The resulting direct and inverse word alignment matrices are combined by
means of the symmetrization alignment operation (Oh und Ney 2003) before extracting
the set of consistent phrase pairs.
In order to obtain an IMT system able to robustly learn from user feedback, Wir
also need to incrementally update the HMM-based alignment models. Im Folgenden
section we show how to efficiently incorporate new knowledge into these models.
3.5.4 Inverse and Direct HMM-Based Alignment Models (h3 and h4). HMM-based alignment
models play a crucial role in log-linear components h3 and h4 because they are used
to smooth phrase-based models and to generate word alignment matrices. HMM-
based alignment models were chosen here because, according to Och and Ney (2003)
and Toutanova, Ilhan, and Manning (2002), they outperform IBM 1 to IBM 4 Ausrichtung
133
Computerlinguistik
Volumen 42, Nummer 1
models while still allowing the exact calculation of the likelihood. Jedoch, our pro-
posal is not restricted to the use of HMM-based alignment models.
The standard estimation procedure for HMM-based alignment models is carried out
by means of the EM algorithm. Jedoch, the standard EM algorithm is not appropriate
to incrementally extend our HMM-based alignment models because it is designed to
work in batch training scenarios. To solve this problem, the incremental view of the EM
Algorithmus (Neal and Hinton 1998) can be applied.
Model Definition. HMM-based alignment models are a class of single-word align-
ment models. Single-word alignment models are based on the concept of alignment
between word positions of the source and the target sentences f J
1. Konkret, Die
alignment is defined as a function a : {1 · · · J} → {0 · · · I}, where aj = i if the jth source
position is aligned with the ith target position. Zusätzlich, aj = 0 notes that the word
position j of f J
1 (or that it has been aligned
with the null word e0). Letting A( f J
1) be the set of all possible alignments between eI
1
and f J
1 has not been aligned with any word position eI
1, eI
1) in terms of the alignment variable as follows:
1, we formulate Pr( f J
1 and eI
1|eI
Pr( f J
1|eI
1) =
(cid:88)
1∈A( f J
aJ
1,eI
1 )
Pr( f J
1, aJ
1|eI
1)
(16)
Under a generative point of view, Pr( f J
1, aJ
1|eI
1) can be decomposed without loss of
generality as follows:
Pr( f J
1, aJ
1|eI
1) = Pr( J|eI
1) ·
J
(cid:89)
j=1
Pr( fj| f j−1
1
, aj
1, eI
1) · Pr(aj| f j−1
1
, aj−1
1
, eI
1)
(17)
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
HMM-based alignment models are very similar to IBM models (Brown et al. 1993),
specifically, they only differ in the assumptions made over the alignment probabilities.
HMM-based alignment models use a first-order alignment model p(aj|aj−1, ICH) to approx-
imate the distribution Pr(aj| f j−1
, eI
1) and a word-to-word lexical model p(fj|äh ) Zu
approximate the distribution Pr( fj| f j−1
, aj
1), resulting in the expression
, aj−1
1
1, eI
1
1
P( f J
1, aJ
1|eI
1, Θ) =
J
(cid:89)
j=1
P( fj|äh ) · P(aj|aj−1, ICH)
where we assume that a0 is equal to zero and
Θ =
(cid:26) P( F |e) ∀ f ∈ F and e ∈ E
P(ich|ich(cid:48), ICH) 1 ≤ i ≤ I, 0 ≤ i(cid:48) ≤ I and ∀ I
is the set of hidden parameters.
(18)
(19)
Incremental EM Algorithm. The incremental EM algorithm was introduced by Neal
and Hinton (1998) in batch learning settings. In such settings, the set of training
samples is known before the training process takes place (see Section 3.1). Here we will
134
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
instantiate the incremental EM algorithm for a batch learning translation task with a
given set of training pairs, {(f1, e1), …, (fm, em), …, (fM, eM)}. After that, we will present
the application of the algorithm in an online learning setting.
It can be demonstrated that s(F, e, A) = (cid:80)
m sm(fm, em, Bin) constitutes a vector of suf-
ficient statistics for the model parameters, where sm(fm, em, Bin) is the vector of sufficient
statistics for data item m:
sm(fm, em, Bin) =
(cid:26) C( F |e; fm, em, Bin) ∀ f ∈ F and e ∈ E
C(ich|ich(cid:48), ICH; fm, em, Bin) 1 ≤ i ≤ I, 0 ≤ i(cid:48) ≤ I and ∀ I
(20)
with c( F |e; fm, em, Bin) being the number of times that the word e is aligned to the word
f for the sentence pair (fm, em); und C(ich|ich(cid:48), ICH; fm, em, Bin) being the number of times that
the alignment i has been seen after the previous alignment i(cid:48), given a source sentence
composed of I words for the sentence pair (fm, em).
To implement the E step of the incremental EM algorithm, we need to obtain the
expected value at trial t of the sufficient statistics, given the probability distribution
of the hidden alignment variable: ˜s(T)
M , where counts are replaced by expected counts,
C( F |e; fm, em, Bin)(T) und C(ich|ich(cid:48), ICH; fm, em, Bin)(T). If data item m is chosen to update the model
at trial t, then the E step requires the following operations:
ich
for i (cid:54)= m
Set ˜s(T)
Set ˜s(T)
Set ˜s(T) = ˜s(t−1) − ˜s(t−1)
i = s(t−1)
m = {C( F |e; fm, em, Bin)(T) , C(ich|ich(cid:48), ICH; fm, em, Bin)(T)}
M + ˜s(T)
M
(21)
Regarding the M step, we have to obtain the set of parameters that maximizes the
likelihood of the complete data given the expected values of the sufficient statistics,
obtaining the following update equations:
P(F |e)(T) =
P(ich|ich(cid:48), ICH)(T) =
M(cid:80)
m=1
M(cid:80)
m=1
(cid:80)
F (cid:48)∈F
M(cid:80)
m=1
M(cid:80)
m=1
ICH(cid:80)
ich(cid:48)(cid:48)=1
C( F |e; fm, em, Bin)(T)
C(F (cid:48)|e; fm, em, Bin)(T)
C(ich|ich(cid:48), ICH; fm, em, Bin)(T)
C(ich(cid:48)(cid:48)|ich(cid:48), ICH; fm, em, Bin)(T)
(22)
(23)
In the previous equations, the numerator values constitute the cumulative sufficient
statistics ˜s(T) = (cid:80)
m ˜s(T)
m for the model parameters.
Application to an Online Setting. The previous instantiation of the incremental EM
algorithm works in a batch learning setting where the set of training samples is given
a priori. Im Gegensatz, in the online learning paradigm the training samples are not
available a priori but become available over time—specifically, one at a time. Gegeben
135
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 42, Nummer 1
the training sample ( f J
sufficient statistics, ˜s (T), is given by the following expression:
1) at trial t, the incremental update equation for the cumulative
1, eI
˜s(T) = ˜s(t−1) + ˜s(T)
T
(24)
where ˜s(T)
T
represents the sufficient statistics for sample at trial t.
M
It should be noted that now we no longer need to subtract the previous sufficient
statistics for individual samples at trial t − 1: ˜s(t−1)
, as it is requested in the batch
learning case (compare the previous equation with Equation (21)), since the training
samples are discarded after being processed. This implies that only one training epoch
is executed over the training samples, and also that the previous sufficient statistics
for individual samples, ˜s(t−1)
, should not be stored, allowing us to save memory in a
substantial manner (the memory requirements may become prohibitive for large sample
sets). The execution of only one training epoch can be seen as a limitation with respect
to batch training, but empirical results in Section 4.4 show that the online update rule
is competitive with the batch update rule due to the faster convergence of incremental
EM. Zusätzlich, this online update rule can be easily modified to execute multiple
epochs while storing a reduced quantity of sufficient statistics for the last samples, als es
is explained in Appendix A.
M
Andererseits, it is also worth mentioning that the sufficient statistics for a
given sentence pair are nonzero for a small fraction of its components. Infolge, Die
time required to update the parameters of the HMM-based alignment model depends
only on the number of nonzero components.
After updating the sufficient statistics, ˜s(T), we need to efficiently compute the model
Parameter. For this purpose, the normalizer factors for ˜s(T) are also maintained.
The parameters of the direct HMM-based alignment model are estimated analo-
gously to those of the inverse model.
3.5.5 Source Phrase Length, Target Phrase Length, and Distortion Models (h5, h6, and h7). Der
δ parameters of the geometric distributions associated with the feature functions h5, h6,
and h7 are left fixed. Because of this, there are no sufficient statistics to store for these
feature functions.
4. Experimental Results
This section describes the experiments that we carried out to test our proposed online
learning techniques. Our experiments were focused on the PE and IMT scenarios,
because they fit nicely into the online learning paradigm.
Our experiments use the log-linear SMT model with online learning capabilities
described in Section 3.4. The IMT experiments reported here combine this log-linear
SMT model with stochastic error correction models following the technique introduced
in Ortiz-Mart´ınez (2011). This technique uses word graphs to avoid retranslating the
source sentence at each interaction of the IMT process. The incremental language and
phrase-based models involved in the interactive translation process were generated and
accessed by means of the open source Thot toolkit (Ortiz, Garc´ıa-Varea, and Casacuberta
2005; Ortiz-Mart´ınez and Casacuberta 2014). The specific functionality used in this
136
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Tisch 1
XRCE corpus statistics for three different language pairs.
Training
Dev
Test
Sentences
Running words
Vocabulary
Sentences
Running words
Sentences
Running words
MRR
RRR
UNF
English
Spanish
English
French
English
Deutsch
55,761
52,844
49,376
571,960
25,627
657,172
29,565
542,762
24,958
573,170
27,399
506,877
24,899
440,682
37,338
1,012
12,111
13,808
9,480
1,125
994
984
9,801
9,162
8,283
964
996
7,634
9,358
9,572
9,805
10,792
9,823
19.8
13.4
18.0
24.4
14.2
14.9
25.9
16.9
18.8
26.9
17.3
16.5
26.9
15.5
13.7
22.8
14.8
24.6
experimentation has been made freely available in a new version of toolkit.5 We did
not consider the use of the Moses decoder (Koehn et al. 2007) in our experiments
because it is not prepared to work in the IMT framework and it does not implement the
incremental version of the EM algorithm (it implements the stepwise version, welches ist
unstable in online learning settings, according to Blain, Schwenk, and Senellart [2012]).
Jedoch, translation quality results reported in Ortiz-Mart´ınez (2011) show that Thot
is competitive with Moses for corpora of different complexities.
4.1 Corpora
The experiments were performed using the XRCE, the Europarl, and the EMEA cor-
pora. The XRCE corpus (SchlumbergerSema S.A. et al. 2001) consists of translations
of XRCE printer manuals from English to three different languages—namely, Spanish,
French, und Deutsch. Tisch 1 shows the main figures of the XRCE corpora for training,
Entwicklung, and test partitions. The XRCE corpus is included here because it has
been extensively used in the literature to report SMT and IMT results (a complete set
of experiments with this corpus is shown in Barrachina et al. [2009]). This feature will
allow us to compare the results of our proposed system with those obtained by state-of-
the-art systems.
Tisch 1 also shows three different measures to predict the effectiveness of online
learning for this particular translation task, including the modified repetition rate
(MRR), the restricted repetition rate (RRR), and the unseen n-gram fraction (UNF) (sehen
Abschnitt 3.3). In this work we propose to use only RRR and UNF measures to assess the
usefulness of online learning. Jedoch, MRR will also be reported so as to give a better
idea of the accuracy of these two measures. The three XRCE test sets present moderately
high values for the three measures. Slight drops of the RRR measure with respect to the
MRR measure are observed (das ist, there are repeated n-grams in the test corpus that
were already seen in the training set), suggesting that the repetition rate present in the
test sets cannot be fully exploited by online learning.
The Europarl corpus (Koehn 2005) is extracted from the proceedings of the Euro-
pean Parliament, which are written in the different languages of the European Union.
5 It can be downloaded at https://github.com/daormar/thot.
137
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 42, Nummer 1
Tisch 2
Europarl corpus statistics for three different language pairs.
Training
Dev
Test
Sentences
Running words
Vocabulary
Sentences
Running words
Sentences
Running words
MRR
RRR
UNF
English
Spanish
English
French
English
Deutsch
1,547,596
1,525,315
1,601,936
33,125 K
96,741
34,116 K
146,288
31,923 K
95,232
34,571 K
114,417
36,185 K
101,113
34,070 K
318,475
3,003
3,003
3,003
72,988
78,888
72,988
81,800
72,988
72,603
3,000
3,000
3,000
64,809
70,562
64,809
73,664
64,809
63,411
10.7
4.3
20.7
11.4
4.4
19.6
10.7
4.3
20.8
12.5
4.8
18.2
10.7
4.2
20.0
8.0
3.1
26.0
In our experiments we used the version created for the shared task of the ACL 2013
Workshop on Statistical Machine Translation (Bojar et al. 2013). To simplify the experi-
gen, all those sentences whose length in words was greater than 40 were removed
from the training set. Regarding the language pairs under consideration, wieder, Wir
will translate from the English language to Spanish, French, und Deutsch. Tisch 2
shows the main figures of training, Entwicklung, and test sets. The Europarl corpus
constitutes one good example of a complex, real-world translation task that is also
very well known in the MT scientific community. Regarding the measures to predict
the effectiveness of online learning, it should be noted that the MRR measure is much
lower than that observed for the XRCE corpora (siehe Tabelle 1). Darüber hinaus, a significant
drop in the RRR measure with respect to the MRR is observed, indicating that the
vast majority of the repeated n-grams in the test corpus has already been seen in the
training corpus. daher, we expect a limited effectiveness of online learning for this
Aufgabe.
Endlich, we also carried out experiments with the EMEA corpus. The EMEA corpus
consists of documents from the European Medicines Agency, made available with the
OPUS corpora collection (Tiedemann 2009). In this work we extracted specific test
sets of 3,000 sentences from the whole set of parallel sentences. Before doing this, Wir
first removed the duplicate sentence pairs contained in this corpus (they represent a
very high percentage of the total number of sentence pairs). Tisch 3 contains some
statistics of the resulting corpora. The main interest of the EMEA corpus in our proposed
experimentation is that it constitutes an example of an in-domain translation task. Der
models of the SMT system can be estimated from the out-of-domain Europarl corpus
and then used to translate the EMEA corpus, simulating a non-stationary translation
Aufgabe. As it can be seen in Table 3, MRR, RRR, and UNF measures clearly suggest the
potential usefulness of online learning in this task (RRR and UNF were calculated using
the Europarl training corpus as the out-of-domain corpus).
4.2 Assessment Criteria
We evaluated our SMT system with online learning using three evaluation measures:
WER, BLEU, and KSMR. System performance was assessed by comparing the system
138
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Tisch 3
Statistics for a subset of the EMEA corpus selected for testing purposes. Figures are shown for
three different language pairs. RRR and UNF measures have been calculated using the Europarl
training set as the out-of-domain corpus.
English
Spanish
English
French
English German
Test
Sentences
Running words
Vocabulary
MRR
RRR
UNF
3,000
3,000
3,000
46,311
4,716
50,410
5,329
45,260
4,659
53,787
5,135
46,319
4,772
43,887
5,809
34.2
30.3
43.5
33.5
29.3
39.9
34.5
30.8
44.0
34.7
31.1
45.5
34.6
30.6
43.7
29.1
25.3
46.5
translations with the corresponding target language references of the test set. WER and
BLEU measures are intended for its use in the evaluation of the PE scenario:
(cid:114)
(cid:114)
Word Error Rate (WER): the system hypothesis is compared to the
reference translation by computing the minimum number of edit distance
Operationen (substitutions, insertions and deletions) between the hypothesis
and the reference translation, divided by the number of reference words.
Bilingual evaluation understudy (BLEU): The BLEU score (Papineni et al.
2001) computes the geometric mean of the precision of n-grams of various
lengths between a hypothesis and a set of reference translations multiplied
by a factor that penalizes short sentences.
Because we want to evaluate the performance of our proposed SMT system in an
IMT scenario, we need to estimate the effort required by the user to produce correct
translations using the system. Zu diesem Zweck, we use the target references to simulate the
translations that the user has in mind. The first translation hypothesis for each given
source sentence is compared with a single reference translation and the longest common
character prefix (LCCP) is obtained. The first non-matching character is replaced by the
corresponding reference character and then a new system translation is produced. Das
process is iterated until a full match with the reference is obtained. Each computation
of the LCCP would correspond to the user looking for the next error and moving the
pointer to the corresponding position of the translation hypothesis. We refer to a pointer
movement as a mouse-action. Andererseits, each character replacement would
correspond to a keystroke of the user. If the first non-matching character is the first
character of the new system hypothesis in a given interaction, no LCCP computation is
erforderlich; das ist, no pointer movement would be made by the user. Bearing this in mind,
we define the following IMT evaluation measure:
(cid:114)
Keystroke and mouse-action ratio (KSMR): KSMR (Barrachina et al. 2009)
is the number of keystrokes plus the number of mouse-actions divided by
the total number of reference characters.
It is worthy of note that KSMR assumes that both keystrokes and mouse-actions
require the same effort from the user. This constitutes an approximation, since these
two actions are different and require different types of effort (Macklovitch 2006).
139
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 42, Nummer 1
In addition to the WER, BLEU, and KSMR measures, in our experimentation we
additionally report the learning time in seconds after each training sample presentation.
The learning time is important to assess the ability of the learning algorithms to work
in a real time scenario. All the experiments were executed on a Windows PC with a 2.00
Ghz Intel Xeon processor with 1GB of memory.
4.3 Experimentation Protocol
We evaluated our techniques by simulating real users. Because the different corpora
used in the experiments contained source and target translations, we used the latter to
simulate the reference translations that the user has in mind for each source sentence.
This paper studies the application of the online learning paradigm to SMT; daher
the experimentation follows the learning process structured as a sequence of trials that
was described in Section 3.1. During this sequence of trials, online learning systems
experience some sort of learning curve as they gain knowledge after each training
sample presentation. Given that such a learning curve is an important issue when
designing online learning algorithms, some of the results reported here include plots
with the evolution of cumulative error measures.
It is also interesting to clarify the way in which the different corpora described
in Section 4.1 have been used throughout the experimentation. One factor that has
influenced the decisions in this regard is the high computational cost of batch retraining.
Batch retraining is present in different experiments reported in this work because it
provides a valuable reference when assessing the performance of online learning. In
such experiments, we have defined specific subsets of the training corpora in order to
speed up the experiments. More specifically, der Erste 10,000 sentences of the XRCE and
Europarl corpora have been used.
The decisions regarding the use of the different corpora in the experiments can be
summarized as follows. The training sets of the XRCE and Europarl corpora were used
to measure the convergence properties of the incremental EM algorithm (Abschnitt 4.4).
The above-mentioned subset of the training corpora was used to study the impact of
the update frequency in the results (Abschnitt 4.5), to compare the performance of batch
and online learning (Abschnitt 4.6), and to analyze the influence of sentence ordering in the
system performance (Abschnitt 4.7). Endlich, in the experiments to test the capability of our
online learning techniques to learn from previously estimated models (Abschnitt 4.8), Wir
used the training and development sets of the XRCE and Europarl corpora to initialize
the system models, and the test sets to obtain translation results. For the system trained
with the Europarl corpus, the experimentation is complemented with translation results
using the in-domain EMEA corpus.
4.4 EM Algorithm Convergence Experiments
The standard estimation procedure for current phrase-based models relies on the gen-
eration of word alignment matrices. As it was explained in Section 3.5, in our proposal
such alignment matrices are generated by means of HMM-based word alignment mod-
els that are incrementally updated from user feedback. For this purpose, we need to
replace the batch EM algorithm by the incremental EM algorithm. Given the great
importance of generating word alignments in the estimation of phrase-based models
(see Section 3.5.3), we carried out experiments to compare the convergence rates of
batch and incremental EM algorithms for HMM-based word alignment models.
140
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
C
Ö
l
ich
_
A
_
0
0
2
4
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Figur 2
EM convergence experiment comparing the normalized log-likelihood obtained when executing
five training epochs of the batch and incremental versions of the EM algorithm. The experiments
were executed on the XRCE and Europarl English–Spanish training corpora.
Figur 2 shows the normalized log-likelihood that is obtained when executing up
to five training epochs6 of the batch and incremental versions of the EM algorithm
(common training schemes in state-of-the-art SMT systems frequently execute five EM
training epochs to train the different word-alignment models). Plots were obtained
for the XRCE and Europarl training corpora and the three translation directions (aus
English to Spanish, French, und Deutsch). Jedoch, in the figure only the XRCE English
to Spanish (Figure 2a) and the Europarl English to Spanish (Figure 2b) results are
reported7 (very similar results were obtained for the other language pairs).
According to the results presented in Figure 2, the incremental EM algorithm is
able to obtain a greater normalized log-likelihood than that obtained by the batch EM
algorithm for the two corpora under consideration. In addition to this, such a greater
log-likelihood can be obtained with fewer EM training epochs. These observed results
are due to the fact that the incremental EM algorithm executes complete E and M steps
for each training sample, resulting in a much greater rate of model updates per each
training epoch (Neal and Hinton 1998).
Beachten Sie, dass, according to Equation (24), only one training epoch of the incremental
EM algorithm is performed when training HMM-based alignment models (d.h., jede
training sample is processed only once by the learning algorithm and discarded after-
Wächter). This contrasts with the conventional batch training scheme, in which a few
training epochs (typically five) are executed. Somit, to fairly compare batch learning
with our proposed online learning strategy, we should observe the relationship between
the normalized log-likelihood of the incremental EM algorithm at the first training
epoch and that of the fifth training epoch of the batch algorithm. According to the
values shown in Figures 2a and 2b, we can appreciate a very small degradation in the
log-likelihood (<1% for the Europarl corpus) or no degradation at all.
Because the difference in the log-likelihood between batch and incremental EM
algorithms is negligible, we consider that the update rule for HMM-based alignment
models given by Equation (24) is able to obtain word-alignment models comparable
to those that can be obtained using batch learning. This claim will be supported with
additional empirical evidence in Section 4.6.
6 An epoch is a single presentation of all samples in the training set.
7 To speed up the experiments, we took the first 100,000 sentences of the Europarl training corpus.
141
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
Nevertheless, it is possible to take advantage of the better convergence properties
of the incremental EM algorithm by slightly modifying the conditions imposed by the
online learning framework adopted in this paper (see Section 3.1). In such a framework,
only the last sample presented so far to the learning algorithm can be used to modify
the model parameters at each trial. This constraint can be slightly relaxed, allowing us to
define alternative update rules for the HMM alignment models that execute more than
one EM algorithm iteration over each sample. One example of such alternative update
rules is described in Appendix A. Empirical results also given in the same appendix
show that the obtained log-likelihood and the evaluation measures can be marginally
improved with respect to the strict observation of the online learning framework.
Finally, it is worth pointing out that according to the results presented in Figure 2,
incremental EM could be suitable to replace batch EM in a batch-learning scenario.
However, one disadvantage of applying incremental EM to a batch-learning task is the
necessity of storing the sufficient statistics for the whole data set: s1, s2, ..., sM. For large
data sets, the sufficient statistics may not fit in memory. Nevertheless, this information
can be stored on disk and accessed efficiently, because the algorithm reads the data in a
sequential manner. By contrast, this disadvantage is totally removed when incremental
EM is applied in an online learning scenario, since the sufficient statistics for each
training sample are discarded at the end of each trial, or after a finite number of trials
for the alternative update rule described in Appendix A.
4.5 Impact of Update Frequency
One important aspect to be clarified when designing PE or IMT systems is the influence
of the system update frequency on the obtained performance. It is expected that updat-
ing the system in a sentence-wise manner will produce the best results. However, this
updating strategy poses efficiency problems because of the necessity of executing model
updates in real time. This problem can be alleviated by defining an alternative update
strategy in which the training process is delayed until a certain number of samples have
been gathered. Delaying model updates may cause performance degradation, but it also
constitutes one way to reduce the strong time requirements of a sentence-wise updating
strategy. Specifically, if the time between updates is sufficiently high, the use of batch
learning techniques could be appropriate (e.g., the training process can be executed
overnight), removing the necessity of implementing online learning.
To test the impact of update frequency on system performance, we carried out
PE and IMT experiments using the XRCE and the Europarl corpora in the three dif-
ferent language pairs (from English to Spanish, French, and German). In the experi-
ments, the first 10,000 sentences of the different training sets were translated, using
cumulative WER and KSMR to measure the user effort in the PE and IMT scenarios,
respectively. The system was initialized with empty models, and after that such mod-
els were extended from the user validated translations using three different update
frequencies: every 10, 100, and 1,000 sentences. Five training epochs were executed
in all cases. We did not consider sentence-by-sentence updating because of the huge
computational cost of the retraining. Model updates were performed by means of
conventional batch-learning techniques, that is, the whole set of training samples seen
so far is batch-retrained whenever the model is updated. Additionally, we adopted
default values for the weights of the log-linear model. The results of the experiments are
shown in Figure 3. Again, we only report results for the English–Spanish XRCE corpus
(Figures 3a and 3b) and for the English–Spanish Europarl corpus (Figures 3c and 3d).
Very similar results were obtained for the other language pairs.
142
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Figure 3
Impact of update frequency when translating the first 10,000 sentences of the English–Spanish
language pair of the XRCE and Europarl corpora. A conventional SMT system executed five
batch-training epochs every 10, 100, and 1,000 sentences. The system was initialized with empty
models and default values for the weights of the log-linear model. Plots show the evolution of
the user effort in PE and IMT scenarios measured in terms of cumulative WER and KSMR,
respectively.
As it can be observed in Figure 3, the user effort in terms of WER and KSMR was
lower when the update frequency was increased. More specifically, batch retraining
every 10 samples (Batch10) consistently outperformed the rest of the systems in all cases
and retraining every 100 samples (Batch100) was also consistently better than retraining
every 1,000 sentences (Batch1000). Sharper curves were obtained when translating the
XRCE corpora, probably reflecting that in this corpus, there are groups of sentences with
highly different translation difficulties from the system point of view.
Note that, in all cases, the initial WER and KSMR measures are not equal to 100%.
This is because of the fact that the system copies to the output all those unknown words
contained in the input. In some cases such copied words (names, dates, etc.) are correct
words contained in the reference translations.
Additionally, it is illustrative to consider the time cost of batch retraining for each
system. Figure 4 shows the time cost in seconds of batch retrainings when we increase
the number of training samples presented to the system. Results are shown for an
update frequency equal to 1,000 when translating the XRCE (XRCE Batch1000) and the
Europarl corpora (Europarl Batch1000). Higher update frequencies produced exactly
the same results but with a higher number of points in the plots. As it was expected, the
training times increase linearly with the number of training samples presented to the
system. Time costs were higher for the more complex Europarl corpus. After processing
143
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
Figure 4
Batch retraining time in seconds as a function of the number of samples presented to the system.
Results are shown for the first 10,000 sentences of the XRCE and Europarl training sets. Systems
were retrained every 1,000 sentences executing five epochs. Time costs are given in seconds.
10,000 sentences, batch retraining took 19 minutes for the XRCE corpus and 45 minutes
for the Europarl corpora. This gives a clear idea of the infeasibility of batch retraining in
a sentence-wise updating strategy. Moreover, time costs of batch retraining soon become
unaffordable because of their linear growth with the number of translated sentences.
Turning back to the question made at the beginning of this section, experimental
results clearly show that it is not possible to obtain the performance of a sentence-
wise update strategy if the update frequency is decreased. This constitutes a strong
argument in favor of the application of our proposed online learning techniques, which
are specifically designed to learn from individual training samples. By contrast, batch
learning requires the execution of expensive retraining processes whenever a new sam-
ple is presented to the learner. These findings are further supported in the next section,
where the performance of batch and online learning systems are compared.
4.6 Batch versus Online Learning, Learning from Scratch
The great impact of frequent updates in the system performance demonstrated in the
previous section poses the question of the necessity of replacing conventional batch
learning techniques by online learning techniques. EM convergence experiments pro-
vided in Section 4.4 showed that the log-likelihood of HMM-based word alignment
models using the incremental version of the EM algorithm is competitive with that
obtained by using the conventional version. However, it is still unclear if the use of
online learning will cause a degradation in the quality of the translations with respect
to the use of batch learning.
Figure 5 shows the experiments we carried out to demonstrate the effectiveness
of online learning. For this purpose, we compared the performance of a batch system
executing five training epochs every 10 sentences (Batch10) with that of an online
system (Online). Plots show the evolution of the user effort required to obtain correct
translations. This effort is measured in terms of cumulative WER and KSMR for the
PE and IMT scenarios, respectively. Initial models were empty in all cases. We report
the results obtained when translating the first 10,000 sentences of the English–Spanish
XRCE (Figures 5a and 5b) and Europarl training corpora (Figures 5c and 5d). Very
similar results were obtained for English–French and English–German language pairs.
As it can be seen in Figure 5, the performance of online learning is slightly better
than that of batch learning for both corpora and for the two scenarios under con-
sideration: PE and IMT. We think that the reason for this slight improvement is the
144
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Figure 5
Comparison between batch and online learning when translating the first 10,000 sentences of the
English–Spanish language pair of the XRCE and Europarl corpora. The batch learning system
executed five training epochs every 10 sentences. The system was initialized with empty models
and default values for the log-linear weights. Plots show the evolution of the user effort in the
PE and IMT scenarios measured in terms of cumulative WER and KSMR, respectively.
higher update frequency of online learning, since the SMT models are extended for each
individual training pair. It should be noted that the shape of the curves obtained with
online learning is very similar to that of batch learning. This implies that incremental
EM presents a stable behavior, which contrasts with the instability of the stepwise EM
algorithm reported in Blain, Schwenk, and Senellart (2012). Finally, it is also worthy of
note that the results also show that the system is able to learn from scratch.
It is also illustrative to carry out a descriptive analysis of the learning times per
training sample that were obtained during this experiment. Figure 6 shows a boxplot
summarizing the main statistics of the learning times for the XRCE and Europarl
corpora. As it can be seen, the boxplot clearly show the small time cost of the learning
process for the two different corpora under consideration. Specifically, the learning time
was never greater than 1 second, and the median times were 0.03 and 0.16 seconds for
the XRCE and the Europarl corpora, respectively. The learning time was greater for
the Europarl corpus because of the greater length in words of the sentence pairs with
respect to that of the XRCE corpus.
4.7 Ordering Effects
The order in which knowledge is acquired is an important issue in online learning tasks
(see Section 3.1). When the label of a new sample is presented to the learning algorithm,
145
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
Figure 6
Boxplots of the time cost per each individual training sample required to train the first 10,000
sentences of the XRCE and Europarl training corpora using our proposed online learning
techniques. Times are measured in seconds.
its parameters are modified to minimize cumulative prediction error. Hopefully, this
modification will allow the system to provide more accurate predictions for similar
samples. However, modifying parameters may also produce lateral effects. A lateral
effect can cause the system to generate a wrong prediction for a given sample because of
undesired changes in the learning algorithm parameters. One possible way to minimize
the number of lateral effects is by processing similar samples in consecutive trials.
Figure 7 shows the experiments we executed to test the influence of corpus or-
dering in both WER and KSMR results when translating the first 10,000 sentences of
the English–Spanish XRCE and Europarl training corpora by means of an online SMT
system. For both tasks we translated the original portion of the training corpus and the
same portion after being randomly shuffled.
As it can be seen in Figure 7, the obtained results were generally better for the
original corpora than for the shuffled ones. The reason for the improved results is due to
the fact that, in the original corpora, similar sentences appear more or less contiguously
(because of the organization of the contents of the printer manuals for the XRCE corpus
or to the chronological order of the parliamentary sessions for the Europarl corpus). This
circumstance increases the accuracy of online learning, since with the original corpora
the number of lateral effects occurred between the translation of similar sentences is
decreased. By contrast, the accuracy was worse for shuffled corpora. Shuffling causes
similar sentences to no longer appear contiguously and thus, the number of lateral
effects that may occur between the translation of similar sentences is increased.
The differences between the results of original and shuffled corpora were much
greater for the XRCE corpus than for the Europarl corpus. One possible explanation
for this phenomenon is the lower repetition rate of the latter corpus. For low repetition
rates, the number of lateral effects between the translation of similar sentences will be
lower, since such sentences appear in a small number.
4.8 Learning from Previously Estimated Models
In the previous sections, we have shown empirical results where the models used
by the SMT system were initially empty. In this section we show experiments in an
alternative learning scenario where the SMT systems learn from previously estimated
models. Under these circumstances, we compared the performance of a conventional
SMT system with that of an online SMT system. More specifically, the conventional
146
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Figure 7
Impact of sentence ordering when translating the first 10,000 sentences of the English–Spanish
language pair of the XRCE and Europarl corpora. Such sentences were presented in their original
order or randomly shuffled. The system was initialized with empty models and default values
for the weights of the log-linear model. The different plots show the evolution of the user effort
in the PE and IMT scenarios measured in terms of cumulative WER and KSMR, respectively.
SMT system is a system that is not able to take advantage of user feedback after each
translation, whereas the online SMT system uses the new sentence pairs provided by
the user to revise the statistical models. Both systems used log-linear models trained
in batch mode by means of the XRCE or the Europarl training corpora (five training
epochs were executed). The weights of the log-linear model were adjusted for the
corresponding development corpora via MERT.
4.8.1 XRCE Experiments. Table 4 shows the obtained results when translating the XRCE
test corpora from English to Spanish, French, and German using conventional (batch
learning without retraining) and online SMT systems. The table shows the BLEU, WER,
and KSMR measures for both systems (95% confidence intervals are shown in all cases).
The table also shows the average online learning time (LT) for each new sample pre-
sented to the system. All the improvements obtained with the online SMT system for the
different measures were statistically significant. Greater improvements were obtained
when translating to French and German. Such improvements could not be accurately
predicted by means of the MRR measure (see Table 1), because, for instance, English to
German presented a lower MRR value than English to Spanish. However, RRR and UNF
measures were lower for English to Spanish, demonstrating the utility of such measures
147
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
Table 4
BLEU, WER, and KSMR results for the XRCE test corpora using conventional (batch learning
without retraining) and online SMT systems. Both systems used MERT to adjust log-linear
weights. The average online learning time (LT) in seconds is shown for the online system.
Corpus
SMT system
BLEU
WER
KSMR
LT (sec)
Eng–Spa
conventional
online
58.3±2.4
64.0±2.4
32.5±1.9
28.1±1.7
19.3±1.2
16.6±1.1
Eng–Fre
conventional
online
32.2±2.2
43.7±2.3
63.7±2.2
48.4±2.2
36.7±1.2
30.3±1.2
Eng–Ger
conventional
online
20.5±1.9
28.9±2.1
72.4±1.9
61.7±2.1
42.9±1.1
37.0±1.3
-
0.06
-
0.09
-
0.07
to obtain refined predictions of the impact of online learning in the results. The average
learning times allow the system to be used in a real-time scenario.
Additionally, in Table 5 we show a comparison of the KSMR results obtained by
our proposed online SMT system, with those obtained by different state-of-the-art IMT
systems described in the literature. These IMT systems are based on different trans-
lation approaches, including the alignment templates (AT), the stochastic finite-state
transducer (SFST), and the phrase-based (PB) approaches to IMT (see Barrachina et al.
[2009] for more details). AT and SFST systems follow the word graph-based approach
to generate the IMT suffixes, whereas the PB system retranslates the source sentence
at each interaction of the IMT process. Experiments reported in Barrachina et al. (2009)
showed that word graph–based systems are much faster than systems that retranslate
the source sentence at each interaction, but obtain slightly worse results. Because quick
response times are critical in an IMT scenario, the majority of the IMT systems reported
in the literature, as well as the one proposed here, follow a word graph–based imple-
mentation strategy. Our system significantly outperformed the results obtained by the
state-of-the-art systems, except those of the PB system for English to Spanish. Even in
this case, our system obtained slightly better results.
4.8.2 Europarl Experiments. Table 6 shows the translation results from English to Spanish,
French, and German for the Europarl corpus when using conventional and online SMT
systems. Again, BLEU, WER, and KSMR measures for conventional and online SMT
Table 5
KSMR results of the comparison of our system with online learning and three different
state-of-the-art IMT conventional systems. The experiments were executed on the XRCE corpora.
Best results are shown in bold.
Corpus
AT
PB
SFST
Online
Eng–Spa
Eng–Fre
Eng–Ger
23.2±1.3
40.4±1.4
44.7±1.2
16.7±1.2
35.8±1.3
40.1±1.2
21.8±1.4
43.8±1.6
45.7±1.4
16.6±1.1
30.3±1.2
37.0±1.2
148
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Table 6
BLEU, WER, and KSMR results for the Europarl test corpora using conventional (batch learning
without retraining) and online SMT systems. Log-linear weights were adjusted via MERT. The
average online learning time (LT) in seconds is shown for the online system.
Corpus
SMT system
BLEU
WER
KSMR
LT (sec)
Eng-Spa
conventional
online
21.0±0.5
22.5±0.6
65.4±0.7
63.4±0.7
45.9±0.4
44.7±0.4
Eng-Fre
conventional
online
21.2±0.6
22.6±0.6
64.4±0.7
63.2±0.7
44.4±0.5
43.2±0.5
Eng-Ger
conventional
online
13.1±0.5
14.1±0.5
73.8±0.7
72.8±0.7
49.2±0.4
48.0±0.4
-
0.2
-
0.2
-
0.2
systems are shown. The table also reports the average LT for the system with online
learning.
As seen in Table 6, online learning allowed us to obtain around one point of
improvement in the three measures under consideration with respect to the conven-
tional system (without retraining). However, the improvements were not statistically
significant in some cases (WER for English to French, BLEU and WER for English to
German). These smaller improvements with respect to those observed for the XRCE
task could be predicted from the lower repetition rates that the Europarl corpus present
(see Table 2), especially for the RRR measure, which reflects how frequently unseen n-
grams are repeated in the corpus to be translated.
The average online learning time was greater than that for the XRCE corpus shown
in Table 4. Despite this, it was small enough for its use in a real-time scenario.
4.8.3 EMEA Experiments. Table 7 shows the obtained results measured in terms of BLEU,
WER, and KSMR when translating the EMEA test corpora from English to Spanish,
French, and German. Conventional and online SMT systems with models estimated
from the Europarl training corpora were used. In this experimentation, we also consid-
ered a third SMT system that used online learning from scratch. The average online LT
for each new sample presented to the system is also reported.
As seen in Table 7, the online SMT system, whether it learned from scratch or from
previously estimated models, significantly outperformed the results obtained by the
conventional SMT system (batch learning without retraining) for the three evaluation
measures. The magnitude of the improvements were greater than that observed for
the XRCE and Europarl corpora, as could be predicted from the MRR, RRR, and UNF
measures provided in Table 3. Note that, although the values of the MRR measure of
EMEA were similar to that obtained for the XRCE corpus, the improvements were
greater because of the higher number of unseen events (explained by the RRR) and
their more-frequent presence in the EMEA corpora (explained by means of the UNF).
The online learning system using previously estimated models consistently produced
improvements of more than 10 points for the three evaluation measures and for each
of the language pairs. The improvements were smaller for the online learning system
from scratch. Despite this, it is worth noting that for this task, it would be better to use
a system with online learning, even if its models were initially empty, than to use a
conventional SMT system with models trained on out-of-domain corpus.
149
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
Table 7
BLEU, WER, and KSMR for the EMEA test corpora using conventional (batch learning without
retraining), online, and online from scratch SMT systems, with models estimated by means of
the Europarl training corpora. Log-linear weights were trained via MERT. The average online
learning time (LT) in seconds is shown for the online system.
Corpus
SMT system
BLEU
WER
KSMR
LT (sec)
Eng-Spa
Eng-Fre
Eng-Ger
conventional
online
online scratch
conventional
online
online scratch
conventional
online
online scratch
24.0±0.8
45.6±1.2
36.0±1.2
18.3±0.7
39.7±1.3
30.0±0.2
17.7±0.8
35.7±1.3
27.4±1.2
64.0±0.9
49.6±1.4
52.2±1.0
69.4±0.9
53.9±1.3
59.5±1.1
77.8±1.3
61.8±1.5
65.5±1.2
49.1±0.6
32.3±0.8
35.6±0.8
52.3±0.6
33.7±0.8
41.2±0.8
55.3±0.6
37.2±0.8
43.8±0.9
-
0.1
0.1
-
0.1
0.1
-
0.1
0.1
Regarding the average learning times per each new training pair, again, they were
small enough to allow the use of online learning in real-time scenarios.
5. Discussion
The set of experiments presented in the previous section validates the use of incremental
EM to design online learning algorithms for SMT. One common criticism of incremental
EM is its great memory requirements due to the necessity of storing the set of sufficient
statistics for each individual sample. However, this criticism was initially made in the
context of batch learning (see Liang and Klein 2009), and there were no studies on its
application to online learning tasks, with the exception of the work presented in Ortiz-
Mart´ınez, Garc´ıa-Varea, and Casacuberta (2010). Here we have proposed two update
rules that present constant (and very small) memory requirements while maintaining
the same or even better performance than batch retraining. The first proposed update
rule (see Equation (24)) allows us to execute one epoch over the training samples
by storing the sufficient statistics for the last one. This significantly differs from the
memory requirements of incremental EM applied in a batch learning context, where
the sufficient statistics for all of the samples seen so far need to be stored (for a more
detailed explanation, see Equation (24) and the subsequent discussion). The second
update rule (see Appendix A) is a generalization of the first one, executing several
epochs over the training samples by storing the sufficient statistics of only a fixed
quantity of the last samples.
Furthermore, incremental EM presents convergence rates very similar to that of
batch retraining, resulting in a very stable learning algorithm whose behavior contrasts
with the stability problems reported in other works for stepwise EM (Blain, Schwenk,
and Senellart 2012).
The reported experiments also measured the impact of frequent updates in the PE
and IMT scenarios. In both cases, system performance was greatly improved when the
update frequency was increased. This constitutes a strong argument in favor of using
online learning, because of the prohibitive time cost of frequent batch retrainings.
150
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
The experimentation also showed a clear manifestation of one inherent feature of
online learning: the ordering effects of training samples in system performance. We
compared the results that are obtained when the sentences to be translated were chrono-
logically ordered with those obtained with a randomly ordered corpus. The benefits of
online learning are favored by the former situation, since similar sentences appear more
or less contiguously. This would be an example of the document internal repetition
phenomenon mentioned by some authors. Moreover, this is the expected situation in
real translation scenarios.
The online learning techniques proposed in this article allows us to achieve the
goal of learning from scratch in an efficient manner. Learning from scratch is not only
a theoretical scenario that can be proposed in an SMT research context, but a technique
with potential utility in real domain adaptation tasks. It is generally acknowledged
that in-domain corpora are difficult to obtain, and, as a result, SMT system models
are initialized by means of out-of-domain texts (Irvine et al. 2013). Empirical results
presented here show that online learning from scratch can produce significantly better
results than a conventional SMT system with models estimated from out-of-domain
corpora. This would not be the only possible application of online learning in this
scenario. Indeed, online learning from scratch has already been applied to build au-
tomatic post-editing systems designed to work in situations where in-domain corpora
are not available (Lagarda et al. 2015). Another possibility would be to linearly combine
a model trained from out-of-domain data with an initially empty and separate online
learning model. This online learning model could be useful both for learning new
translations as well as for giving preference to the in-domain data. For example, Mirking
et al. (2013) demonstrate the utility of a similar system, but implemented with batch
retraining.
Finally, the presented results also demonstrate that our proposed implementation
of online learning is able to learn from previously existing models. We compared the ob-
tained BLEU, WER, and KSMR measures of a conventional SMT system (batch learning
without retraining) with that of an online system. The improvements were significant in
almost all cases, and very strong for specific tasks and language pairs (their magnitude
was greater than 10 points for the different measures under consideration).
6. Related Work
Online learning has been a main topic of research in the field of machine learning.
However, in the SMT framework, the vast majority of the work has been devoted to the
study of the batch-learning setting. The application of online learning to SMT has been
mostly centered on estimating the feature weights of a log-linear model by means of
discriminative training techniques. Examples of this kind of work can be found in Och
and Ney (2002), Liang et al. (2006), Watanabe et al. (2007), Chiang, Marton, and Resnik
(2008) and Mart´ınez-G ´omez, Sanchis-Trilles, and Casacuberta (2012). These works differ
from the one presented here in that we apply online learning techniques to train the
features of the log-linear model instead of their weights.
To our knowledge, the first work on online learning for SMT focused on training
model features is Cesa-Bianchi, Reverberi, and Szedmak (2008). That paper presents a
very constrained version of online learning applied to a CAT scenario, where the trans-
lation model cannot be extended because of the high computational cost of retraining
the whole model for a new training pair. The literature on online learning for SMT
has tried to solve or alleviate this problem in different ways, as it is discussed in the
following sections, where we identify four different approaches.
151
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
6.1 Online Learning Constrained by Previously Existing Models
The first approach accounts for work that relies on previously estimated models as
a measure to avoid full model retraining. An early attempt can be found in Nepveu
et al. (2004), where dynamic adaptation of an IMT system via cache-based model exten-
sions to language and translation models is proposed. That work constitutes a domain
adaptation technique and not an online learning technique, since the proposed cache
components require pre-existent models estimated in batch mode. As pointed out by the
authors, one of the most important limitations of their proposal is the inability to process
words that were not seen during the estimation of the pre-existent models. In addition to
this, their IMT system does not use state-of-the-art models. The work presented in Hardt
and Elming (2010) applies a similar strategy to a modern phrase-based SMT system,
using heuristic IBM4-based word alignment techniques to add new phrase pairs to a
local phrase table. Their technique shares similar limitations with the work presented
in Nepveu et al. (2004), since it requires pre-existent models estimated in batch mode.
In addition to this, according to the empirical results that are reported, their proposal
is slow (average learning times per sentence of up to 1 minute) and unable to obtain
the same results as conventional batch retraining because of the heuristic decisions
that are made to incrementally train the phrase models. Bertoldi, Cettolo, and Federico
(2013) present an online learning technique based on cache components, which is also
strongly inspired by the work by Nepveu et al. (2004). In spite of the fact that their
proposed technique is able to extend the translation model by obtaining alignments
at the phrase level, the phrase model that is used to generate such alignments is not
updated from user feedback. Again, this makes the system dependent on previously
existent models estimated in batch mode and presumably less reliable when learning
from new sentence pairs that contain poorly represented or unseen events during
the initial training stage. Finally, W¨aeschle et al. (2013) present a work very similar
to Bertoldi, Cettolo, and Federico (2013) that also includes discriminative training
methods.
Mathur, Cettolo, and Federico (2013) present an alternative approach in which
a new log-linear component for the set of phrase pairs contained in the translation
table is used. This component is updated so as to increase or decrease the score
of specific phrase pairs depending on whether they are present or not in the user
validated translations. The main difference between that work and other techniques
mentioned earlier is that it does not tackle the problem of adding new phrase pairs
to the translation model. Instead, it is deliberately restricted to adapt previously ex-
isting parameters. Another alternative approach is presented in Denkowski, Dyer,
and Lavie (2014), where a translation grammar, a language model, and a set of log-
linear weights are adapted in a post-editing task. In that case, pre-existent word
alignment models estimated in batch mode are required to extend the translation
grammar.
6.2 Online Learning Based on Output to Reference Alignments
Other works try to avoid retraining the phrase model by aligning the output of the de-
coder with the reference given by the user (Blain, Schwenk, and Senellart 2012; Simard
and Foster 2013). Specifically, Blain, Schwenk, and Senellart (2012) propose obtaining
word alignments between the source and reference sentences using the system output
as pivot. Such alignments are obtained by combining the word alignments from source
to system output and from system output to reference. Because the output and the
152
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
reference sentences are (hopefully) very similar, they can be aligned using edit-distance–
based algorithms, which do not require being trained. Regarding the work of Simard
and Foster (2013), it conceives the translation as a two stage process, first the source is
translated by a regular SMT system, and after that, the output is translated again using
an automatic post-editing module. This module is implemented as a phrase translation
system trained from the system translations and their references. To extract the phrase
pairs, word level alignments based on edit distance are used.
The main drawback of the approaches based on alignments between the out-
put and the reference sentences is the strong assumption regarding the similarity of
such sentences. The similarity may be low when the source sentence contains poorly
represented or unseen events during the initial training process, a situation that is
very common in real translation tasks. A low similarity could negatively affect the
resulting word alignments because of the great simplicity of edit-distance algorithms.
As a consequence, the obtained performance will be worse than that obtained by means
of retraining, as is reported in Blain, Schwenk, and Senellart (2012). The authors of
that paper claim that the only alternative to retraining is the application of techniques
based on the so-called stepwise EM algorithm (Capp´e and Moulines 2009), which is
unstable when used in an online learning setting and has a lower performance. By
contrast, in this paper we empirically demonstrate that the incremental version of the
EM algorithm (Neal and Hinton 1998) can also be applied and does not have such
disadvantages.
6.3 Quick Adaptation Based on Retraining
Mirking and Cancedda (2013) propose techniques to achieve quick model updates using
batch retraining. Such techniques are based on maintaining different translation tables
for out- and in-domain data. Separated in-domain models allow the user to quickly
update the models via batch retraining and to give preference to the in-domain table
via tuning of log-linear weights. Unfortunately, such quick model updates cannot be
executed in a sentence-wise manner but in longer lapses of time (e.g., a day). Moreover,
the proposed configurations have the disadvantage of becoming slower over time.
Mirking and Cancedda state the great interest of replacing batch retraining by prin-
cipled incremental training, but they also point out that such a technology is not still
mature or available.
6.4 Pure Online Learning Techniques
Finally, there is another category of techniques that tackle the training problem fol-
lowing a pure online learning approach, removing the constraints on model updates
imposed by the techniques mentioned previously. Levenberg, Callison-Burch, and
Osborne (2010) introduced stream-based adaptation for SMT. This technique is able to
incrementally learn from scratch or from previously estimated models. However, their
approach only captures a restricted notion of online learning, because it is designed to
process large amounts of incoming data. Indeed, their training regime uses the stepwise
EM algorithm, which works by processing large blocks of training data and not in a
sentence-wise manner.
On the other hand, the work presented in Ortiz-Mart´ınez, Garc´ıa-Varea, and
Casacuberta (2010) introduces a state-of-the-art log-linear SMT model using a set of
incremental update rules for the feature functions. The resulting system is able to learn
from scratch or from previously estimated models, and it is applied in an IMT scenario.
153
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
One key element of the proposal is the use of the incremental EM algorithm. As far as we
know, such work constitutes the first proposal that successfully applies online learning
to SMT, solving the technical limitations encountered in previous works. However,
some important aspects of the proposal were not clarified, such as its performance in
a PE scenario or its effectiveness when compared with batch retraining.
Here, we extend the work by Ortiz-Mart´ınez, Garc´ıa-Varea, and Casacuberta (2010)
both theoretically and empirically. Regarding the theoretical aspects of online learning
in SMT, we provide a much more detailed explanation of the different online update
rules. Additionally, we introduce an alternative update rule using the incremental EM
algorithm, which solves one important limitation of the update rule that was originally
introduced in Ortiz-Mart´ınez, Garc´ıa-Varea, and Casacuberta (2010). Specifically, the
new update rule allows us to execute more than one training epoch over the incoming
data while maintaining constant time and spatial complexity (see Sections 3 and 4.4
as well as Appendix A for more details). In addition to this, we have integrated the
log-linear model proposed in this paper into the IMT technique based on stochastic
error correction models described in Ortiz-Mart´ınez (2011). This IMT technique is able
to generate the IMT suffixes in real time by calculating a word graph for the given
source sentence at the initial interaction of the IMT process. By contrast, the technique
implemented in Ortiz-Mart´ınez, Garc´ıa-Varea, and Casacuberta (2010) is much slower,
since the source sentence is retranslated at each interaction. Finally, we extend the
work of Bertoldi, Cettolo, and Federico (2013) on measuring the effectiveness of online
learning by proposing two new automatic measures.
The empirical study on online learning in SMT has been extended here in many
ways. First, the proposed online learning techniques have been applied to an ad-
ditional CAT scenario (specifically, the PE scenario). Additionally, the experimenta-
tion has been extended to larger corpora. Finally, we clarify some crucial aspects of
online learning that were not studied previously, such as the convergence proper-
ties of the incremental EM algorithm, the impact of the frequency of model updates
in the error measures, or the differences in performance between batch and online
learning.
7. Conclusions and Future Work
We have proposed online learning techniques for SMT. Such techniques allow us to
incrementally extend the statistical models involved in the translation process in an
efficient manner. Our proposal breaks technical limitations encountered in other works,
as is explained in Section 6.
To test our techniques, we carried out experiments in two different CAT scenarios,
namely, PE and IMT, since they fit nicely into the online learning paradigm. However,
it is important to stress that the applicability of the techniques proposed here is not
restricted to such strict online learning scenarios. Another suitable scenario is the one
described by Levenberg et al. (2010), where the incoming data are processed in large
blocks instead of in a sentence-wise manner.
The experiments were executed in three different corpora, namely, the XRCE cor-
pus, which has been extensively used in the literature to report CAT results, the well-
known Europarl corpus, and the EMEA corpus, which is used in this work to simulate
a non-stationary translation task. As summarized in Section 5, the results demonstrate
the appropriateness of the incremental EM algorithm to implement online learning,
the great impact of frequent model updates in translation quality, the importance of
154
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
ordering effects in online learning tasks, and the ability of our system to learn both
from scratch or from previously estimated models.
We have also defined two new measures to predict the effectiveness of online learn-
ing in SMT tasks—the restricted repetition rate (RRR) and the unseen n-gram fraction
(UNF). It has been empirically demonstrated that such measures allow us to refine the
predictions that can be made with the modified repetition rate (MRR) measure, which
is based on the repetition rate introduced in Bertoldi, Cettolo, and Federico (2013).
Additionally, it is important to stress that the SMT system with online learning
capabilities used in this paper is implemented in the freely available Thot toolkit.
In this work we have focused on the incremental estimation of the parameters
of the log-linear model features instead of their weights. We consider that weight
adjustment is not a crucial aspect in our proposal because it typically affects a few
model parameters, whereas estimation of feature parameters (including those of phrase
and language models) may involve millions of parameters. In addition to this, weight
adjustment can be performed offline, since typical MERT procedures use closed de-
velopment corpora of a few thousand sentences (i.e., the training data is bounded and
relatively small). In spite of that, we think that removing any offline training stages from
practical online SMT system implementations could be useful to simplify their usage
and design. Additionally, for future work we plan to incorporate bounds to the data
structures used to store the model parameters, because incoming data is in principle
unbounded.
Finally, we think that the online learning techniques proposed here can be exported
to other natural language processing applications, where the system output is super-
vised by the user. In addition to this, our proposed techniques can also be useful to
implement active learning algorithms for their use in online settings.
Appendix A: Alternative Update Rule for HMM-Based Alignment Models
In this appendix we introduce an alternative update rule for HMM-based alignment
models. This rule allows us to execute more than one training epoch over the in-
coming data in contrast to the rule given by Equation (24). For this purpose, at
each trial we no longer keep only the last training sample but a certain number of
them.
If we want to execute E training epochs over the data, one possible way to im-
plement the update rule is to keep the last E samples at a given trial, executing one
incremental EM iteration for each sample. In spite of the fact that this update rule
allows us to execute E epochs over the training data, it is still very different from
the way in which conventional batch training works. Specifically, batch training only
starts the next training epoch after having processed the whole training corpus in the
previous epoch. This requisite cannot be met in an online learning setting, since the
set of training samples is unbounded. Nevertheless, it is possible to obtain a training
scheme that is more similar to that of batch training. For this purpose, we can store
the last R samples instead of the last E, with R (cid:29) E, processing a total of E samples at
each trial. To increase the lapse of time (or the number of trials) until a given sample
is reprocessed, the samples are processed in an interlaced way, as it is depicted in
Figure A.1. Specifically, the figure shows which samples are processed by means of the
incremental EM algorithm at t = 10 when we want to execute E = 3 epochs, keeping
the last R = 5 samples. Under these circumstances, samples 6, 8, and 10 would be
processed.
155
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
Figure A.1
Interlaced training scheme example. The squares represent the training samples that appear
when the value of t is increased. Gray squares show the training samples that will be processed
at trial t = 10 when E = 3 epochs are to be executed while storing the last R = 5 samples.
More formally, at a given trial t, if E epochs are to be executed and we choose to
keep the last R samples, then the proposed update rule is as follows:
˜s(t) = ˜s(t−1) −
E
(cid:88)
e=1
˜s(t−1)
(cid:98)t−(e−1)·(R/E)(cid:99) +
E
(cid:88)
e=1
˜s(t)
(cid:98)t−(e−1)·(R/E)(cid:99)
(A.1)
where (cid:98)·(cid:99) is the floor operator.
We will refer to this new rule as the interlaced update rule. Note that the basic
update rule given by Equation (24) is a particular case of the interlaced rule where E
and R are set to 1. In order to measure the performance of interlaced training, we briefly
report here the results of some experiments that we carried out for the XRCE and the
Europarl corpora. Because we obtained very similar results for the different language
pairs involved in the experimentation, namely, from English to Spanish, French, and
German, we only report here the English to Spanish results.
One important measure to be evaluated when assessing the performance of the in-
terlaced update rule is the convergence rate of the EM algorithm. In this case, we cannot
show plots of normalized log-likelihood per training epoch as we did in Section 4.4,
because interlaced updates execute the different training epochs simultaneously for a
given training set. Instead, we report the final normalized log-likelihood that is obtained
Table A.1
Normalized log-likelihood of the EM algorithm for a different number of training epochs and
different implementations of the update rule for the translation model, including batch,
incremental, and interlaced versions. Results executed on the XRCE and Europarl
English–Spanish training corpora are shown.
Norm. Log-likelihood
#Epochs XRCE
Europarl
Batch EM
Incr. EM
Incr. EM (Interlaced, R=10)
Incr. EM (Interlaced, R=100)
Incr. EM (Interlaced, R=1000)
5
1
5
5
5
-41.8
-41.8
-40.3
-40.0
-39.6
-90.9
-91.6
-87.1
-85.6
-84.5
156
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
12345678910R=5E=3t=10
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Figure A.2
Comparison between online and interlaced update rules when translating the first 10,000
sentences of the English–Spanish language pair of the XRCE and Europarl corpora. The
interlaced system was executed for different values of the R parameter (R = 10, 100, or 1,000).
The system was initialized with empty models and default values for the weights of the
log-linear model. The different plots show the evolution of the user effort in the PE and IMT
scenarios measured in terms of cumulative WER and KSMR, respectively.
at the end of the estimation process. Table A.1 shows the normalized log-likelihood
of the EM algorithm with batch, incremental, and interlaced (with R equal to 10, 100,
and 1, 000) updates, using the English–Spanish training set of the XRCE and Europarl
corpora.8 As we can see, interlaced training outperformed batch and incremental EM
algorithms for all values of the R parameter. In addition to this, increasing the value of
R produced improvements in the final normalized log-likelihood.
We also compared the performance in terms of WER and KSMR of the interlaced
update rule with that of batch and online rules. Figure A.2 shows plots with the
evolution of WER and KSMR for online and interlaced (with R equal to 10, 100, and
1, 000) update rules when translating the first 10, 000 sentences of the English–Spanish
language pair of the XRCE and Europarl training corpora. As we can see, the value
of the R parameter had a strong influence in the system performance. Specifically,
the interlaced system with R = 10 clearly underperformed the results obtained by
the online system; increasing the value to R = 100 obtained almost identical results;
and R = 1,000 produced slightly better results. We think that this phenomenon is
linked to the propensity of HMM-based alignment models for overfitting (see Och
and Ney 2003 for more details). Under this point of view, the worse results for R = 10
8 To speed up the experiments, we took the first 100,000 sentences of the Europarl training corpus.
157
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
Figure A.3
Boxplots of the learning time per sentence required to train the first 10,000 training samples of
the XRCE and Europarl corpora using online and interlaced systems (for R equal to 10, 100, and
1,000). Times are measured in seconds.
would be due to some sort of local overfitting, which is alleviated for greater values
of R.
Finally, we also studied the impact of the interlaced training scheme in the learning
time per sentence. Figure A.3 shows boxplots for the learning time per sentence that was
required to train the first 10, 000 training samples of the XRCE and Europarl English
to Spanish corpora using online and interlaced systems for different values of the R
parameter. The online system executed only one training epoch, whereas the interlaced
systems executed five. All the times are reported in seconds. As we can see, interlaced
updates increased the training time with respect to that of basic online updates (this was
the expected outcome since five samples were processed at each trial instead of one).
However, the time costs of interlaced updates were still affordable for both corpora
(worst case times of a few seconds, median times less than 1 second for the different
values of R).
Acknowledgments
The author wishes to thank Francisco
Casacuberta and Ismael Gar´ıa Varea for their
insightful comments on the article, always
greatly appreciated. Work supported by the
European Union 7th Framework Programme
(FP7/2007-2013) under the CasMaCat project
(grant agreement no 287576) and by the
Generalitat Valenciana under grant
ALMAMATER (PROMETEOII/2014/030).
References
Alabau, V., C. Buck, M. Carl, F. Casacuberta,
M. Garc´ıa-Mart´ınez, U. Germann, J.
Gonz´alez-Rubio, R. Hill, P. Koehn, L. A.
Leiva, B. Mesa-Lao, D. Ortiz-Mart´ınez,
H. Saint-Amand, G. Sanchis-Trilles,
and C. Tsoukala. 2014. Casmacat: A
computer-assisted translation workbench.
In 14th Annual Meeting of the European
Association for Computational Linguistics:
System Demonstrations, pages 25–28,
Gothenburg.
Anthony, M. and N. Biggs. 1992.
Computational Learning Theory: An
Introduction. Cambridge University Press,
New York.
Barrachina, S., O. Bender, F. Casacuberta,
J. Civera, E. Cubel, S. Khadivi, A. L.
Lagarda, H. Ney, J. Tom´as, E. Vidal, and
J. M. Vilar. 2009. Statistical approaches to
computer-assisted translation.
Computational Linguistics, 35(1):3–28.
Bertoldi, N., M. Cettolo, and M. Federico.
2013. Cache-based online adaptation for
machine translation enhanced computer
assisted translation. In Proceedings of the
XIV Machine Translation Summit,
pages 35–42, Nice.
Blain, F., H. Schwenk, and J. Senellart. 2012.
Incremental adaptation using translation
information and post-editing analysis. In
International Workshop on Spoken Language
Translation, pages 234–241, Hong-Kong.
Bojar, O., C. Buck, C. Callison-Burch,
C. Federmann, B. Haddow, P. Koehn,
C. Monz, M. Post, R. Soricut, and L. Specia.
2013. Findings of the 2013 Workshop on
158
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Statistical Machine Translation. In
Proceedings of the Eighth Workshop on
Statistical Machine Translation, pages 1–44,
Sofia.
Brown, P. F., S. A. Della Pietra, V. J. Della
Pietra, and R. L. Mercer. 1993. The
mathematics of statistical machine
translation: Parameter estimation.
Computational Linguistics, 19(2):263–311.
Capp´e, O. and E. Moulines. 2009. On-line
expectation-maximization algorithm for
latent data models. Journal of the Royal
Statistical Society Series. B 71(1):593–613.
Cesa-Bianchi, N., G. Reverberi, and
S. Szedmak. 2008. Online learning
algorithms for computer-assisted
translation. Deliverable D4.2, SMART: Stat.
Multilingual Analysis for Retrieval and
Translation. University of Southhampton.
Chen, S. F. and J. Goodman. 1996. An
empirical study of smoothing techniques
for language modeling. In Proceedings of the
34th Annual Meeting of the Association for
Computational Linguistics, pages 310–318,
San Francisco.
Chiang, D., Y. Marton, and P. Resnik. 2008.
Online large-margin training of syntactic
and structural translation features. In
Proceedings of the Conference on Empirical
Methods in Natural Language Processing,
pages 224–233, Stroudsburg,
PA.
Church, K. W. and W. A. Gale. 1995. Poisson
mixtures. Natural Language Engineering,
1:163–190.
Denkowski, M., C. Dyer, and A. Lavie. 2014.
Learning from post-editing: Online model
adaptation for statistical machine
translation. In Proceedings of the 14th
Conference of the European Chapter of the
Association for Computational Linguistics,
pages 395–404, Gothenburg.
Federico, M., N. Bertoldi, M. Cettolo, M.
Negri, M. Turchi, M. Trombetti, A.
Cattelan, A. Farina, D. Lupinetti, A.
Martines, A. Massidda, H. Schwenk, L.
Barrault, F. Blain, P. Koehn, C. Buck, and
U. Germann. 2014. The Matecat tool. In
COLING 2014, 25th International Conference
on Computational Linguistics, Proceedings of
the Conference System Demonstrations,
pages 129–132, Dublin.
Foster, G., P. Isabelle, and P. Plamondon.
1997. Target-text mediated interactive
machine translation. Machine Translation,
12(1):175–194.
Giraud-Carrier, C. 2000. A note on the
utility of incremental learning. AI
Communications, 13(4):215–223.
Gonz´alez-Rubio, J., D. Ortiz-Mart´ınez, and
F. Casacuberta. 2012. Active learning for
interactive machine translation. In
Proceedings of the 13th Conference of the
European Chapter of the Association for
Computational Linguistics, pages 245–254,
Avignon.
Green, S., J. Heer, and C. D. Manning. 2013.
The efficacy of human post-editing for
language translation. In Proceedings of the
SIGCHI Conference on Human Factors in
Computing Systems, pages 439–448, Paris.
Hardt, D. and J. Elming. 2010. Incremental
re-training for post-editing SMT. In
Proceedings of the 9th Annual Conference of
the Association for Machine Translation in the
Americas, Denver, CO. Available at
amta2010.amtaweb.org/.
Irvine, A., J. Morgan, M. Carpuat, H. D. III,
and D. S. Munteanu. 2013. Measuring
machine translation errors in new
domains. Transactions of the Association for
Computational Linguistics, 1:429–440.
Knuth, D. E. 1981. Seminumerical Algorithms,
volume 2 of The Art of Computer
Programming. Addison-Wesley, MA, 2nd
edition.
Koehn, P. 2005. Europarl: A parallel corpus
for statistical machine translation. In
Proceedings of the X Machine Translation
Summit, pages 79–86, Phuket.
Koehn, P., H. Hoang, A. Birch,
C. Callison-Burch, M. Federico,
N. Bertoldi, B. Cowan, W. Shen, C. Moran,
R. Zens, C. Dyer, O. Bojar, A. Constantin,
and E. Herbst. 2007. Moses: Open source
toolkit for statistical machine translation.
In Proceedings of the 45th Annual Meeting of
the Association for Computational Linguistics,
pages 177–180, Prague.
Koehn, P., F. J. Och, and D. Marcu. 2003.
Statistical phrase-based translation. In
Proceedings of the Human Language
Technology and North American Association
for Computational Linguistics Conference,
pages 48–54, Edmonton.
Koehn, P. 2009. A process study of
computer-aided translation. Machine
Translation, 23(4):241–263.
Lagarda, A. L., D. Ortiz-Mart´ınez, V. Alabau,
and F. Casacuberta. 2015. Translating
without in-domain corpus: Machine
translation post-editing with online
learning techniques. Computer Speech &
Language, 32(1):109–134.
Levenberg, A., C. Callison-Burch, and
M. Osborne. 2010. Stream-based
translation models for statistical machine
translation. In Proceedings of the North
159
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 1
American Chapter of the Association for
Computational Linguistics - Human Language
Technologies, pages 394–402, Los Angeles,
CA.
Liang, P., A. Bouchard-C ˆot´e, D. Klein, and
B. Taskar. 2006. An end-to-end
discriminative approach to machine
translation. In Proceedings of the 44th
Annual Meeting of the Association for
Computational Linguistics, pages 761–768,
Morristown, NJ.
Liang, P. and D. Klein. 2009. Online EM for
unsupervised models. In NAACL ’09:
Proceedings of Human Language Technologies:
The 2009 Annual Conference of the North
American Chapter of the Association for
Computational Linguistics, pages 611–619,
Morristown, NJ.
Macklovitch, E. 2006. Transtype2: The last
word. In Proceedings of LREC 2006, Genoa,
pages 167–172.
Mart´ınez-G ´omez, P., G. Sanchis-Trilles,
and F. Casacuberta. 2012. Online
adaptation strategies for statistical
machine translation in post-editing
scenarios. Pattern Recognition,
45(9):3193–3203.
Mathur, P., M. Cettolo, and M. Federico. 2013.
Online learning approaches in computer
assisted translation. In Proceedings of the
ACL Workshop on Statistical Machine
Translation, pages 301–308, Sofia.
Mirkin, S. and N. Cancedda. 2013. Assessing
quick update methods of statistical
translation models. In Proceedings of the
International Workshop on Spoken Language
Translation (IWSLT), pages 264–271,
Heidelberg.
Neal, R. M. and G. E. Hinton. 1998.
A view of the EM algorithm that justifies
incremental, sparse, and other variants. In
Proceedings of the NATO-ASI on Learning in
graphical models, pages 355–368, Norwell,
MA.
Nepveu, L., G. Lapalme, P. Langlais, and
G. Foster. 2004. Adaptive language and
translation models for interactive machine
translation. In Proceedings of the Conference
on Empirical Methods in Natural Language
Processing, pages 190–197, Barcelona.
Och, F. J. 2003. Minimum error rate training
in statistical machine translation. In
Proceedings of the 41th Annual Conference of
the Association for Computational Linguistics,
pages 160–167, Sapporo.
Och, F. J. and H. Ney. 2002. Discriminative
training and maximum entropy models
for statistical machine translation. In
Proceedings of the 40th Annual Meeting of the
160
Association for Computational Linguistics
(ACL), pages 295–302, Philadelphia, PA.
Och, F. J. and H. Ney. 2003. A systematic
comparison of various statistical
alignment models. Computational
Linguistics, 29(1):19–51.
Ortiz, D., I. Garc´ıa-Varea, and F. Casacuberta.
2005. Thot: A toolkit to train phrase-based
statistical translation models. In
Proceedings of the X Machine Translation
Summit, pages 141–148, Phuket.
Ortiz-Mart´ınez, D. 2011. Advances in
Fully-Automatic and Interactive Phrase-Based
Statistical Machine Translation. Ph.D. thesis,
Universitat Polit`ecnica de Val`encia.
Ortiz-Mart´ınez, D. and F. Casacuberta. 2014.
The new Thot toolkit for fully automatic
and interactive statistical machine
translation. In 14th Annual Meeting of the
European Association for Computational
Linguistics: System Demonstrations,
pages 45–48, Gothenburg.
Ortiz-Mart´ınez, D., I. Garc´ıa-Varea, and
F. Casacuberta. 2010. Online learning for
interactive statistical machine translation.
In Proceedings of the North American Chapter
of the Association for Computational
Linguistics - Human Language Technologies
(NAACL-HLT), pages 546–554, Los
Angeles, CA.
Ortiz-Mart´ınez, D., L. A. Leiva, V. Alabau,
I. Garc´ıa-Varea, and F. Casacuberta. 2011.
An interactive machine translation system
with online learning. In Proceedings of the
Association for Computational Linguistics
Conference (System Demonstrations),
pages 68–73, Portland, OR.
Ortiz-Mart´ınez, D., J. Gonz´alez-Rubio,
V. Alabau, G. Sanchis-Trilles, and
F. Casacuberta. 2015. Integrating online
and active learning in a computer-assisted
translation workbench. In M. Carl, S.
Bangalore, and M. Schaeffer, editors, New
Directions in Empirical Translation Process
Research, Springer, pages 57–76.
Papineni, K. A., S. Roukos, T. Ward, and
W. Zhu. 2001. BLEU: A method for
automatic evaluation of machine
translation. Technical Report RC22176
(W0109-022), IBM Research Division,
Thomas J. Watson Research Center,
Yorktown Heights, NY.
SchlumbergerSema S.A.,
Instituto Tecnol ´ogico de Inform´atica,
Rheinisch Westf¨alische Technische
Hochschule Aachen Lehrstul f ¨ur
Informatik VI, Recherche Appliqu´ee
en Linguistique Informatique Laboratory
University of Montreal, Celer Soluciones,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ortiz-Mart´ınez
Online Learning for Statistical Machine Translation
Soci´et´e Gamma, and Xerox
Research Centre Europe. 2001. TT2.
TransType2 - computer assisted
translation. Project technical annex.
Information Society Technologies
(IST) Programme, IST-2001-32091.
Simard, M. and G. Foster. 2013. Pepr:
Post-edit propagation using
phrase-based statistical machine
translation. In Proceedings of the
XIV Machine Translation Summit,
pages 191–198, Nice.
TAUS-Project. 2010. Postediting in practice.
A TAUS Report. Technical report, TAUS –
Enabling better translation.
Tiedemann, J. 2009. News from OPUS - A
collection of multilingual parallel corpora
with tools and interfaces. In N. Nicolov,
K. Bontcheva, G. Angelova, and R. Mitkov,
editors, Recent Advances in Natural
Language Processing, volume V. John
Benjamins, Amsterdam/Philadelphia,
pages 237–248.
Toutanova, K., H. T. Ilhan, and C. Manning.
2002. Extensions to HMM-based statistical
word alignment models. In Proceedings of
the Conference on Empirical Methods in
Natural Language Processing, pages 87–94,
Philadelphia, PA.
Vogel, S., H. Ney, and C. Tillmann. 1996.
HMM-based word alignment in
statistical translation. In Proceedings of
the 16th International Conference on
Computational Linguistics, pages 836–841,
Copenhagen.
W¨aeschle, K., P. Simianer, N. Bertoldi,
S. Riezler, and M. Federico. 2013.
Generative and discriminative methods
for online adaptation in SMT. In
Proceedings of the XIV Machine Translation
Summit, pages 11–18, Nice.
Watanabe, T., J. Suzuki, H. Tsukada, and
H. Isozaki. 2007. Online large-margin
training for statistical machine translation.
In Proceedings of the EMNLP-CONLL joint
conference, pages 764–733, Prague.
Zens, R., F. J. Och, and H. Ney. 2002.
Phrase-based statistical machine
translation. In Advances in Artificial
Intelligence. 25. Annual German Conference
on AI, volume 2479 of LNCS. Springer
Verlag, September, pages 18–32.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
1
1
2
1
1
8
0
5
8
2
7
/
c
o
l
i
_
a
_
0
0
2
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
161
PDF Herunterladen