Generation of Compound Words in - Specialized Research AI at MIT

Generation of Compound Words in
Statistical Machine Translation into
Compounding Languages

Sara Stymne
∗
Uppsala University

Nicola Cancedda
Xerox Research Centre Europe

∗∗

Lars Ahrenberg
†
Link ¨oping University

In this article we investigate statistical machine translation (SMT) into Germanic languages,
with a focus on compound processing. Our main goal is to enable the generation of novel
compounds that have not been seen in the training data. We adopt a split-merge strategy, where
compounds are split before training the SMT system, and merged after the translation step.
This approach reduces sparsity in the training data, but runs the risk of placing translations of
compound parts in non-consecutive positions. It also requires a postprocessing step of compound
merging, where compounds are reconstructed in the translation output. We present a method for
increasing the chances that components that should be merged are translated into contiguous
positions and in the right order and show that it can lead to improvements both by direct
inspection and in terms of standard translation evaluation metrics. We also propose several new
methods for compound merging, based on heuristics and machine learning, which outperform
previously suggested algorithms. These methods can produce novel compounds and a translation
with at least the same overall quality as the baseline. For all subtasks we show that it is
useful to include part-of-speech based information in the translation process, in order to handle
compounds.

1. Introduction

In many languages including most of the Germanic (German, Swedish, etc.) and Uralic
(Finnish, Hungarian, etc.) language families, so-called closed compounds are used

∗ Department of Linguistics and Philology, Uppsala University, Box 635, 751 26 Uppsala, Sweden.

E-mail: sara.stymne@lingfil.uu.se.

∗∗ Xerox Research Centre Europe, 6 chemin de Maupertuis, 38240 Meylan, France.

E-mail: nicola.cancedda@xrce.xerox.com.

† Department of Computer and Information Science, Link ¨oping University, 58183 Link ¨oping, Sweden.

E-mail: lars.ahrenberg@liu.se.

Submission received: 25 April 2012; revised submission received: 30 November 2012; accepted for
publication: 8 January 2013.

doi:10.1162/COLI a 00162

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

productively. Closed compounds are written as single words without spaces or other
word boundaries, as in German Goldring. We will refer to these languages as compound-
ing languages. In English, on the other hand, compounds are generally open, that is,
written as two words as in gold ring.

This difference in compound orthography leads to problems for statistical ma-
chine translation (SMT). For translation into a compounding language, often fewer
compounds than in normal texts are produced. This can be due to the fact that the
desired compounds are missing in the training data or that they have not been aligned
correctly. When a compound is the idiomatic word choice in the translation, systems
often produce separate words, genitive or other alternative constructions, or translate
only one part of the compound. For an SMT system to cope with the productivity of the
phenomenon, any effective strategy should be able to correctly process compounds that
have never been seen in the training data as such, although possibly their components
have, either in isolation or within a different compound.

Previous work (e.g., Koehn and Knight 2003) has shown that compound splitting
improves translation from compounding languages into English. In this article we ex-
plore several aspects of the less-researched area of compound treatment for translation
into such languages, using three Germanic languages (German, Swedish, and Danish)
as examples.1 The assumption is that splitting compounds will also improve translation
for this translation direction and lead to more natural translations. The strategy we
adopt is to split compounds in the training data, and to merge them in the translation
output. Our overall goal is to improve translation quality by productively generating
compounds in SMT systems.

The main contributions of the article are as follows:

(cid:1)

Demonstrating improved coalescence (adjacency and order) of compound
parts in translation through the use of sequence models based on
customized part-of-speech sets and count features

Designing and evaluating several heuristic methods for compound
merging that outperforms previous heuristic merging methods

Designing and evaluating a novel method for compound merging based
on sequence labeling

Demonstrating the ability of these merging methods to generate novel
unseen compounds

In addition, we report effects on translation performance from a number of variations
in the methods for compound splitting and merging.

The rest of the article is structured as follows. Section 2 gives an overview of
compound formation in the three target languages used in this work. Section 3 reports
related work on compound processing for machine translation. Section 4 describes the
compound processing strategy we use and Section 5 describes compound splitting.
Section 6 addresses compound coalescence, followed by compound merging in Sec-
tion 7. In Section 8 we present experimental results and in Section 9 we state our
conclusions.

1 This work is a synthesis and extension of Stymne (2008); Stymne and Holmqvist (2008); Stymne,

Holmqvist, and Ahrenberg (2008); Stymne (2009); and Stymne and Cancedda (2011).

1068

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

2. Closed Compounds in German, Swedish, and Danish

Compounds in German, Swedish, and Danish are generally closed, written without
word boundaries, as exempliﬁed for German in Example (1). Compounds can be made
up of two (1a) or more (1b) parts where parts may also be coordinated (1c). In a few
cases compounds are written with a hyphen (1d), often when one of the parts is a
proper name or an abbreviation. Most compounds are nouns (1a–1e), but they can
also be adjectives (1f), verbs (1g), and adverbs (1h). English translations of compounds
can be written as open compounds with separate words (1a) or with hyphens (1f), as
other constructions, possibly with inserted function words (1g) and reordering (1b),
or as single words (1e). Generally the last part of the compound is the compound
head, that is, it conveys the main meaning of the compound, and determines its part
of speech. The other parts, compound modiﬁers, modify the meaning of the com-
pound head in some way and need not have the same part of speech as the full
compound.

(1)

a. Regierungskonferenz intergovernmental conference
Regierung+Konferenz government conference

b. Friedensnobelpreistr¨ager Nobel Peace Prize laureate

Frieden+Nobel+Preis+Tr¨ager peace Nobel prize bearer

c. See- und Binnenh¨afen sea and inland ports

See- und Binnen+H¨afen sea and interior ports

d. EU-Mitgliedstaaten EU member states

EU-Mitglied+Staaten EU member states

e. Jahrtausend millennium

Jahr+tausend year thousand

f. dunkelblau dark-blue

dunkel+blau dark blue

g. kennenlernen get to know
kennen+lernen know learn

h. gr ¨osstenteils in most instances
gr ¨ossten+teils largest partly

Compound modiﬁers often have a special form, such as the addition of an “s”
to the base form of Regierung in Example (1a). We will refer to these form variants
as compounding forms. Table 1 exempliﬁes the type of operations used in Swedish,
German, and Danish to form compounding forms. There are many more alternative
compounding forms in Swedish and German than in Danish. For an overview of
the possible compounding forms in German see Langer (1998) or K ¨urschner (2003),
in Danish see K ¨urschner (2003), and in Swedish see Thorell (1981) or Stymne and
Holmqvist (2008). Some compounding forms coincide with paradigmatic forms, such
as German Jahres that can also be genitive, and Stadien that can also be plural, from
Table 1. There are different views on whether to treat these forms as paradigmatic forms
or as compounding forms. We follow Langer (1998) in viewing them as compounding
forms, because they often do not correspond to plural or possessive semantics. Many
individual compound modiﬁers have more than one possible compounding form. In
Example (2) we provide examples of several possible forms of the modiﬁer Kind (child)
in German compounds.

1069

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

(2)

Kind+phase (child-caring period)

+s Kinds+lage (fetal position)

+es Kindes+unterhalt (child support)

+er Kinder+ﬁlm (children’s ﬁlm)

Ein-Kind-Politik (one-child policy)

In some cases concatenating two words would lead to three identical consecutive
consonants. In the Scandinavian languages, there is a spelling rule that does not allow
this, and three identical consonants are reduced to two, as in Example (3SV). This spelling
rule was also used for some German compounds before 1996, when it was changed by
a spelling reform, so that nowadays three identical consecutive consonants are never
reduced to two at compound boundaries in German (Institut f ¨ur Deutsche Sprache
1998), as shown in Example (3DE).

Table 1
Operations used for forming compounding forms with examples.

Type

Examples

DE 0

Null operation

SV 0

DA 0

DE +es

SV +s

DA +e

DE -e

SV -a

Addition

Deletion

DE -on/+en

Combination

SV -e/+s

DA -e/+s

DE ”+er

Umlaut

SV ”-er/+ra

DA ”+e

1070

umweltfreundlich (environmentally-friendly)
Umwelt+freundlich (environment friendly)
naturkatastrof (natural disaster)
natur+katastrof (nature disaster)
h˚andbagage (hand luggage)
h˚and+bagage (hand luggage)

Jahreswechsel (turn of the year)
Jahr+Wechsel (year change)
kvalitetstecken (quality mark)
kvalitet+tecken(quality sign)
spillekonsol (game console)
spill+konsol(game console)

Lymphreaktion (lymphatic response)
Lymphe+Reaktion (lymph response)
ﬂickskola (girls’ school)
ﬂicka+skola (girl school)

Stadienexperte(stadium expert)
Stadion+Experte (stadium expert)
arbetsolycka (industrial accident)
arbete+olycka (work accident)
embedsmand (civil servant)
embede+mand(job man)

V ¨olkerrecht(international law)
Volk+Recht(people right)
br ¨odrak¨arlek (brotherly love)
broder+k¨arlek (brother love)
børnesko (children’s shoe)
barn+sko (child shoe)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

(3)

SV tullagstiftning customs legislation

tull+lagstiftning custom legislation

DE Zelllinie cell line

Zell+Linie cell line

Compounding is common and productive; new compounds can be readily formed
and understood. This is conﬁrmed in a number of corpus studies. In German, com-
pounds have been shown to make up 5–7% of tokens and 43–47% of types in news
text (Baroni, Matiasek, and Trost 2002; Schiller 2005). If function words are removed,
an even higher number of the tokens are compounds; in both Swedish and German
10% of the content words in a news text have been found to be compounds (Hedlund
2002). That compounding is productive means that it is likely that a high number of
compounds have a very low frequency in texts. Baroni, Matiasek, and Trost (2002) found
that 83% of the compounds in a large German news corpus occur less than ﬁve times.
In Swedish, compounds are the most common type of hapax words, that is, words that
occur only once in a text (Carlberger et al. 2005). The most common type of compound
is the noun+noun compound, which makes up 62% of the compounds in the German
news corpus of Baroni, Matiasek, and Trost.

3. Related Work

The problems arising from differences in compounding strategies in translation from
German into English have been addressed by several authors. The most common archi-
tecture for translation from German is to split compounds in a preprocessing step prior
to training and translation using some automatic method, which has been suggested
both for SMT (Nießen and Ney 2000; Koehn and Knight 2003; Popovi´c, Stein, and Ney
2006; Holmqvist, Stymne, and Ahrenberg 2007) and example-based MT (Brown 2002).
German compounds are split into their component parts in a preprocessing step and the
translation model is then trained between modiﬁed German and English. At translation
time, the German source text is also run through a compound splitter. In the studies
cited here, only one splitting option is given as input to the decoder, which can be
problematic in case the splitting is wrong, or if any of the parts are unknown. In Dyer
(2009) several splitting options were given to the decoder in the form of a lattice. It
is, however, not straightforward to use lattices during training, and in order to solve
this, the training corpus was doubled, one part being without splits and the other part
having the best splitting option for each word.

For translation into German, Popovi´c, Stein, and Ney (2006) investigated three
different strategies for compound processing. The ﬁrst was to split compounds dur-
ing training and after translation merge compound parts back into full compounds,
the second merged English compounds prior to training instead of splitting German
compounds, and the third used compound splitting only to improve word alignment.
The split–merge strategy gave the best results, but using splitting only for alignment
gave similar results. The merging of English compounds led to an improvement over a
baseline without compound processing, but was not as good as the other two strategies.
Popovi´c, Stein, and Ney also presented one of few previous suggestions for compound
merging. Each word in the translation output was looked up in a list of compound parts,
and merged with the next word if it resulted in a known compound. This method led
to improved overall translation results from English to German. The drawback of this
method is that novel compounds cannot be merged. It might also merge words that
should not be merged, but that happen to coincide with known compounds.

1071

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Fraser (2009) merged split German compounds after translation from English, by
applying a second phrase-based SMT (PBSMT) system trained on German with split
compounds and normal German. No separate results were presented for this extension
alone, but in combination with other morphological processing the strategy led to worse
results than the baseline. This method also suffers from the same drawbacks as that of
Popovi´c, Stein, and Ney (2006), that novel compounds cannot be merged and words
that should not be merged can still be.

Koehn, Arun, and Hoang (2008) discussed the treatment of hyphenated compounds
for translation into German. They used a separate mark-up token for hyphened com-
pounds, where the hyphen was split into a separate token, and marked by a symbol.
The impact on the translation result was small.

Botha, Dyer, and Blunsom (2012) discussed the approach of using customized
language models to target German compounds. They presented hierarchical Pitman-
Yor language models, where the compound head is conditioned on the words pre-
ceding the full compound, and the compound modiﬁers are modeled by a reverse
compound language model. They used this model as a replacement of a standard
language model for SMT and found that although the perplexity of the language model
was reduced, there were only minor improvements on Bleu for the SMT task. In this
case the SMT pipeline was left unchanged, meaning that novel compounds were not
considered.

Compound merging has also been performed for speech recognition. An example
of this is Berton, Fetter, and Regel-Brietzmann (1996), who extended the word graphs
output by a German speech recognizer with possible compounds by combining edges
of words during a lexical search. The ﬁnal hypotheses were then identiﬁed from the
graph using dynamic programming techniques. Compound merging for speech recog-
nition is a somewhat different problem than for machine translation, however, because
coalescence is not an issue, as compared with SMT, where there is no guarantee that the
order of the parts in the translation output is correct.

Another somewhat related problem to compound merging is that of detection of er-
roneously split compounds in human text, which is faced by grammar checkers. Writing
compounds as separate words, with spaces between parts, is a common writing error
in compounding languages. Carlberger et al. (2005) described a system for Swedish that
used handwritten rules to identify, among other errors, erroneously split compounds.
The rules used parts of speech and morphological features. On a classiﬁed gold standard
of writing errors they had a recall of 46% and a precision of 39% for identifying split
compounds, indicating that it is a difﬁcult problem to ﬁnd split compounds in free,
unmarked text.

There is also work on morphological merging, which is needed when the target
words have been split into morphs in the training corpus. Virpioja et al. (2007) marked
morphs with a symbol and merged all marked words with the next word for translation
between Finnish, Swedish, and Danish, without showing any improvements over an
unmarked baseline. This strategy does have the advantage of being able to merge novel
word forms, but has a drawback in that it can merge parts into non-words if the parts
are misplaced in the translation output.

El-Kahlout and Oflazer (2006) used a similar symbol-based merging strategy for
translation from English into Turkish, but with the addition of morphographemic rules.
They had positive results when performing limited splitting and grouping the split
morphs, but not when splitting all morphs. They reported that there were problems
with the order of morphs in the output. Badr, Zbib, and Glass (2008) reported results for
translation from English to Arabic, where they used a combination merging method

1072

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

where forms were picked from the corpus for known combinations of morphs and
words, and generated based on handwritten recombination rules otherwise, which
led to improvements over the baseline. They also used the morphs+POS as factors
in a factored translation model (Koehn and Hoang 2007) where surface forms were
generated from this information; this gave a small improvement on a large corpus, but
at the cost of high runtime. El Kholy and Habash (2010) extended the merging scheme of
Badr, Zbib, and Glass (2008) by using the conditional probability and a language model
score to pick the best known merging option. They showed a small effect of this on an
MT task, even though this strategy was the best option in an intrinsic evaluation based
on human reference translations.

Compound splitting has been addressed in many articles, as a separate task (Schiller
2005) or targeted for applications such as information retrieval (Holz and Biemann
2008), speech recognition (Larson et al. 2000), grammar checking (Sj ¨obergh and Kann
2004), lexicon acquisition (Kokkinakis 2001), word prediction (Baroni, Matiasek, and
Trost 2002), and machine translation (Koehn and Knight 2003; Dyer 2009; Fritzinger
and Fraser 2010; Macherey et al. 2011).

The most successful strategies that address compound processing for MT apply
compound splitting as a preprocessing step before training the translation models.
Koehn and Knight (2003) presented an empirical splitting algorithm targeted at SMT
from German to English. They split words in all possible places, and considered a
splitting option valid if all its parts had been seen as words in a monolingual corpus.
They allowed the addition of -s or -es at all splitting points. If there were several valid
splitting options they chose one based on the number of splits, the geometric mean of
part frequencies, or based on alignment data. They evaluated the splitting algorithms
intrinsically on a gold standard of manually split noun phrases and on machine trans-
lation of noun phrases. The best results for PBSMT were achieved by using either the
geometric mean, or the highest number of splits. There were no correlations between
translation results and the intrinsic evaluation.

Several other researchers have also explored compound splitting for translation
from German to English. Nießen and Ney (2000) used a morpho-syntactic analyzer
for splitting German compounds prior to translation. Popovi´c, Stein, and Ney (2006)
used the geometric mean version from Koehn and Knight (2003) as well as the mor-
phosyntactic algorithm from Nießen and Ney (2000) for splitting, with similar posi-
tive results for both options. Fritzinger and Fraser (2010) combined linguistic analysis
with corpus-driven scoring and showed an improvement compared to using only a
corpus-driven approach. Macherey et al. (2011) described a corpus-driven method that
learns compounding form transformations of a language in addition to just compound
splitting, and showed an improvement for translation from several languages into
English compared to a baseline without compound treatment. Dyer (2010) suggested
a compound splitting method based on sequence labeling, which gave good results for
lattice-based translation from German.

4. Compound Translation

For translation into a compounding language, we adopt the compound processing
strategy suggested by Popovi´c, Stein, and Ney (2006). The process is:

Split compounds on the target (compounding language) side of the
training corpus.

1073

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Learn a translation model from source (e.g., English) into
decomposed-target (e.g., decomposed-German).

At translation time, translate using the learned model from source into
decomposed-target.

Apply a postprocessing merge step to reconstruct compounds.

The merging step must solve two problems: Identify which words should be merged
into compounds, and choose the correct compounding form for the compound mod-
iﬁers. The ﬁrst problem can become hopelessly difﬁcult if the translation did not put
components nicely side by side and in the correct order. Preliminary to merging, then,
the problem of coalescence needs to be addressed, that is, translations where compound
elements are correctly positioned should be promoted.

Figure 1 gives an overview of the translation process with compound processing.
Compounds are split before training the translation system, and merged after transla-
tion. We use factored decoding (Koehn and Hoang 2007), where features other than just
surface words can be used by the system. In our case we tag the data with parts of
speech and use a factored model with POS-tags on the target side, which allows us to
have a POS-sequence model, beside the standard language model. This conﬁguration
has a very small overhead compared with decoding without factors. As we show, using
part-of-speech tags on the target side helps to improve the coalescence of compound
parts, and can be used to guide the merging process.

For the tuning step there are two options: either we can split the development set
and tune with a translation with split compounds compared with a reference with split

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 1
The SMT system architecture.

1074

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

compounds, or we can merge compounds in the translation output, before performing
the optimization. We found empirically that we got the best results using the second
approach, of performing compound merging during the tuning process. When we
tuned on split texts, the results were more unstable, and especially the number of words
in the translation output varied substantially. We thus use merging also during the
tuning process in all experiments, as shown in Figure 1.

5. Compound Splitting

Our method for compound splitting is based on Koehn and Knight (2003) and Stymne
(2008). For each word all possible segmentations are explored, with the restrictions that
all parts must have at least three characters, and the last part, the compound head, must
have the same part-of-speech tag as the word itself. Hyphens are treated as additions
to compound modiﬁers, just as +s or +e. Segmentations are scored with the arithmetic
mean of frequencies for each part in the training corpus and the segmentation with the
highest score is chosen.

We have investigated several variants of the basic method and their effects on

compound translation. The variants investigated are:

(cid:1)

(cid:1)
(cid:1)

(cid:1)

Using geometric or arithmetic mean for choosing the highest frequency
candidate split

Restricting the length of each split part, either to three or four characters

Restricting the highest number of parts per compound to two, or allowing
any number of compound parts

Allowing all known compounding forms (c.f. Table 1), or restricting them
to the most common ones

Restricting the last part of the compound to be known from the corpus
with the same part-of-speech tag as the full compound, or allowing all
possible parts-of-speech tags

6. Promoting Coalescence of Compounds

In this section we describe our approach to improve the coalescence of compounds,
which is based on POS-sequence models. We ﬁrst present the representation schemes
we use for compounds, which form the basis of the sequence model approach, and then
describe the sequence modeling approach in more detail. Finally, we discuss how POS-
tags can be used for count features.

6.1 Representation of Compound Parts

As a result of compound splitting the segmentation of words is changed, and we know
from Table 1 that compounding forms often do not coincide with any forms that can be
used as standalone words. This raises several design decisions:

Should alternative forms of the same compound modiﬁer be normalized
to a canonical form?

1075

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Should compound modiﬁers be marked with a special symbol? Or should
the separation between compound parts be marked?

How should compound parts be tagged in a factored system?

Although apparently innocuous, these decisions do have some inﬂuence on the whole
process. In this work we have used three combinations of marking and normalization,
and three different tagsets.

In Example (4a), called the unmarked scheme, compound modiﬁers are normalized
(kamps->kamp) and the words carry no special marking. In Example (4b), called the
marked scheme, compound modiﬁers are not normalized, but they are marked with
the symbol “#”. For these two marking schemes, an extended tagset, EPOS, is used. It
contains tags for compound modiﬁers that also indicate the part of speech of the head
word, such as “N-modif” if the head is a noun (N) or “ADJ-modif” if the head is an
adjective (ADJ).

In Example (4c), called the sepmarked scheme, the split itself is marked, by using
the special token “@#@” to mark split points, and compound modiﬁers are normalized.
In this case we use a standard POS-tagset with the addition of a COMP-tag for the
inserted split-token, and use the POS-tags that were found for the compound modiﬁers
when they were looked up in the monolingual corpus during splitting. We call this
tagset the SPOS-tagset.

(4)

a. fem+kamps+seger:N (pentathlon (ﬁve battle) victory)

fem:N-modif kamp:N-modif seger:N

b. fem+kamps+seger:N (pentathlon (ﬁve battle) victory)

fem#:N-modif kamps#:N-modif seger:N

c. fem+kamps+seger:N (pentathlon (ﬁve battle) victory)

fem:NUM @#@:COMP kamp:N @#@:COMP seger:N

As a third alternative tagset we use a modiﬁed variant of the EPOS-tagset, where
distinctions among parts of speech that are not relevant to the formation of compounds
are blurred. This reduces the tagset to only a few tags, as in the case where only nouns
are split:

(cid:1)
(cid:1)

(cid:1)

N-modif – all parts of a split compound except the last

N – the last part of the compound (its head) and all other nouns

X – all other tokens

We call this tagset the reduced POS-tagset (RPOS). The RPOS-tagset could easily be
extended to other types of compounds—for example, by extending it to ﬁve tags by
also including ADJ and ADJ-modif if we want to split adjectives as well. Using the
RPOS-tagset, the POS-based sequence model will only be useful for controlling the
order and form of compound parts. With the EPOS-tagset it also aids in controlling
the order of other words. All POS-based sequence models are trained on POS-tagged
data, where the tags have been modiﬁed after applying a compound splitting algorithm.
When referring to either of the three tagsets EPOS, RPOS, or SPOS, we will use the
designation *POS.

1076

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

6.2 Part-of-Speech–Based Sequence Models

If compounds are split in the training data, then there is no guarantee that translations
of components will end up in contiguous positions and in the correct order. This
is primarily a language model problem, and we will model it as such by applying
sequence models on the customized part-of-speech sets. These sequence models can
be either standard language models trained on texts where split compound modiﬁers
are marked with symbols, or they can be POS-sequence models trained on the specially
designed tagsets, such as EPOS. Both these types of models can encourage compound
parts to occur in the correct order; the former, however, is a more lightweight approach,
because it relies on surface words.

A language model solution to compound coalescence could be viewed as a soft
constraint in the decoder, which encourages good sequences of compound parts over
bad sequences. An alternative to this would have been a hard constraint in the decoder
that prohibits compound parts to be placed in an incorrect order. We opted for a
soft constraint approach because we found that it gave sufﬁciently good results, and
because previous work on soft versus hard constraints for other areas has shown that
soft constraints give more stable and generally better results, for instance, for phrase
cohesion (Cherry 2008).

As described in the previous section, we can add special POS tags and symbols to
identify compound modiﬁers. Table 2 shows examples of the different representation
schemes. We can then train a *POS n-gram sequence model using any of the *POS-
tagsets, which naturally steers the decoder towards translations with good relative
placement of these components.

A lightweight model that gives some additional information of the order of com-
pound parts compared to a language model on unmarked data is to train language mod-
els on texts where compounds are represented either using the marked or sepmarked
schemes but without parts-of-speech tags. These models are, however, not as strong as
the *POS-based models, since they cannot generalize from surface words, and mainly
can aid in keeping known compounds together.

6.3 Sequence Models as Count Features

We expect a *POS-based n-gram sequence model to learn to discourage sequences
unseen in the training data, such as the sequence of compound parts not followed by
a suitable head. Such a generative LM, however, might also have a tendency to bias

Table 2
Examples of representation schemes for the German phrase die Fremdsprachenkenntnisse
[the knowledge of foreign languages / foreign language knowledge], originally tagged as
DET N(oun).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Original
Unmarked
Marked
Sepmarked
EPOS
RPOS
SPOS

Fremdsprachenkenntnisse
fremd
fremd#
fremd

@#@

die
die
die
die
DET N-Modif
X
N-Modif
DET ADJ

sprache
sprachen#
sprache
N-Modif
N-Modif

COMP N

@#@

kenntnisse
kenntnisse
kenntnisse
N
N
COMP N

1077

Computational Linguistics

Volume 39, Number 4

Table 3
Tag combinations in the translation output.

Combination

Judgment

Boost

Punish

N-Modif N
N-Modif N-Modif

N-Modif
N-Modif X

Good
Good

Bad
Bad

all other combinations

Neutral

1
1

0
0

1
1

lexical selection towards translations with fewer compounds, since the corresponding
tag sequences might be more common in text. To compensate for this bias, we experi-
ment with injecting a little dose of a priori knowledge, and add a count feature, which
explicitly counts the number of occurrences of RPOS-sequences that we deem good
and bad in the translation output. Table 3 gives an overview of the possible bigram
combinations, using the three-symbol tagset, plus sentence beginning and end markers,
and their judgment as good, bad, or neutral.

We deﬁne two new feature functions: the boost model counting the number of
occurrences of Good sequences, and the punish model, counting the occurrences of
Bad sequences, as indicated in Table 3. In the punish model we want to punish the two
bad combinations, where compound modiﬁers are placed in isolation without a suitable
head. In the boost model, we want to support the formation of compounds by rewarding
the two good combinations. We deﬁnitely want to boost the combination of a compound
part with its head. In addition, we can boost the formation of long compounds by
boosting the combination of two compound modiﬁers as well. The boost and punish
models can be used either in isolation or combined, with or without a further *POS
n-gram sequence model.

To exemplify the two models, consider the translation hypothesis given in Exam-
ple (5). The word pairs skogs bruks and bruks plan, marked in bold in the example,
constitute good sequences of a modiﬁer followed by another modiﬁer or a head, and
would give the count 2 with a boost model. The word sequence in italics, skogs av,
constitutes a bad sequence, since the preposition av (of ) cannot be a compound head,
and would give the count 1 to a punish model. The other word sequences do not
inﬂuence either of these models because they do not involve any compound modiﬁers.

A forest

cultivation

plan

gives an

overview forest

7. Compound Merging

Once a translation is generated using a system trained on split compounds, a post-
processing step is required to merge components back into compounds. All methods
we are aware of only consider consecutive tokens for merging: We stick to this assump-
tion, having delegated to the methods described earlier the task to promote a good
relative positioning of component translations. For all pairs of consecutive tokens we
have to decide whether to combine them or not. Depending on the language and on
preprocessing choices, we might also have to decide whether to apply any boundary
transformation such as, for example, inserting -s between components.

1078

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Figure 2
Overview of the reverse normalization algorithm, exempliﬁed using the compound
Rechtsstaatsprinzip [rule-of-law principle / right-state principle]. The chosen option for each
strategy is circled.

In this section we ﬁrst describe reverse normalization, then we describe our new
heuristic merging methods and modiﬁcations to existing heuristics. Finally we describe
a novel sequence-labeling formulation for compound merging.

7.1 Reverse Normalization

For compound modiﬁers that were normalized in the training data the reverse process
(reverse normalization) is needed at merging time to recreate the correct form for the
speciﬁc compound. We designed a method for reverse normalization that is based on
corpus frequencies of compound modiﬁers and compounds that can be collected during
compound splitting.

The strategy is illustrated and exempliﬁed in Figure 2.2 First we look up all known
compounding forms, and their frequencies for all the compound modiﬁers. In Figure 2,
for instance, we found three options for the word recht [right], with rechts being the most
common option. Next, we try all combinations of forms to form a full compound, and if
any matches are found we pick the most frequent match. If this fails we have two back-
up strategies. First, we try to ﬁnd known compounding forms for pairs of parts, starting
from left to right, looking up frequencies of the resulting forms from adding two parts.
If that fails we use our second back-up strategy, which is to use the most common form
of each modiﬁer, and concatenate those. For binary compounds, the second strategy is

2 Nouns in German are capitalized. This is normally dealt with as further recasing postprocessing, and is

an orthogonal problem from the one we deal with here.

1079

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

superﬂuous, because there is only one pair of compound parts, and the second strategy
is always used.

7.2 Heuristic Approaches to Compound Merging

We have investigated three major approaches to heuristic compound merging. Our
major contribution is a novel merging method based on part-of-speech matching. We
contrast this method with adaptations of previous merging suggestions based on sym-
bols and word lists. We also suggest improvements to these methods, and combination
methods that combine the strengths of word lists and parts of speech.

7.2.1 POS-Based Merging. The POS-match algorithm uses the fact that it is possible to
have several output factors besides surface form in a factored translation system. It
merges words that are marked as compound modiﬁers in either the EPOS or RPOS
tagsets if the next POS-tag matches. As described in Section 6.1, the part of speech of a
compound modiﬁer is based on the part of speech of its head word, so a word is consid-
ered matching if the next word is a compound modiﬁer of the same type, or a head with
a matching part of speech. In addition, if the next word does not match, the modiﬁer
could be part of a coordinated compound, which is checked by seeing if the next word is
a conjunction, in which case a hyphen is added to the modiﬁer. We investigated versions
of the algorithm both with and without treatment of coordinated compounds.

If a compound modiﬁer is followed by anything other than a matching part of
speech or a conjunction it has most likely been misplaced in the translation process.
These items are left as they are in the translation output, which is often ﬁne, because
only compound parts that occur as separate words in a corpus are split. When two
matching compound modiﬁers are merged, the process is iterated to see if the next word
is a matching compound modiﬁer, head, or conjunction. This allows compounds with
an arbitrary number of parts to be merged.

In summary, the POS-based merging algorithm has the following steps:

(cid:1)
(cid:1)

Step through each word+POS pair from left to right3

If a compound-POS, X-MODIF, is found:

Remove mark-up of the part if present
Store the compound part

–
–
– When the next POS is a matching part, X-MODIF:
Remove mark-up of the part if present
Store the compound part

–

If the next POS is a matching head, X:

Store the compound head

If at least two parts have been found (either several modiﬁers
or a head):

Perform reverse normalization on the stored parts if parts are
normalized
Merge all parts
For Swedish and Danish: remove a consonant if any of the
merges resulted in three identical consecutive consonants

3 The words that are processed in the inner if-clause are skipped in the outer loop.

1080

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

–

If the next POS is a conjunction and no head was found:

Add a hyphen at the end of the compound part

This heuristic has the advantage over previous merging algorithms that it can form
novel compounds while reducing the risk of erroneous merging through the matching
constraint. It is also the only merging method we are aware of that addresses coordi-
nated compounds. However, it requires a factored decoder that can carry part-of-speech
tags through the translation process. It also requires tagsets where compound modiﬁers
are marked based on the compound head, such as EPOS or RPOS. In the current form
the POS-match strategy cannot be used for the SPOS-tagset that does not use head-
based tags for compound modiﬁers, because we cannot enforce the matching constraint
in this way. It would be possible to design a POS-match strategy based on standard tags,
which for instance allowed merging of noun+noun, but not of noun+preposition. Such
a strategy requires linguistic knowledge and customization for each language, however.

7.2.2 Other Merging Heuristics. The symbol-based method is inspired by work on mor-
phology merging (El-Kahlout and Oflazer 2006; Virpioja et al. 2007). It merges words
that are marked with a symbol with the next word in the marked scheme. In the
sepmarked scheme, when a standard symbol is found, the words on both sides of it
are merged. These algorithms have the disadvantage, compared with the POS-match
algorithm, that it is more likely that words are merged into non-compounds, since no
matching check is carried out.

We have also investigated methods based on word lists, proposed by Popovi´c, Stein,
and Ney (2006). These methods use frequency word lists compiled at split time. Three
types of lists were used: lists of compound modiﬁers, of compounds, and of words. If a
compound modiﬁer is encountered, it is checked whether merging it with the next word
results in either another compound modiﬁer, or a compound or word. Unlike Popovi´c,
Stein, and Ney (2006), we perform this process recursively, to allow compounds with
several parts. Again, reverse normalization is performed when needed. Still, no novel
compounds can be formed, and coordinated compounds are not handled. The method
does not merge words into non-words, but there is another risk, that of merging
words that should be separate in a speciﬁc context, but that happen to form a valid
compound or other word when combined, such as the examples in Example (6). It has
the advantage over the other proposed methods that it can be used on output from
any MT decoder, without the use of customized POS-tags or symbols in the output.
We investigate the usage of two types of frequency lists for this strategy, either a list
of compounds, which can be created as a byproduct of the splitting algorithm, like
Popovi´c, Stein, and Ney (2006), or by using a list of all words in a corpus.

(6) DE bei der (at the)

beider (both)

SV f ¨or sm˚a (too small)
f ¨orsm˚a (spurn)

We empirically veriﬁed that the list-based heuristics tend to misﬁre quite often,
leading to too many compounds, such as merging the words in Example (6). We thus
modiﬁed them in two ways: (1) by additionally requiring the head word to be a content
word and (2) by requiring the generated compound to be more frequent in a corpus than
the corresponding bigram of isolated words. The head word restriction can block some
erroneous merges such as those in Example (6DE), whereas the frequency restriction can

1081

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

potentially block both examples in Example (6). Compound and bigram frequencies can
be computed on any available monolingual corpus in the domain of interest. We also
investigated combinations of the heuristics by using either the union or intersection of
merges from two different strategies.

7.3 Compound Merging as Sequence Labeling

Besides extending and combining existing heuristics, we propose a novel formulation of
compound merging as a sequence labeling problem. The opposite problem, compound
splitting, has successfully been cast as a sequence labeling problem before (Dyer 2010),
but here we apply this formulation in the opposite direction.

Depending on choices made at compound splitting time, this task can be either a
binary or multi-class classiﬁcation task. If compound parts were kept as-is, the merging
task is a simple concatenation of two words, and each separation point must receive a
binary label encoding whether the two tokens should be merged. If compounds were
normalized at splitting time, the compound form has to be restored before concatenat-
ing the parts. This can be modeled either as a multi-class classiﬁer that has the possible
boundary transformations as its classes or in a two-step process where the form of
words are modeled in a separate step after the binary merging decision. In the latter
case it is possible to use the reverse normalization process described in Section 7.1. In
this work we limited our attention to binary classiﬁcation.

Consider for instance translating into German the English in Example (7).

(7) Europe should promote the knowledge of foreign languages

Assuming that the training corpus did not contain occurrences of the pair (knowledge
of foreign languages,‘fremdsprachenkenntnisse’) but contained occurrences of (knowl-
edge,‘kenntnisse’), (foreign,‘fremd’), and (languages,‘sprachen’), then the translation
model from English into decomposed German could be able to produce Example (8).

(8) Europa sollte fremd sprachen kenntnisse f ¨ordern

We cast the problem of merging compounds as one of making a series of correlated
binary decisions, one for each pair of consecutive words, each deciding whether the
whitespace between the two words should be suppressed (label 1) or not (label 0). In the
example case, the correct labeling for the sentence would be {0,0,1,1,0}, reconstructing
the correct German as shown in Example (9).

(9) Europa sollte fremdsprachenkenntnisse f ¨ordern

Although in principle one could address each atomic merging decision indepen-
dently, it seems intuitive that a decision taken at one point should inﬂuence merging
decisions in neighboring separation points, especially because compounds can be made
up of more than two parts. For this reason, instead of a simple (binary or n-ary)
classiﬁcation problem, we prefer a sequence labeling formulation.

Depending on the choice of the features, this approach has the potential to be truly
productive, that is, to form new compounds in an unrestricted way. As discussed,
in fact, the list-based heuristics can only form compounds that were observed in the
training data or in some suitable monolingual corpus, and are thus not productive.
The POS-match heuristic is more ﬂexible, but is still limited in that it can only form

1082

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

a compound if a modifying element (non-head) has been observed and tagged as such
in the training data.

The array of sequence labeling algorithms potentially suitable to our problem is
fairly broad, including hidden Markov models (Rabiner 1989), conditional random
ﬁelds (CRFs) (Lafferty, McCallum, and Pereira 2001), Semi-CRFs (Sarawagi and
Cohen 2004), structured perceptrons (Collins 2002), structured support vector machines
(Tsochantaridis et al. 2005), Max-Margin Markov networks (Taskar, Guestrin, and Koller
2003), and more. Because the focus of this work is on the application rather than on a
comparison among alternative structured learning approaches, we limited ourselves to
a single implementation. Considering its good scaling capabilities, capability to handle
strongly redundant and overlapping features, and widespread recognition in the NLP
community, we chose to use CRFs.

7.3.1 Features. Each sequence item (i.e., each separation point between words) is repre-
sented by means of a vector of features. Our aim was to include features representing
the knowledge available to the heuristics, such as part-of-speech tags, frequencies for
compounds and bigrams, as well as comparisons between them. Features were also
inspired by previous work on compound splitting, with the intuition that features that
are useful for splitting compounds could also be useful for merging. Character n-grams
have successfully been used for splitting Swedish compounds, as the only knowledge
source by Brodda (1979), and as one of several knowledge sources by Sj ¨obergh and
Kann (2004). Friberg (2007) tried to normalize letters, besides using the original letters.
Although she was not successful, we still believe in the potential of this feature. Larson
et al. (2000) used frequencies of preﬁxes and sufﬁxes from a corpus as a basis of their
method for splitting German compounds. We used the following features where -N
refers to the nth position before the merge point, and +N to the nth position after the
merge point:

(cid:1)

(cid:1)
(cid:1)

(cid:1)

Previous tag

Surface words: word-2, word-1, word+1, bigram word-1–word+1

Parts of speech: POS-2, POS-1, POS+1, bigram POS-1–POS+1

Character n-grams around the merge point

–
–
–

three-character sufﬁx of word-1
three-character preﬁx of word+1
Combinations crossing the merge points: 1+3, 3+1, 3+3 characters

Normalized character n-grams around the merge point, where characters
are replaced by phonetic approximations and grouped according to
phonetic distribution, see Figure 3 (only for Swedish)

Frequencies from the training corpus, binned by the following method:

(cid:1)

¯f =

10(cid:6)log10(f )(cid:7)
f

if f > 1
otherwise

for the following items:

–
–

Bigram: word-1,word+1
Compound resulting from merging word-1,word+1

1083

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

# vowels (soft versus hard)
s/[aou˚a]/a/g;
s/[eiy¨a¨o´e]/e/g;

# consonant combinations and
# spelling alternations
s/ng/N/g;
s/gn/G/g;
s/ck/K/g;
s/^[lhgd]j/J/g;
s/^ge/Je/g;
s/^ske/Se/g;
s/^s[kt]?j/S/g;
s/^s?ch/S/g;
s/^tj/T/g;
s/^ke/Te/g;

#consonants grouping
s/[ptk]/p/g;
s/[bdg]/b/g;
s/[lvw]/l/g;
s/[cqxz]/q/g;

Figure 3
Transformations performed for normalizing Swedish characters (Perl notation).

(cid:1)

– Word-1 as a true preﬁx of words in the corpus
– Word+1 as a true sufﬁx of words in the corpus

Frequency comparisons of two different frequencies in the training corpus,
classiﬁed into four categories: freq1 = freq2 = 0, freq1 < freq2, freq1 = freq2, freq1 > freq2

–
–
–

word-1,word+1 as bigram vs. compound
word-1 as true preﬁx vs. single word
word+1 as true sufﬁx vs. single word

7.3.2 Training Data for the Sequence Labeler. Because features are strongly lexicalized, a
suitably large training data set is required to prevent overﬁtting, ruling out the possibil-
ity of manual labeling.

We created our training data automatically, using a subset of the compound merg-
ing heuristics described in Section 7.2.2, plus an additional heuristic enabled by the
availability, when estimating parameters for the CRF, of a reference translation: Merge
if two tokens are observed combined in the reference translation (possibly as a sub-
sequence of a longer word). We compared multiple alternative combinations of heuris-
tics on a validation data set. The validation and test data were created by applying all
heuristics, and then manually having the positive instances checked.

A ﬁrst possibility to automatically generate a training data set consists in applying
the compound splitting preprocessing of choice to the target side of the parallel training
corpus for the SMT system: Separation points where merges should occur are thus
trivially identiﬁed. In practice, however, merging decisions will need to be taken on
the noisy output of the SMT system, and not on the clean training data. To acquire
training data that is similar to the test data, we could have held out from SMT training
a large fraction of the training data, used the trained SMT to translate the source side

1084

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

of it, and then label decision points according to the heuristics. This would, however,
imply making a large fraction of the data unavailable to the training of the SMT system.4
We thus settled for a compromise: We trained the SMT system on the whole training
data, translated the whole source side, and then labeled decision points according to
the heuristics. The translations we obtain are thus biased, of higher quality than those
we should expect to obtain on unseen data. Nevertheless they are substantially more
similar to real SMT output than the reference translations with automatic splits.

8. Experiments

In this section we report experimental results for the strategies proposed in this article.
We ﬁrst describe, in Section 8.1, the overall experimental set-up. We then report on the
experiments, as follows:

(cid:1)

In Section 8.2 we report on general effects of compound processing.

Section 8.3 reports effects on coalescence of different representation
schemes. Here we do not vary merging or splitting.

Section 8.4 reports on compound splitting and the way different parameter
settings affect translation quality. Here we do not vary representation
scheme or merging.

Section 8.5 reports results from varying the merging process. In particular
we compare the novel sequence labeling method with the heuristic
methods. Here we do not vary representation scheme or splitting.

In Section 8.6 we apply the overall best strategies to corpora of different
sizes and to out-of-domain data.

8.1 Experimental Set-up

We performed experiments on translation from English into German, Swedish, and
Danish, all of which have closed compounds. We tested all experimental conditions on
Europarl (Koehn 2005) for translation from English to German. We also give contrasting
results on English–Swedish Europarl and for an automotive corpus for translation from
English to Swedish and Danish. The automotive corpus was gathered from translation
memory data. The two corpora are quite different. The automotive corpus is from a
limited domain, and of a homogeneous nature, whereas Europarl is more diverse, and
tends to have a more complex language than the automotive corpus. Table 4 summa-
rizes the sizes for the corpora that we used in Sections 8.2–8.5. The German test set is
the test2007 set from the WMT 2008 workshop.5 We have chosen to use a smaller part
of Europarl, to reduce runtimes and allow more experiments. In Section 8.6 we report
results from scaling experiments and on out-of-domain data.

4 This option can also be extended so that the training data is split into chunks where each chunk is

translated by a system that is trained on the remaining chunks, a strategy that has been successfully used
for parse reranking (Collins and Koo 2005). Because we found that using fresh data did not give any
signiﬁcant improvements on the merging task except with little training data (see Section 8.5.2, Table 22),
we choose not to use this approach since it would be very time consuming and thus impractical.

5 http://www.statmt.org/wmt08.

1085

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Table 4
Overview of the experimental settings for the experiments in Sections 8.2–8.5.

Name

Europarl German Europarl Swedish Auto Swedish Auto Danish

Corpus
Languages
Training sentences
Avg. target sentence length
Dev sentences
Test sentences

Europarl
En→De
701,157
20.5
500
2,000

Europarl
En→Sv
701,157
19.4
500
2,000

Automotive Automotive
En→Sv
329,090
9.3
2,000
1,000

En→Da
168,047
9.2
1,000
1,000

We used factored translation (Koehn and Hoang 2007) in our experiments, with
both surface words and part-of-speech tags on the target side, with a sequence model
on parts-of-speech. For part-of-speech tagging we used TreeTagger (Schmid 1994) for
German, an in-house hidden Markov model tagger based on Cutting et al. (1992) for
Danish and Swedish, and for Swedish also the Granska tagger (Carlberger and Kann
1999). For the majority of experiments we used the Moses decoder (Koehn et al. 2007),
which is a standard phrase-based statistical decoder, which allows factored decoding.
For word alignment, Giza++ (Och and Ney 2003) was used and for language modeling
we used SRILM (Stolcke 2002). For parameter optimization we used minimum error
rate training (Och 2003). For each experiment we ran minimum error rate training three
times in order to reduce the effect of optimizer instability, and report the average result
and standard deviation. In the merging experiments based on sequence labeling we
used the CRF++ toolkit.6 For the merging experiment with sequence labeling, we used
the Matrax decoder (Simard et al. 2005) on the automotive corpus. Matrax is a phrase-
based decoder that allows discontiguous phrases, and parameter optimization based
on gradient descent for smoothed NIST. We extended the original Matrax decoder with
factored decoding on the target side.

Compounds were split before training using the corpus-based method described in
Section 5. Except for the experiments comparing different compound merging methods,
we used the POS-match merging algorithm developed by us.

We report results on three metrics: Bleu (Papineni et al. 2002), NIST (Doddington
2002), and Meteor. For Meteor we use the version tuned on adequacy and ﬂuency
(Lavie and Agarwal 2007) for German, and the original version with default weights
(Banerjee and Lavie 2005) for Swedish and Danish, since there is no tuned version for
those languages. For Bleu and Meteor we use the %-notation. Signiﬁcance testing was
performed using approximate randomization (Riezler and Maxwell 2005), with 10,000
iterations, on output based on three optimizer runs, as recommended by Clark et al.
(2011).

8.2 General Effects of the Compound Processing Strategy

One effect of compound splitting is that the number of word types is reduced. Tables 5
and 6 show the number of types (number of unique words) and tokens (total number of
words) and the type/token ratio for Europarl in the different representation schemes.
The type count and the type/token ratio in the baseline system are much higher in

6 http://crfpp.sourceforge.net/.

1086

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Table 5
Type and token counts and ratio for the German Europarl corpus.

System

Tokens

Types

Ratio

German

baseline
marked
unmarked
sepmarked

14,356,051
15,674,728
15,674,728
17,007,929

184,215
93,746
81,806
81,808

1.28%
0.60%
0.52%
0.48%

English

15,158,429

63,692

0.42%

Table 6
Type and token counts and ratio for the Swedish Europarl corpus.

System

Tokens

Types

Ratio

Swedish

baseline
marked
unmarked

13,603,062
14,401,784
14,401,784

182,000
107,047
100,492

1.34%
0.74%
0.70%

English

15,043,321

67,044

0.45%

German and Swedish than in English, and is drastically reduced after compound
splitting in all markup schemes, even though it is still higher than in English. One
reason for the type count still being higher in German and Swedish than in English is
morphological complexity, which is low in English and higher in German and Swedish.
The type count is higher in the marked representation scheme than in the other schemes
due to the fact that compound modiﬁers are marked, and do not coincide with other
words as they do in the other schemes.

We believe that the fact that compound splitting leads to both type and token
counts that are more similar to English is a positive thing, because it likely contributes
to making the two languages structurally more similar, which makes the SMT pro-
cess easier. This is supported by Birch, Osborne, and Koehn (2008), who showed that
morphological complexity of a target language correlates negatively with Bleu scores.
A lower type/token ratio translates into fewer out of vocabulary (OOV) words, and
can give more accurate probability estimates for the SMT translation and language
models.

These changes in the number of types and tokens could potentially inﬂuence the
translation process. In Tables 7 and 8 we show the effect of splitting on average phrase
length and average phrase length ratio in the phrase table and during translation in the
baseline and in the unmarked system. For the unmarked system we also compensated
for the split compounds by counting phrase length as if compounds were merged using
the POS-match heuristic. For both languages the lengths and ratios are similar between
the baseline and the unmarked system when we compensate for split compounds. This
indicates that the type of phrases used are not much affected by the higher number of
tokens in the split system. We can also see that the phrases used for translations are on
average longer on the source side and shorter on the target side, with a very different
ratio than the phrases in the phrase table.

1087

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Table 7
Average phrase lengths and English/German length ratios in the phrase table and during
translation for German Europarl. German-m is German with merged compounds, where we
calculated the phrase length as if compounds were merged using the POS-match method.

Baseline
English German

Unmarked
English German German-m

Phrase table

Translation

Phrase length
Phrase ratio
Phrase length
Phrase ratio

1.95

2.46

2.76
0.86
2.29
1.19

1.91

2.48

2.88
0.81
2.51
1.10

2.68
0.86
2.33
1.17

Table 8
Average phrase lengths and English/Swedish length ratios in the phrase table and during
translation for Swedish Europarl. Swedish-m is Swedish with merged compounds, where we
calculated the phrase length as if compounds were merged using the POS-match method.

Baseline

English

Swedish

English

Unmarked
Swedish

Swedish-m

Phrase table

Translation

Phrase length
Phrase ratio
Phrase length
Phrase ratio

1.95

2.51

2.76
0.86
2.22
1.21

1.96

2.61

2.81
0.85
2.38
1.75

2.70
0.88
2.29
1.22

8.3 Promoting Compound Coalescence

In this section we describe three sets of experiments that investigate the use of the *POS-
tagsets in sequence models and as count features. In the ﬁrst experiment we explore the
effect of different representation schemes for translation into German, and in the second
experiment we do the same, on a smaller scale, for Swedish. In the third experiment
we investigate the effect of the RPOS-tagset and the boost and punish models. In all
experiments merging was performed using the POS-match heuristic (see Section 7.2.1),
except for the sepmarked scheme that used the SPOS-tagset that does not allow this
type of POS-match, and for which the symbol merging algorithm (see Section 7.2.2)
was used.

8.3.1 Experiment 1. In the ﬁrst experiment we investigated different compound rep-
resentation schemes for German Europarl. Table 9 shows the results using different
representation schemes, comparing them with two baselines, with and without a POS-
sequence model. There is generally a small signiﬁcant improvement when a POS-based
sequence model is used compared with the same model without a POS-based sequence
model. Especially for the systems with compound splitting, it was clearly worse not to
use *POS-models, and all such systems perform signiﬁcantly worse than both baselines
on most metrics. The representation scheme with sepmarked words and SPOS-tags did
not perform well, which can both be due to less power of the mark-up system, and
to the fact that it cannot use the POS-match merging heuristic. The two systems with
EPOS-models do perform well though, and are on par with the factored baseline, and
mostly better than the unfactored baseline. The marked and unmarked systems with
EPOS perform similarly, with no signiﬁcant differences between them. It is thus hard to

1088

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Table 9
Translation results for different representation schemes on German Europarl. Signiﬁcant
differences (positive or negative) from the baseline are marked *(5% level) and **(1% level), and
differences from baseline+POS are marked similarly with #.

Bleu

NIST

Meteor

Baseline
Baseline+POS
Unmarked
Marked
Sepmarked
EPOS-unmarked
EPOS-marked
SPOS-sepmarked

20.0 (0.3)
20.2 (0.1)**
19.7 (0.2)**##
19.7 (0.2)**##
19.0 (0.1)**##
20.1 (0.0)**
19.9 (0.1)
19.4 (0.3)**##

5.94 (0.04)
5.93 (0.02)
5.90 (0.04)**#
5.91 (0.02)**
5.81 (0.02)**##
5.97 (0.03)*
5.96 (0.02)*
5.83 (0.01)**##

27.5 (0.2)
27.6 (0.2)
27.4 (0.2)##
27.5 (0.1)
27.0 (0.0)**##
27.8 (0.1)**
27.7 (0.1)*
27.3 (0.2)*##

say which of these classiﬁcation systems are preferable, but at least it is clear that the
use of EPOS-tags are useful.

A more detailed analysis was performed of the compound parts in the output. The
outcomes of the merging process were classiﬁed into four groups: known compounds;
novel compounds; parts of coordinated compounds; and unmerged, single parts. They
were further classiﬁed into good or bad outcomes. Compounds were judged as bad
if they formed non-words or had the wrong form, and compound parts were judged
as bad if they should have been merged with the next word, or did not work as a
standalone word.

Table 10 shows the results of this analysis. The majority of the merged compounds
are known from the training corpus for all representation schemes. There is, again,
a marked difference between the systems that use POS-match merging and the sep-
marked systems that do not have that information. The sepmarked system found the
highest number of novel compounds, but also had the highest error rate, which shows
that it is useful to match POS-tags. The EPOS systems have a higher number of novel
compounds than the marked and unmarked options without an EPOS model. All these
systems had a low error rate of the novel compounds, however. Very few errors were
due to reverse normalization; In the EPOS-unmarked system there were only three such
errors.

Table 10
Analysis of merged compounds from different representation schemes. The sepmarked systems
do not do any matching, and can thus not leave any single parts.

EPOS-

EPOS-
unmarked marked

SPOS-
sepmarked

unmarked marked

sepmarked

Known

Novel

Coordinated

Single part

Total

Good
Bad
Good
Bad
Good
Bad

3,339
168
20
43
9
6
11

3,596

3,375
105
8
42
3
5
16

3,554

3,594
176
97
43
9

–
–

3,919

3,747
104
10
42
22
136
52

4,113

3,587
93
7
44
5
33
46

3,815

3,762
245
64
37
7

–
–

4,115

1089

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Generally, the percentage of bad parts or compounds is lower for the systems with a
*POS-sequence model, which shows that the sequence model is useful for the ordering
of compound parts. The number of single compound parts is also much higher for the
systems without a POS sequence model.

The number of compounds found by the splitting algorithm in the German ref-
erence text was 4,472. All systems produce fewer compounds than this number. The
numbers in Table 10 cannot be directly compared to the baseline system, however,
because we do not know which words in its output are compounds.

An indication of how many compounds there are in a text is the number of long
words. In the reference text there are 231 word types with at least 20 characters and
1,178 word types with at least 15 characters, which we used as the limits for long words.
Other length limits give similar results. Table 11 gives data on absolute numbers of
these long word types and recall, precision, and F-score compared with long words
found in the reference for the different systems. The distribution of long words in the
systems with compound processing generally follows the distribution of compounds
from Table 10, which indicates that this is a good indicator of the number of compounds.
Both baseline systems have fewer long words than all compound processing systems,
indicating that the split-merge strategy does indeed help in increasing the number of
compounds in the translation output. The SPOS systems had the highest number of long
words; for up to 15 characters there are even more long words than in the reference.
The EPOS systems have the same number for up to 15 characters, but a bit lower for
up to 20 characters. The EPOS systems also have the highest recall of all systems for
both character lengths. The F-scores are similar for the baseline systems and the EPOS
systems, with the baseline higher on precision, and the EPOS systems higher on recall.
Although the baseline has a higher precision, the absolute number of overlapping words
is still higher in many of the systems with compound processing—for instance, for
words of up to 20 characters there are 431 in EPOS-unmarked compared with 388 in
baseline+POS. It is important to notice, however, that the reference is not a trustworthy
gold standard, because there are many possible alternative good translations, and the
real quality of long words are likely underestimated. Thus, as can be seen in Table 10, the
clear majority of produced compounds in the EPOS systems are judged acceptable, even
though the precision of overlap for long words with the reference is at most 47% for the
EPOS models.

Table 11
Number of long word types in the translations, and R(ecall), P(recision), and F(-score) compared
with long word types in the reference.

≥20 chars
R

Number

≥15 chars
R

Number

Reference
Baseline
Baseline+POS
Unmarked
Marked
Sepmarked
EPOS-unmarked
EPOS-marked
SPOS-sepmarked

231
157
151
192
217
279
231
233
290

–
.27
.26
.27
.29
.28
.30
.30
.30

–
.49
.49
.40
.38
.28
.37
.37
.30

–
.35
.34
.32
.33
.28
.33
.34
.30

1,178
743
733
826
862
1,031
936
895
1,043

–
.33
.33
.34
.35
.33
.36
.36
.34

–
.53
.53
.49
.48
.38
.46
.47
.39

–
.41
.41
.40
.40
.35
.40
.41
.36

1090

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Table 12
Results for Swedish Europarl. Signiﬁcant improvements over baseline+POS are marked
**(1% level).

System

Bleu

NIST

Meteor

Baseline+POS
EPOS-unmarked
EPOS-marked

21.6 (0.0)
22.0 (0.1)**
21.9 (0.0)

6.11 (0.00)
6.14 (0.01)
6.21 (0.00)**

57.8 (0.1)
58.3 (0.1)**
58.3 (0.0)**

8.3.2 Experiment 2. To further test whether a compound processing strategy using a
customized tagset is useful, we performed experiments on an additional language pair,
English–Swedish. In this case we always used the successful EPOS-tagset, in combina-
tion with either the marked or unmarked representation of compounds.

Table 12 shows the results for translation into Swedish. Both systems with com-
pound processing have higher scores on all metrics. The difference is signiﬁcant on
Meteor, and on either Bleu or NIST for the two systems with compound processing.
The only signiﬁcant differences between using marked and unmarked representations
is that the marked system is better on NIST.

To investigate compound translation speciﬁcally, we manually classiﬁed the system
translations corresponding to the ﬁrst 100 compounds in the reference text that had
a clear counterpart in the English source text. As good translations we considered
identical translations to the reference, and alternative translations with the same mean-
ing, which we also distinguished between other compounds, single words, or other
constructions. We also had a category for word groups that were translated as separate
words, but should have been compounded, split compounds.

The result of this evaluation can be seen in Table 13. There are more translations
that are identical to the reference in the two systems with splitting, but the total number
of identical and alternative translations is approximately the same in the three systems.
The number of split compounds is higher in the baseline system. The unmarked system
produces more split compounds and partial translations than the marked system. This
can be seen as an indication of marking having an effect, which, however, is not clear in
the automatic evaluation.

We also investigated the quality of the merged compounds, especially with regard
to the reverse normalization that is needed in the unmarked systems, where compound-
ing forms were normalized. There were no merging errors in the marked system. In the

Table 13
Analysis of the translations of 100 source items yielding compounds in a Swedish reference text.

Baseline+POS

EPOS-Unmarked

EPOS-Marked

Identical comp.
Alt. compound
Alt. word
Alt. other
Split compound
Partly transl.
No equivalent
OOV

48
14
16
9
7
4
0
2

53
9
16
8
5
7
0
2

57
10
12
9
3
4
2
3

1091

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Table 14
Results of coalescence experiment on the automotive corpus. Scores that are signiﬁcantly
different from the baseline are marked *(5% level), and differences from Baseline+POS are
marked with #.

Bleu

Auto Danish
NIST

Meteor

Bleu

Baseline
Baseline+POS
EPOS
RPOS
boost
punish
RPOS+boost
RPOS+punish

81.3 (0.0)
81.4 (0.1)
80.6 (0.2)*
80.9 (0.2)
81.0 (0.3)
80.7 (0.1)
81.0 (0.1)
81.0 (0.1)

9.74 (0.01)
9.73 (0.02)
9.67 (0.02)
9.69 (0.03)
9.72 (0.03)
9.70 (0.01)
9.73 (0.01)
9.71 (0.03)

89.5 (0.1)
89.5 (0.1)
89.1 (0.2)
89.4 (0.2)
89.7 (0.2)
89.4 (0.1)
89.7 (0.2)
89.6 (0.2)

67.7 (0.3)
67.8 (0.1)
68.5 (0.2)
68.5 (0.2)*
68.3 (0.1)
68.3 (0.2)
68.2 (0.2)
68.2 (0.2)

Auto Swedish
NIST

9.96 (0.02)
9.94 (0.03)
10.06 (0.01)*
10.07 (0.02)*#
10.05 (0.01)*
10.04 (0.01)*
10.04 (0.02)
10.04 (0.03)*

Meteor

85.4 (0.1)
85.1 (0.1)
85.5 (0.2)
85.8 (0.1)#
85.7 (0.0)#
85.7 (0.1)#
85.7 (0.2)
85.6 (0.2)#

unmarked system there were only two minor errors due to failed reverse normalization.
The ﬁrst error is a missing insertion of an +s for medlem/+s/l¨ander (member countries) and
in the second case, *samh¨all/-e+s/politiska (socio-political), a combination change -e/+s is
wrong.

8.3.3 Experiment 3. In the third set of experiments with factored translation models we
investigated the RPOS tagsets and the use of count features for the Danish and Swedish
automotive corpus and German Europarl. Compound parts were merged using the
POS-match heuristic.

Results on the two automotive corpora are shown in Table 14. The scores are very
high, which is due to the fact that it is an easy domain with many repetitive sentence
types. In this case there are no signiﬁcant differences between the two baselines when
using a POS-model. For Swedish, the results are overall higher for the systems with
compound splitting, a difference that is signiﬁcant for some systems and metrics. Over-
all, the RPOS system performs best for Swedish, with results signiﬁcantly better than at
least one baseline on all metrics. For Danish, on the other hand, the only signiﬁcant
difference between any baseline and system with splitting is for the EPOS system,
which is slightly worse than the baseline with POS. For the other systems there are
no signiﬁcant differences to the baseline.

Table 15 shows results using RPOS for German Europarl. For this corpus neither the
RPOS model nor the boost and punish models works well. They all give signiﬁcantly
worse results than the baseline. As we have seen before, however, the EPOS model gives
competitive results to the baseline for this corpus.

Table 15
Results of coalescence experiment on German Europarl. Scores that are signiﬁcantly different
from the Baseline+POS are marked with **(1% level).

System

Bleu

NIST

Meteor

Baseline+POS
EPOS
RPOS
boost
punish

20.2 (0.1)
20.1 (0.1)
16.5 (0.1)**
16.3 (0.2)**
16.6 (0.1)**

5.93 (0.02)
5.97 (0.03)
5.32 (0.03)**
5.29 (0.03)**
5.33 (0.03)**

27.6 (0.2)
27.7 (0.1)
25.0 (0.2)**
24.8 (0.3)**
25.1 (0.2)**

1092

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Overall, we found that it was successful to use *POS-sequence models in a factored
translation model in order to handle compound coalescence. Our best models with
compound splitting perform at least on par with the baseline, and sometimes better. We
also showed that there are other advantages to using compound processing, especially
that the number of long words are similar to the baseline, which is not the case for
our baseline systems, which have too few long words. We also showed that *POS-
sequence models are essential for the compound processing approach to be successful.
On Europarl the EPOS-tagset was the most successful, whereas the RPOS-tagset and
count features could help for the automotive domain.

8.4 Inﬂuence of Splitting Strategies

In the following experiments we investigated how compound splitting strategies of
different quality inﬂuenced the suggested compound processing method. In order to do
this we performed an intrinsic evaluation of several splitting methods, and compared
it to translation results using the same splitting strategies. In this experiment we used
German Europarl.

We used a modiﬁed version of the splitting strategy of Koehn and Knight (2003) as
a basis for the comparison, and applied it to all nouns, adjectives, adverbs, and verbs of
minimum length six characters into parts of minimum three characters, by allowing all
splits where the parts were found in a corpus, and were tagged as content words. Parts
were allowed to be modiﬁed by all compound sufﬁxes from Langer (1998). The best
splitting option, which can be no split, was chosen based on the arithmetic mean of the
corpus frequencies of the parts. We also imposed the restriction on the compound head
that its part-of-speech tag needs to be the same as for the full word. We call this system
arith. In other variants of splitting strategies, one feature at the time was changed, based
on the arith system:

geom – using the geometric mean instead of the arithmetic mean
eager – choosing the split with the highest number of parts, instead of using the

arithmetic mean

part4 – limiting the length of compound parts to four characters
max2 – limiting the maximum number of parts per compound to two
common – only allowing the common compound sufﬁxes -s, -es, -n, -nen
anypos – not using part-of-speech tags

The corpus frequencies were gathered from the target side of the training corpus. We
compared the systems with compound splitting with a factored baseline system without
any compound processing.

To measure the success of the different compound splitting algorithms we per-
formed an intrinsic evaluation. We used the gold standard from Stymne (2008), created
by manually annotating the ﬁrst 5,000 words of the test text for one-to-one correspon-
dence with the English reference text, similar to Koehn and Knight (2003). A one-to-
one correspondence occurs when the words in a German compound are translated
as separate words in English. In addition there can be inserted function words. As
an example, Medienfreiheit is in one-to-one correspondence with freedom of the media,
since the two German parts Medien and Freiheit corresponds to two separate words,
media and freedom. The two function words of, the are ignored. Out of the 5,000 words
174 were compounds in one-to-one correspondence with English. The result of the

1093

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

one-to-one evaluation is shown in Table 16. The same metrics and categories as in Koehn
and Knight (2003) were used:

correct split: words that were correctly split
correct not: words that should not be split and were not
wrong not: words that should be split but were not
wrong faulty: words that were split but in an incorrect way
wrong split: words that should not be split but were
precision: (correct split) / (correct split + wrong faulty + wrong split)
recall: (correct split) / (correct split + wrong faulty + wrong not)
accuracy: (correct) / (correct + wrong)

The splitting options have their strengths on different metrics, with three different
methods having the best results for the three metrics used. Compared to the arith
method it can be seen that both imposing length restrictions and using the geometric
mean increases the results on all three metrics. Limiting the number of parts gives
the highest recall. Using only common compound sufﬁxes gives higher precision, and
not using part-of-speech gives lower precision. The baseline system without splitting
actually has the highest accuracy. To a large extent this is due to the fact that the test set
is taken from running text, with a high number of non-compounds.

Overall, the results for all splitting variations are quite low. However, most of the
strategies tend to split too much rather than too little, which is preferable, because the
phrase-based translation system has the possibility of recovering from over-splitting
by handling compounds that are split into too many parts in consistent phrase pairs.
Evaluation towards a one-to-one gold standard is rather harsh because the splitting
algorithm has no knowledge of the corresponding English structure. If we instead use a
gold standard of linguistically motivated compounds, which are many more (545 for the
5,000 word set), precision is much higher for all systems, with mostly a lower recall. For
the arith system, for instance, the precision is more than doubled at .597 with a slightly
lower recall of .522.

The results for the translation task from English into German are shown in
Table 17. In this experiment we used a different selection of 701K training sentences
from Europarl than in Section 8.3, which resulted in overall lower scores both for the
baseline and for the systems with compound processing. Overall, there are very small
differences between both the baseline system and the systems with different types of
splitting, and most of the small differences between the systems are not signiﬁcant. The

Table 16
One-to-one correspondence of split compounds compared with a manually annotated gold
standard for the different splitting methods.

Correct

split

not

0
99
109
43
120
133
99
100

4,826
4,504
4,614
4,243
4,692
4,546
4,714
4,216

not

174
22
33
18
36
29
58
8

Wrong
faulty

split

Metrics
Precision Recall Accuracy

0
52
31
112
17
11
16
65

0
323
213
584
135
281
113
611

–
.209
.309
.058
.441
.313
.434
.128

0
.572
.630
.249
.693
.769
.572
.578

.966
.921
.945
.857
.962
.936
.963
.863

baseline
arith
geom
eager
part4
max2
common
anypos

1094

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Table 17
Translation results using different splitting methods for German Europarl. Signiﬁcant changes
from the baseline are marked *(5% level) and **(1% level).

Bleu

NIST

Meteor

baseline+POS
arith
geom
eager
part4
max2
common
anypos

19.1 (0.1)
19.0 (0.0)
18.9 (0.1)*
18.8 (0.1)**
18.8 (0.1)**
18.8 (0.1)
18.9 (0.1)
18.9 (0.1)

5.79 (0.02)
5.79 (0.01)
5.78 (0.02)
5.80 (0.01)
5.79 (0.01)
5.80 (0.01)
5.78 (0.01)
5.75 (0.01)**

26.8 (0.0)
26.6 (0.1)
26.5 (0.2)**
26.6 (0.1)
26.5 (0.1)*
26.6 (0.1)
26.6 (0.1)
26.6 (0.1)

arith, max2, and common systems are on par with the baseline on all metrics, with no
signiﬁcant differences, whereas the other systems are worse than the baseline on at least
one metric.

On the intrinsic evaluation there were quite large differences between the systems,
which are not found on the translation task. This is similar to previous work by Koehn
and Knight (2003) for translation in the other direction, where systems similar to the ea-
ger and geom system performed similarly on a translation task, while the eager system
was much worse on their intrinsic evaluation. In our evaluation only the eager system
is signiﬁcantly worse than any other system on Bleu, where it is worse than the baseline
and the arith system. The arith system overall is competitive with both the baseline and
the other split systems, despite the relatively low results on the intrinsic evaluation.

Overall, it seems that the translation task is not very sensitive to the quality of the
splitting strategy. As in previous research (Koehn and Knight 2003; Stymne 2008) there
are no clear relations between the intrinsic evaluation and the MT evaluation. Both
systems with low intrinsic accuracy (such as anypos) and with high accuracy (such as
geom) tend to be worse than other systems on at least some MT metrics, even though the
differences are small. We thus think that for the task of translation into compounding
languages, the MT performance cannot be predicted based on intrinsic evaluations.

In a previous similar study using a smaller corpus (Stymne 2008), we found
somewhat bigger differences between the systems, and in that study most of the
systems with splitting were better than the baseline. In that study, too, the arith system
was among the best performing on the MT task, even though many of the differences
to the other systems with splitting were not signiﬁcant. Because the arith strategy
consistently has given good results for the MT task, we chose to use it in our other
experiment. There would, however, have been other reasonable choices of splitting,
such as using the max2 system.

8.5 Compound Merging

In the following studies we investigated the impact of the compound merging strategy
used. We compared previously suggested strategies to the new strategies we have
developed. We ﬁrst present results for heuristic merging strategies, and then go on to
compare the best heuristic methods to a sequence labeling merging strategy. For the
heuristic merging we used German Europarl, and for the comparison with sequence
labeling we used German Europarl and the automotive corpus for Swedish and Danish.
In these experiments we used the arith compound splitting method from Section 8.4.

1095

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

8.5.1 Heuristic Merging. Three main types of merging algorithms were investigated in
this study. The ﬁrst group, inspired by Popovi´c, Stein, and Ney (2006), is based on
frequency lists of words or compounds, and of parts, compiled at split-time. The sec-
ond group uses symbols to guide merging, inspired by work on morphology merging
(Virpioja et al. 2007). The third group takes *POS-tags for compounds into account, so
that merging takes place if and only if the part-of-speech tags match. In addition, we
examined combined merging methods, using either the union or intersection of merges
proposed by different methods. We also extended the list- and symbol-based methods
by a restriction that the head of the compound should have a compounding part of
speech, that is, a noun, adjective, or verb. By using these additions and also combina-
tions of the main algorithms, a total of eleven algorithms were explored, as summarized
in Table 18. For all algorithms, compounds can have an arbitrary number of parts.

In these experiments we used the unmarked representation scheme, which means
that compound modiﬁers were normalized, but not marked with any symbol, and we
used the EPOS-tagset. To handle normalization, we used the reverse normalization pro-
cess in combination with all merging strategies. If there were any compound modiﬁers
that could not be combined with the next word, in any of the algorithms, that part was
left as a single word. For frequency calculations for the list-based strategies we used the
German part of the SMT training corpus.

Table 19 shows the translation results using the different merging algorithms. The
merging methods based on lists all give signiﬁcantly worse results than the baseline
on all metrics, except when combined with the symbol method. The compound-list+
method is the best method of the systems that only uses list information; it is signif-
icantly better than word-list and compound-list on all metrics, but still signiﬁcantly
worse than the baseline, symbol, and POS-match systems. When using the head-pos
restriction, the results get even better, but here POS-tag information is used in addition

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

Table 18
Merging algorithms.

Name

word-list

Description

Merges each token that has been seen as a compound part

with the next part if it results in a known word

word-list + head-pos

As word-list, but only words where the last part is a noun,

compound-list

As word-list, but for known compounds from split-time,

compound-list+

As compound-list, but also includes a check that the

not for all known words

adjective, or verb are merged

frequency of the resulting compound is higher than the
frequency of the unmerged bigram

symbol

Merges all tokens that are marked as compound modiﬁers

with the next token

symbol + head-pos

As symbol, but only merges words where the head is a noun,

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

symbol ∧ word-list
POS-match

POS-match + coord

adjective, or verb

Merges marked modiﬁer parts if the result is a known word
Merges tokens with a compound part-of-speech tag with the

next token, if their tags match

As POS-match, but also adds a hyphen if a compound
modiﬁer is followed by the conjunction und [and]

POS-match ∧ compound-list+ Merges tokens if they are selected by both of these methods
POS-match ∨ compound-list+ Merges tokens if they are selected by either of these methods

1096

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Table 19
Translation results for German Europarl using an EPOS-model and the unmarked representation
scheme. Signiﬁcant differences (positive or negative) from the baseline are marked **(1% level).

Bleu

NIST

Meteor

baseline+POS
word-list
word-list + head-pos
compound-list
compound-list+
symbol
symbol + head-pos
symbol ∧ word-list
POS-match
POS-match + coord
POS-match ∧ compound-list+
POS-match ∨ compound-list+

20.2 (0.1)
17.9 (0.0)**
19.3 (0.0)**
18.1 (0.0)**
18.7 (0.0)**
20.0 (0.0)
20.0 (0.0)
20.0 (0.0)**
20.1 (0.1)
20.1 (0.0)
19.0 (0.1)**
18.8 (0.0)**

5.93 (0.02)
5.70 (0.02)**
5.83 (0.03)**
5.61 (0.02)**
5.66 (0.03)**
5.96 (0.03)
5.95 (0.03)
5.95 (0.03)
5.97 (0.03)
5.97 (0.03)
5.70 (0.03)**
5.74 (0.03)**

27.6 (0.2)
25.8 (0.1)**
27.1 (0.1)**
26.0 (0.1)**
26.6 (0.1)**
27.7 (0.1)
27.8 (0.1)
27.8 (0.1)
27.7 (0.1)
27.8 (0.1)
26.7 (0.1)**
26.3 (0.1)**

to only frequency list information. There are no signiﬁcant differences between the
baseline and the systems with symbol or POS-match merging, but the trend on most
metrics is that the POS-match systems have a slightly higher score than the other
methods. However, as shown in Section 8.3 (Table 10), POS-match does block some
erroneous merging. Especially with EPOS the order is so good that the matching con-
straint is rarely needed. Adding treatment of coordinated compounds to the POS-match
algorithm changes scores marginally, but again, as shown in Table 10, it does improve
merging for coordinated compounds. The combination of POS-match and compound-
list+ was not successful, neither as a union nor as an intersection of the strategies.
Table 19 only shows results with an EPOS-model. The result pattern for systems without
an EPOS-model are similar to using an EPOS-model, but lower.

Table 20 shows the number of merges performed by applying the different algo-
rithms. The word-list–based method produces the highest number of merges, including
many cases that should not be merged. The number of merges is greatly reduced by the
head-pos restriction, or by combining the word-list method with some other merging

Table 20
Number of merges for the different merging algorithms. The numbers can be compared with the
number of compounds found by the splitting algorithm in the reference text, which is 4,472.

with EPOS without EPOS

word-list
word-list + head-pos
compound-list
compound-list+
symbol
symbol + head-pos
symbol ∧ word-list
POS-match
POS-match + coord
POS-match ∧ compound-list+
POS-match ∨ compound-list+

5,275
4,161
4,460
3,843
4,431
4,323
4,178
4,363
4,361
3,502
4,706

5,897
4,752
5,116
4,427
5,144
4,832
4,753
4,867
4,865
4,018
5,281

1097

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

strategy. An investigation of the output of the word-list–based method shows that it
often merges common words that incidentally form a new word, such as bei [at] and der
[the] to beider [both]. Another type of error is due to errors in the corpus, such as merging
umwelt [environment] and und [and], which exists as a mistyped word in the corpus, but is
not a correct German word. These two error types are often prohibited by the head-pos
restrictions and by the symbol and POS-match algorithms. The extensions of the simple
word-list method avoid these types of errors, resulting in a lower number of merges.
The compound-list+ method has the overall lowest number of splits, but it still is the
best of the methods based only on frequency lists, without the use of POS or symbols.
With the EPOS-model there are very small differences in number of merges between
the methods based on symbol and POS-match, although this difference is much larger
without the EPOS-model. For instance, the difference is 68 merges between symbol and
POS-match with the EPOS-model, whereas without the EPOS-model the difference is
277. This difference shows that the EPOS-model clearly helps in improving coalescence.
The choice of merging method thus has a large impact on the ﬁnal translation re-
sults. For merging to be successful we found that some knowledge source that is passed
through the translation process, such as EPOS-tags, is needed. The pure word-list–based
method performed the worst of all systems in most cases. On the automatic metrics, the
POS-match and symbol methods gave about the same results as the baseline; we believe
there are other strengths to this method as compared to the baseline system, however,
such as the production of a higher number of compounds. A further advantage of the
POS-match strategy is that it can form novel compounds, which is not the case for either
the baseline or the word-list–based methods.

8.5.2 Merging as Sequence Labeling. Finally, we performed experiments where we com-
pared our suggestion of a sequence labeling merging method to heuristic merging.
We chose two heuristic strategies, the POS-match strategy, which gave the overall best
results, and compound-list+, which gave the best results of the word-list based merging
strategies without any POS or symbol information. We also found that the intersection
of these two strategies worked well on some of our corpora, and decided to include that
as well. In addition, we used a reference-based heuristic for creating the CRF training
data, which identiﬁed merges resulting in compounds that are present in the reference
data. Such a heuristic cannot be used at translation-time, however, because there is
no reference data then. In this section we will refer to these heuristics as list for the
compound-list+ heuristic, POS for the POS-match strategy, and ref for the reference-
based heuristic.

For these experiments we used three data sets, as summarized in Table 21. For
the training data for the sequence labeler we translated all of the training data, except

Table 21
Overview of the experimental settings for comparing heuristic and sequence labeling merging.
Training sentences are taken from translated SMT training data, except extra data, which was
not used for SMT training.

Corpus

Europarl German Auto Swedish Auto Danish

Compounds split
POS tagsets
Training sentences CRF
Extra training sentences CRF

N, V, Adj
POS
249,388
–

N, V, Adj
POS
317,398
–

N
RPOS
164,702
163,201

1098

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

one-word sentences, which were ﬁltered out, for the smaller automotive corpora. For
the Europarl experiment, we limited the training data to the 250,000 ﬁrst sentences of
the training data, and again ﬁltered out one-word sentences. For the Danish corpora we
also had access to additional data, which we used to control for the effect of reusing
SMT training data. For the machine learning we wanted to use separate development
and test sets. We chose to use the test sets from Table 4 for development testing, where
we used the ﬁrst 1,000 sentences for German Europarl, and picked new unseen 1,000
sentence test sets for the ﬁnal evaluation. For frequency calculations of compounds and
compound parts that were needed for compound splitting and some of the compound
merging strategies, we used the target side of the SMT training data.

In these experiments we used the unmarked representation scheme. For German
Europarl we used the unmarked scheme as it is with a binary sequence labeler. Reverse
normalization can be performed afterwards as for the other strategies. For the automo-
tive corpus we used the unmarked scheme without normalization.

We compared alternative combinations of the heuristics on our three validation
data sets (see Figure 4). In order to estimate the amount of false negatives for all three
heuristics, we inspected the ﬁrst 100 sentences of each validation set, looking for words
that should be merged, but were not marked by any of the heuristics. In no case could
we ﬁnd any such words, so we thus assumed that between them, the heuristics can
ﬁnd the overwhelming majority of all compounds to be merged. The differences across
domains, Europarl and automotive, is quite striking, with a much higher number of
compounds found by using the reference-based heuristic for the automotive corpus.

We conducted a round of preliminary experiments to identify the best combination
of the heuristics available at training time (compound-list+, POS-match, and reference-
based) to use to create automatically the training data for the CRF. The best results on
the validation data are obtained by different combinations of heuristics for the three
data sets, as could be expected by the different distribution of errors in Figure 4. In the
following experiments we trained the CRF using for each data set the combination of
heuristics corresponding to leaving out the gray portions of the Venn diagrams. This
sort of preliminary optimization requires hand-labeling a certain amount of data. Based
on our experiments, skipping this optimization and just using ref∨POS (the optimal
conﬁguration for the German–English Europarl corpus) seems to be a reasonable alter-
native.

The validation data was also used to set a frequency cut-off for feature occurrences
(set at 3 in the following experiments) and to tune the regularization parameter in
the CRF objective function. Results are largely insensitive to variations in these hyper-
parameters, especially to the CRF regularization parameter.

For the Danish automotive corpus we had access to training data that had not been
used to train the SMT system, so we could test the performance of the CRF when trained
on that data as compared with training on the possibly biased data that was used to
train the SMT system. Table 22 shows the results with the two types of training data.
For the smallest training size, 20,000 sentences, the unseen training data is signiﬁcantly
better on recall and F-score. For the other sizes, there are no signiﬁcant differences when
the same amount of training data is used. When we added the unseen data to the SMT
data, we only saw small non-signiﬁcant improvements. This indicates that it is feasible
to use translated SMT training data for the sequence labeler. We still decided to use all
available training data for our subsequent experiments on Danish Europarl.

The overall merging results of the heuristics and the best sequence labeler are
shown in Table 23. Notice how the list and POS heuristics have complementary sets
of false negatives: when merging on the union of the two heuristics, the number of false

1099

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Europarl, German

Automotive, Swedish

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Automotive, Danish

Figure 4
Evaluation of the different heuristics on validation ﬁles from the three corpora. The number
in each region of the Venn diagrams indicates the number of times a certain combination of
heuristics ﬁred (i.e., the number of positives for that combination). The two smaller numbers
below indicate the number of true and false positive, respectively. Venn diagram regions
corresponding to unreliable combinations of heuristics have corresponding ﬁgures on a
gray background. OK means that a large fraction of the Venn cell was inspected, and no false
positive was found.

negatives decreases drastically, in general compensating for the inevitable increase in
false positives.

Among the heuristics, the combination of the improved list heuristic and the POS-
based heuristic has a signiﬁcantly higher recall and F-score than either heuristic alone
in all cases on test data, and in several cases on validation data. The list heuristic alone
performs reasonably well on the Swedish data set, but has a very low recall on the
German and Danish data sets. In all four cases the SMT training data has been used for
the list used by the heuristic, so this is unexpected, especially considering the fact that

1100

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Table 22
Experimental results for CRF on Danish Automotive, with different training data and sizes.
SMT is the same data that was used to train the SMT system, and new is additional data.

Size

Precision Recall

F-score

Data

SMT
SMT
SMT
SMT

new
new
new
new

20K
50K
100K
165K

20K
50K
100K
163K

all SMT + new 165K + 20K
all SMT + new 165K + 50K
all SMT + new 165K + 100K
all SMT + new 165K + 163K

.977
.979
.977
.977

.977
.975
.978
.978

.978
.982
.980
.978

.970
.977
.980
.986

.984
.986
.991
.991

.991
.993
.991
.993

.974
.978
.978
.982

.981
.981
.984
.984

.984
.988
.985
.985

Table 23
Precision, Recall, and F-score for compound merging methods based on heuristics or sequence
labeling on validation data and on held-out test data. The superscripts mark the systems that are
signiﬁcantly worse than the system in question (l-list, p-POS, lp-list∨POS, c-best CRF
conﬁguration).

Validation data

Precision Recall

F-score

Test data
Precision Recall

F-score

list
POS
list∨POS
CRF (ref∨POS)

list
POS
list∨POS
CRF (ref∨list)

list
POS
list∨POS
CRF (ref∨list∨POS)

.967
.991l,lp
.967
.982l,lp

.989p,lp
.976
.972
.987p,lp

.925
.981l,lp
.925
.978l,lp

German Europarl

.724
.978l
.999l,p,c
.990l,p

.828
.984l
.983l
.986l,p

.952
.997l,lp
.963l
.983l,lp

.717
.980l
1l,p,c
.996l,p

.818
.989l
.981l,p
.989l,lp

.994p
.963
1p
.998p

.760
.964l
.986l,p
.993l,p

Swedish auto
.990
.992lp
.982
.987

.991p
.969
.986p
.993p,lp

Danish auto

.835
.972l,lp
.955l
.985l,p,lp

.991lp
.978
.976
.978

.977
.974
.998l,p,c
.987

.984
.983
.990l,p
.987

.764
.929l
.988l,p,c
.966l,p

.863
.954l
.982l,p,c
.972l,p

the Danish data set is in the same domain as one of the Swedish data sets. The Danish
training data is smaller than the Swedish data though, which might be an inﬂuencing
factor. It is possible that this heuristic could perform better on the other data sets given
more data for frequency calculations.

The sequence labeler is competitive with the heuristics; on F-score it is only signif-
icantly worse than any of the heuristics once, for Danish auto test data, and in several
cases it has a signiﬁcantly higher F-score than some of the heuristics. The sequence

1101

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

labeler has a higher precision, signiﬁcantly so in four cases, than the best heuristic, the
combination heuristic, which is positive (because erroneously merged compounds are
usually more disturbing for a reader or post-editor than non-merged compounds).

The differences between the systems on MT metrics are generally small and mostly
not signiﬁcant, as could be expected, since the differences were small on the intrinsic
evaluation. For German Europarl only 10–739 out of 29,786 words are actually different
between any pair of systems. The only noticeable differences between the systems are
that the list system is worse than the other systems for Danish and German. The list
strategy is competitive for Swedish automotive, which is much bigger than Danish, and
in one of our previous experiments, with a large Swedish Europarl corpus (Stymne and
Cancedda 2011), conﬁrming our intuition that the list strategy tends to work better with
large corpora. Even though we do not see much effect on MT metrics, we still think
the small differences on the intrinsic evaluation are important. Compound merging
is a postprocessing process, which operates directly on the actual translation results.
This is quite different from compound splitting, as described in Section 8.4, which is
performed as part of the preprocessing and where we cannot be sure on the impact of
the preprocessing on the translation process.

The sequence labeler has the advantage over the heuristics that it is able to merge
completely novel compounds, whereas the list strategy can only merge compounds
that it has seen, and the POS-based strategy can create novel compounds, but only
with known modiﬁers. An inspection of the test data showed that there were some
novel compounds merged by the sequence labeler that were not identiﬁed with either
of the heuristics. In the test data we found knap+start (button start) and vand+nedsænkning
(water submersion) for Danish, and kvarts sekel (quarter century) and bostad(s)+ers¨attning
(housing grant) for Swedish. This conﬁrms that the sequence labeler, trained from auto-
matically labeled data based on heuristics, can learn to merge new compounds that the
heuristics themselves cannot ﬁnd.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

8.6 Effects of Corpus Size and Domain

Most of the previous experiments were performed on relatively small data sets, and it
has been shown that effects of preprocessing strategies sometimes are larger on small
corpora (e.g., Nießen and Ney 2004; Popovi´c and Ney 2006). To investigate this issue,
we ﬁrst present a scaling experiment for small data sets from German Europarl, and
then report results on two large data sets for translation from English to German on
Europarl and news data.

Table 24 shows the result for baseline and EPOS systems trained on 100K and 440K
sentences. In the smaller data condition the results are similar for the baseline and the
EPOS system. With 440K data the EPOS system is signiﬁcantly better than the baseline

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Table 24
Results for translation on German Europarl with small training data sets. Signiﬁcant
improvements over the respective baseline+POS are marked **(1% level).

System

Bleu

NIST

Meteor

100K baseline+POS
100K EPOS
440K baseline+POS
440K EPOS

16.8 (0.0)
16.9 (0.1)
19.3 (0.1)
19.7 (0.2)**

5.37 (0.02)
5.39 (0.02)
5.76 (0.05)
5.83 (0.03)**

24.6 (0.1)
24.8 (0.1)
26.7 (0.2)
27.0 (0.2)**

1102

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Table 25
Overview of the corpus for large evaluation sets, and the corpus sizes in sentences. The names of
the corpora refer to those used for the WMT evaluations. For the tuning data only a subset of the
full data was used.

WMT08

WMT10

Name/type

Size

Name/type

Size

Translation models
Language model 1
Language model 2
Tuning data
Test Set Europarl
Test Set News

Europarl
Europarl

–

dev2006
test2007
nc-test2007

1,054,688
1,412,546

600
2,000
2,007

Europarl+NewsComm 1,462,401
Europarl+NewsComm 1,885,872
News
news-test2008

17,449,589
1,025

–

news-test2009

2,525

Table 26
Results on WMT08 data on Europarl and News test sets. Signiﬁcantly different results from the
baseline are marked with *(5% level) and **(1% level), signiﬁcantly different results from
baseline+POS are marked similarly with #.

System

Bleu

Europarl
NIST

Meteor

Bleu

News
NIST

Meteor

Baseline
Baseline+POS
EPOS-unmarked

19.9 (0.1)
20.1 (0.1)*
20.1 (0.0)

5.89 (0.01)
5.90 (0.01)
5.95 (0.01)*#

27.6 (0.0)
27.8 (0.2)
27.6 (0.2)

14.3 (0.3)
14.4 (0.3)*
14.7 (0.4)**#

5.61 (0.09)
5.60 (0.09)
5.60 (0.07)

24.0 (0.5)
23.9 (0.4)
24.1 (0.6)*

on all three metrics. We thus see that our compound processing method works well
on medium-sized data sets as well, and is competitive to the baseline on very small
data sets.

In our ﬁnal experiments we used larger data sets from the workshop of statistical
machine transition, year 2008 and 2010, which we will refer to as WMT08 and WMT10.7
The corpora used are presented in Table 25. In these experiments we also investigated
the effects of in- and out-of-domain data. For WMT08 we trained and tuned on only
Europarl data, and tested both in-domain, on Europarl and out-of-domain, on News.
For WMT10 we trained on a mix of data, and tested on News. For these experiments
we used morphologically enriched POS-tagsets. For WMT08 we used a commercial
dependency parser for the morphology (Tapanainen and J¨arvinen 1997), and for
WMT10 we used RFTagger (Schmid and Laws 2008). For the systems with split com-
pounds we used the EPOS-tagset based on morphological tags. We used the POS-match
merging algorithm.

Table 26 shows the results on the WMT08 evaluation. On Europarl the system with
compound processing and an EPOS-model is signiﬁcantly better than both baselines on
NIST, and on the News domain it is signiﬁcantly better than both baselines on Bleu. On
the other metrics the EPOS system is on par with the baseline for both domains. The
differences between the two baselines are mostly small. Overall, the scores are much
lower on the out-of-domain data. Table 27 shows the results for the WMT10 evaluation,
where more data was used to train the systems. The EPOS system is signiﬁcantly better

7 http://www.statmt.org/wmt08/, http://www.statmt.org/wmt10/.

1103

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Table 27
Results on WMT10 data on News test set. Signiﬁcance shown as in Table 26.

System

Bleu

NIST

Meteor

Baseline
Baseline+POS
EPOS-unmarked

13.5 (0.1)
13.8 (0.1)*
14.1 (0.3)**##

5.29 (0.04)
5.35 (0.01)
5.35 (0.07)**

21.0 (0.2)
21.3 (0.2)
21.4 (0.1)*

than the baseline without POS on all metrics, and also signiﬁcantly better than the
baseline with POS on Bleu. This shows that the suggested compound processing is
effective also on larger data sets and out-of-domain data.

9. Conclusion

In this article, we have investigated several methods for splitting and merging com-
pounds when translating into languages with closed compounds, with the goal of
generating novel compounds. We have described methods for promoting coalescence
and for deciding if and how to merge words that are either competitive with, or superior
to, any previously published method. A common trait of all these methods is that they
promote coalescence and allow formation of novel compounds by using customized
*POS-tagsets. This shows that the integration of light-weight linguistic information
through POS-tags is useful for compound processing into compounding languages.
We believe that some of the techniques, such as customized POS-sequence models,
could be used to target morphological phenomena other than compounding as well. We
also believe that the suggested methods are likely to be useful for other compounding
languages such as Finnish or Japanese.

For promoting compound coalescence we introduced additional LMs based on
customized POS-tagsets, and with dedicated SMT model features counting the number
of sequences known a priori to be desirable and undesirable. Experiments showed
consistently good results when using the extended tagset EPOS in sequence models.
The other tagsets were less successful, but on an automotive corpora a restricted tagset,
RPOS worked well both in sequence models and using count features.

For compound splitting, previously proposed strategies for the opposite translation
direction could be modiﬁed to improve performance when translating into German. We
also conﬁrmed previous results that showed no correlations between intrinsic evalua-
tions of compound splitting and translation results in the other translation direction.

For merging we designed a heuristic algorithm based on part-of-speech match-
ing, which gave consistently good results. We also improved existing heuristics based
on frequency lists, which worked well on some data sets. We furthermore cast the
compound merging problem as a sequence labeling problem, opening it to solutions
based on a broad array of models and algorithms. We experimented with one model,
conditional random ﬁelds, designed a set of easily computed features reaching beyond
the information accessed by the heuristics, and showed that it gives very competitive
results, even trained on automatically labeled data.

Our goal was to identify methods that can generate novel compounds. Depending
on the choice of the features, the sequence labeling approach has the potential to be
truly productive, that is, to form new compounds in an unrestricted way. The list-based
heuristic is not productive: It can only form a compound if this was already observed
as such. The POS-match heuristic yields some productivity. Because it uses special

1104

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

POS-tags for compound modiﬁers, it can form a compound provided its head has been
seen either alone or as a head, and its modiﬁer(s) have been seen elsewhere, possibly
separately, as modiﬁer(s) of compounds. The sequence labeling approach can decide to
merge two consecutive words even if neither was ever seen before in a compound.

In the merging step we only considered the order of words as they appear in
the 1-best output. Other options would include also attempting reordering of words
in the output in addition to merging, or to use lattice or n-best output and perform
compound merging and reranking in combination. However, we believe that the use
of sequence models for coalescence gives sufﬁciently good results and minimizes the
potential impact of more costly options.

In this work we mainly viewed the translation system as a black box. Our mod-
iﬁcations to the PBSMT pipeline were based on pre- and postprocessing and on the
use of *POS-sequence models. It is possible that further improvements can be had by
integrating compound processing tighter into the decoder.

The rate of novel compounds is relatively low in the data sets used in our experi-
ments, mainly due to the fact that most experiments are carried out with in-domain test
data. However, the best methods tend to vary between the domain-restricted automo-
tive corpora and Europarl. The importance of effective compound splitting and merging
can be expected to grow in domain adaptation settings, outside the scope of this work.
Overall we have shown that systems with compound processing give consistently
good results when they use POS-based language models and use either the POS-match
heuristic or sequence labeling for compound merging. The difference between a base-
line using a POS-model but no compound splitting and the split-merge EPOS-models
is small in all experiments and does not seem to increase with the size of the training
corpus. However, only the latter can produce novel compounds.

Acknowledgments
We would like to thank Maria Holmqvist
and Tam´as Ga´al for valuable discussions
about different aspects of the work in this
article. We also thank the anonymous
reviewers for their valuable comments on an
earlier draft. The work presented in the
article was mostly conducted while S.
Stymne was at Link ¨oping University and
Xerox Research Centre Europe.

References
Badr, Ibrahim, Rabih Zbib, and James Glass.
2008. Segmentation for English-to-Arabic
statistical machine translation. In
Proceedings of the 46th Annual Meeting of the
ACL: Human Language Technologies, Short
papers, pages 153–156, Columbus, OH.
Banerjee, Satanjeev and Alon Lavie. 2005.
METEOR: An automatic metric for MT
evaluation with improved correlation
with human judgments. In Proceedings
of the Workshop on Intrinsic and Extrinsic
Evaluation Measures for MT and/or
Summarization at ACL’05, pages 65–72,
Ann Arbor, MI.

Baroni, Marco, Johannes Matiasek, and
Harald Trost. 2002. Predicting the
components of German nominal
compounds. In Proceedings of the 15th
European Conference on Artiﬁcial Intelligence
(ECAI), pages 470–474, Amsterdam.
Berton, Andr´e, Pablo Fetter, and Peter
Regel-Brietzmann. 1996. Compound
words in large-vocabulary German
speech recognition systems. In Proceedings
of the Fourth International Conference
on Spoken Language Processing,
pages 1,165–1,168, Philadelphia, PA.
Birch, Alexandra, Miles Osborne, and

Philipp Koehn. 2008. Predicting success
in machine translation. In Proceedings of
the 2008 Conference on Empirical Methods
in Natural Language Processing,
pages 745–754, Honolulu, HI.

Botha, Jan A., Chris Dyer, and Phil Blunsom.

2012. Bayesian language modelling
of German compounds. In Proceedings
of the 24th International Conference on
Computational Linguistics, pages 341–356,
Mumbai.

Brodda, Benny. 1979. N˚agot om de svenska
ordens fonotax och morfotax: Iakttagelse
med utg˚angspunkt fr˚an experiment

1105

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

med automatisk morfologisk analys. In PILUS
nr 38. Inst. f ¨or lingvistik, Stockholms
universitet, Sweden.

Brown, Ralf D. 2002. Corpus-driven splitting
of compound words. In Proceedings of the
9th International Conference of Theoretical and
Methodological Issues in Machine Translation,
pages 12–21, Keihanna.

Carlberger, Johan, Rickard Domeij, Viggo
Kann, and Ola Knutsson. 2005. The
development and performance of a
grammar checker for Swedish: A language
engineering perspective. In Ola Knutsson,
editor, Developing and Evaluating Language
Tools for Writers and Learners of Swedish.
Ph.D. thesis, Royal Institute of Technology
(KTH), Stockholm, Sweden.

Carlberger, Johan and Viggo Kann. 1999.

Implementing an efﬁcient part-of-speech
tagger. Software Practice and Experience,
29:815–832.

Cherry, Colin. 2008. Cohesive phrase-based

decoding for statistical machine
translation. In Proceedings of the
46th Annual Meeting of the ACL: Human
Language Technologies, pages 72–80,
Columbus, OH.

Clark, Jonathan H., Chris Dyer, Alon Lavie,

and Noah A. Smith. 2011. Better
hypothesis testing for statistical machine
translation: Controlling for optimizer
instability. In Proceedings of the 49th Annual
Meeting of the ACL: Human Language
Technologies, pages 176–181, Portland, OR.

Collins, Michael. 2002. Discriminative

training methods for hidden Markov
models: Theory and experiments with
perceptron algorithms. In Proceedings of the
2002 Conference on Empirical Methods in
Natural Language Processing, pages 1–8,
Philadelphia, PA.

Collins, Michael and Terry Koo. 2005.

Discriminative reranking for natural
language parsing. Computational
Linguistics, 31(1):25–69.

Cutting, Doug, Julian Kupiec, Jan Pedersen,
and Penelope Sibun. 1992. A practical
part-of-speech tagger. In Proceedings
of the Third Conference on Applied Natural
Language Processing, pages 133–140, Trento.

Doddington, George. 2002. Automatic

evaluation of machine translation quality
using n-gram co-occurence statistics.
In Proceedings of the Second International
Conference on Human Language Technology,
pages 228–231, San Diego, CA.

Dyer, Chris. 2009. Using a maximum entropy
model to build segmentation lattices for
MT. In Proceedings of Human Language

1106

Technologies: The 2009 Annual Conference of
the NAACL, pages 406–414, Boulder, CO.

Dyer, Chris. 2010. A Formal Model of

Ambiguity and its Applications in Machine
Translation. Ph.D. thesis, University of
Maryland, USA.

El-Kahlout, ˙Ilknur Durgar and Kemal
Oflazer. 2006. Initial explorations in
English to Turkish statistical machine
translation. In Proceedings of the Workshop
on Statistical Machine Translation,
pages 7–14, New York, NY.

El Kholy, Ahmed and Nizar Habash. 2010.
Techniques for Arabic morphological
detokenization and orthographic
denormalization. In LREC 2010 Workshop
on Language Resources and Human Language
Technology for Semitic Languages,
pages 45–51, Valletta.

Fraser, Alexander. 2009. Experiments in

morphosyntactic processing for translating
to and from German. In Proceedings of the
Fourth Workshop on Statistical Machine
Translation, pages 115–119, Athens.

Friberg, Karin. 2007. Decomposing Swedish

compounds using memory-based learning.
In Proceedings of the 16th Nordic Conference
on Computational Linguistics
(NODALIDA’07), pages 224–230, Tartu.
Fritzinger, Fabienne and Alexander Fraser.

2010. How to avoid burning ducks:
Combining linguistic analysis and
corpus statistics for German compound
processing. In Proceedings of the Joint Fifth
Workshop on Statistical Machine Translation
and MetricsMATR, pages 224–234, Uppsala.

Hedlund, Turid. 2002. Compounds in
dictionary-based cross-language
information retrieval. Information Research,
7(2). Available at http://InformationR.
net/ir/7-2/paper128.html. Accessed
January 29, 2013.

Holmqvist, Maria, Sara Stymne, and Lars

Ahrenberg. 2007. Getting to know Moses:
Initial experiments on German–English
factored translation. In Proceedings of the
Second Workshop on Statistical Machine
Translation, pages 181–184, Prague.
Holz, Florian and Chris Biemann. 2008.
Unsupervised and knowledge-free
learning of compound splits and
periphrases. In Proceedings of the
9th International Conference on Intelligent
Text Processing and Computational
Linguistics (CICLING), pages 117–127,
Haifa.

Institut f ¨ur Deutsche Sprache. 1998.
Rechtschreibreform (Aktualisierte
Ausgabe). IDS Sprachreport,

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stymne, Cancedda, and Ahrenberg

Generation of Compound Words in SMT

Extra-Ausgabe Dezember 1998.
Mannheim, Germany.

Koehn, Philipp. 2005. Europarl: A parallel

corpus for statistical machine translation.
In Proceedings of MT Summit X,
pages 79–86, Phuket.

Koehn, Philipp, Abhishek Arun, and Hieu
Hoang. 2008. Towards better machine
translation quality for the German–English
language pairs. In Proceedings of the Third
Workshop on Statistical Machine Translation,
pages 139–142, Columbus, OH.

Koehn, Philipp and Hieu Hoang. 2007.

Factored translation models. In Proceedings
of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and
Computational Natural Language Learning,
pages 868–876, Prague.

Koehn, Philipp, Hieu Hoang, Alexandra
Birch, Chris Callison-Burch, Marcello
Federico, Nicola Bertoldi, Brooke Cowan,
Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondrej Bojar, Alexandra
Constantin, and Evan Herbst. 2007.
Moses: Open source toolkit for statistical
machine translation. In Proceedings of the
45th Annual Meeting of the ACL, Demo and
Poster Sessions, pages 177–180, Prague.
Koehn, Philipp and Kevin Knight. 2003.
Empirical methods for compound
splitting. In Proceedings of the 10th
Conference of the EACL, pages 187–193,
Budapest.

Kokkinakis, Dimitros. 2001. A Framework
for the Acquisition of Lexical Knowledge:
Description and Applications. Ph.D. thesis,
G ¨oteborg University, Sweden.
K ¨urschner, Sebastian. 2003. Von

Volk-s-musik und Sport-ø-geist im
Lemming-ø-land – Af folk-e-musik og
sport-s-˚and i lemming-e-landet:
Fugenelemente im Deutschen und
D¨anischen – eine kontrastive
Studie zu einem Grenzfall der
Morphologie. Master’s thesis,
Albert-Ludwigs-Universit¨at,
Freiburg, Germany.

Lafferty, John, Andrew McCallum, and
Fernando Pereira. 2001. Conditional
random ﬁelds: Probabilistic models for
segmenting and labeling sequence data.
In Proceedings of the 18th International
Conference on Machine Learning,
pages 282–289, Williamstown, MA.
Langer, Stefan. 1998. Zur Morphologie

und Semantik von Nominalkomposita.
In Tagungsband der 4. Konferenz zur
Verarbeitung nat ¨urlicher Sprache
(KONVENS), pages 83–97, Bonn.

Larson, Martha, Daniel Willett, Joachim
K ¨ohler, and Gerhard Rigoll. 2000.
Compound splitting and lexical unit
recombination for improved performance
of a speech recognition system for German
parliamentary speeches. In Proceedings
of the Sixth International Conference on
Spoken Language Processing, volume 3,
pages 945–948, Beijing.

Lavie, Alon and Abhaya Agarwal. 2007.

METEOR: An automatic metric for MT
evaluation with high levels of correlation
with human judgments. In Proceedings of
the Second Workshop on Statistical Machine
Translation, pages 228–231, Prague.

Macherey, Klaus, Andrew Dai, David Talbot,

Ashok Popat, and Franz Och. 2011.
Language-independent compound
splitting with morphological operations.
In Proceedings of the 49th Annual Meeting of
the ACL: Human Language Technologies,
pages 1,395–1,404, Portland, OR.
Nießen, Sonja and Hermann Ney. 2000.

Improving SMT quality with
morpho-syntactic analysis. In Proceedings
of the 18th International Conference on
Computational Linguistics, pages
1,081–1,085, Saarbr ¨ucken.

Nießen, Sonja and Hermann Ney. 2004.

Statistical machine translation with scarce
resources using morpho-syntactic
information. Computational Linguistics,
30(2):181–204.

Och, Franz Josef. 2003. Minimum error rate
training in statistical machine translation.
In Proceedings of the 42nd Annual
Meeting of the ACL, pages 160–167,
Sapporo.

Och, Franz Josef and Hermann Ney. 2003.
A systematic comparison of various
statistical alignment models. Computational
Linguistics, 29(1):19–51.

Papineni, Kishore, Salim Roukos, Todd
Ward, and Wei-Jing Zhu. 2002. BLEU:
A method for automatic evaluation of
machine translation. In Proceedings of the
40th Annual Meeting of the ACL,
pages 311–318, Philadelphia, PA.

Popovi´c, Maja and Hermann Ney. 2006.
Statistical machine translation with a
small amount of bilingual training data.
In 5th LREC SALTMIL Workshop on
Minority Languages, pages 25–29, Genoa.
Popovi´c, Maja, Daniel Stein, and Hermann
Ney. 2006. Statistical machine translation
of German compound words. In
Proceedings of FinTAL – 5th International
Conference on Natural Language Processing,
pages 616–624, Turku.

1107

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o

l
i

_
a
_
0
0
1
6
2
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Rabiner, Lawrence R. 1989. A tutorial on
hidden Markov models and selected
applications in speech recognition.
Proceedings of IEEE, 77(2):257–286.
Riezler, Stefan and John T. Maxwell.

2005. On some pitfalls in automatic
evaluation and signiﬁcance testing for
MT. In Proceedings of the Workshop on
Intrinsic and Extrinsic Evaluation Measures
for MT and/or Summarization at ACL’05,
pages 57–64, Ann Arbor, MI.

Sarawagi, Sunita and William W. Cohen.

2004. Semi-Markov conditional random
ﬁelds for information extraction.
In Advances in Neural Information
Processing Systems 17 (NIPS),
pages 1,185–1,192, Cambridge, MA.
Schiller, Anne. 2005. German compound

analysis with WFSC. In Proceedings of the
Finite State Methods and Natural Language
Processing, pages 239–246, Helsinki.

Schmid, Helmut. 1994. Probabilistic

part-of-speech tagging using decision
trees. In Proceedings of the International
Conference on New Methods in Language
Processing, pages 44–49, Manchester.
Schmid, Helmut and Florian Laws. 2008.
Estimation of conditional probabilities
with decision trees and an application to
ﬁne-grained POS tagging. In Proceedings
of the 22nd International Conference on
Computational Linguistics, pages 777–784,
Manchester.

Simard, Michel, Nicola Cancedda, Bruno

Cavestro, Marc Dymetman, Eric Gaussier,
Cyril Goutte, Kenji Yamada, Philippe
Langlais, and Arne Mauser. 2005.
Translating with non-contiguous phrases.
In Proceedings of the Human Language
Technology Conference and the Conference on
Empirical Methods in Natural Language
Processing, pages 755–762, Vancouver.

Sj ¨obergh, Jonas and Viggo Kann. 2004.
Finding the correct interpretation of
Swedish compounds, a statistical
approach. In Proceedings of the 4th
International Conference on Language
Resources and Evaluation (LREC’04),
pages 899–902, Lisbon.

Stolcke, Andreas. 2002. SRILM—an

extensible language modeling toolkit.
In Proceedings of the Seventh International
Conference on Spoken Language Processing,
pages 901–904, Denver, CO.

Stymne, Sara. 2008. German compounds in
factored statistical machine translation.
In Proceedings of GoTAL – 6th International
Conference on Natural Language Processing,
pages 464–475, Gothenburg.

Stymne, Sara. 2009. A comparison of

merging strategies for translation of
German compounds. In Proceedings of the
EACL 2009 Student Research Workshop,
pages 61–69, Athens.

Stymne, Sara and Nicola Cancedda. 2011.
Productive generation of compound
words in statistical machine translation.
In Proceedings of the Sixth Workshop
on Statistical Machine Translation,
pages 250–260, Edinburgh.

Stymne, Sara and Maria Holmqvist.

2008. Processing of Swedish compounds
for phrase-based statistical machine
translation. In Proceedings of the
12th Annual Conference of the European
Association for Machine Translation,
pages 180–189, Hamburg.

Stymne, Sara, Maria Holmqvist, and Lars

Ahrenberg. 2008. Effects of morphological
analysis in translation between German
and English. In Proceedings of the Third
Workshop on Statistical Machine Translation,
pages 135–138, Columbus, OH.
Tapanainen, Paso and Timo J¨arvinen.

1997. A nonprojective dependency parser.
In Proceedings of the 5th Conference on
Applied Natural Language Processing,
pages 64–71, Washington, DC.

Taskar, Ben, Carlos Guestrin, and Daphne

Koller. 2003. Max-margin Markov
networks. In Advances in Neural
Information Processing Systems 16 (NIPS),
pages 25–32, Vancouver.

Thorell, Olof. 1981. Svensk ordbildningsl¨ara.
Esselte Studium, Stockholm, Sweden.
Tsochantaridis, Ioannis, Thorsten Joachims,
Thomas Hofmann, and Altun Yasemin.
2005. Large margin methods for structured
and interdependent output variables.
Journal of Machine Learning Research,
6:1,453–1,484.

Virpioja, Sami, Jaako J. V¨ayrynen, Mathias
Creutz, and Markus Sadeniemi. 2007.
Morphology-aware statistical machine
translation based on morphs induced
in an unsupervised manner.
In Proceedings of MT Summit XI,
pages 491–498, Copenhagen.

1108

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
6
7
1
8
0
2
5
9
2
/
c
o