A Comparative Study of Minimally
Supervised Morphological Segmentation
Teemu Ruokolainen∗
Aalto University
Oskar Kohonen∗∗
Aalto University
Kairit Sirts†
Tallinn University of Technology
Stig-Arne Grönroos∗
Aalto University
Mikko Kurimo∗
Aalto University
Sami Virpioja∗∗
Aalto University
This article presents a comparative study of a subfield of morphology learning referred to
as minimally supervised morphological segmentation. In morphological segmentation,
word forms are segmented into morphs, the surface forms of morphemes. In the minimally
supervised data-driven learning setting, segmentation models are learned from a small number
of manually annotated word forms and a large set of unannotated word forms. Además de
providing a literature survey on published methods, we present an in-depth empirical comparison
on three diverse model families, including a detailed error analysis. Based on the literature
survey, we conclude that the existing methodology contains substantial work on generative
morph lexicon-based approaches and methods based on discriminative boundary detection. Como para
which approach has been more successful, both the previous work and the empirical evaluation
presented here strongly imply that the current state of the art is yielded by the discriminative
boundary detection methodology.
∗ Department of Signal Processing and Acoustics, Otakaari 5 A, FI-02150 Espoo, Finland.
Correo electrónico: {teemu.ruokolainen, stig-arne.gronroos, mikko.kurimo}@aalto.fi.
∗∗ Department of Information and Computer Science, Konemiehentie 2, FI-02150 Espoo, Finland.
Correo electrónico: {oskar.kohonen, sami.virpioja}@aalto.fi.
† Institute of Cybernetics, Akadeemia tee 21 EE-12618 Tallin, Estonia. Correo electrónico: sirts@phon.ioc.ee.
Envío recibido: 15 Septiembre 2014; versión revisada recibida: 7 Octubre 2015; accepted for publication:
15 Noviembre 2015
doi:10.1162/COLI_a_00243
© 2016 Asociación de Lingüística Computacional
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
1. Introducción
This article discusses a subfield of morphology learning referred to as morphological
segmentation, in which word forms are segmented into morphs, the surface forms
of morphemes. Por ejemplo, consider the English word houses with a corresponding
segmentation house+s, where the segment house corresponds to the word stem and the
suffix -s marks the plural number. Although this is a major simplification of the diverse
morphological phenomena present in languages, this type of analysis has nevertheless
been of substantial interest to computational linguistics, beginning with the pioneering
work on morphological learning by Harris (1955). As for automatic language process-
En g, such segmentations have been found useful in a wide range of applications, en-
cluding speech recognition (Hirsimäki et al. 2006; Narasimhan et al. 2014), información
retrieval (Turunen and Kurimo 2011), machine translation (de Gispert et al. 2009; Verde
and DeNero 2012), and word representation learning (Luong, Socher, and Manning
2013; Qiu et al. 2014).
Since the early work of Harris (1955), most research on morphological segmentation
has focused on unsupervised learning, which aims to learn the segmentation from a
list of unannotated (unlabeled) word forms. The unsupervised methods are appealing
as they can be applied to any language for which there exists a sufficiently large set
of unannotated words in electronic form. Como consecuencia, such methods provide an inex-
pensive means of acquiring a type of morphological analysis for low-resource languages
as motivated, Por ejemplo, by Creutz and Lagus (2002). The unsupervised approach and
learning setting has received further popularity because of its close relationship with the
unsupervised word segmentation problem, which has been viewed as a realistic setting
for theoretical study of language acquisition (Brent 1999; Goldwater 2006).
Although development of novel unsupervised model formulations has remained
a topic of active research (Poon, Cherry, and Toutanova 2009; Monson, Hollingshead,
and Roark 2010; Spiegler and Flach 2010; Sotavento, Haghighi, and Barzilay 2011; Sirts
and Goldwater 2013), recent work has also shown a growing interest towards semi-
supervised learning (Poon, Cherry, and Toutanova 2009; Kohonen, Virpioja, and Lagus
2010; Sirts and Goldwater 2013; Grönroos et al. 2014; Ruokolainen et al. 2014). En general,
the aim of semi-supervised learning is to acquire high-performing models utilizing both
unannotated as well as annotated data (Zhu and Goldberg 2009). In morphological seg-
mentation, the annotated data sets are commonly small, on the order of a few hundred
word forms. We refer to this learning setting with such a small amount of supervision as
minimally supervised learning. En consecuencia, similar to the unsupervised methods,
the minimally supervised techniques can be seen as a means of acquiring a type of
morphological analysis for under-resourced languages.
Individual articles describing novel methods typically contain a comparative dis-
cussion and empirical evaluation between one or two preceding approaches. Por lo tanto,
what is currently lacking from the literature is a summarizing comparative study on the
published methodology as a whole. Además, the literature currently lacks discussion
on error analysis. A study on the error patterns produced by varying approaches could
inform us about their potential utility in different tasks. Por ejemplo, if an application
requires high-accuracy compound splitting, one could choose to apply a model with a
good compound-splitting capability even if its affix accuracy does not reach state of the
arte. The purpose of this work is to address these issues.
Our main contributions are as follows. Primero, we present a literature survey on
morphological segmentation methods applicable in the minimally supervised learning
configuración. The considered methods include unsupervised techniques that learn solely from
92
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
unannotated data, supervised methods that utilize solely annotated data, and semi-
supervised approaches that utilize both unannotated and annotated data. Segundo, nosotros
perform an extensive empirical evaluation of three diverse method families, including a
detailed error analysis. The approaches considered in this comparison are variants of the
Morfessor algorithm (Creutz and Lagus 2002, 2005, 2007; Kohonen, Virpioja, and Lagus
2010; Grönroos et al. 2014), the adaptor grammar framework (Sirts and Goldwater 2013),
and the conditional random field method (Ruokolainen et al. 2013, 2014). We hope the
presented discussion and empirical evaluation will be of help for future research on the
considered task.
The rest of the article is organized as follows. En la sección 2, we provide an overview
of related studies. We then provide a literature survey of published morphological
segmentation methodology in Section 3. Experimental work is presented in Section 4.
Finalmente, we provide a discussion on potential directions for future work and conclusions
on the current work in Sections 5 y 6, respectivamente.
2. Trabajo relacionado
Hammarström and Borin (2011) presented a literature survey on unsupervised learning
of morphology, including methods for learning morphological segmentation. Whereas
the discussion provided by Hammarström and Borin focuses mainly on linguistic
aspects of morphology learning, our work is strongly rooted in machine learning
methodology and empirical evaluation. Además, whereas Hammarström and Borin
focus entirely on unsupervised learning, our work considers a broader range of learn-
ing paradigms. Por lo tanto, although related, Hammarström and Borin and our current
presentation are complementary in that they have different focus areas.
In addition to the work of Hammarström and Borin (2011), we note that there exists
some established forums on morphology learning. Primero, we mention the ACL Special
Interest Group on Computational Morphology and Phonology (SIGMORPHON), cual
has regularly organized workshops on the subject since 2002. As for specifically mor-
phology learning, we refer to the Morpho Challenge competitions organized since 2005
at Aalto University (formerly known as Helsinki University of Technology). A pesar de
these events have been successful in providing a publication and discussion venue for
researchers interested in the topic, they have not given birth to comparative studies
or survey literature. Por ejemplo, whereas the publications on the Morpho Challenge
(Kurimo et al. 2009; Kurimo, Virpioja, and Turunen 2010) discuss the competition
resultados, they nevertheless do not attempt to provide any insight on the fundamental
differences and similarities of the participating methods.
3. Métodos
This section provides a detailed review of our methodology. We begin by describing
varying morphological representations, including segmentation, and the minimally
supervised learning setting in Sections 3.1 y 3.2, respectivamente. We then provide a
literature survey and comparative discussion on a range of methods in Section 3.3.
3.1 On Learning Morphological Representations
In what follows, we briefly characterize morphological segmentation with respect to
alternative morphological representations, particularly the full morphological anal-
ysis. Para tal fin, consider the exemplar segmentations and full analyses for Finnish
93
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
Mesa 1
Morphological segmentation versus full morphological analysis for exemplar Finnish word
formas. The full analysis consists of word lemma (basic form), part-of-speech, and fine-grained
labels.
word form
full analysis
segmentation
auto (auto)
autossa (in car)
autoilta (from cars)
autoilta (car evening)
maantie (highway)
sähköauto (electric car)
auto+N+Sg+Nom
auto+N+Sg+Ine
auto+N+Pl+Abl
auto+N+Sg+Nom+# ilta+N+Sg+Nom
maantie+N+Sg+Nom
maa+N+Sg+Gen+# tie+N+Sg+Nom
sähköauto+N+Sg+Nom
sähkö+N+Sg+Nom+# auto+N+Sg+Nom sähkö+auto
auto
auto+ssa
auto+i+lta
auto+ilta
maantie
maa+n+tie
sähköauto
word forms in Table 1, where the full analyses are provided by the rule-based OMorFi
analyzer developed by Pirinen (2008). Note that it is typical for word forms to have
alternative analyses and/or meanings that cannot be disambiguated without sentential
contexto. Evidently, the level of detail in the full analysis is substantially higher compared
with the segmentation, as it contains lemmatization as well as morphological tagging,
whereas the segmentation consists of only segment boundary positions. Como consecuencia,
because of this simplification, morphological segmentation has been amenable to un-
supervised machine learning methodology, beginning with the work of Harris (1955).
Mientras tanto, the majority of work on learning of full morphological analysis has used
supervised methodology (Chrupala, Dinu, and van Genabith 2008). Por último, there have
been numerous studies on statistical learning of intermediate forms of segmentation
and full analysis (Lignos 2010; Virpioja, Kohonen, and Lagus 2010) as well as alternative
morphological representations (Yarowsky and Wicentowski 2000; Schone and Jurafsky
2001; Neuvel and Fulop 2002; Johnson and Martin 2003).
As for language processing, learning segmentation can be advantageous com-
pared with learning full analyses. En particular, learning full analysis in a supervised
manner typically requires up to tens of thousands of manually annotated sentences.
A low-cost alternative, por lo tanto, could be to learn morphological segmentation from
unannotated word lists and a handful of annotated examples. En tono rimbombante, segmen-
tation analysis has been found useful in a range of applications, such as speech
recognition (Hirsimäki et al. 2006; Narasimhan et al. 2014), information retrieval
(Turunen and Kurimo 2011), machine translation (de Gispert et al. 2009; Green and
DeNero 2012), and word representation learning (Luong, Socher, and Manning 2013;
Qiu et al. 2014).
Despite its intuitiveness, it should be noted that the segmentation representation
is not equally applicable to all languages. Para tal fin, consider the terms isolative
and synthetic languages. In languages with a high amount of isolating morpholog-
ical properties, word forms tend to comprise their own morphemes. Mientras tanto, en
heavily synthetic languages, words tend to contain multiple morphemes. Synthetic
languages can be described further according to their agglutinative (concatenative) y
fusional properties. In the former, the morphs tend to have clear boundaries between
them whereas in the latter, the morphs tend to be indistinguishable. For examples
of agglutinative and fusional word formation, consider the English verbs played (pasado
tense of play) and sang (past tense of sing). Where the previous can be effortlessly
94
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
divided into two segments as play+ed (STEM + PAST TENSE), there are no such distinct
boundaries in the latter. Generally, languages with synthetic properties mix concate-
native and fusional schemes and contain agglutinative properties to varying degrees.
Morphological segmentation can be most naturally applied to highly agglutinative
idiomas.
Morphologically ambiguous word forms are common especially in highly synthetic
idiomas. Even without disambiguation based on sentential context, providing all
correct alternatives could be useful for some downstream applications, such as informa-
tion retrieval. Statistical methods can usually provide n-best segmentations; for exam-
por ejemplo, Morfessor (Creutz and Lagus 2007) and CRFs (Ruokolainen et al. 2013) mediante el uso
n-best Viterbi algorithm and adaptor grammar (Sirts and Goldwater 2013) by collecting
the variations in the posterior distribution samples. Although there is no evident way
to decide the correct number of alternatives for a particular word form, n-best lists
might be useful whenever recall (including the correct answers) is more important than
precisión (excluding any incorrect answers). The Morpho Challenge competitions have
allowed providing alternative segmentations for the submitted methods, but no clear
developments have been reported. De hecho, even in the reference results based on the gold
standard segmentations, selecting all alternative segmentations has performed slightly
worse in the information retrieval tasks than taking only the first segmentation (Kurimo,
Virpioja, and Turunen 2010).
3.2 Minimally Supervised Learning Settings
In data-driven morphological segmentation, our aim is to learn segmentation models
from training data. Subsequent to training, the models provide segmentations for given
word forms. In the minimally supervised learning setting, as defined here, the models
are estimated from annotated and unannotated word forms. We denote the annotated
data set comprising word forms with their corresponding segmentation as D and the
unannotated data set comprising raw word forms as U. Typically, the raw word forms
can be obtained easily and, como consecuencia, U can contain millions of word forms. Significar-
mientras, acquiring the annotated data D requires manual labor and, por lo tanto, típicamente
contains merely hundreds or thousands of word forms. For an illustration of D and U,
ver tabla 2.
Mesa 2
Examples of annotated and unannotated data, D and U, respectivamente. Typically, U can contain
hundreds of thousands or millions of word forms, whereas D contains merely hundreds or
thousands of word forms.
D
Ud.
anarch + ist + s
atado + ed
conting + ency
de + fame
entitle + mento
fresh + hombre
. . .
comportamiento
bilinguals
comunidad
disorders
equipado
faster
. . .
95
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
We consider three machine learning approaches applicable in the minimally su-
pervised learning setting, a saber, unsupervised, supervised, and semi-supervised
aprendiendo. In unsupervised learning, the segmentation models are trained on solely
unannotated data U. Mientras tanto, supervised models are trained from solely the anno-
tated data D. Finalmente, the aim of semi-supervised learning is to utilize both the available
unannotated and annotated data. Because the semi-supervised approach utilizes the
largest amount of data, it is expected to be most suitable for acquiring high segmenta-
tion accuracy in the minimally supervised learning setting.
Por último, we note that the unsupervised learning framework can be understood in a
strict or non-strict sense, depending on whether the applied methods are allowed to use
annotated data D for hyperparameter tuning. Although the term unsupervised learning
itself suggests that such adjusting is infeasible, this type of tuning is nevertheless
common (Creutz et al. 2007; Çöltekin 2010; Spiegler and Flach 2010; Sirts and Goldwater
2013). Además, the minimally supervised learning setting explicitly assumes a small
amount of available annotated word forms. Como consecuencia, in the remainder of this
artículo, all discussion on unsupervised methods refers to unsupervised learning in the
non-strict sense.
3.3 Algorithms
Here we provide a literature survey on proposed morphological segmentation methods
applicable in the minimally supervised learning setting. We place particular emphasis
on three method families, a saber, the Morfessor algorithm (Creutz and Lagus 2002,
2005, 2007; Kohonen, Virpioja, and Lagus 2010; Grönroos et al. 2014), the adaptor
grammar framework (Sirts and Goldwater 2013), and conditional random fields
(Ruokolainen et al. 2013, 2014). These approaches are the subject of the empirical evalu-
ation presented in Section 4. We present individual method descriptions in Section 3.3.1.
Después, Sección 3.3.2 provides a summarizing discussion, the purpose of which
is to gain insight on the fundamental differences and similarities between the varying
approaches.
3.3.1 Descriptions
Morfessor. We begin by describing the original, unsupervised Morfessor method fam-
ily (Creutz and Lagus 2002, 2005, 2007). We then discuss the later, semi-supervised
extensions (Kohonen, Virpioja, and Lagus 2010; Grönroos et al. 2014). En particular,
we review the extension of Morfessor Baseline to semi-supervised learning by using
a weighted generative model (Kohonen, Virpioja, and Lagus 2010), and then discuss
the most recent Morfessor variant, FlatCat (Grönroos et al. 2014). Finalmente, discutimos
some general results from the literature on semi-supervised learning with generative
modelos.
The unsupervised Morfessor methods are based on a generative probabilistic
model that generates the observed word forms xi ∈ U by concatenating morphs xi =
mi1 ◦ mi2 ◦ · · · ◦ min. The morphs are stored in a morph lexicon, which defines the
probability of each morph P(metro | i) given some parameters θ. The Morfessor learning
problem is to find a morph lexicon that strikes an optimal balance between encoding the
observed word forms concisely and, at the same time, having a concise morph lexicon.
Para tal fin, Morfessor utilizes a prior distribution P(i) over morph lexicons, derived
from the Minimum Description Length principle (Rissanen 1989), that favors lexicons
that contain fewer, shorter morphs. This leads to the following minimization problem
96
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
that seeks to balance the conciseness of the lexicon with the conciseness of the observed
corpus encoded with the lexicon:
θ∗ = arg min
l(i, Ud. ) = arg min
{− ln P(i) − ln P(Ud. | i)} ,
(1)
i
i
The optimization problem in Equation (1) is complicated by the fact that each word
in the corpus U can be generated by different combinations of morphs, definido por
the set of segmentations of that word. This introduces a nuisance parameter z for the
segmentation of each word form, where P(Ud. | i) = (cid:80)
Z P(Ud. | z, i)PAG(z). Because of this
suma, the expression cannot be solved analytically, and iterative optimization
must be used instead.
The unsupervised Morfessor variants differ in the following ways: primero, si
all morphs belong to a single category or the categories PREFIX, STEM, and SUFFIX are
usado; secondly, if the model utilizes a lexicon that is flat or hierarchical. In a flat lexicon,
morphs can only be encoded by combining letters, whereas in a hierarchical lexicon
pre-existing morphs can be used for storing longer morphs. Thirdly, the parameter
estimation and inference methods differ. Parameters are estimated using greedy local
search or iterative batch procedures while inference is performed with either Viterbi
decoding or heuristic procedures.
The earliest Morfessor method, referred to as Morfessor Baseline, has been extended
to semi-supervised learning by Kohonen, Virpioja, and Lagus (2010). A diferencia de, el
later methods, a saber, Categories-ML and Categories-MAP, have not been extended,
as they use either hierarchical lexicons or training procedures that make them less
amenable to semi-supervised learning. Sin embargo, recently Grönroos et al. (2014) pro-
posed a new Morfessor variant that uses morph categories in combination with a flat
lexicon, and can therefore apply the semi-supervised learning technique of Kohonen,
Virpioja, and Lagus.
We begin the description of the semi-supervised extension to Morfessor Baseline
(Creutz and Lagus 2002, 2007) by reviewing its generative model. Morfessor Baseline
utilizes a model in which word forms are generated by concatenating morphs, todo
of which belong to the same category. It utilizes a flat morph lexicon P(metro | i) eso
is simply a multinomial distribution over morphs m, according to the probabilities
given by the parameter vector θ. The utilized prior penalizes storing long morphs
in the lexicon by assigning each stored morph a cost that depends most strongly on
the morph length in letters. A morph is considered to be stored if the lexicon as-
signs it a nonzero probability. The parameter estimation for θ finds a local optimum
utilizing greedy local search. The search procedure approximates the optimization
problem in Equation (1) by assuming that, for each word form xi, its corresponding
segmentation distribution P(zi) has all its mass concentrated to a single segmentation
zi. The parameter estimation is then performed by locally searching each word for
the segmentation that yields the best value of the cost function in Equation (1). El
process is repeated for all words in random order until convergence. Subsequent to
aprendiendo, the method predicts the segmentation of a word form by selecting the seg-
mentation with the most probable sequence of morphs using an extension of the Viterbi
algoritmo.
Semi-supervised learning is in principle trivial for a generative model: For the la-
beled word forms D, the segmentation is fixed to its correct value, and for the unlabeled
forms U the standard parameter estimation procedure is applied. Sin embargo, Kohonen,
97
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
Virpioja, and Lagus (2010) failed to achieve notable improvements in this fashion, y
consequently replaced the minimized function L in Equation (1) con
l(i, z, Ud., D) = − ln P(i) − α × ln P(Ud. | i) − β × ln P(D | i).
(2)
Such weighted objectives were used earlier in combination with generative models by,
Por ejemplo, Nigam et al. (2000). The semi-supervised training procedure then adjusts
the weight values α and β. The absolute values of the weights control the cost of
encoding a morph in the training data with respect to the cost of adding a new morph to
the lexicon, and their ratio controls how much weight is placed on the annotated data
with respect to the unannotated data. When the hyperparameters α and β are fixed,
the lexicon parameters θ can be optimized with the same greedy local search procedure
as in the unsupervised Morfessor Baseline. The weights can then be optimized with a
grid search and by choosing the model with the best evaluation score on a held-out
development set. Although this modification is difficult to justify from the perspective
of generative modeling, Kohonen, Virpioja, and Lagus show that in practice it can
yield performance improvements. From a theoretical point of view, it can be seen
as incorporating discriminative training techniques when working with a generative
model by optimizing for segmentation performance rather than maximum a posteriori
probabilidad. Sin embargo, only the hyperparameters are optimized in this fashion, mientras
the lexicon parameters are still learned within the generative model framework.
The semi-supervised learning strategy described here is simple to apply if the
objective function in Equation (1) can be factored to parts that encode the morphs
using letters and encode the training corpus using the morphs. For some models of the
Morfessor family this is not possible because of the use of a hierarchical lexicon, dónde
morphs can be generated from other morphs as well as from individual letters. en par-
particular, this includes the well-performing Categories-MAP variant (Creutz et al. 2007).
In contrast to Morfessor Baseline, the Categories-MAP and the preceding Categories-
ML method use a hidden Markov model to produce the observed words, donde el
states are given by STEM, PREFIX, SUFFIX categories as well as an internal non-morpheme
categoría. A recent development is Morfessor FlatCat by Grönroos et al. (2014), cual
uses the hidden Markov model structure and morph categories in combination with a
flat lexicon, thus allowing semi-supervised learning in the same fashion as for Morfessor
Base.
En general, the key idea behind using the weighted objective function in Equation
(2) for semi-supervised learning is that the hyperparameters α and β can be used to
explicitly control the influence of the unannotated data on the learning. Similar semi-
supervised learning strategies have also been used in other problems. For classification
with generative models, it is known that adding unlabeled data to a model trained with
labeled data can degrade performance (Cozman et al. 2003; Cozman and Cohen 2006).
En particular, this can be the case if the generative model does not match the generating
proceso, something that is difficult to ensure in practice. Recientemente, this phenomenon was
analyzed in more detail by Fox-Roberts and Rosten (2014), who show that, although the
unlabeled data can introduce a bias, the bias can be removed by optimizing a weighted
likelihood function where the unlabeled data is raised to the power NL
norte , where NL is the
number of labeled samples and N is the number of all samples. This corresponds to the
weighting scheme used in Morfessor when setting the ratio α
β = NL
norte .
98
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
Adaptor Grammars. Recientemente, Sirts and Goldwater (2013) presented work on minimally
supervised morphological segmentation using the adaptor grammar (AG) acercarse
(Johnson, Griffiths, and Goldwater 2006). The AGs are a non-parametric Bayesian mod-
eling framework applicable for learning latent tree structures over an input corpus of
strings. They can be used to define morphological grammars of different complexity,
starting from the simplest grammar where each word is just a sequence of morphs and
extending to more complex grammars, where each word consists, Por ejemplo, of zero
or more prefixes, a stem, and zero or more suffixes.
The actual forms of the morphs are learned from the data and, subsequent to
aprendiendo, used to generate segmentations for new word forms. In this general approach,
AGs are similar to the Morfessor family (Creutz and Lagus 2007). A major difference,
sin embargo, is that the morphological grammar is not hard-coded but instead specified as
an input to the algorithm. This allows different grammars to be explored in a flexible
manner. Prior to the work by Sirts and Goldwater, the AGs were successfully applied
in a related task of segmenting utterances into words (Johnson 2008; Johnson and
Goldwater 2009; Johnson and Demuth 2010).
The second major difference between the Morfessor family and the AG framework
is the contrast between the MAP and fully Bayesian estimation approaches. Whereas
the search procedure of the Morfessor method discussed earlier returns a single model
corresponding to the MAP point-estimate, AGs instead operate with full posterior
distributions over all possible models. Because acquiring the posteriors analytically
is intractable, inference is performed utilizing Markov chain Monte Carlo algorithms
to obtain samples from the posterior distributions of interest (Johnson 2008; Johnson
and Goldwater 2009; Johnson and Demuth 2010; Sirts and Goldwater 2013). Sin embargo,
as sampling-based models are costly to train on large amounts of data, we adopt the
parsing-based method proposed in Sirts and Goldwater (2013) to use the trained AG
model inductively on test data. One of the byproducts of training the AG model is the
posterior grammar, which in addition to all the initial grammar rules, also contains
the cached subtrees learned by the system. This grammar can be used in any standard
parser to obtain segmentations for new data.
The AG framework was originally designed for the unsupervised learning setting,
but Sirts and Goldwater (2013) introduced two approaches for semi-supervised learn-
ing they call the semi-supervised AG and AG Select methods. The semi-supervised
AG approach is an extension to unsupervised AG, in which the annotated data D is
exploited in a straightforward manner by keeping the annotated parts of parse trees
fixed while inferring latent structures for the unannotated parts. For unannotated word
formas, inference is performed on full trees. Por ejemplo, the grammar may specify that
words are sequences of morphs and each morph is a sequence of submorphs. Typically,
the annotated data only contain morpheme boundaries and submorphs are latent in this
contexto. In this situation the inference for annotated data is performed over submorph
structures only.
Similarly to unsupervised learning, semi-supervised AG requires the morpholog-
ical grammar to be defined manually. Mientras tanto, the AG Select approach aims to
automate the grammar development process by systematically evaluating a range of
grammars and finding the best one. AG Select is trained using unsupervised AG with
an uninformative metagrammar so that the resulting parse-trees contain many possible
segmentation templates. To find out which template works the best for any given
language or data set, each of these templates are evaluated using the annotated data
set D. En este sentido, AG Select can be characterized as more of a model selection method
than semi-supervised learning.
99
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
Conditional Random Fields. The Morfessor and AG algorithms discussed earlier, a pesar de
different in several respects, operate in a similar manner in that they both learn lexicons.
For Morfessor, the lexicon consists of morphs, whereas for AG, the lexical units are
partial parse-trees. Subsequent to learning, new word forms are segmented either by
generating the most likely morph sequences (Morfessor) or by sampling parse trees
from the posterior distribution (AG). In what follows, we consider a different approach
to segmentation using sequence labeling methodology. The key idea in this approach
is to focus the modeling effort to morph boundaries instead of the whole segments.
Following the presentation of Ruokolainen et al. (2013, 2014), the morphological seg-
mentation task can be represented as a sequence labeling problem by assigning each
character in a word form to one of three classes, a saber,
beginning of a multi-character morph
B
M middle of a multi-character morph
S
single-character morph
Using this label set, one can represent the segmentation of the Finnish word autoilta
(from cars) (auto+i+lta) como
tu
↓
a
a
↓
↓
B M M M S B M M
oh
↓
i
↓
t
↓
t
↓
yo
↓
Naturalmente, one can also use other label sets. Esencialmente, by defining more fine-grained
labels, one captures increasingly eloquent structure but begins to overfit model to the
training data because of increasingly sparser statistics. Subsequent to defining the label
colocar, one can learn a segmentation model using general sequence labeling methods, semejante
as the well-known conditional random field (CRF) estructura (Lafferty, McCallum, y
Pereira 2001).
Denoting the word form and the corresponding label sequence as x and y, respetar-
activamente, the CRFs directly model the conditional probability of the segmentation given the
word form, eso es, pag(y | X; w). The model parameters w are estimated discriminatively
from the annotated data set D using iterative learning algorithms (Lafferty, McCallum,
and Pereira 2001; collins 2002). Subsequent to estimation, the CRF model segments
word forms x by using maximum a posteriori (MAP) graph inference, eso es, solving
an optimization problem
z = arg max
pag (tu | X; w)
tu
(3)
using the standard Viterbi search (Lafferty, McCallum, and Pereira 2001).
As it turns out, the CRF model can learn to segment words with a surprisingly high
accuracy from a relatively small D, eso es, without utilizing any of the available unanno-
tated word forms U. Particularly, Ruokolainen et al. (2013) showed that it is sufficient to
use simple left and right substring context features that are naturally accommodated by
the discriminative parameter estimation procedure. Además, Ruokolainen et al. (2014)
showed that the CRF-based approach can be successfully extended to semi-supervised
learning settings in a straightforward manner via feature set expansion by utilizing
predictions of unsupervised segmentation algorithms. By utilizing this approach, el
CRF model learns to associate the output of the unsupervised algorithms, como el
Morfessor and adaptor grammar methods, in relation to the surrounding substring
contexto.
100
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
Other Work. In addition to the algorithms discussed here, there exist numerous other
segmentation approaches applicable in the minimally supervised learning setting. Como
the earliest example of work in this line, consider obtaining segmentations using the
classic letter successor variety (LSV) method of Harris (1955). The LSV method utilizes
the insight that the predictability of successive letters should be high within morph
segments, and low at the boundaries. Como consecuencia, a high variety of letters following
a prefix indicates a high probability of a boundary. Whereas LSV score tracks pre-
dictability given prefixes, the same idea can be utilized for suffixes, providing the letter
predecessor variety (LPV) método. As for the minimally supervised learning setting, el
LSV/LPV method can be used most straightforwardly by counting the LSV/LPV scores
from unannotated data and, subsequently, tuning the necessary threshold values on the
datos anotados (Çöltekin 2010). Por otro lado, one could also use the LSV/LPV
values as features for a classification model, in which case the threshold values can be
learned discriminatively based on the available annotated data. The latter approach is
essentially realized in the event the LSV/PSV scores are provided for the CRF model
discussed earlier (Ruokolainen et al. 2014).
As for more recent work, we first refer to the generative log-linear model of Poon,
Cherry, and Toutanova (2009). Similarly to the Morfessor model family, this approach
is based on defining a joint probability distribution over the unannotated word forms
U and the corresponding segmentations S. The distribution is log-linear in form and
is denoted as p(Ud., S; i), where θ is the model parameter vector. De nuevo, similarly to
the Morfessor framework, Poon, Cherry, and Toutanova (2009) learn a morph lexicon
that is subsequently used to generate segmentations for new word forms. The learning
is controlled using prior distributions on both corpus and lexicon, which penalize ex-
ceedingly complex morph lexicon (similarly to Morfessor) and exceedingly segmented
cuerpo, respectivamente. The log-linear form of p(Ud., S; i) enables the approach to use a wide
range of overlapping features. Particularly, Poon, Cherry, and Toutanova (2009) utilize a
morph-context feature set with individual features defined for each morph and morph
substring contexts. In addition to unsupervised learning, they present experiments in
the semi-supervised setting. Específicamente, they accomplish this by fixing the segmen-
tations of annotated words in D, according to their gold standard segmentation. Nota,
sin embargo, that this approach of extending a generative model does not necessarily utilize
the supervision efficiently, as discussed previously regarding the Morfessor method
familia.
Finalmente, we briefly mention a range of recently published methods (Monson,
Hollingshead, and Roark 2010; Spiegler and Flach 2010; Kılıç and Boz¸sahin 2012; Eger
2013). The Paramor approach presented by Monson, Hollingshead, and Roark (2010)
defines a rule-based system for unsupervised learning of morphological paradigms.
The Promodes system of Spiegler and Flach (2010) defines a family of generative prob-
abilistic models for recovering segment boundaries in an unsupervised fashion. El
algorithm of Kılıç and Boz¸sahin (2012) is based on a generative hidden Markov model
(HMM), in which the HMM learns to generate morph sequences for given word forms
in a semi-supervised fashion. Finalmente, Eger (2013) presents work on fully supervised
segmentation by exhaustive enumeration and a generative Markov model on morphs.
As for the minimally supervised learning setting, the Paramor system learns mainly
from unannotated data U and utilizes annotated data D to adjust the required threshold
valor. The Promodes models can be trained either in an unsupervised manner on U or
in a supervised manner on D. The algorithm of Kılıç and Boz¸sahin (2012) learns mainly
from unannotated data U and incorporates supervision from the annotated corpus in
the form of manually selected statistics: the inclusion of the statistics yields a large
101
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
improvement in performance. Por último, in their work with the supervised enumeration
acercarse, Eger (2013) assumes a large (on the order of tens of thousands) amount of
annotated word forms available for learning. De este modo, it is left for future work to determine
if the approach could be applied successfully in the minimally supervised learning
configuración.
3.3.2 Summary. Here we aim to summarize the fundamental differences and similarities
between the varying learning approaches discussed in the previous section.
Learning Lexicons versus Detecting Boundaries. We begin by dividing the methods
described earlier into two—lexicon-based (Creutz et al. 2007; Poon, Cherry, y
Toutanova 2009; Monson, Hollingshead, and Roark 2010; Kılıç and Boz¸sahin 2012;
Eger 2013; Sirts and Goldwater 2013) and boundary detection (harris 1955; Spiegler
and Flach 2010; Ruokolainen et al. 2013)—categories. In the former, the model learns
lexical units, whereas in the latter the model learns properties of morph boundaries.
Por ejemplo, in the case of Morfessor (Creutz et al. 2007) the lexical units correspond
to morphs whereas in AGs (Sirts and Goldwater 2013) the units are parse trees. Significar-
mientras, consider the CRF approach of Ruokolainen et al. (2013) and the classical approach
of Harris (1955), which identify morph boundary positions using substring contexts and
letter successor varieties, respectivamente. En general, whether it is easier to discover morphs
or morph boundaries is largely an empirical question. Hasta ahora, only the method of Poon,
Cherry, and Toutanova (2009) has explicitly modeled both a morph lexicon and features
describing character n-grams at morpheme boundaries.
Generative versus Discriminative Learning. The second main distinction divides the
models into generative and discriminative approaches. The generative approaches
(Creutz et al. 2007; Poon, Cherry, and Toutanova 2009; Spiegler and Flach 2010; Monson,
Hollingshead, and Roark 2010; Kılıç and Boz¸sahin 2012; Eger 2013; Sirts and Goldwater
2013) model the joint distribution of word forms and their corresponding segmenta-
ciones, whereas discriminative (harris 1955; Ruokolainen et al. 2013) approaches directly
estimate a conditional distribution of segmentation given a word form. En otras palabras,
whereas generative methods generate both word forms and segmentations, el dis-
criminative methods generate only segmentations given word forms. The generative
models are naturally applicable for unsupervised learning. Mientras tanto, discriminative
modeling always requires some annotated data, thus excluding the possibility of un-
supervised learning. Por último, it appears that most lexicon-based methods are generative
and most boundary detection methods are discriminative. Sin embargo, note that this is a
trend rather than a rule, as exemplified by generative boundary detection method of
Spiegler and Flach (2010).
Semi-Supervised Learning Approaches. Both generative and discriminative models can be
extended to utilize annotated as well as unannotated data in a semi-supervised man-
ner. Sin embargo, the applicable techniques differ. For generative models, semi-supervised
learning is in principle trivial: For the labeled word forms D, the segmentation is fixed
to its correct value, as exemplified by the approaches of Poon, Cherry, and Toutanova
(2009), Spiegler and Flach (2010), and Sirts and Goldwater (2013). Por otro lado,
the semi-supervised setting also makes it possible to apply discriminative techniques
to generative models. En particular, model hyperparameters can be selected to optimize
segmentation performance, rather than some generative objective, such as likelihood.
Special cases of hyperparameter selection include the weighted objective function
102
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
(Kohonen, Virpioja, and Lagus 2010), data selection (Virpioja, Kohonen, and Lagus 2011;
Sirts and Goldwater 2013), and grammar template selection (Sirts and Goldwater 2013).
As for the weighted objective function and grammar template selection, the weights
and templates are optimized to maximize segmentation accuracy. Mientras tanto, datos
selection is based on the observation that omitting some of the training data can improve
segmentation accuracy (Virpioja, Kohonen, and Lagus 2011; Sirts and Goldwater 2013).
For discriminative models, the possibly most straightforward semi-supervised
learning technique is adding features derived from the unlabeled data, as exem-
plified by the CRF approach of Ruokolainen et al. (2014). Sin embargo, discriminative,
semi-supervised learning is in general a much researched field with numerous diverse
técnicas (Zhu and Goldberg 2009). Por ejemplo, merely for the CRF model alone,
there exist several proposed semi-supervised learning approaches (Jiao et al. 2006;
Mann and McCallum 2008; Wang et al. 2009).
On Local Search. In what follows, we will discuss a potential pitfall of some algorithms
that utilize local search procedures in the parameter estimation process, as exemplified
by the Morfessor model family (Creutz et al. 2007). As discussed in Section 3.3.1, el
Morfessor algorithm finds a local optimum of the objective function using a local search
procedimiento. This complicates model development because if two model variants perform
differently empirically, it is uncertain whether it is because of a truly better model or
merely better fit with the utilized parameter estimation method, as discussed also by
Goldwater (2006, Sección 4.2.2.3). Por lo tanto, in contrast, within the adaptor grammar
estructura (Johnson, Griffiths, and Goldwater 2006; Sirts and Goldwater 2013), el
focus has not been on finding a single best model, but rather on finding the posterior
distribution over segmentations of the words. Another approach to the problem of bad
local optima is to start a local search near some known good solution. This approach
is taken in Morfessor FlatCat, for which it was found that initializing the model with
the segmentations produced by the supervised CRF model (with a convex objective
función) yields improved results (Grönroos et al. 2014).
4. experimentos
En esta sección, we perform an empirical comparison of segmentation algorithms in
the semi-supervised learning setting. The purpose of the presented experiments is to
extend the current literature by considering a wider range of languages compared with
previous work, and by providing an in-depth error analysis.
4.1 Datos
We perform the experiments on four languages, a saber, Inglés, Estonian, Finnish,
and Turkish. The English, Finnish, and Turkish data are from the Morpho Challenge
2009/2010 data set (Kurimo et al. 2009; Kurimo, Virpioja, and Turunen 2010). El
annotated Estonian data set is acquired from a manually annotated, morphologi-
cally disambiguated corpus,1 and the unannotated word forms are gathered from the
Estonian Reference Corpus (Kaalep et al. 2010). Mesa 3 shows the total number of
instances available for model estimation and testing.
1 Available at http://www.cl.ut.ee/korpused/morfkorpus/index.php?lang=en.
103
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
Mesa 3
Number of word types in the data sets.
Inglés
Estonian
Finnish
Turkish
train (unannotated)
train (anotado)
desarrollo
prueba
384,903
1,000
694
10×1,000
3,908,820
1,000
800
10×1,000
2,206,719
1,000
835
617,298
1,000
763
10×1,000 10×1,000
4.2 Compared Algorithms
We present a comparison of the Morfessor family (Creutz and Lagus 2002, 2005, 2007;
Kohonen, Virpioja, and Lagus 2010; Grönroos et al. 2014), the adaptor grammar frame-
trabajar (Sirts and Goldwater 2013), and the conditional random fields (Ruokolainen
et al. 2013, 2014). These methods have freely available implementations for research
purposes.
The log-linear model presented by Poon, Cherry, and Toutanova (2009) is omitted
because it does not have a freely available implementation. Sin embargo, the model has
been compared in the semi-supervised learning setting on Arabic and Hebrew with
CRFs and Morfessor previously by Ruokolainen et al. (2013). In these experiments, el
model was substantially outperformed on both languages by the CRF method and on
Hebrew by Morfessor.
In order to provide a strong baseline for unsupervised learning results, we per-
formed preliminary experiments using the model presented by Lee, Haghighi, y
Barzilay (2011).2 Their model learns segmentation in an unsupervised manner by
exploiting syntactic context of word forms observed in running text and has shown
promising results for segmentation of Arabic. En la práctica, we found that when using the
method’s default hyperparameters, it did not yield nearly as good results as the other
unsupervised methods on our studied data sets. Adjusting the hyperparameters turns
out to be complicated by the computational demands of the method. When utilizing the
same computer set-up as for the other models, training the method requires limiting the
maximum word length of analyzed words to 12 in order for the model to fit in memory,
as well as requiring weeks of runtime for a single run. We decided to abandon further
experimentation with the method of Lee, Haghighi, and Barzilay (2011), as optimizing
its hyperparameters was computationally infeasible.
4.3 Evaluation
This section describes the utilized evaluation measures and the performed error
análisis.
4.3.1 Boundary Precision, Recordar, and F1-score. The word segmentations are evaluated by
comparison with reference segmentations using boundary precision, boundary recall,
and boundary F1-score. The boundary F1-score, or F1-score for short, equals the har-
monic mean of precision (the percentage of correctly assigned boundaries with respect
2 Implementation is available at http://people.csail.mit.edu/yklee/code.html.
104
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
to all assigned boundaries) and recall (the percentage of correctly assigned boundaries
with respect to the reference boundaries):
Precision =
Recall =
C(correcto)
C(propuesto)
,
C(correcto)
C(reference)
.
(4)
(5)
We follow Virpioja et al. (2011) and use type-based macro-averages. Sin embargo, nosotros
handle word forms with alternative analyses in a different fashion. Instead of penalizing
algorithms that propose an incorrect number of alternative analyses, we take the best
match over the alternative reference analyses (separately for precision and recall). Este
is because all the methods considered in the experiments provide a single segmentation
per word form.
Throughout the experiments, we establish statistical significance with confidence
nivel 0.95, according to the standard one-sided Wilcoxon signed-rank test performed on
10 random subsets of 1,000 word forms drawn from the complete test sets (subsets may
contain overlapping word forms).
Because we apply a different treatment of alternative analyses, the results reported
in this article are not directly comparable to the boundary F1-scores reported for the
Morpho Challenge competitions (Kurimo et al. 2009; Kurimo, Virpioja, and Turunen
2010). Sin embargo, the best boundary F1-scores for all languages reported in Morpho
Challenge have been achieved with the semi-supervised Morfessor Baseline algorithm
(Kohonen, Virpioja, and Lagus 2010), which is included in the current experiments.
4.3.2 Análisis de errores. We next discuss the performed error analysis. The purpose of the
error analysis is to gain a more detailed understanding into what kind of errors the
methods make, and how the error types affect the overall F1-scores. Para tal fin, we use
a categorization of morphs into the categories PREFIX, STEM, and SUFFIX, in addition
defining a separate category for DASH. For the English and Finnish sections of the
Morpho Challenge data set, the segmentation gold standard annotation contain addi-
tional information for each morph, such as part-of-speech for stems and morphological
categories for affixes, which allows us to assign each morph into one of the morph
type categories. In some rare cases the tagging is not specific enough, and we choose
to assign the tag UNKNOWN. Sin embargo, as we are evaluating segmentations, we lack
the morph category information for the proposed analyses. Como consecuencia, we cannot
apply a straightforward category evaluation metric, such as category F1-score. In what
follows, we instead show how to use the categorization on the gold standard side to
characterize the segmentation errors.
We first observe that errors come in two kinds, over-segmentation and under-
segmentation. In over-segmentation, boundaries are incorrectly assigned within morph
segments, and in under-segmentation, the segmentation fails to uncover correct morph
boundaries. Por ejemplo, consider the English compound word form girlfriend with a
correct analysis girl+friend. Entonces, an under-segmentation error occurs in the event the
model fails to assign a boundary between the segments girl and friend. Mientras tanto,
over-segmentation errors take place if any boundaries are assigned within the two
compound segments girl and friend, such as g+irl or fri+end.
As for the relationship between these two error types and the precision and re-
call measures in Equations (4) y (5), we note that over-segmentation solely affects
105
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
precisión, whereas under-segmentation only affects recall. This is evident as the mea-
sures can be written equivalently as:
Precision =
C(propuesto) − C(over-segm.)
C(propuesto)
= 1 −
C(over-segm.)
C(propuesto)
,
Recall =
C(reference) − C(under-segm.)
C(reference)
= 1 −
C(under-segm.)
C(reference)
.
(6)
(7)
In the error analysis, we use these equivalent expressions as they allow us to examine
the effect of reduction in precision and recall caused by over-segmentation and under-
segmentation, respectivamente.
The over-segmentation errors occur when a segment that should remain intact
is split. De este modo, these errors can be assigned into categories c according to the morph
tags PREFIX, STEM, SUFFIX, and UNKNOWN. The segments in the category DASH cannot
be segmented and do not, por lo tanto, contribute to over-segmentation errors. Nosotros entonces
decompose the precision and recall reductions in Equations (6) y (7) into those caused
by errors in each category indexed by c and d:
Precision = 1 −
Recall = 1 −
C(over-segm. (C))
C(propuesto)
,
C(under-segm. (d))
C(reference)
.
(cid:88)
C
(cid:88)
d
Ecuación (8) holds because
C(over-segm.)
C(reference)
=
(cid:80)
c C(over-segm. (C))
C(reference)
(cid:88)
=
C
C(over-segm. (C))
C(reference)
,
(8)
(9)
(10)
where c indexes the over-segmentation error categories. The expression for recall in
Ecuación (9) can be derived analogously, but it must be noted that the categorization d
by error type differs from that of precision as each under-segmentation error occurs at
a segment boundary, such as STEM-SUFFIX, STEM-STEM, PREFIX-STEM, rather than in the
middle of a segment. To simplify analysis, we have grouped all segment boundaries,
in which either the left or right segment category is DASH into the CONTAINS DASH
categoría. Boundary types that occur fewer than 100 times in the test data are merged
into the OTHER category.
Mesa 4 shows the occurrence frequency of each boundary category, averaged
over alternative analyses. Evidently, we expect the total precision scores to be most
influenced by over-segmentation of STEM and SUFFIX segment types because of their
high frequencies. Similarmente, the overall recall scores are expected to be most impacted
by under-segmentation of STEM-SUFFIX and SUFFIX-SUFFIX boundaries. Finnish is also
substantially influenced by the STEM-STEM boundary, indicating that Finnish uses com-
pounding frequently.
Por simplicidad, when calculating the error analysis, we forgo the sampling procedure
of taking 10 × 1, 000 word forms from the test set, used for the overall F1-score for
statistical significance testing by Virpioja et al. (2011). Bastante, we calculate the error
analysis on the union of these sampled sets. As the sampling procedure may introduce
106
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
Mesa 4
Absolute and relative frequencies of the boundary categories in the error analysis. The numbers
are averaged over the alternative analyses in the reference annotation.
Category
STEM
SUFFIX
PREFIX
UNKNOWN
STEM-SUFFIX
SUFFIX-SUFFIX
STEM-STEM
SUFFIX-STEM
CONTAINS DASH
PREFIX-STEM
OTHER
Inglés
Finnish
38,608.8
7,172.9
1,152.8
54.5
5,349.2
1,481.0
613.4
n / A
458.0
554.3
91.0
(82.2%)
(15.3%)
(2.5%)
(0.1%)
(62.6%)
(17.3%)
(7.2%)
n / A
(6.5%)
(5.4%)
(1.1%)
72,666.0
15,384.9
946.5
414.0
9,889.9
5,917.5
3,538.0
1,501.0
426.0
235.2
105.4
(81.3%)
(17.2%)
(1.1%)
(0.5%)
(45.8%)
(27.4%)
(16.4%)
(6.9%)
(2.0%)
(1.1%)
(0.5%)
the same word form in several samples, the error analysis precisions and recalls are not
necessarily identical to the ones reported for the overall results.
En resumen, although we cannot apply category F1-scores, we can instead catego-
rize each error by type. These categories then map directly to either reduced precision
or recall. Interpreting precision and recall requires some care as it is always possible to
reduce over-segmentation errors by segmenting less and, conversely, to reduce under-
segmentation errors by segmenting more. Sin embargo, if this is taken into account, el
error categorization can be quite informative.
4.4 Model Learning and Implementation Specifics
4.4.1 Morfessor. We use a recently released Python implementation of the Morfessor
método (Virpioja et al. 2013; Smit et al. 2014).3 The package implements both the
unsupervised and semi-supervised Morfessor Baseline (Creutz and Lagus 2002, 2007;
Kohonen, Virpioja, and Lagus 2010). For Morfessor FlatCat we apply the Python imple-
mentation by Grönroos et al. (2014).4
In its original formulation, the unsupervised Morfessor Baseline uses no hyper-
parámetros. Sin embargo, it was found by Virpioja, Kohonen, and Lagus (2011) that perfor-
mance does not improve consitently with growing data because the method segments
less on average for each added training word form. Por lo tanto, we optimize the training
data size by including only the most frequent words in the following sizes: 10k, 20k,
30k, 40k, 50k, 100k, 200k, 400k, . . . , as well as the full set. We then choose the model
yielding highest F1-score on the development set.
As for semi-supervised training of Morfessor Baseline, we perform a grid search on
the development set for the hyperparameter β (mira la sección 3.3.1). For each value of β we
use the automatic adaptation of the hyperparameter α provided by the implementation.
The automatic adaptation procedure is applied during model training and is, por lo tanto,
computationally less demanding compared with grid search. Intuitivamente, the adaptation
3 Available at https://github.com/aalto-speech/morfessor.
4 Available at https://github.com/aalto-speech/flatcat.
107
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
functions as follows. The hyperparameter α affects how much the method segments on
promedio. Although optimizing it for segmentation performance during training is non-
trivial, one can instead apply the heuristic that the method should neither over-segment
nor under-segment. Por lo tanto, the implementation adjusts α such that the development
set precision and recall become approximately equal.
In the semi-supervised training for Morfessor FlatCat, the segmentations are ini-
tialized to the ones produced by the supervised CRF model trained with the same
amount of labeled training data. As automatic adaptation of the hyperparameter α has
not yet been implemented for Morfessor FlatCat, values for both α and β are found
by a combined grid search on the development set. The computational demands of
the grid search were reduced by using the optimal hyperparameter values for Mor-
fessor Baseline as an initial guess when constructing the grid. We also choose the non-
morpheme removal heuristics used by Morfessor FlatCat for each language separately
using the development set. Para ingles, Estonian, and Finnish the heuristics described
by Grönroos et al. (2014) are beneficial, but they do not fit Turkish morphology as well.
For Turkish we convert non-morphemes into suffixes or stems, without modifying the
segmentation.
4.4.2 Adaptor Grammars. The technical details of the AG model are described by
Johnson, Griffiths, and Goldwater (2006) and the inference details are described by
Johnson, Griffiths, and Goldwater (2007). For unsupervised AG learning, we used the
freely available implementation,5 which was also the basis for the semi-supervised
implementación. Table label resampling was turned on and all hyperparameters were
inferred automatically as described by Johnson and Goldwater (2009). The metagram-
mar for AG Select is the same as described by Sirts and Goldwater (2013). Inductive
learning with the posterior grammar was done with a freely available CKY parser.6 For
both unsupervised and semisupervised AG, we use a three-level collocation-submorph
grammar in which the final segmentation is parsed out as a sequence of Morphs:
Word → Colloc+
Colloc → Morph+
Morph → SubMorph+
SubMorph → Char+
We experimented with two types of grammars, where the Word non-terminal is
either cached or not. These two grammar versions have no difference when trained
transductively. Sin embargo, when training an inductive model, it may be beneficial to store
the subtrees corresponding to whole words because these trees can be used to parse the
words in the test set that were seen during training with a single rule. All models, ambos
unsupervised and semi-supervised, are trained on 50k most frequent word types. Para
semi-supervised experiments, we upweight the labeled data by an integer number of
times by repeatedly caching the subtrees corresponding to morphemes in the annotated
datos. The additional cached subtrees are rooted in the Morph non-terminal. Similarly to
semi-supervised Morfessor, we experimented with initializing the segmentations with
5 Available at http://web.science.mq.edu.au/~mjohnson/Software.htm.
6 Also obtained from http://web.science.mq.edu.au/~mjohnson/Software.htm.
108
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
the output of the supervised CRF model, which in some cases resulted in improved
accuracy over the random initialization. We searched the optimal values for each ex-
periment for the upweighting factor, cached versus non-cached root non-terminal, y
random versus CRF initialization on the development set.
An AG model is stochastic and each segmentation result is just a single sample from
the posterior. A common approach in such a case is to take several samples and report
the average result. Maximum marginal decoding (MMD) (Johnson and Goldwater 2009;
Stallard et al. 2012) that constructs a marginal distribution from several independent
samples and returns their mean value has been shown to improve the sampling-based
models’ results about 1–2 percentage points. Although the AG model uses sampling for
training, the MMD is not applicable here because during test time the segmentations
are obtained using parsing. Sin embargo, we propose another way of achieving the gain
in a similar range to the MMD. We train five different models and concatenate their
posterior grammars into a single joint grammar, which is then used as the final model to
decode the test data. Our experiments show that the posterior grammar concatentation,
similarly to the MMD, leads to consistent improvements of 1–2 percentage points over
the mean of the individual samples.
4.4.3 CRFs. The utilized Python implementation of the CRF model follows the presen-
tation of Ruokolainen et al. (2013, 2014).7 As for the left and right substring features
incorporated in the model, we include all substrings that occur in the training data. El
maximum substring length and averaged perceptron learning of CRF model parameters
are optimized on the held-out development sets following Ruokolainen et al. (2013). Para
semi-supervised learning, we utilize log-normalized successor and predecessor variety
scores and binary Morfessor Baseline and AG features following the presentation of
Ruokolainen et al. (2014). The unsupervised Morfessor Baseline and AG models are
optimized on the development set as described earlier. The successor and predecessor
variety scores are estimated from all the available unannotated word forms apart from
words with a corpus frequency of one. The count cutoff is applied as a means of noise
reduction by removing peripheral phenomena, such as misspellings.
4.5 Resultados
Here we summarize the results obtained using our experiment set-up. We present over-
all segmentation accuracies and error analysis in Sections 4.5.1 y 4.5.2, respectivamente.
We then discuss the results in Section 4.6.
4.5.1 Boundary Precisions, Recalls, and F1-scores. In what follows, we first review unsuper-
vised and supervised results and, subsequently, assess the semi-supervised results.
Segmentation accuracies using unsupervised and supervised methods are pre-
sented in Table 5. As for the supervised learning using the CRF model, nosotros reportamos
segmentation accuracies obtained using 100 y 1,000 annotated word forms. Evidently,
utilizing annotated data provides a distinct advantage over learning from unannotated
datos. Particularly, learning the supervised CRFs using 1,000 annotated word forms
results in substantially higher segmentation accuracies compared with learning in an
unsupervised manner from hundreds of thousands or millions of word forms. De hecho,
using merely 100 annotated instances results in higher accuracies in English and Turkish
7 Available at http://users.ics.aalto.fi/tpruokol/.
109
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
Mesa 5
Precision, recordar, and F1-scores for unsupervised and supervised methods.
Método
Tren (ann.)
Tren (unann.)
Pre.
Rec.
F1
Inglés
MORFESSOR BASELINE (USV)
AG (USV)
CRF (SV)
CRF (SV)
Estonian
MORFESSOR BASELINE (USV)
AG (USV)
CRF (SV)
CRF (SV)
Finnish
MORFESSOR BASELINE (USV)
AG (USV)
CRF (SV)
CRF (SV)
Turkish
MORFESSOR BASELINE (USV)
AG (USV)
0
0
100
1,000
0
0
100
1,000
0
0
100
1,000
0
0
384,903
384,903
0
0
3,908,820
3,908,820
0
0
2,206,719
2,206,719
0
0
617,298
617,298
76.3
62.2
86.0
91.6
76.4
78.4
79.2
88.4
70.2
68.1
73.0
88.3
67.9
72.7
76.3
84.4
72.7
81.2
70.4
73.4
59.1
76.7
51.9
68.1
59.4
79.7
65.8
76.5
76.3
71.7
78.8
86.1
73.3
75.8
67.7
82.1
59.7
68.1
65.5
83.8
66.8
74.6
CRF (SV)
CRF (SV)
The columns titled Train (unann.) denote the number of unannotated word forms utilized in
aprendiendo. The columns titled Train (ann.) denote the number of annotated word forms.
100
1,000
71.8
87.3
77.7
88.6
84.6
90.0
0
0
compared with the unsupervised methods. The balance between precision and recall
can be analyzed to assess how well the different methods are tuned to the amount of
segmentation present in the gold standard. As discussed in Section 4.3.2, high precision
in combination with low recall indicates under-segmentation, whereas high recall and
low precision indicates over-segmentation. Morfessor appears to favor precision over
recordar (see Finnish) in the event a trade-off takes place. A diferencia de, the AG heavily favors
recordar (see English). Mientras tanto, the supervised CRF model consistently prefers higher
precision over recall.
These unsupervised and supervised learning results utilize the available data only
parcialmente. De este modo, we next discuss results obtained using semi-supervised learning—that
es, when utilizing all available annotated and unannotated word forms. The obtained
segmentation accuracies are presented in Table 6. We summarize the results as follows.
Primero, the semi-supervised CRF approach CRF (SSV) yielded highest segmentation ac-
curacies for all considered languages and data set sizes. The improvements over other
models are statistically significant. Compared with the supervised CRF model, the semi-
supervised extension successfully increases the recall while maintaining the high preci-
sión. As for the Morfessor family, MORF.FC (SSV) yields significantly higher F1-scores
compared with MORF.BL (SSV) on all languages. Sin embargo, we found that without the
CRF initialization of MORF.FC (SSV), the performance gap decreases substantially (cf.
similar results reported by Grönroos et al. [2014]). Por otro lado, the variants
appear to behave in a similar manner in that, in the majority of cases, both approaches
increase the obtained precision and recall in a balanced manner compared with the
unsupervised approach MORF. BL (USV). Mientras tanto, the AG variants AG (SSV) y
110
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
Mesa 6
Precision, recordar, and F1-scores for semi-supervised methods.
Método
Tren (ann.)
Tren (unann.)
Pre.
Rec.
F1
Inglés
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
Estonian
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
Finnish
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
Turkish
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
MORFESSOR BASELINE (SSV)
MORFESSOR FLATCAT (SSV)
AG (SSV)
AG SELECT (SSV)
CRF (SSV)
100
100
100
100
100
1,000
1,000
1,000
1,000
1,000
100
100
100
100
100
1,000
1,000
1,000
1,000
1,000
100
100
100
100
100
1,000
1,000
1,000
1,000
1,000
100
100
100
100
100
1,000
1,000
1,000
1,000
1,000
384,903
384,903
384,903
384,903
384,903
384,903
384,903
384,903
384,903
384,903
3,908,820
3,908,820
3,908,820
3,908,820
3,908,820
3,908,820
3,908,820
3,908,820
3,908,820
3,908,820
2,206,719
2,206,719
2,206,719
2,206,719
2,206,719
2,206,719
2,206,719
2,206,719
2,206,719
2,206,719
617,298
617,298
617,298
617,298
617,298
617,298
617,298
617,298
617,298
617,298
81.7
83.6
69.0
75.9
87.6
84.4
86.9
69.8
76.7
89.3
77.0
81.8
71.8
60.9
81.5
80.6
84.7
67.1
62.8
90.2
69.8
77.6
65.5
66.8
80.0
76.0
81.6
69.7
69.4
89.3
76.6
80.2
74.1
69.0
81.3
85.1
84.9
77.0
70.5
89.3
82.8
83.0
85.8
79.4
81.0
83.9
85.2
87.1
82.3
87.0
76.1
74.5
75.5
90.4
82.1
80.7
82.0
88.8
90.3
86.3
70.8
73.6
70.5
73.6
77.4
78.0
80.2
77.6
74.3
87.9
80.5
83.9
82.8
82.3
86.0
89.4
92.2
90.9
80.4
92.0
82.2
83.3
76.5
77.6
84.2
84.1
86.0
77.5
79.4
88.1
76.5
77.9
73.6
72.8
81.8
80.7
83.3
76.4
74.1
88.2
70.3
75.5
67.9
70.0
78.7
77.0
80.9
73.4
71.8
88.6
78.5
82.0
78.2
75.0
83.5
87.2
88.4
83.4
75.1
90.7
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
111
Ligüística computacional
Volumen 42, Número 1
AG SELECT (SSV) heavily favor recall over precision, indicating over-segmentation.8
Por último, in contrast with the unsupervised learning results, in the semi-supervised setting
the AG framework is significantly outperformed by the Morfessor variants.
4.5.2 Análisis de errores. Próximo, we examine how different error types contribute to the ob-
tained precision and recall measures, y, como consecuencia, the overall F1-scores. To this
end, we discuss the error analyses for English and Finnish presented in Tables 7 y 8,
respectivamente.
Líneas de base. The first two lines in Tables 7 y 8 present the baseline models WORDS and
LETTERS. The WORDS model corresponds to an approach in which no segmentation is
performed, eso es, all the word forms are kept intact. The LETTERS approach assigns a
segment boundary between all adjacent letters. These approaches maximize precision
(WORDS) and recall (LETTERS) at the cost of the other. En otras palabras, no model can
produce more over-segmentation errors compared with LETTERS, and no model can
produce more under-segmentation errors compared with WORDS.9
Given the baseline results, we observe that the overall precision scores are most
influenced by over-segmentation of STEM and SUFFIX segment types because of their
high frequencies. Similarmente, the overall recall scores are most impacted by under-
segmentation of STEM-SUFFIX and SUFFIX-SUFFIX boundaries. Finnish recall is also
substantially influenced by the STEM-STEM boundary, indicating that Finnish uses
compounding frequently.
Morfessor. Similarly to the baseline (WORDS and LETTERS) resultados, the majority of over-
segmentation errors yielded by the Morfessor variants take place within the STEM and
SUFFIX segments, and most under-segmentation errors occur at the STEM-SUFFIX and
SUFFIX-SUFFIX boundaries. When shifting from unsupervised learning using MORF.BL
(USV) to semi-supervised learning using MORF.BL (SSV) and MORF.FC (SSV), the over-
segmentation problems are alleviated rather substantially, resulting in higher over-
all precision scores. Por ejemplo, consider the word form countermanded, para cual
MORF.BL (SSV) assigns the correct segmentation countermand+ed, but which is severely
oversegmented by MORF.BL (USV) as counter+man+d+ed. One also observes a dramatic
increase in the overall recall scores, indicating a smaller amount of under-segmentation
taking place. Por ejemplo, consider the word form products, for which MORF.BL (SSV)
assigns the correct segmentation product+s, but for which MORF.BL (USV) assigns no
boundaries. Sin embargo, the under-segmentation errors do not decrease consistently:
Although the STEM-SUFFIX and SUFFIX-SUFFIX errors are decreased substantially, uno
additionally observes a decline or no change in the model’s ability to uncover STEM-
STEM and PREFIX-STEM boundaries.
8 Generally, in the presence of annotated training data, under-segmentation and over-segmentation can be
avoided by explicitly tuning the average level of segmentation. Such tuning is performed for Morfessor
with the weighted objective function and for AG by choosing the level in the parse tree from which to
extract the segmentations. By default, the AG segmentations were extracted from the Morph level as this
gave the highest score on the development set. Sin embargo, the Estonian segmentations are extracted from
the Colloc level, which also explains why in the Estonian case the precision is higher than recall. Estos
results suggest that AG (SSV) may benefit from yet another layer in the grammar, which would help to
learn a better balance between precision and recall.
9 Intuitivamente, WORDS should yield zero recall. Sin embargo, when applying macro averaging, a word having a
gold standard analysis with no boundaries yields a zero denominator and is therefore undefined. A
correct for this, we interpret such words as having recall 1, which explains the non-zero recall for WORDS.
112
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
Mesa 7
Error analysis for English.
Over-Segmentation
Under-Segmentation
Método
WORDS
LETTERS
MORF.BL (USV)
MORF.BL (SSV)
MORF.FC (SSV)
AG (USV)
AG (SSV)
AG SELECT (SSV)
CRF (SV)
CRF (SSV)
METRO
mi
t
S
0.0
71.1
20.6
14.3
11.2
31.8
27.8
18.4
7.3
9.6
X
I
F
F
Ud.
S
0.0
11.8
2.9
1.3
1.7
5.8
2.1
4.8
0.9
0.8
X
I
F
mi
R
PAG
0.0
1.7
0.0
0.1
0.0
0.1
0.1
0.0
0.1
0.0
norte
W.
oh
norte
k
norte
Ud.
0.0
0.3
0.1
0.0
0.1
0.0
0.1
0.1
0.0
0.1
l
A
t
oh
t
/
mi
R
PAG
100.0
15.1
76.4
84.4
87.1
62.3
70.0
76.6
91.8
89.5
X
I
F
F
Ud.
S
–
METRO
mi
t
S
55.1
0.0
17.0
9.8
8.6
10.6
10.1
8.2
10.4
8.4
X
I
F
F
Ud.
S
–
X
I
F
F
Ud.
S
8.6
0.0
4.7
0.6
0.5
3.5
1.4
1.4
0.5
0.5
METRO
mi
t
S
–
METRO
mi
t
S
5.9
0.0
0.6
2.8
2.2
0.1
0.2
2.2
4.2
1.4
METRO
mi
t
S
–
X
I
F
mi
R
PAG
4.4
0.0
1.1
2.1
2.5
0.7
0.6
4.1
2.9
1.9
h
S
A
D
S
norte
I
A
t
norte
oh
C
2.5
0.0
0.0
0.0
0.1
0.2
0.2
1.5
0.1
0.0
l
A
t
oh
t
/
C
mi
R
23.1
100.0
76.4
84.3
85.5
84.7
87.3
82.2
81.5
87.4
R
mi
h
t
oh
0.6
0.0
0.2
0.4
0.4
0.2
0.2
0.4
0.4
0.4
Over-segmentation and under-segmentation errors reduce precision and recall, respectivamente. Para
ejemplo, the total precision of MORF. BL (USV) is obtained as 100.0 − 20.6 − 2.9 − 0.0 − 0.1 = 76.4.
The lines MORF. BL (USV), MORF. BL (SSV), and MORF. FC (SSV) correspond to the unsupervised
Morfessor Baseline, semi-supervised Morfessor Baseline, and semi-supervised Morfessor FlatCat
modelos, respectivamente.
Mesa 8
Error analysis for Finnish.
Over-Segmentation
Under-Segmentation
Método
WORDS
LETTERS
MORF.BL (USV)
MORF.BL (SSV)
MORF.FC (SSV)
AG (USV)
AG (SSV)
AG SELECT (SSV)
CRF (SV)
CRF (SSV)
METRO
mi
t
S
0.0
65.2
26.6
20.8
15.3
28.3
27.9
24.2
9.3
9.2
X
I
F
F
Ud.
S
0.0
13.8
3.4
2.9
2.9
3.3
2.1
6.1
2.3
1.4
X
I
F
mi
R
PAG
0.0
0.7
0.0
0.0
0.0
0.1
0.1
0.0
0.0
0.0
norte
W.
oh
norte
k
norte
Ud.
0.0
0.6
0.2
0.2
0.1
0.2
0.2
0.1
0.0
0.1
l
A
t
oh
t
/
mi
R
PAG
100.0
19.7
69.7
76.1
81.7
68.1
69.7
69.5
88.3
89.3
X
I
F
F
Ud.
S
–
METRO
mi
t
S
49.2
0.0
28.8
13.6
12.2
19.0
14.7
13.2
10.7
8.0
X
I
F
F
Ud.
S
–
X
I
F
F
Ud.
S
21.8
0.0
17.1
5.9
5.2
11.5
6.5
7.8
2.2
2.3
METRO
mi
t
S
–
METRO
mi
t
S
17.2
0.0
1.7
1.9
1.5
0.7
0.7
2.4
5.8
1.2
METRO
mi
t
S
–
X
I
F
F
Ud.
S
4.8
0.0
0.5
0.5
0.6
0.2
0.2
1.1
1.1
0.4
h
S
A
D
S
norte
I
A
t
norte
oh
C
1.4
0.0
0.0
0.0
0.1
0.3
0.1
0.8
0.1
0.1
METRO
mi
t
S
–
X
I
F
mi
R
PAG
1.0
0.0
0.1
0.1
0.1
0.0
0.1
0.2
0.3
0.1
l
A
t
oh
t
/
C
mi
R
4.1
100.0
51.6
78.0
80.2
68.1
77.6
74.4
79.7
87.8
R
mi
h
t
oh
0.6
0.0
0.2
0.1
0.1
0.2
0.1
0.1
0.2
0.2
Over-segmentation and under-segmentation errors reduce precision and recall, respectivamente. El
lines MORF. BL (USV), MORF. BL (SSV), and MORF. FC (SSV) correspond to the unsupervised
Morfessor Baseline, the semi-supervised Morfessor Baseline, and semi-supervised Morfessor
FlatCat models, respectivamente.
113
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
Adaptor Grammars. Similarly to the baseline and Morfessor results, the majority of
over-segmentation errors yielded by the AG variants occur within the STEM and
SUFFIX segments. Compared with the unsupervised AG (USV) acercarse, the first
semi-supervised extension AG (SSV) manages to reduce over-segmentation of the STEM
segments slightly and SUFFIX segments substantially, thus resulting in overall higher
precisión. Mientras tanto, the second extension AG SELECT (SSV) also results in overall
higher precision by reducing over-segmentation of STEM segments substantially—
a pesar de, for Finnish, SUFFIX is oversegmented compared with AG SELECT (USV).
Por otro lado, whereas both AG (SSV) and AG SELECT (SSV) improve recall
on Finnish compared to AG (USV), only AG (SSV) succeeds in improving recall for
Inglés. This is because the AG SELECT (SSV) variant decreases the model’s ability
to capture other than STEM-SUFFIX and SUFFIX-SUFFIX boundaries compared with the
unsupervised AG (USV) acercarse.
Conditional Random Fields. In contrast to the Morfessor and AG frameworks, the error
patterns produced by the CRF approach do not directly follow the baseline approaches.
Particularly, we note that the supervised CRF (SV) approach successfully captures
SUFFIX-SUFFIX boundaries and fails to find STEM-STEM boundaries—that is, behaves in
an opposite manner compared with the baseline results. CRF (SV) also under-segments
the less-frequent PREFIX-STEM and STEM-SUFFIX boundaries for English and Finnish,
respectivamente. Mientras tanto, the semi-supervised extension CRF (SSV) alleviates the prob-
lem of finding STEM-STEM boundaries substantially, resulting in improvement in overall
recordar. Por ejemplo, CRF (SSV) correctly segments compound forms rainstorm and wind-
pipe as rain+storm and wind+pipe, whereas CRF (SV) incorrectly assigns no segmentation
boundaries to either of these forms. Note that improving recall means that CRF (SSV)
is required to segment more compared with CRF (SV). Para ingles, this increased seg-
mentation results in a slight increase in over-segmentation of STEM—that is, el modelo
trades off the increase in recall for precision. Por ejemplo, whereas CRF (SV) correctly
segments ledgers as ledger+s, CRF (SSV) yields an incorrect segmentation led+ger+s.
4.6 Discusión
When increasing the amount of data utilized for learning—that is, when shifting from
fully unsupervised or supervised learning to semi-supervised learning—we naturally
expect the segmentation method families to improve their performance measured using
the F1-score. En efecto, as shown in Tables 5 y 6, this improvement takes place within
all considered approaches. En algunos casos, as exemplified by the CRF model on English,
achieving a higher F1-score may require a trade-off between precision and recall—that
es, the model lowers precision somewhat to gain recall (or vice versa). Sin embargo, por
examining the error analyses in Tables 7 y 8, we also observe the occurrence of a
second kind of trade-off, in which the semi-supervised Morfessor and AG approaches
trade off under-segmentation errors to other under-segmentation errors. Particularly,
although the STEM-SUFFIX and SUFFIX-SUFFIX boundary recall errors are decreased, uno
also observes an increase in the errors at STEM-STEM and PREFIX-STEM boundaries.
This type of behavior indicates an inherent inefficiency in the models’ ability to utilize
increasing amounts of data.
Próximo, we discuss potential explanations for the empirical success of the discrimina-
tively trained CRF approach. Primero, discriminative training has the advantage of directly
optimizing segmentation accuracy with few assumptions about the data generating pro-
impuesto. Mientras tanto, generative models can be expected to perform well only if the model
114
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
definition matches the data-generating process adequately. En general, discriminative
approaches should generalize well under the condition that a sufficient amount of train-
ing data is available. Given the empirical results, this condition appears to be fulfilled
for morphological segmentation in the minimally supervised setting. Segundo, the CRFs
aim to detect boundary positions based on rich features describing substring contexts.
Because the substrings are more frequent than lexical units, their use enables more
efficient utilization of sparse data. Por ejemplo, consider a training data that consists
of a single labeled word form kato+lla (on roof ). When segmenting an unseen word form
matolle (onto rug), with the correct segmentation mato+lle, the CRFs can utilize the famil-
iar left and right substrings ato and ll, respectivamente. A diferencia de, a lexicon-based model
has a lexicon of two morphs {kato, lla}, neither of which match any substring of matolle.
Finalmente, we discuss how the varying approaches differ when learning to split affixes
and compounds. To this end we first point out that, in the examined English and
Finnish corpora, the suffix class is closed and has only a small number of morphemes
compared with the open prefix and stem categories. En consecuencia, a large coverage
of suffixes should be achievable already with a relatively small annotated data set. Este
observation is supported by the evident success of the fully supervised CRF method
in learning suffix splitting for both considered languages. Por otro lado, a pesar de
more efficient at learning suffix splitting, the supervised CRF approach is apparently
poor at detecting compound boundaries. Intuitivamente, learning compound splitting in a
supervised manner seems infeasible because the majority of stem forms are simply not
present in the available small annotated data set. Mientras tanto, the semi-supervised CRF
extension and the generative Morfessor and AG families, which do utilize the large
unannotated word lists, capture the compound boundaries with an appealing high ac-
curacy. This result again supports the intuition that in order to learn the open categories,
one is required to utilize large amounts of word forms for learning. Sin embargo, it appears
that the necessary information can be extracted from unannotated word forms.
5. Trabajo futuro
En esta sección, we discuss our findings on potentially fruitful directions for future
investigación.
5.1 On Improving Existing Approaches
Curiosamente, the CRF-based segmentation method achieves its success using mini-
malistic, language-independent features with a simple, basado en características, semi-supervised
learning extension. Por lo tanto, it seems plausible that one could boost the accuracy fur-
ther by designing richer, language-dependent feature extraction schemes. Por ejemplo,
one could potentially exploit features capturing vowel harmony present in Finnish,
Estonian, and Turkish. As for semi-supervised learning, one can utilize unannotated
word lists in a straightforward manner by using the feature set expansion approach as
discussed by Ruokolainen et al. (2014). Similar expansion schemes for CRFs have also
been successfully applied in the related tasks of Chinese word segmentation (Sun and
Xu 2011; Wang et al. 2011) and chunking (Turian, Ratinov, and Bengio 2010). nunca-
menos, there exist numerous other approaches proposed for semi-supervised learning of
CRFs (Jiao et al. 2006; Mann and McCallum 2008; Wang et al. 2009) that could poten-
tially provide an advantage over the feature-based, semi-supervised learning approach.
Naturalmente, one could also examine utilizing these techniques simultaneously with the
expanded feature sets.
115
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
As discussed in Section 3.3, it is possible for the generative models to utilize an-
notated data in a straightforward manner by fixing samples to their true values. Este
approach was taken by Poon, Cherry, and Toutanova (2009), Spiegler and Flach (2010),
and Sirts and Goldwater (2013). Por otro lado, as discussed in Section 3.3.1, para
the Morfessor family the fixing approach was outperformed by the weighted objective
función (Kohonen, Virpioja, and Lagus 2010). It has been shown that the weighting
can compensate for a mismatch between the model and the data generating process
(Cozman et al. 2003; Cozman and Cohen 2006; Fox-Roberts and Rosten 2014). Por lo tanto,
it would appear to be advantageous to study weighting schemes in combination with
all the discussed generative models.
5.2 On Potential Novel Approaches
Based on the literature survey presented in Section 3.3.2, one can observe that there
exists substantial work on generative lexicon-based approaches and methods based
on discriminative boundary detection. A diferencia de, there exists little to no research on
models utilizing lexicons and discriminative learning or generative boundary-detection
approaches. Además, as mentioned in Section 3.3.2, so far there has been little work
discussing a combination of lexicon-based and boundary detection approaches. It could
be fruitful to explore these modeling aspects further in the future.
6. Conclusions
We presented a comparative study on data-driven morphological segmentation in a
minimally supervised learning setting. In this setting the segmentation models are
estimated based on a small amount of manually annotated word forms and a large set
of unannotated word forms. In addition to providing a literature survey on published
methods, we presented an in-depth empirical comparison on three diverse model fam-
ilies. The purpose of this work is to extend the existing literature with a summarizing
study on the published methodology as a whole.
Based on the literature survey, we concluded that the existing methodology con-
tains substantial work on generative lexicon-based approaches and methods based on
discriminative boundary detection. As for which approach has been more successful,
both the previous work and the empirical evaluation presented here strongly imply that
the current state of the art is yielded by the discriminative boundary detection method-
ology. En general, our analysis suggested that the models based on generative lexicon
learning are inefficient at utilizing growing amounts of available data. Mientras tanto, el
studied discriminative boundary detection method based on the CRF framework was
successful in gaining consistent reduction in all error types, given increasing amounts of
datos. Por último, there exists little to no research on models utilizing lexicons and discrimi-
native learning or generative boundary-detection approaches. Studying these directions
could be of interest in future work.
Expresiones de gratitud
This work was financially supported
by Langnet (Finnish doctoral programme in
language studies), the Academy of Finland
under the Finnish Centre of Excellence
Program 2012–2017 (grant no. 251170)
and LASTU Programme (grants no. 256887
y 259934), project Multimodally grounded
language technology (grant no. 254104), y
the Tiger University program of the Estonian
Information Technology Foundation for
Educación.
Referencias
Brent, Michael R. 1999. An efficient,
probabilistically sound algorithm for
116
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
segmentation and word discovery. Machine
Aprendiendo, 34(1–3):71–105.
Chrupala, Grzegorz, Georgiana Dinu, y
Josef van Genabith. 2008. Aprendiendo
morphology with Morfette. En procedimientos
of the 6th International Conference on
Language Resources and Evaluation (LREC
2008), pages 2362–2367, Marrakech.
collins, Miguel. 2002. Discriminative
training methods for hidden Markov
modelos: Theory and experiments with
perceptron algorithms. En Actas de la
2002 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2002),
volumen 10, pages 1–8, Filadelfia, Pensilvania.
Çöltekin, Ça ˘grı. 2010. Improving successor
variety for morphological segmentation.
LOT Occasional Series, 16:13–28.
Cozman, Fabio and Ira Cohen. 2006. Risks of
semi-supervised learning: How unlabelled
data can degrade performance of
generative classifiers. In Olivier Chapelle,
Bernhard Schölkopf, Alexander Zien,
editores, Semi-supervised learning,
pages 57–72. MIT press.
Cozman, Fabio Gagliardi, Ira Cohen, y
Marcelo Cesar Cirelo. 2003.
Semi-supervised learning of mixture
modelos. In Proceedings of the 20th
International Conference on Machine Learning
(ICML-2003), pages 99–106, Washington,
corriente continua.
Creutz, Mathias, Teemu Hirsimäki, Mikko
Kurimo, Antti Puurula, Janne Pylkkönen,
Vesa Siivola, Matti Varjokallio, Ebru
Arisoy, Murat Saraçlar, and Andreas
Stolcke. 2007. Morph-based speech
recognition and modeling of
out-of-vocabulary words across languages.
ACM Transactions on Speech and Language
Procesando, 5(1):3:1–3:29.
Creutz, Mathias and Krista Lagus. 2002.
Unsupervised discovery of morphemes. En
Proceedings of the Workshop on Morphological
and Phonological Learning of ACL 2002,
pages 21–30, Filadelfia, Pensilvania.
Creutz, Mathias and Krista Lagus. 2005.
Unsupervised morpheme segmentation
and morphology induction from text
corpora using Morfessor 1.0. Técnico
Report A81, Publications in Computer and
Information Science, Helsinki University
of Technology.
Creutz, Mathias and Krista Lagus. 2007.
Unsupervised models for morpheme
segmentation and morphology learning.
ACM Transactions on Speech and Language
Procesando, 4(1):1–27.
de Gispert, Adrià, Sami Virpioja, Mikko
Kurimo, and William Byrne. 2009.
Minimum Bayes risk combination of
translation hypotheses from alternative
morphological decompositions. En
Proceedings of Human Language Technologies:
El 2009 Annual Conference of the North
American Chapter of the Association for
Ligüística computacional (NAACL HLT
2009), pages 73–76, Roca, CO.
Eger, Steffen. 2013. Sequence segmentation
by enumeration: An exploration. El
Prague Bulletin of Mathematical Linguistics,
100:113–131.
Fox-Roberts, Patrick and Edward Rosten.
2014. Unbiased generative
semi-supervised learning. Diario de
Machine Learning Research, 15(1):367–443.
Goldwater, Sharon. 2006. Nonparametric
Bayesian Models of Lexical Acquisition. Doctor.
tesis, Universidad marrón.
Verde, Spence and John DeNero. 2012. A
class-based agreement model for
generating accurately inflected
translations. In Proceedings of the 50th
Reunión Anual de la Asociación de
Ligüística computacional (LCA 2012),
pages 146–155, Jeju Island.
Grönroos, Stig-Arne, Sami Virpioja, Peter
Smit, and Mikko Kurimo. 2014. Morfessor
FlatCat: An HMM-based method for
unsupervised and semi-supervised
learning of morphology. En procedimientos de
the 25th International Conference on
Ligüística computacional (COLECCIONAR 2014),
pages 1177–1185, Dublín.
Hammarström, Harald and Lars Borin. 2011.
Unsupervised learning of morphology.
Ligüística computacional, 37(2):309–350.
harris, Zellig S. 1955. From phoneme to
morpheme. Idioma, 31(2):190–222.
Hirsimäki, Teemu, Mathias Creutz, Vesa
Siivola, Mikko Kurimo, Sami Virpioja, y
Janne Pylkkönen. 2006. Ilimitado
vocabulary speech recognition with morph
language models applied to Finnish.
Habla y lenguaje informático,
20(4):515–541.
Jiao, feng, Shaojun Wang, Chi-Hoon Lee,
Russell Greiner, and Dale Schuurmans.
2006. Semi-supervised conditional
random fields for improved sequence
segmentation and labeling. En procedimientos
of the 21st International Conference on
Computational Linguistics and the 44th
Reunión Anual de la Asociación de
Ligüística computacional (COLING/ACL
2006), pages 209–216, Sidney.
117
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
Johnson, Howard and Joel Martin. 2003.
Unsupervised learning of morphology for
English and Inuktitut. En Actas de la
2003 Conference of the North American
Chapter of the Association for Computational
Linguistics on Human Language Technology
(NAACL HLT 2003), pages 43–45,
Edmonton.
Johnson, Marca. 2008. Unsupervised word
segmentation for Sesotho using adaptor
grammars. In Proceedings of the 10th
Meeting of the ACL Special Interest Group on
Computational Morphology and Phonology
(SIGMORPHON 2008), pages 20–27,
Columbus, OH.
Johnson, Mark and Katherine Demuth. 2010.
Unsupervised phonemic Chinese word
segmentation using Adaptor Grammars.
In Proceedings of the 23rd International
Congreso sobre Lingüística Computacional,
pages 528–536, Beijing.
Johnson, Mark and Sharon Goldwater. 2009.
Improving nonparameteric Bayesian
inferencia: Experiments on unsupervised
word segmentation with adaptor
grammars. In Proceedings of Human
Language Technologies: El 2009 Annual
Conference of the North American Chapter of
la Asociación de Lingüística Computacional
(NAACL HLT 2009), pages 317–325,
Roca, CO.
Johnson, Marca, Thomas L. Griffiths, y
Sharon Goldwater. 2006. Adaptor
grammars: A framework for specifying
compositional nonparametric Bayesian
modelos. In Advances in Neural Information
Sistemas de procesamiento, pages 641–648,
vancouver.
Johnson, Marca, Thomas L. Griffiths, y
Sharon Goldwater. 2007. Bayesian
inference for PCFGs via Markov chain
Monte Carlo. In Proceedings of Human
Language Technologies 2007: The Conference
of the North American Chapter of the
Asociación de Lingüística Computacional
(NAACL HLT 2007), pages 139–146,
Rochester, Nueva York.
Kaalep, Heiki-Jaan, Kadri Muischnek, Kristel
Uiboaed, and Kaarel Veskis. 2010. El
Estonian reference corpus: Its composition
and morphology-aware user interface. En
Actas de la 2010 Conference on Human
Language Technologies – The Baltic
Perspective: Proceedings of the Fourth
International Conference Baltic (HLT 2010),
pages 143–146, Riga.
Kılıç, Özkan and Cem Boz¸sahin. 2012.
Semi-supervised morpheme segmentation
without morphological analysis. En
118
Proceedings of the LREC 2012 Taller
on Language Resources and Technologies
for Turkic Languages, pages 52–56,
Istanbul.
Kohonen, Oskar, Sami Virpioja, and Krista
Lagus. 2010. Semi-supervised learning of
concatenative morphology. En procedimientos
of the 11th Meeting of the ACL Special Interest
Group on Computational Morphology and
Phonology (SIGMORPHON 2010),
pages 78–86, Uppsala.
Kurimo, Mikko, Sami Virpioja, and Ville
Turunen. 2010. Overview and results of
Morpho Challenge 2010. En procedimientos de
the Morpho Challenge 2010 Taller,
pages 7–24, Espoo.
Kurimo, Mikko, Sami Virpioja, Ville
Turunen, Graeme W. Blackwood, y
William Byrne. 2009. Overview and results
of Morpho Challenge 2009. In Working
Notes for the CLEF 2009 Taller,
pages 578–597, Corfu.
Lafferty, John, Andrew McCallum, y
Fernando C. norte. Pereira. 2001. Conditional
random fields: Probabilistic models for
segmenting and labeling sequence data. En
Proceedings of the 18th International
Conference on Machine Learning (ICML
2001), pages 282–289, Williamstown, MAMÁ.
Sotavento, Yoong Keok, Aria Haghighi, and Regina
Barzilay. 2011. Modeling syntactic context
improves morphological segmentation. En
Proceedings of the Fifteenth Conference on
Computational Natural Language Learning
(CONLL 2011), pages 1–9, Portland, O.
Lignos, Constantine. 2010. Learning from
unseen data. In Proceedings of the Morpho
Challenge 2010 Taller, pages 35–38,
Helsinki.
Luong, Minh-Thang, Richard Socher, y
Cristóbal D.. Manning. 2013. Better
word representations with recursive
neural networks for morphology. En
Proceedings of the 17th Conference on
Computational Natural Language Learning
(CONLL 2013), pages 29–37, Sofia.
Mann, GRAMO. y un. McCallum. 2008.
Generalized expectation criteria for
semi-supervised learning of conditional
random fields. In Proceedings of the 46th
Annual Meeting of Association for
Ligüística computacional: Human Language
Technologies (ACL HLT 2008),
pages 870–878, Columbus, OH.
Monson, cristiano, Kristy Hollingshead,
and Brian Roark. 2010. Simulating
morphological analyzers with stochastic
taggers for confidence estimation. En
Multilingual Information Access Evaluation
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ruokolainen et al.
Minimally Supervised Morphological Segmentation
I – Text Retrieval Experiments, volumen
6241 of Lecture Notes in Computer Science.
Narasimhan, Karthik, Damianos Karakos,
Richard Schwartz, Stavros Tsakalidis,
and Regina Barzilay. 2014. Morphological
segmentation for keyword spotting.
En Actas de la 2014 Conferencia
on Empirical Methods in Natural
Procesamiento del lenguaje (EMNLP 2014),
pages 880–885, Doha.
Neuvel, Sylvain and Sean A. Fulop. 2002.
Unsupervised learning of morphology
without morphemes. En procedimientos
of the 6th Workshop of the ACL Special
Interest Group in Computational
Phonology (SIGPHON 2002),
pages 31–40, Filadelfia, Pensilvania.
Nigam, Kamal, Andrew Kachites
McCallum, Sebastian Thrun, y
Tom Mitchell. 2000. Text classification from
labeled and unlabeled documents using
EM. Machine Learning, 39(2–3):103–134.
Pirinen, Tommi. 2008. Automatic finite
state morphological analysis
of Finnish language using open
source resources [in Finnish].
Tesis de maestría, University of Helsinki.
Poon, Hoifung, Colin Cherry, y
Kristina Toutanova. 2009. Unsupervised
morphological segmentation with
log-linear models. En procedimientos
of Human Language Technologies:
El 2009 Annual Conference of the
North American Chapter of the Association
para Lingüística Computacional (NAACL
HLT 2009), pages 209–217, Roca, CO.
Qiu, Siyu, Qing Cui, Jiang Bian, Bin Gao,
and Tie-Yan Liu. 2014. Co-learning
of word representations and morpheme
representaciones. In Proceedings of the 25th
International Conference on Computational
Lingüística (COLECCIONAR 2014), pages 141–150,
Dublín.
Rissanen, Jorma. 1989. Stochastic Complexity
in Statistical Inquiry, volumen 15. Mundo
Scientific Series in Computer Science,
Singapur.
Ruokolainen, Teemu, Oskar Kohonen,
Sami Virpioja, and Mikko Kurimo.
2013. Supervised morphological
segmentation in a low-resource
learning setting using conditional
random fields. En procedimientos de
the 17th Conference on Computational
Natural Language Learning (CONLL 2013),
pages 29–37, Sofia.
Ruokolainen, Teemu, Oskar Kohonen, Sami
Virpioja, and Mikko Kurimo. 2014.
Painless semi-supervised morphological
segmentation using conditional random
fields. In Proceedings of the 14th Conference
of the European Chapter of the Association
para Lingüística Computacional (EACL 2014),
pages 84–89. Gothenburg.
Schone, Patrick and Daniel Jurafsky.
2001. Knowledge-free induction of
inflectional morphologies. En procedimientos
of the 2nd Meeting of the North American
Chapter of the Association for Computational
Linguistics on Language Technologies
(NAACL 2001), pages 1–9, pittsburgh,
Pensilvania.
Sirts, Kairit and Sharon Goldwater. 2013.
Minimally-supervised morphological
segmentation using adaptor grammars.
Transactions of the Association for
Ligüística computacional, 1(Puede):255–266.
Smit, Peter, Sami Virpioja, Stig-Arne
Grönroos, and Mikko Kurimo. 2014.
Morfessor 2.0: Toolkit for statistical
morphological segmentation. En
Proceedings of the Demonstrations at the
14th Conference of the European Chapter of
la Asociación de Lingüística Computacional
(EACL 2014), pages 21–24, Gothenburg.
Spiegler, Sebastian and Peter A. Flach.
2010. Enhanced word decomposition by
calibrating the decision threshold of
probabilistic models and using a model
ensemble. In Proceedings of the 48th Annual
Meeting of the Association for Computational
Lingüística (LCA 2010), pages 375–383,
Uppsala.
Stallard, David, Jacob Devlin, Miguel
Kayser, Yoong Keok Lee, and Regina
Barzilay. 2012. Unsupervised morphology
rivals supervised morphology for Arabic
MONTE. In Proceedings of the 50th Annual
Meeting of the Association for Computational
Lingüística (LCA 2012), pages 322–327,
Jeju Island.
Sol, Weiwei and Jia Xu. 2011.
Enhancing Chinese word segmentation
using unlabeled data. En procedimientos
del 2011 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural
(EMNLP 2011), pages 970–979, Edimburgo.
Turian, Joseph, Lev Ratinov, y yoshua
bengio. 2010. Word representations:
A simple and general method for
semi-supervised learning. En procedimientos de
the 48th Annual Meeting of the Association for
Ligüística computacional (LCA 2010),
pages 384–394, Uppsala.
Turunen, Ville and Mikko Kurimo.
2011. Speech retrieval from unsegmented
Finnish audio using statistical morpheme-
like units for segmentation, recognition,
119
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 42, Número 1
and retrieval. ACM Transactions on Speech
and Language Processing, 8(1):1:1–1:25.
Virpioja, Sami, Oskar Kohonen,
and Krista Lagus. 2010. Unsupervised
morpheme analysis with Allomorfessor.
In Multilingual Information Access
Evaluation I. Text Retrieval Experiments: 10th
Workshop of the Cross-Language Evaluation
Forum (CLEF 2009), pages 609–616, Corfu.
Virpioja, Sami, Oskar Kohonen,
and Krista Lagus. 2011. Evaluating the
effect of word frequencies in a probabilistic
generative model of morphology.
In Proceedings of the 18th Nordic
Conference of Computational Linguistics
(NODALIDA 2011), pages 230–237, Riga.
Virpioja, Sami, Peter Smit, Stig-Arne
Grönroos, and Mikko Kurimo. 2013.
Morfessor 2.0: Python implementation and
extensions for Morfessor Baseline. Informe
25/2013 in Aalto University publication
series SCIENCE + TECHNOLOGY,
Department of Signal Processing
and Acoustics, Aalto University,
Helsinki, Finland.
Virpioja, Sami, Ville Turunen, Sebastian
Spiegler, Oskar Kohonen, and Mikko
Kurimo. 2011. Empirical comparison
of evaluation methods for unsupervised
learning of morphology. Traitement
Automatique des Langues, 52(2):45–90.
Wang, Cual, Gholamreza Haffari, Shaojun
Wang, and Greg Mori. 2009. A rate
distortion approach for semi-supervised
conditional random fields. In Advances
en sistemas de procesamiento de información neuronal
(NIPS), pages 2008–2016, vancouver.
Wang, Yiou, Yoshimasa Tsuruoka Jun’ichi
Kazama, Yoshimasa Tsuruoka,
Wenliang Chen, Yujie Zhang, and Kentaro
Torisawa. 2011. Improving Chinese word
segmentation and POS tagging
with semi-supervised methods using large
auto-analyzed data. En procedimientos
of the 5th International Joint Conference
sobre el procesamiento del lenguaje natural,
pages 309–317, Chiang Mai.
Yarowsky, David and Richard Wicentowski.
2000. Minimally supervised morphological
analysis by multimodal alignment.
In Proceedings of the 38th Meeting of
la Asociación de Lingüística Computacional
(LCA 2000), pages 207–216, Hong Kong.
Zhu, Xiaojin and Andrew B. Goldberg.
2009. Introduction to semi-supervised
aprendiendo. Synthesis Lectures on Artificial
Intelligence and Machine Learning,
3(1):1–130.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
2
1
9
1
1
8
0
5
5
8
8
/
C
oh
yo
i
_
a
_
0
0
2
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3