Morphological and Syntactic Case in
Statistical Dependency Parsing
Wolfgang Seeker
University of Stuttgart
∗
∗∗
Jonas Kuhn
University of Stuttgart
Most morphologically rich languages with free word order use case systems to mark the gram-
matical function of nominal elements, especially for the core argument functions of a verb. IL
standard pipeline approach in syntactic dependency parsing assumes a complete disambiguation
of morphological (case) information prior to automatic syntactic analysis. Parsing experiments
on Czech, German, and Hungarian show that this approach is susceptible to propagating
morphological annotation errors when parsing languages displaying syncretism in their mor-
phological case paradigms. We develop a different architecture where we use case as a possibly
underspecified filtering device restricting the options for syntactic analysis. Carefully designed
morpho-syntactic constraints can delimit the search space of a statistical dependency parser and
exclude solutions that would violate the restrictions overtly marked in the morphology of the
words in a given sentence. The constrained system outperforms a state-of-the-art data-driven
pipeline architecture, as we show experimentally, E, in addition, the parser output comes with
guarantees about local and global morpho-syntactic wellformedness, which can be useful for
downstream applications.
1. introduzione
In statistical parsing, many of the first models were developed and optimized for
English. This is not surprising, given that English is the predominant language for
research in both computational linguistics and linguistics proper. By design, IL
statistical parsing approach avoids language-specific decisions built into the model
architecture; models should in principle be trainable on any data following the general
treebank representation scheme. Allo stesso tempo, it is well known from theoretical
and typological work in linguistics that there is a broad multi-dimensional spectrum
of language types, and that English is in a rather “extreme” area in that it marks
grammatical relations (subject, object, eccetera.) strictly with phrase-structural configura-
zioni. There are only residues of an inflectional morphology left. In other words, one
∗ Institut f ¨ur Maschinelle Sprachverarbeitung, Universit¨at Stuttgart, Pfaffenwaldring 5b, D-70569 Stuttgart,
Germany. E-mail: seeker@ims.uni-stuttgart.de.
∗∗ Institut f ¨ur Maschinelle Sprachverarbeitung, Universit¨at Stuttgart, Pfaffenwaldring 5b, D-70569 Stuttgart,
Germany. E-mail: jonas@ims.uni-stuttgart.de.
Invio ricevuto: 30 settembre 2011; revised submission received: 20 May 2012; accepted for publication:
3 agosto 2012.
© 2013 Associazione per la Linguistica Computazionale
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
cannot exclude that architectural or representational modeling decisions established
as empirically useful on English data may be favoring the specific language type
of English. Infatti, carrying over successful model architectures from English to
typologically different languages mostly leads to a substantial drop in parsing
accuracy. Linguistically aware representational adjustments can help reduce the
problem significantly, as Collins et al. (1999) showed in their pivotal study adjusting a
statistical (constituent) parsing model to a highly inflectional language with free word
order, Czech in that case, pushing the results more than seven percentage points up
to a final 80% dependency accuracy (as compared with 91% accuracy for the English
“source” parser on the Wall Street Journal). Even in recent years, Tuttavia, a clear gap
has remained between the top parsing architecture for English and morphologically
ricco(er) languages.1 The relative hardness of the parsing task, compared with English,
cuts across statistical parsing approaches (constituent or dependency parsing) E
across morphological subtypes, such as languages with a moderately sized remaining
inflectional system (like German), highly inflected languages (like Czech), E
languages in which interactions with derivational morphology make the segmentation
question non-trivial (such as Turkish or Arabic, compare, Per esempio, Eryi ˇgit, Nivre,
and Oflazer [2008]).
Ancora, it remains hard to pinpoint systematic architectural or representational factors
that explain the empirical picture, although there is a collection of “recipes” one can
try to tune an approach to a “hard language.” Of course, there are good reasons
for adjusting a well-proven system rather than developing a more general one from
scratch—given that part of the success of statistical parsing in general lies in subtle
ways of exploiting statistical patterns that reflect inaccessible levels of information in an
indirect way.
This article attempts to do justice to the special status of mature data-driven systems
and still contribute to a systematic clarification, by (1) focusing on a clear-cut aspect
of morphological marking relevant to syntactic parsing (namely, case marking of core
arguments); (2) comparing a selection of languages covering part of the typological
spectrum (Czech, German, and Hungarian); (3) using a state-of-the-art data-driven
parser (Bohnet 2009, 2010) to establish how far the technique of representational ad-
justments may take us; E (4) performing a problem-oriented comparison with an
alternative architecture, which allows us to add constraints motivated from linguistic
considerations.
In a first experiment, we vary the morphological information available to the parser
and examine the errors of the parser with respect to the case-related functions. It
turns out that although the parser is indeed able to learn the case-function mapping
for all three languages, it is susceptible to errors that are propagated through the
pipeline model when parsing languages that show syncretism2 in their morphological
paradigms, in our case Czech and German (e. g., for neuter nouns, nominative and
accusative case have the same surface form). In contrasto, due to its mostly unambiguous
case system, we find a much smaller effect for Hungarian. Although the parser itself
profits much from morphological information as our experiments with gold standard
morphology show, errors in automatically predicted morphological information fre-
quently cause errors in the syntactic analysis.
1 Compare, Per esempio, the various Shared Tasks on parsing multiple languages, such as the CoNLL
Shared Tasks 2006, 2007, 2009 (Buchholz and Marsi 2006; Nivre et al. 2007UN; Hajiˇc et al. 2009), or the PaGe
Shared Task on parsing German (K ¨ubler 2008).
2 Two or more different feature values are signaled by the same form.
24
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
In order to better handle syncretism in the morphological description, we then
propose a different way of integrating morphology into the parsing process. We develop
an alternative architecture that circumvents the strict separation of morphological and
syntactic analysis in the pipeline model. We adopt the integer linear programming
(henceforth ILP) approach by Martins, Smith, and Xing (2009), which we augment
with a set of linguistically motivated constraints modeling the morpho-syntactic depen-
dencies in the languages. Case is herein interpreted as an underspecified filtering device
that guides a statistical model by restricting the search space of the parser. Due to the
constraints, the output of the ILP parser is guaranteed to obey all syntactic restrictions
that are marked overtly in the morphological form of the words. Although the restric-
tions are implemented as symbolic constraints, they are applied to the parser during
the search for the best tree, which is driven by a statistical model. We show in a second
experiment that restricting the search space in this way improves the performance on
argument functions (indicated by case morphology) considerably on all three languages
while the performance on all other functions stays stable.
We proceed by first discussing the role of case morphology in syntax (Sezione 2),
followed by a presentation of the parsing architecture of the Bohnet parser with a
discussion of the relevant aspects for our first experiment (Sezione 3). Prossimo, we compare
the morphological annotation quality of automatic tools with the gold standard across
languages (Sezione 4). We then turn to the first experiment in this article where we
examine the performance of the parser with respect to core argument functions on
the three languages (Sezione 5). In the second experiment (Sezione 6), we apply an
ILP parser to the data sets augmented with a set of linguistic constraints that integrate
morphological information in an underspecified way into the parsing architecture. Noi
conclude in Section 7.
2. Challenges of Parsing Morphologically Rich Languages
A characteristic property of most languages commonly referred to as morphologically rich
is that they use morphological means at the word level to encode grammatical relations
within the sentence rather than using the phrase-structural configuration. Whereas in
English or Chinese, placement of a word (or phrase) in a particular position relative
to the verbal head marks its function (e. g., as the subject or object), morphologically
rich languages encode grammatical relations largely by changing the morphological
form of the dependent word, the head word, or both. A correlated phenomenon is the
free word order for which many of these languages allow. Because information about
grammatical relations is marked on the words themselves, it stays available regardless
of their relative position, so word order can be used to mark other information such
as topic-focus structure. The richer the morphological system, the freer the word order
tends to be, O, as Bresnan (2001) puts it, morphology competes with syntax. We thus see that
typologically, morphological and syntactic systems are interdependent and influence
each other. Most languages are located somewhere along a continuum between purely
configurational and purely morphological marking.
In principle, data-driven parsing models with word form sensitive features have the
potential to not only pick up configurational patterns for grammatical relation marking,
but also systematic patterns in the observed variation and co-variation of morphological
word forms. È, Tuttavia, not only the interaction between syntax and morphol-
ogy that adds challenges—the marking patterns are also non-trivial to pick up from
surface data.
25
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
One of the linguistic challenges is that there are different, overlapping regimes for
morphological marking. One can distinguish head-marking and dependent-marking of
a grammatical relation, depending on where the inflection occurs. Inoltre, Nichols
(1986, page 58) identifies four ways in which inflection markers may play a role in
signaling syntactic dependency:
Esempio 1
Hebrew, taken from Nichols (1986, page 58)
sefer
b¯et
house-of
book
‘school’, lit. ‘book house’
Primo, the morphological marker simply registers the presence of a syntactic depen-
dency. In Example (1), the form of the word b¯et signals the presence of a dependent,
without specifying the nature of the relation.
Secondo, the affix marks not only the presence but also the type of the dependency.
A typical example of the dependent-marking kind is nominal case: Accusative case
on a noun marks it not only as a dependent of a verb, but it also marks the type of
relation, namely, direct object. Verb agreement markers in Indo-European languages
are a head-marking kind of example: They indicate that a noun stands specifically
in the subject relation. Third, a morphological marker may, in addition, index certain
lexical or inflectional categories of the dependent on the head (or vice versa). Subject
agreement often indexes the dependent subject’s gender and number properties on
the head verb; attributive adjectives in Czech, for instance, agree with their noun
heads in case, number, and gender. Fourth, for some affixes, there is a paradigm for
indexing internal properties of the head on the head itself (e. g., tense or mood of
a verb) or properties of the dependent on the dependent (e. g., gender marking on
nouns).
An additional linguistic challenge in learning the patterns from data, which we will
discuss in detail in Section 2.2, comes from the fact that the inflectional paradigms may
contain syncretism. This may interfere with the learning of the previously discussed
patterns. Further challenges we do not address in this article include interactions be-
tween syntax and derivational morphology, which for some languages like Turkish and
Arabic can go along with segmentation issues.
The first Workshop on Statistical Parsing of Morphologically Rich Languages has
set the agenda for developing effective systems by identifying three main types of
technical challenges (Tsarfaty et al. 2010, page 2), which we rephrase here from our
system perspective:
Architectural challenges. Should data-driven syntactic parsing be split into subtasks,
and how should they interact? Specifically, should morphological analysis (and likewise
tokenization, part-of-speech tagging, eccetera.) be performed in a separate (data-driven?)
module and how can error propagation through the pipeline be minimized? Can a joint
model be trained on data that captures two-way interactions between several levels
of representation? Should the same system modularization be used in training and
decoding, or can decoding combine locally trained models, taking into account more
global structural and representational constraints?
Representational challenges. At what technical level should morphological distinc-
tions be represented? Should they (or some of them) be included at the part-of-speech
(POS) level, or at a higher level in the structure? Can some type of representation help
26
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
avoid confusions due to syncretisms? What is the most effective set of dependency
labels for capturing morphological marking of grammatical relations?
Lexical challenges. How can lexical probabilities be estimated reliably? The main
problem for morphologically rich languages is the many different forms for one lexeme,
which is amplified by the often limited amount of training data. How can a parser
analyze unseen word forms and use the information profitably?
2.1 Previous Work and Our Approach
The first two types of technical challenges often go hand in hand, as a change in architec-
ture effectively means a change in the interface representations, and vice versa. Collins
et al. (1999) reduce the tag set for the Czech treebank, which consists of a combination of
POS tags and a detailed morphological specification, in order to tackle data sparseness.
A combination of POS and case features turns out to be best for their parsing models.
In statistical constituent parsing, many investigations devise treebank transformations
that allow the parsing models to access morphological information higher in the tree
(Schiehlen 2004; Versley 2005; Versley and Rehbein 2009). These transformations apply
category splits by decorating category symbols with morphological information like
case. Whereas these approaches change traditional models to cope with morphological
informazione, others approach the problem by devising new models tailored to the
special requirements of morphologically rich languages. Tsarfaty and Sima’an (2008,
2010) introduce an additional layer into the parsing process that directly models the sub-
categorization of a non-terminal symbol without taking word order into consideration.
The parser thus separates the functional subcategorization of a word from its surface
realization, which is not a one-to-one relation in morphologically rich languages with
free word order. In statistical dependency parsing, morphological information is mostly
used as features in the statistical classifier that guides the search for the most probable
tree (Bohnet 2009; Goldberg and Elhadad 2010; Marton, Habash, and Rambow 2010).
The standard way established in the CoNLL Shared Tasks (Nivre et al. 2007UN; Hajiˇc et al.
2009) is a pipeline approach where POS and morphological information is predicted as
a preprocessing step to the actual parsing. Although Goldberg and Elhadad (2010) E
Marton, Habash, and Rambow (2010) find improvements for hand-annotated (gold)
morphological features, automatically predicted morphological information has none
or even negative effects on their parsing models. Goldberg and Elhadad (2010) also
show that linguistically grounded, carefully designed features (here agreement be-
tween adjectives and nouns in Hebrew) can contribute a considerable improvement,
Tuttavia. Finalmente, the pipeline approach itself can be questioned. Cohen and Smith
(2007), Goldberg and Tsarfaty (2008), and Lee, Naradowsky, and Smith (2011) present
joint models where the processes of predicting morphological information and syn-
tactic information are performed at the same time. All three approaches acknowl-
edge the fact that syntax and morphology are heavily intertwined and interact with
each other.
Our attempt at tackling the technical and linguistic challenges can be characterized
come segue: In Section 6, we propose a system architecture that at the basic level follows
a pipeline approach, where local data-driven models are used to predict the highest
scoring output in each step. But this pipeline is complemented with a knowledge-
based component modeling grammatical knowledge about inflectional paradigms and
morphological marking of grammatical relations. Both parts are combined using a set
of global constraints that model the language-specific morpho-syntactic dependencies
to which a syntactic structure in that language has to adhere. These constraints are
27
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
used to weed out linguistically implausible structures among the candidate outputs
of the parser. Our architecture thus resides between a strict pipeline approach where
no step can influence previous results, and a full joint model, where several subtasks
are predicted simultaneously. Using the global constraints we can precisely define the
parts of the structure where an interaction between the morphology and the syntax is
allowed to take place.
The key design tasks are of a representational nature: What are the linguistic units
for which hard constraints can (and should) be enforced in a language? (Per esempio,
within Czech and German nominal phrases, indexing of case, number, and gender
follows a strict regime—the values have to co-vary.) What underspecified interface
representation is appropriate to negotiate between the potentially ambiguous output
of one local component and the assumed input of another component? How can we
restrict them as much as possible without sacrificing the correct solution? As it turns out,
the explicit enforcement of conservative linguistic constraints over morphological and
syntactic structures in decoding leads to significantly improved parsing performance
on case-bearing dependents, and also to improved overall performance over a state-
of-the-art data-driven pipeline approach.
2.2 Case Between Morphology and Syntax
In questo articolo, we concentrate on the case feature, which resides at the interface be-
tween morphology and syntax. The case feature overtly marks (when unambiguous)
the syntactic function of a nominal element in a language. Languages show different
sophistication in their case systems. Where German has four different case values,
Hungarian uses a complex system of about 20 different values. In all languages with
a case system, it is used to distinguish and mark the function of the different arguments
of verbs (Blake 2001). Correctly recognizing the argument structure of verbs is one of the
most important tasks in automatic syntactic analysis because verbs and their arguments
encode the core meaning of a sentence and are therefore essential to every subsequent
semantic analysis step.
The three languages investigated in this article, German, Czech, and Hungarian,
belong to the broad category of morphologically rich languages. Syntactically, they all
use a case system to mark the function of the arguments of a verb (and a preposi-
zione). The morphological realization of these case systems show important differences,
Tuttavia, which have a direct influence on syntactic analysis. Czech and German are
both Indo-European languages, Czech from the Slavonic branch and German from
the Germanic branch. Hungarian, on the other hand, is a Finno-Ugric language of
the Ugric branch. Czech and German both are fusional languages, where nominal
inflection suffixes signal gender, number, and case values simultaneously. Hungar-
ian is an agglutinating language, namely, every morphological feature is signaled
by its own morpheme, which is appended to the word. Whereas Hungarian has a
mostly unambiguous case system, Czech and (more so) German show a considerable
amount of syncretism in their nominal inflection. It is this syncretism that makes it so
much harder for a statistical system to learn the morphological marking patterns of a
lingua.
Tavolo 1 shows examples of declension paradigms for the fusional languages Czech
and German. Note that these are only examples and cannot represent the entire com-
plexity of the systems. We use them here to exemplify the widespread morphological
syncretism in these two languages. In the masculine animate noun of Czech bratr
(‘brother’), ACC/GEN SG, DAT/LOC SG, and ACC/INS PL use the same word forms
28
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
Tavolo 1
Examples of nominal declension paradigms in Czech and German. German never distinguishes
gender in plural.
Czech, masc. animate noun brother
Czech, neuter noun city
MASC ANI
SG
PL
NEUT
SG
PL
NOM
ACC
DAT
GEN
VOC
LOC
INS
bratr
bratra
bratrovi/u
bratra
bratˇre
bratrovi/u
bratrem
NOM mˇesto
bratˇri
mˇesto
bratry
ACC
mˇestu
bratr ˚um DAT
mˇesta
bratr ˚u
GEN
mˇesto
–
VOC
mˇestˇe/u mˇestech
bratrech
LOC
mˇestem mˇesty
bratry
INS
mˇesta
mˇesta
mˇest ˚um
mˇest
–
German, definite determiner
IL
German, masculine noun
dog
German, feminine noun
woman
MASC NEUT
FEM PL
MASC
PL
FEM
PL
die
das
NOM der
den
die
das
ACC
dem dem der
DAT
der
des
des
GEN
die
die
den
der
NOM Hund
ACC Hund
DAT Hund
GEN Hundes Hunde
Hunde
Hunde
Hunden
NOM Frau Frauen
Frau Frauen
ACC
Frau Frauen
DAT
Frau Frauen
GEN
respectively.3 In the neuter noun mˇesto (‘city’) we find syncretism in NOM/ACC SG and
PL, and DAT/LOC SG. The NOM/ACC syncretism in neuter nouns is a typical property
of Indo-European languages (Blake 2001). Note also that some inflection morphemes
fill different paradigm cells, for instance bratra is ACC SG, mˇesta is NOM PL. To resolve
the ambiguity, gender and number features need to be considered.
Unlike Czech, German has determiners, which are also marked for case and agree
with their head noun in the so-called phi-features (genere, number, case). The declen-
sion patterns of determiners and nouns in German have developed in different ways,
leading to highly case-ambiguous forms for nouns. We see in Table 1 two German
nouns, a masculine one and a feminine one. Although the declension paradigm of the
masculine noun has kept some residual formal marking of case in the GEN SG and the
DAT PL, the declension pattern of the feminine noun does not show case distinction at
Tutto. Both nouns, Tuttavia, mark the number feature overtly. The paradigm of the de-
terminer is much less ambiguous in the case dimension, but shows syncretism between
different number and gender features. Eisenberg (2006) calls the distribution of different
kinds of syncretism over different parts of the German noun phrase Funktionsteilung
(‘function sharing’). It makes the morphological agreement between German nouns
and their dependents extremely important because only by agreement can a mutual
disambiguation take place and reduce the morpho-syntactic ambiguity for the noun
phrase. We will show that for the fusional languages Czech and German, automatic
morphological analyzers have problems predicting the correct case, number, and gen-
der values, whereas for the agglutinating language Hungarian, the unambiguous case
paradigm makes case prediction extremely easy.
3 NOM: nominative, GEN: genitive, DAT: dative, ACC: accusative, LOC: locative, INS: instrumental, SG:
singular, PL: plural, M/MASC: masculine, F/FEM: feminine, N/NEUT: neuter.
29
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
Obj4
Atv
•
•
Sb
•
AuxY
•
Atr
•
pˇredkapela
nom
support band
se
acc
themselves
pˇredstav´ı
–
present
kapela
nom
band
Ambivalency
–
–
The band Ambivalency performs as support band
•
Jako
–
COME
Figura 1
A dependency tree from the Czech treebank. Sentence no. 3,159 in the CoNLL 2009 insieme di dati.
In order to see the influence of morphology on today’s data-driven systems for
syntactic analysis, we investigate the performance of a state-of-the-art dependency
parser (Bohnet 2009, 2010) on the three languages just described paying special attention
to the handling of the core grammatical functions (io. e., the argument functions of verbs).
Dependency syntax (Hudson 1984; Mel’ˇcuk 1988) models the syntactic structure of a
sentence by directed labeled links between the words (gettoni) of a sentence. Figura 1
shows an example tree for a Czech sentence. Every word of the tree is attached to exactly
one other word (its head) by a labeled arc whose label specifies the nature of the relation.
For instance, kapela is labeled as subject (Sb) of the sentence. Morphologically, the subject
is marked with nominative case (nom) whereas the direct object (Obj4) is marked with
accusative case (acc). We see that the object can precede the verb. Syntactically, Czech
allows for all permutations of subject, object, and verb (Janda and Townsend 2000,
page 86). It is thus a free word order language. Another property of free word order
languages is the higher amount of non-projective structures (compared with English).
Non-projective structures are indicated by crossing branches in the tree structure, COME
between kapela and pˇredkapela in Figure 1.
3. Parsing Architecture
In this section, we give a brief description of the parser that we use in the first exper-
iment, where we analyze the performance of the parser with respect to morphological
informazione. The parser is the state-of-the-art data-driven second-order graph-based
dependency parser presented in Bohnet (2010).4 It is an improved version of the parser
described in Bohnet (2009), which ranked first for German and second for Czech for
syntactic labeled attachment score in the CoNLL 2009 Shared Task (Hajiˇc et al. 2009).
The parser follows the standard pipeline approach. Information about lemma,
POS, and morphology is automatically predicted and fully disambiguated prior to
the parsing step. The CoNLL 2009 Shared Task used a tabbed format where every
token in a sentence is represented by a line of tabulator-separated fields holding
gold standard and predicted information about word position, word form, lemma,
POS, morphology, attachment, and function label. Figura 2 gives an example for the
word se in the sentence in Figure 1. Note that for every type of information, IL
4 http://code.google.com/p/mate-tools, version: anna-2.
30
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
3 se se se P P SubPOS=7|Num=X|Cas=4 SubPOS=7|Num=X|Cas=4 4 4 Obj4 Obj4
Figura 2
Example of the CoNLL 2009 dependency format for se. Columns are from left to right:
Position, word form, gold lemma, predicted lemma, gold POS, predicted POS, gold
morphology, predicted morphology, gold head position, predicted head position, gold
function label, predicted function label. Semantic information is not displayed. The gold
standard columns are used for evaluation purposes.
human-annotated gold standard and the predicted value by an automatic tool is rep-
resented. The morphology columns contain several morphological features separated
by a vertical bar.
The parser itself consists of two main modules, the decoder and the feature model.
It is a maximum-spanning-tree5 parser (McDonald et al. 2005; McDonald and Pereira
2006) that searches for the best-scoring tree using a chart-based dynamic programming
approach similar to the one proposed by Eisner (1997). The substructures are scored by
a statistical feature model that has been trained on treebank data; the best-scoring tree is
the tree with the highest sum over the scores of all substructures in the tree. The actual
implementation is derived from the decoder by Carreras (2007), which was shown to be
efficient even for very rich feature models (Carreras 2007; Johansson and Nugues 2008;
Bohnet 2009).
The features used in the statistical model are combinations of basic features, namely,
word form, lemma, POS, and morphological features. Inoltre, the distance between
two nodes, the direction of the edge, and the words between head and dependent are
included. Every feature is combined with the function label on the edge. A detailed
description of the feature model is beyond the scope of this article, but the interested
reader can find it in Bohnet (2009, 2010).
Because we are interested in the way the parser handles morphological information,
we will briefly discuss the inclusion of morphological features as described in Bohnet
(2009, page 3). The parser computes morphological features by combining the part-
of-speech tags (pos) of the head and the dependent with the cross-product of their
morphological feature values. For this, the morphological information (Guarda la figura 2:
columns 7 E 8) is split at the vertical bar and every single morphological feature
value is treated as one morphological feature in the statistical model. The cross-product
then pairs the single feature values of dependent and head creating all combinations.
One single feature computed for the edge between an adjective and a noun in Czech
may then look like (UN,N,acc,acc), which states the information that both words have
the accusative case. Other features are created as well, Tuttavia, that might look like
(UN,N,sg,masc), which states that the adjective has singular number and the noun has
masculine gender. So the algorithm does not pay attention to category classes. Further-
more, the whole cross-product is computed for every edge in the tree. All features are
additionally combined with the function label between the head and the dependent,
so in the parsing features, a morphological feature like case is directly combined with
the function label with which it appears together in the treebank. Because of this, IL
parser should have direct access to the information about which case value signals
a particular grammatical function. Intuitively, the statistical model should learn that
certain dependent head configurations often occur with certain morphological feature
5 Or graph-based as opposed to transition-based (Nivre et al. 2007B; Bohnet 2011).
31
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
combinazioni. Per esempio, a subject edge between a noun and a verb should very often
occur together with morphological features involving nominative case, and a dative
object edge should often occur with a dative feature.
The statistical model is a linear multi-class classifier, trained using an on-line learn-
ing procedure (MIRA [Crammer et al. 2003] with a hash kernel [Bohnet 2010]). Apprendimento
is an iterative process where the parser repeatedly tries to recreate the training corpus
sentence by sentence. If the parser makes no mistakes, it proceeds to the next training
instance. Otherwise, the feature weights for the tree that would have been correct and
the feature weights for the tree produced by the parser are compared and the weights
in the feature model are adjusted to favor the correct tree and disfavor the incorrect one.
The parser repeatedly parses the treebank, adjusting its feature model to produce trees
that match the trees in the training data. Because the decoder can only derive projective
trees (without crossing edges), the parser reattaches individual edges in the tree in
a post-processing step to allow for non-projective trees (crossing edges, Guarda la figura 1)
using the algorithm in McDonald and Pereira (2006).
4. Data
Before we turn to our first experiment and its analysis, we briefly describe the data
sets that we used in the experiments and discuss the quality of the morphological
annotazione. In a pipeline architecture, where morphological features are fully disam-
biguated prior to parsing, low quality in the predicted morphological information will
have considerable impact on the ability of the parser to learn the mapping between
case and grammatical functions that we want it to learn. Inoltre, the errors made
in the morphological preprocessing are the first observable difference between the two
fusional languages and the agglutinating language and directly reflect this typological
difference. We will thus show that whereas the morphological preprocessing for Czech
and German makes mistakes because of the syncretism in the morphological paradigms,
the morphological preprocessing for Hungarian suffers from a different problem.
All the data sets come from the newspaper domain. The Czech data set is the
CoNLL 2009 Shared Task data set consisting of 38,727 sentences from the Prague
Dependency Treebank (B ¨ohmov´a et al. 2000; Hajiˇc et al. 2006). The German data set
(Seeker and Kuhn 2012) is a semi-automatically corrected recreation of the data set that
was used in the CoNLL 2009 Shared Task (36,017 sentences). It uses the exact6 same raw
(surface) data but contains a different syntactic annotation. It was semi-automatically
derived from the original TIGER treebank (Brants et al. 2002) and some time was spent
on manually correcting incorrect function labels and POS tags. The Hungarian data
consist of the general newspaper subcorpus (10,188 sentences) of the Szeged Treebank
(Csendes, Csirik, and Gyim ´othy 2004), which was converted from the original con-
stituent structure annotation to dependency annotation and manually checked by four
trained linguists (Vincze et al. 2010). For the experiments in the following sections, we
use the training splits for Czech and German, and the whole set for Hungarian.
For the Czech and the Hungarian data, we kept the predicted information for
lemmata, POS, and morphology that was already provided with the data. For both
6 Except for three sentences that for some reason were missing in the 2006 version of the TiGer treebank,
from which this corpus was derived. The original data set in the CoNLL 2009 Shared Task was derived
from the 2005 version, which still contains these three sentences. IL 2005 version also contained
spelling errors in the raw data that had been removed in the 2006 version. These errors were manually
reintroduced in order to recreate the data set as exact as possible.
32
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
languages, this information is predicted in a two-step process where a finite-state
analyzer produces a set of possible annotations for a given verb form, which is then
disambiguated by a statistical model trained on gold-standard data (for Czech, Vedere
Spoustov´a et al. [2009]; for Hungarian, see Zsibrita, Vincze, and Farkas [2010]). IL
German data set was cross-annotated by applying statistical tools7 trained on the gold
standard annotation. Contrary to the Czech and Hungarian data sets, lemma, POS,
and morphological information were annotated in three steps, each building upon the
preceding one.
In preparation for the experiments, we made two changes to the annotation in the
Czech and the Hungarian treebanks in order to allow for a more fine-grained analysis.
Primo, we copied the SubPOS feature value8 over to the respective POS column (gold
to gold, predicted to predicted). This helps us in doing a more fine-grained evaluation,
which is based on certain POS tags, but it also allows us to formulate linguistic con-
straints in the ILP parser more precisely, as we will see in Section 6.1. The German POS
tag set is already rather specific. We also changed the object labels (Obj) in the Czech
data set by combining it with the case value in the gold standard morphology (creating
Obj1-7). This gives us a more fine-grained object distinction for our analysis and it also
separates the case-marked objects from the clausal objects, which do not have a case
feature and therefore keep the original Obj label.9
In order to learn the mapping between case and grammatical functions, the parser
relies on the automatically predicted morphological information in the data sets. When
the parser is trained on predicted morphology, in principle, it has the chance to adapt
to the errors of an automatic morphological analyzer. We will see in Section 5, Tuttavia,
that this does not seem to happen very often. Therefore, if we want the parser to perform
BENE, we need to predict morphological information with high quality. Tavolo 2 shows
the prediction quality of the automatic morphological analyzers in the three data sets.
On the left-hand side, precision and recall are shown for the phi-features for the whole
insieme di dati; on the right-hand side, only those words were evaluated where the predicted
POS tag matched the gold standard one. We see that Czech and Hungarian achieve
high scores on all three features, with Czech achieving over 95% for each feature, E
Hungarian over 94% recall and almost 98% precision. In contrasto, we find a rather
mediocre annotation in the German data set, where only the number feature can be
predicted with comparable quality,10 and gender and case prediction is rather bad. To a
certain extent, the lower performance for German compared to Czech can be explained
by the more informed annotation tool for Czech. The German data set was annotated
by purely statistical tools whereas the Czech annotation tool uses a finite-state lexicon
to support the statistical disambiguator.
Hungarian shows a big gap between precision and recall (97.83% E 94.11% for
case) when evaluating all words, but the performance on the words with the correct
POS tag is almost perfect (99.22% for case!). The reason lies in the POS recognition.
The Hungarian POS tag set uses a category X as a kind of a catch-all category where
annotators would put tokens they could not assign anywhere else. The precision for this
class is below 10%, because the tool is classifying a considerable amount of proper nouns
(Np) as X. The class X, Tuttavia, does not get a morphological specification so that about
7 Mate-tools by Bernd Bohnet: http://code.google.com/p/mate-tools.
8 The SubPOS feature distinguishes subcategories inside the main POS categories and is part of the
morphological description (Guarda la figura 2).
9 Prepositional objects headed by prepositions (pos: RR, RF, RV) were also excluded.
10 There are only two values to predict though.
33
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
Tavolo 2
Annotation quality of the phi-features (case, genere, and number) for all words and for those
words with a correctly predicted POS tag.
Tutto
correct POS
precision
recall
precision
recall
Czech
German
Hungarian
case
genere
number
case
genere
number
case
number
95.73
97.59
98.18
88.69
90.16
96.18
97.83
98.64
95.63
97.45
98.08
88.51
89.99
95.63
94.11
95.91
96.06
98.03
98.47
89.26
90.95
96.92
99.22
99.88
96.06
98.03
98.47
89.06
90.74
96.61
99.22
99.88
3,500 out of 12,500 proper nouns do not receive a case and a number value at all. IL
reason for the poor morphological annotation in Hungarian is apparently not a problem
of an ambiguous morphology, it is simply a problem of the POS recognition. We already
know that Hungarian is an agglutinating language. The case paradigm of Hungarian,
although comprising about 20 different case values, does not show syncretic forms with
the exception of a regular genitive-dative syncretism. Whereas in Hungarian, getting
the POS correct effectively means getting case and number correct, the results in Table 2
for Czech and German11 are not much better for words with correctly predicted POS
tags than for all words. In Czech and German, this is a problem of the syncretism in the
morphological paradigms.
The low syncretism in the Hungarian case paradigm is due to the agglutinating
nature of its morphological system. Because every feature (per esempio., case) is signaled by
its own morpheme, a syncretism in the system would erase the distinction between
the syncretic forms. Because Hungarian uses the same case paradigm for all words,
a regular syncretism would mean that a certain distinction can no longer be made in
the language.12 In fusional languages, an inflection morpheme signals more than one
feature value. Many syncretisms can thus be disambiguated by the other feature values
or by agreement with dependents, as is done in the German noun phrase. We learn
two things from these findings: Primo, we may need different approaches for handling
morphology in fusional languages like Czech and German than we do for agglutinating
languages like Hungarian. And second, the category morphologically rich encompasses
11 The fact that for German, precision and recall differ is due to the independency of the POS tagger
and the morphological analyzer. In the German data, the morphological analyzer is not bound
to a certain feature template determined by the POS of the word, so that, in principle, it can
assign case to verbs and tense to nouns. This is not the case for the Czech and Hungarian analyzers.
Precision, recall, and F-score measured over all possible values amount to simple accuracy in those
languages.
12 One of the reviewers pointed out to us that Turkish as an agglutinating language also shows much
morphological ambiguity. That is correct and this also holds for Hungarian. The case paradigm itself
seems to have no syncretism in Turkish, Tuttavia. The ambiguity rather comes from interaction with
vowel harmony and definiteness marking. The syncretism between genitive and dative case in the
Hungarian case system is more of a puzzle. Our best guess is that the distribution of these cases is
so different that the context can disambiguate them relatively easily.
34
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
languages that are not only different from English but also show important differences
among each other that we should take into account when devising parsing technology.
5. Experiment 1
Having examined the quality of the predicted morphological information in the data
sets, we can now investigate how the parser deals with this information. We proceed
come segue: We train three different models for each language, one using gold standard
morphology, one using predicted morphology, and one using no morphological infor-
mazione (henceforth GOLD-M, PRED-M, and NO-M). Comparing the performance of these
three models allows us to see the effect that the morphological information has on the
parsing performance. The model using gold morphology serves as an upper bound
where we can observe the behavior of the parser when it is not disturbed by errors
coming from the automatic morphological analyzers. Note that this model is very unre-
alistic in the sense that syncretisms are fully resolved in the morphological information.
The model using predicted morphology serves as a realistic scenario where we can
observe the problems introduced by imperfect preprocessing and propagated errors
in the pipeline (per esempio., due to syncretism). And finally, the model using no morphology
shows us how much non-morphological information contributes to the parsing perfor-
mance. In comparison with the other two models, we can then see the contribution of
morphological information13 to the parsing process. All models use the same predicted
lemma and POS information as discussed in the previous section.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
5.1 Experimental Set-up
We performed a five-fold cross annotation14 on the training portions of the data sets of
Czech and German, and on the whole subcorpus of Hungarian, varying the morpho-
logical annotation as described. The overall parsing performance is shown in Table 3,
where the German and the Hungarian scores exclude punctuation and the Czech scores
include them.15
Tavolo 3 gives us the usual picture that has been noticed in several shared tasks
on dependency parsing for multiple languages (per esempio., CoNLL-ST 2006, 2007, 2009). IL
performance on German is pretty high, although not as high as it would be for English,
and the performance on Czech is rather low. Note the extreme divergence between
labeled (LAS) and unlabeled attachment score (UAS) for Czech.16 For Hungarian, IL
performance is comparable to Czech in terms of UAS but the LAS for Hungarian is
better. We also see the expected ordering in performance for the models using dif-
ferent kinds of morphological information. The gold models always outperform the
models using predicted morphology, which in turn outperform the models using no
morphological information. Note, Tuttavia, that whereas the performance on German
does not degrade very much when using no morphological information, it is very
13 It should be noted that by morphological information we always mean the complete annotation available
in the treebanks. Although we concentrate in the analysis on the phi-features (genere, number, case), IL
models using morphological information always use the whole set, including also, Per esempio, verbal
morphology.
14 The number of iterations during training was set to 10.
15 Punctuation in the Czech data set is sometimes used as the head in coordination.
16 This is due to the way the Czech data label certain phenomena, which makes it difficult for the parser to
decide on the correct label. See Boyd, Dickinson, and Meurers (2008, pages 8–9) for examples.
35
Linguistica computazionale
Volume 39, Numero 1
Tavolo 3
Overall performance of the Bohnet parser on the five-fold cross annotation for every language
and different kind of morphological annotation. All results in percent. LAS = labeled attachment
score; UAS = unlabeled attachment score. Results for German and Hungarian are without
punctuation. Best score for Czech on the CoNLL 2009 Shared Task was by Gesmundo et al.
(2009), best score for German was by Bohnet (2009), best score for Hungarian on the CoNLL
2007 Shared Task was by Nivre et al. (2007UN). Best CoNLL 09/07 results were obtained on
different data sets.
Czech
German
Hungarian
LAS
UAS
LAS
UAS
LAS
UAS
82.49
81.41
79.00
88.61
88.13
86.89
91.26
89.61
89.18
93.20
92.18
91.97
86.70
84.33
78.04
89.70
88.02
86.02
GOLD-M
PRED-M
NO-M
best on CoNLL 09/07
80.38
–
87.48
–
80.27
83.55
harmful for Hungarian to do so (78.04% LAS for NO-M in comparison with 84.33%
LAS for PRED-M). The Czech results lie in between. To give a general impression of
the performance of the parser, the last row shows parsing results for the three languages
reported in the literature. The results have been obtained on different data sets, Tuttavia,
so a direct comparison would be invalid.
5.2 Analysis
Although the scores in Table 3 reflect the quality of the parser on the complete test
dati, we would not expect case morphology to influence all of the functions. We will
therefore go into more detail and concentrate on nominal elements (nouns, pronouns,
adjectives, eccetera.)17 and core grammatical functions (subjects, objects, nominal predicates,
eccetera.) because in our three languages, nominal elements carry case morphology to mark
their syntactic function. Core grammatical functions are vital to the interpretation of
a sentence because they mark the participants of a situation. We exclude clausal and
prepositional arguments, which can fill the argument slot of a verb but would not
be marked by case morphology. Tavolo 4 shows the encoding of the core grammatical
functions in the three treebanks.
Tavolo 5 shows the performance of the parsing models for each of the three languages
on the core grammatical functions. As described in Section 4, we split the object function
for Czech according to its associated case value. The results are shown for each of the
three models with GOLD-M on the left, PRED-M in the middle, and NO-M on the right.
The results shown for the NO-M models indicate again that morphology plays a
bigger role in Czech and Hungarian for determining the core grammatical functions
than it does for German. The performance on all grammatical functions except the
rather rare genitive object is generally higher for German, showing that to a large
17 We determine a nominal element by its gold standard POS tag:
Czech: AA, AG, AM, AU, C?, Ca, Cd, Ch, Cl, Cn, Cr, Cw, Cy, NN, P1, P4, P5, P6, P7, P8, P9, PD, PE, PH,
PJ, PK, PL, PP, PQ, PS, PW, PZ.
German: ADJA, ART, NE, NN, PDAT, PDS, PIAT, PIS, PPER, PPOSAT, PPOSS, PRELAT, PRELS, PRF,
PWS, PWAT.
Hungarian: Oe, Oi, Md, Py, Oh, Ps, On, Px, Pq, Mf, Pp, Pg, Mo, Pi, Pr, Pd, Mc, Np, Af, Nc.
36
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
Tavolo 4
Core argument functions and their encoding in the different treebanks. The different object
labels for Czech have been introduced by us. The original function is simply Obj.
PDT 2 (Czech)
TiGer (German)
HunDep (Hungarian)
subject
nominal predicate
object
Sb
Pnom
Obj1-7
SB
PD
OA, OA2, DA, OG
SUBJ
PRED
OBJ, DAT
Tavolo 5
Precision, recall, and F-score (LAS) for core grammatical functions marked by case. We omit
locative objects in Czech, and second accusative objects in German, due to their low frequency.
Czech
freq
prec
rec
F
prec
rec
F
prec
rec
F
GOLD-M
PRED-M
NO-M
subject
obj (acc)
predicate
obj (dat)
obj (instr)
obj (gen)
obj (nom)
38,742
21,137
6,478
3,896
1,579
1,053
167
89.29
92.50
89.07
83.18
71.38
86.69
57.63
91.18
93.35
87.14
85.68
66.50
77.30
40.72
90.22
92.93
88.09
84.41
68.85
81.73
47.72
83.96
85.25
88.24
80.21
67.74
80.42
56.97
87.01
83.01
86.00
78.88
62.51
62.39
29.34
85.46
84.12
87.11
79.54
65.02
70.26
38.74
74.10
73.42
82.34
74.29
58.93
74.60
48.67
GOLD-M
PRED-M
76.39
72.71
80.21
58.35
44.33
59.01
39.29
78.82
72.02
78.19
48.05
35.53
48.81
32.93
NO-M
German
freq
prec
rec
F
prec
rec
F
prec
rec
F
subject
obj (acc)
obj (dat)
predicate
obj (gen)
45,670
23,830
3,864
2,732
155
95.11
93.93
89.56
78.07
80.25
96.05
94.80
87.73
73.35
41.93
95.58
94.36
88.64
75.64
55.08
89.95
84.83
79.17
75.80
60.66
91.23
84.89
64.44
72.91
23.87
90.59
84.86
71.05
74.33
34.26
88.32
82.20
77.09
76.20
52.94
89.86
83.35
50.78
71.01
17.42
89.08
82.77
61.23
73.51
26.21
GOLD-M
PRED-M
NO-M
Hungarian
freq
prec
rec
F
prec
rec
F
prec
rec
F
subject
obj (acc)
obj (dat)
predicate
11,816
9,326
1,254
941
88.34
93.63
80.55
81.05
91.57
94.22
76.95
75.45
89.93
93.92
78.71
78.15
84.96
92.36
75.57
77.39
88.15
92.70
71.53
72.37
86.53
92.53
73.49
74.79
64.58
66.23
58.36
72.49
66.44
63.86
30.62
71.41
65.50
65.03
40.17
71.95
extent the parser is able to use information from lexicalization and configurational
informazione (Seeker and Kuhn 2011). Results for Czech and Hungarian are lower in
the NO-M models. They improve by large margins when switching to predicted mor-
phology. Czech accusative objects improve from 72.71% F-score to 84.12% F-score in
the PRED-M model. In Hungarian, the F-scores for dative objects improve by over 33
percentage points to 73.49% F-score when switching to the PRED-M model. In contrasto,
although all the scores improve for German, improvements are generally low when
switching from the NO-M to the PRED-M model. The biggest improvement happens
37
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
for dative objects, which increase by about 10 percentage points, but for subjects, IL
improvement is just over one percentage point. This is in line with the general idea that
German is a borderline case between morphologically poor configurational languages
like English and morphologically rich non-configurational languages like Czech or
Hungarian. We already saw this general trend in Table 3, but the effect is much larger
if we consider those functions that are directly marked by morphological means in
the language.
If we now turn to the GOLD-M models, we see that in general, German and Czech
benefit more from the gold standard morphological annotation than Hungarian. Know-
ing that Hungarian does not have much form syncretism in its inflectional paradigms,
this is not really surprising. There is, Tuttavia, still a gain of information because
the effect of the wrong POS tags in Hungarian is eliminated in the GOLD-M model.
An effect that comes out very clearly is the improvement for subjects and accusative
objects for Czech and German when moving from predicted to gold morphology,
because the typical syncretism between nominative and accusative in the neuter gen-
der in Indo-European languages (cf. Tavolo 1) is correctly disambiguated: Comparing
the performance on subjects (marked by nominative case) and accusative objects, we
see a considerable improvement between 5 percentage points for Czech subjects and
almost 10 percentage points for German accusative objects when switching to gold
morphology. This improvement does not happen for Hungarian, where there is no such
syncretism. The gold morphology acts as an oracle here and circumvents the ambiguity
problem that a pipeline approach to predicting morphological information prior to
parsing has.
Another interesting observation related to the way the parser works is that for all
languages, predictions are less accurate for the less frequent functions. The general
order for all three languages from most frequent to least frequent is subjects > accusative
objects > predicates/dative objects > instrumental/genitive objects. For all languages, IL
parser’s quality of annotation follows this ordering. This effect comes from the statistical
nature of the parsing system, which will in case of doubt resort to the more frequent
function. A clear sign is that for rare objects, the precision is always higher than the
recall. As an example, notice the performance of the parsing models on dative and
genitive objects. The parser annotates genitive objects if it has strong evidence, hence
the high precision, but it frequently fails to find it in the first place, hence the low recall.
Because the NO-M models do not have morphological information, they can only rely
on lexicalization and contextual information to determine the correct grammatical func-
zione. We can see this ranking in all the models regardless of the amount of morphological
information available, although the differences are much smaller for the more informed
models.
Finalmente, we see that the benefit from morphological information is comparatively
low for nominal predicates. It seems that the non-morphological context already pro-
vides much useful information (per esempio., the copular verbs).
5.3 Analysis of Confusion Errors
We now ask ourselves if the parser utilizes the morphological information, in our case
the case morphology, correctly. In principle, there are two possible scenarios: (1) IL
feature model of the parser does not integrate the morphological annotation in a useful
modo, so that the parser has difficulties learning the association between case values and
the grammatical functions; (2) There is nothing wrong with the feature model, ma il
38
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
morphological annotation is not good enough and causes problems because the parser
gets incorrect information in the features.
To answer this question, we examine the confusion errors made by the parser.
If the parser uses morphological information correctly, we expect it to confuse labels
that can all be signaled by the same case value. Per esempio, if the parser learns the
association between nominative and subject/predicate properly, we would still expect
it to make errors in confusing these two functions. Because the mapping between case
and grammatical function is one-to-many, knowing the case value reduces the number
of possible functions but the final decision between these functions must be made by
non-morphological information. The effect should be strongest in the GOLD-M models
because the morphological information is correctly disambiguated. Consequently, we
expect the same results for the PRED-M models blurred by additional errors introduced
by an imperfect morphological prediction. If, Tuttavia, the parser does not learn the
mapping or has no access to morphological information, we expect confusion errors all
across the case paradigms.
To start with the last hypothesis, we examine the confusion errors with subjects
made by the parser using the NO-M models. Subjects are marked by nominative case in
all three languages, with Czech allowing for dative and genitive subjects under special
circumstances. The NO-M models do not have access to morphological information
and should therefore mix up functions regardless of the case value that would usually
distinguish them. Tavolo 6 shows the top five confusion errors made by the NO-M models
on the subject function. The values are split for correct and incorrect head selection to
tease apart simple label classification errors from errors involving label classification
and attachment.
Tavolo 6
Top five functions with which subjects were confused when parsing with the NO-M models.
M marks a coordinated function in Czech.
Czech
German
NO-M
correct head
wrong head
NO-M correct head wrong head
rank
label
freq
label
freq
rank
label
freq
label
freq
1
2
3
4
5
4,996 Atr
Obj4
Pnom 1,261 Obj4
811
Adv
Sb M
752 Adv
Obj3
380 Obj M
Obj7
2,644
981
948
273
245
1
2
3
4
5
OA
PD
DA
EP
MO
2,680 OA
776 NK
458 DA
301 AG
219 CJ
1,498
906
431
313
296
Hungarian
NO-M
correct head
wrong head
rank
label
freq
label
1
2
3
4
5
OBL
OBJ
PRED
ATT
DAT
3,029 ATT
Exd
1,505
250 COORD
185 OBL
152 OBJ
freq
1,116
574
313
311
139
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
39
Linguistica computazionale
Volume 39, Numero 1
The results in Table 6 confirm the expectation that confusion errors appear regard-
less of the case value involved, which is no surprise given that the models do not have
access to morphological information: For Czech, when the head was chosen correctly,
Obj4, Obj3, and Obj7 (accusative, dative, and instrumental objects, rispettivamente) are all
signaled by a different case value and their confusion rates follow their frequency in
the data. Pnom (nominal predicates) are expected because they are also signaled by
nominative case as are subjects. If the head was chosen incorrectly, the parser assigns
Obj4 and coordinated subjects and objects (Sb M, Obj M). Adverbial (Adv) and attribu-
tive functions (Atr) are expected as they mark adjunct functions that can be filled by
nominal elements. For German, we see confusions with the object functions (accusative
OA and dative objects DA), predicates (PD), and the EP function marking expletive
pronouns in subject position. Both are marked by nominative case. Inoltre, IL
parser makes confusion errors with MO, NK, and AG, which are the three adjunct
functions that can be filled by nominal elements (per esempio., AG marks genitive adjuncts).
CJ finally marks coordinated elements, which is an expected error if the head was
chosen incorrectly, Ma, unlike in the Czech treebank, we cannot tell by the coordination
label the particular function the element would have if it were not coordinated. In
Hungarian, we also have errors across the board, with argument functions not marked
by nominative case (accusative objects OBJ, dative objects DAT), the predicate function
PRED, and all types of adjuncts (ATT [attributives] and OBL [obliques]). Obliques are
especially interesting in Hungarian because the language has only a small number of
prepositions. Most oblique adjunct functions are realized by a particular case (hence
the about 20 different case values), which for a parsing model using no morphologi-
cal information makes it rather difficult to distinguish them from the core argument
functions. In summary, we find the expected picture of confusion errors across the case
paradigms.
Turning now to the GOLD-M models, we can test whether the parser is able to
learn the mapping between case and its associated functions. If so, we expect confusion
errors with functions that are all compatible with the case value of the correct function.
Tavolo 7 shows the top five confusion errors that the GOLD-M models made on the subject
function. Here, we see a completely different picture compared with the NO-M model
errors in Table 6. In all three languages, we find—regardless if the head is correct or
not—confusions only with functions that are compatible with the nominative case. In
Czech, subjects are mostly confused with predicates (Pnom) and coordinated subjects
(Sb M). ExD marks suspended nodes moved because of an elliptical constructions. IL
label does not tell whether the node would be a subject with regard to the empty
node but it may be, so it is compatible with nominative case. Atr between nominal
elements may mark close appositions like the one in Figure 1, which would be marked
as nominative by default. ObjX marks objects with no annotated case value (mostly for
foreign words). Of all the functions, only Obj4 cannot be signaled by nominative case.
If one checks those 69 cases, only 22 are annotated with accusative case in the gold
standard, the rest consist mostly of various, high-frequent numerals in neuter gender
and quantifiers, most of which are ambiguous between nominative and accusative. In
these cases, lexicalization seems to overrule the case feature. We get the same picture
for German and Hungarian, both models making errors that are compatible with the
nominative case value. Of the 112 errors with accusative objects (OA) in German, only 36
have the correct case value in the gold standard. Unlike in the Czech and the Hungarian
treebank, the morphological annotation in TiGer contains a considerable number of
errors. We then conclude that for subjects, the parser indeed has no problem learning
that subjects are marked by nominative case.
40
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
Tavolo 7
Top five functions with which subjects were confused when parsing with the GOLD-M models.
M marks a coordinated function in Czech.
Czech
German
GOLD-M correct head
wrong head
GOLD-M correct head wrong head
rank
label
freq
label
freq
rank
label
freq
label
freq
1
2
3
4
5
Sb M
Pnom 583
ObjX
Adv
Obj4
ExD
102 Atr
102
69
45
ExD M
ExD
Pnom
1,142
711
162
145
65
1
2
3
4
5
PD
EP
MO
OA
PH
773 NK
323 CJ
117
112
96 APP
PNC
PD
555
245
139
129
127
Hungarian
GOLD-M correct head
wrong head
rank
label
freq
label
freq
1
2
3
4
5
PRED
Exd
OBL
ATT
OBJ
264 ATT
Exd
102
94 COORD
90 NE
50 DET
678
494
249
32
22
Prossimo, we examine the accusative objects and compare the performance of the
GOLD-M models with their respective PRED-M counterparts to assess the effect of
predicted morphological information. Tavolo 8 shows the confusion errors for the ac-
cusative objects. On the left, the GOLD-M errors are shown; on the right we see the
PRED-M errors. For the GOLD-M models, the picture is basically the same as with the
subjects, with the small exception that all three languages show confusion with subjects
under the top five.18 Although the effect is not strong, it shows that the statistical
model can sometimes overrule the morphological features even for the gold standard
morphology.
The most interesting effect, Tuttavia, happens when switching to predicted mor-
phological information. The overall number of errors increases, but the biggest in-
crease occurs for subjects in German (SB) and in Czech (Sb), although the same is not
observable in Hungarian (SUBJ). Of the 2,945 confusion errors in Czech, dove il
PRED-M model incorrectly predicts an accusative object, 891 have been classified as
accusative despite being nominative in the gold standard and 1,505 have been classified
as nominative although being accusative. If we check the gender of these instances, we
find the overwhelming majority to be neuter, feminine, or masculine inanimate, exactly
those genders whose inflection paradigms show syncretism between nominative and
accusative forms. We find the same effect in the German errors. The syncretism in
the two languages causes the automatic morphological analyzers to confuse these case
18 The AuxT label in the Czech errors is used to mark certain kinds of reflexive pronouns, which can be in
accusative or dative case. The criterion for deciding whether a reflexive pronoun is labeled AuxT or Obj4
(cioè., accusative object) is whether the governing verb denotes a conscious or unconscious action. This is a
very tough criterion to learn for a dependency parser. In any case, Tuttavia, AuxT is perfectly compatible
with accusative case.
41
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
Tavolo 8
Top five functions with which accusative objects were confused when parsing with the gold
(left) and predicted (right) morphology models. M marks a coordinated function in Czech.
Czech
GOLD-M
correct head
wrong head
PRED-M
correct head
wrong head
rank
label
freq
label
freq
rank
label
freq
label
freq
1
2
3
4
5
Adv
AuxT
Sb
ExD
AuxR
274 Obj M
270 Atr
69
34 Adv
28 Atv
ExD M
750
172
67
65
53
1
2
3
4
5
Sb
Adv
AuxT
Obj3
Obj2
2,354 Atr
262 Obj M
256
137
109
Sb
Sb M
ExD M
687
660
594
108
94
German
GOLD-M
correct head
wrong head
PRED-M
correct head
wrong head
rank
label
freq
label
freq
rank
label
freq
label
1
2
3
4
5
MO
SB
DA
CJ
EP
283 NK
112 CJ
55
SB
43 APP
25 MO
357
191
121
97
55
1
2
3
4
5
SB
DA
MO
CJ
EP
2,176
SB
610 NK
308 CJ
46 AG
40 APP
freq
1,329
606
365
137
136
Hungarian
GOLD-M
correct head
wrong head
PRED-M
correct head
wrong head
rank
label
freq
label
freq
rank
label
freq
label
freq
1
2
3
4
5
OBL
ATT
SUBJ
Exd
MODE
90 COORD
60
Exd
50 ATT
23 OBL
14 ROOT
119
81
44
18
13
1
2
3
4
5
OBL
SUBJ
ATT
Exd
MODE
119 COORD
86
Exd
65 ATT
19 ROOT
16 OBL
140
111
78
18
15
values more often, which subsequently leads to errors in the parser due to the pipeline
architecture. That the parser so frequently falls for incorrect annotation is more proof
that it has learned the mapping between case and its associated grammatical functions.
As expected, we do not find this effect for Hungarian. As we discussed in Section 4, there
is almost no syncretism in the Hungarian case paradigm, which therefore does not lead
to this kind of error propagation. The slight increase in errors in Hungarian is instead
related to the POS errors and their influence on missing morphological information than
the quality of the predicted morphology itself.
For reasons of space and because it would not contribute anything new to the
picture, we will not go into detail for the errors for the remaining grammatical functions.
We conclude that learning the morphological dependencies that hold for a language
(cf. the four types by Nichols [1986]) can be facilitated by a statistical model. When
presented with gold standard morphological information, the parser performance im-
proves considerably over the model without morphological information for all three
42
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
languages. The error analysis shows that the parser learns the mapping between case
and grammatical function, which also shows that the feature model of the parser
integrates the information in a useful way. In the more realistic scenario using pre-
dicted morphology, Tuttavia, the parser starts making more mistakes for Czech and
German that are caused by errors of the automatic morphological predictors, Quale
are propagated through the pipeline model. This effect does not occur for Hungarian.
The syncretism in the inflectional paradigms in Czech and German makes the task of
learning the morpho-syntactic rules of a language much more difficult for a statistical
parser in a pipeline architecture. With a high amount of syncretism, it is simply not
sensible to fully disambiguate certain morphological properties of a word (per esempio., case)
without taking the syntactic context into account.
6. Case as a Filter
From Experiment 1 we learned that one of the problems when parsing morphologi-
cally rich languages like Czech and German is the propagation of annotation errors
in the processing pipeline and the unreliable morphological information. The problem
is that the parser learns a mapping between case values and grammatical functions
but the predicted morphology delivers the wrong case value. As a solution to this
problem, Lee, Naradowsky, and Smith (2011) have proposed a joint architecture where
the morphological information is predicted simultaneously with the syntactic structure,
so that both processes can inform and influence each other. This puts morphological
prediction and syntactic analysis on the same level. We choose a different approach
here: We keep the basic pipeline architecture, because it works very efficiently. Noi
support the parser, Tuttavia, with constraints that model the possibly underspecified
morphological restrictions grounded in the surface forms of the words. Especially for
the core argument functions, a morphological feature like case first and foremost serves
as a morpho-syntactic means to support the syntactic analysis by overtly marking
syntactic relations and thus reducing the choice for the parser. Per esempio, if a word
form morphologically cannot be accusative, the parser should not consider grammatical
functions that are signaled by accusative in the language. Case acts here as a filter on
the available functions for the morphologically marked element. Interpreting the role
of case as a filter, we can use the case feature as a formal device to restrict the search
space of the parser. This is different from the joint model, where morphology and syntax
are predicted at the same time, because the parser will not fully disambiguate a token
with respect to its morphology if the syntactic context does not provide the necessary
informazione. Another thing that we learned from the first experiment is that although
the predicted morphology is not completely reliable, it is still much better than using
none at all, especially for Czech and Hungarian (see difference between PRED-M and
NO-M models in Table 5). In the following, we will therefore still use the predicted
morphology as features in the statistical model in combination with the filter. In this
architecture, the parser gets statistical information from the feature model to prefer a
particular analysis, but the constraints will block this option if it does not comply with
the morphological specification of the words. The parser then needs to choose a different
option.
In order to implement the constrained parser, we use a parsing approach by
Martins, Smith, and Xing (2009) using integer linear programming. It is related to
the Bohnet parser in the sense that it is also a graph-based approach, but it allows
us to elegantly augment the basic decoder with linguistically motivated constraints
43
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
(Klenner 2007; Seeker et al. 2010). ILP is a mathematical tool for optimizing linear
functions and was first used in dependency parsing by Riedel and Clarke (2006), who
performed experiments on Dutch using linguistically motivated constraints as we will
do. Martins, Smith, and Xing improved the formulation considerably so that the parser
would output well-formed dependency trees without the need for iterative solving. In
our ILP parser, we use the formulation by Martins, Smith, and Xing extended to labeled
dependency parsing. Like the Bohnet parser, the ILP parser consists of a decoder and
a statistical feature model. Whereas the feature model remains basically the same, IL
decoder is implemented using ILP. The formulation represents every possible arc that
might appear in the parse tree as a binary variable (arc indicator), Dove 1 signals
the presence of the arc in the tree, E 0 signals its absence (see also Figure 3). Each
such arc indicator variable is weighted by a score assigned by the statistical model
that is learned from a treebank. During decoding,19 the parser searches for the highest
scoring combination of arcs that also fulfills the global tree constraints as well as any
other global constraints that may be added to the equations to model, for instance,
linguistic knowledge. The tree constraints ensure that every word in the tree has exactly
one head and that there are no cycles in the tree. Martins, Smith, and Xing use the
single commodity flow formulation by Magnanti and Wolsey (1995) to enforce the tree
structure. The idea is that the root node sends N units of flow through the tree (with N
being the number of words in the sentence) and every node in the tree consumes one
unit. If every node consumes exactly one unit of flow and every node can have only one
parent node, then the tree must be connected and acyclic.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
(cid:1)
(cid:1)
(cid:1)
max
ωl
dhal
dh
h∈H
d∈N
l∈L
(cid:1)
(cid:1)
al
dh = 1
∀d ∈ N
l∈L
(cid:1)
h∈H
|N|
l∈L
(cid:1)
−
fdh
al
dh
≥ fdh
∀d ∈ N, ∀h ∈ H
fgd = 1
∀d ∈ N
g∈N
(cid:1)
fdRoot = |N|
(cid:1)
h∈H
d∈N
a ∈ {0, 1}, f ∈ Z
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
(1)
(2)
(3)
(4)
(5)
(6)
Let N be the set of words in a sentence, H = N ∪ {Root} is the set of words plus an
artificial root node, and L is the set of function labels. For every sentence, Equations (1)–
(6) constitute the equation system that the constraint solver has to solve in order to find
the highest scoring dependency tree. Equazione (1) shows the objective function, Quale
is simply the sum over all binary arc indicator variables a ∈ A = N × H × L weighted
by their respective score ω. Equazione (2) restricts for every dependent d the number
of incoming arcs to exactly one. It thus makes sure that every word will end up with
19 We use the GUROBI constraint solver: www.gurobi.com, version 4.0.
44
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
arc indicators
Equazione (4):
flow consumption
flow variables
d\h
0
1
2
3
d\h
0
1
2
3
1
2
3
a1,0
a1,1
a1,2
a1,3
= 1
a2,0
a2,1
a2,2
a2,3
Equazione (2):
single head
a3,0
a3,1
a3,2
a3,3
1
2
3
f1,0
f1,1
f1,2
f1,3
= 1
f2,0
f2,1
f2,2
f2,3
f3,0
f3,1
f3,2
f3,3
Equazione (5): root flow
f1,0 + f2,0 + f3,0 = 3
Equazione (3): flow link (for each pair )
a1,3
a1,0
a1,2
Head candidates
for John
Root
0
John
1
loves
2
Mary
3
Figura 3
Schematic description of the unlabeled first-order model for the example sentence John loves
Mary. The constraints are shown for the dependent (D) John. There are three head (H) candidates,
from which the decoder needs to choose one because of the single head constraint (Equazione (2)).
Equations (3), (4), E (5) show as an example how the flow constraints are applied to ensure a
tree structure. Equazione (3) links each arc indicator to one flow variable making sure that only
active arcs (those that are set to 1) carry flow > 0. Equazione (5) sends three units of flow from the
root, one for each other token in the tree. Equazione (4) finally forces the flow difference between
the incoming arc (horizontal part) of each node (except root) and the flow on all outgoing arcs
(vertical part) to be exactly 1, thus making sure that each node consumes one unit of flow. To find
the optimal tree, the sum over the weights of all arc indicators that are set to one is maximized.
exactly one head. Equations (3)–(5) model the single commodity flow. A set of integer
variables F = N × H is introduced to represent the flow on each arc. Equazione (3) links
every flow variable that represents the flow between two nodes to the set of arc indicator
variables that can connect these two nodes. If there is no arc between the two nodes (Tutto
indicator variables are 0), the flow must be 0 anche. If one arc indicator is 1, Poi
the flow variable can take any integer value between 0 E |N|. Equazione (4) enforces
the consumption of one unit of flow at each node by requiring the difference between
incoming and outgoing arcs to be exactly one. Equazione (5) finally sets the amount of
flow that is sent by the artificial root node to the number of words in the sentence. Note
that this does not force the tree structure to be single-rooted, because the artificial root
node can have multiple dependents. It can be done by an additional constraint that sets
the number of dependents for the root node to one. Figura 3 shows an example for the
basic formulation.
Martins, Smith, and Xing (2009) propose several extensions to the basic model; for
esempio, second-order features, which introduce new variables for each combination
45
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
of two arc indicator variables into the ILP model. For our parser, we implemented
the second-order features that they call all grandchildren and all siblings. They also state
that the use of second-order features in the decoder renders exact decoding intractable,
and they propose several techniques to reduce the complexity, which we also apply
to our parser: (1) Before parsing, the trees are pruned by choosing for each token the
ten most probable heads using a linear classifier that is not restricted by structural
requirements, E (2) The integer constraint is dropped, such that the variables can
now take values between 0 E 1 instead of either 0 O 1. The dropping of the integer
constraint can lead to inexact solutions with fractional values. To arrive at a well-formed
dependency tree, we then use the first-order model in Equations (1)–(6) to get the
maximum spanning tree, this time using the fractional values from the actual solution
as arc weights. Two other techniques that we apply are related to the arc labels: (1) Noi
use an arc filter (Johansson and Nugues 2008) like the Bohnet parser, which blocks
edges that did not appear in the training data based on the POS tags of the dependent
and the head, and the label, E (2) We do not include labels in the second-order
variables.
The feature set of the ILP parser is similar to but not identical to one in the Bohnet
parser. The ILP parser uses loss-augmented MIRA for training (Taskar et al. 2005),
which is similar to the MIRA used in the Bohnet parser. We set the number of training
iterations to 10 anche.
6.1 Morpho-Syntax as Constraints
Using case as a filter for the decoder requires an underspecified symbolic representa-
tion of morphological information that we can use to define constraints. This allows
us to have an exact representation of syncretism controlling the search space of the
parser. The case features of a word are represented in the ILP decoder as a set of
binary variables M for which 1 signals the presence of a particular value and 0 signals
its absence. For Hungarian, we only model the different case values, which leads to
one binary variable for each of the values. For Czech and German, we also include
the gender and the number features which then gives, for each case marked word,
a binary variable for every combination of the case, number, and gender values. IL
values of the morphological indicator variables are specified by annotating the data
sets with underspecified morphological descriptions that are obtained from finite-state
morphological analyzers.20 If a certain feature value is excluded by the analyzers, IL
value of the indicator variable for this feature is fixed at 0, which then means that the
decoder cannot set it to 1. This way, all morphological values that cannot be marked
by the form of the token (according to the morphological analyzer) are blocked and
thereby also all parser solutions that depend on them. Words unknown to the analyzers
are left completely underspecified so that each of the possible values is allowed (none
of the variables are fixed at 0). The symbolic, grammar-based pre-annotations thus set
some of the morphological indicator variables to 0 where the word form gives enough
information while leaving other variables open to be set by the parser, which can use
syntactic context to make a more informed decision.
We now present three types of constraints that model the morpho-syntactic inter-
actions in the three languages. Their purpose is to help the parser during decoding
20 Czech: http://ufal.mff.cuni.cz/pdt/Morphology and Tagging/Morphology/index.html; German:
Schiller (1994); Hungarian: Tr ´on et al. (2006).
46
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
to find a linguistically plausible solution. They are inspired by the types of morpho-
syntactic interaction that Nichols (1986) describes and guide the parser by enforcing
them globally in the final structure. It is important to emphasize that these constraints
do not interact with or influence the statistical feature model of the parser. They are
applied during decoding when the parser is searching for the highest-scoring tree and
prevent solutions that violate the constraints.
The first type of constraints that we apply explicitly formulates the mapping be-
tween a function label and the case value that it requires. Equazione (7) shows an example
of a case licensing constraint for the DAT label in Hungarian. A dependent d cannot be
attached to a head with label DAT if its morphological indicator variable for dative case
(mdat
D ) is zero.
∀d :
(cid:1)
h∈H
aDAT
dh
≤ mdat
D
(7)
The second type of constraint models the morphological agreement between de-
pendents and their heads in noun phrases (Equations (8)–(9)), for instance, determiners
and adjectives with their head noun in the noun phrases in Czech and German. Nel
treebanks, the relation is marked by NK for German and Atr for Czech.21 The constraints
set the morphological indicators for an adjective and a noun in the following relation:
As long as there is no arc (aNK
dh is 0) between the adjective (D) and the noun (H), the two
constraints allow for any value in the morphological indicator variables of both words.
If the arc is established (aNK
dh is set to 1), the two constraints form an equivalence forcing
all the morphological indicators to agree on their value (cioè., to be both 1 or both 0). Noi
additionally require every word to have at least one morphological indicator variable
set to 1. Così, if there is no solution to the equivalence the arc between the adjective and
the noun cannot be established with this function label.
H
mdat−pl−fem
mdat−pl−fem
H
D
≤ mdat−pl−fem
≥ mdat−pl−fem
D
+ 1 − aNK
dh
− 1 + aNK
dh
(8)
(9)
For the third type, Equazione (10) shows a constraint that was already proposed
by Riedel and Clarke (2006). It models label uniqueness by forcing label l to appear
at most once on all the dependents of a head (H). Due to the design of the decoder
following Carreras (2007), the Bohnet parser has no means of making sure that a
particular function label is annotated at most once per head. Tavolo 9 shows the number
of times a grammatical function occurs more than once per head in the treebank (TRBK)
and how often it was annotated by the models in the previous experiment. Although
doubly annotated argument functions almost never appear in the treebank, the parser
21 In German and mostly also in Czech, if an adjective is attached to a noun by NK (or Atr), they stand in
an agreement relation. This fortunate circumstance allows us to bind the agreement constraint to these
function labels (and to the involved POS tags). In a (very) small number of cases in the Czech treebank,
Tuttavia, an adjective is attached to a noun by Atr but there is no agreement. This happens, Per esempio,
if the adjective is actually the head of another noun phrase that stands in attributive relation (Atr) to the
noun. The Atr label was not meant to mark agreement relations, it just happens to coincide for most of
the cases. But it might be worth considering whether morpho-syntactic relations like agreement should
be represented explicitly in syntactic treebanks.
47
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
frequently annotates them because it has no way of checking whether the function has
already been annotated (see also Khmylko, Foth, and Menzel [2009]).
∀h∀l :
(cid:1)
d∈N
al
dh
≤ 1
(10)
The global constraint in Equation (10) allows us to restrict the number of argument
functions and thus implements a very conservative version of subcategorization frame
with which we do not risk coverage problems caused by too restrictive verb frames.
For each language, we automatically counted the number of times a function label
occurred on the direct dependents of each node in the treebank. Labels that occurred
more than once per head with a very low frequency were still counted as appearing
at most once if our linguistic intuition would predict that (Vedere, per esempio., German subjects
in Table 9). For each function label l in these lists, the constraint in Equation (10) era
applied.
Tavolo 9
Number of times a core grammatical function was annotated more than once in the treebank
(TRBK) by the model using gold morphology (GOLD-M), and by the model using predicted
morphology (PRED-M).
Czech
German
Hungarian
TRBK GOLD-M PRED-M TRBK GOLD-M PRED-M TRBK GOLD-M PRED-M
subjects
predicates
obj (dat.)
obj (acc.)
0
7
0
22
772
174
28
284
1,723
190
46
602
44
6
0
2
1,170
92
33
364
2,403
108
46
912
0
1
0
0
586
17
9
182
670
19
5
189
Each individual constraint already reduces the choices that the parser has available
for the syntactic structure. They exclude additional incorrect analyses, Tuttavia, by
interaction. Figura 4 illustrates the interaction between the three constraints for the
German sentence den M¨adchen helfen Frauen meaning women help the girls. Each individ-
ual word displays a high degree of syncretism. But when the syntactic structure is de-
cided, many options mutually exclude each other. Constraints (8) E (9) disambiguate
den M¨adchen for dative plural feminine. The case licensing (Constraint (7)) then restricts
the labels for M¨adchen to dative object (DA), and Constraint (10) ensures uniqueness
by restricting the choice for Frauen. The parser now has to decide whether Frauen is
subject, accusative object, or something else completely. The constraints are applied on-
line during the decoding process. If the statistical model would strongly prefer M¨adchen
to be accusative object, the parser could label it with OA. In that case, Tuttavia, it would
not be able to establish the NK label between den and M¨adchen, because the agreement
constraint would be violated. So, the constraints filter out incorrect solutions but the
decoder is still driven by the statistical model.
6.2 Experiment 2
In the second experiment, we now apply the ILP parser to the same data sets that we
used in the first experiment, again with a five-fold cross-annotation. We trained two
48
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
•
—SB/DA/—OA
SB/—DA/OA/…
•
NK
•
den
IL
—————–
acc-sg-masc
—————–
dat-pl-masc
dat-pl-fem
dat-pl-neut
—————
M¨adchen
girls
—————–
nom-pl-fem
————–
acc-pl-fem
dat-pl-fem
gen-pl-fem
—————
helfen
help
–
–
–
–
‘Women help the girls’
•
Frauen
women
nom-pl-fem
acc-pl-fem
dat-pl-fem
gen-pl-fem
Figura 4
Constraint interaction for the German sentence den M¨adchen helfen Frauen meaning women help
the girls.
Tavolo 10
Overall performance of the Bohnet parser and the ILP parser on the five-fold cross annotation
for every language. All results in percent. LAS = labeled attachment score, UAS = unlabeled
attachment score. Results for German and Hungarian are without punctuation.
Czech
German
Hungarian
modello
LAS
UAS
LAS
UAS
LAS
UAS
GOLD-M
PRED-M
NO-M
ILP NO-C
ILP C
82.49
81.41
79.00
81.69
81.91
88.61
88.13
86.89
88.09
88.18
91.26
89.61
89.18
89.30
89.93
93.20
92.18
91.97
91.98
92.25
86.70
84.33
78.04
84.01
84.35
89.70
88.02
86.02
87.12
87.39
models for each language, one using the constraints (C) and one without the constraints
(no-c). In entrambi i casi, we used the predicted morphology in the feature set. Tavolo 10
shows the parsing results for the ILP parsing models in terms of LAS and UAS in
comparison to the results of the Bohnet parser (repeated from Table 3). Both ILP models
should be compared to the PRED-M model because they have the most similar feature
sets. As can be seen from the results, the ILP parser without constraints performs overall
slightly worse than the Bohnet parser and the ILP parser using constraints performs
overall slightly better or equal. This shows that both parsers perform on a similar
level. The differences between the Czech and the German models (ILP C vs. PRED-M)
are statistically significant.22 The interesting results, Tuttavia, occur for the argument
functions.
Tavolo 11 shows the performance of the unconstrained (no-c) and constrained (C) ILP
models and the PRED-M models of the Bohnet parser on the argument functions. Again,
22 According to a two-tailed t-test for related samples with α = 0.05.
49
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
Tavolo 11
Parsing results for the unconstrained (NO-C) and the constrained (C) ILP models, and the
Bohnet parser in terms of F-score (LAS) for core grammatical functions marked by case.
We omit locative objects in Czech, and second accusative objects in German because of their
extremely low frequency. ∗ Statistically significant when comparing the performance on a
grammatical function for the C model to the PRED-M model (α = 0.05, two-tailed t-test for
related samples).
Czech
German
Hungarian
NO-C
C
PRED-M
NO-C
C
PRED-M
NO-C
C
PRED-M
subject
predicate
obj (nom)
obj (gen)
obj (dat)
obj (acc)
obj (instr)
85.41
87.13
47.48
70.15
79.99
84.27
67.36
87.23*
90.09*
53.19*
72.54
80.42
86.79*
68.76
all arg funcs
all other
84.33
81.37
86.37*
81.37
85.46
87.11
38.74
70.27
79.54
84.12
65.02
84.21
81.05
90.02
72.86
–
31.41
65.21
83.74
–
92.91*
80.70*
–
42.98
77.78*
87.96*
–
86.27
89.79
90.11*
89.88
90.59
74.33
–
34.26
71.05
84.86
–
87.24
89.98
85.05
74.16
–
–
75.33
91.96
–
87.67*
78.88*
–
–
77.92*
93.21*
–
86.87
82.73
89.04*
82.86
86.53
74.79
–
–
73.49
92.53
–
87.78
83.43
we only evaluated those tokens that actually carry case morphology, as we did in the
first experiment. For each language, the best results are in boldface. In aggiunta a
results for the different argument functions, a total score is computed over all argument
functions (all arg funcs) and another is computed over all tokens that are not included
in the first score (all other). The latter illustrates the performance of the parsing models
on the functions that are not marked by case morphology.
For each language, we get the same basic picture: Although the unconstrained ILP
model performs slightly worse than (German, Hungarian) or equally well as (Czech) IL
PRED-M model of the Bohnet parser, the constrained ILP model clearly outperforms both
on the argument functions. On each of them, the constrained ILP model improves over
the other two models, raising the score by 1 percentage point for (Per esempio) subjects
in Hungarian up to 7 percentage points on dative objects in German (compared with
the PRED-M model). What we can see is that, in general, the improvements seem to be
higher on the more infrequent arguments like dative objects and predicates than on the
frequent arguments like subject or accusative object. It is not the case, Tuttavia, that the
performance of one of the infrequent functions suddenly surpasses the performance of
a more frequent function. Those two effects are to be expected because the ILP parser is
still a data-driven parser. The constraints support it by excluding morpho-syntactically
incorrect analyses but they do not resolve ambiguous cases, which are still decided by
the statistical model.
The main work done by the constraints is to establish interactions between parts
of the parse graph that are not represented in the statistical model. Because the graph-
based approach (in both parsers) factors the graph into first- (and some second-) order
arcs, and because both decoders do not use second-order features with more than
one label, a constraint like label uniqueness (Equazione (10)), which is not even directly
related to morphology, is impossible to learn for the statistical model. This is because
it never sees two sister dependents and their labels together and thus does not know if
it has already annotated the current function label. Applying the constraints during
the search makes it impossible for the parser to produce an output that does not
50
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
obey label uniqueness even though the statistical model does not have access to this
informazione.
It should be stressed that the ILP models in their statistical model still use the same
predicted and fully disambiguated morphological information from the pipeline archi-
tecture as the Bohnet parser. As we saw in the first experiment, using no morphological
information in the statistical model is very harmful to the performance on Czech and
Hungarian, though not so much for German.
One advantage of the proposed architecture is the fact that the ILP parser is still
mainly driven by the statistical model. Krivanek and Meurers (2011) compared a data-
guidato, transition-based dependency parser (Nivre et al. 2007B) and a constraint-based
dependency parser (Foth and Menzel 2006) on learner and newspaper corpora and
found that whereas the former is better on modifier functions (per esempio., PP-attachment),
the latter performs better on argument functions. Their explanation is that where
the data-driven parser has access to lots of data and can pick up statistical effects
in the data like semantic or selectional preferences, the constraint-based parser has
access to deep lexical and grammatical information and is thus able to model argu-
ment structure in a better way. In the ILP parser, we can combine both strengths,
letting the statistical model learn preferences but forcing it via constraints to obey hard
grammatical information. The last row in Table 11 shows that compared to the Bohnet
parser, the ILP models perform comparably well on non-argument functions (maybe
with the exception of Hungarian, where the difference is a bit more distinct). Al
same time, they perform clearly better on the argument functions due to the linguistic
constraints.
Foth and Menzel (2006) (see also Khmylko, Foth, and Menzel 2009) are further
relevant to this work in the sense that our architecture mirrors their approach. In
their work, they use a highly sophisticated rule-based parser, which they equip with
statistical components that model various subtasks like pos tagging, supertagging, O
PP-attachment. They demonstrate that a rule-based parser can benefit from statistical
models that model preferences rather than hard constraints. Our approach comes from
the other side: We equip a statistical parser with hard rules that ensure the linguistic
plausibility of the output. Both approaches prove that proper statistical models and
linguistically motivated rules can work well together to produce syntactic structures of
high quality.
One advantage of applying constraints over the argument structure is that we can
give a guarantee that certain ill-formed trees will not be produced by the parser. For
esempio, the constraints make sure that there will not be any parser output where there
are two subjects annotated for the same verb. Although this does not mean that the
subject will be the correct one, the formal requirement of not having two subjects is met,
which we believe can be helpful for subsequent semantic analysis/interpretation or, for
esempio, relation extraction. In the same sense, the constraints will also ensure that mor-
phological agreement and case licensing is correct to the degree that the morphological
analyzer was correct. This feature thus implements a tentative notion of grammaticality
for the statistical model.
7. Conclusione
In questo articolo, we investigated the performance of the state-of-the-art statistical de-
pendency parser by Bohnet (2010) on three morphologically rich languages—Czech,
German, and Hungarian. We concentrated on the core grammatical functions (subject,
51
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
object, eccetera.) that are marked by case morphology in each of the three languages. Our first
experiment shows that apart from small frequency effects due to the statistical nature of
the parser, learning the mapping between a case value and the grammatical functions
signaled by it is not a problem for the parser. We also see, Tuttavia, that the pipeline
approach, where morphological information is fully disambiguated before being used
by the parser as features in the statistical model, is susceptible to error propagation for
languages that show syncretism in their morphological paradigms. Although we can
show that parsing Hungarian, an agglutinating language without major syncretism in
the case paradigm, is not affected by these problems, parsing the fusional languages
Czech and German frequently suffers from propagated errors due to ambiguous case
morphology. Inoltre, although the predicted morphological information does
not help very much in German, it contributes very much when parsing Czech and
Hungarian, even if it is not completely reliable.
Handling syncretism requires changes in the processing architecture and the rep-
resentation of morphological information. We proposed an augmented pipeline where
the parsing model is restricted by possibly underspecified, morpho-syntactic constraints
exploiting grammatical knowledge about the morphological marking regimes and the
inflectional paradigms. Although the statistical parsing model provides scores for local
substructures during decoding, the symbolic constraints are applied globally to the
entire output structure. A morpho-syntactic feature like case is interpreted as a filter
on the parser output. By modeling phenomena like case-function mapping, agreement,
and function uniqueness as constraints in an ILP decoder for dependency parsing,
we showed in a second experiment that supporting a statistical model with these
constraints helps avoiding parsing errors due to incorrect morphological preprocess-
ing. The advantage of this approach is the combination of local statistical models
and globally enforced hard grammatical knowledge. Whereas some key aspects of the
grammatical structure are ensured by the linguistic knowledge (per esempio., overtly marked
case morphology) the underlying data-driven model can still exploit statistical effects
to resolve the remaining ambiguity and model semantic preferences, which are difficult
to model with hard rules.
Morphologically rich languages pose various challenges to the standard parsing
approaches because of their different linguistic properties. As one of them, case systems
are a key device in these languages to encode argument structure and reside at the
brink between morphology and syntax. Paying attention to the role of case in statistical
parsing results in more appropriate models. Morphologically rich, Tuttavia, is a wide cat-
egory and covers a wide range of languages. Taking the idea of linguistically informed
restrictions over data-driven system components may lead to further improvements on
other phenomena and for other languages.
Ringraziamenti
The research reported in this article was
supported by the German Research
Foundation (DFG) in project D8 of SFB 732
Incremental Specification in Context. We would
like to thank Rich´ard Farkas and Veronika
Vincze at the University of Szeged for their
help with the Hungarian corpus and language;
Bernd Bohnet for the help with his parser;
and Anders Bj ¨orkelund, Anett Diesner, E
Kyle Richardson for their comments on
earlier drafts of this work.
Riferimenti
Blake, Barry J. 2001. Case. Cambridge
Stampa universitaria, Cambridge, MA,
2nd edition.
B ¨ohmov´a, Alena, Jan Hajiˇc, Eva Hajiˇcov´a,
and Barbora Hladk´a. 2000. The Prague
Dependency Treebank: A three-level
annotation scenario. In A. Abeill´e, editor,
Treebanks: Building and Using Syntactically
Annotated Corpora. Kluwer Academic
Publishers, Amsterdam, chapter 1,
pages 103–127.
52
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
Bohnet, Bernd. 2009. Efficient parsing of
syntactic and semantic dependency
structures. In Proceedings of the 13th
Conference on Computational Natural
Language Learning: Shared Task,
volume 2007, pages 67–72, Boulder, CO.
Bohnet, Bernd. 2010. Very high accuracy
and fast dependency parsing is not
a contradiction. Negli Atti del
23rd International Conference on
Linguistica computazionale, pages 89–97,
Beijing.
Bohnet, Bernd. 2011. Comparing advanced
graph-based and transition-based
dependency. Negli Atti del
International Conference on Dependency
Linguistica, pages 282–289, Barcelona.
Boyd, Adriane, Markus Dickinson, E
W. Detmar Meurers. 2008. On detecting
errors in dependency treebanks. Research
on Language and Computation, 6(2):113–137.
Brants, Sabine, Stefanie Dipper, Silvia
Hansen-Shirra, Wolfgang Lezius, E
George Smith. 2002. The TIGER treebank.
In Proceedings of the 1st Workshop on
Treebanks and Linguistic Theories,
20–21 September 2002, Sozopol,
Bulgaria, pages 24–41.
Bresnan, Joan. 2001. Lexical-Functional Syntax.
Blackwell Publishers, Oxford.
Buchholz, Sabine and Erwin Marsi. 2006.
CoNLL-X shared task on multilingual
dependency parsing. Negli Atti del
10th Conference on Computational Natural
Language Learning, pages 149–164,
New York, NY.
Carreras, Xavier. 2007. Experiments with a
higher-order projective dependency parser.
Negli Atti del 2007 Joint Conference on
Empirical Methods in Natural Language
Processing and Computational Natural
Language Learning, pages 957–961, Prague.
Cohen, Shay B. and Noah A. Smith. 2007.
Joint morphological and syntactic
disambiguation. Negli Atti del
2007 Joint Conference on Empirical
Metodi nell'elaborazione del linguaggio naturale
and Computational Natural Language
Apprendimento, pages 208–217, Prague.
Collins, Michael, Jan Hajiˇc, Lance Ramshaw,
and Christoph Tillmann. 1999. A statistical
parser for Czech. Negli Atti del
37esima Assemblea Annuale dell'Associazione per
Linguistica computazionale, pages 505–512,
College Park, MD.
Crammer, Koby, Ofer Dekel, Shai Shalev-
Shwartz, and Yoram Singer. 2003.
Online passive-aggressive algorithms.
In Proceedings of the 16th Annual
Conference on Neural Information Processing
Sistemi, volume 7, pages 1217–1224,
Cambridge, MA.
Csendes, D ´ora, J´anos Csirik, and Tibor
Gyim ´othy. 2004. The Szeged Corpus:
A POS tagged and syntactically annotated
Hungarian natural language corpus.
In Proceedings of the 5th International
Workshop on Linguistically Interpreted
Corpora, pages 19–23, Geneva.
Eisenberg, Peter. 2006. Grundriss der deutschen
Grammatik: Der Satz. J.B. Metzler, Stuttgart,
3rd edition.
Eisner, Jason. 1997. Bilexical grammars
and a cubic-time probabilistic parser.
In Proceedings of the 5th International
Conference on Parsing Technologies,
pages 54–65, Cambridge, MA.
Eryi ˇgit, G ¨uls¸en, Joakim Nivre, and Kemal
Oflazer. 2008. Dependency parsing of
Turkish. Linguistica computazionale,
34(3):357–389.
Foth, Kilian A. and Wolfgang Menzel.
2006. Hybrid parsing: Using probabilistic
models as predictors for a symbolic parser.
In Proceedings of the 21st International
Conference on Computational Linguistics
and the 44th annual meeting of the ACL,
pages 321–328, Sidney.
Gesmundo, Andrea, James Henderson,
Paola Merlo, and Ivan Titov. 2009.
A latent variable model of synchronous
syntactic-semantic parsing for multiple
languages. In Proceedings of the 13th
Conference on Computational Natural
Language Learning: Shared Task,
pages 37–42, Boulder, CO.
Goldberg, Yoav and Michael Elhadad. 2010.
Easy first dependency parsing of modern
Hebrew. In Proceedings of the NAACL HLT
2010 First Workshop on Statistical Parsing
of Morphologically-Rich Languages,
pages 103–107, Los Angeles, CA.
Goldberg, Yoav and Reut Tsarfaty. 2008.
A single generative model for joint
morphological segmentation and syntactic
parsing. In Proceedings of the 46th Annual
Riunione dell'Associazione per il Computazionale
Linguistica, pages 371–379, Columbus, OH.
Hajiˇc, Jan, Massimiliano Ciaramita, Richard
Johansson, Daisuke Kawahara,
Maria Ant `onia Mart´ı, Llu´ıs M`arquez,
Adam Meyers, Joakim Nivre, Sebastian
Pad ´o, Jan Step´anek, Pavel Stran´ak, Mihai
Surdeanu, Nianwen Xue, and Yi Zhang.
2009. The CoNLL-2009 shared task:
Syntactic and semantic dependencies in
multiple languages. Negli Atti del
13th Conference on Computational Natural
53
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
Language Learning: Shared Task, pages 1–18,
Boulder, CO.
Hajiˇc, Jan, Jarmila Panevov´a, Eva Hajiˇcov´a,
Petr Sgall, Petr Pajas, Jan ˇStˇep´anek, Jiˇr´ı
Havelka, and Marie Mikulov´a. 2006. Prague
Dependency Treebank 2.0, Linguistic Data
Consortium, Philadelphia, PAPÀ.
Hudson, Richard A. 1984. Word Grammar.
Basil Blackwell, Oxford.
Janda, Laura A. and Charles E. Townsend.
2000. Czech. Lincom Europa, Munich.
Johansson, Richard and Pierre Nugues. 2008.
Dependency-based syntactic-semantic
analysis with PropBank and NomBank.
In Proceedings of the 12th Conference on
Computational Natural Language Learning,
pages 183–187, Manchester.
Khmylko, Lidia, Kilian A. Foth, E
Wolfgang Menzel. 2009. Co-parsing with
competitive Models. Negli Atti del
11th International Conference on Parsing
Technologies, pages 99–107, Paris.
Klenner, Manfred. 2007. Shallow dependency
labeling. In Proceedings of the ACL 2007 Demo
and Poster Sessions, pages 201–204, Prague.
Krivanek, Julia and W. Detmar Meurers.
2011. Comparing rule-based and
datadriven dependency parsing of learner
lingua. In Proceedings of the International
Conference on Dependency Linguistics,
pages 310–318, Barcelona.
K ¨ubler, Sandra. 2008. The PaGe 2008 shared
task on parsing German. Negli Atti
of the Workshop on Parsing German,
pages 55–63, Morristown, NJ.
Lee, John, Jason Naradowsky, and David A.
Smith. 2011. A discriminative model for
joint morphological disambiguation and
dependency parsing. Negli Atti del
49esima Assemblea Annuale dell'Associazione per
Linguistica computazionale, pages 885–894,
Portland, OR.
Magnanti, Thomas and Laurence Wolsey.
1995. Optimal trees. Handbooks in
Operations Research and Management
Scienza, 7(April):503–615.
Martins, Andr´e F. T., Noah A. Smith,
and Eric P. Xing. 2009. Concise integer
linear programming formulations for
dependency parsing. Negli Atti del
Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint
Conferenza sull'elaborazione del linguaggio naturale
of the AFNLP, pages 342–350, Suntec.
Marton, Yuval, Nizar Habash, E
Owen Rambow. 2010. Improving Arabic
dependency parsing with lexical and
inflectional morphological features.
In Proceedings of the NAACL HLT 2010
54
First Workshop on Statistical Parsing of
Morphologically-Rich Languages,
pages 13–21, Los Angeles, CA.
McDonald, Ryan and Fernando Pereira.
2006. Online learning of approximate
dependency parsing algorithms. In
Proceedings of the 11th Conference of the
European Chapter of the Association for
Linguistica computazionale, pages 81–88,
Trento.
McDonald, Ryan, Fernando Pereira,
Kiril Ribarov, and Jan Hajiˇc. 2005.
Non-projective dependency parsing
using spanning tree algorithms. In
Atti del 2005 Conference on
Human Language Technology and Empirical
Metodi nell'elaborazione del linguaggio naturale,
pages 523–530, Morristown, NJ.
Mel’ˇcuk, Igor. 1988. Dependency Syntax:
Theory and Practice. SUNY Series in
Linguistica. State University Press of
New York.
Nichols, Joanna. 1986. Head-marking and
dependent-marking grammar. Language,
62(1):56–119.
Nivre, Joakim, Johan Hall, Sandra K ¨ubler,
Ryan McDonald, Jens Nilsson, Sebastian
Riedel, and Deniz Yuret. 2007UN. IL
CoNLL 2007 shared task on dependency
parsing. Negli Atti del 2007 Joint
Conference on Empirical Methods in Natural
Language Processing and Computational
Natural Language Learning, pages 915–932,
Prague.
Nivre, Joakim, Johan Hall, Jens Nilsson,
Atanas Chanev, G ¨uls¸en Eryi ˇgit,
Sandra K ¨ubler, Svetoslav Marinov,
and Erwin Marsi. 2007B. MaltParser:
A language-independent system for
data-driven dependency parsing.
Natural Language Engineering,
13(2):95–135.
Riedel, Sebastian and James Clarke. 2006.
Incremental integer linear programming
for non-projective dependency parsing.
Negli Atti del 2006 Conference on
Empirical Methods in Natural Language
in lavorazione, pages 129–137, Sydney.
Schiehlen, Michael. 2004. Annotation
strategies for probabilistic parsing
in German. In Proceedings of the 20th
Conferenza internazionale sul calcolo
Linguistica, pages 390–397, Geneva.
Schiller, Anne. 1994. Dmor – user’s guide.
Technical report, University of Stuttgart.
Seeker, Wolfgang and Jonas Kuhn. 2012.
Making ellipses explicit in dependency
conversion for a German treebank.
In Proceedings of the 8th International
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Seeker and Kuhn
Morphological and Syntactic Case in Statistical Dependency Parsing
Conference on Language Resources and
Evaluation, pages 3132–3139, Istanbul.
Seeker, Wolfgang and Jonas Kuhn. 2011.
On the role of explicit morphological
feature representation in syntactic
dependency parsing for German. In
Proceedings of the 12th International
Conference on Parsing Technologies,
pages 58–62, Dublin.
Seeker, Wolfgang, Ines Rehbein, Jonas Kuhn,
and Josef Van Genabith. 2010. Hard
constraints for grammatical function
labelling. In Proceedings of the 48th Annual
Riunione dell'Associazione per il Computazionale
Linguistica, pages 1087–1097, Uppsala.
Spoustov´a, Drahom´ıra “Johanka,” Jan Hajiˇc,
Jan Raab, and Miroslav Spousta. 2009.
Semi-supervised training for the averaged
perceptron POS tagger. Negli Atti del
12th Conference of the European Chapter of the
Associazione per la Linguistica Computazionale,
pages 763–771, Athens.
Taskar, Ben, Vassil Chatalbashev, Daphne
Koller, and Carlos Guestrin. 2005.
Learning structured prediction models:
A large margin approach. Negli Atti
of the 22th Annual International Conference
on Machine Learning, pages 896–903,
Bonn.
Tr ´on, Viktor, P´eter Hal´acsy, P´eter Rebrus,
Andr´as Rung, P´eter Vajda, and Eszter
Simone. 2006. Morphdb.hu: Hungarian
lexical database and morphological
grammar. In Proceedings of the 5th
International Conference on Language
Resources and Evaluation, pages 1670–1673,
Genoa, Italy.
Tsarfaty, Reut, Djam´e Seddah, Yoav
Goldberg, Sandra K ¨ubler, Marie Candito,
Jennifer Foster, Yannick Versley, Ines
Rehbein, and Lamia Tounsi. 2010.
Statistical parsing of morphologically
rich languages (SPMRL): Che cosa, how and
whither. In Proceedings of the NAACL HLT
2010 First Workshop on Statistical Parsing
of Morphologically-Rich Languages,
pages 1–12, Los Angeles, CA.
Tsarfaty, Reut and Khalil Sima’an. 2008.
Relational-realizational parsing. In
Proceedings of the 22nd International
Conference on Computational Linguistics,
pages 889–896, Manchester.
Tsarfaty, Reut and Khalil Sima’an. 2010.
Modeling morphosyntactic agreement in
constituency-based parsing of Modern
Hebrew. In Proceedings of the NAACL HLT
2010 First Workshop on Statistical Parsing
of Morphologically-Rich Languages,
pages 40–48, Los Angeles, CA.
Versley, Yannick. 2005. Parser evaluation
across text types. In Proceedings of the 4th
Workshop on Treebanks and Linguistic
Theories, pages 209–220, Barcelona.
Versley, Yannick and Ines Rehbein. 2009.
Scalable discriminative parsing for
German. In Proceedings of the 11th
International Conference on Parsing
Technologies, pages 134–137, Paris.
Vincze, Veronika, D ´ora Szauter, Attila
Alm´asi, Gy ¨orgy M ´ora, Zolt´an Alexin,
and J´anos Csirik. 2010. Hungarian
Dependency Treebank. Negli Atti
of the 7th Conference on International
Language Resources and Evaluation,
pages 1855–1862, Valletta.
Zsibrita, J´anos, Veronika Vincze, E
Rich´ard Farkas. 2010. Ismeretlen
kifejez´esek ´es a sz ´ofaji egy´ertelm ¨us´ıt´es.
In VII. Magyar Sz´am´ıt´og´epes Nyelv´eszeti
Konferencia, pages 275–283, Szeged.
55
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
2
3
1
7
9
9
0
2
4
/
C
o
l
io
_
UN
_
0
0
1
3
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Scarica il pdf