Mildly Non-Projective Dependency Grammar
Marco Kuhlmann*
Uppsala University
Syntactic representations based on word-to-word dependencies have a long-standing tradition
in descriptive linguistics, and receive considerable interest in many applications. Néanmoins,
dependency syntax has remained something of an island from a formal point of view. De plus,
most formalisms available for dependency grammar are restricted to projective analyses, et
thus not able to support natural accounts of phenomena such as wh-movement and cross–serial
dependencies. In this article we present a formalism for non-projective dependency grammar
in the framework of linear context-free rewriting systems. A characteristic property of our
formalism is a close correspondence between the non-projectivity of the dependency trees
admitted by a grammar on the one hand, and the parsing complexity of the grammar on the
other. We show that parsing with unrestricted grammars is intractable. We therefore study two
constraints on non-projectivity, block-degree and well-nestedness. Jointly, these two constraints
define a class of “mildly” non-projective dependency grammars that can be parsed in polynomial
temps. An evaluation on five dependency treebanks shows that these grammars have a good
coverage of empirical data.
1. Introduction
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
Syntactic representations based on word-to-word dependencies have a long-standing
tradition in descriptive linguistics. Since the seminal work of Tesni`ere (1959), ils
have become the basis for several linguistic theories, such as Functional Generative
Description (Sgall, Hajiˇcov´a, and Panevov´a 1986), Meaning–Text Theory (Mel’ˇcuk 1988),
and Word Grammar (Hudson 2007). In recent years they have also been used for a wide
range of practical applications, such as information extraction, machine translation, et
question answering. We ascribe the widespread interest in dependency structures to
their intuitive appeal, their conceptual simplicity, and in particular to the availability of
accurate and efficient dependency parsers for a wide range of languages (Buchholz and
Marsi 2006; Nivre et al. 2007).
Although there exist both a considerable practical interest and an extensive lin-
guistic literature, dependency syntax has remained something of an island from a
formal point of view. En particulier, there are relatively few results that bridge between
dependency syntax and other traditions, such as phrase structure or categorial syntax.
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
∗ Department of Linguistics and Philology, Box 635, 751 26 Uppsala, Sweden.
E-mail: marco.kuhlmann@lingfil.uu.se.
Submission received: 17 Décembre 2009; revised submission received: 3 Avril 2012; accepted for publication:
24 May 2012.
est ce que je:10.1162/COLI a 00125
© 2013 Association for Computational Linguistics
Computational Linguistics
Volume 39, Nombre 2
Chiffre 1
Nested dependencies and cross–serial dependencies.
This makes it hard to gauge the similarities and differences between the paradigms,
and hampers the exchange of linguistic resources and computational methods. Un
overarching goal of this article is to bring dependency grammar closer to the mainland
of formal study.
One of the few bridging results for dependency grammar is thanks to Gaifman
(1965), who studied a formalism that we will refer to as Hays–Gaifman grammar, et
proved it to be weakly equivalent to context-free phrase structure grammar. Although
this result is of fundamental importance from a theoretical point of view, its practical
usefulness is limited. En particulier, Hays–Gaifman grammar is restricted to projective
dependency structures, which is similar to the familiar restriction to contiguous con-
stituents. Encore, non-projective dependencies naturally arise in the analysis of natural
langue. One classic example of this is the phenomenon of cross–serial dependencies
in Dutch. In this language, the nominal arguments of verbs that also select an infinitival
complement occur in the same order as the verbs themselves:
(je) dat
que
Jan1 Piet2 Marie3 zag1
Jan Piet Marie saw help
helpen2 lezen3
read
(Dutch)
‘that Jan saw Piet help Marie read’
In German, the order of the nominal arguments instead inverts the verb order:
(ii) dass Jan1 Piet2 Marie3 lesen3 helfen2 sah1
saw
Jan Piet Marie read help
que
(German)
Chiffre 1 shows dependency trees for the two examples.1 The German linearization
gives rise to a projective structure, where the verb–argument dependencies are nested
within each other, whereas the Dutch linearization induces a non-projective structure
with crossing edges. To account for such structures we need to turn to formalisms more
expressive than Hays–Gaifman grammars.
In this article we present a formalism for non-projective dependency grammar
based on linear context-free rewriting systems (LCFRSs) (Vijay-Shanker, Weir, and Joshi
1987; Weir 1988). This framework was introduced to facilitate the comparison of various
1 We draw the nodes of a dependency tree as circles, and the edges as arrows pointing towards the
dependent (away from the root node). Following Hays (1964), we use dotted lines to help us keep
track of the positions of the nodes in the linear order, and to associate nodes with lexical items.
356
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Kuhlmann
Mildly Non-Projective Dependency Grammar
grammar formalisms, including standard context-free grammar, tree-adjoining gram-
mar (Joshi and Schabes 1997), and combinatory categorial grammar (Steedman and
Baldridge 2011). It also comprises, entre autres, multiple context-free grammars (Seki
et autres. 1991), minimalist grammars (Michaelis 1998), and simple range concatenation
grammars (Boullier 2004).
The article is structured as follows. In Section 2 we provide the technical back-
ground to our work; in particular, we introduce our terminology and notation for linear
context-free rewriting systems. An LCFRS generates a set of terms (formal expressions)
which are interpreted as derivation trees of objects from some domain. Each term also
has a secondary interpretation under which it denotes a tuple of strings, representing
the string yield of the derived object. In Section 3 we introduce the central notion of a
lexicalized linear context-free rewriting system, which is an LCFRS in which each rule
of the grammar is associated with an overt lexical item, representing a syntactic head
(cf. Schabes, Abeill´e, and Joshi 1988 and Schabes 1990). We show that this property gives
rise to an additional interpretation under which each term denotes a dependency tree
on its yield. With this interpretation, lexicalized LCFRSs can be used as dependency
grammars.
In Section 4 we show how to acquire lexicalized LCFRSs from dependency tree-
banks. This works in much the same way as the extraction of context-free grammars
from phrase structure treebanks (cf. Charniak 1996), except that the derivation trees of
dependency trees are not immediately accessible in the treebank. We therefore present
an efficient algorithm for computing a canonical derivation tree for an input depen-
dency tree; from this derivation tree, the rules of the grammar can be extracted in a
straightforward way. The algorithm was originally published by Kuhlmann and Satta
(2009). It produces a restricted type of lexicalized LCFRS that we call “canonical.” In
Section 5 we provide a declarative characterization of this class of grammars, and show
that every lexicalized LCFRS is (strongly) equivalent to a canonical one, in the sense that
it induces the same set of dependency trees.
In Section 6 we present a simple parsing algorithm for LCFRSs. Bien que le
runtime of this algorithm is polynomial in the length of the sentence, the degree of
the polynomial depends on two grammar-specific measures called fan-out and rank.
We show that even in the restricted case of canonical grammars, parsing is an NP-
hard problem. It is important therefore to keep the fan-out and the rank of a grammar
as low as possible, and much of the recent work on LCFRSs has been devoted to
the development of techniques that optimize parsing complexity in various scenarios
G ´omez-Rodr´ıguez and Satta 2009; G ´omez-Rodr´ıguez et al. 2009; Kuhlmann and Satta
2009; Gildea 2010; G ´omez-Rodr´ıguez, Kuhlmann, and Satta 2010; Sagot and Satta 2010;
and Crescenzi et al. 2011).
In this article we explore the impact of non-projectivity on parsing complexity. Dans
Section 7 we present the structural correspondent of the fan-out of a lexicalized LCFRS,
a measure called block-degree (or gap-degree) (Holan et al. 1998). Although there is
no theoretical upper bound on the block-degree of the dependency trees needed for
linguistic analysis, we provide evidence from several dependency treebanks showing
que, from a practical point of view, this upper bound can be put at a value of as low as 2.
In Section 8 we study a second constraint on non-projectivity called well-nestedness
(Bodirsky, Kuhlmann, and M ¨ohl 2005), and show that its presence facilitates tractable
parsing. This comes at the cost of a small loss in coverage on treebank data. Bounded
block-degree and well-nestedness jointly define a class of “mildly” non-projective
dependency grammars that can be parsed in polynomial time.
Section 9 summarizes our main contributions and concludes the article.
357
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Nombre 2
2. Technical Background
We assume basic familiarity with linear context-free rewriting systems (voir, par exemple., Vijay-
Shanker, Weir, and Joshi 1987 and Weir 1988) and only review the terminology and
notation that we use in this article.
A linear context-free rewriting system (LCFRS) is a structure G = (N, Σ, P., S)
where N is a set of nonterminals, Σ is a set of function symbols, P is a finite set of
production rules, and S ∈ N is a distinguished start symbol. Rules take the form
A0
→ f (A1, . . . , Am)
(1)
where f is a function symbol and the Ai are nonterminals. Rules are used for rewriting
in the same way as in a context-free grammar, with the function symbols acting as
terminals. The outcome of the rewriting process is a set T(G) of terms, tree-formed
expressions built from function symbols. Each term is then associated with a string
yield, more specifically a tuple of strings. For this, every function symbol f comes with
a yield function that specifies how to compute the yield of a term f (t1, . . . , tm) from the
yields of its subterms ti. Yield functions are defined by equations
F ((cid:5)x1,1, . . . , x1,k1
(cid:6), . . . , (cid:5)xm,1, . . . , xm,km
(cid:6)) = (cid:5)α1, . . . , αk0
(cid:6)
(2)
where the tuple on the right-hand side consists of strings over the variables on the
left-hand side and some given alphabet of yield symbols, and contains exactly one
occurrence of each variable. For a yield function f defined by an equation of this form,
→ k0. To guarantee that
→ k0, denoted by f : k1
we say that f is of type k1
the string yield of a term is well-defined, each nonterminal A is associated with a
fan-out ϕ(UN) ≥ 1, and it is required that for every rule (1),
· · · km
· · · km
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
F : ϕ(A1) · · · ϕ(Am) → ϕ(A0)
In Equation (2), the values m and k0 are called the rank and the fan-out of f , respectivement.
The rank and the fan-out of an LCFRS are the maximal rank and fan-out of its yield
les fonctions.
Exemple 1
Chiffre 2 shows an example of an LCFRS for the language { (cid:5)anbncndn(cid:6) | n ≥ 0 }.
Équation (2) is uniquely determined by the tuple on the right-hand side of the
equation. We call this tuple the template of the yield function f , and use it as the
canonical function symbol for f . This gives rise to a compact notation for LCFRSs,
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 2
An LCFRS that generates the yield language { (cid:5)anbncndn(cid:6) | n ≥ 0 }.
358
Kuhlmann
Mildly Non-Projective Dependency Grammar
illustrated in the right column of Figure 2. In this notation, to save some subscripts,
we use the following shorthands for variables: x and x1 for x1,1; x2 for x1,2; x3 for x1,3;
y and y1 for x2,1; y2 for x2,2; y3 for x2,3.
3. Lexicalized LCFRSs as Dependency Grammars
Recall the following examples for verb–argument dependencies in German and Dutch
from Section 1:
(iii) dass Jan1 Piet2 Marie3 lesen3 helfen2 sah1
saw
Jan Piet Marie read
help
que
(iv) dat
que
Jan1 Piet2 Marie3 zag1
Jan Piet Marie saw help
helpen2 lezen3
read
(German)
(Dutch)
‘that Jan saw Piet help Marie read’
Chiffre 3 shows the production rules of two linear context-free rewriting systems (one for
German, one for Dutch) that generate these examples. The grammars are lexicalized in
the sense that each of their yield functions is associated with a lexical item, such as sah or
zag (cf. Schabes, Abeill´e, and Joshi 1988 and Schabes 1990). Productions with lexicalized
yield functions can be read as dependency rules. Par exemple, the rules
V → (cid:5)x y sah(cid:6)(N, V)
(German)
V → (cid:5)x y1 zag y2
(cid:6)(N, V)
(Dutch)
can be read as stating that the verb to see requires two dependents, one noun (N) et
one verb (V). Based on this reading, every term generated by a lexicalized LCFRS
does not only yield a tuple of strings, but also induces a dependency tree on these
strings: Each parent–child relation in the term represents a dependency between the
associated lexical items (cf. Rambow and Joshi 1997). Thus every lexicalized LCFRS can
be reinterpreted as a dependency grammar. To illustrate the idea, Chiffre 4 shows (le
tree representations of) two terms generated by the grammars G1 and G2, together with
the dependency trees induced by them. Note that these are the same trees that we gave
pour (iii) et (iv) in Figure 1.
Our goal for the remainder of this section is to make the notion of induction formally
precise. To this end we will reinterpret the yield functions of lexicalized LCFRSs as
operations on dependency trees.
Chiffre 3
Lexicalized linear context-free rewriting systems.
359
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Nombre 2
Chiffre 4
Lexicalized linear context-free rewriting systems induce dependency trees.
3.1 Dependency Trees
By a dependency tree, we mean a pair ( (cid:2)w, D), où (cid:2)w is a tuple of strings, and D is
a tree-shaped graph whose nodes correspond to the occurrences of symbols in (cid:2)w, et
whose edges represent dependency relations between these occurrences. We identify
occurrences in (cid:2)w by pairs (je, j) of integers, where i indexes the component of (cid:2)w that
contains the occurrence, and j specifies the linear position of the occurrence within
that component. We can then formally define a dependency graph for a tuple of
strings
(cid:2)w = (cid:5)a1,1
· · · a1,n1 , . . . , ak,1
· · · ak,nk
(cid:6)
as a directed graph G = (V, E) où
V = { (je, j) | 1 ≤ i ≤ k, 1 ≤ j ≤ ni
}
et
E ⊆ V × V
We use u and v as variables for nodes, and denote edges (toi, v) as u → v. A dependency
tree D for (cid:2)w is a dependency graph for (cid:2)w in which there exists a root node r such
that for any node u, there is exactly one directed path from r to u. A dependency
tree is called simple if (cid:2)w consists of a single string w. Dans ce cas, we write the de-
pendency tree as (w, D), and identify occurrences by their linear positions j in w, avec
1 ≤ j ≤ |w|.
Exemple 2
Chiffre 5 shows examples of dependency trees. In pictures of such structures we use
dashed boxes to group nodes that correspond to occurrences from the same tuple
360
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Kuhlmann
Mildly Non-Projective Dependency Grammar
Chiffre 5
Dependency trees.
component; cependant, we usually omit the box when there is only one component.
Writing Di as Di = (Vi, Ei) we have:
V1 = {(1, 1)}
V2 = {(1, 1), (1, 2)}
V3 = {(1, 1), (2, 1)}
V4 = {(1, 1), (3, 1)}
E1 = {}
E2 = {(1, 1) → (1, 2)}
E3 = {(1, 1) → (2, 1)}
E4 = {(1, 1) → (3, 1)}
We use standard terminology from graph theory for dependency trees and the
relations between their nodes. En particulier, for a node u, the set of descendants of u,
which we denote by (cid:10)toi(cid:11), is the set of nodes that can be reached from u by following a
directed path consisting of zero or more edges. We write u < v to express that the node u
precedes the node v when reading the yield from left to right. Formally, precedence is
the lexicographical order on occurrences:
(i1, j1) < (i2, j2)
if and only if
either i1 < i2 or (i1 = i2 and j1 < j2)
3.2 Operations on Dependency Trees
A yield function f is called lexicalized if its template contains exactly one yield symbol,
representing a lexical item; this symbol is then called the anchor of f . With every
lexicalized yield function f we associate an operation f
on dependency trees as follows.
Let (cid:2)w1, . . . , (cid:2)wm, (cid:2)w be tuples of strings such that
(cid:2)
f ( (cid:2)w1, . . . , (cid:2)wm) = (cid:2)w
and let Di be a dependency tree for (cid:2)wi. By the definition of yield functions, every
occurrence u in an input tuple (cid:2)wi corresponds to exactly one occurrence in the output
tuple (cid:2)w; we denote this occurrence by ¯u. Let G be the dependency graph for (cid:2)w that
has an edge ¯u → ¯v whenever there is an edge u → v in some Di, and no other edges.
Because f is lexicalized, there is exactly one occurrence r in the output tuple (cid:2)w that does
not correspond to any occurrence in some (cid:2)wi; this is the occurrence of the anchor of f .
Let D be the dependency tree for (cid:2)w that is obtained by adding to the graph G all edges
of the form r → ¯ri, where ri is the root node of Di. By this construction, the occurrence r
of the anchor becomes the root node of D, and the root nodes of the input dependency
trees Di become its dependents. We then define
(cid:2)
f
(( (cid:2)w1, D1), . . . , ( (cid:2)wm, Dm)) = ( (cid:2)w, D)
361
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
l
i
_
a
_
0
0
1
2
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 2
Figure 6
Operations on dependency trees.
Example 3
We consider a concrete application of an operation on dependency trees, illustrated in
Figure 6. In this example we have
f = (cid:5)x1 b, y x2
(cid:6)
(cid:2)w1 = (cid:5)a, e(cid:6)
(cid:2)w2 = (cid:5)c d(cid:6)
(cid:2)w = f ( (cid:2)w1, (cid:2)w2) = (cid:5)a b, c d e(cid:6)
and the dependency trees D1, D2 are defined as
D1 = ({(1, 1), (2, 1)}, {(1, 1) → (2, 1)}) D2 = ({(1, 1), (1, 2)}, {(1, 1) → (1, 2)})
We show that f
(cid:2)
(( (cid:2)w1, D1), ( (cid:2)w2, D2)) = ( (cid:2)w, D), where D = (V, E) with
V = {(1, 1), (1, 2), (2, 1), (2, 2), (2, 3)}
E = {(1, 1) → (2, 3), (1, 2) → (1, 1), (1, 2) → (2, 1), (2, 1) → (2, 2)}
The correspondences between the occurrences u in the input tuples and the occur-
rences ¯u in the output tuple are as follows:
for (cid:2)w1:
(1, 1) = (1, 1) , (2, 1) = (2, 3)
for (cid:2)w2:
(1, 1) = (2, 1) , (1, 2) = (2, 2)
By copying the edges from the input dependency trees, we obtain the intermediate
dependency graph G = (V, E
) for (cid:2)w, where
(cid:2)
(cid:2) = {(1, 1) → (2, 3), (2, 1) → (2, 2)}
E
The occurrence r of the anchor b of f in (cid:2)w is (1, 2); the nodes of G that correspond to
the root nodes of D1 and D2 are ¯r1 = (1, 1) and ¯r2 = (2, 1). The dependency tree D is
obtained by adding the edges r → ¯r1 and r → ¯r2 to G.
4. Extraction of Dependency Grammars
We now show how to extract lexicalized linear context-free rewriting systems from
dependency treebanks. To this end, we adapt the standard technique for extracting
context-free grammars from phrase structure treebanks (Charniak 1996).
Our technique was originally published by Kuhlmann and Satta (2009). In recent
work, Maier and Lichte (2011) have shown how to unify it with a similar technique
for the extraction of range concatenation grammars from discontinuous constituent
structures, due to Maier and Søgaard (2008). To simplify our presentation we restrict
our attention to treebanks containing simple dependency trees.
362
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
l
i
_
a
_
0
0
1
2
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Kuhlmann
Mildly Non-Projective Dependency Grammar
Figure 7
A dependency tree and one of its construction trees.
To extract a lexicalized LCFRS from a dependency treebank we proceed as follows.
First, for each dependency tree (w, D) in the treebank, we compute a construction tree,
a term t over yield functions that induces (w, D). Then we collect a set of production
rules, one rule for each node of the construction trees. As an example, consider Fig-
ure 7, which shows a dependency tree with one of its construction trees. (The analysis
is taken from K ¨ubler, McDonald, and Nivre [2009].) From this construction tree we
extract the following rules. The nonterminals (in bold) represent linear positions of
nodes.
1 → (cid:5)A(cid:6)
2 → (cid:5)x hearing, y(cid:6)(1, 5)
3 → (cid:5)x1 is y1 x2 y2
(cid:6)(2, 4)
4 → (cid:5)scheduled, x(cid:6)(8)
5 → (cid:5)on x(cid:6)(7)
6 → (cid:5)the(cid:6)
7 → (cid:5)x issue(cid:6)(6)
8 → (cid:5)today(cid:6)
Rules like these can serve as the starting point for practical systems for data-driven,
non-projective dependency parsing (Maier and Kallmeyer 2010).
Because the extraction of rules from construction trees is straightforward, the prob-
lem that we focus on in this section is how to obtain these trees in the first place. Our
procedure for computing construction trees is based on the concept of “blocks.”
4.1 Blocks
Let D be a dependency tree. A segment of D is a contiguous, non-empty sequence
of nodes of D, all of which belong to the same component of the string yield. Thus
a segment contains its endpoints, as well as all nodes between the endpoints in the
precedence order. For a node u of D, a block of u is a longest segment consisting of
descendants of u. This means that the left endpoint of a block of u either is the first node
in its component, or is preceded by a node that is not a descendant of u. A symmetric
property holds for the right endpoint.
Example 4
Consider the node 2 of the dependency tree in Figure 7. The descendants of 2 fall into
two blocks, marked by the dashed boxes: 1 2 and 5 6 7.
363
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
l
i
_
a
_
0
0
1
2
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 2
We use (cid:2)u and (cid:2)v as variables for blocks. Extending the precedence order on nodes,
we say that a block (cid:2)u precedes a block (cid:2)v, denoted by (cid:2)u < (cid:2)v, if the right endpoint of (cid:2)u
precedes the left endpoint of (cid:2)v.
4.2 Computing Canonical Construction Trees
To obtain a canonical construction tree t for a dependency tree (w, D) we label each
node u of D with a yield function f as follows. Let (cid:2)w be the tuple consisting of the blocks
of u, in the order of their precedence, and let (cid:2)w1, . . . , (cid:2)wm be the corresponding tuples for
the children of u. We may view blocks as strings of nodes. Taking this view, we compute
the (unique) yield function g with the property that
g( (cid:2)w1, . . . , (cid:2)wm) = (cid:2)w
The anchor of g is the node u, the rank of g corresponds to the number of children
of u, the variables in the template of g represent the blocks of these children, and the
components of the template represent the blocks of u. To obtain f , we take the template
of g and replace the occurrence of u with the corresponding lexical item.
Example 5
Node 2 of the dependency tree shown in Figure 7 has two children, 1 and 5. We have
(cid:2)w = (cid:5)1 2, 5 6 7(cid:6)
(cid:2)w1 = (cid:5)1(cid:6)
(cid:2)w2 = (cid:5)5 6 7(cid:6)
g = (cid:5)x 2, y(cid:6)
f = (cid:5)x hearing, y(cid:6)
Note that in order to properly define f we need to assume some order on the
children of u. The function g (and hence the construction tree t) is unique up to the
specific choice of this order. In the following we assume that children are ordered from
left to right based on the position of their leftmost descendants.
4.3 Computing the Blocks of a Dependency Tree
The algorithmically most interesting part of our extraction procedure is the computation
of the yield function g. The template of g is uniquely determined by the left-to-right
sequence of the endpoints of the blocks of u and its children. An efficient algorithm that
can be used to compute these sequences is given in Table 1.
4.3.1 Description. We start at a virtual root node ⊥ (line 1) which serves as the parent
of the real root node. For each node next in the precedence order of D, we follow the
shortest path from the current node current to next. To determine this path, we compute
the lowest common ancestor lca of the two nodes (lines 4–5), using a set of markings
on the nodes. At the beginning of each iteration of the for loop in line 2, all ancestors of
current (including the virtual root node ⊥) are marked; therefore, we find lca by going
upwards from next to the first node that is marked. To restore the loop invariant, we
then unmark all nodes on the path from current to lca (lines 6–9). Each time we move
down from a node to one of its children (line 12), we record the information that next
is the left endpoint of a block of current. Symmetrically, each time we move up from a
node to its parent (lines 8 and 17), we record the information that next − 1 is the right
endpoint of a block of current. The while loop in lines 15–18 takes us from the last node
of the dependency tree back to the node ⊥.
364
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
l
i
_
a
_
0
0
1
2
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Kuhlmann
Mildly Non-Projective Dependency Grammar
Table 1
Computing the blocks of a simple dependency tree.
while current (cid:14)= lca do
push lca to stack; lca ← the parent of lca
lca ← next; stack ← []
while lca is not marked do
current ← ⊥; mark current
for each node next of D from 1 to |w| do
Input: a string w and a simple dependency tree D for w
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15: while current (cid:14)= ⊥ do
16:
17:
18:
(cid:2) |w| is the right endpoint of a block of current
(cid:2) move up from current to the parent of current
unmark current; current ← the parent of current
(cid:2) arrive at next; at this point, current = next
while stack is not empty do
current ← pop stack; mark current
(cid:2) move down from the parent of current to current
(cid:2) next is the left endpoint of a block of current
(cid:2) next − 1 is the right endpoint of a block of current
(cid:2) move up from current to the parent of current
unmark current; current ← the parent of current
loop 1
loop 2
loop 3
loop 4
4.3.2 Runtime Analysis. We analyze the runtime of our algorithm. Let m be the total
number of blocks of D. Let us write ni for the total number of iterations of the ith while
loop, and let n = n1 + n2 + n3 + n4. Under the reasonable assumption that every line in
Table 1 can be executed in constant time, the runtime of the algorithm clearly is in O(n).
Because each iteration of loop 2 and loop 4 determines the right endpoint of a block, we
have n2 + n4 = m. Similarly, as each iteration of loop 3 fixes the left endpoint of a block,
we have n3 = m. To determine n1, we note that every node that is pushed to the auxiliary
stack in loop 1 is popped again in loop 3; therefore, n1 = n3 = m. Putting everything
together, we have n = 3m, and we conclude that the runtime of the algorithm is in O(m).
Note that this runtime is asymptotically optimal for the task we are considering.
5. Canonical Grammars
Our extraction technique produces a restricted type of lexicalized linear context-free
rewriting system that we will refer to as “canonical.” In this section we provide a
declarative characterization of these grammars, and show that every lexicalized LCFRS
is equivalent to a canonical one.
5.1 Definition of Canonical Grammars
We are interested in a syntactic characterization of the yield functions that can occur
in extracted grammars. We give such a characterization in terms of four properties,
stated in the following. We use the following terminology and notation. Consider a
yield function
f : k1
· · · km
→ k ,
f = (cid:5)α1, . . . , αk
(cid:6)
For variables x, y we write x
generated by the class of LCFRSs with fan-out k − 1.
It may be worth emphasizing that the one-to-one correspondence between blocks
and tuple components is a consequence of two characteristic properties of extracted
grammars (Properties 3 et 4), and does not hold for non-canonical lexicalized
LCFRSs.
Exemple 7
The following term induces a two-node dependency tree with block-degree 1, mais
contains yield functions with fan-out 2: (cid:5)a x1 x2
(cid:6)((cid:5)b, ε(cid:6)). Note that the yield functions
in this term violate both Property 3 and Property 4.
7.4 Coverage on Dependency Treebanks
In order to assess the consequences of different bounds on the fan-out, we now evaluate
the block-degree of dependency trees in real-world data. Specifically, we look into five
375
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Nombre 2
dependency treebanks used in the 2006 CoNLL shared task on dependency parsing
(Buchholz and Marsi 2006): the Prague Arabic Dependency Treebank (Hajiˇc et al. 2004),
the Prague Dependency Treebank of Czech (B ¨ohmov´a et al. 2003), the Danish Depen-
dency Treebank (Kromann 2003), the Slovene Dependency Treebank (Dˇzeroski et al.
2006), and the Metu-Sabancı treebank of Turkish (Oflazer et al. 2003). The full data used
in the CoNLL shared task also included treebanks that were produced by conversion
of corpora originally annotated with structures other than dependencies, which is a
potential source of “noise” that one has to take into account when interpreting any
résultats. Ici, we consider only genuine dependency treebanks. More specifically, notre
statistics concern the training sections of the treebanks that were set off for the task. Pour
similar results on other data sets, see Kuhlmann and Nivre (2006), Havelka (2007), et
Maier and Lichte (2011).
Our results are given in Table 3. For each treebank, we list the number of rules
extracted from that treebank, as well as the number of corresponding dependency trees.
We then list the number of rules that we lose if we restrict ourselves to rules with fan-
out = 1, or rules with fan-out ≤ 2, as well as the number of dependency trees that we
lose because their construction trees contain at least one such rule. We count rule tokens,
meaning that two otherwise identical rules are counted twice if they were extracted
from different trees, or from different nodes in the same tree.
By putting the bound at fan-out 1, we lose between 0.74% (Arabic) et 1.75%
(Slovene) of the rules, et entre 11.16% (Arabic) et 23.15% (Czech) of the trees
in the treebanks. This loss is quite substantial. If we instead put the bound at fan-out
≤ 2, then rule loss is reduced by between 94.16% (Turkish) et 99.76% (Arabic), et
tree loss is reduced by between 94.31% (Turkish) et 99.39% (Arabic). This outcome
is surprising. Par exemple, Holan et al. (1998) argue that it is impossible to give a
theoretical upper bound for the block-degree of reasonable dependency analyses of
Czech. Here we find that, if we are ready to accept a loss of as little as 0.02% of the
rules extracted from the Prague Dependency Treebank, and up to 0.5% of the trees, alors
such an upper bound can be set at a block-degree as low as 2.
8. Well-Nestedness
The parsing of LCFRSs is exponential both in the fan-out and in the rank of the
grammars. In this section we study “well-nestedness,” another restriction on the non-
projectivity of dependency trees, and show how enforcing this constraint allows us to
restrict our attention to the class of LCFRSs with rank 2.
Tableau 3
Loss in coverage under the restriction to yield functions with fan-out = 1 and fan-out ≤ 2.
fan-out = 1
fan-out ≤ 2
rules
trees
rules
trees
rules
trees
Arabic
Czech
Danish
Slovene
Turkish
5,839
1,322,111
99,576
30,284
62,507
1,460
72,703
5,190
1,534
4,997
411
22,283
1,229
530
924
163
16,831
811
340
580
1
328
11
14
54
1
312
9
11
33
376
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
c
o
je
je
/
je
un
r
t
je
c
e
–
p
d
F
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
je
je
_
un
_
0
0
1
2
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Kuhlmann
Mildly Non-Projective Dependency Grammar
8.1 Definition of Well-Nestedness
Let D be a dependency tree, and let u and v be nodes of D. The descendants of u
∈ (cid:10)v(cid:11)
and v overlap, denoted by (cid:10)toi(cid:11) (cid:4) (cid:10)v(cid:11), if there exist nodes ul, ur
such that
∈ (cid:10)toi(cid:11) and vl, vr
ul < vl < ur < vr
or
vl < ul < vr < ur
A dependency tree D is called well-nested if for all pairs of nodes u, v of D
(cid:10)u(cid:11) (cid:4) (cid:10)v(cid:11)
implies that
(cid:10)u(cid:11) ∩ (cid:10)v(cid:11) (cid:14)= ∅
In other words, (cid:10)u(cid:11) and (cid:10)v(cid:11) may overlap only if u is an ancestor of v, or v is an ancestor
of u. If this implication does not hold, then D is called ill-nested.
Example 8
Figure 11 shows three non-projective dependency trees. Both D1 and D2 are well-nested:
D1 does not contain any overlapping sets of descendants at all. In D2, although (cid:10)1(cid:11)
and (cid:10)2(cid:11) overlap, it is also the case that (cid:10)1(cid:11) ⊇ (cid:10)2(cid:11). In contrast, D3 is ill-nested, as
(cid:10)2(cid:11) (cid:4) (cid:10)3(cid:11)
but
(cid:10)2(cid:11) ∩ (cid:10)3(cid:11) = ∅
The following lemma characterizes well-nestedness in terms of blocks.
Lemma 9
A dependency tree is ill-nested if and only if it contains two sibling nodes u, v and blocks
(cid:2)u1, (cid:2)u2 of u and (cid:2)v1, (cid:2)v2 of v such that
(cid:2)u1 < (cid:2)v1 < (cid:2)u2 < (cid:2)v2
(4)
Proof
Let D be a dependency tree. Suppose that D contains a configuration of the form (4).
This configuration witnesses that the sets (cid:10)u(cid:11) and (cid:10)v(cid:11) overlap. Because u, v are siblings,
(cid:10)u(cid:11) ∩ (cid:10)v(cid:11) = ∅. Therefore we conclude that D is ill-nested. Conversely now, suppose
that D is ill-nested. In this case, there exist two nodes u and v such that
(cid:10)u(cid:11) (cid:4) (cid:10)v(cid:11)
and
(cid:10)u(cid:11) ∩ (cid:10)v(cid:11) = ∅
(∗)
Figure 11
Well-nestedness and ill-nestedness.
377
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
2
3
5
5
1
8
1
2
0
0
0
/
c
o
l
i
_
a
_
0
0
1
2
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 2
Here, we may assume u and v to be siblings: otherwise, we may replace either u or v
with its parent node, and property (∗) will continue to hold. Because (cid:10)u(cid:11) (cid:4) (cid:10)v(cid:11), there
exist descendants ul, ur
∈ (cid:10)u(cid:11) and vl, vr
∈ (cid:10)v(cid:11) such that
ul < vl < ur < vr
or
vl < ul < vr < ur
Without loss of generality, assume that we have the first case. The nodes ul and ur belong
to different blocks of u, say (cid:2)u1 and (cid:2)u2; and the nodes vl and vr belong to different blocks
of v, say (cid:2)v1 and (cid:2)v2. Then it is not hard to verify Equation (4). (cid:3)
Note that projective dependency trees are always well-nested; in these structures,
every node has exactly one block, so configuration (4) is impossible. For every k > 1,
there are both well-nested and ill-nested dependency trees with block-degree k.
8.2 Testing for Well-Nestedness
Based on Lemma 9, testing whether a dependency tree D is well-nested can be done in
time linear in the number of blocks in D using a simple subsequence test as follows. Nous
run the algorithm given in Table 1, maintaining a stack s[toi] for every node u. The first
time we make a down step to u, we push u to the stack for the parent of u; every other
temps, we pop the stack for the parent until we either find u as the topmost element, or the
stack becomes empty. In the latter case, we terminate the computation and report that D
is ill-nested; if the computation can be completed without any stack ever becoming
vide, we report that D is well-nested.
To show that the algorithm is sound, suppose that some stack s[p] becomes empty
when making a down step to some child v of p. Dans ce cas, the node v must have been
popped from s[p] when making a down step to some other child u of p, and that child
must have already been on the stack before the first down step to v. This witnesses the
existence of a configuration of the form in Equation (4).
8.3 Well-Nestedness in Extracted Grammars
Just like block-degree, well-nestedness can be characterized in terms of yield functions.
Recall the notation x