Data-Driven Parsing using Probabilistic

Data-Driven Parsing using Probabilistic
Linear Context-Free Rewriting Systems

Laura Kallmeyer
Heinrich-Heine-Universit¨at D ¨usseldorf

Wolfgang Maier
Heinrich-Heine-Universit¨at D ¨usseldorf

∗∗

This paper presents the first efficient implementation of a weighted deductive CYK parser for
Probabilistic Linear Context-Free Rewriting Systems (PLCFRSs). LCFRS, an extension of CFG,
can describe discontinuities in a straightforward way and is therefore a natural candidate to be
used for data-driven parsing. To speed up parsing, we use different context-summary estimates

parsing. We evaluate our parser with grammars
of parse items, some of them allowing for A
extracted from the German NeGra treebank. Our experiments show that data-driven LCFRS
parsing is feasible and yields output of competitive quality.

1. Introduction

Recently, the challenges that a rich morphology poses for data-driven parsing have
received growing interest. A direct effect of morphological richness is, for instance, data
sparseness on a lexical level (Candito and Seddah 2010). A rather indirect effect is that
morphological richness often relaxes word order constraints. The principal intuition is
that a rich morphology encodes information that otherwise has to be conveyed by a
particular word order. If, for instance, the case of a nominal complement is not provided
by morphology, it has to be provided by the position of the complement relative to other
complements in the sentence. Example (1) provides an example of case marking and free
word order in German. In turn, in free word order languages, word order can encode
information structure (Hoffman 1995).

(1)

a.

der
the

kleine
little

Jungenom
boy

schickt
sends

seiner
his

Schwesterdat
sister

den
the

Briefacc
letter

b. Other possible word orders:

(i)

(ii)

der kleine Jungenom schickt den Briefacc seiner Schwesterdat

seiner Schwesterdat schickt der kleine Jungenom den Briefacc

(iii) den Briefacc schickt der kleine Jungenom seiner Schwesterdat

∗ Institut f ¨ur Sprache und Information, Universit¨atsstr. 1, D-40225 D ¨usseldorf, Germany.

E-mail: kallmeyer@phil.uni-duesseldorf.de.

∗∗ Institut f ¨ur Sprache und Information, Universit¨atsstr. 1, D-40225 D ¨usseldorf, Germany.

E-mail: maierw@hhu.de.

Submission received: September 29, 2011; revised submission received: May 20, 2012; accepted for publication:
August 3, 2012.

© 2013 Association for Computational Linguistics

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 1

It is assumed that this relation between a rich morphology and free word order does
not hold in both directions. Although it is generally the case that languages with a rich
morphology exhibit a high degree of freedom in word order, languages with a free word
order do not necessarily have a rich morphology. Two examples for languages with a
very free word order are Turkish and Bulgarian. The former has a very rich and the
latter a sparse morphology. See M ¨uller (2002) for a survey of the linguistics literature on
this discussion.

With a rather free word order, constituents and single parts of them can be displaced
freely within the sentence. German, for instance, has a rich inflectional system and
allows for a free word order, as we have already seen in Example (1): Arguments can
be scrambled, and topicalizations and extrapositions underlie few restrictions. Conse-
quently, discontinuous constituents occur frequently. This is challenging for syntactic
description in general (Uszkoreit 1986; Becker, Joshi, and Rambow 1991; Bunt 1996;
M ¨uller 2004), and for treebank annotation in particular (Skut et al. 1997).

In this paper, we address the problem of data-driven parsing of discontinuous constit-
uents on the basis of German. In this section, we inspect the type of data we have to deal
with, and we describe the way such data are annotated in treebanks. We briefly discuss
different parsing strategies for the data in question and motivate our own approach.

1.1 Discontinuous Constituents

Consider the sentences in Example (2) as examples for discontinuous constituents
(taken from the German NeGra [Skut et al. 1997] and TIGER [Brants et al. 2002] tree-
banks). Example (2a) shows several instances of discontinuous VPs and Example (2b)
shows a discontinuous NP. The relevant constituent is printed in italics.

(2)

a.

Fronting:

(i) Dar ¨uber
muss
Thereof
must
“One must think of that”

nachgedacht
thought

werden.
be

(NeGra)

(ii) Ohne internationalen Schaden
Without international damage
distanzieren,
distance
“Bonn could not distance itself from the monument without international
damage.”

von dem Denkmal
from the monument

k ¨onne
could

Bonn
Bonn

nicht
not

… (TIGER)

sich
itself

durch die Regelung
through the regulation

(iii) Auch
w ¨urden
Also
would
entstehen”. (TIGER)
emerge”
“Apart from that, the regulation would only constantly produce new old cases.”

“st¨andig
“constantly

Altf¨alle
old cases

neue
new

nur
only

b. Extraposed relative clauses:

Gel¨ande
terrain

auf
on
k ¨onne,
could,

deren
their
der …
which . . .

. . . ob
. . . whether
werden
get
“. . . whether one could build on their premises the type of parking facility,
which . . . ”

Typ von Abstellanlage
type of parking facility

der
the
(NeGra)

gebaut
built

(i)

88

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

Examples of other such languages are Bulgarian and Korean. Both show discontin-
uous constituents as well. Example (3a) is a Bulgarian example of a PP extracted out of
an NP, taken from the BulTreebank (Osenova and Simov 2004), and Example (3b) is an
example of fronting in Korean, taken from the Penn Korean Treebank (Han, Han, and
Ko 2001).

(3)

a. Na kyshtata

toi
he

popravi
repaired

pokriva.
roof.

Of house-DET
“It is the roof of the house he repairs.”

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

b. Gwon.han- ˘ul

ka.ji.go
Authority-OBJ
has
“Who has no authority?”

nu.ga
who

iss.ji?
not?

Discontinuous constituents are by no means limited to languages with freedom
in word order. They also occur in languages with a rather fixed word order such
as English, resulting from, for instance, long-distance movements. Examples (4a) and
(4b) are examples from the Penn Treebank for long extractions resulting in discontin-
uous S categories and for discontinuous NPs arising from extraposed relative clauses,
respectively (Marcus et al. 1994).

(4)

a.

Long Extraction in English:

(i)

Those chains include Bloomingdale’s, which Campeau recently said it
will sell.

(ii) What should I do.

b. Extraposed nominal modifiers (relative clauses and PPs) in English:

(i)

They sow a row of male-fertile plants nearby, which then pollinate the male-
sterile plants.

(ii) Prices fell marginally for fuel and electricity.

1.2 Treebank Annotation and Data-Driven Parsing

Most constituency treebanks rely on an annotation backbone based on Context-Free
Grammar (CFG). Discontinuities cannot be modeled with CFG, because they require a
larger domain of locality than the one offered by CFG. Therefore, the annotation back-
bone based on CFG is generally augmented with a separate mechanism that accounts
for the non-local dependencies. In the Penn Treebank (PTB), for example, trace nodes
and co-indexation markers are used in order to establish additional implicit edges in the
tree beyond the overt phrase structure. In T ¨uBa-D/Z (Telljohann et al. 2012), a German
Treebank, non-local dependencies are expressed via an annotation of topological fields
(H ¨ohle 1986) and special edge labels. In contrast, some other treebanks, among them
NeGra and TIGER, give up the annotation backbone based on CFG and allow annota-
tion with crossing branches (Skut et al. 1997). In such an annotation, non-local depen-
dencies can be expressed directly by grouping all dependent elements under a single
node. Note that both crossing branches and traces annotate long-distance dependencies
in a linguistically meaningful way. A difference is, however, that crossing branches
are less theory-dependent because they do not make any assumptions about the base
positions of “moved” elements.

Examples for the different approaches of annotating discontinuities are given in
Figures 1 and 2. Figure 1 shows the NeGra annotation of Example (2a-i) (left), and an

89

Computational Linguistics

Volume 39, Number 1

Figure 1
A discontinuous constituent. Original NeGra annotation (left) and a T ¨uBa-D/Z-style annotation
(right).

SBARQ

SBARQ

SQ

SBJ

VP

SQ

VP

SBJ

WHNP

NP

*T*

NP

WHNP

NP

What
WP

should
MD

I
PRP

do
VB

*T*
-NONE-

?
.

What
WP

should
MD

I
PRP

do
VB

?
.

Figure 2
A discontinuous wh-movement. Original PTB annotation (left) and NeGra-style annotation
(right).

annotation of the same sentence in the style of the T ¨uBa-D/Z treebank (right). Figure 2
shows the PTB annotation of Example (4a-ii) (on the left, note that the directed edge
from the trace to the WHNP element visualizes the co-indexation) together with a
NeGra-style annotation of the same sentence (right).

In the past, data-driven parsing has largely been dominated by Probabilistic
Context-Free Grammar (PCFG). In order to extract a PCFG from a treebank, the trees
need to be interpretable as CFG derivations. Consequently, most work has excluded
non-local dependencies; either (in PTB-like treebanks) by discarding labeling conven-
tions such as the co-indexation of the trace nodes in the PTB, or (in NeGra/TIGER-like
treebanks) by applying tree transformations, which resolve the crossing branches (e.g.,
K ¨ubler 2005; Boyd 2007). Especially for the latter treebanks, such a transformation is
problematic, because it generally is non-reversible and implies information loss.

Discontinuities are no minor phenomenon: Approximately 25% of all sentences
in NeGra and TIGER have crossing branches (Maier and Lichte 2011). In the Penn
Treebank, this holds for approximately 20% of all sentences (Evang and Kallmeyer
2011). This shows that it is important to properly treat such structures.

1.3 Extending the Domain of Locality

In the literature, different methods have been explored that allow for the use of non-
local information in data-driven parsing. We distinguish two classes of approaches.
The first class consists of approaches that aim at using formalisms which produce
trees without crossing branches but provide a larger domain of locality than CFG—
for instance, through complex labels (Hockenmaier 2003) or through the derivation

90

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

CFG:

LCFRS:

A

γ1

γ2

γ3

A

γ

Figure 3
Different domains of locality.

mechanism (Chiang 2003). The second class, to which we contribute in this paper,
consists of approaches that aim at producing trees which contain non-local information.
Some methods realize the reconstruction of non-local information in a post- or pre-
processing step to PCFG parsing (Johnson 2002; Dienes 2003; Levy and Manning 2004;
Cai, Chiang, and Goldberg 2011). Other work uses formalisms that accommodate the
direct encoding of non-local information (Plaehn 2004; Levy 2005). We pursue the latter
approach.

Our work is motivated by the following recent developments. Linear Context-Free
Rewriting Systems (LCFRSs) (Vijay-Shanker, Weir, and Joshi 1987) have been estab-
lished as a candidate for modeling both discontinuous constituents and non-projective
dependency trees as they occur in treebanks (Maier and Søgaard 2008; Kuhlmann and
Satta 2009; Maier and Lichte 2011). LCFRSs are a natural extension of CFGs where
the non-terminals can span tuples of possibly non-adjacent strings (see Figure 3). Be-
cause LCFRSs allow for binarization and CYK chart parsing in a way similar to CFGs,
PCFG techniques, such as best-first parsing (Caraballo and Charniak 1998), weighted

parsing (Klein and Manning 2003a) can
deductive parsing (Nederhof 2003), and A
be transferred to LCFRS. Finally, as mentioned before, languages such as German
have recently attracted the interest of the parsing community (K ¨ubler and Penn 2008;
Seddah, K ¨ubler, and Tsarfaty 2010).

We bring together these developments by presenting a parser for Probabilistic
LCFRS (PLCFRS), continuing the promising work of Levy (2005). Our parser pro-
duces trees with crossing branches and thereby accounts for syntactic long-distance
dependencies while not making any additional assumptions concerning the position
of hypothetical traces. We have implemented a CYK parser and we present several
methods for context summary estimation of parse items. The estimates either act as
parsing. A test on
figures-of-merit in a best-first parsing context or as estimates for A
a real-world-sized data set shows that our parser achieves competitive results. To our
knowledge, our parser is the first for the entire class of PLCFRS that has successfully
been used for data-driven parsing.1

The paper is structured as follows. Section 2 introduces probabilistic LCFRS. Sec-
tions 3 and 4 present the binarization algorithm, the parser, and the outside estimates
which we use to speed up parsing. In Section 5 we explain how to extract an LCFRS from
a treebank and we present grammar refinement methods for these specific treebank
grammars. Finally, Section 6 presents evaluation results and Section 7 compares our
work to other approaches.

1 Parts of the results presented in this paper have been presented earlier. More precisely, in Kallmeyer and
Maier (2010), we presented the general architecture of the parser and all outside estimates except the LN
estimate from Section 4.4 which is presented in Maier, Kaeshammer, and Kallmeyer (2012). In Maier and
Kallmeyer (2010) we have presented experiments with the relative clause split from Section 3.2. Finally,
Maier (2010) contains the evaluation of the baseline (together with an evaluation using other metrics).

91

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 1

2. Probabilistic Linear Context-Free Rewriting Systems

2.1 Definition of PLCFRS

LCFRS (Vijay-Shanker, Weir, and Joshi 1987) is an extension of CFG in which a non-
terminal can span not only a single string but a tuple of strings of size k ≥ 1. k is thereby
called its fan-out. We will notate LCFRS with the syntax of Simple Range Concate-
nation Grammars (SRCG) (Boullier 1998b), a formalism that is equivalent to LCFRS.
A third formalism that is equivalent to LCFRS is Multiple Context-Free Grammar
(MCFG) (Seki et al. 1991).

Definition 1 (LCFRS)

A Linear Context-Free Rewriting System (LCFRS) is a tuple (cid:4)N, T, V, P, S(cid:5) where

a)

b)

c)

d)

N is a finite set of non-terminals with a function dim: N → N that
determines the fan-out of each A ∈ N;

T and V are disjoint finite sets of terminals and variables;
S ∈ N is the start symbol with dim(S) = 1;

P is a finite set of rules

A(α1, . . . , αdim(A)) → A1(X(1)

1 , . . . , X(1)

dim(A1 )) · · · Am(X(m)

1

, . . . , X(m)

dim(Am ))

for m ≥ 0 where A, A1, . . . , Am
for 1 ≤ i ≤ dim(A). For all r ∈ P, it holds that every
and αi
variable X occurring in r occurs exactly once in the left-hand side and
exactly once in the right-hand side of r.

∈ V for 1 ≤ i ≤ m, 1 ≤ j ≤ dim(Ai)

∈ N, X(i)
j

∈ (T ∪ V)

A rewriting rule describes how the yield of the left-hand side non-terminal can be
computed from the yields of the right-hand side non-terminals. The rules A(ab, cd) → ε
and A(aXb, cYd) → A(X, Y) from Figure 4 for instance specify that (1) (cid:4)ab, cd(cid:5) is in the
yield of A and (2) one can compute a new tuple in the yield of A from an already existing
one by wrapping a and b around the first component and c and d around the second.
A CFG rule A → BC would be written A(XY) → B(X)C(Y) as an LCFRS rule.

Definition 2 (Yield, language)
Let G = (cid:4)N, T, V, P, S(cid:5) be an LCFRS.

1.

For every A ∈ N, we define the yield of A, yield(A) as follows:

a)

For every rule A((cid:1)α) → ε, (cid:1)α ∈ yield(A);

A(ab, cd) → ε

A(aXb, cYd) → A(X, Y)
S(XY) → A(X, Y)

Figure 4
Sample LCFRS for {anbncndn | n ≥ 1}.

92

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

b)

(i)

(ii)

(iii)

For every rule A(α1, . . . , αdim(A)) → A1(X(1)

, . . . , X(m)

Am(X(m)
(cid:4) f (α1), . . . , f (αdim(A))(cid:5) ∈ yield(A) where f is defined as follows:

dim(Am )) and for all (cid:1)τi

1

dim(A1 )) · · ·
1 , . . . , X(1)
∈ yield(Ai) (1 ≤ i ≤ m):

f (t) = t for all t ∈ T,
f (X(i)

j ) = (cid:1)τi(j) for all 1 ≤ i ≤ m, 1 ≤ j ≤ dim(Ai) and

f (xy) = f (x)f (y) for all x, y ∈ (T ∪ V)+.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

We call f the composition function of the rule.

c)

Nothing else is in yield(A).

2.

The language of G is then L(G) = {w | (cid:4)w(cid:5) ∈ yield(S)}.

As an example, consider again the LCFRS in Figure 4. The last rule tells us that,
given a pair in the yield of A, we can obtain an element in the yield of S by concate-
nating the two components. Consequently, the language generated by this grammar is
{anbncndn | n ≥ 1}.

The terms of grammar fan-out and rank and the properties of monotonicity and
ε-freeness will be referred to later and are therefore introduced in the following defini-
tion. They are taken from the LCFRS/MCFG terminology; the SRCG term for fan-out is
arity and the property of being monotone is called ordered in the context of SRCG.

Definition 3
Let G = (cid:4)N, T, V, P, S(cid:5) be an LCFRS.

1.

2.

3.

4.

The fan-out of G is the maximal fan-out of all non-terminals in G.
Furthermore, the right-hand side length of a rewriting rule r ∈ P is called
the rank of r and the maximal rank of all rules in P is called the rank of G.
G is monotone if for every r ∈ P and every right-hand side non-terminal A
in r and each pair X1, X2 of arguments of A in the right-hand side of r, X1
precedes X2 in the right-hand side iff X1 precedes X2 in the left-hand side.
A rule r ∈ P is called an ε-rule if one of the left-hand side components of r
is ε.
G is ε-free if it either contains no ε-rules or there is exactly one ε-rule
S(ε) → ε and S does not appear in any of the right-hand sides of the rules
in the grammar.

For every LCFRS there exists an equivalent LCFRS that is ε-free (Seki et al. 1991;

Boullier 1998a) and monotone (Michaelis 2001; Kracht 2003; Kallmeyer 2010).

The definition of a probabilistic LCFRS is a straightforward extension of the defini-

tion of PCFG and thus it follows (Levy 2005; Kato, Seki, and Kasami 2006) that:

Definition 4 (PLCFRS)
A probabilistic LCFRS (PLCFRS) is a tuple (cid:4)N, T, V, P, S, p(cid:5) such that (cid:4)N, T, V, P, S(cid:5) is an
LCFRS and p : P → [0..1] a function such that for all A ∈ N:

ΣA((cid:1)x)→(cid:1)Φ∈Pp(A((cid:1)x) → (cid:1)Φ) = 1

93

Computational Linguistics

Volume 39, Number 1

PLCFRS with non-terminals {S, A, B}, terminals {a} and start symbol S:
0.2 : S(X) → A(X)
0.7 : A(aX) → A(X)
0.8 : B(aX, aY) → B(X, Y)
Figure 5
Sample PLCFRS.

0.8 : S(XY) → B(X, Y)
0.3 : A(a) → ε
0.2 : B(a, a) → ε

As an example, consider the PLCFRS in Figure 5. This grammar simply generates
a+. Words with an even number of as and nested dependencies are more probable
than words with a right-linear dependency structure. For instance, the word aa receives
the two analyses in Figure 6. The analysis (a) displaying nested dependencies has
probability 0.16 and (b) (right-linear dependencies) has probability 0.042.

3. Parsing PLCFRS

3.1 Binarization

Similarly to the transformation of a CFG into Chomsky normal form, an LCFRS can be
binarized, resulting in an LCFRS of rank 2. As in the CFG case, in the transformation,
we introduce a non-terminal for each right-hand side longer than 2 and split the rule
into two rules, using this new intermediate non-terminal. This is repeated until all
right-hand sides are of length 2. The transformation algorithm is inspired by G ´omez-
Rodr´ıguez et al. (2009) and it is also specified in Kallmeyer (2010).

3.1.1 General Binarization. In order to give the algorithm for this transformation, we
need the notion of a reduction of a vector (cid:1)α ∈ [(T ∪ V)
]i by a vector (cid:1)x ∈ Vj where all
variables in (cid:1)x occur in (cid:1)α. A reduction is, roughly, obtained by keeping all variables in (cid:1)α
that are not in (cid:1)x. This is defined as follows:

Definition 5 (Reduction)
Let (cid:4)N, T, V, P, S(cid:5) be an LCFRS, (cid:1)α ∈ [(T ∪ V)

]i and (cid:1)x ∈ Vj for some i, j ∈ IN.

Let w = (cid:1)α1$ . . . $(cid:1)αi be the string obtained from concatenating the components of (cid:1)α,

separated by a new symbol $ /∈ (V ∪ T). (cid:4) Let w be the image of w under a homomorphism h defined as follows: h(a) = $ for

all a ∈ T, h(X) = $ for all X ∈ {(cid:1)x1, . . . (cid:1)xj ∈ V+ such that w Let y1, . . . ym } and h(y) = y in all other cases. ∗ ∗ (cid:4) ∈ $
y1$+y2$+ . . . $+ym$

. Then the vector

(cid:4)y1, . . . ym

(cid:5) is the reduction of (cid:1)α by (cid:1)x.

For instance, (cid:4)aX1, X2, bX3
(cid:5) yields (cid:4)X1, X3

duced with (cid:4)X2

(cid:5) reduced with (cid:4)X2
(cid:5) as well.

(cid:5) yields (cid:4)X1, X3

(cid:5) and (cid:4)aX1X2bX3

(cid:5) re-

S

B

a

a

(a)

S

A

(b)

a

A

a

Figure 6
The two derivations of aa.

94

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

for all rules r = A((cid:1)α) → A0( (cid:1)α0) . . . Am( (cid:1)αm) in P with m > 1 do

remove r from P
R := ∅
pick new non-terminals C1, . . . , Cm−1
add the rule A((cid:1)α) → A0( (cid:1)α0)C1( (cid:1)γ1) to R where (cid:1)γ1 is obtained by reducing (cid:1)α with (cid:1)α0
for all i, 1 ≤ i ≤ m − 2 do

add the rule Ci( (cid:1)γi) → Ai( (cid:1)αi)Ci+1( (cid:1)γi+1) to R where (cid:1)γi+1 is obtained by reducing (cid:1)γi with (cid:1)αi

end for
add the rule Cm−1( (cid:1)γm−2) → Am−1( (cid:1)αm−1)Am( (cid:1)αm) to R
for every rule r

(cid:4) ∈ R do

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

replace right-hand side arguments of length > 1 with new variables (in both sides) and
add the result to P

end for

end for

Figure 7
Algorithm for binarizing an LCFRS.

The binarization algorithm is given in Figure 7. As already mentioned, it proceeds
like the CFG binarization algorithm in the sense that for right-hand sides longer than
2, we introduce a new non-terminal that covers the right-hand side without the first
element. Figure 8 shows an example. In this example, there is only one rule with a right-
hand side longer than 2. In a first step, we introduce the new non-terminals and rules
that binarize the right-hand side. This leads to the set R. In a second step, before adding
the rules from R to the grammar, whenever a right-hand side argument contains several
variables, these are collapsed into a single new variable.

The equivalence of the original LCFRS and the binarized grammar is rather straight-

forward. Note, however, that the fan-out of the LCFRS can increase.

The binarization depicted in Figure 7 is deterministic in the sense that for every rule
that needs to be binarized, we choose unique new non-terminals. Later, in Section 5.3.1,
we will introduce additional factorization into the grammar rules that reduces the set
of new non-terminals.

3.1.2 Minimizing Fan-Out and Number of Variables. In LCFRS, in contrast to CFG, the order
of the right-hand side elements of a rule does not matter for the result of a derivation.

Original LCFRS:

S(XYZUVW) → A(X, U)B(Y, V)C(Z, W)
A(aX, aY) → A(X, Y)
B(bX, bY) → B(X, Y)
C(cX, cY) → C(X, Y)

A(a, a) → ε
B(b, b) → ε
C(c, c) → ε

Rule with right-hand side of length > 2: S(XYZUVW) → A(X, U)B(Y, V)C(Z, W)
For this rule, we obtain
R = {S(XYZUVW) → A(X, U)C1(YZ, VW), C1(YZ, VW) → B(Y, V)C(Z, W)}

Equivalent binarized LCFRS:
S(XPUQ) → A(X, U)C1(P, Q)
C1(YZ, VW) → B(Y, V)C(Z, W)
A(aX, aY) → A(X, Y)
B(bX, bY) → B(X, Y)
C(cX, cY) → C(X, Y)

A(a, a) → ε
B(b, b) → ε
C(c, c) → ε

Figure 8
Sample binarization of an LCFRS.

95

Computational Linguistics

Volume 39, Number 1

Therefore, we can reorder the right-hand side of a rule before binarizing it. In the
following, we present a binarization order that yields a minimal fan-out and a minimal
variable number per production and binarization step. The algorithm is inspired by
G ´omez-Rodr´ıguez et al. (2009) and has first been published in this version in Kallmeyer
(2010). We assume that we are only considering partitions of right-hand sides where one
of the sets contains only a single non-terminal.

For a given rule c = A0( (cid:1)x0) → A1( (cid:1)x1) . . . Ak( (cid:1)xk), we define the characteristic string
s(c, Ai) of the Ai-reduction of c as follows: Concatenate the elements of (cid:1)x0, separated with
new additional symbols $ while replacing every component from (cid:1)xi with a $. We then
define the arity of the characteristic string, dim(s(c, Ai)), as the number of maximal sub-
strings x ∈ V+ in s(Ai). Take, for example, a rule c = VP(X, YZU) → VP(X, Z)V(Y)N(U).
Then s(c, VP) =$$Y$U, s(c, V) = X$$ZU.

Figure 9 shows how in a first step, for a given rule r with right-hand side length > 2,
we determine the optimal candidate for binarization based on the characteristic string
s(r, B) of some right-hand side non-terminal B and on the fan-out of B: On all right-
hand side predicates B we check for the maximal fan-out (given by dim(s(r, B))) and the
number of variables (dim(s(r, B)) + dim(B)) we would obtain when binarizing with this
predicate. This check provides the optimal candidate. In a second step we then perform
the same binarization as before, except that we use the optimal candidate now instead
of the first element of the right-hand side.

3.2 The Parser

We can assume without loss of generality that our grammars are ε-free and monotone
(the treebank grammars with which we are concerned all have these properties) and that
they contain only binary and unary rules. Furthermore, we assume POS tagging to be
done before parsing. POS tags are non-terminals of fan-out 1. Finally, according to our
grammar extraction algorithm (see Section 5.1), a separation between two components
always means that there is actually a non-empty gap in between them. Consequently,
two different components in a right-hand side can never be adjacent in the same
component of the left-hand side. The rules are then either of the form A(a) → ε with A a
POS tag and a ∈ T or of the form A((cid:1)x) → B((cid:1)x) or A((cid:1)α) → B((cid:1)x)C((cid:1)y) where (cid:1)α ∈ (V+)dim(A),
(cid:1)x ∈ Vdim(B), (cid:1)y ∈ Vdim(C), that is, only the rules for POS tags contain terminals in their left-
hand sides.

cand = 0
fan-out = number of variables in r
vars = number of variables in r
for all i = 0 to m do

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

cand-fan-out = dim(s(r, Ai));
if cand-fan-out < fan-out and dim(Ai) < fan-out then fan-out = max({cand-fan-out, dim(Ai)}); vars = cand-fan-out + dim(Ai); cand = i; fan-out = max({cand-fan-out, dim(Ai)}); vars = cand-fan-out + dim(Ai); cand = i else if cand-fan-out ≤ fan-out, dim(Ai) ≤ fan-out and cand-fan-out + dim(Ai) < vars then end if end for Figure 9 Optimized version of the binarization algorithm, determining binarization order. 96 Kallmeyer and Maier PLCFRS Parsing During parsing we have to link the terminals and variables in our LCFRS rules to portions of the input string. For this purpose we need the notions of ranges, range vectors, and rule instantiations. A range is a pair of indices that characterizes the span of a component within the input. A range vector characterizes a tuple in the yield of a non-terminal. A rule instantiation specifies the computation of an element from the left- hand side yield from elements in the yields of the right-hand side non-terminals based on the corresponding range vectors. Definition 6 (Range) Let w ∈ T ∗ with w = w1 . . . wn where wi ∈ T for 1 ≤ i ≤ n. Pos(w) := {0, . . . , n}. 1. 2. We call a pair (cid:4)l, r(cid:5) ∈ Pos(w) × Pos(w) with l ≤ r a range in w. Its yield 3. 4. (cid:4)l, r(cid:5)(w) is the substring wl+1 . . . wr. For two ranges ρ1 = (cid:4)l1, r1 of ρ1 and ρ2 is ρ1 A (cid:1)ρ ∈ (Pos(w) × Pos(w))k is a k-dimensional range vector for w iff (cid:5) is a range in w for 1 ≤ i ≤ k. (cid:1)ρ = (cid:4)(cid:4)l1, r1 · ρ2 is undefined. (cid:5); otherwise ρ1 (cid:5)(cid:5) where (cid:4)li, ri (cid:5), ρ2 = (cid:4)l2, r2 · ρ2 = (cid:4)l1, r2 (cid:5), . . . , (cid:4)lk, rk (cid:5), if r1 = l2, then the concatenation We now define instantiations of rules with respect to a given input string. This definition follows the definition of clause instantiations from Boullier (2000). An in- stantiated rule is a rule in which variables are consistently replaced by ranges. Because we need this definition only for parsing our specific grammars, we restrict ourselves to ε-free rules containing only variables. Definition 7 (Rule instantiation) Let G = (N, T, V, P, S) be an ε-free monotone LCFRS. For a given rule r = A((cid:1)α) → A1( (cid:1)x1) · · · Am( (cid:1)xm) ∈ P (0 < m) that does not contain any terminals, 1. 2. an instantiation with respect to a string w = t1 . . . tn consists of a function f : V → {(cid:4)i, j(cid:5) | 1 ≤ i ≤ j ≤ |w|} such that for all x, y adjacent in one of the elements of (cid:1)α, f (x) · f (y) must be defined; we then define f (xy) = f (x) · f (y), if f is an instantiation of r, then A( f ((cid:1)α)) → A1( f ( (cid:1)x1)) · · · Am( f ( (cid:1)xm)) is an instantiated rule where f ((cid:4)x1, . . . , xk (cid:5)) = (cid:4) f (x1), . . . , f (xk)(cid:5). We use a probabilistic version of the CYK parser from Seki et al. (1991). The algo- rithm is formulated using the framework of parsing as deduction (Pereira and Warren 1983; Shieber, Schabes, and Pereira 1995; Sikkel 1997), extended with weights (Nederhof 2003). In this framework, a set of weighted items representing partial parsing results is characterized via a set of deduction rules, and certain items (the goal items) represent successful parses. During parsing, we have to match components in the rules we use with portions of the input string. For a given input w, our items have the form [A, (cid:1)ρ] where A ∈ N and (cid:1)ρ is a range vector that characterizes the span of A. Each item has a weight in that encodes the Viterbi inside score of its best parse tree. More precisely, we use the log probability log(p) where p is the probability. The first rule (scan) tells us that the POS tags that we receive as inputs are given. Consequently, they are axioms; their probability is 1 and their weight therefore 0. The 97 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 1 Scan: 0 : [A, (cid:4)(cid:4)i, i + 1(cid:5)(cid:5)] A is the POS tag of wi+1 Unary: in : [B, (cid:1)ρ] in + log(p) : [A, (cid:1)ρ] p : A((cid:1)α) → B((cid:1)α) ∈ P inB : [B, (cid:1)ρB], inC : [C, (cid:1)ρC] inB + inC + log(p) : [A, (cid:1)ρA] Binary: Goal: [S, (cid:4)(cid:4)0, n(cid:5)(cid:5)] Figure 10 Weighted CYK deduction system. p : A( (cid:1)ρA ) → B( (cid:1)ρB )C( (cid:1)ρC ) is an instantiated rule second rule, unary, is applied whenever we have found the right-hand side of an instantiation of a unary rule. In our grammar, terminals only occur in rules with POS tags and the grammar is ordered and ε-free. Therefore, the components of the yield of the right-hand side non-terminal and of the left-hand side terminals are the same. The rule binary applies an instantiated rule of rank 2. If we already have the two elements of the right-hand side, we can infer the left-hand side element. In both cases, unary and binary, the probability p of the new rule is multiplied with the probabilities of the antecedent items (which amounts to summing up the antecedent weights and log(p)). We perform weighted deductive parsing, based on the deduction system from Figure 10. We use a chart C and an agenda A, both initially empty, and we proceed as in Figure 11. Because for all our deduction rules, the weight functions f that compute the weight of a consequent item from the weights of the antecedent items are monotone non-increasing in each variable, the algorithm will always find the best parse without the need of exhaustive parsing. All new items that we deduce involve at least one of the agenda items as an antecedent item. Therefore, whenever an item is the best in the agenda, we can be sure that we will never find an item with a better (i.e., higher) weight. Consequently, we can safely store this item in the chart and, if it is a goal item, we have found the best parse. As an example consider the development of the agenda and the chart in Figure 12 when parsing aa with the PLCFRS from Figure 5, transformed into a PLCFRS with pre-terminals and binarization (i.e., with a POS tag Ta and a new binarization non- (cid:4) terminal B ). The new PLCFRS is given in Figure 13. In this example, we find a first analysis for the input (a goal item) when combining an A with span (cid:4)(cid:4)0, 2(cid:5)(cid:5) into an S. This S has however a rather low probability and is therefore not on top of the agenda. Later, when finding the better analysis, the weight add SCAN results to A while A (cid:19)= ∅ remove best item x : I from A add x : I to C if I goal item then stop and output true else for all y : I (cid:4) deduced from x : I and items in C: (cid:4) ∈ C ∪ A (cid:4) if there is no z with z : I then add y : I else if z : I (cid:4) ∈ A for some z (cid:4) then update weight of I to A in A to max(y, z) Figure 11 Weighted deductive parsing. 98 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kallmeyer and Maier PLCFRS Parsing chart 0 : [Ta, (cid:4)0, 1(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], 0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)], −0.65 : [A, (cid:4)0, 2(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)], −0.65 : [A, (cid:4)0, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)] agenda 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)] 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)] −0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)] −0.5 : [A, (cid:4)1, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)] −0.65 : [A, (cid:4)0, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)], −1.2 : [S, (cid:4)1, 2(cid:5)] −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)], −1.2 : [S, (cid:4)1, 2(cid:5)], −1.35 : [S, (cid:4)0, 2(cid:5)] −0.8 : [S, (cid:4)0, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)], −1.2 : [S, (cid:4)1, 2(cid:5)] Figure 12 Parsing of aa with the grammar from Figure 5. PLCFRS with non-terminals {S, A, B, B }, terminals {a} and start symbol S: (cid:4) , Ta 0.2 : S(X) → A(X) 0.7 : A(XY) → Ta(X)A(Y) (cid:4) 0.8 : B(ZX, Y) → Ta(Z)B 0.2 : B(X, Y) → Ta(X)Ta(Y) Figure 13 Sample binarized PLCFRS (with pre-terminal Ta). 0.8 : S(XY) → B(X, Y) 0.3 : A(X) → Ta(X) 1 : B 1 : Ta(a) → ε (X, Y) (cid:4) (X, UY) → B(X, Y)Ta(U) of the S item in the agenda is updated and then the goal item is the top agenda item and therefore parsing has been successful. Note that, so far, we have only presented the recognizer. In order to extend it to a parser, we do the following: Whenever we generate a new item, we store it not only with its weight but also with backpointers to its antecedent items. Furthermore, whenever we update the weight of an item in the agenda, we also update the backpointers. In order to read off the best parse tree, we have to start from the goal item and follow the backpointers. 4. Outside Estimates So far, the weights we use give us only the Viterbi inside score of an item. In order to speed up parsing, we add the estimate of the costs for completing the item into a goal item to its weight—that is, to the weight of each item in the agenda, we add an estimate of its Viterbi outside score2 (i.e., the logarithm of the estimate). We use context summary estimates. A context summary is an equivalence class of items for which we can compute the actual outside scores. Those scores are then used as estimates. The challenge is to choose the estimate general enough to be efficiently computable and specific enough to be helpful for discriminating items in the agenda. 2 Note that just as Klein and Manning (2003a), we use the terms inside score and outside score to denote the Viterbi inside and outside scores. They are not to be confused with the actual inside or outside probability. 99 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 1 Admissibility and monotonicity are two important conditions on estimates. All our outside estimates are admissible (Klein and Manning 2003a), which means that they never underestimate the actual outside score of an item. In other words, they are too optimistic about the costs of completing the item into an S item spanning the entire input. For the full SX estimate described in Section 4.1 and the SX estimate with span and sentence length in Section 4.4, the monotonicity is guaranteed and we can do ∗ parsing as described by Klein and Manning. Monotonicity means that for each true A antecedent item of a rule it holds that its weight is greater than or equal to the weight of the consequent item. The estimates from Sections 4.2 and 4.3 are not monotonic. This means that it can happen that we deduce an item I2 from an item I1 where the weight of I2 is greater than the weight of I1. The parser can therefore end up in a local maximum that is not the global maximum we are searching for. In other words, those estimates are only figures of merit (FOM). All outside estimates are computed off-line for a certain maximal sentence length lenmax. 4.1 Full SX Estimate The full SX estimate is a PLCFRS adaption of the SX estimate of Klein and Manning (2003a) (hence the name). For a given sentence length n, the estimate gives the maximal probability of completing a category X with a span ρ into an S with span (cid:4)(cid:4)0, n(cid:5)(cid:5). For its computation, we need an estimate of the inside score of a category C with a span ρ, regardless of the actual terminals in our input. This inside estimate is computed as shown in Figure 14. Here, we do not need to consider the number of terminals outside the span of C (to the left or right or in the gaps), because they are not relevant for the inside score. Therefore the items have the form [A, (cid:4)l1, . . . , ldim(A) (cid:5)], where A is a non- terminal and li gives the length of its ith component. It holds that Σ1≤i≤dim(A)li ≤ lenmax − dim(A) + 1 because our grammar extraction algorithm ensures that the different components in the yield of a non-terminal are never adjacent. There is always at least one terminal in between two different components that does not belong to the yield of the non-terminal. The first rule in Figure 14 tells us that POS tags always have a single component of length 1; therefore this case has probability 1 (weight 0). The rules unary and binary are roughly like the ones in the CYK parser, except that they combine items with length information. The rule unary for instance tells us that if the log of the probability of building [B,(cid:1)l] is greater or equal to in and if there is a rule that allows to deduce an POS tags: 0 : [A, (cid:4)1(cid:5)] A a POS tag Unary: in : [B,(cid:1)l] in + log(p) : [A,(cid:1)l] p : A((cid:1)α) → B((cid:1)α) ∈ P l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Binary: inB : [B,(cid:1)lB], inC : [C,(cid:1)lC] inB + inC + log(p) : [A,(cid:1)lA] where p : A( (cid:1)αA) → B( (cid:1)αB)C( (cid:1)αC) ∈ P and the following holds: we define B(i) as {1 ≤ j ≤ dim(B) | (cid:1)αB( j) occurs in (cid:1)αA(i)} and C(i) as {1 ≤ j ≤ dim(C) | (cid:1)αC( j) occurs in (cid:1)αA(i)}. Then for all i, 1 ≤ i ≤ dim(A): (cid:1)lA(i) = Σj∈B(i) (cid:1)lB( j) + Σj∈C(i) (cid:1)lC( j). Figure 14 Estimate of the Viterbi inside score. 100 Kallmeyer and Maier PLCFRS Parsing Axiom : 0 : [S, (cid:4)0, len, 0(cid:5)] 1 ≤ len ≤ lenmax Unary: out : [A,(cid:1)l] out + log(p) : [B,(cid:1)l] p : A((cid:1)α) → B((cid:1)α) ∈ P Binary-right: Binary-left: out : [X,(cid:1)lX] (cid:4) out + in(A,(cid:1)l A) + log(p) : [B,(cid:1)lB] out : [X,(cid:1)lX] (cid:4) out + in(B,(cid:1)l B) + log(p) : [A,(cid:1)lA] where, for both binary rules, there is an instantiated rule p : X((cid:1)ρ) → A( (cid:1)ρA)B( (cid:1)ρB) such that (cid:1)lX = lout(ρ), (cid:1)lA = lout(ρA),(cid:1)l (cid:4) A = lin(ρA), (cid:1)lB = lout(ρB),(cid:1)l (cid:4) B = lin(ρB). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 15 Full SX estimate first version (top–down). A item from [B,(cid:1)l] with probability p, then the log of the probability of [A,(cid:1)l] is greater or equal to in + log(p). For each item, we record its maximal weight (i.e., its maximal probability). The rule binary is slightly more complicated because we have to compute the length vector of the left-hand side of the rule from the right-hand side length vectors. A straightforward extension of the CFG algorithm from Klein and Manning (2003a) for computing the SX estimate is given in Figure 15. Here, the items have the form [A,(cid:1)l] where the vector(cid:1)l tells us about the lengths of the string to the left of the first component, the first component, the string in between the first and second component, and so on. The algorithm proceeds top–down. The outside estimate of completing an S with component length len and no terminals to the left or to the right of the S component (item [S, (cid:4)0, len, 0(cid:5)]) is 0. If we expand with a unary rule (unary), then the outside estimate of the right-hand side item is greater or equal to the outside estimate of the left-hand side item plus the log of the probability of the rule. In the case of binary rules, we have to further add the inside estimate of the other daughter. For this, we need a different length vector (without the lengths of the parts in between the components). (cid:5)(cid:5) and a sentence length n, Therefore, for a given range vector ρ = (cid:4)(cid:4)l1, r1 we distinguish between the inside length vector lin(ρ) = (cid:4)r1 (cid:5) and the outside length vector lout(ρ) = (cid:4)l1, r1 (cid:5), . . . , (cid:4)lk, rk − l1, . . . , rk − r1, . . . , lk − lk, n − rk − rk−1, rk − lk (cid:5). − l1, l2 This algorithm has two major problems: Because it proceeds top–down, in the binary rules we must compute all splits of the antecedent X span into the spans of A and B, which is very expensive. Furthermore, for a category A with a certain number of terminals in the components and the gaps, we compute the lower part of the outside estimate several times, namely, for every combination of number of terminals to the left and to the right (first and last element in the outside length vector). In order to avoid these problems, we now abstract away from the lengths of the part to the left and the right, modifying our items such as to allow a bottom–up strategy. The idea is to compute the weights of items representing the derivations from a certain lower C up to some A (C is a kind of “gap” in the yield of A) while summing up the inside costs of off-spine nodes and the log of the probabilities of the corresponding rules. We use items [A, C, ρA, ρC, shift] where A, C ∈ N and ρA, ρC are range vectors, both with a first component starting at position 0. The integer shift ≤ lenmax tells us how many positions to the right the C span is shifted, compared to the starting position of the A. ρA and ρC represent the spans of C and A while disregarding the number of terminals to the left and the right (i.e., only the lengths of the components and of the gaps are encoded). This means in particular that the length n of the sentence does not play a role here. The right boundary of the last range in the vectors is limited to lenmax. For 101 Computational Linguistics Volume 39, Number 1 any i, 0 ≤ i ≤ lenmax, and any range vector ρ, we define shift(ρ, i) as the range vector one obtains from adding i to all range boundaries in ρ and shift(ρ, −i) as the range vector one obtains from subtracting i from all boundaries in ρ. The weight of [A, C, ρA, ρC, i] estimates the log of the probability of completing a C tree with yield ρC into an A tree with yield ρA such that, if the span of A starts at position j, the span of C starts at position i + j. Figure 16 gives the computation. The value of in(A,(cid:1)l) is the inside estimate of [A,(cid:1)l]. The SX-estimate for some predicate C with span ρ where i is the left boundary of the first component of ρ and with sentence length n is then given by the maximal weight of [S, C, (cid:4)0, n(cid:5), shift(ρ, −i), i]. 4.2 SX with Left, Gaps, Right, Length A problem of the previous estimate is that with a large number of non-terminals (for treebank parsing, approximately 12,000 after binarization and markovization), the com- putation of the estimate requires too much space. We therefore turn to simpler estimates with only a single non-terminal per item. We now estimate the outside score of a non- terminal A with a span of a length length (the sum of the lengths of all the components of the span), with left terminals to the left of the first component, right terminals to the right of the last component, and gaps terminals in between the components of the A span (i.e., filling the gaps). Our items have the form [X, len, left, right, gaps] with X ∈ N, len + left + right + gaps ≤ lenmax, len ≥ dim(X), gaps ≥ dim(X) − 1. Let us assume that, in the rule X((cid:1)α) → A( (cid:1)αA)B( (cid:1)αB), when looking at the vector (cid:1)α, we have leftA variables for A-components preceding the first variable of a B component, rightA variables for A-components following the last variable of a B component, and rightB variables for B-components following the last variable of an A component. (In our grammars, the first left-hand side argument always starts with the first variable from A.) Furthermore, we set gapsA = dim(A) − leftA − rightA and gapsB = dim(B) − rightB. Figure 17 gives the computation of the estimate. It proceeds top–down, as the computation of the full SX estimate in Figure 15, except that now the items are simpler. POS tags: 0 : [C, C, (cid:4)0, 1(cid:5), (cid:4)0, 1(cid:5), 0] C a POS tag Unary: 0 : [B, B, ρB, ρB, 0] log(p) : [A, B, ρB, ρB, 0] p : A((cid:1)α) → B((cid:1)α) ∈ P Binary-right: 0 : [A, A, ρA, ρA, 0], 0 : [B, B, ρB, ρB, 0] in(A, lin(ρA)) + log(p) : [X, B, ρX, ρB, i] l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 where i is such that for shift(ρB, i) = ρ(cid:4) Binary-left: 0 : [A, A, ρA, ρA, 0], 0 : [B, B, ρB, ρB, 0] in(B, lin(ρB)) + log(p) : [X, A, ρX, ρA, i] B p : X(ρX ) → A(ρA)B(ρ(cid:4) B) is an instantiated rule. Starting sub-trees with larger gaps: out : [B, C, ρB, ρC, i] 0 : [B, B, ρB, ρB, 0] Transitive closure of sub-tree combination: out1 : [A, B, ρA, ρB, i], out2 : [B, C, ρB, ρC, j] out1 + out2 : [A, C, ρA, ρC, i + j] Figure 16 Full SX estimate second version (bottom–up). 102 Kallmeyer and Maier PLCFRS Parsing Axiom : 0 : [S, len, 0, 0, 0] 1 ≤ len ≤ lenmax Unary: out : [X, len, l, r, g] out + log(p) : [A, len, l, r, g] p : X((cid:1)α) → A((cid:1)α) ∈ P Binary-right: out : [X, len, l, r, g] out + in(A, len − lenB) + log(p) : [B, lenB, lB, rB, gB] Binary-left: out : [X, len, l, r, g] out + in(B, len − lenA) + log(p) : [A, lenA, lA, rA, gA] where, for both binary rules, p : X((cid:1)α) → A( (cid:1)αA)B( (cid:1)αB) ∈ P. Further side conditions for Binary-right: a) len + l + r + g = lenB + lB + rB + gB, c) if rightA > 0, then rB
Further side conditions for Binary-left:
a) len + l + r + g = lenA + lA + rA + gA,
c) if rightB > 0, then rA

≥ r + rightA, else (rightA = 0), rB = r,

≥ r + rightB, else (rightB = 0), rA = r

b) lB
d) gB

≥ l + leftA,
≥ gapsA.

b) lA = l,
d) gA

≥ gapsB.

Figure 17
SX estimate depending on length, left, right, gaps.

The value in(X, l) for a non-terminal X and a length l, 0 ≤ l ≤ lenmax is an estimate
of the probability of an X category with a span of length l. Its computation is specified
in Figure 18.

The SX-estimate for a sentence length n and for some predicate C with a range
− li) and r =
− r].

characterized by (cid:1)ρ = (cid:4)(cid:4)l1, r1
n − rdim(C) is then given by the maximal weight of the item [C, len, l1, r, n − len − l1

(cid:5)(cid:5) where len = Σdim(C)

(cid:5), . . . , (cid:4)ldim(C), rdim(C)

(ri

i=1

4.3 SX with LR, Gaps, Length

In order to further decrease the space complexity of the computation of the outside
estimate, we can simplify the previous estimate by subsuming the two lengths left and
right in a single length lr. The items now have the form [X, len, lr, gaps] with X ∈ N,
len + lr + gaps ≤ lenmax, len ≥ dim(X), gaps ≥ dim(X) − 1.

The computation is given in Figure 19. Again, we define leftA, gapsA, rightA and
gapsB, rightB for a rule X((cid:1)α) → A( (cid:1)αA)B( (cid:1)αB) as before. Furthermore, in both Binary-left
and Binary-right, we have limited lr in the consequent item to the lr of the antecedent
plus the length of the sister (lenB, resp. lenA). This results in a further reduction of the
number of items while having only little effect on the parsing results.

The SX-estimate for a sentence length n and for some predicate C with a span
− li) and r = n − rdim(C) is then the

(cid:5)(cid:5) where len = Σdim(C)

(cid:5), . . . , (cid:4)ldim(C), rdim(C)

(ri

(cid:1)ρ = (cid:4)(cid:4)l1, r1
maximal weight of [C, len, l1 + r, n − len − l1

i=1
− r].

POS tags:

0 : [A, 1]

A a POS tag

Unary:

in : [B, l]
in + log(p) : [A, l]

p : A((cid:1)α) → B((cid:1)α) ∈ P

Binary:

inB : [B, lB], inC : [C, lC]
inB + inC + log(p) : [A, lB + lC]

where either p : A( (cid:1)αA) → B( (cid:1)αB)C( (cid:1)αC) ∈ P or p : A( (cid:1)αA) → C( (cid:1)αC)B( (cid:1)αB) ∈ P.

Figure 18
Estimate of the inside score with total span length.

103

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 1

Axiom :

0 : [S, len, 0, 0]

1 ≤ len ≤ lenmax

Unary:

out : [X, len, lr, g]
out + log(p) : [A, len, lr, g]

p : X((cid:1)α) → A((cid:1)α) ∈ P

Binary-right:

out : [X, len, lr, g]
out + in(A, len − lenB) + log(p) : [B, lenB, lrB, gB]

p : X((cid:1)α) → A( (cid:1)αA )B( (cid:1)αB ) ∈ P

Binary-left:

out : [X, len, lr, g]
out + in(B, len − lenA) + log(p) : [A, lenA, lrA, gA]

p : X((cid:1)α) → A( (cid:1)αA )B( (cid:1)αB ) ∈ P

Further side conditions for Binary-right:
a) len + lr + g = lenB + lrB + gB b) lr < lrB Further side conditions for Binary-left: a) len + lr + g = lenA + lrA + gA b) if rightB = 0 then lr = lrA, else lr < lrA ≥ gapsA c) gB c) gA ≥ gapsB Figure 19 SX estimate depending on length, LR, gaps. 4.4 SX with Span and Sentence Length We will now present a further simplification of the last estimate that records only the span length and the length of the entire sentence. The items have the form [X, len, slen] with X ∈ N, dim(X) ≤ len ≤ slen. The computation is given in Figure 20. This last esti- ∗ mate is actually monotonic and allows for true A parsing. The SX-estimate for a sentence length n and for some predicate C with a span − li) is then the maximal weight (cid:5)(cid:5) where len = Σdim(C) (cid:5), . . . , (cid:4)ldim(C), rdim(C) (ri i=1 (cid:1)ρ = (cid:4)(cid:4)l1, r1 of [C, len, n]. In order to prove that this estimate allows for monotonic weighted deductive pars- ing and therefore guarantees that the best parse will be found, let us have a look at the CYK deduction rules when being augmented with the estimate. Only Unary and Binary are relevant because Scan does not have antecedent items. The two rules, augmented with the outside estimate, are shown in Figure 21. We have to show that for every rule, if this rule has an antecedent item with weight (cid:4) , then w ≥ w (cid:4) w and a consequent item with weight w Let us start with Unary. To show: inB + outB ≥ inB + log(p) + outA. Because of the Unary rule for computing the outside estimate and because of the unary production, . Axiom : 0 : [S, len, len] 1 ≤ len ≤ lenmax Unary: out : [X, lX, slen] out + log(p) : [A, lX, slen] p : X((cid:1)α) → A((cid:1)α) ∈ P Binary-right: out : [X, lX, slen] out + in(A, lX − lB) + log(p) : [B, lB, slen] p : X((cid:1)α) → A( (cid:1)αA )B( (cid:1)αB ) ∈ P Binary-left: out + in(B, lX out : [X, lX, slen] − lA) + log(p) : [A, lA, slen] p : X((cid:1)α) → A( (cid:1)αA )B( (cid:1)αB ) ∈ P Figure 20 SX estimate depending on span and sentence length. 104 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kallmeyer and Maier PLCFRS Parsing Unary: inB + outB : [B, (cid:1)ρ] inB + log(p) + outA : [A, (cid:1)ρ] p : A((cid:1)α) → B((cid:1)α) ∈ P Binary: inB + outB : [B, (cid:1)ρB], inC + outC : [C, (cid:1)ρC] inB + inC + log(p) + outA : [A, (cid:1)ρA] p : A( (cid:1)ρA ) → B( (cid:1)ρB )C( (cid:1)ρC ) is an instantiated rule (Here, outA, outB, and outC are the respective outside estimates of [A, (cid:1)ρA], [B, (cid:1)ρB] and [C, (cid:1)ρC].) Figure 21 Parsing rules including outside estimate. we obtain that, given the outside estimate outA of [A, (cid:1)ρ] the outside estimate outB of the item [B, (cid:1)ρ] is at least outA + log(p), namely, outB ≥ log(p) + outA. Now let us consider the rule Binary. We treat only the relation between the weight of the C antecedent item and the consequent. The treatment of the antecedent B is ≥ inB + inC + log(p) + outA. Assume that lB is the length symmetric. To show: inC + outC of the components of the B item and n is the sentence length. Then, because of the Binary-right rule in the computation of the outside estimate and because of our in- stantiated rule p : A( (cid:1)ρA) → B( (cid:1)ρB)C( (cid:1)ρC), we have that the outside estimate outC of the C-item is at least outA + in(B, lB) + log(p). Furthermore, in(B, lB) ≥ inB. Consequently outC ≥ inB + log(p) + outA. 4.5 Integration into the Parser Before parsing, the outside estimates of all items up to a certain maximal sentence length lenmax are precomputed. Then, when performing the weighted deductive parsing as explained in Section 3.2, whenever a new item is stored in the agenda, we add its outside estimate to its weight. Because the outside estimate is always greater than or equal to the actual outside score, given the input, the weight of an item in the agenda is always greater than or equal to the log of the actual product of the inside and outside score of the item. In this sense, the outside estimates given earlier are admissible. Additionally, as already mentioned, note that the full SX estimate and the SX esti- ∗ parsing. The other mate with span and sentence length are monotonic and allow for A two estimates, which are both not monotonic, act as FOMs in a best-first parsing context. Consequently, they contribute to speeding up parsing but they decrease the quality of the parsing output. For further evaluation details see Section 6. 5. Grammars for Discontinuous Constituents 5.1 Grammar Extraction The algorithm we use for extracting an LCFRS from a constituency treebank with cross- ing branches has originally been presented in Maier and Søgaard (2008). It interprets the treebank trees as LCFRS derivation trees. Consider for instance the tree in Figure 22. The S node has two daughters, a VMFIN node and a VP node. This yields a rule S → VP VMFIN. The VP is discontinuous with two components that wrap around the yield of the VMFIN. Consequently, the LCFRS rule is S(XYZ) → VP(X, Z) VMFIN(Y). The extraction of an LCFRS from treebanks with crossing branches is almost im- mediate, except for the fan-out of the non-terminal categories: In the treebank, we can have the same non-terminal with different fan-outs, for instance a VP without a gap (fan-out 1), a VP with a single gap (fan-out 2), and so on. In the corresponding LCFRS, 105 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 1 S VP VP PROAV VMFIN dar ¨uber about it muß must “It must be thought about it” thought VVPP VAINF nachgedacht werden be Figure 22 A sample tree from NeGra. we have to distinguish these different non-terminals by mapping them to different predicates. · · · Am, a clause A0 The algorithm first creates a so-called lexical clause P(a) → ε for each pre-terminal P dominating some terminal a. Then for all other non-terminals A0 with the children · · · Am A1 is the number of discontinuous parts in their yields. The components of A0 are concate- nations of variables that describe how the discontinuous parts of the yield of A0 are obtained from the yields of its daughters. · · · Am is created. The number of components of the A1 → A1 More precisely, the non-terminals in our LCFRS are all Ak where A is a non-terminal label in the treebank and k is a possible fan-out for A. For a given treebank tree (cid:4)V, E, r, l(cid:5) where V is the set of nodes, E ⊂ V × V the set of immediate dominance edges, r ∈ V the root node, and l : V → N ∪ T the labeling function, the algorithm constructs the following rules. Let us assume that w1, . . . , wn are the terminal labels of the leaves in (cid:4)V, E, r(cid:5) with a linear precedence relation wi ≺ wj for 1 ≤ i < j ≤ n. We introduce a variable Xi for every wi, 1 ≤ i ≤ n. ∈ V with (cid:4)v2, v2 For every pair of nodes v1, v2 l(v1)(l(v2)) → ε to the rules of the grammar. (We omit the fan-out subscript here because pre-terminals are always of fan-out 1.) For every node v ∈ V with l(v) = A0 /∈ T such that there are exactly m (cid:5) ∈ E and l(vi) = Ai /∈ T for all nodes v1, . . . , vm 1 ≤ i ≤ m, we now create a rule (cid:5) ∈ E, l(v2) ∈ T, we add ∈ V (m ≥ 1) with (cid:4)v, vi A0(x(0) 1 , . . . , x(0) → A1(x(1) dim(A0 )) 1 , . . . , x(1) dim(A1 )) . . . Am(x(m) 1 , . . . , x(m) dim(Am )) where for the predicate Ai, 0 ≤ i ≤ m, the following must hold: The concatenation of all arguments of Ai, x(i) concatenation of all X ∈ {Xi Xi precedes Xj if i < j, and a variable Xj with 1 ≤ j < n is the right boundary of an argument of (cid:4) (cid:4) Ai if and only if Xj+1 /∈ {Xi }, that is, an | (cid:4)vi, v i ) = wi with l(v i argument boundary is introduced at each discontinuity. 1 . . . x(i) dim(Ai ) is the (cid:4) i ) = wi with l(v } such that (cid:4) | (cid:4)vi, v i ∗ (cid:5) ∈ E ∗ (cid:5) ∈ E As a further step, in this new rule, all right-hand side arguments of length > 1 are replaced in both sides of the rule with a single new variable.
Finally, all non-terminals A in the rule are equipped with an additional
subscript dim(A), which gives us the final non-terminal in our LCFRS.

(cid:1)

(cid:1)

1.

2.

106

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

PROAV(Dar ¨uber) → ε
VMFIN(muß) → ε
VVPP(nachgedacht) → ε
VAINF(werden) → ε

S1(X1X2X3) → VP2(X1, X3)VMFIN(X2)
VP2(X1, X2X3) → VP2(X1, X2)VAINF(X3)
VP2(X1, X2) → PROAV(X1)VVPP(X2)

Figure 23
LCFRS rules extracted from the tree in Figure 22.

For the tree in Figure 22, the algorithm produces for instance the rules in Figure 23.
As standard for PCFG, the probabilities are computed using Maximum Likelihood

Estimation.

5.2 Head-Outward Binarization

As previously mentioned, in contrast to CFG the order of the right-hand side elements
of a rule does not matter for the result of an LCFRS derivation. Therefore, we can reorder
the right-hand side of a rule before binarizing it.

The following, treebank-specific reordering results in a head-outward binarization
where the head is the lowest subtree and it is extended by adding first all sisters to its
left and then all sisters to its right. It consists of reordering the right-hand side of the
rules extracted from the treebank such that first, all elements to the right of the head are
listed in reverse order, then all elements to the left of the head in their original order, and
then the head itself. Figure 24 shows the effect this reordering and binarization has on
the form of the syntactic trees. In addition to this, we also use a variant of this reordering

S

VP

Tree in NeGra format:

NN
das
that

NN
man
one

AV
VAINF
jetzt machen
now

VMFIN
muß
must
“One has to do that now”
Rule extracted for the S node: S(XYZU) → VP(X, U)VMFIN(Y)NN(Z)
Reordering for head-outward binarization: S(XYZU) → NN(Z)VP(X, U)VMFIN(Y)
New rules resulting from binarizing this rule:
S(XYZ) → Sbin1(X, Z)NN(Y)
Rule extracted for the VP node: VP(X, YZ) → NN(X)AV(Y)VAINF(Z)
New rules resulting from binarizing this rule:
VP(X, Y) → NN(X)VPbin1(Y) VPbin1(XY) → AV(X)VAINF(Y)

Sbin1(XY, Z) → VP(X, Z)VMFIN(Y)

do

Tree after binarization:

S

Sbin1

VP

VPbin1

NN

VMFIN NN AV VAINF

Figure 24
Sample head-outward binarization.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

107

Computational Linguistics

Volume 39, Number 1

where we add first the sisters to the right and then the ones to the left. This is what Klein
and Manning (2003b) do. To mark the heads of phrases, we use the head rules that the
Stanford parser (Klein and Manning 2003c) uses for NeGra.

In all binarizations, there exists the possibility of adding additional unary rules
when deriving the head. This allows for a further factorization. In the experiments,
however, we do not insert unary rules, neither at the highest nor at the lowest new
binarization non-terminal, because this was neither beneficial for parsing times nor for
the parsing results.

5.3 Incorporating Additional Context

5.3.1 Markovization. As already mentioned in Section 3.1, a binarization that introduces
unique new non-terminals for every single rule that needs to be binarized produces
a large amount of non-terminals and fails to capture certain generalizations. For this
reason, we introduce markovization (Collins 1999; Klein and Manning 2003b).

Markovization is achieved by introducing only a single new non-terminal for the
new rules introduced during binarization and adding vertical and horizontal context
from the original trees to each occurrence of this new non-terminal. As vertical context,
we add the first v labels on the path from the root node of the tree that we want to
binarize to the root of the entire treebank tree. The vertical context is collected during
grammar extraction and then taken into account during binarization of the rules. As
horizontal context, during binarization of a rule A((cid:1)α) → A0( (cid:1)α0) . . . Am( (cid:1)αm), for the new
non-terminal that comprises the right-hand side elements Ai . . . Am (for some 1 ≤ i ≤
m), we add the first h elements of Ai, Ai−1, . . . , A0.

Figure 25 shows an example of a markovization of the tree from Figure 24 with v = 1
and h = 2. Here, the superscript is the vertical context and the subscript the horizontal
context of the new non-terminal X. Note that in this example we have disregarded the
fan-out of the context categories. The VP, for instance, is actually a VP2 because it has
fan-out 2. For the context symbols, one can either use the categories from the original
treebank (without fan-out) or the ones from the LCFRS rules (with fan-out). We chose
the latter approach because it delivered better parsing results.

5.3.2 Further Category Splitting. Grammar annotation (i.e., manual enhancement of an-
notation information through category splitting) has previously been successfully used
in parsing German (Versley 2005). In order to see if such modifications can have a
beneficial effect in PLCFRS parsing as well, we perform different category splits on the
(unbinarized) NeGra constituency data.

We split the category S (“sentence”) into SRC (“relative clause”) and S (all other
categories S). Relative clauses mostly occur in a very specific context, namely, as the

S

XS

VP,NN

VP

XVP

ADV,NN

NN VMFIN NN ADV

VAINF

Figure 25
Sample markovization with v = 1, h = 2.

108

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

Table 1
NeGra: Properties of the data with crossing branches.

training

test

number of sentences
average sentence length
average tree height
average children per node
sentences without gaps
sentences with one gap
sentences with ≥ 2 gaps
maximum gap degree

16,502

14.56
4.62
2.96
12,481 (75.63%)
3,320 (20.12%)
701 (4.25%)

6

1,833

14.62
4.72
2.94

1,361 (74.25%)
387 (21.11%)
85 (4.64%)
5

right part of an NP or a PP. This splitting should therefore speed up parsing and increase
precision. Furthermore, we distinguish NPs by their case. More precisely, to all nodes
with categories N, we append the grammatical function label to the category label. We
finally experiment with the combination of both splits.

6. Experiments

6.1 Data

Our data source is the NeGra treebank (Skut et al. 1997). We create two different data
sets for constituency parsing. For the first one, we start out with the unmodified NeGra
treebank and remove all sentences with a length of more than 30 words. We pre-process
the treebank following common practice (K ¨ubler and Penn 2008), attaching all nodes
which are attached to the virtual root node to nodes within the tree such that, ideally,
no new crossing edges are created. In a second pass, we attach punctuation which
comes in pairs (parentheses, quotation marks) to the same nodes. For the second data
set we create a copy of the pre-processed first data set, in which we apply the usual
tree transformations for NeGra PCFG parsing (i.e., moving nodes to higher positions
until all crossing branches are resolved). The first 90% of both data sets are used as the
training set and the remaining 10% as test set. The first data set is called NeGraLCFRS
and the second is called NeGraCFG.

Table 1 lists some properties of the training and test (respectively, gold) parts of
NeGraLCFRS, namely, the total number of sentences, the average sentence length, the
average tree height (the height of a tree being the length of the longest of all paths
from the terminals to the root node), and the average number of children per node
(excluding terminals). Furthermore, gap degrees (i.e., the number of gaps in the spans
of non-terminal nodes) are listed (Maier and Lichte 2011).

Our findings correspond to those of Maier and Lichte except for small differences

due to the fact that, unlike us, they removed the punctuation from the trees.

6.2 Parser Implementation

We have implemented the CYK parser described in the previous section in a system
called rparse. The implementation is realized in Java.3

3 rparse is available under the GNU General Public License 2.0 at http://www.phil.hhu.de/rparse.

109

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 1

Table 2
NeGraLCFRS: PLCFRS parsing results for different binarizations.

Head-driven

KM L-to-R Optimal Deterministic

LP
LR
LF1
UP
UR
UF1

74.00
74.24
74.12
77.09
77.34
77.22

74.00
74.13
74.07
77.20
77.33
77.26

75.08
74.69
74.88
77.95
77.54
77.75

74.92
74.88
74.90
77.77
77.73
77.75

72.40
71.80
72.10
75.67
75.04
75.35

6.3 Evaluation

For the evaluation of the constituency parses, we use an EVALB-style metric. For a tree
over a string w, a single constituency is represented by a tuple (cid:4)A, (cid:1)ρ(cid:5) with A being a
node label and (cid:1)ρ ∈ (Pos(w) × Pos(w))dim(A). We compute precision, recall, and F1 based
on these tuples from gold and de-binarized parsed test data from which all category
splits have been removed. This metric is equivalent to the corresponding PCFG metric
for dim(A) = 1. Despite the shortcomings of such a measure (Rehbein and van Genabith
2007), it still allows to some extent a comparison to previous work in PCFG parsing (see
also Section 7). Note that we provide the parser with gold POS tags in all experiments.

6.4 Markovization and Binarization

We use the markovization settings v = 1 and h = 2 for all further experiments. The
setting which has been reported to yield the best results for PCFG parsing of NeGra,
v = 2 and h = 1 (Rafferty and Manning 2008), required a parsing time which was too
high.4

Table 2 contains the parsing results for NeGraLCFRS using five different binariza-
tions: Head-driven and KM are the two head-outward binarizations that use a head
chosen on linguistic grounds (described in Section 5.2); L-to-R is another variant in
which we always choose the rightmost daughter of a node as its head.5 Optimal reorders
the left-hand side such that the fan-out of the binarized rules is optimized (described in
Section 3.1.2). Finally, we also try a deterministic binarization (Deterministic) in which
we binarize strictly from left to right (i.e., we do not reorder the right-hand sides of
productions, and choose unique binarization labels).

The results of the head-driven binarizations and the optimal binarization lie close
together; the results for the deterministic binarization are worse. This indicates that the
presence or absence of markovization has more impact on parsing results than the actual
binarization order. Furthermore, the non-optimal binarizations did not yield a binarized
grammar of a higher fan-out than the optimal binarization: For all five binarizations,
the fan-out was 7 (caused by a VP interrupted by punctuation).

4 Older versions of rparse contained a bug that kept the priority queue from being updated correctly
(i.e., during an update, the corresponding node in the priority queue was not moved to its top, and
therefore the best parse was not guaranteed to be found); however, higher parsing speeds were achieved.
The current version of rparse implements the update operation correctly, using a Fibonacci queue to
ensure efficiency (Cormen et al. 2003). Thanks to Andreas van Cranenburgh for pointing this out.

5 The term head is not used in its proper linguistic sense here.

110

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

700

600

Head-driven

KM

L-to-R

Optimal

500

Deterministic

)
0
0
0
1

n
i
(

s
m
e
t
i

400

300

200

100

0

15

17

19

23

21
sentence length

25

Baseline

NP

S

NP+S

400

350

300

250

200

150

100

50

)
0
0
0
1

n
i
(

s
m
e
t
i

27

29

15

17

19

23

21
sentence length

25

OFF

LR

LN

400

350

300

250

200

150

100

50

)
0
0
0
1

n
i
(

s
m
e
t
i

27

29

15

17

19

23

21
sentence length

25

27

29

Figure 26
NeGraLCFRS: Items for PLCFRS parsing (left-to-right): binarizations, baseline and category splits,
and estimates.

The different binarizations result in different numbers of items, and therefore allow
for different parsing speeds. The respective leftmost graph in Figures 26 and 27 show
a visual representation of the number of items produced by all binarizations, and the
corresponding parsing times. Note that when choosing the head with head rules the
number of items is almost not affected by the choice of adding first the children to
the left of the head and then to the right of the head or vice versa. The optimal bina-
rization produces the best results. Therefore we will use it in all further experiments,
in spite of its higher parsing time.

6.5 Baseline Evaluation and Category Splits

Table 3 presents the constituency parsing results for NeGraLCFRS and NeGraCFG, both
with and without the different category splits. Recall that NeGraLCFRS has crossing
branches and consequently leads to a PLCFRS of fan-out > 1 whereas NeGraCFG does
not contain crossing branches and consequently leads to a 1-PLCFRS—in other words,

Head-driven

KM

L-to-R

Optimal

Deterministic

)
.
c
e
s
(

e
m

i
t

100

10

1

0.1

Baseline

NP

S

NP+S

)
.
c
e
s
(

e
m

i
t

100

10

1

0.1

OFF

LR

LN

100

)
.
c
e
s
(

e
m

i
t

10

1

0.1

15

17

19

21

23

25

27

29

15

17

19

21

23

25

27

29

15

17

19

21

23

25

27

29

sentence length

sentence length

sentence length

Figure 27
NeGraLCFRS: Parsing times for PLCFRS parsing (left-to-right): binarizations, baseline and
category splits, and estimates (log scale).

111

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 1

Table 3
NeGraLCFRS and NeGraCFG: baseline and category splits.

w/ category splits

w/ category splits

NeGraLCFRS

NP

S

NP ◦ S NeGraCFG

NP

S

NP ◦ S

LP
LR
LF1
UP
UR
UF1

74.92
74.88
74.90
77.77
78.73
77.75

75.21
74.95
75.08
78.16
77.88
78.02

75.81
75.65
75.73
78.31
78.15
78.23

75.93
75.57
75.75
78.60
78.22
78.41

76.32
76.36
76.34
79.12
79.17
79.14

76.79
77.23
77.01
79.62
80.08
79.85

77.39
77.35
77.37
79.84
79.80
79.82

77.58
77.99
77.79
80.09
80.52
80.30

a PCFG. We evaluate the parser output against the unmodified gold data; that is,
before we evaluate the experiments with category splits, we replace all split labels in
the parser output with the corresponding original labels.

We take a closer look at the properties of the trees in the parser output for
NeGraLCFRS. Twenty-nine sentences had no parse, therefore, the parser output has 1,804
sentences. The average tree height is 4.72, and the average number of children per node
(excluding terminals) is 2.91. These values are almost identical to the values for the gold
data. As for the gap degree, we get 1,401 sentences with no gaps (1,361 in the gold set),
334 with gap degree 1 (387 in the gold set), and 69 with 2 or 3 gaps (85 in the gold set).
Even though the difference is only small, one can see that fewer gaps are preferred. This
is not surprising, since constituents with many gaps are rare events and therefore end
up with a probability which is too low.

We see that the quality of the PLCFRS parser output on NeGraLCFRS (which contains
more information than the output of a PCFG parser) does not lag far behind the quality
of the PCFG parsing results on NeGraCFG. With respect to the category splits, the results
show furthermore that category splitting is indeed beneficial for the quality of the
PLCFRS parser output. The gains in speed are particularly visible for sentences with
a length greater than 20 words (cf. the number of produced items and parsing times in
Figures 26 and 27 [middle]).

6.6 Evaluating Outside Estimates

We compare the parser performance without estimates (OFF) with its performance
with the estimates described in Sections 4.3 (LR) and 4.4 (LN).

Unfortunately, the full estimates seem to be only of theoretical interest because they
were too expensive to compute both in terms of time and space, given the restrictions
imposed by our hardware. We could, however, compute the LN and the LR estimate.

parsing, the LR estimate lets the
Unlike the LN estimate, which allows for true A
quality of the parsing results deteriorate: Compared with the baseline, labeled F1 drops
from 74.90 to 73.76 and unlabeled F1 drops from 77.91 to 76.89. The respective rightmost
graphs in Figures 26 and 27 show the average number of items produced by the
parser and the parsing times for different sentence lengths. The results indicate that the
estimates have the desired effect of preventing unnecessary items from being produced.
This is reflected in a significantly lower parsing time.

The different behavior of the LR and the LN estimate raises the question of the
trade-off between maintaining optimality and obtaining a higher parsing speed. In

112

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

other words, it raises the question of whether techniques such as pruning or coarse-

parsing. A first
to-fine parsing (Charniak et al. 2006) would probably be superior to A
implementation of a coarse-to-fine approach has been presented by van Cranenburgh
(2012). He generates a CFG from the treebank PLCFRS, based on the idea of Barth´elemy
et al. (2001). This grammar, which can be seen as a coarser version of the actual PLCFRS,
is then used for pruning of the search space. The problem that van Cranenburgh tackles
is specific to PLCFRS: His PCFG stage generalizes over the distinction of labels by their
fan-out. The merit of his work is an enormous increase in efficiency: Sentences with a
length of up to 40 words can now be parsed in a reasonable time. For a comparison of
the results of van Cranenburgh (2012) with our work, the same version of evaluation
parameters would have to be used. The applicability and effectiveness of other coarse-
to-fine approaches (Charniak et al. 2006; Petrov and Klein 2007) on PLCFRS remain to
be seen.

7. Comparison to Other Approaches

Comparing our results with results from the literature is a difficult endeavor, because
PLCFRS parsing of NeGra is an entirely new task that has no direct equivalent in
previous work. In particular, it is a harder task than PCFG parsing. What we can
provide in this section is a comparison of the performance of our parser on NeGraCFG
to the performance of previously presented PCFG parsers on the same data set and
an overview on previous work on parsing which aims at reconstructing crossing
branches.

For the comparison of the performance of our parser on NeGraCFG, we have per-
formed experiments with Helmut Schmid’s LoPar (Schmid 2000) and with the Stanford
Parser (Klein and Manning 2003c) on NeGraCFG.6 For the experiments both parsers
were provided with gold POS tags. Recall that our parser produced labeled precision,
recall, and F1 of 76.32, 76.46, and 76.34, respectively. The plain PCFG provided by LoPar
delivers lower results (LP 72.86, LR 74.43, and LF1 73.63). The Stanford Parser results
(markovization setting v = 2, h = 1 [Rafferty and Manning 2008], otherwise default
parameters) lie in the vicinity of the results of our parser (LP 74.27, LR 76.19, LF1 75.45).
Although the results for LoPar are no surprise, given the similarity of the models
implemented by our parser and the Stanford parser, it remains to be investigated why
the lexicalization component of the Stanford parser does not lead to better results. In
any case the comparison shows that on a data set without crossing branches, our parser
obtains the results one would expect. A further data set to which we can provide a
comparison is the PaGe workshop experimental data (K ¨ubler and Penn 2008).7 Table 4
lists the results of some of the papers in K ¨ubler and Penn (2008) on TIGER, namely,
for Petrov and Klein (2008) (P&K), who use the Berkeley Parser (Petrov and Klein
2007); Rafferty and Manning (2008) (R&M), who use the Stanford parser (see above);
and Hall and Nivre (2008) (H&N), who use a dependency-based approach (see next
paragraph). The comparison again shows that our system produces good results. Again
the performance gap between the Stanford parser and our parser warrants further
investigation.

6 We have obtained the former parser from http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/

LoPar.html and the latter (Version 2.0.1) from http://nlp.stanford.edu/software/lex-parser.shtml.

7 Thanks to Sandra K ¨ubler for providing us with the experimental data.

113

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 1

Table 4
PaGe workshop data.

here

P&K R&M H&N

LP
LR
LF1

66.93
60.79
63.71

69.23
70.41
69.81

58.52
57.63
58.07

67.06
58.07
65.18

As for the work that aims to create crossing branches, Plaehn (2004) obtains 73.16
Labeled F1 using Probabilistic Discontinuous Phrase Structure Grammar (DPSG), albeit
only on sentences with a length of up to 15 words. On those sentences, we obtain 83.97.
The crucial difference between DPSG rules and LCFRS rules is that the former explicitly
specify the material that can occur in gaps whereas LCFRS does not. Levy (2005), like us,
proposes to use LCFRS but does not provide any evaluation results of his work. Very
recently, Evang and Kallmeyer (2011) followed up on our work. They transform the
Penn Treebank such that the trace nodes and co-indexations are converted into crossing
branches and parse them with the parser presented in this article, obtaining promising
results. Furthermore, van Cranenburgh, Scha, and Sangati (2011) and van Cranenburgh
(2012) have also followed up on our work, introducing an integration of our approach
with Data-Oriented Parsing (DOP). The former article introduces an LCFRS adaption
of Goodman’s PCFG-DOP (Goodman 2003). For their evaluation, the authors use the
same data as we do in Maier (2010), and obtain an improvement of roughly 1.5 points
F-measure. They are also confronted with the same efficiency issues, however, and
encounter a bottleneck in terms of parsing time. In van Cranenburgh (2012), a coarse-
to-fine approach is presented (see Section 6.6). With this approach much faster parsing
is made possible and sentences with a length of up to 40 words can be parsed. The cost
of the speed, however, is that the results lie well below the baseline results for standard
PLCFRS parsing.

A comparison with non-projective dependency parsers (McDonald et al. 2005;
Nivre et al. 2007) might be interesting as well, given that non-projectivity is the
dependency-counterpart to discontinuity in constituency parsing. A meaningful com-
parison is difficult to do for the following reasons, however. Firstly, dependency parsing
deals with relations between words, whereas in our case words are not considered in
the parsing task. Our grammars take POS tags for a given and construct syntactic trees.
Also, dependency conversion algorithms generally depend on the correct identification
of linguistic head words (Lin 1995). We cannot rely on grammatical function labels, such
as, for example, Boyd and Meurers (2008). Therefore we would have to use heuristics for
the dependency conversion of the parser output. This would introduce additional noise.
Secondly, the resources one obtains from our PLCFRS parser and from dependency
parsers (the probabilistic LCFRS and the trained dependency parser) are quite different
because the former contains non-lexicalized internal phrase structure identifying mean-
ingful syntactic categories such as VP or NP while the latter is only concerned with rela-
tions between lexical items. A comparison would concentrate only on relations between
lexical items and the rich phrase structure provided by a constituency parser would
not be taken into account. To achieve some comparison, one could of course transform
the discontinuous constituency trees into dependency trees with dependencies between
heads and with edge labels that encode enough of the syntactic structure to retrieve
the original constituency tree (Hall and Nivre 2008). The result could then be used for

114

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

a dependency evaluation. It is not clear what is to gain by this evaluation because
the head-to-head dependencies one would obtain are not necessarily the predicate-
argument dependencies one would aim at when doing direct dependency parsing
(Rambow 2010).8

8. Conclusion

We have presented the first efficient implementation of a weighted deductive CYK
parser for Probabilistic Linear Context-Free Rewriting Systems (PLCFRS), showing
that LCFRS indeed allows for data-driven parsing while modeling discontinuities in
a straightforward way. To speed up parsing, we have introduced different context-
summary estimates of parse items, some acting as figures-of-merit, others allowing for

parsing. We have implemented the parser and we have evaluated it with grammars
A
extracted from the German NeGra treebank. Our experiments show that data-driven
LCFRS parsing is feasible and yields output of competitive quality.
There are three main directions for future work on this subject.

(cid:1)

(cid:1)

On the symbolic side, LCFRS seems to offer more power than necessary.
By removing symbolic expressivity, a lower parsing complexity can be
achieved. One possibility is to disallow the use of so-called ill-nested
LCFRS rules. These are rules where, roughly, the spans of two right-hand
side non-terminals interleave in a cross-serial way. See the parsing
algorithm in G ´omez-Rodr´ıguez, Kuhlmann, and Satta (2010).
Nevertheless, this seems to be too restrictive for linguistic modeling
(Chen-Main and Joshi 2010; Maier and Lichte 2011). Our goal for future
work is therefore to define reduced forms of ill-nested rules with which we
get a lower parsing complexity.
Another possibility is to reduce the fan-out of the extracted grammar. We
have pursued the question whether the fan-out of the trees in the treebank
can be reduced in a linguistically meaningful way in Maier, Kaeshammer,
and Kallmeyer (2012).

On the side of the probabilistic model, there are certain independence
assumptions made in our model that are too strong. The main problem in
respect is that, due to the definition of LCFRS, we have to distinguish
between occurrences of the same category with different fan-outs. For
instance, VP1 (no gaps), VP2 (one gap), and so on, are different
non-terminals. Consequently, the way they expand are considered
independent from each other. This is of course not true, however.
Furthermore, some of these non-terminals are rather rare; we therefore
have a sparse data problem here. This leads to the idea to separate the
development of a category (independent from its fan-out) and the fan-out
and position of gaps. We plan to integrate this into our probabilistic model
in future work.

8 A way to overcome this difference in the content of the dependency annotation would be to use

an evaluation along the lines of Tsarfaty, Nivre, and Andersson (2011); this is not available yet for
annotations with crossing branches, however.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

115

Computational Linguistics

Volume 39, Number 1

(cid:1)

Last, it is clear that a more informative evaluation of the parser output is
still necessary, particularly with respect to its performance at the task of
finding long distance dependencies and with respect to its behavior when
not provided with gold POS tags.

Acknowledgments
We are particularly grateful to Giorgio Satta
for extensive discussions of the details of the
probabilistic treebank model presented in
this paper. Furthermore, we owe a debt to
Kilian Evang who participated in the
implementation of the parser. Thanks to
Andreas van Cranenburgh for helpful
feedback on the parser implementation.
Finally, we are grateful to our three
anonymous reviewers for many valuable
and helpful comments and suggestions.
A part of the work on this paper was funded
by the German Research Foundation DFG
(Deutsche Forschungsgemeinschaft) in the
form of an Emmy Noether Grant and a
subsequent DFG research project.

References
Barth´elemy, Franc¸ois, Pierre Boullier,

Philippe Deschamp, and ´Eric Villemonte
de la Clergerie. 2001. Guided parsing
of range concatenation languages.
In Proceedings of the 39th Annual Meeting
of the Association for Computational
Linguistics, pages 42–49, Toulouse.
Becker, Tilman, Aravind K. Joshi, and

Owen Rambow. 1991. Long-distance
scrambling and tree-adjoining grammars.
In Proceedings of the Fifth Conference of the
European Chapter of the Association for
Computational Linguistics, pages 21–26,
Berlin.

Boullier, Pierre. 1998a. A generalization of
mildly context-sensitive formalisms.
In Proceedings of the Fourth International
Workshop on Tree Adjoining Grammars and
Related Formalisms (TAG+4), pages 17–20,
Philadelphia, PA.

Boullier, Pierre. 1998b. A proposal for a
natural language processing syntactic
backbone. Technical Report 3342, INRIA,
Roquencourt.

Boullier, Pierre. 2000. Range concatenation
grammars. In Proceedings of the Sixth
International Workshop on Parsing
Technologies (IWPT2000), pages 53–64,
Trento.

Boyd, Adriane. 2007. Discontinuity revisited:
An improved conversion to context-free

116

representations. In the Linguistic
Annotation Workshop at ACL 2007,
pages 41–44, Prague.

Boyd, Adriane and Detmar Meurers.

2008. Revisiting the impact of different
annotation schemes on PCFG parsing:
A grammatical dependency evaluation.
In Proceedings of the Workshop on Parsing
German at ACL 2008, pages 24–32,
Columbus, OH.

Brants, Sabine, Stefanie Dipper, Silvia

Hansen, Wolfgang Lezius, and George
Smith. 2002. The TIGER Treebank.
In Proceedings of the 1st Workshop on
Treebanks and Linguistic Theories,
pages 24–42, Sozopol.

Bunt, Harry. 1996. Formal tools for

describing and processing discontinuous
constituency structure. In Harry Bunt and
Arthur van Horck, editors, Discontinuous
Constituency, volume 6 of Natural Language
Processing. Mouton de Gruyter, Berlin,
pages 63–83.

Cai, Shu, David Chiang, and Yoav Goldberg.

2011. Language-independent parsing
with empty elements. In Proceedings of the
49th Annual Meeting of the Association for
Computational Linguistics: Human Language
Technologies, pages 212–216, Portland, OR.

Candito, Marie and Djam´e Seddah. 2010.

Parsing word clusters. In Proceedings of the
First Workshop on Statistical Parsing of
Morphologically-Rich Languages at NAACL
HLT 2010, pages 76–84, Los Angeles, CA.
Caraballo, Sharon A. and Eugene Charniak.
1998. New figures of merit for best-first
probabilistic chart parsing. Computational
Linguistics, 24(2):275–298.

Charniak, Eugene, Mark Johnson, Micha
Elsner, Joseph Austerweil, David Ellis,
Isaac Haxton, Catherine Hill, R. Shrivaths,
Jeremy Moore, Michael Pozar, and Theresa
Vu. 2006. Multilevel coarse-to-fine PCFG
parsing. In Proceedings of the Human
Language Technology Conference of the
NAACL, Main Conference, pages 168–175,
New York, NY.

Chen-Main, Joan and Aravind Joshi. 2010.
Unavoidable ill-nestedness in natural
language and the adequacy of tree
local-MCTAG induced dependency
structures. In Proceedings of the Tenth

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

International Workshop on Tree Adjoining
Grammar and Related Formalisms (TAG+10),
pages 119–126, New Haven, CT.

Chiang, David. 2003. Statistical parsing with
an automatically extracted tree adjoining
grammar. In Rens Bod, Remko Scha, and
Khalil Sima’an, editors, Data-Oriented
Parsing, CSLI Studies in Computational
Linguistics. CSLI Publications, Stanford,
CA, pages 299–316.

Collins, Michael. 1999. Head-Driven Statistical
Models for Natural Language Parsing. Ph.D.
thesis, University of Pennsylvania.

Cormen, Thomas H., Charles E. Leiserson,

Ronald L. Rivest, and Clifford Stein. 2003.
Introduction to Algorithms. MIT Press,
Cambridge, 2nd edition.

Dienes, P´eter. 2003. Statistical Parsing with
Non-local Dependencies. Ph.D. thesis,
Saarland University.

Evang, Kilian and Laura Kallmeyer. 2011.

PLCFRS parsing of English discontinuous
constituents. In Proceedings of the 12th
International Conference on Parsing
Technologies, pages 104–116, Dublin.

G ´omez-Rodr´ıguez, Carlos, Marco

Kuhlmann, and Giorgio Satta. 2010.
Efficient parsing of well-nested linear
context-free rewriting systems. In Human
Language Technologies: The 2010 Annual
Conference of the North American Chapter of
the Association for Computational Linguistics,
pages 276–284, Los Angeles, CA.

G ´omez-Rodr´ıguez, Carlos, Marco

Kuhlmann, Giorgio Satta, and David Weir.
2009. Optimal reduction of rule length in
linear context-free rewriting systems. In
Proceedings of Human Language Technologies:
The 2009 Annual Conference of the North
American Chapter of the Association for
Computational Linguistics, pages 539–547,
Boulder, CO.

Goodman, Joshua. 2003. Efficient parsing of
DOP with PCFG-reductions. In Rens Bod,
Remko Scha, and Khalil Sima’an, editors,
Data-Oriented Parsing, CSLI Studies in
Computational Linguistics. CSLI
Publications, Stanford, CA, pages 125–146.

Hall, Johan and Joakim Nivre. 2008.
A dependency-driven parser for
German dependency and constituency
representations. In Proceedings of the
Workshop on Parsing German at ACL 2008,
pages 47–54, Columbus, OH.
Han, Chung-hye, Na-Rae Han, and

Eon-Suk Ko. 2001. Bracketing guidelines
for Penn Korean TreeBank. Technical
Report 01-10, IRCS, University of
Pennsylvania, Philadelphia, PA.

Hockenmaier, Julia. 2003. Data and models
for Statistical Parsing with Combinatory
Categorial Grammar. Ph.D. thesis,
University of Edinburgh.

Hoffman, Beryl. 1995. Integrating “free”
word order syntax and information
structure. In Seventh Conference of the
European Chapter of the Association for
Computational Linguistics, pages 245–251,
Dublin.

H ¨ohle, Tilman. 1986. Der Begriff

“Mittelfeld”—Anmerkungen ¨uber die
Theorie der topologischen Felder.
In Akten des Siebten Internationalen
Germanistenkongresses 1985, G ¨ottingen,
Germany.

Johnson, Mark. 2002. A simple

pattern-matching algorithm for recovering
empty nodes and their antecedents. In
Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics,
pages 136–143, Philadelphia, PA.
Kallmeyer, Laura. 2010. Parsing Beyond

Context-Free Grammars. Springer, Berlin.
Kallmeyer, Laura and Wolfgang Maier. 2010.
Data-driven parsing with probabilistic
linear context-free rewriting systems.
In Proceedings of the 23rd International
Conference on Computational Linguistics
(COLING 2010), pages 537–545, Beijing.

Kato, Yuki, Hiroyuki Seki, and Tadao
Kasami. 2006. Stochastic multiple
context-free grammar for RNA
pseudoknot modeling. In Proceedings
of the Eighth International Workshop
on Tree Adjoining Grammar and Related
Formalisms (TAG+8), pages 57–64,
Sydney.

Klein, Dan and Christopher D. Manning.

2003a. A* Parsing: Fast exact viterbi parse
selection. In Proceedings of the 2003 Human
Language Technology Conference of the North
American Chapter of the Association for
Computational Linguistics, pages 40–47,
Edmonton.

Klein, Dan and Christopher D. Manning.
2003b. Accurate unlexicalized parsing.
In Proceedings of the 41st Annual Meeting of
the Association for Computational Linguistics,
pages 423–430, Sapporo.

Klein, Dan and Christopher D. Manning.

2003c. Fast exact inference with a factored
model for natural language parsing. In
Advances in Neural Information Processing
Systems 15 (NIPS), pages 3–10, Vancouver.

Kracht, Marcus. 2003. The Mathematics
of Language. Number 63 in Studies in
Generative Grammar. Mouton de Gruyter,
Berlin.

117

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 1

K ¨ubler, Sandra. 2005. How do treebank
annotation schemes influence parsing
results? Or how not to compare apples
and oranges. In Recent Advances in Natural
Language Processing 2005 (RANLP 2005),
pages 293–300, Borovets.

K ¨ubler, Sandra and Gerald Penn, editors.
2008. Proceedings of the Workshop on
Parsing German at ACL 2008. Association
for Computational Linguistics,
Columbus, OH.

Kuhlmann, Marco and Giorgio Satta.

2009. Treebank grammar techniques for
non-projective dependency parsing.
In Proceedings of the 12th Conference
of the European Chapter of the Association
for Computational Linguistics,
pages 478–486, Athens.

Levy, Roger. 2005. Probabilistic Models of

Word Order and Syntactic Discontinuity.
Ph.D. thesis, Stanford University.

Levy, Roger and Christopher D. Manning.

2004. Deep dependencies from context-free
statistical parsers: Correcting the surface
dependency approximation. In Proceedings
of the 42nd Meeting of the Association for
Computational Linguistics (ACL’04), Main
Volume, pages 328–335, Barcelona.
Lin, Dekang. 1995. A dependency-based
method for evaluating broad-coverage
parsers. In Proceedings of the 14th
International Joint Conference on Artificial
Intelligence (IJCAI 95), pages 1420–1427,
Montreal.

Maier, Wolfgang. 2010. Direct parsing of
discontinuous constituents in German.
In Proceedings of the First Workshop on
Statistical Parsing of Morphologically-Rich
Languages at NAACL HLT 2010,
pages 58–66, Los Angeles, CA.

Maier, Wolfgang, Miriam Kaeshammer,

and Laura Kallmeyer. 2012. Data-driven
PLCFRS parsing revisited: Restricting
the fan-out to two. In Proceedings of the
Eleventh International Conference on Tree
Adjoining Grammars and Related Formalisms
(TAG+11), pages 126–134, Paris.

Maier, Wolfgang and Laura Kallmeyer. 2010.
Discontinuity and non-projectivity: Using
mildly context-sensitive formalisms for
data-driven parsing. In Proceedings of the
Tenth International Workshop on Tree
Adjoining Grammars and Related
Formalisms (TAG+10), New Haven, CT.
Maier, Wolfgang and Timm Lichte. 2011.

Characterizing discontinuity in constituent
treebanks. In Formal Grammar. 14th
International Conference, FG 2009. Bordeaux,
France, July 25-26, 2009. Revised Selected

118

Papers, volume 5591 of Lecture Notes in
Artificial Intelligence, pages 167–182,
Springer-Verlag, Berlin/Heidelberg/
New York.

Maier, Wolfgang and Anders Søgaard. 2008.
Treebanks and mild context-sensitivity.
In Proceedings of the 13th Conference on
Formal Grammar (FG-2008), pages 61–76,
Hamburg.

Marcus, Mitchell, Grace Kim, Mary Ann
Marcinkiewicz, Robert MacIntyre,
Ann Bies, Mark Ferguson, Karen Katz,
and Britta Schasberger. 1994. The Penn
Treebank: Annotating predicate argument
structure. In Proceedings of the Human
Language Technology Conference,
pages 114–119.

McDonald, Ryan, Fernando Pereira,
Kiril Ribarov, and Jan Hajiˇc. 2005.
Non-projective dependency parsing using
spanning tree algorithms. In Proceedings of
Human Language Technology Conference and
Conference on Empirical Methods in Natural
Language Processing (HLT/EMNLP),
pages 523–530, Vancouver.

Michaelis, Jens. 2001. On Formal Properties
of Minimalist Grammars. Ph.D. thesis,
Universit¨at Potsdam.

M ¨uller, Gereon. 2002. Free word order,

morphological case, and sympathy theory.
In Gisbert Fanselow and Caroline Fery,
editors, Resolving Conflicts in Grammars:
Optimality Theory in Syntax, Morphology,
and Phonology. Buske Verlag, Hamburg,
pages 265–397.

M ¨uller, Stefan. 2004. Continuous or

discontinuous constituents? Research on
Language & Computation, 2(2):209–257.

Nederhof, Mark-Jan. 2003. Weighted

deductive parsing and knuth’s algorithm.
Computational Linguistics, 29(1):135–143.

Nivre, Joakim, Johan Hall, Jens Nilsson,

Atanas Chanev, G ¨ulsen Eryigit,
Sandra K ¨ubler, Svetoslav Marinov,
and Erwin Marsi. 2007. MaltParser:
A language-independent system for
data-driven dependency parsing. Natural
Language Engineering, 13(2):95–135.
Osenova, Petya and Kiril Simov. 2004.
BTB-TR05: BulTreebank Stylebook.
Technical Report 05, BulTreeBank
Project, Sofia, Bulgaria.

Pereira, Fernando C. N. and David Warren.
1983. Parsing as deduction. In Proceedings
of the 21st Annual Meeting of the Association
for Computational Linguistics, pages 137–144,
Cambridge, MA.

Petrov, Slav and Dan Klein. 2007. Improved

inference for unlexicalized parsing.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

In Human Language Technologies 2007: The
Conference of the North American Chapter of
the Association for Computational Linguistics;
Proceedings of the Main Conference,
pages 404–411, Rochester, NY.

Petrov, Slav and Dan Klein. 2008. Parsing
German with latent variable grammars.
In Proceedings of the Workshop on Parsing
German at ACL 2008, pages 24–32,
Columbus, OH.

Plaehn, Oliver. 2004. Computing the most

probable parse for a discontinuous
phrase-structure grammar. In Harry Bunt,
John Carroll, and Giorgio Satta, editors,
New Developments in Parsing Technology,
volume 23 of Text, Speech And Language
Technology. Kluwer, Dordrecht,
pages 91–106.

Rafferty, Anna and Christopher D. Manning.
2008. Parsing three German treebanks:
Lexicalized and unlexicalized baselines.
In Proceedings of the Workshop on Parsing
German at ACL 2008, pages 40–46,
Columbus, OH.

Rambow, Owen. 2010. The simple

truth about dependency and phrase
structure representations: An opinion
piece. In Human Language Technologies:
The 2010 Annual Conference of the North
American Chapter of the Association for
Computational Linguistics, pages 337–340,
Los Angeles, CA.

Rehbein, Ines and Josef van Genabith.

2007. Evaluating evaluation measures.
In Proceedings of the 16th Nordic
Conference of Computational Linguistics,
pages 372–379, Tartu.

Schmid, Helmut. 2000. LoPar: Design and
implementation. Arbeitspapiere des
Sonderforschungsbereiches 340 149,
IMS, University of Stuttgart, Stuttgart,
Germany.

Seddah, Djame, Sandra K ¨ubler, and Reut
Tsarfaty, editors. 2010. Proceedings of the
First Workshop on Statistical Parsing of
Morphologically-Rich Languages at
NAACL HLT 2010. Association for
Computational Linguistics,
Los Angeles, CA.

Seki, Hiroyuki, Takahashi Matsumura,

Mamoru Fujii, and Tadao Kasami. 1991.
On multiple context-free grammars.
Theoretical Computer Science, 88(2):191–229.

Shieber, Stuart M., Yves Schabes, and

Fernando C. N. Pereira. 1995. Principles
and implementation of deductive parsing.
Journal of Logic Programming, 24(1–2):3–36.

Sikkel, Klaas. 1997. Parsing Schemata. Texts in
Theoretical Computer Science. Springer,
Berlin, Heidelberg, New York.

Skut, Wojciech, Brigitte Krenn, Thorten
Brants, and Hans Uszkoreit. 1997. An
annotation scheme for free word order
languages. In Proceedings of the Fifth
Conference on Applied Natural Language
Processing (ANLP), pages 88–95,
Washington, DC.

Telljohann, Heike, Erhard W. Hinrichs,

Sandra K ¨ubler, Heike Zinsmeister, and
Kathrin Beck. 2012. Stylebook for the
T ¨ubingen Treebank of Written German
(T ¨uBa-D/Z). Technical report, Seminar f ¨ur
Sprachwissenschaft, Universit¨at T ¨ubingen,
T ¨ubingen, Germany. http://www.sfs.
uni.tuebingen.de/resources/tuebadz-
stylebook-1201.pdf.

Tsarfaty, Reut, Joakim Nivre, and Evelina

Andersson. 2011. Evaluating dependency
parsing: Robust and heuristics-free
cross-annotation evaluation. In Proceedings
of the 2011 Conference on Empirical Methods
in Natural Language Processing,
pages 385–396, Edinburgh.

Uszkoreit, Hans. 1986. Linear precedence
in discontinuous constituents: Complex
fronting in German. CSLI report
CSLI-86-47, Center for the Study of
Language and Information, Stanford
University, Stanford, CA.

van Cranenburgh, Andreas. 2012. Efficient

parsing with linear context-free rewriting
systems. In Proceedings of the 13th
Conference of the European Chapter of the
Association for Computational Linguistics,
pages 460–470, Avignon.

van Cranenburgh, Andreas, Remko Scha,

and Federico Sangati. 2011. Discontinuous
data-oriented parsing: A mildly
context-sensitive all-fragments grammar.
In Proceedings of the Second Workshop on
Statistical Parsing of Morphologically Rich
Languages (SPMRL 2011), pages 34–44,
Dublin.

Versley, Yannick. 2005. Parser evaluation
across text types. In Proceedings of the
Fourth Workshop on Treebanks and Linguistic
Theories, pages 209–220, Barcelona, Spain.

Vijay-Shanker, K., David J. Weir, and

Aravind K. Joshi. 1987. Characterizing
structural descriptions produced by
various grammatical formalisms. In
Proceedings of the 25th Annual Meeting of the
Association for Computational Linguistics,
pages 104–111, Stanford, CA.

119

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

3
9
1
8
7
1
7
9
9
6
8
1
/
c
o

l
i

_
a
_
0
0
1
3
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image
Data-Driven Parsing using Probabilistic image

Download pdf