Data-Driven Parsing using Probabilistic - 麻省理工学院人工智能研究专业

Data-Driven Parsing using Probabilistic
Linear Context-Free Rewriting Systems

Laura Kallmeyer
Heinrich-Heine-Universit¨at D ¨usseldorf

∗

Wolfgang Maier
Heinrich-Heine-Universit¨at D ¨usseldorf

∗∗

This paper presents the ﬁrst efﬁcient implementation of a weighted deductive CYK parser for
Probabilistic Linear Context-Free Rewriting Systems (PLCFRSs). LCFRS, an extension of CFG,
can describe discontinuities in a straightforward way and is therefore a natural candidate to be
used for data-driven parsing. To speed up parsing, we use different context-summary estimates
∗
解析. We evaluate our parser with grammars
of parse items, some of them allowing for A
extracted from the German NeGra treebank. Our experiments show that data-driven LCFRS
parsing is feasible and yields output of competitive quality.

1. 介绍

最近, the challenges that a rich morphology poses for data-driven parsing have
received growing interest. A direct effect of morphological richness is, 例如, 数据
sparseness on a lexical level (Candito and Seddah 2010). A rather indirect effect is that
morphological richness often relaxes word order constraints. The principal intuition is
that a rich morphology encodes information that otherwise has to be conveyed by a
particular word order. 如果, 例如, the case of a nominal complement is not provided
by morphology, it has to be provided by the position of the complement relative to other
complements in the sentence. 例子 (1) provides an example of case marking and free
word order in German. 反过来, in free word order languages, word order can encode
information structure (Hoffman 1995).

(1)

这
这

kleine
little

Jungenom
boy

schickt
sends

seiner
他的

Schwesterdat
sister

这
这

Briefacc
letter

乙. Other possible word orders:

(我)

(二)

der kleine Jungenom schickt den Briefacc seiner Schwesterdat

seiner Schwesterdat schickt der kleine Jungenom den Briefacc

(三、) den Briefacc schickt der kleine Jungenom seiner Schwesterdat

∗ Institut f ¨ur Sprache und Information, Universit¨atsstr. 1, D-40225 D ¨usseldorf, 德国.

电子邮件: kallmeyer@phil.uni-duesseldorf.de.

∗∗ Institut f ¨ur Sprache und Information, Universit¨atsstr. 1, D-40225 D ¨usseldorf, 德国.

电子邮件: maierw@hhu.de.

提交材料已收到: 九月 29, 2011; revised submission received: 可能 20, 2012; 接受出版:
八月 3, 2012.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 39, 数字 1

It is assumed that this relation between a rich morphology and free word order does
not hold in both directions. Although it is generally the case that languages with a rich
morphology exhibit a high degree of freedom in word order, languages with a free word
order do not necessarily have a rich morphology. Two examples for languages with a
very free word order are Turkish and Bulgarian. The former has a very rich and the
latter a sparse morphology. See M ¨uller (2002) for a survey of the linguistics literature on
this discussion.

With a rather free word order, constituents and single parts of them can be displaced
freely within the sentence. 德语, 例如, has a rich inﬂectional system and
allows for a free word order, as we have already seen in Example (1): Arguments can
be scrambled, and topicalizations and extrapositions underlie few restrictions. 康塞-
经常地, discontinuous constituents occur frequently. This is challenging for syntactic
description in general (Uszkoreit 1986; Becker, Joshi, and Rambow 1991; Bunt 1996;
M ¨uller 2004), and for treebank annotation in particular (Skut et al. 1997).

在本文中, we address the problem of data-driven parsing of discontinuous constit-
uents on the basis of German. 在这个部分, we inspect the type of data we have to deal
和, and we describe the way such data are annotated in treebanks. We brieﬂy discuss
different parsing strategies for the data in question and motivate our own approach.

1.1 Discontinuous Constituents

Consider the sentences in Example (2) as examples for discontinuous constituents
(taken from the German NeGra [Skut et al. 1997] and TIGER [Brants et al. 2002] 树-
银行). 例子 (2A) shows several instances of discontinuous VPs and Example (2乙)
shows a discontinuous NP. The relevant constituent is printed in italics.

(2)

Fronting:

(我) Dar ¨uber
muss
Thereof
must
“One must think of that”

nachgedacht
想法

werden.
是

(NeGra)

(二) Ohne internationalen Schaden
Without international damage
distanzieren,
distance
“Bonn could not distance itself from the monument without international
damage.”

von dem Denkmal
from the monument

k ¨onne
可以

Bonn
Bonn

nicht
不是

… (TIGER)

sich
本身

durch die Regelung
through the regulation

(三、) Auch
w ¨urden
还
会
entstehen”. (TIGER)
emerge”
“Apart from that, the regulation would only constantly produce new old cases.”

“st¨andig
“constantly

Altf¨alle
old cases

neue
新的

nur
仅有的

乙. Extraposed relative clauses:

Gel¨ande
terrain

auf
在
k ¨onne,
可以,

deren
他们的
这 …
哪个 . . .

. . . ob
. . . 无论
werden
get
“. . . whether one could build on their premises the type of parking facility,
哪个 . . . ”

Typ von Abstellanlage
type of parking facility

这
这
(NeGra)

gebaut
建成

(我)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

Examples of other such languages are Bulgarian and Korean. Both show discontin-
uous constituents as well. 例子 (3A) is a Bulgarian example of a PP extracted out of
an NP, taken from the BulTreebank (Osenova and Simov 2004), and Example (3乙) is an
example of fronting in Korean, taken from the Penn Korean Treebank (Han, Han, 和
Ko 2001).

(3)

A. Na kyshtata

toi
他

popravi
repaired

pokriva.
roof.

Of house-DET
“It is the roof of the house he repairs.”

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

乙. Gwon.han- ˘ul

ka.ji.go
Authority-OBJ
有
“Who has no authority?”

nu.ga
WHO

iss.ji?
不是?

Discontinuous constituents are by no means limited to languages with freedom
in word order. They also occur in languages with a rather ﬁxed word order such
as English, resulting from, 例如, long-distance movements. Examples (4A) 和
(4乙) are examples from the Penn Treebank for long extractions resulting in discontin-
uous S categories and for discontinuous NPs arising from extraposed relative clauses,
分别 (马库斯等人。. 1994).

(4)

Long Extraction in English:

(我)

Those chains include Bloomingdale’s, which Campeau recently said it
will sell.

(二) What should I do.

乙. Extraposed nominal modiﬁers (relative clauses and PPs) in English:

(我)

They sow a row of male-fertile plants nearby, which then pollinate the male-
sterile plants.

(二) Prices fell marginally for fuel and electricity.

1.2 Treebank Annotation and Data-Driven Parsing

Most constituency treebanks rely on an annotation backbone based on Context-Free
语法 (CFG). Discontinuities cannot be modeled with CFG, because they require a
larger domain of locality than the one offered by CFG. 所以, the annotation back-
bone based on CFG is generally augmented with a separate mechanism that accounts
for the non-local dependencies. In the Penn Treebank (PTB), 例如, trace nodes
and co-indexation markers are used in order to establish additional implicit edges in the
tree beyond the overt phrase structure. In T ¨uBa-D/Z (Telljohann et al. 2012), a German
树库, non-local dependencies are expressed via an annotation of topological ﬁelds
(H ¨ohle 1986) and special edge labels. 相比之下, some other treebanks, 他们之中
NeGra and TIGER, give up the annotation backbone based on CFG and allow annota-
tion with crossing branches (Skut et al. 1997). In such an annotation, non-local depen-
dencies can be expressed directly by grouping all dependent elements under a single
node. Note that both crossing branches and traces annotate long-distance dependencies
in a linguistically meaningful way. A difference is, 然而, that crossing branches
are less theory-dependent because they do not make any assumptions about the base
positions of “moved” elements.

Examples for the different approaches of annotating discontinuities are given in
人物 1 和 2. 数字 1 shows the NeGra annotation of Example (2a-i) (左边), 和

计算语言学

体积 39, 数字 1

数字 1
A discontinuous constituent. Original NeGra annotation (左边) and a T ¨uBa-D/Z-style annotation
(正确的).

SBARQ

SBJ

副总裁

SBJ

WHNP

*T*

WHNP

什么
WP

应该
医学博士

我
PRP

做
VB

*T*
-NONE-

?
.

什么
WP

应该
医学博士

我
PRP

做
VB

?
.

数字 2
A discontinuous wh-movement. Original PTB annotation (左边) and NeGra-style annotation
(正确的).

annotation of the same sentence in the style of the T ¨uBa-D/Z treebank (正确的). 数字 2
shows the PTB annotation of Example (4a-ii) (on the left, note that the directed edge
from the trace to the WHNP element visualizes the co-indexation) together with a
NeGra-style annotation of the same sentence (正确的).

在过去, data-driven parsing has largely been dominated by Probabilistic
Context-Free Grammar (PCFG). In order to extract a PCFG from a treebank, the trees
need to be interpretable as CFG derivations. 最后, most work has excluded
non-local dependencies; 任何一个 (in PTB-like treebanks) by discarding labeling conven-
tions such as the co-indexation of the trace nodes in the PTB, 或者 (in NeGra/TIGER-like
treebanks) by applying tree transformations, which resolve the crossing branches (例如,
K ¨ubler 2005; Boyd 2007). Especially for the latter treebanks, such a transformation is
problematic, because it generally is non-reversible and implies information loss.

Discontinuities are no minor phenomenon: Approximately 25% of all sentences
in NeGra and TIGER have crossing branches (Maier and Lichte 2011). In the Penn
树库, this holds for approximately 20% of all sentences (Evang and Kallmeyer
2011). This shows that it is important to properly treat such structures.

1.3 Extending the Domain of Locality

在文献中, different methods have been explored that allow for the use of non-
local information in data-driven parsing. We distinguish two classes of approaches.
The ﬁrst class consists of approaches that aim at using formalisms which produce
trees without crossing branches but provide a larger domain of locality than CFG—
例如, through complex labels (Hockenmaier 2003) or through the derivation

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

CFG:

LCFRS:

•

γ1

γ2

γ3

数字 3
Different domains of locality.

机制 (蒋 2003). The second class, to which we contribute in this paper,
consists of approaches that aim at producing trees which contain non-local information.
Some methods realize the reconstruction of non-local information in a post- or pre-
processing step to PCFG parsing (约翰逊 2002; Dienes 2003; Levy and Manning 2004;
Cai, 蒋, and Goldberg 2011). Other work uses formalisms that accommodate the
direct encoding of non-local information (Plaehn 2004; 征收 2005). We pursue the latter
方法.

Our work is motivated by the following recent developments. Linear Context-Free
Rewriting Systems (LCFRSs) (Vijay-Shanker, Weir, 和乔希 1987) have been estab-
lished as a candidate for modeling both discontinuous constituents and non-projective
dependency trees as they occur in treebanks (Maier and Søgaard 2008; Kuhlmann and
Satta 2009; Maier and Lichte 2011). LCFRSs are a natural extension of CFGs where
the non-terminals can span tuples of possibly non-adjacent strings (见图 3). 是-
cause LCFRSs allow for binarization and CYK chart parsing in a way similar to CFGs,
PCFG techniques, such as best-ﬁrst parsing (Caraballo and Charniak 1998), weighted
∗
解析 (Klein and Manning 2003a) 能
deductive parsing (Nederhof 2003), 和一个
be transferred to LCFRS. 最后, as mentioned before, languages such as German
have recently attracted the interest of the parsing community (K ¨ubler and Penn 2008;
Seddah, K ¨ubler, and Tsarfaty 2010).

We bring together these developments by presenting a parser for Probabilistic
LCFRS (PLCFRS), continuing the promising work of Levy (2005). Our parser pro-
duces trees with crossing branches and thereby accounts for syntactic long-distance
dependencies while not making any additional assumptions concerning the position
of hypothetical traces. We have implemented a CYK parser and we present several
methods for context summary estimation of parse items. The estimates either act as
解析. A test on
ﬁgures-of-merit in a best-ﬁrst parsing context or as estimates for A
a real-world-sized data set shows that our parser achieves competitive results. To our
知识, our parser is the ﬁrst for the entire class of PLCFRS that has successfully
been used for data-driven parsing.1

∗

The paper is structured as follows. 部分 2 introduces probabilistic LCFRS. 秒-
系统蒸发散 3 和 4 present the binarization algorithm, the parser, and the outside estimates
which we use to speed up parsing. 在部分 5 we explain how to extract an LCFRS from
a treebank and we present grammar reﬁnement methods for these speciﬁc treebank
语法. 最后, 部分 6 presents evaluation results and Section 7 compares our
work to other approaches.

1 Parts of the results presented in this paper have been presented earlier. 更确切地说, in Kallmeyer and
Maier (2010), we presented the general architecture of the parser and all outside estimates except the LN
estimate from Section 4.4 which is presented in Maier, Kaeshammer, and Kallmeyer (2012). In Maier and
Kallmeyer (2010) we have presented experiments with the relative clause split from Section 3.2. 最后,
Maier (2010) contains the evaluation of the baseline (together with an evaluation using other metrics).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 39, 数字 1

2. Probabilistic Linear Context-Free Rewriting Systems

2.1 Deﬁnition of PLCFRS

LCFRS (Vijay-Shanker, Weir, 和乔希 1987) is an extension of CFG in which a non-
terminal can span not only a single string but a tuple of strings of size k ≥ 1. k is thereby
called its fan-out. We will notate LCFRS with the syntax of Simple Range Concate-
nation Grammars (SRCG) (Boullier 1998b), a formalism that is equivalent to LCFRS.
A third formalism that is equivalent to LCFRS is Multiple Context-Free Grammar
(MCFG) (Seki et al. 1991).

Deﬁnition 1 (LCFRS)

A Linear Context-Free Rewriting System (LCFRS) is a tuple (西德:4)氮, 时间, V, 磷, S(西德:5) 在哪里

乙)

N is a ﬁnite set of non-terminals with a function dim: N → N that
determines the fan-out of each A ∈ N;

T and V are disjoint ﬁnite sets of terminals and variables;
S ∈ N is the start symbol with dim(S) = 1;

P is a ﬁnite set of rules

A(α1, . . . , αdim(A)) → A1(X(1)

1 , . . . , X(1)

dim(A1 )) · · · Am(X(米)

, . . . , X(米)

dim(是 ))

for m ≥ 0 where A, A1, . . . , 是
为了 1 ≤ i ≤ dim(A). For all r ∈ P, it holds that every
and αi
variable X occurring in r occurs exactly once in the left-hand side and
exactly once in the right-hand side of r.

∈ V for 1 ≤ i ≤ m, 1 ≤ j ≤ dim(Ai)

ε N, X(我)
j

∈ (T ∪ V)

∗

A rewriting rule describes how the yield of the left-hand side non-terminal can be
computed from the yields of the right-hand side non-terminals. The rules A(ab, 光盘) → ε
和一个(aXb, cYd) → A(X, 是) from Figure 4 for instance specify that (1) (西德:4)ab, 光盘(西德:5) is in the
yield of A and (2) one can compute a new tuple in the yield of A from an already existing
one by wrapping a and b around the ﬁrst component and c and d around the second.
A CFG rule A → BC would be written A(XY) → B(X)C(是) as an LCFRS rule.

Deﬁnition 2 (Yield, 语言)
Let G = (西德:4)氮, 时间, V, 磷, S(西德:5) be an LCFRS.

For every A ∈ N, we deﬁne the yield of A, yield(A) as follows:

For every rule A((西德:1)A) → ε, (西德:1)α ∈ yield(A);

A(ab, 光盘) → ε

A(aXb, cYd) → A(X, 是)
S(XY) → A(X, 是)

数字 4
Sample LCFRS for {anbncndn | n ≥ 1}.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

乙)

(我)

(二)

(三、)

For every rule A(α1, . . . , αdim(A)) → A1(X(1)

, . . . , X(米)

是(X(米)
(西德:4) F (α1), . . . , F (αdim(A))(西德:5) ∈ yield(A) where f is deﬁned as follows:

dim(是 )) and for all (西德:1)τi

dim(A1 )) ···
1 , . . . , X(1)
∈ yield(Ai) (1 ≤ i ≤ m):

F (t) = t for all t ∈ T,
F (X(我)

j ) = (西德:1)τi(j) 对全部 1 ≤ i ≤ m, 1 ≤ j ≤ dim(Ai) 和

F (xy) = f (X)F (y) for all x, y ∈ (T ∪ V)+.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

We call f the composition function of the rule.

Nothing else is in yield(A).

The language of G is then L(G) = {w | (西德:4)w(西德:5) ∈ yield(S)}.

举个例子, consider again the LCFRS in Figure 4. The last rule tells us that,
given a pair in the yield of A, we can obtain an element in the yield of S by concate-
nating the two components. 最后, the language generated by this grammar is
{anbncndn | n ≥ 1}.

The terms of grammar fan-out and rank and the properties of monotonicity and
ε-freeness will be referred to later and are therefore introduced in the following deﬁni-
的. They are taken from the LCFRS/MCFG terminology; the SRCG term for fan-out is
arity and the property of being monotone is called ordered in the context of SRCG.

Deﬁnition 3
Let G = (西德:4)氮, 时间, V, 磷, S(西德:5) be an LCFRS.

The fan-out of G is the maximal fan-out of all non-terminals in G.
此外, the right-hand side length of a rewriting rule r ∈ P is called
the rank of r and the maximal rank of all rules in P is called the rank of G.
G is monotone if for every r ∈ P and every right-hand side non-terminal A
in r and each pair X1, X2 of arguments of A in the right-hand side of r, X1
precedes X2 in the right-hand side iff X1 precedes X2 in the left-hand side.
A rule r ∈ P is called an ε-rule if one of the left-hand side components of r
is ε.
G is ε-free if it either contains no ε-rules or there is exactly one ε-rule
S(ε) → ε and S does not appear in any of the right-hand sides of the rules
in the grammar.

For every LCFRS there exists an equivalent LCFRS that is ε-free (Seki et al. 1991;

Boullier 1998a) and monotone (Michaelis 2001; Kracht 2003; Kallmeyer 2010).

The deﬁnition of a probabilistic LCFRS is a straightforward extension of the deﬁni-

tion of PCFG and thus it follows (征收 2005; Kato, Seki, and Kasami 2006) 那:

Deﬁnition 4 (PLCFRS)
A probabilistic LCFRS (PLCFRS) is a tuple (西德:4)氮, 时间, V, 磷, S, p(西德:5) 这样 (西德:4)氮, 时间, V, 磷, S(西德:5) is an
LCFRS and p : P → [0..1] a function such that for all A ∈ N:

ΣA((西德:1)X)→(西德:1)Φ∈Pp(A((西德:1)X) → (西德:1)Φ) = 1

计算语言学

体积 39, 数字 1

PLCFRS with non-terminals {S, A, 乙}, terminals {A} and start symbol S:
0.2 : S(X) → A(X)
0.7 : A(aX) → A(X)
0.8 : 乙(aX, aY) → B(X, 是)
数字 5
Sample PLCFRS.

0.8 : S(XY) → B(X, 是)
0.3 : A(A) → ε
0.2 : 乙(A, A) → ε

举个例子, consider the PLCFRS in Figure 5. This grammar simply generates
a+. Words with an even number of as and nested dependencies are more probable
than words with a right-linear dependency structure. 例如, the word aa receives
the two analyses in Figure 6. The analysis (A) displaying nested dependencies has
probability 0.16 和 (乙) (right-linear dependencies) has probability 0.042.

3. Parsing PLCFRS

3.1 Binarization

Similarly to the transformation of a CFG into Chomsky normal form, an LCFRS can be
binarized, resulting in an LCFRS of rank 2. As in the CFG case, in the transformation,
we introduce a non-terminal for each right-hand side longer than 2 and split the rule
into two rules, using this new intermediate non-terminal. This is repeated until all
right-hand sides are of length 2. The transformation algorithm is inspired by G ´omez-
Rodr´ıguez et al. (2009) and it is also speciﬁed in Kallmeyer (2010).

3.1.1 General Binarization. In order to give the algorithm for this transformation, 我们
need the notion of a reduction of a vector (西德:1)α ∈ [(T ∪ V)
]i by a vector (西德:1)x ∈ Vj where all
variables in (西德:1)x occur in (西德:1)A. A reduction is, 大致, obtained by keeping all variables in (西德:1)A
that are not in (西德:1)X. This is deﬁned as follows:

∗

Deﬁnition 5 (Reduction)
Let (西德:4)氮, 时间, V, 磷, S(西德:5) be an LCFRS, (西德:1)α ∈ [(T ∪ V)

∗

]i and (西德:1)x ∈ Vj for some i, j ∈ IN.

Let w = (西德:1)α1$ . . . $(西德:1)αi be the string obtained from concatenating the components of (西德:1)A,

separated by a new symbol $ /∈ (V ∪ T). (西德:4) Let w be the image of w under a homomorphism h deﬁned as follows: H(A) = $ 为了

all a ∈ T, H(X) = $ for all X ∈ {(西德:1)x1, . . . (西德:1)xj ∈ V+ such that w Let y1, . . . ym } and h(y) = y in all other cases. ∗ ∗ (西德:4) ∈ $
y1$+y2$+ . . . $+ym$

. Then the vector

(西德:4)y1, . . . ym

(西德:5) is the reduction of (西德:1)α by (西德:1)X.

例如, (西德:4)aX1, X2, bX3
(西德:5) yields (西德:4)X1, X3

duced with (西德:4)X2

(西德:5) reduced with (西德:4)X2
(西德:5) 还有.

(西德:5) yields (西德:4)X1, X3

(西德:5) 和 (西德:4)aX1X2bX3

(西德:5) 关于-

乙

(A)

(乙)

数字 6
The two derivations of aa.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

for all rules r = A((西德:1)A) → A0( (西德:1)α0) . . . 是( (西德:1)αm) in P with m > 1 做

remove r from P
右 := ∅
pick new non-terminals C1, . . . , Cm−1
add the rule A((西德:1)A) → A0( (西德:1)α0)C1( (西德:1)γ1) to R where (西德:1)γ1 is obtained by reducing (西德:1)α with (西德:1)α0
for all i, 1 ≤ i ≤ m − 2 做

add the rule Ci( (西德:1)γi) → Ai( (西德:1)αi)Ci+1( (西德:1)γi+1) to R where (西德:1)γi+1 is obtained by reducing (西德:1)γi with (西德:1)αi

end for
add the rule Cm−1( (西德:1)γm−2) → Am−1( (西德:1)αm−1)是( (西德:1)αm) to R
for every rule r

(西德:4) ∈ R do

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

replace right-hand side arguments of length > 1 with new variables (in both sides) 和
add the result to P

end for

数字 7
Algorithm for binarizing an LCFRS.

The binarization algorithm is given in Figure 7. As already mentioned, it proceeds
like the CFG binarization algorithm in the sense that for right-hand sides longer than
2, we introduce a new non-terminal that covers the right-hand side without the ﬁrst
element. 数字 8 shows an example. 在这个例子中, there is only one rule with a right-
hand side longer than 2. In a ﬁrst step, we introduce the new non-terminals and rules
that binarize the right-hand side. This leads to the set R. In a second step, before adding
the rules from R to the grammar, whenever a right-hand side argument contains several
变量, these are collapsed into a single new variable.

The equivalence of the original LCFRS and the binarized grammar is rather straight-

向前. 笔记, 然而, that the fan-out of the LCFRS can increase.

The binarization depicted in Figure 7 is deterministic in the sense that for every rule
that needs to be binarized, we choose unique new non-terminals. 之后, in Section 5.3.1,
we will introduce additional factorization into the grammar rules that reduces the set
of new non-terminals.

3.1.2 Minimizing Fan-Out and Number of Variables. In LCFRS, in contrast to CFG, the order
of the right-hand side elements of a rule does not matter for the result of a derivation.

Original LCFRS:

S(XYZUVW) → A(X, U)乙(是, V)C(Z, 瓦)
A(aX, aY) → A(X, 是)
乙(bX, bY) → B(X, 是)
C(cX, cY) → C(X, 是)

A(A, A) → ε
乙(乙, 乙) → ε
C(C, C) → ε

Rule with right-hand side of length > 2: S(XYZUVW) → A(X, U)乙(是, V)C(Z, 瓦)
For this rule, we obtain
R = {S(XYZUVW) → A(X, U)C1(YZ, VW), C1(YZ, VW) → B(是, V)C(Z, 瓦)}

Equivalent binarized LCFRS:
S(XPUQ) → A(X, U)C1(磷, 问)
C1(YZ, VW) → B(是, V)C(Z, 瓦)
A(aX, aY) → A(X, 是)
乙(bX, bY) → B(X, 是)
C(cX, cY) → C(X, 是)

A(A, A) → ε
乙(乙, 乙) → ε
C(C, C) → ε

数字 8
Sample binarization of an LCFRS.

计算语言学

体积 39, 数字 1

所以, we can reorder the right-hand side of a rule before binarizing it. 在里面
following, we present a binarization order that yields a minimal fan-out and a minimal
variable number per production and binarization step. The algorithm is inspired by
G ´omez-Rodr´ıguez et al. (2009) and has ﬁrst been published in this version in Kallmeyer
(2010). We assume that we are only considering partitions of right-hand sides where one
of the sets contains only a single non-terminal.

For a given rule c = A0( (西德:1)x0) → A1( (西德:1)x1) . . . 和( (西德:1)xk), we deﬁne the characteristic string
s(C, Ai) of the Ai-reduction of c as follows: Concatenate the elements of (西德:1)x0, separated with
new additional symbols $ while replacing every component from (西德:1)xi with a $. We then
deﬁne the arity of the characteristic string, dim(s(C, Ai)), as the number of maximal sub-
strings x ∈ V+ in s(Ai). 拿, 例如, a rule c = VP(X, YZU) → VP(X, Z)V(是)氮(U).
Then s(C, 副总裁) =$$Y$U, s(C, V) = X$$ZU.

数字 9 shows how in a ﬁrst step, for a given rule r with right-hand side length > 2,
we determine the optimal candidate for binarization based on the characteristic string
s(r, 乙) of some right-hand side non-terminal B and on the fan-out of B: On all right-
hand side predicates B we check for the maximal fan-out (given by dim(s(r, 乙))) 和
number of variables (dim(s(r, 乙)) + dim(乙)) we would obtain when binarizing with this
predicate. This check provides the optimal candidate. In a second step we then perform
the same binarization as before, except that we use the optimal candidate now instead
of the ﬁrst element of the right-hand side.

3.2 The Parser

We can assume without loss of generality that our grammars are ε-free and monotone
(the treebank grammars with which we are concerned all have these properties) 然后
they contain only binary and unary rules. 此外, we assume POS tagging to be
done before parsing. POS tags are non-terminals of fan-out 1. 最后, according to our
grammar extraction algorithm (参见章节 5.1), a separation between two components
always means that there is actually a non-empty gap in between them. 最后,
two different components in a right-hand side can never be adjacent in the same
component of the left-hand side. The rules are then either of the form A(A) → ε with A a
POS tag and a ∈ T or of the form A((西德:1)X) → B((西德:1)X) or A((西德:1)A) → B((西德:1)X)C((西德:1)y) 在哪里 (西德:1)α ∈ (V+)dim(A),
(西德:1)x ∈ Vdim(乙), (西德:1)y ∈ Vdim(C), 那是, only the rules for POS tags contain terminals in their left-
hand sides.

cand = 0
fan-out = number of variables in r
vars = number of variables in r
for all i = 0 to m do

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

cand-fan-out = dim(s(r, Ai));
if cand-fan-out < fan-out and dim(Ai) < fan-out then fan-out = max({cand-fan-out, dim(Ai)}); vars = cand-fan-out + dim(Ai); cand = i; fan-out = max({cand-fan-out, dim(Ai)}); vars = cand-fan-out + dim(Ai); cand = i else if cand-fan-out ≤ fan-out, dim(Ai) ≤ fan-out and cand-fan-out + dim(Ai) < vars then end if end for Figure 9 Optimized version of the binarization algorithm, determining binarization order. 96 Kallmeyer and Maier PLCFRS Parsing During parsing we have to link the terminals and variables in our LCFRS rules to portions of the input string. For this purpose we need the notions of ranges, range vectors, and rule instantiations. A range is a pair of indices that characterizes the span of a component within the input. A range vector characterizes a tuple in the yield of a non-terminal. A rule instantiation speciﬁes the computation of an element from the left- hand side yield from elements in the yields of the right-hand side non-terminals based on the corresponding range vectors. Deﬁnition 6 (Range) Let w ∈ T ∗ with w = w1 . . . wn where wi ∈ T for 1 ≤ i ≤ n. Pos(w) := {0, . . . , n}. 1. 2. We call a pair (cid:4)l, r(cid:5) ∈ Pos(w) × Pos(w) with l ≤ r a range in w. Its yield 3. 4. (cid:4)l, r(cid:5)(w) is the substring wl+1 . . . wr. For two ranges ρ1 = (cid:4)l1, r1 of ρ1 and ρ2 is ρ1 A (cid:1)ρ ∈ (Pos(w) × Pos(w))k is a k-dimensional range vector for w iff (cid:5) is a range in w for 1 ≤ i ≤ k. (cid:1)ρ = (cid:4)(cid:4)l1, r1 · ρ2 is undeﬁned. (cid:5); otherwise ρ1 (cid:5)(cid:5) where (cid:4)li, ri (cid:5), ρ2 = (cid:4)l2, r2 · ρ2 = (cid:4)l1, r2 (cid:5), . . . , (cid:4)lk, rk (cid:5), if r1 = l2, then the concatenation We now deﬁne instantiations of rules with respect to a given input string. This deﬁnition follows the deﬁnition of clause instantiations from Boullier (2000). An in- stantiated rule is a rule in which variables are consistently replaced by ranges. Because we need this deﬁnition only for parsing our speciﬁc grammars, we restrict ourselves to ε-free rules containing only variables. Deﬁnition 7 (Rule instantiation) Let G = (N, T, V, P, S) be an ε-free monotone LCFRS. For a given rule r = A((cid:1)α) → A1( (cid:1)x1) · · · Am( (cid:1)xm) ∈ P (0 < m) that does not contain any terminals, 1. 2. an instantiation with respect to a string w = t1 . . . tn consists of a function f : V → {(cid:4)i, j(cid:5) | 1 ≤ i ≤ j ≤ |w|} such that for all x, y adjacent in one of the elements of (cid:1)α, f (x) · f (y) must be deﬁned; we then deﬁne f (xy) = f (x) · f (y), if f is an instantiation of r, then A( f ((cid:1)α)) → A1( f ( (cid:1)x1)) · · · Am( f ( (cid:1)xm)) is an instantiated rule where f ((cid:4)x1, . . . , xk (cid:5)) = (cid:4) f (x1), . . . , f (xk)(cid:5). We use a probabilistic version of the CYK parser from Seki et al. (1991). The algo- rithm is formulated using the framework of parsing as deduction (Pereira and Warren 1983; Shieber, Schabes, and Pereira 1995; Sikkel 1997), extended with weights (Nederhof 2003). In this framework, a set of weighted items representing partial parsing results is characterized via a set of deduction rules, and certain items (the goal items) represent successful parses. During parsing, we have to match components in the rules we use with portions of the input string. For a given input w, our items have the form [A, (cid:1)ρ] where A ∈ N and (cid:1)ρ is a range vector that characterizes the span of A. Each item has a weight in that encodes the Viterbi inside score of its best parse tree. More precisely, we use the log probability log(p) where p is the probability. The ﬁrst rule (scan) tells us that the POS tags that we receive as inputs are given. Consequently, they are axioms; their probability is 1 and their weight therefore 0. The 97 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 1 Scan: 0 : [A, (cid:4)(cid:4)i, i + 1(cid:5)(cid:5)] A is the POS tag of wi+1 Unary: in : [B, (cid:1)ρ] in + log(p) : [A, (cid:1)ρ] p : A((cid:1)α) → B((cid:1)α) ∈ P inB : [B, (cid:1)ρB], inC : [C, (cid:1)ρC] inB + inC + log(p) : [A, (cid:1)ρA] Binary: Goal: [S, (cid:4)(cid:4)0, n(cid:5)(cid:5)] Figure 10 Weighted CYK deduction system. p : A( (cid:1)ρA ) → B( (cid:1)ρB )C( (cid:1)ρC ) is an instantiated rule second rule, unary, is applied whenever we have found the right-hand side of an instantiation of a unary rule. In our grammar, terminals only occur in rules with POS tags and the grammar is ordered and ε-free. Therefore, the components of the yield of the right-hand side non-terminal and of the left-hand side terminals are the same. The rule binary applies an instantiated rule of rank 2. If we already have the two elements of the right-hand side, we can infer the left-hand side element. In both cases, unary and binary, the probability p of the new rule is multiplied with the probabilities of the antecedent items (which amounts to summing up the antecedent weights and log(p)). We perform weighted deductive parsing, based on the deduction system from Figure 10. We use a chart C and an agenda A, both initially empty, and we proceed as in Figure 11. Because for all our deduction rules, the weight functions f that compute the weight of a consequent item from the weights of the antecedent items are monotone non-increasing in each variable, the algorithm will always ﬁnd the best parse without the need of exhaustive parsing. All new items that we deduce involve at least one of the agenda items as an antecedent item. Therefore, whenever an item is the best in the agenda, we can be sure that we will never ﬁnd an item with a better (i.e., higher) weight. Consequently, we can safely store this item in the chart and, if it is a goal item, we have found the best parse. As an example consider the development of the agenda and the chart in Figure 12 when parsing aa with the PLCFRS from Figure 5, transformed into a PLCFRS with pre-terminals and binarization (i.e., with a POS tag Ta and a new binarization non- (cid:4) terminal B ). The new PLCFRS is given in Figure 13. In this example, we ﬁnd a ﬁrst analysis for the input (a goal item) when combining an A with span (cid:4)(cid:4)0, 2(cid:5)(cid:5) into an S. This S has however a rather low probability and is therefore not on top of the agenda. Later, when ﬁnding the better analysis, the weight add SCAN results to A while A (cid:19)= ∅ remove best item x : I from A add x : I to C if I goal item then stop and output true else for all y : I (cid:4) deduced from x : I and items in C: (cid:4) ∈ C ∪ A (cid:4) if there is no z with z : I then add y : I else if z : I (cid:4) ∈ A for some z (cid:4) then update weight of I to A in A to max(y, z) Figure 11 Weighted deductive parsing. 98 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kallmeyer and Maier PLCFRS Parsing chart 0 : [Ta, (cid:4)0, 1(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], 0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)], −0.65 : [A, (cid:4)0, 2(cid:5)] 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)], −0.65 : [A, (cid:4)0, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)] agenda 0 : [Ta, (cid:4)0, 1(cid:5)], 0 : [Ta, (cid:4)1, 2(cid:5)] 0 : [Ta, (cid:4)1, 2(cid:5)], −0.5 : [A, (cid:4)0, 1(cid:5)] −0.5 : [A, (cid:4)0, 1(cid:5)], −0.5 : [A, (cid:4)1, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)] −0.5 : [A, (cid:4)1, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)] −0.65 : [A, (cid:4)0, 2(cid:5)], −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)], −1.2 : [S, (cid:4)1, 2(cid:5)] −0.7 : [B, (cid:4)0, 1(cid:5), (cid:4)1, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)], −1.2 : [S, (cid:4)1, 2(cid:5)], −1.35 : [S, (cid:4)0, 2(cid:5)] −0.8 : [S, (cid:4)0, 2(cid:5)], −1.2 : [S, (cid:4)0, 1(cid:5)], −1.2 : [S, (cid:4)1, 2(cid:5)] Figure 12 Parsing of aa with the grammar from Figure 5. PLCFRS with non-terminals {S, A, B, B }, terminals {a} and start symbol S: (cid:4) , Ta 0.2 : S(X) → A(X) 0.7 : A(XY) → Ta(X)A(Y) (cid:4) 0.8 : B(ZX, Y) → Ta(Z)B 0.2 : B(X, Y) → Ta(X)Ta(Y) Figure 13 Sample binarized PLCFRS (with pre-terminal Ta). 0.8 : S(XY) → B(X, Y) 0.3 : A(X) → Ta(X) 1 : B 1 : Ta(a) → ε (X, Y) (cid:4) (X, UY) → B(X, Y)Ta(U) of the S item in the agenda is updated and then the goal item is the top agenda item and therefore parsing has been successful. Note that, so far, we have only presented the recognizer. In order to extend it to a parser, we do the following: Whenever we generate a new item, we store it not only with its weight but also with backpointers to its antecedent items. Furthermore, whenever we update the weight of an item in the agenda, we also update the backpointers. In order to read off the best parse tree, we have to start from the goal item and follow the backpointers. 4. Outside Estimates So far, the weights we use give us only the Viterbi inside score of an item. In order to speed up parsing, we add the estimate of the costs for completing the item into a goal item to its weight—that is, to the weight of each item in the agenda, we add an estimate of its Viterbi outside score2 (i.e., the logarithm of the estimate). We use context summary estimates. A context summary is an equivalence class of items for which we can compute the actual outside scores. Those scores are then used as estimates. The challenge is to choose the estimate general enough to be efﬁciently computable and speciﬁc enough to be helpful for discriminating items in the agenda. 2 Note that just as Klein and Manning (2003a), we use the terms inside score and outside score to denote the Viterbi inside and outside scores. They are not to be confused with the actual inside or outside probability. 99 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 1 Admissibility and monotonicity are two important conditions on estimates. All our outside estimates are admissible (Klein and Manning 2003a), which means that they never underestimate the actual outside score of an item. In other words, they are too optimistic about the costs of completing the item into an S item spanning the entire input. For the full SX estimate described in Section 4.1 and the SX estimate with span and sentence length in Section 4.4, the monotonicity is guaranteed and we can do ∗ parsing as described by Klein and Manning. Monotonicity means that for each true A antecedent item of a rule it holds that its weight is greater than or equal to the weight of the consequent item. The estimates from Sections 4.2 and 4.3 are not monotonic. This means that it can happen that we deduce an item I2 from an item I1 where the weight of I2 is greater than the weight of I1. The parser can therefore end up in a local maximum that is not the global maximum we are searching for. In other words, those estimates are only ﬁgures of merit (FOM). All outside estimates are computed off-line for a certain maximal sentence length lenmax. 4.1 Full SX Estimate The full SX estimate is a PLCFRS adaption of the SX estimate of Klein and Manning (2003a) (hence the name). For a given sentence length n, the estimate gives the maximal probability of completing a category X with a span ρ into an S with span (cid:4)(cid:4)0, n(cid:5)(cid:5). For its computation, we need an estimate of the inside score of a category C with a span ρ, regardless of the actual terminals in our input. This inside estimate is computed as shown in Figure 14. Here, we do not need to consider the number of terminals outside the span of C (to the left or right or in the gaps), because they are not relevant for the inside score. Therefore the items have the form [A, (cid:4)l1, . . . , ldim(A) (cid:5)], where A is a non- terminal and li gives the length of its ith component. It holds that Σ1≤i≤dim(A)li ≤ lenmax − dim(A) + 1 because our grammar extraction algorithm ensures that the different components in the yield of a non-terminal are never adjacent. There is always at least one terminal in between two different components that does not belong to the yield of the non-terminal. The ﬁrst rule in Figure 14 tells us that POS tags always have a single component of length 1; therefore this case has probability 1 (weight 0). The rules unary and binary are roughly like the ones in the CYK parser, except that they combine items with length information. The rule unary for instance tells us that if the log of the probability of building [B,(cid:1)l] is greater or equal to in and if there is a rule that allows to deduce an POS tags: 0 : [A, (cid:4)1(cid:5)] A a POS tag Unary: in : [B,(cid:1)l] in + log(p) : [A,(cid:1)l] p : A((cid:1)α) → B((cid:1)α) ∈ P l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Binary: inB : [B,(cid:1)lB], inC : [C,(cid:1)lC] inB + inC + log(p) : [A,(cid:1)lA] where p : A( (cid:1)αA) → B( (cid:1)αB)C( (cid:1)αC) ∈ P and the following holds: we deﬁne B(i) as {1 ≤ j ≤ dim(B) | (cid:1)αB( j) occurs in (cid:1)αA(i)} and C(i) as {1 ≤ j ≤ dim(C) | (cid:1)αC( j) occurs in (cid:1)αA(i)}. Then for all i, 1 ≤ i ≤ dim(A): (cid:1)lA(i) = Σj∈B(i) (cid:1)lB( j) + Σj∈C(i) (cid:1)lC( j). Figure 14 Estimate of the Viterbi inside score. 100 Kallmeyer and Maier PLCFRS Parsing Axiom : 0 : [S, (cid:4)0, len, 0(cid:5)] 1 ≤ len ≤ lenmax Unary: out : [A,(cid:1)l] out + log(p) : [B,(cid:1)l] p : A((cid:1)α) → B((cid:1)α) ∈ P Binary-right: Binary-left: out : [X,(cid:1)lX] (cid:4) out + in(A,(cid:1)l A) + log(p) : [B,(cid:1)lB] out : [X,(cid:1)lX] (cid:4) out + in(B,(cid:1)l B) + log(p) : [A,(cid:1)lA] where, for both binary rules, there is an instantiated rule p : X((cid:1)ρ) → A( (cid:1)ρA)B( (cid:1)ρB) such that (cid:1)lX = lout(ρ), (cid:1)lA = lout(ρA),(cid:1)l (cid:4) A = lin(ρA), (cid:1)lB = lout(ρB),(cid:1)l (cid:4) B = lin(ρB). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 15 Full SX estimate ﬁrst version (top–down). A item from [B,(cid:1)l] with probability p, then the log of the probability of [A,(cid:1)l] is greater or equal to in + log(p). For each item, we record its maximal weight (i.e., its maximal probability). The rule binary is slightly more complicated because we have to compute the length vector of the left-hand side of the rule from the right-hand side length vectors. A straightforward extension of the CFG algorithm from Klein and Manning (2003a) for computing the SX estimate is given in Figure 15. Here, the items have the form [A,(cid:1)l] where the vector(cid:1)l tells us about the lengths of the string to the left of the ﬁrst component, the ﬁrst component, the string in between the ﬁrst and second component, and so on. The algorithm proceeds top–down. The outside estimate of completing an S with component length len and no terminals to the left or to the right of the S component (item [S, (cid:4)0, len, 0(cid:5)]) is 0. If we expand with a unary rule (unary), then the outside estimate of the right-hand side item is greater or equal to the outside estimate of the left-hand side item plus the log of the probability of the rule. In the case of binary rules, we have to further add the inside estimate of the other daughter. For this, we need a different length vector (without the lengths of the parts in between the components). (cid:5)(cid:5) and a sentence length n, Therefore, for a given range vector ρ = (cid:4)(cid:4)l1, r1 we distinguish between the inside length vector lin(ρ) = (cid:4)r1 (cid:5) and the outside length vector lout(ρ) = (cid:4)l1, r1 (cid:5), . . . , (cid:4)lk, rk − l1, . . . , rk − r1, . . . , lk − lk, n − rk − rk−1, rk − lk (cid:5). − l1, l2 This algorithm has two major problems: Because it proceeds top–down, in the binary rules we must compute all splits of the antecedent X span into the spans of A and B, which is very expensive. Furthermore, for a category A with a certain number of terminals in the components and the gaps, we compute the lower part of the outside estimate several times, namely, for every combination of number of terminals to the left and to the right (ﬁrst and last element in the outside length vector). In order to avoid these problems, we now abstract away from the lengths of the part to the left and the right, modifying our items such as to allow a bottom–up strategy. The idea is to compute the weights of items representing the derivations from a certain lower C up to some A (C is a kind of “gap” in the yield of A) while summing up the inside costs of off-spine nodes and the log of the probabilities of the corresponding rules. We use items [A, C, ρA, ρC, shift] where A, C ∈ N and ρA, ρC are range vectors, both with a ﬁrst component starting at position 0. The integer shift ≤ lenmax tells us how many positions to the right the C span is shifted, compared to the starting position of the A. ρA and ρC represent the spans of C and A while disregarding the number of terminals to the left and the right (i.e., only the lengths of the components and of the gaps are encoded). This means in particular that the length n of the sentence does not play a role here. The right boundary of the last range in the vectors is limited to lenmax. For 101 Computational Linguistics Volume 39, Number 1 any i, 0 ≤ i ≤ lenmax, and any range vector ρ, we deﬁne shift(ρ, i) as the range vector one obtains from adding i to all range boundaries in ρ and shift(ρ, −i) as the range vector one obtains from subtracting i from all boundaries in ρ. The weight of [A, C, ρA, ρC, i] estimates the log of the probability of completing a C tree with yield ρC into an A tree with yield ρA such that, if the span of A starts at position j, the span of C starts at position i + j. Figure 16 gives the computation. The value of in(A,(cid:1)l) is the inside estimate of [A,(cid:1)l]. The SX-estimate for some predicate C with span ρ where i is the left boundary of the ﬁrst component of ρ and with sentence length n is then given by the maximal weight of [S, C, (cid:4)0, n(cid:5), shift(ρ, −i), i]. 4.2 SX with Left, Gaps, Right, Length A problem of the previous estimate is that with a large number of non-terminals (for treebank parsing, approximately 12,000 after binarization and markovization), the com- putation of the estimate requires too much space. We therefore turn to simpler estimates with only a single non-terminal per item. We now estimate the outside score of a non- terminal A with a span of a length length (the sum of the lengths of all the components of the span), with left terminals to the left of the ﬁrst component, right terminals to the right of the last component, and gaps terminals in between the components of the A span (i.e., ﬁlling the gaps). Our items have the form [X, len, left, right, gaps] with X ∈ N, len + left + right + gaps ≤ lenmax, len ≥ dim(X), gaps ≥ dim(X) − 1. Let us assume that, in the rule X((cid:1)α) → A( (cid:1)αA)B( (cid:1)αB), when looking at the vector (cid:1)α, we have leftA variables for A-components preceding the ﬁrst variable of a B component, rightA variables for A-components following the last variable of a B component, and rightB variables for B-components following the last variable of an A component. (In our grammars, the ﬁrst left-hand side argument always starts with the ﬁrst variable from A.) Furthermore, we set gapsA = dim(A) − leftA − rightA and gapsB = dim(B) − rightB. Figure 17 gives the computation of the estimate. It proceeds top–down, as the computation of the full SX estimate in Figure 15, except that now the items are simpler. POS tags: 0 : [C, C, (cid:4)0, 1(cid:5), (cid:4)0, 1(cid:5), 0] C a POS tag Unary: 0 : [B, B, ρB, ρB, 0] log(p) : [A, B, ρB, ρB, 0] p : A((cid:1)α) → B((cid:1)α) ∈ P Binary-right: 0 : [A, A, ρA, ρA, 0], 0 : [B, B, ρB, ρB, 0] in(A, lin(ρA)) + log(p) : [X, B, ρX, ρB, i] l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 where i is such that for shift(ρB, i) = ρ(cid:4) Binary-left: 0 : [A, A, ρA, ρA, 0], 0 : [B, B, ρB, ρB, 0] in(B, lin(ρB)) + log(p) : [X, A, ρX, ρA, i] B p : X(ρX ) → A(ρA)B(ρ(cid:4) B) is an instantiated rule. Starting sub-trees with larger gaps: out : [B, C, ρB, ρC, i] 0 : [B, B, ρB, ρB, 0] Transitive closure of sub-tree combination: out1 : [A, B, ρA, ρB, i], out2 : [B, C, ρB, ρC, j] out1 + out2 : [A, C, ρA, ρC, i + j] Figure 16 Full SX estimate second version (bottom–up). 102 Kallmeyer and Maier PLCFRS Parsing Axiom : 0 : [S, len, 0, 0, 0] 1 ≤ len ≤ lenmax Unary: out : [X, len, l, r, g] out + log(p) : [A, len, l, r, g] p : X((cid:1)α) → A((cid:1)α) ∈ P Binary-right: out : [X, len, l, r, g] out + in(A, len − lenB) + log(p) : [B, lenB, lB, rB, gB] Binary-left: out : [X, len, l, r, g] out + in(B, len − lenA) + log(p) : [A, lenA, lA, rA, gA] where, for both binary rules, p : X((cid:1)α) → A( (cid:1)αA)B( (cid:1)αB) ∈ P. Further side conditions for Binary-right: a) len + l + r + g = lenB + lB + rB + gB, c) if rightA > 0, then rB
Further side conditions for Binary-left:
A) len + 我 + r + g = lenA + lA + rA + gA,
C) if rightB > 0, then rA

≥ r + rightA, 别的 (rightA = 0), rB = r,

≥ r + rightB, 别的 (rightB = 0), rA = r

乙) lB
d) gB

≥ l + leftA,
≥ gapsA.

乙) lA = l,
d) gA

≥ gapsB.

数字 17
SX estimate depending on length, 左边, 正确的, gaps.

The value in(X, 我) for a non-terminal X and a length l, 0 ≤ l ≤ lenmax is an estimate
of the probability of an X category with a span of length l. Its computation is speciﬁed
图中 18.

The SX-estimate for a sentence length n and for some predicate C with a range
− li) and r =
− r].

characterized by (西德:1)ρ = (西德:4)(西德:4)l1, r1
n − rdim(C) is then given by the maximal weight of the item [C, len, l1, r, n − len − l1

(西德:5)(西德:5) where len = Σdim(C)

(西德:5), . . . , (西德:4)ldim(C), rdim(C)

(ri

我=1

4.3 SX with LR, Gaps, Length

In order to further decrease the space complexity of the computation of the outside
estimate, we can simplify the previous estimate by subsuming the two lengths left and
right in a single length lr. The items now have the form [X, len, lr, gaps] with X ∈ N,
len + lr + gaps ≤ lenmax, len ≥ dim(X), gaps ≥ dim(X) - 1.

The computation is given in Figure 19. 再次, we deﬁne leftA, gapsA, rightA and
gapsB, rightB for a rule X((西德:1)A) → A( (西德:1)αA)乙( (西德:1)αB) as before. 此外, in both Binary-left
and Binary-right, we have limited lr in the consequent item to the lr of the antecedent
plus the length of the sister (lenB, resp. lenA). This results in a further reduction of the
number of items while having only little effect on the parsing results.

The SX-estimate for a sentence length n and for some predicate C with a span
− li) and r = n − rdim(C) is then the

(西德:5)(西德:5) where len = Σdim(C)

(西德:5), . . . , (西德:4)ldim(C), rdim(C)

(ri

(西德:1)ρ = (西德:4)(西德:4)l1, r1
maximal weight of [C, len, l1 + r, n − len − l1

我=1
− r].

POS tags:

0 : [A, 1]

A a POS tag

Unary:

在 : [乙, 我]
在 + 日志(p) : [A, 我]

p : A((西德:1)A) → B((西德:1)A) ∈ P

Binary:

inB : [乙, lB], inC : [C, lC]
inB + inC + 日志(p) : [A, lB + lC]

where either p : A( (西德:1)αA) → B( (西德:1)αB)C( (西德:1)αC) ∈ P or p : A( (西德:1)αA) → C( (西德:1)αC)乙( (西德:1)αB) ∈ P.

数字 18
Estimate of the inside score with total span length.

103

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 39, 数字 1

Axiom :

0 : [S, len, 0, 0]

1 ≤ len ≤ lenmax

Unary:

出去 : [X, len, lr, G]
出去 + 日志(p) : [A, len, lr, G]

p : X((西德:1)A) → A((西德:1)A) ∈ P

Binary-right:

出去 : [X, len, lr, G]
出去 + 在(A, len − lenB) + 日志(p) : [乙, lenB, lrB, gB]

p : X((西德:1)A) → A( (西德:1)αA )乙( (西德:1)αB ) ∈ P

Binary-left:

出去 : [X, len, lr, G]
出去 + 在(乙, len − lenA) + 日志(p) : [A, lenA, lrA, gA]

p : X((西德:1)A) → A( (西德:1)αA )乙( (西德:1)αB ) ∈ P

Further side conditions for Binary-right:
A) len + lr + g = lenB + lrB + gB b) lr < lrB Further side conditions for Binary-left: a) len + lr + g = lenA + lrA + gA b) if rightB = 0 then lr = lrA, else lr < lrA ≥ gapsA c) gB c) gA ≥ gapsB Figure 19 SX estimate depending on length, LR, gaps. 4.4 SX with Span and Sentence Length We will now present a further simpliﬁcation of the last estimate that records only the span length and the length of the entire sentence. The items have the form [X, len, slen] with X ∈ N, dim(X) ≤ len ≤ slen. The computation is given in Figure 20. This last esti- ∗ mate is actually monotonic and allows for true A parsing. The SX-estimate for a sentence length n and for some predicate C with a span − li) is then the maximal weight (cid:5)(cid:5) where len = Σdim(C) (cid:5), . . . , (cid:4)ldim(C), rdim(C) (ri i=1 (cid:1)ρ = (cid:4)(cid:4)l1, r1 of [C, len, n]. In order to prove that this estimate allows for monotonic weighted deductive pars- ing and therefore guarantees that the best parse will be found, let us have a look at the CYK deduction rules when being augmented with the estimate. Only Unary and Binary are relevant because Scan does not have antecedent items. The two rules, augmented with the outside estimate, are shown in Figure 21. We have to show that for every rule, if this rule has an antecedent item with weight (cid:4) , then w ≥ w (cid:4) w and a consequent item with weight w Let us start with Unary. To show: inB + outB ≥ inB + log(p) + outA. Because of the Unary rule for computing the outside estimate and because of the unary production, . Axiom : 0 : [S, len, len] 1 ≤ len ≤ lenmax Unary: out : [X, lX, slen] out + log(p) : [A, lX, slen] p : X((cid:1)α) → A((cid:1)α) ∈ P Binary-right: out : [X, lX, slen] out + in(A, lX − lB) + log(p) : [B, lB, slen] p : X((cid:1)α) → A( (cid:1)αA )B( (cid:1)αB ) ∈ P Binary-left: out + in(B, lX out : [X, lX, slen] − lA) + log(p) : [A, lA, slen] p : X((cid:1)α) → A( (cid:1)αA )B( (cid:1)αB ) ∈ P Figure 20 SX estimate depending on span and sentence length. 104 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kallmeyer and Maier PLCFRS Parsing Unary: inB + outB : [B, (cid:1)ρ] inB + log(p) + outA : [A, (cid:1)ρ] p : A((cid:1)α) → B((cid:1)α) ∈ P Binary: inB + outB : [B, (cid:1)ρB], inC + outC : [C, (cid:1)ρC] inB + inC + log(p) + outA : [A, (cid:1)ρA] p : A( (cid:1)ρA ) → B( (cid:1)ρB )C( (cid:1)ρC ) is an instantiated rule (Here, outA, outB, and outC are the respective outside estimates of [A, (cid:1)ρA], [B, (cid:1)ρB] and [C, (cid:1)ρC].) Figure 21 Parsing rules including outside estimate. we obtain that, given the outside estimate outA of [A, (cid:1)ρ] the outside estimate outB of the item [B, (cid:1)ρ] is at least outA + log(p), namely, outB ≥ log(p) + outA. Now let us consider the rule Binary. We treat only the relation between the weight of the C antecedent item and the consequent. The treatment of the antecedent B is ≥ inB + inC + log(p) + outA. Assume that lB is the length symmetric. To show: inC + outC of the components of the B item and n is the sentence length. Then, because of the Binary-right rule in the computation of the outside estimate and because of our in- stantiated rule p : A( (cid:1)ρA) → B( (cid:1)ρB)C( (cid:1)ρC), we have that the outside estimate outC of the C-item is at least outA + in(B, lB) + log(p). Furthermore, in(B, lB) ≥ inB. Consequently outC ≥ inB + log(p) + outA. 4.5 Integration into the Parser Before parsing, the outside estimates of all items up to a certain maximal sentence length lenmax are precomputed. Then, when performing the weighted deductive parsing as explained in Section 3.2, whenever a new item is stored in the agenda, we add its outside estimate to its weight. Because the outside estimate is always greater than or equal to the actual outside score, given the input, the weight of an item in the agenda is always greater than or equal to the log of the actual product of the inside and outside score of the item. In this sense, the outside estimates given earlier are admissible. Additionally, as already mentioned, note that the full SX estimate and the SX esti- ∗ parsing. The other mate with span and sentence length are monotonic and allow for A two estimates, which are both not monotonic, act as FOMs in a best-ﬁrst parsing context. Consequently, they contribute to speeding up parsing but they decrease the quality of the parsing output. For further evaluation details see Section 6. 5. Grammars for Discontinuous Constituents 5.1 Grammar Extraction The algorithm we use for extracting an LCFRS from a constituency treebank with cross- ing branches has originally been presented in Maier and Søgaard (2008). It interprets the treebank trees as LCFRS derivation trees. Consider for instance the tree in Figure 22. The S node has two daughters, a VMFIN node and a VP node. This yields a rule S → VP VMFIN. The VP is discontinuous with two components that wrap around the yield of the VMFIN. Consequently, the LCFRS rule is S(XYZ) → VP(X, Z) VMFIN(Y). The extraction of an LCFRS from treebanks with crossing branches is almost im- mediate, except for the fan-out of the non-terminal categories: In the treebank, we can have the same non-terminal with different fan-outs, for instance a VP without a gap (fan-out 1), a VP with a single gap (fan-out 2), and so on. In the corresponding LCFRS, 105 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 8 7 1 7 9 9 6 8 1 / c o l i _ a _ 0 0 1 3 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 1 S VP VP PROAV VMFIN dar ¨uber about it muß must “It must be thought about it” thought VVPP VAINF nachgedacht werden be Figure 22 A sample tree from NeGra. we have to distinguish these different non-terminals by mapping them to different predicates. · · · Am, a clause A0 The algorithm ﬁrst creates a so-called lexical clause P(a) → ε for each pre-terminal P dominating some terminal a. Then for all other non-terminals A0 with the children · · · Am A1 is the number of discontinuous parts in their yields. The components of A0 are concate- nations of variables that describe how the discontinuous parts of the yield of A0 are obtained from the yields of its daughters. · · · Am is created. The number of components of the A1 → A1 More precisely, the non-terminals in our LCFRS are all Ak where A is a non-terminal label in the treebank and k is a possible fan-out for A. For a given treebank tree (cid:4)V, E, r, l(cid:5) where V is the set of nodes, E ⊂ V × V the set of immediate dominance edges, r ∈ V the root node, and l : V → N ∪ T the labeling function, the algorithm constructs the following rules. Let us assume that w1, . . . , wn are the terminal labels of the leaves in (cid:4)V, E, r(cid:5) with a linear precedence relation wi ≺ wj for 1 ≤ i < j ≤ n. We introduce a variable Xi for every wi, 1 ≤ i ≤ n. ∈ V with (cid:4)v2, v2 For every pair of nodes v1, v2 l(v1)(l(v2)) → ε to the rules of the grammar. (We omit the fan-out subscript here because pre-terminals are always of fan-out 1.) For every node v ∈ V with l(v) = A0 /∈ T such that there are exactly m (cid:5) ∈ E and l(vi) = Ai /∈ T for all nodes v1, . . . , vm 1 ≤ i ≤ m, we now create a rule (cid:5) ∈ E, l(v2) ∈ T, we add ∈ V (m ≥ 1) with (cid:4)v, vi A0(x(0) 1 , . . . , x(0) → A1(x(1) dim(A0 )) 1 , . . . , x(1) dim(A1 )) . . . Am(x(m) 1 , . . . , x(m) dim(Am )) where for the predicate Ai, 0 ≤ i ≤ m, the following must hold: The concatenation of all arguments of Ai, x(i) concatenation of all X ∈ {Xi Xi precedes Xj if i < j, and a variable Xj with 1 ≤ j < n is the right boundary of an argument of (cid:4) (cid:4) Ai if and only if Xj+1 /∈ {Xi }, that is, an | (cid:4)vi, v i ) = wi with l(v i argument boundary is introduced at each discontinuity. 1 . . . x(i) dim(Ai ) is the (cid:4) i ) = wi with l(v } such that (cid:4) | (cid:4)vi, v i ∗ (cid:5) ∈ E ∗ (cid:5) ∈ E As a further step, in this new rule, all right-hand side arguments of length > 1 are replaced in both sides of the rule with a single new variable.
最后, all non-terminals A in the rule are equipped with an additional
subscript dim(A), which gives us the ﬁnal non-terminal in our LCFRS.

(西德:1)

106

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

PROAV(Dar ¨uber) → ε
VMFIN(muß) → ε
VVPP(nachgedacht) → ε
VAINF(werden) → ε

S1(X1X2X3) → VP2(X1, X3)VMFIN(X2)
VP2(X1, X2X3) → VP2(X1, X2)VAINF(X3)
VP2(X1, X2) → PROAV(X1)VVPP(X2)

数字 23
LCFRS rules extracted from the tree in Figure 22.

For the tree in Figure 22, the algorithm produces for instance the rules in Figure 23.
As standard for PCFG, the probabilities are computed using Maximum Likelihood

Estimation.

5.2 Head-Outward Binarization

As previously mentioned, in contrast to CFG the order of the right-hand side elements
of a rule does not matter for the result of an LCFRS derivation. 所以, we can reorder
the right-hand side of a rule before binarizing it.

下列, treebank-speciﬁc reordering results in a head-outward binarization
where the head is the lowest subtree and it is extended by adding ﬁrst all sisters to its
left and then all sisters to its right. It consists of reordering the right-hand side of the
rules extracted from the treebank such that ﬁrst, all elements to the right of the head are
listed in reverse order, then all elements to the left of the head in their original order, 和
then the head itself. 数字 24 shows the effect this reordering and binarization has on
the form of the syntactic trees. In addition to this, we also use a variant of this reordering

副总裁

Tree in NeGra format:

NN
das
那

NN
男人
一

AV
VAINF
jetzt machen
现在

VMFIN
muß
must
“One has to do that now”
Rule extracted for the S node: S(XYZU) → VP(X, U)VMFIN(是)NN(Z)
Reordering for head-outward binarization: S(XYZU) → NN(Z)副总裁(X, U)VMFIN(是)
New rules resulting from binarizing this rule:
S(XYZ) → Sbin1(X, Z)NN(是)
Rule extracted for the VP node: 副总裁(X, YZ) → NN(X)AV(是)VAINF(Z)
New rules resulting from binarizing this rule:
副总裁(X, 是) → NN(X)VPbin1(是) VPbin1(XY) → AV(X)VAINF(是)

Sbin1(XY, Z) → VP(X, Z)VMFIN(是)

做

Tree after binarization:

Sbin1

副总裁

VPbin1

VMFIN NN AV VAINF

数字 24
Sample head-outward binarization.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

107

计算语言学

体积 39, 数字 1

where we add ﬁrst the sisters to the right and then the ones to the left. This is what Klein
and Manning (2003乙) 做. To mark the heads of phrases, we use the head rules that the
Stanford parser (Klein and Manning 2003c) uses for NeGra.

In all binarizations, there exists the possibility of adding additional unary rules
when deriving the head. This allows for a further factorization. In the experiments,
然而, we do not insert unary rules, neither at the highest nor at the lowest new
binarization non-terminal, because this was neither beneﬁcial for parsing times nor for
the parsing results.

5.3 Incorporating Additional Context

5.3.1 Markovization. As already mentioned in Section 3.1, a binarization that introduces
unique new non-terminals for every single rule that needs to be binarized produces
a large amount of non-terminals and fails to capture certain generalizations. 为了这
原因, we introduce markovization (柯林斯 1999; Klein and Manning 2003b).

Markovization is achieved by introducing only a single new non-terminal for the
new rules introduced during binarization and adding vertical and horizontal context
from the original trees to each occurrence of this new non-terminal. As vertical context,
we add the ﬁrst v labels on the path from the root node of the tree that we want to
binarize to the root of the entire treebank tree. The vertical context is collected during
grammar extraction and then taken into account during binarization of the rules. 作为
horizontal context, during binarization of a rule A((西德:1)A) → A0( (西德:1)α0) . . . 是( (西德:1)αm), for the new
non-terminal that comprises the right-hand side elements Ai . . . 是 (对于一些 1 ≤ i ≤
米), we add the ﬁrst h elements of Ai, Ai−1, . . . , A0.

数字 25 shows an example of a markovization of the tree from Figure 24 with v = 1
and h = 2. 这里, the superscript is the vertical context and the subscript the horizontal
context of the new non-terminal X. Note that in this example we have disregarded the
fan-out of the context categories. The VP, 例如, is actually a VP2 because it has
fan-out 2. For the context symbols, one can either use the categories from the original
treebank (without fan-out) or the ones from the LCFRS rules (with fan-out). We chose
the latter approach because it delivered better parsing results.

5.3.2 Further Category Splitting. Grammar annotation (IE。, manual enhancement of an-
notation information through category splitting) has previously been successfully used
in parsing German (Versley 2005). In order to see if such modiﬁcations can have a
beneﬁcial effect in PLCFRS parsing as well, we perform different category splits on the
(unbinarized) NeGra constituency data.

We split the category S (“sentence”) into SRC (“relative clause”) 和S (all other
categories S). Relative clauses mostly occur in a very speciﬁc context, 即, 作为

副总裁,NN

副总裁

XVP

ADV,NN

NN VMFIN NN ADV

VAINF

数字 25
Sample markovization with v = 1, h = 2.

108

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

桌子 1
NeGra: Properties of the data with crossing branches.

训练

测试

number of sentences
average sentence length
average tree height
average children per node
sentences without gaps
sentences with one gap
sentences with ≥ 2 gaps
maximum gap degree

16,502

14.56
4.62
2.96
12,481 (75.63%)
3,320 (20.12%)
701 (4.25%)

1,833

14.62
4.72
2.94

1,361 (74.25%)
387 (21.11%)
85 (4.64%)
5

right part of an NP or a PP. This splitting should therefore speed up parsing and increase
precision. 此外, we distinguish NPs by their case. 更确切地说, to all nodes
with categories N, we append the grammatical function label to the category label. 我们
ﬁnally experiment with the combination of both splits.

6. 实验

6.1 数据

Our data source is the NeGra treebank (Skut et al. 1997). We create two different data
sets for constituency parsing. For the ﬁrst one, we start out with the unmodiﬁed NeGra
treebank and remove all sentences with a length of more than 30 字. We pre-process
the treebank following common practice (K ¨ubler and Penn 2008), attaching all nodes
which are attached to the virtual root node to nodes within the tree such that, 理想地,
no new crossing edges are created. In a second pass, we attach punctuation which
comes in pairs (parentheses, quotation marks) to the same nodes. For the second data
set we create a copy of the pre-processed ﬁrst data set, in which we apply the usual
tree transformations for NeGra PCFG parsing (IE。, moving nodes to higher positions
until all crossing branches are resolved). 第一个 90% of both data sets are used as the
training set and the remaining 10% as test set. The ﬁrst data set is called NeGraLCFRS
and the second is called NeGraCFG.

桌子 1 lists some properties of the training and test (分别, gold) parts of
NeGraLCFRS, 即, the total number of sentences, the average sentence length, 这
average tree height (the height of a tree being the length of the longest of all paths
from the terminals to the root node), and the average number of children per node
(excluding terminals). 此外, gap degrees (IE。, the number of gaps in the spans
of non-terminal nodes) are listed (Maier and Lichte 2011).

Our ﬁndings correspond to those of Maier and Lichte except for small differences

due to the fact that, unlike us, they removed the punctuation from the trees.

6.2 Parser Implementation

We have implemented the CYK parser described in the previous section in a system
called rparse. The implementation is realized in Java.3

3 rparse is available under the GNU General Public License 2.0 在http://www.phil.hhu.de/rparse.

109

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 39, 数字 1

桌子 2
NeGraLCFRS: PLCFRS parsing results for different binarizations.

头部驱动

KM L-to-R Optimal Deterministic

LP
LR
LF1
UP
UR
UF1

74.00
74.24
74.12
77.09
77.34
77.22

74.00
74.13
74.07
77.20
77.33
77.26

75.08
74.69
74.88
77.95
77.54
77.75

74.92
74.88
74.90
77.77
77.73
77.75

72.40
71.80
72.10
75.67
75.04
75.35

6.3 评估

For the evaluation of the constituency parses, we use an EVALB-style metric. For a tree
over a string w, a single constituency is represented by a tuple (西德:4)A, (西德:1)ρ(西德:5) with A being a
node label and (西德:1)ρ ∈ (Pos(w) × Pos(w))dim(A). We compute precision, 记起, and F1 based
on these tuples from gold and de-binarized parsed test data from which all category
splits have been removed. This metric is equivalent to the corresponding PCFG metric
for dim(A) = 1. Despite the shortcomings of such a measure (Rehbein and van Genabith
2007), it still allows to some extent a comparison to previous work in PCFG parsing (看
also Section 7). Note that we provide the parser with gold POS tags in all experiments.

6.4 Markovization and Binarization

We use the markovization settings v = 1 and h = 2 for all further experiments. 这
setting which has been reported to yield the best results for PCFG parsing of NeGra,
v = 2 and h = 1 (Rafferty and Manning 2008), required a parsing time which was too
high.4

桌子 2 contains the parsing results for NeGraLCFRS using ﬁve different binariza-
系统蒸发散: Head-driven and KM are the two head-outward binarizations that use a head
chosen on linguistic grounds (节中描述 5.2); L-to-R is another variant in
which we always choose the rightmost daughter of a node as its head.5 Optimal reorders
the left-hand side such that the fan-out of the binarized rules is optimized (described in
部分 3.1.2). 最后, we also try a deterministic binarization (Deterministic) 其中
we binarize strictly from left to right (IE。, we do not reorder the right-hand sides of
productions, and choose unique binarization labels).

The results of the head-driven binarizations and the optimal binarization lie close
一起; the results for the deterministic binarization are worse. This indicates that the
presence or absence of markovization has more impact on parsing results than the actual
binarization order. 此外, the non-optimal binarizations did not yield a binarized
grammar of a higher fan-out than the optimal binarization: For all ﬁve binarizations,
the fan-out was 7 (caused by a VP interrupted by punctuation).

4 Older versions of rparse contained a bug that kept the priority queue from being updated correctly
(IE。, during an update, the corresponding node in the priority queue was not moved to its top, 和
therefore the best parse was not guaranteed to be found); 然而, higher parsing speeds were achieved.
The current version of rparse implements the update operation correctly, using a Fibonacci queue to
ensure efﬁciency (Cormen et al. 2003). Thanks to Andreas van Cranenburgh for pointing this out.

5 The term head is not used in its proper linguistic sense here.

110

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

700

600

头部驱动

L-to-R

Optimal

500

Deterministic

)
0
0
0
1

n
我
(

s
米
e
t
我

400

300

200

100

21
sentence length

Baseline

NP+S

400

350

300

250

200

150

100

)
0
0
0
1

n
我
(

s
米
e
t
我

21
sentence length

OFF

400

350

300

250

200

150

100

)
0
0
0
1

n
我
(

s
米
e
t
我

21
sentence length

数字 26
NeGraLCFRS: Items for PLCFRS parsing (left-to-right): binarizations, baseline and category splits,
and estimates.

The different binarizations result in different numbers of items, and therefore allow
for different parsing speeds. The respective leftmost graph in Figures 26 和 27 展示
a visual representation of the number of items produced by all binarizations, 和
corresponding parsing times. Note that when choosing the head with head rules the
number of items is almost not affected by the choice of adding ﬁrst the children to
the left of the head and then to the right of the head or vice versa. The optimal bina-
rization produces the best results. Therefore we will use it in all further experiments,
in spite of its higher parsing time.

6.5 Baseline Evaluation and Category Splits

桌子 3 presents the constituency parsing results for NeGraLCFRS and NeGraCFG, 两个都
with and without the different category splits. Recall that NeGraLCFRS has crossing
branches and consequently leads to a PLCFRS of fan-out > 1 whereas NeGraCFG does
not contain crossing branches and consequently leads to a 1-PLCFRS—in other words,

头部驱动

L-to-R

Optimal

Deterministic

)
.
C
e
s
(

e
米

我
t

100

0.1

Baseline

NP+S

)
.
C
e
s
(

e
米

我
t

100

0.1

OFF

100

)
.
C
e
s
(

e
米

我
t

0.1

sentence length

数字 27
NeGraLCFRS: Parsing times for PLCFRS parsing (left-to-right): binarizations, baseline and
category splits, and estimates (log scale).

111

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 39, 数字 1

桌子 3
NeGraLCFRS and NeGraCFG: baseline and category splits.

w/ category splits

NeGraLCFRS

NP ◦ S NeGraCFG

NP ◦ S

LP
LR
LF1
UP
UR
UF1

74.92
74.88
74.90
77.77
78.73
77.75

75.21
74.95
75.08
78.16
77.88
78.02

75.81
75.65
75.73
78.31
78.15
78.23

75.93
75.57
75.75
78.60
78.22
78.41

76.32
76.36
76.34
79.12
79.17
79.14

76.79
77.23
77.01
79.62
80.08
79.85

77.39
77.35
77.37
79.84
79.80
79.82

77.58
77.99
77.79
80.09
80.52
80.30

a PCFG. We evaluate the parser output against the unmodiﬁed gold data; 那是,
before we evaluate the experiments with category splits, we replace all split labels in
the parser output with the corresponding original labels.

We take a closer look at the properties of the trees in the parser output for
NeGraLCFRS. Twenty-nine sentences had no parse, 所以, the parser output has 1,804
句子. The average tree height is 4.72, and the average number of children per node
(excluding terminals) 是 2.91. These values are almost identical to the values for the gold
数据. As for the gap degree, we get 1,401 sentences with no gaps (1,361 in the gold set),
334 with gap degree 1 (387 in the gold set), 和 69 和 2 或者 3 gaps (85 in the gold set).
Even though the difference is only small, one can see that fewer gaps are preferred. 这
is not surprising, since constituents with many gaps are rare events and therefore end
up with a probability which is too low.

We see that the quality of the PLCFRS parser output on NeGraLCFRS (which contains
more information than the output of a PCFG parser) does not lag far behind the quality
of the PCFG parsing results on NeGraCFG. With respect to the category splits, 结果
show furthermore that category splitting is indeed beneﬁcial for the quality of the
PLCFRS parser output. The gains in speed are particularly visible for sentences with
a length greater than 20 字 (比照. the number of produced items and parsing times in
人物 26 和 27 [中间]).

6.6 Evaluating Outside Estimates

We compare the parser performance without estimates (OFF) with its performance
with the estimates described in Sections 4.3 (LR) 和 4.4 (LN).

很遗憾, the full estimates seem to be only of theoretical interest because they
were too expensive to compute both in terms of time and space, given the restrictions
imposed by our hardware. We could, 然而, compute the LN and the LR estimate.
∗
解析, the LR estimate lets the
Unlike the LN estimate, which allows for true A
quality of the parsing results deteriorate: Compared with the baseline, labeled F1 drops
从 74.90 到 73.76 and unlabeled F1 drops from 77.91 到 76.89. The respective rightmost
graphs in Figures 26 和 27 show the average number of items produced by the
parser and the parsing times for different sentence lengths. The results indicate that the
estimates have the desired effect of preventing unnecessary items from being produced.
This is reﬂected in a signiﬁcantly lower parsing time.

The different behavior of the LR and the LN estimate raises the question of the
trade-off between maintaining optimality and obtaining a higher parsing speed. 在

112

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

也就是说, it raises the question of whether techniques such as pruning or coarse-
∗
解析. A ﬁrst
to-ﬁne parsing (Charniak et al. 2006) would probably be superior to A
implementation of a coarse-to-ﬁne approach has been presented by van Cranenburgh
(2012). He generates a CFG from the treebank PLCFRS, based on the idea of Barth´elemy
等人. (2001). This grammar, which can be seen as a coarser version of the actual PLCFRS,
is then used for pruning of the search space. The problem that van Cranenburgh tackles
is speciﬁc to PLCFRS: His PCFG stage generalizes over the distinction of labels by their
fan-out. The merit of his work is an enormous increase in efﬁciency: Sentences with a
length of up to 40 words can now be parsed in a reasonable time. For a comparison of
the results of van Cranenburgh (2012) with our work, the same version of evaluation
parameters would have to be used. The applicability and effectiveness of other coarse-
to-ﬁne approaches (Charniak et al. 2006; 彼得罗夫和克莱因 2007) on PLCFRS remain to
be seen.

7. Comparison to Other Approaches

Comparing our results with results from the literature is a difﬁcult endeavor, 因为
PLCFRS parsing of NeGra is an entirely new task that has no direct equivalent in
previous work. 尤其, it is a harder task than PCFG parsing. What we can
provide in this section is a comparison of the performance of our parser on NeGraCFG
to the performance of previously presented PCFG parsers on the same data set and
an overview on previous work on parsing which aims at reconstructing crossing
分支机构.

For the comparison of the performance of our parser on NeGraCFG, we have per-
formed experiments with Helmut Schmid’s LoPar (Schmid 2000) and with the Stanford
Parser (Klein and Manning 2003c) on NeGraCFG.6 For the experiments both parsers
were provided with gold POS tags. Recall that our parser produced labeled precision,
记起, and F1 of 76.32, 76.46, 和 76.34, 分别. The plain PCFG provided by LoPar
delivers lower results (LP 72.86, LR 74.43, and LF1 73.63). The Stanford Parser results
(markovization setting v = 2, h = 1 [Rafferty and Manning 2008], otherwise default
参数) lie in the vicinity of the results of our parser (LP 74.27, LR 76.19, LF1 75.45).
Although the results for LoPar are no surprise, given the similarity of the models
implemented by our parser and the Stanford parser, it remains to be investigated why
the lexicalization component of the Stanford parser does not lead to better results. 在
any case the comparison shows that on a data set without crossing branches, our parser
obtains the results one would expect. A further data set to which we can provide a
comparison is the PaGe workshop experimental data (K ¨ubler and Penn 2008).7 桌子 4
lists the results of some of the papers in K ¨ubler and Penn (2008) on TIGER, 即,
for Petrov and Klein (2008) (磷&K), who use the Berkeley Parser (彼得罗夫和克莱因
2007); Rafferty and Manning (2008) (右&中号), who use the Stanford parser (see above);
and Hall and Nivre (2008) (H&氮), who use a dependency-based approach (see next
paragraph). The comparison again shows that our system produces good results. 再次
the performance gap between the Stanford parser and our parser warrants further
调查.

6 We have obtained the former parser from http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/

LoPar.html and the latter (Version 2.0.1) from http://nlp.stanford.edu/software/lex-parser.shtml.

7 Thanks to Sandra K ¨ubler for providing us with the experimental data.

113

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 39, 数字 1

桌子 4
PaGe workshop data.

这里

磷&K R&M H&氮

LP
LR
LF1

66.93
60.79
63.71

69.23
70.41
69.81

58.52
57.63
58.07

67.06
58.07
65.18

As for the work that aims to create crossing branches, Plaehn (2004) obtains 73.16
Labeled F1 using Probabilistic Discontinuous Phrase Structure Grammar (DPSG), 甚至
only on sentences with a length of up to 15 字. On those sentences, we obtain 83.97.
The crucial difference between DPSG rules and LCFRS rules is that the former explicitly
specify the material that can occur in gaps whereas LCFRS does not. 征收 (2005), like us,
proposes to use LCFRS but does not provide any evaluation results of his work. Very
最近, Evang and Kallmeyer (2011) followed up on our work. They transform the
Penn Treebank such that the trace nodes and co-indexations are converted into crossing
branches and parse them with the parser presented in this article, obtaining promising
结果. 此外, van Cranenburgh, Scha, and Sangati (2011) and van Cranenburgh
(2012) have also followed up on our work, introducing an integration of our approach
with Data-Oriented Parsing (DOP). The former article introduces an LCFRS adaption
of Goodman’s PCFG-DOP (古德曼 2003). For their evaluation, the authors use the
same data as we do in Maier (2010), and obtain an improvement of roughly 1.5 点
F-measure. They are also confronted with the same efﬁciency issues, 然而, 和
encounter a bottleneck in terms of parsing time. In van Cranenburgh (2012), a coarse-
to-ﬁne approach is presented (参见章节 6.6). With this approach much faster parsing
is made possible and sentences with a length of up to 40 words can be parsed. The cost
of the speed, 然而, is that the results lie well below the baseline results for standard
PLCFRS parsing.

A comparison with non-projective dependency parsers (McDonald et al. 2005;
Nivre et al. 2007) might be interesting as well, given that non-projectivity is the
dependency-counterpart to discontinuity in constituency parsing. A meaningful com-
parison is difﬁcult to do for the following reasons, 然而. Firstly, dependency parsing
deals with relations between words, whereas in our case words are not considered in
the parsing task. Our grammars take POS tags for a given and construct syntactic trees.
还, dependency conversion algorithms generally depend on the correct identiﬁcation
of linguistic head words (林 1995). We cannot rely on grammatical function labels, 这样的
作为, 例如, Boyd and Meurers (2008). Therefore we would have to use heuristics for
the dependency conversion of the parser output. This would introduce additional noise.
第二, the resources one obtains from our PLCFRS parser and from dependency
parsers (the probabilistic LCFRS and the trained dependency parser) are quite different
because the former contains non-lexicalized internal phrase structure identifying mean-
ingful syntactic categories such as VP or NP while the latter is only concerned with rela-
tions between lexical items. A comparison would concentrate only on relations between
lexical items and the rich phrase structure provided by a constituency parser would
not be taken into account. To achieve some comparison, one could of course transform
the discontinuous constituency trees into dependency trees with dependencies between
heads and with edge labels that encode enough of the syntactic structure to retrieve
the original constituency tree (Hall and Nivre 2008). The result could then be used for

114

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

a dependency evaluation. It is not clear what is to gain by this evaluation because
the head-to-head dependencies one would obtain are not necessarily the predicate-
argument dependencies one would aim at when doing direct dependency parsing
(Rambow 2010).8

8. 结论

We have presented the ﬁrst efﬁcient implementation of a weighted deductive CYK
parser for Probabilistic Linear Context-Free Rewriting Systems (PLCFRS), 显示
that LCFRS indeed allows for data-driven parsing while modeling discontinuities in
a straightforward way. To speed up parsing, we have introduced different context-
summary estimates of parse items, some acting as ﬁgures-of-merit, others allowing for
∗
解析. We have implemented the parser and we have evaluated it with grammars
A
extracted from the German NeGra treebank. Our experiments show that data-driven
LCFRS parsing is feasible and yields output of competitive quality.
There are three main directions for future work on this subject.

(西德:1)

On the symbolic side, LCFRS seems to offer more power than necessary.
By removing symbolic expressivity, a lower parsing complexity can be
达到了. One possibility is to disallow the use of so-called ill-nested
LCFRS rules. These are rules where, 大致, the spans of two right-hand
side non-terminals interleave in a cross-serial way. See the parsing
algorithm in G ´omez-Rodr´ıguez, Kuhlmann, and Satta (2010).
尽管如此, this seems to be too restrictive for linguistic modeling
(Chen-Main and Joshi 2010; Maier and Lichte 2011). Our goal for future
work is therefore to deﬁne reduced forms of ill-nested rules with which we
get a lower parsing complexity.
Another possibility is to reduce the fan-out of the extracted grammar. 我们
have pursued the question whether the fan-out of the trees in the treebank
can be reduced in a linguistically meaningful way in Maier, Kaeshammer,
and Kallmeyer (2012).

On the side of the probabilistic model, there are certain independence
assumptions made in our model that are too strong. The main problem in
respect is that, due to the deﬁnition of LCFRS, we have to distinguish
between occurrences of the same category with different fan-outs. 为了
实例, VP1 (no gaps), VP2 (one gap), 等等, are different
non-terminals. 最后, the way they expand are considered
independent from each other. This is of course not true, 然而.
此外, some of these non-terminals are rather rare; we therefore
have a sparse data problem here. This leads to the idea to separate the
development of a category (independent from its fan-out) and the fan-out
and position of gaps. We plan to integrate this into our probabilistic model
in future work.

8 A way to overcome this difference in the content of the dependency annotation would be to use

an evaluation along the lines of Tsarfaty, Nivre, and Andersson (2011); this is not available yet for
annotations with crossing branches, 然而.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

115

计算语言学

体积 39, 数字 1

(西德:1)

最后的, it is clear that a more informative evaluation of the parser output is
still necessary, particularly with respect to its performance at the task of
ﬁnding long distance dependencies and with respect to its behavior when
not provided with gold POS tags.

致谢
We are particularly grateful to Giorgio Satta
for extensive discussions of the details of the
probabilistic treebank model presented in
这篇论文. 此外, we owe a debt to
Kilian Evang who participated in the
implementation of the parser. Thanks to
Andreas van Cranenburgh for helpful
feedback on the parser implementation.
最后, we are grateful to our three
anonymous reviewers for many valuable
and helpful comments and suggestions.
A part of the work on this paper was funded
by the German Research Foundation DFG
(德国研究基金会) 在里面
form of an Emmy Noether Grant and a
subsequent DFG research project.

参考
Barth´elemy, Franc¸ois, Pierre Boullier,

Philippe Deschamp, and ´Eric Villemonte
de la Clergerie. 2001. Guided parsing
of range concatenation languages.
In Proceedings of the 39th Annual Meeting
of the Association for Computational
语言学, pages 42–49, Toulouse.
Becker, Tilman, Aravind K. Joshi, 和

Owen Rambow. 1991. Long-distance
scrambling and tree-adjoining grammars.
In Proceedings of the Fifth Conference of the
European Chapter of the Association for
计算语言学, pages 21–26,
柏林.

Boullier, 皮埃尔. 1998A. A generalization of
mildly context-sensitive formalisms.
In Proceedings of the Fourth International
Workshop on Tree Adjoining Grammars and
Related Formalisms (TAG+4), pages 17–20,
费城, PA.

Boullier, 皮埃尔. 1998乙. A proposal for a
natural language processing syntactic
backbone. 技术报告 3342, INRIA,
Roquencourt.

Boullier, 皮埃尔. 2000. Range concatenation
语法. In Proceedings of the Sixth
International Workshop on Parsing
Technologies (IWPT2000), pages 53–64,
Trento.

Boyd, Adriane. 2007. Discontinuity revisited:
An improved conversion to context-free

116

陈述. In the Linguistic
Annotation Workshop at ACL 2007,
pages 41–44, Prague.

Boyd, Adriane and Detmar Meurers.

2008. Revisiting the impact of different
annotation schemes on PCFG parsing:
A grammatical dependency evaluation.
In Proceedings of the Workshop on Parsing
German at ACL 2008, pages 24–32,
Columbus, 哦.

布兰特, Sabine, Stefanie Dipper, Silvia

汉森, Wolfgang Lezius, and George
史密斯. 2002. The TIGER Treebank.
In Proceedings of the 1st Workshop on
Treebanks and Linguistic Theories,
pages 24–42, Sozopol.

Bunt, Harry. 1996. Formal tools for

describing and processing discontinuous
constituency structure. In Harry Bunt and
Arthur van Horck, 编辑, Discontinuous
Constituency, 体积 6 of Natural Language
加工. Mouton de Gruyter, 柏林,
pages 63–83.

Cai, Shu, David Chiang, and Yoav Goldberg.

2011. Language-independent parsing
with empty elements. 在诉讼程序中
49th Annual Meeting of the Association for
计算语言学: Human Language
Technologies, pages 212–216, Portland, 或者.

Candito, Marie and Djam´e Seddah. 2010.

Parsing word clusters. 在诉讼程序中
First Workshop on Statistical Parsing of
Morphologically-Rich Languages at NAACL
赫勒特 2010, pages 76–84, 天使们, CA.
Caraballo, Sharon A. and Eugene Charniak.
1998. New ﬁgures of merit for best-ﬁrst
probabilistic chart parsing. 计算型
语言学, 24(2):275–298.

查尼亚克, 尤金, Mark Johnson, Micha
Elsner, Joseph Austerweil, David Ellis,
Isaac Haxton, Catherine Hill, 右. Shrivaths,
Jeremy Moore, Michael Pozar, and Theresa
Vu. 2006. Multilevel coarse-to-ﬁne PCFG
解析. In Proceedings of the Human
Language Technology Conference of the
全国AACL, Main Conference, pages 168–175,
纽约, 纽约.

Chen-Main, Joan and Aravind Joshi. 2010.
Unavoidable ill-nestedness in natural
language and the adequacy of tree
local-MCTAG induced dependency
结构. In Proceedings of the Tenth

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

International Workshop on Tree Adjoining
Grammar and Related Formalisms (TAG+10),
pages 119–126, 新天堂, CT.

蒋, 大卫. 2003. Statistical parsing with
an automatically extracted tree adjoining
grammar. In Rens Bod, Remko Scha, 和
Khalil Sima’an, 编辑, Data-Oriented
Parsing, CSLI Studies in Computational
语言学. CSLI出版物, 斯坦福大学,
CA, pages 299–316.

柯林斯, 迈克尔. 1999. Head-Driven Statistical
Models for Natural Language Parsing. 博士.
论文, 宾夕法尼亚大学.

Cormen, Thomas H., 查尔斯·E. Leiserson,

Ronald L. Rivest, and Clifford Stein. 2003.
Introduction to Algorithms. 与新闻界,
剑桥, 2nd edition.

Dienes, P´eter. 2003. Statistical Parsing with
Non-local Dependencies. 博士. 论文,
Saarland University.

Evang, Kilian and Laura Kallmeyer. 2011.

PLCFRS parsing of English discontinuous
constituents. 在第 12 届会议记录中
International Conference on Parsing
Technologies, pages 104–116, 都柏林.

G ´omez-Rodr´ıguez, Carlos, Marco

Kuhlmann, and Giorgio Satta. 2010.
Efﬁcient parsing of well-nested linear
context-free rewriting systems. In Human
语言技术: 这 2010 Annual
Conference of the North American Chapter of
the Association for Computational Linguistics,
pages 276–284, 天使们, CA.

G ´omez-Rodr´ıguez, Carlos, Marco

Kuhlmann, Giorgio Satta, and David Weir.
2009. Optimal reduction of rule length in
linear context-free rewriting systems. 在
Proceedings of Human Language Technologies:
这 2009 Annual Conference of the North
American Chapter of the Association for
计算语言学, pages 539–547,
博尔德, 一氧化碳.

古德曼, Joshua. 2003. Efﬁcient parsing of
DOP with PCFG-reductions. In Rens Bod,
Remko Scha, and Khalil Sima’an, 编辑,
Data-Oriented Parsing, CSLI Studies in
计算语言学. CSLI
Publications, 斯坦福大学, CA, pages 125–146.

大厅, Johan and Joakim Nivre. 2008.
A dependency-driven parser for
German dependency and constituency
陈述. 在诉讼程序中
Workshop on Parsing German at ACL 2008,
pages 47–54, Columbus, 哦.
Han, Chung-hye, Na-Rae Han, 和

Eon-Suk Ko. 2001. Bracketing guidelines
for Penn Korean TreeBank. Technical
报告 01-10, IRCS, 大学
宾夕法尼亚州, 费城, PA.

Hockenmaier, 朱莉娅. 2003. Data and models
for Statistical Parsing with Combinatory
Categorial Grammar. 博士. 论文,
爱丁堡大学.

Hoffman, Beryl. 1995. Integrating “free”
word order syntax and information
结构. In Seventh Conference of the
European Chapter of the Association for
计算语言学, pages 245–251,
都柏林.

H ¨ohle, Tilman. 1986. Der Begriff

“Mittelfeld”—Anmerkungen ¨uber die
Theorie der topologischen Felder.
In Akten des Siebten Internationalen
Germanistenkongresses 1985, G ¨ottingen,
德国.

约翰逊, 标记. 2002. A simple

pattern-matching algorithm for recovering
empty nodes and their antecedents. 在
第 40 届年会会议记录
计算语言学协会,
pages 136–143, 费城, PA.
Kallmeyer, Laura. 2010. Parsing Beyond

Context-Free Grammars. 施普林格, 柏林.
Kallmeyer, Laura and Wolfgang Maier. 2010.
Data-driven parsing with probabilistic
linear context-free rewriting systems.
In Proceedings of the 23rd International
Conference on Computational Linguistics
(科林 2010), pages 537–545, 北京.

Kato, Yuki, Hiroyuki Seki, and Tadao
Kasami. 2006. Stochastic multiple
context-free grammar for RNA
pseudoknot modeling. 在诉讼中
of the Eighth International Workshop
on Tree Adjoining Grammar and Related
Formalisms (TAG+8), pages 57–64,
悉尼.

克莱因, Dan and Christopher D. 曼宁.

2003A. A* Parsing: Fast exact viterbi parse
选择. 在诉讼程序中 2003 人类
Language Technology Conference of the North
American Chapter of the Association for
计算语言学, pages 40–47,
Edmonton.

克莱因, Dan and Christopher D. 曼宁.
2003乙. Accurate unlexicalized parsing.
In Proceedings of the 41st Annual Meeting of
the Association for Computational Linguistics,
pages 423–430, Sapporo.

克莱因, Dan and Christopher D. 曼宁.

2003C. Fast exact inference with a factored
model for natural language parsing. 在
神经信息处理的进展
系统 15 (NIPS), pages 3–10, Vancouver.

Kracht, 马库斯. 2003. The Mathematics
of Language. 数字 63 in Studies in
Generative Grammar. Mouton de Gruyter,
柏林.

117

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 39, 数字 1

K ¨ubler, Sandra. 2005. How do treebank
annotation schemes inﬂuence parsing
结果? Or how not to compare apples
and oranges. In Recent Advances in Natural
语言处理 2005 (RANLP 2005),
pages 293–300, Borovets.

K ¨ubler, Sandra and Gerald Penn, 编辑.
2008. Proceedings of the Workshop on
Parsing German at ACL 2008. 协会
for Computational Linguistics,
Columbus, 哦.

Kuhlmann, Marco and Giorgio Satta.

2009. Treebank grammar techniques for
non-projective dependency parsing.
In Proceedings of the 12th Conference
of the European Chapter of the Association
for Computational Linguistics,
pages 478–486, 雅典.

征收, 罗杰. 2005. Probabilistic Models of

Word Order and Syntactic Discontinuity.
博士. 论文, 斯坦福大学.

征收, Roger and Christopher D. 曼宁.

2004. Deep dependencies from context-free
statistical parsers: Correcting the surface
dependency approximation. 在诉讼中
of the 42nd Meeting of the Association for
计算语言学 (ACL’04), 主要的
体积, pages 328–335, 巴塞罗那.
林, Dekang. 1995. A dependency-based
method for evaluating broad-coverage
parsers. In Proceedings of the 14th
International Joint Conference on Artiﬁcial
智力 (IJCAI 95), pages 1420–1427,
蒙特利尔.

Maier, Wolfgang. 2010. Direct parsing of
discontinuous constituents in German.
In Proceedings of the First Workshop on
Statistical Parsing of Morphologically-Rich
Languages at NAACL HLT 2010,
pages 58–66, 天使们, CA.

Maier, Wolfgang, Miriam Kaeshammer,

and Laura Kallmeyer. 2012. Data-driven
PLCFRS parsing revisited: Restricting
the fan-out to two. 在诉讼程序中
Eleventh International Conference on Tree
Adjoining Grammars and Related Formalisms
(TAG+11), pages 126–134, 巴黎.

Maier, Wolfgang and Laura Kallmeyer. 2010.
Discontinuity and non-projectivity: 使用
mildly context-sensitive formalisms for
data-driven parsing. 在诉讼程序中
Tenth International Workshop on Tree
Adjoining Grammars and Related
Formalisms (TAG+10), 新天堂, CT.
Maier, Wolfgang and Timm Lichte. 2011.

Characterizing discontinuity in constituent
treebanks. In Formal Grammar. 14th
International Conference, FG 2009. Bordeaux,
法国, 七月 25-26, 2009. Revised Selected

118

文件, 体积 5591 of Lecture Notes in
人工智能, pages 167–182,
Springer-Verlag, Berlin/Heidelberg/
纽约.

Maier, Wolfgang and Anders Søgaard. 2008.
Treebanks and mild context-sensitivity.
In Proceedings of the 13th Conference on
Formal Grammar (FG-2008), pages 61–76,
汉堡.

马库斯, 米切尔, Grace Kim, Mary Ann
马尔辛凯维奇, Robert MacIntyre,
Ann Bies, Mark Ferguson, Karen Katz,
and Britta Schasberger. 1994. The Penn
树库: Annotating predicate argument
结构. In Proceedings of the Human
Language Technology Conference,
pages 114–119.

麦当劳, Ryan, Fernando Pereira,
Kiril Ribarov, 和扬·哈吉克. 2005.
Non-projective dependency parsing using
spanning tree algorithms. 在诉讼程序中
Human Language Technology Conference and
Conference on Empirical Methods in Natural
语言处理 (HLT/EMNLP),
pages 523–530, Vancouver.

Michaelis, Jens. 2001. On Formal Properties
of Minimalist Grammars. 博士. 论文,
Universit¨at Potsdam.

M ¨uller, Gereon. 2002. Free word order,

morphological case, and sympathy theory.
In Gisbert Fanselow and Caroline Fery,
编辑, Resolving Conﬂicts in Grammars:
Optimality Theory in Syntax, Morphology,
and Phonology. Buske Verlag, 汉堡,
pages 265–397.

M ¨uller, Stefan. 2004. Continuous or

discontinuous constituents? Research on
语言 & 计算, 2(2):209–257.

Nederhof, Mark-Jan. 2003. Weighted

deductive parsing and knuth’s algorithm.
计算语言学, 29(1):135–143.

Nivre, Joakim, Johan Hall, Jens Nilsson,

Atanas Chanev, G ¨ulsen Eryigit,
Sandra K ¨ubler, Svetoslav Marinov,
and Erwin Marsi. 2007. MaltParser:
A language-independent system for
data-driven dependency parsing. Natural
Language Engineering, 13(2):95–135.
Osenova, Petya and Kiril Simov. 2004.
BTB-TR05: BulTreebank Stylebook.
技术报告 05, BulTreeBank
项目, Soﬁa, Bulgaria.

佩雷拉, Fernando C. 氮. and David Warren.
1983. Parsing as deduction. 在诉讼中
of the 21st Annual Meeting of the Association
for Computational Linguistics, pages 137–144,
剑桥, 嘛.

Petrov, Slav and Dan Klein. 2007. 改进

inference for unlexicalized parsing.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
9
1
8
7
1
7
9
9
6
8
1
/
C
哦

我
我

_
A
_
0
0
1
3
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Kallmeyer and Maier

PLCFRS Parsing

In Human Language Technologies 2007: 这
Conference of the North American Chapter of
the Association for Computational Linguistics;
Proceedings of the Main Conference,
pages 404–411, 罗切斯特, 纽约.

Petrov, Slav and Dan Klein. 2008. Parsing
German with latent variable grammars.
In Proceedings of the Workshop on Parsing
German at ACL 2008, pages 24–32,
Columbus, 哦.

Plaehn, Oliver. 2004. Computing the most

probable parse for a discontinuous
phrase-structure grammar. In Harry Bunt,
John Carroll, and Giorgio Satta, 编辑,
New Developments in Parsing Technology,
体积 23 of Text, Speech And Language
技术. Kluwer, 多德雷赫特,
pages 91–106.

Rafferty, Anna and Christopher D. 曼宁.
2008. Parsing three German treebanks:
Lexicalized and unlexicalized baselines.
In Proceedings of the Workshop on Parsing
German at ACL 2008, pages 40–46,
Columbus, 哦.

Rambow, 欧文. 2010. The simple

truth about dependency and phrase
structure representations: An opinion
片. In Human Language Technologies:
这 2010 Annual Conference of the North
American Chapter of the Association for
计算语言学, pages 337–340,
天使们, CA.

Rehbein, Ines and Josef van Genabith.

2007. Evaluating evaluation measures.
In Proceedings of the 16th Nordic
Conference of Computational Linguistics,
pages 372–379, Tartu.

Schmid, Helmut. 2000. LoPar: 设计与
implementation. Arbeitspapiere des
Sonderforschungsbereiches 340 149,
IMS, University of Stuttgart, Stuttgart,
德国.

Seddah, Djame, Sandra K ¨ubler, and Reut
Tsarfaty, 编辑. 2010. 诉讼程序
First Workshop on Statistical Parsing of
Morphologically-Rich Languages at
NAACL HLT 2010. 协会
计算语言学,
天使们, CA.

Seki, Hiroyuki, Takahashi Matsumura,

Mamoru Fujii, and Tadao Kasami. 1991.
On multiple context-free grammars.
Theoretical Computer Science, 88(2):191–229.

Shieber, Stuart M., Yves Schabes, 和

Fernando C. 氮. 佩雷拉. 1995. Principles
and implementation of deductive parsing.
Journal of Logic Programming, 24(1–2):3–36.

Sikkel, Klaas. 1997. Parsing Schemata. Texts in
Theoretical Computer Science. 施普林格,
柏林, Heidelberg, 纽约.

Skut, Wojciech, Brigitte Krenn, Thorten
布兰特, and Hans Uszkoreit. 1997. 一个
annotation scheme for free word order
语言. In Proceedings of the Fifth
Conference on Applied Natural Language
加工 (ANLP), pages 88–95,
华盛顿, 直流.

Telljohann, Heike, Erhard W. Hinrichs,

Sandra K ¨ubler, Heike Zinsmeister, 和
Kathrin Beck. 2012. Stylebook for the
T ¨ubingen Treebank of Written German
(T ¨uBa-D/Z). Technical report, Seminar f ¨ur
Sprachwissenschaft, Universit¨at T ¨ubingen,
T ¨ubingen, 德国. http://www.sfs.
uni.tuebingen.de/resources/tuebadz-
stylebook-1201.pdf.

Tsarfaty, Reut, 乔金·尼弗尔, and Evelina

安德森. 2011. Evaluating dependency
解析: Robust and heuristics-free
cross-annotation evaluation. 在诉讼中
的 2011 经验方法会议
自然语言处理博士,
pages 385–396, 爱丁堡.

Uszkoreit, Hans. 1986. Linear precedence
in discontinuous constituents: Complex
fronting in German. CSLI report
CSLI-86-47, Center for the Study of
Language and Information, 斯坦福大学
大学, 斯坦福大学, CA.

van Cranenburgh, Andreas. 2012. Efﬁcient

parsing with linear context-free rewriting
系统. 在第 13 届会议记录中
欧洲分会会议
计算语言学协会,
pages 460–470, 阿维尼翁.

van Cranenburgh, Andreas, Remko Scha,

and Federico Sangati. 2011. Discontinuous
data-oriented parsing: A mildly
context-sensitive all-fragments grammar.
In Proceedings of the Second Workshop on
Statistical Parsing of Morphologically Rich
Languages (SPMRL 2011), pages 34–44,
都柏林.

Versley, Yannick. 2005. Parser evaluation
across text types. 在诉讼程序中
Fourth Workshop on Treebanks and Linguistic
Theories, pages 209–220, 巴塞罗那, 西班牙.

Vijay-Shanker, K., David J. Weir, 和

Aravind K. Joshi. 1987. Characterizing
structural descriptions produced by
various grammatical formalisms. 在
Proceedings of the 25th Annual Meeting of the
计算语言学协会,
pages 104–111, 斯坦福大学, CA.

119

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我