Calculating the Optimal Step in Shift-Reduce Dependency Parsing:
From Cubic to Linear Time
Mark-Jan Nederhof
School of Computer Science
University of St Andrews, UK
markjan.nederhof@googlemail.com
Abstract
We present a new cubic-time algorithm to
calculate the optimal next step in shift-reduce
dependency parsing, relative to ground truth,
commonly referred to as dynamic oracle. Un-
like existing algorithms, it is applicable if the
training corpus contains non-projective struc-
tures. We then show that for a projective
training corpus, the time complexity can be
improved from cubic to linear.
1
Introduction
A deterministic parser may rely on a classifier that
predicts the next step, given features extracted
from the present configuration (Yamada and
Matsumoto, 2003; Nivre et al., 2004). It was
found that accuracy improves if the classifier is
trained not just on configurations that correspond
to the ground-truth, or ‘‘gold’’, tree, but also
on configurations that a parser would typically
reach when a classifier strays from the optimal
predictions. This is known as a dynamic oracle.1
The effective calculation of the optimal step for
some kinds of parsing relies on ‘arc-decomposibility’,
as in the case of Goldberg and Nivre (2012, 2013).
This generally requires a projective training cor-
pus; an attempt to extend this to non-projective
training corpora had to resort to an approxima-
tion (Aufrant et al., 2018). It is known how to
calculate the optimal step for a number of non-
1A term we avoid here, as dynamic oracles are neither
oracles nor dynamic, especially in our formulation, which
allows gold trees to be non-projective. Following, for example,
Kay (2000), an oracle informs a parser whether a step may
lead to the correct parse. If the gold tree is non-projective and
the parsing strategy only allows projective trees, then there
are no steps that lead to the correct parse. At best, there is an
optimal step, by some definition of optimality. An algorithm
to compute the optimal step, for a given configuration, would
typically not change over time, and therefore is not dynamic
in any generally accepted sense of the word.
projective parsing algorithms, however (G´omez-
Rodr´ıguez et al., 2014; G´omez-Rodr´ıguez and
Fern´andez-Gonz´alez, 2015; Fern´andez-Gonz´alez
G´omez-Rodr´ıguez, 2018a); see also de Lhoneux
et al. (2017).
Ordinary shift-reduce dependency parsing is
known at
least since Fraser (1989); see also
Nasr (1995). Nivre (2008) calls it ‘‘arc-standard
parsing.’’ For shift-reduce dependency parsing,
calculation of the optimal step is regarded to
be difficult. The best known algorithm is cubic
and is only applicable if the training corpus is
projective (Goldberg et al., 2014). We present a
new cubic-time algorithm that is also applicable
to non-projective training corpora. Moreover, its
architecture is modular, expressible as a generic
tabular algorithm for dependency parsing plus a
context-free grammar that expresses the allow-
able transitions of the parsing strategy. This dif-
fers from approaches that require specialized
tabular algorithms for different kinds of parsing
(G´omez-Rodr´ıguez et al., 2008; Huang and Sagae,
2010; Kuhlmann et al., 2011).
The generic tabular algorithm is interesting in
its own right, and can be used to determine the opti-
mal projectivization of a non-projective tree. This
is not to be confused with pseudo-projectivization
(Kahane et al., 1998; Nivre and Nilsson, 2005),
which generally has a different architecture and is
used for a different purpose, namely, to allow
a projective parser to produce non-projective
structures, by encoding non-projectivity into pro-
jective structures before training, and then recon-
structing potential non-projectivity after parsing.
A presentational difference with earlier work is
that we do not define optimality in terms of ‘‘loss’’
or ‘‘cost’’ functions but directly in terms of attain-
able accuracy. This perspective is shared by Straka
et al. (2015), who also relate accuracies of compet-
ing steps, albeit by means of actual parser output
and not in terms of best attainable accuracies.
283
Transactions of the Association for Computational Linguistics, vol. 7, pp. 283–296, 2019. Action Editor: Francois Yvon.
Submission batch: 11/2018; Revision batch: 2/2019; Published 5/2019.
c(cid:13) 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
We further show that if the training corpus
is projective, then the time complexity can be
reduced to linear. To achieve this, we develop a
new approach of excluding computations whose
accuracies are guaranteed not to exceed the accu-
racies of the remaining computations. The main
theoretical conclusion is that arc-decomposibility
is not a necessary requirement for efficient cal-
culation of the optimal step.
Despite advances in unrestricted non-projective
parsing, as, for example, Fern´andez-Gonz´alez
and G´omez-Rodr´ıguez (2018b), many state-of-
the-art dependency parsers are projective, as,
for example, Qi and Manning (2017). One main
practical contribution of the current paper is that
it introduces new ways to train projective parsers
using non-projective trees, thereby enlarging the
portion of trees from a corpus that is available for
training. This can be done either after applying
optimal projectivization, or by computing optimal
steps directly for non-projective trees. This can
be expected to lead to more accurate parsers,
especially if a training corpus is small and a large
proportion of it is non-projective.
2 Preliminaries
In this paper, a configuration (for sentence length
n) is a 3-tuple (α, β, T ) consisting of a stack α,
which is a string of integers each between 0 and n,
a remaining input β, which is a suffix of the string
1 · · · n, and a set T of pairs (a, a(cid:48)) of integers,
with 0 ≤ a ≤ n and 1 ≤ a(cid:48) ≤ n. Further, αβ is
a subsequence of 0 1 · · · n, starting with 0.
Integer 0 represents an artificial input position,
not corresponding to any actual token of an input
sentence.
An integer a(cid:48) (1 ≤ a(cid:48) ≤ n) occurs as second
element of a pair (a, a(cid:48)) ∈ T if and only if it
does not occur in αβ. Furthermore, for each a(cid:48)
there is at most one a such that (a, a(cid:48)) ∈ T . If
(a, a(cid:48)) ∈ T then a(cid:48) is generally called a dependent
of a, but as we will frequently need concepts from
graph theory in the remainder of this article, we
will consistently call a(cid:48) a child of a and a the
parent of a(cid:48); if a(cid:48) < a then a(cid:48) is a left child and if
a < a(cid:48) then it is a right child. The terminology is
extended in the usual way to include descendants
and ancestors. Pairs (a, a(cid:48)) will henceforth be
called edges.
For sentence length n, the initial configuration
is (0, 1 2 · · · n, ∅), and a final configuration is
shift:
(α, bβ, T ) (cid:96) (αb, β, T )
reduce left:
(αa1a2, β, T ) (cid:96) (αa1, β, T ∪ {(a1, a2)})
reduce right:
(αa1a2, β, T ) (cid:96) (αa2, β, T ∪ {(a2, a1)}),
provided |α| > 0
Table 1: Shift-reduce dependency parsing.
of the form (0, ε, T ), where ε denotes the empty
string. The three transitions of shift-reduce de-
pendency parsing are given in Table 1. By step
we mean the application of a transition on a
particular configuration. By computation we mean
a series of steps, the formal notation of which
uses (cid:96)∗, the reflexive, transition closure of (cid:96). If
(0, 1 2 · · · n, ∅) (cid:96)∗ (0, ε, T ), then T represents a
tree, with 0 as root element, and T is projective,
which means that for each node, the set of its
descendants (including that node itself) is of the
form {a, a + 1, . . . , a(cid:48) − 1, a(cid:48)}, for some a and a(cid:48).
In general, a dependency tree is any tree of nodes
labelled 0, 1, . . . , n, with 0 being the root.
The score of a tree T for a sentence is the
number of edges that it has in common with a
given gold tree Tg for that sentence, or formally
|T ∩ Tg|. The accuracy is the score divided by n.
Note that neither tree need be projective for the
score to be defined, but in this paper the first tree,
T , will normally be projective. Where indicated,
also Tg is assumed to be projective.
Assume an arbitrary configuration (α, β, T )
for sentence length n and assume a gold tree
Tg for a sentence of that same length, and as-
sume three steps (α, β, T ) (cid:96) (αi, βi, Ti), with
i = 1, 2, 3, obtainable by a shift, reduce left
or reduce right, respectively. (If β = ε, or
|α| ≤ 2,
the three
then naturally some of
transitions need to be left out of consideration.)
We now wish to calculate, for each of i = 1, 2, 3,
the maximum value of |T (cid:48)
i ∩ Tg|, for any T (cid:48)
i such
that (αi, βi, Ti) (cid:96)∗ (0, ε, T (cid:48)
i ). For i = 1, 2, 3, let
σi be this maximum value. The absolute scores σi
are strictly speaking irrelevant; the relative values
determine which is the optimal step, or which are
the optimal steps, to reach a tree with the highest
score. Note that |{i | σi = maxj σj}| is either 1,
2, or 3. In the remainder of this article, we find
it more convenient to calculate σi − |T ∩ Tg| for
each i—or, in other words, gold edges that were
previously found are left out of consideration.
284
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
We can put restrictions on the set of allowable
computations (α, β, T ) (cid:96)∗ (0, ε, T ∪ T (cid:48)). The
left-before-right strategy demands that all edges
(a, a(cid:48)) ∈ T (cid:48) with a(cid:48) < a are found before any
edges (a, a(cid:48)) ∈ T (cid:48) with a < a(cid:48), for each a that is
rightmost in α or that occurs in β. The strict left-
before-right strategy in addition disallows edges
(a, a(cid:48)) ∈ T (cid:48) with a(cid:48) < a for each a in α other
than the rightmost element. The intuition is that
a non-strict strategy allows us to correct mistakes
already made: If we have already pushed other
elements on top of a stack element a, then a will
necessarily obtain right children before it occurs
on top of the stack again, when it can take (more)
left children. By contrast, the strict strategy would
not allow these left children.
The definition of the right-before-left strategy is
symmetric to that of the left-before-right strategy,
but there is no independent strict right-before-
left strategy. In this paper we consider all three
strategies in order to emphasize the power of our
framework. It is our understanding that Goldberg
et al. (2014) does not commit to any particular
strategy.
3 Tabular Dependency Parsing
We here consider context-free grammars (CFGs)
of a special form, with nonterminals in N ∪ (N(cid:96) ×
Nr), for appropriate finite sets N , N(cid:96), Nr, which
need not be disjoint. The finite set of terminals is
denoted Σ. There is a single start symbol S ∈ N .
Rules are of one of the forms:
• (B, C) → a,
• A → (B, C),
• (B(cid:48), C) → A (B, C),
• (B, C (cid:48)) → (B, C) A,
where A ∈ N , B, B(cid:48) ∈ N(cid:96), C, C (cid:48) ∈ Nr, a ∈ Σ.
A first additional requirement is that if (B(cid:48), C) →
A (B, C) is a rule, then (B(cid:48), C (cid:48)) → A (B, C (cid:48)),
for any C (cid:48) ∈ Nr, is also a rule, and if (B, C (cid:48)) →
(B, C) A is a rule, then (B(cid:48), C (cid:48)) → (B(cid:48), C) A,
for any B(cid:48) ∈ N(cid:96), is also a rule. This justifies
our notation of such rules in the remainder of
this paper as (B(cid:48), ) → A (B, ) and ( , C (cid:48)) →
( , C) A, respectively. These two kinds of rules
correspond to attachment of left and right children,
respectively, in dependency parsing. Secondly, we
require that there is precisely one rule (B, C) → a
for each a ∈ Σ.
W(cid:96)(B, i, i) =
(cid:26) 1,
if (B, C) → ai
0, otherwise
(cid:26) 1,
if (B, C) → ai
Wr(C, i, i) =
W(cid:96)(B, C, i, j) = (cid:76)
Wr(B, C, i, j) = (cid:76)
W(cid:96)(C (cid:48), i, j) = (cid:76)
Wr(B(cid:48), i, j) = (cid:76)
0, otherwise
k Wr(B, i, k) ⊗
W(cid:96)(C, k + 1, j) ⊗ w(j, i)
k Wr(B, i, k) ⊗
W(cid:96)(C, k + 1, j) ⊗ w(i, j)
A→(D,B), (C(cid:48), )→A (C, ), k
W(cid:96)(D, i, k) ⊗ W(cid:96)(B, C, k, j)
( ,B(cid:48))→( ,B) A, A→(C,D), k
Wr(B, C, i, k) ⊗ Wr(D, k, j)
W = (cid:76)
S→(B,C) W(cid:96)(B, 0, 0) ⊗ Wr(C, 0, n)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Table 2: Weighted parsing, for an arbitrary semi-
ring, with 0 ≤ i < j ≤ n.
Note that the additional requirements make the
grammar explicitly ‘‘split’’ in the sense of Eisner
and Satta (1999), Eisner (2000), and Johnson
(2007). That is, the two processes of attaching left
and right children, respectively, are independent,
with rules (B, C) → a creating ‘‘initial states’’
B and C, respectively, for these two processes.
Rules of the form A → (B, C) then combine
the end results of these two processes, possibly
placing constraints on allowable combinations of
B and C.
To bring out the relation between our subclass
of CFGs and bilexical grammars, one could ex-
plicitly write (B, C)(a) → a, A(a) → (B, C)(a),
(B(cid:48), )(b) → A(a) (B, )(b), and ( , C (cid:48))(c) →
( , C)(c) A(a).
Purely symbolic parsing is extended to weighted
parsing much as usual, except that instead of
attaching weights to rules, we attach a score w(i, j)
to each pair (i, j), which is a potential edge. This
can be done for any semiring. In the semiring
we will first use, a value is either a non-negative
integer or −∞. Further, w1 ⊕ w2 = max(w1, w2)
and w1 ⊗ w2 = w1 + w2 if w1 (cid:54)= −∞ and w2 (cid:54)=
−∞ and w1 ⊗w2 = −∞ otherwise. Naturally, the
identity element of ⊕ is 0 = −∞ and the identity
element of ⊗ is 1 = 0.
Tabular weighted parsing can be realized
following Eisner and Satta (1999). We assume
the input is a string a0a1 · · · an ∈ Σ∗, with a0
being the prospective root of a tree. Table 2
presents the cubic-time algorithm in the form
of a system of recursive equations. With the
semiring we chose above, W(cid:96)(B, i, j) represents
the highest score of any right-most derivation of
the form (B, ) ⇒ A1(B1, ) ⇒ A1A2(B2, ) ⇒∗
285
→ (S, S)
(S, S) → a
S
( , S) → ( , S) S
(S, ) → S (S, )
Table 3: Grammar for projective dependency
parsing, with Σ = {a} and N = N(cid:96) = Nr = {S}.
Figure 1: Dependency structure and corresponding parse
tree that encodes a computation of a shift-reduce parser.
A1 · · · Am(Bm, ) ⇒ A1 · · · Amaj ⇒∗ ai · · · aj,
for some m ≥ 0, and Wr(C, i, j) has symmetric
meaning. Intuitively, W(cid:96)(B, i, j) considers aj and
its left dependents and Wr(C, i, j) considers ai
and its right dependents. A value W(cid:96)(B, C, i, j),
or Wr(B, C, i, j), represents the highest score
combining ai and its right dependents and aj and
its left dependents, meeting in the middle at some
k, including also an edge from ai to aj, or from
aj to ai, respectively.
One may interpret the grammar in Table 3 as
encoding all possible computations of a shift-
reduce parser, and thereby all projective trees. As
there is only one way to instantiate the under-
scores, we obtain rule (S, S) → (S, S) S, which
corresponds to reduce left, and rule (S, S) →
S (S, S), which corresponds to reduce right.
Figure 1 presents a parse tree for the grammar
and the corresponding dependency tree. Note that
if we are not given a particular strategy, such as
left-before-right, then the parse tree underspec-
ifies whether left children or right children are
attached first. This is necessarily the case because
the grammar is split. Therefore, the computation
in this example may consist of three shifts, fol-
lowed by one reduce left, one reduce right,
and one reduce left, or it may consist of two
shifts, one reduce right, one shift, and two
reduce lefts.
286
→ (P, P )
→ (P, S)
→ (S, S)
(P, P ) → p
(S, S) → s
P
S
S
(S, ) → P (S, )
(S, ) → S (S, )
( , S) → ( , P ) S
( , S) → ( , S) S
(S, ) → P (P, )
Table 4: Grammar for dependency parsing of
pksm+1, representing a stack of length k + 1 and
remaining input of length m, with Σ = {p, s},
N = Nr = N(cid:96) = {P, S}. The last rule would be
excluded for the strict left-to-right strategy, or
alternatively one can set w(i, j) = −∞ for
j < i < k.
For a given gold tree Tg, which may or may not
be projective, we let w(i, j) = δg(i, j), where we
define δg(i, j) = 1 if (i, j) ∈ Tg and δg(i, j) = 0
otherwise. With the grammar from Table 3, the
value W found by weighted parsing is now the
score of the most accurate projective tree. By
backtracing from W as usual, we can construct
the (or more correctly, a) tree with that highest
accuracy. We have thereby found an effective way
to projectivize a treebank in an optimal way. By
a different semiring, we can count the number
of trees with the highest accuracy, which reflects
the degree of ‘‘choice’’ when projectivizing a
treebank.
4 O(n3)O(n3)O(n3) Time Algorithm
In a computation starting from a configuration
(a0 · · · ak, b1 · · · bm, T ), not every projective parse
of the string a0 · · · akb1 · · · bm is achievable. The
structures that are achievable are captured by the
grammar in Table 4, with P for prefix and S for
suffix (also for ‘‘start symbol’’). Nonterminals P
and (P, P ) correspond to a node ai (0 ≤ i < k)
that does not have children. Nonterminal S
corresponds to a node that has either ak or some
bj (1 ≤ j ≤ m) among its descendants. This then
means that the node will appear on top of the stack
at some point in the computation. Nonterminal
(S, S) also corresponds to a node that has one of
the rightmost m + 1 nodes among its descendants,
and, in addition, if it itself is not one of the
rightmost m + 1 nodes, then it must have a left
child.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
The first value σ1 is obtained by investigat-
ing the configuration (a0 · · · akb1, b2 · · · bm, ∅)
resulting after a shift. We run our generic tab-
ular algorithm for the grammar in Table 4, for
input pk+1sm, to obtain σ1 = W . The scores
are obtained by translating indices of a0 · · ·
akb1 · · · bm = c0 · · · ck+m to indices in the orig-
inal input, that is, we let w(i, j) = δg(ci, cj).
However, the shift, which pushes an element on
top of ak, implies that ak will obtain right children
before it can obtain left children. If we assume the
left-before-right strategy, then we should avoid
that ak obtains left children. We could do that
by refining the grammar, but find it easier to set
w(k, i) = −∞ for all i < k.
For the second value σ2, we investigate the con-
figuration (a0 · · · ak−1, b1 · · · bm, ∅) resulting after
a reduce left. The same grammar and algorithm
are used, now for input pk−1sm+1. With a0 · · ·
ak−1b1 · · · bm = c0 · · · ck+m−1, we let w(i, j) =
δg(ci, cj). We let σ2 = W ⊗ δg(ak−1, ak). In
case of a strict left-before-right strategy, we set
w(k − 1, i) = −∞ for i < k − 1, to avoid that
ak−1 obtains left children after having obtained a
right child ak.
If k ≤ 1 then the third value is σ3 = −∞,
as no reduce right is applicable. Otherwise we
investigate (a0 · · · ak−2ak, b1 · · · bm, ∅). The same
grammar and algorithm are used as before, and
w(i, j) = δg(ci, cj) with a0 · · · ak−2akb1 · · · bm =
c0 · · · ck+m−1. Now σ3 = W ⊗ δg(ak, ak−1).
In case of a right-before-left strategy, we set
w(k, i) = −∞ for k < i.
We conclude that the time complexity of cal-
culating the optimal step is three times the time
complexity of the algorithm of Table 2, hence
cubic in n.
For a proof of correctness, it is sufficient to show
that each parse tree by the grammar in Table 4
corresponds to a computation with the same score,
and conversely that each computation corresponds
to an equivalent parse tree. Our grammar has
spurious ambiguity, just as the shift-reduce parser
from Table 1, and this can be resolved in the
same way, depending on whether the intended
strategy is (non-)strict left-before-right or right-
before-left, and whether the configuration is the
result of a shift, reduce left, or reduce right.
Concretely, we can restrict parse trees to attach
children lower in the tree if they would be attached
earlier in the computation, and thereby we obtain
Figure 2: Dependency structure and corresponding
parse tree, for stack of height 4 and remaining input of
length 1.
Nonterminal (P, S) corresponds to a node ai
(0 ≤ i < k) that has ak among its descendants
but that does not have a left child. Nonterminal
(S, P ) corresponds to a node ai (0 ≤ i < k) that
has a left child but no right children. For ai to be
given a left child, it is required that it eventually
appear on top of the stack. This requirement is
encoded in the absence of a rule with right-hand
side (S, P ). In other words, (S, P ) cannot be
part of a successful derivation, unless the rule
(S, S) → (S, P ) S is subsequently used, which
then corresponds to giving ai a right child that has
ak among its descendants.
Figure 2 shows an example. Note that we can
partition a parse tree into ‘‘columns’’, each con-
sisting of a path starting with a label in N , then
a series of labels in N(cid:96) × Nr and ending with a
label in Σ.
A dependency structure that is not achievable,
and that appropriately does not correspond to a
parse tree, for a stack of height 4 and remaining
input of length 1, is:
• • • • | •
Suppose we have a configuration (a0 · · · ak,
b1 · · · bm, T ) for sentence length n, which implies
k + m ≤ n. We need to decide whether a shift,
reduce left, or reduce right should be done
in order to achieve the highest accuracy, for given
gold tree Tg. For this, we calculate three values
σ1, σ2 and σ3, and determine which is highest.
287
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 3: A node ν with label in N(cid:96) × Nr translates to
configuration ( ¯d1 · · · ¯dk(cid:48) , ¯e1 · · · ¯em(cid:48), T ), via its shortest
path to the root. The overlined symbols denote the
integers between 0 and n corresponding to d1 · · ·
dk(cid:48)e1 · · · em(cid:48) ∈ p+s∗.
a bijection between parse trees and computations.
For example, in the middle column of the parse
tree in Figure 2, the (P, S) and its right child occur
below the (S, S) and its left child, to indicate the
reduce left precedes the reduce right.
The proof in one direction assumes a parse
tree, which is traversed to gather the steps of a
computation. This traversal is post-order, from
left to right, but skipping the nodes representing
stack elements below the top of the stack, starting
from the leftmost node labeled s. Each node ν
with a label in N(cid:96) × Nr corresponds to a step.
If the child of ν is labeled s, then we have a
shift, and if it has a right or left child with a label
in N , then it corresponds to a reduce left or
reduce right, respectively. The configuration
resulting from that step can be constructed as
sketched in Figure 3. We follow the shortest path
from ν to the root. All the leaves to the right of
the path correspond to the remaining input. For
the stack, we gather the leaves in the columns
of the nodes on the path, as well as those of the
left children of nodes on the path. Compare this
with the concept of right-sentential forms in the
theory of context-free parsing.
For a proof in the other direction, we can make
use of existing parsing theory, which tells us
how to translate a computation of the shift-reduce
parser to a dependency structure, which in turn is
easily translated to an undecorated parse tree. It
then remains to show that the nodes in that tree can
be decorated (in fact in a unique way), according
to the rules from Table 4. This is straightforward
given the meanings of P and S described earlier
in this section. Most notably, the absence of a rule
288
Figure 4: Components C1, C2, C3 partitioning nodes in
β, and gold edges linking them to α.
with right-hand side ( , P ) P does not prevent
the decoration of a tree that was constructed out
of a computation, because a reduction involving
two nodes within the stack is only possible if the
rightmost of these nodes eventually appears on
top of the stack, which is only possible when the
computation has previously made ak a descendant
of that node, hence we would have S rather
than P .
5 O(n2O(n2O(n2) Time Algorithm
Assume a given configuration (α, β, T ) as before,
resulting from a shift or reduction. Let α =
a0 · · · ak, A = {a0, . . . , ak}, and let B be the
set of nodes in β. We again wish to calculate
the maximum value of |T (cid:48) ∩ Tg| for any T (cid:48) such
that (α, β, ∅) (cid:96)∗ (0, ε, T (cid:48)), but now under the
assumption that Tg is projective. Let us call this
value σmax . We define w in terms of δg as in
the previous section, setting w(i, j) = −∞ for
an appropriate subset of pairs (i, j) to enforce
a strategy that is (non-)strict left-before-right or
right-before-left.
The edges in Tg ∩ (B × B) partition the re-
maining input into maximal connected compo-
nents. Within these components, a node b ∈ B
is called critical if it satisfies one or both of the
following two conditions:
• At least one descendant of b (according to
Tg) is not in B.
• The parent of b (according to Tg) is not in B.
Let Bcrit ⊆ B be the set of critical nodes, listed
in order as b1, . . . , bm, and let Bncrit = B \ Bcrit .
Figure 4 sketches three components as well as
edges in Tg ∩ (A × B) and Tg ∩ (B × A). Com-
ponent C1, for example, contains the critical
elements b1, b2, and b3. The triangles under b1,
. . . , b7 represent subtrees consisting of edges
leading to non-critical nodes. For each b ∈ Bcrit ,
|Tg ∩ ({b} × A)| is zero or more, or in words,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
critical nodes have zero or more children in the
stack. Further, if (a, b) ∈ Tg ∩ (A × Bcrit ), then
b is the rightmost critical node in a component;
examples are b5 and b7 in the figure.
Let Tmax be any tree such that (α, β, ∅) (cid:96)∗
(0, ε, Tmax ) and |Tmax ∩ Tg| = σmax . Then we can
find another tree T (cid:48)
max that has the same properties
and in addition satisfies:
1. Tg ∩ (B × Bncrit ) ⊆ T (cid:48)
max ,
2. T (cid:48)
max ∩ (Bncrit × A) = ∅,
3. T (cid:48)
max ∩ (B × Bcrit ) ⊆ Tg,
or in words, (1) the subtrees rooted in the critical
nodes are entirely included, (2) no child of a
non-critical node is in the stack, and (3) within
the remaining input, all edges to critical nodes
are gold. Very similar observations were made
before by Goldberg et al. (2014), and therefore
we will not give full proofs here. The structure
of the proof is in each case that all violations
of a property can be systematically removed, by
rearranging the computation, in a way that does
not decrease the score.
We need two more properties:
4. If (a, b) ∈ T (cid:48)
max ∩ (A × Bcrit ) \ Tg then
either:
• b is the rightmost critical node in its
component, or
• there is (b, a(cid:48)) ∈ T (cid:48)
max ∩ Tg, for some
a(cid:48) ∈ A and there is at least one other
critical node b(cid:48) to the right of b, but in
the same component, such that (b(cid:48), a(cid:48)(cid:48)) ∈
max ∩ Tg or (a(cid:48)(cid:48), b(cid:48)) ∈ T (cid:48)
T (cid:48)
max ∩ Tg, for
some a(cid:48)(cid:48) ∈ A.
5. If (b, a) ∈ T (cid:48)
is (b, a(cid:48)) ∈ T (cid:48)
a(cid:48) is a sibling of a immediately to its right.
max ∩ (Bcrit × A) \ Tg then there
max , for some a(cid:48) ∈ A, such that
Figure 5, to be discussed in more detail later,
illustrates property (4) for the non-gold edge from
a4; this edge leads to b4 (which has outgoing
gold edge to a5) rather than to b5 or b6. It further
respects property (4) because of the gold edges
connected to b7 and b8, which occur to the right
of b4 but in the same component. Property (5) is
illustrated for the non-gold edge from b3 to a8,
which has sibling a9 immediately to the right.
The proof that property (4) may be assumed to
hold, without loss of generality, again involves
Figure 5: Counting additional gold edges in A×Bcrit ∪
Bcrit × A. Gold edges are thick, others are thin. Gold
edges that are not created appear dotted.
making local changes to the computation,
in
particular replacing the b in an offending non-
gold edge (a, b) ∈ A × Bcrit by another critical
node b(cid:48) further to the left or at the right end of
the component. Similarly, for property (5), if we
have an offending non-gold edge (b, a), then we
can rearrange the computation, such that node a is
reduced not into b but into one of the descendants
of b in B that was given children in A. If none of
the descendants of b in B was given children in A,
then a can instead be reduced into its neighbor in
the stack immediately to the left, without affecting
the score.
By properties (1)–(3), we can from here on
ignore non-critical nodes, so that the remaining
task is to calculate σmax − |Bncrit |. In fact, we go
further than that and calculate σmax − M , where
M = |Tg ∩ (B × B)|. In other words, we take
for granted that the score can be at least as much
as the number of gold edges within the remaining
input, which leaves us with the task of counting the
additional gold edges in the optimal computation.
For any given component C we can consider the
sequence of edges that the computation creates
between A and C, in the order in which they are
created:
• for the first gold edge between C and A, we
count +1,
• for each subsequent gold edge between C to
A, we count +1,
• we ignore interspersed non-gold edges from
C to A,
• but following a non-gold edge from A to
C, the immediately next gold edge between
C and A is not counted, because that non-
gold edge implies that another gold edge in
Bcrit × Bcrit cannot be created.
This is illustrated by Figure 5. For (b3, a9) we
count +1, it being the first gold edge connected
289
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
to the component. For the subsequent three gold
edges, we count +1 for each, ignoring the non-
gold edge (b3, a8). The non-gold edge (a4, b4)
implies that the parent of b4 is already determined.
One would then perhaps expect we count −1 for
non-creation of (b5, b4), considering (b5, b4) was
already counted as part of M . Instead, we let this
−1 cancel out against the following (b7, a3), by
letting the latter contribute +0 rather than +1.
The subsequent edge (b7, a2) again contributes
+1, but the non-gold edge (a1, b7) means that the
subsequent (a0, b8) contributes +0. Hence the net
count in this component is 5.
The main motivation for properties (1)–(5) is
that they limit the input positions that can be
relevant for a node that is on top of the stack,
thereby eliminating one factor m from the time
complexity. More specifically,
the gold edges
relate a stack element to a ‘‘current critical node’’
in a ‘‘current component’’. We need to distinguish
however between three possible states:
• N (none): none of the critical nodes from
the current component were shifted on to the
stack yet,
• C (consumed): the current critical node was
‘consumed’ by it having been shifted and
assigned a parent,
• F (fresh): the current critical node was not
consumed, but at least one of the preceding
critical nodes in the same component was
consumed.
For 0 ≤ i ≤ k, we define p(i) to be the index
j such that (bj, ai) ∈ Tg, and if there is no such
j, then p(i) = ⊥, where ⊥ denotes ‘undefined’.
For 0 ≤ i < k, we let p≥(i) = p(i) if p(i) (cid:54)= ⊥,
and p≥(i) = p≥(i + 1) otherwise, and further
p≥(k) = p(k). Intuitively, we seek a critical node
that is the parent of ai, or if there is none, of
ai+1, . . . We define c(i) to be the smallest j such
that (ai, bj) ∈ Tg, or in words, the index of the
leftmost child in the remaining input, and c(i) = ⊥
if there is none.
As representative element of a component with
critical element bj we take the critical element
that is rightmost in that component, or formally,
we define R(j) to be the largest j(cid:48) such that bj(cid:48)
is an ancestor (by Tg ∩ (Bcrit × Bcrit )) of bj.
For completeness, we define R(⊥) = ⊥. We let
P (i) = R(p(i)) and P≥(i) = R(p≥(i)). Note that
score(i, j, q) = 0 if i < 0, otherwise:
score(i, j, q) =
[nchildren(i)−∆(c(i) = P≥(j) ∧ q (cid:54)= N )] ⊗
w(i, j) ⊗ score(i − 1, i, τ (i, j, q)) ⊕
w(j, i) ⊗ score(i − 1, j, q) ⊕
[if p(j) = ⊥ ∨ q = C then −∞
else ∆(q = N ) ⊗ score(cid:48)(i, p(j))]
score(cid:48)(i, j) = 0 if i < 0, otherwise:
score(cid:48)(i, j) =
[if p(cid:48)(i, j) = ⊥ then score(cid:48)(i − 1, j)
else 1 ⊗ score(cid:48)(i − 1, p(cid:48)(i, j))] ⊕
nchildren(i) ⊗ score(i − 1, i, τ (cid:48)(i, j))
nchildren(i) = |{j | w(i, j + k) = 1}|
τ (i, j, q) = if q = N ∨ P≥(i) (cid:54)= P≥(j) then N
else if p≥(i) (cid:54)= p≥(j) then F
else q
τ (cid:48)(i, j) = if P≥(i) (cid:54)= R(j) then N
else if p≥(i) (cid:54)= j then F
else C
Table 5: Quadratic-time algorithm.
R(c(i)) = c(i) for each i. For 0 ≤ i ≤ k and
1 ≤ j ≤ m, we let p(cid:48)(i, j) = p(i) if P (i) = R(j)
and p(cid:48)(i, j) = ⊥ otherwise; or in words, p(cid:48)(i, j)
is the index of the parent of ai in the remaining
input, provided it is in the same component as bj.
Table 5 presents the algorithm, expressed as
system of recursive equations. Here score(i, j, q)
represents the maximum number of gold edges
(in addition to M )
in a computation from
(a0 · · · aiaj, b(cid:96) · · · bk, ∅), where (cid:96) depends on the
state q ∈ {N , C, F}. If q = N ,
then (cid:96) is
the smallest number such that R((cid:96)) = P≥(j);
critical nodes from the current component were
not yet shifted. If q = C, then (cid:96) = p≥(j) + 1 or
(cid:96) = P≥(j) + 1; this can be related to the two cases
distinguished by property (4). If q = F, then
(cid:96) is greater than the smallest number such that
R((cid:96)) = P≥(j), but smaller than or equal to p≥(j)
or equal to (cid:96) = P≥(j) + 1. Similarly, score(cid:48)(i, j)
represents the maximum number of gold edges in
a computation from (a0 · · · aibj, bj+1 · · · bk, ∅).
For i ≥ 0, the value of score(i, j, q) is the
maximum (by ⊕) of three values. The first
corresponds to a reduction of aj into ai, which
turns the stack into a0 · · · ai−1ai; this would also
include shifts of any remaining right children of
ai, if there are any, and their reduction into ai.
Because there is a new top-of-stack, the state is
updated using τ . The function nchildren counts
the critical nodes that are children of ai. We
define nchildren in terms of w rather than Tg,
as in the case of the right-before-left strategy
290
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 6: Graphical representation of the first value in
the definition of score, for the case q = F, assuming
c(i) = P≥(j) = (cid:96) and ai further has children b(cid:96)+1 and
b(cid:96)+2. Because q = F, there was some other node b(cid:96)(cid:48)
in the same component that was shifted on to the stack
earlier and given a (non-gold) parent; let us assume
(cid:96)(cid:48) = (cid:96) − 1. We can add 3 children to the score, but
should subtract ∆(c(i) = P≥(j) ∧ q (cid:54)= N ) = 1, to
compensate for the fact that edge (b(cid:96), b(cid:96)−1) cannot be
constructed, as b(cid:96)−1 can only have one parent. If we
further assume ai has a parent among the critical nodes,
then that parent must be in a different component, and
therefore τ (i, j, q) = N .
after a reduce right we would preclude right
children of ak by setting w(k, i) = −∞ for k < i.
The leftmost of the children, at index c(i), is not
counted (or in other words, 1 is subtracted from
the number of children) if it is in the current
component P≥(j) and that component is anything
other than ‘none’; here ∆ is the indicator function,
which returns 1 if its Boolean argument evaluates
to true, and 0 otherwise. Figure 6 illustrates one
possible case.
The second value corresponds to a reduction of
ai into aj, which turns the stack into a0 · · · ai−1aj,
leaving the state unchanged as the top of the stack
is unchanged. The third value is applicable if aj
has parent b(cid:96) that has not yet been consumed, and
it corresponds to a shift of b(cid:96) and a reduction of ai
into b(cid:96) (and possibly further shifts and reductions
that are implicit), resulting in stack a0 · · · aib(cid:96). If
this creates the first gold edge connected to the
current component, then we add +1.
For i ≥ 0, the value of score(cid:48)(i, j) is the max-
imum of two values. The first value distinguishes
two cases. In the first case, ai does not have a
parent in the same component as bj, and ai is
reduced into bj without counting the (non-gold)
edge. In the second case, ai is reduced into its par-
ent, which is bj or another critical node that is an
ancestor of bj; in this case we count the gold edge.
The second value in the definition of score(cid:48)(i, j)
corresponds to a reduction of bj into ai (as well
as shifts of any critical nodes that are children of
ai, and their reduction into ai), resulting in stack
Figure 7: Assuming the thick edges are gold, then the
thin edge cannot be gold as well, as the gold tree is
projective. A score obtained from a stack a0 · · · ai−1ai
is therefore at least as high as a score obtained from
a stack a0 · · · ai−1aj, unless all of a(cid:96)(cid:48)+1, . . . , ai first
become children of aj via a series of reduce right
steps, all producing non-gold edges, and therefore
adding nothing to the score. The κ function implements
such a series of reduce right steps.
a0 · · · ai−1ai. The state is updated using τ (cid:48), in the
light of the new top-of-stack.
The top-level call is score(k − 1, k, N ). As
this does not account for right children of the
top of stack ak, we need to add nchildren(k).
Putting everything together, we have σmax =
M ⊗ score(k − 1, k, N ) ⊗ nchildren(k). The
time complexity is quadratic in k + m ≤ n, given
the quadratically many combinations of i and j in
score(i, j, q) and score(cid:48)(i, j).
6 O(n)O(n)O(n) Time Algorithm
Under the same assumption as in the previous sec-
tion, namely, that Tg is projective, we can further
reduce the time complexity of computing σmax ,
by two observations. First, let us define λ(i, j) to
be true if and only if there is an (cid:96) < i such that
(a(cid:96), aj) ∈ Tg or (aj, a(cid:96)) ∈ Tg. If (aj, ai) /∈ Tg
and λ(i, j) is false, then the highest score attain-
able from a configuration (a0 · · · ai−1aj, β, ∅) is
no higher than the highest score attainable from
(a0 · · · ai−1ai, β, ∅), or, if aj has a parent bj(cid:48), from
(a0 · · · aibj(cid:48), β(cid:48), ∅), for appropriate suffix β(cid:48) of β.
This means that in order to calculate score(i, j, q)
we do not need to calculate score(i − 1, j, q) in
this case.
Secondly, if (aj, ai) /∈ Tg and λ(i, j) is true,
and if there is (cid:96)(cid:48) < i such that (a(cid:96)(cid:48), ai) ∈ Tg or
(ai, a(cid:96)(cid:48)) ∈ Tg, then there are no edges between
aj and ai(cid:48) for any i(cid:48) with (cid:96)(cid:48) < i(cid:48) < i, because
of projectivity of Tg. We therefore do not need
to calculate score(i(cid:48), j, q) for such values of i(cid:48)
in order to find the computation with the highest
score. This is illustrated in Figure 7.
291
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Let us define κ(i) to be the smallest (cid:96)(cid:48) such that
(a(cid:96)(cid:48), ai) ∈ Tg or (ai, a(cid:96)(cid:48)) ∈ Tg, or i − 1 if there
is no such (cid:96)(cid:48). In the definition of score, we may
now replace w(j, i) ⊗ score(i − 1, j, q) by:
[if w(j, i) = 1 then 1 ⊗ score(i − 1, j, q)
else if w(j, i) = 0 ∧ λ(i, j)
then score(κ(i), j, q)
else −∞]
Similarly, we define λ(cid:48)(i, j) to be true if and
only if there is an (cid:96) < i such that (a(cid:96), bj(cid:48)) ∈ Tg
or (bj(cid:48), a(cid:96)) ∈ Tg for some j(cid:48) with R(j(cid:48)) = R(j).
In the definition of score(cid:48), we may now replace
score(cid:48)(i − 1, j) by:
[if λ(cid:48)(i, j) then score(cid:48)(κ(i), j) else −∞]
Thereby the algorithm becomes
linear-time,
because the number of values score(i, j, q) and
score(cid:48)(i, j) that are calculated for any i is
now linear. To see this, consider that for any
i, score(i, j, q) would be calculated only if
j = i + 1, if (ai, aj) ∈ Tg or (aj, ai) ∈ Tg, if
(aj, ai+1) ∈ Tg, or if j is smallest such that there
is (cid:96) < i with (a(cid:96), aj) ∈ Tg or (aj, a(cid:96)) ∈ Tg.
Similarly, score(i, j) would be calculated only if
score(i, j(cid:48), q) would be calculated and (bj, aj(cid:48)) ∈
Tg, if (bj, ai+1) ∈ Tg, or if j is smallest such that
there is (cid:96) ≤ i with (a(cid:96), bj(cid:48)) ∈ Tg or (bj(cid:48), a(cid:96)) ∈ Tg
for some j(cid:48) such that bj(cid:48) an ancestor of bj in the
same component.
7 Towards Constant Time Per Calculation
it
in that
A typical application would calculate the optimal
step for several or even all configurations within
one computation. Between one configuration and
the next, the stack differs at most in the two
rightmost elements and the remaining input differs
at most
loses its leftmost element.
Therefore, all but a constant number of values
of score(i, j, q) and score(cid:48)(i, j) can be reused,
to make the time complexity closer to constant
time for each calculation of the optimal step. The
practical relevance of this is limited however if
one would typically reload the data structures
containing the relevant values, which are of linear
size. Hence we have not pursued this further.
8 Experiments
Figure 8: Geometric mean of the number of optimal
projectivized trees against sentence length.
of RAM. The implementation language is Java,
with DL4J2 for the classifier, realized as a neural
network with a single layer of 256 hidden nodes.
Training is with batch size 100, and 20 epochs.
Features are the (gold) parts of speech and length-
100 word2vec representations of the word forms
of the top-most three stack elements, as well as
of the left-most three elements of the remaining
input, and the left-most and right-most depen-
dency relations in the top-most two stack elements.
8.1 Optimal Projectivization
We need to projectivize our training corpus for the
experiments in Section 8.2, using the algorithm
described at the end of Section 3. As we are
not aware of literature reporting experiments with
optimal projectivization, we briefly describe our
findings here.
Projectivizing all the training sets in Universal
Dependencies v2.23 took 244 sec in total, or
0.342 ms per tree. As mentioned earlier, there
may be multiple projectivized trees that are
optimal in terms of accuracy, for a single gold
tree. We are not aware of meaningful criteria
that tell us how to choose any particular one of
them, and for our experiments in Section 8.2 we
have chosen an arbitrary one. It is conceivable,
however, that the choices of the projectivized
trees would affect
the accuracy of a parser
trained on them. Figure 8 illustrates the degree of
‘‘choice’’ when projectiving trees. We consider
Our experiments were run on a laptop with an Intel
i7-7500U processor (4 cores, 2.70 GHz) with 8 GB
2https://deeplearning4j.org/.
3https://universaldependencies.org/.
292
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
pseudo
91.41
98.89
99.99
optimal
92.50
98.97
99.99
grc
de
ja
Table 6: Accuracy (LAS or UAS, which here
are identical) of pseudo-projectivization and of
optimal projectivization.
two languages that are known to differ widely
in the prevalence of non-projectivity, namely
Ancient Greek (PROIEL) and Japanese (BCCWJ),
and we consider one more language, German
(GSD), that falls in between (Straka et al., 2015).
As can be expected, the degree of choice grows
roughly exponentially in sentence length.
Table 6 shows that pseudo-projectivization is
non-optimal. We realized pseudo-projectivization
using MaltParser 1.9.0.4
8.2 Computing the Optimal Step
To investigate the run-time behavior of the algo-
rithms, we trained our shift-reduce dependency
parser on the German training corpus, after it
was projectivized as in Section 8.1. In a second
pass over the same corpus, the parser followed
the steps returned by the trained classifier. For
each configuration that was obtained in this way,
the running time was recorded of calculating the
optimal step, with the non-strict left-before-right
strategy. For each configuration, it was verified
that the calculated scores, for shift, reduce left,
and reduce right, were the same between the
three algorithms from Sections 4, 5, and 6.
The two-pass design was inspired by Choi and
Palmer (2011). We chose this design, rather than
online learning, as we found it easiest to imple-
ment. Goldberg and Nivre (2012) discuss the
relation between multi-pass and online learning
approaches.
As Figure 9 shows, the running times of the
algorithms from Sections 5 and 6 grow slowly
as the summed length of stack and remaining
input grows; note the logarithmic scale. The
improvement of the linear-time algorithm over
the quadratic-time algorithm is perhaps less than
one may expect. This is because the calculation of
the critical nodes and the construction of the nec-
essary tables, such as p, p(cid:48), and R, is considerable
compared to the costs of the memoized recursive
calls of score and score(cid:48).
4http://www.maltparser.org/.
Figure 9: Mean running time per step (milliseconds)
against
length of input, for projectivized and un-
projectivized trees.
Figure 10: Mean k + m(cid:48) against k + m.
Both these algorithms contrast with the algo-
rithm from Section 4, applied on projectivized
trees as above (hence tagged proj in Figure 9),
and with the remaining input simplified to just its
critical nodes. For k + m = 80, the cubic-time
algorithm is slower than the linear-time algorithm
by a factor of about 65. Nonetheless, we find that
the cubic-time algorithm is practically relevant,
even for long sentences.
The decreases at roughly k + m = 88, which
are most visible for Section 4 (proj), are explained
by the fact that the running time is primarily
determined by k + m(cid:48), where m(cid:48) is the number
of critical nodes. Because k + m is bounded by
the sentence length and the stack height k tends
to be much less than the sentence length, high
values of k + m tend to result from the length m
of the remaining input being large, which in turn
implies that there will be more non-critical nodes
that are removed before the most time-consuming
part of the analyses is entered. This is confirmed
by Figure 10.
293
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
LAS
UAS
(1)
subset
(2)
all
de, 13% 71.15
263,804
78.09
da, 13% 69.11
75.13
80,378
eu, 34% 54.69
67.49
72,974
el, 12%
71.62
77.45
42,326
cu, 20% 56.25
37,432
68.08
got, 22% 51.96
64.48
35,024
hu, 26% 52.70
65.72
20,166
71.33
78.14
71.42
76.98
58.27
70.07
72.78
78.34
59.09
69.95
55.00
66.58
56.20
68.96
(5)
all
(4)
(3)
subset
all
Sec. 6 Sec. 6 Sec. 4
72.57
72.55
71.69
79.78
79.77
78.96
72.25
72.18
69.95
78.21
78.00
76.30
57.81
57.49
54.11
70.13
70.07
67.71
72.34
72.66
70.49
78.91
78.36
77.14
59.52
58.78
56.31
70.94
70.10
69.07
56.20
55.94
53.44
68.09
67.85
65.85
57.62
57.37
54.09
70.30
70.20
67.55
Table 7: Accuracies, with percentage of trees that
are non-projective, and number of tokens. Only
gold computations are considered in a single pass
(1,2) or there is a second pass as well (3,4,5).
The first pass is on the subset of projective trees
(1,3) or on all trees after optimal projectivization
(2,4,5). The second pass is on projectivized trees
(3,4) or on unprojectivized trees (5).
The main advantage of the cubic-time algorithm
is that it is also applicable if the training corpus has
not been projectivized. To explore this we have
run this algorithm on the same corpus again, but
now without projectivization in the second pass
(for training the classifier in the first pass, projec-
tivization was done as before). In this case, we can
no longer remove non-critical nodes (without it
affecting correctness), and now the curve is mono-
tone increasing, as shown by Section 4 (unproj)
in Figure 9. Nevertheless, with mean running
times below 0.25 sec even for input longer than
100 tokens, this algorithm is practically relevant.
8.3 Accuracy
If a corpus is large enough for the parameters of
a classifier to be reliably estimated, or if the vast
majority of trees is projective, then accuracy is
not likely to be much affected by the work in this
paper. We therefore also consider six languages
that have some of the smallest corpora in UD v2.2
in combination with a relatively large proportion
of non-projective trees: Danish, Basque, Greek,
Old Church Slanovic, Gothic, and Hungarian. For
these languages, Table 7 shows that accuracy is
generally higher if training can benefit from all
trees. In a few cases, it appears to be slightly better
to train directly on non-projective trees rather than
on optimally projectivized trees.
9 Conclusions
We have presented the first algorithm to calculate
the optimal step for shift-reduce dependency pars-
ing that is applicable on non-projective training
corpora. Perhaps even more innovative than its
functionality is its modular architecture, which im-
plies that the same is possible for related kinds of
parsing, as long as the set of allowable transitions
can be described in terms of a split context-free
grammar. The application of the framework to,
among others, arc-eager dependency parsing is to
be reported elsewhere.
We have also shown that calculation of the
optimal step is possible in linear time if the train-
ing corpus is projective. This is the first time this
has been shown for a form of projective, deter-
ministic dependency parsing that does not have
the property of arc-decomposibility.
Acknowledgments
The author wishes to thank the reviewers for
comments and suggestions, which led to substan-
tial improvements.
References
Lauriane Aufrant, Guillaume Wisniewski, and
Franc¸ois Yvon. 2018. Exploiting dynamic ora-
cles to train projective dependency parsers
on non-projective trees. In Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, volume 2, pages 413–419.
New Orleans, LA.
Jinho D. Choi and Martha Palmer. 2011. Getting
the most out of transition-based dependency pars-
ing. In 49th Annual Meeting of the Association
for Computational Linguistics, Proceedings of
the Conference, pages 687–692. Portland, OR.
Jason Eisner. 2000. Bilexical grammars and their
cubic-time parsing algorithms. In Harry Bunt and
Anton Nijholt, editors, Advances in Probabilis-
tic and other Parsing Technologies, chapter 3,
pages 29–61. Kluwer Academic Publishers.
294
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Jason Eisner and Giorgio Satta. 1999. Efficient
parsing for bilexical context-free grammars
and head automaton grammars. In 37th Annual
Meeting of the Association for Computational
Linguistics, Proceedings of
the Conference,
pages 457–464. Maryland.
Daniel Fern´andez-Gonz´alez and Carlos G´omez-
Rodr´ıguez. 2018a. A dynamic oracle for linear-
time 2-planar dependency parsing. In Conference
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies, volume 2, pages 386–392.
New Orleans, LA.
Daniel Fern´andez-Gonz´alez and Carlos G´omez-
Rodr´ıguez. 2018b. Non-projective dependency
transitions. In Con-
parsing with non-local
ference of
the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, volume 2,
pages 693–700. New Orleans, LA.
Norman Fraser. 1989. Parsing and dependency
grammar. UCL Working Papers in Linguistics,
1:296–319.
Yoav Goldberg and Joakim Nivre. 2012. A dy-
namic oracle for arc-eager dependency parsing.
In The 24th International Conference on Com-
putational Linguistics, pages 959–976. Mumbai.
Yoav Goldberg and Joakim Nivre. 2013. Training
deterministic parsers with non-deterministic
oracles. Transactions of the Association for
Computational Linguistics, 1:403–414.
Yoav Goldberg, Francesco Sartorio, and Giorgio
Satta. 2014. A tabular method for dynamic ora-
cles in transition-based parsing. Transactions of
the Association for Computational Linguistics,
2:119–130.
Carlos G´omez-Rodr´ıguez, John Carroll, and David
Weir. 2008. A deductive approach to depen-
dency parsing. In 46th Annual Meeting of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 968–976.
Columbus, OH.
Conference on Natural Language Processing,
volume 2, pages 256–261. Beijing.
Carlos G´omez-Rodr´ıguez, Francesco Sartorio,
and Giorgio Satta. 2014. A polynomial-time
dynamic oracle for non-projective dependency
parsing. In Conference on Empirical Methods
in Natural Language Processing, Proceedings
of the Conference, pages 917–927. Doha.
Liang Huang and Kenji Sagae. 2010. Dynamic pro-
gramming for linear-time incremental parsing.
In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics,
pages 1077–1086. Uppsala.
Mark Johnson. 2007. Transforming projective
bilexical dependency grammars into efficiently-
In 45th
parsable CFGs with Unfold-Fold.
Annual Meeting of the Association for Com-
putational Linguistics, Proceedings of the Con-
ference, pages 168–175. Prague.
Sylvain Kahane, Alexis Nasr, and Owen Rambow.
1998. Pseudo-projectivity, a polynomially pars-
able non-projective dependency grammar. In
36th Annual Meeting of the Association for
Computational Linguistics and 17th Interna-
tional Conference on Computational Linguis-
tics, volume 1, pages 646–652. Montreal.
Martin Kay. 2000. Guides and oracles for linear-
the Sixth
time parsing.
International Workshop on Parsing Technolo-
gies, pages 6–9. Trento.
In Proceedings of
Marco Kuhlmann, Carlos G´omez-Rodr´ıguez, and
Giorgio Satta. 2011. Dynamic programming
algorithms for transition-based dependency pars-
ers. In 49th Annual Meeting of the Association
for Computational Linguistics, Proceedings of
the Conference, pages 673–682. Portland, OR.
Miryam de Lhoneux, Sara Stymne, and Joakim
Nivre. 2017. Arc-hybrid non-projective depen-
dency parsing with a static-dynamic oracle.
In 15th International Conference on Parsing
Technologies, pages 99–104. Pisa.
Carlos G´omez-Rodr´ıguez and Daniel Fern´andez-
Gonz´alez. 2015. An efficient dynamic oracle
for unrestricted non-projective parsing. In 53rd
Annual Meeting of the Association for Compu-
tational Linguistics and 7th International Joint
Alexis Nasr. 1995. A formalism and a parser for
lexicalised dependency grammars. In Fourth
International Workshop on Parsing Technolo-
gies, pages 186–195. Prague and Karlovy
Vary.
295
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Joakim Nivre. 2008. Algorithms for deterministic
incremental dependency parsing. Computa-
tional Linguistics, 34(4):513–553.
Joakim Nivre, Johan Hall, and Jens Nilsson.
2004. Memory-based dependency parsing. In
Proceedings of
the Eighth Conference on
Computational Natural Language Learning,
pages 49–56. Boston, MA.
Joakim Nivre and Jens Nilsson. 2005. Pseudo-
projective dependency parsing. In 43rd Annual
Meeting of the Association for Computational
the Conference,
Linguistics, Proceedings of
pages 99–106. Ann Arbor, MI.
Peng Qi and Christopher D. Manning. 2017.
Arc-swift: A novel transition system for depen-
dency parsing. In 55th Annual Meeting of the
Association for Computational Linguistics,
Proceedings of
the Conference, volume 2,
pages 110–117. Vancouver.
Milan Straka, Jan Hajiˇc, Jana Strakov´a, and Jan
Hajiˇc, jr. 2015. Parsing universal dependency
treebanks using neural networks and search-
based oracle. In Proceedings of the Fourteenth
International Workshop on Treebanks and
Linguistic Theories, pages 208–220. Warsaw.
Hiroyasu Yamada and Yuji Matsumoto. 2003.
Statistical dependency analysis with support
vector machines. In 8th International Workshop
on Parsing Technologies, pages 195–206.
LORIA, Nancy.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
8
1
9
2
3
4
8
9
/
/
t
l
a
c
_
a
_
0
0
2
6
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
296
Download pdf