Going to the Roots of Dependency Parsing
∗
Miguel Ballesteros
Complutense University of Madrid
∗∗
Joakim Nivre
Uppsala University
Dependency trees used in syntactic parsing often include a root node representing a dummy
word prefixed or suffixed to the sentence, a device that is generally considered a mere technical
convenience and is tacitly assumed to have no impact on empirical results. We demonstrate that
this assumption is false and that the accuracy of data-driven dependency parsers can in fact be
sensitive to the existence and placement of the dummy root node. In particular, we show that
a greedy, left-to-right, arc-eager transition-based parser consistently performs worse when the
dummy root node is placed at the beginning of the sentence (following the current convention
in data-driven dependency parsing) than when it is placed at the end or omitted completely.
Control experiments with an arc-standard transition-based parser and an arc-factored graph-
based parser reveal no consistent preferences but nevertheless exhibit considerable variation in
results depending on root placement. We conclude that the treatment of dummy root nodes in
data-driven dependency parsing is an underestimated source of variation in experiments and
may also be a parameter worth tuning for some parsers.
1. introduzione
It is a lesson learned in many studies on natural language processing that choosing the
right linguistic representation can be crucial for obtaining high accuracy on a given task.
In constituency-based parsing, Per esempio, adding or deleting nodes in syntactic trees
can have a substantial impact on the performance of a statistical parser. In dependency
parsing, the syntactic representations used offer less opportunity for transformation,
given that the nodes of a dependency tree are basically determined by the tokens of
the input sentence, except for the possible addition of a dummy word acting as the
root of the tree. In questo articolo, we show that even this seemingly trivial modification
can make a difference, and that the exact placement of the dummy root node can
have a significant impact on the accuracy of a given parser. This suggests that the
placement of the dummy root is a parameter worth tuning for certain parsing systems
as well as a source of variation to be taken into account when interpreting experimental
risultati.
∗ Universidad Complutense de Madrid, Departamento de Ingenier´ıa del Software e Inteligencia Artificial,
C/ Prof. Jos´e Garc´ıa Santesmases, s/n, 28040 Madrid, Spain. E-mail: miballes@fdi.ucm.es.
∗∗ Uppsala University, Department of Linguistics and Philology, Box 635, SE-75126 Uppsala, Sweden.
E-mail: joakim.nivre@lingfil.uu.se.
Invio ricevuto: 25 Luglio 2012; revised submission received: 13 ottobre 2012; accepted for publication:
19 ottobre 2012.
© 2013 Associazione per la Linguistica Computazionale
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
C
o
l
io
_
UN
_
0
0
1
3
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
2. Dependency Graphs
Dependency-based approaches to syntactic parsing assume that the syntactic structure
of a sentence can be analyzed in terms of binary dependency relations between lexical
units, units that in the simplest case are taken to correspond directly to the tokens
of the sentence. It is very natural to represent this structure by a directed graph,
with nodes representing input tokens and arcs representing dependency relations.
Inoltre, we can add labels to arcs in order to distinguish different dependency
types or grammatical functions (per esempio., subject, object, adverbial). We call such a graph a
dependency graph.
Dependency graphs are normally assumed to satisfy certain formal constraints,
such as the single-head constraint, which forbids more than one incoming arc to a node,
and the acyclicity constraint, ruling out cyclic graphs. Many dependency theories and
annotation schemes further require that the graph should be a tree, with a unique root
token on which all other tokens are transitively dependent, whereas other frameworks
allow more than one token to be a root in the sense of not having any incoming arc. UN
simple and elegant way of reconciling such cross-framework differences and arriving
at a single formalization of dependency structures is to add a dummy root node, a special
node that does not correspond to any input token, and to require that the dependency
graph is a tree rooted at this node. The original tree constraint can then be enforced
by requiring that the special node has exactly one child, but not all frameworks need to
enforce this constraint. An additional advantage of adding a dummy root node is that its
outgoing arcs can be labeled to indicate the functional status of what would otherwise
simply be unlabeled root nodes. With a slight misuse of terminology, we call such labels
informative root labels.1
Because the dummy root node does not correspond to an input token, it has no well-
defined position in the node sequence defined by the word order of a sentence and could
in principle be inserted anywhere (or nowhere at all). One option that can be found in
the literature is to insert it at the end of this sequence, but the more common convention
in contemporary research on dependency parsing is to insert it at the beginning, hence
treating it as a dummy word prefixed to the sentence. This is also the choice implicitly
assumed in the CoNLL data format, used in the CoNLL shared tasks on dependency
parsing in 2006 E 2007 (Buchholz and Marsi 2006; Nivre et al. 2007) and the current
de facto standard for exchange of dependency annotated data.
The question that concerns us here is whether the use of a dummy root node is just
a harmless technicality permitting us to treat different dependency theories uniformly,
and whether its placement in the input sequence is purely arbitrary, or whether both of
these choices may in fact have an impact on the parsing accuracy that can be achieved
with a given parsing model. In order to investigate this question empirically, we define
three different types of dependency graphs that differ only with respect to the existence
and placement of the dummy root node:
1.
2.
3.
None: Only nodes corresponding to tokens are included in the graph.
Primo: A dummy root node is added as the first token in the sentence.
Last: A dummy root node is added as the last token in the sentence.
1 Per esempio, in the Prague Dependency Treebank, which allows multiple children of the dummy root
node, the label may indicate whether the child functions as a main predicate, as the head of a coordinate
structure, or as final punctuation.
6
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
C
o
l
io
_
UN
_
0
0
1
3
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ballesteros and Nivre
Going to the Roots of Dependency Parsing
Figura 1 illustrates the three types of dependency graphs with examples taken from
the Penn Treebank of English (Marcus, Santorini, and Marcinkiewicz 1993) converted
to dependency structure using the procedure described in Nivre (2006), and the Prague
Dependency Treebank of Czech (Hajiˇc et al. 2001; B ¨ohmov´a et al. 2003). In the former
case, it is assumed that the dummy root node always has exactly one child, con un
dummy dependency label ROOT. In the latter case, the dummy root node may have
several children and these children have informative root labels indicating their func-
zione (Pred and AuxK in the example). Note also that the Czech dependency graph of
type None is not a tree, but a forest, because it consists of two disjoint trees.
3. Experiments
In order to test the hypothesis that the existence and placement of the dummy root node
can have an impact on parsing accuracy, we performed an experiment using two widely
used data-driven dependency parsers, MaltParser (Nivre, Hall, and Nilsson 2006) E
MSTParser (McDonald 2006), and all the 13 data sets from the CoNLL 2006 shared
task on multilingual dependency parsing (Buchholz and Marsi 2006) as well as the
English Penn Treebank converted to Stanford dependencies (de Marneffe, MacCartney,
and Manning 2006). We created three different versions of each data set, corresponding
to the representation types None, Primo, and Last, and used them to evaluate MaltParser
with two different transition systems—arc-eager (Nivre 2003) and arc-standard (Nivre
2004)—and MSTParser with the arc-factored non-projective algorithm (McDonald
et al. 2005). The results are shown in Table 1.
When creating the data sets, we took the original version from the CoNLL-X shared
task as None, because it does not include the dummy root node as an explicit input
token. In this representation, the tokens of a sentence are indexed from 1 to n and the
dependency graph is specified by giving each word a head index ranging from 0 to n,
Dove 0 signifies that the token is not a dependent on any other token in the sentence.
The First version was created by adding an extra token at the beginning of the sentence
with index 1 and head index 0, increasing all other token and head indices by 1, Senso
that all tokens that previously had a head index of 0 would now be attached to the new
P
OBJ
(cid:3)
(cid:2)
(cid:3)
(cid:2) (cid:3)
NMOD
(cid:2)
(cid:2)
on6
effect5
(cid:2)
NMOD
(cid:2)
little4
P
OBJ
(cid:3)
(cid:2)
(cid:3)
(cid:2) (cid:3)
NMOD
(cid:2)
(cid:2)
on7
effect6
(cid:2)
NMOD
(cid:2)
little5
NMOD
PMOD
(cid:2)
(cid:2)
financial7
(cid:3)
(cid:3)
(cid:2)
markets8
NMOD
PMOD
(cid:2)
(cid:2)
financial8
(cid:3)
(cid:3)
(cid:2)
markets9
NMOD
SBJ
(cid:2)
(cid:2)
Economic1
(cid:2)
(cid:3)
(cid:2)
news2
(cid:2)
ROOT
(cid:2)
(cid:2)
NMOD
SBJ
(cid:2)
(cid:3)
(cid:2)
news3
ROOT1 Economic2
(cid:2)
(cid:2)
(cid:3)
had3
(cid:2)
(cid:3)
(cid:2)
(cid:3)
(cid:2)
had4
(cid:2)
(cid:2)
(cid:2)
OBJ
(cid:3)
ROOT
P
(cid:2)
PMOD
(cid:2)
(cid:2)
Economic1
NMOD
SBJ
(cid:2)
(cid:3)
(cid:2)
news2
(cid:3)
(cid:2)
had3
(cid:3)
(cid:2)
NMOD
(cid:2)
little4
(cid:2) (cid:3)
NMOD
(cid:2)
(cid:2)
on6
effect5
(cid:2)
(cid:2)
financial7
NMOD
markets8
(cid:3)
(cid:3)
(cid:2)
(cid:2)
AuxP
(cid:3)
(cid:2)
(cid:2)
Z1
Atr
(cid:3)
(cid:2)
nich2
(cid:2)
AuxP
Pred
Atr
(cid:2)
(cid:2)
Z2
(cid:3)
(cid:2)
nich3
(cid:2)
(cid:2)
ROOT1
(cid:3)
(cid:2)
AuxP
ROOT10
(cid:2)
(cid:2)
Z1
Atr
(cid:3)
(cid:2)
nich2
(cid:2)
(cid:2)
je3
(cid:2)
(cid:3)
(cid:2)
(cid:2)
je4
(cid:2)
(cid:2)
(cid:2)
(cid:2)
je3
AuxP
(cid:3)
Sb
AuxZ
(cid:2)
(cid:2)
jen4
(cid:3)
(cid:3)
(cid:2)
jedna5
(cid:2)
(cid:2)
na6
Adv
(cid:3)
(cid:2)
kvalitu7
AuxK
(cid:3)
AuxP
(cid:3)
.8
(cid:3)
Sb
AuxZ
(cid:2)
(cid:2)
jen5
(cid:3)
(cid:3)
(cid:2)
jedna6
(cid:3)
(cid:3)
AuxP
Sb
(cid:2)
(cid:2)
jen4
AuxZ
(cid:3)
(cid:2)
jedna5
(cid:2)
(cid:2)
na7
Adv
(cid:3)
(cid:2)
kvalitu8
(cid:2)
.9
Pred
(cid:3)
(cid:3)
(cid:2)
(cid:2)
na6
Adv
(cid:3)
(cid:2)
kvalitu7
(cid:2)
(cid:2)
.8
AuxK
(cid:3)
ROOT9
(cid:3)
(cid:2)
.9
(cid:3)
(cid:2)
.10
(cid:3)
(cid:2)
.9
Figura 1
Dependency graph types None (top), Primo (middle), and Last (bottom) for an English
sentence from the Penn Treebank (left) and a Czech sentence taken from the Prague
Dependency Treebank (right). (Gloss of Czech sentence: Z/Out-of nich/them je/is jen/
only jedna/one-FEM-SG na/to kvalitu/quality ./. = “Only one of them concerns quality.”)
7
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
C
o
l
io
_
UN
_
0
0
1
3
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 1
Tavolo 1
Experimental results for arc-eager (AE), arc-standard (AS), and maximum spanning tree
parsing (MST) on all the CoNLL-X data sets plus the English Penn Treebank converted to
Stanford dependencies with three different dependency graph types (None, Primo, Last).
Evaluation metrics are labeled attachment score (LAS), unlabeled attachment score (UAS), root
attachment (or no attachment in the case of None) measured as recall (RR) and precision (RP).
Scores in bold are best in their column (per language); scores in italic are not comparable to the
rest because of informative arc labels that cannot be predicted with the None representation.
AE
RR
RP
74.24
84.75
73.56
94.22
90.20
94.22
93.63
92.25
93.63
80.51
83.33
81.07
91.33
86.69
91.64
65.56
72.18
65.76
88.70
83.99
88.58
94.68
89.64
94.68
92.74
88.15
92.96
92.01
87.85
92.01
83.25
79.70
83.76
93.32
91.77
93.32
73.98
71.94
75.00
90.29
88.77
90.14
86.32
85.09
86.45
69.52
73.75
68.24
90.80
87.56
90.58
88.42
89.55
88.42
74.61
72.66
75.53
88.06
82.11
88.36
66.08
57.88
66.67
81.75
85.42
81.73
85.14
90.40
85.14
85.20
89.20
84.56
85.76
86.35
85.76
76.28
73.36
76.39
90.07
89.47
90.07
63.46
64.83
64.05
92.25
93.00
91.95
81.24
81.11
81.25
UAS
75.17
74.57
74.97
90.80
89.83
90.78
89.68
89.09
89.70
81.14
79.96
81.16
87.88
86.59
87.94
74.51
74.41
74.51
90.47
90.04
90.46
86.64
86.08
86.72
92.10
91.27
92.12
88.60
88.36
88.62
82.51
82.15
82.49
89.60
89.29
89.70
79.56
77.84
79.62
72.18
71.86
72.16
84.35
83.67
84.35
LAS
60.48
64.93
65.29
85.06
85.12
85.16
85.25
85.23
85.17
68.36
74.88
74.28
81.64
81.66
81.52
70.67
71.07
70.65
88.15
88.04
88.16
84.31
84.37
84.35
90.15
89.13
90.01
78.32
83.45
83.47
77.88
77.64
77.72
82.65
82.53
82.65
63.67
69.40
69.42
57.00
56.80
56.88
76.69
78.16
78.20
AS
RR
RP
81.69
83.73
82.71
91.96
91.71
91.96
93.06
92.82
92.82
87.85
86.44
87.01
92.88
92.88
92.88
69.84
64.20
69.84
87.83
86.20
87.91
93.84
93.84
93.84
92.53
87.83
92.64
90.28
90.62
90.62
81.73
81.22
80.71
92.03
91.77
91.77
75.26
73.47
75.26
90.59
89.68
90.59
87.24
86.17
87.18
80.60
81.79
81.06
91.96
91.71
91.96
93.06
92.82
92.82
73.35
88.95
73.16
93.17
93.17
92.31
72.53
82.09
72.38
86.42
86.24
86.61
93.84
93.84
93.84
85.42
90.94
85.60
90.28
90.62
90.62
81.73
81.22
81.12
91.56
91.77
91.30
67.35
78.69
67.66
92.13
94.86
92.13
85.24
88.48
85.18
UAS
77.29
77.09
77.31
90.33
90.33
90.33
90.08
90.10
90.00
81.96
82.52
81.78
87.86
87.86
87.74
74.43
75.23
74.45
90.07
89.95
90.07
87.22
87.24
87.22
92.30
91.57
92.28
87.78
87.82
87.80
81.69
81.51
81.55
89.42
89.38
89.36
79.28
79.42
79.28
72.06
72.12
72.10
84.41
84.44
84.38
LAS
66.73
66.41
66.41
86.30
86.32
86.14
86.88
86.54
86.36
76.70
77.04
77.68
83.39
83.97
83.43
79.05
78.91
78.25
87.55
87.63
87.69
85.64
85.74
85.34
90.45
90.83
90.47
84.87
85.19
84.89
79.40
79.20
79.48
81.36
81.76
81.66
71.44
71.72
71.64
58.49
58.59
58.89
79.88
79.99
79.88
MST
UAS
78.96
78.32
78.32
91.64
91.28
91.28
90.82
90.68
90.52
85.98
86.34
86.70
89.46
89.84
89.42
83.49
83.43
82.95
89.91
90.00
90.06
89.54
89.66
89.50
93.02
93.36
93.06
89.74
90.26
89.26
83.57
83.41
83.53
88.29
88.59
88.35
82.47
82.67
82.33
74.55
74.59
74.83
86.53
86.60
86.44
RR
84.07
83.39
78.31
98.24
97.49
97.24
94.33
94.33
93.87
82.20
85.88
89.55
92.57
94.74
92.26
79.77
74.32
75.10
87.70
90.20
88.45
97.76
97.48
97.76
93.38
92.85
92.21
89.58
91.67
90.62
79.70
83.76
84.77
89.97
91.52
92.03
79.08
79.34
76.79
93.47
92.56
93.02
88.70
89.25
88.71
RP
83.78
90.44
87.83
98.24
97.49
97.24
94.33
94.33
93.97
80.83
84.92
89.55
91.44
94.94
92.26
79.92
83.59
85.21
87.70
90.17
88.38
97.76
97.48
97.76
88.47
91.58
90.66
89.58
91.67
90.31
78.50
84.18
83.50
90.21
91.75
92.03
75.98
79.34
79.21
86.88
94.28
94.45
87.40
90.44
90.17
Language
Type
LAS
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
None
Primo
Last
60.00
63.63
64.15
85.76
84.64
85.76
85.13
84.59
85.15
68.30
72.98
73.96
82.36
80.60
82.38
71.09
70.81
71.05
88.63
88.00
88.57
83.85
83.29
83.93
89.85
88.79
89.77
79.12
83.77
84.17
78.64
78.14
78.64
83.49
83.13
83.59
64.25
67.73
69.98
56.66
56.48
56.64
76.94
77.61
78.41
Arabic
Bulgarian
Chinese
Czech
Danish
Dutch
English
German
Japanese
Portuguese
Spanish
Swedish
Slovene
Turkish
Average
8
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
C
o
l
io
_
UN
_
0
0
1
3
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Ballesteros and Nivre
Going to the Roots of Dependency Parsing
dummy root token. The Last version was created by adding an extra token at the end
of the sentence with index n+1, and changing every head index that previously was 0
to n+1. In both First and Last, we made sure that the new dummy token had a unique
word form and unique values for all other features, so that it could not be mistaken for
any real word. For First and Last, we applied an inverse transformation to the parser
output before evaluation.
Both MaltParser and MSTParser by default add a dummy root node at the be-
ginning of the sentence internally before parsing, so we had to modify the parsers
so that they only considered arcs involving nodes corresponding to input tokens. For
MaltParser this only required setting a flag that makes the parser start with an empty
stack instead of a stack containing an extra dummy root node.2 For MSTParser, we
modified the parser implementation so that it extracts a maximum spanning tree that
is still rooted in an extra dummy root node but where the score of a tree is based only
on the scores of arcs connecting real token nodes. Finalmente, because MaltParser with the
arc-eager and arc-standard transition systems can only construct projective dependency
graphs, we projectivized all training sets before training the MaltParser models using
the baseline pseudo-projective transformation of Nivre and Nilsson (2005).3 Except for
these modifications, all parsers were run with out-of-the-box settings.
3.1 Deterministic Arc-Eager Parsing
The arc-eager transition-based parser first described in Nivre (2003) parses a sentence
in a single pass from left to right, using a stack to store partially processed tokens
and greedily choosing the highest-scoring parsing action at each point. The arc-eager
property entails that every arc in the output graph is added at the earliest possible
opportunity, which means that right-dependents are attached to their head before they
have found their own right-dependents. This can be an advantage because the early
attachment neither implies nor precludes the later addition of right-dependents, Ma
it can also be a drawback because it forces the parser to make an early commitment
about right-dependents. In this context, it is especially relevant that the addition of a
dummy root node at the beginning of the sentence (Primo) forces the parser to make
an early commitment regarding dependents of this root node. By contrast, if a dummy
root node is added at the end of a sentence (Last), decisions regarding root dependents
will be postponed until the end. Allo stesso modo, if no root node is added (None), then these
decisions will not be explicitly modeled at all, meaning that whatever nodes remain
on the stack after parsing will be treated as root dependents.
As can be seen from Table 1, the arc-eager parser performs consistently worse under
the First condition, with an average unlabeled attachment score (UAS) Di 83.67 over the
14 languages, to be compared with 84.35 for None and Last. The difference in accuracy
between First and None/Last ranges from 0.10 for Dutch to 1.72/1.78 for Slovene, E
the difference in means is highly statistically significant (P < 0.001, Wilcoxon signed-
rank test). The difference between None and Last is never greater than 0.20 (and very
far from statistically significant), indicating that either postponing or excluding root
attachment decisions leads to very similar performance for the arc-eager parser. The
2 The exact commandline flag is -allow root false. A side effect of this flag is that all unattached tokens
get the dummy label ROOT, meaning that informative root labels cannot be predicted for representations
of type None. See http://maltparser.org for more information.
3 This is done with the MaltParser flag -pp baseline.
9
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
c
o
l
i
_
a
_
0
0
1
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 1
same pattern is found for labeled attachment score (LAS), but here we can only directly
compare First and Last because the Arabic, Czech, Portuguese, and Slovene data sets
contain informative root labels that cannot be predicted under the None condition (cf.
footnote 2). The difference in means between First and Last is 0.80 and again highly
statistically significant (p < 0.001, Wilcoxon signed-rank test). A closer look at the root
accuracy suggests that most of the difference stems from a lower recall on root depen-
dents with the First representation, but this pattern is not completely consistent across
languages and Arabic, Czech, and Dutch actually have higher recall. For Czech and
Dutch this is accompanied by lower precision, but for Arabic the First representation
actually gives the best recall and precision of root dependents (but nevertheless the
worst overall attachment score). It is probably significant that the Arabic data set has
the longest sentences with the root word often appearing early in the sentence. Hence,
an early commitment to root attachment is more likely to be correct in this case, even if
it is more error prone in general.
3.2 Deterministic Arc-Standard Parsing
The arc-standard transition-based parser first described in Nivre (2004) is similar to the
arc-eager parser in that it parses a sentence in a single pass from left to right, using
a stack to store partially processed tokens and greedily choosing the highest-scoring
parsing action at each point. It differs by postponing the attachment of right-dependents
until the complete subtree under the dependent itself has been built. As a consequence,
the dependency tree is built strictly bottom–up, which means that attachment to a
dummy root node will always happen at the end, regardless of whether the dummy
root node is positioned at the beginning or at the end of the sentence. There is therefore
no reason to expect that the placement of the dummy root node should have the same
impact as for the arc-eager parser.
Looking at the results for the arc-standard parser in Table 1 confirms this expecta-
tion, with the three conditions giving very similar mean UAS (84.41 for None, 84.44 for
First, 84.38 for Last) and none of the differences being statistically significant. For LAS,
we can again only directly compare First and Last, but there is practically no difference
in the means here either (78.16 for First vs. 78.20 for Last). Nevertheless, it is worth
noting that, for individual languages, differences in scores can be quite substantial.
Thus, for Dutch, the First condition outperforms the None/Last condition by 0.80/0.78
in UAS and 0.40/0.42 in LAS. Conversely, for Japanese, the None/Last conditions are
better than First by 0.73/0.71 in UAS and 1.02/0.88 in LAS. Although the general trend
is that None and Last give the most similar results, just as for the arc-eager parser,
there are also cases like Chinese where None and First are both slightly better than
Last. Zooming in on root accuracy, we see a clear gain in precision (and marginal drop
in recall) for the First representation, which is probably an effect of the arc-standard
strategy where the attachment of right-dependents often have to be delayed whereas
left-dependents can be attached eagerly.
3.3 Maximum Spanning Tree Parsing
The maximum spanning tree parser described in McDonald et al. (2005) uses a very
different parsing model compared with the two transition-based parsers. Instead of
scoring individual parsing actions, it scores all possible dependency arcs in the sentence
and then uses exact inference to extract the highest-scoring complete dependency tree
10
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
c
o
l
i
_
a
_
0
0
1
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ballesteros and Nivre
Going to the Roots of Dependency Parsing
under an arc-factored model, where the score of each tree is the sum of the scores of
its component arcs. Because the parsing algorithm does not impose any ordering at all
on different attachments, we would expect even less impact from the placement of the
dummy root node than for the deterministic arc-standard parser.
The results in Table 1 do not quite confirm this expectation. The mean UAS varies
from 86.44 for the Last condition to 86.60 for the First condition, and the mean LAS
is 79.88 for None and Last but 79.99 for First. Although none of these differences is
statistically significant on the aggregate level, differences can be quite substantial for
individual languages, with First outperforming Last by a whole percentage point in
UAS for Portuguese (but only 0.30 in LAS) and None outperforming both First and Last
by 0.64 for Arabic (and 0.32 in LAS). The fact that LAS differences tend to be smaller
than UAS differences can probably be explained by the fact that MSTParser uses a two-
stage approach, where the second labeling stage is the same for all three conditions.
With respect to root accuracy, the most interesting observation is that both First and
Last seem to given higher precision than None, which suggests that it is an advantage
to represent root attachments explicitly so that features over these arcs can contribute to
the overall score of a parse tree. It is also worth noting that these features are different
for First and Last, despite the arc-factored model, because of the so-called “in-between
features” that record the part-of-speech tags of words occurring between the head
and the dependent of an arc, which in turn explains why these two conditions do not
always give the same results.
4. Discussion
The main conclusion we draw from this experiment is that the addition of a dummy
word prefixed or suffixed to a sentence is not a mere technical convenience without
impact on empirical results. Whether we include a dummy word representing the root
of the dependency tree and, if so, where we place this word in the sequence of input
tokens, can have a non-negligible effect on parsing accuracy for different parsers—in
some cases resulting in statistically significant differences with a magnitude of several
percentage points according to standard evaluation metrics.
The nature and magnitude of the impact definitely depends on the parsing model
used. Whereas the deterministic arc-eager parser gives consistently worse results with a
dummy root node positioned at the beginning of the sentence, neither the deterministic
arc-standard parser nor the maximum spanning tree parser has any clear preference in
this respect. Although the overall patterns emerging when averaging over many data
sets can largely be explained in this way, there is also considerable variation across data
sets that we do not yet fully understand, however. Zooming in on root accuracy has
allowed us to start forming hypotheses, such as the impact of long sentences in com-
bination with predominantly head-initial structures for Arabic, but a full exploration
of the interaction of parsing models and language-specific properties is clearly outside
the scope of this article and has to be left for future research. Another limitation of
the current study is that it only examines three different parsers, and although this
is clearly sufficient to prove the existence of the phenomenon it will be interesting to
see whether the same patterns can be found if we examine more recent state-of-the-art
methods, going from deterministic parsing to beam search for transition-based parsing
and from arc-factored to higher-order models for graph-based parsing. In this context,
it is also relevant to mention previous work, such as Hall et al. (2007) and Attardi
and Dell’Orletta (2009), which have tried to improve parsing accuracy by switching or
11
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
c
o
l
i
_
a
_
0
0
1
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 1
combining parsing directions, which implicitly has the effect of changing the position
of the root node (if present).
In conclusion, we believe there may be two methodological lessons to learn from
our experiments. The first is that, for certain parsing models, the existence and place-
ment of the dummy root node is in fact a parameter worth tuning for best performance.
Thus, for the deterministic arc-eager parser, it seems that we can obtain higher parsing
accuracy by placing the dummy root node at the end of the sentence (or omitting
it completely) instead of placing it at the beginning in the sentence, as is currently
the norm in data-driven dependency parsing. The second lesson is that, because the
differences observed between different conditions are sometimes at least as large as
the differences considered significant when comparing different parsing models, the
status of the dummy root node may be an underestimated source of variation and a
variable that needs to be controlled for in experimental evaluations. The current practice
of consistently placing the root node at the beginning of the sentence is one way of
ensuring comparability of results, but given the arbitrariness of this decision together
with our experimental results, it may be worth exploring other representations as well.
Acknowledgments
Thanks to Ryan McDonald, Yoav Goldberg,
and three anonymous reviewers for useful
comments, and to Ryan also for help in
modifying MSTParser. Miguel Ballesteros is
funded by the Spanish Ministry of Education
and Science (TIN2009-14659-C03-01 Project).
References
Attardi, Giuseppe and Felice Dell’Orletta.
2009. Reverse revision and linear tree
combination for dependency parsing. In
Proceedings of Human Language Technologies:
The 2009 Annual Conference of the North
American Chapter of the Association for
Computational Linguistics (NAACL HLT),
pages 261–264, Boulder, CO.
B ¨ohmov´a, Alena, Jan Hajiˇc, Eva Hajiˇcov´a,
and Barbora Hladk´a. 2003. The Prague
Dependency Treebank: A three-level
annotation scenario. In Anne Abeill´e,
editor, Treebanks: Building and Using
Parsed Corpora. Kluwer, pages 103–127.
Buchholz, Sabine and Erwin Marsi. 2006.
CoNLL-X shared task on multilingual
dependency parsing. In Proceedings of
the 10th Conference on Computational
Natural Language Learning (CoNLL),
pages 149–164, New York, NY.
de Marneffe, Marie-Catherine, Bill
MacCartney, and Christopher D.
Manning. 2006. Generating typed
dependency parses from phrase
structure parses. In Proceedings of
the 5th International Conference on
Language Resources and Evaluation
(LREC), pages 449–454, Genoa.
12
Hajiˇc, Jan, Barbora Vidova Hladka, Jarmila
Panevov´a, Eva Hajiˇcov´a, Petr Sgall, and
Petr Pajas. 2001. Prague Dependency
Treebank 1.0. LDC, 2001T10.
Hall, Johan, Jens Nilsson, Joakim Nivre,
G ¨ulsen Eryi ˘git, Be´ata Megyesi, Mattias
Nilsson, and Markus Saers. 2007. Single
malt or blended? A study in multilingual
parser optimization. In Proceedings of the
CoNLL Shared Task of EMNLP-CoNLL 2007,
pages 933–939, Prague.
Marcus, Mitchell P., Beatrice Santorini, and
Mary Ann Marcinkiewicz. 1993. Building
a large annotated corpus of English: The
Penn Treebank. Computational Linguistics,
19:313–330.
McDonald, Ryan. 2006. Discriminative
Learning and Spanning Tree Algorithms
for Dependency Parsing. Ph.D. thesis,
University of Pennsylvania.
McDonald, Ryan, Fernando Pereira,
Kiril Ribarov, and Jan Hajiˇc. 2005.
Non-projective dependency parsing using
spanning tree algorithms. In Proceedings of
the Human Language Technology Conference
and the Conference on Empirical Methods in
Natural Language Processing (HLT/EMNLP),
pages 523–530, Vancouver.
Nivre, Joakim. 2003. An efficient algorithm
for projective dependency parsing.
In Proceedings of the 8th International
Workshop on Parsing Technologies (IWPT),
pages 149–160, Nancy.
Nivre, Joakim. 2004. Incrementality in
deterministic dependency parsing.
In Proceedings of the Workshop on
Incremental Parsing: Bringing Engineering
and Cognition Together (ACL), pages 50–57,
Barcelona.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
c
o
l
i
_
a
_
0
0
1
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ballesteros and Nivre
Going to the Roots of Dependency Parsing
Nivre, Joakim. 2006. Inductive Dependency
Parsing. Springer, Berlin.
Nivre, Joakim, Johan Hall, Sandra K ¨ubler,
Ryan McDonald, Jens Nilsson, Sebastian
Riedel, and Deniz Yuret. 2007. The CoNLL
2007 shared task on dependency parsing.
In Proceedings of the CoNLL Shared Task of
EMNLP-CoNLL 2007, pages 915–932,
Prague.
Nivre, Joakim, Johan Hall, and Jens Nilsson.
2006. Maltparser: A data-driven
parser-generator for dependency parsing.
In Proceedings of the 5th International
Conference on Language Resources and
Evaluation (LREC), pages 2216–2219,
Genoa.
Nivre, Joakim and Jens Nilsson. 2005.
Pseudo-projective dependency parsing.
In Proceedings of the 43rd Annual Meeting
of the Association for Computational
Linguistics (ACL), pages 99–106,
Ann Arbor, MI.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
c
o
l
i
_
a
_
0
0
1
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
13
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
1
5
1
7
9
9
0
1
3
/
c
o
l
i
_
a
_
0
0
1
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Scarica il pdf