Parsing Chinese Sentences with - Recherche en IA spécialisée au MIT

Parsing Chinese Sentences with
Grammatical Relations

Weiwei Sun
Peking University
Institute of Computer Science and
Technology and Center for
Chinese Linguistics
ws@pku.edu.cn

Yufei Chen
Peking University
Institute of Computer Science and
Technologie
yufei.chen@pku.edu.cn

Xiaojun Wan
Peking University
Institute of Computer Science and
Technologie
wanxiaojun@pku.edu.cn

Meichun Liu
City University of Hong Kong
Department of Linguistics and
Translation
meichliu@cityu.edu.hk

We report our work on building linguistic resources and data-driven parsers in the grammatical
relation (GR) analysis for Mandarin Chinese. Chinese, as an analytic language, encodes gram-
matical information in a highly conﬁgurational rather than morphological way. Accordingly, it is
possible and reasonable to represent almost all grammatical relations as bilexical dependencies. Dans
this work, we propose to represent grammatical information using general directed dependency
graphs. Both only-local and rich long-distance dependencies are explicitly represented. To create
high-quality annotations, we take advantage of an existing TreeBank, namely, Chinese TreeBank
(CTB), which is grounded on the Government and Binding theory. We deﬁne a set of linguistic
rules to explore CTB’s implicit phrase structural information and build deep dependency graphs.
The reliability of this linguistically motivated GR extraction procedure is highlighted by manual

Submission received: 23 Novembre 2017; revised version received: 19 Septembre 2018; accepted for
publication: 19 Novembre 2018.

est ce que je:10.1162/COLI a 00343

© 2019 Association for Computational Linguistics
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
(CC BY-NC-ND 4.0) Licence

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

evaluation. Based on the converted corpus, data-driven, including graph- and transition-based,
models are explored for Chinese GR parsing. For graph-based parsing, a new perspective, graph
merging, is proposed for building ﬂexible dependency graphs: constructing complex graphs via
constructing simple subgraphs. Two key problems are discussed in this perspective: (1) how to
decompose a complex graph into simple subgraphs, et (2) how to combine subgraphs into a
coherent complex graph. For transition-based parsing, we introduce a neural parser based on
a list-based transition system. We also discuss several other key problems, including dynamic
oracle and beam search for neural transition-based parsing. Evaluation gauges how successful
GR parsing for Chinese can be by applying data-driven models. The empirical analysis suggests
several directions for future study.

1. Introduction

Dependency grammar is a longstanding tradition that determines syntactic structures
on the basis of word-to-word connections. It names a family of approaches to the lin-
guistic analysis that all share a commitment to the typed relations between ordered pairs
of words. Généralement, dependencies represent various grammatical relations (GRs), lequel
are exempliﬁed in traditional grammars by the notions of subject, direct/indirect object,
etc., and therefore encode rich syntactic information of natural language sentences in an
explicit way. Au cours des dernières années, parsing a sentence to a dependency representation has
been well studied and widely applied to many Natural Language Processing (NLP)
tasks, Par exemple, Information Extraction and Machine Translation. En particulier, le
data-driven approaches have made great progress during the past two decades (K ¨ubler,
McDonald's, and Nivre 2009; Koo et al. 2010; Pitler, Kannan, and Marcus 2013; Chen
and Manning 2014; Weiss et al. 2015; Dozat and Manning 2016). Various practical
parsing systems have been built, not only for English but also for a large number of
typologically different languages, Par exemple, Arabic, Basque, Catalan, Chinese, Czech,
Greek, Hungarian, Italian, and Turkish (Nivre et al. 2007).

Previous work on dependency parsing mainly focused on structures that can be
represented in terms of directed bilexical dependency trees. Although tree-shaped
graphs have several desirable properties from the computational perspective, they do
not work well for coordinations, long-range dependencies involved in raising, control,
as well as extraction, and many other complicated linguistic phenomena that go beyond
the surface syntax. Some well-established and leading linguistic theories, such as Word
Grammar (Hudson 1990) and Meaning-Text Theory (Mel’ˇcuk 2001), argue that more
general dependency graphs are necessary to represent a variety of syntactic or semantic
phenomena.

In this article, we are concerned with parsing Chinese sentences to an enriched deep
dependency representation, which marks up a rich set of GRs to specify a linguistically-
rich syntactic analysis. Different from the popular surface1 tree-based dependency rep-
resentation, our GR annotations are represented as general directed graphs that express
not only local but also various long-distance dependencies (voir la figure 1 Par exemple).
To enhance the tree-shaped dependency representation, we borrow the key ideas under-
lying Lexical Function Grammar (LFG) (Bresnan and Kaplan [1982], Dalrymple [2001]),
a well-deﬁned and widely-applied linguistic grammar formalism. En particulier, GRs

1 In this work, we arguably take the dependency tree representation as a surface-oriented syntactic

analyse.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

root

subj

temp

prt

comp

temp

prt

obj

subj*ldd

comp
obj

nmod

relative

nmod

经济领域的法规性

文件

浦东

Pudong recently

近年来颁布实行了涉及
issue practice

involve economic ﬁeld

regulatory document

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 1
An example: Pudong recently enacted regulatory documents involving the economic ﬁeld. The symbol
“*ldd” indicates long-distance dependencies; “subj*ldd” between the word “涉及/involve” and
the word “文件/documents” represents a long-range subject-predicate relation. The arguments
and adjuncts of the coordinated verbs, namely “颁布/issue” and “实行/practice,” are the
separately yet distributively linked two heads.

can be viewed as the lexicalized dependency backbone of an LFG analysis that provides
general linguistic insights.

To acquire a high-quality GR corpus, we propose a linguistically-motivated algo-
rithm to translate a Government and Binding theory (GB) (Chomsky [1981], Carnie
[2007]) grounded phrase structure treebank, namely, Chinese TreeBank (CTB) (Xue et al.
[2005]) to a deep dependency bank where the GRs are explicitly represented. A manual
evaluation highlights the reliability of our linguistically-motivated GR extraction algo-
rithm: The overall dependency-based precision and recall are 99.17% et 98.87%. These
numbers are calculated based on 209 sentences that are randomly selected and manually
corrected. The automatically-converted corpus would be of use for a wide variety of
NLP tasks, Par exemple, parser training and cross-formalism parser evaluation.

By means of the newly created corpus, we study data-driven models for deep
dependency parsing. The key challenge in building GR graphs is the range of relational
ﬂexibility. Different from the surface syntax, GR graphs are not constrained to trees.
Cependant, the tree structure is a fundamental consideration of the design for the majority
of existing parsing algorithms. To deal with this problem, we propose graph merging,
an innovative perspective for building ﬂexible representations. The basic idea is to
decompose a GR graph into several subgraphs, each of which captures most but not
necessarily the complete information. D'une part, each subgraph is simple enough
to allow efﬁcient construction. On the other hand, the combination of all subgraphs
enables the whole target GR structure to be produced.

Based on this design, a practical parsing system is developed with a two-step
architecture. In the ﬁrst step, different graph-based models are applied to assign scores
to individual arcs and various tuples of arcs. Each individual model is trained to target
a unique type of subgraph. In the second step, a Lagrangian Relaxation-based joint
decoder is applied to efﬁciently produce globally optimal GR graphs according to all
graph-based models.

There are two main problems in the graph merging perspective. D'abord, how do we
decompose a complex graph into simple subgraphs in a principled way? To deal with
this problem, we propose modeling the graph decomposition problem as a constrained
optimization problem and leverage appropriate objective functions as well as con-
straints to reﬂect essential structure-speciﬁc properties of the syntactically-motivated

Computational Linguistics

Volume 45, Nombre 1

GR graphs. En particulier, we consider the reachability property: In a given GR graph,
every node is reachable from the same unique root. This property ensures that a GR
graph can be successfully decomposed into a limited number of forests, which in turn
can be accurately and efﬁciently built via tree parsing. Secondly, how can we merge
subgraphs into one coherent structure in a principled way? The problem of ﬁnding an
optimal graph that consistently combines the subgraphs obtained through individual
models is non-trivial. We also treat this problem as an optimization problem and employ
Lagrangian Relaxation to solve the problem.

Motivated by the recent successes of transition-based approaches to tree parsing,
we also explore the transition-based models for building deep dependency structures.
We utilize a list-based transition system that uses open lists for organizing partially
processed tokens. This transition system is sound and complete with respect to directed
graphs without self-loop. With reference to this system, we implement a data-driven
parser with a neural classiﬁer based on Long Short-Term Memory (LSTM) (Hochreiter
and Schmidhuber [1997]). To improve the parsing quality, we look further into the
impact of dynamic oracle and beam search on training and/or decoding.

Experiments on CTB 6.0 are conducted to proﬁle the two types of parsers. Evalu-
ation gauges how successful GR parsing for Chinese can be by applying data-driven
models. Detailed analysis reveal some important factors that may possibly boost the
performance. En particulier, the following non-obvious facts should be highlighted:

Both the graph merging and transition-based parsers produce high-quality
GR analysis with respect to dependency matching, although they do not
use any phrase structure information.

2. When gold-standard POS tags are available, the graph merging model

obtains a labeled f-score of 84.57 on the test set when coupled with a global
linear scorer, et 86.17 when coupled with a neural model. Empirical
analysis indicates the effectiveness of the proposed graph decomposition
and merging methods.

The transition-based parser can be trained with either the dynamic oracle
or the beam search method. According to our implementations, le
dynamic oracle method performs better. Coupled with dynamic oracle,
the transition-based parser reaches a labeled f-score of 85.51 when gold
POS tags are utilized.

Gold-standard POS tags play a very important role in achieving the above
accuracies. Cependant, the automatic tagging quality of state-of-the-art
Chinese POS taggers are still far from satisfactory. To deal with this
problem, we apply ELMo (Peters et al. 2018), a contextualized word
embedding producer, to obtain adequate lexical information. Experiments
show that ELMo is extremely effective in harvesting lexical knowledge
from raw texts. With the help of the ELMo vectors but not any POS
tagging results, the graph merging parser achieves a labeled f-score of
84.90. This is a realistic set-up to evaluate how accurate the GR parsers
could be for real-world Chinese Language Processing applications.

The current study expands on the earlier and preliminary results, which have
been published in Sun et al. (2014) and Sun, Du, and Wan (2017). The new ﬁndings
and contributions in this submission include: (1) a detailed comparison between our

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

GR analysis and other syntactic/semantic analyses, including Universal Dependency,
Semantic Role Labeling, and LFG’s f-structure, (2) the design of a new neural graph
merging parser, (3) the design of a new neural transition-based parser, et (4) plus
comprehensive evaluations and analyses of the neural parsing models.

The implementation of the corpus conversion algorithm, a corpus visualization
tool, and a set of manually labeled sentences are available at http://www.aclweb.org/
anthology/attachments/P/P14/P14-1042.Software.zip. The implementation of the
two neural parsers is available at https://github.com/pkucoli/omg.

2. Background

2.1 Deep Linguistic Processing

Deep linguistic processing is concerned with NLP approaches that aim at modeling the
complexity of natural languages in rich linguistic representations. Such approaches are
typically related to a particular computational linguistic theory, including Combinatory
Categorial Grammar (CCG) (Steedman [1996, 2000]), Lexical Functional Grammar (LFG)
(Bresnan and Kaplan [1982], Dalrymple [2001]), Head-driven Phrase Structure Grammar
(HPSG) (Pollard and Sag [1994]), and Tree-adjoining Grammar (TAG) (Joshi and Schabes
[1997]). These theories provide not only the description of the syntactic structures, mais
also the ways in which meanings are composed, and thus parsing in these formalisms
provides an elegant way to simultaneously obtain both syntactic and semantic analyses,
generating valuable and richer linguistic information. Deep linguistic annotators and
processors are strongly demanded in NLP applications, such as machine translation
(Oepen et al. 2007; Wu, Matsuzaki, and Tsujii 2010) and text mining (Miyao et al. 2008).
En général, derivations based on deep grammar formalisms may provide a wider
variety of syntactic and semantic dependencies in more explicit details. Deep parses
arguably contain the most linguistically satisfactory account of the deep dependen-
cies inherent in many complicated linguistic phenomena, Par exemple, coordination
and extraction. Among the grammatical models, LFG and HPSG encode grammatical
functions directly, and they are adequate for generating predicate–argument analyses
(King et al. 2003; Flickinger, Zhang, and Kordoni 2012). In terms of CCG, Hockenmaier
and Steedman (2007) introduced a co-indexing method to extract (predicate, argument)
pairs from CCG derivations; Clark and Curran (2007un) also utilized a similar method
to derive LFG-style grammatical relations. These bilexical dependencies can be used
to approximate the corresponding semantic structures, such as logic forms in Minimal
Recursion Semantics (Copestake et al. 2005).

Au cours des dernières années, considerable progress has been made in deep linguistic processing
for realistic texts. Traditionnellement, deep linguistic processing has been concerned with
grammar development for parsing and generation. During the last decade, techniques
for treebank-oriented grammar development (Cahill et al. 2002; Miyao, Ninomiya, et
Tsujii 2005; Hockenmaier and Steedman 2007) and statistical deep parsing (Clark and
Curran 2007b; Miyao and Tsujii 2008) have been well studied for English. Statistical
models that are estimated on large-scale treebanks play an essential role in boosting the
parsing performance.

It is generally believed that the representation format for parser outputs may greatly
affect its impact on applications. Predicate–argument structures extracted from deep
parses have been shown very helpful for NLP tasks such as information extraction (Miyao
et autres. 2008; Yakushiji et al. 2005). Partly due to their importance in information process-
ing, deep dependencies are the de facto standard to evaluate deep parsers (Kaplan et al.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

2004; Briscoe and Carroll 2006; Clark and Curran 2007a; Bender et al. 2011), y compris
C&C2 and Enju.3 If the interpretable predicate–argument structures are so valuable,
then the question arises: Why cannot we directly build deep dependency graphs?

2.2 Data-Driven Dependency Parsing

Data-driven, grammar-free dependency parsing has received an increasing amount of
attention in the past decade. Such approaches, Par exemple, transition-based (Yamada
and Matsumoto 2003; Nivre 2008) and graph-based (McDonald et al. 2005; McDonald's,
Crammer, and Pereira 2005) models have attracted the most attention in dependency
parsing in recent works. Transition-based parsers utilize transition systems to derive
dependency trees together with treebank-induced statistical models for predicting tran-
sitions. This approach was pioneered by Yamada and Matsumoto (2003) and Nivre,
Hall, and Nilsson (2004). Speciﬁcally, deep learning techniques have been successfully
applied to enhance the parsing accuracy (Chen and Manning 2014; Weiss et al. 2015;
Andor et al. 2016; Kiperwasser and Goldberg 2016; Dozat and Manning 2016).

Chen and Manning (2014) made the ﬁrst successful attempt at incorporating deep
learning into a transition-based dependency parser. A number of other researchers have
attempted to address some limitations of their parser by augmenting it with additional
complexity: Later it was enhanced with a more principled decoding algorithm, namely
beam search, as well as a conditional random ﬁeld loss objective function (Weiss et al.
2015; Watanabe and Sumita 2015; Zhou et al. 2015; Andor et al. 2016).

A graph-based system explicitly parameterizes models over substructures of a
dependency tree, and formulates parsing as a Maximum Spanning Tree problem
(McDonald et al. 2005). Similar to the transition-based approach, it also utilizes treebank
annotations to train strong statistical models to assign scores to word pairs. Many re-
searchers focus on developing principled decoding algorithms to resolve the Maximum
Spanning Tree problem involved, for both projective (McDonald and Pereira 2006; Koo
and Collins 2010) and non-projective structures (Koo et al. 2010; Kuhlmann 2010; Pitler,
Kannan, and Marcus 2013). Deep learning has also been proved to be powerful for
disambiguation and remained a hot topic in recent research (Kiperwasser and Goldberg
2016; Dozat and Manning 2016).

Kiperwasser and Goldberg (2016) proposed a simple yet effective architecture to
implement neural dependency parsers. En particulier, a bidirectional-LSTM (Bi-LSTM) est
utilized as a powerful feature extractor to assist a dependency parser. Mainstream data-
driven dependency parsers, including both transition- and graph-based ones, can apply
useful word vectors provided by a Bi-LSTM to calculate scores. Following Kiperwasser
and Goldberg (2016)’s experience, we implemented such a parser to evaluate the impact
of empty categories on surface parsing.

Most research concentrated on surface syntactic structures, and the majority of
existing approaches are limited to producing only trees. We notice several exceptions.
Sagae and Tsujii (2008) and Henderson et al. (2013) individually introduced two tran-
sition systems that can generate speciﬁc graphs rather than trees. Malheureusement, le
two transition systems only cover 51.0% et 76.5%, respectivement, of graphs in our
data, highlighting the high complexity of our graph representation and the inadequacy
of existing transition systems. McDonald and Pereira (2006) presented a graph-based

2 http://svn.ask.it.usyd.edu.au/trac/candc/
3 http://kmcs.nii.ac.jp/enju/

100

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

parser that can generate dependency graphs in which a word may depend on multiple
têtes. Encouraged by their work, we study new data-driven models to build deep
dependency structures for Mandarin Chinese.

2.3 Initial Research on Chinese Deep Linguistic Processing

In the last few years, study on deep linguistic processing for Chinese has been ini-
tialized. Treebank annotation for individual formalisms is prohibitively expensive. À
quickly construct deep annotations, corpus-driven grammar engineering has been de-
veloped. Phrase structure trees in CTB have been semi-automatically converted to deep
derivations in the CCG (Tse and Curran 2010), LFG (Guo, van Genabith, and Wang
2007), and HPSG (Yu et al. 2010) formalisms. To semi-automatically extract a large-
scale HPSG grammar, Yu et al. (2010) deﬁned a skeleton, including the structure of
sign, grammatical principles, and schemata, based on which, the CTB trees are con-
verted into HPSG-style trees. The treebank conversion under the CCG formalism is
relatively easier. Tse and Curran (2010) followed the method proposed for English Penn
Treebank (Hockenmaier and Steedman 2007) to process CTB. The main steps include
(1) distinguishing head, argument, and adjunct, (2) binarizing CTB trees, et (3) concernant-
deﬁning syntactic categories. To more robustly construct f-structure for CTB trees, Guo,
van Genabith, and Wang (2007) proposed a dependency-based model, which extracts
functional information from dependency trees.

With the use of converted ﬁne-grained linguistic annotations, successful English
deep parsers, such as C&C (Clark and Curran 2007b) and Enju (Miyao and Tsujii 2008),
have been evaluated on the Chinese annotations (Yu et al. 2011; Tse and Curran 2012).
Although the discriminative learning architecture of both C&C and Enju parsers makes
them relatively easy to be adapted to solve multilingual parsing, their performance on
Chinese sentences is far from satisfactory. Yu et al. (2011) and Tse and Curran (2012)
analyze the challenges and difﬁculties in Chinese deep parsing. En particulier, quelques
language-speciﬁc properties account for a large number of errors.

3. Representing Deep Linguistic Information Using Dependency Graphs

Dans cette section, we discuss the construction of the GR annotations. Basically, the an-
notations are automatically converted from a GB-grounded phrase structure treebank,
namely CTB. Conceptually, this conversion is similar to the conversions from CTB struc-
tures to representations in deep grammar formalisms (Xia 2001; Guo, van Genabith, et
Wang 2007; Tse and Curran 2010; Yu et al. 2010). Cependant, our work is grounded in GB,
which is the linguistic basis of the construction of CTB. We argue that this theoretical
choice makes the conversion process more compatible with the original annotations and
therefore more accurate. We use directed graphs to explicitly encode bilexical dependen-
cies involved in coordination, raising/control constructions, extraction, topicalization,
and many other complicated phenomena. Chiffre 2 shows the original CTB annotation
of the sentence in Figure 1.

3.1 Linguistic Basis

GRs are encoded in different ways in different languages and languages may employ mixed
or multiple strategies for grammatical function encoding. In some languages, for ex-
ample, Turkish, grammatical function is encoded by means of morphological marking,
and there may be no uniform position where a particular grammatical function must

101

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

NP-PN-SBJ

浦东

LCP-TMP

VCD

来

了

近年

颁布

实行

NP-OBJ

WHNP-1

-NONE-

DEC

法规性

文件

*OP*

NP-SBJ

的

-NONE-

NP-OBJ

*T*-1

涉及

经济

领域

Chiffre 2
The original CTB annotation of the running example.

appear. In highly configurational languages, Par exemple, Chinese, cependant, the grammatical
function of a phrase is heavily determined by its constituent structure position.
Dominant Chomskyan theories, including GB, have deﬁned GRs as conﬁgurations at
phrase structures. Following this principle, CTB groups words into constituents through
the use of a limited set of fundamental grammatical functions. Par exemple, one bracket
represents only one grammatical relation in CTB, providing phrase structure annota-
tions with specifications of particular grammatical functions. Transformational grammar
utilizes empty categories to represent long-distance dependencies. In CTB, traces are pro-
vided by relating displaced linguistic material to where it should be interpreted seman-
tically. By exploiting conﬁgurational information, traces, and functional tag annotations,
GR information can hopefully be derived from CTB trees with high accuracy.

3.2 The Extraction Algorithm

Our treebank conversion algorithm borrows key insights from LFG. LFG posits two
levels of representation: c(onstituent)-structure and f(unctional)-structure minimally.
C-structure is represented by phrase structure trees, and captures surface syntactic
conﬁgurations such as word order, whereas f-structure encodes grammatical functions.
It is easy to extract a dependency backbone that approximates basic predicate–
argument–adjunct structures from f-structures. The construction of the widely used
PARC DepBank (King et al. 2003) is a good example.

LFG relates c-structure and f-structure through f-structure annotations, which com-
positionally map every constituent to a corresponding f-structure. Borrowing this key
idea, we translate CTB trees to dependency graphs by ﬁrst augmenting each con-
stituency with f-structure annotations, then propagating the head words of the head
or conjunct daughter(s) upwards to their parents, and ﬁnally creating a dependency
graph. The details are presented step-by-step as follows:

3.2.1 Tapping Implicit Information. A systematic study to tap the implicit functional
information of CTB has been introduced by Xue (2007). This gives us a very good

102

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

LCP
↓∈(↑ TMP)

NP
↓=(↑ SUBJ)

NR
浦东

VP
↓=↑

NP
↓=(↑ COMP)

LC
↓=↑
来

VCD
↓=↑

AS
↓=(↑ PRT)
了

NT
↓=↑
近年

VV
↓∈↑
颁布

VV
↓∈↑
实行

NP
↓=(↑ OBJ)

CP
↓∈(↑ REL)

NP
↓=↑

DEC
↓=↑
的

NN
↓∈(↑ NMOD)
法规性

NN
↓=↑
文件

IP
↓=(↑ COMP)

NP
↓=(↑ SUBJ)

VP
↓=↑

-NONE-
*T*

VV
↓=↑
涉及

NP
↓=(↑ OBJ)

NP
↓∈(↑ NMOD)

NN
↓=↑
领域

NN
↓=↑
经济

Chiffre 3
The original CTB annotation augmented with LFG-like functional annotations of the running
example.

start to extract GRs. We slightly modify their method to enrich a CTB tree with
f-structure annotations: Each node in a resulting tree is annotated with one and only
one corresponding equation (voir la figure 3 for an example). Comparing the original
and enriched annotations, we can see that the functionality of this step is to explicitly
represent and regulate grammatical functions4 for every constituent. We enrich the CTB
trees with function information by using the conversion tool5 that generated the Chinese
data sets for the CoNLL 2009 Shared Task on multilingual dependency parsing and
semantic role labeling.

3.2.2 Beyond CTB Annotations: Tracing More. Natural languages do not always interpret
linguistic materials locally. In order to obtain accurate and complete GR, predicate–
argument, or logical form representations, a hallmark of deep grammars is that they
usually involve a non-local dependency resolution mechanism. CTB trees utilize empty
categories and co-indexed materials to represent long-distance dependencies. An empty
category is a nominal element that does not have any phonological content and is
therefore unpronounced. There are two main types of empty categories: null pronounce
and trace, and each type can be further classiﬁed into different subcategories. Deux
kinds of anaphoric empty categories, c'est, big PRO and trace, are annotated in CTB.
Theoretically speaking, only trace is generated as the result of movement and therefore
annotated with antecedents in CTB. We carefully checked the annotation and found
that considerable amounts of antecedents are not labeled, and hence a large sum of
important non-local information is missing. En outre, because the big PRO is also

4 See the Appendix for the list of major grammatical functions deﬁned by our algorithm.
5 www.cs.brandeis.edu/~clp/ctb/ctb.html

103

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

anaphoric, it is possible to ﬁnd co-indexed components sometimes. Such non-local
information is also valuable in marking the dependency relation.

Beyond CTB annotations, we introduce a number of phrase structure patterns to
extract more non-local dependencies. The method heavily leverages linguistic rules
to exploit structural information. We take into account both theoretical assumptions
and analyzing practices to enrich co-indexation information according to phrase struc-
ture patterns. En particulier, we try to link an anaphoric empty category e with its
c-commonders if no non-empty antecedent has already been co-indexed with e. Because
the CTB is inﬂuenced deeply by the X-bar syntax, which highly regulates constituent
analyse, the number of linguistic rules we have is quite modest. For the development
of conversion rules, we used the ﬁrst 9 ﬁles of CTB, which contain about 100 phrases.
Readers can refer to the well-documented Perl script for details (voir la figure 3 for an
example). The noun phrase “法规性文件/regulatory documents” is related to the trace
“*T*.” This co-indexation is not labeled in the original annotation.

3.2.3 Passing Head Words and Linking Empty Categories. Based on an enriched tree, notre
algorithm propagates the head word of the head daughter upwards to their parents,
linking co-indexed units, and ﬁnally creating a GR graph. The partial result after head
word passing of the running example is shown in Figure 4. There are two differences
in the head word passing between our GR extraction and a “normal” dependency tree
extraction. D'abord, the GR extraction procedure may pass multiple head words to their
parent, especially in a coordination construction. Secondly, long-distance dependencies
are created by linking empty categories and their co-indexed phrases.

The coding of long-distance dependency relations is particularly important for
processing grammatical relations in Chinese. Compared to English, Chinese allows
more ﬂexibility in word order and a wider range of “gap-and-ﬁller” relations. Parmi
the common long-distance dependencies such as co-referential indexing, cleft-focus,
prepositional phrase, and coordination, relativization in Chinese is of particular interest.
Chinese relativization is structurally head-ﬁnal, with the relative clause, marked by
de 的 (the grammatical marker of nominal modiﬁer), occurring before the head noun.
The head noun may hold any kind of semantic relation with the proceeding relative

IP{颁布,实行}

NP{浦东}

VP{颁布,实行}

LCP{来}

VP{颁布,实行}

AS{了}

NP{文件}

CP{的}

NP{文件}

IP{涉及}

DEC{的}

NP{文件*ldd}

VP{涉及}

Chiffre 4
An example of lexicalized tree after head word upward passing. Only partial result is shown.
The long-distance dependency between “涉及/involve” and “文件/document” is created
through copying the dependent to a coindexed anaphoric empty category position.

104

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Tableau 1
Manual evaluation of 209 phrases.

Precision Recall

Unlabeled
Labeled

99.48
99.17

99.17
98.87

99.32
99.02

clause. Autrement dit, Chinese relative structure will have the “gap” occurring before
the “ﬁller” and there is little restriction on the semantic roles of the relativized head
noun. Chinese-speciﬁc structural peculiarities may give rise to unexpected difﬁculties
in sentence processing. Applying an augmented version of dependency notations, notre
system is able to handle such complicated issues in processing Chinese sentences.

3.3 Manual Evaluation

To have a precise understanding of whether our extraction algorithm works well, nous
have selected 20 ﬁles that contain 209 sentences in total for manual evaluation. Linguis-
tic experts carefully examine the corresponding GR graphs derived by our extraction
algorithm and correct all errors. Autrement dit, a gold-standard GR annotation set is
created. The measure for comparing two dependency graphs is precision/recall of GR
tokens, which are deﬁned as (cid:104)wh, wd, je(cid:105) tuples, where wh is the head, wd is the dependent,
and l is the relation. Labeled precision/recall (LP/LR) is the ratio of tuples correctly
identiﬁed by the automatic generator, while unlabeled precision/recall (UP/UR) is the
ratio regardless of l. F-score is a harmonic mean of precision and recall. These measures
correspond to attachment scores (LAS/UAS) in dependency tree parsing. To evaluate
our GR parsing models that will be introduced later, we also report these metrics.

The overall performance is summarized in Table 1. We can see that the automatic
GR extraction achieves relatively high performance. There are two sources of errors
in treebank conversion: (1) inadequate conversion rules and (2) wrong or inconsistent
original annotations. During the creation of the gold-standard corpus, we ﬁnd that
rule-based errors are mainly caused by complicated unbounded dependencies and
the lack of internal structure for some phrases. Such problems are very hard to solve
through rules only, if even possible, since original annotations do not provide sufﬁcient
information. The latter problem is more scattered and unpredictable, which requires
manual correction.

3.4 Comparing to the De Facto LFG Analysis

Our extraction algorithm integrates the key idea underlying LFG and the practice of
corpus annotation of CTB. Donc, the structures obtained are highly comparable
to Dalrymple (2001)’s de facto LFG analysis. Take the running sentence in Figure 1,
Par exemple. LFG separates syntactic analysis into two parallel structures: c-structure
and f-structure. The c-structure provides basic analysis of the constituent structure. UN
standard c-structure for the running example is shown in Figure 5 and this analysis
can be compared to the result—as shown in Figure 3—of the second step of our ex-
traction algorithm. A number of functional schemata are introduced to augment the
phrase structure analysis. By instantiating these schemata, we are able to generate the

105

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

NP
↓=(↑ SUBJ)

NR
浦东

LCP
↓∈(↑ ADJ)

VP
↓=↑

NP
↓=(↑ OBJ)

LC
↓=↑
来

VCD
↓=↑

AS
(↑ ASP)=perfect
了

NT
↓=↑
近年

VV
↓∈↑
颁布

VV
↓∈↑
实行

VP
↓=↑

IP
↓=↑

VP
↓=↑

NP
↓=(↑ OBJ)

CP
↓∈(↑ ADJ)

DEC
(↑ TOPIC PRED)=’PRO’
(↑ TOPIC)=(↑ SUBJ)
(↑ RELPRO)=(↑ TOPIC)
的

NP
↓=↑

NN
↓∈(↑ ADJ)
法规性

NN
↓=↑
文件

VV
↓=↑
涉及

NP
↓=(↑ OBJ)

NP
↓∈(↑ ADJ)

NN
↓=↑
领域

NN
↓=↑
经济

Chiffre 5
A de facto c-structure of the running example. The c-structure is augmented by a number of
functional schemata.

corresponding functional description. Enfin, we solve the simultaneous equations of
these functional descriptions and then construct the minimal f-structure that satisﬁes
eux. Chiffre 6 shows the f-structure of the running sentence. According to the linguistic
assumptions under LFG, this f-structure is more linguistically universal and has a
tighter relationship with semantics.

Comparing the two trees in Figures 3 et 5, we can see that most functional schemata

are similar. There are two minor differences between the LFG and our GR analysis.

LFG utilizes functional schemata that are assigned to phrase structures
rather than empty categories to analyze a number of complex linguistic
phenomena. Donc, in the standard LFG c-structure tree, there is no
“-NONE- *T*”-style terminals. Plutôt, speciﬁc constructions or function
words take the responsibility. Our corpus is based on the original
annotations of CTB, which is based on the GB theory. Par conséquent, notre
grammatical function-augmented trees include unpronounced nodes, et
there are not many functional schemata associated with constructions or
function words.

The labels that indicate GRs, e.g. sujet, objets, adjuncts, are slightly
different. This is also due to the annotation practice of CTB. The labels in
Chiffre 3 are designed to maximally reﬂect implicit functional information
of the CTB annotations. Nevertheless, the two labeling systems are highly
comparable. Par exemple, the “TMP” label in Figure 3 corresponds to the
“ADJ” label in Figure 5 because in most cases a temporal expression is
taken as an adjunct.

106

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Serial-Verb



(cid:68)
PRED ’颁布

(cid:68)
PRED ’实行

SUBJ,OBJ


(cid:21)
(cid:69)

'
(cid:21)
(cid:69)

'
(cid:69)
'

(cid:68)
PRED ’来

OBJ

(cid:104)

PRED ’近年’





(cid:105)



PRED ’浦东’

(cid:105)

PRED ’文件’



ADJ

SUBJ

CONJ

(cid:20)






(cid:20)













































OBJ

ASP







(cid:104)

























ADJ

perfect

(cid:105)

PRED ’法规性’

(cid:68)
’涉及
(cid:104)

PRED

TOPIC

SUBJ,OBJ

(cid:69)
'
(cid:105)
PRED ’PRO’

RELPRO 3

SUBJ

OBJ

3





PRED ’领域’

(cid:26)(cid:104)

PRED ’经济’

ADJ


(cid:105)(cid:27)




(cid:104)


















































































































Chiffre 6
A de facto f-structure of the running example.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

Some annotations may be different due to theoretical considerations. Take relative
clause for example. The analysis in Figures 5 et 6 is based on the solution provided
by Dalrymple (2001) and Bresnan (2001). The predicate–argument relation between
“涉及/involve” and “文件/document” is not included because LFG treats this as an
anaphoric problem. Cependant, we argue that this relation is triggered by the function
word de “的” as a marker of modiﬁcation and is therefore put into our GR graph.
Another motivation of this design is to let our GR graph represent more semantic
information. We will continue to discuss this topic in the next subsection.

3.5 Comparing to Other Dependency Representations

Dans cette section, we consider the differences between our GR representation and other
popular dependency-based syntactic and semantic analyses. Chiffre 7 visualizes four
types of cross-representation annotations assigned to the sentence in Figure 1:

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

our GR graph,

the CTB-dependency tree,

the Universal Dependency, et

PropBank-style Semantic Role Labeling.

107

Computational Linguistics

Volume 45, Nombre 1

Two of the dependency representations are based on tree structures that are repre-
sentative of Chinese Language Processing. Both are converted from CTB by applying
heuristic rules. The CTB-dependency tree is the result of transformation introduced by
Xue (2007), with a constituency-to-dependency transformation algorithm that explores
the implicit functional information of CTB. This dependency corpus is used by the

root

subj

temp

prt

comp

temp

prt

subj*ldd

obj

comp

obj

nmod

relative

nmod

浦东近年来颁布实行了涉及经济领域的法规性文件

(un) The GR graph.

root

subj

comp

temp

prt

cjt

obj

comp

obj

nmod

relative

nmod

浦东近年来颁布实行了涉及经济领域的法规性文件

(b) The CTB-dependency.

dobj

root

acl:relcl

nsubj

case advmod:loc

compound:vc

aux:asp

mark

dobj
compound:nn

compound:nn

浦东近年来颁布实行了涉及经济领域的法规性文件

Arg0

ArgM-TMP

Arg1

Arg0

Arg1

浦东近年来颁布实行了涉及经济领域的法规性文件

(d) PropBank-style Semantic role labeling.

Chiffre 7
Dependency representations in (un) our GR graph, (b) CTB-dependency, (c) Universal
Dependency, et (d) PropBank-style semantic roles.

108

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

CoNLL 2009 shared task. Because this transformation algorithm is proposed by the
developer and maintainer, we call it CTB-dependency. The Universal Dependency6
corpus for Chinese is also based on CTB but with different constituency-to-dependency
transformation rules. Both the language- and treebank-speciﬁc properties and compa-
rability to resources of other languages are considered. The last representation is mo-
tivated by semantic processing and the dependencies reﬂect basic predicate–argument
structures of verbs. The annotations are from Chinese PropBank (Xue and Palmer 2009),
a sister resource to CTB. The head words are verbs or their normalizations, tandis que le
dependent words take semantic roles pertaining to the related verbs.

It can be clearly seen that compared to the tree-shaped dependency analysis, notre
GR representation represents much more syntactic information though more general
graphs. En outre, the syntactic information of our analysis is more transparent with
respect to (compositional) semantics. This property is highlighted by the structural
parallelity between our GR graphs and the semantic role labeling results.

•

Take, Par exemple, the serial verb construction, a special type of
coordination (颁布实行了 “issued-practiced”). According to the properties
of distributive features, the shared subject, objet, and aspect marker hold
the same grammatical relations regarding both conjuncts. Limited by the
single-head constraint, a dependency tree cannot represent all of these
dependencies in an explicit way.
Take the relative clause construction for another example (涉及经济领域的
法规性文件 “the regulatory document that involves economic ﬁeld”).
There is a long-distance predicate–argument dependency between “涉
及/involve” and “文件/document.” The addition of this dependency to a
tree will bring in a cycle for the CTB-dependency analysis. The Universal
Dependency analysis includes this dependency, because its annotations
are driven by not only syntax but also semantics. Nevertheless, the related
label lacks the information that a subject is displaced.

3.6 A Quantitative Analysis

In the previous two subsections, we presented a phenomenon-by-phenomenon com-
parison to show the similarity and dissimilarity between our GR analysis and other
syntactic/semantic analyses, Par exemple, Universal Dependency, Semantic Role Label-
ing, and LFG’s f-structure. In this subsection, we present a quantitative analysis based
on overall statistics of the derived dependency corpus and a quantitative comparison
with the CTB-dependency corpus. Tableau 2 shows how many dependencies are shared by
both representations. The majority of grammatical relations involve local dependencies,
and therefore the intersection of both representations are quite large. Nevertheless, un
considerable number of dependencies of the GR representation do not appear in the
tree representations. En principe, the GR representation removes semantically irrelevant
dependencies, and thus it contains fewer arcs. Chiffre 8 summarizes the distribution of
governable and non-governable GRs with respect to the tree and graph corpora.

6 http://universaldependencies.org

109

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

Tableau 2
The distribution of the unlabeled dependencies. The “Tree” and “Graph” rows indicate how
many dependencies, c'est, arcs, in total are included by the CTB-dependency and our graph
representations. Sentences are selected from CTB 6.0 and have been used by the CoNLL 2009
shared task. “Graph∩Tree” means the dependencies in common; “Graph–Tree” means the
dependencies that appear in the GR graphs but not the CTB-dependency trees; “Tree–Graph”
means the dependencies that appear in the CTB-dependency trees but not the GR graphs.

Tree
Graph
Graph∩Tree
Graph–Tree
Tree–Graph

#Arc

731,833
669,910
554,240
115,670
177,593

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 8
The distribution of the dependency relations.

4. Graph-Based Parsing via Graph Merging

4.1 The Proposal

The key proposal of this work is to construct a complex structure via constructing
simple partial structures. Each partial structure is simple in the sense that it allows
efﬁcient construction. To construct each partial structure, we can employ mature parsing
techniques. To obtain the ﬁnal target output, it requires the total of all partial structures
as they enable the whole target structure to be produced. This article aims to exemplify
the above idea by designing a new parser for obtaining GR graphs. Take the GR graph
in Figure 1 Par exemple. It can be decomposed into two tree-like subgraphs, as shown in
Chiffre 9. If we can parse the sentence into subgraphs and combine them in a principled
chemin, we are able to obtain the original GR graph.

Under this premise, we need to develop a principled method to decompose a com-
plex structure into simple structures, which allows us to generate data to train simple
solvers. We also need to develop a principled method to integrate partial structures,

110

0 5 10 15 20 25ADVAMODCOMPLOCNMODOBJSBJTMP%TreeGraph

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

root

subj

comp

temp

prt

obj

comp

obj

nmod

relative

nmod

浦东近年来颁布实行了涉及经济领域的法规性文件

comp

prt

temp

nmod

obj

nmod

subj

root

subj[inverse]

obj

Chiffre 9
A graph decomposition for the GR graph in Figure 1. The two subgraphs are shown on two
sides of the sentence, respectivement. The subgraph on the upper side of the sentence is exactly a
well-formed tree, while the one on the lower side is slightly different. The edge from the word
“文件/document” to “涉及/involve” is tagged “[inverse]” to indicate that the direction of the
edge in the subgraph is in fact opposite to that in the original graph.

which allows us to produce coherent structures as outputs. The techniques we devel-
oped to solve the two problems are demonstrated in the following sections.

4.2 Decomposing GR Graphs
4.2.1 Graph Decomposition as Optimization. Given a sentence s = w0w1w2 · · · wn of length
n (where w0 denotes the virtual root), a vector yyy of length n(n + 1) is used to denote
a graph on the sentence. Indices i and j are then assigned to index the elements in
the vector, yyy(je, j) ∈ {0, 1}, denoting whether there is an arc from wi to wj (0 ≤ i ≤ n,
1 ≤ j ≤ n).

Given a graph yyy, there may be m subgraphs yyy1, . . . , yyym, each of which belongs to a
speciﬁc class of graphs Gk (k = 1, 2, · · · , m). Each class should allow efﬁcient construc-
tion. Par exemple, we may need a subgraph to be a tree or a non-crossing dependency
graph. The combination of all yyyk gives enough information to construct yyy. En outre,
the graph decomposition procedure is utilized to generate training data for building
submodels. Donc, we hope each subgraph yyyk is informative enough to train a good
scoring model. To do so, for each yyyk, we deﬁne a score function sk that indicates the
“goodness” of yyyk. Integrating all ideas, we can formalize graph decomposition as an
optimization problem,

maximum. (cid:80)
s.t.

k sk(yyyk)

yyyi belongs to Gi
(cid:80)

k yyyk(je, j) ≥ yyy(je, j), ∀i, j

The last condition in this optimization problem ensures that all edges in yyy appear at
least in one subgraph.

For a speciﬁc graph decomposition task, we should deﬁne good score functions sk

and graph classes Gk, according to key properties of the target structure yyy.

4.2.2 Decomposing GR Graphs into Tree-Like Subgraphs. One key property of GR graphs
is their reachability: Every node is either reachable from a unique root or is, by itself,

111

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

an independent connected component. This property allows a GR graph to be decom-
posable into a limited number of tree-like subgraphs. By tree-like, we mean that if we
treat a graph on a sentence as undirected, it is either a tree, or a subgraph of some tree
on the sentence. The advantage of tree-like subgraphs is that they can be effectively
built by adapting data-driven tree parsing techniques. Take the sentence in Figure 1,
Par exemple. For every word, there is at least one path linking the virtual root and the
target word. En outre, we can decompose the graph into two tree-like subgraphs, comme
shown in Figure 9. In such decomposition, one subgraph is exactly a tree, and the other
is very close to a tree.

In practice, we restrict the number of subgraphs to 3. The reasoning is that we use
one tree to capture long distance information and the other two to capture coordination
information.7 In other words, we decompose each given graph yyy into three tree-like
subgraphs ggg1, ggg2, and ggg3 for each subgraph to carry important information of the graph
as well as cover all edges in yyy. The optimization problem can be written as

maximum. s1(ggg1) + s2(ggg2) + s3(ggg3)
s.t.

ggg1, ggg2, ggg3 are tree-like
ggg1(je, j) + ggg2(je, j) + ggg3(je, j) ≥ yyy(je, j), ∀i, j

Scoring a Subgraph. We score a subgraph in a ﬁrst order arc-factored way, which ﬁrst
scores the edges separately and then adds up the scores. Officiellement, the score function
is sk(ggg) = (cid:80) ωk(je, j)gggk(je, j) (k = 1, 2, 3) where ωk(je, j) is the score of the edge from i to j.
Under this score function, we can use the Maximum Spanning Tree (MST) algorithme
(Chu and Liu 1965; Edmonds 1967; Eisner 1996) to decode the tree-like subgraph with
the highest score.

After the score function is deﬁned, extracting a subgraph from a GR graph works in
the following way: We ﬁrst assign heuristic weights ωk(je, j) (1 ≤ i, j ≤ n) to the potential
edges between all the pairs of words, then compute a best projective tree gggk using the
Eisner’s Algorithm (Eisner 1996):

gggk = arg max

ggg

sk(ggg) = arg max

ggg

(cid:88)

ωk(je, j)ggg(je, j).

gggk is not exactly a subgraph of yyy, because there may be some edges in the tree but
not in the graph. To guarantee a meaningful subgraph of the original graph, we add
labels to the edges in trees to encode necessary information. We label gggk(je, j) avec le
original label, if yyy(je, j) = 1; with the original label appended by “∼R” if yyy(j, je) = 1; avec
“None” else. With this labeling, we can have a function t2g to transform the extracted
trees into tree-like graphs. t2g(gggk) is not necessarily the same as the original graph yyy, mais
it must be a subgraph of it.

Three Variations of Scoring. With different weight assignments, different trees can be
extracted from a graph, obtaining different subgraphs. We devise three variations of

7 In this article, we employ projective parsers. The minimal number of sub-graphs is related to the

pagenumber of GR graphs. The pagenumber of 90.96% GR graphs is smaller than or equal to 2, alors que
the pagenumber of 98.18% GR graphs is at most 3. That means 3 projective trees are perhaps good enough
to handle Chinese sentences, mais 2 projective trees are not. Due to the empirical results in Table 3, en utilisant
three projective trees can handle 99.55% GR arcs. Donc, we think three is suitable for our problem.

112

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

weight assignment: ω1, ω2, and ω3. Each of the ω’s consists of two parts. One is shared
by all, denoted by S, and the other is different from each other, denoted by V. Officiellement,
ωk(je, j) = S(je, j) + Vk(je, j) (k = 1, 2, 3 et 1 ≤ i, j ≤ n).

Given a graph yyy, S is deﬁned as S(je, j) = S1(je, j) + S2(je, j) + S3(je, j) + S4(je, j), où

(cid:40)

S1(je, j) =

S2(je, j) =

w1 if yyy(je, j) = 1 or yyy( j, je) = 1
0 else

w2 if yyy(je, j) = 1
0 else

S3(je, j) = w3(n − |i − j|)
S4(je, j) = w4(n − lp(je, j))

(1)

(2)

(3)

(4)

In the deﬁnitions above, w1, w2, w3, and w4 are coefﬁcients, satisfying w1 (cid:29) w2 (cid:29)
w3, and lp is a function of i and j. lp(je, j) is the length of shortest path from i to j that either
i is a child of an ancestor of j or j is a child of an ancestor of i. That is to say, the paths
are in the form i ← n1 ← · · · ← nk → j or i ← n1 → · · · → nk → j. If no such path exists,
then lp(je, j) = n. The reasoning behind the design is illustrated below.

S1 indicates whether there is an edge between i and j, and it is meant for optimal effect;
S2 indicates whether the edge is from i to j, and we want the edge with the correct

direction more likely to be selected;

S3 indicates the distance between i and j, and we like the edge with short distance

because it is easier to predict;

S4 indicates the length of certain types of path between i and j

that reﬂects

c-commanding relationships, and the coefﬁcient remains to be tuned.

The score V is meant to capture different information from the GR graph. In GR
graphs, we have an additional piece of information (as denoted as “*ldd” in Figure 1)
for long-distance dependency edges. De plus, we notice that conjunction is another
important structure, which can be derived from the GR graph. Assume that we tag the
edges relating to conjunctions with “*cjt.” The three variation scores, c'est, V1, V2, et
V3, reﬂect long distance and the conjunction information in different ways.

V1. First for edges yyy(je, j) whose label is tagged with *ldd, we assign V1(je, j) = d.8 When-
ever we come across a parent p with a set of conjunction children cjt1, cjt2, · · · , cjtn, nous
look for the rightmost child gc1r of the leftmost child in conjunction cjt1, and add d to
each V1(p, cjt1) and V1(cjt1, gc1r). The edges in conjunction to which additional d’s are
added are shown in blue in Figure 10.

V2. Different from V1, for edges yyy(je, j) whose label is tagged with *ldd, we assign
V2( j, je) = d. Then for each conjunction structure with a parent p and a set of conjunc-
tion children cjt1, cjt2, · · · , cjtn, we ﬁnd the leftmost child gcnl of the rightmost child in

8 d is a coefﬁcient to be tuned on validation data.

113

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

X*ldd

X*cjt

wp … wc1

… wgc2

… wgc1

… wc2

… wl

Chiffre 10
Examples to illustrate the additional weights.

conjunction cjtn, and add d to each V2(p, cjtn) and V2(cjtn, gcnl). The concerned edges in
conjunction are shown in green in Figure 10.

V3. We do not assign d’s to the edges with tag *ldd. For each conjunction with parent
p and conjunction children cjt1, cjt2, · · · , cjtn, we add d to V3(p, cjt1), V3(p, cjt2), · · · , et
V3(p, cjtn).

4.2.3 Lagrangian Relaxation with Approximation. As soon as we identify three trees ggg1, ggg2,
and ggg3, there are three subgraphs ggg1 = t2g(ggg1), ggg2 = t2g(ggg2), and ggg3 = t2g(ggg3). As stated
au-dessus de, each edge in a graph yyy needs to be covered by at least one subgraph, and the goal
is to maximize the sum of the edge weights of all trees. Note that the inequality in the
constrained optimization problem above can be replaced by a maximization, written as

maximum. s1(ggg1) + s2(ggg2) + s3(ggg3)
ggg1, ggg2, ggg3 are trees
s.t.
maximum{t2g(ggg1)(je, j), t2g(ggg2)(je, j),
t2g(ggg3)(je, j)} = yyy(je, j), ∀i, j

where sk(gggk) = (cid:80) ωk(je, j)gggk(je, j)

Let gggm = max{t2g(ggg1), t2g(ggg2), t2g(ggg3)}, and by max{ggg1, ggg2, ggg3} we mean to take the

maximum of three vectors pointwisely. The Lagrangian of the problem is

L(ggg1, ggg2, ggg3; toi) = s1(ggg1) + s2(ggg2) + s3(ggg3) + toi(cid:62)(gggm − yyy)

where u is the Lagrangian multiplier.

Then the dual is

L(toi) = max
ggg1,ggg2,ggg3

L(ggg1, ggg2, ggg3; toi)

= max
ggg1

(s1(ggg1) + 1
3

toi(cid:62)gggm) + maximum
ggg2

(s2(ggg2) + 1
3

toi(cid:62)gggm) + maximum
ggg3

(s3(ggg3) + 1
3

toi(cid:62)gggm) − u(cid:62)yyy

According to the duality principle, maxggg1,ggg2,ggg3;u minu L(ggg1, ggg2, ggg3) = minu L(toi), nous
can ﬁnd the optimal solution for the problem if we can ﬁnd minu L(toi). Cependant, it
is very hard to compute L(toi), not to mention minu L(toi). The challenge is that gm in the
three maximizations must be consistent.

114

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Initialization: set u(0) à 0
for k = 0 to K:

ggg1 ← arg maxggg1 s1(ggg1) + toi(k)(cid:62)ggg1
ggg2 ← arg maxggg2 s2(ggg2) + toi(k)(cid:62)ggg2
ggg3 ← arg maxggg3 s3(ggg3) + toi(k)(cid:62)ggg3
if max{ggg1, ggg2, ggg3} = y then

return ggg1, ggg2, ggg3

toi(k+1) ← u(k) − un(k)(maximum{ggg1, ggg2, ggg3} − yyy)

return ggg1, ggg2, ggg3

Chiffre 11
The Tree Extraction Algorithm.

The idea is to separate the overall maximization into three maximization problems
by approximation. It is observed that g1, g2, and g3 are very close to gm, so we can
approximate L(toi) par

L(cid:48)(toi) = max
ggg1,ggg2,ggg3

L(ggg1, ggg2, ggg3; toi)

= max
ggg1

(s1(ggg1) + 1
3

toi(cid:62)ggg1) + maximum
ggg2

(s2(ggg2) + 1
3

toi(cid:62)ggg2) + maximum
ggg3

(s3(ggg3) + 1
3

toi(cid:62)ggg3) − u(cid:62)yyy

Dans ce cas, the three maximization problems can be decoded separately, and we can
try to ﬁnd the optimal u using the subgradient method.

4.2.4 The Algorithm. Chiffre 11 gives the tree decomposition algorithm, in which a sub-
gradient method is used to identify minu L(cid:48)(toi) iteratively, and K is the maximum of
iterations. In each iteration, we ﬁrst compute ggg1, ggg2, and ggg3 to ﬁnd L(cid:48)(toi), then update u
until the graph is covered by the subgraphs. The coefﬁcient 1
3 ’s can be merged into the
steps α(k), so we omit them. The three separate problems gggk ← arg maxgggk sk(gggk) + toi(cid:62)gggk
(k = 1, 2, 3) can be solved using Eisner’s Algorithm (Eisner 1996), similar to solving
arg maxgggk sk(gggk). Intuitively, the Lagrangian multiplier u in our algorithm can be re-
garded as additional weights for the score function. The update of u is to increase
weights to the edges that are not covered by any tree-like subgraph, so it is more likely
for them to be selected in the next iteration.

4.3 Graph Merging

As explained above, the extraction algorithm gives three classes of trees for each graph.
The algorithm is applied to the graph training set to deliver three training tree sets. After
que, three parsing models can be trained with the three tree sets. The parsers used in
this study to train models and parse trees include Mate (Bohnet 2010), a second-order
graph-based dependency parser, and our implementation of the ﬁrst-order factorization
model proposed in Kiperwasser and Goldberg (2016).

If the scores used by the three models are f1, f2,

then the
parsers can ﬁnd trees with the highest scores for a sentence. That solves the following
optimization problems: arg maxggg1 f1(ggg1), arg maxggg2 f2(ggg2), and arg maxggg2 f3(ggg3). We can

f3, respectivement,

115

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

parse a given sentence with the three models, obtain three trees, and then transform
them into subgraphs. We combine them together to obtain the graph parse of the
sentence by putting all the edges in the three subgraphs together. That is to say, graph
yyy = max{t2g(ggg1), t2g(ggg2), t2g(ggg3)}. This process is called simple merging.

4.3.1 Capturing the Hidden Consistency. Cependant, the simple merging process lacks the
consistency of extracting the three trees from the same graph, thus losing some impor-
tant information. More speciﬁcally, when we decompose a graph into three subgraphs,
some edges tend to appear in certain classes of subgraphs at the same time, and this
information is lost in the simple merging process. It is more desirable to retain the co-
occurrence relationship of the edges when doing parsing and merging. To retain the
hidden consistency, instead of decoding the three models separately, we must do joint
decoding.

In order to capture the hidden consistency, we add consistency tags to the labels
of the extracted trees to represent the co-occurrence. The basic idea is to use addi-
tional tags to encode the relationship of the edges in the three trees. The tag set is
T = {0, 1, 2, 3, 4, 5, 6}. Given a tag t ∈ T , t&1, t&2, t&4, denote whether the edge is
contained in ggg1, ggg2, ggg3, respectivement, where the operator “&” is the bitwise AND operator.
Since we do not need to consider the ﬁrst bit of the tags of edges in ggg1, the second bit
in ggg2, and the third bit in ggg3, we always assign 0 to these tags. Par exemple, if yyy(je, j) = 1,
ggg1(je, j) = 1, ggg2(j, je) = 1, ggg3(je, j) = 0, and t3(j, je) = 0, we tag ggg1(je, j) comme 2 and ggg2(j, je) comme 1.

When it comes to parsing, it is important to obtain labels with consistency informa-
tion. Our goal is to guarantee that the tags in those edges of the parse trees for the same
sentence are consistent throughout graph merging. Since the consistency tags emerge,
we index the graph and tree vector representation using three indices for convenience.
Ainsi, ggg(je, j, t) denotes whether there is an edge from word wi to word wj with tag t in
graph g.

The joint decoding problem can be written as a constrained optimization

problem as

maximum. f1(ggg1) + f2(ggg2) + f3(ggg3)
ggg(cid:48)
1(je, j, 2) + ggg(cid:48)
s.t.
1(je, j, 4) + ggg(cid:48)
ggg(cid:48)
ggg(cid:48)
2(je, j, 1) + ggg(cid:48)
ggg(cid:48)
2(je, j, 4) + ggg(cid:48)
3(je, j, 1) + ggg(cid:48)
ggg(cid:48)
ggg(cid:48)
3(je, j, 2) + ggg(cid:48)
∀i, j

1(je, j, 6) ≤ (cid:80)
1(je, j, 6) ≤ (cid:80)
2(je, j, 5) ≤ (cid:80)
2(je, j, 5) ≤ (cid:80)
3(je, j, 3) ≤ (cid:80)
3(je, j, 3) ≤ (cid:80)

t ggg(cid:48)
t ggg(cid:48)
t ggg(cid:48)
t ggg(cid:48)
t ggg(cid:48)
t ggg(cid:48)

2(je, j, t)
3(je, j, t)
1(je, j, t)
3(je, j, t)
1(je, j, t)
2(je, j, t)

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

where ggg(cid:48)

k = t2g(gggk)(k = 1, 2, 3).

The inequality constraints in the problem are the consistency constraints. Each of
them gives the constraint between two classes of trees. Par exemple, the ﬁrst inequality
says that an edge in ggg1 with tag t&2 (cid:54)= 0 exists only when the same edge in ggg2 exists.
If all of these constraints are satisﬁed, the subgraphs achieve the desired consistency.

4.3.2 Lagrangian Relaxation with Approximation. To solve the constrained optimization
problem mentioned above, we perform some transformations and then apply the
Lagrangian Relaxation to it with approximation.

116

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Let aaa12(je, j) = ggg1(je, j, 2) + ggg1(je, j, 6); then the ﬁrst constraint can be written as an equity

constraint

ggg1(:, :, 2) + ggg1(:, :, 6) = aaa12. ∗ (

(cid:88)

ggg2(:, :, t))

where “:” is to take out all the elements in the corresponding dimension, and “.∗” is
to do multiplication point-wisely. Other inequality constraints can be rewritten in the
same way. If we take aaa12, aaa13, · · · , aaa32 as constants, then all the constraints are linear.
Ainsi, the constraints can be written as

A1ggg1 + A2ggg2 + A3ggg3 = 000

where A1, A2, and A3 are matrices that can be constructed from aaa12, aaa13, · · · , aaa32.

The Lagrangian of the optimization problem is

L(ggg1, ggg2, ggg3; toi) = f1(ggg1) + f2(ggg2) + f3(ggg3) + toi(cid:62)(A1ggg1 + A2ggg2 + A3ggg3)

where u is the Lagrangian multiplier. Then the dual is

L(toi) = max
ggg1,ggg2,ggg3

L(ggg1, ggg2, ggg3; toi)

= max
ggg1

( f1(ggg1) + toi(cid:62)A1ggg1) + maximum
ggg2

(f2(ggg2) + toi(cid:62)A2ggg2) + maximum
ggg3

(f3(ggg3) + toi(cid:62)A3ggg3)

Encore, we use the subgradient method to minimize L(toi). During the deduction,
aaa12, aaa13, · · · , aaa32 are taken as constants, but unfortunately they are not. We propose an
approximation for the aaa’s in each iteration: Using the aaa’s obtained in the previous
iteration instead. It is a reasonable approximation given that the u’s in two consecutive
iterations are similar and so are the aaa’s.

4.3.3 The Algorithm. The pseudocode of our algorithm is shown in Figure 12. It is well
known that the score functions f1, f2, and f3 each consists of scores that are ﬁrst-order
and higher. So they can be written as

fk(ggg) = s1st

k (ggg) + sh

k (ggg)

1
2
3
4
5
6
7
8
9
10

Initialization: set u(0), A1, A2, A3 to 0,
if k = 0 to K then

ggg1 ← arg maxggg1 f1(ggg1) + toi(k)(cid:62)A1ggg1
ggg2 ← arg maxggg2 f2(ggg2) + toi(k)(cid:62)A2ggg2
ggg3 ← arg maxggg3 f3(ggg3) + toi(k)(cid:62)A3ggg3
UPDATE A1, A2, A3
if A1ggg1 + A2ggg2 + A3ggg3 = 000 alors

return ggg1, ggg2, ggg3

toi(k+1) ← u(k) − un(k)(A1ggg1 + A2ggg2 + A3ggg3)

return ggg1, ggg2, ggg3

Chiffre 12
The Joint Decoding Algorithm.

117

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

k (ggg) = (cid:80) ωk(je, j)ggg(je, j) (k = 1, 2, 3). With this property, each individual problem
where s1st
gggk ← arg maxgggk fk(gggk) + toi(cid:62)Akgggk can be decoded easily, with modiﬁcations to the ﬁrst-
order weights of the edges in the three models. Speciﬁcally, let wk = u(cid:62)Ak, then we can
modify the ωk in sk to ω(cid:48)

k(je, j, t) = ωk(je, j, t) + wk(je, j, t) + wk(j, je, t).

k, such that ω(cid:48)

The update of w1, w2, w3 can be understood in an intuitive way. Consider the fol-
lowing situation: One of the constraints, say, the ﬁrst one for edge yyy(je, j), is not satisﬁed,
without loss of generality. We know ggg1(je, j) is tagged to represent that ggg2(je, j) = 1, mais
it is not the case. So we increase the weight of that edge with all kinds of tags in ggg2,
and decrease the weight of the edge with the tag representing ggg2(je, j) = 1 in ggg1. After the
update of the weights, consistency is more likely to be achieved.

4.3.4 Labeled Parsing. For the sake of formal concision, we illustrate our algorithms
omitting the labels. It is straightforward to extend the algorithms to labeled parsing.
In the joint decoding algorithm, we just need to extend the weights w1, w2, w3 for every
label that appears in the three tree sets, and the algorithm can be deduced similarly.

4.4 Global Linear Model Based Scorer

A majority of dependency parsers have explored the framework of global linear mod-
els with encouraging success (Nivre, Hall, and Nilsson 2004; McDonald et al. 2005;
McDonald's, Crammer, and Pereira 2005; Torres Martins, Forgeron, and Xing 2009; Koo et al.
2010). The dependency parsing problem can be formalized as a structured linear model
as follows:

ggg∗(s) = arg max
ggg∈T (s)

SCORE(s, ggg) = arg max
ggg∈T (s)

je(cid:62)Φ(s, ggg)

(5)

its parse ggg∗(s) is computed by searching for the
In brief, given a sentence s,
highest-scored dependency graph in the set of compatible trees T (s). Scores, namely
SCORE(X, ggg), are assigned using a linear model where Φ(s, ggg) is a feature-vector repre-
sentation of the event that tree ggg is the analysis of sentence s, and θ is parameter vector
containing associated weights. En général, performing a direct maximization over the
set T (s) is infeasible, and a common solution used in many parsing approaches is to
introduce a part-wise factorization:

SCORE(s, ggg) =

(cid:88)

p∈PART(ggg)

SCOREPART(s, p)

(6)

Considering linear models, we can deﬁne a factorization model as follows,

je(cid:62)Φ(s, ggg) =

(cid:88)

je(cid:62)φ(s, p)

p∈PART(ggg)

In the above, we have assumed that the output ggg can be factored into a set of parts p
through a function PART, each of which represents a small substructure of ggg. For exam-
ple, ggg might be factored into the set of its component bilexical dependencies. Each part
can be evaluated using a part-wise feature-vector mapping φ(X, p). The factorization
is able to establish implicit independent relationships among parts, while keeping the
search for the best result efﬁcient.

118

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Score(je, j)

Score(je, j + 1)

MLP

Bi-LSTMs

Embedding

. . .

Bi-LSTMs

Embedding

. . .

j + 1

浦东 NR

实行 VV

了 AS

Chiffre 13
The architecture of the neural network model employed.

4.5 Bi-LSTM Based Scorer

In the above architecture, we can assign scores, namely SCORE(X, ggg) dans (5), using neural
network models. A simple yet effective design is selected among a rich set of choices.
Following Kiperwasser and Goldberg (2016)’s experience, we employ a bidirectional-
LSTMs (Bi-LSTMs) based neural model to perform data-driven parsing. A vector is
associated with each word or POS-tag to transform them into continuous and dense
representations. The concatenation of word embedding and POS-tag embedding of each
word in a speciﬁc sentence is used as the input of Bi-LSTMs to extract context-related
feature vectors ri. The two feature vectors of each word pair are scored with a nonlinear
transformation.

SCOREPART(ggg, je, j) = WWW2 · ReLU (WWW1,1 · rrri + WWW1,2 · rrrj + bbb)

(7)

Chiffre 13 shows the architecture of this design.

We can see here the local score function explicitly utilizes the word positions of the
head and the dependent. It is similar to ﬁrst-order factorization as deﬁned in the linear
model. We use the ﬁrst-order Eisner Algorithm (Eisner 1996) to get coherent projective
subtrees.

5. Transition-Based Parsing

5.1 Background Notions

To precisely illustrate our transition-based parser, we employ traditional graph-
theoretic notations to deﬁne a transition system. A dependency graph G = (V, UN) is a

119

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

labeled directed graph in the standard graph-theoretic sense and consists of nodes, V,
and arcs, UN, such that for sentence x = w0w1 . . . wn and label set R the following holds:

•

V = {0, 1, . . . , n},

A ⊆ V × V × R.

The vertex −1 denotes a virtual root. The arc set A represents the labeled dependency
relations of the particular analysis G. Speciﬁcally, an arc (wi, wj, r) ∈ A represents a
dependency relation from head wi to dependent wj labeled with relation type r. UN
dependency graph G is thus a set of labeled dependency relations between the words
of x.

Following Nivre (2008), we deﬁne a transition system for dependency parsing as a

quadruple S = (C, T, cs, Ct), où

C is a set of conﬁgurations, each of which contains a buffer β of
(remaining) words and a set A of arcs,

T is a set of transitions, each of which is a (partial) function t : C (cid:55)→ C,

cs is an initialization function, mapping a sentence x to a conﬁguration,
with β = [0, . . . , n], et

Ct ⊆ C is a set of terminal conﬁgurations.

Given a sentence x = w1, . . . , wn and a graph G = (V, UN) on it, if there is a sequence
of transitions t1, . . . , tm and a sequence of conﬁgurations c0, . . . , cm such that c0 = cs(X),
ti(ci−1) = ci(i = 1, . . . , m), cm ∈ Ct, and Acm = A, we say the sequence of transitions is an
oracle sequence, and we deﬁne ¯Aci
= A − Aci for the arcs to be built in ci. In a typical
transition-based parsing process, the input words are put into a queue and partially
built structures are organized by one or more memory module(s). A set of transition
actions are performed sequentially to consume words from the queue and update the
partial parsing results, organized by the stack.

5.2 A List-Based Transition System

Nivre (2008) proposed a list-based algorithm to produce non-projective dependency
trees. This algorithm essentially implements a very simple idea that is conceptually
introduced by Covington (2001): making use of a list to store partially processed tokens.
It is straightforward to use this strategy to handle any kind of directed graphs. Dans ce
travail, we use such a system SL. En fait, it is simpler to produce a graph as compared to
a tree. The main difference between Nivre’s (2008) tree-parsing system and SL is that at
each transition step, SL does not need to check the multiheaded condition. This appears
to be simpler and more efﬁcient.

5.2.1 The System. In SL = (C, T, cs, Ct), we take C to be the set of all quadruples
c = (λ, λ(cid:48), β, UN) where λ and λ(cid:48) are lists of nodes. The initial conﬁguration cs(X) est
([], [][1, . . . , n], , {}), and a terminal conﬁguration ct is of the form (λ, λ(cid:48), [], UN) (for any

120

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Transitions
LEFT-ARCl
RIGHT-ARCl
SHIFT

NO-ARC
*SELF-ARCr

(λ1|je, λ2, j|β, UN) ⇒ (λ1|je, λ2, j|β, A ∪ {(j, je, je)})
(λ1|je, λ2, j|β, UN) ⇒ (λ1|je, λ2, j|β, A ∪ {(je, je, j)})
(λ1, λ2, j|β, UN) ⇒ (λ1.λ2|j, [], β, UN)
(The operation λ1.λ2 is to concatenate the two lists)
(λ1|je, λ2, β, UN) ⇒ (λ1, je|λ2, β, UN)
(λ|je, λ(cid:48), β) ⇒ (λ|je, λ(cid:48), β ∪ (je, r, je))

Chiffre 14
Transitions of the list-based system.

list and any arc set). T contains ﬁve types of transitions, shown in Figure 14, and is
illustrated as follows:

•

LEFT-ARCr updates a conﬁguration by adding (j, r, je) to A where i is the top
element of the list λ, j is the front of the buffer, and r is the dependency
relation.

RIGHT-ARCr updates a conﬁguration similarly by adding (je, r, j) to A.
SHIFT concatenates λ(cid:48) to λ, clears λ(cid:48), and then moves the front element of β
to current left list.

NO-ARC removes the right most node from λ and adds it onto the left
most of λ(cid:48).

5.2.2 Theoretical Analysis. Theorem 1
SL is sound and complete9 with respect to the class of directed graphs without self-loop.

Proof 1
The soundness of SL is relatively trivial. The completeness of SL is obvious from the
construction of the oracle sequence as follows: For each step on an initial conﬁguration,
we ﬁrst construct all the arcs in ¯Aci that link the nodes in λci to the front node of βci
by applying LEFT-ARCr, RIGHT-ARCr, and NO-ARC. If no other transition is allowed,
SHIFT is applied. (cid:4)

5.2.3 Extension. It is easy to extend our system to generate arbitrary directed graphs by
adding a new transition SELF-ARC:

•

SELF-ARC adds an arc from the top element of λ to itself, but does not
update any list nor the buffer.

9 The notations of soundness and completeness are adopted from Nivre (2008). Let S be a transition system

for dependency parsing.

•

S is sound for a class G of dependency graphs if and only if, for every sentence x and every
transition sequence c0, . . . , cm for x in S, the parse Gcm ∈ G.
S is complete for a class G of dependency graphs if and only if, for every sentence x and every
dependency graph Gx for x in G, there is a transition sequence c0, . . . , cm for x in S such that
Gcm = Gx.

121

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

Transition

SHIFT
SHIFT
LEFT-ARCCOMP
SHIFT
LEFT-ARCTEMP
NO-ARC
NO-ARC
LEFT-ARCSUBJ
SHIFT
NO-ARC
LEFT-ARCTEMP
NO-ARCSUBJ
NO-ARCSUBJ
LEFT-ARCSUBJ
SHIFT
RIGHT-ARCPRT
NO-ARC
RIGHT-ARCPRT
SHIFT

Conﬁguration
( [-1],
( [-1,0],
( [-1,0,1],
( [-1,0,1],
( [-1,…,2],
( [-1,…,2],
( [-1,0,1],
( [-1,0],
( [-1,0],
( [-1,…,3],
( [-1,…,2],
( [-1,…,2],
( [-1,0,1],
( [-1,0],
( [-1,0],
( [-1,…,4],
( [-1,…,4],
( [-1,…,3],
( [-1,…,3],
( [-1,…,5],

[],
[],
[],
[],
[],
[],
[2],
[1,2],
[1,2],
[],
[3],
[3],
[2,3],
[1,2,3],
[3],
[],
[],
[4],
[4],
[],

∅ )
∅ )
∅ )

[0,…,11],
[1,…,11],
[2,…,11],
[2,…,11], A1 = {(2, 1, COMP)} )
[3,…,12], A1 )
[3,…,11], A2 = A1 ∪ {(3, 2, TEMP)} )
[3,…,11], A2 )
[3,…,11], A2 )
[3,…,11], A3 = A2 ∪ {(3, 0, SUBJ)} )
[4,…,11], A3 )
[4,…,11], A3 )
[4,…,11], A4 = A3 ∪ {(4, 2, TEMP)} )
[4,…,11], A5 = A4 ∪ {(4, 0, SUBJ)} )
[4,…,11], A5 = A4 ∪ {(4, 0, SUBJ)} )
[4,…,11], A5 = A4 ∪ {(4, 0, SUBJ)} )
[5,…,11], A5 )
[5,…,11], A6 = A5 ∪ {(4, 5, PRT)} )
[5,…,11], A6 )
[5,…,11], A7 = A6 ∪ {(3, 5, PRT)} )
[6,…,11], A7 )

Chiffre 15
A preﬁx of the oracle transition sequence for the running example.

Linguistic dependencies usually exclude self-loop, and therefore the basic list-based
system is satisfactory in most cases. We use the basic list-based system, namely SL, comme
the core engine of our parser.

5.2.4 An Example. Chiffre 15 shows the ﬁrst transitions needed to parse the running
example of Figure 1. It can be seen, from this example, that the key step to produce
crossing arcs is to temporarily move nodes that block non-adjacent nodes to the sec-
ondary memory module, namely λ(cid:48). Another key property of the oracle is building arcs
as soon as possible to avoid further complication.

5.3 Bi-LSTM Based Scorer

The neural model, which acts as a classiﬁer of actions in this transition system, is similar
to previous neural models. The Bi-LSTMs play the same role, but the feature vectors of
the front of the buffer and the top of list λ are used to assign scores for actions. Le
structure is shown in Figure 16.

Scores = WWW2 · ReLU (WWW1 · (rrrlistλtop ⊕ rrrbufferfront) + bbb1) + bbb2

Two search methods, c'est, dynamic oracle (Goldberg and Nivre 2012) and beam

recherche, can be used with this transition system.

122

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Scores of actions

MLP

Bi-LSTMs

Embedding

list top

. . .

Bi-LSTMs

Embedding

. . .

buffer front

浦东 NR

实行 VV

了 AS

Chiffre 16
The neural network structure for parsing the running sentence. We select the top element of list λ
and top front of the buffer as features.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

6. Empirical Evaluation

6.1 Experimental Setup

CTB is a segmented, part-of-speech (POS) tagged, and fully bracketed corpus in the
constituency formalism, and very popularly used to evaluate fundamental NLP tasks,
including word segmentation (Sun and Xu 2011), POS tagging (Sun and Uszkoreit 2012),
constituent parsing (Wang, Sagae, and Mitamura 2006; Zhang and Clark 2009), et
dependency parsing (Zhang and Clark 2008; Huang and Sagae 2010; Li et al. 2011).
This corpus was collected during different time periods from different sources with a
diverse range of topics. We used CTB 6.0 and deﬁned the training, development, et
test sets according to the CoNLL 2009 shared task. Tableau 3 gives a summary of the data
sets for experiments.

Evaluation on this benchmark data allows us to directly compare our parsers and
other parsers in the literature, according to numeric performance. The measure for
comparing two dependency graphs is precision/recall of bilexical dependencies, lequel
are deﬁned as (cid:104)wh, wd, je(cid:105) tuples, where wh is the head, wd is the dependent and l is
the relation. Labeled precision/recall (LP/LR) is the ratio of tuples correctly identi-
ﬁed, while unlabeled metrics (UP/UR) is the ratio regardless of l. F-score (UF/LF) est
a harmonic mean of precision and recall. These measures correspond to attachment
scores (LAS/UAS) in dependency tree parsing. To evaluate the ability to recover non-
local dependencies, the recall (URNL/LRNL) of such dependencies is reported. We also
consider the correctness with respect to the whole graph and report unlabeled and
labeled complete match (UCM/LCM).

123

Computational Linguistics

Volume 45, Nombre 1

Tableau 3
Data sets for experiments. Column “Training,” “Development,” and “Test” present the number
of sentences/words/dependencies in the training, development, and test sets.

Training Development

Test

#Sentence
#Word
#Dependency

22,277
609,060
556,551

1,762
49,620
45,724

2,556
73,153
67,635

6.2 Main Results of the Graph-Based Parsing
6.2.1 Results of Graph Decomposition. Tableau 4 shows the results of graph decomposition on
the training set. If we use simple decomposition, Par exemple, directly extracting three
trees from a graph, we get three subgraphs. On the training set, each of the subgraphs
covers around 90% of edges and 30% of sentences. When merged together, they cover
presque 97% of edges and more than 70% of sentences. This indicates that the ability of
a single tree is limited and that three trees can cover most of the edges. To achieve the
best coverage, we need to tune the weights deﬁned in Equations (1–4). This can be done
on the development data. When we apply Lagrangian Relaxation to the decomposition
processus, both the edge coverage and the sentence coverage gain a great error reduction,
indicating that Lagrangian Relaxation is very effective in the task of decomposition.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

6.2.2 Main Results of Graph Merging. Tableau 5 shows the results of graph merging on the
development set when tree parsers based on global linear models are applied. The three
training sets of trees are from the decomposition with Lagrangian Relaxation and the
models are trained from them. In both tables, simple merging (SM) refers to decoding
the three trees for a sentence ﬁrst, then combining them by putting all the edges together.
As is shown, the merged graph achieves a higher f-score than other single models.
With Lagrangian Relaxation, the performance scores of the merged graph and the three
subgraphs are both improved, due to the capturing of the consistency information.

Tableau 4
Results of graph decomposition. SD is for Simple Decomposition and LR for Lagrangian
Relaxation.

Couverture

Subgraph1
Subgraph2
Subgraph3
Merged

Edge

85.52
88.42
90.40
96.93

85.66
88.48
90.67
99.55

Sentence

28.73
28.36
34.37
71.66

29.01
28.63
34.72
96.90

124

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Tableau 5
Accuracies of the graph merging–based parser on development set. SM is for Simple Merging,
and LR for Lagrangian Relaxation. A second-order graph-based parser with a global linear
disambiguation model is used for tree parsing.

UCM LP

LCM

SM Subgraph1
Subgraph2
Subgraph3
Merged

Subgraph1
Subgraph2
Subgraph3
Merged

88.63
88.04
88.91
83.23

89.76
89.30
89.42
88.07

76.19
78.20
81.12
88.45

77.48
79.18
81.55
85.14

81.94
82.83
84.84
85.76

83.17
83.93
85.31
86.58

18.09
17.47
20.36
22.97

18.60
18.66
20.53
26.32

85.94
85.31
86.57
80.59

87.17
86.68
87.09
85.55

73.88
75.77
78.99
85.64

75.25
76.85
79.43
82.70

79.46
80.26
82.61
83.04

80.77
81.47
83.08
84.10

16.11
15.43
17.30
19.29

16.39
16.56
17.81
21.61

When we do simple merging, though the recall of each subgraph is much lower
than its precision, it achieves the opposite effect of the merged graph. This is because
the consistency between the three models is not required and the models tend to give
diverse subgraph predictions. When we require high consistency between the three
models, the precision and recall become comparable, and higher f-scores are achieved.
Tableau 6 shows the results based on the neural tree parsing model. Similar to Table 5,
it is shown that the LR-based graph merging model is effective in improving individual
tree parsers and thus obtains encouraging deep dependency parsing results.

6.3 Main Results of Transition-Based Parsing

Tableau 7 summarizes the parsing results obtained by the transition-based parser. We com-
pare two strategies, namely, dynamic oracle and beam search, for improving training
and decoding. Dynamic oracle is a very useful training strategy for the improvement
of a transition-based parser, especially for neural models (Goldberg and Nivre 2012;
Kiperwasser and Goldberg 2016). Beam search is a very effective search method that

Tableau 6
Accuracies of the graph merging–based parser on development set. A ﬁrst-order graph-based
parser with a neural disambiguation model is used for tree parsing.

UCM LP

LCM

SM Subgraph1
Subgraph2
Subgraph3
Merged

Subgraph1
Subgraph2
Subgraph3
Merged

89.66
89.04
90.50
84.01

91.18
90.45
91.29
88.75

78.81
81.31
81.24
90.70

79.90
82.29
81.91
87.96

83.88
85.00
85.62
87.23

85.17
86.17
86.35
88.36

19.13
19.52
20.20
23.72

20.72
20.72
20.66
29.34

87.40
86.82
88.38
81.70

88.80
88.16
89.11
86.43

76.82
79.29
79.33
88.20

77.82
80.20
79.95
85.66

81.77
82.88
83.61
84.83

82.95
83.99
84.28
86.05

17.25
16.97
17.88
20.60

18.39
18.16
17.93
25.09

125

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

Tableau 7
Accuracies of the transition-based parser on development set.

UCM LP

LCM

Dynamic oracle
Beam search

89.59
84.58

85.50
87.38

87.49
85.96

24.74
20.89

87.49
82.17

83.49
84.89

85.44
83.51

21.45
17.99

achieves excellent results for various NLP tasks, Par exemple, Machine Translation.
When coupled with linear models, beam search has shown to be a useful technique
to improve both training and decoding (Zhang and Clark 2011b). Cependant, dans le
particular neural parsing architecture employed here, beam search performs signiﬁ-
cantly worse than the dynamic oracle strategy. Cependant, do note that beam search and
structured learning may be very helpful for neural parsing models of other architectures
(Weiss et al. 2015; Andor et al. 2016). Ici, the beam size is set to 16.

6.4 Analysis
6.4.1 Precision vs. Recall. A noteworthy fact about the overall performance of the neural
transition-based system is that the precision is promising but the recall is low. Ce
difference is consistent with the result obtained by transition-based parsers with linear
scoring models (Zhang et al. 2016), and the result obtained by a shift-reduce CCG parser
(Zhang and Clark 2011a). The functor-argument dependencies generated by the CCG
parser also has a relatively high precision but considerably low recall. To build NLP
application, Par exemple, information extraction, and systems upon GR parsing, tel
property merits attention. A good trade-off between the precision and the recall may
have a great impact on ﬁnal results.

The graph merging system coupled with neural tree parser performs quite differ-
ently. The precision and recall are quite harmonious. Chiffre 17 and Table 8 présent
detailed analyses. With respect to the dependency length, it is clear that the transition-
based parser obtains higher precision, while the graph merging system parser obtains
higher recall. With respect to the dependency relation, for most types, the precision of
the transition-based parser is higher than its recall. Encore, we can see a difference for

Chiffre 17
Labeled precision and recall relative to dependency length. “TR” denotes the transition-based
model; “GM” denotes the graph merging model.

126

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

60 65 70 75 80 85 90 951:2:3:4:5-9:10-:PrecisionTRGM 50 55 60 65 70 75 80 85 90 951:2:3:4:5-9:10-:RecallTRGM

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Tableau 8
Accuracies with respect to dependency relations. The results are evaluated on the development
data. Only the relation types that appear more than 500 times on the development data are
reported.

Relation

Transition-based
F
R.
P.

Graph merging
F
R.
P.

ADV
AMOD
AUX
COMP
DET
LOC
NMOD
OBJ
QUANT
RELATIVE
SBJ
TMP

4,795
1,309
926
7,930
583
572
8,563
5,406
897
1,201
6,313
1,273

89.35
88.45
88.63
89.73
95.59
80.29
89.45
89.66
92.02
92.06
82.15
78.76

80.15
88.31
86.72
86.96
92.97
67.66
87.43
86.29
92.53
82.01
80.41
71.64

84.50
88.38
87.66
88.33
94.26
73.43
88.43
87.94
92.27
86.75
81.27
75.03

87.24
89.86
88.26
87.27
93.48
81.04
88.09
89.94
92.38
92.12
81.50
79.49

79.96
89.38
86.07
90.68
95.88
68.01
91.00
89.68
94.65
86.68
79.41
73.68

83.44
89.62
87.15
88.94
94.67
73.95
89.52
89.81
93.50
89.32
80.44
76.48

the graph merging parser: for most types, the precision is lower than its recall. Note that
the overall F-scores of these two systems are somehow equivalent.

6.4.2 Deep vs. Deep. CCG and HPSG parsers also favor the dependency-based metrics for
evaluation (Clark and Curran 2007b; Miyao and Tsujii 2008). Previous work on Chinese
CCG and HPSG parsing unanimously agree that obtaining the deep analysis of Chinese
is more challenging (Yu et al. 2011; Tse and Curran 2012). The successful C&C and Enju
parsers provide very inaccurate results for Chinese texts. Though the numbers proﬁling
the qualities of deep dependency structures under different formalisms are not directly
comparable, all empirical evaluation indicates that the state of the art of deep linguistic
processing for Chinese lags very much behind.

6.4.3 Impact of POS Tagging. According to our previous study (Sun and Wan 2016), le
use of different POS taggers has a great impact on syntactic analysis. This is highly
related to a prominent language-speciﬁc property of Mandarin Chinese: as an analytic
langue, Mandarin Chinese lacks morphosyntactic features to explicitly indicate lexi-
cal categories. To evaluate the parsing performance in a more realistic setup, we report
parsing results based on two different POS taggers introduced in Sun and Wan (2016).
Tableau 9 presents the results. We can see that automatic POS tagging has a great impact
on deep dependency parsing. Even when assisted with a state-of-the-art tagger that is
highly engineered, the parser still performs rather poorly.

6.5 Improved Results with ELMo

Tableau 9 indicates the importance of high-quality POS tagging. We take it as a reﬂection of
the importance of lexical information. Chinese lacks explicit morphosyntactic features,
which makes it hard to infer the grammatical functions from word content only. That
means contextual clues are essential for predicting a dependency relation.

127

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

Tableau 9
Parsing accuracies with different POS taggers. The results are evaluated on the development
data. “LGLM” is short for the POS tagger based on a linear-chain global linear model, alors que
“SC” denotes the same POS tagger enhanced with structure compilation.

UCM LP

LCM

GM Gold

88.75
LGLM 81.99
83.65
SC

Gold
89.59
LGLM 83.69
85.15
SC

87.96
81.62
83.03

85.50
79.22
80.73

88.36
81.80
83.34

87.49
81.39
82.88

29.34
23.50
23.78

24.74
19.92
20.89

86.43
76.94
79.03

87.49
78.92
80.72

85.66
76.59
78.43

83.49
74.71
76.54

86.05
76.77
78.73

85.44
76.76
78.57

25.09
19.58
20.09

21.45
17.03
17.76

6.5.1 ELMo: Contextualized Word Embedding. Quite recently, Peters et al. (2018) proposed
ELMo, an unsupervised approach to model words, especially for modeling their syn-
tactic and semantic property in different linguistic contexts. The key design of ELMo
is to train an LSTM-based language model and utilize the hidden vector as contextu-
alizd word embeddings. En particulier, every word is assigned a vector according to
information on the whole sentence rather than several neighboring words in a limited
contexte. ELMo has been proved very effective in inducing both syntagmatic and parag-
matic knowledges and thus extremely useful for improving various English NLP tasks.
Incorporating the word embeddings obtained by ELMo into a parser is very simple:
We can take such embeddings as additional input word vectors and keep other things
unchanged.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

6.5.2 Training an ELMo Model for Chinese. In this work, we trained an ELMo model for
Chinese and applied it to enhance our GR parsers. The model is trained using Chinese
Gigawords, a comprehensive archive of newswire text data that has been acquired over
several years by the Linguistic Data Consortium (LDC), choosing the Mandarin news
text, c'est, Xinhua newswire. This data covers all news published by Xinhua News
Agency (the largest news agency in China) depuis 1991 à 2004, which contains about
12 million sentences. Different from English and other Western languages, Chinese is
written without explicit word delimiters such as space characters. In our experiments,
we employ a supervised segmenter introduced in Sun and Xu (2011) to process the raw
texts.

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

6.5.3 Main Results. Tableau 10 presents the results. We concatenate the 1024-dimensional
ELMo with word and sometimes POS tag embeddings on our neural parsers. The ELMo
vectors signiﬁcantly improve the performance of our parser, and this gain is more
obvious when gold standard POS tags are removed.

Tableau 11 reports the graph merging results. When ELMo vectors are added, les deux
the individual and the joint decoders are improved. En outre, Tableau 12 shows the
performance for representative grammatical functions. Note that only the word content
and their corresponding ELMo embeddings are utilized in this experiment. This is a
realistic setup to evaluate how accurate the GR parsers could be for processing real-
world Chinese texts.

128

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Tableau 10
Parsing accuracies with ELMo. The results are evaluated on the development data. “NoPOS”
means training and parsing without POS tags.

UCM LP

LCM

GM Gold+ELMo

LGLM+ELMo
NoPOS+ELMo

Gold+ELMo
LGLM+ELMo
NoPOS+ELMo

91.01
86.16
88.86

86.89
82.01
84.40

91.05
86.69
88.97

92.92
88.92
91.24

91.03
86.42
88.92

89.80
85.32
87.69

31.93
25.77
28.93

26.38
21.40
22.86

88.76
80.77
84.98

84.65
76.67
80.59

88.81
81.27
85.08

90.53
83.13
87.12

88.79
81.02
85.03

87.49
79.77
83.73

26.21
20.09
22.12

22.35
17.88
18.15

Tableau 11
Accuracies of the graph merging–based parser on development set without POSTag information.
A ﬁrst-order graph-based parser with a neural scoring model is used for tree parsing. ELMo is
used.

UCM LP

LCM

SM Subgraph1
Subgraph2
Subgraph3
Merged

Subgraph1
Subgraph2
Subgraph3
Merged

89.74
89.47
90.62
84.40

91.13
90.86
91.56
88.86

79.79
81.63
81.50
91.24

80.82
82.78
82.38
88.97

84.47
85.37
85.82
87.69

85.67
86.63
86.73
88.92

19.80
19.06
20.76
22.86

21.10
20.31
21.21
28.93

85.69
85.43
86.60
80.59

86.95
86.67
87.49
84.98

76.19
77.94
77.88
87.12

77.11
78.97
78.72
85.08

80.66
81.51
82.01
83.73

81.73
82.64
82.88
85.03

17.07
15.94
16.28
18.15

17.30
16.73
16.79
22.12

Tableau 12
Accuracies with respect to representative grammatical functions. The results are obtained by the
graph merging parser on the development data. The ELMo vectors are utilized but not POS tags.

Relation

ADV
AMOD
COMP
LOC
NMOD
OBJ
SBJ
TMP

83.08
82.92
87.17
82.62
86.41
88.08
82.29
78.39

79.81
80.76
90.13
70.63
89.19
91.00
80.59
76.36

81.41
81.83
88.63
76.15
87.78
89.52
81.43
77.36

6.5.4 Comparing the POS Tags and the ELMo Vectors. Both gold POS tags and ELMo
word embeddings have improved the GR parser with comparable effects, as shown
by equivalent results. If we consider only the unlabeled parsing quality, POS tags and
ELMo help the GR parsers in a different way. To evaluate the differences between two

129

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

Tableau 13
Recalls with respect to dependency types, c'est, local or non-local dependencies. The results are
evaluated on the development data.

Model

URL

LRL

URNL

LRNL

GM (−ELMo)
GM (+ELMo)

88.66
91.65

86.68
89.64

72.60
77.84

63.21
70.48

models trained with gold POS tags and ELMo, respectivement, we deﬁne the following
metric:

where DPOS/DELMo denotes the set of dependencies related to the held-out sentences
returned by the POS/ELMo-enhanced model. When evaluated on the development
data, the values of the unlabeled and labeled metric are 84.46 et 81.05, respectivement.
These relatively low values highlight the complementary strengths of the POS tags and
ELMo vectors. The complementary strengths are also conﬁrmed by the experiment that
uses POS and ELMo together. Comparing results in Tables 9 et 10, we can see clearly
a signiﬁcant improvement.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

6.5.5 Local vs. Non-Local. Although the micro accuracy of all dependencies are consider-
ably good, the ability of current state-of-the-art statistical parsers to ﬁnd difﬁcult non-
local materials is far from satisfactory, even for English (Rimell, Clark, and Steedman
2009; Bender et al. 2011). We report the accuracy in terms of local and non-local depen-
dencies, respectivement, to show the difﬁculty of the recovery of non-local dependencies.
Tableau 13 demonstrates the labeled/unlabeled recall of local (URL/LRL) and non-local
dependencies (URNL/LRNL). We can clearly see that non-local dependency recovery is
extremely difﬁcult for Chinese parsing. Another noteworthy thing is that the ELMo
word embeddings improves the prediction for non-local dependencies with a very
signiﬁcant margin.

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

6.6 Results on the Test Data

Tableau 14 gives the parsing accuracy on the test set. We evaluate different architectures
and different scoring models, as compared to the results reported in Zhang et al. (2016).
The beam size of our transition-based parser is set to 16, which is the same as Zhang et al.
(2016). Comparing the global linear and LSTM-based models, we can see the advantage
of applying neural models for syntactic parsing as well as the effectiveness of our graph
merging model: It is signiﬁcantly better than the transition-based parser. Tableau 15 shows
the results obtained using a realistic setup for real-world applications: The POS tags are
removed while the ELMo vectors are incorporated.

130

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Tableau 14
Accuracies of different models on test set. “GM” is short for graph merging; “TR (DO)” is short
for transition-based parsing with dynamic oracle; “TR (BS)” is short for transition-based parsing
with beam search. The performance of the TR (BS) parser with linear scoring model is from
Zhang et al.’s paper. Gold-standard POS tags are utilized.

UCM LP

LCM

Neural
GM
GM
Linear
TR (DO) Neural
Neural
TR (BS)
Linear
TR (BS)

88.84
88.06
89.56
84.77
– –

88.23
85.11
85.75
87.70
– –

88.53
86.56
87.61
86.21
– –

29.14
26.24
25.11
20.61
– –

86.47
86.03
87.41
82.21
82.28

85.87
83.16
83.69
85.05
83.11

86.17
84.57
85.51
83.61
82.69

24.68
22.84
21.86
18.07
– –

Tableau 15
Accuracies of the GM model on test set. No POS tags are utilized, whereas the ELMo vectors are
used.

POS

ELMo

UCM LP

LCM

NON

YES

89.18

88.86

89.02

29.25

85.05

84.75

84.90

22.68

7. Conclusion

The availability of large-scale treebanks has contributed to the blossoming of statistical
approaches to build accurate surface constituency and dependency parsers. Recent years
have witnessed rapid progress on deep linguistic processing of English texts, et
initial attempts have been made for Chinese. Our work attempts to strike a balance
between traditional CFG, dependency tree parsing, and deep linguistic processing.
By integrating the linguistically-deep grammatical function information and the key
bilexical relation underlying the dependency analysis, we propose a new type of deep
dependency analysis for parsing Mandarin Chinese sentences. Based on LFG-like de-
pendency annotations, we developed statistical models for deep dependency parsing in
both constituency and dependency formalisms. To construct complex linguistic graphs
beyond trees, we propose a new approach, namely graph merging, by constructing GR
graphs from subgraphs. There are two key issues involved in the approach, c'est,
graph decomposition and merging. To solve these two problems in a principled way, nous
treat both problems as optimization problems and employ combinatorial optimization
techniques. Experiments demonstrate the effectiveness of the graph merging frame-
travail, which can be adopted to other types of ﬂexible representations, Par exemple,
semantic dependency graphs (Oepen et al. 2014, 2015) and abstract meaning represen-
tations (Banarescu et al. 2013). The study is signiﬁcant in demonstrating a linguistically
motivated and computationally effective way of parsing Chinese texts.

Appendix

In annotations of dependency structures, typical grammatical relations are subject,
objet, etc., which imply the role the dependent plays with regard to its head. En outre
to phrasal categories, CTB also has functional labels to represent relations. The CTB

131

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

Tableau 16
Illustration of major dependency relations in our corpus.

ADV
AMOD
AUX
COMP
DET
LOC
NMOD
OBJ
PRT
QUANT
RELATIVE
SUBJ
TMP

adverbial adjunct
adjectival modiﬁer
auxiliary
complement
determiner
location
nominal modiﬁer
objet
particle
quantiﬁcation
relative clause
sujet
temporal modiﬁer

utilizes syntactic distribution as the main criterion for distinguishing lexical categories.
By exploring CTB’s original annotations, we deﬁne a rich set of dependency relations.
Tableau 16 lists the major relations. For more details, please refer to the source codes.

Les références
Andor, Daniel, Chris Alberti, David Weiss,
Aliaksei Severyn, Alessandro Presta,
Kuzman Ganchev, Slav Petrov, et
Michael Collins. 2016. Globally normalized
transition-based neural networks. Dans
Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics
(Volume 1: Long Papers), pages 2442–2452,
Berlin, Août.

Banarescu, Laura, Claire Bonial, Shu Cai,
Madalina Georgescu, Kira Grifﬁtt, Ulf
Hermjakob, Kevin Knight, Philipp Koehn,
Martha Palmer, and Nathan Schneider.
2013. Abstract meaning representation for
sembanking. In Proceedings of the 7th
Linguistic Annotation Workshop and
Interoperability with Discourse,
pages 178–186, Association for
Computational Linguistics, Soﬁa, Bulgaria,
Août.

Cintreuse, Emily M., Dan Flickinger, Stephan

Oepen, and Yi Zhang. 2011. Parser
evaluation over local and non-local deep
dependencies in a large corpus. Dans
Actes du 2011 Conference on
Empirical Methods in Natural Language
Processing, pages 397–408, Édimbourg, Juillet.
Association for Computational Linguistics.

Bohnet, Bernd. 2010. Top accuracy and fast

dependency parsing is not a contradiction.
In Proceedings of the 23rd International
Conference on Computational Linguistics
(COLING 2010), pages 89–97, Beijing, Août,
COLING 2010 Organizing Comittee.

132

Bresnan, Joan, and Ronald Kaplan. 1982.
Introduction: Grammars as mental
representations of language. In J. Bresnan,
editor, The Mental Representation of
Grammatical Relations. AVEC Presse,
Cambridge, MA, pages xvii–lii.

Bresnan, Joan. 2001. Lexical-Functional Syntax.

Nombre 16. Puits noir, Malden, MA.

Briscoe, Ted, and John Carroll. 2006.
Evaluating the accuracy of an
unlexicalized statistical parser on the
PARC DepBank. In Proceedings of the
COLING/ACL 2006 Main Conference Poster
Sessions, pages 41–48, Sydney, Juillet.
Association for Computational
Linguistics.

Cahill, Aoife, Mairead Mccarthy, Josef Van

Genabith, and Andy Way. 2002. Automatic
annotation of the Penn-treebank with LFG
f-structure information. In Proceedings of
the LREC Workshop on Linguistic Knowledge
Acquisition and Representation: Bootstrapping
Annotated Language Data, Las Palmas,
Canary Islands, pages 8–15.

Carnie, Andrew. 2007. Syntax: A Generative

Introduction, 2nd edition, Puits noir
Édition, Malden, MA.

Chen, Danqi, and Christopher Manning.
2014. A fast and accurate dependency
parser using neural networks. Dans
Actes du 2014 Conference on
Empirical Methods in Natural Language
Processing (EMNLP), pages 740–750, Doha,
Qatar, Association for Computational
Linguistics.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Chomsky, Noam. 1981. Lectures on

Government and Binding. Foris Publications,
Dordecht.

Chu, Yoeng-Jin, and Tseng-Hong Liu. 1965.
On the shortest arborescence of a directed
graph. Science Sinica, 14:1396–1400.
Clark, Stephen, and James Curran. 2007un.

Formalism-independent parser evaluation
with CCG and DepBank. In Proceedings of
the 45th Annual Meeting of the Association for
Computational Linguistics, pages 248–255,
Prague, Juin.

Clark, Stephen, and James R. Curran. 2007b.
Wide-coverage efﬁcient statistical parsing
with CCG and log-linear models.
Computational Linguistics, 33(4):493–552.
Décembre.

Copestake, Ann, Dan Flickinger, Carl

Pollard, and Ivan A. Sag. 2005. Minimal
recursion semantics: An introduction.
Research on Language and Computation,
pages 281–332.

Covington, Michael A. 2001. A fundamental
algorithm for dependency parsing. Dans
Proceedings of the 39th Annual ACM
Southeast Conference, pages 95–102.
Dalrymple, M.. 2001. Lexical-Functional
Grammar. Volume 34 of Syntax and
Semantics, 34, Academic Press, New York.

Dozat, Timothy, and Christopher D.

Manning. 2016. Deep biafﬁne attention for
neural dependency parsing. CoRR,
abs/1611.01734.

Edmonds, Jack. 1967. Optimum branchings.
Journal of Research of the National Bureau of
Standards, 71B:233–240.

Eisner, Jason M. 1996. Three new

probabilistic models for dependency
parsing: An exploration. In Proceedings of
the 16th Conference on Computational
Linguistics – Volume 1, pages 340–345,
Stroudsburg, Pennsylvanie.

Flickinger, Daniel, Yi Zhang, and Valia

Kordoni. 2012. Deepbank: A dynamically
annotated treebank of the Wall Street
Journal. In Proceedings of the Eleventh
International Workshop on Treebanks and
Linguistic Theories, pages 85–96.

Goldberg, Yoav, and Joakim Nivre. 2012. UN
dynamic oracle for arc-eager dependency
parsing. COLING-2012, 2(Décembre):
959–976.

Guo, Yuqing, Josef van Genabith, et

Haifeng Wang. 2007. Treebank-based
acquisition of LFG resources for Chinese.
In Proceedings of the LFG07 Conference,
CSLI Publications, Stanford, Californie,
pages 214–232.

Henderson, James, Paola Merlo, Ivan Titov,
and Gabriele Musillo. 2013. Multilingual

joint parsing of syntactic and semantic
dependencies with a latent variable model.
Computational Linguistics, 39(4):949–998.
Hochreiter, Sepp, and J ¨urgen Schmidhuber.

1997. Long Short-Term Memory.
Neural Computation, 9(8):1735–1780,
Novembre.

Hockenmaier, Julia, and Mark Steedman.

2007. CCGbank: A corpus of CCG
derivations and dependency structures
extracted from the Penn Treebank.
Computational Linguistics, 33(3):355–396.

Huang, Liang, and Kenji Sagae. 2010.

Dynamic programming for linear-time
incremental parsing. In Proceedings of the
48th Annual Meeting of the Association for
Computational Linguistics, pages 1077–1086,
Uppsala, Sweden, Juillet.

Hudson, Richard. 1990. English Word
Grammar. Puits noir, Malden, MA.

Joshi, Aravind K., and Yves Schabes. 1997.

Tree-adjoining grammars. In G. Rozenberg
et un. Salomaa, editors, Handbook of
Formal Languages, Volume 3, Springer,
Berlin, New York, pages 69–124.

Kaplan, Ron, Stefan Riezler, Tracy H. King,
John T.. Maxwell III, Alex Vasserman, et
Richard Crouch. 2004. Speed and accuracy
in shallow and deep stochastic parsing. Dans
Daniel Marcu Susan Dumais and Salim
Roukos, editors. HLT-NAACL 2004: Main
Procédure, pages 97–104, Boston,
May 2–May 7. Association for
Computational Linguistics.

King, Tracy Holloway, Richard Crouch,
Stefan Riezler, Mary Dalrymple, et
Ronald M. Kaplan. 2003. The PARC 700
dependency bank. In Proceedings of the 4th
International Workshop on Linguistically
Interpreted Corpora, pages 1–8.

Kiperwasser, Eliyahu, and Yoav Goldberg.
2016. Simple and accurate dependency
parsing using bidirectional LSTM feature
representations. Transactions of the
Association for Computational Linguistics,
4:313–327.

Koo, Terry, and Michael Collins. 2010.

Efﬁcient third-order dependency parsers.
In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics,
pages 1–11, Uppsala, Sweden, Juillet.
Koo, Terry, Alexander M. Rush, Michael
Collins, Tommi Jaakkola, and David
Sontag. 2010. Dual decomposition for
parsing with non-projective head
automata. In Proceedings of the 2010
Conference on Empirical Methods in Natural
Language Processing, pages 1288–1298,
Cambridge, MA, Octobre. Association for
Computational Linguistics.

133

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

K ¨ubler, Sandra, Ryan T. McDonald's, et

Joakim Nivre. 2009. Dependency Parsing.
Synthesis Lectures on Human Language
Technologies. Morgan & Claypool,
Londres.

Kuhlmann, Marco. 2010. Dependency

Structures and Lexicalized Grammars: Un
Algebraic Approach. Ph.D. thesis, Saarland
University.

Li, Zhenghua, Min Zhang, Wanxiang Che,

Ting Liu, Wenliang Chen, and Haizhou Li.
2011. Joint models for Chinese POS
tagging and dependency parsing. Dans
Actes du 2011 Conference on
Empirical Methods in Natural Language
Processing, pages 1180–1191, Édimbourg,
Juillet. Association for Computational
Linguistics.

McDonald's, Ryan, Koby Crammer, et

Fernando Pereira. 2005. En ligne
large-margin training of dependency
parsers. In Proceedings of the 43rd Annual
Meeting of the Association for Computational
Linguistics (ACL’05), pages 91–98,
Ann-Arbor, Juin.

McDonald's, Ryan, and Fernando Pereira.
2006. Online learning of approximate
dependency parsing algorithms. Dans
Proceedings of 11th Conference of the
European Chapter of the Association for
Computational Linguistics, Volume 6,
pages 81–88.

McDonald's, Ryan, Fernando Pereira,
Kiril Ribarov, and Jan Hajic. 2005.
Non-projective dependency parsing using
spanning tree algorithms. In Proceedings of
the Conference on Human Language
Technology and Empirical Methods in Natural
Language Processing, pages 523–530,
Vancouver, Octobre. Association for
Computational Linguistics.

Mel’ˇcuk, je. UN. 2001. Communicative

Organization in Natural Language: Le
Semantic-Communicative Structure of
Sentences. Studies in Language
Companion Series. J.. Benjamins,
Amsterdam.

Miyao, Yusuke, Takashi Ninomiya, et
Jun’ichi Tsujii. 2005. Corpus-oriented
grammar development for acquiring a
head-driven phrase structure grammar
from the Penn Treebank. In The Second
International Joint Conference on
Natural Language Processing,
pages 684–693.

Miyao, Yusuke, Rune Sætre, Kenji Sagae,
Takuya Matsuzaki, and Jun’ichi Tsujii.
2008. Task-oriented evaluation of syntactic
parsers and their representations.
In Proceedings of ACL-08: HLT,

134

pages 46–54, Columbus, Ohio, Juin.
Association for Computational Linguistics.

Miyao, Yusuke, and Jun’ichi Tsujii. 2008.
Feature forest models for probabilistic
HPSG parsing. Computational Linguistics,
34(1):35–80, Mars.

Nivre, Joakim. 2008. Algorithms for

deterministic incremental dependency
parsing. Computational Linguistics,
34:513–553, Décembre.

Nivre, Joakim, Johan Hall, Sandra K ¨ubler,
Ryan McDonald, Jens Nilsson, Sebastian
Riedel, and Deniz Yuret. 2007. The CoNLL
2007 shared task on dependency parsing.
In Proceedings of the CoNLL Shared Task
Session of EMNLP-CoNLL 2007,
pages 915–932, Prague, Juin. Association
for Computational Linguistics.

Nivre, Joakim, Johan Hall, and Jens Nilsson.
2004. Memory-based dependency parsing.
In Hwee Tou Ng and Ellen Riloff, editors.
HLT-NAACL 2004 Workshop: Eighth
Conference on Computational Natural
Language Learning (CoNLL-2004),
pages 49–56, Boston, May 6–May 7.
Association for Computational Linguistics.

Oepen, Stephan, Marco Kuhlmann, Yusuke

Miyao, Daniel Zeman, Silvie Cinkova, Dan
Flickinger, Jan Hajic, and Zdenka Uresova.
2015. Semeval 2015 task 18:
Broad-coverage semantic dependency
parsing. In Proceedings of the 9th
International Workshop on Semantic
Evaluation (SemEval 2015), pages 915–926,
Denver, Colorado, Juin. Association for
Computational Linguistics.

Oepen, Stephan, Marco Kuhlmann, Yusuke
Miyao, Daniel Zeman, Dan Flickinger, Jan
Hajic, Angelina Ivanova, and Yi Zhang.
2014. Semeval 2014 task 8: Broad-coverage
semantic dependency parsing. Dans
Proceedings of the 8th International Workshop
on Semantic Evaluation, pages 63–72,
Dublin, Août. Association for
Computational Linguistics and Dublin
City University.

Oepen, Stephan, Erik Velldal, Jan Tore

Lnning, Paul Meurer, Victoria Ros´en, et
Dan Flickinger. 2007. Towards hybrid
quality-oriented machine translation. Sur
linguistics and probabilities in MT. Dans
Proceedings of the 10th International
Conference on Theoretical and Methodological
Issues in Machine Translation, pages 144–153.

Peters, Matthew, Mark Neumann, Mohit

Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. 2018.
Deep contextualized word representations.
In Proceedings of the 2018 Conference of the
North American Chapter of the Association for

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Sun et al.

Parsing Chinese Sentences with Grammatical Relations

Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers),
pages 2227–2237.

Language Processing, pages 970–979,
Édimbourg, Juillet. Association for
Computational Linguistics.

Pitler, Emilie, Sampath Kannan, and Mitchell

Torres Martins, Andre, Noah Smith, and Eric

Marcus. 2013. Finding optimal
1-endpoint-crossing trees. TACL, 1:13–24.

Pollard, Carl, and Ivan A. Sag. 1994.

Head-Driven Phrase Structure Grammar. Le
University of Chicago Press, Chicago.
Rimell, Laura, Stephen Clark, and Mark

Steedman. 2009. Unbounded dependency
recovery for parser evaluation. Dans
Actes du 2009 Conference on
Empirical Methods in Natural Language
Processing, pages 813–821, Singapore,
Association for Computational Linguistics.

Sagae, Kenji, and Jun’ichi Tsujii. 2008.

Shift-reduce dependency DAG parsing.
In Proceedings of the 22nd International
Conference on Computational Linguistics,
pages 753–760, Manchester, ROYAUME-UNI, Août.
Coling 2008 Organizing Committee.
Steedman, M.. 1996. Surface Structure and

Interpretation. Linguistic Inquiry
Monographs. AVEC Presse, Cambridge, MA.
Steedman, Mark. 2000. The Syntactic Process.

AVEC Presse, Cambridge, MA.

Sun, Weiwei, Yantao Du, Xin Kou, Shuoyang

Ding, and Xiaojun Wan. 2014.
Grammatical relations in Chinese:
GB-ground extraction and data-driven
parsing. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers),
pages 446–456, Baltimore, Juin.
Association for Computational Linguistics.

Sun, Weiwei, Yantao Du, and Xiaojun Wan.

2017. Parsing for grammatical relations via
graph merging. In Proceedings of the 21st
Conference on Computational Natural
Language Learning (CoNLL 2017),
pages 26–35, Vancouver, Août.
Association for Computational Linguistics.

Sun, Weiwei, and Hans Uszkoreit. 2012.

Capturing paradigmatic and syntagmatic
lexical relations: Towards accurate Chinese
part-of-speech tagging. In Proceedings of the
50th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long
Papers), pages 242–252, Jeju Island, Korea,
Juillet.

Sun, Weiwei, and Xiaojun Wan. 2016.

Towards accurate and efﬁcient Chinese
part-of-speech tagging. Informatique
Linguistics, 42(3):391–419.

Sun, Weiwei, and Jia Xu. 2011. Enhancing

Chinese word segmentation using
unlabeled data. In Proceedings of the 2011
Conference on Empirical Methods in Natural

Xing. 2009. Concise integer linear
programming formulations for
dependency parsing. In Proceedings of the
Joint Conference of the 47th Annual Meeting of
the Association for Computational Linguistics
and the 4th International Joint Conference on
Natural Language Processing of the AFNLP,
pages 342–350, Suntec, Singapore, Août.

Ces, Daniel, and James R. Curran. 2010.
Chinese CCGbank: Extracting CCG
derivations from the Penn Chinese
Treebank. In Proceedings of the 23rd
International Conference on Computational
Linguistics (Coling 2010), pages 1083–1091,
Beijing, Août. Coling 2010 Organizing
Committee.

Ces, Daniel, and James R. Curran. 2012. Le

challenges of parsing Chinese with
combinatory categorial grammar. Dans
Actes du 2012 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 295–304, Montr´eal,
Juin. Association for Computational
Linguistics.

Wang, Mengqiu, Kenji Sagae, and Teruko

Mitamura. 2006. A fast, accurate
deterministic parser for Chinese. Dans
Proceedings of the 21st International
Conference on Computational Linguistics and
44th Annual Meeting of the Association for
Computational Linguistics, pages 425–432,
Sydney, Juillet.

Watanabe, Taro, and Eiichiro Sumita. 2015.

Transition-based neural constituent
parsing. In Proceedings of the 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1169–1179.
Blanc, David, Chris Alberti, Michael Collins,
and Slav Petrov. 2015. Structured training
for neural network transition-based
parsing. In Proceedings of the 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 323–333,
Beijing, Juillet.

Wu, Xianchao, Takuya Matsuzaki, et
Jun’ichi Tsujii. 2010. Fine-grained
tree-to-string translation rule extraction. Dans
Proceedings of the 48th Annual Meeting of the
Association for Computational Linguistics,
pages 325–334, Uppsala, Sweden, Juillet.

135

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Nombre 1

Xia, Fei. 2001. Automatic Grammar
Generation from Two Different
Perspectives. Ph.D. thesis, Université de
Pennsylvania.

Xue, Naiwen, Fei Xia, Fu-dong Chiou, et
Marta Palmer. 2005. The Penn Chinese
Treebank: Phrase structure annotation of a
large corpus. Natural Language Engineering,
11:207–238, Juin.

Xue, Nianwen. 2007. Tapping the implicit

information for the PS to DS conversion of
the Chinese Treebank. In Proceedings of the
Sixth International Workshop on Treebanks
and Linguistics Theories, pages 431–438.
Xue, Nianwen, and Martha Palmer. 2009.
Adding semantic roles to the Chinese
treebank. Natural Language Engineering,
15:143–172, Janvier.

Yakushiji, Akane, Yusuke Miyao, Yuka
Tateisi, and Jun’ichi Tsujii. 2005.
Biomedical information extraction with
predicate-argument structure patterns. Dans
Proceedings of the Eleventh Annual Meeting of
the Association for Natural Language
Processing, pages 60–69.

Yamada, Hiroyasu, and Yuji Matsumoto.

2003. Statistical dependency analysis with
support vector machines. In The 8th
International Workshop of Parsing
Technologies (IWPT2003), pages 195–206.
Yu, Kun, Yusuke Miyao, Takuya Matsuzaki,
Xiangli Wang, and Junichi Tsujii. 2011.
Analysis of the difﬁculties in Chinese
deep parsing. In Proceedings of the 12th
International Conference on Parsing
Technologies, pages 48–57, Dublin, Octobre.
Association for Computational Linguistics.

Yu, Kun, Miyao Yusuke, Xiangli Wang,

Takuya Matsuzaki, and Junichi Tsujii. 2010.
Semi-automatically developing Chinese
HPSG grammar from the Penn Chinese
Treebank for deep parsing. In Coling 2010:

Posters, pages 1417–1425, Beijing, Août.
Coling 2010 Organizing Committee.
Zhang, Xun, Yantao Du, Weiwei Sun, et
Xiaojun Wan. 2016. Transition-based
parsing for deep dependency structures.
Computational Linguistics, 42(3):353–389.
Zhang, Yue, and Stephen Clark. 2008. A tale

of two parsers: Investigating and
combining graph-based and
transition-based dependency parsing. Dans
Actes du 2008 Conference on
Empirical Methods in Natural Language
Processing, pages 562–571, Honolulu,
Octobre. Association for Computational
Linguistics.

Zhang, Yue, and Stephen Clark. 2009.

Transition-based parsing of the Chinese
Treebank using a global discriminative
model. In Proceedings of the 11th
International Conference on Parsing
Technologies (IWPT’09), pages 162–171,
Paris, Octobre. Association for
Computational Linguistics.

Zhang, Yue, and Stephen Clark. 2011un.

Shift-reduce CCG parsing. In Proceedings of
the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language
Technologies, pages 683–692, Portland,
Oregon, Juin.

Zhang, Yue, and Stephen Clark. 2011b.

Syntactic processing using the generalized
perceptron and beam search. Informatique
Linguistics, 37(1):105–151.

Zhou, Hao, Yue Zhang, Shujian Huang, et
Jiajun Chen. 2015. A neural probabilistic
structured-prediction model for
transition-based dependency parsing. Dans
Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long
Papers), pages 1213–1222.

136

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
5
1
9
5
1
8
0
9
7
3
8
/
c
o

je
je

_
un
_
0
0
3
4
3
p
d

b
oui
g
toi
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF