Integrating Selectional Constraints and - IA de Investigación especializada en el MIT

Integración de restricciones de selección y
Subcategorization Frames in a
Dependency Parser

∗
Seyed Abolghasem Mirroshandel
University of Guilan

Alexis Nasr∗∗
Universit´e Aix-Marseille

Statistical parsers are trained on treebanks that are composed of a few thousand sentences. En
order to prevent data sparseness and computational complexity, such parsers make strong inde-
pendence hypotheses on the decisions that are made to build a syntactic tree. These independence
hypotheses yield a decomposition of the syntactic structures into small pieces, which in turn
prevent the parser from adequately modeling many lexico-syntactic phenomena like selectional
constraints and subcategorization frames. Además, treebanks are several orders of magni-
tude too small to observe many lexico-syntactic regularities, such as selectional constraints and
subcategorization frames. In this article, we propose a solution to both problems: how to account
for patterns that exceed the size of the pieces that are modeled in the parser and how to obtain
subcategorization frames and selectional constraints from raw corpora and incorporate them
in the parsing process. The method proposed was evaluated on French and on English. El
experiments on French showed a decrease of 41.6% of selectional constraint violations and a
decrease of 22% of erroneous subcategorization frame assignment. These ﬁgures are lower for
Inglés: 16.21% in the ﬁrst case and 8.83% in the second.

1. Introducción

The fundamental problem we address in this article was formulated by Joshi (1985)
as the following question: How much context-sensitivity is required to provide reasonable
structural descriptions? His answer to this question was the extended domain of locality
principle of Tree Adjoining Grammars, which allows us to represent in a single structure
(an elementary tree1) a predicate and its arguments.

∗ Computer Engineering Department, Faculty of Engineering, University of Guilan, Rasht, Iran.

Correo electrónico: mirroshandel@guilan.ac.ir.

∗∗ Laboratoire d’Informatique Fondamentale, CNRS – Universit´e Aix-Marseille, Francia.

Correo electrónico: alexis.nasr@univ-amu.fr.

1 In the remainder of this article, we will refer to the elementary object manipulated by the parser as

factores, borrowing this term from graph-based parsing.

Envío recibido: 10 Septiembre 2014; versión revisada recibida: 3 Agosto 2015; accepted for publication:
28 Septiembre 2015.

doi:10.1162/COLI a 00242

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Context sensitivity and the correct size of domain of locality for a reasonable struc-
tural description has always been a major issue for syntactic formalisms and parsers.
Giving an overview of the different solutions that have been proposed to answer this
question is clearly beyond the scope of this article. We will limit our discussion to the
framework of graph-based dependency parsing (Eisner 1996; McDonald and Pereira
2006), in which our study takes place. The basic ﬁrst-order model makes an extreme
choice with respect to the domain of locality by limiting it to one dependency (the score
of a tree is the sum of the scores of its dependencies). Second-order models slightly
extend the factor sizes to second-order patterns that include a dependency adjacent
to the target dependency. It has been shown experimentally that second-order models
perform better than ﬁrst-order ones, which tends to prove that second-order factors do
capture important syntactic regularities. More recently, Koo and Collins (2010) propuesto
extending factor size to three dependencies and showed that such models yield better
accuracy than second-order models on the same data. Sin embargo, extending the order
of the models has a high computational cost because the number of factors to be
considered in a sentence grows exponentially with the order of the model. Using such
models in a Dynamic Programming framework, such as Eisner (1996), becomes quickly
intractable in terms of processing time and memory requirements. Además, por
reliably estimating the scores of such factors we are quickly confronted with the problem
of data sparseness.

One can also note that high-order factors are not needed for all syntactic phenom-
ena. De hecho, most syntactic attachments can be accurately described with ﬁrst- y
second-order models. This is why ﬁrst- and second-order models perform quite well:
They reach a labeled accuracy score of 85.36% y 88.88% for ﬁrst- and second-order
modelos, respectivamente, on the French Treebank (Abeill´e, Cl´ement, and Toussenel 2003).
The same models trained on the Penn Treebank (marco, Marcinkiewicz, and Santorini
1993) reach 88.57% y 91.76% labeled accuracy score, respectivamente. The extension of
the domain of locality can therefore be limited to some syntactic phenomena. Esto es
the case in Tree Adjoining Grammars for which the extended domain of locality only
concerned some aspects of syntax, a saber, subcategorization.

The solution we explore in this article, in order to take high-order factors into
account in a parser, is to decompose the parsing process into two sub-processes.
One of them is in charge of local syntactic phenomena and does not need a high-
order model and the other takes care of syntactic phenomena that are beyond the
scope of the ﬁrst one. We will call the ﬁrst one parsing and the second one patch-
En g. The patching process is responsible for modeling two important aspects of
syntax, Selectional Constraints (SC) and Subcategorization Frames (SFs), cuales son
usually poorly modeled in parsers, as we will show in Sections 7 y 8. Parsing
and patching differ in two important aspects. Primero, they rely on different search
algoritmos. The ﬁrst one is based on Dynamic Programming and the other one
uses Integer Linear Programming (ILP). Segundo, they rely on different data sets. El
ﬁrst uses a standard treebank whereas the second uses much larger unannotated
corpus.

The reason for the ﬁrst difference comes from the different nature of these two
procesos. As already mentioned, high-order factors are needed to model some speciﬁc
syntactic phenomena and we do not want to systematically increase the order of the
model in order to take them into account. Using factors of different sizes in a Dynamic
Programming parser is not practical and leads to ad hoc adaptation of the parser. Este
is the reason why patching is based on ILP, which can accommodate more naturally for
constraints deﬁned on structures of different sizes.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

The second difference between parsing and patching is the data sets that were
used to train them. The parser is trained on a standard treebank whereas the patching
model is trained on a raw corpus that is several orders of magnitude larger. El
reason for this difference comes from the fact that high-order factors used to model
SFs and SCs contain up to three lexical elements and cannot be accurately modeled
using available treebanks. It has been shown by Gildea (2001) and Bikel (2004) eso
bilexical dependencies of the Collins parser (collins 1997) have a very limited impact
on the performances of the parser, although they were supposed to play a key role for
some important ambiguous syntactic attachments. The reason for this is that treebanks
are not large enough to correctly model bilexical dependencies. We show in Sections 7.2
y 8.2 that current treebanks are much too limited in size to accurately model SCs
and SFs.

Other solutions have been proposed in the literature in order to combine local and
global syntactic phenomena in a parser. Parse reranking is one of them. The idea is to
produce the n-best parses for a sentence and rerank them using higher order features,
such as in Collins and Koo (2005) and Charniak and Johnson (2005). This solution has
the advantage of drastically limiting the search space of the high-order search algorithm
to the k best parses. Although the parsing architecture we propose in this article does
use k-best parse lists, our solution allows us to combine dependencies that appear in
any of the parses of the list. We show in Section 6.3 that combining dependencies that
appear in any parse of a k-best list can yield parses that are more accurate than the best
parse of this k-best list. Our method is closer to forest rescoring (Huang and Chiang
2007) or forest reranking (Huang 2008).

Another solution to combine local and global decisions is to represent the local
and the global constraints as a single ILP program. This solution is possible because
dependency parsing can be framed as an ILP program. Riedel and Clarke (2006)
propose a formulation of nonprojective dependency parsing as an ILP program. Este
programa, sin embargo, requires an exponential number of variables. Martins, Herrero, y
Xing (2009) propose a more concise formulation that only requires a polynomial number
of variables and constraints. Martins, Almeida, and Smith (2013) propose extending the
model to third-order factors and using dual decomposition. All these approaches have
in common their ability to produce nonprojective structures (Lecerf 1961). Our work
departs from these approaches in that it uses a dynamic programming parser combined
with an ILP solver. Además, we do not take nonprojective structures into account.
The structure of this article is as follows. En la sección 2 we describe the type of
parser used in this work. Sección 3 describes the patching process that detects, en un
given sentence, a set of optimal SFs and SCs, using ILP. The exact nature of SFs and
SCs is described in that section. Sección 4 proposes a way to combine the solutions
produced by the two methods using constrained parsing. En la sección 5, we justify the
use of ILP by showing that the patching process is actually NP-complete. Secciones 6, 7,
y 8 constitute the experimental part of this work. Sección 6 describes the experimental
set-up, and Sections 7 y 8, respectivamente, give results obtained on French and English.
Sección 9 concludes the article.

The material presented in this article has been partially published in Mirroshandel,
Nasr, and Roux (2012) and Mirroshandel, Nasr, and Sagot (2013). In Mirroshandel, Nasr,
and Roux (2012), a method for taking into account selectional constraints in a parser is
presentado, and in Mirroshandel, Nasr, and Sagot (2013) a method of introducing sub-
categorization frames in the parser is described. This paper proposes a general frame-
work for dealing with these two problems, both separately and jointly, using Integer
Linear Programming, whereas the two previous techniques used ad hoc adaptations

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

and could not treat the two problems jointly. Además, we extended the experimen-
tal part of our work to English, whereas previous work only concerned French.

2. Parsing

En esta sección, we brieﬂy present the graph-based approach for projective dependency
analizando. We then introduce an extension to this kind of parser, which we call con-
strained parsing, that forces the parser to include some speciﬁc patterns in its output.
Entonces, a simple conﬁdence measure that indicates how conﬁdent the parser is about any
subpart of the solution is presented.

2.1 Graph-Based Parsing

Graph-based parsing (McDonald, Crammer, and Pereira 2005a; K ¨ubler, McDonald, y
Nivre 2009) deﬁnes a framework for parsing that does not make use of a generative
gramática. In such a framework, given a sentence S = w1 . . . wl, any dependency tree2
for S is a possible syntactic structure for S. The heart of a graph-based parser is the
scoring function, which assigns a score s(t) to every tree T ∈ T (S), where T (S) denotes
the set of all dependency trees of sentence S. Such scores are usually the sum of the
scores of subparts of S. The parser returns the tree that maximizes such a score:

ˆT = arg max
T∈T (S)

(cid:88)

s(ψ)

ψ∈ψ(t)

(1)

In Equation (1), ψ(t) is the set of all the relevant subparts of tree T and s(ψ) es el
score of subpart ψ. The scores of the subparts are computed using machine learning
algorithms on a treebank, such as McDonald, Crammer, and Pereira (2005b).

As already mentioned in Section 1, several decompositions of the tree into sub-
parts are possible and yield models of increasing complexity. The most simple one
is the ﬁrst-order model, which simply decomposes a tree into single dependencies
and assigns a score to a dependency, irrespective of its context. The more popular
model is the second-order model, which decomposes a tree into subparts made of two
dependencies.

Solving Equation (1) cannot be done naively for combinatorial reasons: The set T (S)
contains an exponential number of trees. Sin embargo, it can be solved in polynomial time
using dynamic programming (Eisner 2000).

2.2 Constrained Parsing

The graph-based framework allows us to impose structural constraints on the parses
that are produced by the parser in a straightforward way, a feature that will prove to be
valuable in this work.

A sentence S = w1 . . . wl is a parser with a scoring function s and a dependency
re = (i, r, j), such that i is the position of the governor, j the position of the dependent,

2 A dependency tree for sentence S = w1 . . . wl and the dependency relation set R is a directed labeled tree

(V, A) such that V = {w1, . . . wn} and A ⊆ V × R × V.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

and r the dependency label, con 1 ≤ i, j ≤ l. We can deﬁne a new scoring function s+
the following way:

d in

s+
d (i(cid:48), r(cid:48), j(cid:48)) =






−∞

if j(cid:48) = j and
(i(cid:48) (cid:54)= i or r(cid:48) (cid:54)= r)

s(i(cid:48), r(cid:48), j(cid:48)) de lo contrario

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

When running the parser on sentence S, with the scoring function s+

d , one can be
sure that dependency d will be part of the solution produced by the parser because the
score of all its competing dependencies (es decir., dependencies of the form (i(cid:48), r(cid:48), j(cid:48)) con
different governors (i(cid:48) (cid:54)= i) or labels (r(cid:48) (cid:54)= r) and the same dependent (j(cid:48) = j)) are set
to −∞. Como resultado, the overall score of trees that have such competing dependencies
cannot be higher than the trees that have our desired dependency. En otras palabras, el
resulting parse tree will contain dependency d. This feature will be used in Section 4
in order to parse a sentence for which a certain number of dependencies are set in
advance.

Function s+

d can be extended in order to force the presence of any set of dependen-
cies in the output of the parser. Suppose D is a set of dependencies; we then deﬁne s+
D
the following way:

s+
D (i, r, j) =






s+
d (i, r, j) if ∃d ∈ D such that d = (·, ·, j)

s(i, r, j) de lo contrario

Modifying the scoring function of the parser to force the inclusion or the ex-
clusion of some dependencies has already been used in the literature. It was used
by Koo and Collins (2010) to ignore dependencies whose probability was below a
certain threshold and by Rush and Petrov (2012) in their multi-pass, coarse-to-ﬁne
acercarse.

2.3 Conﬁdence Measure

The other extension of graph-based parsers that we use in this work is the computation
of conﬁdence measures on the output of the parser. The conﬁdence measure used is
based on the estimation of posterior probability of a subtree, computed on the k-best list
of parses of a sentence. It is very close to conﬁdence measures used for speech recog-
nition (Wessel et al. 2001) and differs from conﬁdence measures used for parsing, semejante
as S´anchez-S´aez, S´anchez, and Bened´ı (2009) or Hwa (2004), which compute conﬁdence
measures for complete parses of a sentence.

Given the k-best parses of a sentence and a subtree ∆ present in at least one
of the k-best parses, let C(∆) be the number of occurrences of ∆ in the k-best
parse set. Then CM(∆), the conﬁdence measure associated with ∆, is computed
como:

CM(∆) =

C(∆)
k

Ligüística computacional

Volumen 42, Número 1

Mirroshandel and Nasr (2011) and Mirroshandel, Nasr, and Roux (2012) provide
more detail on the performance of this measure. Other conﬁdence measures could be
usado, such as the edge expectation described in McDonald and Satta (2007), or inside-
outside probabilities.

3. Patching

En esta sección, we introduce the process of patching a sentence. The general idea of
patching sentence S is to detect in S the occurrences of predeﬁned conﬁgurations.
En la sección 3.1 we introduce Lexico-Syntactic Conﬁgurations, which correspond to the
patches. We describe in section 3.2 the process of selecting an optimal set of patches
on a sentence (ﬁnding an optimal solution is not trivial since patches have scores
and some patches are incompatible). Secciones 3.3 y 3.4, respectivamente, describe two
instances of the patching problem, a saber, subcategorization frame selection and se-
lectional constraint satisfaction. Sección 3.5 shows how both problems can be solved
together.

3.1 Lexico-Syntactic Conﬁguration

A Lexico-syntactic Conﬁguration (LSC) is a pair (s, t) where s is a numerical value,
called the score of the LSC, and T is a dependency tree of arbitrary size. Every node of
the dependency tree has ﬁve features: a syntactic function, a part of speech, a lemma, a
ﬂedged form, and an index, which represents a position in a sentence. The features of
every node can either be speciﬁed or left unspeciﬁed. Given an LSC i, s(i) will represent
the score of i and T(i) will represent its tree.

Index features are always unspeciﬁed in an LSC. What follows is an example of an
LSC that corresponds to the sentence Jean offre des ﬂeurs `a N [Jean offers ﬂowers to N]. El
ﬁve features of the dependency tree nodes are represented between square brackets and
are separated by colons.

(0.028, [:V:offrir:offre:](

[SUJ:norte:Jean:Jean:],
[OBJ:norte:fleur:fleurs:],
[A-OBJ:PAG:a:a:](
[OBJ:norte:::])))

An instantiated LSC (ILSC) is an LSC such that all its node features (En particular,
index features) are speciﬁed. We will note R(i) as the index of the root of ILSC i and L(i)
as the set of its leaf indices.

Given the sentence Jean offre des ﬂeurs `a Marie and the LSC, the instantiation of the

LSC on the sentence gives the following ILSC:

(0.028, [:V:offrir:offre:2](

[SUJ:norte:Jean:Jean:1],
[OBJ:norte:fleur:fleurs:4],
[A-OBJ:PAG:a:a:5](

[OBJ:norte:Marie:Marie:6])))

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

3.2 Selecting the Optimal Set of ILSCs

Given a sentence S, and a set L of LSCs, we deﬁne the set I of all possible instantiations
of elements of L on S. Once the set I has been deﬁned, our aim is to select a subset ˆI
made of compatible3 ILSCs such that

ˆI = arg max

I⊆I

(cid:88)

C∈I

s(C)

(2)

As it was the case for Equation (1), Ecuación (2) cannot be solved naively for
combinatorial reasons: The power-set of I contains an exponential number of elements.
Such a problem will be solved using ILP.

In order to solve Equation (2), the set I of all possible instantiations of the LSCs
of set L for a given sentence S must be built. Several methods can be used to compute
such a set, as described in Section 6.3.

We will see in Sections 3.3 y 3.4 two instances of the patching problem. En

Sección 3.3, ILSCs will represent SFs, and in Section 3.4, they will represent SCs.

3.3 Subcategorization Frames

A subcategorization frame is a special kind of LSC. Its root is a predicate and its leaves
are the arguments of the predicate.

Here is an example of an SF for the French verb donner [to give]. It is a ditransitive
marco; it has both a direct object and an indirect one introduced by the preposition `a as
in Jean donne un livre `a Marie. [Jean gives a book to Marie].

(0.028, [:V:donner::] (

[SUJ:norte:::],
[OBJ:norte:::],
[A-OBJ:PAG:a:a:] (
[OBJ:norte:::])))

The score of an SF T must reﬂect the tendency of a verb V (the root of the SF) a
select the given SF. This score sSF is simply the conditional probability P(t|V) estimated
data sets that will be described in Sections 7 y 8.

3.3.1 Selecting the Optimal Set of Subcategorization Frames. Given a sentence S of length
N and the set I of Instantiated Subcategorization Frames (ISFs) over S, the selection
of the optimal set ˆI ⊆ I is done by solving the general equation 2, using ILP, con
constraints that are speciﬁc to SF selection. The exact formulation of the ILP program is
now described.

(cid:114)

–

Deﬁnition of the variables

αj
i = 1 if word number i is the predicate of ISF number j, and αj
de lo contrario.

i = 0

3 We will not say more at this point about the compatibility of two ILSCs. The exact deﬁnition of

incompatibility will be given in Sections 3.3 y 3.4.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

(cid:114)

–

βj
i = 1 if word number i is an argument of ISF number j, and βj
de lo contrario.
Variables αj
variable αj

i and βj
i are not deﬁned for all values of i and j. A
i is deﬁned only if word i can be the predicate of ISF j ∈ I.

i = 0

Deﬁnition of the constraints

–

a word is the predicate of at most one ISF:

∀i ∈ {1, . . . , norte}

αj
i ≤ 1

(cid:88)

j∈I

–

a word cannot be an argument of more than one ISF:

∀i ∈ {1, . . . , norte}

βj

i ≤ 1

(cid:88)

j∈I

–

for an ISF to be selected, both its predicate and all its arguments
(es decir., l(j)) must be selected:

∀j ∈ {1, . . . , |I|} |l(j)|αj

R(j) −

(cid:88)

l∈L(j)

βj
l = 0

Deﬁnition of the objective function

máximo

(cid:88)

j∈I

αj

R(j)sSF(j)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Let us illustrate this process on the sentence Jean rend le livre qu’il a emprunt´e `a la
biblioth`eque. [Jean returns the book that he has borrowed from the library.] which contains an
ambiguous prepositional phrase attachment: the prepositional phrase `a la biblioth`eque
[from the library] can be attached either to the verb rendre [return] or to the verb emprunter
[borrow] . The set I is composed of the four following ISFs:

1 (0.2,[:V:rend:rendre:2](
[SUJ:norte:Jean:Jean:1],
[OBJ:norte:livre:livre:4]))

2 (0.4,[:V:rend:rendre:2](
[SUJ:norte:Jean:Jean:1],
[OBJ:norte:livre:livre:4],
[A-OBJ:PAG:a:a:9](

[OBJ:norte:biblio.:biblio.:11])))

3 (0.3,[:V:emprunte:emprunter:8] (

[SUJ:norte:il:il:6],
[OBJ:norte:qu’:que:5]))

4 (0.6,[:V:emprunte:emprunter:8] (

[SUJ:norte:il:il:6],
[OBJ:norte:qu’:que:5],
[A-OBJ:PAG:a:a:9] (

[OBJ:norte:biblio.:biblio.:11])))

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

The actual ILP program is the following:

(cid:114)

–

Deﬁnition of the variables

2, α2
α1
2
α3
8, α4
8
β1
1, β2
1
β1
4, β2
4
β1
11
β3
5, β4
5
β3
6, β4
6
β4
11

verb rendre can select ISF 1 o 2
verb emprunter can select ISF 3 o 4
noun Jean can be the subject of ISF 1 o 2
noun livre can be the object of ISF 1 o 2
noun bibliotheque can be the indirect object of ISF 2
relative pronoun que can be the object of ISF 3 o 4
pronoun il can be the subject of ISF 3 o 4
noun bibliotheque can be the indirect object of ISF 4

Deﬁnition of the constraints

a word is the predicate of at most one ISF:

α1
2 + α2
8 + α4
α3

2 ≤ 1
8 ≤ 1

verb rendre cannot be the predicate of both ISF 1 y 2
verb emprunter cannot be the predicate of both ISF 3 y 4

a word cannot be an argument of more than one ISF:

1 + β2
β1
4 + β2
β1
5 + β4
β3

β3
6 + β4
11 + β4
β2

1 ≤ 1
4 ≤ 1
5 ≤ 1

Noun Jean cannot be the subject of both ISF 1 y 2
Noun livre cannot be the object of both ISF 1 y 2
Relative pronoun que cannot be the object of
both ISF 3 y 4
6 ≤ 1
Pronoun il cannot be the subject of both ISF 3 y 4
11 ≤ 1 Noun bibliotheque cannot be the object of both ISF 2 y 4

for an ISF to be selected, both its predicate and all its arguments
must be selected:
1 + β1
2 − (β1
1 + β2
2 − (β2

4) = 0
4 + β2

11) = 0

2α1
3α2

ISF 1 must have a subject and an object
ISF 2 must have a subject, an object and an
indirect object
ISF 3 must have a subject and an object
ISF 4 must have a subject, an object and an
indirect object

2α3
3α4

8 − (β3
8 − (β4

5 + β3
5 + β4

6) = 0
6 + β4

11) = 0

Deﬁnition of the objective function

máximo(0.2α1

2 + 0.4α2

2 + 0.3α3

8 + 0.6α4
8)

The problem admits eight solutions, represented in Figure 1. Cada columna
corresponds to a variable. Black dots (•) correspond to the value 1 and white dots (◦)
to the value 0. The optimal solution is solution number 7 for which verb rendre selects
ISF 1 and verb emprunter selects ISF 4. In this solution, the prepositional phrase `a la
biblioth`eque is attached to the verb rendre.

3.4 Selectional Constraints

A selectional constraint (CAROLINA DEL SUR) is another kind of LSC. Its dependency tree is either made
of a single dependency or two dependencies in a (daughter, madre, grandmother)
conﬁguration. An SC models the tendency of two words (the root and the leaf) to co-
occur in a speciﬁc syntactic conﬁguration.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

2 β2

8 β4

4 α2

5 β3

2 β1
α1

1 β1
1 β2
1 ◦ ◦ ◦ ◦ ◦ ◦
2 • • • ◦ ◦ ◦
3 ◦ ◦ ◦ • • •
4 ◦ ◦ ◦ ◦ ◦ ◦
5 ◦ ◦ ◦ ◦ ◦ ◦
6 • • • ◦ ◦ ◦
7 • • • ◦ ◦ ◦
8 ◦ ◦ ◦ • • •

4 β2
11 α3
◦
◦
•
◦
◦
◦
◦
•

8 β3
5 β4
6 α4
◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦
• • • ◦ ◦ ◦
◦ ◦ ◦ • • •
• • • ◦ ◦ ◦
◦ ◦ ◦ • • •
• • • ◦ ◦ ◦

6 β4
11 puntaje
0.0
◦
0.2
◦
0.4
◦
0.3
◦
0.6
•
0.5
◦
0.8
•
0.7
◦

Cifra 1
Representation of the eight solutions to the ILP problem.

A set of four SC patterns, described here, have been deﬁned. The ﬁrst two describe
a subject and object selectional constraint, respectivamente. Patrones 3 y 4, respectivamente,
describe selectional constraints on an indirect object introduced by the preposition de
and `a :

([:V:::]([SUJ:norte:::]))
([:V:::]([OBJ:norte:::]))
([:V:::]([DE-OBJ:PAG:de::]([OBJ:norte:::])))
([:V:::]([A-OBJ:PAG:a::]([OBJ:norte:::])))

Three important formal features distinguish SFs and SCs:

(cid:114)

The domain of locality of SFs usually exceeds the domain of locality of SCs.

SCs model bilexical phenomena (the tendency of two words to co-occur
in a speciﬁc syntactic conﬁguration) whereas SFs model unilexical
phenomena (the tendency of one word to select a speciﬁc SF).

The number of different SC patterns is set a priori whereas the number of
SF patterns is data driven.

Note that SC patterns are described at a surface syntactic level and, in the case
of control and raising verbs, the SC pattern will be deﬁned over the subject and the
control/raising verb and not the embedded verb—although the selectional constraint is
between the subject and the embedded verb. In the event of coordination of subject or
object, only one of the coordinated elements will be extracted.

The score of an SC should reﬂect the tendency of the root lemma lr and the leaf
lemma ll to appear together in conﬁguration C. It should be maximal if whenever lr oc-
curs as the root of conﬁguration C, the leaf position is occupied by ll and, symmetrically,
if whenever ll occurs as the leaf of conﬁguration C, the root position is occupied by lr.
A function that conforms to such a behavior is the following:

sSC(C, lr, ll) = 1
2

(cid:18) C(C, lr, ll)
C(C, lr, ∗)

C(C, lr, ll)
C(C, ∗, ll)

(cid:19)

where C(C, yo, ∗) (respectivamente, C(C, ∗, yo)) is the number of occurrences of conﬁguration C
with lemma l as a root (respectivamente, leaf) and C(C, lr, ll) is the number of occurrences of
conﬁguration C with lemma lr as a root and lemma ll as a leaf, simultaneously.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

This function takes its values between 0 (lr and ll never co-occur) y 1 (lr and ll
always co-occur). It is close to pointwise mutual information (Church and Hanks 1990)
but takes its values between 0 y 1.

3.4.1 Selecting the Optimal Set of Selectional Constraints. Given a sentence S of length N
and the set I(cid:48) of Instantiated Selectional Constraints (ISC) over S, the selection of the
optimal set ˆI(cid:48) ⊆ I(cid:48) is framed as the following ILP program.

–

(cid:114)

Deﬁnition of the variables

γj
i = 1 if word number i is the root of ISC number j, and γj
de lo contrario.
i if word number i is the leaf of ISC number j, and δj
δj
de lo contrario.

i = 0

Deﬁnition of the constraints

a word cannot be the leaf of more than one ISC

∀i ∈ {1, . . . norte}

δj
i ≤ 1

(cid:88)

j∈I(cid:48)

for an ISC j to be selected, both its root (es decir., R(j)) and its leaf must
be selected

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

∀j ∈ {1, . . . , |I(cid:48)|} γj

R( j) − δj

d∈L(j) = 0

Deﬁnition of the objective function

máximo

(cid:88)

j∈I(cid:48)

γj
R(j)sSC(j)

Let us illustrate the selection process on the same sentence (Jean rend le livre qu’il a

emprunt´e `a la biblioth`eque.). A set of four ISCs is deﬁned:

5 (0.2,[:V:rend:rendre:2]

([SUJ:norte:Jean:Jean:1]))

6 (0.2,[:V:rend:rendre:2]

([OBJ:norte:livre:livre:4]))

7 (0.4,[:V:rend:rendre:2]

([A-OBJ:PAG:a:a:9]

([OBJ:norte:biblio.:biblio.:11])))

8 (0.6,[:V:emprunte:emprunter:8]

([A-OBJ:PAG:a:a:9]

([OBJ:norte:biblio.:biblio.:11])))

The actual IL program is the following:

(cid:114)

Deﬁnition of the variables

2, γ7
2

2, γ6
γ5
γ8
8
δ5
1
δ6
4
11, δ8
δ7
11

verb rendre can be the root of ISC 5, 6, y 7
verb emprunter can be the root of ISC 8
noun Jean can be the leaf of ISC 5
noun livre can be the leaf of ISC 6
noun biblioth`eque can be the leaf of ISC 7 o 8

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

–

(cid:114)

Deﬁnition of the constraints

a word cannot be the leaf of more than one ISC

δ5
1 ≤ 1
δ6
4 ≤ 1
11 + δ8
δ7

11 ≤ 1

noun Jean can only be the leaf of ISC 5
noun livre can only be the leaf of ISC 6
noun biblioth`eque cannot be the leaf of both ISC 7 y 8

for an ISC to be selected, both its root and its leaf must be selected
ISC 5 must have a root and a leaf
ISC 6 must have a root and a leaf
ISC 7 must have a root and a leaf
ISC 8 must have a root and a leaf

γ5
2 − δ5
2 − δ6
γ6
γ7
2 − δ7
8 − δ8
γ8

1 = 0
4 = 0
11 = 0
11 = 0

Deﬁnition of the objective function

máximo(0.2γ5

2 + 0.2γ6

2 + 0.4γ7

2 + 0.6γ8
8)

The problem admits twelve solutions, represented in Figure 2. The optimal solution is
solution number 12, in which the prepositional phrase `a la biblioth`eque is attached to the
verb emprunter.

3.5 Combining Subcategorization Frames and Selectional Constraints

SF selection and SC satisfaction can be combined together in a single ILP program that
combines the variables and constraints of the two problems and adds a new constraint
that takes care of the incompatibilities of SCs and SFs.

Given a sentence S of length N, the set I of ISFs and the set I(cid:48) of ISCs, the selection

of the optimal set ˆI(cid:48)(cid:48) ⊆ I ∪ I(cid:48) is the solution of the following program:

(cid:114)

Deﬁnition of the variables
i, βj
αj
Deﬁnition of the constraints

i, γj

i, δj

–

All the constraints of ISF selection and ISC satisfaction.

2 δ7

1 γ6

11 γ8

2 δ5
γ5

8 δ8
4 γ7
2 δ6
1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
2 • • ◦ ◦ ◦ ◦ ◦ ◦
3 ◦ ◦ • • ◦ ◦ ◦ ◦
4 ◦ ◦ ◦ ◦ • • ◦ ◦
5 ◦ ◦ ◦ ◦ ◦ ◦ • •
6 • • • • ◦ ◦ ◦ ◦
7 • • ◦ ◦ • • ◦ ◦
8 • • ◦ ◦ ◦ ◦ • •
9 ◦ ◦ • • • • ◦ ◦
10 ◦ ◦ • • ◦ ◦ • •
11 • • • • • • ◦ ◦
12 • • • • ◦ ◦ • •

11 puntaje
0.0
0.2
0.2
0.4
0.6
0.4
0.6
0.8
0.6
0.8
0.8
1.0

Cifra 2
Representation of the 12 solutions to the ILP problem.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

–

Incompatible ISFs and ISCs cannot be selected together.
An ISF j and an ISC j(cid:48) are not compatible if they share a common
leaf (i(cid:48)) but have different roots (i (cid:54)= i(cid:48)(cid:48)). Such a constraint can be
modeled by the following inequality:

∀i, i(cid:48), i(cid:48)(cid:48), j, j(cid:48) s.t. i (cid:54)= i(cid:48)(cid:48), 2αj

i + βj

i(cid:48) + 2γj(cid:48)

i(cid:48)(cid:48) + δj(cid:48)

i(cid:48) (cid:54)= 5

(cid:114)

Deﬁnition of the objective function



(cid:88)

máximo



δj
R(j)sSC(j) +

j∈I(cid:48)

(cid:88)

j∈I



αj

R(j)sSF(j)



The actual ILP program is too large and redundant with the two preceding ones to
be exposed fully here. Its set of variables and constraints is the union of the variables
and the constraints of the preceding ones, plus the two following ones:

2α2
2α4

2 + β2
8 + β4

11 + 2γ8
11 + 2γ7

8 + δ8
2 + δ7

11 (cid:54)= 5
11 (cid:54)= 5

ISF 2 and ISC 8 are incompatible
ISF 4 and ISC 7 are incompatible

Deﬁnition of the objective function:

máximo(0.2γ5

2 + 0.2γ6

2 + 0.4γ7

2 + 0.6γ8

8 + 0.2α1

2 + 0.4α2

2 + 0.3α3

8 + 0.6α4
8)

The problem admits 94 soluciones. The optimal solution is made of ISFs 1 y 4, y

ISC 5, 6, y 8.

4. Combining Parsing and Patching

Two processes have been described in Sections 2 y 3. In the ﬁrst one, parsing is based
on dynamic programming and produces a complete parse of a sentence S whereas in
ˆI(cid:48)(cid:48) hecho
the second, patching is based on ILP and produces a partial parse of S: the set
of ISCs and ISFs.

We would like now to produce a parse for sentence S that combines constraints
coming from parsing and patching. We have exposed in Section 1 the reasons why we
chose to keep these two processes separated and to combine their solutions to produce
a single parse. The whole process is composed of three steps:

Sentence S is ﬁrst parsed using a ﬁrst-order parser and the set of k-best
parses is produced in order to generate the set I of ISFs and the set I(cid:48) de
ISC.

Patching is then performed and computes the optimal set
ISFs and ISCs.
A new scoring function s+
produces a parse tree ˆT that preserves the ISFs and ISCs computed in
Step 2.

ˆI(cid:48)(cid:48) is then computed and a second-order parser

ˆI(cid:48)(cid:48) ⊆ I ∪ I(cid:48) de

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

This setting is strongly biased in favor of the patching process. Because of the nature
of the scoring function s+
ˆI(cid:48)(cid:48) , the ISFs and ISCs computed during the patching step are
kept in the ﬁnal solution even if considered unlikely by the parser. En la práctica, este
solution does not perform well.

In order to let the parser inﬂuence the solution of the patching process, the conﬁ-
dence measure CM, deﬁned in Section 2, is introduced in the objective function of the
ILP. Recall that CM(∆) indicates how conﬁdent the parser is about sub-tree ∆. Este
quantity is added to the scores of SFs and SCs introduced in sections 3.3 y 3.4.

The new scoring functions ˆsSF and ˆsSC are deﬁned as follows:

ˆsSF(∆) = (1 − µ1)sSF(∆) + µ1CM(∆)

ˆsSC(∆) = (1 − µ2)sSC(∆) + µ2CM(∆)

where the values of µ1 and µ2 are determined experimentally on a development set. En
the deﬁnition of ˆsSF and ˆsSC, the ﬁrst component (sSF and sSC) can be interpreted as a
lexical score and the second one (CM) as a syntactic score.

After patching is performed and the set

ˆI(cid:48)(cid:48) is computed, the scoring function of the
ˆI(cid:48)(cid:48) to appear in

second order parser is modiﬁed in order to force the ISFs and ISCs of
the parser’s solution.

As explained in Section 2.2, the modiﬁcation of the scoring function modiﬁes only
the scores of ﬁrst-order factors (single dependencies). The modiﬁcation amounts to
setting to −∞ the scores of the dependencies that compete with those dependencies
that occur in ˆI(cid:48)(cid:48). En este punto, the set
ˆI(cid:48)(cid:48) is just considered as a set of ﬁrst-order
factores, whatever the ILSC they are a member of. The scores of second-order factors
are not modiﬁed. This solution works because the second-order parser uses both ﬁrst-
and second-order factors. There is no need to modify the scoring function of the second-
(or higher) order factors since the ﬁrst-order factors that they are composed of will be
discarded anyway, because of their modiﬁed scores.

Let us illustrate the process on the sentence Jean commande une glace `a la serveuse.
[Jean orders an ice-cream from the waitress.]. The parser output for this sentence
wrongly attaches the preposition `a to the noun glace whereas it should be attached to the
ˆI(cid:48)(cid:48) is made of the ISF (comandante,
verb commander. Suppose that, after patching, the set
SBJ:N OBJ:N AOBJ:norte): the verb commander takes an indirect object introduced by prepo-
sition `a, and the ISC (VaN, comandante, serveuse): the noun serveuse is the indirect object of
ˆI(cid:48)(cid:48) is viewed as the set of dependencies { (commande,
the verb commander. En este punto
SUJ, Jean), (commande, OBJ, glace), (commande, AOBJ, `a), (`a, OBJ, serveuse)}. The score of
all competing dependencies, which are all dependencies that have either Jean, glace, `a, o
serveuse as a dependent, are set to −∞ and will be discarded from the parser solution.

5. Complexity of the Patching Problem

The two-stage architecture proposed in this paper—parsing using dynamic program-
ming; then patching using ILP—suffers from a potential efﬁciency problem because
of the complexity of ILP, which is NP-complete (Karp 1972) in the general case. Nuestro
system has therefore an exponential worst-case complexity. Two questions then arise:

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Could patching be solved using a polynomial algorithm? How does a general ILP solver
perform, en la práctica, on our data?

We will answer these questions in two steps. Primero, we will prove in this section that
patching is actually NP-complete. Building the optimal solution requires exponential
time in the worst case. Segundo, we will show in Section 7, eso, due to the size of the
instances at hand, the processing time using a general ILP solver is reasonable.

In order to prove that patching is NP-complete, we will ﬁrst show that some
instances of patching can be expressed as a Set Packing problem (SP), known to be
NP-complete (Karp 1972). The representation of a patching problem as a SP problem
will be called patching as a Set Packing problem (P-SP). We will then reduce SP to P-SP
by showing that any instance of SP can be represented as an instance of P-SP, que lo hará
prove the NP-completeness of P-SP and therefore the NP-completeness of patching.

The Set Packing decision problem is the following: Given a ﬁnite set S and a list U of
subsets of S, are there k subsets in U that are pairwise disjoint? The optimization version
of SP, the Maximum Set Packing problem (MSP), looks for the maximum number of
pairwise disjoint sets in U.

We will show that a subset of the instances of patching can be expressed as an MSP.

Más precisamente, we restrict ourselves to instances such that:

(cid:114)

the set I(cid:48) of Instantiated Constraint Selections is empty (we only deal with
a set I of ISFs) y

the scores of all ISFs in I are equal.

The reason why we limit ourselves to a subset of patching is that our goal is to
reduce any SP problem to a patching problem. Reducing an SP problem to some special
cases of patching is enough to prove NP-completeness of patching. Patching in the
general case might be represented as an MSP problem but this is not our goal.

Recall that patching with ISF is subject to three constraints:

C1: a word cannot be the predicate of more than one ISF

C2: a word cannot be the argument of more than one ISF

C3: for an ISF to be selected, both its predicate and its arguments must be selected

Given a sentence S = w1, . . . , wn and a set I of ISFs on S, let us associate with S the
set S = {1, . . . , norte}. As a ﬁrst approximation, let us represent an ISF of I as the subset
of S composed of the indices of the arguments of the ISF. The list of ISF I is therefore
represented as a list U of subsets of S. Solving the MSP problem for (S, Ud. ) produces the
optimal solution ˆU that satisﬁes C2 and partly C3. C2 is satisﬁed because the elements
of ˆU are pairwise disjoint. C3 is partly satisﬁed because all the arguments of an ISF are
selected altogether, but not the predicate.

The most straightforward way to take C1 into account is to add the index of an
ISF predicate in the subset corresponding to the ISF. Desafortunadamente, this solution is not
correcto. It will correctly constrain a word to be the predicate of at most one ISF but it will
incorrectly prevent a word from acting as a predicate of an ISF and as an argument of
another ISF. In order to solve this problem, we will associate two integers to a word of
la frase, one representing the word as a predicate and the other representing it as
un argumento. Given a sentence of length n and the word of index i, i will represent
the word as an argument and i + n will represent the word as a predicate. An ISF

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

. . . wak as arguments is represented as
having word wp as a predicate and words wa1
the subset {a1, . . . ak, pag + norte}. The solution to the MSP problem using this representation
clearly veriﬁes our three constraints.

Given a sentence S = w1, . . . , wn and a set of ISFs I, we can build the set S =
{1, . . . , norte}, with N ≥ 2n4 and the list U such that any element of U corresponds to an
ISF of I, using the method described here. The solution to the MSP on input (S, Ud. ) es
equivalent to the solution of patching on input (S, I ). Such an MSP problem will be
called a P-MSP problem (patching as Maximum Set Packing). Recall that the decision
version of the problem is called P-SP (patching as Set Packing).

In order to prove the NP-completeness of patching, we will reduce any instance
of SP to an instance of P-SP. Given the set S = {1, . . . , norte} and the list U = (U1, . . . , Um)
of subsets of S, let us build the set S (cid:48) = {1, . . . , norte + metro} and the list U (cid:48) = (Ud. (cid:48)
1, . . . , Ud. (cid:48)
metro)
i = Ui ∪ {i + norte}. The pair (S (cid:48), Ud. (cid:48)) is a valid instance of P-SP. The solution
such that U (cid:48)
of P-SP on input (S (cid:48), Ud. (cid:48)) corresponds to the solution of SP on input (S, Ud. ).5 If there
was a polynomial algorithm to solve patching, there would be a polynomial algorithm
to solve SP. SP would therefore not be NP-complete, which is incorrect. Patching is
therefore NP-complete.

6. Experimental Set-up

We describe in this section and the two following ones the experimental part of this
trabajar. This section concerns language-independent aspects of the experimental set-up
and Sections 7 y 8 describe the experiments and results obtained, respectivamente, en
French and English data.

The complete experiments involve two ofﬂine and three online processes. El

ofﬂine processes are:

Training the parser on a treebank.

Extracting SFs and SCs from raw data and computing their scores. El
result of this process is the set L.

The online processes are:

Generating the set I of ISFs and ISCs on a sentence S, using L. We will call
this process candidate generation.
Patching sentence S with I, using ILP. The result of this process is the set ˆI.
Parsing S under the constraints deﬁned by ˆI.

The remaining part of this section is organized as follows: Sección 6.1 gives some
details about the parser that we use; Sección 6.2 describes some aspects of the extraction
of SCs and SFs from raw data. Sección 6.3 focuses on the candidate generation process.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

4 En este punto, we can impose that N = 2n. We will need the condition N ≥ 2n later in order to prove the
reducibility of SP to P-SP. The condition N ≥ 2n is not harmful, if N > 2n, the range [2norte + 1, . . . , norte] will
simply not be used when representing a patching instance as a P-SP instance.

5 It is important to note that in the set U (cid:48)

any new constraint because this element only occurs in the set U (cid:48)
of another set U (cid:48)

k with k (cid:54)= i.

i = Ui ∪ {i + norte}, adding the new element i + n does not introduce
i and will not conﬂict with any element

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

We have decided not to devote speciﬁc sections to the processes of patching and
parsing under constraints. Patching has been described in full in Section 3. The ILP
solver we have used is SCIP (Achterberg 2009). Parsing under constraints has been
described in section 2.2.

6.1 Parser

The parser used for our experiments is the second-order graph-based parser implemen-
tation of Bohnet (2010). We have extended the decoding part of the parser in order to
implement constrained parsing and k-best parses generation. The ﬁrst-order extension
is based on algorithm 3 of Huang and Chiang (2005), and the second-order extension
relies on a non-optimal greedy search algorithm.

The parser uses the 101 feature templates, described in Table 4 of Bohnet (2010).
Each feature template describes the governor and dependent of a labeled dependency,
called the target dependency, as well as some features of its neighborhood. Every
feature is associated with a score, computed on a treebank. A parse tree score is deﬁned
as the sum of the features it contains and the parser looks for the parse tree with the
highest score.

Feature templates are divided into six sets:

Standard Feature Templates are the standard ﬁrst-order feature templates. They de-
scribe the governor and the dependent of the target dependency. The governor
and the dependent are described by combinations of their part of speech, lema,
morphological features, and ﬂedged form.

Linear Feature Templates describe the immediate linear neighbors of the governor
and/or the dependent of the target dependency. In such templates, governor,
dependent, and neighbors are represented by their part of speech.

Grandchild Feature Templates are the ﬁrst type of second-order features. They de-
scribe the structural neighborhood of the target dependency. This neighborhood
is limited to a child of the dependent of the target dependency. Such templates
are suited to model linguistic phenomena such as prepositional attachments.

Linear Grandchild Feature Templates describe the immediate linear neighbors of the

governor, the dependent, and the grandchild of the target dependency.

Sibling Feature Templates are the second type of second-order feature templates. Ellos
describe the structural neighborhood of the target dependency limited to a sibling
of the dependent of the target dependency. Such templates are suited, Por ejemplo,
to prevent the repetition of non-repeatable dependencies.

Linear Sibling Feature Templates describe the immediate linear neighbors of the

governor, and the two siblings.

When run in ﬁrst-order mode, the parser only uses Standard and Linear features.

All features are used when the parser is run in second-order mode.

Four measures are used to evaluate the parser. The ﬁrst two are standard whereas
the third and fourth have been speciﬁcally deﬁned to monitor the effect of our model
on the performances of the parser:

Labeled Accuracy Score (LAS) of a tree T is the ratio of correct labeled dependencies

in T.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Unlabeled Accuracy Score (UAS) of a tree T is the ratio of correct unlabeled dependen-

cies in T.

Subcategorization Frame Accuracy Score (SFAS) of a tree T is the ratio of verbs in T
that have been assigned their correct subcategorization frame. Más precisamente,
given the reference parse R for sentence S and the output H of the parser for the
same sentence, all ISFs are extracted from R and H, yielding the two sets RSF and
HSF. SFAS is the recall of HSF with respect to RSF (

|RSF∩HSF|
|RSF|

Selectional Constraint Accuracy Score (SCAS) of a tree T is the ratio of correct occur-
rences of SC patterns in T. Más precisamente, given a gold standard tree R for sentence
S and the output H of the parser for the same sentence, all ISCs are extracted from
R and H, yielding the two sets RSC and HSC. SCAS is the recall of HSC with respect
to RSC.

6.2 Extracting SFs and SCs from Raw Data

The LSC extraction process takes as input raw data and produces a set L composed
of SCs and SFs. The process is straightforward: The corpora are ﬁrst parsed, then LSC
templates are applied on the parses, using uniﬁcation, in order to produce SCs and SFs
along with their number of occurrences. These numbers are used to compute selectional
constraints scores (sSC), as deﬁned in Section 3.4, and subcategorization frame scores
(sSF) as described in Section 3.3.

The extraction of lexical co-occurrences from raw data has been a very active direc-
tion of research for many years. Giving an exhaustive description of this body of work is
clearly beyond the aim of this article. The dominant approach for extracting lexical co-
occurrences, as described, Por ejemplo, in Volk (2001), Nakov and Hearst (2005), Pitler
et al. (2010), and Zhou et al. (2011) directly model word co-occurrences on word strings.
Co-occurrences of pairs of words are ﬁrst collected on raw corpora or n-grams. Basado
on the counts produced, lexical afﬁnity scores are computed. The detection of pairs of
word co-occurrences is generally very simple: It is either based on the direct adjacency
of the words in the string or their co-occurrence in a window of a few words. Bansal
and Klein (2011) and Nakov and Hearst (2005) rely on the same sort of techniques
but use more sophisticated patterns, based on simple paraphrase rules, for identifying
co-occurrences. Our work departs from these approaches by extracting co-occurences
in speciﬁc lexicosyntactic contexts that correspond to our SC patterns. Our approach
allows us to extract more ﬁne-grained phenomena but, being based on automatically
parsed data, it is more error-prone.

Extracting SFs for verbs from raw data has also been an active direction of research
for a long time, dating back at least to the work of Brent (1991) and Manning (1993).
More recently, Messiant et al. (2008) proposed such a system for French verbs. El
method we use for extracting SF is not novel with respect to such work. Our aim was
not to devise new extraction techniques, but merely to evaluate the resource produced
by such techniques for statistical parsing.

6.3 Candidate Generation

The candidate generation process has as input a sentence S and a set L of SCs and SFs.
It produces the set I made of instantiations of elements of L on S. This set constitues
the input of the ILP solver that will produce the solution ˆI. The candidate generation

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

process is very important because it deﬁnes the search space of the patching process.
If the search space is not wide enough, it might fail to contain the correct solution. Si
it is too wide, the IL programs generated will be too large to be solved in a reasonable
amount of time.

Several methods can be used in order to perform the candidate generation process.
One could compute I on the linear representation of S enriched with part of speech
tags and lemmas. Bechet and Nasr (2009), Por ejemplo, represent LSCs as ﬁnite-state
automata that are matched on S. This method has a tendency to over-generate and
proposes some very unlikely instantiations.

In order to constrain the instantiation process, some syntax can be used. Para examen-
por ejemplo, LSCs can be matched on a set T of possible parses of S in order to produce I. El
set T itself can be the k-best parses produced by a parser, as in parse reranking (collins
and Koo 2005; Charniak and Johnson 2005). It can also be built using parse correction
técnicas (Hall and Nov´ak 2005; Attardi and Ciaramita 2007; Henestroza and Candito
2011). In the latter approach, a parse tree T of S is ﬁrst built, using a parser, and parse cor-
rection rules are applied on it in order to build other parses that might correct errors of T.
The solution that we propose uses a parser to produce a list T = {T1, . . . , Tk} of k-
best parses, such as in parse reranking. But this list is merged into a set of dependencies
Dk that is the union of all the dependencies appearing in the k-best trees. The set Dk is
not a tree because a single word can have several governors. This set can be seen as a
superset of the k-best parses. We will call this method sub-parse combination or simply
combination when the context is not ambiguous.

The main difference between reranking and combination is that the latter can com-
bine dependencies that do not co-occur in a single tree of the k-best list. The search space
that it deﬁnes is therefore larger than the k-best list. The set Dk can be compared to the
pruned forest of Boullier, Nasr, and Sagot (2009). Once the set Dk is built, the LSCs of L
are matched on the dependencies of Dk using uniﬁcation.

In order to compare the search spaces respectively deﬁned by a k-best parse list and
the corresponding Dk set, we deﬁne two measures on the k-best parse list. The ﬁrst one
(Oracle LAS) corresponds to the LAS of the best tree in the k-best list. The second one
(Recall LAS) is the part of labeled dependencies of the reference parse that appear in
Dk. Oracle LAS constitutes an upper bound for reranking and Recall LAS is an upper
bound for combination. The values of these two measures are represented in Figure 3

Cifra 3
Oracle and Recall LAS of the k-best trees for ﬁrst-order (izquierda) and second-order (bien) analizando,
computed on FTB TEST.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Cifra 4
Recall SFAC and SCAS of the k-best trees for ﬁrst-order (izquierda) and second-order (bien) analizando,
computed on FTB TEST.

for ﬁrst-order parsing (izquierda) and second-order parsing (bien) measured on the test set of
the French Treebank.

As one can see in Figure 3, the Recall LAS curve is, por supuesto, always above the
Oracle LAS curve. In the case of ﬁrst-order parsing, the Oracle curve quickly reaches a
plateau. This is because ﬁrst-order k-best parses are generally very close to one another,
because of the very local nature of the model. The parses in the k-best list tend to
be combinations of dependencies that appeared in previous parses. The number of
new dependencies created tend to decrease quickly with k. For k = 100, the difference
between the Oracle and the Recall scores is almost equal to 5 puntos. This difference
clearly demonstrates the advantage of using Dk instead of the k-best list.

The results for second-order parsing are less spectacular. The Oracle curve does
not reach an asymptote, as it was the case for ﬁrst-order parsing. New dependencies
are continually being created as k increases. The reason why we have considered ﬁrst-
order models in this section is that generating a ﬁrst-order k-best solution can be done
very efﬁciently; second-order k-best generation is more time-consuming. Performing
candidate generation with a ﬁrst-order parser is therefore a potentially interesting
método. We will see in the two following sections that ﬁrst-order k-best generation gave
good results on French but did not perform well on English, where second-order k-best
generation had to be used.

Although Figure 3 gives an estimation of the LAS upper bounds of reranking and
combination, it does not tell us what can be expected in terms of SFAS and SCAS.
Cifra 4 answers this question by showing the evolution of Recall SFAS and Recall SCAS
on the same data set. As one can see, for k = 100, with ﬁrst-order k-best generation,
almost 94% of the ISCs and 93% of the ISFs that appear in the reference can be recovered.
The results for second-order k-best are, respectivamente, equal to 97% y 96%.

7. Experiments on French

We describe in this section the experiments performed on French and the results ob-
tained. We start, en la sección 7.1, with the description of the data used to train the parser
and give an overview of the errors made by the parser. The extraction of SCs and
SFs from raw corpora is described in Section 7.2, which starts with the description
of the corpora that were used, then gives general statistics and coverage results for

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Mesa 1
Decomposition of the French Treebank into training, desarrollo, and test sets.

number of sentences

number of tokens

FTB TRAIN
FTB DEV
FTB TEST

9,881
1,239
1,235

278,083
36,508
36,340

the extracted data. The results of patching and parsing are given in Section 7.3 y
an analysis of the errors is presented in Section 7.4. Sección 7.5 analyzes the runtime
performances.

7.1 Parsing

The parser was trained on the French Treebank (Abeill´e, Cl´ement, and Toussenel 2003),
transformed into dependency trees by Candito et al. (2009). The size of the treebank and
its decomposition into train, desarrollo, and test sets are represented in Table 1.

The parser gave state-of-the-art results for French: the LAS and UAS are, respetar-
activamente, 88.88% y 90.71%. The SCAS reaches 87.81% and the SFAS is 80.84%, cual
means that in 80.84% of the cases, a verb selects its correct subcategorization frame.
These results are for sentences with gold part-of-speech-tags and lemmas.

An analysis of the errors shows that the ﬁrst cause of errors is prepositional attach-
mentos, which account for 23.67% of the errors. A more detailed description of the errors
is given in Table 2. Every line of the table corresponds to one attachment type. el primero
column describes the nature of the attachment. The second indicates the frequency of
this type of dependency in the corpus. The third column displays the accuracy for this
type of dependency (the number of dependencies of this type correctly predicted by the
parser divided by the total number of dependencies of this type). The fourth column
shows the contribution of the errors made on this type of dependency to the global
error rate. We have used this error analysis to deﬁne the following four SC patterns,
which we call SBJ, OBJ, VdeN, and VaN:

([:V:::]([SUJ:norte:::]))
([:V:::]([OBJ:norte:::]))

SBJ
OBJ
VdeN ([:V:::]([DE-OBJ:PAG:de::]([OBJ:norte:::])))
VaN

([:V:::]([A-OBJ:PAG:a::]([OBJ:norte:::])))

The last column of Table 2 indicates the SC pattern that corresponds to a given
type of error. One can see that all SC patterns are not described at the same level of
detail—some of them specify the label of the dependency, if considered important (a
distinguish subject and object, Por ejemplo), and others specify the lexical nature of the
dependent (for prepositions, Por ejemplo).

7.2 Extracting SFs and SCs

In order to automatically produce the set L of SCs and SFs, we gathered a raw corpus
that we have parsed using the parser described in Section 7.1. The corpus is actually a
collection of three corpora of slightly different genres. The ﬁrst one (AFP) is a collection
of news reports of the French press agency Agence France Presse; the second (EST REP)
is a collection of newspaper articles from a local French newspaper: l’Est R´epublicain.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Mesa 2
The seven most common error types made by the baseline parser. For each type of error, nosotros
displayed its frequency, the parser accuracy, the impact on global errors, and the SC pattern that
corresponds to the error type.

dependency

freq.

acc.

contrib.

SC pattern

N—mod →N
V—∗ → `a
V—suj → N
N—coord → CC
N—∗ → de
V—∗ → de
V—obj → N

1.50
0.88
3.43
0.77
3.70
0.66
2.74

72.23
69.11
93.03
69.78
92.07
74.68
90.43

2.91
2.53
2.53
2.05
2.05
1.62
1.60

VaN
SBJ

VdeN
OBJ

The third (WIKI) is a collection of articles from the French Wikipedia. The size of the
different corpora are detailed in Table 3.

The raw corpora were ﬁrst tokenized, POS-tagged, and lemmatized with the
MACAON tool suite (Nasr et al. 2011), then parsed in order to get the most likely parse
for every sentence. The SFs and SCs were then extracted from the parses and their scores
computed. In the two following sections, we give some statistics on the SFs and SCs
produced.

7.2.1 Extracted SF. As described in Section 3.3, an SF frame is a special kind of LSC. Its
root is a predicate and its leaves are the arguments of the predicate. Some linguistic
constraints are deﬁned over SF. The ﬁrst one is the category of the root, which must
be either a ﬁnite tense verb (V), an inﬁnitive form (VINF), a past participle (VPP), o
a present participle (VPR). The second is the set of syntactic functions that introduce
arguments of the predicate; they can either be a subject (SUJ), a direct object (OBJ), o un
indirect object introduced by preposition `a (A-OBJ) or de (DE-OBJ).

SFs are deﬁned at a level that is slightly more abstract than the surface syntactic
realization. Two abstractions are in action. The ﬁrst one is the factoring of several parts
of speech. We have factored pronouns, common nouns, and proper nouns into a single
category N. We have not gathered verbs of different modes into one category because
the mode of the verb inﬂuences its syntactic behavior and hence its SF. El segundo
abstraction is the absence of linear order between the arguments.

The number of verbal lemmas, number of SFs, and average number of SFs per verb
se muestran en la tabla 4, column A0. As can be observed, the number of verbal lemmas and

Mesa 3
Some statistics computed on the corpora used to collect Subcategorization Frames and
Selectional Constraints.

CORPUS Number of sentences Number of tokens

2,041,146
2,998,261
1,592,035
5,198,642

59,914,238
53,913,288
33,821,460
147,648,986

AFP
EST REP
WIKI
TOTAL

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Mesa 4
Number of verbal lemmas, SFs, and average number of SFs per verb for three levels of
thresholding (0, 5, y 10).

number of verbal lemmas
number of different SFs
average number of SFs per verb

23,915
12,122

4,871
2,064

A10

3,923
1,355

14.26

16.16

13.45

SFs are unrealistic. This is because the data from which SFs were extracted are noisy:
they consist of automatically produced syntactic trees that contain errors. There are two
main sources of errors in the parsed data: the pre-processing chain (tokenization, part-
of-speech tagging, and lemmatization), which can consider as a verb a word that is
no; y, por supuesto, parsing errors, which tend to create incorrect SFs. In order to ﬁght
against noise, we have used a simple thresholding: We only collect SFs that occur more
than a threshold i. The results of the thresholding appear in columns A5 and A10, dónde
the subscript is the value of the threshold. It is a very crude technique that allows us
to eliminate many errors but also rare phenomena. As expected, both the number of
verbs and SFs decrease sharply when increasing the value of the threshold. el promedio
number of SFs per verb is 14.26 without threshold and reaches 16.16 for a threshold
de 5. This phenomenon is mainly due to part of speech tagging errors on the raw
cuerpo: A large part of hapax are words mistakenly tagged as verbs. They disappear
with thresholding, and their disappearance tends to increase ambiguity.

We report in Table 5 the coverage of the extracted SFs on FTB DEV. The last column
represents the coverage of the SF extracted from FTB TRAIN. These ﬁgures constitute
an interesting reference point when evaluating the coverage of the SFs extracted from
raw data. We have distinguished lexical coverage (ratio of verbal lemmas of FTB DEV
present in the resource) and syntactic coverage (ratio of SFs of FTB DEV present in the
resource). Both measures were computed on types and tokens. Mesa 5 shows that lexical
coverage is very high, even on A10. It is interesting to note that syntactic coverage of the
automatically built resource is much higher than syntactic coverage of the same resource
extracted from FTB TRAIN. This clearly shows that many SFs of FTB DEV were not seen
in FTB TRAIN. This observation supports our hypothesis that treebanks are not large
enough to accurately model the possible SFs of a verb. Comparing coverage of T and
Ax is somehow unfair because, as we already mentioned, the automatic resources are
noisy, and have a tendency to overgenerate whereas T contains only correct SFs (hasta
the corpus annotation errors). Accuracy results of Section 7.3 will give a more balanced

Mesa 5
Lexical and syntactic coverage on FTB DEV of the extracted SFs. Coverage is computed for three
thresholding levels of the automatically generated resource (A0, A5, and A10) and FTB TRAIN (t).

A10

Lex. cov.
Synt. cov.

types
99.52
95.78

tokens
99.85
97.13

types
98.56
91.08

tokens
99.62
93.96

types
98.08
88.84

tokens
99.50
92.39

types
89.56
62.24

tokens
96.98
73.54

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Mesa 6
Number of occurrences of the four SC patterns (OBJ, SBJ, VdeN, and VaN) in raw corpora with
three levels of thresholding (0, 5, y 10).

SC pat.

OBJ
SBJ
VdeN
VaN
TOTAL

A10

422,756
433,196
116,519
185,127
1,157,598

58,495
55,768
11,676
23,674
149,613

26,847
25,291
4,779
10,729
67,646

view by computing how many errors are actually corrected using the noisy extracted
resource.

7.2.2 Extracted SCs. The number of different SCs extracted for each of the four SC
conﬁgurations are shown in Table 6 for the three thresholds 0, 5, y 10.

The coverage of SCs on FTB DEV is shown in Table 7. It is interesting to note that
SC coverage is much lower than SF coverage. This is due to the different nature of the
two phenomena: SC is a bilexical phenomena whereas SF is a monolexical one. Idealmente,
a larger corpus should be used to extract SCs. One can also note that the coverage of
SCs extracted from FTB DEV is very low.

7.3 Resultados

The results obtained by our method are shown in Table 8, using the four measures
deﬁned in Section 6.1. The accuracy scores of the baseline parser are presented in line 1.
Line 2 shows the results obtained when setting parameters µ1 and µ2 to 1. This setting
corresponds to the situation where SF and SC scores are not used. Patching, en esto
caso, only combines ILSCs that have a good conﬁdence measure. This experiment is
important in order to ensure that sSF and sSC do carry useful information. As one can
ver, setting µ1 and µ2 to 1 slightly increases the accuracy of the parser, which means that
the accuracy of the parser can be increased just by imposing the ﬁnal parse sub-parses
that have a good conﬁdence measure.

The optimal values of µ1 and µ2 were computed on FTB DEV; they are both 0.65.
These values are used for the experiments corresponding to lines 3, 4, y 5. Lines 3

Mesa 7
Coverage for the four SC conﬁgurations (OBJ, SBJ, VdeN, VaN). Coverage is computed for three
thresholding levels of the automatically generated resource (A0, A5, and A10) and FTB TRAIN (t).

SC pat.

A10

OBJ
SBJ
VdeN
VaN
TOTAL

types
68.31
66.26
46.28
62.93
65.17

tokens
69.71
68.18
56.61
64.69
67.54

types
58.50
52.87
34.93
48.89
53.10

tokens
61.51
57.24
49.57
52.42
57.15

types
53.24
47.82
31.66
42.68
47.92

tokens
56.12
52.94
46.73
47.56
52.78

types
17.94
23.11
11.57
20.06
19.96

tokens
21.31
24.48
15.77
23.34
22.24

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Mesa 8
Four measures of accuracy on FTB TEST for four settings: Baseline second-order parser, patching
with conﬁdence measure, patching with SF only, patching with SCs only, and patching with SFs
and SCs.

LAS

UAS

SCAS

SFAS

Base
CM only
SF Selection
SC Satisf.
Combined

88.88
89.02
89.21
89.22
89.51

90.71
90.98
91.05
91.09
91.39

87.81
88.03
88.19
91.97
92.89

80.84
81.16
84.13
82.03
85.06

y 4 indicate the accuracy for SF selection and SC satisfaction alone, as described in
Secciones 3.3 y 3.4. Line 5 provides the result when they are combined, as described in
Sección 3.5.

The table shows that, when setting µ1 and µ2 to their optimal values, the three
methods increase both LAS and UAS with respect to the baseline. The best results are
obtained by the combined method. SCAS and SFAS allow us to gain a better under-
standing of what is happening. As could be expected, SF selection has a stronger impact
on SFAS whereas SC satisfaction has a stronger impact on SCAS. But both methods also
have a positive inﬂuence on the other measure. This is because SFs and SCs are not
independent phenomena and correcting one can also lead to correcting the other. El
combined method yields the best results on our four measures.

7.4 Análisis de errores

There are three main causes for SC and SF errors: insufﬁcient coverage of L, narrowness
of the patching search space I, or inadequacy of the ILP objective function. Given a
sentence S, let us note R its correct parse and H the parse produced by our system.
Let us note ˆIR and ˆIH, respectivamente, the set of ILSCs of R and H. We deﬁne an error as
an ILSC x ∈ ˆIR, such that x /∈ ˆIH, and we note y the LSC corresponding to x (x is an
instantiation of y). The three error causes are described as follows:

NotInDB y /∈ L. This is the situation where an ILSC cannot be extracted from the k-
best list because its corresponding LSC does not appear in L. A possible cause for
such errors comes from the corpora we have used to extract LSCs. We have shown
en la sección 7.2 that the coverage of our resources is not perfect and some ILSCs
that appear in the reference parse R might not appear in the resource. A solution
to such a problem would be to use larger corpora in order to increase coverage.
Another cause of this error is the processing chain that was used to process the
raw corpora. As already mentioned, any of the modules that compose this chain
can make errors and introduce noise in L.

NotInKB x /∈ I ∪ I(cid:48). This is the situation where x does not appear in any of the k-best
árboles. The problem is the limitation of the search space of the ILP. The solution
to this problem is either to increase the size of the search space or to use another
method to produce it, as proposed in Section 3.2. The simplest way to increase
the search space size is to increase the length of the k-best list. Sin embargo, increasing

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Mesa 9
Raw number of errors, error decrease, and error distribution for each SC pattern, average of SC
patrones, and SFs.

Tipo

OBJ
SBJ
VdeN
VaN
CAROLINA DEL SUR
SF

err.

72
101
59
132
364
807

decr. NotInDB NotInKB NotInSOL

37.5%
47.5%
27.1%
45.4%
41.6%
22.0%

0.55
0.41
0.42
0.24
0.38
0.22

0.44
0.30
0.70
0.33
0.42
0.24

0.22
0.47
0.21
0.59
0.41
0.54

the search space yields larger IL programs and therefore the resolution time of
the ILP.

NotInSOL x ∈ I ∪ I(cid:48) but x /∈ ˆIR. This is the situation where x is in the input of the
ILP program but is not part of the solution. There are two possible causes for this
situation. The ﬁrst one comes from the scores sSF and sSC computed on the corpora.
A correct ILSC might have been assigned a lower score than an incorrect one. Este
situation, Sucesivamente, can be due to different attachment preferences in the test and
the raw corpora used to produce L. The second cause is the conﬁdence measure
used to introduce a syntactic dimension to the ILP objective function, as described
en la sección 4. En algunos casos, an ILSC might have a good lexical score but a bad
syntactic score, in which case it is ruled out.

Mesa 9 details the distribution of the errors over these three categories. Columna 1
describes the nature of the attachment. The four ﬁrst rows correspond to speciﬁc SC
patrones, row 5 is the average of all SCs, and row 6 concerns SF selection. Columna 2
indicates the number of errors made by the baseline parser. Columna 3 indicates the
decrease of the number of errors and columns 4, 5, y 6 give the distribution of the
errors over the three error types deﬁned earlier.

As one can see from Table 9, our method decreases by 41.6% the number of SC
errors and by 22% the number of SF errors. The difference between these two ﬁgures
comes from the fact that SFs are more complex phenomena that SC patterns. SFs can be
composed of a large number of dependencies and an error in one of them leads to an
error in the SF selection.

The SC errors are almost equally distributed along the three error types: En 38% de
the cases, the correct SC is not L; en 42% of the cases, the correct SC does not appear in
el 100 best parses; and in 41% of the cases, the ILP fails to ﬁnd the correct SC, a pesar de
it is in the search space. There are three possible causes for the latter type of error: Recordar
that the objective function of the ILP is based on the scores of ILSCs, which are them-
selves a combination of a lexical and a syntactic score: s(∆) = (1 − µ)sL(∆) + µsS(∆).
The cause of the error could be the lexical score (sL), the syntactic score (sS), or the
weight (µ).

Most SF errors pertain to category NotInSOL: En 54% of the cases, the correct SF is in
the input of the ILP but it is not part of the solution. This might be due to the deﬁnition
of the sSF score, which is a simple conditional probability.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Cifra 5
Decomposition of runtime on FTB TEST.

7.5 Runtime Analysis

Cifra 5 displays the runtime of our system with respect to the length of the sentences
parsed. Runtime has been decomposed in three parts: ﬁrst-order 100-best generation,
ILP solving, and constrained second-order parsing. The runtime of the instantiation
process is not reported because it is negligible compared with the runtime of the other
procesos.

Cifra 5 shows that ILP resolution is the most time-consuming part of the whole
proceso. Although in practice the performances of the whole system are reasonable
(aproximadamente 330 words per second on average), we show that the patching problem
is NP-complete and we have no guarantee that the problem will be solved in polynomial
tiempo.

8. Experiments on English

A second series of experiments was performed on English. Although the setting used
is almost the same as for French, there are some notable differences between the two
series of experiments with respect to the data used:

(cid:114)

The treebank used for training the parser is signiﬁcantly larger.

The deﬁnition of SC and SF patterns is different.

The raw corpus used to extract ISCs and ISFs is one order of magnitude
más grande.

8.1 Parsing

The parser was trained and evaluated on the Penn Treebank (marco, Marcinkiewicz,
and Santorini 1993). We have used the standard decomposition of the Penn Treebank

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Mesa 10
Decomposition of the Penn Treebank into training, desarrollo, and test sets.

Number of sentences Number of tokens

PTB TRAIN (Sec. 02–21)
PTB DEV
PTB TEST

(Sec. 22)
(Sec. 23)

39,279
1,334
2,398

958,167
33,368
57,676

into training, desarrollo, and test sets, with sizes as reported in Table 10. The main
difference with respect to French comes from the training set, cual es 2.44 times larger.
As could be expected, the parser trained and evaluated on the Penn Treebank yields
better results than the same parser trained on the French Treebank. The LAS is 91.76 y
UAS is 93.75, a relative increase of 3.24% y 3.35% with respect to French.

The English SF and SC deﬁnitions differ from the French ones in the choice of the
preposition used. The nine most frequent prepositions occuring in the Penn Treebank
were taken into account ({de, en, a, para, con, en, en, de, por}) to deﬁne both SCs and SFs.
A set of 20 SC patterns are deﬁned: {OBJ, SBJ, VofN, VinN, VtoN, VforN, VwithN, VonN,
VatN, VfromN, VbyN, NofN, NinN, NtoN, NforN, NwithN, NonN, NatN, NfromN,
NbyN}. SFs are deﬁned as the dependents of a verb that are introduced by one of the
nine prepositions, plus the subject and the object.

The SCAS and SFAS are, respectivamente, 91.99 y 88.24: an increase of 4.76% y
9.15% compared to French. These results show that the English parser makes fewer
errors on prepositional attachments. This situation may be due to the fact that prepo-
sitional attachments are more ambiguous in French (4.56 prepositions on average per
sentence in the FTB compared with 3.15 for the PTB).

Mesa 11 details the error rate for each of the 20 SC patterns deﬁned, sorted with
respect to their contribution to the total error rate. The total contribution of these errors
reaches 18.87%, which is an approximation of the upper limit of the improvement that
we can expect from our method. The mean accuracy for these patterns is almost equal
a 92%. Two situations should be distinguished, sin embargo: the subject and direct object
dependencies, Por un lado, and the prepositional attachments, en el otro. El

Mesa 11
Contribution of the 20 SC patterns to the errors made by the parser. For each pattern, nosotros
displayed its frequency, the parser accuracy, and the impact on global errors.

SC pat.

freq.

acc.

contrib.

SC pat.

freq.

acc.

contrib.

0.88
VinN
5.48
OBJ
7.71
SBJ
0.32
VforN
0.59
NinN
0.32
VonN
0.30
VtoN
0.35
NforN
0.16
NonN
VfromN 0.14

72.75
95.72
97.17
56.04
76.35
63.69
78.82
84.42
68.48
69.51

2.91
2.85
2.65
1.71
1.69
1.39
0.77
0.66
0.62
0.53

VatN
NofN
VwithN
VbyN
NtoN
NwithN
VofN
NatN
NfromN
NbyN
TOTAL

0.20
1.72
0.23
0.34
0.16
0.15
0.06
0.14
0.10
0.06
19.43

78.26
97.44
84.38
92.82
85.71
84.52
74.29
89.02
85.96
76.47
91.99

0.53
0.53
0.43
0.30
0.28
0.28
0.19
0.19
0.17
0.17
18.87

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

prepositional attachments (restricted to the nine most frequent prepositions) representar
6.24% of the dependencies of the test set but their contribution to the total error rate is
equal to 13.37% and the mean accuracy of these attachments is equal to 82.32%, mientras
subject and direct object dependencies correspond to 13.19% of the dependencies but
generate only 5.5% of the errors.

8.2 Extracting SFs and SCs

The ISFs and ISCs were extracted from the Google-parsed Gigaword (Napoles, Gormley,
and Van Durme 2012b, 2012a) (noted as the GIGA corpus in the remainder of the article),
which is a subset of the original English Gigaword, ﬁfth edition (Parker et al. 2011),
parsed with the parser described in Huang, Harper, and Petrov (2010). The origin of the
corpora as well as their sizes are reported in Table 12. As one can see, the size of the
corpus is one order of magnitude larger than the French corpus we gathered to extract
ISFs and ISCs—more precisely, es 31 times larger.

The number of occurrences of each of the 20 SC patterns extracted are displayed in

Mesa 13. Un total de 33.6 million occurrences of SC patterns have been extracted.

Mesa 14 shows some statistics on the number of SF extracted from the data. El
number of different SFs extracted is, on average, three times larger than in French.
This difference is mainly due to the deﬁnition of English SFs, which take into account
more prepositions. The mean number of SFs per verb is 15.39 without thresholding and
reaches 17.73 for a threshold of 10.

Tables 15 y 16 display coverage on Penn Treebank (PTB) DEV for SC and SF
extracted from the GIGA corpus, with three levels of thresholding, así como el
coverage for SCs and SFs extracted from PTB TRAIN.

The coverage for SCs is 65.66% on tokens, for a threshold equal to 10, a relative
increase of 24% compared with French. This increase is the expected result of using
more data. The most striking ﬁgure of Table 15 is the coverage of the training corpus.
Fue 22.24% for French and 54.55% para ingles. This increase cannot be explained
only by the larger size of the training corpus for English. It seems to indicate a higher
lexicosyntactic variety in the French Treebank than in the Penn Treebank, or a higher
variety between the development and training sets of the French Treebank compared
with the Penn Treebank.

Mesa 12
Some statistics computed on the corpora used to collect Subcategorization Frames and
Selectional Constraints.

CORPUS

Number of sentences Number of tokens

Agence France-Presse, English Service
Associated Press Worldstream, English Service
Central News Agency of Taiwan, English Service
Los Angeles Times/Washington Post Newswire
Servicio
New York Times Newswire Service
Washington Post/Bloomberg Newswire Service
Xinhua News Agency, English Service
TOTAL

35,200,934
64,893,993
1,399,398
13,711,844

80,630,849
808,266
15,770,726
212,416,010

811,564,028
1,346,062,109
41,125,514
313,874,339

1,672,223,270
20,135,054
382,529,760
4,587,514,074

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

Mesa 13
Number of occurrences of the 20 SC patterns in the GIGA corpus with three levels of
thresholding (0, 5, y 10).

SC pat.

OBJ
SBJ
VofN
VinN
VtoN
VforN
VwithN
VonN
VatN
VfromN
VbyN
NofN
NinN
NtoN
NforN
NwithN
NonN
NatN
NfromN
NbyN
TOTAL

A10

5,517,917
5,150,122
488,475
1,866,766
1,201,947
1,223,887
1,591,669
1,148,819
640,019
816,035
1,200,835
4,655,268
2,028,525
784,074
1,691,557
1,066,320
956,125
516,497
743,002
347,791
33,635,650

1,531,695
1,312,531
79,799
471,104
2,750,660
263,518
302,467
260,671
143,864
164,210
252,166
1,048,782
404,431
128,504
296,692
138,280
170,875
93,686
114,957
56,020
7,509,318

933,764
768,133
37,128
270,713
153,676
142,346
151,949
142,715
80,390
85,407
131,468
579,842
212,265
63,935
144,884
62,051
85,611
47,476
54,695
27,363
4,175,811

Mesa 14
Number of verbal lemmas, SFs, and average number of SFs per verb for three levels of
thresholding (0, 5, y 10).

number of verbal lemmas
number of different SFs
average number of SFs per verb

67, 186
31, 055

9, 271
6, 394

A10

6, 305
3, 816

15.39

21.52

17.73

The syntactic coverage of the SF extracted from the GIGA corpus (the ratio of
pares [verbal lemma, SF] of PTB DEV present in the GIGA corpus) es 77.56% when no
threshold is applied, although it was 97.13% for French. This difference is the result
of the more extended deﬁnition of SF in English. But the more surprising ﬁgure is the
coverage of the PTB TRAIN, cual es 82.7% (fue 73.54% for French). This difference
could be explained, as was the case for SC coverage, by a difference of lexicosyntactic
variety between the French Treebank and the Penn Treebank.

8.3 Resultados

In our experiments on French, the ﬁrst step used a ﬁrst-order k-best parser. En el
English experiment, as already noted, the accuracy of the second-order parser is higher
and the quality of the k-best parses using a ﬁrst-order parser is not good enough to leave
enough room for improvement over a second-order parser. The situation is detailed in

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Mesa 15
Coverage for the 20 SC patterns on PTB DEV. Coverage is computed for three thresholding levels
of the automatically generated resource (A0, A5, and A10) and PTB TRAIN (t).

SC pat.

A10

types
SC pat.
78.26
OBJ
79.18
SBJ
86.36
VofN
72.86
VinN
73.94
VtoN
66.96
VforN
VwithN 82.35
79.55
VonN
86.05
VatN
VfromN 68.09
59.05
VbyN
68.54
NofN
58.14
NinN
74.51
NtoN
63.56
NforN
NwithN 46.34
67.27
NonN
50.00
NatN
NfromN 48.72
52.17
NbyN
74.59
TOTAL

tokens
71.86
82.66
86.96
73.94
74.67
67.2
83.02
78.26
87.23
68.75
59.63
68.07
59.12
76.79
64.17
47.62
66.07
51.22
44.19
52.17
74.8

types
73.24
75.99
54.55
67.29
64.08
63.48
72.55
73.86
83.72
59.57
51.43
60.94
52.33
54.9
53.39
43.9
52.73
47.5
43.59
39.13
69.17

tokens
66.13
78.25
56.52
68.66
64.67
64.00
73.58
72.83
85.11
60.42
52.29
60.53
53.59
58.93
54.17
45.24
51.79
48.78
39.53
39.13
69.01

types
70.23
74.03
50
65.8
59.15
58.26
70.59
69.32
79.07
55.32
45.71
56.42
48.26
52.94
46.61
36.59
45.45
37.5
35.9
34.78
65.95

tokens
62.07
76.34
52.17
67.25
60
59.2
71.7
68.48
80.85
56.25
45.87
56.14
49.17
57.14
47.5
38.1
44.64
39.02
32.56
34.78
65.66

types
49.86
68.83
13.64
28.62
26.76
26.96
11.76
27.27
27.91
17.02
9.52
27.49
26.16
15.69
18.64
9.76
27.27
22.5
17.95
0.00
46.6

tokens
58.43
75.97
17.39
32.04
28.67
31.20
13.21
30.43
29.79
16.67
11.01
28.77
28.73
19.64
19.17
11.90
28.57
21.95
25.58
0.00
54.55

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

Mesa 16
Lexical and syntactic coverage of the extracted SF on PTB DEV. Coverage is computed for
three thresholding levels of the automatically generated resource (A0, A5, and A10) y
PTB TRAIN (t).

A10

Lex. cov.
Synt. cov.

types
98.34
72.60

tokens
99.37
77.56

types
98.23
66.23

tokens
99.35
72.87

types
95.58
63.16

tokens
98.77
70.97

types
95.36
70.00

tokens
98.95
82.70

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 17. The table shows that the Oracle accuracy of 100 best parses for a ﬁrst-order
parser is often below the accuracy of the ﬁrst-best parse for a second-order parser. El
situation is slightly more favorable for k = 200 but is still dominated for some SCs by
the ﬁrst-best second-order accuracy. This is the reason why we decided to use a second-
order parser in the candidate generation step. The rest of the process is unchanged.

The results obtained are shown in Table 18. The decrease of labeled error attach-
ments is 3.15% and the decrease of unlabeled error attachments is 4.16%. Estos resultados
are lower than what we observed for French, where the decrease is, respectivamente, 5.66%
y 7.32%. Selectional Constraint violations are decreased by 16.21% and erroneous
Subcategorization Frame assignments by 8.83%. They were equal to 41.6% y 22%,
respectivamente, for French. These results are disappointing—we expected that the larger

Ligüística computacional

Volumen 42, Número 1

Mesa 17
Unlabeled accuracy for each of the 20 SC patterns using a second-order and ﬁrst-order parser. En
the second-order parser, the ﬁrst-best and oracle scores for k = 100 are shown. In the ﬁrst-order
parser, the ﬁrst-best and the oracle score for k = 100 y k = 200 are shown.

SC pat.

100 oracle

ﬁrst-best

100 oracle

o = 2

o = 1
200 oracle

ﬁrst-best

OBJ
SBJ
VofN
VinN
VtoN
VforN
VwithN
VonN
VatN
VfromN
VbyN
NofN
NinN
NtoN
NforN
NwithN
NonN
NatN
NfromN
NbyN

98.13
98.93
100
96.59
98.24
93.41
99.22
94.41
98.26
98.78
98.97
99.49
96.11
98.90
97.49
96.43
95.65
100
96.49
100

95.72
97.17
74.29
72.75
78.82
56.04
84.38
63.69
78.26
69.51
92.82
97.44
76.35
85.71
84.42
84.52
68.48
89.02
85.96
76.47

96.23
96.39
85.71
61.32
76.47
54.95
85.16
60.34
57.39
71.95
89.23
98.57
81.14
94.51
91.96
90.48
72.83
90.24
91.23
91.18

96.91
96.71
88.57
63.73
76.47
60.99
87.5
60.89
59.13
73.17
89.74
98.88
82.34
95.6
93.47
94.05
73.91
91.46
91.23
97.06

93.3
93.96
77.14
56.91
74.12
46.7
82.03
56.42
53.91
69.51
88.21
97.03
70.66
81.32
82.41
77.38
64.13
85.37
75.44
82.35

Mesa 18
Four measures of accuracy on PTB TEST for four settings: Baseline second-order parser, patching
with syntactic score only, patching with SFs only, patching with SCs only, and patching with SFs
and SCs.

LAS

UAS

SCAS

SFAS

Base
CM only
SF Selection
SC Satisf.
Combined

91.76
91.80
91.97
92.02
92.05

93.75
93.79
93.93
94.01
94.05

91.99
92.06
92.96
93.11
93.29

88.24
88.27
89.20
89.14
89.28

size of the corpus used for extracting ISCs and ISFs would yield a larger decrease of SC
and SF errors.

8.4 Análisis de errores

Mesa 19 allows us to gain a better understanding of the causes of the errors. el primero
cause of errors is the ILP objective function. En 53% of the cases for SC errors and 59%
of the cases for SF errors, the ILP failed to ﬁnd the correct SFs or SCs although it is
in the search space. The inadequacy of the objective function is difﬁcult to interpret
because it has, sí mismo, three causes, as we mentioned in our experiments on French: el
SC or SF scores, the conﬁdence measure, or the weight µ. The second cause of errors

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Mesa 19
Raw number of errors, error decrease, and error distribution for each SC pattern, average of SC
patrones, and SF.

Tipo

err.

decr.

NotInDB NotInKB NotInSOL

OBJ
SBJ
VofN
VinN
VtoN
VforN
VwithN
VonN
VatN
VfromN
VbyN
NofN
NinN
NtoN
NforN
NwithN
NonN
NatN
NfromN
NbyN
CAROLINA DEL SUR
SF

14.29%
133
4.03%
124
11.11%
9
30.15%
136
30.56%
36
26.25%
80
10.00%
20
18.46%
65
16.00%
25
8.00%
25
14 −14.29%
12.00%
25
17.72%
79
0.00%
13
6.45%
31
0.00%
13
24.14%
29
0.00%
9
0.00%
8
12.50%
8
16.21%
882
8.83%
1314

0.39
0.24
0.38
0.48
0.32
0.29
0.44
0.32
0.19
0.35
0.56
0.23
0.4
0.38
0.24
0.31
0.23
0.44
0.13
0.43
0.34
0.41

0.51
0.39
0.00
0.18
0.12
0.2
0.06
0.19
0.10
0.04
0.13
0.23
0.2
0.08
0.17
0.23
0.18
0.00
0.25
0.00
0.25
0.19

0.40
0.51
0.63
0.46
0.60
0.61
0.56
0.62
0.71
0.65
0.44
0.68
0.49
0.54
0.59
0.46
0.59
0.56
0.75
0.57
0.53
0.59

is coverage. Despite the size of the raw corpus used to extract SCs and SFs, the lack of
coverage remains an important source of errors. The quality of the search space of the
ILP is the third cause of errors. This observation shows that candidate generation with
second-order parsing yields better results than ﬁrst-order parsing.

9. Conclusions and Discussion

We showed in this paper that erroneous Subcategorization Frame assignment and
Selectional Constraint violations can be partially corrected by using weighted lexicosyn-
tactic patterns, called patches, representing Subcategorization Frames and Selectional
Constraints that have been extracted from a large raw corpus. The experiments, por-
formed on the French Treebank, allowed us to decrease by 41.6% Selectional Constraint
violations and by 22% erroneous Subcategorization Frame assignments. These ﬁgures
are lower for English: 16.21% in the ﬁrst case and 8.83% in the second.

The method proposed is based on the decomposition of the parsing process into two
sub-processes. The ﬁrst one, called patching, identiﬁes, in a sentence, a optimal set of
patches that can be of arbitrary size. The second process is based on constrained graph-
based parsing; it takes as input an optimal set of patches and builds a complete parse of
the sentence that contains these patches.

The proposed architecture is one solution to the general problem of combining local
and global lexicosyntactic patterns in a parser. It has been applied to the problem of
selecting Subcategorization Frames and Selectional Constraints but it could be used to
take into account any type of lexicosyntactic linguistic constraints, such as semantic
frames or discourse structures.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

The main limit of our method is the NP-complete nature of patching. In theory,
solving an instance of the patching problem could require exponential time. En la práctica,
for our task, the resolution time is reasonable, but it could become an issue when dealing
with larger patches. Other approximate methods could be used to perform the patching
tarea, this is the ﬁrst direction in which we wish to continue this work.

The second direction that we want to explore concerns the ﬁrst step of the patching
proceso, a saber, candidate generation. Patches being lexicosyntactic structures, the gen-
eration process needs to have access to the possible syntactic structures of the sentences.
The solution we have proposed uses a ﬁrst- or second-order k-best list of parses from
which syntactic material will be used to propose possible patches. Another solution
would be to perform the patch generation directly on the parse forest.

The error analysis showed that coverage of the automatically extracted ISCs and
ISFs is an important cause of errors, despite the size of the corpus used to extract ISCs
and ISFs. Collecting larger corpora might not be the right solution to increase coverage.
The third direction we would like to explore is generalizing over the events that are
observed in the data, using distributional semantics.

Expresiones de gratitud
The authors thank Yann Vaxes for his help
in proving the NP-Completeness of the
patching problem and Fiona Torbey,
Jean Torbey, and Balamurali Andiyakkal
Rajendran for proofreading.

Referencias
Abeill´e, Anne, Lionel Cl´ement, and Franc¸ois
Toussenel. 2003. Building a treebank for
Francés. In Anne Abeill´e, editor, Treebanks.
Desorden.

Achterberg, Tobias. 2009. SCIP: Solving

constraint integer programs. Matemático
Programming Computation, 1(1):1–41.

Attardi, Giuseppe and Massimiliano

Ciaramita. 2007. Tree revision learning for
dependency parsing. In HLT-NAACL,
pages 388–395, Rochester, Nueva York.

Bansal, Mohit and Dan Klein. 2011.

Web-scale features for full-scale parsing. En
Proceedings of the 49th Annual Meeting of the
Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano,
pages 693–702, Portland.

Bechet, Frederic and Alexis Nasr. 2009.

Robust dependency parsing for spoken
language understanding of spontaneous
speech. In Proceedings of Interspeech,
pages 1039–1042, Brighton, Reino Unido.

Bikel, Daniel M. 2004. Intricacies of Collins’
parsing model. Ligüística computacional,
30(4):479–511.

Bohnet, Bernd. 2010. Top accuracy and fast

dependency parsing is not a contradiction.
In Proceedings of the 23rd International
Congreso sobre Lingüística Computacional
(COLECCIONAR 2010), pages 89–97, Beijing.

Boullier, Pierre, Alexis Nasr, and Benoˆıt

Sagot. 2009. Constructing parse forests that
include exactly the n-best PCFG trees. En
11th Conference on Parsing Technologies,
pages 117–128, París.

Brent, Miguel. 1991. Automatic acquisition

of subcategorization frames from
untagged text. In Proceedings of the 29th
Reunión Anual de la Asociación de
Ligüística computacional, pages 209–214,
berkeley, California.

Candito, Marie, Benoˆıt Crabb´e, Pascal Denis,

and Franc¸ois Gu´erin. 2009. Analyse
syntaxique du franc¸ais: des constituants
aux d´ependances. En procedimientos de
Traitement Automatique des Langues
Naturelles, Senlis, Francia.

Charniak, Eugene and Mark Johnson. 2005.
Coarse-to-ﬁne n-best parsing and MaxEnt
discriminative reranking. En procedimientos de
the 43rd Annual Meeting of the Association
para Lingüística Computacional,
pages 173–180, Michigan.

Church, k. W.. y P. Hanks. 1990. Word

association norms, mutual information,
and lexicography. computacional
Lingüística, 16(1):22–29.

collins, Miguel. 1997. Three generative,

lexicalised models for statistical parsing.
In Proceedings of the 35th Annual Meeting
of the Association for Computational
Linguistics and Eighth Conference of the
European Chapter of the Association for
Ligüística computacional, pages 16–23,
Madrid, España.

collins, Michael and Terry Koo. 2005.

Discriminative reranking for natural
language parsing. computacional
Lingüística, 31(1):25–70.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mirroshandel and Nasr

Integrating Selection and Subcategorization in a Dependency Parser

Eisner, Jason. 2000. Bilexical grammars and
their cubic-time parsing algorithms. En
Advances in Probabilistic and Other Parsing
Technologies, pages 29–61, Saltador.

Eisner, jason m. 1996. Three new

probabilistic models for dependency
analizando: An exploration. En procedimientos de
the 16th Conference on Computational
Linguistics-Volume 1, pages 340–345,
Copenhague, Dinamarca.

Gildea, Daniel. 2001. Corpus variation and
parser performance. En Actas de la
2001 Conference on Empirical Methods in
Natural Language Processing,
pages 167–202, pittsburgh.

Sala, Keith and V´aclav Nov´ak. 2005.

Corrective modeling for non-projective
dependency parsing. En Actas de la
Ninth International Workshop on Parsing
Tecnología, pages 42–52, vancouver,
Canada.

Henestroza, Enrique Anguiano and Marie
Candito. 2011. Parse correction with
specialized models for difﬁcult attachment
types. In Proceedings of the Conference on
Empirical Methods in Natural Language
Procesando, pages 1222–1233, Edimburgo,
Escocia.

Huang, l. y D. Chiang. 2005. Better k-best

analizando. In Proceedings of the Ninth
International Workshop on Parsing
Tecnología, pages 53–64, vancouver.
Huang, Liang. 2008. Forest reranking:

Discriminative parsing with non-local
características. In Proceedings of ACL-08: HLT,
pages 586–594, Columbus.

Huang, Liang and David Chiang. 2007.
Forest rescoring: Faster decoding
with integrated language models. En
Proceedings of the 45th Annual Meeting
of the Association for Computational
Lingüística, pages 144–151, Prague.

Huang, Zhongqiang, Mary Harper, and Slav
Petrov. 2010. Self-training with products of
latent variable grammars. En procedimientos de
el 2010 Conference on Empirical Methods in
Natural Language Processing, pages 12–22,
Cambridge, Massachusetts.

Hwa, rebeca. 2004. Sample selection for

statistical parsing. computacional
Lingüística, 30(3):253–276.

Joshi, Aravind Krishna. 1985. Tree Adjoining
Grammars: How Much Context-Sensitivity is
Required to Provide Reasonable Structural
Descriptions? Technical report.

Karp, Richard M.. 1972. Reducibility among

Combinatorial Problems. Saltador.
Koo, Terry and Michael Collins. 2010.

In Proceedings of the 48th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 1–11, Uppsala, Suecia.

K ¨ubler, Sandra, ryan mcdonald, and Joakim

Nivre. 2009. Dependency parsing.
Synthesis Lectures on Human Language
Technologies, 1(1):1–127.

Lecerf, Yves. 1961. Une repr´esentation

alg´ebrique de la structure des phrases
dans diverses langues naturelles. Compte
Rendu de l’Acadad´emie des Sciences de Paris,
252:232–234.

Manning, Christopher. 1993. Automatic

acquisition of a large subcategorization
dictionary from corpora. En procedimientos de
the 31st Annual Meeting of the Association for
Ligüística computacional, pages 235–242,
Columbus.

marco, METRO. PAG., METRO. A. Marcinkiewicz, y
B. Santorini. 1993. Building a large
annotated corpus of English: The Penn
treebank. Ligüística computacional,
19(2):313–330.

Martins, andref. T., Miguel Almeida, y

Noah A. Herrero. 2013. Turning on
the turbo: Fast third-order non-projective
turbo parsers. En Actas de la
51st Annual Meeting of the Association
para Lingüística Computacional (Volumen 2:
Artículos breves), pages 617–622, Soﬁa.
Martins, andref. T., Noah A. Herrero, y

Eric P. Xing. 2009. Concise integer linear
programming formulations for
dependency parsing. En Actas de la
Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint
Conference on Natural Language Processing of
the AFNLP, pages 342–350, Suntec.

McDonald, r., k. Crammer, and F. Pereira.
2005a. Online large-margin training of
dependency parsers. En Actas de la
43rd Annual Meeting of the Association for
Ligüística computacional (ACL’05),
pages 91–98, Michigan.

McDonald, ryan, Koby Crammer, y
Fernando Pereira. 2005b. En línea
large-margin training of dependency
analizadores. In Proceedings of the 43rd Annual
Meeting of the Association for Computational
Lingüística (ACL’05), pages 91–98,
Michigan.

McDonald, Ryan and Giorgio Satta. 2007.
On the complexity of non-projective
data-driven dependency parsing. En
Proceedings of the 10th International
Conference on Parsing Technologies,
pages 121–132, Prague.

McDonald, Ryan T. and Fernando C. norte.

Efﬁcient third-order dependency parsers.

Pereira. 2006. Online learning of

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 42, Número 1

approximate dependency parsing
algoritmos. In EACL, pages 81–88,
trento, Italia.

Messiant, C., A. Korhonen, t. Poibeau, et al.
2008. Lexschem: A large subcategorization
lexicon for French verbs. En procedimientos de
the Language Resources and Evaluation
Conferencia, Marrakech, Morocco.

Mirroshandel, S.A. y un. Nasr. 2011. Activo
learning for dependency parsing using
partially annotated sentences. En
Proceedings of the 12th International
Conference on Parsing Technologies,
pages 140–149, Dublín.

Mirroshandel, Seyed Abolghasem, Alexis

Nasr, and Joseph Le Roux. 2012.
Semi-supervised dependency parsing
using lexical afﬁnities. En Actas de la
50th Annual Meeting of the Association
para Lingüística Computacional (Volumen 1:
Artículos largos), pages 777–785, Jeju Island,
Korea.

Mirroshandel, Seyed Abolghasem,

Alexis Nasr, and Benoit Sagot. 2013.
Enforcing subcategorization constraints
in a parser using sub-parses recombining.
En Actas de la 2013 Conference of
the North American Chapter of the Association
para Lingüística Computacional: Humano
Language Technologies, pages 239–247,
Atlanta.

Nakov, PAG. y M. Hearst. 2005. Using the

Web as an implicit training set: application
to structural ambiguity resolution. En
Proceedings of the Human Language
Technology Conference and the Conference on
Empirical Methods in Natural Language
Procesando, pages 835–842, vancouver,
Canada.

Napoles, Courtney, Matthew Gormley, y
Benjamin van Durme. 2012a. Annotated
English gigaword ldc2012t21. Lingüístico
Data Consortium, Filadelfia, Pensilvania.

Napoles, Courtney, Matthew Gormley, y
Benjamin van Durme. 2012b. Annotated
gigaword. In Proceedings of the Joint
Workshop on Automatic Knowledge Base
Construction and Web-scale Knowledge
Extraction, pages 95–100, Montréal,
Canada.

Nasr, A., F. B´echet, j. F. Rey, B. Favre, y

Le Roux, j. 2011. MACAON: An NLP tool

suite for processing word lattices. En
Proceedings of the ACL-HLT 2011 Sistema
Demonstrations, pages 86–91, Portland.
parker, Roberto, David Graff, Junbo Kong,
Ke Chen, and Kazuaki Maeda. 2011.
English gigaword ﬁfth edition, ldc2011t07.
Linguistic Data Consortium, Filadelfia,
Pensilvania.

Pitler, MI., S. Bergsma, D. lin, and K. Church.

2010. Using Web-scale N-grams to
improve base NP parsing performance. En
Proceedings of the 23rd International
Congreso sobre Lingüística Computacional
(COLECCIONAR 2010), pages 886–894, Beijing,
Porcelana.

Riedel, Sebastian and James Clarke. 2006.

Incremental integer linear programming
for non-projective dependency parsing. En
Actas de la 2006 Conferencia sobre
Empirical Methods in Natural Language
Procesando, pages 129–137, Sídney,
Australia.

Rush, Alejandro M.. and Slav Petrov. 2012.
Vine pruning for efﬁcient multi-pass
dependency parsing. En Actas de la
2012 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 498–507, Jeju Island, Korea.

S´anchez-S´aez, r., j. A. S´anchez, y j. METRO.
Bened´ı. 2009. Statistical conﬁdence
measures for probabilistic parsing. En
Proceedings of the International Conference
RANLP-2009, pages 388–392, Borovets,
Bulgaria.

Volk, METRO. 2001. Exploiting the WWW as a

corpus to resolve PP attachment
ambiguities. In Corpus Linguistics,
pages 601–606.

wessel, Franco, Ralf Schluter, Klaus Macherey,

and Hermann Ney. 2001. Conﬁdence
measures for large vocabulary continuous
speech recognition. IEEE Transactions
on Speech and Audio Processing,
9(3):288–298.

zhou, GRAMO., j. zhao, k. Liu, y yo. Cai. 2011.

Exploiting Web-derived selectional
preference to improve statistical
dependency parsing. En Actas de la
49ª Reunión Anual de la Asociación de
Ligüística computacional: Human Language
Technologies, pages 1556–1565, Portland.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
2
1
5
5
1
8
0
5
9
8
6
/
C
oh

yo
i

_
a
_
0
0
2
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF