Parsing Models for Identifying - IA de Investigación especializada en el MIT

Parsing Models for Identifying
Multiword Expressions

∗

Spence Green
Universidad Stanford

Marie-Catherine de Marneffe
Universidad Stanford

∗∗

†
Cristóbal D.. Manning
Universidad Stanford

Multiword expressions lie at the syntax/semantics interface and have motivated alternative
theories of syntax like Construction Grammar. Until now, sin embargo, syntactic analysis and
multiword expression identiﬁcation have been modeled separately in natural language process-
En g. We develop two structured prediction models for joint parsing and multiword expression
identiﬁcation. The ﬁrst is based on context-free grammars and the second uses tree substitution
grammars, a formalism that can store larger syntactic fragments. Our experiments show that
both models can identify multiword expressions with much higher accuracy than a state-of-the-
art system based on word co-occurrence statistics.

We experiment with Arabic and French, which both have pervasive multiword expres-
siones. Relative to English, they also have richer morphology, which induces lexical sparsity
in ﬁnite corpora. To combat this sparsity, we develop a simple factored lexical representation
for the context-free parsing model. Morphological analyses are automatically transformed into
rich feature tags that are scored jointly with lexical items. This technique, which we call
a factored lexicon, improves both standard parsing and multiword expression identiﬁcation
exactitud.

1. Introducción

Multiword expressions are groups of words which, taken together, can have un-
predictable semantics. Por ejemplo, the expression part of speech refers not to some
aspect of speaking, but to the syntactic category of a word. If the expression is
altered in some ways—part of speeches, part of speaking, type of speech—then the
idiomatic meaning is lost. Other modiﬁcations, sin embargo, are permitted, as in the plural
parts of speech. These characteristics make multiword expressions (MWEs) difﬁcult to
identify and classify. But if they can be identiﬁed, then the incorporation of MWE
knowledge has been shown to improve task accuracy for a range of NLP applications

∗ Department of Computer Science. Correo electrónico: spenceg@stanford.edu.
∗∗ Department of Linguistics. Correo electrónico: mcdm@stanford.edu.
† Departments of Computer Science and Linguistics. Correo electrónico: manning@stanford.edu.

Envío recibido: Octubre 1, 2011; revised submission received: Junio 9, 2012; accepted for publication:
Agosto 3, 2012.

Sin derechos reservados. This work was authored as part of the Contributor’s ofﬁcial duties as an Employee of
the United States Government and is therefore a work of the United States Government. De acuerdo con
17 USC. 105, no hay protección de derechos de autor disponible para dichas obras bajo los EE.UU.. law.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

including dependency parsing (Nivre and Nilsson 2004), supertagging (Blunsom and
Baldwin 2006), sentence generation (Hogan et al. 2007), machine translation (Carpuat
and Diab 2010), and shallow parsing (Korkontzelos and Manandhar 2010).

The standard approach to MWE identiﬁcation is n-gram classiﬁcation. This tech-
nique is simple. Given a corpus, all n-grams are extracted, ﬁltered using heuristics,
and assigned feature vectors. Each coordinate in the feature vector is a real-valued
quantity such as log likelihood or pointwise mutual information. A binary classiﬁer
is then trained to render a MWE/non-MWE decision. All entries into the 2008 MWE
Tarea compartida (Evert 2008) utilized variants of this technique.

Broadly speaking, n-gram classiﬁcation methods measure word co-occurrence. Sup-
pose that a corpus contains more occurrences of part of speech than parts of speech. Surface
statistics may erroneously predict that only the former is an MWE and the latter is not.
More worrisome is that the statistics for the two n-grams are separate, thus missing an
obvious generalization.

In this article, we show that statistical parsing models generalize more effectively
over arbitrary-length multiword expressions. This approach has not been previously
demonstrated. To show its effectiveness, we build two parsing models for MWE iden-
tiﬁcation. The ﬁrst model is based on a context-free grammar (CFG) with manual
rule reﬁnements (Klein and Manning 2003). This parser also includes a novel lexical
model—the factored lexicon—that incorporates morphological features. El segundo
model is based on tree substitution grammar (TSG), a formalism with greater strong
generative capacity that can store larger structural tree fragments, some of which are
lexicalized.

We apply the models to Modern Standard Arabic (henceforth MSA, or simply
“Arabic”) y francés, two morphologically rich languages (MRLs). The lexical sparsity
(in ﬁnite corpora) induced by rich morphology poses a particular challenge for n-gram
classiﬁcation. Relative to English, French has a richer array of morphological features—
such as grammatical gender and verbal conjugation for aspect and voice. Arabic also
has richer morphology including gender and dual number. It has pervasive verb-
initial matrix clauses, although preposed subjects are also possible. For languages like
these it is well known that constituency parsing models designed for English often do
not generalize well. Por lo tanto, we focus on the interplay among language, annotation
choices, and parsing model design for each language (Levy and Manning 2003; Kübler
2005, inter alia), although our methods are ultimately very general.

Our modeling strategy for MWEs is simple: We mark them with ﬂat bracketings
in phrase structure trees. This representation implicitly assumes a locality constraint
on idioms, an assumption with a precedent in linguistics (Marantz 1997, inter alia).
Por supuesto, it is easy to ﬁnd non-local idioms that do not correspond to surface con-
stituents or even contiguous strings (O'Grady 1998). Utterances such as All hell seemed
to break loose and The cat got Mary’s tongue are clearly idiomatic, yet the idiomatic
elements are discontiguous. Our models cannot identify these MWEs, but then again,
neither can n-gram classiﬁcation. Sin embargo, many common MWE types like nominal
compounds are contiguous and often correspond to constituent boundaries.

Consider again the phrasal compound part of speech,1 which is non-compositional:
The idiomatic meaning “syntactic category” does not derive from any of the component

1 It is common to hyphenate some nominal compounds, p.ej., part-of-speech. This practice invites a
words-with-spaces treatment of idioms. Sin embargo, hyphens are inconsistently used in English.
Hyphenation is more common in French, but totally absent in Arabic.

196

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

palabras. This non-compositionality affects the syntactic environment of the compound as
shown by the addition of an attributive adjective:

(1)

a. Noun is a part of speech.

(2)

*Noun is a big part of speech.

*Noun is a big part.

Liquidity is a part of growth.

b. Liquidity is a big part of growth.

Liquidity is a big part.

In Example (1a) the copula predicate part of speech as a whole describes Noun. En
Examples (1b) y (1C) big clearly modiﬁes only part and the idiomatic meaning is
lost. The attributive adjective cannot probe arbitrarily into the non-compositional com-
pound. A diferencia de, Ejemplo (2) contains parallel data without idiomatic semantics.
The conventional syntactic analysis of Example (2a) is identical to that of Example (1a)
except for the lexical items, yet part of growth is not idiomatic. Como consecuencia, many pre-
modiﬁers are appropriate for part, which is semantically vacuous. In Example (2b), big
clearly modiﬁes part, and of growth is just an optional PP complement, as shown by
Ejemplo (2C), which is still grammatical.

This article proposes different phrase structures for examples such as (1a) y
(2a). Figure 1a shows a Penn Treebank (PTB) (marco, Marcinkiewicz, and Santorini
1993) parse of Example (1a), and Figure 1b shows the parse of a paraphrase. El
phrasal compound part of speech functions syntactically like a single-word nominal
like category, and indeed Noun is a big category is grammatical. Single-word para-
phrasability is a common, though not mandatory, characteristic of MWEs (Baldwin
and Kim 2010). Starting from the paraphrase parse, we create a representation like
Cifra (1C). The MWE is indicated by a label in the predicted structure, cual es
ﬂat. This representation explicitly models the idiomatic semantics of the compound
and is context-free, so we can build efﬁcient parsers for it. Fundamentalmente, MWE identiﬁ-
cation becomes a by-product of parsing as we can trivially extract MWE spans from
full parses.

We convert existing Arabic and French syntactic treebanks to the new MWE
representación. With this representation, the TSG model yields the best MWE iden-
tiﬁcation results for Arabic (81.9% F1) and competitive results for French (71.3%),
even though its parsing results lag state-of-the-art probabilistic CFG (PCFG)-based
analizadores. The TSG model also learns human-interpretable MWE rules. The fac-
tored lexicon model with gold morphological annotations achieves the best MWE
results for French (87.3% F1) and competitive results for Arabic (78.2% F1). For both
languages the factored lexicon model also approaches state-of-the-art basic parsing
exactitud.

The remainder of this article begins with linguistic background on common MWE
types in Arabic and French (Sección 2). We then describe two constituency parsing
models that are tuned for MWE identiﬁcation (Secciones 3 y 4). These models are
supervised and can be trained on existing linguistic resources (Sección 5). We evaluate
the models for both basic parsing and MWE identiﬁcation (Sección 6). Finalmente, nosotros
compare our results with a state-of-the-art n-gram classiﬁcation system (Sección 7) y
to prior work (Sección 8).

197

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

2. Multiword Expressions in Arabic and French

In this section we provide a general deﬁnition and taxonomy of MWEs. Then we discuss
types of MWEs in Arabic and French.

2.1 Deﬁnition of Multiword Expressions

MWEs, a known nuisance for both linguistics and NLP, blur the lines between syntax
and semantics. Jackendoff (1997, página 156) comments that MWEs “are hardly a marginal
part of our use of language,” and estimates that a native speaker knows at least as many
MWEs as single words. A linguistically adequate representation for MWEs remains an
active area of research, sin embargo. Baldwin and Kim (2010) deﬁne MWEs as follows:

Definición 1
Multiword expressions are lexical items that: (a) can be decomposed into multi-
ple lexemes; y (b) display lexical, syntactic, semantic, pragmatic, and/or statistical
idiomaticity.

VBZ

notario público

Noun

vicepresidente

notario público

part

PÁGINAS

notario público

(a) Standard analysis of Example (1a)

speech

notario público

Noun

vicepresidente

VBZ

notario público

categoría

(b) Standard analysis of a paraphrase

notario público

Noun

VBZ

vicepresidente

notario público

MWN

part

speech

Cifra 1
(a) A standard PTB parse of Example (1a). (b) The MWE part of speech functions syntactically
like the ordinary nominal category, as shown by this paraphrase. (C) We incorporate the
presence of the MWE into the syntactic analysis by ﬂattening the tree dominating part of speech
and introducing a new non-terminal label multiword noun (MWN) for the resulting span. El
new representation classiﬁes an MWE according to a global syntactic type and assigns a POS
to each of the internal tokens. It makes no commitment to the internal syntactic structure of
the MWE, sin embargo.

198

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Mesa 1
Semi-ﬁxed MWEs in French and English. The French adverb à terme (‘in the end’) can be
modiﬁed by a small set of adjectives, and in turn some of these adjectives can be modiﬁed
by an adverb such as très (‘very’). Similar restrictions appear in English.

Francés

court
court
moyen
largo
largo

terme
terme
terme
terme
terme
terme

à
à
à
à
à
à

très

Inglés

en el
en el
en el
en el
en el
en el

term
cerca
term
corto
corto
term
medium term
term
largo
term
largo

muy

MWEs fall into four broad categories (Sag et al. 2002):

1. Fixed—do not allow morphosyntactic variation or internal modiﬁcation (in short,

en general).

2. Semi-ﬁxed—can be inﬂected or undergo internal modiﬁcation (Mesa 1).

3. Syntactically ﬂexible—undergo syntactic variation such as inﬂection (p.ej.,

phrasal verbs such as look up and write down).

4. Institutionalized phrases—fully compositional phrases that are statistically

idiosyncratic (trafﬁc light, Secretary of State).

Statistical parsers are well-suited for coping with lexical, syntactic, and statistical
idiomaticity across all four MWE classes. Sin embargo, a nuestro conocimiento, we are the ﬁrst
to explicitly tune parsers for MWE identiﬁcation.

2.2 Arabic MWEs

The most recent and most relevant work on Arabic MWEs was by Ashraf (2012), OMS
analyzed an 83-million-word Arabic corpus. He developed an empirical taxonomy of
six MWE types, which correspond to syntactic classes. The syntactic class is deﬁned
by the projection of the purported syntactic head of the MWE. MWEs are further
(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13) (cid:8)(cid:14)(cid:15)(cid:16) (cid:17)(cid:4) (cid:2)(cid:18)(cid:8)(cid:19)
subcategorized by observed POS sequences. For some of these classes, the syntactic
(cid:5)(cid:6)(cid:7)(cid:4)
distinctions are debatable. Por ejemplo, in the verb-object idiom (cid:2)(cid:3)(cid:4)
Daraba cSfuurayn bi-Hajar (‘he killed two birds with one stone’)2 the composition of
(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13) (cid:8)(cid:14)(cid:15)(cid:16) (‘two birds’) con (cid:2)(cid:5)(cid:4)(cid:20)
(‘stone’) is at least as important as composition with the
verb (cid:17)(cid:4) (cid:2)(cid:18)(cid:8)(cid:19) (‘he killed’), yet Ashraf (2012) classiﬁes the phrase as a verbal idiom.

The corpus in our experiments only marks three of the six Arabic MWE classes:

Nominal idioms (MWN) consist of proper nouns (Example 3a), noun compounds
(Example 3b), and construct NPs (Example 3c). MWNs typically correspond to NP
bracketings:

(3)

a. N N: (cid:21)(cid:11)(cid:22)(cid:4)

(cid:8)(cid:23) (cid:13)(cid:10)(cid:4) (cid:24) abuu Dabii (‘Abu Dhabi’)

2 For each Arabic example in this work, we provide native script, transliterations in italics according to the

phonetic scheme in Ryding (2005), and English translations in single quotes.

199

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

b. D+N D+N:

C. N D+N: !(cid:25)(cid:14)(cid:29)(cid:24)

(cid:25)(cid:26)(cid:10)(cid:11)(cid:28)(cid:8)(cid:30)(cid:31)(cid:29)(cid:24) al-cnaaya al-faaOqa (‘intensive care unit’)

(cid:25)(cid:26)(cid:25)(cid:14)(cid:27)(cid:10)(cid:28) (cid:8)(cid:14)(cid:29)(cid:24)
(cid:25)»(cid:2)# kura al-qudum (‘soccer’)

Prepositional idioms (MWP) include PPs that are commonly used as discourse con-
nectives (Example 4a), function like adverbials in English (Example 4b), or have
been institutionalized (Example 4c). These MWEs are distinguished by a prepositional
syntactic head:

(4)

P D+N:

%
&(cid:24) (cid:21)(cid:25)(cid:22)(cid:20) Hataa al-aan (‘until now’)

(cid:8)$ (cid:8)'(cid:8)(cid:30)(cid:31)(cid:10)(cid:4) bi-cnf (‘violently’) b. P+N: C. P+D+N D+N: (cid:21)(cid:11)((cid:3))*(cid:24) (cid:25)+(cid:30)(cid:11) (cid:25),(cid:13)(cid:25)(cid:30)(cid:29)(cid:28)(cid:10)(cid:4) bi-al-twqiit al-maHalii (‘local time’) Adjectival idioms (MWA) are typically the so-called “false” iDaafa constructs in which the ﬁrst term is an adjective that acts as a modiﬁer of some other noun. These constructs often correspond to a hyphenated modiﬁer in English such as Examples (5a) y (5b). Less frequent are coordinated adjectives that have been institutionalized such as Examples (5C) y (5d): (5) a. A D+N: -(cid:13)(cid:25)(cid:30).)*(cid:24) b. A D+N: /(cid:8)(cid:30)(cid:15)(cid:29)(cid:24) C. D+A C+D+A: d. D+A C+D+A: (cid:25)(cid:26)(cid:31)(cid:30)(cid:11) (cid:8),(cid:12) raﬁica al-mustuuaa (‘high-level’) (cid:25)(cid:26)(cid:30)(cid:11)(cid:25)(cid:10)(cid:28)(cid:30)(cid:11) (cid:8),(cid:13)0 swﬁiaatiia al-Sanac (‘Soviet-made’) (cid:25)(cid:26)(cid:25)(cid:14)(cid:10)(cid:11)!(cid:15)(cid:29)(cid:24)1 (cid:25)(cid:26)(cid:25)(cid:14)(cid:30)(cid:11) (cid:25)(cid:14) 2.(cid:29)(cid:24) al-shaqiiqa w-al-Sadiiqa (‘neighborly’) (cid:25)(cid:26)(cid:10)(cid:11)(cid:2)(cid:3)(cid:30)(cid:4)(cid:29)(cid:24)1 (cid:25)(cid:26)(cid:10)(cid:11)(cid:2)3(cid:4)(cid:29)(cid:24) al-bariia w-al-baHariia (‘land and sea’) These idiom types usually do not cross constituent boundaries, so constituency parsers are well suited for modeling them. The other three classes of Ashraf (2012)—verb- sujeto, verbal, and adverbial—tend to cross constituent boundaries, so they are dif- ﬁcult to represent in a PTB-style treebank. Dependency representations may be more appropriate for these idiom classes. 2.3 French MWEs In French, there is a lexicographic tradition of compiling MWE lists. Por ejemplo, Bruto (1986) shows that whereas French dictionaries contain about 1,500 single-word adverbs there are over 5,000 multiword adverbs. MWEs occur in every part of speech (POS) categoría (p.ej., noun trousse de secours (‘ﬁrst-aid kit’); verb faire main-basse [do hand-low] (‘seize’); adverb comme dans du beurre [as in butter] (‘easily’); adjective à part entière (‘wholly’)). Motivated by the prevalence of MWEs in French, Bruto (1984) developed a linguistic theory known as Lexicon-Grammar. In this theory, MWEs are classiﬁed according to their global POS tags (noun, verb, adverb, adjective), and described in terms of the sequence of the POS tags of the words that constitute the MWE (p.ej., “N de N” garde d’enfant [guard of child] (‘daycare’), pied de guerre [foot of war] (‘at the ready’)) (Bruto 1986). En otras palabras, MWEs are represented by a ﬂat structure. The Lexicon-Grammar distinguishes between units that are ﬁxed and have to appear as is (en tout et pour tout [in all and for all] (‘in total’)) and units that accept some syntactic variation such as admitting the insertion of an adverb or adjective, or the variation of 200 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / c o l i / lartice – pdf / / / / 3 9 1 1 9 5 1 7 9 9 1 9 7 / c o l i _ a _ 0 0 1 3 9 pd . f por invitado 0 7 septiembre 2 0 2 3 Verde, de Marneffe, and Manning Parsing Models for Identifying Multiword Expressions one of the words in the expression (p.ej., a possessive as in from the top of one’s hat). It also notes whether the MWE displays some selectional preferences (p.ej., it has to be preceded by a verb or by an adjective). We discuss three of the French MWE categories here, and list the rest in Appendix A. Nominal idioms (MWN) consist of proper nouns (Ejemplo (6a)), foreign common nouns (6b), and common nouns. The common nouns appear in several syntacti- cally regular sequences of POS tags (Ejemplo (7)). Multiword nouns allow inﬂection (singular vs. plural) but no insertion: (6) a. London Sunday Times, Los Angeles b. week – end, mea culpa, articulación – venture (7) a. N A: corps médical (‘medical staff’), dette publique (‘public debt’) b. N P N: mode d’emploi (‘instruction manual’) C. N N: numéro deux (‘number two’), maison mère [house mother] (‘headquar- ters’), grève surprise (‘sudden strike’) d. N P D N: impôt sur le revenu (‘income tax’), ministre de l’économie (‘ﬁnance minister’) Adjectival idioms (MWA) appear with different POS sequences (Ejemplo (8)). They include numbers like vingt et unième (‘21st’). Some MWAs allow internal variation. Por ejemplo, some adverbs or adjectives can be added to both examples in (8b) (à très haut risque, de toute dernière minute): (8) a. P N: d’antan [from before] (‘old’), en question (‘under discussion’) b. P A N: à haut risque (‘high-risk’), de dernière minute [from the last minute] (‘at the eleventh hour’) C. A C A: pur et simple [pure and simple] (‘straightforward’), noir et blanc (‘black and white’) Verbal idioms (MWV) allow number and tense inﬂections (Ejemplo (9)). Some MWVs containing a noun or an adjective allow the insertion of a modiﬁer (p.ej., donner grande satisﬁcation (‘give great satisfaction’)), whereas others do not. When an adverb inter- venes between the main verb and its complement, the two parts of the MWV may be marked discontinuously (p.ej., [MWV [V prennent]] [ADV déjà] [MWV [P en] [N cause]] (‘already take into account’)): (9) a. V N: avoir lieu (‘take place’), donner satisfaction (‘give satisfaction’) b. V P N: mettre en place (‘put in place’), entrer en vigueur (‘to come into effect’) C. V P ADV: mettre à mal [put at bad] (‘harm’), être à même [be at same] (‘be able’) d. V D N P N: tirer la sonnette d’alarme (‘ring the alarm bell’), avoir le vent en poupe (‘to have the wind astern’) Both Gross (1986) and Ashraf (2012) classify MWEs according to global syntactic role and internal POS sequence. In a constituency tree, these two features can be modeled 201 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / c o l i / lartice – pdf / / / / 3 9 1 1 9 5 1 7 9 9 1 9 7 / c o l i _ a _ 0 0 1 3 9 pd . f por invitado 0 7 septiembre 2 0 2 3 Computational Linguistics Volume 39, Número 1 Mesa 2 French grammar development. Incremental effects on grammar size and labeled F1 for each of the manual grammar features (development set, sentences ≤ 40 palabras). The baseline is a parent-annotated grammar. The features tradeoff between maximizing two objectives: overall parsing F1 and MWE F1. Feature States Tags Parse F1 ΔF1 MWE F1 – tagPA splitPUNC markDe markP MWADVtype1 MWADVtype2 MWNtype1 MWNtype2 4,128 4,360 4,762 4,882 4,884 4,919 4,970 5,042 5,098 32 264 268 284 286 286 286 286 286 77.3 78.4 78.8 79.8 79.9 79.9 79.9 80.0 79.9 +1.1 +0.4 +1.0 +0.1 +0.0 +0.0 +0.1 −0.1 60.7 71.4 71.1 71.6 71.5 71.8 71.7 71.9 71.9 by a span over the MWE composed of a phrasal label indicating the MWE type and pre-terminal labels indicating the internal POS sequence. MWE identiﬁcation then becomes a trivial process of extracting such subtrees from full parses. 3. Context-Free Parsing Model: Stanford Parser In this section and Section 4, we describe constituency parsing models that will be tuned for MWE identiﬁcation. The algorithmic details of the parsing models may seem removed from multiword expressions, but this is by design. MWEs are encoded in the syntactic representation, allowing the model designer to focus on learning that representation rather than trying to model semantic phenomena directly. The Stanford parser (Klein and Manning 2003) is a product model that combines the outputs of a manually reﬁned PCFG with an arc-factored dependency parser. Adapting the Stanford parser to a new language requires: (1) feature engineering for the PCFG grammar, (2) speciﬁcation of head-ﬁnding rules for extracting dependencies, y (3) development of an unknown word model.3 After adapting the basic parser, we develop a novel lexical model, which we call a factored lexicon. The factored lexicon incorporates morphological information that is predicted by a separate morphological analyzer. 3.1 Grammar Development Grammar features consist of manual splits of labels in the training data (p.ej., marking base NPs with the rich label “NP-base”). These features were tuned on a development set. Some of them have linguistic interpretations, whereas others (p.ej., punctuation splitting) have only empirical justiﬁcation. French Grammar Features. Mesa 2 lists the category splits used in our grammar. Most of the features are POS splits as many phrasal tag splits did not improve accuracy. This result may be due to the ﬂat annotation scheme of the FTB. 3 The Stanford parser code, head-ﬁnding rules, and trained models are available at http://nlp.stanford.edu/software/lex-parser.shtml. 202 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / c o l i / lartice – pdf / / / / 3 9 1 1 9 5 1 7 9 9 1 9 7 / c o l i _ a _ 0 0 1 3 9 pd . f por invitado 0 7 septiembre 2 0 2 3 Verde, de Marneffe, and Manning Parsing Models for Identifying Multiword Expressions Parent annotation of POS tags captures information about the external context. Por ejemplo, prepositions (PAG) can introduce a prepositional phrase (PÁGINAS) or an inﬁnitival complement (VPinf), but some prepositions will uniquely appear in one context and not the other (p.ej., sur (‘on’) will only occur in a PP environment). The tagPA provides this kind of distribution. We also split punctuation tags (splitPUNC) into equivalence classes similar to those present in the PTB. We tried different features to mark the context of prepositions. markP identiﬁes prepositions which introduce PPs modifying a noun (notario público). Marking other kinds of prepositional modiﬁers (p.ej., verb) did not help. The feature markDe the preposition de and its variants (du, des, d’), which are very frequent and appear in many contexts. The features that help MWE F1 depend on idiom frequency. We mark MWADVs under S nodes (MWADVtype1), and those with POS sequences that occur more than 500 veces (“P N” – en jeu, à peine, or “P D N” dans l’immédiat, à l’inverse) (MWADVtype2). Similarmente, we mark MWNs that occur more than 600 veces (p.ej., “N P N” and “N N”) (MWNtype1 and MWNtype2). Arabic Grammar Features. The Arabic grammar features come from Green and Manning (2010), which contains an ablation study similar to Table 2. We added one additional feature, markMWEPOS, which marks POS tags dominated by MWE phrasal categories. 3.2 Head-Finding Rules For Arabic, we use the head-ﬁnding rules from Green and Manning (2010). For French, we use the head-ﬁnding rules of Dybro-Johansen (2004), which yielded an approxi- mately 1% development set improvement over those of Arun (2004). 3.3 Unknown Word Models For both languages, we create simple unknown word models that substitute word signatures for rare and unseen word types. The signatures are generated according to the features in Table 3. For tag t and signature s, the signature parameters p(t|s) are estimated after collecting counts for 50% of the training data. Then p(s|t) is computed via Bayes rule with a ﬂat Dirichlet prior. Mesa 3 Unknown word model features for Arabic and French. l D o w n o a d e desde h t t p : / / directo . mi t . e d u / c o l i / lartice – pdf / / / / 3 9 1 1 9 5 1 7 9 9 1 9 7 / c o l i _ a _ 0 0 1 3 9 pd . f por invitado 0 7 septiembre 2 0 2 3 Arabic Lexical Features (cid:2) Presence of the determiner 4(cid:24) Alabama (cid:2) Contains digits or punctuation (cid:2) Ends with the feminine afﬁx (cid:25)» ah (cid:2) Various verbal (p.ej., adjectival sufﬁxes (p.ej., (cid:8)$1 uun) y

(cid:25)(cid:17) t, (cid:24)1 waa,
(cid:25)(cid:26)(cid:10)(cid:11) iiah, -(cid:11) ii)

French Lexical Features

(cid:2) Nominal, adjectival, verbal, adverbial, y
plural sufﬁxes

(cid:2) Contains digits or punctuation

(cid:2) Is capitalized (except the ﬁrst word in a
oración), or consists entirely of capital
letters

(cid:2) If none of the above, deterministically
extract one- and two-character sufﬁxes

203

Ligüística computacional

Volumen 39, Número 1

3.4 Factored Lexicon with Morphological Features

We will apply our models to Arabic and French, yet we have not dealt with the lexical
sparsity induced by rich morphology (ver tabla 5 for a comparison to English). Uno
way to combat sparsity is to parse a factored representation of the terminals, dónde
factors might be the word form, the lemma, or grammatical features such as gender,
number, and person (φ features) (Bilmes and Kirchoff 2003; Koehn and Hoang 2007,
inter alia).

The basic parser lexicon estimates the generative probability of a word given a
tag p(w|t) from word/tag pairs observed in the training set. Además, the lexicon
includes parameter estimates p(t|s) for unknown word signatures s produced by the
unknown word models (mira la sección 3.3). At parsing time, the lexicon scores each input
word type w according to its observed count in the training set c(w). We deﬁne the
unsmoothed and smoothed parameter estimates:

pag(t|w) =

C(t, w)
C(w)

psmooth(t|w) =

C(t, w) + αp(t|s)
C(w) + a

We then compute the desired parameter p(w|t) como

pag(w|t) =

⎧
⎪⎪⎨

⎪⎪⎩

pag(t|w)pag(w)
pag(t)
psmooth(t|w)pag(w)
pag(t)
pag(t|s)pag(s)
pag(t)

if c(w) > β

if c(w) > 0

de lo contrario

(1)

(2)

(3)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

We found that α = 1.0 and β = 100 worked well on both development sets.

In the factored lexicon, each token has an associated morphological analysis m,
which is a string describing various grammatical features (p.ej., tense, voice, deﬁnite-
ness). Instead of generating terminals alone, we generate the word and morphological
analysis using a simple product:

pag(w, metro|t) = p(w|t)pag(metro|t)

(4)

where p(metro|t) is estimated using exactly the same procedure as the lexical insertion
probability p(w|t). Because there are only a few hundred unique (cid:4)t, metro(cid:5) tuples in the
training data for each language, we tend to get sharper parameter estimates, a saber,
we usually estimate p(t|metro) directly as in Equation (1). Además, en el momento de la prueba, even if
the word type w is unknown, the associated morphological analysis m is almost always
conocido, providing additional evidence for tagging.

We also experimented with an additional lemma factor, but found that it did not

improve accuracy. We thus excluded the lemma factor from our experiments.

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

For words that have been observed with only one tagging, the factored lexicon is

clearly redundant. Considerar, sin embargo, the case of the Arabic triliteral 5(cid:25)(cid:30)(cid:25), qtl which,
“murder, killing.” If 5(cid:25)(cid:30)(cid:25), appears as a verb, and we include the tense feature in the

in unvocalized text, can be either a verb meaning “he killed” or a nominal meaning

morphological analysis, then all associated nominal tags (p.ej., NN) will be assigned
zero probability because nominals never carry tense in the training data.

204

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

4. Fragment Parsing Model: Dirichlet Process Tree Substitution Grammars

For our task, a shortcoming of CFG-based grammars is that they do not explicitly
capture idiomatic usage. Por ejemplo, consider the two utterances:

(10)

a. He kicked the bucket.

b. He kicked the pail.

Unless horizontal markovization is applied, PCFGs generate words independently.
Como consecuencia, no phrasal rule parameter in the model differentiates between Exam-
ples (10a) y (10b). Recordar, sin embargo, that in our representation, Ejemplo (10a) debería
receive a ﬂat analysis as MWV, whereas Example (10b) should have a conventional
analysis of the transitive verb kicked and its two arguments.

TSGs are weakly equivalent to CFGs, but with greater strong generative capacity
(Joshi and Schabes 1997). TSGs can store lexicalized tree fragments as rules. Consecuencia-
frecuentemente, if we have seen [MWV kicked the bucket] several times before, we can store that
whole lexicalized fragment in the grammar.

We consider the non-parametric probabilistic TSG (PTSG) model of Cohn,
Goldwater, and Blunsom (2009) in which tree fragments are drawn from a Dirichlet
proceso (DP) prior.4 The DP-TSG can be viewed as a data-oriented parsing (DOP)
modelo (Scha 1990; Bod 1992) with Bayesian parameter estimation. A PTSG is a 5-tuple
(cid:4)V, S, R, ♦, i(cid:5) where c ∈ V are non-terminals; t ∈ Σ are terminals; e ∈ R are elementary
árboles;5 ♦ ∈ V is a unique start symbol; and θc,mi
∈ θ are parameters for each tree
fragment. A PTSG derivation is created by successively applying the substitution
operator to the leftmost frontier node (denoted by c+). All other nodes are internal
(denoted by c

−

In the supervised setting, DP-TSG grammar extraction reduces to a segmentation
problema. We have a treebank T that we segment into the set R, a process that is modeled
with Bayes’ rule:

pag(R | t) ∝ p(t | R) pag(R)

(5)

Because the tree fragments completely specify each tree, pag(t | R) is either 0 o 1, so all
work is performed by the prior over the set of elementary trees.

The DP-TSG contains a DP prior for each c ∈ V and generates a tree fragment e

rooted at non-terminal c according to:

θc

|C, αc, P0(·|C) ∼ DP(αc, P0)

mi|θc

∼ θc

4 Similar models were developed independently by O’Donnell, Tenenbaum, and Goodman (2009) and Post

and Gildea (2009).

5 We use the terms tree fragment and elementary tree interchangeably.

205

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

Mesa 4
DP-TSG notation. For consistency, we largely follow the notation of Liang, Jordán, y
Klein (2010).

αc
P0(mi|C)
X
S
S
b = {bs
z
metro
norte = {nc,mi
ΔnS:metro

}
s∈S

}

DP concentration parameter for each non-terminal type c ∈ V
CFG base distribution
Set of all non-terminal nodes in the treebank
Set of sampling sites (one for each x ∈ x)
A block of sampling sites, where S ⊆ S
Binary variables to be sampled (bs = 1 for frontier nodes)
Latent state of the segmented treebank
Number of sites s ∈ S s.t. bs = 1
Sufﬁcient statistics of z
Change in counts by setting m sites in S

Mesa 4 deﬁnes notation. The data likelihood is given by the latent state z and the
parameters θ: pag(z|i) =

. Integrating out the parameters, tenemos:

(cid:6)

z∈z θnc,mi(z)

C,mi

pag(z) =

(cid:6)

(cid:7)

c∈V

mi(αcP0(mi|C))nc,mi(z)
αnc,·(z)
C

(6)

where xn = x(X + 1) . . . (X + n − 1) is the rising factorial.

Base Distribution. The base distribution P0 is the same maximum likelihood PCFG used
in the Stanford parser.6 After applying the manual grammar features, we perform sim-
ple right binarization, collapse unary rules, and replace rare words with their signatures
(Petrov et al. 2006).

For each non-terminal type c, we learn a stop probability qc

−
P0, the probability of generating a tree fragment A+ → B
terminals is

∼ Beta(1, 1). Under
C+ composed of non-

−
P0(A+ → B

C+) = pMLE(A → B C)qB(1 − qC)

Unlike Cohn, Goldwater, and Blunsom (2009), we penalize lexical insertion:

P0(c → t) = pMLE(c → t)pag(t)

(7)

(8)

where p(t) is equal to the MLE unigram probability of t in the treebank. Lexicalizing a
rule makes it very speciﬁc, so we generally want to avoid lexicalization with rare words.
Empirically, we found that this penalty reduces overﬁtting.

Type-Based Inference Algorithm. To learn the parameters θ we use the collapsed, block
Gibbs sampler of Liang, Jordán, and Klein (2010). We sample binary variables bs
associated with each sampling site s in the treebank. The key idea is to select a block

6 The Stanford parser is a product model which scores parses with both a dependency grammar and a

PCFG. We extract the TSG from the manually split PCFG only. Bansal and Klein (2010) also experimented
with manual grammar features in an all-fragments (parametric) TSG for English.

206

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

NP+

−

PUNC

(1)

N+

−

norte

PUNC+(2)

“

jacques

Chirac

“

Cifra 2
Example of two conﬂicting sites of the same type in a training tree. Deﬁne the type of a
def= (Δns:0, Δns:1). Sites (1) y (2) have the same type because t(z, s1) =t(z, s2). The two
site t(z, s)
sites conﬂict, sin embargo, because the probabilities of setting bs1 and bs2 both depend on counts for
the tree fragment rooted at NP. Como consecuencia, sites (1) y (2) are not exchangeable: El
probabilities of their assignments depend on the order in which they are sampled.

of exchangeable sites S of the same type that do not conﬂict (Cifra 2). Because the
sites in S are exchangeable, we can set bS randomly if we know m, the number of sites
with bs = 1. This algorithm is not a contribution of this article, so we refer the interested
reader to Liang, Jordán, and Klein (2010) para más detalles.

After each Gibbs iteration, we sample each stop probability qc directly using
binomial-Beta conjugacy. We also infer the DP concentration parameters αc with the
auxiliary variable procedure of West (1995).

Decoding. To decode, we ﬁrst create a maximum a posterior (MAP) grammar in which
tree fragments have ﬁxed estimates according to a single sample from the DP-TSG:

θc,e =

nc,mi(z) + αcP0(mi|C)
nc,(z) + αc

(9)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

This MAP grammar has an inﬁnite rule set, sin embargo, because elementary trees with
zero count in n have some residual probability under P0. We discard all zero-count trees
except for the zero-count CFG rules in P0. Scores for these rules follow from Equation (9)
with nc,mi(z) = 0. This grammar represents most of the probability mass and permits
inference using dynamic programming (Cohn, Goldwater, and Blunsom 2009).

Because the derivations of a TSG are context-free (Vijay-Shanker and Weir 1993),
we can form a CFG of the derivation sequences and use a synchronous CFG to translate
the most probable CFG parse to its TSG derivation. Consider a unique tree fragment
ei rooted at cj with frontier γ, which is a sequence of terminals and non-terminals. Nosotros
encode this fragment as an SCFG rule of the form

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

[cj

→ γ , cj

→ i, ck, cl, . . . ]

(10)

where ck, cl, . . . is a ﬁnite-length sequence of the non-terminal frontier nodes in γ.7 The
SCFG translates the input string to a sequence of tree fragment indices. Because the
TSG substitution operator applies to the leftmost frontier node, the best TSG parse can
be deterministically recovered from the sequence of indices.

7 This formulation is due to Chris Dyer.

207

Ligüística computacional

Volumen 39, Número 1

Mesa 5
Gross corpus statistics for the pre-processed corpora used to train and evaluate our models. Nosotros
compare to the WSJ section of the PTB: train (Sections 02–21); desarrollador. (Sección 22); prueba (Sección 23).
Due to its ﬂat annotation style, the FTB sentences have fewer constituents per sentence. En el
ATB, morphological variation accounts for the high proportion of word types to sentences.

Tren

desarrollador.

#oraciones
#tokens
#word types
#POS types
#phrasal types
avg. length

#oraciones
#tokens
#word types
avg. length
OOV rate

Prueba

#oraciones
#tokens

ATB

FTB

WSJ

18,818
597,933
37,188
32
31
31.8

2,318
70,656
12,358
30.5
15.6%

2,313
70,065

13,448
397,917
26,536
30
24
29.6

1,235
38,298
6,794
31.0
17.8%

1,235
37,961

39,832
950,028
44,389
45
27
23.9

1,700
40,117
6,840
23.6
12.8%

2,416
56,684

The SCFG formulation has a practical beneﬁt: We can take advantage of the heavily
optimized SCFG decoders for machine translation. We use cdec (Dyer et al. 2010) to ﬁnd
the Viterbi derivation for each input string.

5. Training Data and Morphological Analyzers

We have described two supervised parsing models for Arabic and French. Now we
show how to construct MWE-aware training resources for them.

The corpora used in our experiments are the Penn Arabic Treebank (ATB)
(Maamouri et al. 2004) and the French Treebank (FTB) (Abeillé, Clément, and Kinyon
2003). Prior to parsing, both treebanks require signiﬁcant pre-processing, cual
we perform automatically.8 Because parsing evaluation metrics are sensitive to the
terminal/non-terminal ratio (Rehbein and van Genabith 2007), we only remove non-
terminal labels in the case of unary rewrites of the same category (p.ej., NP → NP)
(Johnson 1998). Mesa 5 compares the pre-processed corpora with the WSJ section of
the PTB. Appendix C compares the annotation consistency of the ATB, FTB, and WSJ.

5.1 Arabic Treebank

We work with parts 1–3 (newswire) of the ATB,9 which contain documents from three
different news agencies. In addition to phrase structure markers, each syntactic tree also
contains per-token morphological analyses.

8 Tree manipulation is automated with Tregex/Tsurgeon (Levy and Andrew 2006). Our pre-processing

package is available at http://nlp.stanford.edu/software/lex-parser.shtml.

9 LDC catalog numbers: LDC2008E61, LDC2008E62, and LDC2008E22.

208

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Mesa 6
Frequency distribution of the MWE types in the ATB and FTB training sets.

Categories

ATB

FTB

noun
prep.
adj.

MWN
MWP
MWA
MWPRO pron.
conj.
MWC
MWADV adverb
MWV
MWD
MWCL
MWET
MWI

verb
det.
clitic
foreign
interj.

Total

6,975
623
18
–
–
–
–
–
–
–
–

7,616

91.6% 9,680
8.18% 3,526
0.24% 324
266
814
3,852
585
328
59
24
4

–
–
–
–
–
–
–
–

19,462

49.7%
18.1%

1.66%
1.37%
4.18%

19.8%

3.01%
1.69%
0.30%
0.12%
0.02%

Tokenization/Segmentation. We retained the default ATB clitic segmentation scheme.

Morphological Analysis. The ATB contains gold per-token morphological analyses, pero
no lemmas.

Tag Sets. We used the POS tag set described by Kulick, Gabbard, and Marcus (2006). Nosotros
previously showed that the “Kulick” tag set is very effective for basic Arabic parsing
(Green and Manning 2010).

MWE Tagging. The ATB does not mark MWEs. Por lo tanto, we merged an existing Arabic
MWE list (Attia et al. 2010b) with the constituency trees.10 For each string from the MWE
list that was bracketed in the treebank, we ﬂattened the structure over the MWE span
and added a non-terminal label according to the MWE type (Mesa 6). We ignored MWE
strings that crossed constituent boundaries.

Orthographic Normalization. Orthographic normalization has a signiﬁcant impact on
parsing accuracy. We remove all diacritics, instances of taTwiil,11 and pro-drop markers.
We also applied alif normalization12 and mapped punctuation and numbers to their
Latin equivalents.

Corpus Split. We divided the ATB into training/development/test sections according
to the split prepared by Mona Diab for the 2005 Johns Hopkins workshop on parsing
Arabic dialects (Rambow et al. 2005).13

10 La lista de 30,277 distinct MWEs is available at: http://sourceforge.net/projects/arabicmwes/.
11 taTwiil (6) is an elongation character for justifying text. It has no morphosyntactic function or phonetic

realization.
12 Variants of alif [(cid:24),
13 The corpus split is available at: http://nlp.stanford.edu/projects/arabic.shtml.

%
(cid:27)
(cid:24),(cid:24)(cid:27),
(cid:24)] are inconsistent in Arabic text.

209

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

5.2 French Treebank

The FTB14 contains phrase structure trees with morphological analyses and lemmas. En
addition, the FTB explicitly annotates MWEs. POS tags for MWEs are given not only
at the MWE level, but also internally: Most tokens that constitute an MWE also have
a POS tag. Our FTB pre-processing is largely consistent with Lexicon-Grammar, cual
deﬁnes MWE categories based on the global POS.

Tokenization/Segmentation. We changed the default tokenization for numbers by fusing
adjacent digit tokens. Por ejemplo, 500 000 is tagged as an MWE composed of two
palabras 500 y 000. We made this 500000 and removed the MWE POS. We also merged
numbers like “17,9”.

Morphological Analysis. The FTB provides both gold morphological analyses and lemmas
para 86.6% of the tokens. The remaining tokens lack morphological analyses, and in many
cases basic parts of speech. We restored the basic parts of speech by assigning each token
its most frequent POS tag elsewhere in the treebank.15 This technique was too coarse for
missing morphological analyses, which we left empty.

Tag Sets. We transformed the raw POS tags to the CC tag set (Crabbé and Candito
2008), which is now the standard tag set in the French parsing literature. The CC tag
set includes WH markers and verbal mood information.

MWE Tagging. We added the 11 MWE labels shown in Table 6. We mark MWEs with
a ﬂat bracketing in which the phrasal label is the MWE-level POS tag with an MW
preﬁx, and the preterminals are the internal POS tags for each terminal. La resultante
POS sequences are not always unique to MWEs: They appear in abundance elsewhere
in the corpus. Some MWEs contain normally ungrammatical POS sequences, sin embargo
(p.ej., adverb à la va vite (‘in a hurry’): P D V ADV [at the goes quick]), and some words
appear only as part of an MWE, such as insu in à l’insu de (‘to the ignorance of’). Nosotros también
found that 36 MWE spans still lacked a global POS. To restore these labels, we assigned
the most frequent label for that internal POS sequence elsewhere in the corpus.

Corpus Split. We used the 80/10/10 split described by Crabbé and Candito (2008). Ellos
used a previous release of the treebank with 12,531 árboles. Después, 3,391 trees were
added to the FTB. We appended these extra trees to the training set, thus preserving the
original development and test sets.

5.3 Morphological Analysis for Arabic and French

The factored lexicon requires predicted per-token morphological analyses at test time.
We used separately trained, language-speciﬁc tools to obtain these analyses (Mesa 7).

14 Version from June 2010. We used the subset of the FTB with functional annotations, not for those

annotations but because this subset is known to be more consistently annotated. Appendix B compares
our pre-processed version of the FTB to other versions in prior work.

15 Seventy-three of the unlabeled word types did not appear elsewhere in the treebank. All but 11 of these

were nouns. We manually assigned the correct tags, but we would not expect a negative effect by
deterministically labeling all of them as nouns.

210

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Mesa 7
Linguistic resources required by the factored lexicon. Equivalent resources for Arabic and French
do not presently exist. The ATB lacks gold lemmas and a French morphological ranker
equivalent to MADA—which can produce the full set of morphosyntactic features speciﬁed in
the ATB—has not been developed. Morfette is effectively a discriminative classiﬁer that treats
analyses as atomic labels, whereas MADA utilizes a morphological generator.

Gold Morphological Features

Gold Lemmas

Morphological Analyzer
Morphological Ranker
Lemmatizer

Arábica (ATB)

Francés (FTB)

Gender, Número, Tense,
Person, Mood, Voice,
Deﬁniteness
×

(cid:2) (SAMA)
(cid:2) (MADA)
(cid:2) (MADA)

Gender, Número, Tense,
Person

(cid:2)

×
(cid:2) (Morfette)
(cid:2) (Morfette)

Arábica. The morphological analyses in the ATB are human-selected outputs of the
Standard Arabic Morphological Analyzer (SAMA),16 a deterministic system that relies
on manually compiled linguistic dictionaries. The latest version of SAMA has complete
lexical coverage of the ATB, thus it does not encounter unseen word types at test time.
To rank the output of SAMA, we use MADA (Habash and Rambow 2005),17 cual
makes predictions based on an ensemble of support vector machine (SVM) classiﬁers.

Francés. The FTB includes morphological analyses for gender, number, persona, tense,
type of pronouns (relative, reﬂexive, interrogative), type of adverbs (relative or inter-
rogative), and type of nouns (proper vs. common noun). Morfette (Chrupala, Dinu,
and van Genabith 2008) has been used in previous FTB parsing experiments (Candito
and Seddah 2010; Seddah et al. 2010) to predict these features in addition to lemmas.
Morfette is a discriminative sequence classiﬁer that relies on lexical and greedy left con-
text features. Because Morfette lacks a morphological generator like SAMA, sin embargo, él
is effectively a tagger that must predict a very large tag set. We trained Morfette on our
split of the FTB and evaluated accuracy on the development set: 88.3% (full morpho-
logical tagging); 95.0% (lemmatization); y 86.5% (full tagging and lemmatization).18

6. experimentos

For each language, we ran two experiments: standard parsing and MWE identiﬁca-
ción. The evaluation included the Stanford, Stanford+factored lexicon, and DP-TSG
modelos.

All experiments used gold tokenization/segmentation. Unlike the ATB, the FTB
does not contain the raw source documents, so we could not start from raw text for both

16 LDC catalog number LDC2010L01.
17 We used version 3.1. According to the user manual, the training set for the distributed models overlaps
with our ATB development and test sets. Training scripts/procedures are not distributed with MADA,
sin embargo.

18 Morfette training settings: 10 tag and 3 lemma training iterations. We excluded punctuation tokens from

the morphological tagging evaluation because our parsers split punctuation deterministically.

211

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

idiomas. We previously showed that segmentation errors decrease Arabic parsing
accuracy by about 2.0% F1 (Green and Manning 2010).

Morphological analysis accuracy was another experimental resource asymmetry
between the two languages. The morphological analyses were obtained with signiﬁ-
cantly different tools: in Arabic, we had a morphological generator/ranker (MADA),
whereas for French we had only a discriminative classiﬁer (Morfette). Como consecuencia,
French analysis quality was lower (Sección 5.3).

6.1 Standard Parsing Experiments

Líneas de base. We included two parsing baselines: a parent-annotated PCFG (PAPCFG) y
a PCFG with the grammar features in the Stanford parser (SplitPCFG). The PAPCFG is
the standard baseline for TSG models (Cohn, Goldwater, and Blunsom 2009).

Berkeley Parser. We previously showed optimal Berkeley parser (Petrov et al. 2006) pa-
rameterizations for both the Arabic (Green and Manning 2010) y francés (Green et al.
2011) data sets.19 For Arabic, our pre-processing and parameter settings signiﬁcantly
increased the best-published Berkeley ATB baseline. Others had used the Berkeley
parser for French, but on an older revision of the FTB. To our knowledge, we are the
ﬁrst to use the Berkeley parser for MWE identiﬁcation.

Factored Lexicon Features. We selected features for the factored lexicon on the develop-
ment sets. For Arabic, we used gender, number, tense, mood, and deﬁniteness. Para
Francés, we used the grammatical and syntactic features in the CC tag set in addition
to grammatical number. For the experiments in which we evaluated with predicted
morphological analyses, we also trained the parser on predicted analyses.

Métricas de evaluación. We report three evaluation metrics. Evalb is the standard labeled
precision/recall metric.20 Leaf Ancestor measures the cost of transforming guess trees
to the reference (Sampson and Babarczy 2003), and is less biased against ﬂat tree-
banks like the FTB (Rehbein and van Genabith 2007). The Leaf Ancestor score ranges
de 0 a 1 (higher is better). We report micro-averaged (Cuerpo) and macro-averaged
(Sent.) puntuaciones. Finalmente, EX% is the percentage of perfectly parsed sentences according
to Evalb.

Sentence Lengths. We report results for sentences of lengths ≤ 40 palabras. This cutoff
accounts for similar proportions of the ATB and FTB. The DP-TSG grammar extractor
produces very large grammars for Arabic,21 and we found that the grammar constant
was too large for parsing all sentences. Por ejemplo, the ATB development set contains
a sentence that is 268 tokens long.

19 Berkeley training settings: right binarization, no parent annotation, and six split-merge cycles. Results are

the average of three runs in which the random number generator was seeded with the system time.
20 Available at http://nlp.cs.nyu.edu/evalb/ (v.20080701). We used a Java re-implementation included

in the Stanford parser distribution that is compatible with the reference implementation.

21 Average DP-TSG grammar sizes: Arábica, 89,003 normas; Francés, 46,515 normas.

212

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Mesa 8
Arabic standard parsing experiments (test set, sentences ≤ 40 palabras). SplitPCFG is the same
grammar used in the Stanford parser, but without the dependency model. FactLex uses basic
POS tags predicted by the parser and morphological analyses from MADA. FactLex* uses gold
morphological analyses. Berkeley and DP-TSG results are the average of three independent runs.

Arábica

Leaf Ancestor

Evalb

Sent.

Cuerpo

PAPCFG
SplitPCFG

berkeley
DP-TSG
stanford
Stanford+FactLex

0.777
0.821

0.865
0.822
0.851
0.849

Stanford+FactLex*

0.852

0.745
0.797

0.853
0.800
0.835
0.835

0.837

69.5
75.6

83.3
75.5
81.3
81.2

81.8

64.6
73.4

82.7
75.4
80.7
80.8

81.3

66.9
74.5

83.0
75.4
81.0
81.0

81.5

EX%

12.9
17.8

24.0
17.7
23.5
22.8

24.0

Resultados. Tables 8 y 9 show Arabic and French parsing results, respectivamente. For both
idiomas, the Berkeley parser produces the best results in terms of Evalb F1. El oro
factored lexicon setting compares favorably in terms of exact match.

6.2 MWE Identiﬁcation Experiments

The predominant approach to MWE identiﬁcation is the combination of lexical associa-
tion measures (surface statistics) with a binary classiﬁer (Pecina 2010). A state-of-the-art,
language-independent package that implements this approach for higher order n-grams
is mwetoolkit (Ramisch, Villavicencio, and Boitet 2010).

mwetoolkit Baseline. We conﬁgured mwetoolkit with the four standard lexical features:
the maximum likelihood estimator, Dice’s coefﬁcient, pointwise mutual information,
and Student’s t-score. We also included POS tags predicted by the Stanford tagger
(Toutanova et al. 2003). We ﬁltered the training instances by removing unigrams and

Mesa 9
French standard parsing experiments (test set, sentences ≤ 40 palabras). FactLex uses basic POS
tags predicted by the parser and morphological analyses from Morfette. FactLex* uses gold
morphological analyses.

Francés

Leaf Ancestor

Evalb

Sent.

Cuerpo

PAPCFG
SplitPCFG

berkeley
DP-TSG
stanford
Stanford+FactLex

0.857
0.870

0.905
0.858
0.869
0.877

Stanford+FactLex*

0.890

0.840
0.853

0.894
0.841
0.853
0.860

0.874

73.5
77.9

83.9
77.1
78.5
79.0

82.8

72.8
77.1

83.4
76.8
79.6
79.6

84.0

73.1
77.5

83.6
76.9
79.0
79.3

83.4

EX%

14.5
16.0

24.0
16.0
17.6
19.6

27.4

213

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

Mesa 10
Arabic MWE identiﬁcation per category and overall results (test set, sentences ≤ 40 palabras).

#gold PAPCFG SplitPCFG Berkeley DP-TSG Stanford FactLex FactLex*

1
MWA
MWP
34
MWN 465

Total:

500

0.0
36.9
9.8

13.2

0.0
76.9
66.7

67.4

0.0
81.2
74.6

74.8

0.0
91.8
81.1

81.9

0.0
88.2
76.6

77.5

0.0
88.2
77.0

77.9

0.0
86.6
77.5

78.2

non-MWE n-grams that occurred only once. For each resulting n-gram, we created real-
valued feature vectors and trained a binary SVM classiﬁer with Weka (Hall et al. 2009)
with an RBF kernel. See Appendix D for further conﬁguration details.

Resultados. Because our parsers mark MWEs as labeled spans, MWE identiﬁcation is a by-
product of parsing. Our evaluation metric is category-level Evalb for the MWE non-
terminal categories. We report both the per-category scores (Tables 10 y 11), y
a weighted average for all categories. Mesa 12 shows aggregate MWE identiﬁcation
resultados. All parsing models—even the baselines—exceed mwetoolkit by a wide margin.

7. Discusión

7.1 MWE Identiﬁcation Results

The main contribution of this article is Table 12, which summarizes MWE identiﬁcation
resultados. For both languages, our parsing models yield substantial improvements over
the n-gram classiﬁcation method represented by mwetoolkit. The best improvements
come from different models: The DP-TSG model achieves 66.9% F1 absolute improve-
ment for Arabic and the Stanford+FactLex* achieves 50.0% F1 absolute improvement
for French.

Differences in how the training resources were constructed may account for differ-
ences in the ordering of the models. The Arabic MWE list consists mainly of named
entities and nominal compounds, hence the high concentration of MWN types in the

Mesa 11
French MWE identiﬁcation per category and overall results (test set, sentences ≤ 40 palabras).
MWI and MWCL do not occur in the test set.

#gold PAPCFG SplitPCFG Berkeley DP-TSG Stanford FactLex FactLex*

0.0
6.1
42.9
41.1
60.0
83.9
46.8
49.0
74.2

46.0

0.0
56.1
29.6
56.0
70.3
70.3
68.0
78.9
80.7

64.2

0.0
54.3
36.7
67.4
74.4
87.6
72.5
81.4
83.7

71.4

0.0
56.2
36.0
65.7
65.1
75.3
77.2
79.5
85.8

71.3

0.0
57.1
26.1
64.8
68.4
72.2
75.0
81.2
86.3

70.5

0.0
44.9
25.0
64.9
64.9
72.2
76.0
81.9
88.2

70.5

0.0
83.3
33.3
86.3
70.3
81.3
87.9
92.9
97.9

87.3

3
26
8
457
15
17
220
162
47

955

MWET
MWV
MWA
MWN
MWD
MWPRO
MWADV
MWP
MWC

Total:

214

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Mesa 12
MWE identiﬁcation F1 of the parsing models vs. the mwetoolkit baseline (test set, oraciones
≤ 40 palabras). FactLex
uses gold morphological analyses at test time.

∗

Modelo

Arabic F1

French F1

mwetoolkit (base)

PAPCFG
SplitPCFG

berkeley
DP-TSG
stanford
Stanford+FactLex

Stanford+FactLex*

15.0

13.2
67.4

74.8
81.9
77.5
77.8

78.2

37.3

46.0
64.2

71.4
71.3
70.5
70.5

87.3

pre-processed ATB (Mesa 10). Como consecuencia, this particular Arabic MWE identiﬁcation
experiment is similar to joint parsing and named entity recognition (NER) (Finkel and
Manning 2009). The DP-TSG is effective at memorizing the entities and re-using them
en el momento de la prueba. It would be instructive to compare the DP-TSG to the discriminative model
of Finkel and Manning (2009), which currently represents the state-of-the-art for joint
parsing and NER.

The Berkeley and DP-TSG models are equally effective at learning French MWE
normas. One explanation for this result could be the CC tag set, which was explicitly tuned
for the Berkeley parser. The CC tag set improved Berkeley MWE identiﬁcation accuracy
por 1.8% F1 and basic parsing accuracy by 1.2% F1 over the previous version of our work
(Green et al. 2011), in which we used the basic FTB tag set. Sin embargo, this tag set yielded
solo 0.2% F1 and 1.1% F1 improvements, respectivamente, for the DP-TSG.

Interpretation of the factored lexicon results should account for resource asym-
metries. For French, the extraordinary result with gold analyses (Stanford+FactLex*) es
partly due to annotation errors. Gold morphological analyses are missing for many of
the MWE tokens in the FTB. The factored lexicon thus learns that when a token has no
morfología, it is usually part of an MWE. In the automatic setting (Stanford+FactLex),
sin embargo, Morfette tends to assign morphology to the MWE tokens because it has no
semantic knowledge. Como consecuencia, the morphological predictions are less consistent,
and the parsing model falls back to the baseline Stanford result. Certainly more
consistent FTB annotations would help Morfette, which we found to be signiﬁcantly less
accurate on our version of the FTB than MADA on the ATB (see Habash and Rambow
2005). Another remedy would be to incorporate MWE knowledge into the lexical
analyzer, a strategy that Constant, Sigogne, and Watrin (2012) recently found to be very
effective.

The Arabic factored lexicon results are more realistic. Stanford+FactLex* achieves
a 0.7% F1 improvement over Stanford along with a signiﬁcant improvement in
exact match (EX%). In the automatic setting, a 0.3% F1 improvement is maintained
for MWE identiﬁcation. One direction for improvement might be the POS tag set. El
“Kulick” tag set encodes some morphological information (p.ej., number, deﬁniteness),
so the factored lexicon can be redundant. Eliminating this overlap might improve
exactitud.

215

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

Mesa 13
Sample of human-interpretable Arabic TSG rules. Recursive rules like MWA→A MWA result
from memoryless binarization of n-ary rules. This pre-processing step not only increases parsing
exactitud, but also allows the generation of previously unseen MWEs of a given type.

MWN

78(cid:11)9(cid:5)(cid:4)(cid:6)(cid:8)(cid:7)(cid:24) 7(cid:29)
(cid:27)(cid:10)(cid:12)
:(cid:24)(cid:12) (cid:8)(cid:12)(cid:13)(cid:29)(cid:24) 78(cid:11)
-(cid:11) (cid:2);.(cid:16) N ‘military N’
(cid:25)(cid:26)(cid:29)1!(cid:29)(cid:24) norte 79(cid:5)(cid:4)< ‘Los Angeles’ ‘Prime Minister’ (cid:21)(cid:11)((cid:3))*(cid:24) ‘national N council’ -(cid:2) (cid:8)(cid:20)(cid:24) MWP (cid:17)(cid:4) % &(cid:24) (cid:21)(cid:25)(cid:22)(cid:20) (cid:8)$ (cid:25),(cid:13)(cid:25)(cid:30)(cid:29)(cid:28)(cid:10)(cid:4) (cid:25)+(cid:30)(cid:11) (cid:25)(cid:26)(cid:30)(cid:11)(cid:20)(cid:28)(cid:8)(cid:10) (cid:8)(cid:9)= MWP ‘with MWP’ ‘until now’ ‘local time’ ‘on the other hand’ MWA (cid:25)(cid:26)(cid:31)(cid:30)(cid:11) (cid:8),(cid:12) -(cid:13)(cid:25)(cid:30).)*(cid:24) (cid:8),(cid:13)0 (cid:25)(cid:26)(cid:30)(cid:11)(cid:25)(cid:10)(cid:28)(cid:30)(cid:11) /(cid:8)(cid:30)(cid:15)(cid:29)(cid:24) A MWA ‘high-level’ ‘Soviet-made’ l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 1 1 9 5 1 7 9 9 1 9 7 / c o l i _ a _ 0 0 1 3 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 7.2 Interpretability of DP-TSG MWE Rules Arabic. Table 13 lists a sample of the TSG rules learned by the DP-TSG model. Fixed expressions such as names (Los Angeles) and titles (Prime Minister) are cached in the grammar. The model also generalizes over nominal compounds with rules like military N, which captures military coup, military council, and so forth. For multiword adjectives, the model caches several instances of false iDafa in full (high- level, Soviet-made). Memoryless binarization permits the grammar to capture rules like MWA → A MWA, which permits generation of a previously unseen multiword ad- jectives. Some of these recursive rules are lexicalized, as in the multiword preposition rule MWP → (cid:17)(cid:4) MWP. French. We ﬁnd that the DP-TSG model also learns useful generalizations over French MWEs. A sample of the rules is given in Table 14. Some speciﬁc sequences like “[MWN [coup de N]]” are part of the grammar: such rules can indeed generate quite a few MWEs, for example, coup de pied (‘kick’), coup de coeur, coup de foudre (‘love at ﬁrst sight’), coup de main (‘help’), coup d’état, coup de grâce. Certain of these MWEs are unseen in the training data. For MWV, the grammar contains “V de N” as in avoir de cesse (‘give no peace’), perdre de vue [lose from sight] (‘forget’), prendre de vitesse [take from speed] (‘outpace’). For prepositions, the grammar stores full subtrees of MWPs, but also generalizes over very frequent sequences: “en N de” occurs in many multiword prepositions (e.g., en compagnie de, en face de, en matière de, en terme de, en cours de, en faveur de, en raison de, en fonction de). The TSG grammar thus provides a categorization of MWEs consistent with the Lexicon-Grammar. It also learns verbal phrases which contain discontiguous MWVs due to the insertion of an adverb or negation such as “[VN [MWV va] [MWADV d’ailleurs] [MWV bon train]]” [go indeed well], “[VN [MWV a] [ADV jamais] [MWV été question d’]]” [has never been in question]. Table 14 Sample of human-interpretable French TSG rules. MWV sous - V mis en N V DET N V de N V en N MWP de l’ordre de y compris au N de en N de ADV de MWN sociétés de N chef de N coup de N N d’état N de N N à N 216 Green, de Marneffe, and Manning Parsing Models for Identifying Multiword Expressions 7.3 Basic Parsing Results The relative rankings of the different models are the same for Arabic and French (Berkeley > Stanford parser > DP-TSG > PAPCFG), and these rankings correspond to
those observed for English (Cohn, Blunsom, and Goldwater 2010). Although statistical
statements cannot be made about the difﬁculty of parsing the two languages by com-
paring raw evaluation ﬁgures, we can compare the differences between PAPCFG and
the best model for each language. Desde esta perspectiva, manual rule splitting in the
Stanford parser is apparently more effective for the ATB than for the FTB. Differences in
annotation styles may account for this discrepancy. Consider the unbinarized treebanks.
The ATB training set has 8,937 unique non-unary rule types with mean branching factor
m = 2.41 and sample standard deviation SD = 0.984. The FTB has a ﬂat annotation
style, which leads to more rule types (16,159) with a higher branching factor (m = 2.87,
DE = 1.51).

A high branching factor can lead to more brittle grammars, an empirical observa-
tion that motivated memoryless binarization in both the Berkeley parser (Petrov et al.
2006, página 434) and the DP-TSG. The Berkeley parser results also seem to support the
observation that rule reﬁnement is less effective for the FTB. Automatic rule reﬁnement
results in a 16.1% F1 absolute improvement over PAPCFG for Arabic, but only 10.0% F1
for French.

Por supuesto, the FTB contains 28.5% fewer sentences than the ATB, so the FTB rule
counts are also sparser. Además, we found that the FTB has lower inner-annotator
agreement (IAA) than the ATB (Apéndice C), which also negatively affects super-
vised models. Finalmente, Evalb penalizes ﬂat treebanks like the FTB (Rehbein and van
Genabith 2007). To counteract that bias, we also included a Leaf Ancestor evaluation.
Sin embargo, even Leaf Ancestor showed that, with respect to PAPCFG, the best Arabic
model improved nearly twice as much as the best French model.

The DP-TSG improves over PAPCFG, but does not exceed the Berkeley parser. Uno
crucial difference between the two models is the decoding objective. The Berkeley parser
maximizes the expected rule count (max-rule-sum) (Petrov and Klein 2007), an objective
that Cohn, Blunsom, and Goldwater (2010) demonstrated could improve the DP-TSG
por 2.0% F1 over Viterbi for English with no changes to the grammar. We decoded with
Viterbi, so our results are likely a lower bound relative to what could be achieved with
objectives that correlate with labeled recall. Because MWE identiﬁcation is a by-product
of parsing, we expect that MWE identiﬁcation accuracy would also improve.

Because the DP-TSG and PAPCFG have the same weak generative capacity, el
improvement must come from relaxing independencies in the grammar rules (by sav-
ing larger tree fragments). This is the same justiﬁcation for manual rule reﬁnement
for PCFGs (Johnson 1998, página 614). We observe an 8.5% F1 absolute improvement
for Arabic, but just 3.8% F1 for French. Sin embargo, we chose this model precisely for
its greater strong generative capacity, which we hypothesized would improve MWE
identiﬁcation accuracy. The MWE identiﬁcation results seem to bear out this hypothesis.

8. Trabajo relacionado

This section contains three parts. Primero, we review work on MWEs in linguistics and
relate it to parallel developments in NLP. Segundo, we describe other syntax-based
MWE identiﬁcation methods. Finalmente, we enumerate related experiments on Arabic and
Francés.

217

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

8.1 Analysis of MWEs in Linguistics and NLP

An underlying assumption of mainline generative grammatical theory is that words
are the basic units of syntax (Chomsky 1957). Lexical insertion is the process by which
words enter into phrase structure, thus lexical insertion rules have the form [N → dog,
auto, apple], etcétera. This assumption, sin embargo, was questioned not long after it was
propuesto, as early work on idiomatic constructions like kick the bucket—which functions
like a multiword verb in syntax—seemed to indicate a conﬂict (Katz and Postal 1963;
Chafe 1968). Chomsky (1981) brieﬂy engaged kick the bucket in a footnote, but idioms
remained a peripheral issue in mainline generative theory.

To others, the marginal status of idioms and ﬁxed expressions seemed inappropri-
ate given their pervasiveness cross-linguistically. In their classic work on the English
construction let alone, Fillmore, kay, and O’Connor (1988) argued that the basic units of
grammar are not Chomskyan rules but constructions, or triples of phonological, syntac-
tic, and conceptual structures. The subsequent development of Construction Grammar
(Fillmore, kay, and O’Connor 1988; Goldberg 1995) maintained the central role of
idioms. Jackendoff (1997) has advanced a linguistic theory, the Parallel Architecture,
which includes multiword expressions in the lexicon.

In NLP, concurrent with the development of Construction Grammar, Scha (1990)
conceptualized an alternate model of parsing in which new utterances are built from
previously observed language fragments. In his model, which became known as data-
oriented parsing (DOP) (Bod 1992), “idiomaticity is the rule rather than the exception”
(Scha 1990, página 13). Most DOP work, sin embargo, has focused on parameter estimation
issues with a view to improving overall parsing performance rather than explicit mod-
eling of idioms.

Given developments in linguistics, and to a lesser degree DOP, in modeling MWEs,
it is curious that most NLP work on MWE identiﬁcation has not utilized syntax. Más-
encima, the words-with-spaces idea, which Sag et al. (2002) dismissed as unattractive on both
theoretical and computational grounds,22 has continued to appear in NLP evaluations
such as dependency parsing (Nivre and Nilsson 2004), constituency parsing (Arun and
Keller 2005), and shallow parsing (Korkontzelos and Manandhar 2010). En todos los casos, el
conclusion was drawn that pre-grouping MWEs improves task accuracy. Because the
yields (and thus the labelings) of the evaluation sentences were modiﬁed, sin embargo,
the experiments were not strictly comparable. Además, gold pre-grouping was usually
assumed, as was the case in most FTB parsing evaluations after Arun and Keller (2005).
The words-with-spaces strategy is especially unattractive for MRLs because (1) él
intensiﬁes the sparsity problem in the lexicon; y (2) it is not robust to morphological
and syntactic processes such as inﬂection and phrasal expansion.

8.2 Syntactic Methods for MWE Identiﬁcation

There is a voluminous literature on MWE identiﬁcation, so we focus on syntax-
based methods. The classic statistical approach to MWE identiﬁcation, Xtract (Smadja
1993), used an incremental parser in the third stage of its pipeline to identify
predicate-argument relationships. lin (1999) applied information-theoretic measures
to automatically extracted dependency relationships to ﬁnd MWEs. To our knowledge,

22 Sag et al. (2002) showed how to integrate MWE information into a non-probabilistic head-driven phrase

structure grammar for English.

218

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Wehrli (2000) was the ﬁrst to propose the use of a syntactic parser for multiword
expression identiﬁcation. No empirical results were provided, sin embargo, and the MWE-
augmented scoring function for the output of his symbolic parser was left to future
investigación. Recientemente, Seretan (2011) used a symbolic parser for collocation extraction. Columna-
locations are two-word MWEs. A diferencia de, our models handle arbitrary length MWEs.
To our knowledge, only two previous studies considered MWEs in the context
of statistical parsing. Nivre and Nilsson (2004) converted a Swedish corpus into two
versions: one in which MWEs were left as tokens, and one in which they were
grouped (words-with-spaces). They parsed both versions with a transition-based parser,
showing that the words-with-spaces version gave an improvement over the baseline.
Cafferkey (2008) also investigated the words-with-spaces idea along with imposing
chart constraints for pre-bracketed spans. He annotated the PTB using external MWE
lists and an NER system, but his technique did not improve two different constituency
modelos. At issue in both of these studies is the comparison to the baseline. MWE
pre-grouping changes the number of evaluation units (dependency arcs or bracketed
spans), thus the results are not strictly comparable. From an application perspective,
pre-grouping assumes high accuracy identiﬁcation, which may not be available for all
idiomas.

Our goal differs considerably from these two studies, which attempt to im-
prove parsing via MWE information. A diferencia de, we tune statistical parsers for MWE
identiﬁcation.

8.3 Related Experiments on Arabic and French

Arabic Statistical Constituency Parsing. Kulick, Gabbard, and Marcus (2006) were the ﬁrst
to parse the sections of the ATB used in this article. They adapted the Bikel parser (Bikel
2004) and improved accuracy primarily through punctuation equivalence classing and
the Kulick tag set. The ATB was subsequently revised (Maamouri, Bies, and Kulick
2008), and Maamouri, Bies, and Kulick (2009) produced the ﬁrst results on the revision
for our split of the revised corpus. They only reported development set results with
gold POS tags, sin embargo. Petrov (2009) adapted the Berkeley parser to the ATB, and we
later provided a parameterization that dramatically improved his baseline (Green and
Manning 2010). We also adapted the Stanford parser to the ATB, and provided the ﬁrst
results for non-gold tokenization. Attia et al. (2010a) developed an Arabic unknown
word model for the Berkeley parser based on signatures, much like those in Table 3.
More recently, Huang and Harper (2011) presented a discriminative lexical model for
Arabic that can encode arbitrary local lexical features.

Arabic MWE Identiﬁcation. Very little prior work exists on Arabic MWE identi-
ficación. Attia (2006) demonstrated a method for integrating MWE knowledge into
a lexical-functional grammar, but gave no experimental results. Siham Boulaknadel
and Aboutajdine (2008) evaluated several lexical association measures in isolation for
MWE identiﬁcation in newswire. More recently, Attia et al. (2010b) compared cross-
lingual projection methods (using Wikipedia and English Wordnet) with standard
n-gram classiﬁcation methods.

French Statistical Constituency Parsing. Abeillé (1988) and Abeillé and Schabes (1989)
identiﬁed the linguistic and computational attractiveness of lexicalized grammars for
modeling non-compositional constructions in French well before DOP. They developed
a small tree adjoining grammar (TAG) de 1,200 elementary trees and 4,000 lexical items

219

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

that included MWEs. Recent statistical parsing work on French has included stochas-
tic tree insertion grammars (STIG), which are related to TAGs, but with a restricted
adjunction operation.23 Both Seddah, Candito, and Crabbé (2009) and Seddah (2010)
showed that STIGs underperform CFG-based parsers on the FTB. In their experiments,
MWEs were grouped. Appendix B describes additional prior work on CFG-based
FTB parsing.

French MWE Identiﬁcation. Statistical French MWE identiﬁcation has only been investi-
gated recently. We previously reported the ﬁrst results on the FTB using a parser for
MWE identiﬁcation (Green et al. 2011). Contemporaneously, Watrin and Francois (2011)
applied n-gram methods to a French corpus of multiword adverbs (Laporte, Nakamura,
and Voyatzi 2008). Constant and Tellier (2012) used a linear chain conditional random
ﬁelds model (CRF) for joint POS tagging and MWE identiﬁcation. They incorporated
external linguistic resources as features, but reported results for a much older version of
the FTB. Después, Constant, Sigogne, and Watrin (2012) integrated the CRF model
into the Berkeley parser and evaluated on the pre-processed FTB used in this article.
Their best model (with external lexicon features) logrado 77.8% F1.

9. Conclusión

In this article, we showed that parsing models are very effective for identifying
arbitrary-length, contiguous MWEs. We achieved a 66.9% F1 absolute improvement
for Arabic, y un 50.0% F1 absolute improvement for French over n-gram classiﬁca-
tion methods. All parsing models discussed in the paper improve MWE identiﬁcation
over n-gram methods, but the best improvements come from different models. A diferencia de
n-gram classiﬁcation methods, parsers provide syntactic subcategorization and do not
require heuristic pre-ﬁltering of the training data. Our techniques can be applied to
any language for which the following linguistic resources exist: a syntactic treebank,
an MWE list, and a morphological analyzer.

More fundamentally, we exploited a connection between syntax and idiomatic
semantics. This connection has been debated in linguistics, yet overlooked in statistical
NLP until now. Although empirical task evaluations do not always reinforce linguistic
teorías, our results suggest that syntactic context can help identify idiomatic language,
as posited by some modern grammar theories.

We introduced the factored lexicon for the Stanford parser, a simple extension to
the lexical insertion model that helps combat lexical sparsity in morphologically rich
idiomas. In the gold setting, the factored lexicon yielded improvements over the basic
lexicon for both standard parsing and MWE identiﬁcation. Results were lower in the
automatic setting, suggesting that it might be helpful to optimize the morphological
analyzers for speciﬁc features included in downstream tasks like parsing. We evaluated
on in-domain data, but we expect that the factored lexicon would be even more useful
on out-of-domain text with higher out-of-vocabulary rates.

We have also provided empirical evidence that TSGs can capture idiomatic usage
as well as or better than a state-of-the-art CFG-based parser. The suitability of TSGs
for idioms has been discussed since the earliest days of DOP (Scha 1990), but it has
never been demonstrated with experiments like ours. Although the DP-TSG, cual
is a relatively new parsing model, still lags other parsers in terms of overall labeling

23 Unlike TAG and TIG, TSG does not include an adjunction operator.

220

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

exactitud, we have shown that it is already very effective for tasks like MWE iden-
tiﬁcation. Because we modiﬁed the syntactic representation rather than the model
formulation, general improvements to this parsing model should yield improvements
in MWE identiﬁcation accuracy.

Apéndice A: Additional French MWEs

This appendix describes the other French MWE categories annotated in the FTB.

Adverbial idioms (MWADV) often start with a preposition (Ejemplo (11)) but can have
very different part-of-speech sequences:

(11)

P N: du coup (‘so’), sans doute (‘doubtless’)

b. P D A N: avec un bel ensemble [with a nice ensemble] (‘in harmony’)

P ADV P ADV: de plus en plus (‘more and more’)

d. V V: peut-être [can be] (‘maybe’)

mi. ADV A: bien sûr [very certain] (‘of course’)

ET ET: a priori, grosso modo

Foreign words (MWET) include English nominal idioms, such as cash ﬂow and success
story, which are less integrated in French than words such as T-shirt. Expressions such
as Just do it or struggle for life also fall in this category.

Prepositional idioms (MWP) are mostly ﬁxed (Ejemplo (12)), but some permit minimal
variation such as de vs. des or à vs. au:

(12)

P N P: en dépit de (‘despite’), à hauteur de (‘at the height of’)

b. P D N P: dans le cadre de (‘in the framework of’), à l’exception de (‘at the

exception of’)

P P: aﬁn de (‘to’), jusqu’à (‘as far as’)

d. ADV P: autour de (‘around’), quant à (‘as for’)

mi. N V P: compte tenu de (‘taking into account’), exception faite de [exception

made of] (‘at the exception of’)

Pronominal idioms (MWPRO) consist of demonstrative pronouns (celui-ci ‘this one’,
celui-là ‘that one’) and reﬂexive pronouns (lui-même ‘himself’), which vary in gender and
number, as well as a few indeﬁnite pronouns which allow gender inﬂection (d’aucuns
‘no-one’, quelque chose ‘something’, qui que ce soit ‘who ever it is’, n’importe qui ‘anyone’)
and some which are ﬁxed (d’autres ‘others’, la plupart ‘most’, tout un chacun ‘everyone’,
tout le monde ‘everybody’).

Multiword determiners (MWD) consist of expressions such as bien des (‘a lot of’) y
tout le (‘all the’), which display minimal variation in terms of inﬂection (p.ej., la plupart de
vs. la plupart des ‘most of’). Numbers which act as determiners in the sentence (classées en
vingt-huit catégories ‘categorized in twenty-eight categories’) are also classiﬁed as MWD.

221

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

Multiword conjunctions (MWC) are a ﬁxed class:

(13)

a. C C: parce que (‘because’)

b. ADV C: même si (‘even so’)

C. V C: pourvu que (‘so long as’)

d. D N C: au moment où (‘at the time when’)

mi. CL V A C: il est vrai que (‘it’s true that’)

F. ADV C ADV ADV C: tant et si bien que (‘to such an extent that’)

Multiword interjections (MWI) are a small category with expressions such as mille
sabords (‘blistering barnacles’) and au secours (‘help’).

apéndice B: Comparison to Prior FTB Pre-Processing

Our FTB pre-processing is automatic, unlike all previous methods.

ARUN-CONT and ARUN-EXP. (Arun and Keller 2005) Two versions of the full 20,000-
sentence treebank that differed principally in their treatment of MWEs: (1) CONT, en
which MWE tokens were grouped (en moyenne → en_moyenne); y (2) EXP, en el cual
MWEs were marked with a ﬂat structure. For both representations, they also gave
results in which coordinated phrase structures were ﬂattened. In the published exper-
elementos, they mistakenly removed half of the corpus, believing that the multi-terminal
(per POS tag) annotations of MWEs were XML errors (Schluter and van Genabith 2007).

MFT. (Schluter and van Genabith 2007) Manual revision to 3,800 oraciones. Major
changes included coordination raising, an expanded POS tag set, and the correction
of annotation errors. Like ARUN-CONT, MFT contains concatenated MWEs.

FTB-UC. (Candito and Crabbé 2009) A version of the functionally annotated section that
makes a distinction between MWEs that are “syntactically regular” and those that are
no. Syntactically regular MWEs were given internal structure, whereas all other MWEs
were grouped. Por ejemplo, nouns followed by adjectives, such as loi agraire (‘land law’)
or Union monétaire et économique (‘monetary and economic Union’) were considered syn-
tactically regular. They are MWEs because the choice of adjective is arbitrary (loi agraire
‘crow black’), Por ejemplo),
and not
but their syntactic structure is not intrinsic to MWEs. In such cases, FTB-UC gives the
MWE a conventional analysis of an NP with internal structure. Such analysis is indeed
sufﬁcient to recover the meaning of these semantically compositional MWEs that are
extremely productive. FTB-UC loses information about MWEs with non-compositional
semantics, sin embargo.

∗
loi agricole, similarly to (‘coal black’) pero no (

∗

Almost all work on the FTB has followed ARUN-CONT and used gold MWE pre-
grouping. Candito, Crabbé, and Denis (2010) were the ﬁrst to acknowledge and address
this issue, but they still used FTB-UC (with some pre-grouped MWEs). Because the
syntax and deﬁnition of MWEs is a contentious issue, we take a more agnostic view—
which is consistent with that of the FTB annotators—and leave them ungrouped. Este
permits a data-oriented approach to MWE identiﬁcation that is more robust to changes
to the status of speciﬁc MWE instances.

222

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Although our FTB basic parsing results are lower than those of Seddah (2010), el
experiments are not comparable: The data split and pre-processing were different, y
he grouped MWEs.

Apéndice C: Annotation Consistency of Treebanks

Differences in annotation quality among corpora complicate cross-lingual experimental
comparisons. To control for this variable, we performed an annotation consistency
evaluation on the PTB, ATB, and FTB. The conventional wisdom has it that the PTB has
comparatively high inter-annotator agreement (IAA). In the initial release of the ATB,
IAA was inferior to other LDC treebanks, although in subsequent revisions, IAA was
quantiﬁably improved (Maamouri, Bies, and Kulick 2008). The FTB also had signiﬁcant
annotation errors upon release (Arun and Keller 2005), pero, también, has been revised.

To quantify IAA, we extend the variation nucleus method of Dickinson (2005) a
compare annotation error rates. Let C be a set of tuples (cid:4)s, yo, i(cid:5), where s is a substring at
corpus position i with label l. We consider all substrings in the corpus. If s is bracketed
en la posición i, then its label is its non-terminal category. De lo contrario, s has label l = NIL. A
locate variation nuclei, deﬁne Ls as the set of all labels associated with each unique s. Si
|Ls

| > 1, then s is a variation nucleus.24
Variation nuclei can result from either annotation errors or linguistic ambiguity. Hu-
man evaluation is one way to distinguish between the two cases. Following Dickinson
(2005), we sampled 100 variation nuclei from each corpus and evaluated each sample
for the presence of an annotation error. To control for the number of corpus positions
included in each treebank sample, we used frequency-matched stratiﬁed sampling with
bin sizes of 2, 3, 4, 10, 50, y 500.

The human evaluators were a non-native, ﬂuent Arabic speaker for the ATB (el
ﬁrst author), a native French speaker for the FTB (the second author), and a native
English speaker for the WSJ (the third author).25 Table C.1 shows the results of the
evaluación, which supports the anecdotal consistency ranking of the three treebanks.26
The FTB averages more than one variation nucleus per sentence and has twice the token-
level error rate of the other two treebanks.

Apéndice D: mwetoolkit Conﬁguration

We conﬁgured mwetoolkit27 with the four standard lexical features: el maximo
likelihood estimator, Dice’s coefﬁcient, pointwise mutual information, and Student’s
t-score. We added the POS sequence for each n-gram as a single feature. We removed
the Web counts features since the parsers do not use auxiliary data.

Because MWE n-grams only account for a small fraction of the n-grams in the
cuerpo, we ﬁltered the training and test sets by removing all n-grams that occurred
once. To further balance the proportion of MWEs, we trained on all valid MWEs
plus 10x randomly selected non-MWE n-grams. This proportion matches the fraction

24 Kulick, Bies, and Mott (2011) extended our method with TAGs to account for nested bracketing errors.
25 Unlike Dickinson (2005), we stripped traces and only considered POS tags when pre-terminals were the
only intervening nodes between the nucleus and its bracketing (p.ej., unaries, base NPs). Because our
objective was to compare distributions of bracketing discrepancies, we did not prune the set of nuclei.

26 The total variation nuclei in each corpus were: 22,521 (WSJ), 15,629 (ATB), y 14,803 (FTB).
27 We re-implemented mwetoolkit in Java for compatibility with Weka and our pre-processing routines.

223

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

Cuadro C.1
Evaluation of 100 randomly sampled variation nuclei for training splits of the WSJ, ATB, y
FTB. Corpus positions indicates the number of corpus positions in the sample (a variation
nucleus by deﬁnition appears in at least two corpus positions). Nuclei per tree is the average
nuclei per syntactic tree in the corpus, a statistic that gives a rough estimate of variability across
the corpus. The type-level error rate indicates the number of variation nuclei for which at least
one error existed. The token-level error rate indicates the ratio of errors to corpus positions. Nosotros
computed 95% conﬁdence intervals for the type-level error rate.

Cuerpo
Positions

Nuclei
Per Tree

Error %

Tipo 95%

Tipo

Token Conﬁdence Interval

PTB (2-21)
ATB (train)
FTB (train)

750
658
668

0.565
0.830
1.10

16.0% 4.10%
26.0% 4.00%
28.0% 9.13%

[8.80%, 23.2%]
[17.4%, 34.6%]
[19.2%, 36.8%]

of MWE/non-MWE tokens in the FTB. As we generated a random training set, nosotros
reported the average of three independent training runs.

We created feature vectors for the training n-grams and trained a binary SVM
classiﬁer with Weka (Hall et al. 2009). Although mwetoolkit defaults to a linear kernel,
we achieved higher accuracy on the development set with an RBF kernel.

The FTB is sufﬁciently large for the corpus-based methods implemented in
mwetoolkit. Ramisch, Villavicencio, and Boitet (2010) experimented with the Genia
cuerpo, which contains 18k English sentences and 490k tokens, similar to the FTB. Su
test set had 895 oraciones, fewer than ours. They reported 30.6% F1 for their task against
an Xtract baseline, which only obtained 7.3% F1. Their best result compares favorably
(in magnitude) to our mwetoolkit baselines for French and Arabic.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

Expresiones de gratitud
We thank John Bauer for material
contributions to the MWE identiﬁcation
experimentos, and Claude Reichard for help
with editing this article. We also thank Marie
Candito, Markus Dickinson, Chris Dyer,
Ali Farghaly, Dan Flickinger, Nizar Habash,
Seth Kulick, Beth Levin, Percy Liang, David
McClosky, Carlos Ramisch, Ryan Roth,
Djamé Seddah, Valentin Spitkovsky, y
Reut Tsarfaty for insightful comments on
previous versions of this work. el primero
author was supported by a National Science
Foundation Graduate Research Fellowship.
The second author was supported by
a Stanford Interdisciplinary Graduate
Fellowship.

Referencias
Abeillé, A. 1988. Parsing French with Tree
Adjoining Grammar: some linguistic
accounts. In COLING, pages 7–12,
Budapest, Hungary.

Abeillé, A., l. Clément, y un. Kinyon. 2003.
Building a treebank for French. In Anne

224

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Abeillé, editor, Treebanks: building and
using parsed corpora. Desorden, capítulo 10.

Abeillé, A. and Y. Schabes. 1989. Parsing
idioms in lexicalized TAGs. In EACL,
pages 1–9, Manchester.

Arun, A. 2004. Statistical parsing of the
French treebank. Tesis de maestría,
University of Edinburgh.

Arun, A. and F. Keller. 2005. Lexicalization
in crosslinguistic probabilistic parsing:
The case of French. In ACL, pages 306–313,
ann-arbor, MI.

Ashraf, A. 2012. Arabic Idioms: A Corpus-Based

Estudiar. Routledge.

Attia, METRO. 2006. Accommodating multiword
expressions in an Arabic LFG grammar.
In Advances in Natural Language Processing,
volumen 4139. Saltador, páginas 87–98.
Attia, METRO., j. Foster, D. Hogan, j. Le Roux,
l. Tounsi, y j. van Genabith. 2010a.
Handling unknown words in statistical
latent-variable parsing models for Arabic,
English and French. In First Workshop on
Statistical Parsing of Morphologically-Rich
Idiomas (SPMRL), pages 67–75,
Los Angeles, California.

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

Attia, METRO., A. Toral, l. Tounsi, PAG. Pecina,

y j. van Genabith. 2010b. Automatic
extraction of Arabic multiword
expresiones. In Workshop on Multiword
Expressions: From Theory to Applications,
pages 19–27, Beijing.

Baldwin, t. and S. norte. kim. 2010. Multiword
expresiones. Posada. Indurkhya and F. j.
Damerau, editores, Handbook of Natural
Procesamiento del lenguaje. CRC Press, capítulo 12,
pages 267–293.

Bansal, METRO. y D. Klein. 2010. Simple,

accurate parsing with an all-fragments
gramática. In ACL, pages 1098–1107,
Uppsala.

Bikel, D. METRO. 2004. Intricacies of Collins’

parsing model. Ligüística computacional,
30(4):479–511.

Bilmes, j. and K. Kirchoff. 2003. Factored

language models and generalized parallel
backoff. In NAACL, pages 4–6, Edmonton.

Blunsom, PAG. and T. Baldwin. 2006.

Multilingual deep lexical acquisition for
HPSGs via supertagging. In EMNLP,
pages 164–171, Sídney.

Bod, R. 1992. A computation model of

language performance: Data-Oriented
Parsing. In COLING, pages 855–859,
Nantes.

Cafferkey, C. 2008. Exploiting multi-word

units in statistical parsing and generation.
Tesis de maestría, Dublin City University.
Candito, METRO. y B. Crabbé. 2009. Improving

generative statistical parsing with
semi-supervised word clustering.
In IWPT, pages 138–141, París.
Candito, METRO., B. Crabbé, y P. Denis.
2010. Statistical French dependency
analizando: Treebank conversion and ﬁrst
resultados. In LREC, pages 1840–1847,
Valletta.

Candito, METRO. y D. Seddah. 2010. Parsing

word clusters. In First Workshop on
Statistical Parsing of Morphologically-Rich
Idiomas (SPMRL), pages 76–84,
Los Angeles, California.

Carpuat, METRO. y M. Diab. 2010. Task-based
evaluation of multiword expressions:
A pilot study in statistical machine
traducción. In HLT-NAACL,
pages 242–245, Los Angeles, California.
Chafe, W.. l. 1968. Idiomaticity as an

anomaly in the Chomskyan paradigm.
Foundations of Language, 4(2):109–127.
Chomsky, norte. 1957. Syntactic Structures.

Moutón, Londres.

Chomsky, norte. 1981. Lectures on Government

and Binding: The Pisa Lectures. Foris
Publications, Holanda.

Chrupala, GRAMO., GRAMO. Dinu, y j. van Genabith.

2008. Learning morphology with Morfette.
In LREC, pages 2362–2367, Marrakech.
Cohn, T., PAG. Blunsom, and S. Goldwater. 2010.

Inducing tree-substitution grammars.
JMLR, 11:3053–3096.

Cohn, T., S. Goldwater, y P. Blunsom.
2009. Inducing compact but accurate
tree-substitution grammars. In HLT-
NAACL, pages 548–556, Roca, CO.
Constant, METRO., A. Sigogne, y P. Watrin.

2012. Discriminative strategies to integrate
multiword expression recognition and
analizando. In ACL, pages 204–212, Jeju.

Constant, METRO. y yo. Tellier. 2012. Evaluating
the impact of external lexical resources
into a CRF-based multiword segmenter
and part-of-speech tagger. In LREC,
pages 646–650, Istanbul.

Crabbé, B. y M. Candito. 2008. Expériences

d’analyse syntaxique statistique du
français. In TALN, pages 1–10, Avignon.

Dickinson, METRO. 2005. Error Detection and
Correction in Annotated Corpora. Doctor.
tesis, The Ohio State University.
Dybro-Johansen, A. 2004. Extraction

automatique de grammaires à partir d’un
corpus français. Tesis de maestría, Université
París 7.

Dyer, C., A. López, j. Ganitkevitch, j. Weese,

F. Ture, PAG. Blunsom, h. Setiawan,
V. Eidelman, y P. Resnik. 2010. cdec:
A decoder, alignment, Y aprendiendo
framework for ﬁnite-state and context-free
translation models. In ACL System
Demonstrations, pages 7–12, Uppsala.
Evert, S. 2008. The MWE 2008 Tarea compartida:
Ranking MWE candidates. In Presentation
en 2008 Workshop on Multiword Expressions,
Marrakech.

Fillmore, C. J., PAG. kay, y M. C. O’Connor.

1988. Regularity and idiomaticity in
grammatical constructions: The case of
let alone. Idioma, 64(3):501–538.

Finkel, j. R. and C. D. Manning. 2009. Joint
parsing and named entity recognition. En
HLT-NAACL, pages 326–334, Roca, CO.

Goldberg, A. 1995. Constructions:

A Construction Grammar Approach to
Argument Structure. University Of
Chicago Press, chicago.

Verde, S., M-C. de Marneffe, j. Bauer,

and C. D. Manning. 2011. Multiword
expression identiﬁcation with Tree
Substitution Grammars: A parsing
tour de force with French. In EMNLP,
pages 725–735, Edimburgo.

Verde, S. and C. D. Manning. 2010. Better
Arabic parsing: Líneas de base, evaluations,

225

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 1

and analysis. In COLING, pages 394–402,
Beijing.

improvements. In TLT, pages 31–42,
Prague.

Bruto, METRO. 1984. Lexicon-Grammar and

Laporte, MI., t. Nakamura, and S. Voyatzi.

the syntactic analysis of French.
In COLING-ACL, pages 275–282,
stanford, California.

Bruto, METRO. 1986. Lexicon-Grammar: El
representation of compound words.
In COLING, pages 1–6, Bonn.

Habash, norte. and O. Rambow. 2005. Arábica

tokenization, part-of-speech tagging and
morphological disambiguation in one
fell swoop. In ACL, pages 573–580,
ann-arbor, MI.

Sala, METRO., mi. Franco, GRAMO. holmes, B. Pfahringer,
PAG. Reutemann, y yo. h. Witten. 2009. El
WEKA data mining software: An update.
SIGKDD Explorations Newsletter, 11:10–18.

Hogan, D., C. Cafferkey, A. Cahill, y
j. van Genabith. 2007. Exploiting
multi-word units in history-based
probabilistic generation. In EMNLP-
CONLL, pages 267–276, Prague.

Huang, z. y M. Harper. 2011. Feature-rich

log-linear lexical model for latent
variable PCFG grammars. In IJCNLP,
pages 219–227, Chiang Mai.

Jackendoff, R. 1997. The Architecture of
the Language Faculty. CON prensa,
Cambridge, MAMÁ.

Johnson, METRO. 1998. PCFG models of linguistic

tree representations. computacional
Lingüística, 24(4):613–632.

katz, j. j. y P. METRO. Postal. 1963. Semántico
interpretation of idioms and sentences
containing them. M.I.T. Research Laboratory
of Electronics Quarterly Progress Report,
70:275–282.

Klein, D. and C. D. Manning. 2003.
Accurate unlexicalized parsing.
In ACL, pages 423–430, Sapporo.

Koehn, PAG. and H. Hoang. 2007. Factored
translation models. In EMNLP-CoNLL,
pages 868–876, Prague.

Korkontzelos, I. and S. Manandhar. 2010.
Can recognising multiword expressions
improve shallow parsing? In HLT-NAACL,
pages 636–644, Los Angeles, California.

Kübler, S. 2005. How do treebank annotation

schemes inﬂuence parsing results?
Or how not to compare apples and
oranges. In RANLP, pages 79–88,
Borovets.

Kulick, S., A. Bies, y j. Mott. 2011. Usando

derivation trees for treebank error
detección. In ACL, pages 693–698,
Portland, O.

Kulick, S., R. Gabbard, y M. marco. 2006.
Parsing the Arabic Treebank: Analysis and

226

2008. A French corpus annotated
for multiword expressions with
adverbial function. In LREC Linguistic
Annotation Workshop, pages 48–51,
Marrakech.

Exacción, R. y G. Andrew. 2006. Tregex

and Tsurgeon: Tools for querying and
manipulating tree data structures.
In LREC, pages 2,231–2,234, Genoa.
Exacción, R. and C. D. Manning. 2003. Is it

harder to parse Chinese, or the Chinese
treebank? In ACL, pages 439–446,
Sapporo.

Liang, PAG., METRO. I. Jordán, y D. Klein. 2010.
Type-based MCMC. In HLT-NAACL,
pages 573–581, Los Angeles, California.

lin, D. 1999. Automatic identiﬁcation of
non-compositional phrases. In ACL,
pages 317–324, parque universitario, Maryland.
Maamouri, METRO., A. Bies, t. Buckwalter,

and W. Mekki. 2004. The Penn Arabic
Treebank: Building a large-scale annotated
Arabic corpus. In NEMLAR, pages 1–8,
Cairo.

Maamouri, METRO., A. Bies, and S. Kulick.

2008. Enhancing the Arabic Treebank:
A collaborative effort toward new
pautas de anotación. In LREC,
pages 3,192–3,196, Marrakech.

Maamouri, METRO., A. Bies, and S. Kulick. 2009.
Creating a methodology for large-scale
correction of treebank annotation: El
case of the Arabic Treebank. In MEDAR,
pages 138–144, Cairo.

Marantz, A. 1997. No escape from syntax:
Don’t try morphological analysis in the
privacy of your own lexicon. In 21st
Annual Penn Linguistics Colloquium,
pages 1–15, Filadelfia, Pensilvania.

marco, METRO., METRO. A. Marcinkiewicz, y
B. Santorini. 1993. Building a large
annotated corpus of English: The Penn
Treebank. Ligüística computacional,
19:313–330.

Nivre, j. y j. Nilsson. 2004. Multiword

units in syntactic parsing. In Methodologies
and Evaluation of Multiword Units in
Real-World Applications (MEMURA),
pages 1–8, Lisbon.

O'Donnell, t. J., j. B. Tenenbaum, y N. D.
Buen hombre. 2009. Fragment grammars:
Exploring computation and reuse in
idioma. Technical report, MIT Computer
Science and Artiﬁcial Intelligence
Laboratory Technical Report Series,
MIT-CSAIL-TR-2009-013.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Verde, de Marneffe, and Manning

Parsing Models for Identifying Multiword Expressions

O'Grady, W.. 1998. The syntax of idioms.
Natural Language and Linguistic Theory,
16:279–312.

Pecina, PAG. 2010. Lexical association measures

and collocation extraction. Idioma
Resources and Evaluation, 44:137–158.

Petrov, S. 2009. Coarse-to-Fine Natural
Procesamiento del lenguaje. Doctor. tesis,
University of California-Berkeley.
Petrov, S., l. Barrett, R. Thibaux, y

D. Klein. 2006. Learning accurate, compact,
and interpretable tree annotation. In ACL,
pages 443–440, Sídney.

Petrov, S. y D. Klein. 2007. Improved
inference for unlexicalized parsing.
In HLT-NAACL, pages 404–411,
Rochester, Minnesota.

Correo, METRO. y D. Gildea. 2009. Bayesian

learning of a tree substitution grammar.
In ACL-IJCNLP, Artículos breves, pages 45–48,
Suntec.

Rambow, o., D. Chiang, METRO. Diab, norte. Habash,

R. Hwa, k. Sima’an, V. Lacey, R. Exacción,
C. Nichols, and S. Shareef. 2005. Parsing
Arabic dialects. Technical report. Johns
Hopkins University.

Ramisch, C., A. Villavicencio, and C. Boitet.

2010. mwetoolkit: A framework for
multiword expression identiﬁcation.
In LREC, pages 662–669, Valletta.
Rehbein, I. y j. van Genabith. 2007.

Treebank annotation schemes and parser
evaluation for German. In EMNLP-CoNLL,
pages 630–639, Prague.

Ryding, k. 2005. A Reference Grammar of
Modern Standard Arabic. Cambridge
Prensa universitaria.

Sag, I. A., t. Baldwin, F. Vínculo, A. Copestake,

y D. Flickinger. 2002. Multiword
expresiones: A pain in the neck for NLP.
In CICLing, pages 1–15, Ciudad de México.

sansón, GRAMO. y un. Babarczy. 2003. A test of
the leaf-ancestor metric for parse accuracy.
Natural Language Engineering, 9:365–380.
Scha, R. 1990. Taaltheorie en taaltechnologie:
competence en performance. In Q. A. METRO.
de Kort and G. l. j. Leerdam, editores,
Computertoepassingen in de Neerlandistiek.
Landelijke Vereniging van Neerlandici
(LVVNjaarboek), pages 7–22.

Schluter, norte. y j. van Genabith. 2007.

Preparing, restructuring, and augmenting
a French treebank: Lexicalised parsers or
coherent treebanks? In Pacling, pages 1–10,
Melbourne.

Seddah, D. 2010. Exploring the Spinal-STIG

model for parsing French. In LREC,
pages 1,936–1,943, Valletta.

Seddah, D., METRO. Candito, y B. Crabbé.

2009. Cross parser evaluation and tagset
variación: a French treebank study.
In IWPT, pages 150–161, París.

Seddah, D., GRAMO. ella estaba crujiendo, Ö. Çetinoglu,
j. Genabith, y M. Candito. 2010.
Lemmatization and lexicalized statistical
parsing of morphologically rich languages:
The case of French. In First Workshop on
Statistical Parsing of Morphologically Rich
Idiomas (SPMRL), pages 85–93,
Los Angeles, California.

Seretan, V. 2011. Syntax-Based Collocation

Extraction. Saltador.

Siham Boulaknadel, B. D. y

D. Aboutajdine. 2008. A multi-word term
extraction program for Arabic language.
In LREC, pages 1,485–1,488, Marrakech.
Smadja, F. 1993. Retrieving collocations from

texto: Xtract. Ligüística computacional,
19:143–177.

Toutanova, K., D. Klein, C. D. Manning, y
Y. Cantante. 2003. Feature-rich part-of-speech
tagging with a cyclic dependency network.
In NAACL, pages 173–180, Edmonton.
Vijay-Shanker, k. y D. j. Weir. 1993. El
use of shared forests in tree adjoining
grammar parsing. In EACL, pages 384–393,
Utrecht.

Watrin, PAG. and T. Francois. 2011. An n-gram
frequency database reference to handle
MWE extraction in NLP applications. En
Workshop on Multiword Expressions: de
Parsing and Generation to the Real World,
pages 83–91, Portland, O.

Wehrli, mi. 2000. Parsing and collocations.

In Natural Language Processing–NLP 2000,
volumen 1835 of Lecture Notes in Computer
Ciencia. Saltador, pages 272–282.

Oeste, METRO. 1995. Hyperparameter estimation
in Dirichlet process mixture models.
Technical report. Universidad de Duke.

227

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
1
1
9
5
1
7
9
9
1
9
7
/
C
oh

yo
i

_
a
_
0
0
1
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF