Identiﬁcation of Multiword Expressions - IA de Investigación especializada en el MIT

Identiﬁcation of Multiword Expressions
by Combining Multiple Linguistic
Information Sources

Yulia Tsvetkov
Carnegie Mellon University

∗

Shuly Wintner
University of Haifa

∗∗

We propose a framework for using multiple sources of linguistic information in the task of
identifying multiword expressions in natural language texts. We deﬁne various linguistically
motivated classiﬁcation features and introduce novel ways for computing them. We then man-
ually deﬁne interrelationships among the features, and express them in a Bayesian network.
The result is a powerful classiﬁer that can identify multiword expressions of various types
and multiple syntactic constructions in text corpora. Our methodology is unsupervised and
language-independent; it requires relatively few language resources and is thus suitable for a
large number of languages. We report results on English, Francés, and Hebrew, and demonstrate
a signiﬁcant improvement in identiﬁcation accuracy, compared with less sophisticated baselines.

1. Introducción

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

Multiword expressions (MWEs) are lexical items that consist of multiple orthographic
palabras (adhoc,NewYork,lookup). MWEs constitute a signiﬁcant portion of the lexicon
of any natural language (Jackendoff 1997; Erman and Warren 2000; Sag et al. 2002). Ellos
are a heterogeneous class of constructions with diverse sets of characteristics, distin-
guished by their idiosyncratic behavior. Morphologically, some MWEs allow some of
their constituents to freely inﬂect while restricting (or preventing) the inﬂection of other
constituents. In some cases MWEs may allow constituents to undergo non-standard
morphological inﬂections that they would not undergo in isolation. Syntactically, alguno
MWEs behave like words and other are phrases; some occur in one rigid pattern (y un
ﬁxed order), and others permit various syntactic transformations. The most characteris-
tic property of MWEs is their semantic opacity, although the compositionality of MWEs
is gradual, and ranges from fully compositional to completely idiomatic (Bannard,
Baldwin, and Lascarides 2003).

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

∗ Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Ave, pittsburgh, Pensilvania

15213-3891. Correo electrónico: ytsvetko@cs.cmu.edu.

∗∗ Department of Computer Science, University of Haifa Mount Carmel, 31905 Haifa, Israel.

Correo electrónico: shuly@cs.haifa.ac.il.

Envío recibido: 6 Enero 2013; revised submission received: 13 Junio 2013; accepted for publication:
16 Agosto 2013.

doi:10.1162/COLI a 00177

Ligüística computacional

Volumen 40, Número 2

Because of their prevalence and irregularity, MWEs must be stored in lexicons
of natural language processing (NLP) applications. Awareness of MWEs was proven
beneﬁcial for a variety of applications, including information retrieval (Doucet and
Ahonen-Myka 2004), building ontologies (Venkatsubramanyan and Perez-Carballo
2004), text alignment (Venkatapathy and Joshi 2006), and machine translation (Baldwin
and Tanaka 2004; Uchiyama, Baldwin, and Ishizaki 2005; Carpuat and Diab 2010).

We propose a novel architecture for identifying MWEs, of various types and syn-
tactic categories, in monolingual corpora. Unlike much existing work, which focuses
on a particular syntactic construction, our approach addresses MWEs of various types
by zooming in on the general idiosyncratic properties of MWEs rather than on spe-
ciﬁc properties of each subclass thereof. Addressing multiple types of MWEs has its
limitations: The task is less well-deﬁned, one cannot rely on speciﬁc properties of
a particular construction, and the type of the MWE is not extracted along with the
candidate expression. Sin embargo, there are clear beneﬁts to such an approach. Certain
applications can beneﬁt from a large, albeit untyped, mixed bag of MWEs; machine
translation is an obvious candidate (Lambert and Banchs 2005; Ren et al. 2009; Bouamor,
Semmar, and Zweigenbaum 2012). Another use, which motivates our current work, es
the construction of computational lexicons. Claramente, manual supervision is required be-
fore MWE candidates are added to a high-precision lexicon, but our approach provides
the lexicographer with a large-scale set of potential candidates.

We focus on bigrams only in this work, eso es, on MWEs consisting of two consec-
utive tokens. Many of the features we design, as well as the general architecture, can in
principle be extended to longer MWEs, but we do not address longer (y, En particular,
the harder case of non-contiguous) MWEs here. The architecture uses Bayesian net-
obras (Pearl 1985) to express multiple interdependent linguistically motivated features.
Primero, we automatically generate a small (training) set of MWE and non-MWE
bigrams (positive and negative instances, respectivamente) from a small parallel corpus.
We then deﬁne a set of linguistically motivated features that embody observed char-
acteristics of MWEs. We augment these by features that reﬂect collocation measures.
Finalmente, we deﬁne dependencies among these features, expressed in the structure of a
Bayesian network model, which we then use for classiﬁcation. A Bayesian network (BN)
is a directed graph whose nodes express the features used for classiﬁcation and whose
edges deﬁne causal relationships among these features. In this architecture, aprendiendo
does not result in a black box, expressed solely as feature weights. Bastante, the structure
of the BN allows us to study the impact of different MWE features on the classiﬁcation.
The result is a new method for identifying MWEs of various types in text corpora. Él
combines statistics with an array of linguistically motivated features, organized in an
architecture that reﬂects interdependencies among the features.

The contribution of this work is manifold.1 First, we use existing approaches to
MWE extraction to automatically generate training material. Específicamente, we use our
earlier work (Tsvetkov and Wintner 2012) to extract a set of positive and negative MWE
candidates from a small parallel corpus, and use them for training a BN that can then
extract a new set of MWEs from a potentially much larger monolingual corpus. Como

1 This article is a thoroughly revised and extended version of Tsvetkov and Wintner (2011). Whereas the
methodology of that paper required minor supervision, we now present a completely unsupervised
acercarse. We added several linguistically motivated features to the classiﬁcation task. We demonstrate
results on two new languages, English and French, to emphasize the generality of the method.
Additional extensions include a more complete literature survey and, because new languages are added,
diferente, more reliable data sets for evaluating our results.

450

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tsvetkov and Wintner

Identiﬁcation of Multiword Expressions

a result, our method is completely unsupervised (more precisely, it does not require
manual annotation; we do need several language resources, mira la sección 3.2).

Segundo, we propose several linguistically motivated features that can be computed
from data and that are demonstrably productive for improving the accuracy of MWE
identiﬁcation. These features focus on the expression of linguistic idiosyncrasies of var-
ious types, a phenomenon typical of MWEs. Some of these features are commonplace,
but others are new, or are implemented in novel ways. En particular, we account for
the morphological idiosyncrasy of MWEs using a histogram of the number of inﬂected
formas, in a technique that draws from image processing. We also use frequency his-
tograms to model the semantic contexts of MWEs.

Finalmente, the methodology we advocate is not language-speciﬁc; given relatively few
language resources, it can be easily adapted to new languages. We demonstrate the
generality of our methodology by applying it to three languages: Inglés, Francés, y
hebreo. Our evaluation shows that the use of linguistically motivated features results
in a reduction of between one quarter and one third of the errors compared with a
collocation baseline; organizing the knowledge in a Bayesian network reduces the error
rate by an additional 3–9%.

After discussing related work in the next section (borrowing from Tsvetkov and
Wintner [2012]), we motivate in Section 3 the methodology we propose, and list the re-
sources needed for implementing it. Sección 4 discusses the linguistically motivated fea-
tures and their implementation; the organization of the Bayesian network is described
en la sección 5. We explain how we generate training materials in Section 6. Sección 7
provides a thorough evaluation of the results. We conclude with suggestions for future
investigación.

2. Trabajo relacionado

Early approaches to MWE identiﬁcation concentrated on their collocational behavior
(Church and Hanks 1990). One of the ﬁrst approaches was implemented as Xtract
(Smadja 1993): Aquí, word pairs that occur with high frequency within a context of
ﬁve words in a corpus are ﬁrst collected, and are then ranked and ﬁltered according
to contextual considerations, including the parts of speech of their neighbors. Pecina
(2008) compares 55 different association measures in ranking German Adj-N and PP-
Verb collocation candidates. He shows that combining different collocation measures
using standard statistical classiﬁcation methods improves over using a single colloca-
tion measure. Other results (Chang, Danielsson, and Teubert 2002; Villavicencio et al.
2007) suggest that some collocation measures (especially point-wise mutual information
and log-likelihood) are superior to others for identifying MWEs.

Co-occurrence measures alone are probably not enough to identify MWEs, and their
linguistic properties should be exploited as well (Piao et al. 2005). Hybrid methods that
combine word statistics with linguistic information exploit morphological, syntactic,
and semantic idiosyncrasies to extract idiomatic MWEs.

Cocinar, Fazly, and Stevenson (2007), Por ejemplo, use prior knowledge about the
overall syntactic behavior of an idiomatic expression to determine whether an instance
of the expression is used literally or idiomatically. They assume that in most cases,
idiomatic usages of an expression tend to occur in a small number of canonical forms
for that idiom; in contrast, the literal usages of an expression are less syntactically
restricted, and are expressed in a greater variety of patterns, involving inﬂected forms
of the constituents.

451

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 2

Ramisch et al. (2008) evaluate a number of association measures on the task of
identifying English verb-particle constructions and German adjective-noun pairs. Ellos
show that adding linguistic information (mostly POS and POS-sequence patterns) to the
association measure yields a signiﬁcant improvement in performance over using pure
frequency.

Several works address the lexical ﬁxedness or syntactic ﬁxedness of (certain types
de) MWEs in order to extract them from texts. An expression is considered lexically
ﬁxed if replacing any of its constituents by a semantically (and syntactically) similar
word generally results in an invalid or literal expression. Syntactically ﬁxed expressions
prohibit (or restrict) syntactic variation. Por ejemplo, Van de Cruys and Villada Moir ´on
(2007) use lexical ﬁxedness to extract Dutch verb-noun idiomatic combinations (VNICs).
Bannard (2007) uses syntactic ﬁxedness to identify English VNICs. Another work uses
both the syntactic and the lexical ﬁxedness of VNICs in order to distinguish them from
non-idiomatic ones, and eventually to extract them from corpora (Fazly and Stevenson
2006). Recientemente, Green et al. (2011) use parsing, and in particular Tree Substitution
Grammars, for identifying MWEs in French.

(idiomatic) expresiones. Katz and Giesbrecht

Semantic properties of MWEs can be used to distinguish between compositional
and non-compositional
(2006) y
Baldwin et al. (2003) use Latent Semantic Analysis (LSA) for this purpose. They show
that compositional MWEs appear in contexts more similar to their constituents than
non-compositional MWEs. Por ejemplo, the co-occurrence measured by LSA between
the expression kick the bucket and the word die is much higher than co-occurrence
of this expression and its component words. The disadvantage of this methodology is
that to distinguish between idiomatic and non-idiomatic usages of the MWE it relies on
the MWE’s known idiomatic meaning, and this information is usually not available. En
addition, this approach fails when only idiomatic or only literal usages of the MWE are
overwhelmingly frequent.

Although these approaches are in line with ours, they require lexical semantic
resources (p.ej., a database that determines semantic similarity among words) y
syntactic resources (analizadores) that are unavailable for many languages. Our approach
only requires morphological processing and a bilingual dictionary, which are more
readily available for several languages. Note also that these approaches target a speciﬁc
syntactic construction, whereas ours is appropriate for various types of MWEs.

Several properties of Hebrew MWEs are described by Al-Haj (2010); Al-Haj and
Wintner (2010) use them in order to construct a support vector machine (SVM) clasificador
that can distinguish between MWE and non-MWE noun-noun constructions in Hebrew.
The features of the SVM reﬂect several morphological and morphosyntactic properties
of such constructions. The resulting classiﬁer performs much better than a naive base-
line, reducing the error rate by over one third. We rely on some of these insights, as we
implement more of the linguistic properties of MWEs. De nuevo, our methodology is not
limited to a particular construction: En efecto, we demonstrate that our general methodol-
ogia, trained on automatically generated, general training data, performs almost as well
as the noun-noun-speciﬁc approach of Al-Haj and Wintner (2010) on the very same data
colocar (Sección 7).

Recientemente, Tsvetkov and Wintner (2010b, 2012) introduced a general methodology
for extracting MWEs from bilingual corpora, and applied it to Hebrew. The results
were a highly accurate set of Hebrew MWEs, of various types, along with their English
translations. A major limitation of this work is that it can only be used to identify MWEs
in the bilingual corpus, and is thus limited in its scope. We use this methodology to
extract both positive and negative instances for our training set in the current work;

452

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tsvetkov and Wintner

Identiﬁcation of Multiword Expressions

but we extrapolate the results much further by extending the method to monolingual
corpus, which are typically much larger than bilingual ones.

Probabilistic graphical models are widely used in statistical machine learning in
general, and natural language processing in particular (Herrero 2011). Bayesian networks
are an instance of such models, and have been used for classiﬁcation in several natural
language applications. Por ejemplo, BNs have been used for POS tagging of unknown
palabras (Peshkin, Pfeffer, and Savova 2003), dependency parsing (Savova and Peshkin
2005), and document classiﬁcation (Justicia, Bajo, and Ho 1997; Calado et al. 2003; Denoyer
and Gallinari 2004). Very recently, Ramisch et al. (2010) used BN for Portuguese MWE
identiﬁcation. The features used for classiﬁcation were of two kinds: (1) various colloca-
tion measures; (2) bigrams aligned together by an automatic word aligner applied to a
parallel (Portuguese–English) cuerpo. A BN was used to combine the predictions of the
various features on the test set, but the structure of the network is not described. El
combined classiﬁer resulted in a much higher accuracy than any of the two methods
solo. Sin embargo, the use of BN is not central to this work, and its structure does not
reﬂect any insights or intuitions on the structure of the problem domain or on inter-
dependencies among features.

Nosotros, también, acknowledge the importance of combining different sources of knowledge
in the hard task of MWE identiﬁcation. En particular, we also believe that collocation
measures are highly important for this task, but cannot completely solve the problem:
Linguistically motivated features are crucial in order to improve the accuracy of the
clasificador. In this work we focus on various properties of different types of MWEs,
and deﬁne general features that may accurately apply to some, but not necessarily all,
de ellos. An architecture of Bayesian networks is optimal for this task: It enables us
to deﬁne weighted dependencies among features, such that certain features are more
signiﬁcant for identifying some class of MWEs, whereas others are more prominent
in identifying other classes (although we never predeﬁne these classes). As we show
herein, this architecture results in signiﬁcant improvements over a more naive combi-
nation of features.

3. Metodología

3.1 Motivation

The task we address is identiﬁcation of MWEs, of various types and syntactic construc-
ciones, in monolingual corpora. These include proper names, noun phrases, verb-particle
pares, Etcétera. We focus on bigrams (MWEs consisting of two consecutive tokens)
in this work; the methodology, sin embargo, can be extended to longer n-grams. Several
properties of MWEs make this task challenging: MWEs exhibit idiosyncrasies on a
variety of levels, orthographic, morphological, syntactic, and of course semantic. Such a
complex task calls for a combination of multiple approaches, and much research indeed
suggests “hybrid” approaches to MWE identiﬁcation (Duan et al. 2009; Hazelbeck and
saito 2010; Ramisch et al. 2010; Weller and Fritzinger 2010). We believe that Bayesian
networks provide an optimal architecture for expressing various pieces of knowledge
aimed at MWE identiﬁcation, for the following reasons (noted, p.ej., by Heckerman
1995):
(cid:2)

In contrast to many other classiﬁcation methods, Bayesian networks can
aprender (and express) causal relationships between features. This facilitates
better understanding of the problem domain.

453

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 2

(cid:2)

Bayesian networks can encode not only statistical data, pero también
prior domain knowledge and human intuitions, in the form of
interdependencies among features (a possibility that we use here).

Además, we try in this work to leverage the idiosyncrasy of MWEs and use it as a
tool for identifying them.

Our deﬁnition of MWEs is operational: An expression is considered a MWE if it
has to be stored in the lexicon of some NLP application; típicamente, this is because the
expression exhibits some level of idiosyncratic behavior (semantic, syntactic, morpho-
logical, orthographic, etc.). In order to properly handle such expressions in downstream
applications, the lexicon must store some speciﬁc information about the expression. Este
working deﬁnition motivates and drives our methodology: We leverage the idiosyn-
cratic behavior of MWEs and deﬁne (Sección 4) an array of features that capture and
reﬂect this idiosyncrasy in order to extract MWEs from corpora.

3.2 Recursos

Although our approach is in general not language-speciﬁc, applying it to any partic-
ular language requires several language resources which we specify in this section. En
general, we require corpora (both monolingual and bilingual), morphological analyzers
or stemmers, part-of-speech taggers, and bilingual dictionaries. No deeper processing
se supone (p.ej., no parsers or lexical semantic resources are needed). The method we
advocate is thus appropriate for medium-density languages (Varga et al. 2005).

To compute the features discussed in Section 4, we need large monolingual corpora.
For English and French, we use the 109 corpora released for WMT-11 (Callison-Burch
et al. 2011); the corpora were syntactically parsed using the Berkeley parser (Petrov
and Klein 2007), but we only use the POS tags in this work. For Hebrew, we use a
monolingual corpus (Itai and Wintner 2008), which we pre-process as in Tsvetkov and
Wintner (2012): We use a morphological analyzer (Itai and Wintner 2008) to segment
word forms (separating preﬁxes and sufﬁxes) and induce POS tags. Summary statistics
for each corpus are listed in Table 1.

For some features we need access to the lemma of word tokens. In Hebrew, el
MILA morphological analyzer (Itai and Wintner 2008) provides the lemmas, pero el
parsed corpora we use in English and French do not. We therefore use the DELA
dictionaries of English and French, available from LADL as part of the Unitex project
(http://www-igm.univ-mlv.fr/~unitex/). The French dictionary lists 683,824 single-
word entries corresponding to 102,073 lemmas, y 108,436 multiword entries corre-
sponding to 83,604 MWEs. The English dictionary is smaller, con 296,606 single-word
forms corresponding to 150,145 lemmas, y 132,990 multiword entries, correspondiente

Mesa 1
Statistics of the monolingual corpora.

Inglés

Francés

hebreo

Tokens
Types
Bigram tokens
Bigram types

447,073,250
2,421,181
429,550,149
22,929,768

522,964,336
2,416,269
505,441,224
21,428,007

46,239,285
188,572
45,858,152
5,698,581

454

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tsvetkov and Wintner

Identiﬁcation of Multiword Expressions

Mesa 2
Statistics of the bilingual corpora.

English–French

English–Hebrew

Sentences
Tokens
Types
Bigram tokens
Bigram types

30,000
834,707
22,787
804,704
218,108

30,000
895,632
27,880
865,632
225,660

19,626
271,787
14,142
252,183
128,987

19,626
280,508
12,555
280,506
149,688

a 69,912 MWEs. If the corpus surface form is not listed in the dictionary, we use the
surface form in lieu of its lemma. The multiword entries of the DELA dictionaries are
only used for evaluation.

For some features we also need a bilingual dictionary. For English–Hebrew, nosotros
use a small dictionary consisting of 78,313 translation pairs. Some of the entries are
collected manually, whereas others are produced automatically (Itai and Wintner
2008; Kirschenbaum and Wintner 2010). For English–French, because we are unable to
obtain a good-quality dictionary, we use instead Giza++ (Och and Ney 2000) 1-1 palabra
alignments computed automatically from the entire WMT-11 parallel corpus.

In order to prepare training material automatically (Sección 6), we use small bilin-
gual corpora. For English–French, we use a random sample of 30,000 parallel sentences
from the WMT-11 corpus. For English-Hebrew, we use the parallel corpus of Tsvetkov
and Wintner (2010a). Statistics of the parallel corpora are listed in Table 2.

For evaluation we need lists of MWEs, ideally augmented by lists of non-MWE
bigrams. Such lists are notoriously hard to obtain. As a general method of evaluation,
we run 10-fold cross-validation evaluation using the training materials (which we
generate automatically). Además, we use three sets of MWEs for evaluation. Primero,
we extract all the MWE entries from the English WordNet (Miller et al. 1990); we use the
WordNet version that is distributed with NLTK (Bird, Klein, and Loper 2009). Segundo,
we use the MWEs listed in the DELA dictionaries of English and French (see above).
These sets only include positive examples, por supuesto, so we only report recall results on
a ellos. For Hebrew, we use a small set that was used for evaluation in the past (Al-Haj
and Wintner 2010; Tsvetkov and Wintner 2012). This is a small annotated corpus,
NN, of Hebrew noun-noun constructions. The corpus consists of 413 high-frequency
bigrams of the same syntactic construction; of those, 178 are tagged as MWEs (en esto
caso, noun compounds) y 235 as non-MWEs. This corpus consolidates the annotation
of three annotators: Only instances on which all three agreed were included. Because it
includes both positive and negative instances, this corpus facilitates a robust evaluation
of precision and recall.

4. Linguistically Motivated Features

We deﬁne several linguistically motivated features that are aimed at capturing some
of the unique properties of MWEs. Although many idiosyncratic properties of MWEs
have been previously studied, we introduce novel ways to express these properties as
computable features that inform a classiﬁer. Note that many of the features we describe
in the following are completely language-independent; others are applicable to a wide
range of languages, whereas few are speciﬁc to morphologically rich languages, and can

455

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 2

be exhibited in different ways in different languages. We provide examples in English,
Francés, and Hebrew, drawn from the resources listed in Section 3.2. The methodology
we advocate, sin embargo, is completely general.

A common theme for all these features is idiosyncracy: They are all aimed at locating
some linguistic property on which MWEs may differ from non-MWEs. We begin by
detailing these properties, along with the features that we deﬁne to reﬂect them. En todo
casos, the feature is applied to a candidate MWE, deﬁned here as a bigram of tokens (todo
possible bigrams are potential candidates). The features are computed from the large
monolingual corpora described in Section 3.2. In order for a feature to ﬁre, at least ﬁve
instances of the candidate MWE have to be present in the corpus.

Orthographic variation. A veces, MWEs are written with hyphens instead of inter-
token spaces. Examples include Hebrew2 xd-cddi (one sided) ‘unilateral’, Inglés
elephant-bird, and French aide-soignant (helpcarer) ‘caregiver’. Por supuesto, this feature
is only relevant for languages that use the hyphen in this way.

We deﬁne a binary feature, HYPHEN, whose value is 1 iff the corpus includes
instances of the candidate MWE in which the hyphen character connects the two tokens
of the bigram.

Capitalization. MWEs are often named entities, and in languages such as English and
French a large number of MWEs involve words whose ﬁrst letter is capital. We therefore
deﬁne a feature, CAPS, whose value is a binary vector with 1 in the i-th place iff the
i-th word of the MWE candidate is capitalized.3 For example, the White House will
have the value (cid:3)0, 1, 1(cid:4). This feature is of course irrelevant for languages that do not use
capitalization.

Fossil words. MWEs sometimes include constituents that have no usage outside the
particular expression. Examples include Hebrew ird lTmiwn (went-down to-treasury)
‘was lost’, French nightclub, and English hocuspocus; as far as we know, this is a rather
universal property.

We deﬁne a feature, FOSSIL, whose value is a binary vector with 1 in the i-th place
iff the i-th word of the candidate only occurs in this particular bigram; the other words
of the candidate expression can be morphological variants of each other, but must share
the same lemma. Por ejemplo, the value of FOSSIL for hocus pocus is (cid:3)1, 1(cid:4), mientras
for French night club it is (cid:3)1, 0(cid:4). In order to ﬁlter out potential typos, candidates must
occur at least ﬁve times in the corpus in order for this feature to ﬁre.

Frozen form. MWE constituents sometimes occur in one ﬁxed, frozen form, donde el
language’s morphology licenses also other forms. Por ejemplo, spill the beans does
not license spill the bean, although bean is a valid form. Similarmente, Hebrew bit xwlim
(house-of sick-people) ‘hospital’ requires that the noun xwlim be in the plural; el
variant bit xwlh (house-of sick-person) ‘a sick person’s house’ only has the literal
significado. This feature is of use for languages that are not isolating.

2 To facilitate readability we use a transliteration of Hebrew using Roman characters; the letters used,

in Hebrew lexicographic order, are abgdhwzxTiklmnsypcqrˇst.

3 Here and in subsequent examples we do not assume that the length of an MWE is limited to 2.

In the present work, sin embargo, the vector is of length exactly 2.

456

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tsvetkov and Wintner

Identiﬁcation of Multiword Expressions

We deﬁne a feature, FROZEN, whose value is a binary vector with 1 in the i-th place
iff the i-th word of the candidate never inﬂects in the context of this expression. Para
ejemplo, the value of FROZEN for spill the beans is (cid:3)0, 1, 1(cid:4), and for Hebrew bit xwlim
(house-ofsick-people) ‘hospital’ it is (cid:3)0, 1(cid:4).

Partial morphological inﬂection. En algunos casos, MWE constituents undergo a (strict but
non-empty) subset of the full inﬂections that they would undergo in isolation. Para
ejemplo, the Hebrew bit mˇspT (house-of law) ‘court’ occurs in the following inﬂected
formas: bithmˇspT ‘the court’ (75%); bitmˇspT ‘a court’ (15%); btihmˇspT ‘the courts’ (8%);
and bti mˇspT ‘courts’ (2%). Fundamentalmente, forms in which the second word, mˇspT ‘law,’ is
in the plural are altogether missing. Our assumption is that the inﬂection histograms of
non-MWEs are more uniform than the histograms of MWEs, in which some inﬂections
may be more frequent and others may be altogether missing. Por supuesto, restrictions on
the histogram may stem from the part of speech of the expression; such constraints are
captured by dependencies in the BN structure.

We capture this property, which is again relevant for all non-isolating languages,
with a technique that has been proven useful in the area of image processing (jainista
1989, Sección 7.3). We compute a histogram of the distribution in the corpus of all
the possible surface forms of each MWE candidate. Such histograms can compactly
represent distributional information on morphological behavior, in the same way that
histograms of the distribution of gray levels in a picture are used to represent the picture
sí mismo. Por ejemplo, the histogram corresponding to bit mˇspT (house-of law) ‘court’
would be

(cid:3)(bithmˇspT, 0.75), (bitmˇspT, 0.15), (btihmˇspT, 0.08), (btimˇspT, 0.02)(cid:4)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Because each MWE is idiosyncratic in its own way, we do not expect the histograms
of MWEs to have some speciﬁc pattern, except non-uniformity. We therefore sort the
columns of each histogram, thereby losing information pertaining to the speciﬁc inﬂec-
ciones, and retaining only information about the idiosyncrasy of the histogram. Para el
example given, the obtained histogram is (cid:3)75, 15, 8, 2(cid:4). A diferencia de, the non-MWE txwm
mˇspT (domain-of law) ‘domain of the law’, which is syntactically identical, occurs in
nine different inﬂected forms, and its sorted histogram is (cid:3)59, 14, 7, 7, 5, 2, 2, 2, 2(cid:4). El
longer “tail” of the histogram is typical of compositional expressions.

Off-line, we compute the average histogram for positive and negative examples:
The average histogram of MWEs is shorter and less uniform than the average histogram
of non-MWEs. We deﬁne a binary feature, HIST, that determines whether the histo-
gram of the candidate is closer, in terms of L1 (Manhattan) distancia, to the average
histogram of positive or of negative examples.

In our corpora, the average histogram of English positive examples has exactly four
elementos: 93.62, 5.86, 0.45, y 0.05. This shows a clear tendency (93.62%) of English
MWEs to occur in a single form only; and it also implies that no English MWE occurs in
more than four variants. The English negative instances, in contrast, have a much longer
histogram (12 elementos); the ﬁrst element is 85.83, much lower than the dominating
element of the positive examples. In French, which is morphologically much richer, el
number of elements in the average histogram of positive examples is 32 (the domi-
nating elements are 90.8, 6.9, 1.1, 0.4), whereas the number of elements in the average
histogram of negative examples is 92 (dominated by 75.6, 14.5, 3.9, 2.0).

457

Ligüística computacional

Volumen 40, Número 2

Context. We hypothesize that MWEs tend to constrain their (semantic) context more
strongly than non-MWEs. We expect words that occur immediately after MWEs to vary
less freely than words that immediately follow other expressions. One motivation for
this hypothesis is the observation that MWEs tend to be less polysemous than free
combinations of words, thereby limiting the possible semantic context in which they
can occur. This seems to us to be a universal property.

We deﬁne a feature, CONTEXT, como sigue. We ﬁrst compute a histogram of the
frequencies of words following each candidate MWE. We trim the tail of the histogram
by removing words whose frequency is lower than 0.1% (the expectation is that non-
MWEs would have a much longer tail). Off-line, we compute the same histograms for
positive and negative examples and average them as before. The value of CONTEXT
es 1 iff the histogram of the candidate is closer (in terms of L1 distance) to the positive
promedio.

Por ejemplo, the histogram of Hebrew bit mˇspT ‘court’ includes 15 valores, dominación-
inated by bit mˇspT yliwn ‘supreme court’ (20%) and bit mˇspT mxwzi ‘district court’
(13%), followed by contexts whose frequency ranges between 5% y 0.6%. A diferencia de,
the non-MWE txwm mˇspT ‘domain-of law’ has a much shorter histogram, a saber
(12, 11, 6): Encima 70% of the words following this expression occur with frequency lower
than 0.1% and are hence in the trimmed tail.

Syntactic diversity. MWEs can belong to various part of speech categories. We deﬁne as
feature, POS, the category of the candidate, with values obtained by selecting frequent
tuples of POS tags. Por ejemplo, English heart attack is Noun-Noun, dark blue is
Adj-Adj, Al Capone is PropN-PropN; French chant fun`ebre (song funeral) ‘dirge’ is
Noun-Adj, en bas (in low) ‘down’ is Prep-Adj; Hebrew rkbt hrim (train-ofmountains)
‘roller-coaster’ is Noun-Noun, Etcétera.

Translational equivalents. Because MWEs are often idiomatic, they tend to be translated in
a non-literal way, sometimes to a single word. We use a bilingual dictionary to generate
word-by-word translations of candidate MWEs from Hebrew to English, and check
the number of occurrences of the English literal translation in a large English corpus.
For French–English, we check whether the literal translation occurs in the Giza++ (Och
and Ney 2000) alignment results (we use grow-diag-ﬁnal-and for symmetrization in this
caso, to improve the precision). Due to differences in word order between the two
idiomas, we create two variants for each translation, corresponding to both possible
orders. We expect non-MWEs to have some literal translational equivalent (possibly
with frequency that correlates with their frequency in the source language), whereas for
MWEs we expect no (or few) literal translations. Por ejemplo, consider Hebrew sprwt
iph (literature pretty) ‘belles lettres’. Literal translation of the expression to English
yields literature pretty and pretty literature; we expect these phrases to occur rarely
in an English corpus. A diferencia de, the compositional tmwnh iph (picture pretty) ‘pretty
picture’ is much more likely to occur literally in English.

We deﬁne a binary feature, TRANS, whose value is 1 iff some literal translation
of the candidate occurs more than ﬁve times in the corpus. Although this feature is
not language-speciﬁc, we assume that it should work best for pairs of rather distinct
idiomas.

Collocation. As a baseline, statistical association measure, we use pointwise mutual
información (PMI). We deﬁne a binary feature, PMI, with two values, low and high,

458

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tsvetkov and Wintner

Identiﬁcation of Multiword Expressions

FRZN

CAPS

MWE

POS

HYPHEN

FOSSIL

CNTXT

TRANS

HIST

PMI

Cifra 1
The Bayesian network for MWE identiﬁcation.

reﬂecting an experimentally determined threshold. Claramente, other association measures
(as well as combinations of more than one) could be used (Pecina 2005).

5. Feature Interdependencies Expressed as a Bayesian Network

A Bayesian network (Jensen and Nielsen 2007) is organized as a directed acyclic
graph whose nodes are random variables and whose edges represent interdependencies
among those variables. We use a particular view of BN, known as causal networks, en
which directed edges lead to a variable from each of its direct causes.4 This facilitates the
expression of domain knowledge (and intuitions, creencias, etc.) as structural properties of
the network. We use the BN as a classiﬁcation device: Training amounts to computing
the joint probability distribution of the training set, and classiﬁcation maximizes the
posterior probability of the particular node (variable) being queried.

For MWE identiﬁcation we deﬁne a BN whose nodes correspond to the features de-
scribed in Section 4. Además, we deﬁne a node, MWE, for the complete classiﬁcation
tarea. Over these nodes we impose the structure depicted graphically in Figure 1. Este
estructura, which we motivate below, is manually deﬁned: It reﬂects our understanding
of the problem domain and is a result of our linguistic intuition. dicho eso, it can of
course be modiﬁed in various ways, y, En particular, new nodes can be easily added
to reﬂect additional features.

All nodes depend on MWE, as all are affected by whether or not the candidate is
an MWE. The POS of an expression inﬂuences its morphological inﬂection, hence the
edges from POS to HIST and to FROZEN. Por ejemplo, Hebrew noun-noun constructions
allow their constituents to undergo the full inﬂectional paradigm, but when such a
construction is a MWE, inﬂection is severely constrained (Al-Haj and Wintner 2010);

4 The direction of edges is from the target to the observable; this is compatible with the use of BNs in

latent-variable generative models.

459

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 2

similarmente, when one of the constituents of a MWE is a conjunction, the entire expression
is very likely to be frozen, as in English byandlarge and moreorless.

Fossil words clearly affect all statistical metrics, hence the edge from FOSSIL to
PMI. They also affect the existence of literal translations, because if a word is not in
the lexicon, it does not have a translation, hence the edge from FOSSIL to TRANS. También,
we assume that there is a correlation between the frequency (and PMI) of a candidate
and whether or not a literal translation of the expression exists, hence the edge from
PMI to TRANS. The edges from PMI and HIST to CONTEXT are justiﬁed by the correlation
between the frequency and variability of an expression and the variability of the context
in which it occurs.

Claramente, the process of determining the structure of the graph, and in particular the
direction of some of the edges, is somewhat arbitrary. Having said that, it does give the
designer of the system a clear and explicit way of expressing linguistically motivated
intuitions about dependencies among features.

Once the structure of the network is established, the conditional probabilities of
each dependency have to be determined. We compute the conditional probability tables
from our training data (mira la sección 6) using Weka (Hall et al. 2009), and obtain values for
PAG(X | X1, . . . , Xk) for each variable X and all variables Xi, 1 ≤ i ≤ k (parents of X), semejante
that the graph includes an edge from Xi to X. We then use the network for classiﬁcation
by maximizing P(Xmwe | X1, . . . , Xk), where Xmwe corresponds to the node MWE, y
X1, . . . , Xk are the variables corresponding to all other nodes in the network. According
to Bayes rule, tenemos

PAG(Xmwe | X1, . . . , Xk) ∝
PAG(X1, . . . , Xk

| Xmwe) × P(Xmwe)

We deﬁne the prior, PAG(Xmwe), ser 0.41: This is the percentage of MWEs in WordNet 1.7
(Fellbaum 1998). This ﬁgure is of course rather arbitrary, but several studies indicate that
the percentage of MWEs in the (mental) lexicon is approximately one half (Jackendoff
1997; Erman and Warren 2000; Sag et al. 2002). Post factum, we experimented with
various other values for this parameter. We chose values between 0.3 y 0.55, en
increments of 0.05, and computed the F-score of the system on the task of extracting
English MWEs (mira la sección 7). As Table 3 muestra, the differences are small (and not
statistically signiﬁcant), meaning that the accuracy of the system seems to be rather
robust to the actual value of the prior. Given a small tuning set, it should be possible to
optimize the choice of the prior more systematically.
The conditional probabilities P(X1, . . . , Xk

| Xmwe) are determined by Weka from

the conditional probability tables:

PAG(X1, . . . , Xk

| Xmwe) = Πk

i=1P(Xi

| pai)

where k is the number of nodes in the BN (other than Xmwe) and pai is the set of parents
of Xi.

Mesa 3
F-score as a function of the value of the prior.

PAG(Xmwe)
F-score

0.3
0.848

0.35
0.84

0.4
0.833

0.41
0.835

0.45
0.831

0.5
0.836

0.55
0.843

460

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tsvetkov and Wintner

Identiﬁcation of Multiword Expressions

Mesa 4
Sizes of the training sets.

MWE

non-MWE

Total

Inglés
Francés
hebreo

1,381
1,445
350

2,004
2,089
504

3,385
3,534
854

6. Automatic Generation of Training Data

For training we need samples of positive and negative instances of MWEs, each asso-
ciated with a vector of the values of all features discussed in Section 4. We generate
this training material automatically, using the small bilingual corpora described in
Sección 3.2. Each parallel corpus is ﬁrst word-aligned with IBM Model 4 (Brown et al.
1993), implemented in Giza++ (Och and Ney 2003); we use union for symmetrization
aquí, to improve the recall. Entonces, we apply the (completely unsupervised) algorithm of
Tsvetkov and Wintner (2012), which extracts MWE candidates from the aligned corpus
and re-ranks them using statistics computed from a large monolingual corpus.

The core idea behind this algorithm is that MWEs tend to be translated in non-
literal ways; in a parallel corpus, words that are 1:1 aligned typically indicate literal
translations and are hence unlikely constituents of MWEs. The algorithm hence focuses
on misalignments: It trusts the quality of 1:1 alignments (which are further veriﬁed with
a bilingual dictionary) and searches for MWEs exactly in the areas that word alignment
failed to properly align, not relying on the alignment in these cases. Específicamente, el
algorithm views all words that are not included in 1:1 alignments as potential areas
in which to search for MWEs, independently of how these words were aligned by the
word-aligner. Entonces, it uses statistics computed from a large monolingual corpus to rank
the MWE candidates; speciﬁcally, we use the PMI score of candidates based on counts
from the monolingual corpora. Finalmente, the algorithm extracts maximally long sequences
of words from the unaligned parallel phrases, in which each bigram has a PMI score
above some threshold (determined experimentally). All bigrams in those sequences are
considered MWEs. See Tsvetkov and Wintner (2012) for more details.

The set of MWEs that is determined in this way constitutes the positive examples
in the training set. For negative examples, we use two sets of bigrams: Those that are 1:1
aligned and have high PMI; and those that are misaligned but have low PMI. To decide
how many negative examples to generate, we rely on the ratio between MWE and non-
MWE entries in WordNet, as mentioned above: PAG(Xmwe) = 0.41. We thus select from the
negative set approximately 50% more negative examples than positive ones, such that
the ratio between the sizes of the sets is 0.41 : 0.59. The sizes of the resulting training
sets are listed in Table 4.

7. Results and Evaluation

We use the training data described in Section 6 for training and evaluation: Actuamos
10-fold cross validation experiments, reporting accuracy and (balanced) F-score in three
set-ups: Uno (SVM) in which we train an SVM classiﬁer5 with the features described

5 We use Weka SMO with the PolyKernel set-up; experimentation with several other kernels yielded

worse results.

461

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
0
2
4
4
9
1
8
0
3
2
1
2
/
C
oh

yo
i

_
a
_
0
0
1
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 40, Número 2

Mesa 5
10-fold cross validation evaluation results.

hebreo

Francés

Inglés

Accuracy (%)

F-score Accuracy (%)

F-score

PMI
BN-auto
SVM
BN

66.98
71.19
74.59
76.82

0.67
0.71
0.75
0.77

70.88
77.45
78.38
79.04

0.762
0.775
0.736
0.778

74.15
82.16
82.95
83.52

0.737
0.822
0.828
0.835

en la sección 4; uno (BN-auto) in which we train a Bayesian network with these features,
but let Weka determine its structure (using the K2 algorithm); y uno (BN) en el cual
we train a Bayesian network whose structure reﬂects manually crafted linguistically
motivated knowledge, as depicted in Figure 1. The results are listed in Table 5; ellos son
compared with a PMI baseline, obtained by deﬁning a Bayesian network with only two
nodos, MWE and PMI.

The linguistically motivated features deﬁned in Section 4 are clearly helpful in the
classiﬁcation task: The accuracy of an SVM, informed by these features, is close to 75%
para hebreo, encima 78% for French, and as high as 83% para ingles, reducing the error rate
of the PMI baseline by 23% (hebreo) a 34% (Inglés). The contribution of the BN is
also highly signiﬁcant, reducing 3–9% more errors (with respect to the errors made by
the SVM classiﬁer).6 In total, the best method, BN, reduces the error rate of the PMI-
based classiﬁer by one third. Curiosamente, a BN whose structure does not reﬂect prior
conocimiento, but is rather learned automatically, performs worse than these two methods
(but still much better than relying on PMI alone).7 It is the combination of linguistically
motivated features with feature interdependencies reﬂecting domain knowledge that
contribute to the best performance.

We did not investigate the contribution of each of the features to the classiﬁcation
tarea. Sin embargo, we did analyze the weights assigned by the SVM classiﬁer to speciﬁc
características. Como era de esperar, the most distinctive feature is PMI. Among the POS features,
the strongest feature is VB NNS, an indication of a negative instance. Capitalization is
also unsurprisingly a very strong feature. We leave a more systematic analysis of the
contribution of each feature to future work.

To further assess the quality of the results, we performed a human evaluation on
the English data set. We ﬁrst produced the results in the BN set-up, and then sorted
both the (predicted) positive and the (predicted) negative instances by their PMI. Nosotros
randomly picked 100 instances of both lists, at the same positions in the ranked lists,
to constitute an evaluation set. We asked three English-speaking annotators to deter-
mine whether the 200 expressions were indeed MWEs. The annotation guidelines are
given in Appendix A. Comparing the three annotators’ labels, we found out that they
agreed on 141 del 200 (70.5%). This should probably be taken as an upper bound for
la tarea.

6 The improvement of both BN and SVM over the baseline is highly signiﬁcant statistically (sign test,
pag < 0.01 in all three cases); the improvement of BN over SVM is signiﬁcant for English (p < 0.01) but not for French. 7 We are not sure why this is the case. One possible explanation is that our training set contains noisy examples, and as the BN-auto classiﬁer learns the dependencies from noisy data, it performs worse than the SVM classiﬁer. Another possible explanation is that it attempts to learn more dependencies, thereby increasing the parameter space of the model. 462 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 4 9 1 8 0 3 2 1 2 / c o l i _ a _ 0 0 1 7 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tsvetkov and Wintner Identiﬁcation of Multiword Expressions We then computed the majority label and compared it with our predicted label. Exactly 142 of the predicted labels were annotated as correct; that’s an accuracy of 71%. Of the 141 instances that the three annotators agreed on, our results predict the correct label for 112 instances (79.4%). We take these ﬁgures as a strong indication of the accuracy of the results. As an additional evaluation measure, we use the sets of bigrams in the English WordNet, and the bigram MWEs in the DELA dictionaries of English and French (Section 3.2). Because we only have positive instances in these evaluation sets, we can only report recall. We therefore use the Bayesian network classiﬁer to extract MWEs from the large monolingual corpora discussed in Section 3.2. For each evaluation set (WordNet, DELA English, and DELA French), we divide the number of bigrams in the set that are classiﬁed as MWEs by the size of the intersection of the evaluation set with the monolingual corpus. In other words, we exclude from the evaluation those MWEs in the evaluation set that never occur in our corpora. The results are listed in Table 6. As examples of correctly identiﬁed MWEs, consider English advisory board, air cargo,adoptionagency,airticket,crudeoil, and so on, and French accordinternational ‘international agreement’, acte ﬁnal ‘ﬁnal act’, banque centrale ‘central bank’, ce soir ‘tonight’, and so forth, all taken from the DELA dictionaries. The relatively low recall of our method on these dictionaries is to a large extent due to a very liberal deﬁnition of MWEs that the dictionaries use. Many entries that are listed as MWEs are actually highly compositional, and hence our method fails to identify them. DELA entries that are not identiﬁed by our classiﬁer include examples such as English abnormal behavior, abso- lute necessity, academic research, and so on. The French DELA dictionary is especially extensive, with examples such as action sociale, action antitumorale, action associa- tive, action caritative, action collective, action commerciale, action communautaire, and many more, all listed as MWEs. Our system only recognizes the ﬁrst of these. The WordNet results are obviously much better. Correctly identiﬁed MWEs include ad hoc, outer space, web site, inter alia, road map, and so forth. WordNet MWEs that our system failed to identify include has been, as well, in this, a few, set up, and so on. A more involved error analysis is required in order to propose potential directions for improvement on this set. As a further demonstration of the utility of our approach, we evaluate the algorithm on the set NN of Hebrew noun-noun constructions described in Section 3.2. We train a Bayesian network on the training set described in Section 6 and use it to classify the set NN. We compare the results of this classiﬁer with a PMI baseline, and also with the classiﬁcation results reported by Al-Haj and Wintner (2010); the latter reﬂects 10- fold cross-validation evaluation using the entire set, so it may be considered an upper bound for any classiﬁer that uses a general training corpus. The results are depicted in Table 7. They clearly demonstrate that the linguistically motivated features we deﬁne provide a signiﬁcant improvement in classiﬁcation accu- racy over the baseline PMI measure. Note that our F-score, 0.77, is very close to the Table 6 Evaluation results: WordNet and DELA dictionaries. True positives Evaluation set size Recall (%) WordNet DELA English DELA French 25,549 11,955 886 42,403 26,460 4,798 60 45 18 463 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 4 9 1 8 0 3 2 1 2 / c o l i _ a _ 0 0 1 7 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 Table 7 Evaluation results: noun-noun constructions. Accuracy Precision Recall F-score PMI BN AW 71.43% 77.00% 80.77% 0.71 0.77 0.77 0.71 0.77 0.81 0.71 0.77 0.79 AW = results from Al-Haj and Wintner (2010) best result of 0.79 obtained by Al-Haj and Wintner (2010) as the average of 10-fold cross validation runs, using only high-frequency noun-noun constructions for training. We interpret this result as a further proof of the robustness of our architecture. Finally, we conduct an analysis of the quality of extracted (Hebrew) MWEs. We used the trained BN to classify the entire set of bigrams present in the (Hebrew side of the) Hebrew–English parallel corpus described in Section 3.2. Of the more than 140,000 candidates, only 4,000 are classiﬁed as MWEs. We sort this list of potential MWEs by the probability assigned by the BN to the positive value of the variable Xmwe. The resulting sorted list is dominated by high-PMI bigrams, especially proper names, all of which are indeed MWEs. The ﬁrst non-MWE (false positive) occurs in the 50th place on the list; it is crpt niqwla ‘France Nicolas’, which is obviously a sub-sequence of the larger MWE, neia crpt niqwla srqwzi ‘French president Nicolas Sarkozy’. Similar sub-sequences are also present, but only ﬁve are in the top 100. Such false positives can be reduced when longer MWEs are extracted, as it can be assumed that a sub-sequence of a longer MWE does not have to be identiﬁed. Other false positives in the top 100 include some highly frequent expressions, but over 85 of the top 100 are clearly MWEs. Although more careful evaluation is required in order to estimate the rate of true positives in this list, we trust that the vast majority of the positive results are indeed MWEs. 8. Conclusions and Future Work We presented a novel architecture for identifying MWEs in text corpora. The main insights we emphasize are sophisticated computational encoding of linguistic knowl- edge that focuses on the idiosyncratic behavior of such expressions. This is reﬂected in two ways in our work: by deﬁning computable features that reﬂect different facets of irregularities; and by framing the features as part of a larger Bayesian network that accounts for interdependencies among them. We also introduce a method for automat- ically generating a training set for this task, which renders the classiﬁcation entirely unsupervised. The result is a classiﬁer that can identify MWEs of several types and constructions. Evaluation on three languages (English, French, and Hebrew) shows sig- niﬁcant improvement in the accuracy of the classiﬁer compared with less sophisticated baselines. The modular architecture of Bayesian networks facilitates easy exploration with more features. We are currently investigating the contribution of various other sources of information to the classiﬁcation task. For example, Hebrew lacks large-scale lexical semantic resources. However, it is possible to literally translate an MWE candidate to English and rely on the English WordNet for generating synonyms of the literal transla- tion. Such “literal synonyms” can then be back-translated to Hebrew. The assumption is 464 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 4 9 1 8 0 3 2 1 2 / c o l i _ a _ 0 0 1 7 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tsvetkov and Wintner Identiﬁcation of Multiword Expressions that if a back-translated expression has a low PMI, the original candidate is very likely not a MWE. Although such a feature may contribute little on its own, incorporating it in a well-structured BN may improve performance. Another feature that can easily be implemented in this way is whether the POS of MWE constituents is retained when the expression is translated to another language; we hypothesize that this is much more likely when the expression is compositional. Appendix A. Annotation Guidelines These are the instructions given to the annotators. The task is to annotate each line as either a multi-word expression, in which case mark 1 in the ﬁrst ﬁeld; or not, in which case the value is 0. It’s a hard task, but you are requested to be decisive. Please do not change the ﬁle in any other way. The main criterion for determining whether an expression is a MWE is whether it has to be stored in a computational lexicon. Typically, expressions are stored in lexicons if they exhibit idiosyncratic (irregular) behavior. This could be due to: (cid:2) non-compositional meaning. For example, ‘green light’ is an MWE because it is not a light. ‘kill time’ is not a violent action. A good indication of non-compositional meaning is limited reference. For example, if someone gives you a green light, you can’t then refer to it as ‘the light I was given’. (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) non-substitutability of elements. For example, ‘breast cancer’ is an MWE because while ‘breast’ and ‘chest’ can often be substituted, ‘breast cancer’ and ‘chest cancer’ cannot. fossil words, i.e., words that only occur in the context of the expression. For example, ‘mutatis mutandis’. nominalization. If the expression can occur as a single word, or with a connecting hyphen, it is a strong indication that it is an MWE. For example, ‘road map’ can be written ‘roadmap’. irregular syntactic and/or morphological behavior. For example, ‘look up’ is an MWE because while ordinarily you can convert ‘I walked up the alley’ to ‘Up the alley I walked’, you can’t convert ‘I looked up that word in a dictionary’ to ‘Up that word I looked’. proper names. All proper names are by deﬁnition MWEs. This includes people (‘Barack Obama’), places (‘Tel Aviv’), organizations (‘United Nations’), etc. But really, the best criterion is: if I hadn’t known this expression, would I be able to use it properly simply by knowing its two constituents? Would I understand its meaning, be able to inﬂect it properly, construct syntactic constructions, and in general use it in the right context in the right way? Acknowledgments This research was supported by The Israel Science Foundation (grants 137/06 and 1269/07). We are grateful to Gennadi Lembersky for his continuous help, and to the three anonymous Computational Linguistics reviewers for very constructive comments that greatly improved this article. All remaining errors are of course our own. References Al-Haj, Hassan. 2010. Hebrew multiword expressions: Linguistic properties, lexical representation, morphological processing, and automatic acquisition. Master’s thesis, University of Haifa. Al-Haj, Hassan and Shuly Wintner. 2010. Identifying multi-word expressions by leveraging morphological and syntactic 465 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 4 9 1 8 0 3 2 1 2 / c o l i _ a _ 0 0 1 7 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 idiosyncrasy. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 10–18, Beijing. Baldwin, Timothy, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 Workshop on Multiword Expressions, pages 89–96, Sapporo. Baldwin, Timothy and Takaaki Tanaka. 2004. Translation by machine of complex nominals: Getting it right. In Second ACL Workshop on Multiword Expressions: Integrating Processing, pages 24–31, Barcelona. Bannard, Colin. 2007. A measure of syntactic ﬂexibility for automatically identifying multiword expressions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, pages 1–8, Prague. Bannard, Colin, Timothy Baldwin, and Alex Lascarides. 2003. A statistical approach to the semantics of verb-particles. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65–72, Sapporo, Japan. Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media, Sebastopol, CA. Bouamor, Dhouha, Nasredine Semmar, and Pierre Zweigenbaum. 2012. Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 674–679, Istanbul. Brown, Peter F., Stephen Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematic of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311. Calado, P´avel, Marco Cristo, Edleno Silva De Moura, Nivio Ziviani, Berthier A. Ribeiro-Neto, and Marcos Andr´e Gonc¸alves. 2003. Combining link-based and content-based methods for web document classiﬁcation. In Proceedings of CIKM-03, 12th ACM International Conference on Information and Knowledge Management, pages 394–401, New Orleans, LA. Callison-Burch, Chris, Philipp Koehn, Christof Monz, and Omar F. Zaidan, editors. 2011. Proceedings of the Sixth Workshop on Statistical Machine 466 Translation. Association for Computational Linguistics, Edinburgh. Carpuat, Marine and Mona Diab. 2010. Task-based evaluation of multiword expressions: A pilot study in statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 242–245, Los Angeles, CA. Chang, Baobao, Pernilla Danielsson, and Wolfgang Teubert. 2002. Extraction of translation unit from Chinese-English parallel corpora. In Proceedings of the First SIGHAN Workshop on Chinese Language Processing, pages 1–5, Morristown, NJ. Church, Kenneth Ward and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29. Cook, Paul, Afsaneh Fazly, and Suzanne Stevenson. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identiﬁcation of idiomatic expressions in context. In Proceedings of the ACL Workshop on a Broader Perspective on Multiword Expressions (MWE 2007), pages 41–48, Prague. Denoyer, Ludovic and Patrick Gallinari. 2004. Bayesian network model for semi-structured document classiﬁcation. Information Processing and Management, 40(5):807–827. Doucet, Antoine and Helana Ahonen-Myka. 2004. Non-contiguous word sequences for information retrieval. In Second ACL Workshop on Multiword Expressions: Integrating Processing, pages 88–95, Barcelona. Duan, Jianyong, Mei Zhang, Lijing Tong, and Feng Guo. 2009. A hybrid approach to improve bilingual multiword expression extraction. In Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone, and Tu-Bao Ho, editors, Advances in Knowledge Discovery and Data Mining, volume 5476 of Lecture Notes in Computer Science. Springer, Berlin and Heidelberg, pages 541–547. Erman, Britt and Beatrice Warren. 2000. The idiom principle and the open choice principle. Text, 20(1):29–62. Fazly, Afsaneh and Suzanne Stevenson. 2006. Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 337–344, Trento. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 4 9 1 8 0 3 2 1 2 / c o l i _ a _ 0 0 1 7 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tsvetkov and Wintner Identiﬁcation of Multiword Expressions Fellbaum, Christiane, editor. 1998. WordNet: An Electronic Lexical Database. Language, Speech and Communication. MIT Press, Cambridge, MA. Green, Spence, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning. 2011. Multiword expression identiﬁcation with tree substitution grammars: A parsing tour de force with French. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 725–735, Edinburgh. Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10–18. Hazelbeck, Gregory and Hiroaki Saito. 2010. A hybrid approach for functional expression identiﬁcation in a Japanese reading assistant. In Proceedings of the 2010 Workshop on Multiword Expressions: From Theory to Applications, pages 81–84, Beijing. Heckerman, David. 1995. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research, Redmond, WA. Itai, Alon and Shuly Wintner. 2008. Language resources for Hebrew. Language Resources and Evaluation, 42(1):75–98. Jackendoff, Ray. 1997. The Architecture of the Language Faculty. MIT Press, Cambridge, MA. Jain, Anil K. 1989. Fundamentals of Digital Image Processing. Prentice-Hall, Inc., Upper Saddle River, NJ. Jensen, Finn V. and Thomas D. Nielsen. 2007. Bayesian Networks and Decision Graphs. Springer, 2nd edition. Katz, Graham and Eugenie Giesbrecht. 2006. Automatic identiﬁcation of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pages 12–19, Sydney. Kirschenbaum, Amit and Shuly Wintner. 2010. A general method for creating a bilingual transliteration dictionary. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), pages 273–276, Valletta. Lam, Wai, Kon F. Low, and Chao Y. Ho. 1997. Using a Bayesian network induction approach for text categorization. In Proceedings of IJCAI-97, 15th International Joint Conference on Artiﬁcial Intelligence, pages 745–750, Nagoya. Lambert, Patrik and Rafael Banchs. 2005. Data inferred multi-word expressions for statistical machine translation. In Proceedings of the MT Summit X, pages 396–403, Phuket. Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. 1990. Five papers on WordNet. International Journal of Lexicography, 3(4):235–312. Och, Franz Josef and Hermann Ney. 2000. Improved statistical alignment models. In ACL ’00: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 440–447, Hong Kong. Och, Franz Josef and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Pearl, Judea. 1985. Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, pages 329–334, University of California, Irvine, CA. Pecina, Pavel. 2005. An extensive empirical study of collocation extraction methods. In Proceedings of the ACL Student Research Workshop, pages 13–18, Ann Arbor, MI. Pecina, Pavel. 2008. A machine learning approach to multiword expression extraction. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, pages 54–57, Marrakech. Peshkin, Leonid, Avi Pfeffer, and Virginia Savova. 2003. Bayesian nets in syntactic categorization of novel words. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Companion Volume of the Proceedings of HLT-NAACL 2003–Short Papers - Volume 2, NAACL ’03, pages 79–81, Edmonton. Petrov, Slav and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Proceedings of HLT-NAACL, pages 404–411, Rochester, NY. Piao, Scott Songlin, Paul Rayson, Dawn Archer, and Tony McEnery. 2005. Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language, 19(4):378–397. Ramisch, Carlos, Helena de Medeiros Caseli, Aline Villavicencio, Andr´e Machado, and Maria Finatto. 2010. A hybrid approach for multiword expression identiﬁcation. 467 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 4 9 1 8 0 3 2 1 2 / c o l i _ a _ 0 0 1 7 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 In Thiago Pardo, Ant ´onio Branco, Aldebaro Klautau, Renata Vieira, and Vera de Lima, editors, Computational Processing of the Portuguese Language, volume 6001 of Lecture Notes in Computer Science. Springer, Berlin and Heidelberg, pages 65–74. Ramisch, Carlos, Paulo Schreiner, Marco Idiart, and Alline Villavicencio. 2008. An evaluation of methods for the extraction of multiword expressions. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, pages 50–53, Marrakech. Ren, Zhixiang, Yajuan L ¨u, Jie Cao, Qun Liu, and Yun Huang. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the Workshop on Multiword Expressions: Identiﬁcation, Interpretation, Disambiguation and Applications, pages 47–54, Singapore. Sag, Ivan, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), pages 1–15, Mexico City. Savova, Virginia and Leonid Peshkin. 2005. Dependency parsing with dynamic Bayesian network. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence—Volume 3, pages 1,112–1,117, Pittsburgh, PA. Smadja, Frank A. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177. Smith, Noah A. 2011. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies. Morgan and Claypool. Tsvetkov, Yulia and Shuly Wintner. 2010a. Automatic acquisition of parallel corpora from websites with dynamic content. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), pages 3,389–3,392, Valletta. Tsvetkov, Yulia and Shuly Wintner. 2010b. Extraction of multi-word expressions from small parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 1256–1264, Beijing. Tsvetkov, Yulia and Shuly Wintner. 2011. Identiﬁcation of multi-word expressions by combining multiple linguistic information sources. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 836–845, Edinburgh. Tsvetkov, Yulia and Shuly Wintner. 2012. Extraction of multi-word expressions from small parallel corpora. Natural Language Engineering, 18(4):549–573. Uchiyama, Kiyoko, Timothy Baldwin, and Shun Ishizaki. 2005. Disambiguating Japanese compound verbs. Computer Speech & Language, 19(4):497–512. Van de Cruys, Tim and Bego ˜na Villada Moir ´on. 2007. Semantics-based multiword expression extraction. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, pages 25–32, Prague. Varga, D´aniel, P´eter Hal´acsy, Andr´as Kornai, Viktor Nagy, L´aszl ´o N´emeth, and Viktor Tr ´on. 2005. Parallel corpora for medium density languages. In Proceedings of RANLP’2005, pages 590–596, Borovet. Venkatapathy, Sriram and Aravind Joshi. 2006. Using information about multi-word expressions for the word-alignment task. In Proceedings of the COLING/ACL Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pages 20–27, Sydney. Venkatsubramanyan, Shailaja and Jose Perez-Carballo. 2004. Multiword expression ﬁltering for building knowledge. In Second ACL Workshop on Multiword Expressions: Integrating Processing, pages 40–47, Barcelona. Villavicencio, Aline, Valia Kordoni, Yi Zhang, Marco Idiart, and Carlos Ramisch. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 1,034–1,043, Prague. Weller, Marion and Fabienne Fritzinger. 2010. A hybrid approach for the identiﬁcation of multiword expressions. In Proceedings of the SLTC 2010 Workshop on Compounds and Multiword Expressions, pages 1–2, Link ¨oping. 468 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 4 9 1 8 0 3 2 1 2 / c o l i _ a _ 0 0 1 7 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
Descargar PDF