Unsupervised Type and Token Identiﬁcation - IA de Investigación especializada en el MIT

Unsupervised Type and Token Identiﬁcation
of Idiomatic Expressions

Afsaneh Fazly∗
universidad de toronto

Paul Cook∗∗
universidad de toronto

Suzanne Stevenson†
universidad de toronto

Idiomatic expressions are plentiful in everyday language, yet they remain mysterious, tal como
is not clear exactly how people learn and understand them. They are of special interest to
linguists, psycholinguists, and lexicographers, mainly because of their syntactic and semantic
idiosyncrasies as well as their unclear lexical status. Despite a great deal of research on the
properties of idioms in the linguistics literature, there is not much agreement on which properties
are characteristic of these expressions. Because of their peculiarities, idiomatic expressions have
mostly been overlooked by researchers in computational linguistics. En este artículo, we look
into the usefulness of some of the identiﬁed linguistic properties of idioms for their automatic
recognition. Específicamente, we develop statistical measures that each model a speciﬁc property
of idiomatic expressions by looking at their actual usage patterns in text. We use these sta-
tistical measures in a type-based classiﬁcation task where we automatically separate idiomatic
expresiones (expressions with a possible idiomatic interpretation) from similar-on-the-surface
literal phrases (for which no idiomatic interpretation is possible). Además, we use some of
the measures in a token identiﬁcation task where we distinguish idiomatic and literal usages of
potentially idiomatic expressions in context.

1. Introducción

Idioms form a heterogeneous class, with prototypical examples such as by and large, kick
the bucket, and let the cat out of the bag. It is hard to ﬁnd a single agreed-upon deﬁnition
that covers all members of this class (Glucksberg 1993; Cacciari 1993; Nünberg, Sag,
and Wasow 1994), but they are often deﬁned as sequences of words involving some de-
gree of semantic idiosyncrasy or non-compositionality. Eso es, an idiom has a different

∗ Department of Computer Science, universidad de toronto, 6 King’s College Rd., toronto, ON M5S 3G4,

Canada. Correo electrónico: afsaneh@cs.toronto.edu.

∗∗ Department of Computer Science, universidad de toronto, 6 King’s College Rd., toronto, ON M5S 3G4,

Canada. Correo electrónico: pcook@cs.toronto.edu.

† Department of Computer Science, universidad de toronto, 6 King’s College Rd., toronto, ON M5S 3G4,

Canada. Correo electrónico: suzanne@cs.toronto.edu.

Envío recibido: 12 Septiembre 2007; revised submission received: 29 Febrero 2008; aceptado para
publicación: 6 Puede 2008.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

meaning from the simple composition of the meaning of its component words. Idioms
are widely and creatively used by speakers of a language to express ideas cleverly, eco-
nomically, or implicitly, and thus appear in all languages and in all text genres (Sag et al.
2002). Many expressions acquire an idiomatic meaning over time (Cacciari 1993); conse-
frecuentemente, new idioms come into existence on a daily basis (Cowie, Mackin, and McCaig
1983; Seaton and Macaulay 2002). Automatic tools are therefore necessary for assisting
lexicographers in keeping lexical resources up to date, as well as for creating and ex-
tending computational lexicons for use in natural language processing (PNLP) sistemas.
Though completely frozen idioms, such as by and large, can be represented as
words with spaces (Sag et al. 2002), most idioms are syntactically well-formed phrases
that allow some variability in expression, such as shoot the breeze and hold ﬁre (Gibbs
and Nayak 1989; d’Arcais 1993; Fellbaum 2007). Such idioms allow a varying degree
of morphosyntactic ﬂexibility—for example, held ﬁre and hold one’s ﬁre allow for an
idiomatic reading, whereas typically only a literal interpretation is available for ﬁre was
held and held ﬁres. Claramente, a words-with-spaces approach does not work for phrasal
idioms. Por eso, in addition to requiring NLP tools for recognizing idiomatic expressions
(tipos) to include in a lexicon, methods for determining the allowable and preferred
usages (también. canonical forms) of such expressions are also needed. Además, in many
situations, an NLP system will need to distinguish a usage (simbólico) of a potentially
idiomatic expression as either idiomatic or literal in order to handle a given sequence of
words appropriately. Por ejemplo, a machine translation system must translate held ﬁre
differently in The army held their ﬁre and The worshippers held the ﬁre up to the idol.

Previous studies focusing on the automatic identiﬁcation of idiom types have often
recognized the importance of drawing on their linguistic properties, such as their se-
mantic idiosyncrasy or their restricted ﬂexibility, pointed out earlier. Some researchers
have relied on a manual encoding of idiom-speciﬁc knowledge in a lexicon (Copestake
et al. 2002; Odijk 2004; Villavicencio et al. 2004), whereas others have presented ap-
proaches for the automatic acquisition of more general (hence less distinctive) knowl-
edge from corpora (Smadja 1993; McCarthy, Keller, and Carroll 2003). Trabajo reciente
that looks into the acquisition of the distinctive properties of idioms has been limited,
both in scope and in the evaluation of the methods proposed (lin 1999; Evert, Heid,
and Spranger 2004). Our goal is to develop unsupervised means for the automatic
acquisition of lexical, syntactic, and semantic knowledge about a broadly documented
class of idiomatic expressions.

Específicamente, we focus on a cross-linguistically prominent class of phrasal idioms
which are commonly and productively formed from the combination of a frequent verb
and a noun in its direct object position (Cowie, Mackin, and McCaig 1983; Nünberg,
Sag, and Wasow 1994; Fellbaum 2002), Por ejemplo, shoot the breeze, make a face, y
push one’s luck. We refer to these as verb+noun idiomatic combinations or VNICs.1
We present a comprehensive analysis of the distinctive linguistic properties of phrasal
idioms, including VNICs (Sección 2), and propose statistical measures that capture each
propiedad (Sección 3). We provide a multi-faceted evaluation of the measures (Sección 4),
showing their effectiveness in the recognition of idiomatic expressions (tipos)-eso es,
separating them from similar-on-the-surface literal phrases—as well as their superiority
to existing state-of-the-art techniques. Drawing on these statistical measures, nosotros también
propose an unsupervised method for the automatic acquisition of an idiom’s canonical

1 We use the abbreviation VNIC and the term expression to refer to a verb+noun type with a potential

idiomatic meaning. We use the terms instance and usage to refer to a token occurrence of an expression.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

formas (p.ej., shoot the breeze as opposed to shoot a breeze), and show that it can successfully
accomplish the task (Sección 5).

It is possible for a single VNIC to have both idiomatic and non-idiomatic (literal)
meanings. Por ejemplo, make a face is ambiguous between an idiom, as in The little girl
made a funny face at her mother, and a literal combination, as in She made a face on the
snowman using a carrot and two buttons. Despite the common perception that phrases
that can be idioms are mainly used in their idiomatic sense, our analysis of 60 idioms
has shown otherwise. We found that close to half of these also have a clear literal
significado; and of those with a literal meaning, on average around 40% of their usages
are literal. Distinguishing token phrases as idiomatic or literal combinations of words is
thus essential for NLP tasks, such as semantic parsing and machine translation, cual
require the identiﬁcation of multiword semantic units.

Most recent studies focusing on the identiﬁcation of idiomatic and non-idiomatic
tokens either assume the existence of manually annotated data for a supervised clas-
siﬁcation (Patrick and Fletcher 2005; Katz and Giesbrecht 2006), or rely on manually
encoded linguistic knowledge about idioms (Uchiyama, Baldón, and Ishizaki 2005;
Hashimoto, Sato, and Utsuro 2006), or even ignore the speciﬁc properties of non-
literal language and rely mainly on general purpose methods for the task (Birke and
Sarkar 2006). We propose unsupervised methods that rely on automatically acquired
knowledge about idiom types to identify their token occurrences as idiomatic or literal
(Sección 6). More speciﬁcally, we explore the hypothesis that the type-based knowledge
we automatically acquire about an idiomatic expression can be used to determine
whether an instance of the expression is used literally or idiomatically (token-based
conocimiento). Our experimental results show that the performance of the token-based
idiom identiﬁcation methods proposed here is comparable to that of existing supervised
técnicas (Sección 7).

2. Idiomaticity, Semantic Analyzability, and Flexibility

Although syntactically well-formed, phrasal idioms (including VNICs) involve a certain
degree of semantic idiosyncrasy. This means that phrasal idioms are to some extent
nontransparent; eso es, even knowing the meaning of the individual component words,
the meaning of the idiom is hard to determine without special context or previous ex-
posure. There is much evidence in the linguistics literature that idiomatic combinations
also have idiosyncratic lexical and syntactic behavior. Aquí, we ﬁrst deﬁne semantic
analyzability and elaborate on its relation to semantic idiosyncrasy or idiomaticity. Nosotros
then expound on the lexical and syntactic behavior of VNICs, pointing out a suggestive
relation between the degree of idiomaticity of a VNIC and the degree of its lexicosyn-
tactic ﬂexibility.

2.1 Semantic Analyzability

Idioms have been traditionally believed to be completely non-compositional (Fraser
1970; katz 1973). This means that unlike compositional combinations, el significado
of an idiom cannot be solely predicted from the meaning of its parts. Sin embargo,
many linguists and psycholinguists argue against such a view, providing evidence
from idioms that show some degree of semantic compositionality (Nünberg, Sag, y
Wasow 1994; Gibbs 1995). The alternative view suggests that many idioms in fact do

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

have internal semantic structure, while recognizing that they are not compositional in a
simplistic or traditional sense. To explain the semantic behavior of idioms, investigadores
who take this alternative view thus use new terms such as semantic decomposability
and/or semantic analyzability in place of compositionality.

To say that an idiom is semantically analyzable to some extent means that the
constituents contribute some sort of independent meaning—not necessarily their literal
semantics—to the overall idiomatic interpretation. Generalmente, the more semantically
analyzable an idiom is, the easier it is to map the idiom constituents onto their cor-
responding idiomatic referents. En otras palabras, the more semantically analyzable an
idiom is, the easier it is to make predictions about the idiomatic meaning from the
meaning of the idiom parts. Semantic analyzability is thus inversely related to semantic
idiosyncrasy.

Many linguists and psycholinguists conclude that idioms clearly form a heteroge-
neous class, not all of them being truly non-compositional or unanalyzable (Abeill´e
1995; Moon 1998; Grant 2005). Bastante, semantic analyzability in idioms is a matter of
grado. Por ejemplo, the meaning of shoot the breeze (“to chat idly”), a highly idiomatic
expresión, has nothing to do with either shoot or breeze. A less idiomatic expression,
such as spill the beans (“to reveal a secret”), may be analyzed as spill metaphorically
corresponding to “reveal” and beans referring to “secret(s).” An idiom such as pop the
question is even less idiomatic because the relations between the idiom parts and their
idiomatic referents are more directly established, a saber, pop corresponds to “suddenly
ask” and question refers to “marriage proposal.” As we will explain in the following
sección, there is evidence that the difference in the degree of semantic analyzability of
idiomatic expressions is also reﬂected in their lexical and syntactic behavior.

2.2 Lexical and Syntactic Flexibility

Most idioms are known to be lexically ﬁxed, meaning that the substitution of a near syn-
onym (or a closely related word) for a constituent part does not preserve the idiomatic
meaning of the expression. Por ejemplo, neither shoot the wind nor hit the breeze are valid
variations of the idiom shoot the breeze. Similarmente, spill the beans has an idiomatic meaning,
while spill the peas and spread the beans have only literal interpretations. Hay, cómo-
alguna vez, idiomatic expressions that have one (or more) lexical variants. Por ejemplo, blow
one’s own trumpet and toot one’s own horn have the same idiomatic interpretation (Cowie,
Mackin, and McCaig 1983); also keep one’s cool and lose one’s cool have closely related
meanings (Nünberg, Sag, and Wasow 1994). Sin embargo, it is not the norm for idioms
to have lexical variants; when they do, there are usually unpredictable restrictions on
the substitutions they allow.

Idiomatic combinations are also syntactically distinct from compositional combi-
naciones. Many VNICs cannot undergo syntactic variations and at the same time retain
their idiomatic interpretations. It is important, sin embargo, to note that VNICs differ with
respect to the extent to which they can tolerate syntactic operations, eso es, the degree
of syntactic ﬂexibility they exhibit. Some are syntactically inﬂexible for the most part,
whereas others are more versatile, as illustrated in the sentences in Examples (1) y (2):

1. (a)
(b)
(C)
(d)

Sam and Azin shot the breeze.
?? Sam and Azin shot a breeze.
?? Sam and Azin shot the breezes.
?? Sam and Azin shot the casual breeze.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

(mi)
(F)
(gramo)

?? The breeze was shot by Sam and Azin.
?? The breeze that Sam and Azin shot was quite refreshing.
?? Which breeze did Sam and Azin shoot?

2. (a)
(b)
(C)
(d)
(mi)
(F)
(gramo) Which beans did Azin spill?

Azin spilled the beans.
? Azin spilled some beans.
?? Azin spilled the bean.
Azin spilled the Enron beans.
The beans were spilled by Azin.
The beans that Azin spilled caused Sam a lot of trouble.

Linguists have often explained the lexical and syntactic ﬂexibility of idiomatic
combinations in terms of their semantic analyzability (Fellbaum 1993; Gibbs 1993;
Glucksberg 1993; Nünberg, Sag, and Wasow 1994; Schenk 1995). The common belief
is that because the constituents of a semantically analyzable idiom can be mapped onto
their corresponding referents in the idiomatic interpretation, analyzable (less idiomatic)
expressions are often more open to lexical substitution and syntactic variation. Psíquico-
cholinguistic studies also support this hypothesis: Gibbs and Nayak (1989) and Gibbs
et al. (1989), through a series of psychological experiments, demonstrate that there is
variation in the degree of lexicosyntactic ﬂexibility of idiomatic combinations. (Ambos
studies narrow their focus to verb phrase idiomatic combinations, mainly of the form
verb+noun.) Además, their ﬁndings provide evidence that the lexical and syntactic
ﬂexibility of VNICs is not arbitrary, but rather correlates with the semantic analyzability
of these idioms as perceived by the speakers participating in the experiments.

Corpus-based studies such as those by Moon (1998), Riehemann (2001), and Grant
(2005) conclude that idioms are not as ﬁxed as most have assumed. These claims are
often based on observing certain idiomatic combinations in a form other than their so-
called canonical forms. Por ejemplo, Moon mentions that she has observed both kick
the pail and kick the can as variations of kick the bucket. También, Grant ﬁnds evidence of
variations such as eat one’s heart (afuera) and eat one’s hearts (afuera) in the BNC. Riehemann
concludes that in contrast to non-idiomatic combinations of words, “idioms have a
strongly preferred canonical form, but at the same time the occurrence of lexical and
syntactic variations of idioms is too common to be ignored” (página 67). Our understand-
ing of such ﬁndings is that idiomatic combinations are not inherently frozen and that it
is possible for them to appear in forms other than their agreed-upon canonical forms.
Sin embargo, it is important to note that most such observed variations are constrained,
often with unpredictable restrictions.

We are well aware that semantic analyzability is neither a necessary nor a sufﬁcient
condition for an idiomatic combination to be lexically or syntactically ﬂexible. Otro
factores, such as communicative intentions and pragmatic constraints, can motivate a
speaker to use a variant in place of a canonical form (Glucksberg 1993). Para examen-
por ejemplo, journalism is well known for manipulating idiomatic expressions for humor or
cleverness (Grant 2005). The age and the degree of familiarity of an idiom have also
been shown to be important factors that affect its ﬂexibility (Gibbs and Nayak 1989).
Sin embargo, linguists often use observations about lexical and syntactic ﬂexibility of
VNICs in order to make judgments about their degree of idiomaticity (Kyt ¨o 1999;
Tanabe 1999). We thus conclude that lexicosyntactic behavior of a VNIC, a pesar de
affected by historical and pragmatic factors, can be at least partially explained in terms
of semantic analyzability or idiomaticity.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

3. Automatic Acquisition of Type-Based Knowledge about VNICs

We use the observed connection between idiomaticity and (en)ﬂexibility to devise sta-
tistical measures for automatically distinguishing idiomatic verb+noun combinations
(tipos) from literal phrases. More speciﬁcally, we aim to identify verb–noun pairs such
como (cid:3)keep, palabra(cid:4) as having an associated idiomatic expression (keep one’s word), y
also distinguish these from verb–noun pairs such as (cid:3)keep, ﬁsh(cid:4) which do not have
an idiomatic interpretation. Although VNICs vary in their degree of ﬂexibility (cf.
Examples (1) y (2)), on the whole they contrast with fully compositional phrases,
which are more lexically productive and appear in a wider range of syntactic forms. Nosotros
thus propose to use the degree of lexical and syntactic ﬂexibility of a given verb+noun
combination to determine the level of idiomaticity of the expression.

Note that our assumption here is in line with corpus-linguistic studies on idioms:
we do not claim that it is inherently impossible for VNICs to undergo lexical sub-
stitution or syntactic variation. De hecho, for each given idiomatic combination, it may
well be possible to ﬁnd a speciﬁc situation in which a lexical or a syntactic variant of
the canonical form is perfectly plausible. Sin embargo, the main point of the assumption
here is that VNICs are more likely to appear in ﬁxed forms (known as their canonical
formas), more so than non-idiomatic phrases. Por lo tanto, the overall distribution of a
VNIC in different lexical and syntactic forms is expected to be notably different from
the corresponding distribution of a typical verb+noun combination.

The following subsections describe our proposed statistical measures for idiomatic-
idad, which quantify the degree of lexical, syntactic, and overall ﬁxedness of a given
verb+noun combination (represented as a verb–noun pair).

3.1 Measuring Lexical Fixedness

A VNIC is lexically ﬁxed if the replacement of any of its constituents by a semantically
(and syntactically) similar word does not generally result in another VNIC, pero en
an invalid or a literal expression. One way of measuring lexical ﬁxedness of a given
verb+noun combination is thus to examine the idiomaticity of its variants, eso es,
expressions generated by replacing one of the constituents by a similar word. Este
approach has two main challenges: (i) it requires prior knowledge about the idiomaticity
of expressions (which is what we are developing our measure to determine); (II) it can
only measure the lexical ﬁxedness of idiomatic combinations, and so could not apply to
literal combinations. We thus interpret this property statistically in the following way:
We expect a lexically ﬁxed verb+noun combination to appear much more frequently
than its variants in general.

Específicamente, we examine the strength of association between the verb and the
noun constituent of a combination (the target expression or its lexical variants) como
an indirect cue to its idiomaticity, an approach inspired by Lin (1999). We use the
automatically built thesaurus of Lin (1998) to ﬁnd words similar to each constituent,
in order to automatically generate variants.2 Variants are generated by replacing either

2 We also replicated our experiments with an automatically built thesaurus created from the British

National Corpus (BNC) in a similar fashion, and kindly provided to us by Diana McCarthy. Resultados
were similar, hence we do not report them here.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

the noun or the verb constituent of a pair with a semantically (and syntactically) similar
word.3

Examples of automatically generated variants for the pair (cid:3)spill, bean(cid:4) son (cid:3)pour,

bean(cid:4), (cid:3)stream, bean(cid:4), (cid:3)spill, corn(cid:4), y (cid:3)spill, rice(cid:4).

Let Ssim(v) = {VI | 1 ≤ i ≤ Kv} be the set of the Kv most similar verbs to the verb v
of the target pair (cid:3)v, norte(cid:4), and Ssim(norte) =
nj | 1 ≤ j ≤ Kn
be the set of the Kn most similar
nouns to the noun n (according to Lin’s thesaurus). The set of variants for the target pair
is thus:

(cid:1)

(cid:2)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ssim(v, norte) = {(cid:3)VI, norte(cid:4)| 1 ≤ i ≤ Kv} ∪

(cid:1)

(cid:3)v, nj(cid:4)| 1 ≤ j ≤ Kn

(cid:2)

We calculate the association strength for the target pair and for each of its variants
using an information-theoretic measure called pointwise mutual information or PMI
(Church et al. 1991):

PMI(vr, nt) = registro

= registro

PAG(vr, nt)
PAG(vr) PAG(nt)
Nv+n f (vr, nt)
F (vr, ∗) F (∗, nt)

(1)

dónde (cid:3)vr, nt(cid:4) ∈ {(cid:3)v, norte(cid:4)} ∪ Ssim(v, norte); Nv+n is the total number of verb–object pairs in the
cuerpo; F (vr, nt) is the frequency of vr and nt co-occurring as a verb–object pair; F (vr, ∗)
is the total frequency of the target (transitive) verb with any noun as its direct object;
yf (∗, nt) is the total frequency of the noun nt in the direct object position of any verb
in the corpus.

In his work, lin (1999) assumes that a target expression is non-compositional if and
only if its PMI value is signiﬁcantly different from that of all the variants. En cambio, nosotros
propose a novel technique that brings together the association strengths (PMI values)
of the target and the variant expressions into a single measure reﬂecting the degree of
lexical ﬁxedness for the target pair. We assume that the target pair is lexically ﬁxed to
the extent that its PMI deviates from the average PMI of its variants. By our measure, el
target pair is considered lexically ﬁxed (es decir., is given a high ﬁxedness score) only if the
difference between its PMI value and that of most of its variants—not necessarily all, como
in the method of Lin (1999)—is high.4 Our measure calculates this deviation, normalized
using the sample’s standard deviation:

Fixednesslex(v, norte)

.
=

PMI(v, norte) − PMI
s

(2)

3 In an early version of this work (Fazly and Stevenson 2006), only the noun constituent was varied

because we expected replacing the verb constituent with a related verb to be more likely to yield another
VNIC, as in keep/lose one’s cool, give/get the bird, crack/break the ice (Nünberg, Sag, and Wasow 1994; Grant
2005). Later experiments on the development data showed that variants generated by replacing both
constituents, one at a time, produce better results.

4 This way, even if an idiom has a few frequently used variants (p.ej., break the ice and crack the ice), it may

still be assigned a high ﬁxedness score if most other variants are uncommon. Note also that it is possible
that some variants of a given idiom are frequently used literal expressions (p.ej., make biscuit for take
biscuit). It is thus important to use a ﬂexible formulation that relies on the collective evidence (p.ej.,
average PMI) and hence is less sensitive to individual cases.

Ligüística computacional

Volumen 35, Número 1

where PMI is the mean and s the standard deviation of the following sample:

(cid:1)
PMI(vr, nt) | (cid:3)vr, nt(cid:4) ∈ {(cid:3)v, norte(cid:4)} ∪ Ssim(v,norte)

(cid:2)

PMI can be negative, cero, or positive; thus Fixednesslex(v, norte) ∈ [−∞, +∞], where high
positive values indicate higher degrees of lexical ﬁxedness.

3.2 Measuring Syntactic Fixedness

Compared to literal (non-idiomatic) verb+noun combinations, VNICs are expected to
appear in more restricted syntactic forms. To quantify the syntactic ﬁxedness of a target
verb–noun pair, we thus need to: (i) identify relevant syntactic patterns, a saber, those
that help distinguish VNICs from literal verb+noun combinations; y (II) translate the
frequency distribution of the target pair in the identiﬁed patterns into a measure of
syntactic ﬁxedness.

3.2.1 Identifying Relevant Patterns. Determining a unique set of syntactic patterns appro-
priate for the recognition of all idiomatic combinations is difﬁcult indeed: Exactly which
forms an idiomatic combination can occur in is not entirely predictable (Sag et al. 2002).
Sin embargo, there are hypotheses about the difference in behavior of VNICs and literal
verb+noun combinations with respect to particular syntactic variations (Nünberg, Sag,
and Wasow 1994). Linguists note that semantic analyzability of VNICs is related to
the referential status of the noun constituent (es decir., the process of idiomatization of a
verb+noun combination is believed to be accompanied by a change from concreteness
to abstractness for the noun). The referential status of the noun is in turn assumed to
be related to the participation of the combination in certain morpho-syntactic forms.
En lo que sigue, we describe three types of syntactic variation that are assumed to be
mostly tolerated by literal combinations, but less tolerated by many VNICs.

Passivization. There is much evidence in the linguistics literature that VNICs often do
not undergo passivization. Linguists mainly attribute this to the fact that in most cases,
only referential nouns appear as the surface subject of a passive construction (Gibbs
and Nayak 1989). Due to the non-referential status of the noun constituent in most
VNICs, we expect that they do not undergo passivization as often as literal verb+noun
combinations do. Another explanation for this assumption is that passives are mainly
used to put focus on the object of a clause or sentence. For most VNICs, no such
communicative purpose can be served by topicalizing the noun constituent through
passivization (Jackendoff 1997). The passive construction is thus considered as one of
the syntactic patterns relevant to measuring syntactic ﬂexibility.5

Determiner type. A strong correlation has been observed between the ﬂexibility of the
determiner preceding the noun in a verb+noun combination and the overall ﬂexibility
of the phrase (Fellbaum 1993; Kearns 2002; Desbiens and Simon 2003). It is however

5 Note that there are idioms that appear primarily in a passivized form, Por ejemplo, the die is cast ("el

decision is made and will not change”). Our measure can in principle recognize such idioms because we
do not require that an idiom appears mainly in active form; bastante, we include voice (passive or active) como
an important part of the syntactic pattern of an idiomatic combination.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

important to note that the nature of the determiner is also affected by other factors,
such as the semantic properties of the noun. Por esta razón, determiner ﬂexibility is
sometimes argued not to be a good predictor of the overall syntactic ﬂexibility of an ex-
presion. Sin embargo, many researchers consider it as an important part in the process
of idiomatization of a verb+noun combination (Akimoto 1999; Kyt ¨o 1999; Tanabe 1999).
We thus expect a VNIC to mainly appear with one type of determiner.

Pluralization. Although the verb constituent of a VNIC is morphologically ﬂexible, el
morphological ﬂexibility of the noun relates to its referential status (Grant 2005). De nuevo,
one should note that the use of a singular or plural noun in a VNIC may also be affected
by the semantic properties of the noun. Recall that during the idiomatization process,
the noun constituent may become more abstract in meaning. In this process, the noun
may lose some of its nominal features, including number (Akimoto 1999). The non-
referential noun constituent of a VNIC is thus expected to mainly appear in just one of
the singular or plural forms.

Merging the three types of variation results in a pattern set, PAG, de 11 distinct syntac-
tic patterns that are displayed in Table 1 along with examples for each pattern. Cuando
developing this set of patterns, we have taken into account the linguistic theories about
the syntactic constraints on idiomatic expressions; Por ejemplo, our choice of patterns
is consistent with the idiom typology developed by Nicolas (1995). Note that we merge
some of the individual patterns into one; Por ejemplo, we include only one passive
pattern independently of the choice of the determiner or the number of the noun. El
motivation here is to merge low frequency patterns (es decir., those that are expected to
be less common) in order to acquire more reliable evidence on the distribution of a
particular verb–noun pair over the resulting pattern set. En principio, sin embargo, the set
can be expanded to include more patterns; it can also be modiﬁed to contain different
patterns for different classes of idiomatic combinations.

3.2.2 Devising a Statistical Measure. The second step is to devise a statistical measure
that quantiﬁes the degree of syntactic ﬁxedness of a verb–noun pair, con respecto a

Mesa 1
Patterns used in the syntactic ﬁxedness measure, along with examples for each. A pattern
signature is composed of a verb v in active (vact) or passive (vpass) voz; a determiner (el) eso
can be NULL, indeﬁnite (a/an), deﬁnite (el), demonstrative (DEM), or possessive (POSS); y un
noun n that can be singular (nsg) or plural (npl).

Pattern No.

Pattern Signature

Ejemplo

1
2
3
4
5
6
7
8
9
10
11

vact
vact
vact
vact
vact
vact
vact
vact
vact
vact
vpass

el:NULL
el:a/an
el:el
el:DEM
el:POSS
el:NULL
el:el
el:DEM
el:POSS
el:OTHER
el:ANY

nsg
nsg
nsg
nsg
nsg
npl
npl
npl
npl
nsg,pl
nsg,pl

give money
give a book
give the book
give this book
give my book
give books
give the books
give those books
give my books
give many books
a/the/this/my book/books was/were given

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

the selected set of patterns, PAG. We propose a measure that compares the syntactic
behavior of the target pair with that of a “typical” verb–noun pair. Syntactic behav-
ior of a typical pair is deﬁned as the prior probability distribution over the patterns in
PAG. The maximum likelihood estimate for the prior probability of an individual pattern
pt ∈ P is calculated as

(cid:3)

F (VI, nj, pt)

PAG(pt) =

vi∈V

(cid:3)

nj∈N

(cid:3)

F (VI, nj, ptk)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

ptk∈P

vi∈V
nj∈N
F (∗, ∗, pt)
F (∗, ∗, ∗)

(3)

where V is the set of all instances of transitive verbs in the corpus, and N is the set of all
instances of nouns appearing as the direct object of some verb.

The syntactic behavior of the target verb–noun pair (cid:3)v, norte(cid:4) is deﬁned as the posterior
probability distribution over the patterns, given the particular pair. The maximum like-
lihood estimate for the posterior probability of an individual pattern pt is calculated as

PAG(pt | v, norte) =

F (v, norte, pt)

(cid:3)

F (v, norte, ptk)

ptk∈P

F (v, norte, pt)
F (v, norte, ∗)

(4)

The degree of syntactic ﬁxedness of the target verb–noun pair is estimated as
the divergence of its syntactic behavior (the posterior distribution over the patterns)
from the typical syntactic behavior (the prior distribution). The divergence of the two
probability distributions is calculated using a standard information-theoretic measure,
the Kullback Leibler (KL-) divergencia (Cover and Thomas 1991):

Fixednesssyn (v, norte)

.
= D(PAG(pt | v, norte) || PAG(pt))

(cid:3)

ptk∈P

PAG(ptk | v, norte) registro

PAG(ptk | v, norte)
PAG(ptk)

(5)

KL-divergence has proven useful in many NLP applications (Resnik 1999; Dagan,
Pereira, and Lee 1994). KL-divergence is always non-negative and is zero if and only
if the two distributions are exactly the same. De este modo, Fixednesssyn(v, norte) ∈ [0, +∞], dónde
large values indicate higher degrees of syntactic ﬁxedness.

3.3 A Uniﬁed Measure of Fixedness

VNICs are hypothesized to be, in most cases, both lexically and syntactically more ﬁxed
than literal verb+noun combinations (mira la sección 2). We thus propose a new measure

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

of idiomaticity to be a measure of the overall ﬁxedness of a given pair. We deﬁne
Fixednessoverall (v, norte) as a weighted combination of Fixednesslex and Fixednesssyn:

Fixednessoverall (v, norte)

.
= α Fixednesssyn (v, norte) + (1 − α) Fixednesslex (v, norte)

(6)

where α weights the relative contribution of the measures in predicting idiomaticity.

Recall that Fixednesslex(v, norte) ∈ [−∞, +∞], and Fixednesssyn(v, norte) ∈ [0, +∞]. A
combine them in the overall ﬁxedness measure, we rescale them, so that they fall in
the range [0, 1]. De este modo, Fixednessoverall(v, norte) ∈ [0, 1], where values closer to 1 indicate a
higher degree of overall ﬁxedness.

4. VNIC Type Recognition: Evaluación

To evaluate our proposed ﬁxedness measures, we analyze their appropriateness for
determining the degree of idiomaticity of a set of experimental expressions (in the form
of verb–noun pairs, extracted as described in Section 4.1). More speciﬁcally, we ﬁrst use
each measure to assign scores to the experimental pairs. We then use the scores assigned
by each measure to perform two different tasks, and assess the overall goodness of the
measure by looking at its performance in both.

Primero, we look into the classiﬁcation performance of each measure by using the
scores to separate idiomatic verb–noun pairs from literal ones in a mixed list. Esto es
done by setting a threshold, here the median score, where all pairs with scores higher
than the threshold are labeled as idiomatic and the rest as literal.6 For classiﬁcation, nosotros
report accuracy (Acc), as well as the relative error rate reduction (ERR) over a random
(chance) base, referred to as Rand. Segundo, we examine the retrieval performance
of our ﬁxedness measures by using the scores to rank verb–noun pairs according to
their degree of idiomaticity. For retrieval, we present the precision–recall curves, como
well as the interpolated three-point average precision or IAP—that is, el promedio de
the interpolated precisions at the recall levels of 20%, 50%, y 80%. The interpolated
average precision and precision–recall curves are commonly used for the evaluation of
information retrieval systems (Manning and Sch ¨utze 1999), and reﬂect the goodness of
a measure in placing the relevant items (aquí, idioms) before the irrelevant ones (aquí,
literals).

Idioms are often assumed to exhibit collocational behavior to some extent, eso es,
the components of an idiom are expected to appear together more often than expected
by chance. Por eso, some NLP systems have used collocational measures to identify them
(Smadja 1993; Evert and Krenn 2001). Sin embargo, as discussed in Section 2, idioms have
distinctive syntactic and semantic properties that separate them from simple colloca-
ciones. Por ejemplo, although collocations involve some degree of semantic idiosyncrasy
(strong tea vs. ?powerful tea), compared to idioms, they typically have a more transparent
significado, and their syntactic behavior is more similar to that of literal expressions. Nosotros
thus expect our ﬁxedness measures that draw on the distinctive linguistic properties
of idioms to be more appropriate than measures of collocation for the identiﬁcation of
idioms. To verify this hypothesis, in both the classiﬁcation and retrieval tasks, nosotros-
pare the performance of the ﬁxedness measures with that of two collocation extraction
measures: an informed baseline, PMI, and a position-based ﬁxedness measure proposed

6 We adopt the median for this particular (balanced) conjunto de datos, understanding that in practice a suitable

threshold would need to be determined, p.ej., based on development data.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

by Smadja (1993), which we refer to as Smadja. Próximo, we provide more details on PMI
and Smadja.

PMI is a widely used measure for extracting statistically signiﬁcant combinations
of words or collocations. It has also been used for the recognition of idioms (Evert and
Krenn 2001), warranting its use as an informed baseline here for comparison.7 As in
Ecuación (1), our calculation of PMI here restricts the counts of the verb–noun pair to
the direct object relation. Smadja (1993) proposes a collocation extraction method which
measures the ﬁxedness of a word sequence (p.ej., a verb–noun pair) by examining the
relative position of the component words across their occurrences together. We replicate
Smadja’s method, where we measure ﬁxedness of a target verb–noun pair as the spread
(variance) of the co-occurrence frequency of the verb and the noun over 10 relativo
positions within a ﬁve-word window.8

Recall from Section 3.1 that our Fixednesslex measure is intended as an improve-
ment over the non-compositionality measure of Lin (1999). For the sake of completeness,
we also compare the classiﬁcation performance of our Fixednesslex with that of Lin’s
(1999) measure, which we refer to as Lin.9

We ﬁrst elaborate on the methodological aspects of our experiments in Section 4.1,

and then present a discussion of the experimental results in Section 4.2.

4.1 Configuración experimental
4.1.1 Corpus and Data Extraction. We use the British National Corpus (BNC; Burnard
2000); to extract verb–noun pairs, along with information on the syntactic patterns they
appear in. We automatically parse the BNC using the Collins parser (collins 1999), y
augment it with information about verb and noun lemmas, automatically generated
using WordNet (Fellbaum 1998). We further process the corpus using TGrep2 (Rohde
2004) in order to extract syntactic dependencies. For each instance of a transitive verb,
we use heuristics to extract the noun phrase (notario público) in either the direct object position
(if the sentence is active), or the subject position (if the sentence is passive). Nosotros entonces
automatically ﬁnd the head noun of the extracted NP, its number (singular or plural),
and the determiner introducing it.

4.1.2 Experimental Expressions. We select our development and test expressions from
verb–noun pairs that involve a member of a predeﬁned list of transitive verbs, referred
to as basic verbs. Basic verbs, in their literal use, refer to states or acts that are central
to human experience. They are thus frequent, highly polysemous, and tend to combine
with other words to form idiomatic combinations (Cacciari 1993; Claridge 2000; Caballero
and France 2004). An initial list of such verbs was selected from several linguistic and
psycholinguistic studies on basic vocabulary (Ogden 1968; clark 1978; Nünberg, Sag,
and Wasow 1994; Goldberg 1995; Pauwels 2000; Claridge 2000; Newman and Rice 2004).
We further augmented this initial list with verbs that are semantically related to another

7 PMI has been shown to perform better than or comparable to many other association measures (Inkpen

2003; Mohammad and Hirst, submitted). En nuestros experimentos, we also found that PMI consistently
performs better than two other association measures, the Dice coefﬁcient and the log-likelihood measure.
Experiments by Krenn and Evert (2001) showed contradicting results for PMI; sin embargo, estos
experiments were performed on small-sized corpora, and on data which contained items with very low
frequency.

8 We implement the method as explained in Smadja (1993), taking into account the part-of-speech tags of

the target component words.

9 We implement the method as explained in Lin (1999), usando 95% conﬁdence intervals. We thus need to

ignore variants with frequency lower than 4 for which no conﬁdence interval can be formed.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

verb already in the list; Por ejemplo, lose is added in analogy with ﬁnd. Here is the ﬁnal
list of the 28 verbs in alphabetical order:

blow, bring, catch, cortar, ﬁnd, conseguir, give, tener, hear, golpear, hold, keep, kick, lay, lose, make, move,
lugar, pull, empujar, poner, ver, colocar, shoot, smell, llevar, throw, tocar

From the corpus, we extract all the verb–noun pairs (lemmas) that contain any
of these listed basic verbs, and that appear at least 10 times in the corpus in a direct
object relation (irrespective of any intervening determiners or adjectives). A partir de estos,
we select a subset that are idiomatic, and another subset that are literal, como sigue: A
verb–noun pair is considered idiomatic if it appears in an idiom listed in a credible
dictionary such as the Oxford Dictionary of Current Idiomatic English (ODCIE; Cowie,
Mackin, and McCaig 1983), or the Collins COBUILD Idioms Dictionary (CCID; Seaton
and Macaulay 2002).10 To decide whether a verb–noun pair has appeared in an idiom,
we look for all idioms containing the verb and the noun in a direct-object relation,
irrespective of any intervening determiners or adjectives, and/or any other arguments.
The pair is considered literal if it involves a physical act or state (es decir., the basic semantics
of the verb) and does not appear in any of the mentioned dictionaries as an idiom (o
part of an idiom). From the set of idiomatic pairs, we then randomly pull out 80 de-
velopment pairs and 100 test pairs, ensuring that we have items of both low and high
frequency. We then double the size of each data set (development and test) Al agregar
equal numbers of literal pairs, with similar frequency distributions. Some of the idioms
corresponding to the experimental idiomatic pairs are: kick the habit, move mountains, lose
rostro, and keep one’s word. Examples of literal pairs include: move carriage, lose ticket, y
keep ﬁsh.

Development expressions are used in devising the ﬁxedness measures, así como
in determining the values of their parameters as explained in the next subsection. Prueba
expressions are saved as unseen data for the ﬁnal evaluation.

4.1.3 Parameter Settings. Our lexical ﬁxedness measure in Equation (2) involves two
parámetros, Kv and Kn, which determine the number of lexical variants considered in
measuring the lexical ﬁxedness of a given verb–noun pair. We make the least-biased
assumption on the proportion of variants generated by replacing the verb (Kv) y
those generated by replacing the noun (kn)-eso es, we assume Kv = Kn.11 We perform
experiments on the development data, where we set the total number of variants (es decir.,
Kv + kn) de 10 a 100 by steps of 10. (Por simplicidad, we refer to the total number
of variants as K). Cifra 1(a) shows the change in performance of Fixednesslex as a
function of K. Recall that Acc is the classiﬁcation accuracy, and IAP reﬂects the average
precision of a measure in ranking idiomatic pairs before non-idiomatic ones. De acuerdo a
to these results, there is not much variation in the performance of the measure for

10 Our development data also contains items from several other dictionaries, such as Chambers Idioms
(Kirkpatrick and Schwarz 1982). Sin embargo, our test data, which is also used in the token-based
experimentos, sin embargo, only contains idioms from the two dictionaries ODCIE and CCID. Resultados
reported in this article are all on test pairs; development pairs are mainly used for the development of the
métodos.

11 We also performed experiments on the development data in which we did not restrict the number of

variants, and hence did not enforce the condition Kv = Kn. En cambio, we tried using a variety of thresholds
on the similarity scores (from the thesaurus) in order to ﬁnd the set of most similar words to a given verb
or noun. We found that ﬁxing the number of most similar words is more effective than using a similarity
límite, perhaps because the actual scores can be very different for different words.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

Cifra 1
%IAP and %Acc of Fixednesslex and Fixednessoverall over development data.

K ≥ 20. We thus choose an intermediate value for K that yields the highest accuracy
and a reasonably high precision; speciﬁcally, we set K to 50.

The overall ﬁxedness measure deﬁned in Equation (6) also uses a parameter, a,
which determines the relative weights given to the individual ﬁxedness measures in
the linear combination. We experiment on the development data with different values
of α ranging from 0 a 1 by steps of .02; results are shown in Figure 1(b). As can be seen
in the ﬁgure, the accuracy of Fixednessoverall is not affected much by the change in the
value of α. The average precision (IAP), sin embargo, shows that the combined measure
performs best when somewhat equal weights are given to the two individual measures,
and performs worst when the lexical ﬁxedness component is completely ignored (es decir.,
α is close to 1). These results also reinforce that a complete evaluation of our ﬁxedness
measures should include both metrics, exactitud, and average precision, as they reveal
different aspects of performance. Aquí, Por ejemplo, Fixednesssyn (un = 1) has compa-
rable accuracy to Fixednesslex (un = 0), reﬂecting that the two measures generally give
higher scores to idioms. Sin embargo, the ranking precision of the latter is much higher
than that of the former, showing that Fixednesslex ranks many of the idioms at the very
top of the list. In all our experiments reported here, we set α to .6, a value for which
Fixednessoverall shows reasonably good performance according to both Acc and IAP.

4.2 Experimental Results and Analysis

En esta sección, we report the results of evaluating our measures on unseen test expres-
siones, with parameters set to the values determined in Section 4.1.3. (Results on devel-
opment data have similar trends to those on test data.) We analyze the classiﬁcation
performance of the individual lexical and syntactic ﬁxedness measures in Section 4.2.1,
and discuss their effectiveness for retrieval in Section 4.2.2. Sección 4.2.3 then looks into
the performance of the overall ﬁxedness measure, y Sección 4.2.4 presents a summary
and discussion of the results.

4.2.1 Classiﬁcation Performance. Aquí, we look into the performance of the individual
ﬁxedness measures, Fixednesslex and Fixednesssyn, in classifying a mixed set of verb–
noun pairs into idiomatic and literal classes. We compare their performance against the

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

Mesa 2
Accuracy and relative error reduction for the two ﬁxedness measures, the two baseline
measures, and Smadja, over all test pairs (TESTall), and test pairs divided by frequency
(TESTflow

and TESTfhigh

TESTall

TESTflow

TESTfhigh

Measure

%Acc

(%ERR)

%Acc

(%ERR)

%Acc

(%ERR)

Rand
PMI
Smadja
Fixednesslex
Fixednesssyn

50
63
54
68
71

(26)
(8)
(36)
(42)

50
56
64
70
72

(12)
(28)
(40)
(44)

50
70
62
70
82

(40)
(24)
(40)
(64)

two baselines, Rand and PMI, as well as the two state-of-the-art methods, Smadja and
lin. For analytical purposes, we further divide the set of all test expressions, TESTall,
into two sets corresponding to two frequency bands: TESTflow contains 50 idiomatic
y 50 literal pairs, each with total frequency (across all syntactic patterns under
consideración) entre 10 y 40; TESTfhigh consists of 50 idiomatic and 50 literal pairs,
each with total frequency of 40 or greater. Classiﬁcation performances of all measures
except Lin are given in Table 2. Lin does not assign scores to the test verb–noun pairs,
hence we cannot calculate its classiﬁcation accuracy the same way we do for the other
métodos (es decir., using median as the threshold). A separate comparison between Lin and
Fixednesslex is provided at the end of this section.

As can be seen in the ﬁrst two columns of Table 2, the informed baseline, PMI, muestra
a large improvement over the random baseline (26% error reduction) on TESTall. Este
shows that many VNICs have turned into institutionalized (es decir., statistically signiﬁcant)
co-occurrences. Por eso, one can get relatively good performance by treating verb+noun
idiomatic combinations as collocations. Fixednesslex performs considerably better than
the informed baseline (36% vs. 26% error reduction on TESTall). Fixednesssyn has the best
actuación (shown in boldface), con 42% error reduction over the random baseline,
y 21.6% error reduction over PMI. These results demonstrate that lexical and syntactic
ﬁxedness are good indicators of idiomaticity, better than a simple measure of colloca-
tion such as PMI. On TESTall, Smadja performs only slightly better than the random
base (8% error reduction), reﬂecting that a position-based ﬁxedness measure is not
sufﬁcient for identifying idiomatic combinations. These results suggest that looking into
deep linguistic properties of VNICs is necessary for the appropriate treatment of these
expressions.12

PMI is known to perform poorly on low frequency items. To examine the effect of
frequency on the measures, we analyze their performance on the two divisions of the

12 Performing the χ2 test of statistical signiﬁcance, we ﬁnd that the differences between Smadja and our

lexical and syntactic ﬁxedness measures are statistically signiﬁcant at p < 0.05. However, the differences in performance between ﬁxedness measures and PMI are not statistically signiﬁcant. Note that this does not imply that the differences are not substantial, rather that there is not enough evidence in the observed data to reject the null hypothesis (that two methods perform the same in general) with high conﬁdence. Moreover, χ2 is a non-parametric (distribution free) test and hence it has less power to reject a null hypothesis. Later, when we take into account the actual scores assigned by the measures, we ﬁnd that all differences are statistically signiﬁcant (see Sections 4.2.2–4.2.3 for more details). All signiﬁcance tests are performed using the R (2004) package. 75 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 test data, corresponding to the two frequency bands, TESTflow and TESTfhigh. Results are given in the four rightmost columns of Table 2, with the best performance shown in boldface. As expected, the performance of PMI drops substantially for low frequency items. Interestingly, although it is a PMI-based measure, Fixednesslex has comparable performance on all data sets. The performance of Fixednesssyn improves quite a bit when it is applied to high frequency items, while maintaining similar performance on the low frequency items. These results show that the lexical and syntactic ﬁxedness measures perform reasonably well on both low and high frequency items.13 Hence they can be used with a higher degree of conﬁdence, especially when applied to data that is heterogeneous with regard to frequency. This is important because, while some VNICs are very common, others have very low frequency, as noted by Grant (2005). Smadja shows a notable improvement in performance when data is divided by frequency. This effect is likely due to the fact that ﬁxedness is measured as the spread of the position- based (raw) co-occurrence frequencies. Nonetheless, on both data sets the performance of Smadja remains substantially worse than that of our two ﬁxedness measures (the differences are statistically signiﬁcant in three out of the four comparisons at p < .05). Collectively, these results show that our linguistically motivated ﬁxedness measures are particularly suited for identifying idiomatic combinations, especially in comparison with more general collocation extraction techniques, such as PMI or the position-based ﬁxedness measure of Smadja (1993). Especially, our measures tend to perform well on low frequency items, perhaps due to their reliance on distinctive linguistic properties of idioms. We now compare the classiﬁcation performance of Fixednesslex to that of Lin. Unlike Fixednesslex, Lin does not assign continuous scores to the verb–noun pairs, but rather classiﬁes them as idiomatic or non-idiomatic. Thus, we cannot use the same threshold (e.g., median) for the two methods to calculate their classiﬁcation accuracies in a com- parable way. Recall also from Section 3.1 that the performance of both these methods depends on the value of K (the number of variants). We thus measure the classiﬁcation precision of the methods at equivalent levels of recall, using the same number of variants K at each recall level for the two measures. Varying K from 2 to 100 by steps of 4, Lin and Fixednesslex achieve an average classiﬁcation precision of 81.5% and 85.8%, respectively. Performing a t-test on the precisions of the two methods conﬁrms that the difference between the two is statistically signiﬁcant at p < .001. In addition, our method has the advantage of assigning a score to a target verb–noun reﬂecting its degree of lexical ﬁxedness. Such information can help a lexicographer decide whether a given verb–noun should be placed in a lexicon. 4.2.2 Retrieval Performance. The classiﬁcation results suggest that the individual ﬁxed- ness measures are overall better than a simple measure of collocation at separating idiomatic pairs from literal ones. Here, we have a closer look at their performance by examining their goodness in ranking verb–noun pairs according to their degree of idiomaticity. Recall that the ﬁxedness measures are devised to reﬂect the degree of ﬁxedness and hence the degree of idiomaticity of a target verb–noun pair. Thus, the result of applying each measure to a list of mixed pairs is a list that is ranked in the order 13 In fact, the results show that the performance of both ﬁxedness measures is better when data is divided by frequency. Although we expect better performance over high frequency items, more investigation is needed to verify whether the improvement in performance over low frequency items is a meaningful effect or merely an accident of the data at hand. 76 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation of idiomaticity. For a measure to be considered good at retrieval, we expect idiomatic pairs to be very frequent near the top of the ranked list, and to become less frequent towards the bottom. Precision–recall curves are very indicative of this trend: The ideal measure will have a precision of 100% for all values of recall, namely, the measure places all idiomatic pairs at the very top of the ranked list. In reality, although the precision drops as recall increases, we expect a good measure to keep high precision at most levels of recall. Figure 2 depicts the interpolated precision–recall curves for PMI and Smadja, and for the lexical, syntactic, and overall ﬁxedness measures, over TESTall. Note that the minimum interpolated precision is 50% due to the equal number of idiomatic and literal pairs in the test data. In this section, we discuss the retrieval performance of the two individual ﬁxedness measures; the next section analyzes the performance of the overall ﬁxedness measure. The precision–recall curves of Smadja and PMI are nearly ﬂat (with PMI consis- tently higher than Smadja), showing that the distribution of idiomatic pairs in the lists ranked by these two measures is only slightly better than random. A close look at the precision–recall curve of Fixednesslex reveals that, up to the recall level of 50%, the precision of this measure is substantially higher than that of PMI. This means that, compared to PMI, Fixednesslex places more idiomatic pairs at the very top of the list. At higher recall levels (50% and higher), Fixednesslex still consistently outperforms PMI. Nonetheless, at these recall values, the two measures have relatively low precision (com- pared to the other measures), suggesting that both measures also put many idiomatic pairs near the bottom of the list. In contrast, the precision–recall curve of Fixednesssyn shows that its performance is consistently much better than that of PMI: Even at the recall level of 90%, its precision is close to 70% (cf. 55% precision of PMI). A comparison of the precision–recall curves of the two individual ﬁxedness mea- sures reveals their complementary nature. Compared to Fixednesslex, Fixednesssyn maintains higher precision at very high levels of recall, suggesting that the syntactic ﬁxedness measure places fewer idiomatic pairs at the bottom of the ranked list. In con- trast, Fixednesslex has notably higher precision than Fixednesssyn at recall levels of up to 40%, suggesting that the former puts more idiomatic pairs at the top of the ranked list. Statistical signiﬁcance tests conﬁrm these observations: Using the Wilcoxon Signed Rank test (1945), we ﬁnd that both Fixednesslex and Fixednesssyn produce signiﬁcantly different rankings from PMI and Smadja (p (cid:18) .001). Also, the rankings of the items Figure 2 Precision–recall curves for PMI, Smadja, and for the ﬁxedness measures, over TESTall. 77 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 Table 3 Classiﬁcation and retrieval performance of the overall ﬁxedness measure over TESTall. Measure %Acc (%ERR) %IAP PMI Smadja Fixednesslex Fixednesssyn Fixednessoverall 63 54 68 71 74 (26) (8) (36) (42) (48) 63.5 57.2 75.3 75.9 84.7 produced by the two individual ﬁxedness measures are found to be signiﬁcantly differ- ent at p < .01. 4.2.3 Performance of the Overall Fixedness Measure. We now look at the classiﬁcation and retrieval performance of the overall ﬁxedness measure. Table 3 presents %Acc, %ERR, and %IAP of Fixednessoverall, repeating that of PMI, Smadja, Fixednesslex, and Fixednesssyn, for comparison. Here again the error reductions are relative to the random baseline of 50%. Looking at classiﬁcation performance (expressed in terms of %Acc and %ERR), we can see that Fixednessoverall notably outperforms all other measures, including lexical and syntactic ﬁxedness (18.8% error reduction relative to Fixednesslex, and 10% error reduction relative to Fixednesssyn). According to the classiﬁcation results, each of the lexical and syntactic ﬁxedness measures are good at separating idiomatic from literal combinations, with syntactic ﬁxedness performing better. Here we demonstrate that combining them into a single measure of ﬁxedness, while giving more weight to the better measure, results in a more effective classiﬁer.14 The overall behavior of this measure as a function of α is displayed in Figure 3. As can be seen in Table 3, Fixednesslex and Fixednesssyn have comparable IAP: 75.3% and 75.9%, respectively. In comparison, Fixednessoverall has a much higher IAP of 84.7%, reinforcing the claim that combining evidence from both lexical and syntac- tic ﬁxedness is beneﬁcial. Recall from Section 4.2.2 that the two individual ﬁxedness measures exhibit complementary behavior, as observed in their precision–recall curves shown in Figure 2. The precision–recall curve of the overall ﬁxedness measure shows that this measure in fact combines advantages of the two individual measures: At most recall levels, Fixednessoverall has a higher precision than both individual measures. Sta- tistical signiﬁcance tests that look at the actual scores assigned by the measures conﬁrm that the observed differences in performance are signiﬁcant. The Wilcoxon Signed Rank test shows that the Fixednessoverall measure produces a ranking that is signiﬁcantly different from those of the individual ﬁxedness measures, the baseline PMI, and Smadja (at p (cid:18) .001). 4.2.4 Summary and Discussion. Overall, the worst performance belongs to the two collo- cation extraction methods, PMI and Smadja, both in classifying test pairs as idiomatic or 14 Using a χ2 test, we ﬁnd a statistically signiﬁcant difference between the classiﬁcation performance of Fixednessoverall and that of Smadja (p < 0.01), and also a marginally signiﬁcant difference between the performance of Fixednessoverall and that of PMI (p < .1). Recall from footnote 12 (page 15) that none of the individual measures’ performances signiﬁcantly differed from that of PMI. Nonetheless, no signiﬁcant differences are found between the classiﬁcation performance of Fixednessoverall and that of the individual ﬁxedness measures. 78 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation Figure 3 Classiﬁcation performance of Fixednessoverall on test data as a function of α. literal, and in ranking the pairs according to their degree of idiomaticity. This suggests that although some VNICs are institutionalized, many do not appear with markedly high frequency, and hence only looking at their frequency is not sufﬁcient for their recognition. Moreover, a position-based ﬁxedness measure does not seem to sufﬁciently capture the syntactic ﬁxedness of VNICs in contrast to the ﬂexibility of literal phrases. Fixednessoverall is the best performer of all, supporting the hypothesis that many VNICs are both lexically and syntactically ﬁxed, more so than literal verb+noun combinations. In addition, these results demonstrate that incorporating such linguistic properties into statistical measures is beneﬁcial for the recognition of VNICs. Although we focus on experimental expressions with frequency higher than 10, PMI still shows great sensitivity to frequency differences, performing especially poorly on items with frequency between 10 and 40. In contrast, none of the ﬁxedness measures are as sensitive to such frequency differences. Especially interesting is the consistent performance of Fixednesslex, which is a PMI-based measure, on low and high frequency items. These observations put further emphasis on the importance of devising new methods for extracting multiword expressions with particular syntactic and semantic properties, such as VNICs. To further analyze the performance of the ﬁxedness measures, we look at the top and bottom 20 pairs (10%) in the lists ranked by each ﬁxedness measure. Interestingly, the list ranked by Fixednessoverall contains no false positives ( fp) in the top 20 items, and no false negatives ( fn) in the bottom 20 items, once again reinforcing the usefulness of combining evidence from the individual lexical and syntactic ﬁxedness measures. False positive and false negative errors found in the top and bottom 20 ranked pairs, respectively, for the syntactic and lexical ﬁxedness measures are given in Table 4. (Note that fp errors are the non-idiomatic pairs ranked at the top, whereas fn errors are the idiomatic pairs ranked at the bottom.) We ﬁrst look at the errors made by Fixednesssyn. The ﬁrst fp error, throw hat, is an interesting one: even though the pair is not an idiomatic expression on its own, it is part of the larger idiomatic phrase throw one’s hat in the ring, and hence exhibits syntactic ﬁxedness. This shows that our methods can be easily extended to identify other types of verb phrase idiomatic combinations which exhibit syntactic behavior similar to VNICs. Looking at the frequency distribution of the occurrence of the other two fp errors, touch ﬁnger and lose home, in the 11 patterns from Table 1, we observe that both pairs tend to appear mainly in the patterns “vact det:POSS nsg” (touch one’s ﬁnger, lose one’s home) and/or “vact det:POSS npl” (touch one’s ﬁngers). These examples show 79 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 Table 4 Errors found in the top and bottom 20 pairs in the lists ranked by the two individual ﬁxedness measures; fp stands for false positive, fn stands for false negative. Measure: Fixednesssyn Fixednesslex Error Type: fp fn fp fn throw hat touch ﬁnger lose home make pile keep secret push barrow blow bridge have moment give way keep hand that syntactic ﬁxedness is not a sufﬁcient condition for idiomaticity. In other words, it is possible for non-idiomatic expressions to be syntactically ﬁxed for reasons other than semantic idiosyncrasy. In these examples, the nouns ﬁnger and home tend to be introduced by a possessive determiner, because they often belong to someone. It is also important to note that these two patterns have a low prior (i.e., verb–noun pairs do not typically appear in these patterns). Hence, an expression with a strong tendency to appear in such patterns will be given a high syntactic ﬁxedness score. The frequency distribution of the two fn errors for Fixednesssyn reveals that they are given low scores mainly because their distributions are similar to the prior. Even though make pile preferably appears in the two patterns “vact det:a/an nsg” and “vact det:NULL npl,” both patterns have reasonably high prior probabilities. Moreover, because of the low frequency of make pile (< 40), the evidence is not sufﬁcient to distinguish it from a typical verb–noun pair. The pair keep secret has a high frequency, but its occurrences are scattered across all 11 patterns, closely matching the prior distribution. The latter exam- ple shows that syntactic ﬁxedness is not a necessary condition for idiomaticity either.15 Analyzing the errors made by Fixednesslex is more difﬁcult as many factors may affect scores given by this measure. Most important is the quality of the automatically generated variants. We ﬁnd that in one case, push barrow, the ﬁrst 25 distributionally similar nouns (taken from the automatically built thesaurus) are proper nouns, perhaps because Barrow is a common last name. In general, it seems that the similar verbs and nouns for a target verb–noun pair are not necessarily related to the same sense of the target word. Another possible source of error is that in this measure we use PMI as an indirect clue to idiomaticity. In the case of give way and keep hand, many of the variants are plausible combinations with very high frequency of occurrence, for example, give opportunity, give order, ﬁnd way for the former, and hold hand, put hand, keep eye for the latter. Whereas some of these high-frequency variants are literal (e.g., hold hand) or idiomatic (e.g., keep eye), many have metaphorical interpretations (e.g., give opportunity, ﬁnd way). In our ongoing work, we use lexical and syntactic ﬁxedness measures, in com- bination with other linguistically motivated features, to distinguish such metaphori- cal combinations from both literal and idiomatic expressions (Fazly and Stevenson, to appear). One way to decrease the likelihood of making any of these errors is to combine evidence from the lexical and syntactic ﬁxedness of idioms. As can be seen in Table 4, the two ﬁxedness measures make different errors, and combining them results in a measure 15 One might argue that keep secret is more semantically analyzable and hence less idiomatic than an expression such as shoot the breeze. Nonetheless, it is still semantically more idiosyncratic than a fully literal combination such as keep a pen, and hence should not be ranked at the very bottom of the list. 80 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation (the overall ﬁxedness) that makes fewer errors. In the future, we intend to also look into other properties of idioms, such as their semantic non-compositionality, as extra sources of information. 5. Determining the Canonical Forms of VNICs Our evaluation of the ﬁxedness measures demonstrates their usefulness for the au- tomatic recognition of VNICs. Recall from Section 2 that idioms appear in restricted syntactic forms, often referred to as their canonical forms (Glucksberg 1993; Riehemann 2001; Grant 2005). For example, the idiom pull one’s weight mainly appears in this form (when used idiomatically). The lexical representation of an idiomatic combination thus must contain information about its canonical forms. Such information is necessary both for automatically generating appropriate forms (e.g., in a natural language generation system or a machine translation system), and for inclusion in dictionaries for learners (e.g., in the context of computational lexicography). Because VNICs are syntactically ﬁxed, they are mostly expected to have a small number of canonical forms. For example, shoot the breeze is listed in many idiom dictio- naries as the canonical form for (cid:3)shoot, breeze(cid:4). Also, hold ﬁre and hold one’s ﬁre are listed in CCID as canonical forms for (cid:3)hold, ﬁre(cid:4). We expect a VNIC to occur in its canonical form(s) with substantially higher frequency than in any other syntactic patterns. We thus devise an unsupervised method that discovers the canonical form(s) of a given idiomatic verb–noun pair by examining its frequency of occurrence in each syntactic pattern under consideration. Speciﬁcally, the set of the canonical form(s) of the target pair (cid:3)v, n(cid:4) is deﬁned as C(v, n) = {ptk ∈ P | z(v, n, ptk) > Tz}

(7)

Aquí, P is the set of patterns (ver tabla 1), and the condition z(v, norte, ptk) > Tz determines
whether the frequency of the target pair (cid:3)v, norte(cid:4) in ptk is substantially higher than its
frequency in other patterns; z(v, norte, ptk) is calculated using the statistic z-score as in
Ecuación (8), and Tz is a predeﬁned threshold.

z(v, norte, ptk) =

F (v, norte, ptk) − f
s

(8)

where f is the sample mean and s the sample standard deviation.

The statistic z(v, norte, ptk) indicates how far and in which direction the frequency of
occurrence of the target pair (cid:3)v, norte(cid:4) in a particular pattern ptk deviates from the sample
significar, expressed in units of the sample standard deviation. To decide whether ptk is a
canonical pattern for the target pair, we check whether its z-score, z(v, norte, ptk), is greater
than a threshold Tz. Aquí, we set Tz to 1, based on the distribution of z and through
examining the development data.

We evaluate our unsupervised canonical form identiﬁcation method by verifying
its predicted forms against ODCIE and CCID. Específicamente, for each of the 100 idiomatic
pairs in TESTall, we calculate the precision and recall of its predicted canonical forms
(those whose z-scores are above Tz), compared to the canonical forms listed in the two
dictionaries. The average precision across the 100 test pairs is 81.2%, and the average
recall is 88% (con 68 of the pairs having 100% precision and 100% recordar). Además, nosotros

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

ﬁnd that for the overwhelming majority of the pairs, 86%, the predicted canonical form
with the highest z-score appears in the dictionary entry of the pair.

According to the entries in ODCIE and CCID, 93 fuera de 100 idiomatic pairs in
TESTall have one canonical form. Our canonical form extraction method on average ﬁnds
1.2 canonical forms for these 100 pares (one canonical form for 79 de ellos, two for 18,
and three for 3 of these). Generalmente, our method tends to extract more canonical forms
than listed in the dictionaries. This is a desired property, because idiom dictionaries
often do not exhaustively list all canonical forms, but the most dominant ones. Examples
of such cases include: see the sights for which our method also ﬁnds see sights as a canon-
ical form, and catch one’s attention for which our method also ﬁnds catch the attention.
There are also cases where our method ﬁnds canonical forms for a given expression due
to noise resulting from the use of the expression in a non-idiomatic sense. Por ejemplo,
for hold one’s horses, our method also ﬁnds hold the horse and hold the horses as canonical
formas. Similarmente, for get the bird, our method also ﬁnds get a bird.

In a few cases (4 fuera de 100), our method ﬁnds fewer canonical forms than listed
in the dictionaries. These are catch the/one’s imagination, have a/one’s ﬂing, make a/one’s
mark, and have a/the nerve. For the ﬁrst two of these, the z-score of the missed pattern
is only slightly lower than our predeﬁned threshold. En otros casos (8 fuera de 100), ninguno
of the canonical forms extracted by our method match those in a dictionary. Some of
these expressions also have a non-idiomatic sense which might be more dominant than
the idiomatic usage. Por ejemplo, for give the push and give the ﬂick, our method ﬁnds
give a push and give a ﬂick, respectivamente, perhaps due to the common use of the latter
forms as light verb constructions. For make one’s peace, our method ﬁnds a different form,
make peace, which seems a plausible canonical form; and moreover, the canonical form
listed in the dictionaries (make one’s peace) has a z-score which is only slightly lower
than our threshold. There is also one case where our method ﬁnds a canonical form
that corresponds to a different idiom using the same verb+noun: we ﬁnd lose touch as
a canonical form, whereas the dictionaries list an idiom with a different canonical form
(lose one’s touch) as the idiom with lose and touch.

En general, canonical forms extracted by our method are reasonably accurate, pero
may need to be further analyzed by a lexicographer to ﬁlter out incorrectly found
patrones. Además, our method extracts new canonical forms for some expressions,
which could be used to augment dictionaries.

6. Automatic Identiﬁcation of VNIC Tokens

In previous sections, we have provided an analysis of the lexical and syntactic behavior
of idiomatic expressions. We have shown that our proposed techniques that draw on
such properties can successfully distinguish an idiomatic verb+noun combination (a
VNIC type) such as get the sack from a non-idiomatic (literal) one such as get the bag. Es
importante, sin embargo, to note that a potentially idiomatic expression such as get the sack
can also have a literal interpretation in a given context, as in Joe got the sack from the top
shelf . This is true of many potential idioms, although the relative proportion of literal
usages may differ from one expression to another. Por ejemplo, an expression such as
see stars is much more likely to have a literal interpretation than get the sack (according to
our ﬁndings in the BNC). Identiﬁcation of idiomatic tokens in context is thus necessary
for a full understanding of text, and this will be the focus of Sections 6 y 7.

Recent studies addressing token identiﬁcation for idiomatic expressions mainly
perform the task as one of word sense disambiguation, and draw on the local context of

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

a token to disambiguate it. Such techniques either do not use any information regarding
the linguistic properties of idioms (Birke and Sarkar 2006), or mainly focus on the
property of non-compositionality (Katz and Giesbrecht 2006). Studies that do make
use of deep linguistic information often handcode the knowledge into the systems
(Uchiyama, Baldón, and Ishizaki 2005; Hashimoto, Sato, and Utsuro 2006). Our goal is
to develop techniques that draw on the speciﬁc linguistic properties of idioms for their
identiﬁcation, without the need for handcoded knowledge or manually labelled train-
ing data. Such unsupervised techniques can also help provide automatically labelled
(noisy) training data to bootstrap (semi-)supervised methods.

En secciones 3 y 4, we showed that the lexical and syntactic ﬁxedness of idioms
is especially relevant to their type-based recognition. We expect such properties to also
be relevant for their token identiﬁcation. Además, we have shown that it is possible to
learn about the ﬁxedness of idioms in an unsupervised manner. Aquí, we propose unsu-
pervised techniques that draw on the syntactic ﬁxedness of idioms to classify individual
tokens of a potentially idiomatic phrase as literal or idiomatic. We also put forward a
classiﬁcation technique that combines such information (in the form of noisy training
datos) with evidence from the local context of usages of an expression. En la sección 6.1,
we elaborate on the underlying assumptions of our token identiﬁcation techniques.
Sección 6.2 then describes our proposed methods that draw on these assumptions to
perform the task.

6.1 Underlying Assumptions

Although there may be ﬁne-grained differences in meaning across the idiomatic us-
ages of an expression, as well as across its literal usages, we assume that the idiomatic
and literal usages correspond to two coarse-grained senses of the expression. Lo haremos
refer then to each of the literal and idiomatic designations as a (coarse-grained) significar-
ing of the expression, while acknowledging that each may have multiple ﬁne-grained
sentido.

Recall from Section 2 that idioms tend to be somewhat ﬁxed with respect to the
syntactic conﬁgurations in which they occur. Por ejemplo, pull one’s weight tends to
mainly appear in this form when used idiomatically. Other forms of the expression,
such as pull the weights, typically are only used with a literal meaning. En otras palabras,
an idiom tends to have one (or a small number of) canonical form(s), which are its most
preferred syntactic patterns.16 Here we assume that, in most cases, idiomatic usages of
an expression tend to occur in its canonical form(s). We also assume that, en contraste,
the literal usages of an expression are less syntactically restricted, and are expressed
in a greater variety of patterns. Because of their relative unrestrictedness, literal usages
may occur in a canonical form for that expression, but usages in a canonical form are
more likely to be idiomatic. Usages in alternative syntactic patterns for the expression,
which we refer to as the non-canonical forms of the expression, are more likely to be
literal.

Drawing on these assumptions, we develop unsupervised methods that deter-
mío, for each verb+noun token in context, whether it has an idiomatic or a literal

16 Como se señaló anteriormente, 93 fuera de 100 idiomatic pairs in TESTall have one canonical form, according to the
entries in ODCIE and CCID. También, our canonical form extraction method on average ﬁnds 1.2 canonical
forms for the 100 test idioms.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

interpretación. Claramente, the success of our methods depends on the extent to which these
assumptions hold (we will return to these assumptions in Section 7.2.3).

6.2 Proposed Methods

This section elaborates on our proposed methods for identifying the idiomatic and
literal usages of a verb+noun combination: the CFORM method that uses knowledge
of canonical forms only, and the CONTEXT method that also incorporates distributional
evidence about the local context of a token. Both methods draw on our assumptions
described herein, that usages in the canonical form(s) for a potential idiom are more
likely to be idiomatic, and those in other forms are more likely to be literal. Porque
our methods need information about canonical forms of an expression, we use the
unsupervised method described in Section 5 to ﬁnd these automatically. En el siguiente
discussion, we describe each method in more detail.

CFORM. This method classiﬁes an instance (simbólico) of an expression as idiomatic if it
occurs in one of the automatically determined canonical form(s) for that expression
(p.ej., pull one’s weight), and as literal otherwise (p.ej., pull a weight, pull the weights). El
underlying assumption of this method is that information about the canonical form(s) de
an idiom type can provide a reasonably accurate classiﬁcation of its individual instances
as literal or idiomatic.

CONTEXT. Recall our assumption that the idiomatic and literal usages of an idiom corre-
spond to two coarse-grained meanings of the expression. It is natural to further assume
that the literal and idiomatic usages have more in common semantically within each
group than between the two groups. Adopting a distributional approach to meaning—
where the meaning of an expression is approximated by the words with which it co-
ocurre (Firth 1957)—we would expect the literal and idiomatic usages of an expression
to typically occur with different sets of words.

En efecto, in a supervised setting, Katz and Giesbrecht (2006) show that the local
context of an idiom usage is useful in identifying its sense. Inspired by this work, nosotros
propose an unsupervised method that incorporates distributional information about the
local context of the usages of an idiom, in addition to the (syntactic) knowledge about
its canonical forms, in order to determine if its token usages are literal or idiomatic.
To achieve this, the method compares the context surrounding a test instance of an
expression to “gold-standard” contexts for the idiomatic and literal usages of the expres-
sión, which are taken from noisy training data automatically labelled using canonical
forms.17

For each test instance of an expression, the CONTEXT method thus compares its
co-occurring words to two sets of gold-standard co-occurring words: one typical of
idiomatic usages and one typical of literal usages of the expression (we will shortly
explain precisely how we ﬁnd these). If the test token is determined to be (en AVAR-
edad) more similar to the idiomatic usages, then it is labelled as idiomatic. Otro-
inteligente, it is labelled as literal. To measure similarity between two sets of words, Usamos

17 The two CONTEXT methods in our earlier work (Cocinar, Fazly, and Stevenson 2007) were biased because
they used information about the canonical form of a test token (in addition to context information).
We found that when the bias was removed, the similarity measure used in those techniques was not
as effective, and hence we have developed a different method here.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

a standard distributional similarity measure, Jaccard, deﬁned subsequently.18 In the
following equation A and B represent the two sets of words to be compared:

Jaccard(A, B) = A ∩ B
A ∪ B

(9)

Now we explain how the CONTEXT method ﬁnds typically co-occurring words for each
of the idiomatic and literal meanings of an expression. Note that unlike in a supervised
configuración, here we do not assume access to manually annotated training data. We thus use
knowledge of automatically acquired canonical forms to ﬁnd these.

The CONTEXT method labels usages of an expression in a leave-one-out strategy,
where each test token is labelled by using the other tokens as noisy training (oro-
estándar) datos. Específicamente, to provide gold-standard data for each instance of an
expresión, we ﬁrst divide the other instances (of the same expression) into likely-
idiomatic and likely-literal groups, where the former group contains usages in canonical
forma(s) and the latter contains usages in non-canonical form(s). We then pick represen-
tative usages from each group by selecting the K instances that are most similar to the
instance being labelled (the test token) according to the Jaccard similarity score.

Recall that we assume canonical form(s) are predictive of the idiomatic usages and
non-canonical form(s) are indicative of the literal usages of an expression. We thus
expect the co-occurrence sets of the selected canonical and non-canonical instances to
reﬂect the idiomatic and literal meanings of the expression, respectivamente. We take the
average similarity of the test token to the K nearest canonical instances (likely idiomatic)
and the K nearest non-canonical instances (likely literal), and label the test token accord-
ingly.19 In the event that there are less than K canonical or non-canonical form usages
of an expression, we take the average similarity over however many instances there are
of this form. If we have no instances of one of these forms, we classify each token as
idiomatic, the label we expect to be more frequent.

7. VNIC Token Identiﬁcation: Evaluación

To evaluate the performance of our proposed token identiﬁcation methods, Usamos
each in a classiﬁcation task, in which the method indicates for each instance of a given
expression whether it has an idiomatic or a literal interpretation. Sección 7.1 explica
the details of our experimental setup. Sección 7.2 then presents the experimental results
as well as some discussion and analysis.

7.1 Configuración experimental
7.1.1 Experimental Expressions and Annotation. In our token classiﬁcation experiments,
we use a subset of the 180 idiomatic expressions in the development and test data sets
used in the type-based experiments of Section 4. From the original 180 expresiones, nosotros
discard those whose frequency in the BNC is lower than 20, to increase the likelihood
that there are both literal and idiomatic usages of each expression. We also discard any

18 It is possible to incorporate extra knowledge sources, such as WordNet, for measuring similarity

between two sets of words. Sin embargo, our intention is to focus on purely unsupervised, knowledge-lean
aproches.

19 We also tried using the average similarity of the test token to all instances in each group. Sin embargo,

we found that focusing on the most similar instances from each group performs better.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

expression that is not from the two dictionaries ODCIE and CCID (mira la sección 4.1.2
for more details on the original data sets). This process results in the selection of
60 candidate verb–noun pairs.

For each of the selected pairs, 100 sentences containing its usage were randomly ex-
tracted from the automatically parsed BNC, using the method described in Section 4.1.1.
For a pair which occurs less than 100 times in the BNC, all of its usages were extracted.
Two judges were asked to independently label each use of each candidate expression as
literal, idiomatic, or unknown. When annotating a token, the judges had access to only
the sentence in which it occurred, and not the surrounding sentences. If this context was
insufﬁcient to determine the class of the expression, the judge assigned the unknown
etiqueta. In an effort to assure high agreement between the judges’ annotations, the judges
were also provided with the dictionary deﬁnitions of the idiomatic meanings of the
expresiones.

Idiomaticity is not a binary property; rather it is known to fall on a continuum
from completely semantically transparent, or literal, to entirely opaque, or idiomatic.
The human annotators were required to pick the label, literal or idiomatic, that best ﬁt
the usage in their judgment; they were not to use the unknown label for intermediate
casos. Figurative extensions of literal meanings were classiﬁed as literal if their overall
meaning was judged to be fairly transparent, as in You turn right when we hit the road
at the end of this track (taken from the BNC). Sometimes an idiomatic usage, such as have
word in At the moment they only had the word of Nicola’s husband for what had happened
(also taken from the BNC), is somewhat directly related to its literal meaning, cual
is not the case for more semantically opaque idioms such as hit the roof. This sentence
was classiﬁed as idiomatic because the idiomatic meaning is much more salient than the
literal meaning.

Primero, our primary judge, a native English speaker and an author of this paper,
annotated each use of each candidate expression. Based on this judge’s annotations, nosotros
removed the 25 expressions with fewer than 5 instances of either of their literal or idi-
omatic meanings, partida 28 expressions.20 (We will revisit the 25 removed expressions
en la sección 7.2.4.) The remaining expressions were then split into development (DEV) y
prueba (TEST) sets of 14 expressions each. The data was divided such that DEV and TEST
would be approximately equal with respect to the frequency of their expressions, como
well as their proportion of idiomatic-to-literal usages (according to the primary judge’s
anotaciones). At this stage, DEV and TEST contained a total of 813 y 743 tokens,
respectivamente.

Our second judge, also a native English-speaking author of this paper, then anno-
tated DEV and TEST sentences. The observed agreement and unweighted kappa score
(cohen 1960) on TEST were 76% y 0.62, respectivamente. The judges discussed tokens on
which they disagreed to achieve a consensus annotation. Final annotations were gener-
ated by removing tokens that received the unknown label as the consensus annotation,
leaving DEV and TEST with a total of 573 y 607 tokens, and an average of 41 y 43 a-
kens per expression, respectivamente. Mesa 5 shows the DEV and the TEST verb–noun pairs
used in our experiments. The table also contains information on the number of tokens
considered for each pair, as well as the percentage of its usages which are idiomatic.

20 From the original set of 60 expresiones, seven were excluded because our primary annotator did not

provide any annotations for them. These include catch one’s breath, cut one’s losses, and push one’s luck (para
which our annotator did not have access to a literal interpretation); and blow one’s (own) horn, pull one’s
hair, give a lift, and get the bird (for which our annotator did not have access to an idiomatic meaning).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Fazly, Cocinar, and Stevenson

Unsupervised Idiom Identiﬁcation

Mesa 5
Experimental DEV and TEST verb–noun pairs, their token frequency (FRQ), and the percentage of
their usages that are idiomatic (%IDM), ordered in decreasing %IDM.

DEV

TEST

verb–noun

FRQ

%IDM

verb–noun

FRQ

%IDM

ﬁnd foot
make face
get nod
pull weight
kick heel
hit road
take heart
pull plug
blow trumpet
hit roof
lose head
make pile
pull leg
see star

52
30
26
33
38
31
79
65
29
17
38
25
51
61

90
90
89
82
79
77
73
69
66
65
55
32
22
8

have word
lose thread
get sack
make mark
cut ﬁgure
pull punch
blow top
make scene
make hay
get wind
make hit
blow whistle
hold ﬁre
hit wall

89
20
50
85
43
22
28
48
17
29
14
78
23
61

90
90
86
85
84
82
82
58
53
45
36
35
30
11

7.1.2 Líneas de base, Parameters, and Performance Measures. We compare the performance of
our proposed methods, CFORM and CONTEXT, with the baseline of always predicting
an idiomatic interpretation, the most frequent meaning in our development data. Nosotros
also compare the unsupervised methods against a supervised method, SUP, cual es
similar to CONTEXT, except that it forms the idiomatic and literal co-occurrence sets
from manually annotated data (instead of automatically labelled data using canonical
formas). Like CONTEXT, SUP also classiﬁes tokens in a leave-one-out methodology using
the K idiomatic and literal instances which are most similar to a test token. For both
CONTEXT and SUP, we set the value of K (the number of similar instances used as
gold-standard) a 5, since experiments on DEV indicated that performance did not vary
substantially using a range of values of K.

For all methods, we report the accuracy macro-averaged over all expressions in
TEST. We use the individual accuracies (accuracies for the individual expressions) a
perform t-tests for verifying whether different methods have signiﬁcantly different
actuación. To further analyze the performance of the methods, we also report their
recall and precision on identifying usages from each of the idiomatic and literal classes.

7.2 Experimental Results and Analysis

We ﬁrst discuss the overall performance of our proposed unsupervised methods in
Sección 7.2.1. Results reported in Section 7.2.1 are on TEST (results on DEV have similar
trends, unless noted otherwise). Próximo, we look into the performance of our methods
on expressions with different proportions of idiomatic-to-literal usages in Section 7.2.2,
which presents results on TEST and DEV combined, as explained subsequently. Sec-
ción 7.2.3 provides an analysis of the errors made because of using canonical forms, y
identiﬁes some possible directions for future work. En la sección 7.2.4, we present results
on a new data set containing expressions with highly skewed proportion of idiomatic-
to-literal usages.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
1
6
1
1
7
9
8
5
6
0
/
C
oh

yo
i
.

0
8
–
0
1
0
–
r
1
–
0
7
–
0
4
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

Mesa 6
Macro-averaged accuracy (%Acc) and relative error rate reduction (%ERR) on TEST expressions.

Método

%Acc

(%ERR)

Base
Unsupervised CONTEXT

Supervised

CFORM
SUP

61.9
65.8
72.4
82.7

(10.2)
(27.6)
(54.6)

7.2.1 Overall Performance. Mesa 6 shows the macro-averaged accuracy on TEST of our
two unsupervised methods, as well as that of the baseline and the supervised method
for comparison. The best unsupervised performance is indicated in boldface.

As the table shows, both of our unsupervised methods as well as the supervised
method outperform the baseline, conﬁrming that the canonical forms of an expression,
and local context, are both informative in distinguishing literal and idiomatic instances
of the expression.21 Moreover, CFORM outperforms CONTEXT (difference is marginally
signiﬁcant at p < .06), which is somewhat unexpected, as CONTEXT was proposed as an improvement over CFORM in that it combines contextual information along with the syntactic information provided by CFORM. We return to these results later (Section 7.2.3) to offer some reasons as to why this might be the case. However, the results using CFORM conﬁrm our hypothesis that canonical forms—which reﬂect the overall behavior of a verb+noun type—are strongly informative about the class of a token. Importantly, this is the case even though the canonical forms that we use are imperfect knowledge obtained automatically through an unsupervised method. Comparing CFORM with SUP, we observe that even though on average the latter outperforms the former, the difference is not statistically signiﬁcant (p > .1). A close
look at the performance of these methods on the individual expressions reveals that
neither consistently outperforms the other on all (or even most) expresiones. Además,
as we will see in Section 7.2.2, SUP seems to gain most of its advantage over CFORM on
expressions with a low proportion of idiomatic usages, for which canonical forms tend
to have less predictive value (mira la sección 7.2.3 para más detalles).

Recall that both CONTEXT and SUP label each token by comparing its local context
to those of its K nearest “idiomatic” and its K nearest “literal” usages. The difference is
that CONTEXT uses noisy (automáticamente) labelled data to identify these nearest usages
for each token, whereas SUP uses manually labelled data. One possible direction for fu-
ture work is thus to investigate whether providing substantially larger amounts of data
alleviates the effect of noise, as is often found to be the case by researchers in the ﬁeld.

7.2.2 Performance Based on Class Distribution. Recall from Section 6 that both of our un-
supervised techniques for token identiﬁcation depend on how accurately the canonical
forms of an expression can be acquired. The canonical form acquisition technique which
we use here works well if the idiomatic meaning of an expression is sufﬁciently frequent
compared to its literal usage. En esta sección, we thus examine the performance of the

21 Performing a paired t-test, we ﬁnd that the difference between the baseline and CFORM is marginally

signiﬁcant, pag < .06, whereas the difference between baseline and CONTEXT is not statistically signiﬁcant. The difference between the baseline and SUP is signiﬁcant at p < .01. The trend on DEV is somewhat similar: baseline and CFORM are signiﬁcantly different at p < .05; SUP is marginally different from baseline at p < .06. 88 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation Table 7 Macro-averaged accuracy (%Acc) and relative error rate reduction (%ERR) on the 28 expressions in DT (DEV and TEST combined), divided according to the proportion of idiomatic-to-literal usages (high and low). DTIhigh DTIlow Method %Acc (%ERR) %Acc (%ERR) Baseline Unsupervised CONTEXT Supervised CFORM SUP 81.4 80.6 84.7 84.4 (−4.3) (17.7) (16.1) 35.0 44.6 53.4 76.8 (14.8) (28.3) (64.3) token identiﬁcation methods for expressions with different proportions of idiomatic-to- literal usages. We merge DEV and TEST (referring to the new set as DT), and then divide the re- sulting set of 28 expressions according to their proportion of idiomatic-to-literal usages (as determined by the human annotations) as follows.22 Looking at the proportion of idiomatic usages of our expressions in Table 5, we can see that there are gaps between 55% and 65% in DEV, and between 58% and 82% in TEST, in terms of proportion of idiomatic usages. The value of 65% thus serves as a natural lower bound for dominant idiomatic usage, and the value of 58% as a natural upper bound for non-dominant idiomatic usage. We therefore split DT into two sets: DTIhigh contains 17 expressions with 65–90% of their usages being idiomatic (i.e., their idiomatic usage is dominant), whereas DTIlow contains 11 expressions with 8–58% of their occurrences being idiomatic (i.e., their idiomatic usage is not dominant). Table 7 shows the average accuracy of all the methods on these two groups of expressions, with the best performance on each group shown in boldface. We ﬁrst look at the performance of our methods on DTIhigh. On these expressions, CFORM outperforms both the baseline (difference is not statistically signiﬁcant) and CONTEXT (difference is statistically signiﬁcant at p < .05). CFORM also has a comparable performance to the su- pervised method, reinforcing that for these expressions accurate canonical forms can be acquired and that such knowledge can be used with high conﬁdence for distinguishing idiomatic and literal usages in context. We now look into the performance on expressions in DTIlow. On these, both CFORM and CONTEXT outperform the baseline, showing that even for expressions whose idi- omatic meaning is not dominant, automatically acquired canonical forms can help with their token classiﬁcation. Nonetheless, both these methods perform substantially worse than the supervised method, reinforcing that the automatically acquired canonical forms are noisier, and hence less predictive, than they are for expressions in DTIhigh. The poor performance of the unsupervised methods on expressions in DTIlow (com- pared to the supervised performance) is likely to be mostly due to the less predictive canonical forms extracted for these expressions. In general, we can conclude that when canonical forms can be extracted with a high accuracy, the performance of the CFORM method is comparable to that of a supervised method. One possible way of improving the performance of unsupervised methods is thus to develop more accurate techniques for the automatic acquisition of canonical forms. 22 We combine the two sets in order to have a sufﬁcient number of expressions in each group after division. 89 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 Table 8 Confusion matrix for CFORM on expression blow trumpet. idm = idiomatic class; lit = literal class; tp = true positive; fp = false positive; fn = false negative; tn = true negative. True Class idm 17 = tp 2 = fn lit 6 = fp 4 = tn Predicted Class idm lit Table 9 Formulas for calculating Sens and PPV (recall and precision for the idiomatic class), and Spec and NPV (recall and precision for the literal class) from a confusion matrix. recall (R) precision (P) idm lit Sens Spec = = tp tp + fn tn tn + fp PPV = NPV = tp tp + fp tn tn + fn Accuracy is often not a sufﬁcient measure for the evaluation of a binary (two-class) classiﬁer, especially when the number of items in the two classes (here, idiomatic and literal) differ. Instead, one can have a closer look at the performance of a classiﬁer by examining its confusion matrix, which compares the labels predicted by the classiﬁer for each item with its true label. As an example, the confusion matrix of the CFORM method for the expression blow trumpet is given in Table 8. Note that the choice of idiomatic as the positive class (and literal as the negative class) is arbitrary; however, because our ultimate goal is to identify idiomatic usages, there is a natural reason for this choice. To summarize a confusion matrix, four standard measures are often used, which are calculated from the cells in the matrix. The measures are sensitivity (Sens), positive predictive value (PPV), speciﬁcity (Spec), and negative predictive value (NPV), and are calculated as in Table 9. As stated in the table, Sens and PPV are equivalents of recall and precision for the positive (idiomatic) class, also referred to as Ridm and Pidm later in the article. Similarly, Spec and NPV are equivalents of recall and precision for the negative (literal) class, also referred to as Rlit and Plit.23 Table 10 gives the trimmed mean values of these four performance measures over expressions in DTIhigh and DTIlow for the baseline, the two unsupervised methods, and the supervised method.24 (The performance measures on individual expressions are given in Tables 12, 13, and 14 in the Appendix.) Table 10 shows that, as expected, the baseline has very high Sens (100% recall on identifying idiomatic usages), but very low Spec (0% 23 We mainly refer to these measures using their standard names in the literature: Sens, PPV, Spec, and NPV. Alongside the standard names, we use the more expressive names Ridm, Pidm, Rlit, and Plit, to remind the reader about the semantics of the measures. 24 When averaging interdependent measures, such as precision and recall, one needs to make sure that the observed trend in the averages is consistent with that in the individual values. Trimmed mean is a standard statistic used in such cases, which is equivalent to the mean after discarding a percentage (often between 5 and 25) of the sample data at the high and low ends. Here, we report a 14%-trimmed mean, which involves removing two data points from each end. The analysis presented here is based on the trimmed means, as well as the individual values of the performance measures. 90 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation Table 10 Detailed classiﬁcation performance of all methods over DTIhigh and DTIlow . Performance is given using four measures: Sens or Ridm, PPV or Pidm, Spec or Rlit, and NPV or Plit, macro-averaged using 14%-trimmed mean. Data Set Method Sens (Ridm) PPV (Pidm) Spec (Rlit) NPV (Plit) DTIhigh Baseline CONTEXT CFORM SUP 1.00 .97 .95 .99 .82 .84 .92 .86 0.00 .11 .61 .22 0.00 .18 .71 .53 Data Set Method Sens (Ridm) PPV (Pidm) Spec (Rlit) NPV (Plit) DTIlow Baseline CONTEXT CFORM SUP 1.00 .89 .86 .44 .36 .37 .43 .62 0.00 .22 .36 .88 0.00 .63 .86 .80 recall on identifying literal usages). We thus expect a well-performing method to have lower Sens than the baseline, but higher Spec and also higher PPV and NPV (i.e., higher precision on both idiomatic and literal usages). Looking at performance on DTIhigh, we ﬁnd that all three methods have reasonably high Sens and PPV, revealing that the methods are good at labeling idiomatic usages. Performance on literal usages, however, differs across the three methods. CONTEXT has very low Spec and NPV, showing that it tends to label most tokens—including the literal ones—as idiomatic. A close look at the performance of this method on the individual expressions also conﬁrms this tendency: on many expressions (10 out of 17) the Spec and NPV of CONTEXT are both zero (see Table 13 in the Appendix). As we will see in Section 7.2.3, this tendency is partly due to the distribution of the idiomatic and literal usages in canonical and non-canonical forms; because literal usages can also appear in a canonical form, for many expressions there are often not many non-canonical form instances. (Recall that, for training, CONTEXT uses instances in canonical form as being idiomatic and those in non-canonical form as being literal.) Thus, in many cases, it is a priori more likely that a token is more similar to the K most similar canonical form instances. Interestingly, CFORM is the method with the highest Spec and NPV, even higher than those of the supervised method. Nonetheless, even CFORM is overall much better at identifying idiomatic tokens than literal ones (see Section 7.2.3 for more discussion on this). We now turn to performance on DTIlow. CFORM has a high Sens, but a low PPV, indicating that most idiomatic usages are identiﬁed correctly, but many literal usages are also misclassiﬁed as idiomatic (hence a low Spec). CONTEXT shows the same trend as CFORM, though overall it has poorer performance. Performance of SUP varies across the expressions in this group: SUP is very good at identifying literal usages of these expressions (high Spec and NPV for all expressions). Nonetheless, SUP has a low recall in identifying idiomatic usages (low Sens) for many of these expressions. 7.2.3 Discussion and Error Analysis. In this section, we examine two main issues. First, we look into the plausibility of our original assumptions regarding the predictive value of canonical forms (and non-canonical forms). Second, we investigate the appropriateness of our automatically extracted canonical forms. 91 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 To learn more about the predictive value of canonical forms, we examine the per- formance of CFORM on the 28 expressions under study. More speciﬁcally, we look at the values of Sens, PPV, Spec, and NPV on these expressions, as shown in Table 12 in the Appendix. On expressions in DTIhigh, CFORM has both high Sens and high PPV. The formulas in Table 9 indicate that if both Sens and PPV are high, then tp (cid:21) fn and tp (cid:21) fp. Thus, most idiomatic usages of expressions in DTIhigh appear in a canonical form, and most usages in a canonical form are idiomatic. The values of Spec and NPV on the same expressions are in general lower (compared to Sens and PPV), showing that tn is not much higher than fp or fn. On expressions in DTIlow, CFORM generally has high Sens but low-to-medium PPV. This indicates that for these expressions, most idiomatic usages appear in a canonical form, but not all usages in a canonical form are idiomatic. On these expressions, CFORM has generally high NPV, but mostly low Spec. These indicate that tn (cid:21) fn, that is, most usages in a non-canonical form are literal, and that tn is often lower than fp, that is, many literal usages also appear in a canonical form. For example, almost all usages of hit wall in a non-canonical form are literal, but most of its literal usages appear in a canonical form. Generally, it seems that, as we expected, literal usages are less restricted in terms of the syntactic form they appear in; they can appear in both canonical form(s) and in non-canonical form(s). For an expression with a low proportion of literal usages, we can thus acquire canonical forms that are both accurate and have high predictive value for identifying idiomatic usages in context. On the contrary, for expressions with a relatively high proportion of literal usages, automatically acquired canonical forms are less accurate and also have low predictive value (i.e., they are not speciﬁc to idiomatic usages). We expected that using contextual information would help in such cases. However, our CONTEXT method relies on noisy training data automatically labelled using information about canonical forms. Given these ﬁndings, it is not sur- prising that this method performs substantially worse than a corresponding supervised method that uses similar contextual information, but manually labelled training data. It remains to be tested in the future whether providing more noisy data will help. Another possible future direction is to develop context methods that can better exploit noisy labelled data. Now we look at a few cases where our automatically extracted canonical forms are not sufﬁciently accurate. For a verb+noun such as make pile (i.e., make a pile of money), we correctly identify only some of the canonical forms. The automatically determined canonical forms for make pile are make a pile and make piles. However, we ﬁnd that idi- omatic usages of this expression are sometimes of the form make one’s pile. Furthermore, we ﬁnd that the frequency of this form is much higher than that of the non-canonical forms, and not substantially lower than the frequency cut-off for selection as a canonical form. This indicates that our heuristic for selecting patterns as canonical forms could be ﬁne-tuned to yield an improvement in performance. For the expression pull plug, we identify its canonical form as pull the plug, but ﬁnd a mixture of literal and idiomatic usages in this form. However, many of the literal usages are verb-particle constructions using out (pull the plug out), while many of the idiomatic usages occur with a prepositional phrase headed by on (pull the plug on). This indi- cates that incorporating information about particles and prepositions could improve the quality of the canonical forms. Other syntactic categories, such as adjectives, may also be informative in determining canonical forms for expressions which are typically used idiomatically with words of a particular syntactic category, as in blow one’s own trumpet. 92 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation Table 11 Macro-averaged accuracy (%Acc) and relative error rate reduction (%ERR) on the 23 expressions in SKEWED-IDM and on the 37 expressions in the combination of TEST and SKEWED-IDM (ALL). SKEWED-IDM ALL Method %Acc (%ERR) %Acc (%ERR) Baseline Unsupervised CONTEXT Supervised CFORM SUP 97.9 94.2 86.7 97.9 (−176.2) (−533.3) (0.0) 84.3 83.3 81.3 92.1 (−6.4) (−19.1) (49.7) 7.2.4 Performance on Expressions with Skewed Distribution. Recall from Section 7.1.1 that, from the original set of 60 candidate expressions, we excluded those that had fewer than 5 instances of either of their literal or idiomatic meanings. It is nonetheless important to see how well our methods perform on such expressions. In this section, we thus report the performance of our measures on the set of 23 expressions with mostly idiomatic usages, referred to as SKEWED-IDM. Table 11 presents the macro-averaged accuracy of our methods on these expressions. This table also shows the accuracy on all unseen test expressions, that is, the combination of SKEWED-IDM and TEST, referred to as ALL, for comparison.25 On SKEWED-IDM, the supervised method performs as well as the baseline, whereas both unsupervised methods perform worse.26 Note that for 19 out of the 23 expressions in SKEWED-IDM, all instances are idiomatic, and the baseline accuracy is thus 100%. On these, SUP also has 100% accuracy because no literal instances are available, and thus SUP labels every token as idiomatic (same as the baseline). As for the unsupervised methods, we can see that, unlike on TEST, the CONTEXT method outperforms CFORM (the difference is statistically signiﬁcant at p < .001). We saw previously that CONTEXT tends to label usages as idiomatic. This bias might be partially responsible for the better performance of CONTEXT on this data set. Moreover, we ﬁnd that many of these expressions tend to appear in a highly frequent canonical form, but also in less frequent syntactic forms which we (perhaps incorrectly) consider as non-canonical forms. When considering the performance on all unseen test expressions (ALL), neither unsupervised method performs as well as the baseline, but the supervised method offers a substantial improvement over the baseline.27 Our annotators pointed out that for many of the expressions in SKEWED-IDM, either a literal interpretation was almost impossible (as for catch one’s imagination), or extremely implausible (as for kick the habit). Hence, the annotators could predict beforehand that the expression would be mainly used with an idiomatic meaning. A semi-supervised approach that combines expert human knowledge with automatically extracted corpus-drawn information can thus be beneﬁcial for the task of identifying 25 The results obtained on the two excluded expressions which are predominantly used literally in terms of percent accuracy using the various methods are as follows. Baseline: 4.2, Unsupervised CONTEXT: 6.5, Unsupervised CFORM: 16.2, Supervised: 43.5. However, because there are only two such expressions, it is difﬁcult to draw conclusions from these results, and we do not further consider these expressions. 26 According to a paired t-test, on SKEWED-IDM, all the observed differences are statistically signiﬁcant at p < .05. 27 According to a paired t-test, on ALL, the differences between the supervised method and the three other methods are statistically signiﬁcant at p < .01; none of the other differences are statistically signiﬁcant. 93 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 idiomatic expressions in context. A human expert (e.g., a lexicographer) could ﬁrst ﬁlter out expressions for which a literal interpretation is highly unlikely. For the rest of the expressions, a simple unsupervised method such as CFORM—that relies only on automatically extracted information—can be used with reasonable accuracy. 8. Related Work 8.1 Type-Based Recognition of Idioms and Other Multiword Expressions Our work relates to previous studies on determining the compositionality (the inverse of idiomaticity) of idioms and other multiword expressions (MWEs). Most previous work on the compositionality of MWEs either treats them as collocations (Smadja 1993), or examines the distributional similarity between the expression and its constituents (Baldwin et al. 2003; Bannard, Baldwin, and Lascarides 2003; McCarthy, Keller, and Carroll 2003). Others have identiﬁed MWEs by looking into speciﬁc linguistic cues, such as the lexical ﬁxedness of non-compositional MWEs (Lin 1999; Wermter and Hahn 2005), or the lexical ﬂexibility of productive noun compounds (Lapata and Lascarides 2003). Venkatapathy and Joshi (2005) combine aspects of this work, by incorporating lexical ﬁxedness, distributional similarity, and collocation-based measures into a set of features which are used to rank verb+noun combinations according to their compositionality. Our work differs from such studies in that it considers various kinds of ﬁxedness as surface behaviors that are tightly related to the underlying semantic idiosyncrasy (idiomaticity) of expressions. Accordingly, we propose novel methods for measuring the degree of lexical, syntactic, and overall ﬁxedness of verb+noun combinations, and use these as indirect ways of measuring degree of idiomaticity. Earlier research on the lexical encoding of idiom types mainly relied on the exis- tence of human annotations, especially for detecting which syntactic variations (e.g., passivization) an idiom can undergo (Odijk 2004; Villavicencio et al. 2004). Evert, Heid, and Spranger (2004) and Ritz and Heid (2006) propose methods for automatically determining morphosyntactic preferences of idiomatic expressions. However, they treat individual morphosyntactic markers (e.g., the number of the noun in a verb+noun combination) as independent features, and rely mainly on the relative frequency of each possible value for a feature (e.g., plural for number) as an indicator of a preference for that value. If the relative frequency of a particular value of a feature for a given combination (or the lower bound of the conﬁdence interval, in the case of Evert, Heid, and Spranger’s approach) is higher than a certain threshold, then the expression is said to have a preference for that value. These studies recognize that morphosyntactic preferences can be employed as clues to the identiﬁcation of idiomatic combinations; however, none proposes a systematic approach for such a task. Moreover, only subjec- tive evaluations of the proposed methods are presented. Others have also drawn on the notion of syntactic ﬁxedness for the detection of idioms and other MWEs. Widdows and Dorow (2005), for example, look into the ﬁxedness of a highly constrained type of idiom, namely, those of the form “X conj X” where X is a noun or an adjective, and conj is a conjunction such as and, or, but. Smadja (1993) also notes the importance of syntactic ﬁxedness in identifying strongly associated multiword sequences, including collocations and idioms. Nonetheless, in both these studies, the notion of syntactic ﬁxedness is limited to the relative position of words within the sequence. Such a general notion of ﬁxedness does not take into account some of the important syntactic properties of idioms (e.g., the choice of the determiner), and hence cannot distinguish among different subtypes of MWEs which may differ on such 94 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation grounds. Our syntactic ﬁxedness measure looks into a set of linguistically informed patterns associated with a coherent, though large, class of idiomatic expressions. Results presented in this article show that the ﬁxedness measures can successfully separate idioms from literal phrases. Corpus analysis of the measures proves that they can also be used to distinguish idioms from other MWEs, such as light verb constructions and collocations (Fazly and Stevenson 2007; Fazly and Stevenson, to appear). Bannard (2007) proposes an extension of our syntactic ﬁxedness measure—which ﬁrst appeared in Fazly and Stevenson (2006)—where he uses different prior distributions for different syntactic variations. Work on the identiﬁcation of MWE types has also looked at evidence from another language. For example, Melamed (1997a) assumes that non-compositional compounds (NCCs) are usually not translated word-for-word to another language. He thus pro- poses to discover NCCs by maximizing the information-theoretic predictive value of a translation model between two languages. The sample extracted NCCs reveal an important drawback of the proposed method: It relies on a translation model only, without taking into account any prior linguistic knowledge about possible NCCs within a language. Nonetheless, such a technique is capable of identifying many NCCs that are relevant for a translation task. Villada Moir ´on and Tiedemann (2006) propose measures for distinguishing idiomatic expressions from literal ones (in Dutch), by examining their automatically generated translations into a second language, such as English or Spanish. Their approach is based on the assumptions that idiomatic expressions tend to have fewer predictable translations and fewer compositional meanings, compared to the literal ones. The ﬁrst property is measured as the diversity in the translations for the expression, estimated using an entropy-based measure proposed by Melamed (1997b). The non-compositionality of an expression is measured as the overlap between the meaning of an expression (i.e., its translations) and those of its component words. General approaches (such as those explained in the previous paragraph) may be more easily extended to different domains and languages. Our measures incorporate language-speciﬁc information about idiomatic expressions, thus extra work may be required to extend and apply them to other languages and other expressions. (Though see Van de Cruys and Villada Moir ´on [2007] for an extension of our measures to Dutch idioms of the form verb plus prepositional phrase.) Nonetheless, because our measures capture deep linguistic information, they are also expected to acquire more detailed knowledge—for example, they can be used for identifying other classes of MWEs (Fazly and Stevenson 2007). 8.2 Token-Based Identiﬁcation of Idioms and Other Multiword Expressions A handful of studies have focused on identifying idiomatic and non-idiomatic usages (tokens) of words or MWEs. Birke and Sarkar (2006) propose a minimally supervised algorithm for distinguishing between literal and non-literal usages of verbs in context. Their algorithm uses seed sets of literal and non-literal usages that are automatically extracted from online resources such as WordNet. The similarity between the context of a target token and that of each seed set determines the class of the token. The approach is general in that it uses a slightly modiﬁed version of an existing word sense disambigua- tion algorithm. This is both an advantage and a drawback: The algorithm can be easily extended to other parts of speech and other languages; however, such a general method ignores the speciﬁc properties of non-literal (metaphorical and/or idiomatic) language. Similarly, the supervised token classiﬁcation method of Katz and Giesbrecht (2006) relies primarily on the local context of a token, and fails to exploit speciﬁc linguistic 95 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 properties of non-literal language. Our results suggest that such properties are often more informative than the local context, in determining the class of an MWE token. The supervised classiﬁer of Patrick and Fletcher (2005) distinguishes between com- positional and non-compositional usages of English verb-particle constructions. Their classiﬁer incorporates linguistically motivated features, such as the degree of separation between the verb and particle. Here, we focus on a different class of English MWEs, namely, the class of idiomatic verb+noun combinations. Moreover, by making a more direct use of their syntactic behavior, we develop unsupervised token classiﬁcation methods that perform well. The unsupervised token classiﬁer of Hashimoto, Sato, and Utsuro (2006) uses manually encoded information about allowable and non-allowable syntactic transformations of Japanese idioms, which are roughly equivalent to our notions of canonical and non-canonical forms. The rule-based classiﬁer of Uchiyama, Baldwin, and Ishizaki (2005) incorporates syntactic information about Japanese com- pound verbs (JCVs), a type of MWE composed of two verbs. In both cases, although the classiﬁers incorporate syntactic information about MWEs, their manual development limits the scalability of the approaches. Uchiyama, Baldwin, and Ishizaki (2005) also propose a statistical token classiﬁca- tion method for JCVs. This method is similar to ours, in that it also uses type-based knowledge to determine the class of each token in context. However, their method is supervised, whereas our methods are unsupervised. Moreover, Uchiyama, Baldwin, and Ishizaki only evaluate their methods on a set of JCVs that are mostly monosemous. Our main focus here is on MWEs that are harder to disambiguate, that is, those that have two clear idiomatic and literal meanings, and that are frequently used with either meaning. 9. Conclusions The signiﬁcance of the role idioms play in language has long been recognized; however, due to their peculiar behavior, they have been mostly overlooked by researchers in computational linguistics. In this work, we focus on a broadly documented and cross- linguistically frequent class of idiomatic MWEs: those that involve the combination of a verb and a noun in its direct object position, which we refer to as verb+noun idiomatic combinations or VNICs. Although a great deal of research has focused on non-compositionality of MWEs, less attention has been paid to other properties relevant to their semantic idiosyncrasy, such as lexical and syntactic ﬁxedness. Drawing on such properties, we have developed techniques for the automatic recognition of VNIC types, as well as methods for their token identiﬁcation in context. We propose techniques for the automatic acquisition and encoding of knowledge about the lexicosyntactic behavior of idiomatic combinations. More speciﬁcally, we propose novel statistical measures that quantify the degree of lexical, syntactic, and overall ﬁxedness of a verb+noun combination. We demonstrate that these measures can be successfully applied to the task of automatically distinguishing idiomatic ex- pressions (types) from non-idiomatic ones. Our results show that the syntactic and overall ﬁxedness measures substantially outperform existing measures of collocation extraction, even when they incorporate some syntactic information. We put forward an unsupervised means for automatically discovering the set of syntactic variations that are preferred by a VNIC type (its canonical forms) and that should be included in its lexical representation. In addition, we show that the canonical form extraction method can effectively be used in identifying idiomatic and literal usages (tokens) of an expression in context. 96 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation We have annotated a total of 2, 465 tokens for 51 VNIC types according to whether they are a literal or idiomatic usage. We found that for 28 expressions (1, 180 tokens), approximately 40% of the usages were literal. For the remaining 23 expressions (1, 285 tokens), almost all usages were idiomatic. These ﬁgures indicate that automatically determining whether a particular instance of an expression is used idiomatically or lit- erally is of great importance for NLP applications. We have proposed two unsupervised methods that perform such a task. Our proposed methods incorporate automatically acquired knowledge about the overall syntactic behavior of a VNIC type, in order to do token classiﬁcation. More speciﬁcally, our methods draw on the syntactic ﬁxedness of VNICs—a property which has been largely ignored in previous studies of MWE tokens. Our results conﬁrm the usefulness of this property as incorporated into our methods. On the 23 expressions whose usages are predominantly idiomatic, because the baseline is very high none of the methods outperform it. Nonetheless, as pointed out by our human annotators, for many of these expressions it can be predicted beforehand that they are mainly idiomatic and that a literal interpretation is impossible or highly implausible. On the 28 expressions with frequent literal usages, all our methods outperform the baseline of always predicting the most dominant class (idiomatic). Moreover, on these, the accuracy of our best unsupervised method is not substantially lower than the accuracy of a standard supervised approach. Appendix: Performance on the Individual Expressions This Appendix contains the values of the four performance measures, Sens, PPV, Spec, and NPV, for our two unsupervised methods (i.e., CFORM and CONTEXT) as well as for the supervised method, SUP, on individual expressions in DTIhigh and DTIlow. Expressions (verb–noun pairs) in each data set are ordered alphabetically. Table 12 Performance of CFORM on individual expressions in DTIhigh and DTIlow . Data Set verb–noun Sens (Ridm) PPV (Pidm) Spec (Rlit) NPV (Plit) DTIhigh blow top blow trumpet cut ﬁgure ﬁnd foot get nod get sack have word hit road hit roof kick heel lose thread make face make mark pull plug pull punch pull weight take heart 1.00 0.89 0.97 0.98 0.96 1.00 0.56 1.00 1.00 1.00 0.94 0.74 0.85 0.89 0.83 1.00 1.00 0.92 0.89 0.97 0.92 1.00 0.96 0.96 0.80 0.65 0.81 0.94 0.95 1.00 0.77 0.94 0.93 0.97 0.60 0.80 0.86 0.20 1.00 0.71 0.78 0.14 0.00 0.12 0.50 0.67 1.00 0.40 0.75 0.67 0.88 1.00 0.80 0.86 0.50 0.75 1.00 0.17 1.00 0.00 1.00 0.50 0.22 0.54 0.62 0.50 1.00 1.00 97 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 Table 12 (continued) Data Set verb–noun Sens (Ridm) PPV (Pidm) Spec (Rlit) NPV (Plit) DTIlow blow whistle get wind hit wall hold ﬁre lose head make hay make hit make pile make scene pull leg see star 0.93 0.85 0.86 1.00 0.76 1.00 1.00 0.25 0.82 0.64 0.80 0.44 0.73 0.11 0.37 0.62 0.56 0.71 0.14 0.68 0.23 0.10 0.37 0.75 0.09 0.25 0.41 0.12 0.78 0.29 0.45 0.40 0.38 0.90 0.86 0.83 1.00 0.58 1.00 1.00 0.45 0.64 0.80 0.95 Table 13 Performance of CONTEXT on individual expressions in DTIhigh and DTIlow . Data Set verb–noun Sens (Ridm) PPV (Pidm) Spec (Rlit) NPV (Plit) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 blow top blow trumpet cut ﬁgure ﬁnd foot get nod get sack have word hit road hit roof kick heel lose thread make face make mark pull plug pull punch pull weight take heart blow whistle get wind hit wall hold ﬁre lose head make hay make hit make pile make scene pull leg see star 1.00 0.89 1.00 1.00 1.00 1.00 0.70 1.00 1.00 0.97 1.00 0.85 1.00 0.96 0.94 1.00 0.90 0.89 0.85 1.00 1.00 0.90 0.78 0.60 0.50 0.96 0.82 1.00 0.85 0.74 0.84 0.90 0.88 0.86 0.95 0.77 0.65 0.78 0.90 0.88 0.91 0.69 0.89 0.82 0.85 0.36 0.65 0.11 0.30 0.56 0.50 0.38 0.25 0.66 0.22 0.12 0.20 0.40 0.00 0.00 0.00 0.00 0.67 0.00 0.00 0.00 0.00 0.00 0.46 0.05 0.50 0.00 0.38 0.18 0.62 0.00 0.00 0.12 0.12 0.44 0.29 0.30 0.20 0.32 1.00 0.67 0.00 0.00 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 1.00 0.33 0.67 0.00 0.50 0.75 0.83 0.00 0.00 0.50 0.33 0.67 0.56 0.86 0.80 1.00 DTIhigh DTIlow 98 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation Table 14 Performance of SUP on individual expressions in DTIhigh and DTIlow . Data Set verb–noun Sens (Ridm) PPV (Pidm) Spec (Rlit) NPV (Plit) blow top blow trumpet cut ﬁgure ﬁnd foot get nod get sack have word hit road hit roof kick heel lose thread make face make mark pull plug pull punch pull weight take heart blow whistle get wind hit wall hold ﬁre lose head make hay make hit make pile make scene pull leg see star 1.00 0.95 1.00 1.00 0.91 1.00 1.00 1.00 0.82 0.97 1.00 1.00 1.00 0.98 1.00 1.00 0.93 0.52 0.77 0.00 0.00 0.48 0.89 0.40 0.38 0.89 0.55 0.00 0.85 0.72 0.84 0.90 0.91 0.86 0.90 0.80 0.64 0.78 0.95 0.96 0.91 0.90 0.90 0.82 0.83 0.78 0.71 0.00 0.00 0.62 0.80 1.00 0.75 0.69 0.75 0.00 0.20 0.30 0.00 0.00 0.33 0.00 0.00 0.14 0.17 0.00 0.50 0.67 0.46 0.75 0.50 0.00 0.25 0.92 0.75 1.00 0.88 0.65 0.75 1.00 0.94 0.45 0.95 1.00 1.00 0.75 0.00 0.00 0.33 0.00 0.00 1.00 0.33 0.00 1.00 1.00 1.00 0.94 1.00 0.00 0.50 0.78 0.80 0.89 0.67 0.50 0.86 0.75 0.76 0.75 0.88 0.92 DTIhigh DTIlow l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 99 Computational Linguistics Volume 35, Number 1 Acknowledgments This article is an extended and updated combination of two papers that appeared, respectively, in the proceedings of EACL 2006 and the proceedings of the ACL 2007 Workshop on A Broader Perspective on Multiword Expressions. We wish to thank the anonymous reviewers of those papers for their helpful recommendations. We also thank the anonymous reviewers of this article for their insightful comments which we believe have helped us improve the quality of the work. We are grateful to Eric Joanis for providing us with the NP-head extraction software, and to Afra Alishahi and Vivian Tsang for proofreading the manuscript. Our work is ﬁnancially supported by the Natural Sciences and Engineering Research Council of Canada, the Ontario Graduate Scholarship program, and the University of Toronto. References Abeill´e, Anne. 1995. The ﬂexibility of French idioms: A representation with lexicalized Tree Adjoining Grammar. In Everaert et al., editors, Idioms: Structural and Psychological Perspectives. LEA, Mahwah, NJ, pages 15–42. Akimoto, Minoji. 1999. Collocations and idioms in Late Modern English. In L. J. Brinton and M. Akimoto. Collocational and Idiomatic Aspects of Composite Predicates in the History of English. John Benjamins Publishing Company, Amsterdam, pages 207–238. Baldwin, Timothy, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In Proceedings of the ACL-SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 89–96, Sapporo. Bannard, Colin. 2007. A measure of syntactic ﬂexibility for automatically identifying multiword expressions in corpora. In Proceedings of the ACL’07 Workshop on a Broader Perspective on Multiword Expressions, pages 1–8, Prague. Bannard, Colin, Timothy Baldwin, and Alex Lascarides. 2003. A statistical approach to the semantics of verb-particles. In Proceedings of the ACL-SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65–72, Sapporo. Birke, Julia and Anoop Sarkar. 2006. A clustering approach for the nearly 100 unsupervised recognition of nonliteral language. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06), pages 329–336, Trento. Burnard, Lou. 2000. Reference Guide for the British National Corpus (World Edition), second edition. Available at www.natcorp. ox.ac.uk. Cacciari, Cristina. 1993. The place of idioms in a literal and metaphorical world. In C. Cacciari and P. Tabossi, Idioms: Processing, Structure, and Interpretation. LEA, Mahwah, NJ, pages 27–53. Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle. 1991. Using statistics in lexical analysis. In Uri Zernik, editor, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. LEA, Mahwah, NJ, pages 115–164. Claridge, Claudia. 2000. Multi-word Verbs in Early Modern English: A Corpus-based Study. Editions Rodopi B. V., Amsterdam. Clark, Eve V. 1978. Discovering what words can do. Papers from the Parasession on the Lexicon, 14:34–57. Cohen, Jacob. 1960. A coefﬁcient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46. Collins, Michael. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. Cook, Paul, Afsaneh Fazly, and Suzanne Stevenson. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identiﬁcation of idiomatic expressions in context. In Proceedings of the ACL’07 Workshop on a Broader Perspective on Multiword Expressions, pages 41–48, Prague. Copestake, Ann, Fabre Lambeau, Aline Villavicencio, Francis Bond, Timothy Baldwin, Ivan A. Sag, and Dan Flickinger. 2002. Multiword expressions: Linguistic precision and reusability. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’02), pages 1941–47, Las Palmas. Cover, Thomas M. and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley and Sons, Inc., New York. Cowie, Anthony P., Ronald Mackin, and Isabel R. McCaig. 1983. Oxford Dictionary of Current Idiomatic English, volume 2. Oxford University Press. Dagan, Ido, Fernando Pereira, and Lillian Lee. 1994. Similarity-based estimation of word co-occurrence probabilities. In Proceedings of the 32nd Anuual Meeting of the l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation Association for Computational Linguistics (ACL’94), pages 272–278, Las Cruces, NM. d’Arcais, Giovanni B. Flores. 1993. The comprehension and semantic interpretation of idioms. In C. Cacciari and P. Tabossi, Idioms: Processing, Structure, and Interpretation. LEA, Mahwah, NJ, pages 79–98. Desbiens, Marguerite Champagne and Mara Simon. 2003. D´eterminants et locutions verbales. Manuscript. Available at www.er.uqam.ca/nobel/scilang/cesla02/ mara margue.pdf. Evert, Stefan, Ulrich Heid, and Kristina Spranger. 2004. Identifying morphosyntactic preferences in collocations. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04), pages 907–910, Lisbon. Evert, Stefan and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL’01), pages 188–195, Toulouse. Fazly, Afsaneh and Suzanne Stevenson. 2006. Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06), pages 337–344, Trento. Fazly, Afsaneh and Suzanne Stevenson. 2007. Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the ACL’07 Workshop on a Broader Perspective on Multiword Expressions, pages 9–16, Prague. Fazly, Afsaneh and Suzanne Stevenson. A distributional account of the semantics of multiword expressions. To appear in the Italian Journal of Linguistics. Fellbaum, Christiane. 1993. The determiner in English idioms. In C. Cacciari and P. Tabossi, Idioms: Processing, Structure, and Interpretation. LEA, Mahwah, NJ, pages 271–295. Fellbaum, Christiane, editor. 1998. WordNet, An Electronic Lexical Database. MIT Press, Cambridge, MA. Fellbaum, Christiane. 2002. VP idioms in the lexicon: Topics for research using a very large corpus. In Proceedings of the KONVENS 2002 Conference, pages 7–11, Saarbruecken, Germany. Fellbaum, Christiane. 2007. The ontological loneliness of idioms. In Andrea Schalley and Dietmar Zaefferer, editors, Ontolinguistics. Mouton de Gruyter, Berlin, pages 419–434. Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis (special volume of the Philological Society). The Philological Society, Oxford, pages 1–32. Fraser, Bruce. 1970. Idioms within a transformational grammar. Foundations of Language, 6:22–42. Gentner, Dedre and Ilene M. France. 2004. The verb mutability effect: Studies of the combinatorial semantics of nouns and verbs. In Steven L. Small, Garrison W. Cottrell, and Michael K. Tanenhaus, editors, Lexical Ambiguity Resolution: Perspectives from Psycholinguistics, Neuropsychology, and Artiﬁcial Intelligence. Kaufmann, San Mateo, CA, pages 343–382. Gibbs, Raymond W. Jr. 1993. Why idioms are not dead metaphors. In C. Cacciari and P. Tabossi, Idioms: Processing, Structure, and Interpretation. LEA, Mahwah, NJ, pages 57–77. Gibbs, Raymond W. Jr. 1995. Idiomaticity and human cognition. In Everaert et al., editors, Idioms: Structural and Psychological Perspectives. LEA, Mahwah, NJ, pages 97–116. Gibbs, Raymond W. Jr. and Nandini P. Nayak. 1989. Psychololinguistic studies on the syntactic behavior of idioms. Cognitive Psychology, 21:100–138. Gibbs, Raymond W. Jr., Nandini P. Nayak, J. Bolton, and M. Keppel. 1989. Speaker’s assumptions about the lexical ﬂexibility of idioms. Memory and Cognition, 17:58–68. Glucksberg, Sam. 1993. Idiom meanings and allusional content. In C. Cacciari and P. Tabossi, Idioms: Processing, Structure, and Interpretation. LEA, Mahwah, NJ, pages 3–26. Goldberg, Adele E. 1995. Constructions: A Construction Grammar Approach to Argument Structure. The University of Chicago Press. Grant, Lynn E. 2005. Frequency of ‘core idioms’ in the British National Corpus (BNC). International Journal of Corpus Linguistics, 10(4):429–451. Hashimoto, Chikara, Satoshi Sato, and Takehito Utsuro. 2006. Japanese idiom recognition: Drawing a line between literal and idiomatic meanings. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the 101 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 1 Association for Computational Linguistics (COLING-ACL’06), pages 353–360, Sydney. Inkpen, Diana. 2003. Building a Lexical Knowledge-Base of Near-Synonym Differences. Ph.D. thesis, University of Toronto. Jackendoff, Ray. 1997. The Architecture of the Language Faculty. MIT Press, Cambridge, MA. Katz, Graham and Eugenie Giesbrecht. 2006. Automatic identiﬁcation of non-compositional multi-word expressions using Latent Semantic Analysis. In Proceedings of the ACL’06 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pages 12–19, Sydney. Katz, Jerrold J. 1973. Compositionality, idiomaticity, and lexical substitution. In S. Anderson and P. Kiparsky, editors, A Festschrift for Morris Halle. Holt, Rinehart and Winston, New York, pages 357–376. Kearns, Kate. 2002. Light verbs in English. Manuscript. Available at www.ling. canterbury.ac.nz/people/kearns.html. Kirkpatrick, E. M. and C. M. Schwarz, editors. 1982. Chambers Idioms. W & R Chambers Ltd, Edinburgh. Krenn, Brigitte and Stefan Evert. 2001. Can we do better than frequency? A case study on extracting PP-verb collocations. In Proceedings of the ACL’01 Workshop on Collocations, pages 39–46, Toulouse. Kyt ¨o, Merja. 1999. Collocational and idiomatic aspects of verbs in Early Modern English. In L. J. Brinton and M. Akimoto. Collocational and Idiomatic Aspects of Composite Predicates in the History of English. John Benjamins Publishing Company, Amsterdam, pages 167–206. Lapata, Mirella and Alex Lascarides. 2003. Detecting novel compounds: The role of distributional evidence. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), pages 235–242, Budapest. Lin, Dekang. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL’98), pages 768–774, Montreal. Lin, Dekang. 1999. Automatic identiﬁcation of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), pages 317–324, College Park, Maryland. 102 Manning, Christopher D. and Hinrich Sch ¨utze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA. McCarthy, Diana, Bill Keller, and John Carroll. 2003. Detecting a continuum of compositionality in phrasal verbs. In Proceedings of the ACL-SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 73–80, Sapporo. Melamed, I. Dan. 1997a. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP’97), pages 97–108, Providence, RI. Melamed, I. Dan. 1997b. Measuring semantic entropy. In Proceedings of the ACL-SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What and How, pages 41–46, Washington, DC. Mohammad, Saif and Graeme Hirst. Distributional measures as proxies for semantic relatedness. Submitted. Moon, Rosamund. 1998. Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford University Press. Newman, John and Sally Rice. 2004. Patterns of usage for English SIT, STAND, and LIE: A cognitively inspired exploration in corpus linguistics. Cognitive Linguistics, 15(3):351–396. Nicolas, Tim. 1995. Semantics of idiom modiﬁcation. In Everaert et al., editors, Idioms: Structural and Psychological Perspectives. LEA, Mahwah, NJ, pages 233–252. Nunberg, Geoffrey, Ivan A. Sag, and Thomas Wasow. 1994. Idioms. Language, 70(3):491–538. Odijk, Jan. 2004. A proposed standard for the lexical representations of idioms. In Proceedings of Euralex’04, pages 153–164, Lorient. Ogden, Charles Kay. 1968. Basic English, International Second Language. Harcourt, Brace, and World, New York. Patrick, Jon and Jeremy Fletcher. 2005. Classifying verb-particle constructions by verb arguments. In Proceedings of the Second ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications, pages 200–209, Colcheter. Pauwels, Paul. 2000. Put, Set, Lay and Place: A Cognitive Linguistic Approach to Verbal Meaning. LINCOM EUROPA, Munich. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Fazly, Cook, and Stevenson Unsupervised Idiom Identiﬁcation R 2004. Notes on R: A Programming Environment for Data Analysis and Graphics. Available at www.r-project.org. Resnik, Philip. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artiﬁcial Intelligence Research (JAIR), (11):95–130. Riehemann, Susanne. 2001. A Constructional Approach to Idioms and Word Formation. Ph.D. thesis, Stanford University. Ritz, Julia and Ulrich Heid. 2006. Extraction tools for collocations and their morphosyntactic speciﬁcities. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), pages 1925–30, Genoa. Rohde, Douglas L. T. 2004. TGrep2 User Manual. Available at http://tedlab.mit. edu/∼dr/Tgrep2. Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’02), pages 1–15, Mexico City. Schenk, Andr´e. 1995. The syntactic behavior of idioms. In Everaert et al., editors, Idioms: Structural and Psychological Perspectives. LEA, Mahwah, NJ, chapter 10, pages 253–271. Seaton, Maggie and Alison Macaulay, editors. 2002. Collins COBUILD Idioms Dictionary. HarperCollins Publishers, second edition, New York. Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177. Tanabe, Harumi. 1999. Composite predicates and phrasal verbs in The Paston Letters. In L. J. Brinton and M. Akimoto. Collocational and Idiomatic Aspects of Composite Predicates in the History of English. John Benjamins Publishing Company, Amsterdam, pages 97–132. Uchiyama, Kiyoko, Timothy Baldwin, and Shun Ishizaki. 2005. Disambiguating Japanese compound verbs. Computer Speech and Language, 19:497–512. Van de Cruys, Tim and Bego ˜na Villada Moir ´on. 2007. Semantics-based multiword expression extraction. In Proceedings of the ACL’07 Workshop on a Broader Perspective on Multiword Expressions, pages 25–32, Prague. Venkatapathy, Sriram and Aravid Joshi. 2005. Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In Proceedings of Joint Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP’05), pages 899–906, Vancouver. Villada Moir ´on, Bego ˜na and J ¨org Tiedemann. 2006. Identifying idiomatic expressions using automatic word-alignment. In Proceedings of the EACL’06 Workshop on Multiword Expressions in a Multilingual Context, pages 33–40, Trento. Villavicencio, Aline, Ann Copestake, Benjamin Waldron, and Fabre Lambeau. 2004. Lexical encoding of multiword expressions. In Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing, pages 80–87, Barcelona. Wermter, Joachim and Udo Hahn. 2005. Paradigmatic modiﬁability statistics for the extraction of complex multi-word terms. In Proceedings of Joint Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP’05), pages 843–850, Vancouver. Widdows, Dominic and Beate Dorow. 2005. Automatic extraction of idioms using graph analysis and asymmetric lexicosyntactic patterns. In Proceedings of ACL’05 Workshop on Deep Lexical Acquisition, pages 48–56, Ann Arbor, MI. Wilcoxon, Frank. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83. 103 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 1 6 1 1 7 9 8 5 6 0 / c o l i . 0 8 - 0 1 0 - r 1 - 0 7 - 0 4 8 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3
Descargar PDF