Word Sense Clustering and Clusterability - Ricerca sull'intelligenza artificiale specializzata al MIT

Word Sense Clustering and Clusterability

Diana McCarthy∗
University of Cambridge

Marianna Apidianaki∗∗
LIMSI, CNRS, Universit´e Paris-Saclay

Katrin Erk†
University of Texas at Austin

Word sense disambiguation and the related ﬁeld of automated word sense induction tradi-
tionally assume that the occurrences of a lemma can be partitioned into senses. But this seems
to be a much easier task for some lemmas than others. Our work builds on recent work that
proposes describing word meaning in a graded fashion rather than through a strict partition into
sensi; in this article we argue that not all lemmas may need the more complex graded analysis,
depending on their partitionability. Although there is plenty of evidence from previous studies
and from the linguistics literature that there is a spectrum of partitionability of word meanings,
this is the ﬁrst attempt to measure the phenomenon and to couple the machine learning literature
on clusterability with word usage data used in computational linguistics.

We propose to operationalize partitionability as clusterability, a measure of how easy the
occurrences of a lemma are to cluster. We test two ways of measuring clusterability: (1) existing
measures from the machine learning literature that aim to measure the goodness of optimal
k-means clusterings, E (2) the idea that if a lemma is more clusterable, two clusterings
based on two different “views” of the same data points will be more congruent. The two views
that we use are two different sets of manually constructed lexical substitutes for the target
lemma, on the one hand monolingual paraphrases, and on the other hand translations. We apply
automatic clustering to the manual annotations. We use manual annotations because we want
the representations of the instances that we cluster to be as informative and “clean” as possible.
We show that when we control for polysemy, our measures of clusterability tend to correlate
with partitionability, in particular some of the type-(1) clusterability measures, and that these
measures outperform a baseline that relies on the amount of overlap in a soft clustering.

∗ Department of Theoretical and Applied Linguistics, University of Cambridge, UK.

E-mail: diana@dianamccarthy.co.uk.

∗∗ LIMSI, CNRS, Universit´e Paris-Saclay, France. E-mail: marianna.apidianaki@limsi.fr.

† Department of Linguistics, University of Texas at Austin, USA. E-mail: katrin.erk@mail.utexas.edu.

Invio ricevuto: 13 Giugno 2014; revised version received: 3 agosto 2015; accepted for publication:
25 Gennaio 2016.

doi:10.1162/COLI a 00247

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

1. introduzione

In computational linguistics, the ﬁeld of word sense disambiguation (WSD)—where
a computer selects the appropriate sense from an inventory for a word in a given
context—has received considerable attention.1 Initially, most work focused on manually
constructed inventories such as WordNet (Fellbaum 1998) but there has subsequently
been a great deal of work on the related ﬁeld of word sense induction (WSI) (Pedersen
2006; Manandhar et al. 2010; Jurgens and Klapaftis 2013) prior to disambiguation. Questo
article concerns the phenomenon of word meaning and current practice in the ﬁelds of
WSD and WSI.

Computational approaches to determining word meaning in context have tradi-
tionally relied on a ﬁxed sense inventory produced by humans or by a WSI system that
groups token instances into hard clusters. Either sense inventory can then be applied
to tag sentences on the premise that there will be one best-ﬁtting sense for each token
instance. Tuttavia, word meanings do not always take the form of discrete senses but
vary on a continuum between clear-cut ambiguity and vagueness (Tuggy 1993). For
esempio, the noun crane is a clear-cut case of ambiguity between lifting device and bird,
whereas the exact meaning of the noun thing can only be retrieved via the context of use
rather than via a representation in the mental lexicon of speakers. Cases of polysemy
such as the verb paint, which can mean painting a picture, decorating a room, or painting
a mural on a house, lie somewhere between these two poles. Tuggy highlights the
fact that boundaries between these different categories are blurred. Although speciﬁc
context clearly plays a role (Copestake and Briscoe 1995; Passonneau et al. 2010) some
lemmas are inherently much harder to partition than others (Kilgarriff 1998; Cruse
2000). There are recent attempts to address some of these issues by using alternative
characterizations of word meaning that do not involve creating a partition of usages
into senses (McCarthy and Navigli 2009; Erk, McCarthy, and Gaylord 2013), e da
asking WSI systems to produce soft or graded clusterings (Jurgens and Klapaftis 2013)
where tokens can belong to a mixture of the clusters. Tuttavia, these approaches do not
overtly consider the location of a lemma on the continuum, but doing so should help in
determining an appropriate representation. Whereas the broad senses of the noun crane
could easily be represented by a hard clustering, this would not make any sense for the
noun thing; meanwhile, the verb paint might beneﬁt from a more graded representation.
In questo articolo, we propose the notion of partitionability of a lemma, questo è, the ease
with which usages can be grouped into senses. We exploit data from annotation studies
to explore the partitionability of different lemmas and see where on the ambiguity–
vagueness cline a lemma is. This should be useful in helping to determine the appro-
priate computational representation for a word’s meanings—for example, whether a
hard clustering will sufﬁce, whether a soft clustering would be more appropriate, O
whether a clustering representation does not make sense. To our knowledge, there has
been no study on detecting partitionability of word senses.

We operationalize partitionability as clusterability, a measure of how much struc-
ture there is in the data and therefore how easy it is to cluster (Ackerman and Ben-David
2009UN), and test to what extent clusterability can predict partitionability. For deriving a
gold estimate of partitionability, we turn to the Usage Similarity (hereafter Usim) dati
set (Erk, McCarthy, and Gaylord 2009), for which annotators have rated the similarity of

1 See McCarthy (2009) for a high level overview, Navigli (2009) for a detailed summary, and Agirre and

Edmonds (2006) for further background and discussion.

246

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

pairs of instances of a word using a graded scale (an example is given in Section 2.2). Noi
use inter-annotator agreement (IAA) on this data set as an indication of partitionability.
Passonneau et al. (2010) demonstrated that IAA is correlated with sense confusability.
Because this data set consists of similarity judgments on a scale, rather than annotation
with traditional word senses, it gives rise to a second indication of partitionability: Noi
can use the degree to which annotators have used intermediate points on a scale, Quale
indicate that two instances are neither identical in meaning nor completely different,
but somewhat related.

We want to know to what extent measures of clusterability of instances can predict
the partitionability of a lemma. As our focus in this article is to test the predictive
power of clusterability measures in the best possible case, we want the representations
of the instances that we cluster to be as informative and “clean” as possible. For this
reason, we represent instances through manually annotated translations (Mihalcea,
Sinha, and McCarthy 2010) and paraphrases (McCarthy and Navigli 2007). Both
translations (Resnik and Yarowsky 2000; Carpuat and Wu 2007; Apidianaki 2008) E
monolingual paraphrases (Yuret 2007; Biemann and Nygaard 2010; Apidianaki, Verzeni,
and McCarthy 2014) have previously been used as a way of inducing word senses, so
they should be well suited for the task. Since the suggestion by Resnik and Yarowsky
(1997) to limit WSD to senses lexicalized in other languages, numerous works have
exploited translations for semantic analysis. Dyvik (1998) discovers word senses and
their relationships through translations in a parallel corpus and Ide, Erjavec, and Tuﬁs¸
(2002) group the occurrences of words into senses by using translation vectors built
from a multilingual corpus. More recent works focus on discovering the relationships
between the translations and grouping them into clusters either automatically (Bannard
and Callison-Burch 2005; Apidianaki 2009; Bansal, DeNero, and Lin 2012) or manually
(Lefever and Hoste 2010). McCarthy (2011) shows that overlap of translations compared
to overlap of paraphrases on sentence pairs for a given lemma are correlated with inter-
annotator agreement of graded lemma usage similarity judgments (Erk, McCarthy, E
Gaylord 2009) but does not attempt to cluster the translation or paraphrase data or
examine the ﬁndings in terms of clusterability. In this initial study of the clusteribility
phenomenon, we represent instances through translation and paraphrase annotations;
in futuro, we will move to automatically generated instance representations.

There is a small amount of work on clusterability in the area of machine learn-
ing theory (Epter, Krishnamoorthy, and Zaki 1999; Zhang 2001; Ostrovsky et al. 2006;
Ackerman and Ben-David 2009a), and all existing measures are based on k-means
clustering. Two of them (variance ratio and worst pair ratio) test how tight the clusters
are and how far different clusters are from each other (Epter, Krishnamoorthy, and Zaki
1999; Zhang 2001), and one (separability) tests how much the value of the objective
function changes as the number k of clusters changes (Ostrovsky et al. 2006). We test
all three of these intra-clustering (hereafter intra-clust) measures of clusterability. In
aggiunta, we test the intuition that for a well-clusterable lemma, the clusterings based
on two different “views” of the same data points—in our case, a clustering based on
monolingual paraphrases and a clustering based on translations—should be similar.
For this inter-clustering (inter-clust) notion of clusterability, we use a simple graphical
method that does not have the requirement of needing a speciﬁed number of clusters.
We use this same graphical clustering to provide the k for our intra-clust measures
because the existing deﬁnitions of clusterability from machine learning theory need
the number of clusters to be ﬁxed in advance. There are a vast number of clustering
algorithms with which we could experiment. The clustering algorithm itself is not being
evaluated here. Invece, the hypothesis is that if a data set is more clusterable, then it

247

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

should be computationally easier to cluster (Ackerman and Ben-David 2009b) because
the structure in the data is more obvious, so any reasonable algorithm should be able to
partition the data to reﬂect that structure. We contrast the performance of the three intra-
clust measures and the inter-clust measure with a simplistic baseline that relies on the
amount of overlapping items in a soft clustering of the instance data, since such a
baseline would be immediately available if one applied soft clustering to all lemmas.

We show that when controlling for polysemy, our indicators of higher clusterability
tend to correlate with our two gold standard partitionability estimates. In particular,
clusterability tends to correlate positively with higher inter-annotator agreement and
negatively with a greater proportion of mid-range judgments on a graded scale of
instance similarity. Although all our measures show some positive results, it is the intra-
clust measures (particularly two of these) that are most promising.

2. Characterizing Word Meaning

2.1 The Difﬁculty of Characterizing Word Meaning

There has been an enormous amount of work in the ﬁelds of WSD and WSI relying on
a ﬁxed inventory of senses and on the assumption of a single best sense for a given
instance (Per esempio, see the large body of work described in Navigli [2009]) Anche se
doubts have been expressed about this methodology when looking at the linguistic
dati (Kilgarriff 1998; Hanks 2000; Kilgarriff 2006). One major issue arises from the fact
that there is a spectrum of word meaning phenomena (Tuggy 1993) from clear-cut cases
of ambiguity where meanings are distinct and separable, to cases where meanings are
intertwined (highly interrelated) (Cruse 2000; Kilgarriff 1998), to cases of vagueness at
the other extreme where meanings are underspeciﬁed. Per esempio, at the ambiguous
end of the spectrum are words like bank (noun) with the distinct senses of ﬁnancial in-
stitution and side of a river. In such cases, it is relatively straightforward to differentiate
corpus examples and come up with clear deﬁnitions for a dictionary or other lexical
resource.2 These clearly ambiguous words are commonplace in articles promoting WSD
because the ambiguity is evident and the need to resolve it is compelling. On the other
end of the spectrum are cases where meaning is unspeciﬁed (vague); Per esempio, Tuggy
gives the example that aunt can be father’s sister or mother’s sister. There may be no
contextual evidence to determine the intended reading and this does not trouble hearers
and should not trouble computers (the exact meaning can be left unspeciﬁed). Cases of
polysemy are somewhere in between. Examples from Tuggy include the noun set (UN
chess set, a set in tennis, a set of dishes, and a set in logic) and the verb break (a stick, UN
law, a horse, water, ranks, a code, and a record), each having many connections between
the related senses. Although it is assumed in many cases that one meaning has spawned
the other by a metaphorical process (Lakoff 1987)—for example, the mouth of a river
from the mouth of a person—the process is not always transparent and neither is the
point at which the spawned meaning takes an independent existence.

From the linguistics literature, it seems that the boundaries on this continuum
are not clear-cut and tests aimed at distinguishing the different categories are not
deﬁnitive (Cruse 2000). Nel frattempo, in computational linguistics, researchers point to

2 Different etymology can help in determining such homonymous cases where several meanings have
coincidentally ended up having the same word form, but there are many cases where etymologically
related meanings are just as distinct to speakers (Ide and Wilks 2006).

248

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

there being differences in distinguishing meanings with some words being much harder
than others (Landes, Leacock, and Randee 1998), resulting in differences in inter-tagger
agreement (Passonneau et al. 2010, 2012), issues in manually partitioning the semantic
spazio (Chen and Palmer 2009), and difﬁculties in making alignments between lexical
resources (Palmer, Dang, and Rosenzweig 2000; Eom, Dickinson, and Katz 2012). For
esempio, OntoNotes is a project aimed at producing a sense inventory by iteratively
grouping corpus instances into senses and then ensuring that these senses can be
reliably distinguished by annotators to give an impressive 90% inter-annotator agree-
ment (Hovy et al. 2006). Although the process is straightforward in many cases, for
some lemmas this is not possible even after multiple re-partitionings (Chen and Palmer
2009).

Recent work on graded annotations (Erk, McCarthy, and Gaylord 2009, 2013) E
graded word sense induction (Jurgens and Klapaftis 2013) has aimed to allow word
sense annotations where it is assumed that more than one sense can apply and where
the senses do not have to be equally applicable. In the graded annotation study, IL
annotators are assigned various tasks including two independent sense labeling tasks
where they are given corpus instances of a target lemma and sense deﬁnitions (Word-
Net) and are asked to (1) ﬁnd the most appropriate sense for the context and (2) assign
a score out of 5 as to the applicability of every sense for that lemma. In graded word
sense induction (Jurgens and Klapaftis 2013), computer systems and annotators pre-
paring the gold standard have to assign tokens in context to clusters (WordNet senses)
but each token is assigned to as many senses as deemed appropriate and with a graded
level of applicability on a Likert scale (1–5). This scenario allows for overlapping sense
assignments and sense clusters, which is a more natural ﬁt for lemmas with related
sensi, but inter-annotator agreement is highly variable depending on the lemma,
varying between 0.903 E 0.0 on Krippendorff’s α (Krippendorff 1980). This concurs
with the variation seen in other annotation efforts, such as the MASC word sense
corpus (Passonneau et al. 2012). Erk, McCarthy, and Gaylord (2009) demonstrated
that annotators produced more categorical decisions (5 – identical vs. 1 – completely
different) for some words and more mid-range decisions (4 – very similar, 3 – similar,
2 – mostly different) for others. This is not solely due to granularity. In a later article
(Erk, McCarthy, and Gaylord 2013), the authors demonstrated that when coarse-grained
inventories are used, there are some words where, unsurprisingly, usages in the same
coarse senses tend to have higher similarity than those in different coarse senses, Ma
for some lemmas, the reverse happens. Although graded annotations (Erk, McCarthy,
and Gaylord 2009, 2013) and soft clusterings (Jurgens and Klapaftis 2013) allow for
representing subtler relationships between senses, not all words necessitate such a
complicated framework. This article is aimed at ﬁnding metrics that can measure how
difﬁcult a word’s meanings are to partition.

2.2 Alternative Word Meaning Characterizations

Several groups have proposed alternative characterizations of word meaning that do
not rely on a partition of instances into senses. We use three of these approaches in
the current article: two to provide instance annotations that we use as the basis for
clustering and one to provide a gold standard indication of partitionability. Crucially,
these three data sets are all produced by adding annotations to samples taken from
the same set of sentences used for the English lexical substitution task (McCarthy and

249

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

Tavolo 1
Sentences for post.n from LEXSUB.

701

702

703

704

705

706

707

708

709

LEXSUB sentence

Tuttavia, both posts include a one-year hand over period and consequently the elections need to
be held one year in advance of the end of their terms.

Application Details CLOSING DATE : FRIDAY 2 settembre 2005 (Applications must be
post marked on or before this day – no late applications can be considered.)

So I put fence posts all the way around the clearing.

And I put a second rail around the posts.

Received 2 the other day from the AURA Mansﬁeld to Buller ultra in the post at no charge.

26/8/2004 Base Jumping Goodness Filed in : Sport by Reevo | Link to this post | Comments
( 0 ) There’s nothing quite like spending ten minutes watching base jumpers doing their thing
all around the world, check them out, you won’t regret it.

PRO Centre Manager The Board for this post had taken place and the successful applicant would
be in post in November.

It’s becoming really frustrating and they keep on moving the goal post with regard to what they
require as security.

A consultants post, with a special interest in Otology at St Georges Hospital was advertised in
Febbraio.

710

The next morning we arrived at the border post at 7:30.

Navigli 2007), hereafter LEXSUB. Ten sentences for the target lemma post.n3 are shown
in Table 1, with the corresponding sentence ids (s#) in the LEXSUB data set and the target
token underlined.

In LEXSUB, human annotators saw a target lemma in a given sentence context and
were asked to provide one or more substitutes for the lemma in that context. There
were 10 instances for each lemma, and the lemmas were manually selected by the task
organizers. The cross-lingual lexical substitution task (Mihalcea, Sinha, and McCarthy
2010) (CLLS) is similar, except that whereas in LEXSUB both the original sentence and
the substitutes were in English, CLLS used Spanish substitutes. For both tasks, multiple
annotators provided substitutes for each target instance. Tavolo 2 shows the English
substitutes from LEXSUB alongside the Spanish substitutes from CLLS for the sentences
for post.n displayed in Table 1.

In the Usim annotation (Erk, McCarthy, and Gaylord 2009, 2013), annotators saw
a pair of sentences at a time that both contained an instance of the same target word.
Annotators then provided a graded judgment on a scale of 1–5 of how similar the usage
of the target lemma was in the two sentences. Multiple annotators rated each sentence
pair. Tavolo 3 shows the average judgments for the post.n example between each pair of
sentence ids in Table 1.4

3 We use n, v, UN, r sufﬁxes to denote nouns, verbs, adjectives, and adverbs, rispettivamente.
4 There are no judgments for a sentence paired with itself and we do not repeat values where a judgment
has appeared already in the table (Per esempio, 702-701, given that we already have 701-702 displayed).

250

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

Tavolo 2
Paraphrases and translations for sentences with the lemma post.n from the LEXSUB and CLLS
dati. The same sentences were used to elicit both substitute sets.

701

702

703

704

705

706

707

708

709

710

LEXSUB

CLLS

position 3; job 2; role 1;

puesto 2; cargo 1; posicion 1; anuncio 1;

mail 2; postal service 1; date 1; post ofﬁce 1;

enviado 1; mostrando 1; publicado 1; saliendo
1; anuncio 1; correo 1; marcado por correo 1;

pole 3; support 2; stake 1; upright 1;

poste 3; cerco 1; colocando 1; desplegando 1;

support 2; pole 2; stake 2; upright 1;

poste 3; cerco 1; tabla 1;

mail 4; mail carrier 1;

correo 2; posicion 1; puesto 1; pubblicazione 1;
anuncio 1;

message 2; electronic mail 1; mail 1;
announcement 1; electronic bulletin 1;

entrada 4;

position 3; the job 2; employment 1; job 1;

puesto3; cargo 2; posicion 1; publicacion 1;
oﬁcina 1;

support 2; marker 1; target 1; pole 1;
boundary 1; upright 1;

poste 3;

position 3; job 2; appointment 1; situation 1;
role 1;

puesto 3; posicion 2; cargo 1; anuncio 1;

crossing 3; station 2; lookout 1; fence 1;

caseta 2; puesto fronterizo 1; poste 1; correo 1;
frontera 1; cerco 1; puesto 1; caseta fronteriza 1;

The three data sets overlap in the sentences that they cover: Both Usim and CLLS are
drawn from a subset of the data from LEXSUB.5 The overlap between all three data sets
È 45 lemmas each in the context of ten sentences.6 In this article we only use data from
this common subset as it provides us with a gold-standard (Usim) and two different
representations of the instances (LEXSUB and CLLS substitutes). IL 45 lemmas in this
subset include 14 nouns, 14 adjectives, 15 verbs, E 2 adverbs.7

In our experiments herein, we use the Usim data as a gold-standard of how difﬁcult
to partition usages of a lemma is. We use both LEXSUB and CLLS independently as
the basis for intra-clust clusterability experiments. We compare clusterings based on
LEXSUB and CLLS for the inter-clust clusterability experiments.

3. Measuring Clusterability

We present two main approaches to estimating clusterability of word usages using the
translation and paraphrase data from CLLS and LEXSUB. Firstly, we estimate cluster-
ability using intra-clust measures from machine learning. Secondly, our inter-clust

5 Some sentences in LEXSUB did not have two or more responses and for that reason were omitted from the

dati.

6 Usim data were collected in two rounds. For the four lemmas where there is both round 1 and round 2
Usim and CLLS data, we use round 2 data only because there are more annotators (8 for round 2 In
contrast to 3 for round 1) (Erk, McCarthy, and Gaylord 2013).

7 These are the lemmas used in our experiments: account.n, call.v, charge.v, check.v, clear.v, coach.n,

dismiss.v, draw.v, dry.a, execution.n, ﬁeld.n, ﬁgure.n, ﬁre.v, ﬂat.a, fresh.a, function.n, hard.r, heavy.a,
hold.v, investigator.n, lead.n, light.a, match.n, new.a, order.v, paper.n, poor.a, post.n, put.v, range.n, raw.a,
right.r, ring.n, rude.a, shade.n, shed.v, skip.v, soft.a, solid.a, special.a, stiff.a, strong.a, tap.v, throw.v,
work.v.

251

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

Tavolo 3
Average Usim judgments for post.n.

701

702

703

704

705

706

707

708

709

710

701
702
703
704
705

–
"
"
"
"

1.0
–
"
"
"

1.0
1.0
–
"
"

1.0
1.0
5.0
–
"

1.3
3.0
1.0
1.0
–

1.3
1.7
1.0
1.0
2.0

4.7
1.0
1.0
1.0
1.0

1.3
1.0
3.0
3.0
1.0

4.0
1.0
1.0
1.0
1.0

1.0
1.0
2.3
3.3
1.0

method uses clustering evaluation metrics to compare agreement between two cluster-
ings obtained from CLLS and LEXSUB based on the intuition that less clusterable lemmas
will have lower congruence between solutions from the two data sets (che forniscono
different views of the same underlying data).

3.1 Intra-Clustering Clusterability Measures

The notion of the general clusterability of a data set (as opposed to the goodness of any
particular clustering) is explored within the ﬁeld of machine learning by Ackerman and
Ben-David (2009UN). Consider for example the plots in Figure 1, where the data points
on the left should be more clusterable than those on the right because the partitions are
easier to make. All the notions of clusterability that Ackerman and Ben-David consider
are based on k-means and involve optimum clusterings for a ﬁxed k.

We consider three measures of clusterability that all assume a k-means clustering.
Let X be a set of data points, then a k-means k-clustering of X is a partitioning of X into
k sets. We write C = {X1, . . . , Xk} for a k-clustering of X, con (cid:83)k
i=1 Xi = X. The k-means
loss function for a k-clustering C is the sum of squared distances of all data points from
the centroid of their cluster,

l (C) =

k
(cid:88)

(cid:88)

i=1

x∈Xi

||x − centroid(Xi)||2

(1)

(UN) More clusterable data

(B) Less clusterable data

Figura 1
A more clusterable data set compared with a less clusterable one.

252

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

where the centroid or center mass of a set Y of points is

centroid(Y) = 1
|Y|

(cid:88)

sì

y∈Y

(2)

A “k-means optimal k-clustering” of the set X is a k-clustering of X that has the minimal
k-means loss of all k-clusterings of X. There may be multiple such clusterings.

The ﬁrst measure of clusterability that we consider is variance ratio (VR), introduced
by Zhang (2001). Its underlying intuition is that in a good clustering, points should be
close to the centroid of their cluster, and clusters should be far apart. For a set Y of
points,

σ2(Y) = 1
|Y|

(cid:88)

y∈Y

||y − centroid(Y)||2

(3)

is the variance of Y. For a k-clustering C of X, we write pi = |Xi|
cluster variance W(C) and between-cluster variance B(C) of C as follows:

|X| , and deﬁne within-

W(C) = (cid:80)k
B(C) = (cid:80)k

i=1 piσ2(Xi)
i=1 pi||centroid(Xi) − centroid(X)||2

Then the variance ratio of the data set X for the number k of clusters is

VR(X, k) = max
C∈C
k

B(C)
W(C)

(4)

(5)

where C
k is the set of k-means optimal k-clusterings of X. A higher variance ratio indi-
cates better clusterability because variance ratio rises as the distance between clusters
increases (B(C)) and the distance within clusters decreases (W(C)).

Worst pair ratio (WPR) uses a similar intuition as variance ratio, in that it, pure,
considers a ratio of a within-cluster measure and a between-cluster measure. But it
focuses on “worst pairs” (Epter, Krishnamoorthy, and Zaki 1999), the closest pair of
points that are in different clusters, and the most distant points that are in the same
cluster. For two data points x, y ∈ X and a k-clustering C of X, we write x ∼C y if x and
y are in the same cluster of C, and x (cid:54)∼C y otherwise. Then the split of C is the minimum
distance of two data points in different clusters, and the width of C is the maximum
distance of two data points in the same cluster:

split(C) = minx,y∈X,X(cid:54)∼Cy ||x − y||
width(C) = maxx,y∈X,x∼Cy ||x − y||

(6)

We use the variant of worst pair ratio given by Ackerman and Ben-David (2009B), COME
their deﬁnition is analogous to variance ratio:

WPR(X, k) = max
C∈C

split(C)
width(C)

(7)

253

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

where C
k is the set of k-means optimal k-clusterings of X. Worst pair ratio is similar to
variance ratio but can be expected to be more affected by noise in the data, as it only
looks at two pairs of data points while variance ratio averages over all data points.

The third clusterability measure that we use is separability (SEP), due to
Ostrovsky et al. (2006). Its intuition is different from that of variance ratio and worst
pair ratio: It measures the improvement in clustering (in terms of the k-means loss
function) when we move from (k − 1) clusters to k clusters. We write Optk(X) =
l (C) for the k-means loss of a k-means optimal k-clustering of X. Then
minC k-clustering of X
a data set X is (k, ε) separable if Optk(X) ≤ ε Optk−1(X). Separability-based clusterabil-
ity is deﬁned by

SEP(X, k) = the smallest ε such that

X is (k, ε)-separable

(8)

Whereas for variance ratio and worst pair ratio higher values indicate better cluster-
ability, the opposite is true for separability: Lower values of separability signal a larger
drop in k-means loss when moving from (k − 1) to k clusters.8

The clusterability measures that we describe here all rely on k-means optimal
clusterings, as they were all designed to prove properties of clusterings in the area of
clustering theory. To use them to test clusterability of concrete data sets in practice,
we use an external measure to determine k (described in Section 4.3), and we approx-
imate k-means optimality by performing many clusterings of the same data set with
different random starting points, and using the clustering with minimal k-means loss
l .

3.2 Inter-Clustering Clusterability Measures

If the instances of a lemma are highly clusterable, then an instance clustering derived
from monolingual paraphrase substitutes and a second clustering of the same instances
derived from translation substitutes should be relatively similar. We compare two clus-
tering solutions using the SemEval 2010 WSI task (Manandhar et al. 2010) measures:
V-measure (V) (Rosenberg and Hirschberg 2007) and paired F score (pF) (Artiles,
Amig ´o, and Gonzalo 2009).

V is the harmonic mean of homogeneity and completeness. Homogeneity refers
to the degree that each cluster consists of data points primarily belonging to a single
gold-standard class, and completeness refers to the degree that each gold-standard class
consists of data points primarily assigned to a single cluster. The V measure is noted to
depend on both entropy and number of clusters: Systems that provide more clusters do
better. For this reason, Manandhar et al. (2010) also used the paired F score (pF), che è
the harmonic mean of precision and recall. Precision is the number of common instance
pairs between clustering solution and gold-standard classes divided by the number of
pairs in the clustering solution, and recall is the same numerator but divided by the total

8 Ackerman and Ben-David (2009B) proposed an additional clusterability measure, center perturbation.
Tuttavia, this measure is not scale invariant, in that its clusterability scores depend on the overall
distance between data points in X. As we found this dependency to be very strong, we are not using
center perturbation in our experiments in this article.

254

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

number of pairs in the gold-standard. pF penalizes a difference in number of clusters to
the gold-standard in either direction.9

4. Experimental Design

In our experiments reported here, we test both intra-clust and inter-clust clusterability
measures. All clusterability results are computed on the basis of LEXSUB and CLLS
dati. The clusterings that we use for the intra-clust measures are k-means clusterings.
We use k-means because this is how these measures have been deﬁned in the machine
learning literature; as k-means is a widely used clustering, this is not an onerous restric-
zione. The similarity between sentences used by k-means is deﬁned in Section 4.2. IL
k-means method needs the number k of clusters as input; we determine this number
for each lemma by a simple graph-partitioning method that groups all instances that
have a minimum number of substitutes in common (Sezione 4.3). The graph-partitioning
method is also used for the inter-clust approach, since it provides the simplest partition-
ing of the data and determines the number of partitions (clusters) automatically.

In addition to the intra-clust and inter-clust clusterability measures, we test a base-

line measure based on degree of overlap in an overlapping clustering (Sezione 4.4).

We compare the clusterability ratings to two gold standard partitionability esti-
mates, both of which are derived from Usim (Sezione 4.1). We perform two experiments
to measure how well clusterability tracks partitionability (Sezione 4.6).

4.1 The Gold Standard: Estimating Partitionability from Usim

We turn Usim data into partitionability information in two ways. Primo, we model
partitionability as inter-tagger agreement on Usim (Uiaa): Uiaa is the inter-tagger
agreement for a given lemma taken as the average pairwise Spearman’s correlation
between the ranked judgments of the annotators. Secondo, we model partitionability
through the proportion of mid-range judgments over all instances for a lemma and all
annotators (Umid). We follow McCarthy (2011) in calculating Umid as follows. Mid-
range judgments are between 2 E 4, that is not 1 (completely different usages) E
non 5 (the same usage). Let a ∈ A be an annotator from the set A of all annotators, E
ja ∈ Pl be the judgment of annotator a for a sentence pair for a lemma from all possible
such pairings for that lemma (Pl). Then the Umid score for that lemma is calculated as

Umid =

(cid:80)

a∈A

(cid:80)

ja∈Pl

1 if ja ∈ {2, 3, 4}

|UN| · |Pl|

(9)

Umid is a more direct indication of partitionability than Uiaa in that one might have
high values of inter-tagger agreement where annotators all agree on mid-range scores.
Uiaa is useful as it demonstrates clearly that these measures can indicate “tricky” lem-
mas that might prove problematic for human annotators and computational linguistic
systems.

9 Because both measures (V and pF) use the harmonic mean, it does not matter whether we use CLLS as the
gold standard against LEXSUB or vice versa: The harmonic mean of homogeneity and completeness, O
precision and recall, is the same regardless of which clustering solution is considered as “gold.”

255

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

4.2 Similarity of Sentences Through LEXSUB and CLLS for k-Means Clustering

The LEXSUB data for a sentence, Per esempio, an instance of post.n, is turned into a
vector as follows. Each possible LEXSUB substitute for post.n over all its ten instances
becomes a dimension. For a given sentence, for example sentence 701 in Table 2, IL
value for dimension t is the number of times t was named as a substitute for sentence
701. So the vector for sentence 701 has an entry of 3 in the dimension position, an entry
Di 2 in the dimension job, and a value of 1 in the dimension role, and zero in all other
dimensions, and analogously for the other instances. The CLLS data is turned into one
vector per instance in the same way. This results in vectors of the same dimensionality
for all instances of the same lemma, though the instances of different lemmas can be in
different spaces (which does not matter, as they will never be compared). The distance
(dvec) between two instances s, S(cid:48) of the same lemma (cid:96) is calculated as the Euclidean
distance between their vectors. If there are n substitutes overall for (cid:96) across all its
instances, then the distance of s and s(cid:48) È

dvec(S, S(cid:48)) =

(cid:118)
(cid:117)
(cid:117)
(cid:116)

N
(cid:88)

i=1

(si − s(cid:48)

io )2

(10)

4.3 Graphical Partitioning

This subsection describes the method that we use for determining the number of clus-
ters (k) for a given lemma needed by the intra-clust approach described in Section 3.1,
and for providing data partitions for the inter-clust measure of clusterability described
in Section 3.2. We adopt a simple graph-based approach to partitioning word usages
according to their distance, following Di Marco and Navigli (2013). Traditionally, graph-
based WSI algorithms reveal a word’s senses by partitioning a co-occurrence graph
built from its contexts into vertex sets that group semantically related words (V´eronis
2004). In these experiments we build graphs for the LEXSUB and CLLS target lemmas
and partition them based on the distance of the instances, reﬂected in the substitute
annotazioni. Although the graphical approach is straightforward and representative of
the sort of WSI methods used in our ﬁeld, the exact graph partitioning method is not
being evaluated here. Other graph partitioning or clustering algorithms could equally
be used.

For a given lemma l, we build two undirected graphs using the LEXSUB and CLLS
substitutes for l. An instance of l is identiﬁed by a sentence id (s#) and is represented by
a vertex in the graph. Each instance is associated with a set of substitutes (from either
LEXSUB or CLLS) as shown in Table 2 for the noun post. Two vertices are linked by an
edge if their distance is found to be low enough.

The graph partitioning method that we describe here uses a different and sim-
pler estimate of distance than the k-means clustering. The distance of two vertices
is estimated based on the overlap of their substitute sets. As the number of substi-
tutes in each set varies, we use the size of the whole sets along with the size of
the intersection for calculating the distance. Let s be an instance (sentence) from a
insieme di dati (LEXSUB or CLLS) and T be the set of substitute types10 provided for that

10 We have not used the frequency of each substitute, which is the number of annotators that provided it in

LEXSUB or CLLS, though it would be possible to experiment with this in future work.

256

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

Tavolo 4
Hard and overlapping partitions (COMPS and CLIQUES) obtained for post.n from the LEXSUB data.

Partitions

Sentence ids

Elements

COMPS

706, 705, 702

mail carrier, date, post ofﬁce, electronic mail, mail, electronic bulletin,
message, postal service, announcement

CLIQUES

704, 703, 708

support, target, marker, boundary, stake, pole, upright

710

lookout, station, fence, crossing

701, 709, 707

appointment, position, employment, situation, job, the job, role

705, 702

705, 706

704, 703, 708

701, 709, 707

710

mail carrier, date, post ofﬁce, postal service, mail

mail carrier, electronic bulletin, message, electronic mail,
announcement, mail

support, target, marker, boundary, stake, pole, upright

appointment, position, employment, situation, job, the job, role

lookout, station, fence, crossing

instance in LEXSUB or CLLS. The distance (dnode) between two instances (nodes) s and
S(cid:48) with substitute sets T and T(cid:48) corresponds to the number of moves necessary to convert
T into T(cid:48). We use the metric proposed by Goldberg, Hayvanovych, and Magdon-Ismail
(2010), which considers the elements that are shared by, and are unique to, each of the
sets.

dnode(T, T(cid:48)) = |T| + |T(cid:48)| − 2|T ∩ T(cid:48)|

(11)

We consider two instances as similar enough to be linked by an edge if their intersection
is not empty (cioè., they have at least one common substitute) and their distance is below
a threshold. After observation of the distance results for different lemmas, the threshold
was deﬁned to be equal to 7.11 A pair of instances with a distance below the threshold is
linked by an edge in the graph. Per esempio, instances 705 E 706 of post.n are linked
in the graph built from the LEXSUB data (cf. Tavolo 2) because their intersection is not
empty (they share mail) and they have a distance of 5. The graph built for a lemma is
partitioned into connected components (hereafter COMP). As the COMPS do not share
any instances, they correspond to a hard (non-overlapping) clustering solution over the
set of instances. Two instances belong to the same component if there is a path between
their vertices. The top part of Table 4 displays the COMPS obtained for post.n from the
LEXSUB data. IL 10 instances of the lemma in Table 2 are grouped into four COMPS.
Instances 705 E 706 that were linked in the graph are found in the same connected
component. On the contrary, 710 shares no substitutes with any other instance as shown
in Table 2, E, as a consequence, does not satisfy either the intersection or the distance
criterion. Instance 710 is thus isolated as it is linked to no other instances, and forms a
separate component.

11 In future work, we intend to explore ways for deﬁning the distance threshold dynamically, on a per

lemma basis.

257

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

Figura 2
Frequency distribution over number of COMPS: How many lemmas had a given number of
COMPS in the two data sets.

Figura 2 shows the frequency distribution of lemmas over number of COMPS.

4.4 A Baseline Measure Based on Cluster Overlap

Our proposed clusterability measures (both intra- and inter-clust) are applicable to
hard clusterings. WSI in computational linguistics has traditionally focused on a hard
partition of usages into senses but there have been recent attempts to allow for graded
annotazione (Erk, McCarthy, and Gaylord 2009, 2013) and soft clustering (Jurgens and
Klapaftis 2013). We wanted to see how well the extent of overlap between clusters
might be used as a measure of clusterability because this information is present for
any soft clustering. If this simple criterion worked well, it would avoid the need for
an independent measure of clusterability. If the amount of overlap is an indicator
of clusterability then soft clustering can be applied and lemmas with clear-cut sense
distinctions will be identiﬁed as having little or no overlap between clusters, as depicted
in Figure 3.

For this baseline, we measure overlap from a second set of node groupings of
the graphs described in Section 4.3, where an instance can fall into more than one of
the groups. We refer to this soft grouping solution as CLIQUES. A clique consists of a

(UN) More clusterable data

(B) Less clusterable data

Figura 3
A more clusterable data set compared with a less clusterable one, allowing for cluster overlap.

258

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

1234578CLLSLEXSUBnumber of COMPSfreq051015

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

Figura 4
Illustration of the processing pipeline from input data to clusterability estimation.

maximal set of nodes that are pairwise adjacent.12 They are typically ﬁner grained than
the COMPS because there may be vertices in a component that have a path between them
without being adjacent.13

The lower part of Table 4 contains the CLIQUES obtained for post.n in LEXSUB. IL
two solutions, COMPS and CLIQUES, presented for the lemma in this table are very
similar except that there is a further distinction in the CLIQUES as the ﬁrst cluster in
the COMPS is subdivided between two different senses of mail (broadly speaking, IL
physical and electronic senses). Note that these two CLIQUES overlap and share
instance 705.

We wish to see if using the extent of overlap in the CLIQUES reﬂects the partition-
ability numbers derived from the Usim data to the same extent as the clusterability
metrics already presented. If it does, then the overlapping clustering approach itself
could be used to determine how easily the senses partition and clusterability would be
reﬂected by the extent of instance overlap in the clustering solution. Let Cs be the set of
partitions (CLIQUES) to which a sentence s from the sentences for a given lemma (Sl) È
automatically assigned. Then ncs(l) measures the average number of CLIQUES to which
the sentences for a given lemma are assigned.

ncs(l) =

(cid:80)

|Cs|

s∈Sl
|Sl|

(12)

We assume that lemmas that are less easy to partition will have higher values of ncs
compared with lemmas with a similar number of clusters over all sentences but with
lower values of ncs.

4.5 Experimental Design Overview

In Figure 4 we give an overview of the whole processing pipeline, from the input data to
the clusterability estimation. The graphs built for each lemma from the LEXSUB and CLLS

12 Cliques are computed directly from a graph, not from the COMPS.
13 Note that two different COMPS and CLIQUES can share substitutes (translations or paraphrases).

Substitutes serve to determine the distance of the instances. If the distance is high, two instances are not
linked in the graph despite their shared substitutes.

259

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

DataGraph partitioningClusterability metricLexSubCLLScompsParameter deﬁnitionintra-clustering (VR, SEP, WPR)cliquescompscliqueskbaseline (ncs)inter-clustering (pf, V)baseline (ncs)kintra-clustering (VR, SEP, WPR)

Linguistica computazionale

Volume 42, Numero 2

Tavolo 5
Overview of gold partitionability estimates and of clusterability measures to be evaluated.

Gold partitionability estimates

Intra-clust clusterability measures

Inter-clust clusterability measures

Linea di base

Umid: proportion of mid-range (2–4) instance similarity
ratings for a lemma
Uiaa: inter-annotator agreement on the Usim data set
(average pairwise Spearman)
VR, WPR, SEP based on k-means clustering
k estimated as COMPS
clustering computed based on either LEXSUB or CLLS
substitutes
comparing COMPS partitioning of CLLS with COMPS
partitioning of LEXSUB
comparison either through V or pF
average number ncs of CLIQUES clusters, computed either
from LEXSUB or CLLS data

data are partitioned twice creating COMPS and CLIQUES. The COMPS serve to deﬁne
the k per lemma needed by the intra-clust clusterability metrics (VR, SEP, WPR). IL
inter-clust metrics (V and pF) compare the two sets of COMPS created for a lemma from
the LEXSUB and CLLS data. The overlaps present in the CLIQUES are exploited by the
baseline metric (ncs).

4.6 Evaluation

Tavolo 5 provides a summary of the two gold standard partitionability estimates and
the two types of clusterability measures, along with the baseline clusterability measure
that we test. The partitionability estimates and the clusterability measures vary in their
directions: In some cases, high values denote high partitionability; in other cases high
values indicate low partitionability. Because WPR and VR are predicted to have high
values for more clusterable lemmas and SEP has low values, we expect WPR and VR to
positively correlate with Uiaa and negatively with Umid and the direction of correlation
to be reversed for SEP. Our clustering evaluation metrics (V and pF) should provide
correlations with the gold standards in the same direction as WPR and VR since a high
congruence between the two solutions for a lemma from different annotations of the
same sentences should be indicative of higher clusterability and consequently higher
values of Uiaa and lower values of Umid. As regards the baseline approach based on
cluster overlap, because we assume that lemmas that are less easy to partition will have
higher values of ncs, high values of ncs should be positively correlated with Umid and
negatively correlated with Uiaa (like SEP). Tavolo 6 gives an overview of the expected
directions.

We perform two sets of experiments, which differ in the way in which we control for
polysemy. Partitionability estimates as well as clusterability predictions can be expected
to be inﬂuenced by polysemy. Polysemy has an inﬂuence on inter-annotator agreement
in that agreement is lower with higher attested polysemy (Passonneau et al. 2010). IL
number of clusters also inﬂuences all our measures of clusterability. Manandhar et al.
(2010) note that V and pF are inﬂuenced by polysemy. Also, all intra-clust clusterability
measures are inﬂuenced by k. Variance ratio and worst pair ratio both improve mono-
tonically with k because the distance of points from the center mass of their cluster

260

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

Tavolo 6
Directions of partitionability estimates and clusterability measures: (cid:37) means that high values
denote high partitionability, E (cid:38) means that a high value denotes low partitionability.

Gold partitionability estimates Clusterability measures

Umid: (cid:38)
Uiaa: (cid:37)

(cid:37)
VR:
WPR: (cid:37)
SEP: (cid:38)
(cid:37)
V:
(cid:37)
pF:
ncs: (cid:38)

decreases as the number of clusters rises (this affects the within-cluster variance W(C)
and width(C)). Separability is always lowest for k = n (number of data points), E
almost always second-lowest for k = n − 1.

The ﬁrst set of experiments measures correlation using Spearman’s ρ between a
ranking of partitionability estimates and a ranking of clusterability predictions. We do
not perform correlation across all lemmas but control for polysemy by grouping lemmas
into polysemy bands, and performing correlations only on lemmas with a polysemy
within the bounds of the same band. Let k be the number of clusters for lemma l, Quale
is the number of COMPS for all clusterability metrics other than ncs, and the number of
CLIQUES for ncs. For the cluster congruence metrics (V and pF), we take the average
number of clusters for a lemma in both LEXSUB and CLLS.14 Then we deﬁne three
polysemy bands:

(cid:114)

low: 2 ≤ k < 4.3 mid: 4.3 ≤ k < 6.6 high: 6.6 ≤ k < 9 Note that none of the intra-clust clusterability measures are applicable for k = 1, so in cases where the number of COMPS is one, the lemma is excluded from analysis. In these cases the clustering algorithm itself decides that the instances are not easy to partition. The second set of experiments performs linear regression to link partitionability to clusterability, using the degree of polysemy k as an additional independent variable. As we expect polysemy to interfere with all clusterability measures, we are interested not so much in polysemy as a separate variable but in the interaction polysemy × clusterability. This lets us test experimentally whether our prediction that polysemy inﬂuences clusterability is borne out in the data. As the second set of experiments does not break the lemmas into polysemy bands, we have a single, larger set of data points undergoing analysis, which gives us a stronger basis for assessing signiﬁcance. 14 Differences in granularity are quite possibly an indication of non-clusterability, but not necessarily. We have also tried using the difference in the number of clusters between CLLS and LEXSUB as an indicator of clusterability but the proposed measures allow a more complete estimation of disparity and so far seem more reliable. 261 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 2 4 5 1 8 0 7 5 3 2 / c o l i _ a _ 0 0 2 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 42, Number 2 5. Experiments In this section we provide our main results evaluating the various clusterability mea- sures against our gold-standard estimates. Section 5.1 discusses the evaluation via correlation with Spearman’s ρ. In Section 5.2 we present the regression experiments. In Section 5.3 we provide examples and lemma rankings by two of our best performing metrics. 5.1 Correlation of Clusterability Measures Using Spearman’s ρ We calculated Spearman’s correlation coefﬁcient (ρ) for both gold standards (Uiaa and Umid) against all clusterability measures: intra-clust (VR, WPR, and SEP), inter-clust (V and pF), and the baseline ncs. For all these measures except the inter-clust, we calculate ρ using LEXSUB and CLLS separately as our clusterability measure input. The inter-clust measures rely on two views of the data so we use LEXSUB and CLLS together as input. We calculate the correlation for lemmas in the polysemy bands (low, mid, and high, as described above in Section 4.6) subject to the constraint that there are at least ﬁve lemmas within the polysemy range for that band. We provide the details of all trials in Appendix A and report the main ﬁndings here. Table 7 shows the average Spearman’s ρ over all trials for each clusterability mea- sure. Although there are a few non-signiﬁcant results from individual trials that are in the unanticipated direction (as discussed in the following paragraph), all average ρ are in the anticipated direction, speciﬁed in Table 6; SEP and ncs are positively correlated with Umid and negatively with Uiaa whereas for all other measures the direction of correlation is reversed. Some of the metrics show a promising level of correlation but the performance of the metrics varies. The baseline ncs is particularly weak, highlighting that the amount of shared sentences in overlapping clusters is not a strong indication of clusterability. This is important because if this simple baseline had been a good indicator of clusterability, then a sensible approach to the phenomenon of partionability of word meaning would be to simply soft cluster a word’s instances and the extent of overlap would be a direct indication that the meanings are highly intertwined. WPR is also quite Table 7 The macro-averaged correlation of each clusterability metric with the Usim gold-standard rankings Uiaa and Umid: All correlations are in the expected direction. Also, the proportion (prop.) of trials from Tables A.1–A.5 in Appendix A with moderate or stronger correlation in the correct direction with a statistically signiﬁcant result. measure type average ρ measure Umid Uiaa prop. ρ > 0.4* O **
Umid

Uiaa

−0.483

0.365
0.569 −0.390
0.210

−0.322

−0.318
−0.123

0.540
0.493

0.053 −0.164

2/3
2/3
1/3

0/2
0/2

0/6

2/3
1/3
0/3

1/2
0/2

1/6

VR
SEP
WPR

pF
V

ncs

intra-clust

inter-clust

baseline

262

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

weak, which is not unexpected: It only considers the worst pair rather than all data
points, as noted in Section 3.1. Both inter-clust measures (pF and V) have a stronger
correlation with Uiaa than with Umid, whereas for the machine learning measures the
reverse is true and the correlation is stronger for Umid. As mentioned in Section 4.1,
Umid is a more direct gold-standard indicator of partitionability but Uiaa is useful as
a gold standard as it indicates how problematic annotation will be for humans. IL
machine learning metric SEP and our proposal for pF as an indication of clusterability
provide the strongest average correlations, though the results for pF are less consistent
over trials.15

Because we are controlling for polysemy, there is less data (lemmas) for each cor-
relation measurement so many individual trials do not give signiﬁcant results, but all
signiﬁcant correlations are in the anticipated direction. The ﬁnal two columns of Table 7
show the proportion of cases that are signiﬁcant at the 0.05 level or above and have
ρ > 0.416 in the anticipated direction out of all individual trials meeting the constraint
of ﬁve or more lemmas in the respective polysemy band for LEXSUB or CLLS input data.
We are limited by the available gold-standard data and need to control for polysemy.
So there are several results with a promising ρ which, Tuttavia, are not signiﬁcant, come
that they are scored negatively in this more stringent summary. Nevertheless, from this
summary of the results we can see that the machine learning metrics, particularly VR
(which has a higher proportion of successful trials) and SEP (which has the highest
average correlations) are most consistent in indicating partitionability using either gold-
standard estimate (Umid or Uiaa) with VR achieving 66.7% success (2 out of 3 trials
for each gold-standard ranking). WPR is less promising for the reasons stated above.
Although there are some successful trials for the inter-clust approaches, the results are
not consistent and only one trial showed a (highly) signiﬁcant correlation. The baseline
approach which measures cluster overlap has only one signiﬁcant result in all 6 trials,
but more worrisome for this measure is the fact that in 4 out of the 12 trials (2 for each
Umid and Uiaa) the correlation was in the non-anticipated direction. In contrast there
was only one result for WPR (on CLLS) in the non-anticipated direction and one result
for V on the fence (ρ = 0) and all other individual results for the inter and intra-clust
measures were in the anticipated direction.

There were typically more lemmas in the intra-clust trials with LEXSUB compared to
CLLS, as shown in Appendix A due to the fact that many lemmas in CLLS have only one
component (Guarda la figura 2) and are therefore excluded from the intra-clust clusterability
estimation.17

5.2 Linking Partitionability to Clusterability and Polysemy Through Regression

Our ﬁrst round of experiments revealed some clear differences between approaches and
implied good performance, particularly for the intra-clust measures VR and SEP. Nel
ﬁrst round of experiments, Tuttavia, we separated lemmas into polysemy bands and
this resulted in the set of lemmas involved in each individual correlation experiment
being somewhat small. This makes it hard to obtain signiﬁcant results. Even for the

15 This can be seen in Table A.3 in Appendix A.
16 This is generally considered the lower bound of moderate correlation for Spearman’s and is the

level of inter-annotator agreement achieved in other semantics tasks (for example see Mitchell and
Lapata [2008]).

17 As noted before, none of the intra-clust measures are applicable for the case of k = 1.

263

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

overall most successful measures, not all trials came out as signiﬁcant. In this second
round of experiments, we therefore change the set-up in a way that allows us to test on
all lemmas in a single experiment, to see which clusterability measures will exhibit an
overall signiﬁcant ability to predict partitionability.

We use linear regression, an analysis closely related to correlation.18 The dependent
variable to be predicted is a partitionability estimate, either Umid or Uiaa. We use two
types of independent variables (predictors). The ﬁrst is the clusterability measure—
here we call this variable clust. The second is the degree of polysemy, which we
call poly. This way we can model an inﬂuence of polysemy on clusterability as an
interaction of variables, and have all lemmas undergo analysis at the same time. Questo
lets us obtain more reliable results: Previously, a non-signiﬁcant result could indicate
either a weak predictor or a data set that was too small after controlling for poly-
semy, but now the data set undergoing analysis is much bigger.19 Furthermore, Questo
experiment demonstrates how clusterability and polysemy can be used together as
predictors.

The variable clust reﬂects the clusterability predictions of each measure. We use
the actual values, not their rank among the clusterability values for all lemmas. Questo
way we can test the ability of our clusterability measures to predict partitionability
for individual lemmas, while the rank is always relative to other lemmas that are
being analyzed at the same time. The values of the variable clust are obviously dif-
ferent for each clusterability measure, but the values of poly also vary across clus-
terability measures: For all intra-clust measures poly is the number of COMPS. For
the inter-clust measures, it is the average number of COMPS between the numbers
computed from LEXSUB and from CLLS. For the ncs baseline it is the number of
CLIQUES. In all cases, poly is the actual number of COMPS or CLIQUES, not the polysemy
band.

We test three different models in our linear regression experiment. The ﬁrst
model has poly as its sole predictor. It tests to what extent partitionability issues
can be explained solely by a larger number of COMPS or CLIQUES. Our hypothesis
is that this simple model will not sufﬁce. The second model has clust as its sole
predictor, ignoring possible inﬂuences from polysemy. The third model uses the in-
teraction poly × clust as a predictor (along with poly and clust as separate vari-
ables). Our hypothesis is that this third model should fare particularly well, given
the inﬂuence of polysemy on clusterability measures that we derived theoretically
above.20

We evaluate the linear regression models in two ways. The ﬁrst is the F test. Given
a model M predicting Y from predictors X1, . . . , Xm as Y = β0 + β1X1 + . . . + βmXm,
it tests the null hypothesis that β0 = β1 = . . . = βm = 0. Questo è, it tests whether M is
statistically indistinguishable from a model with no predictors.21 Second, we use the
Akaike Information Criterion (AIC) to compare models. AIC tests how well a model

18 The regression coefﬁcient is a standardization of Pearson’s r, a correlation coefﬁcient, related via a ratio of

standard deviations.

19 Also, the ﬁrst round of experiments had to drop some lemmas from the analysis when they were in a

polysemy band with too few members; the second round of experiments does not have this
issue.

20 We also tested a model with predictors poly+clust, without interaction. We do not report on results for
this model here as it did not yield any interesting results. It was basically always between clust and
poly × clust.

21 We will say an F test “reached signiﬁcance” to mean that the null hypothesis was rejected for some

modello.

264

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

Tavolo 8
Regression results for the Umid partitionability estimate. Signiﬁcance of F statistic, and AIC
for the following models: polysemy only (poly), clusterability only (clust), and interaction
(poly × clust). Bolded: model that is best by AIC and has signiﬁcant F, separately for each
substitute set. We use * for statistical signiﬁcance with p < 0.05, ** for p < 0.01, and *** for p < 0.001. data cl. measure poly F AIC CLLS CLLS CLLS CLLS LEXSUB LEXSUB LEXSUB LEXSUB both both VR SEP WPR ncs VR SEP WPR ncs pF V - - - - - - - ** - - -24.9 -24.9 -24.9 -30.6 -24.8 -24.8 -24.8 -26.5 -24.8 -24.8 clust F - ** - - * ** *** ** - * AIC -25.7 -34.2 -27.5 -25.8 -31.4 -32.6 -34.0 -27.9 -25.3 -28.0 poly × clust F AIC ** * - - *** *** * * - - -35.1 -30.3 -25.4 -26.8 -34.5 -32.2 -30.2 -28.2 -21.5 -25.7 will likely generalize (rather than overﬁt) by penalizing models with more predictors. AIC uses the log likelihood of the model under the data, corrected for model complexity computed as its number of predictors. Given again a model M predicting Y (in our case, either Umid or Uiaa) from m predictors, the AIC is AIC = −2 log p(Y|M) + 2m The lower the AIC value, the better the generalization of the model. The model preferred by AIC is the one that minimizes the Kullback-Leibler divergence between the model and the data. AIC allows us to compare all models that model the same data, that is, all models predicting Umid can be compared to each other, and likewise all models predicting Uiaa. The number of data points in each model depends on the partitioning (as lemmas with k = 1 cannot enter into intra-clust clusterability analysis), which differs between CLLS and LEXSUB. AIC depends on the sample size (through p(Y|M)), so in order to be able to compare all models that model the same partitionability estimate, we compute AIC only on the subset of lemmas that enters in all analyses.22 In contrast, we compute the F test on all lemmas where the clusterability measure is valid,23 in order to use the largest possible set of lemmas to test the viability of a model.24 Table 8 shows the results for models predicting Umid, and Table 9 shows the results for the prediction of Uiaa. The bolded ﬁgures are the best AIC values for each substitute set (CLLS, LEXSUB, both) where the corresponding F-tests reach signiﬁcance.25 22 This subset comprises 27 lemmas: charge.v, clear.v, draw.v, dry.a, ﬁre.v, ﬂat.a, hard.r, heavy.a, hold.v, lead.n, light.a, match.n, paper.n, post.n, range.n, raw.a, right.r, ring.n, rude.a, shade.n, shed.v, skip.v, soft.a, solid.a, stiff.a, tap.v, throw.v. 23 For the intra-clust measures, this is only lemmas where k > 1.
24 We also computed AIC separately for substitute sets LEXSUB, CLLS, and both (for inter-clust). The relative

ordering of models within each substitute set remained mostly the same.

25 Log-likelihood values can be positive, as they are in our case, leading to negative AIC values. Vedere, for
esempio, http://blog.stata.com/2011/02/16/positive-log-likelihood-values-happen/.

265

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

Tavolo 9
Regression results for the Uiaa partitionability estimate. Signiﬁcance of F statistic, and AIC
for the following models: polysemy only (poly), clusterability only (clust), and interaction
(poly × clust). Bolded: model that is best by AIC and has signiﬁcant F, separately for each
substitute set. We use * for statistical signiﬁcance with p < 0.05 and ** for p < 0.01. data cl. measure poly F AIC clust F AIC poly × clust F AIC CLLS CLLS CLLS CLLS LEXSUB LEXSUB LEXSUB LEXSUB both both VR SEP WPR ncs VR SEP WPR ncs pF V - - - - - - - - - - -20.2 -20.2 -20.2 -20.8 -20.4 -20.4 -20.4 -22.7 -20.0 -20.0 - - - - - ** * ** - - -21.2 -23.0 -24.1 -20.3 -21.7 -27.7 -29.7 -21.4 -22.1 -24.8 - - - - * * - ** - - -20.7 -20.9 -21.8 -19.3 -26.9 -25.4 -27.0 -24.8 -18.8 -21.9 Conﬁrming the results from our ﬁrst round of experiments, we obtain the best results for SEP and VR: The best AIC results in predicting Umid are reached by VR, while SEP shows a particularly reliable performance. In predicting Umid, all SEP models that use clust reach signiﬁcance, and in predicting Uiaa, all SEP models that use clust reach signiﬁcance if they are based on LEXSUB substitutes. WPR reaches the best AIC values on predicting Uiaa, but on the F test, which takes into account more lemmas, its results are less often signiﬁcant. As in the ﬁrst round of experiments, the performance of the two inter-clust mea- sures is not as strong as that of the intra-clust measures. Here the inter-clust measures are in fact often comparable to the ncs baseline. However, as CLLS seems to be harder to use as a basis than LEXSUB (we comment on this subsequently), the inter-clust measures may be hampered by problems with the CLLS data. The baseline ncs measure does not have as dismal a performance here as it did in the ﬁrst round of experiments, but its performance is still worse throughout than that of the intra-clust measures. Interestingly, the poly variable that we use for ncs, which is the absolute number of CLIQUES for a lemma, is informative to some extent for Umid but not for Uiaa, and the clust variable is informative to some extent for Uiaa but not for Umid. The regression experiments overall conﬁrm the inﬂuence of polysemy on the clus- terability measures. Although clusterability as a predictor on its own (the clust models) often reaches signiﬁcance in predicting partitionability, taking polysemy into account (in the poly × clust models) often strengthens the model in predicting Umid and achieves the overall best results (the two bolded models); however for Uiaa the results are more ambivalent, where of the four clusterability measures that produce signiﬁcant models, two improve when the interaction with polysemy is taken into account, and the two others do not. We also note that COMPS alone (the poly variable for the intra-clust models) never manages to predict partitionability in any way, for either Umid or Uiaa. In contrast, the number of CLIQUES (the poly variable of the ncs model) emerges as a predictor of Umid, though not of Uiaa. In comparing Umid versus Uiaa, we see that Umid seems to be generally easier to predict, as it has more models with a signiﬁcant F test. 266 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 2 4 5 1 8 0 7 5 3 2 / c o l i _ a _ 0 0 2 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 McCarthy, Apidianaki, and Erk Word Sense Clustering and Clusterability Comparing the CLLS and LEXSUB substitutions, we see that the use of LEXSUB leads to much better predictions than CLLS. Most strikingly, in predicting Uiaa no model achieves signiﬁcance using CLLS. We have commented on this issue before: The reason for this effect is that many lemmas in CLLS have only one component and are therefore excluded from the intra-clust clusterability estimation. Clusterability in practice. As this round of experiments used the raw clusterability ﬁgures to predict partitionability, rather than their rank, it points the way to using clusterability in practice: Given a lemma, collect instance data (for example paraphrases, translations, or vectors). Estimate the number of clusters, for example using a graphical clustering approach. Then use a clusterability measure (SEP or VR recommended) to determine its degree of clusterability, and use a regression classiﬁer to predict a partitionability estimate. It may help to take the interaction of clust and poly into account. If the estimate is high, then a hard clustering is more likely to be appropriate, and sense tagging for training or testing should not be difﬁcult. Where the estimate is low it is more likely that a more complex graded representation is needed, and in extreme cases clustering should be avoided altogether. Determining where the boundaries are would depend on the purpose of the lexical representation and is not addressed in this article. Our contribution is an approach to determine the relative location of lemmas on a continuum of partitionability. 5.3 Lemma Clusterability Rankings and Some Examples Our clusterability metrics, in particular VR and SEP, are useful for determining the partitionability of lemmas. In this section we show the rankings for these two metrics with our lemmas and provide a couple of more detailed examples with the LEXSUB and CLLS data. In Table 10 we show the lemmas that have k > 1 when partitioned into COMPS using
the LEXSUB substitutes, their respective gold standard Umid and Uiaa values, and the
SEP and VR values calculated for them on the basis of LEXSUB substitutes. The “L by
Uiaa” and “L by Umid” columns display the lemmas reranked according to the two
gold-standard estimates, and the “L by VR” and “L by SEP” columns do likewise for the
VR and SEP clusterability measures. We have reversed the order of the ranking by Umid
and SEP because these measures are high when clusterability is low and vice versa.
Lemmas with high partitionability should therefore be near the bottom of the table in
columns 7–10 and lemmas with low partitionability should be near the top. There are
differences and all rankings are inﬂuenced by polysemy, but we can see from this table
that on the whole the metrics rank lemmas similarly to the gold-standard rankings with
highly clusterable lemmas (such as ﬁre.v) at the bottom of the table and less clusterable
lemmas (such as work.v) nearer the top.

We now take a closer look at two example lemmas, ﬁre.v and solid.a. Tavolo 11
provides the COMPS from both the LEXSUB and the CLLS data. Both lemmas have a
polysemy of 2 according to the COMPS clustering. ﬁre.v is an example of a highly
clusterable lemma whereas solid.a is a less-clusterable lemma. Tavolo 12 shows the values
for the clusterability measures. The intra-clust metrics are calculated for both LEXSUB
and CLLS independently whereas the inter-clust metrics (pF and V) compare the two
independent clustering solutions with each other. ﬁre.v is more clusterable as can be seen
by the clusters over the LEXSUB and CLLS data (Tavolo 11), which denote a clear sense
distinction, and by the Uiaa and Umid from the Usim gold standard. The measures
WPR, VR, V, and pF are all higher for the more clusterable ﬁre.v compared with solid.a,

267

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

Tavolo 10
Ranking of lemmas (l) by the gold-standards, and by VR and SEP for LEXSUB data.

L by Umid

L by Uiaa

L by VR

L by SEP

raw.a

function.n

throw.v

hold.v

0.67 work.v

strong.a

special.a

throw.v

hard.r

solid.a

put.v

ﬁeld.n

work.v

raw.a

strong.a

throw.v

put.v

hard.r

put.v

work.v

soft.a

heavy.a

hold.v

strong.a

ﬂat.a

ﬁeld.n

throw.v

special.a

account.n

execution.n

check.v

hard.r

shed.v

solid.a

skip.v

right.r

ring.n

stiff.a

dismiss.v

match.n

hard.r

function.n

ring.n

put.v

clear.v

match.n

draw.v

lead.n

work.v

raw.a

execution.n

soft.a

paper.n

rude.a

poor.a

tap.v

rude.a

range.n

heavy.a

light.a

function.n

dry.a

dismiss.v

check.v

heavy.a

special.a

function.n

stiff.a

rude.a

draw.v

check.v

stiff.a

shed.v

lead.n

right.r

hold.v

ﬁeld.n

shade.n

poor.a

hold.v

lead.n

solid.a

light.a

ﬁgure.n

draw.v

soft.a

execution.n

dismiss.v

tap.v

clear.v

paper.n

soft.a

ﬂat.a

ﬁgure.n

shed.v

ring.n

heavy.a

match.n

dry.a

rude.a

account.n

paper.n

ﬂat.a

range.n

strong.a

ﬁgure.n

post.n

special.a

call.v

raw.a

clear.v

right.r

call.v

account.n

ﬁeld.n

charge.v

shade.n

post.n

skip.v

tap.v

range.n

dry.a

clear.v

light.a

lead.n

execution.n

ﬁre.v

poor.a

stiff.a

account.n

post.n

check.v

shed.v

solid.a

paper.n

charge.v

tap.v

skip.v

right.r

call.v

ﬁgure.n

shade.n

ﬂat.a

ﬁre.v

charge.v

dismiss.v

draw.v

ﬁre.v

skip.v

dry.a

light.a

range.n

poor.a

0.73 match.n

0.70

0.68

0.66

0.68

0.76

0.69

ring.n

shade.n

charge.v

post.n

call.v

ﬁre.v

SEP

0.60

0.59

0.44

0.12

0.70

0.54

0.58

0.55

0.68

0.58

0.74

0.68

0.54

0.73

0.51

0.72

0.80

0.69

0.58

0.65

0.71

0.63

0.56

0.75

0.71

0.60

0.67

0.48

0.57

0.65

0.74

0.67

0.68

1.26

7.18

0.50

0.42

0.85

0.71

0.81

0.48

0.71

0.36

0.46

3.44

1.33

2.80

1.28

0.64

1.27

1.84

2.18

1.03

1.24

2.35

2.48

2.46

3.19

2.10

4.22

7.26

4.74

3.86

2.60

3.36

4.76

5.15

4.87

3.56

2.76

8.50

lemma

k Umid Uiaa

0.39

0.52

0.61

0.17

0.60

0.62

0.48

0.49

0.38

0.44

0.63

0.70

0.64

0.18

0.46

0.39

0.64

0.48

0.33

0.44

0.34

0.33

0.50

0.45

0.44

0.53

0.22

0.53

0.30

0.24

0.38

0.47

0.34

0.70

0.45

0.49

0.36

0.73

0.53

0.66

0.35

0.52

0.93

0.57

0.34

0.65

0.53

0.70

0.51

0.49

0.32

0.27

0.65

0.78

0.50

0.34

0.47

0.59

0.63

0.43

0.53

0.40

0.70

0.85

0.14

0.69

0.61

0.42

0.68

0.59

0.25

0.74

0.37

0.63

0.47

0.49

0.29

0.31

0.50

account.n

check.v

dismiss.v

ﬁre.v

heavy.a

put.v

right.r

shed.v

skip.v

soft.a

solid.a

throw.v

work.v

call.v

execution.n

ﬁgure.n

hard.r

hold.v

match.n

paper.n

poor.a

ring.n

stiff.a

tap.v

ﬂat.a

function.n

post.n

rude.a

shade.n

charge.v

dry.a

ﬁeld.n

range.n

special.a

clear.v

lead.n

light.a

raw.a

strong.a

draw.v

268

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

Tavolo 11
COMPS obtained from LEXSUB and CLLS for ﬁre.v and solid.a.

LEXSUB

substitutes

ﬁre.v

solid.a

1857, 1852,
1859, 1855,
1851, 1860

1858, 1856,
1853, 1854

1081, 1083,
1087

1090, 1082,
1088, 1085,
1084, 1089,
1086

discharge, shoot at,
launch, shoot

sack, dismiss, lay off

1857, 1852,
1859, 1855,
1851, 1860

1858, 1856,
1853, 1854

solid, sound, set, strong,
ﬁrm, rigid, dry, concrete,
hard

1084

ﬁxed, secure, substantial,
valid, reliable, good,
sturdy, respectable,
convincing, sound,
substantive, dependable,
strong, genuine,
cemented, ﬁrm, accurate,
stable

1090, 1081,
1087, 1086,
1082, 1083,
1085, 1088,
1089

CLLS

substitutes

balear, lanzar, aparecer,
prender fuego, disparar,
golpear, apuntar, detonar,
abrir fuego

correr, dejar ir, delar sin
trabajo, despedir, desemplear,
liquidar, dejar sin trabajo,
echar, dejar sin empleo

estable, solido, integro,
formal, seguro, ﬁrme, real,
consistente, fuerte, fundado

ﬁdedigno, con cuerpo,
conciso, estable, macizo,
solido, con fundamentos,
tempano, fundamentado,
conﬁable, real, seguro, ﬁrme,
consistente, fuerte, estricto,
congelado, en estado solido,
resistente, duro, bien fundado,
fundado

whereas SEP is lower as anticipated. The two lemmas were selected as examples with the
same number of COMPS to allow for a comparison of the values. The overlap measure
ncs is higher for solid.a as anticipated.26

Note that for the highly clusterable lemma ﬁre.v there are no substitutes in common
in the two groupings with either the LEXSUB or CLLS data because there is no substitute
overlap in the sentences, which results in the COMPS and CLIQUES solutions being
equivalent, whereas for solid.a there are several substitutes shared by the groupings for
LEXSUB (per esempio., strong) and CLLS (per esempio., solido).

6. Conclusions and Future Work

In questo articolo, we have introduced the theoretical notion of clusterability from machine
learning discussed by Ackerman and Ben-David (2009UN) and argued that it is relevant to
WSI since lemmas vary as to the degree of partitionability, as highlighted in the linguis-
tics literature (Tuggy 1993) and supported by evidence from annotation studies (Chen
and Palmer 2009; Erk, McCarthy, and Gaylord 2009, 2013). We have demonstrated here
how clustering of translation or paraphrase data can be used with clusterability mea-
sures to estimate how easily a word’s usages can be partitioned into discrete senses. In
addition to the intra-clust measures from the machine learning literature, we have also
operationalized clusterability as consistency in clustering across information sources

26 The CLIQUES clustering gives a different number of clusters to the two lemmas, so these two lemmas
would be in different polysemy bands for the correlation experiments on ncs since we control for
polysemy.

269

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

Tavolo 12
Values of clusterability metrics for the examples ﬁre.v and solid.a.

intra-clust

SEP
VR
WPR

inter-clust

pF
V

COMPS

LEXSUB

CLLS

ﬁre.v

0.122
7.178
1.732

ﬁre.v

1
1

solid.a

0.584
0.713
0.845

ﬁre.v

0.179
4.579
1.795

LEXSUB and CLLS

solid.a

0.685
0.459
0.707

solid.a

0.081
0.590

CLIQUES

baseline

ﬁre.v (2 #cl)

solid.a (4 #cl)

ﬁre.v (2 #cl)

solid.a (7 #cl)

LEXSUB

CLLS

ncs

1.0

1.5

2.1

Gold-Standard from Usim

Gold-Standard

Uiaa
Umid

ﬁre.v

0.930
0.169

solid.a

0.490
0.630

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
2
4
5
1
8
0
7
5
3
2
/
C
o

l
io

_
UN
_
0
0
2
4
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

using clustering solutions from translation and paraphrase data together. We refer to
this second set of measures as inter-clust measures.

We conducted two sets of experiments. In the ﬁrst we controlled for polysemy by
performing correlations between clusterability estimates and our gold standard on our
lemmas in three polysemy bands, which allows us to look at correlation independent
of polysemy. In the second set of experiments we used linear regression on the data
from all lemmas together, which allows us to see how polysemy and clusterability can
work together as predictors. We ﬁnd that the machine learning metrics SEP and VR
produce the most promising results. The inter-clust metrics (V and pF) are interesting in
that they consider the congruence of different views of the same underlying usages,
but although there are some promising results, the measures are not as consistent
and in particular in the second set of experiments do not outperform the baseline.
This may be due to their reliance on CLLS, which generally produces weaker results
compared to LEXSUB. Our baseline, which measures the amount of overlap in overlap-
ping clustering solutions, shows consistently weaker performance than the intra-clust
measures.

A variant of the inter-clust measures we would like to explore is a comparison of
results from different clustering algorithms. Because more clusterable data is computa-
tionally easier to cluster (Ackerman and Ben-David 2009a), we assume more clusterable
data should produce closer results across different algorithms operating on the same
input data. We plan to test this empirically in future.

270

McCarthy, Apidianaki, and Erk

Word Sense Clustering and Clusterability

Clusterability metrics should be useful in planning annotation projects (and esti-
mating their costs) as well as for determining the appropriate lexical representation for
a lemma. A more clusterable lemma is anticipated to be better-suited to the traditional
hard-clustering winner-takes-all WSD methodology compared with a less clusterable
lemma where a more complex soft-clustering approach should be considered and more
time and expertise is anticipated for any annotation and veriﬁcation tasks. For some
compiti, it may be worthwhile to focus disambiguation efforts only on lemmas with a
reasonable level of partitionability.

We believe that notions of clusterability from machine learning are particularly
relevant to WSI and the ﬁeld of word meaning representation in general. These notions
might prove useful in other areas of computational linguistics and lexical semantics in
particular. One such area to explore would be clustering predicate-argument data (Sun
and Korhonen 2009; Schulte im Walde 2006).

All the metrics and gold standards measure clusterability on a continuum. We have
yet to address the issue of where the cut-off points on that continuum for alternate
representations might be. There is also the issue that for a given word, there may be
some meanings which are distinct and others that are intertwined. It may in future be
possible to ﬁnd contiguous regions of the data that are clusterable, even if there are
other regions where the meanings are less distinguishable.

The paraphrase and translation data we have used to examine clusterability metrics
have been produced manually. In future work, the measures could be applied to auto-
matically generated paraphrases and translations or to vector-space or word (or phrase)
embedding representations of the instances. Use of automatically produced data would
allow us to measure clusterability over a larger vocabulary and corpus of instances but
we would need to ﬁnd an appropriate gold standard. One option might be evidence of
inter-tagger agreement from corpus annotation studies (Passonneau et al. 2012) or data
on ease of word sense alignment (Eom, Dickinson, and Katz 2012).

Appendix A: Individual Spearman’s Correlation Trials

Tables A1–A5 provide the details of the individual Spearman’s correlation trials of clus-
terability measures against the gold standards reported in Section 5.1. All correlations in
the anticipated direction are marked in blue, and those in the counter-intuitive direction
are marked in red and noted by opp in the ﬁnal column. In the same column, noi usiamo *
for statistical signiﬁcance with p < 0.05 and ** for p < 0.01. We use only those polysemy bands where there are at least ﬁve lemmas within the polysemy range for that band. The number of lemmas (#) in each band is shown within parentheses. Table A.1 Correlation of the intra-clust k-means metrics on CLLS against the Usim gold-standard rankings Uiaa and Umid. Band (#) Clusterability measure Usim measure ρ sig/opp low (22) low (22) low (22) low (22) low (22) low (22) VR VR SEP SEP WPR WPR Umid Uiaa Umid Uiaa Umid Uiaa –0.4349 0.4539 0.6077 –0.2041 –0.1187 –0.0266 * * ** opp 271 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 2 4 5 1 8 0 7 5 3 2 / c o l i _ a _ 0 0 2 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 42, Number 2 Table A.2 Correlation of the intra-clust k-means metrics on LEXSUB with the Usim gold-standard estimates Uiaa and Umid. Band (#) measure1 measure2 ρ sig/opp low (29) mid (10) low (29) mid (10) low (29) mid (10) low (29) mid (10) low (29) mid (10) low (29) mid (10) VR VR VR VR SEP SEP SEP SEP WPR WPR WPR WPR Umid Umid Uiaa Uiaa Umid Umid Uiaa Uiaa Umid Umid Uiaa Uiaa –0.6058 –0.4073 0.4049 0.2364 0.359 0.7416 –0.3038 –0.6606 –0.4161 –0.4316 0.2739 0.3818 ** * * * * Table A.3 Correlation of the inter-clust metrics on LEXSUB-CLLS with the Usim gold-standards: Uiaa and Umid. Band (#l) measure1 measure2 ρ sig/opp low (29) mid (5) low (29) mid (5) low (29) mid (5) low (29) mid (5) pF pF pF pF V V V V Umid Umid Uiaa Uiaa Umid Umid Uiaa Uiaa –0.1365 –0.5 0.1796 0.9 –0.2456 0 0.3849 0.6 * * Table A.4 Correlation of the baseline ncs operating on CLLS with the Usim gold-standard: Uiaa and Umid. Band (#) measure1 measure2 ρ sig/opp low (14) mid (17) high (9) low (14) mid (17) high (9) ncs ncs ncs ncs ncs ncs Umid Umid Umid Uiaa Uiaa Uiaa 0.4381 0.0308 −0.4622 −0.3455 −0.4948 0.3713 opp * opp 272 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 2 4 5 1 8 0 7 5 3 2 / c o l i _ a _ 0 0 2 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 McCarthy, Apidianaki, and Erk Word Sense Clustering and Clusterability Table A.5 Correlation of the baseline ncs operating on LEXSUB with the Usim gold-standard Uiaa and Umid. Band (#) measure1 measure2 ρ sig/opp low (14) mid (19) high (10) low (14) mid (19) high (10) ncs ncs ncs ncs ncs ncs Umid Umid Umid Uiaa Uiaa Uiaa 0.2668 0.2204 −0.179 −0.3327 −0.2447 0.0617 opp opp Acknowledgments This work was partially supported by Na- tional Science Foundation grant IIS-0845925 to K. E. We thank the anonymous reviewers for many helpful comments and suggestions. References Ackerman, Margareta and Shai Ben-David. 2009a. Clusterability: A theoretical study. Journal of Machine Learning Research - Proceedings Track, 5:1–8. Ackerman, Margareta and Shai Ben-David. 2009b. Clusterability: A theoretical study. In Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1–8, Clearwater Beach, FL. Agirre, Eneko and Philip Edmonds, editors. 2006. Word Sense Disambiguation, Algorithms and Applications. Springer. Apidianaki, Marianna. 2008. Translation-oriented word sense induction based on parallel corpora. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), pages 3269–3275, Marrakech. Apidianaki, Marianna. 2009. Data-driven semantic analysis for multilingual WSD and lexical selection in translation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL’09), pages 77–85, Athens. Apidianaki, Marianna, Emilia Verzeni, and Diana McCarthy. 2014. Semantic clustering of pivot paraphrases. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 4270–4275, Reykjavik. Artiles, Javier, Enrique Amig ´o, and Julio Gonzalo. 2009. The role of named entities in Web People Search. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09), pages 534–542, Singapore. Bannard, Colin and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 597–604, Ann Arbor, MI. Bansal, Mohit, John DeNero, and Dekang Lin. 2012. Unsupervised translation sense clustering. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL: HLT), pages 773–782, Montr´eal. Biemann, Chris and Valerie Nygaard. 2010. Crowdsourcing WordNet. In Proceedings of the 5th Global WordNet Conference, Mumbai. Carpuat, Marine and Dekai Wu. 2007. Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 61–72, Prague. Chen, Jinying and Martha Palmer. 2009. Improving English verb sense disambiguation performance with linguistically motivated features and clear sense distinction boundaries. Journal of Language Resources and Evaluation, Special Issue on SemEval-2007, 43:181–208. Copestake, Ann and Ted Briscoe. 1995. Semi-productive polysemy and sense extension. Journal of Semantics, 12:15–67. Cruse, D. A. 2000. Aspects of the microstructure of word meanings. In Yael Ravin and Claudia Leacock, editors, Polysemy: Theoretical and Computational Approaches. Oxford University Press, pages 30–51. 273 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 2 4 5 1 8 0 7 5 3 2 / c o l i _ a _ 0 0 2 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 42, Number 2 Di Marco, Antonio and Roberto Navigli. 2013. Clustering and diversifying Web search results with graph-based word sense induction. Computational Linguistics, 39(3):709–754. Dyvik, Helge. 1998. Translations as semantic mirrors: From parallel corpus to Wordnet. In Proceedings of the Workshop Multilinguality in the Lexicon II at the 13th Biennial European Conference on Artiﬁcial Intelligence (ECAI’98), pages 24–44, Brighton. Eom, Soojeong, Markus Dickinson, and Graham Katz. 2012. Using semi-experts to derive judgments on word sense alignment: A pilot study. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 605–611, Istanbul. Epter, Scott, Mukkai Krishnamoorthy, and Mohammed Zaki. 1999. Clusterability detection and initial seed selection in large data sets. Technical Report 99-6, Rensselaer Polytechnic Institute, Computer Science Department, Troy, NY. Erk, Katrin, Diana McCarthy, and Nick Gaylord. 2009. Investigations on word senses and word usages. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 10–18, Suntec. Erk, Katrin, Diana McCarthy, and Nick Gaylord. 2013. Measuring word meaning in context. Computational Linguistics, 39(3):511–554. Fellbaum, Christiane, editor. 1998. WordNet, An Electronic Lexical Database. The MIT Press, Cambridge, MA. Goldberg, Mark K., Mykola Hayvanovych, and Malik Magdon-Ismail. 2010. Measuring similarity between sets of overlapping clusters. In SocialCom/PASSAT, pages 303–308, Minneapolis, MN. Hanks, Patrick. 2000. Do word meanings exist? Computers and the Humanities. Senseval Special Issue, 34(1–2):205–215. Hovy, Eduard, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: The 90% solution. In Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL: HLT), pages 57–60, New York, NY. 274 Ide, Nancy, Tomaˇz Erjavec, and Dan Tuﬁs¸. 2002. Sense discrimination with parallel corpora. In Proceedings of the ACL’02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 54–60, Philadelphia, PA. Ide, Nancy and Yorick Wilks. 2006. Making sense about sense. In Eneko Agirre and Phil Edmonds, editors, Word Sense Disambiguation, Algorithms and Applications. Springer, pages 47–73. Jurgens, David and Ioannis Klapaftis. 2013. Semeval-2013 task 13: Word sense induction for graded and non-graded senses. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 290–299, Atlanta, GA. Kilgarriff, Adam. 1998. ‘I don’t believe in word senses’. Computers and the Humanities, 31(2):91–113. Kilgarriff, Adam. 2006. Word Senses. In Eneko Agirre and Phil Edmonds, editors, Word Sense Disambiguation, Algorithms and Applications. Springer, pages 29–46. Krippendorff, Klaus. 1980. Content Analysis: An Introduction to Methodology. Sage Publications, Inc., Beverly Hills, CA. Lakoff, George. 1987. Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. University of Chicago Press. Landes, Shari, Claudia Leacock, and I. Tengi Randee. 1998. Building Semantic Concordances. In Christiane Fellbaum, editor, WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA, pages 199–237. Lefever, Els and Veronique Hoste. 2010. SemEval-2007 Task 3: Cross-lingual word sense disambiguation. In Proceedings of the Fifth International Workshop on Semantic Evaluations (SemEval-2010), pages 15–20, Uppsala. Manandhar, Suresh, Ioannis Klapaftis, Dmitriy Dligach, and Sameer Pradhan. 2010. SemEval-2010 Task 14: Word sense induction and disambiguation. In Proceedings of the Fifth International Workshop on Semantic Evaluation (SemEval-2010), pages 63–68, Uppsala. McCarthy, Diana. 2009. Word sense disambiguation: An overview. Language and Linguistics Compass, 3(2):537–558. McCarthy, Diana. 2011. Measuring similarity of word meaning in context with lexical substitutes and translations. In l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 2 4 5 1 8 0 7 5 3 2 / c o l i _ a _ 0 0 2 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 McCarthy, Apidianaki, and Erk Word Sense Clustering and Clusterability Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, CICLing 2011, Pt. I (Lecture Notes in Computer Science, LNTCS 6608). Springer, pages 238–252. McCarthy, Diana and Roberto Navigli. 2007. SemEval-2007 Task 10: English lexical substitution task. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), pages 48–53, Prague. McCarthy, Diana and Roberto Navigli. 2009. The English lexical substitution task. Language Resources and Evaluation Special Issue on Computational Semantic Analysis of Language: SemEval-2007 and Beyond, 43(2):139–159. Mihalcea, Rada, Ravi Sinha, and Diana McCarthy. 2010. SemEval-2010 Task 2: Cross-lingual lexical substitution. In Proceedings of the Fifth International Workshop on Semantic Evaluation (SemEval-2010), pages 9–14, Uppsala. Mitchell, Jeff and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL’08: HLT), pages 236–244, Columbus, OH. Navigli, Roberto. 2009. Word sense disambiguation: A survey. ACM Computing Surveys, 41(2):1–69. Ostrovsky, Rafail, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 165–176, Berkeley, CA. Palmer, Martha, Hoa Trang Dang, and Joseph Rosenzweig. 2000. Semantic tagging for the Penn treebank. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), pages 699–704, Athens. Passonneau, Rebecca, Ansaf Salleb-Aouissi, Vikas Bhardwaj, and Nancy Ide. 2010. Word sense annotation of polysemous words by multiple annotators. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), pages 3244–3249, Valleta. Passonneau, Rebecca J., Collin F. Baker, Christiane Fellbaum, and Nancy Ide. 2012. The MASC word sense corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 3025–3030, Istanbul. Pedersen, Ted. 2006. Unsupervised corpus-based methods for WSD. In Eneko Agirre and Phil Edmonds, editors, Word Sense Disambiguation, Algorithms and Applications. Springer, pages 131–166. Resnik, Philip and David Yarowsky. 1997. A perspective on word sense disambiguation methods and their evaluation. In Proceedings of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How, pages 79–86, Washington, DC. Resnik, Philip and David Yarowsky. 2000. Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering, 5(3):113–133. Rosenberg, Andrew and Julia Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 410–420, Prague. Schulte im Walde, Sabine. 2006. Experiments on the automatic induction of German semantic verb classes. Computational Linguistics, 32(2):159–194. Sun, Lin and Anna Korhonen. 2009. Improving verb clustering with automatically acquired selectional preferences. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09), pages 638–647, Singapore. Tuggy, David H. 1993. Ambiguity, Polysemy and Vagueness. Cognitive Linguistics, 4(2):273–290. V´eronis, Jean. 2004. HyperLex: Lexical Cartography for Information Retrieval. Computer Speech & Language, 18(3):223–252. Yuret, Deniz. 2007. KU: Word sense disambiguation by substitution. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 207–214, Prague. Zhang, Bin. 2001. Dependence of clustering algorithm performance on clustered-ness of data. Technical report HP-2001-91, Hewlett-Packard Labs. 275 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 2 4 5 1 8 0 7 5 3 2 / c o l i _ a _ 0 0 2 4 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
Scarica il pdf