Measuring Word Meaning in Context
∗
Katrin Erk
University of Texas at Austin
Diana McCarthy
University of Cambridge
∗∗
∗
Nicholas Gaylord
University of Texas at Austin
Word sense disambiguation (WSD) is an old and important task in computational linguistics
that still remains challenging, to machines as well as to human annotators. Recently there have
been several proposals for representing word meaning in context that diverge from the traditional
use of a single best sense for each occurrence. They represent word meaning in context through
multiple paraphrases, as points in vector space, or as distributions over latent senses. Nuovo
methods of evaluating and comparing these different representations are needed.
In this paper we propose two novel annotation schemes that characterize word meaning in
context in a graded fashion. In WSsim annotation, the applicability of each dictionary sense
is rated on an ordinal scale. Usim annotation directly rates the similarity of pairs of usages of
the same lemma, again on a scale. We find that the novel annotation schemes show good inter-
annotator agreement, as well as a strong correlation with traditional single-sense annotation and
with annotation of multiple lexical paraphrases. Annotators make use of the whole ordinal scale,
and give very fine-grained judgments that “mix and match” senses for each individual usage.
We also find that the Usim ratings obey the triangle inequality, justifying models that treat usage
similarity as metric.
There has recently been much work on grouping senses into coarse-grained groups. Noi
demonstrate that graded WSsim and Usim ratings can be used to analyze existing coarse-grained
sense groupings to identify sense groups that may not match intuitions of untrained native
speakers. In the course of the comparison, we also show that the WSsim ratings are not subsumed
by any static sense grouping.
∗ Linguistics Department. CLA Liberal Arts Building, 305 E. 23rd St. B5100, Austin, TX, USA 78712.
E-mail: katrin.erk@mail.utexas.edu, nlgaylord@utexas.edu.
∗∗ Visiting Scholar, Department of Theoretical and Applied Linguistics, University of Cambridge,
Sidgwick Avenue, Cambridge, CB3 9DA, UK. E-mail: diana@dianamccarthy.co.uk.
Invio ricevuto: 3 novembre 2011; revised version received: 30 April 2012; accepted for publication:
25 Giugno 2012.
doi:10.1162/COLI a 000142
© 2013 Associazione per la Linguistica Computazionale
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
1. introduzione
Word sense disambiguation (WSD) is a task that has attracted much work in computa-
tional linguistics (see Agirre and Edmonds [2007] and Navigli [2009] for an overview),
including a series of workshops, SENSEVAL (Kilgarriff and Palmer 2000; Preiss and
(Agirre, M`arquez, E
Yarowsky 2001; Mihalcea and Edmonds 2004) and SemEval
Wicentowski 2007; Erk and Strapparava 2010), which were originally organized
expressly as a forum for shared tasks in WSD. In WSD, polysemy is typically modeled
through a dictionary, where the senses of a word are understood to be mutually disjoint.
The meaning of an occurrence of a word is then characterized through the best-fitting
among its dictionary senses.
The assumption of senses that are mutually disjoint and that have clear bound-
aries has been drawn into doubt by lexicographers (Kilgarriff 1997; Hanks 2000), lin-
guists (Tuggy 1993; Cruse 1995), and psychologists (Kintsch 2007). Hanks (2000) argues
that word senses have uses where they clearly fit, and borderline uses where only a
few of a sense’s identifying features apply. This notion matches results in psychol-
ogy on human concept representation: Mental categories show “fuzzy boundaries,"
and category members differ in typicality and degree of membership (Rosch 1975;
Rosch and Mervis 1975; Hampton 2007). This raises the question of annotation: Is it
possible to collect word meaning annotation that captures degrees to which a sense
applies?
Recentemente, there have been several proposals for modeling word meaning in context
that can represent different degrees of similarity to a word sense, as well as different
degrees of similarity between occurrences of a word. The SemEval Lexical Substitu-
tion task (McCarthy and Navigli 2009) represents each occurrence through multiple
weighted paraphrases. Other approaches represent meaning in context through a vector
space model (Erk and Pado 2008; Mitchell and Lapata 2008; Thater, F ¨urstenau, E
Pinkal 2010) or through a distribution over latent senses (Dinu and Lapata 2010). Again,
this raises the question of annotation: Can human annotators give fine-grained judg-
ments about degrees of similarity between word occurrences, like these computational
models predict?
The question that we explore in this paper is: Can word meaning be described
through annotation in the form of graded judgments? We want to know whether an-
notators can provide graded meaning annotation in a consistent fashion. Also, we
want to know whether annotators will use the whole graded scale, or whether
they will fall back on binary ratings of either “identical” or “different.” Our ques-
is not whether annotators can be trained to do this. Piuttosto, our
zione, Tuttavia,
aim is to describe word meaning as language users perceive it. We want to tap into
the annotators’ intuitive notions of word meaning. As a consequence, we use un-
trained annotators. We view it as an important aim on its own to capture lan-
guage users’ intuitions on word meaning, but it is also instrumental in answering
our first question, of whether word meaning can be described through graded
annotator judgments: Training annotators in depth on how to distinguish pre-
defined hand-crafted senses could influence them to assign those senses in a binary
fashion.
We introduce two novel annotation tasks in which human annotators characterize
word meaning in context. In the first task, they rate the applicability of dictionary
senses on a graded scale. In the second task, they rate the similarity between pairs of
usages of the same word, also on a graded scale. In designing the annotation tasks, we
utilize techniques from psycholinguistic experimentation: Annotators give ratings on a
512
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
scala, rather than selecting a single label; we also use multiple annotators for each item,
retaining all annotator judgments.1
The result of this graded annotation can then be used to evaluate computational
models of word meaning: either to evaluate graded models of word meaning, or to
evaluate traditional WSD systems in a graded fashion. They can also be used to ana-
lyze existing word sense inventories, in particular to identify sense distinctions worth
revisiting—we say more on this latter use subsequently.
Our aim is not to improve inter-annotator agreement over traditional sense an-
notation. It is highly unlikely that ratings on a scale would ever achieve higher exact
agreement than binary annotation. Our aim is also not to maximize exact agreement, COME
we expect to see individual differences in perceived meaning, and want to capture those
differences. Still it is desirable to have an end product of the annotation that is robust
against such individual differences. In order to achieve this, we average judgments over
multiple annotators after first inspecting pairwise correlations between annotators to
ensure that they are all doing their work diligently and with similar outcomes.
Analyzing the annotation results, we find that the annotators make use of inter-
mediate points on the graded scale and do not treat the task as inherently binary. Noi
find that there is good inter-annotator agreement, measured as correlation. There is
also a highly significant correlation across tasks and with traditional WSD and lexical
substitution tasks. This indicates that the annotators performed these tasks in a con-
sistent fashion. It also indicates that diverse ways of representing word meaning in
context—single best sense, weighted senses, multiple paraphrases, usage similarity—
yield similar characterizations. We find that annotators frequently give high scores to
more than one sense, in a way that is not remedied by a more coarse-grained sense
inventory. Infatti, the annotations are often inconsistent with disjoint sense partitions.
The work reported here is based on our earlier work reported in Erk, McCarthy, E
Gaylord (2009). The current paper extends the previous work in three ways.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
1. We add extensive new annotation to corroborate our findings from
the previous, smaller study. In this new, second round of annotation,
annotators do the two graded ratings tasks as well as traditional
single-sense annotation and annotation with paraphrases (lexical
substitutes), all on the same data. Each item is rated by eight annotators
in parallel. This setting, with four different types of word meaning
annotation on the same data, allows us to compare annotation results
across tasks more directly than before.2
2. We test whether the similarity ratings on pairs of usages obey the triangle
inequality, and find that they do. This point is interesting for psychological
reasons. Tversky and Gati (Tversky 1977; Tversky and Gati 1982) found
that similarity ratings on words did not obey the triangle inequality—
although, unlike our study, they were dealing with words out of context.
The fact that usage similarity ratings obey the triangle inequality is also
important for modeling and annotation purposes.
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
1 We do not use as many raters per item as is usual in psycholinguistics, Tuttavia, as our aim is to cover a
sizeable amount of corpus data.
2 The annotation data from this second round are available at http://www.dianamccarthy.co.uk/
downloads/WordMeaningAnno2012/.
513
Linguistica computazionale
Volume 39, Numero 3
3. We examine the extent to which our graded annotation accords with two
existing coarse-grained sense groupings, and we demonstrate that our
graded annotations can be used to double-check on sense groupings and
find potentially problematic groupings.
2. Background
In this section, we offer an overview of previous word sense annotation efforts, and then
discuss alternative approaches to the annotation and modeling of word meaning.
2.1 Word Sense Annotation
Inter-annotator agreement (also called inter-tagger agreement, or ITA) is one indicator
of the difficulty of the task of manually assigning word senses (Krishnamurthy and
Nicholls 2000). With WordNet, the sense inventory currently most widely used in
word sense annotation, ITA ranges from 67% A 78% (Landes, Leacock, and Tengi 1998;
Mihalcea, Chklovski, and Kilgarriff 2004; Snyder and Palmer 2004), depending on
factors such as degree of polysemy and inter-relatedness of the senses. Questo problema è
not specific to WordNet. Annotation efforts based on other dictionaries have achieved
similar ITA levels, as shown in Table 1. The first group in that table shows two corpora
in which all open-class words are annotated for word sense, in both cases using
WordNet. The second group consists of two English lexical sample corpora, in which
only some target words are annotated. One of them uses WordSmyth senses for verbs
and WordNet for all other parts of speech, and the other uses HECTOR, with similar
ITA, so the choice of dictionary does not seem to make much difference in this case.3
Next is SALSA, a German corpus using FrameNet frames as senses, then OntoNotes,
again an English lexical sample corpus. Inter-annotator agreement is listed in the last
column of the table; agreement is in general relatively low for the first four corpora,
which use fine-grained sense distinctions, and higher for SALSA and OntoNotes, Quale
have more coarse-grained senses.
Sense granularity has a clear impact upon levels of inter-annotator agreement
(Palmer, Dang, and Fellbaum 2007). ITA is substantially improved by using coarser-
grained senses, as seen in OntoNotes (Hovy et al. 2006), which uses an ITA of 90% as the
criterion for constructing coarse-grained sense distinctions. Although this strategy does
improve ITA, it does not eliminate the issues seen with more fine-grained annotation
efforts: For some lemmas, such as leave, 90% ITA is not reached even after multiple
re-partitionings of the semantic space (Chen and Palmer 2009). This suggests that the
meanings of at least some words may not be separable into senses distinct enough
for consistent annotation.4 Moreover, sense granularity does not appear to be the only
question influencing ITA differences between lemmas. Passonneau et al. (2010) found
three main factors: sense concreteness, specificity of the context in which the target word
occurs, and similarity between senses. It is worth noting that of these factors, only the
third can be directly addressed by a change in the dictionary.
3 HECTOR senses are described in richer detail than WordNet senses and the resource is strongly
corpus-based. We use WordNet in our work due to its high popularity and free availability.
4 Examples such as this indicate that there is at times a problem with clearly defining consistently
separable senses of a word. There is no clear measure of exactly how frequent such cases are, Tuttavia.
This is due in part to the fact that this question depends so heavily on the data being considered and the
distinctions being posited.
514
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Tavolo 1
Word sense-annotated data, with inter-annotator agreement (ITA).
Corpus
SemCor
Dictionary
Corpus reference
ITA
WordNet
Landes, Leacock, and Tengi
78.6%
(1998)
SensEval-3
WordNet
SensEval-1 lex. sample HECTOR
Snyder and Palmer (2004)
Kilgarriff and Rosenzweig
72.5%
66.5%
(2000)
SensEval-3 lex. sample WordNet, WordSmyth Mihalcea, Chklovski, E
67.3%
SALSA
OntoNotes
FrameNet
OntoNotes
Kilgarriff (2004)
Burchardt et al. (2006)
Hovy et al. (2006)
86%
most > 90%
Tavolo 2
Best word sense disambiguation performance in SensEval/SemEval English lexical sample
compiti.
Shared task
Shared task overview
Best precision
Linea di base
Kilgarriff and Rosenzweig (2000)
Senseval-2 (2001)
SensEval-1
SensEval-2
SensEval-3 Mihalcea, Chklovski, and Kilgarriff (2004)
SemEval-1
Pradhan et al. (2007)
77%
64%
73%
89%
69%
51%
55%
(not given)
ITA levels in word sense annotation tasks are mirrored in the performance of WSD
systems trained on the annotated data. Tavolo 2 shows results for the best systems that
participated at four English lexical sample tasks. With fine-grained sense inventories,
the top-ranking WSD systems participating in the event achieved precision scores of 73%
A 77% (Edmonds and Cotton 2001; Mihalcea, Chklovski, and Kilgarriff 2004). Current
state-of-the-art systems have made modest improvements on this; Per esempio, IL
system described by Zhong and Ng (2010) achieves 65.3% on the English lexical sample
at SENSEVAL-2, though the same system obtains 72.6%, just below Mihalcea, Chklovski,
and Kilgarriff (2004), on the English lexical sample at SENSEVAL-3. Nevertheless, the pic-
ture remains the same with systems getting around three out of four word occurrences
correct. Under a coarse-grained approach, system performance improves considerably
(Palmer, Dang, and Fellbaum 2007; Pradhan et al. 2007), with the best participating
system achieving a precision close to 90%.5 The merits of a coarser-grained approach
are still a matter of debate (Stokoe 2005; Ide and Wilks 2006; Navigli, Litkowski, E
Hargraves 2007; Brown 2010), Tuttavia.
Although identifying the proper level of granularity for sense repositories has im-
portant implications for improving WSD, we do not focus on this question here. Piuttosto,
we propose novel annotation tasks that allow us to probe the relatedness between
dictionary senses in a flexible fashion, and to explore word meaning in context without
presupposing hard boundaries between usages. The resulting data sets can be used
to compare different inventories, coarse or otherwise. Inoltre, we hope that they
will prove useful for the evaluation of alternative representations of ambiguity in word
5 Zhong, Di, and Chan (2008) report similar results (89.1%) with their state-of-the-art system when
evaluating on the OntoNotes corpus, which is larger than the SENSEVAL data sets.
515
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
Senso (Erk and Pado 2008; Mitchell and Lapata 2008; Reisinger and Mooney 2010;
Thater, F ¨urstenau, and Pinkal 2010; Reddy et al. 2011; Van de Cruys, Poibeau, E
Korhonen 2011).
2.2 Representation of Word Meaning in Word Sense Inventories
One possible factor contributing to the difficulty of manual and automatic word sense
assignment is the design of word sense inventories themselves. As we have seen, come
difficulties are encountered across dictionaries, and it has been argued that there are
problems with the characterization of word meanings as sets of discrete and mutually
exclusive senses (Tuggy 1993; Cruse 1995; Kilgarriff 1997; Hanks 2000; Kintsch 2007).
2.2.1 Criticisms of Enumerative Approaches to Meaning. Dictionaries are practical resources
and the nature of the finished product depends upon the needs of the target audience, COME
well as budgetary and related constraints (cf. Hanks 2000). Consequently, dictionaries
differ in the words that they cover, and also in the word meanings that they distinguish.
Dictionary senses are generalizations over the meanings that a word can take, and these
generalizations themselves are abstractions over collected occurrences of the word in
different contexts (Kilgarriff 1992, 1997, 2006). Regardless of a dictionary’s granularity,
the possibility exists for some amount of detail to be lost as a result of this process.
Kilgarriff (1997) calls into question the possibility of general, all-purpose senses of
a word and argues that sense distinction only makes sense with respect to a given task.
Per esempio, in machine translation, the senses to be distinguished should be those
that lead to different translations in the target language. It has since been demonstrated
that this is in fact the case (Carpuat and Wu 2007a, 2007B). Hanks (2000) questions the
view of senses as disjoint classes defined by necessary and sufficient conditions. Lui
shows that even with a classic homonym like “bank,” some occurrences are more typical
examples of a particular sense than others. This notion of typicality is also important in
theories of concept representation in psychology (Murphy 2002). Theoretical treatments
of word meaning such as the Generative Lexicon (Pustejovsky 1991) also draw attention
to the subtle, yet reliable, fluctuations of meaning-in-context, and work in this paradigm
also provides evidence that two senses which may appear to be quite distinct can in
fact be quite difficult to distinguish in certain contexts (Copestake and Briscoe 1995,
page 53).
2.2.2 Psychological Research on Lexical and Conceptual Knowledge. Not all members of a
mental category are equal. Some are perceived as more typical than others (Rosch 1975;
Rosch and Mervis 1975; and many others), and even category membership itself is
clearer in some cases than in others (Hampton 1979). These results are about mental
concepts, Tuttavia, rather than word meanings per se, which raises the question of
the relation between word meanings and conceptual knowledge. Murphy (1991, 2002)
argues that although not every concept is associated with a word, word meanings show
many of the same phenomena as concepts in general—word meaning is “made up of
pieces of conceptual structure” (Murphy 2002, page 391). A body of work in cognitive
linguistics also discusses the relation between word meaning and conceptual structure
(Coleman and Kay 1981; Taylor 2003).
Psycholinguistic studies on word meaning offer insight into the question of the
mental representation of word senses. Unlike homonym meanings, the senses of a
polysemous word are thought to be related, suggesting that the mental representations
of these senses may overlap as well. The psycholinguistic literature on this question
516
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
is not wholly clear-cut, but by and large does support the position that polysemous
senses are not entirely discrete in the mental lexicon. Whereas Klein and Murphy (2001,
2002) do provide evidence for discreteness of mental sense representations, it appears
as though these findings may be due in part to the particular senses included in their
studies (Klepousniotou, Titone, and Romero 2008).
Inoltre, many psycholinguistic studies have indeed found evidence for process-
ing differences between homonyms and polysemous words, using a variety of experi-
mental designs, including eye movements and reading times (Frazier and Rayner 1990;
Pickering and Frisson 2001) as well as response times in sensicality and lexical decision
compiti (Williams 1992; Klepousniotou 2002). Brown (2008, 2010) takes the question of
shared vs. separate meaning representations one step further in a semantic priming
study6 in which she shows that intuitive meaning-in-context similarity judgments have
a processing correlate in on-line sentence comprehension. Response time to the target
is a negative linear function of its similarity in meaning to the prime, and response
accuracy is a positive linear function of this similarity. In other words, the more similar
in meaning a prime–target pair was judged to be, the faster and more accurately sub-
jects responded. This provides empirical support for a processing correlate of graded
similarity-in-meaning judgments.
In our work reported here, we take inspiration from work in psychology and look
at ways to model word meaning more continuously. Even though there is still some
controversy, the majority of studies support the view that senses of polysemous words
are linked in their mental representations. In our work we do not make an explicit
distinction between homonymy and polysemy, but the data sets we have produced may
be useful for a future exploration of this distinction.
2.3 Alternative Approaches to Word Meaning
Earlier we suggested that word meaning may be better described without positing
disjoint senses. We now describe some alternatives to word sense inventory approaches
to word meaning, most of which do not rely on disjoint senses.
2.3.1 Substitution-Based Approaches. McCarthy and Navigli (2007) explore the use of
synonym or near-synonym lexical substitutions to characterize the meaning of word
occurrences. In contrast to dictionary senses, substitutes are not taken to partition
a word’s meaning into distinct senses. McCarthy and Navigli gathered their lexical
substitution data using multiple annotators. Annotators were allowed to provide up to
three paraphrases for each item. Data were gathered for 10 sentences per lemma for 210
lemmas, spanning verbs, nouns, adjectives, and adverbs. The annotation took the form
of each occurrence being associated with a multiset of supplied paraphrases, weighted
by the frequency with which each paraphrase was supplied. We make extensive use of
the LEXSUB dataset in our work reported here. An example sentence with substitutes
from the LEXSUB dataset (sentence 451) is given in Table 3.
A related approach also characterizes meaning through equivalent terms, but terms
in another language. Resnik and Yarowsky (2000, page 10) suggest “to restrict a word
sense inventory to distinctions that are typically lexicalized cross-linguistically” [emphasis
in original]. They argue that such an approach will avoid being too fine-grained, E
that the distinctions that are made will be independently motivated by crosslinguistic
6 See McNamara (2005) for more information on priming studies.
517
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
Tavolo 3
An example of annotation from the lexical substitution data set: sentence 451.
Sentence:
Annotation:
My interest in Europe’s defence policy is nothing new.
original 2; recente 2; novel 2; different 1; additional 1
trends. Although substitution and translation methods are not without their own issues
(Kilgarriff 1992, page 48), they constitute an approach to word meaning that avoids
many of the drawbacks of more traditional sense distinction and annotation. Some
cross-linguistic approaches group translations into disjoint senses (Lefever and Hoste
2010), whereas others do not (Mihalcea, Sinha, and McCarthy 2010).
2.3.2 Distributional Approaches. Recently there have been a growing number of distri-
butional approaches to representing word meaning in context. These models offer an
opportunity to model subtle distinctions in meaning between two occurrences of a word
in different contexts. In particular, they allow comparisons between two occurrences of
a word without having to classify them as having the same sense or different senses.
Some of these approaches compute a distributional representation for a word across all
its meanings, and then adapt this to a given sentence context (Landauer and Dumais
1997; Erk and Pado 2008; Mitchell and Lapata 2008; Thater, F ¨urstenau, and Pinkal 2010;
Van de Cruys, Poibeau, and Korhonen 2011). Others group distributional contexts into
sensi. This can be done on the fly for a given occurrence (Erk and Pado 2010; Reddy
et al. 2011), or beforehand (Dinu and Lapata 2010; Reisinger and Mooney 2010). IL
latter two approaches then represent an occurrence through weights over those senses.
A third group of approaches is based on language models (Deschacht and Moens 2009;
Washtell 2010; Moon and Erk 2012): They infer other words that could be used in the
position of the target word.7
3. Two Novel Annotation Tasks
In this section we introduce two novel annotation schemes that draw on methods
common in psycholinguistic experiments, but uncommon in corpus annotation. Tra-
ditional word sense annotation usually assumes that there is a single correct label
for each markable. Annotators are trained to identify the correct labels consistently,
often with highly specific a priori guidelines. Multiple annotators are often used, Ma
despite the frequently low ITA in word sense annotation, differences between annotator
responses are often treated as the result of annotator error and are not retained in the
final annotation data.
In these respects, traditional word sense annotation tasks differ in design from
many psycholinguistic experiments, such as the ones discussed in the previous section.
Psycholinguistic experiments frequently do not make strong assumptions about how
participants will respond, and in fact are designed to gather data on that very ques-
zione. Participants are given general guidelines for completing the experiment but these
7 Distributional models for phrases have recently received much attention, even more so than models for
word meaning in context (Baroni and Zamparelli 2010; Coecke, Sadrzadeh, and Clark 2010; Mitchell and
Lapata 2010; Grefenstette and Sadrzadeh 2011; Socher et al. 2011). They are less directly relevant to the
current paper, Tuttavia, as we focus on eliciting judgments for individual words in sentence contexts,
rather than whole phrases.
518
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Tavolo 4
Interpretation of the five-point scale given to the annotators. This interpretation is the same for
the Usim and WSsim tasks.
completely different
1
2 mostly different
3
4
5
similar
very similar
identical
guidelines generally stop short of precise procedural detail, to avoid undue influence
over participant responses. All of the psycholinguistic studies discussed earlier used
participants na¨ıve as to the purpose of the experiment, and who were minimally trained.
Responses are often graded in nature, involving ratings on an ordinal scale or in some
cases even a continuously valued dimension (per esempio., as in Magnitude Estimation). Mul-
tiple participants respond to each stimulus, but all participant responses are typically
retained, as there are often meaningful discrepancies in participant responses that are
not ascribable to error. All of the psycholinguistic studies discussed previously collected
data from multiple participants (up to 80 in the case of one experiment by Williams
[1992]).
The annotation tasks we present subsequently draw upon these principles of exper-
imental design. We collected responses using a scale, rather than binary judgments; we
designed the annotation tasks to be accomplishable without prior training and with
minimal guidelines, and we used multiple annotators (up to eight) and retained all
responses in an effort to capture individual differences. In the following, we describe
two different annotation tasks, one with and one without the use of dictionary senses.
Graded Ratings for Dictionary Senses. In our first annotation task, dubbed WSsim (for
Word Sense Similarity), annotators rated the applicability of WordNet dictionary senses,
using a five-point ordinal scale.8 Annotators rated the applicability of every single
WordNet sense for the target lemma, where a rating of 1 indicated that the sense
in question did not apply at all, and a rating of 5 indicated that the sense applied
completely to that occurrence of the lemma. Tavolo 4 shows the descriptions of the five
points on the scale that the annotators were given. By asking annotators to provide
ratings for each individual sense, we strive to eliminate all bias toward either single-
sense or multiple-sense annotation. By asking annotators to provide ratings on a scale,
we allow for the fact that senses may not be perceived in a binary fashion.
Graded Ratings for Usage Similarity. In our second annotation task, dubbed Usim (for
Usage Similarity), we collected annotations of word usages without recourse to dic-
tionary senses, by asking annotators to judge the similarity in meaning of one usage
of a lemma to other usages. Annotators were presented with pairs of contexts that
share a word in common, and were asked to rate how similar in meaning they perceive
those two occurrences to be. Ratings are again on a five-point ordinal scale; a rating of
1 indicated that the two occurrences of the target lemma were completely dissimilar in
Senso, and a rating of 5 indicated that the two occurrences of the target lemma were
identical in meaning. The descriptions of the five points on the scale, shown in Table 4,
8 The use of a five-point scale is a common choice when collecting ordinal ratings, as it allows more
detailed responses than the “yes/no/maybe” provided by a three-point scale.
519
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
were identical to those used in the WSsim task. Annotators were able to respond “I don’t
know” if they were unable to gauge the similarity in meaning of the two occurrences.9
Annotation Procedure. All annotation for this project was conducted over the Internet
in specially designed interfaces. In both tasks, all annotator responses were retained,
without resolution of disagreement between annotators. We do not focus on obtaining
a single “correct” annotation, but rather view all responses as valuable sources of
informazione, even when they diverge.
For each item presented, annotators additionally were provided a comment field
should they desire to include a more detailed response regarding the item in question.
They could use this, Per esempio, to comment on problems understanding the sentence.
The annotators were able to revisit previous items in the task. Annotators were not able
to skip forward in the task without rating the current item. If an annotator attempted to
submit an incomplete annotation they were prompted to provide a complete response
before proceeding. They were free to log out and resume later at any point, Tuttavia,
and also could access the instructions whenever they wanted.
Two Rounds of Annotation. We performed two rounds of the annotation experiments,
hereafter referred to as R1 and R2.10 Both annotation rounds included both a WSsim and
a Usim task, labeled in the subsequent discussion as WSsim-1 and Usim-1 for R1, E
WSsim-2 and Usim-2 for R2. An important part of the data analysis is to compare the
new, graded annotation to other types of annotation. We compare it to both traditional
word sense annotation, with a single best sense for each occurrence, and lexical
substitution, which characterizes each occurrence through paraphrases. In R1, we chose
annotation data that had previously been labeled with either traditional single sense
annotation or with lexical substitutions. R2 included two additional annotation tasks,
one involving traditional WSD methodology (WSbest) and a lexical substitution task
(SYNbest). In the SYNbest task, annotators provided a single best lexical substitution,
in contrast to the multiple substitutes annotators provided in the original LEXSUB data.11
Three annotators participated in each task in the R1, and eight annotators partici-
pated in R2. In R1, separate groups of annotators participated in WSsim and Usim an-
notation, whereas in R2 the same group of annotators was used for all annotation, so as
to allow comparison across tasks for the same annotator as well as across annotators. In
R2, Perciò, the same annotators did both traditional word sense annotation (WSbest)
and the graded word sense annotation of the WSsim task. This raises the question of
whether their experience on one task will influence their annotation choice on the other
task. We tested this by varying the order in which annotators did WSsim and WSbest.
R2 annotators were divided into two groups of four annotators with the order of tasks
come segue:
group 1: Usim-2
group 2: Usim-2
SYNbest WSsim-2 WSbest
SYNbest WSbest WSsim-2
Another difference between the two rounds of annotation was that in R2 we per-
mitted the annotators to see one more sentence of context on either side of the target
9 The “I don’t know” option was present only in the Usim interface, and was not available in WSsim.
10 The annotation was conducted in two separate rounds due to funding.
11 Annotation guidelines for R1 are at http://www.katrinerk.com/graded-sense-and-usage-annotation
and guidelines for R2 tasks are at http://www.dianamccarthy.co.uk/downloads/
WordMeaningAnno2012/.
520
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Tavolo 5
Abbreviations used in the text for annotation tasks and rounds.
WSsim
Usim
WSbest
SYNbest
Task: graded annotation of WordNet senses on a five-point scale
Task: graded annotation of usage similarity on a five-point scale
Task: traditional single-sense annotation
Task: lexical substitution
R1
R2
Annotation round 1
Annotation round 2
sentence. In R1 each item was given only one sentence as context. We added more
context in order to reduce the chance that the sentence would be unclear. Tavolo 5
summarizes up the annotation tasks and annotation rounds on which we report.
Data Annotated. The data to be annotated in WSsim-1 were taken primarily from
Semcor (Miller et al. 1993) and the Senseval-3 English lexical sample (SE-3) (Mihalcea,
Chklovski, and Kilgarriff 2004). This experiment contained a total of 430 sentences span-
ning 11 lemmas (nouns, verbs, and adjectives). For eight of these lemmas, 50 sentences
were included, 25 randomly sampled from Semcor and 25 randomly sampled from SE-3.
The remaining three lemmas in the experiment had 10 sentences each, from the LEXSUB
dati. Each of the three annotators annotated each of the 430 items, providing a response
for each WordNet sense for that lemma. Usim-1 used data from LEXSUB. Thirty-four
lemmas were manually selected, including the three lemmas also used in WSsim-1. Noi
selected lemmas which exhibited a range of meanings and substitutes in the LEXSUB
dati, with as few multiword substitutes as possible. Each lemma is the target in 10
LEXSUB sentences except there were only nine sentences for the lemma bar.n because
of a part-of-speech tagging error in the LEXSUB trial data. For each lemma, annotators
were presented with every pairwise comparison of these 10 sentences. We refer to each
such pair as an SPAIR. There were 45 SPAIRs per lemma (36 for bar.n), adding up to 1,521
comparisons per annotator in Usim-1.
In R1, only 30 sentences were included in both WSsim and Usim. Because compar-
ison of annotator responses on this subset of the two tasks yielded promising results,
R2 used the same set of sentences for both Usim and WSsim so as to better compare
these tasks. All data in the second round were taken from LEXSUB, and contained 26
lemmas with 10 sentences for each. We produced the SYNbest annotation, piuttosto che
use the existing LEXSUB annotation, so that we could ensure the same conditions as
with the other annotation tasks, questo è, using the same annotators and providing the
extra sentence of context on either side of the original LEXSUB context. We also only
required that the annotators provide one substitute. As such, there were 260 target
lemma occurrences that received graded word sense applicability ratings in WSsim-2,
E 1,170 SPAIRs (pairs of occurrences) to be annotated in Usim-2.
4. Analysis of the Annotation
In this section we present our analysis of the annotated data. We test inter-annotator
agreement, and we test to what extent annotators make use of the added flexibility
of the graded annotation. We also compare the outcome of our graded annotation to
traditional word sense annotation and lexical substitutions for the same data.
521
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
4.1 Evaluation Measures
Because both graded annotation tasks, WSsim and Usim, use ratings on five-point
scales rather than binary ratings, we measure agreement in terms of correlation. Because
ratings were not normally distributed, we choose a non-parametric test which uses
ranks rather than absolute values: We use Spearmans rank correlation coefficient (rho),
following Mitchell and Lapata (2008). For assessing inter-tagger agreement on the R2
WSbest task we adopt the standard WSD measure of average pairwise agreement, E
for R2 SYNbest, we use the same pairwise agreement calculation used in LEXSUB.
When comparing graded ratings with single-sense or lexical substitution annota-
zione, we use the mean of all annotator ratings in the WSsim or Usim annotation. Questo
is justified because the inter-annotator agreement is highly significant, with respectable
rho compared with previous work (Mitchell and Lapata 2008).
As the annotation schemes differ between R1 and R2 (as mentioned previously, IL
number of annotators and the amount of visible context are different, and R2 annotators
did traditional word sense annotation in the WSbest task in addition to the graded
compiti) we report the results of R1 and R2 separately.12
4.2 WSsim: Graded Ratings for WordNet Senses
In the WSsim task, annotators rated the applicability of each sense of the target word on
a five-point scale. We first do a qualitative analysis, then turn to a quantitative analysis
of annotation results.
4.2.1 Qualitative Analysis. Tavolo 6 shows an example of WSsim annotation. The target
is the verb dismiss, which was annotated in R2. The first column gives the WordNet
sense number (sn).13 Note that in the task, the annotators were given the synonyms
and full description but in this figure we only supply part of the description for the
sake of space. As can be seen, three of the annotators chose a single-sense annotation
by giving a rating of 5 to one sense and ratings of 1 to all others. Two annotators gave
ratings of 1 E 2 to all but one sense. The other three annotators gave positive ratings
(ratings of at least 3 [similar], Vedi la tabella 4) to at least two of the senses. All annotators
agree that the first sense fits the usage perfectly, and all annotators agree that senses
3 E 5 do not apply. The second sense, on the other hand, has an interestingly wide
distribution of judgments, ranging from 1 A 4. This is the judicial sense of the verb, COME
in ‘this case is dismissed.’ Some annotators consider this sense to be completely distinct
from sense 1, whereas others see a connection. There is disagreement among annotators,
about sense 6. This is the sense ‘dismiss, dissolve,’ as in ‘the president dissolved the
parliament.’ Six of the annotators consider this sense completely unrelated to ‘dismiss
our actions as irrelevant,’ whereas two annotators view it as highly related (Anche se
not completely identical). It is noteworthy that each of the two opinions, a rating of 1
12 It is known that when responses are collected on an ordinal scale, the possibility exists for different
individuals to use the scale differently. As such, it is common practice to standardize responses using a
z-score, which maps a response X to z = X−μ
P . The calculation of z-scores makes reference to the mean
and the standard deviation of an annotator’s responses. Because responses were not normally distributed
in our task, a transformation that relies on measures of central tendency is not appropriate. So we do not
use z-scores in this paper. We repeated all analyses with z-score transform anyway, and found the results
to be basically the same as those we report here with the raw values. Overall, using z-scores slightly
strengthened most findings, but there were no differences in statistical significance anywhere.
13 We use WordNet 3.0 for our annotation.
522
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Tavolo 6
WSsim example, R2: Annotator judgments for the different senses of dismiss.
If we see ourselves as separate from the world, it is easy to dismiss our actions as irrelevant
or unlikely to make any difference. (902)
sn
1
2
3
4
5
6
Description
Ratings By Annotator
Mean
bar from attention or consideration
cease to consider
stop associating with
terminate the employment of
cause or permit a person to leave
declare void
5
1
1
1
1
1
5
4
2
4
2
1
5
1
1
1
1
1
5
3
1
2
1
4
5
2
1
1
1
1
5
2
2
1
1
1
5
1
1
1
1
1
5
3
1
1
2
4
5
2.125
1.25
1.5
1.25
1.75
and a rating of 4, was chosen by multiple annotators. Because multiple annotators give
each judgment, these data seem to reflect a genuine difference in perceived sense. Noi
discuss inter-annotator agreement, both overall and considering individual annotators,
subsequently.
Tavolo 7 gives an example sentence from R1, where the annotated target is the noun
paper. All annotators agree that sense 5, ‘scholarly article,’ applies fully. Sense 2 (‘essay’)
also gets ratings of ≥ 3 from all annotators. The first annotator seems also to have
perceived the ‘physical object’ connotation to apply strongly to this example, and has
expressed this quite consistently by giving high marks to sense 1 as well as 7.
Tavolo 8 shows a sample annotated sentence with an adjective target, neat, annotated
in R2. In questo caso, only one annotator chose single-sense annotation by marking exclu-
sively sense 4. One annotator gave ratings ≥ 3 (similar) to all senses of the lemma. Tutto
other annotators saw at least two senses as applying (with ratings ≥ 3) and at least one
sense as not applying at all (with a rating of 1). Sense 4 has received positive ratings (Quello
È, ratings ≥ 3) throughout. Senses 1, 2, E 6 have mixed ratings, and senses 3 E 5
have positive ratings only from the one annotator who marked everything as applying.
È interessante notare, ratings for senses 1, 2, E 6 diverge sharply, with some annotators seeing
them as not applying at all, and some giving them ratings in the 3–5 range. Note that the
Tavolo 7
WSsim example, R1: Annotator judgments for the different senses of paper.
This can be justified thermodynamically in this case, and this will be done in a separate
paper which is being prepared. (br-j03, sent. 4)
sn
1
2
3
4
5
6
7
Description
Ratings Mean
a material made of cellulose pulp
an essay (especially one written as an assignment)
a daily or weekly publication on folded sheets; contains
news and articles and advertisements
a medium for written communication
a scholarly article describing the results of observations
or stating hypotheses
a business firm that publishes newspapers
the physical object that is the product of a newspaper
publisher
4
3
2
5
5
2
4
1
3
1
3
5
1
1
1
5
3
1
5
1
1
1.3
3.7
2
3
5
1.3
1.7
523
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
Tavolo 8
WSsim example, R2: Annotator judgments for the different senses of neat.
Over the course of the 20th century scholars have learned that such images tried to make
messy reality neater than it really is (103)
sn
1
2
3
4
5
6
Description
Ratings By Annotator
free from clumsiness; precisely or
deftly executed
refined and tasteful in appearance or
behavior or style
having desirable or positive qualities
especially those suitable for a
thing specified
marked by order and cleanliness in
appearance or habits
not diluted
showing care in execution
1
3
1
4
1
1
5
4
3
5
4
4
1
1
1
5
1
1
4
4
1
3
1
3
5
4
1
4
1
4
5
3
1
5
1
1
5
1
1
5
1
3
5
3
1
5
1
3
Mean
3.375
2.875
1.25
4.5
1.375
2.5
Tavolo 9
Correlation matrix for pairwise correlation agreement for WSsim-1. The last row provides the
agreement of the annotator in that column against the average from the other annotators.
UN
B
C
UN
B
C
1.00
0.47
0.51
0.47
1.00
0.54
0.51
0.54
1.00
against avg
0.56
0.58
0.61
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
annotators who give ratings of 1 are not the same for these three ratings, pointing to dif-
ferent, but quite nuanced, judgments of the ‘make reality neater’ usage in this sentence.
4.2.2 Inter-annotator Agreement. We now turn to a quantitative analysis, starting with
inter-annotator agreement. For the graded WSsim annotation, it does not make sense
to compute the percentage of perfect agreement. As discussed earlier, we report inter-
annotator agreement in terms of correlation, using Spearman’s rho. We calculate pair-
wise agreements and report the average over all pairs. The pairwise correlations are
shown in the matrix in Table 9. We have used capital letters to represent the individ-
uals, preserving the same letter for the same person across tasks. In the last row we
show agreement of each annotator’s judgments against the average judgment from the
other annotators. The pairwise correlations range from 0.47 A 0.54 and all pairwise
correlations were highly significant (P (cid:4) 0.001), with an average of rho = 0.504. Questo
is a very reasonable result given that Mitchell and Lapata (2008) report a rho of 0.40
on a graded semantic similarity task.14 The lowest correlation against the average
14 Direct comparison across tasks is not appropriate, but we wish to point out that for graded semantic
judgments this level of correlation is perfectly reasonable. The Mitchell and Lapata (2008) dati
set has been used in an evaluation exercise (GEMS-2011, https://sites.google.com/site/
geometricalmodels/shared-evaluation). Mitchell and Lapata point out that Spearman’s rho
tends to yield lower coefficients compared with parametric alternatives such as Pearson’s.
524
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Tavolo 10
Correlation matrix for pairwise correlation agreement for WSsim-2. The last row provides the
agreement of the annotator in that column against the average from the other annotators.
UN
C
D
F
G
H
IO
J
UN
C
D
F
G
H
IO
J
1.00
0.55
0.58
0.60
0.61
0.63
0.61
0.59
0.55
1.00
0.54
0.66
0.57
0.55
0.65
0.52
0.58
0.54
1.00
0.55
0.58
0.52
0.56
0.54
0.60
0.66
0.55
1.00
0.62
0.62
0.72
0.59
0.61
0.57
0.58
0.62
1.00
0.63
0.62
0.62
0.63
0.55
0.52
0.62
0.63
1.00
0.64
0.64
0.61
0.65
0.56
0.72
0.62
0.64
1.00
0.58
0.59
0.52
0.54
0.59
0.62
0.64
0.58
1.00
against avg
0.70
0.58
0.62
0.64
0.70
0.71
0.66
0.71
from the other annotators was 0.56. We discuss the annotations of individuals in Sec-
zione 4.6, including our decision to retain the judgments of all annotators for our gold
standard.
From the correlation matrix in Table 10 we see that for WSsim-2, pairwise corre-
lations ranged from 0.52 A 0.72. The average value of the pairwise correlations was
rho = 0.60, and again every pair was highly significant (P (cid:4) 0.001). The lowest correla-
tion against the average from all the other annotators was 0.58.
4.2.3 Choice of Single Sense Versus Multiple Senses. In traditional word sense annotation,
annotators can mark more than one sense as applicable, but annotation guidelines often
encourage them to view the choice of a single sense as the norm. In WSsim, annotators
gave ratings for all senses of the target. So we would expect that in WSsim, there would
be a higher proportion of senses selected as applicable. Indeed we find this to be the
case: Tavolo 11 shows the proportion of sentences where some annotator has assigned
more than one sense with a judgment of 5, the highest value. Both WSsim-1 and WSsim-
2 have a much higher proportion of sentences with multiple senses chosen than the
traditional sense-annotated data sets SemCor and SE-3. È interessante notare, we notice that
the percentage for WSsim-1 is considerably higher than for WSsim-2. In principle, Questo
could be due to differences in the lemmas that were annotated, or differences in the
sense perception of the annotators between R1 and R2. Another potential influencing
Tavolo 11
WSsim annotation: Proportion of sentences where multiple senses received a rating of 5 (highest
judgment) from the same annotator.
Proportion
WSsim-1
WSsim-2
WSsim-2, WSsim first
WSsim-2, WSbest first
SemCor
SE-3
46%
30%
36%
23%
0.3%
8%
525
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
factor is the order of annotation experiments: As described earlier, half of the R2 anno-
tators did WSbest annotation before doing WSsim-2, and half did the two experiments
in the opposite order. As Table 11 shows, those doing the graded task WSsim-2 before
the binary task WSbest had a greater proportion of multiple senses annotated with
the highest response. This demonstrates that annotators in a word meaning task can
be influenced by factors outside of the current annotation task, in this case another
annotation task that they have done previously. We take this as an argument in favor of
using as many annotators as possible in order to counteract factors that contribute noise.
In our case, we counter the influence of previous annotation tasks somewhat by using
multiple annotators and altering the order of the WSsim and WSbest tasks. Another
option would have been to use different annotators for different tasks; by using the
same set of annotators for all four tasks, Tuttavia, we can better control for individual
variation.
4.2.4 Use of the Graded Scale. We next ask whether annotators in WSsim made use of the
whole five-point scale, or whether they mostly chose the extreme ratings of 1 E 5.
If the latter were the case, this could indicate that they viewed the task of word sense
assignment as binary. Figure 1a shows the relative frequency distribution of responses
from all annotators over the five scores for both R1 and R2. Figures 2a and 3a show the
same but for each individual annotator. In both rounds the annotators chose the rating
Di 1 (‘completely different,’ see Table 4) most often. This is understandable because each
item is a sentence and sense combination and there will typically be several irrelevant
senses for a given sentence. The second most frequent choice was 5 (‘identical’). Both
rounds had plenty of judgments somewhere between the two poles, so the annotators
do not seem to view the task of assigning word sense as completely binary. Although
the annotators vary, they all use the intermediate categories to some extent and certainly
the intermediate category judgments do not originate from a minority of annotators.
We notice that R2 annotators tended to give more judgments of 1 (‘completely
different’) than the R1 annotators. One possible reason is again that half our annota-
tors did WSbest before WSsim-2. If this were the cause for the lower judgments, we
would expect more ratings of 1 for the annotators who did the traditional word sense
annotazione (WSbest) first. In Table 12 we list the relative frequency of each rating for the
different groups of annotators. We certainly see an increase in the judgments of 1 Dove
Figura 1
WSsim and Usim R1 and R2 ratings.
526
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Figura 2
WSsim and Usim R1 individual ratings.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 3
WSsim and Usim R2 individual ratings.
WSbest is performed before WSsim-2. Again, this may indicate that annotators were
leaning more towards finding a single exact match because they were influenced by
the WSbest task they had done before. Annotators in that group were also slightly less
inclined to take the middle ground, but this was true of both groups of R2 annotators
compared with the R1 annotators. We think that this difference between the two rounds
may well be due to the lemmas and data.
In Table 18, we show the average range15 and average variance of the judgments
per item for each of the graded annotation tasks. WSsim naturally has less variation
15 As an example, the first two senses (1 E 2) in Table 6 have ranges of 0 E 3, rispettivamente.
527
Linguistica computazionale
Volume 39, Numero 3
Tavolo 12
The relative frequency of the annotations at each judgment from all annotators.
Judgment
Exp
1
2
3
4
5
WSsim-1
WSsim-2
WSsim-2, WSsim first
WSsim-2, WSbest first
Usim-1
Usim-2
0.43
0.696
0.664
0.727
0.360
0.316
0.106
0.081
0.099
0.063
0.202
0.150
0.139
0.067
0.069
0.065
0.165
0.126
0.143
0.048
0.048
0.048
0.150
0.112
0.181
0.109
0.12
0.097
0.123
0.296
compared with Usim because, for any sentence, there are inevitably many WordNet
senses which are irrelevant to the context at hand and which will obtain a judgment
Di 1 from everyone. This is particularly the case for WSsim-2 where the annotators
gave more judgments of 1, as discussed previously. The majority of items have a range
of less than two for WSsim. We discuss the Usim figures further in the following
section.
4.3 Usim: Graded Ratings for Usage Similarity
In Usim annotation, annotators compared pairs of usages of a target word (SPAIRs) E
rated their similarity on the five-point scale given in Table 4. The annotators were also
permitted a response of “don’t know.” Such responses were rare but were used when
the annotators really could not judge usage similarity, perhaps because the meaning
of one sentence was not clear. We removed any pairs where one of the annotators had
given a “don’t know” verdict (9 in R1, 28 in R2). For R1 this meant that we were left with
a total of 1,512 SPAIRs and in R2 we had a resultant 1,142 SPAIRs.
4.3.1 Qualitative Analysis. We again start by inspecting examples of Usim annotation.
Tavolo 13 shows the annotation for an SPAIR of the verb dismiss. The first of the two
sentences talks about “dismissing actions as irrelevant,” the second is about dismissing
a person. È interessante notare, the second usage could be argued to carry both a connotation
of ‘ushering out’ and a connotation of ‘disregarding.’ Annotator opinions on this SPAIR
vary from a 1 (completely different) ad a 5 (identical), but most annotators seem to view
the two usages as related to an intermediate degree. This is adequately reflected in the
average rating of 3.125. Tavolo 14 compares the sentence from Table 8 to another sentence
Tavolo 13
Usim example: Annotator judgments for a pair of usages of dismiss.
Sentences
If we see ourselves as separate from the world, it is easy to dismiss
our actions as irrelevant or unlikely to make any difference.
Simply thank your Gremlin for his or her opinion, dismiss him or
her, and ask your true inner voice to turn up its volume.
Ratings
1, 2, 3, 3,
3, 4, 4, 5
528
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Tavolo 14
Usim example: Annotator judgments for a pair of usages of neat.
Sentences
Over the course of the 20th century scholars have learned that such
images tried to make messy reality neater than it really is.
Strong field patterns created by hedgerows give the landscape a
neat, well structured appearance.
Tavolo 15
Usim example: Annotator judgments for a pair of usages of account.
Sentences
Samba-3 permits use of multiple account data base backends.
Within a week, Scotiabank said that it had frozen some accounts
linked to Washington’s hit list.
Ratings
3, 3, 4, 4,
4, 4, 5, 5
Ratings
1, 2, 3, 3,
3, 4, 4, 4
with the target neat. The first sentence is a metaphorical use (making reality neater),
the second is literal (landscape with neat appearance), but still the SPAIR gets high
ratings of 3–5 throughout for an average of 4.0. Note that the WordNet senses, shown
in Table 8, do not distinguish the literal and metaphorical uses of the adjective, either.
Tavolo 15 shows two uses of the noun account. The first pertains to accounts on a software
system, the second to bank accounts. The spread of annotator ratings shows that these
two uses are not the same, but that some relation exists. The average rating for this
SPAIR is 3.0.
4.3.2 Inter-annotator Agreement. We again calculate inter-annotator agreement as the
average over pairwise Spearman’s correlations. The pairwise correlations are shown
in the matrix in Table 16. In the last row we show agreement of each annotator’s
judgments against the average judgment from the other annotators. For Usim-1 the
range of correlation coefficients is between 0.50 E 0.64 with an average correlation
of rho = 0.548. All the pairs are highly significantly correlated (P (cid:4) 0.001). The smallest
correlation for any individual against the average is 0.55. The correlation matrix for
Usim-2 is provided in Table 17; the range of correlation coefficients is between 0.42 E
Tavolo 16
Correlation matrix for pairwise correlation agreement for Usim-1. The last row provides the
agreement of the annotator in that column against the average from the other annotators.
UN
D
E
UN
D
E
1.00
0.50
0.64
0.50
1.00
0.50
0.64
0.50
1.00
against avg
0.67
0.55
0.67
529
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 39, Numero 3
Tavolo 17
Correlation matrix for pairwise correlation agreement for Usim-2. The last row provides the
agreement of the annotator in that column against the average from the other annotators.
UN
C
D
F
G
H
IO
J
UN
C
D
F
G
H
IO
J
1.00
0.70
0.52
0.70
0.69
0.72
0.73
0.67
0.70
1.00
0.48
0.72
0.60
0.66
0.71
0.69
0.52
0.48
1.00
0.48
0.49
0.51
0.50
0.42
0.70
0.72
0.48
1.00
0.66
0.71
0.74
0.68
0.69
0.60
0.49
0.66
1.00
0.71
0.65
0.62
0.72
0.66
0.51
0.71
0.71
1.00
0.70
0.65
0.73
0.71
0.50
0.74
0.65
0.70
1.00
0.72
0.67
0.69
0.42
0.68
0.62
0.65
0.72
1.00
against avg
0.82
0.78
0.58
0.80
0.76
0.80
0.81
0.76
0.73. All these correlations are highly significant (P (cid:4) 0.001) with an average correlation
of rho = 0.62. The lowest agreement between any individual and the average judgment
of the others is 0.58. Again, we note that these are all respectable values for tasks
involving semantic similarity ratings.
Use of the graded scale. Figure 1b shows how annotators made use of the graded scale
in Usim-1 and Usim-2. It graphs the relative frequency of each of the judgments on the
five-point scale. Figures 2b and 3b show the same but for each individual annotator. In
both annotation rounds, the rating 1 (completely different) was chosen most frequently.
There are also in both annotation rounds many ratings in the middle points of the
scala, indeed we see a larger proportion of mid-range scores for Usim than for WSsim
in general, as shown in Table 12. Figures 2b and 3b show that although individuals
differ, all use the mid points to some extent and it is certainly not the case that these
mid-range judgments come from a minority of annotators. In Usim, annotators com-
pared pairs of usages, whereas in WSsim, they compared usages with sense defini-
zioni. The sense definitions suggest a categorization that may bias annotators towards
categorical choices. Comparing the two annotation rounds for Usim, we see that in
Usim-2 there seem to be many more judgments at 5 than in Usim-1. This is similar
to our findings for WSsim, where we also obtained more polar judgments for R2 than
for R1.
There is a larger range on average for Usim-2 compared with the other tasks as
shown earlier by Table 18. This is understandable given that there are eight annotators
Tavolo 18
Average range and average variance of judgments for each of the graded experiments.
avg range
avg variance
WSsim-1
WSsim-2
Usim-1
Usim-2
1.78
1.55
1.41
2.50
1.44
0.71
0.92
1.12
530
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
C
o
l
io
_
UN
_
0
0
1
4
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
for R2 compared with R1,16 and so a greater chance of a larger range per item. There is
substantial variation by lemma. In Usim-2, fire.v, rough.a, and coach.n have an average
range of 1.33, 1.76, E 1.93, rispettivamente, whereas suffer.v, neat.a, and function.n have
average ranges of 3.14, 3.16, E 3.58, rispettivamente. The variation in range appears to
depend on the lemma rather than POS. This variation can be viewed as a gauge of how
difficult the lemma is. Although the range is larger in Usim-2, Tuttavia, the average
variance per item (cioè., the variance considering the eight annotators) È 1.12 and lower
than that for WSsim-1.
Usim and the triangle inequality. In Euclidean space, the lengths of two sides of a triangle,
taken together, must always be greater than the length of the third side. This is the
triangle inequality:
length(longest) < length(second longest) + length(shortest)
We now ask whether the triangle inequality holds for Usim ratings. If Usim similarities
are metric, that is, if we can view the ratings as proximity in a Euclidean “meaning
space,” then the triangle inequality would have to hold. This question is interesting for
what it says about the psychology of usage similarity judgments. Classic results due
to Tversky and colleagues (Tversky 1977; Tversky and Gati 1982) show that human
judgments of similarity are not always metric. Tversky (1977), varying an example
by William James, gives the following example, which involves words, but explicitly
ignores context:
Consider the similarity between countries: Jamaica is similar to Cuba (because of
geographical proximity); Cuba is similar to Russia (because of their political affinity);
but Jamaica and Russia are not similar at all. [. . . ] the perceived distance of Jamaica to
Russia exceeds the perceived distance of Jamaica to Cuba, plus that of Cuba to
Russia—contrary to the triangle inequality.
Note, however, that Tversky was considering similarity judgments for different words,
whereas we look at different usages of the same word. The question of whether the
triangle inequality holds for Usim ratings is also interesting for modeling reasons.
Several recent approaches model word meaning in context through points in vector
space (Erk and Pado 2008; Mitchell and Lapata 2008; Dinu and Lapata 2010; Reisinger
and Mooney 2010; Thater, F ¨urstenau, and Pinkal 2010; Washtell 2010; Van de Cruys,
Poibeau, and Korhonen 2011). They work on the tacit assumption that similarity of
word usages is metric—an assumption that we can directly test here. Third, the triangle
inequality question is also relevant for future annotation; we will discuss this in more
detail subsequently.
To test whether Usim ratings obey the triangle inequality, we first convert the
similarity ratings that the annotators gave to dissimilarity ratings: Let savg be the mean
similarity rating over all annotators, then we use the dissimilarity rating d = 6 − savg
(as 5 was the highest possible similarity score).
We examine the proportion of sentence triples where the triangle inequality holds
(that is, we consider every triple of sentences that share the same target lemma). In those
16 A likely reason for the larger range in WSsim-1 compared with WSsim-2 is that in WSsim-2 half the
annotators had performed WSbest before WSsim-2 and produced more judgments of 1 compared with
WSsim-1.
531
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
cases where the triangle inequality is violated, we also assess the degree to which it is
violated, calculated as the average distance that is missed: Let Tmiss be the set of triples
for which the triangle inequality does not hold, then we compute
m = 1
|Tmiss
|
(cid:2)
t∈Tmiss
length(longestt) − (length(second longestt) + length(shortestt))
This is the average amount by which the longest side is “too long.”
For the first round of annotation, Usim-1, we found that 99.2% of the sentence
triples obey the triangle inequality. For the triples that miss it, the average amount
by which the longest side is too long is m = 0.520. This is half a point on the five-
point rating scale, a low amount. In R2, all sentence triples obey the triangle inequality.
One potential reason for this is that we have eight annotators for R2, and a larger
sample of annotators reduces the variation from individuals. Another reason may
be that the annotators in R2 could view two more sentences of context than those
in R1.
Tables 19 and 20 show results of the triangle inequality analysis, but by individual
annotator. Every annotator has at least 93% of sentence triples obeying the principle. For
the triples that miss it, they tend to miss it by between one and two points. The results
for individuals accord with the triangle inequality principle, though to a lesser extent
compared with the analysis using the average, which reduces the impact of variation
from individuals.
As discussed previously, this result (that the triangle inequality holds for Usim
annotation triples) is interesting because it contrasts with Tversky’s findings (Tversky
1977; Tversky and Gati 1982) that similarity ratings between different words are not
metric. And although we consider similarity ratings for usages of the same word, not
different words, we would argue that our findings point to the importance of consider-
ing the context in which a word is used. It would be interesting to test whether similarity
ratings for different words, when used in context, obey the triangle inequality. To
reference the Tversky example, and borrowing some terminology from Cruse, evoking
the ISLAND facet of Jamaica and Cuba versus the COMMUNIST STATE facet of Cuba and
Russia would account for the non-metricality of the similarity judgments as Tversky
Table 19
Triangle inequality analysis by annotator, Usim-1.
average
A
D
E
perc obey
missed by
93.8
1.267
97.2
1.221
97.3
1.167
Table 20
Triangle inequality analysis by annotator, Usim-2.
A
C
D
F
G
H
I
J
perc obey
missed by
94.1
1.508
97.5
1.405
98.4
1.122
97.2
1.824
93.6
1.477
97.0
1.281
97.4
1.759
97.4
1.338
532
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Table 21
WSbest annotations.
sense selected
yes
no
Proportion with
multiple choice
WSbest
WSbest, WSsim-2 first
WSbest, WSbest first
19,599
9,779
9,820
2,401
1,221
1,180
0.13
0.15
0.11
points out, and moreover highlight the lack of an apt comparison between Jamaica and
Russia at all. There is some motivation for this idea in the psychological literature on
structural alignment and alignable differences (Gentner and Markman 1997; Gentner
and Gunn 2001).
In addition, our finding that the triangle inequality holds for Usim annotation
will be useful for future Usim annotation. Usage similarity annotation is costly (and
somewhat tedious) as annotators give ratings for each pair of sentences for a given
target lemma. Given that we can expect usage similarity to be metric, we can eliminate
the need for some of the ratings. Once annotators have rated two usage pairs out of a
triple, their ratings set an upper limit on the similarity of the third pair. In the best case, if
usages s1 and s2 have a distance of 1 (i.e., a similarity of 5), and s1 and s3 have a distance
of 1, then the distance of s1 and s3 can be at most 2. For all usage triples where two
pairs have been judged highly similar, we can thus omit obtaining a rating for the third
pair. A second option for obtaining more Usim annotation is to use crowdsourcing. In
crowdsourcing annotation, quality control is always an issue, and again we can make
use of the triangle inequality to detect spurious annotation: Ratings that grossly violate
the triangle inequality can be safely discarded.
4.4 WSbest
The WSbest task reflects the traditional methodology in word sense annotation where
words are annotated with the best fitting sense. The guidelines17 allow for selecting
more than one sense provided all fit the example equally well. Table 21 shows that,
as one would expect given the number of senses in WordNet, there are more unse-
lected senses than selected. We again find an influence of task order: When annota-
tors did the graded annotation (WSsim-2) before WSbest, there were more multiple
assignments (see the last column) and therefore more senses selected. This difference
is statistically significant (χ2 test, p = 0.02). Regardless of the order of tasks, we no-
tice that the proportion of multiple sense choice is far lower than the equivalent for
WSsim (see Table 11), as is expected due to the different annotation schemes and
guidelines.
We calculated inter-annotator agreement using pairwise agreement, as is standard
in WSD. There are several ways to calculate pairwise agreement in cases of multiple
selection, though these details are not typically given in WSD papers. We use the size
of the intersection of selections divided by the maximum number of selections from
17 See http://www.dianamccarthy.co.uk/downloads/WordMeaningAnno2012/wsbest.html.
533
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Table 22
Inter-annotator agreement without one individual for WSbest and SYNbest R2.
average
A
C
D
F
G
H
I
J
WSbest
SYNbest
0.574
0.261
0.579
0.261
0.564
0.259
0.605
0.285
0.560
0.254
0.582
0.256
0.566
0.245
0.566
0.260
0.568
0.267
either annotator. This is equivalent to 1 for agreement and 0 for disagreement in cases
where both annotators have selected only one sense. Formally, let i ∈ I be one annotated
(cid:3) ∈ A} be the set of
sentence. Let A be the set of annotators and let PA = {{a, a
annotator pairs. Let ai be the set of senses that annotator a ∈ A has chosen for sentence
i. Then pairwise agreement between annotators is calculated as:
(cid:3)} | a, a
ITA WSbest =
(cid:3)
(cid:2)
i∈I
{a,a(cid:2)}∈PA
|PA
(cid:2)
|
i
|,|a(cid:2)
i
|ai
∩a
max(|ai
| · |I|
|)
(1)
The average ITA was calculated as 0.574.18 If we restrict the calculation to items
where each annotator only selected one sense (not multiple), the average is 0.626.
For SE-3, ITA was 0.628 on the English Lexical Sample task, not including the
multiword data (Mihalcea, Chklovski, and Kilgarriff 2004). This annotation exercise
used volunteers from the Web (Mihalcea and Chklovski 2003). Like our study, it had
taggers without lexicography background and gave a comparable ITA to our 0.626. We
calculated pairwise agreement for eight annotators. To carry out the experiment under
maximally similar conditions to previous studies, we also calculated ITA for items with
only one response and use only the four annotators who performed WSbest first. This
resulted in an average ITA of 0.638.
We also calculated the agreement for WSbest in R2 as in Equation 1 but with each
individual removed to see the change in agreement. The results are in the first row of
Table 22.
4.5 SYNbest
The SYNbest task is a repetition of the LEXSUB task (McCarthy and Navigli 2007, 2009)
except that annotators were asked to provide one synonym at most. As in LEXSUB,
agreement between a pair of annotators was counted as the proportion of all the
sentences for which the two annotators had given the same response.
As in WSbest, let A be the set of annotators. I is the set of test items, but as in
LEXSUB we only include those where at least two annotators have provided at least
one substitute: If only one annotator can think of a substitute then it is likely to be a
problematic item. As in WSbest, let ai be the set19 of responses (substitutes) for an item
18 Although there is a statistical difference in the number of multiple assignments depending upon whether
WSsim-2 is completed before or after WSbest, ITA on the WSbest task does not significantly differ
between the two sets.
19 Though, in fact, unlike LEXSUB and WSbest, we only collect one substitute per annotator for SYNbest.
534
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
i ∈ I for annotator a ∈ A. Let PA again be the set of pairs of annotators from A. Pairwise
agreement between annotators is calculated as in LEXSUB as:
PA =
(cid:3)
|ai
|ai
| · |I|
{a,a(cid:2)}∈PA
|PA
(cid:2)
∩a
i
∪a(cid:2)
i
|
|
(cid:2)
i∈I
(2)
Note that in contrast to pairwise agreement for traditional word sense annotation
(WSbest), the credit for each item (the intersection of annotations from the annotator
pair) is divided by the union of the responses. For traditional WSD evaluation, it is
divided by the number of responses from either annotator, which is usually one. For
lexical substitution this is important as the annotations are not collected over a predeter-
mined inventory. In LEXSUB, the PA figure was 0.278, whereas we obtain PA = 0.261 on
SYNbest. There were differences in the experimental set-up. We had eight annotators,
compared with five, and for SYNbest each annotator only provided one substitute.
Additionally, our experiment involved only a subset of the data used in LEXSUB. The
figures are not directly comparable, but are reasonably in line.
In our task, out of eight annotators we had at most three people who could not find
a substitute for any given item, so there were always at least five substitutes per item.
In LEXSUB there were 16 items excluded from testing in the full data set of 2010 because
there was only one token substitute provided by the set of annotators.
We also calculated the agreement for SYNbest as in Equation 2 but with each
individual removed to see the change in agreement. The results are in the second row
of Table 22.
4.6 Discussion of the Annotations of Individuals
We do not pose these annotation tasks as having “correct responses.” We wish instead
to obtain the annotators’ opinions, and accept the fact that the judgments will vary.
Nevertheless, we would not wish to conduct our analysis using annotators who were
not taking the task seriously. In the analyses that follow in subsequent sections, we
use the average judgment from our annotators to reduce variation from individuals.
Nevertheless, before doing so, in this subsection we briefly discuss the analysis of the
individual annotations provided earlier in this section in support of our decision to use
all annotators for the gold standard.
Although there was variation in the profile of annotations for individuals, all of the
annotators showed reasonable correlation on the graded task and at a level in excess
of that achieved on other graded semantic tasks (Mitchell and Lapata 2008). There will
inevitably be one annotator that has the lowest correlation with the others on any given
task, but we found that this was not the same annotator on every task. For example,
C on WSsim-2 has the lowest correlation with the average, yet concurs with others
much more on Usim-2 and leaving C out would reduce agreement on WSbest and on
SYNbest. D has lower correlation with others on several tasks, though higher than C on
WSsim-2. When we redo the triangle inequality analysis in Section 4.3 individually we
see from Tables 19 and 20 that annotator D is the highest performing annotator in terms
of meeting the triangle inequality principle in R2 and is a close runner-up in R1. These
results indicate that although annotators may use the graded scale in different ways,
their annotations tally to a reasonable extent. We therefore used all annotators for the
gold standard.
535
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
4.7 Agreement Between Annotations in Different Frameworks
In this paper we are considering various different annotations of the same underly-
ing phenomenon: word meaning as it appears in context. In doing so, we contrast
traditional WSD methodology (SE-3, SemCor, and WSbest) with graded judgments of
sense applicability (WSsim), usage similarity (Usim), and lexical substitution as in
LEXSUB and SYNbest. In this section we compare the annotations from these different
paradigms where the annotations are performed on the same underlying data. For
WSsim and Usim, we use average ratings as the point of comparison.
4.7.1 Agreement Between WSsim and Traditional Sense Assignment. To compare WSsim
ratings on a five-point scale with traditional sense assignment on the same data, we
convert the traditional word sense assignments to ratings on a five-point scale: Any
sense that is assigned is given a score of 5, and any sense that is not assigned is
given a score of 1. If multiple senses are chosen in the gold standard, then they are
all given scores of 5. We then correlate the converted ratings of the traditional word
sense assignment with the average WSsim ratings using Spearman’s rho.
As described earlier, most of the sentences annotated in WSsim-1 were taken from
either SE-3 or SemCor. The correlation of WSsim-1 and SE-3 is rho = 0.416, and the cor-
relation of WSsim-1 with SemCor is rho = 0.425. Both are highly significant (p (cid:4) 0.001).
For R2 we can directly contrast WSsim with the traditional sense annotation in
WSbest on the same data. This allows a fuller comparison of traditional and graded
tagging because we have a data set annotated with both methodologies, under the same
conditions, and with the same set of annotators. We use the mode (most common) sense
tag from our eight annotators as the traditional gold standard label for WSbest and
assign a rating of 5 to that sense, and a rating of 1 elsewhere. We again used Spearman’s
rho to measure correlation between WSbest and WSsim and obtained rho = 0.483
(p (cid:4) 0.001).
4.7.2 Agreement Between WSsim and Usim. WSsim and Usim provide two graded an-
notations of word usage in context. To compare the two, we convert WSsim scores
to usage similarity ratings as in Usim. In WSsim, each sense has a rating (aver-
aged over annotators), so a sentence has a vector of ratings with a “dimension” for
each sense. For example, the vector of average ratings for the sentence in Table 6 is
(cid:6)5, 2.125, 1.25, 1.5, 1.25, 1.75(cid:7). All sentences with the same target will have vectors in the
(cid:3)
same space, as they share the same sense list. Accordingly, we can compare a pair u, u
of sentences that share a target using Euclidean distance:
d((cid:2)u, (cid:2)u(cid:3)) =
(cid:4)(cid:2)
i
((cid:2)ui
− (cid:2)u(cid:3)
i)2
where (cid:2)ui is the ith dimension of the vector (cid:2)u of ratings for sentence u. Note that this gives
us a dissimilarity rating for u, u
. We can now compare these sentence pair dissimilarities
to the similarity ratings of the Usim annotation.
In R1 we found a correlation of rho = −0.596 between WSsim and Usim ratings.20
The basis of this comparison is small, at three lemmas with 10 sentences each, giving
(cid:3)
20 The negative correlation is due to the comparison of dissimilarity ratings with similarity ratings.
536
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Table 23
Spearman’s correlation between lexical paraphrase overlap on the one hand, and Usim
similarity or WSsim dissimilarity on the other hand.
tasks
Usim-1 vs. LEXSUB
Usim-2 vs. SYNbest
rho
0.590
0.764
−0.495
WSsim-1 vs. LEXSUB
WSsim-2 vs. SYNbest −0.749
135 sentence pairs in total, because that is all the data available annotated in both
paradigms. For R2 we can perform the analysis on the whole Usim-2 and WSsim-2
data, which gives us 26 lemmas, with 1,142 sentence pairs.21 Correlation on R2 data is
rho = −0.816. The degree of correlation is striking. We conclude that there is a very
strong relationship between the annotations for Usim and WSsim. This bodes well for
using Usim as a resource for evaluating sense inventories, an idea that we will pursue
further in Section 6: It reflects word meaning but is not tied to any given sense inventory.
4.7.3 Agreement of WSsim and Usim with Lexical Substitution. Lexical paraphrases (sub-
stitutes) have been used as a means of evaluating WSD systems in a task where the
inventory is not predefined (McCarthy and Navigli 2007, 2009). Because the R1 an-
notation was done in part on data that had previously been annotated with lexical
substitutions, and R2 included lexical substitution annotation, we can compare para-
phrase annotation with the results of WSsim and Usim. Again, we need to transform
annotations to make the comparison feasible. We convert all annotations to a Usim-
like format using sentence pair similarity or dissimilarity ratings. For WSsim, we use
the transformation described previously, using Euclidean distance between sense rating
vectors. We transform lexical substitution annotation using multiset intersection, as
the lexical substitution annotation of a sentence is a multiset of substitutes22 from all
annotators. If sentences s1, s2 have substitute multisets subs1 and subs2, respectively, and
freqi(w) is the frequency of substitute w in multiset subsi, then we calculate multiset
intersection as
INTER(s1, s2) =
1
max(|subs1
|, |subs2
|)
(cid:2)
w∈subs1
∩subs2
min( freq1(w), freq2(w))
Again, as before and in LEXSUB, we only keep sentences for which at least two
annotators could come up with a substitute. We also did not include any items that
were tagged with the wrong POS in LEXSUB.23
Table 23 shows correlation, in terms of Spearman’s rho, of Usim and WSsim
annotation with lexical substitution annotation. The values of Usim and WSsim are
based on mean scores averaged over all annotators. The INTER values computed for the
21 This is the number of pairs remaining after we exclude any pairs where one of the annotators provided a
“do not know” response.
22 The frequency of a substitute in a multiset depends on the number of annotators that picked the
substitute for the particular data point.
23 This was relevant only for the trial portion of LEXSUB, as the test portion was manually verified.
537
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
lexical substitution annotation yield similarity ratings for sentence pairs; accordingly,
correlations of transformed lexical substitution with Usim are positive, and correlations
of transformed lexical substitution with the WSsim-based sentence dissimilarity ratings
are negative. All correlations are highly significant (p (cid:4) 0.001). We anticipated a higher
correlation of SYNbest with R2 annotation compared with that obtained using LEXSUB
and R1 annotation: In R2 the set of annotators is larger, the same set of annotators do
all experiments, and the SYNbest annotation focuses on obtaining one substitute per
annotator (whereas LEXSUB allowed annotators to supply up to three paraphrases). This
turned out to be in fact the case, as a comparison of rows 1 and 2 of Table 23 shows,
and likewise a comparison of rows 3 and 4. We notice that the correlation is slightly
stronger for Usim compared with WSsim, for both annotation rounds. One possible
reason for this is that the comparison of lexical substitution data with Usim involves
only one transformation of annotation data (the INTER calculation), whereas the com-
parison with WSsim involves two (INTER and also the Euclidean distance transforma-
tion). We can expect each transformation of annotation data to be “lossy” in the sense
of introducing additional variance. Furthermore, WSsim relies on WordNet, which
may add a layer of structure that does not reflect the overlap in semantic similarity
between usages.
4.7.4 Summary. The Usim framework enables us to compare different annotation
schemes for word meaning, as it is relatively straightforward to map all annotations
to sentence pair (dis-)similarity ratings. We found strong relationships between WSsim
and Usim annotation, and between both graded annotation frameworks on the one
hand and traditional word sense annotation or lexical substitutions on the other hand.
This provides some validation for the novel annotation frameworks. Also, if all labeling
schemes provide comparable results, that opens up opportunities for choosing the best-
fitting labeling scheme for each situation. All these tasks pursue the same endeavor,
although the graded annotations and substitutions strive to capture the more subtle
nuances of meaning that are not adequately represented by the winner takes all ap-
proach of traditional methodology. WSsim is closest to the traditional methodology and
would suit systems needing to output WordNet sense labels, for example because they
want to exploit the semantic relations in WordNet. Usim is application-independent. It
allows for evaluation of systems that relate usages, whether into clusters or simply on
a continuum. It could, for example, be used as a resource-independent gold standard
for word sense induction. Lexical substitution tasks are particularly useful where the
application being considered would benefit from lexical paraphrasing, for example, text
simplification, summarization, or query expansion in information retrieval.
5. Examining Sense Groupings Emerging from WSsim Annotation
Recently there has been considerable work on grouping fine-grained senses, often from
WordNet, into more coarse-grained sense groups (Palmer, Dang, and Fellbaum 2007).
The use of coarse-grained sense groups has been shown to yield considerable improve-
ments in inter-annotator agreement in manual annotation, as well as in the accuracy of
WSD systems (Palmer, Dang, and Fellbaum 2007; Pradhan et al. 2007). In our WSsim
annotation, we have used fine-grained WordNet senses, but we want to check that our
results are not an artifact of this fine-grained inventory. Furthermore, the annotation
results might be useful for identifying senses that could be grouped or for identifying
senses where grouping is not straightforward.
538
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
In WSsim, annotators gave ratings for each sense of a target word. If an annotator
perceives two senses of some target word as very similar, they will probably give them
similar ratings, and not just for a single sentence but across all the sentences featuring
the target word in question. So by looking for pairs of senses that tended to receive
similar ratings across all sentences, we can identify sense descriptions that according
to our annotators describe similar senses. Conversely, we expect that unrelated senses
would have dissimilar ratings. If there were many senses that the WSsim annotators
implicitly “grouped” by giving them similar ratings throughout, we would have to
revise our finding that WSsim annotators often perceived more than one sense to be
applicable, as they would have perceived only what could be described as one implicit
sense group.
If a coarse-grained sense grouping is designed with the aim of reflecting sense
distinctions that would be intuitively plausible to an untrained speaker of the language,
then senses in a common group should also be similar according to WSsim annotation.
So when WSsim annotators give very different ratings to senses that are in the same
coarse-grained group, or very similar ratings to senses that are in different groups, this
can point to problems in a coarse-grained sense group.
In this section, first we describe two existing sense groupings (Hovy et al. 2006;
Navigli, Litkowski, and Hargraves 2007). Then we test the extent that the annotations
accord with sense groupings by:
1.
2.
comparing judgments against the existing groupings, and re-examining
the question of how often WSsim annotators found multiple different
WordNet senses highly applicable.
using the WSsim data to examine the extent that the annotations could be
used to induce sense groupings.
5.1 Existing Sense Grouping Efforts
OntoNotes. The OntoNotes project (Hovy et al. 2006; Chen and Palmer 2009) annotates
word sense, along with coreference and semantic roles. The senses that it uses for verbs
are WordNet 2.1 and 2.0, manually grouped based on both syntactic and semantic
criteria. Examples of these criteria include the causative/inchoative distinction, and
semantic features of particular argument positions, like animacy. Once the sense groups
for a lemma are constructed manually, they are used in trial annotation. If an inter-
annotator agreement of approximately 90% is reached, the lemma’s sense groups are
used for annotation; otherwise they are revised. Chen and Palmer report that the sense
groups used in OntoNotes have resulted in a rise in inter-annotator agreement as well
as annotator productivity. The third column of Table 24 shows OntoNotes groups for
the noun account.
5.1.1 The SemEval-2007 English All Words Task (EAW). For the English All Words task
at SemEval-2007, WordNet 2.1 senses were grouped by mapping them to the more
coarse-grained Oxford Dictionary of English senses. For the training data, this mapping
was done automatically; for the test data, the mapping was done by hand (Navigli,
Litkowski, and Hargraves 2007). For our analysis, we used only lemmas that were
included in the test data where the mapping had been produced manually.
539
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Table 24
WordNet 2.1 senses of the noun account, and their groups in OntoNotes (ON) and EAW.
WordNet sense
WordNet
sense no.
ON
group
EAW
group
business relationship: “he asked to see the executive
who handled his account”
report: “by all accounts they were a happy couple”
explanation: “I expected a brief account”
history, story: “he gave an inaccurate account of
the plot [...]”
report, story: “the account of his speech [...] made
the governor furious”
account statement: “they send me an accounting
every month”
bill: “send me an account of what I owe”
score: “don’t do it on my account”
importance: “a person of considerable account”
the quality of taking advantage: “she turned her
writing skills to good account”
3
8
4
1
2
7
9
5
6
10
1.1
1.2
1.2
1.3
1.3
1.4
1.4
1.5
1.6
1.7
5
2
2
2
2
4
4
3
3
1
Column (4) of Table 24 shows EAW groups for the noun account.24 The two resources
largely agree in the groupings for account. But whereas EAW groups senses 1, 2, 4, and
8 together, OntoNotes splits those senses into two groups.
5.2 Does WSsim Annotation Conform to Existing Sense Groups?
In the WSsim annotation, we have used the fine-grained senses of WordNet 3.0. But
annotators were free to give high ratings for a sentence to more than one sense. So
it is possible that they implicitly used more coarse-grained sense distinctions. In this
and the following section, we will explore the question of whether, and to what extent,
WSsim annotators used implicit coarse-grained sense groups. In this section, we will
first ask whether their annotation matched the sense groups of either OntoNotes or
EAW. OntoNotes and EAW differ in the lemmas they cover. Also, as we saw earlier,
when they both cover a lemma, they do not always agree in the sense groups that they
propose. So we study the agreement of WSsim annotation with the two sense groupings
separately. We only study the lemmas which are in both the WSsim data and in either
OntoNotes or the EAW test data, listed in Table 25.
Tables 26 and 27 show the results. Table 26 looks at the number of sentences where
two senses both had high ratings, but are in different groupings in either OntoNotes or
EAW. The first row shows how many sentences there were where two senses received
a judgment of ≥ 3, but the two senses were not in a common OntoNotes/EAW group.
The second row shows the same for judgments ≥ 4, and the last row for judgments
of 5 only. In general, the percentages are higher for EAW than for OntoNotes. This is
not due to any difference in granularity between the two resources. The EAW sense
groups encompass on average 2.3 fine-grained senses for the R1 lemmas and 2.6 for the
24 The table shows the EAW groups of the WordNet senses, but the group numbering is our own for ease of
reference as no labels are given in EAW.
540
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Table 25
Lemmas in R1 and R2 WSsim that have coarse-grained mappings in OntoNotes and SemEval
2007 EAW.
R1
R2
lemma
account.n
add.v
ask.v
call.v
coach.n
different.a
dismiss.v
fire.v
fix.v
hold.v
lead.n
new.a
order.v
paper.n
rich.a
shed.v
suffer.v
win.v
ON EAW ON EAW
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 26
Sentences that have positive judgments for senses in different coarse groupings: percentage, and
absolute number in parentheses. J. = WSsim judgment, averaged over annotators.
OntoNotes
EAW
J.
≥ 3
≥ 4
5
Rd. 1
Rd. 2
Rd. 1
Rd. 2
28% (42)
13% (19)
(5)
3%
52% (52)
16% (16)
(3)
3%
78% (157)
(82)
41%
(17)
8%
62% (50)
22% (18)
(5)
6%
Table 27
Sentences that have widely different judgments for pairs of senses in the same coarse grouping:
percentage, and absolute number in parentheses. J1 = WSsim judgment for the sense with the
lower rating, averaged over annotators; J2 = averaged WSsim judgment for the higher-rated of
the two senses.
OntoNotes
EAW
J1
J2
Rd. 1
Rd. 2
Rd. 1
Rd. 2
≤ 2 ≥ 4
≤ 2
5
35% (52)
11% (16)
30% (30)
(4)
4%
20% (39)
(4)
2%
60% (48)
15% (12)
541
Computational Linguistics
Volume 39, Number 3
R2 lemmas, and for OntoNotes the mean group sizes are 2.3 (R1) and 2.4 (R2). More
likely it is due to the individual lemmas. We observe that ratings of “similar” or higher
are frequent. In all conditions except WSsim-1/OntoNotes, we find percentages over
50%. On the other hand, there are many fewer sentences where two senses received
judgments of “very similar” or “identical” but were not in the same OntoNotes or
EAW group, but these cases do exist. For example, there were five sentences with the
target dismiss.v which in WSsim received an average judgment of 4 or 5 for senses from
two different OntoNotes groups, 1.1 and 1.2. As dismiss is an R2 lemma, for which
only 10 sentences were annotated, this means that this phenomenon was found in
half the sentences annotated. The two sense groups are related: One is a literal, the
other a metaphorical, use of the verb. OntoNotes group 1.1 is defined as ‘refuse to give
consideration to something or someone,’ and group 1.2 is ‘discharge, let go, persuade
to leave, send away.’ One such sentence was the second sentence in Table 13.
Table 27 lists the number of sentences where two senses in the same OntoNotes
or EAW grouping received widely different ratings in the WSsim annotation. The first
row shows how many sentences there were where one sense received a rating of ≤ 2
and another sense from the same OntoNotes or EAW group had a rating of ≥ 4. The
second row shows the same for sense pairs in the same coarse-grained group where
one received a rating of ≤ 2 and the other the highest possible rating of 5. (Note that
the table considers judgments averaged over all annotators, so this row counts only
sentences where all annotators agreed on the highest rating.) An example of such a case
is Rich people manage their money well. In WSsim the first sense in WordNet (possessing
material wealth) received an average score of 5 (i.e. a unanimous verdict), whereas all
other senses received a score of less than 2. This included the third sense (of great worth
or quality; ”a rich collection of antiques”), which had an average of 1.625, and sense 8
(suggestive of or characterized by great expense; ”a rich display”) with an average of 1.125.
Both senses 3 and 8 are in the same group as sense 1 in EAW.
These are sentences where the WSsim annotation suggests a more fine-grained
analysis than the OntoNotes and EAW groups offer. The percentages are substantial:
For the more inclusive analysis in the first row, the numbers are between 20% and 60%
of all sentences, and between 2% and 15% of sentences even fall into the more restrictive
case in the second row. There is no clear trend in whether we see more of this type of
disagreement for OntoNotes or for EAW, or for the first or the second round of WSsim
annotation.
We see that there are a considerable number of sentences where either two senses
from the same OntoNotes or EAW group have received diverging WSsim ratings, or
two senses from different groups have received high ratings. In this way, the WSsim
annotation can be used to scrutinize sense groupings: If one aim of the sense groupings
is to form groups that would match intuitive sense judgments by untrained subjects,
then WSsim annotation would suggest that the senses of dismiss.v that correspond to
‘dismiss a person’ and ‘dismiss an idea’ may be too close together to be placed in
different groups.
5.3 Inducing Sense Relatedness from WSsim Annotation
In the WSsim annotation, annotators have annotated each occurrence of a target word
with a rating for each of the WordNet senses for the target, as illustrated in Tables 6–8.
This allows us, conversely, to chart the ratings that a WordNet sense received across
all sentences. Table 28 shows this chart for two senses of the noun account. In the
table, ratings are averaged across all annotators. In this case, the averaged ratings are
542
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Table 28
WSsim ratings for two senses of the noun account for 10 annotated sentences (averaged over
annotators).
WordNet
Sentence
sense
1
2
3
4
5
6
7
8
9
10
1
4
1.00
1.50
2.25
3.00
1.13
1.25
4.25
2.88
1.13
1.50
1.0
1.50
1.13
1.63
1.13
1.00
1.13
1.38
4.25
3.88
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4
Correlation between sense pairs: Distribution of rho values (Spearman’s rho).
similar for the two senses: They tend to be high for the same sentences, and low for the
same sentences. In general, senses that are closely related should tend to receive similar
ratings: high on the same sentences, and low on the same sentences, as illustrated for
the two senses in Table 28.
This then means that we can test the correlation on the ratings for two senses to see
if the WSsim annotators perceived them to be similar. We compute correlation for any
pair of senses for a common lemma, again using Spearman’s rho.25 Figure 4 shows the
distribution of rho values obtained for all the sense pairs, as histograms for R1 (left)
and R2 (right). When two senses are strongly positively correlated, this means that
the annotators likely viewed them as similar. When two senses are strongly negatively
correlated, this means they are probably so different that they tend never to be assigned
high ratings for the same sentences. We see that in both rounds, there were roughly as
many positive correlations as negative correlations. In R1, the rho values seem more or
less equally distributed over the range from −1 to 1. In R2, there were more annotators
and the distribution is closer to a normal distribution with more rho values close to 0.
25 We exclude senses that received a uniform rating of 1 on all items. For R1 there were no such cases and
for R2 there were only 14 out of a total of 275 senses.
543
Computational Linguistics
Volume 39, Number 3
We have shown the OntoNotes and EAW sense groups for the noun account. We can
now look at the WSsim-derived correlations for the same lemma, shown in Figure 5. The
first row in each box shows the WordNet sense number, and the second row shows the
OntoNotes and EAW sense groups. All three labels are those used in Table 24. Each edge
represents a correlation in the WSsim annotation. To avoid clutter, only correlations
with rho ≥ 0.5 are included, and a sense is only shown if it is correlated with any other
sense. Edge thickness corresponds to the value of the correlation coefficient rho between
each two senses; rho is also annotated on the edges. The first thing to note is that WSsim-
based correlation does not give us sense groups. Correlations are of different strengths,
and different cutoffs would result in different link graphs. Even for the chosen cutoff
of rho = 0.5, the correlations do not induce cliques (in the graph-theoretic sense). For
example, the sixth sense of account shows a correlation of rho ≥ 0.5 with the eighth
sense, but not with any of the other senses to which the eighth sense is linked. The
figure also shows that there are some senses that are strongly correlated in their annota-
tion but are not grouped in one or the other of the existing groupings. For example,
senses 3 (the executive who handles his account) and 7 (account statement) are strongly
correlated, but are in different groups in OntoNotes as well as in EAW. There are also
senses that share the same group in one of the coarse grained inventories, but have
a weak or even a negative correlation based on the WSsim annotation. For example,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5
Sense correlation in the WSsim annotation for the noun account. Showing all correlations with
rho ≥ 0.5. Upper row in each box: WordNet sense number. Lower row: OntoNotes and EAW
sense groups. Edge thickness expresses strength of correlation.
544
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Figure 6
Overall correlation versus annotation in single sentences: Number of sentences in which two
senses with an overall correlation ≤ ρ have both been annotated with a judgment of ≥ j, for
j = 3, 4, 5. (Judgments averaged over annotators.)
for the lemma paper, senses 1 (a material made from cellulose pulp) and 4 (a medium for
written communication) are in the same EAW group, but have a correlation in WSsim of
rho = −0.52.
In Section 5.2 we asked whether the many cases where WSsim annotators gave
high ratings to more than a single sense could possibly be explained by them im-
plicitly using more coarse-grained senses. We answered this question by comparing
the WSsim annotation to OntoNotes and EAW sense groups, finding a considerable
number of sentences where two senses received a high rating but were not from the
same sense group. Now we can repeat the question, but try to answer it using the
WSsim sense relations obtained from correlation: Is it possible that WSsim annotators
implicitly used more coarse-grained senses, but just not the OntoNotes or EAW sense
groups?
We tested how often annotators gave ratings of at least similar (i.e., ratings ≥ 3) to
senses that were related at a level ≤ rho, for rho ranging from −1 to 1. The question
that we want to answer is: If annotators give high ratings to multiple senses on the
same sentence, is it always to senses that are strongly positively correlated, or do they
sometimes pick multiple senses that are not strongly correlated, or even senses that
are negatively correlated? The results are shown in Figure 6. First, we can see that
there is a sizeable number of sentences where two senses that are negatively correlated
have both received a positive judgment. For R1, the numbers for negatively correlated
senses are 135 ( j ≥ 3), 29 ( j ≥ 4), and 2 ( j = 5). For R2, the numbers of sentences are
lower absolutely and in proportion, with 29 ( j ≥ 3), 7 ( j ≥ 4), and 0 ( j = 5). It is also
interesting to look at a less stringent threshold than rho ≤ 0; we can use the significance
levels p ≤ 0.05 and p ≤ 0.01 for this. If we look at sense pairs that were not positively
correlated at p ≤ 0.05 (p ≤ 0.01), there were 185 (205) sentences in R1 and 54 (88)
sentences in R2 where two such senses both received judgments of 3 or higher. Note
that the significance levels of p ≤ 0.05, p ≤ 0.01 are here just arbitrary thresholds at
which to inspect the data; they are not thresholds that determine the significance of
545
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
some hypothesis.26 This brings us back to the question asked above of whether the
WSsim annotators implicitly used more coarse-grained senses. If they had implicitly
used more coarse-grained senses, we would have expected to see very few cases where
unrelated senses got a high rating on the same sentence. What we found instead was
that such cases were relatively frequent, which implies that WSsim annotators in both
rounds “mix and match” senses specifically for each sentence that they evaluate. For
example, the senses 1 (she dismissed his advances) and 5 (I was dismissed after I gave my
report) of dismiss are negatively correlated (rho = −0.61) yet have average judgments of
3.25 and 4.125 on the second example in Table 13.
5.3.1 Summary. In this section we have analyzed the WSsim annotation in comparison
with more coarse-grained sense repositories. One aim was to find out whether anno-
tators really used the fine granularity that the WSsim task offered or whether they
implicitly used more coarse-grained senses. Both by comparing the WSsim annotation
to coarse-grained OntoNotes and EAW sense groups, and by comparing the WSsim
annotation to the sense relations implied by WSsim, we find that annotators did make
use of the ability to combine sense ratings in a way that was particular to each sentence
they annotated. We also conclude that WSsim annotation can be used to evaluate
OntoNotes and EAW groupings with respect to the level to which senses are intuitively
distinguishable to untrained subjects. Here, WSsim annotation can uncover senses in
different groups that WSsim annotators often conflate, or senses in a single coarse group
that WSsim annotators treat differently.
6. Usim and Sense Groupings
One of the major motivations for the Usim task is that it allows us to examine the
meanings of words without recourse to a predefined inventory. We have demonstrated
in this paper that the data from this task can be compared directly to paraphrase
data as well as to data annotated for word sense. In the previous section we have
focused on using our WSsim data to examine existing sense groupings. WSsim is useful
precisely because it has sense annotations from an existing inventory, WordNet, so we
can use the graded annotations to see how these senses relate, and also relationships
between coarser grained inventories with mappings to WordNet. Usim does not capture
this information, nevertheless it might be useful as a resource for examining sense
groupings. We can use it to examine the extent to which sense groupings keep usages
together that have a high usage similarity according to Usim, and keep sentences
with low usage similarity apart. In this analysis, we use the data from R2 because
this has Usim judgments for sentences alongside traditional word sense annotations
(WSbest). As WSbest annotation, we use the mode of the chosen senses27 (as in the
analysis in Section 4.7) for each sentence in R2, and map it to its coarse-grained sense in
26 We are performing multiple tests on the same senses, which increases the likelihood of falsely assuming
two senses to be significantly correlated at some significance level (Type I errors). The significance levels
are only arbitrary thresholds in our case, however. In addition, our analysis focuses on sense pairs that
are not significantly positively correlated. For that reason, Type I errors actually reduce our estimate
of the number of sentences in which two non-related senses both received high ratings. Conversely,
correcting for multiple testing makes our estimate less conservative: If we count sentences with positive
ratings for sense pairs that are not positively correlated at p ≤ 0.05 with Bonferroni correction, the number
of sentences rises from 185 to 207 for judgments of 3 or higher.
27 We perform this analysis only on sentences where there was one sense found as mode and where this had
a coarse-grained mapping in either the EAW or OntoNotes resources.
546
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
EAW and/or OntoNotes. We then compute the average Usim similarity for all pairs of
sentences with the same coarse-grained sense, and compare it with the average Usim
similarity for sentence pairs with different coarse-grained senses. The results are shown
in the first row of figures in Table 29. We see that the OntoNotes and EAW sense groups
do indeed partition the sentences such that pairs within the same group have high usage
similarity (4 or above) and those in different groups have low usage similarity (2 or
below).
The second part of Table 29 performs the same analysis on the basis of individual
lemmas. A dash (–) means that either there was no coarse mapping, or there were no
sentence pairs in this category. For example, there were no sentence pairs identified
as having different OntoNotes groups or EAW groups for the lemma suffer.v. For the
lemmas call.v and dismiss.v, the two sense inventories give rise to the same groupings of
sentence pairs.
In the table, we see many lemmas where the groupings concur with the Usim
judgments. One example is account.n, where the sentence pairs in the same coarse group
get high average Usim values, whereas sentence pairs with different coarse groups have
low average Usim values. There are, however, a few lemmas where the average Usim
values indicate that either the coarse groupings might benefit from another inspection,
or that the lemma has meanings with subtle relationships where grouping is not a
straightforward exercise. One example is new.a, which has the same high Usim values
for both same and different categories in EAW. Another is shed.v, where the sentences
annotated with the same OntoNotes groups actually have a lower average Usim value
than those with different groups.
We can also use Usim judgments to analyze individual sense groups. This could
be useful in determining specific groups that might warrant further revision, or that
represent meanings which are simply difficult to distinguish. To demonstrate this, we
analyzed all coarse-grained sense groups with at least one sentence pair in R2, that is, all
groups that had at least two R2 sentences whose WSbest mode mapped to that coarse
Table 29
Average Usim rating for R2 where WSbest annotations suggested the same or different coarse
grouping.
OntoNotes
EAW
same
4.0
different
1.9
same
4.1
different
2.0
by lemma
account.n
call.v
coach.n
dismiss.v
fire.v
fix.v
hold.v
lead.v
new.a
order.v
rich.a
shed.v
suffer.v
4.0
4.3
4.6
3.8
4.6
4.2
4.5
–
–
4.3
–
2.9
4.2
1.6
1.4
2.3
2.6
1.2
1.1
2.0
–
–
1.7
–
3.3
–
4.0
4.3
–
3.8
–
–
3.8
2.9
4.6
–
4.6
–
4.2
1.5
1.4
–
2.6
–
–
1.9
1.5
4.6
–
2.0
–
–
547
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
group. (Naturally, due to the skewed nature of sense distributions and the fact that
we only have ten sentences for each lemma, some groups do not meet this criterion.)
We find that the majority of groups that were analyzed have an average Usim rating
of over 4. This is the case for 75% of the analyzed EAW groups and 76% of OntoNotes
groups. There were, however, groups with very low values. One example was group 1.1
of shed.v in OntoNotes, with an average Usim rating of 2.9. This group includes both
literal senses (trees shed their leaves) and metaphorical senses (he shed his image as a
pushy boss) of the verb shed. Another example is group 7 of lead.n in EAW, also with
an average Usim of 2.9. This group includes taking the lead as well as lead actor, so quite
a diverse collection of usages. Two example sentences annotated with these two senses
are shown here. This pair had an average Usim value of 1.25.
My students perform a wide variety of music and they can be found singing leading
roles in their high school and college musical productions, singing lead in rock and
wedding bands, winning classical music competitions, singing at the summer
conservatory of The Papermill Playhouse, and learning to sing so they can sing with
local choirs.
And as a result of President Bush’s initiative, which he took as part of the G-8
Presidency, and also the other changes in which the US, UK has been in the lead, not
least in Afghanistan and Iraq, you can now feel the winds of change blowing through
the Arab world.
In the future we hope to obtain more Usim data. When we have more data, we will
investigate whether the groupings that Usim identifies as problematic tend to be the
same ones that require more iterations in inventory construction (Hovy et al. 2006). We
also plan to test whether groupings with low Usim ratings tend to have lower inter-
tagger agreement on traditional WSD annotation.
7. Computational Modeling
The graded meaning annotation data from Usim and WSsim annotation can be used to
evaluate computational models of word meaning. In this section we summarize existing
work on modeling the R1 data, which has already been made publicly available.
The WSsim data can be used to evaluate graded word meaning models as well
as traditional WSD systems. Instead of evaluating only the highest-confidence sense of a
WSD model, we can take a more nuanced look at a model’s predictions, and give credit if
it proposes multiple appropriate senses. In Erk and McCarthy (2009) we take advantage
of this fact to evaluate and compare two supervised models on the WSsim data: a
traditional WSD model, and a distributional model that forms one prototype vector for
each sense of a given lemma. Both are trained on traditional single-sense annotation, but
the prototype model does not see any negative data during training in order to avoid
spurious negative data. For training, each word occurrence is represented as either a
first-order or a second-order bag-of-words vector of its sentence. In an evaluation using
weighted variants of precision and recall, we find that when the traditional WSD model
is credited for all the senses that it proposes, rather than only the single sense with
the highest confidence value, it does much better on both measures. This shows that
the model does propose multiple appropriate senses, such that its performance may be
underestimated in traditional evaluation. As was to be expected, the prototype models
that do not see negative data during training have much higher recall at lower precision,
for an overall better F-score (again using weighted variants of the evaluation measures).
548
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
Thater, F ¨urstenau, and Pinkal (2010) address the WSsim data with an unsupervised
model. It represents a word sense as the sum of the vectors for all synonyms in its synset,
plus the vectors for all hypernyms scaled down by a factor of 10. They also use a more
complex, syntax-based model to derive occurrence representations. Unfortunately their
results are not directly comparable to Erk and McCarthy (2009) because they evaluate
on a subset of the data (verb lemmas only).
The Usim data, which directly describes the similarity of pairs of usages, can be
used to evaluate distributional models of word meaning in context. So far, only one type
of model has been evaluated on this data to the best of our knowledge: the clustering-
based approach of Reisinger and Mooney (2010). They use the Usim data to test to what
extent their clusters correspond to human intuitions on a word’s senses. Their result is
negative, as a low correlation of human judgments and predictions suggests to them
that the induced clusters are not a good match for human senses. The Usim data is
particularly interesting for a different way of evaluating distributional and vector space
approaches for word meaning in context. These have been evaluated on the tasks of
lexical substitution (Erk and Pado 2008; Dinu and Lapata 2010; Thater, F ¨urstenau, and
Pinkal 2010; Van de Cruys, Poibeau, and Korhonen 2011), information retrieval, and
word sense disambiguation (Sch ¨utze 1998), but Usim, in contrast, offers a different and
more direct evaluation perspective.
8. Conclusion
In this paper we have explored the question of whether word meaning can be described
in a graded fashion. Our aim has been to use annotation with graded ratings to capture
untrained speakers’ intuitions on word meaning. Our motivation has been two-fold. On
the one hand we are drawing on current theories of cognition, which hold that mental
concepts have “fuzzy boundaries.” On the other hand we wanted to give a basis to
current computational models for word meaning in context that predicts degrees of
similarity between word occurrences. We have addressed this question through two
novel types of graded annotations of word meaning in context that draws on methods
from psycholinguistic experimentation. WSsim obtains word sense annotations from a
given sense inventory but uses graded judgments for each sense. Usim judges similarity
of pairs of usages of the same lemma.
The analysis of annotation results lets us answer our main question in the affir-
mative. Annotators can describe word meaning through graded ratings with good
inter-annotator agreement, measured through pairwise correlation. Even though no in-
depth training on sense distinctions was provided, the pairwise correlations were good
in every single case, indicating that all annotators did the tasks in a similar fashion.
In both tasks, all annotators made use of the full graded scale, and did not treat the
task as binary. The Usim annotation provides us with a means of comparing different
word meaning annotation paradigms. We have used it to demonstrate that there is
strong correlation of these new annotations with both traditional WSD labels, and with
overlap of lexical paraphrases. This is as we anticipated, as all of these annotations are
describing the same phenomenon of word meaning in context through different means.
In additional analysis of the WSsim annotation, we found a high proportion of
sentences (between 23% and 46%) in which multiple senses received high positive
judgments from the same annotators. At the same time, annotators used the WSsim
ratings in a nuanced and fine-grained fashion, sometimes assigning high ratings on the
same sentence to two senses that overall patterned very differently. Analyzing Usim
annotation, we found that all annotators’ ratings obey the triangle inequality in almost
549
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
all cases. This can be taken as a measure of intra-annotator consistency on the task.
It also means that current distributional approaches to word meaning in context are
justified in viewing usage similarity as metric. Triangle inequality can be used to check
the validity of future Usim annotation.
We do not propose that either one of our annotations is a panacea for evaluation
of systems that represent word meaning in context, but we argue that they provide
data sets that better reflect the fluid nature of word meaning and allow us to evaluate
questions related to word meaning in a new fashion. In this paper, we have used
both WSsim and Usim data to analyze existing coarse-grained sense inventories. We
have demonstrated that it is often not straightforward to group sentences into disjoint
senses, depending on the lemma. We have also shown how both WSsim and Usim style
judgments can be used to identify problematic lemmas, as well as sense groupings that
may warrant another inspection to check whether they match naive speakers’ intuitive
judgments. The graded annotation can also be used to identify lemmas whose usages
are difficult to group into clear distinct senses. This information can in the future be
used to handle such lemmas differently when making sense inventories, in annotation,
and in computational systems.
An important next question to consider is the use of WSsim and Usim data to eval-
uate computational models of word meaning. As we have shown (Erk and McCarthy
2009), WSsim data can be used to evaluate traditional WSD systems in a graded fashion.
We plan to do a more large-scale evaluation to assess to what extent the performance
of current WSD systems is underestimated. Also, fine-grained WSsim annotation can be
used for a comparison of fine-grained and coarse-grained traditional WSD systems. We
have also shown (Erk and McCarthy 2009) that WSsim can be used to evaluate graded
word sense assignment systems. Although we used a supervised setting, however, we
trained on traditional sense annotation. We plan to collect more WSsim annotation in
order to be able to train word sense assignment systems on graded data, for example,
using a regression model.
In the same vein, we will extend the available Usim data to cover many more
sentences by using crowdsourcing. The use of Usim for supervised training of word
meaning models is particularly interesting as all existing usage similarity models are
unsupervised; given previous results in WSD, we can expect that supervision will
improve the performance of models of usage meaning. One way of using Usim data
in training is to learn a similarity metric. Metric learning (see, e.g., Davis et al. 2007)
induces a distance measure from given constraints stating similarity or dissimilarity of
items.
Our novel graded annotation frameworks, WSsim and Usim, are validated both
through good agreement between those data sets themselves, as well as good agreement
between those data sets and traditional word sense annotation and lexical substitutions.
Because all labeling schemes provide comparable results, this allows different ways of
evaluating systems providing different perspectives on system output. Furthermore,
the different paradigms may suit different types of systems. Lexical substitution tasks
(McCarthy 2002; McCarthy and Navigli 2009) are particularly useful where the ap-
plication being considered would benefit from lexical paraphrasing, for example, text
simplification, summarization, or query expansion in information retrieval. WSsim is
closest to the traditional methodology (WSbest) and would suit systems needing to
output WordNet sense labels, for example, because they want to exploit the semantic
relations in WordNet for tasks such as inferencing or producing lexical chains. Unlike
WSbest, it avoids a winner-takes-all approach and allows for more nuanced sense
tagging. Usim is application-independent. It allows for evaluation of systems that relate
550
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
usages, whether into clusters or simply on a continuum. It could, for example, be used
as a resource-independent gold standard for word sense induction by calculating the
within and across class similarities. Aside from its use as an enabling technology within
a natural language processing application, a system that performs well at the Usim task
may be useful in its own right. For example, it could be used to enable lexicographers
to work on groups of examples that reflect similar meanings, or find further examples
close to the one being scrutinized.
Acknowledgments
The annotation was funded by a UK Royal
Society Dorothy Hodgkin Fellowship to
Diana McCarthy. This work was supported
by National Science Foundation grant
IIS-0845925 for Katrin Erk. We are grateful
to Huw McCarthy for implementing the
interface for round 2 of the annotation. We
thank the anonymous reviewers for many
helpful comments and suggestions.
References
Agirre, Eneko and Philip Edmonds,
editors. 2007. Word Sense Disambiguation:
Algorithms and Applications. Springer,
Dordrecht.
Agirre, Eneko, Llu´ıs M`arquez, and Richard
Wicentowski, editors. 2007. Proceedings
of the Fourth International Workshop on
Semantic Evaluations (SemEval-2007).
Prague.
Baroni, Marco and Roberto Zamparelli. 2010.
Nouns are vectors, adjectives are matrices:
Representing adjective-noun constructions
in semantic space. In Proceedings of the 2010
Conference on Empirical Methods in Natural
Language Processing, pages 1,183–1,193,
Cambridge, MA.
Brown, Susan. 2008. Choosing sense
distinctions for WSD: Psycholinguistic
evidence. In Proceedings of ACL-08: HLT,
Short Papers (Companion Volume),
pages 249–252, Columbus, OH.
Brown, Susan. 2010. Finding Meaning:
Sense Inventories for Improved Word Sense
Disambiguation. Ph.D. thesis, University
of Colorado at Boulder.
Burchardt, Aljoscha, Katrin Erk, Annette
Frank, Andrea Kowalski, Sebastian Pado,
and Manfred Pinkal. 2006. The SALSA
corpus: A German resource for lexical
semantics. In Proceedings of the Fifth
International Conference on Language
Resources and Evaluation (LREC 2006),
pages 969–974, Genoa.
Carpuat, Marine and Dekai Wu. 2007a. How
phrase sense disambiguation outperforms
word sense disambiguation for statistical
machine translation. In Proceedings
of the 11th Conference on Theoretical and
Methodological Issues in Machine Translation
(TMI 2007), pages 43–52, Skovde.
Carpuat, Marine and Dekai Wu. 2007b.
Improving statistical machine translation
using word sense disambiguation. In
Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language
Processing and Computational Natural
Language Learning (EMNLP-CoNLL 2007),
pages 61–72, Prague.
Chen, Jinying and Martha Palmer. 2009.
Improving English verb sense
disambiguation performance with
linguistically motivated features and clear
sense distinction boundaries. Journal of
Language Resources and Evaluation (Special
Issue on SemEval-2007), 43:181–208.
Coecke, Bob, Mehrnoosh Sadrzadeh, and
Stephen Clark. 2010. Mathematical
foundations for a compositional
distributed model of meaning. Lambek
Festschrift, Linguistic Analysis, 36:345–384.
Coleman, Linda and Paul Kay. 1981.
Prototype semantics: The English word
“lie.” Language, 57:26–44.
Copestake, Ann and Ted Briscoe. 1995.
Semi-productive polysemy and sense
extension. Journal of Semantics, 12:15–67.
Cruse, D. A. 1995. Polysemy and related
phenomena from a cognitive linguistic
viewpoint. In Philip Saint-Dizier and
Evelyne Viegas, editors, Computational
Lexical Semantics. Cambridge University
Press, pages 33–49.
Davis, Jason, Brian Kulis, Prateek Jain,
Suvrit Sra, and Inderjit Dhillon. 2007.
Information-theoretic metric learning.
In Proceedings of the 24th International
Conference on Machine Learning,
pages 209–216, Corvallis, OR.
Deschacht, Koen and Marie-Francine Moens.
2009. Semi-supervised semantic role
labeling using the Latent Words Language
Model. In Proceedings of EMNLP,
pages 21–29, Singapore.
Dinu, Georgiana and Mirella Lapata. 2010.
Measuring distributional similarity in
context. In Proceedings of the 2010
551
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Conference on Empirical Methods in Natural
Language Processing, pages 1,162–1,172,
Cambridge, MA.
Hanks, Patrick. 2000. Do word meanings
exist? Computers and the Humanities,
34(1–2):205–215.
Edmonds, Philip and Scott Cotton, editors.
Hovy, Eduard H., Mitchell Marcus, Martha
2001. Proceedings of the SensEval-2
Workshop. Toulouse. See http://www.sle.
sharp.co.uk/senseval.
Erk, Katrin and Diana McCarthy. 2009.
Graded word sense assignment. In
Proceedings of the 2009 Conference on
Empirical Methods in Natural Language
Processing, pages 440–449, Singapore.
Erk, Katrin, Diana McCarthy, and Nicholas
Gaylord. 2009. Investigations on word
senses and word usages. In Proceedings
of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International
Joint Conference on Natural Language
Processing of the AFNLP, pages 10–18,
Suntec.
Erk, Katrin and Sebastian Pado. 2008. A
structured vector space model for word
meaning in context. In Proceedings of
EMNLP-08, pages 897–906, Waikiki, HI.
Erk, Katrin and Sebastian Pado. 2010.
Exemplar-based models for word meaning
in context. In Proceedings of the ACL 2010
Conference Short Papers, pages 92–97,
Uppsala.
Erk, Katrin and Carlo Strapparava, editors.
2010. Proceedings of the 5th International
Workshop on Semantic Evaluation(SemEval).
Uppsala.
Frazier, Lyn and Keith Rayner. 1990. Taking
on semantic commitments: Processing
multiple meanings vs. multiple senses.
Journal of Memory and Language,
29:181–200.
Gentner, Dedre and Virginia Gunn. 2001.
Structural alignment facilitates the
noticing of differences. Memory and
Cognition, 21:565–577.
Gentner, Dedre and Arthur Markman. 1997.
Structural alignment in analogy and
similarity. American Psychologist, 52:45–56.
Grefenstette, Edward and Mehrnoosh
Sadrzadeh. 2011. Experimental
support for a categorical compositional
distributional model of meaning.
In Proceedings of the 2011 Conference on
Empirical Methods in Natural Language
Processing, pages 1,394–1,404, Edinburgh.
Hampton, James A. 1979. Polymorphous
concepts in semantic memory. Journal
of Verbal Learning and Verbal Behavior,
18:441–461.
Hampton, James A. 2007. Typicality, graded
membership, and vagueness. Cognitive
Science, 31:355–384.
552
Palmer, Lance Ramshaw, and Ralph
Weischedel. 2006. OntoNotes: The 90%
solution. In Proceedings of the Human
Language Technology Conference of the North
American Chapter of the ACL (NAACL-2006),
pages 57–60, New York.
Ide, Nancy and Yorick Wilks. 2006. Making
sense about sense. In Eneko Agirre and
Philip Edmonds, editors, Word Sense
Disambiguation, Algorithms and Applications.
Springer, Dordrecht, pages 47–73.
Kilgarriff, Adam. 1992. Polysemy. Ph.D.
thesis, University of Sussex.
Kilgarriff, Adam. 1997. I don’t believe in
word senses. Computers and the Humanities,
31(2):91–113.
Kilgarriff, Adam. 2006. Word senses.
In Eneko Agirre and Philip Edmonds,
editors, Word Sense Disambiguation:
Algorithms and Applications. Springer,
Dordrecht, pages 29–46.
Kilgarriff, Adam and Martha Palmer,
editors. 2000. Senseval: Special Issue of the
Journal Computers and the Humanities,
volume 34(1–2). Kluwer, Dordrecht.
Kilgarriff, Adam and Joseph Rosenzweig.
2000. Framework and results for English
Senseval. Computers and the Humanities,
34(1-2):15–48.
Kintsch, Walter. 2007. Meaning in context. In
T. K. Landauer, D. McNamara, S. Dennis,
and W. Kintsch, editors, Handbook of Latent
Semantic Analysis. Erlbaum, Mahwah, NJ,
pages 89–105.
Klein, Devorah and Gregory Murphy. 2001.
The representation of polysemous words.
Journal of Memory and Language,
45:259–282.
Klein, Devorah and Gregory Murphy. 2002.
Paper has been my ruin: Conceptual
relations of polysemous senses. Journal of
Memory and Language, 47:548–570.
Klepousniotou, Ekaterini. 2002. The
processing of lexical ambiguity:
Homonymy and polysemy in the mental
lexicon. Brain and Language, 81:205–223.
Klepousniotou, Ekaterini, Debra Titone, and
Caroline Romero. 2008. Making sense of
word senses: The comprehension of
polysemy depends on sense overlap.
Journal of Experimental Psychology: Learning,
Memory, and Cognition, 34(6):1,534–1,543.
Krishnamurthy, Ramesh and Diane
Nicholls. 2000. Peeling an onion: The
lexicographers’ experience of manual
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Erk, McCarthy, and Gaylord
Measuring Word Meaning in Context
sense-tagging. Computers and the
Humanities, 34(1-2):85–97.
Landauer, Thomas and Susan Dumais. 1997.
A solution to Plato’s problem: The Latent
Semantic Analysis theory of acquisition,
induction, and representation of
knowledge. Psychological Review,
104:211–240.
Landes, Shari, Claudia Leacock, and
Randee Tengi. 1998. Building semantic
concordances. In Christiane Fellbaum,
editor, WordNet: An Electronic Lexical
Database. The MIT Press, Cambridge, MA.
Lefever, Els and V´eronique Hoste. 2010.
Semeval-2010 task 3: Cross-lingual word
sense disambiguation. In Proceedings of the
5th International Workshop on Semantic
Evaluation, pages 15–20, Uppsala.
McCarthy, Diana. 2002. Lexical substitution
as a task for wsd evaluation. In Proceedings
of the ACL Workshop on Word Sense
Disambiguation: Recent Successes and
Future Directions, pages 109–115,
Philadelphia, PA.
McCarthy, Diana and Roberto Navigli. 2007.
SemEval-2007 task 10: English lexical
substitution task. In Proceedings of the
4th International Workshop on Semantic
Evaluations (SemEval-2007), pages 48–53,
Prague.
McCarthy, Diana and Roberto Navigli. 2009.
The English lexical substitution task.
Language Resources and Evaluation Special
Issue on Computational Semantic Analysis
of Language: SemEval-2007 and Beyond,
43(2):139–159.
McNamara, Timothy P. 2005. Semantic
Priming: Perspectives from Memory and Word
Recognition. Psychology Press, New York.
Mihalcea, Rada and Timothy Chklovski.
2003. Open Mind Word Expert: Creating
large annotated data collections with web
users’ help. In Proceedings of the EACL 2003
Workshop on Linguistically Annotated
Corpora (LINC 2003), pages 53–60,
Budapest.
Mihalcea, Rada, Timothy Chklovski, and
Adam Kilgarriff. 2004. The SENSEVAL-3
English lexical sample task. In Rada
Mihalcea and Phil Edmonds, editors,
Proceedings SENSEVAL-3 Second
International Workshop on Evaluating
Word Sense Disambiguation Systems,
pages 25–28, Barcelona.
Mihalcea, Rada and Phil Edmonds, editors.
2004. Proceedings SENSEVAL-3 Second
International Workshop on Evaluating
Word Sense Disambiguation Systems,
Barcelona.
Mihalcea, Rada, Ravi Sinha, and Diana
McCarthy. 2010. Semeval-2010 task 2:
Cross-lingual lexical substitution. In
Proceedings of the 5th International
Workshop on Semantic Evaluation,
pages 9–14, Uppsala.
Miller, George A., Claudia Leacock,
Randee Tengi, and Ross T Bunker.
1993. A semantic concordance. In
Proceedings of the ARPA Workshop
on Human Language Technology,
pages 303–308, Plainsboro, NJ.
Mitchell, Jeff and Mirella Lapata. 2008.
Vector-based models of semantic
composition. In Proceedings of ACL-08:
HLT, pages 236–244, Columbus, OH.
Mitchell, Jeff and Mirella Lapata. 2010.
Composition in distributional models
of semantics. Cognitive Science,
34(8):1388–1429.
Moon, Taesun and Katrin Erk. In press.
An inference-based model of word
meaning in context as a paraphrase
distribution. ACM Transactions on
Intelligent Systems and Technology
special issue on paraphrasing.
Murphy, Gregory L. 1991. Meaning and
concepts. In Paula Schwanenflugel, editor,
The Psychology of Word Meanings. Lawrence
Erlbaum Associates, Mahwah, NJ,
pages 11–35.
Murphy, Gregory L. 2002. The Big Book of
Concepts. MIT Press, Cambridge, MA.
Navigli, Roberto. 2009. Word sense
disambiguation: a survey. ACM
Computing Surveys, 41(2):1–69.
Navigli, Roberto, Kenneth C. Litkowski,
and Orin Hargraves. 2007. SemEval-2007
task 7: Coarse-grained English all-words
task. In Proceedings of the 4th International
Workshop on Semantic Evaluations
(SemEval-2007), pages 30–35, Prague.
Palmer, Martha, Hoa Trang Dang, and
Christiane Fellbaum. 2007. Making
fine-grained and coarse-grained sense
distinctions, both manually and
automatically. Natural Language
Engineering, 13:137–163.
Passonneau, Rebecca, Ansaf Salleb-Aouissi,
Vikas Bhardwaj, and Nancy Ide. 2010.
Word sense annotation of polysemous
words by multiple annotators. In
Proceedings of LREC-7, pages 3,244–3,249,
Valleta.
Pickering, Martin and Steven Frisson. 2001.
Processing ambiguous verbs: Evidence
from eye movements. Journal of
Experimental Psychology: Learning,
Memory, and Cognition, 27:556–573.
553
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Pradhan, Sameer, Edward Loper, Dmitriy
Dligach, and Martha Palmer. 2007.
Semeval-2007 task 17: English lexical
sample, SRL and all words. In 4th
International Workshop on Semantic
Evaluations (SemEval-4) at ACL-2007,
pages 87–92, Prague.
Preiss, Judita and David Yarowsky, editors.
2001. Proceedings of Senseval-2 Second
International Workshop on Evaluating Word
Sense Disambiguation Systems, Toulouse.
Pustejovsky, James. 1991. The generative
lexicon. Computational Linguistics,
17(4):409–441.
Reddy, Siva, Ioannis P. Klapaftis, Diana
McCarthy, and Suresh Manandhar. 2011.
Dynamic and static prototype vectors for
semantic composition. In Proceedings of The
5th International Joint Conference on Natural
Language Processing 2011 (IJCNLP 2011),
pages 210–218, Chiang Mai.
Reisinger, Joseph and Raymond J. Mooney.
2010. Multi-prototype vector-space models
of word meaning. In Proceedings of Human
Language Technologies: The 11th Annual
Conference of the North American Chapter of
the Association for Computational Linguistics,
pages 109–117, Los Angeles, CA.
Resnik, Philip and David Yarowsky. 2000.
Distinguishing systems and distinguishing
senses: New evaluation methods for word
sense disambiguation. Natural Language
Engineering, 5(3):113–133.
Rosch, Eleanor. 1975. Cognitive
representations of semantic categories.
Journal of Experimental Psychology: General,
104:192–233.
Rosch, Eleanor and Carolyn B. Mervis. 1975.
Family resemblance: Studies in the internal
structure of categories. Cognitive
Psychology, 7:573–605.
Sch ¨utze, Hinrich. 1998. Automatic word
sense discrimination. Computational
Linguistics, 24(1):97–123.
Senseval-2. 2001. Web page:
http://www.sle.sharp.co.uk/senseval2.
Snyder, Benjamin and Martha Palmer.
2004. The English all-words task. In
3rd International Workshop on Semantic
Evaluations (SensEval-3) at ACL-2004,
pages 41–43, Barcelona.
Socher, Richard, Eric H. Huang, Jeffrey
Pennin, Andrew Y. Ng, and Christopher D.
Manning. 2011. Dynamic pooling and
unfolding recursive autoencoders for
paraphrase detection. In Advances in
Neural Information Processing Systems 24,
pages 801–809, Grenada.
Stokoe, Christopher. 2005. Differentiating
homonymy and polysemy in information
retrieval. In Proceedings of HLT/EMNLP-05,
pages 403–410, Vancouver.
Taylor, John R. 2003. Linguistic Categorization.
Oxford University Press, New York.
Thater, Stefan, Hagen F ¨urstenau, and
Manfred Pinkal. 2010. Contextualizing
semantic representations using
syntactically enriched vector models.
In Proceedings of the 48th Annual Meeting
of the Association for Computational
Linguistics, pages 948–957, Uppsala.
Tuggy, David H. 1993. Ambiguity, polysemy
and vagueness. Cognitive Linguistics,
4(2):273–290.
Tversky, Amos. 1977. Features of similarity.
Psychological Review, 84(4):327–352.
Tversky, Amos and Itamar Gati. 1982.
Similarity, separability, and the triangle
inequality. Psychological Review,
89(2):123–154.
Van de Cruys, Tim, Thierry Poibeau, and
Anna Korhonen. 2011. Latent vector
weighting for word meaning in context.
In Proceedings of the 2011 Conference
on Empirical Methods in Natural
Language Processing, pages 1,012–1,022,
Edinburgh.
Washtell, Justin. 2010. Expectation vectors:
A semiotics inspired approach to
geometric lexical-semantic representation.
In Proceedings of the 2010 Workshop on
GEometrical Models of Natural Language
Semantics, pages 45–50, Uppsala.
Williams, John. 1992. Processing polysemous
words in context: Evidence for interrelated
meanings. Journal of Psycholinguistic
Research, 21:193–218.
Zhong, Zhi and Hwee Tou Ng. 2010. It
makes sense: A wide-coverage word
sense disambiguation system for free
text. In Proceedings of the ACL 2010
System Demonstrations, pages 78–83,
Uppsala.
Zhong, Zhi, Hwee Tou Ng, and Yee Seng
Chan. 2008. Word sense disambiguation
using OntoNotes: An empirical study.
In Proceedings of the 2008 Conference
on Empirical Methods in Natural
Language Processing, pages 1,002–1,010,
Honolulu, HI.
554
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
1
1
1
8
0
1
9
5
9
/
c
o
l
i
_
a
_
0
0
1
4
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3