Random Walks for Knowledge-Based
Word Sense Disambiguation
∗
Eneko Agirre
IXA NLP group
University of the Basque Country
∗∗
Oier L ´opez de Lacalle
University of Edinburgh
IKERBASQUE
Basque Foundation for Science
†
Aitor Soroa
IXA NLP group
University of the Basque Country
Word Sense Disambiguation (WSD) systems automatically choose the intended meaning of a
word in context. In this article we present a WSD algorithm based on random walks over large
Lexical Knowledge Bases (LKB). We show that our algorithm performs better than other graph-
based methods when run on a graph built from WordNet and eXtended WordNet. Our algorithm
and LKB combination compares favorably to other knowledge-based approaches in the literature
that use similar knowledge on a variety of English data sets and a data set on Spanish. We include
a detailed analysis of the factors that affect the algorithm. The algorithm and the LKBs used are
publicly available, and the results easily reproducible.
1. Einführung
Word Sense Disambiguation (WSD) is a key enabling technology that automatically
chooses the intended sense of a word in context. It has been the focus of intensive
research since the beginning of Natural Language Processing (NLP), and more recently
it has been shown to be useful in several tasks such as parsing (Agirre, Balduin, Und
Martinez 2008; Agirre et al. 2011), maschinelle Übersetzung (Carpuat and Wu 2007; Chan,
Ng, and Chiang 2007), information retrieval (P´erez-Ag ¨uera and Zaragoza 2008; Zhong
and Ng 2012), question answering (Surdeanu, Ciaramita, and Zaragoza 2008), Und
∗ Informatika Fakultatea, Manuel Lardizabal 1, 20018 Donostia, Basque Country. Email: e.agirre@ehu.es.
∗∗ IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Basque Country.
Email: oier.lopezdelacalle@gmail.com.
† Informatika Fakultatea, Manuel Lardizabal 1, 20018 Donostia, Basque Country. Email: a.soroa@ehu.es.
Einreichung erhalten: 30 Juli 2011; Revised submission received: 15 November 2012; Accepted for publication:
17 Februar 2013.
doi:10.1162/COLI a 00164
© 2014 Verein für Computerlinguistik
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 40, Nummer 1
summarization (Barzilay and Elhadad 1997). WSD is considered to be a key step in
order to approach language understanding beyond keyword matching.
The best performing WSD systems are currently those based on supervised learn-
ing, as attested in public evaluation exercises (Snyder and Palmer 2004; Pradhan et al.
2007), but they need large amounts of hand-tagged data, which is typically very ex-
pensive to produce. Contrary to lexical-sample exercises (where plenty of training and
testing examples for a handful of words are provided), all-words exercises (welche
comprise all words occurring in a running text, and where training data is more scarce)
show that only a few systems beat the most frequent sense (MFS) heuristic, with small
differences. Zum Beispiel, the best system in SensEval-3 scored 65.2 F1, compared to 62.4
(Snyder and Palmer 2004). The best current state-of-the-art WSD system (Zhong and Ng
2010), outperforms the MFS heuristic by 5% Zu 8% in absolute F1 scores on the SensEval
and SemEval fine-grained English all words tasks.
The causes of the small improvement over the MFS heuristic can be found in the
relatively small amount of training data available (sparseness) and the problems that
arise when the supervised systems are applied to different corpora from that used to
train the system (corpus mismatch) (Ng 1997; Escudero, M´arquez, and Rigau 2000).
Note that most of the supervised systems for English are trained over SemCor (Müller
et al. 1993), a half-a-million word subset of the Brown Corpus made available from the
WordNet team, and DSO (Ng and Lee 1996), umfassend 192,800 word occurrences from
the Brown and WSJ corpora corresponding to the 191 most frequent nouns and verbs.
Several researchers have explored solutions to sparseness. Zum Beispiel, Chan and Ng
(2005) present an unsupervised method to obtain training examples from bilingual data,
which was used together with SemCor and DSO to train one of the best performing
supervised systems to date (Zhong and Ng 2010).
In view of the problems of supervised systems, knowledge-based WSD is emerging
as a powerful alternative. Knowledge-based WSD systems exploit the information in
a lexical knowledge base (LKB) to perform WSD. They currently perform below su-
pervised systems on general domain data, but are attaining performance close or above
MFS without access to hand-tagged data (Ponzetto and Navigli 2010). In this sense, Sie
provide a complementary strand of research which could be combined with supervised
Methoden, as shown for instance in Navigli (2008). Zusätzlich, Agirre, L ´opez de Lacalle,
and Soroa (2009) show that knowledge-based WSD systems can outperform supervised
systems in a domain-specific data set, where MFS from general domains also fails. In
dieser Artikel, we will focus our attention on knowledge-based methods.
Early work for knowledge-based WSD was based on measures of similarity
between pairs of concepts. In order to maximize pairwise similarity for a sequence of
n words where each has up to k senses, the algorithms had to consider up to kn sense
Sequenzen. Greedy methods were often used to avoid the combinatorial explosion
(Patwardhan, Banerjee, and Pedersen 2003). As an alternative, graph-based methods
are able to exploit the structural properties of the graph underlying a particular LKB.
These methods are able to consider all possible combinations of occurring senses on a
particular context, and thus offer a way to analyze efficiently the inter-relations among
ihnen, gaining much attention in the NLP community (Mihalcea 2005; Navigli and
Lapata 2007; Sinha and Mihalcea 2007; Agirre and Soroa 2008; Navigli and Lapata 2010).
The nodes in the graph represent the concepts (word senses) in the LKB, and edges
in the graph represent relations between them, such as subclass and part-of. Netzwerk
analysis techniques based on random walks like PageRank (Brin and Page 1998) can
then be used to choose the senses that are most relevant in the graph, and thus output
those senses.
58
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
In order to deal with large knowledge bases containing more than 100,000 con-
cepts (Fellbaum 1998), previous algorithms had to extract subsets of the LKB (Navigli
and Lapata 2007, 2010) or construct ad hoc graphs for each context to be dis-
ambiguated (Mihalcea 2005; Sinha and Mihalcea 2007). An additional reason for the
use of custom-built subsets of ad hoc graphs for each context is that if we were using
a centrality algorithm like PageRank over the whole graph, it would choose the most
important senses in the LKB regardless of context, limiting the applicability of the
Algorithmus. Zum Beispiel, the word coach is ambiguous at least between the “sports
coach” and the “transport service” meanings, as shown in the following examples:
(1) Nadal is sharing a house with his uncle and coach, Toni, and his physical trainer,
Rafael Maymo.
(2) Our fleet comprises coaches from 35 Zu 58 seats.
If we were to run a centrality algorithm over the whole LKB, with no context, Dann
we would always assign coach to the same concept, and we would thus fail to correctly
disambiguate either one of the given examples.
The contributions of this article are the following: (1) A WSD method based on ran-
dom walks over large LKBs. The algorithm outperforms other graph-based algorithms
when using a LKB built from WordNet and eXtended WordNet. The algorithm and
LKB combination compares favorably to the state-of-the-art in knowledge-based WSD
on a wide variety of data sets, including four English and one Spanish data set. (2) A
detailed analysis of the factors that affect the algorithm. (3) The algorithm together with
the corresponding graphs are publicly available1 and can be applied easily to sense
inventories and knowledge bases different from WordNet.
The algorithm for WSD was first presented in Agirre and Soroa (2009). In diesem Artikel,
we present further evaluation on two more recent data sets, analyze the parameters and
options of the system, compare it to the state of the art, and discuss the relation of our
algorithm with PageRank and the MFS heuristic.
2. Related Work
Traditional knowledge-based WSD systems assign a sense to an ambiguous word by
comparing each of its senses with those of the surrounding context. Typically, some se-
mantic similarity metric is used for calculating the relatedness among senses (Lesk 1986;
Patwardhan, Banerjee, and Pedersen 2003). The metric varies between counting word
overlaps between definitions of the words (Lesk 1986) to finding distances between
concepts following the structure of the LKB (Patwardhan, Banerjee, and Pedersen 2003).
Usually the distances are calculated using only hierarchical relations on the LKB (Sussna
1993; Agirre and Rigau 1996). Combining both intuitions, Jiang and Conrath (1997)
present a metric that combines statistics from corpus and a lexical taxonomy structure.
One of the major drawbacks of these approaches stems from the fact that senses are
compared in a pairwise fashion and thus the number of computations grows exponen-
tially with the number of words—that is, for a sequence of n words where each has
up to k senses they need to consider up to kn sense sequences. Although alternatives
like simulated annealing (Cowie, Guthrie, and Guthrie 1992) and conceptual density
1 http://ixa2.si.ehu.es/ukb.
59
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 40, Nummer 1
(Agirre and Rigau 1996) were tried, most of the knowledge-based WSD at the time was
done in a suboptimal word-by-word greedy process, nämlich, disambiguating words
one at a time (Patwardhan, Banerjee, and Pedersen 2003). Trotzdem, some recent work on
finding predominant senses in domains has applied such similarity-based techniques
with success (McCarthy et al. 2007).
Kürzlich, graph-based methods for knowledge-based WSD have gained much at-
tention in the NLP community (Mihalcea 2005; Navigli and Velardi 2005; Navigli and
Lapata 2007; Sinha and Mihalcea 2007; Agirre and Soroa 2008; Navigli and Lapata
2010). These methods use well-known graph-based techniques to find and exploit the
structural properties of the graph underlying a particular LKB. Graph-based techniques
consider all the sense combinations of the words occurring on a particular context at
once, and thus offer a way to analyze the relations among them with respect to the
whole graph. They are particularly suited for disambiguating words in the sequence,
and they manage to exploit the interrelations among the senses in the given context.
In this sense, they provide a principled solution to the exponential explosion problem
mentioned before, with excellent performance.
Graph-based WSD is performed over a graph composed of senses (Knoten) Und
relations between pairs of senses (edges). The relations may be of several types (lexico-
semantic, cooccurrence relations, usw.) and may have some weight attached to them. Alle
the methods reviewed in this section use some version of WordNet as a LKB. Apart
from relations in WordNet, some authors have used semi-automatic and fully auto-
matic methods to enrich WordNet with additional relations. Mihalcea and Moldovan
(2001) disambiguated WordNet glosses in a resource called eXtended WordNet. Der
disambiguated glosses have been shown to improve results of a graph-based system
(Agirre and Soroa 2008), and we have also used them in our experiments. Navigli and
Velardi (2005) enriched WordNet with cooccurrence relations semi-automatically and
showed that those relations are effective in a number of graph-based WSD systems
(Navigli and Velardi 2005; Navigli and Lapata 2007, 2010). More recently, Cuadros and
Rigau (2006, 2007, 2008) learned automatically so-called KnowNets, and showed that
the new provided relations improved WSD performance when plugged into a simple
vector-based WSD system. Endlich, Ponzetto and Navigli (2010) have acquired relations
automatically from Wikipedia, released as WordNet++, and have shown that they are
beneficial in a graph-based WSD algorithm. All of these relations are publicly available
with the exception of Navigli and Velardi (2005), but note that the system is available
on-line.2
Disambiguation is typically performed by applying a ranking algorithm over the
graph, and then assigning the concepts with highest rank to the corresponding words.
Given the computational cost of using large graphs like WordNet, most researchers use
smaller subgraphs built on-line for each target context. The main idea of the subgraph
method is to extract the subgraph whose vertices and relations are particularly relevant
for the set of senses from a given input context. The subgraph is then analyzed and the
most relevant vertices are chosen as the correct senses of the words.
The TextRank algorithm for WSD (Mihalcea 2005) creates a complete weighted
graph (z.B., a graph in which every pair of distinct vertices is connected by a weighted
edge) formed by the synsets of the words in the input context. The weight of the links
joining two synsets is calculated by executing Lesk’s algorithm (Lesk 1986) zwischen
them—that is, by calculating the overlap between the words in the glosses of the
2 http://lcl.uniroma1.it/ssi.
60
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
corresponding senses. Once the complete graph is built, a random walk algorithm
(PageRank) is executed over it and words are assigned to the most relevant synset.
In this sense, PageRank is used as an alternative to simulated annealing to find the
optimal pairwise combinations. This work is extended in Sinha and Mihalcea (2007),
using a collection of semantic similarity measures when assigning a weight to the links
across synsets. They also compare different graph-based centrality algorithms to rank
the vertices of the complete graph. They use different similarity metrics for different
POS types and a voting scheme among the centrality algorithm ranks.
In Navigli and Velardi (2005), the authors develop a knowledge-based WSD method
based on lexical chains called structural semantic interconnections (SSI). Although the
system was first designed to find the meaning of the words in WordNet glosses, Die
authors also apply the method for labeling each word in a text sequence. Given a
text sequence, SSI first identifies monosemous words and assigns the corresponding
synset to them. Dann, it iteratively disambiguates the rest of the terms by selecting
the senses that get the strongest interconnection with the synsets selected so far. Der
interconnection is calculated by searching for paths on the LKB, constrained by some
hand-made rules of possible semantic patterns.
In Navigli and Lapata (2007, 2010), the authors perform a two-stage process for
WSD. Given an input context, the method first explores the whole LKB in order to find
a subgraph that is particularly relevant for the words of the context. The subgraph is
calculated by applying a depth-first search algorithm over the LKB graph for every
word sense occurring in a context. Dann, they study different graph-based centrality
algorithms for deciding the relevance of the nodes on the subgraph. Infolge, jeden
word of the context is attached to the highest ranking concept among its possible senses.
The best results were obtained by a simple algorithm like choosing the concept for each
word with the largest degree (number of edges) and by PageRank (Brin and Page 1998).
We reimplemented their best methods in order to compare our algorithm with theirs on
the same setting (vgl. Abschnitt 6.3). In later work (Ponzetto and Navigli 2010) the authors
apply a subset of their methods to an enriched WordNet with additional relations from
Wikipedia, improving their results for nouns.
Tsatsaronis, Vazirgiannis, and Androutsopoulos (2007) and Agirre and Soroa (2008)
also use such a two-stage process. They build the graph as before, but using breadth-
first search. The first authors apply a spreading activation algorithm over the subgraph
for node ranking, while the second use PageRank. In later work (Tsatsaronis, Varlamis,
and Nørv˚ag 2010) spreading activation is compared with PageRank and other centrality
measures like HITS (Kleinberg 1998), obtaining better results than in their previous
arbeiten.
This work departs from earlier work in its use of the full graph, and its ability to
infuse context information when computing the importance of nodes in the graph. Für
Das, we resort to an extension of the PageRank algorithm (Brin and Page 1998), called
Personalized PageRank (Haveliwala 2002), which tries to bias PageRank using a set of
representative topics and thus capture more accurately the notion of importance with
respect to a particular topic. In unserem Fall, we initialize the random walk with the words
in the context of the target word, and thus we obtain a context-dependent PageRank.
We will show that this method is indeed effective for WSD. Note that in order to use
other centrality algorithms (z.B., HITS [Kleinberg 1998]), previous authors had to build
a subgraph first. In principle, those algorithms could be made context-dependent when
using the full graph and altering their formulae, but we are not aware of such variations.
Random walks over WordNet using Personalized PageRank have been also used
to measure semantic similarity between two words (Hughes and Ramage 2007; Agirre
61
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 40, Nummer 1
et al. 2009). In those papers, the random walks are initialized with a single word,
whereas we use all content words in the context. The results obtained by the authors,
especially in the latter paper, are well above other WordNet-based methods.
Most previous work on knowledge-based WSD has presented results on one or two
general domain corpora for English. We present our results on four general domain
data sets for English and a Spanish data set (M`arquez et al. 2007). Alternativ, manche
researchers have applied knowledge-based WSD to specific domains, using different
methods to adapt the method to the particular test domain. In Agirre, L ´opez de Lacalle,
and Soroa (2009) and Navigli et al. (2011), the authors apply our Personalized PageRank
method to a domain-specific corpus with good results. Ponzetto and Navigli (2010) Auch
apply graph-based algorithms to the same domain-specific corpus.
3. WordNet
Most WSD work uses WordNet as the sense inventory of choice. WordNet (Fellbaum
1998) is a freely available3 lexical database of English, which groups nouns, verbs,
Adjektive, and adverbs into sets of synonyms, each expressing a distinct concept (called
synset in WordNet parlance). Zum Beispiel, coach has five nominal senses and two verbal
senses, which correspond to the following synsets:
In these synsets coach#n1 corresponds to the first nominal sense of coach, coach#v1 corre-
sponds to the first verbal sense, und so weiter. Each of the senses of coach corresponds to a
different synset, and each synset contains several words with different sense numbers.
Zum Beispiel, the first nominal sense of coach has two synonyms: manager in its second
sense and handler in its third sense. As a synset can be identified by any of its words
in a particular sense number, we will use a word and sense number to represent the
full concept. Each synset has a descriptive gloss (z.B., a carriage pulled by four horses with
one driver for coach#n4, or drive a coach for coach#v2). The examples correspond to the
current version of WordNet (3.1), but the sense differences have varied across different
Versionen. There exist automatic mappings across versions (Daude, Padro, and Rigau
2000), but they contain small errors. In this article we will focus on WordNet versions
1.7 Und 2.1, which have been used to tag the evaluation data sets used in this article
(vgl. Abschnitt 6).
The synsets in WordNet are interlinked with conceptual-semantic and lexical rela-
tionen. Examples of conceptual-semantic relations are hypernymy, which corresponds to
the superclass or is-a relation, and holonymy, the part-of relation. Figur 1 shows two
small regions of the graph around three synsets of the word coach, including several
conceptual-semantic relations and lexical relations. Zum Beispiel, the figure shows that
concept trainer#n1 is a coach#n1 (hypernymy relation), and that seat#n1 is a part of
coach#n5 (holonymy relation). The figure only shows a small subset of the relations
3 http://wordnet.princeton.edu.
62
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
Figur 1
Example showing three senses of coach, with links to related concepts.
for three synsets of coach. If we were to show the relations of the rest of the synsets
in WordNet we would end up with a densely connected graph, where one can go from
one synset to another following the semantic relations. In addition to purely conceptual-
semantic relations which hold between synsets, there are also lexical relations which
hold between specific senses. Zum Beispiel, angry#a2 is the antonym of calm#a2 and a
derivation relation exists between handler#n3 and handle#v6, meaning that handler is a
derived form of handle and that the third nominal sense of handler is related to the sixth
verbal sense of handle. Although lexical relations hold only between two senses, we gen-
eralize to the whole synset. This generalization captures the notion that if handler#n3 is
related by derivation to handle#v6, then coach#n1 is also semantically related to handle#v6
(wie in der Abbildung gezeigt 1).
In addition to these relations, we also use the relation between each synset and
the words in the glosses. Most of the words in the glosses have been manually asso-
ciated with their corresponding senses, and we can thus produce a link between the
synset being glossed, and the synsets of each of the words in the gloss. Zum Beispiel,
following one of the given glosses, a gloss relation would be added between coach#v2
and drive#v2. The gloss relations were not available prior to WordNet 3.0, and we
thus used automatically disambiguated glosses for WordNet 1.7 and WordNet 2.1, als
made freely available in the eXtended WordNet (Mihalcea and Moldovan 2001). Notiz
also that the eXtended WordNet provided about 550,000 Beziehungen, whereas the disam-
biguated glosses made available with WordNet 3.0 provide around 339,000 Beziehungen.
We compare the performance of XWN relations and WordNet 3.0 gloss relations in
Abschnitt 6.4.4.
Tisch 1 summarizes the most relevant relations (with less frequent relations
grouped as “other”). The table also lists how we grouped the relations, and the overall
counts. Note that inverse relations are not counted, as their numbers equal those of the
original relation. In Section 6.4.5 we report the impact of the relations in the behavior
of the system. Gesamt, the graph for WordNet 1.7 hat 109, 359 Eckpunkte (concepts) Und
620, 396 edges (relations between concepts). Note that there is some overlap between
XWN and other types of relations. Zum Beispiel, the hypernym of coach#n4 is carriage#n2,
which is also present in its gloss. Note that most of the relation types relate concepts
from the same part of speech, with the exception of derivation and XWN.
63
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 40, Nummer 1
Tisch 1
Relations and their inverses in WordNet 1.7, how we grouped them, and overall counts. XWN
refers to relations from the disambiguated glosses in eXtended WordNet.
relation
inverse
Gruppe
counts
hypernymy
derivation
holonymy
antonymy
andere
xwn
hyponymy
derivation
meronymy
antonymy
andere
xwn
−1
TAX
REL
MER
ANT
REL
XWN
89,078
28,866
21,260
7,558
3,134
551,551
Endlich, we have also used the Spanish WordNet (Atserias, Rigau, and Villarejo
2004). In addition to the native relations, we also added relations from the eXtended
WordNet. All in all, it contains 105, 501 vertices and 623, 316 Beziehungen.
3.1 Representing WordNet as a Graph
An LKB such as WordNet can be seen as a set of concepts and relations among them,
plus a dictionary, which contains the list of words (typically word lemmas) linked to
the corresponding concepts (senses). WordNet can be thus represented as a graph G =
∈ V), and E
(V, E). V is the set of nodes, where each node represents one concept (vi
is the set of edges. Each relation between concepts vi and vj is represented by an edge
∈ E. We ignore the relation type of the edges. If two WordNet relations exist between
NEIN,J
two nodes, we only represent one edge, and ignore the type of the relation. We chose to
use undirected relations between concepts, because most of the relations are symmetric
and have their inverse counterpart (vgl. Abschnitt 3), and in preliminary work we failed to
see any effect using directed relations.
Zusätzlich, we also add vertices for the dictionary words, which are linked to
their corresponding concepts by directed edges (vgl. Figur 1). Note that monosemous
words will be related to just one concept, whereas polysemous words may be attached
to several. Abschnitt 5.2 explains the reason for using directed edges, and also mentions
an alternative to avoid introducing these vertices.
4. PageRank and Personalized PageRank
The PageRank random walk algorithm (Brin and Page 1998) is a method for ranking
the vertices in a graph according to their relative structural importance. The main idea
of PageRank is that whenever a link from vi to vj exists in a graph, a vote from node i
to node j is produced, and hence the rank of node j increases. Zusätzlich, the strength
of the vote from i to j also depends on the rank of node i: The more important node
i is, the more strength its votes will have. Alternativ, PageRank can also be viewed
as the result of a random walk process, where the final rank of node i represents the
probability of a random walk over the graph ending on node i, at a sufficiently large
Zeit.
Let G be a graph with N vertices v1, . . . , vN and di be the outdegree of node i; let M
if a link from i to j exists, Und
be a N × N transition probability matrix, where Mji = 1
di
64
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
zero otherwise. Dann, the calculation of the PageRank Vector P over G is equivalent to
resolving Equation (1).
P = cMP + (1 − c)v
(1)
In the equation, v is a N × 1 stochastic vector and c is the so-called damping
factor, a scalar value between 0 Und 1. The first term of the sum on the equation
models the voting scheme described in the beginning of the section. The second term
represents, loosely speaking, the probability of a surfer randomly jumping to any node
(z.B., without following any paths on the graph). The damping factor, usually set in
Die [0.85..0.95] range, models the way in which these two terms are combined at
each step.
The second term in Equation (1) can also be seen as a smoothing factor that makes
any graph fulfill the property of being aperiodic and irreducible, and thus guarantees
that the PageRank calculation converges to a unique stationary distribution.
In the traditional PageRank formulation the vector v is a stochastic normalized
vector whose element values are all 1
N , thus assigning equal probabilities to all nodes
in the graph in the case of random jumps. Jedoch, as pointed out by Haveliwala
(2002), the vector v can be non-uniform and assign stronger probabilities to certain
kinds of nodes, effectively biasing the resulting PageRank vector to prefer these nodes.
Zum Beispiel, if we concentrate all the probability mass on a unique node i, all random
jumps on the walk will return to i and thus its rank will be high; darüber hinaus, the high rank
of i will make all the nodes in its vicinity also receive a high rank. Daher, the importance
of node i given by the initial distribution of v spreads along the graph on successive
iterations of the algorithm. As a consequence, the P vector can be seen as representing
the relevance of every node in the graph from the perspective of node i.
In diesem Artikel, we will use Static PageRank to refer to the case when a uniform
v vector is used in Equation (1); and whenever a modified v is used, we will call it
Personalized PageRank. The next section shows how we define a modified v.
PageRank is actually calculated by applying an iterative algorithm that computes
Gleichung (1) successively until convergence below a given threshold is achieved, oder
until a fixed number of iterations are executed. Following usual practice, we used a
damping value of 0.85 and finish the calculations after 30 Iterationen (Haveliwala 2002;
Langville and Meyer 2003; Mihalcea 2005). Some preliminary experiments with higher
iteration counts showed that although sometimes the node ranks varied, der Verwandte
order among particular word synsets remained stable after the initial iterations (vgl.
Abschnitt 6.4 for further details). Beachten Sie, dass, in order to discard the effect of dangling
Knoten (d.h., nodes without outlinks) one would need to slightly modify Equation (1)
following Langville and Meyer (2003).4 This modification is not necessary for WordNet,
as it does not have dangling nodes.
5. Random Walks for WSD
We tested two different methods to apply random walks to WSD.
4 The equation becomes P = cMP + (ca + (1 − c)e)v, where ai = 1 if node i is a dangling node,
Und 0 ansonsten, and e is a vector of all ones.
65
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 40, Nummer 1
5.1 Static PageRank, No Context
If we apply traditional PageRank over the whole WordNet, we get a context-
independent ranking of word senses. All concepts in WordNet get ranked according
to their PageRank value. Given a target word, it suffices to check which is the relative
ranking of its senses, and the WSD system would output the one ranking highest. Wir
call this application of PageRank to WSD Static PageRank STATIC for short, as it does
not change with the context, and we use it as a baseline.
As the PageRank measure over undirected graphs for a node is closely related
to the degree of the node, the Static PageRank returns the most predominant sense
according to the number of relations the senses have. We think that this is closely related
to the Most Frequent Sense attested in general corpora, as the lexicon builders would
tend to assign more relations to the most predominant sense. Tatsächlich, our results (vgl.
Abschnitt 6.4.5) show that this is indeed the case for the English WordNet.
5.2 Personalized PageRank, Using Context
Static PageRank is independent of context, but this is not what we want in a WSD
System. Given an input piece of text we want to disambiguate all content words in
the input according to the relationships among them. For this we can use Personalized
PageRank (PPR for short) over the whole WordNet graph.
Given an input text (z.B., a sentence), we extract the list Wi i = 1 . . . m of content
Wörter (d.h., nouns, verbs, Adjektive, and adverbs) that have an entry in the dictionary,
and thus can be related to LKB concepts. As a result of the disambiguation process,
every LKB concept receives a score. Dann, for each target word to be disambiguated, Wir
just choose its associated concept in G with maximum score.
In order to apply Personalized PageRank over the LKB graph, the context words
are first inserted into the graph G as nodes, and linked with directed edges to their
respective concepts. Dann, the Personalized PageRank of the graph G is computed by
concentrating the initial probability mass uniformly over the newly introduced word
Knoten. As the words are linked to the concepts by directed edges, they act as source
nodes injecting mass into the concepts they are associated with, which thus become
relevant nodes, and spread their mass over the LKB graph. daher, das resultierende
Personalized PageRank vector can be seen as a measure of the structural relevance of
LKB concepts in the presence of the input context.
Making the edges from words to concepts directed is important, as the use of
undirected edges will move part of the probability mass in the concepts to the word
Knoten. Note the contrast with the edges representing relations between concepts, welche
are undirected (vgl. Abschnitt 3.1).
Alternativ, we could do without the word nodes, concentrating the initial prob-
ability mass on the senses of the words under consideration. Such an initialization over
the graph with undirected edges between synset nodes is equivalent to initializing
the walk on the words in a graph with undirected edges between synset nodes and
directed nodes from words to synsets. We experimentally checked that the results
of both alternatives are indistinguishable. Although the alternative without nodes is
marginally more efficient, we keep the word nodes as they provide a more intuitive and
appealing formalization.
One problem with Personalized PageRank is that if one of the target words has
two senses that are related by semantic relations, those senses reinforce each other, Und
could thus dampen the effect of the other senses in the context. Although one could
66
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
remove direct edges between competing senses from the graph, it is quite rare that those
senses are directly linked, and usually a path with several edges is involved. With this
observation in mind we devised a variant called word-to-word heuristic (PPRw2w for
short), where we run Personalized PageRank separately for each target word in the
Kontext, das ist, for each target word Wi, we concentrate the initial probability mass
in the senses of the rest of the words in the context of Wi, but not in the senses of
the target word itself, so that context words increase their relative importance in the
graph. The main idea of this approach is to avoid biasing the initial score of concepts
associated with target word Wi, and let the surrounding words decide which concept
associated with Wi has more relevance. Contrary to the previous approach, PPRw2w does
not disambiguate all target words of the context in a single run, which makes it less
efficient (vgl. Abschnitt 6.4).
Figur 2 illustrates the disambiguation of a sample sentence. The STATIC method
(not shown in the figure) would choose the synset coach#n1 for the word coach because
it is related to more concepts than other senses, and because those senses are related
to concepts that have a high degree (zum Beispiel, sport#1). The PPR method (left side
of Figure 2) concentrates the initial mass on the content words in the example. Nach
running the iterative algorithm, the system would return coach#n1 as the result for the
target word coach. Although the words in the sentence clearly indicate that the correct
synset in this sentence corresponds to coach#n5, the fact that teacher#n1 is related to
trainer#n1 in WordNet causes both coach#n2 and coach#n1 to reinforce each other, Und
make their pagerank higher. The right side of Figure 2 depicts the PPRw2w method,
where the word coach is not activated. Daher, there is no reinforcement between coach
senses, and the method would correctly choose coach#n5 as the proper synset.
6. Evaluation
WSD literature has used several measures for evaluation. Precision is the percentage of
correctly disambiguated instances divided by the number of instances disambiguated.
Some systems don’t disambiguate all instances, and thus the precision can be high
even if the system disambiguates a handful of instances. In unserem Fall, when a word has
Figur 2
Portion of WordNet to illustrate the disambiguation of coach in the sentence Our fleet comprises
coaches from 35 Zu 58 seats. Each word in the sentence (shown partially) is linked to all its synsets.
The path between trainer#n1 and teacher#1 is omitted for brevity (siehe Abbildung 1). The left part
shows the PPR method, and the right part shows the PPRw2w method.
67
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
C
Ö
l
ich
_
A
_
0
0
1
6
4
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 40, Nummer 1
two senses with the same PageRank value, our algorithm does not return anything,
because it abstains from returning a sense in the case of ties. Im Gegensatz, recall measures
the percentage of correctly disambiguated instances divided by the total number of
instances to be disambiguated. This measure penalizes systems that are unable to return
a solution for all instances. Endlich, the harmonic mean between precision and recall
(F1) combines both measures. F1 is our main measure of evaluation, as it provides a
balanced measure between the two extremes. Note that a system that returns a solution
for all instances would have equal precision, recall, and F1 measures.
In our experiments we build a context of at least 20 content words for each sentence
to be disambiguated, taking the sentences immediately before and after it in the case that
the original sentence was too short. The parameters for the PageRank algorithm were
set to 0.85 Und 30 iterations following standard practice (Haveliwala 2002; Langville
and Meyer 2003; Mihalcea 2005). The post hoc impact of those and other parameters
has been studied in Section 6.4.
The general domain data sets used in this work are the SensEval-2 (S2AW) (Snyder
and Palmer 2004), SensEval-3 (S3AW) (Palmer et al. 2001), and SemEval-2007 fine-
grained (S07AW) (Palmer et al. 2001; Snyder and Palmer 2004; Pradhan et al. 2007)
and coarse grained all-words data sets (S07CG) (Navigli, Litkowski, and Hargraves
2007). All data sets have been produced similarly: A few documents were selected for
tagging, at least two annotators tagged nouns, verbs, Adjektive, and adverbs, inter-
tagger agreement was measured, and the discrepancies between taggers were solved.
The first two data sets are labeled with WordNet 1.7 tags, the third uses WordNet
2.1 tags, and the last one uses coarse-grained senses that group WordNet 2.1 senses.
We run our system using WordNet 1.7 relations and senses for the first two data sets,
and WordNet 2.1 for the other two. Abschnitt 6.4.3 explores the use of WordNet 3.0 Und
compares the performance with the use of other versions.
Regarding the coarse senses in S07CG, we used the mapping from WordNet 2.1
senses made available by the authors of the data set. In order to return coarse grained-
senses, we run our algorithm on fine-grained senses, and aggregate the scores for
all senses that map to the same coarse-grained sense. We finally choose the coarse-
grained sense with the highest score.
The data sets used in this article contain polysemous and monosemous words,
as customary; the percentage of monosemous word occurrences in the S2AW, S3AW,
S07AW, and S07CG data sets are 20.7%, 16.9%, 14.4%, Und 29.9%, jeweils.
6.1 Ergebnisse
Tisch 2 shows the results as F1 of our random walk WSD systems over these data
sets. We detail overall results, as well as results per part of speech, and whether there
is any statistical difference with respect to the best result on each column. Statistical
significance is obtained using the paired bootstrap resampling method (Noreen 1989),
P < 0.01.
The table shows that PPRw2w is consistently the best method in three data sets. All
in all the differences are small, and in one data set STATIC obtains the best results. The
differences with respect to the best system overall are always statistically significant.
In fact, it is remarkable that a simple non-contextual measure like STATIC performs so
well, without the need for building subgraphs or any other manipulation. Section 6.4.6
will show that in some circumstances the performance of STATIC is much lower and
analyzes the reasons for this drop. Regarding the use of the word-to-word heuristic, it
consistently provides slightly better results than PPR in all four data sets. An analysis of
68
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
Table 2
Results on English data sets (F1). Best results in each column in bold.
with respect to the best result in each column.
∗
Statistically significant
S2AW - SensEval-2 All-Words
Method
All
N
V
Adj.
Adv.
PPR
PPRw2w
STATIC
∗
58.7
59.7
∗
58.0
71.8
70.3
66.5
35.0
40.3
40.2
58.9
59.8
59.8
69.8
72.9
72.5
S3AW - SensEval-3 All-Words
Method
All
N
V
Adj.
Adv.
PPR
PPRw2w
STATIC
∗
57.3
57.9
∗
56.5
63.7
65.3
62.5
47.5
47.2
47.1
61.3
63.6
62.8
96.3
96.3
96.3
S07AW - SemEval 2007 All-Words
Method
All
N
V
Adj.
Adv.
PPR
PPRw2w
STATIC
∗
39.7
∗
41.7
43.0
51.6
56.0
56.0
34.6
35.3
37.3
–
–
–
–
–
–
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
S07CG - SemEval 2007 Coarse-grained All-Words
Method
Adv.
Adj.
All
N
V
PPR
PPRw2w
STATIC
∗
78.1
80.1
∗
79.2
78.3
83.6
81.0
73.8
71.1
72.4
84.0
83.1
82.9
78.4
82.3
82.8
the performance according to the POS shows that PPRw2w performs better particularly
on nouns, but there does not seem to be a clear pattern for the rest. In the rest of the
article, we will only show the overall results, omitting those for all POS, in order not to
clutter the result tables.
Our algorithms do not always return an answer, and thus the precision is higher
than the F1 measure. For instance, in S2AW the percentage of instances that get an
answer ranges between 95.4% and 95.6% for PPR, PPRw2w, and STATIC. The precision
for PPRw2w in S2AW is 61.1%, the recall is 58.4%, and F1 is 59.7%. This pattern of slightly
higher values for precisions, lower values for recall, and F1 in between is repeated for
all data sets, POS, and data sets. The percentage of instances that get an answer for the
other data sets is higher, ranging between 98.1% in S3AW and 99.9% in S07CG.
6.2 Comparison to State-of-the-Art Systems
In this section we compare our results with the WSD systems described in Section 2, as
well as the top performing supervised systems at competition time and other unsuper-
vised systems that improved on them. Note that we do not mention all unsupervised
systems participating in the competitions, but we do select the top performing ones. All
results in Table 3 are given as overall F1 for all Parts of Speech, but we also report F1 for
nouns in the case of S07CG, where Ponz10 (Ponzetto and Navigli 2010) reported very
69
Computational Linguistics
Volume 40, Number 1
high results, but only for nouns. Note that the systems reported here and our system
might use different context sizes.
For easier reference, Table 3 uses a shorthand for each system, whereas the text in
this section includes the shorthand and the full reference the first time the shorthand is
used. The shorthand uses the first letters of the first author followed by the year of the
paper, except for systems which participated in SensEval and SemEval, where we use
their acronym. Most systems in the table have been presented in Section 2, with a few
exceptions that will be presented this section.
The results in Table 3 confirm that our system (PPRw2w) performs on the state-of-the-
art of knowledge-based and unsupervised systems, with two exceptions:
(1)
(2)
Nav10 (Navigli and Lapata 2010) obtained better results on S07AW.
We will compare both systems in more detail below, and also include
a reimplementation in the next subsection which shows that, when
using the same LKB, our method obtains better results.
Although not reported in the table, an unsupervised system using
automatically acquired training examples from bilingual data (Chan and
Ng 2005) obtained very good results on S2AW nouns (77.2 F1, compared
with our 70.3 F1 in Table 2). The automatically acquired training examples
are used in addition to hand-annotated data in Zhong10 (Zhong and Ng
2010), also reported in the table (see below).
We report the best unsupervised systems in S07AW and S07CG on the same row.
JU-SKNSB (Naskar and Bandyopadhyay 2007) is a system based on an extended version
Table 3
Comparison with state-of-the-art results (F1). The top rows report knowledge-based and
unsupervised systems, followed by our system (PPRw2w). Below we report systems that use
annotated data to some degree: (1) MFS or counts from hand-annotated corpora, (2) fully
supervised systems, including the best supervised participants in each exercise. Best result
among unsupervised systems in each column is shown in bold. Please see text for references
of each system.
System
S2AW S3AW S07AW S07CG (N)
Mih05
Sinha07
Tsatsa10
Agirre08
Nav10
JU-SKNSB / TKB-UO
Ponz10
54.2
57.6
58.8
52.2
53.6
57.4
56.8
52.9
43.1
40.2
70.2
(70.8)
(79.4)
PPRw2w
59.7
57.9
41.7
80.1
(83.6)
60.1
68.6
68.2
62.3
58.3
60.4
65.2
67.6
51.4
78.9
(77.4)
83.2
(84.1)
59.1
58.3
82.5
82.6
(82.3)
MFS(1)
IRST-DDD-00(1)
Nav05(1) / UOR-SSI(1)
(2)
BESTsup
Zhong10(2)
70
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
of the Lesk algorithm (Lesk 1986), evaluated on S07AW. TKB-UO (Anaya-S´anchez,
Pons-Porrata, and Berlanga-Llavori 2007), which was evaluated in S07CG, clusters
WordNet senses and uses so-called topic signatures based on WordNet information for
disambiguation. IRST-DDD-00 (Strapparava, Gliozzo, and Giuliano 2004) is a system
based on WordNet domains which leverages on large unannotated corpora. They
obtained excellent results, but their calculation of scores takes into account synset
probabilities from SemCor, and the system can thus be considered to use some degree
of supervision. We consider that systems which make use of information derived from
hand-annotated corpora need to be singled out as having some degree of supervision.
This includes systems using the MFS heuristic, as it is derived from hand-annotated
corpora. In the case of the English WordNet, the use of the first sense also falls in this
category, as the order of senses in WordNet is based on sense counts in hand-annotated
corpora. Note that for wordnets in other languages, hand-annotated corpus is scarce,
and thus our main results do not use this information. Section 6.4.7 analyzes the results
of our system when combined with this information.
Among supervised systems, the best supervised systems at competition time are
reported in a single row (Mihalcea 2002; Decadt et al. 2004; Chan, Ng, and Zhong
2007; Tratz et al. 2007). We also report Zhong10 (Zhong and Ng 2010), which is a freely
available supervised system giving some of the strongest results in WSD.
We will now discuss in detail the systems that are most similar to our own. We first
review the WordNet versions and relations used by each system. Mih05 (Mihalcea 2005)
and Sinha07 (Sinha and Mihalcea 2007) apply several similarity methods, which use
WordNet information from versions 1.7.1 and 2.0, respectively, including all relations
and the text in the glosses.5 Tsatsa10 (Tsatsaronis, Varlamis, and Nørv˚ag 2010) uses
WordNet 2.0. Agirre08 (Agirre and Soroa 2008) experimented with several LKBs formed
by combining relations from different sources and versions, including WordNet 1.7 and
eXtended WordNet. Nav05 and Nav10 (Navigli and Velardi 2005; Navigli and Lapata
2010) use WordNet 2.0, enriched with manually added co-occurrence relations which
are not publicly available.
We can see in Table 3 that the combination of Personalized PageRank and LKB
presented in this article outperforms both Mih05 and Sinha07. In order to factor
out the difference in the WordNet version, we performed experiments using WN2.1
and eXtended WordNet, yielding 58.7 and 56.5 F1 for S2AW and S3AW, respectively.
Although a head-to-head comparison is not possible, the systems use similar informa-
tion: Although they use glosses, our algorithm cannot directly use the glosses, and thus
we use disambiguated glosses as delivered in eXtended WordNet. All in all the results
suggest that analyzing the LKB structure as a graph is preferable to computing pairwise
similarity measures over synsets to build a custom graph and then applying graph
measures. The results of various in-house experiments replicating Mih05 also confirmed
this observation. Note also that our methods are simpler than the combination strategy
used in Sinha07.
Nav05 (Navigli and Velardi 2005) uses a knowledge-based WSD method based on
lexical chains called structural semantic interconnections (SSI). The SSI method was
evaluated on the SensEval-3 data set, as shown in row Nav05 in Table 3. Note that
the method labels an instance with the MFS of the word if the algorithm produces
no output for that instance, which makes comparison to our system unfair, especially
given the fact that the MFS performs better than SSI. In fact, it is not possible to separate
5 Personal communication.
71
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 40, Number 1
the effect of SSI from that of the MFS, and we thus report it as using some degree of
supervision in the table. A variant of the algorithm called UOR-SSI (Navigli, Litkowski,
and Hargraves 2007) (reported in the same row) used a manually added set of 70,000
relations and obtained the best results in S07CG out-of-competition,6 even better than
the best supervised method. Reimplementing SSI is not trivial, so we did not check the
performance of a variant of SSI that does not use MFS and that uses the same LKB as
our method. Section 6.4.7 analyzes the results of our system when combined with MFS
information.
Agirre08 (Agirre and Soroa 2008) uses breadth-first search to extract subgraphs of
the WordNet graph for each context to be disambiguated, and then applies PageRank.
Our better results seem to indicate that using the full graph instead of those subgraphs
would perform better. In order to check whether the better results are due to differences
in the information used, the next subsection presents the results of our reimplementa-
tion of the systems using the same information as our full-graph algorithm.
Tsatsa10 (Tsatsaronis, Vazirgiannis, and Androutsopoulos 2007; Tsatsaronis,
Varlamis, and Nørv˚ag 2010) also builds the graph using breadth-first search, but
weighting each type of edge differently, and using graph-based measures that take into
account those weights. This is in contrast to the experiments performed in this article
where edges have no weight, and is an interesting avenue for future work.
Nav10 (Navigli and Lapata 2010) first builds a subgraph of WordNet composed of
paths between synsets using depth-first search and then applies a set of graph centrality
algorithms. The best results are obtained using the degree of the nodes, and they present
two variants, depending on how they treat ties: Either they return a sense at random, or
they return the most frequent sense. For fair comparison to our system (which does not
use MFS as a back-off), Table 3 reports the former variant as Nav10. This system is better
than ours in one data set and worse in another. They use 60,000 relations that are not
publicly available, but they do not use eXtended WordNet relations. In order to check
whether the difference in performance is due to the relations used or the algorithm,
the next subsection presents a reimplementation of their best graph-based algorithms
using the same LKB as we do. In earlier work (Navigli and Lapata 2007) they test a
similar system on S3AW, but report results only for nouns, verbs, and adjectives (F1 of
61.9, 36.1, and 62.8, respectively), all of which are below the results of our system (cf.
Table 2).
In Ponz10 (Ponzetto and Navigli 2010) the authors apply the same techniques as in
Nav10 to a new resource called WordNet++. They report results for nouns using degree
on subgraphs for the S07CG data set, as shown in Table 3. Their F1 on nouns is 79.4,
lower than our results using our LKB.
6.3 Comparison with Related Algorithms
The previous section shows that our algorithm when applied to a LKB built from
WordNet and eXtended WordNet outperforms other knowledge-based systems in all
cases but one system in one data set. In this section we factor out algorithm and LKB,
and present the results of other graph-based methods for WSD using the same WordNet
versions and relations as in the previous section. As we mentioned in Section 2, ours
is the only method using the full WordNet graph. Navigli and Lapata (2010) and
Ponzetto and Navigli (2010) build a custom graph based on the relations in WordNet
6 The task was co-organized by the authors.
72
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
as follows: For each sense si of each word in the context, a depth-first search (DFS for
short) is conducted through the WordNet graph starting in si until another sense sj of a
word in the context is found or maximum distance is reached. The maximum distance
was set by the authors to 6. All nodes and edges between si and sj, inclusive, are added to
the subgraph. Graph-based measures are then used to select the output senses for each
target word, with degree and PageRank yielding the best results. In closely related work,
Agirre and Soroa (2008) and Tsatsaronis, Varlamis, and Nørv˚ag (2010) use breadth-first
search (BFS) over the whole graph, and keep all paths connecting senses. Note that
unlike the dfs approach, bfs does not require any threshold. The subgraphs obtained
by each of these methods are slightly different.
We reimplemented both strategies, namely, DFS with threshold 6 and BFS with no
threshold. Table 4 shows the overall results of degree and PageRank for both kinds of
subgraphs. DFS yields slightly better results than BFS but PPRw2w is best in all four data
sets, with statistical significance.
In addition, we run PPR and PPRw2w on DFS and BFS subgraphs, and obtained better
results than degree and PageRank in all data sets. DFS with PPR and DFS with PPRw2w
are best in S3AW and S07AW, respectively, although the differences with PPRw2w are not
statistically significant. PPRw2w on the full graph is best in two data sets, with statistical
significance.
From these results we can conclude that PPR and PPRw2w yield the best results
also for subgraphs. Regarding the use of the full graph with respect to DFS or BFS,
the performances for PPRw2w are very similar, but using the full graph gives a small
advantage. Section 6.4.5 provides an analysis of efficiency.
6.4 Analysis of Performance Factors
The behavior of the WSD system is influenced by a set of parameters that can yield
different results. In our main experiments we did not perform any parameter tuning;
we just used some default values which were found to be useful according to previous
work. In this section we perform a post hoc analysis of several parameters on the general
performance of the system, reporting F1 on a single data set, S2AW.
Table 4
Results for subgraph methods compared with our method (F1). In the Reference column we
∗
mention the reference system that we reimplemented. Best results in each column in bold.
Statistically significant with respect to the best result in each column. 0 No significant difference.
Reference
S2AW S3AW S07AW S07CG
Nav10, Ponz10
DFSdegree
BFSdegree
DFSPageRank Nav10
BFSPageRank Agirre08
DFSPPR
BFSPPR
DFSPPRw2w
BFSPPRw2w
PPRw2w
∗
58.4
∗
57.9
∗
58.2
∗
57.7
∗
59.3
58.80
58.70
∗
58.1
59.7
∗
56.4
∗
56.5
∗
56.4
∗
56.7
58.2
57.50
58.00
57.90
57.90
∗
40.3
∗
39.9
∗
39.9
∗
39.7
41.40
41.20
∗
41.2
41.9
41.70
∗
79.4
∗
79.2
∗
79.6
∗
79.4
∗
78.1
∗
78.8
∗
79.7
∗
79.5
80.1
73
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 40, Number 1
60
59
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
58
57
56
55
0
5
10
15
20
25
30
Iterations
Figure 3
Convergence according to number of PageRank iterations (F1 on S2AW).
6.4.1 PageRank Parameters. The PageRank algorithm has two main parameters, the
so-called damping factor and the number of iterations (or, conversely, the convergence
threshold), which we set as 0.85 and 30, respectively (cf. Section 4). Figure 3 depicts
the effect of varying the number of iterations. It shows that the algorithm converges
very quickly: One sole iteration yields relatively high performance, and 20 iterations
are enough to achieve convergence. Note also that the performance is in the [58.0, 58.5]
range for iterations over 5. Note that we use the same range of F1 for the y axis of Figures
3, 4, and 5 for easier comparison.
Figure 4 shows the effect of varying the damping factor. Note that a damping
factor of zero means that the PageRank value coincides with the initial probability
distribution. Given the way we initialize the distribution (c.f. Section 5.2), it would mean
that the algorithm is not able to disambiguate the target words. Thus, the initial value
on Figure 4 corresponds to a damping factor of 0.001. On the other hand, a damping
factor of 1 yields to the same results as the STATIC method (c.f. Section 5.1). The best
value is attained with 0.90, with similar values around it (less than 0.5 absolute points in
(cid:2)(cid:2)
(cid:2)
(cid:2)
(cid:2)
(cid:2)
60
59
58
57
56
55
(cid:2)
(cid:2)
(cid:2)
(cid:2)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Damping factor
Figure 4
Varying the damping factor (F1 on S2AW).
74
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
(cid:2)
(cid:2)
(cid:2) (cid:2) (cid:2)
(cid:2)
(cid:2)
(cid:2)
60
59
58
57
56
55
5
10 15 20 25 30 35 40 45 50
Context size
Figure 5
Varying the context size (F1 on S2AW).
variation), in agreement with previous results which preferred values in the 0.85...0.95
interval (Haveliwala 2002; Langville and Meyer 2003; Mihalcea 2005).
6.4.2 Size of Context Window. Figure 5 shows the performance of the system when trying
different context windows for the target words. The best context size is for windows
of 20 content words, with less than 0.5 absolute point losses for windows in the [5, 25]
range.
6.4.3 Using Different WordNet Versions. There has been little research on the best strat-
egy to use when dealing with data sets and resources attached to different WordNet
versions. Table 5 shows the results for the four data sets used in this study when using
different WordNet versions. Two of the data sets (S2AW and S3AW) were tagged with
senses from version 1.7, S07AW with senses from version 2.1, and S07CG with coarse
senses built on version 2.1 senses.
Given the fact that WordNet 3.0 is a more recent version that includes more rela-
tions, one would hope that using it would provide the best results (Cuadros and Rigau
2008; Navigli and Lapata 2010). We built a graph analogous to the ones for versions
1.7 and 2.1, but using the hand-disambiguated glosses instead of eXtended WordNet
glosses. We used freely available mappings (Daude, Padro, and Rigau 2000)7 to convert
our eXtended WordNet relations to 3.0, and then the WordNet 3.0 sense results to the
corresponding version. In addition, we also tested WN1.7 on S07AW and S07CG, and
WN2.1 on S2AW and S3AW, also using the mappings from Daude, Padro, and Rigau
(2000).
Table 5 shows that the best results are obtained using our algorithm on the same
WordNet version as used in the respective data set. When testing on data sets tagged
with WordNet 1.7, similar results are obtained using 2.1 or 3.0. When testing on data sets
based on 2.1, 3.0 has a small lead over 1.7. In any case, the differences are small ranging
from 1.4 absolute points to 0.5 points. All in all, it seems that the changes introduced
7 http://nlp.lsi.upc.edu/tools/download-map.php.
75
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 40, Number 1
Table 5
Comparing WordNet versions. Best result in each row in bold.
Data set
version
1.7 + xwn
2.1 + xwn
3.0 + xwn
1.7
S2AW
S3AW
1.7
S07AW 2.1
S07CG
2.1 coarse
59.7
57.9
40.7
79.6
58.7
56.5
41.7
80.1
58.4
56.8
40.9
79.6
by different versions slightly deteriorate the results, and the best strategy is to use the
same WordNet version as was used for tagging.
6.4.4 Using xwn vs. WN3.0 Gloss Relations. WordNet 3.0 was released with an accom-
panying data set comprising glosses where some of the words had been manually
disambiguated. In Table 6 we present the results of using these glosses with the WN3.0
graph, showing that the results are lower than using XWN relations. We also checked
the use of WN3.0 gloss relation with other WordNet versions, and the results using
XWN were always slightly better. We hypothesize that the better results for XWN are
due to the amount of relations, with XWN holding 61% more relations than WN3.0
glosses. Still, the best relations are obtained with the combination of both kinds of gloss
relations.
6.4.5 Analysis of Relations. Previous results were obtained using all the relations
of WordNet and taking eXtended WordNet relations into account. In this section
we analyze the effect of the relation types on the whole process, following the
relation groups presented in Table 1. Table 7 shows the results when using different
combinations over relation types. The eXtended WordNet XWN relations appear the
most valuable when performing random walk WSD, as their performance is as good
as when using the whole graph, and they produce a large drop when ablated from the
graph. Ignoring antonymy relations produces a small improvement, but the differences
between using all the relations, eliminating antonyms, and using XWN relations only
are too small to draw any further conclusions. It seems that given the XWN relations
(the most numerous, cf. Section 3.1), our algorithm is fairly robust to the addition or
deletion of other kinds of relations (less numerous).
6.4.6 Behavior with Respect to STATIC and MFS. The high results of the very simple STATIC
method (PageRank with no context) seems to imply that there is no need to use context
for disambiguation. Our intuition was that the synsets which correspond to the most
Table 6
Comparing XWN and WN3.0 gloss relations, separately and combined. Best result in each row in
bold.
Data set
3.0 + XWN
3.0 + gloss
3.0 + XWN + gloss
58.4
56.8
40.9
79.6
58.1
51.7
38.8
78.9
58.8
56.1
42.2
80.2
S2AW
S3AW
S07AW
S07CG
76
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
Table 7
Analysis of relation types. The first column shows the performance using just that relation type.
The second shows the combination of TAX and each type. The last column shows all relations
except the corresponding type.
relation
single
+ TAX
ablation
TAX
ANT
MER
REL
XWN
37.4
19.1
23.4
35.4
59.9
–
42.1
36.4
46.1
59.8
Reference system (all relations)
59.9
59.9
59.6
59.6
47.1
59.7
frequent senses would get more relations. We thus computed the correlation between
systems, gold tags, and MFS. In order to make the correlation results comparable to
the figures used on evaluation, we use the number of times both sets of results agree,
divided by the number of results returned by the first system. Table 8 shows such a
matrix of pairwise correlations. If we take the row of gold tags, the results reflect the
performance of each system (the precision). In the case of MFS, the column shows that
STATIC has a slightly larger correlation with the MFS than the other two methods. The
matrix also shows that all our three methods agree more than 80% of the time, with PPR
and STATIC having a relatively smaller agreement.
In contrast, related work using the same techniques over domain-specific words
(Agirre, L ´opez de Lacalle, and Soroa 2009) shows that the results of our Personalized
PageRank models departs significantly from MFS and STATIC. Table 9 shows the results
of the three techniques on the three subcorpora that constitute the evaluation data set
published in Koeling, McCarthy, and Carroll (2005). This data set consists of examples
retrieved from the Sports and Finance sections of the Reuters corpus, and also from the
balanced British National Corpus (BNC), which is used as a general domain contrast
corpus.
Applying PageRank over the entire WordNet graph yields low results, very similar
to those of MFS, and below those of Personalized PageRank. This confirms that STATIC
PageRank is closely related to MFS, as we hypothesized in Section 5.1 and showed in
Table 8 for the other general domain data sets. Whereas the results of PPRw2w are very
similar in the general-domain BNC, PPRw2w departs from STATIC and MFS with 30 and
20 points of difference in the domain-specific Sports and Finance corpora. These results
are highly relevant, because they show that PPR is able to effectively use contextual
information, and depart from the MFS and STATIC baselines.
Table 8
Correlation between systems, gold tags, and MFS.
Gold MFS
100.0
60.1
61.3
100.0
57.4
58.4
56.7
79.8
79.0
81.4
PPR
58.6
79.8
100.0
86.8
82.8
PPRw2w
59.7
79.0
86.8
100.0
86.4
STATIC
57.8
81.3
82.8
86.4
100.0
Gold
MFS
PPR
PPRw2w
STATIC
77
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 40, Number 1
Table 9
Results on three subcorpora as reported in Agirre, L ´opez de Lacalle, and Soroa (2009),
where Sports and Finance are domain-specific. Best results on each column in bold.
System BNC Sports
Finance
MFS
STATIC
PPRw2w
34.9
36.6
37.7
19.6
20.1
51.5
37.1
39.6
59.3
Table 10
Combination with MFS (F1). The first two rows correspond to our system with and without
information from MFS. Below that we report systems that also use MFS. Best results in each
column in bold.
System
S2AW S3AW S07AW S07CG (N)
PPRw2w
PPRw2w MFS
MFS
IRST-DDD-00
Nav05 / UOR-SSI
Ponz10
59.7
62.6
60.1
57.9
63.0
62.3
58.3
60.4
41.7
48.6
51.4
80.1
81.4
(83.6)
(82.1)
78.9
(77.4)
83.2
81.7
(84.1)
(85.5)
6.4.7 Combination with MFS. As mentioned in Section 6.2, we have avoided using any
information regarding sense frequencies from annotated corpora, as this information
is not always available for all wordnets. In this section we report the results of our
algorithm when taking into account prior probabilities of senses taken from sense
counts. We used the sense counts provided with WordNet in the index.sense file.8 In
this setting, the edges linking words and their respective senses are weighted according
to the prior probabilities of those senses, instead of uniform weights as in Section 5.2.
Table 10 shows that results when using priors from MFS improve over the results
of the original PPRw2w in all data sets. The improvement varies across parts of speech,
and, for instance, the results for nouns in S07CG are worse (shown in rightmost column
of Table 10). In addition, the results for PPRw2w when using MFS information improve
over MFS in all cases except for S07AW.
The table also reports the best systems that do use MFS (see Section 6.3 for detailed
explanations). For S2AW and S07AW we do not have references to related systems.
For S3AW we can see that our system performs best. In the case of S07CG, UOR-SSI
reports better results than our system. Finally, the final row reports their system when
combined with MFS information as back-off (Ponzetto and Navigli 2010), which also
attains better results than our system. We tried to use a combination method similar to
theirs, but did not manage to improve results.
6.4.8 Efficiency of Full Graphs vs. Subgraphs. Given the very close results of our algorithm
when using full graphs and subgraphs (cf. Section 6.3), we studied the efficiency of each.
We benchmarked several graph-based methods on the S2AW data set, which comprises
8 http://wordnet.princeton.edu/wordnet/man/senseidx.5WN.html.
78
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
2,473 instances to be disambiguated. All tests were done on a multicore computer with
16 GB of memory using a single 2.66 GHz processor. When using the full graph PPR
disambiguates full sentences in one go at 684 instances per minute, whereas PPRw2w
disambiguates one word at a time, 70 instances per minute. The DFS subgraphs provide
better performance than PPRw2w, 228 instances per minute when using degree, with
marginally slower performance when using PPRw2w (210 instances per minute). The
BFS subgraph is slowest, with around 20 instances per minute. The memory footprint of
using the full graph algorithm is small, just 270 MB, so several processes can be run on
a multiprocessor machine easily.
All in all, there is a tradeoff in performance and speed, with PPRw2w on the full
graph providing better results at the cost of some speed, and PPR on the full graph
providing the best speed at the cost of worse performance. Using DFS with PPRw2w
lays in between and is also a good alternative, and its speed can be improved using
pre-indexed paths.
6.5 Experiments on Spanish
Our WSD algorithm can be applied over non-English texts, provided that a LKB for this
particular language exists. We have applied our random walk algorithms to the Spanish
WordNet (Atserias, Rigau, and Villarejo 2004), using the SemEval-2007 Task 09 data set
as evaluation gold standard (M`arquez et al. 2007). The data set contains examples of the
150 most frequent nouns in the CESS-ECE corpus, manually annotated with Spanish
WordNet synsets. It is split into a train and test part, and has an “all words” shape
(i.e., input consists of sentences, each one having at least one occurrence of a target
noun). We ran the experiment over the test part (792 instances), and used the train part
for calculating the MFS heuristic. The results in Table 11 are consistent with those for
English, with our algorithms approaching MFS performance, and PPRw2w yielding the
best results. Note that for this data set the supervised algorithm could barely improve
over the MFS, which performs very well, suggesting that in this particular data set the
sense distributions are highly skewed.
Finally, we also show results for the first sense in the Spanish WordNet. In the
Spanish WordNet the order of the senses of a word has been assigned directly by
the lexicographer (Atserias, Rigau, and Villarejo 2004), as there is no information
of sense frequency from hand-annotated corpora. This is in contrast to the English
WordNet, where the senses are ordered according to their frequency in annotated
Table 11
Results as F1 on the Spanish SemEval07 data set, including first sense, MFS, and the best
Statistically significant difference with respect to
supervised system in the competition.
the best of our results (in bold).
∗
Method
PPR
PPRw2w
STATIC
First sense
MFS
BEST
Acc.
∗
78.4
79.3
∗
76.5
∗
66.4
∗
84.6
∗
85.1
79
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 40, Number 1
corpora (Fellbaum 1998), and reflects the status on most other wordnets. In this case,
our algorithm clearly improves over the first sense in the dictionary.
7. Conclusions
In this article we present a method for knowledge-based Word Sense Disambiguation
based on random walks over relations in a LKB. Our algorithm uses the full graph
of WordNet efficiently, and performs better than PageRank or degree on subgraphs
(Navigli and Lapata 2007; Agirre and Soroa 2008; Navigli and Lapata 2010; Ponzetto
and Navigli 2010). We also show that our combination of method and LKB built from
WordNet and eXtended WordNet compares favorably to other knowledge-based
systems using similar information sources (Mihalcea 2005; Sinha and Mihalcea
2007; Tsatsaronis, Vazirgiannis, and Androutsopoulos 2007; Tsatsaronis, Varlamis, and
Nørv˚ag 2010). Our analysis shows that Personalized PageRank yields similar results
when using subgraphs and the full graph, with a trade-off between speed and perfor-
mance, where Personalized PageRank over the full graph is fastest, its word-to-word
variant slowest, and Personalized PageRank over the subgraph lies in between.
We also show that the algorithm can be easily ported to other languages with good
results, with the only requirement of having a wordnet. Our results improve over the
first sense of the Spanish dictionary. This is particularly relevant for wordnets other
than English. For the English WordNet the senses of a word are ordered according to the
frequency of the senses in hand-annotated corpora, and thus the first sense is equivalent
to the Most Frequent Sense, but this information is not always available for languages
that lack large-scale hand-annotated corpora.
We have performed an extensive analysis, showing the behavior according to the
parameters of PageRank, and studying the impact of different relations and WordNet
versions. We have also analyzed the relation between our PPR algorithm, MFS, and
STATIC PageRank. In general domain corpora they get similar results, close to the
performance of the MFS learned from SemCor, but the results reported on domain-
specific data sets (Agirre, L ´opez de Lacalle, and Soroa 2009) show that PPR is able to
move away from the MFS and STATIC and improve over them, indicating that PPR is able
to effectively use contextual information, and depart from MFS and STATIC PageRank.
The experiments in this study are readily reproducible, as the algorithm and the
LKBs are publicly available.9 The system can be applied easily to sense inventories and
knowledge bases different from WordNet.
In the future we would like to explore methods to incorporate global weights of the
edges in the random walk calculations (Tsatsaronis, Varlamis, and Nørv˚ag 2010). Given
the complementarity of the WordNet++ resource (Ponzetto and Navigli 2010) and our
algorithm, it would be very interesting to explore the combination of both, as well as
the contribution of other WordNet related resources (Cuadros and Rigau 2008).
Acknowledgments
We are deeply indebted to the reviewers,
who greatly helped to improve the
article. Our thanks to Rob Koeling and
Diana McCarthy for kindly providing
the data set, thesauri, and assistance,
and to Roberto Navigli and Simone
Ponzetto for clarifying the method to
map to coarse-grained senses. This
work has been partially funded by the
Education Ministry (project KNOW2
TIN2009-15049-C03-01).
9 http://ixa2.si.ehu.es/ukb.
80
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
References
Agirre, E., E. Alfonseca, K. Hall, J. Kravalova,
M. Pasca, and A. Soroa. 2009. A study
on similarity and relatedness using
distributional and WordNet-based
approaches. In Proceedings of the North
American Chapter of the Association for
Computational Linguistics - Human Language
Technologies conference (NAACL/HLT’09),
pages 19–27, Boulder, CO.
Agirre, E., T. Baldwin, and D. Martinez.
2008. Improving parsing and PP
attachment performance with sense
information. In Proceedings of the
46th Annual Meeting of the Association for
Computational Linguistics (ACL/HLT’08),
pages 317–325, Columbus, OH.
Agirre, E., K. Bengoetxea, K. Gojenola,
and J. Nivre. 2011. Improving
dependency parsing with semantic
classes. In Proceedings of the 49th Annual
Meeting of the Association for Computational
Linguistics: Human Language Technologies
(ACL/HLT’11), pages 699–703,
Portland, OR.
Agirre, E., O. L ´opez de Lacalle, and A. Soroa.
2009. Knowledge-based WSD on specific
domains: Performing better than generic
supervised WSD. In Proceedings of the
19th International Joint Conference on
Artificial Intelligence (IJCAI’09),
pages 1,501–1,506, Pasadena, CA.
Agirre, E. and G. Rigau. 1996. Word sense
disambiguation using conceptual density.
In Proceedings of the 16th International
Conference on Computational Linguistics
(COLING’96), pages 16–22, Copenhagen.
Agirre, E. and A. Soroa. 2008. Using the
multilingual central repository for
graph-based word sense disambiguation.
In Proceedings of the 6th Conference on
Language Resources and Evaluation
(LREC ’08), pages 1,388–1,392, Marrakesh.
Agirre, E. and A. Soroa. 2009. Personalizing
PageRank for word sense disambiguation.
In Proceedings of the 12th Conference of the
European Chapter of the Association for
Computational Linguistics (EACL’09),
pages 33–41, Athens.
Anaya-S´anchez, H., A. Pons-Porrata, and
R. Berlanga-Llavori. 2007. TKB-UO: Using
sense clustering for WSD. In Proceedings of
the 4th International Workshop on Semantic
Evaluations (SemEval-2007) in conjunction
with ACL, pages 322–325, Prague.
Atserias, J., G. Rigau, and L. Villarejo. 2004.
Spanish wordnet 1.6: Porting the Spanish
wordnet across Princeton versions.
In Proceedings of the 4th Conference on
Language Resources and Evaluation
(LREC’04), pages 161–164, Lisbon.
Barzilay, R. and M. Elhadad. 1997. Using
lexical chains for text summarization.
In Proceedings of the ACL Workshop on
Intelligent Scalable Text Summarization,
pages 10–17, New York, NY.
Brin, S. and L. Page. 1998. The anatomy
of a large-scale hypertextual Web search
engine. Computer Networks and ISDN
Systems, 30(1-7):107–117.
Carpuat, M. and D. Wu. 2007. Improving
statistical machine translation using word
sense disambiguation. In Proceedings of
the Joint Conference on Empirical Methods
in Natural Language Processing and
Computational Natural Language Learning
(EMNLP/CoNLL’07), pages 61–72, Prague.
Chan, Y. S. and H. T. Ng. 2005. Scaling
up word sense disambiguation via
parallel texts. In Proceedings of the
20th National Conference on Artificial
Intelligence (AAAI’05), pages 1,037–1,042,
Pittsburgh, PA.
Chan, Y. S., H. T. Ng, and D. Chiang.
2007. Word sense disambiguation
improves statistical machine translation.
In Proceedings of the 45th Annual Meeting of
the Association for Computational Linguistics
(ACL’07), pages 33–40, Prague.
Chan, Y. S., H. T. Ng, and Z. Zhong. 2007.
NUS-PT: Exploiting parallel texts for
word sense disambiguation in the English
all-words tasks. In Proceedings of the
4th International Workshop on Semantic
Evaluations (SemEval-2007) in conjunction
with ACL, pages 253–256, Prague.
Cowie, J., J. Guthrie, and L. Guthrie. 1992.
Lexical disambiguation using simulated
annealing. In Proceedings of the Workshop
on Speech and Natural Language (HLT’91),
pages 238–242, Morristown, NJ.
Cuadros, M. and G. Rigau. 2006. Quality
assessment of large scale knowledge
resources. In Proceedings of Joint SIGDAT
Conference on Empirical Methods in Natural
Language Processing (EMNLP’06),
pages 534–541, Sydney.
Cuadros, M. and G. Rigau. 2007.
Semeval-2007 task 16: Evaluation of
wide coverage knowledge resources.
In Proceedings of the 4th International
Workshop on Semantic Evaluations
(SemEval-2007) in conjunction with ACL,
pages 81–86, Prague.
Cuadros, M. and G. Rigau. 2008. Knownet:
Using topic signatures acquired from the
Web for building automatically highly
dense knowledge bases. In Proceedings
81
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 40, Number 1
of the 22nd International Conference on
Computational Linguistics (COLING’08),
pages 71–84, Manchester.
Daude, J., L. Padro, and G. Rigau. 2000.
Mapping WordNets using structural
information. In Proceedings of the 38th
Annual Meeting of the Association for
Computational Linguistics (ACL’00),
pages 504–511, Hong Kong.
Decadt, B., V. Hoste, W. Daelemans, and
A. Van Den Bosch. 2004. GAMBL, genetic
algorithm optimization of memory-based
WSD. In Proceedings of SENSEVAL-3 Third
International Workshop on Evaluation of
Systems for the Semantic Analysis of Text,
pages 108–112, Barcelona.
Escudero, G., L. M´arquez, and G. Rigau.
2000. An empirical study of the domain
dependence of supervised word sense
disambiguation systems. Proceedings of
the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing
and Very Large Corpora (EMNLP/VLC’00),
pages 172–180, Hong Kong.
Fellbaum, C., editor. 1998. WordNet: An
Electronic Lexical Database and Some of Its
Applications. MIT Press, Cambridge, MA.
Haveliwala, T. H. 2002. Topic-sensitive
PageRank. In Proceedings of the
11th International Conference on World
Wide Web (WWW’02), pages 517–526,
New York, NY.
Hughes, T. and D. Ramage. 2007. Lexical
semantic relatedness with random graph
walks. In Proceedings of the Joint Conference
on Empirical Methods in Natural Language
Processing and Computational Natural
Language Learning (EMNLP/CoNLL’07),
pages 581–589, Prague.
Jiang, J. J. and D. W. Conrath. 1997. Semantic
similarity based on corpus statistics and
lexical taxonomy. In Proceedings of the
International Conference on Research in
Computational Linguistics, pages 19–33,
Taiwan.
Kleinberg., J. M. 1998. Authoritative
sources in a hyperlinked environment.
In Proceedings of the Ninth Annual
ACM-SIAM Symposium on Discrete
Algorithms (SODA’98), pages 668–677,
Philadelphia, PA.
Koeling, R., D. McCarthy, and J. Carroll.
2005. Domain-specific sense distributions
and predominant sense acquisition.
In Proceedings of the Human Language
Technology Conference and Conference on
Empirical Methods in Natural Language
Processing (HLT/EMNLP’05),
pages 419–426, Ann Arbor, MI.
82
Langville, A. N. and C. D. Meyer. 2003.
Deeper inside PageRank. Internet
Mathematics, 1(3):335–380.
Lesk, M. 1986. Automatic sense
disambiguation using machine readable
dictionaries: How to tell a pine cone from
an ice cream cone. In Proceedings of the
5th Annual International Conference on
Systems Documentation (SIGDOC’86),
pages 24–26, New York, NY.
M`arquez, L., M. A. Villarejo, T. Mart´ı, and
M. Taul´e. 2007. SemEval-2007 Task 09:
Multilevel semantic annotation of
Catalan and Spanish. In Proceedings of the
4th International Workshop on Semantic
Evaluations (SemEval-2007) in conjunction
with ACL, pages 42–47, Prague.
McCarthy, D., R. Koeling, J. Weeds,
and J. Carroll. 2007. Unsupervised
acquisition of predominant word senses.
Computational Linguistics, 33(4):553–590.
Mihalcea, R. 2002. Word sense
disambiguation with pattern learning
and automatic feature selection. Natural
Language Engineering, 8:343–358.
Mihalcea, R. 2005. Unsupervised
Large-vocabulary word sense
disambiguation with graph-based
algorithms for sequence data labeling.
In Proceedings of the Conference on
Human Language Technology and
Empirical Methods in Natural Language
Processing (HLT’05), pages 411–418,
Morristown, NJ.
Mihalcea, R. and D. I. Moldovan. 2001.
eXtended WordNet: Progress report.
In Proceedings of the NAACL Workshop
on WordNet and Other Lexical Resources,
pages 95–100, Pittsburgh, PA.
Miller, G. A., C. Leacock, R. Tengi, and R.
Bunker. 1993. A semantic concordance.
In Proceedings of the Workshop on
Human Language Technology (HLT’93),
pages 303–308, Plainsboro, NJ.
Naskar, S. K. and S. Bandyopadhyay.
2007. JU-SKNSB: Extended WordNet
based WSD on the English all-words
task at SemEval-1. In Proceedings of the
4th International Workshop on Semantic
Evaluations (SemEval-2007) in conjunction
with ACL, pages 203–206, Prague.
Navigli, R. 2008. A structural approach
to the automatic adjudication of word
sense disagreements. Natural Language
Engineering, 14(4):547–573.
Navigli, R. and M. Lapata. 2007. Graph
connectivity measures for unsupervised
word sense disambiguation. In Proceedings
of the 17th International Joint Conference
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Agirre, L ´opez de Lacalle, and Soroa
Random Walks for Knowledge-Based WSD
on Artificial Intelligence (IJCAI’07),
pages 1,683–1,688, Hyderabad.
Navigli, R. and M. Lapata. 2010.
An experimental study of graph
connectivity for unsupervised word
sense disambiguation. IEEE Transactions
on Pattern Analysis and Machine
Intelligence, 32(4):678–692.
Navigli, R., K. C. Litkowski, and
O. Hargraves. 2007. SemEval-2007 Task 07:
Coarse-grained English all-words task.
In Proceedings of the 4th International
Workshop on Semantic Evaluations
(SemEval-2007) in conjunction with ACL,
pages 30–35, Prague.
Navigli, R. and P. Velardi. 2005. Structural
semantic interconnections: A knowledge-
based approach to word sense
disambiguation. IEEE Transactions on
Pattern Analysis and Machine Intelligence,
27(7):1,075–1,086.
Navigli, Roberto, Stefano Faralli, Aitor Soroa,
Oier L ´opez de Lacalle, and Eneko Agirre.
2011. Two birds with one stone: Learning
semantic models for text categorization
and word sense disambiguation.
In Proceedings of CIKM, pages 2,317–2,320,
Glasgow.
Ng, H. T. and H. B. Lee. 1996. Integrating
multiple knowledge sources to
disambiguate word sense: An exemplar-
based approach. In Proceedings of the
34th Annual Meeting of the Association for
Computational Linguistics, ACL ’96,
pages 40–47, Stroudsburg, PA.
Ng, T. H. 1997. Getting serious about
word sense disambiguation.
In Proceedings of the ACL SIGLEX
Workshop on Tagging Text with Lexical
Semantics: Why, What, and How?,
pages 1–7, Washington, DC.
Noreen, E. W. 1989. Computer-Intensive
Methods for Testing Hypotheses.
John Wiley & Sons.
Palmer, M., C. Fellbaum, S. Cotton, L. Delfs,
and H. T. Dang. 2001. English tasks:
All-words and verb lexical sample.
In Proceedings of SENSEVAL-2: Second
International Workshop on Evaluating
Word Sense Disambiguation Systems,
pages 21–24, Toulouse.
P´erez-Ag ¨uera, J. R. and H. Zaragoza. 2008.
UCM-Y!R at CLEF 2008 Robust and WSD
tasks. In Proceedings of the 9th Cross
Language Evaluation Forum Workshop
(CLEF’08), pages 138–145, Aarhus.
Ponzetto, S. P. and R. Navigli. 2010.
Knowledge-rich word sense
disambiguation rivaling supervised
system. In Proceedings of the 48th Annual
Meeting of the Association for Computational
Linguistics (ACL’10), pages 1,522–1,531,
Uppsala.
Pradhan, S., E. Loper, D. Dligach, and
M. Palmer. 2007. SemEval-2007 Task-17:
English lexical sample SRL and all words.
In Proceedings of the 4th International
Workshop on Semantic Evaluations
(SemEval-2007) in conjunction with ACL,
pages 87–92, Prague.
Sinha, R. and R. Mihalcea. 2007.
Unsupervised graph-based word sense
disambiguation using measures of word
semantic similarity. In Proceedings of the
IEEE International Conference on Semantic
Computing (ICSC 2007), pages 363–369,
Irvine, CA.
Snyder, B. and M. Palmer. 2004. The
English all-words task. In Proceedings of
SENSEVAL-3 Third International Workshop
on Evaluation of Systems for the Semantic
Analysis of Text, pages 41–43, Barcelona.
Strapparava, C., A. Gliozzo, and C. Giuliano.
2004. Pattern abstraction and term
similarity for word sense disambiguation:
IRST at SENSEVAL-3. In Proceedings of
SENSEVAL-3 Third International Workshop
on Evaluation of Systems for the Semantic
Analysis of Text, pages 229–234, Barcelona.
Surdeanu, M., M. Ciaramita, and
H. Zaragoza. 2008. Learning to rank
answers on large online QA collections.
In Proceedings of the 46th Annual Meeting
of the Association for Computational
Linguistics (ACL/HLT’08), pages 719–727,
Columbus, OH.
Sussna, M. 1993. Word sense disambiguation
for free-text indexing using a massive
semantic network. In Proceedings of the
Second International Conference on
Information and Knowledge Management
(CIKM’93), pages 67–74, New York, NY.
Patwardhan, S., S. Banerjee, and
Tratz, S., A. Sanfilippo, M. Gregory,
T. Pedersen. 2003. Using measures of
semantic relatedness for word sense
disambiguation. In Proceedings of the
Fourth International Conference on
Intelligent Text Processing and
Computational Linguistics (CICLING-03),
pages 241–257, Mexico City.
A. Chappell, C. Posse, and P. Whitney.
2007. PNNL: A supervised maximum
entropy approach to word sense
disambiguation. In Proceedings of the
4th International Workshop on Semantic
Evaluations (SemEval-2007) in conjunction
with ACL, pages 264–267, Prague.
83
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 40, Number 1
Tsatsaronis, G., I. Varlamis, and K. Nørv˚ag.
2010. An experimental study on
unsupervised graph-based word
sense disambiguation. In Proceedings
of the 11th International Conference on
Computational Linguistics and Intelligent
Text Processing (CICLing’10),
pages 184–198, Iasi.
Tsatsaronis, G., M. Vazirgiannis, and
I. Androutsopoulos. 2007. Word sense
disambiguation with spreading activation
networks generated from thesauri.
In Proceedings of the 17th International
Joint Conference on Artificial Intelligence
(IJCAI’07), pages 1,725–1,730, Hyderabad.
Zhong, Z. and H. T. Ng. 2010. It makes
sense: A wide-coverage word sense
disambiguation system for free text.
In Proceedings of the ACL 2010 System
Demonstrations, pages 78–83, Uppsala.
Zhong, Z. and H. T. Ng. 2012. Word sense
disambiguation improves information
retrieval. In Proceedings of the 50th Annual
Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers),
pages 273–282, Jeju Island.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
0
1
5
7
1
8
1
2
6
7
9
/
c
o
l
i
_
a
_
0
0
1
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
84
PDF Herunterladen