Computing Lexical Contrast
Saif M. Mohammad
National Research Council Canada
∗
∗∗
Bonnie J. Dorr
University of Maryland
Graeme Hirst†
University of Toronto
Peter D. Turney‡
National Research Council Canada
Knowing the degree of semantic contrast between words has widespread application in natural
language processing, including machine translation, information retrieval, and dialogue sys-
tems. Manually created lexicons focus on opposites, such as hot and cold. Opposites are of
many kinds such as antipodals, complementaries, and gradable. Existing lexicons often do not
classify opposites into the different kinds, however. They also do not explicitly list word pairs
that are not opposites but yet have some degree of contrast in meaning, such as warm and cold
or tropical and freezing. We propose an automatic method to identify contrasting word pairs
that is based on the hypothesis that if a pair of words, A and B, are contrasting, then there is
a pair of opposites, C and D, such that A and C are strongly related and B and D are strongly
related. (For example, there exists the pair of opposites hot and cold such that tropical is related
to hot, and freezing is related to cold.) We will call this the contrast hypothesis.
We begin with a large crowdsourcing experiment to determine the amount of human
agreement on the concept of oppositeness and its different kinds. In the process, we flesh out
key features of different kinds of opposites. We then present an automatic and empirical measure
of lexical contrast that relies on the contrast hypothesis, corpus statistics, and the structure of
a Roget-like thesaurus. We show how, using four different data sets, we evaluated our approach
on two different tasks, solving “most contrasting word” questions and distinguishing synonyms
from opposites. The results are analyzed across four parts of speech and across five different kinds
of opposites. We show that the proposed measure of lexical contrast obtains high precision and
large coverage, outperforming existing methods.
∗ National Research Council Canada. E-mail: saif.mohammad@nrc-cnrc.gc.ca.
∗∗ Department of Computer Science and Institute of Advanced Computer Studies, University of Maryland.
E-mail: bonnie@umiacs.umd.edu.
† Department of Computer Science, University of Toronto. E-mail: gh@cs.toronto.edu.
‡ National Research Council Canada. E-mail: peter.turney@nrc-cnrc.gc.ca.
Submission received: 14 January 2010; revised submission received: 26 June 2012; accepted for publication:
16 July 2012.
doi:10.1162/COLI a 00143
© 2013 Association for Computational Linguistics
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
1. Introduction
Native speakers of a language intuitively recognize different degrees of lexical contrast—
for example, most people will agree that hot and cold have a higher degree of contrast
than cold and lukewarm, and cold and lukewarm have a higher degree of contrast than
penguin and clown. Automatically determining the degree of contrast between words
has many uses, including:
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
Detecting and generating paraphrases (Marton, El Kholy, and Habash
2011) (The dementors caught Sirius Black / Black could not escape the
dementors).
Detecting certain types of contradictions (de Marneffe, Rafferty, and
Manning 2008; Voorhees 2008) (Kyoto has a predominantly wet climate /
It is mostly dry in Kyoto). This is in turn useful in effectively reranking
target language hypotheses in machine translation, and for reranking
query responses in information retrieval.
Understanding discourse structure and improving dialogue systems.
Opposites often indicate the discourse relation of contrast (Marcu and
Echihabi 2002).
Detecting humor (Mihalcea and Strapparava 2005). Satire and jokes tend
to have contradictions and oxymorons.
Distinguishing near-synonyms from word pairs that are semantically
contrasting in automatically created distributional thesauri. Measures
of distributional similarity typically fail to do so.
Detecting lexical contrast is not sufficient by itself to solve most of these problems, but
it is a crucial component.
Lexicons of pairs of words that native speakers consider opposites have been
created for certain languages, but their coverage is limited. Opposites are of many kinds,
such as antipodals, complementaries, and gradable (summarized in Section 3). Existing
lexicons often do not classify opposites into the different kinds, however. Further, the
terminology is inconsistent across different sources. For example, Cruse (1986) defines
antonyms as gradable adjectives that are opposite in meaning, whereas the WordNet
antonymy link connects some verb pairs, noun pairs, and adverb pairs too. In this
article, we will follow Cruse’s terminology, and we will refer to word pairs connected
by WordNet’s antonymy link as opposites, unless referring specifically to gradable
adjectival pairs.
Manually created lexicons also do not explicitly list word pairs that are not
opposites but yet have some degree of contrast in meaning, such as warm and cold or
tropical and cold. Further, contrasting word pairs far outnumber those that are commonly
considered opposites. In our own experiments described later in this article, we find that
more than 90% of the contrasting pairs in GRE “most contrasting word” questions are
not listed as antonyms in WordNet. We should not infer from this that WordNet or any
other lexicographic resource is a poor source for detecting opposites, but rather that
identifying the large number of contrasting word pairs requires further computation,
possibly relying on other semantic relations stored in the lexicographic resource.
Even though a number of computational approaches have been proposed for se-
mantic closeness (Curran 2004; Budanitsky and Hirst 2006), and some for hypernymy–
556
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
hyponymy (Hearst 1992), measures of lexical contrast have been less successful. To some
extent, this is because lexical contrast is not as well understood as other classical lexical–
semantic relations.
Over the years, many definitions of semantic contrast and opposites have been
proposed by linguists (Lehrer and Lehrer 1982; Cruse 1986), cognitive scientists (Kagan
1984), psycholinguists (Deese 1965), and lexicographers (Egan 1984), which differ from
each other in various respects. Cruse (1986, page 197) observes that even though people
have a robust intuition of opposites, “the overall class is not a well-defined one.” He
points out that a defining feature of opposites is that they tend to have many common
properties, but differ saliently along one dimension of meaning. We will refer to this
semantic dimension as the dimension of opposition. For example, giant and dwarf are
both living beings; they both eat, they both walk, they are both capable of thinking, and
so on. They are most saliently different, however, along the dimension of height. Cruse
also points out that sometimes it is difficult to identify or articulate the dimension of
opposition (for example, city–farm).
Another way to define opposites is that they are word pairs with a “binary incom-
patible relation” (Kempson 1977, page 84). That is to say that one member entails the
absence of the other, and given one member, the identity of the other member is obvious.
Thus, night and day are good examples of opposites because night is best paraphrased
by not day, rather than the negation of any other term. On the other hand, blue and yellow
make poor opposites because even though they are incompatible, they do not have an
obvious binary relation such that blue is understood to be a negation of yellow. It should
be noted that there is a relation between binary incompatibility and difference along just
one dimension of meaning.
For this article, we define opposites to be term pairs that clearly satisfy either
the property of binary incompatibility or the property of salient difference across a
dimension of meaning. Word pairs may satisfy the two properties to different degrees,
however. We will refer to all word pairs that satisfy either of the two properties to some
degree as contrasting. For example, daylight and darkness are very different along the
dimension of light, and they satisfy the binary incompatibity property to some degree,
but not as strongly as day and night. Thus we will consider both daylight and darkness as
well as day and night as semantically contrasting pairs (the former pair less so than the
latter), but only day and night as opposites. Even though there are subtle differences in
the meanings of the terms contrasting, opposite, and antonym, they have often been used
interchangeably in the literature, dictionaries, and common parlance. Thus, we list here
what we use these terms to mean in this article:
(cid:1)
(cid:1)
(cid:1)
Opposites are word pairs that have a strong binary incompatibility
relation with each other and/or are saliently different across a dimension
of meaning.
Contrasting word pairs are word pairs that have some non-zero degree
of binary incompatibility and/or have some non-zero difference across
a dimension of meaning. Thus, all opposites are contrasting, but not all
contrasting pairs are opposites.
Antonyms are opposites that are also gradable adjectives.1
1 We follow Cruse’s (1986) definition for antonyms. The WordNet antonymy link, however, also connects
some verb pairs, noun pairs, and adverb pairs.
557
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
In this article, we present an automatic method to identify contrasting word pairs
that is based on the following hypothesis:
Contrast Hypothesis: If a pair of words, A and B, are contrasting, then there is a pair of
opposites, C and D, such that A and C are strongly related and B and D are strongly
related.
For example, there exists the pair of opposites night and day such that darkness is related
to night, and daylight is related to day. We then determine the degree of contrast between
two words using this hypothesis:
Degree of Contrast Hypothesis: If a pair of words, A and B, are contrasting, then their
degree of contrast is proportional to their tendency to co-occur in a large corpus.
For example, consider the contrasting word pairs top–low and top–down; because top and
down occur together much more often than top and low, our method concludes that the
pair top–down has a higher degree of lexical contrast than the pair top–low. The degree
of contrast hypothesis is inspired by the idea that opposites tend to co-occur more often
than chance (Charles and Miller 1989; Fellbaum 1995). Murphy and Andrew (1993)
claim that this is because together opposites convey contrast well, which is rhetorically
useful. Thus we hypothesize that the higher the degree of contrast between two words,
the higher the tendency of people to use them together.
Because opposites are a key component of our method, we begin by first under-
standing different kinds of opposites (Section 3). Then we describe a crowdsourced
project on the annotation of opposites into different kinds (Section 4). In Section 5.1,
we examine whether opposites and other highly contrasting word pairs occur together
in text more often than randomly chosen word pairs. This experiment is crucial to
the degree of contrast hypothesis because if our assumption is true, then we should
find that highly contrasting pairs are used together much more often than randomly
chosen word pairs. Section 5.2 examines this question. Section 6 presents our method
to automatically compute the degree of contrast between word pairs by relying on
the contrast hypothesis, the degree of contrast hypothesis, seed opposites, and the
structure of a Roget-like thesaurus. (This method was first described in Mohammad,
Dorr, and Hirst [2008].) Finally we present experiments that evaluate various aspects of
the automatic method (Section 7). Following is a summary of the key research questions
addressed by this article:
(1) On the kinds of opposites:
Research questions: How good are humans at identifying different kinds
of opposites? Can certain term pairs belong to more than one kind of
opposite?
Experiment: In Sections 3 and 4, we describe how we designed a ques-
tionnaire to acquire annotations about opposites. Because the annotations
are done by crowdsourcing, and there is no control over the educational
background of the annotators, we devote extra effort to make sure that the
questions are phrased in a simple, yet clear, manner. We deploy a quality
control method that uses a word-choice question to automatically identify
and discard dubious and outlier annotations.
Findings: We find that humans agree markedly in identifying opposites;
there is significant variation in the agreement for different kinds of
558
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
opposites, however. We find that a large number of opposing word pairs
have properties pertaining to more than one kind of opposite.
(2) On the manifestation of opposites and other highly contrasting pairs in text:
Research questions: How often do highly contrasting word pairs co-occur
in text? How strong is this tendency compared with random word pairs,
and compared with near-synonym word pairs?
Experiment: Section 5 describes how we compiled sets of highly contrast-
ing word pairs (including opposites), near-synonym pairs, and random
word pairs, and determine the tendency for pairs in each set to co-occur in
a corpus.
Findings: Highly contrasting word pairs co-occur significantly more often
than both the random word pairs set and also the near-synonyms set.
We also find that the average distributional similarity of highly contrasting
word pairs is higher than that of synonymous words. The standard de-
viations of the distributions for the high-contrast set and the synonyms
set are large, however, and so the tendency to co-occur is not sufficient to
distinguish highly contrasting word pairs from near-synonymous pairs.
(3) On an automatic method for computing lexical contrast:
Research questions: How can the contrast hypothesis and the degree
of contrast hypothesis be used to develop an automatic method for
identifying contrasting word pairs? How can we automatically generate
the list of opposites, which are needed as input for a method relying on
the contrast hypothesis?
Proposed Method: Section 6 describes an empirical method for deter-
mining the degree of contrast between two words by using the contrast
hypothesis, the degree of contrast hypothesis, the structure of a thesaurus,
and seed opposite pairs. The use of affixes to generate seed opposite pairs
is also described. (This method was first proposed in Mohammad, Dorr,
and Hirst [2008].)
(4) On the evaluation of automatic methods of contrast:
Research questions: How accurate are automatic methods at identifying
whether one word pair has a higher degree of contrast than another? What
is the accuracy of this method in detecting opposites (a notable subset of
the contrasting pairs)? How does this accuracy vary for different kinds of
opposites?2 How easy is it for automatic methods to distinguish between
opposites and synonyms? How does the proposed method perform when
compared with other automatic methods?
Experiments: We conduct three experiments (described in Sections 7.1,
7.2, and 7.3) involving three different data sets and two tasks to answer
2 Note that though linguists have classified opposites into different kinds, we know of no work doing so
for contrasts more generally. Thus this particular analysis must be restricted to opposites alone.
559
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
these questions. We compare performance of our method with methods
proposed by Lin et al. (2003) and Turney (2008). We automatically
generate a new set of 1,296 “most contrasting word” questions to evaluate
performance of our method on five different kinds of opposites and across
four parts of speech. (The evaluation described in Section 7.1 was first
described in Mohammad, Dorr, and Hirst [2008].)
Findings: We find that the proposed measure of lexical contrast obtains
high precision and large coverage, outperforming existing methods.
Our method performs best on gradable pairs, antipodal pairs, and com-
plementary pairs, but poorly on disjoint opposite pairs. Among different
parts of speech, the method performs best on noun pairs, and relatively
worse on verb pairs.
All of the data created and compiled as part of this research are summarized in Table 18
(Section 8), and is available for download.3
2. Related Work
Charles and Miller (1989) proposed that opposites occur together in a sentence more
often than chance. This is known as the co-occurrence hypothesis. Paradis, Willners,
and Jones (2009) describe further experiments to show how canonical opposites tend
to have high textual co-occurrence. Justeson and Katz (1991) gave evidence in support
of the hypothesis using 35 prototypical opposites (from an original set of 39 opposites
compiled by Deese [1965]) and also with an additional 22 frequent opposites. They also
showed that opposites tend to occur in parallel syntactic constructions. All of these
pairs were adjectives. Fellbaum (1995) conducted similar experiments on 47 noun, verb,
adjective, and adverb pairs (noun–noun, noun–verb, noun–adjective, verb–adverb, etc.)
pertaining to 18 concepts (for example, lose(v)–gain(n) and loss(n)–gain(n), where lose(v)
and loss(n) pertain to the concept of “failing to have/maintain”). Non-opposite semanti-
cally related words also tend to occur together more often than chance, however. Thus,
separating opposites from these other classes has proven to be difficult.
Some automatic methods of lexical contrast rely on lexical patterns in text. For
example, Lin et al. (2003) used patterns such as “from X to Y ” and “either X or Y ”
to separate opposites from distributionally similar pairs. They evaluated their method
on 80 pairs of opposites and 80 pairs of synonyms taken from the Webster’s Collegiate
Thesaurus (Kay 1988). The evaluation set of 160 word pairs was chosen such that it
included only high-frequency terms. This was necessary to increase the probability
of finding sentences in a corpus where the target pair occurred in one of the chosen
patterns. Lobanova, van der Kleij, and Spenader (2010) used a set of Dutch adjective
seed pairs to learn lexical patterns commonly containing opposites. The patterns were
in turn used to create a larger list of Dutch opposites. The method was evaluated by
comparing entries to Dutch lexical resources and by asking human judges to determine
whether an automatically found pair is indeed an opposite. Turney (2008) proposed a
supervised method for identifying synonyms, opposites, hypernyms, and other lexical-
semantic relations between word pairs. The approach learns patterns corresponding to
different relations.
3 http://www.purl.org/net/saif.mohammad/research.
560
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Harabagiu, Hickl, and Lacatusu (2006) detected contrasting word pairs for the
purpose of identifying contradictions by using WordNet chains—synsets connected by
the hypernymy–hyponymy links and exactly one antonymy link. Lucerto, Pinto, and
Jime ˜nez-Salazar (2002) proposed detecting contrasting word pairs using the number
of tokens between two words in text and also cue words such as but, from, and and.
Unfortunately, they evaluated their method on only 18 word pairs. Neither Harabagiu,
Hickl, and Lacatusu nor Lucerto, Pinto, and Jim´enez-Salazar determined the degree of
contrast between words, and their methods have not been shown to have substantial
coverage.
Schwab, Lafourcade, and Prince (2002) created an oppositeness vector for a target
word. The closer this vector is to the context vector of the other target word, the
more opposite the two target words are. The oppositeness vectors were created by first
manually identifying possible opposites and then generating suitable vectors for each
using dictionary definitions. The approach was evaluated on only a handful of word
pairs.
There is a large amount of work on sentiment analysis and opinion mining aimed
at determining the polarity of words (Pang and Lee 2008). For example, Pang, Lee, and
Vaithyanathan (2002) detected that adjectives such as dazzling, brilliant, and gripping cast
their qualifying nouns positively whereas adjectives such as bad, cliched, and boring
portray the noun negatively. Many of these gradable adjectives have opposites, but
these approaches, with the exception of that of Hatzivassiloglou and McKeown (1997),
did not attempt to determine pairs of positive and negative polarity words that are
opposites. Hatzivassiloglou and McKeown proposed a supervised algorithm that uses
word usage patterns to generate a graph with adjectives as nodes. An edge between
two nodes indicates either that the two adjectives have the same or opposite polarity. A
clustering algorithm then partitions the graph into two subgraphs such that the nodes in
a subgraph have the same polarity. They used this method to create a lexicon of positive
and negative words, and argued that the method could also be used to detect opposites.
3. The Heterogeneous Nature of Opposites
Opposites, unlike synonyms, can be of different kinds. Many different classifications
have been proposed, one of which is given by Cruse (1986) (Chapters 9, 10, and 11).
It consists of complementaries (open–shut, dead–alive), antonyms (long–short, slow–
fast) (further classified into polar, overlapping, and equipollent opposites), directional
opposites (up–down, north–south) (further classified into antipodals, counterparts, and
reversives), relational opposites (husband–wife, predator–prey), indirect converses (give–
receive, buy–pay), congruence variants (huge–little, doctor–patient), and pseudo opposites
(black–white).
Various lexical relations have also received attention at the Educational Testing
Services, as analogies and “most contrasting word” questions are part of the tests they
conduct. They classify opposites into contradictories (alive–dead, masculine–feminine),
contraries (old–young, happy-sad), reverses (attack–defend, buy–sell), directionals ( front–
back, left–right), incompatibles (happy–morbid, frank–hypocritical), asymmetric contraries
(hot–cool, dry–moist), pseudo-opposites (popular–shy, right–bad), and defectives (default–
payment, limp–walk) (Bejar, Chaffin, and Embretson 1991).
Keeping in mind the meanings and subtle distinctions between each of these kinds
of opposites is not easy even if we provide extensive training to annotators. Because
we crowdsource the annotations, and we know that Turkers prefer to spend their
time doing the task (and making money) rather than reading lengthy descriptions, we
561
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
focused only on five kinds of opposites that we believed would be easiest to annotate,
and which still captured a majority of the opposites:
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
Antipodals (top–bottom, start–finish): Antipodals are opposites in which
“one term represents an extreme in one direction along some salient axis,
while the other term denotes the corresponding extreme in the other
direction” (Cruse 1986, page 225).
Complementaries (open–shut, dead–alive): The essential characteristic of a
pair of complementaries is that “between them they exhaustively divide
the conceptual domain into two mutually exclusive compartments, so that
what does not fall into one of the compartments must necessarily fall into
the other” (Cruse 1986, page 198).
Disjoint (hot–cold, like–dislike): Disjoint opposites are word pairs that
occupy non-overlapping regions in the semantic dimension such that there
are regions not covered by either term. This set of opposites includes
equipollent adjective pairs (for example, hot–cold) and stative verb pairs
(for example, like–dislike). We refer the reader to Sections 9.4 and 9.7 of
Cruse (1986) for details about these sub-kinds of opposites.
Gradable opposites (long–short, slow–fast): are adjective-pair or
adverb-pair opposites that are gradable, that is, “members of the pair
denote degrees of some variable property such as length, speed, weight,
accuracy, etc.” (Cruse 1986, page 204).
Reversibles (rise–fall, enter–exit): Reversibles are opposite verb pairs such
that “if one member denotes a change from A to B, its reversive partner
denotes a change from B to A” (Cruse 1986, page 226).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
It should be noted that there is no agreed-upon number of kinds of opposites. Different
researchers have proposed various classifications that overlap to a greater or lesser
degree. It is possible that for a certain application or study one may be interested in
a kind of opposite not listed here.
4. Crowdsourcing
We used the Amazon Mechanical Turk (AMT) service to obtain annotations for different
kinds of opposites. We broke the task into small independently solvable units called
HITs (Human Intelligence Tasks) and uploaded them on the AMT Web site.4 Each HIT
had a set of questions, all of which were to be answered by the same person (a Turker, in
AMT parlance). We created HITs for word pairs, taken from WordNet, that we expected
to have some degree of contrast in meaning.
In WordNet, words that are close in meaning are grouped together in a set called a
synset. If one of the words in a synset is an opposite of another word in a different synset,
then the two synsets are called head synsets and WordNet records the two words as
direct antonyms (Gross, Fischer, and Miller 1989)—WordNet regards the terms opposite
and antonym as synonyms. Other word pairs across the two head synsets are called
indirect antonyms. Because we follow Cruse’s definition of antonyms, which requires
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
4 https://www.mturk.com/mturk/welcome.
562
Mohammad et al.
Computing Lexical Contrast
Table 1
Target word pairs chosen for annotation. Each term was annotated about eight times.
part of speech
# of word pairs
adverbs
adjectives
nouns
verbs
all
185
646
416
309
1,556
antonyms to be gradable adjectives, and because WordNet’s direct antonyms include
noun, verb, and adverb pairs too, for the rest of the article we will refer to WordNet
direct antonyms as direct opposites and WordNet indirect antonyms as indirect op-
posites. We will refer to the union of both the direct and indirect opposites simply as
WordNet opposites. Note that the WordNet opposites are highly contrasting term pairs.
We chose as target pairs all the direct or indirect opposites from WordNet that were
also listed in the Macquarie Thesaurus. This condition was a mechanism to ignore less-
frequent and obscure words, and apply our resources on words that are more common.
Additionally, as we will describe subsequently, we use the presence of the words in
the thesaurus to help generate Question 1, which we use for quality control of the
annotations. Table 1 gives a breakdown of the 1,556 pairs chosen by part of speech.
Because we do not have any control over the educational background of the
annotators, we made efforts to phrase questions about the kinds of opposites in a simple
and clear manner. Therefore we avoided definitions and long instructions in favor of
examples and short questions. We believe this strategy is beneficial even in traditional
annotation scenarios.
We created separate questionnaires (HITs) for adjectives, adverbs, nouns, and
verbs. A complete example adjective HIT with directions and questions is shown
in Figure 1. The adverb, noun, and verb questionnaires had similar questions, but
were phrased slightly differently to accommodate differences in part of speech. These
questionnaires are not shown here due to lack of space, but all four questionnaires are
available for download.5 The verb questionnaire had an additional question, shown
in Figure 2. Because nouns and verbs are not considered gradable, the corresponding
questionnaires did not have Q8 and Q9. We requested annotations from eight different
Turkers for each HIT.
4.1 The Word Choice Question: Q1
Q1 is an automatically generated word choice question that has a clear correct answer. It
helps identify outlier and malicious annotations. If this question is answered incorrectly,
then we assume that the annotator does not know the meanings of the target words, and
we ignore responses to the remaining questions. Further, as this question makes the
annotator think about the meanings of the words and about the relationship between
them, we believe it improves the responses for subsequent questions.
The options for Q1 were generated automatically. Each option is a set of four
comma-separated words. The words in the answer are close in meaning to both of the
target words. In order to create the answer option, we first generated a much larger
5 http://www.purl.org/net/saif.mohammad/research.
563
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Word-pair: musical × dissonant
Q1. Which set of words is most related to the word pair musical:dissonant?
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
(cid:1)
useless, surgery, ineffectual, institution
sequence, episode, opus, composition
youngest, young, youthful, immature
consequential, important, importance, heavy
Q2. Do musical and dissonant have some contrast in meaning?
yes
(cid:1)
no
For example, up–down, lukewarm–cold, teacher–student, attack–defend, all have at least
some degree of contrast in meaning. On the other hand, clown–down, chilly–cold,
teacher–doctor, and attack–rush DO NOT have contrasting meanings.
Q3. Some contrasting words are paired together so often that given one we naturally
think of the other. If one of the words in such a pair were replaced with another word of
almost the same meaning, it would sound odd. Are musical:dissonant such a pair?
Examples for “yes”: tall–short, attack–defend, honest–dishonest, happy–sad.
Examples for “no”: tall–stocky, attack–protect, honest–liar, happy–morbid.
Q5. Do musical and dissonant represent two ends or extremes?
yes
no
yes
no
Examples for “yes”: top–bottom, basement–attic, always–never, all–none, start–finish.
Examples for “no”: hot–cold (boiling refers to more warmth than hot and freezing refers
to less warmth than cold), teacher–student (there is no such thing as more or less teacher
and more or less student), always–sometimes (never is fewer times than sometimes).
Q6. If something is musical, would you assume it is not dissonant, and vice versa?
In other words, would it be unusual for something to be both musical and dissonant?
(cid:1)
yes
(cid:1)
no
Examples for “yes”: happy–sad, happy–morbid, vigilant–careless, slow–stationary.
Examples for “no”: happy–calm, stationary–still, vigilant–careful, honest–truthful.
Q7. If something or someone could possibly be either musical or dissonant, is it
necessary that it must be either musical or dissonant? In other words, is it true that for
things that can be musical or dissonant, there is no third possible state, except perhaps
under highly unusual circumstances?
no
yes
(cid:1)
(cid:1)
Examples for “yes”: partial–impartial, true–false, mortal–immortal.
Examples for “no”: hot–cold (an object can be at room temperature is neither hot nor cold),
tall–short (a person can be of medium or average height).
Q8. In a typical situation, if two things or two people are musical, then can one be more
musical than the other?
yes
no
Examples for “yes”: quick, exhausting, loving, costly.
Examples for “no”: dead, pregnant, unique, existent.
(cid:1)
(cid:1)
(cid:1)
(cid:1)
Q9. In a typical situation, if two things or two people are dissonant, can one be more
dissonant than the other?
yes
no
Examples for “yes”: quick, exhausting, loving, costly, beautiful.
Examples for “no”: dead, pregnant, unique, existent, perfect, absolute.
Figure 1
Example HIT: Adjective pairs questionnaire.
Note: Perhaps “musical × dissonant” might be better written as “musical versus dissonant,” but we have
kept “×” here to show the reader exactly what the Turkers were given.
Note: Q4 is not shown here, but can be seen in the on-line version of the questionnaire. It was an exploratory
question, and it was not multiple choice. Q4’s responses have not been analyzed.
564
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Word-pair: enabling × disabling
Q10. In a typical situation, do the sequence of actions disabling and then enabling bring
someone or something back to the original state, AND do the sequence of actions enabling
and disabling also bring someone or something back to the original state?
yes, both ways: the transition back to the initial state makes much sense in both
sequences.
yes, but only one way: the transition back to the original state makes much more sense
one way than the other way.
none of the above
(cid:1)
(cid:1)
(cid:1)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Examples for “yes, both ways”: enter–exit, dress–undress, tie–untie, appear–disappear.
Examples for “yes, but only one way”: live–die, create–destroy, damage–repair, kill–resurrect.
Examples for “none of the above”: leave–exit, teach–learn, attack–defend (attacking and then
defending does not bring one back to the original state).
Figure 2
Additional question in the questionnaire for verbs.
source pool of all the words that were in the same thesaurus category as any of the two
target words. (Words in the same category are closely related.) Words that had the same
stem as either of the target words were discarded. For each of the remaining words, we
added their Lesk similarities with the two target words (Banerjee and Pedersen 2003).
The four words with the highest sum were chosen to form the answer option.
The three distractor options were randomly selected from the pool of correct an-
swers for all other word choice questions. Finally, the answer and distractor options
were presented to the Turkers in random order.
4.2 Post-Processing
The response to a HIT by a Turker is called an assignment. We obtained about 12,448
assignments in all (1,556 pairs × 8 assignments each). About 7% of the adjective, adverb,
and noun assignments and about 13% of the verb assignments had an incorrect answer
to Q1. These assignments were discarded, leaving 1,506 target pairs with three or more
valid assignments. We will refer to this set of assignments as the master set, and all
further analysis in this article is based on this set. Table 2 gives a breakdown of the
average number of annotations for each of the target pairs in the master set.
4.3 Prevalence of Different Kinds of Contrasting Pairs
For each question pertaining to every word pair in the master set, we determined the
most frequent response by the annotators. Table 3 gives the percentage of word pairs in
Table 2
Number of word pairs and average number of annotations per word pair in the master set.
part of
speech
# of
word pairs
average # of
annotations
adverbs
adjectives
nouns
verbs
all
182
631
405
288
1,506
7.80
8.32
8.44
7.58
8.04
565
Computational Linguistics
Volume 39, Number 3
Table 3
Percentage of word pairs that received a response of “yes” for the questions in the questionnaire.
adj. = adjectives; adv. = adverbs.
% of word pairs
Question
answer
adj.
adv.
nouns
verbs
Q2. Do X and Y have some contrast?
Q3. Are X and Y opposites?
Q5. Are X and Y at two ends of a dimension?
Q6. Does X imply not Y?
Q7. Are X and Y mutually exhaustive?
Q8. Does X represent a point on some scale?
Q9. Does Y represent a point on some scale?
Q10. Does X undo Y OR does Y undo X?
yes
yes
yes
yes
yes
yes
yes
one way
both ways
99.5
91.2
81.8
98.3
85.1
78.5
78.5
–
–
96.8
68.6
73.5
92.3
69.7
77.3
70.8
–
–
97.6
65.8
81.1
89.4
74.1
–
–
–
–
99.3
88.8
94.4
97.5
89.5
–
–
3.8
90.9
Table 4
Percentage of WordNet source pairs that are contrasting, opposite, and “contrasting but not
opposite.”
category
basis
adj.
adv.
nouns
verbs
Q2 yes
contrasting
opposites
Q2 yes and Q3 yes
contrasting, but not opposite Q2 yes and Q3 no
99.5
91.2
8.2
96.8
68.6
28.2
97.6
60.2
37.4
99.3
88.9
10.4
the master set that received a most frequent response of “yes.” The first column in the
table lists the question number followed by a brief description of question. (Note that
the Turkers saw only the full forms of the questions, as shown in the example HIT.)
Observe that most of the word pairs are considered to have at least some contrast
in meaning. This is not surprising because the master set was constructed using words
connected through WordNet’s antonymy relation.6 Responses to Q3 show that not all
contrasting pairs are considered opposite, and this is especially the case for adverb
pairs and noun pairs. The rows in Table 4 show the percentage of words in the master
set that are contrasting (row 1), opposite (row 2), and contrasting but not opposite
(row 3).
Responses to Q5, Q6, Q7, Q8, and Q9 (Table 3) show the prevalence of different
kinds of relations and properties of the target pairs.
Table 5 shows the percentage of contrasting word pairs that may be classified into
the different types discussed in Section 3. Observe that rows for all categories other
than the disjoints have percentages greater than 60%. This means that a number of
contrasting word pairs can be classified into more than one kind. Complementaries
are the most common kind in case of adverbs, nouns, and verbs, whereas antipodals
are most common among adjectives. A majority of the adjective and adverb contrasting
pairs are gradable, but more than 30% of the pairs are not. Most of the verb pairs are
reversives (91.6%). Disjoint pairs are much less common than all the other categories
6 All of the direct antonyms were marked as contrasting by the Turkers. Only a few indirect antonyms
were marked as not contrasting.
566
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Table 5
Percentage of contrasting word pairs belonging to various subtypes. The subtype “reversives”
applies only to verbs. The subtype “gradable” applies only to adjectives and adverbs.
subtype
basis
adv.
adj.
nouns
verbs
Antipodals
Q2 yes, Q5 yes
Complementaries Q2 yes, Q7 yes
Q2 yes, Q7 no
Disjoint
Q2 yes, Q8 yes, Q9 yes
Gradable
Q2 yes, Q10 both ways
Reversives
82.3
85.6
14.4
69.6
–
75.9
72.0
28.0
66.4
–
82.5
84.8
15.2
–
–
95.1
98.3
1.7
–
91.6
considered, and they are most prominent among adjectives (28%), and least among verb
pairs (1.7%).
4.4 Agreement
People do not always agree on linguistic classifications of terms, and one of the goals of
this work was to determine how much people agree on properties relevant to different
kinds of opposites. Table 6 lists the breakdown of agreement by target-pair part of
speech and question, where agreement is the average percentage of the number of
Turkers giving the most-frequent response to a question—the higher the number
of Turkers that vote for the majority answer, the higher is the agreement.
Observe that agreement is highest when asked whether a word pair has some
degree of contrast in meaning (Q2), and that there is a marked drop when asked if the
two words are opposites (Q3). This is true for each of the parts of speech, although the
drop is highest for verbs (94.7% to 75.2%).
For Questions 5 through 9, we see varying degrees of agreement—Q6 obtaining the
highest agreement and Q5 the lowest. There is marked difference across parts of speech
for certain questions. For example, verbs are the easiest to identify (highest agreement
for Q5, Q7, and Q8). For Q6, nouns have markedly lower agreement than all other parts
of speech—not surprising considering that the set of disjoint opposites is traditionally
associated with equipollent adjectives and stative verbs. Adverbs and adjectives have
markedly lower agreement scores for Q7 than nouns and verbs.
Table 6
Breakdown of answer agreement by target-pair part of speech and question: For every target
pair, a question is answered by about eight annotators. The majority response is chosen as the
answer. The ratio of the size of the majority and the number of annotators is indicative of the
amount of agreement. The table shows the average percentage of this ratio.
question
adj.
adv.
nouns
verbs
average
Q2. Do X and Y have some contrast?
Q3. Are X and Y opposites?
Q5. Are X and Y at two ends of a dimension?
Q6. Does X imply not Y?
Q7. Are X and Y mutually exhaustive?
average (Q2, Q3, Q5, Q6, and Q7)
Q8. Does X represent a point on some scale?
Q9. Does Y represent a point on some scale?
Q10. Does X undo Y OR does Y undo X?
90.7
79.0
70.3
89.0
70.4
82.3
77.9
75.2
–
92.1
80.9
66.5
90.2
69.2
79.8
71.5
72.0
–
92.0
76.4
73.0
81.8
78.2
80.3
–
–
–
94.7
75.2
78.6
88.4
88.3
85.0
–
–
73.0
92.4
77.9
72.1
87.4
76.5
81.3
74.7
73.6
73.0
567
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
5. Manifestation of Highly Contrasting Word Pairs in Text
As pointed out earlier, there is work on a small set of opposites showing that opposites
co-occur more often than chance (Charles and Miller 1989; Fellbaum 1995). Section 5.1
describes experiments on a larger scale to determine whether highly contrasting word
pairs (including opposites) occur together more often than randomly chosen word
pairs of similar frequency. The section also compares co-occurrence associations with
synonyms.
Research in distributional similarity has found that entries in distributional thesauri
tend to also contain terms that are opposite in meaning (Lin 1998; Lin et al. 2003).
Section 5.2 describes experiments to determine whether highly contrasting word pairs
(including opposites) occur in similar contexts as often as randomly chosen pairs of
words with similar frequencies, and whether highly contrasting words occur in similar
contexts as often as synonyms.
5.1 Co-Occurrence
In order to compare the tendencies of highly contrasting word pairs, synonyms, and
random word pairs to co-occur in text, we created three sets of word pairs: the high-
contrast set, the synonyms set, and the control set of random word pairs. The high-contrast set
was created from a pool of direct and indirect opposites (nouns, verbs, and adjectives)
from WordNet. We discarded pairs that did not meet the following conditions: (1) both
members of the pair must be unigrams, (2) both members of the pair must occur in
the British National Corpus (BNC) (Burnard 2000), and (3) at least one member of the
pair must have a synonym in WordNet. A total of 1,358 word pairs remained, and these
form the high-contrast set.
Each of the pairs in the high-contrast set was used to create a synonym pair by
choosing a WordNet synonym of exactly one member of the pair.7 If a word has more
than one synonym, then the most frequent synonym is chosen.8 These 1,358 word pairs
form the synonyms set. Note that for each of the pairs in the high-contrast set, there is a
corresponding pair in the synonyms set, such that the two pairs have a common term.
For example, the pair agitation and calmness in the high-contrast set has a corresponding
pair agitation and ferment in the synonyms set. We will refer to the common terms
(agitation in this example) as the focus words. Because we also wanted to compare
occurrence statistics of the high-contrast set with the random pairs set, we created the
control set of random pairs by taking each of the focus words and pairing them with
another word in WordNet that has a frequency of occurrence in BNC closest to the term
contrasting with the focus word. This is to ensure that members of the pairs across the
high-contrast set and the control set have similar unigram frequencies.
We calculated the pointwise mutual information (PMI) (Church and Hanks 1990)
for each of the word pairs in the high-contrast set, the random pairs set, and the
synonyms set using unigram and co-occurrence frequencies in the BNC. If two words
occurred within a window of five adjacent words in a sentence, they were marked as
co-occurring (same window as Church and Hanks [1990] used in their seminal work on
word–word associations). Table 7 shows the average and standard deviation in each set.
7 If both members of a pair have WordNet synonyms, then one is chosen at random, and its synonym is
taken.
8 WordNet lists synonyms in order of decreasing frequency in the SemCor corpus.
568
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Table 7
Pointwise mutual information (PMI) of word pairs. High positive values imply a tendency to
co-occur in text more often than random chance.
average PMI
standard deviation
high-contrast set
random pairs set
synonyms set
1.471
0.032
0.412
2.255
0.236
1.110
Observe that the high-contrast pairs have a much higher tendency to co-occur than the
random pairs control set, and also the synonyms set. The high-contrast set has a large
standard deviation, however. A two-sample t-test revealed that the high-contrast set
is significantly different from the random set (p < 0.05), and also that the high-contrast
set is significantly different from the synonyms set (p < 0.05).
On average, however, the PMI between a focus word and its contrasting term was
lower than the PMI between the focus word and 3,559 other words in the BNC. These
were often words related to the focus words, but neither contrasting nor synonymous.
Thus, even though a high tendency to co-occur is a feature of highly contrasting pairs,
it is not a sufficient condition for detecting them. We use PMI as part of our method
for determining the degree of lexical contrast (described in Section 6).
5.2 Distributional Similarity
Charles and Miller (1989) proposed that in most contexts, opposites may be inter-
changed. The meaning of the utterance will be inverted, of course, but the sentence
will remain grammatical and linguistically plausible. This came to be known as the
substitutability hypothesis. Their experiments did not support this claim, however. They
found that given a sentence with the target adjective removed, most people did not
confound the missing word with its opposite. Justeson and Katz (1991) later showed
that in sentences that contain both members of an adjectival opposite pair, the target
adjectives do indeed occur in similar syntactic structures at the phrasal level. Jones et al.
(2007) show how the tendency to appear in certain textual constructions such as “from
X to Y” and “either X or Y” are indicative of prototypicalness of opposites. Thus, we can
formulate the distributional hypothesis of highly contrasting pairs: highly contrasting
pairs occur in similar contexts more often than non-contrasting word pairs.
We used the same sets of high-contrast pairs, synonyms, and random pairs de-
scribed in the previous section to gather empirical proof of the distributional hypothesis.
We calculated the distributional similarity between each pair in the three sets using
Lin’s (1998) measure. Table 8 shows the average and standard deviation in each set.
Observe that the high-contrast set has a much higher average distributional similarity
Table 8
Distributional similarity of word pairs. The measure proposed in Lin (1998) was used.
average distributional similarity
standard deviation
opposites set
random pairs set
synonyms set
0.064
0.036
0.056
0.071
0.034
0.057
569
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
than the random pairs control set, and interestingly it is also higher than the synonyms
set. Once again, the high-contrast set has a large standard deviation. A two-sample
t-test revealed that the high-contrast set is significantly different from both the random
set and the synonyms set with a confidence interval of 0.05. This demonstrates that
relative to other word pairs, high-contrast pairs tend to occur in similar contexts. We
also find that the synonyms set has a significantly higher distributional similarity than
the random pairs set (p < 0.05). This shows that near-synonymous word pairs also occur
in similar contexts (the distributional hypothesis of similarity). Further, a consequence
of the large standard deviations in the cases of both high-contrast pairs and synonyms
means that distributional similarity alone is not sufficient to determine whether two
words are contrasting or synonymous. An automatic method for recognizing contrast
will require additional cues. Our method uses PMI and other sources of information
described in the next section.
6. Computing Lexical Contrast
In this section, we recapitulate the automatic method for determining lexical con-
trast that we first proposed in Mohammad, Dorr, and Hirst (2008). Additional details
are provided regarding the lexical resources used (Section 6.1) and the method itself
(Section 6.2).
6.1 Lexical Resources
Our method makes use of a published thesaurus and co-occurrence information from
text. Optionally, it can use opposites listed in WordNet if available. We briefly describe
these resources here.
6.1.1 Published Thesauri. Published thesauri, such as Roget’s and Macquarie, divide the
vocabulary of a language into about a thousand categories. Words within a category are
semantically related to each other, and they tend to pertain to a coarse concept. Each
category is represented by a category number (unique ID) and a head word—a word that
best represents the meanings of the words in the category. One may also find opposites
in the same category, but this is rare. Words with more than one meaning may be found
in more than one category; these represent its coarse senses.
Within a category, the words are grouped into finer units called paragraphs. Words
in the same paragraph are closer in meaning than those in differing paragraphs. Each
paragraph has a paragraph head—a word that best represents the meaning of the words
in the paragraph. Words in a thesaurus paragraph belong to the same part of speech. A
thesaurus category may have multiple paragraphs belonging to the same part of speech.
For example, a category may have three noun paragraphs, four verb paragraphs, and
one adjective paragraph. We will take advantage of the structure of the thesaurus in
our approach.
6.1.2 WordNet. As mentioned earlier, WordNet encodes certain opposites. We found
in our experiments (Section 7, subsequently) that more than 90% of contrasting pairs
included in Graduate Record Examination (GRE) “most contrasting word” questions
are not encoded in WordNet, however. Also, neither WordNet nor any other manually
created repository of opposites provides the degree of contrast between word pairs.
Nevertheless, we investigate the usefulness of WordNet as a source of seed opposites
for our approach.
570
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
6.2 Proposed Measure of Lexical Contrast
Our method for determining lexical contrast has two parts: (1) determining whether
the target word pair is contrasting or not, and (2) determining the degree of contrast
between the words.
6.2.1 Detecting Whether a Target Word Pair is Contrasting. We use the contrast hypothesis
to determine whether two words are contrasting. The hypothesis is repeated here:
Contrast Hypothesis: If a pair of words, A and B, are contrasting, then there is a pair of
opposites, C and D, such that A and C are strongly related and B and D are strongly
related.
Even if a few exceptions to this hypothesis are found (we are not aware of any), the
hypothesis would remain useful for practical applications. We first determine pairs of
thesaurus categories that have at least one word in each category that are opposites of
each other. We will refer to these categories as contrasting categories and the opposite
connecting the two categories as the seed opposite. Because each thesaurus category
is a collection of closely related terms, all of the word pairs across two contrasting
categories satisfy the contrast hypothesis, and they are considered to be contrasting
word pairs. Note also that words within a thesaurus category may belong to different
parts of speech, and they may be related to the seed opposite word through any of
the many possible semantic relations. Thus a small number of seed opposites can
help identify a large number of contrasting word pairs. We determine whether two
categories are contrasting using the three methods described here, which may be used
alone or in combination with each other:
Method 1: Using word pairs generated from affix patterns.
Opposites such as hot–cold and dark–light occur frequently in text, but in terms of type-
pairs they are outnumbered by those created using affixes, such as un- (clear–unclear)
and dis- (honest–dishonest). Further, this phenomenon is observed in most languages
(Lyons 1977).
Table 9 lists 15 affix patterns that tend to generate opposites in English. They
were compiled by the first author by examining a small list of affixes for the English
language.9 These patterns were applied to all words in the thesaurus that are at least
three characters long. If the resulting term was also a valid word in the thesaurus, then
the word pair was added to the affix-generated seed set. These fifteen rules generated 2,682
word pairs when applied to the words in the Macquarie Thesaurus. Category pairs that
had these opposites were marked as contrasting. Of course, not all of the word pairs
generated through affixes are truly opposites, for example sect–insect and part–impart.
For now, such pairs are sources of error in the system. Manual analysis of these 2,682
word pairs can help determine whether this error is large or small. (We have released
the full set of word pairs.) Evaluation results (Section 7) indicate that these seed pairs
improve the overall accuracy of the system, however.
Figure 3 presents such an example pair. Observe that categories 360 and 361 have
the words cover and uncover, respectively. Affix pattern 8 from Table 9 produces seed
9 http://www.englishclub.com/vocabulary/prefixes.htm.
571
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Table 9
Fifteen affix patterns used to generate opposites. Here ‘X’ stands for any sequence of letters
common to both words w1 and w2.
affix pattern
pattern # word 1 word 2
# word pairs
example pair
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
X
X
X
X
X
X
X
X
lX
rX
imX
inX
upX
overX
Xless
antiX
disX
imX
inX
malX
misX
nonX
unX
illX
irrX
exX
exX
downX
underX
Xful
41
379
193
690
25
142
72
833
25
48
35
74
22
52
51
clockwise–anticlockwise
interest–disinterest
possible–impossible
consistent–inconsistent
adroit–maladroit
fortune–misfortune
aligned–nonaligned
biased–unbiased
legal–illegal
regular–irregular
implicit–explicit
introvert–extrovert
uphill–downhill
overdone–underdone
harmless–harmful
Total:
2,682
pair cover–uncover, and so the system concludes that the two categories have contrasting
meaning. The contrast in meaning is especially strong for the paragraphs cover and
expose because words within these paragraphs are very close in meaning to cover and
uncover, respectively. We will refer to such thesaurus paragraph pairs that have one
word each of a seed pair as prime contrasting paragraphs. We expect the words across
prime contrasting paragraphs to have a high degree of antonymy (for example, mask
and bare), whereas words across other contrasting category paragraphs may have a
smaller degree of antonymy as the meaning of these words may diverge significantly
from the meanings of the words in the prime contrasting paragraphs (for example,
white lie and disclosure).
Method 2: Using opposites from WordNet.
We compiled a list of 20,611 pairs that WordNet records as direct and indirect opposites.
(Recall discussion in Section 4 about direct and indirect opposites.) A large number of
these pairs include multiword expressions. Only 10,807 of the 20,611 pairs have both
words in the Macquarie Thesaurus—the vocabulary used for our experiments. We will
refer to them as the WordNet seed set. Category pairs that had these opposites were
marked as contrasting.
Method 3: Using word pairs in adjacent thesaurus categories.
Most published thesauri, such as Roget’s, are organized such that categories correspond-
ing to opposing concepts are placed adjacent to each other. For example, in the Macquarie
Thesaurus: category 369 is about honesty and category 370 is about dishonesty; as shown
in Figure 3, category 360 is about hiding and category 361 is about revealing. There are
a number of exceptions to this rule, and often a category may be contrasting in meaning
to several other categories. Because this was an easy enough heuristic to implement,
however, we investigated the usefulness of considering adjacent thesaurus categories
572
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Figure 3
Example contrasting category pair. The system identifies the pair to be contrasting through the
affix-based seed pair cover–uncover. The paragraphs of cover and expose are referred to as prime
contrasting paragraphs. Paragraph heads are shown in bold italic.
as contrasting. We will refer to this as the adjacency heuristic. Note that this method of
determining contrasting categories does not explicitly identify a seed opposite, but one
can assume the head words of these category pairs as the seed opposites.
To determine how accurate the adjacency heuristic is, the first author manually
inspected adjacent thesaurus categories in the Macquarie Thesaurus to determine which
of them were indeed contrasting. Because a category, on average, has about a hundred
words, the task was made less arduous by representing each category by just the first
ten words listed in it. This way it took only about five hours to manually determine that
209 pairs of the 811 adjacent Macquarie category pairs were contrasting. Twice, it was
found that category number X was contrasting not just to category number X+1 but also
to category number X+2: category 40 (ARISTOCRACY) has a meaning that contrasts that
of category 41 (MIDDLE CLASS) as well as category 42 (WORKING CLASS); category 542
(PAST) contrasts with category 543 (PRESENT) as well as category 544 (FUTURE). Both
these X – (X+2) pairs are also added to the list of manually annotated contrasting
categories.
6.2.2 Computing the Degree of Contrast Between Two Words. Charles and Miller (1989)
and Fellbaum (1995) argued that opposites tend to co-occur more often than random
chance. Murphy and Andrew (1993) claimed that the greater-than-chance co-occurrence
of opposites is because together they convey contrast well, which is rhetorically useful.
We showed earlier in Section 5.1 that highly contrasting pairs (including opposites)
co-occur more often than randomly chosen pairs. All of these support the degree of
contrast hypothesis stated earlier in the introduction:
Degree of Contrast Hypothesis: If a pair of words, A and B, are contrasting, then their
degree of contrast is proportional to their tendency to co-occur in a large corpus.
573
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
We used PMI to capture the tendency of word–word co-occurrence. We collected
these co-occurrence statistics from the Google n-gram corpus (Brants and Franz 2006),
which was created from a text collection of over one trillion words. Words that occurred
within a window of five words were considered to be co-occurring.
We expected that some features may be more accurate than others. If multiple
features give evidence towards opposing information, then it is useful for the system
to know which feature is more reliable. Therefore, we held out some data from the
evaluation data described in Section 7.1 as the development set. Experiments on the
development set showed that contrasting words may be placed in three bins corre-
sponding to the amount of reliability of the source feature: high, medium, or acceptable.
(cid:1)
(cid:1)
(cid:1)
High reliability (Class I): target words that belong to adjacent thesaurus
categories. For example, all the word pairs across categories 360 and 361,
shown in Figure 3. Examples of Class I contrasting word pairs from the
development set include graceful–ungainly, fortunate–hapless, obese–slim,
and effeminate–virile. (Note, there need not be any affix or WordNet seed
pairs across adjacent thesaurus categories for these word pairs to be
marked Class I.) As expected, if we use only those adjacent categories that
were manually identified to be contrasting (as described in Section 6.2.1,
Method 3), then the system obtains even better results than those obtained
using all adjacent thesaurus categories. (Experiments and results shown
in Section 7.1).
Medium reliability (Class II): target words that are not Class I contrasting
pairs, but belong to one paragraph each of a prime contrasting paragraph.
For example, all the word pairs across the paragraphs of sympathetic and
indifferent. See Figure 4. Examples of Class II contrasting word pairs
from the development set include altruism–avarice, miserly–munificent,
accept–repudiate, and improper–prim.
Acceptable reliability (Class III): target words that are not Class I or
Class II contrasting pairs, but occur across contrasting category pairs. For
example, all word pairs across categories 423 and 230 except those that
have one word each from the paragraphs of sympathetic and indifferent.
See Figure 4. Examples of Class III contrasting word pairs from the
development set include pandemonium–calm, probity–error, artifice–sincerity,
and hapless–wealthy.
Even with access to very large textual data sets, there is always a long tail of words
that occur so few times that there is not enough co-occurrence information for them.
Thus we assume that all word pairs in Class I have a higher degree of contrast than
all word pairs in Class II, and that all word pairs in Class II have a higher degree of
contrast than the pairs in Class III. If two word pairs belong to the same class, then we
calculate their tendency to co-occur with each other in text to determine which pair is
more contrasting. All experiments in the evaluation section ahead follow this method.
6.2.3 Lexicon of Contrasting Word Pairs. Using the method described in the previous
sections, we generated a lexicon of word pairs pertaining to Class I and Class II. The
lexicon has 6.3 million contrasting word pairs, about 3.5 million of which belong to
Class I and about 2.8 million to Class II. Class III pairs are even more numerous and,
given a word pair, our algorithm checked whether it is a Class III pair, but we did not
574
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Category number: 423
Category number: 230
Category head: KINDNESS
Category head: APATHY
1. nouns:
1. nouns:
2. adjectives:
kindness
considerateness
goodness
niceness
...
sympathetic
caring
consolatory
involved
...
2. nouns:
apathy
acedia
depression
moppishness
...
nonchalance
insouciance
carelessness
casualness
...
3. adverbs:
3. adjectives:
benevolent
beneficiently
graciously
kindheartedly
...
indifferent
detached
irresponsive
uncaring
...
...
...
Figure 4
Example contrasting category pair that has Class II and Class III contrasting pairs. The system
identifies the pair to be contrasting through the affix-based seed pair caring (second word in
paragraph 2 or category 423) and uncaring (fourth word in paragraph 3 or category 230). The
paragraphs of sympathetic and indifferent are therefore the prime contrasting paragraphs and so
all word pairs that have one word each from these two paragraphs are Class II contrasting pairs.
All other pairs formed by taking one word each from the two contrasting categories are the
Class III contrasting pairs. Paragraph heads are shown in bold italic.
create a complete set of all Class III contrasting pairs. Class I and II lexicons are available
for download and summarized in Table 18.
7. Evaluation
We evaluate our algorithm on two different tasks and four data sets. Section 7.1 de-
scribes experiments on solving existing GRE “choose the most contrasting word” ques-
tions (a recapitulation of the evaluation reported in Mohammad, Dorr, and Hirst [2008]).
Section 7.2 describes experiments on solving newly created “choose the most contrast-
ing word” questions specifically designed to determine performance on different kinds
of opposites. And lastly, Section 7.3 describes experiments on two different data sets
where the goal is to identify whether a given word pair is synonymous or antonymous.
7.1 Solving GRE’s “Choose the Most Contrasting Word” Questions
The GRE is a test taken by thousands of North American graduate school applicants.
The test is administered by Educational Testing Service (ETS). The Verbal Reasoning
section of GRE is designed to test verbal skills. Until August 2011, one of its sections had
a set of questions pertaining to word-pair contrast. Each question had a target word and
four or five alternatives, or option words. The objective was to identify the alternative
which was most contrasting with respect to the target. For example, consider:
adulterate:
a. renounce
b. forbid
c. purify
d. criticize
e. correct
575
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Here the target word is adulterate. One of the alternatives provided is correct, which
as a verb has a meaning that contrasts with that of adulterate; purify, however, has
a greater degree of contrast with adulterate than correct does and must be chosen
in order for the instance to be marked as correctly answered. ETS referred to these
questions as “antonym questions,” where the examinees had to “choose the word
most nearly opposite” to the target. Most of the target–answer pairs are not gradable
adjectives, however, and because most of them are not opposites either, we will refer
to these questions as “choose the most contrasting word” questions or contrast questions
for short.
Evaluation on this data set tests whether the automatic method is able to identify
not just opposites but also those pairs that are not opposites but that have some degree
of semantic contrast. Notably, for these questions, the method must be able to identify
that one word pair has a higher degree of contrast than all others, even though that
word pair may not necessarily be an opposite.
7.1.1 Data. A Web search for large sets of contrast questions yielded two independent
sets of questions designed to prepare students for the GRE. The first set consists of
162 questions. We used this set while we were developing our lexical contrast algo-
rithm described in Section 4. Therefore, we will refer to it as the development set. The
development set helped determine which features of lexical contrast were more reliable
than others. The second set has 1,208 contrast questions. We discarded questions that
had a multiword target or alternative. After removing duplicates we were left with
790 questions, which we used as the unseen test set. This data set was used (and seen)
only after our algorithm for determining lexical contrast was frozen.
Interestingly, the data contains many instances that have the same target word used
in different senses. For example:
1. obdurate:
2. obdurate:
3. obdurate:
a. meager
a. yielding
a. transitory
b. unsusceptible
b. motivated
b. commensurate
c. right
c. moribund
c. complaisant
d. tender
d. azure
d. similar
e. intelligent
e. hard
e. laconic
In (1), obdurate is used in the sense of HARDENED IN FEELINGS and is most contrasting
with tender. In (2), it is used in the sense of RESISTANT TO PERSUASION and is most
contrasting with yielding. In (3), it is used in the sense of PERSISTENT and is most
contrasting with transitory.
The data sets also contain questions in which one or more of the alternatives is a
near-synonym of the target word. For example:
astute:
a. shrewd
b. foolish
c. callow d. winning
e. debating
Observe that shrewd is a near-synonym of astute. The word most contrasting with astute
is foolish. A manual check of a randomly selected set of 100 test-set questions revealed
that, on average, one in four had a near-synonym as one of the alternatives.
7.1.2 Results. Table 10 presents results obtained on the development and test data
using two baselines, a re-implementation of the method described in Lin et al. (2003),
and variations of our method. Some of the results are for systems that refrain from
576
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Table 10
Results obtained on contrast questions. The best performing system and configuration are
shown in bold.
Baselines:
a. random baseline
b. WordNet antonyms
Related work:
a. Lin et al. (2003)
Our method:
development data
test data
P
R
F
P
R
F
0.20
0.23
0.20
0.23
0.20
0.23
0.20
0.23
0.20
0.23
0.20
0.23
0.23
0.23
0.23
0.24
0.24
0.24
a. affix-generated pairs as seeds
b. WordNet antonyms as seeds
c. both seed sets (a + b)
d. adjacency heuristic only
e. manual annotation of adjacent categories
f. affix seed set and adjacency heuristic (a + d)
g. both seed sets and adjacency heuristic (a + b + d)
h. affix seed set and annotation of adjacent
0.72
0.79
0.77
0.81
0.88
0.75
0.76
0.79
0.53
0.52
0.65
0.43
0.41
0.60
0.66
0.63
0.61
0.63
0.70
0.56
0.56
0.67
0.70
0.70
categories (a + e)
0.71
0.72
0.72
0.83
0.87
0.76
0.76
0.78
0.51
0.49
0.58
0.44
0.41
0.60
0.63
0.60
0.59
0.58
0.64
0.57
0.55
0.67
0.69
0.68
i. both seed sets and annotation of adjacent
0.79
0.66
0.72
0.77
0.63
0.69
categories (a + b + e)
attempting questions for which they do not have sufficient information. We therefore
report precision (P), recall (R), and balanced F-score (F).
P =
# of questions answered correctly
# of questions attempted
R =
# of questions answered correctly
# of questions
F =
2 × P × R
P + R
(1)
(2)
(3)
Baselines. If a system randomly guesses one of the five alternatives with equal probabil-
ity (random baseline), then it obtains an accuracy of 0.2. A system that looks up the list of
WordNet antonyms (10,807 pairs) to solve the contrast questions is our second baseline.
That obtained the correct answer in only 5 instances of the development set (3.09% of
the 162 instances) and 25 instances of the test set (3.17% of the 790 instances), however.
Even if the system guesses at random for all other instances, it attains only a modest
improvement over the random baseline (see row b, under “Baselines,” in Table 10).
Re-implementation of related work. In order to estimate how well the method of Lin
et al. (2003) performs on this task, we re-implemented their method. For each closest-
antonym question, we determined frequency counts in the Google n-gram corpus for
the phrases “from (cid:3)target word(cid:4) to (cid:3)known correct answer(cid:4),” “from (cid:3)known correct
answer(cid:4) to (cid:3)target word(cid:4),” “either (cid:3)target word(cid:4) or (cid:3)known correct answer(cid:4),” and
“either (cid:3)known correct answer(cid:4) or (cid:3)target word(cid:4).” We then summed up the four counts
for each contrast question. This resulted in non-zero counts for only 5 of the 162 in-
stances in the development set (3.09%), and 35 of the 790 instances in the test set (4.43%).
Thus, these patterns fail to cover a vast majority of closest-antonyms, and even if the
577
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
system guesses at random for all other instances, it attains only a modest improvement
over the baseline (see row a, under “Related work,” in Table 10).
Our method. Table 10 presents results obtained on the development and test data using
different combinations of the seed sets and the adjacency heuristic. The best performing
system is marked in bold. It has significantly higher precision and recall than that of the
method proposed by Lin et al. (2003), with 95% confidence according to the Fisher Exact
Test (Agresti 1990).
We performed experiments on the development set first, using our method with
configurations described in rows a, b, and d. These results showed that marking
adjacent categories as contrasting has the highest precision (0.81), followed by using
WordNet seeds (0.79), followed by the use of affix rules to generate seeds (0.72). This
allowed us to determine the relative reliability of the three features as described in
Section 6.2.2. We then froze all system development and ran the remaining experiments,
including those on the test data.
Observe that all of the results shown in Table 10 are well above the random baseline
of 0.20. Using only the small set of 15 affix rules, the system performs almost as well as
when it uses 10,807 WordNet opposites. Using both the affix-generated and the Word-
Net seed sets, the system obtains markedly improved precision and coverage. Using
only the adjacency heuristic gave precision values (upwards of 0.8) with substantial
coverage (attempting more than half of the questions). Using the manually identified
contrasting adjacent thesaurus categories gave precision values just short of 0.9. The
best results were obtained using both seed sets and the contrasting adjacent thesaurus
categories (F-scores of 0.72 and 0.69 on the development and test set, respectively).
In order to determine whether our method works well with thesauri other than the
Macquarie Thesaurus, we determined performance of configurations a, b, c, d, f, and h
using the 1911 U.S. edition of the Roget’s Thesaurus, which is available freely in the public
domain.10 The results were similar to those obtained using the Macquarie Thesaurus. For
example, configuration g obtained a precision of 0.81, recall of 0.58, and F-score of 0.68
on the test set. It may be possible to obtain even better results by combining multiple
lexical resources; that is left for future work. The remainder of this article reports results
obtained with the Macquarie Thesaurus; the 1911 vocabulary is less suited for practical
use in the 21st century.
7.1.3 Discussion. These results show that our method performs well on questions de-
signed to be challenging for humans. In tasks that require higher precision, using only
the contrasting adjacent categories is best, whereas in tasks that require both precision
and coverage, the seed sets may be included. Even when both seed sets were included,
only four instances in the development set and twenty in the test set had target–answer
pairs that matched a seed opposite pair. For all remaining instances, the approach had
to generalize to determine the most contrasting word. This also shows that even the
seemingly large number of direct and indirect antonyms from WordNet (more than
10,000) are by themselves insufficient.
The comparable performance obtained using the affix rules alone suggests that even
in languages that do not have a WordNet-like resource, substantial accuracies may be
obtained. Of course, improved results when using WordNet antonyms as well suggests
that the information they provide is complementary.
10 http://www.gutenberg.org/ebooks/10681.
578
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Error analysis revealed that at times the system failed to identify that a category
pertaining to the target word contrasted with a category pertaining to the answer.
Additional methods to identify seed opposite pairs will help in such cases. Certain
other errors occurred because one or more alternatives other than the official answer
were also contrasting with the target. For example, one of the questions has chasten as
the target word. One of the alternatives is accept, which has some degree of contrast in
meaning to the target. Another alternative, reward, has an even higher degree of contrast
with the target, however. In this instance, the system erred by choosing accept as the
answer.
7.2 Determining Performance of Automatic Method on Different Kinds of Opposites
The previous section showed the overall performance of our method. The performance
of a method may vary significantly on different subsets of data, however. In order to
determine performance on different kinds of opposites, we generated new contrast
questions from the crowdsourced term pairs described in Section 4. Note that for solving
contrast questions with this data set, again the method must be able to identify that
one word pair has a higher degree of contrast than the other pairs; unlike the previous
section, however, here the correct answer is often an opposite of the target.
7.2.1 Generating Contrast Questions. For each word pair from the list of WordNet oppo-
sites, we chose one word randomly to be the target word, and the other as one of its
candidate options. Four other candidate options were chosen from Lin’s distributional
thesaurus (Lin 1998).11 An entry in the distributional thesaurus has a focus word and
a number of other words that are distributionally similar to the focus word. The words
are listed in decreasing order of similarity. Note that these entries include not just
near-synonymous words but also at times contrasting words because contrasting
words tend to be distributionally similar (Lin et al. 2003).
For each of the target words in our contrast questions, we chose the four distribu-
tionally closest words from Lin’s thesaurus to be the distractors. If a distractor had the
same first three letters as the target word or the correct answer, then it was replaced
with another word from the distributional thesaurus. This ad hoc filtering criterion is
effective at discarding distractors that are morphological variants of the target or the
answer. For example, if the target word is adulterate, then words such as adulterated and
adulterates will not be included as distractors even if they are listed as closely similar
terms in the distributional thesaurus.
We place the four distractors and the correct answer in random order. Some of the
WordNet opposites were not listed in Lin’s thesaurus, and the corresponding question
was not generated. In all, 1,269 questions were generated. We created subsets of these
questions corresponding to the different kinds of opposites and also corresponding to
different parts of speech. Because a word pair may be classified as more than one kind
of opposite, the corresponding question may be part of more than one subset.
7.2.2 Experiments and Results. We applied our method of lexical contrast to solve the
complete set of 1,269 questions and also the various subsets. Because this test set is
11 http://webdocs.cs.ualberta.ca/∼lindek/downloads.htm.
579
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Table 11
Percentage of contrast questions correctly answered by the automatic method, where different
question sets correspond to target–answer pairs of different kinds. The automatic method did not
use WordNet seeds for this task. The results shown for ‘ALL’ are micro-averages, that is, they are
the results for the master set of 1,269 contrast questions.
# instances
P
R
F
Antipodals
Complementaries
Disjoint
Gradable
Reversives
ALL
1,044
1,042
228
488
203
1,269
0.95
0.95
0.81
0.95
0.93
0.93
0.84
0.83
0.59
0.85
0.74
0.79
0.89
0.89
0.69
0.90
0.82
0.85
created from WordNet opposites, we applied the algorithm without the use of WordNet
seeds (no WordNet information was used by the method).
Table 11 shows the precision (P), recall (R), and F-score (F) obtained by the method
on the data sets corresponding to different kinds of opposites. The column ‘# instances’
shows the number of questions in each of the data sets. The performance of our method
on the complete data set is shown in the last row ALL. Observe that the F-score of
0.85 is markedly higher than the score obtained on the GRE-preparatory questions.
This is expected because the GRE questions involved vocabulary from a higher reading
level, and included carefully chosen distractors to confuse the examinee. The automatic
method obtains highest F-score on the data sets of gradable adjectives (0.90), antipodals
(0.89), and complementaries (0.89). The precisions and recalls for these opposites are
significantly higher than those of disjoint opposites. The recall for reversives is also sig-
nificantly lower than that for the gradable adjectives, antipodals, and complementaries,
but precision on reversives is quite good (0.93).
Table 12 shows the precision, recall, and F-score obtained by the method on the the
data sets corresponding to different parts of speech. Observe that performances on all
parts of speech are fairly high. The method deals with adverb pairs best (F-score of 0.89),
and the lowest performance is for verbs (F-score of 0.80). The differences in precision
values between various parts of speech are not significant. The recall obtained on the
adverbs is significantly higher than that obtained on adjectives, however, and the recall
on adjectives is significantly higher than that obtained on verbs. The difference between
the recalls on adverbs and nouns is not significant. We used the Fisher Exact Test and a
confidence interval of 95% for all significance testing reported in this section.
Table 12
Percentage of contrast questions correctly answered by the automatic method, where different
question sets correspond to different parts-of-speech.
# instances
P
R
F
Adjectives
Adverbs
Nouns
Verbs
ALL
551
165
330
226
1,269
0.92
0.95
0.93
0.93
0.93
0.79
0.84
0.81
0.71
0.79
0.85
0.89
0.87
0.80
0.85
580
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
7.3 Distinguishing Synonyms from Opposites
Our third evaluation follows that of Lin et al. (2003) and Turney (2008). We developed
a system for automatically distinguishing synonyms from opposites, and applied it to
two data sets. The approach and experiments are described herein.
7.3.1 Data. Lin et al. (2003) compiled 80 pairs of synonyms and 80 pairs of opposites
from the Webster’s Collegiate Thesaurus (Kay 1988) such that each word in a pair is also
in their list of the 50 distributionally most similar words of the other. (Distributional
similarity was calculated using the algorithm proposed by Lin et al. [1998].) Turney
(2008) compiled 136 pairs of words (89 opposites and 47 synonyms) from various Web
sites for learners of English as a second language; the objective for the learners is to
identify whether the words in a pair are opposites or synonyms of each other. The
goals of this evaluation are to determine whether our automatic method can distinguish
opposites from near-synonyms, and to compare our method with the closest related
work on an evaluation task for which published results are already available.
7.3.2 Method. The core of our method is this:
1. Word pairs that occur in the same thesaurus category are close in meaning
and so are marked as synonyms.
2. Word pairs that occur in contrasting thesaurus categories or paragraphs
(as described in Section 6.2.1 above) are marked as opposites.
Even though opposites often occur in different thesaurus categories, they can sometimes
also be found in the same category, however. For example, the word ascent is listed in
the Macquarie Thesaurus categories of 49 (CLIMBING) and 694 (SLOPE), whereas the word
descent is listed in the categories 40 (ARISTOCRACY), 50 (DROPPING), 538 (PARENTAGE),
and 694 (SLOPE). Observe that ascent and descent are both listed in the same category
694 (SLOPE), which makes sense here because both words are pertinent to the concept of
slope. On the other hand, two separate clues independently inform our system that the
words are opposites of each other: (1) Category 49 has the word upwardness in the same
paragraph as ascent, and category 50 has the word downwardness in the same paragraph
as descent. The 13th affix pattern from Table 9 (upX and downX) indicates that the two
thesaurus paragraphs have contrasting meaning. Thus, ascent and descent occur in prime
contrasting thesaurus paragraphs. (2) One of the ascent categories (49) is adjacent to one
of the descent categories (50), and further this adjacent category pair has been manually
marked as contrasting.
Thus the words in a pair may be deemed both synonyms and opposites simultane-
ously by our methods of determining synonyms and opposites, respectively. Some of
the features we use to determine opposites were found to be more precise (e.g., words
listed in adjacent categories) than others (e.g., categories identified as contrasting based
on affix and WordNet seeds), however. Thus we apply the following rules as a decision
list: If one rule fires, then the subsequent rules are ignored.
1.
2.
Rule 1 (high confidence for opposites): If the words in a pair occur in
adjacent thesaurus categories, then they are marked as opposites.
Rule 2 (high confidence for synonyms): If both the words in a pair occur
in the same thesaurus category, then they are marked as synonyms.
581
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
3.
Rule 3 (medium confidence for opposites): If the words in a pair occur in
prime contrasting thesaurus paragraphs, as determined by an affix-based
or WordNet seed set, then they are marked as opposites.
If a word pair is not tagged as synonym or opposite: (a) the system can refrain
from attempting an answer (this will attain high precision), or (b) the system can
randomly guess the lexical relation (this will obtain 50% accuracy for the pairs), or (c) it
could mark all remaining word pairs with the predominant lexical relation in the data
(this will obtain an accuracy proportional to the skew in distribution of opposites and
synonyms). For example, if after step 3, the system finds that 70% of the marked word
pairs were tagged opposites, and 30% as synonyms, then it could mark every hitherto
untagged word pair (word pair for which it has insufficient information) as opposites.
We implemented all three variants. Note that option (b) is indeed expected to perform
poorly compared to option (c), but we include it as part of our evaluation to measure
usefulness of option (c).
7.3.3 Results and Discussion. Table 13 shows the precision (P), recall (R), and balanced
F-score (F) of various systems and baselines in identifying synonyms and opposites
from the data set described in Lin et al. (2003). We will refer to this data set as LZQZ
(the first letters of the authors’ last names).
If a system guesses at random (random baseline) it will obtain an accuracy of 50%.
Choosing opposites (or synonyms) as the predominant class also obtains an accuracy of
50% because the data set has an equal number of opposites and synonyms. Published
results on LZQZ (Lin et al. 2003) are shown here again for convenience. The results
obtained with our system and the three variations on handling word pairs for which
it does not have enough information are shown in the last three rows. The precision of
our method in configuration (a) is significantly higher than that of Lin et al. (2003), with
95% confidence according to the Fisher Exact Test (Agresti 1990). Because precision and
recall are the same for configuration (b) and (c), as well as for the methods described in
Lin et al. (2003) and Turney (2011), we can also refer to these results simply as accuracy.
Table 13
Results obtained on the synonym-or-opposite questions in LZQZ. The best performing systems
are marked in bold. The difference in precision and recall of method by Lin et al. (2003) and our
method in configurations (b) and (c) is not statistically significant.
Baselines:
a. random baseline
b. supervised most-frequent baseline†
Related work:
a. Lin et al. (2003)
b. Turney (2011)
Our method: if no information
a. refrain from guessing
b. make random guess
c. mark the predominant class‡
P
R
F
0.50
0.50
0.90
0.82
0.98
0.88
0.87
0.50
0.50
0.90
0.82
0.78
0.88
0.87
0.50
0.50
0.90
0.82
0.87
0.88
0.87
†This data set has an equal number of opposites and synonyms. Results reported are when
choosing opposites as the predominant class.
‡The system concluded that opposites were slightly more frequent than synonyms.
582
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Table 14
Results obtained on the synonym-or-opposite questions in TURN. The best performing systems
are marked in bold.
Baselines
a. random baseline
b. supervised most-frequent baseline†
Related work
a. Turney (2008)
b. Lin et al. (2003)
Our method: if no information
a. refrain from guessing
b. make random guess
c. mark the predominant class‡
P
R
F
0.50
0.65
0.75
0.35
0.97
0.84
0.90
0.50
0.65
0.75
0.35
0.69
0.84
0.90
0.50
0.65
0.75
0.35
0.81
0.84
0.90
†About 65.4% of the pairs in this data set are opposites. So this row reports baseline results when
choosing opposites as the predominant class.
‡The system concluded that opposites were much more frequent than synonyms.
We found that the differences in accuracies between the method of Lin et al. (2003) and
our method in configurations (b) and (c) are not statistically significant. The method by
Lin et al. (2003) and our method in configuration (b) have significantly higher accuracy
than the method described in Turney (2011), however. The lexical contrast features used
in configurations (a), (b), and (c) correspond to row i in Table 10. The next subsection
presents an analysis of the usefulness of the different features listed in Table 10.
Observe that when our method refrains from guessing in case of insufficient infor-
mation, it obtains excellent precision (0.98), while still providing very good coverage
(0.78). As expected, the results obtained with (b) and (c) do not differ much from each
other because the data set has an equal number of synonyms and opposites. (Note
that the system was not privy to this information.) After step 3 of the algorithm,
however, the system had marked 65 pairs as opposites and 63 pairs as synonyms,
and so it concluded that opposites are slightly more dominant in this data set and
therefore the guess-predominant-class variant marked all previously unmarked pairs as
opposites.
It should be noted that the LZQZ data set was chosen from a list of high-frequency
terms. This was necessary to increase the probability of finding sentences in a corpus
where the target pair occurred in one of the chosen patterns proposed by Lin et al.
(2003). As shown in Table 10, the Lin et al. (2003) patterns have a very low coverage
otherwise. Further, the test data compiled by Lin et al. only had opposites whereas the
contrast questions had many contrasting word pairs that were not opposites.
Table 14 shows results on the data set described in Turney (2008). We will refer to
this data set as TURN. The supervised baseline of always guessing the most frequent
class (in this case, opposites), will obtain an accuracy of 65.4% (P = R = F = 0.654).
Turney (2008) obtains an accuracy of 75% using a supervised method and 10-fold
cross-validation. A re-implementation of the method proposed by Lin et al. (2003) as
described in Section 7.1.3 did not recognize any of the word pairs in TURN as opposites;
that is, none of the word pairs in TURN occurred in the Google n-gram corpus in
patterns used by Lin et al. (2003). Thus it marked all words in TURN as synonyms.
The results obtained with our method are shown in the last three rows. The precision
and recall of our method in configurations (b) and (c) are significantly higher than those
583
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
obtained by the methods by Turney (2008) and Lin et al. (2003), with 95% confidence
according to the Fisher Exact Test (Agresti 1990).
Observe that once again our method, especially the variant that refrains from
guessing in case of insufficient information, obtains excellent precision (0.97), while
still providing good coverage (0.69). Also observe that results obtained by guessing the
predominant class (method (c)) are markedly better than those obtained by randomly
guessing in case of insufficient information (method (b)). This is because, as mentioned
earlier, the distribution of opposites and synonyms is somewhat skewed in this data
set (65.4% of the pairs are opposites). Of course, again the system was not privy to this
information, but method (a) marked 58 pairs as opposites and 39 pairs as synonyms.
Therefore, the system concluded that opposites are more dominant and method (c)
marked all previously unmarked pairs as opposites, obtaining an accuracy of 90%.
Recall that in Section 7.3.2 we described how opposite pairs may occasionally be
listed in the same thesaurus category because the category may be pertinent to both
words. For 12 of the word pairs in the Lin et al. data and 3 of the word pairs in the
Turney data, both words occurred together in the same thesaurus category, and yet
the system marked them as opposites because they occurred in adjacent thesaurus
categories (Class I). For 11 of the 12 pairs from LZQZ and for all 3 of the TURN pairs,
this resulted in the correct answer. These pairs are shown in Table 15. By contrast, only
one of the term pairs in this table occurred in one of Lin’s patterns of oppositeness, and
was thus the only one correctly identified by their method as a pair of opposites.
It should also be noted that a word may have multiple meanings such that it may be
synonymous to a word in one sense and opposite to it in another sense. Such pairs are
also expected to be marked as opposites by our system. Two such pairs in the Turney
(2008) data are: fantastic–awful and terrific–terrible. The word awful can mean INSPIRING
AWE (and so close to the meaning of fantastic in some contexts), and also EXTREMELY
DISAGREEABLE (and so opposite to fantastic). The word terrific can mean FRIGHTFUL
(and so close to the meaning of terrible), and also UNUSUALLY FINE (and so opposite
to terrible). Such pairs are probably not the best synonym-or-opposite questions. Faced
with these questions, however, humans probably home in on the dominant senses of
Table 15
Pairs from LZQZ and TURN that have at least one category in common but are still marked as
opposites by our method.
LZQZ
TURN
word 1
word 2
official solution
word 1
word 2
official solution
fantastic
dry
terrific
awful
wet
terrible
opposite
opposite
opposite
professional
descent
front
top
salvo
exit
hell
outside
senior
truth
minority
zenith
weakness
opposite
opposite
opposite
opposite
synonym
opposite
opposite
opposite
opposite
opposite
opposite
opposite
opposite
amateur
ascent
back
bottom
broadside
entrance
heaven
inside
junior
lie
majority
nadir
strength
584
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Table 16
Results for individual components as well as certain combinations of components on the
synonym-or-opposite questions in LZQZ. The best performing configuration is shown in bold.
Baselines:
a. random baseline
b. supervised most-frequent baseline†
Our methods:
a. affix-generated seeds only
b. WordNet seeds only
c. both seed sets (a + b)
d. adjacency heuristic only
e. manual annotation of adjacent categories
f. affix seed set and adjacency heuristic (a + d)
g. both seed sets and adjacency heuristic (a + b + d)
h. affix seed set and annotation of adjacent categories
(a + e)
P
R
F
0.50
0.50
0.86
0.88
0.88
0.95
0.98
0.95
0.95
0.98
0.50
0.50
0.54
0.65
0.65
0.74
0.74
0.75
0.78
0.77
0.50
0.50
0.66
0.75
0.75
0.83
0.84
0.84
0.86
0.86
i. both seed sets and annotation of adjacent categories
0.98
0.78
0.87
(a + b + e)
†This data set has equal number of opposites and synonyms, so either class can be chosen to be
predominant. Baseline results shown here are for choosing opposites as the predominant class.
the target words to determine an answer. For example, in modern-day English terrific
is used more frequently in the sense of UNUSUALLY FINE than the sense of FRIGHTFUL,
and so most people will say that terrific and terrible are opposites (in fact that is the
solution provided with these data).
7.3.4 Analysis. We carried out additional experiments to determine how useful individ-
ual components of our method were in solving the synonym-or-opposite questions.
The results on LZQZ are shown in Table 16 and the results on TURN are shown in
Table 17. These results are for the case when the system refrains from guessing in case
of insufficient information. The rows in the tables correspond to the rows in Table 10
shown earlier that gave results on the contrast questions.
Observe that the affix-generated seeds give a marked improvement over the base-
lines, and that knowing which categories are contrasting (either from the adjacency
heuristic or manual annotation of adjacent categories) proves to be the most useful
feature. Also note that even though manual annotation and WordNet seeds eventually
lead to the best results (F = 0.87 for LZQZ and F = 0.81 for TURN), using only the
adjacency heuristic and the affix-generated seeds gives competitive results (F = 0.84 for
the Lin set and F = 0.78 for the Turney set). We are interested in developing methods
to make the approach cross-lingual, so that we can use a thesaurus from one language
(say, English) to compute lexical contrast in a resource-poor target language.
The precision of our method is very good (>0.95). Thus future work will be aimed
at improving recall. This can be achieved by developing methods to generate more seed
opposites. This is also an avenue through which some of the pattern-based approaches
(such as the methods described by Lin et al. [2003] and Turney [2008]) can be incorpo-
rated into our method. For instance, we could use n-gram patterns such as “either X or
Y” and “from X to Y” to identify pairs of opposites that can be used as additional seeds
in our method.
585
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
Recall can also be improved by using affix patterns in other languages to identify
contrasting thesaurus paragraphs in the target language. Thus, constructing a cross-
lingual framework in which words from one language will be connected to thesaurus
categories in another language will be useful not only in computing lexical contrast in a
resource-poor language, but also in using affix information from different languages to
improve results in the target, possibly even resource-rich, language.
8. Conclusions and Future Work
Detecting semantically contrasting word pairs has many applications in natural lan-
guage processing. In this article, we proposed a method for computing lexical contrast
that is based on the hypothesis that if a pair of words, A and B, are contrasting, then
there is a pair of opposites, C and D, such that A and C are strongly related and B and
D are strongly related—the contrast hypothesis. We used pointwise mutual information
to determine the degree of contrast between two contrasting words. The method outper-
formed others on the task of solving a large set of “choose the most contrasting word”
questions wherein the system not only identified whether two words are contrasting
but also distinguished between pairs of contrasting words with differing degrees of
contrast. We further determined performance of the method on five different kinds of
opposites and across four parts of speech. We used our approach to solve synonym-or-
opposite questions described in Turney (2008) and Lin et al. (2003).
Because opposites were central to our methodology, we designed a questionnaire to
better understand different kinds of opposites, which we crowdsourced with Amazon
Mechanical Turk. We devoted extra effort to making sure the questions are phrased
in a simple, yet clear manner. Additionally, a quality control method was developed,
using a word-choice question, to automatically identify and discard dubious and outlier
annotations. From these data, we created a data set of different kinds of opposites that
Table 17
Results for individual components as well as certain combinations of components on the
synonym-or-opposite questions in TURN. The best performing configuration is shown in bold.
Baselines:
a. random baseline
b. supervised most-frequent baseline†
Our methods:
a. affix-generated seeds only
b. WordNet seeds only
c. both seed sets (a + b)
d. adjacency heuristic only
e. manual annotation of adjacent categories
f. affix seed set and adjacency heuristic (a + d)
g. both seed sets and adjacency heuristic (a + b + d)
h. affix seeds and annotation of adjacent categories
(a + e)
P
R
F
0.50
0.65
0.92
0.93
0.93
0.94
0.96
0.95
0.95
0.97
0.50
0.65
0.54
0.61
0.61
0.60
0.60
0.67
0.68
0.68
0.50
0.65
0.68
0.74
0.74
0.74
0.74
0.78
0.79
0.80
i. both seed sets and annotation of adjacent categories
0.97
0.69
0.81
(a + b + e)
†About 65.4% of the pairs in this data set are opposites. So this row reports baseline results when
choosing opposites as the predominant class.
586
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
Table 18
A summary of the data created as part of this research on lexical contrast. Available for
download at: http://www.purl.org/net/saif.mohammad/research.
Name
# of items
Affix patterns that tend to generate opposites
Contrast questions:
15 rules
GRE preparatory questions:
Development set
Test set
Newly created questions:
Data from work on types of opposites:
Crowdsourced questionnaires
Responses to questionnaires
Lexicon of opposites generated by the
Mohammad et al. method:
Class I opposites
Class II opposites
Manually identified contrasting categories
in the Macquarie Thesaurus
Word-pairs used in Section 5 experiments:
WordNet opposites set
WordNet random word pairs set
WordNet synonyms set
162 questions
790 questions
1,269 questions
4 sets (one for every pos)
12,448 assignments (in four files)
3.5 million word pairs
2.5 million word pairs
209 category pairs
1,358 word pairs
1,358 word pairs
1,358 word pairs
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
we have made available. We determined the amount of agreement among humans in
identifying lexical contrast, and also in identifying different kinds of opposites. We also
showed that a large number of opposing word pairs have properties pertaining to more
than one kind. Table 18 summarizes the data created as part of this research on lexical
contrast, all of which is available for download.
New questions that target other types of lexical contrast not addressed in this paper
may be added in the future. It may be desirable to break a complex question into two
or more simpler questions. For example, if a word pair is considered to be a certain
kind of opposite when it has both properties M and N, then it is best to have two
separate questions asking whether the word pair has properties M and N, respectively.
The crowdsourcing study can be replicated for other languages by asking the same
questions in the target language for words in the target language. Note, however, that
as of February 2012, most of the Mechanical Turk participants are native speakers of
English, certain Indian languages, and some European languages.
Our future goals include porting this approach to a cross-lingual framework to
determine lexical contrast in a resource-poor language by using a bilingual lexicon to
connect the words in that language with words in another resource-rich language. We
can then use the structure of the thesaurus from the resource-rich language as described
in this article to detect contrasting categories of terms. This is similar to the approach
described by Mohammad et al. (2007), who compute semantic distance in a resource-
poor language by using a bilingual lexicon and a sense disambiguation algorithm to
connect text in the resource-poor language with a thesaurus in a different language. This
enables automatic discovery of lexical contrast in a language even if it does not have a
Roget-like thesaurus. The cross-lingual method still requires a bilingual lexicon to map
words between the target language and the language with the thesaurus, however.
587
Computational Linguistics
Volume 39, Number 3
Our method used only one Roget-like published thesaurus, but even more gains
may be obtained by combining many dictionaries and thesauri using methods proposed
by Ploux and Victorri (1998) and others.
We modified our algorithm to create lexicons of words associated with positive
and negative sentiment (Mohammad, Dunne, and Dorr 2009). We also used the lexi-
cal contrast algorithm in some preliminary experiments to identify contrast between
sentences and use that information to improve cohesion in automatic summarization
(Mohammad et al. 2008). Since its release, the lexicon of contrasting word pairs was used
to improve textual paraphrasing and in turn help improve machine translation (Marton,
El Kholy, and Habash 2011). We are interested in using contrasting word pairs as seeds
to identify phrases that convey contrasting meaning. These will be especially helpful
in machine translation where current systems have difficulty separating translation
hypotheses that convey the same meaning as the source sentences, and those that
do not.
Given a particular word, our method computes a single score as the degree of
contrast with another. A word may be more or less contrasting with another word when
used in different contexts (Murphy 2003), however. Just as in the lexical substitution task
(McCarthy and Navigli 2009), where a system has to find the word that can best replace
a target word in context to preserve meaning, one can imagine a lexical substitution
task to generate contradictions where the objective is to replace a given target word
with one that is contrasting so as to generate a contradiction. Our future work includes
developing a context-sensitive measure of lexical contrast that can be used for exactly
such a task.
There is considerable evidence that children are aware of lexical contrast at a
very early age (Murphy and Jones 2008). They rely on it to better understand var-
ious concepts and in order to communicate effectively. Thus we believe that com-
puter algorithms that deal with language can also obtain significant gains through the
ability to detect contrast and the ability to distinguish between differing degrees of
contrast.
Acknowledgments
We thank Tara Small, Smaranda Muresan,
and Siddharth Patwardhan for their valuable
feedback. This work was supported in part
by the National Research Council Canada;
in part by the National Science Foundation
under grant no. IIS-0705832; in part by the
Human Language Technology Center of
Excellence; and in part by the Natural
Sciences and Engineering Research Council
of Canada. Any opinions, findings, and
conclusions or recommendations expressed
in this material are those of the authors and
do not necessarily reflect the views of the
sponsors.
References
Agresti, Alan. 1990. Categorical Data Analysis.
Wiley, New York, NY.
Banerjee, Satanjeev and Ted Pedersen. 2003.
Extended gloss overlaps as a measure
of semantic relatedness. In International
Joint Conference on Artificial Intelligence,
pages 805–810, Acapulco, Mexico.
Bejar, Isaac I., Roger Chaffin, and Susan
Embretson. 1991. Cognitive and
Psychometric Analysis of Analogical Problem
Solving. Springer-Verlag, New York.
Brants, Thorsten and Alex Franz. 2006. Web
1T 5-gram version 1. Linguistic Data
Consortium, Philadelphia, PA.
Budanitsky, Alexander and Graeme Hirst.
2006. Evaluating WordNet-based
measures of semantic distance.
Computational Linguistics, 32(1):13–47.
Burnard, Lou. 2000. Reference Guide for the
British National Corpus (World Edition).
Oxford University Computing Services,
Oxford.
Charles, Walter G. and George A. Miller.
1989. Contexts of antonymous adjectives.
Applied Psychology, 10:357–375.
Church, Kenneth and Patrick Hanks.
1990. Word association norms, mutual
588
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Mohammad et al.
Computing Lexical Contrast
information and lexicography.
Computational Linguistics, 16(1):22–29.
Cruse, David A. 1986. Lexical Semantics.
Cambridge University Press, Cambridge.
Curran, James R. 2004. From Distributional to
Semantic Similarity. Ph.D. thesis, School of
Informatics, University of Edinburgh,
Edinburgh, UK.
de Marneffe, Marie-Catherine, Anna
Rafferty, and Christopher D. Manning.
2008. Finding contradictions in text. In
Proceedings of the 46th Annual Meeting of the
Association for Computational Linguistics
(ACL-08), pages 1,039–1,047, Columbus, OH.
Deese, James. 1965. The Structure of
Associations in Language and Thought.
The Johns Hopkins University Press,
Baltimore, MD.
Egan, Rose F. 1984. Survey of the history
of English synonymy. Webster’s New
Dictionary of Synonyms, pages 5a–25a.
Fellbaum, Christiane. 1995. Co-occurrence
and antonymy. International Journal of
Lexicography, 8:281–303.
Gross, Derek, Ute Fischer, and George A.
Miller. 1989. Antonymy and the
representation of adjectival meanings.
Memory and Language, 28(1):92–106.
Harabagiu, Sanda M., Andrew Hickl,
and Finley Lacatusu. 2006. Lacatusu:
Negation, contrast and contradiction in
text processing. In Proceedings of the 23rd
National Conference on Artificial Intelligence
(AAAI-06), pages 755–762, Boston, MA.
Hatzivassiloglou, Vasileios and Kathleen R.
McKeown. 1997. Predicting the semantic
orientation of adjectives. In Proceedings of
the Eighth Conference on European Chapter of
the Association for Computational Linguistics,
pages 174–181, Madrid.
Hearst, Marti. 1992. Automatic acquisition
of hyponyms from large text corpora. In
Proceedings of the Fourteenth International
Conference on Computational Linguistics,
pages 539–546, Nantes.
Jones, Steven, Carita Paradis, M. Lynne
Murphy, and Caroline Willners. 2007.
Googling for ‘opposites’: A Web-based
study of antonym canonicity. Corpora,
2(2):129–154.
Justeson, John S. and Slava M. Katz.
1991. Co-occurrences of antonymous
adjectives and their contexts.
Computational Linguistics, 17:1–19.
Kagan, Jerome. 1984. The Nature of the Child.
Basic Books, New York.
Kay, Maire Weir, editor. 1988. Webster’s
Collegiate Thesaurus. Merriam-Webster,
Springfield, MA.
Kempson, Ruth M. 1977. Semantic Theory.
Cambridge University Press, Cambridge.
Lehrer, Adrienne and K. Lehrer. 1982.
Antonymy. Linguistics and Philosophy,
5:483–501.
Lin, Dekang. 1998. Automatic retrieval and
clustering of similar words. In Proceedings
of the 17th International Conference on
Computational Linguistics, pages 768–773,
Montreal.
Lin, Dekang, Shaojun Zhao, Lijuan Qin,
and Ming Zhou. 2003. Identifying
synonyms among distributionally
similar words. In Proceedings of the 18th
International Joint Conference on Artificial
Intelligence (IJCAI-03), pages 1,492–1,493,
Acapulco.
Lobanova, Anna, Tom van der Kleij, and
Jennifer Spenader. 2010. Defining
antonymy: A corpus-based study of
opposites by lexico-syntactic patterns.
International Journal of Lexicography,
23(1):19–53.
Lucerto, Cupertino, David Pinto, and
H´ector Jim´enez-Salazar. 2002. An
automatic method to identify antonymy.
In Workshop on Lexical Resources and the
Web for Word Sense Disambiguation,
pages 105–111, Puebla.
Lyons, John. 1977. Semantics, volume 1.
Cambridge University Press, Cambridge.
Marcu, Daniel and Abdesammad Echihabi.
2002. An unsupervised approach
to recognizing discourse relations.
In Proceedings of the 40th Annual Meeting
of the Association for Computational
Linguistics (ACL-02), pages 368–375,
Philadelphia, PA.
Marton, Yuval, Ahmed El Kholy, and
Nizar Habash. 2011. Filtering antonymous,
trend-contrasting, and polarity-dissimilar
distributional paraphrases for improving
statistical machine translation. In
Proceedings of the Sixth Workshop on
Statistical Machine Translation,
pages 237–249, Edinburgh.
McCarthy, Diana and Roberto Navigli.
2009. The English lexical substitution
task. Language Resources And Evaluation,
43(2):139–159.
Mihalcea, Rada and Carlo Strapparava.
2005. Making computers laugh:
Investigations in automatic humor
recognition. In Proceedings of the
Conference on Human Language Technology
and Empirical Methods in Natural Language
Processing, pages 531–538, Vancouver.
Mohammad, Saif, Bonnie Dorr, and
Graeme Hirst. 2008. Computing word-pair
589
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 39, Number 3
antonymy. In Proceedings of the Conference
on Empirical Methods in Natural Language
Processing (EMNLP-2008), pages 982–991,
Waikiki, HI.
Mohammad, Saif, Bonnie J. Dorr,
Melissa Egan, Nitin Madnani, David Zajic,
and Jimmy Lin. 2008. Multiple alternative
sentence compressions and word-pair
antonymy for automatic text
summarization and recognizing textual
entailment. In Text Analysis Conference
(TAC), Gaithersburg, MD.
Mohammad, Saif, Cody Dunne, and
Bonnie Dorr. 2009. Generating
high-coverage semantic orientation
lexicons from overtly marked words
and a thesaurus. In Proceedings of
Empirical Methods in Natural Language
Processing (EMNLP-2009), pages 599–608,
Singapore.
Mohammad, Saif, Iryna Gurevych,
Graeme Hirst, and Torsten Zesch. 2007.
Cross-lingual distributional profiles of
concepts for measuring semantic distance.
In Proceedings of the Joint Conference on
Empirical Methods in Natural Language
Processing and Computational Natural
Language Learning (EMNLP/CoNLL-2007),
pages 571–580, Prague.
Murphy, Gregory L. and Jane M. Andrew.
1993. The conceptual basis of antonymy
and synonymy in adjectives. Journal of
Memory and Language, 32(3):1–19.
Murphy, Lynne M. 2003. Semantic Relations
and the Lexicon: Antonymy, Synonymy, and
Other Paradigms. Cambridge University
Press, Cambridge.
Murphy, Lynne M. and Steven Jones.
2008. Antonyms in children’s and
child-directed speech. First Language,
28(4):403–430.
Pang, Bo and Lillian Lee. 2008. Opinion
mining and sentiment analysis.
Foundations and Trends in Information
Retrieval, 2(1–2):1–135.
Pang, Bo, Lillian Lee, and Shivakumar
Vaithyanathan. 2002. Thumbs up?:
Sentiment classification using machine
learning techniques. In Proceedings of the
Conference on Empirical Methods in Natural
Language Processing, pages 79–86,
Philadelphia, PA.
Paradis, Carita, Caroline Willners,
and Steven Jones. 2009. Good and
bad opposites using textual and
experimental techniques to measure
antonym canonicity. The Mental Lexicon,
4(3):380–429.
Ploux, Sabine and Bernard Victorri. 1998.
Construction d’espaces s´emantiques `a
l’aide de dictionnaires de synonymes.
TAL, 39(1):161–182.
Schwab, Didier, Mathieu Lafourcade, and
Violaine Prince. 2002. Antonymy and
conceptual vectors. In Proceedings of
the 19th International Conference on
Computational Linguistics (COLING-02),
pages 904–910, Taipei, Taiwan.
Turney, Peter D. 2008. A uniform approach
to analogies, synonyms, antonyms, and
associations. In Proceedings of the 22nd
International Conference on Computational
Linguistics (COLING-08), pages 905–912,
Manchester.
Turney, Peter D. 2011. Analogy perception
applied to seven tests of word
comprehension. Journal of Experimental
and Theoretical Artificial Intelligence –
Psychometric Artificial Intelligence,
23(3):343–362.
Voorhees, Ellen M. 2008. Contradictions and
justifications: Extensions to the textual
entailment task. In Proceedings of the
46th Annual Meeting of the Association for
Computational Linguistics (ACL-08), 63–71,
Columbus, OH.
590
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
3
9
3
5
5
5
1
8
0
1
9
2
3
/
c
o
l
i
_
a
_
0
0
1
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf