Transactions of the Association for Computational Linguistics, 1 (2013) 353–366. Action Editor: Patrick Pantel.
Submitted 5/2013; Revised 7/2013; Published 10/2013. c(cid:13)2013 Association for Computational Linguistics.
DistributionalSemanticsBeyondWords:SupervisedLearningofAnalogyandParaphrasePeterD.TurneyNationalResearchCouncilCanadaInformationandCommunicationsTechnologiesOttawa,Ontario,Canada,K1A0R6peter.turney@nrc-cnrc.gc.caAbstractTherehavebeenseveraleffortstoextenddistributionalsemanticsbeyondindividualwords,tomeasurethesimilarityofwordpairs,phrases,andsentences(briefly,tuples;orderedsetsofwords,contiguousornoncontiguous).Onewaytoextendbeyondwordsistocom-paretwotuplesusingafunctionthatcom-binespairwisesimilaritiesbetweenthecom-ponentwordsinthetuples.Astrengthofthisapproachisthatitworkswithbothrela-tionalsimilarity(analogy)andcompositionalsimilarity(paraphrase).However,pastworkrequiredhand-codingthecombinationfunc-tionfordifferenttasks.Themaincontributionofthispaperisthatcombinationfunctionsaregeneratedbysupervisedlearning.Weachievestate-of-the-artresultsinmeasuringrelationalsimilaritybetweenwordpairs(SATanalo-giesandSemEval2012Task2)andmeasur-ingcompositionalsimilaritybetweennoun-modifierphrasesandunigrams(multiple-choiceparaphrasequestions).1IntroductionHarris(1954)andFirth(1957)hypothesizedthatwordsthatappearinsimilarcontextstendtohavesimilarmeanings.Thishypothesisisthefounda-tionfordistributionalsemantics,inwhichwordsarerepresentedbycontextvectors.Thesimilarityoftwowordsiscalculatedbycomparingthetwocor-respondingcontextvectors(Lundetal.,1995;Lan-dauerandDumais,1997;TurneyandPantel,2010).Distributionalsemanticsishighlyeffectiveformeasuringthesemanticsimilaritybetweenindivid-ualwords.Onasetofeightymultiple-choicesyn-onymquestionsfromthetestofEnglishasafor-eignlanguage(TOEFL),adistributionalapproachrecentlyachieved100%accuracy(BullinariaandLevy,2012).However,ithasbeendifficulttoextenddistributionalsemanticsbeyondindividualwords,towordpairs,phrases,andsentences.Movingbeyondindividualwords,therearevari-oustypesofsemanticsimilaritytoconsider.Herewefocusonparaphraseandanalogy.Paraphraseissimilarityinthemeaningoftwopiecesoftext(AndroutsopoulosandMalakasiotis,2010).Anal-ogyissimilarityinthesemanticrelationsoftwosetsofwords(Turney,2008a).Itiscommontostudyparaphraseatthesentencelevel(AndroutsopoulosandMalakasiotis,2010),butweprefertoconcentrateonthesimplesttypeofparaphrase,whereabigramparaphrasesaunigram.Forexample,doghouseisaparaphraseofkennel.Inourexperiments,weconcentrateonnoun-modifierbigramsandnoununigrams.Analogiesmaptermsinonedomaintotermsinanotherdomain(Gentner,1983).Thefamiliaranal-ogybetweenthesolarsystemandtheRutherford-Bohratomicmodelinvolvesseveraltermsfromthedomainofthesolarsystemandthedomainoftheatomicmodel(Turney,2008a).Thesimplesttypeofanalogyisproportionalanal-ogy,whichinvolvestwopairsofwords(Turney,2006b).Forexample,thepairhcook,rawiisanal-ogoustothepairhdecorate,plaini.Ifwecookathing,itisnolongerraw;ifwedecorateathing,it
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
l
a
c
_
a
_
0
0
2
3
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
354
isnolongerplain.Thesemanticrelationsbetweencookandrawaresimilartothesemanticrelationsbetweendecorateandplain.Inthefollowingexper-iments,wefocusonproportionalanalogies.Erk(2013)distinguishedfourapproachestoextenddistributionalsemanticsbeyondwords:Inthefirst,asinglevectorspacerepresentationforaphraseorsentenceiscomputedfromtherepresen-tationsoftheindividualwords(MitchellandLap-ata,2010;BaroniandZamparelli,2010).Inthesec-ond,twophrasesorsentencesarecomparedbycom-biningmultiplepairwisesimilarityvalues(Socheretal.,2011;Turney,2012).Third,weightedinferencerulesintegratedistributionalsimilarityandformallogic(Garretteetal.,2011).Fourth,asinglespaceintegratesformallogicandvectors(Clarke,2012).Takingthesecondapproach,Turney(2012)intro-ducedadual-spacemodel,withonespaceformea-suringdomainsimilarity(similarityoftopicorfield)andanotherforfunctionsimilarity(similarityofroleorusage).Similaritiesbeyondindividualwordsarecalculatedbyfunctionsthatcombinedomainandfunctionsimilaritiesofcomponentwords.Thedual-spacemodelhasbeenappliedtomea-suringcompositionalsimilarity(paraphraserecogni-tion)andrelationalsimilarity(analogyrecognition).Inexperimentsthattestedforsensitivitytowordorder,thedual-spacemodelperformedsignificantlybetterthancompetingapproaches(Turney,2012).Alimitationofpastworkwiththedual-spacemodelisthatthecombinationfunctionswerehand-coded.Ourmaincontributionistoshowhowhand-codingcanbeeliminatedwithsupervisedlearning.Foreaseofreference,wewillcallourapproachSuperSim(supervisedsimilarity).Withnomodifi-cationofSuperSimforthespecifictask(relationalsimilarityorcompositionalsimilarity),weachievebetterresultsthanprevioushand-codedmodels.Compositionalsimilarity(paraphrase)comparestwocontiguousphrasesorsentences(n-grams),whereasrelationalsimilarity(analogy)doesnotrequirecontiguity.Weusetupletorefertobothcon-tiguousandnoncontiguouswordsequences.Weapproachanalogyasaproblemofsupervisedtupleclassification.Tomeasuretherelationalsim-ilaritybetweentwowordpairs,wetrainSuperSimwithquadruplesthatarelabeledaspositiveandneg-ativeexamplesofanalogies.Forexample,thepro-portionalanalogyhcook,raw,decorate,plainiislabeledasapositiveexample.Aquadrupleisrepresentedbyafeaturevector,composedofdomainandfunctionsimilaritiesfromthedual-spacemodelandotherfeaturesbasedoncorpusfrequencies.SuperSimusesasupportvectormachine(Platt,1998)tolearntheprobabilitythataquadrupleha,b,c,diconsistsofawordpairha,biandananalogouswordpairhc,di.Theprobabilitycanbeinterpretedasthedegreeofrelationalsimilar-itybetweenthetwogivenwordpairs.Wealsoapproachparaphraseassupervisedtupleclassification.Tomeasurethecompositionalsimi-laritybeweenanm-gramandann-gram,wetrainthelearningalgorithmwith(m+n)-tuplesthatarepositiveandnegativeexamplesofparaphrases.SuperSimlearnstoestimatetheprobabilitythatatripleha,b,ciconsistsofacompositionalbigramabandasynonymousunigramc.Forinstance,thephrasefishtankissynonymouswithaquarium;thatis,fishtankandaquariumhavehighcompositionalsimilarity.Thetriplehfish,tank,aquariumiisrepre-sentedusingthesamefeaturesthatweusedforanal-ogy.Theprobabilityofthetriplecanbeinterpretedasthedegreeofcompositionalsimilaritybetweenthegivenbigramandunigram.WereviewrelatedworkinSection2.Thegen-eralfeaturespaceforlearningrelationsandcompo-sitionsispresentedinSection3.TheexperimentswithrelationalsimilarityaredescribedinSection4,andSection5reportstheresultswithcompositionalsimilarity.Section6discussestheimplicationsoftheresults.WeconsiderfutureworkinSection7andconcludeinSection8.2RelatedWorkInSemEval2012,Task2wasconcernedwithmea-suringthedegreeofrelationalsimilaritybetweentwowordpairs(Jurgensetal.,2012)andTask6(Agirreetal.,2012)examinedthedegreeofseman-ticequivalencebetweentwosentences.Thesetwoareasofresearchhavebeenmostlyindependent,althoughSocheretal.(2012)andTurney(2012)presentunifiedperspectivesonthetwotasks.Wefirstdiscusssomeworkonrelationalsimilarity,thensomeworkoncompositionalsimilarity,andlastlyworkthatunifiesthetwotypesofsimilarity.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
l
a
c
_
a
_
0
0
2
3
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
355
2.1RelationalSimilarityLRA(latentrelationalanalysis)measuresrela-tionalsimilaritywithapair–patternmatrix(Turney,2006b).Rowsinthematrixcorrespondtowordpairs(a,b)andcolumnscorrespondtopatternsthatconnectthepairs(“afortheb”)inalargecor-pus.Thisisaholistic(noncompositional)approachtodistributionalsimilarity,sincethewordpairsareopaquewholes;thecomponentwordshavenosep-araterepresentations.Acompositionalapproachtoanalogyhasarepresentationforeachword,andawordpairisrepresentedbycomposingtherepresen-tationsforeachmemberofthepair.Givenavocabu-laryofNwords,acompositionalapproachrequiresNrepresentationstohandleallpossiblewordpairs,butaholisticapproachrequiresN2representations.Holisticapproachesdonotscaleup(Turney,2012).LRArequiredninedaystorun.Bollegalaetal.(2008)answeredtheSATanal-ogyquestionswithasupportvectormachinetrainedonquadruples(proportionalanalogies),aswedohere.However,theirfeaturevectorsareholistic,andhencetherearescalingproblems.Herda˘gdelenandBaroni(2009)usedasupportvectormachinetolearnrelationalsimilarity.Theirfeaturevectorscontainedacombinationofholisticandcompositionalfeatures.Measuringrelationalsimilarityiscloselycon-nectedtoclassifyingwordpairsaccordingtotheirsemanticrelations(TurneyandLittman,2005).SemanticrelationclassificationwasthefocusofSemEval2007Task4(Girjuetal.,2007)andSemEval2010Task8(Hendrickxetal.,2010).2.2CompositionalSimilarityToextenddistributionalsemanticsbeyondwords,manyresearcherstakethefirstapproachdescribedbyErk(2013),inwhichasinglevectorspaceisusedforindividualwords,phrases,andsentences(Lan-dauerandDumais,1997;MitchellandLapata,2008;MitchellandLapata,2010).Inthisapproach,giventhewordsaandbwithcontextvectorsaandb,weconstructavectorforthebigramabbyapplyingvec-toroperationstoaandb.MitchellandLapata(2010)experimentwithmanydifferentvectoroperationsandfindthatelement-wisemultiplicationperformswell.Thebigramabisrepresentedbyc=a(cid:12)b,whereci=ai·bi.However,element-wisemultiplica-tioniscommutative,sothebigramsabandbamaptothesamevectorc.Inexperimentsthattestforordersensitivity,element-wisemultiplicationper-formspoorly(Turney,2012).Wecantreatthebigramabasaunit,asifitwereasingleword,andconstructacontextvectorforabfromoccurrencesofabinalargecorpus.Thisholisticapproachtorepresentingbigramsperformswellwhenalimitedsetofbigramsisspecifiedinadvance(beforebuildingtheword–contextmatrix),butitdoesnotscaleup,becausetherearetoomanypossiblebigrams(Turney,2012).Althoughtheholisticapproachdoesnotscaleup,wecangenerateafewholisticbigramvectorsandusethemtotrainasupervisedregressionmodel(Guevara,2010;BaroniandZamparelli,2010).Givenanewbigramcd,notobservedinthecorpus,theregressionmodelcanpredictaholisticvectorforcd,ifcanddhavebeenobservedseparately.WeshowinSection5thatthisideacanbeadaptedtotrainSuperSimwithoutmanuallylabeleddata.Socheretal.(2011)takethesecondapproachdescribedbyErk(2013),inwhichtwosentencesarecomparedbycombiningmultiplepairwisesimilar-ityvalues.Theyconstructavariable-sizedsimilar-itymatrixX,inwhichtheelementxijisthesim-ilaritybetweenthei-thphraseofonesentenceandthej-thphraseoftheother.Sincesupervisedlearn-ingissimplerwithfixed-sizedfeaturevectors,thevariable-sizedsimilaritymatrixisthenreducedtoasmallerfixed-sizedmatrix,toallowcomparisonofpairsofsentencesofvaryinglengths.2.3UnifiedPerspectivesonSimilaritySocheretal.(2012)representwordsandphraseswithapair,consistingofavectorandamatrix.Thevectorcapturesthemeaningofthewordorphraseandthematrixcaptureshowawordorphrasemod-ifiesthemeaningofanotherwordorphrasewhentheyarecombined.Theyapplythismatrix–vectorrepresentationtobothcompositionsandrelations.Turney(2012)representswordswithtwovectors,avectorfromdomainspaceandavectorfromfunc-tionspace.Thedomainvectorcapturesthetopicorfieldofthewordandthefunctionvectorcapturesthe
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
l
a
c
_
a
_
0
0
2
3
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
356
functionalroleoftheword.Thisdual-spacemodelisappliedtobothcompositionsandrelations.Hereweextendthedual-spacemodelofTur-ney(2012)intwoways:Hand-codingisreplacedwithsupervisedlearningandtwonewsetsoffea-turesaugmentdomainandfunctionspace.Movingtosupervisedlearninginsteadofhand-codingmakesiteasiertointroducenewfeatures.Inthedual-spacemodel,parameterizedsimilar-itymeasuresprovidedtheinputvaluesforhand-craftedfunctions.Eachtaskrequiredadifferentsetofhand-craftedfunctions.Theparametersofthesimilaritymeasuresweretunedusingacustomizedgridsearchalgorithm.Thegridsearchalgorithmwasnotsuitableforintegrationwithasupervisedlearningalgorithm.TheinsightbehindSuperSimisthat,givenappropriatefeatures,asupervisedlearn-ingalgorithmcanreplacethegridsearchalgorithmandthehand-craftedfunctions.3FeaturesforTupleClassificationWerepresentatuplewithfourtypesoffeatures,allbasedonfrequenciesinalargecorpus.Thefirsttypeoffeatureisthelogarithmofthefrequencyofaword.Thesecondtypeisthepositivepoint-wisemutualinformation(PPMI)betweentwowords(ChurchandHanks,1989;BullinariaandLevy,2007).Thirdandfourtharethesimilaritiesoftwowordsindomainandfunctionspace(Turney,2012).Inthefollowingexperiments,weusethePPMImatrixfromTurneyetal.(2011)andthedomainandfunctionmatricesfromTurney(2012).1Thethreematricesandthewordfrequencydataarebasedonthesamecorpus,acollectionofwebpagesgath-eredfromuniversitywebsites,containing5×1010words.2Allthreematricesareword–contextmatri-ces,inwhichtherowscorrespondtoterms(wordsandphrases)inWordNet.3Thecolumnscorrespondtothecontextsinwhichthetermsappear;eachmatrixinvolvesadifferentkindofcontext.1Thethreematricesandthewordfrequencydataareavail-ableonrequestfromtheauthor.Thematrixfilesrangefromtwotofivegigabyteswhenpackagedandcompressedfordistri-bution.2ThecorpuswascollectedbyCharlesClarkeattheUniver-sityofWaterloo.Itisabout280gigabytesofplaintext.3Seehttp://wordnet.princeton.edu/forinfor-mationaboutWordNet.Lethx1,x2,…,xnibeann-tupleofwords.Thenumberoffeaturesweusetorepresentthistupleincreasesasafunctionofn.Thefirstsetoffeaturesconsistsoflogfrequencyvaluesforeachwordxiinthen-tuple.Letfreq(xi)bethefrequencyofxiinthecorpus.WedefineLF(xi)aslog(freq(xi)+1).Ifxiisnotinthecorpus,freq(xi)iszero,andthusLF(xi)isalsozero.Therearenlogfrequencyfeatures,oneLF(xi)featureforeachwordinthen-tuple.Thesecondsetoffeaturesconsistsofpositivepointwisemutualinformationvaluesforeachpairofwordsinthen-tuple.WeusetherawPPMImatrixfromTurneyetal.(2011).Althoughtheycomputedthesingularvaluedecomposition(SVD)toprojecttherowvectorsintoalower-dimensionalspace,weneedtheoriginalhigh-dimensionalcolumnsforourfeatures.TherawPPMImatrixhas114,501rowsand139,246columnswithadensityof1.2%.ForeachterminWordNet,thereisacorrespondingrowintherawPPMImatrix.ForeachunigraminWord-Net,therearetwocorrespondingcolumnsintherawPPMImatrix,onemarkedleftandtheotherright.Supposexicorrespondstothei-throwofthePPMImatrixandxjcorrespondsthej-thcolumn,markedleft.Thevalueinthei-throwandj-thcol-umnofthePPMImatrix,PPMI(xi,xj,left),isthepositivepointwisemutualinformationofxiandxjco-occurringinthecorpus,wherexjisthefirstwordtotheleftofxi,ignoringanyinterveningstopwords(thatis,ignoringanywordsthatarenotinWordNet).Ifxi(orxj)hasnocorrespondingrow(orcolumn)inthematrix,thenthePPMIvalueissettozero.Turneyetal.(2011)estimatedPPMI(xi,xj,left)bysamplingthecorpusforphrasescontainingxiandthenlookingforxjtotheleftofxiinthesampledphrases(andlikewiseforright).Duetothissam-plingprocess,PPMI(xi,xj,left)doesnotnecessar-ilyequalPPMI(xj,xi,right).Forexample,supposexiisararewordandxjisacommonword.WithPPMI(xi,xj,left),whenwesamplephrasescontain-ingxi,wearerelativelylikelytofindxjinsomeofthesephrases.WithPPMI(xj,xi,right),whenwesamplephrasescontainingxj,wearelesslikelytofindanyphrasescontainingxi.Although,intheory,PPMI(xi,xj,left)shouldequalPPMI(xj,xi,right),theyarelikelytobeunequalgivenalimitedsample.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
l
a
c
_
a
_
0
0
2
3
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
357
Fromthen-tuple,weselectallofthen(n−1)pairs,hxi,xji,suchthati6=j.Wethengener-atetwofeaturesforeachpair,PPMI(xi,xj,left)andPPMI(xi,xj,right).Thusthereare2n(n−1)PPMIvaluesinthesecondsetoffeatures.Thethirdsetoffeaturesconsistsofdomainspacesimilarityvaluesforeachpairofwordsinthen-tuple.Domainspacewasdesignedtocapturethetopicofaword.Turney(2012)firstconstructedafrequencymatrix,inwhichtherowscorrespondtotermsinWordNetandthecolumnscorrespondtonearbynouns.Givenatermxi,thecorpuswassam-pledforphrasescontainingxiandthephraseswereprocessedwithapart-of-speechtagger,toidentifynouns.Ifthenounxjwastheclosestnountotheleftorrightofxi,thenthefrequencycountforthei-throwandj-thcolumnwasincremented.Thehypoth-esiswasthatthenounsnearatermcharacterizethetopicsassociatedwiththeterm.Theword–contextfrequencymatrixfordomainspacehas114,297rows(terms)and50,000columns(nouncontexts,topics),withadensityof2.6%.ThefrequencymatrixwasconvertedtoaPPMImatrixandthensmoothedwithSVD.TheSVDyieldsthreematrices,U,Σ,andV.AtermindomainspaceisrepresentedbyarowvectorinUkΣpk.Theparameterkspecifiesthenum-berofsingularvaluesinthetruncatedsingularvaluedecomposition;thatis,kisthenumberoflatentfactorsinthelow-dimensionalrepresentationoftheterm(LandauerandDumais,1997).WegenerateUkandΣkbydeletingthecolumnsinUandΣcorrespondingtothesmallestsingularvalues.TheparameterpraisesthesingularvaluesinΣktothepowerp(Caron,2001).Aspgoesfromonetozero,factorswithsmallersingularvaluesaregivenmoreweight.Thishastheeffectofmakingthesimilaritymeasuremorediscriminating(Turney,2012).Thesimilarityoftwowordsindomainspace,Dom(xi,xj,k,p),iscomputedbyextractingtherowvectorsinUkΣpkthatcorrespondtothewordsxiandxj,andthencalculatingtheircosine.Optimalper-formancerequirestuningtheparameterskandpforthetask(BullinariaandLevy,2012;Turney,2012).Inthefollowingexperiments,weavoiddirectlytun-ingkandpbygeneratingfeatureswithavarietyofvaluesforkandp,allowingthesupervisedlearningalgorithmtodecidewhichfeaturestouse.FeaturesetSizeofsetLF(xi)nPPMI(xi,xj,handedness)2n(n−1)Dom(xi,xj,k,p)12n(n−1)nknpFun(xi,xj,k,p)12n(n−1)nknpTable1:Thefoursetsoffeaturesandtheirsizes.Fromthen-tuple,weselectall12n(n−1)pairs,hxi,xji,suchthati