Transactions of the Association for Computational Linguistics, 1 (2013) 353–366. Action Editor: Patrick Pantel.
Submitted 5/2013; Überarbeitet 7/2013; Published 10/2013. C(cid:13)2013 Verein für Computerlinguistik.
DistributionalSemanticsBeyondWords:SupervisedLearningofAnalogyandParaphrasePeterD.TurneyNationalResearchCouncilCanadaInformationandCommunicationsTechnologiesOttawa,Ontario,Kanada,K1A0R6peter.turney@nrc-cnrc.gc.caAbstractTherehavebeenseveraleffortstoextenddistributionalsemanticsbeyondindividualwords,tomeasurethesimilarityofwordpairs,phrases,andsentences(kurz,tuples;orderedsetsofwords,contiguousornoncontiguous).Onewaytoextendbeyondwordsistocom-paretwotuplesusingafunctionthatcom-binespairwisesimilaritiesbetweenthecom-ponentwordsinthetuples.Astrengthofthisapproachisthatitworkswithbothrela-tionalsimilarity(analogy)andcompositionalsimilarity(paraphrase).Jedoch,pastworkrequiredhand-codingthecombinationfunc-tionfordifferenttasks.Themaincontributionofthispaperisthatcombinationfunctionsaregeneratedbysupervisedlearning.Weachievestate-of-the-artresultsinmeasuringrelationalsimilaritybetweenwordpairs(SATanalo-giesandSemEval2012Task2)andmeasur-ingcompositionalsimilaritybetweennoun-modifierphrasesandunigrams(multiple-choiceparaphrasequestions).1IntroductionHarris(1954)andFirth(1957)hypothesizedthatwordsthatappearinsimilarcontextstendtohavesimilarmeanings.Thishypothesisisthefounda-tionfordistributionalsemantics,inwhichwordsarerepresentedbycontextvectors.Thesimilarityoftwowordsiscalculatedbycomparingthetwocor-respondingcontextvectors(Lundetal.,1995;Lan-dauerandDumais,1997;TurneyandPantel,2010).Distributionalsemanticsishighlyeffectiveformeasuringthesemanticsimilaritybetweenindivid-ualwords.Onasetofeightymultiple-choicesyn-onymquestionsfromthetestofEnglishasafor-eignlanguage(TOEFL),adistributionalapproachrecentlyachieved100%accuracy(BullinariaandLevy,2012).Jedoch,ithasbeendifficulttoextenddistributionalsemanticsbeyondindividualwords,towordpairs,phrases,andsentences.Movingbeyondindividualwords,therearevari-oustypesofsemanticsimilaritytoconsider.Herewefocusonparaphraseandanalogy.Paraphraseissimilarityinthemeaningoftwopiecesoftext(AndroutsopoulosandMalakasiotis,2010).Anal-ogyissimilarityinthesemanticrelationsoftwosetsofwords(Turney,2008A).Itiscommontostudyparaphraseatthesentencelevel(AndroutsopoulosandMalakasiotis,2010),butweprefertoconcentrateonthesimplesttypeofparaphrase,whereabigramparaphrasesaunigram.Forexample,doghouseisaparaphraseofkennel.Inourexperiments,weconcentrateonnoun-modifierbigramsandnoununigrams.Analogiesmaptermsinonedomaintotermsinanotherdomain(Gentner,1983).Thefamiliaranal-ogybetweenthesolarsystemandtheRutherford-Bohratomicmodelinvolvesseveraltermsfromthedomainofthesolarsystemandthedomainoftheatomicmodel(Turney,2008A).Thesimplesttypeofanalogyisproportionalanal-ogy,whichinvolvestwopairsofwords(Turney,2006B).Forexample,thepairhcook,rawiisanal-ogoustothepairhdecorate,plaini.Ifwecookathing,itisnolongerraw;ifwedecorateathing,Es
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
A
C
_
A
_
0
0
2
3
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
354
isnolongerplain.Thesemanticrelationsbetweencookandrawaresimilartothesemanticrelationsbetweendecorateandplain.Inthefollowingexper-iments,wefocusonproportionalanalogies.Erk(2013)distinguishedfourapproachestoextenddistributionalsemanticsbeyondwords:Inthefirst,asinglevectorspacerepresentationforaphraseorsentenceiscomputedfromtherepresen-tationsoftheindividualwords(MitchellandLap-ata,2010;BaroniandZamparelli,2010).Inthesec-ond,twophrasesorsentencesarecomparedbycom-biningmultiplepairwisesimilarityvalues(Socheretal.,2011;Turney,2012).Dritte,weightedinferencerulesintegratedistributionalsimilarityandformallogic(Garretteetal.,2011).Vierte,asinglespaceintegratesformallogicandvectors(Clarke,2012).Takingthesecondapproach,Turney(2012)intro-ducedadual-spacemodel,withonespaceformea-suringdomainsimilarity(similarityoftopicorfield)andanotherforfunctionsimilarity(similarityofroleorusage).Similaritiesbeyondindividualwordsarecalculatedbyfunctionsthatcombinedomainandfunctionsimilaritiesofcomponentwords.Thedual-spacemodelhasbeenappliedtomea-suringcompositionalsimilarity(paraphraserecogni-tion)andrelationalsimilarity(analogyrecognition).Inexperimentsthattestedforsensitivitytowordorder,thedual-spacemodelperformedsignificantlybetterthancompetingapproaches(Turney,2012).Alimitationofpastworkwiththedual-spacemodelisthatthecombinationfunctionswerehand-coded.Ourmaincontributionistoshowhowhand-codingcanbeeliminatedwithsupervisedlearning.Foreaseofreference,wewillcallourapproachSuperSim(supervisedsimilarity).Withnomodifi-cationofSuperSimforthespecifictask(relationalsimilarityorcompositionalsimilarity),weachievebetterresultsthanprevioushand-codedmodels.Compositionalsimilarity(paraphrase)comparestwocontiguousphrasesorsentences(n-grams),whereasrelationalsimilarity(analogy)doesnotrequirecontiguity.Weusetupletorefertobothcon-tiguousandnoncontiguouswordsequences.Weapproachanalogyasaproblemofsupervisedtupleclassification.Tomeasuretherelationalsim-ilaritybetweentwowordpairs,wetrainSuperSimwithquadruplesthatarelabeledaspositiveandneg-ativeexamplesofanalogies.Forexample,thepro-portionalanalogyhcook,raw,decorate,plainiislabeledasapositiveexample.Aquadrupleisrepresentedbyafeaturevector,composedofdomainandfunctionsimilaritiesfromthedual-spacemodelandotherfeaturesbasedoncorpusfrequencies.SuperSimusesasupportvectormachine(Platt,1998)tolearntheprobabilitythataquadrupleha,B,C,diconsistsofawordpairha,biandananalogouswordpairhc,di.Theprobabilitycanbeinterpretedasthedegreeofrelationalsimilar-itybetweenthetwogivenwordpairs.Wealsoapproachparaphraseassupervisedtupleclassification.Tomeasurethecompositionalsimi-laritybeweenanm-gramandann-gram,wetrainthelearningalgorithmwith(m+n)-tuplesthatarepositiveandnegativeexamplesofparaphrases.SuperSimlearnstoestimatetheprobabilitythatatripleha,B,ciconsistsofacompositionalbigramabandasynonymousunigramc.Forinstance,thephrasefishtankissynonymouswithaquarium;thatis,fishtankandaquariumhavehighcompositionalsimilarity.Thetriplehfish,tank,aquariumiisrepre-sentedusingthesamefeaturesthatweusedforanal-ogy.Theprobabilityofthetriplecanbeinterpretedasthedegreeofcompositionalsimilaritybetweenthegivenbigramandunigram.WereviewrelatedworkinSection2.Thegen-eralfeaturespaceforlearningrelationsandcompo-sitionsispresentedinSection3.TheexperimentswithrelationalsimilarityaredescribedinSection4,andSection5reportstheresultswithcompositionalsimilarity.Section6discussestheimplicationsoftheresults.WeconsiderfutureworkinSection7andconcludeinSection8.2RelatedWorkInSemEval2012,Task2wasconcernedwithmea-suringthedegreeofrelationalsimilaritybetweentwowordpairs(Jurgensetal.,2012)andTask6(Agirreetal.,2012)examinedthedegreeofseman-ticequivalencebetweentwosentences.Thesetwoareasofresearchhavebeenmostlyindependent,althoughSocheretal.(2012)andTurney(2012)presentunifiedperspectivesonthetwotasks.Wefirstdiscusssomeworkonrelationalsimilarity,thensomeworkoncompositionalsimilarity,andlastlyworkthatunifiesthetwotypesofsimilarity.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
A
C
_
A
_
0
0
2
3
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
355
2.1RelationalSimilarityLRA(latentrelationalanalysis)measuresrela-tionalsimilaritywithapair–patternmatrix(Turney,2006B).Rowsinthematrixcorrespondtowordpairs(A,B)andcolumnscorrespondtopatternsthatconnectthepairs(“afortheb”)inalargecor-pus.Thisisaholistic(noncompositional)approachtodistributionalsimilarity,sincethewordpairsareopaquewholes;thecomponentwordshavenosep-araterepresentations.Acompositionalapproachtoanalogyhasarepresentationforeachword,andawordpairisrepresentedbycomposingtherepresen-tationsforeachmemberofthepair.Givenavocabu-laryofNwords,acompositionalapproachrequiresNrepresentationstohandleallpossiblewordpairs,butaholisticapproachrequiresN2representations.Holisticapproachesdonotscaleup(Turney,2012).LRArequiredninedaystorun.Bollegalaetal.(2008)answeredtheSATanal-ogyquestionswithasupportvectormachinetrainedonquadruples(proportionalanalogies),aswedohere.However,theirfeaturevectorsareholistic,andhencetherearescalingproblems.Herda˘gdelenandBaroni(2009)usedasupportvectormachinetolearnrelationalsimilarity.Theirfeaturevectorscontainedacombinationofholisticandcompositionalfeatures.Measuringrelationalsimilarityiscloselycon-nectedtoclassifyingwordpairsaccordingtotheirsemanticrelations(TurneyandLittman,2005).SemanticrelationclassificationwasthefocusofSemEval2007Task4(Girjuetal.,2007)andSemEval2010Task8(Hendrickxetal.,2010).2.2CompositionalSimilarityToextenddistributionalsemanticsbeyondwords,manyresearcherstakethefirstapproachdescribedbyErk(2013),inwhichasinglevectorspaceisusedforindividualwords,phrases,andsentences(Lan-dauerandDumais,1997;MitchellandLapata,2008;MitchellandLapata,2010).Inthisapproach,giventhewordsaandbwithcontextvectorsaandb,weconstructavectorforthebigramabbyapplyingvec-toroperationstoaandb.MitchellandLapata(2010)experimentwithmanydifferentvectoroperationsandfindthatelement-wisemultiplicationperformswell.Thebigramabisrepresentedbyc=a(cid:12)B,whereci=ai·bi.However,element-wisemultiplica-tioniscommutative,sothebigramsabandbamaptothesamevectorc.Inexperimentsthattestforordersensitivity,element-wisemultiplicationper-formspoorly(Turney,2012).Wecantreatthebigramabasaunit,asifitwereasingleword,andconstructacontextvectorforabfromoccurrencesofabinalargecorpus.Thisholisticapproachtorepresentingbigramsperformswellwhenalimitedsetofbigramsisspecifiedinadvance(beforebuildingtheword–contextmatrix),butitdoesnotscaleup,becausetherearetoomanypossiblebigrams(Turney,2012).Althoughtheholisticapproachdoesnotscaleup,wecangenerateafewholisticbigramvectorsandusethemtotrainasupervisedregressionmodel(Guevara,2010;BaroniandZamparelli,2010).Givenanewbigramcd,notobservedinthecorpus,theregressionmodelcanpredictaholisticvectorforcd,ifcanddhavebeenobservedseparately.WeshowinSection5thatthisideacanbeadaptedtotrainSuperSimwithoutmanuallylabeleddata.Socheretal.(2011)takethesecondapproachdescribedbyErk(2013),inwhichtwosentencesarecomparedbycombiningmultiplepairwisesimilar-ityvalues.Theyconstructavariable-sizedsimilar-itymatrixX,inwhichtheelementxijisthesim-ilaritybetweenthei-thphraseofonesentenceandthej-thphraseoftheother.Sincesupervisedlearn-ingissimplerwithfixed-sizedfeaturevectors,thevariable-sizedsimilaritymatrixisthenreducedtoasmallerfixed-sizedmatrix,toallowcomparisonofpairsofsentencesofvaryinglengths.2.3UnifiedPerspectivesonSimilaritySocheretal.(2012)representwordsandphraseswithapair,consistingofavectorandamatrix.Thevectorcapturesthemeaningofthewordorphraseandthematrixcaptureshowawordorphrasemod-ifiesthemeaningofanotherwordorphrasewhentheyarecombined.Theyapplythismatrix–vectorrepresentationtobothcompositionsandrelations.Turney(2012)representswordswithtwovectors,avectorfromdomainspaceandavectorfromfunc-tionspace.Thedomainvectorcapturesthetopicorfieldofthewordandthefunctionvectorcapturesthe
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
A
C
_
A
_
0
0
2
3
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
356
functionalroleoftheword.Thisdual-spacemodelisappliedtobothcompositionsandrelations.Hereweextendthedual-spacemodelofTur-ney(2012)intwoways:Hand-codingisreplacedwithsupervisedlearningandtwonewsetsoffea-turesaugmentdomainandfunctionspace.Movingtosupervisedlearninginsteadofhand-codingmakesiteasiertointroducenewfeatures.Inthedual-spacemodel,parameterizedsimilar-itymeasuresprovidedtheinputvaluesforhand-craftedfunctions.Eachtaskrequiredadifferentsetofhand-craftedfunctions.Theparametersofthesimilaritymeasuresweretunedusingacustomizedgridsearchalgorithm.Thegridsearchalgorithmwasnotsuitableforintegrationwithasupervisedlearningalgorithm.TheinsightbehindSuperSimisthat,givenappropriatefeatures,asupervisedlearn-ingalgorithmcanreplacethegridsearchalgorithmandthehand-craftedfunctions.3FeaturesforTupleClassificationWerepresentatuplewithfourtypesoffeatures,allbasedonfrequenciesinalargecorpus.Thefirsttypeoffeatureisthelogarithmofthefrequencyofaword.Thesecondtypeisthepositivepoint-wisemutualinformation(PPMI)betweentwowords(ChurchandHanks,1989;BullinariaandLevy,2007).Thirdandfourtharethesimilaritiesoftwowordsindomainandfunctionspace(Turney,2012).Inthefollowingexperiments,weusethePPMImatrixfromTurneyetal.(2011)andthedomainandfunctionmatricesfromTurney(2012).1Thethreematricesandthewordfrequencydataarebasedonthesamecorpus,acollectionofwebpagesgath-eredfromuniversitywebsites,containing5×1010words.2Allthreematricesareword–contextmatri-ces,inwhichtherowscorrespondtoterms(wordsandphrases)inWordNet.3Thecolumnscorrespondtothecontextsinwhichthetermsappear;eachmatrixinvolvesadifferentkindofcontext.1Thethreematricesandthewordfrequencydataareavail-ableonrequestfromtheauthor.Thematrixfilesrangefromtwotofivegigabyteswhenpackagedandcompressedfordistri-bution.2ThecorpuswascollectedbyCharlesClarkeattheUniver-sityofWaterloo.Itisabout280gigabytesofplaintext.3Seehttp://wordnet.princeton.edu/forinfor-mationaboutWordNet.Lethx1,x2,…,xnibeann-tupleofwords.Thenumberoffeaturesweusetorepresentthistupleincreasesasafunctionofn.Thefirstsetoffeaturesconsistsoflogfrequencyvaluesforeachwordxiinthen-tuple.Letfreq(xi)bethefrequencyofxiinthecorpus.WedefineLF(xi)aslog(freq(xi)+1).Ifxiisnotinthecorpus,freq(xi)iszero,andthusLF(xi)isalsozero.Therearenlogfrequencyfeatures,oneLF(xi)featureforeachwordinthen-tuple.Thesecondsetoffeaturesconsistsofpositivepointwisemutualinformationvaluesforeachpairofwordsinthen-tuple.WeusetherawPPMImatrixfromTurneyetal.(2011).Althoughtheycomputedthesingularvaluedecomposition(SVD)toprojecttherowvectorsintoalower-dimensionalspace,weneedtheoriginalhigh-dimensionalcolumnsforourfeatures.TherawPPMImatrixhas114,501rowsand139,246columnswithadensityof1.2%.ForeachterminWordNet,thereisacorrespondingrowintherawPPMImatrix.ForeachunigraminWord-Net,therearetwocorrespondingcolumnsintherawPPMImatrix,onemarkedleftandtheotherright.Supposexicorrespondstothei-throwofthePPMImatrixandxjcorrespondsthej-thcolumn,markedleft.Thevalueinthei-throwandj-thcol-umnofthePPMImatrix,PPMI(xi,xj,links),isthepositivepointwisemutualinformationofxiandxjco-occurringinthecorpus,wherexjisthefirstwordtotheleftofxi,ignoringanyinterveningstopwords(thatis,ignoringanywordsthatarenotinWordNet).Ifxi(orxj)hasnocorrespondingrow(orcolumn)inthematrix,thenthePPMIvalueissettozero.Turneyetal.(2011)estimatedPPMI(xi,xj,links)bysamplingthecorpusforphrasescontainingxiandthenlookingforxjtotheleftofxiinthesampledphrases(andlikewiseforright).Duetothissam-plingprocess,PPMI(xi,xj,links)doesnotnecessar-ilyequalPPMI(xj,xi,Rechts).Forexample,supposexiisararewordandxjisacommonword.WithPPMI(xi,xj,links),whenwesamplephrasescontain-ingxi,wearerelativelylikelytofindxjinsomeofthesephrases.WithPPMI(xj,xi,Rechts),whenwesamplephrasescontainingxj,wearelesslikelytofindanyphrasescontainingxi.Although,intheory,PPMI(xi,xj,links)shouldequalPPMI(xj,xi,Rechts),theyarelikelytobeunequalgivenalimitedsample.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
A
C
_
A
_
0
0
2
3
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
357
Fromthen-tuple,weselectallofthen(n−1)pairs,hxi,xji,suchthati6=j.Wethengener-atetwofeaturesforeachpair,PPMI(xi,xj,links)andPPMI(xi,xj,Rechts).Thusthereare2n(n−1)PPMIvaluesinthesecondsetoffeatures.Thethirdsetoffeaturesconsistsofdomainspacesimilarityvaluesforeachpairofwordsinthen-tuple.Domainspacewasdesignedtocapturethetopicofaword.Turney(2012)firstconstructedafrequencymatrix,inwhichtherowscorrespondtotermsinWordNetandthecolumnscorrespondtonearbynouns.Givenatermxi,thecorpuswassam-pledforphrasescontainingxiandthephraseswereprocessedwithapart-of-speechtagger,toidentifynouns.Ifthenounxjwastheclosestnountotheleftorrightofxi,thenthefrequencycountforthei-throwandj-thcolumnwasincremented.Thehypoth-esiswasthatthenounsnearatermcharacterizethetopicsassociatedwiththeterm.Theword–contextfrequencymatrixfordomainspacehas114,297rows(Bedingungen)and50,000columns(nouncontexts,topics),withadensityof2.6%.ThefrequencymatrixwasconvertedtoaPPMImatrixandthensmoothedwithSVD.TheSVDyieldsthreematrices,U,Σ,andV.AtermindomainspaceisrepresentedbyarowvectorinUkΣpk.Theparameterkspecifiesthenum-berofsingularvaluesinthetruncatedsingularvaluedecomposition;thatis,kisthenumberoflatentfactorsinthelow-dimensionalrepresentationoftheterm(LandauerandDumais,1997).WegenerateUkandΣkbydeletingthecolumnsinUandΣcorrespondingtothesmallestsingularvalues.TheparameterpraisesthesingularvaluesinΣktothepowerp(Caron,2001).Aspgoesfromonetozero,factorswithsmallersingularvaluesaregivenmoreweight.Thishastheeffectofmakingthesimilaritymeasuremorediscriminating(Turney,2012).Thesimilarityoftwowordsindomainspace,Dom(xi,xj,k,P),iscomputedbyextractingtherowvectorsinUkΣpkthatcorrespondtothewordsxiandxj,andthencalculatingtheircosine.Optimalper-formancerequirestuningtheparameterskandpforthetask(BullinariaandLevy,2012;Turney,2012).Inthefollowingexperiments,weavoiddirectlytun-ingkandpbygeneratingfeatureswithavarietyofvaluesforkandp,allowingthesupervisedlearningalgorithmtodecidewhichfeaturestouse.FeaturesetSizeofsetLF(xi)nPPMI(xi,xj,handedness)2N(n−1)Dom(xi,xj,k,P)12N(n−1)nknpFun(xi,xj,k,P)12N(n−1)nknpTable1:Thefoursetsoffeaturesandtheirsizes.Fromthen-tuple,weselectall12n(n−1)pairs,hxi,xji,suchthati
PDF Herunterladen