Transacciones de la Asociación de Lingüística Computacional, 1 (2013) 353–366. Editor de acciones: Patricio Pantel.
Submitted 5/2013; Revised 7/2013; Publicado 10/2013. C(cid:13)2013 Asociación de Lingüística Computacional.
DistributionalSemanticsBeyondWords:SupervisedLearningofAnalogyandParaphrasePeterD.TurneyNationalResearchCouncilCanadaInformationandCommunicationsTechnologiesOttawa,ontario,Canada,K1A0R6peter.turney@nrc-cnrc.gc.caAbstractTherehavebeenseveraleffortstoextenddistributionalsemanticsbeyondindividualwords,tomeasurethesimilarityofwordpairs,phrases,andsentences(briefly,tuples;orderedsetsofwords,contiguousornoncontiguous).Onewaytoextendbeyondwordsistocom-paretwotuplesusingafunctionthatcom-binespairwisesimilaritiesbetweenthecom-ponentwordsinthetuples.Astrengthofthisapproachisthatitworkswithbothrela-tionalsimilarity(analogy)andcompositionalsimilarity(paraphrase).Sin embargo,pastworkrequiredhand-codingthecombinationfunc-tionfordifferenttasks.Themaincontributionofthispaperisthatcombinationfunctionsaregeneratedbysupervisedlearning.Weachievestate-of-the-artresultsinmeasuringrelationalsimilaritybetweenwordpairs(SATanalo-giesandSemEval2012Task2)andmeasur-ingcompositionalsimilaritybetweennoun-modifierphrasesandunigrams(multiple-choiceparaphrasequestions).1IntroductionHarris(1954)andFirth(1957)hypothesizedthatwordsthatappearinsimilarcontextstendtohavesimilarmeanings.Thishypothesisisthefounda-tionfordistributionalsemantics,inwhichwordsarerepresentedbycontextvectors.Thesimilarityoftwowordsiscalculatedbycomparingthetwocor-respondingcontextvectors(Lundetal.,1995;Lan-dauerandDumais,1997;TurneyandPantel,2010).Distributionalsemanticsishighlyeffectiveformeasuringthesemanticsimilaritybetweenindivid-ualwords.Onasetofeightymultiple-choicesyn-onymquestionsfromthetestofEnglishasafor-eignlanguage(TOEFL),adistributionalapproachrecentlyachieved100%accuracy(BullinariaandLevy,2012).Sin embargo,ithasbeendifficulttoextenddistributionalsemanticsbeyondindividualwords,towordpairs,phrases,andsentences.Movingbeyondindividualwords,therearevari-oustypesofsemanticsimilaritytoconsider.Herewefocusonparaphraseandanalogy.Paraphraseissimilarityinthemeaningoftwopiecesoftext(AndroutsopoulosandMalakasiotis,2010).Anal-ogyissimilarityinthesemanticrelationsoftwosetsofwords(Turney,2008a).Itiscommontostudyparaphraseatthesentencelevel(AndroutsopoulosandMalakasiotis,2010),butweprefertoconcentrateonthesimplesttypeofparaphrase,whereabigramparaphrasesaunigram.Forexample,doghouseisaparaphraseofkennel.Inourexperiments,weconcentrateonnoun-modifierbigramsandnoununigrams.Analogiesmaptermsinonedomaintotermsinanotherdomain(Gentner,1983).Thefamiliaranal-ogybetweenthesolarsystemandtheRutherford-Bohratomicmodelinvolvesseveraltermsfromthedomainofthesolarsystemandthedomainoftheatomicmodel(Turney,2008a).Thesimplesttypeofanalogyisproportionalanal-ogy,whichinvolvestwopairsofwords(Turney,2006b).Forexample,thepairhcook,rawiisanal-ogoustothepairhdecorate,plaini.Ifwecookathing,itisnolongerraw;ifwedecorateathing,él
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
yo
a
C
_
a
_
0
0
2
3
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
354
isnolongerplain.Thesemanticrelationsbetweencookandrawaresimilartothesemanticrelationsbetweendecorateandplain.Inthefollowingexper-iments,wefocusonproportionalanalogies.Erk(2013)distinguishedfourapproachestoextenddistributionalsemanticsbeyondwords:Inthefirst,asinglevectorspacerepresentationforaphraseorsentenceiscomputedfromtherepresen-tationsoftheindividualwords(MitchellandLap-ata,2010;BaroniandZamparelli,2010).Inthesec-ond,twophrasesorsentencesarecomparedbycom-biningmultiplepairwisesimilarityvalues(Socheretal.,2011;Turney,2012).Tercero,weightedinferencerulesintegratedistributionalsimilarityandformallogic(Garretteetal.,2011).Cuatro,asinglespaceintegratesformallogicandvectors(Clarke,2012).Takingthesecondapproach,Turney(2012)intro-ducedadual-spacemodel,withonespaceformea-suringdomainsimilarity(similarityoftopicorfield)andanotherforfunctionsimilarity(similarityofroleorusage).Similaritiesbeyondindividualwordsarecalculatedbyfunctionsthatcombinedomainandfunctionsimilaritiesofcomponentwords.Thedual-spacemodelhasbeenappliedtomea-suringcompositionalsimilarity(paraphraserecogni-tion)andrelationalsimilarity(analogyrecognition).Inexperimentsthattestedforsensitivitytowordorder,thedual-spacemodelperformedsignificantlybetterthancompetingapproaches(Turney,2012).Alimitationofpastworkwiththedual-spacemodelisthatthecombinationfunctionswerehand-coded.Ourmaincontributionistoshowhowhand-codingcanbeeliminatedwithsupervisedlearning.Foreaseofreference,wewillcallourapproachSuperSim(supervisedsimilarity).Withnomodifi-cationofSuperSimforthespecifictask(relationalsimilarityorcompositionalsimilarity),weachievebetterresultsthanprevioushand-codedmodels.Compositionalsimilarity(paraphrase)comparestwocontiguousphrasesorsentences(n-grams),whereasrelationalsimilarity(analogy)doesnotrequirecontiguity.Weusetupletorefertobothcon-tiguousandnoncontiguouswordsequences.Weapproachanalogyasaproblemofsupervisedtupleclassification.Tomeasuretherelationalsim-ilaritybetweentwowordpairs,wetrainSuperSimwithquadruplesthatarelabeledaspositiveandneg-ativeexamplesofanalogies.Forexample,thepro-portionalanalogyhcook,raw,decorate,plainiislabeledasapositiveexample.Aquadrupleisrepresentedbyafeaturevector,composedofdomainandfunctionsimilaritiesfromthedual-spacemodelandotherfeaturesbasedoncorpusfrequencies.SuperSimusesasupportvectormachine(Platón,1998)tolearntheprobabilitythataquadrupleha,b,C,diconsistsofawordpairha,biandananalogouswordpairhc,di.Theprobabilitycanbeinterpretedasthedegreeofrelationalsimilar-itybetweenthetwogivenwordpairs.Wealsoapproachparaphraseassupervisedtupleclassification.Tomeasurethecompositionalsimi-laritybeweenanm-gramandann-gram,wetrainthelearningalgorithmwith(m+n)-tuplesthatarepositiveandnegativeexamplesofparaphrases.SuperSimlearnstoestimatetheprobabilitythatatripleha,b,ciconsistsofacompositionalbigramabandasynonymousunigramc.Forinstance,thephrasefishtankissynonymouswithaquarium;thatis,fishtankandaquariumhavehighcompositionalsimilarity.Thetriplehfish,tank,aquariumiisrepre-sentedusingthesamefeaturesthatweusedforanal-ogy.Theprobabilityofthetriplecanbeinterpretedasthedegreeofcompositionalsimilaritybetweenthegivenbigramandunigram.WereviewrelatedworkinSection2.Thegen-eralfeaturespaceforlearningrelationsandcompo-sitionsispresentedinSection3.TheexperimentswithrelationalsimilarityaredescribedinSection4,andSection5reportstheresultswithcompositionalsimilarity.Section6discussestheimplicationsoftheresults.WeconsiderfutureworkinSection7andconcludeinSection8.2RelatedWorkInSemEval2012,Task2wasconcernedwithmea-suringthedegreeofrelationalsimilaritybetweentwowordpairs(Jurgensetal.,2012)andTask6(Agirreetal.,2012)examinedthedegreeofseman-ticequivalencebetweentwosentences.Thesetwoareasofresearchhavebeenmostlyindependent,althoughSocheretal.(2012)andTurney(2012)presentunifiedperspectivesonthetwotasks.Wefirstdiscusssomeworkonrelationalsimilarity,thensomeworkoncompositionalsimilarity,andlastlyworkthatunifiesthetwotypesofsimilarity.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
yo
a
C
_
a
_
0
0
2
3
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
355
2.1RelationalSimilarityLRA(latentrelationalanalysis)measuresrela-tionalsimilaritywithapair–patternmatrix(Turney,2006b).Rowsinthematrixcorrespondtowordpairs(a,b)andcolumnscorrespondtopatternsthatconnectthepairs(“afortheb”)inalargecor-pus.Thisisaholistic(noncompositional)approachtodistributionalsimilarity,sincethewordpairsareopaquewholes;thecomponentwordshavenosep-araterepresentations.Acompositionalapproachtoanalogyhasarepresentationforeachword,andawordpairisrepresentedbycomposingtherepresen-tationsforeachmemberofthepair.Givenavocabu-laryofNwords,acompositionalapproachrequiresNrepresentationstohandleallpossiblewordpairs,butaholisticapproachrequiresN2representations.Holisticapproachesdonotscaleup(Turney,2012).LRArequiredninedaystorun.Bollegalaetal.(2008)answeredtheSATanal-ogyquestionswithasupportvectormachinetrainedonquadruples(proportionalanalogies),aswedohere.However,theirfeaturevectorsareholistic,andhencetherearescalingproblems.Herda˘gdelenandBaroni(2009)usedasupportvectormachinetolearnrelationalsimilarity.Theirfeaturevectorscontainedacombinationofholisticandcompositionalfeatures.Measuringrelationalsimilarityiscloselycon-nectedtoclassifyingwordpairsaccordingtotheirsemanticrelations(TurneyandLittman,2005).SemanticrelationclassificationwasthefocusofSemEval2007Task4(Girjuetal.,2007)andSemEval2010Task8(Hendrickxetal.,2010).2.2CompositionalSimilarityToextenddistributionalsemanticsbeyondwords,manyresearcherstakethefirstapproachdescribedbyErk(2013),inwhichasinglevectorspaceisusedforindividualwords,phrases,andsentences(Lan-dauerandDumais,1997;MitchellandLapata,2008;MitchellandLapata,2010).Inthisapproach,giventhewordsaandbwithcontextvectorsaandb,weconstructavectorforthebigramabbyapplyingvec-toroperationstoaandb.MitchellandLapata(2010)experimentwithmanydifferentvectoroperationsandfindthatelement-wisemultiplicationperformswell.Thebigramabisrepresentedbyc=a(cid:12)b,whereci=ai·bi.However,element-wisemultiplica-tioniscommutative,sothebigramsabandbamaptothesamevectorc.Inexperimentsthattestforordersensitivity,element-wisemultiplicationper-formspoorly(Turney,2012).Wecantreatthebigramabasaunit,asifitwereasingleword,andconstructacontextvectorforabfromoccurrencesofabinalargecorpus.Thisholisticapproachtorepresentingbigramsperformswellwhenalimitedsetofbigramsisspecifiedinadvance(beforebuildingtheword–contextmatrix),butitdoesnotscaleup,becausetherearetoomanypossiblebigrams(Turney,2012).Althoughtheholisticapproachdoesnotscaleup,wecangenerateafewholisticbigramvectorsandusethemtotrainasupervisedregressionmodel(Guevara,2010;BaroniandZamparelli,2010).Givenanewbigramcd,notobservedinthecorpus,theregressionmodelcanpredictaholisticvectorforcd,ifcanddhavebeenobservedseparately.WeshowinSection5thatthisideacanbeadaptedtotrainSuperSimwithoutmanuallylabeleddata.Socheretal.(2011)takethesecondapproachdescribedbyErk(2013),inwhichtwosentencesarecomparedbycombiningmultiplepairwisesimilar-ityvalues.Theyconstructavariable-sizedsimilar-itymatrixX,inwhichtheelementxijisthesim-ilaritybetweenthei-thphraseofonesentenceandthej-thphraseoftheother.Sincesupervisedlearn-ingissimplerwithfixed-sizedfeaturevectors,thevariable-sizedsimilaritymatrixisthenreducedtoasmallerfixed-sizedmatrix,toallowcomparisonofpairsofsentencesofvaryinglengths.2.3UnifiedPerspectivesonSimilaritySocheretal.(2012)representwordsandphraseswithapair,consistingofavectorandamatrix.Thevectorcapturesthemeaningofthewordorphraseandthematrixcaptureshowawordorphrasemod-ifiesthemeaningofanotherwordorphrasewhentheyarecombined.Theyapplythismatrix–vectorrepresentationtobothcompositionsandrelations.Turney(2012)representswordswithtwovectors,avectorfromdomainspaceandavectorfromfunc-tionspace.Thedomainvectorcapturesthetopicorfieldofthewordandthefunctionvectorcapturesthe
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
yo
a
C
_
a
_
0
0
2
3
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
356
functionalroleoftheword.Thisdual-spacemodelisappliedtobothcompositionsandrelations.Hereweextendthedual-spacemodelofTur-ney(2012)intwoways:Hand-codingisreplacedwithsupervisedlearningandtwonewsetsoffea-turesaugmentdomainandfunctionspace.Movingtosupervisedlearninginsteadofhand-codingmakesiteasiertointroducenewfeatures.Inthedual-spacemodel,parameterizedsimilar-itymeasuresprovidedtheinputvaluesforhand-craftedfunctions.Eachtaskrequiredadifferentsetofhand-craftedfunctions.Theparametersofthesimilaritymeasuresweretunedusingacustomizedgridsearchalgorithm.Thegridsearchalgorithmwasnotsuitableforintegrationwithasupervisedlearningalgorithm.TheinsightbehindSuperSimisthat,givenappropriatefeatures,asupervisedlearn-ingalgorithmcanreplacethegridsearchalgorithmandthehand-craftedfunctions.3FeaturesforTupleClassificationWerepresentatuplewithfourtypesoffeatures,allbasedonfrequenciesinalargecorpus.Thefirsttypeoffeatureisthelogarithmofthefrequencyofaword.Thesecondtypeisthepositivepoint-wisemutualinformation(PPMI)betweentwowords(ChurchandHanks,1989;BullinariaandLevy,2007).Thirdandfourtharethesimilaritiesoftwowordsindomainandfunctionspace(Turney,2012).Inthefollowingexperiments,weusethePPMImatrixfromTurneyetal.(2011)andthedomainandfunctionmatricesfromTurney(2012).1Thethreematricesandthewordfrequencydataarebasedonthesamecorpus,acollectionofwebpagesgath-eredfromuniversitywebsites,containing5×1010words.2Allthreematricesareword–contextmatri-ces,inwhichtherowscorrespondtoterms(wordsandphrases)inWordNet.3Thecolumnscorrespondtothecontextsinwhichthetermsappear;eachmatrixinvolvesadifferentkindofcontext.1Thethreematricesandthewordfrequencydataareavail-ableonrequestfromtheauthor.Thematrixfilesrangefromtwotofivegigabyteswhenpackagedandcompressedfordistri-bution.2ThecorpuswascollectedbyCharlesClarkeattheUniver-sityofWaterloo.Itisabout280gigabytesofplaintext.3Seehttp://wordnet.princeton.edu/forinfor-mationaboutWordNet.Lethx1,x2,…,xnibeann-tupleofwords.Thenumberoffeaturesweusetorepresentthistupleincreasesasafunctionofn.Thefirstsetoffeaturesconsistsoflogfrequencyvaluesforeachwordxiinthen-tuple.Letfreq(xi)bethefrequencyofxiinthecorpus.WedefineLF(xi)aslog(freq(xi)+1).Ifxiisnotinthecorpus,freq(xi)iszero,andthusLF(xi)isalsozero.Therearenlogfrequencyfeatures,oneLF(xi)featureforeachwordinthen-tuple.Thesecondsetoffeaturesconsistsofpositivepointwisemutualinformationvaluesforeachpairofwordsinthen-tuple.WeusetherawPPMImatrixfromTurneyetal.(2011).Althoughtheycomputedthesingularvaluedecomposition(SVD)toprojecttherowvectorsintoalower-dimensionalspace,weneedtheoriginalhigh-dimensionalcolumnsforourfeatures.TherawPPMImatrixhas114,501rowsand139,246columnswithadensityof1.2%.ForeachterminWordNet,thereisacorrespondingrowintherawPPMImatrix.ForeachunigraminWord-Net,therearetwocorrespondingcolumnsintherawPPMImatrix,onemarkedleftandtheotherright.Supposexicorrespondstothei-throwofthePPMImatrixandxjcorrespondsthej-thcolumn,markedleft.Thevalueinthei-throwandj-thcol-umnofthePPMImatrix,PPMI(xi,xj,izquierda),isthepositivepointwisemutualinformationofxiandxjco-occurringinthecorpus,wherexjisthefirstwordtotheleftofxi,ignoringanyinterveningstopwords(thatis,ignoringanywordsthatarenotinWordNet).Ifxi(orxj)hasnocorrespondingrow(orcolumn)inthematrix,thenthePPMIvalueissettozero.Turneyetal.(2011)estimatedPPMI(xi,xj,izquierda)bysamplingthecorpusforphrasescontainingxiandthenlookingforxjtotheleftofxiinthesampledphrases(andlikewiseforright).Duetothissam-plingprocess,PPMI(xi,xj,izquierda)doesnotnecessar-ilyequalPPMI(xj,xi,bien).Forexample,supposexiisararewordandxjisacommonword.WithPPMI(xi,xj,izquierda),whenwesamplephrasescontain-ingxi,wearerelativelylikelytofindxjinsomeofthesephrases.WithPPMI(xj,xi,bien),whenwesamplephrasescontainingxj,wearelesslikelytofindanyphrasescontainingxi.Although,intheory,PPMI(xi,xj,izquierda)shouldequalPPMI(xj,xi,bien),theyarelikelytobeunequalgivenalimitedsample.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
t
yo
a
C
_
a
_
0
0
2
3
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
357
Fromthen-tuple,weselectallofthen(n−1)pares,hxi,xji,suchthati6=j.Wethengener-atetwofeaturesforeachpair,PPMI(xi,xj,izquierda)andPPMI(xi,xj,bien).Thusthereare2n(n−1)PPMIvaluesinthesecondsetoffeatures.Thethirdsetoffeaturesconsistsofdomainspacesimilarityvaluesforeachpairofwordsinthen-tuple.Domainspacewasdesignedtocapturethetopicofaword.Turney(2012)firstconstructedafrequencymatrix,inwhichtherowscorrespondtotermsinWordNetandthecolumnscorrespondtonearbynouns.Givenatermxi,thecorpuswassam-pledforphrasescontainingxiandthephraseswereprocessedwithapart-of-speechtagger,toidentifynouns.Ifthenounxjwastheclosestnountotheleftorrightofxi,thenthefrequencycountforthei-throwandj-thcolumnwasincremented.Thehypoth-esiswasthatthenounsnearatermcharacterizethetopicsassociatedwiththeterm.Theword–contextfrequencymatrixfordomainspacehas114,297rows(terms)and50,000columns(nouncontexts,temas),withadensityof2.6%.ThefrequencymatrixwasconvertedtoaPPMImatrixandthensmoothedwithSVD.TheSVDyieldsthreematrices,Ud.,S,andV.AtermindomainspaceisrepresentedbyarowvectorinUkΣpk.Theparameterkspecifiesthenum-berofsingularvaluesinthetruncatedsingularvaluedecomposition;thatis,kisthenumberoflatentfactorsinthelow-dimensionalrepresentationoftheterm(LandauerandDumais,1997).WegenerateUkandΣkbydeletingthecolumnsinUandΣcorrespondingtothesmallestsingularvalues.TheparameterpraisesthesingularvaluesinΣktothepowerp(Caron,2001).Aspgoesfromonetozero,factorswithsmallersingularvaluesaregivenmoreweight.Thishastheeffectofmakingthesimilaritymeasuremorediscriminating(Turney,2012).Thesimilarityoftwowordsindomainspace,Dom(xi,xj,k,pag),iscomputedbyextractingtherowvectorsinUkΣpkthatcorrespondtothewordsxiandxj,andthencalculatingtheircosine.Optimalper-formancerequirestuningtheparameterskandpforthetask(BullinariaandLevy,2012;Turney,2012).Inthefollowingexperiments,weavoiddirectlytun-ingkandpbygeneratingfeatureswithavarietyofvaluesforkandp,allowingthesupervisedlearningalgorithmtodecidewhichfeaturestouse.FeaturesetSizeofsetLF(xi)nPPMI(xi,xj,handedness)2norte(n−1)Dom(xi,xj,k,pag)12norte(n−1)nknpFun(xi,xj,k,pag)12norte(n−1)nknpTable1:Thefoursetsoffeaturesandtheirsizes.Fromthen-tuple,weselectall12n(n−1)pares,hxi,xji,suchthati
Descargar PDF