Operazioni dell'Associazione per la Linguistica Computazionale, 1 (2013) 353–366. Redattore di azioni: Patrick Pantel.
Submitted 5/2013; Revised 7/2013; Pubblicato 10/2013. C(cid:13)2013 Associazione per la Linguistica Computazionale.
DistributionalSemanticsBeyondWords:SupervisedLearningofAnalogyandParaphrasePeterD.TurneyNationalResearchCouncilCanadaInformationandCommunicationsTechnologiesOttawa,Ontario,Canada,K1A0R6peter.turney@nrc-cnrc.gc.caAbstractTherehavebeenseveraleffortstoextenddistributionalsemanticsbeyondindividualwords,tomeasurethesimilarityofwordpairs,frasi,andsentences(briefly,tuples;orderedsetsofwords,contiguousornoncontiguous).Onewaytoextendbeyondwordsistocom-paretwotuplesusingafunctionthatcom-binespairwisesimilaritiesbetweenthecom-ponentwordsinthetuples.Astrengthofthisapproachisthatitworkswithbothrela-tionalsimilarity(analogy)andcompositionalsimilarity(paraphrase).Tuttavia,pastworkrequiredhand-codingthecombinationfunc-tionfordifferenttasks.Themaincontributionofthispaperisthatcombinationfunctionsaregeneratedbysupervisedlearning.Weachievestate-of-the-artresultsinmeasuringrelationalsimilaritybetweenwordpairs(SATanalo-giesandSemEval2012Task2)andmeasur-ingcompositionalsimilaritybetweennoun-modifierphrasesandunigrams(multiple-choiceparaphrasequestions).1IntroductionHarris(1954)andFirth(1957)hypothesizedthatwordsthatappearinsimilarcontextstendtohavesimilarmeanings.Thishypothesisisthefounda-tionfordistributionalsemantics,inwhichwordsarerepresentedbycontextvectors.Thesimilarityoftwowordsiscalculatedbycomparingthetwocor-respondingcontextvectors(Lundetal.,1995;Lan-dauerandDumais,1997;TurneyandPantel,2010).Distributionalsemanticsishighlyeffectiveformeasuringthesemanticsimilaritybetweenindivid-ualwords.Onasetofeightymultiple-choicesyn-onymquestionsfromthetestofEnglishasafor-eignlanguage(TOEFL),adistributionalapproachrecentlyachieved100%accuracy(BullinariaandLevy,2012).Tuttavia,ithasbeendifficulttoextenddistributionalsemanticsbeyondindividualwords,towordpairs,frasi,andsentences.Movingbeyondindividualwords,therearevari-oustypesofsemanticsimilaritytoconsider.Herewefocusonparaphraseandanalogy.Paraphraseissimilarityinthemeaningoftwopiecesoftext(AndroutsopoulosandMalakasiotis,2010).Anal-ogyissimilarityinthesemanticrelationsoftwosetsofwords(Turney,2008UN).Itiscommontostudyparaphraseatthesentencelevel(AndroutsopoulosandMalakasiotis,2010),butweprefertoconcentrateonthesimplesttypeofparaphrase,whereabigramparaphrasesaunigram.Forexample,doghouseisaparaphraseofkennel.Inourexperiments,weconcentrateonnoun-modifierbigramsandnoununigrams.Analogiesmaptermsinonedomaintotermsinanotherdomain(Gentner,1983).Thefamiliaranal-ogybetweenthesolarsystemandtheRutherford-Bohratomicmodelinvolvesseveraltermsfromthedomainofthesolarsystemandthedomainoftheatomicmodel(Turney,2008UN).Thesimplesttypeofanalogyisproportionalanal-ogy,whichinvolvestwopairsofwords(Turney,2006B).Forexample,thepairhcook,rawiisanal-ogoustothepairhdecorate,plaini.Ifwecookathing,itisnolongerraw;ifwedecorateathing,Esso
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
UN
C
_
UN
_
0
0
2
3
3
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
354
isnolongerplain.Thesemanticrelationsbetweencookandrawaresimilartothesemanticrelationsbetweendecorateandplain.Inthefollowingexper-iments,wefocusonproportionalanalogies.Erk(2013)distinguishedfourapproachestoextenddistributionalsemanticsbeyondwords:Inthefirst,asinglevectorspacerepresentationforaphraseorsentenceiscomputedfromtherepresen-tationsoftheindividualwords(MitchellandLap-ata,2010;BaroniandZamparelli,2010).Inthesec-ond,twophrasesorsentencesarecomparedbycom-biningmultiplepairwisesimilarityvalues(Socheretal.,2011;Turney,2012).Third,weightedinferencerulesintegratedistributionalsimilarityandformallogic(Garretteetal.,2011).Fourth,asinglespaceintegratesformallogicandvectors(Clarke,2012).Takingthesecondapproach,Turney(2012)intro-ducedadual-spacemodel,withonespaceformea-suringdomainsimilarity(similarityoftopicorfield)andanotherforfunctionsimilarity(similarityofroleorusage).Similaritiesbeyondindividualwordsarecalculatedbyfunctionsthatcombinedomainandfunctionsimilaritiesofcomponentwords.Thedual-spacemodelhasbeenappliedtomea-suringcompositionalsimilarity(paraphraserecogni-tion)andrelationalsimilarity(analogyrecognition).Inexperimentsthattestedforsensitivitytowordorder,thedual-spacemodelperformedsignificantlybetterthancompetingapproaches(Turney,2012).Alimitationofpastworkwiththedual-spacemodelisthatthecombinationfunctionswerehand-coded.Ourmaincontributionistoshowhowhand-codingcanbeeliminatedwithsupervisedlearning.Foreaseofreference,wewillcallourapproachSuperSim(supervisedsimilarity).Withnomodifi-cationofSuperSimforthespecifictask(relationalsimilarityorcompositionalsimilarity),weachievebetterresultsthanprevioushand-codedmodels.Compositionalsimilarity(paraphrase)comparestwocontiguousphrasesorsentences(n-grams),whereasrelationalsimilarity(analogy)doesnotrequirecontiguity.Weusetupletorefertobothcon-tiguousandnoncontiguouswordsequences.Weapproachanalogyasaproblemofsupervisedtupleclassification.Tomeasuretherelationalsim-ilaritybetweentwowordpairs,wetrainSuperSimwithquadruplesthatarelabeledaspositiveandneg-ativeexamplesofanalogies.Forexample,thepro-portionalanalogyhcook,raw,decorate,plainiislabeledasapositiveexample.Aquadrupleisrepresentedbyafeaturevector,composedofdomainandfunctionsimilaritiesfromthedual-spacemodelandotherfeaturesbasedoncorpusfrequencies.SuperSimusesasupportvectormachine(Platt,1998)tolearntheprobabilitythataquadrupleha,B,C,diconsistsofawordpairha,biandananalogouswordpairhc,di.Theprobabilitycanbeinterpretedasthedegreeofrelationalsimilar-itybetweenthetwogivenwordpairs.Wealsoapproachparaphraseassupervisedtupleclassification.Tomeasurethecompositionalsimi-laritybeweenanm-gramandann-gram,wetrainthelearningalgorithmwith(m+n)-tuplesthatarepositiveandnegativeexamplesofparaphrases.SuperSimlearnstoestimatetheprobabilitythatatripleha,B,ciconsistsofacompositionalbigramabandasynonymousunigramc.Forinstance,thephrasefishtankissynonymouswithaquarium;thatis,fishtankandaquariumhavehighcompositionalsimilarity.Thetriplehfish,tank,aquariumiisrepre-sentedusingthesamefeaturesthatweusedforanal-ogy.Theprobabilityofthetriplecanbeinterpretedasthedegreeofcompositionalsimilaritybetweenthegivenbigramandunigram.WereviewrelatedworkinSection2.Thegen-eralfeaturespaceforlearningrelationsandcompo-sitionsispresentedinSection3.TheexperimentswithrelationalsimilarityaredescribedinSection4,andSection5reportstheresultswithcompositionalsimilarity.Section6discussestheimplicationsoftheresults.WeconsiderfutureworkinSection7andconcludeinSection8.2RelatedWorkInSemEval2012,Task2wasconcernedwithmea-suringthedegreeofrelationalsimilaritybetweentwowordpairs(Jurgensetal.,2012)andTask6(Agirreetal.,2012)examinedthedegreeofseman-ticequivalencebetweentwosentences.Thesetwoareasofresearchhavebeenmostlyindependent,althoughSocheretal.(2012)andTurney(2012)presentunifiedperspectivesonthetwotasks.Wefirstdiscusssomeworkonrelationalsimilarity,thensomeworkoncompositionalsimilarity,andlastlyworkthatunifiesthetwotypesofsimilarity.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
UN
C
_
UN
_
0
0
2
3
3
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
355
2.1RelationalSimilarityLRA(latentrelationalanalysis)measuresrela-tionalsimilaritywithapair–patternmatrix(Turney,2006B).Rowsinthematrixcorrespondtowordpairs(UN,B)andcolumnscorrespondtopatternsthatconnectthepairs(“afortheb”)inalargecor-pus.Thisisaholistic(noncompositional)approachtodistributionalsimilarity,sincethewordpairsareopaquewholes;thecomponentwordshavenosep-araterepresentations.Acompositionalapproachtoanalogyhasarepresentationforeachword,andawordpairisrepresentedbycomposingtherepresen-tationsforeachmemberofthepair.Givenavocabu-laryofNwords,acompositionalapproachrequiresNrepresentationstohandleallpossiblewordpairs,butaholisticapproachrequiresN2representations.Holisticapproachesdonotscaleup(Turney,2012).LRArequiredninedaystorun.Bollegalaetal.(2008)answeredtheSATanal-ogyquestionswithasupportvectormachinetrainedonquadruples(proportionalanalogies),aswedohere.However,theirfeaturevectorsareholistic,andhencetherearescalingproblems.Herda˘gdelenandBaroni(2009)usedasupportvectormachinetolearnrelationalsimilarity.Theirfeaturevectorscontainedacombinationofholisticandcompositionalfeatures.Measuringrelationalsimilarityiscloselycon-nectedtoclassifyingwordpairsaccordingtotheirsemanticrelations(TurneyandLittman,2005).SemanticrelationclassificationwasthefocusofSemEval2007Task4(Girjuetal.,2007)andSemEval2010Task8(Hendrickxetal.,2010).2.2CompositionalSimilarityToextenddistributionalsemanticsbeyondwords,manyresearcherstakethefirstapproachdescribedbyErk(2013),inwhichasinglevectorspaceisusedforindividualwords,frasi,andsentences(Lan-dauerandDumais,1997;MitchellandLapata,2008;MitchellandLapata,2010).Inthisapproach,giventhewordsaandbwithcontextvectorsaandb,weconstructavectorforthebigramabbyapplyingvec-toroperationstoaandb.MitchellandLapata(2010)experimentwithmanydifferentvectoroperationsandfindthatelement-wisemultiplicationperformswell.Thebigramabisrepresentedbyc=a(cid:12)B,whereci=ai·bi.However,element-wisemultiplica-tioniscommutative,sothebigramsabandbamaptothesamevectorc.Inexperimentsthattestforordersensitivity,element-wisemultiplicationper-formspoorly(Turney,2012).Wecantreatthebigramabasaunit,asifitwereasingleword,andconstructacontextvectorforabfromoccurrencesofabinalargecorpus.Thisholisticapproachtorepresentingbigramsperformswellwhenalimitedsetofbigramsisspecifiedinadvance(beforebuildingtheword–contextmatrix),butitdoesnotscaleup,becausetherearetoomanypossiblebigrams(Turney,2012).Althoughtheholisticapproachdoesnotscaleup,wecangenerateafewholisticbigramvectorsandusethemtotrainasupervisedregressionmodel(Guevara,2010;BaroniandZamparelli,2010).Givenanewbigramcd,notobservedinthecorpus,theregressionmodelcanpredictaholisticvectorforcd,ifcanddhavebeenobservedseparately.WeshowinSection5thatthisideacanbeadaptedtotrainSuperSimwithoutmanuallylabeleddata.Socheretal.(2011)takethesecondapproachdescribedbyErk(2013),inwhichtwosentencesarecomparedbycombiningmultiplepairwisesimilar-ityvalues.Theyconstructavariable-sizedsimilar-itymatrixX,inwhichtheelementxijisthesim-ilaritybetweenthei-thphraseofonesentenceandthej-thphraseoftheother.Sincesupervisedlearn-ingissimplerwithfixed-sizedfeaturevectors,thevariable-sizedsimilaritymatrixisthenreducedtoasmallerfixed-sizedmatrix,toallowcomparisonofpairsofsentencesofvaryinglengths.2.3UnifiedPerspectivesonSimilaritySocheretal.(2012)representwordsandphraseswithapair,consistingofavectorandamatrix.Thevectorcapturesthemeaningofthewordorphraseandthematrixcaptureshowawordorphrasemod-ifiesthemeaningofanotherwordorphrasewhentheyarecombined.Theyapplythismatrix–vectorrepresentationtobothcompositionsandrelations.Turney(2012)representswordswithtwovectors,avectorfromdomainspaceandavectorfromfunc-tionspace.Thedomainvectorcapturesthetopicorfieldofthewordandthefunctionvectorcapturesthe
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
UN
C
_
UN
_
0
0
2
3
3
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
356
functionalroleoftheword.Thisdual-spacemodelisappliedtobothcompositionsandrelations.Hereweextendthedual-spacemodelofTur-ney(2012)intwoways:Hand-codingisreplacedwithsupervisedlearningandtwonewsetsoffea-turesaugmentdomainandfunctionspace.Movingtosupervisedlearninginsteadofhand-codingmakesiteasiertointroducenewfeatures.Inthedual-spacemodel,parameterizedsimilar-itymeasuresprovidedtheinputvaluesforhand-craftedfunctions.Eachtaskrequiredadifferentsetofhand-craftedfunctions.Theparametersofthesimilaritymeasuresweretunedusingacustomizedgridsearchalgorithm.Thegridsearchalgorithmwasnotsuitableforintegrationwithasupervisedlearningalgorithm.TheinsightbehindSuperSimisthat,givenappropriatefeatures,asupervisedlearn-ingalgorithmcanreplacethegridsearchalgorithmandthehand-craftedfunctions.3FeaturesforTupleClassificationWerepresentatuplewithfourtypesoffeatures,allbasedonfrequenciesinalargecorpus.Thefirsttypeoffeatureisthelogarithmofthefrequencyofaword.Thesecondtypeisthepositivepoint-wisemutualinformation(PPMI)betweentwowords(ChurchandHanks,1989;BullinariaandLevy,2007).Thirdandfourtharethesimilaritiesoftwowordsindomainandfunctionspace(Turney,2012).Inthefollowingexperiments,weusethePPMImatrixfromTurneyetal.(2011)andthedomainandfunctionmatricesfromTurney(2012).1Thethreematricesandthewordfrequencydataarebasedonthesamecorpus,acollectionofwebpagesgath-eredfromuniversitywebsites,containing5×1010words.2Allthreematricesareword–contextmatri-ces,inwhichtherowscorrespondtoterms(wordsandphrases)inWordNet.3Thecolumnscorrespondtothecontextsinwhichthetermsappear;eachmatrixinvolvesadifferentkindofcontext.1Thethreematricesandthewordfrequencydataareavail-ableonrequestfromtheauthor.Thematrixfilesrangefromtwotofivegigabyteswhenpackagedandcompressedfordistri-bution.2ThecorpuswascollectedbyCharlesClarkeattheUniver-sityofWaterloo.Itisabout280gigabytesofplaintext.3Seehttp://wordnet.princeton.edu/forinfor-mationaboutWordNet.Lethx1,x2,…,xnibeann-tupleofwords.Thenumberoffeaturesweusetorepresentthistupleincreasesasafunctionofn.Thefirstsetoffeaturesconsistsoflogfrequencyvaluesforeachwordxiinthen-tuple.Letfreq(xi)bethefrequencyofxiinthecorpus.WedefineLF(xi)aslog(freq(xi)+1).Ifxiisnotinthecorpus,freq(xi)iszero,andthusLF(xi)isalsozero.Therearenlogfrequencyfeatures,oneLF(xi)featureforeachwordinthen-tuple.Thesecondsetoffeaturesconsistsofpositivepointwisemutualinformationvaluesforeachpairofwordsinthen-tuple.WeusetherawPPMImatrixfromTurneyetal.(2011).Althoughtheycomputedthesingularvaluedecomposition(SVD)toprojecttherowvectorsintoalower-dimensionalspace,weneedtheoriginalhigh-dimensionalcolumnsforourfeatures.TherawPPMImatrixhas114,501rowsand139,246columnswithadensityof1.2%.ForeachterminWordNet,thereisacorrespondingrowintherawPPMImatrix.ForeachunigraminWord-Net,therearetwocorrespondingcolumnsintherawPPMImatrix,onemarkedleftandtheotherright.Supposexicorrespondstothei-throwofthePPMImatrixandxjcorrespondsthej-thcolumn,markedleft.Thevalueinthei-throwandj-thcol-umnofthePPMImatrix,PPMI(xi,xj,left),isthepositivepointwisemutualinformationofxiandxjco-occurringinthecorpus,wherexjisthefirstwordtotheleftofxi,ignoringanyinterveningstopwords(thatis,ignoringanywordsthatarenotinWordNet).Ifxi(orxj)hasnocorrespondingrow(orcolumn)inthematrix,thenthePPMIvalueissettozero.Turneyetal.(2011)estimatedPPMI(xi,xj,left)bysamplingthecorpusforphrasescontainingxiandthenlookingforxjtotheleftofxiinthesampledphrases(andlikewiseforright).Duetothissam-plingprocess,PPMI(xi,xj,left)doesnotnecessar-ilyequalPPMI(xj,xi,right).Forexample,supposexiisararewordandxjisacommonword.WithPPMI(xi,xj,left),whenwesamplephrasescontain-ingxi,wearerelativelylikelytofindxjinsomeofthesephrases.WithPPMI(xj,xi,right),whenwesamplephrasescontainingxj,wearelesslikelytofindanyphrasescontainingxi.Although,intheory,PPMI(xi,xj,left)shouldequalPPMI(xj,xi,right),theyarelikelytobeunequalgivenalimitedsample.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
3
3
1
5
6
6
6
7
5
/
/
T
l
UN
C
_
UN
_
0
0
2
3
3
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
357
Fromthen-tuple,weselectallofthen(n−1)pairs,hxi,xji,suchthati6=j.Wethengener-atetwofeaturesforeachpair,PPMI(xi,xj,left)andPPMI(xi,xj,right).Thusthereare2n(n−1)PPMIvaluesinthesecondsetoffeatures.Thethirdsetoffeaturesconsistsofdomainspacesimilarityvaluesforeachpairofwordsinthen-tuple.Domainspacewasdesignedtocapturethetopicofaword.Turney(2012)firstconstructedafrequencymatrix,inwhichtherowscorrespondtotermsinWordNetandthecolumnscorrespondtonearbynouns.Givenatermxi,thecorpuswassam-pledforphrasescontainingxiandthephraseswereprocessedwithapart-of-speechtagger,toidentifynouns.Ifthenounxjwastheclosestnountotheleftorrightofxi,thenthefrequencycountforthei-throwandj-thcolumnwasincremented.Thehypoth-esiswasthatthenounsnearatermcharacterizethetopicsassociatedwiththeterm.Theword–contextfrequencymatrixfordomainspacehas114,297rows(terms)and50,000columns(nouncontexts,topics),withadensityof2.6%.ThefrequencymatrixwasconvertedtoaPPMImatrixandthensmoothedwithSVD.TheSVDyieldsthreematrices,U,Σ,andV.AtermindomainspaceisrepresentedbyarowvectorinUkΣpk.Theparameterkspecifiesthenum-berofsingularvaluesinthetruncatedsingularvaluedecomposition;thatis,kisthenumberoflatentfactorsinthelow-dimensionalrepresentationoftheterm(LandauerandDumais,1997).WegenerateUkandΣkbydeletingthecolumnsinUandΣcorrespondingtothesmallestsingularvalues.TheparameterpraisesthesingularvaluesinΣktothepowerp(Caron,2001).Aspgoesfromonetozero,factorswithsmallersingularvaluesaregivenmoreweight.Thishastheeffectofmakingthesimilaritymeasuremorediscriminating(Turney,2012).Thesimilarityoftwowordsindomainspace,Dom(xi,xj,k,P),iscomputedbyextractingtherowvectorsinUkΣpkthatcorrespondtothewordsxiandxj,andthencalculatingtheircosine.Optimalper-formancerequirestuningtheparameterskandpforthetask(BullinariaandLevy,2012;Turney,2012).Inthefollowingexperiments,weavoiddirectlytun-ingkandpbygeneratingfeatureswithavarietyofvaluesforkandp,allowingthesupervisedlearningalgorithmtodecidewhichfeaturestouse.FeaturesetSizeofsetLF(xi)nPPMI(xi,xj,handedness)2N(n−1)Dom(xi,xj,k,P)12N(n−1)nknpFun(xi,xj,k,P)12N(n−1)nknpTable1:Thefoursetsoffeaturesandtheirsizes.Fromthen-tuple,weselectall12n(n−1)pairs,hxi,xji,suchthati
Scarica il pdf