Transactions of the Association for Computational Linguistics, vol. 3, pp. 59–71, 2015. Action Editor: Hwee Tou Ng.
Submission batch: 10/2014; Revision batch 12/2014; Revision batch 1/2015; Published 1/2015. c
(cid:13)
2015 Association for Computational Linguistics.
59
ASense-TopicModelforWordSenseInductionwithUnsupervisedDataEnrichmentJingWang∗MohitBansal†KevinGimpel†BrianD.Ziebart∗ClementT.Yu∗∗UniversityofIllinoisatChicago,Chicago,IL,60607,Etats-Unis{jwang69,bziebart,cyu}@uic.edu†ToyotaTechnologicalInstituteatChicago,Chicago,IL,60637,Etats-Unis{mbansal,kgimpel}@ttic.eduAbstractWordsenseinduction(WSI)seekstoautomat-icallydiscoverthesensesofawordinacor-pusviaunsupervisedmethods.Weproposeasense-topicmodelforWSI,whichtreatssenseandtopicastwoseparatelatentvari-ablestobeinferredjointly.Topicsarein-formedbytheentiredocument,whilesensesareinformedbythelocalcontextsurroundingtheambiguousword.Wealsodiscussunsu-pervisedwaysofenrichingtheoriginalcor-pusinordertoimprovemodelperformance,includingusingneuralwordembeddingsandexternalcorporatoexpandthecontextofeachdatainstance.Wedemonstratesignificantim-provementsoverthepreviousstate-of-the-art,achievingthebestresultsreportedtodateontheSemEval-2013WSItask.1IntroductionWordsenseinduction(WSI)isthetaskofautomat-icallydiscoveringallsensesofanambiguouswordinacorpus.TheinputstoWSIareinstancesoftheambiguouswordwithitssurroundingcontext.Theoutputisagroupingoftheseinstancesintoclusterscorrespondingtotheinducedsenses.WSIisgen-erallyconductedasanunsupervisedlearningtask,relyingontheassumptionthatthesurroundingcon-textofawordindicatesitsmeaning.Mostpreviousworkassumedthateachinstanceisbestlabeledwithasinglesense,andtherefore,thateachinstancebe-longstoexactlyonesensecluster.However,recentwork(ErkandMcCarthy,2009;Jurgens,2013)hasshownthatmorethanonesensecanbeusedtointer-pretcertaininstances,duetocontextambiguityandsenserelatedness.TohandlethesecharacteristicsofWSI(unsuper-vised,sensesrepresentedbytokenclusters,multiplesensesperinstance),weconsiderapproachesbasedontopicmodels.Atopicmodelisanunsupervisedmethodthatdiscoversthesemantictopicsunderly-ingacollectionofdocuments.ThemostpopularislatentDirichletallocation(LDA;Bleietal.,2003),inwhicheachtopicisrepresentedasamultinomialdistributionoverwords,andeachdocumentisrep-resentedasamultinomialdistributionovertopics.OneapproachwouldbetorunLDAonthein-stancesforanambiguousword,thensimplyinter-prettopicsasinducedsenses(BrodyandLapata,2009).Cependant,whilesenseandtopicarerelated,theyaredistinctlinguisticphenomena.Topicsareassignedtoentiredocumentsandareexpressedbyallwordtokens,whilesensesrelatetoasingleam-biguouswordandareexpressedthroughthelocalcontextofthatword.Onepossibleapproachwouldbetoonlykeepthelocalcontextofeachambigu-ousword,discardingtheglobalcontext.However,thetopicalinformationcontainedinthebroadercon-text,thoughitmaynotdeterminethesensedirectly,mightstillbeusefulfornarrowingdownthelikelysensesoftheambiguousword.Considertheambiguouswordcold.Inthesen-tence“Hisreactiontotheexperimentswascold”,thepossiblesensesforcoldincludecoldtempera-ture,acoldsensation,commoncold,oranegativeemotionalreaction.However,ifweknowthatthetopicofthedocumentconcernstheeffectsoflowtemperaturesonphysicalhealth,thenthenegativeemotionalreactionsenseshouldbecomelesslikely.Therefore,inthiscase,knowingthetopichelpsnar-rowdownthesetofplausiblesenses.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
60
Atthesametime,knowingthesensecanalsohelpdeterminepossibletopics.Considerasetoftextsthatallincludethewordcold.Withoutfurtherin-formation,thetextsmightdiscussanyofanumberofpossibletopics.However,ifthesenseofcoldisthatofcoldischemia,thenthemostprobabletopicswouldbethoserelatedtoorgantransplantation.Inthispaper,weproposeasense-topicmodelforWSI,whichtreatssenseandtopicastwoseparatelatentvariablestobeinferredjointly(§4).Whenre-latingthesenseandtopicvariables,abidirectionaledgeisdrawnbetweenthemtorepresenttheircyclicdependence(Heckermanetal.,2001).WeperforminferenceusingcollapsedGibbssampling(§4.2),thenestimatethesensedistributionforeachinstanceasthesolutiontotheWSItask.Weconductexper-imentsontheSemEval-2013Task13WSIdataset,showingimprovementsoverseveralstrongbaselinesandtasksystems(§5).Wealsopresentunsupervisedwaysofenrichingourdataset,includingusingneuralwordembed-dings(Mikolovetal.,2013)andexternalWeb-scalecorporatoenrichthecontextofeachdatainstanceortoaddmoreinstances(§6).Eachdataenrich-mentmethodgivesfurthergains,resultinginsig-nificantimprovementsoverexistingstate-of-the-artWSIsystems.Overall,wefindgainsofupto22%relativeimprovementinfuzzyB-cubedand50%rel-ativeimprovementinfuzzynormalizedmutualin-formation(JurgensandKlapaftis,2013).2BackgroundandRelatedWorkWediscusstheWSItask,thendiscussseveralareasofresearchthatarerelatedtoourapproach,includ-ingapplicationsoftopicmodelingtoWSIaswellasotherapproachesthatusewordembeddingsandclusteringalgorithms.WSDandWSI:WSIisrelatedtobutdistinctfromwordsensedisambiguation(WSD).WSDseekstoassignaparticularsenselabeltoeachtargetwordinstance,wherethesenselabelsareknownandusuallydrawnfromanexistingsenseinventorylikeWordNet(Milleretal.,1990).Al-thoughextensiveresearchhasbeendevotedtoWSD,WSImaybemoreusefulfordownstreamtasks.WSDreliesonsenseinventorieswhoseconstruc-tionistime-intensive,expensive,andsubjecttopoorinter-annotatoragreement(Passonneauetal.,2010).Senseinventoriesalsoimposeafixedsensegran-ularityforeachambiguousword,whichmaynotmatchtheidealgranularityforthetaskofinterest.Finally,theymaylackdomain-specificsensesandaredifficulttoadapttolow-resourcedomainsorlan-guages.Incontrast,sensesinducedbyWSIaremorelikelytorepresentthetaskanddomainofinterest.Researchersinmachinetranslationandinformationretrievalhavefoundthatpredefinedsensesareof-tennotwell-suitedforthesetasks(Voorhees,1993;CarpuatandWu,2005),whileinducedsensescanleadtoimprovedperformance(V´eronis,2004;Vick-reyetal.,2005;CarpuatandWu,2007).TopicModelingforWSI:BrodyandLapata(2009)proposedatopicmodelthatusesaweightedcombinationofseparateLDAmodelsbasedondif-ferentfeaturesets(e.g.wordtokens,partsofspeech,anddependencyrelations).Theyonlyusedsmallerunitsoftextsurroundingtheambiguousword,dis-cardingtheglobalcontextofeachinstance.YaoandVanDurme(2011)proposedamodelbasedonahi-erarchicalDirichletprocess(HDP;Tehetal.,2006),whichhastheadvantagethatitcanautomaticallydiscoverthenumberofsenses.Lauetal.(2012)de-scribedamodelbasedonanHDPwithpositionalwordfeatures;itformedthebasisfortheirsubmis-sion(unimelb,Lauetal.,2013)totheSemEval-2013WSItask(JurgensandKlapaftis,2013).Oursense-topicmodelisdistinctfromthispriorworkinthatwemodelsenseandtopicastwosepa-ratelatentvariablesandlearnthemjointly.Wecom-paretotheperformanceofunimelbin§5.Forwordsensedisambiguation,therealsoexistseveralapproachesthatusetopicmodels(Caietal.,2007;Boyd-GraberandBlei,2007;Boyd-Graberetal.,2007;Lietal.,2010);spacedoesnotpermitafulldiscussion.WordRepresentationsforWSI:Anotherap-proachtosolvingWSIistousewordrepresentationsbuiltbydistributionalsemanticmodels(DSMs;Sahlgren,2006)orneuralnetlanguagemodels(NNLMs;Bengioetal.,2003;MnihandHinton,2007).Theirassumptionisthatwordswithsimilardistributionshavesimilarmeanings.Akkayaetal.(2012)usewordrepresentationslearnedfromDSMsdirectlyforWSI.Eachwordisrepresentedbyaco-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
61
occurrencevector,andthemeaningofanambigu-ouswordinaspecificcontextiscomputedthroughelement-wisemultiplicationappliedtothevectorofthetargetwordanditssurroundingwordsinthecon-text.Theninstancesareclusteredbyhierarchicalclusteringbasedontheirrepresentations.WordrepresentationstrainedbyNNLMs,oftencalledwordembeddings,captureinformationviatrainingcriteriabasedonpredictingnearbywords.TheyhavebeenusefulasfeaturesinmanyNLPtasks(Turianetal.,2010;Collobertetal.,2011;Dhillonetal.,2012;Hisamotoetal.,2013;Bansaletal.,2014).Thesimilaritybetweentwowordscanbecomputedusingcosinesimilarityoftheirembed-dingvectors.Wordembeddingsareoftenalsousedtobuildrepresentationsforlargerunitsoftext,suchassentences,throughvectoroperations(e.g.,sum-mation)appliedtothevectorofeachtokeninthesentence.Inourwork,weusewordembeddingstocomputewordsimilarities(forbettermodelingofourdatadistribution),torepresentsentences(tofindsimilarsentencesinexternalcorporafordataenrich-ment),andinaproduct-of-embeddingsbaseline.Baskayaetal.(2013)representthecontextofeachambiguouswordbyusingthemostlikelysubstitutesaccordingtoa4-gramLM.Theypairtheambigu-ouswordwithlikelysubstitutes,projectthepairsontoasphere(Maronetal.,2010),andobtainfinalsensesviak-meansclustering.WecomparetotheirSemEval-2013systemAI-KU(§5).OtherApproachestoWSI:Otherapproachesin-cludeclusteringalgorithmstopartitioninstancesofanambiguouswordintosense-basedclus-ters(Sch¨utze,1998;PantelandLin,2002;PurandareandPedersen,2004),orgraph-basedmethodstoin-ducesenses(DorowandWiddows,2003;V´eronis,2004;AgirreandSoroa,2007).3ProblemSettingInthispaper,weinducesensesforasetofwordtypes,whichwerefertoastargetwords.Foreachtargetword,wehaveasetofinstances.Eachin-stanceprovidescontextforasingleoccurrenceofthetargetword.1Forourexperiments,weusethe1Thetargetwordtokenmayoccurmultipletimesinanin-stance,butonlyoneoccurrenceischosenasthetargetwordoccurrence. Figure1:Proposedsense-topicmodelinplatenotation.ThereareMDinstancesforthegiventargetword.Inaninstance,thereareNgglobalcontextwords(wg)andN‘localcontextwords(w‘),allofwhichareobserved.Thereisonelatentvariable(“topic”tg)forthewgandtwolatentvariables(“topic”t‘and“sense”s‘)forthew‘.Eachinstancehastopicmixingproportionsθtandsensemixingproportionsθs.Forclarity,notallvariablesareshown.ThecompletefigurewithallvariablesisgiveninAppendixA.Thisisadependencynetwork,notadi-rectedgraphicalmodel,asshownbythedirectedarrowsbetweent‘ands‘;seetextfordetails.datasetreleasedforSemEval-2013Task13(Jur-gensandKlapaftis,2013),collectedfromtheOpenAmericanNationalCorpus(OANC;IdeandSuder-man,2004).2Itincludes50targetwords:20verbs,20nouns,and10adjectives.Thereareatotalof4,664instancesacrossalltargetwords.Eachin-stancecontainsonlyonesentence,withaminimumlengthof22andamaximumlengthof100.Thegoldstandardforthedatasetwaspreparedbymultipleannotators,whereeachannotatorlabeledinstancesbasedonthesenseinventoriesinWordNet3.1.Foreachinstance,theyratedallsensesofatargetwordonaLikertscalefromonetofive.4ASense-TopicModelforWSIWenowpresentoursense-topicmodel,showninplatenotationinFigure1.Itgeneratesthewordsinthesetofinstancesforasingletargetword;werunthemodelseparatelyforeachtargetword,sharingnoparametersacrosstargetwords.Wetreatsenseandtopicastwoseparatelatentvariablestobein-ferredjointly.Todifferentiatesenseandtopic,weuseawindowaroundthetargetwordineachin-stance.Wordtokensinsidethewindowarelocal2“WordSenseInductionforGradedandNon-GradedSenses,”http://www.cs.york.ac.uk/semeval-2013/task13
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
62
contextwords(w‘),whiletokensoutsidethewin-dowareglobalcontextwords(wg).Thenumberofwordsinthewindowisfixedto21inallexperiments(10wordsbeforethetargetwordand10after).Generatingglobalcontextwords:AsshownintheleftpartofFigure1,eachglobalcontextwordwgisgeneratedfromalatenttopicvariabletgfortheinstance,whichfollowsthesamegenerativestoryasLDA.Thecorrespondingprobabilityoftheithglobalcontextwordw(je)gwithininstancedis:3Pr(w(je)g|d,θt,ψt)=TXj=1Pψtj(w(je)g|t(je)g=j)Pθt(t(je)g=j|d)(1)whereTisthenumberoftopics,Pψtj(w(je)g|t(je)g=j)isthemultinomialdistributionoverwordsfortopicj(parameterizedbyψtj)andPθt(t(je)g=j|d)isthemultinomialdistributionovertopicsforinstanced(parameterizedbyθt).Generatinglocalcontextwords:Alocalcontextwordw‘isgeneratedfromatopicvariablet‘andasensevariables‘:Pr(w‘|d,θt,ψt,θs,ψs,θs|t,θt|s,θst)=TXj=1SXk=1Pr(w‘|t‘=j,s‘=k)Pr(t‘=j,s‘=k|d)(2)whereSisthenumberofsenses,Pr(w‘|t‘=j,s‘=k)istheprobabilityofgeneratingwordw‘giventopicjandsensek,andPr(t‘=j,s‘=k|d)isthejointprobabilityovertopicsandsensesford.4UnlikeinEq.(1),wedonotusemultinomialparameterizationsforthedistributionsinEq.(2).Whenparameterizingthem,wemakeseveralde-parturesfrompurely-generativemodeling.Allourchoicesresultindistributionsoversmallereventspacesand/orthosethatconditiononfewervari-ables.Thishelpstomitigatedatasparsityis-suesarisingfromattemptingtoestimatehigh-dimensionaldistributionsfromsmalldatasets.Asecondarybenefitisthatwecanavoidbiasescausedbyparticularchoicesofgenerativedirectionalityin3WeusePr()forgenericprobabilitydistributionswithoutfurtherqualifiersandPθ()fordistributionsparameterizedbyθ.4Forclarity,wedropthe(je)superscriptsintheseandthefollowingequations.themodel.Welaterincludeanempiricalcompari-sontojustifysomeofourmodelingchoices(§5).D'abord,whenrelatingthesenseandtopicvariables,weavoidmakingasingledecisionaboutgenerativedependence.Takinginspirationfromdependencynetworks(Heckermanetal.,2001),weusethefol-lowingfactorization:Pr(t‘=j,s‘=k|d)=1ZdPr(s‘=k|d,t‘=j)Pr(t‘=j|d,s‘=k)(3)whereZdisanormalizationconstant.Wefactorizefurtherbyusingredundantproba-bilisticevents,thenignorethenormalizationcon-stantsduringlearning,aconceptcommonlycalleddeficiency(Brownetal.,1993).DeficientmodelinghasbeenfoundtobeusefulforawiderangeofNLPtasks(KleinandManning,2002;MayandKnight,2007;ToutanovaandJohnson,2007).Inparticular,wefactortheconditionalprobabilitiesinEq.(3)intoproductsofmultinomialprobabilities:Pr(s‘=k|d,t‘=j)=Pθs(s‘=k|d)Pθs|tj(s‘=k|t‘=j)Pθst(t‘=j,s‘=k)Zd,tjPr(t‘=j|d,s‘=k)=Pθt(t‘=j|d)Pθt|sk(t‘=j|s‘=k)Zd,skwhereZd,tjandZd,skarenormalizationfactorsandwehaveintroducednewmultinomialparametersθs,θs|tj,θst,andθt|sk.Weusethesameideatofactorthewordgenera-tiondistribution:Pr(w‘|t‘=j,s‘=k)=Pψtj(w‘|t‘=j)Pψsk(w‘|s‘=k)Ztj,skwhereZtj,skisanormalizationfactor,andwehavenewmultinomialparametersψskforthesense-worddistributions.Oneadvantageofthisparameteriza-tionisthatwenaturallytiethetopic-worddistribu-tionsacrosstheglobalandlocalcontextwordsbyusingthesameparametersψtj.4.1GenerativeStoryWenowgivethefullgenerativestoryofourmodel.WedescribeitforgeneratingasetofinstancesofsizeMD,whereallinstancescontainthesametar-getword.WeusesymmetricDirichletpriorsfor
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
63
allmultinomialdistributionsmentionedabove,us-ingthesamefixedhyperparametervalue(un)forall.Weuseψtodenoteparametersofmultinomialdistributionsoverwords,andθtodenoteparame-tersofmultinomialdistributionsovertopicsand/orsenses.WeleaveunspecifiedthedistributionsoverN‘(numberoflocalwordsinaninstance)andNg(numberofglobalwordsinaninstance),asweonlyuseourmodeltoperforminferencegivenfixedin-stances,nottogeneratenewinstances.Thegenerativestoryfirstfollowsthestepsde-scribedinAlgo.1togenerateparametersthataresharedacrossallinstances;thenforeachinstanced,itfollowsAlgo.2togenerateglobalandlocalwords.Algorithm1Generativestoryforinstanceset1:foreachtopicj←1toTdo2:Choosetopic-wordparams.ψtj∼Dir(un)3:Choosetopic-senseparams.θs|tj∼Dir(un)4:foreachsensek←1toSdo5:Choosesense-wordparams.ψsk∼Dir(un)6:Choosesense-topicparams.θt|sk∼Dir(un)7:Choosetopic/senseparams.θst∼Dir(un)Algorithm2Generativestoryforinstanced1:Choosetopicproportionsθt∼Dir(un)2:Choosesenseproportionsθs∼Dir(un)3:ChooseNgandN‘fromunspecifieddistributions4:fori←1toNgdo5:Chooseatopicj∼Mult(θt)6:Chooseawordwg∼Mult(ψtj)7:fori←1toN‘do8:repeat9:Chooseatopicj∼Mult(θt)10:Chooseasensek∼Mult(θs)11:Chooseatopicj0∼Mult(θt|sk)12:Chooseasensek0∼Mult(θs|tj)13:Choosetopic/sensehj00,k00i∼Mult(θst)14:untilj=j0=j00andk=k0=k0015:repeat16:Chooseawordw‘∼Mult(ψtj)17:Chooseawordw0‘∼Mult(ψsk)18:untilw‘=w0‘4.2InferenceWeusecollapsedGibbssampling(GemanandGe-man,1984)toobtainsamplesfromtheposteriordis-tributionoverlatentvariables,withallmultinomialparametersanalyticallyintegratedoutbeforesam-pling.Thenweestimatethesensedistributionθsforeachinstanceusingmaximumlikelihoodestimationonthesamples.ThesesensedistributionsaretheoutputofourWSIsystem.Wenotethatdeficientmodelingdoesnotordinar-ilyaffectGibbssamplingwhenusedforcomputingposteriorsoverlatentvariables,aslongasparame-ters(theθandψ)arekeptfixed.ThisisthecaseduringtheEstepofanEMalgorithm,whichistheusualsettinginwhichdeficiencyisused.OnlytheMstepisaffected;itbecomesanapproximateMstepbyassumingthenormalizationconstantsequal1(Brownetal.,1993).Cependant,hereweusecollapsedGibbssamplingforposteriorinference,andtheanalyticintegrationisdisruptedbythepresenceofthenormalizationconstants.Tobypassthis,weemploythestandardapproximationofdeficientmodelsthatallnormal-izationconstantsare1,permittingustousestan-dardformulasforanalyticintegrationofmultino-mialparameterswithDirichletpriors.Empirically,wefoundthis“collapseddeficientGibbssampler”toslightlyoutperformamoreprincipledapproachbasedonEM,presumablyduetotheabilityofcol-lapsingtoacceleratemixing.Duringthesamplingprocess,eachsamplerisrunonthefullsetofinstancesforatargetword,iteratingthroughallwordtokensineachinstance.Ifthecur-rentwordtokenisaglobalcontextword,wesampleanewtopicforitconditionedonallotherlatentvari-ablesacrossinstances.Ifthecurrentwordisalocalcontextword,wesampleanewtopic/sensepairforitagainconditionedonallotherlatentvariablevalues.Wewritetheconditionalposteriordistributionovertopicsforglobalcontextwordtokeniinin-stancedasPr(t(je)g=j|d,t−i,s,·),wheret(je)g=jisthetopicassignmentoftokeni,disthecurrentinstance,t−iisthesetoftopicassignmentsofallwordtokensasidefromiforinstanced,sisthesetofsenseassignmentsforalllocalwordtokensininstanced,and“·”standsforallotherobservedorknowninformation,includingallwords,allDirich-lethyperparameters,andalllatentvariableassign-mentsinotherinstances.Theconditionalposterior
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
64
canbecomputedby:Pr(t(je)g=j|d,t−i,s,·)∝CDTdj+αPTk=1CDTdk+Tα|{z}Pr(t=j|d,t−i,s,·)CWTij+αPWtk0=1CWTk0j+Wtα|{z}Pr(w(je)g|t=j,t−i,s,·)(4)whereweusethesuperscriptDTasamnemonicfor“instance/topic”whencountingtopicassignmentsinaninstanceandWTfor“word/topic”whencount-ingtopicassignmentsforaword.CDTdjcontainsthenumberoftimestopicjisassignedtosomewordtokenininstanced,excludingthecurrentwordto-kenw(je)g;CWTijisthenumberoftimeswordw(je)gisassignedtotopicj,acrossallinstances,excludingthecurrentwordtoken.Wtisthenumberofdis-tinctwordtypesinthefullsetofinstances.Weshowthecorrespondingconditionalposteriorprobabilitiesunderneatheachterm;thecountratiosareobtainedusingstandardDirichlet-multinomialcollapsing.Theconditionalposteriordistributionovertopic/sensepairsforalocalcontextwordtokenw(je)‘canbecomputedby:Pr(t(je)‘=j,s(je)‘=k|d,t−i,s−i,·)∝CDTdj+αPTk0=1CDTdk0+Tα|{z}Pr(t=j|d,t−i,s,·)CWTij+αPWtk0=1CWTk0j+Wtα|{z}Pr(w(je)‘|t=j,t−i,s,·)CDSdk+αPSk0=1CDSdk0+Sα|{z}Pr(s=k|d,s−i,·)CWSik+αPWsk0=1CWSk0k+Wsα|{z}Pr(w(je)‘|s=k,s−i,·)CSTkj+αPSk0=1CSTk0j+Sα|{z}Pr(s=k|t=j,t−i,s−i,·)CSTkj+αPTk0=1CSTkk0+Tα|{z}Pr(t=j|s=k,t−i,s−i,·)CSTkj+αPSk0=1PTj0=1CSTk0j0+STα|{z}Pr(s=k,t=j|t−i,s−i,·)(5)whereCDSdkcontainsthenumberoftimessensekisassignedtosomelocalwordtokenininstanced,ex-cludingthecurrentwordtoken;CWSikcontainsthenumberoftimewordw(je)‘isassignedtosensek,ex-cludingthecurrenttime;CSTkjcontainsthenumberoftimessensekandtopicjareassignedtosomelo-calwordtokens.Wsisthenumberofdistinctlocalcontextwordtypesacrossthecollection.DecodingAfterthesamplingprocess,weobtainafixed-pointestimateofthesensedistribution(θs)foreachinstancedusingthecountsfromoursamples.Whereweuseθkstodenotetheprobabilityofsensekfortheinstance,thisamountsto:θks=CDSdkPSk0=1CDSdk0(6)Thisdistributionisconsideredthefinalsenseassign-mentdistributionforthetargetwordininstancedfortheWSItask;thefulldistributionisfedtotheeval-uationmetricsdefinedinthenextsection.Toinspectwhatthemodellearned,wesimilarlyobtainthesense-worddistribution(ψs)fromthecountsasfollows,whereψiskistheprobabilityofwordtypeigivensensek:ψisk=CWSikPWsi0=1CWSi0k(7)5ExperimentalResultsInthissection,weevaluateoursense-topicmodelandcompareittoseveralstrongbaselinesandstate-of-the-artsystems.EvaluationMetricsToevaluateWSIsystems,Ju-rgensandKlapaftis(2013)proposetwometrics:fuzzyB-cubedandfuzzynormalizedmutualinfor-mation(NMI).Theyareeachcomputedseparatelyforeachtargetword,thenaveragedacrosstargetwords.FuzzyB-cubedpreferslabelingallinstanceswiththesamesense,whilefuzzyNMIpreferstheoppositeextremeoflabelingallinstanceswithdis-tinctsenses.Hence,wereportbothfuzzyB-cubed(%)andfuzzyNMI(%)inourevaluation.Foreaseofcomparison,wealsoreportthegeometricmeanofthe2metrics,whichwedenotebyAVG.5SemEval-2013Task13alsoprovidedatrialdataset(TRIAL)thatconsistsofeighttargetambigu-ouswords,eachwith50instances(Erketal.,2009).Weuseitforpreliminaryexperimentsofourmodelandfortuningcertainhyperparameters,andevalu-atefinalperformanceontheSemEval-2013dataset(TEST)with50targetwords.5Wedonotuseanarithmeticmeanbecausetheeffectiverangeofthetwometricsissubstantiallydifferent.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
65
SB-cubed(%)NMI(%)AVG242.94.1813.39331.96.5014.40522.38.6013.85715.48.7211.611012.510.9111.67Table1:PerformanceonTRIALforthesense-topicmodelwithdifferentnumbersofsenses(S).Bestscoreineachcolumnisbold.HyperparameterTuningWeuseTRIALtoana-lyzeperformanceofoursense-topicmodelunderdifferentsettingsforthenumbersofsenses(S)andtopics(T);seeTable1.WealwayssetT=2Sforsimplicity.WefindthatsmallSvaluesworkbest,whichisunsurprisingconsideringtherelativelysmallnumberofinstancesandsmallsizeofeachin-stance.WhenevaluatingonTEST,weuseS=3(whichgivesthebestAVGresultsonTRIAL).Plus tard,whenweaddlargercontextormoreinstances(see§6),tuningonTRIALchoosesalargerSvalue.Duringinference,theGibbssamplerwasrunfor4,000iterationsforeachtargetword,settingthefirst500iterationsastheburn-inperiod.Inordertogetarepresentativesetofsamples,every13thsample(af-terburn-in)issavedtopreventcorrelationsamongsamples.Duetotherandomizednatureofthein-ferenceprocedure,allreportedresultsareaveragescoresover5runs.Thehyperparameters(un)forallDirichletpriorsinourmodelaresettothe(untuned)valueof0.01,followingpriorworkontopicmodel-ing(GriffithsandSteyvers,2004;Heinrich,2005).BaselinesWeincludetwona¨ıvebaselinescorre-spondingtothetwoextremes(biases)preferredbyfuzzyB-cubedandNMI,respectivement:1sense(la-beleachinstancewiththesamesinglesense)andalldistinct(labeleachinstancewithitsownsense).WealsoconsidertwobaselinesbasedonLDA.WerunLDAforeachtargetwordinTEST,usingthesetofinstancesasthesetofdocuments.Wetreatthelearnedtopicsasinducedsenses.Whensettingthenumberoftopics(senses),weusethegold-standardnumberofsensesforeachtargetword,makingthisbaselineunreasonablystrong.WerunLDAbothwithfullcontext(FULL)andlocalcontext(LOCAL),usingthesamewindowsizeasabove(10wordsbe-foreandafterthetargetword).WealsopresentresultsforthetwobestsystemsintheSemEval-2013task(accordingtofuzzyB-cubedandfuzzyNMI,respectivement):unimelbandAI-KU.AsdescribedinSection2,unimelbuseshierarchicalDirichletprocesses(HDPs).Itextracts50,000extrainstancesforeachtargetwordastrain-ingdatafromtheukWaccorpus−awebcorpusofapproximately2billiontokens.6Amongallsystemsinthetask,itperformsbestaccordingtofuzzyB-cubed.AI-KUisbasedonalexicalsubstitutionmethod;alanguagemodelisbuilttoidentifylexicalsubstitutesfortargetwordsfromthedatasetandtheukWaccorpus.ItperformedbestamongallsystemsaccordingtofuzzyNMI.ResultsInTable2,wepresentresultsforthesesystemsandcomparethemtoourbasic(i.e.,withoutanydataenrichment)sense-topicmodelwithS=3(row9).AccordingtobothfuzzyB-cubedandfuzzyNMI,ourmodeloutperformstheotherWSIsystems(LDA,AI-KU,andunimelb).Ainsi,weareabletoachievestate-of-the-artresultsontheSemEval-2013taskevenwhenonlyusingthesinglesentenceofcontextgivenineachinstance(whileAI-KUandunimelbuselargetrainingsetsfromukWac).Wefoundsimilarperformanceimprovementswhenonlytestedoninstanceslabeledwithasinglesense.BidirectionaliltyAnalysisTomeasuretheimpactofthebidirectionaldependencybetweenthetopicandsensevariablesinourmodel,wealsoevalu-atetheperformanceofoursense-topicmodelwhendroppingoneofthedirections.InTable3,wecomparetheirperformancewithourfullsense-topicmodelonTEST.Bothunidirectionalmodelsperformworsethanthefullmodel,anddroppingt→shurtsmore.Thisresultverifiesourintuitionthattopicswouldhelpnarrowdownthesetoflikelysenses,andsuggeststhatbidirectionalmodelingbetweentopicandsenseisdesirableforWSI.Insubsequentsections,weinvestigateseveralwaysofexploitingadditionaldatatobuildbetter-performingsense-topicmodels.6UnsupervisedDataEnrichmentTheprimarysignalusedbyourmodeliswordco-occurrenceinformationacrossinstances.Ifween-6http://wacky.sslmit.unibo.it/doku.php?id=corpora
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
66
ModelDataEnrichmentFuzzyB-cubed%FuzzyNMI%AVG11sense–62.30–2alldistinct–07.09–3unimelbadd50kinstances48.36.017.024AI-KUadd20kinstances39.06.515.925LDA(LOCAL)none47.15.9316.716LDA(FULL)none47.35.7916.557LDA(FULL)addactualcontext(§6.1)43.56.4116.708wordembeddingproduct(§6.3)none33.37.2415.53THISPAPER9Sense-TopicModelnone53.56.9619.3010addukWaccontext(§6.1)54.59.7423.0411addactualcontext(§6.1)59.19.3923.5612addinstances(§6.2)58.96.0118.8113weightbysim.(§6.3)55.47.1419.89Table2:PerformanceonTESTforbaselinesandoursense-topicmodel.Bestscoreineachcolumnisbold.ModelB-cubed(%)NMI(%)AVGDrops→t52.16.8418.88Dropt→s51.16.7818.61Full53.56.9619.30Table3:PerformanceonTESTforthesense-topicmodelwithablationoflinksbetweensenseandtopicvariables.richtheinstances,wecanhavemorerobustco-occurrencestatistics.TheSemEval-2013datasetmaybetoosmalltoinducemeaningfulsenses,sincethereareonlyabout100instancesforeachtargetword,andeachinstanceonlycontainsonesentence.Thisiswhymostsharedtasksystemsaddedin-stancesfromexternalcorpora.Inthissection,weconsiderthreeunsupervisedwaysofenrichingdataandmeasuretheirimpactonperformance.In§6.1weaugmentthecontextofeachinstanceinouroriginaldatasetwhilekeepingthenumberofinstancesfixed.In§6.2wecollectmoreinstancesofeachtargetwordfromukWac,similartotheAI-KUandunimelbsystems.In§6.3,wechangethedistributionofwordsineachin-stancebasedontheirsimilaritytothetargetword.Throughout,wemakeuseofwordembeddings(see§2).Wetrained100-dimensionalskip-gramvectors(Mikolovetal.,2013)onEnglishWikipedia(tokenized/lowercased,resultingin1.8Btokensoftext)usingwindowsize10,hierarchicalsoftmax,andnodownsampling.77Weusedaminimumcountcutoffof20duringtraining,6.1AddingContextThefirstwayweexploreofenrichingdataistoaddabroadercontextforeachinstancewhilekeepingthenumberofinstancesunchanged.Thiswillintro-ducemorewordtokensintothesetofglobalcon-textwords,whilekeepingthesetoflocalcontextwordsmostlyunchanged,asthewindowsizeweuseistypicallysmallerthanthelengthoftheoriginalin-stance.Withmoreglobalcontextwords,themodelhasmoreevidencetolearncoherenttopics,whichcouldalsoimprovetheinducedsensesviathecon-nectionbetweensenseandtopic.Theidealwayofenrichingcontextforaninstanceistoadditsactualcontextfromthecorpusfromwhichitwasextracted.TodothisfortheSemEval-2013task,wefindeachinstanceintheOANCandretrievethreesentencesbeforetheinstanceandthreesentencesafter.WhilenotprovidedfortheSemEvaltask,itisreasonabletoassumethislargercontextinmanyreal-worldapplications,suchasinformationretrievalandmachinetranslationofdocuments.However,inothersettings,thecorpusmayonlyhaveasinglesentencecontainingthetargetword(e.g.,searchqueriesormachinetranslationofsen-tences).Toaddressthis,wefindasemantically-similarsentencefromtheEnglishukWaccorpusandappendittotheinstanceasadditionalcontext.Foreachinstanceintheoriginaldataset,weextractitsthenonlyretainedvectorsforthemostfrequent100,000wordtypes,averagingtheresttogetavectorforunknownwords.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
je
un
c
_
un
_
0
0
1
2
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
67
mostsimilarsentencethatcontainsthesametargetwordandaddittoincreaseitssetofglobalcontextwords.Tocomputesimilarity,wefirstrepresentin-stancesandukWacsentencesbysummingthewordembeddingsacrosstheirwordtokens,thencomputecosinesimilarity.TheukWacsentence(s∗)withthehighestcosinesimilaritytoeachoriginalinstance(d)isappendedtothatinstance:s∗=argmaxs∈ukWacsim(d,s)ResultsSincethevocabularyhasincreased,weexpectwemayneedlargervaluesforSandT.OnTRIAL,wefindbestperformanceforS=10,sowerunonTESTwiththisvalue.PerformanceisshowninTable2(rows10and11).Thesetwometh-odshavehigherAVGscoresthanallothers.BoththeirfuzzyB-cubedandNMIimprovementsoverthebaselinesandpreviousWSIsystemsarestatisti-callysignificant,asmeasuredbyapairedbootstraptest(p<0.01;EfronandTibshirani,1994).Itisunsurprisingthatwefindbestperformancewithactualcontext.Interestingly,however,wecanachievealmostthesamegainswhenautomati-callyfindingrelevantcontextfromadifferentcor-pus.Thus,eveninreal-worldsettingswhereweonlyhaveasinglesentenceofcontext,wecaninducesubstantiallybettersensesbyautomaticallybroad-eningtheglobalcontextinanunsupervisedmanner.Asacomparativeexperiment,wealsoevaluatetheperformanceofLDAwhenaddingactualcon-text(Table2,row7).ComparedwithLDAwithfullcontext(FULL)inrow6,performanceisslightlyimproved,perhapsduetothefactthatlongercon-textsinducemoreaccuratetopics.However,thosetopicsarenotnecessarilyrelatedtosenses,whichiswhyLDAwithonlylocalcontextactuallyper-formsbestamongallthreeLDAmodels.ThusweseethatmerelyaddingcontextdoesnotnecessarilyhelptopicmodelsforWSI.Importantly,sinceourmodelincludesbothsenseandtopic,weareabletoleveragetheadditionalcontexttolearnbettertop-icswhilealsoimprovingthequalityoftheinducedsenses,leadingtoourstrongestresults.ExamplesWepresentexamplestoillustrateoursense-topicmodel’sadvantageoverLDAandthefurtherimprovementwhenaddingactualcontext.Considerinstances(1)and(2)below,withtargetwordoccurrencesinbold:(1)Nigeriathensenttroopstochallengethecoup,evi-dentlytorestorethepresidentandrepairNigeria’scorruptimageabroad.(image%1:07:01::/4)8(2)WhenaskedabouttheBible’sliteralaccountofcreation,asopposedtotheattractiveconceptofdivinecreation,everymajorRepublicanpresiden-tialcandidate—evenBauer—hassquirmed,ducked,andtriedtosteerthediscussionbackto“faith,”“morals,”andthegeneralideathathumans“werecreatedintheimageofGod.”(image%1:06:00::/2image%1:09:02::/4)Bothinstancessharethecommonwordstempres-ident.LDAusesthistoputthesetwoinstancesintothesametopic(i.e.,sense).Inoursense-topicmodel,presidentisalocalcontextwordininstance(1)butaglobalcontextwordininstance(2).Sotheeffectofsharingwordsisdecreased,andthesetwoinstancesareassignedtodifferentsensesbyourmodel.Accordingtothegoldstandard,thetwoin-stancesareannotatedwithdifferentsenses,sooursense-topicmodelprovidesthecorrectprediction.Next,considerinstances(3),(4),and(5):(3)Ihaverecentlydeliberatelybeguntousevariationsof“kickass”and“bitesXintheass”becausetheyarecolorful,evocativephrases;because,thankstoSouthPark,assreferencesarenewlyfamiliarandhilariousandbecausetheydon’tevokepartic-ularlyvividmentalimageofassesanylonger.(im-age%1:09:00::/4)(4)Also,playingvideogamesthatrequirerapidmentalrotationofvisualimageenhancesthespatialtestscoresofboysandgirlsalike.(image%1:06:00::/4)(5)Practicingandsolidifyingmodesofrepresenta-tion,Piagetemphasized,makeitpossibleforthechildtofreethoughtfromthehereandnow;cre-atelargerimagesofrealitythattakeintoaccountpast,present,andfuture;andtransformthoseim-agementallyintheserviceoflogicalthinking.(im-age%1:09:00::/4)Inthegoldstandard,instances(3)and(4)havedif-ferentsenseswhile(3)and(5)havethesamesense.However,sharingthelocalcontextword“mental”8Thisisthegoldstandardsenselabel,whereim-age%1:07:01::indexesthewordnetsenses,and4isthescoreassignedbytheannotators.Thepossiblerangeofascoreis[1,5].
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
l
a
c
_
a
_
0
0
1
2
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
68
triggersbothLDAandoursense-topicmodeltoas-signthemtothesamesenselabelwithhighproba-bility.Whenaugmentingtheinstancesbytheirrealcontexts,wehaveabetterunderstandingaboutthetopics.Instance(3)isaboutphrasevariations,in-stance(4)isaboutenhancingboys’spatialskills,whileinstance(5)discussestheeffectofmake-believeplayforchildren’sdevelopment.WhenLDAisrunwiththeactualcontext,itleaves(4)and(5)inthesametopic(i.e.,sense),whileas-signing(3)intoanothertopicwithhighprobability.Thiscouldbebecause(4)and(5)bothrelatetochilddevelopment,andthereforeLDAconsidersthemassharingthesametopic.However,topicisnotthesameassense,especiallywhenlargercontextsareavailable.Oursense-topicmodelbuiltontheac-tualcontextmakescorrectpredictions,leaving(3)and(5)intothesamesenseclusterwhilelabeling(4)withadifferentsense.6.2AddingInstancesWealsoconsiderawaytoaugmentourdatasetwithadditionalinstancesfromanexternalcorpus.Wehavenogoldstandardsensesfortheseinstances,sowewillnotevaluateourmodelonthem;theyaremerelyusedtoprovidericherco-occurrencestatis-ticsaboutthetargetwordsothatwecanperformbetterontheinstancesonwhichweevaluate.Ifweaddedrandomly-choseninstances(contain-ingthetargetword),wewouldbeconcernedthatthelearnedtopicsandsensesmaynotreflectthedistri-butionsoftheoriginalinstanceset.Soweonlyaddinstancesthataresemanticallysimilartoinstancesinouroriginalset(MooreandLewis,2010;ChambersandJurafsky,2011).Also,toavoidchangingtheoriginalsensedistributionbyaddingtoomanyin-stances,weonlyaddasingleinstanceforeachorig-inalinstance.Asin§6.1,foreachinstanceintheoriginaldataset,wefindthemostsimilarsentenceinukWacforeachinstanceusingwordembeddingsandadditintothedataset.Therefore,thenumberofinstancesisdoubled,andweusetheenricheddatasetforoursense-topicmodel.ResultsSimilarlyto§6.1,onTRIAL,wefindbestperformanceforS=10,sowerunonTESTwiththisvalue.AsshowninTable2(row12),thisimprovesfuzzyB-cubedby5.4%,butfuzzyNMIislower,makingtheAVGworsethantheoriginalmodel.Apossiblereasonforthisisthatthesensedistributionintheaddedinstancesdisturbsthatintheoriginalsetofinstances,eventhoughwepickedthemostsemanticallysimilaronestoadd.6.3WeightingbyWordSimilarityAnotherapproachisinspiredbytheobservationthateachlocalcontexttokenistreatedequallyintermsofitscontributiontothesense.However,ourintu-itionisthatcertaintokensaremoreindicativethanothers.Considerthetargetwordwindow.Sinceglassevokesaparticularsenseofwindow,wewouldliketoweightitmorehighlythan,say,day.Tomeasurewordrelatedness,weusecosinesim-ilarityofwordembeddings.We(softly)replicateeachlocalcontextwordaccordingtoitsexponenti-atedcosinesimilaritytothetargetword.9Theresultisthatthelocalcontextineachinstancehasbeenmodifiedtocontainfeweroccurrencesofunrelatedwordsandmoreoccurrencesofrelatedwords.Ifeachcosinesimilarityis0,weobtainouroriginalsense-topicmodel.Duringinference,theposteriorsensedistributionforinstancedisnowgivenby:Pr(s=k|d,·)=Pw∈d‘exp(sim(w,w∗))1sw=k+αPw0∈d‘exp(sim(w0,w∗))+Sα(8)whered‘isthesetoflocalcontexttokensind,sim(w,w∗)isthecosinesimilaritybetweenwandtargetwordw∗,and1sw=kisanindicatorreturning1whenwisassignedtosensekand0otherwise.Theposteriordistributionofsamplingatokenofwordwifromsensekbecomes:CWSikexp(sim(wi,w∗))+αPWsi0=1CWSi0kexp(sim(wi0,w∗))+Wsα(9)whereCWSikcountsthenumberoftimeswiisas-signedtosensek.ResultsWeagainuseTRIALtotuneS(andstilluseT=2S).WefindbestTRIALperformanceatS=3;thisisunsurprisingsincethisapproachdoesnotchangethevocabulary.InTable2,wepresentre-sultsonTESTwithS=3(row13).Wealsoreport9Cosinesimilaritiesrangefrom-1to1,soweuseexponen-tiationtoensurewealwaysusepositivecounts.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
l
a
c
_
a
_
0
0
1
2
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
69
SenseTop-5termspersenseSense-TopicModel1include,depict,party,paint,visual2zero,manage,company,culture,figure3create,clinton,people,american,popular+weightbysimilarity(§6.3)1depict,create,culture,mental,include2picture,visual,pictorial,matrix,movie3public,means,view,american,storyTable4:Top5termsforeachsenseinducedforthenounimagebythesense-topicmodelandwhenweightinglocalcontextwordsbysimilarity.S=3forboth.anadditionalbaseline:“wordembeddingproduct”(row8),wherewerepresenteachinstancebymul-tiplying(element-wise)thewordvectorsofalllo-calcontextwords,andthenfeedtheinstancevectorsintothefuzzyc-meansclusteringalgorithm(PalandBezdek,1995),c=3.Comparedtothisbaseline,ourapproachimproves4.36%onaverage;comparedwithresultsfortheoriginalsense-topicmodel(row9),thisapproachimproves0.69%onaverage.InTable4weshowthetop-5termsforeachsenseinducedforimage,bothfortheoriginalsense-topicmodelandwhenadditionallyweightingbysimilar-ity.Wefindthattheoriginalmodelprovideslessdistinguishablesenses,asitisdifficulttoderivesep-aratesensesfromthesetopterms.Incontrast,senseslearnedfromthemodelwithweightedsimilaritiesaremoredistinct.Sense1relatestomentalrepre-sentation;sense2isaboutvisualrepresentationpro-ducedonasurface;andsense3isaboutthegeneralimpressionthatsomethingpresentstothepublic.7ConclusionsandFutureWorkWepresentedanovelsense-topicmodelfortheproblemofwordsenseinduction.Weconsideredsenseandtopicasdistinctlatentvariables,definingamodelthatgeneratesglobalcontextwordsusingtopicvariablesandlocalcontextwordsusingbothtopicandsensevariables.Senseandtopicarere-latedusingabidirectionaldependencywitharobustparameterizationbasedondeficientmodeling.Weexploredwaysofenrichingdatausingwordembeddingsfromneurallanguagemodelsandexter-nalcorpora.Wefoundenrichingcontexttobemosteffective,evenwhentheoriginalcontextofthein-stanceisnotavailable.EvaluatingontheSemEval-2013WSIdataset,wedemonstratethatourmodelyieldssignificantimprovementsovercurrentstate-of-the-artsystems,giving59.1%fuzzyB-cubedand9.39%fuzzyNMIinourbestsetting.Moreover,wefindthatmodelingbothsenseandtopiciscriticaltoenableustoeffectivelyexploitbroadercontext,showingthatLDAdoesnotimprovewheneachin-stanceisenrichedbyactualcontext.Infuturework,weplantofurtherexplorethespaceofsense-topicmodels,includingnon-deficientmodels.Onepossibilityistouse“switchingvari-ables”(PaulandGirju,2009)tochoosewhethertogenerateeachwordfromatopicorsense,withastrongerpreferencetogeneratefromsensesclosertothetargetword.Anotherpossibilityistouselocally-normalizedlog-lineardistributionsandincludefea-turespairingwordswithparticularsensesandtopics,ratherthanredundantgenerativesteps.AppendixATheplatediagramforthecompletesense-topicmodelisshowninFigure2. Figure2:Platenotationfortheproposedsense-topicmodelwithallvariables(exceptα,thefixedDirichlethyperparameterusedaspriorforallmultinomialdistri-butions).Eachinstancehastopicmixingproportionsθtandsensemixingproportionsθs.Theinstancesetsharessense/topicparameterθst,topic-sensedistributionθs|t,sense-topicdistributionθt|s,topic-worddistributionψt,andsense-worddistributionψs.AcknowledgmentsWethanktheeditorandtheanonymousreviewersfortheirhelpfulcomments.Thisresearchwaspar-tiallysupportedbyNIHLM010817.Theopinionsexpressedinthisworkarethoseoftheauthorsanddonotnecessarilyreflecttheviewsofthefundingagency.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
l
a
c
_
a
_
0
0
1
2
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
70
ReferencesE.AgirreandA.Soroa.2007.SemEval-2007Task02:Evaluatingwordsenseinductionanddiscriminationsystems.InProc.ofSemEval,pages7–12.C.Akkaya,J.Wiebe,andR.Mihalcea.2012.Utilizingsemanticcompositionindistributionalsemanticmod-elsforwordsensediscriminationandwordsensedis-ambiguation.InProc.ofICSC,pages45–51.M.Bansal,K.Gimpel,andK.Livescu.2014.Tailoringcontinuouswordrepresentationsfordependencypars-ing.InProc.ofACL,pages809–815.O.Baskaya,E.Sert,V.Cirik,andD.Yuret.2013.AI-KU:Usingsubstitutevectorsandco-occurrencemod-elingforwordsenseinductionanddisambiguation.InProc.ofSemEval,pages300–306.Y.Bengio,R.Ducharme,P.Vincent,andC.Janvin.2003.Aneuralprobabilisticlanguagemodel.J.Mach.Learn.Res.,3:1137–1155.D.M.Blei,A.Y.Ng,andM.I.Jordan.2003.La-tentDirichletallocation.J.Mach.Learn.Res.,3:993–1022.J.Boyd-GraberandD.M.Blei.2007.PUTOP:Turningpredominantsensesintoatopicmodelforwordsensedisambiguation.InProc.ofSemEval,pages277–281.J.Boyd-Graber,D.M.Blei,andX.Zhu.2007.Atopicmodelforwordsensedisambiguation.InProc.ofEMNLP-CoNLL,pages1024–1033.S.BrodyandM.Lapata.2009.Bayesianwordsenseinduction.InProc.ofEACL,pages103–111.P.F.Brown,S.A.DellaPietra,V.J.DellaPietra,andR.L.Mercer.1993.Themathematicsofstatisticalmachinetranslation:Parameterestimation.Computa-tionalLinguistics,19(2):263–311.J.F.Cai,W.S.Lee,andY.W.Teh.2007.Improvingwordsensedisambiguationusingtopicfeatures.InProc.ofEMNLP-CoNLL,pages1015–1023.M.CarpuatandD.Wu.2005.Wordsensedisambigua-tionvs.statisticalmachinetranslation.InProc.ofACL,pages387–394.M.CarpuatandD.Wu.2007.Improvingstatisticalma-chinetranslationusingwordsensedisambiguation.InProc.ofEMNLP-CoNLL,pages61–72.N.ChambersandD.Jurafsky.2011.Template-basedinformationextractionwithoutthetemplates.InProc.ofACL,pages976–986.R.Collobert,J.Weston,L.Bottou,M.Karlen,K.Kavukcuoglu,andP.Kuksa.2011.Naturallan-guageprocessing(almost)fromscratch.J.Mach.Learn.Res.,12:2493–2537.P.Dhillon,J.Rodu,D.Foster,andL.Ungar.2012.TwoStepCCA:Anewspectralmethodforestimatingvec-tormodelsofwords.InICML,pages1551–1558.B.DorowandD.Widdows.2003.Discoveringcorpus-specificwordsenses.InProc.ofEACL,pages79–82.B.EfronandR.J.Tibshirani.1994.Anintroductiontothebootstrap,volume57.CRCpress.K.ErkandD.McCarthy.2009.Gradedwordsenseas-signment.InProc.ofEMNLP,pages440–449.K.Erk,D.McCarthy,andN.Gaylord.2009.Investi-gationsonwordsensesandwordusages.InProc.ofACL,pages10–18.S.GemanandD.Geman.1984.Stochasticrelax-ation,Gibbsdistributions,andtheBayesianrestorationofimages.IEEETrans.PatternAnal.Mach.Intell.,6(6):721–741.T.L.GriffithsandM.Steyvers.2004.Findingscien-tifictopics.Proc.oftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,101(Suppl1):5228–5235.D.Heckerman,D.M.Chickering,C.Meek,R.Roun-thwaite,andC.Kadie.2001.Dependencynetworksforinference,collaborativefiltering,anddatavisual-ization.J.Mach.Learn.Res.,1:49–75.G.Heinrich.2005.Parameterestimationfortextanaly-sis.Technicalreport.S.Hisamoto,K.Duh,andY.Matsumoto.2013.Anem-piricalinvestigationofwordrepresentationsforpars-ingtheweb.InANLP.N.IdeandK.Suderman.2004.TheAmericanNationalCorpusfirstrelease.InProc.ofLREC,pages1681–1684.D.JurgensandI.Klapaftis.2013.SemEval-2013Task13:Wordsenseinductionforgradedandnon-gradedsenses.InProc.ofSemEval,pages290–299.D.Jurgens.2013.Embracingambiguity:Acomparisonofannotationmethodologiesforcrowdsourcingwordsenselabels.InProc.ofNAACL,pages556–562.D.KleinandC.D.Manning.2002.Agenerativeconstituent-contextmodelforimprovedgrammarin-duction.InProc.ofACL,pages128–135.J.H.Lau,P.Cook,D.McCarthy,D.Newman,andT.Baldwin.2012.Wordsenseinductionfornovelsensedetection.InProc.ofEACL,pages591–601.J.H.Lau,P.Cook,andT.Baldwin.2013.unimelb:Topicmodelling-basedwordsenseinduction.InProc.ofSemEval,pages307–311.L.Li,B.Roth,andC.Sporleder.2010.Topicmodelsforwordsensedisambiguationandtoken-basedidiomdetection.InProc.ofACL,pages1138–1147.Y.Maron,E.Bienenstock,andM.James.2010.Sphereembedding:Anapplicationtopart-of-speechinduc-tion.InAdvancesinNIPS23.J.MayandK.Knight.2007.Syntacticre-alignmentmodelsformachinetranslation.InProc.ofEMNLP-CoNLL,pages360–368.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
l
a
c
_
a
_
0
0
1
2
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
71
T.Mikolov,K.Chen,G.Corrado,andJ.Dean.2013.Efficientestimationofwordrepresentationsinvectorspace.InProc.ofICLR.G.A.Miller,R.Beckwith,C.Fellbaum,D.Gross,andK.J.Miller.1990.WordNet:Anon-linelexicaldatabase.InternationalJournalofLexicography,3(4).A.MnihandG.Hinton.2007.Threenewgraphicalmodelsforstatisticallanguagemodelling.InProc.ofICML,pages641–648.R.C.MooreandW.Lewis.2010.Intelligentselectionoflanguagemodeltrainingdata.InProc.ofACL,pages220–224.N.R.PalandJ.C.Bezdek.1995.Onclustervalidityforthefuzzyc-meansmodel.Trans.FuzSys.,3:370–379.P.PantelandD.Lin.2002.Discoveringwordsensesfromtext.InProc.ofKDD,pages613–619.R.J.Passonneau,A.Salleb-Aoussi,V.Bhardwaj,andN.Ide.2010.Wordsenseannotationofpolysemouswordsbymultipleannotators.InProc.ofLREC.M.PaulandR.Girju.2009.Cross-culturalanalysisofblogsandforumswithmixed-collectiontopicmodels.InProc.ofEMNLP,pages1408–1417.A.PurandareandT.Pedersen.2004.Wordsensedis-criminationbyclusteringcontextsinvectorandsimi-larityspaces.InProc.ofCoNLL,pages41–48.M.Sahlgren.2006.Theword-spacemodel:Us-ingdistributionalanalysistorepresentsyntagmaticandparadigmaticrelationsbetweenwordsinhigh-dimensionalvectorspaces.Ph.D.dissertation,Stock-holmUniversity.H.Sch¨utze.1998.Automaticwordsensediscrimination.Comput.Linguist.,24(1):97–123.Y.W.Teh,M.I.Jordan,M.J.Beal,andD.M.Blei.2006.HierarchicalDirichletprocesses.JournaloftheAmer-icanStatisticalAssociation,101:1566–1581.K.ToutanovaandM.Johnson.2007.ABayesianLDA-basedmodelforsemi-supervisedpart-of-speechtag-ging.InAdvancesinNIPS20.J.Turian,L.Ratinov,andY.Bengio.2010.Wordrep-resentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InProc.ofACL,pages384–394.J.V´eronis.2004.Hyperlex:lexicalcartographyforin-formationretrieval.ComputerSpeech&Language,18(3):223–252.D.Vickrey,L.Biewald,M.Teyssier,andD.Koller.2005.Word-sensedisambiguationformachinetranslation.InProc.ofHLT-EMNLP,pages771–778.E.M.Voorhees.1993.UsingWordNettodisambiguatewordsensesfortextretrieval.InProc.ofSIGIR,pages171–180.X.YaoandB.VanDurme.2011.Nonparamet-ricBayesianwordsenseinduction.InProc.ofTextGraphs-6:Graph-basedMethodsforNaturalLan-guageProcessing,pages10–14.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
2
2
1
5
6
6
7
3
6
/
/
t
l
a
c
_
a
_
0
0
1
2
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
72