Transacciones de la Asociación de Lingüística Computacional, volumen. 6, páginas. 483–495, 2018. Editor de acciones: Hinrich Schütze.
Lote de envío: 11/2017; Lote de revisión: 3/2018; Publicado 7/2018.
2018 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
LinearAlgebraicStructureofWordSenses,withApplicationstoPolysemySanjeevArora,YuanzhiLi,YingyuLiang,TengyuMa,AndrejRisteskiComputerScienceDepartment,PrincetonUniversity35OldenSt,Princeton,NJ08540{arora,yuanzhil,yingyul,tengyu,risteski}@cs.princeton.eduAbstractWordembeddingsareubiquitousinNLPandinformationretrieval,butitisunclearwhattheyrepresentwhenthewordispolysemous.Hereitisshownthatmultiplewordsensesre-sideinlinearsuperpositionwithinthewordembeddingandsimplesparsecodingcanre-covervectorsthatapproximatelycapturethesenses.Thesuccessofourapproach,whichappliestoseveralembeddingmethods,ismathematicallyexplainedusingavariantoftherandomwalkondiscoursesmodel(Aroraetal.,2016).Anovelaspectofourtech-niqueisthateachextractedwordsenseisac-companiedbyoneofabout2000“discourseatoms”thatgivesasuccinctdescriptionofwhichotherwordsco-occurwiththatwordsense.Discourseatomscanbeofindepen-dentinterest,andmakethemethodpotentiallymoreuseful.Empiricaltestsareusedtoverifyandsupportthetheory.1IntroductionWordembeddingsareconstructedusingFirth’shy-pothesisthataword’ssenseiscapturedbythedistri-butionofotherwordsaroundit(Firth,1957).Clas-sicalvectorspacemodels(seethesurveybyTur-neyandPantel(2010))usesimplelinearalgebraonthematrixofword-wordco-occurrencecounts,whereasrecentneuralnetworkandenergy-basedmodelssuchasword2vecuseanobjectivethatin-volvesanonconvex(de este modo,alsononlinear)functionofthewordco-occurrences(Bengioetal.,2003;Mikolovetal.,2013a;Mikolovetal.,2013b).Thisnonlinearitymakesithardtodiscernhowthesemodernembeddingscapturethedifferentsensesofapolysemousword.Themonolithicviewofembeddings,withtheinternalinformationex-tractedonlyviainnerproduct,isfelttofailincap-turingwordsenses(Griffithsetal.,2007;ReisingerandMooney,2010;Iacobaccietal.,2015).Re-searchershaveinsteadsoughttocapturepolysemyusingmorecomplicatedrepresentations,e.g.,byin-ducingseparateembeddingsforeachsense(Murphyetal.,2012;Huangetal.,2012).Theseembedding-per-senserepresentationsgrownaturallyoutofclassicWordSenseInductionorWSI(Yarowsky,1995;Schutze,1998;ReisingerandMooney,2010;DiMarcoandNavigli,2013)techniquesthatper-formclusteringonneighboringwords.Thecurrentpapergoesbeyondthismono-lithicview,bydescribinghowmultiplesensesofawordactuallyresideinlinearsuperposi-tionwithinthestandardwordembeddings(e.g.,word2vec(Mikolovetal.,2013a)andGloVe(Pen-ningtonetal.,2014)).Bythiswemeanthefollow-ing:considerapolysemousword,saytie,whichcanrefertoanarticleofclothing,oradrawnmatch,oraphysicalact.Let’staketheusualviewpointthattieisasingletokenthatrepresentsmonosemouswordstie1,tie2,….Thetheoryandexperimentsinthispaperstronglysuggestthatwordembeddingscom-putedusingmoderntechniquessuchasGloVeandword2vecsatisfy:vtie≈α1vtie1+α2vtie2+α3vtie3+···(1)wherecoefficientsαi’sarenonnegativeandvtie1,vtie2,etc.,arethehypotheticalembeddingsof
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
4
1
5
6
7
6
5
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
4
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
484
thedifferentsenses—thosethatwouldhavebeeninducedinthethoughtexperimentwherealloc-currencesofthedifferentsenseswerehand-labeledinthecorpus.ThisLinearityAssertion,wherebylinearstructureappearsoutofahighlynonlinearembeddingtechnique,isexplainedtheoreticallyinSection2,andthenempiricallytestedinacoupleofwaysinSection4.Section3usesthelinearityassertiontoshowhowtodoWSIviasparsecoding,whichcanbeseenasalinearalgebraicanalogoftheclassicclustering-basedapproaches,albeitwithoverlappingclusters.Onstandardtestbedsitiscompetitivewithearlierembedding-for-each-senseapproaches(Section6).AnoveltyofourWSImethodisthatitautomat-icallylinksdifferentsensesofdifferentwordsviaouratomsofdiscourse(Section3).Thiscanbeseenasananswertothesuggestionin(ReisingerandMooney,2010)toenhanceone-embedding-per-sensemethodssothattheycanautomaticallylinktogethersensesfordifferentwords,e.g.,recognizethatthe“articleofclothing”senseoftieisconnectedtoshoe,jacket,etc.Thispaperisinspiredbythesolutionofwordanalogiesvialinearalgebraicmethods(Mikolovetal.,2013b),anduseofsparsecodingonwordem-beddingstogetusefulrepresentationsformanyNLPtasks(Faruquietal.,2015).OurtheorybuildsconceptuallyupontherandomwalkondiscoursesmodelofAroraetal.(2016),althoughwemakeasmallbutimportantchangetoexplainempiricalfindingsregardingpolysemy.OurWSIprocedureapplies(withminorvariationinperformance)tocanonicalembeddingssuchasword2vecandGloVeaswellastheoldervectorspacemethodssuchasPMI(ChurchandHanks,1990).Thisisnotsurpris-ingsincetheseembeddingsareknowntobeinterre-lated(LevyandGoldberg,2014;Aroraetal.,2016).2JustificationforLinearityAssertionSincewordembeddingsaresolutionstononconvexoptimizationproblems,atfirstsightitappearshope-lesstoreasonabouttheirfinerstructure.Butitbe-comespossibletodosousingagenerativemodelforlanguage(Aroraetal.,2016)—adynamicversionsbythelog-lineartopicmodelof(MnihandHinton,2007)—whichwenowrecall.Itpositsthatateverypointinthecorpusthereisamicro-topic(“whatisbeingtalkedabout”)calleddiscoursethatisdrawnfromthecontinuumofunitvectorsinℜd.Thepa-rametersofthemodelincludeavectorvw∈ℜdforeachwordw.EachdiscoursecdefinesadistributionoverwordsPr[w|C]∝exp(c·vw).Themodelas-sumesthatthecorpusisgeneratedbytheslowgeo-metricrandomwalkofcovertheunitsphereinℜd:whenthewalkisatc,afewwordsareemittedbyi.i.d.samplesfromthedistribution(2),cual,duetoitslog-linearform,stronglyfavorswordsclosetocincosinesimilarity.Estimatesforlearningparame-tersvwusingMLEandmomentmethodscorrespondtostandardembeddingmethodssuchasGloVeandword2vec(seetheoriginalpaper).Tostudyhowwordembeddingscapturewordsenses,we’llneedtounderstandtherelationshipbe-tweenaword’sembeddingandthoseofwordsitco-occurswith.Inthenextsubsection,wepro-poseaslightmodificationtotheabovemodelandshowshowtoinfertheembeddingofawordfromtheembeddingsofotherwordsthatco-occurwithit.ThisimmediatelyleadstotheLinearityAssertion,asshowninSection2.2.2.1GaussianWalkModelAsalludedtobefore,wemodifytherandomwalkmodelof(Aroraetal.,2016)totheGaussianran-domwalkmodel.Again,theparametersofthemodelincludeavectorvw∈ℜdforeachwordw.Themodelassumesthecorpusisgeneratedasfollows.First,adiscoursevectorcisdrawnfromaGaussianwithmean0andcovarianceΣ.Then,awindowofnwordsw1,w2,…,wnaregeneratedfromcby:Pr[w1,w2,…,wn|C]=n∏i=1Pr[Wisconsin|C],(2)Pr[Wisconsin|C]=exp(c·vwi)/Zc,(3)whereZc=∑wexp(⟨vw,c⟩)isthepartitionfunc-tion.Wealsoassumethepartitionfunctionconcen-tratesinthesensethatZc≈Zexp(∥c∥2)forsomeconstantZ.Thisisadirectextensionof(Aroraetal.,2016,Lemma2.1)todiscoursevectorswithnormotherthan1,andcausestheadditionaltermexp(∥c∥2).11Theformalproofof(Aroraetal.,2016)stillappliesinthissetting.Thesimplestwaytoinformallyjustifythisassumption
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
4
1
5
6
7
6
5
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
4
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
485
Theorem1.Assumetheabovegenerativemodel,andletsdenotetherandomvariableofawindowofnwords.Then,thereisalineartransformationAsuchthatvw≈AE[1n∑wi∈svwi|w∈s].Proof.Letcsbethediscoursevectorforthewholewindows.Bythelawoftotalexpectation,wehaveE[cs|w∈s]=E[mi[cs|s=w1…wj−1wwj+1…wn]|w∈s].(4)Weevaluatethetwosidesoftheequation.First,byBayes’ruleandtheassumptionsonthedistributionofcandthepartitionfunction,wehave:pag(C|w)∝p(w|C)pag(C)∝1Zcexp(⟨vw,c⟩)·exp(−12c⊤Σ−1c)≈1Zexp(⟨vw,c⟩−c⊤(12Σ−1+I)C).Soc|wisaGaussiandistributionwithmeanE[C|w]≈(Σ−1+2I)−1vw.(5)Próximo,wecomputeE[C|w1,…,wn].AgainusingBayes’ruleandtheassumptionsonthedistributionofcandthepartitionfunction,pag(C|w1,…,wn)∝p(w1,…,wn|C)pag(C)∝p(C)n∏i=1p(Wisconsin|C)≈1Znexp(n∑i=1v⊤wic−c⊤(12Σ−1+nI)C).Soc|w1…wnisaGaussiandistributionwithmeanE[C|w1,…,wn]≈(Σ−1+2nI)−1n∑i=1vwi.(6)Nowplugginginequation(5)y(6)intoequa-tion(4),weconcludethat(Σ−1+2I)−1vw≈(Σ−1+2nI)−1E[n∑i=1vwi|w∈s].istoassumevwarerandomvectors,andthenZccanbeshowntoconcentratearoundexp(∥c∥2).Suchaconditionenforcesthewordvectorstobeisotropictosomeextent,andmakesthecovarianceofthediscourseidentifiable.Re-arrangingtheequationcompletestheproofwithA=n(Σ−1+2I)(Σ−1+2nI)−1.Note:Interpretation.Theorem1showsthatthereexistsalinearrelationshipbetweenthevectorofawordandthevectorsofthewordsinitscontexts.Considerthefollowingthoughtexperiment.First,chooseawordw.Then,foreachwindowscontain-ingw,taketheaverageofthevectorsofthewordsinsanddenoteitasvs.Now,taketheaverageofvsforallthewindowsscontainingw,anddenotetheaverageasu.Theorem1saysthatucanbemappedtothewordvectorvwbyalineartransformationthatdoesnotdependonw.Thislinearstructuremayalsohaveconnectionstosomeotherphenomenarelatedtolinearity,e.g.,Gittensetal.(2017)andTianetal.(2017).Exploringsuchconnectionsisleftforfuturework.ThelineartransformationiscloselyrelatedtoΣ,whichdescribesthedistributionofthediscourses.IfwechooseacoordinatesystemsuchthatΣisadiagonalmatrixwithdiagonalentriesλi,thenAwillalsobeadiagonalmatrixwithdiagonalen-tries(n+2nλi)/(1+2nλi).Thisissmoothingthespectrumandessentiallyshrinksthedirectionscor-respondingtolargeλirelativelytotheotherdirec-tions.Suchdirectionsareforcommondiscoursesandthuscommonwords.Empirically,weindeedobservethatAshrinksthedirectionsofcommonwords.Forexample,itslastrightsingularvectorhas,asnearestneighbors,thevectorsforwordslike“with”,“as”,and“the.”Notethatempirically,Aisnotadiagonalmatrixsincethewordvectorsarenotinthecoordinatesystemmentioned.Note:ImplicationsforGloVeandword2vec.RepeatingthecalculationinAroraetal.(2016)forournewgenerativemodel,wecanshowthatthesolutionstoGloVeandword2vectrainingob-jectivessolveforthefollowingvectors:ˆvw=(Σ−1+4I)−1/2vw.Sincetheseotherembeddingsarethesameasvw’suptolineartransformation,Theorem1(andtheLinearityAssertion)stillholdsforthem.Empirically,wefindthat(Σ−1+4I)−1/2isclosetoascaledidentitymatrix(since∥Σ−1∥2issmall),soˆvw’scanbeusedasasurrogateofvw’s.Experimentalnote:Usingbettersentenceem-beddings,SIFembeddings.Theorem1implicitlyusestheaverageoftheneighboringwordvectorsas
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
4
1
5
6
7
6
5
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
4
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
486
anestimate(MLE)forthediscoursevector.Thisestimateisofcoursealsoasimplesentenceem-bedding,verypopularinempiricalNLPworkandalsoreminiscentofword2vec’strainingobjective.Inpractice,thisnaivesentenceembeddingcanbeimprovedbytakingaweightedcombination(oftentf-idf)ofadjacentwords.Thepaper(Aroraetal.,2017)usesasimpletwisttothegenerativemodelin(Aroraetal.,2016)toprovideabetterestimateofthediscourseccalledSIFembedding,whichisbet-terfordownstreamtasksandsurprisinglycompet-itivewithsophisticatedLSTM-basedsentenceem-beddings.Itisaweightedaverageofwordem-beddingsinthewindow,withsmallerweightsformorefrequentwords(reminiscentoftf-idf).ThisweightedaverageistheMLEestimateofcifabovegenerativemodelischangedto:pag(w|C)=αp(w)+(1−α)exp.(vw·c)Zc,wherep(w)istheoverallprobabilityofwordwinthecorpusandα>0isaconstant(hyperparameter).ThetheoryinthecurrentpaperworkswithSIFembeddingsasanestimateofthediscoursec;inotherwords,inTheorem1wereplacetheaveragewordvectorwiththeSIFvectorofthatwindow.Em-piricallywefindthatitleadstosimilarresultsintest-ingourtheory(Section4)andbetterresultsindown-streamWSIapplications(Section6).Por lo tanto,SIFembeddingsareadoptedinourexperiments.2.2ProofofLinearityAssertionNowweuseTheorem1toshowhowtheLinear-ityAssertionfollows.Recallthethoughtexperimentconsideredthere.Supposewordwhastwodistinctsensess1ands2.Computeawordembeddingvwforw.Thenhand-replaceeachoccurrenceofasenseofwbyoneofthenewtokenss1,s2dependinguponwhichoneisbeingused.Next,trainseparateembed-dingsfors1,s2whilekeepingtheotherembeddingsfixed.(NB:theclassicclustering-basedsenseinduc-tion(Schutze,1998;ReisingerandMooney,2010)canbeseenasanapproximationtothisthoughtex-periment.)Theorem2(Main).AssumingthemodelofSec-tion2.1,embeddingsinthethoughtexperimentabovewillsatisfy∥vw−¯vw∥2→0asthecorpuslengthtendstoinfinity,where¯vw≈αvs1+βvs2forα=f1f1+f2,β=f2f1+f2,wheref1andf2arethenumbersofoccurrencesofs1,s2inthecorpus,respectively.Proof.SupposewepickarandomsampleofNwin-dowscontainingwinthecorpus.Foreachwindow,computetheaverageofthewordvectorsandthenapplythelineartransformationinTheorem1.Thetransformedvectorsarei.i.d.estimatesforvw,butwithhighprobabilityaboutf1/(f1+f2)fractionoftheoccurrencesusedsenses1andf2/(f1+f2)usedsenses2,andthecorrespondingestimatesforthosetwosubpopulationsconvergetovs1andvs2respec-tively.Thusbyconstruction,theestimateforvwisalinearcombinationofthoseforvs1andvs2.Note.Theorem1(andhencetheLinearityAsser-tion)holdsalreadyfortheoriginalmodelinAroraetal.(2016)butwithA=I,whereIistheiden-titytransformation.Inpractice,wefindinducingthewordvectorrequiresanon-identityA,whichisthereasonforthemodifiedmodelofSection2.1.Thisalsohelpstoaddressanaggingissuehidinginolderclustering-basedapproachessuchasReisingerandMooney(2010)andHuangetal.(2012),whichiden-tifiedsensesofapolysemouswordbyclusteringthesentencesthatcontainit.Oneimaginesagoodrep-resentationofthesenseofanindividualclusterissimplytheclustercenter.Thisturnsouttobefalse—theclosestwordstotheclustercentersometimesarenotmeaningfulforthesensethatisbeingcap-tured;seeTable1.Indeed,theauthorsofReisingerandMooney(2010)seemawareofthisbecausetheymention“Wedonotassumethatclusterscorrespondtotraditionalwordsenses.Rather,weonlyrelyonclusterstocapturemeaningfulvariationinwordusage.”WefindthatapplyingAtoclustercentersmakesthemmeaningfulagain.SeealsoTable1.3TowardsWSI:AtomsofDiscourseNowweconsiderhowtodoWSIusingonlywordembeddingsandtheLinearityAssertion.Ourap-proachisfullyunsupervised,andtriestoinducesensesforallwordsinonego,togetherwithavectorrepresentationforeachsense.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
4
1
5
6
7
6
5
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
4
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
487
center1beforeandprovideprovidingaafterprovidingprovideopportunitiesprovisioncenter2beforeandatotheafteraccessaccessibleallowingprovideTable1:Fournearestwordsforsomeclustercen-tersthatwerecomputedfortheword“access”byapplying5-meansontheestimateddiscoursevec-tors(seeSection2.1)of1000randomwindowsfromWikipediacontaining“access”.AfterapplyingthelineartransformationofTheorem1tothecenter,thenearestwordsbecomemeaningful.Givenembeddingsforallwords,itseemsun-clearatfirstsighthowtopindownthesensesoftieusingonly(1)sincevtiecanbeexpressedinin-finitelymanywaysassuchacombination,andthisistrueevenifαi’swereknown(andtheyaren’t).Topindownthesenseswewillneedtointerrelatethesensesofdifferentwords,forexample,relatethe“articleofclothing”sensetie1withshoe,jacket,etc.TodosowerelyonthegenerativemodelofSec-tion2.1accordingtowhichunitvectorintheem-beddingspacecorrespondstoamicro-topicordis-course.Empirically,discoursescandc′tendtolooksimilartohumans(intermsofnearbywords)iftheirinnerproductislargerthan0.85,andquitedifferentiftheinnerproductissmallerthan0.5.Sointhedis-cussionbelow,adiscourseshouldreallybethoughtofasasmallregionratherthanapoint.Oneimaginesthatthecorpushasa“clothing”dis-coursethathasahighprobabilityofoutputtingthetie1sense,andalsoofoutputtingrelatedwordssuchasshoe,jacket,etc.By(2)theprobabilityofbe-ingoutputbyadiscourseisdeterminedbytheinnerproduct,sooneexpectsthatthevectorfor“clothing”discoursehasahighinnerproductwithallofshoe,jacket,tie1,etc.,andthuscanstandassurrogateforvtie1in(1)!Thusitmaybesufficienttoconsiderthefollowingglobaloptimization:Givenwordvectors{vw}inℜdandtwointe-gersk,mwithk