Transactions of the Association for Computational Linguistics, vol. 4, pp. 417–430, 2016. Action Editor: Hal Daume III.
Submission batch: 3/2016; Published 7/2016.
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
c
(cid:13)
EncodingPriorKnowledgewithEigenwordEmbeddingsDominiqueOsborneDepartmentofMathematicsandStatisticsUniversityofStrathclydeGlasgow,G11XH,UKdominique.osborne.13@uni.strath.ac.ukShashiNarayanandShayB.CohenSchoolofInformaticsUniversityofEdinburghEdinburgh,EH89LE,ROYAUME-UNI{snaraya2,scohen}@inf.ed.ac.ukAbstractCanonicalcorrelationanalysis(CCA)isamethodforreducingthedimensionofdatarepresentedusingtwoviews.Ithasbeenpreviouslyusedtoderivewordembeddings,whereoneviewindicatesaword,andtheotherviewindicatesitscontext.WedescribeawaytoincorporatepriorknowledgeintoCCA,giveatheoreticaljustificationforit,andtestitbyderivingwordembeddingsandevaluatingthemonamyriadofdatasets.1IntroductionInrecentyearstherehasbeenanimmensein-terestinrepresentingwordsaslow-dimensionalcontinuousreal-vectors,namelywordembeddings.Wordembeddingsaimtocapturelexico-semanticinformationsuchthatregularitiesinthevocabularyaretopologicallyrepresentedinaEuclideanspace.Suchwordembeddingshaveachievedstate-of-the-artperformanceonmanynaturallanguageprocess-ing(NLP)tasks,e.g.,syntacticparsing(Socheretal.,2013),wordorphrasesimilarity(Mikolovetal.,2013b),dependencyparsing(Bansaletal.,2014),unsupervisedlearning(Parikhetal.,2014)andoth-ers.SincethediscoverythatwordembeddingsareusefulasfeaturesforvariousNLPtasks,researchonwordembeddingshastakenonalifeofitsown,withavibrantcommunitysearchingforbetterwordrep-resentationsinavarietyofproblemsanddatasets.Thesewordembeddingsareofteninducedfromlargerawtextcapturingdistributionalco-occurrenceinformationvianeuralnetworks(Bengioetal.,2003;Mikolovetal.,2013b;Mikolovetal.,2013c)orspectralmethods(Deerwesteretal.,1990;Dhillonetal.,2015).Whilethesegeneralpur-posewordembeddingshaveachievedsignificantim-provementinvarioustasksinNLP,ithasbeendis-coveredthatfurthertuningofthesecontinuouswordrepresentationsforspecifictasksimprovestheirper-formancebyalargermargin.Forexample,inde-pendencyparsing,wordembeddingscouldbetai-loredtocapturesimilarityintermsofcontextwithinsyntacticparses(Bansaletal.,2014)ortheycouldberefinedusingsemanticlexiconssuchasWordNet(Miller,1995),FrameNet(Bakeretal.,1998)andtheParaphraseDatabase(Ganitkevitchetal.,2013)toimprovevarioussimilaritytasks(YuandDredze,2014;Faruquietal.,2015;RotheandSch¨utze,2015).Thispaperproposesamethodtoencodepriorsemanticknowledgeinspectralwordembeddings(Dhillonetal.,2015).Spectrallearningalgorithmsareofgreatinter-estfortheirspeed,scalability,theoreticalguaran-teesandperformanceinvariousNLPapplications.Thesealgorithmsarenostrangerstowordembed-dingseither.Inlatentsemanticanalysis(LSA,(Deerwesteretal.,1990;Landaueretal.,1998)),wordembeddingsarelearnedbyperformingSVDonthewordbydocumentmatrix.Recently,Dhillonetal.(2015)haveproposedtousecanonicalcor-relationanalysis(CCA)asamethodtolearnlow-dimensionalrealvectors,calledEigenwords.Un-likeLSAbasedmethods,CCAbasedmethodsarescaleinvariantandcancapturemultiviewinforma-tionsuchastheleftandrightcontextsofthewords.Asaresult,theeigenwordembeddingsofDhillonetal.(2015)thatwerelearnedusingthesimplelin-earmethodsgiveaccuraciescomparabletoorbetterthanstateoftheartwhencomparedwithhighlynon-lineardeeplearningbasedapproaches(CollobertandWeston,2008;MnihandHinton,2007;Mikolovetal.,2013b;Mikolovetal.,2013c).Themaincontributionofthispaperisatechnique
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
418
toincorporatepriorknowledgeintothederivationofcanonicalcorrelationanalysis.Incontrasttoprevi-ousworkwherepriorknowledgeisintroducedintheoff-the-shelfembeddingsasapost-processingstep(Faruquietal.,2015;RotheandSch¨utze,2015),ourapproachintroducespriorknowledgeintheCCAderivationitself.Inthiswayitpreservesthethe-oreticalpropertiesofspectrallearningalgorithmsforlearningwordembeddings.Thepriorknowl-edgeisbasedonlexicalresourcessuchasWordNet,FrameNetandtheParaphraseDatabase.OurderivationofCCAtoincorporatepriorknowledgeisnotlimitedtoeigenwordsandcanbeusedwithCCAforotherproblems.Itfollowsasim-ilarideatotheoneproposedbyKorenandCarmel(2003)forimprovingthevisualizationofprincipalvectorswithprincipalcomponentanalysis(APC).OurderivationrepresentsthesolutiontoCCAasthatofanoptimizationproblemwhichmaximizesthedistancebetweenthetwoviewprojectionsoftrainingexamples,whileweightingthesedistancesusingtheexternalsourceofpriorknowledge.Assuch,ourapproachappliestootherusesofCCAintheNLPliterature,suchastheoneofJagarlamudiandDaum´e(2012),whousedCCAfortranslitera-tion,ortheoneofSilbereretal.(2013),whousedCCAforsemanticallyrepresentingvisualattributes.2BackgroundandNotationForanintegern,wedenoteby[n]thesetofintegers{1,…,n}.Weassumetheexistenceofavocabu-laryofwords,usuallytakenfromacorpus.ThissetofwordsisdenotedbyH={h1,…,h|H|}.ForasquarematrixA,wedenotebydiag(UN)adiagonalmatrixBwhichhasthesamedimensionsasAsuchthatBii=Aiiforalli.Forvectorv∈Rd,wede-noteits‘2normby||v||,c'est à dire.||v||=qPdi=1v2i.Wealsodenotebyvjor[v]jthejthcoordinateofv.Forapairofvectorsuandv,wedenotetheirdotproductbyhu,vi.WedefineawordembeddingasafunctionffromHtoRmforsome(relativelysmall)m.Forexam-ple,inourexperimentswevarymbetween50and300.Thewordembeddingfunctionmapsthewordtosomereal-vectorrepresentation,withtheinten-tiontocaptureregularitiesinthevocabularythataretopologicallyrepresentedinthecorrespondingEu-clideanspace.Forexample,allvocabularywordsthatcorrespondtocitynamescouldbegroupedto-getherinthatspace.Researchonthederivationofwordembeddingsthatcapturevariousregularitieshasgreatlyaccel-eratedinrecentyears.Variousmethodsusedforthispurposerangefromlow-rankapproximationsofco-occurrencestatistics(Deerwesteretal.,1990;Dhillonetal.,2015)toneuralnetworksjointlylearn-ingalanguagemodel(Bengioetal.,2003;Mikolovetal.,2013a)ormodelsforotherNLPtasks(Col-lobertandWeston,2008).3CanonicalCorrelationAnalysisforDerivingWordEmbeddingsOnerecentapproachtoderivewordembeddings,developedbyDhillonetal.(2015),isthroughtheuseofcanonicalcorrelationanalysis,resultinginso-called“eigenwords.”CCAisatechniqueformulti-viewdimensionalityreduction.Itassumestheex-istenceoftwoviewsforasetofdata,similarlytoco-training(Yarowsky,1995;BlumandMitchell,1998),andthenprojectsthedatainthetwoviewsinawaythatmaximizesthecorrelationbetweentheprojectedviews.Dhillonetal.(2015)usedCCAtoderivewordembeddingsthroughthefollowingprocedure.Theyfirstbreakeachdocumentinacorpusofdocumentsintonsequencesofwordsofafixedlength2k+1,wherekisawindowsize.Forexample,ifk=2,theshortdocument“HarryPotterhasbeenabest-seller”wouldbebrokeninto“HarryPotterhasbeena”and“Potterhasbeenabest-seller.”Ineachsuchsequence,themiddlewordisidentifiedasapivot.Thisleadstotheconstructionofthefol-lowingtrainingsetfromasetofdocuments:{(w(je)1,…,w(je)k,w(je),w(je)k+1,…,w(je)2k)|i∈[n]}.Withabuseofnotation,thisisamultiset,ascer-tainwordsareexpectedtoappearincertaincontextsmultipletimes.Eachw(je)isapivotword,andtherestoftheelementsarewordsinthesequencecalled“thecontextwords.”Withthistrainingsetinmind,thetwoviewsforCCAaredefinedasfollowing.Wedefinethefirstviewthroughasparse“contextmatrix”C∈Rn×2k|H|suchthateachrowinthematrixisavector,consistingof2kone-hotvectors,eachoflength|H|.Eachsuchone-hotvectorcorre-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
419
|H|12inW102000jw(je)=hj100|H|012in1k2kC102000jw(je)k=hj100|H|0Figure1:Thewordandcontextviewsrepresentedasma-trixWandC.EachrowinWisavectoroflength|H|,correspondingtoaone-hotvectorforthewordintheex-ampleindexedbytherow.EachrowinCisavectoroflength2k|H|,dividedintosub-vectorseachoflength|H|.Eachsuchsub-vectorisaone-hotvectorforoneofthe2kcontextwordsintheexampleindexedbytherow.spondstoawordthatfiredinaspecificindexinthecontext.Inaddition,wealsodefineasecondviewthroughamatrixW∈Rn×|H|suchthatWij=1ifw(je)=hj.WepresentbothviewsofthetrainingsetinFigure1.NotethatnowthematrixM=W>CisinR|H|×(2k|H|)suchthateachelementMijgivesthecountoftimesthathiappearedwiththecorrespond-ingcontextwordandcontextindexencodedbyj.Similarly,wedefineamatrixD1=diag(W>W)andD2=diag(C>C).Enfin,togetthewordem-beddings,weperformsingularvaluedecomposition(SVD)onthematrixD−1/21MD−1/22.Notethatinitsoriginalform,CCArequiresuseofW>WandC>Cintheirfullform,andnotjustthecorrespond-ingdiagonalmatricesD1andD2;cependant,inprac-tice,invertingthesematricescanbequiteintensivecomputationallyandcanleadtomemoryissues.Assuch,weapproximateCCAbyusingthediagonalmatricesD1andD2.FromtheSVDstep,wegettwoprojectionsU∈R|H|×mandV∈R2k|H|×msuchthatD−1/21MD−1/22≈UΣV>whereΣ∈Rm×misadiagonalmatrixwithΣii>0beingtheithlargestsingularvalueofD−1/21MD−1/22.Inordertogetthefinalwordem-beddings,wecalculateD−1/21U∈R|H|×m.Eachrowinthismatrixcorrespondstoanm-dimensionalvectorforthecorrespondingwordinthevocabulary.Thismeansthatf(Salut)forhi∈HistheithrowofthematrixD−1/21U.TheprojectionVcanbeusedtoget“contextembeddings.”SeemoreaboutthisinDhillonetal.(2015).ThisuseofCCAtoderivewordembeddingsfollowstheusualdistributionalhypothesis(Harris,1957)thatmostwordembeddingstechniquesrelyon.InthecaseofCCA,thishypothesisistrans-latedintoactioninthefollowingway.CCAfindsprojectionsforthecontextsandforthepivotwordswhicharemostcorrelated.Thismeansthatifawordco-occursinaspecificcontextmanytimes(eitherdirectly,ortransitivelythroughsimilaritytootherwords),thenthiscontextisexpectedtobeprojectedtoapoint“close”tothepointtowhichthewordisprojected.Assuch,iftwowordsoccurinaspecificcontextmanytimes,thesetwowordsareexpectedtobeprojectedtopointswhichareclosetoeachother.Forthenextsection,wedenoteX=WD−1/21andY=CD−1/22.TorefertothedimensionsofXandYgenerically,wedenoted=|H|andd0=2k|H|.Inaddition,werefertothecolumnvectorsofUandVasu1,…,umandv1,…,vm.MathematicalIntuitionBehindCCAThepro-cedurethatCCAfollowsfindsaprojectionofthetwoviewsinasharedspace,suchthatthecorrela-tionbetweenthetwoviewsismaximizedateachco-ordinate,andthereisminimalredundancybetweenthecoordinatesofeachview.ThismeansthatCCAsolvesthefollowingsequenceofoptimizationprob-lemsforj∈[m]whereaj∈R1×dandbj∈R1×d0:argmaxaj,bjcorr(ajW>,bjC>)suchthatcorr(ajW>,akW>)=0,k
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
421
dnWnnLpriorknowledge(optional)d0nCdiagW>W−12D1×W>×C×MdiagC>C−12D2≈mdU×Σ×d0mV>X>YFigure2:IntroducingpriorknowledgeinCCA.W∈Rn×dandC∈Rn×d0denotethewordandcontextviewsrespectively.L∈Rn×nisaLaplacianmatrixencodedwiththepriorknowledgeaboutthedistancesbetweentheprojectionsofWandC.Thelastutilitylemmawedescribeshowsthatin-terjectingtheLaplacianbetweenthetwoviewscanbeexpressedasaweightedsumofthedistancesbe-tweentheprojectionsofthetwoviews(thesedis-tancesaregiveninEq.2),wheretheweightscomefromtheLaplacian.Lemma3.Letu1,…,umandv1,…,vmbetwosetsofvectorsoflengthdandd0respectively.LetL∈Rn×nbeaLaplacianandX∈Rn×dandY∈Rn×d0.Then:mXk=1(Xuk)>L(Yvk)=Xi,j−Lij(cid:0)dmij(cid:1)2,wheredmij=vuut12 mXk=1([Xuk]i−[Yvk]j)2!.(2)Thefollowingpropositionisourmainresultforthissection.Proposition4.ThematricesU∈Rd×mandV∈Rd0×mthatCCAcomputesarethem-dimensionalprojectionsthatmaximizeXi,j(cid:0)dmij(cid:1)2−nnXi=1(dmii)2,(3)wheredmijisdefinedasinEq.2foru1,…,umbeingthecolumnsofUandv1,…,vmbeingthecolumnsofV.Proof.AccordingtoLemma3,theobjectiveinEq.3equalsPmk=1(Xuk)>L(Yvk)whereLisdefinedasinEq.1.Therefore,maximizingEq.3correspondstomaximizationofPmk=1(Xuk)>L(Yvk)undertheconstraintsthattheUandVmatriceshaveorthonor-malvectors.UsingLemma2,itcanbeshownthatthesolutiontothismaximizationisdonebydoingsingularvaluedecompositiononX>LY.Accord-ingtoLemma1,thiscorrespondstofindingUandVbydoingsingularvaluedecompositiononX>Y,becauseamultiplicativeconstantdoesnotchangethevalueoftheright/leftsingularvectors.TheabovepropositionshowsthatCCAtriestofindprojectionsofbothviewssuchthatthedistancesbetweenthetwoviewsforpairsofexampleswithin-dicesi6=jaremaximized(firstterminEq.3),alors que
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
422
minimizingthedistancebetweentheprojectionsofthetwoviewsforaspecificexample(secondterminEq.3).Donc,CCAtriestoprojectacontextandawordinthatcontexttopointsthatareclosetoeachotherinasharedspace,whilemaximizingthedistancebetweenacontextandawordwhichdonotoftenco-occurtogether.AslongasLisaLaplacian,Proposition4isstilltrue,onlywiththemaximizationoftheobjectiveXi,j−Lij(cid:0)dmij(cid:1)2,(4)whereLij≤0fori6=jandLii≥0.ThisresultlendsitselftoageneralizationofCCA,inwhichweusepredefinedweightsfortheLaplacianthatencodesomepriorknowledgeaboutthedistancesthattheprojectionsoftwoviewsshouldsatisfy.Iftheweight−Lijislargeforaspecific(je,j),thenwewilltryhardertomaximizethedistancebe-tweenoneviewofexampleiandtheotherviewofexamplej(i.e.wewilltrytoprojectthewordw(je)andthecontextofexamplejintodistantpointsinthespace).Thismeansthatinthecurrentformulation,−Lijplaystheroleofadissimiliarityindicatorbetweenpairsofwords.Themoredissimilarwordsare,thelargertheweight,andthenthemoredistantthepro-jectionsareforthecontextsandthewords.4.2FromCCAwithDissimilaritiestoCCAwithSimilaritiesItisoftenmoreconvenienttoworkwithsimilaritymeasuresbetweenpairsofwords.Todothat,wecanretainthesameformulationasbeforewiththeLaplacian,where−Lijnowdenotesameasureofsimilarity.Now,insteadofmaximizingtheobjectiveinEq.4,wearerequiredtominimizeit.ItcanbeshownthatsuchmirrorformulationcanbedonewithanalgorithmsimilartoCCA,leadingtoapropositioninthestyleofProposition4.Tosolvethisminimizationformulation,wejustneedtochoosethesingularvectorsassociatedwiththesmallestmsingularvalues(insteadofthelargest).OncewechangetheCCAalgorithmwiththeLaplaciantochoosetheseprojections,wecande-fineL,forexample,basedonasimilaritygraph.Thegraphisanundirectedgraphthathas|H|nodes,forInputs:Setofexamples{(w(je)1,…,w(je)k,w(je),w(je)k+1,…,w(je)2k)|i∈[n]},anintegerm,anα∈(0,1],anundirectedgraphGoverH,anintegerN.Datastructures:AmatrixMofsize|H|×(2k|H|)(cross-covariancematrix),amatrixUcorrespondingtothewordembed-dingsAlgorithm:(Cross-covarianceestimation)∀i,j∈[n]suchthat|i−j|≤N•Ifi=j,increaseMrsby1forrdenotingthein-dexofwordw(je)andforallsdenotingthecontextindicesofwordsw(je)1,…,w(je)kandw(je)k+1,…,w(je)2k.•Ifi6=jandwordw(je)isconnectedtowordw(j)inG,increaseMrsbyαforrdenotingtheindexofwordw(je)andforallsdenotingthecontextindicesofwordsw(j)1,…,w(j)kandw(j)k+1,…,w(j)2k.•CalculateD1andD2asspecifiedin§3.(Singularvaluedecompositionstep)•PerformsingularvaluedecompositiononD−1/21MD−1/22togetamatrixU∈R|H|×m.(Wordembeddingprojection)•Foreachwordhifori∈[|H|]returnthewordem-beddingthatcorrespondswiththeithrowofU.Figure3:TheCCA-likealgorithmthatreturnswordem-beddingswithpriorknowledgeencodedbasedonasimi-laritygraph.eachwordinthevocabulary,andthereisanedgebe-tweenapairofwordswheneverthetwowordsaresimilartoeachotherbasedonsomeexternalsourceofinformation,suchasWordNet(forexample,iftheyaresynonyms).WethendefinetheLaplacianLsuchthatLij=−1ifiandjareadjacentinthegraph(andi6=j),LiiisthedegreeofthenodeiandLij=0inallothercases.ByusingthisvariantofCCA,westrivetomaximizethedistanceofthetwoviewsbetweenwordswhichareadjacentinthegraph(orcontinuingtheexampleabove,maximizethedistancebetweenwordswhicharenotsynonyms).Inaddition,thefeweradjacentnodesawordhas(orthemoresyn-onymsithas),thelessimportantitistominimizethedistancebetweenthetwoviewsofthatgivenword.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
423
4.3FinalAlgorithmInordertouseanarbitraryLaplacianmatrixwithCCA,werequirethatthedataiscentered,i.e.thattheaverageoverallexamplesofeachofthecoordi-natesofthewordandcontextvectorsis0.However,suchaprerequisitewouldmakethematricesCandWdense(withmanynon-zerovalues),andhardtomaintaininmemory,andwouldalsomakesingularvaluedecompositioninefficient.Assuch,wedonotcenterthedatatokeepitsparse,andassuch,useamatrixLwhichisnotstrictlyaLaplacian,butthatbehavesbetterinprac-tice.1Giventhegraphmentionedin§4whichisex-tractedfromanexternalsourceofinformation,weuseLsuchthatLij=αforanα∈(0,1)whichistreatedasasmoothingfactorforthegraph(seebelowthechoicesofα)ifiandjarenotadjacentinthegraph,Lij=0ifi6=jareadjacent,andfinallyLii=1foralli∈[n].Donc,thisma-trixissymmetric,andtheonlyconstraintitdoesnotsatisfyisthatofrowsandcolumnssummingto0.ScanningthedocumentsandcalculatingthestatisticmatrixwiththeLaplacianiscomputation-allyinfeasiblewithalargenumberoftokensgivenasinput.Itisquadraticinthatnumber.Assuch,wemakeanothermodificationtothealgorithm,andcalculatea“local”Laplacian.Themodificationre-quiresanintegerNasinput(weuseN=12),andthenitmakesupdatestopairsofwordtokensonlyiftheyarewithinanN-sizedwindowofeach.ThefinalalgorithmweuseisdescribedinFigure3.Thealgorithmworksbydirectlycomputingtheco-occurrencematrixM(insteadofmaintainingWandC).Itdoessobyincreasingby1anycellscorre-spondingtoword-contextco-occurrenceinthedoc-umentsandbyαanycellscorrespondingtowordandcontextsthatareconnectedinthegraph.5ExperimentsInthissectionwedescribeourexperiments.5.1ExperimentalSetupTrainingDataWeusedthreedatasets,WIKI1,WIKI2andWIKI5,allbasedonthefirst1,2and1Wenotethatotherdecompositions,suchasPCA,alsore-quirecenteringofthedata,butincaseofsparsedatamatrix,thisstepisnotperformed.5billionwordsfromWikipediarespectively.2Eachdatasetisbrokenintochunksoflength13(windowsizesof6),correspondingtoadocument.TheaboveLaplacianLiscalculatedwithineachdocumentsep-arately.Thismeansthat−Lijis1onlyifiandjdenotetwowordsthatappearinthesamedocument.Thisisdonetomakethecalculationscomputation-allyfeasible.Wecalculatewordembeddingsforthetopmostfrequent200Kwords.PriorKnowledgeResourcesWeconsiderthreesourcesofpriorknowledge:WordNet(Miller,1995),theParaphraseDatabaseofGanitkevitchetal.(2013),abbreviatedasPPDB,3andFrameNet(Bakeretal.,1998).SinceFrameNetandWordNetindexwordsintheirbaseform,weuseWordNet’sstemmertoidentifythebaseformforthetextinourcorporawheneverwecalculatetheLaplaciangraph.ForWordNet,wehaveanedgeinthegraphifonewordisasynonym,hypernymorhyponymoftheother.ForPPDB,wehaveanedgeifonewordisaparaphraseoftheother,accordingtothedatabase.ForFrameNet,weconnecttwowordsinthegraphiftheyappearinthesameframe.SystemImplementationWemodifiedtheimple-mentationoftheSWELLJavapackage4ofDhillonetal.(2015).Specifically,weneededtomodifytheloopthatiteratesoverwordsineachdocumenttoanestedloopthatiteratesoverpairsofwords,inor-dertocomputeasumoftheformPijXriLijYjs.5Dhillonetal.(2015)usewindowsizek=2,whichweretaininourexperiments.65.2BaselinesOff-the-shelfWordEmbeddingsWecompareourwordembeddingswithexistingstate-of-the-2Wedownloadedthedatafromhttps://dumps.wikimedia.org/,andpreprocesseditusingthetoolavail-ableathttp://mattmahoney.net/dc/textdata.html.3WeusetheXLsubsetofthePPDB.4https://github.com/paramveerdhillon/swell.5Ourimplementationandthewordembeddingsthatwecalculatedareavailableathttp://cohort.inf.ed.ac.uk/cohort/eigen/.6Wealsousethesquare-roottransformationasmentionedinDhillonetal.(2015)whichcontrolsthevarianceinthecountsaccumulatedfromthecorpus.Seeajustificationforthistrans-forminStratosetal.(2015).
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
424
ABCDEFGHIWordsimilarityaverageGeographicanalogiesNPbracketingNPKWNPDFNNPKWNPDFNNPKWNPDFNRetrofittingGlove59.763.164.657.594.875.380.494.878.179.579.478.7Skip-Gram64.165.568.662.387.372.370.587.779.980.481.580.5GlobalContext44.450.050.447.37.34.518.27.379.479.180.580.2Multilingual62.366.968.262.870.746.253.772.781.981.882.782.0Eigen(CCA)59.562.263.661.489.979.273.589.981.381.781.280.7CCAPriorα=0.1-59.159.659.5-88.988.789.9-81.082.481.0α=0.2-59.960.660.0-89.191.390.1-81.081.380.7α=0.5-59.959.759.6-86.989.389.3-81.881.480.9α=0.7-60.759.359.5-86.989.392.9-80.381.280.8α=0.9-60.659.658.9-89.193.292.5-81.380.781.0CCAPrior+RFα=0.1-61.963.661.5-76.071.989.9-81.481.781.2α=0.2-62.664.961.6-78.069.390.1-81.781.180.6α=0.5-62.763.761.4-74.967.392.9-81.981.480.0α=0.7-63.363.061.0-77.465.690.3-81.080.880.4α=0.9-62.063.360.4-77.366.292.5-81.080.780.4Table1:Resultsforthewordsimilaritydatasets,geographicanalogiesandNPbracketing.Thefirstupperblocks(A–C)presenttheresultswithretrofitting.NPKstandsfornopriorknowledge(noretrofittingisused),WNforWordNet,PDforPPDBandFNforFrameNet.Glove,Skip-Gram,GlobalContext,MultilingualandEigenarethewordembeddingsofPenningtonetal.(2014),Mikolovetal.(2013b),Huangetal.(2012),FaruquiandDyer(2014)andDhillonetal.(2015)respectively.Thesecondmiddleblocks(D–F)showtheresultsofoureigenwordembeddingsencodedwithpriorknowledgeusingourmethod.Eachrowintheblockcorrespondstoaspecificuseofanαvalue(smoothingfactor),asdescribedinFigure3.Inthelowerblocks(G–I)wetakethewordembeddingsfromthesecondblock,andretrofitthemusingthemethodofFaruquietal.(2015).Bestresultsineachblockareinbold.artwordembeddings,suchasGlove(Penningtonetal.,2014),Skip-Gram(Mikolovetal.,2013b),GlobalContext(Huangetal.,2012)andMultilin-gual(FaruquiandDyer,2014).WealsocompareourwordembeddingswiththeEigenwordembeddingsofDhillonetal.(2015)withoutanypriorknowl-edge.RetrofittingforPriorKnowledgeWecompareourapproachofincorporatingpriorknowledgeintothederivationofCCAagainstthepreviousworkswherepriorknowledgeisintroducedintheoff-the-shelfembeddingsasapost-processingstep(Faruquietal.,2015;RotheandSch¨utze,2015).Inthispa-per,wefocusontheretrofittingapproachofFaruquietal.(2015).Retrofittingworksbyoptimizinganobjectivefunctionwhichhastwoterms:onethattriestokeepthedistancebetweenthewordvectorsclosetotheoriginaldistances,andtheotherwhichenforcesthevectorsofwordswhichareadjacentinthepriorknowledgegraphtobeclosetoeachotherinthenewembeddingspace.Weusetheretrofittingpackage7tocompareourresultsindifferentsettingsagainsttheresultsofretrofittingofFaruquietal.(2015).5.3EvaluationBenchmarksWeevaluatedthequalityofoureigenwordembed-dingsonthreedifferenttasks:wordsimilarity,geo-graphicanalogiesandNPbracketing.WordSimilarityForthewordsimilaritytaskweexperimentedwith11differentwidelyusedbench-marks.TheWS-353-ALLdataset(Finkelsteinetal.,2002)consistsof353pairsofEnglishwordswiththeirhumansimilarityratings.Later,Agirreetal.(2009)re-annotatedWS-353-ALLforsimilarity(WS-353-SIM)andrelatedness(WS-353-REL)withspecificdistinctionsbetweenthem.TheSimLex-999dataset(Hilletal.,2015)wasbuilttomeasurehowwellmodelscapturesimilarity,ratherthanrelat-ednessorassociation.TheMEN-TR-3000dataset(Brunietal.,2014)consistsof3000wordpairs7https://github.com/mfaruqui/retrofitting.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
425
sampledfromwordsthatoccuratleast700timesinalargewebcorpus.Thedatasets,MTurk-287(Radinskyetal.,2011)andMTurk-771(Halawietal.,2012),werescoredbyAmazonMechanicalTurkworkersforrelatednessofEnglishwordpairs.TheYP-130(YangandPowers,2005)andVerb-143(Bakeretal.,2014)datasetsweredevelopedforverbsimilaritypredictions.Thelasttwodatasets,MC-30(MillerandCharles,1991)andRG-65(RubensteinandGoodenough,1965)consistof30and65nounpairsrespectively.Foreachdataset,wecalculatethecosinesimilar-itybetweenthevectorsofwordpairsandmeasureSpearman’srankcorrelationcoefficientbetweenthescoresproducedbytheembeddingsandhumanrat-ings.Wereporttheaverageofthecorrelationsonall11datasets.Eachwordsimilaritytaskintheabovelistrepresentsadifferentaspectofwordsimilarity,andassuch,averagingtheresultspointstothequal-ityofthewordembeddingsonseveraltasks.Welateranalyzespecificdatasets.GeographicAnalogiesMikolovetal.(2013c)createdatestsetofanalogouswordpairssuchasa:bc:draisingtheanalogyquestionoftheform“aistobascisto”wheredisunknown.Wereportresultsonasubsetofthisdatasetwhichfocusesonfindingcapitalsofcommoncountries,e.g.,GreeceistoAthensasIraqisto.Thisdatasetconsistsof506wordpairs.Forgivenwordpairs,un:bc:dwheredisunknown,weusethevectoroffsetmethod(Mikolovetal.,2013b),i.e.,wecomputeavectorv=vb−va+vcwhereva,vbandvcarevectorrepresentationsofthewordsa,bandcrespectively;wethenreturntheworddwiththegreatestcosinesimilaritytov.NPBracketingHerethegoalistoidentifythecorrectbracketingofathree-wordnoun(Lazaridouetal.,2013).Forexample,thebracketingofannual(pricegrowth)is“right,”whilethebracketingof(en-trylevel)machineis“left.”SimilarlytoFaruquiandDyer(2015),weconcatenatethewordvectorsofthethreewords,andusethisvectorforbinaryclassifi-cationintoleftorright.Sincemostofthedatasetsthatweevaluateoninthispaperarenotstandardlyseparatedintodevelop-mentandtestsets,wereportallresultswecalculated(withrespecttohyperparameterdifferences)anddonotselectjustasubsetoftheresults.5.4EvaluationPreliminaryExperimentsInourfirstsetofex-periments,wevarythedimensionofthewordem-beddingvectors.Wetrym∈{50,100,200,300}.Ourexperimentsshowedthattheresultsconsistentlyimprovewhenthedimensionincreasesforallthedifferentdatasets.Forexample,form=50andWIKI1,wegetanaverageof46.4onthewordsim-ilaritytasks,50.1form=100,53.4form=200and54.2form=300.Themoredataareavailable,themorelikelylargerdimensionwillimprovethequalityofthewordembeddings.Indeed,forWIKI5,wegetanaverageof49.4,54.9,57.0and59.5foreachofthedimensions.Theimprovementswithre-specttothedimensionareconsistentacrossallofourresults,sowefixmat300.Wealsonoticedaconsistentimprovementinac-curacywhenusingmoredatafromWikipedia.Forexample,form=300,usingWIKI1givesanav-erageof54.1,whileusingWIKI2givesanaverageof54.9andfinally,usingWIKI5givesanaverageof59.5.WefixthedatasetweusetobeWIKI5.ResultsTable1describestheresultsfromourfirstsetofexperiments.(Notethatthetableisdividedinto9distinctblocks,labeledAthroughI.)Ingen-eral,addingpriorknowledgetoeigenwordembed-dingsdoesimprovethequalityofwordvectorsforthewordsimilarity,geographicanalogiesandNPbracketingtasksonseveraloccasions(blocksD–FcomparedtolastrowinblocksA–C).Forexample,oureigenwordvectorsencodedwithpriorknowl-edge(CCAPrior)consistentlyperformbetterthantheeigenwordvectorsthatdonothaveanypriorknowledgeforthewordsimilaritytask(59.5,EigeninthefirstrowunderNPKcolumn,versusblockD).Theonlyexceptionsareforα=0.1withWord-Net(59.1),forα=0.7withPPDB(59.3)andforα=0.9withFrameNet(58.9),whereαdenotesthesmoothingfactor.Inseveralcases,runningtheretrofittingalgorithmofFaruquietal.(2015)ontopofourwordembed-dingshelpsfurther,asif“addingpriorknowledgetwiceisbetterthanonce.”Resultsforthesewordembeddings(CCAPrior+RF)areshowninTable1.Addingretrofittingtoourencodingofpriorknowl-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
426
edgeoftenperformsbetterforwordsimilarityandNPbracketingtasks(blockDversusGandblockFversusI).Fait intéressant,CCAPrior+RFembeddingsalsooftenperformbetterthaneigenwordvectors(Eigen)ofDhillonetal.(2015)whenretrofittedusingthemethodofFaruquietal.(2015).Forexample,inthewordsimilaritytask,eigenwordsretrofittedwithWordNetgetanaccuracyof62.2whereasencodingpriorknowledgeusingbothCCAandretrofittinggetsamaximumaccuracyof63.3.WeseethesamepatternforPPDB,with63.6for“Eigen”and64.9for“CCAPrior+RF”.Wehypoth-esizethatthereasonforthesechangesisthatthetwomethodsforencodingpriorknowledgemaxi-mizedifferentobjectivefunctions.TheperformancewithFrameNetisweaker,insomecasesleadingtoworseperformance(e.g.,withGloveandSGvectors).WebelievethatFrameNetdoesnotperformaswellastheotherlexiconsbe-causeitgroupswordsbasedonveryabstractcon-cepts;oftenwordswithseeminglydistantlyrelatedmeanings(e.g.,pushandgrowth)canevokethesameframe.ThisalsosupportsthefindingsofFaruquietal.(2015),whonoticedthattheuseofFrameNetasapriorknowledgeresourceforimprov-ingthequalityofwordembeddingsisnotashelpfulasotherresourcessuchasWordNetandPPDB.WenotethatCCAworksespeciallywellforthegeographicanalogiesdataset.Thequalityofeigen-wordembeddings(andtheotherembeddings)de-gradeswhenweencodepriorknowledgeusingthemethodofFaruquietal.(2015).Ourmethodim-provesthequalityofeigenwordembeddings.GlobalPictureoftheResultsWhencomparingretrofittingtoCCAwithpriorknowledge,thereisanoticabledifference.Retrofittingperformswellorbadly,dependingonthedataset,whilethere-sultswithCCAaremorestable.Weattributethistothedifferencebetweenhowouralgorithmandretrofittingwork.Retrofittingmakesadirectuseofthesourceofpriorknowledge,byaddingaregular-izationtermthatenforceswordswhicharesimilaraccordingtothepriorknowledgetobecloserintheembeddingspace.Ouralgorithm,ontheotherhand,makesamoreindirectuseofthesourceofpriorknowledge,bychangingtheco-occurencematrixonwhichwedosingularvaluedecomposition.Specifically,webelievethatouralgorithmismorestabletocasesinwhichwordsforthetaskathandareunknownwordswithrespecttothesourceofpriorknowledge.Thisisdemonstratedwiththege-ographicalanalogiestask:inthatcase,retrofittinglowerstheresultsinmostcases.Thecityandcoun-trynamesdonotappearinthesourcesofpriorknowledgeweused.FurtherAnalysisWefurtherinspectedtheresultsonthewordsimilaritytasksfortheRG-65andWS-353-ALLdatasets.OurgoalwastofindcasesinwhicheitherCCAembeddingsbythemselvesout-performothertypesofembeddingsorthatencodingpriorknowledgeintoCCAthewaywedescribesig-nificantlyimprovestheresults.FortheWS-353-ALLdataset,theeigenwordem-beddingsgetacorrelationof69.6.Thenextbestperformingwordembeddingsarethemultilingualwordembeddings(68.0)andskip-gram(58.3).In-terestinglyenough,themultilingualwordembed-dingsalsouseCCAtoprojectwordsintoalow-dimensionalspaceusingalineartransformation,suggestingthatlinearprojectionsareagoodfitfortheWS-353-ALLdataset.Thedatasetitselfincludespairsofcommonwordswithacorrespondingsimi-larityscore.Thewordsthatappearinthedatasetareactuallyexpectedtooccurinsimilarcontexts,apropertythatCCAdirectlyencodeswhenderivingwordembeddings.ThebestperformanceontheRG-65datasetiswiththeGlovewordembeddings(76.6).CCAem-beddingsgiveanaccuracyof69.7onthatdataset.However,withthisdataset,weobservesignificantimprovementwhenencodingpriorknowledgeusingourmethod.Forexample,usingWordNetwiththisdatasetimprovestheresultsby4.2points(73.9).Us-ingthemethodofFaruquietal.(2015)(withWord-Net)ontopofourCCAwordembeddingsimprovestheresultsevenfurtherby8.7points(78.4).TheRoleofPriorKnowledgeWealsodesignedanexperimenttotestwhetherusingdistributionalin-formationisnecessaryforhavingwell-performingwordembeddings,orwhetheritissufficienttorelyonthepriorknowledgeresource.Inordertotestthis,wecreatedasparsematrixthatcorrespondstothegraphbasedontheexternalresourcegraph.Wethenfollowupwithsingularvaluedecompositionon
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
427
ResourceWordSimNPBracketingWordNet35.973.6PPDB37.577.9FrameNet19.974.5Table2:Resultsonwordsimilaritydataset(averageover11datasets)andNPbracketing.Thewordembed-dingsarederivedbyusingSVDonthesimilaritygraphextractedfromthepriorknowledgesource(WordNet,PPDBandFrameNet).thatgraph,andgetembeddingsofsize300.Table2givestheresultswhenusingtheseembeddings.WeseethattheresultsareconsistentlylowerthantheresultsthatappearinTable1,implyingthattheuseofpriorknowledgecomeshandinhandwiththeuseofdistributionalinformation.WhenusingtheretrofittingmethodbyFaruquietal.ontopofthesewordembeddings,theresultsbarelyimproved.6RelatedWorkOurideasinthispaperforencodingpriorknowl-edgeineigenwordembeddingsrelatetothreemainthreadsinexistingliterature.Oneofthethreadsfocusesonmodifyingtheob-jectiveofwordvectortrainingalgorithms.YuandDredze(2014),Xuetal.(2014),FriedandDuh(2015)andBianetal.(2014)augmentthetrainingobjectiveinneurallanguagemodelsofMikolovetal.(2013un)toencouragesemanticallyrelatedwordvectorstocomeclosertoeachother.Wangetal.(2014)proposeamethodforjointlyembeddingen-tities(fromFreeBase,alargecommunity-curatedknowledgebase)andwords(fromWikipedia)intothesamecontinuousvectorspace.ChenanddeMelo(2015)proposeasimilarjointmodeltoim-provethewordembeddings,butratherthanus-ingstructuredknowledgesourcestheirmodelfo-cusesondiscoveringstrongersemanticconnectionsinspecificcontextsinatextcorpus.Anotherresearchthreadreliesonpost-processingstepstoencodepriorknowledgefromsemanticlex-iconsinoff-the-shelfwordembeddings.Themainintuitionbehindthistrendistoupdatewordvec-torsbyrunningbeliefpropagationonagraphex-tractedfromtherelationinformationinsemanticlexicons.TheretrofittingapproachofFaruquietal.(2015)usessuchtechniquestoobtainhigherqualitysemanticvectorsusingWordNet,FrameNet,andtheParaphraseDatabase.Theyreportonhowretrofittinghelpsimprovetheperformanceofvari-ousoff-the-shelfwordvectorssuchasGlove,Skip-Gram,GlobalContext,andMultilingual,onvari-ouswordsimilaritytasks.RotheandSch¨utze(2015)alsodescribehowstandardwordvectorscanbeex-tendedtovariousdatatypesinsemanticlexicons,e.g.,synsetsandlexemesinWordNet.Mostofthestandardwordvectortrainingalgo-rithmsuseco-occurrencewithinwindow-basedcon-textstomeasurerelatednessamongwords.Sev-eralstudiesquestionthelimitationsofdefiningre-latednessinthiswayandinvestigateifthewordco-occurrencematrixcanbeconstructedtoencodepriorknowledgedirectlytoimprovethequalityofwordvectors.Wangetal.(2015)investigatetheno-tionofrelatednessinembeddingmodelsbyincor-poratingsyntacticandlexicographicknowledge.Inspectrallearning,Yihetal.(2012)augmentthewordco-occurrencematrixonwhichLSAoperateswithrelationalinformationsuchthatsynonymswilltendtohavepositivecosinesimilarity,andantonymswilltendtohavenegativesimilarities.Theirvectorspacerepresentationsuccessfullyprojectssynonymsandantonymsonoppositesidesintheprojectedspace.Changetal.(2013)furthergeneralizethisapproachtoencodemultiplerelations(andnotjustopposingrelations,suchassynonymsandantonyms)usingmulti-relationalLSA.Inspectrallearning,mostofthestudiesonin-corporatingpriorknowledgeinwordvectorsfocusonLSAbasedwordembeddings(Yihetal.,2012;Changetal.,2013;TurneyandLittman,2005;Tur-ney,2006;TurneyandPantel,2010).Fromthetechnicalperspective,ourworkisalsorelatedtothatofJagarlamudietal.(2011),whoshowedhowtogeneralizeCCAsothatituseslo-calitypreservingprojections(HeandNiyogi,2004).Theyalsoassumetheexistenceofaweightmatrixinamulti-viewsettingthatdescribesthedistancesbetweenpairsofpointsinthetwoviews.Moregenerally,CCAisanimportantcomponentforspectrallearningalgorithmsintheunsupervisedsettingandwithlatentvariables(Cohenetal.,2014;NarayanandCohen,2016;Stratosetal.,2016).OurmethodforincorporatingpriorknowledgeintoCCAcouldpotentiallybetransferredtothesealgorithms.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
428
7ConclusionWedescribedamethodforincorporatingpriorknowledgeintoCCA.Ourmethodrequiresarela-tivelysimplechangetotheoriginalcanonicalcor-relationanalysis,whereextracountsareaddedtothematrixonwhichsingularvaluedecompositionisperformed.Weusedourmethodtoderivewordem-beddingsinthestyleofeigenwords,andtestedthemonasetofdatasets.Ourresultsdemonstrateseveraladvantagesofencodingpriorknowledgeintoeigen-wordembeddings.AcknowledgementsTheauthorswouldliketothankParamveerDhillonforhishelpwithrunningtheSWELLpackage.TheauthorswouldalsoliketothankManaalFaruquiandSujayKumarJauharfortheirhelpandtechni-calassistancewiththeretrofittingpackageandthewordembeddingevaluationsuite.ThanksalsotoAnkurParikhforearlydiscusionsonthisproject.ThisworkwascompletedwhilethefirstauthorwasaninternattheUniversityofEdinburgh,aspartoftheEquateScotlandprogram.ThisresearchwassupportedbyanEPSRCgrant(EP/L02411X/1)andanEUH2020grant(688139/H2020-ICT-2015;SUMMA).AppendixA:ProofsProofofLemma1.TheproofissimilartotheonethatappearsinKorenandCarmel(2003)forLemma3.1.Theonlydifferenceistheuseoftwoviews.Notethat[X>LY]ij=Pk,k0XkiLkk0Yk0j.Assuch,[X>LY]ij=Xk,k0(nδkk0−1)XkiYk0j=nXk=1nXkiYkj− nXk=1Xki!|{z}0× nXk0=1Yk0j!|{z}0=n[X>Y]ij,whereδkk0=1iffk=k0and0otherwise,andthesec-ondequalityreliesontheassumptionofthedatabeingcentered.ProofofLemma2.Withoutlossofgenerality,assumed≤d0.Letu01,…,u0dbetheleftsingularvectorsofAandv01,…,v0d0betherightones,andσ1,…,σdbethesingularvalues.ThereforeA=Pdj=1σju0j(v0j)>.Inaddition,theobjectiveequals(aftersubstitutingA):mXi=1dXj=1σjhui,u0jihvi,v0ji=dXj=1σj mXi=1hui,u0jihvi,v0ji!(5)NotethatbytheCauchy-Schwartzinequality:dXj=1mXi=1hui,u0jihvi,v0ji=mXi=1dXj=1hui,u0jihvi,v0ji≤mXi=1vuutdXj=1|hui,u0ji|2vuutdXj=1|hvi,v0ji|2≤mInaddition,notethatifwechooseui=u0iandvi=v0i,thentheinequalityabovebecomesanequality,andinaddition,theobjectiveinEq.5willequalthesumofthemlargestsingularvectorsPmj=1σj.Assuch,thisassignmenttouiandvimaximizestheobjective.ProofofLemma3.First,bydefinitionofmatrixmulti-plication,mXk=1(Xuk)>L(Yvk)=Xi,jLij mXk=1[Xuk]je[Yvk]j!.(6)Aussi,(cid:0)dmij(cid:1)2=12 mXk=1[Xuk]2i−2[Xuk]je[Yvk]j+[Yvk]2j!.Donc,2Xi,j−Lij(cid:0)dmij(cid:1)2=Xi,j−Lij mXk=1−2[Xuk]je[Yvk]j!+Xi,j−Lij mXk=1[Xuk]2i+[Yvk]2j!|{z}0=2Xi,jLij mXk=1[Xuk]je[Yvk]j,!(7)wherethefirsttwotermsdisappearbecauseofthedefini-tionoftheLaplacian.ThecomparisonofEq.6toEq.7givesusthenecessaryresult.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
429
ReferencesEnekoAgirre,EnriqueAlfonseca,KeithHall,JanaKravalova,MariusPas¸ca,andAitorSoroa.2009.Astudyonsimilarityandrelatednessusingdistribu-tionalandwordnet-basedapproaches.InProceedingsofHLT-NAACL.FrancisBachandMichaelJordan.2005.Aprobabilisticinterpretationofcanonicalcorrelationanalysis.TechReport688,DepartmentofStatistics,UniversityofCalifornia,Berkeley.CollinF.Baker,CharlesJ.Fillmore,andJohnB.Lowe.1998.TheBerkeleyFrameNetproject.InProceed-ingsofACL.SimonBaker,RoiReichart,andAnnaKorhonen.2014.Anunsupervisedmodelforinstancelevelsubcatego-rizationacquisition.InProceedingsofEMNLP.MohitBansal,KevinGimpel,andKarenLivescu.2014.Tailoringcontinuouswordrepresentationsfordepen-dencyparsing.InProceedingsofACL.MikhailBelkin,ParthaNiyogi,andVikasSindhwani.2006.Manifoldregularization:Ageometricframe-workforlearningfromlabeledandunlabeledexam-ples.JournalofMachineLearningResearch,7:2399–2434.YoshuaBengio,R´ejeanDucharme,PascalVincent,andChristianJanvin.2003.Aneuralprobabilisticlan-guagemodel.JournalofMachineLearningResearch,3:1137–1155.JiangBian,BinGao,andTie-YanLiu.2014.Knowledge-powereddeeplearningforwordembed-ding.InMachineLearningandKnowledgeDiscoveryinDatabases,volume8724ofLectureNotesinCom-puterScience,pages132–148.AvrimBlumandTomMitchell.1998.Combiningla-beledandunlabeleddatawithco-training.InProceed-ingsofCOLT.EliaBruni,Nam-KhanhTran,andMarcoBaroni.2014.Multimodaldistributionalsemantics.JournalofArti-ficialIntelligenceResearch,49:1–47.Kai-WeiChang,Wen-tauYih,andChristopherMeek.2013.Multi-relationallatentsemanticanalysis.InProceedingsofEMNLP.JiaqiangChenandGerarddeMelo.2015.Semanticin-formationextractionforimprovedwordembeddings.InProceedingsofNAACLWorkshoponVectorSpaceModelingforNLP.ShayB.Cohen,K.Stratos,MichaelCollins,DeanP.Fos-ter,andLyleUngar.2014.Spectrallearningoflatent-variablePCFGs:Algorithmsandsamplecomplexity.JournalofMachineLearningResearch.RonanCollobertandJasonWeston.2008.Aunifiedar-chitecturefornaturallanguageprocessing:Deepneu-ralnetworkswithmultitasklearning.InProceedingsofICML.ScottDeerwester,SusanT.Dumais,GeorgeW.Furnas,ThomasK.Landauer,andRichardHarshman.1990.Indexingbylatentsemanticanalysis.JournaloftheAmericanSocietyforInformationScience,41(6):391–407.ParamveerS.Dhillon,DeanP.Foster,andLyleH.Ungar.2015.Eigenwords:Spectralwordembeddings.Jour-nalofMachineLearningResearch,16:3035–3078.ManaalFaruquiandChrisDyer.2014.Improvingvectorspacewordrepresentationsusingmultilingualcorrela-tion.InProceedingsofEACL.ManaalFaruquiandChrisDyer.2015.Non-distributionalwordvectorrepresentations.InPro-ceedingsofACL.ManaalFaruqui,JesseDodge,SujayK.Jauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2015.Retrofittingwordvectorstosemanticlexicons.InPro-ceedingsofNAACL.LevFinkelstein,GabrilovichEvgenly,MatiasYossi,RivlinEhud,SolanZach,WolfmanGadi,andRuppinEytan.2002.Placingsearchincontext:Theconceptrevisited.ACMTransactionsonInformationSystems,20(1):116–131.DanielFriedandKevinDuh.2015.Incorporatingbothdistributionalandrelationalsemanticsinwordrepre-sentations.InProceedingsofICLR.JuriGanitkevitch,BenjaminVanDurme,andChrisCallison-Burch.2013.PPDB:Theparaphrasedatabase.InProceedingsofNAACL.GuyHalawi,GideonDror,EvgeniyGabrilovich,andYehudaKoren.2012.Large-scalelearningofwordrelatednesswithconstraints.InProceedingsofACMSIGKDD.ZelligS.Harris.1957.Co-occurrenceandtransforma-tioninlinguisticstructure.Language,33(3):283–340.XiaofeiHeandParthaNiyogi.2004.Localitypreservingprojections.InProceedingsofNIPS.FelixHill,RoiReichart,andAnnaKorhonen.2015.SimLex-999:Evaluatingsemanticmodelswith(gen-uine)similarityestimation.ComputationalLinguis-tics,41(4):665–695.EricHHuang,RichardSocher,ChristopherDManning,andAndrewYNg.2012.Improvingwordrepresenta-tionsviaglobalcontextandmultiplewordprototypes.InProceedingsofACL.JagadeeshJagarlamudiandHalDaum´e.2012.Regu-larizedinterlingualprojections:Evaluationonmul-tilingualtransliteration.InProceedingsofEMNLP-CoNLL.JagadeeshJagarlamudi,RaghavendraUdupa,andHalDaum´e.2011.GeneralizationofCCAviaspectralembedding.InProceedingsoftheSnowbirdLearningWorkshopofAISTATS.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
8
1
5
6
7
3
9
4
/
/
t
je
un
c
_
un
_
0
0
1
0
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
430
YehudaKorenandLiranCarmel.2003.Visualizationoflabeleddatausinglineartransformations.InProceed-ingsofIEEEConferenceonInformationVisualization.ThomasK.Landauer,PeterW.Foltz,andDarrellLa-ham.1998.Anintroductiontolatentsemanticanaly-sis.DiscourseProcesses,25:259–284.AngelikiLazaridou,EvaMariaVecchi,andMarcoBa-roni.2013.Fishtransportersandmiraclehomes:HowcompositionaldistributionalsemanticscanhelpNPparsing.InProceedingsofEMNLP.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013a.Efficientestimationofwordrepresen-tationsinvectorspace.InProceedingsofICLRWork-shop.TomasMikolov,IlyaSutskever,KaiChen,GregSCor-rado,andJeffDean.2013b.Distributedrepresenta-tionsofwordsandphrasesandtheircompositionality.InProceedingsofNIPS.TomasMikolov,WentauYih,andGeoffreyZweig.2013c.Linguisticregularitiesincontinuousspacewordrepresentations.InProceedingsofNAACL-HLT.GeorgeA.MillerandWalterG.Charles.1991.Contex-tualcorrelatesofsemanticsimilarity.LanguageandCognitiveProcesses,6(1):1–28.GeorgeAMiller.1995.WordNet:AlexicaldatabaseforEnglish.CommunicationsoftheACM,38(11):39–41.AndriyMnihandGeoffreyHinton.2007.Threenewgraphicalmodelsforstatisticallanguagemodelling.InProceedingsofICML.ShashiNarayanandShayB.Cohen.2016.Optimizingspectrallearningforparsing.InProceedingsofACL.AnkurP.Parikh,ShayB.Cohen,andEricXing.2014.Spectralunsupervisedparsingwithadditivetreemet-rics.InProceedingsofACL.JeffreyPennington,RichardSocher,andChristopherManning.2014.Glove:Globalvectorsforwordrep-resentation.InProceedingsofEMNLP.KiraRadinsky,EugeneAgichtein,EvgeniyGabrilovich,andShaulMarkovitch.2011.Awordatatime:Com-putingwordrelatednessusingtemporalsemanticanal-ysis.InProceedingsofACMWWW.SaschaRotheandHinrichSch¨utze.2015.AutoEx-tend:Extendingwordembeddingstoembeddingsforsynsetsandlexemes.InProceedingsofACL-IJCNLP.HerbertRubensteinandJohnB.Goodenough.1965.Contextualcorrelatesofsynonymy.CommunicationsoftheACM,8(10):627–633.CarinaSilberer,VittorioFerrari,andMirellaLapata.2013.Modelsofsemanticrepresentationwithvisualattributes.InProceedingsofACL.RichardSocher,JohnBauer,ChristopherD.Manning,andAndrewY.Ng.2013.Parsingwithcompositionalvectorgrammars.InProceedingsofACL.KarlStratos,MichaelCollins,andDanielHsu.2015.Model-basedwordembeddingsfromdecompositionsofcountmatrices.InProceedingsofACL.KarlStratos,MichaelCollins,andDanielHsu.2016.Unsupervisedpart-of-speechtaggingwithanchorhid-denmarkovmodels.TransactionsoftheAssociationforComputationalLinguistics,4:245–257.PeterD.TurneyandMichaelL.Littman.2005.Corpus-basedlearningofanalogiesandsemanticrelations.MachineLearning,60(1-3):251–278.PeterD.TurneyandPatrickPantel.2010.Fromfre-quencytomeaning:Vectorspacemodelsofsemantics.JournalofArtificialIntelligenceResearch,37(1):141–188.PeterD.Turney.2006.Similarityofsemanticrelations.ComputationalLinguistics,32(3):379–416.ZhenWang,JianwenZhang,JianlinFeng,andZhengChen.2014.Knowledgegraphandtextjointlyem-bedding.InProceedingsofEMNLP.TongWang,AbdelrahmanMohamed,andGraemeHirst.2015.Learninglexicalembeddingswithsyntacticandlexicographicknowledge.InProceedingsofACL-IJCNLP.ChangXu,YalongBai,JiangBian,BinGao,GangWang,XiaoguangLiu,andTie-YanLiu.2014.RC-NET:Ageneralframeworkforincorporatingknowledgeintowordrepresentations.InProceedingsoftheACMCIKM.DongqiangYangandDavidMWPowers.2005.Mea-suringsemanticsimilarityinthetaxonomyofWord-Net.InProceedingsoftheAustralasianConferenceonComputerScience.DavidYarowsky.1995.Unsupervisedwordsensedis-ambiguationrivalingsupervisedmethods.InProceed-ingsofACL.Wen-tauYih,GeoffreyZweig,andJohnPlatt.2012.Po-larityinducinglatentsemanticanalysis.InProceed-ingsofEMNLP-CoNLL.MoYuandMarkDredze.2014.Improvinglexicalem-beddingswithsemanticknowledge.InProceedingsofACL.
Télécharger le PDF