Transactions of the Association for Computational Linguistics, vol. 3, pp. 227–242, 2015. Action Editor: Joakim Nivre.

Transactions of the Association for Computational Linguistics, vol. 3, pp. 227–242, 2015. Action Editor: Joakim Nivre.
Submission batch: 2/2015; Revision batch 4/2015; Published 5/2015.

2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 Licence.

c
(cid:13)

LearningCompositionModelsforPhraseEmbeddingsMoYuMachineIntelligence&TranslationLabHarbinInstituteofTechnologyHarbin,Chinagflfof@gmail.comMarkDredzeHumanLanguageTechnologyCenterofExcellenceCenterforLanguageandSpeechProcessingJohnsHopkinsUniversityBaltimore,MARYLAND,21218mdredze@cs.jhu.eduAbstractLexicalembeddingscanserveasusefulrep-resentationsforwordsforavarietyofNLPtasks,butlearningembeddingsforphrasescanbechallenging.Whileseparateembeddingsarelearnedforeachword,thisisinfeasibleforeveryphrase.Weconstructphraseem-beddingsbylearninghowtocomposewordembeddingsusingfeaturesthatcapturephrasestructureandcontext.Weproposeefﬁcientunsupervisedandtask-speciﬁclearningobjec-tivesthatscaleourmodeltolargedatasets.Wedemonstrateimprovementsonbothlanguagemodelingandseveralphrasesemanticsimi-laritytaskswithvariousphraselengths.Wemaketheimplementationofourmodelandthedatasetsavailableforgeneraluse.1IntroductionWordembeddingslearnedbyneurallanguagemod-els(Bengioetal.,2003;CollobertandWeston,2008;Mikolovetal.,2013b)havebeensuccess-fullyappliedtoarangeoftasks,includingsyn-tax(CollobertandWeston,2008;Turianetal.,2010;Collobert,2011)andsemantics(Huangetal.,2012;Socheretal.,2013b;Hermannetal.,2014).Cependant,phrasesarecriticalforcapturinglexicalmeaningformanytasks.Forexample,CollobertandWeston(2008)showedthatwordembeddingsyieldedstate-of-the-artsystemsonword-orientedtasks(POS,NER)butperformanceonphraseori-entedtasks,suchasSRL,lagsbehind.Weproposeanewmethodforcompositionalse-manticsthatlearnstocomposewordembeddingsintophrases.Incontrasttoacommonapproachtophraseembeddingsthatusespre-deﬁnedcompo-sitionoperators(MitchellandLapata,2008),e.g.,component-wisesum/multiplication,welearncom-positionfunctionsthatrelyonphrasestructureandcontext.Otherworkonlearningcompositionsreliesonmatrices/tensorsastransformations(Socheretal.,2011;Socheretal.,2013a;HermannandBlun-som,2013;BaroniandZamparelli,2010;Socheretal.,2012;Grefenstetteetal.,2013).Cependant,thisworksuffersfromtwoprimarydisadvantages.First,thesemethodshavehighcomputationalcomplexityfordenseembeddings:Ô(d2)orO(d3)forcompos-ingeverytwocomponentswithddimensions.Thehighcomputationalcomplexityrestrictsthesemeth-odstouseverylow-dimensionalembeddings(25or50).Whilelow-dimensionalembeddingsperformwellforsyntax(Socheretal.,2013a)andsentiment(Socheretal.,2013b)tasks,theydopoorlyonse-mantictasks.Second,becauseofthecomplexity,theyusesupervisedtrainingwithsmalltask-speciﬁcdatasets.Anexceptionistheunsupervisedobjec-tiveofrecursiveauto-encoders(Socheretal.,2011).Yetthisworkcannotutilizecontextualfeaturesofphrasesandstillposesscalingchallenges.InthisworkweproposeanovelcompositionaltransformationcalledtheFeature-richComposi-tionalTransformation(FCT)model.FCTproducesphrasesfromtheirwordcomponents.Incontrasttopreviouswork,ourapproachtophrasecomposi-tioncanefﬁcientlyutilizehighdimensionalembed-dings(e.g.d=200)withanunsupervisedobjective,bothofwhicharecriticaltodoingwellonseman-ticstasks.Ourcompositionfunctionisparameter-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

228

izedtoallowtheinclusionoffeaturesbasedonthephrasestructureandcontextualinformation,includ-ingpositionalindicatorsofthewordcomponents.Thephrasecompositionisaweightedsummationofembeddingsofcomponentwords,wherethesum-mationweightsaredeﬁnedbythefeatures,whichallowsforfastcomposition.WediscussarangeoftrainingsettingsforFCT.Fortaskswithlabeleddata,weutilizetask-speciﬁctraining.Webeginwithembeddingstrainedonrawtextandthenlearncompositionalphraseparametersaswellasﬁne-tunetheembeddingsforthespeciﬁctask’sobjective.Fortaskswithunlabeleddata(e.g.mostsemantictasks)wecantrainonalargecorpusofunlabeleddata.Fortaskswithbothlabeledandunlabeleddata,weconsiderajointtrainingscheme.Ourmodel’sefﬁciencyensureswecanincorporatelargeamountsofunlabeleddata,whichhelpsmiti-gateover-ﬁttingandincreasesvocabularycoverage.WebeginwithapresentationofFCT(§2),includ-ingourproposedfeaturesforthemodel.Wethenpresentthreetrainingsettings(§3)thatcoverlan-guagemodeling(unsupervised),task-speciﬁctrain-ing(supervised),andjoint(semi-supervised)set-tings.Theremainderofthepaperisdevotedtoeval-uationofeachofthesesettings.2Feature-richCompositionalTransformationsfromWordstoPhrasesWelearntransformationsforcomposingphraseem-beddingsfromthecomponentwordsbasedonex-tractedfeaturesfromaphrase,whereweassumethatthephraseboundariesaregiven.Theresult-ingphraseembeddingisbasedonaper-dimensionweightedaverageofthecomponentphrases.Con-sidertheexampleofbasenounphrases(NP),acom-monphrasetypewhichwewanttocompose.BaseNPsoftenhaveﬂatstructures–allwordsmodifytheheadnoun–whichmeansthatourtransformationshouldfavortheheadnouninthecomposedphraseembedding.ForeachoftheNwordswiinphrasepweconstructtheembedding:ep=NXiλi(cid:12)ewi(1)whereewiistheembeddingforwordi;et(cid:12)referstopoint-wiseproduct.λiisaweightvectorthatisconstructedbasedonthefeaturesofpandthemodelparameters:λij=Xkαjkfk(wi,p)+bij(2)wherefk(wi,p)isafeaturefunctionthatconsiderswordwiinphrasepandbijisabiasterm.Thismodelisfasttotrainsinceithasonlylineartransfor-mations:theonlyoperationsarevectorsummationandinnerproduct.Therefore,welearnthemodelparametersαtogetherwiththeembeddings.WecallthistheFeature-richCompositionalTransformation(FCT)model.Considersomeexamplephrasesandassociatedfeatures.Thephrase“themuseum”shouldhaveanembeddingnearlyidenticalto“museum”since“the”hasminimalimpactthephrase’smeaning.Thiscanbecapturedthroughpart-of-speech(POS)tags,whereatagofDTon“the”willleadtoλi≈~0,removingitsimpactonthephraseembedding.Insomecases,wordswillhavespeciﬁcbehaviors.Inthephrase“historicmuseum”,theword“historic”shouldimpactthephraseembeddingtobecloserto“landmark”.Tocapturethisbehaviorweaddsmoothedlexicalfeatures,wheresmoothingreducesdatasparsityeffects.Thesefeaturescanbebasedonwordclusters,themselvesinducedfrompre-trainedwordembeddings.OurfeaturetemplatesareshowninTable1.Phraseboundaries,tagsandheadsareidentiﬁedus-ingexistingparsersorfromAnnotatedGigaword(Napolesetal.,2012)asdescribedinSection5.InEq.(1),wedonotlimitphrasestructurethoughthefeaturesinTable1tendtoassumeaﬂatstructure.However,withadditionalfeaturesthemodelcouldhandlelongerphraseswithhierarchicalstructures,andaddingthesefeaturesdoesnotchangeourmodelortrainingobjectives.FollowingthesemantictasksusedforevaluationweexperimentedwithbaseNPs(includingbothbigramNPsandlongerones).Weleaveexplorationsoffeaturesforcomplexstructurestofuturework.FCThastwosetsofparameters:oneisthefea-tureweights(un,b),theotheriswordembeddings(ew).Wecoulddirectlyusethewordembeddingslearnedbyneurallanguagemodels.However,ourexperimentsshowthatthosewordembeddingsareoftennotsuitedforFCT.Thereforeweproposeto

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

229

SimpleFeaturesCompoundFeaturesPOStagst(wi−1),t(wi),t(wi+1)k∈{i−1,i}Wordclustersc(wi−1),c(wi),c(wi+1)k∈{i−1,i}wi−1,wi,wi+1ifwiisfunctionwordHeadwordI[i=h]k∈{i−1,i,i+1}k∈{i−1,i,i+1}DistancefromheadDis(h−i)k∈{i−1,i,i+1}k∈{i−1,i,i+1}Headtag/clustert(wh),c(wh)ifi6=h,ifi6=hTable1:Featuretemplatesforwordwiinphrasep.t(w):POStag;c(w):wordcluster(whenwisafunctionword,i.e.aprepositionwordorconjunctionword,thereisnoneedtohavesmoothedversionofthewordfeaturesbasedonclusters.Thereforewedirectlyusethewordformsasfeaturesasshowninline3ofthetable);h:positionofheadwordofthephrasep;Dis(i−j):distancebetweenwiandwj(distanceintokens).referstotheconjunction(i.e.Cartesianproduct)betweentwofeaturetemplatesf1andf2.learnboththefeatureweightsandthewordembed-dingswithobjectivesinSection3.Moreover,ex-perimentsshowthatstartingwiththebaselinewordembeddingsleadstobetterlearningresultscompar-ingtorandominitializations.Thereforeintherestofthepaper,ifnotspeciﬁcallymentioned,wealwaysinitializetheembeddingsofFCTwithbaselinewordembeddingslearnedbyMikolovetal.(2013b).3TrainingObjectivesThespeedandﬂexibilityofFCTenablesarangeoftrainingsettings.Weconsiderstandardunsu-pervisedtraining(languagemodeling),task-speciﬁctrainingandjointobjectives.3.1LanguageModelingForunsupervisedtrainingonlargescalerawtexts(languagemodeling)wetrainFCTsothatphraseem-beddings–ascomposedinSection2–predictcon-textualwords,anextensionoftheskip-gramobjec-tive(Mikolovetal.,2013b)tophrases.Foreachphrasepi=(wi1,…,win)∈P,wij∈V,wherePisthesetofallphrasesandVisthewordvocabu-lary.HereiistheindexofaphraseinsetPandijistheabsoluteindexofthejthcomponentwordofpiinthesentence.Forpredictingthecwordstotheleftandrighttheskip-gramobjectivebecomes:maxα,b,ew,e0w1|P.||P.|Xi=1X0e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 3 5 1 5 6 6 7 5 2 / / t l a c _ a _ 0 0 1 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 230 weuseaclassiﬁcationobjective:maxα,b,ewXpsNXi=1yilogP(yi=1|ps,pi)=maxα,b,ewXpsNXi=1yilogexp(cid:0)epsTepi(cid:1)PNjexp(cid:0)epsTepj(cid:1).(4)whereepisthephraseembeddingfromEq.(1).Whenacandidatephrasepiisasingleword,alex-icalembeddingcanbeuseddirectlytoderiveepi.WhenN=1foreachps,i.e.,weareworkingonbinaryclassiﬁcationproblems,theobjectivewillre-ducetologisticlossandabiasbwillbeadded.Forverylargesets,e.g.,thewholevocabulary,weuseNCEtoapproximatetheobjective.WecallEq.(4)thetask-speciﬁc(TASK-SPEC)objective.InadditiontoupdatingonlytheFCTparameters,wecanupdatetheembeddingsthemselvestoim-provethetask-speciﬁcobjective.Weusetheﬁne-tuningstrategy(CollobertandWeston,2008;Socheretal.,2013a)forlearningtask-speciﬁcwordembed-dings,ﬁrsttrainingFCTandtheembeddingswiththeLMobjectiveandthenﬁne-tuningthewordem-beddingsusinglabeleddataforthetargettask.Werefertothisprocessas“ﬁne-tuningwordemb”intheexperimentsession.NotethatﬁnetuningcanbealsoappliedtobaselinewordembeddingstrainedwiththeTASK-SPECobjectiveortheLMobjectiveabove.3.3JointTrainingWhilelabeleddataisthemosthelpfulfortrain-ingFCTforatask,relyingonlabeleddataalonewillyieldlimitedimprovements:labeleddatahaslowcoverageofthevocabulary,whichcanleadtoover-ﬁttingwhenweupdateFCTmodelparametersEq.(4)andﬁne-tunewordembeddings.Inparticu-lar,theeffectsofﬁne-tuningwordembeddingsareusuallylimitedinNLPapplications.Incontrasttootherapplications,likevision,whereasingleinputcancovermostorallofthemodelparameters,wordembeddingsareuniquetoeachword,soawordwillhaveitsembeddingupdatedonlywhenthewordap-pearsinatraininginstance.Asaresult,onlywordsthatappearinthelabeleddatawillbeneﬁtfromﬁne-tuningand,bychangingonlypartoftheembeddingspace,theperformancemaybeworseoverall.Languagemodelingprovidesamethodtoupdateallembeddingsbasedonalargeunlabeledcorpus.Therefore,wecombinethelanguagemodelingob-ject(Eq.(3))andthetask-speciﬁcobject(Eq.(4))toyieldajointobjective.Whenaword’sembeddingischangedinatask-speciﬁcway,itwillimpacttherestoftheembeddingspacethroughtheLMobjective.Thus,allwordscanbeneﬁtfromthetask-speciﬁctraining.Wecallthisthejointobjectiveandcallthere-sultedmodelFCT-Joint(FCT-Jforshort),sinceitup-datestheembeddingswithboththeLMandTASK-SPECobjectives.Inadditiontojointlytrainingbothobjectives,wecancreateapipeline.First,wetrainFCTwiththeLMobjective.Wethenﬁne-tunealltheparameterswiththeTASK-SPECobjective.WecallthisFCT-Pipeline(FCT-Pforshort).3.4ApplicationstoOtherPhraseCompositionModelsWhileourfocusisthetrainingofFCT,wenotethattheabovetrainingobjectivescanbeappliedtoothercompositionmodelsaswell.Asanexample,considerarecursiveneuralnetwork(RNN)(Socheretal.,2011;Socheretal.,2013a),whichrecur-sivelycomputesphraseembeddingsbasedonthebi-narysub-treeassociatedwiththephrasewithmatrixtransformations.Forthebigramphrasesconsideredintheevaluationtasks,supposewearegivenphrasep=(w1,w2).Themodelthencomputesthephraseembeddingepas:ep=σ(W·[ew1:ew2]),(5)où[ew1:ew2]istheconcatenationoftwoem-beddingvectors.Wisamatrixofparameterstobelearned,whichcanbefurtherreﬁnedaccordingtothelabelsofthechildren.Back-propagationcanbeusedtoupdatetheparametermatrixWandthewordembeddingsduringtraining.ItispossibletotraintheRNNparametersWwithourTASK-SPECorLMobjective:givensyntactictrees,wecanuseRNN(insteadofFCT)tocomputephraseembeddingsep,whichcanbeusedtocomputetheobjective,andthenhaveWupdatedviaback-propagation.Theex-perimentsbelowshowresultsforthismethod,whichwecallRNN,withTASK-SPECtraining.However, l D o w n o un d e d f r o m H t t p : / / d je r e c t . m je t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 3 5 1 5 6 6 7 5 2 / / t l a c _ a _ 0 0 1 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 231 whilewecantrainRNNsusingsmallamountsofla-beleddata,itisimpracticaltoscaleittolargecor-pora(i.e.LMtraining).Incontrast,FCTeasilyscalestolargecorpora.Remark(comparisonbetweenFCTandRNN):Besidesefﬁciency,ourFCTisalsoexpressive.Acommonapproachtocomposition,aweightedsumoftheembeddings(whichweincludeinourexperi-mentsasWeightedSUM),isaspecialcaseofFCTwithnonon-lexicalfeatures,andaspecialcaseofRNNifwerestricttheWmatrixofRNNtobedi-agonal.Therefore,RNNandFCTcanbeviewedastwodifferentwaysofimprovingtheexpressivestrengthofWeightedSUM.TheRNNsincreaseexpressivenessbymakingthetransformationafullmatrix(morecomplexbutlessefﬁcient),whichdoesnotintroduceanyinteractionbetweenonewordanditscontexts.1Ontheotherhand,FCTcanmakethetransformationforoneworddependonitscontextwordsbyextractingrelevantfeatures,whilekeepingthemodellinear.Assupportedbytheexperimentalresults,ourmethodforincreasingexpressivenessismoreeffec-tive,becausethecontextualinformationiscriticalforphrasecompositions.Bycomparison,thema-trixtransformationsinRNNsmaybeunnecessarilycomplicatedandarenotsigniﬁcantlymorehelpfulinmodelingthetargettasksandmakethemodelsmorelikelytoover-ﬁt.4ParameterEstimationTrainingofFCTcanbeeasilyaccomplishedbystochasticgradientdescent(SGD).WhileSGDisfast,trainingwiththeLMorjointobjectivesre-quiresthelearningalgorithmtoscaletolargecor-pora,whichcanbeslowevenforSGD.AsynchronousSGDforFCT:Weusethedis-tributedasynchronousSGD-basedalgorithmfromMikolovetal.(2013b).Thesharedembeddingsareupdatedbyeachthreadbasedontrainingdatawithinthethreadindependently.Withwordembeddings,thecollisionrateislowsinceitisunlikelythatdif-ferentthreadswillupdatethesamewordatthesame1Aswillbediscussedintherelatedworksession,theredoexistsomemoreexpressiveextensionsofRNN,whichcanex-ploittheinteractionbetweenawordanditscontexts.time.However,addingtrainingofFCTtothissetupintroducesaproblem;thesharedfeatureweightsαinthephrasecompositionmodelshaveamuchhighercollisionrate.Topreventconﬂicts,wemod-ifyasynchronousSGDsothatonlyasinglethreadupdatesbothαandlexicalembeddingssimultane-ously,whiletheremainingthreadsonlyupdatethelexicalembeddings.WhentrainingwiththeLMob-jective,onlyasingle(arbitrarilychosen)threadcanupdateFCTfeatureweights;allotherthreadstreatthemasﬁxedduringback-propagation.WhilethisreducesthedataavailablefortrainingFCTparame-terstoonlythatofasinglethread,thesmallnumberofparametersαmeansthatevenasinglethread’sdataissufﬁcientforlearningthem.Wetakeasimilarapproachforupdatingthetask-speciﬁc(TASK-SPEC)partofthejointobjectivedur-ingFCT-Jointtraining.WechooseasinglethreadtooptimizetheTASK-SPECobjectivewhileallotherthreadsoptimizetheLMobjective.Thismeansthatαsareupdatedusingthetask-speciﬁcthread.Re-strictingupdatesforbothsetsofparameterstoasin-glethreaddoesnotslowtrainingsincegradientcom-putationisveryfastfortheembeddingsandαs.Forjointtraining,wecantradeoffbetweenthetwoobjectives(TASK-SPECandLM)bysettingaweightforeachobjective(e.g.c1andc2.)How-ever,underthemulti-threadedsettingwecannotdothisexplicitlysincethenumberofthreadsassignedtoeachpartoftheobjectiveinﬂuenceshowthetermsareweighted.Supposethatweassignn1threadstoTASK-SPECandn2toLM.Sinceeachthreadtakesasimilaramountoftime,theactualweightswillberoughlyc1=c10∗n1andc2=c20∗n2.Therefore,weﬁrstﬁxthenumbersofthreadsandthentunec10andc20.Inallofourexperimentsthatusedistributedtraining,weuse12threads.TrainingDetails:Unlessotherwiseindicatedweuse200-dimensionalembeddings,whichachievedagoodbalancebetweenaccuracyandefﬁciency.WeuseL2regularizationontheweightsαinFCTaswellasforthematricesWofRNNbaselinesinSec-tion6.Inallexperiments,thelearningrates,num-bersofiterationsandtheweightsofL2regularizersaretunedondevelopmentdata.WeexperimentwithbothnegativesamplingbasedNCEtraining(Mikolovetal.,2013b)fortraining l D o w n o a d e d f r o m h t t p : / / d je r e c t . m je t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 3 5 1 5 6 6 7 5 2 / / t l a c _ a _ 0 0 1 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 232 PPDBXXLTotalPairsTrainingPairsVocabSizeNYTPhrasesWordsVocabSizeTrain120,55224,3005,954Train84,149,192518,103,942518,235Dev644500-Dev30,000--Test645500-Test30,000--Table2:StatisticsofNYTandPPDBdata.“Trainingpairs”arepairsofbigramphraseandwordusedinexperiments.word2vecembeddings,theLMobjective,andtheTASK-SPECobjective;aswellasusehierarchicalsoftmaxtraining(HS)forlanguagemodelingexper-iments.Weuseawindowsizec=5,thedefaultofword2vec.Weremovetypesthatoccurlessthan5times(defaultsettingofword2vec).Thevo-cabularyisthesameforallevaluations.ForNCEtrainingwesample15wordsasnegativesamplesforeachtraininginstanceaccordingtotheirfrequenciesinrawtexts.FollowingMikolovetal.(2013b)ifwhasfrequencyu(w)wesetthesamplingprobabilityofwtop(w)∝u(w)3/4.ForHStrainingwebuildaHoffmantreebasedonwordfrequency.Pre-trainedWordEmbeddingsFormethodsthatrequirepre-trainedlexicalembeddings(FCTwithpre-training,SUM(Section5),andtheFCTandRNNmodelsinSection6)wealwaysuseembeddings2trainedwiththeskip-grammodelofword2vec.TheembeddingsaretrainedwithNCEestimationusingthesamesettingsdescribedabove.5Experiments:LanguageModelingWebeginwithexperimentsonFCTforlanguagemodelingtasks(Section3.1).Theresultantem-beddingscanthenbeusedforpre-trainingintask-speciﬁcsettings(Section6).DataWeusethe1994-97subsetfromtheNewYorkTimes(NYT)portionofGigawordv5.0(Parkeretal.,2011).Sentencesaretokenizedus-ingOpenNLP.3Weremovedwordswithfrequencieslessthan5,yieldingavocabularyof518,235wordformsand515,301,382tokensfortrainingwordem-beddings.ThisdatasetisusedforbothtrainingbaselinewordembeddingsandevaluatingourmodelstrainedwiththeLMobjective.WhenevaluatingtheLMtaskweconsiderbigramNPsinisolation(seethe2Weuse“inputembeddings”learnedbyword2vec.3https://opennlp.apache.org/“Phrases”columninTable2).ForFCTfeaturesthatrequiresyntacticinformation,weextracttheNYTportionofAnnotatedGigaword(Napolesetal.,2012),whichusestheStanfordparser’sanno-tations.Weuseallbigramnounphrases(obtainedfromtheannotateddata)astheinputphrasesforEq.(3).AsubsetfromJanuary1998ofNYTdataiswithheldforevaluation.BaselinesWeincludetwobaselines.Theﬁrstistouseeachcomponentwordtopredictthecontextofthephrasewiththeskipgrammodel(Mikolovetal.,2013a)andthenaveragethescorestogettheprobability(denotedasword2vec).Thesec-ondistouseSUMoftheskip-gramembeddingstopredictthescores.TrainingtheFCTmodelswithpre-trainedwordembeddingsrequiresrunningtheskip-grammodelonNYTdatafor2iterations:oneforword2vectrainingandoneforlearningFCT.Therefore,wealsoruntheword2vecmodelfortwoepochstoprovideembeddingsforthebaselines.5.1ResultsWeevaluatetheperplexityoflanguagemodelsthatincludelexicalembeddingsandourcomposedphraseembeddingsfromFCTusingtheLMobjec-tive.WeusetheperplexitycomputationmethodofMikolovetal.(2013un)suitableforskip-grammod-els.TheFCTmodelsaretrainedbytheHSstrategy,whichcanoutputtheexactprobabilityefﬁcientlyandwasshownbyYuandDredze(2014)toobtainbetterperformanceonlanguagemodeling.SinceinSection6.1weuseFCTmodelstrainedbyNCE,wealsoincludetheresultsofmodelstrainedbyNCE.NotethatscoresobtainedfromamodeltrainedwithHSorNCEarenotcomparable.WhilethemodeltrainedbyHSisefﬁcienttoevaluateperplexities,NCEtrainingrequiressummationoverallwordsinthevocabularyinthedenominatorofthesoftmaxtocomputeperplexity,animpracticalityforlargevo-cabulary.Therefore,wereportNCElosswithaﬁxedsetofsamplesforNCEtrainedmodels. l D o w n o un d e d f r o m H t t p : / / d je r e c t . m je t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 3 5 1 5 6 6 7 5 2 / / t l a c _ a _ 0 0 1 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 233 Perplexity(HStraining)NCEloss(NCEtraining)ModelSubsetTrainDevTestSubsetTrainDevTestSUM(2epochs)7.6207.5777.5002.3122.2262.061word2vec(2epochs)7.1037.0747.0522.2742.1952.025FCT(randominit,2epochs)6.7536.6286.7131.8791.7221.659FCT(withpre-training,1epochs)6.6416.5406.5521.8161.6911.620Table3:LanguagemodelperplexityandNCElossonasubsetoftrain,dev,andtestNYTdata.kλ1k(cid:29)kλ2kkλ1k≈kλ2kkλ1k(cid:28)kλ2kModelbiologicalnorth-easterndeadmedicinalnewandiversitypartbodyproductstrialextensionFCTsensitivitynortheasternremainsdrugsproceedingsignednaturalsprawlinggraveusescross-examinationterminatedabilitiespreserveskeletonchemicalsdefendanttemporaryspeciesareafullSUMdestructionportionunconsciousmarijuananewanracialresultdyingpackagingjudgerenewalgeneticintegralﬂeshsubstancescourtroomanotherculturalchunksigningTable4:Differencesinthenearestneighborsfromthetwophraseembeddingmodels.Table3showsresultsfortheNYTtrainingdata(subsetofthefulltrainingdatacontaining30,000phraseswiththeircontextsfromJuly1994),de-velopmentandtestdata.LanguagemodelswithFCTperformedmuchbetterthantheSUMandword2vecbaselines,underbothNCEandHStraining.NotethatFCTwithpre-trainingmakesasinglepassoverthewholeNYTcorpusandthenapassoveronlythebigramNPs,andtherandominitializationmodelmakesapassoverthebigramstwice.Thisislessdatacomparedtotwopassesoverthefulldata(baselines),whichindicatesthatFCTbettercapturesthecontextdistributionsofphrases.QualitativeAnalysisTable4showswordsandtheirmostsimilarphrases(nearestneighbors)com-putedbyFCTandSUM.Weshowthreetypesofphrases:onewherethetwowordsinaphrasecon-tributeequallytothephraseembedding,wheretheﬁrstworddominatesthesecondinthephraseem-bedding,andviceversa.Wemeasuretheeffectofeachwordbycomputingthetotalmagnitudeoftheλvectorforeachwordinthephrase.Forexample,forthephrase“anextension”,theembeddingforthesecondworddominatestheresultingphraseembed-ding(kλ1k(cid:28)kλ2k)aslearnedbyFCT.Thetablehighlightsthedifferencesbetweenthemethodsbyshowingthemostrelevantphrasesnotselectedasmostrelevantbytheothermethod.ItisclearthatwordsselectedusingFCTaremoresemanticallyre-latedthanthoseofthebaseline.6Experiments:Task-speciﬁcTraining:PhraseSimilarityDataWeconsiderseveralphrasesimilaritydatasetsforevaluatingtask-speciﬁctraining.Table5summarizesthesedatasetsandshowsexamplesofinputsandoutputsforeachtask.PPDBTheParaphraseDatabase(PPDB)4(Gan-itkevitchetal.,2013)containstensofmillionsofautomaticallyextractedparaphrasepairs,includingwordsandphrases.Weextractallparaphrasescon-tainingabigramnounphraseandanounwordfromPPDB.Sincearticlesusuallyhavelittlecontribu-tionstothephrasemeaning,weremovedtheeasycasesofallpairsinwhichthephraseiscomposedofanarticleandanoun.Next,weremovedduplicatepairs:sioccurredinPPDB,weremovedre-lationsof.PPDBisorganizedinto6parts,rangingfromS(petit)toXXXL.Divisionintothesesetsisbasedonanautomaticallyderivedaccuracymetric.WeextractedparaphrasesfromtheXXLset.Themostaccurate(i.e.ﬁrst)1,000pairsareusedforevaluationanddividedintoadevset(500pairs)andtestset(500pairs);theremainingpairswereusedfortraining.OurPPDBtaskisanextensionofmea-suringPPDBsemanticsimilaritybetweenwords(Yu4http://www.cis.upenn.edu/˜ccb/ppdb/

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

234

DataSetInputOutput(1)PPDBmedicinalproductsdrugs(2)SemEval2013TrueFalse(3)Turney2012monosyllabicwordmonosyllable,hyalinization,fund,gittern,killer(4)PPDB(ngram)contributionoftheeuropeanunioneucontributionTable5:Examplesofphrasesimilaritytasks.(1)PPDBisarankingtask,inwhichaninputbigramandaoutputnounaregiven,andthegoalistoranktheoutputwordoverotherwordsinthevocabulary.(2)SemEval2013isabinaryclassiﬁcationtask:determinewhetheraninputpairofabigramandawordformaparaphrase(True)ornot(False).(3)Turney2012isamulti-classclassiﬁcationtask:determinethewordmostsimilartotheinputphrase(inbold)fromtheﬁveoutputcandidates.Forthe10-choicetask,thegoalistoselectthemostsimilarpairbetweenthecombinationofonebigramphrase,i.e.,theinputphraseortheswappedinput(“wordmonosyllabic”forthisexample),andtheﬁveoutputcandidates.Thecorrectanswerinthiscaseshouldstillbethepairoforiginalinputphraseandtheoriginalcorrectoutputcandidate(inbold).(4)PPDB(ngram)issimilartoPPDB,butinwhichbothinputsandoutputsbecomesnounphraseswitharbitrarylengths.andDredze,2014)tothatbetweenphrases.Datade-tailsappearinTable2.PhraseSimilarityDatasetsWeuseavarietyofhumanannotateddatasetstoevaluatephrasese-manticsimilarity:theSemEval2013sharedtask(Korkontzelosetal.,2013),andthenoun-modiﬁerproblem(Turney2012)inTurney(2012).Bothtasksprovideevaluationdataandtrainingdata.Se-mEval2013Task5(un)isaclassiﬁcationtasktode-termineifawordphrasepairaresemanticallysimi-lar.Turney2012isatasktoselecttheclosestmatch-ingcandidatewordforagivenphrasefromcandi-datewords.Theoriginaltaskcontainedsevencan-didates,twoofwhicharecomponentwordsoftheinputphrase(seven-choicetask).Followupworkhassinceremovedthecomponentswordsfromthecan-didates(ﬁve-choicetask).Turney(2012)alsopro-posea10-choicetaskbasedonthissamedataset.Inthistask,theinputbigramnounphrasewillhaveitscomponentwordsswapped.Thenallthepairsofswappedphraseandacandidatewordwillbetreatedasanegativeexample.Therefore,eachinputphrasewillcorrespondto10testexampleswhereonlyoneofthemisthepositiveone.LongerPhrases:PPDB(ngram-to-ngram)Toshowthegeneralityofourapproachweevaluateourmethodonphraseslongerthanbigrams.WeextractarbitrarylengthnounphrasepairsfromPPDB.Weonlyincludephrasepairsthatdifferbymorethanoneword;otherwisethetaskwouldreducetoeval-uatingunigramsimilarity.Similartothebigram-to-unigramtask,weusedtheXXLsetandremoveddu-plicatepairs.Weusedthemostaccuratepairsfordevelopment(2,821pairs)andtest(2,920pairs);theremaining148,838pairswereusedfortraining.Asbefore,werelyonnegativesamplingtoefﬁ-cientlycomputetheobjectiveduringtraining.Foreachsource/targetn-grampair,wesamplenegativenounphrasesasoutputs.Boththetargetphraseandthenegativephrasesaretransformedtotheirphraseembeddingswiththecurrentparameters.Wethencomputeinnerproductsbetweenembeddingofthesourcephraseandtheseoutputembeddings,andup-datetheparametersaccordingtotheNCEobjective.WeusethesamefeaturetemplatesasinTable1.NoticethattheXXLsetcontainsseveralsubsets(e.g.,M,L,XL)rankedbyaccuracy.Intheexperi-mentswealsoinvestigatetheirperformanceondevdata.Unlessotherwisespeciﬁed,thefullsetisse-lected(performsbestondevset)fortraining.BaselinesWecomparetothecommonandef-fectivepoint-wiseaddition(SUM)method(MitchellandLapata,2010).5WeadditionallyincludeWeightedSUM,whichlearnsoveralldimensionspeciﬁcweightsfromtask-speciﬁctraining,theequivalentofFCTwithαjk=0andbijlearnedfromdata.Furthermore,wecomparetodatasetspeciﬁc5MitchellandLapata(2010)alsoshowsuccesswithpoint-wiseproduct(MULTI)forVSMs.However,MULTIisill-suitedtowordembeddingsandgavepoorresultsinallourexperi-ments.Mikolovetal.(2013b)showthatsumofembeddingsisrelatedtoproductofcontextdistributionsbecauseoftheloga-rithmiccomputationintheoutputlayer.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

235

baselines:were-implementedtherecursiveneuralnetworkmodel(RNN)(Socheretal.,2013a)andtheDualVSMalgorithminTurney(2012)6sothattheycanbetrainedonourdataset.Wealsoincluderesultsforﬁne-tuningwordembeddingsinSUMandWeightedSUMwithTASK-SPECobjectives,whichdemonstrateimprovementsoverthecorre-spondingmethodswithoutﬁne-tuning.Asbefore,wordembeddingsarepre-trainedwithword2vec.RNNsserveasanotherwaytomodelthecom-positionallyofbigrams.WerunanRNNonbi-gramsandassociatedsub-trees,thesamesettingFCTuses,andaretrainedonourTASK-SPECobjectiveswiththetechniquedescribedinSection3.4.AsinSocheretal.(2013un),wereﬁnethematrixWinEq.(5)accordingtothePOStagsofthecomponentwords.7Forexample,forabigramNPlikenew/ADJtrial/NN,weuseamatrixWADJ−NNtotransformthetwowordembeddingstothephraseembedding.Intheexperimentswehave60differentmatricesintotalforbigramNPs.ThenumberislargerthanthatinSocheretal.(2013un)duetoincorrecttagsinau-tomaticparses.SincetheRNNmodelhastimecomplexityO(N2),wecompareRNNswithdifferentsizedembeddings.Theﬁrstoneusesembeddingswith50dimensions,whichhasthesamesizeastheembeddingsusedinSocheretal.(2013un),andhassimilarcomplexitytoourmodelwith200dimensionembeddings.Thesecondmodelusesthesame200dimensionembed-dingsasourmodelbutissigniﬁcantlymorecompu-tationallyexpensive.Forallmodels,wenormalizetheembeddingssothattheL-2normequals1,whichisimportantinmeasuringsemanticsimilarityviainnerproduct.6.1Results:BigramPhrasesPPDBOurﬁrsttaskistomeasurephrasesimi-larityonPPDB.TrainingusestheTASK-SPECob-6WedidnotincluderesultsforaholisticmodelasinTurney(2012),sincemostofthephrases(especiallyforthoseinPPDB)inourexperimentsarecommonphrases,makingthevocabularytoolargetotrain.Onesolutionwouldbetoonlytrainholisticembeddingsforphrasesinthetestdata,butexaminationofatestsetbeforetrainingisnotarealisticassumption.7Wedonotcomparetheperformancebetweenusingasinglematrixandseveralmatricessince,asdiscussedinSocheretal.(2013un),WsreﬁnedwithPOStagsworkmuchbetterthanusingasingleW.Thatalsosupportstheargumentinthispaper,thatitisimportanttodeterminethetransformationwithmorefeatures.10^310^410^53436384042444648505254Vocabulary SizesMRR on Test Set(%) SUMRNN50RNN200FCT(un)MRRofmodelswithﬁxedwordembeddings10^310^410^53540455055606570Vocabulary SizesMRR on Test Set(%) SUMRNN50RNN200FCTFCT−pipelineFCT−joint(b)MRRofmodelswithﬁne-tuningFigure1:PerformanceonPPDBtask(testset).jective(Eq.(4)withNCEtraining)wheredataarephrase-wordpairs.ThegoalistoselectBfromasetofcandidatesgivenA,wherepairsim-ilarityismeasuredusinginnerproduct.Weusecan-didatesetsofsize1k/10k/100kfromthemostfre-quentNwordsinNYTandreportmeanreciprocalrank(MRR).Wereportresultswiththebaselinemethods(SUM,WeightedSUM,RNN).ForFCTwereporttrainingwiththeTASK-SPECobjective,thejoint-objective(FCT-J)andthepipelineapproach(FCT-P).Toen-surethattheTASK-SPECobjectivehasastrongerin-ﬂuenceinFCT-Joint,weweightedeachtrainingin-stanceofLMby0.01,whichisequivalenttosettingthelearningrateoftheLMobjectiveequaltoη/100andthatoftheTASK-SPECobjectiveasη.Train-ingmakesthesamenumberofpasseswiththesamelearningrateastrainingwiththeTASK-SPECobjec-tiveonly.Foreachmethodwereportresultswithandwithoutﬁne-tuningthewordembeddingsonthelabeleddata.WerunFCTonthePPDBtrainingdatafor5epochswithlearningrateη=0.05,whicharebothselectedfromdevelopmentset.Fig.1showstheoverallMRRresultsondiffer-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

236

Fine-tuningMRRModelObjectiveWordEmb@10kSUM–41.19SUMTASK-SPECY45.01WSumTASK-SPECY45.43RNN50TASK-SPECN37.81RNN50TASK-SPECY39.25RNN200TASK-SPECN41.13RNN200TASK-SPECY40.50FCTTASK-SPECN41.96FCTTASK-SPECY46.99FCTLMY42.63FCT-PTASK-SPEC+LMY49.44FCT-JTASK-SPEC+LMjoint51.65Table6:PerformanceonthePPDBtask(testdata).entcandidatevocabularysizes(1k,10kand100k),andTable6highlightstheresultsonthevocabularyusingthetop10kwords.Overall,FCTwithTASK-SPECtrainingimprovesoverallthebaselinemeth-odsineachsetting.Fine-tuningwordembeddingsimprovesallmethodsexceptRNN(d=200).WenotethattheRNNperformspoorly,possiblybecauseitusesacomplextransformationfromwordem-beddingtophraseembeddings,makingthelearnedtransformationdifﬁculttogeneralizewelltonewphrasesandwordswhenthetask-speciﬁclabeleddataissmall.Asaresult,thereisnoguaranteeofcomparabilitybetweennewpairsofphrasesandwordembeddings.Thephraseembeddingsmayendupinadifferentpartofthesubspacefromthewordembeddings.ComparingtoSUMandWeightedSUM,FCTiscapableofusingfeaturesprovidingcriticalcon-textualinformation,whichisthesourceofFCT’simprovement.Additionally,sincetheRNNsalsousedPOStagsandparsinginformationyetachievedlowerscoresthanFCT,ourresultsshowthatFCTmoreeffectivelyusesthesefeatures.Tobettershowthisadvantage,wetrainFCTmodelswithonlyPOStagfeatures,whichachieve46.37/41.20onMRR@10kwith/withoutﬁne-tuningwordembed-dings,stillbetterthanRNNs.SeeSection6.3forafullablationstudyoffeaturesinTable1.Semi-supervisedResults:Table6alsohigh-lightedtheimprovementfromsemi-supervisedlearning.First,thefullyunsupervisedmethod(LM)improvesoverSUM,showingthatimprovementsinlanguagemodelingcarryovertosemanticsimilar-itytasks.ThiscorrelationbetweentheLMob-jectiveandthetargettaskensuresthesuccessofsemi-supervisedtraining.Asaresult,bothsemi-supervisedmethods,FCT-JandFCT-Pimprovesoverthesupervisedmethods;andFCT-Jachievesthebestresultsofallmethods,includingFCT-P.ThisdemonstratestheeffectivenessofincludinglargeamountsofunlabeleddatawhilelearningwithaTASK-SPECobjective.WebelievethatbyaddingtheLMobjective,wecanpropagatethesemanticin-formationofembeddingstothewordsthatdonotappearinthelabeleddata(seethedifferencesbe-tweenvocabularysizesinTable2).TheimprovementofFCT-JoverFCT-Palsoin-dicatesthatthejointtrainingstrategycanbemoreeffectivethanthetraditionalpipeline-basedpre-training.AsdiscussedinSection3.3,thepipelinemethod,althoughcommonlyusedindeeplearningliteratures,doesnotsuitNLPapplicationswellbe-causeofthesparsityinwordembeddings.There-fore,ourresultssuggestanalternativesolutiontoawiderangeofNLPproblemswherelabeleddatahaslowcoverageofthevocabulary.Forfuturework,wewillfurtherinvestigatetheideaofjointtrainingonmoretasksandcomparewiththepipelinemethod.ResultsonSemEval2013andTurney2012WeevaluatethesamemethodsonSemEval2013andtheTurney20125-and10-choicetasks,whichbothprovidetrainingandtestsplits.Thesamebase-linesinthePPDBexperiments,aswellastheDualSpacemethodofTurney(2012)andtherecursiveauto-encoder(RAE)fromSocheretal.(2011)areusedforcomparison.Sincethetasksdidnotprovideanydevelopmentdata,weusedcross-validation(5folds)fortuningtheparameters,andﬁnallysetthetrainingepochstobe20andη=0.01.Forjointtraining,theweightoftheLMobjectiveisweightedby0.005(i.e.withalearningrateequalto0.005η)sincethetrainingsetsforthesetwotasksaremuchsmaller.Forconvenience,wealsoincluderesultsforDualSpaceasreportedinTurney(2012),thoughtheyarenotcomparableheresinceTurney(2012)usedamuchlargertrainingset.Table7showssimilartrendsasPPDB.Onedif-ferencehereisthatRNNsdobetterwith200dimen-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

237

Fine-tuningSemEval2013Turney2012ModelObjectiveWordEmbTestAcc(5)Acc(10)MRR@10kSUM–65.4639.5819.7912.00SUMTASK-SPECY67.9348.1524.0714.32WeightedSumTASK-SPECY69.5152.5526.1614.74RNN(d=50)TASK-SPECN67.2039.6425.351.39RNN(d=50)TASK-SPECY70.3641.9627.201.46RNN(d=200)TASK-SPECN71.5040.9527.203.89RNN(d=200)TASK-SPECY72.2242.8429.984.03DualSpace1–52.4727.5516.362.22DualSpace2—58.341.5-RAEauto-encoder-51.7522.9914.810.16FCTTASK-SPECN68.8441.9033.808.50FCTTASK-SPECY70.3652.3138.6613.19FCTLM-67.2242.5927.5514.07FCT-PTASK-SPEC+LMY70.6453.0939.1214.17FCT-JTASK-SPEC+LMjoint70.6553.3139.1214.25Table7:PerformanceonSemEval2013andTurney2012semanticsimilaritytasks.DualSpace1:Ourreimple-mentationofthemethodin(Turney,2012).DualSpace2:TheresultreportedinTurney(2012).RAEistherecursiveauto-encoderin(Socheretal.,2011),whichistrainedwiththereconstruction-basedobjectiveofauto-encoder.sionalembeddingsonSemEval2013,thoughatadimensionalitywithsimilarcomputationalcomplex-itytoFCT(d=50),FCTimproves.Additionally,onthe10-choicetaskofTurney2012,boththeFCTandtheRNNmodels,eitherwithorwithoutﬁne-tuningwordembeddings,signiﬁcantlyoutperformSUM,showingthatbothmodelscapturethewordor-derinformation.FinetuninggivessmallergainsonRNNslikelybecausethelimitednumberoftrainingexamplesisinsufﬁcientforthecomplexRNNmodel.TheLMobjectiveleadstoimprovementsonallthreetasks,whileRAEdoesnotperformsigniﬁcantlybet-terthanrandomguessing.Theseresultsareperhapsattributabletothelackofassumptionsintheobjec-tiveabouttherelationsbetweenwordembeddingsandphraseembeddings,makingthelearnedphraseembeddingsnotcomparabletowordembeddings.6.2DimensionalityandComplexityAbeneﬁtofFCTisthatitiscomputationallyefﬁ-cient,allowingittoeasilyscaletoembeddingsof200dimensions.Bycontrast,RNNmodelstypi-callyusesmallersizedembeddings(d=25provedbestinSocheretal.,2013a)andcannotscaleuptolargedatasetswhenlargerdimensionalityembed-dingsareused.Forexample,whentrainingonthePPDBdata,theFCTwithd=200processes2.33instancesperms,whiletheRNNwiththesamedi-mensionalityprocesses0.31instance/ms.TraininganRNNwithd=50isofcomparablespeedtoFCTwithd=200.Figure2(a-b)showstheMRRonPPDBfor1kand10kcandidatesetsforboththeSUMbaselineandFCTwithaTASK-SPECobjectiveandfullfeatures,ascomparedtoRNNswithdiffer-entsizedembeddings.BothFCTandRNNuseﬁne-tunedembeddings.Withasmallnumberofembed-dingdimensions,RNNsachievebetterresults.How-ever,FCTcanscaletomuchhigherdimensionalityembeddings,whicheasilysurpassestheresultsofRNNs.Thisisespeciallyimportantwhenlearningalargenumberofembeddings:the25-dimensionalspacemaynotbesufﬁcienttocapturethesemanticdiversity,asevidencedbythepoorperformanceofRNNswithlowerdimensionality.SimilartrendsobservedonthePPDBdataalsoappearonthetasksofTurney2012andSemEval2013.Figure2(c-f)showstheperfor-mancesonthesetwotasks.OntheTurney2012task,theFCTevenoutperformstheRNNmodelus-ingembeddingswiththesamedimensionality.Onepossiblereasonisduetooverﬁttingofthemorecom-plexRNNmodelsonthesesmalltrainingsets.Fig-ure2(d)showsthattheperformancesofFCTonthe10-choicetaskarelessaffectedbythedimensionsofembeddings.Thatisbecausethecompositionmodelscanwellhandlethewordorderinformation,

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

238

05010015020025030035040045050034363840424446485052rnn25rnn50rnn200dimension of embeddingsMRR(%) SUMFCT(un)MRR@1konPPDBdevset050100150200250300350400450500222426283032343638dimension of embeddingsMRR(%)rnn25rnn50rnn200 SUMFCT(b)MRR@10konPPDBdevset050100150200250300350400450500303540455055rnn25rnn50rnn200dimension of embeddingsACC(%) SUMFCT(c)accuracyonthe5-choicetaskinTurney201205010015020025030035040045050015202530354045rnn25rnn50rnn200dimension of embeddingsACC(%) SUMFCT(d)accuracyonthe10-choicetaskinTurney2012050100150200250300350400450500024681012141618dimension of embeddingsMRR(%)rnn50rnn200 SUMFCT(e)MRR@10konTurney20120501001502002503003504004505006263646566676869707172rnn25rnn50rnn200dimension of embeddingsACC(%) SUMFCT(F)accuracyontheSemEval2013Figure2:Effectsofembeddingdimensiononthesemanticsimilaritytasks.Thenotations“RNN”intheﬁguresstandfortheRNNmodelstrainedwithd-dimensionalembeddings.whichiscriticaltosolvingthe10-choicetask,with-outrelyingontoomuchsemanticinformationfromwordembeddingsthemselves.Figure2(e)showsthatwhenthedimensionalityofembeddingsislowerthan100,bothFCTandRNNdoworsethanthebase-line.Thisislikelybecauseinthecaseoflowdimen-sionality,updatingembeddingsislikelytochangethewholestructureofembeddingsoftrainingwords,makingboththeﬁne-tunedwordembeddingsandthelearnedphraseembeddingsincomparabletotheotherwords.TheperformanceofRNNwith25-dimensionembeddingsistoolowsoitisomitted.6.3ExperimentsonLongerPhrasesSofarourexperimentshavefocusedonbigramphrases.WenowshowthatFCTimprovesforlongern-gramphrases(Table8).Withoutﬁne-tuning,FCTperformssigniﬁcantlybetterthantheothermodels,showingthatthemodelcanbettercapturethecon-textandannotationinformationrelatedtophrasese-manticswiththehelpofrichfeatures.Withdifferentamountsoftrainingdata,wefoundthatWSumandFCTbothperformbetterwhentrainedonthePPDB-TrainFine-tuningMRRModelSetWordEmb@10k@100kSUM-N46.5316.62WSumLN51.1018.92FCTLN68.9129.04SUMXXLY74.3029.14WSumXXLY75.3731.13FCTXXLY79.6836.00Table8:ResultsonPPDBngram-to-ngramtask.Lset,amoreaccuratesubsetofXXLwith24,279phrasepairs.Thiscanbeviewedasalowresourcesetting,wherethereislimiteddataforﬁne-tuningwordembeddings.Withﬁne-tuningofwordembeddings,FCTstillsigniﬁcantlybeatstheothermodels.AllthreemethodsgettheirbestresultsonthefullXXLset,likelybecauseitcontainsmorephrasepairstoal-leviateoverﬁttingcausedbyﬁne-tuningwordem-beddings.Noticethatﬁne-tuninggreatlyhelpsallthemethods,includingSUM,indicatingthatthisngram-to-ngramtaskisstilllargelydominated

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

239

FeatureSetMRR@10kFCT79.68-clus76.82-POS77.67-Compound79.40-Head77.50-Distance78.86WSum75.37SUM74.30Table9:AblationstudyondevsetofthePPDBngram-to-ngramtask(MRR@10k).bythequalityofsinglewordsemantics.Therefore,weexpectlargergainsfromFCTontaskswheresin-glewordembeddingsarelessimportant,suchasre-lationextraction(longdistancedependencies)andquestionunderstanding(intentionsarelargelyde-pendentoninterrogatives).Enfin,wedemonstratetheefﬁcacyofdifferentfeaturesinFCT(Table1)withanablationstudy(Ta-ble9).Wordclusterfeaturescontributemost,be-causethepoint-wiseproductbetweenwordembed-dinganditscontextwordclusterrepresentationisactuallyanapproximationoftheword-wordinter-action,whichisbelievedimportantforphrasecom-positions.Headfeatures,thoughfew,alsomakeabigdifference,reﬂectingtheimportanceofsyntacticinformation.Compoundfeaturesdonothavemuchofanimpact,possiblybecausethesimplerfeaturescaptureenoughinformation.7RelatedWorkCompositionalsemanticmodelsaimtobuilddistri-butionalrepresentationsofaphrasefromitscompo-nentwordrepresentations.Atraditionalapproachforcompositionistoformapoint-wisecombina-tionofsinglewordrepresentationswithcomposi-tionaloperatorseitherpre-deﬁned(e.g.element-wisesum/multiplication)orlearnedfromdata(LeandMikolov,2014).Cependant,theseapproachesignoretheinnerstructureofphrases,e.g.theor-derofwordsinaphraseanditssyntactictree,andthepoint-wiseoperationsareusuallylessexpressive.Onesolutionistoapplyamatrixtransformation(possiblyfollowedbyanon-lineartransformation)totheconcatenationofcomponentwordrepresen-tations(Zanzottoetal.,2010).Forlongerphrases,matrixmultiplicationcanbeappliedrecursivelyac-cordingtotheassociatedsyntactictrees(Socheretal.,2010).Cependant,becausetheinputofthemodelistheconcatenationofwordrepresentations,ma-trixtransformationscannotcaptureinteractionsbe-tweenawordanditscontexts,orbetweencompo-nentwords.Therearethreewaystorestoretheseinterac-tions:Theﬁrstistouseword-speciﬁc/tensortrans-formationstoforcetheinteractionsbetweencom-ponentwordsinaphrase.Inthesemethods,word-speciﬁctransformations,whichareusuallymatri-ces,arelearnedforasubsetofwordsaccordingtotheirsyntacticproperties(e.g.POStags)(BaroniandZamparelli,2010;Socheretal.,2012;Grefen-stetteetal.,2013;Erk,2013).Compositionbetweenawordinthissubsetandanotherwordbecomesthemultiplicationbetweenthematrixassociatedwithonewordandtheembeddingoftheother,produc-inganewembeddingforthephrase.Usingonetensor(notword-speciﬁc)tocomposetwoembed-dingvectors(hasnotbeentestedonphrasesimilar-itytasks)(Bordesetal.,2014;Socheretal.,2013b)isaspecialcaseofthisapproach,wherea“word-speciﬁctransformationmatrix”isderivedbymulti-plyingthetensorandthewordembedding.Addi-tionally,word-speciﬁcmatricescanonlycapturetheinteractionbetweenawordandoneofitscontextwords;othershaveconsideredextensionstomulti-plewords(Grefenstetteetal.,2013;DinuandBa-roni,2014).Theprimarydrawbackoftheseap-proachesisthehighcomputationalcomplexity,lim-itingtheirusefulnessforsemantics(Section6.2.)Asecondapproachdrawsontheconceptofcon-textualization(ErkandPad´o,2008;DinuandLap-ata,2010;Thateretal.,2011),whichsumsembed-dingsofmultiplewordsinalinearcombination.Forexample,CheungandPenn(2013)applycontextu-alizationtowordcompositionsinagenerativeeventextractionmodel.However,thisisanindirectwaytocaptureinteractions(thetransformationsarestillunawareofinteractionsbetweencomponents),andthushasnotbeenapopularchoiceforcomposition.Thethirdapproachistoreﬁneword-independentcompositionaltransformationswithannotationfea-tures.FCTfallsunderthisapproach.Theprimaryadvantageisthatcompositioncanrelyonricherlin-guisticfeaturesfromthecontext.Whiletheem-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

240

beddingsofcomponentwordsstillcannotinteract,theycaninteractwithotherinformation(i.e.fea-tures)oftheircontextwords,andeventheglobalfeatures.Recentresearchhascreatednovelfeaturesbasedoncombiningwordembeddingsandcontex-tualinformation(NguyenandGrishman,2014;RothandWoodsend,2014;Kirosetal.,2014;Yuetal.,2014;Yuetal.,2015).Yuetal.(2015)furtherpro-posedconvertingthecontextualfeaturesintoahid-denlayercalledfeatureembeddings,whichissim-ilartotheαmatrixinthispaper.Examplesofap-plicationstophrasesemanticsincludeSocheretal.(2013un)andHermannandBlunsom(2013),whoen-hancedRNNsbyreﬁningthetransformationmatri-ceswithphrasetypesandCCGsupertags.How-ever,thesemodelsareonlyabletouselimitedinfor-mation(usuallyonepropertyforeachcompositionaltransformation),whereasFCTexploitsmultiplefea-tures.Finally,ourworkisrelatedtorecentworkonlow-ranktensorapproximations.WhenweusethephraseembeddingepinEq.(1)topredictalabely,thescoreofygivenphrasepwillbes(oui,p)=UTyep=PNiUTy(λi(cid:12)ewi)inlog-linearmodels,whereUyistheparametervectorfory.ThisisequivalenttousingaparametertensorTtoevaluatethescorewiths0(oui,p)=PNiT×1y×2f(wi,p)×ewi,whileforcingthetensortohavealow-rankformasT≈U⊗α⊗ew.Here×kindicatestensormul-tiplicationofthekthview,and⊗indicatesmatrixouterproduct(KoldaandBader,2009).Fromthispointofview,ourworkiscloselyrelatedtothedis-criminativetrainingmethodsforlow-ranktensorsinNLP(CaoandKhudanpur,2014;Leietal.,2014),whileitcanhandlemorecomplexngram-to-ngramtasks,wherethelabelyalsohasitsembeddingcom-posedfrombasicwordembeddings.Thereforeourmodelcancapturetheaboveworkasspecialcases.Moreover,wehaveadifferentmethodofdecompos-ingtheinputs,whichresultsinviewsoflexicalpartsandnon-lexicalfeatures.Asweshowinthispaper,thisinputdecompositionallowsustobeneﬁtfrompre-trainedwordembeddingsandfeatureweights.8ConclusionWehavepresentedFCT,anewcompositionmodelforderivingphraseembeddingsfromwordembed-dings.Comparedtoexistingphrasecompositionmodels,FCTisveryefﬁcientandcanutilizehighdi-mensionalwordembeddings,whicharecrucialforsemanticsimilaritytasks.WehavedemonstratedhowFCTcanbeutilizedinalanguagemodelingset-ting,aswellastunedwithtask-speciﬁcdata.Fine-tuningembeddingsontask-speciﬁcdatacanfurtherimproveFCT,butcombiningbothLMandTASK-SPECobjectivesyieldsthebestresults.Wehavedemonstratedimprovementsonbothlanguagemod-elingandseveralsemanticsimilaritytasks.Ourim-plementationanddatasetsarepubliclyavailable.8Whileourresultsdemonstrateimprovementsforlongerphrases,westillonlyfocusonﬂatphrasestructures.InfutureworkweplantoFCTwiththeideaofrecursivelybuildingrepresentations.Thiswouldallowtheutilizationofhierarchicalstructurewhilerestrictingcompositionstoasmallnumberofcomponents.AcknowledgmentsWethankMatthewR.Gormleyforhisinputandanonymousreviewersfortheircomments.MoYuissupportedbytheChinaScholarshipCouncilandbyNSFC61173073.ReferencesMarcoBaroniandRobertoZamparelli.2010.Nounsarevectors,adjectivesarematrices:Representingadjective-nounconstructionsinsemanticspace.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1183–1193.YoshuaBengio,R´ejeanDucharme,PascalVincent,andChristianJanvin.2003.Aneuralprobabilisticlan-guagemodel.TheJournalofMachineLearningRe-search(JMLR),3:1137–1155.AntoineBordes,XavierGlorot,JasonWeston,andYoshuaBengio.2014.Asemanticmatchingenergyfunctionforlearningwithmulti-relationaldata.Ma-chineLearning,94(2):233–259.YuanCaoandSanjeevKhudanpur.2014.Onlinelearn-ingintensorspace.InAssociationforComputationalLinguistics(ACL),pages666–675.JackieChiKitCheungandGeraldPenn.2013.Prob-abilisticdomainmodellingwithcontextualizeddistri-butionalsemanticvectors.InAssociationforCompu-tationalLinguistics(ACL),pages392–401.8https://github.com/Gorov/FCT_PhraseSim_TACL

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

241

RonanCollobertandJasonWeston.2008.Auniﬁedar-chitecturefornaturallanguageprocessing:Deepneu-ralnetworkswithmultitasklearning.InInternationalConferenceonMachineLearning(ICML),pages160–167.RonanCollobert.2011.Deeplearningforefﬁcientdis-criminativeparsing.InInternationalConferenceonArtiﬁcialIntelligenceandStatistics(AISTATS),pages224–232.GeorgianaDinuandMarcoBaroni.2014.Howtomakewordswithvectors:Phrasegenerationindistributionalsemantics.InAssociationforComputationalLinguis-tics(ACL),pages624–633.GeorgianaDinuandMirellaLapata.2010.Measuringdistributionalsimilarityincontext.InEmpiricalMeth-odsinNaturalLanguageProcessing(EMNLP),pages1162–1172.KatrinErkandSebastianPad´o.2008.Astructuredvectorspacemodelforwordmeaningincontext.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages897–906.KatrinErk.2013.Towardsasemanticsfordistributionalrepresentations.InInternationalConferenceonCom-putationalSemantics(IWCS2013),pages95–106.JuriGanitkevitch,BenjaminVanDurme,andChrisCallison-Burch.2013.Ppdb:Theparaphrasedatabase.InNorthAmericanChapteroftheAssoci-ationforComputationalLinguistics(NAACL),pages758–764.EdwardGrefenstette,GeorgianaDinu,Yao-ZhongZhang,MehrnooshSadrzadeh,andMarcoBaroni.2013.Multi-stepregressionlearningforcomposi-tionaldistributionalsemantics.arXiv:1301.6939.KarlMoritzHermannandPhilBlunsom.2013.Theroleofsyntaxinvectorspacemodelsofcompositionalse-mantics.InAssociationforComputationalLinguistics(ACL),pages894–904.KarlMoritzHermann,DipanjanDas,JasonWeston,andKuzmanGanchev.2014.Semanticframeidentiﬁ-cationwithdistributedwordrepresentations.InAs-sociationforComputationalLinguistics(ACL),pages1448–1458.EricHHuang,RichardSocher,ChristopherDManning,andAndrewYNg.2012.Improvingwordrepresenta-tionsviaglobalcontextandmultiplewordprototypes.InAssociationforComputationalLinguistics(ACL),pages873–882.RyanKiros,RichardZemel,andRuslanRSalakhutdinov.2014.Amultiplicativemodelforlearningdistributedtext-basedattributerepresentations.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages2348–2356.TamaraGKoldaandBrettWBader.2009.Ten-sordecompositionsandapplications.SIAMreview,51(3):455–500.IoannisKorkontzelos,TorstenZesch,FabioMassimoZanzotto,andChrisBiemann.2013.Semeval-2013task5:Evaluatingphrasalsemantics.InJointCon-ferenceonLexicalandComputationalSemantics(*SEM),pages39–47.QuocVLeandTomasMikolov.2014.Distributedrepre-sentationsofsentencesanddocuments.arXivpreprintarXiv:1405.4053.TaoLei,YuXin,YuanZhang,ReginaBarzilay,andTommiJaakkola.2014.Low-ranktensorsforscoringdependencystructures.InAssociationforComputa-tionalLinguistics(ACL),pages1381–1391.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013a.Efﬁcientestimationofwordrepresenta-tionsinvectorspace.arXivpreprintarXiv:1301.3781.TomasMikolov,IlyaSutskever,KaiChen,GregCorrado,andJeffreyDean.2013b.Distributedrepresentationsofwordsandphrasesandtheircompositionality.arXivpreprintarXiv:1310.4546.JeffMitchellandMirellaLapata.2008.Vector-basedmodelsofsemanticcomposition.InAssociationforComputationalLinguistics(ACL),pages236–244.JeffMitchellandMirellaLapata.2010.Compositionindistributionalmodelsofsemantics.Cognitivescience,34(8):1388–1429.CourtneyNapoles,MatthewGormley,andBenjaminVanDurme.2012.Annotatedgigaword.InACLJointWorkshoponAutomaticKnowledgeBaseConstructionandWeb-scaleKnowledgeExtraction,pages95–100.ThienHuuNguyenandRalphGrishman.2014.Employ-ingwordrepresentationsandregularizationfordomainadaptationofrelationextraction.InAssociationforComputationalLinguistics(ACL),pages68–74.RobertParker,DavidGraff,JunboKong,KeChen,andKazuakiMaeda.2011.Englishgigawordﬁfthedition,june.LinguisticDataConsortium,LDC2011T07.MichaelRothandKristianWoodsend.2014.Compo-sitionofwordrepresentationsimprovessemanticrolelabelling.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages407–413.RichardSocher,ChristopherDManning,andAndrewYNg.2010.Learningcontinuousphraserepresenta-tionsandsyntacticparsingwithrecursiveneuralnet-works.InNIPSWorkshoponDeepLearningandUn-supervisedFeatureLearning,pages1–9.RichardSocher,JeffreyPennington,EricHHuang,An-drewYNg,andChristopherDManning.2011.Semi-supervisedrecursiveautoencodersforpredictingsen-timentdistributions.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages151–161.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2

/
t

un
c
_
un
_
0
0
1
3
5
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

242

RichardSocher,BrodyHuval,ChristopherDManning,andAndrewYNg.2012.Semanticcompositionalitythroughrecursivematrix-vectorspaces.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1201–1211.RichardSocher,JohnBauer,ChristopherD.Manning,andNgAndrewY.2013a.Parsingwithcompositionalvectorgrammars.InAssociationforComputationalLinguistics(ACL),pages455–465.RichardSocher,AlexPerelygin,JeanWu,JasonChuang,ChristopherD.Manning,AndrewNg,andChristo-pherPotts.2013b.Recursivedeepmodelsforse-manticcompositionalityoverasentimenttreebank.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1631–1642.StefanThater,HagenF¨urstenau,andManfredPinkal.2011.Wordmeaningincontext:Asimpleandef-fectivevectormodel.InInternationalJointCon-ferenceonNaturalLanguageProcessing(IJCNLP),pages1134–1143.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning.InAssociationforCompu-tationalLinguistics(ACL),pages384–394.PeterDTurney.2012.Domainandfunction:Adual-spacemodelofsemanticrelationsandcompo-sitions.JournalofArtiﬁcialIntelligenceResearch(JAIR),44:533–585.MoYuandMarkDredze.2014.Improvinglexicalem-beddingswithsemanticknowledge.InAssociationforComputationalLinguistics(ACL),pages545–550.MoYu,MatthewGormley,andMarkDredze.2014.Factor-basedcompositionalembeddingmodels.InNIPSWorkshoponLearningSemantics.MoYu,MatthewR.Gormley,andMarkDredze.2015.Combiningwordembeddingsandfeatureembeddingsforﬁne-grainedrelationextraction.InNorthAmericanChapteroftheAssociationforComputationalLinguis-tics(NAACL).FabioMassimoZanzotto,IoannisKorkontzelos,FrancescaFallucchi,andSureshManandhar.2010.Estimatinglinearmodelsforcompositionaldistri-butionalsemantics.InInternationalConferenceonComputationalLinguistics(COLING),pages1263–1271.
Télécharger le PDF

Recherche en IA spécialisée au MIT

Recherche en IA spécialisée au MIT

Transactions of the Association for Computational Linguistics, vol. 3, pp. 227–242, 2015. Action Editor: Joakim Nivre.