Transactions of the Association for Computational Linguistics, vol. 3, pp. 227–242, 2015. Action Editor: Joakim Nivre.
Submission batch: 2/2015; Revision batch 4/2015; Published 5/2015.
2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 Licence.
c
(cid:13)
LearningCompositionModelsforPhraseEmbeddingsMoYuMachineIntelligence&TranslationLabHarbinInstituteofTechnologyHarbin,Chinagflfof@gmail.comMarkDredzeHumanLanguageTechnologyCenterofExcellenceCenterforLanguageandSpeechProcessingJohnsHopkinsUniversityBaltimore,MARYLAND,21218mdredze@cs.jhu.eduAbstractLexicalembeddingscanserveasusefulrep-resentationsforwordsforavarietyofNLPtasks,butlearningembeddingsforphrasescanbechallenging.Whileseparateembeddingsarelearnedforeachword,thisisinfeasibleforeveryphrase.Weconstructphraseem-beddingsbylearninghowtocomposewordembeddingsusingfeaturesthatcapturephrasestructureandcontext.Weproposeefficientunsupervisedandtask-specificlearningobjec-tivesthatscaleourmodeltolargedatasets.Wedemonstrateimprovementsonbothlanguagemodelingandseveralphrasesemanticsimi-laritytaskswithvariousphraselengths.Wemaketheimplementationofourmodelandthedatasetsavailableforgeneraluse.1IntroductionWordembeddingslearnedbyneurallanguagemod-els(Bengioetal.,2003;CollobertandWeston,2008;Mikolovetal.,2013b)havebeensuccess-fullyappliedtoarangeoftasks,includingsyn-tax(CollobertandWeston,2008;Turianetal.,2010;Collobert,2011)andsemantics(Huangetal.,2012;Socheretal.,2013b;Hermannetal.,2014).Cependant,phrasesarecriticalforcapturinglexicalmeaningformanytasks.Forexample,CollobertandWeston(2008)showedthatwordembeddingsyieldedstate-of-the-artsystemsonword-orientedtasks(POS,NER)butperformanceonphraseori-entedtasks,suchasSRL,lagsbehind.Weproposeanewmethodforcompositionalse-manticsthatlearnstocomposewordembeddingsintophrases.Incontrasttoacommonapproachtophraseembeddingsthatusespre-definedcompo-sitionoperators(MitchellandLapata,2008),e.g.,component-wisesum/multiplication,welearncom-positionfunctionsthatrelyonphrasestructureandcontext.Otherworkonlearningcompositionsreliesonmatrices/tensorsastransformations(Socheretal.,2011;Socheretal.,2013a;HermannandBlun-som,2013;BaroniandZamparelli,2010;Socheretal.,2012;Grefenstetteetal.,2013).Cependant,thisworksuffersfromtwoprimarydisadvantages.First,thesemethodshavehighcomputationalcomplexityfordenseembeddings:Ô(d2)orO(d3)forcompos-ingeverytwocomponentswithddimensions.Thehighcomputationalcomplexityrestrictsthesemeth-odstouseverylow-dimensionalembeddings(25or50).Whilelow-dimensionalembeddingsperformwellforsyntax(Socheretal.,2013a)andsentiment(Socheretal.,2013b)tasks,theydopoorlyonse-mantictasks.Second,becauseofthecomplexity,theyusesupervisedtrainingwithsmalltask-specificdatasets.Anexceptionistheunsupervisedobjec-tiveofrecursiveauto-encoders(Socheretal.,2011).Yetthisworkcannotutilizecontextualfeaturesofphrasesandstillposesscalingchallenges.InthisworkweproposeanovelcompositionaltransformationcalledtheFeature-richComposi-tionalTransformation(FCT)model.FCTproducesphrasesfromtheirwordcomponents.Incontrasttopreviouswork,ourapproachtophrasecomposi-tioncanefficientlyutilizehighdimensionalembed-dings(e.g.d=200)withanunsupervisedobjective,bothofwhicharecriticaltodoingwellonseman-ticstasks.Ourcompositionfunctionisparameter-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
228
izedtoallowtheinclusionoffeaturesbasedonthephrasestructureandcontextualinformation,includ-ingpositionalindicatorsofthewordcomponents.Thephrasecompositionisaweightedsummationofembeddingsofcomponentwords,wherethesum-mationweightsaredefinedbythefeatures,whichallowsforfastcomposition.WediscussarangeoftrainingsettingsforFCT.Fortaskswithlabeleddata,weutilizetask-specifictraining.Webeginwithembeddingstrainedonrawtextandthenlearncompositionalphraseparametersaswellasfine-tunetheembeddingsforthespecifictask’sobjective.Fortaskswithunlabeleddata(e.g.mostsemantictasks)wecantrainonalargecorpusofunlabeleddata.Fortaskswithbothlabeledandunlabeleddata,weconsiderajointtrainingscheme.Ourmodel’sefficiencyensureswecanincorporatelargeamountsofunlabeleddata,whichhelpsmiti-gateover-fittingandincreasesvocabularycoverage.WebeginwithapresentationofFCT(§2),includ-ingourproposedfeaturesforthemodel.Wethenpresentthreetrainingsettings(§3)thatcoverlan-guagemodeling(unsupervised),task-specifictrain-ing(supervised),andjoint(semi-supervised)set-tings.Theremainderofthepaperisdevotedtoeval-uationofeachofthesesettings.2Feature-richCompositionalTransformationsfromWordstoPhrasesWelearntransformationsforcomposingphraseem-beddingsfromthecomponentwordsbasedonex-tractedfeaturesfromaphrase,whereweassumethatthephraseboundariesaregiven.Theresult-ingphraseembeddingisbasedonaper-dimensionweightedaverageofthecomponentphrases.Con-sidertheexampleofbasenounphrases(NP),acom-monphrasetypewhichwewanttocompose.BaseNPsoftenhaveflatstructures–allwordsmodifytheheadnoun–whichmeansthatourtransformationshouldfavortheheadnouninthecomposedphraseembedding.ForeachoftheNwordswiinphrasepweconstructtheembedding:ep=NXiλi(cid:12)ewi(1)whereewiistheembeddingforwordi;et(cid:12)referstopoint-wiseproduct.λiisaweightvectorthatisconstructedbasedonthefeaturesofpandthemodelparameters:λij=Xkαjkfk(wi,p)+bij(2)wherefk(wi,p)isafeaturefunctionthatconsiderswordwiinphrasepandbijisabiasterm.Thismodelisfasttotrainsinceithasonlylineartransfor-mations:theonlyoperationsarevectorsummationandinnerproduct.Therefore,welearnthemodelparametersαtogetherwiththeembeddings.WecallthistheFeature-richCompositionalTransformation(FCT)model.Considersomeexamplephrasesandassociatedfeatures.Thephrase“themuseum”shouldhaveanembeddingnearlyidenticalto“museum”since“the”hasminimalimpactthephrase’smeaning.Thiscanbecapturedthroughpart-of-speech(POS)tags,whereatagofDTon“the”willleadtoλi≈~0,removingitsimpactonthephraseembedding.Insomecases,wordswillhavespecificbehaviors.Inthephrase“historicmuseum”,theword“historic”shouldimpactthephraseembeddingtobecloserto“landmark”.Tocapturethisbehaviorweaddsmoothedlexicalfeatures,wheresmoothingreducesdatasparsityeffects.Thesefeaturescanbebasedonwordclusters,themselvesinducedfrompre-trainedwordembeddings.OurfeaturetemplatesareshowninTable1.Phraseboundaries,tagsandheadsareidentifiedus-ingexistingparsersorfromAnnotatedGigaword(Napolesetal.,2012)asdescribedinSection5.InEq.(1),wedonotlimitphrasestructurethoughthefeaturesinTable1tendtoassumeaflatstructure.However,withadditionalfeaturesthemodelcouldhandlelongerphraseswithhierarchicalstructures,andaddingthesefeaturesdoesnotchangeourmodelortrainingobjectives.FollowingthesemantictasksusedforevaluationweexperimentedwithbaseNPs(includingbothbigramNPsandlongerones).Weleaveexplorationsoffeaturesforcomplexstructurestofuturework.FCThastwosetsofparameters:oneisthefea-tureweights(un,b),theotheriswordembeddings(ew).Wecoulddirectlyusethewordembeddingslearnedbyneurallanguagemodels.However,ourexperimentsshowthatthosewordembeddingsareoftennotsuitedforFCT.Thereforeweproposeto
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
229
SimpleFeaturesCompoundFeaturesPOStagst(wi−1),t(wi),t(wi+1)
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
234
DataSetInputOutput(1)PPDBmedicinalproductsdrugs(2)SemEval2013
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
235
baselines:were-implementedtherecursiveneuralnetworkmodel(RNN)(Socheretal.,2013a)andtheDualVSMalgorithminTurney(2012)6sothattheycanbetrainedonourdataset.Wealsoincluderesultsforfine-tuningwordembeddingsinSUMandWeightedSUMwithTASK-SPECobjectives,whichdemonstrateimprovementsoverthecorre-spondingmethodswithoutfine-tuning.Asbefore,wordembeddingsarepre-trainedwithword2vec.RNNsserveasanotherwaytomodelthecom-positionallyofbigrams.WerunanRNNonbi-gramsandassociatedsub-trees,thesamesettingFCTuses,andaretrainedonourTASK-SPECobjectiveswiththetechniquedescribedinSection3.4.AsinSocheretal.(2013un),werefinethematrixWinEq.(5)accordingtothePOStagsofthecomponentwords.7Forexample,forabigramNPlikenew/ADJtrial/NN,weuseamatrixWADJ−NNtotransformthetwowordembeddingstothephraseembedding.Intheexperimentswehave60differentmatricesintotalforbigramNPs.ThenumberislargerthanthatinSocheretal.(2013un)duetoincorrecttagsinau-tomaticparses.SincetheRNNmodelhastimecomplexityO(n2),wecompareRNNswithdifferentsizedembeddings.Thefirstoneusesembeddingswith50dimensions,whichhasthesamesizeastheembeddingsusedinSocheretal.(2013un),andhassimilarcomplexitytoourmodelwith200dimensionembeddings.Thesecondmodelusesthesame200dimensionembed-dingsasourmodelbutissignificantlymorecompu-tationallyexpensive.Forallmodels,wenormalizetheembeddingssothattheL-2normequals1,whichisimportantinmeasuringsemanticsimilarityviainnerproduct.6.1Results:BigramPhrasesPPDBOurfirsttaskistomeasurephrasesimi-larityonPPDB.TrainingusestheTASK-SPECob-6WedidnotincluderesultsforaholisticmodelasinTurney(2012),sincemostofthephrases(especiallyforthoseinPPDB)inourexperimentsarecommonphrases,makingthevocabularytoolargetotrain.Onesolutionwouldbetoonlytrainholisticembeddingsforphrasesinthetestdata,butexaminationofatestsetbeforetrainingisnotarealisticassumption.7Wedonotcomparetheperformancebetweenusingasinglematrixandseveralmatricessince,asdiscussedinSocheretal.(2013un),WsrefinedwithPOStagsworkmuchbetterthanusingasingleW.Thatalsosupportstheargumentinthispaper,thatitisimportanttodeterminethetransformationwithmorefeatures.10^310^410^53436384042444648505254Vocabulary SizesMRR on Test Set(%) SUMRNN50RNN200FCT(un)MRRofmodelswithfixedwordembeddings10^310^410^53540455055606570Vocabulary SizesMRR on Test Set(%) SUMRNN50RNN200FCTFCT−pipelineFCT−joint(b)MRRofmodelswithfine-tuningFigure1:PerformanceonPPDBtask(testset).jective(Eq.(4)withNCEtraining)wheredataarephrase-wordpairs.ThegoalistoselectBfromasetofcandidatesgivenA,wherepairsim-ilarityismeasuredusinginnerproduct.Weusecan-didatesetsofsize1k/10k/100kfromthemostfre-quentNwordsinNYTandreportmeanreciprocalrank(MRR).Wereportresultswiththebaselinemethods(SUM,WeightedSUM,RNN).ForFCTwereporttrainingwiththeTASK-SPECobjective,thejoint-objective(FCT-J)andthepipelineapproach(FCT-P).Toen-surethattheTASK-SPECobjectivehasastrongerin-fluenceinFCT-Joint,weweightedeachtrainingin-stanceofLMby0.01,whichisequivalenttosettingthelearningrateoftheLMobjectiveequaltoη/100andthatoftheTASK-SPECobjectiveasη.Train-ingmakesthesamenumberofpasseswiththesamelearningrateastrainingwiththeTASK-SPECobjec-tiveonly.Foreachmethodwereportresultswithandwithoutfine-tuningthewordembeddingsonthelabeleddata.WerunFCTonthePPDBtrainingdatafor5epochswithlearningrateη=0.05,whicharebothselectedfromdevelopmentset.Fig.1showstheoverallMRRresultsondiffer-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
236
Fine-tuningMRRModelObjectiveWordEmb@10kSUM–41.19SUMTASK-SPECY45.01WSumTASK-SPECY45.43RNN50TASK-SPECN37.81RNN50TASK-SPECY39.25RNN200TASK-SPECN41.13RNN200TASK-SPECY40.50FCTTASK-SPECN41.96FCTTASK-SPECY46.99FCTLMY42.63FCT-PTASK-SPEC+LMY49.44FCT-JTASK-SPEC+LMjoint51.65Table6:PerformanceonthePPDBtask(testdata).entcandidatevocabularysizes(1k,10kand100k),andTable6highlightstheresultsonthevocabularyusingthetop10kwords.Overall,FCTwithTASK-SPECtrainingimprovesoverallthebaselinemeth-odsineachsetting.Fine-tuningwordembeddingsimprovesallmethodsexceptRNN(d=200).WenotethattheRNNperformspoorly,possiblybecauseitusesacomplextransformationfromwordem-beddingtophraseembeddings,makingthelearnedtransformationdifficulttogeneralizewelltonewphrasesandwordswhenthetask-specificlabeleddataissmall.Asaresult,thereisnoguaranteeofcomparabilitybetweennewpairsofphrasesandwordembeddings.Thephraseembeddingsmayendupinadifferentpartofthesubspacefromthewordembeddings.ComparingtoSUMandWeightedSUM,FCTiscapableofusingfeaturesprovidingcriticalcon-textualinformation,whichisthesourceofFCT’simprovement.Additionally,sincetheRNNsalsousedPOStagsandparsinginformationyetachievedlowerscoresthanFCT,ourresultsshowthatFCTmoreeffectivelyusesthesefeatures.Tobettershowthisadvantage,wetrainFCTmodelswithonlyPOStagfeatures,whichachieve46.37/41.20onMRR@10kwith/withoutfine-tuningwordembed-dings,stillbetterthanRNNs.SeeSection6.3forafullablationstudyoffeaturesinTable1.Semi-supervisedResults:Table6alsohigh-lightedtheimprovementfromsemi-supervisedlearning.First,thefullyunsupervisedmethod(LM)improvesoverSUM,showingthatimprovementsinlanguagemodelingcarryovertosemanticsimilar-itytasks.ThiscorrelationbetweentheLMob-jectiveandthetargettaskensuresthesuccessofsemi-supervisedtraining.Asaresult,bothsemi-supervisedmethods,FCT-JandFCT-Pimprovesoverthesupervisedmethods;andFCT-Jachievesthebestresultsofallmethods,includingFCT-P.ThisdemonstratestheeffectivenessofincludinglargeamountsofunlabeleddatawhilelearningwithaTASK-SPECobjective.WebelievethatbyaddingtheLMobjective,wecanpropagatethesemanticin-formationofembeddingstothewordsthatdonotappearinthelabeleddata(seethedifferencesbe-tweenvocabularysizesinTable2).TheimprovementofFCT-JoverFCT-Palsoin-dicatesthatthejointtrainingstrategycanbemoreeffectivethanthetraditionalpipeline-basedpre-training.AsdiscussedinSection3.3,thepipelinemethod,althoughcommonlyusedindeeplearningliteratures,doesnotsuitNLPapplicationswellbe-causeofthesparsityinwordembeddings.There-fore,ourresultssuggestanalternativesolutiontoawiderangeofNLPproblemswherelabeleddatahaslowcoverageofthevocabulary.Forfuturework,wewillfurtherinvestigatetheideaofjointtrainingonmoretasksandcomparewiththepipelinemethod.ResultsonSemEval2013andTurney2012WeevaluatethesamemethodsonSemEval2013andtheTurney20125-and10-choicetasks,whichbothprovidetrainingandtestsplits.Thesamebase-linesinthePPDBexperiments,aswellastheDualSpacemethodofTurney(2012)andtherecursiveauto-encoder(RAE)fromSocheretal.(2011)areusedforcomparison.Sincethetasksdidnotprovideanydevelopmentdata,weusedcross-validation(5folds)fortuningtheparameters,andfinallysetthetrainingepochstobe20andη=0.01.Forjointtraining,theweightoftheLMobjectiveisweightedby0.005(i.e.withalearningrateequalto0.005η)sincethetrainingsetsforthesetwotasksaremuchsmaller.Forconvenience,wealsoincluderesultsforDualSpaceasreportedinTurney(2012),thoughtheyarenotcomparableheresinceTurney(2012)usedamuchlargertrainingset.Table7showssimilartrendsasPPDB.Onedif-ferencehereisthatRNNsdobetterwith200dimen-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
237
Fine-tuningSemEval2013Turney2012ModelObjectiveWordEmbTestAcc(5)Acc(10)MRR@10kSUM–65.4639.5819.7912.00SUMTASK-SPECY67.9348.1524.0714.32WeightedSumTASK-SPECY69.5152.5526.1614.74RNN(d=50)TASK-SPECN67.2039.6425.351.39RNN(d=50)TASK-SPECY70.3641.9627.201.46RNN(d=200)TASK-SPECN71.5040.9527.203.89RNN(d=200)TASK-SPECY72.2242.8429.984.03DualSpace1–52.4727.5516.362.22DualSpace2—58.341.5-RAEauto-encoder-51.7522.9914.810.16FCTTASK-SPECN68.8441.9033.808.50FCTTASK-SPECY70.3652.3138.6613.19FCTLM-67.2242.5927.5514.07FCT-PTASK-SPEC+LMY70.6453.0939.1214.17FCT-JTASK-SPEC+LMjoint70.6553.3139.1214.25Table7:PerformanceonSemEval2013andTurney2012semanticsimilaritytasks.DualSpace1:Ourreimple-mentationofthemethodin(Turney,2012).DualSpace2:TheresultreportedinTurney(2012).RAEistherecursiveauto-encoderin(Socheretal.,2011),whichistrainedwiththereconstruction-basedobjectiveofauto-encoder.sionalembeddingsonSemEval2013,thoughatadimensionalitywithsimilarcomputationalcomplex-itytoFCT(d=50),FCTimproves.Additionally,onthe10-choicetaskofTurney2012,boththeFCTandtheRNNmodels,eitherwithorwithoutfine-tuningwordembeddings,significantlyoutperformSUM,showingthatbothmodelscapturethewordor-derinformation.FinetuninggivessmallergainsonRNNslikelybecausethelimitednumberoftrainingexamplesisinsufficientforthecomplexRNNmodel.TheLMobjectiveleadstoimprovementsonallthreetasks,whileRAEdoesnotperformsignificantlybet-terthanrandomguessing.Theseresultsareperhapsattributabletothelackofassumptionsintheobjec-tiveabouttherelationsbetweenwordembeddingsandphraseembeddings,makingthelearnedphraseembeddingsnotcomparabletowordembeddings.6.2DimensionalityandComplexityAbenefitofFCTisthatitiscomputationallyeffi-cient,allowingittoeasilyscaletoembeddingsof200dimensions.Bycontrast,RNNmodelstypi-callyusesmallersizedembeddings(d=25provedbestinSocheretal.,2013a)andcannotscaleuptolargedatasetswhenlargerdimensionalityembed-dingsareused.Forexample,whentrainingonthePPDBdata,theFCTwithd=200processes2.33instancesperms,whiletheRNNwiththesamedi-mensionalityprocesses0.31instance/ms.TraininganRNNwithd=50isofcomparablespeedtoFCTwithd=200.Figure2(a-b)showstheMRRonPPDBfor1kand10kcandidatesetsforboththeSUMbaselineandFCTwithaTASK-SPECobjectiveandfullfeatures,ascomparedtoRNNswithdiffer-entsizedembeddings.BothFCTandRNNusefine-tunedembeddings.Withasmallnumberofembed-dingdimensions,RNNsachievebetterresults.How-ever,FCTcanscaletomuchhigherdimensionalityembeddings,whicheasilysurpassestheresultsofRNNs.Thisisespeciallyimportantwhenlearningalargenumberofembeddings:the25-dimensionalspacemaynotbesufficienttocapturethesemanticdiversity,asevidencedbythepoorperformanceofRNNswithlowerdimensionality.SimilartrendsobservedonthePPDBdataalsoappearonthetasksofTurney2012andSemEval2013.Figure2(c-f)showstheperfor-mancesonthesetwotasks.OntheTurney2012task,theFCTevenoutperformstheRNNmodelus-ingembeddingswiththesamedimensionality.Onepossiblereasonisduetooverfittingofthemorecom-plexRNNmodelsonthesesmalltrainingsets.Fig-ure2(d)showsthattheperformancesofFCTonthe10-choicetaskarelessaffectedbythedimensionsofembeddings.Thatisbecausethecompositionmodelscanwellhandlethewordorderinformation,
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
238
05010015020025030035040045050034363840424446485052rnn25rnn50rnn200dimension of embeddingsMRR(%) SUMFCT(un)MRR@1konPPDBdevset050100150200250300350400450500222426283032343638dimension of embeddingsMRR(%)rnn25rnn50rnn200 SUMFCT(b)MRR@10konPPDBdevset050100150200250300350400450500303540455055rnn25rnn50rnn200dimension of embeddingsACC(%) SUMFCT(c)accuracyonthe5-choicetaskinTurney201205010015020025030035040045050015202530354045rnn25rnn50rnn200dimension of embeddingsACC(%) SUMFCT(d)accuracyonthe10-choicetaskinTurney2012050100150200250300350400450500024681012141618dimension of embeddingsMRR(%)rnn50rnn200 SUMFCT(e)MRR@10konTurney20120501001502002503003504004505006263646566676869707172rnn25rnn50rnn200dimension of embeddingsACC(%) SUMFCT(F)accuracyontheSemEval2013Figure2:Effectsofembeddingdimensiononthesemanticsimilaritytasks.Thenotations“RNN
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
239
FeatureSetMRR@10kFCT79.68-clus76.82-POS77.67-Compound79.40-Head77.50-Distance78.86WSum75.37SUM74.30Table9:AblationstudyondevsetofthePPDBngram-to-ngramtask(MRR@10k).bythequalityofsinglewordsemantics.Therefore,weexpectlargergainsfromFCTontaskswheresin-glewordembeddingsarelessimportant,suchasre-lationextraction(longdistancedependencies)andquestionunderstanding(intentionsarelargelyde-pendentoninterrogatives).Enfin,wedemonstratetheefficacyofdifferentfeaturesinFCT(Table1)withanablationstudy(Ta-ble9).Wordclusterfeaturescontributemost,be-causethepoint-wiseproductbetweenwordembed-dinganditscontextwordclusterrepresentationisactuallyanapproximationoftheword-wordinter-action,whichisbelievedimportantforphrasecom-positions.Headfeatures,thoughfew,alsomakeabigdifference,reflectingtheimportanceofsyntacticinformation.Compoundfeaturesdonothavemuchofanimpact,possiblybecausethesimplerfeaturescaptureenoughinformation.7RelatedWorkCompositionalsemanticmodelsaimtobuilddistri-butionalrepresentationsofaphrasefromitscompo-nentwordrepresentations.Atraditionalapproachforcompositionistoformapoint-wisecombina-tionofsinglewordrepresentationswithcomposi-tionaloperatorseitherpre-defined(e.g.element-wisesum/multiplication)orlearnedfromdata(LeandMikolov,2014).Cependant,theseapproachesignoretheinnerstructureofphrases,e.g.theor-derofwordsinaphraseanditssyntactictree,andthepoint-wiseoperationsareusuallylessexpressive.Onesolutionistoapplyamatrixtransformation(possiblyfollowedbyanon-lineartransformation)totheconcatenationofcomponentwordrepresen-tations(Zanzottoetal.,2010).Forlongerphrases,matrixmultiplicationcanbeappliedrecursivelyac-cordingtotheassociatedsyntactictrees(Socheretal.,2010).Cependant,becausetheinputofthemodelistheconcatenationofwordrepresentations,ma-trixtransformationscannotcaptureinteractionsbe-tweenawordanditscontexts,orbetweencompo-nentwords.Therearethreewaystorestoretheseinterac-tions:Thefirstistouseword-specific/tensortrans-formationstoforcetheinteractionsbetweencom-ponentwordsinaphrase.Inthesemethods,word-specifictransformations,whichareusuallymatri-ces,arelearnedforasubsetofwordsaccordingtotheirsyntacticproperties(e.g.POStags)(BaroniandZamparelli,2010;Socheretal.,2012;Grefen-stetteetal.,2013;Erk,2013).Compositionbetweenawordinthissubsetandanotherwordbecomesthemultiplicationbetweenthematrixassociatedwithonewordandtheembeddingoftheother,produc-inganewembeddingforthephrase.Usingonetensor(notword-specific)tocomposetwoembed-dingvectors(hasnotbeentestedonphrasesimilar-itytasks)(Bordesetal.,2014;Socheretal.,2013b)isaspecialcaseofthisapproach,wherea“word-specifictransformationmatrix”isderivedbymulti-plyingthetensorandthewordembedding.Addi-tionally,word-specificmatricescanonlycapturetheinteractionbetweenawordandoneofitscontextwords;othershaveconsideredextensionstomulti-plewords(Grefenstetteetal.,2013;DinuandBa-roni,2014).Theprimarydrawbackoftheseap-proachesisthehighcomputationalcomplexity,lim-itingtheirusefulnessforsemantics(Section6.2.)Asecondapproachdrawsontheconceptofcon-textualization(ErkandPad´o,2008;DinuandLap-ata,2010;Thateretal.,2011),whichsumsembed-dingsofmultiplewordsinalinearcombination.Forexample,CheungandPenn(2013)applycontextu-alizationtowordcompositionsinagenerativeeventextractionmodel.However,thisisanindirectwaytocaptureinteractions(thetransformationsarestillunawareofinteractionsbetweencomponents),andthushasnotbeenapopularchoiceforcomposition.Thethirdapproachistorefineword-independentcompositionaltransformationswithannotationfea-tures.FCTfallsunderthisapproach.Theprimaryadvantageisthatcompositioncanrelyonricherlin-guisticfeaturesfromthecontext.Whiletheem-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
240
beddingsofcomponentwordsstillcannotinteract,theycaninteractwithotherinformation(i.e.fea-tures)oftheircontextwords,andeventheglobalfeatures.Recentresearchhascreatednovelfeaturesbasedoncombiningwordembeddingsandcontex-tualinformation(NguyenandGrishman,2014;RothandWoodsend,2014;Kirosetal.,2014;Yuetal.,2014;Yuetal.,2015).Yuetal.(2015)furtherpro-posedconvertingthecontextualfeaturesintoahid-denlayercalledfeatureembeddings,whichissim-ilartotheαmatrixinthispaper.Examplesofap-plicationstophrasesemanticsincludeSocheretal.(2013un)andHermannandBlunsom(2013),whoen-hancedRNNsbyrefiningthetransformationmatri-ceswithphrasetypesandCCGsupertags.How-ever,thesemodelsareonlyabletouselimitedinfor-mation(usuallyonepropertyforeachcompositionaltransformation),whereasFCTexploitsmultiplefea-tures.Finally,ourworkisrelatedtorecentworkonlow-ranktensorapproximations.WhenweusethephraseembeddingepinEq.(1)topredictalabely,thescoreofygivenphrasepwillbes(oui,p)=UTyep=PNiUTy(λi(cid:12)ewi)inlog-linearmodels,whereUyistheparametervectorfory.ThisisequivalenttousingaparametertensorTtoevaluatethescorewiths0(oui,p)=PNiT×1y×2f(wi,p)×ewi,whileforcingthetensortohavealow-rankformasT≈U⊗α⊗ew.Here×kindicatestensormul-tiplicationofthekthview,and⊗indicatesmatrixouterproduct(KoldaandBader,2009).Fromthispointofview,ourworkiscloselyrelatedtothedis-criminativetrainingmethodsforlow-ranktensorsinNLP(CaoandKhudanpur,2014;Leietal.,2014),whileitcanhandlemorecomplexngram-to-ngramtasks,wherethelabelyalsohasitsembeddingcom-posedfrombasicwordembeddings.Thereforeourmodelcancapturetheaboveworkasspecialcases.Moreover,wehaveadifferentmethodofdecompos-ingtheinputs,whichresultsinviewsoflexicalpartsandnon-lexicalfeatures.Asweshowinthispaper,thisinputdecompositionallowsustobenefitfrompre-trainedwordembeddingsandfeatureweights.8ConclusionWehavepresentedFCT,anewcompositionmodelforderivingphraseembeddingsfromwordembed-dings.Comparedtoexistingphrasecompositionmodels,FCTisveryefficientandcanutilizehighdi-mensionalwordembeddings,whicharecrucialforsemanticsimilaritytasks.WehavedemonstratedhowFCTcanbeutilizedinalanguagemodelingset-ting,aswellastunedwithtask-specificdata.Fine-tuningembeddingsontask-specificdatacanfurtherimproveFCT,butcombiningbothLMandTASK-SPECobjectivesyieldsthebestresults.Wehavedemonstratedimprovementsonbothlanguagemod-elingandseveralsemanticsimilaritytasks.Ourim-plementationanddatasetsarepubliclyavailable.8Whileourresultsdemonstrateimprovementsforlongerphrases,westillonlyfocusonflatphrasestructures.InfutureworkweplantoFCTwiththeideaofrecursivelybuildingrepresentations.Thiswouldallowtheutilizationofhierarchicalstructurewhilerestrictingcompositionstoasmallnumberofcomponents.AcknowledgmentsWethankMatthewR.Gormleyforhisinputandanonymousreviewersfortheircomments.MoYuissupportedbytheChinaScholarshipCouncilandbyNSFC61173073.ReferencesMarcoBaroniandRobertoZamparelli.2010.Nounsarevectors,adjectivesarematrices:Representingadjective-nounconstructionsinsemanticspace.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1183–1193.YoshuaBengio,R´ejeanDucharme,PascalVincent,andChristianJanvin.2003.Aneuralprobabilisticlan-guagemodel.TheJournalofMachineLearningRe-search(JMLR),3:1137–1155.AntoineBordes,XavierGlorot,JasonWeston,andYoshuaBengio.2014.Asemanticmatchingenergyfunctionforlearningwithmulti-relationaldata.Ma-chineLearning,94(2):233–259.YuanCaoandSanjeevKhudanpur.2014.Onlinelearn-ingintensorspace.InAssociationforComputationalLinguistics(ACL),pages666–675.JackieChiKitCheungandGeraldPenn.2013.Prob-abilisticdomainmodellingwithcontextualizeddistri-butionalsemanticvectors.InAssociationforCompu-tationalLinguistics(ACL),pages392–401.8https://github.com/Gorov/FCT_PhraseSim_TACL
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
241
RonanCollobertandJasonWeston.2008.Aunifiedar-chitecturefornaturallanguageprocessing:Deepneu-ralnetworkswithmultitasklearning.InInternationalConferenceonMachineLearning(ICML),pages160–167.RonanCollobert.2011.Deeplearningforefficientdis-criminativeparsing.InInternationalConferenceonArtificialIntelligenceandStatistics(AISTATS),pages224–232.GeorgianaDinuandMarcoBaroni.2014.Howtomakewordswithvectors:Phrasegenerationindistributionalsemantics.InAssociationforComputationalLinguis-tics(ACL),pages624–633.GeorgianaDinuandMirellaLapata.2010.Measuringdistributionalsimilarityincontext.InEmpiricalMeth-odsinNaturalLanguageProcessing(EMNLP),pages1162–1172.KatrinErkandSebastianPad´o.2008.Astructuredvectorspacemodelforwordmeaningincontext.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages897–906.KatrinErk.2013.Towardsasemanticsfordistributionalrepresentations.InInternationalConferenceonCom-putationalSemantics(IWCS2013),pages95–106.JuriGanitkevitch,BenjaminVanDurme,andChrisCallison-Burch.2013.Ppdb:Theparaphrasedatabase.InNorthAmericanChapteroftheAssoci-ationforComputationalLinguistics(NAACL),pages758–764.EdwardGrefenstette,GeorgianaDinu,Yao-ZhongZhang,MehrnooshSadrzadeh,andMarcoBaroni.2013.Multi-stepregressionlearningforcomposi-tionaldistributionalsemantics.arXiv:1301.6939.KarlMoritzHermannandPhilBlunsom.2013.Theroleofsyntaxinvectorspacemodelsofcompositionalse-mantics.InAssociationforComputationalLinguistics(ACL),pages894–904.KarlMoritzHermann,DipanjanDas,JasonWeston,andKuzmanGanchev.2014.Semanticframeidentifi-cationwithdistributedwordrepresentations.InAs-sociationforComputationalLinguistics(ACL),pages1448–1458.EricHHuang,RichardSocher,ChristopherDManning,andAndrewYNg.2012.Improvingwordrepresenta-tionsviaglobalcontextandmultiplewordprototypes.InAssociationforComputationalLinguistics(ACL),pages873–882.RyanKiros,RichardZemel,andRuslanRSalakhutdinov.2014.Amultiplicativemodelforlearningdistributedtext-basedattributerepresentations.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages2348–2356.TamaraGKoldaandBrettWBader.2009.Ten-sordecompositionsandapplications.SIAMreview,51(3):455–500.IoannisKorkontzelos,TorstenZesch,FabioMassimoZanzotto,andChrisBiemann.2013.Semeval-2013task5:Evaluatingphrasalsemantics.InJointCon-ferenceonLexicalandComputationalSemantics(*SEM),pages39–47.QuocVLeandTomasMikolov.2014.Distributedrepre-sentationsofsentencesanddocuments.arXivpreprintarXiv:1405.4053.TaoLei,YuXin,YuanZhang,ReginaBarzilay,andTommiJaakkola.2014.Low-ranktensorsforscoringdependencystructures.InAssociationforComputa-tionalLinguistics(ACL),pages1381–1391.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013a.Efficientestimationofwordrepresenta-tionsinvectorspace.arXivpreprintarXiv:1301.3781.TomasMikolov,IlyaSutskever,KaiChen,GregCorrado,andJeffreyDean.2013b.Distributedrepresentationsofwordsandphrasesandtheircompositionality.arXivpreprintarXiv:1310.4546.JeffMitchellandMirellaLapata.2008.Vector-basedmodelsofsemanticcomposition.InAssociationforComputationalLinguistics(ACL),pages236–244.JeffMitchellandMirellaLapata.2010.Compositionindistributionalmodelsofsemantics.Cognitivescience,34(8):1388–1429.CourtneyNapoles,MatthewGormley,andBenjaminVanDurme.2012.Annotatedgigaword.InACLJointWorkshoponAutomaticKnowledgeBaseConstructionandWeb-scaleKnowledgeExtraction,pages95–100.ThienHuuNguyenandRalphGrishman.2014.Employ-ingwordrepresentationsandregularizationfordomainadaptationofrelationextraction.InAssociationforComputationalLinguistics(ACL),pages68–74.RobertParker,DavidGraff,JunboKong,KeChen,andKazuakiMaeda.2011.Englishgigawordfifthedition,june.LinguisticDataConsortium,LDC2011T07.MichaelRothandKristianWoodsend.2014.Compo-sitionofwordrepresentationsimprovessemanticrolelabelling.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages407–413.RichardSocher,ChristopherDManning,andAndrewYNg.2010.Learningcontinuousphraserepresenta-tionsandsyntacticparsingwithrecursiveneuralnet-works.InNIPSWorkshoponDeepLearningandUn-supervisedFeatureLearning,pages1–9.RichardSocher,JeffreyPennington,EricHHuang,An-drewYNg,andChristopherDManning.2011.Semi-supervisedrecursiveautoencodersforpredictingsen-timentdistributions.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages151–161.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
5
1
5
6
6
7
5
2
/
/
t
je
un
c
_
un
_
0
0
1
3
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
242
RichardSocher,BrodyHuval,ChristopherDManning,andAndrewYNg.2012.Semanticcompositionalitythroughrecursivematrix-vectorspaces.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1201–1211.RichardSocher,JohnBauer,ChristopherD.Manning,andNgAndrewY.2013a.Parsingwithcompositionalvectorgrammars.InAssociationforComputationalLinguistics(ACL),pages455–465.RichardSocher,AlexPerelygin,JeanWu,JasonChuang,ChristopherD.Manning,AndrewNg,andChristo-pherPotts.2013b.Recursivedeepmodelsforse-manticcompositionalityoverasentimenttreebank.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1631–1642.StefanThater,HagenF¨urstenau,andManfredPinkal.2011.Wordmeaningincontext:Asimpleandef-fectivevectormodel.InInternationalJointCon-ferenceonNaturalLanguageProcessing(IJCNLP),pages1134–1143.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning.InAssociationforCompu-tationalLinguistics(ACL),pages384–394.PeterDTurney.2012.Domainandfunction:Adual-spacemodelofsemanticrelationsandcompo-sitions.JournalofArtificialIntelligenceResearch(JAIR),44:533–585.MoYuandMarkDredze.2014.Improvinglexicalem-beddingswithsemanticknowledge.InAssociationforComputationalLinguistics(ACL),pages545–550.MoYu,MatthewGormley,andMarkDredze.2014.Factor-basedcompositionalembeddingmodels.InNIPSWorkshoponLearningSemantics.MoYu,MatthewR.Gormley,andMarkDredze.2015.Combiningwordembeddingsandfeatureembeddingsforfine-grainedrelationextraction.InNorthAmericanChapteroftheAssociationforComputationalLinguis-tics(NAACL).FabioMassimoZanzotto,IoannisKorkontzelos,FrancescaFallucchi,andSureshManandhar.2010.Estimatinglinearmodelsforcompositionaldistri-butionalsemantics.InInternationalConferenceonComputationalLinguistics(COLING),pages1263–1271.
Télécharger le PDF