Transactions of the Association for Computational Linguistics, vol. 4, pp. 445–461, 2016. Action Editor: Noah Smith.

Transactions of the Association for Computational Linguistics, vol. 4, pp. 445–461, 2016. Action Editor: Noah Smith.
Submission batch: 11/2015; Revision batch: 2/2016; Published 8/2016.

2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

Easy-FirstDependencyParsingwithHierarchicalTreeLSTMsEliyahuKiperwasserComputerScienceDepartmentBar-IlanUniversityRamat-Gan,Israelelikip@gmail.comYoavGoldbergComputerScienceDepartmentBar-IlanUniversityRamat-Gan,Israelyoav.goldberg@gmail.comAbstractWesuggestacompositionalvectorrepresen-tationofparsetreesthatreliesonarecursivecombinationofrecurrent-neuralnetworken-coders.Todemonstrateitseffectiveness,weusetherepresentationasthebackboneofagreedy,bottom-updependencyparser,achiev-ingverystrongaccuraciesforEnglishandChinese,withoutrelyingonexternalwordembeddings.Theparser’simplementationisavailablefordownloadattheﬁrstauthor’swebpage.1IntroductionDependency-basedsyntacticrepresentationsofsen-tencesarecentraltomanylanguageprocessingtasks(Kübleretal.,2009).Dependencyparse-treesen-codenotonlythesyntacticstructureofasentencebutalsomanyaspectsofitssemantics.ArecenttrendinNLPisconcernedwithencod-ingsentencesasvectors(“sentenceembeddings”),whichcanthenbeusedforfurtherpredictiontasks.Recurrentneuralnetworks(RNNs)(Elman,1990),andinparticularmethodsbasedontheLSTMarchi-tecture(HochreiterandSchmidhuber,1997),workverywellformodelingsequences,andconstantlyobtainstate-of-the-artresultsonbothlanguage-modelingandpredictiontasks(voir,par exemple.(Mikolovetal.,2010)).Severalworksattempttoextendrecurrentneu-ralnetworkstoworkontrees(seeSection8forabriefoverview),givingrisetotheso-calledrecursiveneuralnetworks(GollerandKuchler,1996;Socheretal.,2010).Cependant,recursiveneuralnetworksdonotcopewellwithtreeswitharbitrarybranch-ingfactors–mostworkrequiretheencodedtreestobebinary-branching,orhaveaﬁxedmaximumarity.Otherattemptsallowarbitrarybranchingfactors,attheexpenseofignoringtheorderofthemodiﬁers.Incontrast,weproposeatree-encodingthatnat-urallysupportstreeswitharbitrarybranchingfac-tors,makingitparticularlyappealingfordepen-dencytrees.Ourtreeencoderusesrecurrentneuralnetworksasabuildingblock:wemodeltheleftandrightsequencesofmodiﬁersusingRNNs,whicharecomposedinarecursivemannertoformatree(Sec-tion3).Weuseourtreerepresentationforencodingthepartially-builtparsetreesinagreedy,bottom-updependencyparserwhichisbasedontheeasy-ﬁrsttransition-systemofGoldbergandElhadad(2010).UsingtheHierarchicalTreeLSTMrepresenta-tion,andwithoutusinganyexternalembeddings,ourparserachievesparsingaccuraciesof92.6UASand90.2LASonthePTB(Stanforddependencies)and86.1UASand84.4LASontheChinesetree-bank,whilerelyingongreedydecoding.Tothebestofourknowledge,thisistheﬁrstworktodemonstratecompetitiveparsingaccuraciesforfull-scaleparsingwhilerelyingsolelyonrecursive,compositionaltreerepresentations,andwithoutus-ingarerankingframework.WediscussrelatedworkinSection8.Whiletheparsingexperimentsdemonstratethesuitabilityofourrepresentationforcapturingthestructuralelementsintheparsetreethatareusefulforpredictingparsingdecisions,weareinterestedinexploringtheuseoftheRNN-basedcompositionalvectorrepresentationofparsetreesalsoforseman-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
1
0
1
5
6
7
4
0
8

/
t

un
c
_
un
_
0
0
1
1
0
p
d

b
oui
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

446

tictaskssuchassentimentanalysis(Socheretal.,2013b;Taietal.,2015),sentencesimilarityjudge-ments(Marellietal.,2014)andtextualentailment(Bowmanetal.,2015).2BackgroundandNotation2.1Dependency-basedRepresentationAdependency-basedsyntacticrepresentationiscen-teredaroundsyntacticmodiﬁcationrelationsbe-tweenheadwordsandmodiﬁerwords.Theresultaretreesinwhicheachnodeisawordinthesen-tence,andeverynodeexceptforonedesignatedrootnodehasaparentnode.Adependencytreeoverasentencewithnwordsw1,…,wncanberepre-sentedasalistofnpairsoftheform(h,m),where0≤h≤nand1≤m≤n.Eachsuchpairrep-resentsanedgeinthetreeinwhichhistheindexofaheadword(includingthespecialROOTnode0),andmistheindexofamodiﬁerword.Inor-derforthedependencytreestobeusefulforactualdownstreamlanguageprocessingtasks,eachedgeislabeledwithasyntacticrelation.Thetreerepresen-tationthenbecomesalistoftriplets(h,m,‘),where1≤‘≤ListheindexofadependencyrelationoutofadesignatedsetofLsyntacticrelations.Dependencytreestendtoberelativelyshallow,withsomenodeshavingmanychildren.LookingattreesinthePTBtrainingsetweﬁndthat94%ofthetreeshaveaheightofatmost10,and49%ofthetreesaheightofatmost6.Intermsofwidth,93%ofthetreeshaveatleastonenodewithanarityof4ormore,and56%ofthetreeshaveatleastonenodewithanarityof6ormore.2.2RecurrentNetworksandLSTMsRecurrentneuralnetworks(RNNs),ﬁrstproposedbyElman(1990)arestatisticallearnersformodel-ingsequentialdata.Inthiswork,weusetheRNNabstractionasabuildingblock,andrecursivelycom-bineseveralRNNstoobtainourtreerepresenta-tion.WebrieﬂydescribetheRNNabstractionbe-low.ForfurtherdetailonRNNs,thereaderisre-ferredtosourcessuchas(Goldberg,2015;BengioandCourville,2016;Cho,2015).TheRNNabstractionisafunctionRNNthattakesinasequenceofinputsvectorsx1,…,xn(xi∈Rdin),andproducesasequenceofstatevec-tors(alsocalledoutputvectors)y1,…,yn(yi∈Rdout).Eachyiisconditionedonalltheinputsx1,…,xiprecedingit.Ignoringtheintermediateoutputsy1,…,yn−1,theRNNcanbethoughtofasencodingthesequencex1,…,xnintoaﬁnalstateyn.Ournotationinthispaperfollowsthisview.TheRNNisdeﬁnedrecursivelyusingtwofunc-tions:1RNN(s0,x1,…,xn)=yn=O(sn)si=N(si−1,xi)Ici,afunctionNtakesasinputavectorxiandastatevectorsi−1andreturnsasoutputanewstatesi.OnecanthenextractanoutputvectoryifromsiusingthefunctionO(thefunctionOisusuallytheidentityfunction,orafunctionthatreturnsasubsetoftheelementsinsi).Takinganalgorithmicperspective,onecanviewtheRNNasastateobjectwiththreeoperations:s=RNN.initial()returnsanewinitialstate,s.advance(X)takesaninputvectorandreturnsanewstate,ands.output()returnstheoutputvectorforthecurrentstate.Whenclearfromthecontext,weabbreviateandusethestate’sname(s)insteadofs.output()torefertotheoutputvec-toratthestate.ThefunctionsNandOdeﬁningtheRNNarepa-rameterizedbyparametersθ(matricesandvectors),whicharetrainedfromdata.Speciﬁcally,oneisusu-allyinterestedinusingsomeoftheoutputsyiformakingpredictions.TheRNNistrainedsuchthattheencodingyiisgoodforthepredictiontask.Thatis,theRNNlearnswhichaspectsofthesequencex1,…,xiareinformativefortheprediction.Weusesubscripts(i.e.RNNL,RNNR)toindi-catedifferentRNNs,thatis,RNNsthathavediffer-entsetsofparameters.SpeciﬁcinstantiationsofNandOyielddiffer-entrecurrentnetworkmechanisms.InthisworkweusetheLongShortTermMemory(LSTM)variant(HochreiterandSchmidhuber,1997)whichisshowntobeaverycapablesequencelearner.However,ouralgorithmandencodingmethoddonotrelyonanyspeciﬁcpropertyoftheLSTMarchitecture,andthe1WefollowthenotationofGoldberg(2015),withtheexcep-tionoftakingtheoutputoftheRNNtobeasinglevectorratherthanasequence,andrenamingRtoN.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
1
0
1
5
6
7
4
0
8

/
t

un
c
_
un
_
0
0
1
1
0
p
d

b
oui
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

447

LSTMcanbetransparentlyswitchedforanyotherRNNvariant.3TreeRepresentationWenowdescribeourmethodforrepresentingatreeasad-dimensionalvector.Weassumetreesinwhichthechildrenareorderedandtherearekl≥0childrenbeforetheparentnode(leftchildren)andkr≥0childrenafterit(rightchildren).Suchtreescorrespondwelltodependencytreestructures.Werefertotheparentnodeasahead,andtoitschildrenasmodiﬁers.Foranodet,werefertoitsleftmodiﬁersast.l1,t.l2,…,t.lklanditsrightmodiﬁersast.r1,t.r2,…,t.rkrTheindicesofthemodiﬁerarealwaysfromtheparentoutward,thatist.l1istheleftmodiﬁerclosesttotheheadt:tt.r1t.r2t.r3t.r4t.l1t.l2t.l3Thegistoftheideaistotreatthemodiﬁersofanodeasasequence,andencodethissequenceus-inganRNN.Weseparateleft-modiﬁersfromright-modiﬁers,andusetwoRNNs:theﬁrstRNNen-codesthesequenceofleft-modiﬁersfromtheheadoutwards,andthesecondRNNthesequenceofright-modiﬁersfromtheheadoutwards.TheﬁrstinputtoeachRNNisthevectorrepresentationoftheheadword,andthelastinputisthevectorrep-resentationoftheleft-mostortheright-mostmodi-ﬁer.Thenode’srepresentationisthenaconcatena-tionoftheRNNencodingoftheleft-modiﬁerswiththeRNNencodingoftheright-modiﬁers.Theen-codingisrecursive:therepresentationforeachofthemodiﬁernodesiscomputedinasimilarfashion.tRLt.r1Rt.r2Rt.r3Rt.r4Rtt.l1Lt.l2Lt.l3Lenc(t)RNNRRNNLconcatenateandcompressMoreformally,consideranodet.Leti(t)bethesentenceindexofthewordcorrespondingtotheheadnodet,andletvibeavectorcorrespondingtotheithwordinthesentence(thisvectorcapturesinformationsuchasthewordformanditspartofspeechtag,andwillbediscussedshortly).Thevec-torencodingofanodeenc(t)∈Rdencisthende-ﬁnedasfollows:enc(t)=g(We·(il(t)◦er(t))+être)il(t)=RNNL(vi(t),enc(t.l1),…,enc(t.lkl))er(t)=RNNR(vi(t),enc(t.r1),…,enc(t.rkr))D'abord,thesequencesconsistingofthehead-vectorvi(t)followedbyleft-modiﬁersandthehead-vectorfollowedbyright-modiﬁersareencodedusingtwoRNNs,RNNLandRNNR,resultinginRNNstatesel(t)∈Rdoutander(t)∈Rdout.Then,theRNNstatesareconcatenated,resultingina2dout-dimensionalvector(il(t)◦er(t)),whichisreducedbacktod-dimensionsusingalineartransformationfollowedbyanon-linearactivationfunctiong.Therecursionstopsatleafnodes,forwhich:enc(leaf)=g(We·(il(leaf)◦er(leaf))+être)il(leaf)=RNNL(vi(leaf))er(leaf)=RNNR(vi(leaf))Figure1showsthenetworkusedforencodingthesentence“theblackfoxwhoreallylikesapplesdidnotjumpoveralazydogyesterday”.3.1RepresentingwordsInthediscussionaboveweassumeavectorrepre-sentationviassociatedwiththeithsentenceword.Whatdoesvilooklike?Asensibleapproachwouldbetotakevitobeafunctionoftheword-formandthepart-of-speech(POS)tagoftheithword,thatis:vi=g(Wv·(wi◦pi)+bv)wherewiandpiaretheembeddedvectorsoftheword-formandPOS-tagoftheithword.Thisencodeseachwordinisolation,disregardingitscontext.Thecontextofawordcanbeveryin-formativeregardingitsmeaning.Onewayofincor-poratingcontextistheBidirectionalRNN(SchusterandPaliwal,1997).BidirectionalRNNsareshowntobeaneffectiverepresentationforsequencetag-ging(IrsoyandCardie,2014).BidirectionalRNNsrepresentawordinthesentenceusingaconcate-nationoftheend-statesoftwoRNNs,onerunning

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
1
0
1
5
6
7
4
0
8

/
t

un
c
_
un
_
0
0
1
1
0
p
d

b
oui
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

448

theblackfoxwhoreallylikesapplesdidnotjumpoveralazydogyesterdayLRLRLRLRLRLRLRLRLRLLLLLRLRRLRRLRLLRRLRRLRRLLtheblackreallyfoxLfoxRwhoLwhoRlikesLlikesRjumpLjumpRoverLoverRdogLdogRtheblackfoxwhoreallylikesappleswhoreallylikesapplesreallylikesapplesoveralazydogapplesalazydogdidnotalazyyesterdaytheblackfoxwhoreallylikesapplesdidnotjumpoveralazydogyesterdayFigure1:Networkforencodingthesentence“theblackfoxwhoreallylikesapplesdidnotjumpoveralazydogyesterday”.Top:thenetworkstructure:boxednodesrepresentLSTMcells,whereLarecellsbelongingtotheleft-modiﬁerssequencemodelRNNL,andRtotheright-modiﬁerssequencemodelRNNR.Circlenodesrepresentaconcatenationfollowedbyalineartransformationandanon-linearity.Bottom:thedependencyparseofthesentence.fromthebeginningofthesentencetothewordandtheotherrunningfromtheendtotheword.There-sultisavectorrepresentationforeachwordwhichcapturesnotonlythewordbutalsoitscontext.WeadopttheBidirectionalLSTMschemetoen-richournodevectorrepresentation,andforann-wordssentencecomputethevectorrepresentationsviasfollows:v0i=g(Wv·(wi◦pi)+bv)fi=LSTMF(v01,v02,…,v0i)bi=LSTMB(v0n,v0n−1,…,v0i)vi=(fi◦bi)Weplugthiswordrepresentationaswordvectors,allowingeachwordvectorvitocaptureinforma-tionregardingthewordformandPOS-tag,aswellasthesententialcontextitappearsin.TheBI-LSTMencoderistrainedjointlywiththerestofthenetworktowardstheparsingobjective,usingback-propagation.EmbeddingvectorsThewordandPOSembed-dingswiandpiarealsotrainedtogetherwiththenetwork.Forthewordembeddings,weexperimentwithrandominitialization,aswellaswithinitializa-tionusingpre-trainedwordembeddings.Ourmaingoalinthisworkisnottoprovidetopparsingaccu-racies,butrathertoevaluatetheabilityofthepro-posedcompositionalarchitecturetolearnandcap-turethestructuralcuesthatareneededforaccurateparsing.Thus,wearemostinterestedintherandominitializationsetup:whatcanthenetworklearnfromthetrainingcorpusalone,withoutrelyingonexter-nalresources.However,theabilitytoperformsemi-supervisedlearningbyinitializingtheword-embeddingswithvectorsthatarepre-trainedonlargeamountofunan-notateddataisanappealingpropertyoftheneural-networkapproaches,andweevaluateourparseralsointhissemi-supervisedsetup.Whenusingpre-trainedwordembeddings,wefollow(Dyeretal.,2015)anduseembeddingvectorswhicharetrainedusingpositionalcontext(Lingetal.,2015),asthese

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
1
0
1
5
6
7
4
0
8

/
t

un
c
_
un
_
0
0
1
1
0
p
d

b
oui
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

449

wereshowntoworkbetterthantraditionalskip-gramvectorsforsyntactictaskssuchaspart-of-speechtaggingandparsing.3.2Anoteonthehead-outwardgenerationWhydidwechoosetoencodethechildrenfromtheheadoutward,andnottheotherwayaround?Theheadoutwardgenerationorderisneededtofacili-tateincrementaltreeconstructionandallowforef-ﬁcientparsing,asweshowinsection4below.Be-sidestheefﬁciencyconsiderations,usingthehead-outwardencodingputsmoreemphasisontheouter-mostdependants,whichareknowntobethemostin-formativeforpredictingparsestructure.2WerelyontheRNNcapabilityofextractinginformationfromarbitrarypositionsinthesequencetoincorporatein-formationabouttheheadworditself,whichappearsinthebeginningofthesequence.Thisseemstoworkwell,whichisexpectedconsideringthattheaveragemaximalnumberofsiblingsinonedirectioninthePTBis4.1,andLSTMsweredemonstratedtocapturemuchlonger-rangeinteractions.Still,whenusingthetreeencodinginasituationwherethetreeisfullyspeciﬁedinadvance,i.e.forsentenceclassi-ﬁcation,sentencesimilarityortranslationtasks,us-ingahead-inwardgenerationorder(orevenabi-directionalRNN)mayprovetoworkbetter.Weleavethislineofinquirytofuturework.Thehead-outwardmodiﬁergenerationapproachhasalonghistoryintheparsingliterature,andgoesbacktoatleastEisner(1996)andCollins(1997).Incontrasttopreviousworkinwhicheachmodi-ﬁercouldconditiononlyonaﬁxedsmallnumberofmodiﬁersprecedingit,andinwhichtheleft-andright-sequencesofmodiﬁersweretreatedasinde-pendentfromoneanotherforcomputationalefﬁ-ciencyreasons,ourapproachallowsthemodeltoaccessinformationfromtheentiretyofboththeleftandtherightsequencesjointly.2Featuresintransition-baseddependencyparsersoftenlookatthecurrentleft-mostandright-mostdependentsofagivennode,andalmostneverlookfurtherthanthesecondleft-mostorsecondright-mostdependents.Second-ordergraphbaseddependencyparsers(McDonald's,2006;Eisner,2000)alsocon-ditiononthecurrentoutermostdependentwhengeneratingitssibling.4ParsingAlgorithmWenowturntoexplainhowtoparseusingthetreeencoderdeﬁnedabove.Webeginbydescribingourbottom-upparsingalgorithm,andthenshowhowtheencodedvectorrepresentationcanbebuiltandmain-tainedthroughouttheparsingprocess.4.1Bottom-upParsingWefollowa(projective)bottom-upparsingstrategy,similartotheeasy-ﬁrstparsingalgorithmofGold-bergandElhadad(2010).Themaindata-structureintheparserisalistofpartially-builtparsetreeswecallpending.Forasentencewithwordsw1,…,wn,thependinglistisinitializedwithnnodes,wherepending[je]corre-spondstowordwi.Thealgorithmthenchoosestwoneighbouringtreesinthependinglistpending[je]andpending[i+1]andeitherattachestherootofpending[i+1]astheright-mostmodiﬁeroftherootofpending[je],orattachestherootofpending[je]astheleft-mostmodiﬁeroftherootofpending[i+1].Thetreewhichwastreatedasmodiﬁeristhenre-movedfromthependinglist,shorteningitbyone.Theprocessendsaftern−1steps,whenasingletreeremainsinthependinglist,whichistakentobetheoutputparsetree.TheparsingprocessisdescribedinAlgorithm1.Algorithm1Parsing1:Saisir:Sentencew=w1,…,wn2:fori∈1,…,ndo3:pend[je].id←i4:arcs←[]5:alors que|pend|>1do6:A←{(je,d)|1≤i<|pend|,d∈{l,r}}7:i,d←select(A)8:ifd=lthen9:m,h←pend[i],pend[i+1]10:pend.remove(i)11:else12:h,m←pend[i],pend[i+1]13:pend.remove(i+1)14:arcs.append(h.id,m.id)15:returnarcsThisparsingalgorithmisbothsoundandcom-pletewithrespecttotheclassofprojectivedepen- l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 450 dencytrees(GoldbergandElhadad,2010).Theal-gorithmdependsonnon-deterministicchoicesofanindexinthependinglistandanattachmentdirec-tion(line7).Whenparsinginpractice,thenon-deterministicchoicewillbereplacedbyusingatrainedclassiﬁertoassignascoretoeachindex-directionpair,andselectingthehighestscoringpair.WediscussthescoringfunctioninSection4.4,andthetrainingalgorithminSection5.4.2Bottom-upTree-EncodingWewouldlikethescoringfunctiontoconditiononthevectorencodingsofthesubtreesitaimstocon-nect.Algorithm2showshowtomaintainthevec-torencodingstogetherwiththeparsingalgorithm,sothatateverystageintheparsingprocesseachitempending[i]isassociatedwithavectorencodingofthecorrespondingtree.Algorithm2Parsingwhilemaintainingtreerepre-sentations1:Input:Sentencew=w1,...,wn2:Input:Vectorsvicorrespondingtowordswi3:arcs←[]4:fori∈1,...,ndo5:pend[i].id←i6:pend[i].el←RNNL.init().append(vi)7:pend[i].er←RNNR.init().append(vi)8:while|pend|>1do9:A←{(je,d)|1≤i<|pend|,d∈{l,r}}10:i,d←select(A)11:ifd=lthen12:m,h←pend[i],pend[i+1]13:m.c=m.el◦m.er14:m.enc=g(W(m.c)+b)15:h.el.append(m.enc)16:pend.remove(i)17:else18:h,m←pend[i],pend[i+1]19:m.c=m.el◦m.er20:m.enc=g(W(m.c)+b)21:h.er.append(m.enc)22:pend.remove(i+1)23:arcs.add(h.id,m.id)24:returnarcs4.3LabeledTreeRepresentationThetreerepresentationdescribedabovedoesnotac-countfortherelationlabels‘theparsingalgorithmassignseachedge.Incasesthetreeisfullyspeciﬁedinadvance,therelationofeachwordtoitsheadcanbeaddedtothewordrepresentationsvi.However,inthecontextofparsing,thelabelsbecomeknownonlywhenthemodiﬁerisattachedtoitsparent.Wethusextendthetreerepresentationbyconcatenatingthenodevectorrepresentationwithavectorrepre-sentationassignedtothelabelconnectingthesub-treetoitsparent.Formally,onlytheﬁnalenc(t)equationchanges:enc(t)=g(We·(el◦er◦‘)+be)where‘isalearnedembeddingvectorassociatedwiththegivenlabel.4.4ScoringFunctionTheparsingalgorithmreliesonafunctionselect(A)forchoosingtheactiontotakeateachstage.Wemodelthisfunctionas:select(A)=argmax(i,d,‘)∈AScore(pend,i,d,‘)whereScore(.)isalearnedfunctionwhosejobistoassignscorestopossibleactionstoreﬂecttheirquality.Ideally,itwillnotonlyscorecorrectac-tionsaboveincorrectones,butalsomoreconﬁdent(easier)actionsabovelessconﬁdentones,inordertominimizeerrorpropagationinthegreedyparsingprocess.Whenscoringapossibleattachmentbetweenaheadhandamodiﬁermwithrelation‘,thescor-ingfunctionshouldattempttoreﬂectthefollowingpiecesofinformation:•Aretheheadwordsofhandmcompatibleun-derrelationl?•Isthemodiﬁermcompatiblewiththealreadyexistingmodiﬁersofh?Inotherwords,ismagoodsubtreetoconnectasanouter-mostmod-iﬁerinthesubtreeh?•Ismcomplete,inthesensethatitalreadyac-quiredallofitsownmodiﬁers?tothisend,thescoringfunctionlooksatawindowofksubtreestoeachsideofthehead-modiﬁerpair(pend[i−k],...,pend[i+1+k])wheretheneigh-bouringsubtreesareusedforprovidinghintsregard-ingpossibleadditionalmodiﬁersofmandhthatare l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 451 yettobeacquired.Weusek=2inourexperi-ments,foratotalof6subtreesintotal.Thiswin-dowapproachisalsousedintheEasy-FirstparserofGoldbergandElhadad(GoldbergandElhadad,2010)andworksthatextendit(TratzandHovy,2011;Maetal.,2012;Maetal.,2013).How-ever,unlikethepreviouswork,whichmadeuseofextensivefeatureengineeringandrichfeaturefunc-tionsaimingatextractingthemanyrelevantlinguis-ticsub-structuresfromthe6subtreesandtheirinter-actions,weprovidethescoringfunctionsolelywiththevector-encodingofthe6subtreesinthewindow.Modelingthelabeledattachmentscoreismoredifﬁcultthanmodelingtheunlabeledscoreandispronetomoreerrors.Moreover,pickingthelabelforanattachmentwillcauselesscascadingerrorincontrasttopickingthewrongattachment,whichwillnecessarilyprecludetheparserfromreachingthecorrecttreestructure.Inordertopartiallyovercomethisissue,ourscoringfunctionisasumoftwoauxil-iaryscoringfunction,onescoringunlabeledandtheotherscoringlabeledattachments.Theunlabeledat-tachmentscoreterminthesumfunctionsasafall-backwhichmakesiteasierforaparsertopredicttheattachmentdirectionevenwhenthereisnosufﬁcientcertaintyastothelabel.Score(pend,i,d,‘)=ScoreU(pend,i,d)+ScoreL(pend,i,d,‘)EachofScoreUandScoreLaremodeledasmulti-layerperceptrons:ScoreU(pend,i,d)=MLPU(xi)[d]ScoreL(pend,i,d,‘)=MLPL(xi)[(d,‘)]xi=pend[i−2].c◦···◦pend[i+3].cwhereMLPUandMLPLarestandardmulti-layerperceptronclassiﬁerswithonehiddenlayer(MLPX(x)=W2g(W1x+b1)+b2)andhaveout-putlayerswithsize2and2Lrespectively,[.]isanindexingoperation,andweassumethevaluesofdand(d,‘)aremappedtointegervalues.4.5ComputationalComplexityTheEasy-FirstparsingalgorithmworksinO(nlogn)time(GoldbergandElhadad,2010).Theparserinthisworksdifferbythreeaspects:runningaBI-LSTMencoderpriortoparsing(O(n));maintainingthetreerepresentationduringparsing(lines11–22inAlgorithm2)whichtakeaconstanttimeateachparsingstep;andlocalscoringusinganMLPratherthanalinearclassiﬁer(again,aconstant-timeoperation).Thus,theparsermaintainstheO(nlogn)complexityoftheEasy-Firstparser.5TrainingAlgorithm5.1LossandParameterUpdatesAteachstepoftheparsingprocessweselectthehighestscoringaction(i,d,‘).ThegoaloftrainingistosettheScorefunctionsuchthatcorrectactionsarescoredaboveincorrectones.Weuseamargin-basedobjective,aimingtomaximizethemarginbe-tweenthehighestscoringcorrectactionandthesetofincorrectactions.Formally,wedeﬁneahingelossforeachparsingstepasfollows:max{0,1−max(i,d,‘)∈GScore(pend,i,d,‘)+max(i0,d0,‘0)∈A\GScore(pend,i,d,‘)}whereAisthesetofallpossibleactionsandGisthesetofcorrectactionsatthecurrentstage.Asthescoringfunctiondependsonvector-encodingsofalltreesinthewindow,andeachtree-encodingdependsonthenetwork’sparameters,eachparameterupdatewillinvalidateallthevectoren-codings,requiringare-computationoftheentirenetwork.Wethussumthelocallossesthrough-outtheparsingprocess,andupdatetheparameterwithrespecttothesumofthelossesatsentenceboundaries.Sinceweareusinghingelossthegradi-entswillbecomesparserasthetrainingprogresses.Fewernon-zerogradientscouldtranslatetounreli-ableupdates.Inordertoincreasegradientstabilityandtrainingspeed,weuseavariationofmini-batchinwhichweupdatetheparametersonlyafter50er-rorsweremade.Thisassuresusasufﬁcientnumberofgradientsforeveryupdatethusminimizingtheeffectofgradientinstability.Thegradientsoftheentirenetworkwithrespecttothesumofthelossesarecalculatedusingthebackpropagationalgorithm.InitialexperimentswithanSGDoptimizershowedveryinstableresults.WesettledinsteadonusingtheADAMoptimizer(KingmaandBa,2015)which l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 452 workedwellwithoutrequiringﬁddlingwithlearningrates.5.2Error-ExplorationandDynamicOracleTrainingAteachstageinthetrainingprocess,theparseras-signsscorestoallthepossibleactions(i,d,‘)∈A.Itthenselectsanaction,appliesit,andmovestothenextstep.Whichactionshouldbechosen?Asensi-bleoptionistodeﬁneGasthesetofactionsthatcanleadtothegoldtree,andfollowingthehighestscor-ingactionsinthisset.However,usingtraininginthismannertendstosufferfromerrorpropagationattesttime.Theparserseesonlystatesthatresultfromfollowingcorrectactions.Thelackofexamplescon-tainingerrorsinthetrainingphasemakesithardfortheparsertoinferthebestactiongivenpartlyerro-neoustrees.Inordertocopewiththis,wefollowtheerrorexplorationtrainingstrategy,inwhichwelettheparserfollowthehighestscoringactioninAduringtrainingevenifthisactionisincorrect,expos-ingittostatesthatresultfromerroneousdecisions.ThisstrategyrequiresdeﬁningthesetGsuchthatthecorrectactionstotakearewell-deﬁnedalsoforstatesthatcannotleadtothegoldtree.SuchasetGiscalledadynamicoracle.Error-explorationanddynamic-oracleswereintroducedbyGoldbergandNivre(2012).TheDynamicOracleAdynamic-oraclefortheeasy-ﬁrstparsingsystemweuseispresentedin(GoldbergandNivre,2013).Brieﬂy,thedynamic-oracleversionofGdeﬁnesthesetofgoldactionsasthesetofactionswhichdoesnotincreasethenum-beroferroneousattachmentsmorethanthemini-mumpossible(givenpreviouserroneousactions).Thenumberoferroneousattachmentsisincreasedinthreecases:(1)connectingamodiﬁertoitsheadprematurely.Oncethemodiﬁerisattacheditisre-movedfromthependinglistandthereforecannolongeracquireanyofitsownmodiﬁers;(2)connect-ingamodiﬁertoanerroneoushead,whenthecor-rectheadisstillonthependinglist;(3)connectingamodiﬁertoacorrecthead,butanincorrectlabel.Dealingwithcases(2)and(3)istrivial.Todealwith(1),weconsiderascorrectonlyactionsinwhichthemodiﬁeriscomplete.Toefﬁcientlyiden-tifycompletemodiﬁersweholdacounterforeachwordwhichisinitializedtothenumberofmodiﬁersthewordhasinthegoldtree.Whenapplyinganattachmentthecounterofthemodiﬁer’sgoldheadwordisdecreased.Whenthecounterreaches0,thesub-treerootedatthatwordhasnopendingmodi-ﬁers,andisconsideredcomplete.AggressiveExplorationWefoundthatevenwhenusingerror-exploration,afteroneiterationthemodelremembersthetrainingsetquitewell,anddoesnotmakeenougherrorstomakeerror-explorationeffec-tive.Inordertoexposetheparsertomoreerrors,weemployacostaugmentationscheme:wesome-timesfollowincorrectactionsalsoiftheyscorebe-lowcorrectactions.Speciﬁcally,whenthescoreofthecorrectactionisgreaterthanthatofthewrongactionbutthedifferenceissmallerthanthemarginconstant,wechosetofollowthewrongactionwithprobabilitypaug(weusepaug=0.1inourexperi-ments).Pseudocodefortheentiretrainingalgorithmisgiveninthesupplementarymaterial.5.3Out-of-vocabularyitemsandword-dropoutDuetothesparsityofnaturallanguage,wearelikelytoencounterattesttimeasubstantialnumberofthewordsthatdidnotappearinthetrainingdata(OOVwords).OOVwordsarelikelyevenwhenpre-trainingthewordrepresentationsonalargeun-annotatedcorpora.Acommonapproachistodesig-nateaspecial“unknown-word”symbol,whoseas-sociatedvectorwillbeusedasthewordrepresenta-tionwheneveranOOVwordisencounteredattesttime.Inordertotraintheunknown-wordvector,apossibleapproachistoreplaceallthewordsappear-inginthetrainingcorpuslessthanacertainnum-beroftimeswiththeunknown-wordsymbol.Thisapproachgivesagoodvectorrepresentationforun-knownwordsbutattheexpenseofignoringmanyofthewordsfromthetrainingcorpus.Weinsteadproposeavariantoftheword-dropoutapproach(Iyyeretal.,2015).Duringtraining,wereplaceawordwiththeunknown-wordsymbolwithprobabilitythatisinverselyproportionaltofre-quencyoftheword.Formally,wereplaceawordwappearing#(w)timesinthetrainingcorpuswiththeunknownsymbolwithaprobability:punk(w)=α#(w)+α l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 453 Usingthisapproachwelearnavectorrepresenta-tionforunknownwordswithminimalimpactonthetrainingofsparsewords.6ImplementationDetailsOurPythonimplementationwillbemadeavailableattheﬁrstauthor’swebsite.WeusethePyCNNwrapperoftheCNNlibrary3forbuildingthecom-putationgraphofthenetwork,computingthegradi-entsusingautomaticdifferentiation,andperformingparameterupdates.Wenoticedtheerroronthede-velopmentsetdoesnotimproveafter20iterationsoverthetrainingset,therefore,weranthetrain-ingfor20iterations.Thesentenceswhereshufﬂedbetweeniterations.Non-projectivesentenceswereskippedduringtraining.Weusethedefaultparame-tersinitialization,stepsizesandregularizationval-uesprovidedbythePyCNNtoolkit.Thehyper-parametersoftheﬁnalnetworksusedforallthere-portedexperimentsaredetailedinTable1.Wordembeddingdimension100POStagembeddingdimension25Relationembeddingdimension25HiddenunitsinScoreU100HiddenunitsinScoreL100LSTMDimensions(tree)200LSTMLayers(tree)2BI-LSTMDimensions100+100BI-LSTMLayers2Mini-batchsize50α(forworddropout)0.25paug(forexplorationtraining)0.1gtanhTable1:Hyper-parametervaluesusedinexperimentsWeissetal(2015)stresstheimportanceofcare-fulhyperparametertuningforachievingtopaccu-racyinneuralnetworkbasedparser.Wedidnotfollowthisadviceandmadeveryfewattemptsathyper-parametertuning,usingmanualhillclimbinguntilsomethingseemedtoworkwithreasonableac-curacy,andthenstickingwithitfortherestoftheexperiments.3https://github.com/clab/cnn/tree/master/pycnn7ExperimentsandResultsWeevaluatedourparsingmodeltoEnglishandChi-nesedata.Forcomparisonpurposeswefollowedthesetupof(Dyeretal.,2015).DataForEnglish,weusedtheStanfordDepen-dency(SD)(deMarneffeandManning,2008)con-versionofthePennTreebank(Marcusetal.,1993),usingthestandardtrain/dev/testsplitswiththesamepredictedPOS-tagsasusedin(Dyeretal.,2015;ChenandManning,2014).Thisdatasetcontainsafewnon-projectivetrees.Punctuationsymbolsareexcludedfromtheevaluation.ForChinese,weusethePennChineseTreebank5.1(CTB5),usingthetrain/test/devsplitsof(ZhangandClark,2008;Dyeretal.,2015)withgoldpart-of-speechtags,alsofollowing(Dyeretal.,2015;ChenandManning,2014).Whenusingexternalwordembeddings,wealsousethesamedataas(Dyeretal.,2015).4ExperimentalconﬁgurationsWeevaluatedtheparserinseveralconﬁgurationsBOT-TOMUPPARSERisthebaselineparser,notusingthetree-encoding,andinsteadrepre-sentingeachiteminpendingsolelybythevector-representation(wordandPOS)ofitsheadword.BOTTOMUPPARSER+HTLSTMisusingourHierarchicalTreeLSTMrepresentation.BOTTOMUPPARSER+HTLSTM+BI-LSTMistheHierarchicalTreeLSTMwhereweadditionallyuseaBI-LSTMencodingfortheheadwords.Finally,weaddedexternal,pre-trainedwordembeddingstotheBOTTOMUPPARSER+HTLSTM+BI-LSTMsetup.Wealsoevaluatedtheﬁnalparsersina–POSsetup,inwhichwedidnotfeedtheparserwithanyPOS-tags.ResultsResultsforEnglishandChinesearepre-sentedinTables2and3respectively.Forcompar-ison,wealsoshowtheresultsoftheStack-LSTMtransition-basedparsermodelofDyeretal(2015),whichweconsidertobeastate-of-the-artgreedymodelwhichisalsoverycompetitivewithsearch-basedmodels,withandwithoutpre-trainedembed-dings,andwithandwithoutPOS-tags.4WethankDyeretalforsharingtheirdatawithus. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 454 DevTestUASLASUASLASBOTTOMUPPARSER83.379.082.778.6+HTLSTM92.490.192.089.8+BI-LSTMinput93.090.592.690.2+externalembeddings93.390.893.090.9Dyeretal(2015)noexternal92.790.492.490.0Dyeretal(2015)w/external93.290.993.190.9C&M(2014)w/external92.289.791.889.6BOTTOMUP+ALL–POS92.990.592.990.6Dyeretal(2015)–POS93.190.492.790.3Table2:Englishparsingresults(SD)DevTestUASLASUASLASBOTTOMUPPARSER79.377.178.876.3+HTLSTM86.284.586.284.7+BI-LSTM86.284.586.184.4+externalembeddings87.285.787.185.5Dyeretal(2015)noexternal86.384.785.784.1Dyeretal(2015)w/external87.285.987.285.7C&M(2014)noexternal84.082.483.982.4BOTTOMUP+ALL–POS82.980.082.679.5Dyeretal(2015)–POS82.879.882.279.1Table3:Chineseparsingresults(CTB5)Thetrendsareconsistentacrossthetwolan-guages.ThebaselineBottom-Upparserperformsverypoorly.Thisisexpected,asonlythehead-wordofeachsubtreeisusedforprediction.Whenaddingthetree-encoding,resultsjumptonearstate-of-the-artaccuracy,suggestingthatthecomposedvectorrepresentationisindeedsuccessfulincaptur-ingpredictivestructuralinformation.Replacingthehead-wordswiththeirBI-LSTMencodingsresultsinanotherincreaseinaccuracyforEnglish,outper-formingtheDyeretal(S-LSTMnoexternal)modelsonthetest-set.Addingtheexternalpre-trainedem-beddingsfurtherimprovestheresultsforbothourparserandDyeretal’smodel,closingthegapbe-tweenthem.WhenPOS-tagsarenotprovidedasinput,thenumbersforbothparsersdrop.ThedropissmallforEnglishandlargeforChinese,andourparserseemtosufferalittlelessthantheDyeretalmodel.ImportanceofthedynamicoracleWealsoeval-uatetheimportanceofusingthedynamicoracleanderror-explorationtraining,andﬁndthattheyarein-deedimportantforachievinghighparsingaccura-cieswithourmodel(Table4).EnglishChineseUASLASUASLASRAND93.090.586.284.5RAND-NODYN92.289.885.784.1EXT93.390.887.285.7EXT-NODYN92.790.486.685.1Table4:Effectoftheerror-explorationtraining(dynamic-oracle)ondevsetaccuracyinEnglishandChi-nese.RAND:randominitialization.EXT:pre-trainedex-ternalembeddings.Whentrainingwithouterror-exploration(thatis,theparserfollowsonlycorrectactionsduringtrain-ingandnotusingthedynamicaspectoftheora-cle),accuraciesofunseensentencesdropbybe-tween0.4and0.8accuracypoints(average0.58).Thisisconsistentwithpreviousworkontrainingwitherror-explorationanddynamicoracles(Gold-bergandNivre,2013),showingthatthetechniqueisnotrestrictedtomodelstrainedwithsparselinearmodels.Comparisontootherstate-of-the-artparsersOurmainpointofcomparisonisthemodelofDyeretal,whichwaschosenbecauseitis(a)averystrongparsingmodel;and(b)istheclosesttooursinthelit-erature:agreedyparsingmodelmakingheavyuseofLSTMs.Tothisend,wetriedtomakethecom-parisontoDyeretalascontrolledaspossible,usingthesamedependencyannotationschemes,aswellasthesamepredictedPOS-tagsandthepre-trainedembeddings(whenapplicable).Itisalsoinformativetopositionourresultswithrespecttootherstate-of-the-artparsingresultsre-portedintheliterature,aswedoinTable5.Here,someofthecomparisonsarelessdirect:someoftheresultsusedifferentdependencyannotationschemes5,aswellasdifferentpredictedPOS-tags,anddifferentpre-trainedwordembeddings.Whilethenumbersarenotdirectlycomparable,theydogiveagoodreferenceastotheexpectedrangeof5OurEnglishparsingexperimentsusetheStanfordDepen-denciesscheme,whileotherworkuselessinformativedepen-dencyrelationswhicharebasedonthePenn2Maltconverter,usingtheYamadaandMatsumotoheadrules.Fromourexpe-rience,thisconversionissomewhateasiertoparse,resultinginnumberswhichareabout0.3-0.4pointshigherthanStanfordDependencies. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 455 state-of-the-artparsingresults.Oursystem’sEn-glishparsingresultsareinrangeofstate-of-the-artandtheChineseparsingresultssurpassit.Thesenumbersareachievedwhileusingagreedy,bottomupparsingmethodwithoutanysearch,andwhilerelyingsolelyonthecompositionaltreerepresenta-tions.8RelatedWorkWesurveytwolinesofrelatedwork:methodsforencodingtreesasvectors,andmethodsforparsingwithvectorrepresentations.Thepopularapproachforencodingtreesasvec-torsisusingrecursiveneuralnetworks(GollerandKuchler,1996;Socheretal.,2010;Taietal.,2015).Recursiveneuralnetworksrepresentthevectorofaparentnodeinatreeasafunctionofitschil-drennodes.However,thefunctionsareusuallyre-strictedtohavingaﬁxedmaximumarity(usuallytwo)(Socheretal.,2010;Taietal.,2015;Socher,2014).Whiletreescanbebinarizedtocopewiththearityrestriction,doingsoresultsindeeptreeswhichinturnleadstothevanishinggradientproblemwhentraining.Tocopewiththevanishinggradients,(Taietal.,2015)enrichthecompositionfunctionwithagatingmechanismsimilartothatoftheLSTM,resultingintheso-calledTree-LSTMmodel.An-otherapproachistoallowarbitraryaritiesbutignor-ingthesequentialnatureofthemodiﬁers,e.g.byusingabag-of-modiﬁersrepresentationoraconvo-lutionallayer(Taietal.,2015;Zhuetal.,2015).Incontrast,ourtreeencodingmethodnaturallyallowsforarbitrarybranchingtreesbyrelyingonthewellestablishedLSTMsequencemodel,andusingitasablackbox.Veryrecently,Zhangetal.(2015)pro-posedanRNN-basedtreeencodingwhichissimilartooursinencodingthesequenceofmodiﬁersasanRNN.Unlikeourbottom-upencoder,theirmethodworkstop-down,andisthereforenotreadilyappli-cableforparsing.Ontheotherhandthetop-downapproachiswellsuitedforgeneration.Infuturework,itcouldbeinterestingtocombinethebottom-upandtop-downapproachesinanencoder-decoderframework(Sutskeveretal.,2014;Kirosetal.,2015).WorkbyDyeretal(2016),thatwassubmit-tedinparalleltoours,introducesasimilarLSTM-basedrepresentationofsyntacticconstituentsinthecontextofphrase-grammarparsing.Intermsofparsingwithvectorrepresentations,therearefourdominantapproaches:searchbasedparsersthatuselocalfeaturesthatarefedtoaneural-networkclassiﬁer(Peietal.,2015;DurrettandKlein,2015);greedytransitionbasedparsersthatuselocalfeaturesthatarefedintoaneural-networkclassiﬁer(ChenandManning,2014;Weissetal.,2015),sometimescoupledwithanodecom-positionfunction(Dyeretal.,2015;WatanabeandSumita,2015);bottomupparsersthatrelysolelyonrecursivelycombinedvectorencodingsofsub-trees(Socheretal.,2010;Stenetorp,2013;Socheretal.,2013a);andparse-rerankingapproachesthatﬁrstproduceak-bestlistofparsesusingatraditionalparsingtechnique,andthenscorethetreesbasedonarecursivevectorencodingofeachnode(LeandZuidema,2014;LeandZuidema,2015;Zhuetal.,2015).Ourparserisagreedy,bottomupparserthatre-liesoncompositionalvectorencodingsofsubtreesasitssolesetoffeatures.Unlikethere-rankingap-proaches,wedonotrelyonanexternalparsertoprovidek-bestlists.Unlikethebottom-upparserin(Socheretal.,2010)thatonlyparsessentencesofupto15wordsandtheparserof(Stenetorp,2013)thatachievesverylowparsingaccuracies,weparsear-bitrarysentenceswithnearstate-of-the-artaccuracy.Unlikethebottomupparserin(Socheretal.,2013a)wedonotmakeuseofagrammar.Theparserof(Weissetal.,2015)obtainsexceptionallyhighre-sultsusinglocalfeaturesandnocompositionfunc-tion.Thegreedyversionoftheirparserusesexten-sivetuningofhyper-parametersandnetworkdepthinordertosqueezeeverypossiblebitofaccuracy.Addingbeamsearchontopofthatfurtherimprovesresults.Duetoourmuchmorelimitedresources,wedidnotperformamethodologicalsearchoverhyper-parameters,andexploredonlyatinyspaceofthepossiblehyper-parameters,andourparserdoesnotperformsearch.Finally,perhapsclosesttoourapproachisthegreedy,transition-basedparserof(Dyeretal.,2015)thatalsoworksinabottom-upfashion,andincorporatesanLSTMencodingoftheinputtokensandhierarchicalvectorcomposi-tionintoitsscoringmechanism.Indeed,thatparserobtainssimilarscorestoours,althoughweobtainsomewhatbetterresultswhennotusingpre-trainedembeddings.WedifferfromtheparserofDyeret l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 456 SystemMethodRepresentationEmbPTB-YMPTB-SDCTBRuntimeZhangNivre11transition(beam)largefeatureset(sparse)–92.9–86.0O(n)+Martins13graph,3rdorder+largefeatureset(sparse)–92.893.07–O(n4)Pei15graph,2ndorderlargefeatureset(dense)–92.99––O(n3)ThisWorkEasyFirst(greedy)Rec-LSTMencoding––92.686.1O(nlogn)Weiss15transition(greedy)largefeatureset(dense)YES–93.19–O(n)Weiss15transition(beam)largefeatureset(dense)YES–93.99–O(n)+Pei15graph,2ndorderlargefeatureset(dense)YES93.29––O(n3)LeZuidema14reranking/blendinside-outsiderecursivenetYES93.1293.84–O(n3)Zhu15reranking/blendrecursiveconv-netYES93.83–85.71O(n)+ThisWorkEasyFirst(greedy)Rec-LSTMencodingYES–93.087.1O(nlogn)Table5:Parsingresults(UAS)ofvariousstate-of-the-artparsingsystemsontheEnglishandChinesedatasets.Thesystemsthatuseembeddingsusedifferentpre-trainedembeddings.EnglishresultsusepredictedPOStags(differentsystemsusedifferenttaggers),whileChineseresultsusegoldPOStags.PTB-YM:EnglishPTB,YamadaandMatsumotoheadrules.PTB-SD:EnglishPTB,StanfordDependencies(differentsystemsmayusedifferentversionsoftheStanfordconverter.CTB:ChineseTreebank.reranking/blendinmethodcolumnindicatesarerankingsystemwherethererankerscoreisinterpolatedwiththebase-parser’sscore.Thererankingsystems’runtimesarethoseofthebaseparserstheyuse.O(n)+indicatesalinear-timesystemwithalargemultiplicativeconstant.Thedifferentsystemsandthenumbersreportedfromthemaretakenfrom:ZhangNivre11:(ZhangandNivre,2011);Martins13:(Martinsetal.,2013);Weiss15(Weissetal.,2015);Pei15:(Peietal.,2015);LeZuidema14(LeandZuidema,2014);Zhu15:(Zhuetal.,2015).albyhavingamoreelaboratevector-compositionfunction,relyingsolelyonthecompositionalrepre-sentations,andperformingfullybottom-upparsingwithoutbeingguidedbyastack-and-buffercontrolstructure.9ConclusionsandFutureWorkWesuggestacompositionalvectorrepresentationofparsetreesthatreliesonarecursivecombinationofrecurrent-neuralnetworkencoders,anddemonstrateitseffectivenessbyintegratingitinabottom-upeasy-ﬁrstparser.Futureextensionsintermsofpars-ingincludetheadditionofbeamsearch,handlingofunknown-wordsusingcharacter-embeddings,andadaptingthealgorithmtoconstituencytrees.WealsoplantoestablishtheeffectivenessofourHierar-chicalTree-LSTMencoderbyapplyingittomoresemanticvectorrepresentationtasks,i.e.trainingtreerepresentationforcapturingsentiment(Socheretal.,2013b;Taietal.,2015),semanticsentencesimilarity(Marellietal.,2014)ortextualinference(Bowmanetal.,2015).AcknowledgementsThisresearchissupportedbytheIntelCollaborativeResearchInstituteforCom-putationalIntelligence(ICRI-CI)andtheIsraeliSci-enceFoundation(grantnumber1555/15).ReferencesIanGoodfellowYoshuaBengioandAaronCourville.2016.Deeplearning.BookinpreparationforMITPress,http://www.deeplearningbook.org.SamuelR.Bowman,GaborAngeli,ChristopherPotts,andChristopherD.Manning.2015.Alargeannotatedcorpusforlearningnaturallanguageinference.InPro-ceedingsofthe2015ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,pages632–642,Lisbon,Portugal,September.AssociationforCompu-tationalLinguistics.DanqiChenandChristopherManning.2014.Afastandaccuratedependencyparserusingneuralnetworks.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages740–750,Doha,Qatar,October.AssociationforComputationalLinguistics.KyunghyunCho.2015.Naturallanguageunder-standingwithdistributedrepresentation.CoRR,abs/1511.07916.MichaelCollins.1997.Threegenerative,lexicalisedmodelsforstatisticalparsing.InProceedingsofthe35thAnnualMeetingoftheAssociationforComputa-tionalLinguistics,pages16–23,Madrid,Spain,July.AssociationforComputationalLinguistics.Marie-CatherinedeMarneffeandChristopherD.Man-ning.2008.Stanforddependenciesmanual.Techni-calreport,StanfordUniversity.GregDurrettandDanKlein.2015.Neuralcrfparsing.InProceedingsofthe53rdAnnualMeetingoftheAs-sociationforComputationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing(Volume1:LongPapers),pages302–312, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 457 Beijing,China,July.AssociationforComputationalLinguistics.ChrisDyer,MiguelBallesteros,WangLing,AustinMatthews,andNoahA.Smith.2015.Transition-baseddependencyparsingwithstacklongshort-termmemory.InProceedingsofthe53rdAnnualMeet-ingoftheAssociationforComputationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing(Volume1:LongPapers),pages334–343,Beijing,China,July.AssociationforCom-putationalLinguistics.ChrisDyer,AdhigunaKuncoro,MiguelBallesteros,andNoahA.Smith.2016.Recurrentneuralnetworkgrammars.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages199–209,SanDiego,California,June.AssociationforComputationalLinguistics.JasonEisner.1996.Threenewprobabilisticmodelsfordependencyparsing:Anexploration.In16thInterna-tionalConferenceonComputationalLinguistics,Pro-ceedingsoftheConference,COLING1996,CenterforSprogteknologi,Copenhagen,Denmark,August5-9,1996,pages340–345.JasonEisner.2000.Bilexicalgrammarsandtheircubic-timeparsingalgorithms.AdvancesinProbabilisticandOtherParsingTechnologies.JeffreyL.Elman.1990.Findingstructureintime.Cog-nitiveScience,14(2):179–211.YoavGoldbergandMichaelElhadad.2010.Anefﬁ-cientalgorithmforeasy-ﬁrstnon-directionaldepen-dencyparsing.InHumanLanguageTechnologies:The2010AnnualConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguis-tics,pages742–750,LosAngeles,California,June.AssociationforComputationalLinguistics.YoavGoldbergandJoakimNivre.2012.Adynamicora-cleforarc-eagerdependencyparsing.InProceedingsofCOLING2012,pages959–976,Mumbai,India,De-cember.TheCOLING2012OrganizingCommittee.YoavGoldbergandJoakimNivre.2013.Trainingdeterministicparserswithnon-deterministicoracles.TransactionsoftheAssociationforComputationalLinguistics,1:403–414.YoavGoldberg.2015.Aprimeronneuralnet-workmodelsfornaturallanguageprocessing.CoRR,abs/1510.00726.ChristophGollerandAndreasKuchler.1996.Learningtask-dependentdistributedrepresentationsbyback-propagationthroughstructure.InNeuralNetworks,1996.,IEEEInternationalConferenceon,volume1,pages347–352.IEEE.SeppHochreiterandJürgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.OzanIrsoyandClaireCardie.2014.Opinionminingwithdeeprecurrentneuralnetworks.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNat-uralLanguageProcessing(EMNLP),pages720–728,Doha,Qatar,October.AssociationforComputationalLinguistics.MohitIyyer,VarunManjunatha,JordanBoyd-Graber,andHalDauméIII.2015.Deepunorderedcomposi-tionrivalssyntacticmethodsfortextclassiﬁcation.InProceedingsofthe53rdAnnualMeetingoftheAssoci-ationforComputationalLinguisticsandthe7thInter-nationalJointConferenceonNaturalLanguagePro-cessing(Volume1:LongPapers),pages1681–1691,Beijing,China,July.AssociationforComputationalLinguistics.DiederikP.KingmaandJimmyBa.2015.Adam:Amethodforstochasticoptimization.InProceedingsofthe3rdInternationalConferenceforLearningRepre-sentations,SanDiego,California.RyanKiros,YukunZhu,RuslanSalakhutdinov,RichardS.Zemel,RaquelUrtasun,AntonioTorralba,andSanjaFidler.2015.Skip-thoughtvectors.InAdvancesinNeuralInformationProcessingSystems28:AnnualConferenceonNeuralInformationProcessingSystems2015,December7-12,2015,Montreal,Quebec,Canada,pages3294–3302.SandraKübler,RyanT.McDonald,andJoakimNivre.2009.DependencyParsing.SynthesisLecturesonHumanLanguageTechnologies.Morgan&ClaypoolPublishers.PhongLeandWillemZuidema.2014.Theinside-outsiderecursiveneuralnetworkmodelfordepen-dencyparsing.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing(EMNLP),pages729–739,Doha,Qatar,October.AssociationforComputationalLinguistics.PhongLeandWillemZuidema.2015.Theforestcon-volutionalnetwork:Compositionaldistributionalse-manticswithaneuralchartandwithoutbinarization.InProceedingsofthe2015ConferenceonEmpiri-calMethodsinNaturalLanguageProcessing,pages1155–1164,Lisbon,Portugal,September.AssociationforComputationalLinguistics.WangLing,ChrisDyer,AlanW.Black,andIsabelTran-coso.2015.Two/toosimpleadaptationsofword2vecforsyntaxproblems.InNAACLHLT2015,The2015ConferenceoftheNorthAmericanChapteroftheAs-sociationforComputationalLinguistics:HumanLan-guageTechnologies,Denver,Colorado,USA,May31-June5,2015,pages1299–1304. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 458 JiMa,TongXiao,JingboZhu,andFeiliangRen.2012.Easy-ﬁrstChinesePOStagginganddependencypars-ing.InProceedingsofCOLING2012,pages1731–1746,Mumbai,India,December.TheCOLING2012OrganizingCommittee.JiMa,JingboZhu,TongXiao,andNanYang.2013.Easy-ﬁrstpostagginganddependencyparsingwithbeamsearch.InProceedingsofthe51stAnnualMeet-ingoftheAssociationforComputationalLinguistics(Volume2:ShortPapers),pages110–114,Soﬁa,Bul-garia,August.AssociationforComputationalLinguis-tics.MitchellP.Marcus,BeatriceSantorini,andMaryAnnMarcinkiewicz.1993.Buildingalargeannotatedcor-pusofEnglish:ThePennTreebank.ComputationalLinguistics,19(2):313–330.MarcoMarelli,LuisaBentivogli,MarcoBaroni,Raf-faellaBernardi,StefanoMenini,andRobertoZampar-elli.2014.Semeval-2014task1:Evaluationofcom-positionaldistributionalsemanticmodelsonfullsen-tencesthroughsemanticrelatednessandtextualentail-ment.InProceedingsofthe8thInternationalWork-shoponSemanticEvaluation(SemEval2014),pages1–8,Dublin,Ireland,August.AssociationforCompu-tationalLinguisticsandDublinCityUniversity.AndreMartins,MiguelAlmeida,andNoahA.Smith.2013.Turningontheturbo:Fastthird-ordernon-projectiveturboparsers.InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics(Volume2:ShortPapers),pages617–622,Soﬁa,Bulgaria,August.AssociationforComputa-tionalLinguistics.RyanMcDonald.2006.DiscriminativeTrainingandSpanningTreeAlgorithmsforDependencyParsing.Ph.D.thesis,UniversityofPennsylvania.TomasMikolov,MartinKaraﬁát,LukásBurget,JanCernocký,andSanjeevKhudanpur.2010.Re-currentneuralnetworkbasedlanguagemodel.InINTERSPEECH2010,11thAnnualConferenceoftheInternationalSpeechCommunicationAssociation,Makuhari,Chiba,Japan,September26-30,2010,pages1045–1048.WenzhePei,TaoGe,andBaobaoChang.2015.Anef-fectiveneuralnetworkmodelforgraph-baseddepen-dencyparsing.InProceedingsofthe53rdAnnualMeetingoftheAssociationforComputationalLinguis-ticsandthe7thInternationalJointConferenceonNat-uralLanguageProcessing(Volume1:LongPapers),pages313–322,Beijing,China,July.AssociationforComputationalLinguistics.MikeSchusterandKuldipK.Paliwal.1997.Bidirec-tionalrecurrentneuralnetworks.IEEETrans.SignalProcessing,45(11):2673–2681.RichardSocher,ChristopherManning,andAndrewNg.2010.LearningContinuousPhraseRepresentationsandSyntacticParsingwithRecursiveNeuralNet-works.InProceedingsoftheDeepLearningandUnsupervisedFeatureLearningWorkshopof(NIPS)2010,pages1–9.RichardSocher,JohnBauer,ChristopherD.Manning,andAndrewY.Ng.2013a.Parsingwithcomposi-tionalvectorgrammars.InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics,ACL2013,4-9August2013,Soﬁa,Bul-garia,Volume1:LongPapers,pages455–465.RichardSocher,AlexPerelygin,JeanWu,JasonChuang,ChristopherD.Manning,AndrewNg,andChristopherPotts.2013b.Recursivedeepmodelsforsemanticcompositionalityoverasentimenttreebank.InPro-ceedingsofthe2013ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,pages1631–1642,Seattle,Washington,USA,October.AssociationforComputationalLinguistics.RichardSocher.2014.RecursiveDeepLearningForNaturalLanguageProcessingandComputerVision.Ph.D.thesis,StanfordUniversity,August.PontusStenetorp.2013.Transition-basedDependencyParsingUsingRecursiveNeuralNetworks.InDeepLearningWorkshopatthe2013ConferenceonNeuralInformationProcessingSystems(NIPS),LakeTahoe,Nevada,USA,December.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Se-quencetosequencelearningwithneuralnetworks.InAdvancesinNeuralInformationProcessingSystems27:AnnualConferenceonNeuralInformationPro-cessingSystems2014,December8-132014,Montreal,Quebec,Canada,pages3104–3112.KaiShengTai,RichardSocher,andChristopherD.Man-ning.2015.Improvedsemanticrepresentationsfromtree-structuredlongshort-termmemorynetworks.InProceedingsofthe53rdAnnualMeetingoftheAssoci-ationforComputationalLinguisticsandthe7thInter-nationalJointConferenceonNaturalLanguagePro-cessing(Volume1:LongPapers),pages1556–1566,Beijing,China,July.AssociationforComputationalLinguistics.StephenTratzandEduardHovy.2011.Afast,accurate,non-projective,semantically-enrichedparser.InPro-ceedingsofthe2011ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,pages1257–1268,Edinburgh,Scotland,UK.,July.AssociationforComputationalLinguistics.TaroWatanabeandEiichiroSumita.2015.Transition-basedneuralconstituentparsing.InProceedingsofthe53rdAnnualMeetingoftheAssociationforCom-putationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing(Volume l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 459 1:LongPapers),pages1169–1179,Beijing,China,July.AssociationforComputationalLinguistics.DavidWeiss,ChrisAlberti,MichaelCollins,andSlavPetrov.2015.Structuredtrainingforneuralnetworktransition-basedparsing.InProceedingsofthe53rdAnnualMeetingoftheAssociationforComputationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing(Volume1:LongPa-pers),pages323–333,Beijing,China,July.Associa-tionforComputationalLinguistics.YueZhangandStephenClark.2008.Ataleoftwoparsers:Investigatingandcombininggraph-basedandtransition-baseddependencyparsing.InProceedingsofthe2008ConferenceonEmpiricalMethodsinNat-uralLanguageProcessing,pages562–571,Honolulu,Hawaii,October.AssociationforComputationalLin-guistics.YueZhangandJoakimNivre.2011.Transition-baseddependencyparsingwithrichnon-localfeatures.InProceedingsofthe49thAnnualMeetingoftheAsso-ciationforComputationalLinguistics:HumanLan-guageTechnologies,pages188–193,Portland,Ore-gon,USA,June.AssociationforComputationalLin-guistics.XingxingZhang,LiangLu,andMirellaLapata.2015.Treerecurrentneuralnetworkswithapplicationtolan-guagemodeling.CoRR,abs/1511.00060.ChenxiZhu,XipengQiu,XinchiChen,andXuanjingHuang.2015.Are-rankingmodelfordependencyparserwithrecursiveconvolutionalneuralnetwork.InProceedingsofthe53rdAnnualMeetingoftheAssoci-ationforComputationalLinguisticsandthe7thInter-nationalJointConferenceonNaturalLanguagePro-cessing(Volume1:LongPapers),pages1159–1168,Beijing,China,July.AssociationforComputationalLinguistics. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 460 Appendix:TrainingAlgorithmPseudocodeAlgorithm3Trainingonannotatedcorpus1:Input:Sentencesw1,...,wm2:Input:TreeannotationsT1,...,Tm3:Input:Numberofepochstotrain4:V←InitializeVectors()5:Loss←[]6:forepoch∈{1,...,Epochs}do7:forS,T∈{(w1,T1),...,(wm,Tm)}do8:Loss←TrainSentence(S,V[w1,...,wn],T,Loss)9:if|Loss|>50then10:SumLoss←sum(Loss)11:CallADAMtominimizeSumLoss12:Loss←[](SeeAlgorithm4,trainingofasinglesentence,onnextpage.)

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
u

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
1
0
1
5
6
7
4
0
8

/
t

un
c
_
un
_
0
0
1
1
0
p
d

b
oui
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

461

Algorithm4Trainingonasinglesentencewithdynamicoraclealgorithm1:functionTRAINSENTENCE(w,v,T,Loss)2:Saisir:Sentencew=w1,…,wn3:Saisir:Vectorsvicorrespondingtoinputswi4:Saisir:AnnotatedtreeTintheformof(h,m,rel)triplets5:Saisir:ListLosstowhichlossexpressionsareadded6:fori∈1,…,ndo7:unassigned[je]←|Children(wi)|8:pend[je].id←i9:pend[je].el←RNNL.init().append(vi)10:pend[je].er←RNNR.init().append(vi)11:alors que|pend|>1do12:G,W←{},{}13:pour(je,d,rel)∈{1≤i<|pend|,d∈{l,r},rel∈Relations}do14:ifd=lthenm,h←pend[i],pend[i+1]15:elsem,h←pend[i+1],pend[i]16:ifunassigned[m.id]6=0∨∃‘6=rel(h,m,‘)∈Tthen17:W.append((h,m,rel))18:elseG.append((h,m,rel))19:hG,mG,relG←argmax(i,d,‘)∈GScore(pend,i,d,‘)20:hW,mW,relW←argmax(i,d,‘)∈WScore(pend,i,d,‘)21:scoreG←Score(hG,mG,relG)22:scoreW←Score(hW,mW,relW)23:ifscoreG−scoreW<0then24:h,m,rel,score←hW,mW,relW,scoreW25:elseifscoreG−scoreW>1∨random()s l D o w n o a d e d f r o m h t t p : / / d je r e c t . m je t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 0 1 5 6 7 4 0 8 / / t l a c _ a _ 0 0 1 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 462
Télécharger le PDF

Recherche en IA spécialisée au MIT

Recherche en IA spécialisée au MIT

Transactions of the Association for Computational Linguistics, vol. 4, pp. 445–461, 2016. Action Editor: Noah Smith.