Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark.

Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark.
Submission batch: 5/2016; Revision batch: 9/2016; Published 1/2017.

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

Shift-ReduceConstituentParsingwithNeuralLookaheadFeaturesJiangmingLiuandYueZhangSingaporeUniversityofTechnologyandDesign,8SomapahRoad,Singapore,487372{jiangmingliu,yuezhang}@sutd.edu.sgAbstractTransition-basedmodelscanbefastandaccu-rateforconstituentparsing.Comparedwithchart-basedmodels,theyleveragericherfea-turesbyextractinghistoryinformationfromaparserstack,whichconsistsofasequenceofnon-localconstituents.Ontheotherhand,duringincrementalparsing,constituentinfor-mationontherighthandsideofthecurrentwordisnotutilized,whichisarelativeweak-nessofshift-reduceparsing.Toaddressthislimitation,weleverageafastneuralmodeltoextractlookaheadfeatures.Inparticular,webuildabidirectionalLSTMmodel,whichleveragesfullsentenceinformationtopredictthehierarchyofconstituentsthateachwordstartsandends.Theresultsarethenpassedtoastrongtransition-basedconstituentparseraslookaheadfeatures.Theresultingparsergives1.3%absoluteimprovementinWSJand2.3%inCTBcomparedtothebaseline,giv-ingthehighestreportedaccuraciesforfully-supervisedparsing.1IntroductionTransition-basedconstituentparsersarefastandac-curate,performingincrementalparsingusingase-quenceofstatetransitionsinlineartime.Pioneer-ingmodelsrelyonaclassiﬁertomakelocalde-cisions,searchinggreedilyforlocaltransitionstobuildaparsetree(SagaeandLavie,2005).Zhuetal.(2013)useabeamsearchframework,whichpreserveslineartimecomplexityofgreedysearch,whilealleviatingthedisadvantageoferrorpropaga-tion.Themodelgivesstate-of-the-artaccuraciesataspeedof89sentencespersecondonthestandardWSJbenchmark(Marcusetal.,1993).Zhuetal.(2013)exploitrichfeaturesbyextract-inghistoryinformationfromaparserstack,whichconsistsofasequenceofnon-localconstituents.However,duetotheincrementalnatureofshift-reduceparsing,theright-handsideconstituentsofthecurrentwordcannotbeusedtoguidetheactionateachstep.Suchlookaheadfeatures(Tsuruokaetal.,2011)correspondtotheoutsidescoresinchartparsing(Goodman,1998),whichhasbeeneffectiveforobtainingimprovedaccuracies.Toleveragesuchinformationforimprovingshift-reduceparsing,weproposeanovelneuralmodeltopredicttheconstituenthierarchyrelatedtoeachwordbeforeparsing.OurideaisinspiredbytheworkofRoarkandHollingshead(2009)andZhangetal.(2010b),whichshowsthatshallowsyntacticinformationgatheredoverthewordsequencecanbeutilizedforpruningchartparsers,improvingchartparsingspeedwithoutsacriﬁcingaccuracies.Forex-ample,RoarkandHollingshead(2009)predictcon-stituentboundaryinformationonwordsasapre-processingstep,andusesuchinformationtoprunethechart.Sincesuchinformationismuchlighter-weightcomparedtofullparsing,itcanbepredictedrelativelyaccuratelyusingsequencelabellers.DifferentfromRoarkandHollingshead(2009),wecollectlookaheadconstituentinformationforshift-reduceparsing,ratherthanpruninginforma-tionforchartparsing.Ourmainconcernisimprov-ingtheaccuracyratherthanimprovingthespeed.Accordingly,ourmodelshouldpredictthecon-stituenthierarchyforeachwordratherthansimpleboundaryinformation.Forexample,inFigure1(a),theconstituenthierarchythattheword“The”startsis“S→NP”,andtheconstituenthierarchythattheword“table”endsis“S→VP→NP→PP→NP”.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

NP VB DT NN NP VP S (a) (b) DT NNS The students like this book ADJP JJ past CC and JJ present NP PP NP DT NN the table IN on Words-typee-typeThe[s: S NP][e: Ø ]past[s: ADJP][e: Ø ] and[s: Ø][e: Ø ] present[s: Ø][e: ADJP ] students[s: Ø][e: NP ] like[s: VP][e: Ø ] this[s: NP NP][e: Ø ] book[s: Ø][e: NP ] on[s: PP][e: Ø ] the[s: NP][e: Ø ] table[s: Ø[e: S VP NP PP NP ] Figure1:Exampleconstituenthierarchiesforthesentence“Thepastandpresentstudentslikethisbookonthetable”.(a)parsetree;(b)constituenthierarchiesonwords.Foreachword,wepredictboththeconstituenthier-archyitstartsandtheconstituenthierarchyitends,usingthemaslookaheadfeatures.Thetaskischallenging.First,itissigniﬁcantlymoredifﬁcultcomparedtosimplesequencela-belling,sincetwosequencesofconstituenthierar-chiesmustbepredictedforeachwordintheinputsequence.Second,forhighaccuracies,globalfea-turesfromthefullsentencearenecessarysincecon-stituenthierarchiescontainrichstructuralinforma-tion.Third,toretainhighspeedforshift-reduceparsing,lookaheadfeaturepredictionmustbeexe-cutedefﬁciently.Itishighlydifﬁculttobuildsuchamodelusingmanualdiscretefeaturesandstructuredsearch.Fortunately,sequentialrecurrentneuralnetworks(RNNs)areremarkablyeffectivemodelstoencodethefullinputsentence.WeleverageRNNsforbuild-ingourconstituenthierarchypredictor.Inparticular,anLSTM(HochreiterandSchmidhuber,1997)isusedtolearnglobalfeaturesautomaticallyfromtheinputwords.Foreachword,asecondLSTMisthenusedtogeneratetheconstituenthierarchiesgreed-ilyusingfeaturesfromthehiddenlayeroftheﬁrstLSTM,inthesamewayaneurallanguagemodelde-codergeneratesoutputsentencesformachinetrans-lation(Bahdanauetal.,2015).Theresultingmodelsolvesallthreechallengesraisedabove.Forfully-supervisedlearning,welearnwordembeddingsaspartofthemodelparameters.InthestandardWSJ(Marcusetal.,1993)andCTB5.1tests(Xueetal.,2005),ourparsergives1.3F1and2.3F1improvement,respectively,overtheInitialState[φ,0,false,0]FinalState[S,n,true,m:2n<=m<=4n]InductionRules:SHIFT[S,i,false,k][S|w,i+1,false,k+1]REDUCE-L/R-X[S|s1s0,i,false,k][S|X,i,false,k+1]UNARY-X[S|s0,i,false,k][S|X,i,false,k+1]FINISH[S,n,false,k][S,n,true,k+1]IDLE[S,n,true,k][S,n,true,k+1]Figure2:Deductionsystemforthebaselineshift-reduceparsingprocess.baselineofZhuetal.(2013),resultinginaaccuracyof91.7F1forEnglishand85.5F1forChinese,whicharethebestforfully-supervisedmodelsintheliterature.Wereleaseourcode,basedonZPar(ZhangandClark,2011;Zhuetal.,2013),athttps://github.com/SUTDNLP/LookAheadConparser.2BaselineSystemWeadopttheparserofZhuetal.(2013)forabase-line,whichisbasedontheshift-reduceprocessofSagaeandLavie(2005)andthebeamsearchstrat-egyofZhangandClark(2011)withglobalpercep-trontraining. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 5 1 5 6 7 4 4 0 / / t l a c _ a _ 0 0 0 4 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 47 2.1TheShift-ReduceSystemShift-reduceparsersprocessaninputsentencein-crementallyfromlefttoright.Astackisusedtomaintainpartialphrase-structures,whiletheincom-ingwordsareorderedinabuffer.Ateachstep,atransitionactionisappliedtoconsumeaninputwordorconstructanewphrase-structure.Thesetoftran-sitionactionsare•SHIFT:popthefrontwordoffthebuffer,andpushitontothestack.•REDUCE-L/R-X:popthetoptwoconstituentsoffthestack(L/Rmeansthattheheadistheleftconstituentortherightconstituent,respec-tively),combinethemintoanewconstituentwithlabelX,andpushthenewconstituentontothestack.•UNARY-X:popthetopconstituentoffthestack,raiseittoanewconstituentX,andpushthenewconstituentontothestack.•FINISH:poptherootnodeoffthestackandendparsing.•IDLE:no-effectactiononacompletedstatewithoutchangingitemsonthestackorbuffer,usedtoensurethatthesamenumberofactionsareineachiteminbeamsearch(Zhuetal.,2013).ThedeductionsystemfortheprocessisshowninFigure2,whereastateisrepresentedas[stack,bufferfrontindex,completionmark,actionindex],andnisthenumberofwordsintheinput.Forex-ample,giventhesentence“Theylikeapples”,theactionsequence“SHIFT,SHIFT,SHIFT,REDUCE-L-VP,REDUCE-R-S”givesitssyntax“(SThey(VPlikeapples))”.2.2SearchandTrainingBeam-searchisusedfordecodingwiththekbeststateitemsateachstepbeingkeptintheagenda.Duringinitialization,theagendacontainsonlytheinitialstate[φ,0,false,0].Ateachstep,eachstateintheagendaispoppedandexpandedbyapply-ingallvalidtransitionactions,andthetopkre-sultingstatesareputbackontotheagenda(Zhuetal.,2013).TheprocessrepeatsuntiltheagendaisDescriptionTemplatesUNIGRAMs0tc,s0wc,s1tc,s1wc,s2tcs2wc,s3tc,s3wc,q0wt,q1wtq2wt,q3wt,s0lwc,s0rwcs0uwc,s1lwc,s1rwc,s1uwcBIGRAMs0ws1w,s0ws1c,s0cs1w,s0cs1cs0wq0w,s0wq0t,s0cq0w,s0cq0tq0wq1w,q0wq1t,q0tq1w,q0tq1ts1wq0w,s1wq0t,s1cq0w,s1cq0tTRIGRAMs0cs1cs2c,s0ws1cs2c,s0cs1wq0ts0cs1cs2w,s0cs1cq0t,s0ws1cq0ts0cs1wq0t,s0cs1cq0wExtendeds0llwc,s0lrwc,s0luwcs0rlwc,s0rrwc,s0ruwcs0ulwc,s0urwc,s0uuwcs1llwc,s1lrwc,s1luwcs1rlwc,s1rrwc,s1ruwcTable1:Baselinefeaturetemplates,wheresirep-resentstheithitemonthetopofthestackandqidenotestheithiteminthefrontofthebuffer.Thesymbolwdenotesthelexicalheadofanitem;thesymbolcdenotestheconstituentlabelofanitem;thesymboltisthePOSofalexicalhead;udenotesunarychild;silldenotestheleftchildofsi’sleftchild.empty,andthebestcompletedstateistakenasout-put.Thescoreofastateisthetotalscoreofthetransi-tionactionsthathavebeenappliedtobuildit:C(α)=NXi=1Φ(αi)·~θ(1)HereΦ(αi)representsthefeaturevectorfortheithactionαiinthestateitemα.Nisthetotalnumberofactionsinα.Themodelparametervector~θistrainedonlineusingtheaveragedperceptronalgorithmwiththeearly-updatestrategy(CollinsandRoark,2004).2.3BaselineFeaturesOurbaselinefeaturesaretakenfromZhuetal.(2013).AsshowninTable1,theyincludetheUN-IGRAM,BIGRAM,TRIGRAMfeaturesofZhangandClark(2009)andtheextendedfeaturesofZhuetal.(2013). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 5 1 5 6 7 4 4 0 / / t l a c _ a _ 0 0 0 4 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 48 Templatess0gs,s0ge,s1gs,s1geq0gs,q0ge,q1gs,q1geTable2:Lookaheadfeaturetemplates,wheresirep-resentstheithitemonthetopofthestackandqide-notestheithiteminthefrontendofthebuffer.Thesymbolgsandgedenotethenextlevelconstituentinthes-typehierarchyande-typehierarchy,respec-tively.3GlobalLookaheadFeaturesThebaselinefeaturessuffertwolimitations,asmen-tionedintheintroduction.First,theyarerelativelylocaltothestate,consideringonlytheneighbouringnodesofs0(topofstack)andq0(frontofbuffer).Second,theydonotconsiderlookaheadinformationbeyonds3,orthesyntacticstructureofthebufferandsequence.WeuseanLSTMtocapturefullsen-tentialinformationinlineartime,representingsuchglobalinformationthatisfedintothebaselineparserasaconstituenthierarchyforeachword.Lookaheadfeaturesareextractedfromtheconstituenthierarchytoprovidetop-downguidanceforbottom-uppars-ing.3.1ConstituentHierarchyInaconstituencytree,eachwordcanstartorendaconstituenthierarchy.AsshowninFigure1,theword“The”startsaconstituenthierarchy“S→NP”.Inparticular,itstartsaconstituentSinthetoplevel,dominatingaconstituentNP.Theword“table”endsaconstituenthierarchy“S→VP→NP→PP→NP”.Inparticular,itendsaconstituenthierarchy,withaconstituentSonthetoplevel,dominatingaVP(startingfromtheword“like”),andthenanNP(startingfromthenounphrase“thisbook”),andthenaPP(startingfromtheword“in”),andﬁnallyanNP(startingfromtheword“the”).Theextractionofconstituenthierarchiesforeachwordisbasedonun-binarizedgrammars,reﬂectingtheunbinarizedtreesthatthewordstartsorends.Theconstituenthier-archyisempty(denotedasφ)ifthecorrespondingworddoesnotstartorendaconstituent.Thecon-stituenthierarchiesareaddedintotheshift-reduceparserassoftfeatures(section3.2).Formally,aconstituenthierarchyisdeﬁnedas[type:c1→c2→...→cz],wherecisaconstituentlabel(e.g.NP),“→”repre-sentsthetop-downhierarchy,andtypecanbesore,denotingthatthecurrentwordstartsorendsthecon-stituenthierarchy,respectively,asshowninFigure1.Comparedwithfullparsing,theconstituenthier-archiesassociatedwitheachwordhavenoforcedstructuraldependenciesbetweeneachother,andthereforecanbemodelledmoreeasily,foreachwordindividually.Beingsoftlookaheadfeaturesratherthanhardconstraints,inter-dependenciesarenotcrucialforthemainparser.3.2LookaheadFeaturesThelookaheadfeaturetemplatesaredeﬁnedinTable2.Inordertoensureparsingefﬁciency,onlysimplefeaturetemplatesaretakenintoconsideration.Thelookaheadfeaturesofastateareinstantiatedforthetoptwoitemsonthestack(i.e.,s0ands1)andbuffer(i.e.,q0andq1).ThenewfunctionΦ0isdeﬁnedtooutputthelookaheadfeaturesvector.ThescoringofastateinourmodelisbasedonFormula(1)butwithanewtermΦ0(αi)·~θ0:C0(α)=NXi=1Φ(αi)·~θ+Φ0(αi)·~θ0Foreachword,thelookaheadfeaturerepresentsthenextlevelconstituentinthetop-downhierarchy,whichcanguidebottom-upparsing.Forexample,Figure3showstwointermediatestatesduringparsing.InFigure3(a),thes-typeande-typelookaheadfeaturesofs1(i.e.,theword“The”areextractedfromtheconstituenthierarchyinthebottomlevel,namelyNPandNULL,respec-tively.Ontheotherhand,inFigure3(b),thes-typelookaheadfeatureofs1isextractedfromthes-typeconstituenthierarchyofsameword“The”,butitisSbasedoncurrenthierarchicallevel.Thee-typelookaheadfeature,ontheotherhand,isextractedfromthee-typeconstituenthierarchyofendword“students”oftheVPconstituent,whichisNULLinthenextlevel.Lookaheadfeaturesforitemsonthebufferareextractedinthesameway.Thelookaheadfeaturesareusefulforguidingshift-reducedecisionsgiventhecurrentstate.For l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 5 1 5 6 7 4 4 0 / / t l a c _ a _ 0 0 0 4 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 49 stackbufferDTTheJJpastCCandJJpresents0s1SNPØADJPØs0gss0ge=nulls1gss1ge=nullØØØADJPq0q1q0gs=nullq0ge=nullq1gs=nullq1geNPVBDTNNDTNNSs0s1q0q1ThestudentslikethisbookSNPVPVPØNPNPØØs1ge=nulls0gss0ge=nullq0gsq0ge=nullq1gs=nullstackbufferq1ges1gsADJPpastandpresent(a)(b)incorrectConstituenthierarchyLook-aheadfeaturesConfigurationFigure3:Twointermediatestatesforparsingonthesentence“Thepastandpresentstudentslikethisbookonthetable”.Eachitemonthestackorbufferhastwoconstituenthierarchies:s-type(left)ande-type(right),respectively,inthecorrespondingbox.Notethatthee-typeconstituenthierarchyoftheword“students”isincorrectlypredicted,yetusedassoftconstraints(i.e.,features)inourmodel.example,giventheintermediatestateinFigure3(a),s0hasas-typelookaheadfeatureADJP,andq1inthebufferhase-typelookaheadfeatureADJP.Thisindicatesthatthetwoitemsarelikelyreducedintothesameconstituent.Further,s0cannotendacon-stituentbecauseoftheemptye-typeconstituenthi-erarchy.Asaresult,theﬁnalshift-reduceparserwillassignahigherscoretotheSHIFTdecision.4ConstituentHierarchyPredictionWeproposeanovelneuralmodelforconstituenthi-erarchyprediction.Inspiredbytheencoder-decoderframeworkforneuralmachinetranslation(Bah-danauetal.,2015;Choetal.,2014),weuseanLSTMtocapturefullsentencefeatures,andanotherLSTMtogeneratetheconstituenthierarchiesforeachword.ComparedwithaCRF-basedsequencelabellingmodel(RoarkandHollingshead,2009),theproposedmodelhasthreeadvantages.First,theglobalfeaturescanbeautomaticallyrepresented.Second,itcanavoidtheexponentiallylargenum-beroflabelsifconstituenthierarchiesaretreatedasuniquelabels.Third,themodelsizeisrelativelysmall,anddoesnothavealargeeffectontheﬁnalparsermodel.AsshowninFigure4,theneuralnetworkcon-sistsofthreemainlayers,namelytheinputlayer,theencoderlayerandthedecoderlayer.Theinputlayerrepresentseachwordusingitscharactersandtokeninformation;theencoderhiddenlayerusesabidirectionalrecurrentneuralnetworkstructuretolearnglobalfeaturesfromthesentence;andthede-coderlayerpredictsconstituenthierarchiesaccord-ingtotheencoderlayerfeatures,byusingtheatten-tionmechanism(Bahdanauetal.,2015)tocomputethecontributionofeachhiddenunitoftheencoder.4.1InputLayerTheinputlayergeneratesadensevectorrepresenta-tionofeachinputword.Weusecharacterembed-dingstoalleviateOOVproblemsinwordembed-dings(Ballesterosetal.,2015;SantosandZadrozny,2014;Kimetal.,2016),concatenatingcharacter-embeddingsofawordwithitswordembedding.Formally,theinputrepresentationxiofthewordwiiscomputedby:xi=[xwi;ciatt]ciatt=Xjαijc0ij,wherexwiisawordembeddingvectorofthewordwiaccordingtoaembeddinglookuptable,ciattisacharacterembeddingformofthewordwi,cijistheembeddingofthejthcharacterinwi,c0ijisthecharacterwindowrepresentationcenteredatcij,andαijisthecontributionofthec0ijtociatt,whichiscomputedby:αij=ef(xwi,c0ij)Pkef(xwi,c0ik)fisanon-lineartransformationfunction. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 5 1 5 6 7 4 4 0 / / t l a c _ a _ 0 0 0 4 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 50 …h1h2hnh1h2hn……SoftMax…c2,1c2,2c2,m……xw2attention poolingDecoderlayerInput layerEncoderlayery1jx2xnx1s1js1j-1c2_att…x’2x’nx’1h1h2hn…cn,1……xwnattention poolingcn_att…cn,2cn,m’windowswindowswindows…c’2,1c’2,2c’2,mc’n,1c’n,2c’n,m’Figure4:Structureoftheconstituenthierarchypre-dictionmodel.−→hidenotestheleft-to-rightencoderhiddenunits;←−hidenotestheright-to-leftencoderhiddenunits;sdenotesthedecoderhiddenstatevec-tor;andyijisthejthlabelofthewordwi.4.2EncoderLayerTheencoderﬁrstusesawindowstrategytorepre-sentinputnodeswiththeircorrespondinglocalcon-textnodes.Formally,awordwindowrepresentationtakestheformx0i=[xi−win;...;xi;...;xi+win].Second,theencoderscanstheinputsentenceandgenerateshiddenunitsforeachinputwordusingarecurrentneuralnetwork(RNN),whichrepresentsfeaturesofthewordfromtheglobalsequence.For-mally,giventhewindowedinputnodesx01,x02,...,x0nforthesentencew1,w2,...,wn,theRNNlayercalculatesahiddennodesequenceh1,h2,...,hn.LongShort-TermMemory(LSTM)mitigatesthevanishinggradientprobleminRNNtraining,byin-troducinggates(i.e.,inputi,forgetfandoutputo)andacellmemoryvectorc.WeusethevariationofGravesandSchmidhuber(2008).Formally,thevaluesintheLSTMhiddenlayersarecomputedasfollows:ii=σ(W1x0i+W2hi−1+W3(cid:12)ci−1+b1)fi=1−ii˜ci=tanh(W4x0i+W5hi−1+b2)ci=fi(cid:12)ci−1+ii(cid:12)˜cioi=σ(W6x0i+W7hi−1+W8(cid:12)ci+b3)hi=oi(cid:12)tanh(ci),where(cid:12)ispair-wisemultiplication.Further,inor-dertocollectfeaturesforxifrombothx01,..,x0i−1andx0i+1,...x0n,weuseabidirectionalvariation(SchusterandPaliwal,1997;Gravesetal.,2013).AsshowninFigure4,thehiddenunitsaregeneratedbyconcatenatingthecorrespondinghiddenlayersofaleft-to-rightLSTM−→hiandaright-to-leftLSTM←−hi,where←→hi=[−→hi;←−hi]foreachwordwi.4.3DecoderLayerThedecoderhiddenlayerusestwodifferentLSTMstogeneratethes-typeande-typesequencesofcon-stituentlabelsfromeachencoderhiddenoutput,re-spectively,asshowninFigure4.Eachconstituenthierarchyisgeneratedbottom-uprecurrently.Inparticular,asequenceofstatevectorsisgeneratedrecurrently,witheachstateyieldingaoutputcon-stituentlabel.Theprocessstartswitha~0statevec-torandendswhenaNULLconstituentisgenerated.Therecurrentstatetransitionprocessisachievedus-inganLSTMmodelwiththehiddenvectorsoftheencoderlayerbeingusedforcontextfeatures.Formally,forwordwi,thevalueofthejthstateunitsijoftheLSTMiscomputedby:sij=f(sij−1,aij,←→hi)1,wherethecontextaijiscomputedby:aij=Xkβijk←→hkβijk=ef(sij−1,←→hk)Pk0ef(sij−1,←→hk0)1Here,differentfromtypicalMTmodels(Bahdanauetal.,2015),thechainispredictedsequentiallyinafeed-forwardwaywithnofeedbackofthepredictionmade.Wefoundthatthisfastalternativegivessimilarresults. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 5 1 5 6 7 4 4 0 / / t l a c _ a _ 0 0 0 4 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 51 Here←→hkreferstotheencoderhiddenvectorforwk.Theweightsofcontributionβijkarecomputedusingtheattentionmechanism(Bahdanauetal.,2015).Theconstituentlabelsaregeneratedfromeachstateunitsij,whereeachconstituentlabelyijistheoutputofaSOFTMAXfunction,p(yij=l)=es>ijWlPkes>ijWkyij=ldenotesthatthejthlabeloftheithwordisl(l∈L).AsshowninFigure4,theSOFTMAXfunctionsareappliedtothestateunitsofthedecoder,gener-atinghierarchicallabelsbottom-up,untilthedefaultlabelNULLispredicted.4.4TrainingWeusetwoseparatemodelstoassignthes-typeande-typelabels,respectively.Fortrainingeachcon-stituenthierarchypredictor,weminimizethefollow-ingtrainingobjective:L(θ)=−TXiZiXjlogpijo+λ2||θ||2,whereTisthelengthofthesentence,Ziisthedepthoftheconstituenthierarchyofthewordwi,andpijostandsforp(yij=o),whichisgivenbytheSOFT-MAXfunction,andoisthegoldlabel.Weapplyback-propagation,usingmomentumstochasticgradientdescent(Sutskeveretal.,2013)withalearningrateofη=0.01foroptimizationandregularizationparameterλ=10−6.5Experiments5.1ExperimentSettingsOurEnglishdataaretakenfromtheWallStreetJour-nal(WSJ)sectionsofthePennTreebank(Marcusetal.,1993).Weusesections2-21fortraining,section24forsystemdevelopment,andsection23forﬁnalperformanceevaluation.OurChinesedataaretakenfromtheversion5.1ofthePennChineseTreebank(CTB)(Xueetal.,2005).Weusearticles001-270and440-1151fortraining,articles301-325forsys-temdevelopment,andarticles271-300forﬁnalper-formanceevaluation.ForbothEnglishandChinesehyper-parametersvalueWordembeddingsize50Wordwindowsize2Characterembeddingsize30Characterwindowsize2LSTMhiddenlayersize100Characterhiddenlayersize60Table3:Hyper-parametersettingss-typee-typeparser1-layer93.3981.5090.432-layer93.7683.3790.723-layer93.8483.4290.80Table4:PerformanceoftheconstituenthierarchypredictorandthecorrespondingparserontheWSJdevdataset.n-layerdenotesanLSTMmodelwithnhiddenlayers.data,weadoptZPar2forPOStagging,anduseten-foldjackkniﬁngtoassignPOStagsautomaticallytothetrainingdata.Inaddition,weuseten-foldjack-kniﬁngtoassignconstituenthierarchiesautomati-callytothetrainingdatafortrainingtheparserusingtheconstituenthierarchypredictor.WeuseF1scoretoevaluateconstituenthierarchyprediction.Forexample,ifthepredictionis“S→S→VP→NP”andthegoldis“S→NP→NP”,theevaluationprocessmatchesthetwohierarchiesbottom-up.Theprecisionis2/4=0.5,therecallis2/3=0.66andtheF1scoreis0.57.Alabeliscountedascorrectifandonlyifitoccursatthecor-rectposition.WeuseEVALBtoevaluateparsingperformance,includinglabelledprecision(LP),labelledrecall(LR),andbracketingF1.35.2ModelSettingsFortrainingtheconstituenthierarchypredictionmodel,goldconstituentlabelsarederivedfromla-belledconstituencytreesinthetrainingdata.Thehyper-parametersarechosenaccordingtodevelop-menttests,andthevaluesareshowninTable3.Fortheshift-reduceconstituencyparser,wesetthebeamsizeto16forbothtraininganddecoding,whichachievesagoodtradeoffbetweenefﬁciency2https://github.com/SUTDNLP/ZPar3http://nlp.cs.nyu.edu/evalb

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

s-typee-typeparserall93.7683.3790.72allw/owins93.6283.3490.58allw/ochars93.5183.2190.33allw/ochars&wins93.1282.3689.18Table5:PerformanceoftheconstituenthierarchypredictorandthecorrespondingparserontheWSJdevdataset.alldenotestheproposedmodelwith-outablation.winsdenotesinputwindows.charsdenotescharacter-basedattention.andaccuracy(Zhuetal.,2013).Theoptimaltrain-ingiterationnumberisdeterminedonthedevelop-mentsets.5.3ResultsofConstituentHierarchyPredictionTable4showstheresultsofconstituenthierarchyprediction,wherewordandcharacterembeddingsarerandomlyinitialized,andﬁne-tunedduringtrain-ing.Thethirdcolumnshowsthedevelopmentpars-ingaccuracieswhenthelabelsareusedforlooka-headfeatures.AsTable4shows,whenthenumberofhiddenlayersincreases,boths-typeande-typeconstituenthierarchypredictionimprove.Theaccu-racyofe-typepredictionisrelativelylowerduetoright-branchinginthetreebank,whichmakese-typehierarchieslongerthans-typehierarchies.Inaddi-tion,a3-layerLSTMdoesnotgivesigniﬁcantim-provementscomparedtoa2-layerLSTM.Forbettertradeoffbetweenefﬁciencyandaccuracy,wechoosethe2-layerLSTMasourconstituenthierarchypre-dictor.Table5showsablationresultsforconstituenthi-erarchypredictiongivenbydifferentreducedar-chitectures,whichincludeanarchitecturewithoutcharacterembeddingsandanarchitecturewithnei-thercharacterembeddingsnorinputwindows.Weﬁndthattheoriginalarchitectureachievesthehigh-estperformanceonconstituenthierarchyprediction,comparedtothetwobaselines.Thebaselineonlywithoutcharacterembeddingshasrelativelysmallinﬂuenceonconstituenthierarchyprediction.Ontheotherhand,thebaselineonlywithoutinputwordwindowshasrelativelysmallerinﬂuenceoncon-stituenthierarchyprediction.Nevertheless,bothofthesetwoablationarchitecturesleadtolowerpars-ParserLRLPF1Fully-supervisedRatnaparkhi(1997)86.387.586.9Charniak(2000)89.589.989.5Collins(2003)88.188.388.2SagaeandLavie(2005)†86.186.086.0SagaeandLavie(2006)†87.888.187.9PetrovandKlein(2007)90.190.290.1Carrerasetal.(2008)90.791.491.1Shindoetal.(2012)N/AN/A91.1Zhuetal.(2013)†90.290.790.4Socheretal.(2013)*N/AN/A90.4Vinyalsetal.(2015)*N/AN/A88.3CrossandHuang(2016)*†N/AN/A91.3Dyeretal.(2016)*†N/AN/A91.2Thiswork91.392.191.7EnsembleShindoetal.(2012)N/AN/A92.4Vinyalsetal.(2015)*N/AN/A90.5RerankCharniakandJohnson(2005)91.291.891.5Huang(2008)92.291.291.7Dyeretal.(2016)*†N/AN/A93.3Semi-supervisedMcCloskyetal.(2006)92.192.592.3HuangandHarper(2009)91.191.691.3Huangetal.(2010)91.491.891.6Zhuetal.(2013)†91.191.591.3DurrettandKlein(2015)*N/AN/A91.1Table6:ComparisonofrelatedworkontheWSJtestset.*denotesneuralparsing;†denotesmethodsusingashift-reduceframework.ingaccuracies.ThebaselineremovingboththecharacterembeddingsandtheinputwordwindowshasarelativelylowF-score.5.4FinalResultsForEnglish,wecomparetheﬁnalresultswithpreviousrelatedworkontheWSJtestsets.AsshowninTable64,ourmodelachieves1.3%F1improvementcomparedtothebaselineparserwithfully-supervisedlearning(Zhuetal.,2013).Ourmodeloutperformsthestate-of-the-artfully-supervisedsystem(Carrerasetal.,2008;Shindoetal.,2012)by0.6%F1.Inaddition,ourfully-supervisedmodelalsocatchesupwithmanystate-of-the-artsemi-supervisedmodels(Zhuetal.,2013;4Wetreatthemethodsassemi-supervisediftheyusepre-trainedwordembeddings,wordclusters(e.g.,Brownclusters)orextraresources.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

ParserLRLPF1Fully-supervisedCharniak(2000)79.682.180.8Bikel(2004)79.382.080.6PetrovandKlein(2007)81.984.883.3Zhuetal.(2013)†82.184.383.2Wangetal.(2015)‡N/AN/A83.2Dyeretal.(2016)*†N/AN/A84.6Thiswork85.285.985.5RerankCharniakandJohnson(2005)80.883.882.3Dyeretal.(2016)*†N/AN/A86.9Semi-supervisedZhuetal.(2013)†84.486.885.6WangandXue(2014)‡N/AN/A86.3Wangetal.(2015)‡N/AN/A86.6Table7:ComparisonofrelatedworkontheCTB5.1testset.*denotesneuralparsing;†denotesmethodsusingashift-reduceframework;‡denotesjointPOStaggingandparsing.HuangandHarper,2009;Huangetal.,2010;Dur-rettandKlein,2015)byachieving91.7%F1onWSJtestset.Thesizeofourmodelismuchsmallerthanthesemi-supervisedmodelofZhuetal.(2013),whichcontainsrichfeaturesfromalargeautomat-icallyparsedcorpus.Incontrast,ourmodelisaboutthesameinsizecomparedtothebaselineparser.WecarryoutChineseexperimentswiththesamemodels,andcomparetheﬁnalresultswithpreviousrelatedworkontheCTBtestset.AsshowninTable7,ourmodelachieves2.3%F1improvementcom-paredtothestate-of-the-artbaselinesystemwithfully-supervisedlearning(Zhuetal.,2013),whichisbyfarthebestresultintheliterature.Inaddition,ourfully-supervisedmodelisalsocomparabletomanystate-of-the-artsemi-supervisedmodels(Zhuetal.,2013;WangandXue,2014;Wangetal.,2015;Dyeretal.,2016)byachieving85.5%F1ontheCTBtestset.WangandXue(2014)andWangetal.(2015)dojointPOStaggingandparsing.5.5ComparisonofSpeedTable8showstherunningtimesofvariousparsersontestsetsonaIntel2.2GHzprocessorwith16Gmemory.Ourparsersaremuchfasterthanthere-latedparserwiththesameshift-reduceframework(SagaeandLavie,2005;SagaeandLavie,2006).Comparedtothebaselineparser,ourparsergivesParser#Sent/SecondRatnaparkhi(1997)UnkCollins(2003)3.5Charniak(2000)5.7SagaeandLavie(2005)3.7SagaeandLavie(2006)2.2PetrovandKlein(2007)6.2Carrerasetal.(2008)UnkZhuetal.(2013)89.5Thiswork79.2Table8:Comparisonofrunningtimesonthetestset,wherethetimeforloadingmodelsisexcluded.TherunningtimesofrelatedparsersaretakenfromZhuetal.(2013).signiﬁcantimprovementonaccuracies(90.4%to91.7%F1)atthespeedof79.2sentencespersec-ond5,incontrastto89.5sentencespersecondonthestandardWSJbenchmark.6ErrorAnalysisWeconducterroranalysisbymeasuringparsingac-curaciesagainst:differentphrasetypes,constituentsofdifferentspanlengths,anddifferentsentencelengths.6.1PhraseTypeTable9showstheaccuraciesofthebaselineandtheﬁnalparserswithlookaheadfeatureson9commonphrasetypes.Astheresultsshow,whiletheparserwithlookaheadfeaturesachievesimprovementsonallofthefrequentphrasetypes,therearerelativelyhigherimprovementsonVP,S,SBARandWHNP.Theconstituenthierarchypredictorhasrelativelybetterperformanceons-typelabelsforthecon-stituentsVP,WHNPandPP,whicharepronetoerrorsbythebaselinesystem.Theconstituenthi-erarchycangiveguidancetotheconstituentparserfortacklingtheissue.Comparedtothes-typecon-stituenthierarchy,thee-typeconstituenthierarchy5Theconstituenthierarchypredictionisexcluded,whichprocessesanaverageof150sentencespersecondonasingleCPU.Thecostofthisstepisfarlessthanthecostofparsing,andcanbeessentiallyeliminatedbypipeliningtheconstituenthierarchypredictionandtheshift-reducedecoder,bylaunchingtheconstituenthierarchypredictorﬁrst,andthenstartingpars-inginparallelassoonasthelookaheadoutputisavailablefortheﬁrstsentence,sincethelookaheadwilloutpacetheparsingfromthatpointforward.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

2468101214859095spanlengthF1Score(%)baselinelookaheadFigure5:Comparisonwiththebaselineonspansofdifferentlengths.isrelativelymoredifﬁculttopredict,particularlyfortheconstituentswithlongspanssuchasVP,SandSBAR.Despitethis,thee-typeconstituenthi-erarchieswithrelativelylowaccuraciesalsobeneﬁtpredictionofconstituentswithlongspans.6.2SpanLengthFigure5showstheF1-scoresofthetwoparsersonconstituentswithdifferentspanlengths.Asthere-sultsshow,lookaheadfeaturesarehelpfulonbothlargespansandsmallspans,andtheperformancegapbetweenthetwoparsersislargerasthesizeofspanincreases.Thisreﬂectstheusefulnessoflong-rangeinformationcapturedbytheconstituenthier-archypredictorandlookaheadfeatures.6.3SentenceLengthFigure6showstheF1-scoresofthetwoparsersonsentencesofdifferentlengths.Astheresultsshow,theparserwithlookaheadfeaturesoutperformsthebaselinesystemonbothshortsentencesandlongsentences.Also,theperformancegapbetweenthetwoparsersislargerasthelengthofsentencein-creases.Theconstituenthierarchypredictorsgeneratehi-erarchicalconstituentsforeachinputwordusingglobalinformation.Forlongersentences,thepre-dictorsyielddeeperconstituenthierarchies,offer-ingcorrespondinglookaheadfeatures.Asaresult,comparedtothebaselineparser,theperformanceoftheparserwithlookaheadfeaturesdecreasesmoreslowlyasthelengthofthesentencesincreases.102030405050+859095F1score(%)baselinelookaheadFigure6:Comparisonwiththebaselineonsen-tencesofdifferentlengths.Sentenceswithlength[0,10)fallinthebin10.7RelatedWorkOurlookaheadfeaturesaresimilarinspirittotheprunersofRoarkandHollingshead(2009)andZhangetal.(2010b),whichinferthemaximumlengthofconstituentsthataparticularwordcanstartorend.However,ourmethodisdifferentinthreemainways.First,ratherthanusingaCRFwithsparselocalwordwindowfeatures,aneuralnetworkisusedfordenseglobalfeaturesonthesentence.Second,notonlythesizeofconstituentsbutalsotheconstituenthierarchyisidentiﬁedforeachword.Third,theresultsareaddedintoatransition-basedparserassoftfeatures,ratherthenbeingusedashardconstraintstoachartparser.Ourconceptofconstituenthierarchiesissimi-lartosupertagsinthesensethatbothareshallowparses.ForlexicalizedgrammarssuchasCombi-natoryCategorialGrammar(CCG),Tree-AdjoiningGrammar(TAG)andHead-DrivenPhraseStructureGrammar(HPSG),eachwordintheinputsentenceisassignedoneormoresupertags,whichareusedtoidentifythesyntacticroleofthewordtoconstrainparsing(Clark,2002;ClarkandCurran,2004;Car-rerasetal.,2008;Ninomiyaetal.,2006;Dridanetal.,2008;Fale´nskaetal.,2015).Foralexical-izedgrammar,supertaggingcanbeneﬁttheparsinginbothaccuracyandefﬁciencybyofferingalmost-parsinginformation.Inparticular,Carrerasetal.(2008)usedtheconceptofspineforTAG(Schabes,1992;Vijay-ShankerandJoshi,1988),whichissim-ilartoourconstituenthierarchy.However,therearethreedifferences.First,thespineisdeﬁnedtode-scribethemainsyntactictreestructurewithaseries

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

NPVPSPPSBARADVPADJPWHNPQPbaseline92.0690.6390.2887.9386.9384.8374.1295.0389.32withlookaheadfeature93.1092.4591.7888.8488.5985.6474.5096.1889.63improvement+1.04+1.82+1.50+0.91+1.66+0.81+0.38+1.15+0.31constituenthierarchys-type95.1897.5193.3798.0192.1488.9479.8896.1891.70e-type91.9876.8280.7284.8066.8285.0171.1695.1391.02Table9:Comparisonbetweentheparserswithlookaheadfeaturesondifferentphrasestypes,withthecorrespondingconstituenthierarchypredictorperformances.ofunaryprojections,whileconstituenthierarchyisdeﬁnedtodescribehowwordscanstartorendhi-erarchicalconstituents(itcanbeemptyifthewordcannotstartorendconstituents).Second,spinesareextractedfromgoldtreesandusedtoprunethesearchspaceofparsingashardconstraints.Incon-trast,weuseconstituenthierarchiesassoftfeatures.Third,Carrerasetal.(2008)usespinestoprunechartparsing,whileweuseconstituenthierarchiestoimprovealinearshift-reduceparser.Forlexicalizedgrammars,supertagscanbeneﬁtparsingsigniﬁcantlysincetheycontainrichsyntac-ticinformationasalmostparsing(BangaloreandJoshi,1999).Recently,therehasbeenalineofworkonbettersupertagging.Zhangetal.(2010a)proposedefﬁcientmethodstoobtainsupertagsforHPSGparsingusingdependencyinformation.Xuetal.(2015)andVaswanietal.(2016)leveragerecur-siveneuralnetworksforsupertaggingforCCGpars-ing.Incontrast,ourmodelspredicttheconstituenthierarchyinsteadofasinglesupertagforeachwordintheinputsentence.Ourconstituenthierarchypredictorisalsorelatedtosequence-to-sequencelearning(Sutskeveretal.,2014),whichhasbeensuccessfullyusedinneuralmachinetranslation(Bahdanauetal.,2015).Theneuralmodelencodesthesource-sidesentenceintodensevectors,andthenusesthemtogeneratetarget-sidewordbyword.Therehasalsobeenworkthatdi-rectlyappliessequence-to-sequencemodelsforcon-stituentparsing,whichgeneratesconstituenttreesgivenrawsentences(Vinyalsetal.,2015;Luongetal.,2015).ComparedtoVinyalsetal.(2015),whopredictafullparsetreefrominput,ourpredictorstackleamuchsimplertask,bypredictingthecon-stituenthierarchiesofeachwordseparately.Inad-dition,theoutputsofthepredictorsareusedforsoftlookaheadfeaturesinbottom-upparsing,ratherthanbeingtakenasoutputstructuresdirectly.Byintegratinganeuralconstituenthierarchypre-dictor,ourparserisrelatedtoneuralnetworkmod-elsforparsing,whichhasgivencompetitiveaccura-ciesforbothconstituencyparsing(Dyeretal.,2016;CrossandHuang,2016;WatanabeandSumita,2015)anddependencyparsing(ChenandManning,2014;Zhouetal.,2015;Dyeretal.,2015).Inpar-ticular,ourparserismorecloselyrelatedtoneu-ralmodelsthatintegratediscretemanualfeatures(Socheretal.,2013;DurrettandKlein,2015).Socheretal.(2013)useneuralfeaturestorerankasparsebaselineparser;DurrettandKleindirectlyin-tegratesparsefeaturesintoneurallayersinachartparser.Incontrast,weintegrateneuralinformationintosparsefeaturesintheformoflookaheadfea-tures.Therehasalsobeenworkonlookaheadfeaturesforparsing.Tsuruokaetal.(2011)runabaselineparserforafewfuturesteps,andusetheoutputac-tionstoguidethecurrentaction.Incontrasttotheirmodel,ourmodelleveragesfullsententialinforma-tion,yetissigniﬁcantlyfaster.Previousworkinvestigatedmoreefﬁcientparsingwithoutlossofaccuracy,whichisrequiredbyrealtimeapplications,suchaswebparsing.Zhangetal.(2010b)introducedachartprunertoaccelerateaCCGparser.Kummerfeldetal.(2010)proposedaself-trainingmethodfocusingonincreasingthespeedofaCCGparserratherthanitsaccuracy.8ConclusionWeproposedanovelconstituenthierarchypredic-torbasedonrecurrentneuralnetworks,aimingtocaptureglobalsententialinformation.Theresultingconstituenthierarchiesarefedtoabaselineshift-reduceparseraslookaheadfeatures,addressinglim-itationsofshift-reduceparsersinnotleveraging

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

right-handsidesyntaxforlocaldecisions,yetmain-tainingthesamemodelsizeandspeed.Theresultingfully-supervisedparseroutperformsthestate-of-the-artbaselineparserbyachieving91.7%F1onstan-dardWSJevaluationand85.5%F1onstandardCTBevaluation.AcknowledgmentsWethanktheanonymousreviewersfortheirdetailedandconstructivecomments,andtheco-Editor-in-ChiefLillianLeeforherextremelydetailedcopyediting.ThisworkissupportedbyT2MOE201301ofSingaporeMinistryofEducation.YueZhangisthecorrespondingauthor.ReferencesDzmitryBahdanau,KyunghyunCho,andYoshuaBen-gio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.ICLR.MiguelBallesteros,ChrisDyer,andNoahA.Smith.2015.Improvedtransition-basedparsingbymodelingcharactersinsteadofwordswithLSTMs.InEMNLP,pages349–359.SrinivasBangaloreandAravindK.Joshi.1999.Su-pertagging:Anapproachtoalmostparsing.Compu-tationalLinguistics,25(2):237–265,June.DanielM.Bikel.2004.Ontheparameterspaceofgener-ativelexicalizedstatisticalparsingmodels.PhDThe-sis,UniversityofPennsylvania.XavierCarreras,MichaelCollins,andTerryKoo.2008.TAG,dynamicprogramming,andtheperceptronforefﬁcient,feature-richparsing.InCoNLL,pages9–16,Morristown,NJ,USA.AssociationforComputationalLinguistics.EugeneCharniakandMarkJohnson.2005.Coarse-to-ﬁnen-bestparsingandMaxEntdiscriminativererank-ing.InACL.EugeneCharniak.2000.Amaximum-entropy-inspiredparser.InANLP,pages132–139.DanqiChenandChristopherManning.2014.Afastandaccuratedependencyparserusingneuralnetworks.InEMNLP,pages740–750,Stroudsburg,PA,USA.As-sociationforComputationalLinguistics.KyunghyunCho,BartvanMerrienboer,C¸aglarG¨ulc¸ehre,DzmitryBahdanau,FethiBougares,HolgerSchwenk,andYoshuaBengio.2014.Learningphraserepresen-tationsusingRNNencoder-decoderforstatisticalma-chinetranslation.InEMNLP,pages1724–1734.StephenClarkandJamesR.Curran.2004.Theimpor-tanceofsupertaggingforwide-coverageCCGpars-ing.InCOLING,pages282–288,Morristown,NJ,USA,August.UniversityofEdinburgh,AssociationforComputationalLinguistics.StephenClark.2002.Supertaggingforcombinatorycat-egorialgrammar.InProceedingsoftheSixthInter-nationalWorkshoponTreeAdjoiningGrammarandRelatedFrameworks,pages101–106,UniversitadiVenezia.MichaelCollinsandBrianRoark.2004.Incrementalparsingwiththeperceptronalgorithm.InACL,Mor-ristown,NJ,USA.AssociationforComputationalLin-guistics.MichaelCollins.2003.Head-drivenstatisticalmodelsfornaturallanguageparsing.ComputationalLinguis-tics,29(4):589–637.JamesCrossandLiangHuang.2016.Span-basedcon-stituencyparsingwithastructure-labelsystemandprovablyoptimaldynamicoracles.InEMNLP.RebeccaDridan,ValiaKordoni,andJeremyNicholson.2008.Enhancingperformanceoflexicalisedgram-mars.InACL.GregDurrettandDanKlein.2015.NeuralCRFparsing.InACL,pages302–312.ChrisDyer,MiguelBallesteros,WangLing,AustinMatthews,andNoahA.Smith.2015.Transition-baseddependencyparsingwithstacklongshort-termmemory.InACL-IJCNLP,pages334–343.ChrisDyer,AdhigunaKuncoro,MiguelBallesteros,andNoahA.Smith.2016.Recurrentneuralnetworkgrammars.InNAACL,pages199–209.AgnieszkaFale´nska,AndersBj¨orkelund,¨OzlemC¸etino˘glu,andWolfgangSeeker.2015.Stackingorsupertaggingfordependencyparsing–what’sthedifference?InProceedingsofthe14thInternationalConferenceonParsingTechnologies.JoshuaGoodman.1998.Parsinginside-out.PhDthesis,HarvardUniversity.AlexGravesandJ¨urgenSchmidhuber.2008.Ofﬂinehandwritingrecognitionwithmultidimensionalrecur-rentneuralnetworks.InNIPS,pages545–552.AlexGraves,NavdeepJaitly,andAbdel-rahmanMo-hamed.2013.HybridspeechrecognitionwithdeepbidirectionalLSTM.InIEEEWorkshoponAutomaticSpeechRecognition&Understanding(ASRU),pages273–278.IEEE.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780,November.ZhongqiangHuangandMaryP.Harper.2009.Self-trainingPCFGgrammarswithlatentannotationsacrosslanguages.InEMNLP,pages832–841.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

ZhongqiangHuang,MaryP.Harper,andSlavPetrov.2010.Self-trainingwithproductsoflatentvariablegrammars.InEMNLP,pages12–22.LiangHuang.2008.Forestreranking:Discriminativeparsingwithnon-localfeatures.InACL,pages586–594.YoonKim,YacineJernite,DavidSontag,andAlexan-derM.Rush.2016.Character-awareneurallanguagemodels.InAAAI.JonathanK.Kummerfeld,JessikaRoesner,TimDaw-born,JamesHaggerty,JamesR.Curran,andStephenClark.2010.Fasterparsingbysupertaggeradap-tation.InACL,pages345–355.UniversityofCam-bridge,AssociationforComputationalLinguistics,July.Minh-ThangLuong,QuocV.Le,IlyaSutskever,OriolVinyals,andLukaszKaiser.2015.Multi-taskse-quencetosequencelearning.ICLR.MitchellP.Marcus,BeatriceSantorini,andMaryAnnMarcinkiewicz.1993.Buildingalargeannotatedcor-pusofEnglish:ThePenntreebank.ComputationalLinguistics,19(2):313–330.DavidMcClosky,EugeneCharniak,andMarkJohnson.2006.Effectiveself-trainingforparsing.InHLT-NAACL,pages152–159,Morristown,NJ,USA.As-sociationforComputationalLinguistics.TakashiNinomiya,TakuyaMatsuzaki,YoshimasaTsu-ruoka,YusukeMiyao,andJun’ichiTsujii.2006.Ex-tremelylexicalizedmodelsforaccurateandfastHPSGparsing.InEMNLP,pages155–163.UniversityofManchester,AssociationforComputationalLinguis-tics,July.SlavPetrovandDanKlein.2007.Improvedinferenceforunlexicalizedparsing.InHLT-NAACL,pages404–411.AdwaitRatnaparkhi.1997.Alinearobservedtimesta-tisticalparserbasedonmaximumentropymodels.InEMNLP.BrianRoarkandKristyHollingshead.2009.Linearcomplexitycontext-freeparsingpipelinesviachartconstraints.InHLT-NAACL,pages647–655.KenjiSagaeandAlonLavie.2005.Aclassiﬁer-basedparserwithlinearruntimecomplexity.InIWPT,pages125–132,Morristown,NJ,USA.AssociationforCom-putationalLinguistics.KenjiSagaeandAlonLavie.2006.Parsercombinationbyreparsing.InHLT-NAACL,pages129–132,Mor-ristown,NJ,USA.AssociationforComputationalLin-guistics.CiceroD.SantosandBiancaZadrozny.2014.Learningcharacter-levelrepresentationsforpart-of-speechtag-ging.InICML,pages1818–1826.YvesSchabes.1992.Stochastictree-adjoininggram-mars.InProceedingsoftheworkshoponSpeechandNaturalLanguage,pages140–145.AssociationforComputationalLinguistics.MikeSchusterandKuldipK.Paliwal.1997.Bidirec-tionalrecurrentneuralnetworks.SignalProcessing,IEEEtransaction,45(11):2673–2681.HiroyukiShindo,YusukeMiyao,AkinoriFujino,andMasaakiNagata.2012.Bayesiansymbol-reﬁnedtreesubstitutiongrammarsforsyntacticparsing.InACL,pages440–448.RichardSocher,JohnBauer,ChristopherD.Manning,andAndrewY.Ng.2013.Parsingwithcompositionalvectorgrammars.InACL,pages455–465.IlyaSutskever,JamesMartens,GeorgeE.Dahl,andGe-offreyE.Hinton.2013.Ontheimportanceofini-tializationandmomentumindeeplearning.InICML,pages1139–1147.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Sequencetosequencelearningwithneuralnetworks.InNIPS,pages3104–3112.YoshimasaTsuruoka,YusukeMiyao,andJun’ichiKazama.2011.Learningwithlookahead:Canhistory-basedmodelsrivalgloballyoptimizedmodels?InCoNLL,pages238–246.AshishVaswani,YonatanBisk,andKenjiSagae.2016.SupertaggingwithLSTMs.InNAACL.K.Vijay-ShankerandAravindK.Joshi.1988.Astudyoftreeadjoininggrammars.Citeseer.OriolVinyals,LukaszKaiser,TerryKoo,SlavPetrov,IlyaSutskever,andGeoffreyE.Hinton.2015.Gram-marasaforeignlanguage.InNIPS,pages2773–2781.ZhiguoWangandNianwenXue.2014.JointPOStag-gingandtransition-basedconstituentparsinginChi-nesewithnon-localfeatures.InACL,pages733–742,Stroudsburg,PA,USA.AssociationforComputationalLinguistics.ZhiguoWang,HaitaoMi,andNianwenXue.2015.Featureoptimizationforconstituentparsingvianeu-ralnetworks.InACL-IJCNLP,pages1138–1147,Stroudsburg,PA,USA.AssociationforComputationalLinguistics.TaroWatanabeandEiichiroSumita.2015.Transition-basedneuralconstituentparsing.InACL,pages1169–1179.WenduanXu,MichaelAuli,andStephenClark.2015.CCGsupertaggingwitharecurrentneuralnetwork.InACL-IJCNLP,pages250–255,Stroudsburg,PA,USA.AssociationforComputationalLinguistics.NaiwenXue,FeiXia,Fu-dongChiou,andMarthaPalmer.2005.ThePennChinesetreebank:Phrasestructureannotationofalargecorpus.NaturalLan-guageEngineering,11(2):207–238.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
5
1
5
6
7
4
4
0

/
t

a
c
_
a
_
0
0
0
4
5
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

YueZhangandStephenClark.2009.Transition-basedparsingoftheChinesetreebankusingaglobaldiscrim-inativemodel.InICPT,pages162–171,Morristown,NJ,USA.AssociationforComputationalLinguistics.YueZhangandStephenClark.2011.Syntacticprocess-ingusingthegeneralizedperceptronandbeamsearch.ComputationalLinguistics,37(1):105–151.YaozhongZhang,TakuyaMatsuzaki,andJun’ichiTsu-jii.2010a.AsimpleapproachforHPSGsupertag-gingusingdependencyinformation.InNAACL-HLT,pages645–648.UniversityofManchester,AssociationforComputationalLinguistics,June.YueZhang,Byung-GyuAhn,StephenClark,CurtVanWyk,JamesR.Curran,andLauraRimell.2010b.Chartpruningforfastlexicalised-grammarparsing.InCOLING,pages1471–1479.HaoZhou,YueZhang,ShujianHuang,andJiajunChen.2015.Aneuralprobabilisticstructured-predictionmodelfortransition-baseddependencyparsing.InACL,pages1213–1222.MuhuaZhu,YueZhang,WenliangChen,MinZhang,andJingboZhu.2013.Fastandaccurateshift-reducecon-stituentparsing.InACL,pages434–443.
Download pdf

Specialized Research AI at MIT

Specialized Research AI at MIT

Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark.