Transactions of the Association for Computational Linguistics, vol. 4, pp. 313–327, 2016. Action Editor: Marco Kuhlmann.

Transactions of the Association for Computational Linguistics, vol. 4, pp. 313–327, 2016. Action Editor: Marco Kuhlmann.
Submission batch: 2/2016; Published 7/2016.

2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

SimpleandAccurateDependencyParsingUsingBidirectionalLSTMFeatureRepresentationsEliyahuKiperwasserComputerScienceDepartmentBar-IlanUniversityRamat-Gan,Israelelikip@gmail.comYoavGoldbergComputerScienceDepartmentBar-IlanUniversityRamat-Gan,Israelyoav.goldberg@gmail.comAbstractWepresentasimpleandeffectiveschemefordependencyparsingwhichisbasedonbidirectional-LSTMs(BiLSTMs).Eachsen-tencetokenisassociatedwithaBiLSTMvec-torrepresentingthetokeninitssententialcon-text,andfeaturevectorsareconstructedbyconcatenatingafewBiLSTMvectors.TheBiLSTMistrainedjointlywiththeparserob-jective,resultinginveryeffectivefeatureex-tractorsforparsing.Wedemonstratetheef-fectivenessoftheapproachbyapplyingittoagreedytransition-basedparseraswellastoagloballyoptimizedgraph-basedparser.Theresultingparsershaveverysimplearchitec-tures,andmatchorsurpassthestate-of-the-artaccuraciesonEnglishandChinese.1IntroductionThefocusofthispaperisonfeaturerepresen-tationfordependencyparsing,usingrecenttech-niquesfromtheneural-networks(“deeplearning”)literature.Modernapproachestodependencypars-ingcanbebroadlycategorizedintograph-basedandtransition-basedparsers(Kübleretal.,2009).Graph-basedparsers(McDonald,2006)treatpars-ingasasearch-basedstructuredpredictionprob-leminwhichthegoalislearningascoringfunc-tionoverdependencytreessuchthatthecorrecttreeisscoredaboveallothertrees.Transition-basedparsers(Nivre,2004;Nivre,2008)treatparsingasasequenceofactionsthatproduceaparsetree,andaclassiﬁeristrainedtoscorethepossibleactionsateachstageoftheprocessandguidetheparsingpro-cess.Perhapsthesimplestgraph-basedparsersarearc-factored(ﬁrstorder)models(McDonald,2006),inwhichthescoringfunctionforatreedecomposesovertheindividualarcsofthetree.Moreelaboratemodelslookatlarger(overlapping)parts,requiringmoresophisticatedinferenceandtrainingalgorithms(Martinsetal.,2009;KooandCollins,2010).Thebasictransition-basedparsersworkinagreedyman-ner,performingaseriesoflocally-optimaldecisions,andboastveryfastparsingspeeds.Moreadvancedtransition-basedparsersintroducesomesearchintotheprocessusingabeam(ZhangandClark,2008)ordynamicprogramming(HuangandSagae,2010).Regardlessofthedetailsoftheparsingframe-workbeingused,acrucialstepinparserdesignischoosingtherightfeaturefunctionfortheunderly-ingstatisticalmodel.Recentwork(seeSection2.2foranoverview)attempttoalleviatepartsofthefea-turefunctiondesignproblembymovingfromlin-eartonon-linearmodels,enablingthemodelertofocusonasmallsetof“core”featuresandleav-ingituptothemachine-learningmachinerytocomeupwithgoodfeaturecombinations(ChenandMan-ning,2014;Peietal.,2015;Leietal.,2014;Taub-Tabibetal.,2015).However,theneedtocarefullydeﬁneasetofcorefeaturesremains.Forexam-ple,theworkofChenandManning(2014)uses18differentelementsinitsfeaturefunction,whiletheworkofPeietal.(2015)uses21differentelements.Otherworks,notablyDyeretal.(2015)andLeandZuidema(2014),proposemoresophisticatedfeaturerepresentations,inwhichthefeatureengineeringisreplacedwitharchitectureengineering.Inthiswork,wesuggestanapproachwhichismuchsimplerintermsofbothfeatureengineering

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

314

andarchitectureengineering.Ourproposal(Section3)iscenteredaroundBiRNNs(IrsoyandCardie,2014;SchusterandPaliwal,1997),andmorespecif-icallyBiLSTMs(Graves,2008),whicharestrongandtrainablesequencemodels(seeSection2.3).TheBiLSTMexcelsatrepresentingelementsinasequence(i.e.,words)togetherwiththeircontexts,capturingtheelementandan“inﬁnite”windowaroundit.WerepresenteachwordbyitsBiLSTMencoding,anduseaconcatenationofaminimalsetofsuchBiLSTMencodingsasourfeaturefunction,whichisthenpassedtoanon-linearscoringfunction(multi-layerperceptron).Crucially,theBiLSTMistrainedwiththerestoftheparserinordertolearnagoodfeaturerepresentationfortheparsingprob-lem.IfwesetasidetheinherentcomplexityoftheBiLSTMitselfandtreatitasablackbox,ourpro-posalresultsinapleasinglysimplefeatureextractor.WedemonstratetheeffectivenessoftheapproachbyusingtheBiLSTMfeatureextractorintwopars-ingarchitectures,transition-based(Section4)aswellasagraph-based(Section5).Inthegraph-basedparser,wejointlytrainastructured-predictionmodelontopofaBiLSTM,propagatingerrorsfromthestructuredobjectiveallthewaybacktotheBiLSTMfeature-encoder.Tothebestofourknowl-edge,wearetheﬁrsttoperformsuchend-to-endtrainingofastructuredpredictionmodelandarecur-rentfeatureextractorfornon-sequentialoutputs.1AsidefromthenoveltyoftheBiLSTMfeatureextractorandtheend-to-endstructuredtraining,werelyonexistingmodelsandtechniquesfromtheparsingandstructuredpredictionliterature.Westicktothesimplestparsersineachcategory–greedyinferenceforthetransition-basedarchitec-ture,andaﬁrst-order,arc-factoredmodelforthegraph-basedarchitecture.Despitethesimplicityoftheparsingarchitecturesandthefeaturefunc-tions,weachievenearstate-of-the-artparsingac-curaciesinbothEnglish(93.1UAS)andChinese(86.6UAS),usingaﬁrst-orderparserwithtwofea-turesandwhiletrainingsolelyonTreebankdata,withoutrelyingonsemi-supervisedsignalssuchaspre-trainedwordembeddings(ChenandManning,2014),word-clusters(Kooetal.,2008),ortech-1StructuredtrainingofsequencetaggingmodelsoverRNN-basedrepresentationswasexploredbyChiuandNichols(2016)andLampleetal.(2016).niquessuchastri-training(Weissetal.,2015).Whenalsoincludingpre-trainedwordembeddings,weobtainfurtherimprovements,withaccuraciesof93.9UAS(English)and87.6UAS(Chinese)foragreedytransition-basedparserwith11features,and93.6UAS(En)/87.4(Ch)foragreedytransition-basedparserwith4features.2BackgroundandNotationNotationWeusex1:ntodenoteasequenceofnvectorsx1,···,xn.Fθ(·)isafunctionparameter-izedwithparametersθ.WewriteFL(·)asshorthandforFθL–aninstantiationofFwithaspeciﬁcsetofparametersθL.Weuse◦todenoteavectorcon-catenationoperation,andv[i]todenoteanindexingoperationtakingtheithelementofavectorv.2.1FeatureFunctionsinDependencyParsingTraditionally,state-of-the-artparsersrelyonlinearmodelsoverhand-craftedfeaturefunctions.Thefea-turefunctionslookatcorecomponents(e.g.“wordontopofstack”,“leftmostchildofthesecond-to-topwordonthestack”,“distancebetweentheheadandthemodiﬁerwords”),andarecomprisedofsev-eraltemplates,whereeachtemplateinstantiatesabi-naryindicatorfunctionoveraconjunctionofcoreelements(resultinginfeaturesoftheform“wordontopofstackisXandleftmostchildisYand…”).Thedesignofthefeaturefunction–whichcompo-nentstoconsiderandwhichcombinationsofcom-ponentstoinclude–isamajorchallengeinparserdesign.Onceagoodfeaturefunctionisproposedinapaperitisusuallyadoptedinlaterworks,andsometimestweakedtoimproveperformance.Ex-amplesofgoodfeaturefunctionsarethefeature-setproposedbyZhangandNivre(2011)fortransition-basedparsing(includingroughly20corecompo-nentsand72featuretemplates),andthefeature-setproposedbyMcDonaldetal.(2005)forgraph-basedparsing,withthepaperlisting18templatesforaﬁrst-orderparser,whiletheﬁrstorderfeature-extractorintheactualimplementation’scode(MST-Parser2)includesroughlyahundredfeaturetem-plates.2http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

315

Thecorefeaturesinatransition-basedparserusu-allylookatinformationsuchastheword-identityandpart-of-speech(POS)tagsofaﬁxednumberofwordsontopofthestack,aﬁxednumberofwordsonthetopofthebuffer,themodiﬁers(usuallyleft-mostandright-most)ofitemsonthestackandonthebuffer,thenumberofmodiﬁersoftheseelements,parentsofwordsonthestack,andthelengthofthespansspannedbythewordsonthestack.Thecorefeaturesofaﬁrst-ordergraph-basedparserusuallytakeintoaccountthewordandPOSoftheheadandmodiﬁeritems,aswellasPOS-tagsoftheitemsaroundtheheadandmodiﬁer,POStagsofitemsbe-tweentheheadandmodiﬁer,andthedistanceanddirectionbetweentheheadandmodiﬁer.2.2RelatedResearchEffortsComingupwithagoodfeature-setforaparserisahardandtimeconsumingtask,andmanyresearchersattempttoreducetherequiredmanualeffort.TheworkofLeietal.(2014)suggestsalow-rankten-sorrepresentationtoautomaticallyﬁndgoodfeaturecombinations.Taub-Tabibetal.(2015)suggestakernel-basedapproachtoimplicitlyconsiderallpos-siblefeaturecombinationsoversetsofcore-features.Therecentpopularityofneuralnetworkspromptedamovefromtemplatesofsparse,binaryindicatorfeaturestodensecorefeatureencodingsfedintonon-linearclassiﬁers.ChenandManning(2014)en-codeeachcorefeatureofagreedytransition-basedparserasadenselow-dimensionalvector,andthevectorsarethenconcatenatedandfedintoanon-linearclassiﬁer(multi-layerperceptron)whichcanpotentiallycapturearbitraryfeaturecombinations.Weissetal.(2015)showedfurthergainsusingthesameapproachcoupledwithasomewhatimprovedsetofcorefeatures,amoreinvolvednetworkarchi-tecturewithskip-layers,beamsearch-decoding,andcarefulhyper-parametertuning.Peietal.(2015)applyasimilarmethodologytograph-basedpars-ing.Whilethemovetoneural-networkclassi-ﬁersalleviatestheneedforhand-craftingfeature-combinations,theneedtocarefullydeﬁneasetofcorefeaturesremain.Forexample,thefeaturerep-resentationinChenandManning(2014)isacon-catenationof18wordvectors,18POSvectorsand12dependency-labelvectors.3Theaboveworkstackletheeffortinhand-craftingeffectivefeaturecombinations.Adifferentlineofworkattacksthefeature-engineeringproblembysuggestingnovelneural-networkarchitecturesforencodingtheparserstate,includingintermediately-builtsubtrees,asvectorswhicharethenfedtonon-linearclassiﬁers.TitovandHendersonencodetheparserstateusingincrementalsigmoid-beliefnet-works(2007).IntheworkofDyeretal.(2015),theentirestackandbufferofatransition-basedparserareencodedasastack-LSTMs,whereeachstackel-ementisitselfbasedonacompositionalrepresen-tationofparsetrees.LeandZuidema(2014)en-codeeachtreenodeastwocompositionalrepresen-tationscapturingtheinsideandoutsidestructuresaroundthenode,andfeedtherepresentationsintoareranker.Asimilarrerankingapproach,thistimebasedonconvolutionalneuralnetworks,istakenbyZhuetal.(2015).Finally,inKiperwasserandGold-berg(2016)wepresentanEasy-Firstparserbasedonanovelhierarchical-LSTMtreeencoding.Incontrasttothese,theapproachwepresentinthisworkresultsinmuchsimplerfeaturefunctions,withoutresortingtoelaboratenetworkarchitecturesorcompositionaltreerepresentations.WorkbyVinyalsetal.(2015)employsasequence-to-sequencewithattentionarchitectureforconstituencyparsing.Eachtokenintheinputsen-tenceisencodedinadeep-BiLSTMrepresentation,andthenthetokensarefedasinputtoadeep-LSTMthatpredictsasequenceofbracketingac-tionsbasedonthealreadypredictedbracketingaswellastheencodedBiLSTMvectors.AtrainableattentionmechanismisusedtoguidetheparsertorelevantBiLSTMvectorsateachstage.Thisar-chitectureshareswithourstheuseofBiLSTMen-codingandend-to-endtraining.ThesequenceofbracketingactionscanbeinterpretedasasequenceofShiftandReduceoperationsofatransition-basedparser.However,whiletheparserofVinyalsetal.3Inalloftheseneural-networkbasedapproaches,thevec-torrepresentationsofwordswereinitializedusingpre-trainedword-embeddingsderivedfromalargecorpusexternaltothetrainingdata.Thisputstheapproachesinthesemi-supervisedcategory,makingithardtoteaseapartthecontributionoftheau-tomaticfeature-combinationcomponentfromthatofthesemi-supervisedcomponent.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

316

reliesonatrainableattentionmechanismforfo-cusingonspeciﬁcBiLSTMvectors,parsersinthetransition-basedfamilyweuseinSection4useahu-mandesignedstackandbuffermechanismtomanu-allydirecttheparser’sattention.Whiletheeffec-tivenessofthetrainableattentionapproachisim-pressive,thestack-and-bufferguidanceoftransition-basedparsersresultsinmorerobustlearning.In-deed,workbyCrossandHuang(2016),publishedwhileworkingonthecamera-readyversionofthispaper,showthatthesamemethodologyasoursishighlyeffectivealsoforgreedy,transition-basedconstituencyparsing,surpassingthebeam-basedar-chitectureofVinyalsetal.(88.3Fvs.89.8Fpoints)whentrainedonthePennTreebankdatasetandwith-outusingorthogonalmethodssuchasensemblingandup-training.2.3BidirectionalRecurrentNeuralNetworksRecurrentneuralnetworks(RNNs)arestatisticallearnersformodelingsequentialdata.AnRNNal-lowsonetomodeltheithelementinthesequencebasedonthepast–theelementsx1:iuptoandin-cludingit.TheRNNmodelprovidesaframeworkforconditioningontheentirehistoryx1:iwithoutresortingtotheMarkovassumptionwhichistradi-tionallyusedformodelingsequences.RNNswereshowntobecapableoflearningtocount,aswellastomodellinelengthsandcomplexphenomenasuchasbracketingandcodeindentation(Karpathyetal.,2015).Ourproposedfeatureextractorsarebasedonabidirectionalrecurrentneuralnetwork(BiRNN),anextensionofRNNsthattakeintoaccountboththepastx1:iandthefuturexi:n.WeuseaspeciﬁcﬂavorofRNNcalledalongshort-termmemorynetwork(LSTM).Forbrevity,wetreatRNNasanabstrac-tion,withoutgettingintothemathematicaldetailsoftheimplementationoftheRNNsandLSTMs.ForfurtherdetailsonRNNsandLSTMs,thereaderisreferredtoGoldberg(2015)andCho(2015).Therecurrentneuralnetwork(RNN)abstractionisaparameterizedfunctionRNNθ(x1:n)mappingasequenceofninputvectorsx1:n,xi∈Rdintoase-quenceofnoutputvectorsh1:n,hi∈Rdout.Eachoutputvectorhiisconditionedonalltheinputvec-torsx1:i,andcanbethoughtofasasummaryofthepreﬁxx1:iofx1:n.Inournotation,weignoretheintermediatevectorsh1:n−1andtaketheoutputofRNNθ(x1:n)tobethevectorhn.AbidirectionalRNNiscomposedoftwoRNNs,RNNFandRNNR,onereadingthesequenceinitsregularorder,andtheotherreadingitinreverse.Concretely,givenasequenceofvectorsx1:nandadesiredindexi,thefunctionBIRNNθ(x1:n,i)isde-ﬁnedas:BIRNNθ(x1:n,i)=RNNF(x1:i)◦RNNR(xn:i)Thevectorvi=BIRNN(x1:n,i)isthenarepresen-tationoftheithiteminx1:n,takingintoaccountboththeentirehistoryx1:iandtheentirefuturexi:nbyconcatenatingthematchingRNNs.WecanviewtheBiRNNencodingofanitemiasrepresentingtheitemitogetherwithacontextofaninﬁnitewindowaroundit.ComputationalComplexityComputingtheBiRNNvectorsencodingoftheithelementofasequencex1:nrequiresO(n)timeforcomputingthetwoRNNsandconcatenatingtheiroutputs.AnaiveapproachofcomputingthebidirectionalrepresentationofallnelementsresultinO(n2)computation.However,itistrivialtocomputetheBiRNNencodingofallsequenceitemsinlineartimebypre-computingRNNF(x1:n)andRNNR(xn:1),keepingtheintermediaterepresenta-tions,andconcatenatingtherequiredelementsasneeded.BiRNNTrainingInitially,theBiRNNencodingsvidonotcaptureanyparticularinformation.Duringtraining,theencodedvectorsviarefedintofurthernetworklayers,untilatsomepointapredictionismade,andalossisincurred.Theback-propagationalgorithmisusedtocomputethegradientsofalltheparametersinthenetwork(includingtheBiRNNpa-rameters)withrespecttotheloss,andanoptimizerisusedtoupdatetheparametersaccordingtothegradients.ThetrainingprocedurecausestheBiRNNfunctiontoextractfromtheinputsequencex1:ntherelevantinformationforthetasktaskathand.GoingdeeperWeuseavariantofdeepbidirectionalRNN(ork-layerBiRNN)whichiscomposedofkBiRNNfunctionsBIRNN1,···,BIRNNkthatfeedintoeachother:theoutputBIRNN‘(x1:n,1),…,BIRNN‘(x1:n,n)ofBIRNN‘becomestheinputofBIRNN‘+1.Stacking

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

317

BiRNNsinthiswayhasbeenempiricallyshowntobeeffective(IrsoyandCardie,2014).Inthiswork,weuseBiRNNsanddeep-BiRNNsinterchangeably,specifyingthenumberoflayerswhenneeded.HistoricalNotesRNNswereintroducedbyEl-man(1990),andextendedtoBiRNNsbySchus-terandPaliwal(1997).TheLSTMvariantofRNNsisduetoHochreiterandSchmidhuber(1997).BiLSTMswererecentlypopularizedbyGraves(2008),anddeepBiRNNswereintroducedtoNLPbyIrsoyandCardie(2014),whousedthemforse-quencetagging.Inthecontextofparsing,Lewisetal.(2016)andVaswanietal.(2016)useaBiLSTMsequencetaggingmodeltoassignaCCGsupertagforeachtokeninthesentence.Lewisetal.(2016)feedstheresultingsupertagssequenceintoanA*CCGparser.Vaswanietal.(2016)addsanaddi-tionallayerofLSTMwhichreceivestheBiLSTMrepresentationtogetherwiththek-bestsupertagsforeachwordandoutputsthemostlikelysupertaggivenprevioustags,andthenfeedsthepredictedsu-pertagstoadiscriminitivelytrainedparser.Inbothworks,theBiLSTMistrainedtoproduceaccurateCCGsupertags,andisnotawareoftheglobalpars-ingobjective.3OurApproachWeproposetoreplacethehand-craftedfeaturefunc-tionsinfavorofminimally-deﬁnedfeaturefunctionswhichmakeuseofautomaticallylearnedBidirec-tionalLSTMrepresentations.Givenn-wordsinputsentenceswithwordsw1,…,wntogetherwiththecorrespondingPOStagst1,…,tn,4weassociateeachwordwiandPOStiwithembeddingvectorse(wi)ande(ti),andcre-ateasequenceofinputvectorsx1:ninwhicheachxiisaconcatenationofthecorrespondingwordandPOSvectors:xi=e(wi)◦e(pi)Theembeddingsaretrainedtogetherwiththemodel.Thisencodeseachwordinisolation,disregardingitscontext.Weintroducecontextbyrepresentingeach4Inthisworkthetagsequenceisassumedtobegiven,andinpracticeispredictedbyanexternalmodel.Futureworkwilladdressrelaxingthisassumption.inputelementasits(deep)BiLSTMvector,vi:vi=BILSTM(x1:n,i)OurfeaturefunctionφisthenaconcatenationofasmallnumberofBiLSTMvectors.Theexactfea-turefunctionisparserdependentandwillbedis-cussedwhendiscussingthecorrespondingparsers.Theresultingfeaturevectorsarethenscoredusinganon-linearfunction,namelyamulti-layerperceptronwithonehiddenlayer(MLP):MLPθ(x)=W2·tanh(W1·x+b1)+b2whereθ={W1,W2,b1,b2}arethemodelparame-ters.BesideusingtheBiLSTM-basedfeaturefunc-tions,wemakeuseofstandardparsingtechniques.Crucially,theBiLSTMistrainedjointlywiththerestoftheparsingobjective.Thisallowsittolearnrep-resentationswhicharesuitablefortheparsingtask.ConsideraconcatenationoftwoBiLSTMvectors(vi◦vj)scoredusinganMLP.ThescoringfunctionhasaccesstothewordsandPOS-tagsofviandvj,aswellasthewordsandPOS-tagsofthewordsinaninﬁnitewindowsurroundingthem.AsLSTMsareknowntocapturelengthandsequencepositionin-formation,itisveryplausiblethatthescoringfunc-tioncanbesensitivealsotothedistancebetweeniandj,theirordering,andthesequentialmaterialbe-tweenthem.Parsing-timeComplexityOncetheBiLSTMistrained,parsingisperformedbyﬁrstcomputingtheBiLSTMencodingviforeachwordinthesentence(alineartimeoperation).5Then,parsingproceedsasusual,wherethefeatureextractioninvolvesacon-catenationofasmallnumberofthepre-computedvivectors.4Transition-basedParserWebeginbyintegratingthefeatureextractorinatransition-basedparser(Nivre,2008).WefollowthenotationinGoldbergandNivre(2013).The5WhiletheBiLSTMcomputationisquiteefﬁcientasitis,asdemonstratedbyLewisetal.(2016),ifusingaGPUimple-mentationtheBiLSTMencodingcanbeefﬁcientlyperformedovermanyofsentencesinparallel,makingitscomputationcostalmostnegligible.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

318

thes2jumpeds1overs0theb0lazyb1dogb2ROOTb3foxbrownConﬁguration:Scoring:LSTMfxtheconcatLSTMfxbrownconcatLSTMfxfoxconcatLSTMfxjumpedconcatLSTMfxoverconcatLSTMfxtheconcatLSTMfxlazyconcatLSTMfxdogconcatLSTMfxROOTconcatLSTMbs0LSTMbs1LSTMbs2LSTMbs3LSTMbs4LSTMbs5LSTMbs6LSTMbs7LSTMbs8VtheVbrownVfoxVjumpedVoverVtheVlazyVdogVROOTMLP(ScoreLeftArc,ScoreRightArc,ScoreShift)Figure1:Illustrationoftheneuralmodelschemeofthetransition-basedparserwhencalculatingthescoresofthepossibletransitionsinagivenconﬁguration.Theconﬁguration(stackandbuffer)isdepictedonthetop.EachtransitionisscoredusinganMLPthatisfedtheBiLSTMencodingsoftheﬁrstwordinthebufferandthethreewordsatthetopofthestack(thecolorsofthewordscorrespondtocolorsoftheMLPinputsabove),andatransitionispickedgreedily.EachxiisaconcatenationofawordandaPOSvector,andpossiblyanadditionalexternalembeddingvectorfortheword.Theﬁguredepictsasingle-layerBiLSTM,whileinpracticeweusetwolayers.Whenparsingasentence,weiterativelycomputescoresforallpossibletransitionsandapplythebestscoringactionuntiltheﬁnalconﬁgurationisreached.transition-basedparsingframeworkassumesatran-sitionsystem,anabstractmachinethatprocessessentencesandproducesparsetrees.Thetransitionsystemhasasetofconﬁgurationsandasetoftran-sitionswhichareappliedtoconﬁgurations.Whenparsingasentence,thesystemisinitializedtoanini-tialconﬁgurationbasedontheinputsentence,andtransitionsarerepeatedlyappliedtothisconﬁgura-tion.Afteraﬁnitenumberoftransitions,thesystemarrivesataterminalconﬁguration,andaparsetreeisreadofftheterminalconﬁguration.Inagreedyparser,aclassiﬁerisusedtochoosethetransitiontotakeineachconﬁguration,basedonfeaturesex-tractedfromtheconﬁgurationitself.Theparsingal-gorithmispresentedinAlgorithm1below.Givenasentences,theparserisinitializedwiththeconﬁgurationc(line2).Then,afeaturefunc-tionφ(c)representstheconﬁgurationcasavector,whichisfedtoascoringfunctionSCOREassign-ingscoresto(conﬁguration,transition)pairs.SCOREAlgorithm1Greedytransition-basedparsing1:Input:sentences=w1,…,xw,t1,…,tn,parameterizedfunctionSCOREθ(·)withparam-etersθ.2:c←INITIAL(s)3:whilenotTERMINAL(c)do4:ˆt←argmaxt∈LEGAL(c)SCOREθ(cid:0)φ(c),t(cid:1)5:c←ˆt(c)6:returntree(c)scoresthepossibletransitionst,andthehighestscoringtransitionˆtischosen(line4).Thetransitionˆtisappliedtotheconﬁguration,resultinginanewparserconﬁguration.Theprocessendswhenreach-ingaﬁnalconﬁguration,fromwhichtheresultingparsetreeisreadandreturned(line6).Transitionsystemsdifferbythewaytheydeﬁneconﬁgurations,andbytheparticularsetoftransi-tionsavailabletothem.Aparserisdeterminedby

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

319

thechoiceofatransitionsystem,afeaturefunctionφandascoringfunctionSCORE.Ourchoicesaredetailedbelow.TheArc-HybridSystemManytransitionsystemsexistintheliterature.Inthiswork,weusethearc-hybridtransitionsystem(Kuhlmannetal.,2011),whichissimilartothemorepopulararc-standardsystem(Nivre,2004),butforwhichanefﬁcientdy-namicoracleisavailable(GoldbergandNivre,2012;GoldbergandNivre,2013).Inthearc-hybridsys-tem,aconﬁgurationc=(σ,β,T)consistsofastackσ,abufferβ,andasetTofdependencyarcs.Boththestackandthebufferholdintegerindicespointingtosentenceelements.Givenasentences=w1,…,wn,t1,…,tn,thesystemisinitial-izedwithanemptystack,anemptyarcset,andβ=1,…,n,ROOT,whereROOTisthespecialrootindex.AnyconﬁgurationcwithanemptystackandabuffercontainingonlyROOTisterminal,andtheparsetreeisgivenbythearcsetTcofc.Thearc-hybridsystemallows3possibletransitions,SHIFT,LEFT‘andRIGHT‘,deﬁnedas:SHIFT[(σ,b0|β,T)]=(σ|b0,β,T)LEFT‘[(σ|s1|s0,b0|β,T)]=(σ|s1,b0|β,T∪{(b0,s0,‘)})RIGHT‘[(σ|s1|s0,β,T)]=(σ|s1,β,T∪{(s1,s0,‘)})TheSHIFTtransitionmovestheﬁrstitemofthebuffer(b0)tothestack.TheLEFT‘transitionre-movestheﬁrstitemontopofthestack(s0)andattachesitasamodiﬁertob0withlabel‘,addingthearc(b0,s0,‘).TheRIGHT‘transitionremovess0fromthestackandattachesitasamodiﬁertothenextitemonthestack(s1),addingthearc(s1,s0,‘).ScoringFunctionTraditionally,thescoringfunc-tionSCOREθ(x,t)isadiscriminativelinearmodeloftheformSCOREW(x,t)=(W·x)[t].Thelin-earityofSCORErequiredthefeaturefunctionφ(·)toencodenon-linearitiesintheformofcombinationfeatures.WefollowChenandManning(2014)andreplacethelinearscoringmodelwithanMLP.SCOREθ(x,t)=MLPθ(x)[t]SimpleFeatureFunctionThefeaturefunctionφ(c)istypicallycomplex(seeSection2.1).OurfeaturefunctionistheconcatenatedBiLSTMvec-torsofthetop3itemsonthestackandtheﬁrstitemonthebuffer.I.e.,foraconﬁgurationc=(…|s2|s1|s0,b0|…,T)thefeatureextractorisdeﬁnedas:φ(c)=vs2◦vs1◦vs0◦vb0vi=BILSTM(x1:n,i)Thisfeaturefunctionisratherminimal:ittakesintoaccounttheBiLSTMrepresentationsofs1,s0andb0,whicharetheitemsaffectedbythepossibletransitionsbeingscored,aswellasoneextrastackcontexts2.6Figure1depictstransitionscoringwithourarchitectureandthisfeaturefunction.Notethat,unlikepreviouswork,thisfeaturefunctiondoesnottakeintoaccountT,thealreadybuiltstructure.ThehighparsingaccuraciesintheexperimentalsectionssuggestthattheBiLSTMencodingiscapableofes-timatingalotofthemissinginformationbasedontheprovidedstackandbufferelementsandthese-quentialcontentbetweenthem.Whilenotexploredinthiswork,relyingononlyfourwordindicesforscoringanactionre-sultsinverycompactstatesignatures,makingourproposedfeaturerepresentationveryappeal-ingforuseintransition-basedparsersthatemploydynamic-programmingsearch(HuangandSagae,2010;Kuhlmannetal.,2011).ExtendedFeatureFunctionOneofthebeneﬁtsofthegreedytransition-basedparsingframeworkispreciselyitsabilitytolookatarbitraryfeaturesfromthealreadybuilttree.Ifweallowsomewhatlessminimalfeaturefunction,wecouldaddtheBiLSTMvectorscorrespondingtotheright-mostandleft-mostmodiﬁersofs0,s1ands2,aswellastheleft-mostmodiﬁerofb0,reachingatotalof11BiLSTMvectors.Werefertothisastheextendedfeatureset.Aswe’llseeinSection6,usingtheextendedsetdoesindeedimproveparsingaccuracieswhenusingpre-trainedwordembeddings,buthasaminimalef-fectinthefully-supervisedcase.76Anadditionalbuffercontextisnotneeded,asb1isbydef-initionadjacenttob0,afactthatweexpecttheBiLSTMen-codingofb0tocapture.Incontrast,b0,s0,s1ands2arenotnecessarilyadjacenttoeachotherintheoriginalsentence.7Wedidnotexperimentwithotherfeatureconﬁgurations.Itiswellpossiblethatnotalloftheadditional7childencodingsareneededfortheobservedaccuracygains,andthatasmallerfeaturesetwillyieldsimilarorevenbetterimprovements.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

320

4.1DetailsoftheTrainingAlgorithmThetrainingobjectiveistosetthescoreofcorrecttransitionsabovethescoresofincorrecttransitions.Weuseamargin-basedobjective,aimingtomaxi-mizethemarginbetweenthehighestscoringcorrectactionandthehighestscoringincorrectaction.Thehingelossateachparsingconﬁgurationcisdeﬁnedas:max(cid:16)0,1−maxto∈GMLP(cid:0)φ(c)(cid:1)[to]+maxtp∈A\GMLP(cid:0)φ(c)(cid:1)(cid:17)whereAisthesetofpossibletransitionsandGisthesetofcorrect(gold)transitionsatthecur-rentstage.AteachstageofthetrainingprocesstheparserscoresthepossibletransitionsA,incursaloss,selectsatransitiontofollow,andmovestothenextconﬁgurationbasedonit.Thelocallossesaresummedthroughouttheparsingprocessofasen-tence,andtheparametersareupdatedwithrespecttothesumofthelossesatsentenceboundaries.8Thegradientsoftheentirenetwork(includingtheMLPandtheBiLSTM)withrespecttothesumofthelossesarecalculatedusingthebackpropagationalgorithm.Asusual,weperformseveraltrainingit-erationsoverthetrainingcorpus,shufﬂingtheorderofsentencesineachiteration.Error-ExplorationandDynamicOracleTrainingWefollowGoldbergandNivre(2013);GoldbergandNivre(2012)inusingerrorexplorationtrainingwithadynamic-oracle,whichwebrieﬂydescribebelow.Ateachstageinthetrainingprocess,theparserassignsscorestoallthepossibletransitionst∈A.Itthenselectsatransition,appliesit,andmovestothenextstep.Whichtransitionshouldbefollowed?Acommonapproachfollowsthehighestscoringtran-sitionthatcanleadtothegoldtree.However,whentraininginthiswaytheparserseesonlyconﬁgura-tionsthatresultfromfollowingcorrectactions,andasaresulttendstosufferfromerrorpropagationat8Toincreasegradientstabilityandtrainingspeed,wesimu-latemini-batchupdatesbyonlyupdatingtheparameterswhenthesumoflocallossescontainsatleast50non-zeroelements.Sumsoffewerelementsarecarriedacrosssentences.Thisas-suresusasufﬁcientnumberofgradientsamplesforeveryup-datethusminimizingtheeffectofgradientinstability.testtime.Instead,inerror-explorationtrainingtheparserfollowsthehighestscoringactioninAdur-ingtrainingevenifthisactionisincorrect,exposingittoconﬁgurationsthatresultfromerroneousdeci-sions.ThisstrategyrequiresdeﬁningthesetGsuchthatthecorrectactionstotakearewell-deﬁnedalsoforstatesthatcannotleadtothegoldtree.SuchasetGiscalledadynamicoracle.Weperformerror-explorationtrainingusingthedynamic-oracledeﬁnedbyGoldbergandNivre(2013).AggressiveExplorationWefoundthatevenwhenusingerror-exploration,afteroneiterationthemodelremembersthetrainingsetquitewell,anddoesnotmakeenougherrorstomakeerror-explorationeffec-tive.Inordertoexposetheparsertomoreerrors,wefollowanaggressive-explorationscheme:wesometimesfollowincorrecttransitionsalsoiftheyscorebelowcorrecttransitions.Speciﬁcally,whenthescoreofthecorrecttransitionisgreaterthanthatofthewrongtransitionbutthedifferenceissmallerthanamarginconstant,wechosetofollowtheincor-rectactionwithprobabilitypagg(weusepagg=0.1inourexperiments).SummaryThegreedytransition-basedparserfollowsstandardtechniquesfromtheliterature(margin-basedobjective,dynamicoracletraining,errorexploration,MLP-basednon-linearscoringfunction).Wedepartfromtheliteraturebyre-placingthehand-craftedfeaturefunctionovercare-fullyselectedcomponentsoftheconﬁgurationwithaconcatenationofBiLSTMrepresentationsofafewprominentitemsonthestackandthebuffer,andtrainingtheBiLSTMencoderjointlywiththerestofthenetwork.5Graph-basedParserGraph-basedparsingfollowsthecommonstructuredpredictionparadigm(Taskaretal.,2005;McDonaldetal.,2005):predict(s)=argmaxy∈Y(s)scoreglobal(s,y)scoreglobal(s,y)=Xpart∈yscorelocal(s,part)Givenaninputsentences(andthecorrespondingsequenceofvectorsx1:n)welookforthehighest-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

321

LSTMfxtheconcatLSTMfxbrownconcatLSTMfxfoxconcatLSTMfxjumpedconcatLSTMfx∗concatLSTMbs0LSTMbs1LSTMbs2LSTMbs3LSTMbs4VtheVbrownVfoxVjumpedV∗MLPMLPMLPMLP+Figure2:Illustrationoftheneuralmodelschemeofthegraph-basedparserwhencalculatingthescoreofagivenparsetree.Theparsetreeisdepictedbelowthesentence.EachdependencyarcinthesentenceisscoredusinganMLPthatisfedtheBiLSTMencodingofthewordsatthearc’sendpoints(thecolorsofthearcscorrespondtocolorsoftheMLPinputsabove),andtheindividualarcscoresaresummedtoproducetheﬁnalscore.AlltheMLPssharethesameparameters.Theﬁguredepictsasingle-layerBiLSTM,whileinpracticeweusetwolayers.Whenparsingasentence,wecomputescoresforallpossiblen2arcs,andﬁndthebestscoringtreeusingadynamic-programmingalgorithm.scoringparsetreeyinthespaceY(s)ofvalidde-pendencytreesovers.Inordertomakethesearchtractable,thescoringfunctionisdecomposedtothesumoflocalscoresforeachpartindependently.Inthiswork,wefocusonarc-factoredgraphbasedapproachpresentedinMcDonaldetal.(2005).Arc-factoredparsingdecomposesthescoreofatreetothesumofthescoreofitshead-modiﬁerarcs(h,m):parse(s)=argmaxy∈Y(s)X(h,m)∈yscore(cid:0)φ(s,h,m)(cid:1)Giventhescoresofthearcsthehighestscoringpro-jectivetreecanbeefﬁcientlyfoundusingEisner’sdecodingalgorithm(1996).McDonaldetal.andmostsubsequentworkestimatethelocalscoreofanarcbyalinearmodelparameterizedbyaweightvec-torw,andafeaturefunctionφ(s,h,m)assigningasparsefeaturevectorforanarclinkingmodiﬁermtoheadh.WefollowPeietal.(2015)andreplacethelinearscoringfunctionwithanMLP.Thefeatureextractorφ(s,h,m)isusuallycom-plex,involvingmanyelements(seeSection2.1).Incontrast,ourfeatureextractorusesmerelytheBiLSTMencodingoftheheadwordandthemod-iﬁerword:φ(s,h,m)=BIRNN(x1:n,h)◦BIRNN(x1:n,m)Theﬁnalmodelis:parse(s)=argmaxy∈Y(s)scoreglobal(s,y)=argmaxy∈Y(s)X(h,m)∈yscore(cid:0)φ(s,h,m)(cid:1)=argmaxy∈Y(s)X(h,m)∈yMLP(vh◦vm)vi=BIRNN(x1:n,i)ThearchitectureisillustratedinFigure2.TrainingThetrainingobjectiveistosetthescorefunctionsuchthatcorrecttreeyisscoredabovein-correctones.Weuseamargin-basedobjective(Mc-Donaldetal.,2005;LeCunetal.,2006),aimingtomaximizethemarginbetweenthescoreofthegoldtreeyandthehighestscoringincorrecttreey0.Wedeﬁneahingelosswithrespecttoagoldtreeyas:

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

322

max(cid:16)0,1−maxy06=yX(h,m)∈y0MLP(vh◦vm)+X(h,m)∈yMLP(vh◦vm)(cid:17)Eachofthetreescoresisthencalculatedbyacti-vatingtheMLPonthearcrepresentations.Theen-tirelosscanviewedasthesumofmultipleneuralnetworks,whichissub-differentiable.Wecalculatethegradientsoftheentirenetwork(includingtotheBiLSTMencoderandwordembeddings).LabeledParsingUptonow,wedescribedunla-beledparsing.Apossibleapproachforaddingla-belsistoscorethecombinationofanunlabeledarc(h,m)anditslabel‘byconsideringthelabelaspartofthearc(h,m,‘).Thisresultsin|Labels|×|Arcs|partsthatneedtobescored,leadingtoslowparsingspeedsandarguablyaharderlearningproblem.Instead,wechosetoﬁrstpredicttheunlabeledstructureusingthemodelgivenabove,andthenpre-dictthelabelofeachresultingarc.Usingthisap-proach,thenumberofpartsstayssmall,enablingfastparsing.Thelabelingofanarc(h,m)isperformedusingthesamefeaturerepresentationφ(s,h,m)fedintoadifferentMLPpredictor:label(h,m)=argmax‘∈labelsMLPLBL(vh◦vm)[‘]Asbeforeweuseamarginbasedhingeloss.Thela-beleristrainedonthegoldtrees.9TheBiLSTMen-coderresponsibleforproducingvhandvmissharedwiththearc-factoredparser:thesameBiLSTMen-coderisusedintheparerandthelabeler.Thissharingofparameterscanbeseenasaninstanceofmulti-tasklearning(Caruana,1997).AsweshowinSection6,thesharingiseffective:trainingtheBiLSTMfeatureencodertobegoodatpredictingarc-labelssigniﬁcantlyimprovestheparser’sunla-beledaccuracy.LossaugmentedinferenceIninitialexperiments,thenetworklearnedquicklyandoverﬁtthedata.In9Whentrainingthelabeledparser,wecalculatethestructurelossandthelabelinglossforeachtrainingsentence,andsumthelossespriortocomputingthegradients.ordertoremedythis,wefounditusefultouselossaugmentedinference(Taskaretal.,2005).Thein-tuitionbehindlossaugmentedinferenceistoupdateagainsttreeswhichhavehighmodelscoresandarealsoverywrong.Thisisdonebyaugmentingthescoreofeachpartnotbelongingtothegoldtreebyaddingaconstanttoitsscore.Formally,thelosstransformsasfollows:max(0,1+score(x,y)−maxy06=yXpart∈y0(scorelocal(x,part)+1part6∈y))SpeedimprovementsThearc-factoredmodelre-quiresthescoringofn2arcs.ScoringisperformedusinganMLPwithonehiddenlayer,resultinginn2matrix-vectormultiplicationsfromtheinputtothehiddenlayer,andn2multiplicationsfromthehid-dentotheoutputlayer.Theﬁrstn2multiplicationsinvolvelargerdimensionalinputandoutputvectors,andarethemosttimeconsuming.Fortunately,thesecanbereducedto2nmultiplicationsandn2vec-toradditions,byobservingthatthemultiplicationW·(vh◦vm)canbewrittenasW1·vh+W2·vmwhereW1andW1arearetheﬁrstandsecondhalfofthematrixWandreusingtheproductsacrossdif-ferentpairs.SummaryThegraph-basedparserisstraight-forwardﬁrst-orderparser,trainedwithamargin-basedhinge-lossandloss-augmentedinference.Wedepartfromtheliteraturebyreplacingthehand-craftedfeaturefunctionwithaconcatenationofBiLSTMrepresentationsoftheheadandmodiﬁerwords,andtrainingtheBiLSTMencoderjointlywiththestructuredobjective.Wealsointroduceanovelmulti-tasklearningapproachforlabeledpars-ingbytrainingasecond-stagearc-labelersharingthesameBiLSTMencoderwiththeunlabeledparser.6ExperimentsandResultsWeevaluatedourparsingmodelonEnglishandChi-nesedata.ForcomparisonpurposeswefollowthesetupofDyeretal.(2015).DataForEnglish,weusedtheStanfordDepen-dency(SD)(deMarneffeandManning,2008)con-versionofthePennTreebank(Marcusetal.,1993),usingthestandardtrain/dev/testsplitswiththe

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

323

SystemMethodRepresentationEmbPTB-YMPTB-SDCTBUASUASLASUASLASThisworkgraph,1storder2BiLSTMvectors––93.191.086.685.1Thisworktransition(greedy,dyn-oracle)4BiLSTMvectors––93.191.086.285.0Thisworktransition(greedy,dyn-oracle)11BiLSTMvectors––93.291.286.584.9ZhangNivre11transition(beam)largefeatureset(sparse)–92.9––86.084.4Martins13(TurboParser)graph,3rdorder+largefeatureset(sparse)–92.893.1–––Pei15graph,2ndorderlargefeatureset(dense)–93.0––––Dyer15transition(greedy)Stack-LSTM+composition––92.490.085.784.1Ballesteros16transition(greedy,dyn-oracle)Stack-LSTM+composition––92.790.686.184.5Thisworkgraph,1storder2BiLSTMvectorsYES–93.090.986.584.9Thisworktransition(greedy,dyn-oracle)4BiLSTMvectorsYES–93.691.587.485.9Thisworktransition(greedy,dyn-oracle)11BiLSTMvectorsYES–93.991.987.686.1Weiss15transition(greedy)largefeatureset(dense)YES–93.291.2––Weiss15transition(beam)largefeatureset(dense)YES–94.092.0––Pei15graph,2ndorderlargefeatureset(dense)YES93.3––––Dyer15transition(greedy)Stack-LSTM+compositionYES–93.190.987.185.5Ballesteros16transition(greedy,dyn-oracle)Stack-LSTM+compositionYES–93.691.487.686.2LeZuidema14reranking/blendinside-outsiderecursivenetYES93.193.891.5––Zhu15reranking/blendrecursiveconv-netYES93.8––85.7–Table1:Test-setparsingresultsofvariousstate-of-the-artparsingsystemsontheEnglish(PTB)andChinese(CTB)datasets.Thesystemsthatuseembeddingsmayusedifferentpre-trainedembeddings.EnglishresultsusepredictedPOStags(differentsystemsusedifferenttaggers),whileChineseresultsusegoldPOStags.PTB-YM:EnglishPTB,YamadaandMatsumotoheadrules.PTB-SD:EnglishPTB,StanfordDependencies(differentsystemsmayusedifferentversionsoftheStanfordconverter).CTB:ChineseTreebank.reranking/blendinMethodcolumnindicatesarerankingsystemwherethererankerscoreisinterpolatedwiththebase-parser’sscore.Thedifferentsystemsandthenumbersreportedfromthemaretakenfrom:ZhangNivre11:(ZhangandNivre,2011);Martins13:(Martinsetal.,2013);Weiss15(Weissetal.,2015);Pei15:(Peietal.,2015);Dyer15(Dyeretal.,2015);Ballesteros16(Ballesterosetal.,2016);LeZuidema14(LeandZuidema,2014);Zhu15:(Zhuetal.,2015).samepredictedPOS-tagsasusedinDyeretal.(2015);ChenandManning(2014).Thisdatasetcon-tainsafewnon-projectivetrees.Punctuationsym-bolsareexcludedfromtheevaluation.ForChinese,weusethePennChineseTreebank5.1(CTB5),usingthetrain/test/devsplitsof(ZhangandClark,2008;Dyeretal.,2015)withgoldpart-of-speechtags,alsofollowing(Dyeretal.,2015;ChenandManning,2014).Whenusingexternalwordembeddings,wealsousethesamedataasDyeretal.(2015).10ImplementationDetailsTheparsersareimple-mentedinpython,usingthePyCNNtoolkit11forneuralnetworktraining.Thecodeisavailableatthegithubrepositoryhttps://github.com/elikip/bist-parser.WeusetheLSTMvari-antimplementedinPyCNN,andoptimizeusingtheAdamoptimizer(KingmaandBa,2015).Unlessotherwisenoted,weusethedefaultvaluesprovidedbyPyCNN(e.g.forrandominitialization,learningratesetc).10WethankDyeretal.forsharingtheirdatawithus.11https://github.com/clab/cnn/tree/master/pycnnThewordandPOSembeddingse(wi)ande(pi)areinitializedtorandomvaluesandtrainedtogetherwiththerestoftheparsers’networks.Insomeex-periments,weintroducealsopre-trainedwordem-beddings.Inthosecases,thevectorrepresenta-tionofawordisaconcatenationofitsrandomly-initializedvectorembeddingwithitspre-trainedwordvector.Botharetunedduringtraining.WeusethesamewordvectorsasinDyeretal.(2015)Duringtraining,weemployavariantofworddropout(Iyyeretal.,2015),andreplaceawordwiththeunknown-wordsymbolwithprobabilitythatisinverselyproportionaltothefrequencyoftheword.Awordwappearing#(w)timesinthetrainingcor-pusisreplacedwiththeunknownsymbolwithprob-abilitypunk(w)=α#(w)+α.Ifawordwasdroppedtheexternalembeddingofthewordisalsodroppedwithprobability0.5.Wetraintheparsersforupto30iterations,andchoosethebestmodelaccordingtotheUASaccu-racyonthedevelopmentset.HyperparameterTuningWeperformedaveryminimalhyper-parametersearchwiththegraph-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

324

basedparser,andusethesamehyper-parametersforbothparsers.Thehyper-parametersoftheﬁnalnet-worksusedforallthereportedexperimentsarede-tailedinTable2.Wordembeddingdimension100POStagembeddingdimension25HiddenunitsinMLP100HiddenunitsinMLPLBL100BI-LSTMLayers2BI-LSTMDimensions(hidden/output)125/125α(forworddropout)0.25pagg(forexplorationtraining)0.1Table2:Hyper-parametervaluesusedinexperimentsMainResultsTable1liststhetest-setaccuraciesofourbestparsingmodels,comparedtootherstate-of-the-artparsersfromtheliterature.12Itisclearthatourparsersareverycompetitive,despiteusingverysimpleparsingarchitecturesandminimalfeatureextractors.Whennotusingexternalembeddings,theﬁrst-ordergraph-basedparserwith2featuresoutperformsallothersystemsthatarenotusingexternalresources,includingthethird-orderTurboParser.Thegreedytransitionbasedparserwith4featuresalsomatchesoroutperformsmostotherparsers,includingthebeam-basedtransitionparserwithheavilyengineeredfeaturesofZhangandNivre(2011)andtheStack-LSTMparserofDyeretal.(2015),aswellasthesameparserwhentrainedusingadynamicoracle(Ballesterosetal.,2016).Movingfromthesimple(4features)totheextended(11features)featuresetleadstosomegainsinaccuracyforbothEnglishandChinese.Interestingly,whenaddingexternalwordembed-dingstheaccuracyofthegraph-basedparserde-grades.Wearenotsurewhythishappens,andleavetheexplorationofeffectivesemi-supervisedparsingwiththegraph-basedmodelforfuturework.Thegreedyparserdoesmanagetobeneﬁtfromtheex-ternalembeddings,andusingthemwealsoseegainsfrommovingfromthesimpletotheextendedfeatureset.Bothfeaturesetsresultinverycompetitivere-12Unfortunately,manypapersstillreportEnglishparsingresultsonthedeﬁcientYamadaandMatsumotoheadrules(PTB-YM)ratherthanthemoremodernStanford-dependencies(PTB-SD).WenotethatthePTB-YMandPTB-SDresultsarenotstrictlycomparable,andinourexperiencethePTB-YMre-sultsareusuallyabouthalfaUASpointhigher.sults,withtheextendedfeaturesetyieldingthebestreportedresultsforChinese,andrankedsecondforEnglish,aftertheheavily-tunedbeam-basedparserofWeissetal.(2015).AdditionalResultsWeperformsomeablationex-perimentsinordertoquantifytheeffectofthedif-ferentcomponentsonourbestmodels(Table3).PTBCTBUASLASUASLASGraph(noext.emb)93.391.087.085.4–POS92.989.880.676.8–ArcLabeler92.7–86.2––LossAug.81.379.452.651.7Greedy(ext.emb)93.891.587.886.0–POS93.491.283.481.6–DynOracle93.591.487.585.9Table3:Ablationexperimentsresults(devset)forthegraph-basedparserwithoutexternalembeddingsandthegreedyparserwithexternalembeddingsandextendedfeatureset.Lossaugmentedinferenceiscrucialforthesuccessofthegraph-basedparser,andthemulti-tasklearn-ingschemeforthearc-labelercontributesnicelytotheunlabeledscores.DynamicoracletrainingyieldsnicegainsforbothEnglishandChinese.7ConclusionWepresentedapleasinglyeffectiveapproachforfeatureextractionfordependencyparsingbasedonaBiLSTMencoderthatistrainedjointlywiththeparser,anddemonstrateditseffectivenessbyinte-gratingitintotwosimpleparsingmodels:agreedytransition-basedparserandagloballyoptimizedﬁrst-ordergraph-basedparser,yieldingverycom-petitiveparsingaccuraciesinbothcases.AcknowledgementsThisresearchissupportedbytheIntelCollaborativeResearchInstituteforCom-putationalIntelligence(ICRI-CI)andtheIsraeliSci-enceFoundation(grantnumber1555/15).WethankLillianLeeforherimportantfeedbackandeffortsinvestedineditingthispaper.Wealsothankthere-viewersfortheirvaluablecomments.ReferencesMiguelBallesteros,YoavGoldberg,ChrisDyer,andNoahA.Smith.2016.Trainingwithexplo-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

325

rationimprovesagreedystack-LSTMparser.CoRR,abs/1603.03793.RichCaruana.1997.Multitasklearning.MachineLearning,28:41–75,July.DanqiChenandChristopherManning.2014.Afastandaccuratedependencyparserusingneuralnetworks.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages740–750,Doha,Qatar,October.AssociationforComputationalLinguistics.JasonP.C.ChiuandEricNichols.2016.NamedentityrecognitionwithbidirectionalLSTM-CNNs.Transac-tionsoftheAssociationforComputationalLinguistics,4.Toappear.KyunghyunCho.2015.Naturallanguageunder-standingwithdistributedrepresentation.CoRR,abs/1511.07916.JamesCrossandLiangHuang.2016.Incrementalpars-ingwithminimalfeaturesusingbi-directionalLSTM.InProceedingsofthe54thAnnualMeetingoftheAs-sociationforComputationalLinguistics,Berlin,Ger-many,August.AssociationforComputationalLin-guistics.Marie-CatherinedeMarneffeandChristopherD.Man-ning.2008.Stanforddependenciesmanual.Techni-calreport,StanfordUniversity.ChrisDyer,MiguelBallesteros,WangLing,AustinMatthews,andNoahA.Smith.2015.Transition-baseddependencyparsingwithstacklongshort-termmemory.InProceedingsofthe53rdAnnualMeet-ingoftheAssociationforComputationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing(Volume1:LongPapers),pages334–343,Beijing,China,July.AssociationforCom-putationalLinguistics.JasonEisner.1996.Threenewprobabilisticmodelsfordependencyparsing:Anexploration.In16thInterna-tionalConferenceonComputationalLinguistics,Pro-ceedingsoftheConference,COLING1996,CenterforSprogteknologi,Copenhagen,Denmark,August5-9,1996,pages340–345.JeffreyL.Elman.1990.Findingstructureintime.Cog-nitiveScience,14(2):179–211.YoavGoldbergandJoakimNivre.2012.Adynamicora-cleforarc-eagerdependencyparsing.InProceedingsofCOLING2012,pages959–976,Mumbai,India,De-cember.TheCOLING2012OrganizingCommittee.YoavGoldbergandJoakimNivre.2013.Trainingdeterministicparserswithnon-deterministicoracles.TransactionsoftheAssociationforComputationalLinguistics,1:403–414.YoavGoldberg.2015.Aprimeronneuralnet-workmodelsfornaturallanguageprocessing.CoRR,abs/1510.00726.AlexGraves.2008.Supervisedsequencelabellingwithrecurrentneuralnetworks.Ph.D.thesis,TechnicalUniversityMunich.SeppHochreiterandJürgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.LiangHuangandKenjiSagae.2010.Dynamicpro-grammingforlinear-timeincrementalparsing.InPro-ceedingsofthe48thAnnualMeetingoftheAssocia-tionforComputationalLinguistics,pages1077–1086,Uppsala,Sweden,July.AssociationforComputationalLinguistics.OzanIrsoyandClaireCardie.2014.Opinionminingwithdeeprecurrentneuralnetworks.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNat-uralLanguageProcessing(EMNLP),pages720–728,Doha,Qatar,October.AssociationforComputationalLinguistics.MohitIyyer,VarunManjunatha,JordanBoyd-Graber,andHalDauméIII.2015.Deepunorderedcomposi-tionrivalssyntacticmethodsfortextclassiﬁcation.InProceedingsofthe53rdAnnualMeetingoftheAssoci-ationforComputationalLinguisticsandthe7thInter-nationalJointConferenceonNaturalLanguagePro-cessing(Volume1:LongPapers),pages1681–1691,Beijing,China,July.AssociationforComputationalLinguistics.AndrejKarpathy,JustinJohnson,andFei-FeiLi.2015.Visualizingandunderstandingrecurrentnetworks.CoRR,abs/1506.02078.DiederikP.KingmaandJimmyBa.2015.Adam:Amethodforstochasticoptimization.InProceedingsofthe3rdInternationalConferenceforLearningRepre-sentations,SanDiego,California.EliyahuKiperwasserandYoavGoldberg.2016.Easy-ﬁrstdependencyparsingwithhierarchicaltreeLSTMs.TransactionsoftheAssociationforCompu-tationalLinguistics,4.Toappear.TerryKooandMichaelCollins.2010.Efﬁcientthird-orderdependencyparsers.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics,pages1–11,Uppsala,Sweden,July.Asso-ciationforComputationalLinguistics.TerryKoo,XavierCarreras,andMichaelCollins.2008.Simplesemi-superviseddependencyparsing.InPro-ceedingsofthe46thAnnualMeetingoftheAssoci-ationforComputationalLinguistics,pages595–603,Columbus,Ohio,June.AssociationforComputationalLinguistics.SandraKübler,RyanT.McDonald,andJoakimNivre.2009.DependencyParsing.SynthesisLecturesonHumanLanguageTechnologies.Morgan&ClaypoolPublishers.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

326

MarcoKuhlmann,CarlosGómez-Rodríguez,andGior-gioSatta.2011.Dynamicprogrammingalgorithmsfortransition-baseddependencyparsers.InProceed-ingsofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages673–682,Portland,Oregon,USA,June.AssociationforComputationalLinguistics.GuillaumeLample,MiguelBallesteros,SandeepSubra-manian,KazuyaKawakami,andChrisDyer.2016.Neuralarchitecturesfornamedentityrecognition.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputa-tionalLinguistics:HumanLanguageTechnologies,pages260–270,SanDiego,California,June.Associ-ationforComputationalLinguistics.PhongLeandWillemZuidema.2014.Theinside-outsiderecursiveneuralnetworkmodelfordepen-dencyparsing.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing(EMNLP),pages729–739,Doha,Qatar,October.AssociationforComputationalLinguistics.YannLeCun,SumitChopra,RaiaHadsell,Marc’AurelioRanzato,andFuJieHuang.2006.Atutorialonenergy-basedlearning.Predictingstructureddata,1.TaoLei,YuXin,YuanZhang,ReginaBarzilay,andTommiJaakkola.2014.Low-ranktensorsforscor-ingdependencystructures.InProceedingsofthe52ndAnnualMeetingoftheAssociationforCompu-tationalLinguistics(Volume1:LongPapers),pages1381–1391,Baltimore,Maryland,June.AssociationforComputationalLinguistics.MikeLewis,KentonLee,andLukeZettlemoyer.2016.LSTMCCGparsing.InProceedingsofthe2016Con-ferenceoftheNorthAmericanChapteroftheAssocia-tionforComputationalLinguistics:HumanLanguageTechnologies,pages221–231,SanDiego,California,June.AssociationforComputationalLinguistics.MitchellP.Marcus,BeatriceSantorini,andMaryAnnMarcinkiewicz.1993.Buildingalargeannotatedcor-pusofEnglish:ThePennTreebank.ComputationalLinguistics,19(2):313–330.AndreMartins,NoahA.Smith,andEricXing.2009.Conciseintegerlinearprogrammingformulationsfordependencyparsing.InProceedingsoftheJointCon-ferenceofthe47thAnnualMeetingoftheACLandthe4thInternationalJointConferenceonNaturalLan-guageProcessingoftheAFNLP,pages342–350,Sun-tec,Singapore,August.AssociationforComputationalLinguistics.AndreMartins,MiguelAlmeida,andNoahA.Smith.2013.Turningontheturbo:Fastthird-ordernon-projectiveturboparsers.InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics(Volume2:ShortPapers),pages617–622,Soﬁa,Bulgaria,August.AssociationforComputa-tionalLinguistics.RyanMcDonald,KobyCrammer,andFernandoPereira.2005.Onlinelarge-margintrainingofdependencyparsers.InProceedingsofthe43rdAnnualMeet-ingoftheAssociationforComputationalLinguistics(ACL’05),pages91–98,AnnArbor,Michigan,June.AssociationforComputationalLinguistics.RyanMcDonald.2006.DiscriminativeTrainingandSpanningTreeAlgorithmsforDependencyParsing.Ph.D.thesis,UniversityofPennsylvania.JoakimNivre.2004.Incrementalityindeterministicde-pendencyparsing.InFrankKeller,StephenClark,MatthewCrocker,andMarkSteedman,editors,Pro-ceedingsoftheACLWorkshopIncrementalParsing:BringingEngineeringandCognitionTogether,pages50–57,Barcelona,Spain,July.AssociationforCom-putationalLinguistics.JoakimNivre.2008.Algorithmsfordeterministicincre-mentaldependencyparsing.ComputationalLinguis-tics,34(4):513–553.WenzhePei,TaoGe,andBaobaoChang.2015.Anef-fectiveneuralnetworkmodelforgraph-baseddepen-dencyparsing.InProceedingsofthe53rdAnnualMeetingoftheAssociationforComputationalLinguis-ticsandthe7thInternationalJointConferenceonNat-uralLanguageProcessing(Volume1:LongPapers),pages313–322,Beijing,China,July.AssociationforComputationalLinguistics.MikeSchusterandKuldipK.Paliwal.1997.Bidirec-tionalrecurrentneuralnetworks.IEEETrans.SignalProcessing,45(11):2673–2681.BenjaminTaskar,VassilChatalbashev,DaphneKoller,andCarlosGuestrin.2005.Learningstructuredpre-dictionmodels:Alargemarginapproach.InMachineLearning,ProceedingsoftheTwenty-SecondInterna-tionalConference(ICML2005),Bonn,Germany,Au-gust7-11,2005,pages896–903.HillelTaub-Tabib,YoavGoldberg,andAmirGlober-son.2015.Templatekernelsfordependencypars-ing.InProceedingsofthe2015ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages1422–1427,Denver,Colorado,May–June.AssociationforComputationalLinguistics.IvanTitovandJamesHenderson.2007.Alatentvariablemodelforgenerativedependencyparsing.InProceed-ingsoftheTenthInternationalConferenceonParsingTechnologies,pages144–155,Prague,CzechRepub-lic,June.AssociationforComputationalLinguistics.AshishVaswani,YonatanBisk,KenjiSagae,andRyanMusa.2016.SupertaggingwithLSTMs.InPro-ceedingsofthe15thAnnualConferenceoftheNorth

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

327

AmericanChapteroftheAssociationforComputa-tionalLinguistics(ShortPapers),SanDiego,Califor-nia,June.OriolVinyals,LukaszKaiser,TerryKoo,SlavPetrov,IlyaSutskever,andGeoffreyE.Hinton.2015.Gram-marasaforeignlanguage.InAdvancesinNeuralIn-formationProcessingSystems28:AnnualConferenceonNeuralInformationProcessingSystems2015,De-cember7-12,2015,Montreal,Quebec,Canada,pages2773–2781.DavidWeiss,ChrisAlberti,MichaelCollins,andSlavPetrov.2015.Structuredtrainingforneuralnetworktransition-basedparsing.InProceedingsofthe53rdAnnualMeetingoftheAssociationforComputationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing(Volume1:LongPa-pers),pages323–333,Beijing,China,July.Associa-tionforComputationalLinguistics.YueZhangandStephenClark.2008.Ataleoftwoparsers:Investigatingandcombininggraph-basedandtransition-baseddependencyparsing.InProceedingsofthe2008ConferenceonEmpiricalMethodsinNat-uralLanguageProcessing,pages562–571,Honolulu,Hawaii,October.AssociationforComputationalLin-guistics.YueZhangandJoakimNivre.2011.Transition-baseddependencyparsingwithrichnon-localfeatures.InProceedingsofthe49thAnnualMeetingoftheAsso-ciationforComputationalLinguistics:HumanLan-guageTechnologies,pages188–193,Portland,Ore-gon,USA,June.AssociationforComputationalLin-guistics.ChenxiZhu,XipengQiu,XinchiChen,andXuanjingHuang.2015.Are-rankingmodelfordependencyparserwithrecursiveconvolutionalneuralnetwork.InProceedingsofthe53rdAnnualMeetingoftheAssoci-ationforComputationalLinguisticsandthe7thInter-nationalJointConferenceonNaturalLanguagePro-cessing(Volume1:LongPapers),pages1159–1168,Beijing,China,July.AssociationforComputationalLinguistics.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
0
1
1
5
6
7
4
1
0

/
t

a
c
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

328
Download pdf