Transactions of the Association for Computational Linguistics, 1 (2013) 37–48. Action Editor: Ryan McDonald.

Submitted 11/2012; Revised 2/2013; Published 3/2013. c
(cid:13)

2013 Association for Computational Linguistics.

BranchandBoundAlgorithmforDependencyParsingwithNon-localFeaturesXianQianandYangLiuComputerScienceDepartmentTheUniversityofTexasatDallas{qx,yangl}@hlt.utdallas.eduAbstractGraphbaseddependencyparsingisinefﬁcientwhenhandlingnon-localfeaturesduetohighcomputationalcomplexityofinference.Inthispaper,weproposedanexactandefﬁ-cientdecodingalgorithmbasedontheBranchandBound(B&B)frameworkwherenon-localfeaturesareboundedbyalinearcombi-nationoflocalfeatures.Dynamicprogram-mingisusedtosearchtheupperbound.Ex-perimentsareconductedonEnglishPTBandChineseCTBdatasets.Weachievedcompeti-tiveUnlabeledAttachmentScore(UAS)whennoadditionalresourcesareavailable:93.17%forEnglishand87.25%forChinese.Parsingspeedis177wordspersecondforEnglishand97wordspersecondforChinese.Ouralgo-rithmisgeneralandcanbeadaptedtonon-projectivedependencyparsingorothergraph-icalmodels.1IntroductionForgraphbasedprojectivedependencyparsing,dy-namicprogramming(DP)ispopularfordecodingduetoitsefﬁciencywhenhandlinglocalfeatures.Itperformscubictimeparsingforarc-factoredmod-els(Eisner,1996;McDonaldetal.,2005a)andbi-quadratictimeforhigherordermodelswithrichersiblingandgrandchildfeatures(Carreras,2007;KooandCollins,2010).However,formodelswithgen-eralnon-localfeatures,DPisinefﬁcient.Therehavebeennumerousstudiesonglobalin-ferencealgorithmsforgeneralhigherorderparsing.Onepopularapproachisreranking(Collins,2000;CharniakandJohnson,2005;Hall,2007).Ittypi-callyhastwosteps:thelowlevelclassiﬁergener-atesthetopkhypothesesusinglocalfeatures,thenthehighlevelclassiﬁerreranksthesecandidatesus-ingglobalfeatures.Sincethererankingqualityisboundedbytheoracleperformanceofcandidates,someworkhascombinedcandidategenerationandrerankingstepsusingcubepruning(Huang,2008;ZhangandMcDonald,2012)toachievehigheror-acleperformance.Theyparseasentenceinbottomuporderandkeepthetopkderivationsforeachs-panusingkbestparsing(HuangandChiang,2005).Aftermergingthetwospans,non-localfeaturesareusedtoreranktopkcombinations.Thisapproachisveryefﬁcientandﬂexibletohandlevariousnon-localfeatures.Thedisadvantageisthatittendstocomputenon-localfeaturesasearlyaspossiblesothatthedecodercanutilizethatinformationatinter-nalspans,henceitmaymisslonghistoricalfeaturessuchaslongdependencychains.SmithandEisnermodeleddependencyparsingusingMarkovRandomFields(MRFs)withglob-alconstraintsandappliedloopybeliefpropaga-tion(LBP)forapproximatelearningandinference(SmithandEisner,2008).SimilarworkwasdoneforCombinatorialCategorialGrammar(CCG)pars-ing(AuliandLopez,2011).Theyusedposteriormarginalbeliefsforinferencetosatisfythetreecon-straint:foreachfactor,onlylegalmessages(satisfy-ingglobalconstraints)areconsideredinthepartitionfunction.Asimilarlineofresearchinvestigatedtheuseofintegerlinearprogramming(ILP)basedparsing(RiedelandClarke,2006;Martinsetal.,2009).This

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

methodisveryexpressive.Itcanhandlearbitrarynon-localfeaturesdeterminedorboundedbylinearinequalitiesoflocalfeatures.Forlocalmodels,LPislessefﬁcientthanDP.Thereasonisthat,DPworksonasmallnumberofdimensionsineachrecursion,whileforLP,thepopularrevisedsimplexmethodneedstosolveamdimensionallinearsystemineachiteration(NocedalandWright,2006),wheremisthenumberofconstraints,whichisquadraticinsentencelengthforprojectivedependencypars-ing(Martinsetal.,2009).DualDecomposition(DD)(Rushetal.,2010;Kooetal.,2010)isaspecialcaseofLagrangianre-laxation.Itreliesonstandarddecodingalgorithmsasoraclesolversforsub-problems,togetherwithasimplemethodforforcingagreementbetweenthedifferentoracles.Thismethoddoesnotneedtocon-siderthetreeconstraintexplicitly,asitresortstody-namicprogrammingwhichguaranteesitssatisfac-tion.Itworkswellifthesub-problemscanbewelldeﬁned,especiallyforjointlearningtasks.Howev-er,forthetaskofdependencyparsing,usingvariousnon-localfeaturesmayresultinmanyoverlappedsub-problems,henceitmaytakealongtimetoreachaconsensus(Martinsetal.,2011).Inthispaper,weproposeanovelBranchandBound(B&B)algorithmforefﬁcientparsingwithvariousnon-localfeatures.B&B(LandandDoig,1960)isgenerallyusedforcombinatorialoptimiza-tionproblemssuchasILP.ThedifferencebetweenourmethodandILPisthatthesub-probleminILPisarelaxedLP,whichrequiresanumericalsolution,whileoursboundsthenon-localfeaturesbyalin-earcombinationoflocalfeaturesandusesDPfordecodingaswellascalculatingtheupperboundoftheobjectivefunction.Anexactsolutionisachievediftheboundistight.Thoughintheworstcase,timecomplexityisexponentialinsentencelength,itispracticallyefﬁcientespeciallywhenadoptingapruningstrategy.ExperimentsareconductedonEnglishPennTreeBankandChineseTreeBank5(CTB5)withstan-dardtrain/develop/testsplit.Weachieved93.17%UnlabeledAttachmentScore(UAS)forEnglishataspeedof177wordspersecondand87.25%forChi-neseataspeedof97wordspersecond.2GraphBasedParsing2.1ProblemDeﬁnitionGivenasentencex=x1,x2,…,xnwherexiistheithwordofthesentence,dependencyparsingas-signsexactlyoneheadwordtoeachword,sothatdependenciesfromheadwordstomodiﬁersformatree.Therootofthetreeisaspecialsymbolde-notedbyx0whichhasexactlyonemodiﬁer.Inthispaper,wefocusonunlabeledprojectivedependencyparsingbutouralgorithmcanbeadaptedforlabeledornon-projectivedependencyparsing(McDonaldetal.,2005b).Theinferenceproblemistosearchtheoptimalparsetreey∗y∗=argmaxy∈Y(x)ϕ(x,y)whereY(x)isthesetofallcandidateparsetreesofsentencex.ϕ(x,y)isagivenscorefunctionwhichisusuallydecomposedintosmallpartsϕ(x,y)=∑c⊆yϕc(x)(1)wherecisasubsetofedges,andiscalledafactor.Forexample,intheallgrandchildmodel(KooandCollins,2010),thescorefunctioncanberepresentedasϕ(x,y)=∑ehm∈yϕehm(x)+∑egh,ehm∈yϕegh,ehm(x)wheretheﬁrsttermisthesumofscoresofalledgesxh→xm,andthesecondtermisthesumofthescoresofalledgechainsxg→xh→xm.Indiscriminativemodels,thescoreofaparsetreeyistheweightedsumoftheﬁredfeaturefunctions,whichcanberepresentedbythesumofthefactorsϕ(x,y)=wTf(x,y)=∑c⊆ywTf(x,c)=∑c⊆yϕc(x)wheref(x,c)isthefeaturevectorthatdependsonc.Forexample,wecoulddeﬁneafeatureforgrand-childc={egh,ehm}f(x,c)=1ifxg=would∧xh=be∧xm=happy∧cisselected0otherwise

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2.2DynamicProgrammingforLocalModelsInﬁrstordermodels,allfactorscinEq(1)containasingleedge.TheoptimalparsetreecanbederivedbyDPwithrunningtimeO(n3)(Eisner,1996).Thealgorithmhastwotypesofstructures:completes-pan,whichconsistsofaheadwordanditsdescen-dantsononeside,andincompletespan,whichcon-sistsofadependencyandtheregionbetweentheheadandmodiﬁer.Itstartsatsinglewordspans,andmergesthespansinbottomuporder.Forsecondordermodels,thescorefunctionϕ(x,y)addsthescoresofsiblings(adjacentedgeswithacommonhead)andgrandchildrenϕ(x,y)=∑ehm∈yϕehm(x)+∑egh,ehm∈yϕehm,egh(x)+∑ehm,ehs∈yϕehm,ehs(x)Therearetwoversionsofsecondordermodels,usedrespectivelybyCarreras(2007)andKooetal.(2010).ThedifferenceisthatCarreras’onlycon-siderstheoutermostgrandchildren,whileKooandCollin’sallowsallgrandchildfeatures.BothmodelspermitO(n4)runningtime.Third-ordermodelsscoreedgetriplessuchasthreeadjacentsiblingmodiﬁers,orgrand-siblingsthatscoreaword,itsmodiﬁeranditsadjacentgrand-children,andtheinferencecomplexityisO(n4)(KooandCollins,2010).Inthispaper,forallthefactors/featuresthatcanbehandledbyDP,wecallthemthelocalfac-tors/features.3TheProposedMethod3.1BasicIdeaForgeneralhighordermodelswithnon-localfea-tures,weproposetouseBranchandBound(B&B)algorithmtosearchtheoptimalparsetree.AB&Balgorithmhastwosteps:branchingandbounding.Thebranchingsteprecursivelysplitsthesearchs-paceY(x)intotwodisjointsubspacesY(x)=Y1∪Y2byﬁxingassignmentofoneedge.ForeachsubspaceYi,theboundingstepcalculatestheupperboundoftheoptimalparsetreescoreinthesub-space:UBYi≥maxy∈Yiϕ(x,y).IfthisboundisnomorethananyobtainedparsetreescoreUBYi≤ϕ(x,y′),thenallparsetreesinsubspaceYiarenomoreoptimalthany′,andYicouldbeprunedsafely.TheefﬁciencyofB&Bdependsonthebranchingstrategyandupperboundcomputation.Forexam-ple,Sunetal.(2012)usedB&BforMRFs,wheretheyproposedtwobranchingstrategiesandanoveldatastructureforefﬁcientupperboundcomputation.KlennerandAilloud(2009)proposedavariationofBalasalgorithm(Balas,1965)forcoreferencereso-lution,wherecandidatebranchingvariablesaresort-edbytheirweights.Ourboundingstrategyistoﬁndanupperboundforthescoreofeachnon-localfactorccontainingmultipleedges.Theboundisthesumofnewscoresofedgesinthefactorplusaconstantϕc(x)≤∑e∈cψe(x)+αcBasedonthenewscores{ψe(x)}andconstants{αc},wedeﬁnethenewscoreofparsetreeyψ(x,y)=∑c⊆y(∑e∈cψe(x)+αc)Thenwehaveψ(x,y)≥ϕ(x,y),∀y∈Y(x)Theadvantageofsuchaboundisthat,itisthesumofnewedgescores.Hence,itsoptimumtreemaxy∈Y(x)ψ(x,y)canbefoundbyDP,whichistheupperboundofmaxy∈Y(x)ϕ(x,y),asforanyy∈Y(x),ψ(x,y)≥ϕ(x,y).3.2TheUpperBoundFunctionInthissection,wederivetheupperboundfunctionψ(x,y)describedabove.Tosimplifynotation,wedropxthroughouttherestofthepaper.Letzcbeabinaryvariableindicatingwhetherfactorcisse-lectedintheparsetree.WereformulatethescorefunctioninEq(1)asϕ(y)≡ϕ(z)=∑cϕczc(2)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Correspondingly,thetreeconstraintisreplacedbyz∈Z.Thentheparsingtaskisz∗=argmaxz∈Zϕczc(3)Noticethat,foranyzc,wehavezc=mine∈czewhichmeansthatfactorcappearsinparsetreeifandonlyifallitsedges{e|e∈c}areselectedinthetree.Herezeisshortforz{e}forsimplicity.Ourboundingmethodisbasedonthefollowingfact:foraset{a1,a2,…ar}(ajdenotesthejthel-ement),itsminimummin{aj}=minp∈∆∑jpjaj(4)where∆isprobabilitysimplex∆={p|pj≥0,∑jpj=1}Wediscusstheboundforϕczcintwocases:ϕc≥0andϕc<0.Ifϕc≥0,wehaveϕczc=ϕcmine∈cze=ϕcminpc∈∆∑e∈cpecze=minpc∈∆∑e∈cϕcpeczeThesecondequationcomesfromEq(4).Forsim-plicity,letgc(pc,z)=∑e∈cϕcpeczewithdomaindomgc={pc∈∆;ze∈{0,1},∀e∈c}.Thenwehaveϕczc=minpcgc(pc,z)(5)Ifϕc<0,wehavetwoupperbounds.OneiscommonlyusedinILPwhenallthevariablesarebi-narya∗=minj{aj}rj=1⇔a∗≤aja∗≥∑jaj−(r−1)Accordingtothelastinequality,wehavetheupperboundfornegativescoredfactorsϕczc≤ϕc(∑e∈cze−(rc−1))(6)wherercisthenumberofedgesinc.Forsimplicity,weusethenotationσc(z)=ϕc(∑e∈cze−(rc−1))Theotherupperboundwhenϕc<0issimpleϕczc≤0(7)Noticethat,foranyparsetree,oneoftheupperboundsmustbetight.Eq(6)istightifcappearsintheparsetree:zc=1,otherwiseEq(7)istight.Thereforeϕczc=min{σc(z),0}Lethc(pc,z)=p1cσc(z)+p2c·0withdomhc={pc∈∆;ze∈{0,1},∀e∈c}.AccordingtoEq(4),wehaveϕczc=minpchc(pc,z)(8)Letψ(p,z)=∑c,ϕc≥0gc(pc,z)+∑c,ϕc<0hc(pc,z)Minimizeψwithrespecttop,wehaveminpψ(p,z)=minp∑c,ϕc≥0gc(pc,z)+∑c,ϕc<0hc(pc,z)=∑c,ϕc≥0minpcgc(pc,z)+∑c,ϕc<0minpchc(pc,z)=∑c,ϕc≥0ϕczc+∑c,ϕc<0ϕczc=ϕ(z)Thesecondequationholdssince,foranytwofac-tors,candc′,gc(orhc)andgc′(orhc′)areseparable.ThethirdequationcomesfromEq(5)andEq(8).Basedonthis,wehavethefollowingproposition: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 0 8 1 5 6 6 6 3 3 / / t l a c _ a _ 0 0 2 0 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 41 Proposition1.Foranyp,pc∈∆,andz∈Z,ψ(p,z)≥ϕ(z).Therefore,ψ(p,z)isanupperboundfunctionofϕ(z).Furthermore,ﬁxingp,ψ(p,z)isalinearfunc-tionofze,seeEq(5)andEq(8),variableszcforlargefactorsareeliminated.Hencez′=argmaxzψ(p,z)canbesolvedefﬁcientlybyDP.Becauseψ(p,z′)≥ψ(p,z∗)≥ϕ(z∗)≥ϕ(z′)afterobtainingz′,wegettheupperboundandlowerboundofϕ(z∗):ψ(p,z′)andϕ(z′).Theupperboundisexpectedtobeastightaspos-sible.Usingmin-maxinequality,wegetmaxz∈Zϕ(z)=maxz∈Zminpψ(p,z)≤minpmaxz∈Zψ(p,z)whichprovidesthetightestupperboundofϕ(z∗).Sinceψisnotdifferentiablew.r.tp,projectedsub-gradient(CalamaiandMor´e,1987;Rushetal.,2010)isusedtosearchthesaddlepoint.Morespeciﬁcally,ineachiteration,weﬁrstﬁxpandsearchzusingDP,thenweﬁxzandupdatepbypnew=P∆(p+∂ψ∂pα)whereα>0isthestepsizeinlinesearch,functionP∆(q)denotestheprojectionofqontotheproba-bilitysimplex∆.Inthispaper,weuseEuclideanprojection,thatisP∆(q)=minp∈∆∥p−q∥2whichcanbesolvedefﬁcientlybysorting(Duchietal.,2008).3.3BranchandBoundBasedParsingAsdiscussedinSection3.1,theB&Brecursivepro-cedureyieldsabinarytreestructurecalledBranchandBoundtree.EachnodeoftheB&Btreehassomeﬁxedze,specifyingsomemust-selectedgesandmust-removeedges.TherootoftheB&Btreehasnoconstraints,soitcanproduceallpossibleparsetreesincludingz∗.Eachnodehastwochil-dren.Oneaddsaconstraintze=1forafreeedgez=e1010101z=e2ψφ=9=4ψLB,weneedtochooseagoodbranchingvariabletogener-atethechildnodes.LetG(z′)=ψ(p′,z′)−ϕ(z′)denotethegapbetweentheupperboundandlowerbound.Thisgapisactuallytheaccumulatedgapsofallfactorsc.LetGcbethegapofcGc={gc(p′c,z′)−ϕcz′cifϕc≥0hc(p′c,z′)−ϕcz′cifϕc<0 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 0 8 1 5 6 6 6 3 3 / / t l a c _ a _ 0 0 2 0 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 42 Wechoosethebranchingvariableheuristically:foreachedgee,wedeﬁneitsgapasthesumofthegapsoffactorsthatcontainitGe=∑c,e∈cGcTheedgewiththemaximumgapisselectedasthebranchingvariable.SupposethereareNnodesonalevelofB&Btree,andcorrespondingly,wegetNbranchingvari-ables,amongwhich,wechoosetheonewiththehighestlowerboundasitlikelyreachestheoptimalvaluefaster.3.4LowerBoundInitializationAlargelowerboundiscriticalforefﬁcientpruning.Inthissection,wediscussanalternativewaytoini-tializethelowerboundLB.Weapplythesimilartricktogetthelowerboundfunctionofϕ(z).SimilartoEq(8),forϕc≥0,wehaveϕczc=max{ϕc(∑e∈cze−(rc−1)),0}=max{σc(z),0}Usingthefactthatmax{aj}=maxp∈∆∑jpjajwehaveϕczc=maxpc∈∆p1cσc(z)+p2c·0=maxpchc(pc,z)Forϕc<0,wehaveϕczc=maxe∈c{ϕcze}=maxpc∈∆∑e∈cpecϕcze=maxpcgc(pc,z)Putthetwocasestogether,wegetthelowerboundfunctionπ(p,z)=∑c,ϕc≥0hc(pc,z)+∑c,ϕc<0gc(pc,z)Algorithm1BranchandBoundbasedparsingRequire:{ϕc}Ensure:Optimalparsetreez∗Solvep∗,z∗=argmaxp,zπ(p,z)InitializeS={Z},LB=π(p∗,z∗)whileS̸=∅doSetS′=∅{nodesthatsurvivefrompruning}foreachS∈SSolveminpmaxzψ(p,z)togetLBS,UBSLB=max{LB,LBS∈S},updatez∗foreachS∈S,addStoS′,ifUBS>LBSelectabranchingvariableze.ClearS=∅foreachS∈S′AddS1={z|z∈S,ze=1}toSAddS2={z|z∈S,ze=0}toS.endwhileForanyp,pc∈∆,z∈Zπ(p,z)≤ϕ(z)π(p,z)isnotconcave,however,wecouldalterna-tivelyoptimizezandptogetagoodapproximation,whichprovidesalowerboundforϕ(z∗).3.5SummaryWesummarizeourB&BalgorithminAlgorithm1.Itisworthpointingoutthatsofarintheabovedescription,wehaveusedtheassumptionthatthebackboneDPusesﬁrstordermodels,however,thebackboneDPcanbethesecondorthirdorderver-sion.Thedifferenceisthat,forhigherorderDP,higherorderfactorssuchasadjacentsiblings,grand-childrenaredirectlyhandledaslocalfactors.Intheworstcase,alltheedgesareselectedforbranching,andthecomplexitygrowsexponentiallyinsentencelength.However,inpractice,itisquiteefﬁcient,aswewillshowinthenextsection.4Experiments4.1ExperimentalSettingsThedatasetsweusedaretheEnglishPennTreeBank(PTB)andChineseTreeBank5.0(CTB5).Weusethestandardtrain/develop/testsplitasdescribedinTable1.WeextracteddependenciesusingJoakimNivre’sPenn2Malttoolwithstandardheadrules:YamadaandMatsumoto’s(YamadaandMatsumoto,2003)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

TrainDevelopTestPTBsec.2-21sec.22sec.23CTB5sec.001-815sec.886-931sec.816-8851001-11361148-11511137-1147Table1:DatasplitinourexperimentforEnglish,andZhangandClark’s(ZhangandClark,2008)forChinese.Unlabeledattachments-core(UAS)isusedtoevaluateparsingquality1.TheB&BparserisimplementedwithC++.Alltheex-perimentsareconductedontheplatformIntelCorei5-2500CPU3.30GHz.4.2Baseline:DPBasedSecondOrderParserWeusethedynamicprogrammingbasedsecondor-derparser(Carreras,2007)asthebaseline.Aver-agedstructuredperceptron(Collins,2002)isusedforparameterestimation.Wedeterminethenumberofiterationsonthevalidationset,whichis6forbothcorpora.ForEnglish,wetrainthePOStaggerusinglinearchainperceptronontrainingset,andpredictPOStagsforthedevelopmentandtestdata.TheparseristrainedusingtheautomaticPOStagsgeneratedby10foldcrossvalidation.ForChinese,weusethegoldstandardPOStags.Weuseﬁvetypesoffeatures:unigramfeatures,bigramfeatures,in-betweenfeatures,adjacentsib-lingfeaturesandoutermostgrand-childfeatures.TheﬁrstthreetypesoffeaturesareﬁrstlyintroducedbyMcDonaldetal.(2005a)andthelasttwotype-soffeaturesareusedbyCarreras(2007).Allthefeaturesaretheconcatenationofsurroundingwords,lowercasedwords(Englishonly),wordlength(Chi-neseonly),preﬁxesandsufﬁxesofwords(Chineseonly),POStags,coarsePOStagswhicharederivedfromPOStagsusingasimplemappingtable,dis-tancebetweenheadandmodiﬁer,directionofedges.ForEnglish,weused674featuretemplatestogener-atelargeamountsoffeatures,andﬁnallygot86.7Mnon-zeroweightedfeaturesaftertraining.Thebase-lineparsergot92.81%UASonthetestingset.ForChinese,weused858featuretemplates,andﬁnallygot71.5Mnon-zeroweightedfeaturesaftertrain-1ForEnglish,wefollowKooandCollins(2010)andignoreanywordwhosegold-standardPOStagisoneof{“”:,.}.ForChinese,weignoreanywordwhosePOStagisPU.ing.Thebaselineparsergot86.89%UASonthetestingset.4.3B&BBasedParserwithNon-localFeaturesWeusethebaselineparserasthebackboneofourB&Bparser.Wetrieddifferenttypesofnon-localfeaturesaslistedbelow:•Allgrand-childfeatures.Noticethatthisfea-turecanbehandledbyKoo’ssecondordermodel(KooandCollins,2010)directly.•Allgreatgrand-childfeatures.•Allsiblingfeatures:allthepairsofedgeswithcommonhead.AnexampleisshowninFig-ure2.•Alltri-siblingfeatures:allthe3-tuplesofedgeswithcommonhead.•Combfeatures:foranywordwithmorethan3consecutivemodiﬁers,thesetofalltheedgesfromthewordtothemodiﬁersformacomb.2•Handcraftedfeatures:Weperformcrossval-idationonthetrainingdatausingthebaselineparser,anddesignedfeaturesthatmaycorrec-tthemostcommonerrors.Wedesigned13hand-craftfeaturesforEnglishintotal.Oneex-ampleisshowninFigure3.ForChinese,wedidnotaddanyhand-craftfeatures,astheer-rorsinthecrossvalidationresultvaryalot,andwedidnotﬁndgeneralpatternstoﬁxthem.4.4ImplementationDetailsTospeedupthesolutionofthemin-maxsubprob-lem,foreachnodeintheB&Btree,weinitializepwiththeoptimalsolutionofitsparentnode,sincethechildnodeﬁxesonlyoneadditionaledge,itsop-timalpointislikelytobeclosedtoitsparent’s.FortherootnodeofB&Btree,weinitializepec=1rcforfactorswithnon-negativeweightsandp1c=0for2Infact,ouralgorithmcandealwithnon-consecutivemod-iﬁers;however,insuchcases,factordetection(detectregularexpressionslikex1.∗x2.∗…)requiresthelongestcom-monsubsequencealgorithm(LCS),whichistime-consumingifmanycombfeaturesaregenerated.Similarproblemsariseforsub-treefeatures,whichmaycontainmanynon-consecutivewords.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

c0c1c2c3c0hc0c1c0hc0c2c0hc0c3c0hc1c2hc2c3hc1c3hsecond orderhigher orderFigure2:Anexampleofallsiblingfeatures.Top:asub-tree;Bottom:extractedsiblingfeatures.Ex-istinghigherorderDPsystemscannothandlethesiblingsonbothsidesofhead.regulation occurs through inaction , rather than through …Figure3:Anexampleofhand-craftfeature:forthewordsequenceA…ratherthanA,whereAisapreposition,theﬁrstAistheheadofthan,thanistheheadofratherandthesecondA.negativeweightedfactors.Stepsizeαisinitializedwithmaxc,ϕc̸=0{1|ϕc|},asthevectorpisboundedinaunitbox.αisupdatedusingthesamestrategyasRushetal.(2010).Twostoppingcriteriaareused.Oneis0≤ψold−ψnew≤ϵ,whereϵ>0isagivenprecision3.Theotherchecksiftheboundistight:UB=LB.Becauseallfeaturesareboolean(notethattheycanbeinteger),theirweightsareintegerduringeachperceptronupdate,hencethescoresofparsetreesarediscrete.Theminimalgapbetweendifferentscoresis1N×Tafteraveraging,whereNisthenumberoftrainingsamples,andTistheitera-tionnumberforperceptrontraining.ThereforetheupperboundcanbetightenedasUB=⌊NTψ⌋NT.Duringtesting,weusethepre-pruningmethodasusedinMartinsetal.(2009)forbothdatasetstobal-anceparsingqualityandspeed.Thismethodusesasimpleclassiﬁertoselectthetopkcandidatehead-sforeachwordandexcludetheotherheadsfromsearchspace.Inourexperiment,wesetk=10.3weuseϵ=10−8inourimplementationSystemPTBCTBOurbaseline92.8186.89B&B+allgrand-child92.9787.02+allgreatgrand-child92.7886.77+allsibling93.0087.05+alltri-sibling92.7986.81+comb92.8686.91+handcraft92.89N/A+allgrand-child+allsibling+com-b+handcraft93.1787.253rdorderre-impl.93.0387.07TurboParser(reported)92.62N/ATurboParser(ourrun)92.8286.05KooandCollins(2010)93.04N/AZhangandMcDonald(2012)93.0686.87ZhangandNivre(2011)92.9086.00SystemintegrationBohnetandKuhn(2012)93.3987.5SystemsusingadditionalresourcesSuzukietal.(2009)93.79N/AKooetal.(2008)93.5N/AChenetal.(2012)92.76N/ATable2:Comparisonbetweenoursystemandthe-state-of-artsystems.4.5MainResultExperimentalresultsarelistedinTable2.Forcom-parison,wealsoincluderesultsofrepresentativestate-of-the-artsystems.Forthethirdorderpars-er,were-implementedModel1(KooandCollins,2010),andremovedthelongestsentenceintheCTBdataset,whichcontains240words,duetotheO(n4)spacecomplexity4.ForILPbasedparsing,weusedTurboParser5,aspeed-optimizedparsertoolkit.Wetrainedfullmodels(whichuseallgrandchildfea-tures,allsiblingfeaturesandheadbigramfeatures(Martinsetal.,2011))forbothdatasetsusingitsde-faultsettings.WealsolisttheperformanceinitsdocumentationonEnglishcorpus.Theobservationisthat,theall-siblingfeaturesaremosthelpfulforourparser,assomegoodsiblingfeaturescannotbeencodedinDPbasedparser.Forexample,amatchedpairofparenthesesarealwayssiblings,buttheirheadmayliebetweenthem.An-4Infact,Koo’salgorithmrequiresonlyO(n3)space.OurimplementationisO(n4)becausewestorethefeaturevectorsforfasttraining.5http://www.ark.cs.cmu.edu/TurboParser/

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

otherobservationisthatallgreatgrandchildfeaturesandalltri-siblingfeaturesslightlyhurttheperfor-manceandweexcludedthemfromtheﬁnalsystem.Whennoadditionalresourceisavailable,ourparserachievedcompetitiveperformance:93.17%UnlabeledAttachmentScore(UAS)forEnglishataspeedof177wordspersecondand87.25%forChineseataspeedof97wordspersecond.High-erUASisreportedbyjointtaggingandparsing(BohnetandNivre,2012)orsystemintegration(BohnetandKuhn,2012)whichbeneﬁtsfrombothtransitionbasedparsingandgraphbasedparsing.Previousworkshowsthatcombinationofthetwoparsingtechniquescanlearntoovercometheshort-comingsofeachnon-integratedsystem(NivreandMcDonald,2008;ZhangandClark,2008).Sys-temcombinationwillbeaninterestingtopicforourfutureresearch.ThehighestreportedperformanceonEnglishcorpusis93.79%,obtainedbysemi-supervisedlearningwithalargeamountofunla-beleddata(Suzukietal.,2009).4.6TradeoffBetweenAccuracyandSpeedInthissection,westudythetradeoffbetweenac-curacyandspeedusingdifferentpre-pruningsetups.InTable3,weshowtheparsingaccuracyandin-ferencetimeintestingstagewithdifferentnumbersofcandidateheadskinpruningstep.Wecanseethat,onEnglishdataset,whenk≥10,ourpars-ercouldgain2−3timesspeedupwithoutlosingmuchparsingaccuracy.Thereisafurtherincreaseofthespeedwithsmallerk,atthecostofsomeac-curacy.ComparedwithTurboParser,ourparserislessefﬁcientbutmoreaccurate.ZhangandMcDon-ald(2012)isastate-of-the-artsystemwhichadoptscubepruningforefﬁcientparsing.Noticethat,theydidnotusepruningwhichseemstoincreaseparsingspeedwithlittlehitinaccuracy.Moreover,theydidlabeledparsing,whichalsomakestheirspeednotdirectlycomparable.ForeachnodeofB&Btree,ourparsingalgorithmusesprojectedsub-gradientmethodtoﬁndthesad-dlepoint,whichrequiresanumberofcallstoaDP,hencetheefﬁciencyofAlgorithm1ismainlydeter-minedbythenumberofDPcalls.Figure4andFig-ure5showtheaveragedparsingtimeandnumberofcallstoDPrelativetothesentencelengthwithdiffer-entpruningsettings.ParsingtimegrowssmoothlyPTBCTBSystemUASw/sUASw/sOurs(noprune)93.185287.2873Ours(k=20)93.1710587.2876Ours(k=10)93.1717787.2597Ours(k=5)93.1026486.94108Ours(k=3)92.6849385.76128TurboParser(full)92.8240286.05192TurboParser(standard)92.6863885.80283TurboParser(basic)90.97467082.282736ZhangandMcDon-ald(2012)†93.0622086.87N/ATable3:Tradeoffbetweenparsingaccuracy(UAS)andspeed(wordspersecond)withdifferentpre-pruningsettings.kdenotesthenumberofcandi-dateheadsofeachwordpreservedforB&Bparsing.†Theirspeedisnotdirectlycomparableastheyper-formslabeledparsingwithoutpruning.whensentencelength≤40.Thereissomeﬂuctua-tionforthelongsentences.Thisisbecausethereareveryfewsentencesforaspeciﬁclonglength(usual-ly1or2sentences),andthestatisticsarenotstableormeaningfulforthesmallsamples.Withoutpruning,thereareintotal132,161callstoparse2,416Englishsentences,thatis,eachsen-tencerequires54.7callsonaverage.ForChinese,thereare84,645callsfor1,910sentences,i.e.,44.3callsforeachsentenceonaverage.5Discussion5.1PolynomialNon-localFactorsOurboundingstrategycanhandleafamilyofnon-localfactorsthatcanbeexpressedasapolynomialfunctionoflocalfactors.Toseethis,supposezc=∑iαi∏e∈EizeForeachi,weintroducenewvariablezEi=mine∈Eize.Becausezeisbinary,zEi=∏e∈Eize.Inthisway,wereplacezcbyseveralzEithatcanbehandledbyourboundingstrategy.Wegivetwoexamplesofthesepolynomialnon-localfactors.FirstistheORoflocalfactors:zc=max{ze,z′e},whichcanbeexpressedbyzc=ze+z′e−zez′e.Thesecondisthefactorofvalencyfeature

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

01020304050600510parsing time (sec.)sentence lengthk=3k=5k=10k=20no prune(a)PTBcorpus020406080100120140020406080parsing time (sec.)sentence lengthk=3k=5k=10k=20no prune(b)CTBcorpusFigure4Averagedparsingtime(seconds)relativetosentencelengthwithdifferentpruningsettings,kdenotesthenumberofcandidateheadsofeachwordinpruningstep.01020304050600100200Calls to DPsentence lengthk=3k=5k=10k=20no prune(a)PTBcorpus02040608010012014005001000Calls to DPsentence lengthk=3k=5k=10k=20no prune(b)CTBcorpusFigure5AveragednumberofCallstoDPrelativetosentencelengthwithdifferentpruningsettings,kdenotesthenumberofcandidateheadsofeachwordinpruningstep.(Martinsetal.,2009).Letbinaryvariablevikindi-catewhetherwordihaskmodiﬁers.Given{ze}fortheedgeswithheadi,then{vik|k=1,…,n−1}canbesolvedby∑kkjvik=(∑eze)j0≤j≤n−1Theleftsideoftheequationisthelinearfunctionofvik.Therightsideoftheequationisapolynomialfunctionofze.Hence,vikcouldbeexpressedasapolynomialfunctionofze.5.2kBestParsingThoughourB&Balgorithmisabletocaptureava-rietyofnon-localfeatures,itisstilldifﬁculttohan-dlemanykindsoffeatures,suchasthedepthoftheparsetree.Hence,arerankingapproachmaybeuse-fulinordertoincorporatesuchinformation,wherekparsetreescanbegeneratedﬁrstandthenasecondpassmodelisusedtorerankthesecandidatesbasedonmoreglobalornon-localfeatures.Inaddition,k-bestparsingmaybeneededinmanyapplicationstouseparseinformationandespeciallyutilizeinfor-mationfrommultiplecandidatestooptimizetask-speciﬁcperformance.Wehavenotconductedanyexperimentforkbestparsing,henceweonlydis-cussthealgorithm.Accordingtoproposition1,wehaveProposition2.GivenpandsubsetS⊆Z,letzkdenotethekthbestsolutionofmaxz∈Sψ(p,z).Ifaparsetreez′∈Ssatisﬁesϕ(z′)≥ψ(p,zk),thenz′isoneofthekbestparsetreesinsubsetS.Proof.Sincezkisthekthbestsolutionofψ(p,z),forzj,j>k,wehaveψ(p,zk)≥ψ(p,zj)≥ϕ(zj).Sincethesizeoftheset{zj|j>k}is|S|−k,hencethereareatleast|S|−kparsetreeswhosescoresϕ(zj)arelessthanψ(p,zk).Becauseϕ(z′)≥ψ(p,zk),hencez′isatleastthekthbestparsetreeinsubsetS.Therefore,wecansearchthekbestparsetreesinthisway:foreachsub-problem,weuseDPtoderivethekbestparsetrees.Foreachparsetreez,ifϕ(z)≥ψ(p,zk),thenzisselectedintothekbestset.Algorithmterminatesuntilthekthboundistight.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

6ConclusionInthispaperweproposedanewparsingalgorithmbasedonaBranchandBoundframework.Themo-tivationistousedynamicprogrammingtosearchforthebound.ExperimentalresultsonPTBandCTB5datasetsshowthatourmethodiscompetitiveintermsofbothperformanceandefﬁciency.Ourmethodcanbeadaptedtonon-projectivedependen-cyparsing,aswellasthekbestMSTalgorithm(Hall,2007)toﬁndthekbestcandidates.AcknowledgmentsWe’dliketothankHaoZhang,AndreMartinsandZhenghuaLifortheirhelpfuldiscussions.Weal-sothankRyanMcDonaldandthreeanonymousre-viewersfortheirvaluablecomments.ThisworkispartlysupportedbyDARPAunderContractNo.HR0011-12-C-0016andFA8750-13-2-0041.AnyopinionsexpressedinthismaterialarethoseoftheauthorsanddonotnecessarilyreﬂecttheviewsofDARPA.ReferencesMichaelAuliandAdamLopez.2011.AcomparisonofloopybeliefpropagationanddualdecompositionforintegratedCCGsupertaggingandparsing.InProc.ofACL-HLT.EgonBalas.1965.Anadditivealgorithmforsolvinglinearprogramswithzero-onevariables.OperationsResearch,39(4).BerndBohnetandJonasKuhn.2012.Thebestofbothworlds–agraph-basedcompletionmodelfortransition-basedparsers.InProc.ofEACL.BerndBohnetandJoakimNivre.2012.Atransition-basedsystemforjointpart-of-speechtaggingandla-belednon-projectivedependencyparsing.InProc.ofEMNLP-CoNLL.PaulCalamaiandJorgeMor´e.1987.Projectedgradien-tmethodsforlinearlyconstrainedproblems.Mathe-maticalProgramming,39(1).XavierCarreras.2007.Experimentswithahigher-orderprojectivedependencyparser.InProc.ofEMNLP-CoNLL.EugeneCharniakandMarkJohnson.2005.Coarse-to-ﬁnen-bestparsingandmaxentdiscriminativererank-ing.InProc.ofACL.WenliangChen,MinZhang,andHaizhouLi.2012.U-tilizingdependencylanguagemodelsforgraph-baseddependencyparsingmodels.InProc.ofACL.MichaelCollins.2000.Discriminativererankingfornat-urallanguageparsing.InProc.ofICML.MichaelCollins.2002.Discriminativetrainingmethodsforhiddenmarkovmodels:Theoryandexperimentswithperceptronalgorithms.InProc.ofEMNLP.JohnDuchi,ShaiShalev-Shwartz,YoramSinger,andTusharChandra.2008.Efﬁcientprojectionsontothel1-ballforlearninginhighdimensions.InProc.ofICML.JasonM.Eisner.1996.Threenewprobabilisticmodelsfordependencyparsing:anexploration.InProc.ofCOLING.KeithHall.2007.K-bestspanningtreeparsing.InProc.ofACL.LiangHuangandDavidChiang.2005.Betterk-bestparsing.InProc.ofIWPT.LiangHuang.2008.Forestreranking:Discriminativeparsingwithnon-localfeatures.InProc.ofACL-HLT.ManfredKlennerand´EtienneAilloud.2009.Opti-mizationincoreferenceresolutionisnotneeded:Anearly-optimalalgorithmwithintensionalconstraints.InProc.ofEACL.TerryKooandMichaelCollins.2010.Efﬁcientthird-orderdependencyparsers.InProc.ofACL.TerryKoo,XavierCarreras,andMichaelCollins.2008.Simplesemi-superviseddependencyparsing.InProc.ofACL-HLT.TerryKoo,AlexanderM.Rush,MichaelCollins,TommiJaakkola,andDavidSontag.2010.Dualdecomposi-tionforparsingwithnon-projectiveheadautomata.InProc.ofEMNLP.AilsaH.LandandAlisonG.Doig.1960.Anautomat-icmethodofsolvingdiscreteprogrammingproblems.Econometrica,28(3):497–520.AndreMartins,NoahSmith,andEricXing.2009.Con-ciseintegerlinearprogrammingformulationsforde-pendencyparsing.InProc.ofACL.AndreMartins,NoahSmith,MarioFigueiredo,andPe-droAguiar.2011.Dualdecompositionwithmanyoverlappingcomponents.InProc.ofEMNLP.RyanMcDonald,KobyCrammer,andFernandoPereira.2005a.Onlinelarge-margintrainingofdependencyparsers.InProc.ofACL.RyanMcDonald,FernandoPereira,KirilRibarov,andJanHajic.2005b.Non-projectivedependencypars-ingusingspanningtreealgorithms.InProc.ofHLT-EMNLP.JoakimNivreandRyanMcDonald.2008.Integratinggraph-basedandtransition-baseddependencyparsers.InProc.ofACL-HLT.JorgeNocedalandStephenJ.Wright.2006.NumericalOptimization.Springer,2ndedition.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
2
0
8
1
5
6
6
6
3
3

/
t

a
c
_
a
_
0
0
2
0
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

SebastianRiedelandJamesClarke.2006.Incrementalintegerlinearprogrammingfornon-projectivedepen-dencyparsing.InProc.ofEMNLP.AlexanderMRush,DavidSontag,MichaelCollins,andTommiJaakkola.2010.Ondualdecompositionandlinearprogrammingrelaxationsfornaturallanguageprocessing.InProc.ofEMNLP.DavidSmithandJasonEisner.2008.Dependencypars-ingbybeliefpropagation.InProc.ofEMNLP.MinSun,MuraliTelaprolu,HonglakLee,andSilvioSavarese.2012.EfﬁcientandexactMAP-MRFin-ferenceusingbranchandbound.InProc.ofAISTATS.JunSuzuki,HidekiIsozaki,XavierCarreras,andMichaelCollins.2009.Anempiricalstudyofsemi-supervisedstructuredconditionalmodelsfordependencyparsing.InProc.ofEMNLP.HiroyasuYamadaandYujiMatsumoto.2003.Statisticaldependencyanalysiswithsupportvectormachines.InProc.ofIWPT.YueZhangandStephenClark.2008.Ataleoft-woparsers:Investigatingandcombininggraph-basedandtransition-baseddependencyparsing.InProc.ofEMNLP.HaoZhangandRyanMcDonald.2012.Generalizedhigher-orderdependencyparsingwithcubepruning.InProc.ofEMNLP.YueZhangandJoakimNivre.2011.Transition-baseddependencyparsingwithrichnon-localfeatures.InProc.ofACL-HLT.
Download pdf