Transactions of the Association for Computational Linguistics, Bd. 3, S. 489–501, 2015. Action Editor: Sebastian Riedel.

Transactions of the Association for Computational Linguistics, Bd. 3, S. 489–501, 2015. Action Editor: Sebastian Riedel.
Submission batch: 4/2015; Published 8/2015.

2015 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

C
(cid:13)

Approximation-AwareDependencyParsingbyBeliefPropagationMatthewR.GormleyMarkDredzeJasonEisnerHumanLanguageTechnologyCenterofExcellenceCenterforLanguageandSpeechProcessingDepartmentofComputerScienceJohnsHopkinsUniversity,Baltimore,MD{mrg,mdredze,jason}@cs.jhu.eduAbstractWeshowhowtotrainthefastdependencyparserofSmithandEisner(2008)forim-provedaccuracy.Thisparsercanconsiderhigher-orderinteractionsamongedgeswhileretainingO(n3)runtime.Itoutputstheparsewithmaximumexpectedrecall—butforspeed,thisexpectationistakenunderapos-teriordistributionthatisconstructedonlyap-proximately,usingloopybeliefpropagationthroughstructuredfactors.Weshowhowtoadjustthemodelparameterstocompensatefortheerrorsintroducedbythisapproximation,byfollowingthegradientoftheactuallossontrainingdata.Weﬁndthisgradientbyback-propagation.Thatis,wetreattheentireparser(approximationsandall)asadifferentiablecircuit,asothershavedoneforloopyCRFs(Domke,2010;Stoyanovetal.,2011;Domke,2011;StoyanovandEisner,2012).There-sultingparserobtainshigheraccuracywithfeweriterationsofbeliefpropagationthanonetrainedbyconditionallog-likelihood.1IntroductionRecentimprovementstodependencyparsingac-curacyhavebeendrivenbyhigher-orderfeatures.Suchafeaturecanlookbeyondjusttheparentandchildwordsconnectedbyasingleedgetoalsocon-sidersiblings,grandparents,etc.Byincludingin-creasinglyglobalinformation,thesefeaturespro-videmoreinformationfortheparser—buttheyalsocomplicateinference.Theresultinghigher-orderparsersdependonapproximateinferenceanddecod-ingprocedures,whichmaypreventthemfrompre-dictingthebestparse.Forexample,considerthedependencyparserwewilltraininthispaper,whichisbasedontheworkofSmithandEisner(2008).Ostensibly,thisparserﬁndstheminimumBayesrisk(MBR)parseunderaprobabilitydistributiondeﬁnedbyahigher-orderdependencyparsingmodel.Inreality,itachievesO(n3tmax)runtimebyrelyingonthreeapproxima-tionsduringinference:(1)variationalinferencebyloopybeliefpropagation(BP)onafactorgraph,(2)truncatinginferenceaftertmaxiterationspriortoconvergence,Und(3)aﬁrst-orderpruningmodeltolimitthenumberofedgesconsideredinthehigher-ordermodel.Suchparsersaretraditionallytrainedasiftheinferencehadbeenexact.1Incontrast,wetraintheparsersuchthattheap-proximatesystemperformswellontheﬁnaleval-uationfunction.Wetreattheentireparsingcom-putationasadifferentiablecircuit,andbackprop-agatetheevaluationfunctionthroughourapprox-imateinferenceanddecodingmethodstoimproveitsparametersbygradientdescent.Thesystemalsolearnstocopewithmodelmisspeciﬁcation,wherethemodelcouldn’tperfectlyﬁtthedistributionevenabsenttheapproximations.Forstandardgraphicalmodels,StoyanovandEisner(2012)callthisap-proachERMA,for“empiricalriskminimizationun-derapproximations.”Forobjectivesbesidesempiri-calrisk,Domke(2011)referstoitas“learningwithtruncatedmessagepassing.”Ourprimarycontributionistheapplicationofthisapproximation-awarelearningmethodinthepars-ingsetting,forwhichthegraphicalmodelinvolvesaglobalconstraint.SmithandEisner(2008)pre-viouslyshowedhowtorunBPinthissetting(bycallingtheinside-outsidealgorithmasasubroutine).Wemustbackpropagatethedownstreamobjective1Forperceptrontraining,utilizinginexactinferenceasadrop-inreplacementforexactinferencecanbadlymisleadthelearner(KuleszaandPereira,2008;Huangetal.,2012).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

490

functionthroughtheiralgorithmsothatwecanfol-lowitsgradient.Wecarefullydeﬁneanempiricalriskobjectivefunction(`alaERMA)tobesmoothanddifferentiable,yetequivalenttoaccuracyoftheminimumBayesrisk(MBR)parseinthelimit.Find-ingthisdifﬁculttooptimize,weintroduceanewsimplerobjectivefunctionbasedontheL2distancebetweentheapproximatemarginalsandthe“true”marginalsfromthegolddata.Thegoalofthisworkistoaccountfortheapprox-imationsmadebyasystemrootedinstructuredbe-liefpropagation.Takingsuchapproximationsintoaccountduringtrainingenablesustoimprovethespeedandaccuracyofinferenceattesttime.Wecompareourtrainingmethodwiththestandardap-proachofconditionallog-likelihood(CLL)train-ing.Weevaluateourparseron19languagesfromtheCoNLL-2006(BuchholzandMarsi,2006)andCoNLL-2007(Nivreetal.,2007)SharedTasksaswellastheEnglishPennTreebank(Marcusetal.,1993).OnEnglish,theresultingparserobtainshigheraccuracywithfeweriterationsofBPthanCLL.OntheCoNLLlanguages,weﬁndthatonav-erageityieldshigheraccuracyparsersthanCLL,particularlywhenlimitedtofewBPiterations.2DependencyParsingbyBeliefPropagationThissectiondescribestheparserthatwewilltrain.ModelAfactorgraph(Freyetal.,1997;Kschis-changetal.,2001)deﬁnesthefactorizationofaprobabilitydistributionoverasetofvariables{Y1,Y2,…}.Itisabipartitegraphbetweenvari-ablesYiandfactorsα.Edgesconnecteachfactorαtoasubsetofthevariables{Yα1,Yα2,…},calleditsneighbors.Eachfactordeﬁnesapotentialfunc-tionψα,whichassignsanonnegativescoretoeachconﬁgurationofitsneighborsyα={yα1,yα2,…}.Wedeﬁnetheprobabilityofagivenassignmenty={y1,y2,…}tobeproportionaltotheproductofallfactors’potentialfunctions:P(j)=1ZQαψα(yα).SmithandEisner(2008)deﬁneafactorgraphfordependencyparsingofagivenn-wordsentence:n2binaryvariablesindicatewhichofthedirectedarcsareincluded(yi=ON)orexcluded(yi=OFF)inthedependencyparse.Oneofthefactorsplaystheroleofahardglobalconstraint:ψPTREE(j)is0 2 1 3 4 Juan su abdica reino $ Y2,1 Y1,2 Y3,2 Y2,3 Y3,1 Y1,3 Y4,3 Y3,4 Y4,2 Y2,4 Y4,1 Y1,4 Y0,1 Y0,3 Y0,4 Y0,2 Figure1:Factorgraphfordependencyparsingofa4-wordsentence;$istherootofthedependencygraph.ThebooleanvariableYh,mencodeswhethertheedgefromparenthtochildmispresent.Theunaryfactor(black)connectedtothisvariablescorestheedgeiniso-lation(giventhesentence).ThePTREEfactor(Rot)coor-dinatesallvariablestoensurethattheedgesformatree.Thedrawingshowsafewhigher-orderfactors(purpleforgrandparents,greenforarbitrarysiblings);thesearere-sponsibleforthegraphbeingcyclic(“loopy”).1or0accordingtowhethertheassignmenten-codesaprojectivedependencytree.Anothern2fac-tors(onepervariable)evaluatetheindividualarcsgiventhesentence,sothatp(j)describesaﬁrst-orderdependencyparser.Ahigher-orderparsingmodelisachievedbyalsoincludinghigher-orderfactors,eachscoringconﬁgurationsoftwoormorearcs,suchasgrandparentandsiblingconﬁgurations.Higher-orderfactorstendtocreatecyclesinthefac-torgraph.SeeFigure1foranexamplefactorgraph.Wedeﬁneeachpotentialfunctiontohavealog-linearform:ψα(yα)=exp(θ·fα(yα,X)).HerexistheassignmenttotheobservedvariablessuchasthesentenceanditsPOStags;fαextractsavectoroffeatures;andθisourvectorofmodelparameters.Wewritetheresultingprobabilitydistributionoverparsesaspθ(j|X),toindicatethatitdependsonθ.LossFordependencyparsing,ourlossfunctionisthenumberofmissingedgesinthepredictedparseˆy,relativetothereference(or“gold”)parsey∗:„(ˆy,y∗)=Pi:ˆyi=OFFI(y∗i=ON)(1)Iistheindicatorfunction.Becauseˆyandy∗eachspecifyexactlyoneparentperwordtoken,„(ˆy,y∗)equalsthedirecteddependencyerror:thenumberofwordtokenswhoseparentispredictedincorrectly.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

491

DecoderToobtainasingleparseasoutput,weuseaminimumBayesrisk(MBR)decoder,whichreturnsthetreewithminimumexpectedlossunderthemodel’sdistribution(BickelandDoksum,1977;Guter Mann,1996).Our‘givesthedecisionrule:hθ(X)=argminˆyEy∼pθ(·|X)[„(ˆy,j)](2)=argmaxˆyXi:ˆyi=ONpθ(yi=ON|X)(3)Hereˆyrangesoverwell-formedparses.Thus,ourparserseeksawell-formedparsehθ(X)whosein-dividualedgeshaveahighprobabilityofbeingcor-rectaccordingtopθ(sinceitlacksknowledgey∗ofwhichedgesaretrulycorrect).MBRistheprin-cipledwaytotakealossfunctionintoaccountun-deraprobabilisticmodel.Bycontrast,maximumaposteriori(MAP)decodingdoesnotconsiderthelossfunction.Itwouldreturnthesinglehighest-probabilityparseevenifthatparse,anditsindividualedges,wereunlikelytobecorrect.2AllsystemsinthispaperuseMBRdecodingtoconsiderthelossfunctionattesttime.Thisimpliesthattheidealtrainingprocedurewouldbetoﬁndthetruepθsothatitsmarginalscanbeusedin(3).Ourbaselinesystemattemptsthis.Yetinpractice,wewillnotbeabletoﬁndthetruepθ(modelmisspec-iﬁcation)norexactlycomputethemarginalsofpθ(computationalintractability).Daher,thispaperpro-posesatrainingprocedurethatcompensatesforthesystem’sapproximations,adjustingθtoreducetheactuallossofhθ(X)asmeasuredattrainingtime.ToﬁndtheMBRparse,weﬁrstruninferencetocomputethemarginalprobabilitypθ(yi=ON|X)foreachedge.Thenwemaximize(3)byrunningaﬁrst-orderdependencyparserwithedgescoresequaltothoseprobabilities.3Whenourinferencealgo-rithmisapproximate,wereplacetheexactmarginalwithitsapproximation—thebelieffromBP,givenbybi(ON)In(6)below.InferenceLoopybeliefpropagation(BP)(Murphyetal.,1999)computesap-proximationstothevariablemarginals2Ifweusedasimple0-1lossfunctionwithin(2),thenMBRdecodingwouldreducetoMAPdecoding.3Priorwork(SmithandEisner,2008;Bansaletal.,2014)usedthelog-oddsratiologpθ(yi=ON)pθ(yi=OFF)astheedgescoresfordecoding,butthisyieldsaparsedifferentfromtheMBRparse.pθ(yi|X)=Py0:y0i=yipθ(y0|X),asneededby(3),aswellasthefactormarginalspθ(yα|X)=Py0:y0α=yαpθ(y0|X).Thealgo-rithmproceedsbyiterativelysendingmessagesfromvariables,yi,tofactors,α:M(T)i→α(yi)∝Yβ∈N(ich)\αm(t−1)β→i(yi)(4)andfromfactorstovariables:M(T)α→i(yi)∝Xyα∼yiψα(yα)Yj∈N(α)\im(t−1)j→α(yi)(5)whereN(ich)andN(α)denotetheneighborsofyiandαrespectively,andwhereyα∼yiisstandardnotationtoindicatethatyαrangesoverallassign-mentstothevariablesparticipatinginthefactorαprovidedthattheithvariablehasvalueyi.Notethatthemessagesattimetarecomputedfromthoseattime(t−1).Messagesattheﬁnaltimetmaxareusedtocomputethebeliefsateachfactorandvariable:bi(yi)∝Yα∈N(ich)M(tmax)α→i(yi)(6)bα(yα)∝ψα(yα)Yi∈N(α)M(tmax)i→α(yi)(7)Weassumeeachofthemessagesandbeliefsgivenin(4)–(7)arescaledtosum-to-one.Forexample,biisnormalizedsuchthatPyibi(yi)=1andap-proximatesthemarginaldistributionoveryivalues.Messagescontinuetochangeindeﬁnitelyifthefac-torgraphiscyclic,butinthelimit,themessagesmayconverge.Althoughtheequationsaboveupdateallmessagesinparallel,convergenceismuchfasterifonlyonemessageisupdatedpertimestep,insomewell-chosenserialorder.4ForthePTREEfactor,thesummationovervari-ableassignmentsrequiredform(T)α→i(yi)inEq.(5)equatestoasummationoverexponentiallymanyprojectiveparsetrees.However,wecanuseaninside-outsidevariantofEisner(1996)’salgorithm4FollowingDreyerandEisner(2009,footnote22),wechooseanarbitrarydirectedspanningtreeofthefactorgraphrootedatthePTREEfactor.Wevisitthenodesintopologicallysortedorder(fromleavestoroot)andupdateanymessagefromthenodebeingvisitedtoanodethatislaterintheorder.Wethenreversethisorderandrepeat,sothateverymessagehasbeenpassedonce.ThisconstitutesoneiterationofBP.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

492

tocomputethisinpolynomialtime(wedescribethisashypergraphparsingin§3).Theresulting“struc-turedBP”inferenceprocedure—detailedbySmithandEisner(2008)—isexactforﬁrst-orderdepen-dencyparsing.Whenhigher-orderfactorsareincor-porated,itisapproximatebutremainsfast,whereasexactinferencewouldbeslow.53Approximation-awareLearningWeaimtoﬁndtheparametersθ∗thatminimizearegularizedobjectivefunctionoverthetrainingsam-pleof(Satz,parse)pairs{(X(D),j(D))}Dd=1.θ∗=argminθ1D(cid:16)(cid:0)DXd=1J(θ;X(D),j(D))(cid:1)+λ2||θ||22(cid:17)(8)whereλ>0istheregularizationcoefﬁcientandJ(θ;X,y∗)isagivendifferentiablefunction,pos-siblynonconvex.Welocallyminimizethisobjec-tiveusing‘2-regularizedAdaGradwithCompositeMirrorDescent(Duchietal.,2011)—avariantofstochasticgradientdescentthatusesmini-batches,anadaptivelearningrateperdimension,andsparselazyupdatesfromtheregularizer.6ObjectiveFunctionsThestandardchoiceforJisthenegativeconditionallog-likelihood(§6).How-ever,asinStoyanovetal.(2011),ouraimistomini-mizeexpectedlossonthetruedatadistributionoversentence/parsepairs(X,Y):θ∗=argminθE[„(hθ(X),Y)](9)Sincethetruedatadistributionisunknown,wesub-stitutetheexpectedlossoverthetrainingsample,andregularizeourobjectiveinordertoreducesam-plingvariance.Speciﬁcally,weaimtominimizetheregularizedempiricalrisk,givenby(8)withJ(θ;X(D),j(D))setto‘(hθ(X(D)),j(D)).Notethat5Howslowisexactinferencefordependencyparsing?Forcertainchoicesofhigher-orderfactors,polynomialtimeispos-sibleviadynamicprogramming(McDonaldetal.,2005;Car-reras,2007;KooandCollins,2010).Jedoch,BPwilltypicallybeasymptoticallyfaster(foraﬁxednumberofiterations)andfasterinpractice.Insomeothersettings,exactinferenceisNP-hard.Inparticular,non-projectiveparsingbecomesNP-hardwithevensecond-orderfactors(McDonaldandPereira,2006).BPcanhandlethiscaseinpolynomialtimebyreplacingthePTREEfactorwithaTREEfactorthatallowsedgestocross.6θisinitializedto0whennototherwisespeciﬁed.thislossfunctionwouldnotbedifferentiable—akeyissuewewilltakeupbelow.Thisisthe“ERMA”methodofStoyanovandEisner(2012).WewillalsoconsidersimplerchoicesofJ—akintothelossfunc-tionsusedbyDomke(2011).GradientComputationTocomputethegradi-ent∇θJ(θ;X,y∗)ofthelossonasinglesentence(X,y∗)=(X(D),j(D)),weapplyautomaticdiffer-entiation(AD)inthereversemode(GriewankandCorliss,1991).Thisyieldsthesametypeof“back-propagation”algorithmthathaslongbeenusedfortrainingneuralnetworks(Rumelhartetal.,1986).Itisimportanttonotethattheresultinggradientcom-putationalgorithmisexactuptoﬂoating-pointer-ror,andhasthesameasymptoticcomplexityastheoriginaldecodingalgorithm,requiringonlyabouttwicethecomputation.TheADmethodappliespro-videdthattheoriginalfunctionisindeeddifferen-tiablewithrespecttoθ.Inprinciple,itispossi-bletocomputethegradientwithminimaladditionalcoding.ThereexistsADsoftware(somelistedatautodiff.org)thatcouldbeusedtoderivethenecessarycodeautomatically.AnotheroptionwouldbetousetheperturbationmethodofDomke(2010).Jedoch,weimplementedthegradientcomputationdirectly,andwedescribeithere.Inference,Decoding,andLossasaFeedfowardCircuitThebackpropagationalgorithmisoftenappliedtoneuralnetworks,wherethetopologyofafeedforwardcircuitisstaticallyspeciﬁedandcanbeappliedtoanyinput.OurBPalgorithm,decoder,andlossfunctionsimilarlydeﬁneafeedfowardcir-cuitthatcomputesourfunctionJ.Thecircuit’sdepthdependsonthenumberofBPtimesteps,tmax.Itstopologyisdeﬁneddynamically(persentencex(D))by“unrolling”thecomputationintoagraph.Figure2showsthistopology.Thehighlevelmodulesconsistof(A)computingpotentialfunc-tions,(B)initializingmessages,(C)sendingmes-sages,(D)computingbeliefs,Und(E)decodingandcomputingtheloss.Wezoominontwosubmodules:theﬁrstcomputesmessagesfromthePTREEfactorefﬁciently(C.1–C.3);thesecondcomputesasoft-enedversionofourlossfunction(E.1–E.3).Bothofthesesubmodulesaremadeefﬁcientbytheinside-outsidealgorithm.Thenexttwosectionsdescribeingreaterdetail

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

493

howwedeﬁnethefunctionJ(theforwardpass)andhowwecomputeitsgradient(thebackwardpass).BackpropagationthroughthecircuitfromFigure2posesseveralchallenges.EatonandGhahramani(2009),Stoyanovetal.(2011),andDomke(2011)showedhowtobackpropagatethroughthebasicBPalgorithm,andwereiteratethekeydetailsbelow(§5.2).Theremainingchallengesformtheprimarytechnicalcontributionofthispaper:1.Ourtruelossfunction‘(hθ(X),y∗)bywayofthedecoderhθcontainsanargmax(3)overtreesandisthereforenotdifferentiable.Weshowhowtosoftenthisdecoder(bysubstitut-ingasoftmax),makingitdifferentiable(§4.1).2.Empirically,weﬁndtheaboveobjectivedifﬁ-culttooptimize.Toaddressthis,wesubstituteasimplerL2lossfunction(commonlyusedinneuralnetworks).Thisiseasiertooptimizeandyieldsourbestparsersinpractice(§4.2).3.Weshowhowtorunbackpropthroughtheinside-outsidealgorithmonahypergraph(§5.4)foruseintwomodules:thesofteneddecoder(§5.1)andcomputationofmessagesfromthePTREEfactor(§5.3).Thisallowsustogobe-yondStoyanovetal.(2011)andtrainstruc-turedBPinanapproximation-awareandloss-awarefashion.4DifferentiableObjectiveFunctions4.1AnnealedRiskMinimizingthetest-timelossistheappropriategoalfortraininganapproximatesystemlikeours.Thatlossisestimatedbytheempiricalriskonalargeamountofin-domainsupervisedtrainingdata.Alas,thisriskisnonconvexandpiecewisecon-stant,soweturntodeterministicannealing(SmithandEisner,2006)andcleverinitialization.Directeddependencyerror,„(hθ(X),y∗),isnotdifferentiableduetotheargmaxinthedecoderhθ.SoweredeﬁneJ(θ;X,y∗)tobeanewdifferentiablelossfunction,theannealedriskR1/Tθ(X,y∗),whichapproachestheloss‘(hθ(X),y∗)asthetemperatureT→0.Ourﬁrststepistodeﬁneadistributionoverparses,whichtakesthemarginalspθ(yi=ON|X)asinput,orinpractice,theirBPapproximationsbi(ON):q1/Tθ(ˆy|X)∝exp(cid:16)Pi:ˆyi=ONpθ(yi=ON|X)T(cid:17)(10)(E)DecodeandLossJ(θ;X,y∗)=(E.3)ExpectedRecall(E.2)Inside-Outside(E.1)AnnealBeliefs(D)Beliefsbi(yi)=…,bα(yα)=…(C)Messagesattimetmaxm(tmax)i→α(yi)=…,M(tmax)α→i(yi)=…M(tmax)PTREE→i(yi)=(C.3)OutgoingMessages(C.2)Inside-Outside(C.1)MessageRatios···(C)Messagesattimetm(T)i→α(yi)=…,M(T)α→i(yi)=…M(T)PTREE→i(yi)=(C.3)OutgoingMessages(C.2)Inside-Outside(C.1)MessageRatios···(C)Messagesattimet=1m(1)i→α(yi)=…,M(1)α→i(yi)=…M(1)PTREE→i(yi)=(C.3)OutgoingMessages(C.2)Inside-Outside(C.1)MessageRatios(A)ComputePotentialsψα(yα)=exp(θ·f(yα,X))(B)InitialMessagesm(0)i→α(yi)=1m(0)α→i(yi)=1(C.3)OutgoingMessages(C.2)Inside-Outside(C.1)MessageRatios(E.3)ExpectedRecall(E.2)Inside-Outside(E.1)AnnealBeliefsFigure2:Feed-forwardtopologyofinference,decoding,andloss.(E.1–E.3)showtheannealedrisk,oneoftheobjectivefunctionsweconsider.Usingthisdistribution,wecanreplaceournon-differentiabledecoderhθwithadifferentiableone(attrainingtime).Imaginethatournewdecoderstochasticallyreturnsaparseˆysampledfromthisdistribution.Wedeﬁnetheannealedriskastheex-pectedlossofthatdecoder:R1/Tθ(X,y∗)=Eˆy∼q1/Tθ(·|X)[„(ˆy,y∗)](11)AsT→0(“annealing”),thedecoderalmostalwayschoosestheMBRparse,7soourriskapproachesthelossoftheactualMBRdecoderthatwillbeusedattesttime.However,asafunctionofθ,itremainsdifferentiable(thoughnotconvex)foranyT>0.Tocomputetheannealedrisk,observethatitsim-pliﬁestoR1/Tθ(X,y∗)=−Pi:y∗i=ONq1/Tθ(ˆyi=ON|X).Thisisthenegatedexpectedrecallofaparseˆy∼q1/Tθ.Weobtaintherequiredmarginalsq1/Tθ(ˆyi=ON|X)aus(10)byrunninginside-7Recallfrom(3)thattheMBRparseisthetreeˆythatmax-imizesthesumPi:ˆyi=ONpθ(yi=ON|X).AsT→0,theright-handsideof(10)growsfastestforthisˆy,soitsprobabil-ityunderq1/Tθapproaches1(or1/kifthereisak-waytieforMBRparse).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

494

outsidewheretheedgeweightforedgeiisgivenbyexp(pθ(yi=ON|X)/T).Whetherourtest-timesystemcomputesthemarginalsofpθexactlyordoessoapproximatelyviaBP,ournewtrainingobjectiveapproaches(asT→0)thetrueempiricalriskofthetest-timeparserthatperformsMBRdecodingfromthecomputedmarginals.Empirically,Jedoch,wewillﬁndthatitisnotthemosteffectivetrainingobjective(§7.2).Stoyanovetal.(2011)postulatethatthenonconvex-ityofempiricalriskmaymakeitadifﬁcultfunctiontooptimize,evenwithannealing.Ournexttwoob-jectivesprovidealternatives.4.2L2DistanceWecanviewourinference,decoder,andlossasdeﬁningaformofdeepneuralnetwork,whosetopologyisinspiredbyourlinguisticknowledgeoftheproblem(e.g.,theedgevariablesshoulddeﬁneatree).Thisconnectiontodeeplearningallowsustoconsidertrainingmethodsakintosupervisedlayer-wisetraining(Bengioetal.,2007).Wetem-porarilyremovethetoplayersofournetwork(i.e.thedecoderandlossmodule,Fig.2(E))sothattheoutputlayerofour“deepnetwork”consistsofthevariablebeliefsbi(yi)fromBP.Wecanthende-ﬁneasupervisedlossfunctiondirectlyonthesebe-liefs.Wedon’thavesuperviseddataforthislayerofbeliefs,butwecancreateitartiﬁcially.Usethesupervisedparsey∗todeﬁne“targetbeliefs”byb∗i(yi)=I(yi=y∗i)∈{0,1}.Toﬁndparame-tersθthatmakeBP’sbeliefsclosetothesetargets,wecanminimizeanL2distancelossfunction:J(θ;X,y∗)=XiXyi(bi(yi)−b∗i(yi))2(12)WecanusethisL2distanceobjectivefunctionfortraining,addingtheMBRdecoderandlossevalua-tionbackinonlyattesttime.4.3Layer-wiseTrainingJustasinlayer-wisetrainingofneuralnetworks,wecantakeatwo-stageapproachtotraining.First,wetraintominimizetheL2distance.Then,weusetheresultingθasinitializationtooptimizetheannealedrisk,whichdoesconsiderthedecoderandlossfunc-tion(i.e.thetoplayersofFig.2).Stoyanovetal.(2011)foundmeansquarederror(MSE)togiveasmoothertrainingobjective,thoughstillnonconvex,andusedittoinitializeempiricalrisk.ThoughtheirvariantoftheL2objectivedidnotcompletelydis-pensewiththedecoderasoursdoes,itisasimilarapproachtoourproposedlayer-wisetraining.5GradientsbyBackpropagationBackpropagationcomputesthederivativeofanygivenfunctionspeciﬁedbyanarbitrarycircuitcon-sistingofelementarydifferentiableoperations(e.g.+,−,×,÷,log,exp).Thisisaccomplishedbyre-peatedapplicationofthechainrule.Backpropagat-ingthroughanalgorithmproceedsbysimilarap-plicationofthechainrule,wheretheintermediatequantitiesaredeterminedbythetopologyofthecircuit—justasinFigure2.Runningbackwardsthroughthecircuit,backpropcomputesthepartialderivativesoftheobjectiveJ(θ;X,y∗)withrespecttoeachintermediatequantityu—ormoreconciselytheadjointofu:ðu=∂J(θ;X,y∗)∂u.Thissectiongivesasummaryoftheadjointcomputationswere-quire.Duetospaceconstraints,wedirectthereadertotheextendedversionofthispaper(Gormleyetal.,2015a)forfulldetailsofalltheadjoints.5.1BackpropagationofDecoder/LossTheadjointoftheobjectiveitselfðJ(θ;X,y∗)isal-ways1.Sotheﬁrstadjointswemustcomputearethoseofthebeliefs:ðbi(yi)andðbα(yα).Thiscor-respondstothebackwardpassthroughFigure2(E).ConsiderthesimplecasewhereJisL2distancefrom(12):thevariablebeliefadjointisðbi(yi)=2(bi(yi)−b∗i(yi))andtriviallyðbα(yα)=0.IfJisannealedriskfrom(11),wecomputeðbi(yi)byap-plyingbackpropagationrecursivelytoouralgorithmforJfrom§4.1.Thissub-algorithmdeﬁnesasub-circuitdepictedinFigure2(E.1–E.3).Thecompu-tationsoftheannealedbeliefsandtheexpectedre-callareeasilydifferentiable.Themainchallengeisdifferentiatingthefunctioncomputedbytheinside-outsidealgorithm;weaddressthisin§5.4.5.2BackpropagationthroughStructuredBPGiventheadjointsofthebeliefs,wenextback-propagatethroughstructuredBP—extendingpriorworkwhichdidthesameforregularBP(EatonandGhahramani,2009;Stoyanovetal.,2011;Domke,

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

495

2011).ExceptforthemessagessentfromthePTREEfactor,eachstepofBPcomputessomevaluefromearliervaluesusingtheupdateequa-tions(4)–(7).Backpropagationdifferentiatestheseelementaryexpressions.First,usingthebeliefad-joints,wecomputetheadjointsoftheﬁnalmes-sages(ðm(tmax)j→α(yj),ðm(tmax)β→i(yi))byapplyingthechainruletoEqs.(6)Und(7).ThisisthebackwardpassthroughFig.2(D).Recallthatthemessagesattimetwerecomputedfrommessagesattimet−1andthepotentialfunctionsψαintheforwardpassviaEqs.(4)Und(5).Backpropworksintheoppo-siteorder,updatingtheadjointsofthemessagesattimet−1andthepotentialfunctions(ðm(t−1)j→α(yj),ðm(t−1)β→i(yi),ðψα(yα))onlyafterithascomputedtheadjointsofthemessagesattimet.Repeatingthisthroughtimesteps{T,t−1,…,1}constitutesthebackwardpassthroughFig.2(C).ThebackwardpassthroughFig.2(B)doesnothing,sincethemes-sageswereinitializedtoaconstant.Theﬁnalstepofbackpropusesðψα(yα)tocomputeðθj—theback-wardpassthroughFig.2(A).Fortheexplicitfor-mulaoftheseadjoints,seeGormleyetal.(2015A)orAppendixA.1ofStoyanovetal.(2011).Thenextsectionhandlesthespecialcaseofðm(T)j→PTREE(yj).5.3BPandBackpropagationwithPTREEThePTREEfactorhasaspecialstructurethatweexploitforefﬁciencyduringBP.SmithandEis-ner(2008)giveamoreefﬁcientwaytoimplementEq.(5),whichcomputesthemessagefromafac-torαtoavariableyi,inthespecialcasewhereα=PTREE.Theyﬁrstruntheinside-outsideal-gorithmwheretheedgeweightsaregivenbythera-tiosofthemessagestoPTREE:M(T)i→α(ON)M(T)i→α(OFF).Thentheymultiplyeachresultingedgemarginalgivenbyinside-outsidebytheproductofalltheOFFmes-sagesQim(T)i→α(OFF)togetthemarginalfactorbe-liefbα(yi).Finallytheydividethebeliefbythein-comingmessagem(T)i→α(ON)togetthecorrespond-ingoutgoingmessagem(t+1)α→i(ON).ThesestepsareshowninFigure2(C.1–C.3),andarerepeatedeachtimewesendamessagefromthePTreefactor.Similarly,weexploitthestructureofthisalgo-rithmtocomputetheadjointsðm(T)j→PTREE(yj).Thederivativesofthemessageratiosandproductsmen-tionedherearesimple.Inthenextsubsection,weexplainhowtobackpropagatethroughtheinside-outsidealgorithm.Thoughwefocushereonpro-jectivedependencyparsing,ourtechniquesarealsoapplicabletonon-projectiveparsingandtheTREEfactor;weleavethistofuturework.5.4BackpropofHypergraphInside-OutsideBoththeannealedrisklossfunction(§4.1)andthecomputationofmessagesfromthePTREEfactor(§5.3)usetheinside-outsidealgorithmfordepen-dencyparsing.Herewedescribeinside-outsideandtheaccompanyingbackpropagationalgorithmoverahypergraph.Thisgeneraltreatment(KleinandMan-ning,2001;LiandEisner,2009)enablesourmethodtobeappliedtoothertaskssuchasconstituencyparsing,HMMforward-backward,andhierarchicalmachinetranslation.Inthecaseofdependencypars-ing,thestructureofthehypergraphisgivenbythedynamicprogrammingalgorithmofEisner(1996).Fortheforwardpassoftheinside-outsidemod-ule,theinputvariablesarethehyperedgeweightswe∀eandtheoutputsarethemarginalprobabilitiespw(ich)∀iofeachnodeiinthehypergraph.Thelatterareafunctionoftheinsideβiandoutsideαjproba-bilities.Weinitializeαroot=1.βi=Xe∈I(ich)weYj∈T(e)βj(13)αj=Xe∈O(ich)weαH(e)Yj∈T(e):j6=iβj(14)pw(ich)=αiβi/βroot(15)Foreachnodei,wedeﬁnethesetofincomingedgesI(ich)andoutgoingedgesO(ich).TheantecedentsoftheedgeareT(e),theparentoftheedgeisH(e),anditsweightiswe.Forthebackwardpassoftheinside-outsidemodule,theinputsareðpw(ich)∀iandtheoutputsareðwe∀e.Wealsocomputetheadjointsoftheinter-mediatequantitiesðβj,ðαi.Weﬁrstcomputeðαibottom-up.Nextðβjarecomputedtop-down.Theadjointsðwearethencomputedinanyorder.ðαi=ðpw(ich)∂pw(ich)∂αi+Xe∈I(ich)Xj∈T(e)ðαj∂αj∂αi(16)ðβroot=Xi6=rootðpw(ich)∂pw(ich)∂βroot(17)

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

496

ðβj=ðpw(J)∂pw(J)∂βj+Xe∈O(J)ðβH(e)∂βH(e)∂βj+Xe∈O(J)Xk∈T(e):k6=jðαk∂αk∂βj∀j6=root(18)ðwe=ðβH(e)∂βH(e)∂we+Xj∈T(e)ðαj∂αj∂we(19)Thepartialderivativesrequiredfortheabovead-jointsaregivenintheextendedversionofthispa-per(Gormleyetal.,2015a).ThisbackpropagationmethodisusedforbothFigure2(C.2)Und(E.2).6OtherLearningSettingsLoss-awareTrainingwithExactInferenceBackpropagatingthroughinference,decoder,andlossneednotberestrictedtoapproximateinferencealgorithms.LiandEisner(2009)optimizeBayesriskwithexactinferenceonahypergraphformachinetranslation.Eachofourdifferentiablelossfunctions(§4)canalsobecoupledwithexactinference.Foraﬁrst-orderparser,BPisexact.Yet,inplaceofmodules(B),(C),Und(D)inFigure2,wecanuseastandarddynamicprogrammingalgorithmfordependencyparsing,whichissimplyanotherinstanceofinside-outsideonahypergraph(§5.4).Theexactmarginalsfrominside-outside(15)arethenfedforwardintothedecoder/lossmodule(E).ConditionalandSurrogateLog-likelihoodThestandardapproachtotrainingisconditionallog-likelihood(CLL)maximization(SmithandEisner,2008)withouttakinginexactinferenceintoaccount:J(θ;X,y∗)=−logpθ(j|X).Wheninferenceisexact,thisbaselinecomputesthetruegradientofCLL.Wheninferenceisapproximate,thisbase-lineusesthefactorbeliefsbα(yα)fromBPinplaceoftheexactmarginalsinthegradient.Theliter-aturereferstothisapproximation-unawaretrainingmethodassurrogatelikelihoodtrainingsinceitre-turnsthe“wrong”parametersevenundertheas-sumptionofinﬁnitetrainingdatadrawnfromthemodelbeingused(Wainwright,2006).Despitethis,thesurrogatelikelihoodobjectiveiscommonlyusedtotrainCRFs.CLLandapproximation-awaretrain-ingarenotmutuallyexclusive.TrainingastandardfactorgraphwithERMAandalog-likelihoodobjec-tiverecoversCLLexactly(Stoyanovetal.,2011).7Experiments7.1SetupFeaturesAsthefocusofthisworkisonanovelapproachtotraining,welooktopriorworkformodelandfeaturedesign(§2).WeaddO(n3)second-ordergrandparentandarbitrary-siblingfac-torsasinRiedelandSmith(2010)andMartinsetal.(2010).Weusestandardfeaturesetsforﬁrst-order(McDonaldetal.,2005)andsecond-order(Carreras,2007)parsing.FollowingRushandPetrov(2012),wealsoincludeaversionofeachpart-of-speech(POS)tagfeature,withthecoarsetagsfromPetrovetal.(2012).Weusefeaturehashing(GanchevandDredze,2008;Weinbergeretal.,2009)andrestricttoatmost20millionfeatures.Weleavetheincor-porationofthird-orderfeaturestofuturework.PruningToreducethetimespentonfeatureex-traction,weenforcethetype-speciﬁcdependencylengthboundsfromEisnerandSmith(2005)asusedbyRushandPetrov(2012):themaximumalloweddependencylengthforeachtuple(parenttag,childtag,Richtung)isgivenbythemaximumobservedlengthforthattupleinthetrainingdata.Follow-ingKooandCollins(2010),wetrainaﬁrst-ordermodelwithCLLandforeachtokenpruneanypar-entsforwhichthemarginalprobabilityislessthan0.0001timesthemaximumparentmarginalforthattoken.Onaper-tokenbasis,wefurtherrestricttothetenparentswithhighestmarginalprobabilityasinMartinsetal.(2009)(butweavoidpruningthefullyright-branchingtree,sothatsomeparsealwaysexists).8Thisletsussimplifythefactorgraph,re-movingvariablesyicorrespondingtoprunededgesandspecializingtheirfactorstoassumeyi=OFF.Wetrainthefullmodel’sparameterstoworkwellonthisprunedgraph.DataWeconsider19languagesfromtheCoNLL-2006(BuchholzandMarsi,2006)andCoNLL-2007(Nivreetal.,2007)SharedTasks.WealsoconverttheEnglishPennTreebank(PTB)(Marcusetal.,1993)todependenciesusingtheheadrulesfromYa-madaandMatsumoto(2003)(PTB-YM).Weevalu-ateunlabeledattachmentaccuracy(UAS)usinggold8ThepruningmodelusesasimplerfeaturesetasinRushandPetrov(2012).Pruningislikelytheleastimpactfulofourapproximations:itobtains99.46%oracleUASforEnglish.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

497

88.0 89.0 90.0 91.0 92.0 93.0 1 2 3 4 5 6 7 8 UAS # Iterations of BP CLL L2L2+ARFigure3:Speed/accuracytradeoffofEnglishPTB-YMUASvs.thetotalnumberofBPiterationstmaxforstandardconditionallikelihoodtraining(CLL)andourapproximation-awaretrainingwitheitheranL2objective(L2)orastagedtrainingofL2followedbyannealedrisk(L2+AR).Notethatthex-axisshowsthenumberofiter-ationsusedforbothtrainingandtesting.Weusea2nd-ordermodelwithGrand.+Sib.factors.POStagsfortheCoNLLlanguages,andpredictedtagsfromTurboTagger(Martinsetal.,2013)forthePTB.Unlikemostpriorwork,weholdout10%ofeachCoNLLtrainingdatasetasdevelopmentdataforregularizationbyearlystopping.9SomeoftheCoNLLlanguagescontainnon-projectiveedges,butoursystemisbuiltusingaprobabilitydistributionoverprojectivetreesonly.ERMAcanstillbeusedwithsuchabadlymisspec-iﬁedmodel—oneofitsadvantages—butnoamountoftrainingcanraiseCLL’sobjectiveabove−∞,sinceanynon-projectivegoldtreewillalwayshaveprobability0.Thus,forCLLonly,wereplaceeachgoldtreeintrainingdatawithaminimum-lossprojectivetree(Carreras,2007).10ThisresemblesERMA’sgoaloftrainingthesystemtoﬁndalow-lossprojectivetree.Attesttime,wealwaysevaluatethesystem’sprojectiveoutputtreesagainstthepos-siblynon-projectivegoldtrees,asinpriorwork.LearningSettingsWecomparethreelearningset-tings.Theﬁrst,ourbaseline,isconditionallog-9Indevexperiments,wefoundL2distancetobelesssensi-tivetothe‘2-regularizerweightthanCLL.Soweaddedaddi-tionalregularizationbyearlystoppingtoimproveCLL.10WealsoranacontrolledexperimentwithL2andnotjustCLLtrainedontheseprojectivizedtrees:theaveragemarginofimprovementforourmethodwidenedveryslightly.90.5 91 91.5 92 92.5 Unary Grand. Sib. Grand.+Sib. UAS CLL L2L2+ARFigure4:EnglishPTB-YMUASvs.thetypesof2nd-orderfactorsincludedinthemodelforapproximation-awaretrainingandstandardconditionallikelihoodtrain-ing.Allmodelsinclude1st-orderfactors(Unary).The2nd-ordermodelsincludegrandparents(Grand.),arbi-trarysiblings(Sib.),orboth(Grand.+Sib.)—anduse4iterationsofBP.likelihoodtraining(CLL)(§6).Asiscommonintheliterature,weconﬂatetwodistinctlearningsettings(conditionallog-likelihood/surrogatelog-likelihood)underthesinglename“CLL,”allowingtheinferencemethod(exact/inexact)todifferentiatethem.Thesecondlearningsettingisapproximation-awarelearning(§3)witheitherourL2distanceob-jective(L2)(§4.2)orourlayer-wisetrainingmethod(L2+AR)whichtakestheL2-trainedmodelasanini-tializerforourannealedrisk(§4.3).Theannealedriskobjectiverequiresanannealingschedule:overthecourseoftraining,welinearlyannealfromini-tialtemperatureT=0.1toT=0.0001,updat-ingTateachstepofstochasticoptimization.Thethirdlearningsettingusesthesametwoobjectives,L2andL2+AR,butwithexactinference(§6).The‘2-regularizerweightin(8)isλ=1.EachmethodistrainedbyAdaGradfor5epochswithearlystopping(i.e.themodelwiththehighestscoreondevdataisreturned).AcrossCoNLL,theaverageepochchosenforCLLwas2.02andforL2was3.42.Thelearningrateforeachtrainingrunisdynamicallytunedonasampleofthetrainingdata.7.2ResultsOurgoalistodemonstratethatourapproximation-awaretrainingmethodleadstoimprovedparserac-curacyascomparedwiththestandardtrainingap-proachofconditionallog-likelihood(CLL)maxi-mization(SmithandEisner,2008),whichdoesnot

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

498

takeinexactinferenceintoaccount.Thetwokeyﬁndingsofourexperimentsarethatourlearningap-proachismorerobustto(1)decreasingthenumberofiterationsofBPand(2)addingadditionalcyclestothefactorgraphintheformofhigher-orderfac-tors.Inshort:ourapproachleadstofasterinferenceandcreatesopportunitiesformoreaccurateparsers.Speed-AccuracyTradeoffOurﬁrstexperimentisonEnglishdependencies.ForEnglishPTB-YM,Figure3showsaccuracyasafunctionofthenum-berofBPiterationsforoursecond-ordermodelwithbotharbitrarysiblingandgrandparentfactorsonEn-glish.Weﬁndthatourtrainingmethods(L2andL2+AR)obtainhigheraccuracythanstandardtrain-ing(CLL),particularlywhenasmallnumberofBPiterationsareusedandtheinferenceisaworseap-proximation.NoticethatwithjusttwoiterationsofBP,theparserstrainedbyourapproachobtainac-curacygreaterthanorequaltothosebyCLLwithanynumberofiterations(1to8).Contrastingthetwoobjectivesforourapproximation-awaretrain-ing,weﬁndthatoursimpleL2objectiveperformsverywell.Infact,inonlytwocases,at3and5itera-tions,doesriskannealing(L2+AR)furtherimproveperformanceontestdata.Inourdevelopmentexper-iments,wealsoevaluatedARwithoutusingL2forinitializationandwefoundthatitperformedworsethaneitherofCLLandL2alone.ThatARperformsonlyslightlybetterthanL2(andnotworse)inthecaseofL2+ARislikelyduetoearlystoppingondevdata,whichguardsagainstselectingaworsemodel.IncreasinglyCyclicModelsFigure4contrastsaccuracywiththetypeof2nd-orderfactors(grand-parent,sibling,orboth)includedinthemodelforEnglish,foraﬁxedbudgetof4BPiterations.Addinghigher-orderfactorsintroducesmoreloops,makingtheloopyBPapproximationmoreproblem-aticforstandardCLLtraining.Bycontrast,underapproximation-awaretraining,enrichingthemodelwithmorefactorsalwayshelpsperformance,asde-sired,ratherthanhurtingit.Noticethatouradvantageisnotrestrictedtothecaseofloopygraphs.Evenwhenweusea1st-ordermodel,forwhichBPinferenceisexact,ourapproachyieldshigher-accuracyparsersthanCLLtraining.Wespeculatethatthisimprovementisduetoourmethod’sabilitytobetterdealwithmodelTRAININFERENCEDEVUASTESTUASCLLExact91.9991.62CLLBP4iters91.3791.25L2Exact91.9191.66L2BP4iters91.8391.63Table1:Theimpactofexactvs.approximateinferenceona2nd-ordermodelwithgrandparentfactorsonly.Re-sultsareforthedevelopment(§22)andtest(§23)sec-tionsofPTB-YM.misspeciﬁcation—aﬁrst-ordermodelisquitemis-speciﬁed!Notethefollowingsubtlepoint:wheninferenceisexact,theCLLestimatorisactuallyaspecialcaseofourapproximation-awarelearner—thatis,CLLcomputesthesamegradientthatourtrainingbybackpropagationwouldifweusedlog-likelihoodastheobjective.ExactInferencewithGrandparents§2notedthatsincewealwaysdoMBRdecoding,theidealstrategyistoﬁtthetruedistributionwithagoodmodel.Considera“goodmodel”thatincludesunaryandgrandparentfactors.ExactinferenceispossiblehereinO(n4)timebydynamicprogramming(KooandCollins,2010,Model0).Table1showsthatCLLtrainingwithexactinferenceindeeddoeswellontestdata—butthataccuracyfallsifwesubstitutefastapproximateinference(4iterationsofBP).OurproposedL2trainingisabletoclosethegap,justasintended.Thatis,wesuccesfullytrainafewitera-tionsofanapproximateO(n3)algorithmtobehaveaswellasanexactO(n4)algorithm.OtherLanguagesOurﬁnalexperimentstrainandtestourparserson19languagesfromCoNLL-2006/2007(Table2).Weﬁndthat,onaverageacrosslanguages,approximation-awaretrainingwithanL2objectiveobtainshigherUASthanCLLtraining.Thisresultholdsforbothourpoorestmodel(1st-order)andourrichestone(2nd-orderwithgrandpar-entandsiblingfactors),using1,2,4,or8iterationsofBP.Noticethattheapproximation-awaretrain-ingdoesn’talwaysoutperformCLLtraining—onlyintheaggregate.Again,weseethetrendthatourtrainingapproachyieldslargergainswhenBPisre-strictedtoasmallnumberofmaximumiterations.Itispossiblethatlargertrainingsetswouldalsofavorourapproach,byprovidingaclearersignalofhowtoreducetheobjective(8).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

499

1ST-ORDER2ND-ORDER(WITHGIVENNUM.BPITERATIONS)1248LANGUAGECLLL2−CLLCLLL2−CLLCLLL2−CLLCLLL2−CLLCLLL2−CLLAR77.63-0.2673.39+2.2177.05-0.1777.20+0.0277.16-0.07BG90.38-0.7689.18-0.4590.44+0.0490.73+0.2590.63-0.19CA90.47+0.3088.90+0.1790.79+0.3891.21+0.7891.49+0.66CS84.69-0.0779.92+3.7882.08+2.2783.02+2.9481.60+4.42DA87.15-0.1286.31-1.0787.41+0.0387.65-0.1187.68-0.10DE88.55+0.8188.060.0089.27+0.4689.85-0.0589.87-0.07EL82.43-0.5480.02+0.2981.97+0.0982.49-0.1682.66-0.04EN88.31+0.3285.53+1.4487.67+1.8288.63+1.1488.85+0.96ES81.49-0.0979.08-0.3780.73+0.1481.75-0.6681.52+0.02EU73.69+0.1171.45+0.8574.16+0.2474.92-0.3274.94-0.38HU78.79-0.5276.46+1.2479.10+0.0379.07+0.6079.28+0.31IT84.75+0.3284.14+0.0485.15+0.0185.66-0.5185.81-0.59JA93.54+0.1993.01+0.4493.71-0.1093.75-0.2693.47+0.07NL76.96+0.5374.23+2.0877.12+0.5378.03-0.2777.83-0.09PT86.31+0.3885.68-0.0187.01+0.2987.34+0.0887.30+0.17SL79.89+0.3078.42+1.5079.56+1.0280.91+0.0380.80+0.34SV87.22+0.6086.14-0.0287.68+0.7488.01+0.4187.87+0.37TR78.53-0.3077.43-0.6478.51-1.0478.80-1.0678.91-1.13ZH84.93-0.3982.62+1.4384.27+0.9584.79+0.6884.77+1.14AVG.83.98+0.0482.10+0.6883.88+0.4184.41+0.1984.34+0.31Table2:Resultson19languagesfromCoNLL-2006/2007.Forlanguagesappearinginbothdatasets,the2006versionwasused,exceptforChinese(ZH).Evaluationfollowsthe2006conventionsandexcludespunctuation.WereportabsoluteUASforthebaseline(CLL)andtheimprovementinUASforL2overCLL(L2−CLL)withpositive/negativedifferencesinblue/red.TheaverageUASandaveragedifferenceacrossalllanguages(AVG.)isgiven.8DiscussionThepurposeofthisworkwastoexploreERMAandrelatedtrainingmethodsformodelswhichincorpo-ratestructuredfactors.Weappliedthesemethodstoabasichigher-orderdependencyparsingmodel,becausethatwasthesimplestandﬁrstinstanceofstructuredBP(SmithandEisner,2008).Infuturework,wehopetoexplorefurthermodelswithstruc-turedfactors—particularlythosewhichjointlyac-countformultiplelinguisticstrata(e.g.syntax,se-mantics,andtopic).Anothernaturalextensionofthisworkistoexploreothertypesoffactors:hereweconsideredonlylog-linearpotentialfunctions(com-monlyusedinCRFs),butanydifferentiablefunc-tionwouldbeappropriate,suchasaneuralnetwork(DurrettandKlein,2015;Gormleyetal.,2015b).Ourprimarycontributionisapproximation-awaretrainingforstructuredBP.Wehavespeciﬁcallypresentedmessage-passingformulasforanyfactorwhosebelief’spartitionfunctioncanbecomputedasthetotalweightofallhyperpathsinaweightedhypergraph.Thiswouldsufﬁcetotrainthestruc-turedBPsystemsthathavebeenbuiltforprojectivedependencyparsing(SmithandEisner,2008),CNFgrammarparsing(Naradowskyetal.,2012),TAG(AuliandLopez,2011),ITG-constraintsforphraseextraction(BurkettandKlein,2012),andgraphicalmodelsoverstrings(DreyerandEisner,2009).9ConclusionsWeintroduceanewapproximation-awarelearningframeworkforbeliefpropagationwithstructuredfactors.Wepresentdifferentiableobjectivesforbothempiricalriskminimization(`alaERMA)andanovelobjectivebasedonL2distancebetweenthein-ferredbeliefsandthetrueedgeindicatorfunctions.ExperimentsontheEnglishPennTreebankand19languagesfromCoNLL-2006/2007showsthatourestimatorisabletotrainmoreaccuratedependencyparserswithfeweriterationsofbeliefpropagationthanstandardconditionallog-likelihoodtraining,bytakingapproximationsintoaccount.Foradditionaldetails,seethetechreportversionofthispaper(Gormleyetal.,2015a).Ourcodeisavailableinageneral-purposelibraryforstructuredBP,hyper-graphs,andbackprop(Gormley,2015).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

500

AcknowledgmentsThisresearchwasfundedbytheHumanLanguageTechnologyCenterofExcel-lenceatJohnsHopkinsUniversity.Thankstotheanonymousreviewersfortheirinsightfulcomments.ReferencesMichaelAuliandAdamLopez.2011.AcomparisonofloopybeliefpropagationanddualdecompositionforintegratedCCGsupertaggingandparsing.InProceed-ingsofACL.MohitBansal,DavidBurkett,GerarddeMelo,andDanKlein.2014.Structuredlearningfortaxonomyinduc-tionwithbeliefpropagation.InProceedingsofACL.YoshuaBengio,PascalLamblin,DanPopovici,andHugoLarochelle.2007.Greedylayer-wisetrainingofdeepnetworks.InB.Sch¨olkopf,J.C.Platt,andT.Hoffman,editors,AdvancesinNeuralInformationProcessingSystems19.PeterJ.BickelandKjellA.Doksum.1977.Mathe-maticalStatistics:BasicIdeasandSelectedTopics.Holden-DayInc.,Oakland,CA,USA.SabineBuchholzandErwinMarsi.2006.CoNLL-Xsharedtaskonmultilingualdependencyparsing.InProceedingsofCoNLL.DavidBurkettandDanKlein.2012.Fastinferenceinphraseextractionmodelswithbeliefpropagation.InProceedingsofNAACL-HLT.XavierCarreras.2007.Experimentswithahigher-orderprojectivedependencyparser.InProceedingsoftheCoNLLSharedTaskSessionofEMNLP-CoNLL2007.JustinDomke.2010.Implicitdifferentiationbypertur-bation.InAdvancesinNeuralInformationProcessingSystems.JustinDomke.2011.Parameterlearningwithtruncatedmessage-passing.InProceedingsoftheIEEECon-ferenceonComputerVisionandPatternRecognition(CVPR).MarkusDreyerandJasonEisner.2009.Graphicalmod-elsovermultiplestrings.InProceedingsofEMNLP.JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.TheJournalofMachineLearningResearch.GregDurrettandDanKlein.2015.NeuralCRFParsing.InProceedingsofACL.FrederikEatonandZoubinGhahramani.2009.Choos-ingavariabletoclamp.InProceedingsofAISTATS.JasonEisnerandNoahA.Smith.2005.Parsingwithsoftandhardconstraintsondependencylength.InProceedingsoftheInternationalWorkshoponParsingTechnologies(IWPT).JasonEisner.1996.Threenewprobabilisticmodelsfordependencyparsing:Anexploration.InProceedingsofCOLING.BrendanJ.Frey,FrankR.Kschischang,Hans-AndreaLoeliger,andNiclasWiberg.1997.Factorgraphsandalgorithms.InProceedingsoftheAnnualAllertonConferenceonCommunicationControlandComput-ing,volume35.KuzmanGanchevandMarkDredze.2008.Smallsta-tisticalmodelsbyrandomfeaturemixing.InProceed-ingsoftheACL08HLTWorkshoponMobileLanguageProcessing.JoshuaGoodman.1996.EfﬁcientalgorithmsforparsingtheDOPmodel.InProceedingsofEMNLP.MatthewR.Gormley,MarkDredze,andJasonEis-ner.2015a.Approximation-awaredependencyparsingbybeliefpropagation(extendedversion).TechnicalreportavailablefromarXiv.orgasarXiv:1508.02375.MatthewR.Gormley,MoYu,andMarkDredze.2015b.Improvedrelationextractionwithfeature-richcom-positionalembeddingmodels.InProceedingsofEMNLP.MatthewR.Gormley.2015.Pacaya—agraphicalmod-elsandNLPlibrary.Availablefromhttps://github.com/mgormley/pacaya.AndreasGriewankandGeorgeF.Corliss,editors.1991.AutomaticDifferentiationofAlgorithms:Theory,Im-plementation,andApplication.SIAM,Philadelphia,PA.LiangHuang,SuphanFayong,andYangGuo.2012.Structuredperceptronwithinexactsearch.InProceed-ingsofNAACL-HLT.DanKleinandChristopherD.Manning.2001.Parsingandhypergraphs.InProceedingsoftheInternationalWorkshoponParsingTechnologies(IWPT).TerryKooandMichaelCollins.2010.Efﬁcientthird-orderdependencyparsers.InProceedingsofACL.FrankR.Kschischang,BrendanJ.Frey,andHans-AndreaLoeliger.2001.Factorgraphsandthesum-productalgorithm.IEEETransactionsonInformationTheory,47(2).AlexKuleszaandFernandoPereira.2008.StructuredLearningwithApproximateInference.InAdvancesinNeuralInformationProcessingSystems.ZhifeiLiandJasonEisner.2009.First-andsecond-orderexpectationsemiringswithapplicationstominimum-risktrainingontranslationforests.InProceedingsofEMNLP.MitchellP.Marcus,MaryAnnMarcinkiewicz,andBeat-riceSantorini.1993.Buildingalargeannotatedcor-pusofEnglish:Thepenntreebank.Computationallinguistics,19(2).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
5
3
1
5
6
6
7
7
4

/
T

A
C
_
A
_
0
0
1
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

501

Andr´eF.T.Martins,NoahA.Smith,andEricP.Xing.2009.Conciseintegerlinearprogrammingformula-tionsfordependencyparsing.InProceedingsofACL-IJCNLP.Andr´eF.T.Martins,NoahA.Smith,EricP.Xing,Pe-droM.Q.Aguiar,andM´arioA.T.Figueiredo.2010.Turboparsers:Dependencyparsingbyapproximatevariationalinference.InProceedingsofEMNLP.Andr´eF.T.Martins,MiguelB.Almeida,andNoahA.Smith.2013.Turningontheturbo:Fastthird-ordernon-projectiveturboparsers.InProceedingsofACL.RyanMcDonaldandFernandoPereira.2006.On-linelearningofapproximatedependencyparsingal-gorithms.InProceedingsofEACL.RyanMcDonald,KobyCrammer,andFernandoPereira.2005.Onlinelarge-margintrainingofdependencyparsers.InProceedingsofACL.KevinP.Murphy,YairWeiss,andMichaelI.Jordan.1999.Loopybeliefpropagationforapproximatein-ference:Anempiricalstudy.InProceedingsofUAI.JasonNaradowsky,TimVieira,andDavidA.Smith.2012.Grammarlessparsingforjointinference.InProceedingsofCOLING.JoakimNivre,JohanHall,SandraK¨ubler,RyanMcDon-ald,JensNilsson,SebastianRiedel,andDenizYuret.2007.TheCoNLL2007sharedtaskondependencyparsing.InProceedingsoftheCoNLLSharedTaskSessionofEMNLP-CoNLL2007.SlavPetrov,DipanjanDas,andRyanMcDonald.2012.Auniversalpart-of-speechtagset.InProceedingsofLREC.SebastianRiedelandDavidA.Smith.2010.Relaxedmarginalinferenceanditsapplicationtodependencyparsing.InProceedingsofNAACL-HLT.DavidE.Rumelhart,GeoffreyE.Hinton,andRonaldJ.Williams.1986.Learninginternalrepresentationsbyerrorpropagation.InDavidE.RumelhartandJamesL.McClelland,editors,ParallelDistributedProcessing:ExplorationsintheMicrostructureofCognition,volume1.MITPress.AlexanderM.RushandSlavPetrov.2012.Vinepruningforefﬁcientmulti-passdependencyparsing.InPro-ceedingsofNAACL-HLT.DavidA.SmithandJasonEisner.2006.Minimum-riskannealingfortraininglog-linearmodels.InProceed-ingsofCOLING-ACL.DavidA.SmithandJasonEisner.2008.Dependencyparsingbybeliefpropagation.InProceedingsofEMNLP.VeselinStoyanovandJasonEisner.2012.Minimum-risktrainingofapproximateCRF-BasedNLPsystems.InProceedingsofNAACL-HLT.VeselinStoyanov,AlexanderRopson,andJasonEis-ner.2011.Empiricalriskminimizationofgraphi-calmodelparametersgivenapproximateinference,de-coding,andmodelstructure.InProceedingsofAIS-TATS.MartinJ.Wainwright.2006.Estimatingthe“wrong”graphicalmodel:Beneﬁtsinthecomputation-limitedsetting.TheJournalofMachineLearningResearch,7.KilianWeinberger,AnirbanDasgupta,JohnLangford,AlexSmola,andJoshAttenberg.2009.Featurehash-ingforlargescalemultitasklearning.InProceedingsofICML.HiroyasuYamadaandYujiMatsumoto.2003.Statisticaldependencyanalysiswithsupportvectormachines.InProceedingsoftheInternationalWorkshoponParsingTechnologies(IWPT),volume3.