Transactions of the Association for Computational Linguistics, 1 (2013) 207–218. Action Editor: Ben Taskar.
Submitted 10/2012; Revised 3/2013; Published 5/2013. c
(cid:13)
2013 Association for Computational Linguistics.
DualCoordinateDescentAlgorithmsforEfficientLargeMarginStructuredPredictionMing-WeiChangWen-tauYihMicrosoftResearchRedmond,WA98052,USA{minchang,scottyih}@microsoft.comAbstractDuetothenatureofcomplexNLPproblems,structuredpredictionalgorithmshavebeenimportantmodelingtoolsforawiderangeoftasks.WhilethereexistsevidenceshowingthatlinearStructuralSupportVectorMachine(SSVM)algorithmperformsbetterthanstruc-turedPerceptron,theSSVMalgorithmisstilllessfrequentlychosenintheNLPcommunitybecauseofitsrelativelyslowtrainingspeed.Inthispaper,weproposeafastandeasy-to-implementdualcoordinatedescentalgorithmforSSVMs.UnlikealgorithmssuchasPer-ceptronandstochasticgradientdescent,ourmethodkeepstrackofdualvariablesandup-datestheweightvectormoreaggressively.Asaresult,thistrainingprocessisasefficientasexistingonlinelearningmethods,andyetde-rivesconsistentlybettermodels,asevaluatedonfourbenchmarkNLPdatasetsforpart-of-speechtagging,named-entityrecognitionanddependencyparsing.1IntroductionComplexnaturallanguageprocessingtasksarein-herentlystructured.Fromsequencelabelingprob-lemslikepart-of-speechtaggingandnamedentityrecognitiontotreeconstructiontaskslikesyntacticparsing,strongdependenciesexistamongthela-belsofindividualcomponents.Bymodelingsuchrelationsintheoutputspace,structuredoutputpre-dictionalgorithmshavebeenshowntooutperformsignificantlysimplebinaryormulti-classclassi-fiers(Laffertyetal.,2001;Collins,2002;McDonaldetal.,2005).Amongtheexistingstructuredoutputpredictionalgorithms,thelinearStructuralSupportVectorMachine(SSVM)algorithm(Tsochantaridisetal.,2004;Joachimsetal.,2009)hasshownoutstandingperformanceinseveralNLPtasks,suchasbilingualwordalignment(Mooreetal.,2007),constituencyanddependencyparsing(Taskaretal.,2004b;Kooetal.,2007),sentencecompression(CohnandLa-pata,2009)anddocumentsummarization(Lietal.,2009).Nevertheless,asalearningmethodforNLP,theSSVMalgorithmhasbeenlessthanpopularal-gorithmssuchasthestructuredPerceptron(Collins,2002).Thismaybeduetothefactthatcur-rentSSVMimplementationsoftensufferfromsev-eralpracticalissues.First,state-of-the-artimple-mentationsofSSVMsuchascuttingplanemeth-ods(Joachimsetal.,2009)aretypicallycompli-cated.1Second,whilemethodslikestochasticgradi-entdescentaresimpletoimplement,tuninglearningratescanbedifficult.Finally,whileSSVMmod-elscanachievesuperioraccuracy,thisoftenrequireslongtrainingtime.Inthispaper,weproposeanoveloptimiza-tionmethodforefficientlytraininglinearSSVMs.Ourmethodnotonlyiseasytoimplement,butalsohasexcellenttrainingspeed,competitivewithbothstructuredPerceptron(Collins,2002)andMIRA(Crammeretal.,2005).WhenevaluatedonseveralNLPtasks,includingPOStagging,NERanddependencyparsing,thisoptimizationmethodalsooutperformsotherapproachesintermsofpredictionaccuracy.Ourfinalalgorithmisadualcoordinate1Ouralgorithmiseasytoimplementmainlybecauseweusethesquarehingelossfunction.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
208
descent(DCD)algorithmforsolvingastructuredoutputSVMproblemwitha2-normhingelossfunc-tion.Thealgorithmconsistsoftwomaincompo-nents.Onecomponentbehavesanalogouslytoon-linelearningmethodsandupdatestheweightvectorimmediatelyafterinferenceisperformed.Theothercomponentissimilartothecuttingplanemethodandupdatesthedualvariables(andtheweightvec-tor)withoutrunninginference.Conceptually,thishybridapproachoperatesatabalancedtrade-offpointbetweeninferenceandweightupdate,per-formingbetterthanwitheithercomponentalone.Ourcontributionsinthisworkcanbesummarizedasfollows.Firstly,ourproposedalgorithmshowsthatevenforstructuredoutputprediction,anSSVMmodelcanbetrainedasefficientlyasastructuredPerceptronone.Secondly,weconductedacarefulexperimentalstudyonthreeNLPtasksusingfourdifferentbenchmarkdatasets.WhencomparedwithpreviousmethodsfortrainingSSVMs(Joachimsetal.,2009),ourmethodachievessimilarperfor-manceusinglesstrainingtime.WhencomparedtocommonlyusedlearningalgorithmssuchasPercep-tronandMIRA,themodeltrainedbyouralgorithmperformsconsistentlybetterwhengiventhesameamountoftrainingtime.WebelieveourmethodcanbeapowerfultoolformanydifferentNLPtasks.Therestofourpaperisorganizedasfollows.WefirstdescribeourapproachbyformallydefiningtheproblemandnotationinSec.2,wherewealsoreviewsomeexisting,closely-relatedstructured-outputlearningalgorithmsandoptimizationtech-niques.Weintroducethedetailedalgorithmicde-signinSec.3.TheexperimentalcomparisonsofvariationsofourapproachandtheexistingmethodsonseveralNLPbenchmarktasksanddatasetsarere-portedinSec.4.Finally,Sec.5concludesthepaper.2BackgroundandRelatedWorkWefirstintroducenotationsusedthroughoutthispa-per.Aninputexampleisdenotedbyxandanout-putstructureisdenotedbyy.ThefeaturevectorΦ(x,y)isafunctiondefinedoveraninput-outputpair(x,y).Wefocusonlinearmodelswithpredic-tionsmadebysolvingthedecodingproblem:argmaxy∈Y(xi)wTΦ(xi,y).(1)ThesetY(xi)representsallpossible(exponentiallymany)structuresthatcanbegeneratedfromtheex-amplexi.Letyibethetruestructuredlabelofxi.ThedifferencebetweenthefeaturevectorsofthecorrectlabelyiandyisdenotedasΦyi,y(xi)≡Φ(xi,yi)−Φ(xi,y).Wedefine∆(yi,y)asadis-tancefunctionbetweentwostructures.2.1PerceptronandMIRAStructuredPerceptronFirstintroducedbyCollins(2002),thestructuredPerceptronalgorithmrunstwostepsiteratively:first,itfindsthebeststructuredpredictionyforanexamplewiththecurrentweightvectorusingEq.(1);thentheweightvectorisupdatedaccordingtothedifferencebetweenthefeaturevectorsofthetruelabelandtheprediction:w←w+Φyi,y(xi).InspiredbyFreundandSchapire(1999),Collins(2002)alsoproposedtheaveragedstructuredPerceptron,whichmaintainsanaveragedweightvectorthroughoutthetrainingprocedure.Thistechniquehasbeenshowntoimprovethegeneralizationabilityofthemodel.MIRATheMarginInfusedRelaxedAlgo-rithm(MIRA),whichwasintroducedbyCrammeretal.(2005),explicitlyusesthenotionofmargintoupdatetheweightvector.TheMIRAupdatestheweightvectorbycalculatingthestepsizeusingminw12kw−w0k2S.T.wTΦyi,y(xi)≥∆(y,yi),∀y∈Hk,whereHkisasetcontainingthebest-kstructuresaccordingtotheweightvectorw0.MIRAisaverypopularmethodintheNLPcommunityandhasbeenappliedtoNLPtaskslikewordsegmentationandpart-of-speechtagging(Kruengkraietal.,2009),NERandchunking(MejerandCrammer,2010)anddependencyparsing(McDonaldetal.,2005).2.2StructuralSVMStructuralSVM(SSVM)isamaximummarginmodelforthestructuredoutputpredictionsetting.TrainingSSVMisequivalenttosolvingthefollow-ingglobaloptimizationproblem:minwkwk22+ClXi=1L(xi,yi,w),(2)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
209
wherelisthenumberoflabeledexamplesandL(xi,yi,w)=‘(cid:18)maxy(cid:2)∆(yi,y)−wTΦyxi,y(xi)(cid:3)(cid:19)Thetypicalchoiceof‘is‘(a)=at.Ift=2isused,werefertotheSSVMdefinedinEq.(2)astheL2-LossSSVM.Ifhingeloss(t=1)isusedinEq.(2),werefertoitastheL1-LossSSVM.Notethatthefunction∆isnotonlynecessary,2butalsoenablesustousemoreinformationonthedifferencesbe-tweenthestructuresinthetrainingphase.Forex-ample,usingHammingdistanceforsequencelabel-ingisareasonablechoice,asthemodelcanexpressfinerdistinctionsbetweenstructuresyiandy.WhentraininganSSVMmodel,weoftenneedtosolvetheloss-augmentedinferenceproblem,argmaxy∈Y(xi)(cid:2)wTΦ(xi,y)+∆(yi,y)(cid:3).(3)NotethatitisadifferentinferenceproblemthanthedecodingprobleminEq.(1).AlgorithmsfortrainingSSVMCuttingplane(CP)methods(Tsochantaridisetal.,2004;Joachimsetal.,2009)havebeenthedominantmethodforlearningtheL1-LossSSVM.Eq.(2)containsanexponentialnumberofconstraints.Thecuttingplane(CP)methodsiterativelyselectasubsetofactiveconstraintsforeachexamplethensolveasub-problemwhichcontainsactiveconstraintstoimprovethemodel.CPhasprovenusefulforsolvingSSVMs.Forinstance,YuandJoachims(2009)proposedusingCPmethodstosolvea1-slackvariableformulation,andshowedthatsolvingfora1-slackvariableformulationismuchfasterthansolvingthel-slackvariableone(Eq.(2)).Changetal.(2010)alsoproposedavariantofcuttingplanemethodforsolvingtheL2-LossSSVM.Thismethodusesadualcoordinatedescentalgorithmtosolvethesub-problems.WecalltheirapproachtheCPDmethod.Severalotheralgorithmsalsoaimatsolv-ingtheL1-LossSSVM.Stochasticgradientde-scent(SGD)(Bottou,2004;Shalev-Shwartzetal.,2007)isatechniqueforoptimizinggeneralcon-vexfunctionsandhasbeenappliedtosolvingthe2Without∆(y,yi)inEq.2,theoptimalwwouldbezero.L1-LossSSVM(Ratliffetal.,2007).Taskaretal.(2004a)proposedastructuredSMOalgorithm.BecausethealgorithmsolvesthedualformulationoftheL1-LossSSVM,itrequirespickingavio-lationpairforeachupdate.Incontrast,becauseeachdualvariablecanbeupdatedindependentlyinourDCDalgorithm,theimplementationisrelativelysimple.TheextragradientalgorithmhasalsobeenappliedtosolvingtheL1-LossSSVM(Taskaretal.,2005).UnlikeourDCDalgorithm,theextragradientmethodrequiresthelearningratetobespecified.Theconnectionsbetweendualmethodsandtheonlinealgorithmshavebeenpreviouslydiscussed.Specifically,Shalev-ShwartzandSinger(2006)con-nectsthedualmethodstoawiderangeofonlinelearningalgorithms.In(Martinsetal.,2010),theau-thorsapplysimilartechniquesonL1-LossSSVMsandshowthattheproposedalgorithmcanbefasterthantheSGDalgorithm.ExponentiatedGradient(EG)descent(KivinenandWarmuth,1995;Collinsetal.,2008)hasre-centlybeenappliedtosolvingtheL1-LossSSVM.ComparedtootherSSVMlearners,EGrequiresmanualtuningofthestepsize.Inaddition,EGre-quiressolutionofthesum-productinferenceprob-lem,whichcanbemoreexpensivethansolvingEq.(3)(Taskaretal.,2006).Veryrecently,Lacoste-Julienetal.(2013)proposedablock-coordinatede-scentalgorithmfortheL1-LossSSVMbasedontheFrank-Wolfealgorithm(FW-Struct),whichhasbeenshowntooutperformtheEGalgorithmsignificantly.SimilartoourDCDalgorithm,FWcalculatestheoptimallearningratewhenupdatingthedualvari-ables.TheSequentialDualMethod(SDM)(Shevadeetal.,2011)isprobablythemostrelatedtothispaper.SDMsolvestheL1-LossSSVMproblemusingmul-tipleupdatingpolicies,whichissimilartoourap-proach.However,thereareseveralimportantdiffer-encesinthedetailedalgorithmicdesign.AswillbeclearinSec.3,ourdualcoordinatedescent(DCD)algorithmisverysimple,whileSDM(whichisnotaDCDalgorithm)usesacomplicatedproceduretobalancedifferentupdatepolicies.BytargetingtheL2-LossSSVMformulation,ourmethodscanup-datetheweightvectormoreefficiently,sincetherearenoequalityconstraintsinthedual.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
210
3DualCoordinateDescentAlgorithmsforStructuralSVMInthiswork,wefocusonsolvingthedualoflinearL2-LossSSVM,whichcanbewrittenasfollows:minαi,y≥012kXi,yαi,yΦyi,y(xi)k2(4)+14CXi(Xy∈Y(xi)αi,y)2−Xi,y∆(y,yi)αi,y.Intheaboveequation,adualvariableαi,yisasso-ciatedwithastructurey∈Y(xi).Therefore,thetotalnumberofdualvariablescanbequitelarge:itsupperboundislB,whereB=maxi|Y(xi)|.Theconnectionbetweenthedualvariablesandtheweightvectorwatoptimalsolutionsisthroughthefollowingequation:w=lXi=1Xy∈Y(xi)αi,yΦyi,y(xi).(5)AdvantagesofL2-LossSSVMTheuseofthe2-normhingelossfunctioneliminatestheneedofequalityconstraints3;onlynon-negativeconstraints(αi,y≥0)remain.Thisisimportantbecausenoweachdualvariablecanbeupdatedwithoutchangingvaluesoftheotherdualvariables.Wecanthenup-dateonesingledualvariableatatime.Asaresult,thisdualformulationallowsustodesignasimpleandprincipleddualcoordinatedescent(DCD)opti-mizationmethod.DCDalgorithmsconsistoftwoiterativesteps:1.Pickadualvariableαi,y.2.Updatethedualvariableandtheweightvector.Goto1.Inthenormalbinaryclassificationcase,howtoselectdualvariablestosolveisnotanissueaschoosingthemrandomlyworkseffectivelyinprac-tice(Hsiehetal.,2008).However,thisisnotaprac-ticalschemefortrainingSSVMmodelsgiventhatthenumberofdualvariablesinEq.(4)canbeverylargebecauseoftheexponentiallymanylegitimateoutputstructures.Toaddressthisissue,weintro-ducetheconceptofworkingsetbelow.3ForL1-LossSSVM,therearetheequalityconstraints:Py∈Y(xi)αi,y=C,∀i.WorkingSetThenumberofnon-zerovariablesintheoptimalαcanbesmallwhensolvingEq.(4).Hence,itisoftenfeasibletouseasmallworkingsetWiforeachexampletokeeptrackofthestructuresfornon-zeroα’s.Moreformally,Wi={y|∀y∈Y(xi),αi,y>0}.Intuitively,theworkingsetWirecordstheoutputstructuresthataresimilartothetruestructureyi.Wesetalldualvariablestobezeroinitially(therefore,w=0aswell),soWi=∅foralli.Thenthealgo-rithmstartstobuildtheworkingsetinthetrainingprocedure.Aftertraining,theweightvectoriscom-putedusingdualvariablesintheworkingsetandthusequivalenttow=lXi=1Xy∈Wiαi,yΦyi,y(xi).(6)ConnectionstoStructuredPerceptronThepro-cessofupdatingadualvariableisinfactverysimi-lartotheupdateruleusedinPerceptronandMIRA.TakestructuredPerceptronforexample,itsweightvectorcanbedeterminedusingthefollowingequa-tion:wperc=lXi=1Xy∈Γ(xi)βi,yΦyi,y(xi),(7)whereΓ(xi)isthesetcontainingallstructuresPer-ceptronpredictsforxiduringtraining,andβi,yisthenumberoftimesPerceptronpredictsyforxiduringtraining.BycomparingEq.(6)andEq.(7),itisclearthatSSVMisjustamoreprincipledwaytoupdatetheweightvector,asα’sarecomputedbasedonthenotionofmargin.4UpdatingDualVariablesandWeightsAfterpickingadualvariableαi,¯y,wefirstshowhowtoupdateitoptimally.Recallthatadualvariableαi,¯yisassociatedwiththei-thexampleandastructure¯y.Theoptimalupdatesizedforαi,¯ycanbecalculatedanalyticallyfromthefollowingoptimizationprob-4Ofcourse,WicouldbeverydifferentfromΓ(xi),thecon-structionoftheworkingsetswillbediscussedinSec.3.1.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
211
Algorithm1UPDATEWEIGHT(i,w):Updatetheweightvectorwandthedualvariablesintheworkingsetofthei-thexample.Cisthereg-ularizationparameterdefinedinEq.(2).1:ShuffletheelementsinWi(butretainthenewestmemberoftheworkingsettobeupdatedfirst.SeeTheorem1forthereasons.)2:for¯y∈Wido3:d←∆(¯y,yi)−wTΦyi,¯y(xi)−Py∈Wiαi,y2CkΦyi,¯y(xi)k2+12C4:α0←max(αi,¯y+d,0)5:w←w+(α0−αi,¯y)Φyi,¯y(xi)6:αi,¯y←α07:endforlem(derivedfromEq.(4)):mind≥−αi,¯y12kw+dΦyi,¯y(x)k2+14C(d+Xy∈Wiαi,y)2−d∆(yi,¯y),(8)wherethewisdefinedinEq.(6).Comparedtostochasticgradientdescent,DCDalgorithmskeeptrackofdualvariablesanddonotneedtotunethelearningrate.Insteadofupdatingonedualvariableatatime,ouralgorithmupdatesalldualvariablesonceintheworkingset.Thisstepisimportantfortheconver-genceoftheDCDalgorithms.5TheexactupdatealgorithmispresentedinAlgorithm1.Line3cal-culatestheoptimalstepsize(theanalyticalsolutiontotheaboveoptimizationproblem).Line4makessurethatdualvariablesarenon-negative.Lines5and6updatetheweightvectorsandthedualvari-ables.NotethateveryupdateensuresEq.(4)tobenogreaterthantheoriginalvalue.3.1TwoDCDOptimizationAlgorithmsNowwearereadytopresenttwonovelDCDalgo-rithmsforL2-LossSSVM:DCD-LightandDCD-SSVM.3.1.1DCD-LightThebasicideaofDCD-Lightisjustlikeonlinelearningalgorithms.Insteadofdoinginferencefor5Specifically,updatingallofthestructuresintheworkingsetisanecessaryconditionforouralgorithmstoconverge.thewholebatchofexamplesbeforeupdatingtheweightvectorineachiteration,asdoneinCPDand1-slackvariableformulationofSVM-Struct,DCD-Lightupdatesthemodelweightsaftersolvingthein-ferenceproblemforeachindividualexample.Algo-rithm2depictsthedetailedsteps.InLine5,theloss-augmentedinference(Eq.(3))isperformed;thentheweightvectorisupdatedinLine9–allofthestructuresanddualvariablesintheworkingsetareusedtoupdatetheweightvector.NotethatthereisaδparameterinLine6tocontrolhowprecisewewouldliketosolvethisSSVMproblem.Assug-gestedin(Hsiehetal.,2008),weshuffletheexam-plesineachiteration(Line3)asithelpsthealgo-rithmconvergefaster.DCD-Lighthasseveralnoticeabledifferenceswhencomparedtothemostpopularonlinelearn-ingmethod,averagedPerceptron.First,DCD-Lightperformstheloss-augmentedinference(Eq.(3))atLine5insteadoftheargmaxinference(Eq.(1)).Second,thealgorithmupdatestheweightvectorwithallstructuresintheworkingset.Finally,DCD-lightdoesnotaveragetheweightvectors.3.1.2DCD-SSVMObservingthatDCD-Lightdoesnotfullyutilizethesaveddualvariablesintheworkingset,wepro-poseahybridapproachcalledDCD-SSVM,whichcombinesideasfromDCD-Lightandcuttingplanemethods.Inshort,afterrunningtheupdatesonabatchofexamples,werefinethemodelbysolvingthedualvariablesfurtherinthecurrentworkingsets.Thekeyadvantageofkeepingtrackofthesedualvariablesisthatitallowsustoupdatethesaveddualvariableswithoutperforminganyinference,whichisoftenanexpensivestepinstructuredpredictionalgorithms.DCD-SSVMissummarizedinAlgorithm3.Lines10to16arefromDCD-Light.InLines3to8,wegrabtheideafromcuttingplanemethodsbyupdatingtheweightvectorusingthesaveddualvari-ablesintheworkingsetswithoutanyinference(notethatLines3to8donothaveanyeffectatthefirstiteration).Byrevisitingthedualvariables,wecanderiveabetterintermediatemodel,resultinginrun-ningtheinferenceprocedurelessfrequently.SimilartoDCD-Light,wealsoshuffletheexamplesineachiteration.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
212
Algorithm2DCD-Light:Thelightweightdualco-ordinatedescentalgorithmforoptimizingEq.(4).1:w←0,Wi←∅,∀i2:fort=1…Tdo3:Shuffletheorderofthetrainingexamples4:fori=1…ldo5:¯y←argmaxywTΦ(xi,y)+∆(y,yi)6:if∆(¯y,yi)−wTΦyi,¯y(xi)−Py∈Wiαi,y2C≥δthen7:Wi←Wi∪{¯y}8:endif9:UPDATEWEIGHT(i,w){Algo.1}10:endfor11:endforDCDalgorithmsaresimilartocolumngenerationalgorithmsforlinearprogramming(DesrosiersandL¨ubbecke,2005),wherethemasterproblemistosolvethedualproblemthatfocusesonthevariablesintheworkingsets,andthesubproblemistofindnewvariablesfortheworkingsets.InSec.4,wewilldemonstratetheimportanceofbalancingthesetwoproblemsbycomparingDCD-SSVMandDCD-Light.3.2ConvergenceAnalysisWenowpresentthetheoreticanalysisofbothDCD-LightandDCD-SSVM,andaddresstwomaintop-ics:(1)whethertheworkingsetswillgrowexpo-nentiallyand(2)theconvergencerate.Duetothelackofspace,weshowonlythemaintheorems.LeveragingTheorem5in(Joachimsetal.,2009),wecanprovethattheDCDalgorithmsonlyaddalimitednumberofvariablesintheworkingsets,andhavethefollowingtheorem.Theorem1.ThenumberoftimesthatDCD-LightorDCD-SSVMaddsstructuresintoworkingsetsisboundedbyO(cid:16)2(R2+12C)lC∆2δ2(cid:17),whereR2isde-finedasmaxi,¯ykΦyi,¯y(xi)k2,and∆istheupperboundof∆(yi,y0),∀yi,y0∈Y(xi).WediscussnexttheconvergenceratesofourDCDalgorithmsundertwodifferentconditions–whentheworkingsetsarefixedandthegeneralcase.IftheworkingsetsarefixedinDCDalgorithms,theybecomecyclicdualcoordinatedescentmeth-Algorithm3DCD-SSVM:ahybriddualcoor-dinatedescentalgorithmthatcombinesideasfromDCD-Lightandcuttingplanealgorithms.1:w←0,Wi←∅,∀i2:fort=1…Tdo3:forj=1…rdo4:Shuffletheorderofthetrainingexamples5:fori=1…ldo6:UPDATEWEIGHT(i,w){Algo.1}7:endfor8:endfor9:Shuffletheorderofthetrainingexamples10:fori=1…ldo11:¯y←argmaxywTΦ(xi,y)+∆(y,yi)12:if∆(¯y,yi)−wTΦyi,¯y(xi)−Py∈Wiαi,y2C≥δthen13:Wi←Wi∪{¯y}14:endif15:UPDATEWEIGHT(i,w){Algo.1}16:endfor17:endforods.Inthiscase,wedenotetheminimizationprob-lemEq.(4)asF(α).Forfixedworkingsets{Wi},wedenoteFS(α)astheminimizationproblemthatfocusesonthedualvariablesintheworkingsetonly.Byapplyingtheresultsfrom(LuoandTseng,1993;WangandLin,2013)toL2-LossSSVM,wehavethefollowingtheorem.Theorem2.Foranygivennon-emptyworkingsets{Wi},iftheDCDalgorithmsdonotextendtheworkingsets(i.e.,line6-8inAlgorithm2arenotex-ecuted),thentheDCDalgorithmswillobtainthe(cid:15)-optimalsolutionforFS(α)inO(log(1(cid:15)))iterations.BasedonTheorem1andTheorem2,wehavethefollowingtheorem.Theorem3.DCD-SSVMobtainsan(cid:15)-optimalsolu-tioninO(1(cid:15)2log(1(cid:15)))iterations.Tothebestofourknowledge,thisisthefirstcon-vergenceanalysisresultforL2-LossSSVM.Com-paredtoothertheoreticanalysisresultsforL1-LossSSVM,atighterboundmightexistgivenabettertheoreticanalysis.Weleavethisforfuturework.66Noticeably,theuseofworkingsetscomplicatesthetheo-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
213
0204060801007880828486TrainingTime(seconds)TestF1(a)TestF1vs.TimeinNER-CoNLL010020030040096.696.89797.2TrainingTime(seconds)TestAcc(b)TestAccvs.TimeinPOS02,0004,0006,000868890TrainingTime(seconds)TestAcc(c)TestAccvs.TimeinDP-WSJ0204060801002,0004,0006,000TrainingTime(seconds)PrimalObjectiveValue(d)PrimalObjectiveValueinNER-CoNLL010020030040050011.52·104TrainingTime(seconds)PrimalObjectiveValue(e)PrimalObjectiveValueinPOS02,0004,0006,000246·104TrainingTime(seconds)PrimalObjectiveValueDCD-SSVMDCD-LightCPD(f)PrimalObjectiveValueinDP-WSJFigure1:Weplotthetestingperformance(toprow)andtheprimalobjectivefunction(bottomrow)versustrainingtimeforthreeoptimizationmethodsforlearningtheL2-LossSSVM.Ingeneral,DCD-SSVMisthebestalgorithmforboththeobjectivefunctionandthetestingperformance.4ExperimentsInordertoverifytheeffectivenessoftheproposedalgorithm,weconductasetofexperimentsondif-ferentoptimizationandlearningalgorithms.Beforegoingtotheexperimentalresults,wefirstintroducethetasksandsettingsusedintheexperiments.4.1TasksandDataWeevaluatedourmethodandexistingstructuredoutputlearningapproachesonnamedentityrecog-nition(NER),part-of-speechtagging(POS)andde-pendencyparsing(DP)onfourbenchmarkdatasets.NER-MUC7MUC-7datacontainsasubsetofNorthAmericanNewsTextCorporaannotatedwithmanytypesofentities.Wefollowedthesettingsin(RatinovandRoth,2009)andconsiderthreemainentitiescategories:PER,LOCandORG.Weevalu-atedtheresultsusingphrase-levelF1.reticanalysissignificantly.AlsonotethatTheorem2showsthatifweputallpossiblestructuresintheworkingsets(i.e.,F(α)=FS(α)),thentheDCDalgorithmscanobtain(cid:15)-optimalsolutioninO(log(1(cid:15)))iterations.NER-CoNLLThisistheEnglishdatasetfromtheCoNLL2003sharedtask(T.K.SangandDeMeul-der,2003).ThedatasetlabelssentencesfromtheReutersCorpus,Volume1(Lewisetal.,2004)withfourdifferententitytypes:PER,LOC,ORGandMISC.Weevaluatedtheresultsusingphrase-levelF1.POS-WSJThestandardsetforevaluatingtheper-formanceofapart-of-speechtagger.Thetraining,developmentandtestsetsconsistofsections0-18,19-21and22-24ofthePennTreebankdata(Marcusetal.,1993),respectively.Weevaluatedtheresultsbytoken-levelaccuracy.DP-WSJWetooksections02-21ofPennTree-bankasthetrainingset,section00asthedevelop-mentsetandsection23asthetestset.Weimple-mentasimpleversionofhashkerneltospeedupoftrainingprocedureforthistask(Bohnet,2010).Wereportedtheunlabeledattachmentaccuracyforthistask(McDonaldetal.,2005).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
214
01020307677787980TrainingTime(seconds)TestF1(a)TestF1vs.TimeinNER-MUC702040608010012080828486TrainingTime(seconds)TestF1(b)TestF1vs.TimeinNER-CoNLL0200400600959697TrainingTime(seconds)TestAccDCD-SSVMFW-StructSVM-Struct(c)TestAccvs.TimeinPOS-WSJFigure2:ComparisonsbetweenthetestingperformanceofDCD-SSVM,FW-StructandSVM-Struct.NotethatDCD-SSVMoftenobtainabettermodelwithmuchlesstrainingtimewhencomparingtoSVM-Struct.4.2FeaturesandInferenceAlgorithmsForthesequencelabelingtasks,NERandPOS,wefollowedthediscriminativeHMMsettingsusedin(Joachimsetal.,2009)anddefinedthefeaturesasΦ(x,y)=NXi=lΦemi(xi,yi)[yi=1][yi−1=1][yi=1][yi−1=2]…[yi=k][yi−1=k],whereΦemiisthefeaturevectordedicatedtothei-thtoken(or,theemissionfeatures),Nrepresentsthenumberoftokensinthissequence,yirepresentsthei-thtokeninthesequencey,[yi=1]istheindictorvariableandkisthenumberoftags.TheinferenceproblemsaresolvedbytheViterbialgorithm.TheemissionfeaturesusedinbothPOSandNERarethestandardones,includingwordfea-tures,word-shapefeatures,etc.ForNER,weusedadditionalsimplegazetteerfeatures7andwordclus-terfeatures(Turianetal.,2010)Fordependencyparsing,wefollowedthesettingdescribedin(McDonaldetal.,2005)andusedsim-ilarfeatures.Thedecodingalgorithmisthefirst-orderEisner’salgorithm(Eisner,1997).4.3AlgorithmsandImplementationDetailForallSSVMalgorithms(includingSGD),Cwaschosenamongtheset{0.01,0.05,0.1,0.5,1,5}ac-cordingtotheaccuracy/F1onthedevelopmentset.Foreachtask,thesamefeatureswereusedbyall7AddingWikipediagazetteerswouldlikelyincreasetheper-formancesignificantly(RatinovandRoth,2009).algorithms.ForNER-MUC7,NER-CoNLLandPOS-WSJ,werantheonlinealgorithmsandDCD-SSVMfor25iterations.ForDP-WSJ,weonlyletthealgorithmsrunfor10iterationsastheinferenceprocedureisveryexpensivecomputationally.Thealgorithmsintheexperimentsare:DCDOurdualcoordinatedescentmethodontheL2-LossSSVM.ForDCD-SSVM,rissettobe5.ForbothDCD-LightandDCD-SSVM,wefollowthesuggestionin(Joachimsetal.,2009):ifthevalueofadualvariablebecomeszero,itscorrespondingstructurewillberemovedfromtheworkingsettoimprovethespeed.SVM-StructWeusedthelatest(v3.10)ofSVM-HMM.8Thisversionusesthecuttingplanemethodona1-slackvariableformulation(Joachimsetal.,2009)fortheL1-LossSSVM.SVM-Structwasim-plementedinCandalltheotheralgorithmsareim-plementedinC#.WedidnotapplySVM-StructtoDP-WSJbecausethereisnonativeimplementation.PerceptronThisreferstotheaveragedstructuredPerceptronmethodintroducedbyCollins(2002).Tospeeduptheconvergencerate,weshufflethetrain-ingexamplesateachiteration.MIRAMarginInfusedRelaxedAlgorithm(MIRA)(Crammeretal.,2005)istheonlinelearningalgorithmthatexplicitlyusesthenotionofmargintoupdatetheweightvector.Weuse1-bestMIRAinourexperiments.Toincrease8http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
215
theconvergencespeed,weshufflethetrainingexamplesateachiteration.Following(McDonaldetal.,2005),wedidnottunetheCparameterfortheMIRAalgorithm.SGDStochasticgradientdescent(SGD)(Bottou,2004)isatechniqueforoptimizinggeneralconvexfunctions.Inthispaper,weuseSGDasanalterna-tivebaselineforoptimizingtheL1-LossSSVMob-jectivefunction(Eq.(2)withhigneloss).9Whenus-ingSGD,thelearningratemustbecarefullytuned.Following(Bottou,2004),thelearningrateisob-tainedbyη0(1.0+(η0T/C))0.75,whereCistheregularizationparameter,Tisthenumberofupdatessofarandη0istheinitialstepsize.Theparameterη0wasselectedamongtheset{2−1,2−2,2−3,2−4,2−5}byrunningtheSGDal-gorithmonasetof1000randomlysampledexam-ples,andthenchoosingtheη0withlowestprimalobjectivefunctionontheseexamples.FW-StructFW-StructrepresentstheFrank-WolfealgorithmfortheL1-LossSSVM(Lacoste-Julienetal.,2013).Inordertoimprovethetrainingspeed,wecachedallthefeaturevectorsgeneratedbythegoldla-beleddataoncecomputed.Thisappliedtoallal-gorithmsexceptSVM-Struct,whichhasitsowncachingmechanism.WereporttheperformanceoftheaveragedweightvectorsforPerceptronandMIRA.4.4ResultsWepresenttheexperimentalresultsbelowoncom-paringdifferentdualcoordinatedescentmethods,aswellascomparingourmainalgorithm,DCD-SSVM,withotherstructuredlearningapproaches.4.4.1ComparisonsofDCDMethodsWecomparedthreeDCDmethods:DCD-Light,DCD-SSVMandCPD.CPDisacuttingplanemethodproposedbyChangetal.(2010),whichuses9TocomparewithSGDusingitsbestsetting,wereportonlytheresultsofSGDontheL1-LossSSVMaswefoundtuningthestepsizefortheL2-LossSSVMismoredifficult.adualcoordinatedescentalgorithmtosolvethein-ternalsub-problems.WespecificallyincludedCPDasitalsotargetsattheL2-LossSSVM.Becausedifferentoptimizationstrategieswillreachthesameobjectivevalueseventually,compar-ingthemonpredictionaccuracyofthefinalmodelsisnotmeaningful.Instead,herewecomparehowfasteachalgorithmconvergesasshowninFigure1.Eachmarkeronthelineinthisfigurerepresentsoneiterationofthecorrespondingalgorithm.Generallyspeaking,CPDimprovesthemodelveryslowlyintheearlystages,butmuchfasterafterseveraliter-ations.Incomparison,DCD-Lightoftenbehavesmuchbetterinitially,andDCD-SSVMisgenerallythemostefficientalgorithmhere.ThereasonbehindtheslowperformanceofCPDisclear.Duringearlyroundsofthealgorithm,theweightvectorisfarfromoptimal,soitspendstoomuchtimeusing“bad”weightvectorstofindthemostviolatedstructures.Ontheotherhand,DCD-Lightupdatestheweightvectormorefre-quently,soitbehavesmuchbetteringeneral.DCD-SSVMspendsmoretimeonupdatingmodelsduringeachbatch,butkeepsthesameamountoftimedoinginferenceasDCD-Light.Asaresult,itfindsabettertrade-offbetweeninferenceandlearning.4.4.2DCD-SSVM,SVM-StructandFW-StructJoachimsetal.(2009)proposeda1-slackvari-ablemethodfortheL1-LossSSVM.Theyshowedthatsolvinga1-slackvariableformulationisanorder-of-magnitudefasterthansolvingtheoriginalformulation(l-slackvariablesformulation).Nev-ertheless,fromFigure2,wecanseetheclearad-vantageofDCD-SSVMoverSVM-Struct.Al-thoughusing1-slackvariablehasimprovedthelearningspeed,SVM-StructstillconvergesslowerthanDCD-SSVM.Inaddition,theperformanceofmodelstrainedbySVM-Structintheearlystageisquiteunstable,whichmakesearlystoppinganin-effectivestrategyinpracticewhentrainingtimeislimited.WealsocomparedouralgorithmstoFW-Struct.Ourresultsagreewith(Lacoste-Julienetal.,2013),whichshowsthattheFW-StructoutperformstheSVM-Struct.Inourexperiments,wefoundthatourDCDalgorithmswerecompetitive,sometimescon-vergedfasterthantheFW-Struct.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
216
01020307677787980TrainingTime(seconds)TestF1(a)TestF1vs.TimeinNER-MUC7010020030040096.696.89797.2TrainingTime(seconds)TestAcc(b)TestAccvs.TimeinPOS-WSJ02,0004,0006,00088899091TrainingTime(seconds)TestAccDCD-SSVMPERPMIRASGD(c)TestAccvs.TimeinDP-WSJFigure3:ComparisonsbetweenDCD-SSVMandpopularonlinelearningalgorithms.NotethattheresultsdivergewhencomparingPerceptronandMIRA.Ingeneral,DCD-SSVMisthemoststablealgorithm.Task/DataDCDPercepMIRASGDNER-MUC779.478.578.877.8NER-CoNLL85.685.385.184.2POS-WSJ97.196.996.996.9DP-WSJ90.890.390.290.9Table1:PerformanceofonlinelearningalgorithmsandtheDCD-SSVMalgorithmonthetestingsets.NERismeasuredbyF1whileothersbyaccuracy.4.4.3DCD-SSVM,MIRA,PerceptronandSGDAsinbinaryclassification,large-marginmethodslikeSVMsoftenperformbetterthanalgorithmslikePerceptronandSGD(Hsiehetal.,2008;Shalev-ShwartzandZhang,2013),hereweobservesimilarbehaviorsinthestructuredoutputdomain.Table1showsthefinaltestaccuracynumbersorF1scoresofmodelstrainedbyalgorithmsincludingPerceptron,MIRAandSGD,comparedtothoseoftheSSVMmodelstrainedbyDCD-SSVM.Amongthebench-markdatasetsandtaskswehaveexperimentedwith,DCD-SSVMderivedthemostaccuratemodels,ex-ceptforDP-WSJwhencomparedtoSGD.Perhapsamoreinterestingcomparisonisonthetrainingspeed,whichcanbeobservedinFig-ure3.Comparedtootheronlinealgorithms,DCD-SSVMcantakeadvantageofcacheddualvariablesandstructures.WeshowthatthetrainingspeedofDCD-SSVMcanbecompetitivetothatoftheon-linelearningalgorithms,unlikeSVM-Struct.NotethatSGDisnotverystableforNER-MUC7,eventhoughwetunedthestepsizeverycarefully.5ConclusionInthispaper,wepresentanovelapproachforlearn-ingtheL2-LossSSVMmodel.Bycombiningtheideasofdualcoordinatedescentandcuttingplanemethods,thehybridapproach,DCD-SSVMoutper-formsotherSSVMtrainingmethodsbothintermsofobjectivevaluereductionandtestingerrorratereduction.AsdemonstratedinourexperimentsonseveralNLPtasks,ourapproachalsotendstolearnmoreaccuratemodelsthancommonlyusedstructuredlearningalgorithms,includingstructuredPerceptron,MIRAandSGD.Perhapsmoreinter-estingly,ourSSVMlearningmethodisveryeffi-cient:themodeltrainingtimeiscompetitivetoon-linelearningalgorithmssuchasstructuredPercep-tronandMIRA.TheseuniquequalitiesmakeDCD-SSVManexcellentchoiceforsolvingavarietyofcomplexNLPproblems.Inthefuture,wewouldliketocompareouralgo-rithmtootherstructuredpredictionapproaches,suchasconditionalrandomfields(Laffertyetal.,2001)andexponentialgradientdescentmethods(Collinsetal.,2008).Expeditingthelearningprocessfur-therbyleveragingapproximateinferenceisalsoaninterestingdirectiontoinvestigate.AcknowledgmentsWesincerelythankJohnPlatt,LinXiaoandKaiweiChangforthediscussionsandfeedback.WearegratefultoPo-WeiWangandChih-JenLinforprovidingtheirworkonconvergencerateanalysisonfeasibledescentmethods.Wealsothankthereview-ersfortheirdetailedcommentsonthispaper.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
217
ReferencesB.Bohnet.2010.Veryhighaccuracyandfastdepen-dencyparsingisnotacontradiction.InProceedingsofthe23rdInternationalConferenceonComputationalLinguistics,ProceedingstheInternationalConferenceonComputationalLinguistics(COLING).L.Bottou.2004.Stochasticlearning.InOlivierBous-quetandUlrikevonLuxburg,editors,AdvancedLec-turesonMachineLearning,LectureNotesinArtifi-cialIntelligence,LNAI3176,pages146–168.SpringerVerlag,Berlin.M.Chang,V.Srikumar,D.Goldwasser,andD.Roth.2010.Structuredoutputlearningwithindirectsuper-vision.InProceedingsoftheInternationalConferenceonMachineLearning(ICML).T.CohnandM.Lapata.2009.Sentencecompressionastreetransduction.JournalofAIResearch,34:637–674,April.M.Collins,A.Globerson,T.Koo,X.Carreras,andP.L.Bartlett.2008.Exponentiatedgradientalgorithmsforconditionalrandomfieldsandmax-marginMarkovnetworks.JournalofMachineLearningResearch,9.M.Collins.2002.DiscriminativetrainingmethodsforhiddenMarkovmodels:Theoryandexperimentswithperceptronalgorithms.InProceedingsoftheConfer-enceonEmpiricalMethodsforNaturalLanguagePro-cessing(EMNLP).K.Crammer,R.Mcdonald,andF.Pereira.2005.Scal-ablelarge-marginonlinelearningforstructuredclas-sification.Technicalreport,DepartmentofComputerandInformationScience,UniversityofPennsylvania.J.DesrosiersandM.E.L¨ubbecke.2005.Aprimerincolumngeneration.InColumnGeneration,pages1–32.Springer.J.M.Eisner.1997.Threenewprobabilisticmodelsfordependencyparsing:Anexploration.InProceedingstheInternationalConferenceonComputationalLin-guistics(COLING),pages340–345.Y.FreundandR.Schapire.1999.Largemarginclas-sificationusingthePerceptronalgorithm.MachineLearning,37(3):277–296.C.-J.Hsieh,K.-W.Chang,C.-J.Lin,S.S.Keerthi,andS.Sundararajan.2008.Adualcoordinatedescentmethodforlarge-scalelinearSVM.InProceedingsoftheInternationalConferenceonMachineLearning(ICML),NewYork,NY,USA.ACM.T.Joachims,T.Finley,andChun-NamYu.2009.Cutting-planetrainingofstructuralSVMs.MachineLearning,77(1):27–59.J.KivinenandM.K.Warmuth.1995.Exponentiatedgradientversusgradientdescentforlinearpredictors.InACMSymp.oftheTheoryofComputing.T.Koo,A.Globerson,X.Carreras,andM.Collins.2007.Structuredpredictionmodelsviathematrix-treetheo-rem.InProceedingsofthe2007JointConferenceofEMNLP-CoNLL,pages141–150.C.Kruengkrai,K.Uchimoto,J.Kazama,Y.Wang,K.Torisawa,andH.Isahara.2009.Anerror-drivenword-characterhybridmodelforjointchinesewordsegmentationandpostagging.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics(ACL),pages513–521.S.Lacoste-Julien,M.Jaggi,M.W.Schmidt,andP.Pletscher.2013.Stochasticblock-coordinateFrank-WolfeoptimizationforstructuralSVMs.InPro-ceedingsoftheInternationalConferenceonMachineLearning(ICML).J.Lafferty,A.McCallum,andF.Pereira.2001.Con-ditionalrandomfields:Probabilisticmodelsforseg-mentingandlabelingsequencedata.InProceedingsoftheInternationalConferenceonMachineLearning(ICML).D.D.Lewis,Y.Yang,T.Rose,andF.Li.2004.RCV1:Anewbenchmarkcollectionfortextcategorizationresearch.JournalofMachineLearningResearch,5:361–397.L.Li,K.Zhou,G.-R.Xue,H.Zha,andY.Yu.2009.Enhancingdiversity,coverageandbalanceforsumma-rizationthroughstructurelearning.InProceedingsofthe18thinternationalconferenceonWorldwideweb,TheInternationalWorldWideWebConference,pages71–80,NewYork,NY,USA.ACM.Z.-Q.LuoandP.Tseng.1993.Errorboundsandconver-genceanalysisoffeasibledescentmethods:Ageneralapproach.AnnalsofOperationsResearch,46(1):157–178.M.P.Marcus,B.Santorini,andM.Marcinkiewicz.1993.BuildingalargeannotatedcorpusofEn-glish:ThePennTreebank.ComputationalLinguistics,19(2):313–330,June.A.F.Martins,K.Gimpel,N.A.Smith,E.P.Xing,M.A.Figueiredo,andP.M.Aguiar.2010.Learningstruc-turedclassifierswithdualcoordinateascent.Technicalreport,TechnicalreportCMU-ML-10-109.R.McDonald,K.Crammer,andF.Pereira.2005.Onlinelarge-margintrainingofdependencyparsers.InPro-ceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics(ACL),pages91–98,AnnArbor,Michigan.A.MejerandK.Crammer.2010.Confidenceinstructured-predictionusingconfidence-weightedmod-els.InProceedingsofthe2010ConferenceonEm-piricalMethodsinNaturalLanguageProcessing,Pro-ceedingsoftheConferenceonEmpiricalMethodsforNaturalLanguageProcessing(EMNLP),pages971–981.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
2
1
1
5
6
6
6
6
9
/
/
t
l
a
c
_
a
_
0
0
2
2
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
218
R.C.Moore,W.Yih,andA.Bode.2007.Improveddis-criminativebilingualwordalignment.InProceedingsoftheAnnualMeetingoftheAssociationforCompu-tationalLinguistics(ACL).L.RatinovandD.Roth.2009.Designchallengesandmisconceptionsinnamedentityrecognition.InProc.oftheAnnualConferenceonComputationalNaturalLanguageLearning(CoNLL),Jun.N.Ratliff,J.Andrew(Drew)Bagnell,andM.Zinkevich.2007.(Online)subgradientmethodsforstructuredprediction.InProceedingsoftheInternationalWork-shoponArtificialIntelligenceandStatistics,March.S.Shalev-ShwartzandY.Singer.2006.Onlinelearn-ingmeetsoptimizationinthedual.InProceedingsoftheAnnualACMWorkshoponComputationalLearn-ingTheory(COLT).S.Shalev-ShwartzandT.Zhang.2013.Stochasticdualcoordinateascentmethodsforregularizedlossmin-imization.JournalofMachineLearningResearch,14:567–599.S.Shalev-Shwartz,Y.Singer,andN.Srebro.2007.Pe-gasos:primalestimatedsub-gradientsolverforSVM.InZoubinGhahramani,editor,ProceedingsoftheIn-ternationalConferenceonMachineLearning(ICML),pages807–814.Omnipress.S.Shevade,P.Balamurugan,S.Sundararajan,andS.Keerthi.2011.Asequentialdualmethodforstruc-turalSVMs.InIEEEInternationalConferenceonDataMining(ICDM).E.F.T.K.SangandF.DeMeulder.2003.IntroductiontotheCoNLL-2003sharedtask:Language-independentnamedentityrecognition.InWalterDaelemansandMilesOsborne,editors,ProceedingsofCoNLL-2003,pages142–147.Edmonton,Canada.B.Taskar,C.Guestrin,andD.Koller.2004a.Max-marginmarkovnetworks.InTheConferenceonAdvancesinNeuralInformationProcessingSystems(NIPS).B.Taskar,D.Klein,M.Collins,D.Koller,andC.Man-ning.2004b.Max-marginparsing.InProceedingsoftheConferenceonEmpiricalMethodsforNaturalLanguageProcessing(EMNLP).B.Taskar,S.Lacoste-julien,andM.I.Jordan.2005.Structuredpredictionviatheextragradientmethod.InTheConferenceonAdvancesinNeuralInformationProcessingSystems(NIPS).B.Taskar,S.Lacoste-Julien,andM.IJordan.2006.Structuredprediction,dualextragradientandbregmanprojections.JournalofMachineLearningResearch,7:1627–1653.I.Tsochantaridis,T.Hofmann,T.Joachims,andY.Altun.2004.Supportvectormachinelearningforinterde-pendentandstructuredoutputspaces.InProceedingsoftheInternationalConferenceonMachineLearning(ICML).J.Turian,L.Ratinov,andY.Bengio.2010.Wordrep-resentations:asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguis-tics(ACL),pages384–394,Stroudsburg,PA,USA.AssociationforComputationalLinguistics.Po-WeiWangandChih-JenLin.2013.Iterationcom-plexityoffeasibledescentmethodsforconvexopti-mization.Technicalreport,NationalTaiwanUniver-sity.C.YuandT.Joachims.2009.LearningstructuralSVMswithlatentvariables.InProceedingsoftheInterna-tionalConferenceonMachineLearning(ICML).
Download pdf