Transactions of the Association for Computational Linguistics, 2 (2014) 393–404. Action Editor: Robert C. Moore.
Submitted 2/2014; Revised 6/2014; Published 10/2014. c
(cid:13)
2014 Association for Computational Linguistics.
LocallyNon-LinearLearningforStatisticalMachineTranslationviaDiscretizationandStructuredRegularizationJonathanH.Clark∗ChrisDyer†AlonLavie†*MicrosoftResearch†CarnegieMellonUniversityRedmond,WA98052,USAPittsburgh,PA15213,USAjonathan.clark@microsoft.com{cdyer,alavie}@cs.cmu.eduAbstractLinearmodels,whichsupportefficientlearn-ingandinference,aretheworkhorsesofstatis-ticalmachinetranslation;however,linearde-cisionrulesarelessattractivefromamodelingperspective.Inthiswork,weintroduceatech-niqueforlearningarbitrary,rule-local,non-linearfeaturetransformsthatimprovemodelexpressivity,butdonotsacrificetheefficientinferenceandlearningassociatedwithlinearmodels.Todemonstratethevalueofourtech-nique,wediscardthecustomarylogtransformoflexicalprobabilitiesanddropthephrasaltranslationprobabilityinfavorofrawcounts.Weobservethatouralgorithmlearnsavari-ationofalogtransformthatleadstobettertranslationqualitycomparedtotheexplicitlogtransform.Weconcludethatnon-linearre-sponsesplayanimportantroleinSMT,anob-servationthatwehopewillinformtheeffortsoffeatureengineers.1IntroductionLinearmodelsusinglog-transformedprobabilitiesasfeatureshaveemergedasthedominantmodelinMTsystems.ThispracticecanbetracedbacktotheIBMnoisychannelmodels(Brownetal.,1993),whichdecomposedecodingintotheproductofatranslationmodel(TM)andalanguagemodel(LM),motivatedbyBayes’Rule.WhenOchandNey(2002)introducedalog-linearmodelfortrans-lation(alinearsumoflog-spacefeatures),theynotedthatthenoisychannelmodelwasaspecialcaseoftheirmodelusinglogprobabilities.This∗Thisworkwasconductedaspartofthefirstauthor’sPh.D.workatCarnegieMellonUniversity.sameformulationpersistedevenaftertheintroduc-tionofMERT(Och,2003),whichoptimizesalin-earmodel;again,usingtwologprobabilityfea-tures(TMandLM)withequalweightrecoveredthenoisychannelmodel.Yetsystemsnowusemanymorefeatures,someofwhicharenotevenprobabil-ities.Wenolongerbelievethatequalweightsbe-tweentheTMandLMprovidesoptimaltranslationquality;theprobabilitiesintheTMdonotobeythechainrulenorBayes’rule,nullifyingseveralthe-oreticalmathematicaljustificationsformultiplyingprobabilities.Thestoryofmultiplyingprobabilitiesmayjustamounttoheavilypenalizingsmallvalues.Thecommunityhasabandonedtheoriginalmo-tivationsforalinearinterpolationoftwolog-transformedfeatures.Isthereempiricalevidencethatweshouldcontinueusingthisparticulartrans-formation?Dowehaveanyreasontobelieveitisbetterthanothernon-lineartransformations?Toan-swerthese,weexploretheissueofnon-linearityinmodelsforMT.Intheprocess,wewilldiscusstheimpactoflinearityonfeatureengineeringandde-velopageneralmechanismforlearningaclassofnon-lineartransformationsofreal-valuedfeatures.Applyinganon-lineartransformationsuchaslogtofeaturesisonewayofachievinganon-linearresponsefunction,evenifthosefeaturesareaggre-gatedinalinearmodel.Alternatively,wecouldachieveanon-linearresponseusinganativelynon-linearmodelsuchasaSVM(Wangetal.,2007)orRankBoost(Sokolovetal.,2012).However,MTisastructuredpredictionproblem,inwhichafullhypothesisiscomposedofpartialhypotheses.MTdecoderstakeadvantageofthefactthatthemodel
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
9
1
1
5
6
6
9
3
9
/
/
t
l
a
c
_
a
_
0
0
1
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
394
scoredecomposesasalinearsumoverbothlocalfeaturesandpartialhypothesestoefficientlyper-forminferenceinthesestructuredspaces(§2)–cur-rently,therearenoscalablesolutionstointegratingthehypothesis-levelnon-linearfeaturetransformstypicallyassociatedwithkernelmethodswhilestillmaintainingpolynomialtimesearch.Anotheralter-nativeisincorporatingarecurrentneuralnetwork(Schwenk,2012;Aulietal.,2013;KalchbrennerandBlunsom,2013)oranadditiveneuralnetwork(Liuetal.,2013a).Whilethesemodelshaveshownpromiseasmethodsofaugmentingexistingmod-els,theyhavenotyetofferedapathforreplacingortransformingexistingreal-valuedfeatures.Inthisarticle,wediscussbackground(§2),de-scribelocaldiscretization,ourapproachtolearningnon-lineartransformationsofindividualfeatures,compareitwithgloballynon-linearmodels(§3),presentourexperimentalsetup(§5),empiricallyver-ifytheimportanceofnon-linearfeaturetransforma-tionsinMTanddemonstratethatdiscretizationcanbeusedtorecovernon-lineartransformations(§6),discussrelatedwork(§7),andconclude(§8).2BackgroundandDefinitions2.1FeatureLocality&StructuredHypothesesDecodingagivensourcesentencefcanbeex-pressedassearchovertargethypothesese,eachwithanassociatedcompletederivationD.Tofindthebest-scoringhypothesisˆe(f),alinearmodelappliesasetofweightswtoacompletehypothesis’featurevectorH:ˆe(f)=argmaxe,D|H|Xi=0wiHi(f,e,D)(1)However,thishidesmanyoftherealitiesofperform-inginferenceinmoderndecoders.Traditionalin-ferencewouldbeintractableifeveryfeaturewereallowedaccesstotheentirederivationDanditsas-sociatedtargethypothesise.Decoderstakeadvan-tageofthefactthatfeaturesdecomposeoverpar-tialderivationsd.ForacompletederivationD,theglobalfeaturesH(D)areanefficientsummationoverlocalfeaturesh(d):ˆe(f)=argmaxe,D|H|Xi=0wiXd∈Dhi(d)|{z}Hi(D)(2)Thiscontrastswithnon-localfeaturessuchasthelanguagemodel(LM),whichcannotbeexactlycal-culatedgivenanarbitrarypartialhypothesis,whichmaylackbothleftandrightcontext.1Suchfeaturesrequirespecialhandlingincludingfuturecostesti-mation.Inthisstudy,welimitourselvestolocalfeatures,leavingthetraditionalnon-localLMfea-tureunchanged.Ingeneral,featurelocalityisrel-ativetoaparticularstructuredhypothesisspace,andisunrelatedtothestructuredfeaturesdescribedinSection4.2.2.2FeatureNon-LinearityandSeparabilityUnlikemodelsthatrelyprimarilyonalargenumberofsparseindicatorfeatures,state-of-the-artmachinetranslationsystemsrelyheavilyonasmallnumberofdensereal-valuedfeatures.However,unlikeindi-catorfeatures,real-valuedfeaturesmaybenefitfromnon-lineartransformationstoallowalinearmodeltobetterfitthedata.Decodersusealinearmodeltorankhypotheses,selectingthehighest-rankedderivation.Sincetheabsolutescoreofthemodelisirrelevant,non-linearresponsesareusefulonlyincaseswheretheyelicitnovelrankings.Inthissection,wewilldiscussthesecasesintermsofseparability.Here,wearesepa-ratingthecorrectlyrankedpairsofhypothesesfromtheincorrectintheimplicitpairwiserankingsde-finedbythetotalorderingonhypothesesprovidedbyourmodel.Whenthelocalfeaturevectorshofeachoracle-best2hypothesis(orhypotheses)aredistinctfromthoseofallothercompetinghypotheses,wesaythattheinputsareoracleseparablegiventhefeatureset.Ifthereexistsaweightvectorthatdistinguishestheoracle-bestrankingfromallotherrankingsunderalinearmodel,wesaythattheinputsarelinearlysep-arablegiventhefeatureset.Iftheinputsareora-cleseparablebutnotlinearlyseparable,wesaythattherearenon-linearitiesthatareunexplainedbythefeatureset.Forexample,thiscanhappenifafeatureispositivelyrelatedtoqualityinsomeregionsbutnegativelyrelatedinotherregions.Asweaddmoresentencestoourcorpus,sepa-rabilitybecomesincreasinglydifficult.Foragiven1Thisisespeciallyproblematicforchart-baseddecoders.2Wedefinetheoracle-besthypothesisintermsofsomeex-ternalqualitymeasuresuchasBLEU
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
9
1
1
5
6
6
9
3
9
/
/
t
l
a
c
_
a
_
0
0
1
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
395
corpus,ifallhypothesesareoracleseparable,wecanalwaysproducetheoracletranslation–assum-inganoptimal(andpotentiallyverycomplex)modelandweightvector.Ifourhypothesisspacealsocon-tainsallreferencetranslations,wecanalwaysre-coverthereference.Inpractice,bothofthesecondi-tionsaretypicallyviolatedtoacertaindegree.How-ever,ifwemodifyourfeaturesetsuchthatsomelower-rankedhigher-qualityhypothesiscanbesepa-ratedfromallhigher-rankedlower-qualityhypothe-ses,thenwecanimprovetranslationquality.Forthisreason,webelievethatseparabilityremainsaninformativetoolforthinkingaboutmodelinginMT.Currently,non-linearitiesinnovelreal-valuedfea-turesaretypicallyaddressedviamanualfeatureen-gineeringinvolvingagooddealoftrialanderror(GimpelandSmith,2009)3orbymanuallydiscretiz-ingfeatures(e.g.indicatorfeaturesforcount=N).Wewillexploreonetechniqueforautomaticallyavoidingnon-linearitiesinSection3.2.3LearningwithLargeFeatureSetsWhileMERThasproventobeastrongbaseline,itdoesnotscaletolargerfeaturesetsintermsofbothinefficiencyandoverfitting.WhileMIRA(Chiangetal.,2008),Rampion(GimpelandSmith,2012),andHOLS(Flaniganetal.,2013)havebeenshowntobeeffectiveoverlargerfeaturesets,theyaredif-ficulttoexplicitlyregularize–thiswillbecomeim-portantinSection4.2.Therefore,weusethePROoptimizer(HopkinsandMay,2011)asourbaselinelearnersinceithasbeenshowntoperformcompa-rablytoMERTforasmallnumberoffeatures,andtosignificantlyoutperformMERTforalargenum-beroffeatures(HopkinsandMay,2011;Ganitke-vitchetal.,2012).OtherveryrecentMToptimiz-erssuchasthelinearstructuredSVM(CherryandFoster,2012),AROW(Chiang,2012)andregular-izedMERT(Galleyetal.,2013)arealsocompatiblewiththediscretizationandstructuredregularizationtechniquesdescribedinthisarticle.43Gimpeletal.eventuallyusedrawprobabilitiesintheirmodelratherthanlog-probabilities.4Sincewedispensewithnearlyalloftheoriginaldensefea-turesandourstructuredregularizerisscalesensitive,onewouldneedtousethe‘1-renormalizedvariantofregularizedMERT.3DiscretizationandFeatureInductionInthissection,weproposeafeatureinductiontech-niquebasedondiscretizationthatproducesafeaturesetthatislesspronetonon-linearities(see§2.2).WedefinefeatureinductionasafunctionΦ(y)thattakestheresultofthefeaturefunctiony=h(x)∈Randreturnsatuplehy0,jiwherey0∈Risatransformedfeaturevalueandjisthetransformedfeatureindex.5Buildingonequation2,wecanapplyfeatureinductionasfollows:ˆe(f)=argmaxe,DXd∈D|H|Xi=0hy0,ji=Φi(hi(d))w0jy0|{z}H0(f,e,D)(3)Atfirstglance,onemightbetemptedtosim-plychoosesomenon-linearfunctionforΦ(e.g.log(x),exp(x),sin(x),xn).However,evenifweweretorestrictourselvestosome“standard”setofnon-linearfunctions,manyofthesefunctionshavehyperparametersthatarenotdirectlytunablebycon-ventionaloptimizers(e.g.periodandamplitudeforsin,ninxn).LearningHLearningwOriginal Linear Model: w • HHFeature Inductionw’Induced Linear Model: w’ • H’H’Figure1:Top:Atraditionallearningprocedure,assign-ingasetofweightstoafixedfeatureset.Bottom:Dis-cretization,ourfeatureinductiontechnique,expandsthefeaturesetaspartoflearning,whilestillproducingalin-earmodelforinference,albeitwithmorefeatures.Discretizationallowsustoavoidmanynon-linearities(§2.2)whilepreservingthefastinferenceprovidedbyfeaturelocality(§2.1).Wefirstdis-cretizereal-valuedfeaturesintoasetofindicator5OnecouldalsoimagineafeaturetransformationfunctionΦthatreturnsavectorofbinsforasinglevaluereturnedbyafeaturefunctionhoratransformationthathasaccesstovaluesfrommultiplefeaturefunctionsatonce.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
9
1
1
5
6
6
9
3
9
/
/
t
l
a
c
_
a
_
0
0
1
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
396
featuresandthenuseaconventionaloptimizertolearnaweightforeachindicatorfeature(Figure1).Thistechniqueissometimesreferredtoasbinningandiscloselyrelatedtoquantization.Effectively,discretizationallowsustore-shapeafeaturefunc-tion(Figure2).Infact,givenaninfinitenumberofbins,wecanperformanynon-lineartransformationoftheoriginalfunction.1.00.5h01.00.5h1h2h3h4RFigure2:Left:Areal-valuedfeature.Bolddotsrepre-sentpointswherewecouldimaginebinsbeingplaced.However,sincewemayonlyadjustw0,these“bins”willberigidlyfixedalongthefeaturefunction’svalue.Right:Afterdiscretizingthefeatureinto4bins,wemaynowadjust4weightsindependently,toachieveanon-linearre-shapingofthefunction.Forindicatordiscretization,wedefineΦiintermsofabinningfunctionBINi(x)∈R→N:Φi(x)=h1,i_BINi(x)i(4)wherethe_operatorindicatesconcatenationofafeatureidentifierwithabinidentifiertoformanew,uniquefeatureidentifier.3.1LocalDiscretizationUnlikeotherapproachestonon-linearlearninginMT,weperformnon-lineartransformationonpar-tialhypothesesasinequation3wherediscretiza-tionisappliedasΦi(hi(d)),whichallowslocallynon-lineartransformations,insteadofapplyingΦtocompletehypothesesasinΦi(Hi(D)),whichwouldallowgloballynon-lineartransformations.Thisen-ablesourtransformedmodeltoproducenon-linearresponseswithregardtotheinitialfeaturesetHwhileinferenceremainslinearwithregardtotheoptimizedparametersw0.Importantly,ourtrans-formedfeaturesetrequiresnoadditionalnon-localinformationforinference.Byperformingtransformationswithinalocalcontext,weeffectivelyreinterpretthefeatureset.Forexample,thefamiliartargetwordcountfeaturefoundinmanymodernMTsystemsisoftenconcep-tualizedas“whatisthecountoftargetwordsinthecompletehypothesis?”Ahypothesis-levelviewofdiscretizationwouldviewthisas“Didthishypoth-esishave5targetwords?”.Onlyonesuchfeaturewillfireforeachhypothesis.However,localdis-cretizationreinterpretsthisfeatureas“Howmanyphrasesinthecompletehypothesishave1targetword?”Manysuchfeaturesarelikelytofireforeachhypothesis.WeprovideafurtherexampleofthistechniqueinFigure3.hTM=0.1hTM=0.2hCount=2hTM=0.113hTM_0.1=1hTM_0.2=1hTM_0.1=1hCount_2=1el gato come furtivamenteFigure3:Weperformdiscretizationlocallyoneachgrammarruleorphrasepair,operatingonthelocalfea-turevectorsh.Inthisexample,theoriginalreal-valuedfeaturesarecrossedoutwithasolidgraylineandtheirdiscretizedindicatorfeaturesarewrittenabove.Whenformingacompletehypothesisfrompartialhypotheses,wesumthecountsoftheseindicatorfeaturestoob-tainthecompletefeaturevectorH.Inthisexample,H={HTM0.1:2,HTM0.2:1,HCount2:1}Intermsofpredictivepower,thistransformationcanprovidethelearnedmodelwithincreasedabil-itytodiscriminatebetweenhypotheses.Thisispri-marilyaresultofmovingtoahigher-dimensionalfeaturespace.Asweintroducenewparameters,weexpectthatsomehypothesesthatwerepreviouslyin-distinguishableunderHbecomeseparableunderH0(§2.2).Weshowspecificexamplescomparinglin-ear,locallynon-linear,andgloballynon-linearmod-elsinFigures4-6.Asseenintheseexamples,lo-callynon-linearmodels(Eq.3,4)arenotanapprox-imationnorasubsetofgloballynon-linearmodels,butratheradifferentclassofmodels.3.2BinningAlgorithmToinitializethelearningprocedure,weconstructthebinningfunctionBINusedbytheindicatordi-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
9
1
1
5
6
6
9
3
9
/
/
t
l
a
c
_
a
_
0
0
1
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
397
LinearGloballyNon-LinearLocallyNon-LinearRanking∗S13heh:1.0saysh:1.0={H:2.0}S23sheh:2.0saidh:2.0={H:4.0}S14smallh:2.0kittenh:2.0={H:4.0}∗S24bigh:3.0lionh:3.0={H:6.0}∗S13heh:1.0saysh:1.0={H2:1}S23sheh:2.0saidh:2.0={H4:1}S14smallh:2.0kittenh:2.0={H4:1}∗S24bigh:3.0lionh:3.0={H6:1}∗S13heh1:1saysh1:1={H1:2}S23sheh2:1saidh2:1={H2:2}S14smallh2:1kittenh2:1={H2:2}∗S24bigh3:1lionh3:1={H3:2}Pairs(S13,S23){∆H:-2.0}⊕(S23,S13){∆H:2.0}(cid:9)(S24,S14){∆H:-2.0}(cid:9)(S14,S24){∆H:2.0}⊕(S13,S23){∆H2:1,∆H4:-1}⊕(S23,S13){∆H2:-1,∆H4:1}(cid:9)(S24,S14){∆H4:1,∆H6:-1}(cid:9)(S14,S24){∆H4:-1,∆H6:1}⊕(S13,S23){∆H1:2,∆H2:-2}⊕(S23,S13){∆H1:-2,∆H2:2}(cid:9)(S24,S14){∆H2:2,∆H3:-2}(cid:9)(S14,S24){∆H2:-2,∆H3:2}⊕PairwiseRankingΔH-202⊖⊕⊕⊖InseparableΔH2-11⊕-11ΔH4⊕H6:1H6:-1⊖⊖SeparableΔH1-11⊕-11ΔH2⊕H3:1H3:-1⊖⊖SeparableFigure4:AnexampleshowingacollinearityovermultipleinputsentencesS3,S4inwhichtheoracle-besthypothesisis“trapped”alongalinewithotherlowerqualityhypothesesinthelinearmodel’soutputspace.Rankingshowshowthehypotheseswouldappearinak-bestlistwitheachpartialderivationhavingitspartialfeaturevectorhunderit;thecompletefeaturevectorHisshowntotherightofeachhypothesisandtheoracle-besthypothesisisnotatedwitha∗.Pairsexplicatestheimplicitpairwiserankings.PairwiseRankinggraphsthosepairsinordertovisualizewhetherornotthehypothesesareseparable.(⊕indicatesthatthepairofhypothesesisrankedcorrectlyaccordingtotheextrinsicmetricand(cid:9)indicatesthepairisrankedincorrectly.Inthepairwiserankingrow,some⊕and(cid:9)pointsareannotatedwiththeirpositionsalongthethirdaxisH3(omittedforclarity).Collinearitycanalsooccurwithasingleinputhavingatleast3hypotheses.LinearGloballyNon-LinearLocallyNon-LinearRanking∗S12someh:2.0thingsh:2.0={H:4.0}S22somethingh:4.0={H:4.0}∗S12someh:2.0thingsh:2.0={H4:1}S22somethingh:4.0={H4:1}∗S12someh2:1thingsh2:1={H2:2}S22somethingh4:1={H4:1}Pairs(S12,S22){∆H:0.0}⊕(S22,S12){∆H:0.0}(cid:9)(S12,S22){∆H4:0}⊕(S22,S12){∆H4:0}(cid:9)(S12,S22){∆H2:2,∆H4:-1}⊕(S22,S12){∆H2:-2,∆H4:1}(cid:9)PairwiseRankingInseparableInseparableSeparableFigure5:Anexampleshowingatrivial“collision”inwhichtwohypothesesofdifferingqualityreceivethesamemodelscoreuntillocaldiscretizationisapplied.ThetwohypothesesareindistinguishableunderalinearmodelwiththefeaturesetH,asshownbythezero-differenceinthe“pairs”row.Whileagloballynon-lineartransformationdoesnotyieldanyimprovement,localdiscretizationallowsthehypothesestobeproperlyrankedduetothehigher-dimensionalfeaturespaceH2,H4.SeeFigure4foranexplanationofnotation.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
9
1
1
5
6
6
9
3
9
/
/
t
l
a
c
_
a
_
0
0
1
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
398
LinearLocallyNon-LinearRanking∗hB:1.0={HB:1.0}hA:1.0hA:1.0={HA:2.0}∗hA:1.0hA:1.0hB:1.0={HA:2.0,HB:1.0}hA:0.0={}∗hB:-4.0hB:1.0hB:1.0hB:1.0={HB:-1.0}hA:1.0={HA:1.0}∗hB:-4.0hA:1.0={HA:1.0,HB:-4.0}hB:1.0hB:1.0={HB:2.0}∗hB1:1={HB1:1}hA1:1hA1:1={HA1:2}∗hA1:1hA1:1hB1:1={HA1:2,HB1:1}hA1:0={}∗hB−4:1hB1:1hB1:1hB1:1={HB−4:1,HB1:3}hA1:1={HA1:1}∗hB−4:1hA1:1={HA1:1,HB−4:1}hB1:1hB1:1={HB1:2}PairwiseRankingΔHA-66⊕-66ΔHB⊖⊖⊖⊕⊕⊕⊖InseparableΔHA-33⊕-33ΔHB⊖⊖⊖⊕⊕⊕⊖11HB:1-4HB:-1-4HB:1-4HB:-1-4SeparableFigure6:Anexampledemonstratinganon-lineardecisionboundaryinducedbydiscretization.Thenon-linearnatureofthedecisionboundarycanbeseenclearlywhentheinducedfeaturesetHA1,HB1,HB−4(right)isconsideredintheoriginalfeaturespaceHA,HB(left).Inthepairwiserankingrow,twoaxes(HA1,HB1)areplottedwhilethethirdaxisHB−4isindicatedonlyasstand-offannotationsforclarity.Givenalargernumberofhypotheses,suchsituationscouldalsoarisewithinasinglesentence.SeeFigure4foranexplanationofnotation.cretizerΦ.Wehavetwodesiderata:(1)anymono-tonictransformationofafeatureshouldnotaffecttheinducedbinningsinceweshouldnotrequirefeatureengineerstodeterminetheoptimalfeaturetransformationand(2)nobin’sdatashouldbesosparsethattheoptimizercannotreliablyestimateaweightforeachbin.Therefore,weconstructbinsthatare(i)populateduniformlysubjectto(ii)eachbincontainingnomorethanonefeaturevalue.Wecallthisapproachuniformpopulationfeaturebin-ning.Whileonecouldconsiderthepredictivepowerofthefeatureswhendeterminingbinboundaries,thiswouldsuggestthatweshouldjointlyoptimizeanddeterminebinboundaries,whichisbeyondthescopeofthiswork.ThisproblemhasrecentlybeenconsideredforNLPbySuzukiandNagata(2013)andforMTbyLiuetal.(2013b),thoughthelatterinvolvesdecodingtheentiretrainingdata.LetXbethelistoffeaturevaluestobinwhereiindexesfeaturevaluesxi∈Xandtheirassoci-atedfrequenciesfi.Wewanteachbintohaveauniformsizeu.Forthesakeofsimplifyingourfi-nalalgorithm,wefirstcreateadjustedfrequenciesf0isothatveryfrequentfeaturevalueswillnotoc-cupymorethan100%ofabinviathefollowingal-gorithm,whichiteratesoverk:uk=1|X||X|Xi=1fki(5)fk+1i=min(fki,uk)(6)whichreturnsu0=ukwhenfki
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
9
1
1
5
6
6
9
3
9
/
/
t
l
a
c
_
a
_
0
0
1
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
403
AcknowledgmentsThisworkwassupportedbyGoogleFacultyRe-searchgrants2011R2705and2012R210andbytheNSF-sponsoredXSEDEcomputingresourcesprogramundergrantTG-CCR110017.ReferencesMichaelAuli,MichelGalley,ChrisQuirk,andGeoffreyZweig.2013.JointLanguageandTranslationMod-elingwithRecurrentNeuralNetworks.InEmpiricalMethodsinNaturalLanguageProcessing,numberOc-tober,pages1044–1054.R.E.Barlow,D.Bartholomew,J.M.Bremner,andH.D.Brunk.1972.Statisticalinferenceunderorderrestric-tions;thetheoryandapplicationofisotonicregres-sion.Wiley.OndejBojar,ZdenˇekˇZabokrtsk´y,OndejDuˇsek,Pe-traGaluˇsˇc´akov´a,MartinMajliˇs,DavidMareˇcek,Ji´ıMarˇs´ık,MichalNov´ak,MartinPopel,andAleˇsTam-chyna.2012.TheJoyofParallelismwithCzEng1.0.InProceedingsofLREC2012,Istanbul,Turkey.Euro-peanLanguageResourcesAssociation.PeterEBrown,StephenADellaPietra,VincentJDellaPietra,andRobertLMercer.1993.TheMathematicsofStatisticalMachineTranslation:ParameterEstima-tion.ComputationalLinguistics,10598.EugeneCharniak.2010.Top-DownNearly-Context-SensitiveParsing.InEmpiricalMethodsinNaturalLanguageProcessing,numberOctober,pages674–683.ColinCherryandGeorgeFoster.2012.BatchTuningStrategiesforStatisticalMachineTranslation.InPro-ceedingsoftheNorthAmericanAssociationforCom-putationalLinguistics,pages427–436.DavidChiang,YuvalMarton,andPhilipResnik.2008.Onlinelarge-margintrainingofsyntacticandstruc-turaltranslationfeatures.InProceedingsoftheCon-ferenceonEmpiricalMethodsinNaturalLanguageProcessing-EMNLP’08,pages224–233,Morris-town,NJ,USA.AssociationforComputationalLin-guistics.DavidChiang.2007.HierarchicalPhrase-BasedTrans-lation.ComputationalLinguistics,33(2):201–228,June.DavidChiang.2012.Hopeandfearfordiscriminativetrainingofstatisticaltranslationmodels.JournalofMachineLearningResearch,13:1159–1187.JonathanHClark,ChrisDyer,AlonLavie,andNoahASmith.2011.BetterHypothesisTestingforStatisticalMachineTranslation:ControllingforOptimizerInsta-bility.InAssociationforComputationalLinguistics.CorinnaCortes,MehryarMohri,andAfshinRos-tamizadeh.2009.LearningNon-LinearCombinationsofKernels.InAdvancesinNeuralInformationPro-cessingSystems(NIPS2009),pages1–9,Vancouver,Canada.JamesDougherty,RonKohavi,andMehranSahami.1995.SupervisedandUnsupervisedDiscretizationofContinuousFeatures.InProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pages194–202,SanFrancisco,CA.ChrisDyer,JonathanWeese,AdamLopez,VladimirEi-delman,PhilBlunsom,andPhilipResnik.2010.cdec:ADecoder,Alignment,andLearningFrameworkforFinite-StateandContext-FreeTranslationModels.InAssociationforComputationalLinguistics,numberJuly,pages7–12.ThomasFinleyandThorstenJoachims.2008.TrainingstructuralSVMswhenexactinferenceisintractable.InProceedingsoftheInternationalConferenceonMa-chineLearning,pages304–311,NewYork,NewYork,USA.ACMPress.JeffreyFlanigan,ChrisDyer,andJaimeCarbonell.2013.Large-ScaleDiscriminativeTrainingforStatisticalMachineTranslationUsingHeld-OutLineSearch.InNorthAmericanAssociationforComputationalLin-guistics,numberJune,pages248–258.MichelGalley,ChrisQuirk,ColinCherry,andKristinaToutanova.2013.RegularizedMinimumErrorRateTraining.InEmpiricalMethodsinNaturalLanguageProcessing.JuriGanitkevitch,YuanCao,JonathanWeese,MattPost,andChrisCallison-Burch.2012.Joshua4.0:Pack-ing,PRO,andParaphrases.InWorkshoponStatisticalMachineTranslation,pages283–291.Jes´usGim´enezandLlu´ısM`arquez.2007.Context-awareDiscriminativePhraseSelectionforStatisticalMachineTranslation.InWorkshoponStatisticalMa-chineTranslation,numberJune,pages159–166.KevinGimpelandNoahASmith.2009.Feature-RichTranslationbyQuasi-SynchronousLatticeParsing.InEmpiricalMethodsinNaturalLanguageProcessing.KevinGimpelandNoahASmith.2012.StructuredRampLossMinimizationforMachineTranslation.InNorthAmericanAssociationforComputationalLin-guistics.XiaodongHeandLiDeng.2012.MaximumExpectedBLEUTrainingofPhraseandLexiconTranslationModels.InProceedingsoftheAssociationforCom-putationalLinguistics,JejuIsland,Korea.MicrosoftResearch.MarkHopkinsandJonathanMay.2011.TuningasRanking.ComputationalLinguistics,pages1352–1362.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
9
1
1
5
6
6
9
3
9
/
/
t
l
a
c
_
a
_
0
0
1
9
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
404
FrederickJelinek,JohnLafferty,DavidMagerman,RobertMercer,AdwaitRatnaparkhi,andSalimRoukos.1994.DecisionTreeParsingusingaHiddenDerivationModel.InWorkshoponHumanLanguageTechnologies(HLT).NalKalchbrennerandPhilBlunsom.2013.RecurrentContinuousTranslationModels.InEmpiricalMeth-odsinNaturalLanguageProcessing.SotirisKotsiantisandDimitrisKanellopoulos.2006.DiscretizationTechniques:Arecentsurvey.InGESTSInternationalTransactionsonComputerSci-enceandEngineering,volume32,pages47–58.LemaoLiu,TaroWatanabe,EiichiroSumita,andTiejunZhao.2013a.AdditiveNeuralNetworksforStatisticalMachineTranslation.InProceedingsoftheAssocia-tionforComputationalLinguistics.LemaoLiu,TiejunZhao,TaroWatanabe,andEiichiroSumita.2013b.TuningSMTwithALargeNumberofFeaturesviaOnlineFeatureGrouping.InProceed-ingsoftheInternationalJointConferenceonNaturalLanguageProcessing.AdamLopez.2008.Tera-ScaleTranslationModelsviaPatternMatching.InAssociationforComputa-tionalLinguisticsComputationalLinguistics,numberAugust,pages505–512.DavidMMagerman.1995.StatisticalDecision-TreeModelsforParsing.InAssociationforComputationalLinguistics,pages276–283.AnilNelakanti,CedricArchambeau,JulienMairal,Fran-cisBach,andGuillaumeBouchard.2013.StructuredPenaltiesforLog-linearLanguageModels.InEmpiri-calMethodsinNaturalLanguageProcessing,Seattle,WA.PatrickNguyen,MilindMahajan,XiaodongHe,andMi-crosoftWay.2007.TrainingNon-ParametricFeaturesforStatisticalMachineTranslation.InAssociationforComputationalLinguistics.FranzJosefOchandHermannNey.2002.Discrimi-nativetrainingandmaximumentropymodelsforsta-tisticalmachinetranslation.InProceedingsoftheAssociationforComputationalLinguistics,numberJuly,page295,Morristown,NJ,USA.AssociationforComputationalLinguistics.FranzJOch.2003.MinimumErrorRateTraininginSta-tisticalMachineTranslation.InAssociationforCom-putationalLinguistics,numberJuly,pages160–167.KishorePapineni,SalimRoukos,ToddWard,andWei-jingZhu.2002.BLEU:aMethodforAutomaticEvaluationofMachineTranslation.InComputationalLinguistics,numberJuly,pages311–318.TimRobertson,F.T.Wright,andR.L.Dykstra.1988.OrderRestrictedStatisticalInference.Wiley.STedSandler.2010.RegularizedLearningwithFeatureNetworks.Ph.D.thesis,UniversityofPennsylvania.HolgerSchwenk.2012.ContinuousSpaceTranslationModelsforPhrase-BasedStatisticalMachineTransla-tion.InInternationalConferenceonComputationalLinguistics(COLING),numberDecember2012,pages1071–1080,Mumbai,India.MervynJ.SilvapulleandPranabK.Sen.2005.Con-strainedStatisticalInference:Order,Inequality,andShapeConstraints.Wiley.ArtemSokolov,GuillaumeWisniewski,andFranc¸oisYvon.2012.Non-linearN-bestListRerankingwithFewFeatures.InAssociationforMachineTranslationintheAmericas.JunSuzukiandMasaakiNagata.2013.SupervisedModelLearningwithFeatureGroupingbasedonaDiscreteConstraint.InProceedingsoftheAssociationforComputationalLinguistics,pages18–23.BenTaskar,CarlosGuestrin,andDaphneKoller.2003.Max-MarginMarkovNetworks.InNeuralInforma-tionProcessingSystems.RyanJTibshirani,HolgerHoefling,andRobertTibshi-rani.2011.Nearly-IsotonicRegression.Technomet-rics,53(1):54–61.KristinaToutanovaandByung-GyuAhn.2013.Learn-ingNon-linearFeaturesforMachineTranslationUs-ingGradientBoostingMachines.InProceedingsoftheAssociationforComputationalLinguistics.IoannisTsochantaridis,ThomasHofmann,ThorstenJoachims,andYaseminAltun.2004.SupportVectorMachineLearningforInterdependentandStructuredOutputSpaces.InInternationalConferenceonMa-chineLearning(ICML).ZhuoranWang,JohnShawe-Taylor,andSandorSzed-mak.2007.KernelRegressionBasedMachineTrans-lation.InNorthAmericanAssociationforCompu-tationalLinguistics,numberApril,pages185–188,Rochester,N.DekaiWu,WeifengSu,andMarineCarpuat.2004.AKernelPCAMethodforSuperiorWordSenseDisam-biguation.InAssociationforComputationalLinguis-tics,Barcelona.PengXuandFrederickJelinek.2004.RandomForestsinLanguageModeling.InEmpiricalMethodsinNaturalLanguageProcessing.Chun-NamJohnYuandThorstenJoachims.2009.LearningstructuralSVMswithlatentvariables.InProceedingsofthe26thAnnualInternationalConfer-enceonMachineLearning-ICML’09,pages1–8,NewYork,NewYork,USA.ACMPress.