Transactions of the Association for Computational Linguistics, 2 (2014) 393–404. Action Editor: Robert C. Moore.

Submitted 2/2014; Revised 6/2014; Published 10/2014. c
(cid:13)

2014 Association for Computational Linguistics.

LocallyNon-LinearLearningforStatisticalMachineTranslationviaDiscretizationandStructuredRegularizationJonathanH.Clark∗ChrisDyer†AlonLavie†*MicrosoftResearch†CarnegieMellonUniversityRedmond,WA98052,USAPittsburgh,PA15213,USAjonathan.clark@microsoft.com{cdyer,alavie}@cs.cmu.eduAbstractLinearmodels,whichsupportefﬁcientlearn-ingandinference,aretheworkhorsesofstatis-ticalmachinetranslation;cependant,linearde-cisionrulesarelessattractivefromamodelingperspective.Inthiswork,weintroduceatech-niqueforlearningarbitrary,rule-local,non-linearfeaturetransformsthatimprovemodelexpressivity,butdonotsacriﬁcetheefﬁcientinferenceandlearningassociatedwithlinearmodels.Todemonstratethevalueofourtech-nique,wediscardthecustomarylogtransformoflexicalprobabilitiesanddropthephrasaltranslationprobabilityinfavorofrawcounts.Weobservethatouralgorithmlearnsavari-ationofalogtransformthatleadstobettertranslationqualitycomparedtotheexplicitlogtransform.Weconcludethatnon-linearre-sponsesplayanimportantroleinSMT,anob-servationthatwehopewillinformtheeffortsoffeatureengineers.1IntroductionLinearmodelsusinglog-transformedprobabilitiesasfeatureshaveemergedasthedominantmodelinMTsystems.ThispracticecanbetracedbacktotheIBMnoisychannelmodels(Brownetal.,1993),whichdecomposedecodingintotheproductofatranslationmodel(TM)andalanguagemodel(LM),motivatedbyBayes’Rule.WhenOchandNey(2002)introducedalog-linearmodelfortrans-lation(alinearsumoflog-spacefeatures),theynotedthatthenoisychannelmodelwasaspecialcaseoftheirmodelusinglogprobabilities.This∗Thisworkwasconductedaspartoftheﬁrstauthor’sPh.D.workatCarnegieMellonUniversity.sameformulationpersistedevenaftertheintroduc-tionofMERT(Och,2003),whichoptimizesalin-earmodel;again,usingtwologprobabilityfea-tures(TMandLM)withequalweightrecoveredthenoisychannelmodel.Yetsystemsnowusemanymorefeatures,someofwhicharenotevenprobabil-ities.Wenolongerbelievethatequalweightsbe-tweentheTMandLMprovidesoptimaltranslationquality;theprobabilitiesintheTMdonotobeythechainrulenorBayes’rule,nullifyingseveralthe-oreticalmathematicaljustiﬁcationsformultiplyingprobabilities.Thestoryofmultiplyingprobabilitiesmayjustamounttoheavilypenalizingsmallvalues.Thecommunityhasabandonedtheoriginalmo-tivationsforalinearinterpolationoftwolog-transformedfeatures.Isthereempiricalevidencethatweshouldcontinueusingthisparticulartrans-formation?Dowehaveanyreasontobelieveitisbetterthanothernon-lineartransformations?Toan-swerthese,weexploretheissueofnon-linearityinmodelsforMT.Intheprocess,wewilldiscusstheimpactoflinearityonfeatureengineeringandde-velopageneralmechanismforlearningaclassofnon-lineartransformationsofreal-valuedfeatures.Applyinganon-lineartransformationsuchaslogtofeaturesisonewayofachievinganon-linearresponsefunction,evenifthosefeaturesareaggre-gatedinalinearmodel.Alternatively,wecouldachieveanon-linearresponseusinganativelynon-linearmodelsuchasaSVM(Wangetal.,2007)orRankBoost(Sokolovetal.,2012).Cependant,MTisastructuredpredictionproblem,inwhichafullhypothesisiscomposedofpartialhypotheses.MTdecoderstakeadvantageofthefactthatthemodel

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
9
1
1
5
6
6
9
3
9

/
t

un
c
_
un
_
0
0
1
9
1
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

394

scoredecomposesasalinearsumoverbothlocalfeaturesandpartialhypothesestoefﬁcientlyper-forminferenceinthesestructuredspaces(§2)–cur-rently,therearenoscalablesolutionstointegratingthehypothesis-levelnon-linearfeaturetransformstypicallyassociatedwithkernelmethodswhilestillmaintainingpolynomialtimesearch.Anotheralter-nativeisincorporatingarecurrentneuralnetwork(Schwenk,2012;Aulietal.,2013;KalchbrennerandBlunsom,2013)oranadditiveneuralnetwork(Liuetal.,2013a).Whilethesemodelshaveshownpromiseasmethodsofaugmentingexistingmod-els,theyhavenotyetofferedapathforreplacingortransformingexistingreal-valuedfeatures.Inthisarticle,wediscussbackground(§2),de-scribelocaldiscretization,ourapproachtolearningnon-lineartransformationsofindividualfeatures,compareitwithgloballynon-linearmodels(§3),presentourexperimentalsetup(§5),empiricallyver-ifytheimportanceofnon-linearfeaturetransforma-tionsinMTanddemonstratethatdiscretizationcanbeusedtorecovernon-lineartransformations(§6),discussrelatedwork(§7),andconclude(§8).2BackgroundandDeﬁnitions2.1FeatureLocality&StructuredHypothesesDecodingagivensourcesentencefcanbeex-pressedassearchovertargethypothesese,eachwithanassociatedcompletederivationD.Toﬁndthebest-scoringhypothesisˆe(F),alinearmodelappliesasetofweightswtoacompletehypothesis’featurevectorH:ˆe(F)=argmaxe,D|H|Xi=0wiHi(F,e,D)(1)Cependant,thishidesmanyoftherealitiesofperform-inginferenceinmoderndecoders.Traditionalin-ferencewouldbeintractableifeveryfeaturewereallowedaccesstotheentirederivationDanditsas-sociatedtargethypothesise.Decoderstakeadvan-tageofthefactthatfeaturesdecomposeoverpar-tialderivationsd.ForacompletederivationD,theglobalfeaturesH(D)areanefﬁcientsummationoverlocalfeaturesh(d):ˆe(F)=argmaxe,D|H|Xi=0wiXd∈Dhi(d)|{z}Hi(D)(2)Thiscontrastswithnon-localfeaturessuchasthelanguagemodel(LM),whichcannotbeexactlycal-culatedgivenanarbitrarypartialhypothesis,whichmaylackbothleftandrightcontext.1Suchfeaturesrequirespecialhandlingincludingfuturecostesti-mation.Inthisstudy,welimitourselvestolocalfeatures,leavingthetraditionalnon-localLMfea-tureunchanged.Ingeneral,featurelocalityisrel-ativetoaparticularstructuredhypothesisspace,andisunrelatedtothestructuredfeaturesdescribedinSection4.2.2.2FeatureNon-LinearityandSeparabilityUnlikemodelsthatrelyprimarilyonalargenumberofsparseindicatorfeatures,state-of-the-artmachinetranslationsystemsrelyheavilyonasmallnumberofdensereal-valuedfeatures.However,unlikeindi-catorfeatures,real-valuedfeaturesmaybeneﬁtfromnon-lineartransformationstoallowalinearmodeltobetterﬁtthedata.Decodersusealinearmodeltorankhypotheses,selectingthehighest-rankedderivation.Sincetheabsolutescoreofthemodelisirrelevant,non-linearresponsesareusefulonlyincaseswheretheyelicitnovelrankings.Inthissection,wewilldiscussthesecasesintermsofseparability.Here,wearesepa-ratingthecorrectlyrankedpairsofhypothesesfromtheincorrectintheimplicitpairwiserankingsde-ﬁnedbythetotalorderingonhypothesesprovidedbyourmodel.Whenthelocalfeaturevectorshofeachoracle-best2hypothesis(orhypotheses)aredistinctfromthoseofallothercompetinghypotheses,wesaythattheinputsareoracleseparablegiventhefeatureset.Ifthereexistsaweightvectorthatdistinguishestheoracle-bestrankingfromallotherrankingsunderalinearmodel,wesaythattheinputsarelinearlysep-arablegiventhefeatureset.Iftheinputsareora-cleseparablebutnotlinearlyseparable,wesaythattherearenon-linearitiesthatareunexplainedbythefeatureset.Forexample,thiscanhappenifafeatureispositivelyrelatedtoqualityinsomeregionsbutnegativelyrelatedinotherregions.Asweaddmoresentencestoourcorpus,sepa-rabilitybecomesincreasinglydifﬁcult.Foragiven1Thisisespeciallyproblematicforchart-baseddecoders.2Wedeﬁnetheoracle-besthypothesisintermsofsomeex-ternalqualitymeasuresuchasBLEU

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
9
1
1
5
6
6
9
3
9

/
t

un
c
_
un
_
0
0
1
9
1
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

395

corpus,ifallhypothesesareoracleseparable,wecanalwaysproducetheoracletranslation–assum-inganoptimal(andpotentiallyverycomplex)modelandweightvector.Ifourhypothesisspacealsocon-tainsallreferencetranslations,wecanalwaysre-coverthereference.Inpractice,bothofthesecondi-tionsaretypicallyviolatedtoacertaindegree.How-ever,ifwemodifyourfeaturesetsuchthatsomelower-rankedhigher-qualityhypothesiscanbesepa-ratedfromallhigher-rankedlower-qualityhypothe-ses,thenwecanimprovetranslationquality.Forthisreason,webelievethatseparabilityremainsaninformativetoolforthinkingaboutmodelinginMT.Currently,non-linearitiesinnovelreal-valuedfea-turesaretypicallyaddressedviamanualfeatureen-gineeringinvolvingagooddealoftrialanderror(GimpelandSmith,2009)3orbymanuallydiscretiz-ingfeatures(e.g.indicatorfeaturesforcount=N).Wewillexploreonetechniqueforautomaticallyavoidingnon-linearitiesinSection3.2.3LearningwithLargeFeatureSetsWhileMERThasproventobeastrongbaseline,itdoesnotscaletolargerfeaturesetsintermsofbothinefﬁciencyandoverﬁtting.WhileMIRA(Chiangetal.,2008),Rampion(GimpelandSmith,2012),andHOLS(Flaniganetal.,2013)havebeenshowntobeeffectiveoverlargerfeaturesets,theyaredif-ﬁculttoexplicitlyregularize–thiswillbecomeim-portantinSection4.2.Therefore,weusethePROoptimizer(HopkinsandMay,2011)asourbaselinelearnersinceithasbeenshowntoperformcompa-rablytoMERTforasmallnumberoffeatures,andtosigniﬁcantlyoutperformMERTforalargenum-beroffeatures(HopkinsandMay,2011;Ganitke-vitchetal.,2012).OtherveryrecentMToptimiz-erssuchasthelinearstructuredSVM(CherryandFoster,2012),AROW(Chiang,2012)andregular-izedMERT(Galleyetal.,2013)arealsocompatiblewiththediscretizationandstructuredregularizationtechniquesdescribedinthisarticle.43Gimpeletal.eventuallyusedrawprobabilitiesintheirmodelratherthanlog-probabilities.4Sincewedispensewithnearlyalloftheoriginaldensefea-turesandourstructuredregularizerisscalesensitive,onewouldneedtousethe‘1-renormalizedvariantofregularizedMERT.3DiscretizationandFeatureInductionInthissection,weproposeafeatureinductiontech-niquebasedondiscretizationthatproducesafeaturesetthatislesspronetonon-linearities(see§2.2).WedeﬁnefeatureinductionasafunctionΦ(oui)thattakestheresultofthefeaturefunctiony=h(X)∈Randreturnsatuplehy0,jiwherey0∈Risatransformedfeaturevalueandjisthetransformedfeatureindex.5Buildingonequation2,wecanapplyfeatureinductionasfollows:ˆe(F)=argmaxe,DXd∈D|H|Xi=0hy0,ji=Φi(Salut(d))w0jy0|{z}H0(F,e,D)(3)Atﬁrstglance,onemightbetemptedtosim-plychoosesomenon-linearfunctionforΦ(e.g.log(X),exp(X),sin(X),xn).Cependant,evenifweweretorestrictourselvestosome“standard”setofnon-linearfunctions,manyofthesefunctionshavehyperparametersthatarenotdirectlytunablebycon-ventionaloptimizers(e.g.periodandamplitudeforsin,ninxn).LearningHLearningwOriginal Linear Model: w • HHFeature Inductionw’Induced Linear Model: w’ • H’H’Figure1:Top:Atraditionallearningprocedure,assign-ingasetofweightstoaﬁxedfeatureset.Bottom:Dis-cretization,ourfeatureinductiontechnique,expandsthefeaturesetaspartoflearning,whilestillproducingalin-earmodelforinference,albeitwithmorefeatures.Discretizationallowsustoavoidmanynon-linearities(§2.2)whilepreservingthefastinferenceprovidedbyfeaturelocality(§2.1).Weﬁrstdis-cretizereal-valuedfeaturesintoasetofindicator5OnecouldalsoimagineafeaturetransformationfunctionΦthatreturnsavectorofbinsforasinglevaluereturnedbyafeaturefunctionhoratransformationthathasaccesstovaluesfrommultiplefeaturefunctionsatonce.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
9
1
1
5
6
6
9
3
9

/
t

un
c
_
un
_
0
0
1
9
1
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

396

featuresandthenuseaconventionaloptimizertolearnaweightforeachindicatorfeature(Figure1).Thistechniqueissometimesreferredtoasbinningandiscloselyrelatedtoquantization.Effectively,discretizationallowsustore-shapeafeaturefunc-tion(Figure2).Infact,givenaninﬁnitenumberofbins,wecanperformanynon-lineartransformationoftheoriginalfunction.1.00.5h01.00.5h1h2h3h4RFigure2:Gauche:Areal-valuedfeature.Bolddotsrepre-sentpointswherewecouldimaginebinsbeingplaced.However,sincewemayonlyadjustw0,these“bins”willberigidlyﬁxedalongthefeaturefunction’svalue.Right:Afterdiscretizingthefeatureinto4bins,wemaynowadjust4weightsindependently,toachieveanon-linearre-shapingofthefunction.Forindicatordiscretization,wedeﬁneΦiintermsofabinningfunctionBINi(X)∈R→N:Φi(X)=h1,i_BINi(X)je(4)wherethe_operatorindicatesconcatenationofafeatureidentiﬁerwithabinidentiﬁertoformanew,uniquefeatureidentiﬁer.3.1LocalDiscretizationUnlikeotherapproachestonon-linearlearninginMT,weperformnon-lineartransformationonpar-tialhypothesesasinequation3wherediscretiza-tionisappliedasΦi(Salut(d)),whichallowslocallynon-lineartransformations,insteadofapplyingΦtocompletehypothesesasinΦi(Hi(D)),whichwouldallowgloballynon-lineartransformations.Thisen-ablesourtransformedmodeltoproducenon-linearresponseswithregardtotheinitialfeaturesetHwhileinferenceremainslinearwithregardtotheoptimizedparametersw0.Importantly,ourtrans-formedfeaturesetrequiresnoadditionalnon-localinformationforinference.Byperformingtransformationswithinalocalcontext,weeffectivelyreinterpretthefeatureset.Forexample,thefamiliartargetwordcountfeaturefoundinmanymodernMTsystemsisoftenconcep-tualizedas“whatisthecountoftargetwordsinthecompletehypothesis?”Ahypothesis-levelviewofdiscretizationwouldviewthisas“Didthishypoth-esishave5targetwords?”.Onlyonesuchfeaturewillﬁreforeachhypothesis.However,localdis-cretizationreinterpretsthisfeatureas“Howmanyphrasesinthecompletehypothesishave1targetword?”Manysuchfeaturesarelikelytoﬁreforeachhypothesis.WeprovideafurtherexampleofthistechniqueinFigure3.hTM=0.1hTM=0.2hCount=2hTM=0.113hTM_0.1=1hTM_0.2=1hTM_0.1=1hCount_2=1el gato come furtivamenteFigure3:Weperformdiscretizationlocallyoneachgrammarruleorphrasepair,operatingonthelocalfea-turevectorsh.Inthisexample,theoriginalreal-valuedfeaturesarecrossedoutwithasolidgraylineandtheirdiscretizedindicatorfeaturesarewrittenabove.Whenformingacompletehypothesisfrompartialhypotheses,wesumthecountsoftheseindicatorfeaturestoob-tainthecompletefeaturevectorH.Inthisexample,H={HTM0.1:2,HTM0.2:1,HCount2:1}Intermsofpredictivepower,thistransformationcanprovidethelearnedmodelwithincreasedabil-itytodiscriminatebetweenhypotheses.Thisispri-marilyaresultofmovingtoahigher-dimensionalfeaturespace.Asweintroducenewparameters,weexpectthatsomehypothesesthatwerepreviouslyin-distinguishableunderHbecomeseparableunderH0(§2.2).Weshowspeciﬁcexamplescomparinglin-ear,locallynon-linear,andgloballynon-linearmod-elsinFigures4-6.Asseenintheseexamples,lo-callynon-linearmodels(Eq.3,4)arenotanapprox-imationnorasubsetofgloballynon-linearmodels,butratheradifferentclassofmodels.3.2BinningAlgorithmToinitializethelearningprocedure,weconstructthebinningfunctionBINusedbytheindicatordi-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
9
1
1
5
6
6
9
3
9

/
t

un
c
_
un
_
0
0
1
9
1
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

397

LinearGloballyNon-LinearLocallyNon-LinearRanking∗S13heh:1.0saysh:1.0={H:2.0}S23sheh:2.0saidh:2.0={H:4.0}S14smallh:2.0kittenh:2.0={H:4.0}∗S24bigh:3.0lionh:3.0={H:6.0}∗S13heh:1.0saysh:1.0={H2:1}S23sheh:2.0saidh:2.0={H4:1}S14smallh:2.0kittenh:2.0={H4:1}∗S24bigh:3.0lionh:3.0={H6:1}∗S13heh1:1saysh1:1={H1:2}S23sheh2:1saidh2:1={H2:2}S14smallh2:1kittenh2:1={H2:2}∗S24bigh3:1lionh3:1={H3:2}Pairs(S13,S23){∆H:-2.0}⊕(S23,S13){∆H:2.0}(cid:9)(S24,S14){∆H:-2.0}(cid:9)(S14,S24){∆H:2.0}⊕(S13,S23){∆H2:1,∆H4:-1}⊕(S23,S13){∆H2:-1,∆H4:1}(cid:9)(S24,S14){∆H4:1,∆H6:-1}(cid:9)(S14,S24){∆H4:-1,∆H6:1}⊕(S13,S23){∆H1:2,∆H2:-2}⊕(S23,S13){∆H1:-2,∆H2:2}(cid:9)(S24,S14){∆H2:2,∆H3:-2}(cid:9)(S14,S24){∆H2:-2,∆H3:2}⊕PairwiseRankingΔH-202⊖⊕⊕⊖InseparableΔH2-11⊕-11ΔH4⊕H6:1H6:-1⊖⊖SeparableΔH1-11⊕-11ΔH2⊕H3:1H3:-1⊖⊖SeparableFigure4:AnexampleshowingacollinearityovermultipleinputsentencesS3,S4inwhichtheoracle-besthypothesisis“trapped”alongalinewithotherlowerqualityhypothesesinthelinearmodel’soutputspace.Rankingshowshowthehypotheseswouldappearinak-bestlistwitheachpartialderivationhavingitspartialfeaturevectorhunderit;thecompletefeaturevectorHisshowntotherightofeachhypothesisandtheoracle-besthypothesisisnotatedwitha∗.Pairsexplicatestheimplicitpairwiserankings.PairwiseRankinggraphsthosepairsinordertovisualizewhetherornotthehypothesesareseparable.(⊕indicatesthatthepairofhypothesesisrankedcorrectlyaccordingtotheextrinsicmetricand(cid:9)indicatesthepairisrankedincorrectly.Inthepairwiserankingrow,some⊕and(cid:9)pointsareannotatedwiththeirpositionsalongthethirdaxisH3(omittedforclarity).Collinearitycanalsooccurwithasingleinputhavingatleast3hypotheses.LinearGloballyNon-LinearLocallyNon-LinearRanking∗S12someh:2.0thingsh:2.0={H:4.0}S22somethingh:4.0={H:4.0}∗S12someh:2.0thingsh:2.0={H4:1}S22somethingh:4.0={H4:1}∗S12someh2:1thingsh2:1={H2:2}S22somethingh4:1={H4:1}Pairs(S12,S22){∆H:0.0}⊕(S22,S12){∆H:0.0}(cid:9)(S12,S22){∆H4:0}⊕(S22,S12){∆H4:0}(cid:9)(S12,S22){∆H2:2,∆H4:-1}⊕(S22,S12){∆H2:-2,∆H4:1}(cid:9)PairwiseRankingInseparableInseparableSeparableFigure5:Anexampleshowingatrivial“collision”inwhichtwohypothesesofdifferingqualityreceivethesamemodelscoreuntillocaldiscretizationisapplied.ThetwohypothesesareindistinguishableunderalinearmodelwiththefeaturesetH,asshownbythezero-differenceinthe“pairs”row.Whileagloballynon-lineartransformationdoesnotyieldanyimprovement,localdiscretizationallowsthehypothesestobeproperlyrankedduetothehigher-dimensionalfeaturespaceH2,H4.SeeFigure4foranexplanationofnotation.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
9
1
1
5
6
6
9
3
9

/
t

un
c
_
un
_
0
0
1
9
1
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

398

LinearLocallyNon-LinearRanking∗hB:1.0={HB:1.0}hA:1.0hA:1.0={HA:2.0}∗hA:1.0hA:1.0hB:1.0={HA:2.0,HB:1.0}hA:0.0={}∗hB:-4.0hB:1.0hB:1.0hB:1.0={HB:-1.0}hA:1.0={HA:1.0}∗hB:-4.0hA:1.0={HA:1.0,HB:-4.0}hB:1.0hB:1.0={HB:2.0}∗hB1:1={HB1:1}hA1:1hA1:1={HA1:2}∗hA1:1hA1:1hB1:1={HA1:2,HB1:1}hA1:0={}∗hB−4:1hB1:1hB1:1hB1:1={HB−4:1,HB1:3}hA1:1={HA1:1}∗hB−4:1hA1:1={HA1:1,HB−4:1}hB1:1hB1:1={HB1:2}PairwiseRankingΔHA-66⊕-66ΔHB⊖⊖⊖⊕⊕⊕⊖InseparableΔHA-33⊕-33ΔHB⊖⊖⊖⊕⊕⊕⊖11HB:1-4HB:-1-4HB:1-4HB:-1-4SeparableFigure6:Anexampledemonstratinganon-lineardecisionboundaryinducedbydiscretization.Thenon-linearnatureofthedecisionboundarycanbeseenclearlywhentheinducedfeaturesetHA1,HB1,HB−4(droite)isconsideredintheoriginalfeaturespaceHA,HB(gauche).Inthepairwiserankingrow,twoaxes(HA1,HB1)areplottedwhilethethirdaxisHB−4isindicatedonlyasstand-offannotationsforclarity.Givenalargernumberofhypotheses,suchsituationscouldalsoarisewithinasinglesentence.SeeFigure4foranexplanationofnotation.cretizerΦ.Wehavetwodesiderata:(1)anymono-tonictransformationofafeatureshouldnotaffecttheinducedbinningsinceweshouldnotrequirefeatureengineerstodeterminetheoptimalfeaturetransformationand(2)nobin’sdatashouldbesosparsethattheoptimizercannotreliablyestimateaweightforeachbin.Therefore,weconstructbinsthatare(je)populateduniformlysubjectto(ii)eachbincontainingnomorethanonefeaturevalue.Wecallthisapproachuniformpopulationfeaturebin-ning.Whileonecouldconsiderthepredictivepowerofthefeatureswhendeterminingbinboundaries,thiswouldsuggestthatweshouldjointlyoptimizeanddeterminebinboundaries,whichisbeyondthescopeofthiswork.ThisproblemhasrecentlybeenconsideredforNLPbySuzukiandNagata(2013)andforMTbyLiuetal.(2013b),thoughthelatterinvolvesdecodingtheentiretrainingdata.LetXbethelistoffeaturevaluestobinwhereiindexesfeaturevaluesxi∈Xandtheirassoci-atedfrequenciesfi.Wewanteachbintohaveauniformsizeu.Forthesakeofsimplifyingourﬁ-nalalgorithm,weﬁrstcreateadjustedfrequenciesf0isothatveryfrequentfeaturevalueswillnotoc-cupymorethan100%ofabinviathefollowingal-gorithm,whichiteratesoverk:uk=1|X||X|Xi=1fki(5)fk+1i=min(fki,uk)(6)whichreturnsu0=ukwhenfki0∀kdefR(j)=|X|−(N−j−1)BRemainingfrequencymasswithinidealbounddefC(j)=j·u0−Pjkbki←1BCurrentfeaturevalueforj∈[1,N]dowhilei≤R(j)andfi≤C(j)dobj←bj∪{xi}i←i+1endwhileBHandlevaluethatstraddlesidealboundariesbyminimizingitsviolationoftheidealifi≤R(j)andfi−C(j)fi<0.5thenbj←bj∪{xi}i←i+1endifendforreturnB4StructuredRegularizationUnfortunately,choosingtherightnumberofbinscanhaveimportanteffectsonthemodel,including:Fidelity.Ifwechoosetoofewbins,weriskdegrad-ingthemodel’sperformancebydiscardingimpor-tantdistinctionsencodedinﬁnedifferencesbetweenthefeaturevalues.Intheextreme,wecouldreduceareal-valuedfeaturetoasingleindicatorfeature.Sparsity.Ifwechoosetoomanybins,weriskmak-ingeachindicatorfeaturetoosparse,whichislikelytoresultintheoptimizeroverﬁttingsuchthatwegeneralizepoorlytounseendata.Whileonemaybetemptedtosimplythrowmoredataormillionsofsparsefeaturesattheproblem,weelecttomorestrategicallyuseexistingdata,since(1)largein-domaintuningdataisnotalwaysavailable,and(2)whenitisavailable,itcanaddconsiderablecomputationalexpense.Inthissection,weexploremethodsformitigatingdatasparsitybyembeddingmoreknowledgeintothelearningprocedure.4.1OverlappingBinsOneverysimplisticwaywecouldcombatsparsityistoextendtheedgesofeachbinsuchthattheycovertheirneighbors’values(seeEquation4):Φ0i(x)=h1,i_BINi(x)iifx∈∪i+1k=i−1BINk(8)Thisway,eachbinwillhavemoredatapointstoestimateitsweight,reducingdatasparsity,andthebinswillmutuallyconstraineachother,reducingtheabilitytooverﬁt.Weincludethistechniqueasacon-trastivebaselineforstructuredregularization.4.2LinearNeighborRegularizationRegularizationhaslongbeenusedtodiscourageop-timizationsolutionsthatgivetoomuchweighttoanyonefeature.Thisencodesourpriorknowl-edgethatsuchsolutionsareunlikelytogeneralize.Regularizationtermssuchasthe‘pnormarefre-quentlyusedingradient-basedoptimizersincludingourbaselineimplementationofPRO.Unregularizeddiscretizationispotentiallybrittlewithregardtothenumberofbinschosen.Primar-ily,itsuffersfromsparsity.Atthesametime,wenotethatweknowmuchmoreaboutdiscretizedfeaturesthaninitialfeaturessincewecontrolhowtheyareformed.Thesefeaturesmakeupastruc-turedfeaturespace.Withthesethingsinmind,weproposelinearneighborregularization,astructuredregularizerthatembedsasmallamountofknowl-edgeintotheobjectivefunction:thattheindicatorfeaturesresultingfromthediscretizationofasin-glereal-valuedfeaturearespatiallyrelated.Weex-pectsimilarweightstobegiventotheindicatorfea-turesthatrepresentneighboringvaluesoftheorigi-nalreal-valuedfeaturesuchthattheresultingtrans-formationappearssomewhatsmooth.Toincorporatethisknowledgeofnearbybins,thelinearneighborregularizerRLNRpenalizeseachfea-ture’sweightbythesquaredamountitdiffersfromitsneighbors’midpoint:RLNR(w,j)=(cid:18)12(wj−1+wj+1)−wj(cid:19)2(9)RLNR(w)=β|h|−1Xj=2RLNR(w,j)(10)Thisisaspecialcaseofthefeaturenetworkreg-ularizerofSandler(2010).Unliketraditionalregu-larizers,wedonothopetoreducetheactivefeaturecount.WiththePROlosslanda‘2regularizaterR2,ourﬁnallossfunctioninternaltoeachiterationofPROis:L(w)=l(x,y;w)+R2(w)+RLNR(w)(11) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 9 1 1 5 6 6 9 3 9 / / t l a c _ a _ 0 0 1 9 1 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 400 4.3MonotoneNeighborRegularizationHowever,asβ→∞,thelinearneighborregular-izerRLNRforcesalineararrangementofweights–thisviolatesourpremisethatweshouldbeagnos-tictonon-lineartransformations.WenowdescribeastructuredregularizerRMNRwhoselimitingsolutionisanymonotonearrangementofweights.Weaug-mentRLNRwithasmoothdampingtermD(w,j),whichhastheshapeofabathtubcurvewithsteep-nessγ:D(w,j)=tanh2γ12(wj−1+wj+1)−wj12(wj−1−wj+1)(12)RMNR(w)=β|h|−1Xj=2D(w,j)RLNR(w,j)(13)Disnearlyzerowhilewj∈[wj−1,wj+1]andnearlyoneotherwise.Brieﬂy,thenumeratormeasureshowfarwjisfromthemidpointofwj−1andwj+1whilethedenominatorscalesthatdistancebytheradiusfromthemidpointtotheneighboringweight.5ExperimentalSetup6Formalism:Inourexperiments,weuseahierarchi-calphrase-basedtranslationmodel(Chiang,2007).Acorpusofparallelsentencesisﬁrstword-aligned,andthenphrasetranslationsareextractedheuristi-cally.Inaddition,hierarchicalgrammarrulesareextractedwherephrasesarenested.Ingeneral,ourchoiceofformalismisratherunimportant–ourtechniquesshouldapplytomostcommonphrase-basedandchart-basedparadigmsincludingHieroandsyntacticsystems.Decoder:Fordecoding,wewillusecdec(Dyeretal.,2010),amulti-passdecoderthatsupportssyn-tactictranslationmodelsandsparsefeatures.Optimizer:OptimizationisperformedusingPRO(HopkinsandMay,2011)asimplementedbythecdecdecoder.WerunPROfor30iterationsassug-gestedbyHopkinsandMay(2011).ThePROopti-mizerinternallyusesaL-BFGSoptimizerwiththedefault‘2regularizationimplementedincdec.Anyadditionalregularizationisexplicitlynoted.BaselineFeatures:WeusethebaselinefeaturesproducedbyLopez’sufﬁxarraygrammarextrac-tor(Lopez,2008),whichisdistributedwithcdec.6Allcodeathttp://github.com/jhclark/cdecBidirectionallexicallog-probabilities,thecoher-entphrasaltranslationlog-probability,targetwordcount,gluerulecount,sourceOOVcount,tar-getOOVcount,andtargetlanguagemodellog-probability.Notethatthesefeaturesmaybesim-pliﬁedorremovedasspeciﬁedineachexperimentalcondition.Zh→EnAr→EnCz→EnTrain303K5.4M1MWeightTune166417973000HyperTune108510562000Test135713132000Table1:Corpusstatistics:numberofparallelsentences.ChineseResources:FortheChinese→Englishex-periments,includingthecompletedworkpresentedinthisproposal,wetrainontheForeignBroadcastInformationService(FBIS)corpusofapproximately300,000sentencepairswithabout9.4millionEn-glishwords.WetuneweightsontheNISTMT2006dataset,tunehyperparametersonNISTMT05,andtestonNISTMT2008.ArabicResources:WebuildanArabic→Englishsystem,trainingonthelargeNISTMT2009con-strainedtrainingcorpusofapproximately5mil-lionsentencepairswithabout181millionEnglishwords.WetuneweightsontheNISTMT2006dataset,tunehyperparametersonNISTMT2005,andtestonNISTMT2008.Czechresources:WealsoconstructaCzech→EnglishsystembasedontheCzEng1.0data(Bojaretal.,2012).First,welowercasedandperformedsentence-leveldeduplicationofthedata.7Then,weuniformlysampledatrainingsetof1Msentences(sections1–97)alongwithaweight-tuningset(section98),hyperparameter-tuning(section99),andtestset(section99)fromtheparawebdomaincontainedofCzEng.8Sentenceslessthan5wordswerediscardedduetonoise.Evaluation:Wequantifyincreasesintranslationqualityusingcase-insensitiveBLEU(Papinenietal.,2002).Wecontrolfortestsetvariationandopti-mizerinstabilitybyaveragingovermultipleopti-mizerreplicas(Clarketal.,2011).97CzEngisdistributeddeduplicatedatthedocumentlevel,leadingtoveryhighsentence-leveloverlap.8ThesectionsplitsrecommendedbyBojaretal.(2012).9MultEval0.5.1:github.com/jhclark/multeval l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 9 1 1 5 6 6 9 3 9 / / t l a c _ a _ 0 0 1 9 1 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 401 Bits4812Features101130212,910TestBLEU36.436.636.8Table2:TranslationqualityforCz→Ensystemwithvaryingbitsfordiscretization.Forallotherexperiments,wetunethenumberofbitsonheld-outdata.ConditionZh→EnAr→EnCz→EnP20.8?(-2.7)44.3?(-3.6)36.5?(-1.1)logP23.5†47.9†37.6†DiscP23.4†(-0.1)47.2†(-0.7)36.8?(-0.8)Over.P20.7?(-2.8)44.6?(-3.3)36.6?(-1.0)LNRP23.1?†(-0.4)48.0†(+0.1)37.3(-0.3)MNRP23.8†(+0.3)48.7?†(+0.8)37.6†(±)MNRC23.6†(±)48.7?†(+0.8)37.4†(-0.2)Table3:Top:Translationqualityforsystemswithandwithoutthetypicallogtransform.Bottom:Transla-tionqualityforsystemsusingdiscretizationandstruc-turedregularizationwithprobabilitiesPorcountsCastheinputofdiscretization.MNRPconsistentlyrecoversoroutperformsastate-of-the-artsystem,butwithoutanyassumptionsabouthowtotransformtheinitialfeatures.Allscoresareaveragedover3end-to-endoptimizerrepli-cations.?denotessigniﬁcantlydifferentthanlogprobs(row2)withp(CHANCE)<0.01underClarketal.(2011)and†islikewiseusedwithregardtoP(row1).6Results6.1DoesNon-LinearityMatter?Inourﬁrstsetofexperiments,weseektoanswer“Doesnon-linearitymatter?”bystartingwithourbaselinesystemof7typicalfeatures(thelogProb-abilitysystem)andwethenremovethelogtrans-formfromallofthelogprobabilityfeaturesinourgrammar(theProbs.system).TheresultsareshowninTable3(rows1,2).Ifana¨ıvefeatureengi-neerweretoremovethenon-linearlogtransform,thesystemswoulddegradebetween1.1BLEUand3.6BLEU.Fromthis,weconcludethatnon-linearitydoesaffecttranslationquality.Thisisapotentialpit-fallforanyreal-valuedfeatureincludingprobabilityfeatures,countfeatures,similaritymeasures,etc.6.2LearningNon-LinearTransformationsNext,weevaluatetheeffectsofdiscretization(Disc),overlappingbins(Over.),linearneighborregularization(LNR),andmonotoneneighborreg-ularization(MNR)onthreelanguagepairs:asmallZh→Ensystem,alargeAr→EnsystemandalargeCz→Ensystem.IntheﬁrstrowofTable3,weuserawprobabilitiesratherthanlogprobabilitiesforpcoherent(t|s),plex(t|s),andplex(s|t).Inrows3–7,alltranslationmodelfeatures(withoutthelog-transformedfeatures)arethendiscretizedintoindi-catorfeatures.10Thenumberofbinsandthestruc-turedregularizationstrengthweretunedonthehy-perparametertuningset.Discretizationalonedoesnotconsistentlyrecovertheperformanceofthelogtransformedfeatures(row3).Thena¨ıveoverlapstrategyinfactdegradesperformance(row4).Linearneighborregularization(row5)behavesmoreconsistentlythandiscretiza-tionalone,butisconsistentlyoutperformedbythemonotoneneighborregularizer(row6),whichisabletomeetorsigniﬁcantlyexceedtheperformanceofthelogtransformedsystem.Importantly,thisisdonewithoutanyknowledgeofthecorrectnon-lineartransformation.Intheﬁnalrow,wegoastepfurtherbyremovingpcoherent(t|s)altogetherandreplacingitwithsimplecountfeatures:c(s)andc(s,t),withslighttonodegradationinquality.11Wetakethisasevidencethatafeatureengineerdevelop-inganewreal-valuedfeaturemayﬁnddiscretizationandmonotoneneighborregularizationuseful.Wealsoobservethatdifferentdatasetsbeneﬁtfromnon-linearfeaturetransformationintodiffer-entdegrees(Table3,rows1,2).Wenoticedthatdis-cretizationwithmonotoneneighborregularizationisabletoimproveoveralogtransform(rows2,6)inproportiontotheimprovementofalogtransformoverprobability-basedfeatures(rows1,2).Toprovideinsightintohowtranslationqualitycanbeaffectedbythenumberofbitsfordiscretization,weofferTable2.InFigure7,wepresenttheweightslearnedbytheAr→Ensystemforprobability-basedfeatures.Weseethatevenwithoutabiastowardalogtransform,alog-likeshapestillemergesformanySMTfea-turesbasedonlyonthecriteriaofoptimizingBLEUandapreferenceformonotonicity.However,theop-timizerchoosessomeimportantvariationsonthelogcurve,especiallyforlowprobabilities,thatleadto10Wealsokeepareal-valuedcopyofthewordpenaltytohelpnormalizethelanguagemodel.11Thesefeaturescansingle-outruleswithc(s)=1,c(s,t)=1,subsumingseparatelow-countfeatures l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 9 1 1 5 6 6 9 3 9 / / t l a c _ a _ 0 0 1 9 1 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 402 improvementsintranslationquality.0.00.20.40.60.81.00.00.20.40.60.81.00.020.040.060.080.10Original probability feature value0.020.07Weight050100150200250300Original raw count feature value0.080.07WeightFigure7:Plotsofweightslearnedforthediscretizedpcoherent(e|f)(top)andc(f)(bottom)fortheAr→Ensys-temwith4bitsandmonotoneneighborregularization.p(e|f)>0.11isomittedforexpositionasvalueswereconstantafterthispoint.Thegraylineﬁtsalogcurvetotheweights.Thesystemlearnsashapethatdeviatesfromtheloginseveralregions.Eachnon-monotonicsegmentrepresentsthelearnerchoosingtobetterﬁtthedatawhilepayingastrongregularizationpenalty.7RelatedWorkPreviousworkonfeaturediscretizationinmachinelearninghasfocusedontheconversionofreal-valuedfeaturesintodiscretevaluesforlearnersthatareeitherincapableofhandlingreal-valuedinputsorperformsuboptimallygivenreal-valuedinputs(Doughertyetal.,1995;KotsiantisandKanellopou-los,2006).Decisiontreesandrandomforestshavebeensuccessfullyusedinlanguagemodeling(Je-lineketal.,1994;XuandJelinek,2004)andparsing(Charniak,2010;Magerman,1995).Kernelmethodssuchassupportvectormachines(SVMs)areoftenconsideredwhennon-linearinter-actionsbetweenfeaturesaredesiredsincetheyal-lowforeasyusageofnon-linearkernels.Wuetal.(2004)showedimprovementsusingnon-linearker-nelPCAforwordsensedisambiguation.Taskaretal.(2003)describesamethodforincorporatingkernelsintostructuredMarkovnetworks.Tsochan-taridisetal.(2004)thenproposedastructuredSVMforgrammarlearning,named-entityrecognition,textclassiﬁcation,andsequencealignment.ThiswasfollowedbyastructuredSVMwithinexactin-ference(FinleyandJoachims,2008)andthelatentstructuredSVM(YuandJoachims,2009).Evenwithinkernelmethods,learningnon-linearmap-pingswithkernelsremainsanopenareaofresearch;Forexample,Cortesetal.(2009)investigatedlearn-ingnon-linearcombinationsofkernels.InMT,Gim´enezandM`arquez(2007)usedaSVMtoan-notateaphrasetablewithbinaryfeaturesindicatingwhetherornotaphrasetranslationwasappropriateincontext.Nguyenetal.(2007)alsoappliednon-linearfeaturesforSMTn-bestreranking.ToutanovaandAhn(2013)useaformofregres-siondecisiontreestoinducelocallynon-linearfea-turesinan-bestrerankingframework.HeandDeng(2012)directlyoptimizethelexicalandphrasalfea-turesusingexpectedBLEU.Nelakantietal.(2013)usetree-structured‘pregularizerstotrainlanguagemodelsandimproveperplexityoverKneser-Ney.Learningparametersunderweakorderrestric-tionshasalsobeenstudiedforregression.Isotonicregression(Barlowetal.,1972;Robertsonetal.,1988;SilvapulleandSen,2005)ﬁtsacurvetoasetofdatapointssuchthateachpointintheﬁttedcurveisgreaterthanorequaltothepreviouspointinthecurve.Nearlyisotonicregressionallowsviolationsinmonotonicity(Tibshiranietal.,2011).8ConclusionIntheabsenceofhighlyreﬁnedknowledgeaboutafeature,discretizationwithstructuredregulariza-tionenableshigherqualityimpactofnewfeaturesetsthatcontainnon-linearities.Inourexperiments,weobservedthatdiscretizationout-performedna¨ıvefeatureslackingagoodnon-lineartransformationbyupto4.4BLEUandthatitcanoutperformabaselinebyupto0.8BLEUwhiledroppingthelogtransformofthelexicalprobabilitiesandremovingthephrasalprobabilitiesinfavorofcounts.Lookingbeyondthisbasicfeatureset,non-lineartransformationscouldbethedifferencebetweenshowingqualityimprove-mentsornotfornovelfeatures.Asresearchersin-cludemorereal-valuedfeaturesincludingcounts,similaritymeasures,andseparately-trainedmodelswithmillionsoffeatures,wesuspectthiswillbe-comeanincreasinglyrelevantissue.Weconcludethatnon-linearresponsesplayanimportantroleinSMT,evenforacommonly-usedfeatureset,anob-servationthatwehopewillinformfeatureengineers.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
9
1
1
5
6
6
9
3
9

/
t

un
c
_
un
_
0
0
1
9
1
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

403

AcknowledgmentsThisworkwassupportedbyGoogleFacultyRe-searchgrants2011R2705and2012R210andbytheNSF-sponsoredXSEDEcomputingresourcesprogramundergrantTG-CCR110017.ReferencesMichaelAuli,MichelGalley,ChrisQuirk,andGeoffreyZweig.2013.JointLanguageandTranslationMod-elingwithRecurrentNeuralNetworks.InEmpiricalMethodsinNaturalLanguageProcessing,numberOc-tober,pages1044–1054.R.E.Barlow,D.Bartholomew,J.M.Bremner,andH.D.Brunk.1972.Statisticalinferenceunderorderrestric-tions;thetheoryandapplicationofisotonicregres-sion.Wiley.OndejBojar,ZdenˇekˇZabokrtsk´y,OndejDuˇsek,Pe-traGaluˇsˇc´akov´a,MartinMajliˇs,DavidMareˇcek,Ji´ıMarˇs´ık,MichalNov´ak,MartinPopel,andAleˇsTam-chyna.2012.TheJoyofParallelismwithCzEng1.0.InProceedingsofLREC2012,Istanbul,Turkey.Euro-peanLanguageResourcesAssociation.PeterEBrown,StephenADellaPietra,VincentJDellaPietra,andRobertLMercer.1993.TheMathematicsofStatisticalMachineTranslation:ParameterEstima-tion.ComputationalLinguistics,10598.EugeneCharniak.2010.Top-DownNearly-Context-SensitiveParsing.InEmpiricalMethodsinNaturalLanguageProcessing,numberOctober,pages674–683.ColinCherryandGeorgeFoster.2012.BatchTuningStrategiesforStatisticalMachineTranslation.InPro-ceedingsoftheNorthAmericanAssociationforCom-putationalLinguistics,pages427–436.DavidChiang,YuvalMarton,andPhilipResnik.2008.Onlinelarge-margintrainingofsyntacticandstruc-turaltranslationfeatures.InProceedingsoftheCon-ferenceonEmpiricalMethodsinNaturalLanguageProcessing-EMNLP’08,pages224–233,Morris-town,New Jersey,USA.AssociationforComputationalLin-guistics.DavidChiang.2007.HierarchicalPhrase-BasedTrans-lation.ComputationalLinguistics,33(2):201–228,June.DavidChiang.2012.Hopeandfearfordiscriminativetrainingofstatisticaltranslationmodels.JournalofMachineLearningResearch,13:1159–1187.JonathanHClark,ChrisDyer,AlonLavie,andNoahASmith.2011.BetterHypothesisTestingforStatisticalMachineTranslation:ControllingforOptimizerInsta-bility.InAssociationforComputationalLinguistics.CorinnaCortes,MehryarMohri,andAfshinRos-tamizadeh.2009.LearningNon-LinearCombinationsofKernels.InAdvancesinNeuralInformationPro-cessingSystems(NIPS2009),pages1–9,Vancouver,Canada.JamesDougherty,RonKohavi,andMehranSahami.1995.SupervisedandUnsupervisedDiscretizationofContinuousFeatures.InProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pages194–202,SanFrancisco,CA.ChrisDyer,JonathanWeese,AdamLopez,VladimirEi-delman,PhilBlunsom,andPhilipResnik.2010.cdec:ADecoder,Alignment,andLearningFrameworkforFinite-StateandContext-FreeTranslationModels.InAssociationforComputationalLinguistics,numberJuly,pages7–12.ThomasFinleyandThorstenJoachims.2008.TrainingstructuralSVMswhenexactinferenceisintractable.InProceedingsoftheInternationalConferenceonMa-chineLearning,pages304–311,NewYork,NewYork,USA.ACMPress.JeffreyFlanigan,ChrisDyer,andJaimeCarbonell.2013.Large-ScaleDiscriminativeTrainingforStatisticalMachineTranslationUsingHeld-OutLineSearch.InNorthAmericanAssociationforComputationalLin-guistics,numberJune,pages248–258.MichelGalley,ChrisQuirk,ColinCherry,andKristinaToutanova.2013.RegularizedMinimumErrorRateTraining.InEmpiricalMethodsinNaturalLanguageProcessing.JuriGanitkevitch,YuanCao,JonathanWeese,MattPost,andChrisCallison-Burch.2012.Joshua4.0:Pack-ing,PRO,andParaphrases.InWorkshoponStatisticalMachineTranslation,pages283–291.Jes´usGim´enezandLlu´ısM`arquez.2007.Context-awareDiscriminativePhraseSelectionforStatisticalMachineTranslation.InWorkshoponStatisticalMa-chineTranslation,numberJune,pages159–166.KevinGimpelandNoahASmith.2009.Feature-RichTranslationbyQuasi-SynchronousLatticeParsing.InEmpiricalMethodsinNaturalLanguageProcessing.KevinGimpelandNoahASmith.2012.StructuredRampLossMinimizationforMachineTranslation.InNorthAmericanAssociationforComputationalLin-guistics.XiaodongHeandLiDeng.2012.MaximumExpectedBLEUTrainingofPhraseandLexiconTranslationModels.InProceedingsoftheAssociationforCom-putationalLinguistics,JejuIsland,Korea.MicrosoftResearch.MarkHopkinsandJonathanMay.2011.TuningasRanking.ComputationalLinguistics,pages1352–1362.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
9
1
1
5
6
6
9
3
9

/
t

un
c
_
un
_
0
0
1
9
1
p
d

b
oui
g
toi
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

404

FrederickJelinek,JohnLafferty,DavidMagerman,RobertMercer,AdwaitRatnaparkhi,andSalimRoukos.1994.DecisionTreeParsingusingaHiddenDerivationModel.InWorkshoponHumanLanguageTechnologies(HLT).NalKalchbrennerandPhilBlunsom.2013.RecurrentContinuousTranslationModels.InEmpiricalMeth-odsinNaturalLanguageProcessing.SotirisKotsiantisandDimitrisKanellopoulos.2006.DiscretizationTechniques:Arecentsurvey.InGESTSInternationalTransactionsonComputerSci-enceandEngineering,volume32,pages47–58.LemaoLiu,TaroWatanabe,EiichiroSumita,andTiejunZhao.2013a.AdditiveNeuralNetworksforStatisticalMachineTranslation.InProceedingsoftheAssocia-tionforComputationalLinguistics.LemaoLiu,TiejunZhao,TaroWatanabe,andEiichiroSumita.2013b.TuningSMTwithALargeNumberofFeaturesviaOnlineFeatureGrouping.InProceed-ingsoftheInternationalJointConferenceonNaturalLanguageProcessing.AdamLopez.2008.Tera-ScaleTranslationModelsviaPatternMatching.InAssociationforComputa-tionalLinguisticsComputationalLinguistics,numberAugust,pages505–512.DavidMMagerman.1995.StatisticalDecision-TreeModelsforParsing.InAssociationforComputationalLinguistics,pages276–283.AnilNelakanti,CedricArchambeau,JulienMairal,Fran-cisBach,andGuillaumeBouchard.2013.StructuredPenaltiesforLog-linearLanguageModels.InEmpiri-calMethodsinNaturalLanguageProcessing,Seattle,WA.PatrickNguyen,MilindMahajan,XiaodongHe,andMi-crosoftWay.2007.TrainingNon-ParametricFeaturesforStatisticalMachineTranslation.InAssociationforComputationalLinguistics.FranzJosefOchandHermannNey.2002.Discrimi-nativetrainingandmaximumentropymodelsforsta-tisticalmachinetranslation.InProceedingsoftheAssociationforComputationalLinguistics,numberJuly,page295,Morristown,New Jersey,USA.AssociationforComputationalLinguistics.FranzJOch.2003.MinimumErrorRateTraininginSta-tisticalMachineTranslation.InAssociationforCom-putationalLinguistics,numberJuly,pages160–167.KishorePapineni,SalimRoukos,ToddWard,andWei-jingZhu.2002.BLEU:aMethodforAutomaticEvaluationofMachineTranslation.InComputationalLinguistics,numberJuly,pages311–318.TimRobertson,F.T.Wright,andR.L.Dykstra.1988.OrderRestrictedStatisticalInference.Wiley.STedSandler.2010.RegularizedLearningwithFeatureNetworks.Ph.D.thesis,UniversityofPennsylvania.HolgerSchwenk.2012.ContinuousSpaceTranslationModelsforPhrase-BasedStatisticalMachineTransla-tion.InInternationalConferenceonComputationalLinguistics(COLING),numberDecember2012,pages1071–1080,Mumbai,India.MervynJ.SilvapulleandPranabK.Sen.2005.Con-strainedStatisticalInference:Order,Inequality,andShapeConstraints.Wiley.ArtemSokolov,GuillaumeWisniewski,andFranc¸oisYvon.2012.Non-linearN-bestListRerankingwithFewFeatures.InAssociationforMachineTranslationintheAmericas.JunSuzukiandMasaakiNagata.2013.SupervisedModelLearningwithFeatureGroupingbasedonaDiscreteConstraint.InProceedingsoftheAssociationforComputationalLinguistics,pages18–23.BenTaskar,CarlosGuestrin,andDaphneKoller.2003.Max-MarginMarkovNetworks.InNeuralInforma-tionProcessingSystems.RyanJTibshirani,HolgerHoeﬂing,andRobertTibshi-rani.2011.Nearly-IsotonicRegression.Technomet-rics,53(1):54–61.KristinaToutanovaandByung-GyuAhn.2013.Learn-ingNon-linearFeaturesforMachineTranslationUs-ingGradientBoostingMachines.InProceedingsoftheAssociationforComputationalLinguistics.IoannisTsochantaridis,ThomasHofmann,ThorstenJoachims,andYaseminAltun.2004.SupportVectorMachineLearningforInterdependentandStructuredOutputSpaces.InInternationalConferenceonMa-chineLearning(ICML).ZhuoranWang,JohnShawe-Taylor,andSandorSzed-mak.2007.KernelRegressionBasedMachineTrans-lation.InNorthAmericanAssociationforCompu-tationalLinguistics,numberApril,pages185–188,Rochester,N.DekaiWu,WeifengSu,andMarineCarpuat.2004.AKernelPCAMethodforSuperiorWordSenseDisam-biguation.InAssociationforComputationalLinguis-tics,Barcelona.PengXuandFrederickJelinek.2004.RandomForestsinLanguageModeling.InEmpiricalMethodsinNaturalLanguageProcessing.Chun-NamJohnYuandThorstenJoachims.2009.LearningstructuralSVMswithlatentvariables.InProceedingsofthe26thAnnualInternationalConfer-enceonMachineLearning-ICML’09,pages1–8,NewYork,NewYork,USA.ACMPress. Transactions of the Association for Computational Linguistics, 2 (2014) 393–404. Action Editor: Robert C. Moore. image

Télécharger le PDF