Transacciones de la Asociación de Lingüística Computacional, volumen. 5, páginas. 87–99, 2017. Editor de acciones: Chris Quirk.
Lote de envío: 6/2016; Lote de revisión: 10/2016; Publicado 3/2017.
2017 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
ContextGatesforNeuralMachineTranslationZhaopengTu†YangLiu‡ZhengdongLu†XiaohuaLiu†HangLi††Noah’sArkLab,HuaweiTechnologies,Hong Kong{tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com‡DepartmentofComputerScienceandTechnology,TsinghuaUniversity,Beijingliuyang2011@tsinghua.edu.cnAbstractInneuralmachinetranslation(NMT),genera-tionofatargetworddependsonbothsourceandtargetcontexts.Wefindthatsourcecon-textshaveadirectimpactontheadequacyofatranslationwhiletargetcontextsaffecttheflu-ency.Intuitively,generationofacontentwordshouldrelymoreonthesourcecontextandgenerationofafunctionalwordshouldrelymoreonthetargetcontext.Duetothelackofeffectivecontrolovertheinfluencefromsourceandtargetcontexts,conventionalNMTtendstoyieldfluentbutinadequatetransla-tions.Toaddressthisproblem,weproposecontextgateswhichdynamicallycontroltheratiosatwhichsourceandtargetcontextscon-tributetothegenerationoftargetwords.Inthisway,wecanenhanceboththeadequacyandfluencyofNMTwithmorecarefulcon-troloftheinformationflowfromcontexts.Experimentsshowthatourapproachsignif-icantlyimprovesuponastandardattention-basedNMTsystemby+2.3BLEUpoints.1IntroductionNeuralmachinetranslation(NMT)(KalchbrennerandBlunsom,2013;Sutskeveretal.,2014;Bah-danauetal.,2015)hasmadesignificantprogressinthepastseveralyears.Itsgoalistoconstructandutilizeasinglelargeneuralnetworktoaccom-plishtheentiretranslationtask.Onegreatadvan-tageofNMTisthatthetranslationsystemcanbecompletelyconstructedbylearningfromdatawith-outhumaninvolvement(cf.,featureengineeringinstatisticalmachinetranslation(SMT)).Theencoder-decoderarchitectureiswidelyemployed(Choetal.,inputj¯ınni´anqi´anliˇangyu`eguˇangd¯ongg¯aox¯ınj`ısh`uchˇanpˇınch¯ukˇou37.6y`ımˇeiyu´anNMTinthefirsttwomonthsofthisyear,theexportofnewhighleveltechnologyproductwasUNK-billionusdollars5srcchina’sguangdonghi-techexportshit58billiondollars5tgtchina’sexportofhighandnewhi-techexportsoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportof···Table1:Sourceandtargetcontextsarehighlycor-relatedtotranslationadequacyandfluency,respec-tively.5srcand5tgtdenotehalvingthecontribu-tionsfromthesourceandtargetcontextswhengen-eratingthetranslation,respectively.2014;Sutskeveretal.,2014),inwhichtheencodersummarizesthesourcesentenceintoavectorrepre-sentation,andthedecodergeneratesthetargetsen-tenceword-by-wordfromthevectorrepresentation.Therepresentationofthesourcesentenceandtherepresentationofthepartiallygeneratedtargetsen-tence(traducción)ateachpositionarereferredtoassourcecontextandtargetcontext,respectively.Thegenerationofatargetwordisdeterminedjointlybythesourcecontextandtargetcontext.SeveraltechniquesinNMThaveproventobeveryeffective,includinggating(HochreiterandSchmidhuber,1997;Choetal.,2014)andat-tention(Bahdanauetal.,2015)whichcanmodellong-distancedependenciesandcomplicatedalign-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4
/
/
t
yo
a
C
_
a
_
0
0
0
4
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
88
mentrelationsinthetranslationprocess.Usinganencoder-decoderframeworkthatincorporatesgat-ingandattentiontechniques,ithasbeenreportedthattheperformanceofNMTcansurpasstheper-formanceoftraditionalSMTasmeasuredbyBLEUscore(Luongetal.,2015).Despitethissuccess,weobservethatNMTusu-allyyieldsfluentbutinadequatetranslations.1Weattributethistoastrongerinfluenceoftargetcon-textongeneration,whichresultsfromastrongerlanguagemodelthanthatusedinSMT.Oneques-tionnaturallyarises:whatwillhappenifwechangetheratioofinfluencesfromthesourceortargetcon-texts?Table1showsanexampleinwhichanattention-basedNMTsystem(Bahdanauetal.,2015)gener-atesafluentyetinadequatetranslation(e.g.,missingthetranslationof“guˇangd¯ong”).Whenwehalvethecontributionfromthesourcecontext,theresultfur-therlosesitsadequacybymissingthepartialtrans-lation“inthefirsttwomonthsofthisyear”.Onepossibleexplanationisthatthetargetcontexttakesahigherweightandthusthesystemfavorsashortertranslation.Incontrast,whenwehalvethecon-tributionfromthetargetcontext,theresultcom-pletelylosesitsfluencybyrepeatedlygeneratingthetranslationof“ch¯ukˇou”(i.e.,“theexportof”)un-tilthegeneratedtranslationreachesthemaximumlength.Therefore,thisexampleindicatesthatsourceandtargetcontextsinNMTarehighlycorrelatedtotranslationadequacyandfluency,respectively.Infact,conventionalNMTlackseffectivecontrolontheinfluenceofsourceandtargetcontexts.Ateachdecodingstep,NMTtreatsthesourceandtar-getcontextsequally,andthusignoresthedifferentneedsofthecontexts.Forexample,contentwordsinthetargetsentencearemorerelatedtothetransla-tionadequacy,andthusshoulddependmoreonthesourcecontext.Incontrast,functionwordsinthetargetsentenceareoftenmorerelatedtothetrans-lationfluency(e.g.,“of”after“isfond”),andthusshoulddependmoreonthetargetcontext.Inthiswork,weproposetousecontextgatestocontrolthecontributionsofsourceandtargetcon-textsonthegenerationoftargetwords(decoding)1Fluencymeasureswhetherthetranslationisfluent,whileadequacymeasureswhetherthetranslationisfaithfultotheoriginalsentence(Snoveretal.,2009).Figure1:ArchitectureofdecoderRNN.inNMT.Contextgatesarenon-lineargatingunitswhichcandynamicallyselecttheamountofcontextinformationinthedecodingprocess.Specifically,ateachdecodingstep,thecontextgateexaminesboththesourceandtargetcontexts,andoutputsaratiobetweenzeroandonetodeterminethepercentagesofinformationtoutilizefromthetwocontexts.Inthisway,thesystemcanbalancetheadequacyandfluencyofthetranslationwithregardtothegenera-tionofawordateachposition.Experimentalresultsshowthatintroducingcon-textgatesleadstoanaverageimprovementof+2.3BLEUpointsoverastandardattention-basedNMTsystem(Bahdanauetal.,2015).Aninterestingfind-ingisthatwecanreplacetheGRUunitsinthede-coderwithconventionalRNNunitsandinthemean-timeutilizecontextgates.Thetranslationperfor-manceiscomparablewiththestandardNMTsystemwithGRU,butthesystemenjoysasimplerstructure(i.e.,usesonlyasinglegateandhalfoftheparam-eters)andafasterdecoding(i.e.,requiresonlyhalfthematrixcomputationsfordecoding).22NeuralMachineTranslationSupposethatx=x1,…xj,…xJrepresentsasourcesentenceandy=y1,…yi,…yIatargetsentence.NMTdirectlymodelstheprobabilityoftranslationfromthesourcesentencetothetargetsentencewordbyword:PAG(y|X)=IYi=1P(yi|ygithub.com/nyu-dl/dl4mt-tutorial
),ti−1andyi−1arecombinedtogetherwithaGRUbeforebeingfedintothedecoder,whichcanboosttranslationperformance.Wefollowthepracticeandtreatbothofthemastargetcontext. yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t yo a C _ a _ 0 0 0 4 8 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 90 forsourceandtargetcontexts:ti=f(b⊗(Nosotros(yi−1)+Uti−1)+a⊗Csi)Forexample,thepair(1.0,0.5)meansfullylever-agingtheeffectofsourcecontextwhilehalvingtheeffectoftargetcontext.Reducingtheeffectoftar-getcontext(i.e.,thelines(1.0,0.8)y(1.0,0.5))resultsinlongertranslations,whilereducingtheef-fectofsourcecontext(i.e.,thelines(0.8,1.0)y(0.5,1.0))leadstoshortertranslations.Whenhalv-ingtheeffectofthetargetcontext,mostofthegener-atedtranslationsreachthemaximumlength,whichisthreetimesthelengthofsourcesentenceinthiswork.Figure2(b)showstheresultsofmanualevalu-ationon200sourcesentencesrandomlysampledfromthetestsets.Reducingtheeffectofsourcecon-text(es decir.,(0.8,1.0)y(0.5,1.0))leadstomoreflu-entyetlessadequatetranslations.Ontheotherhand,reducingtheeffectoftargetcontext(es decir.,(1.0,0.5)y(1.0,0.8))isexpectedtoyieldmoreadequatebutlessfluenttranslations.Inthissetting,thesourcewordsaretranslated(i.e.,higheradequacy)whilethetranslationsareinwrongorder(i.e.,lowerflu-ency).Inpractice,sin embargo,weobservethesideef-fectthatsomesourcewordsaretranslatedrepeatedlyuntilthetranslationreachesthemaximumlength(i.e.,lowerfluency),whileothersareleftuntrans-lated(i.e.,loweradequacy).Thereasonistwofold:1.NMTlacksamechanismthatguaranteesthateachsourcewordistranslated.4Thedecod-ingstateimplicitlymodelsthenotionof“cover-age”byrecurrentlyreadingthetime-dependentsourcecontextsi.Loweringitscontributionweakensthe“coverage”effectandencour-agesthedecodertoregeneratephrasesmultipletimestoachievethedesiredtranslationlength.2.Thetranslationisincomplete.AsshowninTa-ble1,NMTcangetstuckinaninfinitelooprepeatedlygeneratingaphraseduetotheover-whelminginfluenceofthesourcecontext.Asaresult,generationterminatesearlybecause4Therecentlyproposedcoveragebasedtechniquecanallevi-atethisproblem(Tuetal.,2016).Inthiswork,weconsideran-otherapproach,whichiscomplementarytothecoveragemech-anism.Figure3:Architectureofcontextgate.thetranslationreachesthemaximumlengthal-lowedbytheimplementation,eventhoughthedecodingprocedureisnotfinished.Thequantitative(Figure2)andqualitative(Ta-ble1)resultsconfirmourhypothesis,i.e.,sourceandtargetcontextsarehighlycorrelatedtotranslationadequacyandfluency.WebelievethatamechanismthatcandynamicallyselectinformationfromsourcecontextandtargetcontextwouldbeusefulforNMTmodels,andthisisexactlytheapproachwepropose.3ContextGates3.1ArchitectureInspiredbythesuccessofgatedunitsinRNN(HochreiterandSchmidhuber,1997;Choetal.,2014),weproposeusingcontextgatestodynamicallycontroltheamountofinformationflowingfromthesourceandtargetcontextsandthusbalancethefluencyandadequacyofNMTateachdecodingstep.Intuitively,ateachdecodingstepi,thecontextgatelooksatinputsignalsfromboththesource(i.e.,si)andtarget(i.e.,ti−1andyi−1)sides,andoutputsanumberbetween0and1foreachelementintheinputvectors,where1denotes“completelytrans-ferringthis”while0denotes“completelyignoringthis”.Thecorrespondinginputsignalsarethenpro-cessedwithanelement-wisemultiplicationbeforebeingfedtotheactivationlayertoupdatethedecod-ingstate.Formally,acontextgateconsistsofasigmoidneuralnetworklayerandanelement-wisemultipli-cationoperation,asillustratedinFigure3.Thecon-textgateassignsanelement-wiseweighttotheinput l D o w n o a d e d f r o m h t t p : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t yo a C _ a _ 0 0 0 4 8 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 91 (a)ContextGate(source)(b)ContextGate(objetivo)(C)ContextGate(ambos)Figure4:ArchitecturesofNMTwithvariouscontextgates,whicheitherscaleonlyonesideoftranslationcontexts(i.e.,sourcecontextin(a)andtargetcontextin(b))orcontroltheeffectsofbothsides(es decir.,(C)).signals,computedbyzi=σ(Wze(yi−1)+Uzti−1+Czsi)(4)Hereσ(·)isalogisticsigmoidfunction,andWz∈Rn×m,Uz∈Rn×n,Cz∈Rn×n0aretheweightmatrices.Again,metro,nandn0arethedimensionsofwordembedding,decodingstate,andsourcerep-resentation,respectively.Notethatzihasthesamedimensionalityasthetransferredinputsignals(e.g.,Csi),andthuseachelementintheinputvectorshasitsownweight.3.2IntegratingContextGatesintoNMTNext,weconsiderhowtointegratecontextgatesintoanNMTmodel.Thecontextgatecandecidetheamountofcon-textinformationusedingeneratingthenexttargetwordateachstepofdecoding.Forexample,afterobtainingthepartialtranslation“...newhighleveltechnologyproduct”,thegatelooksatthetranslationcontextsanddecidestodependmoreheavilyonthesourcecontext.Accordingly,thegateassignshigherweightstothesourcecontextandlowerweightstothetargetcontextandthenfeedsthemintothede-codingactivationlayer.Thiscouldcorrectinade-quatetranslations,suchasthemissingtranslationof“guˇangd¯ong”,duetogreaterinfluencefromthetar-getcontext.WehavethreestrategiesforintegratingcontextgatesintoNMTthateitheraffectoneofthetransla-tioncontextsorbothcontexts,asillustratedinFig-ure4.Thefirsttwostrategiesareinspiredbyout-putgatesinLSTMs(HochreiterandSchmidhuber,1997),whichcontroltheamountofmemorycontentutilized.Inthesekindsofmodels,zionlyaffectseithersourcecontext(i.e.,si)ortargetcontext(i.e.,yi−1andti−1):•ContextGate(source)ti=f(cid:0)Nosotros(yi−1)+Uti−1+zi◦Csi(cid:1)•ContextGate(objetivo)ti=f(cid:0)zi◦(Nosotros(yi−1)+Uti−1)+Csi(cid:1)where◦isanelement-wisemultiplication,andziisthecontextgatecalculatedbyEquation4.ThisisalsoessentiallysimilartotheresetgateintheGRU,whichdecideswhatinformationtoforgetfromthepreviousdecodingstatebeforetransferringthatin-formationtothedecodingactivationlayer.Thedif-ferenceisthatherethe“reset”gateresetsthecontextvectorratherthanthepreviousdecodingstate.Thelaststrategyisinspiredbytheconceptofup-dategatefromGRU,whichtakesalinearsumbe-tweenthepreviousstateti−1andthecandidatenewstate˜ti.Inourcase,wetakealinearinterpolationbetweensourceandtargetcontexts:•ContextGate(ambos)ti=f(cid:0)(1−zi)◦(Nosotros(yi−1)+Uti−1)+zi◦Csi(cid:1) yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t yo a C _ a _ 0 0 0 4 8 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 92 (a)GatingScalar(b)ContextGateFigure5:ComparisontoGatingScalarproposedbyXuetal.(2015).4RelatedWorkComparisonto(Xuetal.,2015):ContextgatesareinspiredbythegatingscalarmodelproposedbyXuetal.(2015)fortheimagecaptiongenera-tiontask.Theessentialdifferenceliesinthetaskrequirement:•Inimagecaptiongeneration,thesourceside(i.e.,image)containsmoreinformationthanthetargetside(i.e.,caption).Por lo tanto,theyem-ployagatingscalartoscaleonlythesourcecontext.•Inmachinetranslation,bothlanguagesshouldcontainequivalentinformation.Ourmodeljointlycontrolsthecontributionsfromthesourceandtargetcontexts.AdirectinteractionbetweeninputsignalsfrombothsidesisusefulforbalancingadequacyandfluencyofNMT.Otherdifferencesinthearchitectureinclude:1Xuetal.(2015)usesascalarthatissharedbyallelementsinthesourcecontext,whileweemployagatewithadistinctweightforeachel-ement.Thelatteroffersthegateamoreprecisecontrolofthecontextvector,sincedifferentel-ementsretaindifferentinformation.2Weaddpeepholeconnectionstothearchitec-ture,bywhichthesourcecontextcontrolsthegate.Ithasbeenshownthatpeepholeconnec-tionsmakeprecisetimingseasiertolearn(GersandSchmidhuber,2000).3Ourcontextgatealsoconsidersthepreviouslygeneratedwordyi−1asinput.Themostre-centlygeneratedwordcanhelpthegatetobet-terestimatetheimportanceoftargetcontext,especiallyforthegenerationoffunctionwordsintranslationsthatmaynothaveacorrespond-ingwordinthesourcesentence(e.g.,“of”after“isfond”).Experimentalresults(Section5.4)showthatthesemodificationsconsistentlyimprovetranslationqual-ity.ComparisontoGatedRNN:State-of-the-artNMTmodels(Sutskeveretal.,2014;Bahdanauetal.,2015)generallyemployagatedunit(e.g.,GRUorLSTM)astheactivationfunctioninthedecoder.Onemightsuspectthatthecontextgateproposedinthisworkissomewhatredundant,giventheexistinggatesthatcontroltheamountofinformationcarriedoverfromthepreviousdecodingstatesi−1(e.g.,re-setgateinGRU).Wearguethattheyareinfactcom-plementary:thecontextgateregulatesthecontextualinformationflowingintothedecodingstate,whilethegatedunitcaptureslong-termdependenciesbe-tweendecodingstates.Ourexperimentsconfirmthecorrectnessofourhypothesis:thecontextgatenotonlyimprovestranslationqualitywhencomparedtoaconventionalRNNunit(e.g.,anelement-wisetanh),butalsowhencomparedtoagatedunitofGRU,asshowninSection5.2.ComparisontoCoverageMechanism:Recientemente,Tuetal.(2016)proposeaddingacoveragemechanismintoNMTtoalleviateover-translationandunder-translationproblems,whichdirectlyaffecttranslationadequacy.Theymaintainacov-eragevectortokeeptrackofwhichsourcewordshavebeentranslated.Thecoveragevectorisfedtotheattentionmodeltohelpadjustfutureattention.ThisguidesNMTtofocusontheun-translatedsourcewordswhileavoidingrepetitionofsourcecontent.Ourapproachiscomplementary:thecov-eragemechanismproducesabettersourcecontextrepresentation,whileourcontextgatecontrolstheeffectofthesourcecontextbasedonitsrelativeimportance.ExperimentsinSection5.2showthatcombiningthetwomethodscanfurtherimprovetranslationperformance.Thereisanotherdifference l D o w n o a d e d f r o m h t t p : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t yo a C _ a _ 0 0 0 4 8 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 93 aswell:thecoveragemechanismisonlyapplicabletoattention-basedNMTmodels,whilethecontextgateisapplicabletoallNMTmodels.ComparisontoExploitingAuxiliaryContextsinLanguageModeling:Athreadofworkinlan-guagemodeling(LM)attemptstoexploitauxiliarysentence-levelordocument-levelcontextinanRNNLM(MikolovandZweig,2012;Jietal.,2015;WangandCho,2016).Independentofourwork,WangandCho(2016)propose“earlyfusion”modelsofRNNswhereadditionalinformationfromaninter-sentencecontextis“fused”withtheinputtotheRNN.CloselyrelatedtoWangandCho(2016),ourapproachaimstodynamicallycontrolthecontribu-tionsofrequiredsourceandtargetcontextsforma-chinetranslation,whiletheirsfocusesonintegratingauxiliarycorpus-levelcontextsforlanguagemod-ellingtobetterapproximatethecorpus-levelprob-ability.Inaddition,weemployagatingmechanismtoproduceadynamicweightatdifferentdecodingstepstocombinesourceandtargetcontexts,whiletheydoalinearcombinationofintra-sentenceandinter-sentencecontextswithstaticweights.Exper-imentsinSection5.2showthatourgatingmech-anismsignificantlyoutperformslinearinterpolationwhencombiningcontexts.ComparisontoHandlingNull-GeneratedWordsinSMT:Inmachinetranslation,therearecertainsyntacticelementsofthetargetlanguagethataremissinginthesource(i.e.,null-generatedwords).Infactthiswasthepreliminarymotivationforourapproach:currentattentionmodelslackamecha-nismtocontrolthegenerationofwordsthatdonothaveastrongcorrespondenceonthesourceside.ThemodelstructureofNMTisquitesimilartothetraditionalword-basedSMT(Brownetal.,1993).Por lo tanto,techniquesthathaveproveneffectiveinSMTmayalsobeapplicabletoNMT.Toutanovaetal.(2002)extendthecalculationoftranslationprob-abilitiestoincludenull-generatedtargetwordsinword-basedSMT.Thesewordsaregeneratedbasedonboththespecialsourcetokennullandtheneigh-bouringwordinthetargetlanguagebyamixturemodel.Wehavesimplifiedandgeneralizedtheirap-proach:weusecontextgatestodynamicallycontrolthecontributionofsourcecontext.Whenproduc-ingnull-generatedwords,thecontextgatecanas-signlowerweightstothesourcecontext,bywhichthesource-sideinformationhavelessinfluence.Inasense,thecontextgaterelievestheneedforanullstateinattention.5Experiments5.1SetupWecarriedoutexperimentsonChinese-Englishtranslation.Thetrainingdatasetconsistedof1.25MsentencepairsextractedfromLDCcorpora5,with27.9MChinesewordsand34.5MEnglishwordsre-spectively.WechosetheNIST2002(MT02)datasetasthedevelopmentset,andtheNIST2005(MT05),2006(MT06)and2008(MT08)datasetsasthetestsets.Weusedthecase-insensitive4-gramNISTBLEUscore(Papinenietal.,2002)astheevalua-tionmetric,andsign-test(Collinsetal.,2005)forthestatisticalsignificancetest.Forefficienttrainingoftheneuralnetworks,welimitedthesourceandtargetvocabulariestothemostfrequent30KwordsinChineseandEnglish,coveringapproximately97.7%and99.3%ofthedatainthetwolanguagesrespectively.Allout-of-vocabularywordsweremappedtoaspecialtokenUNK.Wetrainedeachmodelonsentencesoflengthupto80wordsinthetrainingdata.Thewordem-beddingdimensionwas620andthesizeofahid-denlayerwas1000.WetrainedourmodelsuntiltheBLEUscoreonthedevelopmentsetstopsimprov-ing.WecomparedourmethodwithrepresentativeSMTandNMT6models:•Moses(Koehnetal.,2007):anopensourcephrase-basedtranslationsystemwithdefaultconfigurationanda4-gramlanguagemodeltrainedonthetargetportionoftrainingdata;•GroundHog(Bahdanauetal.,2015):anopensourceattention-basedNMTmodelwithde-faultsetting.Wehavetwovariantsthatdifferintheactivationfunctionusedinthedecoder5ThecorporaincludeLDC2002E18,LDC2003E07,LDC2003E14,HansardsportionofLDC2004T07,LDC2004T08andLDC2005T06.6Thereissomerecentprogressonaggregatingmultiplemodelsorenlargingthevocabulary(e.g.,,in(Jeanetal.,2015)),butherewefocusonthegenericmodels. yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t yo a C _ a _ 0 0 0 4 8 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 94 #System#ParametersMT05MT06MT08Ave.1Moses–31.3730.8523.0128.412GroundHog(vanilla)77.1M26.0727.3420.3824.6032+ContextGate(ambos)80.7M30.86∗30.85∗24.71∗28.814GroundHog(GRU)84.3M30.6131.1223.2328.3254+ContextGate(source)87.9M31.96∗32.29∗24.97∗29.7464+ContextGate(objetivo)87.9M32.38∗32.11∗23.7829.4274+ContextGate(ambos)87.9M33.52∗33.46∗24.85∗30.618GroundHog-Coverage(GRU)84.4M32.7332.4725.2330.1498+ContextGate(ambos)88.0M34.13∗34.83∗26.22∗31.73Table2:Evaluationoftranslationqualitymeasuredbycase-insensitiveBLEUscore.“GroundHog(vanilla)”and“GroundHog(GRU)”denoteattention-basedNMT(Bahdanauetal.,2015)andusesasim-pletanhfunctionorasophisticatedgatefunctionGRUrespectivelyastheactivationfunctioninthede-coderRNN.“GroundHog-Coverage”denotesattention-basedNMTwithacoveragemechanismtoindicatewhetherasourcewordistranslatedornot(Tuetal.,2016).“*”indicatestatisticallysignificantdifference(pag<0.01)fromthecorrespondingNMTvariant.“2+ContextGate(both)”denotesintegrating“ContextGate(both)”intothebaselinesysteminRow2(i.e.,“GroundHog(vanilla)”).RNN:1)GroundHog(vanilla)usesasimpletanhfunctionastheactivationfunction,and2)GroundHog(GRU)usesasophisticatedgatefunctionGRU;•GroundHog-Coverage(Tuetal.,2016)7:animprovedattention-basedNMTmodelwithacoveragemechanism.5.2TranslationQualityTable2showsthetranslationperformancesintermsofBLEUscores.WecarriedoutexperimentsonmultipleNMTvariants.Forexample,“2+ContextGate(both)”inRow3denotesintegrating“Con-textGate(both)”intothebaselineinRow2(i.e.,GroundHog(vanilla)).Forbaselines,wefoundthatthegatedunit(i.e.,GRU,Row4)indeedsurpassesitsvanillacounterpart(i.e.,tanh,Row2),whichisconsistentwiththeresultsinotherwork(Chungetal.,2014).Clearlytheproposedcontextgatessignificantlyimprovethetranslationqualityinallcases,althoughtherearestillconsiderablediffer-encesamongthevariants:ParametersContextgatesintroduceafewnewparameters.Thenewlyintroducedparametersin-cludeWz∈Rn×m,Uz∈Rn×n,Cz∈Rn×n0in7https://github.com/tuzhaopeng/NMT-Coverage.Equation4.Inthiswork,thedimensionalityofthedecodingstateisn=1000,thedimensionalityofthewordembeddingism=620,andthedimen-sionalityofcontextrepresentationisn0=2000.Thecontextgatesonlyintroduce3.6Madditionalparam-eters,whichisquitesmallcomparedtothenumberofparametersintheexistingmodels(e.g.,84.3Minthe“GroundHog(GRU)”).OverGroundHog(vanilla)Wefirstcarriedoutexperimentsonasimpledecoderwithoutgatingfunction(Rows2and3),tobetterestimatetheim-pactofcontextgates.AsshowninTable2,theproposedcontextgatesignificantlyimprovedtrans-lationperformanceby4.2BLEUpointsonaverage.Itisworthemphasizingthatcontextgateevenout-performsamoresophisticatedgatingfunction(i.e.,GRUinRow4).Thisisveryencouraging,sinceourmodelonlyhasasinglegatewithhalfoftheparam-eters(i.e.,3.6Mversus7.2M)andlesscomputations(i.e.,halfthematrixcomputationstoupdatethede-codingstate8).8WeonlyneedtocalculatethecontextgateonceviaEqua-tion4andthenapplyitwhenupdatingthedecodingstate.Incontrast,GRUrequiresthecalculationofanupdategate,are-setgate,aproposedupdateddecodingstateandaninterpolationbetweenthepreviousstateandtheproposedstate.Pleasereferto(Choetal.,2014)formoredetails. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t l a c _ a _ 0 0 0 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 95 GroundHogvs.GroundHog+ContextGateAdequacyFluency<=><=>evaluator130.0%54.0%16.0%28.5%48.5%23.0%evaluator230.0%50.0%20.0%29.5%54.5%16.0%Table3:Subjectiveevaluationoftranslationadequacyandfluency.OverGroundHog(GRU)WetheninvestigatedtheeffectofthecontextgatesonastandardNMTwithGRUasthedecodingactivationfunction(Rows4-7).Severalobservationscanbemade.First,con-textgatesalsoboostperformancebeyondtheGRUinallcases,demonstratingourclaimthatcontextgatesarecomplementarytotheresetandupdategatesinGRU.Second,jointlycontrollingtheinfor-mationfrombothtranslationcontextsconsistentlyoutperformsitssingle-sidecounterparts,indicatingthatadirectinteractionbetweeninputsignalsfromthesourceandtargetcontextsisusefulforNMTmodels.OverGroundHog-Coverage(GRU)Wefinallytestedonastrongerbaseline,whichemploysacov-eragemechanismtoindicatewhetherornotasourcewordhasalreadybeentranslated(Tuetal.,2016).Ourcontextgatestillachievesasignificantimprove-mentof1.6BLEUpointsonaverage,reconfirm-ingourclaimthatthecontextgateiscomplemen-tarytotheimprovedattentionmodelthatproducesabettersourcecontextrepresentation.Finally,ourbestmodel(Row7)outperformstheSMTbaselinesystemusingthesamedata(Row1)by3.3BLEUpoints.Fromhereon,wereferto“GroundHog”for“GroundHog(GRU)",and“ContextGate”for“ContextGate(ambos)”ifnototherwisestated.SubjectiveEvaluationWealsoconductedasub-jectiveevaluationofthebenefitofincorporatingcontextgates.Twohumanevaluatorswereaskedtocomparethetranslationsof200sourcesentencesrandomlysampledfromthetestsetswithoutknow-ingwhichsystemproducedeachtranslation.Table3showstheresultsofsubjectiveevaluation.Thetwohumanevaluatorsmadesimilarjudgments:inade-quacy,around30%ofGroundHogtranslationsareworse,52%areequal,and18%arebetter;whileinSystemSAERAERGroundHog67.0054.67+ContextGate67.4355.52GroundHog-Coverage64.2550.50+ContextGate63.8049.40Table4:Evaluationofalignmentquality.Thelowerthescore,thebetterthealignmentquality.fluency,around29%areworse,52%areequal,and19%arebetter.5.3AlignmentQualityTable4liststhealignmentperformances.Follow-ingTuetal.(2016),weusedthealignmenterrorrate(AER)(OchandNey,2003)anditsvariantSAERtomeasurethealignmentquality:SAER=1−|MA×MS|+|MA×MP||MAMÁ|+|MS|whereAisacandidatealignment,andSandParethesetsofsureandpossiblelinksintherefer-encealignmentrespectively(S⊆P).Mdenotesthealignmentmatrix,andforbothMSandMPweassigntheelementsthatcorrespondtotheexistinglinksinSandPprobability1andtheotherelementsprobability0.Inthisway,weareabletobettereval-uatethequalityofthesoftalignmentsproducedbyattention-basedNMT.Wefindthatcontextgatesdonotimprovealign-mentqualitywhenusedalone.Whencombinedwithcoveragemechanism,sin embargo,itproducesbet-teralignments,especiallyone-to-onealignmentsbyselectingthesourcewordwiththehighestalign-mentprobabilitypertargetword(i.e.,AERscore).Onepossiblereasonisthatbetterestimateddecod-ingstates(fromthecontextgate)andcoveragein-formationhelptoproducemoreconcentratedalign-ments,asshowninFigure6.yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4
/
/
t
yo
a
C
_
a
_
0
0
0
4
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
96
(a)GroundHog-Coverage(SAER=50.80)(b)+ContextGate(SAER=47.35)Figure6:Examplealignments.Incorporatingcontextgateproducesmoreconcentratedalignments.#SystemGateInputsMT05MT06MT08Ave.1GroundHog–30.6131.1223.2328.3221+GatingScalarti−131.62∗31.4823.8528.9831+ContextGate(source)ti−131.69∗31.6324.25∗29.1941+ContextGate(ambos)ti−132.15∗32.05∗24.39∗29.535ti−1,si31.81∗32.75∗25.66∗30.076ti−1,si,yi−133.52∗33.46∗24.85∗30.61Table5:AnalysisofthemodelarchitecturesmeasuredinBLEUscores.“GatingScalar”denotesthemodelproposedby(Xuetal.,2015)intheimagecaptiongenerationtask,whichlooksatonlythepreviousdecod-ingstateti−1andscalesthewholesourcecontextsiatthevector-level.Toinvestigatetheeffectofeachcomponent,welisttheresultsofcontextgatevariantswithdifferentinputs(e.g.,thepreviouslygeneratedwordyi−1).“*”indicatesstatisticallysignificantdifference(pag<0.01)from“GroundHog”.5.4ArchitectureAnalysisTable5showsadetailedanalysisofarchitecturecomponentsmeasuredinBLEUscores.Severalob-servationscanbemade:•OperationGranularity(Rows2and3):Element-wisemultiplication(i.e.,ContextGate(source))outperformsthevector-levelscalar(i.e.,GatingScalar),indicatingthatprecisecontrolofeachelementinthecontextvectorbooststranslationperformance.•GateStrategy(Rows3and4):Whenonlyfedwiththepreviousdecodingstateti−1,ContextGate(both)consistentlyoutperformsContextGate(source),showingthatjointlycontrollinginformationfrombothsourceandtargetsidesisimportantforjudgingtheimportanceofthecontexts.•Peepholeconnections(Rows4and5):Peep-holes,bywhichthesourcecontextsicontrolsthegate,playanimportantroleinthecontextgate,whichimprovestheperformanceby0.57inBLEUscore.•Previouslygeneratedword(Rows5and6):Previouslygeneratedwordyi−1providesamoreexplicitsignalforthegatetojudgetheimportanceofcontexts,leadingtoafurtherim-provementontranslationperformance.5.5EffectsonLongSentencesWefollowBahdanauetal.(2015)andgroupsen-tencesofsimilarlengthstogether.Figure7shows l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t l a c _ a _ 0 0 0 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 97 Figure7:Performanceoftranslationsonthetestsetwithrespecttothelengthsofthesourcesentences.Contextgateimprovesperformancebyalleviatingin-adequatetranslationsonlongsentences.theBLEUscoreandtheaveragedlengthoftrans-lationsforeachgroup.GroundHogperformsverywellonshortsourcesentences,butdegradesonlongsourcesentences(i.e.,>30),whichmaybeduetothefactthatsourcecontextisnotfullyinterpreted.Contextgatescanalleviatethisproblembybalanc-ingthesourceandtargetcontexts,andthusimprovedecoderperformanceonlongsentences.Infact,in-corporatingcontextgatesboosttranslationperfor-manceonallsourcesentencegroups.Weconfirmthatcontextgateweightzicorrelateswellwithtranslationperformance.Inotherwords,translationsthatcontainhigherzi(i.e.,sourcecon-textcontributesmorethantargetcontext)atmanytimestepsarebetterintranslationperformance.Weusedthemeanofthesequencez1,…,zi,…,zIasthegateweightofeachsentence.WecalculatedthePearsonCorrelationbetweenthesentence-levelgateweightandthecorrespondingimprovementontranslationperformance(i.e.,BLEU,adequacy,andfluencyscores),9asshowninTable6.Weobservedthatcontextgateweightispositivelycorrelatedwithtranslationperformanceimprovementandthatthecorrelationishigheronlongsentences.Asanexample,considerthissourcesentencefromthetestset:9Weusetheaverageofcorrelationsonsubjectiveevaluationmetrics(i.e.,adequacyandfluency)bytwoevaluators.LengthBLEUAdequacyFluency<300.0240.0710.040>300.0760.1210.168Table6:Correlationbetweencontextgateweightandimprovementoftranslationperformance.“Length”denotesthelengthofsourcesentence.“BLEU”,“Adequacy”,and“Fluency”denotesdifferentmetricsmeasuringthetranslationperfor-manceimprovementofusingcontextgates.zh¯ouli`uzh`engsh`ıy¯ınggu´om´ınzh`ongd`aoch¯aosh`ıcˇaig`oudeg¯aof¯engsh´ık`e,d¯angsh´ı14ji¯ach¯aosh`ıdegu¯anb`ıl`ıngy¯ınggu´ozh`eji¯azu`ıd`adeli´ansuˇoch¯aosh`ısˇunsh¯ısh`ubˇaiw`any¯ıngb`angdexi¯aosh`oush¯our`u.GroundHogtranslatesitinto:twenty-sixlondonsupermarketswereclosedatapeakhourofthebritishpop-ulationinthesameperiodoftime.whichalmostmissesalltheinformationofthesourcesentence.Integratingcontextgatesimprovesthetranslationadequacy:thisisexactlythepeakdaysBritishpeo-plebuyingthesupermarket.theclosure
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4
/
/
t
yo
a
C
_
a
_
0
0
0
4
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
98
ofthe14supermarketsofthe14super-marketsthatthelargestchainsupermar-ketinenglandlostseveralmillionpoundsofsalesincome.Coveragemechanismsfurtherimprovethetransla-tionbyrectifyingover-translation(e.g.,“ofthe14supermarkets”)andunder-translation(e.g.,“satur-day”and“atthattime”):saturdayisthepeakseasonofbritishpeo-ple’spurchasesofthesupermarket.atthattime,theclosureof14supermarketsmadethebiggestsupermarketofbritainlosemillionsofpoundsofsalesincome.6ConclusionWefindthatsourceandtargetcontextsinNMTarehighlycorrelatedtotranslationadequacyandflu-ency,respectively.Basedonthisobservation,weproposeusingcontextgatesinNMTtodynamicallycontrolthecontributionsfromthesourceandtargetcontextsinthegenerationofatargetsentence,toenhancetheadequacyofNMT.ByprovidingNMTtheabilitytochoosetheappropriateamountofin-formationfromthesourceandtargetcontexts,onecanalleviatemanytranslationproblemsfromwhichNMTsuffers.ExperimentalresultsshowthatNMTwithcontextgatesachievesconsistentandsignifi-cantimprovementsintranslationqualityoverdiffer-entNMTmodels.Contextgatesareinprincipleapplicabletoallsequence-to-sequencelearningtasksinwhichinfor-mationfromthesourcesequenceistransformedtothetargetsequence(correspondingtoadequacy)andthetargetsequenceisgenerated(correspondingtofluency).Inthefuture,wewillinvestigatetheef-fectivenessofcontextgatestoothertasks,suchasdialogueandsummarization.ItisalsonecessarytovalidatetheeffectivenessofourapproachonmorelanguagepairsandotherNMTarchitectures(e.g.,usingLSTMaswellasGRU,ormultiplelayers).AcknowledgementThisworkissupportedbyChinaNational973project2014CB340301.YangLiuissupportedbytheNationalNaturalScienceFoundationofChina(No.61522204)andthe863Program(2015AA015407).WethankactioneditorChrisQuirkandthreeanonymousreviewersfortheirin-sightfulcomments.ReferencesDzmitryBahdanau,KyunghyunCho,andYoshuaBen-gio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.ICLR2015.PeterE.Brown,StephenA.DellaPietra,VincentJ.DellaPietra,andRobertL.Mercer.1993.Themathematicsofstatisticalmachinetranslation:Parameterestima-tion.ComputationalLinguistics,19(2):263–311.KyunghyunCho,BartvanMerrienboer,CaglarGulcehre,FethiBougares,HolgerSchwenk,andYoshuaBen-gio.2014.Learningphraserepresentationsusingrnnencoder-decoderforstatisticalmachinetranslation.InEMNLP2014.JunyoungChung,CaglarGulcehre,KyungHyunCho,andYoshuaBengio.2014.Empiricalevaluationofgatedrecurrentneuralnetworksonsequencemodel-ing.arXiv.MichaelCollins,PhilippKoehn,andIvonaKuˇcerov´a.2005.Clauserestructuringforstatisticalmachinetranslation.InACL2005.FelixAGersandJ¨urgenSchmidhuber.2000.Recurrentnetsthattimeandcount.InIJCNN2000.IEEE.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation.S´ebastienJean,KyunghyunCho,RolandMemisevic,andYoshuaBengio.2015.Onusingverylargetargetvo-cabularyforneuralmachinetranslation.InACL2015.YangfengJi,TrevorCohn,LingpengKong,ChrisDyer,andJacobEisenstein.2015.Documentcontextlan-guagemodels.InICLR2015.NalKalchbrennerandPhilBlunsom.2013.Recurrentcontinuoustranslationmodels.InEMNLP2013.PhilippKoehn,HieuHoang,AlexandraBirch,ChrisCallison-Burch,MarcelloFederico,NicolaBertoldi,BrookeCowan,WadeShen,ChristineMoran,RichardZens,ChrisDyer,OndrejBojar,AlexandraCon-stantin,andEvanHerbst.2007.Moses:opensourcetoolkitforstatisticalmachinetranslation.InACL2007.Minh-ThangLuong,HieuPham,andChristopherD.Manning.2015.Effectiveapproachestoattention-basedneuralmachinetranslation.InEMNLP2015.TomasMikolovandGeoffreyZweig.2012.Contextde-pendentrecurrentneuralnetworklanguagemodel.InSLT2012.FranzJ.OchandHermannNey.2003.Asystematiccomparisonofvariousstatisticalalignmentmodels.ComputationalLinguistics,29(1):19–51.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4
/
/
t
yo
a
C
_
a
_
0
0
0
4
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
99
KishorePapineni,SalimRoukos,ToddWard,andWei-JingZhu.2002.BLEU:amethodforautomaticeval-uationofmachinetranslation.InACL2002.MatthewSnover,NitinMadnani,BonnieJDorr,andRichardSchwartz.2009.Fluency,adequacy,orHTER?:exploringdifferenthumanjudgmentswithatunableMTmetric.InProceedingsoftheFourthWorkshoponStatisticalMachineTranslation,pages259–268.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Sequencetosequencelearningwithneuralnetworks.InNIPS2014.KristinaToutanova,H.TolgaIlhan,andChristopherD.Manning.2002.ExtensionstoHMM-basedstatisticalwordalignmentmodels.InEMNLP2012.ZhaopengTu,ZhengdongLu,YangLiu,XiaohuaLiu,andHangLi.2016.Modelingcoverageforneuralmachinetranslation.InACL2016.TianWangandKyunghyunCho.2016.Larger-contextlanguagemodellingwithrecurrentneuralnetwork.InACL2016.KelvinXu,JimmyBa,RyanKiros,KyunghyunCho,AaronCourville,RuslanSalakhutdinov,RichardZemel,andYoshuaBengio.2015.Show,attendandtell:Neuralimagecaptiongenerationwithvisualat-tention.InICML2015.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4
/
/
t
yo
a
C
_
a
_
0
0
0
4
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
100