Transactions of the Association for Computational Linguistics, vol. 5, pp. 87–99, 2017. Action Editor: Chris Quirk.

Transactions of the Association for Computational Linguistics, vol. 5, pp. 87–99, 2017. Action Editor: Chris Quirk.
Submission batch: 6/2016; Revision batch: 10/2016; Published 3/2017.

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

ContextGatesforNeuralMachineTranslationZhaopengTu†YangLiu‡ZhengdongLu†XiaohuaLiu†HangLi††Noah’sArkLab,HuaweiTechnologies,HongKong{tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com‡DepartmentofComputerScienceandTechnology,TsinghuaUniversity,Beijingliuyang2011@tsinghua.edu.cnAbstractInneuralmachinetranslation(NMT),genera-tionofatargetworddependsonbothsourceandtargetcontexts.Weﬁndthatsourcecon-textshaveadirectimpactontheadequacyofatranslationwhiletargetcontextsaffecttheﬂu-ency.Intuitively,generationofacontentwordshouldrelymoreonthesourcecontextandgenerationofafunctionalwordshouldrelymoreonthetargetcontext.Duetothelackofeffectivecontrolovertheinﬂuencefromsourceandtargetcontexts,conventionalNMTtendstoyieldﬂuentbutinadequatetransla-tions.Toaddressthisproblem,weproposecontextgateswhichdynamicallycontroltheratiosatwhichsourceandtargetcontextscon-tributetothegenerationoftargetwords.Inthisway,wecanenhanceboththeadequacyandﬂuencyofNMTwithmorecarefulcon-troloftheinformationﬂowfromcontexts.Experimentsshowthatourapproachsignif-icantlyimprovesuponastandardattention-basedNMTsystemby+2.3BLEUpoints.1IntroductionNeuralmachinetranslation(NMT)(KalchbrennerandBlunsom,2013;Sutskeveretal.,2014;Bah-danauetal.,2015)hasmadesigniﬁcantprogressinthepastseveralyears.Itsgoalistoconstructandutilizeasinglelargeneuralnetworktoaccom-plishtheentiretranslationtask.Onegreatadvan-tageofNMTisthatthetranslationsystemcanbecompletelyconstructedbylearningfromdatawith-outhumaninvolvement(cf.,featureengineeringinstatisticalmachinetranslation(SMT)).Theencoder-decoderarchitectureiswidelyemployed(Choetal.,inputj¯ınni´anqi´anliˇangyu`eguˇangd¯ongg¯aox¯ınj`ısh`uchˇanpˇınch¯ukˇou37.6y`ımˇeiyu´anNMTintheﬁrsttwomonthsofthisyear,theexportofnewhighleveltechnologyproductwasUNK-billionusdollars5srcchina’sguangdonghi-techexportshit58billiondollars5tgtchina’sexportofhighandnewhi-techexportsoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportof···Table1:Sourceandtargetcontextsarehighlycor-relatedtotranslationadequacyandﬂuency,respec-tively.5srcand5tgtdenotehalvingthecontribu-tionsfromthesourceandtargetcontextswhengen-eratingthetranslation,respectively.2014;Sutskeveretal.,2014),inwhichtheencodersummarizesthesourcesentenceintoavectorrepre-sentation,andthedecodergeneratesthetargetsen-tenceword-by-wordfromthevectorrepresentation.Therepresentationofthesourcesentenceandtherepresentationofthepartiallygeneratedtargetsen-tence(translation)ateachpositionarereferredtoassourcecontextandtargetcontext,respectively.Thegenerationofatargetwordisdeterminedjointlybythesourcecontextandtargetcontext.SeveraltechniquesinNMThaveproventobeveryeffective,includinggating(HochreiterandSchmidhuber,1997;Choetal.,2014)andat-tention(Bahdanauetal.,2015)whichcanmodellong-distancedependenciesandcomplicatedalign-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4

/
t

a
c
_
a
_
0
0
0
4
8
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

mentrelationsinthetranslationprocess.Usinganencoder-decoderframeworkthatincorporatesgat-ingandattentiontechniques,ithasbeenreportedthattheperformanceofNMTcansurpasstheper-formanceoftraditionalSMTasmeasuredbyBLEUscore(Luongetal.,2015).Despitethissuccess,weobservethatNMTusu-allyyieldsﬂuentbutinadequatetranslations.1Weattributethistoastrongerinﬂuenceoftargetcon-textongeneration,whichresultsfromastrongerlanguagemodelthanthatusedinSMT.Oneques-tionnaturallyarises:whatwillhappenifwechangetheratioofinﬂuencesfromthesourceortargetcon-texts?Table1showsanexampleinwhichanattention-basedNMTsystem(Bahdanauetal.,2015)gener-atesaﬂuentyetinadequatetranslation(e.g.,missingthetranslationof“guˇangd¯ong”).Whenwehalvethecontributionfromthesourcecontext,theresultfur-therlosesitsadequacybymissingthepartialtrans-lation“intheﬁrsttwomonthsofthisyear”.Onepossibleexplanationisthatthetargetcontexttakesahigherweightandthusthesystemfavorsashortertranslation.Incontrast,whenwehalvethecon-tributionfromthetargetcontext,theresultcom-pletelylosesitsﬂuencybyrepeatedlygeneratingthetranslationof“ch¯ukˇou”(i.e.,“theexportof”)un-tilthegeneratedtranslationreachesthemaximumlength.Therefore,thisexampleindicatesthatsourceandtargetcontextsinNMTarehighlycorrelatedtotranslationadequacyandﬂuency,respectively.Infact,conventionalNMTlackseffectivecontrolontheinﬂuenceofsourceandtargetcontexts.Ateachdecodingstep,NMTtreatsthesourceandtar-getcontextsequally,andthusignoresthedifferentneedsofthecontexts.Forexample,contentwordsinthetargetsentencearemorerelatedtothetransla-tionadequacy,andthusshoulddependmoreonthesourcecontext.Incontrast,functionwordsinthetargetsentenceareoftenmorerelatedtothetrans-lationﬂuency(e.g.,“of”after“isfond”),andthusshoulddependmoreonthetargetcontext.Inthiswork,weproposetousecontextgatestocontrolthecontributionsofsourceandtargetcon-textsonthegenerationoftargetwords(decoding)1Fluencymeasureswhetherthetranslationisﬂuent,whileadequacymeasureswhetherthetranslationisfaithfultotheoriginalsentence(Snoveretal.,2009).Figure1:ArchitectureofdecoderRNN.inNMT.Contextgatesarenon-lineargatingunitswhichcandynamicallyselecttheamountofcontextinformationinthedecodingprocess.Speciﬁcally,ateachdecodingstep,thecontextgateexaminesboththesourceandtargetcontexts,andoutputsaratiobetweenzeroandonetodeterminethepercentagesofinformationtoutilizefromthetwocontexts.Inthisway,thesystemcanbalancetheadequacyandﬂuencyofthetranslationwithregardtothegenera-tionofawordateachposition.Experimentalresultsshowthatintroducingcon-textgatesleadstoanaverageimprovementof+2.3BLEUpointsoverastandardattention-basedNMTsystem(Bahdanauetal.,2015).Aninterestingﬁnd-ingisthatwecanreplacetheGRUunitsinthede-coderwithconventionalRNNunitsandinthemean-timeutilizecontextgates.Thetranslationperfor-manceiscomparablewiththestandardNMTsystemwithGRU,butthesystemenjoysasimplerstructure(i.e.,usesonlyasinglegateandhalfoftheparam-eters)andafasterdecoding(i.e.,requiresonlyhalfthematrixcomputationsfordecoding).22NeuralMachineTranslationSupposethatx=x1,…xj,…xJrepresentsasourcesentenceandy=y1,…yi,…yIatargetsentence.NMTdirectlymodelstheprobabilityoftranslationfromthesourcesentencetothetargetsentencewordbyword:P(y|x)=IYi=1P(yi|y<=>evaluator130.0%54.0%16.0%28.5%48.5%23.0%evaluator230.0%50.0%20.0%29.5%54.5%16.0%Table3:Subjectiveevaluationoftranslationadequacyandﬂuency.OverGroundHog(GRU)WetheninvestigatedtheeffectofthecontextgatesonastandardNMTwithGRUasthedecodingactivationfunction(Rows4-7).Severalobservationscanbemade.First,con-textgatesalsoboostperformancebeyondtheGRUinallcases,demonstratingourclaimthatcontextgatesarecomplementarytotheresetandupdategatesinGRU.Second,jointlycontrollingtheinfor-mationfrombothtranslationcontextsconsistentlyoutperformsitssingle-sidecounterparts,indicatingthatadirectinteractionbetweeninputsignalsfromthesourceandtargetcontextsisusefulforNMTmodels.OverGroundHog-Coverage(GRU)Weﬁnallytestedonastrongerbaseline,whichemploysacov-eragemechanismtoindicatewhetherornotasourcewordhasalreadybeentranslated(Tuetal.,2016).Ourcontextgatestillachievesasigniﬁcantimprove-mentof1.6BLEUpointsonaverage,reconﬁrm-ingourclaimthatthecontextgateiscomplemen-tarytotheimprovedattentionmodelthatproducesabettersourcecontextrepresentation.Finally,ourbestmodel(Row7)outperformstheSMTbaselinesystemusingthesamedata(Row1)by3.3BLEUpoints.Fromhereon,wereferto“GroundHog”for“GroundHog(GRU)”,and“ContextGate”for“ContextGate(both)”ifnototherwisestated.SubjectiveEvaluationWealsoconductedasub-jectiveevaluationofthebeneﬁtofincorporatingcontextgates.Twohumanevaluatorswereaskedtocomparethetranslationsof200sourcesentencesrandomlysampledfromthetestsetswithoutknow-ingwhichsystemproducedeachtranslation.Table3showstheresultsofsubjectiveevaluation.Thetwohumanevaluatorsmadesimilarjudgments:inade-quacy,around30%ofGroundHogtranslationsareworse,52%areequal,and18%arebetter;whileinSystemSAERAERGroundHog67.0054.67+ContextGate67.4355.52GroundHog-Coverage64.2550.50+ContextGate63.8049.40Table4:Evaluationofalignmentquality.Thelowerthescore,thebetterthealignmentquality.ﬂuency,around29%areworse,52%areequal,and19%arebetter.5.3AlignmentQualityTable4liststhealignmentperformances.Follow-ingTuetal.(2016),weusedthealignmenterrorrate(AER)(OchandNey,2003)anditsvariantSAERtomeasurethealignmentquality:SAER=1−|MA×MS|+|MA×MP||MA|+|MS|whereAisacandidatealignment,andSandParethesetsofsureandpossiblelinksintherefer-encealignmentrespectively(S⊆P).Mdenotesthealignmentmatrix,andforbothMSandMPweassigntheelementsthatcorrespondtotheexistinglinksinSandPprobability1andtheotherelementsprobability0.Inthisway,weareabletobettereval-uatethequalityofthesoftalignmentsproducedbyattention-basedNMT.Weﬁndthatcontextgatesdonotimprovealign-mentqualitywhenusedalone.Whencombinedwithcoveragemechanism,however,itproducesbet-teralignments,especiallyone-to-onealignmentsbyselectingthesourcewordwiththehighestalign-mentprobabilitypertargetword(i.e.,AERscore).Onepossiblereasonisthatbetterestimateddecod-ingstates(fromthecontextgate)andcoveragein-formationhelptoproducemoreconcentratedalign-ments,asshowninFigure6.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4

/
t

a
c
_
a
_
0
0
0
4
8
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

(a)GroundHog-Coverage(SAER=50.80)(b)+ContextGate(SAER=47.35)Figure6:Examplealignments.Incorporatingcontextgateproducesmoreconcentratedalignments.#SystemGateInputsMT05MT06MT08Ave.1GroundHog–30.6131.1223.2328.3221+GatingScalarti−131.62∗31.4823.8528.9831+ContextGate(source)ti−131.69∗31.6324.25∗29.1941+ContextGate(both)ti−132.15∗32.05∗24.39∗29.535ti−1,si31.81∗32.75∗25.66∗30.076ti−1,si,yi−133.52∗33.46∗24.85∗30.61Table5:AnalysisofthemodelarchitecturesmeasuredinBLEUscores.“GatingScalar”denotesthemodelproposedby(Xuetal.,2015)intheimagecaptiongenerationtask,whichlooksatonlythepreviousdecod-ingstateti−1andscalesthewholesourcecontextsiatthevector-level.Toinvestigatetheeffectofeachcomponent,welisttheresultsofcontextgatevariantswithdifferentinputs(e.g.,thepreviouslygeneratedwordyi−1).“*”indicatesstatisticallysigniﬁcantdifference(p<0.01)from“GroundHog”.5.4ArchitectureAnalysisTable5showsadetailedanalysisofarchitecturecomponentsmeasuredinBLEUscores.Severalob-servationscanbemade:•OperationGranularity(Rows2and3):Element-wisemultiplication(i.e.,ContextGate(source))outperformsthevector-levelscalar(i.e.,GatingScalar),indicatingthatprecisecontrolofeachelementinthecontextvectorbooststranslationperformance.•GateStrategy(Rows3and4):Whenonlyfedwiththepreviousdecodingstateti−1,ContextGate(both)consistentlyoutperformsContextGate(source),showingthatjointlycontrollinginformationfrombothsourceandtargetsidesisimportantforjudgingtheimportanceofthecontexts.•Peepholeconnections(Rows4and5):Peep-holes,bywhichthesourcecontextsicontrolsthegate,playanimportantroleinthecontextgate,whichimprovestheperformanceby0.57inBLEUscore.•Previouslygeneratedword(Rows5and6):Previouslygeneratedwordyi−1providesamoreexplicitsignalforthegatetojudgetheimportanceofcontexts,leadingtoafurtherim-provementontranslationperformance.5.5EffectsonLongSentencesWefollowBahdanauetal.(2015)andgroupsen-tencesofsimilarlengthstogether.Figure7shows l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 8 1 5 6 7 4 4 4 / / t l a c _ a _ 0 0 0 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 97 Figure7:Performanceoftranslationsonthetestsetwithrespecttothelengthsofthesourcesentences.Contextgateimprovesperformancebyalleviatingin-adequatetranslationsonlongsentences.theBLEUscoreandtheaveragedlengthoftrans-lationsforeachgroup.GroundHogperformsverywellonshortsourcesentences,butdegradesonlongsourcesentences(i.e.,>30),whichmaybeduetothefactthatsourcecontextisnotfullyinterpreted.Contextgatescanalleviatethisproblembybalanc-ingthesourceandtargetcontexts,andthusimprovedecoderperformanceonlongsentences.Infact,in-corporatingcontextgatesboosttranslationperfor-manceonallsourcesentencegroups.Weconﬁrmthatcontextgateweightzicorrelateswellwithtranslationperformance.Inotherwords,translationsthatcontainhigherzi(i.e.,sourcecon-textcontributesmorethantargetcontext)atmanytimestepsarebetterintranslationperformance.Weusedthemeanofthesequencez1,…,zi,…,zIasthegateweightofeachsentence.WecalculatedthePearsonCorrelationbetweenthesentence-levelgateweightandthecorrespondingimprovementontranslationperformance(i.e.,BLEU,adequacy,andﬂuencyscores),9asshowninTable6.Weobservedthatcontextgateweightispositivelycorrelatedwithtranslationperformanceimprovementandthatthecorrelationishigheronlongsentences.Asanexample,considerthissourcesentencefromthetestset:9Weusetheaverageofcorrelationsonsubjectiveevaluationmetrics(i.e.,adequacyandﬂuency)bytwoevaluators.LengthBLEUAdequacyFluency<300.0240.0710.040>300.0760.1210.168Table6:Correlationbetweencontextgateweightandimprovementoftranslationperformance.“Length”denotesthelengthofsourcesentence.“BLEU”,“Adequacy”,and“Fluency”denotesdifferentmetricsmeasuringthetranslationperfor-manceimprovementofusingcontextgates.zh¯ouli`uzh`engsh`ıy¯ınggu´om´ınzh`ongd`aoch¯aosh`ıcˇaig`oudeg¯aof¯engsh´ık`e,d¯angsh´ı14ji¯ach¯aosh`ıdegu¯anb`ıl`ıngy¯ınggu´ozh`eji¯azu`ıd`adeli´ansuˇoch¯aosh`ısˇunsh¯ısh`ubˇaiw`any¯ıngb`angdexi¯aosh`oush¯our`u.GroundHogtranslatesitinto:twenty-sixlondonsupermarketswereclosedatapeakhourofthebritishpop-ulationinthesameperiodoftime.whichalmostmissesalltheinformationofthesourcesentence.Integratingcontextgatesimprovesthetranslationadequacy:thisisexactlythepeakdaysBritishpeo-plebuyingthesupermarket.theclosure

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4

/
t

a
c
_
a
_
0
0
0
4
8
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

ofthe14supermarketsofthe14super-marketsthatthelargestchainsupermar-ketinenglandlostseveralmillionpoundsofsalesincome.Coveragemechanismsfurtherimprovethetransla-tionbyrectifyingover-translation(e.g.,“ofthe14supermarkets”)andunder-translation(e.g.,“satur-day”and“atthattime”):saturdayisthepeakseasonofbritishpeo-ple’spurchasesofthesupermarket.atthattime,theclosureof14supermarketsmadethebiggestsupermarketofbritainlosemillionsofpoundsofsalesincome.6ConclusionWeﬁndthatsourceandtargetcontextsinNMTarehighlycorrelatedtotranslationadequacyandﬂu-ency,respectively.Basedonthisobservation,weproposeusingcontextgatesinNMTtodynamicallycontrolthecontributionsfromthesourceandtargetcontextsinthegenerationofatargetsentence,toenhancetheadequacyofNMT.ByprovidingNMTtheabilitytochoosetheappropriateamountofin-formationfromthesourceandtargetcontexts,onecanalleviatemanytranslationproblemsfromwhichNMTsuffers.ExperimentalresultsshowthatNMTwithcontextgatesachievesconsistentandsigniﬁ-cantimprovementsintranslationqualityoverdiffer-entNMTmodels.Contextgatesareinprincipleapplicabletoallsequence-to-sequencelearningtasksinwhichinfor-mationfromthesourcesequenceistransformedtothetargetsequence(correspondingtoadequacy)andthetargetsequenceisgenerated(correspondingtoﬂuency).Inthefuture,wewillinvestigatetheef-fectivenessofcontextgatestoothertasks,suchasdialogueandsummarization.ItisalsonecessarytovalidatetheeffectivenessofourapproachonmorelanguagepairsandotherNMTarchitectures(e.g.,usingLSTMaswellasGRU,ormultiplelayers).AcknowledgementThisworkissupportedbyChinaNational973project2014CB340301.YangLiuissupportedbytheNationalNaturalScienceFoundationofChina(No.61522204)andthe863Program(2015AA015407).WethankactioneditorChrisQuirkandthreeanonymousreviewersfortheirin-sightfulcomments.ReferencesDzmitryBahdanau,KyunghyunCho,andYoshuaBen-gio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.ICLR2015.PeterE.Brown,StephenA.DellaPietra,VincentJ.DellaPietra,andRobertL.Mercer.1993.Themathematicsofstatisticalmachinetranslation:Parameterestima-tion.ComputationalLinguistics,19(2):263–311.KyunghyunCho,BartvanMerrienboer,CaglarGulcehre,FethiBougares,HolgerSchwenk,andYoshuaBen-gio.2014.Learningphraserepresentationsusingrnnencoder-decoderforstatisticalmachinetranslation.InEMNLP2014.JunyoungChung,CaglarGulcehre,KyungHyunCho,andYoshuaBengio.2014.Empiricalevaluationofgatedrecurrentneuralnetworksonsequencemodel-ing.arXiv.MichaelCollins,PhilippKoehn,andIvonaKuˇcerov´a.2005.Clauserestructuringforstatisticalmachinetranslation.InACL2005.FelixAGersandJ¨urgenSchmidhuber.2000.Recurrentnetsthattimeandcount.InIJCNN2000.IEEE.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation.S´ebastienJean,KyunghyunCho,RolandMemisevic,andYoshuaBengio.2015.Onusingverylargetargetvo-cabularyforneuralmachinetranslation.InACL2015.YangfengJi,TrevorCohn,LingpengKong,ChrisDyer,andJacobEisenstein.2015.Documentcontextlan-guagemodels.InICLR2015.NalKalchbrennerandPhilBlunsom.2013.Recurrentcontinuoustranslationmodels.InEMNLP2013.PhilippKoehn,HieuHoang,AlexandraBirch,ChrisCallison-Burch,MarcelloFederico,NicolaBertoldi,BrookeCowan,WadeShen,ChristineMoran,RichardZens,ChrisDyer,OndrejBojar,AlexandraCon-stantin,andEvanHerbst.2007.Moses:opensourcetoolkitforstatisticalmachinetranslation.InACL2007.Minh-ThangLuong,HieuPham,andChristopherD.Manning.2015.Effectiveapproachestoattention-basedneuralmachinetranslation.InEMNLP2015.TomasMikolovandGeoffreyZweig.2012.Contextde-pendentrecurrentneuralnetworklanguagemodel.InSLT2012.FranzJ.OchandHermannNey.2003.Asystematiccomparisonofvariousstatisticalalignmentmodels.ComputationalLinguistics,29(1):19–51.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
8
1
5
6
7
4
4
4

/
t

a
c
_
a
_
0
0
0
4
8
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

KishorePapineni,SalimRoukos,ToddWard,andWei-JingZhu.2002.BLEU:amethodforautomaticeval-uationofmachinetranslation.InACL2002.MatthewSnover,NitinMadnani,BonnieJDorr,andRichardSchwartz.2009.Fluency,adequacy,orHTER?:exploringdifferenthumanjudgmentswithatunableMTmetric.InProceedingsoftheFourthWorkshoponStatisticalMachineTranslation,pages259–268.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Sequencetosequencelearningwithneuralnetworks.InNIPS2014.KristinaToutanova,H.TolgaIlhan,andChristopherD.Manning.2002.ExtensionstoHMM-basedstatisticalwordalignmentmodels.InEMNLP2012.ZhaopengTu,ZhengdongLu,YangLiu,XiaohuaLiu,andHangLi.2016.Modelingcoverageforneuralmachinetranslation.InACL2016.TianWangandKyunghyunCho.2016.Larger-contextlanguagemodellingwithrecurrentneuralnetwork.InACL2016.KelvinXu,JimmyBa,RyanKiros,KyunghyunCho,AaronCourville,RuslanSalakhutdinov,RichardZemel,andYoshuaBengio.2015.Show,attendandtell:Neuralimagecaptiongenerationwithvisualat-tention.InICML2015.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o