Transacciones de la Asociación de Lingüística Computacional, volumen. 5, páginas. 365–378, 2017. Editor de acciones: Adam Lopez.
Lote de envío: 11/2016; Lote de revisión: 2/2017; Publicado 10/2017.
2017 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
FullyCharacter-LevelNeuralMachineTranslationwithoutExplicitSegmentationJasonLee∗ETHZ¨urichjasonlee@inf.ethz.chKyunghyunChoNewYorkUniversitykyunghyun.cho@nyu.eduThomasHofmannETHZ¨urichthomas.hofmann@inf.ethz.chAbstractMostexistingmachinetranslationsystemsop-erateatthelevelofwords,relyingonex-plicitsegmentationtoextracttokens.Wein-troduceaneuralmachinetranslation(NMT)modelthatmapsasourcecharactersequencetoatargetcharactersequencewithoutanyseg-mentation.Weemployacharacter-levelcon-volutionalnetworkwithmax-poolingattheencodertoreducethelengthofsourcerep-resentation,allowingthemodeltobetrainedataspeedcomparabletosubword-levelmod-elswhilecapturinglocalregularities.Ourcharacter-to-charactermodeloutperformsarecentlyproposedbaselinewithasubword-levelencoderonWMT’15DE-ENandCS-EN,andgivescomparableperformanceonFI-ENandRU-EN.Wethendemonstratethatitispossibletoshareasinglecharacter-levelencoderacrossmultiplelanguagesbytrainingamodelonamany-to-onetransla-tiontask.Inthismultilingualsetting,thecharacter-levelencodersignificantlyoutper-formsthesubword-levelencoderonallthelanguagepairs.WeobservethatonCS-EN,FI-ENandRU-EN,thequalityofthemultilin-gualcharacter-leveltranslationevensurpassesthemodelsspecificallytrainedonthatlan-guagepairalone,bothintermsoftheBLEUscoreandhumanjudgment.1IntroductionNearlyallpreviousworkinmachinetranslationhasbeenatthelevelofwords.Asidefromourintu-∗ThemajorityofthisworkwascompletedwhiletheauthorwasvisitingNewYorkUniversity.itiveunderstandingofwordasabasicunitofmean-ing(Jackendoff,1992),onereasonbehindthisisthatsequencesaresignificantlylongerwhenrep-resentedincharacters,compoundingtheproblemofdatasparsityandmodelinglong-rangedepen-dencies.ThishasdrivenNMTresearchtobeal-mostexclusivelyword-level(Bahdanauetal.,2015;Sutskeveretal.,2014).Despitetheirremarkablesuccess,word-levelNMTmodelssufferfromseveralmajorweaknesses.Forone,theyareunabletomodelrare,out-of-vocabularywords,makingthemlimitedintranslat-inglanguageswithrichmorphologysuchasCzech,FinnishandTurkish.Ifoneusesalargevocabularytocombatthis(Jeanetal.,2015),thecomplexityoftraininganddecodinggrowslinearlywithrespecttothetargetvocabularysize,leadingtoaviciouscycle.Toaddressthis,wepresentafullycharacter-levelNMTmodelthatmapsacharactersequenceinasourcelanguagetoacharactersequenceinatargetlanguage.Weshowthatourmodeloutperformsabaselinewithasubword-levelencoderonDE-ENandCS-EN,andachievesacomparableresultonFI-ENandRU-EN.Apurelycharacter-levelNMTmodelwithabasicencoderwasproposedasabase-linebyLuongandManning(2016),buttrainingitwasprohibitivelyslow.Wewereabletotrainourmodelatareasonablespeedbydrasticallyreducingthelengthofsourcesentencerepresentationusingastackofconvolutional,poolingandhighwaylayers.Oneadvantageofcharacter-levelmodelsisthattheyarebettersuitedformultilingualtranslationthantheirword-levelcounterpartswhichrequireaseparatewordvocabularyforeachlanguage.We
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
6
7
1
5
6
7
5
3
7
/
/
t
yo
a
C
_
a
_
0
0
0
6
7
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
366
verifythisbytrainingasinglemodeltotranslatefourlanguages(Alemán,checo,FinnishandRus-sian)toEnglish.Ourmultilingualcharacter-levelmodeloutperformsthesubword-levelbaselinebyaconsiderablemargininallfourlanguagepairs,stronglyindicatingthatacharacter-levelmodelismoreflexibleinassigningitscapacitytodifferentlanguagepairs.Furthermore,weobservethatourmultilingualcharacter-leveltranslationevenexceedsthequalityofbilingualtranslationinthreeoutoffourlanguagepairs,bothinBLEUscoremetricandhumanevaluation.Thisdemonstratesexcel-lentparameterefficiencyofcharacter-leveltransla-tioninamultilingualsetting.Wealsoshowcaseourmodel’sabilitytohandleintra-sentencecode-switchingwhileperforminglanguageidentificationonthefly.Thecontributionsofthisworkaretwofold:weempiricallyshowthat(1)wecantraincharacter-to-characterNMTmodelwithoutanyexplicitsegmen-tation;y(2)wecanshareasinglecharacter-levelencoderacrossmultiplelanguagestobuildamul-tilingualtranslationsystemwithoutincreasingthemodelsize.2Background:AttentionalNeuralMachineTranslationNeuralmachinetranslation(NMT)isarecentlyproposedapproachtomachinetranslationthatbuildsasingleneuralnetworkwhichtakesasaninput,asourcesentenceX=(x1,…,xTX)andgeneratesitstranslationY=(y1,…,yTY),wherextandyt0aresourceandtargetsymbols(Bahdanauetal.,2015;Sutskeveretal.,2014;Luongetal.,2015;Choetal.,2014a).AttentionalNMTmodelshavethreecomponents:anencoder,adecoderandanattentionmechanism.EncoderGivenasourcesentenceX,theen-coderconstructsacontinuousrepresentationthatsummarizesitsmeaningwitharecurrentneuralnetwork(RNN).AbidirectionalRNNisoftenimplementedasproposedin(Bahdanauetal.,2015).Aforwardencoderreadstheinputsentencefromlefttoright:−→ht=−→fenc(cid:0)Ex(xt),−→ht−1(cid:1).Similarmente,abackwardencoderreadsitfromrighttoleft:←−ht=←−fenc(cid:0)Ex(xt),←−ht+1(cid:1),whereExisthesourceembeddinglookuptable,and−→fencand←−fencarerecurrentactivationfunctionssuchaslongshort-termmemoryunits(LSTMs)(HochreiterandSchmidhuber,1997)orgatedrecurrentunits(GRUs)(Choetal.,2014b).TheencoderconstructsasetofcontinuoussourcesentencerepresentationsCbyconcatenatingtheforwardandbackwardhid-denstatesateachtimestep:C=(cid:8)h1,…,hTX(cid:9),whereht=(cid:2)−→ht;←−ht(cid:3).AttentionFirstintroducedinBahdanauetal.(2015),theattentionmechanismletsthedecoderat-tendmoretodifferentsourcesymbolsforeachtargetsymbol.Moreconcretely,itcomputesthecontextvectorct0ateachdecodingtimestept0asaweightedsumofthesourcehiddenstates:ct0=PTXt=1αt0tht.SimilarlytoChungetal.(2016)andFiratetal.(2016a),eachattentionalweightαt0trepresentshowrelevantthet-thsourcetokenxtistothet0-thtargettokenyt0,andiscomputedas:αt0t=1Zexp(cid:18)puntaje(cid:16)Ey(yt0−1),st0−1,ht(cid:17)(cid:19),(1)whereZ=PTXk=1exp(cid:0)puntaje(Ey(yt0−1),st0−1,hk)(cid:1)isthenormalizationconstant.score()isafeed-forwardneuralnetworkwithasinglehiddenlayerthatscoreshowwellthesourcesymbolxtandthetargetsymbolyt0match.Eyisthetargetembeddinglookuptableandst0isthetargethiddenstateattimet0.DecoderGivenasourcecontextvectorct0,thede-codercomputesitshiddenstateattimet0as:st0=fdec(cid:0)Ey(yt0−1),st0−1,ct0(cid:1).Entonces,aparametricfunc-tionoutk()returnstheconditionalprobabilityofthenexttargetsymbolbeingk:pag(yt0=k|y