Transactions of the Association for Computational Linguistics, vol. 5, pp. 365–378, 2017. Action Editor: Adam Lopez.
Submission batch: 11/2016; Revision batch: 2/2017; Published 10/2017.
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
c
(cid:13)
FullyCharacter-LevelNeuralMachineTranslationwithoutExplicitSegmentationJasonLee∗ETHZ¨urichjasonlee@inf.ethz.chKyunghyunChoNewYorkUniversitykyunghyun.cho@nyu.eduThomasHofmannETHZ¨urichthomas.hofmann@inf.ethz.chAbstractMostexistingmachinetranslationsystemsop-erateatthelevelofwords,relyingonex-plicitsegmentationtoextracttokens.Wein-troduceaneuralmachinetranslation(NMT)modelthatmapsasourcecharactersequencetoatargetcharactersequencewithoutanyseg-mentation.Weemployacharacter-levelcon-volutionalnetworkwithmax-poolingattheencodertoreducethelengthofsourcerep-resentation,allowingthemodeltobetrainedataspeedcomparabletosubword-levelmod-elswhilecapturinglocalregularities.Ourcharacter-to-charactermodeloutperformsarecentlyproposedbaselinewithasubword-levelencoderonWMT’15DE-ENandCS-EN,andgivescomparableperformanceonFI-ENandRU-EN.Wethendemonstratethatitispossibletoshareasinglecharacter-levelencoderacrossmultiplelanguagesbytrainingamodelonamany-to-onetransla-tiontask.Inthismultilingualsetting,thecharacter-levelencodersignificantlyoutper-formsthesubword-levelencoderonallthelanguagepairs.WeobservethatonCS-EN,FI-ENandRU-EN,thequalityofthemultilin-gualcharacter-leveltranslationevensurpassesthemodelsspecificallytrainedonthatlan-guagepairalone,bothintermsoftheBLEUscoreandhumanjudgment.1IntroductionNearlyallpreviousworkinmachinetranslationhasbeenatthelevelofwords.Asidefromourintu-∗ThemajorityofthisworkwascompletedwhiletheauthorwasvisitingNewYorkUniversity.itiveunderstandingofwordasabasicunitofmean-ing(Jackendoff,1992),onereasonbehindthisisthatsequencesaresignificantlylongerwhenrep-resentedincharacters,compoundingtheproblemofdatasparsityandmodelinglong-rangedepen-dencies.ThishasdrivenNMTresearchtobeal-mostexclusivelyword-level(Bahdanauetal.,2015;Sutskeveretal.,2014).Despitetheirremarkablesuccess,word-levelNMTmodelssufferfromseveralmajorweaknesses.Forone,theyareunabletomodelrare,out-of-vocabularywords,makingthemlimitedintranslat-inglanguageswithrichmorphologysuchasCzech,FinnishandTurkish.Ifoneusesalargevocabularytocombatthis(Jeanetal.,2015),thecomplexityoftraininganddecodinggrowslinearlywithrespecttothetargetvocabularysize,leadingtoaviciouscycle.Toaddressthis,wepresentafullycharacter-levelNMTmodelthatmapsacharactersequenceinasourcelanguagetoacharactersequenceinatargetlanguage.Weshowthatourmodeloutperformsabaselinewithasubword-levelencoderonDE-ENandCS-EN,andachievesacomparableresultonFI-ENandRU-EN.Apurelycharacter-levelNMTmodelwithabasicencoderwasproposedasabase-linebyLuongandManning(2016),buttrainingitwasprohibitivelyslow.Wewereabletotrainourmodelatareasonablespeedbydrasticallyreducingthelengthofsourcesentencerepresentationusingastackofconvolutional,poolingandhighwaylayers.Oneadvantageofcharacter-levelmodelsisthattheyarebettersuitedformultilingualtranslationthantheirword-levelcounterpartswhichrequireaseparatewordvocabularyforeachlanguage.We
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
6
7
1
5
6
7
5
3
7
/
/
t
l
a
c
_
a
_
0
0
0
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
366
verifythisbytrainingasinglemodeltotranslatefourlanguages(German,Czech,FinnishandRus-sian)toEnglish.Ourmultilingualcharacter-levelmodeloutperformsthesubword-levelbaselinebyaconsiderablemargininallfourlanguagepairs,stronglyindicatingthatacharacter-levelmodelismoreflexibleinassigningitscapacitytodifferentlanguagepairs.Furthermore,weobservethatourmultilingualcharacter-leveltranslationevenexceedsthequalityofbilingualtranslationinthreeoutoffourlanguagepairs,bothinBLEUscoremetricandhumanevaluation.Thisdemonstratesexcel-lentparameterefficiencyofcharacter-leveltransla-tioninamultilingualsetting.Wealsoshowcaseourmodel’sabilitytohandleintra-sentencecode-switchingwhileperforminglanguageidentificationonthefly.Thecontributionsofthisworkaretwofold:weempiricallyshowthat(1)wecantraincharacter-to-characterNMTmodelwithoutanyexplicitsegmen-tation;and(2)wecanshareasinglecharacter-levelencoderacrossmultiplelanguagestobuildamul-tilingualtranslationsystemwithoutincreasingthemodelsize.2Background:AttentionalNeuralMachineTranslationNeuralmachinetranslation(NMT)isarecentlyproposedapproachtomachinetranslationthatbuildsasingleneuralnetworkwhichtakesasaninput,asourcesentenceX=(x1,…,xTX)andgeneratesitstranslationY=(y1,…,yTY),wherextandyt0aresourceandtargetsymbols(Bahdanauetal.,2015;Sutskeveretal.,2014;Luongetal.,2015;Choetal.,2014a).AttentionalNMTmodelshavethreecomponents:anencoder,adecoderandanattentionmechanism.EncoderGivenasourcesentenceX,theen-coderconstructsacontinuousrepresentationthatsummarizesitsmeaningwitharecurrentneuralnetwork(RNN).AbidirectionalRNNisoftenimplementedasproposedin(Bahdanauetal.,2015).Aforwardencoderreadstheinputsentencefromlefttoright:−→ht=−→fenc(cid:0)Ex(xt),−→ht−1(cid:1).Similarly,abackwardencoderreadsitfromrighttoleft:←−ht=←−fenc(cid:0)Ex(xt),←−ht+1(cid:1),whereExisthesourceembeddinglookuptable,and−→fencand←−fencarerecurrentactivationfunctionssuchaslongshort-termmemoryunits(LSTMs)(HochreiterandSchmidhuber,1997)orgatedrecurrentunits(GRUs)(Choetal.,2014b).TheencoderconstructsasetofcontinuoussourcesentencerepresentationsCbyconcatenatingtheforwardandbackwardhid-denstatesateachtimestep:C=(cid:8)h1,…,hTX(cid:9),whereht=(cid:2)−→ht;←−ht(cid:3).AttentionFirstintroducedinBahdanauetal.(2015),theattentionmechanismletsthedecoderat-tendmoretodifferentsourcesymbolsforeachtargetsymbol.Moreconcretely,itcomputesthecontextvectorct0ateachdecodingtimestept0asaweightedsumofthesourcehiddenstates:ct0=PTXt=1αt0tht.SimilarlytoChungetal.(2016)andFiratetal.(2016a),eachattentionalweightαt0trepresentshowrelevantthet-thsourcetokenxtistothet0-thtargettokenyt0,andiscomputedas:αt0t=1Zexp(cid:18)score(cid:16)Ey(yt0−1),st0−1,ht(cid:17)(cid:19),(1)whereZ=PTXk=1exp(cid:0)score(Ey(yt0−1),st0−1,hk)(cid:1)isthenormalizationconstant.score()isafeed-forwardneuralnetworkwithasinglehiddenlayerthatscoreshowwellthesourcesymbolxtandthetargetsymbolyt0match.Eyisthetargetembeddinglookuptableandst0isthetargethiddenstateattimet0.DecoderGivenasourcecontextvectorct0,thede-codercomputesitshiddenstateattimet0as:st0=fdec(cid:0)Ey(yt0−1),st0−1,ct0(cid:1).Then,aparametricfunc-tionoutk()returnstheconditionalprobabilityofthenexttargetsymbolbeingk:p(yt0=k|y