Operazioni dell'Associazione per la Linguistica Computazionale, vol. 5, pag. 365–378, 2017. Redattore di azioni: Adam Lopez.
Lotto di invio: 11/2016; Lotto di revisione: 2/2017; Pubblicato 10/2017.
2017 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
C
(cid:13)
FullyCharacter-LevelNeuralMachineTranslationwithoutExplicitSegmentationJasonLee∗ETHZ¨urichjasonlee@inf.ethz.chKyunghyunChoNewYorkUniversitykyunghyun.cho@nyu.eduThomasHofmannETHZ¨urichthomas.hofmann@inf.ethz.chAbstractMostexistingmachinetranslationsystemsop-erateatthelevelofwords,relyingonex-plicitsegmentationtoextracttokens.Wein-troduceaneuralmachinetranslation(NMT)modelthatmapsasourcecharactersequencetoatargetcharactersequencewithoutanyseg-mentation.Weemployacharacter-levelcon-volutionalnetworkwithmax-poolingattheencodertoreducethelengthofsourcerep-resentation,allowingthemodeltobetrainedataspeedcomparabletosubword-levelmod-elswhilecapturinglocalregularities.Ourcharacter-to-charactermodeloutperformsarecentlyproposedbaselinewithasubword-levelencoderonWMT’15DE-ENandCS-EN,andgivescomparableperformanceonFI-ENandRU-EN.Wethendemonstratethatitispossibletoshareasinglecharacter-levelencoderacrossmultiplelanguagesbytrainingamodelonamany-to-onetransla-tiontask.Inthismultilingualsetting,thecharacter-levelencodersignificantlyoutper-formsthesubword-levelencoderonallthelanguagepairs.WeobservethatonCS-EN,FI-ENandRU-EN,thequalityofthemultilin-gualcharacter-leveltranslationevensurpassesthemodelsspecificallytrainedonthatlan-guagepairalone,bothintermsoftheBLEUscoreandhumanjudgment.1IntroductionNearlyallpreviousworkinmachinetranslationhasbeenatthelevelofwords.Asidefromourintu-∗ThemajorityofthisworkwascompletedwhiletheauthorwasvisitingNewYorkUniversity.itiveunderstandingofwordasabasicunitofmean-ing(Jackendoff,1992),onereasonbehindthisisthatsequencesaresignificantlylongerwhenrep-resentedincharacters,compoundingtheproblemofdatasparsityandmodelinglong-rangedepen-dencies.ThishasdrivenNMTresearchtobeal-mostexclusivelyword-level(Bahdanauetal.,2015;Sutskeveretal.,2014).Despitetheirremarkablesuccess,word-levelNMTmodelssufferfromseveralmajorweaknesses.Forone,theyareunabletomodelrare,out-of-vocabularywords,makingthemlimitedintranslat-inglanguageswithrichmorphologysuchasCzech,FinnishandTurkish.Ifoneusesalargevocabularytocombatthis(Jeanetal.,2015),thecomplexityoftraininganddecodinggrowslinearlywithrespecttothetargetvocabularysize,leadingtoaviciouscycle.Toaddressthis,wepresentafullycharacter-levelNMTmodelthatmapsacharactersequenceinasourcelanguagetoacharactersequenceinatargetlanguage.Weshowthatourmodeloutperformsabaselinewithasubword-levelencoderonDE-ENandCS-EN,andachievesacomparableresultonFI-ENandRU-EN.Apurelycharacter-levelNMTmodelwithabasicencoderwasproposedasabase-linebyLuongandManning(2016),buttrainingitwasprohibitivelyslow.Wewereabletotrainourmodelatareasonablespeedbydrasticallyreducingthelengthofsourcesentencerepresentationusingastackofconvolutional,poolingandhighwaylayers.Oneadvantageofcharacter-levelmodelsisthattheyarebettersuitedformultilingualtranslationthantheirword-levelcounterpartswhichrequireaseparatewordvocabularyforeachlanguage.We
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
6
7
1
5
6
7
5
3
7
/
/
T
l
UN
C
_
UN
_
0
0
0
6
7
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
366
verifythisbytrainingasinglemodeltotranslatefourlanguages(German,Czech,FinnishandRus-sian)toEnglish.Ourmultilingualcharacter-levelmodeloutperformsthesubword-levelbaselinebyaconsiderablemargininallfourlanguagepairs,stronglyindicatingthatacharacter-levelmodelismoreflexibleinassigningitscapacitytodifferentlanguagepairs.Furthermore,weobservethatourmultilingualcharacter-leveltranslationevenexceedsthequalityofbilingualtranslationinthreeoutoffourlanguagepairs,bothinBLEUscoremetricandhumanevaluation.Thisdemonstratesexcel-lentparameterefficiencyofcharacter-leveltransla-tioninamultilingualsetting.Wealsoshowcaseourmodel’sabilitytohandleintra-sentencecode-switchingwhileperforminglanguageidentificationonthefly.Thecontributionsofthisworkaretwofold:weempiricallyshowthat(1)wecantraincharacter-to-characterNMTmodelwithoutanyexplicitsegmen-tation;E(2)wecanshareasinglecharacter-levelencoderacrossmultiplelanguagestobuildamul-tilingualtranslationsystemwithoutincreasingthemodelsize.2Background:AttentionalNeuralMachineTranslationNeuralmachinetranslation(NMT)isarecentlyproposedapproachtomachinetranslationthatbuildsasingleneuralnetworkwhichtakesasaninput,asourcesentenceX=(x1,…,xTX)andgeneratesitstranslationY=(y1,…,yTY),wherextandyt0aresourceandtargetsymbols(Bahdanauetal.,2015;Sutskeveretal.,2014;Luongetal.,2015;Choetal.,2014a).AttentionalNMTmodelshavethreecomponents:anencoder,adecoderandanattentionmechanism.EncoderGivenasourcesentenceX,theen-coderconstructsacontinuousrepresentationthatsummarizesitsmeaningwitharecurrentneuralnetwork(RNN).AbidirectionalRNNisoftenimplementedasproposedin(Bahdanauetal.,2015).Aforwardencoderreadstheinputsentencefromlefttoright:−→ht=−→fenc(cid:0)Ex(xt),−→ht−1(cid:1).Allo stesso modo,abackwardencoderreadsitfromrighttoleft:←−ht=←−fenc(cid:0)Ex(xt),←−ht+1(cid:1),whereExisthesourceembeddinglookuptable,and−→fencand←−fencarerecurrentactivationfunctionssuchaslongshort-termmemoryunits(LSTMs)(HochreiterandSchmidhuber,1997)orgatedrecurrentunits(GRUs)(Choetal.,2014b).TheencoderconstructsasetofcontinuoussourcesentencerepresentationsCbyconcatenatingtheforwardandbackwardhid-denstatesateachtimestep:C=(cid:8)h1,…,hTX(cid:9),whereht=(cid:2)−→ht;←−ht(cid:3).AttentionFirstintroducedinBahdanauetal.(2015),theattentionmechanismletsthedecoderat-tendmoretodifferentsourcesymbolsforeachtargetsymbol.Moreconcretely,itcomputesthecontextvectorct0ateachdecodingtimestept0asaweightedsumofthesourcehiddenstates:ct0=PTXt=1αt0tht.SimilarlytoChungetal.(2016)andFiratetal.(2016UN),eachattentionalweightαt0trepresentshowrelevantthet-thsourcetokenxtistothet0-thtargettokenyt0,andiscomputedas:αt0t=1Zexp(cid:18)score(cid:16)Ey(yt0−1),st0−1,ht(cid:17)(cid:19),(1)whereZ=PTXk=1exp(cid:0)score(Ey(yt0−1),st0−1,hk)(cid:1)isthenormalizationconstant.score()isafeed-forwardneuralnetworkwithasinglehiddenlayerthatscoreshowwellthesourcesymbolxtandthetargetsymbolyt0match.Eyisthetargetembeddinglookuptableandst0isthetargethiddenstateattimet0.DecoderGivenasourcecontextvectorct0,thede-codercomputesitshiddenstateattimet0as:st0=fdec(cid:0)Ey(yt0−1),st0−1,ct0(cid:1).Then,aparametricfunc-tionoutk()returnstheconditionalprobabilityofthenexttargetsymbolbeingk:P(yt0=k|sì