Transactions of the Association for Computational Linguistics, Bd. 5, S. 365–378, 2017. Action Editor: Adam Lopez.

Transactions of the Association for Computational Linguistics, Bd. 5, S. 365–378, 2017. Action Editor: Adam Lopez.
Submission batch: 11/2016; Revision batch: 2/2017; Published 10/2017.

2017 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

C
(cid:13)

FullyCharacter-LevelNeuralMachineTranslationwithoutExplicitSegmentationJasonLee∗ETHZ¨urichjasonlee@inf.ethz.chKyunghyunChoNewYorkUniversitykyunghyun.cho@nyu.eduThomasHofmannETHZ¨urichthomas.hofmann@inf.ethz.chAbstractMostexistingmachinetranslationsystemsop-erateatthelevelofwords,relyingonex-plicitsegmentationtoextracttokens.Wein-troduceaneuralmachinetranslation(NMT)modelthatmapsasourcecharactersequencetoatargetcharactersequencewithoutanyseg-mentation.Weemployacharacter-levelcon-volutionalnetworkwithmax-poolingattheencodertoreducethelengthofsourcerep-resentation,allowingthemodeltobetrainedataspeedcomparabletosubword-levelmod-elswhilecapturinglocalregularities.Ourcharacter-to-charactermodeloutperformsarecentlyproposedbaselinewithasubword-levelencoderonWMT’15DE-ENandCS-EN,andgivescomparableperformanceonFI-ENandRU-EN.Wethendemonstratethatitispossibletoshareasinglecharacter-levelencoderacrossmultiplelanguagesbytrainingamodelonamany-to-onetransla-tiontask.Inthismultilingualsetting,thecharacter-levelencodersigniﬁcantlyoutper-formsthesubword-levelencoderonallthelanguagepairs.WeobservethatonCS-EN,FI-ENandRU-EN,thequalityofthemultilin-gualcharacter-leveltranslationevensurpassesthemodelsspeciﬁcallytrainedonthatlan-guagepairalone,bothintermsoftheBLEUscoreandhumanjudgment.1IntroductionNearlyallpreviousworkinmachinetranslationhasbeenatthelevelofwords.Asidefromourintu-∗ThemajorityofthisworkwascompletedwhiletheauthorwasvisitingNewYorkUniversity.itiveunderstandingofwordasabasicunitofmean-ing(Jackendoff,1992),onereasonbehindthisisthatsequencesaresigniﬁcantlylongerwhenrep-resentedincharacters,compoundingtheproblemofdatasparsityandmodelinglong-rangedepen-dencies.ThishasdrivenNMTresearchtobeal-mostexclusivelyword-level(Bahdanauetal.,2015;Sutskeveretal.,2014).Despitetheirremarkablesuccess,word-levelNMTmodelssufferfromseveralmajorweaknesses.Forone,theyareunabletomodelrare,out-of-vocabularywords,makingthemlimitedintranslat-inglanguageswithrichmorphologysuchasCzech,FinnishandTurkish.Ifoneusesalargevocabularytocombatthis(Jeanetal.,2015),thecomplexityoftraininganddecodinggrowslinearlywithrespecttothetargetvocabularysize,leadingtoaviciouscycle.Toaddressthis,wepresentafullycharacter-levelNMTmodelthatmapsacharactersequenceinasourcelanguagetoacharactersequenceinatargetlanguage.Weshowthatourmodeloutperformsabaselinewithasubword-levelencoderonDE-ENandCS-EN,andachievesacomparableresultonFI-ENandRU-EN.Apurelycharacter-levelNMTmodelwithabasicencoderwasproposedasabase-linebyLuongandManning(2016),buttrainingitwasprohibitivelyslow.Wewereabletotrainourmodelatareasonablespeedbydrasticallyreducingthelengthofsourcesentencerepresentationusingastackofconvolutional,poolingandhighwaylayers.Oneadvantageofcharacter-levelmodelsisthattheyarebettersuitedformultilingualtranslationthantheirword-levelcounterpartswhichrequireaseparatewordvocabularyforeachlanguage.We

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
0
6
7
1
5
6
7
5
3
7

/
T

A
C
_
A
_
0
0
0
6
7
P
D

B
j
G
u
e
S
T

Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

366

verifythisbytrainingasinglemodeltotranslatefourlanguages(Deutsch,Czech,FinnishandRus-sian)toEnglish.Ourmultilingualcharacter-levelmodeloutperformsthesubword-levelbaselinebyaconsiderablemargininallfourlanguagepairs,stronglyindicatingthatacharacter-levelmodelismoreﬂexibleinassigningitscapacitytodifferentlanguagepairs.Furthermore,weobservethatourmultilingualcharacter-leveltranslationevenexceedsthequalityofbilingualtranslationinthreeoutoffourlanguagepairs,bothinBLEUscoremetricandhumanevaluation.Thisdemonstratesexcel-lentparameterefﬁciencyofcharacter-leveltransla-tioninamultilingualsetting.Wealsoshowcaseourmodel’sabilitytohandleintra-sentencecode-switchingwhileperforminglanguageidentiﬁcationontheﬂy.Thecontributionsofthisworkaretwofold:weempiricallyshowthat(1)wecantraincharacter-to-characterNMTmodelwithoutanyexplicitsegmen-tation;Und(2)wecanshareasinglecharacter-levelencoderacrossmultiplelanguagestobuildamul-tilingualtranslationsystemwithoutincreasingthemodelsize.2Background:AttentionalNeuralMachineTranslationNeuralmachinetranslation(NMT)isarecentlyproposedapproachtomachinetranslationthatbuildsasingleneuralnetworkwhichtakesasaninput,asourcesentenceX=(x1,…,xTX)andgeneratesitstranslationY=(y1,…,yTY),wherextandyt0aresourceandtargetsymbols(Bahdanauetal.,2015;Sutskeveretal.,2014;Luongetal.,2015;Choetal.,2014a).AttentionalNMTmodelshavethreecomponents:anencoder,adecoderandanattentionmechanism.EncoderGivenasourcesentenceX,theen-coderconstructsacontinuousrepresentationthatsummarizesitsmeaningwitharecurrentneuralnetwork(RNN).AbidirectionalRNNisoftenimplementedasproposedin(Bahdanauetal.,2015).Aforwardencoderreadstheinputsentencefromlefttoright:−→ht=−→fenc(cid:0)Ex(xt),−→ht−1(cid:1).Ähnlich,abackwardencoderreadsitfromrighttoleft:←−ht=←−fenc(cid:0)Ex(xt),←−ht+1(cid:1),whereExisthesourceembeddinglookuptable,and−→fencand←−fencarerecurrentactivationfunctionssuchaslongshort-termmemoryunits(LSTMs)(HochreiterandSchmidhuber,1997)orgatedrecurrentunits(GRUs)(Choetal.,2014b).TheencoderconstructsasetofcontinuoussourcesentencerepresentationsCbyconcatenatingtheforwardandbackwardhid-denstatesateachtimestep:C=(cid:8)h1,…,hTX(cid:9),whereht=(cid:2)−→ht;←−ht(cid:3).AttentionFirstintroducedinBahdanauetal.(2015),theattentionmechanismletsthedecoderat-tendmoretodifferentsourcesymbolsforeachtargetsymbol.Moreconcretely,itcomputesthecontextvectorct0ateachdecodingtimestept0asaweightedsumofthesourcehiddenstates:ct0=PTXt=1αt0tht.SimilarlytoChungetal.(2016)andFiratetal.(2016A),eachattentionalweightαt0trepresentshowrelevantthet-thsourcetokenxtistothet0-thtargettokenyt0,andiscomputedas:αt0t=1Zexp(cid:18)Punktzahl(cid:16)Ey(yt0−1),st0−1,ht(cid:17)(cid:19),(1)whereZ=PTXk=1exp(cid:0)Punktzahl(Ey(yt0−1),st0−1,hk)(cid:1)isthenormalizationconstant.score()isafeed-forwardneuralnetworkwithasinglehiddenlayerthatscoreshowwellthesourcesymbolxtandthetargetsymbolyt0match.Eyisthetargetembeddinglookuptableandst0isthetargethiddenstateattimet0.DecoderGivenasourcecontextvectorct0,thede-codercomputesitshiddenstateattimet0as:st0=fdec(cid:0)Ey(yt0−1),st0−1,ct0(cid:1).Dann,aparametricfunc-tionoutk()returnstheconditionalprobabilityofthenexttargetsymbolbeingk:P(yt0=k|j

PDF Herunterladen