Transactions of the Association for Computational Linguistics, vol. 6, pp. 451–465, 2018. Action Editor: Brian Roark. - IA de Investigación especializada en el MIT

Transacciones de la Asociación de Lingüística Computacional, volumen. 6, páginas. 451–465, 2018. Editor de acciones: Brian Roark.
Lote de envío: 12/2017; Lote de revisión: 5/2018; Publicado 7/2018.

2018 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

C
(cid:13)

LanguageModelingforMorphologicallyRichLanguages:Character-AwareModelingforWord-LevelPredictionDanielaGerz1,IvanVuli´c1,EdoardoPonti1JasonNaradowsky3,RoiReichart2AnnaKorhonen11LanguageTechnologyLab,DTAL,UniversityofCambridge2FacultyofIndustrialEngineeringandManagement,Technion,IIT3JohnsHopkinsUniversity1{dsg40,iv250,ep490,alk23}@cam.ac.uk2roiri@ie.technion.ac.il3narad@jhu.eduAbstractNeuralarchitecturesareprominentinthecon-structionoflanguagemodels(LMs).How-ever,word-levelpredictionistypicallyagnos-ticofsubword-levelinformation(charactersandcharactersequences)andoperatesoveraclosedvocabulary,consistingofalimitedwordset.Indeed,whilesubword-awaremod-elsboostperformanceacrossavarietyofNLPtasks,previousworkdidnotevaluatetheabil-ityofthesemodelstoassistnext-wordpredic-tioninlanguagemodelingtasks.Suchsubword-levelinformedmodelsshouldbeparticularlyeffectiveformorphologically-richlanguages(MRLs)thatexhibithightype-to-tokenratios.Inthiswork,wepresentalarge-scaleLMstudyon50typologicallydiverselanguagescover-ingawidevarietyofmorphologicalsystems,andoffernewLMbenchmarkstothecommu-nity,whileconsideringsubword-levelinforma-tion.Themaintechnicalcontributionofourworkisanovelmethodforinjectingsubword-levelinformationintosemanticwordvectors,integratedintotheneurallanguagemodelingtraining,tofacilitateword-levelprediction.WeconductexperimentsintheLMsettingwherethenumberofinfrequentwordsislarge,anddemonstratestrongperplexitygainsacrossour50languages,especiallyformorphologically-richlanguages.Ourcodeanddatasetsarepub-liclyavailable.1IntroductionLanguageModeling(LM)isakeyNLPtask,servingasanimportantcomponentforapplicationsthatre-quiresomeformoftextgeneration,suchasmachinetranslation(Vaswanietal.,2013),speechrecognition(Mikolovetal.,2010),dialoguegeneration(Serbanetal.,2016),orsummarisation(Filippovaetal.,2015).Atraditionalrecurrentneuralnetwork(RNN)LMsetupoperatesonalimitedclosedvocabularyofwords(Bengioetal.,2003;Mikolovetal.,2010).Thelimitationarisesduetothemodellearningpa-rametersexclusivetosinglewords.Astandardtrain-ingprocedureforneuralLMsgraduallymodiﬁestheparametersbasedoncontextual/distributionalinfor-mation:eachoccurrenceofawordtokenintrain-ingdatacontributestotheestimateofawordvector(i.e.,modelparameters)assignedtothiswordtype.Low-frequencywordsthereforeoftenhaveincorrectestimates,nothavingmovedfarfromtheirrandominitialisation.Acommonstrategyfordealingwiththisissueistosimplyexcludethelow-qualityparam-etersfromthemodel(i.e.,toreplacethemwiththeplaceholder),leadingtoonlyasubsetofthevocabularybeingrepresentedbythemodel.Thislimitedvocabularyassumptionenablesthemodeltobypasstheproblemofunreliablewordes-timatesforlow-frequencyandunseenwords,butitdoesnotresolveit.Theassumptionisfarfromideal,partlyduetotheZipﬁannatureofeachlanguage(Zipf,1949),anditslimitationisevenmorepro-nouncedformorphologically-richlanguages(MRLs):theselanguagesinherentlygenerateaplethoraofwordsbytheirmorphologicalsystems.Asaconse-quence,therewillbealargenumberofwordsforwhichastandardRNNLMcannotguaranteeareli-ablewordestimate.Sincegradualparameterestimationbasedoncon-textualinformationisnotfeasibleforrarephenomena

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

452

inthefullvocabularysetup(Adamsetal.,2017),itisofcrucialimportancetoconstructandenabletech-niquesthatcanobtaintheseparametersinalternativeways.Onesolutionistodrawinformationfromad-ditionalsources,suchascharactersandcharactersequences.Asaconsequence,suchcharacter-awaremodelsshouldfacilitateLMword-levelpredictioninareal-lifeLMsetupwhichdealswithalargeamountoflow-frequencyorunseenwords.Effortsintothisdirectionhaveyieldedexcitingresults,primarilyontheinputsideofneuralLMs.AstandardRNNLMarchitecturereliesontwowordrepresentationmatriceslearnedduringtrainingforitsinputandnext-wordprediction.Thiseffectivelymeansthattherearetwosetsofper-wordspeciﬁcpa-rametersthatneedtobetrained.Recentworkshowsthatitispossibletogenerateawordrepresentationon-the-ﬂybasedonitsconstituentcharacters,therebyeffectivelysolvingtheproblemfortheparametersetontheinputsideofthemodel(Kimetal.,2016;lu-ongandmanning,2016;MiyamotoandCho,2016;Lingetal.,2015).Sin embargo,itisnotstraightforwardhowtoadvancetheseideastotheoutputsideofthemodel,asthissecondsetofword-speciﬁcparametersisdirectlyresponsibleforthenext-wordprediction:ithastoencodeamuchwiderrangeofinformation,suchastopicalandsemanticknowledgeaboutwords,whichcannotbeeasilyobtainedfromitscharactersalone(Jozefowiczetal.,2016).Whileonesolutionistodirectlyoutputcharactersinsteadofwords(Tumbas,2013;MiyamotoandCho,2016),arecentworkfromJozefowiczetal.(2016)suggeststhatsuchpurelycharacter-basedarchitec-tures,whichdonotreserveparametersforinforma-tionspeciﬁctosinglewords,cannotattainstate-of-the-artLMperformanceonword-levelprediction.Inthiswork,wecombinethetwoworldsandpro-poseanovelLMapproachwhichreliesonbothword-level(i.e.,contextual)andsubword-levelknowledge.Inadditiontotrainingword-speciﬁcparametersforword-levelpredictionusingaregularLMobjective,ourmethodencouragestheparameterstoalsore-ﬂectsubword-levelpatternsbyinjectingknowledgeaboutmorphology.Thisinformationisextractedinanunsupervisedmannerbasedonalreadyavailableinformationinconvolutionalﬁltersfromearliernet-worklayers.Theproposedmethodleadstolargeimprovementsinperplexityacrossawidespectrumoflanguages:22inEnglish,144inHebrew,378inFinnish,957inKoreanonourLMbenchmarks.Wealsoshowthatthegainsextendtoanothermul-tilingualLMevaluationset,compiledrecentlyfor7languagesbyKawakamietal.(2017).WeconductasystematicLMstudyon50typo-logicallydiverselanguages,sampledtorepresentavarietyofmorphologicalsystems.Wediscusstheim-plicationsoftypologicaldiversityontheLMtask,boththeoreticallyinSection2,andempiricallyinSection7;weﬁndaclearcorrespondencebetweenperformanceofstate-of-theartLMsandstructurallinguisticproperties.Further,theconsistentperplex-itygainsacrossthelargesampleoflanguagessuggestwideapplicabilityofournovelmethod.Finally,thisarticlecanalsobereadasacom-prehensivemultilingualanalysisofcurrentLMar-chitecturesonasetoflanguageswhichismuchlargerthantheonesusedinrecentLMwork(BothaandBlunsom,2014;VaniaandLopez,2017;Kawakamietal.,2017).Wehopethatthisarticlewithitsnewdatasets,methodologyandmodels,allavailableonlineathttp://people.ds.cam.ac.uk/dsg40/lmmrl.html,willpavethewayfortruemultilingualresearchinlanguagemodeling.2LMDataandTypologicalDiversityAlanguagemodeldeﬁnesaprobabilitydistributionoversequencesoftokens,andistypicallytrainedtomaximisethelikelihoodoftokeninputsequences.Formally,theLMobjectiveisexpressedasfollows:PAG(t1,…tn)=YiP(de|t1,…ti−1).(1)tiisatokenwiththeindexiinthesequence.Forword-levelpredictionatokencorrespondstooneword,whereasforcharacter-level(alsotermedchar-level)predictionitisonecharacter.LMsaremostcommonlytestedonWesternEu-ropeanlanguages.StandardLMbenchmarksinEn-glishincludethePennTreebank(PTB)(Marcusetal.,1993),the1BillionWordBenchmark(BWB)(Chelbaetal.,2014),andtheHutterPrizedata(Hut-ter,2012).EnglishdatasetsextractedfromBBCNews(GreeneandCunningham,2006)andIMDBMovieReviews(Maasetal.,2011)arealsousedforLMevaluation(WangandCho,2016;Miyamotoand

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

453

Dar,2016;PressandWolf,2017).RegardingmultilingualLMevaluation,BothaandBlunsom(2014)extractdatasetsforotherlanguagesfromthesetsprovidedbythe2013WorkshoponStatisticalMachineTranslation(WMT)(Bojaretal.,2013):theyexperimentwithCzech,Francés,Span-ish,GermanandRussian.ArecentworkofKimetal.(2016)reusesthesedatasetsandaddsArabic.Lingetal.(2015)evaluateonEnglish,Portuguese,Catalan,GermanandTurkishdatasetsextractedfromWikipedia.Verwimpetal.(2017)useasubsetoftheCorpusofSpokenDutch(Oostdijk,2000)forDutchLM.Kawakamietal.(2017)evaluateon7EuropeanlanguagesusingWikipediadata,includingFinnish.Perhapsthelargestandmostdiversesetoflan-guagesusedformultilingualLMevaluationsofaristheoneofVaniaandLopez(2017).Theirstudyin-cludes10languagesintotalrepresentingseveralmor-phologicaltypes(fusional,e.g.,Russian,andaggluti-native,e.g.,Finnish),aswellaslanguageswithpar-ticularmorphologicalphenomena(root-and-patterninHebrewandreduplicationinMalay).Inthiswork,weprovideLMevaluationdatasetsfor50typologi-callydiverselanguages,withtheirselectionguidedbystructuralproperties.LanguageSelectionAimingforacomprehensivemultilingualLMevaluation,weincludelanguagesforallpossibletypesofmorphologicalsystems.OurstartingpointisthePolyglotWikipedia(PW)(Al-Rfouetal.,2013).WhileatﬁrstPWseemscom-prehensiveandquitelargealready(covering40lan-guages),themajorityofthePWlanguagesaresimi-larfrombothagenealogicalperspective(26/40areIndo-European)andageographicperspective(28/40WesternEuropean).Asaconsequence,theysharemanypatternsandarenotarepresentativesampleoftheworld’slanguages.Inordertoquantitativelyanalyseglobaltrendsandcross-linguisticgeneralisationsacrossalargesetoflanguages,weproposetotestonallPWlanguagesandsourceadditionaldatafromthesamedomain,Wikipedia1,consideringcandidatesindescendingorderofcorpussizeandmorphologicaltype.Tradi-tionally,languageshavebeengroupedintothefour1Chinese,Japanese,andThaiaresourcedfromWikipediaandprocessedwiththePolyglottokenisersincewefoundtheirpreprocessinginthePWisnotadequateforlanguagemodeling. is outside listening tolisteningtomusicChar-CNN-LSTM LMFine-tuningCwMwLSTMCNNywMc( X , X ) w p( X , X ) w nsamplinghwFigure1:AnillustrationoftheChar-CNN-LSTMLMandourﬁne-tuningpost-processingmethod.Af-tereachepochweadaptword-levelvectorsinthesoftmaxembeddingMwusingsamplesbasedonfea-turesfromthechar-levelconvolutionalﬁlters.Theﬁgurefollowsthemodelﬂowbottomtothetop.maintypes:isolating,fusional,introﬂexiveandag-glutinative,basedontheirpositionalongaspectrummeasuringtheirpreferenceonbreakingupconceptsinmanywords(ononeextreme)orrathercomposethemintosinglewords(ontheotherextreme).Sin embargo,evenlanguagesbelongingtothesametypedisplaydifferentout-of-vocabularyratesandtype-tokenratios.Thishappensbecauselanguagesspecifydifferentsubsetsofgrammaticalcategories(suchastenseforverbs,ornumberfornouns)andvalues(suchasfuturefortense,pluralfornumber).Theamountofgrammaticalcategoriesexpressedinalanguagedeterminesitsinﬂectionalsynthesis(BickelandNichols,2013).Inourﬁnalsampleoflanguages,weselectlan-guagesbelongingtomorphologicaltypesdifferentfromthefusionalone,whichisover-representedinthePW.Inparticular,weincludenewisolating(MinNan,Burmese,Khmer),agglutinative(Basque,Geor-gian,canarés,Tamil,Mongolian,Javanese),andintroﬂexivelanguages(Amharic).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

454

3UnderlyingLM:Char-CNN-LSTMAstheunderlyingmodelweoptforthestate-of-the-artneuralLMarchitectureofKimetal.(2016):ithasbeenshowntoworkacrossanumberoflanguagesandinalarge-scalesetup(Jozefowiczetal.,2016).Italreadyprovidesasolutionfortheinputsideparam-etersofthemodelbybuildingwordvectorsbasedontheword’sconstituentcharactersequences.How-ever,itsoutputsidestilloperateswithastandardword-levelmatrixwithintheclosedandlimitedvo-cabularyassumption.WerefertothismodelasChar-CNN-LSTManddescribeitsdetailsinthefollowing.Figure1(izquierda)illustratesthemodelarchitecture.Char-CNN-LSTMconstructsinputwordvectorsbasedonthecharactersineachwordusingaconvo-lutionalneuralnetwork(CNN)(LeCunetal.,1989),thenprocessestheinputword-levelusingaLSTM(HochreiterandSchmidhuber,1997).Thenextwordispredictedusingwordembeddings,alargenumberofparameterswhichhavetobetrainedspeciﬁcallytorepresentthesemanticsofsinglewords.WerefertothisspaceofwordrepresentationsasMw.Formally,fortheinputlayerthemodeltrainsalook-upmatrixC∈R|Vc|×dc,correspondingtoonedc-dimensionalvectorpercharactercinthecharvocabularyVc.Foreachinput,ittakesasequenceofcharactersofaﬁxedlengthm,[c1,…cm],wheremisthemaximumlengthofallwordsinthewordvocabularyVw,andthelengthofeachwordisl≤m.Lookingupallcharactersofawordyieldsase-quenceofcharrepresentationsinRdc×l,whichiszero-paddedtoﬁttheﬁxedlengthm.ForeachwordonegetsasequenceofcharrepresentationsCw∈Rdc×m,passedthrougha1Dconvolution:fwi=tanh(hCw,Hii+b).(2)Hi∈Rdf,i×siisaﬁlterorkernelofsize/widthsiandhA,Bi=Tr(ABT)istheFrobeniusinnerproduct.Themodelhasmultipleﬁlters,Hi,withkernelsofdifferentwidth,si,anddimensionalitydf,i,iisusedtoindexﬁlters.Sincethemodelperformsaconvo-lutionovercharembeddings,sicorrespondstothecharwindowtheconvolutionisoperatingon:e.g.,aﬁlterofwidthsi=3andd3,i=150couldbeseenaslearning150featuresfordetecting3-grams.Bylearningkernelsofdifferentwidth,si,themodelcanlearnsubword-levelfeaturesforcharac-tersequencesofdifferentlengths.fwiistheoutputoftakingtheconvolutionwithﬁlterHiforwordw.Sincefwicangetquitelarge,itsdimensional-ityisreducedusingmax-over-time(1D)pooling:ywi=maxjfwi[j].Aquí,jindexesthedimensionsdf,ioftheﬁlterfwi,andywi∈Rdf,i.Thiscorre-spondstotakingthemaximumvalueforeachfeatureofHi,withtheintuitionthatthemostinformativefeaturewouldhavethehighestactivation.Theout-putofallmax-poolingoperationsywiisconcatenatedtoformawordvectoryw∈Rdp,wheredpisthenumberofallfeaturesforallHi:yw=concat([yw1,…,ywi]).(3)Thisvectorispassedthroughahighwaynetwork(Srivastavaetal.,2015)togivethenetworkthepos-sibilitytoreweighortransformthefeatures:hw=Highway(yw).Sofaralltransformationsweredoneperword;afterthehighwaytransformationwordrep-resentationsareprocessedinasequencebyanLSTM(HochreiterandSchmidhuber,1997):owt=LSTM([hw1,…hwt−1]).(4)TheLSTMyieldsoneoutputvectorowtperwordinthesequence,givenallprevioustimesteps[yw1,…ywt−1].Topredictthenextwordwt+1,onetakesthedotproductofthevectorowt∈R1×dlwithalookupmatrixMw∈Rdl×|Vw|,wheredlcorre-spondstotheLSTMhiddenstatesize.Thevectorpt+1∈R1×|Vw|isnormalisedtocontainvaluesbe-tween0and1,representingaprobabilitydistributionoverthenextword.ThiscorrespondstocalculatingthesoftmaxfunctionforeverywordkinVw:pag(wt+1=k|ot)=e(ot·mk)Pk0∈Vwe(ot·mk0)(5)whereP(wt+1=k|ot)istheprobabilityofthenextwordwt+1beingkgivenot,andmkistheoutputembeddingvectortakenfromMw.Word-LevelVectorSpace:MwThemodelpa-rametersinMwcanbeseenasthebottleneckofthemodel,astheyneedtobetrainedspeciﬁcallyforsinglewords,leadingtounreliableestimatesforinfrequentwords.Asananalysisofthecorpusstatis-ticslaterinSection7reveals,theZipﬁaneffectanditsinﬂuenceonwordvectorestimationcannotbefullyresolvedevenwithalargecorpus,especially

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

455

takingintoaccounthowﬂexibleMRLsareintermsofwordformationandcombination.Yet,havingagoodestimatefortheparametersinMwisessentialfortheﬁnalLMperformance,astheyaredirectlyresponsibleforthenext-wordprediction.Therefore,ouraimistoimprovethequalityofrep-resentationsinMw,focusingoninfrequentwords.Toachievethis,weturntoanothersourceofinformation:characterpatterns.Inotherwords,sinceMwdoesnothaveanyinformationaboutcharacterpatternsfromlowerlayers,weseekawayto:a)detectwordswithsimilarsubwordstructures(i.e.,“morpheme”-levelinformation),andb)letthesewordssharetheirsemanticinformation.4Character-AwareVectorSpaceTheCNNpartofChar-CNN-LSTM,seeEq.(3),infactprovidesinformationaboutsuchsubword-levelpatterns:themodelconstructsawordvectorywon-the-ﬂybasedontheword’sconstituentcharacters.Weletthemodelconstructywforallwordsinthevo-cabulary,resultinginacharacter-awarewordvectorspaceMc∈R|Vw|×dp.Theconstructionofthespaceiscompletelyunsupervisedandindependentoftheword’scontext;onlytheﬁrst(CNN)networklayersareactivated.Ourcoreideaistoleveragethisin-formationobtainedfromMctoinﬂuencetheoutputmatrixMw,andconsequentlythenetworkprediction,andextendthemodeltohandleunseenwords.Weﬁrsttakeacloserlookatthecharacter-awarespaceMc,andthendescribehowtoimproveandexpandthesemanticspaceMwbasedontheinfor-mationcontainedinMc(Section5).EachvocabularyentryinMcencodescharactern-grampatternsabouttherepresentedword,for1≥n≤7.Then-grampatternsarisethroughﬁltersofdifferentlengths,andtheirmaximumactivationisconcatenatedtoformeachindividualvectoryw.ThematrixMcisofdi-mensionality|Vw|×1100,whereeachofthe1,100dimensionscorrespondstotheactivationofoneker-nelfeature.Inpractice,dimensions[0,1,..:50]correspondtosingle-characterfeatures,[50:150]tocharacter2-grams,[150:300]to3-grams.Thehigher-ordern-gramsgetassigned200dimensionseach,uptodimensions[900:1100]for7-grams.Drawingananalogytoworkincomputervision(ZeilerandFergus,2014;Chatﬁeldetal.,2014),wesiPatternMaxActivationsZH1更,不更更更为,更更更改,更更更名,..,不不不满,不不不明,不不不易1今,代今今今日,今今今人,至少,..,如何,现代代代,当代代代1CapsIn,Ebru,VIC,..,FAT,MW,MITTR3mu-..,mutfa˘gının,muharebe,muhtelif6Üniversite..,Üniversitesi’nin,üniversitelerdeTable1:EachCNNﬁltertendstohavehighacti-vationsforasmallnumberofsubwordpatterns.sidenotestheﬁltersize.WordNearestNeighboursUrsprünglichkeitursprüngliche,Urstoff,ursprünglichenDEMittelwertMittelwerten,Regelwerkes,MittelwesereffektivEffekt,Perfekt,Effekte,perfekten,Respekt大学大大大金,大大大石,大大大震災,大大大空,大大大野JAハイクハハハイイイム,バイイイククク,メイイイククク,ハハハッサククク17251825,1625,1524mm,1728MagentaMaplet,Maya,ManagementTable2:Nearestneighboursforvocabularywords,basedonthecharacter-awarevectorspaceMc.delvedeeperintotheﬁlteractivationsandanalysethekeypropertiesofthevectorspaceMc.Thequalitativeanalysisrevealsthatmanyfeaturesareinterpretablebyhumans,andindeedcorrespondtofrequentsub-wordpatterns,asillustratedinTable1.Forinstance,tokenisedChinesedatafavoursshortwords:conse-quentlyshortﬁltersactivatestronglyforoneortwocharacters.Theﬁrsttwoﬁlters(width1)arehighlyactivefortwocommonsinglecharacterseach:oneﬁlterisactivefor更(de nuevo,más),不(no),andtheotherfor今(now),代(timeperiod).Largerﬁlters(width5-7)donotshowinterpretablepatternsinChi-nese,sincethevocabularylargelyconsistsofshortwords(length1-4).Agglutinativelanguagesshowatendencytowardslongwords.Weﬁndthatmedium-sizedﬁlters(width3-5)areactiveformorphemesorshortcommonsub-wordunits,andthelongﬁltersareactivatedfordif-ferentsurfacerealisationsofthesamerootword.InTurkish,oneﬁlterishighlyactiveonvariousformsofthewordüniversite(university).Más,inMRLswiththeLatinalphabetshortﬁltersaretypicallyac-tiveoncapitalisationorspecialchars.Table2showsexamplesofnearestneighboursbasedontheactivationsinMc.Thespaceseemstobearrangedaccordingtosharedsubwordpatterns

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

456

basedontheCNNfeatures.Itdoesnotrelyonlyonasimplecharacteroverlap,butalsocapturessharedmorphemes.ThispropertyisexploitedtoinﬂuencetheLMoutputwordembeddingmatrixMwinacom-pletelyunsupervisedway,asillustratedontherightsideofFigure1.5Fine-TuningtheLMPredictionWhiletheoutputvectorspaceMwcapturesword-levelsemantics,Mcarrangeswordsbysubwordfeatures.Amodelwhichreliessolelyoncharacter-levelknowledge(similartotheinformationstoredinMc)forword-levelpredictioncannotfullycaptureword-levelsemanticsandevenhurtsLMperformance(Jozefowiczetal.,2016).Sin embargo,sharedsubwordunitsstillprovideusefulevidenceofsharedsemantics(Cotterelletal.,2016;Vuli´cetal.,2017):injectingthisintothespaceMwtoadditionallyreﬂectsharedsubword-levelinformationshouldleadtoimprovedwordvectorestimates,especiallyforMRLs.5.1Fine-TuningandConstraintsWeinjectthisinformationintoMwbyadaptingre-centﬁne-tuning(oftentermedretroﬁttingorspecial-isation)methodsforvectorspacepost-processing(Faruquietal.,2015;Wietingetal.,2015;Mrkši´cetal.,2017;Vuli´cetal.,2017,i.a.).Thesemodelsenrichinitialvectorspacesbyencodingexternalknowledgeprovidedintheformofsimplelinguisticconstraints(i.e.,wordpairs)intotheinitialvectorspace.Therearetwofundamentaldifferencesbetweenourworkandpreviousworkonspecialisation.First,previousmodelstypicallyuserichhand-craftedlexi-calresourcessuchasWordNet(Fellbaum,1998)ortheParaphraseDatabase(Ganitkevitchetal.,2013),ormanuallydeﬁnedrules(Vuli´cetal.,2017)toex-tracttheconstraints,whilewegeneratethemdirectlyusingtheimplicitknowledgecodedinMc.Second,ourmethodisintegratedintoalanguagemodel:itper-formsupdatesaftereachepochoftheLMtraining.2InSection5.2,wedescribeourmodelforﬁne-tuningMwbasedontheinformationprovidedinMc.Ourﬁne-tuningapproachreliesonconstraints:positiveandnegativewordpairs(xi,xj),where2Wehavealsoexperimentedwithavariantwhichperformsonlyapost-hocsingleupdateoftheMwmatrixaftertheLMtraining,butavariantwhichperformscontinuousper-epochupdatesismorebeneﬁcialfortheﬁnalLMperformance.xi,xj∈Vw.Iteratingovereachcuewordxw∈VwweﬁndasetofpositivewordpairsPwandnega-tivewordpairsNw:theirextractionisbasedontheir(dis)similaritywithxwinMc.Positivepairs(xw,xp)containwordsxpyieldingthehighestcosinesimilar-itytothexw(=nearestneighbors)inMc.Negativepairs(xw,xn)areconstructedbyrandomlysamplingwordsxnfromthevocabulary.SinceMcgetsup-datedduringtheLMtraining,nosotros(re)generatethesetsPwandNwaftereachepoch.5.2Attract-PreserveWenowpresentamethodforﬁne-tuningtheout-putmatrixMwwithintheChar-CNN-LSTMLMframework.Assaid,theﬁne-tuningprocedurerunsaftereachepochofthestandardlog-likelihoodLMtraining(seeFigure1).Weadaptavariantofastate-of-the-artpost-processingspecialisationprocedure(Wietingetal.,2015;Mrkši´cetal.,2017).Theideaoftheﬁne-tuningmethod,whichwelabelAttract-Preserve(AP),istopullthepositivepairsclosertogetherintheoutputword-levelspace,whilepush-ingthenegativepairsfurtheraway.Letvidenotethewordvectorofthewordxi.TheAPcostfunctionhastwoparts:attractandpreserve.Intheattractterm,usingtheextractedsetsPwandNw,wepushthevectorofxwtobeclosertoxpbyasimilaritymarginδthantoitsnegativesamplexn:attr(Pw,Nw)=X(xw,xp)∈Pw,(xw,xn)∈NwReLU(δ+vwvn−vwvp).ReLU(X)isthestandardrectiﬁedlinearunit(NairandHinton,2010).Theδmarginissetto0.6inallexperimentsasinpriorwork(Mrkši´cetal.,2017)withoutanysubsequentﬁne-tuning.Thepreservecostactsasaregularisationpullingthe“ﬁne-tuned”vectorbacktoitsinitialvalue:pres(Pw,Nw)=Xxw∈Vwλreg||ˆvw−vw||2.(6)λreg=10−9istheL2-regularisationconstant(Mrkši´cetal.,2017);ˆvwistheoriginalwordvec-torbeforetheprocedure.Thistermtriestopreservethesemanticcontentpresentintheoriginalvectorspace,aslongasthisinformationdoesnotcontradicttheknowledgeinjectedbytheconstraints.Theﬁnalcostfunctionaddsthetwocosts:cost=attr+pres.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

457

6ExperimentsDatasetsWeusethePolyglotWikipedia(Al-Rfouetal.,2013)forallavailablelanguagesexceptforJapanese,Chino,andThai,andaddtheseandfur-therlanguagesusingWikipediadumps.TheWikidumpswerecleanedandpreprocessedbythePoly-glottokeniser.Weconstructsimilarly-sizeddatasetsbyextracting46Ksentencesforeachlanguagefromthebeginningofeachdump,ﬁlteredtocontainonlyfullsentences,andsplitintotrain(40k),validation(3k),andtest(3k).Theﬁnallistoflanguagesalongwithstandardlanguagecodes(ISO639-1standard,usedthroughoutthepaper)andstatisticsonvocabu-laryandtokencountsareprovidedinTable4.EvaluationSetupWereportperplexityscores(Ju-rafskyandMartin,2017,Chapter4.2.1)usingthefullvocabularyoftherespectiveLMdataset.Thismeansthatweexplicitlydecidetoretainalsoin-frequentwordsinthemodeleddata.Replacingin-frequentwordsbyaplaceholdertokenisastandardtechniqueinLMtoobtainequalvocabu-larysizesacrossdifferentdatasets.Motivatedbytheobservationthatinfrequentwordsconstituteasig-niﬁcantpartofthevocabularyinMRLs,andthatvocabularysizesnaturallydifferbetweenlanguages,wehavedecidedtoavoidtheplaceholderforlow-frequencywords,andrunallmodelsonthefullvocabulary(Adamsetal.,2017;Graveetal.,2017).Webelievethatthisfull-vocabularysetupoffersadditionalinsightintothestandardLMtechniques,leadingtoevaluationwhichpinpointscruciallimita-tionsofcurrentword-basedmodelswithregardtomorphologically-richlanguages.Inoursetupthevo-cabularycontainsallwordsoccurringatleastonceinthetrainingset.Toensureafaircomparisonbetweenallneuralmodels,wordsoccurringonlyinthetestsetaremappedtoarandomvectorwiththesametechniqueforallneuralmodels,asdescribednext.SamplingVectorsofUnseenWordsSincezero-shotsemanticvectorestimationattesttimeisanunresolvedproblem,weseekanalternativewaytocomparemodelpredictionsattesttime.Wereportallresultswithunseentestwordsbeingmappedtoonerandomlysampledvector.Thevectorispartofthevocabularyattrainingtime,butremainsuntrainedandatitsrandominitializationCharacterembeddingsize15Wordembeddingsize650NumberofRNNlayers2Numberofhighwaylayers2Dropoutvalue0.5OptimizerSGDLearningrate1.0Learningratedecay0.5Parameterinit:randuniform[-0.05,0.05]Batchsize20RNNsequencelength35Maxgradnorm5.0MaxwordlengthdynamicMaxepochs15or30APmargin(δ)0.6APoptimizerAdagradAPlearningrate0.05APgradientclip2APregularizationconstant10−9APrarewordsfrequencythreshold5Table3:Hyper-parameters.sinceitneveroccursinthetrainingdata.Therefore,wesamplearandomvectorattesttimefromthesamepartofthespaceasthetrainedvectors,usinganormaldistributionwiththemeanandthevarianceofMwandthesameﬁxedrandomseedforallmodels.WeemploythismethodologyforallneuralLMmodels,andtherebyensurethatresultsarecomparable.TrainingSetupandParametersWereproducethestandardLMsetupofZarembaetal.(2015)andparameterchoicesofKimetal.(2016),withbatchesof20andasequencelengthof35,whereonestepcorrespondstoonetoken.Themaximumwordlengthischosendynamicallybasedonthelongestwordinthecorpus.Thecorpusisprocessedcontinuously,andtheRNNhiddenstateresetsoccuratthebeginningofeachepoch.Parametersareoptimisedwithstochasticgradientdescent.Thegradientisaveragedoverthebatchsizeandsequencelength.Wethenscaletheaveragedgradientbythesequencelength(=35)andclipto5.0formorestabletraining.Thelearningrateis1.0,decayedby0.5aftereachepochifthevalida-tionperplexitydoesnotimprove.Wetrainallmodelsfor15epochsonthesmallercorpora,andfor30onthelargeones,whichistypicallysufﬁcientformodelconvergence.OurAPﬁne-tuningmethodoperatesonthewhole

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

458

Mwspace,butweonlyallowwordsmorefrequentthan5ascuewordsxw(seeSection5again),whiletherearenorestrictionsonxpandxn.3Ourprelimi-naryanalysisontheinﬂuenceofthenumberofnear-estneighboursinMcshowsthatthisparameterhasonlyamoderateeffectontheﬁnalLMscores.Wethusﬁxitto3positiveand3negativesamplesforeachxwwithoutanytuning.APisoptimisedwithAdagrad(Duchietal.,2011)andalearningrateof0.05,thegradientsareclippedto±2.4Afullsum-maryofallhyper-parametersandtheirvaluesispro-videdinTable3.(Base)LanguageModelsTheavailabilityofLMevaluationsetsinalargenumberofdiverselan-guages,describedinSection2,nowprovidesanop-portunitytoperformafull-ﬂedgedmultilingualanal-ysisofrepresentativeLMarchitectures.Atthesametime,thesedifferentarchitecturesserveasthebase-linesforournovelmodelwhichﬁne-tunestheoutputmatrixMw.Asmentioned,thetraditionalLMsetupistousewordsbothontheinputandontheoutputside(Buen hombre,2001;Bengioetal.,2003;De-schachtandMoens,2009)relyingonn-gramwordsequences.Weevaluateastrongmodelfromthen-gramfamilyofmodelsfromtheKenLMpack-age(https://github.com/kpu/kenlm):itisbasedon5-gramswithextendedKneser-Neysmoothing(KN5)(KneserandNey,1995;Heaﬁeldetal.,2013)5.Therationalebehindincludingthisnon-neuralmodelistoalsoprobethelimitationsofsuchn-gram-basedLMarchitecturesonadiversesetoflanguages.Recurrentneuralnetworks(RNNs),especiallyLong-Short-TermMemorynetworks(LSTMs),havetakenovertheLMuniverserecently(Mikolovetal.,2010;Sundermeyeretal.,2015;Chenetal.,2016,i.a.).TheseLMsmapasequenceofinputwordstoembeddingvectorsusingalook-upmatrix.TheembeddingsarepassedtotheLSTMasinput,and3Thischoicehasbeenmotivatedbytheobservationthatrarewordstendtohaveotherrarewordsastheirnearestneighbours.Notethatvectorsofwordsfrompositiveandnegativeexamples,andnotonlycuewords,alsogetupdatedbytheAPmethod.4AllscoreswithneuralmodelsareproducedwithourownimplementationsinTensorFlow(Abadietal.,2016).5Weevaluatethedefaultsetupforthismodelusingtheoption-interpolate_unigrams=1whichavoidsassigningzero-probabilitytounseenwords.themodelistrainedinanautoregressivefashiontopredictthenextwordfromthepre-deﬁnedvocabu-larygiventhecurrentcontext.AsastrongbaselinefromthisLMfamily,wetrainastandardLSTMLM(LSTM-Word)relyingonthesetupfromZarembaetal.(2015)(seeTable3).Finalmente,arecentstrandofLMworkusescharactersontheinputsidewhileretainingword-levelpredic-tionontheoutputside.Arepresentativearchitecturefromthisgroup,alsoservingasthebasisinourwork(Section3),isCharCNN-LSTM(Kimetal.,2016).Allneuralmodelsoperateonexactlythesamevo-cabularyandtreatout-of-vocabulary(OOV)wordsinexactlythesameway.Asmentioned,weincludeKN5asastrong(non-neural)baselinetogiveper-spectiveonhowthismoretraditionalmodelperformsacross50typologicallydiverselanguages.WehaveselectedthesetupfortheKN5modeltobeascloseaspossibletothatofneuralLMs,Sin embargo,duetothedifferentnatureofthemodels,wenotethattheresultsbetweenKN5andothermodelsarenotcomparable.InKN5discountsareaddedforlow-frequencywords,andunseenwordsattesttimeareregardedasoutliersandassignedlowprobabilityestimates.Incontrast,forallneuralmodelswesampleunseenwordvectorstolieinthespaceoftrainedvectors(seebefore).WeﬁndthelattersetuptobetterreﬂectourintuitionthatespeciallyinMRLsunseenwordsarenotoutliersbutoftenariseduetomorphologicalcomplexity.7ResultsandDiscussionInthissection,wepresentthemainempiricalﬁnd-ingsofourwork.Thefocusison:a)theresultsofournovellanguagemodelwiththeAPﬁne-tuningproce-dure,anditscomparisontootherlanguagemodelsinourcomparison;b)theanalysisoftheLMresultsinrelationtotypologicalfeaturesandcorpusstatistics.Table4thatlistsall50testlanguagesalongwiththeirlanguagecodesandprovidesthekeystatisticsofour50LMevaluationbenchmarks.Thestatisticsincludethenumberofwordtypesintrainingdata,thenumberofwordtypesoccurringintestdatabutunseenintraining,aswellasthetotalnumberofwordtokensinbothtrainingandtestdata,andtype-to-tokenratios.Table4alsoshowstheresultsforKN5,LSTM-

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

459

DataStatsBaselineModelsOurs:Fine-TuningMwLanguage(código)VocabSize(Tren)NewTestVocabNumberTokens(Tren)NumberTokens(Prueba)Type/Token(Tren)KN5LSTMChar-CNN-LSTM+AP∆+AP×Amharic(am)897494805511K39.2K0.1812521535981817164×Arabic(ar)890895032722K54.7K0.12215625871659160455(cid:3)Bulgarian(bg)713603896670K49K0.116106514154096(cid:3)Catalan(ca)610332562788K59.4K0.083583182412383(cid:3)checo(cs)867834300641K49.6K0.141658220012521131121(cid:3)Danish(da)724683618663K50.3K0.1166871046644224(cid:3)Alemán(de)807414045682K51.3K0.1293090360255151(cid:3)Griego(el)762643767744K56.5K0.1060753840538916(cid:3)Inglés(en)555212480783K59.5K0.0753349437134922(cid:3)Español(es)601962721781K57.2K0.084153662752705FEstonian(et)941843907556K38.6K0.17160925641478138890FBasque(eu)811773365647K47.3K0.1356053334730938(cid:3)Farsi(fa)523062041738K54.2K0.073552632082053FFinnish(fi)1155796489585K44.8K0.202611426322361858378(cid:3)Francés(fr)585392575769K57.1K0.0835029423122011×Hebrew(él)832173862717K54.6K0.121797218915191375144(cid:3)Hindi(hi)503842629666K49.1K0.0847342632629927(cid:3)Croatian(hr)863574371620K48.1K0.14129416651014906108FHungarian(hu)1018745015672K48.7K0.1511511595929819110(cid:3)Indonesian(id)491252235702K52.2K0.0745435928626323(cid:3)italiano(él)701942923787K59.3K0.09567493349350-1FJapanese(ja)448631768729K54.6K0.0616915613612511FJavanese(jv)651414292622K52K0.101387144311581003155FGeorgian(ka)802113738580K41.1K0.14137018271097939158(cid:3)Khmer(km)378511303579K37.4K0.07586637522535-13FKannada(kn)946604604434K29.4K0.222315531025582265293FKorean(ko)1437948275648K50.6K0.2251461006347783821957(cid:3)Lithuanian(lt)815013791554K41.7K0.151155141585482727(cid:3)Latvian(lv)752944564587K45K0.13145219671129969160(cid:3)Malay(EM)493852824702K54.1K0.0777672552551312FMongolian(mng)738844171629K50K0.12139217161165109174(cid:3)Burmese(mi)20574755576K46.1K0.042092121821802(cid:3)Min-Nan(nan)3323814041.2M65.6K0.03614339381(cid:3)Dutch(nl)602062626708K53.8K0.0839734026724819(cid:3)Norwegian(No)697613352674K47.8K0.1053451337934633(cid:3)Polish(pl)973254526634K47.7K0.151741264114911328163(cid:3)Portuguese(pt)561672394780K59.3K0.0734227221420212(cid:3)Romanian(ro)689133079743K52.5K0.093843592562479(cid:3)Russian(ru)980973987666K48.4K0.151128130981271597(cid:3)Slovak(sk)887264521618K45K0.141560206212751151124(cid:3)Slovene(sl)839974343659K49.2K0.131114130877673343(cid:3)Serbian(sr)816173641628K46.7K0.1379096158254735(cid:3)Swedish(sv)774994109688K50.4K0.1184383258354340FTamil(frente a)1064036017507K39.6K0.213342623434962768728(cid:3)Thai(th)300561300628K49K0.052332412061997(cid:3)Tagalog(tl)724163791972K66.3K0.073792982192118FTurkish(tr)908404608627K45K0.14172422671350129060(cid:3)Ukranian(uk)897244983635K47K0.141639189312831091192(cid:3)Vietnamese(vi)320551160754K61.9K0.04197190158165-7(cid:3)Chino(zh)436721653746K56.8K0.06106482679776235(cid:3)Isolating(avg)409301825759K54K0.054403923263188(cid:3)Fusional(avg)734993532689K51.3K0.1184296961856652×Introﬂexive(avg)873524566650K49.5K0.141735210413861265121FAgglutinative(avg)910514687603K45K0.161898316417271473254Table4:Testperplexitiesfor50languages(ISO639-1codessortedalphabetically)inthefull-vocabularypredictionLMsetup;Left:Basicstatisticsofourevaluationdata.Middle:ResultswiththeBaselineLMs.NotethattheabsolutescoresintheKN5columnarenotcomparabletothescoresobtainedwithneuralmodels(seeSection6).Right:ResultswithChar-CNN-LSTMandourAPﬁne-tuningstrategy.∆isindicatingthedifferenceinperformanceovertheoriginalChar-CNN-LSTMmodel.Thebestscoringneuralbaselineisunderlined.Theoverallbestperformingneuralmodelforeachlanguageisinbold.Word,Char-CNN-LSTM,andourmodelwiththeAPﬁne-tuning.Furthermore,avisualisationoftheChar-CNN-LSTM+APmodelasafunctionoftype/tokenratioisshowninFigure2.7.1Fine-TuningtheOutputMatrixFirst,wetesttheimpactofourAPﬁne-tuningmethod.Asthemainﬁnding,theinclusionofﬁne-tuningintoChar-CNN-LSTM(thismodelistermed

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0

/
t

a
C
_
a
_
0
0
0
3
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

460

+AP)yieldsimprovementsonalargenumberoftestlanguages.Themodelisbetterthanbothstrongneu-ralbaselinelanguagemodelsfor47/50languages,anditimprovesovertheoriginalChar-CNN-LSTMLMfor47/50languages.Thelargestgainsareindi-catedforthesubsetofagglutinativeMRLs(e.g.,950perplexitypointsinKorean,largegainsalsomarkedforFI,HE,EL,HU,TA,ET).Wealsoobservelargegainsforthethreeintroﬂexivelanguagesincludedinourstudy(Amharic,Arábica,hebreo).Whiletheselargeabsolutegainsmaybepartiallyattributedtotheexponentialnatureoftheperplexitymeasure,onecannotignorethesubstantialrelativegainsachievedbyourmodels:e.g.,EU(∆PPL=38)improvesmorethanafusionallanguagelikeDA(∆PPL=24)evenwithalowerbaselineperplex-ity.Thissuggeststhatinjectingsubword-levelinfor-mationismorestraightforwardfortheformer:inagglutinativelanguages,themappingbetweenmor-phemesandmeaningsislessambiguous.Moreover,thenumberofwordsthatbeneﬁtfromtheinjectionofcharacter-basedinformationislargerforaggluti-nativelanguages,becausetheyalsotendtodisplaythehighestinﬂectionalsynthesis.Fortheoppositereasons,wedonotsurpassChar-CNN-LSTMinafewfusional(IT)andisolatinglan-guages(KM,VI).WealsoobserveimprovementsforSlaviclanguageswithrichmorphology(RU,HR,PL).Thegainsarealsoachievedforsomeisolatingandfusionallanguageswithsmallervocabulariesandasmallernumberofrarewords,e.g.,inTagalog,En-glish,Catalan,andSwedish.Thissuggeststhatourmethodforﬁne-tuningtheLMpredictionisnotre-strictedtoMRLsonly,andhastheabilitytoimprovetheestimationforrarewordsinmultipletypologi-callydiverselanguages.7.2LanguageModels,TypologicalFeatures,andCorpusStatisticsInthenextexperiment,weestimatecorrelationstrengthofallperplexityscoreswithaseriesofinde-pendentvariables.Thevariablesare1)type-tokenra-tiointhetraindata;2)newwordtypesinthetestdata;3)themorphologicaltypeofthelanguageamongiso-lating,fusional,introﬂexive,andagglutinative,cap-turingdifferentaspectsrelatedtothemorphologicalrichnessofalanguage.ResultswithPearson’sρ(numerical)andη2in400006000080000100000120000140000vocabulary size05001000150020002500300035004000perplexityaramkoroheeneuesmsmnghurunanltjakaknjvzhvifitaplcstltrknltlvslIsolatingFusionalIntroflexiveAgglutinativeFigure2:PerplexityresultswithChar-CNN-LSTM+AP(y-axis)inrelationtotype/tokenratio(x-axis).Forlanguagecodes,seeTable4.one-wayANOVA(categorical)areshowninTable5.Signiﬁcancetestsshowp-values<1−3forallcombi-nationsofmodelsandindependentvariables,demon-stratingallofthemaregoodperformancepredictors.Ourmainﬁndingindicatesthatlinguisticcategoriesanddatastatisticsbothcorrelatewell(≈0.35and≈0.82,respectively)withtheperformanceoflan-guagemodels.Forthecategoricalvariableswecomparethemeanvaluespercategorywiththenumericaldependentvariable.Assuch,η2canbeinterpretedastheamountofvariationexplainedbythemodel-theresultinghighcorrelationssuggestthatperplexitiestendtobehomogeneousforlanguagesofasamemorphologicaltype,especiallysoforstate-of-the-artmodels.ThisisintuitivelyevidentinFigure2,whereper-plexityscoresofChar-CNN-LSTM+APareplot-tedagainsttype/tokenratio.Isolatinglanguagesareplacedontheleftsideofthespectrumasexpected,withlowtype/tokenratioandgoodperformance(e.g.,VI,ZH).Asforfusionallanguages,sub-groupsbe-havedifferently.WeﬁndthatRomanceandGermaniclanguagesdisplayroughlythesamelevelofperfor-manceasisolatinglanguages,despitetheiroveralllargertype/tokenratio.Balto-Slaviclanguages(e.g.CS,LV)insteadshowbothhigherperplexitiesandhighertype/tokenratio.Thesedifferencesmaybeexplainedintermsofdifferentinﬂectionalsynthesis.Introﬂexiveandagglutinativelanguagescanbe l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 2 1 5 6 7 6 3 0 / / t l a c _ a _ 0 0 0 3 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 461 VariablesIndependentDependentStatisticalTestModelsKN5LSTM+Char-CNN++APTraintype/tokenPPLPearson’sρ0.8330.8130.8230.831TestnewtypesPPLPearson’sρ0.8600.8030.8180.819MorphologyPPLone-wayANOVAη20.3540.3380.3690.374LSTMvs+CharCNN+CharCNNvs++APTraintype/token∆PPLPearson’sρ0.7290.778Morphology∆PPLone-wayANOVAη20.3080.284Table5:Correlationsbetweenmodelperformanceandlanguagetypologyaswellaswithcorpusstatistics(type/tokenratioandnewwordtypesintestdata).Allvariablesaregoodperformancepredictions.foundmostlyontherightsideofthespectrumintermsofperformance(seeFigure2).Althoughthelanguageswithhighestabsoluteperplexityscoresarecertainlyclassiﬁedasagglutinative(e.g.,DravidianlanguagessuchasKNandTA),wealsoﬁndsomeoutliersintheagglutinativelanguages(EU)withre-markablylowperplexityscores.7.3CorpusSizeandType/TokenRatioBuildingonthestrongcorrelationbetweentype/tokenratioandmodelperformancefromSection7.2,wenowfurtheranalysetheresultsinlightofcorpussizeandtype/tokenstatistics.TheLMdatasetsforour50languagesaresimilarinsizetothewidelyusedEnglishPTBdataset(Marcusetal.,1993).Assuch,wehopethattheseevaluationdatasetscanhelpguidemultilinguallanguagemodelingresearchacrossawidespectrumoflanguages.However,ourgoalnowistoverifythattype/tokenratioandnotabsolutecorpussizeisthedecidingfactorwhenunravelingthelimitationsofstandardLMarchitecturesacrossdifferentlanguages.Tothisend,weconductadditionalexperimentsonalllan-guagesoftherecentMultilingualWikipediaCorpus(MWC)(Kawakamietal.,2017)forlanguagemod-eling,usingthesamesetupasbefore(seeTable3).Thecorpusprovidesdatasetsfor7languagesfromthesamedomainasourbenchmarks(Wikipedia),andcomesintwosizes.Wechoosethelargercorpusvariantforeachlanguage,whichprovidesabout3-5timesasmanytokensascontainedinourdatasetsfromTable4.TheresultsontheMWCevaluationdataalongwithcorpusstatisticsaresummarisedinTable6.Asoneimportantﬁnding,weobservethatthegainsinperplexityusingourﬁne-tuningAPmethodextendalsototheselargerevaluationdatasets.Inparticular,weﬁndimprovementsofthesamemagnitudeasinthePTB-sizeddatasetsoverthestrongestbaselinemodel(Char-CNN-LSTM)forallMWClanguages.Forinstance,perplexityisreducedfrom1781to1578forRussian,andfrom365to352forEnglish.WealsoobserveagainforFrenchandSpanishwithperplexityreducedfrom282to272and255to243respectively.Inaddition,wetestonsamplesoftheEuroparlcorpus(Koehn,2005;Tiedemann,2012)whichcon-tainsapproximately10timesmoretokensthanourPTB-sizedevaluationdata:weuse400KsentencesfromEuroparlfortrainingandtesting.However,thisdatacomesfromamuchnarrowerdomainofparlia-mentaryproceedings:thispropertyyieldsaverylowtype/tokenratioasvisiblefromTable6.Infact,weﬁndthetype/tokenratiointhiscorpustobeonthesamelevelorevensmallerthanisolatinglanguages(comparewiththescoresinTable4):0.02forDutchand0.03forCzech.Thisleadstosimilarperplexi-tieswithandwithout+APforthesetwoselectedtestlanguages.ThethirdEPtestlanguage,Finnish,hasaslightlyhighertype/tokenratio.Consequently,wedoobserveanimprovementof10pointsinperplexity.Amoredetailedanalysisofthisphenomenonfollows.Table7displaystheoveralltype/tokenratiointhetrainingsetofthesecopora.WeobservethattheMWChascomparableorevenhighertype/tokenratiosthanthesmallersetsdespiteitsincreasedsize.ThecorpushasbeenconstructedbysamplingthedatafromavarietyofdifferentWikipediacate-gories(Kawakamietal.,2017):itcanthereforeberegardedasmorediverseandchallengingtomodel. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 2 1 5 6 7 6 3 0 / / t l a c _ a _ 0 0 0 3 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 462 LangCorpus#Vocab#TokensType/TokenChar-CNN-LSTM+APtraintesttraintesttrainnlEP197K200K10M255K0.026263csEP265K268K7.9M193K0.03180186enMWC310K330K5.0M0.5M0.06365352esMWC258K277K3.7M0.4M0.07255243frMWC260K278K4.0M0.5M0.07282272ﬁEP459K465K6.8M163K0.07515505deMWC394K420K3.8M0.3M0.10710665ruMWC372K399K2.5M0.3M0.1517811578csMWC241K258K1.5M0.2M0.1623962159ﬁMWC320K343K1.5M0.1M0.2153004911Table6:ResultsonthelargerMWCdataset(Kawakamietal.,2017)andonasubsetoftheEuroparl(EP)corpus.Improvementswith+AParenotdependentoncorpussize,butrathertheystronglycorrelatewiththetype/tokenratioofthecorpus.Type/TokenRatioLanguageOurDataMWCEuroparlCzech0.130.160.03German0.120.10-English0.060.06-Spanish0.070.07-Finnish0.200.210.07French0.070.07-Russian0.140.15-Dutch0.09-0.02Table7:Comparisonoftype/tokenratiosinthecor-porausedforevaluation.Theratioisnotdependentonlyonthecorpussizebutalsoonthelanguageanddomainofthecorpus.Europarlontheotherhandshowssubstantiallylowertype/tokenratios,presumablyduetoitsnarrowerdo-mainandmorerepetitivenature.Ingeneral,weﬁndthatalthoughthetype/tokenratiodecreaseswithincreasingcorpussize,thede-creasingrateslowsdowndramaticallyatacertainpoint(Herdan,1960;Heaps,1978).Thisdependsonthetypologyofthelanguageanddomainofthecorpus.Figure3showstheempiricalproofofthisintuition.Weshowthevariationoftype/tokenratiosinWikipediaandEuroparlwithincreasingcorpussize.Wecanseethatinaverylargecorpusof800Ksentences,thetype/tokenratioinMRLssuchasKo-reanorFinnishstayscloseto0.1,alevelwherewestillexpectanimprovementinperplexitywiththeproposedAPﬁne-tuningmethodappliedontopof0K100K200K300K400K500K600K700K800Knumber of sentences0.00.10.20.30.4type/token ratioFormatStrFormatter('%dK')fi europarlnl europarlfi wikide wikinl wikiko wikiFigure3:Type/tokenratiovaluesvs.corpussize.Adomain-specifccorpus(Europarl)hasalowertype/tokenratiothanamoregeneralcorpus(Wikipedia),regardlessoftheabsolutecorpussize.Char-CNN-LSTM.Inordertoisolateandverifytheeffectofthetype/tokenratio,wenowpresentresultsonsynthet-icallycreateddatasetswheretheratioiscontrolledexplicitly.WeexperimentwithsubsetsoftheGermanWikipediawithequalnumberofsentences(25K)6,comparablenumberoftokens,butvaryingtype/tokenratio.Wegeneratethesecontrolleddatasetsbyclus-teringsparsebag-of-wordssentencevectorswiththek-meansalgorithm,samplingfromdifferentclusters,6Wesplitthedatainto20Ktraining,2.5Kvalidationand2.5Ktestsentences l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 2 1 5 6 7 6 3 0 / / t l a c _ a _ 0 0 0 3 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 463 Clusters#Vocab#TokensType/TokenChar-CNN-LSTM+APtraintesttraintesttrain248K52K382K47K0.132252172,469K75K495K62K0.144544202,4,5,978K84K494K62K0.166055475,984K91K492K62K0.17671612566K72K372K46K0.18681598Table8:ResultsonGermanwithdatasetsofcomparablesizeandincreasingtype/tokenratio.0.130.140.150.160.170.18type/token ratio200300400500600700800perplexityCharCNN-LSTMCharCNN-LSTM+APFigure4:VisualisationofresultsfromTable8.TheAPmethodisespeciallyhelpfulforcorporawithhightype/tokenratios.andthenselectingtheﬁnalcombinationsaccordingtotheirtype/tokenratioandthenumberoftokens.Corporastatisticsalongwithcorrespondingperplex-ityscoresareshowninTable8,andplottedinFig-ure4.TheseresultsclearlydemonstrateandverifythattheeffectivenessoftheAPmethodincreasesforcorporawithhighertype/tokenratios.Thisﬁndingalsofurthersupportstheusefulnessoftheproposedmethodformorphologically-richlanguagesinpartic-ular,wheresuchhightype/tokenratiosareexpected.8ConclusionWehavepresentedacomprehensivelanguagemod-elingstudyoverasetof50typologicallydiverselanguages.Thelanguageswerecarefullyselectedtorepresentawidespectrumofdifferentmorphologicalsystemsthatarefoundamongtheworld’slanguages.Ourcomprehensivestudyprovidesnewbenchmarksandlanguagemodelingbaselineswhichshouldguidethedevelopmentofnext-generationlanguagemodelsfocusedonthechallengingmultilingualsetting.OneparticularLMchallengeisaneffectivelearn-ingofparametersforinfrequentwords,especiallyformorphologically-richlanguages(MRLs).ThemethodologicalcontributionofthisworkisanewneuralapproachwhichenricheswordvectorsattheLMoutputwithsubword-levelinformationtocap-turesimilarcharactersequencesandconsequentlytofacilitateword-levelLMprediction.Ourmethodhasbeenimplementedasaﬁne-tuningstepwhichgrad-uallyreﬁneswordvectorsduringtheLMtraining,basedonsubword-levelknowledgeextractedinanunsupervisedmannerfromcharacter-awareCNNlay-ers.Ourapproachyieldsgainsfor47/50languagesinthechallengingfull-vocabularysetup,withlargestgainsreportedforMRLssuchasKoreanorFinnish.Wehavealsodemonstratedthatthegainsextendtolargertrainingcorpora,andarewellcorrelatedwiththetype-to-tokenratiointhetrainingdata.Infutureworkweplantodealwiththeopenvocab-ularyLMsetupandextendourframeworktoalsohan-dleunseenwordsattesttime.Oneinterestingavenuemightbetofurtherﬁne-tunetheLMpredictionbasedonadditionalevidencebeyondpurelycontextualin-formation.Insummary,wehopethatthisarticlewillencouragefurtherresearchintolearningsemanticrepresentationsforrareandunseenwords,andsteerfurtherdevelopmentsinmultilinguallanguagemodel-ingacrossalargenumberofdiverselanguages.Codeanddataareavailableonline:http://people.ds.cam.ac.uk/dsg40/lmmrl.html.AcknowledgmentsThisworkissupportedbytheERCConsolidatorGrantLEXICAL(648909).Wethankalleditorsandreviewersfortheirhelpfulfeedbackandsuggestions. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 2 1 5 6 7 6 3 0 / / t l a c _ a _ 0 0 0 3 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 464 ReferencesMartinAbadi,AshishAgarwal,PaulBarham,EugeneBrevdo,ZhifengChen,CraigCitro,GregCorrado,AndyDavis,JeffreyDean,MatthieuDevin,SanjayGhe-mawat,IanGoodfellow,AndrewHarp,GeoffreyIrving,MichaelIsard,YangqingJia,LukaszKaiser,Manju-nathKudlur,JoshLevenberg,DanMan,RajatMonga,SherryMoore,DerekMurray,JonShlens,BenoitSteiner,IlyaSutskever,PaulTucker,VincentVan-houcke,VijayVasudevan,OriolVinyals,PeteWarden,MartinWicke,YuanYu,andXiaoqiangZheng.2016.TensorFlow:Large-scalemachinelearningonhetero-geneousdistributedsystems.CoRR,abs/1603.04467.OliverAdams,AdamMakarucha,GrahamNeubig,StevenBird,andTrevorCohn.2017.Cross-lingualwordembeddingsforlow-resourcelanguagemodeling.InProceedingsofEACL,pages937–947.RamiAl-Rfou,BryanPerozzi,andStevenSkiena.2013.Polyglot:Distributedwordrepresentationsformultilin-gualNLP.InProceedingsofCoNLL,pages183–192.YoshuaBengio,RéjeanDucharme,PascalVincent,andChristianJauvin.2003.Aneuralprobabilisticlan-guagemodel.JournalofMachineLearningResearch,3:1137–1155.BalthasarBickelandJohannaNichols,2013.InﬂectionalSynthesisoftheVerb.MaxPlanckInstituteforEvolu-tionaryAnthropology,Leipzig.OndˇrejBojar,ChristianBuck,ChrisCallison-Burch,ChristianFedermann,BarryHaddow,PhilippKoehn,ChristofMonz,MattPost,RaduSoricut,andLuciaSpecia.2013.Findingsofthe2013WorkshoponSta-tisticalMachineTranslation.InProceedingsofthe8thWorkshoponStatisticalMachineTranslation,pages1–44.JanA.BothaandPhilBlunsom.2014.Compositionalmorphologyforwordrepresentationsandlanguagemodelling.InProceedingsofICML,pages1899–1907.KenChatﬁeld,KarenSimonyan,AndreaVedaldi,andAn-drewZisserman.2014.Returnofthedevilinthedetails:Delvingdeepintoconvolutionalnets.InPro-ceedingsofBMVC.CiprianChelba,TomasMikolov,MikeSchuster,QiGe,ThorstenBrants,andPhillippKoehn.2014.Onebil-lionwordbenchmarkformeasuringprogressinstatis-ticallanguagemodeling.InProceedingsofINTER-SPEECH,pages2635–2639.XieChen,XunyingLiu,YanminQian,M.J.F.Gales,andPhilipCWoodland.2016.CUED-RNNLM:Anopen-sourcetoolkitforefﬁcienttrainingandevaluationofrecurrentneuralnetworklanguagemodels.InProceed-ingsofICASSP,pages6000–6004.RyanCotterell,HinrichSchütze,andJasonEisner.2016.Morphologicalsmoothingandextrapolationofwordembeddings.InProceedingsofACL,pages1651–1660.KoenDeschachtandMarie-FrancineMoens.2009.Semi-supervisedsemanticrolelabelingusingthelatentwordslanguagemodel.InProceedingsofEMNLP,pages21–29.JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JournalofMachineLearningResearch,12:2121–2159.ManaalFaruqui,JesseDodge,SujayKumarJauhar,ChrisDyer,EduardHHovy,andNoahASmith.2015.RetroﬁttingWordVectorstoSemanticLexicons.InProceedingsofNAACL-HLT,pages1606–1615.ChristianeFellbaum.1998.WordNet:AnElectronicLexi-calDatabase.BradfordBooks.KatjaFilippova,EnriqueAlfonseca,CarlosA.Col-menares,LukaszKaiser,andOriolVinyals.2015.SentencecompressionbydeletionwithLSTMs.InProceedingsofEMNLP,pages360–368.JuriGanitkevitch,BenjaminVanDurme,andChrisCallison-Burch.2013.PPDB:TheParaphraseDatabase.InProceedingsofNAACL-HLT,pages758–764.JoshuaT.Goodman.2001.Abitofprogressinlanguagemodeling.ComputerSpeech&Language,15(4):403–434.EdouardGrave,MoustaphaCissé,andArmandJoulin.2017.Unboundedcachemodelforonlinelanguagemodelingwithopenvocabulary.InProceedingsofNIPS,pages6044–6054.AlexGraves.2013.Generatingsequenceswithrecurrentneuralnetworks.CoRR,abs/1308.0850.DerekGreeneandPadraigCunningham.2006.Practi-calsolutionstotheproblemofdiagonaldominanceinkerneldocumentclustering.InProceedingsofICML,pages377–384.KennethHeaﬁeld,IvanPouzyrevsky,JonathanH.Clark,andPhilippKoehn.2013.ScalablemodiﬁedKneser-Neylanguagemodelestimation.InProceedingsofACL,pages690–696.HaroldStanleyHeaps.1978.Informationretrieval,com-putationalandtheoreticalaspects.AcademicPress.GustavHerdan.1960.Type-tokenmathematics,volume4.Mouton.SeppHochreiterandJürgenSchmidhuber.1997.LongShort-TermMemory.NeuralComputation,9(8):1735–1780.MarcusHutter.2012.Thehumanknowledgecompressioncontest.RafalJozefowicz,OriolVinyals,MikeSchuster,NoamShazeer,andYonghuiWu.2016.Exploringthelimitsoflanguagemodeling.InProceedingsofICML.DanJurafskyandJamesH.Martin.2017.SpeechandLanguageProcessing,volume3.Pearson. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 2 1 5 6 7 6 3 0 / / t l a c _ a _ 0 0 0 3 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 465 KazuyaKawakami,ChrisDyer,andPhilBlunsom.2017.Learningtocreateandreusewordsinopen-vocabularyneurallanguagemodeling.InProceedingsofACL,pages1492–1502.YoonKim,YacineJernite,DavidSontag,andAlexan-derM.Rush.2016.Character-awareneurallanguagemodels.InProceedingsofAAAI,pages2741–2749.ReinhardKneserandHermannNey.1995.Improvedbacking-offforM-gramlanguagemodeling.InPro-ceedingsofICASSP,pages181–184.PhilippKoehn.2005.Europarl:Aparallelcorpusforstatisticalmachinetranslation.InProceedingsofthe10thMachineTranslationSummit,pages79–86.YannLeCun,BernhardE.Boser,JohnS.Denker,DonnieHenderson,RichardE.Howard,WayneE.Hubbard,andLawrenceD.Jackel.1989.Handwrittendigitrecognitionwithaback-propagationnetwork.InPro-ceedingsofNIPS,pages396–404.WangLing,TiagoLuís,LuísMarujo,RamónFernándezAstudillo,SilvioAmir,ChrisDyer,AlanW.Black,andIsabelTrancoso.2015.Findingfunctioninform:Compositionalcharactermodelsforopenvocabularywordrepresentation.InProceedingsofEMNLP,pages1520–1530.Minh-ThangLuongandChristopherD.Manning.2016.Achievingopenvocabularyneuralmachinetranslationwithhybridword-charactermodels.InProceedingsofACL,pages1054–1063.AndrewL.Maas,RaymondE.Daly,PeterT.Pham,DanHuang,AndrewY.Ng,andChristopherPotts.2011.Learningwordvectorsforsentimentanalysis.InPro-ceedingsofACL,pages142–150.MitchellP.Marcus,MaryAnnMarcinkiewicz,andBeat-riceSantorini.1993.BuildingalargeannotatedcorpusofEnglish:ThePennTreebank.ComputationalLin-guistics,19(2):313–330.TomasMikolov,MartinKaraﬁát,LukasBurget,JanCer-nock`y,andSanjeevKhudanpur.2010.Recurrentneu-ralnetworkbasedlanguagemodel.InProceedingsofINTERSPEECH,pages1045–1048.YasumasaMiyamotoandKyunghyunCho.2016.Gatedword-characterrecurrentlanguagemodel.InProceed-ingsofEMNLP,pages1992–1997.NikolaMrkši´c,IvanVuli´c,DiarmuidÓSéaghdha,IraLeviant,RoiReichart,MilicaGaši´c,AnnaKorhonen,andSteveYoung.2017.Semanticspecialisationofdistributionalwordvectorspacesusingmonolingualandcross-lingualconstraints.TransactionsoftheACL,5:309–324.VinodNairandGeoffreyE.Hinton.2010.RectiﬁedlinearunitsimproverestrictedBoltzmannmachines.InProceedingsofICML,pages807–814.NellekeOostdijk.2000.ThespokenDutchcorpus.Overviewandﬁrstevaluation.InProceedingsofLREC,pages887–894.OﬁrPressandLiorWolf.2017.Usingtheoutputembed-dingtoimprovelanguagemodels.InProceedingsofEACL,pages157–163.IulianVladSerban,AlessandroSordoni,YoshuaBengio,AaronC.Courville,andJoellePineau.2016.Buildingend-to-enddialoguesystemsusinggenerativehierar-chicalneuralnetworkmodels.InProceedingsofAAAI,pages3776–3784.RupeshKumarSrivastava,KlausGreff,andJürgenSchmidhuber.2015.Highwaynetworks.InProceed-ingsoftheICMLDeepLearningWorkshop.MartinSundermeyer,HermannNey,andRalfSchluter.2015.FromfeedforwardtorecurrentLSTMneuralnetworksforlanguagemodeling.IEEETransactionsonAudio,SpeechandLanguageProcessing,23(3):517–529.JörgTiedemann.2012.Paralleldata,toolsandinterfacesinOPUS.InProceedingsofLREC,pages2214–2218.ClaraVaniaandAdamLopez.2017.Fromcharacterstowordstoinbetween:Dowecapturemorphology?InProceedingsofACL,pages2016–2027.AshishVaswani,YinggongZhao,VictoriaFossum,andDavidChiang.2013.Decodingwithlarge-scaleneurallanguagemodelsimprovestranslation.InProceedingsofEMNLP,pages1387–1392.LyanVerwimp,JorisPelemans,HugoVanhamme,andPatrickWambacq.2017.Character-wordLSTMlan-guagemodels.InProceedingsofEACL,pages417–427.IvanVuli´c,NikolaMrkši´c,RoiReichart,DiarmuidÓSéaghdha,SteveYoung,andAnnaKorhonen.2017.Morph-ﬁtting:Fine-tuningwordvectorspaceswithsimplelanguage-speciﬁcrules.InProceedingsofACL,pages56–68.TianWangandKyunghyunCho.2016.Larger-contextlanguagemodellingwithrecurrentneuralnetwork.InProceedingsofACL,pages1319–1329.JohnWieting,MohitBansal,KevinGimpel,andKarenLivescu.2015.Fromparaphrasedatabasetocomposi-tionalparaphrasemodelandback.TransactionsoftheACL,3:345–358.WojciechZaremba,IlyaSutskever,andOriolVinyals.2015.Recurrentneuralnetworkregularization.InProceedingsofICLR.MatthewD.ZeilerandRobFergus.2014.Visualizingandunderstandingconvolutionalnetworks.InProceedingsofECCV,pages818–833.GeorgeKingsleyZipf.1949.Humanbehaviorandtheprincipleofleasteffort:Anintroductiontohumanecol-ogy.MartinoFineBooks. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 2 1 5 6 7 6 3 0 / / t l a c _ a _ 0 0 0 3 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 466
Descargar PDF