Transacciones de la Asociación de Lingüística Computacional, volumen. 6, páginas. 451–465, 2018. Editor de acciones: Brian Roark.
Lote de envío: 12/2017; Lote de revisión: 5/2018; Publicado 7/2018.
2018 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
LanguageModelingforMorphologicallyRichLanguages:Character-AwareModelingforWord-LevelPredictionDanielaGerz1,IvanVuli´c1,EdoardoPonti1JasonNaradowsky3,RoiReichart2AnnaKorhonen11LanguageTechnologyLab,DTAL,UniversityofCambridge2FacultyofIndustrialEngineeringandManagement,Technion,IIT3JohnsHopkinsUniversity1{dsg40,iv250,ep490,alk23}@cam.ac.uk2roiri@ie.technion.ac.il3narad@jhu.eduAbstractNeuralarchitecturesareprominentinthecon-structionoflanguagemodels(LMs).How-ever,word-levelpredictionistypicallyagnos-ticofsubword-levelinformation(charactersandcharactersequences)andoperatesoveraclosedvocabulary,consistingofalimitedwordset.Indeed,whilesubword-awaremod-elsboostperformanceacrossavarietyofNLPtasks,previousworkdidnotevaluatetheabil-ityofthesemodelstoassistnext-wordpredic-tioninlanguagemodelingtasks.Suchsubword-levelinformedmodelsshouldbeparticularlyeffectiveformorphologically-richlanguages(MRLs)thatexhibithightype-to-tokenratios.Inthiswork,wepresentalarge-scaleLMstudyon50typologicallydiverselanguagescover-ingawidevarietyofmorphologicalsystems,andoffernewLMbenchmarkstothecommu-nity,whileconsideringsubword-levelinforma-tion.Themaintechnicalcontributionofourworkisanovelmethodforinjectingsubword-levelinformationintosemanticwordvectors,integratedintotheneurallanguagemodelingtraining,tofacilitateword-levelprediction.WeconductexperimentsintheLMsettingwherethenumberofinfrequentwordsislarge,anddemonstratestrongperplexitygainsacrossour50languages,especiallyformorphologically-richlanguages.Ourcodeanddatasetsarepub-liclyavailable.1IntroductionLanguageModeling(LM)isakeyNLPtask,servingasanimportantcomponentforapplicationsthatre-quiresomeformoftextgeneration,suchasmachinetranslation(Vaswanietal.,2013),speechrecognition(Mikolovetal.,2010),dialoguegeneration(Serbanetal.,2016),orsummarisation(Filippovaetal.,2015).Atraditionalrecurrentneuralnetwork(RNN)LMsetupoperatesonalimitedclosedvocabularyofwords(Bengioetal.,2003;Mikolovetal.,2010).Thelimitationarisesduetothemodellearningpa-rametersexclusivetosinglewords.Astandardtrain-ingprocedureforneuralLMsgraduallymodifiestheparametersbasedoncontextual/distributionalinfor-mation:eachoccurrenceofawordtokenintrain-ingdatacontributestotheestimateofawordvector(i.e.,modelparameters)assignedtothiswordtype.Low-frequencywordsthereforeoftenhaveincorrectestimates,nothavingmovedfarfromtheirrandominitialisation.Acommonstrategyfordealingwiththisissueistosimplyexcludethelow-qualityparam-etersfromthemodel(i.e.,toreplacethemwiththe
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
452
inthefullvocabularysetup(Adamsetal.,2017),itisofcrucialimportancetoconstructandenabletech-niquesthatcanobtaintheseparametersinalternativeways.Onesolutionistodrawinformationfromad-ditionalsources,suchascharactersandcharactersequences.Asaconsequence,suchcharacter-awaremodelsshouldfacilitateLMword-levelpredictioninareal-lifeLMsetupwhichdealswithalargeamountoflow-frequencyorunseenwords.Effortsintothisdirectionhaveyieldedexcitingresults,primarilyontheinputsideofneuralLMs.AstandardRNNLMarchitecturereliesontwowordrepresentationmatriceslearnedduringtrainingforitsinputandnext-wordprediction.Thiseffectivelymeansthattherearetwosetsofper-wordspecificpa-rametersthatneedtobetrained.Recentworkshowsthatitispossibletogenerateawordrepresentationon-the-flybasedonitsconstituentcharacters,therebyeffectivelysolvingtheproblemfortheparametersetontheinputsideofthemodel(Kimetal.,2016;lu-ongandmanning,2016;MiyamotoandCho,2016;Lingetal.,2015).Sin embargo,itisnotstraightforwardhowtoadvancetheseideastotheoutputsideofthemodel,asthissecondsetofword-specificparametersisdirectlyresponsibleforthenext-wordprediction:ithastoencodeamuchwiderrangeofinformation,suchastopicalandsemanticknowledgeaboutwords,whichcannotbeeasilyobtainedfromitscharactersalone(Jozefowiczetal.,2016).Whileonesolutionistodirectlyoutputcharactersinsteadofwords(Tumbas,2013;MiyamotoandCho,2016),arecentworkfromJozefowiczetal.(2016)suggeststhatsuchpurelycharacter-basedarchitec-tures,whichdonotreserveparametersforinforma-tionspecifictosinglewords,cannotattainstate-of-the-artLMperformanceonword-levelprediction.Inthiswork,wecombinethetwoworldsandpro-poseanovelLMapproachwhichreliesonbothword-level(i.e.,contextual)andsubword-levelknowledge.Inadditiontotrainingword-specificparametersforword-levelpredictionusingaregularLMobjective,ourmethodencouragestheparameterstoalsore-flectsubword-levelpatternsbyinjectingknowledgeaboutmorphology.Thisinformationisextractedinanunsupervisedmannerbasedonalreadyavailableinformationinconvolutionalfiltersfromearliernet-worklayers.Theproposedmethodleadstolargeimprovementsinperplexityacrossawidespectrumoflanguages:22inEnglish,144inHebrew,378inFinnish,957inKoreanonourLMbenchmarks.Wealsoshowthatthegainsextendtoanothermul-tilingualLMevaluationset,compiledrecentlyfor7languagesbyKawakamietal.(2017).WeconductasystematicLMstudyon50typo-logicallydiverselanguages,sampledtorepresentavarietyofmorphologicalsystems.Wediscusstheim-plicationsoftypologicaldiversityontheLMtask,boththeoreticallyinSection2,andempiricallyinSection7;wefindaclearcorrespondencebetweenperformanceofstate-of-theartLMsandstructurallinguisticproperties.Further,theconsistentperplex-itygainsacrossthelargesampleoflanguagessuggestwideapplicabilityofournovelmethod.Finally,thisarticlecanalsobereadasacom-prehensivemultilingualanalysisofcurrentLMar-chitecturesonasetoflanguageswhichismuchlargerthantheonesusedinrecentLMwork(BothaandBlunsom,2014;VaniaandLopez,2017;Kawakamietal.,2017).Wehopethatthisarticlewithitsnewdatasets,methodologyandmodels,allavailableonlineathttp://people.ds.cam.ac.uk/dsg40/lmmrl.html,willpavethewayfortruemultilingualresearchinlanguagemodeling.2LMDataandTypologicalDiversityAlanguagemodeldefinesaprobabilitydistributionoversequencesoftokens,andistypicallytrainedtomaximisethelikelihoodoftokeninputsequences.Formally,theLMobjectiveisexpressedasfollows:PAG(t1,…tn)=YiP(de|t1,…ti−1).(1)tiisatokenwiththeindexiinthesequence.Forword-levelpredictionatokencorrespondstooneword,whereasforcharacter-level(alsotermedchar-level)predictionitisonecharacter.LMsaremostcommonlytestedonWesternEu-ropeanlanguages.StandardLMbenchmarksinEn-glishincludethePennTreebank(PTB)(Marcusetal.,1993),the1BillionWordBenchmark(BWB)(Chelbaetal.,2014),andtheHutterPrizedata(Hut-ter,2012).EnglishdatasetsextractedfromBBCNews(GreeneandCunningham,2006)andIMDBMovieReviews(Maasetal.,2011)arealsousedforLMevaluation(WangandCho,2016;Miyamotoand
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
453
Dar,2016;PressandWolf,2017).RegardingmultilingualLMevaluation,BothaandBlunsom(2014)extractdatasetsforotherlanguagesfromthesetsprovidedbythe2013WorkshoponStatisticalMachineTranslation(WMT)(Bojaretal.,2013):theyexperimentwithCzech,Francés,Span-ish,GermanandRussian.ArecentworkofKimetal.(2016)reusesthesedatasetsandaddsArabic.Lingetal.(2015)evaluateonEnglish,Portuguese,Catalan,GermanandTurkishdatasetsextractedfromWikipedia.Verwimpetal.(2017)useasubsetoftheCorpusofSpokenDutch(Oostdijk,2000)forDutchLM.Kawakamietal.(2017)evaluateon7EuropeanlanguagesusingWikipediadata,includingFinnish.Perhapsthelargestandmostdiversesetoflan-guagesusedformultilingualLMevaluationsofaristheoneofVaniaandLopez(2017).Theirstudyin-cludes10languagesintotalrepresentingseveralmor-phologicaltypes(fusional,e.g.,Russian,andaggluti-native,e.g.,Finnish),aswellaslanguageswithpar-ticularmorphologicalphenomena(root-and-patterninHebrewandreduplicationinMalay).Inthiswork,weprovideLMevaluationdatasetsfor50typologi-callydiverselanguages,withtheirselectionguidedbystructuralproperties.LanguageSelectionAimingforacomprehensivemultilingualLMevaluation,weincludelanguagesforallpossibletypesofmorphologicalsystems.OurstartingpointisthePolyglotWikipedia(PW)(Al-Rfouetal.,2013).WhileatfirstPWseemscom-prehensiveandquitelargealready(covering40lan-guages),themajorityofthePWlanguagesaresimi-larfrombothagenealogicalperspective(26/40areIndo-European)andageographicperspective(28/40WesternEuropean).Asaconsequence,theysharemanypatternsandarenotarepresentativesampleoftheworld’slanguages.Inordertoquantitativelyanalyseglobaltrendsandcross-linguisticgeneralisationsacrossalargesetoflanguages,weproposetotestonallPWlanguagesandsourceadditionaldatafromthesamedomain,Wikipedia1,consideringcandidatesindescendingorderofcorpussizeandmorphologicaltype.Tradi-tionally,languageshavebeengroupedintothefour1Chinese,Japanese,andThaiaresourcedfromWikipediaandprocessedwiththePolyglottokenisersincewefoundtheirpreprocessinginthePWisnotadequateforlanguagemodeling. is outside listening tolisteningtomusicChar-CNN-LSTM LMFine-tuningCwMwLSTMCNNywMc( X , X ) w p( X , X ) w nsamplinghwFigure1:AnillustrationoftheChar-CNN-LSTMLMandourfine-tuningpost-processingmethod.Af-tereachepochweadaptword-levelvectorsinthesoftmaxembeddingMwusingsamplesbasedonfea-turesfromthechar-levelconvolutionalfilters.Thefigurefollowsthemodelflowbottomtothetop.maintypes:isolating,fusional,introflexiveandag-glutinative,basedontheirpositionalongaspectrummeasuringtheirpreferenceonbreakingupconceptsinmanywords(ononeextreme)orrathercomposethemintosinglewords(ontheotherextreme).Sin embargo,evenlanguagesbelongingtothesametypedisplaydifferentout-of-vocabularyratesandtype-tokenratios.Thishappensbecauselanguagesspecifydifferentsubsetsofgrammaticalcategories(suchastenseforverbs,ornumberfornouns)andvalues(suchasfuturefortense,pluralfornumber).Theamountofgrammaticalcategoriesexpressedinalanguagedeterminesitsinflectionalsynthesis(BickelandNichols,2013).Inourfinalsampleoflanguages,weselectlan-guagesbelongingtomorphologicaltypesdifferentfromthefusionalone,whichisover-representedinthePW.Inparticular,weincludenewisolating(MinNan,Burmese,Khmer),agglutinative(Basque,Geor-gian,canarés,Tamil,Mongolian,Javanese),andintroflexivelanguages(Amharic).
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
454
3UnderlyingLM:Char-CNN-LSTMAstheunderlyingmodelweoptforthestate-of-the-artneuralLMarchitectureofKimetal.(2016):ithasbeenshowntoworkacrossanumberoflanguagesandinalarge-scalesetup(Jozefowiczetal.,2016).Italreadyprovidesasolutionfortheinputsideparam-etersofthemodelbybuildingwordvectorsbasedontheword’sconstituentcharactersequences.How-ever,itsoutputsidestilloperateswithastandardword-levelmatrixwithintheclosedandlimitedvo-cabularyassumption.WerefertothismodelasChar-CNN-LSTManddescribeitsdetailsinthefollowing.Figure1(izquierda)illustratesthemodelarchitecture.Char-CNN-LSTMconstructsinputwordvectorsbasedonthecharactersineachwordusingaconvo-lutionalneuralnetwork(CNN)(LeCunetal.,1989),thenprocessestheinputword-levelusingaLSTM(HochreiterandSchmidhuber,1997).Thenextwordispredictedusingwordembeddings,alargenumberofparameterswhichhavetobetrainedspecificallytorepresentthesemanticsofsinglewords.WerefertothisspaceofwordrepresentationsasMw.Formally,fortheinputlayerthemodeltrainsalook-upmatrixC∈R|Vc|×dc,correspondingtoonedc-dimensionalvectorpercharactercinthecharvocabularyVc.Foreachinput,ittakesasequenceofcharactersofafixedlengthm,[c1,…cm],wheremisthemaximumlengthofallwordsinthewordvocabularyVw,andthelengthofeachwordisl≤m.Lookingupallcharactersofawordyieldsase-quenceofcharrepresentationsinRdc×l,whichiszero-paddedtofitthefixedlengthm.ForeachwordonegetsasequenceofcharrepresentationsCw∈Rdc×m,passedthrougha1Dconvolution:fwi=tanh(hCw,Hii+b).(2)Hi∈Rdf,i×siisafilterorkernelofsize/widthsiandhA,Bi=Tr(ABT)istheFrobeniusinnerproduct.Themodelhasmultiplefilters,Hi,withkernelsofdifferentwidth,si,anddimensionalitydf,i,iisusedtoindexfilters.Sincethemodelperformsaconvo-lutionovercharembeddings,sicorrespondstothecharwindowtheconvolutionisoperatingon:e.g.,afilterofwidthsi=3andd3,i=150couldbeseenaslearning150featuresfordetecting3-grams.Bylearningkernelsofdifferentwidth,si,themodelcanlearnsubword-levelfeaturesforcharac-tersequencesofdifferentlengths.fwiistheoutputoftakingtheconvolutionwithfilterHiforwordw.Sincefwicangetquitelarge,itsdimensional-ityisreducedusingmax-over-time(1D)pooling:ywi=maxjfwi[j].Aquí,jindexesthedimensionsdf,iofthefilterfwi,andywi∈Rdf,i.Thiscorre-spondstotakingthemaximumvalueforeachfeatureofHi,withtheintuitionthatthemostinformativefeaturewouldhavethehighestactivation.Theout-putofallmax-poolingoperationsywiisconcatenatedtoformawordvectoryw∈Rdp,wheredpisthenumberofallfeaturesforallHi:yw=concat([yw1,…,ywi]).(3)Thisvectorispassedthroughahighwaynetwork(Srivastavaetal.,2015)togivethenetworkthepos-sibilitytoreweighortransformthefeatures:hw=Highway(yw).Sofaralltransformationsweredoneperword;afterthehighwaytransformationwordrep-resentationsareprocessedinasequencebyanLSTM(HochreiterandSchmidhuber,1997):owt=LSTM([hw1,…hwt−1]).(4)TheLSTMyieldsoneoutputvectorowtperwordinthesequence,givenallprevioustimesteps[yw1,…ywt−1].Topredictthenextwordwt+1,onetakesthedotproductofthevectorowt∈R1×dlwithalookupmatrixMw∈Rdl×|Vw|,wheredlcorre-spondstotheLSTMhiddenstatesize.Thevectorpt+1∈R1×|Vw|isnormalisedtocontainvaluesbe-tween0and1,representingaprobabilitydistributionoverthenextword.ThiscorrespondstocalculatingthesoftmaxfunctionforeverywordkinVw:pag(wt+1=k|ot)=e(ot·mk)Pk0∈Vwe(ot·mk0)(5)whereP(wt+1=k|ot)istheprobabilityofthenextwordwt+1beingkgivenot,andmkistheoutputembeddingvectortakenfromMw.Word-LevelVectorSpace:MwThemodelpa-rametersinMwcanbeseenasthebottleneckofthemodel,astheyneedtobetrainedspecificallyforsinglewords,leadingtounreliableestimatesforinfrequentwords.Asananalysisofthecorpusstatis-ticslaterinSection7reveals,theZipfianeffectanditsinfluenceonwordvectorestimationcannotbefullyresolvedevenwithalargecorpus,especially
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
455
takingintoaccounthowflexibleMRLsareintermsofwordformationandcombination.Yet,havingagoodestimatefortheparametersinMwisessentialforthefinalLMperformance,astheyaredirectlyresponsibleforthenext-wordprediction.Therefore,ouraimistoimprovethequalityofrep-resentationsinMw,focusingoninfrequentwords.Toachievethis,weturntoanothersourceofinformation:characterpatterns.Inotherwords,sinceMwdoesnothaveanyinformationaboutcharacterpatternsfromlowerlayers,weseekawayto:a)detectwordswithsimilarsubwordstructures(i.e.,“morpheme”-levelinformation),andb)letthesewordssharetheirsemanticinformation.4Character-AwareVectorSpaceTheCNNpartofChar-CNN-LSTM,seeEq.(3),infactprovidesinformationaboutsuchsubword-levelpatterns:themodelconstructsawordvectorywon-the-flybasedontheword’sconstituentcharacters.Weletthemodelconstructywforallwordsinthevo-cabulary,resultinginacharacter-awarewordvectorspaceMc∈R|Vw|×dp.Theconstructionofthespaceiscompletelyunsupervisedandindependentoftheword’scontext;onlythefirst(CNN)networklayersareactivated.Ourcoreideaistoleveragethisin-formationobtainedfromMctoinfluencetheoutputmatrixMw,andconsequentlythenetworkprediction,andextendthemodeltohandleunseenwords.Wefirsttakeacloserlookatthecharacter-awarespaceMc,andthendescribehowtoimproveandexpandthesemanticspaceMwbasedontheinfor-mationcontainedinMc(Section5).EachvocabularyentryinMcencodescharactern-grampatternsabouttherepresentedword,for1≥n≤7.Then-grampatternsarisethroughfiltersofdifferentlengths,andtheirmaximumactivationisconcatenatedtoformeachindividualvectoryw.ThematrixMcisofdi-mensionality|Vw|×1100,whereeachofthe1,100dimensionscorrespondstotheactivationofoneker-nelfeature.Inpractice,dimensions[0,1,..:50]correspondtosingle-characterfeatures,[50:150]tocharacter2-grams,[150:300]to3-grams.Thehigher-ordern-gramsgetassigned200dimensionseach,uptodimensions[900:1100]for7-grams.Drawingananalogytoworkincomputervision(ZeilerandFergus,2014;Chatfieldetal.,2014),wesiPatternMaxActivationsZH1更,不更更更为,更更更改,更更更名,..,不不不满,不不不明,不不不易1今,代今今今日,今今今人,至少,..,如何,现代代代,当代代代1CapsIn,Ebru,VIC,..,FAT,MW,MITTR3mu-..,mutfa˘gının,muharebe,muhtelif6Üniversite..,Üniversitesi’nin,üniversitelerdeTable1:EachCNNfiltertendstohavehighacti-vationsforasmallnumberofsubwordpatterns.sidenotesthefiltersize.WordNearestNeighboursUrsprünglichkeitursprüngliche,Urstoff,ursprünglichenDEMittelwertMittelwerten,Regelwerkes,MittelwesereffektivEffekt,Perfekt,Effekte,perfekten,Respekt大学大大大金,大大大石,大大大震災,大大大空,大大大野JAハイクハハハイイイム,バイイイククク,メイイイククク,ハハハッサククク17251825,1625,1524mm,1728MagentaMaplet,Maya,ManagementTable2:Nearestneighboursforvocabularywords,basedonthecharacter-awarevectorspaceMc.delvedeeperintothefilteractivationsandanalysethekeypropertiesofthevectorspaceMc.Thequalitativeanalysisrevealsthatmanyfeaturesareinterpretablebyhumans,andindeedcorrespondtofrequentsub-wordpatterns,asillustratedinTable1.Forinstance,tokenisedChinesedatafavoursshortwords:conse-quentlyshortfiltersactivatestronglyforoneortwocharacters.Thefirsttwofilters(width1)arehighlyactivefortwocommonsinglecharacterseach:onefilterisactivefor更(de nuevo,más),不(no),andtheotherfor今(now),代(timeperiod).Largerfilters(width5-7)donotshowinterpretablepatternsinChi-nese,sincethevocabularylargelyconsistsofshortwords(length1-4).Agglutinativelanguagesshowatendencytowardslongwords.Wefindthatmedium-sizedfilters(width3-5)areactiveformorphemesorshortcommonsub-wordunits,andthelongfiltersareactivatedfordif-ferentsurfacerealisationsofthesamerootword.InTurkish,onefilterishighlyactiveonvariousformsofthewordüniversite(university).Más,inMRLswiththeLatinalphabetshortfiltersaretypicallyac-tiveoncapitalisationorspecialchars.Table2showsexamplesofnearestneighboursbasedontheactivationsinMc.Thespaceseemstobearrangedaccordingtosharedsubwordpatterns
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
456
basedontheCNNfeatures.Itdoesnotrelyonlyonasimplecharacteroverlap,butalsocapturessharedmorphemes.ThispropertyisexploitedtoinfluencetheLMoutputwordembeddingmatrixMwinacom-pletelyunsupervisedway,asillustratedontherightsideofFigure1.5Fine-TuningtheLMPredictionWhiletheoutputvectorspaceMwcapturesword-levelsemantics,Mcarrangeswordsbysubwordfeatures.Amodelwhichreliessolelyoncharacter-levelknowledge(similartotheinformationstoredinMc)forword-levelpredictioncannotfullycaptureword-levelsemanticsandevenhurtsLMperformance(Jozefowiczetal.,2016).Sin embargo,sharedsubwordunitsstillprovideusefulevidenceofsharedsemantics(Cotterelletal.,2016;Vuli´cetal.,2017):injectingthisintothespaceMwtoadditionallyreflectsharedsubword-levelinformationshouldleadtoimprovedwordvectorestimates,especiallyforMRLs.5.1Fine-TuningandConstraintsWeinjectthisinformationintoMwbyadaptingre-centfine-tuning(oftentermedretrofittingorspecial-isation)methodsforvectorspacepost-processing(Faruquietal.,2015;Wietingetal.,2015;Mrkši´cetal.,2017;Vuli´cetal.,2017,i.a.).Thesemodelsenrichinitialvectorspacesbyencodingexternalknowledgeprovidedintheformofsimplelinguisticconstraints(i.e.,wordpairs)intotheinitialvectorspace.Therearetwofundamentaldifferencesbetweenourworkandpreviousworkonspecialisation.First,previousmodelstypicallyuserichhand-craftedlexi-calresourcessuchasWordNet(Fellbaum,1998)ortheParaphraseDatabase(Ganitkevitchetal.,2013),ormanuallydefinedrules(Vuli´cetal.,2017)toex-tracttheconstraints,whilewegeneratethemdirectlyusingtheimplicitknowledgecodedinMc.Second,ourmethodisintegratedintoalanguagemodel:itper-formsupdatesaftereachepochoftheLMtraining.2InSection5.2,wedescribeourmodelforfine-tuningMwbasedontheinformationprovidedinMc.Ourfine-tuningapproachreliesonconstraints:positiveandnegativewordpairs(xi,xj),where2Wehavealsoexperimentedwithavariantwhichperformsonlyapost-hocsingleupdateoftheMwmatrixaftertheLMtraining,butavariantwhichperformscontinuousper-epochupdatesismorebeneficialforthefinalLMperformance.xi,xj∈Vw.Iteratingovereachcuewordxw∈VwwefindasetofpositivewordpairsPwandnega-tivewordpairsNw:theirextractionisbasedontheir(dis)similaritywithxwinMc.Positivepairs(xw,xp)containwordsxpyieldingthehighestcosinesimilar-itytothexw(=nearestneighbors)inMc.Negativepairs(xw,xn)areconstructedbyrandomlysamplingwordsxnfromthevocabulary.SinceMcgetsup-datedduringtheLMtraining,nosotros(re)generatethesetsPwandNwaftereachepoch.5.2Attract-PreserveWenowpresentamethodforfine-tuningtheout-putmatrixMwwithintheChar-CNN-LSTMLMframework.Assaid,thefine-tuningprocedurerunsaftereachepochofthestandardlog-likelihoodLMtraining(seeFigure1).Weadaptavariantofastate-of-the-artpost-processingspecialisationprocedure(Wietingetal.,2015;Mrkši´cetal.,2017).Theideaofthefine-tuningmethod,whichwelabelAttract-Preserve(AP),istopullthepositivepairsclosertogetherintheoutputword-levelspace,whilepush-ingthenegativepairsfurtheraway.Letvidenotethewordvectorofthewordxi.TheAPcostfunctionhastwoparts:attractandpreserve.Intheattractterm,usingtheextractedsetsPwandNw,wepushthevectorofxwtobeclosertoxpbyasimilaritymarginδthantoitsnegativesamplexn:attr(Pw,Nw)=X(xw,xp)∈Pw,(xw,xn)∈NwReLU(δ+vwvn−vwvp).ReLU(X)isthestandardrectifiedlinearunit(NairandHinton,2010).Theδmarginissetto0.6inallexperimentsasinpriorwork(Mrkši´cetal.,2017)withoutanysubsequentfine-tuning.Thepreservecostactsasaregularisationpullingthe“fine-tuned”vectorbacktoitsinitialvalue:pres(Pw,Nw)=Xxw∈Vwλreg||ˆvw−vw||2.(6)λreg=10−9istheL2-regularisationconstant(Mrkši´cetal.,2017);ˆvwistheoriginalwordvec-torbeforetheprocedure.Thistermtriestopreservethesemanticcontentpresentintheoriginalvectorspace,aslongasthisinformationdoesnotcontradicttheknowledgeinjectedbytheconstraints.Thefinalcostfunctionaddsthetwocosts:cost=attr+pres.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
457
6ExperimentsDatasetsWeusethePolyglotWikipedia(Al-Rfouetal.,2013)forallavailablelanguagesexceptforJapanese,Chino,andThai,andaddtheseandfur-therlanguagesusingWikipediadumps.TheWikidumpswerecleanedandpreprocessedbythePoly-glottokeniser.Weconstructsimilarly-sizeddatasetsbyextracting46Ksentencesforeachlanguagefromthebeginningofeachdump,filteredtocontainonlyfullsentences,andsplitintotrain(40k),validation(3k),andtest(3k).Thefinallistoflanguagesalongwithstandardlanguagecodes(ISO639-1standard,usedthroughoutthepaper)andstatisticsonvocabu-laryandtokencountsareprovidedinTable4.EvaluationSetupWereportperplexityscores(Ju-rafskyandMartin,2017,Chapter4.2.1)usingthefullvocabularyoftherespectiveLMdataset.Thismeansthatweexplicitlydecidetoretainalsoin-frequentwordsinthemodeleddata.Replacingin-frequentwordsbyaplaceholdertoken
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
458
Mwspace,butweonlyallowwordsmorefrequentthan5ascuewordsxw(seeSection5again),whiletherearenorestrictionsonxpandxn.3Ourprelimi-naryanalysisontheinfluenceofthenumberofnear-estneighboursinMcshowsthatthisparameterhasonlyamoderateeffectonthefinalLMscores.Wethusfixitto3positiveand3negativesamplesforeachxwwithoutanytuning.APisoptimisedwithAdagrad(Duchietal.,2011)andalearningrateof0.05,thegradientsareclippedto±2.4Afullsum-maryofallhyper-parametersandtheirvaluesispro-videdinTable3.(Base)LanguageModelsTheavailabilityofLMevaluationsetsinalargenumberofdiverselan-guages,describedinSection2,nowprovidesanop-portunitytoperformafull-fledgedmultilingualanal-ysisofrepresentativeLMarchitectures.Atthesametime,thesedifferentarchitecturesserveasthebase-linesforournovelmodelwhichfine-tunestheoutputmatrixMw.Asmentioned,thetraditionalLMsetupistousewordsbothontheinputandontheoutputside(Buen hombre,2001;Bengioetal.,2003;De-schachtandMoens,2009)relyingonn-gramwordsequences.Weevaluateastrongmodelfromthen-gramfamilyofmodelsfromtheKenLMpack-age(https://github.com/kpu/kenlm):itisbasedon5-gramswithextendedKneser-Neysmoothing(KN5)(KneserandNey,1995;Heafieldetal.,2013)5.Therationalebehindincludingthisnon-neuralmodelistoalsoprobethelimitationsofsuchn-gram-basedLMarchitecturesonadiversesetoflanguages.Recurrentneuralnetworks(RNNs),especiallyLong-Short-TermMemorynetworks(LSTMs),havetakenovertheLMuniverserecently(Mikolovetal.,2010;Sundermeyeretal.,2015;Chenetal.,2016,i.a.).TheseLMsmapasequenceofinputwordstoembeddingvectorsusingalook-upmatrix.TheembeddingsarepassedtotheLSTMasinput,and3Thischoicehasbeenmotivatedbytheobservationthatrarewordstendtohaveotherrarewordsastheirnearestneighbours.Notethatvectorsofwordsfrompositiveandnegativeexamples,andnotonlycuewords,alsogetupdatedbytheAPmethod.4AllscoreswithneuralmodelsareproducedwithourownimplementationsinTensorFlow(Abadietal.,2016).5Weevaluatethedefaultsetupforthismodelusingtheoption-interpolate_unigrams=1whichavoidsassigningzero-probabilitytounseenwords.themodelistrainedinanautoregressivefashiontopredictthenextwordfromthepre-definedvocabu-larygiventhecurrentcontext.AsastrongbaselinefromthisLMfamily,wetrainastandardLSTMLM(LSTM-Word)relyingonthesetupfromZarembaetal.(2015)(seeTable3).Finalmente,arecentstrandofLMworkusescharactersontheinputsidewhileretainingword-levelpredic-tionontheoutputside.Arepresentativearchitecturefromthisgroup,alsoservingasthebasisinourwork(Section3),isCharCNN-LSTM(Kimetal.,2016).Allneuralmodelsoperateonexactlythesamevo-cabularyandtreatout-of-vocabulary(OOV)wordsinexactlythesameway.Asmentioned,weincludeKN5asastrong(non-neural)baselinetogiveper-spectiveonhowthismoretraditionalmodelperformsacross50typologicallydiverselanguages.WehaveselectedthesetupfortheKN5modeltobeascloseaspossibletothatofneuralLMs,Sin embargo,duetothedifferentnatureofthemodels,wenotethattheresultsbetweenKN5andothermodelsarenotcomparable.InKN5discountsareaddedforlow-frequencywords,andunseenwordsattesttimeareregardedasoutliersandassignedlowprobabilityestimates.Incontrast,forallneuralmodelswesampleunseenwordvectorstolieinthespaceoftrainedvectors(seebefore).WefindthelattersetuptobetterreflectourintuitionthatespeciallyinMRLsunseenwordsarenotoutliersbutoftenariseduetomorphologicalcomplexity.7ResultsandDiscussionInthissection,wepresentthemainempiricalfind-ingsofourwork.Thefocusison:a)theresultsofournovellanguagemodelwiththeAPfine-tuningproce-dure,anditscomparisontootherlanguagemodelsinourcomparison;b)theanalysisoftheLMresultsinrelationtotypologicalfeaturesandcorpusstatistics.Table4thatlistsall50testlanguagesalongwiththeirlanguagecodesandprovidesthekeystatisticsofour50LMevaluationbenchmarks.Thestatisticsincludethenumberofwordtypesintrainingdata,thenumberofwordtypesoccurringintestdatabutunseenintraining,aswellasthetotalnumberofwordtokensinbothtrainingandtestdata,andtype-to-tokenratios.Table4alsoshowstheresultsforKN5,LSTM-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
459
DataStatsBaselineModelsOurs:Fine-TuningMwLanguage(código)VocabSize(Tren)NewTestVocabNumberTokens(Tren)NumberTokens(Prueba)Type/Token(Tren)KN5LSTMChar-CNN-LSTM+AP∆+AP×Amharic(am)897494805511K39.2K0.1812521535981817164×Arabic(ar)890895032722K54.7K0.12215625871659160455(cid:3)Bulgarian(bg)713603896670K49K0.116106514154096(cid:3)Catalan(ca)610332562788K59.4K0.083583182412383(cid:3)checo(cs)867834300641K49.6K0.141658220012521131121(cid:3)Danish(da)724683618663K50.3K0.1166871046644224(cid:3)Alemán(de)807414045682K51.3K0.1293090360255151(cid:3)Griego(el)762643767744K56.5K0.1060753840538916(cid:3)Inglés(en)555212480783K59.5K0.0753349437134922(cid:3)Español(es)601962721781K57.2K0.084153662752705FEstonian(et)941843907556K38.6K0.17160925641478138890FBasque(eu)811773365647K47.3K0.1356053334730938(cid:3)Farsi(fa)523062041738K54.2K0.073552632082053FFinnish(fi)1155796489585K44.8K0.202611426322361858378(cid:3)Francés(fr)585392575769K57.1K0.0835029423122011×Hebrew(él)832173862717K54.6K0.121797218915191375144(cid:3)Hindi(hi)503842629666K49.1K0.0847342632629927(cid:3)Croatian(hr)863574371620K48.1K0.14129416651014906108FHungarian(hu)1018745015672K48.7K0.1511511595929819110(cid:3)Indonesian(id)491252235702K52.2K0.0745435928626323(cid:3)italiano(él)701942923787K59.3K0.09567493349350-1FJapanese(ja)448631768729K54.6K0.0616915613612511FJavanese(jv)651414292622K52K0.101387144311581003155FGeorgian(ka)802113738580K41.1K0.14137018271097939158(cid:3)Khmer(km)378511303579K37.4K0.07586637522535-13FKannada(kn)946604604434K29.4K0.222315531025582265293FKorean(ko)1437948275648K50.6K0.2251461006347783821957(cid:3)Lithuanian(lt)815013791554K41.7K0.151155141585482727(cid:3)Latvian(lv)752944564587K45K0.13145219671129969160(cid:3)Malay(EM)493852824702K54.1K0.0777672552551312FMongolian(mng)738844171629K50K0.12139217161165109174(cid:3)Burmese(mi)20574755576K46.1K0.042092121821802(cid:3)Min-Nan(nan)3323814041.2M65.6K0.03614339381(cid:3)Dutch(nl)602062626708K53.8K0.0839734026724819(cid:3)Norwegian(No)697613352674K47.8K0.1053451337934633(cid:3)Polish(pl)973254526634K47.7K0.151741264114911328163(cid:3)Portuguese(pt)561672394780K59.3K0.0734227221420212(cid:3)Romanian(ro)689133079743K52.5K0.093843592562479(cid:3)Russian(ru)980973987666K48.4K0.151128130981271597(cid:3)Slovak(sk)887264521618K45K0.141560206212751151124(cid:3)Slovene(sl)839974343659K49.2K0.131114130877673343(cid:3)Serbian(sr)816173641628K46.7K0.1379096158254735(cid:3)Swedish(sv)774994109688K50.4K0.1184383258354340FTamil(frente a)1064036017507K39.6K0.213342623434962768728(cid:3)Thai(th)300561300628K49K0.052332412061997(cid:3)Tagalog(tl)724163791972K66.3K0.073792982192118FTurkish(tr)908404608627K45K0.14172422671350129060(cid:3)Ukranian(uk)897244983635K47K0.141639189312831091192(cid:3)Vietnamese(vi)320551160754K61.9K0.04197190158165-7(cid:3)Chino(zh)436721653746K56.8K0.06106482679776235(cid:3)Isolating(avg)409301825759K54K0.054403923263188(cid:3)Fusional(avg)734993532689K51.3K0.1184296961856652×Introflexive(avg)873524566650K49.5K0.141735210413861265121FAgglutinative(avg)910514687603K45K0.161898316417271473254Table4:Testperplexitiesfor50languages(ISO639-1codessortedalphabetically)inthefull-vocabularypredictionLMsetup;Left:Basicstatisticsofourevaluationdata.Middle:ResultswiththeBaselineLMs.NotethattheabsolutescoresintheKN5columnarenotcomparabletothescoresobtainedwithneuralmodels(seeSection6).Right:ResultswithChar-CNN-LSTMandourAPfine-tuningstrategy.∆isindicatingthedifferenceinperformanceovertheoriginalChar-CNN-LSTMmodel.Thebestscoringneuralbaselineisunderlined.Theoverallbestperformingneuralmodelforeachlanguageisinbold.Word,Char-CNN-LSTM,andourmodelwiththeAPfine-tuning.Furthermore,avisualisationoftheChar-CNN-LSTM+APmodelasafunctionoftype/tokenratioisshowninFigure2.7.1Fine-TuningtheOutputMatrixFirst,wetesttheimpactofourAPfine-tuningmethod.Asthemainfinding,theinclusionoffine-tuningintoChar-CNN-LSTM(thismodelistermed
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
yo
a
C
_
a
_
0
0
0
3
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
460
+AP)yieldsimprovementsonalargenumberoftestlanguages.Themodelisbetterthanbothstrongneu-ralbaselinelanguagemodelsfor47/50languages,anditimprovesovertheoriginalChar-CNN-LSTMLMfor47/50languages.Thelargestgainsareindi-catedforthesubsetofagglutinativeMRLs(e.g.,950perplexitypointsinKorean,largegainsalsomarkedforFI,HE,EL,HU,TA,ET).Wealsoobservelargegainsforthethreeintroflexivelanguagesincludedinourstudy(Amharic,Arábica,hebreo).Whiletheselargeabsolutegainsmaybepartiallyattributedtotheexponentialnatureoftheperplexitymeasure,onecannotignorethesubstantialrelativegainsachievedbyourmodels:e.g.,EU(∆PPL=38)improvesmorethanafusionallanguagelikeDA(∆PPL=24)evenwithalowerbaselineperplex-ity.Thissuggeststhatinjectingsubword-levelinfor-mationismorestraightforwardfortheformer:inagglutinativelanguages,themappingbetweenmor-phemesandmeaningsislessambiguous.Moreover,thenumberofwordsthatbenefitfromtheinjectionofcharacter-basedinformationislargerforaggluti-nativelanguages,becausetheyalsotendtodisplaythehighestinflectionalsynthesis.Fortheoppositereasons,wedonotsurpassChar-CNN-LSTMinafewfusional(IT)andisolatinglan-guages(KM,VI).WealsoobserveimprovementsforSlaviclanguageswithrichmorphology(RU,HR,PL).Thegainsarealsoachievedforsomeisolatingandfusionallanguageswithsmallervocabulariesandasmallernumberofrarewords,e.g.,inTagalog,En-glish,Catalan,andSwedish.Thissuggeststhatourmethodforfine-tuningtheLMpredictionisnotre-strictedtoMRLsonly,andhastheabilitytoimprovetheestimationforrarewordsinmultipletypologi-callydiverselanguages.7.2LanguageModels,TypologicalFeatures,andCorpusStatisticsInthenextexperiment,weestimatecorrelationstrengthofallperplexityscoreswithaseriesofinde-pendentvariables.Thevariablesare1)type-tokenra-tiointhetraindata;2)newwordtypesinthetestdata;3)themorphologicaltypeofthelanguageamongiso-lating,fusional,introflexive,andagglutinative,cap-turingdifferentaspectsrelatedtothemorphologicalrichnessofalanguage.ResultswithPearson’sρ(numerical)andη2in400006000080000100000120000140000vocabulary size05001000150020002500300035004000perplexityaramkoroheeneuesmsmnghurunanltjakaknjvzhvifitaplcstltrknltlvslIsolatingFusionalIntroflexiveAgglutinativeFigure2:PerplexityresultswithChar-CNN-LSTM+AP(y-axis)inrelationtotype/tokenratio(x-axis).Forlanguagecodes,seeTable4.one-wayANOVA(categorical)areshowninTable5.Significancetestsshowp-values<1−3forallcombi-nationsofmodelsandindependentvariables,demon-stratingallofthemaregoodperformancepredictors.Ourmainfindingindicatesthatlinguisticcategoriesanddatastatisticsbothcorrelatewell(≈0.35and≈0.82,respectively)withtheperformanceoflan-guagemodels.Forthecategoricalvariableswecomparethemeanvaluespercategorywiththenumericaldependentvariable.Assuch,η2canbeinterpretedastheamountofvariationexplainedbythemodel-theresultinghighcorrelationssuggestthatperplexitiestendtobehomogeneousforlanguagesofasamemorphologicaltype,especiallysoforstate-of-the-artmodels.ThisisintuitivelyevidentinFigure2,whereper-plexityscoresofChar-CNN-LSTM+APareplot-tedagainsttype/tokenratio.Isolatinglanguagesareplacedontheleftsideofthespectrumasexpected,withlowtype/tokenratioandgoodperformance(e.g.,VI,ZH).Asforfusionallanguages,sub-groupsbe-havedifferently.WefindthatRomanceandGermaniclanguagesdisplayroughlythesamelevelofperfor-manceasisolatinglanguages,despitetheiroveralllargertype/tokenratio.Balto-Slaviclanguages(e.g.CS,LV)insteadshowbothhigherperplexitiesandhighertype/tokenratio.Thesedifferencesmaybeexplainedintermsofdifferentinflectionalsynthesis.Introflexiveandagglutinativelanguagescanbe
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
l
a
c
_
a
_
0
0
0
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
461
VariablesIndependentDependentStatisticalTestModelsKN5LSTM+Char-CNN++APTraintype/tokenPPLPearson’sρ0.8330.8130.8230.831TestnewtypesPPLPearson’sρ0.8600.8030.8180.819MorphologyPPLone-wayANOVAη20.3540.3380.3690.374LSTMvs+CharCNN+CharCNNvs++APTraintype/token∆PPLPearson’sρ0.7290.778Morphology∆PPLone-wayANOVAη20.3080.284Table5:Correlationsbetweenmodelperformanceandlanguagetypologyaswellaswithcorpusstatistics(type/tokenratioandnewwordtypesintestdata).Allvariablesaregoodperformancepredictions.foundmostlyontherightsideofthespectrumintermsofperformance(seeFigure2).Althoughthelanguageswithhighestabsoluteperplexityscoresarecertainlyclassifiedasagglutinative(e.g.,DravidianlanguagessuchasKNandTA),wealsofindsomeoutliersintheagglutinativelanguages(EU)withre-markablylowperplexityscores.7.3CorpusSizeandType/TokenRatioBuildingonthestrongcorrelationbetweentype/tokenratioandmodelperformancefromSection7.2,wenowfurtheranalysetheresultsinlightofcorpussizeandtype/tokenstatistics.TheLMdatasetsforour50languagesaresimilarinsizetothewidelyusedEnglishPTBdataset(Marcusetal.,1993).Assuch,wehopethattheseevaluationdatasetscanhelpguidemultilinguallanguagemodelingresearchacrossawidespectrumoflanguages.However,ourgoalnowistoverifythattype/tokenratioandnotabsolutecorpussizeisthedecidingfactorwhenunravelingthelimitationsofstandardLMarchitecturesacrossdifferentlanguages.Tothisend,weconductadditionalexperimentsonalllan-guagesoftherecentMultilingualWikipediaCorpus(MWC)(Kawakamietal.,2017)forlanguagemod-eling,usingthesamesetupasbefore(seeTable3).Thecorpusprovidesdatasetsfor7languagesfromthesamedomainasourbenchmarks(Wikipedia),andcomesintwosizes.Wechoosethelargercorpusvariantforeachlanguage,whichprovidesabout3-5timesasmanytokensascontainedinourdatasetsfromTable4.TheresultsontheMWCevaluationdataalongwithcorpusstatisticsaresummarisedinTable6.Asoneimportantfinding,weobservethatthegainsinperplexityusingourfine-tuningAPmethodextendalsototheselargerevaluationdatasets.Inparticular,wefindimprovementsofthesamemagnitudeasinthePTB-sizeddatasetsoverthestrongestbaselinemodel(Char-CNN-LSTM)forallMWClanguages.Forinstance,perplexityisreducedfrom1781to1578forRussian,andfrom365to352forEnglish.WealsoobserveagainforFrenchandSpanishwithperplexityreducedfrom282to272and255to243respectively.Inaddition,wetestonsamplesoftheEuroparlcorpus(Koehn,2005;Tiedemann,2012)whichcon-tainsapproximately10timesmoretokensthanourPTB-sizedevaluationdata:weuse400KsentencesfromEuroparlfortrainingandtesting.However,thisdatacomesfromamuchnarrowerdomainofparlia-mentaryproceedings:thispropertyyieldsaverylowtype/tokenratioasvisiblefromTable6.Infact,wefindthetype/tokenratiointhiscorpustobeonthesamelevelorevensmallerthanisolatinglanguages(comparewiththescoresinTable4):0.02forDutchand0.03forCzech.Thisleadstosimilarperplexi-tieswithandwithout+APforthesetwoselectedtestlanguages.ThethirdEPtestlanguage,Finnish,hasaslightlyhighertype/tokenratio.Consequently,wedoobserveanimprovementof10pointsinperplexity.Amoredetailedanalysisofthisphenomenonfollows.Table7displaystheoveralltype/tokenratiointhetrainingsetofthesecopora.WeobservethattheMWChascomparableorevenhighertype/tokenratiosthanthesmallersetsdespiteitsincreasedsize.ThecorpushasbeenconstructedbysamplingthedatafromavarietyofdifferentWikipediacate-gories(Kawakamietal.,2017):itcanthereforeberegardedasmorediverseandchallengingtomodel.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
l
a
c
_
a
_
0
0
0
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
462
LangCorpus#Vocab#TokensType/TokenChar-CNN-LSTM+APtraintesttraintesttrainnlEP197K200K10M255K0.026263csEP265K268K7.9M193K0.03180186enMWC310K330K5.0M0.5M0.06365352esMWC258K277K3.7M0.4M0.07255243frMWC260K278K4.0M0.5M0.07282272fiEP459K465K6.8M163K0.07515505deMWC394K420K3.8M0.3M0.10710665ruMWC372K399K2.5M0.3M0.1517811578csMWC241K258K1.5M0.2M0.1623962159fiMWC320K343K1.5M0.1M0.2153004911Table6:ResultsonthelargerMWCdataset(Kawakamietal.,2017)andonasubsetoftheEuroparl(EP)corpus.Improvementswith+AParenotdependentoncorpussize,butrathertheystronglycorrelatewiththetype/tokenratioofthecorpus.Type/TokenRatioLanguageOurDataMWCEuroparlCzech0.130.160.03German0.120.10-English0.060.06-Spanish0.070.07-Finnish0.200.210.07French0.070.07-Russian0.140.15-Dutch0.09-0.02Table7:Comparisonoftype/tokenratiosinthecor-porausedforevaluation.Theratioisnotdependentonlyonthecorpussizebutalsoonthelanguageanddomainofthecorpus.Europarlontheotherhandshowssubstantiallylowertype/tokenratios,presumablyduetoitsnarrowerdo-mainandmorerepetitivenature.Ingeneral,wefindthatalthoughthetype/tokenratiodecreaseswithincreasingcorpussize,thede-creasingrateslowsdowndramaticallyatacertainpoint(Herdan,1960;Heaps,1978).Thisdependsonthetypologyofthelanguageanddomainofthecorpus.Figure3showstheempiricalproofofthisintuition.Weshowthevariationoftype/tokenratiosinWikipediaandEuroparlwithincreasingcorpussize.Wecanseethatinaverylargecorpusof800Ksentences,thetype/tokenratioinMRLssuchasKo-reanorFinnishstayscloseto0.1,alevelwherewestillexpectanimprovementinperplexitywiththeproposedAPfine-tuningmethodappliedontopof0K100K200K300K400K500K600K700K800Knumber of sentences0.00.10.20.30.4type/token ratioFormatStrFormatter('%dK')fi europarlnl europarlfi wikide wikinl wikiko wikiFigure3:Type/tokenratiovaluesvs.corpussize.Adomain-specifccorpus(Europarl)hasalowertype/tokenratiothanamoregeneralcorpus(Wikipedia),regardlessoftheabsolutecorpussize.Char-CNN-LSTM.Inordertoisolateandverifytheeffectofthetype/tokenratio,wenowpresentresultsonsynthet-icallycreateddatasetswheretheratioiscontrolledexplicitly.WeexperimentwithsubsetsoftheGermanWikipediawithequalnumberofsentences(25K)6,comparablenumberoftokens,butvaryingtype/tokenratio.Wegeneratethesecontrolleddatasetsbyclus-teringsparsebag-of-wordssentencevectorswiththek-meansalgorithm,samplingfromdifferentclusters,6Wesplitthedatainto20Ktraining,2.5Kvalidationand2.5Ktestsentences
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
l
a
c
_
a
_
0
0
0
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
463
Clusters#Vocab#TokensType/TokenChar-CNN-LSTM+APtraintesttraintesttrain248K52K382K47K0.132252172,469K75K495K62K0.144544202,4,5,978K84K494K62K0.166055475,984K91K492K62K0.17671612566K72K372K46K0.18681598Table8:ResultsonGermanwithdatasetsofcomparablesizeandincreasingtype/tokenratio.0.130.140.150.160.170.18type/token ratio200300400500600700800perplexityCharCNN-LSTMCharCNN-LSTM+APFigure4:VisualisationofresultsfromTable8.TheAPmethodisespeciallyhelpfulforcorporawithhightype/tokenratios.andthenselectingthefinalcombinationsaccordingtotheirtype/tokenratioandthenumberoftokens.Corporastatisticsalongwithcorrespondingperplex-ityscoresareshowninTable8,andplottedinFig-ure4.TheseresultsclearlydemonstrateandverifythattheeffectivenessoftheAPmethodincreasesforcorporawithhighertype/tokenratios.Thisfindingalsofurthersupportstheusefulnessoftheproposedmethodformorphologically-richlanguagesinpartic-ular,wheresuchhightype/tokenratiosareexpected.8ConclusionWehavepresentedacomprehensivelanguagemod-elingstudyoverasetof50typologicallydiverselanguages.Thelanguageswerecarefullyselectedtorepresentawidespectrumofdifferentmorphologicalsystemsthatarefoundamongtheworld’slanguages.Ourcomprehensivestudyprovidesnewbenchmarksandlanguagemodelingbaselineswhichshouldguidethedevelopmentofnext-generationlanguagemodelsfocusedonthechallengingmultilingualsetting.OneparticularLMchallengeisaneffectivelearn-ingofparametersforinfrequentwords,especiallyformorphologically-richlanguages(MRLs).ThemethodologicalcontributionofthisworkisanewneuralapproachwhichenricheswordvectorsattheLMoutputwithsubword-levelinformationtocap-turesimilarcharactersequencesandconsequentlytofacilitateword-levelLMprediction.Ourmethodhasbeenimplementedasafine-tuningstepwhichgrad-uallyrefineswordvectorsduringtheLMtraining,basedonsubword-levelknowledgeextractedinanunsupervisedmannerfromcharacter-awareCNNlay-ers.Ourapproachyieldsgainsfor47/50languagesinthechallengingfull-vocabularysetup,withlargestgainsreportedforMRLssuchasKoreanorFinnish.Wehavealsodemonstratedthatthegainsextendtolargertrainingcorpora,andarewellcorrelatedwiththetype-to-tokenratiointhetrainingdata.Infutureworkweplantodealwiththeopenvocab-ularyLMsetupandextendourframeworktoalsohan-dleunseenwordsattesttime.Oneinterestingavenuemightbetofurtherfine-tunetheLMpredictionbasedonadditionalevidencebeyondpurelycontextualin-formation.Insummary,wehopethatthisarticlewillencouragefurtherresearchintolearningsemanticrepresentationsforrareandunseenwords,andsteerfurtherdevelopmentsinmultilinguallanguagemodel-ingacrossalargenumberofdiverselanguages.Codeanddataareavailableonline:http://people.ds.cam.ac.uk/dsg40/lmmrl.html.AcknowledgmentsThisworkissupportedbytheERCConsolidatorGrantLEXICAL(648909).Wethankalleditorsandreviewersfortheirhelpfulfeedbackandsuggestions.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
l
a
c
_
a
_
0
0
0
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
464
ReferencesMartinAbadi,AshishAgarwal,PaulBarham,EugeneBrevdo,ZhifengChen,CraigCitro,GregCorrado,AndyDavis,JeffreyDean,MatthieuDevin,SanjayGhe-mawat,IanGoodfellow,AndrewHarp,GeoffreyIrving,MichaelIsard,YangqingJia,LukaszKaiser,Manju-nathKudlur,JoshLevenberg,DanMan,RajatMonga,SherryMoore,DerekMurray,JonShlens,BenoitSteiner,IlyaSutskever,PaulTucker,VincentVan-houcke,VijayVasudevan,OriolVinyals,PeteWarden,MartinWicke,YuanYu,andXiaoqiangZheng.2016.TensorFlow:Large-scalemachinelearningonhetero-geneousdistributedsystems.CoRR,abs/1603.04467.OliverAdams,AdamMakarucha,GrahamNeubig,StevenBird,andTrevorCohn.2017.Cross-lingualwordembeddingsforlow-resourcelanguagemodeling.InProceedingsofEACL,pages937–947.RamiAl-Rfou,BryanPerozzi,andStevenSkiena.2013.Polyglot:Distributedwordrepresentationsformultilin-gualNLP.InProceedingsofCoNLL,pages183–192.YoshuaBengio,RéjeanDucharme,PascalVincent,andChristianJauvin.2003.Aneuralprobabilisticlan-guagemodel.JournalofMachineLearningResearch,3:1137–1155.BalthasarBickelandJohannaNichols,2013.InflectionalSynthesisoftheVerb.MaxPlanckInstituteforEvolu-tionaryAnthropology,Leipzig.OndˇrejBojar,ChristianBuck,ChrisCallison-Burch,ChristianFedermann,BarryHaddow,PhilippKoehn,ChristofMonz,MattPost,RaduSoricut,andLuciaSpecia.2013.Findingsofthe2013WorkshoponSta-tisticalMachineTranslation.InProceedingsofthe8thWorkshoponStatisticalMachineTranslation,pages1–44.JanA.BothaandPhilBlunsom.2014.Compositionalmorphologyforwordrepresentationsandlanguagemodelling.InProceedingsofICML,pages1899–1907.KenChatfield,KarenSimonyan,AndreaVedaldi,andAn-drewZisserman.2014.Returnofthedevilinthedetails:Delvingdeepintoconvolutionalnets.InPro-ceedingsofBMVC.CiprianChelba,TomasMikolov,MikeSchuster,QiGe,ThorstenBrants,andPhillippKoehn.2014.Onebil-lionwordbenchmarkformeasuringprogressinstatis-ticallanguagemodeling.InProceedingsofINTER-SPEECH,pages2635–2639.XieChen,XunyingLiu,YanminQian,M.J.F.Gales,andPhilipCWoodland.2016.CUED-RNNLM:Anopen-sourcetoolkitforefficienttrainingandevaluationofrecurrentneuralnetworklanguagemodels.InProceed-ingsofICASSP,pages6000–6004.RyanCotterell,HinrichSchütze,andJasonEisner.2016.Morphologicalsmoothingandextrapolationofwordembeddings.InProceedingsofACL,pages1651–1660.KoenDeschachtandMarie-FrancineMoens.2009.Semi-supervisedsemanticrolelabelingusingthelatentwordslanguagemodel.InProceedingsofEMNLP,pages21–29.JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JournalofMachineLearningResearch,12:2121–2159.ManaalFaruqui,JesseDodge,SujayKumarJauhar,ChrisDyer,EduardHHovy,andNoahASmith.2015.RetrofittingWordVectorstoSemanticLexicons.InProceedingsofNAACL-HLT,pages1606–1615.ChristianeFellbaum.1998.WordNet:AnElectronicLexi-calDatabase.BradfordBooks.KatjaFilippova,EnriqueAlfonseca,CarlosA.Col-menares,LukaszKaiser,andOriolVinyals.2015.SentencecompressionbydeletionwithLSTMs.InProceedingsofEMNLP,pages360–368.JuriGanitkevitch,BenjaminVanDurme,andChrisCallison-Burch.2013.PPDB:TheParaphraseDatabase.InProceedingsofNAACL-HLT,pages758–764.JoshuaT.Goodman.2001.Abitofprogressinlanguagemodeling.ComputerSpeech&Language,15(4):403–434.EdouardGrave,MoustaphaCissé,andArmandJoulin.2017.Unboundedcachemodelforonlinelanguagemodelingwithopenvocabulary.InProceedingsofNIPS,pages6044–6054.AlexGraves.2013.Generatingsequenceswithrecurrentneuralnetworks.CoRR,abs/1308.0850.DerekGreeneandPadraigCunningham.2006.Practi-calsolutionstotheproblemofdiagonaldominanceinkerneldocumentclustering.InProceedingsofICML,pages377–384.KennethHeafield,IvanPouzyrevsky,JonathanH.Clark,andPhilippKoehn.2013.ScalablemodifiedKneser-Neylanguagemodelestimation.InProceedingsofACL,pages690–696.HaroldStanleyHeaps.1978.Informationretrieval,com-putationalandtheoreticalaspects.AcademicPress.GustavHerdan.1960.Type-tokenmathematics,volume4.Mouton.SeppHochreiterandJürgenSchmidhuber.1997.LongShort-TermMemory.NeuralComputation,9(8):1735–1780.MarcusHutter.2012.Thehumanknowledgecompressioncontest.RafalJozefowicz,OriolVinyals,MikeSchuster,NoamShazeer,andYonghuiWu.2016.Exploringthelimitsoflanguagemodeling.InProceedingsofICML.DanJurafskyandJamesH.Martin.2017.SpeechandLanguageProcessing,volume3.Pearson.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
l
a
c
_
a
_
0
0
0
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
465
KazuyaKawakami,ChrisDyer,andPhilBlunsom.2017.Learningtocreateandreusewordsinopen-vocabularyneurallanguagemodeling.InProceedingsofACL,pages1492–1502.YoonKim,YacineJernite,DavidSontag,andAlexan-derM.Rush.2016.Character-awareneurallanguagemodels.InProceedingsofAAAI,pages2741–2749.ReinhardKneserandHermannNey.1995.Improvedbacking-offforM-gramlanguagemodeling.InPro-ceedingsofICASSP,pages181–184.PhilippKoehn.2005.Europarl:Aparallelcorpusforstatisticalmachinetranslation.InProceedingsofthe10thMachineTranslationSummit,pages79–86.YannLeCun,BernhardE.Boser,JohnS.Denker,DonnieHenderson,RichardE.Howard,WayneE.Hubbard,andLawrenceD.Jackel.1989.Handwrittendigitrecognitionwithaback-propagationnetwork.InPro-ceedingsofNIPS,pages396–404.WangLing,TiagoLuís,LuísMarujo,RamónFernándezAstudillo,SilvioAmir,ChrisDyer,AlanW.Black,andIsabelTrancoso.2015.Findingfunctioninform:Compositionalcharactermodelsforopenvocabularywordrepresentation.InProceedingsofEMNLP,pages1520–1530.Minh-ThangLuongandChristopherD.Manning.2016.Achievingopenvocabularyneuralmachinetranslationwithhybridword-charactermodels.InProceedingsofACL,pages1054–1063.AndrewL.Maas,RaymondE.Daly,PeterT.Pham,DanHuang,AndrewY.Ng,andChristopherPotts.2011.Learningwordvectorsforsentimentanalysis.InPro-ceedingsofACL,pages142–150.MitchellP.Marcus,MaryAnnMarcinkiewicz,andBeat-riceSantorini.1993.BuildingalargeannotatedcorpusofEnglish:ThePennTreebank.ComputationalLin-guistics,19(2):313–330.TomasMikolov,MartinKarafiát,LukasBurget,JanCer-nock`y,andSanjeevKhudanpur.2010.Recurrentneu-ralnetworkbasedlanguagemodel.InProceedingsofINTERSPEECH,pages1045–1048.YasumasaMiyamotoandKyunghyunCho.2016.Gatedword-characterrecurrentlanguagemodel.InProceed-ingsofEMNLP,pages1992–1997.NikolaMrkši´c,IvanVuli´c,DiarmuidÓSéaghdha,IraLeviant,RoiReichart,MilicaGaši´c,AnnaKorhonen,andSteveYoung.2017.Semanticspecialisationofdistributionalwordvectorspacesusingmonolingualandcross-lingualconstraints.TransactionsoftheACL,5:309–324.VinodNairandGeoffreyE.Hinton.2010.RectifiedlinearunitsimproverestrictedBoltzmannmachines.InProceedingsofICML,pages807–814.NellekeOostdijk.2000.ThespokenDutchcorpus.Overviewandfirstevaluation.InProceedingsofLREC,pages887–894.OfirPressandLiorWolf.2017.Usingtheoutputembed-dingtoimprovelanguagemodels.InProceedingsofEACL,pages157–163.IulianVladSerban,AlessandroSordoni,YoshuaBengio,AaronC.Courville,andJoellePineau.2016.Buildingend-to-enddialoguesystemsusinggenerativehierar-chicalneuralnetworkmodels.InProceedingsofAAAI,pages3776–3784.RupeshKumarSrivastava,KlausGreff,andJürgenSchmidhuber.2015.Highwaynetworks.InProceed-ingsoftheICMLDeepLearningWorkshop.MartinSundermeyer,HermannNey,andRalfSchluter.2015.FromfeedforwardtorecurrentLSTMneuralnetworksforlanguagemodeling.IEEETransactionsonAudio,SpeechandLanguageProcessing,23(3):517–529.JörgTiedemann.2012.Paralleldata,toolsandinterfacesinOPUS.InProceedingsofLREC,pages2214–2218.ClaraVaniaandAdamLopez.2017.Fromcharacterstowordstoinbetween:Dowecapturemorphology?InProceedingsofACL,pages2016–2027.AshishVaswani,YinggongZhao,VictoriaFossum,andDavidChiang.2013.Decodingwithlarge-scaleneurallanguagemodelsimprovestranslation.InProceedingsofEMNLP,pages1387–1392.LyanVerwimp,JorisPelemans,HugoVanhamme,andPatrickWambacq.2017.Character-wordLSTMlan-guagemodels.InProceedingsofEACL,pages417–427.IvanVuli´c,NikolaMrkši´c,RoiReichart,DiarmuidÓSéaghdha,SteveYoung,andAnnaKorhonen.2017.Morph-fitting:Fine-tuningwordvectorspaceswithsimplelanguage-specificrules.InProceedingsofACL,pages56–68.TianWangandKyunghyunCho.2016.Larger-contextlanguagemodellingwithrecurrentneuralnetwork.InProceedingsofACL,pages1319–1329.JohnWieting,MohitBansal,KevinGimpel,andKarenLivescu.2015.Fromparaphrasedatabasetocomposi-tionalparaphrasemodelandback.TransactionsoftheACL,3:345–358.WojciechZaremba,IlyaSutskever,andOriolVinyals.2015.Recurrentneuralnetworkregularization.InProceedingsofICLR.MatthewD.ZeilerandRobFergus.2014.Visualizingandunderstandingconvolutionalnetworks.InProceedingsofECCV,pages818–833.GeorgeKingsleyZipf.1949.Humanbehaviorandtheprincipleofleasteffort:Anintroductiontohumanecol-ogy.MartinoFineBooks.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
3
2
1
5
6
7
6
3
0
/
/
t
l
a
c
_
a
_
0
0
0
3
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
466
Descargar PDF