Transactions of the Association for Computational Linguistics, vol. 6, pp. 529–541, 2018. Action Editor: Holger Schwenk.

Transactions of the Association for Computational Linguistics, vol. 6, pp. 529–541, 2018. Action Editor: Holger Schwenk.
Submission batch: 8/2017; Revision batch: 1/2018; Published 8/2018.

2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

NeuralLatticeLanguageModelsJacobBuckmanLanguageTechnologiesInstituteCarnegieMellonUniversityjacobbuckman@gmail.comGrahamNeubigLanguageTechnologiesInstituteCarnegieMellonUniversitygneubig@cs.cmu.eduAbstractInthiswork,weproposeanewlanguagemod-elingparadigmthathastheabilitytoperformbothpredictionandmoderationofinforma-tionﬂowatmultiplegranularities:neurallat-ticelanguagemodels.Thesemodelscon-structalatticeofpossiblepathsthroughasen-tenceandmarginalizeacrossthislatticetocal-culatesequenceprobabilitiesoroptimizepa-rameters.Thisapproachallowsustoseam-lesslyincorporatelinguisticintuitions–in-cludingpolysemyandtheexistenceofmulti-wordlexicalitems–intoourlanguagemodel.ExperimentsonmultiplelanguagemodelingtasksshowthatEnglishneurallatticelanguagemodelsthatutilizepolysemousembeddingsareabletoimproveperplexityby9.95%rela-tivetoaword-levelbaseline,andthataChi-nesemodelthathandlesmulti-characterto-kensisabletoimproveperplexityby20.94%relativetoacharacter-levelbaseline.1IntroductionNeuralnetworkmodelshaverecentlycontributedto-wardsagreatamountofprogressinnaturallanguageprocessing.Thesemodelstypicallyshareacommonbackbone:recurrentneuralnetworks(RNN),whichhaveproventhemselvestobecapableoftacklingavarietyofcorenaturallanguageprocessingtasks(HochreiterandSchmidhuber,1997;Elman,1990).Onesuchtaskislanguagemodeling,inwhichweestimateaprobabilitydistributionoversequencesoftokensthatcorrespondstoobservedsentences(§2).Neurallanguagemodels,particularlymodelscon-ditionedonaparticularinput,havemanyapplica-tionsincludinginmachinetranslation(Bahdanauetal.,2016),abstractivesummarization(Chopraetal.,2016),andspeechprocessing(Gravesetal.,2013).dogs chased the small cat dogs chased the smallcatdogs chased thesmalldogs chasedthethe_smallthe_small_cat small_catdogs_chasedchasedchased_thedogs_chased_thechased_the_smallFigure1:Latticedecompositionofasentenceanditscor-respondinglatticelanguagemodelprobabilitycalculationSimilarly,state-of-the-artlanguagemodelsareal-mostuniversallybasedonRNNs,particularlylongshort-termmemory(LSTM)réseaux(Jozefowiczetal.,2016;Inanetal.,2017;Merityetal.,2016).Whilepowerful,LSTMlanguagemodelsusuallydonotexplicitlymodelmanycommonly-acceptedlinguisticphenomena.Asaresult,standardmod-elslacklinguisticallyinformedinductivebiases,po-tentiallylimitingtheiraccuracy,particularlyinlow-datascenarios(Adamsetal.,2017;KoehnandKnowles,2017).Inthiswork,wepresentanovelmodiﬁcationtothestandardLSTMlanguagemod-elingframeworkthatallowsustoincorporatesomevarietiesoftheselinguisticintuitionsseamlessly:neurallatticelanguagemodels(§3.1).Neurallat-ticelanguagemodelsdeﬁnealatticeoverpossi-blepathsthroughasentence,andmaximizethemarginalprobabilityoverallpathsthatleadtogen-eratingthereferencesentence,asshowninFig.1.Dependingonhowwedeﬁnethesepaths,wecanin-corporatedifferentassumptionsabouthowlanguageshouldbemodeled.Intheparticularinstantiationsofneurallatticelanguagemodelscoveredbythispaper,wefocusontwopropertiesoflanguagethatcouldpotentiallybeofuseinlanguagemodeling:theexistenceofmulti-wordlexicalunits(Zgusta,1967)(§4.1)andpoly-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
3
6
1
5
6
7
6
2
8

/
t

un
c
_
un
_
0
0
0
3
6
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

530

semy(RavinandLeacock,2000)(§4.2).Neurallat-ticelanguagemodelsallowthemodeltoincorporatetheseaspectsinanend-to-endfashionbysimplyad-justingthestructureoftheunderlyinglattices.Werunexperimentstoexplorewhetherthesemodiﬁcationsimprovetheperformanceofthemodel(§5).En plus,weprovidequalitativevisualiza-tionsofthemodeltoattempttounderstandwhattypesofmulti-tokenphrasesandpolysemousem-beddingshavebeenlearned.2Background2.1LanguageModelsConsiderasequenceXforwhichwewanttocal-culateitsprobability.Assumewehaveavocabularyfromwhichwecanselectauniquelistof|X|tokensx1,x2,…,X|X|suchthatX=[x1;x2;…;X|X|],i.e.theconcatenationofthetokens(withanappro-priatedelimiter).Thesetokenscanbeeitheronthecharacterlevel(HwangandSung,2017;Lingetal.,2015)orwordlevel(Inanetal.,2017;Merityetal.,2016).Usingthechainrule,languagemodelsgen-erallyfactorizep(X)inthefollowingway:p(X)=p(x1,x2,…,X|X|)=|X|Yt=1p(xt|x1,x2,…,xt−1).(1)Notethatthisfactorizationisexactonlyinthecasewherethesegmentationisunique.Incharacter-levelmodels,itiseasytoseethatthispropertyismaintained,becauseeachtokenisuniqueandnon-overlapping.Inword-levelmodels,thisalsoholds,becausetokensaredelimitedbyspaces,andnowordcontainsaspace.2.2RecurrentNeuralNetworksRecurrentneuralnetworkshaveemergedasthestate-of-the-artapproachtoapproximatingp(X).Inparticular,theLSTMcell(HochreiterandSchmid-huber,1997)isaspeciﬁcRNNarchitecturewhichhasbeenshowntobeeffectiveonmanytasks,in-cludinglanguagemodeling(PressandWolf,2017;Jozefowiczetal.,2016;Merityetal.,2016;Inanetal.,2017).1LSTMlanguagemodelsrecursivelycal-1Inthiswork,weutilizeanLSTMwithlinkedinputandforgetgates,asproposedbyGreffetal.(2016).culatethehiddenandcellstates(htandctrespec-tively)giventheinputembeddinget−1correspond-ingtotokenxt−1:ht,ct=LSTM(ht−1,ct−1,et−1,θ),(2)thencalculatetheprobabilityofthenexttokengiventhehiddenstate,generallybyperforminganafﬁnetransformparameterizedbyWandb,followedbyasoftmax:p(xt|ht):=softmax(W∗ht+b).(3)3NeuralLatticeLanguageModels3.1LanguageModelswithAmbiguousSegmentationsToreiterate,thestandardformulationoflanguagemodelingintheprevioussectionrequiressplittingsentenceXintoauniquesetoftokensx1,…,X|X|.Ourproposedmethodgeneralizesthepreviousfor-mulationtoremovetherequirementofuniquenessofsegmentation,similartothatusedinnon-neuraln-gramlanguagemodelssuchasDupontandRosen-feld(1997)andGoldwateretal.(2007).D'abord,wedeﬁnesometerminology.Weusetheterm“token”,designatedbyxi,todescribeanyin-divisibleiteminourvocabularythathasnoothervocabularyitemasitsconstituentpart.Weusetheterm“chunk”,designatedbykiorxji,todescribeasequenceofoneormoretokensthatrepresentsaportionofthefullstringX,containingtheunitto-kensxithroughxj:xji=[xi,xi+1;…;xj].Wealsorefertothe“tokenvocabulary”,whichisthesubsetofthevocabularycontainingonlytokens,andtothe“chunkvocabulary”,whichsimilarlycontainsallchunks.NotethatwecanfactorizetheprobabilityofanysequenceofchunksKusingthechainrule,inpre-ciselythesamewayassequencesoftokens:p(K)=p(k1,k2,…,k|K|)=|K|Yt=1p(kt|k1,k2,…,kt−1).(4)Wecanfactorizetheoverallprobabilityofato-kenlistXintermsofitschunksbyusingthechainrule,andmarginalizingoverallsegmentations.ForanyparticulartokenlistX,wedeﬁneasetofvalid

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
3
6
1
5
6
7
6
2
8

/
t

un
c
_
un
_
0
0
0
3
6
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

531

segmentationsS(X),suchthatforeverysequences∈S(X),X=[xs1−1s0;xs2−1s1;…;xs|s|s|s|−1].Thefactorizationis:p(X)=XSp(X,S)=XSp(X|S)p(S)=XS∈S(X)p(S)=XS∈S(X)|S|Yt=1p(xst−1st−1|xs1−1s0,xs2−1s1,…,xst−1−1st−2).(5)Notethat,bydeﬁnition,thereexistsauniqueseg-mentationofXsuchthatx1,x2,…arealltokens,inwhichcase|S|=|X|.WhenonlythatoneuniquesegmentationisallowedperX,Scontainsonlythatoneelement,sosummationdropsout,andthereforeforstandardcharacter-levelandword-levelmodels,Eq.(5)reducestoEq.(4),asdesired.However,formodelsthatlicensemultiplesegmentationsperX,computingthismarginalizationdirectlyisgener-allyintractable.Forexample,considersegmentingasentenceusingavocabularycontainingallwordsandall2-wordexpressions.ThesizeofSwouldgrowexponentiallywiththenumberofwordsinX,meaningwewouldhavetomarginalizeovertril-lionsofuniquesegmentationsforevenmodestly-sizedsentences.3.2LatticeLanguageModelsToavoidthis,itispossibletore-organizethecom-putationsinalattice,whichallowsustodramati-callyreducethenumberofcomputationsrequired(DupontandRosenfeld,1997;Neubigetal.,2010).AllsegmentationsofXcanbeexpressedastheedgesofpathsthroughalatticeovertoken-levelpre-ﬁxesofX:X<1,x<2,...,X.Theinﬁmumistheemptypreﬁxx<1;thesupremumisX;anedgefrompreﬁxxl D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 6 1 5 6 7 6 2 8 / / t l a c _ a _ 0 0 0 3 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 532 thedogbarked.thedogbarked.thedogdogbarked.thedogbarked.thedogdogbarkedbarked.thedog1barked1.dog2barked2(un)(b)(c)(d)Figure2:Exampleof(un)asingle-pathlattice,(b)asparselattice,(c)adenselatticewithD=2,and(d)amultilat-ticewithD=2,forsentence“thedogbarked.”modelsdependontheentirecontext,causingthemtolackthisability.Ourprimarytechnicalcontribu-tionisthereforetodescribeseveraltechniquesforincorporatinglatticesintoaneuralframeworkwithinﬁnitecontext,byprovidingwaystoapproximatethehiddenstateoftherecurrentneuralnet.3.3.1DirectApproximationOneapproachtoapproximatingthehiddenstateistheTreeLSTMframeworkdescribedbyTaietal.(2015).3IntheTreeLSTMformulation,newstatesarederivedfrommultiplepredecessorsbysimplysummingtheindividualhiddenandcellstatevec-torsofeachofthem.Foreachpredecessorlocationi∈Aj,weﬁrstcalculatethelocalhiddenstate˜handlocalcellstate˜cbycombiningtheembeddingejiwiththehiddenstateoftheLSTMatxo l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 6 1 5 6 7 6 2 8 / / t l a c _ a _ 0 0 0 3 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 533 thepre-softmaxpredecessorscoresandthentakingtheargmaxisequivalenttosamplingfromtheprob-abilitydistribution.Byreplacingtheargmaxwithasoftmaxfunctionscaledbyatemperatureτ,wecangetthispseudo-sampleddistributionthroughafullydifferentiablecomputation:N(Xτ)Pk∈Ajexp((log(M.(Xt l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 3 6 1 5 6 7 6 2 8 / / t l a c _ a _ 0 0 0 3 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 534 themissingedges,leadingtowastedcomputation.SinceonlyedgesoflengthLorlessarepresent,themaximumin-degreeofanynodeinthelatticeDisnogreaterthanL,givingusthetimeboundO(L|X|).4.1.3TokenVocabulariesStoringanembeddingforeverypossiblemulti-tokenchunkwouldrequire|V|Luniqueembed-dings,whichisintractable.Therefore,weconstructourmulti-tokenembeddingsbymergingcomposi-tionalandnon-compositionalrepresentations.Non-compositionalRepresentationWeﬁrstes-tablishapriorisetof“core”chunk-leveltokens,eachhaveadenseembedding.Inordertoguaranteefullcoverageofsentences,weﬁrstaddeveryunit-leveltokentothisvocabulary,e.g.everywordinthecor-pusforaword-levelmodel.Followingthis,wealsoaddthemostfrequentn-grams(where1K>embedding.SentinelMixtureModelforPredictionsAteachtimestep,wewanttouseourLSTMhiddenstatehttoassignsomeprobabilitymasstoeverychunkwithalengthlessthanL.Todothis,wefollowMerityetal.(2016)increatinganew“sentinel”tokenandaddingittoourvocabulary.Ateachtimestep,weﬁrstuseourneuralnetworktocalculateascoreforeachchunkCinourvocabulary,includingthesentineltoken.Wedoasoftmaxacrossthesescorestoassignaprobabilitypmain(Ct+1|ht;je)toeverychunkinourvocabulary,andalsoto.Fortokensequencesnotrepresentedinourchunkvocabulary,thisprobabilitypmain(Ct+1|ht;je)=0.Next,theprobabilitymassassignedtothesentinelvalue,pmain(|ht;je),isdistributedacrossallpossibletokenssequencesoflengthlessthanL,us-inganotherLSTMwithparametersθsub.SimilartoJozefowiczetal.(2016),thissub-LSTMisinitial-izedbypassinginthehiddenstateofthemainlatticeLSTMatthattimestep.Thisgivesusaprobabilityforeachsequencepsub(c1,c2,…,cL|ht;θsub).TheﬁnalformulaforcalculatingtheprobabilitymassassignedtoaspeciﬁcchunkCis:p(C|ht;je)=pmain(C|ht;je)+pmain(|ht;je)psub(C|ht;θsub).4.2IncorporatingPolysemousTokens4.2.1MotivationAsecondshortcomingofcurrentlanguagemod-elingapproachesisthateachwordisassociatedwithonlyoneembedding.Forhighlypolysemouswords,asingleembeddingmaybeunabletorepresentallmeaningseffectively.Therehasbeenpastworkinwordembeddingswhichhasshownthatusingmultipleembeddingsforeachwordishelpfulinconstructingausefulrepre-sentation.AthiwaratkunandWilson(2017)repre-sentedeachwordwithamultimodalGaussiandis-tributionanddemonstratedthatembeddingsofthisformwereabletooutperformmorestandardskip-gramembeddingsonwordsimilarityandentailmenttasks.Similarly,Chenetal.(2015)incorporatestandardskip-gramtrainingintoaGaussianmixtureframeworkandshowthatthisimprovesperformanceonseveralwordsimilaritybenchmarks.Whenapolysemouswordisrepresentedusingonlyasingleembeddinginalanguagemodelingtask,themultimodalnatureofthetrueembeddingdistributionmaycausestheresultingembeddingtobebothhigh-varianceandskewedfromthepositionsofeachofthetruemodes.Thus,itislikelyusefultorepresenteachtokenwithmultipleembeddingswhendoinglanguagemodeling.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
3
6
1
5
6
7
6
2
8

/

/
t

je

un
c
_
un
_
0
0
0
3
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

535

4.2.2ModelingStrategyForourpolysemyexperiments,theunderlyinglat-ticesaremulti-lattices:latticeswhicharealsomulti-graphs,andcanhaveanynumberofedgesbetweenanygivenpairofnodes(Fig.2,d).Latticessetupinthismannerallowustoincorporatemultipleem-beddingsforeachword.Withinasinglesentence,anypairofnodescorrespondstothestartandendofaparticularsubsequenceofthefullsentence,andisthusassociatedwithaspeciﬁctoken.Eachedgebetweenthemisauniqueembeddingforthatto-ken.Whilemanystrategiesforchoosingthenum-berofembeddingsexistintheliterature(Neelakan-tanetal.,2014),inthiswork,wechooseanumberofembeddingsEandassignthatmanyembeddingstoeachword.Thisensuresthatthemaximumin-degreeofanynodeinthelatticeD,isnogreaterthanE,givingusthetimeboundO(E|X|).Inthiswork,wedonotexploremodelsthatin-cludebothchunkvocabulariesandmultipleembed-dings.However,combiningthesetwotechniques,aswellasexploringother,morecomplexlatticestruc-tures,isaninterestingavenueforfuturework.5Experiments5.1DataWeperformexperimentsontwolanguages:EnglishandChinese,whichprovideaninterestingcontrastinlinguisticfeatures.4InEnglish,themostcommonbenchmarkforlanguagemodelingrecentlyisthePennTree-bank,speciﬁcallytheversionpreprocessedbyTom´aˇsMikolov(2010).Cependant,thiscorpusislim-itedbybeingrelativelysmall,onlycontainingap-proximately45,000sentences,whichwefoundtobeinsufﬁcienttoeffectivelytrainlatticelanguagemod-els.5Thus,weinsteadusedtheBillionWordCorpus(Chelbaetal.,2014).PastexperimentsontheBWCtypicallymodeledeverywordwithoutrestrictingthevocabulary,whichresultsinanumberofchallengesregardingthemodelingofopenvocabulariesthatareorthogonaltothiswork.Thus,wecreateapre-4Codetoreproducedatasetsandexperimentsisavailableat:http://github.com/jbuckman/neural-lattice-language-models5Experimentsusingmulti-wordunitsresultedinoverﬁtting,regardlessofnormalizationandhyperparametersettings.processedversionofthedatainthesamemannerasMikolov,lowercasingthewords,replacingnum-berswithtokens,etingallwordsbeyondthetenthousandmostcommon.Addition-ally,werestrictedthedatasettoonlyincludesen-tencesoflength50orless,ensuringthatlargemini-batchescouldﬁtinGPUmemory.OursubsampledEnglishcorpuscontained29,869,166sentences,ofwhich29,276,669wereusedfortraining,5,000forvalidation,and587,497fortesting.Tovalidatethatourmethodsscaleuptolargerlanguagemodelingscenarios,wealsoreportasmallersetoflarge-scaleexperimentsonthefullbillionwordbenchmarkinAppendixA.InChinese,weranexperimentsonasubsetoftheChineseGigaWordcorpus.Chineseisalsopar-ticularlyinterestingbecauseunlikeEnglish,itdoesnotusespacestodelimitwords,sosegmentationisnon-trivial.Therefore,weusedacharacter-levellan-guagemodelforthebaseline,andourlatticewascomposedofmulti-characterchunks.Weusedsen-tencesfromGuangmingDaily,againingallbutthe10,000mostcommontokensandrestrict-ingtheselectedsentencestoonlyincludesentencesoflength150orless.OursubsampledChinesecor-pusincluded934,101sentencesfortraining,5,000forvalidation,and30,547fortesting.5.2MainExperimentsWecompareabaselineLSTMmodel,denselatticesofsize1,2,and3,andamultilatticewith2and3embeddingsperword.TheimplementationofournetworkswasdoneinDyNet(Neubigetal.,2017).AllLSTMshad2lay-ers,eachwithahiddendimensionof200.Vari-ationaldropout(GalandGhahramani,2016)of.2wasusedontheChineseexperiments,buthurtper-formanceontheEnglishdata,soitwasnotused.The10,000wordembeddingseachhaddimension256.Forlatticemodels,chunkvocabularieswerese-lectedbytakingthe10,000wordsinthevocabularyandaddingthemostcommon10,000n-gramswith1in”,yield-ingqualitativelygoodphrases.Inthemultiple-embeddingexperiments,itispos-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
3
6
1
5
6
7
6
2
8

/

/
t

je

un
c
_
un
_
0
0
0
3
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

537

Figure3:Segmentationofthreesentencesrandomlysampledfromthetestcorpus,usingL=2.Greennumbersshowprobabilityassignedtotokensizes.Forexample,theﬁrstthreewordsintheﬁrstsentencehavea59%and41%chanceofbeing“pleaseletme”or“pleaseletme”respectively.Boxesaroundwordsshowgreedysegmentation.Table4:Comparisonofrandomly-selectedcontextsofseveralwordsselectedfromthevocabularyoftheBillionWordCorpus,inwhichthemodelpreferredoneembeddingovertheother.rock1rock2…atthepop,rockandjazz……includinghsbc,northernrockand……alittlebitrock,……pakistanhasarockmusicscene……onlightrockandstations……spokesmanforroundrock,…bank1bank2…beingabankholidayin……thebankofenglandhas……alltheusbankrunsand……withtheroyalbankofscotland……bygettingthebank’sinterests……developmentbankofjapanandthe…page1page2…onpageofthe……wasitfrontpagenews……asourcetoldpagesix…….himself,tonypage,theformer……onpageofthe……sectionsofthepagethatdiscuss…proﬁle1proﬁle2…(:quote,proﬁle,recherche)……donctheproﬁleofthecity……(:quote,proﬁle,recherche)……thehighestproﬁleheldby……(:quote,proﬁle,recherche)……fromhighi,eliteschools,…edition1edition2…ofthesecondeditionofwindows……ofthenewyorkedition…….thismonth’seditionof,le……ofthenewyorkedition…….forthcomingd.c.editionofthehit……ofthenewyorkedition….rodham1rodham2…senatorshillaryrodhamclintonand……makinghillaryrodhamclintonhis……hillaryrodhamclinton’scampaignhas…sibletoseewhichofthetwoembeddingsofawordwasassignedthehigherprobabilityforanyspeciﬁctest-setsentence.Inordertovisualizewhattypesofmeaningsareassignedtoeachembedding,weselectsentencesinwhichoneembeddingispreferred,andlookatthecontextinwhichthewordisused.Sev-eralexamplesofthiscanbeseeninTable4;itisclearfromlookingattheseexamplesthatthesystemdoeslearndistinctembeddingsfordifferentsensesoftheword.Whatisinteresting,cependant,isthatitdoesnotnecessarilylearnintuitivesemanticmean-ings;insteadittendstogroupthewordsbythecon-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
3
6
1
5
6
7
6
2
8

/

/
t

je

un
c
_
un
_
0
0
0
3
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

538

textinwhichtheyappear.Insomecases,likeproﬁleandedition,oneofthetwoembeddingssimplycap-turesanidiosyncrasyofthetrainingdata.Additionally,forsomewords,suchasrodhaminTable4,thesystemalwaysprefersoneembedding.Thisispromising,becauseitmeansthatinfutureworkitmaybepossibletofurtherimproveaccu-racyandtrainingefﬁciencybyassigningmoreem-beddingstopolysemouswords,insteadofassigningthesamenumberofembeddingstoallwords.6RelatedWorkPastworkthatutilizedlatticesinneuralmodelsfornaturallanguageprocessingcentersaroundus-ingtheselatticesintheencoderportionofmachinetranslation.Suetal.(2016)utilizedavariationoftheGatedRecurrentUnit(GRU)thatoperatedoverlattices,andpreprocessedlatticesoverChi-nesecharactersthatallowedittoeffectivelyencodemultiplesegmentations.Additionally,Sperberetal.(2017)proposedavariationoftheTreeLSTMwiththegoalofcreatinganencoderoverspeechlatticesinspeech-to-text.Ourworktackleslanguagemod-elingratherthanencoding,andthusaddressestheissueofmarginalizationoverthelattice.AnotherrecentworkwhichmarginalizedovermultiplepathsthroughasentenceisLingetal.(2016).Theauthorstackletheproblemofcodegen-eration,wheresomecomponentsofthecodecanbecopiedfromtheinput,viaaneuralnetwork.Ourworkexpandsonthisbyhandlingmulti-wordtokensasinputtotheneuralnetwork,ratherthanpassinginonetokenatatime.Neurallatticelanguagemodelsimproveaccuracybyhelpingthegradientﬂowoversmallerpaths,pre-ventingvanishinggradients.Manyhierarchicalneu-rallanguagemodelshavebeenproposedwithasim-ilarobjective(Koutniketal.,2014;Zhouetal.,2017).Ourworkisdistinguishedfromthesebytheuseoflatenttoken-levelsegmentationsthatcap-turemeaningdirectly,ratherthansimplybeinghigh-levelmechanismstoencouragegradientﬂow.Chanetal.(2017)proposeamodelforpredict-ingcharactersatmultiplegranularitiesinthede-codersegmentofamachinetranslationsystem.Ourworkexpandsontheirsbyconsideringtheentirelat-ticeatonce,ratherthanconsideringaonlyasin-glepaththroughthelatticeviaancestralsampling.Thisallowsustotrainend-to-endwithoutthemodelcollapsingtoalocalminimum,withnoexplorationbonusneeded.Additionally,weproposeamorebroadclassofmodels,includingthoseincorporat-ingpolysemouswords,andapplyourmodeltothetaskofword-levellanguagemodeling,ratherthancharacter-leveltranscription.Concurrentlytothiswork,vanMerri¨enboeretal.(2017)haveproposedaneurallanguagemodelthatcansimilarlyhandlemultiplescales.Ourworkisdifferentiatedinthatitismoregeneral:utilizinganopenmulti-tokenvocabulary,proposingmultipletechniquesforhiddenstatecalculation,andhandlingpolysemyusingmulti-embeddinglattices.7FutureWorkInthefuture,wewouldliketoexperimentwithuti-lizingneurallatticelanguagemodelsinextrinsicevaluation,suchasmachinetranslationandspeechrecognition.Additionally,inthecurrentmodel,thenon-compositionalembeddingsmustbeselectedapriori,andmaybesuboptimal.Weareexploringtechniquestostoreﬁxedembeddingsdynamically,sothatthenon-compositionalphrasescanbese-lectedaspartoftheend-to-endtraining.8ConclusionInthiswork,wehaveintroducedtheideaofaneurallatticelanguagemodel,whichallowsustomarginal-izeoverallsegmentationsofasentenceinanend-to-endfashion.InourexperimentsontheBillionWordCorpusandChineseGigaWordcorpus,wedemonstratedthattheneurallatticelanguagemodelbeatsanLSTM-basedbaselineatthetaskoflan-guagemodeling,bothwhenitisusedtoincorpo-ratemultiple-wordphrasesandmultiple-embeddingwords.Qualitatively,weobservedthatthelatentsegmentationsgeneratedbythemodelcorrespondwelltohumanintuitionaboutmulti-wordphrases,andthatthevaryingusageofwordswithmultipleembeddingsseemstoalsobesensible.AcknowledgementsTheauthorswouldliketothankHolgerSchwenk,KristinaToutanova,CindyRobinson,andallthere-viewersofthisworkfortheirinvaluablefeedback.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
3
6
1
5
6
7
6
2
8

/

/
t

je

un
c
_
un
_
0
0
0
3
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

539

ReferencesOliverAdams,AdamMakarucha,GrahamNeubig,StevenBird,andTrevorCohn.2017.Cross-lingualwordembeddingsforlow-resourcelanguagemodel-ing.InProceedingsofthe15thConferenceoftheEu-ropeanChapteroftheAssociationforComputationalLinguistics:Volume1,LongPapers,volume1,pages937–947.BenAthiwaratkunandAndrewWilson.2017.Multi-modalworddistributions.InProceedingsofthe55thAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),volume1,pages1645–1656.DzmitryBahdanau,JanChorowski,DmitriySerdyuk,PhilemonBrakel,andYoshuaBengio.2016.End-to-endattention-basedlarge-vocabularyspeechrecog-nition.InIEEEInternationalConferenceonAcous-tics,SpeechandSignalProcessing,pages4945–4949.IEEE.ColinBannardandDanielleMatthews.2008.Storedwordsequencesinlanguagelearning:Theeffectoffamiliarityonchildren’srepetitionoffour-wordcom-binations.PsychologicalScience,19(3):241–248.WilliamChan,YuZhang,QuocLe,andNavdeepJaitly.2017.Latentsequencedecompositions.5thInterna-tionalConferenceonLearningRepresentations.CiprianChelba,TomasMikolov,MikeSchuster,QiGe,ThorstenBrants,PhillippKoehn,andTonyRobinson.2014.Onebillionwordbenchmarkformeasuringprogressinstatisticallanguagemodeling.Interspeech.XinchiChen,XipengQiu,JingxiangJiang,andXuanjingHuang.2015.Gaussianmixtureembeddingsformul-tiplewordprototypes.CoRR,abs/1511.06246.SumitChopra,MichaelAuli,AlexanderMRush,andSEASHarvard.2016.Abstractivesentencesum-marizationwithattentiverecurrentneuralnetworks.NorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages93–98.PierreDupontandRonaldRosenfeld.1997.Latticebasedlanguagemodels.Technicalreport,DTICDoc-ument.ChrisDyer,AdhigunaKuncoro,MiguelBallesteros,andNoahASmith.2016.Recurrentneuralnetworkgram-mars.NorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages199–209.JeffreyL.Elman.1990.Findingstructureintime.Cog-nitivescience,14(2):179–211.YarinGalandZoubinGhahramani.2016.Atheoreti-callygroundedapplicationofdropoutinrecurrentneu-ralnetworks.InAdvancesinNeuralInformationPro-cessingSystems,pages1019–1027.SharonGoldwater,ThomasL.Grifﬁths,MarkJohnson,etal.2007.Distributionalcuestowordboundaries:Contextisimportant.InH.Caunt-Nulton,S.Kilati-late,andI.Woo,editors,BUCLD31:Proceedingsofthe31stAnnualBostonUniversityConferenceonLan-guageDevelopment,pages239–250.Somerville,Mas-sachusetts:CascadillaPress.AlexGraves,Abdel-rahmanMohamed,andGeoffreyHinton.2013.Speechrecognitionwithdeeprecurrentneuralnetworks.InIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing,pages6645–6649.IEEE.KlausGreff,RupeshK.Srivastava,JanKoutn´ık,BasR.Steunebrink,andJ¨urgenSchmidhuber.2016.LSTM:Asearchspaceodyssey.IEEETransactionsonNeuralNetworksandLearningSystems.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.KyuyeonHwangandWonyongSung.2017.Character-levellanguagemodelingwithhierarchicalrecurrentneuralnetworks.InIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing,pages5720–5724.IEEE.HakanInan,KhashayarKhosravi,andRichardSocher.2017.Tyingwordvectorsandwordclassiﬁers:Alossframeworkforlanguagemodeling.5thInternationalConferenceonLearningRepresentations.EricJang,ShixiangGu,andBenPoole.2017.Categori-calreparameterizationwithGumbel-Softmax.5thIn-ternationalConferenceonLearningRepresentations.RafalJozefowicz,OriolVinyals,MikeSchuster,NoamShazeer,andYonghuiWu.2016.Exploringthelimitsoflanguagemodeling.arXiv:1602.02410.PhilippKoehnandRebeccaKnowles.2017.Sixchal-lengesforneuralmachinetranslation.InProceedingsoftheFirstWorkshoponNeuralMachineTranslation,pages28–39.JanKoutnik,KlausGreff,FaustinoGomez,andJuergenSchmidhuber.2014.AclockworkRNN.ProceedingsofMachineLearningResearch.WangLing,IsabelTrancoso,ChrisDyer,andAlanW.Black.2015.Character-basedneuralmachinetransla-tion.CoRR,abs/1511.04586.WangLing,EdwardGrefenstette,KarlMoritzHermann,Tom´aˇsKoˇcisk`y,AndrewSenior,FuminWang,andPhilBlunsom.2016.Latentpredictornetworksforcodegeneration.AssociationforComputationalLin-guistics.ChrisJMaddison,AndriyMnih,andYeeWhyeTeh.2017.Theconcretedistribution:Acontinuousrelax-ationofdiscreterandomvariables.5thInternationalConferenceonLearningRepresentations.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
3
6
1
5
6
7
6
2
8

/

/
t

je

un
c
_
un
_
0
0
0
3
6
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

540

StephenMerity,CaimingXiong,JamesBradbury,andRichardSocher.2016.Pointersentinelmixturemod-els.4thInternationalConferenceonLearningRepre-sentations.ArvindNeelakantan,JeevanShankar,RePassos,andAn-drewMccallum.2014.Efﬁcientnonparametrices-timationofmultipleembeddingsperwordinvectorspace.InProceedingsofEMNLP.Citeseer.GrahamNeubig,MasatoMimura,ShinsukeMori,andTatsuyaKawahara.2010.Learningalanguagemodelfromcontinuousspeech.InINTERSPEECH,pages1053–1056.GrahamNeubig,ChrisDyer,YoavGoldberg,AustinMatthews,WaleedAmmar,AntoniosAnastasopoulos,MiguelBallesteros,DavidChiang,DanielClothiaux,TrevorCohn,etal.2017.DyNet:Thedynamicneuralnetworktoolkit.arXivpreprintarXiv:1701.03980.OﬁrPressandLiorWolf.2017.Usingtheoutputembed-dingtoimprovelanguagemodels.5thInternationalConferenceonLearningRepresentations.YaelRavinandClaudiaLeacock.2000.Polysemy:The-oreticalandComputationalApproaches.OUPOx-ford.RicoSennrich,BarryHaddow,andAlexandraBirch.2015.Neuralmachinetranslationofrarewordswithsubwordunits.AssociationforComputationalLin-guistics.AnnaSiyanova-Chanturia,KathyConklin,andNorbertSchmitt.2011.Addingmorefueltotheﬁre:Aneye-trackingstudyofidiomprocessingbynativeandnon-nativespeakers.SecondLanguageResearch,27(2):251–272.MatthiasSperber,GrahamNeubig,JanNiehues,andAlexWaibel.2017.Neurallattice-to-sequencemod-elsforuncertaininputs.InProceedingsofthe2017ConferenceonEmpiricalMethodsinNaturalLan-guageProcessing,pages1380–1389.JinsongSu,ZhixingTan,DeyiXiong,andYangLiu.2016.Lattice-basedrecurrentneuralnet-workencodersforneuralmachinetranslation.CoRR,abs/1609.07730,ver.2.KaiShengTai,RichardSocher,andChristopherD.Man-ning.2015.Improvedsemanticrepresentationsfromtree-structuredlongshort-termmemorynetworks.As-sociationforComputationalLinguistics.Luk´aˇsBurgetJanHonzaCernockSanjeevKhudanpurTom´aˇsMikolov,MartinKaraﬁ´at.2010.Recur-rentneuralnetworkbasedlanguagemodel.Pro-ceedingsofthe11thAnnualConferenceoftheInter-nationalSpeechCommunicationAssociation,pages1045–1048.BartvanMerri¨enboer,AmartyaSanyal,HugoLarochelle,andYoshuaBengio.2017.Multiscalesequencemodelingwithalearneddictionary.arXivpreprintarXiv:1707.00762.LadislavZgusta.1967.Multiwordlexicalunits.Word,23(1-3):578–587.HaoZhou,ZhaopengTu,ShujianHuang,XiaohuaLiu,HangLi,andJiajunChen.2017.Chunk-basedbi-scaledecoderforneuralmachinetranslation.Associa-tionforComputationalLinguistics.ALarge-ScaleExperimentsToverifythatourﬁndingsscaletostate-of-the-artlanguagemodels,wealsocomparedabaselinemodel,denselatticesofsize1and2,andamulti-latticewith2embeddingsperwordonthefullbyte-pairencodedBillionWordCorpus.Inthissetofexperiments,wetakethefullBil-lionWordCorpus,andapplybyte-pairencodingasdescribedbySennrichetal.(2015)toconstructavocabularyof10,000sub-wordtokens.OurmodelconsistsofthreeLSTMlayers,eachwith1500hid-denunits.Wetrainthemodelforasingleepochoverthecorpus,usingtheAdamoptimizerwithlearningrate.0001onaP100GPU.Weuseabatchsizeof40,andvariationaldropoutof0.1.The10,000sub-wordembeddingseachhaddimension600.Forlatticemodels,chunkvocabularieswereselectedbytakingthe10,000sub-wordsinthevocabularyandaddingthemostcommon10,000n-gramswith1

Télécharger le PDF