Transactions of the Association for Computational Linguistics, Bd. 4, S. 477–490, 2016. Action Editor: Brian Roark.

Transactions of the Association for Computational Linguistics, Bd. 4, S. 477–490, 2016. Action Editor: Brian Roark.
Submission batch: 1/2016; Revision batch: 6/2016; Published 9/2016.

2016 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

C
(cid:13)

Schnell,SmallandExact:Infinite-orderLanguageModellingwithCompressedSuffixTreesEhsanShareghi,[MatthiasPetri,\GholamrezaHaffari[andTrevorCohn\[FacultyofInformationTechnology,MonashUniversity\ComputingandInformationSystems,TheUniversityofMelbournefirst.last@{monash.edu,unimelb.edu.au}AbstractEfficientmethodsforstoringandqueryingarecriticalforscalinghigh-orderm-gramlan-guagemodelstolargecorpora.Weproposealanguagemodelbasedoncompressedsuffixtrees,arepresentationthatishighlycompactandcanbeeasilyheldinmemory,whilesup-portingqueriesneededincomputinglanguagemodelprobabilitieson-the-fly.Wepresentseveraloptimisationswhichimprovequeryruntimesupto2500×,despiteonlyincurringamodestincreaseinconstructiontimeandmemoryusage.ForlargecorporaandhighMarkovorders,ourmethodishighlycompeti-tivewiththestate-of-the-artKenLMpackage.Itimposesmuchlowermemoryrequirements,oftenbyordersofmagnitude,andhasrun-timesthatareeithersimilar(fortraining)orcomparable(forquerying).1IntroductionLanguagemodels(LMs)arefundamentaltomanyNLPtasks,includingmachinetranslationandspeechrecognition.StatisticalLMsareprobabilis-ticmodelsthatassignaprobabilitytoasequenceofwordswN1,indicatinghowlikelythesequenceisinthelanguage.m-gramLMsarepopular,andprovetobeaccuratewhenestimatedusinglargecorpora.IntheseLMs,theprobabilitiesofm-gramsareoftenprecomputedandstoredexplicitly.Althoughwidelysuccessful,currentm-gramLMapproachesareimpracticalforlearninghigh-orderLMsonlargecorpora,duetotheirpoorscalingprop-ertiesinbothtrainingandqueryphases.Prevailingmethods(Heafield,2011;Stolckeetal.,2011)pre-computeallm-gramprobabilities,andconsequentlyneedtostoreandaccessasmanyasahundredofbil-lionsofm-gramsforatypicalmoderate-orderLM.Recentresearchhasattemptedtotacklescalabil-ityissuesthroughtheuseofefficientdatastructuressuchastriesandhash-tables(Heafield,2011;Stol-ckeetal.,2011),lossycompression(TalbotandOs-borne,2007;LevenbergandOsborne,2009;GuthrieandHepple,2010;PaulsandKlein,2011;Churchetal.,2007),compactdatastructures(Germannetal.,2009;Watanabeetal.,2009;SorensenandAllauzen,2011),anddistributedcomputation(Heafieldetal.,2013;Brantsetal.,2007).Fundamentaltoallthewidelyusedmethodsistheprecomputationofallprobabilities,hencetheydonotprovideanadequatetrade-offbetweenspaceandtimeforhighm,bothduringtrainingandquerying.ExceptionsareKen-ningtonetal.(2012)andZhangandVogel(2006),whouseasuffix-treeorsuffix-arrayoverthetextforcomputingthesufficientstatisticson-the-fly.Inourpreviouswork(Shareghietal.,2015),weextendedthislineofresearchusingaCompressedSuffixTree(CST)(Ohlebuschetal.,2010),whichprovidesaconsiderablymorecompactsearchablemeansofstoringthecorpusthananuncompressedsuffixarrayorsuffixtree.Thisapproachshowedfavourablescalingpropertieswithmandhadonlyamodestmemoryrequirement.However,themethodonlysupportedKneser-Neysmoothing,notitsmodi-fiedvariant(ChenandGoodman,1999)whichover-allperformsbetterandhasbecomethede-factostandard.Additionally,queryingwassignificantlyslowerthanforleadingLMtoolkits,makingthemethodimpracticalforwidespreaduse.InthispaperweextendShareghietal.(2015)tosupportmodifiedKneser-Neysmoothing,Und

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
1
2
1
5
6
7
4
2
6

/

/
T

l

A
C
_
A
_
0
0
1
1
2
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

478

presentnewoptimisationmethodsforfastconstruc-tionandquerying.1Criticaltoourapproachare:•Precomputationofseveralmodifiedcounts,whichwouldbeveryexpensivetocomputeatquerytime.Toorchestratethis,asubsetoftheCSTnodesisselectedbasedonthecostofcomputingtheirmodifiedcounts(whichre-lateswiththebranchingfactorofanode).Theprecomputedcountsarethenstoredinacompresseddatastructuresupportingefficientmemoryusageandlookup.•Re-useofCSTnodeswithinm-gramprobabil-itycomputationasasentencegetsscoredleft-to-right,thussavingmanyexpensivelookups.Empiricalcomparisonagainstourearlierwork(Shareghietal.,2015)showsthesignificanceofeachoftheseoptimisations.Thestrengthsofourmethodareapparentwhenappliedtoverylargetrainingdatasets(≥16GiB)andforhighordermod-els,m≥5.Inthissetting,whileourapproachismorememoryefficientthantheleadingKenLMmodel,bothintheconstruction(Ausbildung)andquery-ing(testing)phases,wearehighlycompetitiveintermsofruntimesofbothphases.Whenmemoryisalimitingfactoratquerytime,ourapproachisordersofmagnitudefasterthanthestateoftheart.Moreover,ourmethodallowsforefficientqueryingwithanunlimitedMarkovorder,m→∞,withoutresortingtoapproximationsorheuristics.2ModifiedKneser-NeyLanguageModelInanm-gramlanguagemodel,theprobabilityofasentenceisdecomposedintoQNi=1P(wi|wi−1i−m+1),whereP(wi|wi−1i−m+1)istheconditionalprobabilityofthenextwordgivenitsfinitehistory.Smooth-ingtechniquesareemployedtodealwithsparsitywhenestimatingtheparametersofP(wi|wi−1i−m+1).Acomprehensivecomparisonofdifferentsmooth-ingtechniquesisprovidedin(ChenandGoodman,1999).WefocusoninterpolatedModifiedKneser-Ney(MKN)smoothing,whichiswidelyregardedasastate-of-the-arttechniqueandissupportedbypopularlanguagemodellingtoolkits,e.g.SRILM(Stolcke,2002)andKenLM(Heafield,2011).1https://github.com/eehsan/cstlmPm(w|ux)=[C(uxw)−Dm(C(uxw))]+C(ux)+γm(ux)¯Pm−1(w|X)C(ux)¯Pk(w|ux)=[N1+(·uxw)−Dk(N1+(·uxw))]+N1+(·ux·)+γk(ux)¯Pk−1(w|X)N1+(·ux·)¯P0(w|(cid:15))=[N1+(·w)−D1(N1+(·w))]+N1+(··)+γ((cid:15))N1+(··)×1σγk(ux)=(Pj∈{1,2,3+}Dk(J)Nj(ux·),ifk=mPj∈{1,2,3+}Dk(J)N0j(ux·),ifke l D o w n o a d e d f r o m h t t p : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 P D . F B j G u e S T T O N 0 9 S e P e M B e R 2 0 2 3 479 knum.countdenom.countγ4c(Forceisstrongwith)C(Forceisstrong)N{1,2,3+}(Forceisstrong·)3N1+(·isstrongwith)N1+(·isstrong·)N0{1,2,3+}(isstrong·)2N1+(·strongwith)N1+(·strong·)N0{1,2,3+}(strong·)1N1+(·with)N1+(··)N0{1,2,3+}((cid:15)·)Table1:ThemainquantitiesrequiredforcomputingP(mit|Force,Ist,strong)underMKN.differenttypesofquantitiesrequiredforcomputinganexample4-gramMKNprobability.Efficientcomputationofthesequantitiesischal-lengingwithlimitedmemoryandtimeresources,particularlywhentheorderofthelanguagemodelmishighand/orthetrainingcorpusislarge.Inthispaper,wemakeuseofadvanceddatastructurestoefficientlyobtaintherequiredquantitiestoanswerprobabilisticqueriesastheyarrive.Oursolutionin-volvesprecomputingandcachingexpensivequan-tities,N1+(·α·),N1+(·α),N{1,2,3+}(·α)andN0{1,2,3+}(α·),whichwewillexplainin§4.Westartin§3byprovidingareviewoftheapproachinShareghietal.(2015)uponwhichwebaseourwork.3KNwithCompressedSuffixTrees3.1CompressedDataStructuresShareghietal.(2015)proposedamethodforKneser-Ney(KN)languagemodellingbasedonon-the-flyprobabilitycomputationfromacompressedsuffixtree(CST)(Ohlebuschetal.,2010).TheCSTemulatesthefunctionalityoftheSuffixTree(ST)(Wiener Würstchen,1973)usingsubstantiallylessspace.ThesuffixtreeisaclassicalsearchindexconsistingofarootedlabelledsearchtreeconstructedfromatextToflengthndrawnfromanalphabetofsizeσ.EachroottoleafpathinthesuffixtreecorrespondstoasuffixofT.Theleaves,consideredinleft-to-rightorderdefinethesuffixarray(SA)(ManberandMyers,1993)suchthatthesuffixT[SA[ich],n−1]islexicographicallysmallerthanT[SA[i+1],n−1]fori∈[0,n−2].SearchingforapatternαoflengthminTcanbeachievedbyfindingthe“highest”nodevintheSTsuchthatthepathfromtheroottovisprefixedbyα.Allleafnodesinthesubtreestart-ingatvcorrespondtothelocationsofαinT.ThisistranslatedtofindingthespecificrangeSA[lb,rb]suchthatT[SA[J],SA[j+m−1]]=αforj∈[lb,rb]asillustratedintheSTandSAofFigure2(links).WhilesearchingusingtheSTortheSAiseffi-cientintheory,itrequireslargeamountsofmainmemory.ACSTreducesthespacerequirementsofSTbyutilizingthecompressibilityoftheBurrows-Wheelertransform(BWT)(BurrowsandWheeler,1994).TheBWTcorrespondstoareversibleper-mutationofthetextusedindatacompressiontoolssuchasBZIP2toincreasethecompressibilityoftheinput.ThetransformisdefinedasBWT[ich]=T[SA[ich]−1modn](1)andisthecorecomponentoftheFM-Index(Ferrag-inaandManzini,2000)whichisasubcomponentofaCSTtoprovideefficientsearchforlocatingar-bitrarylengthpatterns(m-grams),determiningoc-currencefrequenciesetc.Thekeyfunctionalitypro-videdbytheFM-IndexistheabilitytoefficientlydeterminetherangeSA[lb,rb]matchingagivenpat-ternαdescribedabovewithouttheneedtostoretheSTorSAexplicitly.ThisisachievedbyiterativelyprocessingαinreverseorderusingtheBWT,whichisusuallyreferredtoasbackward-search.Thebackward-searchprocedureutilizesthedual-itybetweentheBWTandSAtoiterativelydetermineSA[lb,rb]forsuffixesofα.SupposeSA[spj,epj]correspondstoallsuffixesinTmatchingα[J,m−1].RangeSA[spj−1,epj−1]matchingα[j−1,m−1]withcdef=α[j−1]canbeexpressedasspj−1=C[C]+RANK(BWT,spj,C)epj−1=C[c+1]+RANK(BWT,epj+1,c)−1whereC[C]referstothestartingpositionofallsuf-fixesprefixedbycinSAandRANK(BWT,spj,C)determinesthenumberofoccurrencesofsymbolcinBWT[0,spj].OperationRANK(BWT,ich,C)(anditsinverseop-erationSELECT(BWT,ich,C)2)canbeperformedeffi-cientlyusingawavelettree(Grossietal.,2003)rep-resentationoftheBWT.Awavelettreeisaver-satile,space-efficientrepresentationofasequencewhichcanefficientlysupportavarietyofopera-tions(Navarro,2014).Thestructureofthewavelettreeisderivedbyrecursivelydecomposingtheal-phabetintosubsets.Ateachlevelthealphabetis2SELECT(BWT,ich,C)returnsthepositionoftheithoccur-renceofsymbolcinBWT. l D O w N O A D e D F R O M H T T P : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 P D . F B j G u e S T T O N 0 9 S e P e M B e R 2 0 2 3 480 $1471213036914#𝒗108112501234567891011121314SA14130147101125812369BWT#c$#ccdbaaabbbb(A)SuffixTree,SuffixArray,andBurrows-WheelerTransform#c$#ccdbaaabbbb010011110001111#$#aaa000111cccdbbbbb000100000#$#101$bc0101#a1cccbbbbb0001111100101d(B)WaveletTreeofBWTFigure2:(A)Character-LevelSuffixTree,SuffixArray(SA),andBurrows-WheelerTransform(BWT)for“#abcab-cabdbbc#$”(asformulatedineq.1).(B)TheWaveletTreeofBWTandtheRANK(BWT,8,A).Theorderedalphabetsymbolsandtheircodewordsare{$:000,#:001,A:01,B:100,C:101,D:11},andsymbols“#”and“$”aretomarksentenceandfileboundaries.TheredboundingboxesanddigitssignifythepathforcomputingRANK(BWT,8,A).splitintotwosubsetsbasedonwhichsymbolsinthesequenceareassignedtotheleftandrightchildnodesrespectively.Usingcompressedbitvectorrep-resentationsandHuffmancodestodefinetheal-phabetpartitioning,thespaceusageofthewavelettreeandassociatedRANKstructuresoftheBWTisboundbyHk(T)n+o(nlogσ)bits(Grossietal.,2003).Thusthespaceusageisproportionaltotheorderkentropy(Hk(T))ofthetext.Figure2(Rechts)showsasamplewavelettreerep-resentation.Usingthewavelettreestructure,RANKoverasequencedrawnfromanalphabetofsizeσcanbereducedtologσbinaryRANKopera-tionswhichcanbeansweredefficientlyinconstanttime(Jacobson,1989).TherangeSA[lb,rb]cor-respondingtoapatternα,canbedeterminedinO(mlogσ)timeusingawavelettreeoftheBWT.InadditiontotheFM-index,aCSTefficientlystoresthetreetopologyoftheSTtoemulatetreeoperationssuchefficiently(Ohlebuschetal.,2010).3.2ComputingKNmodifiedcountsShareghietal.(2015)showedhowthereq-uisitecountsforaKN-LM,namelyc(α),N1+(·α),N1+(·α·)andN1+(α·),canbecomputeddirectlyfromCST.ConsidertheexampleinFigure2,thenumberofoccurrencesofbcorre-spondstocountingthenumberofleaves,Größe(v),inthesubtreerootedatv.ThiscanbecomputedinO(1)timebycomputingthesizeoftherange[lb,rb]implicitlyassociatedwitheachnode.Thenumberofuniquerightcontextsofbcanbedeterminedusingdegree(v)(againO(1)butrequiresbitoperationsonthesuccincttreerepresentationoftheST).Thatis,N1+()=degree(v)=3.Determiningthenumberofleft-contextsandsur-roundingcontextsismoreinvolved.ComputingN1+(·α)reliesontheBWT.RecallthatBWT[ich]correspondstothesymbolprecedingthesuffixstart-ingatSA[ich].ForexamplecomputingN1+(·b)firstrequiresfindingtheintervalofsuffixesstart-ingwithbinSA,namelylb=6andrb=10,andthencountingthenumberofuniquesymbolsinBWT[6,10]={D,B,A,A,A},i.e.,3.Determin-ingalluniquesymbolsinBWT[ich,J]canbeper-formedefficiently(independentlyofthesizeoftherange)usingthewavelettreeencodingoftheBWT.Thesetofsymbolsprecedingpatternα,denotedbyP(α)canbecomputedinO(|P(α)|logσ)byvis-itingalluniqueleafsinthewavelettreewhichcor-respondtosymbolsinBWT[ich,J].Thisisusuallyreferredtoastheinterval-symbols(Schnattingeretal.,2010)procedureandusesRANKoperationstofindthesetofsymbolss∈P(α)andcorre-spondingrangesforsαinSA.Intheaboveexam-ple,identifyingtheSArangeofabrequiresfind-ingthelb,rbintheSAforsuffixesstartingwitha(SA[3,5])andthenadjustingtheboundstocoveronlythesuffixesstartingwithab.Thislaststepisdoneviacomputingtherankofthreeasymbols l D o w n o a d e d f r o m h t t p : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 P D . F B j G u e S T T O N 0 9 S e P e M B e R 2 0 2 3 481 2358¥010s20sN'123+(A .)N123+(A .)N1+(. A .)N1+(. A)N1+(A .)backward−searchOn−the−flyTime (Sek)2358¥04ms8msPrecomputedm−gramTime (ms)Figure3:Timebreakdownforqueryingaverageper-sentence,shownwithoutruntimeprecomputationofex-pensivecontextualcounts(über)vs.withprecomputa-tion(below).TheleftandrightbarineachgroupdenoteKNandMKN,respectively.TrainedontheGermanpor-tionoftheEuroparlcorpusandtestedoverthefirst10KsentencesfromtheNewsCommentarycorpus.inBWT[8,10]usingthewavelettree,seeFigure2(Rechts)forRANK(BWT,A,8).Asillustrated,an-sweringRANK(BWT,8,A)correspondstoprocess-ingthefirstdigitofthecodewordattherootlevel,whichtranslatesintoRANK(WTroot,8,0)=4,fol-lowedbyaRANK(WT1,4,1)=1ontheleftbranch.Oncetheranksarecomputedlb,rbarerefinedac-cordinglytoSA[3+(1-1),3+(3-1)].Endlich,forN1+(·α·)allpatternswhichcanfollowαareenumerated,andforeachoftheseextendedpatterns,thenumberofprecedingsymbolsiscomputedusinginterval-symbols.Thisprovedtobethemostexpen-siveoperationintheirapproach.Giventhesequantities,Shareghietal.(2015)showhowm-gramprobabilitiescanbecomputedondemandusinganiterativealgorithmtosearchformatchingnodesinthesuffixtreefortherequiredk-gram(k≤m)patternsinthenumeratorandde-nominatoroftheKNrecursiveequations,whicharethenusedtocomputetheprobabilities.WereferthereadertoShareghietal.(2015)forfurtherde-tails.Overalltheirapproachshowedpromise,inthatitallowedforunlimitedorderKN-LMstobeeval-uatedwithamodestmemoryfootprint,howeveritwasmanyordersofmagnitudeslowerforsmallermthanleadingLMtoolkits.Toillustratethecostofquerying,seeFigure3(top)whichshowsper-sentencequerytimeforKN,Algorithm1N{1,2,3+}(α·)orN0{1,2,3+}(α·)1:functionN123PFRONT(T,v,α,is-prime).tisaCST,visthenodematchingpatternα2:N1,N2,N3+←03:foru←children(v)do4:ifis-primethen5:f←interval-symbols(T,[lb(u),rb(u)])6:else7:f←size(u)8:if1≤f≤2then9:Nf←Nf+110:N3+←degree(v)−N1−N211:returnN1,N2,N3+basedontheapproachofShareghietal.(2015)(alsoshownisMKN,throughanextensionoftheirmethodasdescribedin§4).Itisclearthattherun-timesforKNaremuchtoolongforpracticaluse–about5secondspersentence,withthemajorityofthistimespentcomputingN1+(·α·).Clearlyopti-misationiswarranted,andthegainsfromitarespec-tacular(seeFigure3bottom,whichusestheprecom-putationmethodasdescribedin§4.2).4ExtendingtoMKN4.1ComputingMKNmodifiedcountsAcentralrequirementforextendingShareghietal.(2015)tosupportMKNarealgorithmsforcomput-ingN{1,2,3+}(α·)andN0{1,2,3+}(α·),whichwenowexpoundupon.Algorithm1computesbothofthesequantities,takingasinputaCSTt,anodevmatchingthepatternα,thepatternandaflagis-prime,denotingwhichoftheNandN0variantsisrequired.Thismethodenumeratesthechildrenofthenode(line3)andcalculateseitherthefrequencyofeachchild(line7)orthemodifiedcountN1+(·αx),foreachchilduwherexisthefirstsymbolontheedgevu(line5).Lines8and9accumulatethenumberofthesevaluesequaltooneortwo,andfi-nallyinline10,N3+iscomputedbythedifferencebetweenN1+(α·)=degree(v)andthealreadycountedeventsN1+N2.Forexample,computingN{1,2,3+}()inFig-ure2correspondstoenumeratingoveritsthreechil-dren.Twoofv’schildrenareleafnodes{10,8},andonechildhasthreeleafdescendants{11,2,5},henceN1andN2are2and0respectively,andN3+is1.Further,considercomputingN0{1,2,3+}()in l D o w n o a d e d f r o m h t t p : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 P D . F B j G u e S T T O N 0 9 S e P e M B e R 2 0 2 3 482 Figure2,whichagainenumeratesoverchildnodes(whosepathlabelsstartwithsymbolsb,candd)andcomputesthenumberofprecedingsymbolsfortheextendedpatterns.3AccordinglyN01()=2,N02()=1andN03+()=0.Whileroughlysimilarinapproach,comput-ingN0{1,2,3+}(α·)isinpracticeslowerthanN{1,2,3+}(α·)sinceitrequirescallinginterval-symbols(line7)insteadofcallingtheconstanttimesizeoperation(line5).ThisgivesrisetoatimecomplexityofO(D|P(α)|logσ)forN0{1,2,3+}(α·)wheredisthenumberofchildrenofv.AsillustratedinFigure3(top),themodifiedcounts(§2)combinedareresponsiblefor99%ofthequerytime.MoreoverthealreadyexpensiveruntimeofKNisconsiderablyworsenedinMKNduetotheadditionalcountsrequired.Thesefactsmotivateop-timisation,whichweachievebyprecomputingval-ues,resultingina2500×speedupinquerytimeasshowninFigure3(bottom).4.2EfficientPrecomputationLanguagemodellingtoolkitssuchasKenLMandSRILMprecomputerealvaluedprobabilitiesandbackoff-weightsattrainingtime,suchthatqueryingbecomeslargelyaproblemofretrieval.Wemightconsidertakingasimilarrouteinoptimisingourlan-guagemodel,howeverwewouldfacetheproblemthatfloatingpointnumberscannotbecompressedveryeffectively.Evenwithquantisation,whichcanhaveadetrimentaleffectonmodellingperplexity,wewouldnotexpectgoodcompressionandthusthistechniquewouldlimitthescalingpotentialofourap-proach.Forthesereasons,insteadwestorethemostex-pensivecountdata,targetingthosecountswhichhavethegreatesteffectonruntime(seeFigure3top).Weexpecttheseintegervaluestocompresswell:ashighlightedbyFigure4mostcountswillhavelowvalues,andaccordinglyavariablebytecompressionschemewillbeabletorealisehighcompressionrates.Removingtheneedforcomput-ingthesevaluesatquerytimeleavesonlypatternsearchandafewfloatingpointoperationsinordertocomputelanguagemodelprobabilities(see§4.3)whichcanbedonecheaply.3ThatisN1+(·bb)=1,N1+(·bc)=2,N1+(·bd)=1.25%50%75%100%0163248648096112128Storedvalue%ofvaluessmallerorequalthan0%5%10%15%2345678910StorageThreshold%oftotalspaceusageN1+(·α)N1(α·)N01(α·)N2(α·)N02(α·)N1+(·α·)bvFigure4:Links:DistributionofvaluesprestoredforEu-roparlGerman;Rechts:SpaceusageofprestoredvaluesrelativetototalindexsizeforEuroparlGermanfordif-ferentstoragethresholds(ˆm).Ourfirstconsiderationishowtostructurethecache.Giventhateachprecomputedvalueiscom-putedusingaCSTnode,v,(withthepatternasitspath-label),westructurethecacheasamappingbe-tweenuniquenodeidentifiersid(v)andtheprecom-putedvalues.4Nextweconsiderwhichvaluestocache:whileitispossibletoprecomputevaluesforeverynodeintheCST,manynodesareunlikelytobeaccessedatquerytime.Moreover,theserarepat-ternsarelikelytobecheaptoprocessusingtheon-the-flymethods,giventheyoccurinfewcontexts.Consequentlyprecomputingthesevalueswillbringminimalspeedbenefits,whilestillincurringamem-orycost.Forthisreasonweprecomputetheval-uesonlyfornodescorrespondingtok-gramsuptolengthˆm(forourword-levelexperimentsˆm=10),whicharemostlikelytobeaccessedatruntime.5TheprecomputationmethodisoutlinedinAlgorithm2,showinghowacompressedcacheiscreatedforthequantitiesx∈{N1+(·α),N1+(·α·),N12(α·),N012(α·)}.Thealgorithmvisitsthesuffixtreenodesindepth-first-search(DFS)Befehl,andselectsasubsetofnodesforprecomputation(line7),suchthattheremainingnodesareeitherrareortrivialtohandle4EachnodecanuniquelybeidentifiedbytheorderwhichitisvisitedinaDFStraversalofthesuffixtree.ThiscorrespondstotheRANKoftheopeningparenthesisofthenodeinthebal-ancedparenthesisrepresentationofthetreetopologyoftheCSTwhichcanbedeterminedinO(1)Zeit(Ohlebuschetal.,2010).5Wedidnottestotherselectioncriteria.Othermethodsmaybemoreeffective,suchasselectingnodesforprecomputationbasedonthefrequencyoftheircorrespondingpatternsinthetrainingset. l D O w N O A D e D F R O M H T T P : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 P D . F B j G u e S T T O N 0 9 S e P e M B e R 2 0 2 3 483 Algorithm2PrecomputingexpensivecountsN{1,2}(α·),N1+(·α·),N1+(·α),N0{1,2}(α·).1:functionPRECOMPUTE(T,ˆm)2:bvl←0∀l∈[0,Knoten(T)−1]3:ich(X)l←0∀l∈[0,Knoten(T)−1],x∈counttypes4:j←05:forv←descendants(root(T))do.DFS6:d←string-depth(v)7:ifnotis-leaf(v)∧d≤ˆmthen8:l←id(v).uniqueid9:bvl←110:CallN1PFRONTBACK1(T,v,·)11:CallN123PFRONT(T,v,·,0)12:CallN123PFRONT(T,v,·,1)13:ich(X)j←countsfromabove,foreachoutputx14:j←j+115:bvrrr←compress-rrr(bv)16:i←compress-dac({ich(X)∀x})17:write-to-disk(bvrrr,ich)on-the-fly(i.e.,leafnodes).Anodeincludedinthecacheismarkedbystoringa1inthebitvectorbv(lines8-9)atindexl,wherelisthenodeidentifier.Foreachselectednodeweprecomputetheexpensivecountsinlines10-12,N1+(·α·),N1+(·α)via6N1PFRONTBACK1(T,v,·),N{1,2}(α·)viaN123PFRONT(T,v,·,0),N0{1,2}(α·)viaN123PFRONT(T,v,·,1),whicharestoredintointegervectorsi(X)foreachcounttypex(line13).Theintegervectorsarestreamedtodiskandthencompressed(lines15-17)inordertolimitmemoryusage.Thefinalstepsinlines15and16compresstheintegerandbit-vectors.Theintegervectorsi(X)arecompressedusingavariablelengthencod-ing,namelyDirectlyAddressableVariable-LengthCodes(DAC;Brisaboaetal.(2009))whichallowsforefficientstorageofintegerswhileprovidingeffi-cientrandomaccess.Astheoverwhelmingmajorityofourprecomputedvaluesaresmall(seeFigure4left),thisgivesrisetoadramaticcompressionrateofonly≈5.2bitsperinteger.ThebitvectorbvofsizeO(N)wherenisthenumberofnodesinthesuf-fixtree,iscompressedusingtheschemeofRamanetal.(2002)whichsupportsconstanttimerankop-erationoververylargebitvectors.6ThefunctionN1PFRONTBACK1isdefinedasAlgorithm5inShareghietal.(2015).Thisencodingallowsforefficientretrievaloftheprecomputedcountsatquerytime.Thecompressedvectorsareloadedintomemoryandwhenanex-pensivecountisrequiredfornodev,theprecom-putedquantitiescanbefetchedinconstanttimeviaLOOKUP(v,bv,ich(X))=i(X)RANK(bv,id(v),1).WeuseRANKtodeterminethenumberof1sprecedingv’spositioninthebitvectorbv.Thiscorrespondstov’sindexinthecompressedintegervectorsi(X),fromwhichitsprecomputedcountcanbefetched.Thisstrategyonlyappliesforprecomputednodes;forothernodes,thevaluesarecomputedon-the-fly.Figure3comparesthequerytimebreakdownforon-the-flycountcomputation(top)versusprecom-putation(bottom),forbothKNandMKNandwithdifferentMarkovorders,m.Notethatqueryspeedimprovesdramatically,byafactorofabout2500×,forprecomputedcases.Thisimprovementcomesatamodestcostinconstructionspace.PrecomputingforCSTnodeswithm≤10resultedin20%ofthenodesbeingselectedforprecomputation.Thespaceusedbytheprecomputedvaluesaccountsfor20%ofthetotalspaceusage(seeFigure4right).Indexconstructiontimeincreasedby70%.4.3ComputingMKNProbabilityHavingestablishedameansofcomputingtherequi-sitecountsforMKNandanefficientprecomputationstrategy,wenowturntothealgorithmforcomputingthelanguagemodelprobability.ThisispresentedinAlgorithm3,whichisbasedonShareghietal.(2015)’ssingleCSTapproachforcomputingtheKNprobability(reportedintheirpaperasAlgorithm4.)Similartotheirmethod,ourapproachimplementstherecursivem-gramprobabilityformulationasaniterativeloop(hereusingMKN).Thecoreofthealgorithmarethetwonodesvfullandvwhichcor-respondtonodesmatchingthefullk-gramandits(k−1)-gramcontext,respectively.AlthoughsimilartoShareghietal.(2015)’smethod,whichalsofeaturesasimilarright-to-leftpatternlookup,inadditionweoptimisethecompu-tationofafullsentenceprobabilitybyslidingawin-dowofwidthmoverthesequencefromleft-to-right,addingonenewwordatatime.7Thisallowsforthere-useofnodesinonewindowmatchingthefullk-7PaulsandKlein(2011)proposeasimilaralgorithmfortrie-basedLMs. l D O w N O A D e D F R O M H T T P : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 P D . F B j G u e S T T O N 0 9 S e P e M B e R 2 0 2 3 484 Algorithm3MKNprobabilityP(cid:0)wi|wi−1i−(m−1)(cid:1)1:functionPROBMKN(T,wii−m+1,m,[vk]m−1k=0)2:Assumption:vkisthematchingnodeforwi−1i−k3:vfull0←root(T).tracksmatchforwii−k4:p←1/|σ|5:fork←1tomdo6:ifvk−1doesnotmatchthen7:breakoutofloop8:vfullk←back-search([lb(vfullk−1),rb(vfullk−1)],wi−k+1)9:Dk(1),Dk(2),Dk(3+)←discountsfork-grams10:ifk=mthen11:c←size(vfullk)12:d←size(vk−1)13:N1,2,3+←N123PFRONT(T,vk−1,wi−1i−k+1,0)14:else15:c←N1PBACK1(T,vfullk,wi−1i−k+1)16:d←N1PFRONTBACK1(T,vk−1,wi−1i−k+1)17:N1,2,3+←N123PFRONT(T,vk−1,wi−1i−k+1,1)18:if1≤c≤2then19:c←c−Dk(C)20:else21:c←c−Dk(3+)22:γ←Dk(1)N1+Dk(2)N2+Dk(3+)N3+23:p←1d(c+γp)24:return(cid:16)P,(cid:2)vfullk(cid:3)m−1k=0(cid:17)Gramm,vfull,asthenodesmatchingthecontextinthesubsequentwindow,denotedv.Forexample,inthesentence“TheForceisstrongwiththisone.”,computingthe4-gramprobabilityof“TheForceisstrong”requiresmatchesintotheCSTfor“strong”,“isstrong”,etc.AsillustratedinTable1,forthenext4-gramresultingfromslid-ingthewindowtoinclude“with”,thedenomina-tortermsrequireexactlythesenodes,seeFigure5.Practically,thisisachievedbystoringthematchingvfullnodescomputedinline8,andpassingthisvec-torastheinputargument[vk]m−1k=0tothenextcalltoPROBMKN(line1).Thissaveshalfthecallstobackward-search,welche,asshowninFigure3,rep-resentasignificantfractionofthequeryingcost,re-sultingina30%improvementinqueryruntime.Thealgorithmstartsbyconsideringtheunigramprobability,andgrowsthecontexttoitsleftbyonewordatatimeuntilthem-gramisfullycovered(line5).Thisbestsuitstheuseofbackward-searchinaCST,whichproceedsfromright-to-leftoverthesearchpattern.Ateachstagethesearchforvfullkusesthespanfromthepreviousmatch,vfullk−1,SISFISTFISwSWISWFISWSISFISN1+(•W)N1+(••)N1+(•SW)N1+(•S•)N’123+(S•)N1+(•ISW)N1+(•IS•)N’123+(IS •)C (FISW)C (FIS)N123+(FIS•)N’123+(ε•)Root : [0,N]Root : [0,N]Figure5:ExampleMKNprobabilitycomputationfora4-gramLMappliedto“TheForceisstrongwith”(eachwordabbreviatedtoitsfirstcharacter),showinginthetwoleftcolumnsthesuffixmatchesrequiredforthe4-gramFISWandelementswhichcanbereusedfrompre-vious4-gramcomputation(grayshading),TFIS.Ele-mentsontherightdenotethecountandoccurrencestatis-ticsderivedfromthesuffixmatches,aslinkedbybluelines.alongwiththeBWTtoefficientlylocatethematch-ingnode.Oncethenodesmatchingthefullse-quenceanditscontextareretrieved,theprocedureisfairlystraightforward:thediscountsareloadedonline9andappliedinlines18-21,whilethenu-merator,denominatorandsmoothingquantitiesasrequiredforcomputingPand¯Parecalculatedinlines10-13and15-17,respectively.8NotethatthecallsforfunctionsN123PFRONT,N1PBACK1,andN1PFRONTBACK1areavoidedifthecorrespondingnodeisamongsttheselectednodesintheprecompu-tationstep;insteadtheLOOKUPfunctioniscalled.Finally,thesmoothingweightγiscomputedinline22andtheconditionalprobabilitycomputedonline23.Theloopterminateswhenwereachthelengthlimitk=morwecannotmatchthecontext,i.e.,wi−1i−kisnotinthetrainingcorpus,inwhichcasetheprobabilityvaluepforthelongestmatchisreturned.Wenowturntothediscountparameters,Dk(J),k≤m,j∈1,2,3+,whicharefunctionsofthecorpusstatisticsasoutlinedinFigure1.Whilethesecouldbecomputedbasedonrawm-gramstatistics,thisapproachisveryinefficientforlargem≥5;insteadthesevaluescanbecomputedeffi-cientlyfromthecompresseddatastructures.Algo-rithm4outlineshowtheDk(ich)valuescanbecom-puteddirectlyfromtheCST.Thismethoditerates8N1PBACK1andN1PFRONTBACK1aredefinedinShareghietal.(2015);seealso§3foranoverview. l D O w N O A D e D F R O M H T T P : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 P D . F B j G u e S T T O N 0 9 S e P e M B e R 2 0 2 3 485 Algorithm4Computediscounts1:functionCOMPUTEDISCOUNTS(T,¯m,bv0,SA)2:ni(k)←0,¯ni(k)←0∀i∈[1,4],k∈[1,¯m]3:N1+(··)←04:forv←descendants(root(T))do.DFS5:dP←string-depth(parent(v))6:d←string-depth(v)7:dS←depth-next-sentinel(SA,bv0,lb(v))8:i←size(v).frequency9:c←interval-symbols(T,[lb(v),rb(v)]).leftocc.10:fork←dP+1tomin(D,¯m,dS−1)do11:ifk=2then12:N1+(··)←N1+(··)+113:if1≤i≤4then14:ni(k)←ni(k)+115:if1≤c≤4then16:¯nc(k)←¯nc(k)+117:Dk(ich)←computedusingformulainFigure118:returnDk(ich),k∈[1,¯m],i∈{1,2,3+}overthenodesinthesuffixtree,andforeachnodeconsidersthek-gramsencodedintheedgelabel,whereeachk-gramistakentostartattherootnode(toavoidduplicatecounting,weconsiderk-gramsonlycontainedonthegivenedgebutnotinthepar-entedges,i.e.,byboundingkbasedonthestringdepthoftheparentandcurrentnodes,dP≤k≤d).Foreachk-gramwerecorditscount,ich(line8),andthenumberofuniquesymbolstotheleft,C(line9),whichareaccumulatedinanarrayforeachk-gramsizeforvaluesbetween1and4(lines13-14and15-16,respectively).Wealsorecordthenumberofuniquebigramsbyincrementingacounterduringthetraversal(lines11-12).Specialcareisrequiredtoexcludeedgelabelsthatspansentenceboundaries,bydetectingspecialsen-tinelsymbols(line8)thatseparateeachsentenceorconcludethecorpus.Thischeckcouldbedonebyrepeatedlycallingedge(v,k)tofindthekthsymbolonthegivenedgetocheckforsentinels,howeverthisisaslowoperationasitrequiresmultipleback-wardsearchcalls.Insteadweprecalculateabitvec-tor,bv0,ofsizeequaltothenumberoftokensinthecorpus,N,inwhichsentinellocationsinthetextaremarkedby1bits.Coupledwiththis,weusethesuf-fixarraySA,suchthatdepth-next-sentinel(SA,bv0,‘)=SELECT(bv0,RANK(bv0,SA‘,1)+1,1)−SA‘,whereSA‘returnstheoffsetintothetextforindex‘,andtheSAisstoreduncompressedtoavoidtheex-pensivecostofrecoveringthesevalues.9Thisfunc-tioncanbeunderstoodasfindingthefirstoccurrenceofthepatterninthetext(usingSA‘)thenfindingthelocationofthenext1inthebitvector,usingconstanttimeRANKandSELECToperations.Thislocatesthenextsentinelinthetext,afterwhichitcomputesthedistancetothestartofthepattern.Usingthismethodinplaceofexplicitedgecallsimprovedthetrainingruntimesubstantiallyupto41×.Weprecomputethediscountvaluesfork≤¯m-grams.Forqueryingwithm>¯m(including∞)wereusethediscountsforthelargest¯m-grams.105ExperimentsToevaluateourapproachwemeasurememoryandtimeusage,alongwiththepredictiveperplexityscoreofword-levelLMsonanumberofdifferentcorporavaryinginsizeanddomain.Forallofourword-levelLMs,weuse¯m,ˆm≤10.Wealsodemonstratethepositiveimpactofincreasingthesetlimiton¯m,ˆmfrom10to50onimprovingcharacter-levelLMperplexity.TheSDSLlibrary(Gogetal.,2014)isusedtoimplementourdatastructures.ThebenchmarkingexperimentswererunonasinglecoreofaIntelXeonE5-2687v33.10GHzserverwith500GiBofRAM.Inourword-levelexperiments,weusetheGer-mansubsetoftheEuroparl(Koehn,2005)asasmallcorpus,whichis382MiBinsizemeasuringtherawuncompressedtext.Wealsoevaluateonmuchlargercorpora,trainingon32GiBsubsetsofthededupli-catedEnglish,Spanish,Deutsch,andFrenchCom-monCrawlcorpus(Bucketal.,2014).Astestsets,weusednewstest-2014foralllanguagesexceptSpanish,forwhichweusednewstest-2013.11Inour9AlthoughtheSAcanbeverylarge,weneednotstoreitinmemory.TheDFStraversalinAlgorithm4(lines4–16)meansthatthecallstoSA‘occurinincreasingorderof‘.Hence,weuseon-diskstoragefortheSAwithasmallmemorymappedbuffer,therebyincurringanegligiblememoryoverhead.10Itispossibletocomputethediscountsforallpatternsofthetextusingouralgorithmwithcomplexitylinearinthelengthofthetext.However,thediscountsappeartoconvergebypatternlength¯m=10.Thislimitalsohelpstoavoidproblemsofwildfluctuationsindiscountsforverylongpatternsarisingfromnoiseforlowcountevents.11http://www.statmt.org/wmt{13,14}/test.tgz

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
1
2
1
5
6
7
4
2
6

/

/
T

l

A
C
_
A
_
0
0
1
1
2
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

486

constructionload+query101001k10k100k0.11.010.00.11.010.0Memory[GiB]Time[Sek]CSTon-the-flyCSTprecomputeKenLM(trie)KenLM(probing)SRILMFigure6:MemoryconsumptionandtotalruntimefortheCSTwithandwithoutprecomputation,KenLM(trie),andSRILM(default)withm∈[2,10],whilewealsoin-cludem=∞forCSTmethods.TrainedontheEuroparlGermancorpusandtestedoverthebottom1MsentencesfromtheGermanCommonCrawlcorpus.benchmarkingexperimentsweusedthebottom1Msentences(notusedintraining)oftheGermanCom-manCrawlcorpus.WeusedthepreprocessingscriptofBucketal.(2014),thenremovedsentenceswith≤2words,andreplacedrarewords12c≤9inthetrainingdatawithaspecialtoken.Inourcharacter-levelexperiments,weusedthetrainingandtestdataofthebenchmark1-billion-wordscorpus(Chelbaetal.,2013).Smalldata:GermanEuroparlFirst,wecom-parethetimeandmemoryconsumptionofboththeSRILMandKenLMtoolkits,andtheCSTonthesmallGermancorpus.Figure6showsthememoryusageforconstructionandqueryingforCST-basedmethodsw/oprecomputationisindependentofm,butbecomessubstantiallywithmfortheSRILMandKenLMbenchmarks.Tomakeourresultscom-parabletothosereportedin(Shareghietal.,2015)forquerytimemeasurementswereportedtheload-ingandquerytimecombined.Theconstructioncostismodest,requiringlessmemorythanthebench-marksystemsform≥3,andrunninginasim-ilartime13(despiteourmethodsupportingqueries12Runningwiththefullvocabularyincreasedthememoryre-quirementby40%forconstructionand5%forqueryingwithourmodel,and10%and30%,resp.forKenLM.Constructiontimesforbothapproacheswere15%slower,butqueryruntimewas20%slowerforourmodelversus80%forKenLM.13Foralltimingsreportedinthepaperwemanuallyflushedthesystemcachebetweeneachoperation(bothforconstructionsize(M)perplexitytokensm=2m=3m=5m=7m=10m=∞EN6470321.6183.8154.3152.7152.5152.3ES6276231.3133.2111.7109.7109.3109.2FR6100215.8109.285.283.182.682.4DE5540588.3336.6292.8288.1287.8287.8Table2:PerplexitiesonEnglish,French,Germannew-stests2014,andSpanishnewstest2013whentrainedon32GiBchunksofEnglish,Spanish,French,andGermanCommonCrawlcorpus.ofunlimitedsize).Precomputationaddstothecon-structiontime,whichrosefrom173to299seconds,butyieldedspeedimprovementsofseveralordersofmagnitudeforquerying(218kto98secondsfor10-gram).Inquerying,theCST-precomputemethodis2-4×slowerthanbothSRILMandKenLMforlargem≥5,withtheexceptionofm=10whereitoutperformsSRILM.Asubstantialfractionofthequerytimeisloadingthestructuresfromdisk;whenthiscostisexcluded,ourapproachisbetween8-13×slowerthanthebenchmarktoolkits.NotethatperplexitycomputedbytheCSTcloselymatchedKenLM(differences≤0.1).BigData:CommonCrawlTable2reportstheperplexityresultsfortrainingon32GiBsubsetsoftheEnglish,Spanish,French,andGermanCommonCrawlcorpus.Notethatwithsuchlargedatasets,perplexityimproveswithincreasingm,withsub-stantialgainsavailablemovingabovethewidelyusedm=5.Thishighlightstheimportanceofourapproachbeingindependentfromm,inthatwecanevaluateforanym,including∞,atlowcost.HeterogeneousDataToillustratetheeffectsofdomainshift,corpussizeandlanguagemodelcapac-ityonmodellingaccuracy,wenowevaluatethesys-temusingavarietyofdifferenttrainingcorpora.Ta-ble3reportstheperplexityforGermanwhentrain-ingoverdatasetsrangingfromthesmallEuroparlupto32GiBoftheCommonCrawlcorpus.NotethatthetestsetisfromthesamedomainastheNewsandquerying)toremovetheeffectofcachingonruntime.ToqueryKenLM,weusedthespeedoptimisedpopulatemethod.(WealsocomparethememoryoptimisedlazymethodinFig-ure7.)TotrainandquerySRILMweusedthedefaultmethodwhichisoptimisedforspeed,buthadslightlyworsememoryusagethanthecompactmethod.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
1
2
1
5
6
7
4
2
6

/

/
T

l

A
C
_
A
_
0
0
1
1
2
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

487

Größe(M)perplexityTrainingtokenssentsm=3m=5m=10Europarl552.21004.8973.3971.4Commentary60.2941.8928.6928.1NCrawl2007372.0514.8493.5488.9NCrawl20081266.8427.7404.8400.0NCrawl20091196.5433.4408.9404.7NCrawl2010543.0472.7450.9446.8NCrawl201129816.3362.6335.9327.9NCrawl201237720.9343.9315.6307.3NCrawl201364135.1268.9229.8225.6NCrawl201484546.3247.6195.2189.3Allcombined2560139.3211.8158.9151.5CCrawl1G17314.2542.8515.5512.7CCrawl2G34628.5493.2462.3459.2CCrawl4G69256.2446.5412.2408.8CCrawl8G1390110.2402.1364.4360.9CCrawl16G2770216.7364.9323.9319.6CCrawl32G5540426.6336.6292.8287.8Table3:PerplexityofGermannewstest2014withdiffer-entdatasets(Europarl,News-Commentary,NewsCrawl2007-2014,CommonCrawl1-32GiBchunks)andm.Crawl,whichexplainsthevastdifferenceinperplex-ities.Thedomaineffectisstrongenoughtoelimi-natetheimpactofusingmuchlargercorpora,com-pare10-gramperplexitiesfortrainingonthesmallerNewsCrawl2007corpusversusEuroparl.However‘bigdata’isstilluseful:inallcasestheperplexityimprovesasweprovidemoredatafromthesamesource.Moreover,themagnitudeofthegaininper-plexitywhenincreasingmisinfluencedbythedatasize:withmoretrainingdatahigherorderm-gramsproviderichermodels;daher,thescalabilityofourmethodtolargedatasetsiscruciallyimportant.BenchmarkingagainstKenLMNextwecom-pareourmodelagainstthestate-of-the-artmethod,KenLMtrie.TheperplexitydifferencebetweenCSTandKenLMwaslessthan0.003inallexperiments.ConstructionCost.Figure7acomparesthepeakmemoryusageofourCSTmodelsandKenLM.KenLMisgivenatargetmemoryusageofthepeakusageofourCSTmodels.14TheconstructionphasefortheCSTrequiredmoretimeforlowerordermod-els(seeFigure7c)butwascomparableforlarger14Usingthememorybudgetoption,-S.NotethatKenLMof-tenusedmorememorythanspecified.AllowingKenLMuseof80%oftheavailableRAMreducedtrainingtimebyafactorofbetween2and4.m,roughlymatchingKenLMform=10.15Forthe32GiBdataset,theCSTmodeltook14hourstobuild,comparedtoKenLM’s13.5and4hoursforthe10-gramand5-grammodels,respectively.QueryCost.AsshowninFigure7b,themem-oryrequirementsforqueryingwiththeCSTmethodwereconsistentlylowerthanKenLMform≥4:form=10thememoryconsumptionofKenLMwas277GiBcomparedtoour27GiB,a10×im-provement.Thiscloselymatchesthefilesizesofthestoredmodelsondisk.Figure7dreportsthequeryruntimes,showingthatKenLMbecomessub-stantiallyslowerwithincreasingdatasetsizeandin-creasinglanguagemodelorder.Incontrast,therun-timeofourCSTapproachismuchlessaffectedbydatasizeormodelorder.OurapproachisfasterthanKenLMwiththememoryoptimisedlazyoptionform≥3,oftenbyseveralordersofmagnitude.ForthefasterKenLMpopulate,ourmodelisstillhighlycompetitive,growingto4×fasterforthelargestdatasize.16Theloadingtimeisstillasignificantpartoftheruntime;withoutthiscost,ourmodelis5×slowerthanKenLMpopulateform=10onthelargestdataset.Runningourmodelwithm=∞onthelargestdatasizedidnotchangethememoryus-ageandonlyhadaminoreffectonruntime,taking645s.Character-levelmodellingTodemonstratethefullpotentialofourapproach,wenowconsidercharacterbasedlanguagemodelling,evaluatedonthelargebenchmark1-billion-wordslanguagemod-ellingcorpus,a3.9GiB(Ausbildung)datasetwith768Mwordsand4billioncharacters.17Table4showsthetestperplexityresultsforourmodels,usingthefulltrainingvocabulary.Notethatperplexityimproveswithmforthecharacterbasedmodel,butplateausatm=10forthewordbasedmodel;onereasonforthisisthelimiteddiscountcomputation,¯m≤10,15TheCSTmethodusesasinglethreadforconstruction,whileKenLMusesseveralthreads.Moststagesofconstructionforourmethodcouldbeeasilyparallelised.16KenLMbenefitssignificantlyfromcachingwhichcanoc-curbetweenrunsorasmorequeriesareissued(fromm-gramrepetitioninourlarge1millionsentencetestset),whereastheCSTapproachdoesnotbenefitnoticeably(asitdoesnotincor-porateanycachingfunctionality).17http://www.statmt.org/lm-benchmark/

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
1
2
1
5
6
7
4
2
6

/

/
T

l

A
C
_
A
_
0
0
1
1
2
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

488

constructionload+query11010014816321481632InputSize[GiB]Memory[GiB]constructionload+query1001k10k14816321481632InputSize[GiB]Time[seconds]m2gram3gram4gram5gram8gram10grammethodken(pop.)ken(lazy)cst(A)(B)(C)(D)Figure7:MemoryandruntimestatisticsforCSTandKenLMforconstructionandqueryingwithdifferentamountsofGermanCommonCrawltrainingdataanddifferentMarkovorders,m.WecomparethequeryruntimesagainsttheoptimisedversionofKenLMformemory(lazy)andspeed(populate).Forclarity,inthefigureweonlyshowCSTnumbersform=10;theresultsforothersettingsofmareverysimilar.KenLMwastrainedtomatchtheconstructionmemoryrequirementsoftheCST-precomputemethod.unittime(S)mem(GiB)m=5m=10m=20m=∞word81646.2973.4568.6668.7668.80character1793518.583.932.692.372.33Table4:Perplexityresultsforthe1billionwordbench-markcorpus,showingwordbasedandcharacterbasedMKNmodels,fordifferentm.Timingsandpeakmem-oryusagearereportedforconstruction.Thewordmodelcomputeddiscountsandprecomputedcountsupto¯m,ˆm=10,whilethecharactermodelusedthresholds¯m,ˆm=50.Timingsmeasuredonasinglecore.forthewordmodel,whichmaynotbeagoodpa-rameterisationform>¯m.Despitethecharacterbasedmodel(implizit)havingamassiveparameterspace,estimatingthismodelwastractablewithourapproach:thecon-structiontimewasamodest5hours(and2.3hoursforthewordbasedmodel.)Forthesamedataset,Chelbaetal.(2013)reportthattrainingaMKN5-grammodeltook3hoursusingaclusterof100CPUs;ouralgorithmisfasterthanthis,despiteonlyusingasingleCPUcore.18Querieswerealsofast:0.72-0.87msand15mspersentenceforwordandcharacterbasedmodels,respectively.6ConclusionsWeproposedalanguagemodelbasedoncom-pressedsuffixtrees,arepresentationthatishighly18Chelbaetal.(2013)reportabetterperplexityof67.6,buttheyprunedthetrainingvocabulary,whereaswedidnot.AlsoweuseastringenttreatmentofOOV,followingHeafield(2013).compactandcanbeeasilyheldinmemory,whilesupportingqueriesneededincomputinglanguagemodelprobabilitiesonthefly.Wepresentedseveraloptimisationstoacceleratethisprocess,withonlyamodestincreaseinconstructiontimeandmemoryusage,yetimprovingqueryruntimesupto2500×.Inbenchmarkingagainstthestate-of-the-artKenLMpackageonlargecorpora,ourmethodhassuperiormemoryusageandhighlycompetitiveruntimesforbothqueryingandtraining.Ourapproachallowseasyexperimentationwithhighorderlanguagemod-els,andourresultsprovideevidencethatsuchhighordersaremostusefulwhenusinglargetrainingsets.Wepositthatfurtherperplexitygainscanbere-alisedusingrichersmoothingtechniques,suchasanon-parametricBayesianprior(Teh,2006;Woodetal.,2011).Ourongoingworkwillexplorethisav-enue,aswellasintegratingourlanguagemodelintotheMosesmachinetranslationsystem,andimprov-ingthequeryingtimebycachingthelowerorderprobabilities(e.g.,m<4)whichwebelievecanimprovequerytimesubstantiallywhilemaintainingamodestmemoryfootprint.AcknowledgementsThisresearchwassupportedbytheAustralianRe-searchCouncil(FT130101105),NationalICTAus-tralia(NICTA)andaGoogleFacultyResearchAward. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 489 ReferencesThorstenBrants,AshokCPopat,PengXu,FranzJOch,andJeffreyDean.2007.Largelanguagemodelsinmachinetranslation.InProceedingsoftheJointConferenceonEmpiricalMethodsinNaturalLan-guageProcessingandComputationalNaturalLan-guageLearning.NievesRBrisaboa,SusanaLadra,andGonzaloNavarro.2009.Directlyaddressablevariable-lengthcodes.InProceedingsoftheInternationalSymposiumonStringProcessingandInformationRetrieval.ChristianBuck,KennethHeafield,andBasvanOoyen.2014.N-gramcountsandlanguagemodelsfromthecommoncrawl.InProceedingsoftheLanguageRe-sourcesandEvaluationConference.MichaelBurrowsandDavidWheeler.1994.Ablocksortinglosslessdatacompressionalgorithm.Techni-calReport124,DigitalEquipmentCorporationSys-temsResearchCenter.CiprianChelba,TomasMikolov,MikeSchuster,QiGe,ThorstenBrants,PhillippKoehn,andTonyRobin-son.2013.Onebillionwordbenchmarkformeasur-ingprogressinstatisticallanguagemodeling.arXivpreprintarXiv:1312.3005.StanleyFChenandJoshuaGoodman.1999.Anempir-icalstudyofsmoothingtechniquesforlanguagemod-eling.ComputerSpeech&Language,13(4):359–393.KennethChurch,TedHart,andJianfengGao.2007.CompressingtrigramlanguagemodelswithGolombcoding.InProceedingsoftheConferenceonEmpiri-calMethodsinNaturalLanguageProcessing.PaoloFerraginaandGiovanniManzini.2000.Oppor-tunisticdatastructureswithapplications.InProceed-ingsoftheAnnualSymposiumonFoundationsofCom-puterScience.UlrichGermann,EricJoanis,andSamuelLarkin.2009.Tightlypackedtries:Howtofitlargemodelsintomemory,andmakethemloadfast,too.InProceedingsoftheWorkshoponSoftwareEngineering,Testing,andQualityAssuranceforNaturalLanguageProcessing.SimonGog,TimoBeller,AlistairMoffat,andMatthiasPetri.2014.Fromtheorytopractice:Plugandplaywithsuccinctdatastructures.InProceedingsoftheIn-ternationalSymposiumonExperimentalAlgorithms.R.Grossi,A.Gupta,andJ.S.Vitter.2003.High-orderentropy-compressedtextindexes.InProceedingsoftheACM-SIAMsymposiumonDiscretealgorithms.DavidGuthrieandMarkHepple.2010.Storingthewebinmemory:Spaceefficientlanguagemodelswithcon-stanttimeretrieval.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing.KennethHeafield,IvanPouzyrevsky,JonathanH.Clark,andPhilippKoehn.2013.ScalablemodifiedKneser-Neylanguagemodelestimation.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics.KennethHeafield.2011.KenLM:Fasterandsmallerlan-guagemodelqueries.InProceedingsoftheWorkshoponStatisticalMachineTranslation.KennethHeafield.2013.EfficientLanguageModelingAlgorithmswithApplicationstoStatisticalMachineTranslation.Ph.D.thesis,CarnegieMellonUniversity.GuyJacobson.1989.Space-efficientstatictreesandgraphs.InProceedingsoftheAnnualSymposiumonFoundationsofComputerScience.CaseyReddKennington,MartinKay,andAnnemarieFriedrich.2012.Suffixtreesaslanguagemodels.InProceedingsoftheConferenceonLanguageRe-sourcesandEvaluation.PhilippKoehn.2005.Europarl:Aparallelcorpusforstatisticalmachinetranslation.InProceedingsoftheMachineTranslationsummit.AbbyLevenbergandMilesOsborne.2009.Stream-basedrandomisedlanguagemodelsforSMT.InPro-ceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing.UdiManberandEugeneW.Myers.1993.Suffixarrays:Anewmethodforon-linestringsearches.SIAMJour-nalonComputing,22(5):935–948.GonzaloNavarro.2014.Wavelettreesforall.JournalofDiscreteAlgorithms,25:2–20.EnnoOhlebusch,JohannesFischer,andSimonGog.2010.CST++.InProceedingsoftheInternationalSymposiumonStringProcessingandInformationRe-trieval.AdamPaulsandDanKlein.2011.Fasterandsmallern-gramlanguagemodels.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguis-tics:HumanLanguageTechnologies.RajeevRaman,VenkateshRaman,andSSrinivasaRao.2002.Succinctindexabledictionarieswithapplica-tionstoencodingk-arytreesandmultisets.InPro-ceedingsofthethirteenthannualACM-SIAMSympo-siumonDiscretealgorithms.ThomasSchnattinger,EnnoOhlebusch,andSimonGog.2010.Bidirectionalsearchinastringwithwavelettrees.InProceedingsoftheAnnualSymposiumonCombinatorialPatternMatching.EhsanShareghi,MatthiasPetri,GholamrezaHaffari,andTrevorCohn.2015.Compact,efficientandunlimitedcapacity:Languagemodelingwithcompressedsuffixtrees.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 2 1 5 6 7 4 2 6 / / t l a c _ a _ 0 0 1 1 2 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 490 JeffreySorensenandCyrilAllauzen.2011.Unarydatastructuresforlanguagemodels.InProceedingsofIN-TERSPEECH.AndreasStolcke,JingZheng,WenWang,andVictorAbrash.2011.SRILMatsixteen:Updateandoutlook.InProceedingsofIEEEAutomaticSpeechRecognitionandUnderstandingWorkshop.AndreasStolcke.2002.SRILM–anextensiblelanguagemodelingtoolkit.InProceedingsoftheInternationalConferenceofSpokenLanguageProcessing.DavidTalbotandMilesOsborne.2007.Randomisedlanguagemodellingforstatisticalmachinetranslation.InProceedingsoftheAnnualMeetingoftheAssocia-tionforComputationalLinguistics.YeeWhyeTeh.2006.AhierarchicalBayesianlanguagemodelbasedonPitman-Yorprocesses.InProceedingsoftheAnnualMeetingoftheAssociationforCompu-tationalLinguistics.TaroWatanabe,HajimeTsukada,andHidekiIsozaki.2009.Asuccinctn-gramlanguagemodel.InPro-ceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics.PeterWeiner.1973.Linearpatternmatchingalgorithms.InProceedingsoftheAnnualSymposiumSwitchingandAutomataTheory.FrankWood,JanGasthaus,C´edricArchambeau,LancelotJames,andYeeWhyeTeh.2011.Thesequencememoizer.CommunicationsoftheACM,54(2):91–98.YingZhangandStephanVogel.2006.Suffixarrayanditsapplicationsinempiricalnaturallanguageprocess-ing.Technicalreport,CMU,PittsburghPA.
PDF Herunterladen