Transacciones de la Asociación de Lingüística Computacional, volumen. 4, páginas. 477–490, 2016. Editor de acciones: Brian Roark.
Lote de envío: 1/2016; Lote de revisión: 6/2016; Publicado 9/2016.
2016 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
Fast,SmallandExact:Infinite-orderLanguageModellingwithCompressedSuffixTreesEhsanShareghi,[MatthiasPetri,\GholamrezaHaffari[andTrevorCohn\[Facultad de Tecnología de la Información,MonashUniversity\ComputingandInformationSystems,TheUniversityofMelbournefirst.last@{monash.edu,unimelb.edu.au}AbstractEfficientmethodsforstoringandqueryingarecriticalforscalinghigh-orderm-gramlan-guagemodelstolargecorpora.Weproposealanguagemodelbasedoncompressedsuffixtrees,arepresentationthatishighlycompactandcanbeeasilyheldinmemory,whilesup-portingqueriesneededincomputinglanguagemodelprobabilitieson-the-fly.Wepresentseveraloptimisationswhichimprovequeryruntimesupto2500×,despiteonlyincurringamodestincreaseinconstructiontimeandmemoryusage.ForlargecorporaandhighMarkovorders,ourmethodishighlycompeti-tivewiththestate-of-the-artKenLMpackage.Itimposesmuchlowermemoryrequirements,oftenbyordersofmagnitude,andhasrun-timesthatareeithersimilar(fortraining)orcomparable(forquerying).1IntroductionLanguagemodels(LMs)arefundamentaltomanyNLPtasks,includingmachinetranslationandspeechrecognition.StatisticalLMsareprobabilis-ticmodelsthatassignaprobabilitytoasequenceofwordswN1,indicatinghowlikelythesequenceisinthelanguage.m-gramLMsarepopular,andprovetobeaccuratewhenestimatedusinglargecorpora.IntheseLMs,theprobabilitiesofm-gramsareoftenprecomputedandstoredexplicitly.Althoughwidelysuccessful,currentm-gramLMapproachesareimpracticalforlearninghigh-orderLMsonlargecorpora,duetotheirpoorscalingprop-ertiesinbothtrainingandqueryphases.Prevailingmethods(Heafield,2011;Stolckeetal.,2011)pre-computeallm-gramprobabilities,andconsequentlyneedtostoreandaccessasmanyasahundredofbil-lionsofm-gramsforatypicalmoderate-orderLM.Recentresearchhasattemptedtotacklescalabil-ityissuesthroughtheuseofefficientdatastructuressuchastriesandhash-tables(Heafield,2011;Stol-ckeetal.,2011),lossycompression(TalbotandOs-borne,2007;LevenbergandOsborne,2009;GuthrieandHepple,2010;PaulsandKlein,2011;Churchetal.,2007),compactdatastructures(Germannetal.,2009;Watanabeetal.,2009;SorensenandAllauzen,2011),anddistributedcomputation(Heafieldetal.,2013;Brantsetal.,2007).Fundamentaltoallthewidelyusedmethodsistheprecomputationofallprobabilities,hencetheydonotprovideanadequatetrade-offbetweenspaceandtimeforhighm,bothduringtrainingandquerying.ExceptionsareKen-ningtonetal.(2012)andZhangandVogel(2006),whouseasuffix-treeorsuffix-arrayoverthetextforcomputingthesufficientstatisticson-the-fly.Inourpreviouswork(Shareghietal.,2015),weextendedthislineofresearchusingaCompressedSuffixTree(CST)(Ohlebuschetal.,2010),whichprovidesaconsiderablymorecompactsearchablemeansofstoringthecorpusthananuncompressedsuffixarrayorsuffixtree.Thisapproachshowedfavourablescalingpropertieswithmandhadonlyamodestmemoryrequirement.However,themethodonlysupportedKneser-Neysmoothing,notitsmodi-fiedvariant(ChenandGoodman,1999)whichover-allperformsbetterandhasbecomethede-factostandard.Additionally,queryingwassignificantlyslowerthanforleadingLMtoolkits,makingthemethodimpracticalforwidespreaduse.InthispaperweextendShareghietal.(2015)tosupportmodifiedKneser-Neysmoothing,y
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
1
2
1
5
6
7
4
2
6
/
/
t
yo
a
C
_
a
_
0
0
1
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
478
presentnewoptimisationmethodsforfastconstruc-tionandquerying.1Criticaltoourapproachare:•Precomputationofseveralmodifiedcounts,whichwouldbeveryexpensivetocomputeatquerytime.Toorchestratethis,asubsetoftheCSTnodesisselectedbasedonthecostofcomputingtheirmodifiedcounts(whichre-lateswiththebranchingfactorofanode).Theprecomputedcountsarethenstoredinacompresseddatastructuresupportingefficientmemoryusageandlookup.•Re-useofCSTnodeswithinm-gramprobabil-itycomputationasasentencegetsscoredleft-to-right,thussavingmanyexpensivelookups.Empiricalcomparisonagainstourearlierwork(Shareghietal.,2015)showsthesignificanceofeachoftheseoptimisations.Thestrengthsofourmethodareapparentwhenappliedtoverylargetrainingdatasets(≥16GiB)andforhighordermod-els,m≥5.Inthissetting,whileourapproachismorememoryefficientthantheleadingKenLMmodel,bothintheconstruction(training)andquery-ing(pruebas)phases,wearehighlycompetitiveintermsofruntimesofbothphases.Whenmemoryisalimitingfactoratquerytime,ourapproachisordersofmagnitudefasterthanthestateoftheart.Moreover,ourmethodallowsforefficientqueryingwithanunlimitedMarkovorder,m→∞,withoutresortingtoapproximationsorheuristics.2ModifiedKneser-NeyLanguageModelInanm-gramlanguagemodel,theprobabilityofasentenceisdecomposedintoQNi=1P(Wisconsin|wi−1i−m+1),whereP(Wisconsin|wi−1i−m+1)istheconditionalprobabilityofthenextwordgivenitsfinitehistory.Smooth-ingtechniquesareemployedtodealwithsparsitywhenestimatingtheparametersofP(Wisconsin|wi−1i−m+1).Acomprehensivecomparisonofdifferentsmooth-ingtechniquesisprovidedin(ChenandGoodman,1999).WefocusoninterpolatedModifiedKneser-Ney(MKN)smoothing,whichiswidelyregardedasastate-of-the-arttechniqueandissupportedbypopularlanguagemodellingtoolkits,e.g.SRILM(Stolcke,2002)andKenLM(Heafield,2011).1https://github.com/eehsan/cstlmPm(w|ux)=[C(uxw)−Dm(C(uxw))]+C(ux)+γm(ux)¯Pm−1(w|X)C(ux)¯Pk(w|ux)=[N1+(·uxw)−Dk(N1+(·uxw))]+N1+(·ux·)+γk(ux)¯Pk−1(w|X)N1+(·ux·)¯P0(w|(cid:15))=[N1+(·w)−D1(N1+(·w))]+N1+(··)+γ((cid:15))N1+(··)×1σγk(ux)=(Pj∈{1,2,3+}Dk(j)Nj(ux·),ifk=mPj∈{1,2,3+}Dk(j)N0j(ux·),ifk
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
1
2
1
5
6
7
4
2
6
/
/
t
yo
a
C
_
a
_
0
0
1
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
486
constructionload+query101001k10k100k0.11.010.00.11.010.0Memory[GiB]Time[segundo]CSTon-the-flyCSTprecomputeKenLM(trie)KenLM(probing)SRILMFigure6:MemoryconsumptionandtotalruntimefortheCSTwithandwithoutprecomputation,KenLM(trie),andSRILM(default)withm∈[2,10],whilewealsoin-cludem=∞forCSTmethods.TrainedontheEuroparlGermancorpusandtestedoverthebottom1MsentencesfromtheGermanCommonCrawlcorpus.benchmarkingexperimentsweusedthebottom1Msentences(notusedintraining)oftheGermanCom-manCrawlcorpus.WeusedthepreprocessingscriptofBucketal.(2014),thenremovedsentenceswith≤2words,andreplacedrarewords12c≤9inthetrainingdatawithaspecialtoken.Inourcharacter-levelexperiments,weusedthetrainingandtestdataofthebenchmark1-billion-wordscorpus(Chelbaetal.,2013).Smalldata:GermanEuroparlFirst,wecom-parethetimeandmemoryconsumptionofboththeSRILMandKenLMtoolkits,andtheCSTonthesmallGermancorpus.Figure6showsthememoryusageforconstructionandqueryingforCST-basedmethodsw/oprecomputationisindependentofm,butbecomessubstantiallywithmfortheSRILMandKenLMbenchmarks.Tomakeourresultscom-parabletothosereportedin(Shareghietal.,2015)forquerytimemeasurementswereportedtheload-ingandquerytimecombined.Theconstructioncostismodest,requiringlessmemorythanthebench-marksystemsform≥3,andrunninginasim-ilartime13(despiteourmethodsupportingqueries12Runningwiththefullvocabularyincreasedthememoryre-quirementby40%forconstructionand5%forqueryingwithourmodel,and10%and30%,resp.forKenLM.Constructiontimesforbothapproacheswere15%slower,butqueryruntimewas20%slowerforourmodelversus80%forKenLM.13Foralltimingsreportedinthepaperwemanuallyflushedthesystemcachebetweeneachoperation(bothforconstructionsize(METRO)perplexitytokensm=2m=3m=5m=7m=10m=∞EN6470321.6183.8154.3152.7152.5152.3ES6276231.3133.2111.7109.7109.3109.2FR6100215.8109.285.283.182.682.4DE5540588.3336.6292.8288.1287.8287.8Table2:PerplexitiesonEnglish,Francés,Germannew-stests2014,andSpanishnewstest2013whentrainedon32GiBchunksofEnglish,Español,Francés,andGermanCommonCrawlcorpus.ofunlimitedsize).Precomputationaddstothecon-structiontime,whichrosefrom173to299seconds,butyieldedspeedimprovementsofseveralordersofmagnitudeforquerying(218kto98secondsfor10-gram).Inquerying,theCST-precomputemethodis2-4×slowerthanbothSRILMandKenLMforlargem≥5,withtheexceptionofm=10whereitoutperformsSRILM.Asubstantialfractionofthequerytimeisloadingthestructuresfromdisk;whenthiscostisexcluded,ourapproachisbetween8-13×slowerthanthebenchmarktoolkits.NotethatperplexitycomputedbytheCSTcloselymatchedKenLM(differences≤0.1).BigData:CommonCrawlTable2reportstheperplexityresultsfortrainingon32GiBsubsetsoftheEnglish,Español,Francés,andGermanCommonCrawlcorpus.Notethatwithsuchlargedatasets,perplexityimproveswithincreasingm,withsub-stantialgainsavailablemovingabovethewidelyusedm=5.Thishighlightstheimportanceofourapproachbeingindependentfromm,inthatwecanevaluateforanym,including∞,atlowcost.HeterogeneousDataToillustratetheeffectsofdomainshift,corpussizeandlanguagemodelcapac-ityonmodellingaccuracy,wenowevaluatethesys-temusingavarietyofdifferenttrainingcorpora.Ta-ble3reportstheperplexityforGermanwhentrain-ingoverdatasetsrangingfromthesmallEuroparlupto32GiBoftheCommonCrawlcorpus.NotethatthetestsetisfromthesamedomainastheNewsandquerying)toremovetheeffectofcachingonruntime.ToqueryKenLM,weusedthespeedoptimisedpopulatemethod.(WealsocomparethememoryoptimisedlazymethodinFig-ure7.)TotrainandquerySRILMweusedthedefaultmethodwhichisoptimisedforspeed,buthadslightlyworsememoryusagethanthecompactmethod.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
1
2
1
5
6
7
4
2
6
/
/
t
yo
a
C
_
a
_
0
0
1
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
487
tamaño(METRO)perplexityTrainingtokenssentsm=3m=5m=10Europarl552.21004.8973.3971.4Commentary60.2941.8928.6928.1NCrawl2007372.0514.8493.5488.9NCrawl20081266.8427.7404.8400.0NCrawl20091196.5433.4408.9404.7NCrawl2010543.0472.7450.9446.8NCrawl201129816.3362.6335.9327.9NCrawl201237720.9343.9315.6307.3NCrawl201364135.1268.9229.8225.6NCrawl201484546.3247.6195.2189.3Allcombined2560139.3211.8158.9151.5CCrawl1G17314.2542.8515.5512.7CCrawl2G34628.5493.2462.3459.2CCrawl4G69256.2446.5412.2408.8CCrawl8G1390110.2402.1364.4360.9CCrawl16G2770216.7364.9323.9319.6CCrawl32G5540426.6336.6292.8287.8Table3:PerplexityofGermannewstest2014withdiffer-entdatasets(Europarl,News-Commentary,NewsCrawl2007-2014,CommonCrawl1-32GiBchunks)andm.Crawl,whichexplainsthevastdifferenceinperplex-ities.Thedomaineffectisstrongenoughtoelimi-natetheimpactofusingmuchlargercorpora,com-pare10-gramperplexitiesfortrainingonthesmallerNewsCrawl2007corpusversusEuroparl.However‘bigdata’isstilluseful:inallcasestheperplexityimprovesasweprovidemoredatafromthesamesource.Moreover,themagnitudeofthegaininper-plexitywhenincreasingmisinfluencedbythedatasize:withmoretrainingdatahigherorderm-gramsproviderichermodels;por lo tanto,thescalabilityofourmethodtolargedatasetsiscruciallyimportant.BenchmarkingagainstKenLMNextwecom-pareourmodelagainstthestate-of-the-artmethod,KenLMtrie.TheperplexitydifferencebetweenCSTandKenLMwaslessthan0.003inallexperiments.ConstructionCost.Figure7acomparesthepeakmemoryusageofourCSTmodelsandKenLM.KenLMisgivenatargetmemoryusageofthepeakusageofourCSTmodels.14TheconstructionphasefortheCSTrequiredmoretimeforlowerordermod-els(seeFigure7c)butwascomparableforlarger14Usingthememorybudgetoption,-S.NotethatKenLMof-tenusedmorememorythanspecified.AllowingKenLMuseof80%oftheavailableRAMreducedtrainingtimebyafactorofbetween2and4.m,roughlymatchingKenLMform=10.15Forthe32GiBdataset,theCSTmodeltook14hourstobuild,comparedtoKenLM’s13.5and4hoursforthe10-gramand5-grammodels,respectively.QueryCost.AsshowninFigure7b,themem-oryrequirementsforqueryingwiththeCSTmethodwereconsistentlylowerthanKenLMform≥4:form=10thememoryconsumptionofKenLMwas277GiBcomparedtoour27GiB,a10×im-provement.Thiscloselymatchesthefilesizesofthestoredmodelsondisk.Figure7dreportsthequeryruntimes,showingthatKenLMbecomessub-stantiallyslowerwithincreasingdatasetsizeandin-creasinglanguagemodelorder.Incontrast,therun-timeofourCSTapproachismuchlessaffectedbydatasizeormodelorder.OurapproachisfasterthanKenLMwiththememoryoptimisedlazyoptionform≥3,oftenbyseveralordersofmagnitude.ForthefasterKenLMpopulate,ourmodelisstillhighlycompetitive,growingto4×fasterforthelargestdatasize.16Theloadingtimeisstillasignificantpartoftheruntime;withoutthiscost,ourmodelis5×slowerthanKenLMpopulateform=10onthelargestdataset.Runningourmodelwithm=∞onthelargestdatasizedidnotchangethememoryus-ageandonlyhadaminoreffectonruntime,taking645s.Character-levelmodellingTodemonstratethefullpotentialofourapproach,wenowconsidercharacterbasedlanguagemodelling,evaluatedonthelargebenchmark1-billion-wordslanguagemod-ellingcorpus,a3.9GiB(training)datasetwith768Mwordsand4billioncharacters.17Table4showsthetestperplexityresultsforourmodels,usingthefulltrainingvocabulary.Notethatperplexityimproveswithmforthecharacterbasedmodel,butplateausatm=10forthewordbasedmodel;onereasonforthisisthelimiteddiscountcomputation,¯m≤10,15TheCSTmethodusesasinglethreadforconstruction,whileKenLMusesseveralthreads.Moststagesofconstructionforourmethodcouldbeeasilyparallelised.16KenLMbenefitssignificantlyfromcachingwhichcanoc-curbetweenrunsorasmorequeriesareissued(fromm-gramrepetitioninourlarge1millionsentencetestset),whereastheCSTapproachdoesnotbenefitnoticeably(asitdoesnotincor-porateanycachingfunctionality).17http://www.statmt.org/lm-benchmark/
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
1
2
1
5
6
7
4
2
6
/
/
t
yo
a
C
_
a
_
0
0
1
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
488
constructionload+query11010014816321481632InputSize[GiB]Memoria[GiB]constructionload+query1001k10k14816321481632InputSize[GiB]Time[artículos de segunda clase]m2gram3gram4gram5gram8gram10grammethodken(pop.)ken(lazy)cst(a)(b)(C)(d)Figura 7:MemoryandruntimestatisticsforCSTandKenLMforconstructionandqueryingwithdifferentamountsofGermanCommonCrawltrainingdataanddifferentMarkovorders,m.WecomparethequeryruntimesagainsttheoptimisedversionofKenLMformemory(lazy)andspeed(populate).Forclarity,inthefigureweonlyshowCSTnumbersform=10;theresultsforothersettingsofmareverysimilar.KenLMwastrainedtomatchtheconstructionmemoryrequirementsoftheCST-precomputemethod.unittime(s)mem(GiB)m=5m=10m=20m=∞word81646.2973.4568.6668.7668.80character1793518.583.932.692.372.33Table4:Perplexityresultsforthe1billionwordbench-markcorpus,showingwordbasedandcharacterbasedMKNmodels,fordifferentm.Timingsandpeakmem-oryusagearereportedforconstruction.Thewordmodelcomputeddiscountsandprecomputedcountsupto¯m,ˆm=10,whilethecharactermodelusedthresholds¯m,ˆm=50.Timingsmeasuredonasinglecore.forthewordmodel,whichmaynotbeagoodpa-rameterisationform>¯m.Despitethecharacterbasedmodel(implicitly)havingamassiveparameterspace,estimatingthismodelwastractablewithourapproach:thecon-structiontimewasamodest5hours(and2.3hoursforthewordbasedmodel.)Forthesamedataset,Chelbaetal.(2013)reportthattrainingaMKN5-grammodeltook3hoursusingaclusterof100CPUs;ouralgorithmisfasterthanthis,despiteonlyusingasingleCPUcore.18Querieswerealsofast:0.72-0.87msand15mspersentenceforwordandcharacterbasedmodels,respectively.6ConclusionsWeproposedalanguagemodelbasedoncom-pressedsuffixtrees,arepresentationthatishighly18Chelbaetal.(2013)reportabetterperplexityof67.6,buttheyprunedthetrainingvocabulary,whereaswedidnot.AlsoweuseastringenttreatmentofOOV,followingHeafield(2013).compactandcanbeeasilyheldinmemory,whilesupportingqueriesneededincomputinglanguagemodelprobabilitiesonthefly.Wepresentedseveraloptimisationstoacceleratethisprocess,withonlyamodestincreaseinconstructiontimeandmemoryusage,yetimprovingqueryruntimesupto2500×.Inbenchmarkingagainstthestate-of-the-artKenLMpackageonlargecorpora,ourmethodhassuperiormemoryusageandhighlycompetitiveruntimesforbothqueryingandtraining.Ourapproachallowseasyexperimentationwithhighorderlanguagemod-els,andourresultsprovideevidencethatsuchhighordersaremostusefulwhenusinglargetrainingsets.Wepositthatfurtherperplexitygainscanbere-alisedusingrichersmoothingtechniques,suchasanon-parametricBayesianprior(Teh,2006;Woodetal.,2011).Ourongoingworkwillexplorethisav-enue,aswellasintegratingourlanguagemodelintotheMosesmachinetranslationsystem,andimprov-ingthequeryingtimebycachingthelowerorderprobabilities(e.g.,m<4)whichwebelievecanimprovequerytimesubstantiallywhilemaintainingamodestmemoryfootprint.AcknowledgementsThisresearchwassupportedbytheAustralianRe-searchCouncil(FT130101105),NationalICTAus-tralia(NICTA)andaGoogleFacultyResearchAward.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
1
2
1
5
6
7
4
2
6
/
/
t
l
a
c
_
a
_
0
0
1
1
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
489
ReferencesThorstenBrants,AshokCPopat,PengXu,FranzJOch,andJeffreyDean.2007.Largelanguagemodelsinmachinetranslation.InProceedingsoftheJointConferenceonEmpiricalMethodsinNaturalLan-guageProcessingandComputationalNaturalLan-guageLearning.NievesRBrisaboa,SusanaLadra,andGonzaloNavarro.2009.Directlyaddressablevariable-lengthcodes.InProceedingsoftheInternationalSymposiumonStringProcessingandInformationRetrieval.ChristianBuck,KennethHeafield,andBasvanOoyen.2014.N-gramcountsandlanguagemodelsfromthecommoncrawl.InProceedingsoftheLanguageRe-sourcesandEvaluationConference.MichaelBurrowsandDavidWheeler.1994.Ablocksortinglosslessdatacompressionalgorithm.Techni-calReport124,DigitalEquipmentCorporationSys-temsResearchCenter.CiprianChelba,TomasMikolov,MikeSchuster,QiGe,ThorstenBrants,PhillippKoehn,andTonyRobin-son.2013.Onebillionwordbenchmarkformeasur-ingprogressinstatisticallanguagemodeling.arXivpreprintarXiv:1312.3005.StanleyFChenandJoshuaGoodman.1999.Anempir-icalstudyofsmoothingtechniquesforlanguagemod-eling.ComputerSpeech&Language,13(4):359–393.KennethChurch,TedHart,andJianfengGao.2007.CompressingtrigramlanguagemodelswithGolombcoding.InProceedingsoftheConferenceonEmpiri-calMethodsinNaturalLanguageProcessing.PaoloFerraginaandGiovanniManzini.2000.Oppor-tunisticdatastructureswithapplications.InProceed-ingsoftheAnnualSymposiumonFoundationsofCom-puterScience.UlrichGermann,EricJoanis,andSamuelLarkin.2009.Tightlypackedtries:Howtofitlargemodelsintomemory,andmakethemloadfast,too.InProceedingsoftheWorkshoponSoftwareEngineering,Testing,andQualityAssuranceforNaturalLanguageProcessing.SimonGog,TimoBeller,AlistairMoffat,andMatthiasPetri.2014.Fromtheorytopractice:Plugandplaywithsuccinctdatastructures.InProceedingsoftheIn-ternationalSymposiumonExperimentalAlgorithms.R.Grossi,A.Gupta,andJ.S.Vitter.2003.High-orderentropy-compressedtextindexes.InProceedingsoftheACM-SIAMsymposiumonDiscretealgorithms.DavidGuthrieandMarkHepple.2010.Storingthewebinmemory:Spaceefficientlanguagemodelswithcon-stanttimeretrieval.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing.KennethHeafield,IvanPouzyrevsky,JonathanH.Clark,andPhilippKoehn.2013.ScalablemodifiedKneser-Neylanguagemodelestimation.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics.KennethHeafield.2011.KenLM:Fasterandsmallerlan-guagemodelqueries.InProceedingsoftheWorkshoponStatisticalMachineTranslation.KennethHeafield.2013.EfficientLanguageModelingAlgorithmswithApplicationstoStatisticalMachineTranslation.Ph.D.thesis,CarnegieMellonUniversity.GuyJacobson.1989.Space-efficientstatictreesandgraphs.InProceedingsoftheAnnualSymposiumonFoundationsofComputerScience.CaseyReddKennington,MartinKay,andAnnemarieFriedrich.2012.Suffixtreesaslanguagemodels.InProceedingsoftheConferenceonLanguageRe-sourcesandEvaluation.PhilippKoehn.2005.Europarl:Aparallelcorpusforstatisticalmachinetranslation.InProceedingsoftheMachineTranslationsummit.AbbyLevenbergandMilesOsborne.2009.Stream-basedrandomisedlanguagemodelsforSMT.InPro-ceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing.UdiManberandEugeneW.Myers.1993.Suffixarrays:Anewmethodforon-linestringsearches.SIAMJour-nalonComputing,22(5):935–948.GonzaloNavarro.2014.Wavelettreesforall.JournalofDiscreteAlgorithms,25:2–20.EnnoOhlebusch,JohannesFischer,andSimonGog.2010.CST++.InProceedingsoftheInternationalSymposiumonStringProcessingandInformationRe-trieval.AdamPaulsandDanKlein.2011.Fasterandsmallern-gramlanguagemodels.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguis-tics:HumanLanguageTechnologies.RajeevRaman,VenkateshRaman,andSSrinivasaRao.2002.Succinctindexabledictionarieswithapplica-tionstoencodingk-arytreesandmultisets.InPro-ceedingsofthethirteenthannualACM-SIAMSympo-siumonDiscretealgorithms.ThomasSchnattinger,EnnoOhlebusch,andSimonGog.2010.Bidirectionalsearchinastringwithwavelettrees.InProceedingsoftheAnnualSymposiumonCombinatorialPatternMatching.EhsanShareghi,MatthiasPetri,GholamrezaHaffari,andTrevorCohn.2015.Compact,efficientandunlimitedcapacity:Languagemodelingwithcompressedsuffixtrees.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
1
2
1
5
6
7
4
2
6
/
/
t
l
a
c
_
a
_
0
0
1
1
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
490
JeffreySorensenandCyrilAllauzen.2011.Unarydatastructuresforlanguagemodels.InProceedingsofIN-TERSPEECH.AndreasStolcke,JingZheng,WenWang,andVictorAbrash.2011.SRILMatsixteen:Updateandoutlook.InProceedingsofIEEEAutomaticSpeechRecognitionandUnderstandingWorkshop.AndreasStolcke.2002.SRILM–anextensiblelanguagemodelingtoolkit.InProceedingsoftheInternationalConferenceofSpokenLanguageProcessing.DavidTalbotandMilesOsborne.2007.Randomisedlanguagemodellingforstatisticalmachinetranslation.InProceedingsoftheAnnualMeetingoftheAssocia-tionforComputationalLinguistics.YeeWhyeTeh.2006.AhierarchicalBayesianlanguagemodelbasedonPitman-Yorprocesses.InProceedingsoftheAnnualMeetingoftheAssociationforCompu-tationalLinguistics.TaroWatanabe,HajimeTsukada,andHidekiIsozaki.2009.Asuccinctn-gramlanguagemodel.InPro-ceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics.PeterWeiner.1973.Linearpatternmatchingalgorithms.InProceedingsoftheAnnualSymposiumSwitchingandAutomataTheory.FrankWood,JanGasthaus,C´edricArchambeau,LancelotJames,andYeeWhyeTeh.2011.Thesequencememoizer.CommunicationsoftheACM,54(2):91–98.YingZhangandStephanVogel.2006.Suffixarrayanditsapplicationsinempiricalnaturallanguageprocess-ing.Technicalreport,CMU,PittsburghPA.
Descargar PDF