Transactions of the Association for Computational Linguistics, 2 (2014) 465–476. Action Editor: Kristina Toutanova.

Transactions of the Association for Computational Linguistics, 2 (2014) 465–476. Action Editor: Kristina Toutanova.

Submitted 11/2013; Revised 5/2014; Revised 9/2014; Published 10/2014. c
(cid:13)

2014 Association for Computational Linguistics.

465

OnlineAdaptorGrammarswithHybridInferenceKeZhaiComputerScienceandUMIACSUniversityofMarylandCollegePark,MDUSAzhaike@cs.umd.eduJordanBoyd-GraberComputerScienceUniversityofColoradoBoulder,COUSAjordan.boyd.graber@colorado.eduShayB.CohenSchoolofInformaticsUniversityofEdinburghEdinburgh,Scotland,UKscohen@inf.ed.ac.ukAbstractAdaptorgrammarsareaflexible,powerfulformalismfordefiningnonparametric,un-supervisedmodelsofgrammarproductions.Thisflexibilitycomesatthecostofexpensiveinference.Weaddressthedifficultyofinfer-encethroughanonlinealgorithmwhichusesahybridofMarkovchainMonteCarloandvariationalinference.Weshowthatthisin-ferencestrategyimprovesscalabilitywithoutsacrificingperformanceonunsupervisedwordsegmentationandtopicmodelingtasks.1IntroductionNonparametricBayesianmodelsareeffectivetoolstodiscoverlatentstructureindata(M¨ullerandQuin-tana,2004).Thesemodelshavehadgreatsuccessintextanalysis,especiallysyntax(Shindoetal.,2012).Nonparametricdistributionsprovidesupportoveracountablyinfinitelong-taileddistributionscommoninnaturallanguage(Goldwateretal.,2011).Wefocusonadaptorgrammars(Johnsonetal.,2006),syntacticnonparametricmodelsbasedonprobabilisticcontext-freegrammars.Adaptorgram-marsweakenthestrongstatisticalindependenceas-sumptionsPCFGsmake(Section2).Theweakerstatisticalindependenceassumptionsthatadaptorgrammarsmakecomeatthecostofex-pensiveinference.Adaptorgrammarsarenotaloneinthistrade-off.Forexample,nonparametricexten-sionsoftopicmodels(Tehetal.,2006)havesubstan-tiallymoreexpensiveinferencethantheirparametriccounterparts(Yaoetal.,2009).Acommonapproachtoaddressthiscompu-tationalbottleneckisthroughvariationalinfer-ence(WainwrightandJordan,2008).Oneoftheadvantagesofvariationalinferenceisthatitcanbeeasilyparallelized(Nallapatietal.,2007)ortrans-formedintoanonlinealgorithm(Hoffmanetal.,2010),whichoftenconvergesinfeweriterationsthanbatchvariationalinference.Pastvariationalinferencetechniquesforadap-torgrammarsassumeapreprocessingstepthatlooksatallavailabledatatoestablishthesupportofthesenonparametricdistributions(Cohenetal.,2010).Thus,thesepastapproachesarenotdirectlyamenabletoonlineinference.MarkovchainMonteCarlo(MCMC)inference,analternativetovariationalinference,doesnothavethisdisadvantage.MCMCiseasiertoimplement,anditdiscoversthesupportofnonparametricmod-elsduringinferenceratherthanassumingitapriori.Weapplystochastichybridinference(Mimnoetal.,2012)toadaptorgrammarstogetthebestofbothworlds.WeinterleaveMCMCinferenceinsidevari-ationalinference.Thispreservesthescalabilityofvariationalinferencewhileaddingthesparsestatis-ticsandimprovedexplorationMCMCprovides.OurinferencealgorithmforadaptorgrammarsstartswithavariationalalgorithmsimilartoCohenetal.(2010)andaddshybridsamplingwithinvaria-tionalinference(Section3).Thisobviatestheneedforexpensivepreprocessingandisanecessarysteptocreateanonlinealgorithmforadaptorgrammars.Ouronlineextension(Section4)processesexam-plesinsmallbatchestakenfromastreamofdata.Asdataarrive,thealgorithmdynamicallyextendstheunderlyingapproximateposteriordistributionsasmoredataareobserved.Thismakesthealgo-rithmflexible,scalable,andamenabletodatasetsthatcannotbeexaminedexhaustivelybecauseoftheirsize—e.g.,terabytesofsocialmediadataap-peareverysecond—ortheirnature—e.g.,speechac-quisition,wherealanguagelearnerislimitedtothebandwidthofthehumanperceptualsystemandcan-notacquiredatainamonolithicbatch(B¨orschingerandJohnson,2012).Weshowourapproach’sscalabilityandeffective-

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

466

nessbyapplyingourinferenceframeworkinSec-tion5ontwotasks:unsupervisedwordsegmenta-tionandinfinite-vocabularytopicmodeling.2BackgroundInthissection,wereviewprobabilisticcontext-freegrammarsandadaptorgrammars.2.1ProbabilisticContext-freeGrammarsProbabilisticcontext-freegrammars(PCFG)de-fineprobabilitydistributionsoverderivationsofacontext-freegrammar.WedefineaPCFGGtobeatuplehW,N,R,S,θi:asetofterminalsW,asetofnonterminalsN,productionsR,startsym-bolS∈Nandavectorofruleprobabilitiesθ.TherulesthatrewritenonterminalcisR(c).ForamorecompletedescriptionofPCFGs,seeManningandSch¨utze(1999).PCFGstypicallyusenonterminalswithasyntacticinterpretation.Asequenceofterminals(theyield)isgeneratedbyrecursivelyrewritingnonterminalsassequencesofchildsymbols(eitheranonterminalorasymbol).Thisbuildsahierarchicalphrase-treestructureforeveryyield.Forexample,anonterminalVPrepresentsaverbphrase,whichprobabilisticallyrewritesintoase-quenceofnonterminalsV,N(correspondingtoverbandnoun)usingtheproductionruleVP→VN.Bothnonterminalscanbefurtherrewritten.Eachnonterminalhasamultinomialdistributionoverex-pansions;forexample,amultinomialfornonter-minalNwouldrewriteas“cake”,withprobabilityθN→cake=0.03.Rewritingterminateswhenthederivationhasreachedaterminalsymbolsuchas“cake”(whichdoesnotrewrite).WhilePCFGsareusedbothinthesupervisedset-tingandintheunsupervisedsetting,inthispaperweassumeanunsupervisedsetting,inwhichonlyterminalsareobserved.Ourgoalistopredicttheunderlyingphrase-structuretree.2.2AdaptorGrammarsPCFGsassumethattherewritingoperationsarein-dependentgiventhenonterminal.Thiscontext-freenessassumptionoftenistoostrongformodelingnaturallanguage.Adaptorgrammarsbreakthisindependenceas-sumptionbytransformingaPCFG’sdistributionoverAlgorithm1GenerativeProcess1:Fornonterminalsc∈N,drawruleprobabilitiesθc∼Dir(αc)forPCFGG.2:foradaptednonterminalcinc1,…,c|M|do3:DrawgrammatonHc∼PYGEM(ac,bc,Gc)accordingtoEquation1,whereGcisdefinedbythePCFGrulesR.4:Fori∈{1,…,D},generateaphrase-structuretreetS,iusingthePCFGrulesR(e)atnon-adaptednonterminaleandthegrammatonsHcatadaptednonterminalsc.5:Theyieldsoftreest1,…,tDareobservationsx1,…,xD.treesGcrootedatnonterminalcintoaricherdistri-butionHcoverthetreesheadedbyanonterminalc,whichisoftenreferredtoasthegrammaton.APitman-YorAdaptorgrammar(PYAG)formstheadaptedtreedistributionsHcusingaPitman-Yorprocess(PitmanandYor,1997,PY),ageneralizationoftheDirichletprocess(Ferguson,1973,DP).1AdrawHc≡(πc,zc)isformedbythestickbreak-ingprocess(SudderthandJordan,2008,PYGEM)parametrizedbyscaleparametera,discountfactorb,andbasedistributionGc:π0k∼Beta(1−b,a+kb),zk∼Gc,πk≡π0kQk−1j=1(1−π0j),H≡Pkπkδzk.(1)Intuitively,thedistributionHcisadiscreterecon-structionoftheatomssampledfromGc—hence,reweightsGc.GrammatonHcassignsnon-zerostick-breakingweightsπtoacountablyinfinitenumberofparsetreesz.WedescribelearningthesegrammatonsinSection3.Moreformally,aPYAGisaquintupleA=hG,M,a,b,αiwith:aPCFGG;asetofadaptednonterminalsM⊆N;Pitman-Yorprocessparam-etersac,bcateachadaptorc∈MandDirichletparametersαcforeachnonterminalc∈N.Wealsoassumeanorderontheadaptednonterminals,c1,…,c|M|suchthatcjisnotreachablefromciinaderivationifj>i.2Algorithm1describesthegenerativeprocessofanadaptorgrammaronasetofDobservedsen-tencesx1,…,xD.1Adaptorgrammars,intheirgeneralform,donothavetousethePitman-YorprocessbuthaveonlybeenusedwiththePitman-Yorprocess.2Thisispossiblebecauseweassumethatrecursivenonter-minalsarenotadapted.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

467

GivenaPYAGA,thejointprobabilityforasetofsentencesXanditscollectionoftreesTisp(X,T,π,θ,z|A)=Qc∈Mp(πc|ac,bc)p(zc|Gc)·Qc∈Np(θc|αc)Qxd∈Xp(xd,td|θ,π,z),wherexdandtdrepresentthedthobservedstringanditscorrespondingparse.ThemultinomialPCFGparameterθcisdrawnfromaDirichletdistributionatnonterminalc∈N.Ateachadaptednontermi-nalc∈M,thestick-breakingweightsπcaredrawnfromaPYGEM(Equation1).Eachweighthasanas-sociatedatomzc,ifrombasedistributionGc,asub-treerootedatc.Theprobabilityp(xd,td|θ,π,z)isthePCFGlikelihoodofyieldxdwithparsetreetd.AdaptorgrammarsrequireabasePCFGsuchthatitdoesnothaverecursiveadaptednonterminals,i.e.,therecannotbeapathinaderivationfromagivenadaptednonterminaltoasecondappearanceofthatadaptednonterminal.3HybridVariational-MCMCInferenceDiscoveringthelatentvariablesofthemodel—trees,adaptedprobabilities,andPCFGrules—isaproblemofposteriorinferencegivenobserveddata.Previ-ousapproachesuseMCMC(Johnsonetal.,2006)orvariationalinference(Cohenetal.,2010).MCMCdiscoversthesupportofnonparametricmodelsduringtheinference,butdoesnotscaletolargerdatasets(duetotightcouplingofvariables).Variationalinference,however,isinherentlyparal-lelandeasilyamendabletoonlineinference,butre-quirespreprocessingtodiscovertheadaptedproduc-tions.Wecombinethebestofbothworldsandpro-poseahybridvariational-MCMCinferencealgorithmforadaptorgrammars.Variationalinferencepositsavariationaldistribu-tionoverthelatentvariablesinthemodel;thisinturninducesan“evidencelowerbound”(ELBO,L)asafunctionofavariationaldistributionq,alowerboundonthemarginallog-likelihood.Variationalinferenceoptimizesthisobjectivefunctionwithre-specttotheparametersthatdefineq.Inthissection,wederivecoordinate-ascentup-datesforthesevariationalparameters.Akeymath-ematicalcomponentistakingexpectationswithre-specttothevariationaldistributionq.Westrategi-callyuseMCMCsamplingtocomputetheexpecta-tionofqoverparsetreesz.Insteadofexplicitlycomputingthevariationaldistributionforallparam-eters,onecansamplefromit.Thisproducesasparseapproximationofthevariationaldistribution,whichimprovesbothscalabilityandperformance.Sparsedistributionsareeasiertostoreandtransmitinim-plementations,whichimprovesscalability.Mimnoetal.(2012)alsoshowthatsparserepresentationsimproveperformance.Moreover,becauseitcanflexiblyadjustitssupport,itisanecessaryprereq-uisitetoonlineinference(Section4).3.1VariationalLowerBoundWepositamean-fieldvariationaldistribution:q(π,θ,T|γ,ν,φ)=Qc∈MQ∞i=1q(π0c,i|ν1c,i,ν2c,i)·Qc∈Nq(θc|γc)Qxd∈Xq(td|φd),(2)whereπ0c,iisdrawnfromavariationalBetadistri-butionparameterizedbyν1c,i,ν2c,i;andθcisfromavariationalDirichletpriorγc∈R|R(c)|+.Indexirangesoverapossiblyinfinitenumberofadaptedrules.Theparseforthedthobservation,tdismod-eledbyamultinomialφd,whereφd,iistheproba-bilitygeneratingtheithphrase-structuretreetd,i.ThevariationaldistributionoverlatentvariablesinducesthefollowingELBOonthelikelihood:L(z,π,θ,T,D;a,b,α)=H[q(θ,π,T)]+Pc∈NEq[logp(θc|αc)](3)+Pc∈MP∞i=1Eq[logp(π0c,i|ac,bc)]+Pc∈MP∞i=1Eq[logp(zc,i|π,θ)]+Pxd∈XEq[logp(xd,td|π,θ,z)],whereH[•]istheentropyfunction.Tomakethislowerboundtractable,wetruncatethedistributionoverπtoafiniteset(BleiandJor-dan,2005)foreachadaptednonterminalc∈M,i.e.,π0c,Kc≡1forsomeindexKc.BecausetheatomweightsπkaredeterministicallydefinedbyEqua-tion1,thisimpliesthatπc,iiszerobeyondindexKc.Eachweightπc,iisassociatedwithanatomzc,i,asubtreerootedatc.Wecalltheorderedsetofzc,ithetruncatednonterminalgrammaton(TNG).Eachadaptednonterminalc∈MhasitsownTNGc.TheithsubtreeinTNGcisdenotedTNGc(i).

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

468

Intherestofthissection,wedescribeapproxi-mateinferencetomaximizeL.Themostimpor-tantupdateisφd,i,whichweupdateusingstochasticMCMCinference(Section3.2).Pastvariationalap-proachesforadaptorgrammars(Cohenetal.,2010)relyonapreprocessingstepandheuristicstodefineastaticTNG.Incontrast,ourmodeldynamicallydiscoverstrees.TheTNGgrowsasthemodelseesmoredata,allowingonlineupdates(Section4).Theremainingvariationalparametersareopti-mizedusingexpectedcountsofadaptorgrammarrules.TheseexpectedcountsaredescribedinSec-tion3.3,andthevariationalupdatesforthevari-ationalparametersexcludingφd,iaredescribedinSection3.4.3.2StochasticMCMCInferenceEachobservationxdhasanassociatedvariationalmultinomialdistributionφdovertreestdthatcanyieldobservationxdwithprobabilityφd,i.Hold-ingallothervariationalparametersfixed,thecoordinate-ascentupdate(Mimnoetal.,2012;Bishop,2006)forφd,iisφd,i∝exp{E¬φdq[logp(td,i|xd,π,θ,z)]},(4)whereφd,iistheprobabilitygeneratingtheithphrase-structuretreetd,iandE¬φdq[•]istheexpec-tationwithrespecttothevariationaldistributionq,excludingthevalueofφd.Insteadofcomputingthisexpectationexplicitly,weturntostochasticvariationalinference(Mimnoetal.,2012;Hoffmanetal.,2013)tosamplefromthisdistribution.Thisproducesasetofsampledtreesσd≡{σd,1,…,σd,k}.Fromthissetoftreeswecanapproximateourvariationaldistributionovertreesφusingtheempiricaldistributionσd,i.e.,φd,i∝I[σd,j=td,i,∀σd,j∈σd].(5)Thisleadstoasparseapproximationofvariationaldistributionφ.3Previousinferencestrategies(Johnsonetal.,2006;B¨orschingerandJohnson,2012)foradaptorgrammarshaveusedsampling.Theadaptorgram-marinferencemethodsuseanapproximatePCFGtoemulatethemarginalizedPitman-Yordistributions3Inourexperiments,weusetensamples.ateachnonterminal.GiventhisapproximatePCFG,wecanthensampleaderivationzforstringxfromthepossibletrees(Johnsonetal.,2007).SamplingrequiresaderivedPCFGG0thatapprox-imatesthedistributionovertreederivationscondi-tionedonayield.ItincludestheoriginalPCFGrulesR={c→β}thatdefinethebasedistributionandthenewadaptedproductionsR0={c⇒z,z∈TNGc}.UnderG0,theprobabilityθ0ofadaptedpro-ductionc⇒zislogθ0c⇒z=Eq[logπc,i],ifTNGc(i)=zEq[logπc,Kc]+Eq[logθc⇒z],otherwise(6)whereKcisthetruncationlevelofTNGcandπc,Kcrepresentstheleft-overstickweightsinthestick-breakingprocessforadaptorc∈M.θc⇒zrepre-sentstheprobabilityofgeneratingtreec⇒zunderthebasedistribution.SeealsoCohen(2011).TheexpectationofthePitman-Yormultinomialπc,iunderthetruncatedvariationalstick-breakingdistributionisEq[logπa,i]=Ψ(ν1a,i)−Ψ(ν1a,i+ν2a,i)(7)+Pi−1j=1(Ψ(ν2a,j)−Ψ(ν1a,j+ν2a,j)),andtheexpectationofgeneratingthephrase-structuretreea⇒zbasedonPCFGproductionsunderthevariationalDirichletdistributionisEq[logθa⇒z]=Pc→β∈a⇒z(cid:0)Ψ(γc→β)(8)−Ψ(Pc→β0∈Rcγc→β0)(cid:1)whereΨ(•)isthedigammafunction,andc→β∈a⇒zrepresentsallPCFGproductionsinthephrase-structuretreea⇒z.ThisPCFGcancomposearbitrarysubtreesandthusdiscovernewtreesthatbetterdescribethedata,evenifthosetreesarenotpartoftheTNG.Thisisequivalenttocreatinga“newtable”inMCMCin-ferenceandprovidestruncation-freevariationalup-dates(WangandBlei,2012)bysamplingaunseensubtreewithadaptednonterminalc∈Mattheroot.Thisfreesourmodelfrompreprocessingtoinitial-izetruncatedgrammatonsinCohenetal.(2010).Thisstochasticapproachhastheadvantageofcreat-ingsparsedistributions(WangandBlei,2012):fewuniquetreeswillberepresented.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

469

S→ABB→{a,b,c}A→BBaBbGrammarSeating Assignments (nonterminal A)YieldParseCountscaBcASBaNew SeatingBaBbBch(A →c) +=1g(B →c) +=1g(B →a) +=1abBaASBbBaBbBah(A →a) +=1g(B →a) +=1g(B →b) +=1baBbASBaBaBbf(A →b) +=1g(B →a) +=1Figure1:Givenanadaptorgrammar,wesamplederivationsgivenanapproximatePCFGandshowhowtheseaffectcounts.ThesampledderivationscanbeunderstoodviatheChineserestaurantmetaphor(Johnsonetal.,2006).Existingcachedrules(elementsintheTNG)canbethoughtofasoccupiedta-bles;thishappensinthecaseoftheyield“ba”,whichincreasescountsforunadaptedrulesgandforentriesinTNGA,f.Fortheyield“ca”,thereisnoappropriateentryintheTNG,soitmustusethebasedistribution,whichcorrespondstosittingatanewtable.Thisgeneratescountsforg,asitusestheunadaptedruleandforh,whichrepresentsentriesthatcouldbeincludedintheTNGinthefuture.Thefinalyield,“ab”,showsthatevenwhencompatibleentriesareintheTNG,itmightstillcreateanewtable,changingtheunderlyingbasedistribution.ParallelizationAsnotedinCohenetal.(2010),theinside-outsidealgorithmdominatestheruntimeofeveryiteration,bothforsamplingandvariationalinference.However,unlikeMCMC,variationalin-ferenceishighlyparallelizableandrequiresfewersynchronizationsperiteration(Zhaietal.,2012).Inourapproach,bothinsidealgorithmsandsamplingprocesscanbedistributed,andthosecountscanbeaggregatedafterwards.Inourimplementation,weusemultiplethreadstoparallelizetreesampling.3.3CalculatingExpectedRuleCountsForeveryobservationxd,thehybridapproachpro-ducesasetofsampledtrees,eachofwhichcontainsthreetypesofproductions:adaptedrules,originalPCFGrules,andpotentiallyadaptedrules.Thelastsetismostimportant,asthesearenewrulesdis-coveredbythesampler.TheseareexplainedusingtheChineserestaurantmetaphorinFigure1.ThemultisetofalladaptedproductionsisM(td,i)andthemultisetofnon-adaptedproductionsthatgener-atetreetd,iisN(td,i).Wecomputethreecounts:1:fistheexpectednumberofproductionswithintheTNG.Itisthesumovertheprobabilityofatreetd,ktimesthenumberoftimesanadaptedproductionappearedintd,k,fd(a⇒za,i)=Pk(cid:0)φd,k|a⇒za,i:a⇒za,i∈M(td,k)||{z}countofrulea⇒za,iintreetd,k(cid:1).2:gistheexpectedcountsofPCFGproductionsRthatdefinesthebasedistributionoftheadaptorgrammar,gd(a→β)=Pk(φd,k|a→β:a→β∈N(td,k)|).3:Finally,athirdsetofproductionsarenewlydis-coveredbythesamplerandnotintheTNG.Thesesubtreesarerulesthatcouldbeadapted,withexpectedcountshd(c⇒zc,i)=Pk(φd,k|c⇒zc,i:c⇒zc,i/∈M(td,k)|).Thesesubtrees—listsofPCFGrulessampledfromEquation6—correspondtoadaptedpro-ductionsnotyetpresentintheTNG.3.4VariationalUpdatesGiventhesparsevectorsφsampledfromthehybridMCMCstep,weupdateallvariationalparametersasγa→β=αa→β+Pxd∈Xgd(a→β)+Pb∈MPKbi=1n(a→β,zb,i),ν1a,i=1−ba+Pxd∈Xfd(a⇒za,i)+Pb∈MPKbk=1n(a⇒za,i,zb,k),ν2a,i=aa+iba+Pxd∈XPKaj=1fd(a⇒za,j)+Pb∈MPKbk=1PKaj=1n(a⇒za,j,zb,k),wheren(r,t)istheexpectednumberoftimespro-ductionrisintreet,estimatedduringsampling.HyperparameterUpdateWeupdateourPCFGhyperparameterα,PYGEMhyperparametersaandbasinCohenetal.(2010).4OnlineVariationalInferenceOnlineinferenceforprobabilisticmodelsrequiresustoupdateourposteriordistributionasnewobserva-tionsarrive.Unlikebatchinferencealgorithms,wedonotassumewealwayshaveaccesstotheentire

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

470

dataset.Instead,weassumethatobservationsarriveinsmallgroupscalledminibatches.Theadvantageofonlineinferenceisthreefold:a)itdoesnotre-quireretainingthewholedatasetinmemory;b)eachonlineupdateisfast;andc)themodelusuallycon-vergesfaster.Allofthesemakeadaptorgrammarsscalabletolargerdatasets.Ourapproachisbasedonthestochasticvaria-tionalinferencefortopicmodels(Hoffmanetal.,2013).Thisinferencestrategyusesaformofstochasticgradientdescent(Bottou,1998):usingthegradientoftheELBO,itfindsthesufficientstatisticsnecessarytoupdatevariationalparameters(whicharemostlyexpectedcountscalculatedusingtheinside-outsidealgorithm),andinterpolatestheresultwiththecurrentmodel.WeassumedataarriveinminibatchesB(asetofsentences).Weaccumulateexpectedcounts˜f(l)(a⇒za,i)=(1−(cid:15))·˜f(l−1)(a⇒za,i)(9)+(cid:15)·|X||Bl|Pxd∈Blfd(a⇒za,i),˜g(l)(a→β)=(1−(cid:15))·˜g(l−1)(a→β)(10)+(cid:15)·|X||Bl|Pxd∈Blgd(a→β),withdecayfactor(cid:15)∈(0,1)toguaranteeconver-gence.Wesetitto(cid:15)=(τ+l)−κ,wherelistheminibatchcounter.Thedecayinertiaτpreventspre-matureconvergence,anddecayrateκcontrolsthespeedofchangeinsufficientstatistics(Hoffmanetal.,2010).WerecoverbatchvariationalapproachwhenB=Dandκ=0.Thevariables˜f(l)and˜g(l)areaccumulatedsuffi-cientstatisticsofadaptedandunadaptedproductionsafterprocessingminibatchBl.Theyupdatetheap-proximategradient.Theupdatesforvariationalpa-rametersbecomeγa→β=αa→β+˜g(l)(a→β)(11)+Pb∈MPKbi=1n(a→β,zb,i),ν1a,i=1−ba+˜f(l)(a⇒za,i)(12)+Pb∈MPKbk=1n(a⇒za,i,zb,k),ν2a,i=aa+iba+PKaj=1˜f(l)(a⇒za,j)(13)+Pb∈MPKbk=1PKaj=1n(a⇒za,j,zb,k),whereKaisthesizeoftheTNGatadaptora∈M.4.1RefiningtheTruncationAsweobservemoredataduringinference,ourTNGsneedtochange.Newrulesshouldbeadded,uselessrulesshouldberemoved,andderivationsforexistingrulesshouldbeupdated.Inthissection,wedescribeheuristicsforperformingeachoftheseoperations.AddingProductionsSamplingcanidentifypro-ductionsthatarenotadaptedbutwereinsteaddrawnfromthebasedistribution.ThesearecandidatesfortheTNG.Foreverynonterminala,weaddthesepotentiallyadaptedproductionstoTNGaaftereachminibatch.Thecountassociatedwithcandidatepro-ductionsisnowassociatedwithanadaptedproduc-tion,i.e.,thehcountcontributestotherelevantfcount.ThismechanismdynamicallyexpandsTNGa.SortingandRemovingProductionsOurmodeldoesnotrequireapreprocessingsteptoinitializetheTNGs,rather,itconstructsandexpandsallTNGsonthefly.TopreventtheTNGfromgrowingunwieldy,wepruneTNGaftereveryuminibatches.Asare-sult,weneedtoimposeanorderingoveralltheparsetreesintheTNG.TheunderlyingPYGEMdistribu-tionimplicitlyplacesanrankingoveralltheatomsaccordingtotheircorrespondingsufficientstatis-tics(Kuriharaetal.,2007),asshowninEquation9.Itmeasuresthe“usefulness”ofeveryadaptedpro-ductionthroughoutinferenceprocess.Inadditiontoaccumulatedsufficientstatistics,Cohenetal.(2010)addasecondarytermtodiscour-ageshortconstituents(Mochihashietal.,2009).Weimposearewardtermforlongerphrasesinadditionto˜fandsortalladaptedproductionsinTNGausingtherankingscoreΛ(a⇒za,i)=˜f(l)(a⇒za,i)·log((cid:15)·|s|+1),where|s|isthenumberofyieldsinproductiona⇒za,i.Because(cid:15)decreaseseachminibatch,therewardforlongphrasesdiminishes.ThisissimilartoanannealedversionofCohenetal.(2010)—wheretherewardforlongphrasesisfixed,seealsoMochihashietal.(2009).Aftersorting,weremoveallbutthetopKaadaptedproductions.RederivingAdaptedProductionsForMCMCin-ference,JohnsonandGoldwater(2009)observethatatomsalreadyassociatedwithayieldmayhavetrees

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

471

Algorithm2Onlineinferenceforadaptorgrammars1:Randominitializeallvariationalparameters.2:forminibatchofl=1,2,…do3:ConstructapproximatePCFGθ0ofA(Equation6).4:forinputsentenced=1,2,…,Dldo5:AccumulateinsideprobabilitiesfromapproximatePCFGθ0.6:Samplephrase-structuretreesσandupdatethetreedistributionφ(Equation5).7:Foreveryadaptednonterminalc,appendadaptedpro-ductionstoTNGc.8:Accumulatesufficientstatistics(Equations9and10).9:Updateγ,ν1,andν2(Equations11-13).10:Refineandprunethetruncationeveryuminibatches.thatdonotexplaintheiryieldwell.Theyproposeta-blelabelresamplingtorederiveyields.Inourapproachthisisequivalentto“mutating”somederivationsinaTNG.Afterpruningrulesev-eryuminibatches,weperformtablelabelresam-plingforadaptednonterminalsfromgeneraltospe-cific(i.e.,atopologicalsort).Thisprovidesbetterexpectedcountsn(r,•)forrulesusedinphrase-structuresubtrees.Empirically,wefindtablela-belresamplingonlymarginallyimprovestheword-segmentationresult.InitializationOurinferencebeginswithrandomvariationalDirichletsandemptyTNGs,whichobvi-atesthepreprocessingstepinCohenetal.(2010).OurmodelconstructsandexpandsallTNGsonthefly.ItmimicstheincrementalinitializationofJohn-sonandGoldwater(2009).Algorithm2summarizesthepseudo-codeofouronlineapproach.4.2ComplexityInsideandoutsidecallsdominateexecutiontimeforadaptorgrammarinference.Variationalap-proachescomputeinside-outsidealgorithmsandes-timatetheexpectedcountsforeverypossibletreederivation(Cohenetal.,2010).ForadatasetwithDobservations,variationalinferencerequiresO(cid:0)DI(cid:1)callstoinside-outsidealgorithm,whereIisthenumberofiterations,typicallyinthetens.Incontrast,MCMConlyneedstoaccumulatein-sideprobabilities,andthensampleatreederiva-tion(ChappelierandRajman,2000).Thesamplingstepisnegligibleinprocessingtimecomparedtotheinsidealgorithm.MCMCinferencerequiresO(cid:0)DI(cid:1)callstotheinsidealgorithm—henceeveryiterationcollocationSENT7→COLLOCSENT7→COLLOCSENTCOLLOC7→WORDSunigramWORDS7→WORDWORDS7→WORDWORDSWORD7→CHARSCHARS7→CHARCHARS7→CHARCHARSCHAR7→?InfVocLDASENT7→DOCjj=1,2,…DDOCj7→−jTOPICii=1,2,…KTOPICi7→WORDWORD7→CHARSCHARS7→CHARCHARS7→CHARCHARSCHAR7→?Table1:Grammarsusedinourexperiments.ThenonterminalCHARisanon-adaptedrulethatexpandstoallcharactersusedinthedata,sometimescalledpre-terminals.Adaptednonter-minalsareunderlined.Fortheunigramgrammar,onlynonter-minalWORDisadapted;whereasforthecollocationgrammar,bothnonterminalsWORDandCOLLOCareadapted.FortheIN-FVOCLDAgrammar,DisthetotalnumberofdocumentsandKisthenumberoftopics.Therefore,jrangesover{1,…,D}andirangesover{1,…,K}.ismuchfasterthanvariationalapproach—butIisusuallyontheorderofthousands.Likewise,ourhybridapproachalsoonlyneedsthelessexpensiveinsidealgorithmtosampletrees.Andwhileeachiterationislessexpensive,ourap-proachcanachievereasonableresultswithonlyasinglepassthroughthedata.AndthusonlyrequiresO(D)callstotheinsidealgorithm.Becausetheinside-outsidealgorithmisfunda-mentaltoeachofthesealgorithms,weuseitasacommonbasisforcomparisonacrossdifferentim-plementations.Thisisover-generoustovariationalapproaches,asthefullinside-outsidecomputationismoreexpensivethantheinsidealgorithmrequiredforsamplinginMCMCandourhybridapproach.5ExperimentsandDiscussionWeimplementouronlineadaptorgrammarmodel(ONLINE)inPython4andcompareitagainstbothMCMC(JohnsonandGoldwater,2009,MCMC)andthevariationalinference(Cohenetal.,2010,VARI-ATIONAL).WeusethelatestimplementationofMCMCsamplerforadaptorgrammars5andsimulatethevariationalapproachusingourimplementation.ForMCMCapproach,weusethebestsettingsre-portedinJohnsonandGoldwater(2009)withincre-mentalinitializationandtablelabelresampling.4Availableathttp://www.umiacs.umd.edu/˜zhaike/.5http://web.science.mq.edu.au/˜mjohnson/code/py-cfg-2013-02-25.tgz

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

472

ModelandSettingsctb7pkucityuunigramcollocationunigramcollocationunigramcollocationMCMC500iter72.70(2.81)50.53(2.82)72.01(2.82)49.06(2.81)74.19(3.55)63.14(3.53)1000iter72.65(2.83)62.27(2.79)71.81(2.81)62.47(2.77)74.37(3.54)70.62(3.51)1500iter72.17(2.80)69.65(2.77)71.46(2.80)70.20(2.73)74.22(3.54)72.33(3.50)2000iter71.75(2.79)71.66(2.76)71.04(2.79)72.55(2.70)74.01(3.53)73.15(3.48)ONLINEκτKWord=30kKColloc=100kKWord=40kKColloc=120kKWord=50kKColloc=150K0.63270.17(2.84)68.43(2.77)69.93(2.89)68.09(2.71)72.59(3.62)69.27(3.61)12872.98(2.72)65.20(2.81)72.26(2.63)65.57(2.83)74.73(3.40)64.83(3.62)51272.76(2.78)56.05(2.85)71.99(2.74)58.94(2.94)73.68(3.60)60.40(3.70)0.83271.10(2.77)70.84(2.76)70.31(2.78)70.91(2.71)73.12(3.60)71.89(3.50)12872.79(2.64)70.93(2.63)72.08(2.62)72.02(2.63)74.62(3.45)72.28(3.51)51272.82(2.58)68.53(2.76)72.14(2.58)70.07(2.69)74.71(3.37)72.58(3.49)1.03269.98(2.87)70.71(2.63)69.42(2.84)71.45(2.67)73.18(3.59)72.42(3.45)12871.84(2.72)71.29(2.58)71.29(2.67)72.56(2.61)73.23(3.39)72.61(3.41)51272.68(2.62)70.67(2.60)71.86(2.63)71.39(2.66)74.45(3.41)72.88(3.38)VARIATIONAL69.83(2.85)67.78(2.75)67.82(2.80)66.97(2.75)70.47(3.72)69.06(3.69)Table2:WordsegmentationaccuracymeasuredbywordtokenF1scoresandnegativelog-likelihoodonheld-outtestdatasetinthebrackets(lowerthebetter,onthescaleof106)forourONLINEmodelagainstMCMCapproach(Johnsonetal.,2006)onvariousdatasetusingtheunigramandcollocationgrammar.740506070801e+031e+041e+051e+061e+07# of inside−outside function callsf−1 scoremodelmcmconlinevariational(a)unigramgrammar.3040506070801e+031e+041e+051e+061e+07# of inside−outside function callsf−1 scoremodelmcmconlinevariational(b)collocationgrammar.Figure2:WordsegmentationaccuracymeasuredbywordtokenF1scoresonbrentcorpusofthreeapproachesagainstnumberofinside-outsidefunctioncallusingunigram(upper)andcollo-cation(lower)grammarsinTable1.66OurONLINEsettingsarebatchsizeB=20,decayinertiaτ=128,decayrateκ=0.6forunigramgrammar;andmini-batchsizeB=5,decayinertiaτ=256,decayrateκ=0.8forcollocationgrammar.TNGsarerefinedatintervalu=50.TruncationsizeissettoKWord=1.5kandKColloc=3k.Thesettingsarechosenfromcrossvalidation.Weobservesimi-larbehaviorunderκ={0.7,0.9,1.0},τ={32,64,512},B={10,50}andu={10,20,100}.7ForONLINEinference,weparallelizeeachminibatchwithfourthreadswithsettings:batchsizeB=100andTNGrefine-mentintervalu=100.ONLINEapproachrunnsfortwopassesoverdatasets.VARIATIONALrunsfiftyiterations,withthesametruncationlevelasinONLINE.Fornegativelog-likelihoodeval-uation,wetrainthemodelonarandom70%ofthedata,andholdouttherestfortesting.Weobservesimilarbehaviorfor5.1WordSegmentationWeevaluateouronlineadaptorgrammaronthetaskofwordsegmentation,whichfocusesonidentify-ingwordboundariesfromasequenceofcharacters.ThisisespeciallythecaseforChinese,sincechar-actersarewritteninsequencewithoutwordbound-aries.Wefirstevaluateallthreemodelsonthestan-dardBrentversionoftheBernstein-Ratnercor-pus(Bernstein-Ratner,1987;BrentandCartwright,1996,brent).Thedatasetcontains10ksentences,1.3kdistinctwords,and72distinctcharacters.Wecomparetheresultsonbothunigramandcolloca-tiongrammarsintroducedinJohnsonandGoldwater(2009)aslistedinTable1.Figure2illustratesthewordsegmentationac-curacyintermsofwordtokenF1-scoresonbrentagainstthenumberofinside-outsidefunctioncallsforallthreeapproachesusingunigramandcolloca-tiongrammars.Inbothcases,ourONLINEapproachconvergesfasterthanMCMCandVARIATIONALap-proaches,yetyieldscomparableorbetterperfor-mancewhenseeingmoredata.Inadditiontothebrentcorpus,wealsoevalu-atethreeapproachesonthreeotherChinesedatasetscompiledbyXueetal.(2005)andEmerson(2005):8•ChineseTreebank7.0(ctb7):162ksentences,57kdistinctwords,4.5kdistinctcharacters;ourmodelunderκ={0.7,0.9}andτ={64,256}.8Weuseallpunctuationasnaturaldelimiters(i.e.,wordscannotcrosspunctuation).

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

473

•PekingUniversity(pku):183ksentences,53kdistinctwords,4.6kdistinctcharacters;and•CityUniversityofHongKong(cityu):207ksentences,64kdistinctwords,and5kdistinctcharacters.WecompareourinferencemethodagainstotherapproachesonF1score.Whileotherunsupervisedwordsegmentationsystemsareavailable(Mochi-hashietal.(2009),interalia),9ourfocusisonadi-rectcomparisonofinferencetechniquesforadaptorgrammar,whichachievecompetitive(ifnotstate-of-the-art)performance.Table2showsthewordtokenF1-scoresandneg-ativelikelihoodonheld-outtestdatasetofourmodelagainstMCMCandVARIATIONAL.Werandomlysample30%ofthedatafortestingandtherestfortraining.Wecomputetheheld-outlikelihoodofthemostlikelysampledparsetreesoutofeachmodel.10OurONLINEapproachconsistentlybettersegmentswordsthanVARIATIONALandachievescomparableorbetterresultsthanMCMC.ForMCMC,JohnsonandGoldwater(2009)showthatincrementalinitialization—oronlineupdatesingeneral—resultsinmoreaccuratewordsegmenta-tion,eventhoughthetreeshavelowerposteriorprobability.Similarly,ourONLINEapproachinitial-izesandlearnsthemonthefly,insteadofinitializingthegrammatonsandparsetreesforalldataupfrontasforVARIATIONAL.Thisuniformlyoutperformsbatchinitializationonthewordsegmentationtasks.5.2InfiniteVocabularyTopicModelingTopicmodelsoftencanbereplicatedusingacare-fullycraftedPCFG(Johnson,2010).Thesepow-erfulextensionscancapturetopicalcollocationsandstickytopics;theseembelishmentscouldfur-therimproveNLPapplicationsofsimpleunigramtopicmodelssuchaswordsensedisambigua-tion(Boyd-GraberandBlei,2007),partofspeech9Theirresultsarenotdirectlycomparable:theyusedifferentsubsetsandassumedifferentpreprocessing.10Notethatthisisonlyanapproximationtothetrueheld-outlikelihood,sinceitisimpossibletoenumerateallthepossibleparsetreesandhencecomputethelikelihoodforagivensen-tenceunderthemodel.11Wetrainallmodelswith5topicswithsettings:TNGre-finementintervalu=100,truncationsizeKTopic=3k,andthemini-batchsizeB=50.Weobserveasimilarbehaviorunderκ∈{0.7,0.9}andτ∈{64,256}.321285125060705060705060700.60.81110100100011010010001101001000# of passes over the datasetpmiinferenceinfvocmcmconlinevariational⌧:⌧:⌧:706050706050706050321285125060705060705060700.60.81110100100011010010001101001000# of passes over the datasetpmiinferenceinfvocmcmconlinevariationalcoherenceFigure3:Theaveragecoherencescoreoftopicsonde-newsdatasetsagainstINFVOCapproachandotherinferencetech-niques(MCMC,VARIATIONAL)underdifferentsettingsofde-cayrateκanddecayinertiaτusingtheInfVocLDAgrammarinTable1.Thehorizontalaxisshowsthenumberofpassesovertheentiredataset.11tagging(ToutanovaandJohnson,2008)ordialoguemodeling(ZhaiandWilliams,2014).However,ex-pressingtopicmodelsinadaptorgrammarsismuchslowerthantraditionaltopicmodels,forwhichfastonlineinference(Hoffmanetal.,2010)isavailable.ZhaiandBoyd-Graber(2013)arguethatonlineinferenceandtopicmodelsviolateafundamentalassumptioninonlinealgorithms:newwordsareintroducedasmoredataarestreamedtothealgo-rithm.ZhaiandBoyd-Graber(2013)thenintroduceaninferenceframework,INFVOC,todiscoverwordsfromaDirichletprocesswithacharactern-grambasedistribution.Weshowthattheircomplicatedmodelandon-lineinferencecanbecapturedandextendedviaanappropriatePCFGgrammarandouronlineadap-torgrammarinferencealgorithm.OurextensiontoINFVOCgeneralizestheirstaticcharactern-grammodel,learningthebasedistribution(i.e.,howwordsarecomposedfromcharacters)fromdata.Incontrast,theirbasedistributionwaslearnedfromadictionaryasapreprocessingstepandheldfixed.Thisisanattractivetestbedforouronlineinfer-ence.Withinatopic,wecanverifythatthewordswediscoverarerelevanttothetopicandthatnewwordsriseinimportanceinthetopicovertimeiftheyarerelevant.Fortheseexperiments,wetreateachtoken(withitsassociateddocumentpseudo-word−j)asasinglesentence,andeachminibatchcontainsonlyonesentence(token).12TheplotisgeneratedwithtruncationsizeKTopic=2k,mini-batchsizeB=1,truncationpruningintervalu=50,decayinertiaτ=256,anddecayrateκ=0.8.AllPYhyper-parametersareoptimized.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

474

new words added at corresponding minibatchminibatch-3k…2-union3-wage…16-minist…18-year…21-bill…32-increas33-tax…48-reform…58-lower…82-percent…95-committe…180-pension…minibatch-8k1-year2-minist3-tax4-pension5-reform…10-committe…12-percent…16-lower…19-increas…25-bill…42-union43-wage…181-schroeder…436-deduct…minibatch-19k1-deduct2-tax3-year4-pension5-reform…7-minist…9-increas…11-committe…13-schroeder14-percent…17-lower…23-bill…49-union…92-wage…minibatch-20k1-tax2-year3-reform4-pension5-minist6-increas…9-schroeder…11-committe…19-percent…31-lower…49-bill…51-union…53-deduct…127-wage…minibatch-10k1-tax2-reform3-pension4-year5-minist6-increas…8-lower…13-percent…30-committe…33-bill…106-wage…115-union…120-schroeder…530-deduct…minibatch-1k1-reform…5-increas…10-union…13-wage…47-percent…53-year…67-tax…70-minist…108-bill…164-lowerpensioncommitte…schroederaffair…minibatch-4k…4-percent5-tax6-reform…12-year13-increas…16-wage…19-minist…22-union…49-lower…82-schroeder…90-bill…106-committe…229-pension…deductshop…recess…primarili…minibatch-15k1-tax2-schroeder3-year4-reform5-minist6-pension…8-increas…13-lower…16-percent17-committe…20-union…235-bill…272-wage…306-deduct…recipi…minibatch-17k1-tax2-year3-reform4-schroeder5-increas6-minist…9-pension…11-percent…15-lower…19-bill…28-committe…51-union…78-wage…382-deduct…alloc…club…Figure4:Theevolutionofonetopic—concerningtaxpolicy—outoffivetopicslearnedusingonlineadaptorgrammarinferenceonthede-newsdataset.Eachminibatchrepresentsawordprocessedbythisonlinealgorithm;timeprogressesfromlefttoright.Asthealgorithmencountersnewwords(bottom)theycanmaketheirwayintothetopic.Thenumbersnexttowordsrepresenttheiroverallrankinthetopic.Forexample,theword“pension”firstappearedinmini-batch100,wasrankedat229afterminibatch400andbecameoneofthetop10wordsinthistopicafter2000minibatches(tokens).12Quantitatively,weevaluatethreedifferentinfer-enceschemesandtheINFVOCapproach13onacol-lectionofEnglishdailynewssnippets(de-news).14WeusedtheInfVocLDAgrammar(Table1).Forallapproaches,wetrainthemodelwithfivetopics,andevaluatetopiccoherence(Newmanetal.,2009),whichcorrelateswellwithhumanratingsoftopicinterpretability(Changetal.,2009).Wecollecttheco-occurrencecountsfromWikipediaandcomputetheaveragepairwisepointwisemutualinformation(PMI)scorebetweenthetop10rankedwordsofev-erytopic.Figure3illustratesthePMIscore,andourapproachyieldscomparableorbetterresultsagainstallotherapproachesundermostconditions.Qualitatively,Figure4showsanexampleofatopicevolutionusingonlineadaptorgrammarforthede-newsdataset.Thetopicisabout“taxpol-icy”.Thetopicimprovesovertime;wordslike“year”,“tax”and“minist(er)”becomemorepromi-nent.Moreimportantly,theonlineapproachdiscov-ersnewwordsandincorporatesthemintothetopic.13Availableathttp://www.umiacs.umd.edu/˜zhaike/.14Thede-newsdatasetisrandomlyselectedsubsetof2.2kEnglishdocumentsfromhttp://homepages.inf.ed.ac.uk/pkoehn/publications/de-news/.Itcontains6.5kuniquetypesandover200kwordtokens.TokenizationandstemmingprovidedbyNLTK(Birdetal.,2009).Forexample,“schroeder”(formerGermanchancel-lor)firstappearedinminibatch300,wassuccess-fullypickedupbyourmodel,andbecameoneofthetoprankedwordsinthetopic.6ConclusionProbabilisticmodelingisausefultoolinunderstand-ingunstructureddataordatawherethestructureislatent,likelanguage.However,developingthesemodelsisoftenadifficultprocess,requiringsignifi-cantmachinelearningexpertise.Adaptorgrammarsofferaflexibleandquickwaytoprototypeandtestnewmodels.Despiteex-pensiveinference,theyhavebeenusedfortopicmodeling(Johnson,2010),discoveringperspec-tive(Hardistyetal.,2010),segmentation(JohnsonandGoldwater,2009),andgrammarinduction(Co-henetal.,2010).Wehavepresentedanewonline,hybridinferenceschemeforadaptorgrammars.Unlikepreviousap-proaches,itdoesnotrequireextensivepreprocess-ing.Itisalsoabletofasterdiscoverusefulstructureintext;withfurtherdevelopment,thesealgorithmscouldfurtherspeedthedevelopmentandapplicationofnewnonparametricmodelstolargedatasets.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

475

AcknowledgmentsWewouldliketothanktheanonymousreviewers,KristinaToutanova,MarkJohnson,andKeWuforinsightfuldiscussions.ThisworkwassupportedbyNSFGrantCCF-1018625.Boyd-GraberisalsosupportedbyNSFGrantIIS-1320538.Anyopin-ions,findings,conclusions,orrecommendationsex-pressedherearethoseoftheauthorsanddonotnec-essarilyreflecttheviewofthesponsor.ReferencesNanBernstein-Ratner.1987.Thephonologyofparentchildspeech.Children’slanguage,6:159–174.StevenBird,EwanKlein,andEdwardLoper.2009.Nat-uralLanguageProcessingwithPython.O’ReillyMe-dia.ChristopherM.Bishop.2006.PatternRecognitionandMachineLearning.Springer-VerlagNewYork,Inc.,Secaucus,NJ,USA.DavidM.BleiandMichaelI.Jordan.2005.VariationalinferenceforDirichletprocessmixtures.JournalofBayesianAnalysis,1(1):121–144.BenjaminB¨orschingerandMarkJohnson.2012.Usingrejuvenationtoimproveparticlefilteringforbayesianwordsegmentation.InProceedingsoftheAssociationforComputationalLinguistics.L´eonBottou.1998.Onlinealgorithmsandstochasticapproximations.InOnlineLearningandNeuralNet-works.CambridgeUniversityPress,Cambridge,UK.JordanBoyd-GraberandDavidM.Blei.2007.PUTOP:TurningpredominantsensesintoatopicmodelforWSD.In4thInternationalWorkshoponSemanticEvaluations.MichaelR.BrentandTimothyA.Cartwright.1996.Dis-tributionalregularityandphonotacticconstraintsareusefulforsegmentation.volume61,pages93–125.JonathanChang,JordanBoyd-Graber,andDavidM.Blei.2009.Connectionsbetweenthelines:Augment-ingsocialnetworkswithtext.InKnowledgeDiscoveryandDataMining.Jean-C´edricChappelierandMartinRajman.2000.Monte-CarlosamplingforNP-hardmaximizationproblemsintheframeworkofweightedparsing.InNaturalLanguageProcessing,pages106–117.ShayB.Cohen,DavidM.Blei,andNoahA.Smith.2010.Variationalinferenceforadaptorgrammars.InConferenceoftheNorthAmericanChapteroftheAs-sociationforComputationalLinguistics.ShayB.Cohen.2011.ComputationalLearningofProb-abilisticGrammarsintheUnsupervisedSetting.Ph.D.thesis,CarnegieMellonUniversity.ThomasEmerson.2005.Thesecondinternationalchi-nesewordsegmentationbakeoff.InFourthSIGHANWorkshoponChineseLanguage,Jeju,Korea.ThomasS.Ferguson.1973.ABayesiananalysisofsomenonparametricproblems.TheAnnalsofStatis-tics,1(2).SharonGoldwater,ThomasL.Griffiths,andMarkJohn-son.2011.Producingpower-lawdistributionsanddampingwordfrequencieswithtwo-stagelanguagemodels.JournalofMachineLearningResearch,pages2335–2382,July.EricHardisty,JordanBoyd-Graber,andPhilipResnik.2010.Modelingperspectiveusingadaptorgrammars.InProceedingsofEmpericalMethodsinNaturalLan-guageProcessing.MatthewHoffman,DavidM.Blei,andFrancisBach.2010.OnlinelearningforlatentDirichletallocation.InProceedingsofAdvancesinNeuralInformationProcessingSystems.MatthewHoffman,DavidM.Blei,ChongWang,andJohnPaisley.2013.Stochasticvariationalinference.InJournalofMachineLearningResearch.MarkJohnsonandSharonGoldwater.2009.ImprovingnonparametericBayesianinference:experimentsonunsupervisedwordsegmentationwithadaptorgram-mars.InConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics.MarkJohnson,ThomasL.Griffiths,andSharonGoldwa-ter.2006.Adaptorgrammars:Aframeworkforspeci-fyingcompositionalnonparametricBayesianmodels.InProceedingsofAdvancesinNeuralInformationProcessingSystems.MarkJohnson,ThomasL.Griffiths,andSharonGoldwa-ter.2007.BayesianinferenceforPCFGsviaMarkovchainMonteCarlo.InConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics.MarkJohnson.2010.PCFGs,topicmodels,adaptorgrammarsandlearningtopicalcollocationsandthestructureofpropernames.InProceedingsoftheAs-sociationforComputationalLinguistics.KenichiKurihara,MaxWelling,andYeeWhyeTeh.2007.CollapsedvariationalDirichletprocessmixturemodels.InInternationalJointConferenceonArtifi-cialIntelligence.ChristopherD.ManningandHinrichSch¨utze.1999.FoundationsofStatisticalNaturalLanguageProcess-ing.TheMITPress,Cambridge,MA.DavidMimno,MatthewHoffman,andDavidBlei.2012.SparsestochasticinferenceforlatentDirichletalloca-tion.InProceedingsoftheInternationalConferenceofMachineLearning.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
1
9
6
1
5
6
6
9
0
9

/

/
t

l

a
c
_
a
_
0
0
1
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

476

DaichiMochihashi,TakeshiYamada,andNaonoriUeda.2009.Bayesianunsupervisedwordsegmentationwithnestedpitman-yorlanguagemodeling.InProceedingsoftheAssociationforComputationalLinguistics.PeterM¨ullerandFernandoA.Quintana.2004.Non-parametricBayesiandataanalysis.StatisticalScience,19(1).RameshNallapati,WilliamCohen,andJohnLafferty.2007.ParallelizedvariationalEMforlatentDirichletallocation:Anexperimentalevaluationofspeedandscalability.InICDMW.DavidNewman,SarvnazKarimi,andLawrenceCave-don.2009.Externalevaluationoftopicmodels.InProceedingsoftheAurstralasianDocumentComput-ingSymposium.J.PitmanandM.Yor.1997.Thetwo-parameterPoisson-Dirichletdistributionderivedfromastablesubordina-tor.AnnalsofProbability,25(2):855–900.HiroyukiShindo,YusukeMiyao,AkinoriFujino,andMasaakiNagata.2012.Bayesiansymbol-refinedtreesubstitutiongrammarsforsyntacticparsing.InPro-ceedingsoftheAssociationforComputationalLin-guistics.ErikB.SudderthandMichaelI.Jordan.2008.Sharedsegmentationofnaturalscenesusingdepen-dentPitman-Yorprocesses.InProceedingsofAd-vancesinNeuralInformationProcessingSystems.YeeWhyeTeh,MichaelI.Jordan,MatthewJ.Beal,andDavidM.Blei.2006.HierarchicalDirichletpro-cesses.JournaloftheAmericanStatisticalAssocia-tion,101(476):1566–1581.KristinaToutanovaandMarkJohnson.2008.ABayesianLDA-basedmodelforsemi-supervisedpart-of-speechtagging.InProceedingsofAdvancesinNeuralInformationProcessingSystems,pages1521–1528.MartinJ.WainwrightandMichaelI.Jordan.2008.Graphicalmodels,exponentialfamilies,andvaria-tionalinference.FoundationsandTrendsinMachineLearning,1(1–2):1–305.ChongWangandDavidM.Blei.2012.Truncation-freeonlinevariationalinferenceforBayesiannonparamet-ricmodels.InProceedingsofAdvancesinNeuralIn-formationProcessingSystems.NaiwenXue,FeiXia,Fu-dongChiou,andMartaPalmer.2005.ThePennChineseTreeBank:Phrasestructureannotationofalargecorpus.NaturalLanguageEngi-neering.LiminYao,DavidMimno,andAndrewMcCallum.2009.Efficientmethodsfortopicmodelinferenceonstreamingdocumentcollections.InKnowledgeDis-coveryandDataMining.KeZhaiandJordanBoyd-Graber.2013.OnlinelatentDirichletallocationwithinfinitevocabulary.InPro-ceedingsoftheInternationalConferenceofMachineLearning.KeZhaiandJasonD.Williams.2014.Discoveringlatentstructureintask-orienteddialogues.InProceedingsoftheAssociationforComputationalLinguistics.KeZhai,JordanBoyd-Graber,NimaAsadi,andMo-hamadAlkhouja.2012.Mr.LDA:Aflexiblelargescaletopicmodelingpackageusingvariationalinfer-enceinmapreduce.InProceedingsofWorldWideWebConference.Transactions of the Association for Computational Linguistics, 2 (2014) 465–476. Action Editor: Kristina Toutanova. image

Download pdf