Transactions of the Association for Computational Linguistics, 2 (2014) 181–192. Action Editor: Eric Fosler-Lussier.

Submitted 10/2013; Revised 2/2014; Published 4/2014. c
(cid:13)

2014 Association for Computational Linguistics.

DynamicLanguageModelsforStreamingTextDaniYogatama∗ChongWang∗BryanR.Routledge†NoahA.Smith∗EricP.Xing∗∗SchoolofComputerScience†TepperSchoolofBusinessCarnegieMellonUniversityPittsburgh,PA15213,USA∗{dyogatama,chongw,nasmith,epxing}@cs.cmu.edu,†routledge@cmu.eduAbstractWepresentaprobabilisticlanguagemodelthatcapturestemporaldynamicsandconditionsonarbitrarynon-linguisticcontextfeatures.Thesecontextfeaturesserveasimportantindicatorsoflanguagechangesthatareotherwisedifﬁculttocaptureusingtextdatabyitself.Welearnourmodelinanefﬁcientonlinefashionthatisscalableforlarge,streamingdata.Withﬁvestreamingdatasetsfromtwodifferentgenres—economicsnewsarticlesandsocialmedia—weevaluateourmodelonthetaskofsequentiallanguagemodeling.Ourmodelconsistentlyoutperformscompetingmodels.1IntroductionLanguagemodelsareakeycomponentinmanyNLPapplications,suchasmachinetranslationandex-ploratorycorpusanalysis.Languagemodelsaretypi-callyassumedtobestatic—theword-given-contextdistributionsdonotchangeovertime.Examplesincluden-grammodels(Jelinek,1997)andproba-bilistictopicmodelslikelatentDirichletallocation(Bleietal.,2003);weusetheterm“languagemodel”toreferbroadlytoprobabilisticmodelsoftext.Recently,streamingdatasets(e.g.,socialmedia)haveattractedmuchinterestinNLP.Sincesuchdataevolverapidlybasedoneventsintherealworld,as-sumingastaticlanguagemodelbecomesunrealistic.Ingeneral,moredataisseenasbetter,buttreatingallpastdataequallyrunstheriskofdistractingamodelwithirrelevantevidence.Ontheotherhand,cau-tiouslyusingonlythemostrecentdatarisksoverﬁt-tingtoshort-termtrendsandmissingimportanttime-insensitiveeffects(BleiandLafferty,2006;Wangetal.,2008).Donc,inthispaper,wetakestepstowardmethodsforcapturinglong-rangetemporaldynamicsinlanguageuse.Ourmodelalsoexploitsobservablecontextvari-ablestocapturetemporalvariationthatisotherwisedifﬁculttocaptureusingonlytext.Speciﬁcallyfortheapplicationsweconsider,weusestockmarketdataasexogenousevidenceonwhichthelanguagemodeldepends.Forexample,whenanimportantcompany’spricemovessuddenly,thelanguagemodelshouldbebasednotontheveryrecenthistory,butshouldbesimilartothelanguagemodelforadaywhenasimilarchangehappened,sincepeoplearelikelytosaysimilarthings(eitheraboutthatcom-pany,oraboutconditionsrelevanttothechange).Non-linguisticcontextssuchasstockpricechangesprovideusefulauxiliaryinformationthatmightindi-catethesimilarityoflanguagemodelsacrossdiffer-enttimesteps.Wealsoturntoafullyonlinelearningframework(Cesa-BianchiandLugosi,2006)todealwithnon-stationarityanddynamicsinthedatathatnecessitateadaptationofthemodeltodatainrealtime.Inon-linelearning,streamingexamplesareprocessedonlywhentheyarrive.Onlinelearningalsoeliminatestheneedtostorelargeamountsofdatainmemory.Strictlyspeaking,onlinelearningisdistinctfromstochasticlearning,whichforlanguagemodelsbuiltonmassivedatasetshasbeenexploredbyHoffmanetal.(2013)andWangetal.(2011).Thosetech-niquesarestillforstaticmodeling.Languagemodel-ingforstreamingdatasetsinthecontextofmachinetranslationwasconsideredbyLevenbergandOs-borne(2009)andLevenbergetal.(2010).Goyaletal.(2009)introducedastreamingalgorithmforlargescalelanguagemodelingbyapproximatingn-gramfrequencycounts.Weproposeageneralonlinelearningalgorithmforlanguagemodelingthatdrawsinspirationfromregretminimizationinsequentialpredictions(Cesa-BianchiandLugosi,2006)andon-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

182

linevariationalalgorithms(Sato,2001;HonkelaandValpola,2003).Toourknowledge,ourmodelistheﬁrsttobringtogethertemporaldynamics,conditioningonnon-linguisticcontext,andscalableonlinelearningsuit-ableforstreamingdataandextensibletoincludetopicsandn-gramhistories.Themainideaofourmodelisindependentofthechoiceofthebaselan-guagemodel(e.g.,unigrams,bigrams,topicmodels,etc.).Inthispaper,wefocusonunigramandbi-gramlanguagemodelsinordertoevaluatethebasicideaonwellunderstoodmodels,andtoshowhowitcanbeextendedtohigher-ordern-grams.Weleaveextensionstotopicmodelsforfuturework.Weproposeanoveltasktoevaluateourproposedlanguagemodel.Thetaskistopredicteconomics-relatedtextatagiventime,takingintoaccountthechangesinstockpricesuptothecorrespondingday.ThiscanbeseenaninverseofthesetupconsideredbyLavrenkoetal.(2000),wherenewsisassumedtoinﬂuencestockprices.Weevaluateourmodeloneconomicsnewsinvariouslanguages(English,German,andFrench),aswellasTwitterdata.2BackgroundInthissection,weﬁrstdiscussthebackgroundforsequentialpredictionsthendescribehowtoformulateonlinelanguagemodelingassequentialpredictions.2.1SequentialPredictionsLetw1,w2,…,wTbeasequenceofresponsevari-ables,revealedoneatatime.Thegoalistodesignagoodlearnertopredictthenextresponse,givenpreviousresponsesandadditionalevidencewhichwedenotebyxt∈RM(attimet).Throughoutthispaper,weusethetermfeaturesforx.Speciﬁcally,ateachroundt,thelearnerreceivesxtandmakesapre-dictionˆwt,bychoosingaparametervectorαt∈RM.Inthispaper,werefertoαasfeaturecoefﬁcients.Therehasbeenanenormousamountofworkononlinelearningforsequentialpredictions,muchofitbuildingonconvexoptimization.Forasequenceoflossfunctions‘1,‘2,…,‘T(parameterizedbyα),anonlinelearningalgorithmisastrategytominimizetheregret,withrespecttothebestﬁxedα∗inhind-sight.1RegretguaranteesassumeaLipschitzcon-1Formally,theregretisdeﬁnedasRegretT(α∗)=ditiononthelossfunction‘thatcanbeprohibitiveforcomplexmodels.SeeCesa-BianchiandLugosi(2006),Rakhlin(2009),Bubeck(2011),andShalev-Shwartz(2012)forin-depthdiscussionandreview.TherehasalsobeenworkononlineandstochasticlearningforBayesianmodels(Sato,2001;HonkelaandValpola,2003;Hoffmanetal.,2013),basedonvariationalinference.Thegoalistoapproximatepos-teriordistributionsoflatentvariableswhenexamplesarriveoneatatime.Inthispaper,wewillusebothkindsoftechniquestolearnlanguagemodelsforstreamingdatasets.2.2ProblemFormulationConsideranonlinelanguagemodelingproblem,inthespiritofsequentialpredictions.Thetaskistobuildalanguagemodelthataccuratelypredictsthetextsgeneratedondayt,conditionedonobserv-ablefeaturesuptodayt,x1:t.Everyday,afterthemodelmakesaprediction,theactualtextswtarerevealedandwesufferaloss.Thelossisde-ﬁnedasthenegativeloglikelihoodofthemodel‘t=−logp(wt|un,β1:t−1,x1:t−1,n1:t−1),whereαandβ1:Tarethemodelparametersandnisaback-grounddistribution(detailsaregivenin§3.2).Wecanthenupdatethemodelandproceedtodayt+1.Noticethesimilaritytothesequentialpredictionde-scribedabove.Importantly,thisisarealisticsetupforbuildingevolvinglanguagemodelsfromlarge-scalestreamingdatasets.3Model3.1NotationWeindextimestepsbyt∈{1,…,T}andwordtypesbyv∈{1,…,V},botharealwaysgivenassubscripts.Wedenotevectorsinboldfaceanduse1:Tasashorthandfor{1,2,…,T}.Weassumewordsoftheform{wt}Tt=1forwt∈RV,whichisthevectorofwordfrequencesattimetstept.Non-linguisticcontextfeaturesare{xt}Tt=1forxt∈RM.Thegoalistolearnparametersαandβ1:T,whichwillbedescribedindetailnext.3.2GenerativeStoryThemainideaofourmodelisillustratedbythefol-lowinggenerativestoryfortheunigramlanguagePTt=1‘t(xt,αt,wt)−infα∗PTt=1‘t(xt,α∗,wt).

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

183

model.(Wewilldiscusstheextensiontohigher-orderlanguagemodelslater.)AgraphicalrepresentationofourproposedmodelisgiveninFigure1.1.Drawfeaturecoefﬁcientsα∼N(0,λI).2HereαisavectorinRM,whereMisthedimension-alityofthefeaturevector.2.Foreachtimestept:(un)Observenon-linguisticcontextfeaturesxt.(b)Drawβt∼N(cid:18)Pt−1k=1δkexp(α>f(xt,xk))Pt−1j=1δjexp(α>f(xt,xj))βk,ϕI(cid:19).Ici,βtisavectorinRV,whereVisthesizeofthewordvocabulary,ϕisthevarianceparameterandδkisaﬁxedhyperparameter;wediscussthembelow.(c)Foreachwordwt,v,drawwt,v∼Categorical(cid:16)exp(n1:t−1,v+βt,v)Pj∈Vexp(n1:t−1,j+βt,j)(cid:17).Inthelaststep,βtandnaremappedtotheV-dimensionalsimplex,formingadistributionoverwords.n1:t−1∈RVisabackground(log)distri-bution,inspiredbyasimilarideainEisensteinetal.(2011).Inthispaper,wesetn1:t−1,vtobethelog-frequencyofvuptotimet−1.Wecaninterpretβasatime-dependentdeviationfromthebackgroundlog-frequenciesthatincorporatesworld-context.Thisdeviationcomesintheformofaweightedaverageofearlierdeviationvectors.Theintuitionbehindthemodelisthattheprobabil-ityofawordappearingatdaytdependsontheback-groundlog-frequencies,thedeviationcoefﬁcientsofthewordatprevioustimestepsβ1:t−1,andthesim-ilarityofcurrentconditionsoftheworld(basedonobservablefeaturesx)toprevioustimestepsthroughf(xt,xk).Thatis,fisafunctionthattakesd-dimensionalfeaturevectorsattwotimestepsxtandxkandreturnsasimilarityvectorf(xt,xk)∈RM(see§6.1.1foranexampleoffthatweuseinourexperiments).Thesimilarityisparameterizedbyα,anddecaysovertimewithrateδk.Inthiswork,weassumeaﬁxedwindowsizec(i.e.,weconsidercmostrecenttimesteps),sothatδ1:t−c−1=0andδt−c:t−1=1.Thisallowsuptocthorderdepen-dencies.3Settingδthiswayallowsustoboundthe2Featurecoefﬁcientsαcanbealsodrawnfromotherdistri-butionssuchasα∼Laplace(0,λ).3InonlineBayesianlearning,itisknownthatforgettinginaccurateestimatesfromearliertimestepsisimportant(Sato,xtxsxrxqwqwrwswttsrq↵NrNqNsNtTFigure1:Graphicalrepresentationofthemodel.Thesubscriptindicesq,r,sareshorthandsfortheprevi-oustimestepst−3,t−2,t−1.Onlyfourtimestepsareshownhere.Therearearrowsfrompreviousβt−4,βt−5,…,βt−ctoβt,wherecisthewindowsizeasdescribedin§3.2.Theyarenotshownhere,forread-ability.numberofpastvectorsβthatneedtobekeptinmemory.Wesetβ0to0.Althoughthegenerativestorydescribedaboveisforunigramlanguagemodels,extensionscanbemadetomorecomplexmodels(e.g.,mixtureofun-igrams,topicmodels,etc.)andtolongern-gramcontexts.Inthecaseoftopicmodels,themodelwillberelatedtodynamictopicmodels(BleiandLafferty,2006)augmentedbycontextfeatures,andthelearningprocedurein§4canbeusedtoperformonlinelearningofdynamictopicmodels.However,ourmodelcaptureslonger-rangedependenciesthandynamictopicmodels,andcanconditiononnon-linguisticfeaturesormetadata.Inthecaseofhigher-ordern-grams,onesimplewayistodrawmoreβ,oneforeachhistory.Forexample,forabigrammodel,βisinRV2,ratherthanRVintheunigrammodel.Weconsiderbothunigramandbigramlan-guagemodelsinourexperimentsin§6.However,themainideapresentedinthispaperislargelyindepen-dentofthebasemodel.Relatedwork.MimnoandMcCallum(2008)andEisensteinetal.(2010)similarlyconditionedtexton2001;HonkelaandValpola,2003).Sincewesetδ1:t−c−1=0,ateverytimestept,δkleadstoforgettingolderexamples.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

184

observablefeatures(e.g.,author,publicationvenue,géographie,andotherdocument-levelmetadata),butconductedinferenceinabatchsetting,thustheirap-proachesarenotsuitableforstreamingdata.Itisnotimmediatelyclearhowtogeneralizetheirapproachtodynamicsettings.Algorithmically,ourworkcomesclosesttotheonlinedynamictopicmodelofIwataetal.(2010),exceptthatwealsoincorporatecontextfeatures.4LearningandInferenceThegoalofthelearningprocedureistominimizetheoverallnegativeloglikelihood,−logL(D)=−logZdβ1:Tp(β1:T|un,x1:T)p(w1:T|β1:T,n).Cependant,thisquantityisintractable.Instead,wederiveanupperboundforthisquantityandminimizethatupperbound.UsingJensen’sinequality,thevari-ationalupperboundonthenegativeloglikelihoodis:−logL(D)≤−Zdβ1:Tq(β1:T|γ1:T)(4)logp(β1:T|un,x1:T)p(w1:T|β1:T,n)q(β1:T|γ1:T).Speciﬁcally,weusemean-ﬁeldvariationalinferencewherethevariablesinthevariationaldistributionqarecompletelyindependent.WeuseGaussiandistri-butionsasourvariationaldistributionsforβ,denotedbyγintheboundinEq.4.WedenotetheparametersoftheGaussianvariationaldistributionforβt,v(wordvattimestept)byµt,v(mean)andσt,v(variance).Figure2showsthefunctionalformofthevaria-tionalboundthatweseektominimize,denotedbyˆB.Thetwomainstepsintheoptimizationoftheboundareinferringβtandupdatingfeaturecoefﬁcientsα.Wenextdescribeeachstepindetail.4.1LearningThegoalofthelearningprocedureistominimizetheupperboundinFigure2withrespecttoα.However,sincethedataarrivesinanonlinefashion,andspeedisveryimportantforprocessingstreamingdatasets,themodelneedstobeupdatedateverytimestept(inourexperiments,daily).Noticethatattimestept,weonlyhaveaccesstox1:tandw1:t,andweperformlearningateverytimestepafterthetextforthecurrenttimestepwtisrevealed.Wedonotknowxt+1:Tandwt+1:T.Nonetheless,wewanttoupdateourmodelsothatitcanmakeabetterpredictionatt+1.Therefore,wecanonlyminimizethebounduntiltimestept.LetCk,exp(α>f(xt,xk))Pt−1j=t−cexp(α>f(xt,xj)).Ourlearningal-gorithmisavariationalExpectation-Maximizationalgorithm(WainwrightandJordan,2008).E-stepRecallthatweusevariationalinferenceandthevariationalparametersforβareµandσ.AsshowninFigure2,sincethelog-sum-expinthelasttermofBisproblematic,weintroduceadditionalvariationalparametersζtosimplifyBandobtainˆB(Eqs.2–3).TheE-stepdealswithallthelocalvariablesµ,p,andζ.FixingothervariablesandtakingthederivativeoftheboundˆBw.r.t.ζtandsettingittozero,weobtaintheclosed-formupdateforζt:ζt=Pv∈Vexp(n1:t−1,v)exp(cid:0)µt,v+σt,v2(cid:1).Tominimizewithrespecttoµtandσt,weapplygradient-basedmethodssincetherearenoclosed-formsolutions.Thederivativew.r.t.µt,vis:∂ˆB∂µt,v=µt,v−Ckµk,vϕ−nt,v+ntζtexp(n1:t−1,v)exp(cid:16)µt,v+σt,v2(cid:17),wherent=Pv∈Vnt,v.Thederivativew.r.t.σt,vis:∂ˆB∂σt,v=12σt,v+12ϕ+nt2ζtexp(n1:t−1,v)exp(cid:16)µt,v+σt,v2(cid:17).AlthoughwerequireiterativemethodsintheE-step,weﬁndittobereasonablyfastinpractice.4Speciﬁ-cally,weusetheL-BFGSquasi-Newtonalgorithm(LiuandNocedal,1989).Wecanfurtherimprovetheboundbyupdatingthevariationalparametersfortimestep1:t−1,i.e.,µ1:t−1andσ1:t−1,aswell.However,thiswillrequirestoringthetextsfromprevioustimesteps.Addition-ally,thiswillcomplicatetheM-stepupdatedescribed4Approximately16.5seconds/day(walltime)tolearnthemodelontheEN:NAdatasetona2.40GHzCPUwith24GBmemory.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

185

B=−TXt=1Eq[logp(βt|βk,un,xt)]−TXt=1Eq[logp(wt|βt,nt)]−H(q)(1)=TXt=112Xj∈Vlogσt,jϕ−Eq−(cid:16)βt−Pt−1k=t−cCkβk(cid:17)22ϕ−EqXv∈wtn1:t−1,v+βt,v−logXj∈Vexp(n1:t−1,j+βt,j)(2)≤TXt=112Xj∈Vlogσt,vϕ+(cid:16)µt−Pt−1k=t−cCkµk(cid:17)22ϕ+σt+Pt−1k=t−cC2kσk2ϕ−Xv∈wtµt,v−logζt−1ζtXj∈Vexp(n1:t−1,j)exp(cid:16)µt,j+σt,j2(cid:17)+const(3)Figure2:Thevariationalboundthatweseektominimize,B.H(q)istheentropyofthevariationaldistributionq.Thederivationfromline1toline2isdonebyreplacingtheprobabilitydistributionsp(βt|βk,un,xt)andp(wt|βt,nt)bytheirrespectivefunctionalforms.Noticethatinline3wecomputetheexpectationsunderthevariationaldistributionsandfurtherboundBbyintroducingadditionalvariationalparametersζusingJensen’sinequalityonthelog-sum-expinthelastterm.WedenotethenewboundˆB.below.Therefore,foreachsF(xt,xs))Pt−1s=t−cexp(α>f(xt,xs)).Wefollowtheconvexoptimizationstrategyandsim-plyperformastochasticgradientupdate:αt+1=αt+ηt∂ˆB∂αt(Zinkevich,2003).WhilethevariationalboundˆBisnotconvex,giventhelocalvariablesµ1:t5Inourimplementation,weaugmentαwithasquaredL2regularizationterm(i.e.,weassumethatαisdrawnfromanormaldistributionwithmeanzeroandvarianceλ)andusetheFOBOSalgorithm(DuchiandSinger,2009).Thederivativeoftheregularizationtermissimpleandisnotshownhere.Ofcourse,otherregularizers(e.g.,theL1-norm,whichweuseforotherparameters,ortheL1/∞-norm)canalsobeexplored.andσ1:t,optimizingαattimesteptwithoutknow-ingthefuturebecomesaconvexproblem.6Sincewedonotreestimateµ1:t−1andσ1:t−1intheE-step,thechoicetoperformonlinegradientdescentinsteadofiterativelyperformingbatchoptimizationateverytimestepistheoreticallyjustiﬁed.NoticethatouroveralllearningprocedureisstilltominimizethevariationalupperboundˆB.Allthesechoicesaremadetomakethemodelsuitableforlearninginrealtimefromlargestreamingdatasets.PreliminaryexperimentsshowedthatperformingmorethanoneEMiterationperdaydoesnotconsid-erablyimproveperformance,soinourexperimentsweperformoneEMiterationperday.Tolearntheparametersofthemodel,werelyonapproximationsandoptimizeanupperboundˆB.Wehaveoptedforthisapproachoveralternatives(suchasMCMCmethods)becauseofourinterestintheonline,large-datasetting.Ourexperimentsshowthatwearestillabletolearnreasonableparameteresti-matesbyoptimizingˆB.Likeonlinevariationalmeth-odsforotherlatent-variablemodelssuchasLDA(Sato,2001;Hoffmanetal.,2013),openquestionsre-mainaboutthetightnessofsuchapproximationsandtheidentiﬁabilityofmodelparameters.Wenote,how-6Asaresult,ouralgorithmisHannanconsistentw.r.t.thebestﬁxedα(forˆB)inhindsight;i.e.,theaverageregretgoestozeroasTgoesto∞.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

186

jamais,thatourmodeldoesnotincludelatentmixturesoftopicsandmaybegenerallyeasiertoestimate.5PredictionAsdescribedin§2.2,ourmodelisevaluatedbythelosssufferedateverytimestep,wherethelossisdeﬁnedasthenegativeloglikelihoodofthemodelontextattimestepwt.Therefore,ateachtimestept,weneedtopredict(thedistributionof)wt.Inordertodothis,foreachwordv∈V,wesimplycomputethedeviationmeansβt,vasweightedcombinationsofpreviousmeans,wheretheweightsaredeterminedbytheworld-contextsimilarityencodedinx:Eq[βt,v|µt,v]=t−1Xk=t−cexp(α>f(xt,xk))Pt−1j=t−cexp(α>f(xt,xj))µk,v.RecallthattheworddistributionthatweuseforpredictionisobtainedbyapplyingtheoperatorπthatmapsβtandntotheV-dimensionalsimplex,formingadistributionoverwords:π(βt,n1:t−1)v=exp(n1:t−1,v+βt,v)Pj∈Vexp(n1:t−1,j+βt,j),wheren1:t−1,v∈RVisabackgrounddistribution(thelog-frequencyofwordvobserveduptotimet−1).6ExperimentsInourexperiments,weconsidertheproblemofpre-dictingeconomy-relatedtextappearinginnewsandmicroblogs,basedonobservablefeaturesthatreﬂectcurrenteconomicconditionsintheworldatagiventime.Inthefollowing,wedescribeourdatasetinde-tail,thenshowexperimentalresultsontextprediction.Inallexperiments,wesetthewindowsizec=7(oneweek)orc=14(twoweeks),λ=12|V|(Visthesizeofvocabularyofthedatasetunderconsideration),andϕ=1.6.1DatasetOurdatacontainsmetadataandtextcorpora.Themetadataisusedasourfeatures,whereasthetextcorporaareusedforlearninglanguagemodelsandpredictions.Thedataset(excludingTwitter)canbedownloadedathttp://www.ark.cs.cmu.edu/DynamicLM.6.1.1MetadataWeuseend-of-daystockpricesgatheredfromfinance.yahoo.comforeachstockincludedintheStandard&Poor’s500index(S&P500).Theindexincludeslarge(bymarketvalue)companieslistedonUSstockexchanges.7Wecalculatedaily(continuouslycompounded)returnsforeachstock,o:ro,t=logPo,t−logPo,t−1,wherePo,tistheclosingstockprice.8WemakeasimplifyingassumptionthattextfordaytisgeneratedafterPo,tisobserved.9Ingeneral,stockstradeMondaytoFriday(exceptforfederalholidaysandnaturaldisasters).Fordayswhenstocksdonottrade,wesetro,t=0forallstockssinceanypricechangeisnotobserved.Wetransformreturnsintosimilarityvaluesasfol-lows:F(xo,t,xo,k)=1iffsign(ro,t)=sign(ro,k)and0otherwise.Whilethislimitsthemodelbyig-noringthemagnitudeofpricechanges,itisstillrea-sonabletocapturethesimilaritybetweentwodays.10Thereare500stocksintheS&P500,soxt∈R500andf(xt,xk)∈R500.6.1.2TextdataWehaveﬁvestreamsoftextdata.TheﬁrstfourcorporaarenewsstreamstrackedthroughReuters.11TwoofthemarewritteninEnglish,NorthAmericanBusinessReport(EN:NA)andJapaneseInvestmentNews(EN:JP).TheremainingtwoareGermanEco-nomicNewsService(DE,inGerman)andFrenchEconomicNewsService(FR,inFrench).ForallfouroftheReutersstreams,wecollectednewsdataoveraperiodofthirteenmonths(392jours),2012-05-26to2013-06-21.SeeTable1fordescriptivestatisticsofthesedatasets.Numericaltermsaremappedtoasingleword,andalllettersaredowncased.ThelasttextstreamcomesfromtheDeca-hose/GardenhosestreamfromTwitter.Wecollectedpublictweetsthatcontaintickersymbols(i.e.,sym-bolsthatareusedtodenotestocksofaparticularcompanyinastockmarket),precededbythedollar7ForalistofcompanieslistedintheS&P500asof2012,seehttp://en.wikipedia.org/wiki/List_of_S\%26P_500_companies.Thissetwasﬁxedduringthetimeperiodsofallourexperiments.8Weusethe“adjustedclose”onYahoothatincludesinterimdividendcashﬂowsandalsoadjustsfor“splits”(changesinthenumberofoutstandingshares).9Thisisdoneinordertoavoidhavingtodealwithhourlytimesteps.Inaddition,intradaypricedataisonlyavailablethroughcommercialdataprovided.10Notethatdailystockreturnsareequallylikelytobepositiveornegativeanddisplaylittleserialcorrelation.11http://www.reuters.com

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

187

DatasetTotal#Doc.Avg.#Doc.#DaysUnigramsBigramsTotal#TokensSizeVocab.Total#TokensSizeVocab.EN:NA86,68322339228,265,55010,00011,804,2015,000EN:JP70.80718239216,026,38010,0007,047,0955,000FR62,35516039211,942,27110,0003,773,5175,000DE51,5151323929,027,82310,0003,499,9655,000Twitter214,7943366391,660,87410,000551,7685,000Table1:Statisticsaboutthedatasets.Averagenumberofdocuments(thirdcolumn)isperday.sign$(e.g.,$GOOG,$MSFT,$AAPL,etc.).Thesetagsaregenerallyusedtoindicatetweetsaboutthestockmarket.Welookattweetsfromtheperiod2011-01-01to2012-09-30(639jours).Asaresult,wehaveapproximately100–800tweetsperday.WetokenizedthetweetsusingtheCMUARKTweetNLPtools,12numericaltermsaremappedtoasingleword,andalllettersaredowncased.Weperformtwoexperimentsusingunigramandbigramlanguagemodelsasthebasemodels.Foreachdataset,weconsiderthetop10,000unigramsafterremovingcorpus-speciﬁcstopwords(thetop100wordswithhighestfrequencies).Forthebigramexperiments,weonlyuse5,000wordstolimitthenumberofuniquebigramssothatwecansimulateexperimentsfortheentiretimehorizoninareason-ableamountoftime.Instandardopen-vocabularylanguagemodelingexperiments,thetreatmentofun-knownwordsdeservescare.Wehaveoptedforacontrolled,closed-vocabularyexperiment,sincestan-dardsmoothingtechniqueswillalmostsurelyinteractwithtemporaldynamicsandcontextininterestingwaysthatareoutofscopeinthepresentwork.6.2BaselinesSincethisisaforecastingtask,ateachtimestep,weonlyhaveaccesstodatafromprevioustimesteps.Ourmodelassumesthatallwordsinalldocumentsinacorpuscomefromasinglemultinomialdistri-bution.Therefore,wecompareourapproachtothecorrespondingbasemodels(standardunigramandbi-gramlanguagemodels)overthesamevocabulary(foreachstream).Theﬁrstonemaintainscountsofeverywordandupdatesthecountsateachtimestep.Thiscorrespondstoabasemodelthatusesalloftheavail-abledatauptothecurrenttimestep(“baseall”).Thesecondonereplacescountsofeverywordwiththe12https://www.ark.cs.cmu.edu/TweetNLPcountsfromtheprevioustimestep(“baseone”).Ad-ditionally,wealsocomparewithabasemodelwhosecountsdecayexponentially(“baseexp”).Thatis,thecountsfromprevioustimestepsdecaybyexp(−γs),wheresisthedistancebetweenprevioustimestepsandthecurrenttimestepandγisthedecayconstant.Wesetthedecayconstantγ=1.WeputasymmetricDirichletprioronthecounts(“add-one”smoothing);thisisanalogoustoourtreatmentofthebackgroundfrequenciesninourmodel.Notethatourmodel,similarto“baseall,”usesallavailabledatauptotimestept−1whenmakingpredictionsfortimestept.Thewindowsizeconlydetermineswhichprevi-oustimesteps’modelscanbechosenformakingapredictiontoday.Thepastmodelsthemselvesarees-timatedfromallavailabledatauptotheirrespectivetimesteps.Wealsocomparewithtwostrongbaselines:alin-earinterpolationof“baseone”modelsforthepastweek(“int.week”)andalinearinterpolationof“baseall”and“baseone”(“intoneall”).Theinterpolationweightsarelearnedonlineusingthenormalizedexpo-nentiatedgradientalgorithm(KivinenandWarmuth,1997),whichhasbeenshowntoenjoyastrongerregretguaranteecomparedtostandardonlinegra-dientdescentforlearningaconvexcombinationofweights.6.3ResultsWeevaluatetheperplexityonunseendatasettoeval-uatetheperformanceofourmodel.Speciﬁcally,weuseper-wordpredictiveperplexity:perplexity=exp −PTt=1logp(wt|un,x1:t,n1:t−1)PTt=1Pj∈Vwt,j!.NotethatthedenominatoristhenumberoftokensuptotimestepT.Lowerperplexityisbetter.Table2andTable3showtheperplexityresultsfor

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

188

Datasetbaseallbaseonebaseexpint.weekint.oneallc=7c=14EN:NA3,3413,6773,4863,4033,2713,2623,285EN:JP2,8023,2122,7502,9492,7082,6562,689FR3,6033,9103,6783,6253,4163,4043,438DE3,7894,1993,9793,9263,6343,6493,687Twitter3,8806,1685,1335,8594,0473,8013,819Table2:Perplexityresultsforourﬁvedatastreamsintheunigramexperiments.Thebasemodelsin“baseall,”“baseone,”and“baseexp”areunigramlanguagemodels.“int.week”isalinearinterpolationof“baseone”fromthepastweek.“int.oneall”isalinearinterpolationof“baseone”and“baseall”.Therightmosttwocolumnsareversionsofourmodel.Bestresultsarehighlightedinbold.Datasetbaseallbaseonebaseexpint.weekint.oneallc=7EN:NA2422,2291,8802,200244223EN:JP1852,1011,7262,050189167FR1592,0841,7072,068166139DE2682,6342,2672,644282243Twitter7564,2454,2535,8594,046739Table3:Perplexityresultsforourﬁvedatastreamsinthebigramexperiments.Thebasemodelsin“baseall,”“baseone,”and“baseexp”arebigramlanguagemodels.“int.week”isalinearinterpolationof“baseone”fromthepastweek.“int.oneall”isalinearinterpolationof“baseone”and“baseall”.Therightmostcolumnisaversionofourmodelwithc=7.Bestresultsarehighlightedinbold.eachofthedatasetsforunigramandbigramexperi-mentsrespectively.Ourmodeloutperformedothercompetingmodelsinallcasesbutone.Recallthatweonlydeﬁnethesimilarityfunctionofworldcontextas:F(xo,t,xo,k)=1iffsign(ro,t)=sign(ro,k)and0otherwise.Abettersimilarityfunction(e.g.,onethattakesintoaccountmarketsizeofthecompanyandthemagnitudeofincreaseordecreaseinthestockprice)mightbeabletoimprovetheperformancefur-ther.Weleavethisforfuturework.Furthermore,thevariationscanbecapturedusingmodelsfromthepastweek.Wediscusswhyincreasingcfrom7to14didnotimproveperformanceofthemodelinmoredetailin§6.4.Wecanalsoseehowthemodelsperformedovertime.Figure4tracesperplexityforfourReutersnewsstreamdatasets.13Wecanseethatinsomecasestheperformanceofthe“baseall”modeldegradedovertime,whereasourmodelismorerobusttotemporal13Inbothexperiments,inordertomanagethetimeandspacecomplexitiesofupdatingβ,weapplyasparsityshrinkagetech-niquebyusingOWL-QN(AndrewandGao,2007)whenmaxi-mizingit,withregularizationconstantsetto1.Intuitively,thisisequivalenttoencouragingthedeviationvectortobesparse(Eisensteinetal.,2011).shifts.Inthebigramexperiments,weonlyranourmodelwithc=7,sinceweneedtomaintainβinRV2,insteadofRVintheunigrammodel.Thegoalofthisexperimentistodeterminewhetherourmethodstilladdsbeneﬁttomoreexpressivelanguagemod-els.Notethattheweightsofthelinearinterpolationmodelsarealsolearnedinanonlinefashionsincetherearenoclassicaltraining,development,andtestsetsinoursetting.Sincethe“baseone”modelper-formedpoorlyinthisexperiment,theperformanceoftheinterpolatedmodelsalsosuffered.Forexample,the“int.oneall”modelneededtimetolearnthatthe“baseone”modelhastobedownweighted(westartedwithallinterpolatedmodelshavinguniformweights),soitwasnotabletooutperformeventhe“baseall”model.6.4AnalysisandDiscussionItshouldnotbesurprisingthatconditioningonworld-contextreducesperplexity(CoverandThomas,1991).Akeyattractionofourmodel,webelieve,liesintheabilitytoinspectitsparameters.Deviationcoefﬁcients.Inspectingthemodelal-lowsustogaininsightintotemporaltrends.We

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

189

Twitter:Googletimestepβ01002003004005006000.00.51.01.52.0googgoog@google@googlegoogle+google+#goog#googrGOOGrGOOGTwitter:Microsofttimestepβ01002003004005006000.00.51.01.5microsoftmicrosoftmsftmsft#microsoft#microsoftrMSFTrMSFTFigure3:DeviationcoefﬁcientsβovertimeforGoogle-andMicrosoft-relatedwordsonTwitterwithunigrambasemodel(c=7).Signiﬁcantchanges(increasesordecreases)inthereturnsofGoogleandMicrosoftstocksareusuallyfollowedbyincreasesinβofrelatedwords.investigatethedeviationslearnedbyourmodelontheTwitterdataset.ExamplesareshowninFigure3.TheleftplotshowsβforfourwordsrelatedtoGoogle:goog,#goog,@google,google+.Forcompari-son,wealsoshowthereturnofGooglestockforthecorrespondingtimestep(scaledby50andcenteredat0.5forreadability,smoothedusingloess(Cleveland,1979),denotedbyrGOOGintheplot).WecanseethatsigniﬁcantchangesofreturnofGooglestocks(e.g.,therGOOGspikesbetweentimesteps50–100,150–200,490–550intheplot)occurredalongsideanincreaseinβofGoogle-relatedwords.SimilartrendscanalsobeobservedforMicrosoft-relatedwordsintherightplot.ThemostsigniﬁcantlossofreturnofMicrosoftstocks(thedownwardspikeneartimestep500intheplot)isfollowedbyasuddensharpincreaseinβofthewords#microsoftandmicrosoft.Featurecoefﬁcients.Wecanalsoinspectthelearnedfeaturecoefﬁcientsαtoinvestigatewhichstockshavehigherassociationswiththetextthatisgenerated.Ourfeaturecoefﬁcientsaredesignedtoreﬂectwhichchanges(orlackofchanges)instockpricesinﬂuencetheworddistributionmore,notwhichstocksaretalkedaboutmoreoften.Weﬁndthatthefeaturecoefﬁcientsdonotcorrelatewithobviouscompanycharacteristicslikemarketcapi-talization(ﬁrmsize).Forexample,ontheTwitterdatasetwithbigrambasemodels,theﬁvestockswiththehighestweightsare:ConAgraFoodsInc.,IntelCorp.,Bristol-MyersSquibb,FrontierCommunica-tionsCorp.,andAmazon.comInc.Stronglynegativeweightstendedtoalignwithstreamswithlessactiv-time lagsfrequency0204060801234567891011121314Figure5:Distributionsoftheselectionprobabilitiesofmodelsfromthepreviousc=14timesteps,ontheEN:NAdatasetwithunigrambasemodel.Forsimplicity,weshowE-stepmodes.Thehistogramshowsthatthemodeltendstofavormodelsfromdaysclosertothecurrentdate.ity,suggestingthatthesewerebeingusedtosmoothacrossallcdaysofhistory.Ahigherweightforstockoimpliesanincreaseinprobabilityofchoosingmod-elsfromprevioustimestepss,whenthestateoftheworldforthecurrenttimesteptandtimestepsisthesame(asrepresentedbyoursimilarityfunction)withrespecttostocko(allotherthingsbeingequal),andadecreaseinprobabilityforalowerweight.Selectedmodels.Besidesfeaturecoefﬁcients,ourmodelcapturestemporalshiftbymodelingsimilar-ityacrossthemostrecentcdays.Duringinference,ourmodelweightsdifferentworddistributionsfromthepast.Thesimilarityisencodedinthepairwisefeaturesf(xt,xk)andtheparametersα.Figure5showsthedistributionsofthestrongest-posteriormodelsfromprevioustimesteps,basedonhowfar

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

190

EN:NAtimestepperplexity050100150200250300350200400600base allbase allcompletecompleteint. one allint. one allEN:JPtimestepperplexity050100150200250300350200400600base allbase allcompletecompleteint. one allint. one allFRtimestepperplexity050100150200250300350200400600base allbase allcompletecompleteint. one allint. one allDEtimestepperplexity050100150200250300350300500700base allbase allcompletecompleteint. one allint. one allFigure4:PerplexityovertimeforfourReutersnewsstreams(c=7)withbigrambasemodels.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

191

inthepasttheyareatthetimeofuse,aggregatedacrossroundsontheEN:NAdataset,forwindowsizec=14.Itshowsthatthemodeltendstofavormodelsfromdaysclosertothecurrentdate,withthet−1modelsselectedthemost,perhapsbecausethestateoftheworldtodayismoresimilartodatesclosertotodaycomparetomoredistantdates.Theplotalsoexplainswhyincreasingcfrom7to14didnotim-proveperformanceofthemodel,sincemostofthevariationinourdatasetscanbecapturedwithmodelsfromthepastweek.Topics.Latenttopicvariableshaveoftenﬁguredheavilyinapproachestodynamiclanguagemodel-ing.Inpreliminaryexperimentsincorporatingsingle-membershiptopicvariables(i.e.,eachdocumentbe-longstoasingletopic,asinamixtureofunigrams),wesawnobeneﬁttoperplexity.Incorporatingtop-icsalsoincreasescomputationalcost,sincewemustmaintainandestimateonelanguagemodelpertopic,pertimestep.Itisstraightforwardtodesignmod-elsthatincorporatetopicswithsingle-ormixed-membershipasinLDA(Bleietal.,2003),anin-terestingfuturedirection.Potentialapplications.Dynamiclanguagemodelslikeourscanbepotentiallyusefulinmanyapplica-tions,eitherasastandalonelanguagemodel,e.g.,predictivetextinput,whoseperformancemayde-pendonthetemporaldimension;orasacomponentinapplicationslikemachinetranslationorspeechrecognition.Additionally,themodelcanbeseenasasteptowardsenhancingtextunderstandingwithnumerical,contextualdata.7ConclusionWepresentedadynamiclanguagemodelforstream-ingdatasetsthatallowsconditioningonobservablereal-worldcontextvariables,exempliﬁedinourex-perimentsbystockmarketdata.Weshowedhowtoperformlearningandinferenceinanonlinefashionforthismodel.Ourexperimentsshowedthepredic-tivebeneﬁtofsuchconditioningandonlinelearningbycomparingtosimilarmodelsthatignoretemporaldimensionsandobservablevariablesthatinﬂuencethetext.AcknowledgementsTheauthorsthankseveralanonymousreviewersforhelp-fulfeedbackonearlierdraftsofthispaperandBrendanO’ConnorforhelpwithcollectingTwitterdata.Thisre-searchwassupportedinpartbyGoogle,bycomputingresourcesatthePittsburghSupercomputingCenter,byNationalScienceFoundationgrantIIS-1111142,AFOSRgrantFA95501010247,ONRgrantN000140910758,andbytheIntelligenceAdvancedResearchProjectsActiv-ityviaDepartmentofInteriorNationalBusinessCentercontractnumberD12PC00347.TheU.S.GovernmentisauthorizedtoreproduceanddistributereprintsforGovern-mentalpurposesnotwithstandinganycopyrightannotationthereon.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnec-essarilyrepresentingtheofﬁcialpoliciesorendorsements,eitherexpressedorimplied,ofIARPA,DoI/NBC,ortheU.S.Government.ReferencesGalenAndrewandJianfengGao.2007.Scalabletrainingofl1-regularizedlog-linearmodels.InProc.ofICML.DavidM.BleiandJohnD.Lafferty.2006.Dynamictopicmodels.InProc.ofICML.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.LatentDirichletallocation.JournalofMachineLearningResearch,3:993–1022.S´ebastienBubeck.2011.Introductiontoonlineopti-mization.Technicalreport,DepartmentofOperationsResearchandFinancialEngineering,PrincetonUniver-sity.Nicol`oCesa-BianchiandG´aborLugosi.2006.Prediction,Apprentissage,andGames.CambridgeUniversityPress.WilliamS.Cleveland.1979.Robustlocallyweightedregressionandsmoothingscatterplots.JournaloftheAmericanStatisticalAssociation,74(368):829–836.ThomasM.CoverandJoyA.Thomas.1991.ElementsofInformationTheory.JohnWiley&Sons.JohnDuchiandYoramSinger.2009.Efﬁcientonlineandbatchlearningusingforwardbackwardsplitting.JournalofMachineLearningResearch,10(7):2899–2934.JacobEisenstein,BrendanO’Connor,NoahA.Smith,andEricP.Xing.2010.Alatentvariablemodelforgeographiclexicalvariation.InProc.ofEMNLP.JacobEisenstein,AmrAhmed,andEricP.Xing.2011.Sparseadditivegenerativemodelsoftext.InProc.ofICML.AmitGoyal,HalDaumeIII,andSureshVenkatasubrama-nian.2009.StreamingforlargescaleNLP:Languagemodeling.InProc.ofHLT-NAACL.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
1
7
5
1
5
6
6
8
9
4

/
t

un
c
_
un
_
0
0
1
7
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

192

MattHoffman,DavidM.Blei,ChongWang,andJohnPaisley.2013.Stochasticvariationalinference.Jour-nalofMachineLearningResearch,14:1303–1347.AnttiHonkelaandHarriValpola.2003.On-linevaria-tionalBayesianlearning.InProc.ofICA.TomoharuIwata,TakeshiYamada,YasushiSakurai,andNaonoriUeda.2010.Onlinemultiscaledynamictopicmodels.InProc.ofKDD.FrederickJelinek.1997.StatisticalMethodsforSpeechRecognition.MITPress.JyrkiKivinenandManfredK.Warmuth.1997.Expo-nentiatedgradientversusgradientdescentforlinearpredictors.InformationandComputation,132:1–63.VictorLavrenko,MattSchmill,DawnLawrie,PaulOgilvie,DavidJensen,andJamesAllan.2000.Miningofconcurrenttextandtimeseries.InProc.ofKDDWorkshoponTextMining.AbbyLevenbergandMilesOsborne.2009.Stream-basedrandomisedlanguagemodelsforSMT.InProc.ofEMNLP.AbbyLevenberg,ChrisCallison-Burch,andMilesOs-borne.2010.Stream-basedtranslationmodelsforsta-tisticalmachinetranslation.InProc.ofHLT-NAACL.DongC.LiuandJorgeNocedal.1989.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.MathematicalProgrammingB,45(3):503–528.DavidMimnoandAndrewMcCallum.2008.Topicmod-elsconditionedonarbitraryfeatureswithDirichlet-multinomialregression.InProc.ofUAI.AlexanderRakhlin.2009.Lecturenotesononlinelearn-ing.Technicalreport,DepartmentofStatistics,TheWhartonSchool,UniversityofPennsylvania.MasaakiSato.2001.Onlinemodelselectionbasedonthevariationalbayes.NeuralComputation,13(7):1649–1681.ShaiShalev-Shwartz.2012.Onlinelearningandonlineconvexoptimization.FoundationsandTrendsinMa-chineLearning,4(2):107–194.MartinJ.WainwrightandMichaelI.Jordan.2008.Graph-icalmodels,exponentialfamilies,andvariationalinfer-ence.FoundationsandTrendsinMachineLearning,1(1–2):1–305.ChongWang,DavidM.Blei,andDavidHeckerman.2008.Continuoustimedynamictopicmodels.InProc.ofUAI.ChongWang,JohnPaisley,andDavidM.Blei.2011.On-linevariationalinferenceforthehierarchicalDirichletprocess.InProc.ofAISTATS.MartinZinkevich.2003.Onlineconvexprogrammingandgeneralizedinﬁnitesimalgradientascent.InProc.ofICML.
Télécharger le PDF