Transacciones de la Asociación de Lingüística Computacional, volumen. 5, páginas. 529–542, 2017. Editor de acciones: Diana McCarthy.
Lote de envío: 7/2017 Publicado 12/2017.
2017 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
AnchoredCorrelationExplanation:TopicModelingwithMinimalDomainKnowledgeRyanJ.Gallagher1,2,KyleReing1,DavidKale1,andGregVerSteeg11InformationSciencesInstitute,UniversityofSouthernCalifornia2VermontComplexSystemsCenter,ComputationalStoryLab,UniversityofVermontryan.gallagher@uvm.edu{reing,kale,gregv}@isi.eduAbstractWhilegenerativemodelssuchasLatentDirichletAllocation(LDA)haveprovenfruit-fulintopicmodeling,theyoftenrequirede-tailedassumptionsandcarefulspecificationofhyperparameters.Suchmodelcomplexityis-suesonlycompoundwhentryingtogeneral-izegenerativemodelstoincorporatehumaninput.WeintroduceCorrelationExplanation(CorEx),analternativeapproachtotopicmod-elingthatdoesnotassumeanunderlyinggen-erativemodel,andinsteadlearnsmaximallyinformativetopicsthroughaninformation-theoreticframework.Thisframeworknat-urallygeneralizestohierarchicalandsemi-supervisedextensionswithnoadditionalmod-elingassumptions.Inparticular,word-leveldomainknowledgecanbeflexiblyincorpo-ratedwithinCorExthroughanchorwords,al-lowingtopicseparabilityandrepresentationtobepromotedwithminimalhumaninterven-tion.Acrossavarietyofdatasets,métrica,andexperiments,wedemonstratethatCorExproducestopicsthatarecomparableinqualitytothoseproducedbyunsupervisedandsemi-supervisedvariantsofLDA.1IntroductionThemajorityoftopicmodelingapproachesutilizeprobabilisticgenerativemodels,modelswhichspec-ifymechanismsforhowdocumentsarewritteninordertoinferlatenttopics.Thesemechanismsmaybeexplicitlystated,asinLatentDirichletAlloca-tion(LDA)(Bleietal.,2003),orimplicitlystated,aswithmatrixfactorizationtechniques(Hofmann,1999;Dingetal.,2008;BuntineandJakulin,2006).ThecoregenerativemechanismsofLDA,inpar-ticular,haveinspirednumerousgeneralizationsthataccountforadditionalinformation,suchastheau-thorship(Rosen-Zvietal.,2004),documentlabels(McAuliffeandBlei,2008),orhierarchicalstructure(Griffithsetal.,2004).Sin embargo,thesegeneralizationscomeatthecostofincreasinglyelaborateandunwieldygenerativeassumptions.Whiletheseassumptionsallowtopicinferencetobetractableinthefaceofadditionalmetadata,theyprogressivelyconstraintopicstoanarrowerviewofwhatatopiccanbe.Suchassump-tionsareundesirableincontextswhereonewishestominimizemodelcomplexityandlearntopicswith-outpreexistingnotionsofhowthosetopicsorigi-nated.Forthesereasons,weproposetopicmodelingbywayofCorrelationExplanation(CorEx),1aninformation-theoreticapproachtolearninglatenttopicsoverdocuments.UnlikeLDA,CorExdoesnotassumeaparticulardatageneratingmodel,andinsteadsearchesfortopicsthatare“maximallyin-formative”aboutasetofdocuments.Bylearninginformativetopicsratherthangeneratedtopics,weavoidspecifyingthestructureandnatureoftopicsaheadoftime.Inaddition,thelightweightframeworkunderly-ingCorExisversatileandnaturallyextendstohier-archicalandsemi-supervisedvariantswithnoaddi-tionalmodelingassumptions.Morespecifically,we1Opensource,documentedcodefortheCorExtopicmodelavailableathttps://github.com/gregversteeg/corex_topic.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
530
mayflexiblyincorporateword-leveldomainknowl-edgewithintheCorExtopicmodel.Topicmodelsareoftensusceptibletoportrayingonlydominantthemesofdocuments.Injectingatopicmodel,suchasCorEx,withdomainknowledgecanhelpguideittowardsotherwiseunderrepresentedtopicsthatareofimportancetotheuser.Byincorporatingrele-vantdomainwords,wemightencourageourtopicmodeltorecognizeararediseasethatwouldother-wisebemissedinclinicalhealthnotes,focusmoreattentiontotopicsfromnewsarticlesthatcanguidereliefworkersindistributingaidmoreeffectively,ordisambiguateaspectsofacomplexsocialissue.Ourcontributionsareasfollows:primero,weframeCorExasatopicmodelandderiveanefficientalter-ationtotheCorExalgorithmtoexploitsparsedata,suchaswordcountsindocuments,fordramaticspeedups.Second,weshowhowdomainknowledgecanbenaturallyintegratedintoCorExthrough“an-chorwords”andtheinformationbottleneck.Third,wedemonstratethatCorExandanchoredCorExproducetopicsofcomparablequalitytounsuper-visedandsemi-supervisedvariantsofLDAoversev-eraldatasetsandmetrics.Finally,wecarefullydetailseveralanchoringstrategiesthathighlighttheversa-tilityofanchoredCorExonavarietyoftasks.2Methods2.1CorEx:CorrelationExplanationHerewereviewthefundamentalsofCorrelationEx-planation(CorEx),andadoptthenotationusedbyVerSteegandGalstyanintheiroriginalpresenta-tionofthemodel(2014).LetXbeadiscreteran-domvariablethattakesonafinitenumberofval-ues,indicatedwithlowercase,x.Furthermore,ifwehavensuchrandomvariables,letXGdenoteasub-collectionofthem,whereG⊆{1,…,norte}.TheprobabilityofobservingXG=xGiswrittenasp(XG=xG),whichistypicallyabbreviatedtop(xG).TheentropyofXiswrittenasH(X)andthemutualinformationoftworandomvariablesX1andX2isgivenbyI(X1:X2)=H(X1)+h(X2)−H(X1,X2).Thetotalcorrelation,ormultivariatemutualin-formation,ofagroupofrandomvariablesXGisex-pressedasTC(XG)=Xi∈GH(Xi)−H(XG)(1)=DKL p(xG)||Yi∈Gp(xi)!.(2)WeseethatEq.1doesnotquantify“correlation”inthemodernsenseoftheword,andsoitcanbehelp-fultoconceptualizetotalcorrelationasameasureoftotaldependence.Indeed,Eq.2showsthattotalcor-relationcanbeexpressedusingtheKullback-LeiblerDivergenceand,por lo tanto,itiszeroifandonlyifthejointdistributionofXGfactorizes,o,inotherwords,thereisnodependencebetweentherandomvariables.Thetotalcorrelationcanbewrittenwhencondi-tioningonanotherrandomvariableY,TC(XG|Y)=Pi∈GH(Xi|Y)−H(XG|Y).So,wecanconsiderthereductioninthetotalcorrelationwhenconditioningonY.TC(XG;Y)=TC(XG)−TC(XG|Y)(3)=Xi∈GI(Xi:Y)−I(XG:Y)(4)ThequantityexpressedinEq.3actsasalowerboundofTC(XG)(VerSteegandGalstyan,2015),asreadilyverifiedbynotingthatTC(XG)andTC(XG|Y)arealwaysnon-negative.Alsonote,thejointdistributionofXGfactorizesconditionalonYifandonlyifTC(XG|Y)=0.Ifthisisthecase,thenTC(XG;Y)ismaximized,andYexplainsallofthedependenciesinXG.Inthecontextoftopicmodeling,XGrepresentsagroupofwordtypesandYrepresentsatopictobelearned.Sincewearealwaysinterestedingroup-ingmultiplesetsofwordsintomultipletopics,wewilldenotethebinarylatenttopicsasY1,…YmandtheircorrespondinggroupsofwordtypesasXGjforj=1,…,mrespectively.TheCorExtopicmodelseekstomaximallyexplainthedependenciesofwordsindocumentsthroughlatenttopicsbymax-imizingTC(X;Y1,…,Ym).Todothis,wemaxi-mizethefollowinglowerboundonthisexpression:maxGj,pag(yj|xGj)mXj=1TC(XGj;Yj).(5)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
531
Aswedescribeinthefollowingsection,thisob-jectivecanbeefficientlyapproximated,despitethesearchoccurringoveranexponentiallylargeproba-bilityspace(VerSteegandGalstyan,2014).Sinceeachtopicexplainsacertainportionoftheoveralltotalcorrelation,wemaychoosethenumberoftopicsbyobservingdiminishingreturnstotheob-jective.Furthermore,sincetheCorEximplementa-tiondependsonarandominitialization(asdescribedshortly),onemayrestarttheCorExtopicmodelsev-eraltimesandchoosetheonethatexplainsthemosttotalcorrelation.Thelatentfactors,Yj,areoptimizedtobeinfor-mativeaboutdependenciesinthedataanddonotrequiregenerativemodelingassumptions.Notethatthediscoveredfactors,Y,canbeusedasinputstoconstructnewlatentfactors,z,andsoonleadingtoahierarchyoftopics.Althoughthisextensionisquitenatural,wefocusouranalysisonthefirstleveloftopicrepresentationsforeasierinterpretationandevaluation.2.2CorExImplementationWesummarizetheimplementationofCorExaspre-sentedbyVerSteegandGalstyan(2014)inprepa-rationforinnovationsintroducedinthesubsequentsections.ThenumericaloptimizationforCorExbe-ginswitharandominitializationofparametersandthenproceedsviaaniterativeupdateschemesimi-lartoEM.Forcomputationaltractability,wesubjecttheoptimizationinEq.5totheconstraintthatthegroups,Gj,donotoverlap,i.e.weenforcesingle-membershipofwordswithintopics.Theoptimiza-tionentailsacombinatorialsearchovergroups,soinsteadwelookforaformthatismoreamenabletosmoothoptimization.WerewritetheobjectiveusingthealternateforminEq.4whileintroducingindica-torvariablesαi,jwhichareequalto1ifandonlyifwordXiappearsintopicYj(i.e.i∈Gj).maxαi,j,pag(yj|X)mXj=1 nXi=1αi,jI(Xi:Yj)−I(X:Yj)!s.t.αi,j=I[j=argmax¯jI(Xi:Y¯j)].(6)Notethattheconstraintonnon-overlappinggroupsnowbecomesaconstraintonα.Tomaketheopti-mizationsmoothweshouldrelaxtheconstraintsothatαi,j∈[0,1].Todoso,wereplacethesecondlinewithasoftmaxfunction.Theupdateforαatiterationtbecomes,αti,j=exp(cid:18)λt(I(Xi:Yj)−max¯jI(Xi:Y¯j))(cid:19).Nowα∈[0,1]andtheparameterλcontrolsthesharpnessofthesoftmaxfunction.Earlyintheopti-mizationweuseasmallvalueofλ,thenincreaseitlaterintheoptimizationtoenforceahardconstraint.TheobjectiveinEq.6onlylowerboundstotalcor-relationinthehardmaxlimit.Theconstraintonαforcescompetitionamonglatentfactorstoexplaincertainwords,whilesettingλ=0resultsinallfac-torslearningthesamething.Holdingαfixed,takingthederivativeoftheobjective(withrespecttothevariablesp(yj|X),andsettingitequaltozeroleadstoafixedpointequation.Weusethisfixedpointtodefineupdateequationsatiterationt.pt(yj)=X¯xpt(yj|¯x)pag(¯x)(7)pt(xi|yj)=X¯xpt(yj|¯x)pag(¯x)I[¯xi=xi]/pt(yj)logpt+1(yj|x‘)=(8)logpt(yj)+nXi=1αti,jlogpt(x‘i|yj)pag(x‘i)−logZj(x‘)Thefirsttwolinesjustdefinethemarginalsintermsoftheoptimizationparameter,pt(yj|X).Wetakep(X)tobetheempiricaldistributiondefinedbysomeobservedsamples,x‘,‘=1,…,N.Thethirdlineupdatespt(yj|x‘),theprobabilisticlabelsforeachlatentfactor,Yj,foragivensample,x‘.Notethataneasilycalculatedconstant,Zj(x‘),appearstoensurethenormalizationofpt(yj|x‘)foreachsample.Weiteratethroughtheseupdatesuntilconvergence.Afterconvergence,weusethemutualinformationtermsI(Xi:Yj)torankwhichwordsaremostin-formativeforeachfactor.Theobjectiveisasumoftermsforeachlatentfactorandthisallowsustorankthecontributionofeachfactortowardourlowerboundonthetotalcorrelation.Theexpectedlogofthenormalizationconstant,oftencalledthefreeen-ergy,mi[logZj(X)],playsanimportantrolesinceitsexpectationprovidesafreeestimateofthei-thtermintheobjective(VerSteegandGalstyan,2015),como
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
532
canbeseenbytakingtheexpectationofEq.8atconvergenceandcomparingittoEq.6.Becauseoursampleestimateoftheobjectiveisjustthemeanofcontributionsfromindividualsamplepoints,x‘,werefertologZj(x‘)asthepointwisetotalcorrelationexplainedbyfactorjforsample‘.PointwiseTCcanbeusedtolocalizewhichsamplesareparticu-larlyinformativeaboutspecificlatentfactors.2.3SparsityOptimization2.3.1DerivationToaltertheCorExoptimizationproceduretoex-ploitsparsityinthedata,wenowassumethatallvariables,xi,yj,arebinaryandxisabinaryvectorwhereX‘i=1ifwordioccursindocument‘andX‘i=0otherwise.Sinceallvariablesarebinary,themarginaldistribution,pag(xi|yj),isjustatwobytwotableofprobabilitiesandcanbeestimatedeffi-ciently.Thetime-consumingpartoftrainingisthesubsequentupdateofthedocumentlabelsinEq.8foreachdocument‘.Thecomputationoftheloglikelihoodratioforallnwordsoveralldocumentsisnotefficient,asmostwordsdonotappearinagivendocument.Werewritethelogarithmintheinteriorofthesum.logpt(x‘i|yj)pag(x‘i)=logpt(Xi=0|yj)pag(Xi=0)+(9)xlilog(cid:18)pt(X‘i=1|yj)pag(Xi=0)pt(Xi=0|yj)pag(X‘i=1)(cid:19)Nota,whentheworddoesnotappearinthedocu-ment,onlytheleadingtermofEq.9willbenonzero.However,whentheworddoesappear,everythingbutlogP(X‘i=1|yj)/pag(X‘i=1)cancelsout.So,wehavetakenadvantageofthefactthattheCorExtopicmodelbinarizesdocumentstoassumebyde-faultthataworddoesnotappearinthedocument,andthencorrectthecontributiontotheupdateiftheworddoesappear.Thus,whensubstitutingbackintoEq.8,thesumbecomesamatrixmultiplicationbetweenamatrixwithdimensionsofthenumberofvariablesbythenumberofdocumentsandentriesx‘ithatisas-sumedtobesparseandadensematrixwithdi-mensionsofthenumberofvariablesbythenum-beroflatentfactors.Givennvariables,Nsam-ples,andρnonzeroentriesinthedatamatrix,the100101102103104Disaster Relief ArticlesTime (Seconds)CorExSparse CorExLDA100101102103104New York TimesTime (Seconds)103104105106Number of Docs(500 Words Fixed)100101102103104PubMedTime (Seconds)103104105Number of Words(5000 Documents Fixed)Figure1:Speedcomparisonstoafixednumberofitera-tionsasthenumberofdocumentsandwordsvary.NewYorkTimesarticlesandPubMedabstractswerecollectedfromtheUCIMachineLearningRepository(Lichman,2013).Thedisasterreliefarticlesaredescribedinsection4.1,andrepresentedsimplyasbagsofwords,notphrases.asymptoticscalingforCorExgoesfromO(Nn)toO(norte)+oh(norte)+oh(ρ)exploitingsparsity.Latenttreemodelingapproachesarequadraticinnorworse,soweexpectCorEx’scomputationaladvantagetoin-creaseforlargerdatasets.2.3.2OptimizationEvaluationWeperformexperimentscomparingtherunningtimeofCorExbeforeandafterimplementingtheim-provementswhichexploitsparsity.WealsocomparewithScikit-Learn’ssimplebatchimplementationofLDAusingthevariationalBayesalgorithm(Hoff-manetal.,2013).Experimentswereperformedonafourcore,Inteli5chiprunningat4GHzwith32GBRAM.Weshowruntimewhenvaryingthedatasizeintermsofthenumberofwordtypesandthenum-berofdocuments.Weused50topicsforallrunsandsetthenumberofiterationsforeachrunto10itera-tionsforLDAand50iterationsforCorEx.ResultsareshowninFigure1.WeseethatCorExexploit-ingsparsityisordersofmagnitudefasterthanthe
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
533
naiveversionandisgenerallycomparabletoLDAasthenumberofdocumentsscales.Theslopeonthelog-logplotsuggestsalineardependenceofrun-ningtimeonthedatasetsize,asexpected.2.4AnchorWordsviatheBottleneckTheinformationbottleneckformulatesatrade-offbetweencompressingdataXintoarepresentationY,andpreservingtheinformationinXthatisrel-evanttoZ(typicallylabelsinasupervisedlearningtask)(Tishbyetal.,1999;Friedmanetal.,2001).Moreformally,theinformationbottleneckisex-pressedasmaxp(y|X)βI(z:Y)−I(X:Y),(10)whereβisaparametercontrollingthetrade-offbe-tweencompressingXandpreservinginformationabouttherelevancevariable,Z.ToseetheconnectionwithCorEx,wecomparetheCorExobjectiveaswritteninEq.6withthebottleneckinEq.10.Weseethatwehaveexactlythesamecompressiontermforeachlatentfactor,I(X:Yj),buttherelevancevariablesnowcorre-spondtoZ≡Xi.Ifwewanttolearnrepresen-tationsthataremorerelevanttospecifickeywords,wecansimplyanchorawordtypeXitotopicYj,byconstrainingouroptimizationsothatαi,j=βi,j,whereβi,j≥1controlstheanchorstrength.Oth-erwise,theupdatesonαremainthesame.ThisschemaisanaturalextensionoftheCorExoptimiza-tionanditisflexible,allowingformultiplewordtypestobeanchoredtoonetopic,foronewordtypetobeanchoredtomultipletopics,orforanycom-binationofthesesemi-supervisedanchoringstrate-gies.3RelatedWorkWithrespecttointegratingdomainknowledgeintotopicmodels,wedrawinspirationfromAroraetal.(2012),whousedanchorwordsinthecon-textofnon-negativematrixfactorization.Usinganassumptionofseparability,theseanchorwordsactashighprecisionmarkersofparticulartopicsand,de este modo,helpdiscernthetopicsfromoneanother.Al-thoughtheoriginalalgorithmproposedbyAroraetal.(2012),andsubsequentimprovementstotheirapproach,findtheseanchorwordsautomatically(Aroraetal.,2013;LeeandMimno,2014),recentadaptationsallowmanualinsertionofanchorwordsandothermetadata(Nguyenetal.,2014;Nguyenetal.,2015).Ourworkissimilartothelatter,wherewetreatanchorwordsasfuzzylogicmarkersandem-bedthemintothetopicmodelinasemi-supervisedfashion.Inthissense,ourworkisclosesttoHalpernetal.(2014;2015),whohavealsomadeuseofdo-mainexpertiseandsemi-supervisedanchoredwordsindevisingtopicmodels.Thereisanadjacentlineofworkthathasfocusedonincorporatingword-levelinformationintoLDA-basedmodels.Jagarlamudietal.(2012)proposedSeededLDA,amodelthatseedswordsintogiventopicsandguides,butdoesnotforce,thesetopicstowardstheseintegratedwords.AndrzejewskiandZhu(2009)presentedamodelthatmakesuseof“z-labels,”wordsthatareknowntopertaintospecifictopicsandthatarerestrictedtoappearinginsomesubsetofallthepossibletopics.Althoughthez-labelscanbeleveragedtoplacedifferentsensesofawordintodifferenttopics,itrequiresadditionalef-forttodeterminewhenthesedifferentsensesoccur.Ouranchoringapproachallowsausertomoreeasilyanchoronewordtomultipletopics,allowingCorExtonaturallyfindtopicsthatrevolvearounddifferentsensesofaword.Andrzejewskietal.(2009)presentedasecondmodelwhichallowsspecificationofMust-LinkandCannot-Linkrelationshipsbetweenwordsthathelppartitionotherwisemuddledtopics.Theselogicalconstraintshelpenforcetopicseparability,thoughthesemechanismslessdirectlyaddresshowtoan-chorasinglewordorsetofwordstohelpatopicemerge.Moregenerally,theMust/Cannotlinkandz-labeltopicmodelshavebeenexpressedinapowerfulfirst-order-logicframeworkthatallowsthespecificationofarbitrarydomainknowledgethroughlogicalrules(Andrzejewskietal.,2011).Othershavebuiltoffthisfirst-order-logicapproachtoau-tomaticallylearnruleweights(Meietal.,2014)andincorporateadditionallatentvariableinforma-tion(Fouldsetal.,2015).Mathematically,CorExtopicmodelsmostcloselyresembletopicmodelsbasedonlatenttreerecon-struction(Chenetal.,2016).InChenetal.’s(2016)análisis,theirownlatenttreeapproachandCorExbothreportsignificantlybetterperplexitythanhi-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
534
erarchicaltopicmodelsbasedonthehierarchicalDirichletprocessandtheChineserestaurantprocess.CorExhasalsobeeninvestigatedasawaytofind“surprising”documents(Hodasetal.,2015).4DataandEvaluationMethods4.1DataWeusetwochallengingdatasetswithcorrespondingdomainknowledgelexiconstoevaluateanchoredCorEx.Ourfirstdatasetconsistsof504,000human-itarianassistanceanddisasterrelief(HA/DR)arti-clescovering21disastertypescollectedfromRe-liefWeb,anHA/DRnewsarticleaggregatorspon-soredbytheUnitedNations.Tomitigateover-whelminglabelimbalancesduringanchoring,webothrestrictourselvestodocumentsinEnglishwithonelabel,andrandomlysubsample2,000articlesfromeachofthelargestdisastertypelabels.Thisleavesuswithacorpusof18,943articles.2WeaccompanythesearticleswithanHA/DRlex-iconofapproximately34,000wordsandphrases.Thelexiconwascuratedbyfirstgathering40–60seedtermsperdisastertypefromHA/DRdomainexpertsandCrisisLex.Thistermlistwasthenex-pandedbycreatingwordembeddingsforeachdis-astertype,andtakingtermswithinaspecifiedco-sinesimilarityoftheseedwords.Theselistswerethenfilteredbyremovingnames,lugares,non-ASCIIcharacters,andtermswithfewerthanthreecharac-ters.Finally,theextractedtermswereauditedusingCrowdFlower,whereusersratedtherelevanceofthetermsonaLikertscale.Lowrelevancetermsweredroppedfromthelexicon.Oftheseterms11,891typesappearintheHA/DRarticles.Ourseconddatasetconsistsof1,237deidentifiedclinicaldischargesummariesfromtheInformaticsforIntegratingBiologyandtheBedside(i2b2)2008ObesityChallenge.3Thesesummariesarelabeledbyclinicalexpertswith15conditionsfrequentlyassociatedwithobesity.Forthesedocuments,weleverageatextpipelinethatextractscommonmed-icaltermsandphrases(Daietal.,2008;Chapmanetal.,2001),whichyields3,231suchtermtypes.2HA/DRarticlesandaccompanyinglexiconavailableathttp://dx.doi.org/10.7910/DVN/TGOPRU3Dataavailableupondatauseagreementathttps://www.i2b2.org/NLP/Obesity/Forbothsetsofdocuments,weusetheirrespectivelexiconstobreakthedocumentsdownintobagsofwordsandphrases.Wealsomakeuseofthe20Newsgroupsdataset,asprovidedandpreprocessedintheScikit-Learnli-brary(Pedregosaetal.,2011).4.2EvaluationCorExdoesnotexplicitlyattempttolearnagenera-tivemodeland,de este modo,traditionalmeasuressuchasperplexityarenotappropriateformodelcompari-sonagainstLDA.Furthermore,itiswell-knownthatperplexityandheld-outlog-likelihooddonotneces-sarilycorrelatewithhumanevaluationofsemantictopicquality(Changetal.,2009).Por lo tanto,wemeasurethesemantictopicqualityusingMimnoetal.’s(2011)UMassautomatictopiccoherencescore,whichcorrelateswithhumanjudgments.Wealsoevaluatethemodelsintermsofmulti-classlogisticregressiondocumentclassification(Pe-dregosaetal.,2011),wherethefeaturesetofeachdocumentisitstopicdistribution.Weperformalldocumentclassificationtasksusinga60/40training-testsplit.Finally,wemeasurehowwelleachtopicmodeldoesatclusteringdocuments.Weobtainacluster-ingbyassigningeachdocumenttothetopicthatoc-curswiththehighestprobability.Wethenmeasurethequalitywithinclusters(homogeneity)andacrossclusters(adjustedmutualinformation).Thehighestpossiblevalueforbothmeasuresisone.Wedonotreportclusteringmetricsontheclinicalhealthnotesbecausethedocumentsaremulti-labeland,inthatcase,themetricsarenotwell-defined.4.3ChoosingAnchorWordsWewishtosystematicallytesttheeffectofanchorwordsgiventhedomain-specificlexicons.Todoso,wefollowtheapproachusedbyJagarlamudietal.(2012)toautomaticallygenerateanchorwords:foreachlabelinadataset,wefindthewordsthathavethehighestmutualinformationwiththelabel.ForwordwandlabelL,thisiscomputedasI(l:w)=H(l)−H(l|w),(11)whereforeachdocumentoflabelLweconsiderifthewordwappearsornot.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
535
0.20.30.40.5Macro F1Disaster Relief Articles20 NewsgroupsClinical Health Notes0.20.30.40.50.6Micro F11601401201008060Coherence20406080100Number of Topics20406080100Number of Topics0.10.20.30.4Homogeneity20406080100Number of TopicsCorExLDAFigure2:BaselinecomparisonofCorExtoLDAwithrespecttotopiccoherenceanddocumentclassificationandclusteringonthreedifferentdatasetsasthenumberoftopicsvary.Pointsaretheaverageof30runsofatopicmodel.Confidenceintervalsareplottedbutaresosmallthattheyarenotdistinguishable.CorExistrainedusingbinarydata,whileLDAistrainedoncountdata.Ho-mogeneityisnotwell-definedonthemulti-labelclinicalhealthnotes,soitisomitted.5Results5.1LDABaselineComparisonWecompareCorExtoLDAintermsoftopiccoher-ence,documentclassification,anddocumentclus-teringacrossthreedatasets.CorExistrainedonbi-narydata,whileLDAistrainedoncountdata.Whilenotreportedhere,CorExconsistentlyoutperformedLDAtrainedonbinarydata.Indoingthesecompar-isons,weusetheGensimimplementationofLDA(ˇReh˚uˇrekandSojka,2010).Theresultsofcompar-ingCorExtoLDAasafunctionofthenumberoftopicsarepresentedinFigure2.Acrossallthreedatasets,wefindthatthetopicsproducedbyCorExyielddocumentclassificationre-sultsthatareonparwithorbetterthanthosepro-ducedbyLDAtopics.Intermsofclustering,CorExconsistentlyproducesdocumentclustersofhigherRankDisasterReliefTopic1drought,farmers,harvest,crop,livestock,planting,grain,maize,rainfall,irrigation3eruption,volcanic,lava,crater,eruptions,volcanos,slopes,volcanicactivity,evacuated,lavaflows8winter,snow,snowfall,temperaturas,heavysnow,heating,freezing,warmclothing,severewinter,avalanches23military,armed,civiles,soldiers,aircraft,armas,rebel,planes,bombs,militarypersonnelRank20NewsgroupsTopic3team,juego,season,player,league,hockey,play,equipos,nhl14car,bike,carros,engine,miles,camino,ride,riding,bikes,ground26nasa,launch,orbit,shuttle,mission,satellite,gobierno,jpl,orbital,solar39medical,enfermedad,doctor,patients,treatment,medicine,salud,hospital,doctors,painRankClinicalHealthNotesTopic12vomiting,nausea,abdominalpain,diarrhea,fever,dehydration,chill,clostridiumdifficile,intravenousfluid,compazine19anxietystate,insomnia,ativan,neurontin,depresión,lorazepam,gabapentin,trazodone,fluoxetine,headache27pain,oxycodone,tylenol,percocet,ibuprofen,morphine,osteoarthritis,hernia,motrin,bleedingTable1:ExamplesoftopicslearnedbytheCorExtopicmodel.Wordsarerankedaccordingtomutualinforma-tionwiththetopic,andtopicsarerankedaccordingtotheamountoftotalcorrelationtheyexplain.Topicmodelswererunwith50topicsontheReliefweband20News-groupsdatasets,and30topicsontheclinicalhealthnotes.homogeneitythanLDA.Onthedisasterreliefarti-cles,theCorExclustersarenearlytwiceashomoge-neousastheLDAclusters.CorExoutperformsLDAintermsoftopiccoher-enceontwooutofthreeofthedatasets.WhileLDA
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
536
0.00.10.20.30.40.5HomogeneityUnsupervisedSemi-SupervisedDisaster Relief Articles (21 temas)UnsupervisedSemi-Supervised20 Newsgroups (20 temas)0.00.10.20.30.40.5Adjusted MutualInformationCorExLDAAnchored CorExLinked LDAz-labels LDA1201008060CoherenceCorExLDAAnchored CorExLinked LDAz-labels LDAFigure3:ComparisonofanchoredCorExtoothersemi-supervisedtopicmodelsintermsofdocumentclusteringandtopiccoherence.Foreachdataset,thenumberoftop-icsisfixedtothenumberofdocumentlabels.Eachdotistheaverageof30runs.Confidenceintervalsareplottedbutaresosmallthattheyarenotdistinguishable.producesmorecoherenttopicsfortheclinicalhealthnotes,itisparticularlystrikingthatCorExisabletoproducehighqualitytopicswhileonlyleverag-ingbinarycountdata.ExamplesofthesetopicsareshowninTable1.Despitethebinarycountslimi-tation,CorExstillfindsmeaningfullycoherentandcompetitivestructureinthedata.5.2AnchoredCorExAnalysisWenowexaminetheeffectsandbenefitsofguidingCorExthroughanchorwords.Indoingso,wealsocompareanchoredCorExtoothersemi-supervisedtopicmodels.5.2.1AnchoringforTopicSeparabilityWearefirstinterestedinhowanchoringcanbeusedtoencouragetopicseparabilitysothatdocu-mentsclusterwell.WefocusontheHA/DRarticlesand20newsgroupsdatasets,sincetraditionalclus-teringmetricsarenotwell-definedonthemulti-labelclinicalhealthnotes.Forbothdatasets,wefixtheRankAnchoredDisasterReliefTopic1harvest,locus,drought,foodcrisis,farmers,crops,crop,malnutrition,foodaid,livestock4tents,quake,internationalfederation,redcrescent,redcross,blankets,earthquake,richterscale,societies,aftershocks12climate,impacts,calentamiento,climatechange,irrigation,consumption,household,droughts,livelihoods,interventions19storms,clima,winds,coastal,tornado,meteorological,tornadoes,strongwinds,tropical,roofsRankAnchored20NewsgroupsTopic5government,congress,clinton,estado,national,económico,general,estados,united,order6bible,christian,dios,jesus,christians,believe,vida,fe,world,man15use,usado,alto,circuito,fuerza,trabajar,voltage,need,bajo,end20baseball,pitching,braves,mets,hitter,pitcher,cubs,dl,sox,jaysTable2:ExamplesoftopicslearnedbyCorExwhensimultaneouslyanchoringmanytopicswithanchoringparameterβ=2.Anchorwordsareshowninbold.Wordsarerankedaccordingtomutualinformationwiththetopic,andtopicsarerankedaccordingtotheamountoftotalcorrelationtheyexplain.Topicmodelswererunwith21topicsontheReliefwebarticlesand20topicsonthe20Newsgroupsdataset.numberoftopicstobeequaltothenumberofdoc-umentlabels.ItisinthiscontextthatwecompareanchoredCorExtotwoothersemi-supervisedtopicmodels:z-labelsLDAandmust/cannotlinkLDA.UsingthemethoddescribedinSection4.3,weau-tomaticallyretrievethetopfiveanchorsforeachdis-astertypeandnewsgroup.Wethenfiltertheselistsofanywordsthatareambiguous,i.e.wordsthatareanchorwordsformorethanonedocumentlabel.ForanchoredCorExandz-labelsLDAwesimulta-neouslyassigneachsetofanchorwordstoexactlyonetopiceach.Formust/cannotlinkLDA,wecre-atemust-linkswithinthewordsofthesameanchor
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
537
grupo,andcreatecannot-linksbetweenwordsofdif-ferentanchorgroups.Sincewearesimultaneouslyanchoringtomanytopics,weuseaweakanchoringparameterβ=2foranchoredCorEx.Usingthenotationfromtheiroriginalpapers,weuseη=1forz-labelsLDA,andη=1000formust/cannotlinkLDA.ForbothLDAvariants,weuseα=0.5,β=0.1andtake2,000samples,andestimatethemodelsusingcodeimplementedbytheoriginalauthors.TheresultsofthiscomparisonareshowninFig-ure3,andexamplesofanchoredCorExtopicsareshowninTable2.AcrossallmeasuresCorExandanchoredCorExoutperformLDA.Wefindthatan-choredCorExalwaysimprovesclusterqualityver-susCorExintermsofhomogeneityandadjustedmutualinformation.ComparedtoCorEx,multiplesimultaneousanchoringneitherharmsnorbenefitsthetopiccoherenceofanchoredCorEx.TogetherthesemetricssuggestthatanchoredCorExisfind-ingtopicsthatareofequivalentcoherencetoCorEx,butmorerelevanttothedocumentlabelssincegainsareseenintermsofdocumentclustering.Againsttheothersemi-supervisedtopicmodels,anchoredCorExcomparesfavorably.ThedocumentclusteringofanchoredCorExissimilarto,orbet-terthan,thatofz-labelsLDAandmust/cannotlinkLDA.Acrossthedisasterreliefarticles,anchoredCorExfindslesscoherenttopicsthanthetwoLDAvariants,whileitfindssimilarlycoherenttopicsasmust/cannotlinkLDAonthe20newsgroupdataset.5.2.2AnchoringforTopicRepresentationWenowturntostudyinghowdomainknowledgecanbeanchoredtoasingletopictohelpanother-wisedominatedtopicemerge,andhowtheanchor-ingparameterβaffectsthatemergence.Todiscernthiseffect,wefocusjustonanchoredCorExalongwiththeHA/DRarticlesandclinicalhealthnotes,datasetsforwhichwehaveadomainexpertlexicon.Wedevisethefollowingexperiment:primero,wede-terminethetopfiveanchorwordsforeachdocu-mentlabelusingthemethodologydescribedinSec-tion4.3.Unlikeintheprevioussection,wedonotfiltertheselistsofambiguousanchorwords.Sec-ond,foreachdocumentlabel,werunananchoredCorExtopicmodelwiththatlabel’sanchorwordsanchoredtoexactlyonetopic.Wecomparethisan-0.40.20.00.20.40.6Topic Overlap Diff.Post-AnchoringDisaster Relief Articles0.20.00.20.40.6Clinical Health Notes4020020406080100Percent Change inTopic Coherence1005005010015020012345678910Anchoring Parameter β0.150.100.050.000.050.100.15F1 DifferencePost-Anchoring12345678910Anchoring Parameter β0.20.10.00.10.20.30.4Figure4:Effectofanchoringwordstoasingletopicforonedocumentlabelatatimeasafunctionoftheanchor-ingparameterβ.Lightgraylinesindicatethetrajectoryofthemetricforagivendisasterordiseaselabel.Thickredlinesindicatethepointwiseaverageacrossalllabelsforfixedvalueofβ.choredtopicmodeltoanunsupervisedCorExtopicmodelusingthesamerandomseeds,thuscreatingamatchedpairwheretheonlydifferenceisthetreat-mentofanchorwords.Finally,thismatchedpairsprocessisrepeated30times,yieldingadistributionforeachmetricovereachlabel.Weuse50topicswhenmodelingtheReliefWebarticlesand30topicswhenmodelingthei2b2clin-icalhealthnotes.Thesevalueswerechosenbyob-servingdiminishingreturnstothetotalcorrelationexplainedbyadditionaltopics.InFigure4weshowhowtheresultsofthisex-perimentvaryasafunctionoftheanchoringpa-rameterβforeachdisasteranddiseasetypeinthetwodatasets.Sincethereisheavyvarianceacrossdocumentlabelsforeachmetric,wealsoexamineamoredetailedcrosssectionoftheseresultsinFig-ure5,wherewesetβ=5fortheclinicalhealthnotesandsetβ=10forthedisasterreliefarti-cles.Asweshowmomentarily,disasteranddiseasetypesthatbenefitthemostfromanchoringwereun-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
538
0.10.00.10.20.30.40.50.6Tropical CycloneFloodEpidemicEarthquakeDroughtVolcanoFlash FloodInsect InfestationCold WaveTechnological DisasterTsunamiLand SlideWild FireSevere Local StormOtherSnow AvalancheExtratropical CycloneMud SlideHeat WaveStorm SurgeFireAnchoring Parameter β=1050050100Anchoring Parameter β=100.100.050.000.050.100.150.20Anchoring Parameter β=100.10.00.10.20.30.40.50.6Topic Overlap Diff.Post-AnchoringAsthmaCoronary Heart DiseaseCongestive Heart FailureDepressionDiabetesGERDGallstonesGoutHypercholesterolemiaHypertensionHypertriglyceridemiaOsteoarthritisObstructive Sleep ApneaObesityPeripheral Vascular DiseaseAnchoring Parameter β=550050100Percent Change of Anchored TopicCoherence when Most Predictive TopicAnchoring Parameter β=50.10.00.10.20.30.4F1 DifferencePost-AnchoringAnchoring Parameter β=50.00.51.0Proportion of Runs AnchoredTopic is the Most Predictive0.00.51.0Average F1 ScorePre-Anchoring0.00.51.0Average Topic OverlapPre-Anchoring0.00.51.0Proportion of Runs AnchoredTopic is the Most Predictive0.00.51.0Average F1 ScorePre-Anchoring0.00.51.0Average Topic OverlapPre-AnchoringFigure5:Cross-sectionresultsoftheanchoringmetricsfromfixingβ=5fortheclinicalhealthnotes,andβ=10forthedisasterreliefarticles.Disasteranddiseasetypesaresortedbyfrequency,withthemostfrequentdocumentlabelsappearingatthetop.Errorbarsindicate95%confidenceintervals.Thecolorbarsprovidecontextforeachmetric:topicoverlappre-anchoring,proportionoftopicmodelrunswheretheanchoredtopicwasthemostpredictivetopic,andF1scorepre-anchoring.derrepresentedpre-anchoring.Documentlabelsthatwerewell-representedpriortoanchoringachieveonlymarginalgain.ThisresultsinthevarianceseeninFigure4.Aprioriwedonotknowthatanchoringwillcausetheanchorwordstoappearatthetopoftopics.So,wefirstmeasurehowthetopicoverlap,thepropor-tionofthetoptenmutualinformationwordsthatap-pearwithinthetoptenwordsofthetopics,changesbeforeandafteranchoring.FromFigure4(row1)weseethatasβincreases,moreoftheserel-evantwordsconsistentlyappearwithinthetopics.Forthedisasterreliefarticles,manydisastertypesseeabouttwomorewordsintroduced,whileintheclinicalhealthnotestheoverlapincreasesbyuptofourwords.AnalyzingthecrosssectioninFig-ure5(column1),weseemanyofthesegainscomefromdisasteranddiseasetypesthatappearedlessinthetopicspre-anchoring.Thus,wecanswaythetopicmodeltowardslessdominantthemesthroughanchoring.Documentlabelsthatoccurthemostfrequentlyarethoseforwhichthetopicoverlapchangestheleast.Next,weexaminewhethertheseanchoredtopicsaremorecoherenttopics.Todoso,wecomparethecoherenceoftheanchoredtopicwiththatofthemostpredictivetopicpre-anchoring,i.e.thetopicwiththelargestcorrespondingcoefficientinmagnitudeofthelogisticregression,whentheanchoredtopicitselfismostpredictive.FromFigure4(row2),weseetheseresultshavemorevariance,butlargelytheanchoredtopicsaremorecoherent.Insomecases,thecoher-enceis1.5to2timesthatofpre-anchoring.Fur-thermore,bycolorsofthecentralpanelofFigure5,wefindthattheanchoredtopicsare,en efecto,oftenthemostpredictivetopicsforeachdocumentlabel.Similartotopicoverlap,thelabelsthatseetheleastimprovementarethosethatappearthemostandarealreadywell-representedinthetopicmodel.Finally,wefindthattheanchored,morecoherenttopicscanleadtomodestgainsindocumentclas-sification.Forthedisasterreliefarticles,Figure4(row3)showsthattherearemixedresultsintermsofF1scoreimprovement,withsomedisastertypesperformingconsistentlybetter,andothersperform-ingconsistentlyworse.Theresultsaremoreconsis-tentfortheclinicalhealthnotes,wherethereisanaverageincreaseofabout0.1intheF1score,y
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
539
somediseasetypesseeanincreaseofupto0.3inF1.Giventhatweareonlyanchoring5wordstothetopicmodel,thesearesignificantgainsinpredictivepower.Unlikethegainsintopicoverlapandcoherence,theF1scoreincreasesdonotsimplycorrelatewithwhichdocumentlabelsappearedmostfrequently.Forexample,weseeinFigure5(column3)thatTropicalCycloneexhibitsthelargestincreaseinpre-dictiveperformance,eventhoughitisalsooneofthemostfrequentlyappearingdocumentlabels.Simi-larly,someofthemajorgainsinF1forthediseasetypes,andmajorlossesinF1forthedisastertypes,donotcomefromthemostorleastfrequentdocu-mentlabels.Thus,ifusinganchoringsingletopicswithinCorExfordocumentclassification,itisim-portanttoexaminehowtheanchoringaffectspre-dictionforindividualdocumentlabels.5.2.3AnchoringforTopicAspectsFindingtopicsthatrevolvearoundaword,suchasanameorlocation,oragroupofwordscanaidinunderstandinghowaparticularsubjectoreventhasbeenframed.Wefinishwithaqualitativeex-perimentwherewedisambiguateaspectsofatopicbyanchoringasetofwordstomultipletopicswithintheCorExtopicmodel.Note,must/cannotlinkLDAcannotbeusedinthismanner,andz-labelsLDAwouldrequireustoknowtheseaspectsbeforehand.Weconsidertweetscontaining#Ferguson(case-insensitive),whichdetailreactionstotheshootingofBlackteenagerMichaelBrownbyWhitepo-liceofficerDarrenWilsononAugust9th,2014inFerguson,Missouri.ThesetweetswerecollectedfromtheTwitterGardenhose,a10%randomsam-pleofalltweets,overtheperiodAugust9th,2014toNovember30th,2014.SinceCorExwillseekmax-imallyinformativetopicsbyexploitingredundan-cies,weremoveduplicatesofretweets,leavinguswith869,091tweets.Wefilterthesetweetsofpunc-tuation,stopwords,hyperlinks,usernames,andthe‘RT’retweetsymbol,andusethetop20,000wordtypes.Inthewakeofboththeshootingandtheeventualnon-indictmentofDarrenWilson,severalprotestsoccurred.Someonlookerssupportedandencour-agedsuchprotests,whileotherscharacterizedtheprotestsasviolent“riots.”TodisambiguatetheseTopicAspectsof“protest”1protest,protests,peaceful,violent,continue,night,island,photos,staten,nights2protest,protests,#hiphopmoves,#cole,hiphop,nationwide,moves,para,anheuser,boeing3protest,protests,st,louis,guard,national,county,patrol,highway,city4protest,protests,paddy,covering,beverly,walmart,wagon,hills,passionately,including5protest,protests,solidarity,march,square,rally,#oakland,downtown,nyc,#nycTopicAspectsof“riot”6riot,riots,unheard,idioma,inciting,accidentally,jokingly,watts,waving,dies7riot,negro,riots,white,#tcot,blacks,hombres,whites,carrera,#pjnet8riot,riots,looks,como,sounds,acting,act,animals,looked,treated9riot,riots,almacenar,looting,negocios,burning,fire,looted,historias,business10gas,riot,tear,riots,gear,rubber,bullets,militar,molotov,armoredTable3:Topicaspectsaround“protest”and“riot”fromrunningaCorExtopicmodelwith55topicsandanchor-ing“protest”and“protests”togethertofivetopicsand“riot”and“riots”togethertofivetopicswithβ=2.An-chorwordsareshowninbold.Note,topicsarenotor-deredbytotalcorrelation.differentdepictions,wetrainaCorExtopicmodelwith55topics,anchoring“protest”and“protests”togethertofivetopics,and“riot”and“riots”to-gethertofivetopicswithβ=2.TheseanchoredtopicsarepresentedinTable3.Theanchoredtopicsreflectdifferentaspectsoftheframingofthe“protests”and“riots,”andaregenerallyinterpretable,despitethetypicaldifficultyofextractingcoherenttopicsfromshortdocumentsusingLDA(Tangetal.,2014).The“protest”topicaspectsdescribeprotestsinSt.Louis,Oakland,Bev-erlyHills,andpartsofNewYorkCity(topics1,3,4,5),resistancebylawenforcement(topics3and4),anddiscussionofwhethertheprotestswerepeaceful(topic1).Topic2revolvesaroundhip-hopartistswhomarchedinsolidaritywithprotesters.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
540
The“riot”topicaspectsdiscussracialdynamicsoftheprotests(topic7)andsuggestthedemonstrationsaredangerous(topics8and9).Topic10describesthe“riot”gearusedinthemilitarizedresponsetotheFergusonprotesters,andTopic7alsohintsataspectsofconservatismthroughthehashtags#tcot(TopConservativesonTwitter)and#pjnet(PatriotJournalistNetwork).Aswesee,anchoredCorExfindsseveralin-teresting,non-trivialaspectsaround“protest”and“riot”thatcouldsparkadditionalqualitativeinves-tigation.Retrievingtopicaspectsthroughanchorwordsinthismannerallowstheusertoexploredif-ferentframesofcomplexissues,events,ordiscus-sionswithindocuments.Aswiththeotheranchor-ingstrategies,thishasthepotentialtosupplementqualitativeresearchdonebyresearcherswithinthesocialsciencesanddigitalhumanities.6DiscussionWehaveintroducedaninformation-theoretictopicmodel,CorEx,thatdoesnotrelyonanyofthegener-ativeassumptionsofLDA-basedtopicmodels.Thistopicmodelseeksmaximallyinformativetopicsasencodedbytheirtotalcorrelation.Wealsoderivedaflexiblemethodforanchoringword-leveldomainknowledgeintheCorExtopicmodelthroughthein-formationbottleneck.AnchoredCorExguidesthetopicmodeltowardsthemesthatdonotnaturallyemerge,andoftenproducesmorecoherentandpre-dictivetopics.BothCorExandanchoredCorExconsistentlyproducetopicsthatareofcomparablequalitytoLDA-basedmethods,despiteonlymakinguseofbinarizedwordcounts.AnchoredCorExismoreflexiblethanpreviousattemptsatintegratingword-levelinformationintotopicmodels.Topicseparabilitycanbeenforcedbylightlyanchoringdisjointgroupsofwordstosepa-ratetopics,topicrepresentationcanbepromotedbyassertivelyanchoringagroupofwordstoasingletopic,andtopicaspectscanbeunveiledbyanchor-ingasinglegroupofwordstomultipletopics.Theflexibilityofanchoringthroughtheinformationbot-tlenecklendsitselftomanyotherpossiblecreativeanchoringstrategiesthatcouldguidethetopicmodelindifferentways.Differentgoalsmaycallfordif-ferentanchoringstrategies,anddomainexpertscanshapethesestrategiestotheirneeds.WhilewehavedemonstratedseveraladvantagesoftheCorExtopicmodeltoLDA,itdoeshavesometechnicalshortcomings.Mostnotably,CorExre-liesonbinarycountdatainitssparsityoptimiza-tion,ratherthanthestandardcountdatathatisusedasinputintoLDAandothertopicmodels.WhilewehavedemonstratedCorExperformsatthelevelofLDAdespitethislimitation,itseffectwouldbemorenoticeableonlongerdocuments.Thiscanbepartlyovercomeifonechunkssuchlongerdocu-mentsintoshortersubdocumentspriortorunningthetopicmodel.Ourimplementationalsorequiresthateachwordappearsinonlyonetopic.Theselim-itationsarenotfundamentallimitationsofthethe-ory,butamatterofcomputationalefficiency.Infuturework,wehopetoremovetheserestrictionswhilepreservingthespeedofthesparseCorExtopicmodelingalgorithm.Aswehavedemonstrated,theinformation-theoreticapproachprovidedviaCorExhasrichpo-tentialforfindingmeaningfulstructureindocu-ments,particularlyinawaythatcanhelpdomainexpertsguidetopicmodelswithminimalinterven-tiontocaptureotherwiseeclipsedthemes.ThelightweightandversatileframeworkofanchoredCorExleavesopenpossibilitiesfortheoreticalex-tensionsandnovelapplicationswithintherealmoftopicmodeling.AcknowledgmentsWewouldliketothanktheMachineIntelligenceandDataScience(MINDS)researchgroupattheInfor-mationSciencesInstitutefortheirhelpandinsightduringthecourseofthisresearch.WealsothanktheVermontAdvancedComputingCore(VACC)foritscomputationalresources.Finally,wethanktheanonymousreviewersandtheTACLactioneditorsDianeMcCarthyandKristinaToutanovafortheirtimeandeffortinhelpingusimproveourwork.RyanJ.GallagherwasavisitingresearchassistantattheInformationSciencesInstitutewhileperform-ingthisresearch.RyanJ.GallagherandGregVerSteegweresupportedbyDARPAawardHR0011-15-C-0115andDavidKalewassupportedbytheAlfredE.MannInnovationinEngineeringDoctoralFellowship.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
541
ReferencesDavidAndrzejewskiandXiaojinZhu.2009.LatentDirichletAllocationwithtopic-in-setknowledge.InProceedingsoftheNAACLHLT2009WorkshoponSemi-SupervisedLearningforNaturalLanguagePro-cessing,pages43–48.AssociationforComputationalLinguistics.DavidAndrzejewski,XiaojinZhu,andMarkCraven.2009.IncorporatingdomainknowledgeintotopicmodelingviaDirichletforestpriors.InProceedingsofthe26thAnnualInternationalConferenceonMachineLearning,pages25–32.DavidAndrzejewski,XiaojinZhu,MarkCraven,andBenjaminRecht.2011.Aframeworkforincorpo-ratinggeneraldomainknowledgeintolatentDirichletallocationusingfirst-orderlogic.InProceedingsoftheInternationalJointConferenceonArtificialIntelli-gence,volume22,page1171.SanjeevArora,RongGe,andAnkurMoitra.2012.Learningtopicmodels–goingbeyondSVD.In2012IEEE53rdAnnualSymposiumonFoundationsofComputerScience(FOCS),pages1–10.IEEE.SanjeevArora,RongGe,YonatanHalpern,DavidM.Mimno,AnkurMoitra,domingo de david,YichenWu,andMichaelZhu.2013.Apracticalalgorithmfortopicmodelingwithprovableguarantees.InProceedingsofInternationalConferenceonMachineLearning,pages280–288.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.LatentDirichletAllocation.JournalofMa-chineLearningResearch,3(Ene):993–1022.WrayBuntineandAleksJakulin.2006.Discretecompo-nentanalysis.InSubspace,LatentStructureandFea-tureSelection,pages1–33.Springer.JonathanChang,SeanGerrish,ChongWang,JordanL.Boyd-Graber,andDavidM.Blei.2009.Readingtealeaves:Howhumansinterprettopicmodels.InAdvancesinNeuralInformationProcessingSystems,pages288–296.WendyW.Chapman,WillBridewell,PaulHanbury,Gre-goryF.Cooper,andBruceG.Buchanan.2001.Asimplealgorithmforidentifyingnegatedfindingsanddiseasesindischargesummaries.JournalofBiomedi-calInformatics,34(5):301–310.PeixianChen,NevinL.Zhang,LeonardK.M.Poon,andZhourongChen.2016.ProgressiveEMforlatenttreemodelsandhierarchicaltopicdetection.InProceed-ingsoftheThirtiethAAAIConferenceonArtificialIn-telligence,pages1498–1504.ManhongDai,NigamH.Shah,WeiXuan,MarkA.Musen,StanleyJ.Watson,BrianD.Athey,FanMeng,etal.2008.Anefficientsolutionformappingfreetexttoontologyterms.AMIASummitonTranslationalBioinformatics,21.ChrisDing,TaoLi,andWeiPeng.2008.Ontheequiv-alencebetweennon-negativematrixfactorizationandprobabilisticlatentsemanticindexing.ComputationalStatistics&DataAnalysis,52(8):3913–3927.JamesFoulds,ShachiKumar,andLiseGetoor.2015.Latenttopicnetworks:Aversatileprobabilisticpro-grammingframeworkfortopicmodels.InPro-ceedingsoftheInternationalConferenceonMachineLearning,pages777–786.NirFriedman,OriMosenzon,NoamSlonim,andNaftaliTishby.2001.Multivariateinformationbottleneck.InProceedingsoftheSeventeenthConferenceonUncer-taintyinArtificialIntelligence,pages152–161.ThomasL.Griffiths,MichaelI.Jordan,JoshuaB.Tenen-baum,andDavidM.Blei.2004.HierarchicaltopicmodelsandthenestedChineserestaurantprocess.InAdvancesinNeuralInformationProcessingSystems,pages17–24.YoniHalpern,YoungduckChoi,StevenHorng,andDavidSontag.2014.Usinganchorstoestimateclini-calstatewithoutlabeleddata.InAMIAAnnualSympo-siumProceedings.AmericanMedicalInformaticsAs-sociation.YoniHalpern,StevenHorng,andDavidSontag.2015.Anchoreddiscretefactoranalysis.arXivpreprintarXiv:1511.03299.NathanHodas,GregVerSteeg,JoshuaHarrison,SatishChikkagoudar,EricBell,andCourtneyCorley.2015.DisentanglingthelexiconsofdisasterresponseinTwitter.InThe3rdInternationalWorkshoponSocialWebforDisasterManagement(SWDM’15).MatthewD.Hoffman,DavidM.Blei,ChongWang,andJohnPaisley.2013.Stochasticvariationalinference.JournalofMachineLearningResearch,14(1):1303–1347.ThomasHofmann.1999.Probabilisticlatentsemanticanalysis.InProceedingsoftheFifteenthConferenceonUncertaintyinArtificialIntelligence,pages289–296.JagadeeshJagarlamudi,HalDaum´eIII,andRaghaven-draUdupa.2012.Incorporatinglexicalpriorsintotopicmodels.InProceedingsofthe13thConferenceoftheEuropeanChapteroftheAssociationforCom-putationalLinguistics,pages204–213.AssociationforComputationalLinguistics.MoontaeLeeandDavidMimno.2014.Low-dimensionalembeddingsforinterpretableanchor-basedtopicinference.InProceedingsofEmpiri-calMethodsinNaturalLanguageProcessing,pages1319–1328.MosheLichman.2013.UCIrvineMachineLearningRepository.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
7
8
1
5
6
7
5
3
5
/
/
t
yo
a
C
_
a
_
0
0
0
7
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
542
JonD.McAuliffeandDavidM.Blei.2008.Supervisedtopicmodels.InAdvancesinNeuralInformationPro-cessingSystems,pages121–128.ShikeMei,JunZhu,andJerryZhu.2014.RobustReg-Bayes:Selectivelyincorporatingfirst-orderlogicdo-mainknowledgeintoBayesianmodels.InProceed-ingsofthe31stInternationalConferenceonMachineLearning(ICML-14),pages253–261.DavidMimno,HannaM.Wallach,EdmundTalley,MiriamLeenders,andAndrewMcCallum.2011.Op-timizingsemanticcoherenceintopicmodels.InPro-ceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages262–272.Asso-ciationforComputationalLinguistics.ThangNguyen,YueningHu,andJordanL.Boyd-Graber.2014.Anchorsregularized:Addingrobustnessandextensibilitytoscalabletopic-modelingalgorithms.InProceedingsoftheAssociationofComputationalLin-guistics,pages359–369.ThangNguyen,JordanBoyd-Graber,JeffreyLund,KevinSeppi,andEricRingger.2015.Isyouranchorgoingupordown?Fastandaccuratesupervisedtopicmodels.InProceedingsofNorthAmericanChapteroftheAssociationforComputationalLinguistics.FabianPedregosa,Ga¨elVaroquaux,AlexandreGram-fort,VincentMichel,BertrandThirion,OliverGrisel,MathieuBlondel,PeterPrettenhofer,RonWeiss,Vin-centDubourg,JakeVanderplas,AlexandrePassos,DavidCournapeau,MatthieuBrucher,MatthieuPer-rot,and´EdouardDuchesnay.2011.Scikit-learn:Ma-chinelearninginPython.JournalofMachineLearn-ingResearch,12:2825–2830.RadimˇReh˚uˇrekandPetrSojka.2010.SoftwareFrame-workforTopicModelingwithLargeCorpora.InPro-ceedingsoftheLREC2010WorkshoponNewChal-lengesforNLPFrameworks,pages45–50.KyleReing,DavidC.Kale,GregVerSteeg,andAramGalstyan.2016.Towardinterpretabletopicdiscoveryviaanchoredcorrelationexplanation.ICMLWorkshoponHumanInterpretabilityinMachineLearning.MichalRosen-Zvi,ThomasGriffiths,MarkSteyvers,andPadhraicSmyth.2004.Theauthor-topicmodelforau-thorsanddocuments.InProceedingsofthe20thCon-ferenceonUncertaintyinArtificialIntelligence,pages487–494.JianTang,ZhaoshiMeng,XuanlongNguyen,QiaozhuMei,andMingZhang.2014.Understandingthelimit-ingfactorsoftopicmodelingviaposteriorcontractionanalysis.InProceedingsoftheInternationalConfer-enceonMachineLearning,pages190–198.NaftaliTishby,FernandoC.Pereira,andWilliamBialek.1999.Theinformationbottleneckmethod.InPro-ceedingsof37thAnnualAllertonConferenceonCom-munication,ControlandComputing,pages368–377.GregVerSteegandAramGalstyan.2014.Discoveringstructureinhigh-dimensionaldatathroughcorrelationexplanation.InAdvancesinNeuralInformationPro-cessingSystems,pages577–585.GregVerSteegandAramGalstyan.2015.Maxi-mallyinformativehierarchicalrepresentationsofhigh-dimensionaldata.InArtificialIntelligenceandStatis-tics,pages1004–1012.