Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 1–14. Redattore di azioni: Lillian Lee.
Submitted 3/2013; Revised 6/2013; Pubblicato 2/2014. C
2014 Associazione per la Linguistica Computazionale.
(cid:13)
HeterogeneousNetworksandTheirApplications:Scientometrics,NameDisambiguation,andTopicModelingBenKing,RahulJhaDepartmentofEECSUniversityofMichiganAnnArbor,MI{benking,rahuljha}@umich.eduDragomirR.RadevDepartmentofEECSSchoolofInformationUniversityofMichiganAnnArbor,MIradev@umich.eduAbstractWepresentheterogeneousnetworksasawaytounifylexicalnetworkswithrelationaldata.WebuildaunifiedACLAnthologynetwork,tyingtogetherthecitation,authorcollaboration,andterm-cooccurencenetworkswithaffiliationandvenuerelations.Thisrepresentationprovestobeconvenientandallowsproblemssuchasnamedisambiguation,topicmodeling,andthemea-surementofscientificimpacttobeeasilysolvedusingonlythisnetworkandoff-the-shelfgraphalgorithms.1IntroductionGraph-basedmethodshavebeenusedtogreatef-fectinNLP,onproblemssuchaswordsensedisam-biguation(Mihalcea,2005),summarization(ErkanandRadev,2004),anddependencyparsing(McDon-aldetal.,2005).Mostpreviousstudiesofnetworksconsidernetworkswithonlyasingletypeofnode,andinsomecasesusinganetworkwithasingletypeofnodecanbeanoversimplifiedviewifitignoresothertypesofrelationships.Inthispaperwewilldemonstrateheterogeneousnetworks,networkswithmultipledifferenttypesofnodesandedges,alongwithseveralapplicationsofthem.Theapplicationsinthispaperarenotpre-sentedsomuchasrobustattemptstoout-performthecurrentstate-of-the-art,butratherattemptsatbeingcompetitiveagainsttopmethodswithlittleeffortbe-yondtheconstructionoftheheterogeneousnetwork.Throughoutthispaper,wewillusethedatafromtheACLAnthologyNetwork(AAN)(Birdetal.,2008;Radevetal.,2013),whichcontainsadditionalmetadatarelationshipsnotfoundintheACLAnthol-ogy,asatypicalheterogeneousnetwork.Theresultsinthispapershouldbegenerallyapplicabletootherheterogeneousnetworks.1.1HeterogeneousAANschemaWebuildaheterogeneousgraphG(V,E)fromAAN,whereVisthesetofverticesandEisthesetofedgesconnectingvertices.Avertexcanbeoneoffivesemantictypes:{paper,author,venue,institution,term}.Anedgecanalsobeoneoffivetypes,eachconnectingdifferenttypesofvertices:•author—[writes]—paper•paper—[cites]—paper•paper—[publishedin]—venue1•author—[affiliatedwith]—institution2•paper—[contains]—termAllofthisdata,exceptfortheterms,isavailableforallpapersinthe2009releaseofAAN.TermsareextractedfromtitlesbyrunningTextRank(Mihal-ceaandTarau,2004)onNP-chunksfromtitlesandmanuallyfilteringoutbadterms.Weshowtheusefulnessofthisrepresentationinseveralapplications:themeasurementofscien-tificimpact(Section2),namedisambiguation(Sec-tion3),andtopicmodeling(Section4).Thehetero-geneousnetworkrepresentationprovidesasimpleframeworkforcombininglexicalnetworks(likethetermco-occurencenetwork)withmetadatarelationsfromasourcelikeAANandallowsustobegintodevelopNLP-awaremethodsforproblemslikesci-entometricsandnamedisambiguation,whicharenotusuallyframedinanNLPperspective.1ForajointmeetingofvenuesAandBpublishingapaperx,twoedges(X,UN)E(X,B)arecreated.2Author-affiliationedgesareweightedaccordingtothenumberofpapersanauthorhaspublishedfromaninstitution.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
T
l
UN
C
_
UN
_
0
0
1
6
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
2
2ScientificImpactMeasurementThestudyofscientometrics,whichattemptstoquantifythescientificimpactofpapers,authors,etc.hasreceivedmuchattentionrecently,evenwithintheNLPcommunity.Inthepastfewyears,therehavebeenmanyproposedmeasuresofscientificim-pactbasedonrelationshipsbetweenentities.Intu-itively,amodelthatcantakeintoaccountmanydif-ferenttypesofrelationshipsbetweenentitiesshouldbeabletomeasurescientificimpactmoreaccu-ratelythansimplermeasureslikecitationcountsorh-index.WeproposeusingPagerankontheheterogeneousAAN(Pageetal.,1999)tomeasurescientificimpact.Sincechangesinthenetworkschemacanaffecttherelativerankingsbetweendifferenttypesofentities,thismethodisprobablynotappropriateforcompar-ingentitiesoftwodifferenttypesagainsteachother.Butbetweennodesofthesametype,thismeasureisanappropriate(andaswewillshow,accurate)waytocompareimpacts.Weseethismethodasafirstlogicalstepinthedirectionofheterogeneousnetwork-basedsciento-metrics.Thismethodcouldeasilybeextendedtouseadirectedschema(KurlandandLee,2005)oraschemathatisawareofthelexicalcontentofcitationsentences,suchassentiment-basedsignednetworks(Hassanetal.,2012).Determiningtheintrinsicqualityofscientificim-pactmeasurescanbedifficultsincethereisnowaytocollectgoldstandardmeasurementsforreal-worldentities.Previousstudieshaveattemptedtoshowthattheirmeasuresgivehighscorestoafewknownhigh-impactentities,e.g.Nobelprizewin-ners(Hirsch,2005),orhaveperformedastatisticalcomponentanalysistofindthemostimportantmea-suresinagroupofrelatedstatistics(Bollenetal.,2009).Ourapproach,instead,istogeneratereal-isticdatafromsyntheticentitieswhoseimpactsareknown.Wehadconsideredalternativeformulationsthatdidnotrelyonsyntheticdata,buteachofthempresentedproblems.WhenweattemptedmanualprominenceannotationforAANdata,theinter-judgeagreement(measuredbySpearmancorrela-tion)inourexperimentsrangedfromdecent(0.9inthecaseofinstitutions)topoor(0.3forauthors)tonearlyrandom(0.03forterms),fartoolowtouseinmostcases.Wealsoconsideredevaluatingprominencemeasuresbytheirabilitytopredictfu-turecitationstoanentity.Citationsareoftenusedasaproxyforimpact,butourmeasurementshavefoundthatcorrelationbetweenpastcitationsandfu-turecitationsistoohighforcitationpredictiontobeameaningfulevaluation3.2.1CreatingasyntheticAANInnetworktheory,acommontechniquefortestingnetworkalgorithmswhenjudgmentsofreal-worlddataareexpensiveorimpossibletoobtainistotestthealgorithmonasyntheticnetwork.Tocreatesuchasyntheticnetwork,theauthorsdefineasimple,butrealisticgenerativeprocessbywhichthereal-worldnetworksofinterestmayarise.Thepropertiesofthenetworkaremeasuredtoensurethatitreplicatescertainobservablebehaviorsofthereal-worldnet-work.Theycanthentestnetworkalgorithmstoseehowwelltheyareabletorecoverthehiddenparam-etersthatgeneratedthesyntheticnetwork.(Pastor-SatorrasandVespignani,2001;Clausetetal.,2009;KarrerandNewman,2011)Wetakeatwo-stepapproachtogeneratingthissyntheticdata,firstgeneratingentitieswithknownimpacts,andsecond,linkingtheseentitiestogetheraccordingtotheirlatentimpacts.Ourheuristicisthathighimpactentitiesshouldbelinkedtootherhighimpactentitiesandvice-versa.Asinthenet-worktheoryliterature,wemustshowthatthisdatareflectsimportantpropertiesobservedinthetrueAAN.Onesuchpropertyisthatthenumberofcitationsperpaperfollowsapowerlawdistribution(Redner,1998).WeobservethisbehaviorinAANalongwithseveralothersmall-worldbehaviors,suchasasmalldiameter,asmallaverageshortestpathlength,andahighclusteringcoefficientinthecoauthorshipgraph.Westrivetoreplicatethesepropertiesinoursyn-theticdata.3Mostexistingimpactmeasurementsrequireaccesstoatleastoneyear’sworthofcitationinformation.TheSpearmancorrelationbetweenthenumberofcitationsreceivedafteroneyearandafterfiveyearsis0.79withcorrelationbetweensuc-cessiveyearsashighas0.99.Practicallythismeansthatthemeasuresthatbestcorrelatewithcitationsafterfiveyearsareexactlythosethatbestcorrelatewithcitationsafteroneyear.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
T
l
UN
C
_
UN
_
0
0
1
6
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
3
Sincescientificimpactmeasuresattempttoquan-tifythetrueimpactofentities,wecanusethesemea-surestohelpunderstandhowthetrueimpactmea-suresaredistributedacrossdifferententities.Infact,citationcounts,beingagoodestimateofimpact,canbeusedtogeneratetheselatentimpactvariablesforeachentity.Foreachtypeofentity(papers,authors,istituzioni,venues,andterms),wecreatealatentimpactbysamplingfromtheappropriatecitationcountdistribution.Aftersampling,alltheimpactsarenormalizedtofallinthe[0,1]interval,withthehighest-impactentityofeachtypehavingalatentimpactof1.Additivesmoothingisusedtoavoidhavinganimpactof0.Oncewehavecreatedtheentities,ourmethodforplacingedgesismostsimilartotheErd˝os-R´eynimethodforcreatingrandomgraphs(Erd˝osandR´enyi,1960),inwhichedgesaredistributeduniformlyatrandombetweenpairsofvertices.In-steadofdistributinglinksuniformly,linksbetweenentitiesaresampledproportionallytoI(UN)IO(B)(1−(IO(UN)−I(B))2),whereI(X)isthelatentimpactofentityx.WetriedseveralotherformulaethatfailedtoreplicatethepropertiesoftherealAAN.TheI(UN)IO(B)partoftheformulaabovereflectsapref-erencefornodesofanytypetoconnectwithhigh-impactentities(e.g.,majorconferencesreceivemanysubmissionseventhoughmostsubmissionswillberejected),butthe1−(IO(UN)−I(B))2partalsoreflectstherealitythatentitiesofsimilarpromi-nencearemostlikelytoattachtoeachother(e.g.,well-knownauthorspublishinmajorconferences,whilelesswell-knownauthorsmaypublishmostlyinlesser-knownworkshops).Usingthisdistribution,werandomlysamplelinksbetweenpapersandauthors;authorsandinstitu-tions;papersandvenues;andpapersandterms.Theonlyexceptiontothiswaspaper-to-papercitationlinks,forwhichwedidnotexpectthissamebe-haviortoapply,aslow-impactpapersregularlycitehigh-impactpapers,butnotvice-versa.Tomodelci-tations,weselectedcitingpapersuniformlyatran-domandcitedpapersinproportiontotheirimpacts.(AlbertandBarab´asi,2002)Finalmente,wegeneratedanetworkequalinsizetoAAN,thatis,withtheexactsamenumbersofpa-pers,authors,etc.andtheexactsamenumberofRelationshipTruevalueSynth.valuePaper-citationspowerlawcoeff.1.822.12Diameter98Avg.shortestpath4.274.05Collaborationnetworkclusteringcoeff.0.340.26Table1:NetworkpropertiesofthesyntheticAANcomparedwiththetrueAAN.paper-authorlinks,paper-venuelinks,etc.Table1comparestheobservedpropertiesofthetrueAANwiththeobservedpropertiesofthissyntheticversionofAAN.Noneofthestatisticsareexactmatches,butwhenbuildingrandomgraphs,itisnotuncommonformeasurestodifferbymanyordersofmagnitude,soamodelthathasmeasuresthatareonthesameorderofmagnitudeastheobserveddataisgenerallyconsideredtobeadecentmodel(NewmanandPark,2003).2.2MeasuringimpactonthesyntheticAANThisrandomnetworkis,ofcourse,stillimperfectinsomeregards.Firstofall,ithasnotimeaspect,soitisnotpossibleforimpacttochangeovertime,whichmeanswecannottestagainstsomeimpactmeasuresthathaveatimecomponentlikeCiteR-ank(MaslovandRedner,2008).Secondo,therearesomeconstraintspresentintherealworldthatarenotenforcedhere.Becausetheedgesarerandomlyselected,somepapershavenovenues,whileothershavemultiplevenues.Thereisalsonothingtoen-forcecertainconsistencies,suchasauthorspublish-ingmanypapersfromrelativelyfewinstitutions,orrepeatedlycollaboratingwiththesameauthors.WehadalsoconsideredusingexistingrandomgraphmodelssuchastheBarab´asi-Albertmodel(Barab´asiandAlbert,1999),whichareknowntoproducegraphsthatexhibitpowerlawbehavior.Thesemodels,Tuttavia,donotprovideawaytore-spectthelatentimpactsoftheentities,astheyaddlinksinproportiononlytothenumberofexistinglinksanodehas.Wemeasurethequalityofimpactmeasuresbycomparingrankedlists:theorderingoftheentities
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
T
l
UN
C
_
UN
_
0
0
1
6
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
4
PapermeasureAgreementHeterogeneousnetworkPagerank0.773CitationnetworkPagerank0.558Citationcount0.642AuthormeasureAgreementHeterogeneousnetworkPagerank0.461CoauthorshipnetworkPagerank0.244h-index(Hirsch,2005)0.292Aggregatedcitationcount0.236i10-index0.235InstitutionmeasureAgreementHeterogeneousnetworkPagerank0.373h-index(Mitra,2006)0.334Aggregatedcitationcount0.327VenuemeasureAgreementHeterogeneousnetworkPagerank0.449h-index(Braunetal.,2006)0.425Aggregatedcitationcount0.370Impactfactor0.092VenuecitationnetworkPagerank(Bollenetal.,2006)0.366Table2:Agreementofvariousimpactmeasureswiththetruelatentimpact.bytheirtrue(buthidden)impactagainsttheirorder-ingaccordingtotheimpactmeasure.Theagree-mentbetweentheselistsismeasuredbyKendall’sTau.Table2comparesseveralwell-knownimpactmeasureswithourimpactmeasure,Pagerankcen-tralityontheheterogeneousAANnetwork.Wefindthatsomepopularmethods,suchash-index(Hirsch,2005)aretoocoarsetoaccuratelycapturemuchoftheunderlyingvariation.ThereisaversionofKendall’sTauthataccountsforties,andwhilethismetricslightlyhelpsthecoarsermeasures,Pagerankontheheterogeneousnetworkisstilltheclearwin-ner.Whencomparingdifferentorderingmethods,itisnaturaltowonderwhichofentitiestheorderingsdisagreeon.Ingeneral,non-heterogeneousmea-sureslikeh-indexorcollaborationnetworkPager-ank,whichonlyfocusononetypeofrelationshipcansufferwhentheentityinquestionhasanimpor-tantrelationshipofanothertype.Forexample,ifanauthorishighlycited,butmostlyworksalone,his19851990199520002005201020406080100120RelativePagerankACLEMNLPCOLINGNAACLFigure1:Evolutionofconferenceimpacts.They-axismeasuresrelativePagerank,theentity’sPager-ankrelativetotheaveragePagerankinthatyear.contributionwouldbeundervaluedinthecollabo-rationnetwork,butwouldbemoreaccurateintheheterogeneousnetwork.Themajorityofthedifferencesbetweentheim-pactmeasures,Anche se,tendtobeinhowtheyhan-dleentitiesoflowprominence.Itseemsthat,forthemostpart,thereisrelativelylittledisagreementintheorderingsofhigh-impactentitiesbetweendiffer-entimpactmeasures.Thatis,mosthighlyprominententitiestendtobehighlyratedbymostmeasures.Butwhenanauthororapaper,forexample,onlyhasoneortwocitations,itcanbeadvantageoustolookatmoretypesofrelationshipsthanjustcitations.Thepapermaybewrittenbyanotherwiseprominentauthor,orpublishedatawell-knownvenue,andhav-ingmanytypesofrelationsatitsdisposalcanhelpamethodlikeheterogeneousnetworkPagerankbetterdistinguishbetweentwolow-prominenceentities.2.3Top-rankedentitiesaccordingtoheterogeneousnetworkPageRankTable3showsthepapers,authors,istituzioni,venues,andtermsthatreceivedthehighestPager-ankintheheterogeneousAAN.Itisobviousthatthetop-rankedentitiesinthisnetworkarenotsimplythemosthighlycitedentities.Thisrankingalsodoesnothaveanytimebiastowardtheentitiesthatarecurrentlyprominent,assomeofthetopauthorsweremoreprolificinprevi-ousdecadesthanatthecurrenttime.WealsoseethiseffectwithCOLING,whichformanyoftheearlyyears,istheonlyvenueintheACLAnthology.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
T
l
UN
C
_
UN
_
0
0
1
6
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
5
TopPapersTopAuthorsTopInstitutionsTopVenuesTopTerms−BuildingALargeAnnotatedCorpusOfEnglish:ThePennTreebank415Jun’ichiTsujii48CarnegieMellonUniversity41COLING−translation−TheMathematicsOfStatisticalMachineTranslation:ParameterEstimation47AravindK.Joshi41UniversityofEdinburgh51ACL43speech−Attention,Intentions,AndTheStructureOfDiscourse418RalphGrishman52UniversityofPennsylvania42HLT51parsing−AMaximumEntropyApproachToNaturalLanguageProcessing475HitoshiIsahara52MassachusettsInstituteofTechnology44EACL51machinetranslation−BLEU:aMethodforAutomaticEvaluationofMachineTranslation420YujiMatsumoto412SaarlandUniversity47LREC43generation−AMaximum-Entropy-InspiredParser47KathleenR.McKeown52IBMT.J.WatsonResearchCenter−NAACL43evaluation42AStochasticPartsProgramAndNounPhraseParserForUnrestrictedText413EduardHovy439CNRS53EMNLP46grammar51ASystematicComparisonofVariousStatisticalAlignmentModels410ChristopherD.Manning426UniversityofTokyo55ComputationalLinguistics416dialogue44Transformation-BasedError-DrivenLearningandNaturalLanguageProcessing:aCaseStudyinPart-of-SpeechTagging493YorickWilks54StanfordUniversity44IJCNLP410knowl-edge41AMaximumEntropyModelforPart-of-SpeechTagging59HermannNey43BBNTechnologies41WorkshoponSpeechandNaturalLanguage41discourseTable3:TheentitiesofeachtypereceivingthehighestscoresfromtheheterogeneousnetworkPagerankimpactmeasurealongwiththeirrespectivechangesinrankingwhencomparedtoasimplecitationcountmeasure.Onepossiblewaytoaddressthisistouseanarrowertimewindowwhencreatingthegraph,suchasonlyincludingedgesfromthepreviousfiveyears.Weapplythistechniqueinthefollowingsection.2.4EntityimpactevolutionTheheterogeneousgraphformalismalsoprovidesanaturalwaytostudytheevolutionofimpactovertime,asin(Halletal.,2008),butatamuchfinergranularity.Halletal.measuredtheyear-by-yearprominenceofstatisticaltopics,butwecanmeasureyear-by-yearprominenceforanyentityinthegraph.Tomeasuretheevolutionofimpactsovertheyears,weiterativelycreateyear-by-yearversionsoftheheterogeneousAAN.Eachofthesegraphscon-tainsallentitiesalongwithalledgesoccurringinafiveyearwindow.Duetospace,wecannotcom-prehensivelyexhibitthistechniqueandthedataitproduces,butasabriefexample,inFigure1,weshowhowtheimpactsofsomemajorNLPconfer-enceschangesovertime.ThegraphshowsthatNAACLandEMNLPhavebeensteadilygainingprominencesincetheirintro-ductions,butalsoshowsthatACLhashadtomakeupalotofgroundsince1990tosurpassCOLING.Wealsonoticethatallthemajorconferenceshavegrowninimpactsince2005,andbelievethatasthefieldcontinuestogrow,themajorconferenceswillcontinuetobecomemoreandmoreimportant.3NameDisambiguationWeframenetworknamedisambiguationinalinkpredictionsetting(Taskaretal.,2003;Liben-NowellandKleinberg,2007).Theproblemsofnamedis-ambiguationandlinkpredictionsharemanychar-acteristics,andwehavefoundthatiftwoambigu-ousnamenodesarecloseenoughtobeselectedbyalink-predictionmethod,thentheylikelycorrespondtothesamereal-worldauthor.Weintendtoshowthattheheterogeneousbiblio-graphicnetworkcanbeusedtobetterdisambiguateauthornamesthantheauthorcollaborationnetwork.Theheterogeneousnetworkforthisproblemcon-tainspapers,authors,terms,venues,andinstitutions.Wecompareseveralwell-knownnetworksimilaritymeasuresfromlinkpredictionbytransformingthe
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
T
l
UN
C
_
UN
_
0
0
1
6
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
6
NetworkDistanceMeasurePrecisionRecallF1-scoreRandindexPurityNMIHeterogeneousTruncatedCommuteTime0.590.780.630.630.710.43HeterogeneousShortestPath0.900.790.830.870.940.76HeterogeneousPropFlow0.890.830.840.870.930.77CoauthorshipTruncatedCommuteTime0.470.800.540.470.600.18CoauthorshipShortestPath0.540.730.600.610.670.31CoauthorshipPropFlow0.570.760.640.660.710.43CoauthorshipGHOST0.890.600.690.810.940.63Table4:Performanceofdifferentnetworksanddistancemeasuresontheauthornamedisambiguationtask.Theperformancemeasuresareaveragedoverthesetsoftwo,three,andfourauthors.Randindexisfrom(Rand,1971)andNMIisanabbreviationfornormalizedmutualinformation(StrehlandGhosh,2003)similaritiestodistancesandinducingclustersofau-thorsbasedonthesedistances.Wecomparethreedistancemeasures:shortestpath,truncatedcommutetime(Sarkaretal.,2008),andPropFlow(Lichtenwalteretal.,2010).Short-estpathdistancecanbeausefulmetricforauthordisambiguationbecauseitissmallwhentwoam-biguousnodesareneighborsinthegraphorshareaneighbor.Itsdownsideisthatitonlyconsidersonepathbetweennodes,theshortest,andcannottakeadvantageofthefactthattheremaybemanyshortpathsbetweentwonodes.Truncatedcommutetimeisavariantofcommutetimewhereallpathslongerthansomethresholdaretruncated.Thetruncationthresholdlshouldbesetsuchthatnosemanticallymeaningfulpathistrun-cated.Weuseavalueoftenforlintheheteroge-neousgraphandthreeinthecoauthorshipgraph4.Theadvantageoftruncatedcommutetimeoveror-dinarycommutetimeissimplercalculation,asnopathslongerthanlneedbeconsidered.Thedown-sideofthismethodisthatlargebranchingfactorstendtoleadtolessagreementbetweencommutetimeandtruncatedcommutetime.PropFlowisaquantitythatmeasurestheproba-bilitythatanon-intersectingrandomwalkstartingatnodeareachesnodebinlstepsorfewer,wherelisagainathreshold.Asbefore,lshouldbeaboundonthelengthofsemanticallymeaningfulpaths,soweusethesamevaluesforlaswithtruncatedcommutetime.Ofcourse,PropFlowisnotametric,whichis4Thisisastandardcoauthorshipgraphwiththeedgeweightsequaltothenumberofpublicationssharedbetweenauthors.Theheterogeneousnetworkdoesnothaveauthor-to-authorlinks,asauthorsarelinkedbypapernodes.requiredforsomeclusteringmethods.WeusethefollowingequationtotransformPropFlowtoamet-ric:D(UN,B)=1PropFlow(UN,B)−1.Witheachofthedistancemeasures,weapplythesameclusteringmethod:partitioningaroundmedoids,withthenumberofclustersautomaticallydeterminedusingthegapstatisticmethod(Tibshi-ranietal.,2001).Wecreatethenulldistributionneededforthegapstatisticmethodbymanyitera-tionsofrandomlysamplingdistancesfromthecom-pletedistancematrixbetweenallnodesinthegraph.Thegapstatisticmethodautomaticallyselectsthenumberofclustersfromtwo,three,orfourauthorclusters.WecompareourmethodsagainstGHOST(Fanetal.,2011),ahigh-performanceauthordisambigua-tionmethodbasedonthecoauthorshipgraph.3.1DataTogeneratenamedisambiguationdata,weusethepseudowordmethodof(Galeetal.,1992).Specif-ically,wechoosetwoormorecompletelyrandomauthorsandconflatethembygivingallinstancesofbothauthorsthesamename.Weleteachpaperwrittenbythispseudoauthorbeaninstancetobeclustered.Theclustersproducedbyanyauthordis-ambiguationmethodcanthenbecomparedagainstthepapersactuallywrittenbyeachofthetwoau-thors.Thismethod,ofcourse,reliesonhavingalloftheunderlyingauthorscompletelydisambiguated,whichAANprovides.Thismethodisusedtocreate100distambiguationsetswithtwoauthors,100forthreeauthors,and100forfourauthors.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
T
l
UN
C
_
UN
_
0
0
1
6
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
7
3.2ResultsTable4showstheperformanceofauthornamedis-ambiguationwithdifferentnetworksanddistancemetrics.F1-scoreisthemeasurethatismostof-tenusedtocompareauthordisambiguationmethods.BothPropFlowandshortestpathsimilarityontheheterogeneousnetworkperformquitewellaccord-ingthismeasure,aswellastheotherreportedmea-sures.Whilecomparablerecallcanbeachievedus-ingonlythecoauthorshipgraph,theheterogeneousgraphallowsformuchhigherprecision.4RandomwalktopicmodelHerewepresentatopicmodelbasedentirelyongraphrandomwalks.Thismethodisnottrulyastatisticalmodelastherearenostatisticalparame-tersbeinglearned,butratheratopic-discoveryand-assignmentmethod,attemptingtosolvethesameproblemasstatisticaltopicmodelssuchasproba-bilisticlatentsemanticanalysis(pLSA)(Hofmann,1999)orlatentDirichletallocation(LDA)(Bleietal.,2003).Intheabsenceofbetterterminology,weusethenamerandomwalktopicmodel.Whilethismethoddoesnothavetherobustmath-ematicalfoundationthatstatisticaltopicmodelspos-sess,initsfavorithasmodularity,simplicity,andinterpretability.Thislanguagemodelismodularasitcompletelyseparatesthediscoveryoftopicsfromtheassociationoftopicswithentities.Itissim-plebecauseitrequiresonlyaclusteringalgorithmandrandomwalkalgorithms,insteadofcomplexin-ferencealgorithms.Themethodalsodoesnotre-quireanymodificationifthetopologyofthenet-workchanges,whereasstatisticalmodelsmayneedanentirelydifferentinferenceprocedureif,e.g.,au-thortopicsaredesiredinadditiontopapertopics.Thirdlythismethodiseasilyinterpretablewithtop-icsprovidedbyclusteringintheword-relatednessgraphandtopicassociationbasedonrandomwalksfromentitiestotopics.4.1TopicsfromwordgraphclusteringFromthesetofACLanthologytitles,wecreatetwographs:(1)awordrelatednessgraphbycre-atingaweightedlinkbetweeneachpairofwordscorrespondingtothePropFlow(Lichtenwalteretal.,2010)measurebetweenthemonthefullheteroge-neousgraphand(2)awordco-occurencegraphbycreatingaweightedlinkbetweeneachpairofwordscorrespondingtothenumberoftitlesinwhichbothwordsoccur.BothofthesegraphsarethenclusteredusingGraphFactorizationClustering(GFC).GFCisasoftclusteringalgorithmforgraphsthatmodelsgraphedgesasamixtureoflatentnode-clusterassociationvariables.(Yuetal.,2006)GivenawordgraphGwithverticesVandad-jacencymatrix[w]ij,GFCattemptstofitabipar-titegraphK(V,U)withadjacencymatrix[B]ijontothisdata,withthemnodesofUrepresentingtheclusters.WhereasinG,similaritybetweentwowordsiandjcanbemeasuredwithwij,wecansimilarlymeasuretheirsimilarityinKwithw0ij=Pmp=1bipbjpλpwhereλp=Pni=1bipisthedegreeofvertexp∈U.Essentiallythebipartitegraphattemptstoapprox-imatethetransitionprobabilitybetweeniandjinGwiththesumoftransitionprobabilitiesfromitojthroughanyofthemnodesinU.Yu,etal.(2006)presentanalgorithmforminimizingthedivergencedistance‘(X,Y)=Pij(xijlogxijyij−xij+yij)be-tween[w]ijand[w0]ij.WerunGFCwiththisdistancemetricandm=100clustersonthewordgraphuntilconvergence(changeinlog-likelihood<0.1%).Afterconver-gence,thenodesinUbecometheclustersandtheweightsbip(constrainedtosumto1foreachclus-ter)becomethetopic-wordassociationscores.ExamplesofsometopicsfoundbythismethodareshowninTable5.Frommanualinspectionofthesetopics,wefoundthemtobeverymuchliketopicscreatedbystatisticaltopicmodels.Wefindinstancesofallthetypesoftopicslistedin(Mimnoetal.,2011):chained,intruded,random,andunbal-anced.ForanevaluationofthesetopicsseeSec-tion4.3.1.4.2Entity-topicassociationToassociateentitieswithtopics,wefirstcreatetheheterogeneousnetworkasinprevioussections,addinglinksbetweenpapersandtheirtitlewords,alongwithlinksbetweenwordsandthetopicsthatwerediscoveredintheprevioussection.Word-topiclinksarealsoweightedaccordingtotheweights
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
t
l
a
c
_
a
_
0
0
1
6
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
8
WordsenseinductionsensedisambiguationwordinductionunsupervisedclusteringsensesbasedsimilaritychineseCRFs+theirapplicationsentitynamedrecognitionrandomconditionalfieldschineseentitiesbiomedicalsegmentationDependencyparsingparsingdependencyprojectiveprobabilisticincrementaldeterministicalgorithmdatasyntactictreesTaggingmodelstaggingmodellatentmarkovconditionalrandomparsingunsupervisedsegmentationMulti-docsummarizationsummarizationmultidocumenttexttopicbasedqueryextractivefocusedsummariesChinesewordsegmentationwordsegmentationchinesebasedalignmentcharactertaggingbakeoffmodelcrfLexicalsemanticslexicalsemanticdistributionalsimilaritywordnetresourceslexiconacquistionsemanticsrepresentationCross-lingualIRcrosslingualretrievaldocumentlanguagelinguisticmultipersonmultilingualcoreferenceGenerationforsummar.sentencebasedcompressiontextsummarizationorderingapproachrankinggenerationSpokenlanguagespeechrecognitionautomaticprosodictaggingspontaneousnewsbroadcastunderstandingconversationalFrenchfunctionwordsdeladudesleautomatiqueanalyseuneenpourQuestionansweringquestionansweringsystemanswerdomainretrievalwebbasedopensystemsUnsupervisedlearningunsuperviseddiscoverylearninginductionknowledgegraphacquisitionconceptclusteringpatternSVMsforNLPsupportvectormachineserrorsspaceclassificationcorrectingwordparsingdetectingMaxEntmodelsentropymaximumapproachbasedattachmentmodelmodelsphraseprepositionaldisambiguationDialoguesystemsdialoguespokensystemshumanconversationalmultiinteractiondialoguesutterancesmultimodalSemanticrole-labelingsemanticrolelabelingparsingsyntacticfeaturesilldependencyformedframenetSMTbasedtranslationmachinestatisticalphraseenglishapproachlearningreorderingmodelCoreferenceresolutionresolutioncoreferenceanaphorareferencepronounellipsisambiguityresolvingapproachpronominalSemi-andweak-supervisionlearningsupervisedsemiclassificationactivedataclusteringapproachgraphweaklyInformationretrievalbasedretrievalsimilaritymodelssemanticspacemodeldistancemeasuresdocumentDiscoursediscourserelationsstructurerhetoricalcoherencetemporalrepresentationtextconnectivestheoryCFGparsingcontextfreegrammarsparsinglinearprobabilisticrewritinggrammarsystemsoptimalMin.risktrain.anddecod.minimumefficienttrainingerrorratetranslationriskbayesdecodingstatisticalPhonologyphonemeconversionletterphonologicalgraphemerulesapplyingtransliterationsyllablesoundSentimentsentimentopinionreviewsclassificationminingpolarityanalysispredictingproductfeaturesNeuralnetspeechrecog.speechrobustrecognitionrealnetworktimeneuralnetworkslanguageenvironmentsFinitestatemethodsstatefinitetransducersautomataweightedtranslationparsingincrementalminimalconstructionMechanicalTurkmechanicalturkautomaticevaluationamazontechniquesdataarticlesimagescientificTable5:Top10wordsforseveraltopicscreatedbytheco-occurencerandomwalktopicmodel.Theleftcolumnisamanuallabel.Topic59Topic82translation0.1953parsing0.1715machine0.1802dependency0.1192statistical0.0784projective0.0138MachineTranslation0.0018K-bestSpanningTreeParsing0.0025BetterHypothesisTestingforStatisticalMachineTranslation:ControllingforOptimizerInstability0.0016Pseudo-ProjectiveDependencyParsing0.0024FilteringAntonymous,Trend-Contrasting,andPolarity-DissimilarDistributionalParaphrasesforImprovingStatisticalMachineTranslation0.0015Shift-ReduceDependencyDAGParsing0.0017Knight,Kevin0.0083Nivre,Joakim0.0120Koehn,Philipp0.0074Johnson,Mark0.0085Ney,Hermann0.0072Nederhof,Mark-Jan0.0064RWTHAachenUniversity0.0212VaxjoUniversity0.0113CarnegieMellonUniversity0.0183BrownUniversity0.0107UniversityofSouthernCalifornia0.0177UniversityofAmsterdam0.0094WorkshoponStatisticalMachineTranslation0.0590ACL0.0512EMNLP0.0270EMNLP0.0259COLING0.0173CoNLL0.0223Table6:Examplesofentitiesassociatedwithselectedtopics.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
t
l
a
c
_
a
_
0
0
1
6
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
9
determinedbyGCF.Wethensimplytakerandomwalksfromtopicstoentitiesandmeasurethepro-portionatwhichtherandomwalkarrivesateachen-tityofinterest.Theseproportionsbecometheentity-topicassociationscores.Forexample,ifwewantedtofindtheauthorsmostassociatedwithtopic12,wewouldtakeanum-berofrandomwalks(say50,000)startingattopic12andterminatingassoonastherandomwalkfirstreachesanauthornode.Measuringtheproportionatwhichrandomwalksarriveateachallowsustocomputeanassociationscorebetweentopic12andeachauthor.Acommonprobleminrandomwalksonlargegraphsisthatthewalkcaneasilyget“lost”betweentwonodesthatshouldbeverynearbytakingajustafewstepsinthewrongdirection.Tokeeptheran-domwalksfromtakingthesewrongsteps,weadjustthetopologyofthenetworkusingdirectedlinkstokeeptherandomwalksmovinginthe“right”direc-tion.Wedesignthegraphsuchthatifwedesirearandomwalkfromnodesoftypestonodesoftypet,therandomwalkwillneverbeabletofollowanout-goinglinkthatdoesnotdecreaseitsdistancefromthenodesoft.Asshowninsection2.3,therearecertainnodesatwhicharandomwalk(likePagerank)arrivesatmoreoftenthanotherssimplybecauseoftheirpositionsinthegraph.Thissuggeststhattheremaybestationaryrandomwalkdistributionsoverentities,whichwewouldneedtoadjustforinordertofindthemostsignificantentitiesforatopic.Indeedthisiswhatwedofind.Asanexample,ifwesampletopicsuniformlyandtakerandomwalkstoauthornodes,bychanceweendupatJun’ichiTsujiion0.3%ofrandomwalks,EduardHovyon0.2%ofwalks,etc.Thesevaluesareabout1000timesgreaterthanwouldbeexpectedatrandom.Toadjustforthiseffect,whenwetakearandomwalkfromatopicxtoanentitytypet,wesubtractoutthisstationarydistributionfort,whichcorre-spondstotheproportionofrandomwalksthatendatanyparticularentityoftypetbychance,andnotbyvirtueofthefactthatthewalkstartedattopicx.Theresultingdistributionyieldstheentitiesoftthataremostsignificantlyassociatedwithtopicx.Ta-ble6givesexamplesofthemostsignificantentitiesforacoupleoftopics.−200−150−100−50RW-coocRW-simRTMLDACoherenceFigure2:Distributionoftopiccoherencesforthefourtopicmodels.4.3TopicModelEvaluationWeprovidetwoseparateevaluationsinthissection,oneofthetopicsalone,andoneextrinsticevaluationoftheentirepaper-topicmodel.Thevariantsofran-domwalktopicmodelsarecomparedagainstLDAandtherelationaltopicmodel(RTM),eachwith100topics(ChangandBlei,2010).AsRTMallowsonlyasingletypeofrelationshipbetweendocuments,weusecitationsastheinter-documentrelationships.4.3.1TopicCoherenceThecoherenceofatopicisevaluatedusingtheco-herencemetricintroducedin(Mimnoetal.,2011).GiventhetopMwordsV(t)=(v(t)1,...,v(t)M)foratopict,thecoherenceofthattopiccanbecalculatedwiththefollowingformula:C(t;V(t))=MXm=2m−1Xl=1log D(v(t)m,v(t)l)+1D(v(t)l)!,whereD(v)isthenumberofdocumentscontain-ingvandD(v,v0)isthenumberofdocumentscon-tainingbothvandv0.Thismeasureofcoherenceishighlycorrelatedwithmanualannotationsoftopicquality,withahighercoherencescorecorrespondingtoamoreco-herent,higherqualitytopic.Aftercalculatingtheco-herenceforeachofthe100topicsforRTMandtherandom-walktopicmodel,theaveragecoherenceforRTMtopicswas-135.2andtheaveragecoherenceforword-similarityrandomwalktopicswas-122.2,withstatisticalsignificanceatp<0.01.Figure2demonstratesthis,showingthatthewordsimilarity-basedrandomwalkmethodgeneratesseveralhighlycoherenttopics.TheaveragecoherencefortheLDAandtheco-occurencerandomwalkmodelweresig-nificantlylower.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
t
l
a
c
_
a
_
0
0
1
6
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
10
4.3.2ExtrinsicEvaluationOnedifficultyinevaluatingthisrandom-walktopicmodelintrinsicallyagainstastatisticaltopicmodellikeRTMisthatexistingevaluationmeasuresassumecertainstatisticalpropertiesofthetopic,forexample,thatthetopicsaregeneratedaccordingtoaDirichletprior.Becauseofthis,wechooseinsteadtoevaluatethistopicmodelextrinsicallywithadown-streamapplication.Wechooseaninformationre-trievalapplication,returningarankedlistofsimilardocuments,givenareferencedocument.Weevaluatefivedifferentmethods:citation-RTM,LDA,thetwoversionsoftherandom-walktopicmodel,andasimplewordvectorsimilaritybaseline.Similaritybetweendocumentswiththetopicmodelsaredeterminedbycosinesimilaritybe-tweenthetopicvectorsofthetwodocuments.Wordvectorsimilaritydeterminesthesimilaritybetweendocumentsbytakingthecosinesimilarityoftheirwordvectors.Fromthesesimilarityscores,arankedlistisproduced.Thedocumentsetforthistaskisthesetofallpa-persappearingatACLbetween2000and2011.Thetop10resultsreturnedbyeachmethodarepooledandmanuallyevaluatedwitharelevancescorebe-tween1and10.Thirtysuchresultsetsweremanu-allyannotated.Wethenevaluateeachmethodac-cordingtoitsdiscountedcumulativegain(DCG)(J¨arvelinandKek¨al¨ainen,2000).PerformanceofthesemethodsissummarizedinTable7.Theco-occurence-basedrandomwalktopicmodelperformedcomparablywiththebestper-formeratthistask,LDA,andtherewasnosignifi-cantdifferencebetweenthetwoatp<0.05.Goingforward,animportantproblemistorec-onciletheco-occurence-andword-similarity-basedformulationsofthistopicmodel,asthetwoformu-lationsperformverydifferentlyinourtwoevalua-tions.Heuristically,theco-occurencemodelseemstocreategoodhuman-readabletopics,whiletheword-similaritymodelcreatestopicsthataremoremathematically-coherent,butlesshuman-readable.5RelatedWorkHeterogeneousnetworkshavebeenstudiedinanumberofdifferentfields,suchasbiology(Sio-son,2005),transportationnetworks(LozanoandMethodDCGWordvector1.345±0.007LDA3.302±0.008RTM3.058±0.011Random-walk(cooc)3.295±0.006Random-walk(sim)2.761±0.007Table7:DCGPerformanceofthevarioustopicmodelsandbaselinesontherelateddocumentfind-ingtask.A95%confidenceintervalisprovided.Storchi,2002),socialnetworks(LambiotteandAus-loos,2006),andbibliographicnetworks(Sunetal.,2011).Thesenetworksarealsosometimesknownbythenamecomplexnetworksormultimodalnet-works,butboththesetermshaveotherconnotations.Weprefer“heterogeneousnetworks”asusedbySunetal.(2009).Therehasalsobeensomestudyofthesenetworksingeneral,incommunitydetection(Murata,2010),clustering(Longetal.,2008;Sunetal.,2012),anddatamining(Muthukrishnanetal.,2010),buttherehasnotyetbeenanycomprehensivestudy.Recently,NLPhasseenseveralusesofheterogeneousnet-works(thoughnotbythatname)forusewithlabelpropagationalgorithms(DasandPetrov,2011;Spe-riosuetal.,2011)andrandomwalks(Toutanovaetal.,2004;KokandBrockett,2010).Severalauthorshaveproposedtheideaofusingnetworkcentralitymeasurestoranktheimpactsofjournals,authors,papers,etc.(Bollenetal.,2006;Bergstrometal.,2008;Chenetal.,2007;Liuetal.,2005),andithasevenbeenproposedthatcentral-itycanbeapplicableinbipartitenetworks(Zhouetal.,2007).WeproposethatPagerankonanygen-eralheterogeneousnetworkisappropriateforcreat-ingrankedlistsforeachtypeofentity.Mostprevi-ouspapersalsolackarobustevaluation,demonstrat-ingagreementwithpreviousmethodsorwithsomeexternalawardsorrecognitions.Weusearandomgraphthatreplicatesthepropertiesofthereal-worldnetworktoshowthatPagerankontheheterogeneousnetworkoutperformsothermethods.Namedisambiguationhasbeenstudiedinanum-berofdifferentsettings,includinggraph-basedset-tings.Itiscommontousethecoauthorshipgraph(Kangetal.,2009;Fanetal.,2011),butauthors
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
t
l
a
c
_
a
_
0
0
1
6
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
11
havealsousedlexicalsimilaritygraphs(OnandLee,2007),citationgraphs(McRae-SpencerandShad-bolt,2006),orsocialnetworks(Malin,2005).Al-mostallgraphmethodsareunsupervised.Therehavebeensometopicmodelsdevelopedspecificallyforrelationaldata(Wangetal.,2006;Airoldietal.,2008),butbothofthesemodelshavelimitationsinthetypesofrelationaldatatheyareabletomodel.Thegrouptopicmodeldescribedin(Wangetal.,2006)isabletocreatestrongertopicsbyconsideringassociationsbetweenwords,events,andentities,butisverycoarseinthewayithan-dlesthebehaviorofentities,anddoesnotgeneralizetomultipledifferenttypesofentities.Thestochas-ticblockmodelof(Airoldietal.,2008)cancreateblocksofsimilarentitiesinagraphandisgeneralinthetypesofgraphsitcanhandle,butproduceslessmeaningfulresultsongraphsthathavespecificschemas.6ConclusionandFutureDirectionsInthispaper,wepresentaheterogeneousnet-worktreatmentoftheACLAnthologyNetworkanddemonstrateseveralapplicationsofit.Usingonlyoff-the-shelfgraphalgorithmswithasingledatarep-resentation,theheterogeneousAAN,weareabletoveryeasilybuildascientificimpactmeasurethatismoreaccuratethanexistingmeasures,anauthordis-ambiguationsystembetterthanexistinggraph-basedauthordisambiguationsystems,andarandom-walk-basedtopicmodelthatiscompetitivewithstatisticaltopicmodels.Whiletherearemanyothertasks,suchascitation-basedsummarization,thatcouldlikelybeap-proachedusingthisframeworkwiththeappropri-ateadditionofnewtypesofnodesintothehetero-geneousAANnetwork,thereareevensomepoten-tialsynergiesbetweenthetasksdescribedinthispa-perthathaveyettobeexplored.Forexample,wemayconsiderthatthemethodsoftheauthordisam-biguationortopicmodelingtaskscouldbetofindthehighest-impactpapersassociatedwithaterm(forsurveygeneration,perhaps)orhigh-impactauthorsassociatedwithaworkshop’stopic(toselectgoodreviewersforit).Webelievethatheterogeneousgraphsareaflexibleframeworkthatwillallowre-searcherstofindsimple,flexiblesolutionsforava-rietyofproblems.AcknowledgmentsThisresearchissupportedbytheIntelligenceAdvancedResearchProjectsActivity(IARPA)viaDepartmentofInteriorNationalBusinessCenter(DoI/NBC)contractnumberD11PC20153.TheU.S.Governmentisautho-rizedtoreproduceanddistributereprintsforGovernmen-talpurposesnotwithstandinganycopyrightannotationthereon.Disclaimer:Theviewsandconclusionscon-tainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofficialpoli-ciesorendorsements,eitherexpressedorimplied,ofIARPA,DoI/NBC,ortheU.S.Government.ReferencesEdoardoM.Airoldi,DavidM.Blei,StephenE.Fienberg,andEricP.Xing.2008.Mixedmembershipstochasticblockmodels.TheJournalofMachineLearningRe-search,9:1981–2014.R´ekaAlbertandAlbert-L´aszl´oBarab´asi.2002.Statisti-calmechanicsofcomplexnetworks.Reviewsofmod-ernphysics,74(1):47.A.L.Barab´asiandR.Albert.1999.Emergenceofscal-inginrandomnetworks.Science,286(5439):509–512.CarlT.Bergstrom,JevinD.West,andMarcA.Wiseman.2008.Theeigenfactormetrics.TheJournalofNeuro-science,28(45):11433–11434.StevenBird,RobertDale,BonnieJDorr,BryanGib-son,MarkJoseph,Min-YenKan,DongwonLee,BrettPowley,DragomirRRadev,andYeeFanTan.2008.TheACLanthologyreferencecorpus:Areferencedatasetforbibliographicresearchincomputationallin-guistics.InProc.ofthe6thInternationalConferenceonLanguageResourcesandEvaluationConference(LREC08),pages1755–1759.D.M.Blei,A.Y.Ng,andM.I.Jordan.2003.Latentdirichletallocation.theJournalofmachineLearningresearch,3:993–1022.JohanBollen,MarkoA.Rodriguez,andHerbertVandeSompel.2006.Journalstatus.CoRR,abs/cs/0601030.JohanBollen,HerbertVandeSompel,AricHagberg,andRyanChute.2009.Aprincipalcomponentanalysisof39scientificimpactmeasures.PloSone,4(6):e6022.TiborBraun,WolfgangGl¨anzel,andAndr´asSchubert.2006.Ahirsch-typeindexforjournals.Scientomet-rics,69(1):169–173.JonathanChangandDavidMBlei.2010.Hierarchicalrelationalmodelsfordocumentnetworks.TheAnnalsofAppliedStatistics,4(1):124–150.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
t
l
a
c
_
a
_
0
0
1
6
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
12
PengChen,HuafengXie,SergeiMaslov,andSidRedner.2007.Findingscientificgemswithgooglespagerankalgorithm.JournalofInformetrics,1(1):8–15.AaronClauset,CosmaRohillaShalizi,andMarkEJNewman.2009.Power-lawdistributionsinempiricaldata.SIAMreview,51(4):661–703.DipanjanDasandSlavPetrov.2011.Unsupervisedpart-of-speechtaggingwithbilingualgraph-basedprojec-tions.InProceedingsofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics:Hu-manLanguageTechnologies,pages600–609.PaulErd˝osandAlfr´edR´enyi.1960.Ontheevolutionofrandomgraphs.MagyarTud.Akad.Mat.Kutat´oInt.K¨ozl,5:17–61.G¨unesErkanandDragomirR.Radev.2004.Lexrank:Graph-basedlexicalcentralityassalienceintextsum-marization.J.Artif.Intell.Res.(JAIR),22:457–479.XiaomingFan,JianyongWang,XuPu,LizhuZhou,andBingLv.2011.Ongraph-basednamedisambigua-tion.J.DataandInformationQuality,2(2):10:1–10:23,February.WilliamA.Gale,KennethW.Church,andDavidYarowsky.1992.Workonstatisticalmethodsforwordsensedisambiguation.InWorkingNotesoftheAAAIFallSymposiumonProbabilisticApproachestoNatu-ralLanguage,volume54,page60.DavidHall,DanielJurafsky,andChristopherD.Man-ning.2008.Studyingthehistoryofideasusingtopicmodels.InProceedingsoftheConferenceonEmpir-icalMethodsinNaturalLanguageProcessing,pages363–371.ACL.AhmedHassan,AmjadAbu-Jbara,andDragomirRadev.2012.Extractingsignedsocialnetworksfromtext.TextGraphs-7,page6.JorgeE.Hirsch.2005.Anindextoquantifyanindi-vidual’sscientificresearchoutput.ProceedingsoftheNationalAcademyofSciencesoftheUnitedstatesofAmerica,102(46):16569.ThomasHofmann.1999.Probabilisticlatentsemanticindexing.InProceedingsofthe22ndannualinterna-tionalACMSIGIRconferenceonResearchanddevel-opmentininformationretrieval,pages50–57.ACM.KalervoJ¨arvelinandJaanaKek¨al¨ainen.2000.IRevalua-tionmethodsforretrievinghighlyrelevantdocuments.InProceedingsofthe23rdannualinternationalACMSIGIRconferenceonResearchanddevelopmentinin-formationretrieval,pages41–48.ACM.In-SuKang,Seung-HoonNa,SeungwooLee,HanminJung,PyungKim,Won-KyungSung,andJong-HyeokLee.2009.Onco-authorshipforauthordisam-biguation.InformationProcessing&Management,45(1):84–97.BrianKarrerandMarkEJNewman.2011.Stochas-ticblockmodelsandcommunitystructureinnetworks.PhysicalReviewE,83(1):016107.StanleyKokandChrisBrockett.2010.Hittingtherightparaphrasesingoodtime.InHumanLanguageTech-nologies:The2010AnnualConferenceoftheNorthAmericanChapteroftheAssociationforComputa-tionalLinguistics,pages145–153.ACL.OrenKurlandandLillianLee.2005.Pagerankwithouthyperlinks:Structuralrerankingusinglinksinducedbylanguagemodels.InSIGIR’05.RenaudLambiotteandMarcelAusloos.2006.Collabo-rativetaggingasatripartitenetwork.ComputationalScience–ICCS2006,pages1114–1117.DavidLiben-NowellandJonKleinberg.2007.Thelink-predictionproblemforsocialnetworks.JournaloftheAmericansocietyforinformationscienceandtechnol-ogy,58(7):1019–1031.R.N.Lichtenwalter,J.T.Lussier,andN.V.Chawla.2010.Newperspectivesandmethodsinlinkprediction.InProceedingsofthe16thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages243–252.ACM.XiaomingLiu,JohanBollen,MichaelL.Nelson,andHerbertVandeSompel.2005.Co-authorshipnet-worksinthedigitallibraryresearchcommunity.Infor-mationprocessing&management,41(6):1462–1480.BoLong,ZhongfeiZhang,andTianbingXu.2008.Clusteringoncomplexgraphs.InProc.the23rdConf.AAAI2008.AngelicaLozanoandGiovanniStorchi.2002.Shortestviablehyperpathinmultimodalnetworks.Transporta-tionResearchPartB:Methodological,36(10):853–874.BradleyMalin.2005.Unsupervisednamedisambigua-tionviasocialnetworksimilarity.InWorkshoponLinkAnalysis,Counterterrorism,andSecurity,vol-ume1401,pages93–102.SergeiMaslovandSidneyRedner.2008.Promiseandpitfallsofextendinggoogle’spagerankalgorithmtocitationnetworks.TheJournalofNeuroscience,28(44):11103–11105.RyanMcDonald,FernandoPereira,KirilRibarov,andJanHajiˇc.2005.Non-projectivedependencypars-ingusingspanningtreealgorithms.InProceedingsoftheconferenceonHumanLanguageTechnologyandEmpiricalMethodsinNaturalLanguageProcessing,pages523–530.ACL.DuncanM.McRae-SpencerandNigelR.Shadbolt.2006.Alsobythesameauthor:Aktiveauthor,acita-tiongraphapproachtonamedisambiguation.InPro-ceedingsofthe6thACM/IEEE-CSjointconferenceonDigitallibraries,pages53–54.ACM.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
t
l
a
c
_
a
_
0
0
1
6
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
13
RadaMihalceaandPaulTarau.2004.Textrank:Bring-ingorderintotexts.InProceedingsofEMNLP,vol-ume4,pages404–411.Barcelona,Spain.RadaMihalcea.2005.Unsupervisedlarge-vocabularywordsensedisambiguationwithgraph-basedalgo-rithmsforsequencedatalabeling.InProceedingsofHLT-EMNLP,pages411–418.ACL.DavidMimno,HannaM.Wallach,EdmundTalley,MiriamLeenders,andAndrewMcCallum.2011.Op-timizingsemanticcoherenceintopicmodels.InPro-ceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages262–272.ACL.PanchananMitra.2006.Hirsch-typeindicesforrank-inginstitutionsscientificresearchoutput.CurrentSci-ence,91(11):1439.TsuyoshiMurata.2010.Detectingcommunitiesfromtripartitenetworks.InProceedingsofthe19thinter-nationalconferenceonWorldwideweb,pages1159–1160.ACM.PradeepMuthukrishnan,DragomirRadev,andQiaozhuMei.2010.Edgeweightregularizationovermul-tiplegraphsforsimilaritylearning.InDataMining(ICDM),2010IEEE10thInternationalConferenceon,pages374–383.IEEE.MarkE.J.NewmanandJuyongPark.2003.Whysocialnetworksaredifferentfromothertypesofnetworks.PhysicalReviewE,68(3):036122.Byung-WonOnandDongwonLee.2007.Scalablenamedisambiguationusingmulti-levelgraphpartition.InProceedingsofthe7thSIAMinternationalconferenceondatamining,pages575–580.LawrencePage,SergeyBrin,RajeevMotwani,andTerryWinograd.1999.Thepagerankcitationranking:bringingordertotheweb.RomualdoPastor-SatorrasandAlessandroVespignani.2001.Epidemicspreadinginscale-freenetworks.Physicalreviewletters,86(14):3200–3203.DragomirR.Radev,PradeepMuthukrishnan,VahedQazvinian,andAmjadAbu-Jbara.2013.TheACLanthologynetworkcorpus.LanguageResourcesandEvaluation,pages1–26.WilliamM.Rand.1971.Objectivecriteriafortheeval-uationofclusteringmethods.JournaloftheAmericanStatisticalassociation,66(336):846–850.S.Redner.1998.Howpopularisyourpaper?anempir-icalstudyofthecitationdistribution.TheEuropeanPhysicalJournalB-CondensedMatterandComplexSystems,4(2):131–134.P.Sarkar,A.W.Moore,andA.Prakash.2008.Fastincre-mentalproximitysearchinlargegraphs.InProceed-ingsofthe25thinternationalconferenceonMachinelearning,pages896–903.ACM.AllanA.Sioson.2005.Multimodalnetworksinbiology.Ph.D.thesis,VirginiaPolytechnicInstituteandStateUniversity.MichaelSperiosu,NikitaSudan,SidUpadhyay,andJa-sonBaldridge.2011.Twitterpolarityclassificationwithlabelpropagationoverlexicallinksandthefol-lowergraph.InProceedingsoftheFirstworkshoponUnsupervisedLearninginNLP,pages53–63,Edin-burgh,Scotland,July.ACL.AlexanderStrehlandJoydeepGhosh.2003.Clusterensembles—aknowledgereuseframeworkforcom-biningmultiplepartitions.TheJournalofMachineLearningResearch,3:583–617.YizhouSun,JiaweiHan,PeixiangZhao,ZhijunYin,HongCheng,andTianyiWu.2009.Rankclus:inte-gratingclusteringwithrankingforheterogeneousin-formationnetworkanalysis.InProceedingsofthe12thInternationalConferenceonExtendingDatabaseTechnology:AdvancesinDatabaseTechnology,pages565–576.ACM.YizhouSun,RickBarber,ManishGupta,andJiaweiHan.2011.Co-authorrelationshippredictioninheteroge-neousbibliographicnetworks.YizhouSun,CharuC.Aggarwal,andJiaweiHan.2012.Relationstrength-awareclusteringofheterogeneousinformationnetworkswithincompleteattributes.Pro-ceedingsoftheVLDBEndowment,5(5):394–405.BenTaskar,Ming-FaiWong,PieterAbbeel,andDaphneKoller.2003.Linkpredictioninrelationaldata.InNeuralInformationProcessingSystems,volume15.RobertTibshirani,GuentherWalther,andTrevorHastie.2001.Estimatingthenumberofclustersinadatasetviathegapstatistic.JournaloftheRoyalSta-tisticalSociety:SeriesB(StatisticalMethodology),63(2):411–423.KristinaToutanova,ChristopherDManning,andAn-drewYNg.2004.Learningrandomwalkmodelsforinducingworddependencydistributions.InPro-ceedingsofthetwenty-firstinternationalconferenceonMachinelearning,page103.ACM.XueruiWang,NatashaMohanty,andAndrewMcCallum.2006.Groupandtopicdiscoveryfromrelationsandtheirattributes.Technicalreport,DTICDocument.KaiYu,ShipengYu,andVolkerTresp.2006.Softclusteringongraphs.AdvancesinNeuralInformationProcessingSystems,18:1553.DingZhou,SergeyA.Orshanskiy,HongyuanZha,andC.LeeGiles.2007.Co-rankingauthorsanddocu-mentsinaheterogeneousnetwork.InDataMining,2007.ICDM2007.SeventhIEEEInternationalCon-ferenceon,pages739–744.IEEE.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
1
1
5
6
6
8
2
4
/
/
t
l
a
c
_
a
_
0
0
1
6
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
14
Scarica il pdf