Transactions of the Association for Computational Linguistics, vol. 3, pp. 15–28, 2015. Action Editor: Hwee Tou Ng. - IA de Investigación especializada en el MIT

Transacciones de la Asociación de Lingüística Computacional, volumen. 3, páginas. 15–28, 2015. Editor de acciones: Hwee Tou Ng.

Lote de envío: 9/2014; Lote de revisión 11/2014; Publicado 1/2015. C
(cid:13)

2015 Asociación de Lingüística Computacional.

Cross-DocumentCo-ReferenceResolutionusingSample-BasedClusteringwithKnowledgeEnrichmentSouravDuttaMaxPlanckInstituteforInformaticsSaarbr¨ucken,Germanysdutta@mpi-inf.mpg.deGerhardWeikumMaxPlanckInstituteforInformaticsSaarbr¨ucken,Germanyweikum@mpi-inf.mpg.deAbstractIdentifyingandlinkingnamedentitiesacrossinformationsourcesisthebasisofknowledgeacquisitionandattheheartofWebsearch,rec-ommendations,andanalytics.Animportantprobleminthiscontextiscross-documentco-referenceresolution(CCR):computingequiv-alenceclassesoftextualmentionsdenotingthesameentity,withinandacrossdocuments.Priormethodsemployranking,clustering,orprobabilisticgraphicalmodelsusingsyntacticfeaturesanddistantfeaturesfromknowledgebases.However,thesemethodsexhibitlimita-tionsregardingrun-timeandrobustness.ThispaperpresentstheCROCSframeworkforunsupervisedCCR,improvingthestateoftheartintwoways.First,weextendthewayknowledgebasesareharnessed,bycon-structinganotionofsemanticsummariesforintra-documentco-referencechainsusingco-occurringentitymentionsbelongingtodiffer-entchains.Second,wereducethecomputa-tionalcostbyanewalgorithmthatembedssample-basedbisection,usingspectralclus-teringorgraphpartitioning,inahierarchi-calclusteringprocess.ThisallowsscalingupCCRtolargecorpora.Experimentswiththreedatasetsshowsigniﬁcantgainsinoutputqual-ity,comparedtothebestpriormethods,andtherun-timeefﬁciencyofCROCS.1Introduction1.1MotivationandProblemStatementWearewitnessinganotherrevolutioninWebsearch,userrecommendations,anddataanalytics:tran-sitioningfromdocumentsandkeywordstodata,conocimiento,andentities.Examplesofthismega-trendaretheGoogleKnowledgeGraphanditsap-plications,andtheIBMWatsontechnologyfordeepquestionanswering.Toalargeextent,thesead-vanceshavebeenenabledbytheconstructionofhugeknowledgebases(KB’s)suchasDBpedia,Yago,orFreebase;thelatterformingthecoreoftheKnowledgeGraph.Suchsemanticresourcesprovidehugecollectionsofentities:gente,lugares,compa-nies,celebrities,cine,etc.,alongwithrichknowl-edgeabouttheirpropertiesandrelationships.Perhapsthemostimportantvalue-addingcom-ponentinthissettingistherecognitionanddis-ambiguationofnamedentitiesinWebandusercontents.NamedEntityDisambiguation(NED)(ver,p.ej.,(Cucerzan,2007;milne&Witten,2008;Cornoltietal.,2013))mapsamentionstring(e.g.,apersonnamelike“Bolt”oranounphraselike“light-ningbolt”)ontoitsproperentityifpresentinaKB(e.g.,thesprinterUsainBolt).Arelatedbutdifferenttaskofco-referencereso-lution(CR)(ver,p.ej.,(Haghighi&Klein,2009;Ng,2010;Leeetal.,2013))identiﬁesallmentionsinagiventextthatrefertothesameentity,includinganaphorassuchas“thepresident’swife”,“theﬁrstlady”,or“she”.Thistaskwhenextendedtoprocessanentirecorpusisthenknownascross-documentco-referenceresolution(CCR)(Singhetal.,2011).Ittakesasinputasetofdocumentswithentitymen-tions,andcomputesasoutputasetofequivalenceclassesovertheentitymentions.Thisdoesnotin-volvemappingmentionstotheentitiesofaKB.Un-likeNED,CCRcandealwithlong-tailoremergingentitiesthatarenotcapturedintheKBoraremerelyinverysparseform.StateoftheArtanditsLimitations.CRmethods,forco-referenceswithinadocument,aregenerallybasedonrulesorsupervisedlearningusingdiffer-

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
9
1
5
6
6
7
4
4

/
t

a
C
_
a
_
0
0
1
1
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

entkindsoflinguisticfeatureslikesyntacticpathsbetweenmentions,thedistancesbetweenthem,andtheirsemanticcompatibilityasderivedfromco-occurrencesinnewsandWebcorpora(Haghighi&Klein,2009;Leeetal.,2013).Somemethodsad-ditionallyusedistantlabelsfromknowledgebases(KB’s).Cluster-rankingandmulti-sievemethodsin-crementallyexpandgroupsofmentionsandexploitrelatednessfeaturesderivedfromsemantictypes,aliasnames,andWikipediacategories(Rahmán&Ng,2011a;Ratinov&Roth,2012).TheCCRtask-computingequivalenceclassesacrossdocuments-isessentiallyaclusteringprob-lemusingasimilaritymetricbetweenmentionswithfeatureslikethosediscussedabove.However,standardclustering(e.g.,k-meansorEMvariants,CLUTO,etc.)lacksawarenessofthetransitivityofco-referenceequivalenceclassesandsuffersfromknowledgerequirementofmodeldimensions.Prob-abilisticgraphicalmodelslikeMarkovLogicnet-works(Richardson&Domingos,2006;Domingosetal.,2007;Domingos&Lowd,2009)orfactorgraphs(Loeliger,2008;Koller&Friedman,2009)takeintoconsiderationconstraintssuchastransi-tivity,whilespectralclusteringmethods(Luxburg,2007)implicitlyconsidertransitivityintheunderly-ingeigenspacedecomposition,butsufferfromhighcomputationalcomplexity.Inparticular,allmethodsneedtoprecomputefeaturesforthedatapointsandsimilarityvaluesbetweenallpairsofdatapoints.Thelattermaybealleviatedbypruningheuristics,butonlyattheriskofdegradingoutputquality.NotethatCCRcannotbeaddressedbysimplyapplyinglocalCRtoa“super-document”thatcon-catenatesalldocumentsinthecorpus.Withinadocument,identicalmentionstypicallyrefertothesameentity,whileindifferentdocuments,identicalmentionscanhavedifferentmeanings.Althoughacross-documentviewgivestheopportunitytospotjointcuesfromdifferentcontextsforanentity,doc-umentsvaryintheirstylesofreferringtoentitiesandmerelycombiningthelocalco-referencechainsintoasuper-groupmightleadtosubstantialnoiseintro-duction.Inaddition,CRmethodsarenotdesignedforscalingtohuge“super-documents”correspond-ingtomillionsofwebpagesornewsarticles.ProblemStatement.WeaimtoovercometheabovelimitationsbyproposingaCCRmethodthatmakesrichuseofdistantKBfeatures,considerstransitiv-ity,andiscomputationallyefﬁcient.1.2ApproachandContributionInthispaper,weefﬁcientlytackletheCCRproblembyconsideringco-occurringmentionsandrichfeaturesfromexternalknowledgebases,andusingatransitivity-awaresampling-basedhierarchicalclusteringapproach.WedevelopedtheCROCS(CROss-documentCo-referencereSolution)frame-workwithunsupervisedhierarchicalclusteringbyrepeatedbisectionusingspectralclusteringorgraphpartitioning.CROCSharnessessemanticfeaturesderivedfromKB’sbyconstructinganotionofsemanticsummaries(semsum’s)fortheintra-documentco-referencechains.InadditiontoincorporatingKBlabelsasfeaturesfortheco-referringmentions,wealsoconsiderco-occurringmentionsbelongingtootherentitiesandutilizetheirfeatures.Considerthetext:HillarylivedintheWhiteHouseandbackedBilldespitehisaffairs.containing3men-tiongroups:{“Hillary”},{“Bill”},y{“WhiteHouse”}.MerelyobtainingdistantKBfeaturesfortheﬁrstmentiongroup,thesparseinformationleadstohighambiguity,e.g.,mayrefertothemountaineerSirEdmundHillary.ButbyalsoobtainingfeaturesfromKBfor“WhiteHouse”(co-occurringmention),weobtainmuchstrongercuestowardsthecorrectsolution.CROCSadoptsabisectionbasedclusteringmethodandinvokesitrepeatedlyinatop-downhi-erarchicalprocedurewithaninformation-theoreticstoppingcriterionforclustersplitting.Weescapethequadraticrun-timecomplexityforpair-wisesim-ilaritycomputationsbyusingasamplingtechniqueforthespectraleigenspacedecompositionorforgraphpartitioning.Thisisinspiredbytherecentworkof(Krishnamurtyetal.,2012;Wauthieretal.,2012)onactiveclusteringtechniques.Similar-itycomputationsbetweenmentiongroupsareper-formedlazilyon-demandforthedynamicallyse-lectedsamples.Inanutshell,thenovelcontributionsare:•CROCS,aframeworkforcross-documentco-referenceresolutionusingsample-basedspectralclusteringorgraphpartitioningembeddedinahierarchicalbisectionprocess;•semsum’s,amethodforincorporatingdistantfeaturesfromKB’salsoconsideringthecou-plingbetweenco-occurringmentionsindiffer-entco-referencechains;

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
9
1
5
6
6
7
4
4

/
t

a
C
_
a
_
0
0
1
1
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

•experimentalevaluationwithbenchmarkcor-porademonstratingsubstantialgainsoverpriormethodsinaccuracyandrun-time.2ComputationalFrameworkTheCROCSmodelassumesaninputsetoftextdocumentsD={d1,d2,…},withmarkupofentitymentionsM={m11,m12,…,m21,m22,…},mij∈dj,presentinthedocuments.CROCScomputesanequivalencerelationoverMwithequivalenceclassesCj,whereCj∩Ck=∅forj6=kand∪jCj=M.Thenumberofdesiredclassesisaprioriunknown;itneedstobedeterminedbythealgorithm.Detectingthementionsandmarkingtheirboundarieswithinthetextisaproblembyitself,referredtoasNER(NamedEntityRecognition).Thispaperdoesnotaddressthisissueandreliesonestablishedmethods.TheCROCSframeworkconsistsof4stages:1.Intra-documentCR:Givenaninputcorpus,DwithmentionsM,weinitiallyperformintra-documentco-referenceresolution.2.Knowledgeenrichment:Foreachofthelocalmentiongroups({mij})obtainedintheprevi-ousstep,wecombinethesentencesofthemen-tionstodeterminethebestmatchingentityinaKBandretrieveitsfeatures.Analogousstepsareperformedforco-occurringmentions(de{mij})andtheirfeaturesincluded.Wetermthisfeaturesetof{mij}assemanticsummary(semsum’s).3.Similaritycomputation:Wecomputesimilar-ityscoresbetweenmentiongroupsbasedonthefeaturesextractedabove.Thesearecomputedon-demand,andonlyforasampledsubsetofmentions(avoidingquadraticcomputationcost).4.Sampling-basedclustering:Weperformspec-tralclusteringorbalancedgraphpartitioning(usingthesimilaritymetric)inahierarchi-calfashiontocomputethecross-documentco-referenceequivalenceclassesofmentions.3Intra-DocumentCRCROCSinitiallypre-processesinputdocumentstocastthemintoplaintext(usingstandardtoolslike(https://code.google.com/p/boilerpipe/),(www.jsoup.org),etc.).ItthenusestheStanfordCoreNLPtoolsuitetodetectmentionsandanaphors(http://nlp.stanford.edu/software/).Thedetectedmentionsarealsotaggedwithcoarse-grainedlexi-caltypes(persona,organización,ubicación,etc.)bytheStanfordNERTagger(Finkeletal.,2005).Thisformstheinputtotheintra-documentCRstep,whereweusethestate-of-the-artopen-sourceCRtool(basedonmulti-passsievealgorithm)fromStanfordtocomputethelocalmentionco-referencechains(Raghunathanetal.,2010;Leeetal.,2011;Leeetal.,2013).Thetaggedtextsandthelocalco-referencechainsarethenpassedtothesecondstage.ThislocalCRstepmayproduceerrors(e.g.,in-correctchainingofmentionsoromissions)whichpropagatetothelaterstages.However,improvingintra-documentCRisorthogonaltoourproblemandthusoutofthescopeofthispaper.OurexperimentslatershowthatCROCSisrobustandproduceshigh-qualityoutputevenwithmoderateerrorsencoun-teredduringthelocal-CRstage.4KnowledgeEnrichmentTheknowledgeenrichmentphasestartswiththelocalco-referencechainsperdocument.Assumethatwehaveobtainedmentiongroups(chains){Michelle,she,ﬁrstlady}y{thepresident’swife,ﬁrstlady}fromtwodocuments.Toassesswhetherthesetwochainsshouldbecombined,i.e.,theybothrefertothesameentity,wecomputeseman-ticfeaturesbytappingintoknowledgebases(KB’s).Específicamente,weharnesslabelsandpropertiesfromfreebase.comentries,forpossiblymatchingenti-ties,toenrichthefeaturesofamentiongroup.TheKBfeaturesformapartofthesemanticsummaryorsemsum’sforeachlocalmentiongroup.Featuresde-rivedfromtheconstructedsemsum’sarelaterusedtocomparedifferentmentiongroupsviaasimilaritymeasure(describedinSection5).Formalmente,amentionmisatextstringatapartic-ularpositioninadocument.mbelongstoamentiongroupM(metro)consistingofallequivalentmentions,withthesamestring(atdifferentpositions)ordiffer-entstrings.Foragivenm,thebasicsemsumofm,Sbasic(metro),isdeﬁnedasSbasic(metro)={t∈sentence(m0)|m0∈M(metro)}∪{t∈label(m0)|m0∈M(metro)}wheretaretexttokens(wordsorphrases),oración(m0)isthesentenceinwhichmentionm0occurs,andlabel(m0)isthesemanticlabelform0obtainedfromtheKB.NotethatSbasic(metro)isabag

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
9
1
5
6
6
7
4
4

/
t

a
C
_
a
_
0
0
1
1
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

oftokens,asdifferentmentionsinM(metro)canobtainthesametokensorlabelsandtherecouldbemulti-pleoccurrencesofthesamementionstringinM(metro)anyway.PriorworksonCR(p.ej.,(Rahmán&Ng,2011a;Ratinov&Roth,2012;Hajishirzietal.,2013;Zhengetal.,2013))andNED(p.ej.,(Cucerzan,2007;milne&Witten,2008;Ratinovetal.,2011;Hoffartetal.,2011;Hajishirzietal.,2013))haveconsideredsuchformofdistantfeatures.CROCSextendsthesepre-viousmethodsbyalsoconsideringdistantfeaturesforco-occurringmentiongroups,andnotjustthegroupathand.Wenowintroduceageneralframe-workforknowledgeenrichmentinourCCRsetting.Strategiesforknowledgeenrichmentinvolvede-cisionmakingalongthefollowingdimensions:•Target:elementos(singlementions,localmentiongroups,orglobalmentiongroupsacrossdocu-ments)forwhichsemanticfeaturesareobtained.•Source:theresourcefromwheresemanticfea-turesareextracted.Existingmethodsconsideravarietyofchoices:i)inputcorpora,ii)externaltextcorpus,e.g.,Wikipedia,andiii)knowledgebasessuchasFreebase,DBpedia,orYago.•Scope:theneighborhoodofthetargetconsid-eredforenrichment.Itcaneitherberestrictedtothetargetitselforcanconsiderco-occurringitems(othermentiongroupsconnectedtothetarget).•Match:involvesmappingthetargettooneormorerelevantitemsinthesource,andcanin-volvesimplenamequeriestofull-ﬂedgedNEDbasedonrelevanceorscoreconﬁdence.Existingmethodsgenerallyconsiderindividualmentionsorlocalmentiongroupsastarget.Ex-tendedscopeslikeco-occurringentitiesbasedonautomaticNERandIEtechniqueshavebeenpro-posed(Mann&Yarowsky,2003;Niuetal.,2004;Chen&Martín,2007;Baron&Freedman,2008),butuseonlytheinputcorpusastheenrichmentsource.Recentmethods(Rahmán&Ng,2011a;Ratinov&Roth,2012;Hajishirzietal.,2013;Zhengetal.,2013)harnessKB’s,butconsideronlylo-calmentiongroups.Also,thesemethodsrelyonhigh-qualityNEDformappingmentionstoKBen-tries.Incontrast,CROCSconsidersextendedscopesthatincludementiongroupsalongwithco-occurringmentiongroupswhentappingintoKB’s.Wemakeonlyweakassumptionsonmatchingmen-tionsagainstKBentities,byﬁlteringonconﬁdenceandmerelytreatingsemsum’sasfeaturesratherthanrelyingonperfectlymappedentities.Speciﬁcally,ourCROCSmethodhandlesthefourdimensionsofknowledgeenrichmentasfollows:EnrichmentTarget:Weuseper-documentmentiongroups,afterthelocalCRstep,astarget.Inprinci-ple,wecouldrepeattheenrichmentduringtheitera-tionsoftheCCRalgorithm.However,asCROCSperformstop-downsplittingofgroupsratherthanbottom-upmerging,thereisnoaddedvalue.EnrichmentSource:Weincludeallthesentencesofamentiongroupinitssemsum’s,thusdrawingontheinputdocumentitself.Themainenrichmenthar-nessesentity-structuredKB’slikeFreebaseorYagobyqueryingthemwithphrasesderivedfromthementiongroups’summaries.Thefeaturesthatareextractedfromthebest-matchingentityincludese-mantictypesorcategories(e.g.,“politician”,“awardnominee”),aliasnames(e.g.,“MichelleRobinson”),títulos(e.g.,“FirstLadyoftheUnitedStates”)andgenderofpeople.Thesefeaturesareappendedtothesemsum’sandformthecoreofamentiongroup’ssemanticsummary.EnrichmentScope:CROCSincludesco-occurringmentiongroupsasadditionaltargetsforsemanticfeatures.Considerthe4examplesentencesinFigure1.SupposethelocalCRﬁnds4mentiongroupsasshown.Thementionsandthesentencesinwhichtheyoccurarerepresentedasabipartitegraphdepictingtheirconnections(rightsideofFig.1).Considerthementiongroupof“president’swife”(m11)and“ﬁrstlady”(m21).Togetherwiththeirimmediatesentenceneighborsinthebipartitegraph,thesementionsformwhatwecallthebasicscopeforknowledgeenrichment,es decir.,{m11,s1,m21,s2}.Thesentencesofthismentiongroupcontainothermentionswhichcanbeinmentiongroupsspanningfurthersentences.Weutilizethisco-occurrenceasadditionalcuesforcharacterizingthementiongroupathand.Theunionofthecurrentscopewiththatofallthetwo-hopneighborsinthebipartitegraphformtheextendedscope.Forthegroup{m11,s1,m21,s2},thetwo-hopmen-tionneighborsare{m12,m22,m23,m31}.Por eso,weincludethescopesofthesegroups,themen-tionsandsentences,yieldingtheextendedscope{m11,s1,m21,s2,m22,m23,m31,s3}.Formalmente,formentionminmentiongroup

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
9
1
5
6
6
7
4
4

/
t

a
C
_
a
_
0
0
1
1
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

METRO(metro),itsextendedsemsumSextended(metro)es:Sextended(metro)=Sbasic(metro)∪ [m0(cid:0)Sbasic(m0)|∃s:m0∈s∧m∈s(cid:1)!wheresisasentenceinwhichbothmandm0occur.Inprinciple,wecouldconsiderevenmoreaggres-siveexpansions,like4-hopneighborsortransitiveclosures.However,ourexperimentsshowthatthe2-hopextensionisasweetspotthatgainssubstan-tialbeneﬁtsoverthebasicscope.EnrichmentMatching:Foreachlocalmentiongroup,CROCSﬁrstinspectsthecoarse-grainedtypes(persona,organización,ubicación)asdeterminedbytheStanfordNERTagger.Weconsiderpronounstoderiveadditionalcuesforpersonmentions.Ifalltagsinagroupagree,wemarkthegroupbythistag;otherwisethegroupasawholeisnottype-tagged.TomatchamentiongroupagainstaKBentity,wetriggeraphrasequerycomprisingtaggedphrasesfromthementiongrouptotheKBinterface1.Weremovenon-informativewordsfromthephrases,droppingarticles,stop-words,etc.Forexample,theﬁrstmentiongroup,{m11,m21}inFig.1leadstothequery»presidentwifefirstlady».Thequeryresultsareﬁlteredbymatchingtheresulttype-tagwiththetypetagofthementiongroup.Fortheextendedscope,weconstructanalogousqueriesfortheco-occurringmentions:»WhiteHouseUSpresidentresidence»y»husband»intheexample.Theresultsareprocessedasfollows.WeprimarilyrelyontheKBserviceitselftorankthematchingentitiesbyconﬁdenceand/orrele-vance/importance.Wesimplyacceptthetop-rankedentityanditsKBproperties,andextendthesem-sum’sonthisbasis.Thisisalsodonefortheco-occurringmentiongroups,leadingtotheextendedscopeoftheoriginalmentiongroupconsidered.ToavoiddependencyontherankingoftheKB,wecanalternativelyobtainthetop-kresultsforeachqueryandalsotheKB’sconﬁdencefortheentitymatching.Wethenre-rankthecandidatesbyoursimilaritymeasuresandpruneoutcandidateswithlowconﬁdence.Weintroduceaconﬁdencethresh-old,i,suchthatallcandidateshavingmatchingcon-ﬁdencebelowthethresholdareignored,i.e.,the1Forexample,(https://gate.d5.mpi-inf.mpg.de/webyagospotlx/WebInterface)o(www.freebase.com/query)m22 s1 s2 s3 s4 m11 m12 m21 m23 m31 m32 m33 m43 m41 m42 s1 s2 s3 s4 m43 m41 m42 m33 m31 m32 m23 m21 m22 m11 m12 The president‘s wife lives in the White House. The first lady helps her husband with the duties in the president‘s residence. The White House is located in Washington DC and is the home of the US president. The American president and his wife live in Washington. Figure1:Exampleoflocalmentiongroups.Algorithm1:ExtendedKnowledgeEnrichmentRequire:TextT,SetGofmentiongroups(fromStanfordCoreNLP),KBMatchThresholdθ,KnowledgebaseKBEnsure:semsumforeachgroupinG1:foreachmentiongroup,M∈Gdo2:BasicScope:semsumM←sentencesfromTcontainingmentionsinM3:ExtractandaddKBfeaturesformentionsandphrasesinsemsumM(Sbasic(METRO))4:ExtendedScope:Appendcontextof2-hopco-occurringmentions(frombipartitegraph)tosemsumM5:Matching:ExtractphrasesfromsemsumMforquerygenerationtoKB6:RetrievehighestrankedKBresultentitye7:ifmatchconﬁdenceofe>θthen8:Extractsetoffeaturesfore,LefromKB9:AppendLetosemsumM(Sextended(METRO))10:endif11:endfor12:OutputsemsumMforallM∈Gentirementiongroupisdisregardedinthesemsumconstruction.Thismakesextendedscoperobusttonoise.Forexample,thementiongroup{husband}havinglowconﬁdencewouldlikelydegradethesemsum’squalityandisthusdropped.FeatureVector:Thesemsum’softhementiongroupscomprisesentencesandbagsofphrases.Fortheexamplementiongroup{m11,m21},weincludethesentences{s1,s2,s3}duringtheextended-scopeenrichment,andobtainphrasesfromtheKBlike:“MichelleObama”,“FirstLadyofUnitedStates”,“capitaloftheUnitedStates”,etc.Algorithm1showsthepseudo-codeforconstructingsemsum’s.CROCSnextcastseachsemsumintotwoforms,(i)abagofwords,y(ii)abagofkeyphrases,andusesbothforconstructingafeaturevector.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
9
1
5
6
6
7
4
4

/
t

a
C
_
a
_
0
0
1
1
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

5SimilarityComputationCROCScomparesmentiongroupsbyasimilaritymeasuretoinferwhethertheydenotethesameentityornot.Thesimilarityisbasedonthefeaturevec-torsofmentiongroups(constructedasinSection4).Eachfeatureinamentiongroup’svectorisweightedusingIR-stylemeasuresaccordingtothebag-of-words(BoW)modelorthekeyphrases(KP)modelforthesemsum’s.Empirically,thebestapproachisamixtureofboththewordsandkeyphrasesmodel,whichisemployedbyCROCS.Similaritycompar-isonsarecomputedon-demandandonlyforasmallsampledsetofmentiongroups,asrequiredduringthehierarchicalclusteringprocedure(seeSection6).ThesimilarityoftwomentionsgroupsG1,G2is,sim(G1,G2)=α×simBoW(G1,G2)+(1−α)×simKP(G1,G2)whereαisatunablehyper-parameter.Whenevertwomentiongroupsaretobecombined(referringtothesameentity),theirfeaturevectorsarecombinedbycomputingabagunionoftheirwordsand/orphrases,andthenrecomputingtheweights.With-outlossofgenerality,ourdefaultsettingisα=0.5.Bag-of-WordsModel(BoW):Forthismodel,wecomputethetermfrequency,tf(w)foreachwordwinthesemsum’s,andalsotheinversedocumentfrequency,idf(w),ofthewordacrossallsemsum’s(i.e.,allmentiongroupsfromallinputdocuments).Theweightofw,wgt(w)=tf(w)×idf(w).Asthesemsum’sareshort,weusethesimpleproductratherthandampeningtfvaluesorothervariations.Alternatively,moreadvancedIRweightingmodelssuchasOkapiBM25orstatisticallanguagemodelscanbeused.However,theclassicaltf×idfmeasureworksquitewell.CROCScomputesthesimilarityoftwofeaturevectorsbytheircosinedistance.KeyphrasesModel(KP):Thekeyphrasesofamen-tiongroupareobtainedbyextractingpropernames,títulos,aliasnames,locations,organización,etc.,fromitssemsum’s.SimilartotheBoWmodel,CROCSsupportstf×idfstyleweightsforentirekeyphrases.Forcomputingthesimilarityofkeyphrasesbe-tweentwomentiongroupsG1andG2,CROCSmatchesthekeyphrasesofG1inthesemsum’sofG2,andviceversa.However,entirephrasesrarelymatchexactly.Forexample,thekeyphrase“PeaceNobel”matchonlypartiallyinthetext“NobelprizeforPeace”.Toconsidersuchpartialmatchesandrewardbothhighoverlapofwordsandshortdis-tancesbetweenmatchingwords(locality),weadoptthescoringmodelof(Tanevaetal.,2011).Thescoreforapartialmatchofkeyphrasepintextxis,S(pag|X)=#matchwordslen.ofcov(pag|X) Pw∈cov(pag)wgt(w)Pw∈pwgt(w)!1+γwherethecover(cov)ofpinxistheshortestwordspan(inx)containingallthewordsofppresentinx(withaboundof10-20words).Fortheexampleabove,thecoverofp=“PeaceNobel”inthetextxis“NobelprizeforPeace”(all2wordsmatchingwithcoverlength4).Theparameterγ,(0<γ<1)servestotunetheprogressionofpenalizingmissingwords.Inourexperiments,γwassetto0.5andstop-wordssuchas“a”,“the”,etc.wereremovedwithonlykeywordsbeingconsidered.FormentiongroupsG1andG2,wecompute,sim(G1|G2)=Xp∈KP(G1)wgt(p)×S(p|semsum0s(G2))Finally,weresolvetheasymmetryinsimilaritymea-sureduetotheorderingofthetwogroupsbysetting,sim(G1,G2)=max{sim(G1|G2),sim(G2|G1)}6ClusteringAlgorithmTheﬁnalstageofCROCStakesthementiongroupsandthesemsum’sasinput.Itperformsatop-downhierarchicalbisectionprocess,basedonsim-ilarityscoresamongentities,toclustertogetherco-referringmentiongroupsateachsplittinglevel.Initiallyallmentiongroupsareplacedinasinglecluster,andarethenrecursivelysplituntilastoppingcriterionﬁnalizesaclusterasleaf.Ateachlevel,clustersplittingisperformedbyusingeitherspectralclustering(Luxburg,2007)orbalancedgraphparti-tioning(Karypis&Kumar,1998).Boththesemeth-odsimplicitlyconsidertransitivity,whichisessen-tialastheequivalenceclassesofmentionsshouldbetransitivelyclosed.Thechallengeofthisseeminglysimpleprocedureliesin(i)judiciouslychoosingandoptimizingthedetails(modelselectionandstoppingcriterion),and(ii)reducingthecomputationalcost.Thelatteriscrucialasspectralclusteringhascubiccomplexity,graphpartitioningheuristicscomputa-tionsareexpensive,andCCR(unlikeCR)needstocopewithWeb-scaleinputsconsistingofmillionsofdocumentsandentities. yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 9 1 5 6 6 7 4 4 / / t yo a C _ a _ 0 0 1 1 9 pag d . F b y gramo tu mi s t t oh norte 0 7 S mi pag mi metro b mi r 2 0 2 3 21 Clusteringisinvokedforeachofthecoarse-grainedentitytypesseparately(asobtainedfromStanfordNERtagger):gente,lugares,andorganiza-tions.Thebeneﬁtistwofold:gainingefﬁciencyandimprovingaccuracy,astwodifferententitytypeswouldnotco-referinreality.However,theriskisthattwodifferentlytaggedmentiongroupsmightactuallyrefertothesameentity,withatleastonetagbeingincorrect.Ourexperimentsshowthatthebeneﬁtsclearlyoutweighthisrisk.Withoutlossofgenerality,weonlyconsiderchainsthataretaggedintooneoftheabovetypes,andotherco-referencechainsareignored.Althoughthismightleadtocer-tainmentionsbeingoverlooked,improvingtheac-curacyandrecallofNERtaggingapproachesareor-thogonaltoourcurrentscopeofwork.Activespectralclustering:Spectralcluster-ing(Luxburg,2007)usestheeigenspaceofthesim-ilaritygraph’sLaplacianmatrixtocomputegraphpartitionsasclusters.CROCSadoptstherecentlyproposedActiveSpectralClusteringtechnique(Kr-ishnamurtyetal.,2012;Wauthieretal.,2012),whichapproximatestheeigenspaceofaLaplacianwithasmallsubsetofsampleddatapoints(mentiongroupsinCROCS).ForndatapointsandsamplesizesintheorderofO(logn),thistechniquere-ducesthecostofspectralclusteringfromO(n3)toO(log3n)(withboundederror).CROCSinitializeseachbisectionstepbyselectingsmentiongroupsfromaclusterandcomputesallpair-wisesimilari-tiesamongthesampledgroups.Spectralclusteringisthenperformedonthissubsettoobtainasplitinto2clusters.Thenon-sampledmentiongroupsareas-signedtotheclosestclusterintermsofaveragedis-tancetoclustercentroids.Thechildrenclustersareiterativelysplitfurtheratnextlevelsuntilthestop-pingcriterionﬁres.Balancedgraphpartitioning:Balancedgraphpartitioningassignstheverticesofagraphintocom-ponentsofnearlythesamesizehavingfewedgesacrosscomponents.TheproblemisNP-complete,andseveralapproximationalgorithmshavebeenproposed(Bulucetal.,2013).CROCSusestheMETISsoftware(http://glaros.dtc.umn.edu/gkhome/metis/metis/overview)toobtainmentiongrouppartitioningateachlevelofthehierarchicalclustering.Theunderlyingmentionsimilaritygraphiscon-structedbysamplingsmentiongroups,andsim-ilaritiesamongthemrepresentedasedgeweights.Formentiongroupsnotselectedinthesample,sim-ilaritiestoonlythessamplepointsarecomputedandcorrespondingedgescreated.ThegraphisthenpartitionedusingMETIS(Karypis&Kumar,1998)(multi-levelrecursiveprocedure)tominimizetheedge-cutstherebypartitioningdissimilarmen-tiongroups.SpeciﬁcsofCROCS:Activespectralclustering(Krishnamurtyetal.,2012)usesrandomsampling,choosesthenumberofﬁnalclusters,kbasedoneigengap,andenforcesabalancingconstraintforthekclusterstobeofsimilarsizes.CROCSjudi-ciouslydeviatesfromthedesignof(Krishnamurtyetal.,2012)como:•Modelselection:Wechooseaﬁxednumberofpartitionskateachcluster-splittingstepofthehierarchicalprocess.Weuseasmallkvalue,typicallyk=2.Thisavoidsselectingmodeldi-mensionparameters,allowingthestoppingcri-teriontodecidetheﬁnalnumberofclusters.•Formofgraphcut:CROCSusesbalancednor-malizedcutforgraphpartitioning(Karypis&Kumar,1998).Sin embargo,unbalancedclustersizeswithseveralsingletonclusters(havingonlyonementiongroup)mightbeformed.InourCCRsetting,thisisactuallyanaturaloutcomeasmanylong-tailentitiesoccuronlyonceinthecorpus.Suchmentiongroupssigniﬁcantlydifferinsemanticandcontextualfeaturescomparedtotheothermentiongroups.Hence,singletonclus-termentionshavelowsimilarityscore(basedonsemsum’s)withothermentionsgroups.Thistranslatestolowedgeweightsintheunderlyingsimilaritygraphstructure(betweenmentions),thusformingfavorablecandidatesintheinitialphasesofclustersplittingusingminimumedge-cutbasedgraphpartitioning.Therefore,CROCSinherentlyincorporatesearlypartition(duringtheclusteringphase)ofsuchpossiblysingletonmentionclustersfromthe“maindata”,therebyhelpinginde-noisingandefﬁciency.•Sampling:Insteadofsamplingdatapointsuni-formlyrandomly,weusebiasedsamplingsim-ilartoinitializationusedink-meansclustering.Startingwitharandompoint,weaddpointstothesamplesetsuchthattheiraveragesimilaritytothealreadyincludedpointsisminimized,thusmaximizingthediversityamongthesamples.StoppingcriterionofCROCS:Thesample-based l D o w n o a d e d f r o m h t t p : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 9 1 5 6 6 7 4 4 / / t yo a C _ a _ 0 0 1 1 9 pag d . F b y gramo tu mi s t t oh norte 0 7 S mi pag mi metro b mi r 2 0 2 3 22 hierarchicalclusteringprocessoperateswithoutanypriorknowledgeofthenumberofclusters(entidades)presentinthecorpus.WeusetheBayesianInfor-mationCriteria(BIC)(Schwarz,1978;Hourdakisetal.,2010)asthestoppingcriteriontodecidewhetheraclustershouldbefurthersplitorﬁnalized.BICisaBayesianvariantoftheMinimumDescriptionLength(MDL)principle(Grunwald,2007),assum-ingthepointsinaclustertobeGaussiandistributed.TheBICscoreofaclusterCwiths(muestreado)datapoints,xiandclustercentroid¯Cis:BIC(C)=Xi=1,···,slog2(xi−¯C)2+log2sTheBICscoreforasetofclustersisthemicro-averagedBICoftheclusters.CROCSsplitsaclus-terCintosub-clustersC1,...,CkiffthecombinedBICvalueofthechildrenisgreaterthanthatoftheparent,elseCismarkedasleaf.7ExperimentalEvaluationBenchmarkDatasets:Weperformedexperimentswiththefollowingthreepubliclyavailablebench-markingdatasets,therebycomparingtheperfor-manceofCROCSagainststate-of-the-artbaselinesundervariousinputcharacteristics.•JohnSmithcorpus:theclassicalbenchmarkforCCR(Bagga&Baldwin,1998)comprising197articlesselectedfromtheNewYorkTimes.Itincludesmentionsof35different“JohnSmith”personentities.AllmentionspertainingtoJohnSmithwithinadocumentrefertothesameper-son.Thisprovidesasmall-scalebutdemandingsettingforCCR,asmostJohnSmithsarelong-tailentitiesunknowntoWikipediaoranyKB.•WePS-2collection:asetof4,500WebpagesusedintheWebPeopleSearch2competition(Artilesetal.,2009).Thedocumentscomprisethetop150Websearchresults(usingYahoo!buscar(asof2008))foreachof30differentpeople(obtainedfromWikipedia,ACL’08,andUSCensus),coveringbothprominententities(e.g.,IvanTitov,computerscienceresearcher)andlong-tailedentities(e.g.,IvanTitov,actor).•NewYorkTimes(NYT)archivo:asetofaround1.8millionnewsarticlefromthearchivesofthenewspaper(Sandhaus,2008)extractedbe-tweenJanuary1987andJune2007.Weran-domlyselect220,000articlesfromthetimerangeofJanuary1,2004throughJune19,2007,whichcontainabout3.71millionmentions,or-ganizedinto1.57millionlocalmentionchainsaftertheintra-documentCRstep.Inourexperiments,weconsidermentionsofpersonentitiesasthisisthemostpredominantanddemand-ingclassofentitiesinthedatasets.TheJohnSmithandWePS-2datasetshaveexplicitgroundtruthan-notations,whiletheNYTcontainseditorialannota-tionsforentitiespresentineacharticle.Forknowl-edgeenrichment,weusedFreebase;althoughsensi-tivitystudiesexplorealternativesetupswithYago.EvaluationMeasures:Weusetheestablishedmea-surestoassessoutputqualityofCCRmethods:•B3F1score(Bagga&Baldwin,1998):mea-surestheF1scoreasaharmonicmeanofpreci-sionandrecalloftheﬁnalequivalenceclasses.Precisionisdeﬁnedastheratioofthenum-berofcorrectlyreportedco-references(foreachmention)tothetotalnumber;whilerecallcom-putesthefractionofactualco-referencescor-rectlyidentiﬁed.Boththeﬁnalprecisionandre-callarecomputedbyaveragingoverallmentiongroups.•φ3-CEAFscore(luo,2005):analternatewayofcomputingprecision,recordar,andF1scoresus-ingthebest1-to-1mappingbetweentheequiv-alenceclassesobtainedandthoseinthegroundtruth.Thebestmappingofground-truthtoout-putclassesexhibitsthehighestmentionoverlap.Allexperimentswereconductedona4coreInteli52.50GHzprocessorwith8GBRAMrunningUbuntu12.04.7.1ParameterTuningTheuseofexternalfeaturesextractedfromKB’s(formentiongroups)formsanintegralpartinthework-ingofCROCS,andisrepresentedbythechoiceofthethreshold,θ.Givenaninputcorpus,wenowdis-cussthetuningofθbasedonsplittingtheavailabledataintotrainingandtestingsubsets.Werandomlypartitiontheinputdatainto3parts(assumedtobelabeledasA,B,andC).Oneofthedatapartsisthetrainingdataandtheothertwopartsarethetestdata.Usingthegoldannotationsofthetrainingdataset,weempiricallylearnthevalueofθ,thatprovidesthebestB3F1scoreforCCR,usingasimplelinesearch.Initially,θissetto1(noKBusage)andissubsequentlydecreasedusing0.01 l D o w n o a d e d f r o m h t t p : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 9 1 5 6 6 7 4 4 / / t yo a C _ a _ 0 0 1 1 9 pag d . F b y gramo tu mi s t t oh norte 0 7 S mi pag mi metro b mi r 2 0 2 3 23 MethodP(%)R(%)F1(%)CROCS78.5472.1275.21Stream(Rao,2010)84.759.269.7Inference(singh,2011)--66.4Table1:B3F1resultsonJohnSmithdataset.asthestepsizeforeachofthelearningphaseitera-tions.AssoonastheperformanceofCROCSisseentodegrade(comparedtothepreviousiteration),theprocedureisterminatedandthepreviousvalueofθisconsideredasthelearnedparametervalue.Theﬁnalresultswereportareaveragedover3indepen-dentruns,eachconsideringdifferentdatapartitions(amongA,B,andC)asthetrainingdata.Althoughmoreadvancedlearningalgorithmsmightalsobeused,thissimpleapproachisobservedtoworkwell.Learningoftheθvaluemightconvergetoalocalmaximum,ormaybedistortedduetopresenceofnoiseinthetrainingdata.However,welatershow(inSection7.5)thattheperformanceofCROCSisrobusttosmallvariationsofθ.7.2John-SmithCorpus:Long-TailEntitiesTable1comparesCROCSwithtwostate-of-the-artmethodsachievingthebestpublishedresultsforthisbenchmark.66randomlyselecteddocumentswereusedasthetrainingset(whiletherestformedthetestset)andthesubsequentθvaluelearned(asdescribedinSection7.1)was0.96.Sincethecorpuscontainedmostlylong-tailentitiesnotpresentinanyKB(only5-6ofthe35differentJohnSmith’sareinWikipedia,eg.theexplorerJohnSmithetc.),theKBmatchesweretoounreliableandledtotheintroduc-tionofnoise.Hence,ahighvalueofθwasobtained(i.e.KBfeaturesmostlydisregarded).CROCS(usingsample-basedspectralcluster-ing)achievesanF1scoreof75.21%,whileStream(Raoetal.,2010)andInference(Singhetal.,2011)reachonly69.7%and66.4%resp.CROCSalsohasahighφ3-CEAFscoreof69.89%exhibitingsubstantialgainsoverpriormethods2.Ournovelnotionofsemsum’swithextendedscope(mentionsandco-occurringmentiongroups)provedessentialforoutperformingtheexistingmethods(seeSec-tion7.6).TheruntimeofCROCSwasonlyaround6seconds.2DataanddetailedCROCSoutputresultsareavailableat(www.dropbox.com/s/1grribug15yghys/John_Smith_Dataset.zip?dl=0)MethodP(%)R(%)F1(%)CROCS85.381.7583.48PolyUHK(Artiles,2009)877982UVA1(Artiles,2009)858081Table2:B3F1resultsonWePS-2dataset.7.3WePS-2Corpus:WebContentsWecomparedsampledspectralclusteringbasedCROCSontheWePS-2corpusagainstthebestmethodsreportedin(Artilesetal.,2009).Weem-piricallyobtainedtheKBmatchparameterθ=0.68accordingtothetrain-testsetupdescribedearlier(with1500trainingdocuments).CROCSachievesaB3basedF1scoreof83.48%andaφ3-CEAFscoreof74.02%(Table2),providinganimprovementofabout1.5F1scorepoints3.ThegainobservedisnotashighasthatfortheJohnSmithdataset,asintheWePS-2corpusdoc-umentsarelonger,givingrichercontextwithfewerambiguousentitymentions.Thus,simplermethodsalsoperformfairlywell.TheruntimeofCROCSonWePS-2corpuswasabout90seconds.7.4NewYorkTimesCorpus:WebScaleTheprevioustwodatasets,JohnSmithandWePS-2aretoosmalltoassesstherobustnessofCROCSforhandlinglargedata.WethereforeranCROCS(withsample-basedspectralclustering)onarandomsampleof220,000NYTnewsarticles.Theknowl-edgeenrichmentthresholdθwaslearnedtobe0.45with73Ktrainingdocuments.CROCSachievedaB3F1scoreof59.17%(withP=56.18%andR=62.49%)andaφ3-CEAFscoreof50.0%.NopriormethodsreportF1performanceﬁguresforthislargedataset.How-ever,thefactorgraphbasedapproachof(Singhetal.,2010)measuresthementionco-referenceac-curacyforasampleof1,000documents.Accu-racyisdeﬁnedastheratioofdocumentclustersas-signedtoanentitytothegroundtruthannotations.Wealsosampled1,000documentsconsideringonlymentionswithmultipleentitycandidates.CROCSachievedanaccuracyof81.71%,ascomparedto69.9%for(Singhetal.,2010).Asforrun-time,CROCStook14.3hourstopro-cessaround150,000newsarticlesselectedasthetestcorpus.Wealsocomparedthisresultagainstal-3DataanddetailedCROCSoutputresultsareavail-ableat(www.dropbox.com/s/1i9ot4seavcfdyc/WePS-2_Dataset.zip?dl=0) yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 9 1 5 6 6 7 4 4 / / t yo a C _ a _ 0 0 1 1 9 pag d . F b y gramo tu mi s t t oh norte 0 7 S mi pag mi metro b mi r 2 0 2 3 24 CROCSconﬁgurationWePS-2NYTSentencesonly50.3539.52BasicScope64.1453.88ExtendedScope83.4859.17NEDbaseline61.2559.62Table3:B3F1scoresforCROCSenrichmentvariants.ternativealgorithmswithinourframework(seeSec-tion7.6).Por eso,CROCSefﬁcientlyhandlesWebscaleinputdata.7.5SensitivityStudiesTheCROCSframeworkinvolvesanumberoftun-ablehyper-parametersadjustingthepreciseperfor-manceofthecomponents.Wenowstudytherobust-nessofCROCS(sample-basedspectralclusteringvariant)forvaryingparametervalues.KnowledgeEnrichmentScope:CROCSsupportsseverallevelsofknowledgeen-richmentforsemsum’sconstruction:i)includingonlysentencesofamentiongroup(disregardingtheKB),ii)usingdistantKBlabelsforthegivenmentiongrouponly(basicscope),andiii)addingdistantKBlabelsforco-occurringmentiongroups(extendedscope).Wecomparedtheseconﬁgura-tionsamongeachotherandalsoagainstastate-of-the-artNEDmethodalone.TheresultsareshowninTable3.WeusedAIDA(Hoffartetal.,2011)open-sourcesoftware(https://github.com/yago-naga/aida)forNED,andcombinedmentionsmappedtothesameKBentity.Weusethetrainedvalueofθobtainedpreviously(forthere-spectivedatasets)forconstructingthebasicandex-tendedscopeofsemsum’s,andreportthebestB3F1scores.NotethattheSentencesonlyandNEDcon-ﬁgurationsareindependentofthechoiceofθvalue.Real-lifeWebarticlescontainamixtureofpromi-nententities,ambiguousnames,andlong-tailen-tities;hencesolerelianceonNEDforCCRfarespoorly.Theextendedscopesemsum’sconstructionproducessuperiorresultscomparedtoothermodels.KnowledgeEnrichmentMatchingThreshold:Tostudytheinﬂuenceofdifferentdegreesofdis-tantKBfeatureextraction,wevariedtheenrich-mentmatchingthresholdθfrom0.0(acceptallKBmatches)to1.0(noimportfromKB).TheJohnSmithdatasetlargelycontaininglong-tailentitiesusesθ∼1(trainedvalue),andoperatesonsem-sum’scontainingpracticallynofeatureinclusionfromexternalKB’s.Hence,weonlyconsiderthescenariowhentheKBiscompletelydisregarded(i.e.Datasetθ0.00.250.50.650.750.91.0WePS-276.977.382.483.975.768.963.5NYT60.561.562.262.260.052.148.4Table4:B3F1scores(%)fordifferentchoicesofθ.DatasetθusedP(%)R(%)F1(%)WePS-20.4583.4680.2181.9NYT0.6859.4264.261.8Table5:θerrorsensitivityofCROCSθ=1.0)andobtainaB3F1scoreof76.47%.Fortheothertwodatasets,theB3F1resultsforvaryingθareshowninTable4.WeobservethatKBfeatureshelptheCCRprocessandthebestresultsareobtainedforθbetween0.6and0.7.Weobservethattheexactchoiceofθisnotasensitiveissue,andanychoicebetween0.25and0.75yieldsfairlygoodF1scores(within10%oftheempiricallyoptimalF1results).Por eso,ourapproachisrobustregardingpa-rametertuning.Weobservethatthetrainedvalueofθ(obtainedpreviously)forboththeWePS-2andtheNYTdatasetsareclosetotheoptimalsettingasseenfromTable4andprovidenearlysimilarF1scoreperfor-mance.Therefore,wesetθ=0.65andconsidertheentireinputcorporaastestsetfortheremainderofourexperiments.ToreconﬁrmtherobustnessofCROCStoθvalueranges,weusetheKBthresholdtrainedonWePS-2dataset,andtestitontheNYTdataset(andviceversa).FromTable5weobserveCROCStorendercomparableperformanceinpresenceofer-rorsduringtheθlearningphase.ClusteringHyper-Parameters:Westudytheeffectofvaryingk,thenumberofsub-clustersforthebisectionprocedureinvokedateachlevelofthehierarchicalclustering.Bydefault,thisissetto2(i.e.bisection).Table6showstheB3F1scoresfordifferentchoicesofk,forthethreedatasets(withθ=1.0forJohnSmithandθ=0.65fortheothertwodatasets).Weobservethatk=2performsbestinallcases.Theoutputqualitymono-tonicallydropswithincreaseink,asthisforcesevensimilarmentiongroupstoformseparateclusters.Hence,bisectionallowsthehierarchicalprocesstoadjustthemodelselectionatthegloballevel.AlternativeKB:ToassesstheimpactofdependencyonFreebase(featureextractionofbestmatchingentity),we l D o w n o a d e d f r o m h t t p : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 9 1 5 6 6 7 4 4 / / t yo a C _ a _ 0 0 1 1 9 pag d . F b y gramo tu mi s t t oh norte 0 7 S mi pag mi metro b mi r 2 0 2 3 25 Datasetk=2k=3k=4k=5JohnSmith76.4773.2465.2960.7WePS-283.9282.6178.3773.19NYT62.2459.3452.6046.64Table6:B3F1scores(%)fordifferent#sub-clustersk.DatasetFreebaseYagoP(%)R(%)F1P(%)R(%)F1WePS-286.382.183.986.682.584.0NYT59.864.962.261.360.861.0Table7:CROCSB3F1scoreswithFreebasevs.YagoranalternativeexperimentsontheWePS-2andNYTdatasetswiththeYagoKB(www.yago-knowledge.org).Weobtainallapproximatematchesforamentiongroupandrankthembasedonthekeyphrasesimilaritymodel(Section5)us-ingsentencesofthementiongroupandextractedfeatures(fromtheYagohasLabelpropertyandin-foboxesinWikipediapagesofthesameAslink).ResultsinTable7showsimilarperformance,depict-ingnopreferenceofCROCStoanyparticularKB.7.6AlgorithmicVariantsTheCROCSframeworksupportsavarietyofal-gorithmicbuildingblocks,mostnotably,clusteringmethods(eg.,k-means)orgraphpartitioningforthebisectionsteps,y lo más importante,sampling-basedmethodsversusmethodsthatfullyprocessalldatapoints.ThecomparativeresultsforthethreedifferentdatasetsarepresentedinTable8.FortheJohnSmithcorpus(withθ=1.0),allalgorithmsexceptsample-basedk-meansachievedsimilarperformancesinaccuracyandruntime.Thebestmethodwasthefull-ﬂedgedspectralclustering,withabout2%F1scoreimprovement.WiththeWePS-2dataset,weobtainasimilarpic-turew.r.t.outputquality.However,thisdatasetislargeenoughtobringouttherun-timedifferences.Sampling-basedmethods,includingCROCS,wereabout4×fasterthantheirfull-ﬂedgedcounterparts,albeitwithameagerlossofabout2%inF1score.TheNYTdatasetﬁnallyportraysthescenarioonhugedatasets.Here,onlythesample-basedmethodsrantocompletion,whileallthefull-ﬂedgedmethodswereterminatedafter20hours.Thefastestofthem,thesimplek-meansmethod,hadprocessedonlyabout5%ofthedataatthispoint(needingabout400hoursonextrapolation).Incontrast,CROCS,usingsample-basedspectralclusteringorgraphpar-titioning,neededabout19.6hoursforthe220,000documents.Thesampling-basedk-meanscompeti-torwasslightlyfaster(17.8horas),butlostdramat-icallyonoutputquality:withonlyabout42%F1scorecomparedto62%F1scoreforCROCS.Hence,weobservethatCROCSisindeedwelldesignedforscalablesampling-basedCCR,whereasothersimplermethodslikek-means,lackingtransi-tivityawareness,failtodelivergoodoutputquality.8RelatedWorkCo-referenceResolution(CR):Existingintra-documentCRmethodscombinesyntacticwithse-manticfeaturesforidentifyingthebestantecedent(precedingnameorphrase)foragivenmention(name,phrase,orpronoun).Syntacticfeaturesareusuallyderivedfromdeepparsingofsentencesandnoungroupparsing.Semanticfeaturesareobtainedbymappingmentionstobackgroundknowledgere-sourcessuchasWikipedia.AnoverviewofCRmethodsisgivenin(Ng,2010).Recentmethodsadopttheparadigmofmulti-phasesieves,apply-ingacascadeofrulestonarrowdownthechoiceofantecedentsforamention(p.ej.,(Haghighi&Klein,2009;Raghunathanetal.,2010;Ratinov&Roth,2012)).Thecluster-rankingfamilyofmethods(p.ej.,(Rahmán&Ng,2011b))extendsthisparadigmforconnectingmentionswithaclusterofprecedingmentions.PersonnamedisambiguationinCRdealswithonlypersonnames,título,nicknames,andsur-faceformsvariations(Chen&Martín,2007).DistantKnowledgeLabelsforCR:Toobtainsemanticfeatures,additionalknowledgeresourcessuchasWikipedia,Yagoontology,andFrameNetcorpushavebeenconsidered(Suchaneketal.,2007;Rahmán&Ng,2011a;Panadero,2012).Toidentifytheentitycandidate(s)thatamention(grupo)shouldusefordistantsupervision,CRmethodssuchas(Rati-nov&Roth,2012;Leeetal.,2013)usematchingheuristicsbasedonthegivenmentionalonetoiden-tifyasingleentityorallmatchingentitieswithcon-ﬁdenceabovesomethreshold.Zhengetal.(2013)generalizesthisbymaintainingarankedlistofen-titiesfordistantlabeling,asmentiongroupsareup-dated.UnlikeCROCS,priormethodsutilizeonlythecandidatesforthegivenmention(grupo)it-selfanddistantknowledgefeaturesforco-occurringmentionsarenotconsidered.Cross-DocumentCR(CCR):Earlyworks(Gooi&Allan,2004)onCCR,introducedby(Bagga& yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 9 1 5 6 6 7 4 4 / / t yo a C _ a _ 0 0 1 1 9 pag d . F b y gramo tu mi s t t oh norte 0 7 S mi pag mi metro b mi r 2 0 2 3 26 DatasetClusteringMethodB3measureφ3measure(%)Run-timeP(%)R(%)F1(%)Spectralclustering79.680.179.8573.528.11seck-meansclustering71.2783.8377.0471.948.01secJohnBalancedgraphpartition75.8379.5677.6570.637.83secSmithSampledk-means63.5765.5264.5359.615.12secSampledspectralclustering79.5373.6476.4770.256.5secSampledgraphpartitioning71.4277.8374.4968.366.86secWePS-2Spectralclustering88.285.6186.8877.91331seck-meansclustering85.784.0184.8576.45296.56secBalancedgraphpartition86.5682.7884.6377.73324.64secSampledk-means72.6768.5670.5666.9272secSampledspectralclustering86.282.1183.9274.785.8secSampledgraphpartitioning85.382.283.7274.583.65seck-meansclustering*39.34*49.17*43.72*31.45*>20hrsNewYorkSampledk-means40.4545.3442.7640.6117.8hrsTimesSampledspectralclustering59.7864.9262.2451.0219.6hrsSampledgraphpartitioning61.4562.7162.0750.8819.7hrs*resultsafterrunterminatedat20hrs(∼5%mentionsprocessed)Table8:AccuracyandscalabilityofvariousalgorithmsembeddedinCROCSBaldwin,1998),usedIR-stylesimilaritymeasures(tf×idfcosine,KLdivergence,etc.)onfeatures,similartointra-documentCR.Recentworkssuchas(Culottaetal.,2007;Singhetal.,2010;Singhetal.,2011)arebasedonprobabilisticgraphicalmodelsforjointlylearningthemappingsofallmentionsintoequivalenceclasses.ThefeaturesforthislearningtaskareessentiallyliketheonesinlocalCR.BaronandFreedman(2008)proposedaCCRmethodin-volvingfullclusteringcoupledwithstatisticallearn-ingofparameters.However,thismethoddoesnotscaletolargecorporamakingitunsuitableforWebcontents.Amorelight-weightonlinemethodby(Raoetal.,2010)performswellonlargebench-markcorpora.Itisbasedonastreamingcluster-ingalgorithm,whichincrementallyaddsmentionstoclustersormergesmentiongroupsintosingleclus-ters,andhaslineartimecomplexity;albeitwithinfe-riorclusteringqualitycomparedtoadvancedmeth-odslikespectralclustering.SeveralCCRmethodshaveharnessedco-occurringentitymentions,espe-ciallyforthetaskofdisambiguatingpersonnames(Mann&Yarowsky,2003;Niuetal.,2004;Chen&Martín,2007;Baron&Freedman,2008).How-ever,thesemethodsdonotutilizeknowledgebases,butuseinformationextraction(IE)methodsontheinputcorpusitself;thusfacingsubstantialnoiseduetoIEqualityvarianceonstylisticallydiversedocu-mentslikeWebarticles.SpectralClustering:(Luxburg,2007)providesadetailedstudyonspectralclusteringmodelsandal-gorithms.Yanetal.(2009)proposedtwoapproxi-mationalgorithms,basedonthek-meanstechniqueandrandomprojections,reducingtheO(n3)timecomplexitytoO(k3)+oh(kn)wherekisthenum-berofclusters.InCCR,thenumberofclusters(trulydistinctentities)canbehugeandtypicallyunknown;hence(Shamir&Tishby,2011;Krishnamurtyetal.,2012;Wauthieretal.,2012)developedactivespectralclustering,wheretheexpensiveclusteringstepisbasedondatasamplesandotherdatapointsaremerely“foldedin”.Theterm“active”referstotheactivelearningﬂavorofchoosingthesamples(notwithstandingthatthesemethodsmostlyadoptuniformrandomsampling).9ConclusionsWehavepresentedtheCROCSframeworkforcross-documentco-referenceresolution(CCR).Itperformssample-basedspectralclusteringorgraphpartitioninginahierarchicalbisectionprocesstoob-tainthementionequivalenceclasses,therebyavoid-ingmodel-selectionparametersandthehighcostofclusteringorpartitioning.CROCSconstructsfeaturesformentiongroupsbyconsideringco-occurringmentionsandobtainingdistantsemanticlabelsfromKB’s(forsemsum’s).FeaturegenerationfrommultipleKB’sandcater-ingtostreamingscenarios(e.g.,newsfeedsorsocialmedia)aredirectionsoffuturework.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
9
1
5
6
6
7
4
4

/
t

a
C
_
a
_
0
0
1
1
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

ReferencesJ.Artiles,J.Gonzalo,S.Sekine:2009.WePS2Evalua-tionCampaign:OverviewoftheWebPeopleSearchClusteringTask.WWW2009.A.Bagga,B.Baldwin:1998.Entity-BasedCross-DocumentCoreferencingUsingtheVectorSpaceModel.InCOLING-ACL,pages79–85.C.F.Baker:2012.FrameNet,CurrentCollaborations&FutureGoals.LREC,46(2):269–286.A.Baron,M.Freedman:2008.WhoisWho&WhatisWhat:ExperimentsinCross-DocumentCo-Reference.InEMNLP,pages274–283.A.Buluc,H.Meyerhenke,I.Safro,P.Sanders,C.Schulz:2013.RecentAdvancesinGraphPartitioning.Karl-sruheInstituteofTechnology,TechnicalReport.J.Cai,M.Strube:2010.EvaluationMetricsforEnd-to-EndCoreferenceResolutionSystems.InSIGDIAL,pages28–36.Y.Chen,J.Martin:2007.TowardsRobustUnsupervisedPersonalNameDisambiguation.InEMNLP-CoNLL,pages190–198.M.Cornolti,P.Ferragina,M.Ciaramita:2013.AFrame-workforBenchmarkingEntity-AnnotationSystems.InWWW,pages249–260.S.Cucerzan:2007.Large-ScaleNamedEntityDis-ambiguationBasedonWikipediaData.InEMNLP-CoNLL,pages708–716.A.Culotta,M.L.Wick,A.McCallum:2007.First-OrderProbabilisticModelsforCoreferenceResolution.InHLT-NAACL,pages81–88.P.Domingos,S.Kok,D.Lowd,H.Poon,M.Richard-son,P.Singla:2007.MarkovLogic.ProbabilisticILP.Springer-Verlag,pages92–117.P.Domingos,D.Lowd:2009.MarkovLogic:AnIn-terfaceLayerforArtiﬁcialIntelligence.MorganandClaypoolPublishers,2009.J.R.Finkel,T.Grenager,C.D.Manning:2005.Incor-poratingNon-localInformationintoInformationEx-tractionSystemsbyGibbsSampling.InACL,pages363–370.C.H.Gooi,J.Allan:2004.Cross-DocumentCoreferenceonaLargeScaleCorpus.InHLT-NAACL,pages9–16.P.D.Gr¨unwald:2007.TheMinimumDescriptionLengthPrinciple.MITUniversityPress.A.Haghighi,D.Klein:2009.SimpleCoreferenceReso-lutionwithRichSyntacticandSemanticFeatures.InEMNLP,pages1152–1161.A.Haghighi,D.Klein:2010.CoreferenceResolutioninaModular,Entity-CenteredModel.InHLT-NAACL,pages385–393.H.Hajishirzi,L.Zilles,D.S.Weld,L.S.Zettlemoyer:2013.JointCoreferenceResolutionandNamed-EntityLinkingwithMulti-PassSieves.InEMNLP,pages289–299.J.Hoffart,M.A.Yosef,I.Bordino,H.F¨urstenau,M.Pinkal,M.Spaniol,B.Taneva,S.Thater,G.Weikum:2011.RobustDisambiguationofNamedEntitiesinText.InEMNLP,pages782–792.N.Hourdakis,M.Argyriou,G.M.Petrakis,E.E.Mil-ios:2010.HierarchicalClusteringinMedicalDocu-mentCollections:theBIC-MeansMethod.JournalofDigitalInformationManagement,8(2):71–77.G.Karypis,V.Kumar:1998.AFastandHighlyQualityMultilevelSchemeforPartitioningIrregularGraphs.JournalonScientiﬁcComputing,20(1):359–392.B.W.Kernighan,S.Lin:1970.Anefﬁcientheuristicpro-cedureforpartitioninggraphs.BellSystemTechnicalJournal.D.Koller,N.Friedman:2009.ProbabilisticGraphicalModels:PrinciplesandTechniques.MITPress.A.Krishnamurty,S.Balakrishnan,M.Xu,A.Singh:2012.EfﬁcientActiveAlgorithmsforHierarchicalClustering.InICML,pages887–894.H.Lee,Y.Peirsman,A.Chang,N.Chambers,M.Sur-deanu,D.Jurafsky:2011.Stanford’sMulti-PassSieveCoreferenceResolutionSystemattheCoNLL-2011SharedTask.InCoNLL,pages28–34.H.Lee,A.Chang,Y.Peirsman,N.Chambers,M.Sur-deanu,D.Jurafsky:2013.DeterministicCoreferenceResolutionbasedonEntity-centric,Precision-rankedRules.ComputationalLinguisticsJournal,39(4):885–916.H.Lee,M.Recasens,A.X.Chang,M.Surdeanu,D.Ju-rafsky:2012.JointEntityandEventCoreferenceRes-olutionacrossDocuments.InEMNLP-CoNLL,pages489–500.H.A.Loeliger:2008.AnIntroductiontoFactorGraphs.InMLSB.X.Luo:2005.OnCoreferenceResolutionPerformanceMetrics.InHLT-EMNLP,pages25–32.U.vonLuxburg:2007.ATutorialonSpectralClustering.StatisticsandComputingJournal,17(4):395–416.G.S.Mann,D.Yarowsky:2003.UnsupervisedPersonalNameDisambiguation.InCoNLL,HLT-NAACL,pages33–40.D.N.Milne,I.H.Witten:2008.LearningtoLinkwithWikipedia.InCIKM,pages509–518.V.Ng:2010.SupervisedNounPhraseCoreferenceRe-search:TheFirstFifteenYears.InACL,pages1396–1411.C.Niu,W.Li,R.K.Srihari:2004.WeaklySuper-visedLearningforCross-documentPersonNameDis-ambiguationSupportedbyInformationExtraction.InACL,article597.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
9
1
5
6
6
7
4
4

/
t

a
C
_
a
_
0
0
1
1
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

K.Raghunathan,H.Lee,S.Rangarajan,N.Chambers,M.Surdeanu,D.Jurafsky,C.Manning:2010.AMulti-PassSieveforCoreferenceResolution.InEMNLP,pages492–501.A.Rahman,V.Ng:2011a.CoreferenceResolutionwithWorldKnowledge.InACL,pages814–824.A.Rahman,V.Ng:2011b.Ensemble-BasedCoreferenceResolution.InIJCAI,pages1884–1889.D.Rao,P.McNamee,M.Dredze:2010.StreamingCrossDocumentEntityCoreferenceResolution.InCOL-ING,pages1050–1058.L.A.Ratinov,D.Roth,D.Downey,M.Anderson:2011.LocalandGlobalAlgorithmsforDisambiguationtoWikipedia.InACL,pages1375–1384.L.A.Ratinov,D.Roth:2012.Learning-basedMulti-SieveCo-referenceResolutionwithKnowledge.InEMNLP-CoNLL,pages1234–1244.M.Richardson,P.Domingos:2006.MarkovLogicNet-works.JournalofMachineLearning,62(1-2):107–136.E.Sandhaus:2008.TheNewYorkTimesAnnotatedCor-pusOverview.LinguisticDataConsortium.G.E.Schwarz:1978.EstimatingtheDimensionofaModel.AnnalsofStatistics,6(2):461–464.O.Shamir,N.Tishby:2011.SpectralClusteringonaBudget.JournalofMachineLearningResearch-Pro-ceedingsTrack15:661–669.S.Singh,M.L.Wick,A.McCallum:2010.DistantlyLa-belingDataforLargeScaleCross-DocumentCorefer-ence.CoRRabs/1005.4298.S.Singh,A.Subramanya,F.Pereira,A.McCallum:2011.Large-ScaleCross-DocumentCoreferenceUs-ingDistributedInferenceandHierarchicalModels.InACL,pages793–803.F.M.Suchanek,G.Kasneci,G.Weikum:2007.YAGO:aCoreofSemanticKnowledge.InWWW,pages697–706.B.Taneva,M.Kacimi,G.Weikum:2011.FindingIm-agesofDifﬁcultEntitiesintheLongTail.InCIKM,pages189–194.F.L.Wauthier,N.Jojic,M.I.Jordan:2012.ActiveSpec-tralClusteringviaIterativeUncertaintyReduction.InKDD,pages1339–1347.D.Yan,L.Huang,M.I.Jordan:2009.Fastapproximatespectralclustering.InKDD,pages907–916.J.Zheng,L.Vilnis,S.Singh,J.D.Choi,A.McCal-lum:2013.DynamicKnowledge-BaseAlignmentforCoreferenceResolution.InCoNLL,pages153–162.
Descargar PDF