Documentación - IA de Investigación especializada en el MIT

¿Sobre qué tema necesitas documentación??

Transacciones de la Asociación de Lingüística Computacional, volumen. 3, páginas. 15–28, 2015. Editor de acciones: Hwee Tou Ng.

Transacciones de la Asociación de Lingüística Computacional, volumen. 3, páginas. 15–28, 2015. Editor de acciones: Hwee Tou Ng. Lote de envío: 9/2014; Lote de revisión 11/2014; Publicado 1/2015. C (cid:13) 2015 Asociación de Lingüística Computacional. 15 Cross-DocumentCo-ReferenceResolutionusingSample-BasedClusteringwithKnowledgeEnrichmentSouravDuttaMaxPlanckInstituteforInformaticsSaarbr¨ucken,Germanysdutta@mpi-inf.mpg.deGerhardWeikumMaxPlanckInstituteforInformaticsSaarbr¨ucken,Germanyweikum@mpi-inf.mpg.deAbstractIdentifyingandlinkingnamedentitiesacrossinformationsourcesisthebasisofknowledgeacquisitionandattheheartofWebsearch,rec-ommendations,andanalytics.Animportantprobleminthiscontextiscross-documentco-referenceresolution(CCR):computingequiv-alenceclassesoftextualmentionsdenotingthesameentity,withinandacrossdocuments.Priormethodsemployranking,agrupación,orprobabilisticgraphicalmodelsusingsyntacticfeaturesanddistantfeaturesfromknowledgebases.However,thesemethodsexhibitlimita-tionsregardingrun-timeandrobustness.ThispaperpresentstheCROCSframeworkforunsupervisedCCR,improvingthestateoftheartintwoways.First,weextendthewayknowledgebasesareharnessed,bycon-structinganotionofsemanticsummariesforintra-documentco-referencechainsusingco-occurringentitymentionsbelongingtodiffer-entchains.Second,wereducethecomputa-tionalcostbyanewalgorithmthatembedssample-basedbisection,usingspectralclus-teringorgraphpartitioning,inahierarchi-calclusteringprocess.ThisallowsscalingupCCRtolargecorpora.Experimentswiththreedatasetsshowsigniﬁcantgainsinoutputqual-ity,comparedtothebestpriormethods,andtherun-timeefﬁciencyofCROCS.1Introduction1.1MotivationandProblemStatementWearewitnessinganotherrevolutioninWebsearch,userrecommendations,anddataanalytics:tran-sitioningfromdocumentsandkeywordstodata,conocimiento,andentities.Examplesofthismega-trendaretheGoogleKnowledgeGraphanditsap-plications,andtheIBMWatsontechnologyfordeepquestionanswering.Toalargeextent,thesead-vanceshavebeenenabledbytheconstructionofhugeknowledgebases(KB’s)suchasDBpedia,Yago,orFreebase;thelatterformingthecoreoftheKnowledgeGraph.Suchsemanticresourcesprovidehugecollectionsofentities:gente,lugares,compa-nies,celebrities,cine,etc.,alongwithrichknowl-edgeabouttheirpropertiesandrelationships.Perhapsthemostimportantvalue-addingcom-ponentinthissettingistherecognitionanddis-ambiguationofnamedentitiesinWebandusercontents.NamedEntityDisambiguation(NED)(ver,p.ej.,(Cucerzan,2007;milne&Witten,2008;Cornoltietal.,2013))mapsamentionstring(e.g.,apersonnamelike“Bolt”oranounphraselike“light-ningbolt”)ontoitsproperentityifpresentinaKB(e.g.,thesprinterUsainBolt).Arelatedbutdifferenttaskofco-referencereso-lution(CR)(ver,p.ej.,(Haghighi&Klein,2009;Ng,2010;Leeetal.,2013))identiﬁesallmentionsinagiventextthatrefertothesameentity,includinganaphorassuchas“thepresident’swife”,“theﬁrstlady”,or“she”.Thistaskwhenextendedtoprocessanentirecorpusisthenknownascross-documentco-referenceresolution(CCR)(Singhetal.,2011).Ittakesasinputasetofdocumentswithentitymen-tions,andcomputesasoutputasetofequivalenceclassesovertheentitymentions.Thisdoesnotin-volvemappingmentionstotheentitiesofaKB.Un-likeNED,CCRcandealwithlong-tailoremergingentitiesthatarenotcapturedintheKBoraremerelyinverysparseform.StateoftheArtanditsLimitations.CRmethods,forco-referenceswithinadocument,aregenerallybasedonrulesorsupervisedlearningusingdiffer- l D o w n o a d e d f r o m h t t p : / / directo . m i t