Transactions of the Association for Computational Linguistics, 1 (2013) 367–378. Action Editor: Kristina Toutanova.

Submitted 7/2013; Revised 8/2013; Published 10/2013. c
(cid:13)

2013 Association for Computational Linguistics.

ModelingMissingDatainDistantSupervisionforInformationExtractionAlanRitterMachineLearningDepartmentCarnegieMellonUniversityrittera@cs.cmu.eduLukeZettlemoyer,MausamComputerSci.&Eng.UniversityofWashington{lsz,mausam}@cs.washington.eduOrenEtzioniVulcanInc.Seattle,WAorene@vulcan.comAbstractDistantsupervisionalgorithmslearninforma-tionextractionmodelsgivenonlylargeread-ilyavailabledatabasesandtextcollections.Mostpreviousworkhasusedheuristicsforgeneratinglabeleddata,forexampleassum-ingthatfactsnotcontainedinthedatabasearenotmentionedinthetext,andfactsinthedatabasemustbementionedatleastonce.Inthispaper,weproposeanewlatent-variableapproachthatmodelsmissingdata.Thispro-videsanaturalwaytoincorporatesidein-formation,forinstancemodelingtheintuitionthattextwilloftenmentionrareentitieswhicharelikelytobemissinginthedatabase.De-spitetheaddedcomplexityintroducedbyrea-soningaboutmissingdata,wedemonstratethatacarefullydesignedlocalsearchapproachtoinferenceisveryaccurateandscalestolargedatasets.Experimentsdemonstrateim-provedperformanceforbinaryandunaryre-lationextractionwhencomparedtolearningwithheuristiclabels,includingonaveragea27%increaseinareaundertheprecisionre-callcurveinthebinarycase.1IntroductionThispaperaddressestheissueofmissingdata(Lit-tleandRubin,1986)inthecontextofdistantsuper-vision.Thegoalofdistantsupervisionistolearntoprocessunstructureddata,forinstancetoextractbinaryorunaryrelationsfromtext(BunescuandMooney,2007;SnyderandBarzilay,2007;WuandWeld,2007;Mintzetal.,2009;CollinsandSinger,1999),usingalargedatabaseofpropositionsasaPersonEMPLOYERBibbLatan´eUNCChapelHillTimCookAppleSusanWojcickiGoogleTruePositive“BibbLatan´e,aprofessorattheUniversityofNorthCarolinaatChapelHill,publishedthetheoryin1981.”FalsePositive“TimCookpraisedApple’srecordrevenue…”FalseNegative“JohnP.McNamara,aprofessoratWashingtonStateUniversity’sDepartmentofAnimalSciences…”Figure1:Asmallhypotheticaldatabaseandheuris-ticallylabeledtrainingdatafortheEMPLOYERrela-tion.distantsourceofsupervision.Inthecaseofbinaryrelations,theintuitionisthatanysentencewhichmentionsapairofentities(e1ande2)thatpartici-pateinarelation,r,islikelytoexpresstheproposi-tionr(e1,e2),sowecantreatitasapositivetrainingexampleofr.Figure1presentsanexampleofthisprocess.Onequestionwhichhasreceivedlittleattentioninpreviousworkishowtohandlethesituationwhereinformationismissing,eitherfromthetextcorpus,orthedatabase.Asanexample,supposethepairofentities(JohnP.McNamara,WashingtonStateUni-versity)isabsentfromtheEMPLOYERrelation.Inthiscase,thesentenceinFigure1(andotherswhichmentiontheentitypair)iseffectivelytreatedasanegativeexampleoftherelation.Thisisanissue

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

368

ofpracticalconcern,asmostdatabasesofinterestarehighlyincomplete-thisisthereasonweneedtoextendthembyextractinginformationfromtextintheﬁrstplace.Weneedtobecautiousinhowwehandlemiss-ingdataindistantsupervision,becausethisisacasewheredataisnotmissingatrandom(NMAR).Whetherapropositionisobservedormissinginthetextordatabasedependsheavilyonitstruthvalue:giventhatitistruewehavesomechancetoob-serveit,howeverwedonotobservethosewhicharefalse.Toaddressthischallenge,weproposeajointmodelofextractionfromtextandtheprocessbywhichpropositionsareobservedormissinginboththedatabaseandtext.Ourapproachprovidesanaturalwaytoincorporatesideinformationintheformofamissingdatamodel.Forinstance,popularentitiessuchasBarackObamaalreadyhavegoodcoverageinFreebase,sonewextractionsaremorelikelytobeerrorsthanthoseinvolvingrareentitieswithpoorcoverage.OurapproachtomissingdataisgeneralandcanbecombinedwithvariousIEsolutions.Asaproofofconcept,weextendMultiR(Hoffmannetal.,2011),arecentmodelfordistantlysupervisedinformationextraction,toexplicitlymodelmissingdata.TheseextensionscomplicatetheMAPinferenceproblemwhichisusedasasubroutineinlearning.Thismotivatedustoexploreavarietyofapproachestoinferenceinthejointextractionandmissingdatamodel.WeexplorebothexactinferencebasedonA*searchandefﬁcientapproximateinferenceusinglocalsearch.Ourexperimentsdemonstratethatwithacarefullydesignedsetofsearchoperators,localsearchproducesoptimalsolutionsinmostcases.Experimentalresultsdemonstratelargeperfor-mancegainsovertheheuristiclabelingstrategyonbothbinaryrelationextractionandweaklysuper-visednamedentitycategorization.Forexampleourmodelobtainsa27%increaseinareaunderthepre-cisionrecallcurveonthesentence-levelrelationex-tractiontask.2RelatedWorkTherehasbeenmuchinterestindistantlysu-pervised1trainingofrelationextractorsusing1alsoreferredtoasweaklysuperviseddatabases.Forexample,CravenandKumlien(1999)buildaheuristicallylabeleddataset,usingtheYeastProteinDatabasetolabelPubmedabstractswiththesubcellular-localizationrelation.WuandWeld(2007)heuristicallyannotateWikipediaarticleswithfactsmentionedintheinfoboxes,enablingauto-matedinfoboxgenerationforarticleswhichdonotyetcontainthem.Bensonet.al.(2011)useadatabaseofmusiceventstakingplaceinNewYorkCityasasourceofdistantsupervisiontotraineventextractorsfromTwitter.Mintzet.al.(2009)usedasetofrelationsfromFreebaseasadistantsourceofsupervisiontolearntoextractinformationfromWikipedia.Ridelet.al.(2010),Hoffmannet.al.(2011),andSurdeanuet.al.(2012)presentedaseriesofmodelscastingdistantsupervisionasamultiple-instancelearningproblem(Dietterichetal.,1997).Recentworkhasbeguntoaddressthechallengeofnoiseinheuristicallylabeledtrainingdatagen-eratedbydistantsupervision,andproposedava-rietyofstrategiesforcorrectingerroneouslabels.Takamatsuetal.(2012)presentagenerativemodelofthelabelingprocess,whichisusedasapre-processingstepforimprovingthequalityoflabelsbeforetrainingrelationextractors.Independently,Xuet.al.(2013)analyzearandomsampleof1834sentencesfromtheNewYorkTimes,demon-stratingthatmostentitypairsexpressingaFreebaserelationcorrespondtofalsenegatives.Theyapplypseudo-relevancefeedbacktoaddmissingentriesintheknowledgebasebeforeapplyingtheMultiRmodel(Hoffmannetal.,2011).Minetal.(2013)extendtheMIMLmodelofSurdeanuet.al.(2012)usingasemi-supervisedapproachassumingaﬁxedproportionoftruepositivesforeachentitypair.TheMinetal.(2013)approachisperhapsthemostcloselyrelatedoftherecentapproachesfordis-tantsupervision.However,thereareanumberofkeydifferences:(1)Theyimposeahardconstraintontheproportionoftruepositiveexamplesforeachentitypair,whereaswejointlymodelrelationextrac-tionandmissingdatainthetextandKB.(2)Theyonlyhandlethecaseofmissinginformationinthedatabaseandnotinthetext.(3)Theirmodel,basedonSurdeanu(2012),usesharddiscriminativeEMtotuneparameters,whereasweuseperceptron-styleupdates.(4)Weevaluatevariousinferencestrategies

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

369

forexactandapproximateinference.Theissueofmissingdatahasbeenextensivelystudiedinthestatisticalliterature(LittleandRubin,1986;Gelmanetal.,2003).Mostmethodsforhan-dlingmissingdataassumethatvariablesaremissingatrandom(MAR):whetheravariableisobserveddoesnotdependonitsvalue.InsituationswheretheMARassumptionisviolated(forexampledis-tantlysupervisedinformationextraction),ignoringthemissingdatamechanismwillintroducebias.Inthiscaseitisnecessarytojointlymodeltheprocessofinterest(e.g.informationextraction)inadditiontothemissingdatamechanism.Anotherlineofrelatedworkisiterativesemanticbootstrapping(Brin,1999;AgichteinandGravano,2000).Carlsonet.al.(2010)exploitconstraintsbe-tweenrelationstoreducesemanticdriftintheboot-strappingprocess;suchconstraintsarepotentiallycomplementarytoourapproachofmodelingmiss-ingdata.3ALatentVariableModelforDistantlySupervisedRelationExtractionInthissectionwereviewtheMultiRmodel(duetoHoffmannet.al.(2011))fordistantsupervisioninthecontextofextractingbinaryrelations.ThismodelisextendedtohandlemissingdatainSection4.Wefocusonbinaryrelationstokeepdiscussionsconcrete;unaryrelationextractionisalsopossible.Givenasetofsentences,s=s1,s2,…,sn,whichmentionaspeciﬁcpairofentities(e1ande2)ourgoalistocorrectlypredictwhichrelationismentionedineachsentence,or“NA”ifnoneoftherelationsunderconsiderationarementioned.Un-likethestandardsupervisedlearningsetup,wedonotobservethelatentsentence-levelrelationmen-tionvariables,z=z1,z2,…,zn.2Insteadweonlyobserveaggregatebinaryvariablesforeachrela-tion,d=d1,d2,…,dk,whichindicatewhetherthepropositionrj(e1,e2)ispresentinthedatabase(Freebase).Ofcoursethequestionwhicharisesis:howdowerelatetheaggregate-levelvariables,dj,tothesentence-levelrelationmentions,zi?Asensi-bleanswertothisquestionisasimpledeterministic-ORfunction.Thedeterministic-ORstatesthatif2Thesevariablesindicatewhichrelationismentionedbe-tweene1ande2ineachsentence.thereexistsatleastoneisuchthatzi=j,thendj=1.Forexample,ifatleastonesentencemen-tionsthat“BarackObamawasborninHonolulu”,thenthatfactistrueinaggregate,ifnoneofthesen-tencesmentionstherelation,thenthefactisassumedfalse.Themodelalsomakestheconverseassump-tion:ifFreebasecontainstherelationBIRTHLOCA-TION(BarackObama,Honolulu),thenwemustex-tractitfromatleastonesentence.Asummaryofthismodel,whichisduetoHoffmannet.al.(2011),ispresentedinFigure2.3.1LearningTolearntheparametersofthesentence-levelrela-tionmentionclassiﬁer,je,wemaximizethelikeli-hoodofthefactsobservedinFreebaseconditionedonthesentencesinourtextcorpus:θ∗=argmaxθP(d|s;je)=argmaxθYe1,e2XzP(z,d|s;je)Heretheconditionallikelihoodofagivenentitypairisdeﬁnedasfollows:P.(z,d|s;je)=nYi=1φ(zi,si;je)×kYj=1ω(z,dj)=nYi=1eθ·f(zi,si)×kYj=11¬dj⊕∃i:j=ziWhere1xisanindicatorvariablewhichtakesthevalue1ifxistrueand0otherwise,theω(z,dj)factorsarehardconstraintscorrespondingtothedeterministic-ORfunction,andf(zi,si)isavectoroffeaturesextractedfromsentencesiandrelationzi.Aniterativegradient-ascentbasedapproachisusedtotuneθusingalatent-variableperceptron-styleadditiveupdatescheme(Collins,2002;Liangetal.,2006;ZettlemoyerandCollins,2007).Thegradientoftheconditionalloglikelihood,forasin-glepairofentities,e1ande2,isasfollows:3∂logP(d|s;je)∂θ=EP(z|s,d;je)Xjf(sj,zj)−EP(z,d|s;je)Xjf(sj,zj)3FordetailsseeKollerandFriedman(2009),Chapter20.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

370

𝑠1 𝑠2 𝑠3 … 𝑠𝑛 𝑧1 𝑧2 𝑧3 … 𝑧𝑛 𝑑1 𝑑2 𝑑𝑘 … Local Extractors Deterministic OR (Barack Obama, Honolulu) Sentences Aggregate Relations (Born-In, Lived-In, enfants, etc…) 𝑃𝑧𝑖=𝑟𝑠𝑖∝exp⁡(𝜃⋅𝑓𝑠𝑖,𝑟) Relation mentions Figure2:MultiR(Hoffmannet.al.2011)𝑠1 𝑠2 𝑠3 … 𝑠𝑛 𝑧1 𝑧2 𝑧3 … 𝑧𝑛 𝑡1 𝑡2 𝑡𝑘 … Mentioned in DB 𝑑1 𝑑2 𝑑𝑘 … Mentioned in Text Figure3:DNMARTheseexpectationsaretoodifﬁculttocomputeinpractice,soinsteadtheyareapproximatedasmaxi-mizations.Computingthisapproximationtothegra-dientrequiressolvingtwoinferenceproblemscorre-spondingtothetwomaximizations:z∗DB=argmaxzP(z|s,d;je)z∗=argmaxzP(z,d|s;je)TheMAPsolutionforthesecondtermiseasytocompute:becausedandzaredeterministicallyre-lated,wecansimplyﬁndthehighestscoringre-lation,r,foreachsentence,si,accordingtothesentence-levelfactors,φ,independently.Theﬁrstterm,ismoredifﬁcult,cependant,asthisrequiresﬁnd-ingthebestassignmenttothesentence-levelhiddenvariablesz=z1…znconditionedontheobservedsentencesandfactsinthedatabase.Hoffmannet.al.(2011)showhowthisreducestoawell-knownweightededgecoverproblemwhichcanbesolvedexactlyinpolynomialtime.4ModelingMissingDataThemodelpresentedinSection3makestwoas-sumptionswhichcorrespondtohardconstraints:1.Ifafactisnotfoundinthedatabaseitcannotbementionedinthetext.2.Ifafactisinthedatabase,itmustbementionedinatleastonesentence.Theseassumptionsdrivethelearning,howeverifthereisinformationmissingfromeitherthetextorthedatabasethisleadstoerrorsinthetrainingdata(falsepositives,andfalsenegativesrespectively).Inordertogracefullyhandletheproblemofmiss-ingdata,weproposetoextendthemodelpresentedinSection3bysplittingtheaggregatelevelvari-ables,d,intotwoparts:twhichrepresentswhetherafactismentionedinthetext(inatleastonesen-tence),andd0whichrepresentswhetherthefactismentionedinthedatabase.Weintroducepair-wisepotentialsψ(tj,dj)whichpenalizedisagree-mentbetweentjanddj,thatis:ψ(tj,dj)=−αMITiftj=0anddj=1−αMIDiftj=1anddj=00otherwiseWhereαMIT(MissingInText)andαMID(MissingInDatabase)areparametersofthemodelwhichcanbeunderstoodaspenaltiesformissinginformationinthetextanddatabaserespectively.WerefertothismodelasDNMAR(forDistantSupervisionwithDataNotMissingAtRandom).AgraphicalmodelrepresentationispresentedinFigure3.Thismodelcanbeunderstoodasrelaxingthetwohardconstraintsmentionedaboveintosoftcon-straints.AsweshowinSection7,simplyrelaxingthesehardconstraintsintosoftconstraintsandset-tingthetwoparametersαMIT,andαMIDbyhandondevelopmentdataresultsinalargeimprovementtoprecisionatcomparablerecalloverMultiRontwodifferentapplicationsofdistantsupervision:binaryrelationextractionandnamedentitycategorization.Inferenceinthismodelbecomesmorechalleng-inghowever,becausetheconstrainedinferenceproblemnolongerreducestoaweightededgecoverproblemasbefore.InSection5,wepresentaninfer-encetechniqueforthenewmodelwhichistimeand

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

371

memoryefﬁcientandalmostalwaysﬁndsanexactMAPsolution.Thelearningproceedsanalogouslytowhatwasdescribedinsection3.1,withtheexceptionthatwenowmaximizeovertheadditionalaggregate-levelhiddenvariablest,whichhavebeenintroduced.Asbefore,MAPinferenceisasubroutineinlearning,bothfortheunconstrainedcasecorrespondingtothesecondterm(whichisagaintrivialtocompute),andfortheconstrainedcasewhichismorechallengingasitnolongerreducestoaweightededgecoverproblemasbefore.5MAPInferenceTheonlydifferenceinthenewinferenceproblemistheadditionoft;zandtaredeterministicallyre-lated,sowecansimplyﬁndaMAPassignmenttoz,fromwhichtfollows.Theresultinginferenceprob-lemcanbeviewedasoptimizationundersoftcon-straints,wheretheobjectiveincludestermsforeachfactnotinFreebasewhichisextractedfromthetext:−αMID,andaneffectiverewardforextractingafactwhichiscontainedinFreebase:αMIT.ThesolutiontotheMAPinferenceproblemisthevalueofzwhichmaximizesthefollowingobjective:z∗DB=argmaxzP(z|d;je,un)=argmaxznXi=1θ·f(zi,si)(1)+kXj=1(cid:0)αMIT1dj∧∃i:j=zi−αMID1¬dj∧∃i:j=zi(cid:1)WhetherwechoosetosettheparametersαMITandαMIDtoﬁxedvalues(Section4),orincorporatesideinformationthroughamissingdatamodel(Sec-tion6),inferencebecomesmorechallengingthaninthemodelwherefactsobservedinFreebasearetreatedashardconstraints(Section3);thehardcon-straintsareequivalenttosettingαMID=αMIT=∞.Wenowpresentexactandapproximateap-proachestoinference.StandardsearchmethodssuchasA*andbranchandboundhavehighcom-putationandmemoryrequirementsandarethere-foreonlyfeasibleonproblemswithfewvariables;theyare,cependant,guaranteedtoﬁndanoptimalso-lution.4Approximatemethodsscaletolargeprob-4Eachentitypairdeﬁnesaninferenceproblemwherethelemsizes,butweloosetheguaranteeofﬁndinganoptimalsolution.Aftershowinghowtoﬁndguaran-teedexactsolutionsforsmallproblemsizes(e.g.upto200variables),wepresentaninferencealgorithmbasedonlocalsearchwhichisempiricallyshowntoﬁndoptimalsolutionsinalmosteverycasebycom-paringitssolutionstothosefoundbyA*.5.1A*SearchWecastexactMAPinferenceintheDNMARmodelasanapplicationofA*search.Eachpartialhypoth-esis,h,inthesearchspacecorrespondstoapar-tialassignmentoftheﬁrstmvariablesinz;toex-pandahypothesis,wegenerateknewhypotheses,wherekisthetotalnumberofrelations.Eachnewhypothesish0containsthesamepartialassignmenttoz1,…,zmash,witheachh0havingadifferentvalueofzm+1=r.A*operatesbymaintainingapriorityqueueofhypothesestoexpand,witheachhypothesis’prioritydeterminedbyanadmissibleheuristic.Theheuristicrepresentsanupperboundonthescoreofthebestsolutionwithh’spartialvariableassignmentundertheobjectivefromEquation1.Ingeneral,atighterupperboundcorrespondstoabetterheuristicandfastersolutions.Toupperboundourobjective,westartwiththeφ(zi,si)factorsfromthepartialas-signment.Unassignedvariables(i>k),aresettotheirmaximumpossiblevalue,zi=maxrφ(r,si)independently.Nexttoaccountfortheeffecttheaggregateψ(tj,dj)factorsontheunassignedvari-ables,weconsiderindependentlychangingeachunassignedzivariableforeachψ(tj,dj)factortoimprovetheoverallscore.Thisapproachcanleadtoinconsistencies,butprovidesuswithagoodupperboundforthebestpossiblesolutionwithapartialassignmenttoz1,…,zk.5.2LocalSearchWhileA*isguaranteedtoﬁndanexactsolution,itstimeandmemoryrequirementsprohibituseonlargeproblemsinvolvingmanyvariables.Asamorescal-ablealternativeweproposeagreedyhillclimbingmethod(Russelletal.,1996),whichstartswithafullassignmenttoz,andrepeatedlymovestothebestneighboringsolutionz0accordingtotheobjectiveinnumberofvariablesisequaltothenumberofsentenceswhichmentionthepair.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

372

Equation1.Theneighborhoodofzisdeﬁnedbyasetofsearchoperators.Ifnoneoftheneighboringsolutionshasahigherscore,thenwehavereacheda(locale)maximumatwhichpointthealgorithmter-minateswiththecurrentsolutionwhichmayormaynotcorrespondtoaglobalmaximum.Thisprocessisrepeatedusinganumberofrandomrestarts,andthebestlocalmaximumisreturnedasthesolution.SearchOperators:Westartwithastandardsearchoperator,whichconsiderschangingeachrelation-mentionvariable,zi,individuallytomaxi-mizetheoverallscore.Ateachiteration,allzisareconsidered,andtheonewhichproducesthelargestimprovementtotheoverallscoreischangedtoformtheneighboringsolution,z0.Unfortunately,thisdeﬁnitionofthesolutionneighborhoodispronetopoorlocaloptimabecauseitisoftenrequiredtotraversemanylowscoringstatesbeforechangingoneoftheaggregatevariables,tj,andachievingahigherscorefromtheassociatedaggregatefactor,ψ(tj,dj).Forexample,consideracasewherethepropositionr(e1,e2)isnotinFreebase,butismen-tionedmanytimesinthetext,andimaginethecur-rentsolutioncontainsnomentionzi=r.AnyneighboringsolutionwhichassignsamentiontorwillincludethepenaltyαMID,whichcouldout-weighthebeneﬁtfromchanginganyindividualzitor:φ(r,si)−φ(zi,si).Ifmultiplementionswerechangedtorhowever,togethertheycouldoutweighthepenaltyforextractingafactnotinFreebase,andproduceanoverallhigherscore.Toavoidtheproblemofgettingstuckinlocaloptima,weproposeanadditionalsearchoperatorwhichconsiderschangingallvariables,zi,whicharecurrentlyassignedtoaspeciﬁcrelationr,toanewrelationr0,resultinginanadditional(k−1)2possibleneighbors,inadditiontothen×(k−1)neighborswhichcomefromthestandardsearchop-erator.Thisaggregate-levelsearchoperatorallowsformoreglobalmoveswhichhelptoavoidlocalop-tima,similartothetype-levelsamplingapproachforMCMC(Liangetal.,2010).Ateachiteration,weconsideralln×(k−1)+(k−1)2possibleneighboringsolutionsgeneratedbybothsearchoperators,andpicktheonewithbiggestoverallimprovement,orterminatethealgorithmifnoimprovementscanbemadeoverthecurrentso-lution.20randomrestartswereusedforeachinfer-enceproblem.Wefoundthisapproachtoalmostal-waysﬁndanoptimalsolution.Inover100,000prob-lemswith200orfewervariablesfromtheNewYorkTimesdatasetusedinSection7,anoptimalsolu-tionwasmissedinonly3caseswhichwasveriﬁedbycomparingagainstoptimalsolutionsfoundusingA*.Withoutincludingtheaggregate-levelsearchoperator,localsearchalmostalwaysgetsstuckinalocalmaximum.6IncorporatingSideInformationInSection4,werelaxedthehardconstraintsmadebyMultiR,whichallowsformissinginformationineitherthetextordatabase,enablingerrorsinthedistantlysupervisedtrainingdatatobenaturallycorrectedasaside-effectoflearning.Wemadethesimplifyingassumption,cependant,thatallfactsareequallylikelytobemissingfromthetextordatabase,whichisencodedinthechoiceof2ﬁxedparametersαMIT,andαMID.Isitpossibletoim-proveperformancebyincorporatingsideinforma-tionintheformofamissingdatamodel(LittleandRubin,1986),takingintoaccounthowlikelyeachfactistobeobservedinthetextandthedatabaseconditionedonitstruthvalue?Inoursetting,themissingdatamodelcorrespondstochoosingtheval-uesofαMITandαMIDdynamicallybasedontheen-titiesandrelationsinvolved.PopularEntities:Considertwoentities:BarackObama,the44thpresidentoftheUnitedStates,andDonaldParry,aprofessionalrugbyleaguefootballerofthe1980s.5SinceObamaismuchmorewell-knownthanParry,wewouldn’tbeverysurprisedtoseeinformationmissingfromFreebaseaboutParry,butitwouldseemoddiftruepropositionsweremiss-ingaboutObama.Wecanencodetheseintuitionsbychoosingentity-speciﬁcvaluesofαMID:un(e1,e2)MID=−γmin(c(e1),c(e2))wherec(ei)isthenumberoftimeseiappearsinFreebase,whichisusedasanestimateofitscov-erage.WellAlignedRelations:Giventhatapairofen-tities,e1ande2,participatinginaFreebaserelation,5http://en.wikipedia.org/wiki/Donald_Parry

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

373

r,appeartogetherinasentencesi,thechancethatsiexpressesrvariesgreatlydependingonr.Forexample,ifasentencementionsapairofentitieswhichparticipateinboththeCOUNTRYCAPITOLrelationandtheLOCATIONCONTAINSrelation(forexampleMoscowandRussia),itismorelikelythatthearandomsentencewillexpressLOCATIONCON-TAINSthanCOUNTRYCAPITOL.Wecanencodethispreferenceformatchingcer-tainrelationsoverothersbysettingαrMITonaper-relationbasis.WechooseadifferentvalueofαrMITforeachrelationbasedonquickinspec-tionofthedata,andestimatingthenumberoftruepositives.Relationssuchascontains,placelived,andnationalitywhichcontainalargenumberoftruepositivematchesareassignedalargevalueofαrMIT=γlarge,thosewithamediumnumbersuchascapitol,placeofdeathandadministrativedivisionswereassignedamediumvalueγmedium,andthoserelationswithfewmatcheswereassignedasmallvalueγsmall.These3parametersweretunedonheldoutdevelopmentdata.7ExperimentsInSection5,wepresentedascalableapproachtoinferenceintheDNMARmodelwhichalmostal-waysﬁndsanoptimalsolution.Ofcoursetherealquestionis:doesmodelingmissingdataimproveperformanceatextractinginformationfromtext?Inthissectionwepresentexperimentalresultsshowinglargeimprovementsinbothprecisionandrecallontwodistantlysupervisedlearningtasks:binaryrela-tionextractionandnamedentitycategorization.7.1BinaryRelationExtractionForthesakeofcomparisontopreviousworkweevaluateperformanceontheNewYorkTimestext,featuresandFreebaserelationsdevelopedbyRiedelet.al.(2010)whichwasalsousedbyHoffmannet.al.(2011).Thisdatasetisconstructedbyextractingnamedentitiesfrom1.8millionNewYorkTimesar-ticles,whicharethenmatchagainstentitiesinFree-base.Sentenceswhichcontainpairsofentitiespar-ticipatinginoneormorerelationsarethenusedastrainingexamplesforthoserelations.Thesentence-levelfeaturesincludewordsequencesappearingincontextwiththepairofentities,inadditiontopartofspeechsequences,anddependencypathsfromtheMaltparser(Nivreetal.,2004).7.1.1BaselineToevaluatetheeffectofmodelingmissingdataindistantsupervision,wecompareagainsttheMul-tiRmodelfordistantsupervision(Hoffmannetal.,2011),astateoftheartapproachforbinaryrela-tionextractionwhichisthemostsimilarpreviouswork,andmodelsfactsinFreebaseashardcon-straintsdisallowingthepossibilityofmissinginfor-mationineitherthetextorthedatabase.Tomakeourexperimentascontrolledaspossibleandrule-outthepossibilityofdifferencesinperformanceduetoimplementationdetails,wecompareagainstourownre-implementationofMultiRwhichreproducesHoffmannet.al.’sperformance,andsharesasmuchcodeaspossiblewiththeDNMARmodel.7.1.2ExperimentalSetupWeevaluatebinaryrelationextractionusingtwoevaluations.Weﬁrstevaluateonasentence-levelextractiontaskusingamanuallyannotateddatasetprovidedbyHoffmannet.al.(2011).6Thisdatasetconsistsofsentencespairedwithhumanjudgmentsonwhethereachexpressesaspeciﬁcrelation.Sec-ondly,weperformanautomaticevaluationwhichcomparespropositionsextractedfromtextagainstheld-outdatafromFreebase.7.1.3ResultsSententialExtraction:Figure4presentspreci-sionandrecallcurvesforthesentence-levelrelationextractiontaskonthesamemanuallyannotateddatapresentedbyHoffmannet.al.(2011).Byexplic-itlymodelingthepossibilityofmissinginformationinboththetextandthedatabaseweachievea17%increaseinareaundertheprecisionrecallcurve.In-corporatingadditionalsideinformationintheformofamissingdatamodel,asdescribedinSection6,producesevenbetterperformance,yieldinga27%increaseoverthebaselineinareaunderthecurve.WealsocompareagainstthesystemdescribedbyXuet.al.(2013)(hereinaftercalledXu13).Todothis,wetrainedourimplementationofMultiRon6http://raphaelhoffmann.com/mr/

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

374

0.00.20.40.60.81.00.00.20.40.60.81.0recallprecisionMultiRXu13DNMARDNMAR*Figure4:OverallprecisionandRecallatthesentence-levelextractiontaskcomparingagainsthumanjudgments.DNMAR∗incorporatesside-informationasdiscussedinSection6.thelabelspredictedbytheirPseudo-relevanceFeed-backmodel.7Thedifferencesbetweeneachpairofsystems,exceptDNMARandXu138,issigniﬁcantwithp-valuelessthan0.05accordingtoapairedt-testassuminganormaldistribution.Per-relationprecisionandrecallcurvesarepre-sentedinFigure6.Forcertainrelations,forinstance/location/usstate/capital,theresimplyisn’tenoughoverlapbetweentheinformationcontainedinFree-baseandfactsmentionedinthetexttolearnany-thinguseful.Fortheserelations,entitypairmatchesareunlikelytoactuallyexpresstherelation;forin-stance,inthefollowingsentencefromthedata:NHPF,whichhasitsLouisianaofﬁceinBatonRouge,getsthefunds…althoughBatonRougeisthecapitalofLouisiana,the/location/usstate/capitalrelationisnotex-pressedinthissentence.Anotherinterestingob-servationwhichwecanmakefromFigure6,isthatthebeneﬁtfrommodelingmissingdata7WethankWeiXuformakingthisdataavailable.8DNMARhasa1.3%increaseinAUCoverXu13,thoughthisdifferenceisnotsigniﬁcantaccordingtoapairedt-test.DNMAR*achievesa10%increaseinAUCoverXu13whichissigniﬁcant.0.000.050.100.150.200.250.300.350.00.20.40.60.81.0recallprecisionMultiRDNMARDNMAR*Figure5:Aggregate-levelautomaticevaluationcomparingagainstheld-outdatafromFreebase.DNMAR∗incorporatesside-informationasdis-cussedinSection6.variesfromonerelationtoanother.Somere-lations,forinstance/people/person/placeofbirth,haverelativelygoodcoverageinbothFreebaseandthetext,andthereforewedonotseeasmuchgainfrommodelingmissingdata.Otherrela-tions,suchas/location/location/contains,and/peo-ple/person/placelivedhavepoorercoveragemakingourmissingdatamodelverybeneﬁcial.AggregateExtraction:Followingpreviouswork,weevaluateprecisionandrecallagainstheld-outdatafromFreebaseinFigure5.AsmentionedbyMintzet.al.(2009),thisautomaticevaluationun-derestimatesprecisionbecausemanyfactscorrectlyextractedfromthetextaremissinginthedatabaseandthereforejudgedasincorrect.Riedelet.al.(2013)furtherarguesthatthisevaluationisbiasedbecausefrequententitypairsaremorelikelytocon-tainfactsinFreebase,sosystemswhichrankextrac-tionsinvolvingpopularentitieshigherwillachievebetterperformanceindependentlyofhowaccuratetheirpredictionsare.IndeedinFigure5weseethattheprecisionofoursystemwhichmodelsmissingdataisgenerallylowerthanthesystemwhichas-sumesnodataismissingfromFreebase,althoughwedoroughlydoubletherecall.Bybettermodeling

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

375

0.00.20.40.60.81.00.00.20.40.60.81.0business_company_foundersrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0business_person_companyrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0location_country_administrative_divisionsrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0location_country_capitalrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0location_location_containsrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0location_neighborhood_neighborhood_ofrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0location_us_state_capitalrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0people_deceased_person_place_of_deathrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0people_person_childrenrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0people_person_nationalityrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0people_person_place_livedrecallprecision0.00.20.40.60.81.00.00.20.40.60.81.0people_person_place_of_birthrecallprecisionFigure6:Per-relationprecisionandrecallonthesentence-levelrelationextractiontask.ThedashedlinecorrespondstoMultiR,DNMARisthesolidline,andDNMAR*,whichincorporatesside-information,isrepresentedbythedottedline.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

376

missingdataweachievelowerprecisiononthisau-tomaticheld-outevaluationasthesystemusinghardconstraintsisexplicitlytrainedtopredictfactswhichoccurinFreebase(notthosewhicharementionedinthetextbutunlikelytoappearinthedatabase).7.2NamedEntityCategorizationAsmentionedpreviously,theproblemofmissingdataindistant(weak)supervisionisaverygeneralissue;sofarwehaveinvestigatedthisprobleminthecontextofextractingbinaryrelationsusingdistantsupervision.Wenowturntotheproblemofweaklysupervisednamedentityrecognition(CollinsandSinger,1999;TalukdarandPereira,2010).7.2.1ExperimentalSetupTodemonstratetheeffectofmodelingmissingdatainthedistantlysupervisednamedentitycate-gorizationtask,weadapttheMultiRandDNMARmodelstotheTwitternamedentitycategorizationdatasetwhichwaspresentedbyRitteret.al.(2011).Themodelsdescribedsofarareappliedunchanged:ratherthanmodelingasetofrelationsinFreebasebetweenapairofentities,e1ande2,wenowmodelasetofpossibleFreebasecategoriesassociatedwithasingleentitye.Thisisanaturalextensionofdis-tantsupervisionfrombinarytounaryrelations.TheunlabeleddataandfeaturesdescribedbyRitteret.al.(2011)areusedfortrainingthemodel,andtheirmanuallyannotatedTwitternamedentitydatasetisusedforevaluation.7.2.2ResultsPrecisionandrecallatweaklysupervisednamedentitycategorizationcomparingMultiRagainstDN-MARispresentedinFigure7.Weobservesubstan-tialimprovementinprecisionatcomparablerecallbyexplicitlymodelingthepossibilityofmissingin-formationinthetextanddatabase.Themissingdatamodelleadstoa107%increaseinareaundertheprecision-recallcurve(from0.16to0.34),butstillfallsshortoftheresultspresentedbyRitteret.al.(2011).Intuitivelythismakessense,becausethemodelusedbyRitteret.al.isbasedonlatentDirich-letallocationwhichisbettersuitedtothishighlyam-biguousunaryrelationdata.0.00.20.40.60.81.00.00.20.40.60.81.0recallprecisionNER_MultiRNER_DNMARFigure7:PrecisionandRecallatthenamedentitycategorizationtask8ConclusionsInthispaperwehaveinvestigatedtheproblemofmissingdataindistantsupervision;weintroducedajointmodelofinformationextractionandmiss-ingdatawhichrelaxesthehardconstraintsusedinpreviousworktogenerateheuristiclabels,andpro-videsanaturalwaytoincorporatesideinformationthroughamissingdatamodel.Efﬁcientinferencebreaksinthenewmodel,sowepresentedanap-proachbasedonA*searchwhichisguaranteedtoﬁndexactsolutions,howeverexactinferenceisnotcomputationallytractableforlargeproblems.Toad-dressthechallengeoflargeproblemsizes,wepro-posedascalableinferencealgorithmbasedonlocalsearch,whichincludesasetofaggregatesearchop-eratorsallowingforlong-distancejumpsintheso-lutionspacetoavoidlocalmaxima;thisapproachwasexperimentallydemonstratedtoﬁndexactso-lutionsinalmosteverycase.Finallyweevaluatedtheperformanceofourmodelonthetasksofbinaryrelationextractionandnamedentitycategorizationshowinglargeperformancegainsineachcase.Infutureworkwewouldliketoapplyourap-proachtomodelingmissingdatatoadditionalmod-els,forinstancethemodelofSurdeanuet.al.(2012)andRitteret.al.(2011),andalsoexplorenewmiss-ingdatamodels.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
J’ai
r
e
c
t
.

J’ai
t
.

e
d
u

/
t

un
c
l
/

un
r
t
J’ai
c
e
–
p
d

F
/

d
o

J’ai
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
2
3
4
1
5
6
6
6
7
9

/
t

un
c
_
un
_
0
0
2
3
4
p
d

b
oui
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

377

AcknowledgementsTheauthorswouldliketothankDanWeld,ChrisQuirk,RaphaelHoffmannandtheanonymousre-viewersforhelpfulcomments.ThankstoWeiXuforprovidingdata.ThisresearchwassupportedinpartbyONRgrantN00014-11-1-0294,DARPAcontractFA8750-09-C-0179,agiftfromGoogle,agiftfromVulcanInc.,andcarriedoutattheUniver-sityofWashington’sTuringCenter.ReferencesEugeneAgichteinandLuisGravano.2000.Snowball:Extractingrelationsfromlargeplain-textcollections.InProceedingsoftheﬁfthACMconferenceonDigitallibraries.EdwardBenson,AriaHaghighi,andReginaBarzilay.2011.Eventdiscoveryinsocialmediafeeds.InPro-ceedingsofACL.SergeyBrin.1999.Extractingpatternsandrelationsfromtheworldwideweb.InTheWorldWideWebandDatabases.RazvanBunescuandRaymondMooney.2007.Learningtoextractrelationsfromthewebusingminimalsuper-vision.InACL.AndrewCarlson,JustinBetteridge,BryanKisiel,BurrSettles,EstevamRHruschkaJr,andTomMMitchell.2010.Towardanarchitecturefornever-endinglan-guagelearning.InProceedingsofAAAI.MichaelCollinsandYoramSinger.1999.Unsupervisedmodelsfornamedentityclassiﬁcation.InEMNLP.MichaelCollins.2002.Discriminativetrainingmeth-odsforhiddenmarkovmodels:theoryandexperi-mentswithperceptronalgorithms.InProceedingsofEMNLP.MarkCraven,JohanKumlien,etal.1999.Constructingbiologicalknowledgebasesbyextractinginformationfromtextsources.InISMB.ThomasGDietterich,RichardHLathrop,andTom´asLozano-P´erez.1997.Solvingthemultipleinstanceproblemwithaxis-parallelrectangles.ArtiﬁcialIntel-ligence.AndrewGelman,JohnBCarlin,HalSStern,andDon-aldBRubin.2003.Bayesiandataanalysis.CRCpress.RaphaelHoffmann,CongleZhang,XiaoLing,LukeZettlemoyer,andDanielS.Weld.2011.Knowledge-basedweaksupervisionforinformationextractionofoverlappingrelations.InProceedingsofACL-HLT.D.KollerandN.Friedman.2009.ProbabilisticGraphi-calModels:PrinciplesandTechniques.MITPress.PercyLiang,AlexandreBouchard-Cˆot´e,DanKlein,andBenTaskar.2006.Anend-to-enddiscriminativeap-proachtomachinetranslation.InProceedingsofACL.PercyLiang,MichaelIJordan,andDanKlein.2010.Type-basedmcmc.InProceedingsofACL.RoderickJALittleandDonaldBRubin.1986.Statis-ticalanalysiswithmissingdata.JohnWiley&Fils,Inc.,NewYork,New York,USA.BonanMin,RalphGrishman,LiWan,ChangWang,andDavidGondek.2013.Distantsupervisionforrela-tionextractionwithanincompleteknowledgebase.InProceedingsofNAACL-HLT.MikeMintz,StevenBills,RionSnow,andDanJurafsky.2009.Distantsupervisionforrelationextractionwith-outlabeleddata.InProceedingsofACL-IJCNLP.JoakimNivre,JohanHall,andJensNilsson.2004.Memory-baseddependencyparsing.InProceedingsofCoNLL.SebastianRiedel,LiminYao,andAndrewMcCallum.2010.Modelingrelationsandtheirmentionswithoutlabeledtext.InProceedingsofECML/PKDD.SebastianRiedel,LiminYao,AndrewMcCallum,andBenjaminMMarlin.2013.Relationextractionwithmatrixfactorizationanduniversalschemas.InPro-ceedingsofNAACL-HLT.AlanRitter,SamClark,Mausam,andOrenEtzioni.2011.Namedentityrecognitionintweets:Anexperi-mentalstudy.ProceedingsofEMNLP.StuartJ.Russell,PeterNorvig,JohnF.Candy,Jiten-draM.Malik,andDouglasD.Edwards.1996.Ar-tiﬁcialintelligence:amodernapproach.BenjaminSnyderandReginaBarzilay.2007.Database-textalignmentviastructuredmultilabelclassiﬁcation.InProceedingsofIJCAI.MihaiSurdeanu,JulieTibshirani,RameshNallapati,andChristopherD.Manning.2012.Multi-instancemulti-labellearningforrelationextraction.InProceedingsofEMNLP-Conll.ShingoTakamatsu,IsseiSato,andHiroshiNakagawa.2012.Reducingwronglabelsindistantsupervisionforrelationextraction.InProceedingsACL.ParthaPratimTalukdarandFernandoPereira.2010.Experimentsingraph-basedsemi-supervisedlearningmethodsforclass-instanceacquisition.InProceedingsofACL.FeiWuandDanielS.Weld.2007.Autonomouslyse-mantifyingwikipedia.InProceedingsofCIKM.WeiXu,RaphaelHoffmannLeZhao,andRalphGrish-man.2013.Fillingknowledgebasegapsfordistantsupervisionofrelationextraction.InProceedingsofACL.LukeS.ZettlemoyerandMichaelCollins.2007.Onlinelearningofrelaxedccggrammarsforparsingtologicalform.InEMNLP-CoNLL.