Transactions of the Association for Computational Linguistics, vol. 3, pp. 529–543, 2015. Action Editor: Sebastian Riedel.
Submission batch: 5/2015; Revision batch: 8/2015; Published 10/2015.
2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
c
(cid:13)
Large-ScaleInformationExtractionfromTextualDefinitionsthroughDeepSyntacticandSemanticAnalysisClaudioDelliBovi,LucaTelescaandRobertoNavigliDepartmentofComputerScienceSapienzaUniversityofRome{dellibovi,navigli}@di.uniroma1.itluca.telesca@gmail.comAbstractWepresentDEFIE,anapproachtolarge-scaleInformationExtraction(IE)basedonasyntactic-semanticanalysisoftextualdefini-tions.Givenalargecorpusofdefinitionsweleveragesyntacticdependenciestoreducedatasparsity,thendisambiguatetheargumentsandcontentwordsoftherelationstrings,andfi-nallyexploittheresultinginformationtoorga-nizetheacquiredrelationshierarchically.TheoutputofDEFIEisahigh-qualityknowledgebaseconsistingofseveralmillionautomati-callyacquiredsemanticrelations.11IntroductionTheproblemofknowledgeacquisitionliesatthecoreofNaturalLanguageProcessing.Recentyearshavewitnessedthemassiveexploitationofcollabo-rative,semi-structuredinformationastheidealmid-dlegroundbetweenhigh-quality,fully-structuredresourcesandthelargeramountofcheaper(butnoisy)unstructuredtext(Hovyetal.,2013).Col-laborativeprojects,likeFreebase(Bollackeretal.,2008)andWikidata(Vrandeˇci´c,2012),havebeenbeingdevelopedformanyyearsandarecontinu-ouslybeingimproved.Agreatdealofresearchalsofocusesonenrichingavailablesemi-structuredre-sources,mostnotablyWikipedia,therebycreatingtaxonomies(PonzettoandStrube,2011;Flatietal.,2014),ontologies(Mahdisoltanietal.,2015)andse-manticnetworks(NavigliandPonzetto,2012;Nas-taseandStrube,2013).Thesesolutions,however,1http://lcl.uniroma1.it/defieareinherentlyconstrainedtosmallandoftenpre-specifiedsetsofrelations.AmoreradicalapproachisadoptedinsystemslikeTEXTRUNNER(Etzionietal.,2008)andREVERB(Faderetal.,2011),whichdevelopedfromtheOpenInformationExtraction(OIE)paradigm(Etzionietal.,2008)andfocusedontheunconstrainedextractionofalargenumberofrelationsfrommassiveunstructuredcorpora.Ul-timately,alltheseendeavorsweregearedtowardsaddressingtheknowledgeacquisitionproblemandtacklinglong-standingchallengesinthefield,suchasMachineReading(Mitchell,2005).WhileearlierOIEapproachesreliedmostlyondependenciesatthelevelofsurfacetext(Etzionietal.,2008;Faderetal.,2011),morerecentworkhasfocusedondeeperlanguageunderstandingatthelevelofbothsyntaxandsemantics(Nakasholeetal.,2012;MoroandNavigli,2013)andtackledchal-lenginglinguisticphenomenalikesynonymyandpolysemy.However,theseissueshavenotyetbeenaddressedintheirentirety.Relationstringsarestillboundtosurfacetext,lackingactualsemanticcon-tent.Furthermore,mostOIEsystemsdonothaveaclearandunifiedontologicalstructureandre-quireadditionalprocessingsteps,suchasstatisti-calinferencemappings(Duttaetal.,2014),graph-basedalignmentsofrelationalphrases(GrycnerandWeikum,2014),orknowledgebaseunificationpro-cedures(DelliBovietal.,2015),inorderfortheirpotentialtobeexploitableinrealapplications.InDEFIEthekeyideaistoleveragethelinguisticanalysisofrecentsemantically-enhancedOIEtech-niqueswhilemovingfromopentexttosmallercor-poraofdenseprescriptiveknowledge.Theaimis
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
530
Figure1:Syntactic-semanticgraphconstructionfromatextualdefinitionthentoextractasmuchinformationaspossiblebyunifyingsyntacticanalysisandstate-of-the-artdis-ambiguationandentitylinking.Usingthisstrategy,fromaninputcorpusoftextualdefinitions(shortandconcisedescriptionsofagivenconceptorentity)weareabletoharvestfullydisambiguatedrelationin-stancesonalargescale,andintegratethemauto-maticallyintoahigh-qualitytaxonomyofseman-ticrelations.Asaresultalargeknowledgebaseisproducedthatshowscompetitiveaccuracyandcov-erageagainststate-of-the-artOIEsystemsbasedonmuchlargercorpora.Ourcontributionscanbesum-marizedasfollows:•WeproposeanapproachtoIEthattiestogethersyntacticdependenciesandunifiedentitylink-ing/wordsensedisambiguation,designedtodiscoversemanticrelationsfromarelativelysmallcorpusoftextualdefinitions;•Wecreatealargeknowledgebaseoffullydisambiguatedrelationinstances,rangingovernamedentitiesandconceptsfromavailablere-sourceslikeWordNetandWikipedia;•Weexploitoursemantifiedrelationpatternstoautomaticallybuildarich,high-qualityrelationtaxonomy,showingcompetitiveresultsagainststate-of-the-artapproaches.Ourapproachcomprisesthreestages.First,weextractfromourinputcorpusaninitialsetofseman-ticrelations(Section2);eachrelationisthenscoredandaugmentedwithsemantictypesignatures(Sec-tion3);finally,theaugmentedrelationsareusedtobuildarelationtaxonomy(Section4).2RelationExtractionHerewedescribethefirststageofourapproach,whereasetofsemanticrelationsisextractedfromtheinputcorpus.Inthefollowing,werefertoare-lationinstanceasatriplet=hai,r,ajiwithaiandajbeingtheargumentsandrtherelationpattern.FromeachrelationpatternrktheassociatedrelationRkisidentifiedbythesetofallrelationinstanceswherer=rk.Inordertoextractalargesetoffullydisambiguatedrelationinstanceswebringtogethersyntacticandsemanticanalysisonacorpusofplaintextualdefinitions.Eachdefinitionisfirstparsedanddisambiguated(Figure1a-b,Section2.1);syntacticandsemanticinformationiscombinedintoastruc-turedgraphrepresentation(Figure1c,Section2.2)andrelationpatternsarethenextractedasshortestpathsbetweenconceptpairs(Section2.3).ThesemanticsofourrelationsdrawsonBabelNet(NavigliandPonzetto,2012),awide-coveragemul-tilingualsemanticnetworkobtainedfromtheauto-maticintegrationofWordNet,Wikipediaandotherresources.Thischoiceisnotmandatory;however,inasmuchasitisasupersetoftheseresources,Ba-belNetbringstogetherlexicographicandencyclope-dicknowledge,enablingustoreachhighercoveragewhilestillbeingabletoaccommodatedifferentdis-ambiguationstrategies.Foreachrelationinstancetextracted,bothai,ajandthecontentwordsappear-inginrarelinkedtotheBabelNetinventory.IntheremainderofthepaperweidentifyBabelNetcon-ceptsorentitiesusingasubscript-superscriptnota-tionwhere,forinstance,bandibnreferstothei-thBabelNetsensefortheEnglishwordband.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
531
2.1TextualDefinitionProcessingThefirststepoftheprocessistheautomaticextractionofsyntacticinformation(typeddepen-dencies)andsemanticinformation(wordsensesandnamedentitymentions)fromeachtextualdefinition.Eachdefinitionundergoesthefollowingsteps:SyntacticAnalysis.Eachtextualdefini-tiondisparsedtoobtainadependencygraphGd(Figure1a).ParsingiscarriedoutusingC&C(ClarkandCurran,2007),alog-linearparserbasedonCombinatoryCategorialGrammar(CCG).Althoughouralgorithmseamlesslyworkswithanysyntacticformalism,CCGrulesareespeciallysuitedtolongerdefinitionsandlinguisticphenomenalikecoordinatingconjunctions(Steedman,2000).SemanticAnalysis.SemanticanalysisisbasedonBabelfy(Moroetal.,2014),ajoint,state-of-the-artapproachtoentitylinkingandwordsensedisambiguation.Givenalexicalizedsemanticnet-workasunderlyingstructure,Babelfyusesadensesubgraphalgorithmtoidentifyhigh-coherencesemanticinterpretationsofwordsandmulti-wordexpressionsacrossaninputtext.WeapplyBabelfytoeachdefinitiond,obtainingasensemappingSdfromsurfacetext(wordsandentitymentions)towordsensesandnamedentities(Figure1b).Asamatteroffact,anydisambiguationorentitylinkingstrategycanbeusedatthisstage.However,aknowledge-basedunifiedapproachlikeBabelfyisbestsuitedtooursetting,wherecontextislimitedandexploitingdefinitionalknowledgeasmuchaspossibleiskeytoattaininghigh-coverageresults(asweshowinSection6.4).2.2Syntactic-SemanticGraphConstructionTheinformationextractedbyparsinganddis-ambiguatingagivendefinitiondisunifiedintoasyntactic-semanticgraphGsemdwhereconceptsandentitiesidentifiedindarearrangedinagraphstruc-tureencodingtheirsyntacticdependencies(Figure1c).WestartfromthedependencygraphGd,asprovidedbythesyntacticanalysisofdinSection2.1.SemanticinformationfromthesensemappingsSdcanbeincorporateddirectlyintheverticesofGdbyattachingavailablematchesbetweenwordsandsensestothecorrespondingvertices.Dependencygraphs,however,encodedependenciessolelyonawordbasis,whileoursensemappingsmayincludemulti-wordexpressions(e.g.PinkFloyd1bn).Inordertoextractconsistentinformation,subsetsofverticesreferringtothesameconceptorentityaremergedtoasinglesemanticnode,whichreplacesthesubgraphcoveredintheoriginaldependencystructure.ConsidertheexampleinFigure1:anentitylikePinkFloyd1bncoverstwodistinctandconnectedverticesinthedependencygraphGd,oneforthenounFloydandoneforitsmodifierPink.Intheactualsemanticsofthesentence,asencodedinGsemd(Figure1c),thesetwoverticesaremergedtoasinglenodereferringtotheentityPinkFloyd1bn(theEnglishrockband),insteadofbeingassignedindividualwordinterpretations.OurprocedureforbuildingGsemdtakesasinputatypeddependencygraphGdandasensemappingSd,bothextractedfromagivendefinitiond.GsemdisfirstpopulatedwiththeverticesofGdreferringtodisambiguatedcontentwords,mergingthosever-ticescoveredbythesamesenses∈Sdintoasin-glenode(likePinkFloyd1bnandAtomHeartMother1bninFigure1c).Then,theremainingver-ticesandedgesareaddedasinGd,discardingnon-disambiguatedadjunctsandmodifiers(liketheandfifthinFigure1).2.3RelationPatternIdentificationAtthisstage,alltheinformationinagivendefi-nitiondhasbeenextractedandencodedinthecor-respondinggraphGsemd(Section2.2).Wenowcon-siderthosepathsconnectingentitypairsacrossthegraphandextracttherelationpatternrbetweentwoentitiesand/orconceptsastheshortestpathbetweenthetwocorrespondingverticesinGsemd.Thisen-ablesustoexcludelessrelevantinformation(typ-icallycarriedbyadjunctsormodifiers)andreducedatasparsityintheoverallextractionprocess.Ouralgorithmworksasfollows:givenatextualdefinitiond,weconsidereverypairofidentifiedconceptsorentitiesandcomputethecorrespondingshortestpathinGsemdusingtheFloyd-Warshallal-gorithm(Floyd,1962).Theonlyconstraintween-forceisthatresultingpathsmustincludeatleastoneverbnode.Thisconditionfiltersoutmeaninglesssingle-nodepatterns(e.g.twoconceptsconnected
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
532
Algorithm1RelationExtractionprocedureEXTRACTRELATIONSFROM(D)1:R:=∅2:foreachdinDdo3:Gd:=dependencyParse(d)4:Sd:=disambiguate(d)5:Gsemd:=buildSemanticGraph(Gd,Sd)6:foreachhsi,sjiinSddo7:hsi,rij,sji:=shortestPath(si,sj)8:R:=R∪{hsi,rij,sji}9:filterPatterns(R,ρ)returnR;withapreposition)and,giventheprescriptivenatureofd,isunlikelytodiscardsemanticallyrelevantat-tributescompactedinnounphrases.Asanexample,considerthetwosentences“Mutteristhethirdal-bumbyGermanbandRammstein”and“AtomHeartMotheristhefifthalbumbyEnglishbandPinkFloyd”.Inbothcases,twovalidshortest-pathpat-ternsareextracted.Thefirstextractedshortest-pathpatternis:X→is→album1bn→by→Ywithai=Mutter3bn,aj=Rammstein1bnforthefirstsentenceandai=AtomHeartMother1bn,aj=PinkFloyd1bnforthesecondone.Thesec-ondextractedshortest-pathpatternis:X→is→Ywithai=Mutter3bn,aj=album1bnforthefirstsentenceandai=AtomHeartMother1bn,aj=album1bnforthesecondone.Infact,ourextractionprocessseamlesslydiscoversgeneralknowledge(e.g.thatMutter3bnandAtomHeartMother1bnareinstancesoftheconceptalbum1bn)andfacts(e.g.thattheentitiesRammstein1bnandPinkFloyd1bnhaveanisAlbumByrelationwiththetworecordings).Apseudo-codefortheentireextractionalgorithmisshowninAlgorithm1:givenasetoftextualdefinitionsD,asetofrelationsisgeneratedoverextractionsR,witheachrelationR⊂RcomprisingrelationinstancesextractedfromD.Eachd∈Disfirstparsedanddisambiguatedtoproduceasyntactic-semanticgraphGsemd(Sections2.1-2.2);thenalltheconceptpairshsi,sjiareexaminedtodetectrelationinstancesasshortestpaths.Finally,wefilteroutfromtheresultingsetallrelationsforwhichthenumberofextractedinstancesisbelowafixedthresholdρ.2Theoverallalgorithmextractsover20millionrelationinstancesinourexperimentalsetup(Section5)withalmost256,000distinctrelations.3RelationTypeSignaturesandScoringWefurthercharacterizethesemanticsofourre-lationsbycomputingsemantictypesignaturesforeachR⊂R,i.e.byattachingapropersemanticclasstobothitsdomainandrange(thesetsofar-gumentsoccurringontheleftandrightofthepat-tern).AseveryelementinthedomainandrangeofRisdisambiguated,weretrievethecorrespondingsensesandcollecttheirdirecthypernyms.Thenweselectthehypernymcoveringthelargestsubsetofargumentsastherepresentativesemanticclassforthedomain(orrange)ofR.WeextracthypernymsusingBabelNet,wheretaxonomicinformationcov-ersbothgeneralconcepts(fromtheWordNettaxon-omy(Fellbaum,1998))andnamedentities(fromtheWikipediaBitaxonomy(Flatietal.,2014)).FromthedistributionofdirecthypernymsoverdomainandrangeargumentsofRweestimatethequalityofRandassociateaconfidencevaluewithitsrelationpatternr.Intuitivelywewanttoassignhigherconfidencetorelationswherethecorrespond-ingdistributionshavelowentropy.Forinstance,ifbothsetshaveasinglehypernymcoveringallargu-ments,thenRarguablycapturesawell-definedse-manticrelationandshouldbeassignedhighconfi-dence.ForeachrelationR,wecompute:HR=−nXi=1p(hi)log2p(hi)(1)wherehi(i=1,…,n)areallthedistinctargumenthypernymsoverthedomainandrangeofRandprobabilitiesp(hi)areestimatedfromthepropor-tionofargumentscoveredinsuchsets.ThelowerHR,thebettersemantictypesofRaredefined.Asamatteroffact,however,somevalidbutover-generalrelations(e.g.XisaY,XisusedforY)haveinher-entlyhighvaluesofHR.Toobtainabalancedscore,2InalltheexperimentsofSection6wesetρ=10.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
533
PatternScoreEntropyXdirectedbyY4025.801.74XknownforY2590.703.65Xiselectiondistrict1bnofY110.490.83Xiscomposer1bnfromY39.922.08Xisstreet1bnnamedafterY1.912.24Xisvillage2bnfoundedin1912inY0.910.18Table1:ExamplesofrelationscoresFigure2:Precisionagainstscore(R)(a)andHR(b)wethereforeconsidertwoadditionalfactors,i.e.thenumberofextractedinstancesforRandthelengthoftheassociatedpatternr,obtainingthefollowingempiricalmeasure:score(R)=|SR|(HR+1)length(r)(2)withSRbeingthesetofextractedrelationinstancesforR.The+1termaccountsforcaseswhereHR=0.AsshownintheexamplesofTable1,relationswithrathergeneralpatterns(suchasXknownforY)achievehigherscorescomparedtoveryspecificones(likeXisvillage2bnfoundedin1912inY)de-spitehigherentropyvalues.Wevalidatedourmea-sureonthesamplesofSection6.1,computingtheoverallprecisionfordifferentscorethresholds.ThemonotonicdecreaseofsampleprecisioninFigure2ashowsthatourmeasurecapturesthequalityofextractedpatternsbetterthanHR(Figure2b).4RelationTaxonomizationInthelaststageofourapproachoursetofex-tractedrelationsisarrangedautomaticallyinarela-tiontaxonomy.Theprocessiscarriedoutbycom-paringrelationspairwise,lookingforhypernymy-hyponymyrelationshipsbetweenthecorrespondingrelationpatterns;wethenbuildourtaxonomybyconnectingwithanedgethoserelationpairsforwhichsucharelationshipisfound.BoththerelationFigure3:Hypernym(a)andsubstring(b)generalizationstaxonomizationproceduresdescribedhereexaminenounnodesacrosseachrelationpatternr,andcon-siderfortaxonomizationonlythoserelationswhosepatternsareidenticalexceptforasinglenounnode.34.1HypernymGeneralizationAdirectwayofidentifyinghypernym/hyponymnounnodesacrossrelationpatternsistoanalyzethesemanticinformationattachedtothem.Giventworelationpatternsriandrj,differingonlyinrespectofthenounnodesniandnj,wefirstlookattheas-sociatedconceptsorentities,ciandcj,andretrievethecorrespondinghypernymsets,H(ci)andH(cj).HypernymsetsareobtainedbyiterativelycollectingthesuperclassesofciandcjfromthesemanticnetworkofBabelNet,uptoafixedheight.Forinstance,givenci=album1bn,H(ci)={workofart1bn,creation2bn,artifact1bn},andgivencj=Rammstein1bn,H(cj)={band2bn,musicalensemble1bn,organization1bn}.OncewehaveH(ci)andH(cj),wejustcheckwhethercj∈H(ci)orci∈H(cj)(Figure3a).Accordingtowhichisthecase,weconcludethatrjisageneralizationofri,orthatriisageneralizationofrj.4.2SubstringGeneralizationThesecondprocedurefocusesonthenoun(orcompound)representedbythenode.Giventwore-lationpatterns,riandrj,weapplythefollowingheuristic:fromoneofthetwonouns,beitni,anyadjunctormodifierisremoved,retainingthesoleheadwordˆni.Then,ˆniiscomparedwithnjand,ifˆni=nj,weassumethattherelationrjisagen-eralizationofri(Figure3b).3Thesimplifyingassumptionhereisthattwogivenrelationpatternsmaybeinahypernymy-hyponymyrelationshiponlywhentheirplainsyntacticstructureisequivalent(e.g.isN1byandisN2by,withN1andN2beingtwodistinctnounnodes).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
534
DEFIENELLPATTYREVERBWISENETFreebaseDBpediaDistinctrelations255881298163153166474624593518941368Distinctrelations(disambiguated)240618——Averageextractionsperrelation81.687013.039.6822.169.24127727.9924451.48Distinctrelationinstances2035290320898831580294614728268227180724189788233449631Distinctconcepts/entitiesinvolved239898219960211087907332742516363076698823210338501Table2:Comparativestatisticsontherelationextractionprocess5ExperimentalSetupInput.TheinputcorpususedfortherelationextractionprocedureisthefullsetofEnglishtextualdefinitionsinBabelNet2.5(NavigliandPonzetto,2012).4Infact,anysetoftextualdefinitionscanbeprovidedasinputtoDEFIE,rangingfromexistingdictionaries(likeWordNetorWiktionary)tothesetoffirstsentencesofWikipediaarticles.5Asitisamergerforvariousdifferentresourcesofthiskind,BabelNetprovidesalargeheterogeneoussetcom-prisingdefinitionsfromWordNet,Wikipedia,Wik-tionary,WikidataandOmegaWiki.Tothebestofourknowledge,thissetconstitutesthelargestavail-ablecorpusofdefinitionalknowledge.Wethereforeworkedonatotalof4,357,327textualdefinitionsfromtheEnglishsynsetsofBabelNet’sknowledgebase.WethenusedthesameversionofBabelNetastheunderlyingsemanticnetworkstructurefordis-ambiguatingwithBabelfy.6Statistics.ComparativestatisticsareshowninTable2.DEFIEextracts20,352,903relationin-stances,outofwhich13,753,133featureafullydis-ambiguatedpattern,yieldinganaverageof3.15dis-ambiguatedrelationinstancesextractedfromeachdefinition.Aftertheextractionprocess,ourknowl-edgebasecomprises255,881distinctsemanticre-lations,94%ofwhichalsohavedisambiguatedcontentwordsintheirpatterns.DEFIEextractsaconsiderablylargeramountofrelationinstancescomparedtosimilarapproaches,despitethemuchsmalleramountoftextused.Forexample,weman-agedtoharvestover5millionrelationinstancesmorethanPATTY,usingamuchsmallercorpus(sin-4babelnet.org5AccordingtotheWikipediaguidelines,anarticleshouldbeginwithashortdeclarativesentence,definingwhat(orwho)isthesubjectandwhyitisnotable.6babelfy.orgglesentencesasopposedtofullWikipediaarticles)andgeneratinganumberofdistinctrelationsthatwassixtimeslessthanPATTY’s.Asaresult,weobtainedanaveragenumberofextractionsthatwassubstantiallyhigherthanthoseofourOIEcompeti-tors.ThissuggeststhatDEFIEisabletoexploitthenatureoftextualdefinitionseffectivelyandgeneral-izeoverrelationpatterns.Furthermore,oursemanticanalysiscaptured2,398,982distinctarguments(ei-therconceptornamedentities),outperformingal-mostallopen-textsystemsexamined.Evaluation.AlltheevaluationscarriedoutinSection6werebasedonmanualassessmentbytwohumanjudges,withaninter-annotatoragreement,asmeasuredbyCohen’skappacoefficient,above70%inallcases.IntheseevaluationswecomparedDE-FIEwiththefollowingOIEapproaches:•NELL(Carlsonetal.,2010)withknowledgebasebeliefsupdatedtoNovember2014;•PATTY(Nakasholeetal.,2012)withFree-basetypesandpatternsynsetsfromtheEnglishWikipediadumpofJune2011;•REVERB(Faderetal.,2011),usingthesetofnormalizedrelationinstancesfromtheClueWeb09dataset;•WISENET(MoroandNavigli,2012;MoroandNavigli,2013)withrelationalphrasesfromtheEnglishWikipediadumpofDecember2012.Inaddition,wealsocomparedourknowledgebasewithup-to-datehuman-contributedresources,namelyFreebase(Bollackeretal.,2008)andDBpe-dia(Lehmannetal.,2014),bothfromthedumpsofApril/May2014.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
535
Top100Top250Rand100Rand250DEFIE0.93±0.010.91±0.020.79±0.020.81±0.08PATTY0.93±0.05N/A0.80±0.08N/ATable3:PrecisionofrelationpatternsNELLPATTYREVERBWISENETFreebaseDBpediaTop100.571.238.214.155.571.461Rand100.942.711.596.635.904.880Table4:Noveltyoftheextractedinformation6Experiments6.1QualityofRelationsWefirstassessedthequalityandthesemanticconsistencyofourrelationsusingmanualevalua-tion.Werankedourrelationsaccordingtotheirscore(Section3)andthencreatedtwosamples(ofsize100and250respectively)ofthetopscoringrelations.Inordertoevaluatethelongtailoflessconfidentrelations,wecreatedanothertwosam-plesofthesamesizewithrandomlyextractedre-lations.Wepresentedthesesamplestoourhumanjudges,accompanyingeachrelationwithasetof50argumentpairsandthecorrespondingtextualdefini-tionsfromBabelNet.Foreachiteminthesampleweaskedwhetheritrepresentedameaningfulrela-tionandwhethertheextractedargumentpairswereconsistentwiththisrelationandthecorrespondingdefinitions.Iftheanswerwaspositive,therela-tionwasconsideredascorrect.Finallyweesti-matedtheoverallprecisionofthesampleastheproportionofcorrectitems.ResultsarereportedinTable3andcomparedtothoseobtainedbyourclosestcompetitor,PATTY,inthesettingofSec-tion5.InPATTYtheconfidenceofagivenpatternwasestimatedfromitsstatisticalstrength(Nakas-holeetal.,2012).AsshowninTable3,DEFIEachievedacomparablelevelofaccuracyineverysample.Anerroranalysisidentifiedmosterrorsasrelatedtothevaguenessofsomeshortandgeneralpatterns,e.g.XtakeY,XmakeY.Otherswerere-latedtoparsing(e.g.inlabelingtheheadwordofcomplexnounphrases)ordisambiguation.Inad-dition,weusedthesamesamplestoestimatethenoveltyoftheextractedinformationincompari-sontocurrentlyavailableresources.WeexaminedeachcorrectrelationpatternandlookedmanuallyforanequivalentrelationintheknowledgebasesGoldStandardDEFIEWISENETPATTY163131129126REVERBFreebaseDBpedia1226939Table5:CoverageofsemanticrelationsofbothourOIEcompetitorsandhuman-contributedresources.Forinstance,giventherelationXborninY,NELLandREVERBhavetheequivalentrela-tionspersonborninlocationandisbornin,whileFreebaseandDBpediahavePlaceofbirthandbirthPlacerespectively.Wethencomputedtheproportionof‘new’relationsamongthosepreviouslylabeledascorrectbyourhumanjudges.ResultsareshowninTable4forboththetop100sampleandtherandomsample.Thehighproportionofrelationsnotappearinginexistingre-sources(especiallyacrosstherandomsamples)sug-geststhatDEFIEiscapableofdiscoveringinforma-tionnotobtainablefromavailableknowledgebases,includingveryspecificrelations(XisblizzardinY,XisMayanlanguagespokenbyY,Xisgovernment-ownedcorporationinY),aswellasgeneralbutun-usualones(XusedbywriterofY).6.2CoverageofRelationsToassessthecoverageofDEFIEwefirsttestedourextractedrelationsonapublicdatasetde-scribedin(Nakasholeetal.,2012)andconsist-ingof163semanticrelationsmanuallyannotatedfromfiveWikipediapagesaboutmusicians.Fol-lowingthelineofpreviousworks(Nakasholeetal.,2012;MoroandNavigli,2013),foreachan-notationwesoughtarelationinourknowledgebasecarryingthesamesemantics.Resultsarere-portedinTable5.ConsistentlywiththeresultsinTable4,theproportionofnovelinformationplacesDEFIEinlinewithitsclosestcompetitors,achievingacoverageof80.3%withrespecttothegoldstandard.Examplesofrelationsnotcov-eredbyourcompetitorsarehasFatherInLawandhasDaughterInLaw.Furthermore,relationsholdingbetweenentitiesandgeneralconcepts(e.g.critizedFor,praisedFor,sentencedTo),arecapturedonlybyDEFIEandREVERB(which,however,lacksanyargumentsemantics).Wealsoassessedthecoverageofresourcesbased
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
536
FreebaseDBpediaNELLRandom10083%81%89%Table6:CoverageofmanuallycuratedresourcesPATTYWISENETRandom10066%69%Table7:CoverageofindividualrelationinstancesHyp.Gen.Substr.Gen.PATTY(Top)PATTY(Rand)Precision0.87±0.030.90±0.020.85±0.070.62±0.09#Edges4441220339Density1.89×10−67.64×10−9Table8:Precisionandcoverageoftherelationtaxonomyonhuman-definedsemanticrelations:weextractedthreerandomsamplesof100relationsfromFree-base,DBpediaandNELLandlookedforseman-ticallyequivalentrelationsinourknowledgebase.AsshowninTable6,DEFIEreportsacoveragebe-tween81%and89%dependingontheresource,fail-ingtocovermostlyrelationsthatrefertonumericalproperties(e.g.numberOfMembers).Finally,wetestedthecoverageofDEFIEoverin-dividualrelationinstances.Weselectedarandomsampleof100triplesfromthetwoclosestcom-petitorsexploitingtextualcorpora,i.e.PATTYandWISENET.Foreachselectedtriplehai,r,aji,wesoughtanequivalentrelationinstanceinourknowl-edgebase,i.e.onecomprisingaiandajandare-lationpatternexpressingthesamesemanticrelationofr.ResultsinTable7showacoveragegreaterthan65%overbothsamples.Giventhedramaticre-ductionofcorpussizeandthehighprecisionoftheitemsextracted,thesefiguresdemonstratethatdef-initionalknowledgeisextremelyvaluableforrela-tionextractionapproaches.Thismightsuggestthat,eveninlarge-scaleOIE-basedresources,asubstan-tialamountofknowledgeislikelytocomefromarathersmallersubsetofdefinitionalsentenceswithinthesourcecorpus.6.3QualityofRelationTaxonomizationWeevaluatedourrelationtaxonomybymanuallyassessingtheaccuracyofourtaxonomizationheuris-tics.ThenwecomparedourresultsagainstPATTY,theonlysystemamongourclosestcompetitorsthatgeneratesataxonomyofrelations.ThesettingforthisevaluationwasthesameofthatofSection6.1.However,aswelackedaconfidencemeasureinthiscase,wejustextractedarandomsampleof200hy-pernymedgesforeachgeneralizationprocedure.Wepresentedthesesamplestoourhumanjudgesand,foreachhypernymedge,weaskedwhetherthecor-respondingpairofrelationsrepresentedacorrectgeneralization.Wethenestimatedtheoverallpreci-sionastheproportionofedgesregardedascorrect.ResultsarereportedinTable8,alongwithPATTY’sresultsinthesettingofSection5;asPATTY’sedgesarerankedbyconfidence,weconsid-eredbothitstopconfident100subsumptionsandarandomsampleofthesamesize.AsshowninTable8,DEFIEoutperformsPATTYintermsofprecision,andgeneratesmorethantwicethenumberofedgesoverall.HARPY(GrycnerandWeikum,2014)en-richesPATTY’staxonomywith616,792hypernymedges,butitsalignmentalgorithm,inthesettingofSection5,alsoincludestransitiveedgesandstillyieldsasparsertaxonomycomparedtoours,withagraphdensityof2.32×10−7.Generalizationerrorsinourtaxonomyaremostlyrelatedtodisambigua-tionerrorsorflawsintheWikipediaBitaxonomy(e.g.theconceptTitularChurch1bnmarkedashyponymofCardinal1bn).6.4QualityofEntityLinkingandDisambiguationWeevaluatedthedisambiguationstageofDEFIE(Section2.1)bycomparingBabelfyagainstotherstate-of-the-artentitylinkingsystems.Inordertocomparedifferentdisambiguationoutputswese-lectedarandomsampleof60,000glossesfromtheinputcorpusoftextualdefinitions(Section5)andrantherelationextractionalgorithm(Sections2.1-2.3)usingadifferentcompetitorinthedisambigua-tionstepeachtime.Weeventuallyusedthemap-pingsinBabelNettoexpresseachoutputusingacommondictionaryandsenseinventory.Thecoverageobtainedbyeachcompetitorwasas-sessedbylookingatthenumberofdistinctrelationsextractedintheprocess,thetotalnumberofrelationinstancesextracted,thenumberofdistinctconceptsorentitiesinvolved,andtheaveragenumberofse-manticnodeswithintherelationpatterns.Foreachcompetitor,wealsoassessedtheprecisionobtainedbyevaluatingthequalityandsemanticconsistencyoftherelationpatterns,inthesamemannerasin
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
537
#Relations#Triples#EntitiesAverageSem.NodesBabelfy96434233517799982.37TagME2.088638226905893181.67WAT2408356503381470.39DBpediaSpotlight67377140711382541.45WikipediaMiner3954788777370360.96Table9:CoveragefordifferentdisambiguationsystemsRelationsRelationinstancesBabelfy82.3%76.6%TagME2.076.0%62.0%WAT84.6%72.6%DBpediaSpotlight70.5%62.6%WikipediaMiner71.7%56.0%Table10:PrecisionfordifferentdisambiguationsystemsSection6.1,bothatthelevelofsemanticrelations(onthetop150relationpatterns)andatthelevelofindividualrelationinstances(onarandomlyex-tractedsampleof150triples).ResultsareshowninTables9and10forBabelfyandthefollowingsys-tems:•TagME2.07(FerraginaandScaiella,2012),whichlinkstextfragmentstoWikipediabasedonmeasureslikesensecommonnessandkeyphraseness(MihalceaandCsomai,2007);•WAT(PiccinnoandFerragina,2014),anen-tityannotatorthatimprovesoverTagMEandfeaturesare-designedspotting,disambiguationandpruningpipeline;•DBpediaSpotlight8(Mendesetal.,2011),whichannotatestextdocumentswithDBpediaURIsusingscoressuchasprominence,topicalrelevanceandcontextualambiguity;•WikipediaMiner9(MilneandWitten,2013),whichcombinesparallelizedprocessingofWikipediadumps,relatednessmeasuresandannotationfeatures.AsshowninTable9,Babelfyoutperformsallitscompetitorsintermsofcoverageand,duetoitsunifiedwordsensedisambiguationandentitylink-ingapproach,extractssemanticallyricherpatterns7tagme.di.unipi.it8spotlight.dbpedia.org9wikipediadataminer.cms.waikato.ac.nz#DefinitionsProportion(%)Wikipedia389908789.50Wikidata3644848.35WordNet413560.95Wiktionary393830.90OmegaWiki130170.30Table11:Compositionoftheinputcorpusbysource#Relations#RelationinstancesAvg.ExtractionsWikipedia2519541945599277.58Wikidata54141033732191.01WordNet226012820056.73Wiktionary286314399050.52OmegaWiki11684581839.45Table12:Impactofeachsourceontheextractionstepwith2.37semanticnodesontheaveragepersen-tence.Thisreflectsonthequalityofsemanticrela-tions,reportedinTable10,withanoverallincreaseofprecisionbothintermsofrelationsandintermsofindividualinstances;eventhoughWATshowsslightlyhigherprecisionoverrelations,itsconsid-erablylowercoverageyieldssemanticallypoorpat-terns(0.39semanticnodesontheaverage)andim-pactsontheoverallqualityofrelations,wheresomeambiguityisnecessarilyretained.Asanexample,thepatternXisstationinY,extractedfromWAT’sdisambiguationoutput,coversbothrailwaystationsandradiobroadcasts.Babelfyproduces,instead,twodistinctrelationpatternsforeachsense,tag-gingstationasrailwaystation1bnforthefor-merandstation5bnforthelatter.6.5ImpactofDefinitionSourcesWecarriedoutanempiricalanalysisovertheinputcorpusinourexperimentalsetup,studyingtheimpactofeachsourceoftextualdefinitionsinisolation.Infact,asexplainedinSection5,BabelNet’stextualdefinitionscomefromvariousresources:WordNet,Wikipedia,Wikidata,Wik-tionaryandOmegaWiki.Table11showsthecom-positionoftheinputcorpuswithrespecttoeachofthesedefinitionsources.Thedistributionisratherskewed,withthevastmajorityofdefinitionscomingfromWikipedia(almost90%oftheinputcorpus).Werantherelationextractionalgorithm(Sections2.1-2.3)oneachsubsetoftheinputcorpus.Asinpreviousexperiments,wereportthenumberofre-lationinstancesextracted,thenumberofdistinctre-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
538
#Wikipages#Sentences#ExtractionsPrecisionAll140722258673968461.8%Top100103341617691368759.0%Table13:Extractionresultsovernon-definitionaltext#Relationinstances#Relations#EdgesPATTY(definitions)3212065415934785PATTY(Wikipedia)15802946163153120339Oursystem2080773225588144412Table14:PerformanceofPATTYondefinitionaldatalations,andtheaveragenumberofextractionsforeachrelation.Results,asshowninTable12,areconsistentwiththecompositionoftheinputcor-pusinTable11:byrelyingsolelyonWikipedia’sfirstsentences,theextractionalgorithmdiscovered98%ofallthedistinctrelationsidentifiedacrossthewholeinputcorpus,and93%ofthetotalnum-berofextractedinstances.Wikidataprovidesmorethan1millionextractions(5%ofthetotal)butdef-initionsarerathershortandmostofthem(44.2%)generateonlyis-arelationinstances.Theremain-ingsources(WordNet,Wiktionary,OmegaWiki)ac-countforlessthan2%oftheextractions.6.6ImpactoftheApproachvs.ImpactoftheDataDEFIE’srelationextractionalgorithmisexplic-itlydesignedtotargettextualdefinitions.Hence,theresultitachievesisduetothemutualcontributionoftwokeyfeatures:anOIEapproachandtheuseofdefinitionaldata.Inordertodecouplethesetwofactorsandstudytheirrespectiveim-pacts,wecarriedouttwoexperiments:firstweappliedDEFIEtoasampleofnon-definitionaltext;thenweappliedourclosestcompetitor,PATTY,onthesamedefinitioncorpusdescribedinSection5.Extractionfromnon-definitionaltext.WeselectedarandomsampleofWikipediapagesfromtheEnglishWikipediadumpofOctober2012.WeprocessedeachsentenceasinSections2.1-2.2andextractedinstancesofthoserelationsproducedbyDEFIEintheoriginaldefinitionalsetting(Section5);wethenautomaticallyfilteredoutthoseinstanceswherethearguments’hypernymsdidnotagreewiththesemantictypesoftherelation.WeevaluatedmanuallythequalityofextractionsonasampleofSourceLabelTargetenzyme1bncatalyzesreaction1bnofchemical1bnalbum1bnrecordedbyrockgroup1bnofficier1bncommandedbrigade1bnofarmyunit1bnbridge1bncrossesoverriver1bnacademicjournal1bncoversresearch1bninscience1bnorganization1bnhasheadquarters3bnincity1bnTable15:Examplesofaugmentedsemanticedges100items(asinSection6.1)forboththefullsetofextractedinstancesandforthesubsetofextractionsfromthetop100scoringrelations.ResultsarereportedinTable13:inbothcases,precisionfiguresshowthatextractionqualitydropsconsistentlyincomparisontoSection6.1,suggestingthatourextractionapproachbyitselfislessaccuratewhenmovingtomorecomplexsentences(with,e.g.,subordinateclausesorcoreferences).PATTYontextualdefinitions.Sincenoopen-sourceimplementationofPATTYisavailable,weimplementedaversionofthealgorithmwhichusesBABELFYfornamedentitydisambiguation.WethenranitonourcorpusofBabelNetdefinitionsandcomparedtheresultsagainstthoseoriginallyob-tainedbyPATTY(ontheentireWikipediacorpus)andthoseobtainedbyDEFIE.FiguresarereportedinTable14intermsofnumberofextractedrelationinstances,distinctrelationsandhypernymedgesintherelationtaxonomy.Resultsshowthatthedra-maticreductionofcorpussizeaffectsthesupportsetsofPATTY’srelations,worseningbothcoverageandgeneralizationcapability.6.7PreliminaryStudy:ResourceEnrichmentTofurtherinvestigatethepotentialofourap-proach,weexploredtheapplicationofDEFIEtotheenrichmentofexistingresources.WefocusedonBabelNetasacasestudy.InBabelNet’sseman-ticnetwork,nodesrepresentingconceptsanden-titiesareonlyconnectedvialexicograhicrelation-shipsfromWordNet(hypernymy,meronymy,etc.)orunlabelededgesderivedfromWikipediahyper-links.Ourextractionalgorithmhasthepotentialtoprovideusefulinformationtobothaugmentunla-belededgeswithlabelsandexplicitsemanticcon-tent,andcreateadditionalconnectionsbasedonse-manticrelations.ExamplesareshowninTable15.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
539
#Conceptpairs#Unlabeled#LabeledTypesignatures140329990Relationinstances84935883401677551331Table16:ConceptpairsandassociatededgesinBabelNetWecarriedoutapreliminaryanalysisoveralldis-ambiguatedrelationswithatleast10extractedin-stances.Foreachrelationpatternr,wefirstexam-inedtheconceptpairsassociatedwithitstypesigna-turesandlookedinBabelNetforanunlabelededgeconnectingthepair.ThenweexaminedthewholesetofextractedrelationinstancesinRandlookedinBabelNetforanunlabelededgeconnectingtheargu-mentsaiandaj.ResultsinTable16showthatonly27.7%oftheconceptpairsrepresentingrelationtypesignaturesareconnectedinBabelNet,andmostoftheseconnectionsareunlabeled.Bythesametoken,morethan4milliondistinctargumentpairs(53.5%)donotshareanyedgeinthesemanticnetworkand,amongthosethatdo,lessthan14%havealabeledrelationship.Theseproportionssuggestthatourre-lationsprovideapotentialenrichmentoftheunder-lyingknowledgebaseintermsofbothconnectivityandlabelingofexistingedges.InBabelNet,ourcasestudy,cross-resourcemappingsmightalsopropa-gatethisinformationacrossotherknowledgebasesandrephrasesemanticrelationsintermsof,e.g.,au-tomaticallygeneratedWikipediahyperlinks.7RelatedWorkFromtheearliestdays,OIEsystemshadtocopewiththedimensionandheterogeneityofhugeun-structuredsourcesoftext.Thefirstsystemsem-ployedstatisticaltechniquesandreliedheavilyoninformationredundancy.Then,assoonassemi-structuredresourcescameintoplay(Hovyetal.,2013),researchersstarteddevelopinglearningsys-temsbasedonself-supervision(WuandWeld,2007)anddistantsupervision(Mintzetal.,2009;Krauseetal.,2012).Crucialissuesindistantsupervision,likenoisytrainingdata,havebeenaddressedinvar-iousways:probabilisticgraphicalmodels(Riedeletal.,2010;Hoffmannetal.,2011),sophisticatedmulti-instancelearningalgorithms(Surdeanuetal.,2012),matrixfactorizationtechniques(Riedeletal.,2013),labeleddatainfusion(Pershinaetal.,2014)orcrowd-basedhumancomputing(Kondreddietal.,2014).Adifferentstrategyconsistsofmovingfromopentextextractiontomoreconstrainedsettings.Forinstance,theKNOWLEDGEVAULT(Dongetal.,2014)combinesWeb-scaleextractionwithpriorknowledgefromexistingknowledgebases;BIPER-PEDIA(Guptaetal.,2014)reliesonschema-levelattributesfromthequerystreaminordertocreateanontologyofclass-attributepairs;RENOUN(Yahyaetal.,2014)inturnexploitsBIPERPEDIAtoextractfactsexpressedasnounphrases.DEFIEfocuses,in-stead,onsmalleranddensercorporaofprescriptiveknowledge.Althoughearlyworks,suchasMindNet(Richardsonetal.,1998),hadalreadyhighlightedthepotentialoftextualdefinitionsforextractingre-liablesemanticinformation,noOIEapproachtothebestofourknowledgehasexploiteddefinitionaldatatoextractanddisambiguatealargeknowledgebaseofsemanticrelations.Thedirectionofmostpapers(especiallyintherecentOIEliterature)seemsrathertheopposite,namely,totargetWeb-scalecorpora.Incontrast,wemanagetoextractalargeamountofhigh-qualityinformationbycombininganOIEun-supervisedapproachwithdefinitionaldata.Adeeperlinguisticanalysisconstitutesthefo-cusofmanyOIEapproaches.Syntacticdependen-ciesareusedtoconstructgeneralrelationpatterns(Nakasholeetal.,2012),ortoimprovethequal-ityofsurfacepatternrealizations(MoroandNav-igli,2013).Phenomenalikesynonymyandpoly-semyhavebeenaddressedwithkernel-basedsimi-laritymeasuresandsoftclusteringtechniques(Minetal.,2012;MoroandNavigli,2013),orexploitingthesemantictypesofrelationarguments(Nakasholeetal.,2012;MoroandNavigli,2012).Anappro-priatemodelingofsemantictypes(e.g.selectionalpreferences)constitutesalineofresearchbyitself,rootedinearlierworkslike(Resnik,1996)andfo-cusedoneitherclass-based(ClarkandWeir,2002),orsimilarity-based(Erk,2007),approaches.How-ever,thesemethodsareusedtomodeltheseman-ticsofverbsratherthanarbitrarypatterns.Morere-centlysomestrategiesbasedontopicmodelinghavebeenproposed,eithertoinferlatentrelationseman-tictypesfromOIErelations(Ritteretal.,2010),ortodirectlylearnanontologicalstructurefromastart-ingsetofrelationinstances(Movshovitz-AttiasandCohen,2015).However,theknowledgegeneratedisoftenhardtointerpretandintegratewithexisting
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
540
knowledgebaseswithouthumanintervention(Rit-teretal.,2010).Inthisrespect,thesemanticpredi-catesproposedbyFlatiandNavigli(2013)seemtobemorepromising.Anoveltyinourapproachisthatissueslikepoly-semyandsynonymyareexplicitlyaddressedwithaunifiedentitylinkinganddisambiguationalgorithm.Byincorporatingexplicitsemanticcontentinourre-lationpatterns,notonlydowemakerelationslessambiguous,butwealsoabstractawayfromspecificlexicalizationsofthecontentwordsandmergeto-gethermanypatternsconveyingthesamesemantics.Ratherthanusingplaindependencieswealsoinjectexplicitsemanticcontentintothedependencygraphtogenerateaunifiedsyntactic-semanticrepresenta-tion.Previousworks(Moroetal.,2013)usedsimi-larsemanticgraphrepresentationstoproducefilter-ingrulesforrelationextraction,buttheyrequiredastartingsetofrelationpatternsanddidnotexploitsyntacticinformation.Ajointapproachofsyntactic-semanticanalysisoftextwasusedinworkssuchas(Laoetal.,2012),buttheyaddressedasubstan-tiallydifferenttask(inferenceforknowledgebasecompletion)andassumedaradicallydifferentset-ting,withapredefinedstartingsetofsemanticre-lationsfromagivenknowledgebase.Asween-forceanOIEapproach,wedonothavesuchrequire-mentsanddirectlyprocesstheinputtextviaparsinganddisambiguation.ThisenablesDEFIEtogener-aterelationsalreadyintegratedwithresourceslikeWordNetandWikipedia,withoutadditionalalign-mentsteps(GrycnerandWeikum,2014),orseman-tictypepropagations(Linetal.,2012).AsshowninSection6.3,explicitsemanticcontentwithinre-lationpatternsunderpinsarichandhigh-qualityre-lationtaxonomy,whereasgeneralizationin(Nakas-holeetal.,2012)islimitedtosupportsetinclusionandleadstosparserandlessaccurateresults.8ConclusionandFutureWorkWepresentedDEFIE,anapproachtoOIEthat,thankstoanovelunifiedsyntactic-semanticanaly-sisoftext,harvestsinstancesofsemanticrelationsfromacorpusoftextualdefinitions.DEFIEex-tractsknowledgeonalargescale,reducingdatasparsityanddisambiguatingbothargumentsandre-lationpatternsatthesametime.Unlikeprevioussemantically-enhancedapproaches,mostlyrelyingonthesemanticsofargumenttypes,DEFIEisabletosemantifyrelationphrasesaswell,byprovidingexplicitlinkstotheunderlyingknowledgebase.Weleveragedaninputcorpusof4.3milliondefinitionsandextractedover20millionrelationinstances,withmorethan250,000distinctrelationsandalmost2.4millionconceptsandentitiesinvolved.Fromtheserelationsweautomaticallyconstructedahigh-qualityrelationtaxonomybyexploitingtheexplicitsemanticcontentoftherelationpatterns.Intheresultingknowledgebaseconceptsandentitiesarelinkedtoexistingresources,suchasWordNetandWikipedia,viatheBabelNetsemanticnetwork.WeevaluatedDEFIEintermsofprecision,coverage,noveltyofinformationincomparisontoexistingre-sourcesandqualityofdisambiguation,andwecom-paredourrelationtaxonomyagainststate-of-the-artsystemsobtaininghighlycompetitiveresults.Akeyfeatureofourapproachisitsdeepsyntactic-semanticanalysistargetedtotextualdef-initions.Incontrasttoourcompetitors,wheresyn-tacticconstraintsarenecessaryinordertokeeppre-cisionhighwhendealingwithnoisydata,DEFIEshowscomparable(orgreater)performancesbyex-ploitingadense,noise-freedefinitionalsetting.DE-FIEgeneratesalargeknowledgebase,inlinewithcollaboratively-builtresourcesandstate-of-the-artOIEsystems,butusesamuchsmalleramountofin-putdata:ourcorpusofdefinitionscompriseslessthan83milliontokensoverall,whileotherOIEsys-temsexploitmassivecorporalikeWikipedia(typi-callymorethan1.5billiontokens),ClueWeb(morethan33billiontokens),ortheWebitself.Fur-thermore,oursemanticanalysisbasedonBabelfyenablesthediscoveryofsemanticconnectionsbe-tweenbothgeneralconceptsandnamedentities,withthepotentialtoenrichexistingstructuredandsemi-structuredresources,asweshowedinapre-liminarystudyonBabelNet(cf.Section6.7).Asthenextstep,weplantoapplyDEFIEtoopentextandintegrateitwithdefinitionextractionandautomaticglossfindingalgorithms(NavigliandVe-lardi,2010;Dalvietal.,2015).Also,byfurtherex-ploitingtheunderlyingknowledgebase,inferenceandlearningtechniques(Laoetal.,2012;Wangetal.,2015)canbeappliedtocomplementourmodel,generatingnewtriplesorcorrectingwrongones.Fi-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
541
nally,anotherfutureperspectiveistoleveragetheincreasinglylargevarietyofmultilingualresources,likeBabelNet,andmovetowardsthemodelingoflanguage-independentrelations.AcknowledgmentsTheauthorsgratefullyacknowledgethesupportoftheERCStartingGrantMultiJEDINo.259234.ThisresearchwasalsopartiallysupportedbyGooglethroughaFacultyResearchAwardgrantedinJuly2012.ReferencesKurtBollacker,ColinEvans,PraveenParitosh,TimSturge,andJamieTaylor.2008.Freebase:ACollab-orativelyCreatedGraphDatabaseForStructuringHu-manKnowledge.InProceedingsofSIGMOD,pages1247–1250.AndrewCarlson,JustinBetteridge,BryanKisiel,BurrSettles,EstevamR.HruschkaJr.,andTomM.Mitchell.2010.TowardanArchitectureforNever-EndingLanguageLearning.InProceedingsofAAAI,pages1306–1313.StephenClarkandJamesR.Curran.2007.Wide-coverageEfficientStatisticalParsingwithCCGandLog-LinearModels.ComputationalLinguistics,33(4):493–552.StephenClarkandDavidWeir.2002.Class-BasedProb-abilityEstimationUsingaSemanticHierarchy.Com-putationalLinguistics,28(2):187–206.BhavanaDalvi,EinatMinkov,ParthaP.Talukdar,andWilliamW.Cohen.2015.AutomaticGlossFindingforaKnowledgeBaseusingOntologicalConstraints.InProceedingsofWSDM,pages369–378.ClaudioDelliBovi,LuisEspinosaAnke,andRobertoNavigli.2015.KnowledgeBaseUnificationviaSenseEmbeddingsandDisambiguation.InProceedingsofEMNLP,pages726–736.XinDong,EvgeniyGabrilovich,GeremyHeitz,WilkoHorn,NiLao,KevinMurphy,ThomasStrohmann,ShaohuaSun,andWeiZhang.2014.KnowledgeVault:aWeb-ScaleApproachtoProbabilisticKnowl-edgeFusion.InProceedingsofKDD,pages601–610.ArnabDutta,ChristianMeilicke,andSimonePaoloPonzetto.2014.AProbabilisticApproachforInte-gratingHeterogeneousKnowledgeSources.InPro-ceedingsofESWC,pages286–301.KatrinErk.2007.ASimple,Similarity-basedModelforSelectionalPreferences.InProceedingsofACL,page216–223.OrenEtzioni,MicheleBanko,StephenSoderland,andDanielS.Weld.2008.OpenInformationExtractionfromtheWeb.Commun.ACM,51(12):68–74.AnthonyFader,StephenSoderland,andOrenEtzioni.2011.IdentifyingRelationsforOpenInformationExtraction.InProceedingsofEMNLP,pages1535–1545.ChristianeFellbaum.1998.WordNet:AnElectronicLexicalDatabase.BradfordBooks.PaoloFerraginaandUgoScaiella.2012.FastandAccu-rateAnnotationofShortTextswithWikipediaPages.IEEESoftware,29(1):70–75.TizianoFlatiandRobertoNavigli.2013.SPred:Large-scaleHarvestingofSemanticPredicates.InProceed-ingsofACL,pages1222–1232.TizianoFlati,DanieleVannella,TommasoPasini,andRobertoNavigli.2014.TwoIsBigger(andBetter)ThanOne:theWikipediaBitaxonomyProject.InPro-ceedingsofACL,pages945–955.RobertW.Floyd.1962.Algorithm97:ShortestPath.CommunicationsoftheACM,5(6):345–345.AdamGrycnerandGerhardWeikum.2014.HARPY:HypernymsandAlignmentofRelationalParaphrases.InProceedingsofCOLING,pages2195–2204.RahulGupta,AlonHalevy,XuezhiWang,StevenEui-jongWhang,andFeiWu.2014.Biperpedia:AnOntologyforSearchApplications.InProceedingsofVLDB,pages505–516.RaphaelHoffmann,CongleZhang,XiaoLing,LukeZettlemoyer,andDanielS.Weld.2011.Knowledge-basedWeakSupervisionforInformationExtractionofOverlappingRelations.InProceedingsofNAACLHLT,pages541–540.EduardHovy,RobertoNavigli,andSimonePaoloPonzetto.2013.Collaborativelybuiltsemi-structuredcontentandArtificialIntelligence:Thestorysofar.ArtificialIntelligence,194:2–27.SarathKumarKondreddi,PeterTriantafillou,andGer-hardWeikum.2014.CombiningInformationExtrac-tionandHumanComputingforCrowdsourcedKnowl-edgeAcquisition.InProceedingsofICDE,pages988–999.SebastianKrause,HongLi,HansUszkoreit,andFeiyuXu.2012.Large-ScaleLearningofRelation-ExtractionRuleswithDistantSupervisionfromtheWeb.InProceedingsofISWC.NiLao,AmarnagSubramanya,FernandoPereira,andWilliamW.Cohen.2012.ReadingtheWebwithLearnedSyntactic-SemanticInferenceRules.InPro-ceedingsofEMNLP-CoNLL,pages1017–1026.JensLehmann,RobertIsele,MaxJakob,AnjaJentzsch,DimitrisKontokostas,PabloN.Mendes,SebastianHellmann,MohamedMorsey,PatrickvanKleef,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
542
SörenAuer,andChristianBizer.2014.DBpedia-ALarge-scale,MultilingualKnowledgeBaseExtractedfromWikipedia.SemanticWebJournal,pages1–29.ThomasLin,Mausam,andOrenEtzioni.2012.NoNounPhraseLeftBehind:DetectingandTypingUn-linkableEntities.InProceedingsofEMNLP-CoNLL,pages893–903.FarzanehMahdisoltani,JoannaBiega,andFabianM.Suchanek.2015.YAGO3:AKnowledgeBasefromMultilingualWikipedias.InCIDR.PabloN.Mendes,MaxJakob,AndrésGarcía-Silva,andChristianBizer.2011.DBPediaSpotlight:SheddingLightontheWebofDocuments.InProceedingsofI-Semantics,pages1–8.RadaMihalceaandAndrasCsomai.2007.Wikify!:LinkingDocumentstoEncyclopedicKnowledge.InProceedingsofCIKM,pages233–242.DavidMilneandIanH.Witten.2013.AnOpen-SourceToolkitforMiningWikipedia.ArtificialIntelligence,194:222–239.BonanMin,ShumingShi,RalphGrishman,andChin-YewLin.2012.EnsembleSemanticsforLarge-scaleUnsupervisedRelationExtraction.InProceedingsofEMNLP-CoNLL,pages1027–1037.MikeMintz,StevenBills,RionSnow,andDanJuraf-sky.2009.DistantSupervisionforRelationExtrac-tionWithoutLabeledData.InProceedingsofACL-IJCNLP,pages1003–1011.TomM.Mitchell.2005.ReadingtheWeb:ABreak-throughGoalforAI.AIMagazine.AndreaMoroandRobertoNavigli.2012.WiSeNet:BuildingaWikipedia-basedSemanticNetworkwithOntologizedRelations.InProceedingsofCIKM,pages1672–1676.AndreaMoroandRobertoNavigli.2013.IntegratingSyntacticandSemanticAnalysisintotheOpenInfor-mationExtractionParadigm.InProceedingsofIJCAI,pages2148–2154.AndreaMoro,HongLi,SebastianKrause,FeiyuXu,RobertoNavigli,andHansUszkoreit.2013.SemanticRuleFilteringforWeb-ScaleRelationExtraction.InProceedingsofISWC,pages347–362.AndreaMoro,AlessandroRaganato,andRobertoNav-igli.2014.EntityLinkingmeetsWordSenseDisam-biguation:aUnifiedApproach.TACL,2:231–244.DanaMovshovitz-AttiasandWilliamW.Cohen.2015.KB-LDA:JointlyLearningaKnowledgeBaseofHi-erarchy,Relations,andFacts.InProceedingsofACL.NdapandulaNakashole,GerhardWeikum,andFabianM.Suchanek.2012.PATTY:ATaxonomyofRela-tionalPatternswithSemanticTypes.InProceedingsofEMNLP-CoNLL,pages1135–1145.ViviNastaseandMichaelStrube.2013.Transform-ingWikipediaintoaLargeScaleMultilingualConceptNetwork.ArtificialIntelligence,194:62–85.RobertoNavigliandSimonePaoloPonzetto.2012.Ba-belNet:TheAutomaticConstruction,EvaluationandApplicationofaWide-CoverageMultilingualSeman-ticNetwork.ArtificialIntelligence,193:217–250.RobertoNavigliandPaolaVelardi.2010.LearningWord-classLatticesforDefinitionandHypernymEx-traction.InProceedingsofACL,pages1318–1327.MariaPershina,BonanMin,WeiXu,andRalphGrish-man.2014.InfusionofLabeledDataintoDistantSu-pervisionforRelationExtraction.InProceedingsofACL,pages732–738.FrancescoPiccinnoandPaoloFerragina.2014.FromTagMEtoWAT:aNewEntityAnnotator.InProceed-ingsofERD,pages55–62.SimonePaoloPonzettoandMichaelStrube.2011.Tax-onomyInductionBasedonaCollaborativelyBuiltKnowledgeRepository.ArtificialIntelligence,175(9-10):1737–1756.PhilipResnik.1996.SelectionalConstraints:AnInformation-TheoreticModelanditsComputationalRealization.Cognition,61(1-2):127–159.StephenD.Richardson,WilliamB.Dolan,andLucyVan-derwende.1998.MindNet:AcquiringandStructur-ingSemanticInformationfromText.InProceedingsofACL,pages1098–1102.SebastianRiedel,LiminYao,andAndrewMcCallum.2010.ModelingRelationsandTheirMentionswith-outLabeledText.InProceedingsofECML-PKDD,pages148–163.SebastianRiedel,LiminYao,AndrewMcCallum,andBenjaminM.Marlin.2013.RelationExtractionwithMatrixFactorizationandUniversalSchemas.InPro-ceedingsofNAACLHLT,pages74–84.AlanRitter,Mausam,andOrenEtzioni.2010.ALa-tentDirichletAllocationMethodforSelectionalPref-erences.InProceedingsofACL,pages424–434.MarkSteedman.2000.TheSyntacticProcess.MITPress,Cambridge,MA,USA.MihaiSurdeanu,JulieTibshirani,RameshNallapati,andChristopherD.Manning.2012.Multi-instanceMulti-labelLearningforRelationExtraction.InProceedingsofEMNLP-CoNLL,pages455–465.DennyVrandeˇci´c.2012.Wikidata:ANewPlatformforCollaborativeDataCollection.InProceedingsofWWW,pages1063–1064.WilliamYangWang,KathrynMazaitis,NiLao,TomM.Mitchell,andWilliamW.Cohen.2015.EfficientIn-ferenceandLearninginaLargeKnowledgeBase-ReasoningwithExtractedInformationusingaLocallyGroundableFirst-OrderProbabilisticLogic.MachineLearning,100(1):101–126.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
543
FeiWuandDanielS.Weld.2007.AutonomouslySemantifyingWikipedia.InProceedingsofCIKM,pages41–50.MohamedYahya,StevenEuijongWhang,RahulGupta,andAlonHalevy.2014.ReNoun:FactExtractionforNominalAttributes.InProceedingsofEMNLP,pages325–335.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
5
6
1
5
6
6
8
0
2
/
/
t
l
a
c
_
a
_
0
0
1
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
544