Transactions of the Association for Computational Linguistics, vol. 6, pp. 287–302, 2018. Action Editor: Katrin Erk.

Transactions of the Association for Computational Linguistics, vol. 6, pp. 287–302, 2018. Action Editor: Katrin Erk.
Submission batch: 10/2017; Revision batch: 2/2018; Published 5/2018.

2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

ConstructingDatasetsforMulti-hopReadingComprehensionAcrossDocumentsJohannesWelbl1PontusStenetorp1SebastianRiedel1,21UniversityCollegeLondon,2BloomsburyAI{j.welbl,p.stenetorp,s.riedel}@cs.ucl.ac.ukAbstractMostReadingComprehensionmethodslimitthemselvestoquerieswhichcanbeansweredusingasinglesentence,paragraph,ordocu-ment.Enablingmodelstocombinedisjointpiecesoftextualevidencewouldextendthescopeofmachinecomprehensionmethods,butcurrentlynoresourcesexisttotrainandtestthiscapability.Weproposeanoveltasktoencouragethedevelopmentofmodelsfortextunderstandingacrossmultipledocumentsandtoinvestigatethelimitsofexistingmethods.Inourtask,amodellearnstoseekandcom-bineevidence–effectivelyperformingmulti-hop,aliasmulti-step,inference.Wedeviseamethodologytoproducedatasetsforthistask,givenacollectionofquery-answerpairsandthematicallylinkeddocuments.Twodatasetsfromdifferentdomainsareinduced,1andweidentifypotentialpitfallsanddevisecircum-ventionstrategies.Weevaluatetwoprevi-ouslyproposedcompetitivemodelsandﬁndthatonecanintegrateinformationacrossdoc-uments.However,bothmodelsstruggletose-lectrelevantinformation;andprovidingdoc-umentsguaranteedtoberelevantgreatlyim-provestheirperformance.Whilethemod-elsoutperformseveralstrongbaselines,theirbestaccuracyreaches54.5%onanannotatedtestset,comparedtohumanperformanceat85.0%,leavingampleroomforimprovement.1IntroductionDevisingcomputersystemscapableofansweringquestionsaboutknowledgedescribedusingtexthas1Availableathttp://qangaroo.cs.ucl.ac.ukThe Hanging Gardens, in [Mumbai], also known as Pherozeshah Mehta Gardens, are terraced gardens … They provide sunset views over the [Arabian Sea] …Mumbai (also known as Bombay, the ofﬁcial name until 1995) is the capital city of the Indian state of Maharashtra. It is the most populous city in India …Q: (Hanging gardens of Mumbai, country, ?) Options: {Iran, India, Pakistan, Somalia, …}The Arabian Sea is a region of the northern Indian Ocean bounded on the north by Pakistan and Iran, on the west by northeastern Somalia and the Arabian Peninsula, and on the east by India …Figure1:AsamplefromtheWIKIHOPdatasetwhereitisnecessarytocombineinformationspreadacrossmulti-pledocumentstoinferthecorrectanswer.beenalongstandingchallengeinNaturalLanguageProcessing(NLP).Contemporaryend-to-endRead-ingComprehension(RC)methodscanlearntoex-tractthecorrectanswerspanwithinagiventextandapproachhuman-levelperformance(Kadlecetal.,2016;Seoetal.,2017a).However,forexist-ingdatasets,relevantinformationisoftenconcen-tratedlocallywithinasinglesentence,emphasizingtheroleoflocating,matching,andaligninginforma-tionbetweenqueryandsupporttext.Forexample,Weissenbornetal.(2017)observedthatasimplebi-naryword-in-queryindicatorfeatureboostedtherel-ativeaccuracyofabaselinemodelby27.9%.Wearguethat,inordertofurthertheabilityofma-chinecomprehensionmethodstoextractknowledgefromtext,wemustmovebeyondascenariowhererelevantinformationiscoherentlyandexplicitlystatedwithinasingledocument.MethodswiththiscapabilitywouldaidInformationExtraction(IE)ap-plications,suchasdiscoveringdrug-druginterac-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

288

tions(Gurulingappaetal.,2012)byconnectingpro-teininteractionsreportedacrossdifferentpublica-tions.Theywouldalsobeneﬁtsearch(CarpinetoandRomano,2012)andQuestionAnswering(QA)ap-plications(LinandPantel,2001)wheretherequiredinformationcannotbefoundinasinglelocation.Figure1showsanexamplefromWIKIPEDIA,wherethegoalistoidentifythecountrypropertyoftheHangingGardensofMumbai.Thiscannotbeinferredsolelyfromthearticleaboutthemwithoutadditionalbackgroundknowledge,astheanswerisnotstatedexplicitly.However,severalofthelinkedarticlesmentionthecorrectanswerIndia(andothercountries),butcoverdifferenttopics(e.g.Mumbai,ArabianSea,etc.).Findingtheanswerrequiresmulti-hopreasoning:ﬁguringoutthattheHangingGardensarelocatedinMumbai,andthen,fromaseconddocument,thatMumbaiisacityinIndia.WedeﬁneanovelRCtaskinwhichamodelshouldlearntoanswerqueriesbycombiningev-idencestatedacrossdocuments.Weintroduceamethodologytoinducedatasetsforthistaskandde-rivetwodatasets.Theﬁrst,WIKIHOP,usessetsofWIKIPEDIAarticleswhereanswerstoqueriesaboutspeciﬁcpropertiesofanentitycannotbelocatedintheentity’sarticle.Intheseconddataset,MEDHOP,thegoalistoestablishdrug-druginteractionsbasedonscientiﬁcﬁndingsaboutdrugsandproteinsandtheirinteractions,foundacrossmultipleMEDLINEabstracts.ForbothdatasetswedrawuponexistingKnowledgeBases(KBs),WIKIDATAandDRUG-BANK,asgroundtruth,utilizingdistantsupervi-sion(Mintzetal.,2009)toinducethedata–similartoHewlettetal.(2016)andJoshietal.(2017).Weestablishthatfor74.1%and68.0%ofthesamples,theanswercanbeinferredfromthegivendocumentsbyahumanannotator.Still,construct-ingmulti-documentdatasetsischallenging;ween-counterandprescriberemediesforseveralpitfallsassociatedwiththeirassembly–forexample,spuri-ousco-locationsofanswersandspeciﬁcdocuments.Forbothdatasetswethenestablishseveralstrongbaselinesandevaluatetheperformanceoftwopre-viouslyproposedcompetitiveRCmodels(Seoetal.,2017a;Weissenbornetal.,2017).Weﬁndthatonecanintegrateinformationacrossdocuments,butnei-therexcelsatselectingrelevantinformationfromalargerdocumentsset,astheiraccuracyincreasessig-niﬁcantlywhengivenonlydocumentsguaranteedtoberelevant.Thebestmodelreaches54.5%onanannotatedtestset,comparedtohumanperformanceat85.0%,indicatingampleroomforimprovement.Insummary,ourkeycontributionsareasfollows:Firstly,proposingacross-documentmulti-stepRCtask,aswellasageneraldatasetinductionstrat-egy.Secondly,assemblingtwodatasetsfromdif-ferentdomainsandidentifyingdatasetconstructionpitfallsandremedies.Thirdly,establishingmultiplebaselines,includingtworecentlyproposedRCmod-els,aswellasanalysingmodelbehaviourindetailthroughablationstudies.2TaskandDatasetConstructionMethodWewillnowformallydeﬁnethemulti-hopRCtask,andagenericmethodologytoconstructmulti-hopRCdatasets.Later,inSections3and4wewilldemonstratehowthismethodisappliedinpracticebycreatingdatasetsfortwodifferentdomains.TaskFormalizationAmodelisgivenaqueryq,asetofsupportingdocumentsSq,andasetofcandi-dateanswersCq–allofwhicharementionedinSq.Thegoalistoidentifythecorrectanswera∗∈CqbydrawingonthesupportdocumentsSq.Queriescouldpotentiallyhaveseveraltrueanswerswhennotconstrainedtorelyonaspeciﬁcsetofsupportdoc-uments–e.g.,queriesabouttheparentofacertainindividual.However,inoursetupeachsamplehasonlyonetrueansweramongCqandSq.Notethateventhoughwewillutilizebackgroundinformationduringdatasetassembly,suchinformationwillnotbeavailabletoamodel:thedocumentsetwillbeprovidedinrandomorderandwithoutanymetadata.Whilecertainlybeneﬁcial,thiswoulddistractfromourgoaloffosteringend-to-endRCmethodsthatin-ferfactsbycombiningseparatefactsstatedintext.DatasetAssemblyWeassumethatthereexistsadocumentcorpusD,togetherwithaKBcontainingfacttriples(s,r,o)–withsubjectentitys,relationr,andobjectentityo.Forexample,onesuchfactcouldbe(HangingGardensofMumbai,country,India).WestartwithindividualKBfactsandtrans-formthemintoquery-answerpairsbyleavingtheobjectslotempty,i.e.q=(s,r,?)anda∗=o.Next,wedeﬁneadirectedbipartitegraph,where

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

289

verticesononesidecorrespondtodocumentsinD,andverticesontheothersideareentitiesfromtheKB–seeFigure2foranexample.Adocu-mentnodedisconnectedtoanentityeifeismen-tionedind,thoughtheremaybefurtherconstraintswhendeﬁningthegraphconnectivity.Foragiven(q,a∗)pair,thecandidatesCqandsupportdocu-mentsSq⊆Dareidentiﬁedbytraversingthebipar-titegraphusingbreadth-ﬁrstsearch;thedocumentsvisitedwillbecomethesupportdocumentsSq.Asthetraversalstartingpoint,weusethenodebelongingtothesubjectentitysofthequeryq.Astraversalendpoints,weusethesetofallentitynodesthataretype-consistentanswerstoq.2Notethatwheneverthereisanotherfact(s,r,o0)intheKB,i.e.afactproducingthesameqbutwithadifferenta∗,wewillnotincludeo0intothesetofendpointsforthissample.Thisensuresthatpreciselyoneoftheendpointscorrespondstoacorrectanswertoq.Whentraversingthegraphstartingats,severaloftheendpointswillbevisited,thoughgenerallynotall;thosevisiteddeﬁnethecandidatesetCq.Ifhoweverthecorrectanswera∗isnotamongthemwediscardtheentire(q,a∗)pair.Thedocumentsvisitedtoreachtheendpointswilldeﬁnethesupportdocu-mentsetSq.Thatis,Sqcompriseschainsofdocu-mentsleadingnotonlyfromthequerysubjecttothecorrectanswercandidate,butalsototype-consistentfalseanswercandidates.Withthismethodology,relevanttextualevidencefor(q,a∗)willbespreadacrossdocumentsalongthechainconnectingsanda∗–ensuringthatmulti-hopreasoninggoesbeyondresolvingco-referencewithinasingledocument.Notethatincludingothertype-consistentcandidatesalongsidea∗asendpointsinthegraphtraversal–andthusintothesup-portdocuments–rendersthetaskconsiderablymorechallenging(JiaandLiang,2017).Modelscouldotherwiseidentifya∗inthedocumentsbysimplyrelyingontype-consistencyheuristics.Itisworthpointingoutthatbyintroducingalternativecandi-dateswecounterbalanceatype-consistencybias,incontrasttoHermannetal.(2015)andHilletal.(2016)whoinsteadrelyonentitymasking.2Todetermineentitieswhicharetype-consistentforaqueryq,weconsiderallentitieswhichareobservedasobjectinafactwithrasrelationtype–includingthecorrectanswer.DocumentsEntitiesKB(s,r,o)(s,r,o0)(s0,r,o00)soo0o00Figure2:Abipartitegraphconnectingentitiesanddoc-umentsmentioningthem.BoldedgesarethosetraversedfortheﬁrstfactinthesmallKBontheright;yellowhigh-lightingindicatesdocumentsinSqandcandidatesinCq.Checkandcrossindicatecorrectandfalsecandidates.3WIKIHOPWIKIPEDIAcontainsanabundanceofhuman-curated,multi-domaininformationandhassev-eralstructuredresourcessuchasinfoboxesandWIKIDATA(Vrandeˇci´c,2012)associatedwithit.WIKIPEDIAhasthusbeenusedforawealthofre-searchtobuilddatasetsposingqueriesaboutasinglesentence(Moralesetal.,2016;Levyetal.,2017)orarticle(Yangetal.,2015;Hewlettetal.,2016;Ra-jpurkaretal.,2016).However,noattempthasbeenmadetoconstructacross-documentmulti-stepRCdatasetbasedonWIKIPEDIA.ArecentlyproposedRCdatasetisWIKIREAD-ING(Hewlettetal.,2016),whereWIKIDATAtu-ples(item,property,answer)arealignedwiththeWIKIPEDIAarticlesregardingtheiritem.Thetuplesdeﬁneaslotﬁllingtaskwiththegoalofpre-dictingtheanswer,givenanarticleandproperty.OneproblemwithusingWIKIREADINGasanex-tractiveRCdatasetisthat54.4%ofthesamplesdonotstatetheanswerexplicitlyinthegivenarti-cle(Hewlettetal.,2016).However,weobservedthatsomeofthearticlesaccessiblebyfollowinghy-perlinksfromthegivenarticleoftenstatetheanswer,alongsideotherplausiblecandidates.3.1AssemblyWenowapplythemethodologyfromSection2tocreateamulti-hopdatasetwithWIKIPEDIAasthedocumentcorpusandWIKIDATAasstructuredknowledgetriples.Inthissetup,(item,property,answer)WIKIDATAtuplescorrespondto(s,r,o)triples,andtheitemandpropertyofeachsample

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

290

togetherformourqueryq–e.g.,(HangingGardensofMumbai,country,?).SimilartoYangetal.(2015)weonlyusetheﬁrstparagraphofanarticle,asrel-evantinformationismoreoftenstatedinthebegin-ning.StartingwithallsamplesinWIKIREADING,weﬁrstremovesampleswheretheanswerisstatedexplicitlyintheWIKIPEDIAarticleabouttheitem.3Thebipartitegraphisstructuredasfollows:(1)foredgesfromarticlestoentities:allarticlesmentioninganentityeareconnectedtoe;(2)foredgesfromentitiestoarticles:eachentityeisonlyconnectedtotheWIKIPEDIAarticleabouttheentity.Traversingthegraphisthenequivalenttoiterativelyfollowinghyperlinkstonewarticlesaboutthean-chortextentities.Foragivenquery-answerpair,theitementityischosenasthestartingpointforthegraphtraver-sal.Atraversalwillalwayspassthroughthearticleabouttheitem,sincethisistheonlydocumentcon-nectedfromthere.Theendpointsetincludesthecorrectansweralongsideothertype-consistentcan-didateexpressions,whicharedeterminedbyconsid-eringallfactsbelongingtoWIKIREADINGtrain-ingsamples,selectingthosetripleswiththesamepropertyasinqandkeepingtheiranswerexpres-sions.Asanexample,fortheWIKIDATApropertycountry,thiswouldbetheset{France,Russia,…}.Weexecutedgraphtraversaluptoamaximumchainlengthof3documents.Tonotposeunreasonablecomputationalconstraints,sampleswithmorethan64differentsupportdocumentsor100candidatesareremoved,discarding≈1%ofthesamples.3.2MitigatingDatasetBiasesDatasetcreationisalwaysfraughtwiththeriskofinducingunintendederrorsandbiases(Chenetal.,2016;Schwartzetal.,2017).AsHewlettetal.(2016)onlycarriedoutlimitedanalysisoftheirWIKIREADINGdataset,wepresentananalysisofthedownstreameffectsweobserveonWIKIHOP.CandidateFrequencyImbalanceAﬁrstobser-vationisthatthereisasigniﬁcantbiasintheanswerdistributionofWIKIREADING.Forexample,inthemajorityofthesamplesthepropertycountryhastheUnitedStatesofAmericaastheanswer.Asimple3WethususeadisjointsubsetofWIKIREADINGcomparedtoLevyetal.(2017)toconstructWIKIHOP.majorityclassbaselinewouldthusprovesuccessful,butwouldtelluslittleaboutmulti-hopreasoning.Tocombatthisissue,wesubsampledthedatasettoen-surethatsamplesofanyoneparticularanswercan-didatemakeupnomorethan0.1%ofthedataset,andomittedarticlesabouttheUnitedStates.Document-AnswerCorrelationsAproblemuniquetoourmulti-documentsettingisthepossibil-ityofspuriouscorrelationsbetweencandidatesanddocumentsinducedbythegraphtraversalmethod.Infact,ifwewerenottoaddressthisissue,amodeldesignedtoexploittheseregularitiescouldachieve74.6%accuracy(detailedinSection6).Concretely,weobservedthatcertaindocumentsfrequentlyco-occurwiththecorrectanswer,inde-pendentlyofthequery.Forexample,ifthearticleaboutLondonispresentinSq,theanswerislikelytobetheUnitedKingdom,independentofthequerytypeorentityinquestion.Wedesignedastatistictomeasurethiseffectandthenusedittosub-samplethedataset.ThestatisticcountshowoftenacandidatecisobservedasthecorrectanswerwhenacertaindocumentispresentinSqacrosstrainingsetsamples.Morefor-mally,foragivendocumentdandanswercandi-datec,letcooccurrence(d,c)denotethetotalcountofhowoftendco-occurswithcinasamplewherecisalsothecorrectanswer.Weusethisstatistictoﬁlterthedataset,bydiscardingsampleswithatleastonedocument-candidatepair(d,c)forwhichcooccurrence(d,c)>20.4MEDHOPFollowingthesamegeneralmethodology,wenextconstructaseconddatasetforthedomainofmolec-ularbiology–aﬁeldthathasbeenundergoingex-ponentialgrowthinthenumberofpublications(Co-henandHunter,2004).ThepromiseofapplyingNLPmethodstocopewiththisincreasehasledtoresearcheffortsinIE(Hirschmanetal.,2005;Kimetal.,2011)andQAforbiomedicaltext(Hershetal.,2007;Nentidisetal.,2017).Thereareaplethoraofmanuallycuratedstructuredresources(Ashburneretal.,2000;TheUniProtConsortium,2017)whichcaneitherserveasgroundtruthortoinducetrainingdatausingdistantsupervision(CravenandKumlien,1999;Bobicetal.,2012).ExistingRCdatasetsare

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

291

eitherseverelylimitedinsize(Hershetal.,2007)orcoveraverydiversesetofquerytypes(Nentidisetal.,2017),complicatingtheapplicationofneu-ralmodelsthathaveseensuccessesforotherdo-mains(Wieseetal.,2017).AtaskthathasreceivedsigniﬁcantattentionisdetectingDrug-DrugInteractions(DDIs).Exist-ingDDIeffortshavefocusedonexplicitmentionsofinteractionsinsinglesentences(Gurulingappaetal.,2012;Perchaetal.,2012;Segura-Bedmaretal.,2013).However,asshownbyPengetal.(2017),cross-sentencerelationextractionincreasesthenumberofavailablerelations.Itisthuslikelythatcross-documentinteractionswouldfurtherim-proverecall,whichisofparticularimportancecon-sideringinteractionsthatareneverstatedexplicitly–butratherneedtobeinferredfromseparatepiecesofevidence.Thepromiseofmulti-hopmethodsisﬁndingandcombiningindividualobservationsthatcansuggestpreviouslyunobservedDDIs,aidingtheprocessofmakingscientiﬁcdiscoveries,yetnotdi-rectlyfromexperiments,butbyinferringthemfromestablishedpublicknowledge(Swanson,1986).DDIsarecausedbyProtein-ProteinInterac-tion(PPI)chains,formingbiomedicalpathways.IfweconsiderPPIchainsacrossdocuments,weﬁndexampleslikeinFigure3.HeretheﬁrstdocumentstatesthatthedrugLeuprolidecausesGnRHreceptor-inducedsynapticpotenti-ations,whichcanbeblockedbytheproteinProgonadoliberin-1.Thelastdocumentstatesthatanotherdrug,Triptorelin,isasuperagonistofthesameprotein.Itisthereforelikelytoaffectthepo-tencyofLeuprolide,describingawayinwhichthetwodrugsinteract.BesidesthetrueinteractionthereisalsoafalsecandidateUrofollitropinforwhich,althoughmentionedtogetherwithGnRHreceptorwithinonedocument,thereisnotextualevidenceindicatinginteractionswithLeuprolide.4.1AssemblyWeconstructMEDHOPusingDRUGBANK(Lawetal.,2014)asstructuredknowledgeresourceandresearchpaperabstractsfromMEDLINEasdocu-ments.ThereisonlyonerelationtypeforDRUG-BANKfacts,interactswith,thatconnectspairsofdrugs–anexampleofaMEDHOPquerywouldthusbe(Leuprolide,interactswith,?).WestartQ: (Leuprolide, interacts_with, ?) Options: {Triptorelin, Urofollitropin}Leuprolide … elicited a long-lasting potentiation of excitatory postsynaptic currents… [GnRH receptor]-induced synaptic potentiation was blocked … by [Progonadoliberin-1], a speciﬁc [GnRH receptor] antagonist…Analyses of gene expression demonstrated a dynamic response to the Progonadoliberin-1 superagonist Triptorelin.… our research to study the distribution, co-localization of Urofollitropin and its receptor[,] and co-localization of Urofollitropin and GnRH receptor…Figure3:AsamplefromtheMEDHOPdataset.byprocessingthe2016MEDLINEreleaseusingthepreprocessingpipelineemployedfortheBioNLP2011SharedTask(Stenetorpetal.,2011).Were-strictthesetofentitiesinthebipartitegraphtodrugsinDRUGBANKandhumanproteinsinSWISS-PROT(Bairochetal.,2004).Thatis,thegraphhasdrugsandproteinsononeside,andMEDLINEab-stractsontheother.Theedgestructureisasfollows:(1)Thereisanedgefromadocumenttoallproteinsmentionedinit.(2)Thereisanedgebetweenadocumentandadrug,ifthisdocumentalsomentionsaproteinknowntobeatargetforthedrugaccordingtoDRUGBANK.Thisedgeisbidirectional,i.e.itcanbetraversedbothways,sincethereisnocanonicaldocumentdescrib-ingeachdrug–thusonecan“hop”toanydocumentmentioningthedruganditstarget.(3)Thereisanedgefromaproteinptoadocumentmentioningp,butonlyifthedocumentalsomentionsanotherpro-teinp0whichisknowntointeractwithpaccordingtoREACTOME(Fabregatetal.,2016).Givenourdis-tantsupervisionassumption,theseadditionallycon-strainingrequirementserronthesideofprecision.Asamention,similartoPerchaetal.(2012),weconsideranyexactmatchofanamevariantofadrugorhumanproteininDRUGBANKorSWISS-PROT.ForagivenDDI(drug1,interactswith,drug2),wethenselectdrug1asthestartingpointforthegraphtraversal.Aspossibleendpoints,weconsideranyotherdrug,apartfromdrug1andthoseinteractingwithdrug1otherthandrug2.SimilartoWIKIHOP,weexcludesampleswithmorethan64supportdocumentsandimposeamaximumdocu-mentlengthof300tokensplustitle.DocumentSub-samplingThebipartitegraphforMEDHOPisordersofmagnitudemoredenselycon-nectedthanforWIKIHOP.Thiscanleadtopoten-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

292

tiallylargesupportdocumentsetsSq,toadegreewhereitbecomescomputationallyinfeasibleforamajorityofexistingRCmodels.Afterthetraver-salhasﬁnished,wesubsampledocumentsbyﬁrstaddingasetofdocumentsthatconnectsthedruginthequerywithitsanswer.Wetheniterativelyadddocumentstoconnectalternativecandidatesuntilwereachthelimitof64documents–whileensuringthatallcandidateshavethesamenumberofpathsthroughthebipartitegraph.MitigatingCandidateFrequencyImbalanceSomedrugsinteractwithmoredrugsthanothers–Aspirinforexampleinteractswith743otherdrugs,butIsotretinoinwithonly34.ThisleadstosimilarcandidatefrequencyimbalanceissuesaswithWIKIHOP–butduetoitssmallersizeMEDHOPisdifﬁculttosub-sample.Neverthelesswecansuccessfullycombatthisissuebymaskingentitynames,detailedinSection6.2.5DatasetAnalysisTable1showsthedatasetsizes.NotethatWIK-IHOPinheritsthetrain,development,andtestsetsplitsfromWIKIREADING–i.e.,thefulldatasetcreation,ﬁltering,andsub-samplingpipelineisex-ecutedoneachsetindividually.Alsonotethatsub-samplingaccordingtodocument-answercorrelationsigniﬁcantlyreducesthesizeofWIKIHOPfrom≈528Ktrainingsamplesto≈44K.Whileintermsofsamples,bothWIKIHOPandMEDHOParesmallerthanotherlarge-scaleRCdatasets,suchasSQuADandWIKIREADING,thesupervisedlearningsignalavailablepersampleisarguablygreater.Onecould,forexample,re-framethetaskasbinarypathclas-siﬁcation:giventwoentitiesandadocumentpathconnectingthem,determinewhetheragivenrela-tionholds.Forsuchacase,WIKIHOPandMED-HOPwouldhavemorethan1Mand150Kpathstobeclassiﬁed,respectively.Instead,inourformula-tion,thiscorrespondstoeachsinglesamplecontain-ingthesupervisedlearningsignalfromanaverageof19.5and59.8uniquedocumentpaths.Table2showsstatisticsonthenumberofcandi-datesanddocumentspersampleontherespectivetrainingsets.ForMEDHOP,themajorityofsam-pleshave9candidates,duetothewaydocumentsareselectedupuntilamaximumof64documentsisTrainDevTestTotalWIKIHOP43,7385,1292,45151,318MEDHOP1,6203425462,508Table1:Datasetsizesforourrespectivedatasets.minmaxavgmedian#cand.–WH27919.814#docs.–WH36313.711#tok/doc–WH42,046100.491#cand.–MH298.99#docs.–MH56436.429#tok/doc–MH5458253.9264Table2:Candidatesanddocumentspersampleanddoc-umentlengthstatistics.WH:WIKIHOP;MH:MEDHOP.reached.Fewsampleshavelessthan9candidates,andsampleswouldhavefarmorefalsecandidatesifmorethan64supportdocumentswereincluded.ThenumberofquerytypesinWIKIHOPis277,whereasinMEDHOPthereisonlyone:interactswith.5.1QualitativeAnalysisToestablishthequalityofthedataandanalyzepo-tentialdistantsupervisionerrors,wesampledandannotated100samplesfromeachdevelopmentset.WIKIHOPTable3listscharacteristicsalongwiththeproportionofsamplesthatexhibitthem.For45%,thetrueanswereitheruniquelyfollowsfrommultipletextsdirectlyorissuggestedaslikely.For26%,morethanonecandidateisplausiblysup-portedbythedocuments,includingthecorrectan-swer.Thisisoftenduetohypernymy,wheretheappropriatelevelofgranularityforthean-swerisdifﬁculttopredict–e.g.(westsuffolk,administrativeentity,?)withcandidatessuffolkandengland.Thisisadirectconse-quenceofincludingtype-consistentfalseanswercandidatesfromWIKIDATA,whichcanleadtoques-tionswithseveraltrueanswers.For9%ofthecasesasingledocumentsufﬁces;thesesamplescontainadocumentthatstatesenoughinformationaboutitemandanswertogether.Forexample,thequery(LouisAuguste,father,?)hasthecorrectanswerLouisXIVofFrance,andFrenchkingLouisXIVismentionedwithinthesamedoc-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

293

Uniquemulti-stepanswer.36%Likelymulti-stepuniqueanswer.9%Multipleplausibleanswers.15%Ambiguityduetohypernymy.11%Onlysingledocumentrequired.9%Answerdoesnotfollow.12%WIKIDATA/WIKIPEDIAdiscrepancy.8%Table3:QualitiativeanalysisofWIKIHOPsamples.umentasLouisAuguste.Finally,althoughourtaskissigniﬁcantlymorecomplexthanmostpre-vioustaskswheredistantsupervisionhasbeenap-plied,thedistantsupervisionassumptionisonlyvi-olatedfor20%ofthesamples–aproportionsim-ilartopreviouswork(Riedeletal.,2010).Thesecasescaneitherbeduetoconﬂictinginformationbe-tweenWIKIDATAandWIKIPEDIA(8%),e.g.whenthedateofbirthforapersondiffersbetweenWIKI-DATAandwhatisstatedintheWIKIPEDIAarticle,orbecausetheanswerisconsistentbutcannotbeinferredfromthesupportdocuments(12%).Whenanswering100questions,theannotatorknewthean-swerpriortoreadingthedocumentsfor9%,andpro-ducedthecorrectanswerafterreadingthedocumentsetsfor74%ofthecases.On100questionsofaval-idatedportionoftheDevset(seeSection5.3),85%accuracywasreached.MEDHOPSincebothdocumentcomplexityandnumberofdocumentspersampleweresigniﬁcantlylargercomparedtoWIKIHOP,itwasnotfeasibletoaskanannotatortoreadallsupportdocumentsfor100samples.Wethusoptedtoverifythedatasetqualitybyprovidingonlythesubsetofdocumentsrelevanttosupportthecorrectanswer,i.e.,thosetra-versedalongthepathreachingtheanswer.Thean-notatorwasaskediftheanswertothequery“fol-lows”,“islikely”,or“doesnotfollow”,giventherelevantdocuments.68%ofthecaseswereconsid-eredas“follows”oras“islikely”.ThemajorityofcasesviolatingthedistantsupervisionassumptionwereerrorsduetothelackofanecessaryPPIinoneoftheconnectingdocuments.5.2CrowdsourcedHumanAnnotationWeaskedhumanannotatorsonAmazonMechanicalTurktoevaluatesamplesoftheWIKIHOPdevelop-mentset.SimilartoourqualitativeanalysisofMED-HOP,annotatorswereshownthequery-answerpairasafactandthechainofrelevantdocumentsleadingtotheanswer.Theyweretheninstructedtoanswer(1)whethertheyknewthefactbefore;(2)whetherthefactfollowsfromthetexts(withoptions“factfollows”,“factislikely”,and“factdoesnotfol-low”);and(3);whetherasingleorseveralofthedocumentsarerequired.Eachsamplewasshowntothreeannotatorsandamajorityvotewasusedtoag-gregatetheannotations.Annotatorswerefamiliarwiththefact4.6%ofthetime;priorknowledgeofthefactisthusnotlikelytobeaconfoundingeffectontheotherjudgments.Inter-annotatoragreementasmeasuredbyFleiss’kappais0.253in(2),and0.281in(3)–indicatingafairoverallagreement,ac-cordingtoLandisandKoch(1977).Overall,9.5%ofsampleshavenoclearmajorityin(2).Amongsampleswithamajorityjudgment,59.8%arecaseswherethefact“follows”,for14.2%thefactisjudgedas“likely”,andas“notfollow”for25.9%.Thisagainprovidesgoodjustiﬁcationforthedistantsupervisionstrategy.Amongthesampleswithamajorityvotefor(2)ofeither“follows”or“likely”,55.9%weremarkedwithamajorityvoteasrequiringmultipledocu-mentstoinferthefact,and44.1%asrequiringonlyasingledocument.Thelatternumberislargerthaninitiallyexpected,giventheconstructionofsamplesthroughgraphtraversal.However,wheninspectingcasesjudgedas“single”moreclosely,weobservedthatmanyindeedprovideaclearhintaboutthecor-rectanswerwithinonedocument,butwithoutstat-ingitexplicitly.Forexample,forthefact(witoldcichy,countryofcitizenship,poland)withdocumentsd1:WitoldCichy(bornMarch15,1986inWodzisawlski)isaPolishfootballer[…]andd2:Wodzisawlski[…]isatowninSilesianVoivodeship,southernPoland[…],theinformationprovidedind1sufﬁcesforahumangiventhebackgroundknowl-edgethatPolishisanattributerelatedtoPoland,re-movingtheneedford2toinfertheanswer.5.3ValidatedTestSetsWhiletrainingmodelsondistantlysuperviseddataisuseful,oneshouldideallyevaluatemethodsonamanuallyvalidatedtestset.Wethusidentiﬁedsub-setsoftherespectivetestsetsforwhichthecorrect

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

294

answercanbeinferredfromthetext.Thisisincon-trasttopriorworksuchasHermannetal.(2015),Hilletal.(2016),andHewlettetal.(2016),whoevaluateonlyondistantlysupervisedsamples.ForWIKIHOP,weappliedthesameannotationstrategyasdescribedinSection5.2.Thevalidatedtestsetconsistsofthosesampleslabeledbyamajorityofannotators(atleast2of3)as“follows”,andrequir-ing“multiple”documents.Whiledesirable,crowd-sourcingisnotfeasibleforMEDHOPsinceitre-quiresspecialistknowledge.Inaddition,thenumberofdocumentpathsis≈3xlarger,whichalongwiththecomplexityofthedocumentsgreatlyincreasestheannotationtime.Wethusmanuallyannotated20%oftheMEDHOPtestsetandidentiﬁedthesam-plesforwhichthetextimpliesthecorrectanswerandwheremultipledocumentsarerequired.6ExperimentsThissectiondescribesexperimentsonWIKIHOPandMEDHOPwiththegoalofestablishingtheper-formanceofseveralbaselinemodels,includingre-centneuralRCmodels.Weempiricallydemonstratetheimportanceofmitigatingdatasetbiases,probewhethermulti-stepbehaviorisbeneﬁcialforsolv-ingthetask,andinvestigateifRCmodelscanlearntoperformlexicalabstraction.Trainingwillbecon-ductedontherespectivetrainingsets,andevaluationonboththefulltestsetandvalidatedportion(Sec-tion5.3)allowingforacomparisonbetweenthetwo.6.1ModelsRandomSelectsarandomcandidate;notethatthenumberofcandidatesdiffersbetweensamples.Max-mentionPredictsthemostfrequentlymen-tionedcandidateinthesupportdocumentsSqofasample–randomlybreakingties.Majority-candidate-per-query-typePredictsthecandidatec∈Cqthatwasmostfrequentlyobservedasthetrueanswerinthetrainingset,giventhequerytypeofq.ForWIKIHOP,thequerytypeistheprop-ertypofthequery;forMEDHOPthereisonlythesinglequerytype–interactswith.TF-IDFRetrieval-basedmodelsareknowntobestrongQAbaselinesifcandidateanswersarepro-vided(Clarketal.,2016;Welbletal.,2017).Theysearchforindividualdocumentsbasedonkeywordsinthequestion,buttypicallydonotcombineinfor-mationacrossdocuments.Thepurposeofthisbase-lineistoseeifitispossibletoidentifythecorrectan-swerfromasingledocumentalonethroughlexicalcorrelations.Themodelformsitspredictionasfol-lows:Foreachcandidatec,theconcatenationofthequeryqwithcisfedasanORqueryintothewhooshtextretrievalengine.ItthenpredictsthecandidatewiththehighestTF-IDFsimilarityscore:argmaxc∈Cq[maxs∈Sq(TF-IDF(q+c,s))](1)Document-cueDuringdatasetconstructionweobservedthatcertaindocument-answerpairsappearmorefrequentlythanothers,totheeffectthatthecorrectcandidateisoftenindicatedsolelybythepresenceofcertaindocumentsinSq.Thisbaselinecaptureshoweasyitisforamodeltoexploittheseinformativedocument-answerco-occurrences.ItpredictsthecandidatewithhighestscoreacrossCq:argmaxc∈Cq[maxd∈Sq(cooccurrence(d,c))](2)ExtractiveRCmodels:FastQAandBiDAFInourexperimentsweevaluatetworecentlyproposedLSTM-basedextractiveQAmodels:theBidirec-tionalAttentionFlowmodel(BiDAF,Seoetal.(2017a)),andFastQA(Weissenbornetal.,2017),whichhaveshownarobustperformanceacrosssev-eraldatasets.Thesemodelspredictananswerspanwithinasingledocument.Weadaptthemtoamulti-documentsettingbysequentiallyconcatenatingalld∈Sqinrandomorderintoasuperdocument,addingdocumentseparatortokens.Duringtraining,theﬁrstanswermentionintheconcatenateddocu-mentservesasthegoldspan.4Attesttime,wemea-suredaccuracybasedontheexactmatchbetweenthepredictionandanswer,bothlowercased,afterre-movingarticles,trailingwhitespacesandpunctu-ation,inthesamewayasRajpurkaretal.(2016).Toruleoutanysignalstemmingfromtheorderofdocumentsinthesuperdocument,thisorderisran-domizedbothattrainingandtesttime.Inaprelimi-naryexperimentwealsotrainedmodelsusingdiffer-entrandomdocumentorderpermutations,butfoundthatperformancedidnotchangesigniﬁcantly.4Wealsotestedassigningthegoldspanrandomlytoanyoneofthementionoftheanswer,withinsigniﬁcantchanges.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

295

ForBiDAF,thedefaulthyperparametersfromtheimplementationofSeoetal.(2017a)areused,withpretrainedGloVe(Penningtonetal.,2014)embed-dings.However,werestrictthemaximumdocu-mentlengthto8,192tokensandhiddensizeto20,andtrainfor5,000iterationswithbatchsize16inor-dertoﬁtthemodelintomemory.5ForFastQAweusetheimplementationprovidedbytheauthors,alsowithpre-trainedGloVeembeddings,nocharacter-embeddings,nomaximumsupportlength,hiddensize50,andbatchsize64for50epochs.WhileBiDAFandFastQAwereinitiallydevel-opedandtestedonsingle-hopRCdatasets,theirus-ageofbidirectionalLSTMsandattentionoverthefullsequencetheoreticallygivesthemthecapacitytointegrateinformationfromdifferentlocationsinthe(super-)document.Inaddition,BiDAFemploysiterativeconditioningacrossmultiplelayers,poten-tiallymakingitevenbettersuitedtointegrateinfor-mationfoundacrossthesequence.6.2LexicalAbstraction:CandidateMaskingThepresenceoflexicalregularitiesamongan-swersisaprobleminRCdatasetassembly–aphenomenonalreadyobservedbyHermannetal.(2015).Whencomprehendingatext,thecorrectan-swershouldbecomeclearfromitscontext–ratherthanfromanintrinsicpropertyoftheanswerex-pression.Toevaluatetheabilityofmodelstorelyoncontextalone,wecreatedmaskedversionsofthedatasets:wereplaceanycandidateexpressionrandomlyusing100uniqueplaceholdertokens,e.g.“MumbaiisthemostpopulouscityinMASK7.”Maskingisconsistentwithinonesample,butgen-erallydifferentforthesameexpressionacrosssam-ples.Thisnotonlyremovesanswerfrequencycues,italsoremovesstatisticalcorrelationsbetweenfre-quentanswerstringsandsupportdocuments.Mod-elsconsequentlycannotbasetheirpredictiononin-trinsicpropertiesoftheanswerexpression,buthavetorelyonthecontextsurroundingthementions.6.3ResultsandDiscussionTable5showstheexperimentaloutcomesforWIK-IHOPandMEDHOP,togetherwithresultsforthemaskedsetting;wewillﬁrstdiscusstheformer.A5Thesuperdocumenthasalargernumberoftokenscom-paredtoe.g.SQuAD,thustheadditionalmemoryrequirements.ModelUnﬁlteredFilteredDocument-cue74.636.7Maj.candidate41.238.8TF-IDF43.825.6Trainsetsize527,77343,738Table4:Accuracycomparisonforsimplebaselinemod-elsonWIKIHOPbeforeandafterﬁltering.ﬁrstobservationisthatcandidatementionfrequencydoesnotproducebetterpredictionsthanarandomguess.Predictingtheanswermostfrequentlyob-servedattrainingtimeachievesstrongresults:asmuchas38.8%/44.2%and58.4%/67.3%onthetwodatasets,forthefullandvalidatedtestsetsre-spectively.Thatis,asimplefrequencystatisticto-getherwithanswertypeconstraintsaloneisarela-tivelystrongpredictor,andthestrongestoverallforthe“unmasked”versionofMEDHOP.TheTF-IDFretrievalbaselineclearlyperformsbetterthanrandomforWIKIHOP,butisnotverystrongoverall.Thatis,thequestiontokensarehelp-fultodetectrelevantdocuments,butexploitingonlythisinformationcomparespoorlytotheotherbase-lines.Ontheotherhand,asnoco-mentionofaninteractingdrugpairoccurswithinanysingledoc-umentinMEDHOP,theTF-IDFbaselineperformsworsethanrandom.Weconcludethatlexicalmatch-ingwithasinglesupportdocumentisnotenoughtobuildastrongpredictivemodelforbothdatasets.TheDocument-cuebaselinecanpredictmorethanathirdofthesamplescorrectly,forbothdatasets,evenaftersub-samplingfrequentdocument-answerpairsforWIKIHOP.Therelativestrengthofthisandotherbaselinesprovestobeanimportantis-suewhendesigningmulti-hopdatasets,whichweaddressedthroughthemeasuresdescribedinSec-tion3.2.InTable4wecomparethetworelevantbaselinesonWIKIHOPbeforeandafterapplyingﬁlteringmeasures.Theabsolutestrengthofthesebaselinesbeforeﬁlteringshowshowvitaladdress-ingthisissueis:74.6%accuracycouldbereachedthroughexploitingthecooccurrence(d,c)statisticalone.Thisunderlinestheparamountimportanceofinvestigatingandaddressingdatasetbiasesthatoth-erwisewouldconfoundseeminglystrongRCmodelperformance.Therelativedropdemonstratesthat

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

296

WIKIHOPMEDHOPstandardmaskedstandardmaskedModeltesttest*testtest*testtest*testtest*Random11.512.212.213.013.920.414.122.4Max-mention10.615.913.920.19.516.39.216.3Majority-candidate-per-query-type38.844.212.013.758.467.310.46.1TF-IDF25.636.714.424.29.014.38.814.3Document-cue36.741.77.420.344.953.115.216.3FastQA25.727.235.838.023.124.531.330.6BiDAF42.949.754.559.847.861.233.742.9Table5:TestaccuraciesfortheWIKIHOPandMEDHOPdatasets,bothinstandard(unmasked)andmaskedsetup.Columnsmarkedwithasteriskareforthevalidatedportionofthedataset.WIKIHOPMEDHOPstandardgoldchainstandardgoldchainModeltesttest*testtest*testtest*testtest*BiDAF42.949.757.963.447.861.286.489.8BiDAFmask54.559.881.285.733.742.999.3100.0FastQA25.727.244.553.523.124.554.659.2FastQAmask35.838.065.370.031.330.651.855.1Table6:Testaccuracycomparisonwhenonlyusingdocumentsleadingtothecorrectanswer(goldchain).Columnswithasteriskholdresultsforthevalidatedsamples.themeasuresundertakensuccessfullymitigatetheissue.Adownsidetoaggressiveﬁlteringisasignif-icantlyreduceddatasetsize,renderingitinfeasibleforsmallerdatasetslikeMEDHOP.Amongthetwoneuralmodels,BiDAFisoverallstrongestacrossbothdatasets–thisisincontrasttothereportedresultsforSQuADwheretheirperfor-manceisnearlyindistinguishable.ThisispossiblyduetotheiterativelatentinteractionsintheBiDAFarchitecture:wehypothesizethattheseareofin-creasedimportanceforourtask,whereinformationisdistributedacrossdocuments.Itisworthempha-sizingthatunliketheotherbaselines,bothFastQAandBiDAFpredicttheanswerbyextractingaspanfromthesupportdocumentswithoutrelyingonthecandidateoptionsCq.Inthemaskedsetupallbaselinemodelsreliantonlexicalcuesfailinthefaceoftherandomizedanswerexpressions,sincethesameansweroptionhasdif-ferentplaceholdersindifferentsamples.EspeciallyonMEDHOP,wheredatasetsub-samplingisnotaviableoption,maskingprovestobeavaluablealter-native,effectivelycircumventingspuriousstatisticalcorrelationsthatRCmodelscanlearntoexploit.BothneuralRCmodelsareabletolargelyretainorevenimprovetheirstrongperformancewhenan-swersaremasked:theyareabletoleveragethetex-tualcontextofthecandidateexpressions.Tounder-standdifferencesinmodelbehaviorbetweenWIK-IHOPandMEDHOP,itisworthnotingthatdrugmentionsinMEDHOParenormalizedtoauniquesingle-wordidentiﬁer,andperformancedropsundermasking.Incontrast,fortheopen-domainsettingofWIKIHOP,areductionoftheanswervocabularyto100randomsingle-tokenmaskexpressionsclearlyhelpsthemodelinselectingacandidatespan,com-paredtothemulti-tokencandidateexpressionsintheunmaskedsetting.Overall,althoughbothneuralRCmodelsclearlyoutperformtheotherbaselines,theystillhavelargeroomforimprovementcomparedtohumanperformanceat74%/85%forWIKIHOP.Comparingresultsonthefullandvalidatedtest

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

297

WIKIHOPMEDHOPtesttest*testtest*BiDAF54.559.833.742.9BiDAFrem44.657.730.436.7FastQA35.838.031.330.6FastQArem38.041.228.624.5Table7:Testaccuracy(masked)whenonlydocumentscontaininganswercandidatesaregiven(rem).sets,weobservethattheresultsconsistentlyimproveonthevalidatedsets.Thissuggeststhatthetrainingsetcontainsthesignalnecessarytomakeinferenceonvalidsamplesattesttime,andthatnoisysamplesarehardertopredict.6.4UsingonlyrelevantdocumentsWeconductedfurtherexperimentstoexaminetheRCmodelswhenpresentedwithonlytherelevantdocumentsinSq,i.e.,thechainofdocumentslead-ingtothecorrectanswer.Thisallowsustoinvesti-gatethehypotheticalperformanceofthemodelsiftheywereabletoselectandreadonlyrelevantdocu-ments:Table6summarizestheseresults.Modelsimprovegreatlyinthisgoldchainsetup,withupto81.2%/85.7%onWIKIHOPinthemaskedset-tingforBiDAF.ThisdemonstratesthatRCmodelsarecapableofidentifyingtheanswerwhenfewornoplausiblefalsecandidatesarementioned,whichisparticularlyevidentforMEDHOP,wheredocu-mentstendtodiscussonlysingledrugcandidates.Inthemaskedgoldchainsetup,modelscanthenpickuponwhatthemaskingtemplatelookslikeandachievealmostperfectscores.Conversely,theseresultsalsoshowthatthemodels’answerselec-tionprocessisnotrobusttotheintroductionofun-relateddocumentswithtype-consistentcandidates.Thisindicatesthatlearningtointelligentlyselectrel-evantdocumentsbeforeRCmaybeamongthemostpromisingdirectionsforfuturemodeldevelopment.6.5RemovingrelevantdocumentsToinvestigateiftheneuralRCmodelscandrawuponinformationrequiringmulti-stepinferencewedesignedanexperimentwherewediscardalldoc-umentsthatdonotcontaincandidatementions,in-cludingtheﬁrstdocumentstraversed.Table7showstheresults:wecanobservethatperformancedropsacrosstheboardforBiDAF.Thereisasigniﬁcantdropof3.3%/6.2%onMEDHOP,and10.0%/2.1%onWIKIHOP,demonstratingthatBiDAF,isabletoleveragecross-documentinformation.FastQAshowsaslightincreaseof2.2%/3.2%forWIKIHOPandadecreaseof2.7%/4.1%onMEDHOP.Whileinconclusive,itisclearthatFastQAwithfewerla-tentinteractionsthanBiDAFhasproblemsintegrat-ingcross-documentinformation.7RelatedWorkRelatedDatasetsEnd-to-endtext-basedQAhaswitnessedasurgeininterestwiththeadventoflarge-scaledatasets,whichhavebeenassembledbasedonFREEBASE(Berantetal.,2013;Bordesetal.,2015),WIKIPEDIA(Yangetal.,2015;Rajpurkaretal.,2016;Hewlettetal.,2016),websearchqueries(Nguyenetal.,2016),newsarticles(Her-mannetal.,2015;Onishietal.,2016),books(Hilletal.,2016;Papernoetal.,2016),scienceex-ams(Welbletal.,2017),andtrivia(Boyd-Graberetal.,2012;Dunnetal.,2017).BesidesTrivi-aQA(Joshietal.,2017),allthesedatasetsarecon-ﬁnedtosingledocuments,andRCtypicallydoesnotrequireacombinationofmultipleindependentfacts.Incontrast,WIKIHOPandMEDHOParespeciﬁ-callydesignedforcross-documentRCandmulti-stepinference.Thereexistothermulti-hopRCre-sources,buttheyareeitherverylimitedinsize,suchastheFraCaStestsuite,orbasedonsyntheticlanguage(Westonetal.,2016).TriviaQApartlyinvolvesmulti-stepreasoning,butthecomplexitylargelystemsfromparsingcompositionalquestions.Ourdatasetscenteraroundcompositionalinferencefromcomparativelysimplequeriesandthecross-documentsetupensuresthatmulti-stepinferencegoesbeyondresolvingco-reference.CompositionalKnowledgeBaseInferenceCombiningmultiplefactsiscommonforstructuredknowledgeresourceswhichformulatefactsusingﬁrst-orderlogic.KBinferencemethodsincludeInductiveLogicProgramming(Quinlan,1990;Pazzanietal.,1991;RichardsandMooney,1991)andprobabilisticrelaxationstologiclikeMarkovLogic(RichardsonandDomingos,2006;Schoen-mackersetal.,2008).Theseapproachessufferfrom

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

298

limitedcoverageandinefﬁcientinference,thougheffortstocircumventsparsityhavebeenunder-taken(Schoenmackersetal.,2008;Schoenmackersetal.,2010).Amorescalableapproachtocompos-iterulelearningisthePathRankingAlgorithm(LaoandCohen,2010;Laoetal.,2011),whichperformsrandomwalkstoidentifysalientpathsbetweenentities.Gardneretal.(2013)circumventthesesparsityproblemsbyintroducingsyntheticlinksviadenselatentembeddings.Severalothermethodshavebeenproposed,usingcompositionfunctionssuchasvectoraddition(Bordesetal.,2014),RNNs(Neelakantanetal.,2015;Dasetal.,2017),andmemorynetworks(Jain,2016).AllofthesepreviousapproachescenteraroundlearninghowtocombinefactsfromaKB,i.e.,inastructuredformwithpre-deﬁnedschema.Thatis,theyworkaspartofapipeline,andeitherrelyontheoutputofapreviousIEstep(Bankoetal.,2007),orondirecthumanannotation(Bollackeretal.,2008)whichtendstobecostlyandbiasedincov-erage.However,recentneuralRCmethods(Seoetal.,2017a;Shenetal.,2017)havedemonstratedthatend-to-endlanguageunderstandingapproachescaninferanswersdirectlyfromtext–sidesteppingin-termediatequeryparsingandIEsteps.Ourworkaimstoevaluatewhetherend-to-endmulti-stepRCmodelscanindeedoperateonrawtextdocumentsonly–whileperformingthekindofinferencemostcommonlyassociatedwithlogicalinferencemeth-odsoperatingonstructuredknowledge.Text-BasedMulti-StepReadingComprehensionFriedetal.(2015)havedemonstratedthatexploit-inginformationfromotherrelateddocumentsbasedonlexicalsemanticsimilarityisbeneﬁcialforre-rankinganswersinopen-domainnon-factoidQA.Jansenetal.(2017)chaintextualbackgroundre-sourcesforscienceexamQAandprovidemulti-sentenceanswerexplanations.Beyond,arichcol-lectionofneuralmodelstailoredtowardsmulti-stepRChasbeendeveloped.Memorynetworks(We-stonetal.,2015;Sukhbaataretal.,2015;Kumaretal.,2016)deﬁneamodelclassthatiterativelyattendsovertextualmemoryitems,andtheyshowpromisingperformanceonsynthetictasksrequiringmulti-stepreasoning(Westonetal.,2016).Onecommoncharacteristicofneuralmulti-hopmodelsistheirrichstructurethatenablesmatchingandin-teractionbetweenquestion,context,answercandi-datesandcombinationsthereof(Pengetal.,2015;Weissenborn,2016;Xiongetal.,2017;LiuandPerez,2017),whichisofteniteratedoverseveraltimes(Sordonietal.,2016;Neumannetal.,2016;Seoetal.,2017b;Huetal.,2017)andmaycontaintrainablestoppingmechanisms(Graves,2016;Shenetal.,2017).Allthesemethodsshowpromiseforsingle-documentRC,andbydesignshouldbecapa-bleofintegratingmultiplefactsacrossdocuments.However,thusfartheyhavenotbeenevaluatedforacross-documentmulti-stepRCtask–asinthiswork.LearningSearchExpansionOtherresearchad-dressesexpandingthedocumentsetavailabletoaQAsystem,eitherintheformofwebnavi-gation(NogueiraandCho,2016),orviaqueryreformulationtechniques,whichoftenuseneuralreinforcementlearning(Narasimhanetal.,2016;NogueiraandCho,2017;Bucketal.,2018).Whilerelated,thisworkultimatelyaimsatreformulatingqueriestobetteracquireevidencedocuments,andnotatansweringqueriesthroughcombiningfacts.8ConclusionsandFutureWorkWehaveintroducedanewcross-documentmulti-hopRCtask,devisedagenericdatasetderivationstrategyandappliedittotwoseparatedomains.TheresultingdatasetstestRCmethodsintheirabilitytoperformcompositereasoning–somethingthusfarlimitedtomodelsoperatingonstructuredknowledgeresources.Inourexperimentswefoundthatcontem-poraryRCmodelscanleveragecross-documentin-formation,butasizeablegaptohumanperformanceremains.Finally,weidentiﬁedtheselectionofrele-vantdocumentsetsasthemostpromisingdirectionforfutureresearch.Thusfar,ourdatasetscenteraroundfactoidques-tionsaboutentities,andasextractiveRCdatasets,itisassumedthattheanswerismentionedverba-tim.Whilethislimitsthetypesofquestionsonecanask,theseassumptionscanfacilitatebothtrainingandevaluation,andfuturework–oncefree-formab-stractiveanswercompositionhasadvanced–shouldmovebeyond.Wehopethatourworkwillfosterresearchoncross-documentinformationintegration,workingtowardstheselongtermgoals.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

299

AcknowledgmentsWewouldliketothankthereviewersandtheac-tioneditorfortheirthoughtfulandconstructivesug-gestions,aswellasMatkoBoˇsnjak,TimDettmers,PasqualeMinervini,JeffMitchell,andSebastianRuderforseveralhelpfulcommentsandfeedbackondraftsofthispaper.ThisworkwassupportedbyanAllenDistinguishedInvestigatorAward,aMarieCurieCareerIntegrationAward,theEUH2020SUMMAproject(grantagreementnumber688139),andanEngineeringandPhysicalSciencesResearchCouncilscholarship.ReferencesMichaelAshburner,CatherineA.Ball,JudithA.Blake,DavidBotstein,HeatherButler,J.MichaelCherry,Al-lanP.Davis,KaraDolinski,SelinaS.Dwight,JananT.Eppig,MidoriA.Harris,DavidP.Hill,LaurieIssel-Tarver,AndrewKasarskis,SuzannaLewis,JohnC.Matese,JoelE.Richardson,MartinRingwald,Ger-aldM.Rubin,andGavinSherlock.2000.Geneon-tology:toolfortheuniﬁcationofbiology.NatureGe-netics,25(1):25.AmosBairoch,BrigitteBoeckmann,SerenellaFerro,andElisabethGasteiger.2004.Swiss-Prot:Jugglingbe-tweenevolutionandstability.BrieﬁngsinBioinfor-matics,5(1):39–55.MicheleBanko,MichaelJ.Cafarella,StephenSoderland,MattBroadhead,andOrenEtzioni.2007.Openinfor-mationextractionfromtheweb.InProceedingsofthe20thInternationalJointConferenceonArtiﬁcalIntel-ligence,IJCAI’07,pages2670–2676.JonathanBerant,AndrewChou,RoyFrostig,andPercyLiang.2013.Semanticparsingonfreebasefromquestion-answerpairs.InProceedingsofthe2013ConferenceonEmpiricalMethodsinNaturalLan-guageProcessing,pages1533–1544.TamaraBobic,RomanKlinger,PhilippeThomas,andMartinHofmann-Apitius.2012.Improvingdistantlysupervisedextractionofdrug-drugandprotein-proteininteractions.InProceedingsoftheJointWorkshoponUnsupervisedandSemi-SupervisedLearninginNLP,pages35–43.KurtBollacker,ColinEvans,PraveenParitosh,TimSturge,andJamieTaylor.2008.Freebase:acollabo-rativelycreatedgraphdatabaseforstructuringhumanknowledge.InSIGMOD08Proceedingsofthe2008ACMSIGMODinternationalconferenceonManage-mentofdata,pages1247–1250.AntoineBordes,SumitChopra,andJasonWeston.2014.Questionansweringwithsubgraphembeddings.InEmpiricalMethodsforNaturalLanguageProcessing(EMNLP),pages615–620.AntoineBordes,NicolasUsunier,SumitChopra,andJasonWeston.2015.Large-scalesimpleques-tionansweringwithmemorynetworks.CoRR,abs/1506.02075.JordanBoyd-Graber,BriannaSatinoff,HeHe,andHalDaum´e,III.2012.Bestingthequizmaster:Crowd-sourcingincrementalclassiﬁcationgames.InPro-ceedingsofthe2012JointConferenceonEmpir-icalMethodsinNaturalLanguageProcessingandComputationalNaturalLanguageLearning,EMNLP-CoNLL’12,pages1290–1301.ChristianBuck,JannisBulian,MassimilianoCiaramita,AndreaGesmundo,NeilHoulsby,WojciechGajewski,andWeiWang.2018.Asktherightquestions:Ac-tivequestionreformulationwithreinforcementlearn-ing.InternationalConferenceonLearningRepresen-tations(ICLR).ClaudioCarpinetoandGiovanniRomano.2012.Asur-veyofautomaticqueryexpansionininformationre-trieval.ACMComput.Surv.,44(1):1:1–1:50,January.DanqiChen,JasonBolton,andChristopherD.Manning.2016.AthoroughexaminationoftheCNN/DailyMailreadingcomprehensiontask.InProceedingsofthe54thAnnualMeetingoftheAssociationforCompu-tationalLinguistics(Volume1:LongPapers),pages2358–2367.PeterClark,OrenEtzioni,TusharKhot,AshishSab-harwal,OyvindTafjord,PeterTurney,andDanielKhashabi.2016.Combiningretrieval,statistics,andinferencetoanswerelementarysciencequestions.InProceedingsoftheThirtiethAAAIConferenceonArti-ﬁcialIntelligence,AAAI’16,pages2580–2586.KevinBretonnelCohenandLawrenceHunter.2004.Naturallanguageprocessingandsystemsbiology.Ar-tiﬁcialIntelligenceMethodsandToolsforSystemsBi-ology,pages147–173.MarkCravenandJohanKumlien.1999.Constructingbiologicalknowledgebasesbyextractinginformationfromtextsources.InProceedingsoftheSeventhInter-nationalConferenceonIntelligentSystemsforMolec-ularBiology,pages77–86.RajarshiDas,ArvindNeelakantan,DavidBelanger,andAndrewMcCallum.2017.Chainsofreasoningoverentities,relations,andtextusingrecurrentneuralnet-works.EuropeanChapteroftheAssociationforCom-putationalLinguistics(EACL),pages132–141.MatthewDunn,LeventSagun,MikeHiggins,V.UgurG¨uney,VolkanCirik,andKyunghyunCho.2017.SearchQA:AnewQ&Adatasetaugmentedwithcon-textfromasearchengine.CoRR,abs/1704.05179.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

300

AntonioFabregat,KonstantinosSidiropoulos,PhaniGarapati,MarcGillespie,KerstinHausmann,RobinHaw,BijayJassal,StevenJupe,FlorianKorninger,SheldonMcKay,LisaMatthews,BruceMay,Mar-ijaMilacic,KarenRothfels,VeronicaShamovsky,MarissaWebber,JoelWeiser,MarkWilliams,Guan-mingWu,LincolnStein,HenningHermjakob,andPeterD’Eustachio.2016.TheReactomepath-wayknowledgebase.NucleicAcidsResearch,44(D1):D481–D487.DanielFried,PeterJansen,GustaveHahn-Powell,MihaiSurdeanu,andPeterClark.2015.Higher-orderlexi-calsemanticmodelsfornon-factoidanswerreranking.TransactionsoftheAssociationofComputationalLin-guistics,3:197–210.MattGardner,ParthaPratimTalukdar,BryanKisiel,andTomM.Mitchell.2013.Improvinglearningandinfer-enceinalargeknowledge-baseusinglatentsyntacticcues.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages833–838.AlexGraves.2016.Adaptivecomputationtimeforre-currentneuralnetworks.CoRR,abs/1603.08983.HarshaGurulingappa,AbdulMateenRajput,AngusRoberts,JulianeFluck,MartinHofmann-Apitius,andLucaToldo.2012.Developmentofabenchmarkcor-pustosupporttheautomaticextractionofdrug-relatedadverseeffectsfrommedicalcasereports.JournalofBiomedicalInformatics,45(5):885–892.TextMin-ingandNaturalLanguageProcessinginPharmacoge-nomics.KarlMoritzHermann,TomasKocisky,EdwardGrefen-stette,LasseEspeholt,WillKay,MustafaSuleyman,andPhilBlunsom.2015.Teachingmachinestoreadandcomprehend.InAdvancesinNeuralInformationProcessingSystems,pages1693–1701.WilliamHersh,AaronCohen,LynnRuslen,andPhoebeRoberts.2007.TREC2007genomicstrackoverview.InNISTSpecialPublication.DanielHewlett,AlexandreLacoste,LlionJones,IlliaPolosukhin,AndrewFandrianto,JayHan,MatthewKelcey,andDavidBerthelot.2016.WIKIREADING:Anovellarge-scalelanguageunderstandingtaskoverWikipedia.InProceedingsoftheThe54thAnnualMeetingoftheAssociationforComputationalLinguis-tics(ACL2016),pages1535–1545.FelixHill,AntoineBordes,SumitChopra,andJasonWe-ston.2016.Thegoldilocksprinciple:Readingchil-dren’sbookswithexplicitmemoryrepresentations.ICLR.LynetteHirschman,AlexanderYeh,ChristianBlaschke,andAlfonsoValencia.2005.OverviewofBioCre-AtIvE:Criticalassessmentofinformationextractionforbiology.BMCBioinformatics,6(1):S1,May.MinghaoHu,YuxingPeng,andXipengQiu.2017.Mnemonicreaderformachinecomprehension.CoRR,abs/1705.02798.SarthakJain.2016.Questionansweringoverknowledgebaseusingfactualmemorynetworks.InProceedingsofNAACL-HLT,pages109–115.PeterJansen,RebeccaSharp,MihaiSurdeanu,andPeterClark.2017.FramingQAasbuildingandrankingin-tersentenceanswerjustiﬁcations.ComputationalLin-guistics,43(2):407–449.RobinJiaandPercyLiang.2017.Adversarialexam-plesforevaluatingreadingcomprehensionsystems.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).MandarJoshi,EunsolChoi,DanielS.Weld,andLukeZettlemoyer.2017.TriviaQA:Alargescaledistantlysupervisedchallengedatasetforreadingcomprehen-sion.InProceedingsofthe55thAnnualMeetingoftheAssociationforComputationalLinguistics,July.RudolfKadlec,MartinSchmid,OndrejBajgar,andJanKleindienst.2016.Textunderstandingwiththeat-tentionsumreadernetwork.Proceedingsofthe54thAnnualMeetingoftheAssociationforComputationalLinguistics,pages908–918.Jin-DongKim,YueWang,ToshihisaTakagi,andAkinoriYonezawa.2011.OverviewofGeniaeventtaskinBioNLPsharedtask2011.InProceedingsofBioNLPSharedTask2011Workshop,pages7–15.AnkitKumar,OzanIrsoy,PeterOndruska,MohitIyyer,IshaanGulrajaniJamesBradbury,VictorZhong,Ro-mainPaulus,andRichardSocher.2016.Askmeanything:Dynamicmemorynetworksfornaturallan-guageprocessing.InternationalConferenceonMa-chineLearning,48:1378–1387.J.RichardLandisandGaryG.Koch.1977.Themea-surementofobserveragreementforcategoricaldata.Biometrics,pages159–174.NiLaoandWilliamWCohen.2010.Relationalre-trievalusingacombinationofpath-constrainedran-domwalks.Machinelearning,81(1):53–67.Ni Lao, Tom Mitchell, and William W. Cohen. 2011.Randomwalkinferenceandlearninginalargescaleknowledgebase.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages529–539.VivianLaw,CraigKnox,YannickDjoumbou,TimJew-ison,AnChiGuo,YifengLiu,AdamMaciejew-ski,DavidArndt,MichaelWilson,VanessaNeveu,AlexandraTang,GeraldineGabriel,CarolLy,SakinaAdamjee,ZerihunT.Dame,BeomsooHan,YouZhou,andDavidS.Wishart.2014.DrugBank4.0:Shed-dingnewlightondrugmetabolism.NucleicAcidsRe-search,42(D1):D1091–D1097.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

301

OmerLevy,MinjoonSeo,EunsolChoi,andLukeZettle-moyer.2017.Zero-shotrelationextractionviaread-ingcomprehension.InProceedingsofthe21stCon-ferenceonComputationalNaturalLanguageLearning(CoNLL2017),pages333–342,August.DekangLinandPatrickPantel.2001.Discoveryofin-ferencerulesforquestion-answering.Nat.Lang.Eng.,7(4):343–360,December.FeiLiuandJulienPerez.2017.Gatedend-to-endmem-orynetworks.InProceedingsofthe15thConferenceoftheEuropeanChapteroftheAssociationforCom-putationalLinguistics,EACL2017,Volume1:LongPapers,pages1–10.MikeMintz,StevenBills,RionSnow,andDanielJuraf-sky.2009.Distantsupervisionforrelationextractionwithoutlabeleddata.InProceedingsoftheJointCon-ferenceofthe47thAnnualMeetingoftheACLandthe4thInternationalJointConferenceonNaturalLan-guageProcessingoftheAFNLP,pages1003–1011.AlvaroMorales,VarotPremtoon,CordeliaAvery,SueFelshin,andBorisKatz.2016.LearningtoanswerquestionsfromWikipediainfoboxes.InProceedingsofthe2016ConferenceonEmpiricalMethodsinNat-uralLanguageProcessing,pages1930–1935.KarthikNarasimhan,AdamYala,andReginaBarzilay.2016.Improvinginformationextractionbyacquiringexternalevidencewithreinforcementlearning.InPro-ceedingsofthe2016ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,EMNLP2016,pages2355–2365.ArvindNeelakantan,BenjaminRoth,andAndrewMc-Callum.2015.Compositionalvectorspacemodelsforknowledgebasecompletion.Proceedingsofthe53rdAnnualMeetingoftheAssociationforComputationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing,pages156–166.AnastasiosNentidis,KonstantinosBougiatiotis,Anasta-siaKrithara,GeorgiosPaliouras,andIoannisKakadi-aris.2017.ResultsoftheﬁftheditionoftheBioASQchallenge.InBioNLP2017,pages48–57.MarkNeumann,PontusStenetorp,andSebastianRiedel.2016.Learningtoreasonwithadaptivecomputation.InInterpretableMachineLearningforComplexSys-temsatthe2016ConferenceonNeuralInformationProcessingSystems(NIPS),Barcelona,Spain,Decem-ber.TriNguyen,MirRosenberg,XiaSong,JianfengGao,SaurabhTiwary,RanganMajumder,andLiDeng.2016.MSMARCO:Ahumangeneratedma-chinereadingcomprehensiondataset.CoRR,abs/1611.09268.RodrigoNogueiraandKyunghyunCho.2016.WebNav:Anewlarge-scaletaskfornaturallanguagebasedse-quentialdecisionmaking.CoRR,abs/1602.02261.RodrigoNogueiraandKyunghyunCho.2017.Task-orientedqueryreformulationwithreinforcementlearning.Proceedingsofthe2017ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages574–583.TakeshiOnishi,HaiWang,MohitBansal,KevinGim-pel,andDavidA.McAllester.2016.Whodidwhat:Alarge-scaleperson-centeredclozedataset.InPro-ceedingsofthe2016ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,EMNLP2016,pages2230–2235.DenisPaperno,Germ´anKruszewski,AngelikiLazari-dou,NgocQuanPham,RaffaellaBernardi,SandroPezzelle,MarcoBaroni,GemmaBoleda,andRaquelFernandez.2016.TheLAMBADAdataset:Wordpre-dictionrequiringabroaddiscoursecontext.InPro-ceedingsofthe54thAnnualMeetingoftheAssocia-tionforComputationalLinguistics(Volume1:LongPapers),pages1525–1534.MichaelPazzani,CliffordBrunk,andGlennSilverstein.1991.Aknowledge-intensiveapproachtolearningre-lationalconcepts.InProceedingsoftheEighthInter-nationalWorkshoponMachineLearning,pages432–436,Evanston,IL.BaolinPeng,ZhengdongLu,HangLi,andKam-FaiWong.2015.Towardsneuralnetwork-basedreason-ing.CoRR,abs/1508.05508.NanyunPeng,HoifungPoon,ChrisQuirk,KristinaToutanova,andWen-tauYih.2017.Cross-sentenceN-aryrelationextractionwithgraphLSTMs.Transac-tionsoftheAssociationforComputationalLinguistics,5:101–115.JeffreyPennington,RichardSocher,andChristopherD.Manning.2014.GloVe:Globalvectorsforwordrep-resentation.InProceedingsoftheEmpiricalMeth-odsinNaturalLanguageProcessing(EMNLP),pages1532–1543.BethanyPercha,YaelGarten,andRussBAltman.2012.Discoveryandexplanationofdrug-druginteractionsviatextmining.InPaciﬁcsymposiumonbiocomput-ing,page410.NIHPublicAccess.JohnRossQuinlan.1990.Learninglogicaldeﬁnitionsfromrelations.MachineLearning,5:239–266.PranavRajpurkar,JianZhang,KonstantinLopyrev,andPercyLiang.2016.SQuAD:100,000+questionsformachinecomprehensionoftext.InProceedingsofthe2016ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages2383–2392.BradleyL.RichardsandRaymondJ.Mooney.1991.First-ordertheoryrevision.InProceedingsoftheEighthInternationalWorkshoponMachineLearning,pages447–451,Evanston,IL.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
2
1
1
5
6
7
6
3
2

/
t

a
c
_
a
_
0
0
0
2
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

302

MatthewRichardsonandPedroDomingos.2006.Markovlogicnetworks.Mach.Learn.,62(1-2):107–136.SebastianRiedel,LiminYao,andAndrewMcCallum.2010.Modelingrelationsandtheirmentionswith-outlabeledtext.InProceedingsofthe2010EuropeanConferenceonMachineLearningandKnowledgeDis-coveryinDatabases:PartIII,ECMLPKDD’10,pages148–163.StefanSchoenmackers,OrenEtzioni,andDanielS.Weld.2008.Scalingtextualinferencetotheweb.InEMNLP’08:ProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages79–88.StefanSchoenmackers,OrenEtzioni,DanielS.Weld,andJesseDavis.2010.Learningﬁrst-orderhornclausesfromwebtext.InProceedingsofthe2010ConferenceonEmpiricalMethodsinNaturalLan-guageProcessing,EMNLP’10,pages1088–1098.RoySchwartz,MaartenSap,IoannisKonstas,LeilaZilles,YejinChoi,andNoahA.Smith.2017.Theeffectofdifferentwritingtasksonlinguisticstyle:AcasestudyoftheROCstoryclozetask.InProceed-ingsofthe21stConferenceonComputationalNaturalLanguageLearning(CoNLL2017),pages15–25.IsabelSegura-Bedmar,PalomaMart´ınez,andMar´ıaHer-reroZazo.2013.SemEval-2013Task9:Extractionofdrug-druginteractionsfrombiomedicaltexts(DDIEx-traction2013).InSecondJointConferenceonLexi-calandComputationalSemantics(*SEM),Volume2:ProceedingsoftheSeventhInternationalWorkshoponSemanticEvaluation(SemEval2013),pages341–350.MinjoonSeo,AniruddhaKembhavi,AliFarhadi,andHannanehHajishirzi.2017a.Bidirectionalattentionﬂowformachinecomprehension.InTheInternationalConferenceonLearningRepresentations(ICLR).MinjoonSeo,SewonMin,AliFarhadi,andHannanehHajishirzi.2017b.Query-reductionnetworksforquestionanswering.ICLR.YelongShen,Po-SenHuang,JianfengGao,andWeizhuChen.2017.ReasoNet:Learningtostopreadinginmachinecomprehension.InProceedingsofthe23rdACMSIGKDDInternationalConferenceonKnowl-edgeDiscoveryandDataMining,KDD’17,pages1047–1055.AlessandroSordoni,PhillipBachman,andYoshuaBen-gio.2016.Iterativealternatingneuralattentionformachinereading.CoRR,abs/1606.02245.PontusStenetorp,GoranTopi´c,SampoPyysalo,TomokoOhta,Jin-DongKim,andJun’ichiTsujii.2011.BioNLPsharedtask2011:Supportingresources.InProceedingsofBioNLPSharedTask2011Workshop,pages112–120.SainbayarSukhbaatar,ArthurSzlam,JasonWeston,andRobFergus.2015.End-to-endmemorynetworks.InAdvancesinNeuralInformationProcessingSystems,pages2440–2448.DonR.Swanson.1986.Undiscoveredpublicknowl-edge.TheLibraryQuarterly,56(2):103–118.TheUniProtConsortium.2017.UniProt:theuniver-salproteinknowledgebase.NucleicAcidsResearch,45(D1):D158–D169.DennyVrandeˇci´c.2012.Wikidata:Anewplatformforcollaborativedatacollection.InProceedingsofthe21stInternationalConferenceonWorldWideWeb,WWW’12Companion,pages1063–1064.DirkWeissenborn,GeorgWiese,andLauraSeiffe.2017.MakingneuralQAassimpleaspossiblebutnotsim-pler.InProceedingsofthe21stConferenceonCom-putationalNaturalLanguageLearning(CoNLL2017),pages271–280.AssociationforComputationalLin-guistics.DirkWeissenborn.2016.Separatinganswersfromqueriesforneuralreadingcomprehension.CoRR,abs/1607.03316.JohannesWelbl,NelsonF.Liu,andMattGardner.2017.Crowdsourcingmultiplechoicesciencequestions.InProceedingsoftheThirdWorkshoponNoisyUser-generatedText,pages94–106.JasonWeston,SumitChopra,andAntoineBordes.2015.Memorynetworks.ICLR.JasonWeston,AntoineBordes,SumitChopra,andTomasMikolov.2016.TowardsAI-completeques-tionanswering:Asetofprerequisitetoytasks.ICLR.GeorgWiese,DirkWeissenborn,andMarianaNeves.2017.NeuralquestionansweringatBioASQ5B.InProceedingsoftheBioNLP2017,pages76–79.CaimingXiong,VictorZhong,andRichardSocher.2017.Dynamiccoattentionnetworksforquestionan-swering.ICLR.YiYang,Wen-tauYih,andChristopherMeek.2015.WikiQA:Achallengedatasetforopen-domainques-tionanswering.InProceedingsofthe2015Confer-enceonEmpiricalMethodsinNaturalLanguagePro-cessing,pages2013–2018. Transactions of the Association for Computational Linguistics, vol. 6, pp. 287–302, 2018. Action Editor: Katrin Erk. image

Download pdf

Specialized Research AI at MIT

Specialized Research AI at MIT

Transactions of the Association for Computational Linguistics, vol. 6, pp. 287–302, 2018. Action Editor: Katrin Erk.