Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018. Action Editor: Katrin Erk.

Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018. Action Editor: Katrin Erk.
Submission batch: 6/2017; Revision batch: 9/2017; Published 5/2018.

2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

TheNarrativeQAReadingComprehensionChallengeTom´aˇsKoˇcisk´y†‡JonathanSchwarz†PhilBlunsom†‡ChrisDyer†KarlMoritzHermann†G´aborMelis†EdwardGrefenstette††DeepMind‡UniversityofOxford{tkocisky,schwarzjn,pblunsom,cdyer,kmh,melisgl,etg}@google.comAbstractReadingcomprehension(RC)—incontrasttoinformationretrieval—requiresintegratingin-formationandreasoningaboutevents,enti-ties,andtheirrelationsacrossafulldocument.QuestionansweringisconventionallyusedtoassessRCability,inbothartificialagentsandchildrenlearningtoread.However,existingRCdatasetsandtasksaredominatedbyques-tionsthatcanbesolvedbyselectinganswersusingsuperficialinformation(e.g.,localcon-textsimilarityorglobaltermfrequency);theythusfailtotestfortheessentialintegrativeas-pectofRC.Toencourageprogressondeepercomprehensionoflanguage,wepresentanewdatasetandsetoftasksinwhichthereadermustanswerquestionsaboutstoriesbyreadingentirebooksormoviescripts.Thesetasksaredesignedsothatsuccessfullyansweringtheirquestionsrequiresunderstandingtheunderly-ingnarrativeratherthanrelyingonshallowpatternmatchingorsalience.Weshowthatal-thoughhumanssolvethetaskseasily,standardRCmodelsstruggleonthetaskspresentedhere.Weprovideananalysisofthedatasetandthechallengesitpresents.1IntroductionNaturallanguageunderstandingseekstocreatemod-elsthatreadandcomprehendtext.Acommonstrat-egyforassessingthelanguageunderstandingcapa-bilitiesofcomprehensionmodelsistodemonstratethattheycananswerquestionsaboutdocumentstheyread,akintohowreadingcomprehensionistestedinchildrenwhentheyarelearningtoread.Afterread-ingadocument,areaderusuallycannotreproduceTitle:GhostbustersIIQuestion:HowisOscarrelatedtoDana?Answer:hersonSummarysnippet:…Peter’sformergirlfriendDanaBarretthashadason,OscarStorysnippet:DANA(settingthewheelbrakesonthebuggy)Thankyou,Frank.I’llgetthehangofthiseventually.ShecontinuesdigginginherpursewhileFrankleansoverthebuggyandmakesfunnyfacesatthebaby,OSCAR,averycutenine-montholdboy.FRANK(tothebaby)Hiya,Oscar.Whatdoyousay,slugger?FRANK(toDana)That’sagood-lookingkidyougotthere,Ms.Barrett.Figure1:Examplequestion–answerpair.Thesnippetsherewereextractedbyhumansfromsummariesandthefulltextofmoviescriptsorbooks,respectivement,andarenotprovidedtothemodelassupervisionorattesttime.Instead,themodelwillneedtoreadthefulltextandlo-catesalientsnippetsbasedsolelyonthequestionanditsreadingofthedocumentinordertogeneratetheanswer.theentiretextfrommemory,butoftencananswerquestionsaboutunderlyingnarrativeelementsofthedocument:thesaliententities,events,lieux,andtherelationsbetweenthem.Thus,testingunderstandingrequiresthecreationofquestionsthatexaminehigh-levelabstractionsinsteadofjustfactsoccurringinonesentenceatatime.Unfortunately,superficialquestionsaboutadocu-mentmayoftenbeansweredsuccessfully(bybothhumansandmachines)usingashallowpatternmatch-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

318

ingstrategiesorguessingbasedonglobalsalience.Inthefollowingsection,wesurveyexistingQAdatasets,showingthattheyareeithertoosmalloranswerablebyshallowheuristics(Section2).Ontheotherhand,questionswhicharenotaboutthesurfaceformofthetext,butratherabouttheunderlyingnarra-tive,requiretheformationofmoreabstractrepresen-tationsabouttheeventsandrelationsexpressedinthecourseofthedocument.Answeringsuchquestionsrequiresthatreadersintegrateinformationwhichmaybedistributedacrossseveralstatementsthroughoutthedocument,andgenerateacogentansweronthebasisofthisintegratedinformation.Thatis,theytestthatthereadercomprehendslanguage,notjustthatitcanpatternmatch.Wepresentanewtaskanddataset,whichwecallNarrativeQA,whichwilltestandrewardartificialagentsapproachingthislevelofcompetence(Section3),andmakeavailableonline.1Thedatasetconsistsofstories,whicharebooksandmoviescripts,withhumanwrittenquestionsandanswersbasedsolelyonhuman-generatedabstractivesummaries.FortheRCtasks,questionsmaybean-sweredusingjustthesummariesorthefullstorytext.WegiveashortexampleofasamplemoviescriptfromthisdatasetinFigure1.Fictionalstorieshaveanumberofadvantagesasadomain(SchankandAbelson,1977).D'abord,theyarelargelyself-contained:beyondthebasicfundamentalvocabularyofEnglish,alloftheinformationaboutsaliententitiesandcon-ceptsrequiredtounderstandthenarrativeispresentinthedocument,withtheexpectationthatareasonablycompetentlanguageuserwouldbeabletounderstandit.2Second,storysummariesareabstractiveandgen-erallywrittenbyindependentauthorswhoknowtheworkonlyasareader.2ReviewofReadingComprehensionDatasetsandModelsTherearealargenumberofdatasetsandassociatedtasksavailableforthetrainingandevaluationofread-1http://deepmind.com/publications2Forexample,newnamesandwordsmaybecoinedbytheauthor(e.g.“muggle”inHarryPotternovels)butthereaderneedonlyappealtothebookitselftounderstandthemeaningoftheseconcepts,andtheirplaceinthenarrative.Thisabilitytoformnewconceptsbasedonthecontextsofatextisacrucialaspectofreadingcomprehension,andisinparttestedaspartofthequestionansweringtaskswepresent.ingcomprehensionmodels.WesummarizethekeyfeaturesofacollectionofpopularrecentdatasetsinTable1.Inthissection,webrieflydiscussthenatureandlimitationsofthesedatasetsandtheirassociatedtasks.MCTest(Richardsonetal.,2013)isacollectionofshortstories,eachwithmultiplequestions.Eachsuchquestionhassetofpossibleanswers,oneofwhichislabelledascorrect.WhilethiscouldbeusedasaQAtask,theMCTestcorpusisinfactintendedasananswerselectioncorpus.Thedataishumangenerated,andtheanswerscanbephrasesorsentences.Themainlimitationofthisdatasetisthatitservesmoreasaanevaluationchallengethanasthebasisforend-to-endtrainingofmodels,duetoitsrelativelysmallsize.Incontrast,CNN/DailyMail(Hermannetal.,2015),Children’sBookTest(CBT)(Hilletal.,2016),andBookTest(Bajgaretal.,2016)eachprovidelargeamountsofquestion–answerpairs.QuestionsareCloze-form(predictthemissingword)andareproducedfromeithershortabstractivesummaries(CNN/DailyMail)orfromthenextsentenceinthedocumentthecontextwastakenfrom(CBTandBookTest).Thetasksassociatedwiththesedatasetsareallselectingananswerfromasetofoptions,whichisexplicitlyprovidedforCBTandBookTest,andisimplicitforCNN/DailyMail,astheanswersarealwaysentitiesfromthedocument.Thissignif-icantlyfavorsmodelsthatoperatebypointingtoaparticulartoken(ortype).En effet,themostsuccess-fulmodelsonthesedatasets,suchastheAttentionSumReader(ASReader)(Kadlecetal.,2016),ex-ploitpreciselythisbiasinthedata.However,thesemodelsareinappropriateforanswersrequiringsyn-thesisofanewanswer.Thisbiastowardsanswersthatareshallowlysalientisamoreseriouslimita-tionoftheCNN/DailyMaildataset,sinceitscontextdocumentsarenewsstorieswhichusuallycontainasmallnumberofsaliententitiesandfocusonasingleevent.StanfordQuestionAnsweringDataset(SQuAD)(Rajpurkaretal.,2016)andNewsQA(Trischleretal.,2016)offeradifferentchallenge.Alargenumberofquestionsandanswersareprovidedforasetofdocuments,wheretheanswersarespansofthecon-textdocument,i.e.contiguoussequencesofwordsfromthedocument.Althoughtheanswersarenot

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

319

DatasetDocumentsQuestionsAnswersMCTest(Richardsonetal.,2013)660shortstories,gradeschoollevel2640humangenerated,basedonthedocumentmultiplechoiceCNN/DailyMail(Hermannetal.,2015)93K+220Knewsarticles387K+997KCloze-form,basedonhighlightsentitiesChildren’sBookTest(CBT)(Hilletal.,2016)687Kof20sentencepassagesfrom108children’sbooksCloze-form,fromthe21stsentencemultiplechoiceBookTest(Bajgaretal.,2016)14.2M.,similartoCBTCloze-form,similartoCBTmultiplechoiceSQuAD(Rajpurkaretal.,2016)23Kparagraphsfrom536Wikipediaarticles108Khumangenerated,basedontheparagraphsspansNewsQA(Trischleretal.,2016)13KnewsarticlesfromtheCNNdataset120Khumangenerated,basedonheadline,highlightsspansMSMARCO(Nguyenetal.,2016)1Mpassagesfrom200K+docu-mentsretrievedusingthequeries100Ksearchquerieshumangenerated,basedonthepassagesSearchQA(Dunnetal.,2017)6.9mpassagesretrievedfromasearchengineusingthequeries140khumangeneratedJeopardy!questionshumangeneratedJeopardy!answersNarrativeQA(thispaper)1,572stories(livres,moviescripts)&humangeneratedsummaries46,765humangenerated,basedonsummarieshumangenerated,basedonsummariesTable1:Comparisonofdatasets.justsingleword/entityanswers,manyplausibleques-tionsforassessingRCcannotbeaskedbecausenodocumentspanwouldcontainitsanswer.Whiletheyprovidealargenumberofquestions,thesearefromarelativelysmallnumberofdocuments,whicharethemselvesfairlyshort,therebylimitingthelexicalandtopicaldiversityofmodelstrainedonthisdata.Whiletheanswersaremulti-wordphrases,thespansaregenerallyshortandrarelycrosssentencebound-aries.Simplemodelsscoringand/orextractingcandi-datespansconditionedonthequestionandsuperficialsignalfromtherestofthedocumentdowell,e.g.,Seoetal.(2016).Thesemodelswillnottriviallygen-eralizetoproblemswheretheanswersarenotspansinthedocument,supervisionforspansisnotpro-vided,orseveraldiscontinuousspansareneededtogenerateacorrectanswer.Thisrestrictsthescalabil-ityandapplicabilityofmodelsdoingwellonSQuADorNewsQAtomorecomplexproblems.MSMARCOdataset(Nguyenetal.,2016)presentsabolderchallenge:questionsarepairedwithsetsofsnippets(“contextpassages”)thatcontaintheinformationnecessarytoanswerthequestionandan-swersarefree-formhumangeneratedtext.However,asnorestrictionwasplacedonannotatorstopreventthemfromcopyinganswersfromsourcedocuments,manyanswersareinfactverbatimcopiesofshortspansfromthecontextpassages.ModelsthatdowellonSQuAD(e.g.,WangandJiang(2016),Weis-senbornetal.(2017)),extractingspansorpointing,dowellheretoo,andthesameconcernsaboutthegeneralapplicabilityofsolutionstothisparticulardatasettolargerreadingcomprehensionproblemsapplyherealso,asabove.SearchQA(Dunnetal.,2017)isarecentdatasetinwhichthecontextforeachquestionisasetofdocumentsretrievedbyasearchengineusingthequestionasthequery.However,incontrastwithpreviousdatasetsneitherquestionsnoranswerswereproducedbyannotatingthecontextdocuments,butratherthecontextdocumentswereretrievedaftercollectingpre-existingquestion–answerpairs.Assuch,itisnotopentosameannotationbiasasthedatasetsdiscussedabove.However,uponexamininganswersintheJeopardydatausedtoconstructthisdataset,onefindsthat80%ofanswersarebigramsorunigrams,and99%are5tokensorfewer.Ofasampleof100answers,72%arenamedentities,allareshortnoun-phrases.SummaryofLimitations.Weseeseverallimita-tionsofthescopeanddepthoftheRCproblemsinexistingdatasets.First,severaldatasetsaresmall(MCTest)ornotoverlynaturalistic(bAbI(Westonetal.,2015)).Deuxième,inmorenaturalisticdocuments,amajorityofquestionsrequireonlyasinglesen-tencetolocatesupportinginformationforanswering(Chenetal.,2016;Rajpurkaretal.,2016).Ce,wesuspect,islargelyanartifactofthequestiongenera-tionmethodology,inwhichannotatorshavecreatedquestionsfromacontextdocument,orwherecontextdocumentsthatexplicitlyansweraquestionareiden-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

320

tifiedusingasearchengine.Althoughthefactoid-likeJeopardyquestionsofSearchQAalsoappeartofavorquestionsanswerablewithlocalcontext.Finally,weseefurtherevidenceofthesuperficialityoftheques-tionsinthearchitecturesthathaveevolvedtosolvethem,whichtendtoexploitspanselectionbasedonrepresentationsderivedfromlocalcontextandthequery(Seoetal.,2016;Wangetal.,2017).3NarrativeQA:ANewDatasetInthissection,weintroduceournewdataset,Nar-rativeQA,whichaddressesmanyofthelimitationsidentifiedinexistingdatasets.3.1DesiderataFromtheabovediscussedfeaturesandlimitations,wedefineourdesiderataasfollows.Wewishtoconstructadatasetwithalargenumberofquestion–answerpairsbasedoneitheralargenumberofsup-portingdocumentsorfromasmallercollectionoflargedocuments.Thispermitsthetrainingofneu-ralnetwork-basedmodelsoverwordembeddingsandprovidesdecentlexicalcoverageanddiversity.Thequestionsandanswersshouldbenatural,un-constrained,andhumangenerated;andansweringquestionsshouldfrequentlyrequirereferencetosev-eralpartsoralargerspanofthecontextdocumentratherthansuperficialrepresentationsoflocalcon-text.Furthermore,wewantannotatorstoexpress,intheirownwords,higher-levelrelationsbetweenentities,lieux,andevents,ratherthancopyshortspansofthedocument.Furthermore,wewanttoevaluatemodelsbothonthefluencyandcorrectnessofgeneratedfree-formanswers,andasananswerselectionproblem,whichrequirestheprovisionofsensibledistractorstothecorrectanswer.Finally,thescopeandcomplexityoftheQAproblemshouldbesuchthatcurrentmodelsstruggle,whilehumansarecapableofsolvingthetaskcorrectly,soastomotivatefurtherresearchintothedevelopmentofmodelsseekinghumanreadingcomprehensionability.3.2DataCollectionMethodWewillconsidercomplex,self-containednarrativesasourdocuments/stories.Tomaketheannotationtractableandleadannotatorstowardsaskingnon-localizedquestions,wewillonlyprovidethemhu-manwrittensummariesofthestoriesforgeneratingthequestion–answerpairs.Wepresentbothbooksandmoviescriptsasstoriesinourdataset.BookswerecollectedfromProjectGutenberg3andmoviescriptsarescrapedfromtheweb.4WematchedourstorieswithplotsummariesfromWikipediausingtitlesandverifiedthematchingwithhelpfromhumanannotators.Theannotatorswereaskedtodetermineifboththestoryandthesummaryrefertoamovieorabook(assomebooksaremadeintomovies),oriftheyarethesamepartinaseriesproducedinthesameyear.Inthiswayweobtained1,567stories.Thisprovideswithasmallersetofdocuments,comparedtotheotherdatasets,butthedocumentsarelongwhichprovidesuswithgoodlexicalcoverageanddiversity.Thebottleneckforobtainingalargernumberofpubliclyavailablestorieswasfindingcorrespondingsummaries.AnnotatorsonAmazonMechanicalTurkwerein-structedtowrite10question–answerpairseachbasedsolelyonagivensummary.Readingandannotatingsummariesistractableunlikewritingquestionsandanswersbasedonthefullstories,andmoreover,astheannotatorsneverseethefullstorieswearemuchlesslikelytogetquestionsandanswerswhichareextractedfromalocalizedcontext.Annotatorswereinstructedtoimaginethattheyarewritingquestionstoteststudentswhohavereadthefullstoriesbutnotthesummaries.Werequiredquestionsthatarespecificenough,giventhelengthandcomplexityofthenarratives,andtoprovideadiversesetofquestionsaboutcharacters,events,whythishappened,andsoon.Annotatorswereencour-agedtousetheirownwordsandwepreventedthemfromcopying.5Weaskedforanswersthataregram-matical,completesentences,andexplicitlyallowedshortanswers(oneword,orafew-wordphrase,orashortsentence)aswethinkthatansweringwithafullsentenceisfrequentlyperceivedasartificialwhenaskingaboutfactualinformation.Annotatorswereaskedtoavoidextra,unnecessaryinformationinthequestionortheanswer,andtoavoidyes/noquestions3http://www.gutenberg.org/4Mainlyfromhttp://www.imsdb.com/,butalsohttp://www.dailyscript.com/,http://www.awesomefilm.com/.5ThiswasdoneboththroughinstructionsandJavascripthardlimitationsontheannotationsite.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

321

orquestionsabouttheauthorortheactors.About30question–answerpairspersummarywereobtained.Theresultisacollectionofhumanwrittennaturalquestionsandanswers.Aswehavemultiplequestionspersummary/story,thisallowsustoconsideranswerselection(fromamongthe30)asasimplerversionoftheQAratherthananswergenerationfromscratch.Answerselection(Hewlettetal.,2016)andmultiple-choicequestionanswer-ing(Richardsonetal.,2013;Hilletal.,2016)arefrequentlyused.Weadditionallycollectedasecondreferencean-swerforeachquestionbyaskingannotatorstojudgewhetheraquestionisanswerable,giventhesummary,andprovideananswerifitwas.Allbut2.3%ofthequestionswerejudgedasanswerable.3.3CoreStatisticsWecollected1,567stories,evenlysplitbetweenbooksandmoviescripts.Wepartitionedthedatasetintonon-overlappingtraining,validation,andtestportions,alongstories/summaries.SeeTable2fordetailedstatistics.Thedatasetcontains46,765question–answerpairs.Thequestionsaregrammaticalquestionswrittenbyhumanannotators,thataverage9.8tokensinlength,andaremostlyformedas‘WH’-questions(seeTa-ble3).Wecategorizedasampleof300questionsinTable4.Weobservedagoodvarietyofquestiontypes.Aninterestingcategoryarequestionswhichaskforsomethingrelatedto,oroccurringtogether,before,orafterwithanevent,ofwhichthereareabout15%.Answersinthedatasetarehumanwrittennaturalanswersthatareshort,averaging4.73tokens,butarenotrestrictedtospansfromthedocuments.Thereareanswersthatappearasspansofthesummariesandthestories,44.05%and29.57%,respectively.Asexpected,lowerproportionofanswersarespansonstoriescomparedtosummariesonwhichtheywereconstructed.3.4TasksWepresenttasksvaryingintheirscopeandcomplex-ity:weconsidereitherthesummaryorthestoryascontext,andforeachweevaluateanswergenerationandanswerselection.Thetaskofansweringquestionsbasedonsum-mariesissimilarinscopetopreviousdatasets.How-ever,summariescontainmorecomplexrelationshipsandtimelinesthannewsarticlesorshortparagraphsfromthewebandthusprovideataskdifferentinnature.WehopethatNarrativeQAwillmotivatethedesignofarchitecturescapableofmodelingsuchrela-tionships.Thissettingissimilartotheprevioustasksinthatthequestionsandanswerswereconstructedbasedonthesesupportingdocuments.ThefullversionofNarrativeQArequiresread-ingandunderstandingentirestories(i.e.,booksandmoviescripts).Atpresent,thistaskisintractableforexistingneuralmodelsoutofthebox.Wefurtherdiscussthechallengesandpossibleapproachesinthefollowingsections.Werequiretheuseofmetricsforgeneratedtext.WeevaluateusingBLEU-1,BLEU-4(Papinenietal.,2002),Meteor(DenkowskiandLavie,2011),andROUGE-L(Lin,2004),usingtworeferencesforeachquestion,6exceptforthehumanbaselinewhereweevaluateonereferenceagainsttheother.Wealsoevaluateourmodelsusingarankingmetric.Thisallowsustoevaluatehowgoodourmodelisatreadingcomprehensionregardlessofhowgooditisatgeneratinganswers.Werankanswersforquestionsassociatedwiththesamesummary/storyandcomputethemeanreciprocalrank(MRR).74BaselinesandOraclesInthissection,weshowthatNarrativeQApresentsachallengingproblemforcurrentapproachestoread-ingcomprehensionbyevaluatingseveralbaselinesbasedoninformationretrieval(IR)techniquesandneuralmodels.Sinceneuralmodelsusequitediffer-entprocessesforgeneratinganswers(e.g.,predict-ingasinglewordorentity,selectingaspanofthedocumentcontext,oropengenerationoftheanswersequence),wepresentresultsoneach.Wealsore-portthehumanperformancebyscoringthesecondreferenceansweragainstthefirst.6Welowercaseboththecandidatesandthereferencesandremovetheendofsentencemarkerandthefinalfullstop.7MRRisthemeanoverexamplesof1/r,wherer∈{1,2,…}istherankofthecorrectansweramongcandidates.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

322

trainvalidtest#documents1,102115355books54858177moviescripts55457178#question–answerpairs32,7473,46110,557Avg.#tok.insummaries659638654Max#tok.insummaries1,1611,1891,148Avg.#tok.instories62,52862,74357,780Max#tok.instories430,061418,265404,641Avg.#tok.inquestions9.839.699.85Avg.#tok.inanswers4.734.604.72Table2:NarrativeQAdatasetstatistics.FirsttokenFrequencyWhat38.04%Who23.37%Why9.78%How8.85%Where7.53%Which2.21%Howmany/much1.80%When1.67%In1.19%OTHER5.57%Table3:Frequencyoffirsttokenofthequestioninthetrainingset.CategoryFrequencyPerson30.54%Description24.50%Location9.73%Why/reason9.40%How/method8.05%Event4.36%Entity4.03%Object3.36%Numeric3.02%Duration1.68%Relation1.34%Table4:Questioncategoriesonasampleof300questionsfromthevalidationset.4.1SimpleIRBaselinesWeconsiderbasicIRbaselineswhichretrieveanan-swerbyselectingaspanoftokensfromthecontextdocumentbasedonasimilaritymeasurebetweenthecandidatespanandaquery.Wecomparetwoqueries:thequestionand(asanoracle)thegoldstandardan-swer.Theansweroracleprovidesanupperboundontheperformanceofspanretrievalmodels,includ-ingtheneuralmodelsdiscussedbelow.Whenusingthequestionasthequery,weobtaingeneralizationresultsofIRmethods.Testsetresultsarecomputedbyextractingeither4-gram,8-gram,orfull-sentencespansaccordingtothebestperformanceonthevali-dationset.8Weconsiderthreesimilaritymetricsforextractingspans:BLEU-1,ROUGE-L,andthecosinesimilaritybetweenbag-of-wordsembeddingofthequeryandthecandidatespanusingpre-trainedGloVewordembeddings(Penningtonetal.,2014).4.2NeuralBenchmarksAsafirstbenchmarkweconsiderasimplebi-directionalLSTMsequencetosequence(Seq2Seq)model(Sutskeveretal.,2014)predictingtheanswerdirectlyfromthequery.Importantly,weprovidenocontextinformationfromeithersummaryorstory.Suchamodelmightclassifythequestionandpredictananswerofasimilartopicorcategory.8Notethatwedonotconsiderthespan’scontextwhencom-putingtheMRRforIRbaselines,asthecandidatespans(i.e.allanswerstoquestionsonthestory)aregivenandsimplyrankedbytheirsimilaritytothequery.PreviousreadingcomprehensiontaskssuchasCNN/DailyMailmotivatedmodelsconstrainedtopredictingasingletokenfromtheinputsequence.TheASReader(AttentionSumReader(Kadlecetal.,2016))considerstheentirecontextandpredictsadistributionoveruniquewordtypes.WeadaptthemodelforsequencepredictionbyusinganLSTMse-quencedecoderandchoosingatokenfromtheinputateachstepoftheoutputsequence.Asaspan-predictionmodelweconsiderasim-plifiedversionoftheBi-DirectionalAttentionFlownetwork(Seoetal.,2016).Weomitthecharacterembeddinglayerandlearnamappingfromwordstoavectorspaceratherthanmakinguseofpre-trainedembeddings;andweuseasinglelayerbi-directionalLSTMtomodelinteractionsamongcontextwordsconditionedonthequery(modellinglayer).Aspro-posed,weadopttheoutput-layertailoredforspan-predictionandleavetherestunchanged.Itwasnotouraimtousethestate-of-the-artmodelforotherdatasetsbutrathertoprovideastrongbenchmark.SpanpredictionmodelscanbetrainedbyobtainingsupervisiononthetrainingsetfromtheoracleIRmodel.WeusestartandendindicesofthespanachievingthehighestROUGE-Lscorewithrespecttothereferenceanswersaslabelsonthetrainingset.Themodelisthentrainedtopredictthesespansbymaximizingtheprobabilityoftheindices.4.3NeuralBenchmarksonStoriesThedesignoftheNarrativeQAdatasetmakesthestraight-forwardapplicationoftheexistingneuralar-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

323

ModelValidation/TestBLEU-1BLEU-4MeteorROUGE-LMRRIRBaselinesBLEU-1givenquestion(1sentence)10.48/10.753.02/3.3411.93/12.3314.34/14.900.176/0.171ROUGE-Lgivenquestion(8-gram)11.74/11.012.18/1.997.05/6.5012.58/11.740.168/0.161Cosinegivenquestion(1sentence)7.49/7.511.88/1.9710.18/10.3512.01/12.280.170/0.171Randomrank0.133/0.133NeuralBenchmarksSeq2Seq(nocontext)16.10/15.891.40/1.264.22/4.0813.29/13.150.211/0.202AttentionSumReader23.54/23.205.90/6.398.02/7.7723.28/22.260.269/0.259SpanPrediction33.45/33.7215.69/15.5315.68/15.3836.74/36.30—OracleIRModelsBLEU-1givenanswer(ans.length)54.60/55.5526.71/27.7831.32/32.0858.90/59.771.000/1.000ROUGE-Lgivenanswer(ans.length)52.94/54.1427.18/28.1830.81/31.5059.09/59.921.000/1.000Cosinegivenanswer(ans.length)46.69/47.9524.25/25.2527.02/27.8144.64/45.660.836/0.838Human(givensummaries)44.24/44.4318.17/19.6523.87/24.1457.17/57.02—Table5:Experimentsonsummaries.Higherisbetterforallmetrics.Sections4.1and4.2explaintheIRandneuralmodels,respectively.chitecturescomputationallyinfeasible,asthiswouldrequirerunninganrecurrentneuralnetworkonse-quencesofhundredsofthousandsoftimestepsorcomputingadistributionovertheentireinputforattention,asiscommon.Wesplitthetaskintotwosteps:first,weretrieveasmallnumberofrelevantpassagesfromthestoryusinganIRsystem;second,weapplyoneoftheneu-ralmodelsontheresultingdocument.Thequestionbecomesthequeryforretrieval.ThisIRproblemismuchharderthantraditionaldocumentretrieval,asthedocuments,thepassageshere,areverysimilar,andthequestionisshortandentitiesmentionedlikelyoccurmanytimesinthestory.Ourretrievalsystemconsiderschunksof200wordsfromthestoryandcomputesrepresentationsforallchunksandthequery.Wethenselectavaryingnumberofsuchchunksbasedontheirsimilaritytothequery.Weexperimentwithdifferentrepresenta-tionsandsimilaritymeasuresinSection5.Finally,weconcatenatetheselectedchunksinthecorrecttemporalorderandinsertdelimitersbetweenthemtoobtainamuchshorterdocument.Forspanpredic-tionmodels,wethenfurtherselectaspanfromtheretrievedchunksasdescribedinSection4.2.5ExperimentsInthissection,wedescribethedatapreparationmethodologyweused,andtheexperimentalresultsonthesummary-readingtaskaswellasthefullstorytask.5.1DataPreparationTheprovidednarrativescontainalargenumberofnamedentities(suchasnamesofcharactersorplaces).InspiredbyHermannetal.(2015),wereplacesuchentitieswithmarkers,suchas@entity42.Thesemarkersarepermutedduringtrainingandtestingsothatnoneoftheirembeddingslearnaspecificentity’srepresentation.Thisallowsustobuildrepresentationsforentitiesfromstoriesthatwereneverseenintraining,sincetheyaregivenaspecificidentifier(todifferentiatethemfromotherentitiesinthedocument)fromasetofgenericidenti-fiersre-usedacrossdocuments.Entitiesarereplacedaccordingtoasimpleheuristicbasedonacapitalfirstcharacterandtherespectivewordnotappearinginlowercase.5.2ReadingSummariesOnlyReadingcomprehensionofsummariesissimilartoanumberofpreviousreadingcomprehensiontaskswherequestionswereconstructedbasedonthecon-textdocument.However,plotsummariestendto

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

324

ModelValidation/TestBLEU-1BLEU-4MeteorROUGE-LMRRIRBaselinesBLEU-1givenquestion(8-gram)6.73/6.520.30/0.343.58/3.356.73/6.450.176/0.171ROUGE-Lgivenquestion(1sentence)5.78/5.690.25/0.323.71/3.646.36/6.260.168/0.161Cosinegivenquestion(8-gram)6.40/6.330.28/0.293.54/3.286.50/6.430.171/0.171Randomrank0.133/0.133NeuralBenchmarksAttentionSumReadergiven1chunk16.95/16.081.26/1.083.84/3.5612.12/11.940.164/0.161AttentionSumReadergiven2chunks18.54/17.760.0/1.14.2/4.0113.5/12.830.169/0.169AttentionSumReadergiven5chunks18.91/18.361.37/1.644.48/4.2414.47/13.40.171/0.173AttentionSumReadergiven10chunks20.0/19.092.23/1.814.45/4.2914.47/14.030.182/0.177AttentionSumReadergiven20chunks19.79/19.061.79/2.114.6/4.3714.86/14.020.182/0.179SpanPrediction5.82/5.680.22/0.253.84/3.726.33/6.22—OracleIRModelsBLEU-1givenanswer(ans.length)41.81/42.377.03/7.7019.10/19.5246.40/47.151.000/1.000ROUGE-Lgivenanswer(ans.length)39.17/39.507.81/8.4618.13/18.5548.91/49.941.000/1.000Cosinegivenanswer(4-gram)38.21/38.927.78/8.4312.58/12.6031.24/31.700.842/0.845Human(givensummaries)44.24/44.4318.17/19.6523.87/24.1457.17/57.02—Table6:Experimentsonfullstories.Eachchunkcontains200tokens.Higherisbetterforallmetrics.Sections4.1and4.2explaintheIRandneuralmodels,respectively.Notethatthehumanscoresarebasedonansweringquestionsgivensummaries,sameasinTable5.containmoreintricateeventtimelinesandalargernumberofcharacters,andinthissense,aremorecomplextofollowthannewsarticlesorparagraphsfromWikipedia.SeeTable5fortheresults.Giventhatquestionswereconstructedbasedonthesummaries,weexpectedthatbothneuralmodelsandspan-selectionmodelswouldperformwell.Thisisindeedthecase,withtheneuralspanpredictionmodelsignificantlyoutperformingallotherproposedmethods.However,significantroomremainsforimprovementwhencomparedwiththeoracleandhumanscores.Boththeplainsequence-to-sequencemodelandtheASReader,successfullyappliedtotheCNN/DailyMailreadingcomprehensiontask,alsoperformedwellonthistask.WeobservethattheASReadertendstocopysubsequenttokensfromthecontext,thusbehavinglikeaspanpredictionmodel.Anadditionalinductivebiasresultsinhigherperfor-manceforthespanpredictionmodel.Similarobser-vationsbetweenASReaderandspanmodelshavealsobeenmadebyWangandJiang(2016).Notethatwehavetunedeachmodelseparatelyonthedevelopmentsettwice:onceselectingthebestmodelbasedonROUGE-L,reportingthefirstfourmetrics,andasecondtimeselectingbasedontheMRR.5.3ReadingFullStoriesOnlyTable6summarizestheresultsonthefullNarra-tiveQAtask,wherethecontextdocumentsarefullstories.Asexpected(anddesired),weobserveade-clineinperformanceofthespan-selectionoracleIRmodel,comparedtotheresultsonsummaries.Thisisunsurprisingasthequestionswereconstructedonsummariesandconfirmstheinitialmotivationfordesigningthistask.Aspreviously,weconsideredallspansofagivenlengthacrosstheentirestoryforthismodel.Forshortanswersofoneortwowords—typicallymaincharactersinastory—thecandidate(i.e.,theclosestspantothereferenceanswer)iseas-ilyfoundduetobeingmentionedthroughoutthetext.Forlongeranswersitbecomesmuchlesslikely,com-paredtothesummaries,thatahigh-scoringspancanbefoundinthestory.NotethatthisdistinguishesNarrativeQAfrommanyoftherevieweddatasets.InourIRplusneuraltwo-stepapproachtothetask,wefirstretrieverelevantchunksofthestoriesandthenapplyexistingreadingcomprehensionmodels.WeusethequestionstoguidetheIRsystemforchunk

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

325

extraction,withtheresultsofthestandaloneIRbase-linesgivinganindicationofthedifficultyofthisaspectofthetask.Theretrievalqualityhasadirecteffectontheperformanceofallneuralmodels—achallengewhichmodelsonsummariesarenotpre-sentedwith.Weconsideredseveralapproachestochunkselection:weretrievechunksbasedonthehighestROUGE-LorBLEU-1scoringspanwithre-specttothequestioninthestory;comparingtopicdistributionsfromanLDAmodel(Bleietal.,2003)betweenquestionsandchunksaccordingtotheirsym-metricKullback–Leiblerdivergence.Finally,wealsoconsiderthecosinesimilarityofTF-IDFrepresenta-tions.Wefoundthatthisapproachledtothebestper-formanceofthesubsequentlyappliedmodelonthevalidationset,irrespectiveofthenumberofchunks.Notethatweusedtheanswerasthequeryonthetraining,andthequestionforthevalidationandtest.Giventheretrievedchunks,weexperimentedwithseveralneuralmodelsusingthemascontext.TheASReader,whichwasthebetter-performingmodelonthesummariestask,underperformsthesimpleno-contextSeq2Seqbaseline(showninTable5)intermsofMRR.Whileitdoesslightlybetterontheothermetrics,itclearlyfailstomakeuseofthere-trievedcontexttogainadistinctivemarginovertheno-contextSeq2Seqmodel.Increasingthenumberofretrievedchunks,andtherebyrecallofpossiblyrelevantpartsofthestory,hadonlyaminorpositiveeffect.Thespanpredictionmodel—whichherealsousesselectedchunksforcontext—doesespeciallypoorlyinthissetup.Whilethismodelprovidedthebestneuralresultsonthesummariestask,wesuspectthatitsperformancewasparticularlybadlyhurtbythefactthatthereissolittlelexicalandgrammaticaloverlapbetweenthesourceofthequestions(sum-maries)andthecontextprovided(stories).AswiththeASReader,weobservednosignificantdiffer-encesforvaryingnumbersofchunks.Theseresultsleavealargegapinhumanperfor-mance,highlightingthesuccessofourdesignob-jectivetobuildataskthatisrealisticandstraight-forwardforhumanswhileverydifficultforcurrentreadingcomprehensionmodels.Title:Armageddon2419A.D.Question:InwhatyeardidRogersawakenfromhisdeepslumber?Answer:2419Summarysnippet:…Rogersremainedinsleepfor492years.Heawakesin2419and,…Storysnippet:Ishouldstatetherefore,thatI,An-thonyRogers,am,sofarasIknow,theonlymanalivewhosenormalspanofeighty-oneyearsoflifehasbeenspreadoveraperiodof573years.Tobeprecise,Ilivedthefirsttwenty-nineyearsofmylifebetween1898and1927;theotherfifty-twosince2419.Thegapbetweenthesetwo,aperiodofnearlyfivehundredyears,Ispentinastateofsuspendedan-imation,freefromtheravagesofkatabolicprocesses,andwithoutanyapparenteffectonmyphysicalormentalfaculties.WhenIbeganmylongsleep,manhadjustbegunhisrealconquestoftheairFigure2:Examplequestion–answerpairwithsnippetsfromthesummaryandthestory.6QualitativeAnalysisandChallengesWefindthattheproposeddatasetmeetsthedesider-atawesetoutinSection3.1.Inparticular,wecon-structedadatasetwithanumberoflongdocuments,characterizedbygoodlexicalcoverageanddiversity.Thequestionsandanswersarehumangeneratedandnaturalsounding;et,basedonasmallmanualex-aminationof‘GhostbustersII’,‘Airplane’,‘Jacob’sLadder’,onlyasmallnumberofquestionsandan-swersareshallowparaphrasesofsentencesinthefulldocument.Mostquestionsrequirereadingsegmentsatleastseveralparagraphslong,andinsomecasesevenmultiplesegmentsspreadthroughoutthestory.ComputationalchallengesidentifiedinSection5.3naturallysuggestaretrievalprocedureasthefirststep.Wefoundthattheretrievalischallenging,evenforhumansnotfamiliarwiththepresentednarrative.Inparticular,thetaskoftenrequiresreferringtolargerpartsofthestory,inadditiontoknowingatleastsomebackgroundaboutentities.Thismakesthesearchprocedure,basedononlyashortquestion,achallengingandinterestingtaskinitself.Weshowexamplequestion–answerpairsinFig-ures1,2,3.Theseexampleswerechosenfromasmallsetofmanuallyannotatedquestion–answerpairstoberepresentativeofthiscollection.Inpartic-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

326

ular,theexamplesshowthatlargerpartsofthestoryarerequiredtoanswerquestions.Figure3showsthatwhiletherelevantparagraphdepictingthein-juryappearsearlyon,itisnotuntilthenextsnippet(whichappearsattheendofthenarrative)thatthelethalconsequencesoftheinjuryarerevealed.Thisillustratesaniterativereasoningprocessaswellasextremelylongtemporaldependenciesthatween-counteredduringmanualannotation.AsshowninFigure1,readingcomprehensiononmoviescriptsre-quiresanunderstandingofthewrittendialogue.Thisisachallengeasdialogueistypicallynon-descriptive,whereasthequestionswereaskedbasedondescrip-tivesummaries,requiringmodelsto“readbetweenthelines”.Weexpectthatunderstandingnarrativesascom-plexasthosepresentedinNarrativeQAwillrequiretransferringtextunderstandingcapabilityfromothersupervisedlearningtasks.7RelatedWorkThispaperisthefirstlarge-scalequestionanswer-ingdatasetonfull-lengthbooksandmoviescripts.However,althoughwearethefirsttolookattheQAtask,learningtounderstandbooksthroughothermodelingobjectiveshasbecomeanimportantsub-probleminNLP.Theseincludehighlevelplotunder-standingthroughclusteringofnovels(FrermannandSzarvas,2017)orsummarizationofmoviescripts(GorinskiandLapata,2015),tomorefinegrainedprocessingbyinducingcharactertypes(Bammanetal.,2014b;Bammanetal.,2014a),understandingre-lationshipsbetweencharacters(Iyyeretal.,2016;Chaturvedietal.,2017),orunderstandingplans,goals,andnarrativestructureintermsofabstractnar-ratives(SchankandAbelson,1977;Wilensky,1978;BlackandWilensky,1979;ChambersandJurafsky,2009).Incomputervision,theMovieQAdataset(Tapaswietal.,2016)fulfillsasimilarroleasNar-rativeQA.Itseekstotesttheabilityofmodelstocomprehendmoviesviaquestionanswering,andpartofthedatasetincludesfulllengthscripts.8ConclusionWehaveintroducedanewdatasetandasetoftasksfortrainingandevaluatingreadingcomprehensionsystems,bornefromananalysisofthelimitationsTitle:Jacob’sLadderQuestion:WhatisthefatalinjurythatJacobsus-tainswhichultimatelyleadstohisdeath?Answer:Abayonetestabbingtohisgut.Summarysnippet:AterrifiedJacobfleesintothejungle,onlytobebayonetedinthegutbyanunseenassailant.[…]Inawartimetriagetentin1971,militarydoctorsfruitlesslytreatingJacobreluctantlydeclarehimdeadStorysnippet:Ashespinsaroundoneoftheat-tackersjamsalleightinchesofhisbayonetbladeintoJacob’sstomach.Jacobscreams.Itisaloudandpiercingwail.[…]Int.VietnamFieldHospital-DayAdoctorleanshisheadinfrontofthelampandre-moveshismask.Hisexpressionissomber.Heshakeshishead.Hiswordsaresimpleandfinal.DOCTORHe’sgone.CuttoJacobSingerThedoctorstepsaway.Anurserudelypullsagreensheetupoverhishead.Thedoctorturnstooneoftheaidesandthrowsuphishandsindefeat.Figure3:Examplequestion–answerpairwithsnippetsfromthesummaryandthestory.ofexistingdatasetsandtasks.WhileourQAtaskresemblestasksprovidedbyexistingdatasets,itex-posesnewchallengesbecauseofitsdomain:fiction.Fictionalstories—incontrasttonewsstories—areself-containedanddescribearichersetofentities,events,andtherelationsbetweenthem.Wehavearangeoftasks,fromsimple(whichrequiremodelstoreadsummariesofbooksandmoviescripts,andgenerateorrankfluentEnglishanswerstohuman-generatedquestions)tomorecomplex(whichrequiremodelstoreadthefullstoriestoanswerthequestions,withnoaccesstothesummaries).Inadditiontotheissueofscalingneuralmodelstolargedocuments,thelargertasksaresignificantlymoredifficultasquestionsformulatedbasedononeortwosentencesofasummarymightrequireappeal-ingtopossiblydiscontiguoussentencesorparagraphsfromthesourcetext.Thisrequirespotentialsolutionstothesetaskstojointlymodeltheprocessofsearch-ingforinformation(possiblyinseveralsteps)toserveassupportforgeneratingananswer,alongsidethe

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

327

processofgeneratingtheanswerentailedbysaidsupport.End-to-endmechanismsforbothsearchingforinformation,suchasattention,donotscalebe-yondselectingwordsorn-gramsinshortcontextssuchassentencesandsmalldocuments.Likewise,neuralmodelsformappingdocumentstoanswers,ordeterminingentailmentbetweensupportingevidenceandahypothesis,typicallyoperateonthescaleofsentencesratherthansetsofparagraphs.Wehaveprovidedbaselineandbenchmarkresultsforbothsetsoftasks,demonstratingthatwhileex-istingmodelsgivesensibleresultsoutoftheboxonsummaries,theydonotgetanytractiononthebook-scaletasks.Havinggivenaquantitativeandqualitativeanalysisofthedifficultyofthemorecom-plextasks,wesuggestresearchdirectionsthatmayhelpbridgethegapbetweenexistingmodelsandhu-manperformance.Ourhopeisthatthisdatasetwillservenotonlyasachallengeforthemachinereadingcommunity,butasadriverforthedevelopmentofanewclassofneuralmodelswhichwilltakeasig-nificantstepbeyondthelevelofcomplexitywhichexistingdatasetsandtaskspermit.ReferencesOndrejBajgar,RudolfKadlec,andJanKleindienst.2016.Embracingdataabundance:Booktestdatasetforread-ingcomprehension.CoRR,arXiv:1610.00956.DavidBamman,BrendanO’Connor,andNoahA.Smith.2014a.Learninglatentpersonasoffilmcharacters.InProceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics,page352.DavidBamman,TedUnderwood,andNoahA.Smith.2014b.ABayesianmixedeffectsmodelofliterarycharacter.InProceedingsofthe52ndAnnualMeet-ingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),pages370–379.JohnB.BlackandRobertWilensky.1979.Anevaluationofstorygrammars.CognitiveScience,3(3):213–229.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.Latentdirichletallocation.JournalofMachineLearningResearch,3(Jan):993–1022.NathanaelChambersandDanJurafsky.2009.Unsuper-visedlearningofnarrativeschemasandtheirpartici-pants.InProceedingsoftheJointConferenceofthe47thAnnualMeetingoftheACLandthe4thInterna-tionalJointConferenceonNaturalLanguageProcess-ingoftheAFNLP:Volume2-Volume2,ACL’09,pages602–610.SnigdhaChaturvedi,MohitIyyer,andHalDaum´eIII.2017.Unsupervisedlearningofevolvingrelationshipsbetweenliterarycharacters.InAssociationfortheAd-vancementofArtificialIntelligence.DanqiChen,JasonBolton,andChristopherD.Manning.2016.AthoroughexaminationoftheCNN/DailyMailreadingcomprehensiontask.InProceedingsofthe54thAnnualMeetingoftheAssociationforComputationalLinguistics.MichaelDenkowskiandAlonLavie.2011.Meteor1.3:AutomaticMetricforReliableOptimizationandEvalu-ationofMachineTranslationSystems.InProceedingsoftheEMNLP2011WorkshoponStatisticalMachineTranslation,pages85–91.MatthewDunn,LeventSagun,MikeHiggins,UgurGuney,VolkanCirik,andKyunghyunCho.2017.SearchQA:AnewQ&Adatasetaugmentedwithcontextfromasearchengine.arXiv:1704.05179v2.LeaFrermannandGy¨orgySzarvas.2017.Inducingse-manticmicro-clustersfromdeepmulti-viewrepresenta-tionsofnovels.InProceedingsofthe2017ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages1874–1884.PhilipJohnGorinskiandMirellaLapata.2015.Moviescriptsummarizationasgraph-basedsceneextraction.InProceedingsofthe2015ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies,pages1066–1076,May–June.KarlMoritzHermann,Tom´aˇsKoˇcisk´y,EdwardGrefen-stette,LasseEspeholt,WillKay,MustafaSuleyman,andPhilBlunsom.2015.Teachingmachinestoreadandcomprehend.InAdvancesinNeuralInformationProcessingSystems,pages1693–1701.DanielHewlett,AlexandreLacoste,LlionJones,IlliaPolo-sukhin,AndrewFandrianto,JayHan,MatthewKelcey,andDavidBerthelot.2016.WikiReading:Anovellarge-scalelanguageunderstandingtaskoverwikipedia.InProceedingsofthe54thAnnualMeetingoftheAsso-ciationforComputationalLinguistics(Volume1:LongPapers),pages1535–1545.FelixHill,AntoineBordes,SumitChopra,andJasonWe-ston.2016.Thegoldilocksprinciple:Readingchil-dren’sbookswithexplicitmemoryrepresentations.InProceedingsofICLR.MohitIyyer,AnupamGuha,SnigdhaChaturvedi,JordanBoyd-Graber,andHalDaum´eIII.2016.Feudingfamiliesandformerfriends:Unsupervisedlearningfordynamicfictionalrelationships.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies,pages1534–1544.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
2
3
1
5
6
7
6
5
2

/

/
t

je

un
c
_
un
_
0
0
0
2
3
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

328

RudolfKadlec,MartinSchmid,OndˇrejBajgar,andJanKleindienst.2016.Textunderstandingwiththeatten-tionsumreadernetwork.InProceedingsofthe54thAnnualMeetingoftheAssociationforComputationalLinguistics,pages908–918.Chin-YewLin.2004.ROUGE:Apackageforautomaticevaluationofsummaries.InProc.ACLWorkshoponTextSummarizationBranchesOut.TriNguyen,MirRosenberg,XiaSong,JianfengGao,SaurabhTiwary,RanganMajumder,andLiDeng.2016.MSMARCO:Ahumangeneratedmachineread-ingcomprehensiondataset.CoRR,arXiv:1611.09268.KishorePapineni,SalimRoukos,ToddWard,andWei-JingZhu.2002.BLEU:Amethodforautomaticeval-uationofmachinetranslation.InProceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,pages311–318.JeffreyPennington,RichardSocher,andChristopherD.Manning.2014.GloVe:Globalvectorsforwordrep-resentation.InProcessingofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1532–1543.PranavRajpurkar,JianZhang,KonstantinLopyrev,andPercyLiang.2016.SQuAD:100,000+questionsformachinecomprehensionoftext.InProceedingsofEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).MatthewRichardson,ChristopherJCBurges,andErinRenshaw.2013.MCTest:Achallengedatasetfortheopen-domainmachinecomprehensionoftext.InPro-ceedingsofthe2013ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages193–203.RogerC.SchankandRobertP.Abelson.1977.Scripts,Plans,GoalsandUnderstanding:anInquiryintoHu-manKnowledgeStructures.L.Erlbaum,Hillsdale,NJ.MinjoonSeo,AniruddhaKembhavi,AliFarhadi,andHannanehHajishirzi.2016.Bidirectionalattentionflowformachinecomprehension.arXiv:1611.01603.IlyaSutskever,OriolVinyals,andQuocVLe.2014.Sequencetosequencelearningwithneuralnetworks.InProceedingsofConferenceontheAdvancesinNeuralInformationProcessingSystems,pages3104–3112.MakarandTapaswi,YukunZhu,RainerStiefelhagen,An-tonioTorralba,RaquelUrtasun,andSanjaFidler.2016.MovieQA:Understandingstoriesinmoviesthroughquestion-answering.InIEEEConferenceonComputerVisionandPatternRecognition(CVPR),pages4631–4640.AdamTrischler,TongWang,XingdiYuan,JustinHar-ris,AlessandroSordoni,PhilipBachman,andKaheerSuleman.2016.NewsQA:Amachinecomprehensiondataset.CoRR,arXiv:1611.09830.ShuohangWangandJingJiang.2016.Machinecom-prehensionusingmatch-LSTMandanswerpointer.arXiv:1608.07905.WenhuiWang,NanYang,FuruWei,BaobaoChang,andMingZhou.2017.Gatedself-matchingnetworksforreadingcomprehensionandquestionanswering.InProceedingsofthe55thAnnualMeetingoftheAssocia-tionforComputationalLinguistcs,pages189–198.DirkWeissenborn,GeorgWiese,andLauraSeiffe.2017.FastQA:Asimpleandefficientneuralarchitectureforquestionanswering.CoRR,arXiv:1703.04816v1.JasonWeston,AntoineBordes,SumitChopra,andTomasMikolov.2015.TowardsAI-completequestionan-swering:Asetofprerequisitetoytasks.CoRR,arXiv:1502.05698.R.Wilensky.1978.WhyJohnmarriedMary:Under-standingstoriesinvolvingrecurringgoals.CognitiveScience,2(3):235–266.
Télécharger le PDF