Transactions of the Association for Computational Linguistics, vol. 6, pp. 241–252, 2018. Action Editor: Brian Roark.

Transactions of the Association for Computational Linguistics, vol. 6, pp. 241–252, 2018. Action Editor: Brian Roark.
Submission batch: 9/2017; Revision batch: 12/2017; Published 4/2018.

2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

QuestionableAnswersinQuestionAnsweringResearch:ReproducibilityandVariabilityofPublishedResultsMattCraneDavidR.CheritonSchoolofComputerScience,UniversityofWaterloomatt.crane@uwaterloo.caAbstract“Basedontheoreticalreasoningithasbeensuggestedthatthereliabilityoffindingspub-lishedinthescientificliteraturedecreaseswiththepopularityofaresearchfield”(PfeifferandHoffmann,2009).Asweknow,deeplearningisverypopularandtheabilitytoreproducere-sultsisanimportantpartofscience.Thereisgrowingconcernwithinthedeeplearningcommunityaboutthereproducibilityofresultsthatarepresented.Inthispaperwepresentanumberofcontrollable,yetunreported,ef-fectsthatcansubstantiallychangetheeffec-tivenessofasamplemodel,andthuslythere-producibilityofthoseresults.Throughtheseenvironmentaleffectsweshowthatthecom-monlyheldbeliefthatdistributionofsourcecodeisallthatisneededforreproducibilityisnotenough.Sourcecodewithoutarepro-ducibleenvironmentdoesnotmeananythingatall.Inadditiontherangeofresultsproducedfromtheseeffectscanbelargerthanthema-jorityofincrementalimprovementreported.1IntroductionTherecent“reproducibilitycrisis”(Baker,2016)invariousscientificfields(particularlyPsychologyandSocialSciences)indicatesthatsomeintrospectionisneededinallfields,particularlythosethatareexper-imentalbynature.TheeffortsofCollberg’srepeata-bilitystudieshighlightthestateofaffairswithinthecomputersystemsresearchcommunity(Morailaetal.,2014;Collbergetal.,2015).1Otherfieldshavealsobeguntopushformorestringentpresentationof1http://reproducibility.cs.arizona.eduresults,forexample,theinformationretrievalcom-munityhasbeenawareforsometimeoftheissuessurroundingweakbaselines(Armstrongetal.,2009)andmorerecentlyreproducibility(Arguelloetal.,2016;Linetal.,2016).Theissueofreproducibilityinthedeep-learningcommunityhasalsostartedtobecomeagrowingconcern,withtheneedforreplicableandrepro-ducibleresultsbeingincludedinalistofchallengesfortheACL(Nivre,2017).Inreinforcementlearn-ing,Hendersonetal.(2017)showedthatthereareanumberofeffectsthatwouldchangetheresultsob-tainedbypublishedauthorsandcallformorerigor-oustesting,andreporting,ofstate-of-the-artmeth-ods.ThereisalsoanongoingprojectbyOpenAItoprovidebaselinesinreinforcementlearningthatarereproducedfrompublisheddescriptions,buteventheyadmitthattheirscoresareonly“roughlyonparwiththescoresinpublishedpapers.”2ReimersandGurevych(2017)investigatedover50,000combina-tionsofhyper-parametersettings,suchaswordem-beddingsourcesandtheoptimizeracrossfivedif-ferentNLPtasksandfoundthatthesesettingshaveasignificantimpactonboththevariability,andtherelativeeffectivenessofmodels.Inthispaperwepresentanumberofcontrollableenvironmentsettingsthatoftengounreported,andillustratethatthesearefactorsthatcancauseirre-producibilityofresultsaspresentedinthelitera-ture.Theseenvironmentalfactorshaveaneffectontheeffectivenessofneuralnetworksduetothenon-convexityoftheoptimizationsurface,meaningthat2https://blog.openai.com/openai-baselines-dqn/

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
0
1
8
1
5
6
7
6
0
6

/

/
t

l

a
c
_
a
_
0
0
0
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

242

evenminorchangesincomputationcanleadthenet-worktofallintooneofamultitudeoflocalmin-ima.Becausetheseeffectsizesarecomparabletothelargestincrementalimprovementsthathavebeenreportedbringsintoquestionthoseimprovements,andassociatedclaimsofprogress.2ExperimentalSetupInordertolimitthescopeofthispaper,wespecif-icallyfocusoureffortsonasinglenaturallanguageprocessingtask—answerselectionwithinquestionanswering—elaborateduponinSection2.1.Wealsofurtherlimitourdiscussiontolookathowtheseen-vironmentaleffectsmanifestinasingleimplemen-tationofasinglemodel,describedinSection2.2.Theserestrictions,however,donotmeanthatourresultsareonlyapplicabletothismodelonthistask,ratherourdiscussiongeneralizestoallneuralnet-workbasedresearch.Toisolatetheeffectthateachenvironmentalfac-torhasallothersettingsrelatedtothenetworkarefixed;thatis,thehyper-parametersarestaticacrossallexperiments,andonlytheenvironmentalvari-ableofinterestismanipulated.Alongwitheachofthepresentedfactorsweincludesuggestionsonhowtorespondtotheseinordertobestensurethatthework,aspresented,isreproducible.2.1ExemplarTaskAnswerselectionisoneimportantaspectofopen-domainquestionanswering.Givenaquestion,q,andasetofcandidatesentences,A,theanswerse-lectiontaskistorankthesentencescontainedinAsuchthatthosecandidatesthatanswerthequestionarerankedatthetopofthelist.Fromthisrankedlistandassessmentsofwhetherthecandidatecontainsananswertothequestion,commoninformationre-trievalmetricsaverageprecision(AP)andreciprocalrank(RR)canbecalculatedtoassesstheeffective-nessofthesystem.Thesemetricsarethedefactometricstoevaluateanswerselection,andassuchthemetricsarereportedwithinthispaper.Descriptionsofthesemetricsareeasilyfoundintheliterature.Worryingly,intheliterature,itisbecomingin-creasinglycommontonotconductstatisticalsignif-icancetesting,ratherahighermetricvalueistakenasevidencethatthemodelperformsbetter.Duetosentence matrix convolution feature maps !pooled representation!Where!was!the!cat!?!The!cat!sat!on!the!mat!join !layer!softmax!additional !features xfeathidden !layer!query!document!xqxdFqFdFigure1:Exemplarmodelarchitecturediagram.thenatureofthispaper,weonlyperformsignifi-cancetestingbetweenresultsinthesamecondition,andnotacrossconditions.Withineachconditionweidentifya“baseline”/defaultsettingtocompareagainst.Conductingthismanysignificancetestswouldnormallycallforacorrectionmethodtobeapplied,butwedonotdoso,asweonlywishtoin-dicatethatselectingthehighernumbermayresultinanabsolutedifference,butnotnecessarilyastatisti-callysignificantlyone.TocalculatesignificanceweuseapairedWilcoxonsignedranktest.2.2ExemplarModelToperformourexperimentsweutilizethemodelre-leasedbySequieraetal.(2017),asimplifiedPy-TorchimplementationofthemodelproposedbySeverynandMoschitti(2015).Themodelwascho-senbecauseofitssimplicity,itisquicktotrainwhichsupportsafastiterationofexperiments,andithasalsobeenreimplementedwithsimilareffec-tivenessadditionaltimes(Raoetal.,2017).Fig-ure1showsadiagramofthemodel,whichadoptsa“Siamese”structurewithtwosub-networkstopro-cessthequestionandcandidatesentence.Weemphasizethatthismodelwasonlyselectedtoserveasanexemplar;theeffectsthatareobservedinrelativeperformancewillalsobepresentinothermodels.Indeed,becauseofthesimplicityofthismodel,itislikelythattheenvironmentaleffectsde-scribedwillhaveamoresubstantialimpactonthenetworkeffectivenessofmorecomplicatedmodels.2.3DatasetsTheexperimentsreportedinthispaperareallper-formedagainsttheTrecQAdatasetthatwasfirstreleasedbyWangetal.(2007)andfurtherelab-

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
0
1
8
1
5
6
7
6
0
6

/

/
t

l

a
c
_
a
_
0
0
0
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

243

AnswersSplitQuestionsPositiveNegativeTrecQATrain1,2296,40347,014Development82222926Test1002841,233Total1,4116,90649,173WikiQATrain8731,0407,632Development126140990Test2432932,058Total1,2421,47310,680Table1:Datasetsummaries.orateduponbyYaoetal.(2013),aswellastheWikiQAdatasetreleasedbyYangetal.(2015).Bothdatasetsconsistsofpre-determinedtraining,devel-opmentandtestsets.Foreachquestion,eachcandi-dateanswerislabelledpositiveifitcontainsthean-swertothequestion,otherwisenegative.TheratiosoftheselabelsandsizeofthesplitsforbothdatasetsareshowninTable1.TheTrecQAdatasethasfurtherdivergedintotwoversions,namedRAWandCLEAN.TheCLEANver-sionhasremovedthosequestionsthathadnopos-itivelabelledanswers.Resultsonthesetwovari-antsarenotdirectlycomparabletoeachother(Raoetal.,2017),andexperimentsinthispaperareper-formedagainsttheRAWvariant.Similarmanipula-tionoftheWikiQAdatasethasalsobeenperformed,althoughnoanalysisofthecomparabilityofthere-sultshasbeenconducted.3WeakBaselinesAsobservedbyArmstrongetal.(2009)intheinfor-mationretrievalfield,theuseofweakbaselinesisafactorthatshouldbeconsideredwhendiscussingresultsasusingaweakbaselineshowsagreaterim-provementthancouldotherwisebeclaimed.Ta-ble2showsthestate-of-the-artresultsinanswerse-lection,asreplicatedfromtheACLWiki,Table3likewiseshowsthe(potentiallyincomplete)state-of-the-artresultsontheWikiQAdataset,sourcedbyinspectionofrelevantpapers.TheTrecQAdatasetcontainsanadditionalrowthatpresentsasimplebaseline—thesumofIDFweightsfortermsinboth∆ModelAPRRAPRRRAWDATASETPunyakanoketal.(2004)0.4190.494Cuietal.(2005)0.4270.5260.0080.032Wangetal.(2007)0.6030.6850.1760.159HeilmanandSmith(2010)0.6090.6920.0060.007WangandManning(2010)0.5950.695−0.0140.003Yaoetal.(2013)0.6310.7480.0220.053SeverynandMoschitti(2013)0.6780.7360.047−0.012Shnarch(2013)0.6860.7540.0080.006IDF-WeightedSum0.7010.769Yihetal.(2013)0.7090.7700.0230.016Yuetal.(2014)0.7110.7850.0020.015WangandNyberg(2015)0.7130.7920.0020.007Fengetal.(2015)0.7110.800−0.0020.008SeverynandMoschitti(2015)0.7460.8080.0330.008Yangetal.(2016)0.7500.8110.0040.003Heetal.(2015)0.7620.8300.0120.019HeandLin(2016)0.7580.822−0.004−0.008Raoetal.(2016)0.7800.8340.0180.004Chenetal.(2017b)0.7820.8370.0020.003CLEANDATASETWangandIttycheriah(2015)0.7460.820Tanetal.(2015)0.7280.832−0.0180.012dosSantosetal.(2016)0.7530.8510.0070.019Wangetal.(2016b)0.7710.8450.018−0.006Heetal.(2015)0.7770.8360.006−0.015HeandLin(2016)0.8010.8770.0240.026?)0.8020.8750.001−0.002Table2:State-of-the-art(replicatedfromACLWiki(ACL,2017))resultsontheTrecQAdatasetversions,an-notatedwithimprovementoverpriorstate-of-the-artre-sultsandasimplebaseline.thequestionandcandidatesentence—thatperformsnolearningofanysort.Sq,a=∑t∈q∩alog|D||{d∈D:t∈d}|(1)Equation1showsthefunctionthatproducestheseresults,whereDisthedocumentcollection,disadocumentfromthiscollection,qisthequery,aisthecandidatesentence,andtisaterm.Thiscal-culationisdoneafterremovalofstopwords.3Thisbaselineoutperformsanumberoftheolderstate-of-the-artmethods.Thereforetheseolderresults,andsomeresultsafterwardsarecomparingagainstaweakbaseline.Infairness,thisbaselinewasfirstreportedintheliteraturebyYihetal.(2013),butis3Thestopwordlistcontains127Englishwords,sourcedfromthenltkPythonLibrary.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
0
1
8
1
5
6
7
6
0
6

/

/
t

l

a
c
_
a
_
0
0
0
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

244

∆ModelAPRRAPRRYuetal.(2014)0.61900.6281Yangetal.(2015)0.65200.66520.03300.0371dosSantosetal.(2016)0.68860.69570.03660.0305Miaoetal.(2016)0.68860.70690.00000.0112Yinetal.(2016)0.69210.71080.00350.0039Raoetal.(2016)0.7010.7180.00800.0072Wangetal.(2016b)0.70580.72260.00480.0046HeandLin(2016)0.70900.72340.00320.0008YinandSchütze(2017)0.71240.72370.00340.0003Chenetal.(2017a)0.72120.73120.00880.0075Wangetal.(2016a)0.73410.74180.01290.0106WangandJiang(2016)0.74330.75450.00920.0127Table3:State-of-the-art(gatheredbymanualinspection)resultsontheWikiQAdataset,annotatedwithimprove-mentoverpriorstate-of-the-artresults.aresultthatisfrequentlyoverlookedintheliteraturethatfollowed.However,wealsonotethattheresultsforthissimplebaselinedifferbetweenourreportedvalueandthatofYihetal.(2013),whichissub-stantiallylower—0.6531AP,0.7071RR.Forthesereasonswerepeatthisresulthere.4ConfoundingVariablesInthissectionwedocumentanumberofconfound-ingvariablesthatoftengounreportedinthelitera-ture,andcanhaveasubstantialeffectonwhetheraresultwouldbeconsideredstate-of-the-artornot,andthereproducibilityofthatresult.Theserangefromcontrollablefactors,tofactorsthatarenotcon-trollable,butneedtobereported.Toaiddiscussion,thestate-of-the-arttableshavebeenrecreatedandannotatedwiththechangeinAPandRRoverthethenstate-of-the-artresult(Table2andTable3).Unlessotherwisestated,allexperimentsareper-formedusingDockercontainersthatarederivedfromcommon,shared,baseimages.Thissubstan-tiallyeasesthefixingofallenvironmentandver-sioningissuesthatcanbeobservedwhenrunningunderanativeenvironment.Allthedatathatisre-quiredtoreproducetheresultsinthispaperispub-liclyavailable.IncludingDockerimages,scriptstocreateandusethoseimages,andresultingpre-trainedmodelfiles.44https://github.com/snapbug/questionable-qaTrecQAWikiQAVersionAPRRAPRRcf0e2690.74950.81220.67320.69531f894ba171fee40.74950.81220.67320.6953715502b0.74950.81220.67320.6953d99990b0.74950.81220.67320.695370d7a03*0.74950.81220.67320.69536d9d98f*+0.75870.82250.68580.70655ef19a9*+0.6741‡0.7519‡0.5374‡0.5422‡196f0aa*+0.6742‡0.7519‡0.5376‡0.5424‡95ea349*+0.6713‡0.7409†0.5543‡0.5579‡Table4:Effectoftheversionofthemodelbeingusedonmodelresults.Onlyversionsthatmodifiedthepyfilesareincluded.A*indicatesthatthemodelatthatchange-setdoesnotrununderthecreatedDockerenvironment,andresultsaretakenfromanativehost,anda+indicatesthattheresultsfromthisversionarethemselvesnotrepro-ducible,changingbetweenruns.Version1f894badoesnotcompleteduetoabug.A‡indicatesthattheresultwasstatisticallysignificantlydifferentatthep<0.01level,and†atthep<0.05level,comparedtocf0e269.4.1SoftwareVersionsTherearenumerouspointsofsoftwareinwhichtheversionofthesoftwarebeingusedcanimpacttheendresultssubstantially.Thesearethemodeldef-initions,theyframeworksoftwareandthelibrariesthattheframeworkuses.4.1.1ModelDefinitionWerefertothecodethatisusedtodefinethemodelandtoruntheexperimentsasthemodeldefinition.Thesearechangingartifacts,andwhenthesoftwareismadeavailabletoresearchers,thenitmustbeaccompaniedbytheversionofthatsoftwarebeingused.Acursoryglanceofthecommithistoryofsomeoftheserepositoriesshowsanon-zeroamountofbugfixingcommits.Becauseofthenatureofdeeplearning,thesebugsmayactuallyimproveeffective-ness,asanecdotalevidencesuggests.5Whetherthecommitsfixbugsoraddfeatures,themodelsbeingcomparedareinherentlydifferentandcanresultindifferentoutcomes.Authorsshouldspecifywhichversionofthecodeisbeingruntoob-taintheresultspresented.Table4showstheeffect5https://twitter.com/soumithchintala/status/910339781019791360 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 245 TrecQAWikiQAPyTorchAPRRAPRR0.2.00.7234†0.78660.67730.69800.1.120.74950.81220.67320.69530.1.110.74950.81220.67320.69530.1.100.74950.81220.67320.69530.1.90.74950.81220.67320.6953Table5:EffectoftheversionofPyTorchbeingusedonmodelresults.Version0.1.8andearlierwouldnotrunthesamplemodelduetoAPIchanges.A†indicatesthattheresultsarestatisticallysignificantlydifferentto0.1.12atthep<0.05level.ofchangingtheversionofthecodeonthemodel’seffectiveness.Ascanbeseen,thereisareason-ableshiftinresults,anduntilauthorsspecifytheexactversionofthemodelthatisusedforexper-imentationtheirresultsarenon-reproducible.Theversionusedforallfurtherexperimentsinthispaperiscf0e269.4.1.2FrameworkVersionSpecifyingtheframeworkthatisbeingusedwouldbeanimportantfirststep.Differentframeworkver-sionscouldgivedifferentresultsforthesamemodelcode.ToillustratethisweranthesamplemodeloverarangeofdifferentversionsofPyTorch.Table5showstheimpactofchangingthever-sionofPyTorchusedinthetrainingofthesamplemodel.Itshowsthatnewer(0.2.0)isnotneces-sarilybetter,althoughthisdependsonthedataset.Theversionusedthroughouttherestofthepaperis0.1.12,asthisistheversionthatwasusedinpriorworkforthismodel.Version0.1.8andear-lierwouldnotrunthesamplemodelduetouseoffeaturesintroducedin0.1.9.Theresultsarestablefor0.1.xversionsacrossdatasets.OnepossiblecausecouldbethattheunderlyinglibrariesPyTorchreliesonwerepinnedtospecificversionsacrossPy-Torchversions.Alternatively,themodelcodemaynotbeusingfeaturesofPyTorchthatwerechangingacrosstheseversions.4.1.3FrameworkDependenciesWhilefixingtheframeworkversionisagoodstep,theseframeworksoftenthemselvesrelyonotherli-braries.OfparticularinteresttotheneuralnetworkLibrary/PlatformAPRRTrecQAIntelMKLonInteli7-6800K0.74950.8122IntelMKLonAMDFX-8370E0.74870.8136OpenBLASoneither0.73070.8029WikiQAIntelMKLonInteli7-6800K0.67320.6953IntelMKLonAMDFX-8370E0.67720.6981OpenBLASoneither0.67730.6980Table6:Effectofchangingmathlibraryandarchitec-tureonmodelresults,usingPyTorch0.1.12.Noneoftheresultsarestatisticallysignificantlydifferenttothei7-6800K.communityisthemathlibrarythatunderpinsallthematrixandvectoroperations.BydefaultPyTorchinstallsaversionofthelibrarythatislinkedagainstIntel’sMathKernelLibrary(MKL).Whenrunningthesamplemodelondifferenthardware,weidentifytheeffectivenessofthemodelchanges.Table6showstheresultsofrunningtheMKL-backedversiononIntelandAMDhardware,com-paredtoanOpenBLAS,whichresultsinthesameanswersregardlessofhardware.Intelalsonotesthattheresultsofthesamefloatingpointcalculationmaybedifferentacrosstheirownhardware.6ItshouldnotsurprisethereaderthatIntel’smathlibrarygivesdifferentresultsondifferentarchitectures;afterall,IntelknowswithgreatdetailthearchitectureofIntelchipsetsandisnotnecessarilyinclinedtoproduceoptimalcodeforcompetingplatforms.Thesensi-tivityofthenetworktothebackingmathlibraryisdependentonthedataset.Thisdifferenceineffec-tivenessislikelyduetotherelativenon-convexityoftheoptimizationsurfaceforthetwodatasets,wheretheTrecQAsurfacehasalargenumberoflocalmin-ima.Changingthelibrary,oreventhebackend,withinthesamelibrary,inwhichthemodelisimple-mentedcansubstantiallychangetheeffectivenessofthemodel.ForexampleSimon(2017)observeda16%increaseoftestaccuracyforthesamemodel(from0.5438to0.6197)bychangingthecomputa-tionbackendfromTensorflowtoMXNet.6http://intel.ly/1b8Qrq6 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 246 TrecQAWikiQAThreadsAPRRAPRR10.74950.81220.67320.695320.74850.81450.68020.702230.74950.81220.67320.695340.74770.80960.67710.698350.74950.81220.67320.695360.74890.81620.67780.6992Table7:EffectofnumberofthreadsonmodelresultsusingMKL-backedPyTorchv0.1.12onanInteli7-6800Kprocessor.Noneoftheresultsarestatisticallysig-nificantlydifferenttoasinglethread.4.2ThreadsThreadingintroducesanumberofpossibilitiesfornon-reproducibleresults,asresultsfromthreadscanbereturnedindifferingorders.Thisisbecausefloatingpointarithmeticisnon-associativeaswellasnon-commutative;however,theseeffectscanbecontrolledbyusingtheappropriatefunctionsandsettingsinthelibrary.Trainingthesamplemodelrepeatedlyachievesthesameresults,suggestingthatthesesettingsarebeingutilizedinsidethePyTorchlibrary,althoughweimplorereaderstodiscoverthisfortheirlibraryofchoice.However,whilethreadingitselfdoesnotimpacttheresultswithinPyTorch,thenumberofthreadsuseddoes.Otherthanbynevervaryingthenumberofthreadsused,thiseffectcannotbecontrolledfor.Thereasonforthisisrelatedtothenon-associativityoffloatingpointmaths.Forexample,giventhemathematicalrelationsa+b=e,andc+d=f,thefloatingpointspecificationdoesnotensurethatthemathematicalequalitya+b+c+d=e+fholds.Aresultcalculatedontwothreadsmayper-formthee+fcalculation,whileonfourthreadsthea+b+c+dcalculationmaybeperformed,resultinginpotentialdifferences.FortheseexperimentsweusePyTorchv0.1.12withIntel’sMKLlibraryonanInteli7-6800Kprocessor.UsingtheOMP_NUM_THREADSandMKL_NUM_THREADSenvironmentvariables,aswellastheset_num_threadsfunctioninPy-Torch,wecancontrolthenumberofthreadsusedintraining.Werangethisfrom1–6onourmachine,asthisisthenumberofhardwarecoresontheCPU,andthereforethemaximumnumberofthreadsthatOpenMPwillspawn.Table7showstheresultsofthisexperiment.Interestinglytheresultsarecon-sistentwithindatasetswhenusinganoddnumberofthreads,althoughthisismostlikelycoincidental.Therangeofdifferencesissmall,butis,again,largerthansomeoftheincrementalimprovementsreportedintheliterature.Theexactenvironmentvariables,orcodesettings,thatneedtobemodifiedwilldependontheframeworkbeingused.Thereisnosolutiontothisgiventhenon-associativenatureofthefloating-pointandthesplittingofworkloadamongdifferingnumbersofthreads.Theonlyrecommendationisthatauthorsreportthenumberofthreadsusedfortraining,al-thoughwedosuggestasmallernumbertoerronthesideofcaution,asOpenMPwillnotcreatemorethreadsthantherearehardwarecores.4.3GPUComputationThevariationofGPUsavailablefordeeplearn-ingresearchisarguablylargerthanthatofCPUs.Therearemanymodels,andeachmanufacturerisfreetodeviatefromthereferencemodelsprovidedbynVidiaorAMD,althoughitisunclearjusthowmanychoosetodoso.Therearealsomoreuncon-trollablefactors,forinstance,thenumberofthreadsthatareusedbytheGPUisuncontrollablemeaningthatresultsareunlikelytobethesameacrossdiffer-entGPUs,unlikeCPUtraining.Table8showstheresultsofenablingGPUcom-putationonthesamplemodel.Wereportonbothen-ablingthecuDNNbackend,asthisisthedefault,aswellasdisablingitglobally.ThecuDNNbackendisknowntocontainsomenon-deterministickernels.7Inaddition,nVidiaprovidesawhite-paperthatde-scribessomeoftheimplementationdetailsandcom-plianceissuesoftheIEEE754floatingpointspec-ificationandtheirimpactonnVidiaGPUs(White-headandFit-Florea,2011).Thepaperalsopresentsexampleswherecompilingfora32-bitx86archi-tectureanda64-bitx86-64architecturecanyielddifferentresults.InadditiontorunningtheexperimentonourownGPU,anAsusbrandednVidiaGeForce1080GTX(revisiona1)wealsorepeatedtheexperimenton7http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/#reproducibility l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 247 TrecQAWikiQAComputationHardwareAPRRAPRRCPUInteli7-6800K0.74950.81220.67320.6953GPUGeForce1080GTXcuDNN0.72770.77880.66040.6804GeForce1080GTX0.74740.80440.68730.7054TeslaK80cuDNN0.75270.81150.68520.7046TeslaK800.75270.81150.68520.7046Table8:Effectofthecomputationhardwareonmodelresults.NoneoftheresultsarestatisticallysignificantlydifferentwhencomparedtotheInteli7-6800K.anAmazonEC2p2.xlargeinstance.ThisinstancecomesequippedwithasinglenVidiaTeslaK80GPU.OtherinstancescomeequippedwithmultipleGPUs,butasthemodelisbothsmallanddoesnottakeadvantageofmultipleGPUsexperimentswerenotperformedontheseinstances.OfnoteisthepresenceorabsenceofthecuDNNlibraryhasnoeffectontheK80,butdoesonthe1080GTXGPU.WesuspectthatthereasonforthisisthattheK80isdesignedasacomputecard,whilethe1080GTXisprimarilydesignedforgraphicsprocessing.Thisde-signdifferencecouldmanifestitselfindifferentin-structionsthatcanbetakenadvantageofbycuDNNkernels.EvenwithjusttwoGPUsandusingthecuDNNbackend,thereisalreadyevidencethattheperfor-manceofthenetworkdependsonboththedatasetandtheunderlyinghardware.Thereisaclearde-pendenceonthedatasetfortherelativeperformance.FurtherresultsreportedontheGPUarereportedontheGeForce1080GTXwithcuDNNdisabled.Al-thoughthisisnotthedefaultitmaximizesrepro-ducibilitybyavoidingnon-reproduciblekernels.4.4RandomSeedPerhapsthemostobviousfeatureofmachinelearn-ingthatcanimpacttheeffectivenessistherandomseed.Thusfartheexperimentsinthispaperhaveusedafixedseed,andlikemostpriorresearchthiswasonlyimpliedratherthanexplicitlystated.Theseedinquestionwas1234.Randomnessisacrucialpartofmachinelearn-ingandvaluesfromtherandomgeneratorarewidelyused.Forexample,randomvaluesareusedfortheinitialvaluesofweights,forselectingwhichnodestodropindrop-outlayers,andforselectingsetembeddingsfortermsthathavenoassociatedem-beddings.AsGoldberg(2017,Section5.3.2p59)rightlymakesnoteof—“Whendebugging,andforreproducibilityofresults[emphasisadded],itisad-visedtousedafixedrandomseed.”Figure2showsthevarianceinAPandRRwhenspecifyingdiffer-entseeds,for200randomlychosenseeds(selectedusingthebash(version4.3.48(1))RANDOMbuilt-in,itselfinitialized/seededto1234priortoperformingruns).Notingthegeneratoroftherandomnumbersisimportant,asdifferentlanguagesandlibrariesmayusedifferentgenerators.Mostlanguagesdefaulttoapseudo-randomgeneratorforperformancerea-sons,whichcarriestheadditionalbenefitthatse-quencescanbereconstructedfromagivenstartstate,commonlyreferredtoasaseed.Forexam-ple,thebashversionusedtogeneratetheseedsforFigure2usesalinearcongruentialgenerator(ParkandMiller,1988).Amorecommonlyusedgen-eratorisMT19937,aMersenneTwisterbasedontheMersenneprime219937−1,thestandardim-plementationthatusesa32-bitwordlength.An-otherimplementation,MT19937-64,usesa64-bitwordlengthandgeneratesadifferentsequence.Tospecifythegeneratoruseditisoftenenoughtospec-ifythelanguageversionandplatformbeingused.PyTorchanddependentlibrariesusetheaforemen-tionedMT19937generator.Thespreadofresultsshowsthattheresultsareei-thermarginallyworsethanpriorwork(whichwouldlikelymeantheresultfromthismodelwouldnotbe l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 248 CPUWikiQAGPUWikiQACPUTrecQAGPUTrecQA0.620.640.660.680.700.720.740.620.640.660.680.700.720.740.720.740.760.780.720.740.760.780.780.800.820.840.640.660.680.700.720.740.760.780.800.820.840.640.660.680.700.720.740.76APRRSeedsState-of-the-artS&MFigure2:VarianceinAPandRRduetotherandomseedspecifiedwhentrainingonboththeCPU&GPU.Versioncf0e269ofthemodelwasused,withversion0.1.12ofPyTorch,andtheIntelMKLmathlibrary.CPUtrain-ingwasperformedonanInteli7-6800Kprocessorusingonethread,andGPUtrainingonanAsusbrandednVidia1080GTX(revisiona1)withCUDAversion8.0.61.The200seedswereselectedbythebash(version4.3.48(1))RANDOMbuiltin,itselfinitializedto1234.Theseedsselectedwereidenticalforalltrainingconditions.Thecoloursandshapesrepresentwhetherthemodelisthesamplemodel(bluecircles),theS&Mmodelthatisbeingreimple-mented(redsquare),oranotherstate-of-the-artresult(yellowtriangles).TheTrecQAdatasetisshownonthetop,andtheWikiQAdatasetonthebottom.published),orbetterthanworkthatwasreportedonafterward(meaningtheselatterresultsmaynothavebeenpublished).Significancetestingacrossthesemodelswasnotperformed.Table9showstheagreements,calculatedusingKendall’sτandSpearman’sρ,onrankingsforeachofthedatasets,comparingthetwometrics,andtwocomputationalbackendsused,allresultsshownarestatisticallysignificantatthep<0.01level.OntheTrecQAdataset,thesevariationsinAPandRRshowonlymoderateagreementinrankingsofthemodelwhenthesametrainingcomputationalbackendisused,andonlyweakagreementacrosscomputationbackend.FortheTrecQAdataset,theCPUresultscoveredarangeof0.0393forAP,and0.0599forRR,whiletheGPUcovered0.0379AP,and0.0492RRrespectively.TheWikiQAdatasetexhibitsstrongeragreementsaboutmodelrankingsonthesamecomputationbackend,butsimilarlyweakagreementswhencom-paringacrosscomputationalbackends.TherangeofAPandRRvaluesonthisdatasetareevenlarger,covering0.0712AP,and0.0727RRonGPU;and0.0705APand0.0755RRontheCPU.TheserangesinAPandRRvaluesaregreaterthanalargeproportionofincrementalimprovementsreportedinprioranswerselectionresearch(seeTa-ble2andTable3),andindeedareanorderofmag-nitudelargerthanatypicallyreportedimprovementineithermetricontheWikiQAdatasets.InthesecasesthemodelwastrainedtotargetAP,anothersettingthatisnotcommonlyreported.Whilesomesoftwareformodelsmadeavailablespecifiesaseed,thisdetailisoftenomittedfromthepaper,makingreplication-from-papereffortsnighonimpossible.Reagenetal.(2017,Chapter4)discussthisvari-anceinresultsfromseeding,callingitIso-TrainingNoise.Theyusethisconcepttoframediscus- l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 249 TrecQAKENDALL’SτRRCPUAPGPURRGPUAPCPU0.55140.28710.2069RRCPU0.21480.2894APGPU0.5315SPEARMAN’SρAPCPU0.74090.41250.3304RRCPU0.31260.4205APGPU0.7171WikiQAKENDALL’SτRRCPUAPGPURRGPUAPCPU0.88420.32380.3358RRCPU0.30960.3330APGPU0.9068SPEARMAN’SρAPCPU0.97830.46220.4762RRCPU0.43920.4690APGPU0.9868Table9:Kendall’sτandSpearman’sρbasedontherank-ingsofmodeleffectivenessondifferentseedsacrossmet-ricsandtrainingcomputationbackend.Allvaluesaresta-tisticallysignificantatthep<0.01level.sionoverwhetheroptimizations,suchasusingfixedpointarithmeticoverfloatingpoint,aresafetoper-form.Theydefineanoptimizationassafeifthere-sultsarewithinonestandarddeviationofthemeanoftheresultsobservedfrommultipleseededruns.Wesuggestthatspecifyingtherandomseedusedintrainingisthebareminimum,necessarystepthatshouldbetaken,althoughgiventhepotentialfordif-ferentpseudo-randomgenerators,anddifferencesinimplementation,thismaynotbeenough.Indeed,thebestapproachistostopreportingsingle-valueresults,andinsteadreportthedistributionofresultsfromarangeofseeds.Doingsoallowsforafairercomparisonacrossmodels,bydiscardingpotentialcomparisonsofluckyandunluckyseeds.Inad-dition,theseresultpopulationscanbestatisticallycomparedforsignificance,allowingforstrongerclaimsonimprovement.4.5InteractionsThusfarthispaperhaspresentedanumberofeffectsthatcanaffecttheresultsofaneuralnetwork.Eachofthesehasbeenpresentedinisolation,afterfixingtheprioreffects.Theseeffectsclearlyhavepotentialforinteractionandtheinteractionisunpredictable.Inthissectionwebrieflyexamineoneofthesein-teractions,namelytheseedselectioncombinedwitheitherCPUorGPUtraining.TheresultspresentedinTable8showthatforagivenseedthemodelsexhibitdifferenteffectivenessbasedonthehardwareusedfortraining.InSec-tion4.4itwasshownthattheseedhasasignificantimpactontherelativeeffectivenessofthemodelre-gardlessofthiscomputationalbackend.Thecor-relationcoefficientsacrossdevicespresentedinTa-ble9leadsustosuspectthattherecanbesubstantialchangesineffectivenesswhenswitchingtheback-endfromCPUtoGPUandviceversa.Figure3showstheeffectofchangingfromCPUtrainingtoGPUtraining,usingthesame200seedsthatwereusedinSection4.4.TherelationshipobservedinFigure2betweenAPandRRisstillpresent,butthereisnotelling,givenafixedseed,whethertrainingonGPUorCPUwouldresultinbettereffectiveness.Inaddition,thesedeltascanbelargerthanasubstantialnumberofincrementalim-provementsreported.Forexample,amiddlingre-sultontheCPUmaybetransformedtoeitheratoporbottomresultifswitchingtoGPUtraining,witheverythingelsefixed.Byreportingresultsassinglenumbersthevaria-tionduetothehardwareonwhichthetrainingisper-formedishidden,andthiscouldleadauthorstocon-cludethattheirmodelisasubstantialimprovementonstate-of-the-art.ThechangesinAPandRRthatareobservedarerepresentativeofeventhelargerimprovementsinstate-of-the-art.However,whencomparingthedistributionsofthescoresacrossthebackendsbyvisualinspectionofFigure2thereisclearlynotanydifferenceinthepopulations.Sta-tisticalsignificancetesting(p≫0.05inapairedt-test,bothtwo-andsingle-tailed)bearsoutthisin-tuition.Usingthesepopulationbasedresultswouldthenleadauthorstoadifferentconclusionthaniftheseedwas“lucky”forthetraininghardware.Thisisaconcreteexampleofthedifferencesbetweenreport-ingresultdistributionscomparedwithsinglevalues.4.6ReportingRoundingThefinalaspectofresultreportingthatiscontrol-lableforistheroundingofresults.Forexample, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 250 TrecQAWikiQA-0.02-0.010.000.010.02-0.050-0.0250.0000.0250.050-0.030.000.03-0.020.000.02∆AP∆RRFigure3:ChangeinAPandRRwhenswitchingtotrainingthemodelonGPUfromtoCPU.Eachdotrepresentsatrainingrunwiththesameseedprovidedtoeachofthetrainingprocesses.TheTrecQAdatasetisshownontheleft,andWikiQAontheright.whenusingthedefaultinstalloptionsofthesamplemodel,andfixingtheotherversionsandsettings,oursamplemodelgivestwoobservedseparatere-sultswithCPUtraining—ontheTrecQAdatasetei-theranAPof0.7485or0.7487isobtained.Whilethisdifferenceof0.0002issmall,thereisanewertrend(presentinthelatterthreepapersinTable2)ofreportingresultstothreedecimalpoints.Inthiscaseevensuchaminordifferencecanresultinstate-of-the-artornot,statisticalsignificancenotwithstand-ing.Forexamplearesultof0.7484wouldrounddown,while0.7486wouldroundup,overemphasiz-ingthedifferencebyafactorof5.Weconcede,how-ever,thatthesameargumentcanbeappliedregard-lessofwhichdecimalpointcut-offisused,althoughweobservethattrec_eval,thedefactotoolusedtocalculateAPandRR,reportstofour.Werecommendthatreviewersbeskepticalofsuchminorimprovementsonstate-of-the-artwhensingleresultsarereported,therecommendationherefollowsthatofSection4.4,inthatideallymultipleseedsareused,andtestingisperformedonthepop-ulationofresultstodetermineimprovement.5ConclusionsInthispaperwehavedemonstratedanumberoffac-torsthatarepresentduringtrainingofamodelandaffecttheresultsofsaidmodel.Theseparameters,andtheirsettings,oftengounreportedinthelitera-ture.Theresultisthatalargeamountofpriorworkinanswerselectionisinherentlyirreproducible.Fur-thermore,thedifferencesinresultsillustratedbytheseeffectscanbemuchlargerthanthemajorityofimprovementsreportedasgainsintheliterature.Theeffectsthatwepresentedarenotstand-aloneeffects.Interactionbetweeneffectsalsohasanad-ditionalimpact,oneofwhichwasdiscussedinSec-tion4.5.Otherresultspresentedinthispaperdonotconsiderthisinteraction.ForexampleTable6sug-gestedthatamodeltrainedusingOpenBLASpro-ducesworseresultsfortheTrecQAdatasetthanonetrainedusingIntel’sMKLlibrary,whichistrue...forthatversionofthemodelcode,forthatversionoftheframework,forthatrandomseed,whentrainedonasinglethreadonthatCPU,forthatdataset.Wereserveinvestigatingtheinteractioneffectsoftheseindividualeffectsforfuturework.Itissimplynolongeradequatetoreportasinglevaluewhenevaluatingresultsfromneuralnetworks,especiallywithoutthepresenceofstatisticaltestingonthoseresults.Byfarthelargestsourceofvari-abilityintheexperimentspresentedinthispaperwaswhenthenetworkwasseededwithdifferentrandomstartingpoints.Therangeofresultsproducedcoverrangesofresultsthatcanbeanorderofmagnitudelargerthantypicallyimportedimprovements.Aswellasrepeatingexperimentsformultipleseeds,thespecificationsofthehardwareonwhichtheexperimentswereperformedshouldbereportedalongsidetheresults,aschangingthehardwarecanchangetheresultsbyanorderofmagnitude.Addi-tionally,thenumberofthreadsandthemathlibraryusedimpactontheresultsandshouldbereported.Finally,beyondthehardwareeffects,thesoftwarethatisusedtobothrunthemodel,anddefinethemodel,hasanimpact.Forthisreasonboththemodel l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 251 definitionandlibraryversions,aswellasallthere-quireddependencies,shouldbepinnedtoaspecifiedversion.TheseissuesareeasilyavoidablebytheuseofcommonpackagingtoolssuchasDocker,whichalsoprovidesopportunitiestofixmostofthenon-versioningenvironmentalissuesaswell.IncaseswhereauthorsareunabletoprovideaDockerimage,orequivalent,thenmakingthetrainedmodelsavailableisonealternative.Load-ingpre-trainedmodelsisanactionthatissupportedbyanumberofframeworks.PyTorch,forexample,providesfunctionstoloadamodelfromaURL.Thepre-trainedmodelsappeartoprovideconsistentre-sultsevenwhentheinferencepassisperformedus-ingsettingsthatwouldhaveprovideddifferentre-sultsintraining.Asentenceisallthatittakestodescribetheenvi-ronmentusedfortraining.Forexample:“ourmodelwaswrittenagainstPyTorchv0.1.12,andtrainingwasconductedonanInteli7-6800KusingasinglethreadandIntel’sMathKernelLibrary”.Beyondthisweimplorereviewerstobewaryofsuchminorreportedimprovementsinthelightoftheseissues.AcknowledgementsTheauthorwishestoacknowledgetheinputandadviceof(inalphabeticalorder)GauravBaruah,JimmyLin,AdamRoegiest,RoyalSequiera,andMichaelTu.Finallythankstothereviewersanded-itorsfortheircommentsandsuggestionstoimprovethepaper.ReferencesACL.2017.QuestionAnswering(Stateoftheart).https://aclweb.org/w/index.php?title=Question_Answering_(State_of_the_art).Accessed:Sept.72017.JaimeArguello,MattCrane,FernandoDiaz,JimmyLin,andAndrewTrotman.2016.ReportontheSIGIR2015workshoponreproducibility,inexplicability,andgeneralizabilityofresults(RIGOR).49(2):107–116.TimothyG.Armstrong,AlistairMoffat,WilliamWeb-ber,andJustinZobel.2009.Improvementsthatdon’taddup:Ad-hocretrievalresultssince1998.InSIGIR,pages601–610.MonyaBaker.2016.1,500scientistsliftthelidonrepro-ducibility.Nature,533(7604):452–454.QinChen,QinminHu,JimmyXiangjiHuang,LiangHe,andWeijieAn.2017a.Enhancingrecurrentneu-ralnetworkswithpositionalattentionforquestionan-swering.InSIGIR,pages993–996.Ruey-ChengChen,EviYulianti,MarkSanderson,andW.BruceCroft.2017b.Onthebenefitofincorporat-ingexternalfeaturesinaneuralarchitectureforanswersentenceselection.InSIGIR,pages1017–1020.ChristianCollberg,ToddProebsting,andAlexM.War-ren.2015.Repeatabilityandbenefactionincomputersystemsresearch.UniversityofArizonaTR14.HangCui,RenxuSun,KeyaLi,Min-YanKan,andTat-SengChua.2005.Questionansweringpassagere-trievalusingdependencyrelations.InSIGIR,pages400–407.CíceroNogueiradosSantos,MingTan,BingXiang,andBowenZhou.2016.Attentivepoolingnetworks.arXiv,abs/1602.03609v1.MinweiFeng,BingXiang,MichaelR.Glass,LidanWang,andBowenZhou.2015.Applyingdeeplearn-ingtoanswerselection:Astudyandanopentask.InASRU,pages813–820.YoavGoldberg.2017.NeuralNetworkMethodsforNaturalLanguageProcessing.SynthesisLecturesonHumanLanguageTechnologies.Morgan&ClaypoolPublishers.HuaHeandJimmyLin.2016.Pairwisewordinteractionmodelingwithdeepneuralnetworksforsemanticsim-ilaritymeasurement.InHLT-NAACL,pages937–948.HuaHe,KevinGimpel,andJimmyLin.2015.Multi-perspectivesentencesimilaritymodelingwithconvo-lutionalneuralnetworks.InEMNLP,pages1576–1586.MichaelHeilmanandNoahA.Smith.2010.Treeeditmodelsforrecognizingtextualentailments,para-phrases,andanswerstoquestions.InHLT-NAACL,pages1011–1019.PeterHenderson,RiashatIslam,PhillipBachman,JoellePineau,DoinaPrecup,andDavidMeger.2017.DeepReinforcementLearningthatMatters.arXiv,abs/1709.06560v1.JimmyLin,MattCrane,AndrewTrotman,JamieCallan,IshanChattopadhyaya,JohnFoley,GrantIngersoll,CraigMacdonald,andSebastianoVigna.2016.To-wardreproduciblebaselines:Theopen-sourceIRre-producibilitychallenge.InECIR,pages408–420.YishuMiao,LeiYu,andPhilBlunsom.2016.Neu-ralvariationalinferencefortextprocessing.InICML,pages1727–1736.GinaMoraila,AkashShankaran,ZuomingShi,andAlexM.Warren.2014.Measuringreproducibilityincomputersystemsresearch.Technicalreport,Univer-sityofArizona. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 8 1 5 6 7 6 0 6 / / t l a c _ a _ 0 0 0 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 252 JoakimNivre.2017.ChallengesforACL:ACLPresidentialAddress2017.https://www.slideshare.net/aclanthology/joakim-nivre-2017-presidential-address-acl-2017-challenges-for-acl/.Accessed:20Sept.2017.StephenK.ParkandKeithW.Miller.1988.Randomnumbergenerators:goodonesarehardtofind.Com-municationsoftheACM,31(10):1192–1201.ThomasPfeifferandRobertHoffmann.2009.Large-scaleassessmentoftheeffectofpopularityonthere-liabilityofresearch.PLOSOne,4(6):e5996.VasinPunyakanok,DanRoth,andWen-tauYih.2004.Mappingdependenciestrees:Anapplicationtoques-tionanswering.InProceedingsofAI&Math2004,pages1–10.JinfengRao,HuaHe,andJimmyLin.2016.Noise-contrastiveestimationforanswerselectionwithdeepneuralnetworks.InCIKM,pages1913–1916.JinfengRao,HuaHe,andJimmyLin.2017.Experi-mentswithconvolutionalneuralnetworkmodelsforanswerselection.InSIGIR,pages1217–1220.BrandonReagen,RobertAdolf,PaulN.Whatmough,Gu-YeonWei,andDavidM.Brooks.2017.DeepLearningforComputerArchitects.SynthesisLecturesonComputerArchitecture.Morgan&ClaypoolPub-lishers.NilsReimersandIrynaGurevych.2017.Optimalhy-perparametersfordeepLSTM-networksforsequencelabelingtasks.arXiv,abs/1707.06799v1.RoyalSequiera,GauravBaruah,ZhuchengTu,SalmanMohammed,JinfengRao,HaotianZhang,andJimmyLin.2017.Exploringtheeffectivenessofconvolu-tionalneuralnetworksforanswerselectioninend-to-endquestionanswering.arXiv,abs/1707.07804v1.AliakseiSeverynandAlessandroMoschitti.2013.Au-tomaticfeatureengineeringforanswerselectionandextraction.InEMNLP,volume13,pages458–467.AliakseiSeverynandAlessandroMoschitti.2015.Learningtorankshorttextpairswithconvolutionaldeepneuralnetworks.InSIGIR,pages373–382.EyalShnarch.2013.ProbabilisticModelsforLexicalInference.BarIlanUniversity.JulienSimon.2017.Kerasshoot-out:TensorFlowvsMXNet.https://medium.com/@julsimon/keras-shoot-out-tensorflow-vs-mxnet-51ae2b30a9c0.Accessed:5Sept.2017.MingTan,BingXiang,andBowenZhou.2015.LSTM-baseddeeplearningmodelsfornon-factoidanswerse-lection.arXiv,abs/1511.04108v4.ZhiguoWangandAbrahamIttycheriah.2015.FAQ-basedquestionansweringviawordalignment.arXiv,abs/1507.02628v1.ShuohangWangandJingJiang.2016.Acompare-aggregatemodelformatchingtextsequences.arXiv,abs/1611.01747v1.MengqiuWangandChristopherD.Manning.2010.Probabilistictree-editmodelswithstructuredlatentvariablesfortextualentailmentandquestionanswer-ing.InCOLING,pages1164–1172.DiWangandEricNyberg.2015.Alongshort-termmemorymodelforanswersentenceselectioninques-tionanswering.InACL,pages707–712.MengqiuWang,NoahA.Smith,andTerukoMitamura.2007.Whatisthejeopardymodel?Aquasi-synchronousgrammarforQA.InEMNLP-CoNLL,volume7,pages22–32.BingningWang,KangLiu,andJunZhao.2016a.Innerattentionbasedrecurrentneuralnetworksforanswerselection.InACL,pages1288–1297.ZhiguoWang,HaitaoMi,andAbrahamIttycheriah.2016b.Sentencesimilaritylearningbylexicaldecom-positionandcomposition.InCOLING,pages1340–1349.NathanWhiteheadandAlexFit-Florea.2011.Precision&performance:FloatingpointandIEEE754compli-ancefornVidiaGPUs.Accessed:Sept.72017,fromhttps://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf.YiYang,Wen-tauYih,andChristopherMeek.2015.WikiQA:Achallengedatasetforopen-domainques-tionanswering.InEMNLP,pages2013–2018.LiuYang,QingyaoAi,JiafengGuo,andW.BruceCroft.2016.aNMM:Rankingshortanswertextswithattention-basedneuralmatchingmodel.InCIKM,pages287–296.XuchenYao,BenjaminVanDurme,ChrisCallison-Burch,andPeterClark.2013.Answerextractionassequencetaggingwithtreeeditdistance.InHLT-NAACL,pages858–867.ScottWen-tauYih,Ming-WeiChang,ChrisMeek,andAndrzejPastusiak.2013.Questionansweringus-ingenhancedlexicalsemanticmodels.InACL,pages1744–1753.WenpengYinandHinrichSchütze.2017.Task-specificattentivepoolingofphrasealignmentscontributestosentencematching.InEACL,pages699–709.WenpengYin,HinrichSchütze,BingXiang,andBowenZhou.2016.ABCNN:Attention-basedconvolutionalneuralnetworkformodelingsentencepairs.TACL,4(1):259–272.LeiYu,KarlMoritzHermann,PhilBlunsom,andStephenPulman.2014.Deeplearningforanswersen-tenceselection.arXiv,abs/1412.1632v1.
Download pdf