Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016. Action Editor: Brian Roark.
Submission batch: 12/2015; Revision batch: 3/2016; Published 6/2016.
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
c
(cid:13)
ABCNN:Attention-BasedConvolutionalNeuralNetworkforModelingSentencePairsWenpengYin,HinrichSch¨utzeCenterforInformationandLanguageProcessingLMUMunich,Germanywenpeng@cis.lmu.deBingXiang,BowenZhouIBMWatsonYorktownHeights,New York,USAbingxia,zhou@us.ibm.comAbstractHowtomodelapairofsentencesisacriticalissueinmanyNLPtaskssuchasanswerselec-tion(AS),paraphraseidentification(PI)andtextualentailment(TE).Mostpriorwork(je)dealswithoneindividualtaskbyfine-tuningaspecificsystem;(ii)modelseachsentence’srepresentationseparately,rarelyconsideringtheimpactoftheothersentence;ou(iii)re-liesfullyonmanuallydesigned,task-specificlinguisticfeatures.Thisworkpresentsagen-eralAttentionBasedConvolutionalNeuralNetwork(ABCNN)formodelingapairofsentences.Wemakethreecontributions.(je)TheABCNNcanbeappliedtoawideva-rietyoftasksthatrequiremodelingofsen-tencepairs.(ii)Weproposethreeattentionschemesthatintegratemutualinfluencebe-tweensentencesintoCNNs;thus,therep-resentationofeachsentencetakesintocon-siderationitscounterpart.Theseinterdepen-dentsentencepairrepresentationsaremorepowerfulthanisolatedsentencerepresenta-tions.(iii)ABCNNsachievestate-of-the-artperformanceonAS,PIandTEtasks.Wereleasecodeat:https://github.com/yinwenpeng/Answer_Selection.1IntroductionHowtomodelapairofsentencesisacriticalis-sueinmanyNLPtaskssuchasanswerselection(AS)(Yuetal.,2014;Fengetal.,2015),paraphraseidentification(PI)(Madnanietal.,2012;YinandSch¨utze,2015un),textualentailment(TE)(Marellietal.,2014a;Bowmanetal.,2015a)etc.ASs0howmuchdidWaterboygross?s+1themovieearned$161.5millions−1thiswasJerryReed’sfinalfilmappearancePIs0shestruckadealwithRHtopenabooktodays+1shesignedacontractwithRHtowriteabooks−1shedeniedtodaythatshestruckadealwithRHTEs0aniceskatingrinkplacedoutdoorsisfullofpeoples+1alotofpeopleareinaniceskatingparks−1aniceskatingrinkplacedindoorsisfullofpeopleFigure1:Positive(
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
je
un
c
_
un
_
0
0
0
9
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
261
Fengetal.(2015)testvarioussetupsofabi-CNNar-chitectureonaninsurancedomainQAdataset.Tanetal.(2016)explorebidirectionalLSTMsonthesamedataset.Ourapproachisdifferentbecausewedonotmodelthesentencesbytwoindependentneu-ralnetworksinparallel,butinsteadasaninterdepen-dentsentencepair,usingattention.ForPI,BlacoeandLapata(2012)formsentencerepresentationsbysummingupwordembeddings.Socheretal.(2011)userecursiveautoencoders(RAEs)tomodelrepresentationsoflocalphrasesinsentences,thenpoolsimilarityvaluesofphrasesfromthetwosentencesasfeaturesforbinaryclassi-fication.YinandSch¨utze(2015un)similarlyreplaceanRAEwithaCNN.Inallthreepapers,therep-resentationofonesentenceisnotinfluencedbytheother–incontrasttoourattention-basedmodel.ForTE,Bowmanetal.(2015b)userecursiveneu-ralnetworkstoencodeentailmentonSICK(Marellietal.,2014b).Rockt¨ascheletal.(2016)presentanattention-basedLSTMfortheStanfordnaturallan-guageinferencecorpus(Bowmanetal.,2015a).OursystemisthefirstCNN-basedworkonTE.Somepriorworkaimstosolveageneralsen-tencematchingproblem.Huetal.(2014)presenttwoCNNarchitectures,ARC-IandARC-II,forsen-tencematching.ARC-Ifocusesonsentencerepre-sentationlearningwhileARC-IIfocusesonmatch-ingfeaturesonphraselevel.BothsystemsweretestedonPI,sentencecompletion(SC)andtweet-responsematching.YinandSch¨utze(2015b)pro-posetheMultiGranCNNarchitecturetomodelgen-eralsentencematchingbasedonphrasematchingonmultiplelevelsofgranularityandgetpromisingre-sultsforPIandSC.Wanetal.(2016)trytomatchtwosentencesinASandSCbymultiplesentencerepresentations,eachcomingfromthelocalrepre-sentationsoftwoLSTMs.Ourworkisthefirstonetoinvestigateattentionforthegeneralsentencematchingtask.Attention-BasedDLinNon-NLPDomains.Eventhoughthereislittleifanyworkonatten-tionmechanismsinCNNsforNLP,attention-basedCNNshavebeenusedincomputervisionforvisualquestionanswering(Chenetal.,2015),imageclas-sification(Xiaoetal.,2015),captiongeneration(Xuetal.,2015),imagesegmentation(Hongetal.,2016)andobjectlocalization(Caoetal.,2015).symboldescriptions,s0,s1sentenceorsentencelengthvwordwfilterwidthdidimensionalityofinputtolayeri+1WweightmatrixTable1:NotationMnihetal.(2014)applyattentioninrecurrentneuralnetworks(RNNs)toextractinformationfromanimageorvideobyadaptivelyselectingase-quenceofregionsorlocationsandonlyprocessingtheselectedregionsathighresolution.Gregoretal.(2015)combineaspatialattentionmechanismwithRNNsforimagegeneration.Baetal.(2015)inves-tigateattention-basedRNNsforrecognizingmulti-pleobjectsinimages.Chorowskietal.(2014)andChorowskietal.(2015)useattentioninRNNsforspeechrecognition.Attention-BasedDLinNLP.Attention-basedDLsystemshavebeenappliedtoNLPaftertheirsuccessincomputervisionandspeechrecognition.TheymainlyrelyonRNNsandend-to-endencoder-decodersfortaskssuchasmachinetranslation(Bah-danauetal.,2015;Luongetal.,2015)andtextre-construction(Lietal.,2015;Rushetal.,2015).Ourworktakestheleadinexploringattentionmecha-nismsinCNNsforNLPtasks.3BCNN:BasicBi-CNNWenowintroduceourbasic(non-attention)CNNthatisbasedontheSiamesearchitecture(Brom-leyetal.,1993),i.e.,itconsistsoftwoweight-sharingCNNs,eachprocessingoneofthetwosen-tences,andafinallayerthatsolvesthesentencepairtask.SeeFigure2.WerefertothisarchitectureastheBCNN.ThenextsectionwillthenintroducetheABCNN,anattentionarchitecturethatextendstheBCNN.Table1givesournotationalconventions.Inourimplementationandalsointhemathemat-icalformalizationofthemodelgivenbelow,wepadthetwosentencestohavethesamelengths=max(s0,s1).Cependant,inthefiguresweshowdif-ferentlengthsbecausethisgivesabetterintuitionofhowthemodelworks.WenowdescribetheBCNN’sfourtypesoflay-ers:input,convolution,averagepoolingandoutput.Inputlayer.Intheexampleinthefigure,thetwoinputsentenceshave5and7words,respectivement.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
je
un
c
_
un
_
0
0
0
9
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
262
Figure2:BCNN:ABCNNwithoutAttentionEachwordisrepresentedasad0-dimensionalpre-computedword2vec(Mikolovetal.,2013)embed-ding,d0=300.Asaresult,eachsentenceisrepre-sentedasafeaturemapofdimensiond0×s.Convolutionlayer.Letv1,v2,…,vsbethewordsofasentenceandci∈Rw·d0,0j>s.Wethengeneratetherepresentationpi∈Rd1forthephrasevi−w+1,
…,viusingtheconvolutionweightsW∈Rd1×wd0asfollows:pi=tanh(W·ci+b)whereb∈Rd1isthebias.Averagepoolinglayer.Pooling(includingmin,maximum,averagepooling)iscommonlyusedtoextractrobustfeaturesfromconvolution.Inthispaper,weintroduceattentionweightingasanalternative,butuseaveragepoolingasabaselineasfollows.Fortheoutputfeaturemapofthelastconvolu-tionlayer,wedocolumn-wiseaveragingoverallcolumns,denotedasall-ap.Thisgeneratesarep-resentationvectorforeachofthetwosentences,shownasthetop“Averagepooling(all-ap)”layerbelow“Logisticregression”inFigure2.Thesetwovectorsarethebasisforthesentencepairdecision.Fortheoutputfeaturemapofnon-finalconvolu-tionlayers,wedocolumn-wiseaveragingoverwin-dowsofwconsecutivecolumns,denotedasw-ap;shownasthelower“Averagepooling(w-ap)”layerinFigure2.Forfilterwidthw,aconvolutionlayertransformsaninputfeaturemapofscolumnsintoanewfeaturemapofs+w−1columns;averagepoolingtransformsthisbacktoscolumns.Thisar-chitecturesupportsstackinganarbitrarynumberofconvolution-poolingblockstoextractincreasinglyabstractfeatures.Inputfeaturestothebottomlayerarewords,inputfeaturestothenextlayerareshortphrasesandsoon.Eachlevelgeneratesmoreab-stractfeaturesofhighergranularity.Thelastlayerisanoutputlayer,chosenaccord-ingtothetask;e.g.,forbinaryclassificationtasks,thislayerislogisticregression(seeFigure2).Othertypesofoutputlayersareintroducedbelow.Wefoundthatinmostcases,performanceisboostedifweprovidetheoutputofallpoolinglay-ersasinputtotheoutputlayer.Foreachnon-finalaveragepoolinglayer,weperformw-ap(poolingoverwindowsofwcolumns)asdescribedabove,butwealsoperformall-ap(poolingoverallcolumns)andforwardtheresulttotheoutputlayer.Thisimprovesperformancebecauserepresentationsfromdifferentlayerscoverthepropertiesofthesentencesatdifferentlevelsofabstractionandalloftheselev-elscanbeimportantforaparticularsentencepair.4ABCNN:Attention-BasedBCNNWenowdescribethreearchitecturesbasedontheBCNN,theABCNN-1,theABCNN-2andtheABCNN-3,thateachintroducesanattentionmech-anismformodelingsentencepairs;seeFigure3.ABCNN-1.TheABCNN-1(Figure3(un))em-ploysanattentionfeaturematrixAtoinfluencecon-volution.Attentionfeaturesareintendedtoweightthoseunitsofsimorehighlyinconvolutionthatarerelevanttoaunitofs1−i(i∈{0,1});weusetheterm“unit”heretorefertowordsonthelowestlevelandtophrasesonhigherlevelsofthenetwork.Fig-ure3(un)showstwounitrepresentationfeaturemapsinred:thispartoftheABCNN-1isthesameasintheBCNN(seeFigure2).Eachcolumnistheje
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
je
un
c
_
un
_
0
0
0
9
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
263
(un)OneblockinABCNN-1(b)OneblockinABCNN-2(c)OneblockinABCNN-3Figure3:ThreeABCNNarchitectures
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
je
un
c
_
un
_
0
0
0
9
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
264
representationofaunit,awordonthelowestlevelandaphraseonhigherlevels.WefirstdescribetheattentionfeaturematrixAinformally(layer“Convinput”,middlecolumn,inFigure3(un)).Aisgener-atedbymatchingunitsoftheleftrepresentationfea-turemapwithunitsoftherightrepresentationfea-turemapsuchthattheattentionvaluesofrowiinAdenotetheattentiondistributionofthei-thunitofs0withrespecttos1,andtheattentionvaluesofcol-umnjinAdenotetheattentiondistributionofthej-thunitofs1withrespecttos0.Acanbeviewedasanewfeaturemapofs0(resp.s1)inrow(resp.col-umn)directionbecauseeachrow(resp.column)isanewfeaturevectorofaunitins0(resp.s1).Ainsi,itmakessensetocombinethisnewfeaturemapwiththerepresentationfeaturemapsandusebothasin-puttotheconvolutionoperation.WeachievethisbytransformingAintothetwobluematricesinFigure3(un)thathavethesameformatastherepresentationfeaturemaps.Asaresult,thenewinputofconvolu-tionhastwofeaturemapsforeachsentence(showninredandblue).Ourmotivationisthattheatten-tionfeaturemapwillguidetheconvolutiontolearn“counterpart-biased”sentencerepresentations.Moreformally,letFi,r∈Rd×sbetherepresen-tationfeaturemapofsentencei(i∈{0,1}).ThenwedefinetheattentionmatrixA∈Rs×sasfollows:Ai,j=match-score(F0,r[:,je],F1,r[:,j])(1)Thefunctionmatch-scorecanbedefinedinavarietyofways.Wefoundthat1/(1+|x−y|)workswellwhere|·|isEuclideandistance.GivenattentionmatrixA,wegeneratetheatten-tionfeaturemapFi,aforsiasfollows:F0,a=W0·A>,F1,a=W1·ATheweightmatricesW0∈Rd×s,W1∈Rd×sareparametersofthemodeltobelearnedintraining.1WestacktherepresentationfeaturemapFi,randtheattentionfeaturemapFi,aasanorder3tensorandfeeditintoconvolutiontogenerateahigher-levelrepresentationfeaturemapforsi(i∈{0,1}).InFigure3(un),s0has5units,s1has7.Theoutputofconvolution(showninthetoplayer,filterwidth1Theweightsofthetwomatricesaresharedinourimple-mentationtoreducethenumberofparametersofthemodel.w=3)isahigher-levelrepresentationfeaturemapwith7columnsfors0and9columnsfors1.ABCNN-2.TheABCNN-1computesattentionweightsdirectlyontheinputrepresentationwiththeaimofimprovingthefeaturescomputedbyconvolu-tion.TheABCNN-2(Figure3(b))insteadcomputesattentionweightsontheoutputofconvolutionwiththeaimofreweightingthisconvolutionoutput.IntheexampleshowninFigure3(b),thefeaturemapsoutputbyconvolutionfors0ands1(layermarked“Convolution”inFigure3(b))have7and9columns,respectivement;eachcolumnistherepresentationofaunit.TheattentionmatrixAcomparesallunitsins0withallunitsofs1.Wesumallattentionvaluesforaunittoderiveasingleattentionweightforthatunit.ThiscorrespondstosummingallvaluesinarowofAfors0(“col-wisesum”,resultinginthecolumnvectorofsize7shown)andsummingallvaluesinacolumnfors1(“row-wisesum”,resultingintherowvectorofsize9shown).Moreformally,letA∈Rs×sbetheattentionmatrix,a0,j=PA[j,:]theattentionweightofunitjins0,a1,j=PA[:,j]theattentionweightofunitjins1andFci,r∈Rd×(si+w−1)theoutputofconvolutionforsi.Thenthej-thcolumnofthenewfeaturemapFpi,rgeneratedbyw-apisderivedby:Fpi,r[:,j]=Xk=j:j+wai,kFci,r[:,k],j=1…siNotethatFpi,r∈Rd×si,i.e.,ABCNN-2poolinggeneratesanoutputfeaturemapofthesamesizeastheinputfeaturemapofconvolution.Thisallowsustostackmultipleconvolution-poolingblockstoextractfeaturesofincreasingabstraction.TherearethreemaindifferencesbetweentheABCNN-1andtheABCNN-2.(je)AttentionintheABCNN-1impactsconvolutionindirectlywhileat-tentionintheABCNN-2influencespoolingthroughdirectattentionweighting.(ii)TheABCNN-1re-quiresthetwomatricesWitoconverttheattentionmatrixintoattentionfeaturemaps;andtheinputtoconvolutionhastwotimesasmanyfeaturemaps.Thus,theABCNN-1hasmoreparametersthantheABCNN-2andismorevulnerabletooverfitting.(iii)Aspoolingisperformedafterconvolution,pool-inghandleslarger-granularityunitsthanconvolu-tion;e.g.,iftheinputtoconvolutionhaswordlevel
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
je
un
c
_
un
_
0
0
0
9
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
265
granularity,thentheinputtopoolinghasphraselevelgranularity,thephrasesizebeingequaltofiltersizew.Thus,theABCNN-1andtheABCNN-2imple-mentattentionmechanismsforlinguisticunitsofdifferentgranularity.ThecomplementarityoftheABCNN-1andtheABCNN-2motivatesustopro-posetheABCNN-3,athirdarchitecturethatcom-bineselementsofthetwo.ABCNN-3(Figure3(c))combinestheABCNN-1andtheABCNN-2bystackingthem;itcombinesthestrengthsoftheABCNN-1and-2byallowingtheattentionmechanismtooperate(je)bothonthecon-volutionandonthepoolingpartsofaconvolution-poolingblockand(ii)bothontheinputgranularityandonthemoreabstractoutputgranularity.5ExperimentsWetesttheproposedarchitecturesonthreetasks:answerselection(AS),paraphraseidentification(PI)andtextualentailment(TE).CommonTrainingSetup.Wordsareinitializedby300-dimensionalword2vecembeddingsandnotchangedduringtraining.Asinglerandomlyinitial-izedembeddingiscreatedforallunknownwordsbyuniformsamplingfrom[-.01,.01].WeemployAda-grad(Duchietal.,2011)andL2regularization.NetworkConfiguration.Eachnetworkintheexperimentsbelowconsistsof(je)aninitializationblockb1thatinitializeswordsbyword2vecem-beddings,(ii)astackofk−1convolution-poolingblocksb2,…,bk,computingincreasinglyabstractfeatures,et(iii)onefinalLRlayer(logisticregres-sionlayer)asshowninFigure2.TheinputtotheLRlayerconsistsofknfeatures–eachblockprovidesnsimilarityscores,e.g.,ncosinesimilarityscores.Figure2showsthetwosentencevectorsoutputbythefinalblockbkofthestack(“sentencerepresentation0”,“sentencerepre-sentation1”);thisisthebasisofthelastnsimilarityscores.AsweexplainedinthefinalparagraphofSection3,weperformall-appoolingforallblocks,notjustforbk.Thuswegetonesentencerepresenta-tioneachfors0ands1foreachblockb1,…,bk.Wecomputensimilarityscoresforeachblock(basedontheblock’stwosentencerepresentations).Ainsi,wecomputeatotalofknsimilarityscoresandthesescoresareinputtotheLRlayer.ASPITE#CLlrwL2lrwL2lrwL2ABCNN-11.084.0004.083.0002.083.0006ABCNN-12.0854.0006.0853.0003.0853.0006ABCNN-21.054.0003.0853.0001.093.00065ABCNN-22.064.0006.0853.0001.0853.0007ABCNN-31.054.0003.053.0003.093.0007ABCNN-32.064.0006.0553.0005.093.0007Table2:Hyperparameters.lr:learningrate.#CL:num-berconvolutionlayers.w:filterwidth.Thenumberofconvolutionkernelsdi(i>0)is50throughout.Dependingonthetask,weusedifferentmethodsforcomputingthesimilarityscore:seebelow.LayerwiseTraining.Inourtrainingregime,wefirsttrainanetworkconsistingofjustoneconvolution-poolingblockb2.Wethencreateanewnetworkbyaddingablockb3,initializeitsb2blockwiththepreviouslylearnedweightsforb2andtrainb3keepingthepreviouslylearnedweightsforb2fixed.Werepeatthisprocedureuntilallk−1convolution-poolingblocksaretrained.Wefoundthatthistrainingregimegivesusgoodperformanceandshortenstrainingtimesconsiderably.Sincesim-ilarityscoresoflowerblocksarekeptunchangedoncetheyhavebeenlearned,thisalsohastheniceeffectthat“simple”similarityscores(thosebasedonsurfacefeatures)arelearnedfirstandsubsequenttrainingphasescanfocusoncomplementaryscoresderivedfrommorecomplexabstractfeatures.Classifier.WefoundthatperformanceincreasesifwedonotusetheoutputoftheLRlayerasthefinaldecision,butinsteadtrainalinearSVMoralogisticregressionwithdefaultparameters2directlyontheinputtotheLRlayer(i.e.,ontheknsimilarityscoresthataregeneratedbythek-blockstackafternetworktrainingiscompleted).DirecttrainingofSVMs/LRseemstogetclosertotheglobaloptimumthangradientdescenttrainingofCNNs.Table2showshyperparameters,tunedondev.WeuseadditionandLSTMsastwosharedbase-linesforallthreetasks,i.e.,forAS,PIandTE.Wenowdescribethesetwosharedbaselines.(je)Addition.Wesumupwordembeddingselement-wisetoformeachsentencerepresentation.Theclassifierinputisthentheconcatenationofthetwosentencerepresentations.(ii)A-LSTM.Be-forethiswork,mostattentionmechanismsinNLP2http://scikit-learn.org/stable/forboth.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
je
un
c
_
un
_
0
0
0
9
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
266
wereimplementedinrecurrentneuralnetworksfortextgenerationtaskssuchasmachinetranslation(e.g.,Bahdanauetal.(2015),Luongetal.(2015)).Rockt¨ascheletal.(2016)presentanattention-LSTMfornaturallanguageinference.SincethismodelisthepioneeringattentionbasedRNNsystemforsen-tencepairclassification,weconsideritasabaselinesystem(“A-LSTM”)forallourthreetasks.TheA-LSTMhasthesameconfigurationasourABCNNsintermsofwordinitialization(300-dimensionalword2vecembeddings)andthedimensionalityofallhiddenlayers(50).5.1AnswerSelectionWeuseWikiQA,3anopendomainquestion-answerdataset.Weusethesubtaskthatassumesthatthereisatleastonecorrectanswerforaquestion.Thecorrespondingdatasetconsistsof20,360question-candidatepairsintrain,1,130pairsindevand2,352pairsintestwhereweadoptthestandardsetupofonlyconsideringquestionswithcorrectanswersintest.FollowingYangetal.(2015),wetruncatean-swersto40tokens.Thetaskistorankthecandidateanswersbasedontheirrelatednesstothequestion.Evaluationmea-suresaremeanaverageprecision(MAP)andmeanreciprocalrank(MRR).Task-SpecificSetup.WeusecosinesimilarityasthesimilarityscoreforAS.Inaddition,weusesen-tencelengths,WordCnt(countofthenumberofnon-stopwordsinthequestionthatalsooccurinthean-swer)andWgtWordCnt(reweightthecountsbytheIDFvaluesofthequestionwords).Ainsi,thefinalinputtotheLRlayerhassizek+4:onecosineforeachofthekblocksandthefouradditionalfeatures.Wecomparewithsevenbaselines.ThefirstthreeareconsideredbyYangetal.(2015):(je)WordCnt;(ii)WgtWordCnt;(iii)CNN-Cnt(thestate-of-the-artsystem):combineCNNwith(je)et(ii).ApartfromthebaselinesconsideredbyYangetal.(2015),wecomparewithtwoAdditionbaselinesandtwoLSTMbaselines.AdditionandA-LSTMarethesharedbaselinesdescribedbefore.Wealsocombinebothwiththefourextrafeatures;thisgivesustwoadditionalbaselinesthatwerefertoasAddition(+)andA-LSTM(+).3http://aka.ms/WikiQA(Yangetal.,2015)methodMAPMRRBaselinesWordCnt0.48910.4924WgtWordCnt0.50990.5132CNN-Cnt0.65200.6652Addition0.50210.5069Addition(+)0.58880.5929A-LSTM0.53470.5483A-LSTM(+)0.63810.6537BCNNone-conv0.66290.6813two-conv0.65930.6738ABCNN-1one-conv0.6810∗0.6979∗two-conv0.6855∗0.7023∗ABCNN-2one-conv0.6885∗0.7054∗two-conv0.6879∗0.7068∗ABCNN-3one-conv0.6914∗0.7127∗two-conv0.6921∗0.7108∗Table3:ResultsonWikiQA.Bestresultpercolumnisbold.Significantimprovementsoverstate-of-the-artbaselines(underlined)aremarkedwith∗(t-test,p<.05).Results.Table3showsperformanceofthebase-lines,oftheBCNNandofthethreeABCNNs.ForCNNs,wetestone(one-conv)andtwo(two-conv)convolution-poolingblocks.Thenon-attentionnetworkBCNNalreadyper-formsbetterthanthebaselines.Ifweaddattentionmechanisms,thentheperformancefurtherimprovesbyseveralpoints.ComparingtheABCNN-2withtheABCNN-1,wefindtheABCNN-2isslightlybettereventhoughtheABCNN-2isthesimplerar-chitecture.IfwecombinetheABCNN-1andtheABCNN-2toformtheABCNN-3,wegetfurtherimprovement.4ThiscanbeexplainedbytheABCNN-3’sabil-itytotakeattentionoffiner-grainedgranularityintoconsiderationineachconvolution-poolingblockwhiletheABCNN-1andtheABCNN-2considerat-tentiononlyatconvolutioninputoronlyatpoolinginput,respectively.Wealsofindthatstackingtwoconvolution-poolingblocksdoesnotbringconsis-tentimprovementandthereforedonottestdeeperarchitectures.5.2ParaphraseIdentificationWeusetheMicrosoftResearchParaphrase(MSRP)corpus(Dolanetal.,2004).Thetrainingsetcontains2753true/1323falseandthetestset1147true/578falseparaphrasepairs.Werandomlyselect4004IfwelimittheinputtotheLRlayertotheksimilarityscoresintheABCNN-3(two-conv),resultsare.660(MAP)/.677(MRR).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
l
a
c
_
a
_
0
0
0
9
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
267
pairsfromtrainandusethemasdev;butwestillreportresultsfortrainingontheentiretrainingset.Foreachtriple(label,s0,s1)inthetrainingset,wealsoadd(label,s1,s0)tothetrainingsettomakebestuseofthetrainingdata.SystemsareevaluatedbyaccuracyandF1.Task-SpecificSetup.Inthistask,weaddthe15MTfeaturesfrom(Madnanietal.,2012)andthelengthsofthetwosentences.Inaddition,wecomputeROUGE-1,ROUGE-2andROUGE-SU4(Lin,2004),whicharescoresmeasuringthematchbetweenthetwosentenceson(i)unigrams,(ii)bi-gramsand(iii)unigramsandskip-bigrams(maxi-mumskipdistanceoffour),respectively.Inthistask,wefoundtransformingEuclideandistanceintosimilarityscoreby1/(1+|x−y|)performsbetterthancosinesimilarity.Additionally,weusedynamicpooling(YinandSch¨utze,2015a)oftheattentionmatrixAinEquation(1)andforwardpooledval-uesofallblockstotheclassifier.Thisgivesusbet-terperformancethanonlyforwardingsentence-levelmatchingfeatures.WecompareoursystemwithrepresentativeDLapproaches:(i)A-LSTM;(ii)A-LSTM(+):A-LSTMplushandcraftedfeatures;(iii)RAE(Socheretal.,2011),recursiveautoencoder;(iv)Bi-CNN-MI(YinandSch¨utze,2015a),abi-CNNarchitec-ture;and(v)MPSSM-CNN(Heetal.,2015),thestate-of-the-artNNsystemforPI,andthefollow-ingfournon-DLsystems:(vi)Addition;(vii)Ad-dition(+):Additionplushandcraftedfeatures;(viii)MT(Madnanietal.,2012),asystemthatcombinesmachinetranslationmetrics;5(ix)MF-TF-KLD(JiandEisenstein,2013),thestate-of-the-artnon-NNsystem.Results.Table4showsthattheBCNNisslightlyworsethanthestate-of-the-artwhereastheABCNN-1roughlymatchesit.TheABCNN-2isslightlyabovethestate-of-the-art.TheABCNN-3outper-formsthestate-of-the-artinaccuracyandF1.6Twoconvolutionlayersonlybringsmallimprovementsoverone.5Forbettercomparabilityofapproachesinourexperiments,weuseasimpleSVMclassifier,whichperformsslightlyworsethanMadnanietal.(2012)’smorecomplexmeta-classifier.6Improvementof.3(acc)and.1(F1)overstate-of-the-artisnotsignificant.TheABCNN-3(two-conv)without“linguistic”features(i.e.,MTandROUGE)achieves75.1/82.7.methodaccF1Baselinesmajorityvoting66.579.9RAE76.883.6Bi-CNN-MI78.484.6MPSSM-CNN78.684.7MT76.883.8MF-TF-KLD78.684.6Addition70.880.9Addition(+)77.384.1A-LSTM69.580.1A-LSTM(+)77.184.0BCNNone-conv78.184.1two-conv78.384.3ABCNN-1one-conv78.584.5two-conv78.584.6ABCNN-2one-conv78.684.7two-conv78.884.7ABCNN-3one-conv78.884.8two-conv78.984.8Table4:ResultsforPIonMSRP5.3TextualEntailmentSemEval2014Task1(Marellietal.,2014a)eval-uatessystempredictionsoftextualentailment(TE)relationsonsentencepairsfromtheSICKdataset(Marellietal.,2014b).Thethreeclassesareentail-ment,contradictionandneutral.ThesizesofSICKtrain,devandtestsetsare4439,495and4906pairs,respectively.WecallthisdatasetORIG.WealsocreateNONOVER,acopyofORIGinwhichwordsoccurringinbothsentencesarere-moved.AsentenceinNONOVERisdenotedbythespecialtoken
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
je
un
c
_
un
_
0
0
0
9
7
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
268
ORIGNONOVER0childreninredshirtsarechildrenredshirtsplayingintheleavesplayingthreekidsaresittingintheleavesthreekidssitting1threeboysarejumpingintheleavesboysthreekidsarejumpingintheleaveskids2amanisjumpingintoanemptypoolanemptyamanisjumpingintoafullpoolafullTable5:SICKdata:Convertingtheoriginalsentences(ORIG)intotheNONOVERformatWeusethefollowinglinguisticfeatures.Nega-tionisimportantfordetectingcontradiction.Fea-tureNEGissetto1ifeithersentencecontains“no”,“not”,“nobody”,“isn’t”andto0otherwise.Fol-lowingLaiandHockenmaier(2014),weuseWord-Net(Miller,1995)todetectnyms:synonyms,hy-pernymsandantonymsinthepairs.ButwedothisonNONOVER(notonORIG)tofocusonwhatiscriticalforTE.Specifically,featureSYNisthenumberofwordpairsins0ands1thataresyn-onyms.HYP0(resp.HYP1)isthenumberofwordsins0(resp.s1)thathaveahypernymins1(resp.s0).Inaddition,wecollectallpotentialantonympairs(PAP)inNONOVER.Weidentifythematchedchunksthatoccurincontradictoryandneutral,butnotinentailedpairs.Weexcludesynonymsandhy-pernymsandapplyafrequencyfilterofn=2.IncontrasttoLaiandHockenmaier(2014),wecon-strainthePAPpairstocosinesimilarityabove0.4inword2vecembeddingspaceasthisdiscardsmanynoisepairs.FeatureANTisthenumberofmatchedPAPantonymsinasentencepair.Asbeforeweusesentencelengths,bothforORIG(LEN0O:lengths0,LEN1O:lengths1)andforNONOVER(LEN0N:lengths0,LEN1N:lengths1).Onthewhole,wehave24extrafeatures:15MTmetrics,NEG,SYN,HYP0,HYP1,ANT,LEN0O,LEN1O,LEN0NandLEN1N.ApartfromtheAdditionandLSTMbaselines,wefurthercomparewiththetop-3systemsinSemEvalandTrRNTN(Bowmanetal.,2015b),arecursiveneuralnetworkdevelopedforthisSICKtask.Results.Table6showsthatourCNNsoutper-formA-LSTM(withorwithoutlinguisticfeaturesadded)andthetopthreeSemEvalsystems.Compar-ingABCNNswiththeBCNN,attentionmechanismsconsistentlyimproveperformance.TheABCNN-1hasperformancecomparabletotheABCNN-2whilemethodaccSem-EvalTop3(Jimenezetal.,2014)83.1(Zhaoetal.,2014)83.6(LaiandHockenmaier,2014)84.6TrRNTN(Bowmanetal.,2015b)76.9Additionnofeatures73.1plusfeatures79.4A-LSTMnofeatures78.0plusfeatures81.7BCNNone-conv84.8two-conv85.0ABCNN-1one-conv85.6two-conv85.8ABCNN-2one-conv85.7two-conv85.8ABCNN-3one-conv86.0∗two-conv86.2∗Table6:ResultsonSICK.Significantimprovementsover(LaiandHockenmaier,2014)aremarkedwith∗(testofequalproportions,p<.05).theABCNN-3isbetterstill:aboostof1.6pointscomparedtothepreviousstateoftheart.7VisualAnalysis.Figure4visualizestheattentionmatricesforoneTEsentencepairintheABCNN-2forblocksb1(unigrams),b2(firstconvolutionallayer)andb3(secondconvolutionallayer).Darkershadesofblueindicatestrongerattentionvalues.InFigure4(top),eachwordcorrespondstoex-actlyoneroworcolumn.Wecanseethatwordsinsiwithsemanticequivalentsins1−igethighatten-tionwhilewordswithoutsemanticequivalentsgetlowattention,e.g.,“walking”and“murals”ins0and“front”and“colorful”ins1.Thisbehaviorseemsreasonablefortheunigramlevel.Rows/columnsoftheattentionmatrixinFigure4(middle)correspondtophrasesoflengththreesincefilterwidthw=3.Highattentionvaluesgenerallycorrelatewithclosesemanticcorrespondence:thephrase“peopleare”ins0matches“severalpeopleare”ins1;both“arewalkingoutside”and“walkingoutsidethe”ins0match“areinfront”ins1;“thebuildingthat”ins0matches“acolorfulbuilding”ins1.Moreinterestingly,lookingatthebottomrightcorner,both“onit”and“it”ins0match“building”ins1;thisindicatesthatABCNNsareabletodetectsomecoreferenceacrosssentences.“building”ins1hastwoplacesinwhichhigherattentionsappear,oneiswith“it”ins0,theotheriswith“thebuilding7IfweruntheABCNN-3(two-conv)withoutthe24linguis-ticfeatures,performanceis84.6.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
l
a
c
_
a
_
0
0
0
9
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
269
severalpeopleareinfrontofacolorfulbuildingpeoplearewalkingoutsidethebuildingthathasseveralmuralsonitseveralseveral peopleseveral people are people are inare in frontin front offront of aof a colorfula corlorful buildingcorlorful buildingbuildingpeoplepeople arepeople are walkingare walking outsidewalking outside theoutside the buildingthe building thatbuilding that hasthat has severalhas several muralsseveral murals onmurals on iton ititseveralseveral peopleseveral...areseveral...inseveral...frontpeople...ofare...ain...colorfulfront...buildingof...buildinga...buildingpeoplepeople arepeople...walkingpeople...outsidepeople...theare...buildingwalking...thatoutside...hasthe...severalbuilding...muralsthat...onhas...itseveral...itmurals...itFigure4:AttentionvisualizationforTE.Top:unigrams,b1.Middle:conv1,b2.Bottom:conv2,b3.that”ins0.ThismayindicatethatABCNNsrecog-nizethat“building”ins1and“thebuildingthat”/“it”ins0refertothesameobject.Hence,corefer-enceresolutionacrosssentencesaswellaswithinasentencebotharedetected.Fortheattentionvectorsontheleftandthetop,wecanseethatattentionhasfocusedonthekeyparts:“peoplearewalkingout-sidethebuildingthat”ins0,“severalpeoplearein”and“ofacolorfulbuilding”ins1.Rows/columnsoftheattentionmatrixinFigure4(bottom,secondlayerofconvolution)correspondtophrasesoflength5sincefilterwidthw=3inbothconvolutionlayers(5=1+2∗(3−1)).Weuse“...”todenotewordsinthemiddleifaphraselike“several...front”hasmorethantwowords.Wecanseethatattentiondistributioninthematrixhasfocusedonsomelocalregions.Asgranularityofphrasesislarger,itmakessensethattheattentionvaluesaresmoother.Butwestillcanfindsomeinterestingclues:atthetwoendsofthemaindi-agonal,higherattentionshintthatthefirstpartofs0matcheswellwiththefirstpartofs1;“severalmuralsonit”ins0matcheswellwith“ofacolor-fulbuilding”ins1,whichsatisfiestheintuitionthatthesetwophrasesarecrucialformakingadecisiononTEinthiscase.Thisagainshowsthepotentialstrengthofoursysteminfiguringoutwhichpartsofthetwosentencesrefertothesameobject.Inad-dition,inthecentralpartofthematrix,wecanseethatthelongphrase“peoplearewalkingoutsidethebuilding”ins0matcheswellwiththelongphrase“areinfrontofacolorfulbuilding”ins1.6SummaryWepresentedthreemechanismstointegrateatten-tionintoCNNsforgeneralsentencepairmodelingtasks.OurexperimentsonAS,PIandTEshowthatattention-basedCNNsperformbetterthanCNNswithoutattentionmechanisms.TheABCNN-2gen-erallyoutperformstheABCNN-1andtheABCNN-3surpassesboth.Inalltasks,wedidnotfindanybigimprovementoftwolayersofconvolutionoveronelayer.Thisisprobablyduetothelimitedsizeoftrainingdata.Weexpectthat,aslargertrainingsetsbecomeavailable,deepABCNNswillshowevenbetterperformance.Inaddition,linguisticfeaturescontributeinallthreetasks:improvementsby0.0321(MAP)and0.0338(MRR)forAS,improvementsby3.8(acc)and2.1(F1)forPIandanimprovementby1.6(acc)forTE.ButourABCNNscanstillreachorsurpassstate-of-the-artevenwithoutthosefeaturesinASandTEtasks.ThisindicatesthatABCNNsaregen-erallystrongNNsystems.Attention-basedLSTMsareespeciallysuccessfulintaskswithastronggenerationcomponentlikema-chinetranslation(discussedinSec.2).CNNshavenotbeenusedforthistypeoftask.Thisisaninter-estingareaoffutureworkforattention-basedCNNs.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
l
a
c
_
a
_
0
0
0
9
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
270
AcknowledgmentsWegratefullyacknowledgethesupportofDeutscheForschungsgemeinschaft(DFG):grantSCHU2246/8-2.Wewouldliketothanktheanonymousreviewersfortheirhelpfulcomments.ReferencesJimmyBa,VolodymyrMnih,andKorayKavukcuoglu.2015.Multipleobjectrecognitionwithvisualatten-tion.InProceedingsofICLR.DzmitryBahdanau,KyunghyunCho,andYoshuaBen-gio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.InProceedingsofICLR.MatthewW.Bilotti,PaulOgilvie,JamieCallan,andEricNyberg.2007.Structuredretrievalforquestionan-swering.InProceedingsofSIGIR,pages351–358.WilliamBlacoeandMirellaLapata.2012.Acomparisonofvector-basedrepresentationsforsemanticcomposi-tion.InProceedingsofEMNLP-CoNLL,pages546–556.SamuelR.Bowman,GaborAngeli,ChristopherPotts,andChristopherD.Manning.2015a.Alargeanno-tatedcorpusforlearningnaturallanguageinference.InProceedingsofEMNLP,pages632–642.SamuelR.Bowman,ChristopherPotts,andChristo-pherD.Manning.2015b.Recursiveneuralnetworkscanlearnlogicalsemantics.InProceedingsofCVSCWorkshop,pages12–21.JaneBromley,JamesW.Bentz,L´eonBottou,IsabelleGuyon,YannLeCun,CliffMoore,EduardS¨ackinger,andRoopakShah.1993.Signatureverificationus-ingA“siamese”timedelayneuralnetwork.IJPRAI,7(4):669–688.ChunshuiCao,XianmingLiu,YiYang,YinanYu,JiangWang,ZileiWang,YongzhenHuang,LiangWang,ChangHuang,WeiXu,DevaRamanan,andThomasS.Huang.2015.Lookandthinktwice:Cap-turingtop-downvisualattentionwithfeedbackcon-volutionalneuralnetworks.InProceedingsofICCV,pages2956–2964.Ming-WeiChang,DanGoldwasser,DanRoth,andVivekSrikumar.2010.Discriminativelearningovercon-strainedlatentrepresentations.InProceedingsofNAACL-HLT,pages429–437.KanChen,JiangWang,Liang-ChiehChen,HaoyuanGao,WeiXu,andRamNevatia.2015.ABC-CNN:Anattentionbasedconvolutionalneuralnetworkforvisualquestionanswering.CoRR,abs/1511.05960.JanChorowski,DzmitryBahdanau,KyunghyunCho,andYoshuaBengio.2014.End-to-endcontinuousspeechrecognitionusingattention-basedrecurrentNN:Firstresults.InProceedingsofDeepLearningandRepre-sentationLearningWorkshop,NIPS.JanChorowski,DzmitryBahdanau,DmitriySerdyuk,KyunghyunCho,andYoshuaBengio.2015.Attention-basedmodelsforspeechrecognition.InProceedingsofNIPS,pages577–585.BillDolan,ChrisQuirk,andChrisBrockett.2004.Un-supervisedconstructionoflargeparaphrasecorpora:Exploitingmassivelyparallelnewssources.InPro-ceedingsofCOLING,pages350–356.JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JMLR,12:2121–2159.MinweiFeng,BingXiang,MichaelR.Glass,LidanWang,andBowenZhou.2015.Applyingdeeplearn-ingtoanswerselection:Astudyandanopentask.InProceedingsofIEEEASRUWorkshop,pages813–820.KarolGregor,IvoDanihelka,AlexGraves,DaniloJimenezRezende,andDaanWierstra.2015.DRAW:Arecurrentneuralnetworkforimagegeneration.InProceedingsofICML,pages1462–1471.HuaHe,KevinGimpel,andJimmyLin.2015.Multi-perspectivesentencesimilaritymodelingwithconvo-lutionalneuralnetworks.InProceedingsofEMNLP,pages1576–1586.MichaelHeilmanandNoahA.Smith.2010.Treeeditmodelsforrecognizingtextualentailments,para-phrases,andanswerstoquestions.InProceedingsofNAACL-HLT,pages1011–1019.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.SeunghoonHong,JunhyukOh,HonglakLee,andBo-hyungHan.2016.Learningtransferrableknowl-edgeforsemanticsegmentationwithdeepconvolu-tionalneuralnetwork.InProceedingsofCVPR.BaotianHu,ZhengdongLu,HangLi,andQingcaiChen.2014.Convolutionalneuralnetworkarchitecturesformatchingnaturallanguagesentences.InProceedingsofNIPS,pages2042–2050.YangfengJiandJacobEisenstein.2013.Discriminativeimprovementstodistributionalsentencesimilarity.InProceedingsofEMNLP,pages891–896.SergioJimenez,GeorgeDue˜nas,JuliaBaquero,andAlexanderGelbukh.2014.UNAL-NLP:Combiningsoftcardinalityfeaturesforsemantictextualsimilar-ity,relatednessandentailment.InProceedingsofSe-mEval,pages732–742.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
l
a
c
_
a
_
0
0
0
9
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
271
NalKalchbrenner,EdwardGrefenstette,andPhilBlun-som.2014.Aconvolutionalneuralnetworkformod-ellingsentences.InProceedingsofACL,pages655–665.YoonKim.2014.Convolutionalneuralnetworksforsen-tenceclassification.InProceedingsofEMNLP,pages1746–1751.AliceLaiandJuliaHockenmaier.2014.Illinois-LH:Adenotationalanddistributionalapproachtosemantics.InProceedingsofSemEval,pages329–334.YannLeCun,L´eonBottou,YoshuaBengio,andPatrickHaffner.1998.Gradient-basedlearningappliedtodocumentrecognition.InProceedingsoftheIEEE,pages2278–2324.JiweiLi,Minh-ThangLuong,andDanJurafsky.2015.Ahierarchicalneuralautoencoderforparagraphsanddocuments.InProceedingsofACL,pages1106–1115.Chin-YewLin.2004.Rouge:Apackageforautomaticevaluationofsummaries.InProceedingsoftheACLTextSummarizationWorkshop.ThangLuong,HieuPham,andChristopherD.Manning.2015.Effectiveapproachestoattention-basedneuralmachinetranslation.InProceedingsofEMNLP,pages1412–1421.NitinMadnani,JoelTetreault,andMartinChodorow.2012.Re-examiningmachinetranslationmetricsforparaphraseidentification.InProceedingsofNAACL-HLT,pages182–190.MarcoMarelli,LuisaBentivogli,MarcoBaroni,Raf-faellaBernardi,StefanoMenini,andRobertoZampar-elli.2014a.Semeval-2014task1:Evaluationofcom-positionaldistributionalsemanticmodelsonfullsen-tencesthroughsemanticrelatednessandtextualentail-ment.InProceedingsofSemEval,pages1–8.MarcoMarelli,StefanoMenini,MarcoBaroni,LuisaBentivogli,RaffaellaBernardi,andRobertoZampar-elli.2014b.ASICKcurefortheevaluationofcom-positionaldistributionalsemanticmodels.InProceed-ingsofLREC,pages216–223.TomasMikolov,IlyaSutskever,KaiChen,GregoryS.Corrado,andJeffreyDean.2013.Distributedrep-resentationsofwordsandphrasesandtheircomposi-tionality.InProceedingsofNIPS,pages3111–3119.GeorgeA.Miller.1995.WordNet:Alexicaldatabaseforenglish.Commun.ACM,38(11):39–41.VolodymyrMnih,NicolasHeess,AlexGraves,andKo-rayKavukcuoglu.2014.Recurrentmodelsofvisualattention.InProceedingsofNIPS,pages2204–2212.DanMoldovan,ChristineClark,SandaHarabagiu,andDanielHodges.2007.Cogex:Asemanticallyandcontextuallyenrichedlogicproverforquestionan-swering.JournalofAppliedLogic,5(1):49–69.VasinPunyakanok,DanRoth,andWen-tauYih.2004.Mappingdependenciestrees:Anapplicationtoques-tionanswering.InProceedingsofAI&Math2004(Specialsession:IntelligentTextProcessing).TimRockt¨aschel,EdwardGrefenstette,KarlMoritzHer-mann,Tom´aˇsKoˇcisk`y,andPhilBlunsom.2016.Rea-soningaboutentailmentwithneuralattention.InPro-ceedingsofICLR.AlexanderM.Rush,SumitChopra,andJasonWeston.2015.Aneuralattentionmodelforabstractivesen-tencesummarization.InProceedingsofEMNLP,pages379–389.DanShenandMirellaLapata.2007.Usingsemanticrolestoimprovequestionanswering.InProceedingsofEMNLP-CoNLL,pages12–21.RichardSocher,EricH.Huang,JeffreyPennington,An-drewY.Ng,andChristopherD.Manning.2011.Dy-namicpoolingandunfoldingrecursiveautoencodersforparaphrasedetection.InProceedingsofNIPS,pages801–809.MingTan,BingXiang,andBowenZhou.2016.LSTM-baseddeeplearningmodelsfornon-factoidanswerse-lection.InProceedingsofICLRWorkshop.ShengxianWan,YanyanLan,JiafengGuo,JunXu,LiangPang,andXueqiCheng.2016.Adeeparchitectureforsemanticmatchingwithmultiplepositionalsentencerepresentations.InProceedingsofAAAI,pages2835–2841.MengqiuWang,NoahA.Smith,andTerukoMitamura.2007.Whatisthejeopardymodel?Aquasi-synchronousgrammarforQA.InProceedingsofEMNLP-CoNLL,pages22–32.TianjunXiao,YichongXu,KuiyuanYang,JiaxingZhang,YuxinPeng,andZhengZhang.2015.Theapplicationoftwo-levelattentionmodelsindeepcon-volutionalneuralnetworkforfine-grainedimageclas-sification.InProceedingsofCVPR,pages842–850.KelvinXu,JimmyBa,RyanKiros,KyunghyunCho,AaronC.Courville,RuslanSalakhutdinov,RichardS.Zemel,andYoshuaBengio.2015.Show,attendandtell:Neuralimagecaptiongenerationwithvisualat-tention.InProceedingsofICML,pages2048–2057.YiYang,Wen-tauYih,andChristopherMeek.2015.WikiQA:Achallengedatasetforopen-domainques-tionanswering.InProceedingsofEMNLP,pages2013–2018.XuchenYao,BenjaminVanDurme,ChrisCallison-Burch,andPeterClark.2013a.Semi-markovphrase-basedmonolingualalignment.InProceedingsofEMNLP,pages590–600.XuchenYao,BenjaminVanDurme,andPeterClark.2013b.Automaticcouplingofanswerextractionandinformationretrieval.InProceedingsofACL,pages159–165.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
9
7
1
5
6
7
4
1
2
/
/
t
l
a
c
_
a
_
0
0
0
9
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
272
Wen-tauYih,Ming-WeiChang,ChristopherMeek,andAndrzejPastusiak.2013.Questionansweringusingenhancedlexicalsemanticmodels.InProceedingsofACL,pages1744–1753.WenpengYinandHinrichSch¨utze.2015a.Convolu-tionalneuralnetworkforparaphraseidentification.InProceedingsofNAACL-HLT,pages901–911.WenpengYinandHinrichSch¨utze.2015b.Multi-GranCNN:Anarchitectureforgeneralmatchingoftextchunksonmultiplelevelsofgranularity.InPro-ceedingsofACL-IJCNLP,pages63–73.LeiYu,KarlMoritzHermann,PhilBlunsom,andStephenPulman.2014.Deeplearningforanswersen-tenceselection.InProceedingsofDeepLearningandRepresentationLearningWorkshop,NIPS.JiangZhao,TiantianZhu,andManLan.2014.ECNU:Onestonetwobirds:Ensembleofheterogenousmea-suresforsemanticrelatednessandtextualentailment.InProceedingsofSemEval,pages271–277.
Télécharger le PDF