Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016. Action Editor: Brian Roark.

Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016. Action Editor: Brian Roark.
Submission batch: 12/2015; Revision batch: 3/2016; Published 6/2016.

2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

ABCNN:Attention-BasedConvolutionalNeuralNetworkforModelingSentencePairsWenpengYin,HinrichSch¨utzeCenterforInformationandLanguageProcessingLMUMunich,Germanywenpeng@cis.lmu.deBingXiang,BowenZhouIBMWatsonYorktownHeights,New York,USAbingxia,zhou@us.ibm.comAbstractHowtomodelapairofsentencesisacriticalissueinmanyNLPtaskssuchasanswerselec-tion(AS),paraphraseidentification(PI)andtextualentailment(TE).Mostpriorwork(je)dealswithoneindividualtaskbyfine-tuningaspecificsystem;(ii)modelseachsentence’srepresentationseparately,rarelyconsideringtheimpactoftheothersentence;ou(iii)re-liesfullyonmanuallydesigned,task-specificlinguisticfeatures.Thisworkpresentsagen-eralAttentionBasedConvolutionalNeuralNetwork(ABCNN)formodelingapairofsentences.Wemakethreecontributions.(je)TheABCNNcanbeappliedtoawideva-rietyoftasksthatrequiremodelingofsen-tencepairs.(ii)Weproposethreeattentionschemesthatintegratemutualinfluencebe-tweensentencesintoCNNs;thus,therep-resentationofeachsentencetakesintocon-siderationitscounterpart.Theseinterdepen-dentsentencepairrepresentationsaremorepowerfulthanisolatedsentencerepresenta-tions.(iii)ABCNNsachievestate-of-the-artperformanceonAS,PIandTEtasks.Wereleasecodeat:https://github.com/yinwenpeng/Answer_Selection.1IntroductionHowtomodelapairofsentencesisacriticalis-sueinmanyNLPtaskssuchasanswerselection(AS)(Yuetal.,2014;Fengetal.,2015),paraphraseidentification(PI)(Madnanietal.,2012;YinandSch¨utze,2015un),textualentailment(TE)(Marellietal.,2014a;Bowmanetal.,2015a)etc.ASs0howmuchdidWaterboygross?s+1themovieearned$161.5millions−1thiswasJerryReed’sfinalfilmappearancePIs0shestruckadealwithRHtopenabooktodays+1shesignedacontractwithRHtowriteabooks−1shedeniedtodaythatshestruckadealwithRHTEs0aniceskatingrinkplacedoutdoorsisfullofpeoples+1alotofpeopleareinaniceskatingparks−1aniceskatingrinkplacedindoorsisfullofpeopleFigure1:Positive()andnegative()examplesforAS,PIandTEtasks.RH=RandomHouseMostpriorworkderiveseachsentence’srepresen-tationseparately,rarelyconsideringtheimpactoftheothersentence.Thisneglectsthemutualinflu-enceofthetwosentencesinthecontextofthetask.Italsocontradictswhathumansdowhencomparingtwosentences.Weusuallyfocusonkeypartsofonesentencebyextractingpartsfromtheothersentencethatarerelatedbyidentity,synonymy,antonymyandotherrelations.Thus,humanbeingsmodelthetwosentencestogether,usingthecontentofonesen-tencetoguidetherepresentationoftheother.Figure1demonstratesthateachsentenceofapairpartiallydetermineswhichpartsoftheothersen-tencewemustfocuson.ForAS,correctlyanswer-ings0requiresattentionon“gross”:s+1containsacorrespondingunit(“earned”)whiles−1doesnot.ForPI,focusshouldberemovedfrom“today”tocorrectlyrecognizeasparaphrasesandasnon-paraphrases.ForTE,weneedtofocuson“fullofpeople”(torecognizeTEfor)andon“outdoors”/“indoors”(torecog-nizenon-TEfor).Theseexamplesshowtheneedforanarchitecturethatcomputesdifferentrepresentationsofsifordifferents1−i(i∈{0,1}). l Téléchargé à partir du site Web : / / direct . m je t . e d u / t a c l / l a r t i c e – pdf / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 9 7 1 5 6 7 4 1 2 / / t l a c _ a _ 0 0 0 9 7 pd . f par invité 0 8 Septembre 2 0 2 3 260 ConvolutionalNeuralNetworks(CNNs)(LeCunetal.,1998)arewidelyusedtomodelsentences(Kalchbrenneretal.,2014;Kim,2014)andsen-tencepairs(Socheretal.,2011;YinandSch¨utze,2015un),especiallyinclassificationtasks.CNNsaresupposedtobegoodatextractingrobustandabstractfeaturesofinput.ThisworkpresentstheABCNN,anattention-basedconvolutionalneuralnetwork,thathasapowerfulmechanismformod-elingasentencepairbytakingintoaccounttheinterdependencebetweenthetwosentences.TheABCNNisageneralarchitecturethatcanhandleawidevarietyofsentencepairmodelingtasks.Somepriorworkproposessimplemechanismsthatcanbeinterpretedascontrollingvaryingatten-tion;e.g.,Yihetal.(2013)employwordalignmenttomatchrelatedpartsofthetwosentences.Incon-trast,ourattentionschemebasedonCNNsmodelsrelatednessbetweentwopartsfullyautomatically.Moreover,attentionatmultiplelevelsofgranularity,notonlyatwordlevel,isachievedaswestackmul-tipleconvolutionlayersthatincreaseabstraction.Priorworkonattentionindeeplearning(DL)mostlyaddresseslongshort-termmemorynetworks(LSTMs)(HochreiterandSchmidhuber,1997).LSTMsachieveattentionusuallyinaword-to-wordscheme,andwordrepresentationsmostlyencodethewholecontextwithinthesentence(Bahdanauetal.,2015;Rockt¨ascheletal.,2016).Itisnotclearwhetherthisisthebeststrategy;e.g.,intheASex-ampleinFigure1,itispossibletodeterminethat“howmuch”ins0matches“$161.5million”ins1withouttakingtheentiresentencecontextsintoac-count.ThisobservationwasalsoinvestigatedbyYaoetal.(2013b)whereaninformationretrievalsystemretrievessentenceswithtokenslabeledasDATEbynamedentityrecognitionorasCDbyPOStaggingifthereisa“when”question.However,la-belsorPOStagsrequireextratools.CNNsbenefitfromincorporatingattentionintorepresentationsoflocalphrasesdetectedbyfilters;incontrast,LSTMsencodethewholecontexttoformattention-basedwordrepresentations–astrategythatismorecom-plexthantheCNNstrategyand(asourexperimentssuggest)performslesswellforsometasks.Apartfromthesedifferences,itisclearthatatten-tionhasasmuchpotentialforCNNsasitdoesforLSTMs.Asfarasweknow,thisisthefirstNLPpaperthatincorporatesattentionintoCNNs.OurABCNNsgetstate-of-the-artinASandTEtasks,andcompetitiveperformanceinPI,thenobtainsfur-therimprovementsoverallthreetaskswhenlinguis-ticfeaturesareused.2RelatedWorkNon-DLonSentencePairModeling.Sentencepairmodelinghasattractedlotsofattentioninthepastdecades.Manytaskscanbereducedtoase-mantictextmatchingproblem.Duetothevarietyofwordchoicesandinherentambiguitiesinnatu-rallanguage,bag-of-wordapproacheswithsimplesurface-formwordmatchingtendtoproducebrit-tleresultswithpoorpredictionaccuracy(Bilottietal.,2007).Asaresult,researchersputmoreempha-sisonexploitingsyntacticandsemanticstructure.Representativeexamplesincludemethodsbasedondeepersemanticanalysis(ShenandLapata,2007;Moldovanetal.,2007),treeedit-distance(Pun-yakanoketal.,2004;HeilmanandSmith,2010)andquasi-synchronousgrammars(Wangetal.,2007)thatmatchthedependencyparsetreesofthetwosentences.Insteadoffocusingonthehigh-levelse-manticrepresentation,Yihetal.(2013)turntheirattentiontoimprovingtheshallowsemanticcom-ponent,lexicalsemantics,byperformingsemanticmatchingbasedonalatentword-alignmentstruc-ture(cf.Changetal.(2010)).LaiandHockenmaier(2014)explorefiner-grainedwordoverlapandalign-mentbetweentwosentencesusingnegation,hyper-nym,synonymandantonymrelations.Yaoetal.(2013un)extendword-to-wordalignmenttophrase-to-phrasealignmentbyasemi-MarkovCRF.How-ever,suchapproachesoftenrequiremorecomputa-tionalresources.Inaddition,employingsyntacticorsemanticparsers–whichproduceerrorsonmanysentences–tofindthebestmatchbetweenthestruc-turedrepresentationsoftwosentencesisnottrivial.DLonSentencePairModeling.Toaddresssomeofthechallengesofnon-DLwork,muchre-centworkusesneuralnetworkstomodelsentencepairsforAS,PIandTE.ForAS,Yuetal.(2014)presentabigramCNNtomodelquestionandanswercandidates.Yangetal.(2015)extendthismethodandgetstate-of-the-artperformanceontheWikiQAdataset(Section5.1).

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2

/

/
t

je

un
c
_
un
_
0
0
0
9
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

261

Fengetal.(2015)testvarioussetupsofabi-CNNar-chitectureonaninsurancedomainQAdataset.Tanetal.(2016)explorebidirectionalLSTMsonthesamedataset.Ourapproachisdifferentbecausewedonotmodelthesentencesbytwoindependentneu-ralnetworksinparallel,butinsteadasaninterdepen-dentsentencepair,usingattention.ForPI,BlacoeandLapata(2012)formsentencerepresentationsbysummingupwordembeddings.Socheretal.(2011)userecursiveautoencoders(RAEs)tomodelrepresentationsoflocalphrasesinsentences,thenpoolsimilarityvaluesofphrasesfromthetwosentencesasfeaturesforbinaryclassi-fication.YinandSch¨utze(2015un)similarlyreplaceanRAEwithaCNN.Inallthreepapers,therep-resentationofonesentenceisnotinfluencedbytheother–incontrasttoourattention-basedmodel.ForTE,Bowmanetal.(2015b)userecursiveneu-ralnetworkstoencodeentailmentonSICK(Marellietal.,2014b).Rockt¨ascheletal.(2016)presentanattention-basedLSTMfortheStanfordnaturallan-guageinferencecorpus(Bowmanetal.,2015a).OursystemisthefirstCNN-basedworkonTE.Somepriorworkaimstosolveageneralsen-tencematchingproblem.Huetal.(2014)presenttwoCNNarchitectures,ARC-IandARC-II,forsen-tencematching.ARC-Ifocusesonsentencerepre-sentationlearningwhileARC-IIfocusesonmatch-ingfeaturesonphraselevel.BothsystemsweretestedonPI,sentencecompletion(SC)andtweet-responsematching.YinandSch¨utze(2015b)pro-posetheMultiGranCNNarchitecturetomodelgen-eralsentencematchingbasedonphrasematchingonmultiplelevelsofgranularityandgetpromisingre-sultsforPIandSC.Wanetal.(2016)trytomatchtwosentencesinASandSCbymultiplesentencerepresentations,eachcomingfromthelocalrepre-sentationsoftwoLSTMs.Ourworkisthefirstonetoinvestigateattentionforthegeneralsentencematchingtask.Attention-BasedDLinNon-NLPDomains.Eventhoughthereislittleifanyworkonatten-tionmechanismsinCNNsforNLP,attention-basedCNNshavebeenusedincomputervisionforvisualquestionanswering(Chenetal.,2015),imageclas-sification(Xiaoetal.,2015),captiongeneration(Xuetal.,2015),imagesegmentation(Hongetal.,2016)andobjectlocalization(Caoetal.,2015).symboldescriptions,s0,s1sentenceorsentencelengthvwordwfilterwidthdidimensionalityofinputtolayeri+1WweightmatrixTable1:NotationMnihetal.(2014)applyattentioninrecurrentneuralnetworks(RNNs)toextractinformationfromanimageorvideobyadaptivelyselectingase-quenceofregionsorlocationsandonlyprocessingtheselectedregionsathighresolution.Gregoretal.(2015)combineaspatialattentionmechanismwithRNNsforimagegeneration.Baetal.(2015)inves-tigateattention-basedRNNsforrecognizingmulti-pleobjectsinimages.Chorowskietal.(2014)andChorowskietal.(2015)useattentioninRNNsforspeechrecognition.Attention-BasedDLinNLP.Attention-basedDLsystemshavebeenappliedtoNLPaftertheirsuccessincomputervisionandspeechrecognition.TheymainlyrelyonRNNsandend-to-endencoder-decodersfortaskssuchasmachinetranslation(Bah-danauetal.,2015;Luongetal.,2015)andtextre-construction(Lietal.,2015;Rushetal.,2015).Ourworktakestheleadinexploringattentionmecha-nismsinCNNsforNLPtasks.3BCNN:BasicBi-CNNWenowintroduceourbasic(non-attention)CNNthatisbasedontheSiamesearchitecture(Brom-leyetal.,1993),i.e.,itconsistsoftwoweight-sharingCNNs,eachprocessingoneofthetwosen-tences,andafinallayerthatsolvesthesentencepairtask.SeeFigure2.WerefertothisarchitectureastheBCNN.ThenextsectionwillthenintroducetheABCNN,anattentionarchitecturethatextendstheBCNN.Table1givesournotationalconventions.Inourimplementationandalsointhemathemat-icalformalizationofthemodelgivenbelow,wepadthetwosentencestohavethesamelengths=max(s0,s1).Cependant,inthefiguresweshowdif-ferentlengthsbecausethisgivesabetterintuitionofhowthemodelworks.WenowdescribetheBCNN’sfourtypesoflay-ers:input,convolution,averagepoolingandoutput.Inputlayer.Intheexampleinthefigure,thetwoinputsentenceshave5and7words,respectivement.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2

/

/
t

je

un
c
_
un
_
0
0
0
9
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

262

Figure2:BCNN:ABCNNwithoutAttentionEachwordisrepresentedasad0-dimensionalpre-computedword2vec(Mikolovetal.,2013)embed-ding,d0=300.Asaresult,eachsentenceisrepre-sentedasafeaturemapofdimensiond0×s.Convolutionlayer.Letv1,v2,…,vsbethewordsofasentenceandci∈Rw·d0,0j>s.Wethengeneratetherepresentationpi∈Rd1forthephrasevi−w+1,…,viusingtheconvolutionweightsW∈Rd1×wd0asfollows:pi=tanh(W·ci+b)whereb∈Rd1isthebias.Averagepoolinglayer.Pooling(includingmin,maximum,averagepooling)iscommonlyusedtoextractrobustfeaturesfromconvolution.Inthispaper,weintroduceattentionweightingasanalternative,butuseaveragepoolingasabaselineasfollows.Fortheoutputfeaturemapofthelastconvolu-tionlayer,wedocolumn-wiseaveragingoverallcolumns,denotedasall-ap.Thisgeneratesarep-resentationvectorforeachofthetwosentences,shownasthetop“Averagepooling(all-ap)”layerbelow“Logisticregression”inFigure2.Thesetwovectorsarethebasisforthesentencepairdecision.Fortheoutputfeaturemapofnon-finalconvolu-tionlayers,wedocolumn-wiseaveragingoverwin-dowsofwconsecutivecolumns,denotedasw-ap;shownasthelower“Averagepooling(w-ap)”layerinFigure2.Forfilterwidthw,aconvolutionlayertransformsaninputfeaturemapofscolumnsintoanewfeaturemapofs+w−1columns;averagepoolingtransformsthisbacktoscolumns.Thisar-chitecturesupportsstackinganarbitrarynumberofconvolution-poolingblockstoextractincreasinglyabstractfeatures.Inputfeaturestothebottomlayerarewords,inputfeaturestothenextlayerareshortphrasesandsoon.Eachlevelgeneratesmoreab-stractfeaturesofhighergranularity.Thelastlayerisanoutputlayer,chosenaccord-ingtothetask;e.g.,forbinaryclassificationtasks,thislayerislogisticregression(seeFigure2).Othertypesofoutputlayersareintroducedbelow.Wefoundthatinmostcases,performanceisboostedifweprovidetheoutputofallpoolinglay-ersasinputtotheoutputlayer.Foreachnon-finalaveragepoolinglayer,weperformw-ap(poolingoverwindowsofwcolumns)asdescribedabove,butwealsoperformall-ap(poolingoverallcolumns)andforwardtheresulttotheoutputlayer.Thisimprovesperformancebecauserepresentationsfromdifferentlayerscoverthepropertiesofthesentencesatdifferentlevelsofabstractionandalloftheselev-elscanbeimportantforaparticularsentencepair.4ABCNN:Attention-BasedBCNNWenowdescribethreearchitecturesbasedontheBCNN,theABCNN-1,theABCNN-2andtheABCNN-3,thateachintroducesanattentionmech-anismformodelingsentencepairs;seeFigure3.ABCNN-1.TheABCNN-1(Figure3(un))em-ploysanattentionfeaturematrixAtoinfluencecon-volution.Attentionfeaturesareintendedtoweightthoseunitsofsimorehighlyinconvolutionthatarerelevanttoaunitofs1−i(i∈{0,1});weusetheterm“unit”heretorefertowordsonthelowestlevelandtophrasesonhigherlevelsofthenetwork.Fig-ure3(un)showstwounitrepresentationfeaturemapsinred:thispartoftheABCNN-1isthesameasintheBCNN(seeFigure2).Eachcolumnisthe

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2

/

/
t

je

un
c
_
un
_
0
0
0
9
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

263

(un)OneblockinABCNN-1(b)OneblockinABCNN-2(c)OneblockinABCNN-3Figure3:ThreeABCNNarchitectures

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2

/

/
t

je

un
c
_
un
_
0
0
0
9
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

264

representationofaunit,awordonthelowestlevelandaphraseonhigherlevels.WefirstdescribetheattentionfeaturematrixAinformally(layer“Convinput”,middlecolumn,inFigure3(un)).Aisgener-atedbymatchingunitsoftheleftrepresentationfea-turemapwithunitsoftherightrepresentationfea-turemapsuchthattheattentionvaluesofrowiinAdenotetheattentiondistributionofthei-thunitofs0withrespecttos1,andtheattentionvaluesofcol-umnjinAdenotetheattentiondistributionofthej-thunitofs1withrespecttos0.Acanbeviewedasanewfeaturemapofs0(resp.s1)inrow(resp.col-umn)directionbecauseeachrow(resp.column)isanewfeaturevectorofaunitins0(resp.s1).Ainsi,itmakessensetocombinethisnewfeaturemapwiththerepresentationfeaturemapsandusebothasin-puttotheconvolutionoperation.WeachievethisbytransformingAintothetwobluematricesinFigure3(un)thathavethesameformatastherepresentationfeaturemaps.Asaresult,thenewinputofconvolu-tionhastwofeaturemapsforeachsentence(showninredandblue).Ourmotivationisthattheatten-tionfeaturemapwillguidetheconvolutiontolearn“counterpart-biased”sentencerepresentations.Moreformally,letFi,r∈Rd×sbetherepresen-tationfeaturemapofsentencei(i∈{0,1}).ThenwedefinetheattentionmatrixA∈Rs×sasfollows:Ai,j=match-score(F0,r[:,je],F1,r[:,j])(1)Thefunctionmatch-scorecanbedefinedinavarietyofways.Wefoundthat1/(1+|x−y|)workswellwhere|·|isEuclideandistance.GivenattentionmatrixA,wegeneratetheatten-tionfeaturemapFi,aforsiasfollows:F0,a=W0·A>,F1,a=W1·ATheweightmatricesW0∈Rd×s,W1∈Rd×sareparametersofthemodeltobelearnedintraining.1WestacktherepresentationfeaturemapFi,randtheattentionfeaturemapFi,aasanorder3tensorandfeeditintoconvolutiontogenerateahigher-levelrepresentationfeaturemapforsi(i∈{0,1}).InFigure3(un),s0has5units,s1has7.Theoutputofconvolution(showninthetoplayer,filterwidth1Theweightsofthetwomatricesaresharedinourimple-mentationtoreducethenumberofparametersofthemodel.w=3)isahigher-levelrepresentationfeaturemapwith7columnsfors0and9columnsfors1.ABCNN-2.TheABCNN-1computesattentionweightsdirectlyontheinputrepresentationwiththeaimofimprovingthefeaturescomputedbyconvolu-tion.TheABCNN-2(Figure3(b))insteadcomputesattentionweightsontheoutputofconvolutionwiththeaimofreweightingthisconvolutionoutput.IntheexampleshowninFigure3(b),thefeaturemapsoutputbyconvolutionfors0ands1(layermarked“Convolution”inFigure3(b))have7and9columns,respectivement;eachcolumnistherepresentationofaunit.TheattentionmatrixAcomparesallunitsins0withallunitsofs1.Wesumallattentionvaluesforaunittoderiveasingleattentionweightforthatunit.ThiscorrespondstosummingallvaluesinarowofAfors0(“col-wisesum”,resultinginthecolumnvectorofsize7shown)andsummingallvaluesinacolumnfors1(“row-wisesum”,resultingintherowvectorofsize9shown).Moreformally,letA∈Rs×sbetheattentionmatrix,a0,j=PA[j,:]theattentionweightofunitjins0,a1,j=PA[:,j]theattentionweightofunitjins1andFci,r∈Rd×(si+w−1)theoutputofconvolutionforsi.Thenthej-thcolumnofthenewfeaturemapFpi,rgeneratedbyw-apisderivedby:Fpi,r[:,j]=Xk=j:j+wai,kFci,r[:,k],j=1…siNotethatFpi,r∈Rd×si,i.e.,ABCNN-2poolinggeneratesanoutputfeaturemapofthesamesizeastheinputfeaturemapofconvolution.Thisallowsustostackmultipleconvolution-poolingblockstoextractfeaturesofincreasingabstraction.TherearethreemaindifferencesbetweentheABCNN-1andtheABCNN-2.(je)AttentionintheABCNN-1impactsconvolutionindirectlywhileat-tentionintheABCNN-2influencespoolingthroughdirectattentionweighting.(ii)TheABCNN-1re-quiresthetwomatricesWitoconverttheattentionmatrixintoattentionfeaturemaps;andtheinputtoconvolutionhastwotimesasmanyfeaturemaps.Thus,theABCNN-1hasmoreparametersthantheABCNN-2andismorevulnerabletooverfitting.(iii)Aspoolingisperformedafterconvolution,pool-inghandleslarger-granularityunitsthanconvolu-tion;e.g.,iftheinputtoconvolutionhaswordlevel

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2

/

/
t

je

un
c
_
un
_
0
0
0
9
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

265

granularity,thentheinputtopoolinghasphraselevelgranularity,thephrasesizebeingequaltofiltersizew.Thus,theABCNN-1andtheABCNN-2imple-mentattentionmechanismsforlinguisticunitsofdifferentgranularity.ThecomplementarityoftheABCNN-1andtheABCNN-2motivatesustopro-posetheABCNN-3,athirdarchitecturethatcom-bineselementsofthetwo.ABCNN-3(Figure3(c))combinestheABCNN-1andtheABCNN-2bystackingthem;itcombinesthestrengthsoftheABCNN-1and-2byallowingtheattentionmechanismtooperate(je)bothonthecon-volutionandonthepoolingpartsofaconvolution-poolingblockand(ii)bothontheinputgranularityandonthemoreabstractoutputgranularity.5ExperimentsWetesttheproposedarchitecturesonthreetasks:answerselection(AS),paraphraseidentification(PI)andtextualentailment(TE).CommonTrainingSetup.Wordsareinitializedby300-dimensionalword2vecembeddingsandnotchangedduringtraining.Asinglerandomlyinitial-izedembeddingiscreatedforallunknownwordsbyuniformsamplingfrom[-.01,.01].WeemployAda-grad(Duchietal.,2011)andL2regularization.NetworkConfiguration.Eachnetworkintheexperimentsbelowconsistsof(je)aninitializationblockb1thatinitializeswordsbyword2vecem-beddings,(ii)astackofk−1convolution-poolingblocksb2,…,bk,computingincreasinglyabstractfeatures,et(iii)onefinalLRlayer(logisticregres-sionlayer)asshowninFigure2.TheinputtotheLRlayerconsistsofknfeatures–eachblockprovidesnsimilarityscores,e.g.,ncosinesimilarityscores.Figure2showsthetwosentencevectorsoutputbythefinalblockbkofthestack(“sentencerepresentation0”,“sentencerepre-sentation1”);thisisthebasisofthelastnsimilarityscores.AsweexplainedinthefinalparagraphofSection3,weperformall-appoolingforallblocks,notjustforbk.Thuswegetonesentencerepresenta-tioneachfors0ands1foreachblockb1,…,bk.Wecomputensimilarityscoresforeachblock(basedontheblock’stwosentencerepresentations).Ainsi,wecomputeatotalofknsimilarityscoresandthesescoresareinputtotheLRlayer.ASPITE#CLlrwL2lrwL2lrwL2ABCNN-11.084.0004.083.0002.083.0006ABCNN-12.0854.0006.0853.0003.0853.0006ABCNN-21.054.0003.0853.0001.093.00065ABCNN-22.064.0006.0853.0001.0853.0007ABCNN-31.054.0003.053.0003.093.0007ABCNN-32.064.0006.0553.0005.093.0007Table2:Hyperparameters.lr:learningrate.#CL:num-berconvolutionlayers.w:filterwidth.Thenumberofconvolutionkernelsdi(i>0)is50throughout.Dependingonthetask,weusedifferentmethodsforcomputingthesimilarityscore:seebelow.LayerwiseTraining.Inourtrainingregime,wefirsttrainanetworkconsistingofjustoneconvolution-poolingblockb2.Wethencreateanewnetworkbyaddingablockb3,initializeitsb2blockwiththepreviouslylearnedweightsforb2andtrainb3keepingthepreviouslylearnedweightsforb2fixed.Werepeatthisprocedureuntilallk−1convolution-poolingblocksaretrained.Wefoundthatthistrainingregimegivesusgoodperformanceandshortenstrainingtimesconsiderably.Sincesim-ilarityscoresoflowerblocksarekeptunchangedoncetheyhavebeenlearned,thisalsohastheniceeffectthat“simple”similarityscores(thosebasedonsurfacefeatures)arelearnedfirstandsubsequenttrainingphasescanfocusoncomplementaryscoresderivedfrommorecomplexabstractfeatures.Classifier.WefoundthatperformanceincreasesifwedonotusetheoutputoftheLRlayerasthefinaldecision,butinsteadtrainalinearSVMoralogisticregressionwithdefaultparameters2directlyontheinputtotheLRlayer(i.e.,ontheknsimilarityscoresthataregeneratedbythek-blockstackafternetworktrainingiscompleted).DirecttrainingofSVMs/LRseemstogetclosertotheglobaloptimumthangradientdescenttrainingofCNNs.Table2showshyperparameters,tunedondev.WeuseadditionandLSTMsastwosharedbase-linesforallthreetasks,i.e.,forAS,PIandTE.Wenowdescribethesetwosharedbaselines.(je)Addition.Wesumupwordembeddingselement-wisetoformeachsentencerepresentation.Theclassifierinputisthentheconcatenationofthetwosentencerepresentations.(ii)A-LSTM.Be-forethiswork,mostattentionmechanismsinNLP2http://scikit-learn.org/stable/forboth.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2

/

/
t

je

un
c
_
un
_
0
0
0
9
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

266

wereimplementedinrecurrentneuralnetworksfortextgenerationtaskssuchasmachinetranslation(e.g.,Bahdanauetal.(2015),Luongetal.(2015)).Rockt¨ascheletal.(2016)presentanattention-LSTMfornaturallanguageinference.SincethismodelisthepioneeringattentionbasedRNNsystemforsen-tencepairclassification,weconsideritasabaselinesystem(“A-LSTM”)forallourthreetasks.TheA-LSTMhasthesameconfigurationasourABCNNsintermsofwordinitialization(300-dimensionalword2vecembeddings)andthedimensionalityofallhiddenlayers(50).5.1AnswerSelectionWeuseWikiQA,3anopendomainquestion-answerdataset.Weusethesubtaskthatassumesthatthereisatleastonecorrectanswerforaquestion.Thecorrespondingdatasetconsistsof20,360question-candidatepairsintrain,1,130pairsindevand2,352pairsintestwhereweadoptthestandardsetupofonlyconsideringquestionswithcorrectanswersintest.FollowingYangetal.(2015),wetruncatean-swersto40tokens.Thetaskistorankthecandidateanswersbasedontheirrelatednesstothequestion.Evaluationmea-suresaremeanaverageprecision(MAP)andmeanreciprocalrank(MRR).Task-SpecificSetup.WeusecosinesimilarityasthesimilarityscoreforAS.Inaddition,weusesen-tencelengths,WordCnt(countofthenumberofnon-stopwordsinthequestionthatalsooccurinthean-swer)andWgtWordCnt(reweightthecountsbytheIDFvaluesofthequestionwords).Ainsi,thefinalinputtotheLRlayerhassizek+4:onecosineforeachofthekblocksandthefouradditionalfeatures.Wecomparewithsevenbaselines.ThefirstthreeareconsideredbyYangetal.(2015):(je)WordCnt;(ii)WgtWordCnt;(iii)CNN-Cnt(thestate-of-the-artsystem):combineCNNwith(je)et(ii).ApartfromthebaselinesconsideredbyYangetal.(2015),wecomparewithtwoAdditionbaselinesandtwoLSTMbaselines.AdditionandA-LSTMarethesharedbaselinesdescribedbefore.Wealsocombinebothwiththefourextrafeatures;thisgivesustwoadditionalbaselinesthatwerefertoasAddition(+)andA-LSTM(+).3http://aka.ms/WikiQA(Yangetal.,2015)methodMAPMRRBaselinesWordCnt0.48910.4924WgtWordCnt0.50990.5132CNN-Cnt0.65200.6652Addition0.50210.5069Addition(+)0.58880.5929A-LSTM0.53470.5483A-LSTM(+)0.63810.6537BCNNone-conv0.66290.6813two-conv0.65930.6738ABCNN-1one-conv0.6810∗0.6979∗two-conv0.6855∗0.7023∗ABCNN-2one-conv0.6885∗0.7054∗two-conv0.6879∗0.7068∗ABCNN-3one-conv0.6914∗0.7127∗two-conv0.6921∗0.7108∗Table3:ResultsonWikiQA.Bestresultpercolumnisbold.Significantimprovementsoverstate-of-the-artbaselines(underlined)aremarkedwith∗(t-test,p<.05).Results.Table3showsperformanceofthebase-lines,oftheBCNNandofthethreeABCNNs.ForCNNs,wetestone(one-conv)andtwo(two-conv)convolution-poolingblocks.Thenon-attentionnetworkBCNNalreadyper-formsbetterthanthebaselines.Ifweaddattentionmechanisms,thentheperformancefurtherimprovesbyseveralpoints.ComparingtheABCNN-2withtheABCNN-1,wefindtheABCNN-2isslightlybettereventhoughtheABCNN-2isthesimplerar-chitecture.IfwecombinetheABCNN-1andtheABCNN-2toformtheABCNN-3,wegetfurtherimprovement.4ThiscanbeexplainedbytheABCNN-3’sabil-itytotakeattentionoffiner-grainedgranularityintoconsiderationineachconvolution-poolingblockwhiletheABCNN-1andtheABCNN-2considerat-tentiononlyatconvolutioninputoronlyatpoolinginput,respectively.Wealsofindthatstackingtwoconvolution-poolingblocksdoesnotbringconsis-tentimprovementandthereforedonottestdeeperarchitectures.5.2ParaphraseIdentificationWeusetheMicrosoftResearchParaphrase(MSRP)corpus(Dolanetal.,2004).Thetrainingsetcontains2753true/1323falseandthetestset1147true/578falseparaphrasepairs.Werandomlyselect4004IfwelimittheinputtotheLRlayertotheksimilarityscoresintheABCNN-3(two-conv),resultsare.660(MAP)/.677(MRR). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 9 7 1 5 6 7 4 1 2 / / t l a c _ a _ 0 0 0 9 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 267 pairsfromtrainandusethemasdev;butwestillreportresultsfortrainingontheentiretrainingset.Foreachtriple(label,s0,s1)inthetrainingset,wealsoadd(label,s1,s0)tothetrainingsettomakebestuseofthetrainingdata.SystemsareevaluatedbyaccuracyandF1.Task-SpecificSetup.Inthistask,weaddthe15MTfeaturesfrom(Madnanietal.,2012)andthelengthsofthetwosentences.Inaddition,wecomputeROUGE-1,ROUGE-2andROUGE-SU4(Lin,2004),whicharescoresmeasuringthematchbetweenthetwosentenceson(i)unigrams,(ii)bi-gramsand(iii)unigramsandskip-bigrams(maxi-mumskipdistanceoffour),respectively.Inthistask,wefoundtransformingEuclideandistanceintosimilarityscoreby1/(1+|x−y|)performsbetterthancosinesimilarity.Additionally,weusedynamicpooling(YinandSch¨utze,2015a)oftheattentionmatrixAinEquation(1)andforwardpooledval-uesofallblockstotheclassifier.Thisgivesusbet-terperformancethanonlyforwardingsentence-levelmatchingfeatures.WecompareoursystemwithrepresentativeDLapproaches:(i)A-LSTM;(ii)A-LSTM(+):A-LSTMplushandcraftedfeatures;(iii)RAE(Socheretal.,2011),recursiveautoencoder;(iv)Bi-CNN-MI(YinandSch¨utze,2015a),abi-CNNarchitec-ture;and(v)MPSSM-CNN(Heetal.,2015),thestate-of-the-artNNsystemforPI,andthefollow-ingfournon-DLsystems:(vi)Addition;(vii)Ad-dition(+):Additionplushandcraftedfeatures;(viii)MT(Madnanietal.,2012),asystemthatcombinesmachinetranslationmetrics;5(ix)MF-TF-KLD(JiandEisenstein,2013),thestate-of-the-artnon-NNsystem.Results.Table4showsthattheBCNNisslightlyworsethanthestate-of-the-artwhereastheABCNN-1roughlymatchesit.TheABCNN-2isslightlyabovethestate-of-the-art.TheABCNN-3outper-formsthestate-of-the-artinaccuracyandF1.6Twoconvolutionlayersonlybringsmallimprovementsoverone.5Forbettercomparabilityofapproachesinourexperiments,weuseasimpleSVMclassifier,whichperformsslightlyworsethanMadnanietal.(2012)’smorecomplexmeta-classifier.6Improvementof.3(acc)and.1(F1)overstate-of-the-artisnotsignificant.TheABCNN-3(two-conv)without“linguistic”features(i.e.,MTandROUGE)achieves75.1/82.7.methodaccF1Baselinesmajorityvoting66.579.9RAE76.883.6Bi-CNN-MI78.484.6MPSSM-CNN78.684.7MT76.883.8MF-TF-KLD78.684.6Addition70.880.9Addition(+)77.384.1A-LSTM69.580.1A-LSTM(+)77.184.0BCNNone-conv78.184.1two-conv78.384.3ABCNN-1one-conv78.584.5two-conv78.584.6ABCNN-2one-conv78.684.7two-conv78.884.7ABCNN-3one-conv78.884.8two-conv78.984.8Table4:ResultsforPIonMSRP5.3TextualEntailmentSemEval2014Task1(Marellietal.,2014a)eval-uatessystempredictionsoftextualentailment(TE)relationsonsentencepairsfromtheSICKdataset(Marellietal.,2014b).Thethreeclassesareentail-ment,contradictionandneutral.ThesizesofSICKtrain,devandtestsetsare4439,495and4906pairs,respectively.WecallthisdatasetORIG.WealsocreateNONOVER,acopyofORIGinwhichwordsoccurringinbothsentencesarere-moved.AsentenceinNONOVERisdenotedbythespecialtokenifallwordsareremoved.Table5showsthreepairsfromORIGandtheirtrans-formationinNONOVER.Weobservethatfocusingonthenon-overlappingpartsprovidesclearerhintsforTEthanORIG.Inthistask,weruntwocopiesofeachnetwork,oneforORIG,oneforNONOVER;thesetwonetworkshaveasinglecommonLRlayer.LikeLaiandHockenmaier(2014),wetrainourfinalsystem(afterfixinghyperparameters)ontrainanddev(4934pairs).Evalmeasureisaccuracy.Task-SpecificSetup.Wefoundthatforthistaskforwardingtwosimilarityscoresfromeachblock(insteadofjustone)ishelpful.Weusecosinesim-ilarityandEuclideandistance.Aswedidforpara-phraseidentification,weaddthe15MTfeaturesforeachsentencepairforthistaskaswell;ourmotiva-tionisthatentailedsentencesresembleparaphrasesmorethancontradictorysentencesdo.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
0
9
7
1
5
6
7
4
1
2

/

/
t

je

un
c
_
un
_
0
0
0
9
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

268

ORIGNONOVER0childreninredshirtsarechildrenredshirtsplayingintheleavesplayingthreekidsaresittingintheleavesthreekidssitting1threeboysarejumpingintheleavesboysthreekidsarejumpingintheleaveskids2amanisjumpingintoanemptypoolanemptyamanisjumpingintoafullpoolafullTable5:SICKdata:Convertingtheoriginalsentences(ORIG)intotheNONOVERformatWeusethefollowinglinguisticfeatures.Nega-tionisimportantfordetectingcontradiction.Fea-tureNEGissetto1ifeithersentencecontains“no”,“not”,“nobody”,“isn’t”andto0otherwise.Fol-lowingLaiandHockenmaier(2014),weuseWord-Net(Miller,1995)todetectnyms:synonyms,hy-pernymsandantonymsinthepairs.ButwedothisonNONOVER(notonORIG)tofocusonwhatiscriticalforTE.Specifically,featureSYNisthenumberofwordpairsins0ands1thataresyn-onyms.HYP0(resp.HYP1)isthenumberofwordsins0(resp.s1)thathaveahypernymins1(resp.s0).Inaddition,wecollectallpotentialantonympairs(PAP)inNONOVER.Weidentifythematchedchunksthatoccurincontradictoryandneutral,butnotinentailedpairs.Weexcludesynonymsandhy-pernymsandapplyafrequencyfilterofn=2.IncontrasttoLaiandHockenmaier(2014),wecon-strainthePAPpairstocosinesimilarityabove0.4inword2vecembeddingspaceasthisdiscardsmanynoisepairs.FeatureANTisthenumberofmatchedPAPantonymsinasentencepair.Asbeforeweusesentencelengths,bothforORIG(LEN0O:lengths0,LEN1O:lengths1)andforNONOVER(LEN0N:lengths0,LEN1N:lengths1).Onthewhole,wehave24extrafeatures:15MTmetrics,NEG,SYN,HYP0,HYP1,ANT,LEN0O,LEN1O,LEN0NandLEN1N.ApartfromtheAdditionandLSTMbaselines,wefurthercomparewiththetop-3systemsinSemEvalandTrRNTN(Bowmanetal.,2015b),arecursiveneuralnetworkdevelopedforthisSICKtask.Results.Table6showsthatourCNNsoutper-formA-LSTM(withorwithoutlinguisticfeaturesadded)andthetopthreeSemEvalsystems.Compar-ingABCNNswiththeBCNN,attentionmechanismsconsistentlyimproveperformance.TheABCNN-1hasperformancecomparabletotheABCNN-2whilemethodaccSem-EvalTop3(Jimenezetal.,2014)83.1(Zhaoetal.,2014)83.6(LaiandHockenmaier,2014)84.6TrRNTN(Bowmanetal.,2015b)76.9Additionnofeatures73.1plusfeatures79.4A-LSTMnofeatures78.0plusfeatures81.7BCNNone-conv84.8two-conv85.0ABCNN-1one-conv85.6two-conv85.8ABCNN-2one-conv85.7two-conv85.8ABCNN-3one-conv86.0∗two-conv86.2∗Table6:ResultsonSICK.Significantimprovementsover(LaiandHockenmaier,2014)aremarkedwith∗(testofequalproportions,p<.05).theABCNN-3isbetterstill:aboostof1.6pointscomparedtothepreviousstateoftheart.7VisualAnalysis.Figure4visualizestheattentionmatricesforoneTEsentencepairintheABCNN-2forblocksb1(unigrams),b2(firstconvolutionallayer)andb3(secondconvolutionallayer).Darkershadesofblueindicatestrongerattentionvalues.InFigure4(top),eachwordcorrespondstoex-actlyoneroworcolumn.Wecanseethatwordsinsiwithsemanticequivalentsins1−igethighatten-tionwhilewordswithoutsemanticequivalentsgetlowattention,e.g.,“walking”and“murals”ins0and“front”and“colorful”ins1.Thisbehaviorseemsreasonablefortheunigramlevel.Rows/columnsoftheattentionmatrixinFigure4(middle)correspondtophrasesoflengththreesincefilterwidthw=3.Highattentionvaluesgenerallycorrelatewithclosesemanticcorrespondence:thephrase“peopleare”ins0matches“severalpeopleare”ins1;both“arewalkingoutside”and“walkingoutsidethe”ins0match“areinfront”ins1;“thebuildingthat”ins0matches“acolorfulbuilding”ins1.Moreinterestingly,lookingatthebottomrightcorner,both“onit”and“it”ins0match“building”ins1;thisindicatesthatABCNNsareabletodetectsomecoreferenceacrosssentences.“building”ins1hastwoplacesinwhichhigherattentionsappear,oneiswith“it”ins0,theotheriswith“thebuilding7IfweruntheABCNN-3(two-conv)withoutthe24linguis-ticfeatures,performanceis84.6. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 9 7 1 5 6 7 4 1 2 / / t l a c _ a _ 0 0 0 9 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 269 severalpeopleareinfrontofacolorfulbuildingpeoplearewalkingoutsidethebuildingthathasseveralmuralsonitseveralseveral peopleseveral people are people are inare in frontin front offront of aof a colorfula corlorful buildingcorlorful buildingbuildingpeoplepeople arepeople are walkingare walking outsidewalking outside theoutside the buildingthe building thatbuilding that hasthat has severalhas several muralsseveral murals onmurals on iton ititseveralseveral peopleseveral...areseveral...inseveral...frontpeople...ofare...ain...colorfulfront...buildingof...buildinga...buildingpeoplepeople arepeople...walkingpeople...outsidepeople...theare...buildingwalking...thatoutside...hasthe...severalbuilding...muralsthat...onhas...itseveral...itmurals...itFigure4:AttentionvisualizationforTE.Top:unigrams,b1.Middle:conv1,b2.Bottom:conv2,b3.that”ins0.ThismayindicatethatABCNNsrecog-nizethat“building”ins1and“thebuildingthat”/“it”ins0refertothesameobject.Hence,corefer-enceresolutionacrosssentencesaswellaswithinasentencebotharedetected.Fortheattentionvectorsontheleftandthetop,wecanseethatattentionhasfocusedonthekeyparts:“peoplearewalkingout-sidethebuildingthat”ins0,“severalpeoplearein”and“ofacolorfulbuilding”ins1.Rows/columnsoftheattentionmatrixinFigure4(bottom,secondlayerofconvolution)correspondtophrasesoflength5sincefilterwidthw=3inbothconvolutionlayers(5=1+2∗(3−1)).Weuse“...”todenotewordsinthemiddleifaphraselike“several...front”hasmorethantwowords.Wecanseethatattentiondistributioninthematrixhasfocusedonsomelocalregions.Asgranularityofphrasesislarger,itmakessensethattheattentionvaluesaresmoother.Butwestillcanfindsomeinterestingclues:atthetwoendsofthemaindi-agonal,higherattentionshintthatthefirstpartofs0matcheswellwiththefirstpartofs1;“severalmuralsonit”ins0matcheswellwith“ofacolor-fulbuilding”ins1,whichsatisfiestheintuitionthatthesetwophrasesarecrucialformakingadecisiononTEinthiscase.Thisagainshowsthepotentialstrengthofoursysteminfiguringoutwhichpartsofthetwosentencesrefertothesameobject.Inad-dition,inthecentralpartofthematrix,wecanseethatthelongphrase“peoplearewalkingoutsidethebuilding”ins0matcheswellwiththelongphrase“areinfrontofacolorfulbuilding”ins1.6SummaryWepresentedthreemechanismstointegrateatten-tionintoCNNsforgeneralsentencepairmodelingtasks.OurexperimentsonAS,PIandTEshowthatattention-basedCNNsperformbetterthanCNNswithoutattentionmechanisms.TheABCNN-2gen-erallyoutperformstheABCNN-1andtheABCNN-3surpassesboth.Inalltasks,wedidnotfindanybigimprovementoftwolayersofconvolutionoveronelayer.Thisisprobablyduetothelimitedsizeoftrainingdata.Weexpectthat,aslargertrainingsetsbecomeavailable,deepABCNNswillshowevenbetterperformance.Inaddition,linguisticfeaturescontributeinallthreetasks:improvementsby0.0321(MAP)and0.0338(MRR)forAS,improvementsby3.8(acc)and2.1(F1)forPIandanimprovementby1.6(acc)forTE.ButourABCNNscanstillreachorsurpassstate-of-the-artevenwithoutthosefeaturesinASandTEtasks.ThisindicatesthatABCNNsaregen-erallystrongNNsystems.Attention-basedLSTMsareespeciallysuccessfulintaskswithastronggenerationcomponentlikema-chinetranslation(discussedinSec.2).CNNshavenotbeenusedforthistypeoftask.Thisisaninter-estingareaoffutureworkforattention-basedCNNs. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 9 7 1 5 6 7 4 1 2 / / t l a c _ a _ 0 0 0 9 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 270 AcknowledgmentsWegratefullyacknowledgethesupportofDeutscheForschungsgemeinschaft(DFG):grantSCHU2246/8-2.Wewouldliketothanktheanonymousreviewersfortheirhelpfulcomments.ReferencesJimmyBa,VolodymyrMnih,andKorayKavukcuoglu.2015.Multipleobjectrecognitionwithvisualatten-tion.InProceedingsofICLR.DzmitryBahdanau,KyunghyunCho,andYoshuaBen-gio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.InProceedingsofICLR.MatthewW.Bilotti,PaulOgilvie,JamieCallan,andEricNyberg.2007.Structuredretrievalforquestionan-swering.InProceedingsofSIGIR,pages351–358.WilliamBlacoeandMirellaLapata.2012.Acomparisonofvector-basedrepresentationsforsemanticcomposi-tion.InProceedingsofEMNLP-CoNLL,pages546–556.SamuelR.Bowman,GaborAngeli,ChristopherPotts,andChristopherD.Manning.2015a.Alargeanno-tatedcorpusforlearningnaturallanguageinference.InProceedingsofEMNLP,pages632–642.SamuelR.Bowman,ChristopherPotts,andChristo-pherD.Manning.2015b.Recursiveneuralnetworkscanlearnlogicalsemantics.InProceedingsofCVSCWorkshop,pages12–21.JaneBromley,JamesW.Bentz,L´eonBottou,IsabelleGuyon,YannLeCun,CliffMoore,EduardS¨ackinger,andRoopakShah.1993.Signatureverificationus-ingA“siamese”timedelayneuralnetwork.IJPRAI,7(4):669–688.ChunshuiCao,XianmingLiu,YiYang,YinanYu,JiangWang,ZileiWang,YongzhenHuang,LiangWang,ChangHuang,WeiXu,DevaRamanan,andThomasS.Huang.2015.Lookandthinktwice:Cap-turingtop-downvisualattentionwithfeedbackcon-volutionalneuralnetworks.InProceedingsofICCV,pages2956–2964.Ming-WeiChang,DanGoldwasser,DanRoth,andVivekSrikumar.2010.Discriminativelearningovercon-strainedlatentrepresentations.InProceedingsofNAACL-HLT,pages429–437.KanChen,JiangWang,Liang-ChiehChen,HaoyuanGao,WeiXu,andRamNevatia.2015.ABC-CNN:Anattentionbasedconvolutionalneuralnetworkforvisualquestionanswering.CoRR,abs/1511.05960.JanChorowski,DzmitryBahdanau,KyunghyunCho,andYoshuaBengio.2014.End-to-endcontinuousspeechrecognitionusingattention-basedrecurrentNN:Firstresults.InProceedingsofDeepLearningandRepre-sentationLearningWorkshop,NIPS.JanChorowski,DzmitryBahdanau,DmitriySerdyuk,KyunghyunCho,andYoshuaBengio.2015.Attention-basedmodelsforspeechrecognition.InProceedingsofNIPS,pages577–585.BillDolan,ChrisQuirk,andChrisBrockett.2004.Un-supervisedconstructionoflargeparaphrasecorpora:Exploitingmassivelyparallelnewssources.InPro-ceedingsofCOLING,pages350–356.JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JMLR,12:2121–2159.MinweiFeng,BingXiang,MichaelR.Glass,LidanWang,andBowenZhou.2015.Applyingdeeplearn-ingtoanswerselection:Astudyandanopentask.InProceedingsofIEEEASRUWorkshop,pages813–820.KarolGregor,IvoDanihelka,AlexGraves,DaniloJimenezRezende,andDaanWierstra.2015.DRAW:Arecurrentneuralnetworkforimagegeneration.InProceedingsofICML,pages1462–1471.HuaHe,KevinGimpel,andJimmyLin.2015.Multi-perspectivesentencesimilaritymodelingwithconvo-lutionalneuralnetworks.InProceedingsofEMNLP,pages1576–1586.MichaelHeilmanandNoahA.Smith.2010.Treeeditmodelsforrecognizingtextualentailments,para-phrases,andanswerstoquestions.InProceedingsofNAACL-HLT,pages1011–1019.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.SeunghoonHong,JunhyukOh,HonglakLee,andBo-hyungHan.2016.Learningtransferrableknowl-edgeforsemanticsegmentationwithdeepconvolu-tionalneuralnetwork.InProceedingsofCVPR.BaotianHu,ZhengdongLu,HangLi,andQingcaiChen.2014.Convolutionalneuralnetworkarchitecturesformatchingnaturallanguagesentences.InProceedingsofNIPS,pages2042–2050.YangfengJiandJacobEisenstein.2013.Discriminativeimprovementstodistributionalsentencesimilarity.InProceedingsofEMNLP,pages891–896.SergioJimenez,GeorgeDue˜nas,JuliaBaquero,andAlexanderGelbukh.2014.UNAL-NLP:Combiningsoftcardinalityfeaturesforsemantictextualsimilar-ity,relatednessandentailment.InProceedingsofSe-mEval,pages732–742. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 9 7 1 5 6 7 4 1 2 / / t l a c _ a _ 0 0 0 9 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 271 NalKalchbrenner,EdwardGrefenstette,andPhilBlun-som.2014.Aconvolutionalneuralnetworkformod-ellingsentences.InProceedingsofACL,pages655–665.YoonKim.2014.Convolutionalneuralnetworksforsen-tenceclassification.InProceedingsofEMNLP,pages1746–1751.AliceLaiandJuliaHockenmaier.2014.Illinois-LH:Adenotationalanddistributionalapproachtosemantics.InProceedingsofSemEval,pages329–334.YannLeCun,L´eonBottou,YoshuaBengio,andPatrickHaffner.1998.Gradient-basedlearningappliedtodocumentrecognition.InProceedingsoftheIEEE,pages2278–2324.JiweiLi,Minh-ThangLuong,andDanJurafsky.2015.Ahierarchicalneuralautoencoderforparagraphsanddocuments.InProceedingsofACL,pages1106–1115.Chin-YewLin.2004.Rouge:Apackageforautomaticevaluationofsummaries.InProceedingsoftheACLTextSummarizationWorkshop.ThangLuong,HieuPham,andChristopherD.Manning.2015.Effectiveapproachestoattention-basedneuralmachinetranslation.InProceedingsofEMNLP,pages1412–1421.NitinMadnani,JoelTetreault,andMartinChodorow.2012.Re-examiningmachinetranslationmetricsforparaphraseidentification.InProceedingsofNAACL-HLT,pages182–190.MarcoMarelli,LuisaBentivogli,MarcoBaroni,Raf-faellaBernardi,StefanoMenini,andRobertoZampar-elli.2014a.Semeval-2014task1:Evaluationofcom-positionaldistributionalsemanticmodelsonfullsen-tencesthroughsemanticrelatednessandtextualentail-ment.InProceedingsofSemEval,pages1–8.MarcoMarelli,StefanoMenini,MarcoBaroni,LuisaBentivogli,RaffaellaBernardi,andRobertoZampar-elli.2014b.ASICKcurefortheevaluationofcom-positionaldistributionalsemanticmodels.InProceed-ingsofLREC,pages216–223.TomasMikolov,IlyaSutskever,KaiChen,GregoryS.Corrado,andJeffreyDean.2013.Distributedrep-resentationsofwordsandphrasesandtheircomposi-tionality.InProceedingsofNIPS,pages3111–3119.GeorgeA.Miller.1995.WordNet:Alexicaldatabaseforenglish.Commun.ACM,38(11):39–41.VolodymyrMnih,NicolasHeess,AlexGraves,andKo-rayKavukcuoglu.2014.Recurrentmodelsofvisualattention.InProceedingsofNIPS,pages2204–2212.DanMoldovan,ChristineClark,SandaHarabagiu,andDanielHodges.2007.Cogex:Asemanticallyandcontextuallyenrichedlogicproverforquestionan-swering.JournalofAppliedLogic,5(1):49–69.VasinPunyakanok,DanRoth,andWen-tauYih.2004.Mappingdependenciestrees:Anapplicationtoques-tionanswering.InProceedingsofAI&Math2004(Specialsession:IntelligentTextProcessing).TimRockt¨aschel,EdwardGrefenstette,KarlMoritzHer-mann,Tom´aˇsKoˇcisk`y,andPhilBlunsom.2016.Rea-soningaboutentailmentwithneuralattention.InPro-ceedingsofICLR.AlexanderM.Rush,SumitChopra,andJasonWeston.2015.Aneuralattentionmodelforabstractivesen-tencesummarization.InProceedingsofEMNLP,pages379–389.DanShenandMirellaLapata.2007.Usingsemanticrolestoimprovequestionanswering.InProceedingsofEMNLP-CoNLL,pages12–21.RichardSocher,EricH.Huang,JeffreyPennington,An-drewY.Ng,andChristopherD.Manning.2011.Dy-namicpoolingandunfoldingrecursiveautoencodersforparaphrasedetection.InProceedingsofNIPS,pages801–809.MingTan,BingXiang,andBowenZhou.2016.LSTM-baseddeeplearningmodelsfornon-factoidanswerse-lection.InProceedingsofICLRWorkshop.ShengxianWan,YanyanLan,JiafengGuo,JunXu,LiangPang,andXueqiCheng.2016.Adeeparchitectureforsemanticmatchingwithmultiplepositionalsentencerepresentations.InProceedingsofAAAI,pages2835–2841.MengqiuWang,NoahA.Smith,andTerukoMitamura.2007.Whatisthejeopardymodel?Aquasi-synchronousgrammarforQA.InProceedingsofEMNLP-CoNLL,pages22–32.TianjunXiao,YichongXu,KuiyuanYang,JiaxingZhang,YuxinPeng,andZhengZhang.2015.Theapplicationoftwo-levelattentionmodelsindeepcon-volutionalneuralnetworkforfine-grainedimageclas-sification.InProceedingsofCVPR,pages842–850.KelvinXu,JimmyBa,RyanKiros,KyunghyunCho,AaronC.Courville,RuslanSalakhutdinov,RichardS.Zemel,andYoshuaBengio.2015.Show,attendandtell:Neuralimagecaptiongenerationwithvisualat-tention.InProceedingsofICML,pages2048–2057.YiYang,Wen-tauYih,andChristopherMeek.2015.WikiQA:Achallengedatasetforopen-domainques-tionanswering.InProceedingsofEMNLP,pages2013–2018.XuchenYao,BenjaminVanDurme,ChrisCallison-Burch,andPeterClark.2013a.Semi-markovphrase-basedmonolingualalignment.InProceedingsofEMNLP,pages590–600.XuchenYao,BenjaminVanDurme,andPeterClark.2013b.Automaticcouplingofanswerextractionandinformationretrieval.InProceedingsofACL,pages159–165. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 9 7 1 5 6 7 4 1 2 / / t l a c _ a _ 0 0 0 9 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 272 Wen-tauYih,Ming-WeiChang,ChristopherMeek,andAndrzejPastusiak.2013.Questionansweringusingenhancedlexicalsemanticmodels.InProceedingsofACL,pages1744–1753.WenpengYinandHinrichSch¨utze.2015a.Convolu-tionalneuralnetworkforparaphraseidentification.InProceedingsofNAACL-HLT,pages901–911.WenpengYinandHinrichSch¨utze.2015b.Multi-GranCNN:Anarchitectureforgeneralmatchingoftextchunksonmultiplelevelsofgranularity.InPro-ceedingsofACL-IJCNLP,pages63–73.LeiYu,KarlMoritzHermann,PhilBlunsom,andStephenPulman.2014.Deeplearningforanswersen-tenceselection.InProceedingsofDeepLearningandRepresentationLearningWorkshop,NIPS.JiangZhao,TiantianZhu,andManLan.2014.ECNU:Onestonetwobirds:Ensembleofheterogenousmea-suresforsemanticrelatednessandtextualentailment.InProceedingsofSemEval,pages271–277.
Télécharger le PDF