Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark.

Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark.

Submitted 10/2013; Revised 3/2014; Pubblicato 4/2014. C
(cid:13)

2014 Associazione per la Linguistica Computazionale.

GroundedCompositionalSemanticsforFindingandDescribingImageswithSentencesRichardSocher,AndrejKarpathy,QuocV.Le*,ChristopherD.Manning,AndrewY.NgStanfordUniversity,ComputerScienceDepartment,*GoogleInc.richard@socher.org,karpathy@cs.stanford.edu,qvl@google.com,manning@stanford.edu,ang@cs.stanford.eduAbstractPreviousworkonRecursiveNeuralNetworks(RNNs)showsthatthesemodelscanproducecompositionalfeaturevectorsforaccuratelyrepresentingandclassifyingsentencesorim-ages.However,thesentencevectorsofprevi-ousmodelscannotaccuratelyrepresentvisu-allygroundedmeaning.WeintroducetheDT-RNNmodelwhichusesdependencytreestoembedsentencesintoavectorspaceinordertoretrieveimagesthataredescribedbythosesentences.UnlikepreviousRNN-basedmod-elswhichuseconstituencytrees,DT-RNNsnaturallyfocusontheactionandagentsinasentence.Theyarebetterabletoabstractfromthedetailsofwordorderandsyntacticexpression.DT-RNNsoutperformotherre-cursiveandrecurrentneuralnetworks,kernel-izedCCAandabag-of-wordsbaselineonthetasksoffindinganimagethatfitsasentencedescriptionandviceversa.Theyalsogivemoresimilarrepresentationstosentencesthatdescribethesameimage.1IntroductionSinglewordvectorspacesarewidelyused(TurneyandPantel,2010)andsuccessfulatclassifyingsin-glewordsandcapturingtheirmeaning(CollobertandWeston,2008;Huangetal.,2012;Mikolovetal.,2013).Sincewordsrarelyappearinisolation,thetaskoflearningcompositionalmeaningrepre-sentationsforlongerphraseshasrecentlyreceivedalotofattention(MitchellandLapata,2010;Socheretal.,2010;Socheretal.,2012;Grefenstetteetal.,2013).Allo stesso modo,classifyingwholeimagesintoafixedsetofclassesalsoachievesveryhighperfor-mance(Leetal.,2012;Krizhevskyetal.,2012).Tuttavia,similartowords,objectsinimagesareof-tenseeninrelationshipswithotherobjectswhicharenotadequatelydescribedbyasinglelabel.Inthiswork,weintroduceamodel,illustratedinFig.1,whichlearnstomapsentencesandimagesintoacommonembeddingspaceinordertobeabletoretrieveonefromtheother.Weassumewordandimagerepresentationsarefirstlearnedintheirre-spectivesinglemodalitiesbutfinallymappedintoajointlylearnedmultimodalembeddingspace.OurmodelformappingsentencesintothisspaceisbasedonideasfromRecursiveNeuralNetworks(RNNs)(Pollack,1990;Costaetal.,2003;Socheretal.,2011b).Tuttavia,unlikeallpreviousRNNmodelswhicharebasedonconstituencytrees(CT-RNNs),ourmodelcomputescompositionalvectorrepresentationsinsidedependencytrees.Thecom-positionalvectorscomputedbythisnewdependencytreeRNN(DT-RNN)capturemoreofthemeaningofsentences,wherewedefinemeaningintermsofsimilaritytoa“visualrepresentation”ofthetextualdescription.DT-RNNinducedvectorrepresenta-tionsofsentencesaremorerobusttochangesinthesyntacticstructureorwordorderthanrelatedmod-elssuchasCT-RNNsorRecurrentNeuralNetworkssincetheynaturallyfocusonasentence’sactionanditsagents.WeevaluateandcompareDT-RNNinducedrep-resentationsontheirabilitytouseasentencesuchas“Amanwearingahelmetjumpsonhisbikenearabeach.”tofindimagesthatshowsuchascene.Thegoalistolearnsentencerepresentationsthatcapture

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

208

A man wearing a helmet jumps on his bike near a beach.Compositional Sentence VectorsTwo airplanes parked in an airport.A man jumping his downhill bike.Image Vector RepresentationA small child sits on a cement wall near white flower.Multi-Modal RepresentationsFigure1:TheDT-RNNlearnsvectorrepresentationsforsentencesbasedontheirdependencytrees.Welearntomaptheoutputsofconvolutionalneuralnetworksappliedtoimagesintothesamespaceandcanthencomparebothsentencesandimages.Thisallowsustoqueryimageswithasentenceandgivesentencedescriptionstoimages.thevisualscenedescribedandtofindappropriateimagesinthelearned,multi-modalsentence-imagespace.Conversely,whengivenaqueryimage,wewouldliketofindadescriptionthatgoesbeyondasinglelabelbyprovidingacorrectsentencedescrib-ingit,ataskthathasrecentlygarneredalotofat-tention(Farhadietal.,2010;Ordonezetal.,2011;Kuznetsovaetal.,2012).Weusethedatasetintro-ducedby(Rashtchianetal.,2010)whichconsistsof1000images,eachwith5descriptions.Onalltasks,ourmodeloutperformsbaselinesandrelatedmod-els.2RelatedWorkThepresentedmodelisconnectedtoseveralareasofNLPandvisionresearch,eachwithalargeamountofrelatedworktowhichwecanonlydosomejusticegivenspaceconstraints.SemanticVectorSpacesandTheirComposition-ality.Thedominantapproachinsemanticvec-torspacesusesdistributionalsimilaritiesofsinglewords.Often,co-occurrencestatisticsofawordanditscontextareusedtodescribeeachword(TurneyandPantel,2010;BaroniandLenci,2010),suchastf-idf.Mostofthecompositionalityalgorithmsandrelateddatasetscapturetwo-wordcompositions.Forinstance,(MitchellandLapata,2010)usetwo-wordphrasesandanalyzesimilaritiescomputedbyvectoraddition,multiplicationandothers.Compo-sitionalityisanactivefieldofresearchwithmanydifferentmodelsandrepresentationsbeingexplored(Grefenstetteetal.,2013),amongmanyothers.Wecomparetosupervisedcompositionalmodelsthatcanlearntask-specificvectorrepresentationssuchasconstituencytreerecursiveneuralnetworks(Socheretal.,2011b;Socheretal.,2011a),chainstructuredrecurrentneuralnetworksandotherbaselines.An-otheralternativewouldbetouseCCGtreesasabackboneforvectorcomposition(K.M.Hermann,2013).MultimodalEmbeddings.Multimodalembed-dingmethodsprojectdatafrommultiplesourcessuchassoundandvideo(Ngiametal.,2011)orim-agesandtext.Socheretal.(SocherandFei-Fei,2010)projectwordsandimageregionsintoacom-monspaceusingkernelizedcanonicalcorrelationanalysistoobtainstateoftheartperformanceinan-notationandsegmentation.Similartoourwork,theyuseunsupervisedlargetextcorporatolearnseman-ticwordrepresentations.AmongotherrecentworkisthatbySrivastavaandSalakhutdinov(2012)whodevelopedmultimodalDeepBoltzmannMachines.Similartotheirwork,weusetechniquesfromthebroadfieldofdeeplearningtorepresentimagesandwords.Recently,singlewordvectorembeddingshavebeenusedforzeroshotlearning(Socheretal.,2013c).Mappingimagestowordvectorsenabledtheirsystemtoclassifyimagesasdepictingobjectssuchas”cat”withoutseeinganyexamplesofthisclass.RelatedworkhasalsobeenpresentedatNIPS(Socheretal.,2013b;Fromeetal.,2013).Thisworkmoveszero-shotlearningbeyondsinglecategoriesperimageandextendsittounseenphrasesandfulllengthsentences,makinguseofsimilarideasofse-manticspacesgroundedinvisualknowledge.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

209

DetailedImageAnnotation.Interactionsbe-tweenimagesandtextsisagrowingresearchfield.Earlyworkinthisareaincludesgeneratingsinglewordsorfixedphrasesfromimages(Duyguluetal.,2002;Barnardetal.,2003)orusingcontextualin-formationtoimproverecognition(GuptaandDavis,2008;Torralbaetal.,2010).Apartfromalargebodyofworkonsingleobjectimageclassification(Leetal.,2012),thereisalsoworkonattributeclassificationandothermid-levelelements(Kumaretal.,2009),someofwhichwehopetocapturewithourapproachaswell.Ourworkiscloseinspiritwithrecentworkinde-scribingimageswithmoredetailed,longertextualdescriptions.Inparticular,Yaoetal.(2010)describeimagesusinghierarchicalknowledgeandhumansintheloop.Incontrast,ourworkdoesnotrequirehu-maninteractions.Farhadietal.(2010)andKulkarnietal.(2011),ontheotherhand,useamoreautomaticmethodtoparseimages.Forinstance,theformerap-proachusesasingletripleofobjectsestimatedforanimagetoretrievesentencesfromacollectionwrittentodescribesimilarimages.Itformsrepresentationstodescribe1object,1action,and1scene.Kulkarnietal.(2011)extendstheirmethodtodescribeanim-agewithmultipleobjects.Noneoftheseapproacheshaveusedacompositionalsentencevectorrepre-sentationandtheyrequirespecificlanguagegener-ationtechniquesandsophisticatedinferencemeth-ods.Sinceourmodelisbasedonneuralnetworksin-ferenceisfastandsimple.Kuznetsovaetal.(2012)useaverylargeparallelcorpustoconnectimagesandsentences.FengandLapata(2013)usealargedatasetofcaptionedimagesandexperimentswithbothextractive(search)andabstractive(generation)models.MostrelatedistheveryrecentworkofHodoshetal.(2013).Theytooevaluateusingarankingmea-sure.Inourexperiments,wecomparetokernelizedCanonicalCorrelationAnalysiswhichisthemaintechniqueintheirexperiments.3Dependency-TreeRecursiveNeuralNetworksInthissectionwefirstfocusontheDT-RNNmodelthatcomputescompositionalvectorrepresentationsforphrasesandsentencesofvariablelengthandsyn-tactictype.Insection5theresultingvectorswillthenbecomemultimodalfeaturesbymappingim-agesthatshowwhatthesentencedescribestothesamespaceandlearningboththeimageandsen-tencemappingjointly.Themostcommonwayofbuildingrepresenta-tionsforlongerphrasesfromsinglewordvectorsistosimplylinearlyaveragethewordvectors.Whilethisbag-of-wordsapproachcanyieldreasonableperformanceinsometasks,itgivesallthewordsthesameweightandcannotdistinguishimportantdif-ferencesinsimplevisualdescriptionssuchasThebikecrashedintothestandingcar.vs.Thecarcrashedintothestandingbike..RNNmodels(Pollack,1990;GollerandK¨uchler,1996;Socheretal.,2011b;Socheretal.,2011a)pro-videdanovelwayofcombiningwordvectorsforlongerphrasesthatmovedbeyondsimpleaverag-ing.TheycombinevectorswithanRNNinbinaryconstituencytreeswhichhavepotentiallymanyhid-denlayers.Whiletheinducedvectorrepresentationsworkverywellonmanytasks,theyalsoinevitablycapturealotofsyntacticstructureofthesentence.However,thetaskoffindingimagesfromsentencedescriptionsrequiresustobemoreinvarianttosyn-tacticdifferences.Onesuchexampleareactive-passiveconstructionswhichcancollapsewordssuchas“by”insomeformalisms(deMarneffeetal.,2006),relyinginsteadonthesemanticrelationshipof“agent”.Forinstance,Themotherhuggedherchild.andThechildwashuggedbyitsmother.shouldmaptoroughlythesamevisualspace.Cur-rentRecursiveandRecurrentNeuralNetworksdonotexhibitthisbehaviorandevenbagofwordsrep-resentationswouldbeinfluencedbythewordswasandby.Themodelwedescribebelowfocusesmoreonrecognizingactionsandagentsandhasthepo-tentialtolearnrepresentationsthatareinvarianttoactive-passivedifferences.3.1DT-RNNInputs:WordVectorsandDependencyTreesInorderfortheDT-RNNtocomputeavectorrepre-sentationforanorderedlistofmwords(aphraseorsentence),wemapthesinglewordstoavectorspaceandthenparsethesentence.First,wemapeachwordtoad-dimensionalvec-tor.Weinitializethesewordvectorswiththeun-

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

210

AmanwearingahelmetjumpsonhisbikenearabeachdetnsubjpartmoddetdobjrootprepposspobjprepdetpobjFigure2:Exampleofafulldependencytreeforalongersentence.TheDT-RNNwillcomputevectorrepresentationsateverywordthatrepresentsthatwordandanarbitrarynumberofchildnodes.Thefinalrepresentationiscomputedattherootnode,hereattheverbjumps.Notethatmoreimportantactivityandobjectwordsarehigherupinthistreestructure.supervisedmodelofHuangetal.(2012)whichcanlearnsinglewordvectorrepresentationsfrombothlocalandglobalcontexts.Theideaistoconstructaneuralnetworkthatoutputshighscoresforwindowsanddocumentsthatoccurinalargeunlabeledcorpusandlowscoresforwindow-documentpairswhereonewordisreplacedbyarandomword.WhensuchanetworkisoptimizedviagradientdescentthederivativesbackpropagateintoawordembeddingmatrixAwhichstoreswordvectorsascolumns.Inordertopredictcorrectscoresthevectorsinthema-trixcaptureco-occurrencestatistics.Weused=50inallourexperiments.TheembeddingmatrixXisthenusedbyfindingthecolumnindexiofeachword:[w]=iandretrievingthecorrespondingcol-umnxwfromX.Henceforth,werepresentaninputsentencesasanorderedlistof(word,vector)pairs:s=((w1,xw1),…,(wm,xwm)).Prossimo,thesequenceofwords(w1,…,wm)isparsedbythedependencyparserofdeMarneffeetal.(2006).Fig.2showsanexample.Wecanrepresentadependencytreedofasentencesasanorderedlistof(child,parent)indices:D(S)={(io,j)},whereeverychildwordinthesequencei=1,…,mispresentandhasanywordj∈{1,…,M}{0}asitsparent.Therootwordhasasitsparent0andwenoticethatthesamewordcanbeaparentbetweenzeroandmnumberoftimes.Withoutlossofgenerality,weassumethatthesein-dicesformatreestructure.Tosummarize,theinputtotheDT-RNNforeachsentenceisthepair(S,D):thewordsandtheirvectorsandthedependencytree.3.2ForwardPropagationinDT-RNNsGiventhesetwoinputs,wenowillustratehowtheDT-RNNcomputesparentvectors.Wewillusethefollowingsentenceasarunningexample:Students1ride2bikes3at4night5.Fig.3showsitstreeandcomputedvectorrepresentations.Thedepen-Students bikes nightride at x1x2x3x4x5h1h2h3h4h5Figure3:ExampleofaDT-RNNtreestructureforcom-putingasentencerepresentationinabottomupfashion.dencytreeforthissentencecanbesummarizedbythefollowingsetof(child,parent)edges:d={(1,2),(2,0),(3,2),(4,2),(5,4)}.TheDT-RNNmodelwillcomputeparentvectorsateachwordthatincludeallthedependent(chil-dren)nodesinabottomupfashionusingacom-positionalityfunctiongθwhichisparameterizedbyallthemodelparametersθ.Tothisend,thealgo-rithmsearchesfornodesinatreethathaveeither(io)nochildrenor(ii)whosechildrenhavealreadybeencomputedandthencomputesthecorrespond-ingvector.Inourexample,thewordsx1,x3,x5areleafnodesandhence,wecancomputetheircorrespond-inghiddennodesvia:hc=gθ(xc)=f(Wvxc)forc=1,3,5,(1)wherewecomputethehiddenvectoratpositioncviaourgeneralcompositionfunctiongθ.Inthecaseofleafnodes,thiscompositionfunctionbecomessimplyalinearlayer,parameterizedbyWv∈Rn×d,followedbyanonlinearity.Wecross-validateoverusingnononlinearity(f=id),di pesce,sigmoidorrectifiedlinearunits(f=max(0,X),butgenerallyfindtanhtoperformbest.Thefinalsentencerepresentationwewanttocom-puteisath2,however,sincewestilldonothaveh4,

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

211

wecomputethatonenext:h4=gθ(x4,h5)=f(Wvx4+Wr1h5),(2)whereweusethesameWvasbeforetomapthewordvectorintohiddenspacebutwenowalsohavealinearlayerthattakesasinputh5,theonlychildofthefourthnode.ThematrixWr1∈Rn×nisusedbecausenode5isthefirstchildnodeontherightsideofnode4.Generally,wehavemultiplematri-cesforcomposingwithhiddenchildvectorsfromtherightandleftsides:Wr·=(Wr1,…,Wrkr)andWl·=(Wl1,…,Wlkl).Thenumberofneededma-tricesisdeterminedbythedatabysimplyfindingthemaximumnumbersofleftklandrightkrchil-drenanynodehas.Ifattesttimeachildappearedatanevenlargedistance(thisdoesnothappeninourtestset),thecorrespondingmatrixwouldbetheidentitymatrix.Nowthatallchildrenofh2havetheirhiddenvec-tors,wecancomputethefinalsentencerepresenta-tionvia:h2=gθ(x2,h1,h3,h4)=(3)F(Wvx2+Wl1h1+Wr1h3+Wr2h4).Noticethatthechildrenaremultipliedbymatricesthatdependontheirlocationrelativetothecurrentnode.Anothermodificationthatimprovesthemeanrankbyapproximately6inimagesearchonthedevsetistoweightnodesbythenumberofwordsunder-neaththemandnormalizebythesumofwordsunderallchildren.Thisencouragestheintuitivedesidera-tumthatnodesdescribinglongerphrasesaremoreimportant.Let‘(io)bethenumberofleafnodes(parole)undernodeiandC(io,sì)bethesetofchildnodesofnodeiindependencytreey.Thefinalcom-positionfunctionforanodevectorhibecomes:hi=f1‘(io)Wvxi+Xj∈C(io)(j)Wpos(io,j)hj,(4)wherebydefinition‘(io)=1+Pj∈C(io)(j)andpos(io,j)istherelativepositionofchildjwithre-specttonodei,e.g.l1orr2inEq.3.3.3SemanticDependencyTreeRNNsAnalternativeistoconditiontheweightmatricesonthesemanticrelationsgivenbythedependencyparser.WeusethecollapsedtreeformalismoftheStanforddependencyparser(deMarneffeetal.,2006).Withsuchasemanticuntyingoftheweights,theDT-RNNmakesbetteruseofthedependencyformalismandcouldgiveactive-passivereversalssimilarsemanticvectorrepresentation.TheequationforthissemanticDT-RNN(SDT-RNN)isthesameastheoneaboveexceptthatthematricesWpos(io,j)arereplacedwithmatricesbasedonthedependencyrelationship.Thereareatotalof141uniquesuchrelationshipsinthedataset.However,mostareveryrare.Forexamplesofsemanticrelationships,seeFig.2andthemodelanalysissection6.7.Thisforwardpropagationcanbeusedforcom-putingcompositionalvectorsandinSec.5wewillexplaintheobjectivefunctioninwhichthesearetrained.3.4ComparisontoPreviousRNNModelsTheDT-RNNhasseveralimportantdifferencestopreviousRNNmodelsofSocheretal.(2011UN)E(Socheretal.,2011b;Socheretal.,2011c).TheseconstituencytreeRNNs(CT-RNNs)usethefollow-ingcompositionfunctiontocomputeahiddenpar-entvectorhfromexactlytwochildvectors(c1,c2)inabinarytree:h=f(cid:18)W(cid:20)c1c2(cid:21)(cid:19),whereW∈Rd×2disthemainparametertolearn.ThiscanberewrittentoshowthesimilaritytotheDT-RNNash=f(Wl1c1+Wr1c2).Tuttavia,thereareseveralimportantdifferences.NotefirstthatinpreviousRNNmodelsthepar-entvectorswereofthesamedimensionalitytoberecursivelycompatibleandbeusedasinputtothenextcomposition.Incontrast,ournewmodelfirstmapssinglewordsintoahiddenspaceandthenpar-entnodesarecomposedfromthesehiddenvectors.Thisallowsahighercapacityrepresentationwhichisespeciallyhelpfulfornodesthathavemanychil-dren.Secondly,theDT-RNNallowsforn-arynodesinthetree.ThisisanimprovementthatispossibleevenforconstituencytreeCT-RNNsbutithasnotbeenexploredinpreviousmodels.Third,duetocomputingparentnodesincon-stituencytrees,previousmodelshadtheproblemthatwordsthataremergedlastinthetreehavealargerweightorimportanceinthefinalsentencerep-

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

212

Figure4:Thearchitectureofthevisualmodel.Thismodelhas3sequencesoffiltering,poolingandlocalcontrastnormalizationlayers.Thelearnableparametersarethefilteringlayer.Thefiltersarenotshared,i.e.,thenetworkisnonconvolutional.resentation.Thiscanbeproblematicsincetheseareoftensimplenon-contentwords,suchasaleading‘But,’.Whilesuchsinglewordscanbeimportantfortaskssuchassentimentanalysis,wearguethatfordescribingvisualscenestheDT-RNNcapturesthemoreimportanteffects:Thedependencytreestruc-turespushthecentralcontentwordssuchasthemainactionorverbanditssubjectandobjecttobemergedlastandhence,byconstruction,thefinalsentencerepresentationismorerobusttolessimportantad-jectivalmodifiers,wordorderchanges,etc.Fourth,weallowsomeuntyingofweightsde-pendingoneitherhowfarawayaconstituentisfromthecurrentwordorwhatitssemanticrelationshipis.Nowthatwecancomputecompositionalvectorrepresentationsforsentences,thenextsectionde-scribeshowwerepresentimages.4LearningImageRepresentationswithNeuralNetworksTheimagefeaturesthatweuseinourexperimentsareextractedfromadeepneuralnetwork,replicatedfromtheonedescribedin(Leetal.,2012).Thenet-workwastrainedusingbothunlabeleddata(randomwebimages)andlabeleddatatoclassify22,000cat-egoriesinImageNet(Dengetal.,2009).Wethenusedthefeaturesatthelastlayer,beforetheclassi-fier,asthefeaturerepresentationinourexperiments.Thedimensionofthefeaturevectorofthelastlayeris4,096.Thedetailsofthemodelanditstrainingproceduresareasfollows.ThearchitectureofthenetworkcanbeseeninFigure4.Thenetworktakes200x200pixelimagesasinputsandhas9layers.Thelayersconsistofthreesequencesoffiltering,poolingandlocalcon-trastnormalization(Jarrettetal.,2009).ThepoolingfunctionisL2poolingofthepreviouslayer(takingthesquareofthefilteringunits,summingthemupinasmallareaintheimage,andtakingthesquare-root).Thelocalcontrastnormalizationtakesinputsinasmallareaofthelowerlayer,subtractsthemeananddividesbythestandarddeviation.Thenetworkwasfirsttrainedusinganunsuper-visedobjective:tryingtoreconstructtheinputwhilekeepingtheneuronssparse.Inthisphase,thenet-workwastrainedon20millionimagesrandomlysampledfromtheweb.Weresizedagivenimagesothatitsshortdimensionhas200pixels.Wethencroppedafixedsize200x200pixelimagerightatthecenteroftheresizedimage.Thismeanswemaydis-cardafractionofthelongdimensionoftheimage.Afterunsupervisedtraining,weusedIma-geNet(Dengetal.,2009)toadjustthefeaturesintheentirenetwork.TheImageNetdatasethas22,000categoriesand14millionimages.Thenumberofimagesineachcategoryisequalacrosscategories.The22,000categoriesareextractedfromWordNet.Tospeedupthesupervisedtrainingofthisnet-work,wemadeasimplemodificationtothealgo-rithmdescribedinLeetal.(2012):addinga“bottle-neck”layerinbetweenthelastlayerandtheclassi-fier.toreducethenumberofconnections.Weaddedone“bottleneck”layerwhichhas4,096unitsinbe-tweenthelastlayerofthenetworkandthesoftmaxlayer.Thisnewly-addedlayerisfullyconnectedtothepreviouslayerandhasalinearactivationfunc-tion.Thetotalnumberofconnectionsofthisnet-workisapproximately1.36billion.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

213

Thenetworkwastrainedagainusingthesuper-visedobjectiveofclassifyingthe22,000classesinImageNet.Mostfeaturesinthenetworksarelocal,whichallowsmodelparallelism.DataparallelismbyasynchronousSGDwasalsoemployedasinLeetal.(2012).Theentiretraining,bothunsupervisedandsupervised,took8daysonalargeclusterofma-chines.Thisnetworkachieves18.3%precision@1onthefullImageNetdataset(ReleaseFall2011).Wewillusethefeaturesatthebottlenecklayerasthefeaturevectorzofanimage.Eachscaledandcroppedimageispresentedtoournetwork.Thenet-workthenperformsafeedforwardcomputationtocomputethevaluesofthebottlenecklayer.Thismeansthateveryimageisrepresentedbyafixedlengthvectorof4,096dimensions.Notethatduringtraining,noalignedsentence-imagedatawasusedandtheImageNetclassesdonotfullyintersectwiththewordsusedinourdataset.5MultimodalMappingsTheprevioustwosectionsdescribedhowwecanmapsentencesintoad=50-dimensionalspaceandhowtoextracthighqualityimagefeaturevectorsof4096dimensions.Wenowdefineourfinalmulti-modalobjectivefunctionforlearningjointimage-sentencerepresentationswiththesemodels.OurtrainingsetconsistsofNimagesandtheirfeaturevectorsziandeachimagehas5sentencedescrip-tionssi1,…,si5forwhichweusetheDT-RNNtocomputevectorrepresentations.SeeFig.5forex-amplesfromthedataset.Fortraining,weuseamax-marginobjectivefunctionwhichintuitivelytrainspairsofcorrectimageandsentencevectorstohavehighinnerproductsandincorrectpairstohavelowinnerproducts.Letvi=WIzibethemappedimagevectorandyij=DTRNNθ(sij)thecomposedsen-tencevector.WedefineStobethesetofallsentenceindicesandS(io)thesetofsentenceindicescorre-spondingtoimagei.Similarly,Iisthesetofallim-ageindicesandI(j)istheimageindexofsentencej.ThesetPisthesetofallcorrectimage-sentencetrainingpairs(io,j).Therankingcostfunctiontominimizeisthen:J(WI,θ)=X(io,j)∈PXc∈S\S(io)max(0,∆−vTiyj+vTiyc)+X(io,j)∈PXc∈I\I(j)max(0,∆−vTiyj+vTcyj),(5)whereθarethelanguagecompositionmatrices,andbothsecondsumsareoverothersentencescom-ingfromdifferentimagesandviceversa.Thehyper-parameter∆isthemargin.Themarginisfoundviacrossvalidationonthedevsetandusuallyaround1.Thefinalobjectivealsoincludestheregulariza-tiontermλ/left(kθk22+kWIkF).Boththevisualmodelandthewordvectorlearningrequireaverylargeamountoftrainingdataandbothhaveahugenumberofparameters.Hence,topreventoverfitting,weassumetheirweightsarefixedandonlytraintheDT-RNNparametersWI.Iflargertrainingcorporabecomeavailableinthefuture,trainingbothjointlybecomesfeasibleandwouldpresentaverypromis-ingdirection.WeuseamodifiedversionofAda-Grad(Duchietal.,2011)foroptimizationofbothWIandtheDT-RNNaswellastheotherbaselines(exceptkCCA).Adagradhasachievedgoodperfor-mancepreviouslyinneuralnetworksmodels(Deanetal.,2012;Socheretal.,2013a).Wemodifyitbyresettingallsquaredgradientsumsto1every5epochs.Withbothimagesandsentencesinthesamemultimodalspace,wecaneasilyquerythemodelforsimilarimagesorsentencesbyfindingthenearestneighborsintermsofnegativeinnerproducts.AnalternativeobjectivefunctionisbasedonthesquaredlossJ(WI,θ)=P(io,j)∈Pkvi−yjk22.ThisrequiresanalternatingminimizationschemethatfirsttrainsonlyWI,thenfixesWIandtrainstheDT-RNNweightsθandthenrepeatsthisseveraltimes.Wefindthattheperformancewiththisob-jectivefunction(pairedwithfindingsimilarimagesusingEuclideandistances)isworseforallmodelsthanthemarginlossofEq.5.InadditionkCCAalsoperformsmuchbetterusinginnerproductsinthemultimodalspace.6ExperimentsWeusethedatasetofRashtchianetal.(2010)whichconsistsof1000images,eachwith5sentences.SeeFig.5forexamples.WeevaluateandcomparetheDT-RNNinthreedifferentexperiments.First,weanalyzehowwellthesentencevectorscapturesimilarityinvisualmeaning.ThenweanalyzeImageSearchwithQuerySentences:toqueryeachmodelwithasen-tenceinordertofindanimageshowingthatsen-

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

214

1. A woman and her dog watch the cameraman in their living with wooden floors.2. A woman sitting on the couch while a black faced dog runs across the floor.3. A woman wearing a backpack sits on a couch while a small dog runs on the hardwood floor next to her.4. A women sitting on a sofa while a small Jack Russell walks towards the camera.5. White and black small dog walks toward the camera while woman sits on couch, desk and computer seen in the background as well as a pillow, teddy bear and moggie toy on the wood floor.1. A man in a cowboy hat check approaches a small red sports car.2. The back and left side of a red Ferrari and two men admiring it.3. The sporty car is admired by passer by.4. Two men next to a red sports car in a parking lot.5. Two men stand beside a red sports car.Figure5:Examplesfromthedatasetofimagesandtheirsentencedescriptions(Rashtchianetal.,2010).Sentencelengthvariesgreatlyanddifferentobjectscanbementionedfirst.Hence,modelshavetobeinvarianttowordordering.tence’svisual‘meaning.’ThelastexperimentDe-scribingImagesbyFindingSuitableSentencesdoesthereversesearchwherewequerythemodelwithanimageandtrytofindtheclosesttextualdescriptionintheembeddingspace.Inourcomparisontoothermethodswefocusonthosemodelsthatcanalsocomputefixed,continu-ousvectorsforsentences.Inparticular,wecomparetotheRNNmodelonconstituencytreesofSocheretal.(2011UN),astandardrecurrentneuralnetwork;asimplebag-of-wordsbaselinewhichaveragesthewords.AllmodelsusethewordvectorsprovidedbyHuangetal.(2012)anddonotupdatethemasdis-cussedabove.Modelsaretrainedwiththeircorre-spondinggradientsandbackpropagationtechniques.Astandardrecurrentmodelisusedwherethehiddenvectoratwordindextiscomputedfromthehiddenvectorattheprevioustimestepandthecurrentwordvector:ht=f(Whht−1+Wxxt).Duringtraining,wetakethelasthiddenvectorofthesentencechainandpropagatetheerrorintothat.Itisalsothisvectorthatisusedtorepresentthesentence.Otherpossiblecomparisonsaretotheverydiffer-entmodelsmentionedintherelatedworksection.Thesemodelsusealotmoretask-specificengineer-ing,suchasrunningobjectdetectorswithboundingboxes,attributeclassifiers,sceneclassifiers,CRFsforcomposingthesentences,etc.Anotherlineofworkuseslargesentence-imagealignedresources(Kuznetsovaetal.,2012),whereaswefocusoneas-ilyobtainabletrainingdataofeachmodalitysepa-ratelyandarathersmallmultimodalcorpus.Inourexperimentswesplitthedatainto800train-ing,100developmentand100testimages.Sincethereare5sentencesdescribingeachimage,wehave4000trainingsentencesand500testingsen-tences.Thedatasethas3020uniquewords,halfofwhichonlyappearonce.Hence,theunsupervised,pre-trainedsemanticwordvectorrepresentationsarecrucial.Wordvectorsarenotfinetunedduringtrain-ing.Hence,themainparametersaretheDT-RNN’sWl·,Wr·orthesemanticmatricesofwhichthereare141andtheimagemappingWI.ForbothDT-RNNstheweightmatricesareinitializedtoblockidentitymatricesplusGaussiannoise.Wordvectorsandhid-denvectorsaresetolength50.Usingthedevelop-mentsplit,wefoundλ=0.08andthelearningrateofAdaGradto0.0001.Thebestmodelusesamar-ginof∆=3.InspiredbySocherandFei-Fei(2010)andHo-doshetal.(2013)wealsocomparetokernelizedCanonicalCorrelationAnalysis(kCCA).Weusetheaverageofwordvectorsfordescribingsentencesandthesamepowerfulimagevectorsasbefore.WeusethecodeofSocherandFei-Fei(2010).Tech-nically,onecouldcombinetherecentlyintroduceddeepCCAAndrewetal.(2013)andtrainthere-cursiveneuralnetworkarchitectureswiththeCCAobjective.Weleavethistofuturework.Withlin-earkernels,kCCAdoeswellforimagesearchbutisworseforsentenceselfsimilarityanddescribingimageswithsentencesclose-byinembeddingspace.AllothermodelsaretrainedbyreplacingtheDT-RNNfunctioninEq.5.6.1SimilarityofSentencesDescribingtheSameImageInthisexperiment,wefirstmapall500sentencesfromthetestsetintothemulti-modalspace.Thenforeachsentence,wefindthenearestneighborsen-

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

215

SentencesSimilarityforImageModelMeanRankRandom101.1BoW11.8CT-RNN15.8RecurrentNN18.5kCCA10.7DT-RNN11.1SDT-RNN10.5ImageSearchModelMeanRankRandom52.1BoW14.6CT-RNN16.1RecurrentNN19.2kCCA15.9DT-RNN13.6SDT-RNN12.5DescribingImagesModelMeanRankRandom92.1BoW21.1CT-RNN23.9RecurrentNN27.1kCCA18.0DT-RNN19.2SDT-RNN16.9Table1:Left:Comparisonofmethodsforsentencesimilarityjudgments.Lowernumbersarebettersincetheyindicatethatsentencesdescribingthesameimagerankmorehighly(arecloser).Theranksareoutofthe500sentencesinthetestset.Center:Comparisonofmethodsforimagesearchwithquerysentences.Shownistheaveragerankofthesinglecorrectimagethatisbeingdescribed.Right:Averagerankofacorrectsentencedescriptionforaqueryimage.tencesintermsofinnerproducts.Wethensorttheseneighborsandrecordtherankorpositionofthenearestsentencethatdescribesthesameim-age.Ifalltheimageswereveryuniqueandthevi-sualdescriptionsclose-paraphrasesandconsistent,wewouldexpectaverylowrank.However,usuallyahandfulofimagesarequitesimilar(forinstance,therearevariousimagesofairplanesflying,parking,taxiingorwaitingontherunway)andsentencede-scriptionscanvarygreatlyindetailandspecificityforthesameimage.Table1(left)showstheresults.Wecanseethataveragingthehighqualitywordvectorsalreadycap-turesalotofsimilarity.Thechainstructureofastandardrecurrentneuralnetperformsworstsinceitsrepresentationisdominatedbythelastwordsinthesequencewhichmaynotbeasimportantasear-lierwords.6.2ImageSearchwithQuerySentencesThisexperimentevaluateshowwellwecanfindim-agesthatdisplaythevisualmeaningofagivensen-tence.Wefirstmapaquerysentenceintothevectorspaceandthenfindimagesinthesamespaceusingsimpleinnerproducts.AsshowninTable1(center),thenewDT-RNNoutperformsallothermodels.6.3DescribingImagesbyFindingSuitableSentencesLastly,werepeattheaboveexperimentsbutwithrolesreversed.Foranimage,wesearchforsuitabletextualdescriptionsagainsimplybyfindingclose-bysentencevectorsinthemulti-modalembeddingspace.Table1(right)showsthattheDT-RNNagainoutperformsrelatedmodels.Fig.2assignedtoim-ImageSearchModelmRankBoW24.7CT-RNN22.2RecurrentNN28.4kCCA13.7DT-RNN13.3SDT-RNN15.8DescribingImagesModelmRankBoW30.7CT-RNN29.4RecurrentNN31.4kCCA38.0DT-RNN26.8SDT-RNN37.5Table2:ResultsofmultimodalrankingwhenmodelsaretrainedwithasquarederrorlossandusingEuclideandis-tanceinthemultimodalspace.Betterperformanceisreachedforallmodelswhentrainedinamax-marginlossandusinginnerproductsasintheprevioustable.ages.Theaveragerankingof25.3foracorrectsen-tencedescriptionisoutof500possiblesentences.Arandomassignmentwouldgiveanaveragerankingof100.6.4Analysis:SquaredErrorLossvs.MarginLossWeanalyzetheinfluenceofthemultimodallossfunctionontheperformance.Inaddition,wecom-pareusingEuclideandistancesinsteadofinnerprod-ucts.Table2showsthatperformanceisworseforallmodelsinthissetting.6.5Analysis:RecallatnvsMeanRankHodoshetal.(2013)andotherrelatedworkusere-callatnasanevaluationmeasure.Recallatncap-tureshowoftenoneofthetopnclosestvectorswereacorrectimageorsentenceandgivesagoodintu-itionofhowamodelwouldperforminarankingtaskthatpresentsnsuchresultstoauser.Below,wecomparethreecommonlyusedandhighperformingmodels:bagofwords,kCCAandourSDT-RNNon

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

216

A gray convertible sports car is parked in front of the trees.A close-up view of the headlights of a blue old-fashioned car.Black shiny sports car parked on concrete driveway.Five cows grazing on a patch of grass between two roadways.A jockey rides a brown and white horse in a dirt corral.A young woman is riding a Bay hose in a dirt riding-ring.A white bird pushes a miniature teal shopping cart.A person rides a brown horse.A motocross bike with rider flying through the air.White propeller plane parked in middle of grassy field.The white jet with its landing gear down flies in the blue sky.An elderly woman catches a ride on the back of the bicycle.A green steam train running down the tracks.Steamy locomotive speeding thou the forest.A steam engine comes down a train track near trees.A double decker bus is driving by Big Ben in London.People in an outrigger canoe sail on emerald green water.Two people sailing a small white sail boat.behind a cliff, a boat sails awayTourist move in on Big Ben on a typical overcast London day.A group of people sitting around a table on a porch.A group of four people walking past a giant mushroom.A man and women smiling for the camera in a kitchen.A group of men sitting around a table drinking while a man behind stands pointing.Figure6:ImagesandtheirsentencedescriptionsassignedbytheDT-RNN.ImageSearchModelmRank4R@15R@55R@105BoW14.615.842.260.0kCCA15.916.441.458.0SDT-RNN12.516.446.665.6DescribingImagesBoW21.119.038.057.0kCCA18.021.047.061.0SDT-RNN16.923.045.063.0Table3:Evaluationcomparisonbetweenmeanrankoftheclosestcorrectimageorsentence(lowerisbetter4)withrecallatdifferentthresholds(higherisbetter,5).Withoneexception(R@5,bottomtable),theSDT-RNNoutperformstheothertwomodelsandallothermodelswedidnotincludehere.thisdifferentmetric.Table3showsthatthemea-suresdocorrelatewellandtheSDT-RNNalsoper-formsbestonthemultimodalrankingtaskswhenevaluatedwiththismeasure.6.6ErrorAnalysisInordertounderstandthemainproblemswiththecomposedsentencevectors,weanalyzethesen-tencesthathavetheworstnearestneighborrankbe-tweeneachother.WefindthatthemainfailuremodeoftheSDT-RNNoccurswhenasentencethatshoulddescribethesameimagedoesnotuseaverbbuttheothersentencesofthatimagedoincludeaverb.Forexample,thefollowingsentencepairhasvectorsthatareveryfarapartfromeachothereventhoughtheyaresupposedtodescribethesameimage:1.Ablueandyellowairplaneflyingstraightdownwhileemittingwhitesmoke2.AirplaneindivepositionGenerally,aslongasbothsentenceseitherhaveaverbordonot,theSDT-RNNismorerobusttodif-ferentsentencelengthsthanbagofwordsrepresen-tations.6.7ModelAnalysis:SemanticCompositionMatricesThebestmodelusescompositionmatricesbasedonsemanticrelationshipsfromthedependencyparser.WegivesomeinsightsintowhatthemodellearnsbylistingthecompositionmatriceswiththelargestFrobeniusnorms.Intuitively,thesematriceshavelearnedlargerweightsthatarebeingmultipliedwiththechildvectorinthetreeandhencethatchildwillhavemoreweightinthefinalcomposedparentvec-tor.IndecreasingorderofFrobeniusnorm,there-lationshipmatricesare:nominalsubject,possessionmodifier(e.g.their),passiveauxiliary,prepositionat,prepositioninfrontof,passiveauxiliary,passivenominalsubject,objectofpreposition,prepositioninandprepositionon.Themodellearnsthatnounsareveryimportantaswellastheirspatialprepositionsandadjectives.7ConclusionWeintroducedanewrecursiveneuralnetworkmodelthatisbasedondependencytrees.Foreval-uation,weusethechallengingtaskofmappingsen-tencesandimagesintoacommonspaceforfindingonefromtheother.Ournewmodeloutperformsbaselinesandothercommonlyusedmodelsthatcancomputecontinuousvectorrepresentationsforsen-tences.Incomparisontorelatedmodels,theDT-RNNismoreinvariantandrobusttosurfacechangessuchaswordorder.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

217

ReferencesG.Andrew,R.Arora,K.Livescu,andJ.Bilmes.2013.Deepcanonicalcorrelationanalysis.InICML,At-lanta,Georgia.K.Barnard,P.Duygulu,N.deFreitas,D.Forsyth,D.Blei,andM.Jordan.2003.Matchingwordsandpictures.JMLR.M.BaroniandA.Lenci.2010.Distributionalmem-ory:Ageneralframeworkforcorpus-basedsemantics.ComputationalLinguistics,36(4):673–721.R.CollobertandJ.Weston.2008.Aunifiedarchi-tecturefornaturallanguageprocessing:deepneuralnetworkswithmultitasklearning.InProceedingsofICML,pages160–167.F.Costa,P.Frasconi,V.Lombardo,andG.Soda.2003.Towardsincrementalparsingofnaturallanguageusingrecursiveneuralnetworks.AppliedIntelligence.M.deMarneffe,B.MacCartney,andC.D.Manning.2006.Generatingtypeddependencyparsesfromphrasestructureparses.InLREC.J.Dean,G.S.Corrado,R.Monga,K.Chen,M.Devin,Q.V.Le,M.Z.Mao,M.Ranzato,A.Senior,P.Tucker,K.Yang,andA.Y.Ng.2012.Largescaledistributeddeepnetworks.InNIPS.J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei.2009.ImageNet:ALarge-ScaleHierarchicalIm-ageDatabase.InCVPR.J.Duchi,E.Hazan,andY.Singer.2011.Adaptivesub-gradientmethodsforonlinelearningandstochasticop-timization.JMLR,12,July.P.Duygulu,K.Barnard,N.deFreitas,andD.Forsyth.2002.Objectrecognitionasmachinetranslation.InECCV.A.Farhadi,M.Hejrati,M.A.Sadeghi,P.Young,C.Rashtchian,J.Hockenmaier,andD.Forsyth.2010.Everypicturetellsastory:Generatingsentencesfromimages.InECCV.Y.FengandM.Lapata.2013.Automaticcaptiongen-erationfornewsimages.IEEETrans.PatternAnal.Mach.Intell.,35.A.Frome,G.Corrado,J.Shlens,S.Bengio,J.Dean,M.Ranzato,andT.Mikolov.2013.Devise:Adeepvisual-semanticembeddingmodel.InNIPS.C.GollerandA.K¨uchler.1996.Learningtask-dependentdistributedrepresentationsbybackpropaga-tionthroughstructure.InProceedingsoftheInterna-tionalConferenceonNeuralNetworks.E.Grefenstette,G.Dinu,Y.-Z.Zhang,M.Sadrzadeh,andM.Baroni.2013.Multi-stepregressionlearningforcompositionaldistributionalsemantics.InIWCS.A.GuptaandL.S.Davis.2008.Beyondnouns:Exploit-ingprepositionsandcomparativeadjectivesforlearn-ingvisualclassifiers.InECCV.M.Hodosh,P.Young,andJ.Hockenmaier.2013.Fram-ingimagedescriptionasarankingtask:Data,mod-elsandevaluationmetrics.J.Artif.Intell.Res.(JAIR),47:853–899.E.H.Huang,R.Socher,C.D.Manning,andA.Y.Ng.2012.ImprovingWordRepresentationsviaGlobalContextandMultipleWordPrototypes.InACL.K.Jarrett,K.Kavukcuoglu,M.A.Ranzato,andY.Le-Cun.2009.Whatisthebestmulti-stagearchitectureforobjectrecognition?InICCV.P.Blunsom.K.M.Hermann.2013.Theroleofsyntaxinvectorspacemodelsofcompositionalsemantics.InACL.A.Krizhevsky,I.Sutskever,andG.E.Hinton.2012.Imagenetclassificationwithdeepconvolutionalneuralnetworks.InNIPS.G.Kulkarni,V.Premraj,S.Dhar,S.Li,Y.Choi,A.C.Berg,andT.L.Berg.2011.Babytalk:Understandingandgeneratingimagedescriptions.InCVPR.N.Kumar,A.C.Berg,P.N.Belhumeur,,andS.K.Na-yar.2009.Attributeandsimileclassifiersforfacever-ification.InICCV.P.Kuznetsova,V.Ordonez,A.C.Berg,T.L.Berg,andYejinChoi.2012.Collectivegenerationofnaturalimagedescriptions.InACL.Q.V.Le,M.A.Ranzato,R.Monga,M.Devin,K.Chen,G.S.Corrado,J.Dean,andA.Y.Ng.2012.Build-inghigh-levelfeaturesusinglargescaleunsupervisedlearning.InICML.T.Mikolov,W.Yih,andG.Zweig.2013.Linguisticregularitiesincontinuousspacewordrepresentations.InHLT-NAACL.J.MitchellandM.Lapata.2010.Compositionindis-tributionalmodelsofsemantics.CognitiveScience,34(8):1388–1429.J.Ngiam,A.Khosla,M.Kim,J.Nam,H.Lee,andA.Y.Ng.2011.Multimodaldeeplearning.InICML.V.Ordonez,G.Kulkarni,andT.L.Berg.2011.Im2text:Describingimagesusing1millioncaptionedpho-tographs.InNIPS.J.B.Pollack.1990.Recursivedistributedrepresenta-tions.ArtificialIntelligence,46,November.C.Rashtchian,P.Young,M.Hodosh,andJ.Hocken-maier.2010.CollectingimageannotationsusingAmazon’sMechanicalTurk.InWorkshoponCreat-ingSpeechandLanguageDatawithAmazon’sMTurk.R.SocherandL.Fei-Fei.2010.Connectingmodalities:Semi-supervisedsegmentationandannotationofim-agesusingunalignedtextcorpora.InCVPR.R.Socher,C.D.Manning,andA.Y.Ng.2010.Learningcontinuousphraserepresentationsandsyntacticpars-ingwithrecursiveneuralnetworks.InProceedingsoftheNIPS-2010DeepLearningandUnsupervisedFea-tureLearningWorkshop.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/

/
T

l

UN
C
_
UN
_
0
0
1
7
7
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

218

R.Socher,E.H.Huang,J.Pennington,A.Y.Ng,andC.D.Manning.2011a.DynamicPoolingandUnfold-ingRecursiveAutoencodersforParaphraseDetection.InNIPS.R.Socher,C.Lin,A.Y.Ng,andC.D.Manning.2011b.ParsingNaturalScenesandNaturalLanguagewithRecursiveNeuralNetworks.InICML.R.Socher,J.Pennington,E.H.Huang,A.Y.Ng,andC.D.Manning.2011c.Semi-SupervisedRecursiveAutoencodersforPredictingSentimentDistributions.InEMNLP.R.Socher,B.Huval,C.D.Manning,andA.Y.Ng.2012.SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces.InEMNLP.R.Socher,J.Bauer,C.D.Manning,andA.Y.Ng.2013a.ParsingWithCompositionalVectorGrammars.InACL.R.Socher,M.Ganjoo,C.D.Manning,andA.Y.Ng.2013b.Zero-ShotLearningThroughCross-ModalTransfer.InNIPS.R.Socher,M.Ganjoo,H.Sridhar,O.Bastani,andA.Y.Ng.C.D.Manningand.2013c.Zero-shotlearn-ingthroughcross-modaltransfer.InProceedingsoftheInternationalConferenceonLearningRepresenta-tions(ICLR,WorkshopTrack).N.SrivastavaandR.Salakhutdinov.2012.Multimodallearningwithdeepboltzmannmachines.InNIPS.A.Torralba,K.P.Murphy,andW.T.Freeman.2010.Usingtheforesttoseethetrees:exploitingcontextforvisualobjectdetectionandlocalization.Communica-tionsoftheACM.P.D.TurneyandP.Pantel.2010.Fromfrequencytomeaning:Vectorspacemodelsofsemantics.JournalofArtificialIntelligenceResearch,37:141–188.B.Yao,X.Yang,L.Lin,M.W.Lee,andS.-C.Zhu.2010.I2t:imageparsingtotextdescription.IEEEXplore.Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine
Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine

Scarica il pdf