Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark.

Submitted 10/2013; Revised 3/2014; Pubblicato 4/2014. C
(cid:13)

2014 Associazione per la Linguistica Computazionale.

GroundedCompositionalSemanticsforFindingandDescribingImageswithSentencesRichardSocher,AndrejKarpathy,QuocV.Le*,ChristopherD.Manning,AndrewY.NgStanfordUniversity,ComputerScienceDepartment,*GoogleInc.richard@socher.org,karpathy@cs.stanford.edu,qvl@google.com,manning@stanford.edu,ang@cs.stanford.eduAbstractPreviousworkonRecursiveNeuralNetworks(RNNs)showsthatthesemodelscanproducecompositionalfeaturevectorsforaccuratelyrepresentingandclassifyingsentencesorim-ages.However,thesentencevectorsofprevi-ousmodelscannotaccuratelyrepresentvisu-allygroundedmeaning.WeintroducetheDT-RNNmodelwhichusesdependencytreestoembedsentencesintoavectorspaceinordertoretrieveimagesthataredescribedbythosesentences.UnlikepreviousRNN-basedmod-elswhichuseconstituencytrees,DT-RNNsnaturallyfocusontheactionandagentsinasentence.Theyarebetterabletoabstractfromthedetailsofwordorderandsyntacticexpression.DT-RNNsoutperformotherre-cursiveandrecurrentneuralnetworks,kernel-izedCCAandabag-of-wordsbaselineonthetasksofﬁndinganimagethatﬁtsasentencedescriptionandviceversa.Theyalsogivemoresimilarrepresentationstosentencesthatdescribethesameimage.1IntroductionSinglewordvectorspacesarewidelyused(TurneyandPantel,2010)andsuccessfulatclassifyingsin-glewordsandcapturingtheirmeaning(CollobertandWeston,2008;Huangetal.,2012;Mikolovetal.,2013).Sincewordsrarelyappearinisolation,thetaskoflearningcompositionalmeaningrepre-sentationsforlongerphraseshasrecentlyreceivedalotofattention(MitchellandLapata,2010;Socheretal.,2010;Socheretal.,2012;Grefenstetteetal.,2013).Allo stesso modo,classifyingwholeimagesintoaﬁxedsetofclassesalsoachievesveryhighperfor-mance(Leetal.,2012;Krizhevskyetal.,2012).Tuttavia,similartowords,objectsinimagesareof-tenseeninrelationshipswithotherobjectswhicharenotadequatelydescribedbyasinglelabel.Inthiswork,weintroduceamodel,illustratedinFig.1,whichlearnstomapsentencesandimagesintoacommonembeddingspaceinordertobeabletoretrieveonefromtheother.Weassumewordandimagerepresentationsareﬁrstlearnedintheirre-spectivesinglemodalitiesbutﬁnallymappedintoajointlylearnedmultimodalembeddingspace.OurmodelformappingsentencesintothisspaceisbasedonideasfromRecursiveNeuralNetworks(RNNs)(Pollack,1990;Costaetal.,2003;Socheretal.,2011b).Tuttavia,unlikeallpreviousRNNmodelswhicharebasedonconstituencytrees(CT-RNNs),ourmodelcomputescompositionalvectorrepresentationsinsidedependencytrees.Thecom-positionalvectorscomputedbythisnewdependencytreeRNN(DT-RNN)capturemoreofthemeaningofsentences,wherewedeﬁnemeaningintermsofsimilaritytoa“visualrepresentation”ofthetextualdescription.DT-RNNinducedvectorrepresenta-tionsofsentencesaremorerobusttochangesinthesyntacticstructureorwordorderthanrelatedmod-elssuchasCT-RNNsorRecurrentNeuralNetworkssincetheynaturallyfocusonasentence’sactionanditsagents.WeevaluateandcompareDT-RNNinducedrep-resentationsontheirabilitytouseasentencesuchas“Amanwearingahelmetjumpsonhisbikenearabeach.”toﬁndimagesthatshowsuchascene.Thegoalistolearnsentencerepresentationsthatcapture

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

208

A man wearing a helmet jumps on his bike near a beach.Compositional Sentence VectorsTwo airplanes parked in an airport.A man jumping his downhill bike.Image Vector RepresentationA small child sits on a cement wall near white flower.Multi-Modal RepresentationsFigure1:TheDT-RNNlearnsvectorrepresentationsforsentencesbasedontheirdependencytrees.Welearntomaptheoutputsofconvolutionalneuralnetworksappliedtoimagesintothesamespaceandcanthencomparebothsentencesandimages.Thisallowsustoqueryimageswithasentenceandgivesentencedescriptionstoimages.thevisualscenedescribedandtoﬁndappropriateimagesinthelearned,multi-modalsentence-imagespace.Conversely,whengivenaqueryimage,wewouldliketoﬁndadescriptionthatgoesbeyondasinglelabelbyprovidingacorrectsentencedescrib-ingit,ataskthathasrecentlygarneredalotofat-tention(Farhadietal.,2010;Ordonezetal.,2011;Kuznetsovaetal.,2012).Weusethedatasetintro-ducedby(Rashtchianetal.,2010)whichconsistsof1000images,eachwith5descriptions.Onalltasks,ourmodeloutperformsbaselinesandrelatedmod-els.2RelatedWorkThepresentedmodelisconnectedtoseveralareasofNLPandvisionresearch,eachwithalargeamountofrelatedworktowhichwecanonlydosomejusticegivenspaceconstraints.SemanticVectorSpacesandTheirComposition-ality.Thedominantapproachinsemanticvec-torspacesusesdistributionalsimilaritiesofsinglewords.Often,co-occurrencestatisticsofawordanditscontextareusedtodescribeeachword(TurneyandPantel,2010;BaroniandLenci,2010),suchastf-idf.Mostofthecompositionalityalgorithmsandrelateddatasetscapturetwo-wordcompositions.Forinstance,(MitchellandLapata,2010)usetwo-wordphrasesandanalyzesimilaritiescomputedbyvectoraddition,multiplicationandothers.Compo-sitionalityisanactiveﬁeldofresearchwithmanydifferentmodelsandrepresentationsbeingexplored(Grefenstetteetal.,2013),amongmanyothers.Wecomparetosupervisedcompositionalmodelsthatcanlearntask-speciﬁcvectorrepresentationssuchasconstituencytreerecursiveneuralnetworks(Socheretal.,2011b;Socheretal.,2011a),chainstructuredrecurrentneuralnetworksandotherbaselines.An-otheralternativewouldbetouseCCGtreesasabackboneforvectorcomposition(K.M.Hermann,2013).MultimodalEmbeddings.Multimodalembed-dingmethodsprojectdatafrommultiplesourcessuchassoundandvideo(Ngiametal.,2011)orim-agesandtext.Socheretal.(SocherandFei-Fei,2010)projectwordsandimageregionsintoacom-monspaceusingkernelizedcanonicalcorrelationanalysistoobtainstateoftheartperformanceinan-notationandsegmentation.Similartoourwork,theyuseunsupervisedlargetextcorporatolearnseman-ticwordrepresentations.AmongotherrecentworkisthatbySrivastavaandSalakhutdinov(2012)whodevelopedmultimodalDeepBoltzmannMachines.Similartotheirwork,weusetechniquesfromthebroadﬁeldofdeeplearningtorepresentimagesandwords.Recently,singlewordvectorembeddingshavebeenusedforzeroshotlearning(Socheretal.,2013c).Mappingimagestowordvectorsenabledtheirsystemtoclassifyimagesasdepictingobjectssuchas”cat”withoutseeinganyexamplesofthisclass.RelatedworkhasalsobeenpresentedatNIPS(Socheretal.,2013b;Fromeetal.,2013).Thisworkmoveszero-shotlearningbeyondsinglecategoriesperimageandextendsittounseenphrasesandfulllengthsentences,makinguseofsimilarideasofse-manticspacesgroundedinvisualknowledge.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

209

DetailedImageAnnotation.Interactionsbe-tweenimagesandtextsisagrowingresearchﬁeld.Earlyworkinthisareaincludesgeneratingsinglewordsorﬁxedphrasesfromimages(Duyguluetal.,2002;Barnardetal.,2003)orusingcontextualin-formationtoimproverecognition(GuptaandDavis,2008;Torralbaetal.,2010).Apartfromalargebodyofworkonsingleobjectimageclassiﬁcation(Leetal.,2012),thereisalsoworkonattributeclassiﬁcationandothermid-levelelements(Kumaretal.,2009),someofwhichwehopetocapturewithourapproachaswell.Ourworkiscloseinspiritwithrecentworkinde-scribingimageswithmoredetailed,longertextualdescriptions.Inparticular,Yaoetal.(2010)describeimagesusinghierarchicalknowledgeandhumansintheloop.Incontrast,ourworkdoesnotrequirehu-maninteractions.Farhadietal.(2010)andKulkarnietal.(2011),ontheotherhand,useamoreautomaticmethodtoparseimages.Forinstance,theformerap-proachusesasingletripleofobjectsestimatedforanimagetoretrievesentencesfromacollectionwrittentodescribesimilarimages.Itformsrepresentationstodescribe1object,1action,and1scene.Kulkarnietal.(2011)extendstheirmethodtodescribeanim-agewithmultipleobjects.Noneoftheseapproacheshaveusedacompositionalsentencevectorrepre-sentationandtheyrequirespeciﬁclanguagegener-ationtechniquesandsophisticatedinferencemeth-ods.Sinceourmodelisbasedonneuralnetworksin-ferenceisfastandsimple.Kuznetsovaetal.(2012)useaverylargeparallelcorpustoconnectimagesandsentences.FengandLapata(2013)usealargedatasetofcaptionedimagesandexperimentswithbothextractive(search)andabstractive(generation)models.MostrelatedistheveryrecentworkofHodoshetal.(2013).Theytooevaluateusingarankingmea-sure.Inourexperiments,wecomparetokernelizedCanonicalCorrelationAnalysiswhichisthemaintechniqueintheirexperiments.3Dependency-TreeRecursiveNeuralNetworksInthissectionweﬁrstfocusontheDT-RNNmodelthatcomputescompositionalvectorrepresentationsforphrasesandsentencesofvariablelengthandsyn-tactictype.Insection5theresultingvectorswillthenbecomemultimodalfeaturesbymappingim-agesthatshowwhatthesentencedescribestothesamespaceandlearningboththeimageandsen-tencemappingjointly.Themostcommonwayofbuildingrepresenta-tionsforlongerphrasesfromsinglewordvectorsistosimplylinearlyaveragethewordvectors.Whilethisbag-of-wordsapproachcanyieldreasonableperformanceinsometasks,itgivesallthewordsthesameweightandcannotdistinguishimportantdif-ferencesinsimplevisualdescriptionssuchasThebikecrashedintothestandingcar.vs.Thecarcrashedintothestandingbike..RNNmodels(Pollack,1990;GollerandK¨uchler,1996;Socheretal.,2011b;Socheretal.,2011a)pro-videdanovelwayofcombiningwordvectorsforlongerphrasesthatmovedbeyondsimpleaverag-ing.TheycombinevectorswithanRNNinbinaryconstituencytreeswhichhavepotentiallymanyhid-denlayers.Whiletheinducedvectorrepresentationsworkverywellonmanytasks,theyalsoinevitablycapturealotofsyntacticstructureofthesentence.However,thetaskofﬁndingimagesfromsentencedescriptionsrequiresustobemoreinvarianttosyn-tacticdifferences.Onesuchexampleareactive-passiveconstructionswhichcancollapsewordssuchas“by”insomeformalisms(deMarneffeetal.,2006),relyinginsteadonthesemanticrelationshipof“agent”.Forinstance,Themotherhuggedherchild.andThechildwashuggedbyitsmother.shouldmaptoroughlythesamevisualspace.Cur-rentRecursiveandRecurrentNeuralNetworksdonotexhibitthisbehaviorandevenbagofwordsrep-resentationswouldbeinﬂuencedbythewordswasandby.Themodelwedescribebelowfocusesmoreonrecognizingactionsandagentsandhasthepo-tentialtolearnrepresentationsthatareinvarianttoactive-passivedifferences.3.1DT-RNNInputs:WordVectorsandDependencyTreesInorderfortheDT-RNNtocomputeavectorrepre-sentationforanorderedlistofmwords(aphraseorsentence),wemapthesinglewordstoavectorspaceandthenparsethesentence.First,wemapeachwordtoad-dimensionalvec-tor.Weinitializethesewordvectorswiththeun-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

210

AmanwearingahelmetjumpsonhisbikenearabeachdetnsubjpartmoddetdobjrootprepposspobjprepdetpobjFigure2:Exampleofafulldependencytreeforalongersentence.TheDT-RNNwillcomputevectorrepresentationsateverywordthatrepresentsthatwordandanarbitrarynumberofchildnodes.Theﬁnalrepresentationiscomputedattherootnode,hereattheverbjumps.Notethatmoreimportantactivityandobjectwordsarehigherupinthistreestructure.supervisedmodelofHuangetal.(2012)whichcanlearnsinglewordvectorrepresentationsfrombothlocalandglobalcontexts.Theideaistoconstructaneuralnetworkthatoutputshighscoresforwindowsanddocumentsthatoccurinalargeunlabeledcorpusandlowscoresforwindow-documentpairswhereonewordisreplacedbyarandomword.WhensuchanetworkisoptimizedviagradientdescentthederivativesbackpropagateintoawordembeddingmatrixAwhichstoreswordvectorsascolumns.Inordertopredictcorrectscoresthevectorsinthema-trixcaptureco-occurrencestatistics.Weused=50inallourexperiments.TheembeddingmatrixXisthenusedbyﬁndingthecolumnindexiofeachword:[w]=iandretrievingthecorrespondingcol-umnxwfromX.Henceforth,werepresentaninputsentencesasanorderedlistof(word,vector)pairs:s=((w1,xw1),…,(wm,xwm)).Prossimo,thesequenceofwords(w1,…,wm)isparsedbythedependencyparserofdeMarneffeetal.(2006).Fig.2showsanexample.Wecanrepresentadependencytreedofasentencesasanorderedlistof(child,parent)indices:D(S)={(io,j)},whereeverychildwordinthesequencei=1,…,mispresentandhasanywordj∈{1,…,M}∪{0}asitsparent.Therootwordhasasitsparent0andwenoticethatthesamewordcanbeaparentbetweenzeroandmnumberoftimes.Withoutlossofgenerality,weassumethatthesein-dicesformatreestructure.Tosummarize,theinputtotheDT-RNNforeachsentenceisthepair(S,D):thewordsandtheirvectorsandthedependencytree.3.2ForwardPropagationinDT-RNNsGiventhesetwoinputs,wenowillustratehowtheDT-RNNcomputesparentvectors.Wewillusethefollowingsentenceasarunningexample:Students1ride2bikes3at4night5.Fig.3showsitstreeandcomputedvectorrepresentations.Thedepen-Students bikes nightride at x1x2x3x4x5h1h2h3h4h5Figure3:ExampleofaDT-RNNtreestructureforcom-putingasentencerepresentationinabottomupfashion.dencytreeforthissentencecanbesummarizedbythefollowingsetof(child,parent)edges:d={(1,2),(2,0),(3,2),(4,2),(5,4)}.TheDT-RNNmodelwillcomputeparentvectorsateachwordthatincludeallthedependent(chil-dren)nodesinabottomupfashionusingacom-positionalityfunctiongθwhichisparameterizedbyallthemodelparametersθ.Tothisend,thealgo-rithmsearchesfornodesinatreethathaveeither(io)nochildrenor(ii)whosechildrenhavealreadybeencomputedandthencomputesthecorrespond-ingvector.Inourexample,thewordsx1,x3,x5areleafnodesandhence,wecancomputetheircorrespond-inghiddennodesvia:hc=gθ(xc)=f(Wvxc)forc=1,3,5,(1)wherewecomputethehiddenvectoratpositioncviaourgeneralcompositionfunctiongθ.Inthecaseofleafnodes,thiscompositionfunctionbecomessimplyalinearlayer,parameterizedbyWv∈Rn×d,followedbyanonlinearity.Wecross-validateoverusingnononlinearity(f=id),di pesce,sigmoidorrectiﬁedlinearunits(f=max(0,X),butgenerallyﬁndtanhtoperformbest.Theﬁnalsentencerepresentationwewanttocom-puteisath2,however,sincewestilldonothaveh4,

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

211

wecomputethatonenext:h4=gθ(x4,h5)=f(Wvx4+Wr1h5),(2)whereweusethesameWvasbeforetomapthewordvectorintohiddenspacebutwenowalsohavealinearlayerthattakesasinputh5,theonlychildofthefourthnode.ThematrixWr1∈Rn×nisusedbecausenode5istheﬁrstchildnodeontherightsideofnode4.Generally,wehavemultiplematri-cesforcomposingwithhiddenchildvectorsfromtherightandleftsides:Wr·=(Wr1,…,Wrkr)andWl·=(Wl1,…,Wlkl).Thenumberofneededma-tricesisdeterminedbythedatabysimplyﬁndingthemaximumnumbersofleftklandrightkrchil-drenanynodehas.Ifattesttimeachildappearedatanevenlargedistance(thisdoesnothappeninourtestset),thecorrespondingmatrixwouldbetheidentitymatrix.Nowthatallchildrenofh2havetheirhiddenvec-tors,wecancomputetheﬁnalsentencerepresenta-tionvia:h2=gθ(x2,h1,h3,h4)=(3)F(Wvx2+Wl1h1+Wr1h3+Wr2h4).Noticethatthechildrenaremultipliedbymatricesthatdependontheirlocationrelativetothecurrentnode.Anothermodiﬁcationthatimprovesthemeanrankbyapproximately6inimagesearchonthedevsetistoweightnodesbythenumberofwordsunder-neaththemandnormalizebythesumofwordsunderallchildren.Thisencouragestheintuitivedesidera-tumthatnodesdescribinglongerphrasesaremoreimportant.Let‘(io)bethenumberofleafnodes(parole)undernodeiandC(io,sì)bethesetofchildnodesofnodeiindependencytreey.Theﬁnalcom-positionfunctionforanodevectorhibecomes:hi=f1‘(io)Wvxi+Xj∈C(io)‘(j)Wpos(io,j)hj,(4)wherebydeﬁnition‘(io)=1+Pj∈C(io)‘(j)andpos(io,j)istherelativepositionofchildjwithre-specttonodei,e.g.l1orr2inEq.3.3.3SemanticDependencyTreeRNNsAnalternativeistoconditiontheweightmatricesonthesemanticrelationsgivenbythedependencyparser.WeusethecollapsedtreeformalismoftheStanforddependencyparser(deMarneffeetal.,2006).Withsuchasemanticuntyingoftheweights,theDT-RNNmakesbetteruseofthedependencyformalismandcouldgiveactive-passivereversalssimilarsemanticvectorrepresentation.TheequationforthissemanticDT-RNN(SDT-RNN)isthesameastheoneaboveexceptthatthematricesWpos(io,j)arereplacedwithmatricesbasedonthedependencyrelationship.Thereareatotalof141uniquesuchrelationshipsinthedataset.However,mostareveryrare.Forexamplesofsemanticrelationships,seeFig.2andthemodelanalysissection6.7.Thisforwardpropagationcanbeusedforcom-putingcompositionalvectorsandinSec.5wewillexplaintheobjectivefunctioninwhichthesearetrained.3.4ComparisontoPreviousRNNModelsTheDT-RNNhasseveralimportantdifferencestopreviousRNNmodelsofSocheretal.(2011UN)E(Socheretal.,2011b;Socheretal.,2011c).TheseconstituencytreeRNNs(CT-RNNs)usethefollow-ingcompositionfunctiontocomputeahiddenpar-entvectorhfromexactlytwochildvectors(c1,c2)inabinarytree:h=f(cid:18)W(cid:20)c1c2(cid:21)(cid:19),whereW∈Rd×2disthemainparametertolearn.ThiscanberewrittentoshowthesimilaritytotheDT-RNNash=f(Wl1c1+Wr1c2).Tuttavia,thereareseveralimportantdifferences.NoteﬁrstthatinpreviousRNNmodelsthepar-entvectorswereofthesamedimensionalitytoberecursivelycompatibleandbeusedasinputtothenextcomposition.Incontrast,ournewmodelﬁrstmapssinglewordsintoahiddenspaceandthenpar-entnodesarecomposedfromthesehiddenvectors.Thisallowsahighercapacityrepresentationwhichisespeciallyhelpfulfornodesthathavemanychil-dren.Secondly,theDT-RNNallowsforn-arynodesinthetree.ThisisanimprovementthatispossibleevenforconstituencytreeCT-RNNsbutithasnotbeenexploredinpreviousmodels.Third,duetocomputingparentnodesincon-stituencytrees,previousmodelshadtheproblemthatwordsthataremergedlastinthetreehavealargerweightorimportanceintheﬁnalsentencerep-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

212

Figure4:Thearchitectureofthevisualmodel.Thismodelhas3sequencesofﬁltering,poolingandlocalcontrastnormalizationlayers.Thelearnableparametersaretheﬁlteringlayer.Theﬁltersarenotshared,i.e.,thenetworkisnonconvolutional.resentation.Thiscanbeproblematicsincetheseareoftensimplenon-contentwords,suchasaleading‘But,’.Whilesuchsinglewordscanbeimportantfortaskssuchassentimentanalysis,wearguethatfordescribingvisualscenestheDT-RNNcapturesthemoreimportanteffects:Thedependencytreestruc-turespushthecentralcontentwordssuchasthemainactionorverbanditssubjectandobjecttobemergedlastandhence,byconstruction,theﬁnalsentencerepresentationismorerobusttolessimportantad-jectivalmodiﬁers,wordorderchanges,etc.Fourth,weallowsomeuntyingofweightsde-pendingoneitherhowfarawayaconstituentisfromthecurrentwordorwhatitssemanticrelationshipis.Nowthatwecancomputecompositionalvectorrepresentationsforsentences,thenextsectionde-scribeshowwerepresentimages.4LearningImageRepresentationswithNeuralNetworksTheimagefeaturesthatweuseinourexperimentsareextractedfromadeepneuralnetwork,replicatedfromtheonedescribedin(Leetal.,2012).Thenet-workwastrainedusingbothunlabeleddata(randomwebimages)andlabeleddatatoclassify22,000cat-egoriesinImageNet(Dengetal.,2009).Wethenusedthefeaturesatthelastlayer,beforetheclassi-ﬁer,asthefeaturerepresentationinourexperiments.Thedimensionofthefeaturevectorofthelastlayeris4,096.Thedetailsofthemodelanditstrainingproceduresareasfollows.ThearchitectureofthenetworkcanbeseeninFigure4.Thenetworktakes200x200pixelimagesasinputsandhas9layers.Thelayersconsistofthreesequencesofﬁltering,poolingandlocalcon-trastnormalization(Jarrettetal.,2009).ThepoolingfunctionisL2poolingofthepreviouslayer(takingthesquareoftheﬁlteringunits,summingthemupinasmallareaintheimage,andtakingthesquare-root).Thelocalcontrastnormalizationtakesinputsinasmallareaofthelowerlayer,subtractsthemeananddividesbythestandarddeviation.Thenetworkwasﬁrsttrainedusinganunsuper-visedobjective:tryingtoreconstructtheinputwhilekeepingtheneuronssparse.Inthisphase,thenet-workwastrainedon20millionimagesrandomlysampledfromtheweb.Weresizedagivenimagesothatitsshortdimensionhas200pixels.Wethencroppedaﬁxedsize200x200pixelimagerightatthecenteroftheresizedimage.Thismeanswemaydis-cardafractionofthelongdimensionoftheimage.Afterunsupervisedtraining,weusedIma-geNet(Dengetal.,2009)toadjustthefeaturesintheentirenetwork.TheImageNetdatasethas22,000categoriesand14millionimages.Thenumberofimagesineachcategoryisequalacrosscategories.The22,000categoriesareextractedfromWordNet.Tospeedupthesupervisedtrainingofthisnet-work,wemadeasimplemodiﬁcationtothealgo-rithmdescribedinLeetal.(2012):addinga“bottle-neck”layerinbetweenthelastlayerandtheclassi-ﬁer.toreducethenumberofconnections.Weaddedone“bottleneck”layerwhichhas4,096unitsinbe-tweenthelastlayerofthenetworkandthesoftmaxlayer.Thisnewly-addedlayerisfullyconnectedtothepreviouslayerandhasalinearactivationfunc-tion.Thetotalnumberofconnectionsofthisnet-workisapproximately1.36billion.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

213

Thenetworkwastrainedagainusingthesuper-visedobjectiveofclassifyingthe22,000classesinImageNet.Mostfeaturesinthenetworksarelocal,whichallowsmodelparallelism.DataparallelismbyasynchronousSGDwasalsoemployedasinLeetal.(2012).Theentiretraining,bothunsupervisedandsupervised,took8daysonalargeclusterofma-chines.Thisnetworkachieves18.3%precision@1onthefullImageNetdataset(ReleaseFall2011).Wewillusethefeaturesatthebottlenecklayerasthefeaturevectorzofanimage.Eachscaledandcroppedimageispresentedtoournetwork.Thenet-workthenperformsafeedforwardcomputationtocomputethevaluesofthebottlenecklayer.Thismeansthateveryimageisrepresentedbyaﬁxedlengthvectorof4,096dimensions.Notethatduringtraining,noalignedsentence-imagedatawasusedandtheImageNetclassesdonotfullyintersectwiththewordsusedinourdataset.5MultimodalMappingsTheprevioustwosectionsdescribedhowwecanmapsentencesintoad=50-dimensionalspaceandhowtoextracthighqualityimagefeaturevectorsof4096dimensions.Wenowdeﬁneourﬁnalmulti-modalobjectivefunctionforlearningjointimage-sentencerepresentationswiththesemodels.OurtrainingsetconsistsofNimagesandtheirfeaturevectorsziandeachimagehas5sentencedescrip-tionssi1,…,si5forwhichweusetheDT-RNNtocomputevectorrepresentations.SeeFig.5forex-amplesfromthedataset.Fortraining,weuseamax-marginobjectivefunctionwhichintuitivelytrainspairsofcorrectimageandsentencevectorstohavehighinnerproductsandincorrectpairstohavelowinnerproducts.Letvi=WIzibethemappedimagevectorandyij=DTRNNθ(sij)thecomposedsen-tencevector.WedeﬁneStobethesetofallsentenceindicesandS(io)thesetofsentenceindicescorre-spondingtoimagei.Similarly,Iisthesetofallim-ageindicesandI(j)istheimageindexofsentencej.ThesetPisthesetofallcorrectimage-sentencetrainingpairs(io,j).Therankingcostfunctiontominimizeisthen:J(WI,θ)=X(io,j)∈PXc∈S\S(io)max(0,∆−vTiyj+vTiyc)+X(io,j)∈PXc∈I\I(j)max(0,∆−vTiyj+vTcyj),(5)whereθarethelanguagecompositionmatrices,andbothsecondsumsareoverothersentencescom-ingfromdifferentimagesandviceversa.Thehyper-parameter∆isthemargin.Themarginisfoundviacrossvalidationonthedevsetandusuallyaround1.Theﬁnalobjectivealsoincludestheregulariza-tiontermλ/left(kθk22+kWIkF).Boththevisualmodelandthewordvectorlearningrequireaverylargeamountoftrainingdataandbothhaveahugenumberofparameters.Hence,topreventoverﬁtting,weassumetheirweightsareﬁxedandonlytraintheDT-RNNparametersWI.Iflargertrainingcorporabecomeavailableinthefuture,trainingbothjointlybecomesfeasibleandwouldpresentaverypromis-ingdirection.WeuseamodiﬁedversionofAda-Grad(Duchietal.,2011)foroptimizationofbothWIandtheDT-RNNaswellastheotherbaselines(exceptkCCA).Adagradhasachievedgoodperfor-mancepreviouslyinneuralnetworksmodels(Deanetal.,2012;Socheretal.,2013a).Wemodifyitbyresettingallsquaredgradientsumsto1every5epochs.Withbothimagesandsentencesinthesamemultimodalspace,wecaneasilyquerythemodelforsimilarimagesorsentencesbyﬁndingthenearestneighborsintermsofnegativeinnerproducts.AnalternativeobjectivefunctionisbasedonthesquaredlossJ(WI,θ)=P(io,j)∈Pkvi−yjk22.ThisrequiresanalternatingminimizationschemethatﬁrsttrainsonlyWI,thenﬁxesWIandtrainstheDT-RNNweightsθandthenrepeatsthisseveraltimes.Weﬁndthattheperformancewiththisob-jectivefunction(pairedwithﬁndingsimilarimagesusingEuclideandistances)isworseforallmodelsthanthemarginlossofEq.5.InadditionkCCAalsoperformsmuchbetterusinginnerproductsinthemultimodalspace.6ExperimentsWeusethedatasetofRashtchianetal.(2010)whichconsistsof1000images,eachwith5sentences.SeeFig.5forexamples.WeevaluateandcomparetheDT-RNNinthreedifferentexperiments.First,weanalyzehowwellthesentencevectorscapturesimilarityinvisualmeaning.ThenweanalyzeImageSearchwithQuerySentences:toqueryeachmodelwithasen-tenceinordertoﬁndanimageshowingthatsen-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

214

1. A woman and her dog watch the cameraman in their living with wooden floors.2. A woman sitting on the couch while a black faced dog runs across the floor.3. A woman wearing a backpack sits on a couch while a small dog runs on the hardwood floor next to her.4. A women sitting on a sofa while a small Jack Russell walks towards the camera.5. White and black small dog walks toward the camera while woman sits on couch, desk and computer seen in the background as well as a pillow, teddy bear and moggie toy on the wood floor.1. A man in a cowboy hat check approaches a small red sports car.2. The back and left side of a red Ferrari and two men admiring it.3. The sporty car is admired by passer by.4. Two men next to a red sports car in a parking lot.5. Two men stand beside a red sports car.Figure5:Examplesfromthedatasetofimagesandtheirsentencedescriptions(Rashtchianetal.,2010).Sentencelengthvariesgreatlyanddifferentobjectscanbementionedﬁrst.Hence,modelshavetobeinvarianttowordordering.tence’svisual‘meaning.’ThelastexperimentDe-scribingImagesbyFindingSuitableSentencesdoesthereversesearchwherewequerythemodelwithanimageandtrytoﬁndtheclosesttextualdescriptionintheembeddingspace.Inourcomparisontoothermethodswefocusonthosemodelsthatcanalsocomputeﬁxed,continu-ousvectorsforsentences.Inparticular,wecomparetotheRNNmodelonconstituencytreesofSocheretal.(2011UN),astandardrecurrentneuralnetwork;asimplebag-of-wordsbaselinewhichaveragesthewords.AllmodelsusethewordvectorsprovidedbyHuangetal.(2012)anddonotupdatethemasdis-cussedabove.Modelsaretrainedwiththeircorre-spondinggradientsandbackpropagationtechniques.Astandardrecurrentmodelisusedwherethehiddenvectoratwordindextiscomputedfromthehiddenvectorattheprevioustimestepandthecurrentwordvector:ht=f(Whht−1+Wxxt).Duringtraining,wetakethelasthiddenvectorofthesentencechainandpropagatetheerrorintothat.Itisalsothisvectorthatisusedtorepresentthesentence.Otherpossiblecomparisonsaretotheverydiffer-entmodelsmentionedintherelatedworksection.Thesemodelsusealotmoretask-speciﬁcengineer-ing,suchasrunningobjectdetectorswithboundingboxes,attributeclassiﬁers,sceneclassiﬁers,CRFsforcomposingthesentences,etc.Anotherlineofworkuseslargesentence-imagealignedresources(Kuznetsovaetal.,2012),whereaswefocusoneas-ilyobtainabletrainingdataofeachmodalitysepa-ratelyandarathersmallmultimodalcorpus.Inourexperimentswesplitthedatainto800train-ing,100developmentand100testimages.Sincethereare5sentencesdescribingeachimage,wehave4000trainingsentencesand500testingsen-tences.Thedatasethas3020uniquewords,halfofwhichonlyappearonce.Hence,theunsupervised,pre-trainedsemanticwordvectorrepresentationsarecrucial.Wordvectorsarenotﬁnetunedduringtrain-ing.Hence,themainparametersaretheDT-RNN’sWl·,Wr·orthesemanticmatricesofwhichthereare141andtheimagemappingWI.ForbothDT-RNNstheweightmatricesareinitializedtoblockidentitymatricesplusGaussiannoise.Wordvectorsandhid-denvectorsaresetolength50.Usingthedevelop-mentsplit,wefoundλ=0.08andthelearningrateofAdaGradto0.0001.Thebestmodelusesamar-ginof∆=3.InspiredbySocherandFei-Fei(2010)andHo-doshetal.(2013)wealsocomparetokernelizedCanonicalCorrelationAnalysis(kCCA).Weusetheaverageofwordvectorsfordescribingsentencesandthesamepowerfulimagevectorsasbefore.WeusethecodeofSocherandFei-Fei(2010).Tech-nically,onecouldcombinetherecentlyintroduceddeepCCAAndrewetal.(2013)andtrainthere-cursiveneuralnetworkarchitectureswiththeCCAobjective.Weleavethistofuturework.Withlin-earkernels,kCCAdoeswellforimagesearchbutisworseforsentenceselfsimilarityanddescribingimageswithsentencesclose-byinembeddingspace.AllothermodelsaretrainedbyreplacingtheDT-RNNfunctioninEq.5.6.1SimilarityofSentencesDescribingtheSameImageInthisexperiment,weﬁrstmapall500sentencesfromthetestsetintothemulti-modalspace.Thenforeachsentence,weﬁndthenearestneighborsen-

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

215

SentencesSimilarityforImageModelMeanRankRandom101.1BoW11.8CT-RNN15.8RecurrentNN18.5kCCA10.7DT-RNN11.1SDT-RNN10.5ImageSearchModelMeanRankRandom52.1BoW14.6CT-RNN16.1RecurrentNN19.2kCCA15.9DT-RNN13.6SDT-RNN12.5DescribingImagesModelMeanRankRandom92.1BoW21.1CT-RNN23.9RecurrentNN27.1kCCA18.0DT-RNN19.2SDT-RNN16.9Table1:Left:Comparisonofmethodsforsentencesimilarityjudgments.Lowernumbersarebettersincetheyindicatethatsentencesdescribingthesameimagerankmorehighly(arecloser).Theranksareoutofthe500sentencesinthetestset.Center:Comparisonofmethodsforimagesearchwithquerysentences.Shownistheaveragerankofthesinglecorrectimagethatisbeingdescribed.Right:Averagerankofacorrectsentencedescriptionforaqueryimage.tencesintermsofinnerproducts.Wethensorttheseneighborsandrecordtherankorpositionofthenearestsentencethatdescribesthesameim-age.Ifalltheimageswereveryuniqueandthevi-sualdescriptionsclose-paraphrasesandconsistent,wewouldexpectaverylowrank.However,usuallyahandfulofimagesarequitesimilar(forinstance,therearevariousimagesofairplanesﬂying,parking,taxiingorwaitingontherunway)andsentencede-scriptionscanvarygreatlyindetailandspeciﬁcityforthesameimage.Table1(left)showstheresults.Wecanseethataveragingthehighqualitywordvectorsalreadycap-turesalotofsimilarity.Thechainstructureofastandardrecurrentneuralnetperformsworstsinceitsrepresentationisdominatedbythelastwordsinthesequencewhichmaynotbeasimportantasear-lierwords.6.2ImageSearchwithQuerySentencesThisexperimentevaluateshowwellwecanﬁndim-agesthatdisplaythevisualmeaningofagivensen-tence.Weﬁrstmapaquerysentenceintothevectorspaceandthenﬁndimagesinthesamespaceusingsimpleinnerproducts.AsshowninTable1(center),thenewDT-RNNoutperformsallothermodels.6.3DescribingImagesbyFindingSuitableSentencesLastly,werepeattheaboveexperimentsbutwithrolesreversed.Foranimage,wesearchforsuitabletextualdescriptionsagainsimplybyﬁndingclose-bysentencevectorsinthemulti-modalembeddingspace.Table1(right)showsthattheDT-RNNagainoutperformsrelatedmodels.Fig.2assignedtoim-ImageSearchModelmRankBoW24.7CT-RNN22.2RecurrentNN28.4kCCA13.7DT-RNN13.3SDT-RNN15.8DescribingImagesModelmRankBoW30.7CT-RNN29.4RecurrentNN31.4kCCA38.0DT-RNN26.8SDT-RNN37.5Table2:ResultsofmultimodalrankingwhenmodelsaretrainedwithasquarederrorlossandusingEuclideandis-tanceinthemultimodalspace.Betterperformanceisreachedforallmodelswhentrainedinamax-marginlossandusinginnerproductsasintheprevioustable.ages.Theaveragerankingof25.3foracorrectsen-tencedescriptionisoutof500possiblesentences.Arandomassignmentwouldgiveanaveragerankingof100.6.4Analysis:SquaredErrorLossvs.MarginLossWeanalyzetheinﬂuenceofthemultimodallossfunctionontheperformance.Inaddition,wecom-pareusingEuclideandistancesinsteadofinnerprod-ucts.Table2showsthatperformanceisworseforallmodelsinthissetting.6.5Analysis:RecallatnvsMeanRankHodoshetal.(2013)andotherrelatedworkusere-callatnasanevaluationmeasure.Recallatncap-tureshowoftenoneofthetopnclosestvectorswereacorrectimageorsentenceandgivesagoodintu-itionofhowamodelwouldperforminarankingtaskthatpresentsnsuchresultstoauser.Below,wecomparethreecommonlyusedandhighperformingmodels:bagofwords,kCCAandourSDT-RNNon

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

216

A gray convertible sports car is parked in front of the trees.A close-up view of the headlights of a blue old-fashioned car.Black shiny sports car parked on concrete driveway.Five cows grazing on a patch of grass between two roadways.A jockey rides a brown and white horse in a dirt corral.A young woman is riding a Bay hose in a dirt riding-ring.A white bird pushes a miniature teal shopping cart.A person rides a brown horse.A motocross bike with rider flying through the air.White propeller plane parked in middle of grassy field.The white jet with its landing gear down flies in the blue sky.An elderly woman catches a ride on the back of the bicycle.A green steam train running down the tracks.Steamy locomotive speeding thou the forest.A steam engine comes down a train track near trees.A double decker bus is driving by Big Ben in London.People in an outrigger canoe sail on emerald green water.Two people sailing a small white sail boat.behind a cliff, a boat sails awayTourist move in on Big Ben on a typical overcast London day.A group of people sitting around a table on a porch.A group of four people walking past a giant mushroom.A man and women smiling for the camera in a kitchen.A group of men sitting around a table drinking while a man behind stands pointing.Figure6:ImagesandtheirsentencedescriptionsassignedbytheDT-RNN.ImageSearchModelmRank4R@15R@55R@105BoW14.615.842.260.0kCCA15.916.441.458.0SDT-RNN12.516.446.665.6DescribingImagesBoW21.119.038.057.0kCCA18.021.047.061.0SDT-RNN16.923.045.063.0Table3:Evaluationcomparisonbetweenmeanrankoftheclosestcorrectimageorsentence(lowerisbetter4)withrecallatdifferentthresholds(higherisbetter,5).Withoneexception(R@5,bottomtable),theSDT-RNNoutperformstheothertwomodelsandallothermodelswedidnotincludehere.thisdifferentmetric.Table3showsthatthemea-suresdocorrelatewellandtheSDT-RNNalsoper-formsbestonthemultimodalrankingtaskswhenevaluatedwiththismeasure.6.6ErrorAnalysisInordertounderstandthemainproblemswiththecomposedsentencevectors,weanalyzethesen-tencesthathavetheworstnearestneighborrankbe-tweeneachother.WeﬁndthatthemainfailuremodeoftheSDT-RNNoccurswhenasentencethatshoulddescribethesameimagedoesnotuseaverbbuttheothersentencesofthatimagedoincludeaverb.Forexample,thefollowingsentencepairhasvectorsthatareveryfarapartfromeachothereventhoughtheyaresupposedtodescribethesameimage:1.Ablueandyellowairplaneﬂyingstraightdownwhileemittingwhitesmoke2.AirplaneindivepositionGenerally,aslongasbothsentenceseitherhaveaverbordonot,theSDT-RNNismorerobusttodif-ferentsentencelengthsthanbagofwordsrepresen-tations.6.7ModelAnalysis:SemanticCompositionMatricesThebestmodelusescompositionmatricesbasedonsemanticrelationshipsfromthedependencyparser.WegivesomeinsightsintowhatthemodellearnsbylistingthecompositionmatriceswiththelargestFrobeniusnorms.Intuitively,thesematriceshavelearnedlargerweightsthatarebeingmultipliedwiththechildvectorinthetreeandhencethatchildwillhavemoreweightintheﬁnalcomposedparentvec-tor.IndecreasingorderofFrobeniusnorm,there-lationshipmatricesare:nominalsubject,possessionmodiﬁer(e.g.their),passiveauxiliary,prepositionat,prepositioninfrontof,passiveauxiliary,passivenominalsubject,objectofpreposition,prepositioninandprepositionon.Themodellearnsthatnounsareveryimportantaswellastheirspatialprepositionsandadjectives.7ConclusionWeintroducedanewrecursiveneuralnetworkmodelthatisbasedondependencytrees.Foreval-uation,weusethechallengingtaskofmappingsen-tencesandimagesintoacommonspaceforﬁndingonefromtheother.Ournewmodeloutperformsbaselinesandothercommonlyusedmodelsthatcancomputecontinuousvectorrepresentationsforsen-tences.Incomparisontorelatedmodels,theDT-RNNismoreinvariantandrobusttosurfacechangessuchaswordorder.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

217

ReferencesG.Andrew,R.Arora,K.Livescu,andJ.Bilmes.2013.Deepcanonicalcorrelationanalysis.InICML,At-lanta,Georgia.K.Barnard,P.Duygulu,N.deFreitas,D.Forsyth,D.Blei,andM.Jordan.2003.Matchingwordsandpictures.JMLR.M.BaroniandA.Lenci.2010.Distributionalmem-ory:Ageneralframeworkforcorpus-basedsemantics.ComputationalLinguistics,36(4):673–721.R.CollobertandJ.Weston.2008.Auniﬁedarchi-tecturefornaturallanguageprocessing:deepneuralnetworkswithmultitasklearning.InProceedingsofICML,pages160–167.F.Costa,P.Frasconi,V.Lombardo,andG.Soda.2003.Towardsincrementalparsingofnaturallanguageusingrecursiveneuralnetworks.AppliedIntelligence.M.deMarneffe,B.MacCartney,andC.D.Manning.2006.Generatingtypeddependencyparsesfromphrasestructureparses.InLREC.J.Dean,G.S.Corrado,R.Monga,K.Chen,M.Devin,Q.V.Le,M.Z.Mao,M.Ranzato,A.Senior,P.Tucker,K.Yang,andA.Y.Ng.2012.Largescaledistributeddeepnetworks.InNIPS.J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei.2009.ImageNet:ALarge-ScaleHierarchicalIm-ageDatabase.InCVPR.J.Duchi,E.Hazan,andY.Singer.2011.Adaptivesub-gradientmethodsforonlinelearningandstochasticop-timization.JMLR,12,July.P.Duygulu,K.Barnard,N.deFreitas,andD.Forsyth.2002.Objectrecognitionasmachinetranslation.InECCV.A.Farhadi,M.Hejrati,M.A.Sadeghi,P.Young,C.Rashtchian,J.Hockenmaier,andD.Forsyth.2010.Everypicturetellsastory:Generatingsentencesfromimages.InECCV.Y.FengandM.Lapata.2013.Automaticcaptiongen-erationfornewsimages.IEEETrans.PatternAnal.Mach.Intell.,35.A.Frome,G.Corrado,J.Shlens,S.Bengio,J.Dean,M.Ranzato,andT.Mikolov.2013.Devise:Adeepvisual-semanticembeddingmodel.InNIPS.C.GollerandA.K¨uchler.1996.Learningtask-dependentdistributedrepresentationsbybackpropaga-tionthroughstructure.InProceedingsoftheInterna-tionalConferenceonNeuralNetworks.E.Grefenstette,G.Dinu,Y.-Z.Zhang,M.Sadrzadeh,andM.Baroni.2013.Multi-stepregressionlearningforcompositionaldistributionalsemantics.InIWCS.A.GuptaandL.S.Davis.2008.Beyondnouns:Exploit-ingprepositionsandcomparativeadjectivesforlearn-ingvisualclassiﬁers.InECCV.M.Hodosh,P.Young,andJ.Hockenmaier.2013.Fram-ingimagedescriptionasarankingtask:Data,mod-elsandevaluationmetrics.J.Artif.Intell.Res.(JAIR),47:853–899.E.H.Huang,R.Socher,C.D.Manning,andA.Y.Ng.2012.ImprovingWordRepresentationsviaGlobalContextandMultipleWordPrototypes.InACL.K.Jarrett,K.Kavukcuoglu,M.A.Ranzato,andY.Le-Cun.2009.Whatisthebestmulti-stagearchitectureforobjectrecognition?InICCV.P.Blunsom.K.M.Hermann.2013.Theroleofsyntaxinvectorspacemodelsofcompositionalsemantics.InACL.A.Krizhevsky,I.Sutskever,andG.E.Hinton.2012.Imagenetclassiﬁcationwithdeepconvolutionalneuralnetworks.InNIPS.G.Kulkarni,V.Premraj,S.Dhar,S.Li,Y.Choi,A.C.Berg,andT.L.Berg.2011.Babytalk:Understandingandgeneratingimagedescriptions.InCVPR.N.Kumar,A.C.Berg,P.N.Belhumeur,,andS.K.Na-yar.2009.Attributeandsimileclassiﬁersforfacever-iﬁcation.InICCV.P.Kuznetsova,V.Ordonez,A.C.Berg,T.L.Berg,andYejinChoi.2012.Collectivegenerationofnaturalimagedescriptions.InACL.Q.V.Le,M.A.Ranzato,R.Monga,M.Devin,K.Chen,G.S.Corrado,J.Dean,andA.Y.Ng.2012.Build-inghigh-levelfeaturesusinglargescaleunsupervisedlearning.InICML.T.Mikolov,W.Yih,andG.Zweig.2013.Linguisticregularitiesincontinuousspacewordrepresentations.InHLT-NAACL.J.MitchellandM.Lapata.2010.Compositionindis-tributionalmodelsofsemantics.CognitiveScience,34(8):1388–1429.J.Ngiam,A.Khosla,M.Kim,J.Nam,H.Lee,andA.Y.Ng.2011.Multimodaldeeplearning.InICML.V.Ordonez,G.Kulkarni,andT.L.Berg.2011.Im2text:Describingimagesusing1millioncaptionedpho-tographs.InNIPS.J.B.Pollack.1990.Recursivedistributedrepresenta-tions.ArtiﬁcialIntelligence,46,November.C.Rashtchian,P.Young,M.Hodosh,andJ.Hocken-maier.2010.CollectingimageannotationsusingAmazon’sMechanicalTurk.InWorkshoponCreat-ingSpeechandLanguageDatawithAmazon’sMTurk.R.SocherandL.Fei-Fei.2010.Connectingmodalities:Semi-supervisedsegmentationandannotationofim-agesusingunalignedtextcorpora.InCVPR.R.Socher,C.D.Manning,andA.Y.Ng.2010.Learningcontinuousphraserepresentationsandsyntacticpars-ingwithrecursiveneuralnetworks.InProceedingsoftheNIPS-2010DeepLearningandUnsupervisedFea-tureLearningWorkshop.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
1
7
7
1
5
6
6
8
2
6

/
T

UN
C
_
UN
_
0
0
1
7
7
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

218

R.Socher,E.H.Huang,J.Pennington,A.Y.Ng,andC.D.Manning.2011a.DynamicPoolingandUnfold-ingRecursiveAutoencodersforParaphraseDetection.InNIPS.R.Socher,C.Lin,A.Y.Ng,andC.D.Manning.2011b.ParsingNaturalScenesandNaturalLanguagewithRecursiveNeuralNetworks.InICML.R.Socher,J.Pennington,E.H.Huang,A.Y.Ng,andC.D.Manning.2011c.Semi-SupervisedRecursiveAutoencodersforPredictingSentimentDistributions.InEMNLP.R.Socher,B.Huval,C.D.Manning,andA.Y.Ng.2012.SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces.InEMNLP.R.Socher,J.Bauer,C.D.Manning,andA.Y.Ng.2013a.ParsingWithCompositionalVectorGrammars.InACL.R.Socher,M.Ganjoo,C.D.Manning,andA.Y.Ng.2013b.Zero-ShotLearningThroughCross-ModalTransfer.InNIPS.R.Socher,M.Ganjoo,H.Sridhar,O.Bastani,andA.Y.Ng.C.D.Manningand.2013c.Zero-shotlearn-ingthroughcross-modaltransfer.InProceedingsoftheInternationalConferenceonLearningRepresenta-tions(ICLR,WorkshopTrack).N.SrivastavaandR.Salakhutdinov.2012.Multimodallearningwithdeepboltzmannmachines.InNIPS.A.Torralba,K.P.Murphy,andW.T.Freeman.2010.Usingtheforesttoseethetrees:exploitingcontextforvisualobjectdetectionandlocalization.Communica-tionsoftheACM.P.D.TurneyandP.Pantel.2010.Fromfrequencytomeaning:Vectorspacemodelsofsemantics.JournalofArtiﬁcialIntelligenceResearch,37:141–188.B.Yao,X.Yang,L.Lin,M.W.Lee,andS.-C.Zhu.2010.I2t:imageparsingtotextdescription.IEEEXplore. Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 207–218. Redattore di azioni: Alexander Clark. Immagine

Scarica il pdf