Transactions of the Association for Computational Linguistics, vol. 6, pp. 133–144, 2018. Action Editor: Stefan Riezler. - IA de Investigación especializada en el MIT

Transacciones de la Asociación de Lingüística Computacional, volumen. 6, páginas. 133–144, 2018. Editor de acciones: Stefan Riezler.
Lote de envío: 6/2017; Lote de revisión: 9/2017; Publicado 2/2018.

2018 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

C
(cid:13)

LearningRepresentationsSpecializedinSpatialKnowledge:LeveragingLanguageandVisionGuillemCollellDepartmentofComputerScienceKULeuven3001Heverlee,Belgiumgcollell@kuleuven.beMarie-FrancineMoensDepartmentofComputerScienceKULeuven3001Heverlee,Belgiumsien.moens@cs.kuleuven.beAbstractSpatialunderstandingiscrucialinmanyreal-worldproblems,yetlittleprogresshasbeenmadetowardsbuildingrepresentationsthatcapturespatialknowledge.Here,wemoveonestepforwardinthisdirectionandlearnsuchrepresentationsbyleveragingataskconsistinginpredictingcontinuous2Dspa-tialarrangementsofobjectsgivenobject-relationship-objectinstances(e.g.,“catunderchair”)andasimpleneuralnetworkmodelthatlearnsthetaskfromannotatedimages.Weshowthatthemodelsucceedsinthistaskand,además,thatitiscapableofpredictingcorrectspatialarrangementsforunseenob-jectsifeitherCNNfeaturesorwordembed-dingsoftheobjectsareprovided.Thediffer-encesbetweenvisualandlinguisticfeaturesarediscussed.Next,toevaluatethespatialrepresentationslearnedintheprevioustask,weintroduceataskandadatasetconsistinginasetofcrowdsourcedhumanratingsofspatialsimilarityforobjectpairs.WeﬁndthatbothCNN(convolutionalneuralnetwork)featuresandwordembeddingspredicthumanjudgmentsofsimilaritywellandthatthesevectorscanbefurtherspecializedinspatialknowledgeifweupdatethemwhentrainingthemodelthatpredictsspatialarrangementsofobjects.Overall,thispaperpavesthewaytowardsbuildingdistributedspatialrepresen-tations,contributingtotheunderstandingofspatialexpressionsinlanguage.1IntroductionRepresentingspatialknowledgeisinstrumentalinanytaskinvolvingtext-to-sceneconversionsuchasrobotunderstandingofnaturallanguagecommands(Guadarramaetal.,2013;MoratzandTenbrink,2006)oranumberofrobotnavigationtasks.Despiterecentadvancesinbuildingspecializedrepresenta-tionsindomainssuchassentimentanalysis(Tangetal.,2014),semanticsimilarity/relatedness(Kielaetal.,2015)ordependencyparsing(Bansaletal.,2014),littleprogresshasbeenmadetowardsbuild-ingdistributedrepresentations(a.k.a.embeddings)specializedinspatialknowledge.Intuitively,onemayreasonablyexpectthatthemoreattributestwoobjectsshare(e.g.,size,func-tionality,etc.),themorelikelytheyaretoexhibitsimilarspatialarrangementswithrespecttootherobjects.Leveragingthisintuition,weforeseethatvisualandlinguisticrepresentationscanbespatiallyinformativeaboutunseenobjectsastheyencodefeatures/attributesofobjects(CollellandMoens,2016).Forinstance,withouthavingeverseenan“elephant”before,butonlya“horse”,onewouldprobablydevisethe“elephant”carryingthe“hu-man”thanotherwise,justbyconsideringtheirsizeattribute.Similarly,onecaninferthata“tablet”anda“book”willshowsimilarspatialpatterns(usuallyonatable,insomeone’shands,etc.)althoughtheybarelyshowanyvisualresemblance—yettheyaresimilarinsizeandfunctionality.Inthispaperwesystematicallystudyhowinformativevisualandlin-guisticfeatures—intheformofconvolutionalneuralnetwork(CNN)featuresandwordembeddings—areaboutthespatialbehaviorofobjects.Animportantgoalofthisworkistolearndis-tributedrepresentationsspecializedinspatialknowl-edge.Asavehicletolearnspatialrepresentations,

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
1
0
1
5
6
7
6
0
4

/
t

a
C
_
a
_
0
0
0
1
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

134

weleveragethetaskofpredictingthe2Dspatialar-rangementfortwoobjectsunderarelationshipex-pressedbyeitherapreposition(e.g.,“below”or“on”)oraverb(e.g.,“riding”,“jumping”,etc.).Forthat,wemakeuseofimageswherebothobjectsareannotatedwithboundingboxes.Forinstance,inanimagedepicting(horse,jumping,fence)wereason-ablyexpecttoﬁndthe“horse”abovethe“fence”.Tolearnthetask,weemployafeedforwardnetworkthatrepresentsobjectsascontinuous(espacial)fea-turesinanembeddinglayerandguidesthelearningwithadistance-basedsupervisionontheobjects’co-ordinates.Weshowthatthemodelfareswellinthistaskandthatbyinformingitwitheitherwordem-beddingsorCNNfeaturesitisabletooutputaccu-ratepredictionsaboutunseenobjects,e.g.,predict-ingthespatialarrangementof(hombre,riding,bike)withouthavingeverbeenexposedtoa“bike”be-fore.Thisresultsuggeststhatthesemanticandvi-sualknowledgecarriedbythevisualandlinguisticfeaturescorrelatestoacertainextentwiththespatialpropertiesofwords,thusprovidingpredictivepowerforunseenobjects.Toevaluatethequalityofthespatialrepresenta-tionslearnedintheprevioustask,weintroduceataskconsistinginasetof1,016humanratingsofspatialsimilaritybetweenobjectpairs.Itisthusde-sirableforspatialrepresentationsthat“spatiallysim-ilar”objects(i.e.,objectsthatarearrangedspatiallysimilarinmostsituationsandrelativetootherob-jects)havesimilarembeddings.Intheseratingsweshow,primero,thatbothCNNfeaturesandwordem-beddingsaregoodpredictorsofhumanjudgments,andsecond,thatthesevectorscanbefurtherspe-cializedinspatialknowledgeifweupdatethembybackpropagationwhenlearningthemodelinthetaskofpredictingspatialarrangementsofobjects.Therestofthepaperisorganizedasfollows.InSect.2wereviewrelatedresearch.InSect.3wedescribetwospatialtasksandamodel.InSect.4wedescribeourexperimentalsetup.InSect.5wepresentanddiscussourresults.Finally,inSect.6wesummarizeourcontributions.2RelatedWorkContrarytoearlierrule-basedapproachestospatialunderstanding(Kruijffetal.,2007;MoratzandTen-brink,2006),MalinowskiandFritz(2014)proposealearning-basedmethodthatlearnstheparametersof“spatialtemplates”(orregionsofacceptabilityofanobjectunderaspatialrelation)usingapool-ingapproach.Theyshowimprovedperformanceinimageretrievalandimageannotation(i.e.,retriev-ingsentencesgivenaqueryimage)overpreviousrule-basedsystemsandmethodsthatrelyonhand-craftedtemplates.Contrarytous,theyrestricttorelationshipsexpressedbyexplicitspatialpreposi-tions(e.g.,“on”or“below”)whilewealsoconsideractions(e.g.,“jumping”).Además,theydonotbuildspatialrepresentationsforobjects.Otherapproacheshaveshownthevalueofprop-erlyintegratingspatialinformationintoavarietyoftasks.Forexample,Shiangetal.(2017)improveoverthestate-of-the-artobjectrecognitionbylever-agingpreviousknowledgeofobjectco-occurrencesandrelativepositionsofobjects—whichtheyminefromtextandtheweb—inordertorankpossibleobjectdetections.Inasimilarfashion,LinandParikh(2015)leveragecommonsensevisualknowl-edge(e.g.,objectlocationsandco-occurrences)intwotasks:ﬁll-in-the-blankandvisualparaphrasing.Theycomputethelikelihoodofascenetoidentifythemostlikelyanswertomultiple-choicetextualscenedescriptions.Incontrast,wefocussolelyonspatialinformationratherthansemanticplausibility.Moreover,ourprimarytargetistobuild(espacial)rep-resentations.Alternatively,ElliottandKeller(2013)annotategeometricrelationshipsbetweenobjectsinimages(e.g.,theyaddan“on”linkbetween“man”and“bike”inanimageofa“man”“riding”a“bike”)tobetterinfertheactionpresentintheimage.Forinstance,ifthe“man”isnexttothebikeonecaninferthattheaction“repairing”ismorelikelythan“riding”inthisimage.Accountingforthisextraspatialstructureallowsthemtooutperformbag-of-featuresmethodsinanimagecaptioningtask.Incontrastwiththosewhorestricttoasmalldomainof10actions(e.g.,“takingaphoto”,“riding”,etc.),ourgoalistogeneralizetoanyunseen/rareobjectsandactionsbylearningfromfrequentspatialconﬁg-urationsandobjects,andcritically,leveragingrep-resentationsofobjects.Recentwork(Collelletal.,2018)tacklestheresearchquestionofwhetherrel-ativespatialarrangementscanbepredictedequallywellfromactions(e.g.,“riding”)thanfromspatial

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
1
0
1
5
6
7
6
0
4

/
t

a
C
_
a
_
0
0
0
1
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

135

prepositions(e.g.,“below”),andhowtointerpretthelearnedweightsofthenetwork.Incontrast,ourresearchquestionsconcernspatialrepresentations.Crucially,noneofthestudiesabovehaveconsid-eredorattemptedtolearndistributedspatialrepre-sentationsofobjects,norstudiedhowmuchspatialknowledgeiscontainedinvisualandlinguisticrep-resentations.Theexistenceofquantitative,continuousspatialrepresentationsofobjectshasbeenformerlydis-cussed,yettoourknowledge,notsystematicallyin-vestigatedbefore.Forinstance,Forbusetal.(1991)conjecturedthat“thereisnopurelyqualitative,gen-eralpurposerepresentationofspatialproperties”,furtheremphasizingthatthequantitativecomponentisstrictlynecessary.ItisalsoworthcommentingonearlyworkaimedatenhancingtheunderstandingofnaturalspatiallanguagesuchastheL0project(Feldmanetal.,1996).Inthecontextofthisproject,Gobernancia(1996)proposedaconnectionistmodelthatlearnstopredictafewspatialprepositions(“above”,“below”,“left”,“right”,“in”,“out”,“on”,and“off”)fromlowreso-lutionvideoscontainingalimitedsetoftoyobjects(circle,square,etc.).Incontrast,weconsideranun-limitedvocabularyofreal-worldobjects,andwedonotrestricttospatialprepositionsbutweincludeac-tions,aswell.Hence,Regier’s(1996)settingdoesnotseemplausibletodealwithactionsgiventhat,incontrasttothespatialprepositionsthattheyuse,whicharemutuallyexclusive(anobjectcannotbe“above”andsimultaneously“below”anotherob-ject),actionsarenot.Inparticular,actionsexhibitlargespatialoverlapand,por lo tanto,attempttopre-dictthousandsofdifferentactionsfromtherelativelocationsoftheobjectsseemsinfeasible.Addition-ally,Regier’s(1996)architecturedoesnotallowtomeaningfullyextractrepresentationsofobjectsfromthevisualinput—whichyieldsrathervisualfeatures.Here,weproposeanadhocsettingforboth,learningandevaluatingspatialrepresentations.Inparticular,insteadoflearningtopredictspatialrela-tionsfromvisualinputasinRegier’s(1996)trabajar,welearnthereversedirection,i.e.,tomaptherela-tion(andtwoobjects)totheirvisualspatialarrange-ment.Bybackpropagatingtheembeddingsoftheobjectswhilelearningthetask,weenablelearningspatialrepresentations.Asacoreﬁnding,weshowinanadhoctask,namelyourcollectedhumanrat-ingsofspatialsimilarity,thatthelearnedfeaturesaremorespecializedinspatialknowledgethantheCNNfeaturesandwordembeddingsthatwereusedtoinitializetheparametersoftheembeddings.3TasksandModelHere,weﬁrstdescribethePredictiontaskandmodelthatweusetolearnthespatialrepresentations.WesubsequentlypresentthespatialSimilaritytaskwhichisemployedtoevaluatethequalityofthelearnedrepresentations.3.1PredictionTaskToevaluatetheabilityofamodelorembeddingstolearnspatialknowledge,weemploythetaskofpre-dictingthespatiallocationofanObject(“O”)rela-tivetoaSubject(“S”)underaRelationship(“R”).LetOc=(Ocx,Ocy)denotethecoordinatesofthecenter(“c”)oftheObject’sboundingbox,whereOcx∈RandOcy∈Rareitsxandycompo-nents.LetOb=(Obx,Oby)beonehalfofthever-tical(Oby∈R)andhorizontal(Obx∈R)sizesoftheObject’sboundingbox(“b”).Asimilarnotationap-pliestotheSubject(i.e.,ScandSb),andwedenotemodelpredictionswithahatcOc,cOb.Thetaskistolearnamappingfromthestructuredtextualinput(Sujeto,Relation,Objeto)—abbreviatedby(S,R,oh)—totheoutputconsistingoftheObject’scentercoordinatesOcanditssizeOb(seeFig.1).Wenoticethata“Subject”isnotnecessarilyasyntacticsubjectbutsimplyaconvenientnotationtoaccommodatethecasewheretheRelationship(R)isanaction(e.g.,“riding”or“wearing”),whilewhenRisaspatialpreposition(e.g.,“below”or“on”)theSubjectsimplydenotesthereferentobject.Simi-larly,theObjectisnotnecessarilyadirectobject.13.2RegressionModelFollowingthetaskabove(Sect.3.1),wecon-sideramodel(Fig.1)thattakesatripletofwords(S,R,oh)asinputandmapstheirone-1Weprefertoadheretotheterminologyusedtoexpressentity-relationshipsintheVisualGenomedataset,butareawareofannotationschemesforspatialsemantics(Pustejovskyetal.,2012).Sin embargo,aone-to-onemappingoftheVisualGenometerminologytotheseannotationschemesisnotalwayspossible.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
1
0
1
5
6
7
6
0
4

/
t

a
C
_
a
_
0
0
0
1
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

136

Re-scalecoordinatesMirror image (if needed)Image pre-processingObj.Subj.man walking dogObj.Rel.Subj.Output: coordinates & sizeLearning (regresión)[Scx, Scy][Sbx, Sby]Original image(with bounding boxes)Embedding layer(vis. or lang.)Composition layersInput (structured text)ConcatenateSbySbx[Scx, Scy]Subj.Obj.[Ocx, Ocy]Obj.ObyObx= [Ocx, Ocy, Obx, Oby]Figure1:Overviewofthemodel(bien)andtheimagepre-processingsetting(izquierda).hot2vectorswS,wR,wOtod-dimensionaldensevectorswSWS,wRWR,wOWOviadotproductwiththeirrespectiveembeddingmatricesWS∈Rd×|VS|,WR∈Rd×|VR|,WO∈Rd×|VO|,dónde|VS|,|VR|,|VO|arethevocabularysizes.Theem-beddinglayermodelsourintuitionthatspatialprop-ertiesofobjectscanbe,toacertainextent,en-codedwithavectorofcontinuousfeatures.Inthisworkwetesttwotypesofembeddings,visualandlinguistic.ThenextlayersimplyconcatenatesthethreeembeddingstogetherwiththeSubject’ssizeSbandSubjectcenterSc.TheinclusionoftheSubject’ssizeisaimedatprovidingareferencesizetothemodelinordertopredictthesizeoftheObjectOb.3Theresultingconcatenatedvector[wSWS,wRWR,wOWO,Sc,Sb]isthenfedintoahiddenlayer(s)whichactsasacompositionfunctionforthetriplet(S,R,oh):2One-hotvectors(a.k.a.one-of-k),aresparsevectorswith0everywhereexceptfora1atthepositionofthek-thword.3WeﬁndthatwithoutinputingSbtothemodel,itstilllearnstopredictan“averagesize”foreachObjectduetotheMSEpenalty.However,tokeepthedesigncleanerandintuitive,hereweprovidethesizeoftheSubjecttothemodel.z=f(¿Qué?[wSWS,wRWR,wOWO,Sc,Sb]+bh)wheref(·)isthenon-linearityandWhandbhtheparametersofthelayer.These“compositionlay-ers”allowtodistinguishbetweene.g.,(hombre,walks,horse)whichisspatiallydistinctfrom(hombre,rides,horse).Weﬁndthataddingmorelayersgener-allyimprovesperformance,sotheoutputzabovecansimplybecomposedwithmorelayers,i.e.,f(Wh2z+bh2).Finalmente,alinearoutputlayertriestomatchthegroundtruthtargetsy=(Oc,Ob)us-ingameansquarederror(MSE)lossfunction:Loss(y,Y)=kˆy−yk2whereˆy=(cOc,cOb)isthemodelpredictionandk·kdenotestheEuclideannorm.Critically,un-likeCNNs,themodeldoesnotmakeuseofthepixels(whicharediscardedduringtheimagepre-processing(Fig.1andSect.3.2.1)),butlearnsex-clusivelyfromimagecoordinates,yieldingasimplermodelfocusedsolelyonspatialinformation.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
1
0
1
5
6
7
6
0
4

/
t

a
C
_
a
_
0
0
0
1
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

137

3.2.1ImagePre-ProcessingWeperformthefollowingpre-processingstepstotheimagesbeforefeedingthemtothemodel.(i)Normalizetheimagecoordinatesbythenum-berofpixelsofeachaxis(verticalandhorizontal).Thisstepguaranteesthatcoordinatesareindepen-dentoftheresolutionoftheimageandalwaysliewithinthe[0,1]×[0,1]square,i.e.,Sc,Oc∈[0,1]2.(II)Mirrortheimage(whennecessary).Weno-ticethatthedistinctionbetweenrightandleftisar-bitraryinimagessinceamirroredimagecompletelypreservesitsspatialmeaning.Forinstance,a“man”“feeding”an“elephant”canbearbitrarilyateithersideofthe“elephant”,whilea“man”“riding”an“elephant”cannotbeeitherbeloworabovethe“ele-phant”.Thisleft/rightarbitrarinesshasalsobeenacknowledgedinpriorwork(Singhaletal.,2003).De este modo,toenableamoremeaningfullearning,wemir-rortheimagewhen(andonlywhen)theObjectisattheleft-handsideoftheSubject.4ThechoiceofleavingtheObjectalwaystotheright-handsideisarbitraryanddoesnotentailalossofgeneral-ity,i.e.,wecanconsiderleft/rightsymmetricallyre-ﬂectedpredictionsasequiprobable.Mirroringpro-videsthusamorerealisticperformanceevaluationinthePredictiontaskandenableslearningrepresenta-tionsindependentoftheright/leftdistinctionwhichisirrelevantforthespatialsemantics.3.3SpatialSimilarityTaskToevaluatehowwellourembeddingsmatchhumanmentalrepresentationsofspatialknowledgeaboutobjects,wecollectratingsfor1,016wordpairs(w1,w2)askingannotatorstoratethembytheirspa-tialsimilarity.Thatis,objectsthatexhibitsimilarlocationsinmostsituationsandareplacedsimilarlyrelativetootherobjectswouldreceiveahighscore,andlowerotherwise.Forexample(gorra,sunglasses)wouldreceiveahighscoreastheyareusuallyatthetopofthehumanbody,whilefollowingasim-ilarlogic,(gorra,shoes)wouldreceivealowerscore.Ourcollectedratingsestablishthespatialcounter-parttootherexistingsimilarityratingssuchasse-manticsimilarity(SilbererandLapata,2014),vi-4Theonlyconﬂictingcaseforthe“mirroring”transforma-tioniswhentheRelationship(R)iseither“left”or“right,”e.g.,(hombre,leftof,auto),yettheseonlyaccountforaninsigniﬁcantproportionofinstances(<0.1%)andthusweleavethemout.sualsimilarity(SilbererandLapata,2014)orgen-eralrelatedness(Agirreetal.,2009).Afewexem-plarsofratingsareshowninTab.1.Followingstan-dardpractices(Penningtonetal.,2014),wecomputethepredictionofsimilaritybetweentwoembeddingssw1andsw2(representingwordsw1andw2)withtheircosinesimilarity:cos(sw1,sw2)=sw1sw2ksw1kksw2kWenoticethatthisspatialSimilaritytaskdoesnotinvolvelearninganditsmainpurposeistoevalu-atethequalityoftherepresentationslearnedinthePredictiontask(Sect.3.1)andthespatialinforma-tivenessofvisualandlinguisticfeatures.WordpairRatingWordpairRating(snowboard,feet)7.2(horns,backpack)1.8(ears,eye)8.3(baby,bag)7(cockpit,table)2.4(hair,laptop)1.8(cap,hair)9(earring,racket)2(frisbee,food)2.4(ears,hat)5.6Table1:Examplesofourcollectedsimilarityratings.4ExperimentalSetupInthissectionwedescribetheexperimentalsettingsemployedinthetasksandthemodel.4.1VisualGenomeDataSetWeobtainourannotateddatafromVisualGenome(Krishnaetal.,2017).Thisdatasetcontains108,077imagesandover1.5Mhuman-annotatedobject-relationship-objectinstances(S,R,O)withtheircorrespondingboxesfortheObjectandSubject.Wekeeponlythoseexamplesforwhichwehaveembed-dingsavailable(seeSect.4.3).Thisyields∼1.1Minstancesoftheform(S,R,O),7,812uniqueim-ageobjectsand2,214uniqueRelationships(R)forourlinguisticembeddings;and∼920K(S,R,O)instances,4,496uniqueimageobjectsand1,831uniqueRelationshipsforourvisualembeddings.WenoticethatvisualrepresentationsdonotexistforRelationshipsR(i.e.,eitherprepositionsorverbs)andthereforeweonlyrequirevisualembeddingsforthepair(S,O)insteadofthecompletetriplet(S,R,O)requiredinlanguage.Noticethatsincewe l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 0 1 5 6 7 6 0 4 / / t l a c _ a _ 0 0 0 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 138 donotrestricttoanyparticulardomain(e.g.,furni-tureorlandscapes)thecombinations(S,R,O)aremarkedlysparse,whichmakeslearningourPredic-tiontaskespeciallychallenging.4.2EvaluationSetsinthePredictionTaskInthePredictiontask,weconsiderthefollowingsubsetsofVisualGenome(Sect.4.1)forevaluationpurposes:(i)Originalset:atestsplitfromtheoriginaldatawhichcontainsinstancesunseenattrainingtime.Thatis,thetestcombinations(S,R,O)mighthavebeenseenattrainingtime,yetindifferentin-stances(e.g.,indifferentimages).Thissetcontainsalargenumberofnoisycombinationssuchas(peo-ple,walk,funny)or(metal,white,chandelier).(ii)UnseenWordsset:Werandomlyselectalistof25objects(e.g.,“wheel”,“camera”,“elephant”,etc.)amongthe100mostfrequentobjectsinVi-sualGenome.5Wechoosethemamongthemostfrequentonesinordertoavoidmeaninglessobjectssuchas“gate2”,“number40”or“2:10pm”whicharenotinfrequentinVisualGenome.Wethentakeallinstancesofcombinationsthatcontainanyofthesewords,yielding∼123Kinstances.Forex-ample,since“cap”isinourlist,(girl,wears,cap)isincludedinthisset.Whenweenforce“unseen”conditions,weremovealltheseinstancesfromthetrainingset,usingthemonlyfortesting.4.3VisualandLinguisticFeaturesAsourlinguisticrepresentations,weemploy300-dimensionalGloVevectors(Penningtonetal.,2014)trainedontheCommonCrawlcorpuswith840B-tokensanda2.2Mwordsvocabulary.6Weusethepubliclyavailablevisualrepresenta-tionsfromCollelletal.(2017).7Theyextract128-dimensionalvisualfeatureswiththeforwardpassofaVGG-128(VisualGeometryGroup)CNNmodel(Chatﬁeldetal.,2014)pre-trainedinImageNet(Russakovskyetal.,2015).Therepresentationofawordistheaveragedfeaturevector(centroid)of5Thecompletelistofobjectsis:[leaves,foot,wheel,t-shirt,ball,handle,skirt,stripe,trunk,face,camera,socks,tail,pants,elephant,ear,helmet,vest,shoe,eye,coat,skateboard,apple,cap,motorcycle].6http://nlp.stanford.edu/projects/glove7http://liir.cs.kuleuven.be/software.phpallimagesinImageNetforthisconcept.Theyonlykeepwordswithatleast50imagesavailable.Wenoticethatalthoughweemployvisualfeaturesfromanexternalsource(ImageNet),thesecouldbeal-ternativelyobtainedintheVisualGenomedata—althoughImageNetgenerallyprovidesalargernum-berofimagesperconcept.4.4MethodComparisonWeconsidertwotypesofmodels,thosethatupdatetheparametersoftheembeddings(U∼“Update”)andthosethatkeepthemﬁxed(NU∼“NoUp-date”)whenlearningthePredictiontask.Foreachtype(UandNU)weconsidertwoconditions,em-beddingsinitializedwithpre-trainedvectors(INI)andrandomembeddings(RND)randomlydrawnfromacomponent-wisenormaldistributionofmeanandstandarddeviationequaltothoseoftheorigi-nalembeddings.Forexample,U-RNDcorrespondstoamodelwithupdated,randomembeddings.FortheINImethodswealsoaddasubindexindicatingwhethertheembeddingsarevisual(vis)orlinguistic(lang),asdescribedinSect.4.3.8FortheNUtypeweadditionallyconsiderone-hotembeddings(1H).Wealsoincludeacontrolmethod(rand-pred)thatoutputsrandomuniformpredictions.4.5ImplementationDetailsandValidationTovalidateresultsinourPredictiontaskweemploya10-foldcross-validation(CV)scheme.Thatis,wesplitthedatainto10partsandemploy90%ofthedatafortrainingand10%fortesting.Thisyields10embeddings(foreach“U”method),whicharethenevaluatedinourSimilaritytask.Inbothtasks,wereportresultsaveragedacrossthe10folds.Modelhyperparametersareﬁrstselectedbycross-validationin10initialsplitsandresultsarere-portedin10newsplits.Allmodelsemployalearn-ingrateof0.0001andaretrainedfor10epochsbybackpropagationwiththeRMSpropoptimizer.Thedimensionalityoftheembeddingsistheorigi-nalone,i.e.,d=300forGloVeandd=128forVGG-128(Sect.4.3),whichispreservedfortherandom-8GiventhatvisualrepresentationsarenotavailablefortheRelationships(i.e.,verbsandprepositions),themodelswithvisembeddingsemployone-hotembeddingsfortheRelationshipsandvisualembeddingsforObjectandSubject.ThisisaratherneutralchoicethatenablesthevismodelstouseRelationships. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 0 1 5 6 7 6 0 4 / / t l a c _ a _ 0 0 0 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 139 embeddingmethodsRND(Sect.4.4).Modelsem-ploy2hiddenlayerswith100RectiﬁedLinearUnits(ReLu),followedbyanoutputlayerwithalinearac-tivation.Earlystoppingisemployedasaregularizer.WeimplementourmodelswithKerasdeeplearningframeworkinPython2.7(Cholletandothers,2015).4.6SpatialSimilarityTaskTobuildthewordpairs,werandomlyselectalistofobjectsfromVisualGenomeandfromthesewerandomlychose1,016non-repeatedwordpairs(w1,w2).RatingsarecollectedwiththeCrowd-ﬂower9platformandcorrespondtoaveragesofatleast5reliableannotators10thatprovidedratingsinadiscretescalefrom1to10.Themediansimilarityratingis3.3andthemeanvariancebetweenannota-torsperwordpairis∼1.2.4.7EvaluationMetrics4.7.1PredictionTaskWeevaluatemodelpredictionswiththefollowingmetrics.(I)Regressionmetrics..(i)MeanSquaredError(MSE)betweenthepre-dictedˆy=(cOc,cOb)andthetruey=(Oc,Ob)Ob-jectcentercoordinatesandObjectsize.NoticethatsinceOc,Obarewithin[0,1]2,theMSEiseasilyin-terpretable,rangingbetween0and1.(ii)PearsonCorrelation(r)betweenthepredictedcOcandthetrueOcObjectcentercoordinates.Weconsiderthevertical(ry)andhorizontal(rx)com-ponentsseparately(i.e.,OcxandOcy).(iii)CoefﬁcientofDetermination(R2)ofthepre-dictionsˆy=(cOc,cOb)andthetargety=(Oc,Ob).R2isemployedtoevaluategoodnessofﬁtofare-gressionmodelandisrelatedtothepercentageofvarianceofthetargetexplainedbythepredictions.Thebestpossiblescoreis1anditcanbearbitrarilynegativeforbadpredictions.Amodelthatoutputseitherrandomorconstantpredictionswouldobtainscorescloseto0andexactly0respectively.9https://www.crowdﬂower.com/10Reliableannotatorsarethosewithperformanceover70%inthetestquestions(16inourcase)thatthecrowdsourcingplat-formallowsustointroduceinordertotestannotators’accuracy.(II)Classiﬁcation.Additionally,giventheseman-ticdistinctionbetweentheverticalandhorizontalaxisnotedabove(Sect.3.2.1),weconsidertheclas-siﬁcationproblemofpredictingabove/belowrela-tivelocations.Thatis,ifthepredictedy-coordinatefortheObjectcentercOcyfallsbelowthey-coordinateoftheSubjectcenterScyandtheactualObjectcen-terOcyisbelowtheSubjectcenterScy,wecountitasacorrectprediction,andasincorrectother-wise.Likewiseforabovepredictions.Wecomputebothmacro-averaged11accuracy(accy)andmacro-averagedF1(F1y)metrics.(III)IntersectionoverUnion(IoU).Weconsidertheboundingboxoverlap(IoU)fromtheVOCde-tectiontask(Everinghametal.,2015):IoU=area(cBO∩BO)/area(cBO∪BO)wherecBOandBOarepredictedandgroundtruthObjectboxesre-spectively.ApredictioniscountedascorrectiftheIoUislargerthan50%.Crucially,wenoticethatoursettingandresultsarenotcomparabletoobjectde-tectionasweemploytextinsteadofimagesasinputandthuswecannotleveragethepixelstolocatetheObject,unlikeindetection.4.7.2SimilarityTaskFollowingstandardpractices(Penningtonetal.,2014),theperformanceofthepredictionsof(co-sine)similarityfromtheembeddings(describedinSect.3.3)isevaluatedwiththeSpearmancorrelationρagainstthecrowdsourcedhumanratings.5ResultsandDiscussionWeconsiderthenotationofthemethodsfromSect.4.4andtheevaluationsubsetsdescribedinSect.4.2forthePredictiontask.ToteststatisticalsigniﬁcanceweemployaFriedmanranktestandposthocNemenytestsontheresultsofthe10folds.5.1PredictionTaskTable2showsthattheINIandRND12methodsper-formsimilarlyintheOriginaltestset,arguablybe-11Macro-averagedaccuracyequalstotheaverageofper-classaccuracies,withclasses{above,below}.SimilarlyforF1.12Forsimplicity,wedonotaddanysubindex(visorlang)toRND,yetthesevectorsaredrawnfromtwodifferentdistri-butions,i.e.,fromeithervisorlangembeddings(Sect.4.4).Additionally,resultstablesshowtwoblocksofmethodssincevisandlangdonotsharealltheinstances(seeSect.4.1). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 1 0 1 5 6 7 6 0 4 / / t l a c _ a _ 0 0 0 1 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 140 causealargepartofthelearningtakesplaceintheparametersofthelayerssubsequenttotheembed-dinglayer.However,inthenextsectionweshowthatthisisnolongerthecasewhenunseenwordsarepresent.Wealsoobservethattheone-hotembed-dingsNU-1Hperformslightlybetterthantherestofmethodswhennounseenwordsarepresent(Tab.2andTab.3right).MSER2accyF1yrxryIoUU-INIlang0.0110.6540.7730.7730.8490.8320.283U-RNDlang0.0110.6460.7700.7700.8470.8270.279NU-INIlang0.0110.6510.7700.7700.8480.8290.275NU-RNDlang0.0110.6360.7660.7660.8450.8220.268NU-1H0.0100.6590.7770.7780.8500.8330.297rand-pred0.794-27.610.5330.5160.0000.0010.010U-INIvis0.0110.6270.7660.7660.8410.8200.266U-RNDvis0.0120.6120.7620.7620.8360.8100.244NU-INIvis0.0120.6110.7650.7630.8370.8130.246NU-RNDvis0.0120.6070.7670.7660.8350.8080.237NU-1H0.0110.6570.7880.7880.8480.8330.308rand-pred0.789-27.510.5340.5190.0000.0000.010Table2:ResultsintheOriginaltestset(Sect.4.2).Bold-faceindicatesbestperformancewithinthecorrespondingblockofmethods(langabove,andvisbelow).ItisalsoworthnotingthattheresultsofthePre-dictiontaskare,infact,conservative.First,theOriginaltestdatacontainsaconsiderablenumberofmeaningless(e.g.,(giraffe,a,animal)),andir-relevantcombinations(e.g.,(clock,has,numbers)or(sticker,identiﬁes,apple)).Second,evenwhenonlymeaningfulexamplesareconsidered,wearein-evitablypenalizingforplausiblepredictions.Forin-stance,in(man,watching,man)weexpectbothmentobereasonablyseparatedonthex-axisyettheonewiththehighestycoordinateisgenerallynotpre-dictableasitdependsontheirheightandtheirdis-tancetothecamera.Thisyieldsabove/belowclassi-ﬁcationperformanceandcorrelations.Regardless,allmethods(exceptrand-pred)exhibitreasonablyhighperformanceinallmeasures.5.1.1EvaluationonUnseenWordsTable3evidencesthatbothvisualandlinguisticembeddings(INIvisandINIlang)signiﬁcantlyout-performtheirrandom-embeddingcounterpartsRNDbyalargemarginwhenunseenwordsarepresent.Theimprovementoccursforboth,updated(U)andnon-updated(NU)embeddings—althoughitisex-pectedthattheupdatedmethodsperformslightlyworsethanthenon-updatedonessincetheoriginalembeddingswillhave“moved”duringtrainingandthereforeanunseenembedding(whichhasnotbeenupdated)mightnolongerbeclosetoothersemanti-callysimilarvectorsintheupdatedspace.Besidesstatisticalsigniﬁcance,itisworthmen-tioningthattheINImethodsconsistentlyoutper-formedboththeirRNDcounterpartsandNU-1Hineachofthe10folds(notshownhere)byasteadilylargemargin.Infact,resultsaremarkedlystableacrossfolds,inpartduetothelargesizeofthetrain-ingandtestsets(>0.9Mand>120Kexamplesre-spectively).Además,toensurethat“unseen”resultsarenotdependentonourparticularlistofobjects,werepeatedtheexperimentwithtwoaddi-tionallistsofrandomlyselectedobjects,obtainingverysimilarresults.Remarkably,theINImethodsexperienceonlyasmallperformancedropunderunseenconditions(Tab.3,left)comparedtowhenweallowthemtotrainwiththesewords(Tab.3,right),andthisdiffer-encemightbepartiallyattributedtothereductionofthetrainingdataunder“unseen”conditions,whereatleast10%ofthetrainingdataareleftout.Altogether,theseresultsonunseenwordsshowthatsemanticandvisualsimilaritiesbetweencon-cepts,asencodedbywordandvisualembeddings,canbeleveragedbythemodelinordertopredictspatialknowledgeaboutunseenwords.135.1.2QualitativeInsightVisualinspectionofmodelpredictionsisinstruc-tiveinordertogaininsightonthespatialinforma-tivenessofvisualandlinguisticrepresentationsonunseenwords.Figure2showsheatmapsoflow(negro)andhigh(blanco)probabilityregionsfortheobjects.The“heat”fortheObjectisassumedtobenormallydistributedwithmean(µ)equaltothepre-dictedObjectcentercOcandstandarddeviation(pag)equaltothepredictedObjectsizecOb(assumingin-dependenceofthexandycomponents,whichyieldstheproductoftwoGaussians,oneforeachcompo-nentxandy).The“heat”fortheSubjectiscom-putedsimilarly,althoughwithµandσequaltothe13InthisPredictiontaskwehaveadditionallyconsideredtheconcatenationofvisualandlinguisticrepresentations(notshown),whichdidnotshowanyrelevantimprovementovertheunimodalrepresentations.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
1
0
1
5
6
7
6
0
4

/
t

a
C
_
a
_
0
0
0
1
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

141

UnseenwordsconditionSeenwordsconditionMSER2accyF1yrxryIoUMSER2accyF1yrxryIoUU-INIlang0.011∗(cid:5)0.584∗(cid:5)0.712∗(cid:5)0.710∗(cid:5)0.877∗(cid:5)0.770∗0.131∗(cid:5)0.0070.736∗0.8100.8100.9010.8760.223U-RNDlang0.0150.4220.6030.6010.8630.6240.0900.0070.7300.8060.8060.8990.8740.223NU-INIlang0.009∗(cid:5)0.663∗(cid:5)0.770∗(cid:5)0.770∗(cid:5)0.888∗(cid:5)0.835∗(cid:5)0.164∗(cid:5)0.0070.734∗0.8050.8050.9000.8750.221NU-RNDlang0.0160.4050.6000.5980.8640.6170.1010.0070.7210.8030.8030.8980.8710.212NU-1H0.0150.4650.6080.6070.8670.6420.0980.0070.7400.8140.8130.9010.8770.243rand-pred0.843-38.320.5240.5010.0000.0000.0120.845-38.420.5240.500-0.002-0.0010.012U-INIvis0.010∗(cid:5)0.599∗(cid:5)0.775∗(cid:5)0.774∗(cid:5)0.887∗(cid:5)0.801∗(cid:5)0.123∗(cid:5)0.0070.7260.8160.8160.9040.8740.200U-RNDvis0.0170.3600.5810.5780.8670.5130.0820.0080.7110.8120.8110.9010.8640.174NU-INIvis0.010∗(cid:5)0.602∗(cid:5)0.777∗(cid:5)0.775∗(cid:5)0.887∗(cid:5)0.803∗(cid:5)0.123∗(cid:5)0.0070.7110.8170.8150.9020.8680.186NU-RNDvis0.0170.3660.5740.5720.8670.5360.0850.0080.7060.8200.8190.9010.8620.171NU-1H0.0150.4370.6180.6170.8670.6010.0780.0060.7600.8410.8410.9100.8850.256rand-pred0.840-40.340.5240.507-0.0010.0000.0120.840-40.370.5240.507-0.002-0.0010.012Table3:ResultsintheUnseenWordsset(Sect.4.2).Lefttable:resultsofenforcing“unseen”conditions,i.e.,leavingoutallwordsoftheUnseenWordssetfromourtrainingdata.Righttable:themodelsareevaluatedinthesamesetbutweallowthemtotrainwiththewordsfromthisset.Asterisks(∗)inanINImethodindicatesigniﬁcantlybetterperformance(pag<0.05)thanitsRNDcounterpart(i.e.,U-INIembtypeiscomparedagainstU-RND,andNU-INIembtypeagainstNU-RND).Diamonds((cid:5))indicatesigniﬁcantlybetterperformancethanNU-1H.actualSubjectcenterScandsizeSb,respectively.TheINImethodsinFigure2illustratethecontri-butionoftheembeddingstothespatialunderstand-ingofunseenobjects.Ingeneral,bothvisualandlinguisticembeddingsenabledpredictingmeaning-fulspatialarrangements,yetforthesakeofspacewehaveonlyincludedthreeexampleswhere:visperformsbetterthanlang(thirdcolumn),wherelangperformsbetterthanvis(secondcolumn),andwherebothperformwell(ﬁrstcolumn).Wenoticethattheembeddingsenablethemodeltoinferthate.g.,since“camera”(unseen)issimilarto“cam-corder”(seenattrainingtime),bothmustbehavespatiallysimilarly.Likewise,theembeddingsenablepredictingcorrectlytherelativesizesofunseenob-jects.Wealsoobservethatwhentheembeddingsarenotinformativeenough,modelpredictionsbe-comelessaccurate.Forinstance,inNU-INIlang,someunrelatedobjects(e.g.,“ipod”)haveembed-dingssimilarto“apple”,andanalogouslyforNU-INIvisand“tail”.Weﬁnallynoticethatpredictionsonunseenobjectsusingrandomembeddings(RND)aremarkedlybad.5.2SpatialSimilarityTaskTable4showstheresultsofevaluatingtheem-beddings,includingthoselearnedinthePredic-tiontask,againstthehumanratingsofspatialsim-ilarity(Sect.3.3).Hence,onlythe“updated”methods(U)areshownandweadditionallyin-cludetheconcatenationofvisualandlinguisticembeddingsCONCGloVe+VGG-128andtheconcate-nationofthecorrespondingupdatedembeddingsCONCU-INIlang+U-INIvis.LANGV&LGloVe0.5430.535VGG-128-0.459CONCGloVe+VGG-128-0.582U-INIlang0.557±0.0015∗0.558±0.002∗U-INIvis-0.48±0.0012∗CONCU-INIlang+U-INIvis-0.6±0.0015∗U-RND0.15±0.00750.174±0.0078#wordpairs1016839Table4:Spearmancorrelationsbetweenmodelpredic-tionsandhumanratings.Standarderrorsacrossfoldsareshownforthemethodsthatinvolvelearning(secondblock).Columnscorrespondtothewordpairsforwhichbothembeddings(visandlang)areavailable(V&L)andthoseforwhichonlythelinguisticembeddingsareavailable(LANG).Asterisk(∗)indicatessigniﬁcantim-provement(p<0.05)ofaU-INImethodofthesecondblock(U-INIvisandU-INIlang)overitscorrespondinguntrainedembedding(i.e.,VGG-128orGloVerespec-tively)fromtheﬁrstblock.TheﬁrstthingtonoticeinTab.4isthatbothvisualandlinguisticembeddingsshowgoodcorre-lationswithhumanspatialratings(��>0.45andρ>0.53respectively),suggestingthatvisualandlinguisticfeaturescarrysigniﬁcantknowledgeabout

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
0
1
0
1
5
6
7
6
0
4

/
t

a
C
_
a
_
0
0
0
1
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

142

NU-INI langcamcorder tripod viewfinderscreen 0.690.60.590.57 wingshindfeatherbeak0.60.580.560.55 ipodlemoncranberrycinnamon0.600.590.570.56 person, tenencia, applezebra, tiene, tailman, tenencia, cameraNU-INIvisviewfindershutterlenscamcorder0.990.990.970.97birdfeederwoodpeckerpigeon0.970.960.930.91pearfruittomatomelon0.980.970.970.95NU-RNDobstacleregionboneinterstate0.240.190.180.18brandingstreepeppercornsladies0.230.210.180.17diplomamanicuresweatpantsgarage0.210.190.180.18Figure2:HeatmapsofpredictionsoftheNU-INIlang,U-INIvisandNU-RNDmethods.TheunseenObjectsareunderlined(topoftheimage)andtheircorrespondingfour(cosine-based)nearestneighborsareshownbelowwiththeirrespectivecosinesimilarities.spatialpropertiesofobjects.Inparticular,linguisticfeaturesseemtobemorespatiallyinformativethanvisualfeatures.Crucially,weobserveasigniﬁcantimprovementoftheU-INIvisovertheoriginalvisualvectors(VGG-128)(pag<0.05)andoftheU-INIlangovertheoriginallinguisticembeddings(GloVe)(p<0.05),whichevidencetheeffectivenessoftraininginthePredictiontaskasamethodtofurtherspecializeem-beddingsinspatialknowledge.Itisworthmention-ingthattheseimprovementsareconsistentineachofthe10folds(notshownhere)andmarkedlysta-ble(seestandarderrorsinTab.4).Weadditionallyobservethattheconcate-nationofvisualandlinguisticembeddingsCONCGloVe+VGG-128outperformsallunimodalembeddingsbyamargin,suggestingthatthefusionofvisualandlinguisticfeaturesprovidesamorecompletedescriptionofspatialpropertiesofobjects.Remarkably,theimprovementisevenlargerfortheconcatenationoftheembeddingsupdatedduringtrainingCONCU-INIlang+U-INIvis,whichobtainsthehighestperformanceoverall.Figure3illustratestheprogressivespecializationofourembeddingsinspatialknowledgeaswetraintheminourPredictiontask.Wenoticethatallem-beddingsimprove,yetU-INIlangseemtoworsentheirqualitywhenweover-trainthem—likelyduetooverﬁtting,aswedonotuseanyregularizerbesidesearlystopping.Wealsoobservethatalthoughtherandomembeddings(RND)aretheonesthatbeneﬁtthemostfromthetraining,theirperformanceisstillfarfromthatofU-INIvisandU-INIlang,suggestingtheimportanceofvisualandlinguisticfeaturestorepresentspatialpropertiesofobjects.00.10.20.30.40.50.60102030Spearman correlationNumber of epochsU-INIlangU-INIvisU-RNDGloVeVGG-128Figure3:Correlationbetweenhumanratingsandembed-dingcosinesimilaritiesateachnumberofepochs.Itisrelevanttomentionthatinapilotstudywecrowdsourcedadifferentlistof1,016objectpairswhereweemployed3insteadof5annotatorsperrow.Resultsstayedremarkablyconsistentwiththosepresentedhere—theimprovementfortheup-datedembeddingswasinfactevenlarger. yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 0 1 0 1 5 6 7 6 0 4 / / t yo a C _ a _ 0 0 0 1 0 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 143 LimitationsofthecurrentapproachandfutureworkInordertokeepthedesigncleaninthisﬁrstpaperondistributedspatialrepresentationsweem-ployafullysupervisedsetup.However,wenoticethatmethodstoautomaticallyparseimages(e.g.,ob-jectdetectors)andsentencesareavailable.Asecondlimitationisthe2Dsimpliﬁcationoftheactual3Dworldthatourapproachandthecurrentspatialliteraturegenerallyemploys.Eventhoughmethodsthatinfer3Dstructurefrom2Dimagesex-ist,thisisbeyondthescopeofthispaperwhichshowsthata2Dtreatmentalreadyenhancesthelearnedspatialrepresentations.Itisalsoworthnot-ingthattheproposedregressionsettingtriviallygen-eralizesto3Difsuitabledataareavailable,andinfact,webelievethatthelearnedrepresentationscouldfurtherbeneﬁtfromsuchextension.6ConclusionsAltogether,thispapershedslightontheproblemoflearningdistributedspatialrepresentationsofob-jects.Tolearnspatialrepresentationswehavelever-agedthetaskofpredictingthecontinuous2Drel-ativespatialarrangementoftwoobjectsunderarelationship,andasimpleembedding-basedneuralmodelthatlearnsthistaskfromannotatedimages.InthesamePredictiontaskwehaveshownthatbothwordembeddingsandCNNfeaturesendowthemodelwithgreatpredictivepowerwhenispresentedwithunseenobjects.Next,inordertoassessthespa-tialcontentofdistributedrepresentations,wehavecollectedasetof1,016objectpairsratedbyspatialsimilarity.Wehaveshownthatbothwordembed-dingsandCNNfeaturesaregoodpredictorsofhu-manspatialjudgments.Morespeciﬁcally,weﬁndthatwordembeddings(ρ=0.535)tendtoperformbetterthanvisualfeatures(ρ∼0.46),andthattheircombination(ρ∼0.6)outperformsbothmodalitiesseparately.Crucially,inthesameratingswehaveshownthatbytrainingtheembeddingsinthePre-dictiontaskwecanfurtherspecializetheminspatialknowledge,makingthemmoreakintohumanspa-tialjudgments.Tobenchmarkthetask,wemaketheSimilaritydatasetandourtrainedspatialrepresenta-tionspubliclyavailable.14Lastly,thispapercontributestotheautomaticun-14https://github.com/gcollell/spatial-representationsderstandingofspatialexpressionsinlanguage.Thelackofcommonsenseknowledgehasbeenrecur-rentlyarguedasoneofthemainreasonswhyma-chinesfailatexhibitingmore“human-like”behav-iorintasks(LinandParikh,2015).Aquí,wehaveprovidedameansofcompressingandencod-ingsuchcommonsensespatialknowledgeaboutob-jectsintodistributedrepresentations,furthershow-ingthatthesespecializedrepresentationscorrelatewellwithhumanjudgments.Infuturework,wewillalsoexploretheapplicationofourtrainedspatialembeddingsinextrinsictasksinwhichrepresentingspatialknowledgeisessentialsuchasrobotnaviga-tionorrobotunderstandingofnaturallanguagecom-mands(Guadarramaetal.,2013;MoratzandTen-brink,2006).Robotnavigationtaskssuchasassist-ingpeoplewithspecialneeds(blind,elderly,etc.)areinfactbecomingincreasinglynecessary(Yeetal.,2015)andrequiregreatunderstandingofspatiallanguageandspatialconnotationsofobjects.AcknowledgmentsThisworkhasbeensupportedbytheCHIST-ERAEUprojectMUSTER15andbytheKULeuvengrantRUN/15/005.Weadditionallythankouranony-mousreviewersfortheirinsightfulcommentswhichhelpedtoimprovetheoverallqualityofthepaper,andtheactioneditorsfortheirhelpfulassistance.ReferencesEnekoAgirre,EnriqueAlfonseca,KeithHall,JanaKravalova,MariusPas¸ca,andAitorSoroa.2009.AStudyonSimilarityandRelatednessUsingDistribu-tionalandWordNet-BasedApproaches.InNAACL,pages19–27.ACL.MohitBansal,KevinGimpel,andKarenLivescu.2014.TailoringContinuousWordRepresentationsforDe-pendencyParsing.InACL,pages809–815.KenChatﬁeld,KarenSimonyan,AndreaVedaldi,andAndrewZisserman.2014.ReturnoftheDevilintheDetails:DelvingDeepintoConvolutionalNets.InBMVC.Franc¸oisCholletetal.2015.Keras.https://github.com/fchollet/keras.GuillemCollellandMarie-FrancineMoens.2016.IsanImageWorthMorethanaThousandWords?OntheFine-GrainSemanticDifferencesbetweenVisualand15http://www.chistera.eu/projects/muster l D o w n o a d e d f r o m h t t p : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 0 1 0 1 5 6 7 6 0 4 / / t yo a C _ a _ 0 0 0 1 0 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 144 LinguisticRepresentations.InCOLING,pages2807–2817.ACL.GuillemCollell,TeddyZhang,andMarie-FrancineMoens.2017.ImaginedVisualRepresentationsasMultimodalEmbeddings.InAAAI,pages4378–4384.AAAI.GuillemCollell,LucVanGool,andMarie-FrancineMoens.2018.AcquiringCommonSenseSpatialKnowledgethroughImplicitSpatialTemplates.InAAAI.AAAI.DesmondElliottandFrankKeller.2013.ImageDe-scriptionUsingVisualDependencyRepresentations.InEMNLP,volume13,pages1292–1302.MarkEveringham,S.M.AliEslami,LucVanGool,ChristopherK.I.Williams,JohnWinn,andAndrewZisserman.2015.ThePASCALVisualObjectClassesChallenge:ARetrospective.InternationalJournalofComputerVision,111(1):98–136.JeromeFeldman,GeorgeLakoff,DavidBailey,SriniNarayanan,TerryRegier,andAndreasStolcke.1996.L0-TheFirstFiveYearsofanAutomatedLanguageAcquisitionProject.IntegrationofNaturalLanguageandVisionProcessing,10:205.KennethD.Forbus,PaulNielsen,andBoiFaltings.1991.Qualitativespatialreasoning:TheCLOCKproject.ArtiﬁcialIntelligence,51(1-3):417–471.SergioGuadarrama,LorenzoRiano,DaveGolland,DanielGo,YangqingJia,DanKlein,PieterAbbeel,TrevorDarrell,etal.2013.GroundingSpatialRe-lationsforHuman-RobotInteraction.InIROS,pages1640–1647.IEEE.DouweKiela,FelixHill,andStephenClark.2015.Spe-cializingWordEmbeddingsforSimilarityorRelated-ness.InEMNLP,pages2044–2048.RanjayKrishna,YukeZhu,OliverGroth,JustinJohnson,KenjiHata,JoshuaKravitz,StephanieChen,YannisKalantidis,Li-JiaLi,DavidA.Shamma,etal.2017.VisualGenome:Connectinglanguageandvisionus-ingcrowdsourceddenseimageannotations.Interna-tionalJournalofComputerVision,123(1):32–73.Geert-JanM.Kruijff,HendrikZender,PatricJensfelt,andHenrikI.Christensen.2007.SituatedDialogueandSpatialOrganization:Qué,WhereandWhy?InternationalJournalofAdvancedRoboticSystems,4(1):16.XiaoLinandDeviParikh.2015.Don’tJustListen,UseyourImagination:LeveragingVisualCommonSenseforNon-VisualTasks.InCVPR,pages2984–2993.MateuszMalinowskiandMarioFritz.2014.APoolingApproachtoModellingSpatialRelationsforImageRetrievalandAnnotation.arXivpreprintarXiv:1411.5190v2.ReinhardMoratzandThoraTenbrink.2006.SpatialRef-erenceinLinguisticHuman-RobotInteraction:Itera-tive,EmpiricallySupportedDevelopmentofaModelofProjectiveRelations.SpatialCognitionandCom-putation,6(1):63–107.JeffreyPennington,RichardSocher,andChristopherD.Manning.2014.GloVe:GlobalVectorsforWordRepresentation.InEMNLP,volume14,pages1532–1543.JamesPustejovsky,JessicaMoszkowicz,andMarcVer-hagen.2012.ALinguisticallyGroundedAnnota-tionLanguageforSpatialInformation.TraitementAu-tomatiquedesLangues,53(2):87–113.TerryRegier.1996.TheHumanSemanticPotential:SpatialLanguageandConstrainedConnectionism.MITPress.OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause,SanjeevSatheesh,SeanMa,ZhihengHuang,AndrejKarpathy,AdityaKhosla,MichaelBernstein,etal.2015.ImageNetLargeScaleVisualRecognitionChal-lenge.InternationalJournalofComputerVision,115(3):211–252.Sz-RungShiang,StephanieRosenthal,AnatoleGersh-man,JaimeCarbonell,andJeanOh.2017.Vision-LanguageFusionforObjectRecognition.InAAAI,pages4603–4610.AAAI.CarinaSilbererandMirellaLapata.2014.Learn-ingGroundedMeaningRepresentationswithAutoen-coders.InACL,pages721–732.AmitSinghal,JieboLuo,andWeiyuZhu.2003.Proba-bilisticSpatialContextModelsforSceneContentUn-derstanding.InCVPR,volume1,pages235–241.IEEE.DuyuTang,FuruWei,NanYang,MingZhou,TingLiu,andBingQin.2014.LearningSentiment-SpeciﬁcWordEmbeddingforTwitterSentimentClassiﬁcation.InACL,pages1555–1565.CangYe,SoonhacHong,andAmirhosseinTamjidi.2015.6-DOFPoseEstimationofaRoboticNaviga-tionAidbyTrackingVisualandGeometricFeatures.IEEETransactionsonAutomationScienceandEngi-neering,12(4):1169–1180. Transacciones de la Asociación de Lingüística Computacional, volumen. 6, páginas. 133–144, 2018. Editor de acciones: Stefan Riezler. imagen

Descargar PDF