Transacciones de la Asociación de Lingüística Computacional, 2 (2014) 351–362. Editor de acciones: Hal Daume III.
Submitted 2/2014; Revised 5/2014; Publicado 10/2014. C
(cid:13)
2014 Asociación de Lingüística Computacional.
TREETALK:CompositionandCompressionofTreesforImageDescriptionsPolinaKuznetsova††StonyBrookUniversityStonyBrook,NYpkuznetsova@cs.stonybrook.eduVicenteOrdonez‡TamaraL.Berg‡‡UNCChapelHillChapelHill,CAROLINA DEL NORTE{vicente,tlberg}@cs.unc.eduYejinChoi††††UniversityofWashingtonSeattle,WAyejin@cs.washington.eduAbstractWepresentanewtreebasedapproachtocomposingexpressiveimagedescriptionsthatmakesuseofnaturallyoccuringwebimageswithcaptions.Weinvestigatetworelatedtasks:imagecaptiongeneralizationandgen-eration,wheretheformerisanoptionalsub-taskofthelatter.Thehigh-levelideaofourapproachistoharvestexpressivephrases(astreefragments)fromexistingimagedescrip-tions,thentocomposeanewdescriptionbyselectivelycombiningtheextracted(andop-tionallypruned)treefragments.Keyalgo-rithmiccomponentsaretreecompositionandcompression,bothintegratingtreestructurewithsequencestructure.Ourproposedsystemattainssignificantlybetterperformancethanpreviousapproachesforbothimagecaptiongeneralizationandgeneration.Inaddition,ourworkisthefirsttoshowtheempiricalben-efitofautomaticallygeneralizedcaptionsforcomposingnaturalimagedescriptions.1IntroductionThewebisincreasinglyvisual,withhundredsofbil-lionsofusercontributedphotographshostedonline.Asubstantialportionoftheseimageshavesomesortofaccompanyingtext,rangingfromkeywords,tofreetextonwebpages,totextualdescriptionsdi-rectlydescribingdepictedimagecontent(i.e.cap-tions).Wetapintothelastkindoftext,usingnatu-rallyoccuringpairsofimageswithnaturallanguagedescriptionstocomposeexpressivedescriptionsforqueryimagesviatreecompositionandcompression.Suchautomaticimagecaptioningeffortscouldpotentiallybeusefulformanyapplications:fromautomaticorganizationofphotocollections,tofacil-itatingimagesearchwithcomplexnaturallanguagequeries,toenhancingwebaccessibilityforthevi-suallyimpaired.Ontheintellectualside,bylearn-ingtodescribethevisualworldfromnaturallyexist-ingwebdata,ourstudyextendsthedomainsoflan-guagegroundingtothehighlyexpressivelanguagethatpeopleuseintheireverydayonlineactivities.Therehasbeenarecentspikeineffortstoau-tomaticallydescribevisualcontentinnaturallan-guage(Yangetal.,2011;Kulkarnietal.,2011;Lietal.,2011;Farhadietal.,2010;Krishnamoorthyetal.,2013;ElliottandKeller,2013;YuandSiskind,2013;Socheretal.,2014).Thisreflectsthelongstandingunderstandingthatencodingthecomplex-itiesandsubtletiesofimagecontentoftenrequiresmoreexpressivelanguageconstructsthanasetoftags.Nowthatvisualrecognitionalgorithmsarebe-ginningtoproducereliableestimatesofimagecon-tent(Perronninetal.,2012;Dengetal.,2012a;Dengetal.,2010;Krizhevskyetal.,2012),thetimeseemsripetobeginexploringhigherlevelsemantictasks.Therehavebeentwomaincomplementarydirec-tionsexploredforautomaticimagecaptioning.Thefirstfocusesondescribingexactlythoseitems(e.g.,objects,atributos)thataredetectedbyvisionrecog-nition,whichsubsequentlyconfineswhatshouldbedescribedandhow(Yaoetal.,2010;Kulkarnietal.,2011;Kojimaetal.,2002).Approachesinthisdirec-tioncouldbeidealforvariouspracticalapplicationssuchasimagedescriptionforthevisuallyimpaired.However,itisnotclearwhetherthesemanticexpres-sivenessoftheseapproachescaneventuallyscaleuptothecasual,buthighlyexpressivelanguagepeo-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
352
Target’Image’A»cow!de pie!en!el!agua!I!no/ced!eso!este!funny!cow!era»staring»en»a mí»A!bird!hovering!en»el»grass»You!poder!ver!estos!beau/ful!hills!solo!en»el»countryside»Object’Ac/on’Stuff’Scene’Figure1:Harvestingphrases(astreefragments)forthetargetimagebasedon(partial)visualmatch.plenaturallyuseintheironlineactivities.InFig-ure1,forexample,itwouldbehardtocompose“Inoticedthatthisfunnycowwasstaringatme”or“Youcanseethesebeautifulhillsonlyinthecoun-tryside”inapurelybottom-upmannerbasedontheexactcontentdetected.Thekeytechnicalbottleneckisthattherangeofdescribablecontent(i.e.,objects,atributos,comportamiento)isultimatelyconfinedbythesetofitemsthatcanbereliablyrecognizedbystate-of-the-artvisiontechniques.Theseconddirection,inacomplementaryavenuetothefirst,hasexploredwaystomakeuseoftherichspectrumofvisualdescriptionscontributedbyonlinecitizens(Kuznetsovaetal.,2012;FengandLapata,2013;Mason,2013;Ordonezetal.,2011).Intheseapproaches,thesetofwhatcanbedescribedcanbesubstantiallylargerthanthesetofwhatcanberecognized,wheretheformerisshapedanddefinedbythedata,ratherthanbyhumans.Thisallowstheresultingdescriptionstobesubstantiallymoreex-pressive,elaborate,andinterestingthanwhatwouldbepossibleinapurelybottom-upmanner.Ourworkcontributestothissecondlineofresearch.Onechallengeinutilizingnaturallyexistingmul-timodaldata,sin embargo,isthenoisysemanticalign-mentbetweenimagesandtext(Dodgeetal.,2012;Bergetal.,2010).Por lo tanto,wealsoinvesti-gatearelatedtaskofimagecaptiongeneralization(Kuznetsovaetal.,2013),whichaimstoimprovethesemanticimage-textalignmentbyremovingbitsoftextfromexistingcaptionsthatarelesslikelytobetransferabletootherimages.Thehigh-levelideaofoursystemistoharvestusefulbitsoftext(astreefragments)fromexist-ingimagedescriptionsusingdetectedvisualcontentsimilarity,andthentocomposeanewdescriptionbyselectivelycombiningtheseextracted(andop-tionallypruned)treefragments.Thisoverallideaofcompositionbasedonextractedphrasesisnotnewinitself(Kuznetsovaetal.,2012),sin embargo,wemakeseveraltechnicalandempiricalcontributions.First,weproposeanovelstochastictreecompo-sitionalgorithmbasedonextractedtreefragmentsthatintegratesbothtreestructureandsequenceco-hesionintostructuralinference.Ouralgorithmper-mitsasubstantiallyhigherleveloflinguisticexpres-siveness,flexibility,andcreativitythanthosebasedonrulesortemplates(Kulkarnietal.,2011;Yangetal.,2011;Mitchelletal.,2012),whilealsoaddress-inglong-distancegrammaticalrelationsinamoreprincipledwaythanthosebasedonhand-codedcon-straints(Kuznetsovaetal.,2012).Segundo,weaddressimagecaptiongeneralizationasanoptionalsubtaskofimagecaptiongeneration,andproposeatreecompressionalgorithmthatper-formsalight-weightparsingtosearchfortheop-timalsetoftreebranchestoprune.Ourworkisthefirsttoreportempiricalbenefitsofautomaticallycompressedcaptionsforimagecaptioning.Theproposedapproachesattainsignificantlybet-terperformanceforbothimagecaptiongeneraliza-tionandgenerationtasksovercompetitivebaselinesandpreviousapproaches.Ourworkresultsinanim-provedimagecaptioncorpuswithautomaticgener-alization,whichispubliclyavailable.12HarvestingTreeFragmentsGivenaqueryimage,weretrieveimagesthatarevi-suallysimilartothequeryimage,thenextractpo-tentiallyusefulsegments(i.e.,phrases)fromtheircorrespondingimagedescriptions.Wethencom-poseanewimagedescriptionusingtheseretrievedtextfragments(§3).Extractionofusefulphrasesisguidedbybothvisualsimilarityandthesyn-tacticparseofthecorrespondingtextualdescrip-1http://ilp-cky.appspot.com/
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
353
tion.Thisextractionstrategy,originallyproposedbyKuznetsovaetal.(2012),attemptstomakethebestuseoflinguisticregularitieswithrespecttoobjects,comportamiento,andscenes,makingitpossibletoobtainrichertextualdescriptionsthanwhatcur-rentstate-of-the-artvisiontechniquescanprovideinisolation.InallofourexperimentsweusethecaptionedimagecorpusofOrdonezetal.(2011),firstpre-processingthecorpusforrelevantcontentbyrunningdeformablepartmodelobjectdetec-tors(Felzenszwalbetal.,2010).Forourstudy,werundetectorsfor89objectclassessetahighconfi-dencethresholdfordetection.AsillustratedinFigure1,foraqueryimagede-tection,weextractfourtypesofphrases(astreefragments).Primero,weretrieverelevantnounphrasesfromimageswithvisuallysimilarobjectdetections.Weusecolor,texture(LeungandMalik,1999),andshape(DalalandTriggs,2005;Lowe,2004)basedfeaturesencodedinahistogramofvectorquantizedresponsestomeasurevisualsimilarity.Second,weextractverbphrasesforwhichthecorrespondingnounphrasetakesthesubjectrole.Third,fromthoseimageswith“stuff”detections,e.g.“water”,or“sky”(typicallymassnouns),weextractpreposi-tionalphrasesbasedonsimilarityofbothvisualap-pearanceandrelativespatialrelationshipsbetweendetectedobjectsand“stuff”.Finally,weuseglobal“scene”similarity2toextractprepositionalphrasesreferringtotheoverallscene,e.g.,“attheconfer-ence,”or“inthemarket”.Weperformthisphraseretrievalprocessforeachdetectedobjectinthequeryimageandgenerateonesentenceforeachobject.Allsentencesarethencombinedtogethertoproducethefinaldescription.Optionally,weapplyimagecaptiongeneralization(viacompression)(§4)toallcaptionsinthecorpuspriortothephraseextractionandcomposition.3TreeCompositionWemodeltreecompositionasconstraintoptimiza-tion.Theinputtoouralgorithmisthesetofre-trievedphrases(i.e.,treefragments),asillustratedin§2.LetP={p0,…,pL−1}bethesetofallphrasesacrossthefourphrasetypes(objects,ac-tions,stuffandscene).Weassumeamappingfunc-2L2distancebetweenclassificationscorevectors(Xiaoetal.,2010)tionpt:[0,l)→T,whereTisthesetofphrasetypes,sothatthephrasetypeofpiispt(i).Inad-dition,letRbethesetofPCFGproductionrulesandNTbethesetofnonterminalsymbolsofthePCFG.Thegoalistofindandcombineagoodse-quenceofphrasesG,|GRAMO|≤|t|=N=4,drawnfromP,intoafinalsentence.Moreconcretely,wewanttoselectandorderasubsetofphrases(atmostonephraseofeachphrasetype)whileconsideringboththeparsestructureandn-gramcohesionacrossphrasalboundaries.Figure2showsasimplifiedexampleofacom-posedsentencewithitscorrespondingparsestruc-ture.Forbrevity,thefigureshowsonlyonephraseforeachphrasetype,butinactualitytherewouldbeasetofcandidatephrasesforeachtype.Figure3showstheCKY-stylerepresentationoftheinternalmechanicsofconstraintoptimizationfortheexam-plecompositionfromFigure2.EachcellijoftheCKYmatrixcorrespondstoGij,asubsequenceofGstartingatpositioniandendingatpositionj.IfacellintheCKYmatrixislabeledwithanontermi-nalsymbols,itmeansthatthecorrespondingtreeofGijhassasitsroot.AlthoughwevisualizetheoperationusingaCKY-stylerepresentationinFigure3,notethatcomposi-tionrequiresmorecomplexcombinatorialdecisionsthanCKYparsingduetotwoadditionalconsidera-tions.Weare:(1)selectingasubsetofcandidatephrases,y(2)re-orderingtheselectedphrases(hencemakingtheproblemNP-hard).Por lo tanto,weencodeourproblemusingIntegerLinearPro-gramming(ILP)(RothandtauYih,2004;ClarkeandLapata,2008)andusetheCPLEX(ILOG,Cª,2006)solver.3.1ILPVariablesVariablesforSequenceStructure:Variablesαen-codephraseselectionandordering:αik=1iffphrasei∈Pisselected(1)forpositionk∈[0,norte)WherekisoneoftheN=4positionsinasentence.3Additionally,wedefinevariablesforeachpairofad-jacentphrasestocapturesequencecohesion:3Thenumberofpositionsisequaltothenumberofphrasetypes,sinceweselectatmostonefromeachtype.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
354
A»cow in»el»countryside was»staring»en»me in#the#grass NP PP VP PP NP S i=0$j=2$k=1$0123levelandeachnodeofthatlevel,algorithmhastodecide,whichparsetagtochoose.Thisprocessisrepresentedbyassignmentofaparticulartagtoamatrixcell.Thechosentagmustbeaheadofarule,fiexamplecell12isassignedtagVP,correspond-ingtoruleVP!VPPP.Thisruleconnectsleafs“goingouttosea”and“intheocean”.Theprob-lemistofindtagassignmentforeachcellofthema-trix,givensomecellscanbeempty,iftheydonotconnectchildrencells.lattercorrespondtochildrenbranchesofthetreeandbelongtothepreviousdiag-onalintheleft-to-rightorder.Alsowedonottryallpossiblepairs5ofchildrenfrompreviousdiagonal.WeusetechniquesimilartotheoneusedinCKYparsingapproach.Matrixcellpairscorrespondingto
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
355
ables(Equations2,4,5).ConstraintsforaproductoftwovariableshavebeendiscussedbyClarkeandLapata(2008).ForEquation2,weaddthefollow-ingconstraints(similarconstraintsarealsoaddedforEquations4,5).∀ijk,αijk≤αik(7)αijk≤αj(k+1)αijk+(1−αik)+(1−αj(k+1))≥1ConsistencybetweenTreeLeafsandSequences:Theorderingofphrasesimpliedbyαijkmustbeconsistentwiththeorderingofphrasesimpliedbytheβvariables.Thiscanbeachievedbyaligningtheleafcells(i.e.,βkks)intheCKY-stylematrixwithαvariablesasfollows:∀ik,αik≤Xs∈NTiβkks(8)∀k,Xiαik=Xs∈NTβkks(9)WhereNTireferstothesetofPCFGnonterminalsthatarecompatiblewithaphrasetypept(i)ofpi.Forexample,NTi={NN,notario público,…}ifpicorrespondstoan“object”(noun-phrase).De este modo,Equation8en-forcesthecorrespondencebetweenphrasetypesandnonterminalsymbolsatthetreeleafs.Equation9enforcestheconstraintthatthenumberofselectedphrasesandinstantiatedtreeleafsmustbethesame.TreeCongruenceConstraints:ToensurethateachCKYcellhasatmostonesymbolwerequire∀ij,Xs∈NTβijs≤1(10)Wealsorequirethat∀i,j>i,h,βijh=j−1Xk=iXr∈Rhβijkr(11)WhereRh={r∈R:r=h→pq}.Weenforcetheseconstraintsonlyfornon-leafs.Thisconstraintforbidsinstantiationswhereanonterminalsymbolhisselectedforcellijwithoutselectingacorrespond-ingPCFGrule.Wealsoensurethatweproduceavalidtreestruc-ture.Forinstance,ifweselect3phrasesasshowninFigure3,wemusthavetherootofthetreeatthecorrespondingcell02.∀k∈[1,norte),Xs∈NTβkks≤N−1Xt=kXs∈NTβ0ts(12)Wealsorequirecellsthatarenotselectedfortheresultingparsestructuretobeempty:∀ijXkγijk≤1(13)Además,wepenalizesolutionswithouttheStagattheparserootasasoft-constraint.MiscellaneousConstraints:Finalmente,weincludeseveralconstraintstoavoiddegeneratesolutionsortootherwiseenhancethecomposedoutput.We:(1)enforcethatanoun-phraseisselected(toensurese-manticrelevancetotheimagecontent),(2)allowatmostonephraseofeachtype,(3)donotallowmul-tiplephraseswithidenticalheadwords(toavoidre-dundancy),(4)allowatmostonescenephraseforallsentencesinthedescription.Wefindthathan-dlingofsentenceboundariesisimportantiftheILPformulationisbasedonlyonsequencestructure,butwiththeintegrationoftree-basedstructure,wedonotneedtospecificallyhandlesentenceboundaries.3.4DiscussionAninterestingaspectofdescriptiongenerationex-ploredinthispaperisusingtreefragmentsasthebuildingblocksofcompositionratherthanindivid-ualwords.Therearethreepracticalbenefits:(1)syntacticandsemanticexpressiveness,(2)correct-ness,y(3)computationalefficiency.Becauseweextractphrasesfromhumanwrittencaptions,weareabletouseexpressivelanguage,andlesslikelytomakesyntacticorsemanticerrors.Ourphraseex-tractionprocesscanbeviewedatahighlevelasvisually-groundedorvisually-situatedparaphrasing.Also,becausetheunitofoperationistreefragments,theILPformulationencodedinthisworkiscom-putationallylightweight.Iftheunitofcompositionwaswords,theILPinstanceswouldbesignificantlymorecomputationallyintensive,andmorelikelytosufferfromgrammaticalandsemanticerrors.4TreeCompressionAsnotedbyrecentstudies(MasonandCharniak,2013;Kuznetsovaetal.,2013;Jamiesonetal.,2010),naturallyexistingimagecaptionsoftenin-cludecontextualinformationthatdoesnotdirectlydescribevisualcontent,whichultimatelyhinderstheirusefulnessfordescribingotherimages.There-fore,toimprovethefidelityofthegenerateddescrip-tions,weexploreimagecaptiongeneralizationasan
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
356
Late%in%the%day,%a,er%my%sunset%shot%a2empts,%my%cat%strolled%along%the%fence%and%posed%for%this%classic%profile%Late%in%the%day%%%cat%%%posed%for%this%profile%Generaliza)on+This%bridge%stands%late%in%the%day,%a,er%my%sunset%shot%a2empts%A%cat%strolled%along%the%fence%and%posed%for%this%classic%profile%Figure4:Compressedcaptions(ontheleft)aremoreap-plicablefordescribingnewimages(ontheright).optionalpre-processingstep.Figure4illustratesaconcreteexampleofimagecaptiongeneralizationinthecontextofimagecaptiongeneration.Wecastcaptiongeneralizationassentencecom-pression.WeencodetheproblemastreepruningvialightweightCKYparsing,whilealsoincorporatingseveralotherconsiderationssuchasleaf-levelngramcohesionscoresandvisuallyinformedcontentselec-tion.Figure5showsanexamplecompression,andFigure6showsthecorrespondingCKYmatrix.Atahighlevel,thecompressionoperationresem-blesbottom-upCKYparsing,butinadditiontopars-ing,wealsoconsiderdeletionofpartsofthetrees.Whendeletingpartsoftheoriginaltree,wemightneedtore-parsetheremainderofthetree.Notethatweconsiderre-parsingonlywithrespecttotheorig-inalparsetreeproducedbyastate-of-the-artparser,henceitisonlyalight-weightparsing.54.1DynamicProgrammingInputtothealgorithmisasentence,representedasavectorx=x0…xn−1=x[0:n−1],anditsPCFGparseπ(X)obtainedfromtheStanfordparser.Forsimplicityofnotation,weassumethatboththeparsetreeandthewordsequenceareencodedinx.Then,thecompressioncanbeformalizedas:5Integratingfullparsingintotheoriginalsentencewouldbeastraightforwardextensionconceptually,butmaynotbeanem-piricallybetterchoicewhenparsingforcompressionisbasedonvanillaunlexicalizedparsing.ˆy=argmaxyYiφi(X,y)(14)Whereeachφiisapotentialfunction,correspondingtoacriteriaofthedesiredcompression:φi(X,y)=exp(θi·fi(X,y))(15)Whereθiistheweightforaparticularcriteria(de-scribedin§4.2),whosescoringfunctionisfi.Wesolvethedecodingproblem(Equation14)us-ingdynamicprogramming.Forthis,weneedtosolvethecompressionsub-problemsforsequencesx[i:j],whichcanbeviewedasbranchesˆy[i,j]ofthefinaltreeˆy[0:n−1].Forexample,inFigure5,thefinalsolutionisˆy[0:7],whileasub-solutionofx[4:7]correspondstoatreebranchPP.Noticethatsub-solutionˆy[3:7]representsthesamebranchasˆy[4:7]duetobranchdeletion.Somecomputedsub-solutions,e.g.,ˆy[1:4],getdroppedfromthefinalcompressedtree.WedefineamatrixofscoresD[i,j,h](Equa-tion17),wherehisoneofthenonterminalsymbolsbeingconsideredforacellindexedbyi,j,i.e.acan-didatefortherootsymbolofabranchˆy[i:j].WhenallvaluesD[i,j,h]arecomputed,wetakeˆh=argmaxhD[0,n−1,h](16)andbacktracktoreconstructthefinalcompression(theexactsolutiontoequation14).D[i,j,h]=maxk∈[i,j)r∈Rh(1)D[i,k,pag]+D[k+1,j,q]+∆φ[r,ij](2)D[i,k,pag]+∆φ[r,ij](3)D[k+1,j,pag]+∆φ[r,ij](17)WhereRh={r∈R:r=h→pq∨r=h→p}.Indexkdeterminesasplitpointforchildbranchesofasubtreeˆy[i:j].Forexample,intheFigure5thesplitpointforchildrenofthesubtreeˆy[0:7]isk=2.Thethreecases((1)–(3))oftheaboveequationcorrespondtothefollowingtreepruningcases:PruningCase(1):Noneofthechildrenofthecur-rentnodeisdeleted.Forexample,inFigures5and6,thePCFGrulePP→INPP,correspondingtothesequence“inblackandwhite”,isretained.Anothersituationthatcanbeencounteredistreere-parsing.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
357
Vintage!motorcycle!shot!hecho!en!negro!y!white!JJ!NN!NN!VBN!EN!JJ!JJ!CC!notario público, NN!notario público!CC-JJ VP, PP NP!PP S Dele%on!probabilidad!Regla!probabilidad!Vision!confidence!Ngram!cohesion!(Dele%on,)caso)2))(Dele%on,)caso)1))01234567k=2$Figure5:CKYcompression.Boththechosenrulesandphrases(blueboldfontandbluesolidarrows)andnotchosenrulesandphrases(reditalicsmallerfontandreddashedlines)areshown.PruningCase(2)/(3):Deletionoftheleft/rightchildrespectively.Therearetwotypesofdeletion,asillustratedinFigures5and6.Thefirstcorre-spondstodeletionofachildnode.Forexample,thesecondchildNNofruleNP→NPNNisdeleted,whichyieldsdeletionof“shot”.Thesec-ondtypeisaspecialcaseofpropagatinganodetoahigher-levelofthetree.InFigure6,thissit-uationoccurswhendeletingJJ“Vintage”,whichcausesthepropagationofNNfromcell11tocell01.Forthispurpose,weexpandthesetofrulesRwithadditionalspecialrulesoftheformh→h,e.g.,NN→NN,whichallowspropagationoftreenodestohigherlevelsofthecompressedtree.64.2ModelingCompressionCriteriaThe∆φterm7inEquation17denotesthesumoflogofpotentialfunctionsforeachcriteriaq:∆φ[r,ij]=Xqθ·∆fq(r,ij)(18)Notethat∆φdependsonthecurrentruler,alongwiththehistoricalinformationbeforethecurrentstepij,suchastheoriginalrulerij,andngramsontheborderbetweenleftandrightchildbranchesofrulerij.Weusethefollowingfourcriteriafqinourmodel,whicharedemonstratedinFigures5and6.I.TreeStructure:WecapturePCFGruleprob-abilitiesestimatedfromthecorpusas∆fpcfg=logPpcfg(r).6Weassignprobabilitiesofthesespecialpropagationrulesto1sothattheywillnotaffectthefinalparsetreescore.TurnerandCharniak(2005)handledpropagationcasessimilarly.7Weuse∆todistinguishthepotentialvalueforthewholesentencefromthegainofthepotentialduringasinglestepofthealgorithm.JJ NP, NN NP S Vintage NN motorcycle NN shot VBN VP, PP done IN PP in JJ NP black CC CC-JJ and JJ white 00″11″01″Rule%probability%Ngram%cohesion%Dele6on%probability%Vision%Confidence%i»j»Figure6:CKYcompression.Boththechosenrulesandphrases(blueboldfontandbluesolidarrows)andnotchosenrulesandphrases(reditalicsmallerfontandreddashedlines)areshown.II.SequenceStructure:Weincorporatengramcohesionscoresonlyacrosstheborderbetweentwobranchesofasubtree.III.BranchDeletionProbabilities:Wecomputeprobabilitiesofdeletionforchildrenas:∆fdel=logP(rt|rij)=logcount(rt,rij)count(rij)(19)Wherecount(rt,rij)isthefrequencyinwhichrijistransformedtortbydeletionofoneofthechildren.Weestimatethisprobabilityfromatrainingcorpus,describedin§4.3.count(rij)isthecountofrijinuncompressedsentences.IV.VisionDetection(ContentSelection):Wewanttokeepwordsreferringtoactualobjectsintheimage.Thus,weuseV(xj),avisualsimilarityscore,asourconfidenceofanobjectcorrespondingtowordxj.Thissimilarityisobtainedfromthevi-sualrecognitionpredictionsof(Dengetal.,2012b).Notethatsometestinstancesincluderulesthatwehavenotobservedduringtraining.Wedefaulttotheoriginalcaptioninthosecases.Theweightsθiaresetusingatuningdataset.Wecontrolover-compressionbysettingtheweightforfdeltoasmallvaluerelativetotheotherweights.4.3HumanCompressedCaptionsAlthoughwemodelimagecaptiongeneralizationassentencecompression,inpracticalapplicationswemaywanttheoutputsofthesetwotaskstobediffer-ent.Forexample,theremaybedifferencesinwhatshouldbedeleted(namedentitiesinnewswiresum-mariescouldbeimportanttokeep,whiletheymay
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
358
origen:»Nota»el»pillows,»ellos»match»el»silla»eso»goes»con»él,»plus»el»mesa»en»el»picture»es»included.%SeqCompression:%El»mesa»en»el»picture.»»TreePruning:»El»silla»con»el»mesa»en»el»picture.»origen:»Solo»en»invierno;a mí»nosotros»ver»estos»birds»aquí»en»el»river.»%SeqCompression:»Ver»estos»birds»en»el»river.»»TreePruning:»Estos»birds»en»el»river.»»origen:»El»world’s»mayoría»powerful»lighthouse»si@ng»beside»el»house»con»el»world’s»thickest»curtains.»SeqCompression:%Si@ng»beside»el»house»»TreePruning:»Powerful»lighthouse»beside»el»house»con»el»curtains.»»origen:»Orange»cloud»en»street»luz»C»cerca»Lanakila»Calle»(phone»cámara).»»SeqCompression:%Orange»street»»TreePruning:»Phone»camera.%Relevance(problema(origen:»There’s»algo»acerca de»teniendo»5″trucks»parked»en»frente»de»mi»house»eso»makes»a mí»feel»todo»importantClike.»SeqCompression:%Frente»de»mi»house.»»TreePruning:»Trucks»en»frente»mi»house.%Grammar(errores(Figura 7:Captiongeneralization:good/badexamples.beextraneousforimagecaptiongeneralization).Tolearnthesyntacticpatternsforcaptiongeneraliza-tion,wecollectasmallsetofexamplecompressedcaptions(380intotal)usingAmazonMechanicalTurk(AMT)(Snowetal.,2008).Foreachimage,weasked3turkerstofirstlistallvisibleobjectsinanimageandthentowriteacompressedcaptionbyremovingnotvisuallyverifiablebitsoftext.Wethenaligntheoriginalandcompressedcaptionstomea-sureruledeletionprobabilities,excludingmisalign-ments,similartoKnightandMarcu(2000).Notethatweremovethisdatasetfromthe1Mcaptioncor-puswhenweperformdescriptiongeneration.5ExperimentsWeusethe1McaptionedimagecorpusofOrdonezetal.(2011).Wereserve1Kimagesasatestset,andusetherestofthecorpusforphraseextraction.Weexperimentwiththefollowingapproaches:ProposedApproaches:•TREEPRUNING:Ourtreecompressionap-proachasdescribedin§4.•SEQ+TREE:Ourtreecompositionapproachasdescribedin§3.•SEQ+TREE+PRUNING:SEQ+TREEusingcompressedcaptionsofTREEPRUNINGasbuildingblocks.BaselinesforComposition:•SEQ+LINGRULE:Themostequivalenttotheoldersequence-drivensystem(Kuznetsovaetal.,2012).Usesafewminorenhancements,suchassentence-boundarystatistics,toim-provegrammaticality.•SEQ:The§3systemwithouttreemodelsandmentionedenhancementsofSEQ+LINGRULE.MethodBleuMeteorw/(w/o)penaltyPRMSEQ+LINGRULE0.152(0.152)0.130.170.095SEQ0.138(0.138)0.120.180.094SEQ+TREE0.149(0.149)0.130.140.082SEQ+PRUNING0.177(0.177)0.150.160.101SEQ+TREE+PRUNING0.140(0.189)0.160.120.088Table1:AutomaticEvaluation•SEQ+PRUNING:SEQusingcompressedcap-tionsofTREEPRUNINGasbuildingblocks.Wealsoexperimentwiththecompressionofhumanwrittencaptions,whichareusedtogenerateimagedescriptionsforthenewtargetimages.BaselinesforCompression:•SEQCOMPRESSION(Kuznetsovaetal.,2013):Inferenceoperatesoverthesequencestructure.Althoughoptimizationissubjecttoconstraintsderivedfromdependencyparse,parsingisnotanexplicitpartoftheinferencestructure.Ex-ampleoutputsareshowninFigure7.5.1AutomaticEvaluationWeperformautomaticevaluationusingtwomea-sureswidelyusedinmachinetranslation:AZUL(Pa-pinenietal.,2002)8andMETEOR(DenkowskiandLavie,2011).9Weremoveallpunctuationandcon-vertcaptionstolowercase.Weuse1Ktestim-agesfromthecaptionedimagecorpus,10andas-sumetheoriginalcaptionsasthegoldstandardcap-tionstocompareagainst.TheresultsinTable18WeusetheunigramNISTimplementation:ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v13a-20091001.tar.gz9WithequalweightbetweenprecisionandrecallinTable1.10ExceptforthoseforwhichimageURLsarebroken,orCPLEXdidnotreturnasolution.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
359
Method-1Method-2CriteriaMethod-1preferredoverMethod-2(%)allturkersturkersw/κ>0.55turkersw/κ>0.6ImageDescriptionGenerationSEQ+TREESEQRel727272SEQ+TREESEQGmar838383SEQ+TREESEQAll686966SEQ+TREE+PRUNINGSEQ+TREERel687272SEQ+TREE+PRUNINGSEQ+TREEGmar413841SEQ+TREE+PRUNINGSEQ+TREEAll636466SEQ+TREESEQ+LINGRULEAll626462SEQ+TREE+PRUNINGSEQ+LINGRULEAll677577SEQ+TREE+PRUNINGSEQ+PRUNINGAll737575SEQ+TREE+PRUNINGHUMANAll241919ImageCaptionGeneralizationTREEPRUNINGSEQCOMPRESSION∗Rel656566Table2:HumanEvaluation:posedasabinaryquestion“whichofthetwooptionsisbetter?”withrespecttoRelevance(Rel),Grammar(Gmar),andOverall(Todo).AccordingtoPearson’sχ2test,allresultsarestatisticallysignificant.showthatboththeintegrationofthetreestructure(+TREE)andthegeneralizationofcaptionsusingtreecompression(+PRUNING)improvetheBLEUscorewithoutbrevitypenaltysignificantly,11whileimprovingMETEORonlymoderately(duetoanim-provementonprecisionwithadecreaseinrecall.)5.2HumanEvaluationNeitherBLEUnorMETEORdirectlymeasuregrammaticalcorrectnessoverlongdistancesandmaynotcorrespondperfectlytohumanjudgments.Therefore,wesupplementautomaticevaluationwithhumanevaluation.Forhumanevaluations,wepresenttwooptionsgeneratedfromtwocompet-ingsystems,andaskturkerstochoosetheonethatisbetterwithrespectto:relevance,gramática,andoverall.ResultsareshowninTable2with3turkerratingsperimage.Wefilteroutturkersbasedonacontrolquestion.Wethencomputetheselec-tionrate(%)ofpreferringmethod-1overmethod-2.Theagreementamongturkersisafrequentconcern.Therefore,wevarythesetofdependableusersbasedontheirCohen’skappascore(κ)againstotherusers.Itturnsout,filteringusersbasedonκdoesnotmakeabigdifferenceindeterminingthewinningmethod.Asexpected,tree-basedsystemssignificantlyout-performsequence-basedcounterparts.Forexample,11While4-gramBLEUwithbrevitypenaltyisfoundtocor-relatebetterwithhumanjudgesbyrecentstudies(ElliottandKeller,2014),wefoundthatthisisnotthecaseforourtask.Thismaybeduetothedifferencesinthegoldstandardcap-tions.Weusenaturallyexistingones,whichincludeawiderrangeofcontentandstylethancrowd-sourcedcaptions.Seq:»A»bu&erfly»a»el»auto»era»spo&ed»por»mi»nueve»año»viejo»cousin.»Seq+Pruning:»El»bu&erflies»son»a&racted»a»el»colourful»flowers»a»el»car.+Seq+Tree:»El»bu&erflies»son»a&racted»a»el»colourful»flowers»en»Hope»Gardens.»»Seq+Tree+Pruning:»El»bu&erflies»son»a&racted»a»el»colourful»flowers.»origen:»El»bu&erflies»son»a&racted»a»el»colourful»flowers»en»Hope»Gardens.»»SeqCompression:»El»colourful»flowers.»»»TreePruning:»El»bu&erflies»son»a&racted»a»el»colourful»flowers.»»»Cap>on»Generaliza>on»Image»Descrip>on»Genera>on»Figura 8:Anexampleofadescriptionpreferredoverhu-mangoldstandard.Imagedescriptionisimprovedduetocaptiongeneralization.SEQ+TREEisstronglypreferredoverSEQ,withaselectionrateof83%.Somewhatsurprisingly,im-provedgrammaticalityalsoseemstoimproverele-vancescores(72%),possiblybecauseitishardertoappreciatethesemanticrelevanceofautomaticcap-tionswhentheyarelesscomprehensible.Alsoasexpected,compositionsbasedonprunedtreefrag-mentssignificantlyimproverelevance(68–72%),whileslightlydeterioratinggrammar(38–41%).Notablemente,thecaptionsgeneratedbyoursystemarepreferredovertheoriginal(ownergenerated)cap-tions19–24%ofthetime.Onesuchexampleisin-cludedinFigure8:“Thebutterfliesareattractedtothecolorfulflowers.”Additionalexamples(goodandbad)arepro-videdinFigures9and10.Manyofthesecaptionsarehighlyexpressivewhileremainingsemantically
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
yo
a
C
_
a
_
0
0
1
8
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
360
Humano:»Alguno»flower»en»a»bar»en»a»hotel»en»Grapevine,»TX.»»&Seq+Tree+Pruning:»El»flower»era»entonces»vivid»y»a:rac