Transactions of the Association for Computational Linguistics, 2 (2014) 351–362. Action Editor: Hal Daume III.
Submitted 2/2014; Revised 5/2014; Published 10/2014. c
(cid:13)
2014 Association for Computational Linguistics.
TREETALK:CompositionandCompressionofTreesforImageDescriptionsPolinaKuznetsova††StonyBrookUniversityStonyBrook,NYpkuznetsova@cs.stonybrook.eduVicenteOrdonez‡TamaraL.Berg‡‡UNCChapelHillChapelHill,NC{vicente,tlberg}@cs.unc.eduYejinChoi††††UniversityofWashingtonSeattle,WAyejin@cs.washington.eduAbstractWepresentanewtreebasedapproachtocomposingexpressiveimagedescriptionsthatmakesuseofnaturallyoccuringwebimageswithcaptions.Weinvestigatetworelatedtasks:imagecaptiongeneralizationandgen-eration,wheretheformerisanoptionalsub-taskofthelatter.Thehigh-levelideaofourapproachistoharvestexpressivephrases(astreefragments)fromexistingimagedescrip-tions,thentocomposeanewdescriptionbyselectivelycombiningtheextracted(andop-tionallypruned)treefragments.Keyalgo-rithmiccomponentsaretreecompositionandcompression,bothintegratingtreestructurewithsequencestructure.Ourproposedsystemattainssignificantlybetterperformancethanpreviousapproachesforbothimagecaptiongeneralizationandgeneration.Inaddition,ourworkisthefirsttoshowtheempiricalben-efitofautomaticallygeneralizedcaptionsforcomposingnaturalimagedescriptions.1IntroductionThewebisincreasinglyvisual,withhundredsofbil-lionsofusercontributedphotographshostedonline.Asubstantialportionoftheseimageshavesomesortofaccompanyingtext,rangingfromkeywords,tofreetextonwebpages,totextualdescriptionsdi-rectlydescribingdepictedimagecontent(i.e.cap-tions).Wetapintothelastkindoftext,usingnatu-rallyoccuringpairsofimageswithnaturallanguagedescriptionstocomposeexpressivedescriptionsforqueryimagesviatreecompositionandcompression.Suchautomaticimagecaptioningeffortscouldpotentiallybeusefulformanyapplications:fromautomaticorganizationofphotocollections,tofacil-itatingimagesearchwithcomplexnaturallanguagequeries,toenhancingwebaccessibilityforthevi-suallyimpaired.Ontheintellectualside,bylearn-ingtodescribethevisualworldfromnaturallyexist-ingwebdata,ourstudyextendsthedomainsoflan-guagegroundingtothehighlyexpressivelanguagethatpeopleuseintheireverydayonlineactivities.Therehasbeenarecentspikeineffortstoau-tomaticallydescribevisualcontentinnaturallan-guage(Yangetal.,2011;Kulkarnietal.,2011;Lietal.,2011;Farhadietal.,2010;Krishnamoorthyetal.,2013;ElliottandKeller,2013;YuandSiskind,2013;Socheretal.,2014).Thisreflectsthelongstandingunderstandingthatencodingthecomplex-itiesandsubtletiesofimagecontentoftenrequiresmoreexpressivelanguageconstructsthanasetoftags.Nowthatvisualrecognitionalgorithmsarebe-ginningtoproducereliableestimatesofimagecon-tent(Perronninetal.,2012;Dengetal.,2012a;Dengetal.,2010;Krizhevskyetal.,2012),thetimeseemsripetobeginexploringhigherlevelsemantictasks.Therehavebeentwomaincomplementarydirec-tionsexploredforautomaticimagecaptioning.Thefirstfocusesondescribingexactlythoseitems(e.g.,objects,attributes)thataredetectedbyvisionrecog-nition,whichsubsequentlyconfineswhatshouldbedescribedandhow(Yaoetal.,2010;Kulkarnietal.,2011;Kojimaetal.,2002).Approachesinthisdirec-tioncouldbeidealforvariouspracticalapplicationssuchasimagedescriptionforthevisuallyimpaired.However,itisnotclearwhetherthesemanticexpres-sivenessoftheseapproachescaneventuallyscaleuptothecasual,buthighlyexpressivelanguagepeo-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
352
Target’Image’A”cow!standing!in!the!water!I!no/ced!that!this!funny!cow!was”staring”at”me”A!bird!hovering!in”the”grass”You!can!see!these!beau/ful!hills!only!in”the”countryside”Object’Ac/on’Stuff’Scene’Figure1:Harvestingphrases(astreefragments)forthetargetimagebasedon(partial)visualmatch.plenaturallyuseintheironlineactivities.InFig-ure1,forexample,itwouldbehardtocompose“Inoticedthatthisfunnycowwasstaringatme”or“Youcanseethesebeautifulhillsonlyinthecoun-tryside”inapurelybottom-upmannerbasedontheexactcontentdetected.Thekeytechnicalbottleneckisthattherangeofdescribablecontent(i.e.,objects,attributes,actions)isultimatelyconfinedbythesetofitemsthatcanbereliablyrecognizedbystate-of-the-artvisiontechniques.Theseconddirection,inacomplementaryavenuetothefirst,hasexploredwaystomakeuseoftherichspectrumofvisualdescriptionscontributedbyonlinecitizens(Kuznetsovaetal.,2012;FengandLapata,2013;Mason,2013;Ordonezetal.,2011).Intheseapproaches,thesetofwhatcanbedescribedcanbesubstantiallylargerthanthesetofwhatcanberecognized,wheretheformerisshapedanddefinedbythedata,ratherthanbyhumans.Thisallowstheresultingdescriptionstobesubstantiallymoreex-pressive,elaborate,andinterestingthanwhatwouldbepossibleinapurelybottom-upmanner.Ourworkcontributestothissecondlineofresearch.Onechallengeinutilizingnaturallyexistingmul-timodaldata,however,isthenoisysemanticalign-mentbetweenimagesandtext(Dodgeetal.,2012;Bergetal.,2010).Therefore,wealsoinvesti-gatearelatedtaskofimagecaptiongeneralization(Kuznetsovaetal.,2013),whichaimstoimprovethesemanticimage-textalignmentbyremovingbitsoftextfromexistingcaptionsthatarelesslikelytobetransferabletootherimages.Thehigh-levelideaofoursystemistoharvestusefulbitsoftext(astreefragments)fromexist-ingimagedescriptionsusingdetectedvisualcontentsimilarity,andthentocomposeanewdescriptionbyselectivelycombiningtheseextracted(andop-tionallypruned)treefragments.Thisoverallideaofcompositionbasedonextractedphrasesisnotnewinitself(Kuznetsovaetal.,2012),however,wemakeseveraltechnicalandempiricalcontributions.First,weproposeanovelstochastictreecompo-sitionalgorithmbasedonextractedtreefragmentsthatintegratesbothtreestructureandsequenceco-hesionintostructuralinference.Ouralgorithmper-mitsasubstantiallyhigherleveloflinguisticexpres-siveness,flexibility,andcreativitythanthosebasedonrulesortemplates(Kulkarnietal.,2011;Yangetal.,2011;Mitchelletal.,2012),whilealsoaddress-inglong-distancegrammaticalrelationsinamoreprincipledwaythanthosebasedonhand-codedcon-straints(Kuznetsovaetal.,2012).Second,weaddressimagecaptiongeneralizationasanoptionalsubtaskofimagecaptiongeneration,andproposeatreecompressionalgorithmthatper-formsalight-weightparsingtosearchfortheop-timalsetoftreebranchestoprune.Ourworkisthefirsttoreportempiricalbenefitsofautomaticallycompressedcaptionsforimagecaptioning.Theproposedapproachesattainsignificantlybet-terperformanceforbothimagecaptiongeneraliza-tionandgenerationtasksovercompetitivebaselinesandpreviousapproaches.Ourworkresultsinanim-provedimagecaptioncorpuswithautomaticgener-alization,whichispubliclyavailable.12HarvestingTreeFragmentsGivenaqueryimage,weretrieveimagesthatarevi-suallysimilartothequeryimage,thenextractpo-tentiallyusefulsegments(i.e.,phrases)fromtheircorrespondingimagedescriptions.Wethencom-poseanewimagedescriptionusingtheseretrievedtextfragments(§3).Extractionofusefulphrasesisguidedbybothvisualsimilarityandthesyn-tacticparseofthecorrespondingtextualdescrip-1http://ilp-cky.appspot.com/
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
353
tion.Thisextractionstrategy,originallyproposedbyKuznetsovaetal.(2012),attemptstomakethebestuseoflinguisticregularitieswithrespecttoobjects,actions,andscenes,makingitpossibletoobtainrichertextualdescriptionsthanwhatcur-rentstate-of-the-artvisiontechniquescanprovideinisolation.InallofourexperimentsweusethecaptionedimagecorpusofOrdonezetal.(2011),firstpre-processingthecorpusforrelevantcontentbyrunningdeformablepartmodelobjectdetec-tors(Felzenszwalbetal.,2010).Forourstudy,werundetectorsfor89objectclassessetahighconfi-dencethresholdfordetection.AsillustratedinFigure1,foraqueryimagede-tection,weextractfourtypesofphrases(astreefragments).First,weretrieverelevantnounphrasesfromimageswithvisuallysimilarobjectdetections.Weusecolor,texture(LeungandMalik,1999),andshape(DalalandTriggs,2005;Lowe,2004)basedfeaturesencodedinahistogramofvectorquantizedresponsestomeasurevisualsimilarity.Second,weextractverbphrasesforwhichthecorrespondingnounphrasetakesthesubjectrole.Third,fromthoseimageswith“stuff”detections,e.g.“water”,or“sky”(typicallymassnouns),weextractpreposi-tionalphrasesbasedonsimilarityofbothvisualap-pearanceandrelativespatialrelationshipsbetweendetectedobjectsand“stuff”.Finally,weuseglobal“scene”similarity2toextractprepositionalphrasesreferringtotheoverallscene,e.g.,“attheconfer-ence,”or“inthemarket”.Weperformthisphraseretrievalprocessforeachdetectedobjectinthequeryimageandgenerateonesentenceforeachobject.Allsentencesarethencombinedtogethertoproducethefinaldescription.Optionally,weapplyimagecaptiongeneralization(viacompression)(§4)toallcaptionsinthecorpuspriortothephraseextractionandcomposition.3TreeCompositionWemodeltreecompositionasconstraintoptimiza-tion.Theinputtoouralgorithmisthesetofre-trievedphrases(i.e.,treefragments),asillustratedin§2.LetP={p0,…,pL−1}bethesetofallphrasesacrossthefourphrasetypes(objects,ac-tions,stuffandscene).Weassumeamappingfunc-2L2distancebetweenclassificationscorevectors(Xiaoetal.,2010)tionpt:[0,L)→T,whereTisthesetofphrasetypes,sothatthephrasetypeofpiispt(i).Inad-dition,letRbethesetofPCFGproductionrulesandNTbethesetofnonterminalsymbolsofthePCFG.Thegoalistofindandcombineagoodse-quenceofphrasesG,|G|≤|T|=N=4,drawnfromP,intoafinalsentence.Moreconcretely,wewanttoselectandorderasubsetofphrases(atmostonephraseofeachphrasetype)whileconsideringboththeparsestructureandn-gramcohesionacrossphrasalboundaries.Figure2showsasimplifiedexampleofacom-posedsentencewithitscorrespondingparsestruc-ture.Forbrevity,thefigureshowsonlyonephraseforeachphrasetype,butinactualitytherewouldbeasetofcandidatephrasesforeachtype.Figure3showstheCKY-stylerepresentationoftheinternalmechanicsofconstraintoptimizationfortheexam-plecompositionfromFigure2.EachcellijoftheCKYmatrixcorrespondstoGij,asubsequenceofGstartingatpositioniandendingatpositionj.IfacellintheCKYmatrixislabeledwithanontermi-nalsymbols,itmeansthatthecorrespondingtreeofGijhassasitsroot.AlthoughwevisualizetheoperationusingaCKY-stylerepresentationinFigure3,notethatcomposi-tionrequiresmorecomplexcombinatorialdecisionsthanCKYparsingduetotwoadditionalconsidera-tions.Weare:(1)selectingasubsetofcandidatephrases,and(2)re-orderingtheselectedphrases(hencemakingtheproblemNP-hard).Therefore,weencodeourproblemusingIntegerLinearPro-gramming(ILP)(RothandtauYih,2004;ClarkeandLapata,2008)andusetheCPLEX(ILOG,Inc,2006)solver.3.1ILPVariablesVariablesforSequenceStructure:Variablesαen-codephraseselectionandordering:αik=1iffphrasei∈Pisselected(1)forpositionk∈[0,N)WherekisoneoftheN=4positionsinasentence.3Additionally,wedefinevariablesforeachpairofad-jacentphrasestocapturesequencecohesion:3Thenumberofpositionsisequaltothenumberofphrasetypes,sinceweselectatmostonefromeachtype.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
354
A”cow in”the”countryside was”staring”at”me in#the#grass NP PP VP PP NP S i=0$j=2$k=1$0123levelandeachnodeofthatlevel,algorithmhastodecide,whichparsetagtochoose.Thisprocessisrepresentedbyassignmentofaparticulartagtoamatrixcell.Thechosentagmustbeaheadofarule,fiexamplecell12isassignedtagVP,correspond-ingtoruleVP!VPPP.Thisruleconnectsleafs“goingouttosea”and“intheocean”.Theprob-lemistofindtagassignmentforeachcellofthema-trix,givensomecellscanbeempty,iftheydonotconnectchildrencells.lattercorrespondtochildrenbranchesofthetreeandbelongtothepreviousdiag-onalintheleft-to-rightorder.Alsowedonottryallpossiblepairs5ofchildrenfrompreviousdiagonal.WeusetechniquesimilartotheoneusedinCKYparsingapproach.Matrixcellpairscorrespondingto
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
355
ables(Equations2,4,5).ConstraintsforaproductoftwovariableshavebeendiscussedbyClarkeandLapata(2008).ForEquation2,weaddthefollow-ingconstraints(similarconstraintsarealsoaddedforEquations4,5).∀ijk,αijk≤αik(7)αijk≤αj(k+1)αijk+(1−αik)+(1−αj(k+1))≥1ConsistencybetweenTreeLeafsandSequences:Theorderingofphrasesimpliedbyαijkmustbeconsistentwiththeorderingofphrasesimpliedbytheβvariables.Thiscanbeachievedbyaligningtheleafcells(i.e.,βkks)intheCKY-stylematrixwithαvariablesasfollows:∀ik,αik≤Xs∈NTiβkks(8)∀k,Xiαik=Xs∈NTβkks(9)WhereNTireferstothesetofPCFGnonterminalsthatarecompatiblewithaphrasetypept(i)ofpi.Forexample,NTi={NN,NP,…}ifpicorrespondstoan“object”(noun-phrase).Thus,Equation8en-forcesthecorrespondencebetweenphrasetypesandnonterminalsymbolsatthetreeleafs.Equation9enforcestheconstraintthatthenumberofselectedphrasesandinstantiatedtreeleafsmustbethesame.TreeCongruenceConstraints:ToensurethateachCKYcellhasatmostonesymbolwerequire∀ij,Xs∈NTβijs≤1(10)Wealsorequirethat∀i,j>i,h,βijh=j−1Xk=iXr∈Rhβijkr(11)WhereRh={r∈R:r=h→pq}.Weenforcetheseconstraintsonlyfornon-leafs.Thisconstraintforbidsinstantiationswhereanonterminalsymbolhisselectedforcellijwithoutselectingacorrespond-ingPCFGrule.Wealsoensurethatweproduceavalidtreestruc-ture.Forinstance,ifweselect3phrasesasshowninFigure3,wemusthavetherootofthetreeatthecorrespondingcell02.∀k∈[1,N),Xs∈NTβkks≤N−1Xt=kXs∈NTβ0ts(12)Wealsorequirecellsthatarenotselectedfortheresultingparsestructuretobeempty:∀ijXkγijk≤1(13)Additionally,wepenalizesolutionswithouttheStagattheparserootasasoft-constraint.MiscellaneousConstraints:Finally,weincludeseveralconstraintstoavoiddegeneratesolutionsortootherwiseenhancethecomposedoutput.We:(1)enforcethatanoun-phraseisselected(toensurese-manticrelevancetotheimagecontent),(2)allowatmostonephraseofeachtype,(3)donotallowmul-tiplephraseswithidenticalheadwords(toavoidre-dundancy),(4)allowatmostonescenephraseforallsentencesinthedescription.Wefindthathan-dlingofsentenceboundariesisimportantiftheILPformulationisbasedonlyonsequencestructure,butwiththeintegrationoftree-basedstructure,wedonotneedtospecificallyhandlesentenceboundaries.3.4DiscussionAninterestingaspectofdescriptiongenerationex-ploredinthispaperisusingtreefragmentsasthebuildingblocksofcompositionratherthanindivid-ualwords.Therearethreepracticalbenefits:(1)syntacticandsemanticexpressiveness,(2)correct-ness,and(3)computationalefficiency.Becauseweextractphrasesfromhumanwrittencaptions,weareabletouseexpressivelanguage,andlesslikelytomakesyntacticorsemanticerrors.Ourphraseex-tractionprocesscanbeviewedatahighlevelasvisually-groundedorvisually-situatedparaphrasing.Also,becausetheunitofoperationistreefragments,theILPformulationencodedinthisworkiscom-putationallylightweight.Iftheunitofcompositionwaswords,theILPinstanceswouldbesignificantlymorecomputationallyintensive,andmorelikelytosufferfromgrammaticalandsemanticerrors.4TreeCompressionAsnotedbyrecentstudies(MasonandCharniak,2013;Kuznetsovaetal.,2013;Jamiesonetal.,2010),naturallyexistingimagecaptionsoftenin-cludecontextualinformationthatdoesnotdirectlydescribevisualcontent,whichultimatelyhinderstheirusefulnessfordescribingotherimages.There-fore,toimprovethefidelityofthegenerateddescrip-tions,weexploreimagecaptiongeneralizationasan
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
356
Late%in%the%day,%a,er%my%sunset%shot%a2empts,%my%cat%strolled%along%the%fence%and%posed%for%this%classic%profile%Late%in%the%day%%%cat%%%posed%for%this%profile%Generaliza)on+This%bridge%stands%late%in%the%day,%a,er%my%sunset%shot%a2empts%A%cat%strolled%along%the%fence%and%posed%for%this%classic%profile%Figure4:Compressedcaptions(ontheleft)aremoreap-plicablefordescribingnewimages(ontheright).optionalpre-processingstep.Figure4illustratesaconcreteexampleofimagecaptiongeneralizationinthecontextofimagecaptiongeneration.Wecastcaptiongeneralizationassentencecom-pression.WeencodetheproblemastreepruningvialightweightCKYparsing,whilealsoincorporatingseveralotherconsiderationssuchasleaf-levelngramcohesionscoresandvisuallyinformedcontentselec-tion.Figure5showsanexamplecompression,andFigure6showsthecorrespondingCKYmatrix.Atahighlevel,thecompressionoperationresem-blesbottom-upCKYparsing,butinadditiontopars-ing,wealsoconsiderdeletionofpartsofthetrees.Whendeletingpartsoftheoriginaltree,wemightneedtore-parsetheremainderofthetree.Notethatweconsiderre-parsingonlywithrespecttotheorig-inalparsetreeproducedbyastate-of-the-artparser,henceitisonlyalight-weightparsing.54.1DynamicProgrammingInputtothealgorithmisasentence,representedasavectorx=x0…xn−1=x[0:n−1],anditsPCFGparseπ(x)obtainedfromtheStanfordparser.Forsimplicityofnotation,weassumethatboththeparsetreeandthewordsequenceareencodedinx.Then,thecompressioncanbeformalizedas:5Integratingfullparsingintotheoriginalsentencewouldbeastraightforwardextensionconceptually,butmaynotbeanem-piricallybetterchoicewhenparsingforcompressionisbasedonvanillaunlexicalizedparsing.ˆy=argmaxyYiφi(x,y)(14)Whereeachφiisapotentialfunction,correspondingtoacriteriaofthedesiredcompression:φi(x,y)=exp(θi·fi(x,y))(15)Whereθiistheweightforaparticularcriteria(de-scribedin§4.2),whosescoringfunctionisfi.Wesolvethedecodingproblem(Equation14)us-ingdynamicprogramming.Forthis,weneedtosolvethecompressionsub-problemsforsequencesx[i:j],whichcanbeviewedasbranchesˆy[i,j]ofthefinaltreeˆy[0:n−1].Forexample,inFigure5,thefinalsolutionisˆy[0:7],whileasub-solutionofx[4:7]correspondstoatreebranchPP.Noticethatsub-solutionˆy[3:7]representsthesamebranchasˆy[4:7]duetobranchdeletion.Somecomputedsub-solutions,e.g.,ˆy[1:4],getdroppedfromthefinalcompressedtree.WedefineamatrixofscoresD[i,j,h](Equa-tion17),wherehisoneofthenonterminalsymbolsbeingconsideredforacellindexedbyi,j,i.e.acan-didatefortherootsymbolofabranchˆy[i:j].WhenallvaluesD[i,j,h]arecomputed,wetakeˆh=argmaxhD[0,n−1,h](16)andbacktracktoreconstructthefinalcompression(theexactsolutiontoequation14).D[i,j,h]=maxk∈[i,j)r∈Rh(1)D[i,k,p]+D[k+1,j,q]+∆φ[r,ij](2)D[i,k,p]+∆φ[r,ij](3)D[k+1,j,p]+∆φ[r,ij](17)WhereRh={r∈R:r=h→pq∨r=h→p}.Indexkdeterminesasplitpointforchildbranchesofasubtreeˆy[i:j].Forexample,intheFigure5thesplitpointforchildrenofthesubtreeˆy[0:7]isk=2.Thethreecases((1)–(3))oftheaboveequationcorrespondtothefollowingtreepruningcases:PruningCase(1):Noneofthechildrenofthecur-rentnodeisdeleted.Forexample,inFigures5and6,thePCFGrulePP→INPP,correspondingtothesequence“inblackandwhite”,isretained.Anothersituationthatcanbeencounteredistreere-parsing.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
357
Vintage!motorcycle!shot!done!in!black!and!white!JJ!NN!NN!VBN!IN!JJ!JJ!CC!NP, NN!NP!CC-JJ VP, PP NP!PP S Dele%on!probability!Rule!probability!Vision!confidence!Ngram!cohesion!(Dele%on,)case)2))(Dele%on,)case)1))01234567k=2$Figure5:CKYcompression.Boththechosenrulesandphrases(blueboldfontandbluesolidarrows)andnotchosenrulesandphrases(reditalicsmallerfontandreddashedlines)areshown.PruningCase(2)/(3):Deletionoftheleft/rightchildrespectively.Therearetwotypesofdeletion,asillustratedinFigures5and6.Thefirstcorre-spondstodeletionofachildnode.Forexample,thesecondchildNNofruleNP→NPNNisdeleted,whichyieldsdeletionof“shot”.Thesec-ondtypeisaspecialcaseofpropagatinganodetoahigher-levelofthetree.InFigure6,thissit-uationoccurswhendeletingJJ“Vintage”,whichcausesthepropagationofNNfromcell11tocell01.Forthispurpose,weexpandthesetofrulesRwithadditionalspecialrulesoftheformh→h,e.g.,NN→NN,whichallowspropagationoftreenodestohigherlevelsofthecompressedtree.64.2ModelingCompressionCriteriaThe∆φterm7inEquation17denotesthesumoflogofpotentialfunctionsforeachcriteriaq:∆φ[r,ij]=Xqθ·∆fq(r,ij)(18)Notethat∆φdependsonthecurrentruler,alongwiththehistoricalinformationbeforethecurrentstepij,suchastheoriginalrulerij,andngramsontheborderbetweenleftandrightchildbranchesofrulerij.Weusethefollowingfourcriteriafqinourmodel,whicharedemonstratedinFigures5and6.I.TreeStructure:WecapturePCFGruleprob-abilitiesestimatedfromthecorpusas∆fpcfg=logPpcfg(r).6Weassignprobabilitiesofthesespecialpropagationrulesto1sothattheywillnotaffectthefinalparsetreescore.TurnerandCharniak(2005)handledpropagationcasessimilarly.7Weuse∆todistinguishthepotentialvalueforthewholesentencefromthegainofthepotentialduringasinglestepofthealgorithm.JJ NP, NN NP S Vintage NN motorcycle NN shot VBN VP, PP done IN PP in JJ NP black CC CC-JJ and JJ white 00″11″01″Rule%probability%Ngram%cohesion%Dele6on%probability%Vision%Confidence%i”j”Figure6:CKYcompression.Boththechosenrulesandphrases(blueboldfontandbluesolidarrows)andnotchosenrulesandphrases(reditalicsmallerfontandreddashedlines)areshown.II.SequenceStructure:Weincorporatengramcohesionscoresonlyacrosstheborderbetweentwobranchesofasubtree.III.BranchDeletionProbabilities:Wecomputeprobabilitiesofdeletionforchildrenas:∆fdel=logP(rt|rij)=logcount(rt,rij)count(rij)(19)Wherecount(rt,rij)isthefrequencyinwhichrijistransformedtortbydeletionofoneofthechildren.Weestimatethisprobabilityfromatrainingcorpus,describedin§4.3.count(rij)isthecountofrijinuncompressedsentences.IV.VisionDetection(ContentSelection):Wewanttokeepwordsreferringtoactualobjectsintheimage.Thus,weuseV(xj),avisualsimilarityscore,asourconfidenceofanobjectcorrespondingtowordxj.Thissimilarityisobtainedfromthevi-sualrecognitionpredictionsof(Dengetal.,2012b).Notethatsometestinstancesincluderulesthatwehavenotobservedduringtraining.Wedefaulttotheoriginalcaptioninthosecases.Theweightsθiaresetusingatuningdataset.Wecontrolover-compressionbysettingtheweightforfdeltoasmallvaluerelativetotheotherweights.4.3HumanCompressedCaptionsAlthoughwemodelimagecaptiongeneralizationassentencecompression,inpracticalapplicationswemaywanttheoutputsofthesetwotaskstobediffer-ent.Forexample,theremaybedifferencesinwhatshouldbedeleted(namedentitiesinnewswiresum-mariescouldbeimportanttokeep,whiletheymay
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
358
Orig:”Note”the”pillows,”they”match”the”chair”that”goes”with”it,”plus”the”table”in”the”picture”is”included.%SeqCompression:%The”table”in”the”picture.””TreePruning:”The”chair”with”the”table”in”the”picture.”Orig:”Only”in”winter;me”we”see”these”birds”here”in”the”river.”%SeqCompression:”See”these”birds”in”the”river.””TreePruning:”These”birds”in”the”river.””Orig:”The”world’s”most”powerful”lighthouse”si@ng”beside”the”house”with”the”world’s”thickest”curtains.”SeqCompression:%Si@ng”beside”the”house””TreePruning:”Powerful”lighthouse”beside”the”house”with”the”curtains.””Orig:”Orange”cloud”on”street”light”C”near”Lanakila”Street”(phone”camera).””SeqCompression:%Orange”street””TreePruning:”Phone”camera.%Relevance(problem(Orig:”There’s”something”about”having”5″trucks”parked”in”front”of”my”house”that”makes”me”feel”all”importantClike.”SeqCompression:%Front”of”my”house.””TreePruning:”Trucks”in”front”my”house.%Grammar(mistakes(Figure7:Captiongeneralization:good/badexamples.beextraneousforimagecaptiongeneralization).Tolearnthesyntacticpatternsforcaptiongeneraliza-tion,wecollectasmallsetofexamplecompressedcaptions(380intotal)usingAmazonMechanicalTurk(AMT)(Snowetal.,2008).Foreachimage,weasked3turkerstofirstlistallvisibleobjectsinanimageandthentowriteacompressedcaptionbyremovingnotvisuallyverifiablebitsoftext.Wethenaligntheoriginalandcompressedcaptionstomea-sureruledeletionprobabilities,excludingmisalign-ments,similartoKnightandMarcu(2000).Notethatweremovethisdatasetfromthe1Mcaptioncor-puswhenweperformdescriptiongeneration.5ExperimentsWeusethe1McaptionedimagecorpusofOrdonezetal.(2011).Wereserve1Kimagesasatestset,andusetherestofthecorpusforphraseextraction.Weexperimentwiththefollowingapproaches:ProposedApproaches:•TREEPRUNING:Ourtreecompressionap-proachasdescribedin§4.•SEQ+TREE:Ourtreecompositionapproachasdescribedin§3.•SEQ+TREE+PRUNING:SEQ+TREEusingcompressedcaptionsofTREEPRUNINGasbuildingblocks.BaselinesforComposition:•SEQ+LINGRULE:Themostequivalenttotheoldersequence-drivensystem(Kuznetsovaetal.,2012).Usesafewminorenhancements,suchassentence-boundarystatistics,toim-provegrammaticality.•SEQ:The§3systemwithouttreemodelsandmentionedenhancementsofSEQ+LINGRULE.MethodBleuMeteorw/(w/o)penaltyPRMSEQ+LINGRULE0.152(0.152)0.130.170.095SEQ0.138(0.138)0.120.180.094SEQ+TREE0.149(0.149)0.130.140.082SEQ+PRUNING0.177(0.177)0.150.160.101SEQ+TREE+PRUNING0.140(0.189)0.160.120.088Table1:AutomaticEvaluation•SEQ+PRUNING:SEQusingcompressedcap-tionsofTREEPRUNINGasbuildingblocks.Wealsoexperimentwiththecompressionofhumanwrittencaptions,whichareusedtogenerateimagedescriptionsforthenewtargetimages.BaselinesforCompression:•SEQCOMPRESSION(Kuznetsovaetal.,2013):Inferenceoperatesoverthesequencestructure.Althoughoptimizationissubjecttoconstraintsderivedfromdependencyparse,parsingisnotanexplicitpartoftheinferencestructure.Ex-ampleoutputsareshowninFigure7.5.1AutomaticEvaluationWeperformautomaticevaluationusingtwomea-sureswidelyusedinmachinetranslation:BLEU(Pa-pinenietal.,2002)8andMETEOR(DenkowskiandLavie,2011).9Weremoveallpunctuationandcon-vertcaptionstolowercase.Weuse1Ktestim-agesfromthecaptionedimagecorpus,10andas-sumetheoriginalcaptionsasthegoldstandardcap-tionstocompareagainst.TheresultsinTable18WeusetheunigramNISTimplementation:ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v13a-20091001.tar.gz9WithequalweightbetweenprecisionandrecallinTable1.10ExceptforthoseforwhichimageURLsarebroken,orCPLEXdidnotreturnasolution.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
359
Method-1Method-2CriteriaMethod-1preferredoverMethod-2(%)allturkersturkersw/κ>0.55turkersw/κ>0.6ImageDescriptionGenerationSEQ+TREESEQRel727272SEQ+TREESEQGmar838383SEQ+TREESEQAll686966SEQ+TREE+PRUNINGSEQ+TREERel687272SEQ+TREE+PRUNINGSEQ+TREEGmar413841SEQ+TREE+PRUNINGSEQ+TREEAll636466SEQ+TREESEQ+LINGRULEAll626462SEQ+TREE+PRUNINGSEQ+LINGRULEAll677577SEQ+TREE+PRUNINGSEQ+PRUNINGAll737575SEQ+TREE+PRUNINGHUMANAll241919ImageCaptionGeneralizationTREEPRUNINGSEQCOMPRESSION∗Rel656566Table2:HumanEvaluation:posedasabinaryquestion“whichofthetwooptionsisbetter?”withrespecttoRelevance(Rel),Grammar(Gmar),andOverall(All).AccordingtoPearson’sχ2test,allresultsarestatisticallysignificant.showthatboththeintegrationofthetreestructure(+TREE)andthegeneralizationofcaptionsusingtreecompression(+PRUNING)improvetheBLEUscorewithoutbrevitypenaltysignificantly,11whileimprovingMETEORonlymoderately(duetoanim-provementonprecisionwithadecreaseinrecall.)5.2HumanEvaluationNeitherBLEUnorMETEORdirectlymeasuregrammaticalcorrectnessoverlongdistancesandmaynotcorrespondperfectlytohumanjudgments.Therefore,wesupplementautomaticevaluationwithhumanevaluation.Forhumanevaluations,wepresenttwooptionsgeneratedfromtwocompet-ingsystems,andaskturkerstochoosetheonethatisbetterwithrespectto:relevance,grammar,andoverall.ResultsareshowninTable2with3turkerratingsperimage.Wefilteroutturkersbasedonacontrolquestion.Wethencomputetheselec-tionrate(%)ofpreferringmethod-1overmethod-2.Theagreementamongturkersisafrequentconcern.Therefore,wevarythesetofdependableusersbasedontheirCohen’skappascore(κ)againstotherusers.Itturnsout,filteringusersbasedonκdoesnotmakeabigdifferenceindeterminingthewinningmethod.Asexpected,tree-basedsystemssignificantlyout-performsequence-basedcounterparts.Forexample,11While4-gramBLEUwithbrevitypenaltyisfoundtocor-relatebetterwithhumanjudgesbyrecentstudies(ElliottandKeller,2014),wefoundthatthisisnotthecaseforourtask.Thismaybeduetothedifferencesinthegoldstandardcap-tions.Weusenaturallyexistingones,whichincludeawiderrangeofcontentandstylethancrowd-sourcedcaptions.Seq:”A”bu&erfly”to”the”car”was”spo&ed”by”my”nine”year”old”cousin.”Seq+Pruning:”The”bu&erflies”are”a&racted”to”the”colourful”flowers”to”the”car.+Seq+Tree:”The”bu&erflies”are”a&racted”to”the”colourful”flowers”in”Hope”Gardens.””Seq+Tree+Pruning:”The”bu&erflies”are”a&racted”to”the”colourful”flowers.”Orig:”The”bu&erflies”are”a&racted”to”the”colourful”flowers”in”Hope”Gardens.””SeqCompression:”The”colourful”flowers.”””TreePruning:”The”bu&erflies”are”a&racted”to”the”colourful”flowers.”””Cap>on”Generaliza>on”Image”Descrip>on”Genera>on”Figure8:Anexampleofadescriptionpreferredoverhu-mangoldstandard.Imagedescriptionisimprovedduetocaptiongeneralization.SEQ+TREEisstronglypreferredoverSEQ,withaselectionrateof83%.Somewhatsurprisingly,im-provedgrammaticalityalsoseemstoimproverele-vancescores(72%),possiblybecauseitishardertoappreciatethesemanticrelevanceofautomaticcap-tionswhentheyarelesscomprehensible.Alsoasexpected,compositionsbasedonprunedtreefrag-mentssignificantlyimproverelevance(68–72%),whileslightlydeterioratinggrammar(38–41%).Notably,thecaptionsgeneratedbyoursystemarepreferredovertheoriginal(ownergenerated)cap-tions19–24%ofthetime.Onesuchexampleisin-cludedinFigure8:“Thebutterfliesareattractedtothecolorfulflowers.”Additionalexamples(goodandbad)arepro-videdinFigures9and10.Manyofthesecaptionsarehighlyexpressivewhileremainingsemantically
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
l
a
c
_
a
_
0
0
1
8
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
360
Human:”Some”flower”on”a”bar”in”a”hotel”in”Grapevine,”TX.””&Seq+Tree+Pruning:”The”flower”was”so”vivid”and”a:rac