Transactions of the Association for Computational Linguistics, 2 (2014) 351–362. Action Editor: Hal Daume III.
Submitted 2/2014; Revised 5/2014; Published 10/2014. c
(cid:13)
2014 Association for Computational Linguistics.
TREETALK:CompositionandCompressionofTreesforImageDescriptionsPolinaKuznetsova††StonyBrookUniversityStonyBrook,NYpkuznetsova@cs.stonybrook.eduVicenteOrdonez‡TamaraL.Berg‡‡UNCChapelHillChapelHill,Caroline du Nord{vicente,tlberg}@cs.unc.eduYejinChoi††††UniversityofWashingtonSeattle,WAyejin@cs.washington.eduAbstractWepresentanewtreebasedapproachtocomposingexpressiveimagedescriptionsthatmakesuseofnaturallyoccuringwebimageswithcaptions.Weinvestigatetworelatedtasks:imagecaptiongeneralizationandgen-eration,wheretheformerisanoptionalsub-taskofthelatter.Thehigh-levelideaofourapproachistoharvestexpressivephrases(astreefragments)fromexistingimagedescrip-tions,thentocomposeanewdescriptionbyselectivelycombiningtheextracted(andop-tionallypruned)treefragments.Keyalgo-rithmiccomponentsaretreecompositionandcompression,bothintegratingtreestructurewithsequencestructure.Ourproposedsystemattainssignificantlybetterperformancethanpreviousapproachesforbothimagecaptiongeneralizationandgeneration.Inaddition,ourworkisthefirsttoshowtheempiricalben-efitofautomaticallygeneralizedcaptionsforcomposingnaturalimagedescriptions.1IntroductionThewebisincreasinglyvisual,withhundredsofbil-lionsofusercontributedphotographshostedonline.Asubstantialportionoftheseimageshavesomesortofaccompanyingtext,rangingfromkeywords,tofreetextonwebpages,totextualdescriptionsdi-rectlydescribingdepictedimagecontent(i.e.cap-tions).Wetapintothelastkindoftext,usingnatu-rallyoccuringpairsofimageswithnaturallanguagedescriptionstocomposeexpressivedescriptionsforqueryimagesviatreecompositionandcompression.Suchautomaticimagecaptioningeffortscouldpotentiallybeusefulformanyapplications:fromautomaticorganizationofphotocollections,tofacil-itatingimagesearchwithcomplexnaturallanguagequeries,toenhancingwebaccessibilityforthevi-suallyimpaired.Ontheintellectualside,bylearn-ingtodescribethevisualworldfromnaturallyexist-ingwebdata,ourstudyextendsthedomainsoflan-guagegroundingtothehighlyexpressivelanguagethatpeopleuseintheireverydayonlineactivities.Therehasbeenarecentspikeineffortstoau-tomaticallydescribevisualcontentinnaturallan-guage(Yangetal.,2011;Kulkarnietal.,2011;Lietal.,2011;Farhadietal.,2010;Krishnamoorthyetal.,2013;ElliottandKeller,2013;YuandSiskind,2013;Socheretal.,2014).Thisreflectsthelongstandingunderstandingthatencodingthecomplex-itiesandsubtletiesofimagecontentoftenrequiresmoreexpressivelanguageconstructsthanasetoftags.Nowthatvisualrecognitionalgorithmsarebe-ginningtoproducereliableestimatesofimagecon-tent(Perronninetal.,2012;Dengetal.,2012a;Dengetal.,2010;Krizhevskyetal.,2012),thetimeseemsripetobeginexploringhigherlevelsemantictasks.Therehavebeentwomaincomplementarydirec-tionsexploredforautomaticimagecaptioning.Thefirstfocusesondescribingexactlythoseitems(e.g.,objects,attributes)thataredetectedbyvisionrecog-nition,whichsubsequentlyconfineswhatshouldbedescribedandhow(Yaoetal.,2010;Kulkarnietal.,2011;Kojimaetal.,2002).Approachesinthisdirec-tioncouldbeidealforvariouspracticalapplicationssuchasimagedescriptionforthevisuallyimpaired.However,itisnotclearwhetherthesemanticexpres-sivenessoftheseapproachescaneventuallyscaleuptothecasual,buthighlyexpressivelanguagepeo-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
352
Target’Image’A”cow!standing!dans!le!eau!je!no/ced!que!ce!funny!cow!était”staring”à”me”UN!bird!hovering!dans”le”grass”You!peut!voir!ces!beau/ful!hills!only!dans”le”countryside”Object’Ac/on’Stuff’Scene’Figure1:Harvestingphrases(astreefragments)forthetargetimagebasedon(partial)visualmatch.plenaturallyuseintheironlineactivities.InFig-ure1,forexample,itwouldbehardtocompose“Inoticedthatthisfunnycowwasstaringatme”or“Youcanseethesebeautifulhillsonlyinthecoun-tryside”inapurelybottom-upmannerbasedontheexactcontentdetected.Thekeytechnicalbottleneckisthattherangeofdescribablecontent(i.e.,objects,attributes,actions)isultimatelyconfinedbythesetofitemsthatcanbereliablyrecognizedbystate-of-the-artvisiontechniques.Theseconddirection,inacomplementaryavenuetothefirst,hasexploredwaystomakeuseoftherichspectrumofvisualdescriptionscontributedbyonlinecitizens(Kuznetsovaetal.,2012;FengandLapata,2013;Mason,2013;Ordonezetal.,2011).Intheseapproaches,thesetofwhatcanbedescribedcanbesubstantiallylargerthanthesetofwhatcanberecognized,wheretheformerisshapedanddefinedbythedata,ratherthanbyhumans.Thisallowstheresultingdescriptionstobesubstantiallymoreex-pressive,elaborate,andinterestingthanwhatwouldbepossibleinapurelybottom-upmanner.Ourworkcontributestothissecondlineofresearch.Onechallengeinutilizingnaturallyexistingmul-timodaldata,cependant,isthenoisysemanticalign-mentbetweenimagesandtext(Dodgeetal.,2012;Bergetal.,2010).Donc,wealsoinvesti-gatearelatedtaskofimagecaptiongeneralization(Kuznetsovaetal.,2013),whichaimstoimprovethesemanticimage-textalignmentbyremovingbitsoftextfromexistingcaptionsthatarelesslikelytobetransferabletootherimages.Thehigh-levelideaofoursystemistoharvestusefulbitsoftext(astreefragments)fromexist-ingimagedescriptionsusingdetectedvisualcontentsimilarity,andthentocomposeanewdescriptionbyselectivelycombiningtheseextracted(andop-tionallypruned)treefragments.Thisoverallideaofcompositionbasedonextractedphrasesisnotnewinitself(Kuznetsovaetal.,2012),cependant,wemakeseveraltechnicalandempiricalcontributions.First,weproposeanovelstochastictreecompo-sitionalgorithmbasedonextractedtreefragmentsthatintegratesbothtreestructureandsequenceco-hesionintostructuralinference.Ouralgorithmper-mitsasubstantiallyhigherleveloflinguisticexpres-siveness,flexibility,andcreativitythanthosebasedonrulesortemplates(Kulkarnietal.,2011;Yangetal.,2011;Mitchelletal.,2012),whilealsoaddress-inglong-distancegrammaticalrelationsinamoreprincipledwaythanthosebasedonhand-codedcon-straints(Kuznetsovaetal.,2012).Deuxième,weaddressimagecaptiongeneralizationasanoptionalsubtaskofimagecaptiongeneration,andproposeatreecompressionalgorithmthatper-formsalight-weightparsingtosearchfortheop-timalsetoftreebranchestoprune.Ourworkisthefirsttoreportempiricalbenefitsofautomaticallycompressedcaptionsforimagecaptioning.Theproposedapproachesattainsignificantlybet-terperformanceforbothimagecaptiongeneraliza-tionandgenerationtasksovercompetitivebaselinesandpreviousapproaches.Ourworkresultsinanim-provedimagecaptioncorpuswithautomaticgener-alization,whichispubliclyavailable.12HarvestingTreeFragmentsGivenaqueryimage,weretrieveimagesthatarevi-suallysimilartothequeryimage,thenextractpo-tentiallyusefulsegments(i.e.,phrases)fromtheircorrespondingimagedescriptions.Wethencom-poseanewimagedescriptionusingtheseretrievedtextfragments(§3).Extractionofusefulphrasesisguidedbybothvisualsimilarityandthesyn-tacticparseofthecorrespondingtextualdescrip-1http://ilp-cky.appspot.com/
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
353
tion.Thisextractionstrategy,originallyproposedbyKuznetsovaetal.(2012),attemptstomakethebestuseoflinguisticregularitieswithrespecttoobjects,actions,andscenes,makingitpossibletoobtainrichertextualdescriptionsthanwhatcur-rentstate-of-the-artvisiontechniquescanprovideinisolation.InallofourexperimentsweusethecaptionedimagecorpusofOrdonezetal.(2011),firstpre-processingthecorpusforrelevantcontentbyrunningdeformablepartmodelobjectdetec-tors(Felzenszwalbetal.,2010).Forourstudy,werundetectorsfor89objectclassessetahighconfi-dencethresholdfordetection.AsillustratedinFigure1,foraqueryimagede-tection,weextractfourtypesofphrases(astreefragments).D'abord,weretrieverelevantnounphrasesfromimageswithvisuallysimilarobjectdetections.Weusecolor,texture(LeungandMalik,1999),andshape(DalalandTriggs,2005;Lowe,2004)basedfeaturesencodedinahistogramofvectorquantizedresponsestomeasurevisualsimilarity.Second,weextractverbphrasesforwhichthecorrespondingnounphrasetakesthesubjectrole.Third,fromthoseimageswith“stuff”detections,e.g.“water”,or“sky”(typicallymassnouns),weextractpreposi-tionalphrasesbasedonsimilarityofbothvisualap-pearanceandrelativespatialrelationshipsbetweendetectedobjectsand“stuff”.Finally,weuseglobal“scene”similarity2toextractprepositionalphrasesreferringtotheoverallscene,e.g.,“attheconfer-ence,”or“inthemarket”.Weperformthisphraseretrievalprocessforeachdetectedobjectinthequeryimageandgenerateonesentenceforeachobject.Allsentencesarethencombinedtogethertoproducethefinaldescription.Optionally,weapplyimagecaptiongeneralization(viacompression)(§4)toallcaptionsinthecorpuspriortothephraseextractionandcomposition.3TreeCompositionWemodeltreecompositionasconstraintoptimiza-tion.Theinputtoouralgorithmisthesetofre-trievedphrases(i.e.,treefragments),asillustratedin§2.LetP={p0,…,pL−1}bethesetofallphrasesacrossthefourphrasetypes(objets,ac-tions,stuffandscene).Weassumeamappingfunc-2L2distancebetweenclassificationscorevectors(Xiaoetal.,2010)tionpt:[0,L)→T,whereTisthesetofphrasetypes,sothatthephrasetypeofpiispt(je).Inad-dition,letRbethesetofPCFGproductionrulesandNTbethesetofnonterminalsymbolsofthePCFG.Thegoalistofindandcombineagoodse-quenceofphrasesG,|G|≤|T|=N=4,drawnfromP,intoafinalsentence.Moreconcretely,wewanttoselectandorderasubsetofphrases(atmostonephraseofeachphrasetype)whileconsideringboththeparsestructureandn-gramcohesionacrossphrasalboundaries.Figure2showsasimplifiedexampleofacom-posedsentencewithitscorrespondingparsestruc-ture.Forbrevity,thefigureshowsonlyonephraseforeachphrasetype,butinactualitytherewouldbeasetofcandidatephrasesforeachtype.Figure3showstheCKY-stylerepresentationoftheinternalmechanicsofconstraintoptimizationfortheexam-plecompositionfromFigure2.EachcellijoftheCKYmatrixcorrespondstoGij,asubsequenceofGstartingatpositioniandendingatpositionj.IfacellintheCKYmatrixislabeledwithanontermi-nalsymbols,itmeansthatthecorrespondingtreeofGijhassasitsroot.AlthoughwevisualizetheoperationusingaCKY-stylerepresentationinFigure3,notethatcomposi-tionrequiresmorecomplexcombinatorialdecisionsthanCKYparsingduetotwoadditionalconsidera-tions.Weare:(1)selectingasubsetofcandidatephrases,et(2)re-orderingtheselectedphrases(hencemakingtheproblemNP-hard).Donc,weencodeourproblemusingIntegerLinearPro-gramming(ILP)(RothandtauYih,2004;ClarkeandLapata,2008)andusetheCPLEX(ILOG,Inc,2006)solver.3.1ILPVariablesVariablesforSequenceStructure:Variablesαen-codephraseselectionandordering:αik=1iffphrasei∈Pisselected(1)forpositionk∈[0,N)WherekisoneoftheN=4positionsinasentence.3Additionally,wedefinevariablesforeachpairofad-jacentphrasestocapturesequencecohesion:3Thenumberofpositionsisequaltothenumberofphrasetypes,sinceweselectatmostonefromeachtype.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
354
UN”cow in”le”countryside was”staring”à”me in#the#grass NP PP VP PP NP S i=0$j=2$k=1$0123levelandeachnodeofthatlevel,algorithmhastodecide,whichparsetagtochoose.Thisprocessisrepresentedbyassignmentofaparticulartagtoamatrixcell.Thechosentagmustbeaheadofarule,fiexamplecell12isassignedtagVP,correspond-ingtoruleVP!VPPP.Thisruleconnectsleafs“goingouttosea”and“intheocean”.Theprob-lemistofindtagassignmentforeachcellofthema-trix,givensomecellscanbeempty,iftheydonotconnectchildrencells.lattercorrespondtochildrenbranchesofthetreeandbelongtothepreviousdiag-onalintheleft-to-rightorder.Alsowedonottryallpossiblepairs5ofchildrenfrompreviousdiagonal.WeusetechniquesimilartotheoneusedinCKYparsingapproach.Matrixcellpairscorrespondingto
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
355
ables(Equations2,4,5).ConstraintsforaproductoftwovariableshavebeendiscussedbyClarkeandLapata(2008).ForEquation2,weaddthefollow-ingconstraints(similarconstraintsarealsoaddedforEquations4,5).∀ijk,αijk≤αik(7)αijk≤αj(k+1)αijk+(1−αik)+(1−αj(k+1))≥1ConsistencybetweenTreeLeafsandSequences:Theorderingofphrasesimpliedbyαijkmustbeconsistentwiththeorderingofphrasesimpliedbytheβvariables.Thiscanbeachievedbyaligningtheleafcells(i.e.,βkks)intheCKY-stylematrixwithαvariablesasfollows:∀ik,αik≤Xs∈NTiβkks(8)∀k,Xiαik=Xs∈NTβkks(9)WhereNTireferstothesetofPCFGnonterminalsthatarecompatiblewithaphrasetypept(je)ofpi.Forexample,NTi={NN,NP,…}ifpicorrespondstoan“object”(noun-phrase).Ainsi,Equation8en-forcesthecorrespondencebetweenphrasetypesandnonterminalsymbolsatthetreeleafs.Equation9enforcestheconstraintthatthenumberofselectedphrasesandinstantiatedtreeleafsmustbethesame.TreeCongruenceConstraints:ToensurethateachCKYcellhasatmostonesymbolwerequire∀ij,Xs∈NTβijs≤1(10)Wealsorequirethat∀i,j>i,h,βijh=j−1Xk=iXr∈Rhβijkr(11)WhereRh={r∈R:r=h→pq}.Weenforcetheseconstraintsonlyfornon-leafs.Thisconstraintforbidsinstantiationswhereanonterminalsymbolhisselectedforcellijwithoutselectingacorrespond-ingPCFGrule.Wealsoensurethatweproduceavalidtreestruc-ture.Forinstance,ifweselect3phrasesasshowninFigure3,wemusthavetherootofthetreeatthecorrespondingcell02.∀k∈[1,N),Xs∈NTβkks≤N−1Xt=kXs∈NTβ0ts(12)Wealsorequirecellsthatarenotselectedfortheresultingparsestructuretobeempty:∀ijXkγijk≤1(13)En plus,wepenalizesolutionswithouttheStagattheparserootasasoft-constraint.MiscellaneousConstraints:Enfin,weincludeseveralconstraintstoavoiddegeneratesolutionsortootherwiseenhancethecomposedoutput.We:(1)enforcethatanoun-phraseisselected(toensurese-manticrelevancetotheimagecontent),(2)allowatmostonephraseofeachtype,(3)donotallowmul-tiplephraseswithidenticalheadwords(toavoidre-dundancy),(4)allowatmostonescenephraseforallsentencesinthedescription.Wefindthathan-dlingofsentenceboundariesisimportantiftheILPformulationisbasedonlyonsequencestructure,butwiththeintegrationoftree-basedstructure,wedonotneedtospecificallyhandlesentenceboundaries.3.4DiscussionAninterestingaspectofdescriptiongenerationex-ploredinthispaperisusingtreefragmentsasthebuildingblocksofcompositionratherthanindivid-ualwords.Therearethreepracticalbenefits:(1)syntacticandsemanticexpressiveness,(2)correct-ness,et(3)computationalefficiency.Becauseweextractphrasesfromhumanwrittencaptions,weareabletouseexpressivelanguage,andlesslikelytomakesyntacticorsemanticerrors.Ourphraseex-tractionprocesscanbeviewedatahighlevelasvisually-groundedorvisually-situatedparaphrasing.Also,becausetheunitofoperationistreefragments,theILPformulationencodedinthisworkiscom-putationallylightweight.Iftheunitofcompositionwaswords,theILPinstanceswouldbesignificantlymorecomputationallyintensive,andmorelikelytosufferfromgrammaticalandsemanticerrors.4TreeCompressionAsnotedbyrecentstudies(MasonandCharniak,2013;Kuznetsovaetal.,2013;Jamiesonetal.,2010),naturallyexistingimagecaptionsoftenin-cludecontextualinformationthatdoesnotdirectlydescribevisualcontent,whichultimatelyhinderstheirusefulnessfordescribingotherimages.There-fore,toimprovethefidelityofthegenerateddescrip-tions,weexploreimagecaptiongeneralizationasan
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
356
Late%in%the%day,%un,er%my%sunset%shot%a2empts,%my%cat%strolled%along%the%fence%and%posed%for%this%classic%profile%Late%in%the%day%%%cat%%%posed%for%this%profile%Generaliza)on+This%bridge%stands%late%in%the%day,%un,er%my%sunset%shot%a2empts%A%cat%strolled%along%the%fence%and%posed%for%this%classic%profile%Figure4:Compressedcaptions(ontheleft)aremoreap-plicablefordescribingnewimages(ontheright).optionalpre-processingstep.Figure4illustratesaconcreteexampleofimagecaptiongeneralizationinthecontextofimagecaptiongeneration.Wecastcaptiongeneralizationassentencecom-pression.WeencodetheproblemastreepruningvialightweightCKYparsing,whilealsoincorporatingseveralotherconsiderationssuchasleaf-levelngramcohesionscoresandvisuallyinformedcontentselec-tion.Figure5showsanexamplecompression,andFigure6showsthecorrespondingCKYmatrix.Atahighlevel,thecompressionoperationresem-blesbottom-upCKYparsing,butinadditiontopars-ing,wealsoconsiderdeletionofpartsofthetrees.Whendeletingpartsoftheoriginaltree,wemightneedtore-parsetheremainderofthetree.Notethatweconsiderre-parsingonlywithrespecttotheorig-inalparsetreeproducedbyastate-of-the-artparser,henceitisonlyalight-weightparsing.54.1DynamicProgrammingInputtothealgorithmisasentence,representedasavectorx=x0…xn−1=x[0:n−1],anditsPCFGparseπ(X)obtainedfromtheStanfordparser.Forsimplicityofnotation,weassumethatboththeparsetreeandthewordsequenceareencodedinx.Then,thecompressioncanbeformalizedas:5Integratingfullparsingintotheoriginalsentencewouldbeastraightforwardextensionconceptually,butmaynotbeanem-piricallybetterchoicewhenparsingforcompressionisbasedonvanillaunlexicalizedparsing.ˆy=argmaxyYiφi(X,oui)(14)Whereeachφiisapotentialfunction,correspondingtoacriteriaofthedesiredcompression:φi(X,oui)=exp(θi·fi(X,oui))(15)Whereθiistheweightforaparticularcriteria(de-scribedin§4.2),whosescoringfunctionisfi.Wesolvethedecodingproblem(Equation14)us-ingdynamicprogramming.Forthis,weneedtosolvethecompressionsub-problemsforsequencesx[je:j],whichcanbeviewedasbranchesˆy[je,j]ofthefinaltreeˆy[0:n−1].Forexample,inFigure5,thefinalsolutionisˆy[0:7],whileasub-solutionofx[4:7]correspondstoatreebranchPP.Noticethatsub-solutionˆy[3:7]representsthesamebranchasˆy[4:7]duetobranchdeletion.Somecomputedsub-solutions,e.g.,ˆy[1:4],getdroppedfromthefinalcompressedtree.WedefineamatrixofscoresD[je,j,h](Equa-tion17),wherehisoneofthenonterminalsymbolsbeingconsideredforacellindexedbyi,j,i.e.acan-didatefortherootsymbolofabranchˆy[je:j].WhenallvaluesD[je,j,h]arecomputed,wetakeˆh=argmaxhD[0,n−1,h](16)andbacktracktoreconstructthefinalcompression(theexactsolutiontoequation14).D[je,j,h]=maxk∈[je,j)r∈Rh(1)D[je,k,p]+D[k+1,j,q]+∆φ[r,ij](2)D[je,k,p]+∆φ[r,ij](3)D[k+1,j,p]+∆φ[r,ij](17)WhereRh={r∈R:r=h→pq∨r=h→p}.Indexkdeterminesasplitpointforchildbranchesofasubtreeˆy[je:j].Forexample,intheFigure5thesplitpointforchildrenofthesubtreeˆy[0:7]isk=2.Thethreecases((1)–(3))oftheaboveequationcorrespondtothefollowingtreepruningcases:PruningCase(1):Noneofthechildrenofthecur-rentnodeisdeleted.Forexample,inFigures5and6,thePCFGrulePP→INPP,correspondingtothesequence“inblackandwhite”,isretained.Anothersituationthatcanbeencounteredistreere-parsing.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
357
Vintage!motorcycle!shot!fait!dans!black!et!blanc!JJ!NN!NN!VBN!IN!JJ!JJ!CC!NP, NN!NP!CC-JJ VP, PP NP!PP S Dele%on!probability!Rule!probability!Vision!confidence!Ngram!cohesion!(Dele%on,)case)2))(Dele%on,)case)1))01234567k=2$Figure5:CKYcompression.Boththechosenrulesandphrases(blueboldfontandbluesolidarrows)andnotchosenrulesandphrases(reditalicsmallerfontandreddashedlines)areshown.PruningCase(2)/(3):Deletionoftheleft/rightchildrespectively.Therearetwotypesofdeletion,asillustratedinFigures5and6.Thefirstcorre-spondstodeletionofachildnode.Forexample,thesecondchildNNofruleNP→NPNNisdeleted,whichyieldsdeletionof“shot”.Thesec-ondtypeisaspecialcaseofpropagatinganodetoahigher-levelofthetree.InFigure6,thissit-uationoccurswhendeletingJJ“Vintage”,whichcausesthepropagationofNNfromcell11tocell01.Forthispurpose,weexpandthesetofrulesRwithadditionalspecialrulesoftheformh→h,e.g.,NN→NN,whichallowspropagationoftreenodestohigherlevelsofthecompressedtree.64.2ModelingCompressionCriteriaThe∆φterm7inEquation17denotesthesumoflogofpotentialfunctionsforeachcriteriaq:∆φ[r,ij]=Xqθ·∆fq(r,ij)(18)Notethat∆φdependsonthecurrentruler,alongwiththehistoricalinformationbeforethecurrentstepij,suchastheoriginalrulerij,andngramsontheborderbetweenleftandrightchildbranchesofrulerij.Weusethefollowingfourcriteriafqinourmodel,whicharedemonstratedinFigures5and6.I.TreeStructure:WecapturePCFGruleprob-abilitiesestimatedfromthecorpusas∆fpcfg=logPpcfg(r).6Weassignprobabilitiesofthesespecialpropagationrulesto1sothattheywillnotaffectthefinalparsetreescore.TurnerandCharniak(2005)handledpropagationcasessimilarly.7Weuse∆todistinguishthepotentialvalueforthewholesentencefromthegainofthepotentialduringasinglestepofthealgorithm.JJ NP, NN NP S Vintage NN motorcycle NN shot VBN VP, PP done IN PP in JJ NP black CC CC-JJ and JJ white 00″11″01″Rule%probability%Ngram%cohesion%Dele6on%probability%Vision%Confidence%i”j”Figure6:CKYcompression.Boththechosenrulesandphrases(blueboldfontandbluesolidarrows)andnotchosenrulesandphrases(reditalicsmallerfontandreddashedlines)areshown.II.SequenceStructure:Weincorporatengramcohesionscoresonlyacrosstheborderbetweentwobranchesofasubtree.III.BranchDeletionProbabilities:Wecomputeprobabilitiesofdeletionforchildrenas:∆fdel=logP(rt|rij)=logcount(rt,rij)count(rij)(19)Wherecount(rt,rij)isthefrequencyinwhichrijistransformedtortbydeletionofoneofthechildren.Weestimatethisprobabilityfromatrainingcorpus,describedin§4.3.count(rij)isthecountofrijinuncompressedsentences.IV.VisionDetection(ContentSelection):Wewanttokeepwordsreferringtoactualobjectsintheimage.Thus,weuseV(xj),avisualsimilarityscore,asourconfidenceofanobjectcorrespondingtowordxj.Thissimilarityisobtainedfromthevi-sualrecognitionpredictionsof(Dengetal.,2012b).Notethatsometestinstancesincluderulesthatwehavenotobservedduringtraining.Wedefaulttotheoriginalcaptioninthosecases.Theweightsθiaresetusingatuningdataset.Wecontrolover-compressionbysettingtheweightforfdeltoasmallvaluerelativetotheotherweights.4.3HumanCompressedCaptionsAlthoughwemodelimagecaptiongeneralizationassentencecompression,inpracticalapplicationswemaywanttheoutputsofthesetwotaskstobediffer-ent.Forexample,theremaybedifferencesinwhatshouldbedeleted(namedentitiesinnewswiresum-mariescouldbeimportanttokeep,whiletheymay
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
358
Orig:”Note”le”pillows,”ils”match”le”chair”que”goes”avec”it,”plus”le”table”dans”le”picture”est”included.%SeqCompression:%Le”table”dans”le”picture.””TreePruning:”Le”chair”avec”le”table”dans”le”picture.”Orig:”Only”dans”winter;me”nous”voir”ces”birds”ici”dans”le”river.”%SeqCompression:”Voir”ces”birds”dans”le”river.””TreePruning:”These”birds”dans”le”river.””Orig:”Le”world’s”most”powerful”lighthouse”si@ng”beside”le”maison”avec”le”world’s”thickest”curtains.”SeqCompression:%Si@ng”beside”le”maison””TreePruning:”Powerful”lighthouse”beside”le”maison”avec”le”curtains.””Orig:”Orange”cloud”sur”street”light”C”près”Lanakila”Street”(phone”camera).””SeqCompression:%Orange”street””TreePruning:”Phone”camera.%Relevance(problem(Orig:”There’s”something”à propos”having”5″trucks”parked”dans”front”de”mon”maison”que”makes”me”feel”tous”importantClike.”SeqCompression:%Front”de”mon”house.””TreePruning:”Trucks”dans”front”mon”house.%Grammar(mistakes(Figure7:Captiongeneralization:good/badexamples.beextraneousforimagecaptiongeneralization).Tolearnthesyntacticpatternsforcaptiongeneraliza-tion,wecollectasmallsetofexamplecompressedcaptions(380intotal)usingAmazonMechanicalTurk(AMT)(Snowetal.,2008).Foreachimage,weasked3turkerstofirstlistallvisibleobjectsinanimageandthentowriteacompressedcaptionbyremovingnotvisuallyverifiablebitsoftext.Wethenaligntheoriginalandcompressedcaptionstomea-sureruledeletionprobabilities,excludingmisalign-ments,similartoKnightandMarcu(2000).Notethatweremovethisdatasetfromthe1Mcaptioncor-puswhenweperformdescriptiongeneration.5ExperimentsWeusethe1McaptionedimagecorpusofOrdonezetal.(2011).Wereserve1Kimagesasatestset,andusetherestofthecorpusforphraseextraction.Weexperimentwiththefollowingapproaches:ProposedApproaches:•TREEPRUNING:Ourtreecompressionap-proachasdescribedin§4.•SEQ+TREE:Ourtreecompositionapproachasdescribedin§3.•SEQ+TREE+PRUNING:SEQ+TREEusingcompressedcaptionsofTREEPRUNINGasbuildingblocks.BaselinesforComposition:•SEQ+LINGRULE:Themostequivalenttotheoldersequence-drivensystem(Kuznetsovaetal.,2012).Usesafewminorenhancements,suchassentence-boundarystatistics,toim-provegrammaticality.•SEQ:The§3systemwithouttreemodelsandmentionedenhancementsofSEQ+LINGRULE.MethodBleuMeteorw/(w/o)penaltyPRMSEQ+LINGRULE0.152(0.152)0.130.170.095SEQ0.138(0.138)0.120.180.094SEQ+TREE0.149(0.149)0.130.140.082SEQ+PRUNING0.177(0.177)0.150.160.101SEQ+TREE+PRUNING0.140(0.189)0.160.120.088Table1:AutomaticEvaluation•SEQ+PRUNING:SEQusingcompressedcap-tionsofTREEPRUNINGasbuildingblocks.Wealsoexperimentwiththecompressionofhumanwrittencaptions,whichareusedtogenerateimagedescriptionsforthenewtargetimages.BaselinesforCompression:•SEQCOMPRESSION(Kuznetsovaetal.,2013):Inferenceoperatesoverthesequencestructure.Althoughoptimizationissubjecttoconstraintsderivedfromdependencyparse,parsingisnotanexplicitpartoftheinferencestructure.Ex-ampleoutputsareshowninFigure7.5.1AutomaticEvaluationWeperformautomaticevaluationusingtwomea-sureswidelyusedinmachinetranslation:BLEU(Pa-pinenietal.,2002)8andMETEOR(DenkowskiandLavie,2011).9Weremoveallpunctuationandcon-vertcaptionstolowercase.Weuse1Ktestim-agesfromthecaptionedimagecorpus,10andas-sumetheoriginalcaptionsasthegoldstandardcap-tionstocompareagainst.TheresultsinTable18WeusetheunigramNISTimplementation:ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v13a-20091001.tar.gz9WithequalweightbetweenprecisionandrecallinTable1.10ExceptforthoseforwhichimageURLsarebroken,orCPLEXdidnotreturnasolution.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
359
Method-1Method-2CriteriaMethod-1preferredoverMethod-2(%)allturkersturkersw/κ>0.55turkersw/κ>0.6ImageDescriptionGenerationSEQ+TREESEQRel727272SEQ+TREESEQGmar838383SEQ+TREESEQAll686966SEQ+TREE+PRUNINGSEQ+TREERel687272SEQ+TREE+PRUNINGSEQ+TREEGmar413841SEQ+TREE+PRUNINGSEQ+TREEAll636466SEQ+TREESEQ+LINGRULEAll626462SEQ+TREE+PRUNINGSEQ+LINGRULEAll677577SEQ+TREE+PRUNINGSEQ+PRUNINGAll737575SEQ+TREE+PRUNINGHUMANAll241919ImageCaptionGeneralizationTREEPRUNINGSEQCOMPRESSION∗Rel656566Table2:HumanEvaluation:posedasabinaryquestion“whichofthetwooptionsisbetter?”withrespecttoRelevance(Rel),Grammar(Gmar),andOverall(All).AccordingtoPearson’sχ2test,allresultsarestatisticallysignificant.showthatboththeintegrationofthetreestructure(+TREE)andthegeneralizationofcaptionsusingtreecompression(+PRUNING)improvetheBLEUscorewithoutbrevitypenaltysignificantly,11whileimprovingMETEORonlymoderately(duetoanim-provementonprecisionwithadecreaseinrecall.)5.2HumanEvaluationNeitherBLEUnorMETEORdirectlymeasuregrammaticalcorrectnessoverlongdistancesandmaynotcorrespondperfectlytohumanjudgments.Therefore,wesupplementautomaticevaluationwithhumanevaluation.Forhumanevaluations,wepresenttwooptionsgeneratedfromtwocompet-ingsystems,andaskturkerstochoosetheonethatisbetterwithrespectto:relevance,grammar,andoverall.ResultsareshowninTable2with3turkerratingsperimage.Wefilteroutturkersbasedonacontrolquestion.Wethencomputetheselec-tionrate(%)ofpreferringmethod-1overmethod-2.Theagreementamongturkersisafrequentconcern.Therefore,wevarythesetofdependableusersbasedontheirCohen’skappascore(κ)againstotherusers.Itturnsout,filteringusersbasedonκdoesnotmakeabigdifferenceindeterminingthewinningmethod.Asexpected,tree-basedsystemssignificantlyout-performsequence-basedcounterparts.Forexample,11While4-gramBLEUwithbrevitypenaltyisfoundtocor-relatebetterwithhumanjudgesbyrecentstudies(ElliottandKeller,2014),wefoundthatthisisnotthecaseforourtask.Thismaybeduetothedifferencesinthegoldstandardcap-tions.Weusenaturallyexistingones,whichincludeawiderrangeofcontentandstylethancrowd-sourcedcaptions.Seq:”UN”bu&erfly”à”le”car”était”spo&éd”par”mon”nine”année”vieux”cousin.”Seq+Pruning:”Le”bu&erflies”sont”un&racted”à”le”colourful”flowers”à”le”car.+Seq+Tree:”Le”bu&erflies”sont”un&racted”à”le”colourful”flowers”dans”Hope”Gardens.””Seq+Tree+Pruning:”Le”bu&erflies”sont”un&racted”à”le”colourful”flowers.”Orig:”Le”bu&erflies”sont”un&racted”à”le”colourful”flowers”dans”Hope”Gardens.””SeqCompression:”Le”colourful”flowers.”””TreePruning:”Le”bu&erflies”sont”un&racted”à”le”colourful”flowers.”””Cap>on”Generaliza>on”Image”Descrip>on”Genera>on”Figure8:Anexampleofadescriptionpreferredoverhu-mangoldstandard.Imagedescriptionisimprovedduetocaptiongeneralization.SEQ+TREEisstronglypreferredoverSEQ,withaselectionrateof83%.Somewhatsurprisingly,im-provedgrammaticalityalsoseemstoimproverele-vancescores(72%),possiblybecauseitishardertoappreciatethesemanticrelevanceofautomaticcap-tionswhentheyarelesscomprehensible.Alsoasexpected,compositionsbasedonprunedtreefrag-mentssignificantlyimproverelevance(68–72%),whileslightlydeterioratinggrammar(38–41%).Notably,thecaptionsgeneratedbyoursystemarepreferredovertheoriginal(ownergenerated)cap-tions19–24%ofthetime.Onesuchexampleisin-cludedinFigure8:“Thebutterfliesareattractedtothecolorfulflowers.”Additionalexamples(goodandbad)arepro-videdinFigures9and10.Manyofthesecaptionsarehighlyexpressivewhileremainingsemantically
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
8
8
1
5
6
6
9
0
5
/
/
t
je
un
c
_
un
_
0
0
1
8
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
360
Human:”Some”flower”sur”un”bar”dans”un”hotel”dans”Grapevine,”TX.””&Seq+Tree+Pruning:”Le”flower”était”donc”vif”et”un:rac