Operazioni dell'Associazione per la Linguistica Computazionale, 1 (2013) 25–36. Redattore di azioni: Hal Daum´e III.
Submitted 10/2012; Pubblicato 3/2013. C
(cid:13)
2013 Associazione per la Linguistica Computazionale.
GroundingActionDescriptionsinVideosMichaelaRegneri∗,MarcusRohrbach(cid:5),DominikusWetzel∗,StefanThater∗,BerntSchiele(cid:5)andManfredPinkal∗∗DepartmentofComputationalLinguistics,SaarlandUniversity,Saarbr¨ucken,Germany(regneri|dwetzel|stth|pinkal)@coli.uni-saarland.de(cid:5)MaxPlanckInstituteforInformatics,Saarbr¨ucken,Germany(rohrbach|schiele)@mpi-inf.mpg.deAbstractRecentworkhasshownthattheintegrationofvisualinformationintotext-basedmodelscansubstantiallyimprovemodelpredictions,butsofaronlyvisualinformationextractedfromstaticimageshasbeenused.Inthispaper,weconsidertheproblemofgroundingsentencesdescribingactionsinvisualinformationex-tractedfromvideos.Wepresentageneralpurposecorpusthatalignshighqualityvideoswithmultiplenaturallanguagedescriptionsoftheactionsportrayedinthevideos,togetherwithanannotationofhowsimilartheactiondescriptionsaretoeachother.Experimentalresultsdemonstratethatatext-basedmodelofsimilaritybetweenactionsimprovessubstan-tiallywhencombinedwithvisualinformationfromvideosdepictingthedescribedactions.1IntroductionTheestimationofsemanticsimilaritybetweenwordsandphrasesisabasictaskincomputationalsemantics.Vector-spacemodelsofmeaningareonestandardapproach.Followingthedistributionalhy-pothesis,frequenciesofcontextwordsarerecordedinvectors,andsemanticsimilarityiscomputedasaproximitymeasureintheunderlyingvectorspace.Suchdistributionalmodelsareattractivebecausetheyareconceptuallysimple,easytoimplementandrelevantforvariousNLPtasks(TurneyandPan-tel,2010).Atthesametime,theyprovideasub-stantiallyincompletepictureofwordmeaning,sincetheyignoretherelationbetweenlanguageandextra-linguisticinformation,whichisconstitutiveforlin-guisticmeaning.Inthelastfewyears,agrowingamountofworkhasbeendevotedtothetaskofgroundingmeaninginvisualinformation,inpar-ticularbyextendingthedistributionalapproachtojointlycovertextsandimages(FengandLapata,2010;Brunietal.,2011).Asaclearresult,visualinformationimprovesthequalityofdistributionalmodels.Brunietal.(2011)showthatvisualinfor-mationdrawnfromimagesisparticularlyrelevantforconcretecommonnounsandadjectives.Anaturalnextstepistointegratevisualinfor-mationfromvideosintoasemanticmodelofeventandactionverbs.Psychologicalstudieshaveshowntheconnectionbetweenactionsemanticsandvideos(Glenberg,2002;Howelletal.,2005),buttoourknowledge,wearethefirsttoprovideasuitabledatasourceandtoimplementsuchamodel.Thecontributionofthispaperisthree-fold:•Wepresentamultimodalcorpuscontainingtextualdescriptionsalignedwithhigh-qualityvideos.StartingfromthevideocorpusofRohrbachetal.(2012B),whichcontainshigh-resolutionvideorecordingsofbasiccookingtasks,wecollectedmultipletextualdescrip-tionsofeachvideoviaMechanicalTurk.Wealsoprovideanaccuratesentence-levelalign-mentofthedescriptionswiththeirrespectivevideos.Weexpectthecorpustobeavalu-ableresourceforcomputationalsemantics,andmoreoverhelpfulforavarietyofpurposes,in-cludingvideounderstandingandgenerationoftextfromvideos.•Weprovideagold-standarddatasetfortheevaluationofsimilaritymodelsforactionverbsandphrases.ThedatasethasbeendesignedasanalogoustotheUsageSimilaritydatasetof
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
T
l
UN
C
_
UN
_
0
0
2
0
7
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
26
Erketal.(2009)andcontainspairsofnatural-languageactiondescriptionsplustheirassoci-atedvideosegments.Eachofthepairsisan-notatedwithasimilarityscorebasedonseveralmanualannotations.•Wereportanexperimentonsimilaritymodel-ingofactiondescriptionsbasedonthevideocorpusandthegoldstandardannotation,whichdemonstratestheimpactofsceneinformationfromvideos.Visualsimilaritymodelsoutper-formtext-basedmodels;theperformanceofcombinedmodelsapproachestheupperboundindicatedbyinter-annotatoragreement.Thepaperisstructuredasfollows:Wefirstplaceourselvesinthelandscapeofrelatedwork(Sec.2),thenweintroduceourcorpus(Sec.3).Sec.4re-portsouractionsimilarityannotationexperimentandSec.5introducesthesimilaritymeasuresweap-plytotheannotateddata.WeoutlinetheresultsofourevaluationinSec.6,andconcludethepaperwithasummaryanddirectionsforfuturework(Sec.7).2RelatedWorkAlargemultimodalresourcecombininglanguageandvisualinformationresultedfromtheESPgame(vonAhnandDabbish,2004).Thedatasetcontainsmanyimagestaggedwithseveralone-wordlabels.TheMicrosoftVideoDescriptionCorpus(ChenandDolan,2011,MSVD)isaresourceprovidingtextualdescriptionsofvideos.Itconsistsofmultiplecrowd-sourcedtextualdescriptionsofshortvideosnippets.TheMSVDcorpusismuchlargerthanourcorpus,butmostofthevideosareofrelativelylowqualityandthereforetoochallengingforstate-of-the-artvideoprocessingtoextractrelevantinforma-tion.Thevideosaretypicallyshortandsummarizedwithasinglesentence.Ourcorpuscontainscoher-enttextualdescriptionsoflongervideosequences,whereeachsentenceisassociatedwithatimeframe.Guptaetal.(2009)presentanotherusefulre-source:theirmodellearnsthealignmentofpredicate-argumentstructureswithvideosandusestheresultforactionrecognitioninvideos.However,thecorpuscontainsnonaturallanguagetexts.Theconnectionbetweennaturallanguagesen-tencesandvideoshassofarbeenmostlyexploredbythecomputervisioncommunity,wheredif-ferentmethodsforimprovingactionrecognitionbyexploitinglinguisticdatahavebeenproposed(GuptaandMooney,2010;MotwaniandMooney,2012;Couretal.,2008;Tzoukermannetal.,2011;Rohrbachetal.,2012b,amongothers).Ourresourceisintendedtobeusedforactionrecognitionaswell,butinthispaper,wefocusontheinverseeffectofvisualdataonlanguageprocessing.FengandLapata(2010)werethefirsttoenrichtopicmodelsfornewspaperarticleswithvisualin-formation,byincorporatingfeaturesfromarticleil-lustrations.Theyachievebetterresultswhenin-corporatingthevisualinformation,providinganen-richedmodelthatpairsasingletextwithapicture.Brunietal.(2011)usedtheESPgamedatatocre-ateavisuallygroundedsemanticmodel.Theirre-sultsoutperformpurelytext-basedmodelsusingvi-sualinformationfrompicturesforthetaskofmod-elingnounsimilarities.Theymodelsinglewords,andmostlyvisualfeaturesleadonlytomoderateim-provements,whichmightbeduetothemixedqual-ityandrandomchoiceoftheimages.Dodgeetal.(2012)recentlyinvestigatedwhichwordscanactu-allybegroundedinimagesatall,producinganau-tomaticclassifierforvisualwords.Aninterestingin-depthstudybyMatheetal.(2008)automaticallylearntthesemanticsofmotionverbsasabstractfeaturesfromvideos.Thestudycaptures4actionswith8-10videosforeachoftheactions,andwouldneedaperfectobjectrecognitionfromavisualclassifiertoscaleup.Steyvers(2010)andlaterSilbererandLapata(2012)presentanalternativeapproachtoincorpo-ratingvisualinformationdirectly:theyuseso-calledfeaturenorms,whichconsistofhumanassociationsformanygivenwords,asaproxyforgeneralpercep-tualinformation.Becausethismodelistrainedandevaluatedonthosefeaturenorms,itisnotdirectlycomparabletoourapproach.TheRestaurantGamebyOrkinandRoy(2009)groundswrittenchatdialoguesinactionscarriedoutinacomputergame.Whilethisworkisoutstandingfromthesociallearningperspective,theactionsthatgroundthedialoguesareclicksonascreenratherthanreal-worldactions.Thedatasethassuccessfullybeenusedtomodeldeterminermeaning(Reckmanetal.,2011)inthecontextoftheRestaurantGame,
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
T
l
UN
C
_
UN
_
0
0
2
0
7
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
27
butitisunclearhowthisapproachcouldscaleuptocontentwordsandotherdomains.3TheTACOSCorpusWebuildourcorpusontopofthe“MPIICook-ingCompositeActivities”videocorpus(Rohrbachetal.,2012b,MPIIComposites),whichcontainsvideosofdifferentactivitiesinthecookingdomain,e.g.,preparingcarrotsorseparatingeggs.Weex-tendtheexistingcorpuswithmultipletextualde-scriptionscollectedbycrowd-sourcingviaAmazonMechanicalTurk1(MTurk).Tofacilitatethealign-mentofsentencesdescribingactivitieswiththeirpropervideosegments,wealsoobtainedapproxi-matetimestamps,asdescribedinSec.3.2.MPIICompositescomeswithtimedgold-standardannotationoflow-levelactivitiesandpar-ticipatingobjects(e.g.OPEN[HAND,DRAWER]orTAKEOUT[HAND,KNIFE,DRAWER]).Byaddingtextualdescriptions(e.g.,Thepersontakesaknifefromthedrawer)andaligningthemonthesentencelevelwithvideosandlow-levelannotations,wepro-videarichmultimodalresource(cf.Fig.2),the“Saarbr¨uckenCorpusofTextuallyAnnotatedCook-ingScenes”(TACOS).Inparticular,theTACOScor-pusprovides:•Acollectionofcoherenttextualdescrip-tionsforvideorecordingsofactivitiesofmediumcomplexity,asasabasisforempiri-caldiscourse-relatedresearch,e.g.,theselec-tionandgranularityofactiondescriptionsincontext•Ahigh-qualityalignmentofsentenceswithvideosegments,supportingthegroundingofactiondescriptionsinvisualinformation•Collectionsofparaphrasesdescribingthesamescene,whichresultasaby-productfromthetext-videoalignmentandcanbeusefulfortextgenerationfromvideos(amongotherthings)•Thealignmentoftextualactivitydescriptionswithsequencesoflow-levelactivities,whichmaybeusedtostudythedecompositionofac-tionverbsintobasicactivitypredicates1mturk.comWeexpectthatourcorpuswillencourageanden-ablefutureworkonvarioustopicsinnaturallan-guageandvideoprocessing.Inthispaper,wewillmakeuseofthesecondaspectonly,demonstratingtheusefulnessofthecorpusforthegroundingtask.Afteramoredetaileddescriptionofthebasicvideocorpusanditsannotation(Sec.3.1)wede-scribethecollectionoftextualdescriptionswithMTurk(Sec.3.2),andfinallyshowtheassemblyandsomebenchmarksofthefinalcorpus(Sec.3.3).3.1ThevideocorpusMPIICompositescontains212highresolutionvideorecordingsof1-23minuteslength(4.5min.onav-erage).41basiccookingtaskssuchascuttingacu-cumberwererecorded,eachbetween4and8times.Theselectionofcookingtasksisbasedonthosepro-posedat“Jamie’sHomeCookingSkills”.2Thecor-pusisrecordedinakitchenenvironmentwithatotalof22subjects.Eachvideodepictsasingletaskexe-cutedbyanindividualsubject.Thedatasetcontainsexpertannotationsoflow-levelactivitytags.Annotationsareprovidedforseg-mentscontainingasemanticallymeaningfulcook-ingrelatedmovementpattern.Theactionmustgobeyondsinglebodypartmovements(suchasmovearmup)andmusthavethegoalofchangingthestateorlocationofanobject.60differentactivitylabelsareusedforannotation(e.g.PEEL,STIR,TRASH).Eachlow-levelactivitytagconsistsofanactivitylabel(PEEL),asetofassociatedobjects(CARROT,DRAWER,…),andtheassociatedtimeframe(start-ingandendingpointsoftheactivity).Associatedobjectsaretheparticipantsofanactivity,namelytools(e.g.KNIFE),patient(CARROT)andlocation(CUTTING-BOARD).Weprovidethecoarse-grainedroleinformationforpatient,locationandtoolinthecorpusdata,butwedidnotusethisinformationinourexperiments.Thedatasetcontainsatotalof8818annotatedsegments,onaverage42pervideo.3.2CollectingtextualvideodescriptionsWecollectedtextualdescriptionsforasubsetofthevideosinMPIIComposites,restrictingcollectiontotasksthatinvolvemanipulationofcookingingredi-ents.Wealsoexcludedtaskswithfewerthanfour2www.jamieshomecookingskills.com
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
T
l
UN
C
_
UN
_
0
0
2
0
7
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
28
videorecordingsinthecorpus,leaving26taskstobedescribed.Werandomlyselectedfivevideosfromeachtask,exceptthethreetasksforwhichonlyfourvideosareavailable.Thisresultedinatotalof127videos.Foreachvideo,wecollected20differenttextualdescriptions,leadingto2540annotationas-signments.Wepublishedtheseassignments(HITs)onMTurk,usinganadaptedversion3oftheannota-tiontoolVatic(Vondricketal.,2012).Ineachassignment,thesubjectsawonevideospecifiedwiththetasktitle(e.g.Howtoprepareanonion),andthenwasaskedtoenteratleastfiveandatmost15completeEnglishsentencestodescribetheeventsinthevideo.Theannotationinstructionscontainedexampleannotationsfromakitchentasknotcontainedinouractualdataset.Annotatorswereencouragedtowatcheachvideoseveraltimes,skippingbackwardandforwardastheywished.Theywerealsoaskedtotakenoteswhilewatching,andtosketchtheannotationbeforeenteringit.Oncefamiliarizedwiththevideo,sub-jectsdidthefinalannotationbywatchingtheentirevideofrombeginningtoend,withoutthepossibil-ityoffurthernon-sequentialviewing.Subjectswereaskedtoentereachsentenceassoonastheactionde-scribedbythesentencewascompleted.Thevideoplaybackpausedautomaticallyatthebeginningofthesentenceinput.Werecordedpauseonsetforeachsentenceannotationasanapproximateendingtimestampofthedescribedaction.Theannotatorsresumedthevideomanually.ThetasksrequiredaHITapprovalrateof75%andwereopenonlytoworkersintheUS,inordertoincreasethegenerallanguagequalityoftheEn-glishannotations.Eachtaskpaid1.20USD.Beforepayingwerandomlyinspectedtheannotationsandmanuallycheckedforquality.Thetotalcostsofcol-lectingtheannotationsamountedto3,353USD.Thedatawasobtainedwithinatimeframeof3.5weeks.3.3PuttingtheTACOScorpustogetherOurcorpusisacombinationoftheMTurkdataandMPIIComposites,createdbyfilteringoutinappro-priatematerialandcomputingahigh-qualityalign-mentofsentencesandvideosegments.Thealign-mentisdonebymatchingtheapproximatetimes-3github.com/marcovzla/vatic/tree/boltl1l2l3l4s1l5s3s2s1s3s2elementary timeframessentencesl1l2l3l4l5Figure1:Aligningactiondescriptionswiththevideo.tampsoftheMTurkdatatotheaccuratetimestampsinMPIIComposites.Wediscardedtextinstancesifpeopledidnottimethesentencesproperly,takingtheassociationofsev-eral(orevenall)sentencestoasingletimestampasanindicator.Wheneverwefoundatimestampasso-ciatedwithtwoormoresentences,wediscardedthewholeinstance.Overall,wehadtofilterout13%ofthetextinstances,whichleftuswith2206textualvideodescriptions.Forthealignmentofsentenceannotationsandvideosegments,weassignaprecisetimeframetoeachsentenceinthefollowingway:Wetakethetimeframesgivenbythelow-levelannotationinMPIICompositesasagoldstandardmicro-eventsegmentationofthevideo,becausetheymarkalldistinctframesthatcontainactivitiesofinterest.Wecallthemelementaryframes.Thesequenceofel-ementaryframesisnotnecessarilycontinuous,be-causeidletimeisnotannotated.TheMTurksentenceshaveendpointsthatcon-stituteacoarse-grained,noisyvideosegmentation,assumingthateachsentencespansthetimebetweentheendoftheprevioussentenceanditsownend-ingpoint.WerefinethosenoisytimeframestogoldframesasshowninFig.1:Eachelementaryframe(l1-l5)ismappedtoasentence(s1-s3)ifitsnoisytimeframecoversatleasthalfoftheelementaryframe.Wedefinethefinalgoldsentenceframethenasthetimespanbetweenthestartingpointofthefirstandtheendingpointofthelastelementaryframe.Thealignmentofdescriptionswithlow-levelac-tivitiesresultsinatableasgiveninFig.3.Columnscontainthetextualdescriptionsofthevideos;rows
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
T
l
UN
C
_
UN
_
0
0
2
0
7
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
29
Top10Verbscut,take,Ottenere,put,wash,place,rinse,remove,*pan,peelTop10Activitiesmove,takeout,cut,wash,takeapart,add,shake,screw,putin,peelFigure4:10mostfrequentverbsandlow-levelactionsintheTACOScorpus.panisprobablyoftenmis-tagged.correspondtolow-levelactions,andeachsentenceisalignedwiththelastofitsassociatedlow-levelac-tions.Asasideeffect,wealsoobtainmultiplepara-phrasesforeachsentence,byconsideringallsen-tenceswiththesameassociatedtimeframeasequiv-alentrealizationsofthesameaction.Thecorpuscontains17,334actiondescrip-tions(gettoni),realizing11,796differentsentences(types).Itconsistsof146,771words(gettoni),75,210ofwhicharecontentwordinstances(i.e.nouns,verbsandadjectives).Theverbvocabularycomprises28,292verbtokens,realizing435lem-mas.Sinceverbsoccurringinthecorpustypicallydescribeactions,wecannotethatthelinguisticvari-anceforthe58differentlow-levelactivitiesisquitelarge.Fig.4givesanimpressionoftheactionre-alizationsinthecorpus,listingthemostfrequentverbsfromthetextualdata,andthemostfrequentlow-levelactivities.Onaverage,eachdescriptioncovers2.7low-levelactivities,whichindicatesacleardifferenceingran-ularity.38%ofthedescriptionscorrespondtoex-actlyonelow-levelactivity,aboutaquarter(23%)coverstwoofthem;16%have5ormorelow-levelelements,2%morethan10.Thecorpusshowshowhumansvarythegranularityoftheirdescriptions,measuredintimeornumberoflow-levelactivities,anditshowshowtheyvarythelinguisticrealizationofthesameaction.Forexample,Fig.3containsdiceandchopintosmallpiecesasalternativerealizationsofthelow-levelactivitysequenceSLICE-SCRATCHOFF-SLICE.Thedescriptionsareofvaryinglength(9wordsonaverage),reachingfromtwo-wordphrasestode-taileddescriptionsof65words.Mostsentencesareshort,consistingofareferencetothepersoninthevideo,aparticipantandanactionverb(Thepersonrinsesthecarrot,Hecutsoffthetwoedges).Peopleoftenspecifiedaninstrument(fromthefaucet),ortheresultingstateoftheaction(chopthecarrotsinsmallpieces).Occasionally,wefindmorecomplexconstructions(supportverbs,coordinations).AsFig.3indicates,thetimestamp-basedalign-mentisprettyaccurate;occasionalerrorsoccurlikeHestartschoppingthecarrot…inNLSequence3.Thedatacontainssometyposandungrammaticalsentences(Hewashedcarrot),butforourownex-periments,thesmallnumberofsucherrorsdidnotleadtoanyprocessingproblems.4TheActionSimilarityDatasetInthissection,wepresentagoldstandarddataset,asabasisfortheevaluationofvisuallygroundedmodelsofactionsimilarity.Wecallitthe“ActionSimilarityDataset”(ASim)inanalogytotheUsageSimilaritydataset(USim)ofErketal.(2009)andErketal.(2012).SimilarlytoUSim,ASimcon-tainsacollectionofsentencepairswithnumericalsimilarityscoresassignedbyhumanannotators.Weaskedtheannotatorstofocusonthesimilarityoftheactivitiesdescribedratherthanonassessingseman-ticsimilarityingeneral.WeusesentencesfromtheTACOScorpusandrecordtheirtimestamps.Thuseachsentencecomeswiththevideosegmentwhichitdescribes(thesewerenotshowntotheannotators).4.1SelectingactiondescriptionpairsRandomselectionofannotatedsentencesfromthecorpuswouldleadtoalargemajorityofpairswhicharecompletelydissimilar,ordifficulttograde(e.g.,Heopensthedrawer–Thepersoncutsofftheendsofthecarrot).Weconstrainedtheselectionpro-cessintwoways:Primo,weconsideronlysentencesdescribingactivitiesofmanipulatinganingredient.Thelow-levelannotationofthevideocorpushelpsusidentifycandidatedescriptions.Weexcluderareandspecialactivities,endingupwithCUT,SLICE,CHOP,PEEL,TAKEAPART,andWASH,whichoc-curreasonablyfrequently,withawidedistributionoverdifferentscenarios.Werestrictthecandidatesettothosesentenceswhosetimespanincludesoneoftheseactivities.Thisresultsinaconceptuallymorefocussedrepertoireofdescriptions,andatthesametimeadmitsfulllinguisticvariation(washanappleunderthefaucet–rinseanapple,slicethecucumber–cutthecucumberintoslices).
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
T
l
UN
C
_
UN
_
0
0
2
0
7
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
30
896 -1137wash [hand,carrot]1145 -1212shake [hand,carrot]1330 -1388close [hand,drawer]1431 -1647take out [hand,knife,drawer]1647 -1669move [hand,cutting board,counter]1673 -1705move [hand,carrot,bowl,cutting board]1736 -1818cut [knife,carrot,cutting board]1919 -3395slice [knife,carrot,cutting board]> 890: The man takes out a cutting board.> 1300: He washes a carrot.> 1500: He takes out a knife.> 4000: He slices the carrot.Videos of basic kitchen tasksLow level annotations with timestamps, actions and objectsNatural language descriptions with ending times of the actionsmanual low-level annotationMechanical Turk data collectiontimestamp-based alignmentFigure2:CorpusOverviewSampleframeStartEndActionParticipantsNLSequence1NLSequence2NLSequence3743911washhand,carrotHewashedcarrotThepersonrinsesthecarrot.Herinsesthecarrotfromthefaucet.9821090cutknife,carrot,cuttingboardHecutoffendsofcarrotsThepersoncutsofftheendsofthecarrot.Hecutsoffthetwoedges.11641257openhand,drawer16791718closehand,drawerHesearchesforsome-thinginthedrawer,failedattempt,hethrowsawaytheedgesintrash.17461799trashhand,carrotThepersonsearchesforthetrashcan,thenthrowstheendsofthecarrotaway.18542011washhand,carrotHerinsesthecarrotagain.20112045shakehand,carrotHewashedcarrotThepersonrinsesthecarrotagain.Hestartschoppingthecarrotinsmallpieces.20832924sliceknife,carrot,cuttingboard29242959scratchoffhand,carrot,knife,cuttingboard30003696sliceknife,carrot,cuttingboardHedicedcarrotsHefinishedchoppingthecarrotsinsmallpieces.Figure3:ExcerptfromthecorpusforavideoonPREPARINGACARROT.Exampleframes,low-levelannotation(ActionandParticipants)isshownalongwiththreeoftheMTurksequences(NLSequence1-3).
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
T
l
UN
C
_
UN
_
0
0
2
0
7
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
31
Secondo,werequiredthepairstosharesomelexi-calmaterial,eithertheheadverborthemanipulatedingredient(orboth).4Moreprecisely,wecomposedtheASimdatasetfromthreedifferentsubsets:Differentactivity,sameobject:Thissubsetcon-tainspairsdescribingdifferenttypesofactionscar-riedoutonthesametypeofobject(e.g.Themanwashesthecarrot.–Shedicesthecarrot.).Itsfo-cusisonthecentraltaskofmodelingthesemanticrelationbetweenactions(ratherthantheobjectsin-volvedintheactivity),sincetheobjectheadnounsinthedescriptionsarethesame,andtherespectivevideosegmentsshowthesametypeofobject.Sameactivity,sameobject:Descriptionpairsofthissubsetwillinmanycases,butnotalways,agreeintheirheadverbs.Thedatasetisusefulforexplor-ingthedegreetowhichactiondescriptionsareun-derspecifiedwithrespecttotheprecisemanneroftheirpracticalrealization.Forexample,peelinganonionwillmostlybedoneinaratheruniformway,whilecutappliedtocarrotcanmeanthatthecarrotischoppedup,orsliced,orcutinhalves.Sameactivity&verb,differentobject:Descrip-tionpairsinthissubsetshareheadverbandlow-levelactivity,buthavedifferentobjects(e.g.Themanwashesthecarrot.–Agirlwashesanappleun-derthefaucet.).Thisdatasetenablestheexplorationoftheobjects’meaningcontributiontothecompleteaction,establishedbythevariationofequivalentac-tionsthataredonetodifferentobjects.Weassembled900actiondescriptionpairsforanno-tation:480pairssharetheobject;240ofwhichhavedifferentactivities,andtheother240pairssharethesameactivity.Weincludedparaphrasesdescribingthesamevideosegment,butweexcludedpairsofidenticalsentences.420additionalpairssharetheirheadverb,buthavedifferentobjects.4.2ManualannotationThreenativespeakersofEnglishwereaskedtojudgethesimilarityoftheactionpairswithrespecttohow4Werefertothelatterwiththetermobject;wedon’trequiretheingredienttermtobetheactualgrammaticalobjectintheactiondescriptions,weratheruse“object”initssemanticrolesenseastheentityaffectedbyanaction.PartofGoldStandardSimσρDIFF.ACTIVITY,SAMEOBJECT2.201.070.73SAMEACTIVITY,SAMEOBJECT4.191.040.73ALLWITHSAMEOBJECT3.201.440.84SAMEVERB,DIFF.OBJECT3.340.690.43COMPLETEDATASET3.271.150.73Figure5:Averagesimilarityratings(Sim),theirstandarddeviation(P))andannotatoragreement(ρ)forASim.theyarecarriedout,ratingeachsentencepairwithascorefrom1(notsimilaratall)to5(thesameornearlythesame).Theydidnotseetherespectivevideos,butwenotedtherelevantkitchentask(i.e.whichvegetablewasprepared).Weaskedthean-notatorsexplicitlytoignoretheactoroftheaction(e.g.whetheritisamanorawoman)andscorethesimilaritiesoftheunderlyingactionsratherthantheirverbalizations.Eachsubjectratedall900pairs,whichwereshowntothemincompletelyrandomor-der,withadifferentorderforeachsubject.Wecomputeinter-annotatoragreement(andtheforthcomingevaluationscores)usingSpearman’srankcorrelationcoefficient(ρ),anon-parametrictestwhichiswidelyusedforsimilarevaluationtasks(MitchellandLapata,2008;Brunietal.,2011;ErkandMcCarthy,2009).Spearman’sρevaluateshowthesamplesarerankedrelativetoeachotherratherthanthenumericaldistancebetweentherankings.Fig.5showstheaveragesimilarityratingsinthedifferentsettingsandtheinter-annotatoragreement.Theaverageinter-rateragreementwasρ=0.73(av-eragedoverpairwiserateragreements),withpair-wiseresultsofρ=0.77,0.72,and0.69,respec-tively,whichareallhighlysignificantatp<0.001.Asexpected,pairswiththesameactivityandob-jectareratedverysimilar(4.19)onaverage,whilethesimilarityofdifferentactivitiesonthesameob-jectisthelowest(2.2).Forbothsubsets,inter-rateragreementishigh(ρ=0.73),andevenhigherforbothSAMEOBJECTsubsetstogether(0.84).Pairswithidenticalheadverbsanddifferentob-jectshaveasmallstandarddeviation,at0.69.Theinter-annotatoragreementonthissetismuchlowerthanforpairsfromtheSAMEOBJECTset.Thisindi-catesthatsimilarityassessmentfordifferentvariantsofthesameactivityisahardtaskevenforhumans.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
t
l
a
c
_
a
_
0
0
2
0
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
32
5ModelsofActionSimilarityInthefollowing,wedemonstratethatvisualinfor-mationcontainedinvideosofthekindprovidedbytheTACOScorpus(Sec.3)substantiallycontributestothesemanticmodelingofaction-denotingexpres-sions.InSec.6,weevaluateseveralmethodsforpredictingactionsimilarityonthetaskprovidedbytheASimdataset.Inthissection,wedescribethemodelsconsideredintheevaluation.Weusetwodifferentmodelsbasedonvisualinformation,andinadditiontwotextbasedmodels.Wewillalsoexploretheeffectofcombininglinguisticandvisualinfor-mationandinvestigatewhichmodeismostsuitableforwhichkindsofsimilarity.5.1Text-basedmodelsWeusetwodifferentmodelsoftextualsimilaritytopredictactionsimilarity:asimpleword-overlapmeasure(Jaccardcoefficient)andastate-of-the-artmodelbasedon“contextualized”vectorrepresenta-tionsofwordmeaning(Thateretal.,2011).Jaccardcoefficient.TheJaccardcoefficientgivestheratiobetweenthenumberof(distinct)wordscommontotwoinputsentencesandthetotalnum-berof(distinct)wordsinthetwosentences.Suchsimplesurface-orientedmeasuresoftextualsimilar-ityareoftenusedasbaselinesinrelatedtaskssuchasrecognizingtextualentailment(Daganetal.,2005)andareknowntodeliverrelativelystrongresults.Vectormodel.WeusethevectormodelofThateretal.(2011),which“contextualizes”vectorrepre-sentationsforindividualwordsbasedontheparticu-larsentencecontextinwhichthetargetwordoccurs.Thebasicintuitionbehindthisapproachisthatthewordsinthesyntacticcontextofthetargetwordinagiveninputsentencecanbeusedtorefineordisam-biguateitsvector.Intuitively,thisallowsustodis-criminatebetweendifferentactionsthataverbcanreferto,basedonthedifferentobjectsoftheaction.Wefirstexperimentedwithaversionofthisvec-tormodelwhichpredictsactionsimilarityscoresoftwoinputsentencesbycomputingthecosinesimi-larityofthecontextualizedvectorsoftheverbsinthetwosentencesonly.Weachievedbetterperformancewithavariantofthismodelwhichcomputesvectorsforthetwosentencesbysummingoverthecontex-tualizedvectorsofallconstituentcontentwords.Intheexperimentsreportedbelow,weonlyusethesecondvariant.WeusethesameexperimentalsetupasThateretal.(2011),aswellastheparametersettingsthatarereportedtoworkbestinthatpaper.5.2Video-basedmodelsWedistinguishtwoapproachestocomputethesim-ilaritybetweentwovideosegments.Inthefirst,un-supervisedapproachweextractavideodescriptorandcomputesimilaritiesbetweentheserawfeatures(Wangetal.,2011).Thesecondapproachbuildsuponthefirstbyadditionallylearninghigherlevelattributeclassifiers(Rohrbachetal.,2012b)onaheldouttrainingset.Thesimilaritybetweentwosegmentsisthencomputedbetweentheclassifierre-sponses.Inthefollowingwedetailbothapproaches:Rawvisualfeatures.Weusethestate-of-the-artvideodescriptorDenseTrajectories(Wangetal.,2011)whichextractsvisualvideofeatures,namelyhistogramsoforientedgradients,flow,andmotionboundaryhistograms,arounddenselysampledandtrackedpoints.Thisapproachisespeciallysuitedforthisdataasitignoresnon-movingpartsinthevideo:weareinterestedinactivitiesandmanipulationofobjects,andthistypeoffeatureimplicitlyusesonlyinfor-mationinrelevantimagelocations.Foroursettingthisfeaturerepresentationhasbeenshowntobesu-periortohumanpose-basedapproaches(Rohrbachetal.,2012a).Usingabag-of-wordsrepresentationweencodethefeaturesusinga16,000dimensionalcodebook.Featuresandcodebookareprovidedwiththepubliclyavailablevideodataset.Wecomputethesimilaritybetweentwoencodedfeaturesbycomputingtheintersectionofthetwo(normalized)histograms.Visualclassifiers.Visualrawfeaturestendtohaveseveraldimensionsinthefeaturespacewhichpro-videunreliable,noisyvaluesandthusdegradethestrengthofthesimilaritymeasure.Intermediatelevelattributeclassifierscanlearnwhichfeaturedi-mensionsaredistinctiveandthussignificantlyim-proveperformanceoverrawfeatures.Rohrbachetal.(2012b)showedthatusingsuchanattributeclas-sifierrepresentationcansignificantlyimproveper-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
t
l
a
c
_
a
_
0
0
2
0
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
33
MODELSAMEOBJECTSAMEVERBOVERALLTEXTJACCARD0.280.250.25TEXTUALVECTORS0.300.250.27TEXTCOMBINED0.390.350.36VIDEOVISUALRAWVECTORS0.53-0.080.35VISUALCLASSIFIER0.600.030.44VIDEOCOMBINED0.61-0.040.44MIXALLUNSUPERVISED0.580.320.48ALLCOMBINED0.670.280.55UPPERBOUND0.840.430.73Figure6:EvaluationresultsinSpearman’sρ.Allvalues>0.11aresignificantatp<0.001.formanceforcompositeactivityrecognition.Therelevantattributesareallactivitiesandobjectsan-notatedinthevideodata(cf.Section3.1).FortheexperimentsreportedbelowweusethesamesetupasRohrbachetal.(2012b)anduseallvideosinMPIICompositesandMPIICooking(Rohrbachetal.,2012a),excludingthe127videosusedduringevaluation.Thereal-valuedSVM-classifieroutputprovidesaconfidencehowlikelyacertainattributeappearedinagivenvideosegment.Thisresultsina218-dimensionalvectorofclassifieroutputsforeachvideosegment.Tocomputethesimilaritybetweentwovectorswecomputethecosinebetweenthem.6EvaluationWeevaluatethedifferentsimilaritymodelsintro-ducedinSec.5bycalculatingtheircorrelationwiththegold-standardsimilarityannotationsofASim(cf.Sec.4).Forallcorrelations,weuseSpear-man’sρasameasure.Weconsiderthetwotextualmeasures(JACCARDandTEXTUALVECTORS)andtheircombination,aswellasthetwovisualmod-els(VISUALRAWVECTORSandVISUALCLAS-SIFIER)andtheircombination.Wealsocombinedtextualandvisualfeatures,intwovariants:Thefirstincludesallmodels(ALLCOMBINED),thesec-ondonlytheunsupervisedcomponents,omittingthevisualclassifier(ALLUNSUPERVISED).Tocom-binemultiplesimilaritymeasures,wesimplyaver-agetheirnormalizedscores(usingz-scores).Figure6showsthescoresforallofthesemea-suresonthecompleteASimdataset(OVERALL),alongwiththetwosubparts,wheredescriptionpairsshareeithertheobject(SAMEOBJECT)ortheheadverb(SAMEVERB).Inadditiontothemodelre-sults,thetablealsoshowstheaveragehumaninter-annotatoragreementasUPPERBOUND.Onthecompleteset,bothvisualandtextualmea-sureshaveahighlysignificantcorrelationwiththegoldstandard,whereasthecombinationofbothclearlyleadstothebestperformance(0.55).TheresultsontheSAMEOBJECTandSAMEVERBsub-setsshedlightonthedivisionoflaborbetweenthetwoinformationsources.Whilethetextualmea-suresshowacomparableperformanceoverthetwosubsets,thereisadramaticdifferenceinthecontri-butionofvisualinformation:OntheSAMEOBJECTset,thevisualmodelsclearlyoutperformthetextualones,whereasthevisualinformationhasnopositiveeffectontheSAMEVERBset.Thisisclearevidencethatthevisualmodeldoesnotcapturethesimilar-ityoftheparticipatingobjectsbutrathergenuineac-tionsimilarity,whichthevisualfeatures(Wangetal.,2011)weemployweredesignedfor.Adirectionforfutureworkistolearndedicatedvisualobjectde-tectorstorecognizeandcapturesimilaritiesbetweenobjectsmoreprecisely.ThenumbersshowninFigure7supportthishy-pothesis,showingthetwogroupsintheSAMEOB-JECTclass:Forsentencepairsthatsharethesameactivity,thetextualmodelsseemtobemuchmoresuitablethanthevisualones.Ingeneral,visualmod-elsperformbetteronactionswithdifferentactivitytypes,textualmodelsoncloselyrelatedactivities.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
t
l
a
c
_
a
_
0
0
2
0
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
34
MODEL(SAMEOBJECT)sameactiondiff.actionTEXTJACCARD0.440.14TEXTVECTORS0.420.05TEXTCOMBINED0.520.14VIDEOVIS.RAWVECTORS0.210.23VIS.CLASSIFIER0.210.45VIDEOCOMBINED0.260.38MIXALLUNSUPERVISED0.490.24ALLCOMBINED0.480.41UPPERBOUND0.730.73Figure7:Resultsforsentenceswiththesameobject,witheitherthesameordifferentlow-levelactivity.Overall,thesupervisedclassifiercontributesagoodparttothefinalresults.However,thesupervi-sionisnotstrictlynecessarytoarriveatasignificantcorrelation;therawvisualfeaturesalonearesuffi-cientforthemainperformancegainseenwiththeintegrationofvisualinformation.7ConclusionWepresentedtheTACOScorpus,whichprovidescoherenttextualdescriptionsforhigh-qualityvideorecordings,plusaccuratealignmentsoftextandvideoonthesentencelevel.Weexpectthecorpustobebeneficialforavarietyofresearchactivitiesinnatural-languageandvisualprocessing.Inthispaper,wefocusedonthetaskofgroundingthemeaningofactionverbsandphrases.Wede-signedtheASimdatasetasagoldstandardandeval-uatedseveraltext-andvideo-basedsemanticsimi-laritymodelsonthedataset,bothindividuallyandindifferentcombinations.Wearethefirsttoprovidesemanticmodelsforaction-describingexpressions,whicharebasedoninformationextractedfromvideos.Ourexperimen-talresultsshowthatthesemodelsareofconsiderablequality,andthatpredictionsbasedonacombinationofvisualandtextualinformationevenapproachtheupperboundgivenbytheagreementofhumanan-notators.Inthisworkweusedexistingsimilaritymodelsthathadbeendevelopedfordifferentapplications.Weappliedthesemodelswithoutanyspecialtrain-ingoroptimizationforthecurrenttask,andwecom-binedtheminthemoststraightforwardway.Thereisroomforimprovementbytuningthemodelstothetask,orbyusingmoresophisticatedapproachestocombinemodality-specificinformation(SilbererandLapata,2012).Webuiltourworkonanexistingcorpusofhigh-qualityvideomaterial,whichisrestrictedtothecookingdomain.Asaconsequence,thecorpuscov-ersonlyalimitedinventoryofactivitytypesandac-tionverbs.Note,however,thatourmodelsarefullyunsupervised(excepttheVisualClassifiermodel),andthuscanbeappliedwithoutmodificationtoar-bitrarydomainsandactionverbs,giventhattheyareaboutobservableactivities.Also,corporacontain-inginformationcomparabletotheTACOScorpusbutwithwidercoverage(andperhapsabitnoisier)canbeobtainedwithamoderateamountofeffort.Oneneedsvideosofreasonablequalityandsomesortofalignmentwithactiondescriptions.Insomecasessuchalignmentsevencomeforfree,e.g.viasubti-tles,ordescriptionsofshortvideoclipsthatdepictjustasingleaction.Forfuturework,wewillfurtherinvestigatethecompositionalityofaction-describingphrases.WealsowanttoleveragethemultimodalinformationprovidedbytheTACOScorpusfortheimprovementofhigh-levelvideounderstanding,aswellasforgenerationofnatural-languagetextfromvideos.TheTACOScorpusandallotherdatadescribedinthispaper(videos,low-levelannotation,alignedtex-tualdescriptions,theASim-Datasetandvisualfea-tures)arepubliclyavailable.5AcknowledgementsWe’dliketothankAsadSayeed,AlexisPalmerandPrashantRaofortheirhelpwiththeannotations.We’reindebtedtoCarlVondrickandMarcoAn-tonioValenzuelaEscrcegafortheirextensivesup-portwiththevideoannotationtool.FurtherwethankAlexisPalmerandinparticularthreeanony-mousreviewersfortheirhelpfulcommentsonthispaper.–ThisworkwasfundedbytheClusterofEx-cellence“MultimodalComputingandInteraction”oftheGermanExcellenceInitiativeandtheDFGprojectSCHI989/2-2.5http://www.coli.uni-saarland.de/projects/smile/page.php?id=tacos
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
t
l
a
c
_
a
_
0
0
2
0
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
35
ReferencesLuisvonAhnandLauraDabbish.2004.Labelingimageswithacomputergame.InProceedingsofSIGCHI2004.EliaBruni,GiangBinhTran,andMarcoBaroni.2011.Distributionalsemanticsfromtextandimages.InPro-ceedingsofGEMS2011.DavidL.ChenandWilliamB.Dolan.2011.Collect-inghighlyparalleldataforparaphraseevaluation.InProceedingsofACL2011.TimotheeCour,ChrisJordan,EleniMiltsakaki,andBenTaskar.2008.Movie/script:Alignmentandparsingofvideoandtexttranscription.InComputerVision–ECCV2008,volume5305ofLectureNotesinCom-puterScience,pages158–171.SpringerBerlinHeidel-berg.IdoDagan,OrenGlickman,andBernardoMagnini.2005.ThePASCALrecognisingtextualentailmentchallenge.InProceedingsofMLCW2005.JesseDodge,AmitGoyal,XufengHan,AlyssaMen-sch,MargaretMitchell,KarlStratos,KotaYamaguchi,YejinChoi,HalDaum´eIII,AlexanderC.Berg,andTamaraL.Berg.2012.Detectingvisualtext.InHLT-NAACL,pages762–772.KatrinErkandDianaMcCarthy.2009.Gradedwordsenseassignment.InProceedingsofEMNLP2009.KatrinErk,DianaMcCarthy,andNicholasGaylord.2009.Investigationsonwordsensesandwordusages.InProceedingsofACL/AFNLP2009.KatrinErk,DianaMcCarthy,andNickGaylord.2012.Measuringwordmeaningincontext.CL.YansongFengandMirellaLapata.2010.Visualinfor-mationinsemanticrepresentation.InProceedingsofHLT-NAACL2010.A.M.Glenberg.2002.Groundinglanguageinaction.PsychonomicBulletin&Review.SonalGuptaandRaymondJ.Mooney.2010.Us-ingclosedcaptionsassupervisionforvideoactiv-ityrecognition.InProceedingsoftheTwenty-FourthAAAIConferenceonArtificialIntelligence(AAAI-2010),pages1083–1088,Atlanta,GA,July.AbhinavGupta,PraveenSrinivasan,JianboShi,andLarryS.Davis.2009.Understandingvideos,con-structingplotslearningavisuallygroundedstorylinemodelfromannotatedvideos.InProceedingsofCVPR2009.SteveR.Howell,DamianJankowicz,andSuzannaBecker.2005.Amodelofgroundedlanguageac-quisition:Sensorimotorfeaturesimprovelexicalandgrammaticallearning.JML.S.Mathe,A.Fazly,S.Dickinson,andS.Stevenson.2008.Learningtheabstractmotionsemanticsofverbsfromcaptionedvideos.pages1–8.JeffMitchellandMirellaLapata.2008.Vector-basedmodelsofsemanticcomposition.InProceedingsofACL2008.TanviS.MotwaniandRaymondJ.Mooney.2012.Im-provingvideoactivityrecognitionusingobjectrecog-nitionandtextmining.InProceedingsofthe20thEuropeanConferenceonArtificialIntelligence(ECAI-2012),pages600–605,August.JeffOrkinandDebRoy.2009.Automaticlearningandgenerationofsocialbehaviorfromcollectivehumangameplay.InProceedingsofAAMAS2009.HilkeReckman,JeffOrkin,andDebRoy.2011.Ex-tractingaspectsofdeterminermeaningfromdialogueinavirtualworldenvironment.InProceedingsofCCS2011,IWCS’11.MarcusRohrbach,SikandarAmin,MykhayloAndriluka,andBerntSchiele.2012a.Adatabaseforfinegrainedactivitydetectionofcookingactivities.InProceedingsofCVPR2012.MarcusRohrbach,MichaelaRegneri,MichaAndriluka,SikandarAmin,ManfredPinkal,andBerntSchiele.2012b.Scriptdataforattribute-basedrecognitionofcompositeactivities.InProceedingsofECCV2012.CarinaSilbererandMirellaLapata.2012.Groundedmodelsofsemanticrepresentation.InProceedingsofEMNLP-CoNLL2012.MarkSteyvers.2010.Combiningfeaturenormsandtextdatawithtopicmodels.ActaPsychologica,133(3):234–243.¡ce:title¿Formalmodelingofse-manticconcepts¡/ce:title¿.StefanThater,HagenF¨urstenau,andManfredPinkal.2011.Wordmeaningincontext:Asimpleandeffec-tivevectormodel.InProceedingsofIJCNLP2011.PeterD.TurneyandPatrickPantel.2010.Fromfre-quencytomeaning.vectorspacemodelsforsemantics.JAIR.E.Tzoukermann,J.Neumann,J.Kosecka,C.Fermuller,I.Perera,F.Ferraro,B.Sapp,R.Chaudhry,andG.Singh.2011.Languagemodelsforsemanticex-tractionandfilteringinvideoactionrecognition.InAAAIWorkshoponLanguage-ActionToolsforCogni-tiveArtificialAgents.CarlVondrick,DonaldPatterson,andDevaRamanan.2012.Efficientlyscalingupcrowdsourcedvideoan-notation.IJCV.HengWang,AlexanderKl¨aser,CordeliaSchmid,andCheng-LinLiu.2011.ActionRecognitionbyDenseTrajectories.InProceedingsofCVPR2011.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
7
1
5
6
6
6
2
3
/
/
t
l
a
c
_
a
_
0
0
2
0
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
36