HAL Id: inria-00459653
https://hal.inria.fr/inria-00459653
Submitted on 24 Feb 2010
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Representation, Segmentation and Recognition
Daniel Weinland, Rémi Ronfard, Edmond Boyer
To cite this version:
Daniel Weinland, Rémi Ronfard, Edmond Boyer. A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition. [Research Report] RR-7212, INRIA. 2010, pp.54.
�inria-00459653�
a p p o r t
d e r e c h e r c h e
N0249-6399ISRNINRIA/RR--7212--FR+ENG
Vision, Perception and Multimedia Understanding
A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition
Daniel Weinland — Remi Ronfard — Edmond Boyer
N° 7212
Février 2010
Centre de recherche INRIA Grenoble – Rhône-Alpes
DanielWeinland
∗
, Remi Ronfard
†
,Edmond Boyer
‡
Theme: Vision, PereptionandMultimediaUnderstanding
Pereption,Cognition,Interation
Équipes-ProjetsLearetPereption
Rapportdereherhe n° 7212Février201054pages
Abstrat: Ationreognitionhasbeomeaveryimportanttopiinomputer
vision,withmanyfundamentalappliations,inrobotis,videosurveillane,hu-
manomputerinteration,andmultimediaretrievalamongothers. Thenumber
of works published is steadily inreasing, and ation reognitionis meanwhile
presented with numerous publiations at reent onferenes. A large variety
of approahes have been desribed. The purpose of this survey is to give an
overview and ategorization of the approahes used. We onentrate on ap-
proahesthataimonlassiationoffull-bodymotions,suhaskiking,punh-
ing, waving,et. and weategorize them aordingto howtheyrepresentthe
spatial and temporal struture of ations; how theysegmentations from an
inputstreamofvisualdata;andhowtheylearnaview-invariantrepresentation
ofations.
Key-words: omputervision,ationreognition
∗
DeutsheTelekomLaboratories,TUBerlin,Germany
†
INRIATeamLear,Grenoble,Frane
‡
INRIATeamPereption,Grenoble,Frane
d'Ations
Résumé : La reonnaissane d'ations est un problème important en vision
par ordinateur, ave de nombreuses appliations fondamentales en robotique,
télé-surveillane, interation homme-mahine et indexation multimedia, entre
autres. Le nombre de publiations sur e sujet augmente régulièrement dans
les onférenes dudomaine. Une grande variété d'approhesont été dérites.
Lebut deerapportestdedresserunétatdel'artdudomaineet deproposer
unelassiationdesapprohesutilisées. Nousnousfoalisonssurleproblème
delalassiationdesationsfaisantintervenirl'ensembleduorps,tellesque
s'asseoir, se lever, battre des mains, donner un oup de pied ou un oup de
poing,et. Nouslassonslesdiérentesapprohesduproblèmeenfontiondes
représentationsspatialesettemporellesqu'ellesdonnentdesations;delafaçon
dontelles permettent de segmenter les ationsdans un ux visuel ontinu; et
deleurapaitéàapprendredesmodèlesindépendantsdupointdevue.
Mots-lés : reonnaissaned'ations,visionparordinateur
Contents
1 Introdution 4
2 SpatialAtion Representations 6
2.1 Body models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Imagemodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Sparsefeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 TemporalAtion Representations 16 3.1 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Keyframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Ation Segmentation 22 4.1 BoundaryDetetion . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 SlidingWindows . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Higher-LevelGrammars . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Ationprimitives . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 View-IndependentAtion Reognition 27 5.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 Normalizationin 2D . . . . . . . . . . . . . . . . . . . . . 28
5.1.2 Normalizationin 3D . . . . . . . . . . . . . . . . . . . . . 28
5.2 ViewInvariane. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.1 ViewInvarianein 2D . . . . . . . . . . . . . . . . . . . . 29
5.2.2 ViewInvarianein 3D . . . . . . . . . . . . . . . . . . . . 31
5.3 ExhaustiveSearh . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.1 ExhaustiveSearhusingMultiple2DViews . . . . . . . . 31
5.3.2 ExhaustiveSearhusinga3DModel . . . . . . . . . . . . 32
6 Datasets 33 6.1 TheKTHDataset . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 TheWeizmanndataset. . . . . . . . . . . . . . . . . . . . . . . . 34
6.3 TheIXMASdataset . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.4 Otherdatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7 Conlusion 39
1 Introdution
Ationreognitionisaveryativeresearhtopiinomputervisionwithmany
important appliations, inluding human-omputer interfaes, ontent-based
videoindexing, full-videosearh, videosurveillane, robotis,programmingby
demonstration, among others. Historially, visualation reognitionhasbeen
divided into sub-topis suh as gesture reognition(espeially hand gestures)
forhuman-omputerinterfaes[36,122℄,faialexpressionreognition[204℄,and
movement behavior reognitionfor videosurveillane [66℄. Howeverfull-body
ations usually inlude dierent motions and require a unied approah for
reognition,enompassingfaialations,handationsandfeetations.
Ation reognitionis the proess of naming ations, usually in the simple
form of anationverb,using sensoryobservations. Tehnially, anation isa
sequeneofmovementsgeneratedbyahumanagentduringtheperformaneofa
task. Assuh,itisafour-dimensionalobjet,whihmaybefurtherdeomposed
intospatialandtemporalparts. Inthispaper,weareonlyonernedwithvisual
observations,typiallybymeansofoneormorevideoameras,butitshouldbe
notedthatationsanofoursealsobereognizedfromothersensoryhannels,
inludingaudio. An ationlabelisaname,suhthatanaveragehumanagent
an understandand performthenamed ation. Thetaskof ationreognition
istonameations,i.e. determinetheationlabelthatbestdesribesanation
instane, evenwhen performed bydierentagentsunder dierentviewpoints,
andinspiteoflargedierenesinmannerandspeed. Atypialset-upfortesting
andevaluatingationreognitionsystemsonsistinsendinginstrutionstothe
ators, using simple ation verb imperatives, and to ompare them with the
reognizedationnames.
Toreahthatgoal,thevariousapproahestypiallyemployaombinationof
vision andmahinelearning tools. Visiontehniquesattempttoextratation
disriminative features from thevideo sequenes, while providing appropriate
robustness to distrating ues. Mahine learningattempts to learn statistial
modelsform those features, and to lassify newfeatures basedon thelearned
models. Two issues whih are thereby of partiular importane are to deal
with hanging viewpoints and to segmentthe observedmotions into semanti
meaningfulinstanesofations.
Note that ourdenition of an ationis more restritivethan the one pro-
posedbyPinhanez[127℄whenhestatesthatationsaresequenesofmovements
performedinagivenontext(action=context+movement),withtheexample
oftypingorplayingpianowhihinvolvethesamequikmovementsofthengers
inthedierentontextsofaomputerdeskoraonerthall. Forourpurpose,
theyareoneandthesameationofquiklymovingone'sngers,andthisation
an beexeuted aspartofdierenttasks, suhas playingthepianoortyping.
Theimportaneofontextforvisualationreognitionisthefousofanexel-
lentreentsurveyonthemeaning of ation[85℄. Here,weonentrateonthe
struture of ation by reviewing vision-based tehniques that an be used for
analyzing, segmentingandlassifyingmovementsin orderto reognizeations
independentlyofthetaskandontextwhereitisperformed.
Generiationreognitionhasalreadybeensurveyedin[25,2,49,106,107℄
in theontextofmotionaptureandbodytraking,andin [66℄in theontext
of surveillane. High-levelanalysis of ativitieswas reentlysurveyedin [174℄.
Inontrary, oursurveyfouses exlusivelyonation reognition,and itis the
Figure 1: A typial data-owfor generi ation reognitionsystem omprises
inter-dependentstagesoffeature extration,learning,segmentationand lassi-
ation.
rstworkinvestigatingthethreerelatedissuesofrepresenting,segmentingand
reognizingations.
Figure 1 illustrates the major omponents of a generi ation reognition
systemandtheirtypialarrangement.
Feature extration isthemain visiontaskinationreognitionandon-
sist in extrating posture and motion ues from the video that are disrimi-
native with respet to human ations. Very dierent representations an be
used,rangingfromomplexbodymodelstosimplesilhouetteimages. Ineither
ase,issuessuhaspersonloation,robustnesstopartialolusion,bakground
lutter, shadowsand dierentilluminationneed tobeaddressed. Furtherrep-
resentationsshouldprovidesomeinsensitivitytodierenttypesoflothingand
physiques.
Ation learning and lassiation are the steps of learning statistial
modelsfromtheextratedfeatures,andusingthosemodelstolassifynewfea-
tureobservations. Amajorhallengetherebyistodealwiththelargevariability
thatanationlassanexhibit,inpartiularifperformedbydierentsubjets
ofdierentgenderandsize,andwithdierentspeedandstyle. Ationategories
whih mightseemlearly dened to us, suh as kiking, punhing, or waving,
for instane, an have verylarge variability whenperformed in pratie. It is
thusapartiularhallengeto designanationmodel,whihidentiesforeah
ation theharateristi attitudes, whilemaintaining appropriateadaptability
toallformsofvariations.
Ation segmentation is neessary to ut streams of motions into single
ationinstanesthatareonsistentto theset ofinitialtrainingsequenesused
to learn the models. Closely related are the questions: how to hoose suh
initial segmentations;and isthere somethinglikeanelementaryvoabulary of
primitivemotionsinationartiulationandpereption?
Vision-basedtehniquesforrepresenting,segmentingandreognizinghuman
ations an be lassied aording to many dierent riteria, e.g. the body
partsinvolved(faialexpressions,handgestures,upper-bodygestures,full-body
gestures, et.); the seletedimage features (interest points, landmarks, edges,
optialow,et.);thelassofstatistialmodelsusedforlearningandreognition
(nearest neighbors,disriminant analysis, Markov models, Bayesiannetworks,
onditional random elds, et.). The lassiation we have found to be the
most useful is how the dierent methods proposed in theliterature represent
thespatialandtemporalstrutureofations. Indeed,ouranalysisofthereent
literature in omputervisionreveals alarge variety ofapproahesin boththe
temporaland thespatial dimensions, whih an be summarizedas follows. In
the spatial domain, ation reognitionan bebased on globalimage features,
alignedtothegeometryoftheseneoramera;oronparametriimagefeatures,
alignedtothegeometryofthehumanbody;oronloalimagefeatures,without
struture. WereviewthosethreeimportantlassesinSetion2. Inthetemporal
domain,ationreognitionanbebasedonglobaltemporalsignatures,suhas
staked features, that represent an entire ation from start to nish; or on
grammatialmodelsthat representhowthemoments of ationsareorganized
sequentially, usually with several states and transitions between those states;
or on sparse and unstrutured observations, suh as isolated key-frames. We
reviewthosethreeimportantlassesinSetion3. Byombiningthethreemain
spatiallasseswiththethreemaintemporallasses,weendupwithasynopti
lassiationof ationreognitionintoninebasilasses,showninTable1.
Additionaldiultiesareintroduedwhenweallowtoobserveationsfrom
dierent and hangingviews. In suh unonstrained realistisettings asingle
poseormotionanresultinanalmostinnitenumberofpossibleobservations.
An appropriaterepresentationneedsthusto aountforsuhhanges. Tothis
aim, view-independent approahes havebeen introdued. Beause of the im-
portane ofthat issue andbeauseof thelarge variety ofdierentapproahes
that havebeenproposed,wedisussthoseapproahesinaseparatesetion.
The paper is therefore organized as follows. First, we present a general
overviewofationreognitionmethods,basedonhowtheyrepresentthespatial
strutureofationsinSetion2,andthetemporalstrutureofationsinSetion
3. Then, we reviewthespeial topisof ationsegmentation in Setion 4and
view-invariant ation reognition in Setion 5. We lose this survey with a
disussiononavailabledatasets andexperimentalevaluation.
2 Spatial Ation Representations
Webeginthissurveywithareviewofspatialrepresentationusedtodisriminate
ationsfromvisualdata. Asmentionedpreviously,arststepin ationreog-
nition is the extration of image features that are disriminative with respet
to postureand motionof thehumanbody. Variousrepresentations havebeen
suggested. Theymainlyontrastbytheamountofhigh levelinformationthey
representversushow eientthey areto extrat in pratie. For the purpose
of this survey, welassify them into three main groups- body models, image
models,andunstruturedfeatures. Bodymodels arebasedonaparametrirep-
resentationofthehumanbodyreoveredfromimagesusingbody-partdetetion
andtraking. Image models, arebasedondenseimagefeatures omputedover
aregulargrid. Sparsefeatures arebasedonsparseimagefeaturesomputedat
speially deteted interest regions and loosely organizedinto a spatial bag-of-
features.
Table1: ClassiationofAtionReognitionMethodsbasedonSpatial(vertial
axis) and TemporalRepresentations(horizontalaxis). Onlysomeof themore
reentapproahesarelistedin eahell.
Parametri, Global, Loal,
AtionGrammar AtionTemplate BagofFeatures
Parametri,BodyModel
BodyGrammar BodyTemplate BagofPostures
e.g. e.g. e.g.
Wang[183℄,Kojima[84℄,
Zhao[203 ℄,Park[121℄,
Ramanan[133 ℄,Green[54 ℄,
Nguyen[111℄,
Guerra-Filho[57 ℄,
Parameswaran[118 ℄,
Peursum[125 ℄,Kitani[82℄,
Lv[99℄,Wang[184℄,Ali[4℄,
Ikizler[67 ℄,Moreny[108℄
Guo[58 ℄,Niyogi[114 ℄,
Gavrila[47 ℄,Seitz[153 ℄,
Yaoob[195℄,Ben-Arie[8℄,
Rao[136 ℄,Gritai[55℄,
Alon[5℄,Sheikh[156℄,
Yilmaz[200℄,Shen[157 ℄
Global,ImageModel
ImageGrammar ImageTemplate BagofKeyrames
e.g. e.g. e.g.
Brand[17 ℄,Elgammal[35℄,
Cuzzolin[28 ℄,Ogale[116 ℄,
Robertson[138℄,
Sminhisesu[161℄,
Ahmad[3℄,Lv[100℄,
Turaga[176℄,
Weinland[187 ℄,
Natarajan[110℄,
Vitaladevuni[180 ℄
Pierobon[126℄,Roh[141℄,
Weinland[190 ℄,Kim[81℄,
Laptev[89℄,Meng[105 ℄,
Wang[181℄,Farhadi[37 ℄,
Fathi[39 ℄,Holte[64℄,
Jia[72℄,Jiang[73 ℄,
Junejo[76 ℄,
Rodriguez[139℄,
Souvenir[164℄,Yan[197 ℄
Carlsson[24℄,Efros[33℄,
Jhuang[71 ℄,Thurau[169℄,
Wang[185℄,Shindler[149℄,
Weinland[186 ℄,Zhang[202℄
Loal,SpatialBagofFeatures
FeatureGrammar FeatureTemplate BagofST-Features
e.g. e.g. e.g.
Shi[158℄ Laptev[86 ℄,Ke[80℄
Shuldt[151℄,Boiman[16℄,
Dollar[32 ℄,Niebles[112℄,
Ikizler[68 ℄,Niebles[113℄,
Nowozin[115 ℄,
Sovanner[152℄,
Wong[193℄,Filipovyh[42℄,
Gilbert[51 ℄,Klaser[83 ℄,
Laptev[87℄,Liu[96℄
Figure 2: Illustration of moving light displays, taken from [74℄. Johansson
showed that humans an reognize ations merely from the motion of a few
lightdisplaysattahedtothehumanbody. Awaiting publisherpermission
2.1 Body models
Inthissetion,wereviewmethodsthatrepresentthespatialstrutureofations
withreferenetothehumanbody. Ineahframeoftheobservedvideostream,
the pose of a humanbody is reoveredfrom avariety of available image fea-
tures,and ationreognitionisperformedbasedonsuh pose estimates. This
is anintuitiveandbiologially-plausibleapproahto ationreognition,whih
is supportedby psyhophysial workon visualinterpretationof biologialmo-
tion[74℄.
Johanssonshowedthat humansan reognize ationsmerelyform themo-
tionofafewmovinglightdisplays(MLD)attahedtothehumanbody(Figure
2). Overseveral deadeshis experiments inspiredapproahesin ation reog-
nition, whih usedsimilarrepresentationsbasedonmotionoflandmarkpoints
onthehumanbody. Hisexperimentswerealsooriginoftheunresolvedontro-
versy on whether humans atually reognize ations diretly from 2D motion
patterns,orwhethertheyrstomputea3Dreonstrutionfromthemotionof
thepatterns. Theobservationthatupside-downreordingsofMLDsareusually
notreognizedby humans an beinterpreted asevidene forthepreseneofa
strong prior model in humanpereption [166, 52℄, i.e. humans expet people
walkinguprightandannoteasilyadaptto strongtransformations.
Inthe ontextofmahine vision,thetwoapproaheshavebeenadvoated,
resultingintwomain lassesofmethods[107℄: 1)reognition byreonstrution
of3Dbodymodelsand2)diretreognition from2Dbody models.
Reognitionby reonstrution dividesthetaskofationreognitionin
twowellseparatestages-amotionapturestagewhihestimatea3Dmodelof
thehumanbody,typiallyrepresentedasakinematijointmodel;andanation
reognition stage whih operates on joint trajetories. Two major diulties
are the largenumber of degrees-of-freedomsof the humanbody and thehigh
variabilityoftheirshapes. Asaresult,aparametrimodelofthehumanbody
must be arefully seleted and alibrated to support ation reognition and
generalization. Alargevarietyofparametrimodelshavebeenproposedoverthe
yearsandwean onlymentionsomeofthem. SeeFigure3forsomeexamples.