A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition

(1)

HAL Id: inria-00459653

https://hal.inria.fr/inria-00459653

Submitted on 24 Feb 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Representation, Segmentation and Recognition

Daniel Weinland, Rémi Ronfard, Edmond Boyer

To cite this version:

Daniel Weinland, Rémi Ronfard, Edmond Boyer. A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition. [Research Report] RR-7212, INRIA. 2010, pp.54.

�inria-00459653�

(2)

a p p o r t

d e r e c h e r c h e

N0249-6399ISRNINRIA/RR--7212--FR+ENG

Vision, Perception and Multimedia Understanding

A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition

Daniel Weinland — Remi Ronfard — Edmond Boyer

N° 7212

Février 2010

(3)

(4)

Centre de recherche INRIA Grenoble – Rhône-Alpes

DanielWeinland

∗

, Remi Ronfard

†

,Edmond Boyer

‡

Theme: Vision, PereptionandMultimediaUnderstanding

Pereption,Cognition,Interation

Équipes-ProjetsLearetPereption

Rapportdereherhe n° 7212Février201054pages

Abstrat: Ationreognitionhasbeomeaveryimportanttopiinomputer

vision,withmanyfundamentalappliations,inrobotis,videosurveillane,hu-

manomputerinteration,andmultimediaretrievalamongothers. Thenumber

of works published is steadily inreasing, and ation reognitionis meanwhile

presented with numerous publiations at reent onferenes. A large variety

of approahes have been desribed. The purpose of this survey is to give an

overview and ategorization of the approahes used. We onentrate on ap-

proahesthataimonlassiationoffull-bodymotions,suhaskiking,punh-

ing, waving,et. and weategorize them aordingto howtheyrepresentthe

spatial and temporal struture of ations; how theysegmentations from an

inputstreamofvisualdata;andhowtheylearnaview-invariantrepresentation

ofations.

Key-words: omputervision,ationreognition

∗

DeutsheTelekomLaboratories,TUBerlin,Germany

†

INRIATeamLear,Grenoble,Frane

‡

INRIATeamPereption,Grenoble,Frane

(5)

d'Ations

Résumé : La reonnaissane d'ations est un problème important en vision

par ordinateur, ave de nombreuses appliations fondamentales en robotique,

télé-surveillane, interation homme-mahine et indexation multimedia, entre

autres. Le nombre de publiations sur e sujet augmente régulièrement dans

les onférenes dudomaine. Une grande variété d'approhesont été dérites.

Lebut deerapportestdedresserunétatdel'artdudomaineet deproposer

unelassiationdesapprohesutilisées. Nousnousfoalisonssurleproblème

delalassiationdesationsfaisantintervenirl'ensembleduorps,tellesque

s'asseoir, se lever, battre des mains, donner un oup de pied ou un oup de

poing,et. Nouslassonslesdiérentesapprohesduproblèmeenfontiondes

représentationsspatialesettemporellesqu'ellesdonnentdesations;delafaçon

dontelles permettent de segmenter les ationsdans un ux visuel ontinu; et

deleurapaitéàapprendredesmodèlesindépendantsdupointdevue.

Mots-lés : reonnaissaned'ations,visionparordinateur

(6)

Contents

1 Introdution 4

2 SpatialAtion Representations 6

2.1 Body models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Imagemodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Sparsefeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 TemporalAtion Representations 16 3.1 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Keyframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Ation Segmentation 22 4.1 BoundaryDetetion . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 SlidingWindows . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Higher-LevelGrammars . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Ationprimitives . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 View-IndependentAtion Reognition 27 5.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.1 Normalizationin 2D . . . . . . . . . . . . . . . . . . . . . 28

5.1.2 Normalizationin 3D . . . . . . . . . . . . . . . . . . . . . 28

5.2 ViewInvariane. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.1 ViewInvarianein 2D . . . . . . . . . . . . . . . . . . . . 29

5.2.2 ViewInvarianein 3D . . . . . . . . . . . . . . . . . . . . 31

5.3 ExhaustiveSearh . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3.1 ExhaustiveSearhusingMultiple2DViews . . . . . . . . 31

5.3.2 ExhaustiveSearhusinga3DModel . . . . . . . . . . . . 32

6 Datasets 33 6.1 TheKTHDataset . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.2 TheWeizmanndataset. . . . . . . . . . . . . . . . . . . . . . . . 34

6.3 TheIXMASdataset . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.4 Otherdatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Conlusion 39

(7)

1 Introdution

Ationreognitionisaveryativeresearhtopiinomputervisionwithmany

important appliations, inluding human-omputer interfaes, ontent-based

videoindexing, full-videosearh, videosurveillane, robotis,programmingby

demonstration, among others. Historially, visualation reognitionhasbeen

divided into sub-topis suh as gesture reognition(espeially hand gestures)

forhuman-omputerinterfaes[36,122℄,faialexpressionreognition[204℄,and

movement behavior reognitionfor videosurveillane [66℄. Howeverfull-body

ations usually inlude dierent motions and require a unied approah for

reognition,enompassingfaialations,handationsandfeetations.

Ation reognitionis the proess of naming ations, usually in the simple

form of anationverb,using sensoryobservations. Tehnially, anation isa

sequeneofmovementsgeneratedbyahumanagentduringtheperformaneofa

task. Assuh,itisafour-dimensionalobjet,whihmaybefurtherdeomposed

intospatialandtemporalparts. Inthispaper,weareonlyonernedwithvisual

observations,typiallybymeansofoneormorevideoameras,butitshouldbe

notedthatationsanofoursealsobereognizedfromothersensoryhannels,

inludingaudio. An ationlabelisaname,suhthatanaveragehumanagent

an understandand performthenamed ation. Thetaskof ationreognition

istonameations,i.e. determinetheationlabelthatbestdesribesanation

instane, evenwhen performed bydierentagentsunder dierentviewpoints,

andinspiteoflargedierenesinmannerandspeed. Atypialset-upfortesting

andevaluatingationreognitionsystemsonsistinsendinginstrutionstothe

ators, using simple ation verb imperatives, and to ompare them with the

reognizedationnames.

Toreahthatgoal,thevariousapproahestypiallyemployaombinationof

vision andmahinelearning tools. Visiontehniquesattempttoextratation

disriminative features from thevideo sequenes, while providing appropriate

robustness to distrating ues. Mahine learningattempts to learn statistial

modelsform those features, and to lassify newfeatures basedon thelearned

models. Two issues whih are thereby of partiular importane are to deal

with hanging viewpoints and to segmentthe observedmotions into semanti

meaningfulinstanesofations.

Note that ourdenition of an ationis more restritivethan the one pro-

posedbyPinhanez[127℄whenhestatesthatationsaresequenesofmovements

performedinagivenontext(action=context+movement^),^with^the^example

oftypingorplayingpianowhihinvolvethesamequikmovementsofthengers

inthedierentontextsofaomputerdeskoraonerthall. Forourpurpose,

theyareoneandthesameationofquiklymovingone'sngers,andthisation

an beexeuted aspartofdierenttasks, suhas playingthepianoortyping.

Theimportaneofontextforvisualationreognitionisthefousofanexel-

lentreentsurveyonthemeaning of ation[85℄. Here,weonentrateonthe

struture of ation by reviewing vision-based tehniques that an be used for

analyzing, segmentingandlassifyingmovementsin orderto reognizeations

independentlyofthetaskandontextwhereitisperformed.

Generiationreognitionhasalreadybeensurveyedin[25,2,49,106,107℄

in theontextofmotionaptureandbodytraking,andin [66℄in theontext

of surveillane. High-levelanalysis of ativitieswas reentlysurveyedin [174℄.

Inontrary, oursurveyfouses exlusivelyonation reognition,and itis the

(8)

Figure 1: A typial data-owfor generi ation reognitionsystem omprises

inter-dependentstagesoffeature extration,learning,segmentationand lassi-

ation.

rstworkinvestigatingthethreerelatedissuesofrepresenting,segmentingand

reognizingations.

Figure 1 illustrates the major omponents of a generi ation reognition

systemandtheirtypialarrangement.

Feature extration isthemain visiontaskinationreognitionandon-

sist in extrating posture and motion ues from the video that are disrimi-

native with respet to human ations. Very dierent representations an be

used,rangingfromomplexbodymodelstosimplesilhouetteimages. Ineither

ase,issuessuhaspersonloation,robustnesstopartialolusion,bakground

lutter, shadowsand dierentilluminationneed tobeaddressed. Furtherrep-

resentationsshouldprovidesomeinsensitivitytodierenttypesoflothingand

physiques.

Ation learning and lassiation are the steps of learning statistial

modelsfromtheextratedfeatures,andusingthosemodelstolassifynewfea-

tureobservations. Amajorhallengetherebyistodealwiththelargevariability

thatanationlassanexhibit,inpartiularifperformedbydierentsubjets

ofdierentgenderandsize,andwithdierentspeedandstyle. Ationategories

whih mightseemlearly dened to us, suh as kiking, punhing, or waving,

for instane, an have verylarge variability whenperformed in pratie. It is

thusapartiularhallengeto designanationmodel,whihidentiesforeah

ation theharateristi attitudes, whilemaintaining appropriateadaptability

toallformsofvariations.

Ation segmentation is neessary to ut streams of motions into single

ationinstanesthatareonsistentto theset ofinitialtrainingsequenesused

to learn the models. Closely related are the questions: how to hoose suh

initial segmentations;and isthere somethinglikeanelementaryvoabulary of

primitivemotionsinationartiulationandpereption?

Vision-basedtehniquesforrepresenting,segmentingandreognizinghuman

ations an be lassied aording to many dierent riteria, e.g. the body

partsinvolved(faialexpressions,handgestures,upper-bodygestures,full-body

gestures, et.); the seletedimage features (interest points, landmarks, edges,

optialow,et.);thelassofstatistialmodelsusedforlearningandreognition

(nearest neighbors,disriminant analysis, Markov models, Bayesiannetworks,

onditional random elds, et.). The lassiation we have found to be the

(9)

most useful is how the dierent methods proposed in theliterature represent

thespatialandtemporalstrutureofations. Indeed,ouranalysisofthereent

literature in omputervisionreveals alarge variety ofapproahesin boththe

temporaland thespatial dimensions, whih an be summarizedas follows. In

the spatial domain, ation reognitionan bebased on globalimage features,

alignedtothegeometryoftheseneoramera;oronparametriimagefeatures,

alignedtothegeometryofthehumanbody;oronloalimagefeatures,without

struture. WereviewthosethreeimportantlassesinSetion2. Inthetemporal

domain,ationreognitionanbebasedonglobaltemporalsignatures,suhas

staked features, that represent an entire ation from start to nish; or on

grammatialmodelsthat representhowthemoments of ationsareorganized

sequentially, usually with several states and transitions between those states;

or on sparse and unstrutured observations, suh as isolated key-frames. We

reviewthosethreeimportantlassesinSetion3. Byombiningthethreemain

spatiallasseswiththethreemaintemporallasses,weendupwithasynopti

lassiationof ationreognitionintoninebasilasses,showninTable1.

Additionaldiultiesareintroduedwhenweallowtoobserveationsfrom

dierent and hangingviews. In suh unonstrained realistisettings asingle

poseormotionanresultinanalmostinnitenumberofpossibleobservations.

An appropriaterepresentationneedsthusto aountforsuhhanges. Tothis

aim, view-independent approahes havebeen introdued. Beause of the im-

portane ofthat issue andbeauseof thelarge variety ofdierentapproahes

that havebeenproposed,wedisussthoseapproahesinaseparatesetion.

The paper is therefore organized as follows. First, we present a general

overviewofationreognitionmethods,basedonhowtheyrepresentthespatial

strutureofationsinSetion2,andthetemporalstrutureofationsinSetion

3. Then, we reviewthespeial topisof ationsegmentation in Setion 4and

view-invariant ation reognition in Setion 5. We lose this survey with a

disussiononavailabledatasets andexperimentalevaluation.

2 Spatial Ation Representations

Webeginthissurveywithareviewofspatialrepresentationusedtodisriminate

ationsfromvisualdata. Asmentionedpreviously,arststepin ationreog-

nition is the extration of image features that are disriminative with respet

to postureand motionof thehumanbody. Variousrepresentations havebeen

suggested. Theymainlyontrastbytheamountofhigh levelinformationthey

representversushow eientthey areto extrat in pratie. For the purpose

of this survey, welassify them into three main groups- body models, image

models,andunstruturedfeatures. Bodymodels arebasedonaparametrirep-

resentationofthehumanbodyreoveredfromimagesusingbody-partdetetion

andtraking. Image models, arebasedondenseimagefeatures omputedover

aregulargrid. Sparsefeatures arebasedonsparseimagefeaturesomputedat

speially deteted interest regions and loosely organizedinto a spatial bag-of-

features.

(10)

Table1: ClassiationofAtionReognitionMethodsbasedonSpatial(vertial

axis) and TemporalRepresentations(horizontalaxis). Onlysomeof themore

reentapproahesarelistedin eahell.

Parametri, Global, Loal,

AtionGrammar AtionTemplate BagofFeatures

Parametri,BodyModel

BodyGrammar BodyTemplate BagofPostures

e.g. e.g. e.g.

Wang[183℄,Kojima[84℄,

Zhao[203 ℄,Park[121℄,

Ramanan[133 ℄,Green[54 ℄,

Nguyen[111℄,

Guerra-Filho[57 ℄,

Parameswaran[118 ℄,

Peursum[125 ℄,Kitani[82℄,

Lv[99℄,Wang[184℄,Ali[4℄,

Ikizler[67 ℄,Moreny[108℄

Guo[58 ℄,Niyogi[114 ℄,

Gavrila[47 ℄,Seitz[153 ℄,

Yaoob[195℄,Ben-Arie[8℄,

Rao[136 ℄,Gritai[55℄,

Alon[5℄,Sheikh[156℄,

Yilmaz[200℄,Shen[157 ℄

Global,ImageModel

ImageGrammar ImageTemplate BagofKeyrames

e.g. e.g. e.g.

Brand[17 ℄,Elgammal[35℄,

Cuzzolin[28 ℄,Ogale[116 ℄,

Robertson[138℄,

Sminhisesu[161℄,

Ahmad[3℄,Lv[100℄,

Turaga[176℄,

Weinland[187 ℄,

Natarajan[110℄,

Vitaladevuni[180 ℄

Pierobon[126℄,Roh[141℄,

Weinland[190 ℄,Kim[81℄,

Laptev[89℄,Meng[105 ℄,

Wang[181℄,Farhadi[37 ℄,

Fathi[39 ℄,Holte[64℄,

Jia[72℄,Jiang[73 ℄,

Junejo[76 ℄,

Rodriguez[139℄,

Souvenir[164℄,Yan[197 ℄

Carlsson[24℄,Efros[33℄,

Jhuang[71 ℄,Thurau[169℄,

Wang[185℄,Shindler[149℄,

Weinland[186 ℄,Zhang[202℄

Loal,SpatialBagofFeatures

FeatureGrammar FeatureTemplate BagofST-Features

e.g. e.g. e.g.

Shi[158℄ Laptev[86 ℄,Ke[80℄

Shuldt[151℄,Boiman[16℄,

Dollar[32 ℄,Niebles[112℄,

Ikizler[68 ℄,Niebles[113℄,

Nowozin[115 ℄,

Sovanner[152℄,

Wong[193℄,Filipovyh[42℄,

Gilbert[51 ℄,Klaser[83 ℄,

Laptev[87℄,Liu[96℄

(11)

Figure 2: Illustration of moving light displays, taken from [74℄. Johansson

showed that humans an reognize ations merely from the motion of a few

lightdisplaysattahedtothehumanbody. Awaiting publisherpermission

2.1 Body models

Inthissetion,wereviewmethodsthatrepresentthespatialstrutureofations

withreferenetothehumanbody. Ineahframeoftheobservedvideostream,

the pose of a humanbody is reoveredfrom avariety of available image fea-

tures,and ationreognitionisperformedbasedonsuh pose estimates. This

is anintuitiveandbiologially-plausibleapproahto ationreognition,whih

is supportedby psyhophysial workon visualinterpretationof biologialmo-

tion[74℄.

Johanssonshowedthat humansan reognize ationsmerelyform themo-

tionofafewmovinglightdisplays(MLD)attahedtothehumanbody(Figure

2). Overseveral deadeshis experiments inspiredapproahesin ation reog-

nition, whih usedsimilarrepresentationsbasedonmotionoflandmarkpoints

onthehumanbody. Hisexperimentswerealsooriginoftheunresolvedontro-

versy on whether humans atually reognize ations diretly from 2D motion

patterns,orwhethertheyrstomputea3Dreonstrutionfromthemotionof

thepatterns. Theobservationthatupside-downreordingsofMLDsareusually

notreognizedby humans an beinterpreted asevidene forthepreseneofa

strong prior model in humanpereption [166, 52℄, i.e. humans expet people

walkinguprightandannoteasilyadaptto strongtransformations.

Inthe ontextofmahine vision,thetwoapproaheshavebeenadvoated,

resultingintwomain lassesofmethods[107℄: 1)reognition byreonstrution

of3Dbodymodelsand2)diretreognition from2Dbody models.

Reognitionby reonstrution dividesthetaskofationreognitionin

twowellseparatestages-amotionapturestagewhihestimatea3Dmodelof

thehumanbody,typiallyrepresentedasakinematijointmodel;andanation

reognition stage whih operates on joint trajetories. Two major diulties

are the largenumber of degrees-of-freedomsof the humanbody and thehigh

variabilityoftheirshapes. Asaresult,aparametrimodelofthehumanbody

must be arefully seleted and alibrated to support ation reognition and

generalization. Alargevarietyofparametrimodelshavebeenproposedoverthe

yearsandwean onlymentionsomeofthem. SeeFigure3forsomeexamples.