Spatiotemporal MRF approach to video segmentation: Application to motion detection and lip segmentation

(1)

HAL Id: hal-00785993

https://hal.archives-ouvertes.fr/hal-00785993

Submitted on 7 Feb 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Spatiotemporal MRF approach to video segmentation:

Application to motion detection and lip segmentation

Franck Luthon, Alice Caplier, Marc Liévin

To cite this version:

Franck Luthon, Alice Caplier, Marc Liévin. Spatiotemporal MRF approach to video segmentation:

Application to motion detection and lip segmentation. Signal Processing, Elsevier, 1999, 76 (1),

pp.61-80. �hal-00785993�

(2)

Appliation to Motion Detetion and Lip Segmentation.

F. Luthon,A. Caplier, M.Lievin

Laboratoire des Images et des Signaux

Institut NationalPolytehnique de Grenoble

LIS, INPG, 46 avenue Felix-Viallet

38031 Grenoble Cedex, Frane

Tel: +33(0)4 76 57 43 72 Fax: +33 (0)4 76 57 47 90

Email: Frank.Luthoninpg.fr

Signal Proessing,76(1):61-80, July 1999

Abstrat

In this paper, a spatiotemporal strategy for image sequene analysis is proposed: a video

sequeneisproessedasa3-Ddatabathinsteadofaseriesof2-Dimages.

Applying this approah to motion detetion, a 3-DMarkovian model assoiatedwith aspa-

tiotemporal relaxationis dened. Using a3-Dneighbourhood of pixelsfor modelling spatiotem-

poralinterations, robustresultsareobtainedfordeteting movingobjetsinnoisysequenesor

intheaseofoverlappingmotion.

In order to improve the performane to detet poorly-texturedobjets orveryslow motion,

thealgorithmisintegratedinaspatiotemporalmultiresolutionsheme. Thedatapyramidisbuilt

byusing3-Dlow-passlteringand3-Dsubsampling. Robustresultsforsynthetiand real-world

outdoorimagesequenesarereported.

Thisapproahisalsoappliedsuessfullytospeaker'slipsegmentationinimagesequenes,for

audiovisualteleommuniation.

Key words: motiondetetion,imagesequenes,MarkovRandomField(MRF),spatiotempo-

ralapproah,multiresolution,lipsegmentation.

Resume

Cet artile presente une approhe spatio-temporelle pour l'analyse de sequenes d'images 1

:

une sequene est traitee omme un ot de donnees a trois dimensions au lieu d'une suession

d'imagesadeuxdimensions.

L'utilisation de etteapprohepour la detetion demouvement onduit ala denition d'un

modelemarkovien3-Dassoieaune relaxationspatio-temporelle. Gr^aeaunemodelisationne

desinterations spatio-temporellesentre lespixelsd'un voisinageubique, desresultatsrobustes

sont obtenus pour la detetion d'objets mobiles dans une sene tres bruitee et d'objets dont le

mouvements'eetue avereouvrementd'uneimagealasuivante.

Dans lebut d'ameliorerl'aptitude de l'algorithme adeteter des objets tres peu textureset

desobjets demouvementtreslent, ondenitun adrede multiresolutionspatio-temporelle. La

pyramide de donnees est onstruite par une suession de ltrages et de sous-ehantillonnages

appliquesdanshaunedestroisdimensions. L'inter^etdelamultiresolutionspatio-temporelleest

mis en evidene par diversresultats de detetion de mouvement sur des senes synthetiques et

reelles.

Une autre appliation de ette approhe porte sur la segmentation des levres d'un louteur,

dansunontextedeteleommuniationsaudio-visuelles.

1

ApaperinFrenhisalsoavailable[5 ℄.

(3)

Motion detetion and region-based segmentation areimportant issuesinimage sequene analysis or

oding,withappliationsinvideo-surveillaneand video-ommuniation.

Althoughthree dimensions(x;y;t)arerequiredto desribeanimagesequene,mostofthemeth-

odsdealing with sequene analysis are time sequential (eah image is proessed in turn), and work

on a pair of onseutive images. This might induelimitations e.g. for deteting subpixel motion 2

.

A ommon way to integrate motion information over a larger temporal domain is to use reursive

temporal lteringsuh asKalmanltering.

Inthispaper, another strategy isproposed. The point isto onsidera videosequene not asan

imageseries, butasa3-D databath, takinginto aount spatialand temporal dimensionswithina

single proess. This approahis oherent with thefat that a movingobjet overs a volume inthe

(x;y;t) spae.

Thesope ofthepaperistwofold: to giveaninsightintotheprosand onsofthespatiotemporal

approah,togetherwithfousingonpratialappliations. Theperformaneofthisapproahisindeed

illustratedwith twoappliations: robustmotiondetetion and lipsegmentationin videosequenes.

Asfor robust motion detetion, a 3-D non-separable Markov Random Field (MRF) based algo-

rithmis dened. Thismethodyieldsbetter resultsthantheseparableversion ofthe samealgorithm

in the ase of noisy sequenes oroverlapping motion 3

. The same observations (temporal variations

of the intensity funtion) as in the separable ase are retained, the enhaned performane of the

3-D algorithm oming from theimprovement of the MRFmodel whihis better at takingtemporal

onstraintsinto aount.

Todetet subpixelmotionand uniformmovingobjets 4

,alargerspatiotemporaldomainmustbe

taken into aount. This isdone byomputingobservationson a spatiotemporalpyramid.

In setion 2, a separable motion detetion algorithm is presented. The algorithm is inspiredby

thework ofBouthemyet al. [3 ℄. Therearetwomajordierenesbetweenthealgorithmdesribedin

[3 ℄ andtheone presentedhere. The rstdierene isthewaytemporalinformationisdealt with. In

[3 ℄,the proessingof eah imageis doneintwo steps("two-pass algorithm"). An initialdetetion of

movingareasattimetisderivedwhenonsideringimagesI(t 1)andI(t). Thisdetetionisupdated

when onsidering images I(t) and I(t+1). Two suessive label elds are always simultaneously

onsidered(optimization intwo passes),and the deisionaboutunoveredareas is postponedto the

nextproessingpass. Insetion2,weproposea"one-passalgorithm": asinglelabeleld(theurrent

one) is optimized at eah time (and only one). It makes implementation easier, for an equivalent

qualityofresults. Thisismadepossiblethanksto anotherwayofdoinginitialisation: weuseaoarse

estimate of the futurelabeleld, instead of repeting the past asis donein [3℄. Unovered areas are

handledbygivingmore weight to thefuture thanto thepast (anisotropyintemporal interations).

The seond dierene onerns omputational omplexity: we use four model parameters (se-

tion2.4), insteadofve in[3 ℄,sinethefuntionexpressingthelinkbetweenobservationsandlabels

is simpler inour ase. The deision abouta temporal lique(past orfuture) requires onlyone on-

ditional test to hoose among two ongurations, while eight dierent ongurations are tested in

Bouthemy's algorithm (Table 1 in[3℄). The number of 2-D eldsrequired for the relaxation is ve

inour ase (Fig. 1-b), instead of sixforBouthemy'salgorithm (Fig. 2 in [3 ℄). Hene,the amount of

memory requiredfor data storage is minor inour ase. Therefore, thetwo-step algorithm proposed

in[3℄ isless adequateforreal-timeimplementation(i.e. proessingat videorate).

Sine real-time proessing is of major onern for pratial video appliations, the paper ad-

dresseson several oasions theissuesof omputation ost and hardware implementation,either on

generalpurposeprogrammabledevies(digitalsignalproessors(DSPs)orvideoproessors),parallel

mahines (SIMDorMIMD)ordediated iruits(ASICs,VLSIellularanalog networks).

2

Subpixelmotionmeansdisplaementsoflessthanonepixelbetweentwoimages(i.e.slowmotion).

3

Overlappingmotionmeansthattheintersetionofthemasksofamovingobjetattimest 1andtisnotempty.

4

Uniformmovingobjetsmeansmovingobjetsthatarepoorly-textured,i.e. haveuniformintensity.

(4)

in the sense that spae and time have distint roles in the proessing (hereafter, the algorithm is

referred to asthe"separablealgorithm").

Its3-Dnonseparableounterpartisdesribedinsetion3. Aomparisonbetweentheperformane

of both algorithms is made. In setion 4, it is shown how the integration of the 3-D algorithm

in a spatiotemporal multiresolution framework allows subpixel motion and poorly-textured moving

objets to be deteted. In setion 5, anotherappliation of thisapproah is presented, forspeaker's

lipsegmentation ina ontext ofaudiovisualteleommuniation. A disussionin setion6 onludes

thepaper.

2 Separable MRF Model

MRFmodellingiswidely usedformotionanalysis,eitherfordetetion,estimation, orsegmentation.

For a state of the art about image motion analysis and an extensive bibliography, the reader may

refer to [13 ℄.

2.1 Observations and Labels

The purpose of motion detetion is to loalize moving and stati areas in a dynamisene. It is a

binary labelling problem that onsists in attributing to eah pixel or site s = (x;y) of image S at

time t one of the two labels: l

s

= a if s belongs to a moving area, l

s

= b if s belongs to the stati

bakground.

Withtheassumptionsofquasi-onstant illumination(verysmalllightingvariationsbetweent 1

and t) and stati amera, motion informationis loselyrelated to temporal hanges of theintensity

funtionI

s

(t). Therefore, observations are given by:

o

s

=jI

s (t) I

s

(t 1)j: (1)

The following notation is used: l = fl

s

;s 2 Sg and o = fo

s

;s 2 Sg represent one partiular

realisationat time tofthe labeland observation eldsL andO,respetively 5

.

Given a realisation o of eld O, the aim is to nd the most probable onguration l of eld

L. This is done by using the Maximum A Posteriori riterion (MAP). From Bayes theorem and

the equivalene between MRF and Gibbs distribution, it is known that the maximisation of the a

posteriori probabilityisequivalent to theminimisation ofan energyfuntion[9 ℄:

max

l

P(L =ljO=o) ()min

l

U(l;o): (2)

2.2 Energy Funtions

The energyfuntionislassiallythesum oftwo terms(orresponding to priorknowledge anddata-

link,respetively):

U(l;o)=U

m (l)+U

a

(o;l): (3)

The modelenergy U

m

(l) is aregularisationterm. Itputs apriorionstraints(spatiotemporal homo-

geneity) onthe masksof movingobjets, erasingisolatedpointsdueto noise. Itsexpressionisgiven

by:

U

m (l)=

X

2C V

(l

s

;l

n

) (4)

where denotes anyof thebinary liques denedin theneighbourhood ofFig. 1-a. A binarylique

5

Everytimetheinstantonsideredisdierentfromtheurrenttimet,atemporalindexwillbeaddedinthenotation.

(5)

binarization binarization ICM Relaxation minimum of energy

o o

I I

I

- + | | - + | |

l l

buffer l

l n

s

n n

n n n n n n

t-1 t

: current pixel s : a neighbour n a clique c = (s,n)

: n

t+1 a)

b)

t-1 t t+1

t t+1

t-1

t t

0 t+1 0

θ

parameters α β s β

p β f , , ,

Figure 1: a) Neighbourhoodand binaryliques. b) Separablealgorithm blokdiagram (l 0

denotesa

oarseestimate orinitialisationof labeleldL).

=(s;n) isanypair of distintsites intheneighbourhood,inludingtheurrentpixel sand anyof

theneighboursn. C istheset of all liques. V

(l

s

;l

n

) isan elementarypotentialfuntionassoiated

witheah lique=(s;n). In orderto puthomogeneityonstraintsintothe model,itis denedas:

V

(l

s

;l

n )=

(

ifl

s

=l

n

+ ifl

s 6=l

n

(5)

where the positive parameter depends on the nature of the lique: a parameter

s

is dened for

spatialliques,aparameter

p

forpasttemporalliqueandaparameter

f

forfuturetemporallique.

Thelink betweenlabels and observationsis expressedbytherelationship: o

s

= (l

s )+g

s where

g isa Gaussianunorrelated entered noisewithvariane 2

and:

(l

s )=

(

0 ifl

s

=b

>0 otherwise.

(6)

modelstheobservations: ifapixelisstati,notemporalhangeoursintheintensityfuntionand

theobservation shouldbe zero; ifa pixelis mobile,a hangeoursand theobservation issupposed

to take apositive value loseto , whihrepresents theaverage value ofnon-zero observations.

Thelink-to-data energyU

a

(o;l) (attahment energy) isderivedfrom theabove funtion:

U

a

(o;l)= 1

2 2

X

s2S [o

s (l

s )℄

2

(7)

wherethe observationvariane 2

is evaluatedon-lineforeah image.

2.3 Spatial Deterministi Relaxation

Fig. 1-b shows the blok diagram of the separable algorithm. The algorithm works on three on-

seutive frames. Suppose thepast labeleld l

t 1

hasbeen determined asthe result of the previous

optimization. The urrent label eld is initialised with a binary map l 0

t

derived from observation

eld o

t

, and a oarse estimate l 0

t+1

of thefuture label eld is also derived from binarisation of eld

o

t+1

. The binarymaps are obtainedwith the likelihood method proposed in[10 ℄, but ouldalso be

omputed withasimplethresholdingmethod,foromputation savingspurpose.

Tondtheminimumoftheenergyfuntion,thedeterministirelaxationalgorithmICM(Iterated

Conditional Modes) is used [2 ℄. For eah pixel s of the urrent image, the two labels a and b are

tested and the label whih indues the minimum loal energy in the neighbourhood is kept. The

proess iterates over the image until onvergene, one iteration orresponding to the sanning in x

and y dimensionsof the imageat time t. The stopping riterionfor onvergeneof therelaxation is

(6)

next imageof thesequene isproessed.

Notethat, sinethe algorithm workswith three frames, labelelds areobtained with adelayof

one frame.

2.4 Parameter Setting

Theseparablealgorithmdependsonveparameters: fourparametersforMRFmodelling(

s

;

f

;

p

;),

plusone threshold parameter for binarisationof observations. From various experiments bothon

real-world and syntheti image sequenes, the model parameters are xed to the following values:

s

=20,

p

=10,

f

=30, =10. This manuallearningphase for parametertuning wasbased on

empirialobservations: ontextual homogeneityofdetetedmasks,goodagreement betweenontours

of masks and atual moving objets, and insensitivity to aquisition noise. Unsupervised estima-

tion methods, like Expetation-Maximisation [7 ℄, ould also be used to estimate model parameters

s

;

f

;

p

. But they are prohibitive in terms of omputation ost. Morevover high preision in the

determinationofthesevaluesisnotrequired(robustnessofMRFmethodinsensitivetoaslighthange

of these values). Parameter

s

ontrols spatial homogeneity and may be dereased in ase of very

noisy sequenes. Parameters

f and

p

ontrol temporal homogeneity. More weight is given to the

future by taking

f

>

p

, so that the bakground area whih has beenunovered duringmotion is

fastereliminated. Indeed,insuharegion,thepasttemporalneighbourisa-labelledwhilethefuture

one isb-labelled. Butthegood labelis thestati one(l

s

=b),given bythefutureinformation. Note

thattemporal homogeneityonstraint an be relaxed inaseof fastmotion.

Parameterstandsforsome kindofaverage valueof non-zeroobservations. Thisparametermay

either be omputed on-line for eah image asexplained in [3 ℄, or xed to an arbitrary value before

proessing. From experimental tests, on-line omputation of for eah imagedoes notsigniantly

improvemotion detetion results.

The threshold required for binarisation (omputation of initial binary maps with a method

derived from [10 ℄) is the only parameter whih must be adjusted for eah sequene. Here, it is

determined manually (o-linelearningphase at thebeginningof videoaquisition orbeforerunning

the automati proessing). One ould use likelihood tests suh as desribed in [10 , 1 ℄ to determine

this deision threshold automatially, but at the expense of omputation ost. A too low value of

indues many false detetions. A too high value of erases moving pixels in overlapping motion

areas. Forallsequenesaquiredwiththesame ameraunderthesame lightingonditions,thesame

valueof maybekept (e.g. =32 forall street sequenespresentedinthispaper).

2.5 Computational Complexity

The proessing rate is evaluated in the ase of images of size 128 128. When implemented on

a Spar-10 workstation with C programming, the proessing of an image takes about 1:8s of pu

time ( 0:4s per iteration). This orresponds roughly to N

0 N

x N

y N

i

=2:5 10 7

elementary

operations. N

0

=400 isthenumberofelementaryoperations(additions,multipliations,onditional

tests) involved intheomputation oftheloalenergyassoiatedwitheah pixel (1 multipliation=

10 additions). N

x

= 128 and N

y

= 128 represent the image dimensions and N

i

=4 is the average

numberofiterationsuntilonvergene.

To ahieve real-time proessing, various hardware implementations (on parallel SIMD mahine,

DSP board, or ellular VLSI analog network) have been either developped or simulated [6 , 8 ℄. A

proessing rate of 12 to 25 frames per seond is then ahieved. Another implementationon a Pro-

grammableVideoProessor(PVP)forteleommuniationappliationsisnowunderstudy. ThePVP

is an intensive omputing unit with a parallel SIMD arhiteture (8 sub-proessors onneted to a

sharedmemoryof16Kbytes)andsevenI/Oportsfordataowirulation. Itoersahighomputing

power(2 Gops)witha highI/Orate(4 Gbits/s). Forimages ofsize256256 anda lokfrequeny

(7)

PVPsoftwaresimulator.

2.6 Experimental Results

The separable algorithm was tested both on syntheti and real-world image sequenes. A typial

example forvideo-surveillane appliation (traÆ ontrol) is shown inFig. 2. This street sequene,

Figure2: Top)Streetsequenewithamovingpedestrian;Bottom)Masksofthemovingbodydeteted

after relaxation (blak =movinglabel,white =stati label).

aquiredwith a standard video amera, ontains a single pedestrianwalking on the pavement. The

image sequene is notvery noisyand motionof the pedestrianis largeenough betweentwo images,

allowingagooddetetion. Themaskofthemovingbodydetetedintheimageplaneisgivenat four

onseutive instants.

3 3-D Non Separable MRF Model

3.1 Spatiotemporal Relaxation

Althoughtheseparablealgorithmintegratesmotion informationfromthree onseutiveframes, only

the urrent frame is proessed at eah time (Fig. 1-b). The 3-D non separable model for motion

detetion is based on the intuitive idea that, by taking into aount more than three onseutive

framesof thesequene, theanalysis ofmotionmaybe improved. Therefore, thevideosequeneisno

longer onsidered as an image series butas a 3-D data bath. L and O are now 3-D random elds

(orvolumes).

Tondtheminimumoftheenergyfuntion,aspatiotemporalversionofICMisrequired. Thekey

point isthat, at eahiteration,therelaxation runsovertemporal setionsof lengthN

t

(Fig.3). The

sanning isdone notonlyin spatialdimensions(x;y) at a given time t, butin thethree dimensions

(x;y;t)together. It isperformedbak-and-forthspatiallyandtemporally. Oneiterationorresponds

tothesanningofawholetemporalsetion. Allframesofthetemporalsetionareproessedtogether.

After onvergeneof ICM, labels ofall pixelsinludedin thatsetion areavailable.

Allalong the paper, we refer to the 3-D non-separable motion detetion algorithm as "the 3-D

algorithm".

3.2 A Priori Model

The mathematial framework of MRF modelling remains the same. The relationships of setion

2 still hold, sine there is no restrition about the dimensions of elds L and O. However, elds

L and O are now supposed to be spatiotemporal 3-D random elds, bringing about the following

hanges: in Eq. (7), S represents now a temporal setion of N

t

images, instead of a single image.

Theneighbourhoodstrutureassoiatedwith Lisnowaompletespatiotemporalube (Fig.4),and

(8)

N images

x y t t

N _y

N _x

Figure 3: Temporal setionof lengthN

t .

t

t-1 t+1

s

n n n

n n

n n n

: a neighbour n current pixel s

: : spatial clique

: temporal clique

: spatiotemporal clique

Figure 4: 3-Dneighbourhoodand binaryliques.

(9)

parameterisaddedintheglobalenergyfuntionforbalaning U

m

(l) andU

a

(o;l) inuenes inthis

extended neighbourhood:

U(l;o)=U

m

(l)+U

a

(o;l): (8)

Wesupposethattheproposedneighbourhoodontainsallthedependeniesofpixels. Thisisthe

simplest 3-D neighbourhood. One ould inrease in spae and time the size of the neighbourhood,

butattheexpenseof omputationost. Inthisspatiotemporalneighbourhood,three kindsof binary

liques are dened: spatial, temporal and spatiotemporal (Fig. 5). They dier aording to their

x

y

t δ x

δ _y = 1

= 1

Spatial cliques (horizontal, vertical and diagonal )

Temporal cliques x

y

t δ t =1

x

y

t δ _x = δ _t =1

δ _x = δ _y = δ _t =1

Spatiotemporal cliques δ _y = δ _t =1

s s s

δ _x = δ _y =1

Figure5: Thethree typesof binaryliques.

spatial and temporal extent (along the x;y and t axis, respetively). Let Æ

x , Æ

y

;Æ

t

represent in the

3-Dspae(x;y;t) theoordinatesofvetor

!

(s;n)orrespondingtoa liquewithoriginintheurrent

pixel s (Æ 2 f 1;0;1g). Then we get: eight purely spatial liques (horizontal (Æ

x

= 1, Æ

y

= 0,

Æ

t

= 0), or vertial (Æ

x

=0, Æ

y

=1, Æ

t

=0), ordiagonal (Æ

x

=1, Æ

y

= 1, Æ

t

=0)); two purely

temporalliques(Æ

x

=0,Æ

y

=0,Æ

t

=1);sixteenspatiotemporalliques((Æ

x

=1,Æ

y

=0,Æ

t

=1)

or(Æ

x

=0, Æ

y

=1, Æ

t

=1)or(Æ

x

=1,Æ

y

=1, Æ

t

=1)).

Forthedenitionofliquepotentials inEq. (5),a spatialparameter

s

is usedto ontrolspatial

homogeneity(nodistintionismade betweenxandy) andatemporalparameter

t

forhomogeneity

intemporaldimension. Thisisasimplewaytotakeintoaountthenon-homogeneitybetweenspae

and time. Note that no more distintionis made betweenpast and future, sine the3-D algorithm

will propagate information forward and bakward in time and allow to hange a deision taken in

thepast,espeiallyasregardsunovered areas,thanksto thespatiotemporal natureof iterations(see

omments insetion 3.5).

Alllique potentials are dened with these two parameters, aording to the physial priniple

that interation with the urrent pixel gets weaker when the neighbour is far. Here, interation is

assumed to be inversely proportionalto the squared distane between sites in the ube. Thus, the

atualpotential(s;n)assoiatedwith alique=(s;n) isdened bythefollowingexpression:

(s;n)=

1

d 2

(s;n) h

Æ 2

x (s;n)

s +

Æ 2

y (s;n)

s +

Æ 2

t (s;n)

t i

(9)

whered(s;n)= q

Æ 2

x +Æ

2

y +Æ

2

t

istheEulidiandistanebetweentheurrentpixelsandtheonsidered

neighbourn. This relationshipgives:

(s;n)=

s

forspatialhorizontal orvertialliques(d(s;n)=1);

(10)

(s;n)=

s

4

forspatialdiagonal liques(d(s;n)= 2);

(s;n)=

t

fortemporal liques(d(s;n)=1);

(s;n)=

s

t

2(s+t)

forspatiotemporalhorizontalorvertialliques (d(s;n)= p

2 );

(s;n)= st

3(s+2t)

forspatiotemporal diagonalliques(d(s;n)= p

3).

3.3 Parameter Setting

Four model parameters are required:

s

= 20,

t

= 5, = 15 and = 5. These values were

determined experimentally (as in theseparablemodel). We hoose inpratie

s

>

t

to give more

importaneto spatialhomogeneitywhihissupposedtobemorereliablethantemporal homogeneity

(espeiallytrue inthease ofnon-deformableobjets undergoing arbitrary motion).

Parameterontrols theinueneof bothterms of energy. If itisneessary to reinforeapriori

onstraints (beause of bad observations for example), should be dereased. If it is neessary to

reinforethelinkto data, shouldbeinreased.

Thespeiationof neighbourhoodandliquepotentialsentirelydenes theMRFmodel,sothat

atualvaluesofN

t

;N

x orN

y

donotinuenethemodelling. DierentvaluesofN

t

weretested. The

defaultvalue isN

t

=8. Itmaybe dereased whenspatiotemporal homogeneity onstraint isbroken

(fast motion) and it may be inreased for very noisy sequenes. Still, at least N

t

= 5 images per

setion are requiredbeause of temporal boundaryeets (rst and lastimages of a setion are not

proessedbeause of thelak ofpast andfuture neighbours,respetively).

At rst sight, the omputational omplexity of the 3-D algorithm may be a bottlenek. In pra-

tie, handling a video sequene as a 3-D data bath does not drastially inrease the global om-

putation time ompared to a serial proessing image per image. On a Spar-10 workstation with

C-programming, 4s of pu-time per image of size 128128 are neessary to detet motion. The

inreaseof omputation ost omesprimarilyfrom theinreasednumberof iterationsrequireduntil

onvergene (10 iterationson average instead of 4), the stopping riterion remainingthe same asin

setion 2.3. The neighbourhood extension (26 neighbours instead of 10) does not indue a major

extraomputingharge.

Onthe other hand,the delayrequired before obtainingmotion detetion resultsmaybe ruial.

Sine the 3-D algorithm runs on temporal setions of length N

t

, all motion masks of a setion are

available at thesame time, when theproessing of thewhole setion is ompleted. In order to limit

boththe delay and the required memory for software implementation, N

t

should be small(anyway

muh lowerthan theatuallengthof anyvideo sequene).

Therefore, a long sequene should be proessed reursively, by utting it into smaller temporal

setions. Fig. 6 illustratesthe reursive proess withN

t

=5. The 3-D algorithm runs in spae and

time on therst setion of ve images: images t 2, t 1 and t areproessedtogether (setion 1);

thenitruns ontheseond setionof ve images: imagest 1,tand t+1 areproessed, withinitial

label eldsl 0

t 1 ,l

0

t

givenbyresultsof setion1, et...

Every time, one new imageis stored and only5 suessive frames stay inmemory. When image

t+3 is aquired, the nal result for image t may be omputed, orresponding to a delayof 120ms

(3 40ms for sequenes aquired at 25 images per seond), whih might be aeptable in many

appliations(e.g. video-surveillane).

The reursive proess does not inrease omputational omplexity. Of ourse, eah label eld

is estimated in three onseutive temporal setions. For example, l

t

is proessed when estimating

(l

t 2

;l

t 1

;l

t

) (setion 1), (l

t 1

;l

t

;l

t+1

) (setion 2), and (l

t

;l

t+1

;l

t+2

) (setion 3). But as regards the

(11)

section 1

section 2

section 3 t-2 t-1

t-3 t t+1 t+2 t+3

frames not processed (temporal boundary effects) :

: processed frames

Figure6: Setion-reursive algorithm (N

t

=5)

two lastestimations(temporal setions2and 3),theinitiallabeleldl 0

t

ismorereliable(loseto the

nalone), sothat onvergeneis faster(feweriterations areneeded).

This reursive version of the algorithm was implemented on a SIMD mahine [6℄. The parallel

mahine isalinearnetwork of256elementaryproessorswith4Kbytesofloalmemoryeah. Itom-

muniates withahost workstation via Ethernetinterfae. AssemblerorC-parallelprogrammingan

beused. Loalomputations are doneinparallel. Data areuniformlydistributedamongproessors.

The proessingrate ahieved is around3 to 4images/seond (imagesofsize128128).

Fig. 7 illustratesthe eÆieny of the 3-D algorithm to reover movingobjets in a noisy sequene.

The syntheti sequeneontains twomovingobjets: alearretangle whihtranslates rightward(1

Figure 7: From top to bottom: 1) Syntheti sequene with impulse noise; 2) Initial binary maps

( =20); 3) Masks deteted after spatial relaxation (separable algorithm); 4) Masks deteted after

spatiotemporal relaxation (3-D algorithm,N

t

=8).

pixel/image)andadarksquarewhihtranslatesleftward(1pixel/image). Shownarethebinarymasks

deteted with the separable and the 3-D algorithms, respetively. One an see that spatiotemporal

relaxation isusefulto eliminatebad detetion dueto noise(isolatedpoints).

(12)

propagated both in spae and time. The sequene of Fig. 8 ontains two moving areas: a group

of three pedestrians walking on the pavement and a biyle riding leftward on the road. With the

Figure 8: From top to bottom: 1) Street sequene; 2) Masks deteted with separable algorithm; 3)

Masks deteted with 3-D algorithm; 4) Contours of masks obtained with the 3-D algorithm, super-

imposedon theimagesequene.

separablealgorithm,thepedestriansmaskisonlypartiallyreoveredbeauseofalakofinformation

intheoverlappingmotionarea. Theseparablealgorithmimpliesausalproessinganddoesnotallow

tobak-propagatespatiotemporalonstraintsintimeandtohangeadeisiontakeninthepast. The

3-D algorithm, in ontrary, makes it possible to bak-propagate information in time and to fully

reoverthe pedestrians maskfor eah image of the sequene. In thebottom of Fig.8, the preision

ofthemasks intermsofontoursisshown: theuppermaskorrespondsto thegroupofpedestrians,

whilethelower maskorrespondsto thebiyle.

4 Spatiotemporal Multiresolution Framework

Both versions of the algorithm (separable and 3-D) yield poor results in ase of uniform intensity

movingareas orsubpixelmotion. Insuhases, althoughobjets aremoving, temporalvariationsof

theintensityfuntionare almost zero(observationsof poorquality). To solve thisproblem,the3-D

algorithm is runon a spatiotemporal pyramid of data witha oarse-to-ne strategy. Spatial ltering

isa ommonwayto dealwithlargeuniformintensitymovingareas. Temporallteringiseetive in

order to dealwith subpixelmotion.

Multiresolutionmay also improve theinitialisationstep for spatiotemporal relaxation. Indeed it

hasbeenonjeturedthatmultiresolutionanalysissmoothestheenergyfuntion[11 ℄,makingiteasier

to ndtheglobalminimum. ThismayberuialwhenadeterministirelaxationalgorithmlikeICM

is used,sineitmaygetstukin therst enountered loalminimumof theenergy funtioninase

of bad initialisation.

4.1 Spatiotemporal Low-Pass Pyramid

The spatiotemporal struture of the 3-D MRF model suggests to build not only a spatial but a

spatiotemporal pyramid. The basionvolutionkernelisthebinomiallow-passlter 1

4

[121℄whih is

(13)

x x x

1 2 1

2 4 2

1 2 1

x x x

1 2 1

2 4 2

1 2 1

x x x

2 4 2

4 8 4

2 4 2

t t-1

t+1

1 64

F F F

I(x,y,t)

I (x,y,t) ⁰ I (x,y,t) ¹ I (x,y,t) ^k

2 2

: 3-D filtering : 3-D subsampling

a)

b)

F 2

Figure 9: a) Spatiotemporal lter onvolution kernel. b) Pyramid buildingproess (the supersript

k denotestheresolutionlevel).

byBurt's spatialpyramid[4 ℄, this kernel, assoiated with a spatiotemporal subsampling, is used to

builda spatiotemporal pyramid (Fig. 9-b). Note that frames after lteringand before subsampling

arelessorruptedbynoisethanframesafterlteringandsubsampling. Anexampleofspatiotemporal

pyramid with three resolution levels is shown in Fig. 10. Spatiotemporal subsampling redues the

Figure10: Spatiotemporalpyramid: originalsequeneintoprow,andthethreespatiotemporallevels

below(k=0;1;2).

sizeof eah imagebya fator of4 and thelength ofthesequene bya fator of 2at eah resolution

level.

The 3-D algorithm is run at eah level of the spatiotemporal low-pass pyramid. The strategy

is oarse-to-ne: the algorithm starts at the lowest resolution level (k

max

). After spatiotemporal

interpolation(Fig.11), theresultofrelaxation atlevelk isusedto initialiserelaxation atlevelk 1.

Running the algorithm on this pyramid gives a multiresolution label eld. At eah level, the label

eldisoptimised aordingtoobservationsattheorrespondinglevelinthespatiotemporalpyramid

(14)

level k-1

level k

Figure 11: Spatiotemporal interpolation

(Fig. 12).

Temporal section S

Observations O

Spatiotemporal pyramid of observations O ⁰

O ¹ O ²

L L L 0

1 2

Spatiotemporal pyramid of labels

Figure12: Spatiotemporal multiresolutionsheme.

The optimal numberof levels for the pyramid (k

max

) depends on the size and speed of moving

objets: todetetveryslowmotion,k

max

shouldbeinreased. Iftheseneontainsverysmallmoving

objets, k

max

shouldbe dereased. Thedefaultvalue isk

max

=2 (threeresolution levels).

4.2 Pyramid of Observations

Thespatiotemporallteringintegratesmotioninformationoveralargerspatialandtemporaldomain,

so that observations at low resolution levels are more relevant in ase of subpixel motion and uni-

form movingareas. Spatial lteringimproves observationsforpoorly-texturedareas, whiletemporal

lteringon manyonseutive framesimprovesobservationsinase of subpixelmotion.

Fig. 13 exhibits the quality of observations, omputed both with mono- and multiresolution

shemes, forthe well-known Trevor sequene. This sequene representsthemotion ofa TV speaker.

Motion between two onseutive frames isvery slow(subpixelmotion) and manyareas of theshirt,

head and hands are poorly-textured. The gure presents the multiresolution observations at three

resolution levels. The darker the pixel, the larger the orresponding observation. Multiresolution

observationsare learlymore onsistant thanmonoresolution observationsinthat ase.

4.3 Parameter Adaptation

Parametersof thealgorithm areadapted alongthe pyramidasexplained below.

First,theevolutionof observationsalongthepyramidhasbeeninvestigatedinorder toadaptpa-

rameterateahresolutionlevel. Ofourse,low-passlteringreduestheamplitudeofobservations.

Letusfousona pixelsat a motiontransition,asshowninFig. 14(vertialedgemovingrightward

at a speedof 1 pixel/frame). After spatiotemporal ltering, theamplitude of observation isdivided

by a fator 8

3

' 3 (obvious omputation with the 3-D onvolution kernel of Fig. 9-a). Therefore

(15)

tion observations; 3) Multiresolution observations, 3 resolution levels (k = 0;1;2). All displays are

normalisedinorder to spanoverthefullavailabledynamirange [ 0;255℄.

t-2 t-1 t t+1

I0 ∆Ι

s

0 I0 I0

I0 + ^I0 ⁺ ∆Ι ^I0 ⁺ ∆Ι

s s s

s s

s

0 ∆Ι 0 0 ∆Ι

∆Ι

t-1 t t+1

sequence

observations +

- - + - +

Figure 14: Evolution of observation after spatiotemporal ltering: typial ase of a vertial edge

movingrightward. I

s

(t 1)=I

0 ,I

s (t)=I

0

+IandI

s

(t+1) =I

0

+I. Irepresentstheamplitude

of stepobserved at pixels. Before3-D spatiotemporal ltering: o

s

=jI

s (t) I

s

(t 1)j=jIj. After

spatiotemporal ltering: o

s

= 1

64

(4jIj+16jIj+4jIj)= 3

8 jIj.

(16)

sameproportionalongthepyramid,i.e.

k

=

0

=3 k

. Sineobservationsdereasebyafatorofabout

3, theobservation variane 2

dereasesbya fator of about 9,so that theratio inEq. (7) remains

onstant.

Seondly,sinespatialandtemporalinformationareintegratedinthesamewayalongthepyramid,

theparameter ratio

s

=

t

iskept onstantfor allresolution levels.

Thirdly, from a qualitative point of view, spatiotemporal interations should get weaker at low

resolution levels, sine two neighbouring pixels are atually far away in the full-resolution image

sequene. The evolution of potentials (s;n) should be related to the physial distane between

pixelsina squaregrid. This an alsobe statedfrom aquantitative point of view: at eah resolution

level, the physial distane d(s;n) between two pixels is atually doubled beause of subsampling.

This leads to a derease of 4 for lique potentials (s;n) in Eq. (9). For omputational simpliity,

this evolution law is simply implemented by adapting the weight fator

k

as follows:

k

= 4 k

0 .

The globalenergyis then: U(l;o)=U

m (l)+

k U

a (o;l).

Finally,parameterdoesnotneedtobeadaptedalongthepyramid,sinethebinarisationmethod

derivedfrom[10 ℄ (andhene parameter)isonlyusedforlabelinitialisationat thelowest resolution

level k

max

. At ner resolution levels, initialisationis simply performed by interpolating the results

of lower resolution levels, with no need of . But ompared to the monoresolution sheme, must

beinreasedwhen multiresolutionisused(experimentalobservation). Thetheoretialexplanationof

theneessaryinreaseofwith multiresolutionlevelis theinueneof datalow-passlteringonthe

method given in[10 ℄ forsetting thethresholdvalue.

The buildingof the pyramidis not omputationally expensive: sine the 3-D onvolution kernel of

Fig. 9-a is separable, the implementationof the spatiotemporal ltering is equivalent to the imple-

mentation ofthree 1-D binomiallters inx,y and tdimensions, respetively.

The relaxation at low resolution levels is quik dueto the smaller number of sites and therefore

Markovianonstraintsarepropagatedfaster. Comparedwiththefull-resolutionlevelk=0,thedata

ow to be proessedat levelk =1is redued by a fator of 8 (N

x

;N

y

;N

t

derease eah bya fator

of 2). Thus, one iteration at resolution levelk is equivalent (in terms of omputationost) to 1=2 3k

iterationsat thenest resolutionlevel (k=0).

Then,athigher(ner)resolutionlevels,feweriterationsareneededomparedtoamonoresolution

sheme,beauseofabetter initialisationpropagatedfromlowerresolutionlevels. So,multiresolution

usuallyreduesthe overall numberof iterations.

Theomputationtimehasbeenreordedexperimentallyformanysequenes. Thesame stopping

riterion as in setion 2.3 wasused. In fat, the multiresolution spatiotemporal algorithm does not

drastiallyspeeduptheproessingrate. Sothemaininterestofthemultiresolutionframeworkhereis

theimprovedperformanefordeteting subpixelmotionand poorly-texturedmovingareas asshown

innext setion,butnotomputation savings.

Fig. 15 presents the masks deteted in a ase of subpixel motion with both versions (mono- and

multiresolution) of the 3-D algorithm. The syntheti sene ontains three mobile objets: a lear

retangle movingrightward (1 pixel/frame), a dark square moving leftward(1 pixel/frame) and an-

other square on the left moving slowly upwards (0.35 pixel/frame). With the 3-D monoresolution

algorithm, the slowest square is badly deteted. With the multiresolution version of the algorithm,

thissquareis welldeteted, startingfrom theseond resolutionlevel.

Fig. 16 presents the masks deteted for the Trevor sequene with both versions (mono- and

multiresolution) of the 3-D algorithm. Monoresolution masks are very fragmented, sine speaker's

(17)

subpixelmotion;2) Monoresolutionmasks; 3) Multiresolutionmasks (2 levels: k =0;1).

motion isvery slowand thesene ontainsmanypoorly-texturedareas (hands, shirt,head). Onthe

ontrary, multiresolution masks arespatially and temporallyhomogeneous. The whole bodyis fully

deteted, startingfrom thethirdresolutionlevel (k=2).

5 Lip Segmentation

Theproposedapproahwasalsoappliedtolipsegmentationinolorimagesequenes,foraudiovisual

ommuniation between two speakers. Fig. 17 shows the ontext of appliation for a high quality

and lowbitratevideophone. It an alsobeusedforman-mahineommuniation(automatispeeh

reognition)orvideoonferening.

The speaker wears a light helmet equipped with a miro-amera and a mirophone, so that the

amerais xedwith respet tothe head. The segmentationis basedon theassumptionthat lipsare

areas inthefaewereredhue and motionpredominate.

Themainstepsoftheproessingareasfollows(detailsmaybefoundin[12 ℄). First,aolorvideo

sequene of speaker's fae is aquired under natural lighting onditions and without anypartiular

make-up. A logarithmi olor transform is performed from RGB (red, green, blue) to HIS (hue,

intensity, saturation) olor spae, in order to gain independene from illumination brightness and

noise.

Then,two observationsarederived. Therst observationisomputedfromthehuevalueateah

pixel: itgives informationaboutareas wherered hue ismost prominent. The seond observation is

thesameasinEq.(1): framedierenesbetweentwoonseutiveimages. Itgivesinformationabout

motion areas.

Fromthese twothresholdedobservations, fourinitiallabels(a

0 ,a

1 ,b

0 ,b

1

) arederived,foroding

four pixel lasses: pixels with (

1

), respetively without (

0

) motion, belonging (a), respetively not

belonging(b), to redhueareas.

Thespatiotemporal MRFapproah isthenusedforregularizingthesolution. Somehangeswere

introdued in the model presented in setion 3.2, in order to take into aount better the a priori

knowledge availableforthisspei appliation(lip shapeand motion). Namely, thespatiotemporal

potentialfuntion(s;n)isnowinverselyproportionaltotheEulidiandistane(andnotthesquared

(18)

Multiresolutionmasks (3 levels : k=0;1;2).

Figure 17: Context of audiovisualommuniation: from the image sequene of speaker's fae, geo-

metrial features of lips are extrated and provide modelling parameters for talking fae synthesis

and animation.

(19)

s and

t

assalingfators. Butforthisappliation,weforesomespatialanisotropy:

x

=2:

y

=

s

in order to putemphasize on horizontal ongurations (geometrial onstraints on lipshape). This

yields:

(s;n)=

1

r

Æ

x

2

+

Æ

y

2

+

Æ

t

2

=

s

t

r

2

t

Æ 2

x +4Æ

2

y

+ 2

s Æ

2

t

: (10)

Moreover, inontraryto setion3.2,parameters

s and

t

arenotonstant,butdependonthelabels

taken bysitessandn. Theyaredenedto onstrainthemodelto,respetively,spatialhomogeneity

of labels, and temporal homogeneity of hue when no motion is deteted. For example,

s (l

s

;l

n ) is

proportionalto: jr(s) r(n)j+jm(s) m(n)j, wherer(s)and m(s)arebinarydigits (0or1) oding

thepresene at pixel s ofred hue and motion,respetively. Forthe denitionof

t (l

s

;l

n

),see Table

3 in[12 ℄.

Withthismodelling,one obtains robustlabel eldsafter relaxation, exhibitingareas inthe fae

wherered hue and motionare predominant (Fig.18).

Figure 18: From top to bottom: 1) Sequene of luminane images: male fae without make-up; 2)

Initiallabelelds; 3)Finallabeleldsafterrelaxation: thefourlabelsareshowningraylevels(from

white to blak: b

1 ,a

1 ,b

0 ,a

0

); 4) Sequeneof lipmasks(ombination ofa

0 and a

1 ).

From thenal label eld,a region ofinterest isdetermined automatially (mouthboundingbox

in Fig. 19). Measurements of geometrial features are performed on lip masks (height and width,

Figure19: Top)Sequeneofluminaneimages: femalefaewithsoftredmake-up; Bottom)Sequene

of lipmasks withboundingboxsuperimposedon theluminane.

surfae), and usedforfaesynthesis at thereeiver'send.

Theproposedmethod forlipsegmentationsolves two ruialproblemsthatusually ariseinsuh

a ontext: indeed, theproessing gainsindependene bothfrom lighting onditionsand make-up of

lips. Thisisduebothto theuseof thelogarithmiolortransform,andtotherobustspatiotemporal

(20)

lipareas.

Aparallel implementationof thisalgorithmon a Programmable Video Proessoris under study.

The ahievable proessing rateis estimatedto be 13 images/sforimages ofsize256256.

6 Disussion

A spatiotemporal strategy for image sequene analysis was presented, and applied suessfully to

motiondetetionand lipsegmentationinaMarkovianframework. Itprimarilyonsistsinproessing

a videosequene asa3-D databath.

Withsuh an approah, improved performaneis reported formotion detetion inase of noisy

sequenesand inase of overlappingmotion.

A3-Dspatiotemporal multiresolutionsheme oherent withthe3-D MRFisalso proposed. This

multiresolutionapproahiseÆienttohandletwodiÆultases: subpixelmotionandpoorly-textured

movingareas. Butinaseofveryfastmotion,themultiresolutionalgorithmyieldsworseresultsthan

the monoresolution version. This is due to the fat that temporal ltering indues an averaging of

motion information over many images, so that it is no longer possible to preisely detet motion

boundaries. Asaresult,motionmasks arebiggerthanatualmovingobjets. Spatialmultiresolution

withouttemporalmultiresolutionwouldbebeneialinthatase,sineitallowstospatiallylinearize

intensity without temporal blurring. Mono- and multiresolution algorithms being omplementary,

it would be interesting to develop a strategy for swithing automatially between both versions of

the algorithm aording to the analysed sequene. Moreover, the multiresolution pyramid involves

3-D low-pass ltering. In order to limitthe blurringeet, the use of 3-D wavelets (3-D orthogonal

high-passand low-passlterbanks)ould beonsidered.

Theseondappliationreportedhereonernsspeaker'slipsegmentationinaolorvideosequene.

Theinterestofthespatiotemporalmethod,togetherwitha logarithmiolortransform,issupported

bythegoodqualityof resultsobtainedinthishallengingsituation(naturalimages ofspeaker'sfae

withoutanypartiularmake-up orlighting).

Thespatiotemporalapproahhasalsobeenusedtoomputespatiotemporalgradientswithspline

funtions(resultsnotreported here). Theimplementationinvolves3-D reursive lterings. Thus,we

do believe itould also be appliedwith suessto optial owestimation.

Referenes

[1℄ T. Aah, A. Kaup, R. Mester, "Statistial model-based hange detetion in moving video",

Signal Proessing,Vol. 31,No.2, Marh 1993,pp.165-180.

[2℄ J. Besag, "On the Statistial Analysis of Dirty Pitures", Journal ofRoyalStatistial Soiety,

Vol. B-48,No. 3,1986, pp.259-302.

[3℄ P. Bouthemy, P. Lalande, "Reovery of moving objet masks in an image sequene using loal

spatiotemporalontextualinformation",OptialEngineering,Vol.32,No.6,June1993,pp.1205-

1212.

[4℄ P.J. Burt, E.H. Adelson,"The Laplaian PyramidasCompat Image Code",

IEEE Trans.on Communiations,Vol. 31,No. 4,1984,pp. 532-540.

[5℄ A.Caplier,F.Luthon,"Approhe spatio-temporellepourl'analysedesequenesd'images. Appli-

ation endetetion de mouvement",Traitement du Signal,Vol. 14,No. 2,1997, pp.195-208.

[6℄ A. Caplier, F. Luthon, C. Dumontier, "Real-Time Implementations of an MRF-based Motion

Detetion Algorithm",Real-TimeImaging, Vol. 4,No.1, February1998, pp.41-54.

(21)

algorithm", Journalof theRoyalStatistialSoiety, SeriesB, Vol.39, 1977, pp.1-38.

[8℄ C.Dumontier,F.Luthon,J.P.Charras,"Real-TimeImplementationofanMRF-basedMotionDe-

tetion Algorithm on a DSP Board", Pro. 7 th

IEEE DigitalSignalProessing Workshop , Loen,

Norway, September1996,pp. 183-186.

[9℄ S.Geman,D.Geman,"StohastiRelaxation,GibbsDistributions,andtheBayesianRestoration

of Images", IEEE Trans.PatternAnal.and Mahine Intel., Vol. 6, No. 6, November 1984, pp.

721-741.

[10℄ Y.Z.Hsu, H.H.Nagel,G.Rekers,"New likelihoodtestmethodsforhangedetetion inimage

sequenes", Comput. VisionGraph. ImageProess., Vol. 26, 1984,pp.73-106.

[11℄ J. Konrad, E. Dubois, "Multigrid Estimation of Image Motion Fields using Stohasti Relax-

ation",Pro.2 nd

IEEE Int.Conf.on ComputerVision,TarponSprings,Florida,Deember1988,

pp.354-362.

[12℄ M. Lievin, F. Luthon, "Lip features automati ex-

tration", Pro.5 th

IEEE Int. Conf.onImage Proessing, Chiago, Illinois, Otober 1998, Vol.

3, pp.168-172.

[13℄ A.Mitihe, P. Bouthemy, "Computation and analysis of Image Motion: A Synopsis of Current

Problems and Methods", International Journalof ComputerVision, Vol. 19, No. 1, 1996, pp.

29-55.

Spatiotemporal MRF approach to video segmentation: Application to motion detection and lip segmentation

HAL Id: hal-00785993

https://hal.archives-ouvertes.fr/hal-00785993

Submitted on 7 Feb 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Spatiotemporal MRF approach to video segmentation:

Application to motion detection and lip segmentation

Franck Luthon, Alice Caplier, Marc Liévin

To cite this version:

Franck Luthon, Alice Caplier, Marc Liévin. Spatiotemporal MRF approach to video segmentation:

Application to motion detection and lip segmentation. Signal Processing, Elsevier, 1999, 76 (1),

pp.61-80. �hal-00785993�

binarization binarization ICM Relaxation minimum of energy

o o

I I

I

- + | | - + | |

l l

buffer l

l n

s

n n

n n n n n n

t-1 t

: current pixel s : a neighbour n a clique c = (s,n)

: n

t+1 a)

b)

t-1 t t+1

t t+1

t-1

t t

0 t+1 0

θ

parameters α β s β

p β f , , ,

N images

x y t t

N y

N x

t

t-1 t+1

s

n n n

n n n

n n n

n n n

n n

n n n

n n n

n n n

n n n

: a neighbour n current pixel s

: : spatial clique

: temporal clique

: spatiotemporal clique

x

y

t δ x

δ y = 1

= 1

Spatial cliques (horizontal, vertical and diagonal )

Temporal cliques x

y

t δ t =1

x

y

t δ x = δ t =1

δ x = δ y = δ t =1

Spatiotemporal cliques δ y = δ t =1

s s s

δ x = δ y =1

section 1

section 2

section 3

t-2 t-1

t-3 t t+1 t+2 t+3

frames not processed (temporal boundary effects) :

: processed frames

N _y

N _x

δ _y = 1

t δ _x = δ _t =1

δ _x = δ _y = δ _t =1

Spatiotemporal cliques δ _y = δ _t =1

δ _x = δ _y =1

I (x,y,t) ⁰ I (x,y,t) ¹ I (x,y,t) ^k

Spatiotemporal pyramid of observations O ⁰

O ¹ O ²

I0 + ^I0 ⁺ ∆Ι ^I0 ⁺ ∆Ι