HAL Id: hal-00785993
https://hal.archives-ouvertes.fr/hal-00785993
Submitted on 7 Feb 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Spatiotemporal MRF approach to video segmentation:
Application to motion detection and lip segmentation
Franck Luthon, Alice Caplier, Marc Liévin
To cite this version:
Franck Luthon, Alice Caplier, Marc Liévin. Spatiotemporal MRF approach to video segmentation:
Application to motion detection and lip segmentation. Signal Processing, Elsevier, 1999, 76 (1),
pp.61-80. �hal-00785993�
Appliation to Motion Detetion and Lip Segmentation.
F. Luthon,A. Caplier, M.Lievin
Laboratoire des Images et des Signaux
Institut NationalPolytehnique de Grenoble
LIS, INPG, 46 avenue Felix-Viallet
38031 Grenoble Cedex, Frane
Tel: +33(0)4 76 57 43 72 Fax: +33 (0)4 76 57 47 90
Email: Frank.Luthoninpg.fr
Signal Proessing,76(1):61-80, July 1999
Abstrat
In this paper, a spatiotemporal strategy for image sequene analysis is proposed: a video
sequeneisproessedasa3-Ddatabathinsteadofaseriesof2-Dimages.
Applying this approah to motion detetion, a 3-DMarkovian model assoiatedwith aspa-
tiotemporal relaxationis dened. Using a3-Dneighbourhood of pixelsfor modelling spatiotem-
poralinterations, robustresultsareobtainedfordeteting movingobjetsinnoisysequenesor
intheaseofoverlappingmotion.
In order to improve the performane to detet poorly-texturedobjets orveryslow motion,
thealgorithmisintegratedinaspatiotemporalmultiresolutionsheme. Thedatapyramidisbuilt
byusing3-Dlow-passlteringand3-Dsubsampling. Robustresultsforsynthetiand real-world
outdoorimagesequenesarereported.
Thisapproahisalsoappliedsuessfullytospeaker'slipsegmentationinimagesequenes,for
audiovisualteleommuniation.
Key words: motiondetetion,imagesequenes,MarkovRandomField(MRF),spatiotempo-
ralapproah,multiresolution,lipsegmentation.
Resume
Cet artile presente une approhe spatio-temporelle pour l'analyse de sequenes d'images 1
:
une sequene est traitee omme un ot de donnees a trois dimensions au lieu d'une suession
d'imagesadeuxdimensions.
L'utilisation de etteapprohepour la detetion demouvement onduit ala denition d'un
modelemarkovien3-Dassoieaune relaxationspatio-temporelle. Gr^aeaunemodelisationne
desinterations spatio-temporellesentre lespixelsd'un voisinageubique, desresultatsrobustes
sont obtenus pour la detetion d'objets mobiles dans une sene tres bruitee et d'objets dont le
mouvements'eetue avereouvrementd'uneimagealasuivante.
Dans lebut d'ameliorerl'aptitude de l'algorithme adeteter des objets tres peu textureset
desobjets demouvementtreslent, ondenitun adrede multiresolutionspatio-temporelle. La
pyramide de donnees est onstruite par une suession de ltrages et de sous-ehantillonnages
appliquesdanshaunedestroisdimensions. L'inter^etdelamultiresolutionspatio-temporelleest
mis en evidene par diversresultats de detetion de mouvement sur des senes synthetiques et
reelles.
Une autre appliation de ette approhe porte sur la segmentation des levres d'un louteur,
dansunontextedeteleommuniationsaudio-visuelles.
1
ApaperinFrenhisalsoavailable[5 ℄.
Motion detetion and region-based segmentation areimportant issuesinimage sequene analysis or
oding,withappliationsinvideo-surveillaneand video-ommuniation.
Althoughthree dimensions(x;y;t)arerequiredto desribeanimagesequene,mostofthemeth-
odsdealing with sequene analysis are time sequential (eah image is proessed in turn), and work
on a pair of onseutive images. This might induelimitations e.g. for deteting subpixel motion 2
.
A ommon way to integrate motion information over a larger temporal domain is to use reursive
temporal lteringsuh asKalmanltering.
Inthispaper, another strategy isproposed. The point isto onsidera videosequene not asan
imageseries, butasa3-D databath, takinginto aount spatialand temporal dimensionswithina
single proess. This approahis oherent with thefat that a movingobjet overs a volume inthe
(x;y;t) spae.
Thesope ofthepaperistwofold: to giveaninsightintotheprosand onsofthespatiotemporal
approah,togetherwithfousingonpratialappliations. Theperformaneofthisapproahisindeed
illustratedwith twoappliations: robustmotiondetetion and lipsegmentationin videosequenes.
Asfor robust motion detetion, a 3-D non-separable Markov Random Field (MRF) based algo-
rithmis dened. Thismethodyieldsbetter resultsthantheseparableversion ofthe samealgorithm
in the ase of noisy sequenes oroverlapping motion 3
. The same observations (temporal variations
of the intensity funtion) as in the separable ase are retained, the enhaned performane of the
3-D algorithm oming from theimprovement of the MRFmodel whihis better at takingtemporal
onstraintsinto aount.
Todetet subpixelmotionand uniformmovingobjets 4
,alargerspatiotemporaldomainmustbe
taken into aount. This isdone byomputingobservationson a spatiotemporalpyramid.
In setion 2, a separable motion detetion algorithm is presented. The algorithm is inspiredby
thework ofBouthemyet al. [3 ℄. Therearetwomajordierenesbetweenthealgorithmdesribedin
[3 ℄ andtheone presentedhere. The rstdierene isthewaytemporalinformationisdealt with. In
[3 ℄,the proessingof eah imageis doneintwo steps("two-pass algorithm"). An initialdetetion of
movingareasattimetisderivedwhenonsideringimagesI(t 1)andI(t). Thisdetetionisupdated
when onsidering images I(t) and I(t+1). Two suessive label elds are always simultaneously
onsidered(optimization intwo passes),and the deisionaboutunoveredareas is postponedto the
nextproessingpass. Insetion2,weproposea"one-passalgorithm": asinglelabeleld(theurrent
one) is optimized at eah time (and only one). It makes implementation easier, for an equivalent
qualityofresults. Thisismadepossiblethanksto anotherwayofdoinginitialisation: weuseaoarse
estimate of the futurelabeleld, instead of repeting the past asis donein [3℄. Unovered areas are
handledbygivingmore weight to thefuture thanto thepast (anisotropyintemporal interations).
The seond dierene onerns omputational omplexity: we use four model parameters (se-
tion2.4), insteadofve in[3 ℄,sinethefuntionexpressingthelinkbetweenobservationsandlabels
is simpler inour ase. The deision abouta temporal lique(past orfuture) requires onlyone on-
ditional test to hoose among two ongurations, while eight dierent ongurations are tested in
Bouthemy's algorithm (Table 1 in[3℄). The number of 2-D eldsrequired for the relaxation is ve
inour ase (Fig. 1-b), instead of sixforBouthemy'salgorithm (Fig. 2 in [3 ℄). Hene,the amount of
memory requiredfor data storage is minor inour ase. Therefore, thetwo-step algorithm proposed
in[3℄ isless adequateforreal-timeimplementation(i.e. proessingat videorate).
Sine real-time proessing is of major onern for pratial video appliations, the paper ad-
dresseson several oasions theissuesof omputation ost and hardware implementation,either on
generalpurposeprogrammabledevies(digitalsignalproessors(DSPs)orvideoproessors),parallel
mahines (SIMDorMIMD)ordediated iruits(ASICs,VLSIellularanalog networks).
2
Subpixelmotionmeansdisplaementsoflessthanonepixelbetweentwoimages(i.e.slowmotion).
3
Overlappingmotionmeansthattheintersetionofthemasksofamovingobjetattimest 1andtisnotempty.
4
Uniformmovingobjetsmeansmovingobjetsthatarepoorly-textured,i.e. haveuniformintensity.
in the sense that spae and time have distint roles in the proessing (hereafter, the algorithm is
referred to asthe"separablealgorithm").
Its3-Dnonseparableounterpartisdesribedinsetion3. Aomparisonbetweentheperformane
of both algorithms is made. In setion 4, it is shown how the integration of the 3-D algorithm
in a spatiotemporal multiresolution framework allows subpixel motion and poorly-textured moving
objets to be deteted. In setion 5, anotherappliation of thisapproah is presented, forspeaker's
lipsegmentation ina ontext ofaudiovisualteleommuniation. A disussionin setion6 onludes
thepaper.
2 Separable MRF Model
MRFmodellingiswidely usedformotionanalysis,eitherfordetetion,estimation, orsegmentation.
For a state of the art about image motion analysis and an extensive bibliography, the reader may
refer to [13 ℄.
2.1 Observations and Labels
The purpose of motion detetion is to loalize moving and stati areas in a dynamisene. It is a
binary labelling problem that onsists in attributing to eah pixel or site s = (x;y) of image S at
time t one of the two labels: l
s
= a if s belongs to a moving area, l
s
= b if s belongs to the stati
bakground.
Withtheassumptionsofquasi-onstant illumination(verysmalllightingvariationsbetweent 1
and t) and stati amera, motion informationis loselyrelated to temporal hanges of theintensity
funtionI
s
(t). Therefore, observations are given by:
o
s
=jI
s (t) I
s
(t 1)j: (1)
The following notation is used: l = fl
s
;s 2 Sg and o = fo
s
;s 2 Sg represent one partiular
realisationat time tofthe labeland observation eldsL andO,respetively 5
.
Given a realisation o of eld O, the aim is to nd the most probable onguration l of eld
L. This is done by using the Maximum A Posteriori riterion (MAP). From Bayes theorem and
the equivalene between MRF and Gibbs distribution, it is known that the maximisation of the a
posteriori probabilityisequivalent to theminimisation ofan energyfuntion[9 ℄:
max
l
P(L =ljO=o) ()min
l
U(l;o): (2)
2.2 Energy Funtions
The energyfuntionislassiallythesum oftwo terms(orresponding to priorknowledge anddata-
link,respetively):
U(l;o)=U
m (l)+U
a
(o;l): (3)
The modelenergy U
m
(l) is aregularisationterm. Itputs apriorionstraints(spatiotemporal homo-
geneity) onthe masksof movingobjets, erasingisolatedpointsdueto noise. Itsexpressionisgiven
by:
U
m (l)=
X
2C V
(l
s
;l
n
) (4)
where denotes anyof thebinary liques denedin theneighbourhood ofFig. 1-a. A binarylique
5
Everytimetheinstantonsideredisdierentfromtheurrenttimet,atemporalindexwillbeaddedinthenotation.
binarization binarization ICM Relaxation minimum of energy
o o
I I
I
- + | | - + | |
l l
buffer l
l n
s
n n
n n n n n n
t-1 t
: current pixel s : a neighbour n a clique c = (s,n)
: n
t+1 a)
b)
t-1 t t+1
t t+1
t-1
t t
0 t+1 0
θ
parameters α β s β
p β f , , ,
Figure 1: a) Neighbourhoodand binaryliques. b) Separablealgorithm blokdiagram (l 0
denotesa
oarseestimate orinitialisationof labeleldL).
=(s;n) isanypair of distintsites intheneighbourhood,inludingtheurrentpixel sand anyof
theneighboursn. C istheset of all liques. V
(l
s
;l
n
) isan elementarypotentialfuntionassoiated
witheah lique=(s;n). In orderto puthomogeneityonstraintsintothe model,itis denedas:
V
(l
s
;l
n )=
(
ifl
s
=l
n
+ ifl
s 6=l
n
(5)
where the positive parameter depends on the nature of the lique: a parameter
s
is dened for
spatialliques,aparameter
p
forpasttemporalliqueandaparameter
f
forfuturetemporallique.
Thelink betweenlabels and observationsis expressedbytherelationship: o
s
= (l
s )+g
s where
g isa Gaussianunorrelated entered noisewithvariane 2
and:
(l
s )=
(
0 ifl
s
=b
>0 otherwise.
(6)
modelstheobservations: ifapixelisstati,notemporalhangeoursintheintensityfuntionand
theobservation shouldbe zero; ifa pixelis mobile,a hangeoursand theobservation issupposed
to take apositive value loseto , whihrepresents theaverage value ofnon-zero observations.
Thelink-to-data energyU
a
(o;l) (attahment energy) isderivedfrom theabove funtion:
U
a
(o;l)= 1
2 2
X
s2S [o
s (l
s )℄
2
(7)
wherethe observationvariane 2
is evaluatedon-lineforeah image.
2.3 Spatial Deterministi Relaxation
Fig. 1-b shows the blok diagram of the separable algorithm. The algorithm works on three on-
seutive frames. Suppose thepast labeleld l
t 1
hasbeen determined asthe result of the previous
optimization. The urrent label eld is initialised with a binary map l 0
t
derived from observation
eld o
t
, and a oarse estimate l 0
t+1
of thefuture label eld is also derived from binarisation of eld
o
t+1
. The binarymaps are obtainedwith the likelihood method proposed in[10 ℄, but ouldalso be
omputed withasimplethresholdingmethod,foromputation savingspurpose.
Tondtheminimumoftheenergyfuntion,thedeterministirelaxationalgorithmICM(Iterated
Conditional Modes) is used [2 ℄. For eah pixel s of the urrent image, the two labels a and b are
tested and the label whih indues the minimum loal energy in the neighbourhood is kept. The
proess iterates over the image until onvergene, one iteration orresponding to the sanning in x
and y dimensionsof the imageat time t. The stopping riterionfor onvergeneof therelaxation is
next imageof thesequene isproessed.
Notethat, sinethe algorithm workswith three frames, labelelds areobtained with adelayof
one frame.
2.4 Parameter Setting
Theseparablealgorithmdependsonveparameters: fourparametersforMRFmodelling(
s
;
f
;
p
;),
plusone threshold parameter for binarisationof observations. From various experiments bothon
real-world and syntheti image sequenes, the model parameters are xed to the following values:
s
=20,
p
=10,
f
=30, =10. This manuallearningphase for parametertuning wasbased on
empirialobservations: ontextual homogeneityofdetetedmasks,goodagreement betweenontours
of masks and atual moving objets, and insensitivity to aquisition noise. Unsupervised estima-
tion methods, like Expetation-Maximisation [7 ℄, ould also be used to estimate model parameters
s
;
f
;
p
. But they are prohibitive in terms of omputation ost. Morevover high preision in the
determinationofthesevaluesisnotrequired(robustnessofMRFmethodinsensitivetoaslighthange
of these values). Parameter
s
ontrols spatial homogeneity and may be dereased in ase of very
noisy sequenes. Parameters
f and
p
ontrol temporal homogeneity. More weight is given to the
future by taking
f
>
p
, so that the bakground area whih has beenunovered duringmotion is
fastereliminated. Indeed,insuharegion,thepasttemporalneighbourisa-labelledwhilethefuture
one isb-labelled. Butthegood labelis thestati one(l
s
=b),given bythefutureinformation. Note
thattemporal homogeneityonstraint an be relaxed inaseof fastmotion.
Parameterstandsforsome kindofaverage valueof non-zeroobservations. Thisparametermay
either be omputed on-line for eah image asexplained in [3 ℄, or xed to an arbitrary value before
proessing. From experimental tests, on-line omputation of for eah imagedoes notsigniantly
improvemotion detetion results.
The threshold required for binarisation (omputation of initial binary maps with a method
derived from [10 ℄) is the only parameter whih must be adjusted for eah sequene. Here, it is
determined manually (o-linelearningphase at thebeginningof videoaquisition orbeforerunning
the automati proessing). One ould use likelihood tests suh as desribed in [10 , 1 ℄ to determine
this deision threshold automatially, but at the expense of omputation ost. A too low value of
indues many false detetions. A too high value of erases moving pixels in overlapping motion
areas. Forallsequenesaquiredwiththesame ameraunderthesame lightingonditions,thesame
valueof maybekept (e.g. =32 forall street sequenespresentedinthispaper).
2.5 Computational Complexity
The proessing rate is evaluated in the ase of images of size 128 128. When implemented on
a Spar-10 workstation with C programming, the proessing of an image takes about 1:8s of pu
time ( 0:4s per iteration). This orresponds roughly to N
0 N
x N
y N
i
=2:5 10 7
elementary
operations. N
0
=400 isthenumberofelementaryoperations(additions,multipliations,onditional
tests) involved intheomputation oftheloalenergyassoiatedwitheah pixel (1 multipliation=
10 additions). N
x
= 128 and N
y
= 128 represent the image dimensions and N
i
=4 is the average
numberofiterationsuntilonvergene.
To ahieve real-time proessing, various hardware implementations (on parallel SIMD mahine,
DSP board, or ellular VLSI analog network) have been either developped or simulated [6 , 8 ℄. A
proessing rate of 12 to 25 frames per seond is then ahieved. Another implementationon a Pro-
grammableVideoProessor(PVP)forteleommuniationappliationsisnowunderstudy. ThePVP
is an intensive omputing unit with a parallel SIMD arhiteture (8 sub-proessors onneted to a
sharedmemoryof16Kbytes)andsevenI/Oportsfordataowirulation. Itoersahighomputing
power(2 Gops)witha highI/Orate(4 Gbits/s). Forimages ofsize256256 anda lokfrequeny
PVPsoftwaresimulator.
2.6 Experimental Results
The separable algorithm was tested both on syntheti and real-world image sequenes. A typial
example forvideo-surveillane appliation (traÆ ontrol) is shown inFig. 2. This street sequene,
Figure2: Top)Streetsequenewithamovingpedestrian;Bottom)Masksofthemovingbodydeteted
after relaxation (blak =movinglabel,white =stati label).
aquiredwith a standard video amera, ontains a single pedestrianwalking on the pavement. The
image sequene is notvery noisyand motionof the pedestrianis largeenough betweentwo images,
allowingagooddetetion. Themaskofthemovingbodydetetedintheimageplaneisgivenat four
onseutive instants.
3 3-D Non Separable MRF Model
3.1 Spatiotemporal Relaxation
Althoughtheseparablealgorithmintegratesmotion informationfromthree onseutiveframes, only
the urrent frame is proessed at eah time (Fig. 1-b). The 3-D non separable model for motion
detetion is based on the intuitive idea that, by taking into aount more than three onseutive
framesof thesequene, theanalysis ofmotionmaybe improved. Therefore, thevideosequeneisno
longer onsidered as an image series butas a 3-D data bath. L and O are now 3-D random elds
(orvolumes).
Tondtheminimumoftheenergyfuntion,aspatiotemporalversionofICMisrequired. Thekey
point isthat, at eahiteration,therelaxation runsovertemporal setionsof lengthN
t
(Fig.3). The
sanning isdone notonlyin spatialdimensions(x;y) at a given time t, butin thethree dimensions
(x;y;t)together. It isperformedbak-and-forthspatiallyandtemporally. Oneiterationorresponds
tothesanningofawholetemporalsetion. Allframesofthetemporalsetionareproessedtogether.
After onvergeneof ICM, labels ofall pixelsinludedin thatsetion areavailable.
Allalong the paper, we refer to the 3-D non-separable motion detetion algorithm as "the 3-D
algorithm".
3.2 A Priori Model
The mathematial framework of MRF modelling remains the same. The relationships of setion
2 still hold, sine there is no restrition about the dimensions of elds L and O. However, elds
L and O are now supposed to be spatiotemporal 3-D random elds, bringing about the following
hanges: in Eq. (7), S represents now a temporal setion of N
t
images, instead of a single image.
Theneighbourhoodstrutureassoiatedwith Lisnowaompletespatiotemporalube (Fig.4),and
N images
x y t t
N y
N x
Figure 3: Temporal setionof lengthN
t .
t
t-1 t+1
s
n n n
n n n
n n n
n n n
n n
n n n
n n n
n n n
n n n
: a neighbour n current pixel s
: : spatial clique
: temporal clique
: spatiotemporal clique
Figure 4: 3-Dneighbourhoodand binaryliques.
parameterisaddedintheglobalenergyfuntionforbalaning U
m
(l) andU
a
(o;l) inuenes inthis
extended neighbourhood:
U(l;o)=U
m
(l)+U
a
(o;l): (8)
Wesupposethattheproposedneighbourhoodontainsallthedependeniesofpixels. Thisisthe
simplest 3-D neighbourhood. One ould inrease in spae and time the size of the neighbourhood,
butattheexpenseof omputationost. Inthisspatiotemporalneighbourhood,three kindsof binary
liques are dened: spatial, temporal and spatiotemporal (Fig. 5). They dier aording to their
x
y
t δ x
δ y = 1
= 1
Spatial cliques (horizontal, vertical and diagonal )
Temporal cliques x
y
t δ t =1
x
y
t δ x = δ t =1
δ x = δ y = δ t =1
Spatiotemporal cliques δ y = δ t =1
s s s
δ x = δ y =1
Figure5: Thethree typesof binaryliques.
spatial and temporal extent (along the x;y and t axis, respetively). Let Æ
x , Æ
y
;Æ
t
represent in the
3-Dspae(x;y;t) theoordinatesofvetor
!
(s;n)orrespondingtoa liquewithoriginintheurrent
pixel s (Æ 2 f 1;0;1g). Then we get: eight purely spatial liques (horizontal (Æ
x
= 1, Æ
y
= 0,
Æ
t
= 0), or vertial (Æ
x
=0, Æ
y
=1, Æ
t
=0), ordiagonal (Æ
x
=1, Æ
y
= 1, Æ
t
=0)); two purely
temporalliques(Æ
x
=0,Æ
y
=0,Æ
t
=1);sixteenspatiotemporalliques((Æ
x
=1,Æ
y
=0,Æ
t
=1)
or(Æ
x
=0, Æ
y
=1, Æ
t
=1)or(Æ
x
=1,Æ
y
=1, Æ
t
=1)).
Forthedenitionofliquepotentials inEq. (5),a spatialparameter
s
is usedto ontrolspatial
homogeneity(nodistintionismade betweenxandy) andatemporalparameter
t
forhomogeneity
intemporaldimension. Thisisasimplewaytotakeintoaountthenon-homogeneitybetweenspae
and time. Note that no more distintionis made betweenpast and future, sine the3-D algorithm
will propagate information forward and bakward in time and allow to hange a deision taken in
thepast,espeiallyasregardsunovered areas,thanksto thespatiotemporal natureof iterations(see
omments insetion 3.5).
Alllique potentials are dened with these two parameters, aording to the physial priniple
that interation with the urrent pixel gets weaker when the neighbour is far. Here, interation is
assumed to be inversely proportionalto the squared distane between sites in the ube. Thus, the
atualpotential(s;n)assoiatedwith alique=(s;n) isdened bythefollowingexpression:
(s;n)=
1
d 2
(s;n) h
Æ 2
x (s;n)
s +
Æ 2
y (s;n)
s +
Æ 2
t (s;n)
t i
(9)
whered(s;n)= q
Æ 2
x +Æ
2
y +Æ
2
t
istheEulidiandistanebetweentheurrentpixelsandtheonsidered
neighbourn. This relationshipgives:
(s;n)=
s
forspatialhorizontal orvertialliques(d(s;n)=1);
(s;n)=
s
4
forspatialdiagonal liques(d(s;n)= 2);
(s;n)=
t
fortemporal liques(d(s;n)=1);
(s;n)=
s
t
2(s+t)
forspatiotemporalhorizontalorvertialliques (d(s;n)= p
2 );
(s;n)= st
3(s+2t)
forspatiotemporal diagonalliques(d(s;n)= p
3).
3.3 Parameter Setting
Four model parameters are required:
s
= 20,
t
= 5, = 15 and = 5. These values were
determined experimentally (as in theseparablemodel). We hoose inpratie
s
>
t
to give more
importaneto spatialhomogeneitywhihissupposedtobemorereliablethantemporal homogeneity
(espeiallytrue inthease ofnon-deformableobjets undergoing arbitrary motion).
Parameterontrols theinueneof bothterms of energy. If itisneessary to reinforeapriori
onstraints (beause of bad observations for example), should be dereased. If it is neessary to
reinforethelinkto data, shouldbeinreased.
Thespeiationof neighbourhoodandliquepotentialsentirelydenes theMRFmodel,sothat
atualvaluesofN
t
;N
x orN
y
donotinuenethemodelling. DierentvaluesofN
t
weretested. The
defaultvalue isN
t
=8. Itmaybe dereased whenspatiotemporal homogeneity onstraint isbroken
(fast motion) and it may be inreased for very noisy sequenes. Still, at least N
t
= 5 images per
setion are requiredbeause of temporal boundaryeets (rst and lastimages of a setion are not
proessedbeause of thelak ofpast andfuture neighbours,respetively).
3.4 Computational Complexity
At rst sight, the omputational omplexity of the 3-D algorithm may be a bottlenek. In pra-
tie, handling a video sequene as a 3-D data bath does not drastially inrease the global om-
putation time ompared to a serial proessing image per image. On a Spar-10 workstation with
C-programming, 4s of pu-time per image of size 128128 are neessary to detet motion. The
inreaseof omputation ost omesprimarilyfrom theinreasednumberof iterationsrequireduntil
onvergene (10 iterationson average instead of 4), the stopping riterion remainingthe same asin
setion 2.3. The neighbourhood extension (26 neighbours instead of 10) does not indue a major
extraomputingharge.
Onthe other hand,the delayrequired before obtainingmotion detetion resultsmaybe ruial.
Sine the 3-D algorithm runs on temporal setions of length N
t
, all motion masks of a setion are
available at thesame time, when theproessing of thewhole setion is ompleted. In order to limit
boththe delay and the required memory for software implementation, N
t
should be small(anyway
muh lowerthan theatuallengthof anyvideo sequene).
Therefore, a long sequene should be proessed reursively, by utting it into smaller temporal
setions. Fig. 6 illustratesthe reursive proess withN
t
=5. The 3-D algorithm runs in spae and
time on therst setion of ve images: images t 2, t 1 and t areproessedtogether (setion 1);
thenitruns ontheseond setionof ve images: imagest 1,tand t+1 areproessed, withinitial
label eldsl 0
t 1 ,l
0
t
givenbyresultsof setion1, et...
Every time, one new imageis stored and only5 suessive frames stay inmemory. When image
t+3 is aquired, the nal result for image t may be omputed, orresponding to a delayof 120ms
(3 40ms for sequenes aquired at 25 images per seond), whih might be aeptable in many
appliations(e.g. video-surveillane).
The reursive proess does not inrease omputational omplexity. Of ourse, eah label eld
is estimated in three onseutive temporal setions. For example, l
t
is proessed when estimating
(l
t 2
;l
t 1
;l
t
) (setion 1), (l
t 1
;l
t
;l
t+1
) (setion 2), and (l
t
;l
t+1
;l
t+2
) (setion 3). But as regards the
section 1
section 2
section 3
t-2 t-1
t-3 t t+1 t+2 t+3
frames not processed (temporal boundary effects) :
: processed frames
Figure6: Setion-reursive algorithm (N
t
=5)
two lastestimations(temporal setions2and 3),theinitiallabeleldl 0
t
ismorereliable(loseto the
nalone), sothat onvergeneis faster(feweriterations areneeded).
This reursive version of the algorithm was implemented on a SIMD mahine [6℄. The parallel
mahine isalinearnetwork of256elementaryproessorswith4Kbytesofloalmemoryeah. Itom-
muniates withahost workstation via Ethernetinterfae. AssemblerorC-parallelprogrammingan
beused. Loalomputations are doneinparallel. Data areuniformlydistributedamongproessors.
The proessingrate ahieved is around3 to 4images/seond (imagesofsize128128).
3.5 Experimental Results
Fig. 7 illustratesthe eÆieny of the 3-D algorithm to reover movingobjets in a noisy sequene.
The syntheti sequeneontains twomovingobjets: alearretangle whihtranslates rightward(1
Figure 7: From top to bottom: 1) Syntheti sequene with impulse noise; 2) Initial binary maps
( =20); 3) Masks deteted after spatial relaxation (separable algorithm); 4) Masks deteted after
spatiotemporal relaxation (3-D algorithm,N
t
=8).
pixel/image)andadarksquarewhihtranslatesleftward(1pixel/image). Shownarethebinarymasks
deteted with the separable and the 3-D algorithms, respetively. One an see that spatiotemporal
relaxation isusefulto eliminatebad detetion dueto noise(isolatedpoints).
propagated both in spae and time. The sequene of Fig. 8 ontains two moving areas: a group
of three pedestrians walking on the pavement and a biyle riding leftward on the road. With the
Figure 8: From top to bottom: 1) Street sequene; 2) Masks deteted with separable algorithm; 3)
Masks deteted with 3-D algorithm; 4) Contours of masks obtained with the 3-D algorithm, super-
imposedon theimagesequene.
separablealgorithm,thepedestriansmaskisonlypartiallyreoveredbeauseofalakofinformation
intheoverlappingmotionarea. Theseparablealgorithmimpliesausalproessinganddoesnotallow
tobak-propagatespatiotemporalonstraintsintimeandtohangeadeisiontakeninthepast. The
3-D algorithm, in ontrary, makes it possible to bak-propagate information in time and to fully
reoverthe pedestrians maskfor eah image of the sequene. In thebottom of Fig.8, the preision
ofthemasks intermsofontoursisshown: theuppermaskorrespondsto thegroupofpedestrians,
whilethelower maskorrespondsto thebiyle.
4 Spatiotemporal Multiresolution Framework
Both versions of the algorithm (separable and 3-D) yield poor results in ase of uniform intensity
movingareas orsubpixelmotion. Insuhases, althoughobjets aremoving, temporalvariationsof
theintensityfuntionare almost zero(observationsof poorquality). To solve thisproblem,the3-D
algorithm is runon a spatiotemporal pyramid of data witha oarse-to-ne strategy. Spatial ltering
isa ommonwayto dealwithlargeuniformintensitymovingareas. Temporallteringiseetive in
order to dealwith subpixelmotion.
Multiresolutionmay also improve theinitialisationstep for spatiotemporal relaxation. Indeed it
hasbeenonjeturedthatmultiresolutionanalysissmoothestheenergyfuntion[11 ℄,makingiteasier
to ndtheglobalminimum. ThismayberuialwhenadeterministirelaxationalgorithmlikeICM
is used,sineitmaygetstukin therst enountered loalminimumof theenergy funtioninase
of bad initialisation.
4.1 Spatiotemporal Low-Pass Pyramid
The spatiotemporal struture of the 3-D MRF model suggests to build not only a spatial but a
spatiotemporal pyramid. The basionvolutionkernelisthebinomiallow-passlter 1
4
[121℄whih is
x x x
x x x
x x x
1 2 1
2 4 2
1 2 1
x x x
x x x
x x x
1 2 1
2 4 2
1 2 1
x x x
x x x
x x x
2 4 2
4 8 4
2 4 2
t t-1
t+1
1 64
F F F
I(x,y,t)
I (x,y,t) 0 I (x,y,t) 1 I (x,y,t) k
2 2
: 3-D filtering : 3-D subsampling
a)
b)
F 2
Figure 9: a) Spatiotemporal lter onvolution kernel. b) Pyramid buildingproess (the supersript
k denotestheresolutionlevel).
byBurt's spatialpyramid[4 ℄, this kernel, assoiated with a spatiotemporal subsampling, is used to
builda spatiotemporal pyramid (Fig. 9-b). Note that frames after lteringand before subsampling
arelessorruptedbynoisethanframesafterlteringandsubsampling. Anexampleofspatiotemporal
pyramid with three resolution levels is shown in Fig. 10. Spatiotemporal subsampling redues the
Figure10: Spatiotemporalpyramid: originalsequeneintoprow,andthethreespatiotemporallevels
below(k=0;1;2).
sizeof eah imagebya fator of4 and thelength ofthesequene bya fator of 2at eah resolution
level.
The 3-D algorithm is run at eah level of the spatiotemporal low-pass pyramid. The strategy
is oarse-to-ne: the algorithm starts at the lowest resolution level (k
max
). After spatiotemporal
interpolation(Fig.11), theresultofrelaxation atlevelk isusedto initialiserelaxation atlevelk 1.
Running the algorithm on this pyramid gives a multiresolution label eld. At eah level, the label
eldisoptimised aordingtoobservationsattheorrespondinglevelinthespatiotemporalpyramid
level k-1
level k
Figure 11: Spatiotemporal interpolation
(Fig. 12).
Temporal section S
Observations O
Spatiotemporal pyramid of observations O 0
O 1 O 2
L L L 0
1 2
Spatiotemporal pyramid of labels
Figure12: Spatiotemporal multiresolutionsheme.
The optimal numberof levels for the pyramid (k
max
) depends on the size and speed of moving
objets: todetetveryslowmotion,k
max
shouldbeinreased. Iftheseneontainsverysmallmoving
objets, k
max
shouldbe dereased. Thedefaultvalue isk
max
=2 (threeresolution levels).
4.2 Pyramid of Observations
Thespatiotemporallteringintegratesmotioninformationoveralargerspatialandtemporaldomain,
so that observations at low resolution levels are more relevant in ase of subpixel motion and uni-
form movingareas. Spatial lteringimproves observationsforpoorly-texturedareas, whiletemporal
lteringon manyonseutive framesimprovesobservationsinase of subpixelmotion.
Fig. 13 exhibits the quality of observations, omputed both with mono- and multiresolution
shemes, forthe well-known Trevor sequene. This sequene representsthemotion ofa TV speaker.
Motion between two onseutive frames isvery slow(subpixelmotion) and manyareas of theshirt,
head and hands are poorly-textured. The gure presents the multiresolution observations at three
resolution levels. The darker the pixel, the larger the orresponding observation. Multiresolution
observationsare learlymore onsistant thanmonoresolution observationsinthat ase.
4.3 Parameter Adaptation
Parametersof thealgorithm areadapted alongthe pyramidasexplained below.
First,theevolutionof observationsalongthepyramidhasbeeninvestigatedinorder toadaptpa-
rameterateahresolutionlevel. Ofourse,low-passlteringreduestheamplitudeofobservations.
Letusfousona pixelsat a motiontransition,asshowninFig. 14(vertialedgemovingrightward
at a speedof 1 pixel/frame). After spatiotemporal ltering, theamplitude of observation isdivided
by a fator 8
3
' 3 (obvious omputation with the 3-D onvolution kernel of Fig. 9-a). Therefore
tion observations; 3) Multiresolution observations, 3 resolution levels (k = 0;1;2). All displays are
normalisedinorder to spanoverthefullavailabledynamirange [ 0;255℄.
t-2 t-1 t t+1
I0 ∆Ι
s
0
I0 I0
I0 + I0 + ∆Ι I0 + ∆Ι
s s s
s s
s
0 ∆Ι 0 0 ∆Ι
∆Ι
t-1 t t+1
sequence
observations +
- - + - +
Figure 14: Evolution of observation after spatiotemporal ltering: typial ase of a vertial edge
movingrightward. I
s
(t 1)=I
0 ,I
s (t)=I
0
+IandI
s
(t+1) =I
0
+I. Irepresentstheamplitude
of stepobserved at pixels. Before3-D spatiotemporal ltering: o
s
=jI
s (t) I
s
(t 1)j=jIj. After
spatiotemporal ltering: o
s
= 1
64
(4jIj+16jIj+4jIj)= 3
8 jIj.
sameproportionalongthepyramid,i.e.
k
=
0
=3 k
. Sineobservationsdereasebyafatorofabout
3, theobservation variane 2
dereasesbya fator of about 9,so that theratio inEq. (7) remains
onstant.
Seondly,sinespatialandtemporalinformationareintegratedinthesamewayalongthepyramid,
theparameter ratio
s
=
t
iskept onstantfor allresolution levels.
Thirdly, from a qualitative point of view, spatiotemporal interations should get weaker at low
resolution levels, sine two neighbouring pixels are atually far away in the full-resolution image
sequene. The evolution of potentials (s;n) should be related to the physial distane between
pixelsina squaregrid. This an alsobe statedfrom aquantitative point of view: at eah resolution
level, the physial distane d(s;n) between two pixels is atually doubled beause of subsampling.
This leads to a derease of 4 for lique potentials (s;n) in Eq. (9). For omputational simpliity,
this evolution law is simply implemented by adapting the weight fator
k
as follows:
k
= 4 k
0 .
The globalenergyis then: U(l;o)=U
m (l)+
k U
a (o;l).
Finally,parameterdoesnotneedtobeadaptedalongthepyramid,sinethebinarisationmethod
derivedfrom[10 ℄ (andhene parameter)isonlyusedforlabelinitialisationat thelowest resolution
level k
max
. At ner resolution levels, initialisationis simply performed by interpolating the results
of lower resolution levels, with no need of . But ompared to the monoresolution sheme, must
beinreasedwhen multiresolutionisused(experimentalobservation). Thetheoretialexplanationof
theneessaryinreaseofwith multiresolutionlevelis theinueneof datalow-passlteringonthe
method given in[10 ℄ forsetting thethresholdvalue.
4.4 Computational Complexity
The buildingof the pyramidis not omputationally expensive: sine the 3-D onvolution kernel of
Fig. 9-a is separable, the implementationof the spatiotemporal ltering is equivalent to the imple-
mentation ofthree 1-D binomiallters inx,y and tdimensions, respetively.
The relaxation at low resolution levels is quik dueto the smaller number of sites and therefore
Markovianonstraintsarepropagatedfaster. Comparedwiththefull-resolutionlevelk=0,thedata
ow to be proessedat levelk =1is redued by a fator of 8 (N
x
;N
y
;N
t
derease eah bya fator
of 2). Thus, one iteration at resolution levelk is equivalent (in terms of omputationost) to 1=2 3k
iterationsat thenest resolutionlevel (k=0).
Then,athigher(ner)resolutionlevels,feweriterationsareneededomparedtoamonoresolution
sheme,beauseofabetter initialisationpropagatedfromlowerresolutionlevels. So,multiresolution
usuallyreduesthe overall numberof iterations.
Theomputationtimehasbeenreordedexperimentallyformanysequenes. Thesame stopping
riterion as in setion 2.3 wasused. In fat, the multiresolution spatiotemporal algorithm does not
drastiallyspeeduptheproessingrate. Sothemaininterestofthemultiresolutionframeworkhereis
theimprovedperformanefordeteting subpixelmotionand poorly-texturedmovingareas asshown
innext setion,butnotomputation savings.
4.5 Experimental Results
Fig. 15 presents the masks deteted in a ase of subpixel motion with both versions (mono- and
multiresolution) of the 3-D algorithm. The syntheti sene ontains three mobile objets: a lear
retangle movingrightward (1 pixel/frame), a dark square moving leftward(1 pixel/frame) and an-
other square on the left moving slowly upwards (0.35 pixel/frame). With the 3-D monoresolution
algorithm, the slowest square is badly deteted. With the multiresolution version of the algorithm,
thissquareis welldeteted, startingfrom theseond resolutionlevel.
Fig. 16 presents the masks deteted for the Trevor sequene with both versions (mono- and
multiresolution) of the 3-D algorithm. Monoresolution masks are very fragmented, sine speaker's
subpixelmotion;2) Monoresolutionmasks; 3) Multiresolutionmasks (2 levels: k =0;1).
motion isvery slowand thesene ontainsmanypoorly-texturedareas (hands, shirt,head). Onthe
ontrary, multiresolution masks arespatially and temporallyhomogeneous. The whole bodyis fully
deteted, startingfrom thethirdresolutionlevel (k=2).
5 Lip Segmentation
Theproposedapproahwasalsoappliedtolipsegmentationinolorimagesequenes,foraudiovisual
ommuniation between two speakers. Fig. 17 shows the ontext of appliation for a high quality
and lowbitratevideophone. It an alsobeusedforman-mahineommuniation(automatispeeh
reognition)orvideoonferening.
The speaker wears a light helmet equipped with a miro-amera and a mirophone, so that the
amerais xedwith respet tothe head. The segmentationis basedon theassumptionthat lipsare
areas inthefaewereredhue and motionpredominate.
Themainstepsoftheproessingareasfollows(detailsmaybefoundin[12 ℄). First,aolorvideo
sequene of speaker's fae is aquired under natural lighting onditions and without anypartiular
make-up. A logarithmi olor transform is performed from RGB (red, green, blue) to HIS (hue,
intensity, saturation) olor spae, in order to gain independene from illumination brightness and
noise.
Then,two observationsarederived. Therst observationisomputedfromthehuevalueateah
pixel: itgives informationaboutareas wherered hue ismost prominent. The seond observation is
thesameasinEq.(1): framedierenesbetweentwoonseutiveimages. Itgivesinformationabout
motion areas.
Fromthese twothresholdedobservations, fourinitiallabels(a
0 ,a
1 ,b
0 ,b
1
) arederived,foroding
four pixel lasses: pixels with (
1
), respetively without (
0
) motion, belonging (a), respetively not
belonging(b), to redhueareas.
Thespatiotemporal MRFapproah isthenusedforregularizingthesolution. Somehangeswere
introdued in the model presented in setion 3.2, in order to take into aount better the a priori
knowledge availableforthisspei appliation(lip shapeand motion). Namely, thespatiotemporal
potentialfuntion(s;n)isnowinverselyproportionaltotheEulidiandistane(andnotthesquared
Multiresolutionmasks (3 levels : k=0;1;2).
Figure 17: Context of audiovisualommuniation: from the image sequene of speaker's fae, geo-
metrial features of lips are extrated and provide modelling parameters for talking fae synthesis
and animation.
s and
t
assalingfators. Butforthisappliation,weforesomespatialanisotropy:
x
=2:
y
=
s
in order to putemphasize on horizontal ongurations (geometrial onstraints on lipshape). This
yields:
(s;n)=
1
r
Æ
x
x
2
+
Æ
y
y
2
+
Æ
t
t
2
=
s
t
r
2
t
Æ 2
x +4Æ
2
y
+ 2
s Æ
2
t
: (10)
Moreover, inontraryto setion3.2,parameters
s and
t
arenotonstant,butdependonthelabels
taken bysitessandn. Theyaredenedto onstrainthemodelto,respetively,spatialhomogeneity
of labels, and temporal homogeneity of hue when no motion is deteted. For example,
s (l
s
;l
n ) is
proportionalto: jr(s) r(n)j+jm(s) m(n)j, wherer(s)and m(s)arebinarydigits (0or1) oding
thepresene at pixel s ofred hue and motion,respetively. Forthe denitionof
t (l
s
;l
n
),see Table
3 in[12 ℄.
Withthismodelling,one obtains robustlabel eldsafter relaxation, exhibitingareas inthe fae
wherered hue and motionare predominant (Fig.18).
Figure 18: From top to bottom: 1) Sequene of luminane images: male fae without make-up; 2)
Initiallabelelds; 3)Finallabeleldsafterrelaxation: thefourlabelsareshowningraylevels(from
white to blak: b
1 ,a
1 ,b
0 ,a
0
); 4) Sequeneof lipmasks(ombination ofa
0 and a
1 ).
From thenal label eld,a region ofinterest isdetermined automatially (mouthboundingbox
in Fig. 19). Measurements of geometrial features are performed on lip masks (height and width,
Figure19: Top)Sequeneofluminaneimages: femalefaewithsoftredmake-up; Bottom)Sequene
of lipmasks withboundingboxsuperimposedon theluminane.
surfae), and usedforfaesynthesis at thereeiver'send.
Theproposedmethod forlipsegmentationsolves two ruialproblemsthatusually ariseinsuh
a ontext: indeed, theproessing gainsindependene bothfrom lighting onditionsand make-up of
lips. Thisisduebothto theuseof thelogarithmiolortransform,andtotherobustspatiotemporal
lipareas.
Aparallel implementationof thisalgorithmon a Programmable Video Proessoris under study.
The ahievable proessing rateis estimatedto be 13 images/sforimages ofsize256256.
6 Disussion
A spatiotemporal strategy for image sequene analysis was presented, and applied suessfully to
motiondetetionand lipsegmentationinaMarkovianframework. Itprimarilyonsistsinproessing
a videosequene asa3-D databath.
Withsuh an approah, improved performaneis reported formotion detetion inase of noisy
sequenesand inase of overlappingmotion.
A3-Dspatiotemporal multiresolutionsheme oherent withthe3-D MRFisalso proposed. This
multiresolutionapproahiseÆienttohandletwodiÆultases: subpixelmotionandpoorly-textured
movingareas. Butinaseofveryfastmotion,themultiresolutionalgorithmyieldsworseresultsthan
the monoresolution version. This is due to the fat that temporal ltering indues an averaging of
motion information over many images, so that it is no longer possible to preisely detet motion
boundaries. Asaresult,motionmasks arebiggerthanatualmovingobjets. Spatialmultiresolution
withouttemporalmultiresolutionwouldbebeneialinthatase,sineitallowstospatiallylinearize
intensity without temporal blurring. Mono- and multiresolution algorithms being omplementary,
it would be interesting to develop a strategy for swithing automatially between both versions of
the algorithm aording to the analysed sequene. Moreover, the multiresolution pyramid involves
3-D low-pass ltering. In order to limitthe blurringeet, the use of 3-D wavelets (3-D orthogonal
high-passand low-passlterbanks)ould beonsidered.
Theseondappliationreportedhereonernsspeaker'slipsegmentationinaolorvideosequene.
Theinterestofthespatiotemporalmethod,togetherwitha logarithmiolortransform,issupported
bythegoodqualityof resultsobtainedinthishallengingsituation(naturalimages ofspeaker'sfae
withoutanypartiularmake-up orlighting).
Thespatiotemporalapproahhasalsobeenusedtoomputespatiotemporalgradientswithspline
funtions(resultsnotreported here). Theimplementationinvolves3-D reursive lterings. Thus,we
do believe itould also be appliedwith suessto optial owestimation.
Referenes
[1℄ T. Aah, A. Kaup, R. Mester, "Statistial model-based hange detetion in moving video",
Signal Proessing,Vol. 31,No.2, Marh 1993,pp.165-180.
[2℄ J. Besag, "On the Statistial Analysis of Dirty Pitures", Journal ofRoyalStatistial Soiety,
Vol. B-48,No. 3,1986, pp.259-302.
[3℄ P. Bouthemy, P. Lalande, "Reovery of moving objet masks in an image sequene using loal
spatiotemporalontextualinformation",OptialEngineering,Vol.32,No.6,June1993,pp.1205-
1212.
[4℄ P.J. Burt, E.H. Adelson,"The Laplaian PyramidasCompat Image Code",
IEEE Trans.on Communiations,Vol. 31,No. 4,1984,pp. 532-540.
[5℄ A.Caplier,F.Luthon,"Approhe spatio-temporellepourl'analysedesequenesd'images. Appli-
ation endetetion de mouvement",Traitement du Signal,Vol. 14,No. 2,1997, pp.195-208.
[6℄ A. Caplier, F. Luthon, C. Dumontier, "Real-Time Implementations of an MRF-based Motion
Detetion Algorithm",Real-TimeImaging, Vol. 4,No.1, February1998, pp.41-54.
algorithm", Journalof theRoyalStatistialSoiety, SeriesB, Vol.39, 1977, pp.1-38.
[8℄ C.Dumontier,F.Luthon,J.P.Charras,"Real-TimeImplementationofanMRF-basedMotionDe-
tetion Algorithm on a DSP Board", Pro. 7 th
IEEE DigitalSignalProessing Workshop , Loen,
Norway, September1996,pp. 183-186.
[9℄ S.Geman,D.Geman,"StohastiRelaxation,GibbsDistributions,andtheBayesianRestoration
of Images", IEEE Trans.PatternAnal.and Mahine Intel., Vol. 6, No. 6, November 1984, pp.
721-741.
[10℄ Y.Z.Hsu, H.H.Nagel,G.Rekers,"New likelihoodtestmethodsforhangedetetion inimage
sequenes", Comput. VisionGraph. ImageProess., Vol. 26, 1984,pp.73-106.
[11℄ J. Konrad, E. Dubois, "Multigrid Estimation of Image Motion Fields using Stohasti Relax-
ation",Pro.2 nd
IEEE Int.Conf.on ComputerVision,TarponSprings,Florida,Deember1988,
pp.354-362.
[12℄ M. Lievin, F. Luthon, "Lip features automati ex-
tration", Pro.5 th
IEEE Int. Conf.onImage Proessing, Chiago, Illinois, Otober 1998, Vol.
3, pp.168-172.
[13℄ A.Mitihe, P. Bouthemy, "Computation and analysis of Image Motion: A Synopsis of Current
Problems and Methods", International Journalof ComputerVision, Vol. 19, No. 1, 1996, pp.
29-55.