HAL Id: inria-00590110
https://hal.inria.fr/inria-00590110
Submitted on 3 May 2011
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval
Patrick Bouthemy, Christophe Garcia, Rémi Ronfard, Georges Tziritas, Emmanuel Veneau, Didier Zugaj
To cite this version:
Patrick Bouthemy, Christophe Garcia, Rémi Ronfard, Georges Tziritas, Emmanuel Veneau, et al..
Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval. 3d International Conference on Visual Information Systems (VISUAL’99), Jun 1999, Amsterdam, Netherlands. pp.245–
253, �10.1007/3-540-48762-X_31�. �inria-00590110�
Extraction in the DiVAN Video Indexing and
Retrieval Architecture
PatrickBouthemy
,ChristopheGarcia y
,RemiRonfard z
,GeorgeTziritas y
,
EmmanuelVeneau z
,DidierZugaj
INRIA{IRISA,CampusUniversitairedeBeaulieu,35042RennesCedex,France
y
ICS{FORTH,P.O.Box1385,GR71110Heraklion,Crete,Greece
z
INA,4avenuedel'Europe,94366,Bry-sur-Marne,France
Email:fcgarcia,tziritasg@csi.fo rth.gr,f rronfa rd,eveneaug@ina.f r,fbo uthemy,dzugajg@irisa.f r
Abstract We present a video analysis and indexing engine, that can
performfully automaticscene segmentation and feature extraction, in
thecontextofatelevisionarchive.A template-basedindexingengineis
proposed,whichinteracts witha general-purpose video analysis server
(providingspatio-temporalsegmentation,featureextraction,featureclus-
teringandfeaturematchingservices)andatemplateserver(whichpro-
videstoolsettings,scriptsandinterpretationrulesfor specicprogram
collections).
1 Introduction
TheESPRITprojectDiVANaimsatbuildingadistributedarchitectureforthe
management of - and access to - large television archives. An important part
of the system is devoted to automatic segmentation of the video content, in
ordertoassisttheindexingandretrievalofprogramsatseverallevelsofdetails
from individualframesto scenesandstories, usingacombinationofannotated
semanticinformationandautomaticallyextractedcontent-basedsignatures.
Insection2,thetemplate-basedarchitectureofthesystemisdescribed.The
toolsdevelopedin this architecture are presentedin section 3In section 4,we
briey sketchhowthose toolscanbeintegrated intocomplete analysisgraphs,
drivenbytheinternal structureoftelevisionprograms.
2 Template-based indexing and retrieval
InDiVAN,avideoindexingsessionisdistributedbetweenseveralmachines,us-
ing connectionsto tworemote CORBA servers-a videoanalysis serveranda
template libraryserver (1). Each programtemplate containsthe taxonomy of
eventcategories(suchasshots,scenesandsequences)meaningfultoacollection
of programs (such as news, drama, etc.), along with their temporal and com-
byscriptsstoredinthetemplatelibrary,andlaunchestoolsfromthetoolslibrary.Both
librariesare accessedviatheCORBAORB.
torecoverthestructureofeachindividualprogram,basedonobservableevents
from thetoolsserver(such asdissolves,wipes,cameramotiontypes,extracted
facesandlogos).Section3reviewsthetoolsavailableintheDiVANsystem,and
Section 4presentsexamplesoftemplate-basedindexingusingthose tools.
3 Shot segmentation and feature extraction
The rststep in the analysis of avideois the partitioning into itselementary
subparts,calledshots,accordingtosignicantscenechanges(cutsorprogressive
transitions). Featuressuch askeyframes,dominantcolors, faces and logoscan
alsobeextractedto representeachshot.
3.1 Cut detection
Ouralgorithminvolvestheframesthatareincludedintheintervalbetweentwo
successiveI-frames. It is assumedthat this interval is short enoughto include
only one cutpoint. Given this assumption, the algorithm proceeds as follows.
Step1:PerformcutdetectionconsideringthenextI-framepairandusingoneof
thehistogrammetricsandapredenedcutdetectionthreshold.Step2:Conrm
the detectedcut in thenextsubinterval betweentwoconsecutiveI-, P-frames,
included in theintervalof I-frames.Step3:Performcut detectionbetween two
successive I-, P-frames using histogram or MPEG macro block motion infor-
mation. If the cut is not conrmed in this subinterval , go to step 2. Step4:
Determine the exact cut point in the subinterval, by applying ahistogram or
MPEG motionbased cutdetection method on each frame interval-pair.If the
cutisnotconrmed, gotostep2,otherwise,gotostep1.
Key-frameextractionisperformedduringshotdetection,andisbasedupon
asimplesequentialclusteringalgorithm,usingasingleclusterCofthecandidate
key-frames,whichis successivelyupdatedbyincludingtheI-framethat follows
itslast elementin displayorder.C isasetof K consecutiveI-frames andC
mf
is an"ideal" framewith the propertythat itsfeatures match themean vector
("Mean Frame")ofthecorrespondingcharacteristicsof framesC
fk
(k=1::K).
Assuming that the only feature used for classication is the colour/intensity
histogram H
fk
of framesC
fk
, the"mean frame"histogram H
mf
is computed.
Then, in orderto measurethedistance Æ(C ;C ),anappropriatehistogram
KF
frameofC)isdetermined,accordingto athreshold.
3.2 Progressive transitionsand wipes
Themethodweproposefordetectingprogressivetransitionsreliesontheideaof
temporalcoherenceoftheglobaldominantimagemotionwithinashot.Tothus
end,weuseamethodexploiting aM-estimationcriterionandamultiresolution
scheme to estimate a 2D parametric motion model and to ensure the goal of
robustnessin presence of secondary motions[8]. The minimization problem is
solvedbyusingaIRLS(Iteratively Re-weightedLeastSquares) technique.
Theset of pixels conformingwith the estimated dominantmotion form its
estimation support. Theanalysis of thetemporalevolution ofthe support size
supplies the partitioning of the video into shots [3]. A cummulative sum test
is usedto detectsignicantjumps ofthis variable, accountingfor thepresence
of shot changes. It can accurately determine the beginning and the end time
instants of the progressive transitions. This technique, due to the use of the
robust motion estimation method, is fairly resilient to the presence of mobile
objectsorto globalilluminationchanges.Moredetailscanbefoundin[3].Itis
particularily eÆcientto detect dissolvetransitions. To handle wipe transitions
betweentwostaticshots,wehavedevelopedaspecicextensionofthismethod.
In the case of wipes, we conversely pay attention to the outliers. Indeed as
illustratedinFigure2,thegeometryofthemainoutlierregionreectsthenature
of the wipe with the presence of a rectangular stripe. To exploit in a simple
and eÆcient way this property, we project the outlier binary pixels onto the
verticalandhorizontal image axes,whichsuppliescharacterizablepeaks. Then
for detecting wipes, we compute the temporal correlation of these successive
projections. The location and value of the correlation extremum allow us to
characterizewipesin thevideo.Onerepresentativeresultisreportedin Figure
2.
3.3 Dominantcolor extraction
We characterize the color content of keyframes by automatically selecting a
small set ofcolours,called dominantcolours. Because,in mostimages, asmall
numberofcolorsrangescapturethemajorityofpixels,thesecolourscanbeused
eÆcientlytocharacterizethecolorcontent.Thesedominantcolorscanbeeasily
determinedfromthekeyframecolorhistograms.Gettingasmallersetofcolours
enhances theperformanceofimage matching, becausecolour variationsdue to
noiseareremoved.Thedescriptionoftheimage,using thesedominantcolours,
issimplythepercentagesofthosecolorsfoundintheimage.
Wecompute the set of dominant colors using the k-meansclustering algo-
rithm. This iterativealgorithm determines a set of clusters of nearly homoge-
neouscolours,eachclusterbeingrepresentedbyitsmeanvalue.Ateachiteration,
twostepsareperformed:1)usingthenearest-neighbourrulepixelsaregrouped
a)
Image 358
50 100 150 200 250 300 350
50
100
150
200
250
b)
Image 359
50 100 150 200 250 300 350
50
100
150
200
250
c)
Outlier image 358 weight=0
50 100 150 200 250 300 350
50
100
150
200
250
d)
50 100 150 200 250 300 350
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Horizontal projection image 358
e)
50 100 150 200 250
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Vertical projection image 358
f)
0 100 200 300 400 500 600 700
0 1 2 3 4 5 6 7 8 9
Projection Correlations
Figure2. Detectionof wipes:(a,b) Imageswithin awipe eect, (c) Image ofoutliers
in black, (d,e) projections on horizontal and vertical axes, (f) plot of the correlation
criterion.
3.4 Logo detection
We introducea simplebut eÆcient method to perform thedetection of static
logos,whichisthemostfrequentcaseinTVprograms.Tospeedupthedetection,
we specify the expected screen location of the logo by dening its associated
boundingboxB(seeFigure3).
Themethod proposed isbasedonthetwofollowingsteps:rst,determina-
tionofthelogobinarymask,then,evaluationofamatchingcriterioncombining
luminance andchrominanceattributes.Ifthelogotobedetectedisnotrectan-
gular,therststepconsiststobuildtheexactlogomaskfordetection.That is,
wehavetodeterminewhichpixelsbelongtothelogointheboundingbox.From
atleasttwologosfromimageswithdierentbackground,adiscriminantanalysis
techniqueis used to cluster pixels in two classes: those belonging to thelogo
andtheothers.Theclusteringstepisperformedbyconsideringimagedierence
valuesfortheluminanceattributefa
1
g,and,directlyvaluesofchrominanceim-
ages for chrominance attributes fa
2
;a
3
g in the Y;U;V space. The twoclasses
willbediscriminatedbyconsideringathresholdTforeachattributefa
1
;a
2
;a
3 g.
We consider the inter-class variance 2
InterC
(T) givenfor each attribute fa
k g.
The optimal thresholdT
then corresponds to an extremum of the inter-class
variance 2
InterC (T).
WedisplayinFigure3anexampleofhistogramoftheimagedierencepixels
withintheboundingbox,aplotoftheinter-classvariance 2
InterC
(T)supplying
the dissimilarity measure betweenthe two classes, and the determined binary
maskM.Thematchingsimplyconsistsincomparingeachpixelthatbelongsto
thelogomask inthecurrentcolorimageof thesequencewith thevaluein the
referencelogoimage.Thecriterionisgivenbythesumofthesquaredierences
image with logo
50 100 150 200 250 300 350
50
100
150
200
250
image with logo
50 100 150 200 250 300 350
50
100
150
200
250
Bounding box image 1.L
5 10 15 20 25 30
5
10
15
20
25
30
Bounding box image 2.L
5 10 15 20 25 30
5
10
15
20
25
30
a)
Image of Chrominance 1.U
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
b)
100 120 140 160 180 200 220
0 0.01 0.02 0.03 0.04 0.05 0.06
Attribute distribution a2 1.U
c)
100 120 140 160 180 200 220
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Inter−class variance
d)
Binary mask M2
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure3. Toprow: Imagesoflogosand associatedbounding boxes. Middlerow: De-
termination of the binary masks (d) from image of chrominance U (a). Plot of the
attributedistributionfa
2
g(b),andtheinter-classvariance 2
InterC (T)(c).
3.5 Facedetection
Wedevelopedafastandcompleteschemeforfacedetectionincolorimageswhere
thenumberoffaces,theirlocation,theirorientationandtheirsizeareunknown,
under non-constrainedconditionssuch ascomplexbackground[6]. Ourscheme
startsbyperformingachrominance-basedsegmentationusingtheYCbCrcolor
model which is closely related to the MPEG coding scheme. According to a
precise approximation of the skin colorsub-spaces in the YCbCr colormodel,
eachmacro-blockaveragecolorvalueisclassied.Abinarymaskiscomputedin
whicha"one"correspondstoaskincolormacro-block,anda\zero"corresponds
toanon-skincolormacro-block.Then,scanningthroughthebinarymask,con-
tinuous\one"candidatefaceregionsareextracted.Inordertorespectthespeed
requirements,weuse a simplemethod of integralprojectionlike in [10]. Each
macro-blockmaskimageis segmentedintonon-overlappingrectangularregions
thatcontainacontiguous\one"region.Then,wesearchforcandidatefaceareas
in theseresultingareas.Giventhatwedon't knowapriorithesize ofthefaces
regions contained in theframe, we rst look for thelargest possibleones and
weiterativelyreduce theirsize andrunoverthepossibleaspectsratiosandthe
possiblepositionsineachbinarymasksegmentarea.Constraintsrelatedtosize
range, shape and homogeneity are thenapplied to detect these candidate face
regions and possible overlapsare resolved.The main purpose of thelast stage
ofouralgorithm isto verify thefacedetectionresultsobtainedafter candidate
faces areassegmentationand removefalsealarmscaused byobjectswithcolor
similar to skin colors and similar aspects ratios such as exposed parts of the
bodyorpartofbackground.Forthispurpose,awaveletpacketanalysisscheme
is applied using asimilar method to theone we described in [4,5] in the case
of face recognition. For a candidate face region, wesearch for a face in every
positioninsidetheregionandforapossiblenumberofdimensionsofabounding
rectangle. Thissearchis basedonafeature vectorthat isextracted from aset
ofwaveletpacketcoeÆcientsrelatedto thecorrespondingregion.According to
usingsomeprototypefaceareavectors,acquiredin aprevioustrainingstep.
4 Scene segmentation
Shotsegmentationandfeatureextractionalonecannotmeettheneedsofprofes-
sionaltelevision archives,where themeaningful units are higher-levelprogram
segmentssuchas storiesandscenes. Thispaperfocuses onprogramsdescribed
bygenerictemplates(special-purposetemplates aredescribedin[9].
4.1 Simple programtemplate
The DiVAN simple program template implements a scene segmentation algo-
rithm introduced by Yeo et al. [11].A scene is dened as a grouping of con-
tiguousshots,basedonimage similarity. Thisalgorithm usesthree parameters
:animagesimilarityfunction,anaveragescene durationT and aclusterdis-
tance threshold. The similarity function can be dened, using a combination
ofextractedfeatures:colorhistograms,dominantcolors,cameramotiontypes,
number, size andpositions of faces, etc. Figure 4shows theanalysis graphfor
sceneanalysisusingdominantcolors.Theresultoftheindexingsessionconsists
ofasoundtrackandahierarchicallyorganizedscenetrack.
Figure4.Analysisgraphforthesimpleprogramtemplate.
Thedetailsofthescenesegmentationalgorithmfollowstheoriginalalgorithm
by Yeo et al.A binary tree is constructed in order to represent thesimilarity
betweenkeyframes.Atemporalconstraintisintroducedinordertoreducecom-
putation time. To calculate the N(N 1) distances describing the similarity
between clusters, weintroduced initially adistance betweenkeyframeswith a
temporal constraintasdone in [2]. Distances are accessed via a priority stack
foreÆciencyreasons.Ateachstep,thesmallerdistanceamongtheN(N-1) dis-
tanceisselectedand thetwocorrespondingclustersaremerged.Thewhole set
of distances isupdatedthrough theLance&Williamsformula. A binarytree is
built by successivemergingofthe nearestclusters, until thethreshold valueis
Webuildanorientedgraphwherethenodesarethepreviouslyselectedclusters
and the edges are temporal successionlinks. The oriented graphis segmented
intoitsstronglyconnectedcomponents,whichcorrespondtoscenes.Finally,we
complete the scene track by adding all progressivetransitions at their appro-
priate level (between shots or scenes). We obtained good scene segmentation
resultsusingregionalcolorhistogramsordominantcolorsinprogramswithlong
sequencescorrespondingtoonescene(drama,ction).Inmanyothercases,scene
segmentationsuerssomelimitations,duetotherestrictivedenitionofascene.
Forinstance,ashort storyitemin anewsbroadcastwillbeincorporatedintoa
largerscene,becauseoftherecurrentanchorpersonimagesbeforeandafterthe
story.
4.2 Compositeprogramtemplate
The DiVAN composite program template is used to introduce a broader def-
inition for program segments such as stories, using cues such as logo or face
detection.Figure5showsthecompleteanalysisgraphforacompositeprogram.
Theprogramissegmentedintostories,basedondetectedlogos,wipes,dissolves
and soundtrack changes, and stories are segmented into scenes using the pre-
viously described algorithm. in [9], we state the story segmentation problem
as a CSP (constraint satisfaction problem), in the framework of an extended
descriptionlogics.
Figure5.Analysisgraphforacompositeprogram template.
5 Conclusion
This paperdescribestherstprototypeoftheDiVAN indexingsystem,witha
focusonthetoolslibraryandthegeneralsystemarchitecture.Template-based
videoanalysissystemshavebeenproposedinacademicresearch([2,7,1],butno
suchsystemisavailablecommercially.TheVideoAnalysisEnginebyExcalibur
though the edition of their own templates, and will be thoroughly tested by
documentalistsatINA,RAIandERTinearly1999.ThesecondDIVANproto-
typewillprovidemanymorefunctionssuchasaudioanalysis,facerecognition,
query-by-example,movingobjectsanalysisandcameramotionanalysis.
Acknowledgments
This work was funded in part under the ESPRIT EP24956 project DiVAN.
We would like to thank the Innovation Department at Institut National de
l'Audiovisuel (INA) for providing the video test material reproduced in this
communication.
References
1. P. Aigrain, P. Joly, and V. Longueville. Medium knowledge-based macro-
segmentationof videointosequences, inIntelligent multimedia information re-
trieval,M.T.Maybury,Editor.1997,MITPress.
2. G.Ahanger and T.D.C.Little. A surveyoftechnologies for parsing and index-
ing digital video. Journal of Visual Communicationand Image Representation,
7(1),March1996.
3. P.BouthemyandF.Ganansia. Videopartioningandcameramotioncharacteri-
zationforcontent-basedvideoindexing. InProc.3rdIEEEInt.Conf. onImage
Processing, ICIP'96,Lausane,September1996.
4. C.Garcia, G.Zikos, G.Tziritas. A Wavelet-BasedFramework forFace Recog-
nition. InInt.WorkshoponAdvances inFacialImageAnalysisandRecognition
Technology, 5 th
European Conference on Computer Vision, June 2 { 6, 1998,
Freiburg,Germany.
5. C.Garcia,G.Zikos,G.Tziritas. WaveletPacketAnalysisforFaceRecognition.
ToappearinImage andVision Computing(1999).
6. C. Garcia,G. Zikos, G. Tziritas. Face detection incolor images usingwavelet
packetanalysis SubmittedtoIEEEMultimediaSystems'99(1999)
7. B.MerialdoandF.Dubois,Anagent-basedarchitectureforcontent-basedmulti-
mediabrowsing, inIntelligentmultimediainformationretrieval, M.T.Maybury,
Editor.1997,MITPress.
8. J.M. Odobezand P. Bouthemy. Robust multiresolution estimationand object-
orientedsegmentationofparametricmotionmodels.JalofVisualCommunication
andImageRepresentation,6(4),1995.
9. R.Ronfard,J.Carrive,F.Pachet.Almtemplatelibraryforvideoanalysisand
indexingbasedondescriptionlogics.SubmittedtoIEEEMultimediaSystems'99.
10. H.WangandS.-F.Chang.AHighlyEÆcientSystemforAutomaticFaceRegion
Detection in MPEG Video. IEEE Transactions on Circuits and Systems For
VideoTechnology,7(4)(1997)615-628.
11. M.Yeung,B.-L.Yeo&B.Liu.Extractingstoryunitsfromlongprogramsforvideo
browsingandnavigation.InternationalConferenceonMultimediaComputingand
Systems, June1996.