Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval

(1)

HAL Id: inria-00590110

https://hal.inria.fr/inria-00590110

Submitted on 3 May 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval

Patrick Bouthemy, Christophe Garcia, Rémi Ronfard, Georges Tziritas, Emmanuel Veneau, Didier Zugaj

To cite this version:

Patrick Bouthemy, Christophe Garcia, Rémi Ronfard, Georges Tziritas, Emmanuel Veneau, et al..

Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval. 3d International Conference on Visual Information Systems (VISUAL’99), Jun 1999, Amsterdam, Netherlands. pp.245–

253, �10.1007/3-540-48762-X_31�. �inria-00590110�

(2)

Extraction in the DiVAN Video Indexing and

Retrieval Architecture

PatrickBouthemy

,ChristopheGarcia y

,RemiRonfard z

,GeorgeTziritas y

,

EmmanuelVeneau z

,DidierZugaj

INRIA{IRISA,CampusUniversitairedeBeaulieu,35042RennesCedex,France

y

ICS{FORTH,P.O.Box1385,GR71110Heraklion,Crete,Greece

z

INA,4avenuedel'Europe,94366,Bry-sur-Marne,France

Email:fcgarcia,tziritasg@csi.fo rth.gr,f rronfa rd,eveneaug@ina.f r,fbo uthemy,dzugajg@irisa.f r

Abstract We present a video analysis and indexing engine, that can

performfully automaticscene segmentation and feature extraction, in

thecontextofatelevisionarchive.A template-basedindexingengineis

proposed,whichinteracts witha general-purpose video analysis server

(providingspatio-temporalsegmentation,featureextraction,featureclus-

teringandfeaturematchingservices)andatemplateserver(whichpro-

videstoolsettings,scriptsandinterpretationrulesfor specicprogram

collections).

1 Introduction

TheESPRITprojectDiVANaimsatbuildingadistributedarchitectureforthe

management of - and access to - large television archives. An important part

of the system is devoted to automatic segmentation of the video content, in

ordertoassisttheindexingandretrievalofprogramsatseverallevelsofdetails

from individualframesto scenesandstories, usingacombinationofannotated

semanticinformationandautomaticallyextractedcontent-basedsignatures.

Insection2,thetemplate-basedarchitectureofthesystemisdescribed.The

toolsdevelopedin this architecture are presentedin section 3In section 4,we

briey sketchhowthose toolscanbeintegrated intocomplete analysisgraphs,

drivenbytheinternal structureoftelevisionprograms.

2 Template-based indexing and retrieval

InDiVAN,avideoindexingsessionisdistributedbetweenseveralmachines,us-

ing connectionsto tworemote CORBA servers-a videoanalysis serveranda

template libraryserver (1). Each programtemplate containsthe taxonomy of

eventcategories(suchasshots,scenesandsequences)meaningfultoacollection

of programs (such as news, drama, etc.), along with their temporal and com-

(3)

byscriptsstoredinthetemplatelibrary,andlaunchestoolsfromthetoolslibrary.Both

librariesare accessedviatheCORBAORB.

torecoverthestructureofeachindividualprogram,basedonobservableevents

from thetoolsserver(such asdissolves,wipes,cameramotiontypes,extracted

facesandlogos).Section3reviewsthetoolsavailableintheDiVANsystem,and

Section 4presentsexamplesoftemplate-basedindexingusingthose tools.

3 Shot segmentation and feature extraction

The rststep in the analysis of avideois the partitioning into itselementary

subparts,calledshots,accordingtosignicantscenechanges(cutsorprogressive

transitions). Featuressuch askeyframes,dominantcolors, faces and logoscan

alsobeextractedto representeachshot.

3.1 Cut detection

Ouralgorithminvolvestheframesthatareincludedintheintervalbetweentwo

successiveI-frames. It is assumedthat this interval is short enoughto include

only one cutpoint. Given this assumption, the algorithm proceeds as follows.

Step1:PerformcutdetectionconsideringthenextI-framepairandusingoneof

thehistogrammetricsandapredenedcutdetectionthreshold.Step2:Conrm

the detectedcut in thenextsubinterval betweentwoconsecutiveI-, P-frames,

included in theintervalof I-frames.Step3:Performcut detectionbetween two

successive I-, P-frames using histogram or MPEG macro block motion infor-

mation. If the cut is not conrmed in this subinterval , go to step 2. Step4:

Determine the exact cut point in the subinterval, by applying ahistogram or

MPEG motionbased cutdetection method on each frame interval-pair.If the

cutisnotconrmed, gotostep2,otherwise,gotostep1.

Key-frameextractionisperformedduringshotdetection,andisbasedupon

asimplesequentialclusteringalgorithm,usingasingleclusterCofthecandidate

key-frames,whichis successivelyupdatedbyincludingtheI-framethat follows

itslast elementin displayorder.C isasetof K consecutiveI-frames andC

mf

is an"ideal" framewith the propertythat itsfeatures match themean vector

("Mean Frame")ofthecorrespondingcharacteristicsof framesC

fk

(k=1::K).

Assuming that the only feature used for classication is the colour/intensity

histogram H

fk

of framesC

fk

, the"mean frame"histogram H

mf

is computed.

Then, in orderto measurethedistance Æ(C ;C ),anappropriatehistogram

(4)

KF

frameofC)isdetermined,accordingto athreshold.

3.2 Progressive transitionsand wipes

Themethodweproposefordetectingprogressivetransitionsreliesontheideaof

temporalcoherenceoftheglobaldominantimagemotionwithinashot.Tothus

end,weuseamethodexploiting aM-estimationcriterionandamultiresolution

scheme to estimate a 2D parametric motion model and to ensure the goal of

robustnessin presence of secondary motions[8]. The minimization problem is

solvedbyusingaIRLS(Iteratively Re-weightedLeastSquares) technique.

Theset of pixels conformingwith the estimated dominantmotion form its

estimation support. Theanalysis of thetemporalevolution ofthe support size

supplies the partitioning of the video into shots [3]. A cummulative sum test

is usedto detectsignicantjumps ofthis variable, accountingfor thepresence

of shot changes. It can accurately determine the beginning and the end time

instants of the progressive transitions. This technique, due to the use of the

robust motion estimation method, is fairly resilient to the presence of mobile

objectsorto globalilluminationchanges.Moredetailscanbefoundin[3].Itis

particularily eÆcientto detect dissolvetransitions. To handle wipe transitions

betweentwostaticshots,wehavedevelopedaspecicextensionofthismethod.

In the case of wipes, we conversely pay attention to the outliers. Indeed as

illustratedinFigure2,thegeometryofthemainoutlierregionreectsthenature

of the wipe with the presence of a rectangular stripe. To exploit in a simple

and eÆcient way this property, we project the outlier binary pixels onto the

verticalandhorizontal image axes,whichsuppliescharacterizablepeaks. Then

for detecting wipes, we compute the temporal correlation of these successive

projections. The location and value of the correlation extremum allow us to

characterizewipesin thevideo.Onerepresentativeresultisreportedin Figure

2.

3.3 Dominantcolor extraction

We characterize the color content of keyframes by automatically selecting a

small set ofcolours,called dominantcolours. Because,in mostimages, asmall

numberofcolorsrangescapturethemajorityofpixels,thesecolourscanbeused

eÆcientlytocharacterizethecolorcontent.Thesedominantcolorscanbeeasily

determinedfromthekeyframecolorhistograms.Gettingasmallersetofcolours

enhances theperformanceofimage matching, becausecolour variationsdue to

noiseareremoved.Thedescriptionoftheimage,using thesedominantcolours,

issimplythepercentagesofthosecolorsfoundintheimage.

Wecompute the set of dominant colors using the k-meansclustering algo-

rithm. This iterativealgorithm determines a set of clusters of nearly homoge-

neouscolours,eachclusterbeingrepresentedbyitsmeanvalue.Ateachiteration,

twostepsareperformed:1)usingthenearest-neighbourrulepixelsaregrouped

(5)

a)

Image 358

50 100 150 200 250 300 350

50

100

150

200

250

b)

Image 359

50 100 150 200 250 300 350

50

100

150

200

250

c)

Outlier image 358 weight=0

50 100 150 200 250 300 350

50

100

150

200

250

d)

50 100 150 200 250 300 350

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Horizontal projection image 358

e)

50 100 150 200 250

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Vertical projection image 358

f)

0 100 200 300 400 500 600 700

0 1 2 3 4 5 6 7 8 9

Projection Correlations

Figure2. Detectionof wipes:(a,b) Imageswithin awipe eect, (c) Image ofoutliers

in black, (d,e) projections on horizontal and vertical axes, (f) plot of the correlation

criterion.

3.4 Logo detection

We introducea simplebut eÆcient method to perform thedetection of static

logos,whichisthemostfrequentcaseinTVprograms.Tospeedupthedetection,

we specify the expected screen location of the logo by dening its associated

boundingboxB(seeFigure3).

Themethod proposed isbasedonthetwofollowingsteps:rst,determina-

tionofthelogobinarymask,then,evaluationofamatchingcriterioncombining

luminance andchrominanceattributes.Ifthelogotobedetectedisnotrectan-

gular,therststepconsiststobuildtheexactlogomaskfordetection.That is,

wehavetodeterminewhichpixelsbelongtothelogointheboundingbox.From

atleasttwologosfromimageswithdierentbackground,adiscriminantanalysis

techniqueis used to cluster pixels in two classes: those belonging to thelogo

andtheothers.Theclusteringstepisperformedbyconsideringimagedierence

valuesfortheluminanceattributefa

1

g,and,directlyvaluesofchrominanceim-

ages for chrominance attributes fa

2

;a

3

g in the Y;U;V space. The twoclasses

willbediscriminatedbyconsideringathresholdTforeachattributefa

1

;a

2

;a

3 g.

We consider the inter-class variance 2

InterC

(T) givenfor each attribute fa

k g.

The optimal thresholdT

then corresponds to an extremum of the inter-class

variance 2

InterC (T).

WedisplayinFigure3anexampleofhistogramoftheimagedierencepixels

withintheboundingbox,aplotoftheinter-classvariance 2

InterC

(T)supplying

the dissimilarity measure betweenthe two classes, and the determined binary

maskM.Thematchingsimplyconsistsincomparingeachpixelthatbelongsto

thelogomask inthecurrentcolorimageof thesequencewith thevaluein the

referencelogoimage.Thecriterionisgivenbythesumofthesquaredierences

(6)

image with logo

50 100 150 200 250 300 350

50

100

150

200

250

image with logo

50 100 150 200 250 300 350

50

100

150

200

250

Bounding box image 1.L

5 10 15 20 25 30

5

10

15

20

25

30

Bounding box image 2.L

5 10 15 20 25 30

5

10

15

20

25

30

a)

Image of Chrominance 1.U

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

b)

100 120 140 160 180 200 220

0 0.01 0.02 0.03 0.04 0.05 0.06

Attribute distribution a2 1.U

c)

100 120 140 160 180 200 220

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Inter−class variance

d)

Binary mask M2

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Figure3. Toprow: Imagesoflogosand associatedbounding boxes. Middlerow: De-

termination of the binary masks (d) from image of chrominance U (a). Plot of the

attributedistributionfa

2

g(b),andtheinter-classvariance 2

InterC (T)(c).

3.5 Facedetection

Wedevelopedafastandcompleteschemeforfacedetectionincolorimageswhere

thenumberoffaces,theirlocation,theirorientationandtheirsizeareunknown,

under non-constrainedconditionssuch ascomplexbackground[6]. Ourscheme

startsbyperformingachrominance-basedsegmentationusingtheYCbCrcolor

model which is closely related to the MPEG coding scheme. According to a

precise approximation of the skin colorsub-spaces in the YCbCr colormodel,

eachmacro-blockaveragecolorvalueisclassied.Abinarymaskiscomputedin

whicha"one"correspondstoaskincolormacro-block,anda\zero"corresponds

toanon-skincolormacro-block.Then,scanningthroughthebinarymask,con-

tinuous\one"candidatefaceregionsareextracted.Inordertorespectthespeed

requirements,weuse a simplemethod of integralprojectionlike in [10]. Each

macro-blockmaskimageis segmentedintonon-overlappingrectangularregions

thatcontainacontiguous\one"region.Then,wesearchforcandidatefaceareas

in theseresultingareas.Giventhatwedon't knowapriorithesize ofthefaces

regions contained in theframe, we rst look for thelargest possibleones and

weiterativelyreduce theirsize andrunoverthepossibleaspectsratiosandthe

possiblepositionsineachbinarymasksegmentarea.Constraintsrelatedtosize

range, shape and homogeneity are thenapplied to detect these candidate face

regions and possible overlapsare resolved.The main purpose of thelast stage

ofouralgorithm isto verify thefacedetectionresultsobtainedafter candidate

faces areassegmentationand removefalsealarmscaused byobjectswithcolor

similar to skin colors and similar aspects ratios such as exposed parts of the

bodyorpartofbackground.Forthispurpose,awaveletpacketanalysisscheme

is applied using asimilar method to theone we described in [4,5] in the case

of face recognition. For a candidate face region, wesearch for a face in every

positioninsidetheregionandforapossiblenumberofdimensionsofabounding

rectangle. Thissearchis basedonafeature vectorthat isextracted from aset

ofwaveletpacketcoeÆcientsrelatedto thecorrespondingregion.According to

(7)

usingsomeprototypefaceareavectors,acquiredin aprevioustrainingstep.

4 Scene segmentation

Shotsegmentationandfeatureextractionalonecannotmeettheneedsofprofes-

sionaltelevision archives,where themeaningful units are higher-levelprogram

segmentssuchas storiesandscenes. Thispaperfocuses onprogramsdescribed

bygenerictemplates(special-purposetemplates aredescribedin[9].

4.1 Simple programtemplate

The DiVAN simple program template implements a scene segmentation algo-

rithm introduced by Yeo et al. [11].A scene is dened as a grouping of con-

tiguousshots,basedonimage similarity. Thisalgorithm usesthree parameters

:animagesimilarityfunction,anaveragescene durationT and aclusterdis-

tance threshold. The similarity function can be dened, using a combination

ofextractedfeatures:colorhistograms,dominantcolors,cameramotiontypes,

number, size andpositions of faces, etc. Figure 4shows theanalysis graphfor

sceneanalysisusingdominantcolors.Theresultoftheindexingsessionconsists

ofasoundtrackandahierarchicallyorganizedscenetrack.

Figure4.Analysisgraphforthesimpleprogramtemplate.

Thedetailsofthescenesegmentationalgorithmfollowstheoriginalalgorithm

by Yeo et al.A binary tree is constructed in order to represent thesimilarity

betweenkeyframes.Atemporalconstraintisintroducedinordertoreducecom-

putation time. To calculate the N(N 1) distances describing the similarity

between clusters, weintroduced initially adistance betweenkeyframeswith a

temporal constraintasdone in [2]. Distances are accessed via a priority stack

foreÆciencyreasons.Ateachstep,thesmallerdistanceamongtheN(N-1) dis-

tanceisselectedand thetwocorrespondingclustersaremerged.Thewhole set

of distances isupdatedthrough theLance&Williamsformula. A binarytree is

built by successivemergingofthe nearestclusters, until thethreshold valueis

(8)

Webuildanorientedgraphwherethenodesarethepreviouslyselectedclusters

and the edges are temporal successionlinks. The oriented graphis segmented

intoitsstronglyconnectedcomponents,whichcorrespondtoscenes.Finally,we

complete the scene track by adding all progressivetransitions at their appro-

priate level (between shots or scenes). We obtained good scene segmentation

resultsusingregionalcolorhistogramsordominantcolorsinprogramswithlong

sequencescorrespondingtoonescene(drama,ction).Inmanyothercases,scene

segmentationsuerssomelimitations,duetotherestrictivedenitionofascene.

Forinstance,ashort storyitemin anewsbroadcastwillbeincorporatedintoa

largerscene,becauseoftherecurrentanchorpersonimagesbeforeandafterthe

story.

4.2 Compositeprogramtemplate

The DiVAN composite program template is used to introduce a broader def-

inition for program segments such as stories, using cues such as logo or face

detection.Figure5showsthecompleteanalysisgraphforacompositeprogram.

Theprogramissegmentedintostories,basedondetectedlogos,wipes,dissolves

and soundtrack changes, and stories are segmented into scenes using the pre-

viously described algorithm. in [9], we state the story segmentation problem

as a CSP (constraint satisfaction problem), in the framework of an extended

descriptionlogics.

Figure5.Analysisgraphforacompositeprogram template.

5 Conclusion

This paperdescribestherstprototypeoftheDiVAN indexingsystem,witha

focusonthetoolslibraryandthegeneralsystemarchitecture.Template-based

videoanalysissystemshavebeenproposedinacademicresearch([2,7,1],butno

suchsystemisavailablecommercially.TheVideoAnalysisEnginebyExcalibur

(9)

though the edition of their own templates, and will be thoroughly tested by

documentalistsatINA,RAIandERTinearly1999.ThesecondDIVANproto-

typewillprovidemanymorefunctionssuchasaudioanalysis,facerecognition,

query-by-example,movingobjectsanalysisandcameramotionanalysis.

Acknowledgments

This work was funded in part under the ESPRIT EP24956 project DiVAN.

We would like to thank the Innovation Department at Institut National de

l'Audiovisuel (INA) for providing the video test material reproduced in this

communication.

References

1. P. Aigrain, P. Joly, and V. Longueville. Medium knowledge-based macro-

segmentationof videointosequences, inIntelligent multimedia information re-

trieval,M.T.Maybury,Editor.1997,MITPress.

2. G.Ahanger and T.D.C.Little. A surveyoftechnologies for parsing and index-

ing digital video. Journal of Visual Communicationand Image Representation,

7(1),March1996.

3. P.BouthemyandF.Ganansia. Videopartioningandcameramotioncharacteri-

zationforcontent-basedvideoindexing. InProc.3rdIEEEInt.Conf. onImage

Processing, ICIP'96,Lausane,September1996.

4. C.Garcia, G.Zikos, G.Tziritas. A Wavelet-BasedFramework forFace Recog-

nition. InInt.WorkshoponAdvances inFacialImageAnalysisandRecognition

Technology, 5 th

European Conference on Computer Vision, June 2 { 6, 1998,

Freiburg,Germany.

5. C.Garcia,G.Zikos,G.Tziritas. WaveletPacketAnalysisforFaceRecognition.

ToappearinImage andVision Computing(1999).

6. C. Garcia,G. Zikos, G. Tziritas. Face detection incolor images usingwavelet

packetanalysis SubmittedtoIEEEMultimediaSystems'99(1999)

7. B.MerialdoandF.Dubois,Anagent-basedarchitectureforcontent-basedmulti-

mediabrowsing, inIntelligentmultimediainformationretrieval, M.T.Maybury,

Editor.1997,MITPress.

8. J.M. Odobezand P. Bouthemy. Robust multiresolution estimationand object-

orientedsegmentationofparametricmotionmodels.JalofVisualCommunication

andImageRepresentation,6(4),1995.

9. R.Ronfard,J.Carrive,F.Pachet.Almtemplatelibraryforvideoanalysisand

indexingbasedondescriptionlogics.SubmittedtoIEEEMultimediaSystems'99.

10. H.WangandS.-F.Chang.AHighlyEÆcientSystemforAutomaticFaceRegion

Detection in MPEG Video. IEEE Transactions on Circuits and Systems For

VideoTechnology,7(4)(1997)615-628.

11. M.Yeung,B.-L.Yeo&B.Liu.Extractingstoryunitsfromlongprogramsforvideo

browsingandnavigation.InternationalConferenceonMultimediaComputingand

Systems, June1996.