• Aucun résultat trouvé

A probabilistic framework of selecting effective key frames for video browsing and indexing

N/A
N/A
Protected

Academic year: 2021

Partager "A probabilistic framework of selecting effective key frames for video browsing and indexing"

Copied!
13
0
0

Texte intégral

(1)

HAL Id: inria-00548295

https://hal.inria.fr/inria-00548295

Submitted on 21 Dec 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

A probabilistic framework of selecting effective key

frames for video browsing and indexing

Riad Hammoud, Roger Mohr

To cite this version:

(2)

Key-Frames for Video Browsing and Indexing 1 2

RiadHammoud andRoger Mohr MOVI-INRIARh^one-Alpesand GRAVIR-CNRS

655avenuedel'Europe,38330MontbonnotSaintMartin,FRANCE

riad.hammoudinrialpes.fr

International workshop on Real-Time ImageSequen e Analysis, pages 79-88 August 2000

Abstra t: Torepresente e tivelythevideo ontent,forbrowsing,indexing andvideo skim-ming, the most hara teristi frames ( alled key-frames) should be extra ted from given shots. This paper, brie y reviews and evaluates the existing approa hesof key-frames extra tion; and then introdu es a framework of sele ting e e tive key-frames using an unsupervised lustering method. Themixtureof Gaussiansisusedtomodelthetemporalvariationofthefeatureve tors of allframes in the shot. As aresult,the feature-basedrepresentation of theshotis partitioned into several lusters. From ea h obtained luster, rstly the losest frame to the median of its framesissele tedasareferen ekey-frame. Thendepending onthevariationintimeand appear-an e of the luster ontent againstthe referen e key-framemultiple frames an be extra ted to represente e tivelythe luster. Thenumberof lustersisdeterminedautomati allybytheBayes Information Criterion. Experimental resultsontra ked obje tsin a real-world videostream are presentedwhi hillustratetheperforman eof theproposedte hnique.

1 Introdu tion and motivation

Astheamountofvideodatagrowsrapidly,theabilitytomanipulateiteÆ ientlybe omes of greaterimportan e,forthepurposeofsele tionof appropriateelementsof information [6 ℄. Thesele tionorextra tionoflimitedandmeaningfulinformationsisawaytoresolvea

setof hallengingproblemsforre entlyemergingmultimediaappli ations: videobrowsing and navigation, ontent-based indexing, video summarization and trailers, storage and transmissionbandwidthof digitizedvideoinformation[14℄[9 ℄.

Thea essto videoisstilla hardtaskdueto video'slengthand unstru turedformat. Video abstra tion and summarizationte hniquesareneeded to solve this diÆ ulty. Shot boundary dete tion and key-frame extra tion are two bases for abstra tion and

summa-rizationte hniques[15 ℄ [1 ℄ [11 ℄.

A shot is de ned as an unbroken sequen e of frames re orded from a single amera, whi h forms the building blo k of a video. The purpose of shot boundary dete tion is

to segment thevideostream into multipleshots. Thereexist many already e e tive shot boundaryte hniques[2 ℄.

Beyond theshotlevelanabstra tionlevel ouldbe onstru tedbymappingtheentire

shot to a smallnumberof representative frames, alled key-frames[14 ℄. Indeed, anindex 1

This work is supported by Al atel CRC Grant Al atel-Inria No. 198G098. 2

(3)

an subsequentlybedisplayed forbrowsing purposes.

Thispaperfo usesonthekey-frameextra tionte hniques. Thereexistmanydi erent

approa hes to extra t key-frames [14 ℄ [15 ℄ [1 ℄. However, they an note e tively apture themajorvisual ontent, and/orarenot friendly-userwherea setof parameters mustbe adjustedbytheuser,and/or also are omputationallyexpensive.

Inthispaper,anewstrategytoextra tthemost hara teristi key-framesisproposed. The main idea is to luster similar or redundant views within the shot together. The lusters are approximated by a mixture of Gaussians using the standard

Expe tation-Maximization(EM)algorithm[4 ℄. Here,theestimationisperformedinthe olorhistogram feature spa e. The Bayes Information Criterion [12℄ is used to hose the appropriate number of lusters (i.e. the number of key-frames) for ea h shot di erently, depending

on its omplexity. From ea h obtained luster, rstlythe losest frame to the median of its frames is sele ted as a referen e key-frame. Then dependingon thevariationin time andappearan e ofthe luster ontentagainstthereferen ekey-framemultipleframes an

be extra ted to represent e e tively the luster. A temporal lter is applied on the set of all sele ted key-frames inorder to eliminatethe overlapping ase between onstru ted lustersofframes. Usingtheproposedframework onlysuÆ ientseparated framesintime

and appearan e are kept. The sele tion of key-frames is fully automati , no parameters to be adjustedbytheuser.

The organization of this paper is as follows. Se tions 2 and 3 review and evaluate

respe tivelysomerelevant approa hesto thepresent work. Se tion4.1detailsthe luster-ing strategy and se tion 4.2 des ribes the algorithm to extra t key-frames. In thiswork the key-frames are extra tedfor onlybrowsing purposessin e key-frames summarizethe ontent of a shot [9℄. In se tion 5 experimental results on di erent tra ked obje ts in a

real-world videosequen e arepresented. Thevideosequen e hasbeenalready segmented into shots [3℄ and moving obje ts are lo alized and tra ked in shots [5 ℄. These experi-ments demonstrate the performan e of the proposed te hnique. A short dis ussion and

on ludingremarks aregiven inse tions5.2and 6 respe tively.

2 Related work

Manyresear he orthave beengiven intheareaofkey frameextra tion [14 ℄ [15 ℄ [1℄[13 ℄. They ouldberegroupedinthree following ategories.

1. Shot boundary based approa h. O' onnoret al. use either of the rst, the middle

orthelastframeof theshot astheshot'skeyframes[11℄.

2. Motion analysis based approa h. Wolf proposes a motion based approa h to

key-frame extra tion [13 ℄. He rst omputes the opti al ow for ea h frame and then omputes asimplemotionmetri basedon theopti al ow. Finallyheanalyzesthe

metri asa fun tionof timeto sele t key-framesat the lo alminimaof motion.

(4)

key frames [14 ℄. The similarity between the urrent frame and the last key-frameis identi edinea h feature spa e bya thresholdingte hnique.

 MotivatedbythesameobservationasWolf'sandZhangAvrithisetal. ombine the olor and motion features in a fuzzy feature ve tor [1 ℄. The traje tory of feature ve tors of all framesof a given shot is analyzed rstly. Then the

key-frames are sele ted on the urve points: the lo al minima and maxima of the magnitudeofthese ondderivativeontheinitialtraje tory,inthedis rete ase.

 Zhuangetal. proposean adaptive keyframeextra tion usinga linear

luster-ing te hnique to regroup similarframes together [15℄. The similaritybetween images of the same shot is omputed in the 128-dimensional Hue-Saturation olor histogram spa e. Based on a prede ned threshold of similarity for ea h

videosequen e, thenumberof lustersis determined. After that, an arbitrary point of ea h luster is sele ted as a key-frame. Only lusters of proportions greater thana prede nedthresholdarerepresented.

3 Evaluation of existing te hniques

The approa h of O' onnor is the easy way to extra t key frames. However, it does not apture the visual ontent of the video shot. The methods of Avrithis and Wolf give

interesting results. However they are omputationally expensive dueto their analysis of motion, and their underlyingassumption of lo alminima doesnot work very wellinthe ase of onstant variation of thefeatureve tors. The methods of Zhuang and Zhangare

relativelyfast. However, theyareverysensibletothe hoi eofthethresholdofsimilarity. As a result, the number of sele ted key-frames is very variable. The adjustment of the thresholdparameter represents a hallengingproblemfortheuser ofthese methods.

Next se tion details the theoreti part of the proposed framework to automati ally sele t the e e tive key-frames for agiven shot. Our approa h uses an unsupervised lus-tering algorithm to group similar frames within a shot together. The Gaussian mixture

density is used to model the temporal variation of olor histograms in the RGB olor spa e. Inorder to sele t automati ally thenumberof appropriate omponents( lusters) the BayesInformation riterion isperformed.

4 Probabilisti framework for shot abstra tion

Assume that temporal video segmentationinto shotswasalready performed. Then, ea h frame within a shot a is hara terized by a ve tor of measurements alled feature. Ea h

feature is represented by a single point or individual in the d-dimensionalfeature spa e, where its oordinatesare thevaluesofthefeature ve tor.

Now, for a given shot of n frames (or a tra ked obje t of n o urren es), n points

(5)

shot. The64-dimensionalspa eofthisdatawasalready redu edperformingthePrin ipal Components Analysis (PCA). In the urrent framework the method of [5 ℄ was used to

tra knon-rigidobje ts.

In thefollowing both lustering strategyto lassify similarframes together, and key-frames extra tionalgorithm to realizetheabstra tion levelaredes ribed.

22 White car

9 White car

54 White car

32 White car

65 White car

Var 1

Var 3

Var 2

Figure 1: Illustrationof the ontent-basedvariationofthetra ked\FordCar"withinthe

shot of gure3,in the3-prin ipal omponentsof theRGB histogramspa e. Somelabels of pointsareshown(e.g the orrespondingnumberofframes).

4.1 Clustering by Gaussian mixture densities

Again,assumethatavideoshot onsistingofnimageshasbeensele ted. Letusdenoteby

y i

thefeature ve tor ofdimensiond that hara terizestheithframe,and byY =fy i

;i= 1:::ngthesetoffeatureve tors olle tedforallframesoftheshot. ThedistributionofY is modeledasajointprobabilitydensityfun tion,f(y jY;)whereisthesetofparameters

(6)

j

by the enter and the ovarian e matrix, = (;). In the following, we denote  j = (p j ; j ; j

),forj=1;:::;J theparameters to be estimated.

Ea h luster approximated by a Gaussian omponent of the mixture groups a set of similar points (i.e. similar frames) in the feature spa e. Thus a transition from one Gaussian omponenttoanotherindi atesasigni anttemporalvariationwithintheshot.

Parameters Estimation. Gaussian mixture density estimation is performed in a semi-parametri way so that the number of omponents s ales with the omplexity of the data and not with the size of the data set. The density estimation pro edure is a

missing data estimation problem to whi h the EM algorithm [4 ℄ an be applied. The type of Gaussian mixture model to be used (see next paragraph) has to be xed and also the number of omponents in the mixture. If the numberof omponents is one the

estimation pro edure isa standard omputation (step M),otherwisethe expe tation (E) andmaximization(M)stepsareexe utedalternatelyuntilthelog-likelihoodofstabilizes orthemaximumnumberof iterationsisrea hed.

Lety=  y i ; 1inandy i 2IR d

be theobserved samplefrom themixture distri-butionf(yj). We assume that the omponent from whi h ea h y

i

arises is unknown,so thatthemissingdataarethelabels

i

(i=1;:::;n). Wehave i

=jifandonlyifjisthe

mixture omponent from whi h y i

arises. Let = ( 1

;:::; n

) denote the missing data, 2B

n

,whereB =f1;:::;Jg. The ompletesampleisx=(x 1 ;:::;x n )withx i =(y i ; i ). The ompletelog-likelihoodis

L(;x)= n X i=1 log 8 < : J X j=1 p j '(x i j j ; j ) 9 = ; : (2)

TheEM algorithmat iteration \m"is summarizedasfollow:

Step-E : For i = 1;:::;n and j = 1;:::;J ompute the onditional probability, given y ,that y

i

arises fromthemixture omponentwithdensity'(:j m j ; m j ) andmixing proportionp m j t ij ( m )= p m j '(x i ; m j ; m j ) J X `=1 p m ` '(x i j m ` ; m ` ) : (3)

Step-M : Maximize the log-likelihood onditionally on t m ij

. Indeed, in the ase of a

(7)

p m+1 j = 1 n n X i=1 t ij ( m ); m+1 j = X i=1 t ij ( m )y i n X i=1 t ij ( m )  m+1 j = n X i=1 t ij ( m )(y i  m+1 j )(y i  m+1 j ) T n X i=1 t ij ( m ) : (4)

At ea h iteration,thefollowingpropertieshold: Fori=1;:::;n

J X j=1 t ij ( m )=1 and J X j=1 p m j =1: (5)

Moredetailson the EMalgorithm ould be foundin[4 ℄. Initializationof the lusters is done randomly. In order to limit dependen e on the initial position, the algorithm is

runseveral times (10times inourexperiments) and thebest solutioniskept.

Gaussian models. Gaussian mixtures are suÆ ientlygeneral to model arbitrarily omplex, non-linear distribution a urately given enough data [4℄. When the data is limited, i.e. the number of frames of a shot is small, the method should be onstrained

to providebetter onditioningfortheestimation. Forthesereasons andinorder to make the method fast some onstraints are added on the ovarian e parameter. In a previous work we have des ribed these Gaussian models and their appli ation [8℄. These models

are basi allyintrodu edin[4 ℄.

Choosing models and mixture omponents' number. Toavoid ahand-pi ked

numberof Gaussians in themixture,i.e. thenumberof lusters and thenthe numberof key-framesto besele ted,theBayesInformationCriterion(BIC)[12 ℄isusedtodetermine the best probability density representation (appropriate Gaussian model and number of

omponents). It is an approa h based on a measure that determines the best balan e betweenthenumberofparameters usedand theperforman ea hievedin lassi ation. It minimizes thefollowing riterion:

BIC(M)= 2L M +Q M ln(n) (6) where L M

is the maximized log-likelihood of the Gaussian model M and Q M

is its numberof freeparameters.

4.2 Key-frames extra tion

A few images alled \key-frames" an summarize the visual ontent of a video shot. By

(8)

result of the rst part of the approa h, a set of lusters are identi ed. Ea h luster is hara terized by its enter (mass of the distribution), its ovarian e matrix (dispersion

around the enter) and thenumberofindividualsbelong to it.

Figure 2: Key frames browsinginterfa e

Thealgorithm to extra t key-frames from agiven shotis oftwostages:

 Perform the following pro edure on ea h luster C k

with k = 1::K and K denote thenumberof onstru ted lustersof frames.

1- Compute the median frame, F k m

, for the set of frames  F k i ;i=1::n k belong to the lusterC k ,n k

representstheproportionof lusterC k

.

2- Sele tas areferen ekey-frame,F k r

,the losest frameto F k m . 3- Forea h frame,F k i ,belongsto C k

, he k(a)ifthetemporal distan ebetween it and thereferen e framesis greaterthan aprede ned temporal threshold. If

this onditionisveri edthen he k(b)ifthesimilaritydistan ebetweenitand the referen e frames isgreater than a prede nedsimilaritythreshold. If these two onditionsareveri edaddthisframeF

k i

tothesetofreferen ekey-frames.

The Mahalanobis distan e, d M (F k i ;F k r ) = (F k i F k r ) 1 k (F k i F k r ) t , is used here to omputethesimilaritybetweenfeatureve tors of twoframes.

 Merge the set of referen e key-frames obtained forall lusters of a shot. From this

(9)

#425 #430 #435 #440

#445 #450 #455 #460

Figure3: Subset of views of the \Ford Car" sequen e of 66 frames.

5 Implementations and experimental results

In ourproje tforbuildingand browsingintera tive video[9 ℄ [7 ℄,a videosequen e is seg-mentedintoshots rst,usingthemethodof[3 ℄,andthenmovingobje tsarelo alizedand tra ked in ea h shot separately. The method of [5℄ was used to tra k obje ts. As

men-tionedpreviouslyweextra tthekey-frameshereforabrowsingpurposes. Thebrowsingof key-frames allowsafastvisualizationofthe ontentof theshot(see gure2forexample). In the urrent experiments ea h o urren e of a tra ked obje t is hara terized by a histogram omputed inthe RGB olor spa e. The histogram approa h is well known as

an attra tive method for image retrieval be ause of its simpli ity,speed and robustness. TheRGBspa e isquantizedinto 64 olors. Then,thePrin ipalComponentAnalysiswas appliedontheentiresetofve torfeaturesinordertoredu etheirdimensionality. Only10

#412 #435 #462

(10)

the lustering strategywasapplied. Thismakesthemethod more a urateand speed.

5.1 Results

Experimentsare ondu tedon theMPEG\Avengers" TVmovieof\Institut National de l'Audiovisuelen Fran e" (INA).The extra tion ofkey-frames isperformed separately on ea htra kedobje twithinashot. Duringtheestimationpro ess,themaximumnumberof

permittedGaussian omponents, K,dependson thenumberof framesintheshot. Using theBIC riterion,theappropriatenumberof luster(2[1::K℄)is hosenautomati allyi.e the thenumberof sele ted key-frames. The sizeof the temporal window was xed to 20

frames whi his reasonableto separate two key-frames intime.

To evaluatethee e tivenessand a ura y oftheproposedkey-frameextra tion te h-nique,weillustrateinthispapertheresultontwodi erenttra kedobje tsofthedatabase.

The\FordCar"and\laLi orne"sequen es, onsistingof66and100framesareillustrated in gures3and5respe tively. Oneevery5framesisdepi ted. Theresultsoftheproposed approa h arepresentedin gures4and 6.

5.2 Dis ussion

Forea h experimentedshot,it an beseen thatthesele ted key-frames providesuÆ ient visualizationofthetotalframesoftheshot. Theyare learlyrepresentativeofthedi erent

views oftra ked obje ts whi h are ontinuously hangingwithtime.

The losest work to our approa h is the work of [15℄. Zhuang et al. use a linear lustering algorithm with a prede ned threshold to determine the number of lusters.

Then, they represent ea h formed luster of frames by an arbitrary one. The te hnique presented here uses the EM algorithm whi h is more adequate to nd the partition of a omplex distribution where the number of lusters ( omplexity of the distribution) is

determinedautomati allyusingtheBIC riterion. Also,thekey-framesareextra tedhere fortra ked obje ts.

It is obviously that the method of Zhuang et al. is more speed be ause the

em-ployed lustering strategy is linear and non-parameterized. The proposed approa h here isadoptedbyourproje t[9℄sin etheestimatedGaussian omponentsareusedbeforethe sele tion ofkey-frames, bythe re ognitionpro ess of similartra ked obje ts inthewhole

videosequen e [7℄.

6 Con lusion

In this paper, an eÆ ient video ontent representation has been presented for realizing an abstra tion level beyond the shot one: the key-framelevel. The presented framework representsafullautomati methodto extra tkey-frameswherenoparametersareneeded

(11)

#7145 #7150 #7155 #7160 #7165

#7170 #7175 #7180 #7185 #7190

#7200 #7205 #7210 #7215 #7218

Figure5: Subsetof viewsof the\laLi orne" sequen e of100 frames.

Theexperimentsondi erent real-worldtra ked obje tsshowntheperforman eofthe proposed te hnique. Su h a te hnique an be performed on non-segmented frames of

shots. It is able to apture the salient visual ontent of the key lusters and thus that of the underlying shot. The olor histogram was omputed as a feature where another features ould be tested. However, a ura y of the estimation of the Gaussian mixture

densities is related to thedimension of the featurespa e whi h must be hosen arefully inrespe tto the sizeof a shot.

Thekey-frames areextra tedhere forabrowsing purposes only,butan indexmaybe

(12)

Figure 6: Key-frameresultsforthe\laLi orne" sequen e

A knowledgments

Wewouldliketoa knowledgeAl altel CRCforitssupportofthiswork,andthe\Institut National de l'Audiovisuelen Fran e", dept of Innovation, for providing the videoused in

thispaper.

Referen es

[1℄ Y. S. Avrithis, A. D. Doulamis, N. D. Doulamis, and S. D. Kollias. A sto hasti framework for optimalkey frameextra tion from mpeg video databases. Computer Vision and Image Understanding, 75(1/2):3{24, July-August1999.

[2℄ J. S.Bore zky and L.A. Rowe. Comparisonof videoshot boundarydete tion te h-niques. InPro .SPIE Conf. on Vis. Commun and Image Pro ., 1996.

[3℄ P.Bouthemyand F.Ganansia. Video partitionningand ameramotion

hara terisa-tion for ontent-basedvideoindexing. Pro . 3rd IEEE Int.Conf. Image Pro essing., september1996.

[4℄ G. Celeux and G. Govaert. Gaussian Parsimonious Models. Pattern Re ognition, 28(5):781{783, 1995.

[5℄ M. Gelgonand P.Bouthemy. Aregion-levelgraphlabelingappro h to motion-based segmentation. InCVPR, pages 514{519, Puerto Ri o,June17-19 1997.

[6℄ N.Guimaraes,N.Correia,I.Oliveira,andJ.Martins.Designing omputerfor ontent analysis: A situated use of videoparsing and analysis te hniques. Multimedia Tools and Appli ations, 7:159{180, 1998.

[7℄ R.HammoudandR.Mohr. Buildingandbrowsinghyper-videos: a ontent variation

modeling solution. Pattern Analysis andAppli ations, 2000. Spe ialIssue on Image Indexation,Submitted.

(13)

formovie lms. In ACM Multimedia, pages 497{498, LosAngeles, California, USA, O tober30 - November3 2000. (demo session).

[10℄ R. Hammoud and R. Mohr. Mixture densities for video obje ts re ognition. In International Conferen e on Pattern Re ognition, volume2,pages 71{75, Bar elona,

Spain,3-8September2000.

[11℄ B.C. O'Connor. Sele ting key frames of movingimage do uments : A digital

envi-ronment foranalysis and navigation. Mi ro omputers for Information Management, 8(2):119{133, 1991.

[12℄ G.S hwarz.Estimatingthedimensionofamodel.TheAnnals ofStatisti s,6(2):461{ 464,Mar h1978.

[13℄ W. Wolf. Heyframesele tionbymotionanalysis. InPro .IEEEInt. Conf. A oust., Spee h, and Signal Pro ., 1996.

[14℄ H. J. Zhang, C. Y. Low, S.W. Smoliar, and J.H. Wu. Video parsing, retrieval and

browsing: Anintegratedand ontent-basedsolution. ACM Multimedia,pages 15{24, 1995.

Références

Documents relatifs

In this section we describe some of the more important modules/filters currently deployed in the framework. The system also contains classical surveillance functions

Temporal alignment of primary video data (soccer match videos) with semantically organized events and entities from the textual and structured complementary

In this paper, we focus on the following features: (1) combine recog- nized video contents (at higher level and output from a video analysis module) with visual words (at low

The user marks the boundaries of storyline sequences as collections of mul- tiple shots and adds a variety of bibliographical metadata like title and type of video content, topics

In order to be time efficient during the lattice search process, we cannot keep trace of all the exact N-best paths, as each N-best path has its own phoneme segmentation, and

Also, it may allow to construct nodes that cover non-sequential video shots, for example to prepare a presentation which will include a given sequence of demonstrations in any

Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval,. Patrick Bouthemy, Christophe Garcia, Rémi Ronfard, Georges Tziritas, Emmanuel Veneau,

Our video frame interpolation approach has an unsu- pervised (self-supervised) network (the flow computation CNN) that can compute the bi-directional optical flow be- tween two