Conceptual indexing of television images based on face and caption sizes and locations

(1)

HAL Id: inria-00590141

https://hal.inria.fr/inria-00590141

Submitted on 3 May 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Conceptual indexing of television images based on face and caption sizes and locations

Rémi Ronfard, Christophe Garcia, Jean Carrive

To cite this version:

Rémi Ronfard, Christophe Garcia, Jean Carrive. Conceptual indexing of television images based on face and caption sizes and locations. 4th International Conference on Advances in Visual Information Systems (VISUAL ’00), Nov 2000, Lyon, France. pp.349–359, �10.1007/3-540-40053-2_31�. �inria- 00590141�

(2)

based on fae and aption sizes and loations.

RemiRonfard

,ChristopheGaria y

,JeanCarrive

INA,4avenuedel'Europe,94366,Bry-sur-Marne,Frane

y

ICS{FORTH,P.O.Box1385,GR71110Heraklion,Crete,Greee

Email:frronfard,jarrivegina.fr,gariais.forth.gr

Abstrat Indexing videos by their image ontent is an important is-

suefordigitalaudiovisualarhives.Whilemuhworkhasbeendevoted

tolassiation andindexingmethodsbasedonpereptual qualitiesof

images, suhas olor,shape and texture, thereis also aneedfor las-

siationand indexingof somestrutural properties of images. Inthis

paper,wepresentsomemethodsforimagelassiationinvideo,based

onthepresene,size andloationoffaesandaptions.Wearguethat

suhlassiations are highlydomain-dependent, andarebesthandled

usingexibleknowledgemanagementsystems(inourase,adesription

logis).

1 Introdution

Classifying shots based on their visual ontent is an important step toward

higher-level segmentation of a video into meaningful units suh as stories in

broadastnewsorsenesinomedyanddrama.Earlierworkonthesubjethas

shownthat shotsimilaritybasedonglobal featuressuh asduration andolor

ould be eÆient in limited ases [14,1℄. More reent work tends to highlight

the limitsof suh tehniques, andto emphasizemore spei features, suh as

aptionandfae sizesandloations[11,12,9℄.

Captionsandfaesarepowerfulvideoindexes,giventhattheygivegenerally

a lue about the videoontent. In videosegmentation, theymay help to nd

program boundaries, by deteting sript lines and to selet more meaningful

keyframesontainingtextualdata and/orhimanfaes. Automati detetionof

programs, suh as TV Commerials or news, beomes possible using loation

andsize oftext.

Oneimportantissuethatisnotdealtwithbypreviousworkistheneessity

of exploiting domain knowledge, whih may only be available at run-time. In

thispaper,weestablishalear-utseparationbetweenfeatureextrationwhih

is based ongeneri tools(fae detetion, aption detetion) and lassiation,

whih is based onheuristi, domain-spei rules. With examplesdrawnfrom

real broadastnews ,weilustrate howsuh lassesan beorganized into tax-

(3)

WeusetheCLASSICDesriptionLogissystem[2℄asarepresentationforboth

theimagelassesandtheimageobservations,whihareobtainedthroughvideo

analysis.CLASSICrepresentslassesas oneptswhih anbeprimitive orde-

ned.Primitiveoneptsareonlyrepresentedwithneessaryonditions.Weuse

themtorepresenteventlasseswhiharediretlyobservable:shots,keyframes,

faes and aptions. The neessary onditions determine the inferenes whih

an be drawn in suh lasses : for instane, shot have at least one keyframe,

keyframes may have faes or aptions. Dened onepts are represented with

both neessary and suÆient onditions. Therefore, lass membership an be

inferred automatially for dened onepts. Inthis paper,we fous ondened

onepts for keyframeand shot lasses. Relations between oneptsare alled

roles,andoneimportantrolebetweenaudiovisualeventsisontainment(part-of

role).Coneptsandrolesareorganizedintaxonomies,suhastheoneshownin

Fig.1, whih ontainsboth primitive anddened onepts implementedin our

urrentprototype.

Keyframe part-of Shot Image Region part-of

Caption Face

MS Face

MLS Face MCU

Face CU Face

LS Face Upper Left

Caption Bottom

Caption

Center Caption

Locational Caption Personal

Caption

Topical Caption

Locational Shot Interview

Shot

Reporter Shot

Figure1.Ataxonomyofimageregions,keyframesandshots.Context-speilasses

aredenedintermsofmoregenerilassesusingsubsumptionandpart-oflinks.

3 Feature extration

Ingeneral,ashotan be representedsynthetiallyby asmall numberofstati

keyframes.Weseletkeyframesbylusteringthembasedontheirolorontent,

(4)

DIVAN havebeendesribedelsewhere[3℄,andwefoushereonthetehniques

usedto detetfaes andaptions.

3.1 Faedetetion

FaesappearinginvideoframesaredetetedusinganovelandeÆientmethod

that wepresentedindetails in [7℄. Theproposedshemeis designedforhuman

faesdetetionin olorimagesundernon-onstrained seneonditions,suh as

thepreseneofaomplexbakgroundandunontrolledillumination.Colorlus-

tering and ltering using approximationsof the HSV skin olor subspaesare

applied ontheoriginalimage, providingquantized skinolorregionswhih are

iterativelymergedin ordertoprovideasetofandidatefaeareas.Constraints

relatedto shapeandfaetextureanalysis areapplied, byperformingawavelet

paketdeompositiononeahfaeareaandidateandextratingsimplestatisti-

alfeaturessuhasstandarddeviation.Compatandmeaningfulfeaturevetors

arebuiltwiththesestatistialfeatures.Then,theBhattaharryadistaneisused

forlassifyingthefeature vetorsintofaeor non-faeareas,usingsomeproto-

type faearea vetors,aquired in a previoustrainingstage.Foradata set of

100imageswith104faesoveringmostoftheasesofhumanfaesappearane,

a94:23%gooddetetionrate,20falsealarmsanda5:76%falsedismissalsrate

wereobtained.

3.2 Caption detetion

Our method for aption detetion is espeially designed for being applied to

thediÆultase wheretext issuperimposed onolorimageswith ompliated

bakgroundandisdesribedin[8℄.Ourgoalistominimizethenumberoffalse

alarms and to binarize eÆiently the deteted text areasso that they an be

proessedbystandardOCRsoftware.First,potentialareasoftextaredeteted

byenhanementandlusteringproesses,onsideringmostofonstraintsrelated

to the texture of words. Then, lassiationand binarization of potential text

areasareahievedinasingleshemeperformingolorquantizationandhara-

tersperiodiityanalysis.First resultsusingadataset of200imagesontaining

480linesoftextwithharatersizesrangingfrom8to30,areveryenouraging.

Our algorithm deteted93%of thelines and binarizethem with an estimated

good readabilityrate of 82%. An overall numberof 23 false alarmshave been

found,in areaswithontrastedrepetitivetexture.

4 Shot lassiation

Theautomatidetetionofhumanfaesandtextualinformationprovidesusers

with powerfulindexing apaitiesof the videomaterial.Framesontainingde-

teted faesor text areasmaybesearhed aordingto thenumber,the sizes,

(5)

and size of detetedfaes mayharaterizeabig audiene(multiple faes), an

interview(twomedium size faes) or alose-upview ofaspeaker(alargesize

fae). Loation andsize of text areashelps in haraterizingthe videoontent

espeially in news. In this setion, we explain in more details how suh shot

lasses an be dened, and their instanes reognized automatially, in a DL

framework.

4.1 Faelasses

The rst axisfor shot lassiationis theapparentsize of the detetedfaes.

Faesareanimportantsemantimarkerinimages,andtheyalsoserveasavery

intuitive and immediate spatial referene. With respet to the human gure,

inematographers use a voabulary of ommon framings, alled 'shot values'

fromwhihwehaveseletedvelasses,orrrespondingtoaseswherethefae

anbeseenanddetetedlearly.Theyrangefromthelose-up(CU),wherethe

faeoupiesapproximatelyhalfofthesreen,tothelongshot(LS),wherethe

(6)

(MCU) andthemedium-long-shot(MLS)[13℄.

Shotvaluelassesareusuallydenedinrelativeandimpreiseterms,based

on the distane of the subjet to the amera. In order to provide a quantita-

tivedenition,we usethefat that in television and lm,the apparentsize of

faesonthesreenvaryinverselywiththeirdistanetotheamera(perspetive

shortening). We therefore ompute the quantity d=

FrameWidth

FaeWidth

and lassiy

thefae regionsaordingto veoverlappingbins, basedonauniformquanti-

zationofdintherangeof[0;12℄(seeTable1).Notethatthisisonsistentwith

theresolutionused(MPEG-1videowith22marobloksperline).

Thevefae lassesshownin Fig.1followimmediatelyfrom theorrespon-

dane shown in Table 1. Given suh lasses, it is possible to dene keyframe

lassesbasedonthenumberandsize oftheirdetetedfaes.Whenallfaesare

inagivenlass(mu-fae)thenthekeyframeitselfanbequalied(mu-frame).

Notethat intheaseofmultiplefae lasses,wedonotattemptto lassifythe

keyframe. But using overlapping fae value lasses allowsus to automatially

lassifytheframeintotheommonlass ofallitsdetetedfaes,in mostpra-

tialases.

Value CU MCU MS MLS LS

Size 1/2 1/4 1/6 1/8 1/10

Range d4 2d6 4d8 6d10 8d

Table1.Faesizes,distanesandshot values.

4.2 Caption lasses

Whilefaesarelassiedaordingtotheirdimension,aptionsarebestlassied

aordingto theirposition onthesreen. Inmany ontexts,suh as broadast

news,theaptionloationdeterminesthesemantilassoftheaptiontext.As

anexample,Fig.2 showsexamplesofthree aption lasses:topial(enter-left

aption),personal(bottomaption)andloational(upper-leftaption).Inthis

ase, we thereforedene three aption lassesbased on simplegeometri tests

for bottom, upper-left and enter-left aptions, as we did with faes. But we

propagate the lass memberships from aptions to framesand shots in a very

dierentwayfromwhat didwithshotvalues,beausein thisasethepresene

ofasingleenter-leftaptionsuÆestolassifytheframeasatopial keyframe,

andthe shotas atopial shot.Sine CLASSICdoesnotprovidetheexistential

operator,thisis donewithaspeial-purposepropagation rule,triggered forall

(7)

Shot lassiationimmediatelyfollowsfrom keyframelassiationin thease

of simple shots (shots withexatlyone keyframe).Shotsontainingmorethan

one keyframe are qualied as omposite shots and are only lassied as CU,

MCU, et. when all their keyframesare in the same lass. In all other ases,

weleavethemunlassied,forlakofmorespeiinformation.Curiously,this

limitation oinideswith limitationsof CLASSICitself, whihan only handle

onjontionsofrolerestritions,butnotnegationsordisjuntions.Inthefuture,

we will investigate other DL systems to overome this limitation. As another

extension, we areurrentlydeveloppingaonstraint-basedtemporalreasoning

systemontopofCLASSIC,whihwillallowustodeneandlassifyaomposite

shotasazoom infromMStoCU[4℄.

Insome ontexts, suh as broadast news or sports, more speialized shot

lassesan be dened,using simpleombinations ofthe previouslyintrodued

lasses.Forinstane,aninterviewshotanbedenedasaone-shotwhihisboth

anMCU-shotandapersonal-shot.Areportershotanbedenedsimilarly,asa

one-shot,MCU,loationalshot.Andananhorshotanbedenedasaone-shot,

MCU,topialshot.Whilesuhlassesareonlyvalidwithinapartiularontext,

theyallowusefulinferenes,espeiallywhendealingwithlargeolletionsofvery

similartelevisionbroadasts.

5 Experimental results and further work

OurshotlassiationsystemhasbeentestedaspartoftheDiVANprototype.

DiVAN is a ditributedaudiovisual arhivenetwork whih usesadvaned video

segmentation tehniques to failitatethe taskof doumentalists, whoannotate

thevideoontentswithtime-odeddesriptions.Inourexperiments,thevideois

proessedsequentially,fromsegmentationtofeatureextration,toshotlassi-

ationandsenegroupings,withouthumanintervention,basedonapreompiled

shottaxonomyrepresentingtheavailableknowledgeaboutaolletionofrelated

televisionprograms.

S1 S2 S3 S4 S5 S6

CU MLS,

MS

MCU, MS,

Topial,

Anhor

CU, MCU,

Personal,

Interview

MCU, MS,

Loational,

Presonal,

Reporter,

Interview CU,

MCU,

Per-

sonal,

Inter-

view

Table2.ShotlassiationresultsforFig.2

In Fig.2, we present some results of the proposed fae and text detetion

(8)

detetionappearingin the sameframe.Table2showsthelassiationresults

for those shots. In those examples, it should be noted that multiple or even

onitinginterpretations(suhasInterviewandReportershot)areallowed.We

believe that suh ambiguities an onlybe resolvedby adding more knowledge

andmorefeaturesintothesystem.

Oneway of adding suh knowledge is to go from detetion to reognition.

Faeandaptionreognitionenablemorepowerfulindexingapaities, suh as

indexing sports programsby soreguresand playernames, or indexing news

byperson andplaenames.Whendetetedfaes arereognizedandassoiated

automatially withtextualinformationlikeinthesystemsName-it[10℄or Pi-

tion[5℄, potentialappliations suh as newsvideoviewerprovidingdesription

ofthedisplayedfaes,newstextbrowsergivingfaialinformation,orautomated

videoannotationgeneratorsforfaes arepossible.

Inordertoimplementsuhapabilities,wearedevelopinganalgorithmded-

iated to fae reognition when faes are large enough and in a semi-frontal

position [6℄. This algorithm uses diretly the features extrated in the dete-

tion stage.As anaddition, ouralgorithm for text detetion[8℄ inludesa text

binarization stage that makes theuse of standard OCR software possible. We

are alsourrentlyompletingour studybyusing astandardOCR softwarefor

textreognition.With thoseapabilities, wewillbeableto extendthenumber

of shot lassesreognized by oursystem, to reognize shot sequenes, suh as

shot-reverse-shots,andtoresolveambiguousases,suhasdeterminingwhether

twokeyframesontainthesamefaesor not(within ashotboundary).

6 Conlusions

Basedonextrated faesand aptions,wehavebeenableto buildsomeuseful

lasses for desribing television images. The desription logi framework used

allowsus to easily speialize and extendthetaxonomies. Classiationof new

instanesisperformedusingaombinationofnumerialmethodsandsymboli

reasoning,andallowsustoalwaysstorethemostspeidesriptionsforshotsor

groupsofshots,basedontheavailableknowledgeandfeature-basedinformation.

Referenes

1. Aigrain, Ph., Joly, Ph. and Longueville, V. Medium knowledge-based maro-

segmentationofvideointosequenes Intelligent multimediainformationretrieval,

AAAIPress-MITPress,1997.

2. Borgida, A.,Brahman,R.J.,MGuiness, D.L.,Resnik,L.A.1989. CLASSIC:A

StruturalDataModelfor Objets. ACM SIGMODInt.Conf. onManagement of

Data, 1989.

3. Bouthemy,P.,Garia C. , Ronfard R. , Tziritas G. , Veneau E. Senesegmen-

tation andimage featureextrationfor videoindexingand retrieval. VISUAL'99,

(9)

sualDouments.ProeedingsoftheInternationalWorkshoponDesriptionLogis,

Trento,Italy,1998.

5. Chopra K., Srihari R.K.. Control Strutures for Inorporating Piture-Spei

Context inImageInterpretation. in: Proeedings of Int'lJointConf. onArtiial

Intelligene,1995.

6. GariaC.,ZikosG.,TziritasG.. Wavelet PaketAnalysisfor FaeReognition.

ToappearinImageandVisionComputing,18(4).

7. Garia C. and TziritasG.. Fae DetetionUsing QuantizedSkinColor Regions

MergingandWaveletPaketAnalysis.IEEETransationsonMultimedia,1(3):264{

277,Sept.1999.

8. Garia C.,Apostolidis X. . TextDetetion and SegmentationinComplex Color

Images. IEEEInternationalConfereneonAoustis,Speeh,andSignal,June5-9

2000,Istanbul,Turkey.

9. Ide,I.,Yamamoto,K.andTanaka,H. Automatiindexingtovideobasedonshot

lassiation. AdvanedMultimedia Content Proessing, LNCS 1554, November

1998.

10. SatohS.,KanadeT..Name-it:AssoiationofFaeandNameinVideo.in:Pro.

of Computer Vision andPattern Reognition. IEEE ComputerSoiety Press, pp.

368-373, 1997.

11. Gunsel,B.andFerman,A.M.andTekalp,A.M.VideoIndexingThroughIntegra-

tionofSyntatiandSemantiFeatures.WACV,1996.

12. Ferman,A.M.,Tekalp,A.M.andMehrotra,R. EetiveContentRepresentation

for Video IEEEIntern.ConfereneonImageProessing,Otober1998.

13. Thomson,R.Grammaroftheshot.MediaManual,FoalPress,Oxford,UK,1998.

14. Yeung,M.andYeo,B.-L.Time-onstrainedClusteringforSegmentationofVideo

intoStoryUnits InternationalConfereneonPatternReognition,1996.