HAL Id: inria-00590141
https://hal.inria.fr/inria-00590141
Submitted on 3 May 2011
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Conceptual indexing of television images based on face and caption sizes and locations
Rémi Ronfard, Christophe Garcia, Jean Carrive
To cite this version:
Rémi Ronfard, Christophe Garcia, Jean Carrive. Conceptual indexing of television images based on face and caption sizes and locations. 4th International Conference on Advances in Visual Information Systems (VISUAL ’00), Nov 2000, Lyon, France. pp.349–359, �10.1007/3-540-40053-2_31�. �inria- 00590141�
based on fae and aption sizes and loations.
RemiRonfard
,ChristopheGaria y
,JeanCarrive
INA,4avenuedel'Europe,94366,Bry-sur-Marne,Frane
y
ICS{FORTH,P.O.Box1385,GR71110Heraklion,Crete,Greee
Email:frronfard,jarrivegina.fr,gariais.forth.gr
Abstrat Indexing videos by their image ontent is an important is-
suefordigitalaudiovisualarhives.Whilemuhworkhasbeendevoted
tolassiation andindexingmethodsbasedonpereptual qualitiesof
images, suhas olor,shape and texture, thereis also aneedfor las-
siationand indexingof somestrutural properties of images. Inthis
paper,wepresentsomemethodsforimagelassiationinvideo,based
onthepresene,size andloationoffaesandaptions.Wearguethat
suhlassiations are highlydomain-dependent, andarebesthandled
usingexibleknowledgemanagementsystems(inourase,adesription
logis).
1 Introdution
Classifying shots based on their visual ontent is an important step toward
higher-level segmentation of a video into meaningful units suh as stories in
broadastnewsorsenesinomedyanddrama.Earlierworkonthesubjethas
shownthat shotsimilaritybasedonglobal featuressuh asduration andolor
ould be eÆient in limited ases [14,1℄. More reent work tends to highlight
the limitsof suh tehniques, andto emphasizemore spei features, suh as
aptionandfae sizesandloations[11,12,9℄.
Captionsandfaesarepowerfulvideoindexes,giventhattheygivegenerally
a lue about the videoontent. In videosegmentation, theymay help to nd
program boundaries, by deteting sript lines and to selet more meaningful
keyframesontainingtextualdata and/orhimanfaes. Automati detetionof
programs, suh as TV Commerials or news, beomes possible using loation
andsize oftext.
Oneimportantissuethatisnotdealtwithbypreviousworkistheneessity
of exploiting domain knowledge, whih may only be available at run-time. In
thispaper,weestablishalear-utseparationbetweenfeatureextrationwhih
is based ongeneri tools(fae detetion, aption detetion) and lassiation,
whih is based onheuristi, domain-spei rules. With examplesdrawnfrom
real broadastnews ,weilustrate howsuh lassesan beorganized into tax-
WeusetheCLASSICDesriptionLogissystem[2℄asarepresentationforboth
theimagelassesandtheimageobservations,whihareobtainedthroughvideo
analysis.CLASSICrepresentslassesas oneptswhih anbeprimitive orde-
ned.Primitiveoneptsareonlyrepresentedwithneessaryonditions.Weuse
themtorepresenteventlasseswhiharediretlyobservable:shots,keyframes,
faes and aptions. The neessary onditions determine the inferenes whih
an be drawn in suh lasses : for instane, shot have at least one keyframe,
keyframes may have faes or aptions. Dened onepts are represented with
both neessary and suÆient onditions. Therefore, lass membership an be
inferred automatially for dened onepts. Inthis paper,we fous ondened
onepts for keyframeand shot lasses. Relations between oneptsare alled
roles,andoneimportantrolebetweenaudiovisualeventsisontainment(part-of
role).Coneptsandrolesareorganizedintaxonomies,suhastheoneshownin
Fig.1, whih ontainsboth primitive anddened onepts implementedin our
urrentprototype.
Keyframe part-of Shot Image Region part-of
Caption Face
MS Face
MLS Face MCU
Face CU Face
LS Face Upper Left
Caption Bottom
Caption
Center Caption
Locational Caption Personal
Caption
Topical Caption
Locational Shot Interview
Shot
Reporter Shot
Figure1.Ataxonomyofimageregions,keyframesandshots.Context-speilasses
aredenedintermsofmoregenerilassesusingsubsumptionandpart-oflinks.
3 Feature extration
Ingeneral,ashotan be representedsynthetiallyby asmall numberofstati
keyframes.Weseletkeyframesbylusteringthembasedontheirolorontent,
DIVAN havebeendesribedelsewhere[3℄,andwefoushereonthetehniques
usedto detetfaes andaptions.
3.1 Faedetetion
FaesappearinginvideoframesaredetetedusinganovelandeÆientmethod
that wepresentedindetails in [7℄. Theproposedshemeis designedforhuman
faesdetetionin olorimagesundernon-onstrained seneonditions,suh as
thepreseneofaomplexbakgroundandunontrolledillumination.Colorlus-
tering and ltering using approximationsof the HSV skin olor subspaesare
applied ontheoriginalimage, providingquantized skinolorregionswhih are
iterativelymergedin ordertoprovideasetofandidatefaeareas.Constraints
relatedto shapeandfaetextureanalysis areapplied, byperformingawavelet
paketdeompositiononeahfaeareaandidateandextratingsimplestatisti-
alfeaturessuhasstandarddeviation.Compatandmeaningfulfeaturevetors
arebuiltwiththesestatistialfeatures.Then,theBhattaharryadistaneisused
forlassifyingthefeature vetorsintofaeor non-faeareas,usingsomeproto-
type faearea vetors,aquired in a previoustrainingstage.Foradata set of
100imageswith104faesoveringmostoftheasesofhumanfaesappearane,
a94:23%gooddetetionrate,20falsealarmsanda5:76%falsedismissalsrate
wereobtained.
3.2 Caption detetion
Our method for aption detetion is espeially designed for being applied to
thediÆultase wheretext issuperimposed onolorimageswith ompliated
bakgroundandisdesribedin[8℄.Ourgoalistominimizethenumberoffalse
alarms and to binarize eÆiently the deteted text areasso that they an be
proessedbystandardOCRsoftware.First,potentialareasoftextaredeteted
byenhanementandlusteringproesses,onsideringmostofonstraintsrelated
to the texture of words. Then, lassiationand binarization of potential text
areasareahievedinasingleshemeperformingolorquantizationandhara-
tersperiodiityanalysis.First resultsusingadataset of200imagesontaining
480linesoftextwithharatersizesrangingfrom8to30,areveryenouraging.
Our algorithm deteted93%of thelines and binarizethem with an estimated
good readabilityrate of 82%. An overall numberof 23 false alarmshave been
found,in areaswithontrastedrepetitivetexture.
4 Shot lassiation
Theautomatidetetionofhumanfaesandtextualinformationprovidesusers
with powerfulindexing apaitiesof the videomaterial.Framesontainingde-
teted faesor text areasmaybesearhed aordingto thenumber,the sizes,
and size of detetedfaes mayharaterizeabig audiene(multiple faes), an
interview(twomedium size faes) or alose-upview ofaspeaker(alargesize
fae). Loation andsize of text areashelps in haraterizingthe videoontent
espeially in news. In this setion, we explain in more details how suh shot
lasses an be dened, and their instanes reognized automatially, in a DL
framework.
4.1 Faelasses
The rst axisfor shot lassiationis theapparentsize of the detetedfaes.
Faesareanimportantsemantimarkerinimages,andtheyalsoserveasavery
intuitive and immediate spatial referene. With respet to the human gure,
inematographers use a voabulary of ommon framings, alled 'shot values'
fromwhihwehaveseletedvelasses,orrrespondingtoaseswherethefae
anbeseenanddetetedlearly.Theyrangefromthelose-up(CU),wherethe
faeoupiesapproximatelyhalfofthesreen,tothelongshot(LS),wherethe
(MCU) andthemedium-long-shot(MLS)[13℄.
Shotvaluelassesareusuallydenedinrelativeandimpreiseterms,based
on the distane of the subjet to the amera. In order to provide a quantita-
tivedenition,we usethefat that in television and lm,the apparentsize of
faesonthesreenvaryinverselywiththeirdistanetotheamera(perspetive
shortening). We therefore ompute the quantity d=
FrameWidth
FaeWidth
and lassiy
thefae regionsaordingto veoverlappingbins, basedonauniformquanti-
zationofdintherangeof[0;12℄(seeTable1).Notethatthisisonsistentwith
theresolutionused(MPEG-1videowith22marobloksperline).
Thevefae lassesshownin Fig.1followimmediatelyfrom theorrespon-
dane shown in Table 1. Given suh lasses, it is possible to dene keyframe
lassesbasedonthenumberandsize oftheirdetetedfaes.Whenallfaesare
inagivenlass(mu-fae)thenthekeyframeitselfanbequalied(mu-frame).
Notethat intheaseofmultiplefae lasses,wedonotattemptto lassifythe
keyframe. But using overlapping fae value lasses allowsus to automatially
lassifytheframeintotheommonlass ofallitsdetetedfaes,in mostpra-
tialases.
Value CU MCU MS MLS LS
Size 1/2 1/4 1/6 1/8 1/10
Range d4 2d6 4d8 6d10 8d
Table1.Faesizes,distanesandshot values.
4.2 Caption lasses
Whilefaesarelassiedaordingtotheirdimension,aptionsarebestlassied
aordingto theirposition onthesreen. Inmany ontexts,suh as broadast
news,theaptionloationdeterminesthesemantilassoftheaptiontext.As
anexample,Fig.2 showsexamplesofthree aption lasses:topial(enter-left
aption),personal(bottomaption)andloational(upper-leftaption).Inthis
ase, we thereforedene three aption lassesbased on simplegeometri tests
for bottom, upper-left and enter-left aptions, as we did with faes. But we
propagate the lass memberships from aptions to framesand shots in a very
dierentwayfromwhat didwithshotvalues,beausein thisasethepresene
ofasingleenter-leftaptionsuÆestolassifytheframeasatopial keyframe,
andthe shotas atopial shot.Sine CLASSICdoesnotprovidetheexistential
operator,thisis donewithaspeial-purposepropagation rule,triggered forall
Shot lassiationimmediatelyfollowsfrom keyframelassiationin thease
of simple shots (shots withexatlyone keyframe).Shotsontainingmorethan
one keyframe are qualied as omposite shots and are only lassied as CU,
MCU, et. when all their keyframesare in the same lass. In all other ases,
weleavethemunlassied,forlakofmorespeiinformation.Curiously,this
limitation oinideswith limitationsof CLASSICitself, whihan only handle
onjontionsofrolerestritions,butnotnegationsordisjuntions.Inthefuture,
we will investigate other DL systems to overome this limitation. As another
extension, we areurrentlydeveloppingaonstraint-basedtemporalreasoning
systemontopofCLASSIC,whihwillallowustodeneandlassifyaomposite
shotasazoom infromMStoCU[4℄.
Insome ontexts, suh as broadast news or sports, more speialized shot
lassesan be dened,using simpleombinations ofthe previouslyintrodued
lasses.Forinstane,aninterviewshotanbedenedasaone-shotwhihisboth
anMCU-shotandapersonal-shot.Areportershotanbedenedsimilarly,asa
one-shot,MCU,loationalshot.Andananhorshotanbedenedasaone-shot,
MCU,topialshot.Whilesuhlassesareonlyvalidwithinapartiularontext,
theyallowusefulinferenes,espeiallywhendealingwithlargeolletionsofvery
similartelevisionbroadasts.
5 Experimental results and further work
OurshotlassiationsystemhasbeentestedaspartoftheDiVANprototype.
DiVAN is a ditributedaudiovisual arhivenetwork whih usesadvaned video
segmentation tehniques to failitatethe taskof doumentalists, whoannotate
thevideoontentswithtime-odeddesriptions.Inourexperiments,thevideois
proessedsequentially,fromsegmentationtofeatureextration,toshotlassi-
ationandsenegroupings,withouthumanintervention,basedonapreompiled
shottaxonomyrepresentingtheavailableknowledgeaboutaolletionofrelated
televisionprograms.
S1 S2 S3 S4 S5 S6
CU MLS,
MS
MCU, MS,
Topial,
Anhor
CU, MCU,
Personal,
Interview
MCU, MS,
Loational,
Presonal,
Reporter,
Interview CU,
MCU,
Per-
sonal,
Inter-
view
Table2.ShotlassiationresultsforFig.2
In Fig.2, we present some results of the proposed fae and text detetion
detetionappearingin the sameframe.Table2showsthelassiationresults
for those shots. In those examples, it should be noted that multiple or even
onitinginterpretations(suhasInterviewandReportershot)areallowed.We
believe that suh ambiguities an onlybe resolvedby adding more knowledge
andmorefeaturesintothesystem.
Oneway of adding suh knowledge is to go from detetion to reognition.
Faeandaptionreognitionenablemorepowerfulindexingapaities, suh as
indexing sports programsby soreguresand playernames, or indexing news
byperson andplaenames.Whendetetedfaes arereognizedandassoiated
automatially withtextualinformationlikeinthesystemsName-it[10℄or Pi-
tion[5℄, potentialappliations suh as newsvideoviewerprovidingdesription
ofthedisplayedfaes,newstextbrowsergivingfaialinformation,orautomated
videoannotationgeneratorsforfaes arepossible.
Inordertoimplementsuhapabilities,wearedevelopinganalgorithmded-
iated to fae reognition when faes are large enough and in a semi-frontal
position [6℄. This algorithm uses diretly the features extrated in the dete-
tion stage.As anaddition, ouralgorithm for text detetion[8℄ inludesa text
binarization stage that makes theuse of standard OCR software possible. We
are alsourrentlyompletingour studybyusing astandardOCR softwarefor
textreognition.With thoseapabilities, wewillbeableto extendthenumber
of shot lassesreognized by oursystem, to reognize shot sequenes, suh as
shot-reverse-shots,andtoresolveambiguousases,suhasdeterminingwhether
twokeyframesontainthesamefaesor not(within ashotboundary).
6 Conlusions
Basedonextrated faesand aptions,wehavebeenableto buildsomeuseful
lasses for desribing television images. The desription logi framework used
allowsus to easily speialize and extendthetaxonomies. Classiationof new
instanesisperformedusingaombinationofnumerialmethodsandsymboli
reasoning,andallowsustoalwaysstorethemostspeidesriptionsforshotsor
groupsofshots,basedontheavailableknowledgeandfeature-basedinformation.
Referenes
1. Aigrain, Ph., Joly, Ph. and Longueville, V. Medium knowledge-based maro-
segmentationofvideointosequenes Intelligent multimediainformationretrieval,
AAAIPress-MITPress,1997.
2. Borgida, A.,Brahman,R.J.,MGuiness, D.L.,Resnik,L.A.1989. CLASSIC:A
StruturalDataModelfor Objets. ACM SIGMODInt.Conf. onManagement of
Data, 1989.
3. Bouthemy,P.,Garia C. , Ronfard R. , Tziritas G. , Veneau E. Senesegmen-
tation andimage featureextrationfor videoindexingand retrieval. VISUAL'99,
sualDouments.ProeedingsoftheInternationalWorkshoponDesriptionLogis,
Trento,Italy,1998.
5. Chopra K., Srihari R.K.. Control Strutures for Inorporating Piture-Spei
Context inImageInterpretation. in: Proeedings of Int'lJointConf. onArtiial
Intelligene,1995.
6. GariaC.,ZikosG.,TziritasG.. Wavelet PaketAnalysisfor FaeReognition.
ToappearinImageandVisionComputing,18(4).
7. Garia C. and TziritasG.. Fae DetetionUsing QuantizedSkinColor Regions
MergingandWaveletPaketAnalysis.IEEETransationsonMultimedia,1(3):264{
277,Sept.1999.
8. Garia C.,Apostolidis X. . TextDetetion and SegmentationinComplex Color
Images. IEEEInternationalConfereneonAoustis,Speeh,andSignal,June5-9
2000,Istanbul,Turkey.
9. Ide,I.,Yamamoto,K.andTanaka,H. Automatiindexingtovideobasedonshot
lassiation. AdvanedMultimedia Content Proessing, LNCS 1554, November
1998.
10. SatohS.,KanadeT..Name-it:AssoiationofFaeandNameinVideo.in:Pro.
of Computer Vision andPattern Reognition. IEEE ComputerSoiety Press, pp.
368-373, 1997.
11. Gunsel,B.andFerman,A.M.andTekalp,A.M.VideoIndexingThroughIntegra-
tionofSyntatiandSemantiFeatures.WACV,1996.
12. Ferman,A.M.,Tekalp,A.M.andMehrotra,R. EetiveContentRepresentation
for Video IEEEIntern.ConfereneonImageProessing,Otober1998.
13. Thomson,R.Grammaroftheshot.MediaManual,FoalPress,Oxford,UK,1998.
14. Yeung,M.andYeo,B.-L.Time-onstrainedClusteringforSegmentationofVideo
intoStoryUnits InternationalConfereneonPatternReognition,1996.