HAL Id: inria-00507752
https://hal.inria.fr/inria-00507752
Submitted on 30 Jul 2010
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Xavier Naturel, Patrick Gros
To cite this version:
Xavier Naturel, Patrick Gros. Dealing with Television Archives: Television Structuring. [Research Report] RR-7301, INRIA. 2010, pp.24. �inria-00507752�
a p p o r t
d e r e c h e r c h e
9-6399ISRNINRIA/RR--7301--FR+ENG
Vision, Perception and Multimedia Understanding
Dealing with Television Archives: Television Structuring
Xavier Naturel — Patrick Gros
N° 7301
June 2010
Xavier Naturel, Patrik Gros
Theme: Vision, PereptionandMultimediaUnderstanding
Pereption,Cognition,Interation
Équipes-Projetstexmex
Rapportdereherhe n° 7301June201023pages
Abstrat: This paper investigatesthe problem of managing very large dig-
ital television arhives. This problem is alled television struturing (or TV
broadast maro-segmentation)and isdened astheproessof identifyingthe
strutureofatelevisionstreamaswatherspereiveit: asuessionofprograms.
Thisistheveryrststepinordertomanageatelevisionolletion. Inthispa-
per,aompletesolutionfortelevisionstruturingisproposed,whihmakesuse
of simple yet eient methods in order to dealwith huge datasets. Methods
fromommerialdetetionaregeneralizedtobeabletodistinguishregularpro-
grams from non-programs. It is shown how television program guides an be
usedto labeltheidentiedprograms. It isnallyshownhowanupdateproe-
dure an improvethesegmentation resultsover time. Results areprovided on
3 weeksofFrenhtelevision.
Key-words: Video indexing, Television struturing, maro-segmentation,
Pereptualhashing
Résumé : Ce rapportdereherhes'intéresseàlastruturationdelarges vol-
umesd'arhivesdetélévision. Parstruturation,nousentendonsl'identiation
des programmes de télévision, leur début et leur n, dans le ux, et don le
déoupage dee ux en une suessiondeprogrammes. Ceiest latoute pre-
mière étapedans un proessus d'indexation d'un ux detélévision, an de le
rendrefailementnaviguable etrequêtable. Nousprésentons unesolutionom-
plètebaséesurdesméthodessimplesandepouvoirtraiterdetrèsimportantes
quantitédedonnées. Nousgénéralisonsdesméthodesprovenantdeladétetion
depubliitéstéléviséesandedistinguerlesprogrammesdesinter-programmes.
Ilest égalementmontré ommentles guidesdeprogrammes peuventêtreutil-
isésan d'étiqueterlesprogrammes identiés. Nous proposonsnalement une
proéduredemiseàjour,quipermetd'obtenirdesrésultatsonstantsauours
du temps. Des résultats sur trois semaines de télévision française permettent
devérierl'eaitédesméthodes.
Mots-lés: Indexationvidéo,Struturationdetélévision,maro-segmentation,
hahagepereptuel
Contents
1 Introdution 4
2 Previous work 5
3 Struturingmethod 5
3.1 Denitionsandoverview . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Segmentationandlassiation . . . . . . . . . . . . . . . . . . . 6
3.2.1 Separationdetetion . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 Repetitiondetetion . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Classiationandfusion . . . . . . . . . . . . . . . . . . . 10
3.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 AlignmentusingDTW. . . . . . . . . . . . . . . . . . . . 11
3.3.2 Improvements. . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Updating the referene videodataset 14 4.1 Identiationofunknownnon-programs . . . . . . . . . . . . . . 16
4.2 Identifying trailers . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Identifyingsequenes. . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Identifyingandlabelingtrailers . . . . . . . . . . . . . . . 19
4.3 Updateproedureand results . . . . . . . . . . . . . . . . . . . . 21
5 Conlusion 22
1 Introdution
Television isanimportant part oftoday's soureof information,as wellas an
important part of our ulturalheritage. Information retrieval from television
streams is however still in its infany, whereas the amount of television on-
tentis inreasing, with anever larger set of available hannels. For example,
sineSeptember2006, theNationalInstitute ofAudiovisualin Frane(ina) is
arhiving540.000hoursof televisionperyear. Thereis thus aneedto develop
methods to retrieve information from very large television olletions, whih
mightomewithfewornoadditionalinformationapartfromthevideo stream
itself.
Theontextoftheworkisassumedtobetelevisionarhives. Itmeansthat
weeks,months,oryearsofontinuousreordingshavetobeanalyzedtoextrat
relevant information. We assumethat television program guides areavailable
togetherwiththevideodata. WhilethisisatuallytheaseinFraneatina,it
maybedierentinotherountries. Inthisontext,weareinterestedinnding
thestrutureofthetelevisionstreamasTVwatherspereiveit: asuessionof
well-identiedprograms,togetherwithsomenon-programevents(ommerials,
trailers,sponsoring...). Thisiswhatweall televisionstruturing.
Morepreisely,thegoalis rstto performa segmentationof thetelevision
stream into programs and non-programs, and in a seond step, to labelthese
programs,withinformationomingfromtheprogramguide. Thisproessmay
also be viewed as metadata renement: the proess takes as input the video
streamaswellas somemetadata(theprogramguide),andoutputsaorreted
version ofthe programguide, where thegiven shedules havebeenhekedin
orderto math withtheatualvideo stream.
This may be seen as a trivial or non-needed problem sine the EPG an
bethought as self-suient. Berrani et al.[1 ℄ have shown thatthis is notthe
ase, and that preise television struturing annot be ahieved onlywith the
provided information oming from hannelsand/or broadasters. Theyassess
theneessityofontent-basedtehniques.
Even ifourrstgoalisarhiving, quitea largenumberof appliationsan
benet from this oneptuallysimple task of nding programsin a television
stream. In the ase of television arhives, it is obvious that exat program
boundaries shouldbeavailable whenbrowsingor querying a televisionorpus.
Manual struturingis a very tediousand time-onsumingtask, andautomati
orsemi-automatimethodsshouldbeinvestigatedtoreduetheneedofmanual
parsing and labeling. Monitoring television may also be an appliation, for
exampletoverifylegalregulationsaboutommerials,orprovidesomestatistis
about delays between the atual broadast time and the sheduled one. A
moreuser-orientedappliationmightbetomanagereordedprograms,allowing
features likeskipping ommerialsor ndingprogramsthat may notbein the
EPG 1
.
Our approah is stream based and bottom-up. Three kinds of informa-
tion are mainly used. Joint silene and monohrome image detetion allows
to nd program boundaries and to detet non-program segments. Repetition
detetion[15 ℄using areferenevideo dataset(RVD) allowsto ndsimilarseg-
mentsappearingseveraltimesin thestreamandisusefultoharaterizemany
1
Inthepaper, programguidesand EPG (EletroniProgram Guides)are usedas syn-
onyms.
non-programs. Finally, the programguide is usedto assign labelsto program
segments[16 ℄.
The paper is organized as follows. Setion 2 provides a brief overview of
the state-of-the-art. Setion 3 presents the struturing tehnique. Setion 4
explainstheproblemofupdatingtheRVD.Setion5 onludesthework.
2 Previous work
Video struturing for spei programs like sports or news is a well-studied
domain. The aim is to infer the struture of the program by analyzing the
video andaudiostreams[10 ,6,2 ℄. Studies havealsobeenondutedonolle-
tionsof programs,leading tonon-obvioustasksliketopithreading [9 ℄. These
worksaredealingwithsetsofhomogeneousprograms(mainlynewsoraspei
sport),andarelookingforastrutureinside theprogramitself. Theythusare
quite dierent from our task, and are notlikely to be suited to nd program
boundaries.
Amorerelevanttopiisommerialdetetion. Thisisawell-studieddomain,
where eetive solutions havebeen proposed to identify ommerials in a TV
stream. Some simple but eetive rule-based methods have been proposed,
whih use detetion of monohrome frames and silene between ommerials
[12 , 14 , 19 ℄. Classiation tehniques have also been proposed [5 , 22 , 13 ℄, as
wellastehniquesbasedonreognition[8,4,21 ,18℄. Commerialdetetionisof
majorimportanefortelevisionstruturingbeauseitandetetnon-programs.
However, some non-programs are notommerials, and these tehniques have
thustobeextendedtohandle alltypesofnon-programs.
Very few works have onsidered television struturing. Liang et al. [11℄
proposed to detetprograms bytheir lead-in/lead-out. Interestingresults are
obtainedontheirdatasetbutannotbegeneralizedtootherTVhannels,whih
maynotagtheirprogramsbysystematislead-inandlead-out.
AverydierentworkisproposedbyPoli [17 ℄. Thebasiideaistoimplement
a top-downapproah usinga very largesetof alreadyannotateddatatolearn
a model of theTV stream. Poli proposes to use a hidden Markovmodel and
a deisiontree forthat purpose. Theresult isa weekly programproviding an
approximate start timeand durationand thetypeforeah programand non-
programduringtheweek. This modeliseventuallyhekedwiththestream.
To thebestofourknowledge,these two methods aretheonlyonesdealing
withtelevisionstruturing.
3 Struturing method
3.1 Denitions and overview
First,letus denemore preiselythenotion of programand non-program.
Programs are regular television broadasts whih makethe ore of a hannel
broadasting (e.g. news, weather foreast, movies, shows...). Non-programs
areeither ommerials,sponsoring,hannelpromotion (e.g. trailers,jingles).
An overview of the proposed method is shown on gure 1. Three inputs
are used: the video stream itself, the program guide, and a Referene Video
Dataset(RVD).ThisRVDisadatasetomposedofmanuallylabeledprograms
andnon-programs,withthefollowinginformation:
aategory(programornon-program).
atitle
EPG
Segmentation Labeled stream
Reference video
Labeling
dataset Stream
Figure1: Overviewofthestruturingmethod
The struturing method is in two steps: segmentation and labeling. Seg-
mentationutsthestreamintosegments 2
,whiharethenlassiedintoeither
program andnon-program. Labelingassigns labelsto every segmentlassied
as aprogram.
To testthemethod, aorpus ofthreeweeksoftelevisionhasbeenreorded
from a frenh hannel (Frane2) from 5/9/2005 to 5/30/2005. This orpus is
omposed of21les,eahonerepresenting24hofTV. Manualstruturinghas
beenperformedonthisorpusto obtaingroundtruth.
3.2 Segmentation and lassiation
Theaim ofthesegmentationisto ndthedierentsegmentsof programsand
non-programs. Methods developed in theontext of ommerialdetetion are
wellsuitedtomakethissegmentation. However,asstatedinsetion2,thetask
has to be generalized to detet all kinds of non-programs, e.g. ommerials,
trailers,jingles,sponsoring...Sinetrailersandjingleshavedierentharater-
istisfromommerials,intermsofshot lengthandvisualativity,approahes
basedonlassiationarenotlikelytoperformwell. Wefousonmethodsthat
areabletogeneralizetoallnon-programs: reognition-basedmethods,andjoint
silene andmonohromeframesdetetion.
3.2.1 Separation detetion
Weallseparations simultaneousourrenesofmonohromeframesandsilene
that happenbetweenommerials. This isavery popularfeaturefordeteting
ommerials [12 ,19,14℄anditisusedonevery FrenhTVhannel.
Todetetmonohromeframes,a48-binhistogramontheluminanehannel
isrstomputed. Detetingmonohromeframesisthenahievedbythreshold-
ingthehistogramentropy. ForanhistogramhquantizedintoNbins,itsentropy
2
Thetermsegmentwillbeusedintheremainingofthepaperastheresultofthissegmen-
tationproess.
Detection of Separations
Detection of Repetitions
Pre−segmentation
Final segmentation Classification/fusion
Detection phase
Segmentation phase
Figure 2: Overviewofthesegmentationandlassiationproesses
isgivenby:
H =− XN
i=1
pilogpi withpi= h(i) P
kh(k)
Figure3showsasampleofthehistogramentropyon1hourofourorpus. The
thresholdissetexperimentally to2.
Figure3: Variationoftheluminanehistogramentropyonone hourofTV
To detet silene, a very simple method is used. It onsists in building
overlappingaudioframesof10ms,andomputingthelog-energyoneahframe
usingthestandardformula:
Edb(i) = 10 log10
XN
n=1
x2n(i)
Thethresholdissetto60(seegure4),andonlysegmentslongerthan30msare
kept. Thereasonforusingasosimplemethodisgivenbygure4 ,whihshows
the variation of energy on 1 hour of TV. One an easily see two separations,
wheretheenergyisatuallyzero. Thisphenomenonhasbeenobservedonevery
Frenhhannel. Itmightbedierentin otherountries.
Figure4: Variationoftheaudioenergyona fewminutesofTV
Theresultsofthesileneandmonohromeframesdetetionarethenmerged
using a suessiveanalysis. Sine theaudio feature isfar more disriminative
than the image one, the proess onsists in taking the segments deteted by
the audio as andidate segment, and then hek orretness using the image
feature. Results are shown in table 1 where the resultsare the preision and
reallomputedoverthenumberofimagesorretlydetetedasbelongingtoa
separation.
Modality Preision Reall
Audio 0.82 0.9
Image 0.41 0.89
Fusion 1 0.9
Table1: Separationdetetionresults
3.2.2 Repetition detetion
Televisionstreamsarehighlyredundant. Detetingrepetitionsangreatlyhelp
tounoverthestrutureofthestream. Inpartiular,allkindsofnon-programs
arefrequentlyrepeatedwithnomodiationexeptforbroadastandompres-
sionnoise. Beause onlyminortransformationsexist betweentwoinstanes of
thesame video lip, it isquite easyto detetthose repetitions. However,it is
importanttobeabletodealwithaverylargedatabaseandtohavealowom-
plexity. Therefore, a repetition detetionmethod has to put the emphasis on
thosetwoaspets. Apopularmethodtoahievebotheenyandeetiveness
in theontextofommerial detetionispereptualhashing [4 ,8 ℄. Notethat
this method is partiularly suited in ourontext beause it an deal with all
kindsofnon-programs.
−5 56 110
00000000000000 00000000000000 00000000000000 11111111111111 11111111111111 11111111111111
0000000000000000000000000000000 0000000000000000000000000000000 0000000000000000000000000000000 1111111111111111111111111111111 1111111111111111111111111111111 1111111111111111111111111111111 00000000000000000000
00000000000000000000 00000000000000000000 11111111111111111111 11111111111111111111 11111111111111111111 Shots
Image
DCT
64 first coefficients (except DC)
0 1 Signature 1
Binarization w.r.t median value
C1 C64
. . .
. . .
Figure 5: Imagesignatureomputation.
Todetetrepeatedvideolipsusingpereptualhashingthemethodproposed
in [15 ℄ is used. Shots are onsidered as the reognition unit, i.e. we detet
repeated shots. A visual signature is built for eah image of eah shot. The
signatureextrationispresentedongure5 . Foreahimage,theDCTisapplied
on thewhole image, onthe luminane hannel only, and the8x8 top-leftsub-
matrix isextrated from the lowest AC frequenyoeients, (DC oeient
is nottakeninto aount). The median valueofthis sub-matrix is omputed,
andoeientsare thenbinarizedaordingto thismedian, thusmakinga 64
bitssignature.
Thissignatureissuientlyrobusttonoisetobequeriedbyexatmathing,
allowing the use of a fast retrieval struture like a hash table. The retrieval
proessmakesindeeduseofahashtable,inwhiha pair(signature,shotid)is
storedforeveryframeofeveryshotofthedatabase. Whenaqueryismade,i.e.
wewanttoknowifaertainshothasadupliateinthedatabase,signaturesof
eahframeofthisqueryshotareomputed,andthenqueriedonebyoneagainst
thehashtable. If anexatmath ours,that isa signatureofthequeryshot
sq isequalto asignaturesd inthehashtable,apair(sd,shotid)is reovered.
This shot id givesaandidate shot,whihis further analyzedbyomputinga
similaritydistanebetweenthisandidateshotandthequeryshot.
The similarity distane between shots is dened as the average Hamming
distanebetweenthesignaturesoftheretrievedandqueryshots. Thisdistane
makesuseoftherelativepositionsofthemathedsignaturestoaligntheshots,
thusgainingrobustnesstotemporalvariationsandshotsegmentationartifats.
Todeideiftwo shotsmath,this distaneisthresholded.
Thismethodisused severaltimesthroughoutthestruturingproess. It is
usedtogetherwiththereferenevideodataset(RVD)previouslydened,whih
ishosenastherstdayofour3weeksorpus,day5/9/2005,andwasmanually
labeled.
3.2.3 Classiationand fusion
Programmes
00000000 0000 11111111
1111Non−programs
0000 0000 0000 0000
1111 1111 1111 1111
0000 0000 00 1111 1111 11
0000 0000 00 1111 1111 11 0000 0000 00 1111 1111 11
000000 000000 000000 000000
111111 111111 111111 111111
00000 00000 00000 00000 00000 00000 00000 00000
11111 11111 11111 11111 11111 11111 11111 11111
0000 0000 00 1111 1111 11
00 00 00 00 0
11 11 11 11 1
0000 0000 00 1111 1111 11
0000 0000 00 1111 1111 11
0000 0000 00 1111 1111 11 0000 0000 00 1111 1111 11
00 00 00 00 0
11 11 11 11 1
00 00 00 00 0
11 11 11 11 1
00 00 00 00 0
11 11 11 11 1
00 0 11 1
0000 00 1111 11
0000 0000 00 1111 1111 11
Classification
Fusion
Pre−segmentation
Pre−segments Separations Repeated non−programs
Programs Repeated non−programs
Separations non−programs
Figure6: Thethreestepsofsegmentation: pre-segmentation,lassiation,and
fusion.
Thedetetionofseparationsandrepetitionsyieldsapre-segmentationofthe
stream,asanbeseeningure6. Onethispre-segmentationisomputed,the
next step is to lassify pre-segmentsas either program or non-program. The
deision is taken by simply thresholding the length of the pre-segment, short
pre-segments are lassied as non-programs and long ones as programs. The
thresholdTs ishosensoastomaximizetheF-measureoforretlassiation
onasampledayofourorpus.
Finally, ontiguous segmentsof repetitions, separations, and non-programs
aremerged into a singlenon-program segment. Figure 6 sumsupthesegmen-
tationproess.