• Aucun résultat trouvé

Sports video analysis

N/A
N/A
Protected

Academic year: 2021

Partager "Sports video analysis"

Copied!
146
0
0

Texte intégral

(1)

University of Mohammed V Soussi - RABAT

École Nationale Supérieure d’Informatique et d’Analyse

des

Systèmes

ENSIAS - SI3M

2010

N˚ Bourse : b13/005

Thesis

Presented in Partial Fulfillment of the Requirements of the Degree of Doctor in « Computer Science »

by

Youness TABII

Sports Video Analysis

Thesis Defended on July 10, 2010 at 10h30 am, before the committee :

President : Driss ABOUTAJDINE Professor, FS, Rabat

Reporter : Azedine BOULMAKOUL Professor, FST, Mohammedia

Reporter : Ahmed HAMMOUCH Professor, ENSET, Rabat

Reporter : Hamid ZOUAKI Professor, FS, El Jadida

Examiner : Bouchaib BOUNABAT Professor, ENSIAS, Rabat

(2)
(3)

To My Mother Whose is shall always with me within my heart and To My Father For his love, sacrifice, encouragement, and patience

(4)
(5)

Acknowledgment

>

Õæk

QË@ á



Ôg

QË@ é

<Ë@ Õ



æ„.

> J

àñ JÓ

ñÖÏ@ ð

é

Ëñ

ƒ P ð Õ

º

Ê

Ô

« é

<Ë@ ø



Q  ‚

¯ ñ

Ê

Ô«@ É

¯ ð



I >

Õæ

¢ ªË@ é

<Ë@



†



Y

“

>

F

irst and foremost, I would like to thank my advisor, Prof. Rachid OULAD

HAJ THAMI. Little, without his inspiration, confidence and support, this dissertation would have remained merely a dream. Ever since I joined the SI3M Laboratory, Prof. Little has always been there to provide guidance, friendship, and encouragement, and I just cannot thank him enough.

I particularly thank all Mr. Azedine BOULMAKOUL, Professor in FST Mohamme-dia and Mr. Ahmed HAMMOUCH, Professor in ENSET Rabat, Mr Hamid ZOUAKI, Professor in FS El Jadida and Mr Bouchaib BOUNABAT, Professor in ENSIAS Rabat to have agreed to judge this work. I thank them for their advices and suggestions that help me improve this manuscript. Accept Mrs. my sincere thanks for presence in the jury.

Special thanks to Mr. Driss ABOUTAJDINE Professor in faculty of sciences Rabat and member in Hassan II Academy of Science and Technology to chair the jury and to have agreed to judge this work. Thanks Mr.

I would like to extend my thanks to my colleagues in SI3M laboratory, i express to them my deepest sympathy and i wish them the best.

I will not forget FAZ for help to find the correct words and sentences in English,

(6)

without her advices, support, assistance and encouragement this thesis may not be completed. Thank you FAZ.

Last, but most importantly,

I

’d like to dedicate this thesis for my mum and dad to express my deepest grati-tude. They are the best parents who are so willing in giving me the best in life. I thank them for all that they have given me. This dissertation is a tribute to their love, patience, sacrifice and understanding. To my sister Mounia and my brother Karim for too many things to mention.

ù

ª JJ.

£  ñ K

ENSIAS, le 28 juillet 2010.

(7)

Abstract

T

herapidly increasing power of computers has led to the development of novel applications in multimedia word, like image processing, video processing and analysis and image/video databases. These applications take advantage of the increa-sing procesincrea-sing power and storage of computers to rapidly process large amounts of data. The challenge has now become one of developing suitable tools and methods for manipulation of the data that has become available. Given the enormous amount of information contained in a multimedia data stream, it is sensible that a deeper comprehension of the data stream may need to be achieved through the integration of many independent analysis on different aspects of it.

The processing and analysis of the video still a new field of research and attracts a large number of researchers. In video processing and analysis field, the experts do not process the whole video in one time, because it is difficult and sometimes impos-sible to extract the contents of the video especially if it contains many scenes, colors, actors . . .etc. However, they try to decompose the video into small coherent segments, they call it shot, this shot should contain the same high-level (scene, dialogue, debate . . .etc) and low-level (color, forms, actors, dots, etc.) features . After segmentation of video shots, and to extract the contents of the video (content understanding : semantics), the shots must be classified into classes, this classification allows us to cluster the shots that have the same features (high and low level).

In general, we can not analyse and process with the same methods all the videos of different genre (movie, documentary news, sports . . .etc.), but we need to study the video area, extract their features and how the directors of such videos reorganize the shots. This study allows us to identify some concepts and some rules that manage these videos area, they are known as domain Knowledge. The domain knowledge

(8)

of such as domain is all the information perceived, discovered or learned in this field, this information changes from one domain to another and from one subject to another. For these reasons, the researchers always develop there videos processing system for each specific video area that integrates segmentation and classification that are the common steps in all systems. Adding that the domain knowledge is the factor that differentiate between domains in video processing an analysis.

This thesis focuses on existing methods in the field of video processing and analy-sis. We propose a new method for segmentation of video into shots, as an algorithm for the classification of these shots in the case of soccer video and a new algorithm for score box detection in soccer match video based on motion vector computation. We exploit the domain knowledge of soccer video to generate summaries and high-lights. The developed algorithms are integrated in one framework. We conclude with presentation of other ideas to improve our framework as future work.

(9)

Résumé

L

’évolution rapide des ordinateurs a conduit au développement de nouvelles applications dans le monde de multimédia comme traitement d’image, traite-ment et l’analyse de la vidéo et les bases de données image/vidéo. Ces applications profitent de la croissance en puissance de traitement et de stockage des ordinateurs pour traiter rapidement une grande quantité de données. Le défi est désormais d’élaborer des outils et des méthodes adaptés pour la manipulation des données multimédia qui sont devenus disponibles partout. Étant donné l’énorme quan-tité d’informations contenues dans un flux de données multimédia, il est judicieux qu’une compréhension plus profonde des flux de données, doive être atteinte par l’in-tégration de nombreuses analyses indépendantes sur les différents aspects de celui-ci.

Le traitement et l’analyse de la vidéo reste un nouveau champ de recherche et attire pas mal de chercheurs. Dans le domaine de traitement de la vidéo, les experts ne traitent pas la vidéo d’un seul coup, c’est à dire ils ne peuvent pas traiter la vidéo en totalité d’une seule fois, car c’est difficile et parfois impossible d’extraire le contenu de la vidéo surtout s’il contient beaucoup de scènes, couleurs, acteurs . . .etc, cependant, ils essaient de décomposer la vidéo en petit segments cohérents, ils l’appellent un plan (shot), ce plan censé contenir les même caractéristiques de haut-niveau (scène, dialogue, débat . . .etc) et de bas-niveau (couleur, formes, acteurs . . .etc). Après la phase de segmentation de la vidéo en plans, et afin d’extrait le contenu de la vidéo (compréhension du contenu : la sémantique), il faut classifier ces plan en classes, cette classification nous permet de trier les plans qui ont les même caractéristiques (de haut ou de bas niveau).

En général, on ne peut pas traiter et analyser de la même manière les vidéos de différentes domaines (film, documentaire journal, sport . . .etc), mais il nous faut une

(10)

étude de ce domaine, extrait leurs caractéristiques et comment les directeurs de ce type de vidéos réorganise les plans. Cette étude nous permet de dégager quelque concept et quelque règle qui gère ce domaine, ils sont reconnus par domaine de connaissances. Le domaine de connaissance d’un tel domaine, c’est l’ensemble des informations perçut, découvrit ou apprit dans ce domaine, ces informations changent d’un domaine à l’autre et d’un sujet à l’autre. Pour ces raisons, les chercheurs tou-jours développent des systèmes de traitement de vidéos spécifiques propre pour chaque domaine qui intègre la segmentation et la classification qui sont des étapes communes entre tous les systèmes. Ajoutons que le domaine de connaissance est le facteur qui différencié entre touts les domaines.

Cette thèse met l’accent sur les méthodes existantes dans le champ de traitement et d’analyse de vidéo. Nous proposons une nouvelle méthode pour la segmentation de vidéos en plans, aussi un algorithme pour la classification de ces plans dans le cas de vidéo football. Nous exploitons le domaine de connaissance de la vidéo de football pour générer des résumés et des résumés des moments importants dans un match. Les algorithmes développés sont intégrés dans une seule structure (Frame-work). Nous concluons par la présentation des autres axes de recherche pour amélio-rer notre framework.

(11)

Contents

Contents xi

Liste of figures xiv

1 Introduction 1

2 Sports video analysis : State of the art 7

2.1 Introduction . . . 9

2.2 Video structure . . . 10

2.3 Soccer video structure . . . 12

2.4 Structural analysis for sports videos . . . 14

2.4.1 play and Break detection . . . 14

2.4.2 Event detection . . . 15

2.4.3 Replay detection . . . 16

2.4.4 Highlight detection . . . 16

2.5 Conclusion . . . 17

3 A framework for automatic sports video analysis 19 3.1 Introduction . . . 21

3.2 Sports video analysis and processing . . . 21

3.3 Proposed Framework for Sports Video Processing . . . 23

3.4 Norms MPEG . . . 26

3.5 Conclusion . . . 31

4 Dominant color detection 33 4.1 Introduction . . . 35

4.2 The color spaces . . . 36

4.3 Dominant color extraction . . . 41

4.4 Results . . . 44

4.5 Conclusion . . . 47

(12)

5 Shot boundary detection and classification 49

5.1 Introduction . . . 51

5.2 Methods in literature on Shot-Boundary Detection . . . 52

5.2.1 Pixel based methods . . . 53

5.2.2 Block based methods . . . 54

5.2.3 Histogram based methods . . . 55

5.2.4 Compressed domain methods . . . 56

5.2.5 Scene detection . . . 57

5.3 Shot-boundary detection . . . 59

5.3.1 Discreet cosine transform (DCT) . . . 59

5.3.2 Discreet cosine transform multi-resolution (DCT-MR) . . . 60

5.4 Methods in literature on classification . . . 62

5.5 Shot classification . . . 63

5.5.1 Key frames extraction . . . 63

5.5.2 Domain knowledge. . . 64

5.5.3 Spatial segmentation . . . 67

5.5.4 Classification process . . . 67

5.6 Performance measures . . . 70

5.7 Results shot detection . . . 71

5.8 Results shot classification . . . 78

5.9 Conclusion . . . 79

6 Text detection and recognition 83 6.1 Introduction . . . 85

6.2 Methods in literature on Text detection and recognition . . . 85

6.3 Method proposed . . . 87

6.3.1 Score box sub-block extraction . . . 88

6.3.2 Score box detection . . . 89

6.3.3 Pre-processing . . . 92

6.3.4 OCR recognition . . . 93

6.4 Experimental Results . . . 93

6.5 Conclusion . . . 95

7 Video summarizes and highlights extraction 97 7.1 Introduction . . . 99

7.2 Audio descriptor . . . 99

7.3 Soccer domain knowledge . . . 101 xii

(13)

7.4 Finite state machine . . . 102

7.5 Video XMLisation . . . 104

7.6 Summarizes and highlights generation . . . 105

7.7 Experimental Results . . . 106 7.8 Conclusion . . . 106 General conclusion 109 7.9 generaleConclusion . . . 109 7.10 future works . . . 110 A Annexe 113 Bibliography 117 Notations 125 xiii

(14)

Liste of figures

2.1 Video content . . . 9

2.2 (a) Video structure, (b) video sub-element. . . 11

2.3 Play field area . . . 13

2.4 Soccer video structure . . . 13

2.5 (a)-American football line-up event. (b)-tennis starting serve . . . 15

3.1 Three-level event definition and detection hierarchy that is adopted in many sports video processing/analysis works. . . 23

3.2 The flowchart of the proposed framework for sports video processing and analysis. . . 24

3.3 Video Sequence in MPEG stream. . . 27

4.1 Light dispersed by a prism to make up of all visible wavelengths . . . . 35

4.2 RGB color space. . . 37

4.3 CMY color space. . . 37

4.4 HSV color space. . . 37

4.5 HSL color space. . . 38

4.6 YUV color space (with Y=0). . . 38

4.7 ¯x(λ), ¯y(λ), ¯z(λ)CIE1931 Standard Colorimetric Observer functions . . 39

4.8 CIE L*a*b* or CIELAB color space. . . 40

4.9 Lighting condition. . . 41

4.10 The flowchart of the proposed dominant color region detection algorithm. 42 4.11 Example of dominant color extraction. . . 43

4.12 Result of dominant color extraction and binarization : (a) RGB frame, (b) HSV binarization, (c) L*a*b* binarization, (d) combined HSV-L*a*b* binarisation. . . 45

4.13 Binarization result of soccer frames . . . 46

4.14 Binarization result of golf frames . . . 46

4.15 Binarization result of US football frames . . . 47

(15)

5.1 Three types of camera breaks : (a) Cut, (b) Wipe, (c) Dissolve. . . 51

5.2 Shot boundary detection system. The figure shows the general scheme of a shot boundary detection system : the frames are used to compute z(k, k+L), in this case L=2, and the shot detector inserts a boundary if z(k, k+L) >T. . . 52

5.3 Flowchart for shot change detection. . . 59

5.4 Stages for DCT-MR and shot change detection. . . 61

5.5 Pixels adjacent representation. . . 61

5.6 Flowchart for shot classification using domain knowledge and spatial segmentaion of key frames. . . 63

5.7 The four defined classes : (a)- Long view (LS), (b)- Medium view (MS), (c)- Close-up view (CpS) and (d) Out of field view (OFS) . . . 66

5.8 Domain knowledge presentation of zoom in/out and transitions. . . 66

5.9 3:5 :3 format representation in RGB frame. . . 67

5.10 Differents frames in 3 :5 :3 format. . . 67

5.11 3 :5 :3 format representation in binary frames. . . 68

5.12 Long shot models. . . 68

5.13 Medium shot model. . . 68

5.14 Close up and out of field shots models. . . 69

5.15 Result of Coarse Spatial Representation. . . 69

5.16 Frames from clips. . . 72

5.17 Result for 400 I/P frames with height resolution R=5. . . 73

6.1 Processes in Video OCR. . . 86

6.2 Flowchart of score box detection and text recognition. . . 88

6.3 Score box position. . . 89

6.4 Team name and match result. . . 89

6.5 Player substitution. . . 89

6.6 frames sampling, sub-block extraction, motion vector Computation and texts detection. . . 90

6.7 Diamond search algorithm. . . 91

6.8 Result of score box detection . . . 94

6.9 Result of score box detection . . . 96

7.1 Windowing of the audio track . . . 101

7.2 Adopted Finite State Machine for soccer video . . . 104

7.3 representation of video by sequences of alphabet and/or words . . . 104

(16)

7.4 Structure of XML file for video description . . . 105 7.5 Screen snapshot of XML file . . . 105 7.6 Last step in summaries and highlights extraction algorithm . . . 106

(17)

Chapter 1

Introduction

Motivations

In the last decade, the rapid increase in the adoption of video production de-vices, such as digital cameras and cam-coders, along with the advances in video compression and the increasing usage of the internet and wireless communication, such as 3G and 4G standards, have enabled the production, storage and sharing of large amounts of video, respectively. The ubiquitous consumption of video, however, poses many problems among which the field of multimedia processing focuses on the effective description of video information, the extraction of such information for fast and easy access to the relevant set at a later time (video querying/video search retrieval). this thesis work has been inspired by the lack of tools that aim to solve these challenging problems.

The characteristic of video document is : its ability to convey a rich semantic presentation through synchronized audio, visual and text presentations over a period of time. Nowadays, we find several genre of video such as drama movie, science fiction movie, etc. . ., and in the other hand we find sports video. Each genre if this videos and movies has its own edition style, syntax, semantic and rules to generate the final sequence to present it to the final viewers. Sports video poses some unique challenges, namely : 1) Each sports genre has context-dependent characteristics such as special game structure and camera views, 2) Sports video is recorded without control over the script and setting/environment, therefore the temporal structures are difficult to predict and background noises cannot be avoided, 3) Sports video is broadcasted with different styles of editing effects (e.g. slow motion replay and text displays) depending on the broadcasters, 4) Sports video can be used for various purposes such as entertainment, performance analysis and refereeing.

(18)

2 Chapter 1. Introduction

aims

This thesis focuses on exploring the techniques for summarizing and highlights extraction in sports video.

The primary aims are to develop, implement and test :

1. Enhanced technique dominant color extraction in soccer video 2. A novel technique for shot detection

3. New method for shots classification

4. A new method for score-box detection and text recognition using optical cha-racter recognition algorithm

5. A framework for soccer video summaries and highlights extraction using au-dio/video features with finite state machine

Thesis Outline

The thesis is organized in the following manner :

Chapter 2

Second chapter present a state of art related with sport video processing and structural analysis such play/beak detection, event/highlight extraction and replay detection. We also give a structure model of soccer video for future analysis. This structure can be used for processing and analysis of sports video contents.

Chapter 3

Concerning the third chapter, we will introduce a framework for sports video processing and analysis, this framework is based on new algorithms. The framework meet between the audio stream and video stream analysis, also use the domain know-ledge in order to models the structure of the sport video in case of soccer video. The framework aims to extract the high-level features of sport video such as : summaries and/or highlights.

Chapter 4

In the fourth chapter, we present an enhanced algorithm for dominant color detec-tion that automatically detects the color of sports field and adapts to spatio-tempoal

(19)

3

variations in dominant color that may caused by differences in official field color specifications, changes in environmental factors, such as sunset, spatio-temporal va-riations in illumination intensity due to irregularly positioned and/or imperfect sta-dium lights, and spatial variations in field properties due to deformations. Because of the limitations of a single color space, we use two spaces : HSV and L∗a∗b∗.

Chapter 5

In this chapter, a novel algorithms : sports video shot-boundary detection and shot-type classification are demonstrated. The proposed algorithms are applicable to multiple sports such as US football, golf and any other sport that have the play field is green.

Specifically, we propose :

– A new shot-boundary detection algorithm that reliably detects both instant and gradual shot transitions and is robust to large camera and object motion – A new shot-type classification algorithm that classify the shot into a specific

classes defined by domain knowledge in case of soccer video.

Chapter 6

This chapter introduce a new method for the detection of score-box superimposed in sports video with high motion, and recognition of the text written in this extracted score-box. This method is based on block matching algorithm and vector motion com-putation. Afterward motion vector computation, we use the 2D variance to extract the score box. For recognition of text, we use the optical character recognition (OCR) and morphologies constraints in order to reduce noise and to connect lost characters from the complete words.

Chapter 7

Video summaries and highlights extraction chapter gathers all the algorithms pro-posed in previous chapters to retrieve the summaries and highlights from soccer vi-deo. Those algorithms can be adapted easily to other sports video, especially most of the algorithms have generic behavior.

(20)

4 Chapter 1. Introduction

His journal publications and conferences based on this

thesis work include

Journals :

[67] Youness TABII, Rachid Oulad Haj Thami, "A Framework for Soccer Video

Pro-cessing and Analysis Based on Enhanced Algorithm for Dominant Color Extraction", Computer Science Journals : International Journal of Image Processing, Vol. 3, Issue 4, pp 131-142, ISSN 1985-2304, September 2009.

[65] Youness TABII, Rachid Oulad Haj Thami, "A Method for Automatic Score Box

De-tection and Text Recognition in Soccer Video", International Review on Computers and Software, Vol. 4 N 2, pp 188-192, ISSN 1828-6003, March 2009.

Conferences :

[64] Youness TABII, Rachid OULAD HAJ THAMI, "An Efficient and Simple Framework

for Soccer Video Summaries and Highlights Extraction", IPCV’09(WORLDCOMP’09) : International Conference on Image Processing, Computer Vision, and Pattern Recognition (ISBN 1-60132-117-1), pp. 461 - 465, Las Vegas, Nevada, USA, 13-16 July, 2009.

[66] Youness TABII, Rachid Oulad Haj Thami, "A new Algorithm for Soccer Video

Summarizing Based on Shot Detection, Classification and Finite State Machine", SETIT’09 : 5th International Conference : Sciences of Electronic,Technologies of Information and Telecommunications(IEEE), Book Chapter (ISBN 978-9973-0-0122-1), Hammamet, Tunisia, March 22-26, 2009.

[62] Youness TABII, Rachid OULAD HAJ THAMI, "A new method for soccer shot

de-tection with multi-resolution DCT", CORESA’07 : COmpression et REprésentation des Signaux Audiovisuels, Montpellier, France, 8-9 November, 2007.

[68] Youness TABII, Mohamed OULD DJIBRIL, Youssef HADI, Rachid OULAD

HAJ THAMI, "A new method for video soccer shot classification", VISAPP’07 : 2nd International Conference on Computer Vision Theory and Applications (ISBN

(21)

5

978-972-8865-73-3), pp. 221 - 224, Barcelona, Spain, 8-11 March, 2007.

[63] Youness TABII, Rachid OULAD HAJ THAMI, "Classification des plans dans les

vidéos de sport : cas de football", SITA’08 : Conference Sur Les Systèmes Intelli-gents : Théories Et Applications, Rabat, Morocco, 5-6 May, 2008.

[61] Youness TABII, Rachid OULAD HAJ THAMI, "A method for cut detection in

soccer video", WOTIC’07 : Workshop sur les Technologies de l’Information et de la Communication, Rabat, Morocco , 5-6 July, 2007.

Others :

[53] Abdel Alim SADIQ, Youness TABII, Rachid OULAD HAJ THAMI, "Indexation des objets 3d avec les graphes de reeb multi-resolution", WOTIC’07 : Workshop sur les Technologies de l’Information et de la Communication, Rabat, Morocco , 5-6 July, 2007.

(22)
(23)

Chapter 2

Sports video analysis : State of the

art

D

igital video has become a major information storage and exchange media in our modern area. It plays a very important role in the current multimedia computing and communication environments, with various applications in entertainment, broadcasting, education, publishing . . .etc.

In this chapter, we shall present the structure of video (video components) in general (movie, news, sport, documentary . . .etc). Then, we will give a model of soccer video structure based on style of editing and broadcasting in this kind of sports game. Finally, we will present a number of works in literature related to sports video processing and analysis.

(24)
(25)

2.1. Introduction 9

2

.1

Introduction

Database management systems are a part of normal day to day operation of computer systems. The type of databases available on computers today is mainly limited to alpha numeric databases. In the recent past, there has been some research done about image/video data management systems.With the coming of age of digital video technology video data promises to be an ubiquitous medium of the represen-tation of information in computer systems. The need for managing video data on computer systems is growing.

Let’s pose the question. What is a video ? Video is an audio-visual medium of information presentation. The Figure 2.1 below shows a high-level view of the content of the video. In this figure the content of the video has been grouped into two types :

Figure 2.1 – Video content

Information Content: This is the message or information conveyed by the video.

For example, after watching the news about sports, the viewer acquires information from the video about several events, like type of sport, where did it played, who play . . .etc. This information was conveyed to the viewer via the audio-visual medium of video.

Audio-Visual Content: This is the audio-visual content of video. It includes the

video-clips and audio signals. For example, in the sports video, in the case of soccer match on goal time, the viewer sees the location of the goal, hears the associated sound track (the supporters celebrate the goal, the commentator indicate that there

(26)

10 Chapter 2. Sports video analysis : State of the art

is a goal by the word Goal or But). Depending on how the video was produced, the same information content can be presented through an infinite number of different audio-visual presentations.

The key distinction between the information content and audio visual content is the amount of contextual information and the knowledge required to extract each of these contents. The information content of video requires the use of contextual infor-mation along with a large body of associated knowledge where as the audio visual content is primarily oriented towards the aural and visual senses and does not re-quire an understanding of the information. The audio visual content can be extracted from video based on the capabilities like speech recognition, image understanding and interpretation.

2

.2

Video structure

Structural video models represent video as a union of smaller coherent units that are obtained by a temporal or a spatio-temporal segmentation process. The boun-daries of these temporal units (segments) correspond to a large differences in some feature space while a temporal unit has similar features within itself. These features are usually a combination of color, texture, shape, and motion which are commonly referred to as low-level features.

The video can be trimmed into a set of elements 2.2(b), and each element also can be trimmed into a set of sub-elements, and between those elements we find a transition (large difference in low-level features). From this point of view, we find that the video can be trimmed into scenes, in those scenes also, we find a set of shots, so the scene too can be clipped into shots, the shot is like a container of frames 2.2(a), where we find objects (actors, forms, . . .etc). The frames present only one feature that is the color. We use this single feature to extract the description of the whole video and to retrieve the semantics. The domain dependency tends to increase com-putability/analysis accuracy tends to decrease toward the higher structured levels. Specifically, in frame level analysis, low level features such as color, texture, shape, motion and audio are generally used and the analysis requires no or minimum domain knowledge. At this level, many shot boundary detection (SBD) methods have been proposed to segment video into shots, each of which can then be represented by one or a few key frames from the shot.

(27)

2.2. Video structure 11

(a)

(b)

Figure 2.2 – (a) Video structure, (b) video sub-element.

Before we go into the details, it will be beneficial first to introduce some important terms used in digital video analysis research field.

– Video shot : is a consecutive sequence of frames recorded from a single camera. The same low-level features.

– Video scene : is defined as a collection of semantically related and temporally adjacent shots, depicting and conveying a high-level concept or story. While shots are marked by physical boundaries, scenes are marked by semantic boun-daries1

.

– Play and Break : is the first level of semantic segmentation in sports video and surveillance video. In sports video (soccer, baseball, golf, US football and rugby), a game is in play when the ball is in the field and the game is going on ; break, or out of play, is the complement set. Whenever "the ball has completely crossed the goal line or has touched line, whether on the ground or in the air" 1. Some of the early literature in video parsing misused the phrase scene change detection for shot boundary detection. To avoid any later confusion, we will use shot boundary detection to mean the detection of physical shot boundaries while using scene boundary detection to mean the detection of semantic scene boundaries.

(28)

12 Chapter 2. Sports video analysis : State of the art

or "the game has been halted by the referee" [77]. In surveillance video, a play is a period in which there is some activity in the scene.

– Key frame : is the frame which represents the salient visual content of a shot. Depending on the complexity of the content of the shot, one or more key frames can be extracted.

– Video Marker : is a contiguous sequence of video frames containing a key video object that is indicative of the events of interest in the video. An example of a video marker for baseball videos is the video segment containing the squatting catcher at the beginning of every pitch.

– Audio Marker : is a contiguous sequence of audio frames representing a key audio class that is indicative of the events of interest in the video. An example of an audio marker for sports video can be the audience reaction sound (chee-ring and applause) or commentator’s exciting speech.

– Highlight Candidate : is a video segment that is likely to be remarkable and can be identified using the video and audio markers.

2

.3

Soccer video structure

The editors of soccer video follow very regular steps during the broadcasting and recording of soccer matches. These steps are almost common between the editors to give to the television viewers global view which can help them to understand what happened in and follow the match. This set of common stages can be called as Style of Edition (ES).

The figure 2.3 show the play field of soccer video with different regions filmed by match editors.

Based on the segmentation of the play field presented, we propose a model of soccer video structure that can be used for soccer video processing and analysis 2.4.

Soccer videos are composed of a regular set of units which are used by editors in a reliable periodical manner, we present here this set of units as hierarchy tree to describe soccer video (Figure 2.4).

(29)

2.3. Soccer video structure 13

Figure 2.3 – Play field area

Figure 2.4 – Soccer video structure

The Soccer video game can be represented and segmented as shown in figure 2.4 : the "Non Play" shots include interview shots in the soccer live broadcast, for example studio discussion between an interlocutor and guests, and advertisements. The class "Play" can be also divided into "Game Play", "Replay" and "Slow Motion Replay". in "In Field" shots and during the match when the ball is in play, the camera view mainly focuses on the field with a "Medium", "Long" or "close-up" view of the action. In "Out Field" shots the editors show the "Stadium", "Coach" or the "Substitution" view if any. Based on the defined areas of the playing area, the "Long View" shots can be divided into followed classes :

(30)

14 Chapter 2. Sports video analysis : State of the art

meaning of "left" and "right" is the position and the motion of the main camera relative to the playing field.

– Non-Goal Area : "Non-Goal Area" class includes "Left Field", "Middle Field" and "Right Field", and in these sub-classes, we found in left and right field the cor-ners, where each side there are two corners namely : "Left Corner" and "Right Corner".

2

.4

Structural analysis for sports videos

In this section we will introduce methods existing in literature. The researchers in sports video analysis and processing field tries to retrieve the semantics of such as sports video using low-level features based on segmentation, color and the audio stream. The video semantics is a wide field and the researchers remains investigate this notion and essay to describe it (semantic). To achieve this aim (finding the se-mantic of such as video we call it also high level feature), most of researchers tries to use low level features to extract the high level features by segmentation of the video into play/break, replay extraction, events detection, highlight detection.

2.4.1 play and Break detection

The sports video is composed of play events and break events. The play events refer to the times when the ball is in-play, and the break events refer to intervals of stoppage in the game. Many researchers have reported play-break detection methods. In [77], six HMM topologies were trained for segmentation and classification of play and break in one pass. In [22], the play/break was detected by thresholding the du-ration of the time interval between consecutive long shots ; Xu et al. [78] segmented the soccer game into play/break by classifying the respective visual patterns, such as shot type and camera motion, during play, break and play/break transition.

Soccer game is a fast moving game, so close-up shots sometimes represent a break in play. For example : after a goal or foul, before a kick-off, a corner, a goal kick, a free kick, or during substitution of players. The main objects in close-up shots are players, referees and some times the coaches.

In other works, the researchers analyze similar sports video as US football and tennis. Li et Sezan [36; 35] proposed an algorithm for American football play-break event detection. The intend method uses football players’ lining up (Figure 2.5(a)), to

(31)

2.4. Structural analysis for sports videos 15

(a) (b)

Figure 2.5 – (a)-American football line-up event. (b)-tennis starting serve

detect the start of the play while the end of the detected play is assumed to be signaled with the start of a new shot. The algorithm involves grass region detection, selection of frames having large grass ratio, line detection, and color-based blob verification to detect the lining up of the players. The introduced algorithm, although efficient, relies strictly on domain rules. A similar type of domain information is used in [86], where tennis games are classified into serves and non-serves by modeling the start of the serve and the field structure Figure 2.5(b)). Unlike the previous algorithms, which utilize only visual information, Dahyot et al. [18] use both audio and video information to detect plays in tennis broadcasts. Because during tennis plays, the ball goes back and forth between two players and global view of the field is shown, the audio features resulting from ball hits are fused with the edge features of the field to detect play events.

2.4.2 Event detection

In last few years the event detection research forms the largest body of content-based sports video analysis work. In the literature, the researchers propose a number of methods and various algorithms for event detection in sports video, the methods aims at the automatic recognition and extraction of interesting or significant events from sports video to generate condensed summaries and/or textual indexes to label the events. For example, in soccer and US football game the researchers labels the video with the number of goals and in tennis with number of sets.

In recent years, the researches propose method that use multi-modal analysis tech-niques for event detection [58] (visual, audio and text analysis), by contrast of old algorithms who used only one modality. In [43], audio, video and superimposed text annotation information were utilized to extract highlights from television Formula

(32)

16 Chapter 2. Sports video analysis : State of the art

one program. A mixture of cinematic and object descriptors is used in [87, 86] to index basketball video ; the scene cuts and camera motion parameters were used for soccer event detection in [34], where the authors showed that unreliable detection would occur if limited cinematic features were used ; the camera motion and object-based features were employed in [8] to detect soccer events ; closed Caption (CC) information and visual features were integrated in [10] for event-based football video indexing. In [43, 24], the audio information was used jointly with video/text feature for content characterization of sports video.

2.4.3 Replay detection

In sports videos the editors always repeat the important moments sometimes by slow motion, sometimes with normal-speed. Hence replay scenes are excellent indi-cators of semantically important segments and is useful for many sports highlight generation applications. The most common characteristic of replay scenes is that they are usually played in a slow motion manner. Such slow motion effect is produced either by repeating frames of the original video sequence or by playing the sequence captured from a high speed camera at the normal frame rate.

The first slow motion effect, by repeating frames, will result in the presence of still and shift frames in the replay video. These frames will generate unique feature pat-terns on the frame differences [21, 47], macro-block types [32], vector flow, encoding size [31], etc, and hence can be used to detect replays. However, if the replays are produced from high-speed camera, the above-mentioned methods cannot be used. Hence, other replay detection techniques have been proposed. Nitta et Babaguchi [45], Babaguchi et Nitta [12] the authors used Bayesian Network together with six textual features extracted from Closed Caption (CC) to detect replay shot ; Pan et Sezan [46] proposed to detect the transition effect. For example, flying-logo, before and after the replay segments to detect replay ; Tong et al. [71] introduced a method to automatically discover the flying-logo for replay detection ; Mihajlovic et Petrovic [43], Babaguchi et al. [11] attempted modeling the "Digital Video Effect (DVE)" to locate pairs of DVEs for replay identification.

2.4.4 Highlight detection

The sports highlight detection research aims to extract the most interesting seg-ments automatically (known as game highlights) from the full-length of video. The

(33)

2.5. Conclusion 17

systems that use visual information often define interesting segments as slow-motion replays or high motion activity segments. Audio highlights are defined as high vo-lume or excited human speech (commentator, audiences)segments.

The segments includes important moments (highlights) in sports video can usually be distinguished by the detection of certain low-level features, such as : the oc-currence of replay scene [47], the excited audience or commentator speech [51, 72, 73], certain camera motion [50, 42] or certain highlight event related sound [51, 82]. In [26], the authors have introduced a generic game highlight detection method based on exciting segment detection. The exciting segment was detected using empirically selected cinematic features from different sports domains and weighted combine to obtain an excitement level indicator. Tjondronegoro et al. [70] detected the whistle sounds, crowd excitement, and text boxes to complement previous play-breaks and highlights localization method to generate more complete and generic sports video summarization. Generally, the sports video highlight detection task is often associa-ted with sports video event detection. The difference between the two is that the later goes further in the recognition of the types of event in the detected game highlight.

2

.5

Conclusion

In this chapter we have presented the structure of videos in general and the struc-ture of sport video, afterward we have proposed a partitioning models of soccer video. A state of art of sports video processing and analysis is presented, the most used methods and the used features for play/break, events, replay and highlights detection are exhibited.

(34)
(35)

Chapter 3

A framework for automatic sports

video analysis

S

ports video analysis has become a new research domain in the last decade. This new domain attract many researchers to contribute in developing and enhancing such a system (framework) for sports video processing and analysis. This necessity of the framework, increase with the number of viewers and supporters of various sports games and in particular team sports games. Furthermore the hight advancement in information technologies in two sides : software and hardware which are the materials of video recording.

In this chapter, firstly and for ease reference, we present some keywords used in video processing and analysis. Second we introduce our framework for sports video processing and analysis. Our framework use new algorithms : an enhanced algorithm in dominate color ex-traction ; new algorithm in shot boundary detection ; new algorithm in shot classification in case of soccer match video and new method for score box detection and text recognition using optical character recognition (OCR).

(36)
(37)

3.1. Introduction 21

3

.1

Introduction

Sports broadcasting constitute a major percentage in the total of public and com-mercial television broadcasts. The growing demands of the viewers require deve-loping in video capturing, storage, delivery, and video processing ability. With the creation of large storage capabilities and more TV channels with full coverage of large sport events, the organization and search in these data-sets become more appealing. But this situation poses a challenging research problem : how to quickly find the in-teresting video segments for various consumers with differing preferences. In other words, the consumers want a system that allows them to retrieve specific segments quickly from the huge volume of available sports video, and thus saving time. Other growing demands of viewers are new enhancement and the various techniques that provide a better viewing experience, or that even generate the feeling to actually take part in the sports event, instead of only watching a transmission of a video captured by a single camera. All these require research in the fields of reconstruction, virtual reality generation, enrichment, etc.

3

.2

Sports video analysis and processing

The sports video analysis and processing is interested with the extraction of semantics (video understanding) by using efficient and effective analysis of visual, audio, text and any other information.

The system of retrieval content video usually refers to event and object features extraction, The events detection concern summaries of the whole sports video and the most important moments and events such as goals an penalty in soccer match. Although each of the visual, audio and text information contributes to a high-level of analysis, visual features require special treatment because text may not be available while audio may demonstrate variations due to speaker characteristics and com-pression, respectively.Therefore, in systems of sports video processing and analysis, use the analysis of visual information that involves the extraction of low-level (color, texture, shape, and motion) features for high-level event detection (summaries and highlights). Shot boundaries, shot classification, and replays are cinematic features that we use in sports video processing. Object-based features are low-level, such as motion trajectories, and high-level features of semantic objects, such as players, referee, field objects, and ball. In the following, the properties of some features are

(38)

22 Chapter 3. A framework for automatic sports video analysis

described :

– Play field color : In sports video, the play field have a single dominant color that is the Green(in sports like soccer, US football, golf). We can easily automatically extract this green color so detecting the play field. The color is the unique feature used in other application such as object detection (players, refree . . .etc). The color features are used for shot-boundary detection, shot-type classifica-tion, and object detection and tracking.

– Shot-boundaries : In shot boundaries detection methods, the researchers use frame-to-frame differences in some preselected features and indicates a shot boundary when a large difference value is found. Some other features are used for shot boundaries detection such as large camera and objects motion. In sport video where we found a high motion activity, the large camera and objects motion, gradual shot transitions make the shot boundaries detection difficult.

– Shot classes : Sports video present a limited number of views. For instance, in soccer video game, the editors show global view of the play field to fol-low the ball when the play continues whereas player close-ups are preferred during an important moments (goal, kicks, penalty . . .etc) or stoppages in game.

– Slow-motion replays : The TV viewers always like and appreciate the slow motion replays to amuse with the action in sports video. In general, the televi-sion broadcasters replays the interesting events in slow motion ; hence, a sports video summary may be generated from all replay shots in a game.

– Objects detection : In sports play field, we find several and distinct objects such as players, referee, lines, ellipse in the middle, goal mouth and penalty boxes, which make objects detection a challenging task.

All sports video features cited above may be used to detect events that refer to an interesting moments (highlights) or a summary according to some subjective mea-sures, such as high volume and high motion activity. The figure 3.1 show the stages to high-level events detection adopted in sports video processing and analysis.

The detection of play and break events may help generate condensed summaries for non-continuous sports, such as US football and baseball. It is often argued that

(39)

3.3. Proposed Framework for Sports Video Processing 23

Figure 3.1 – Three-level event definition and detection hierarchy that is adopted in many sports video processing/analysis works.

play summaries are lossless because they are composed of all play actions in the game. However, break events in some games may be as interesting as play events ; hence, the next level of hierarchy is concerned with detection of interesting events (highlights) independent of their being plays or breaks. For a video segment to be interesting, it should satisfy some constraints defined by a number of features and conditions, such as high volume, high motion activity, and the existence of slow-motion replays in close temporal distances, which may result in a subjective definition. Although the definition is subjective, the evaluation may be objective if the performance of the systems that detect interesting events are reported as a function of the detected and the missed objective events. On the contrary, the top level of the hierarchy uses official definitions of events for detection as well as recognition. The summaries at this level can be defined in terms of detected events, for example, soccer summaries consisting of only goal events may be presented to the viewers.

3

.3

Proposed Framework for Sports Video Processing

In this section, we propose a generic and scalable sports video processing frame-work that employs multi-modal features for sports video summaries and highlights extraction in sports video.

Shot boundary detection

Many visual features change at shot boundaries and between shots. It is therefore crucial to detect shot boundaries before doing further analysis. The identification of the shot boundaries is a key essential step prior to performing shot-level feature

(40)

24 Chapter 3. A framework for automatic sports video analysis

Figure 3.2 – The flowchart of the proposed framework for sports video processing and analysis. extraction and any subsequent scene-level analysis. Shot transitions can be classified as of two types : abrupt transitions (cut) and gradual transitions (fade, wipe, dissolve, etc. . .).

In our framework and in the shot boundaries detection bloc, we present an al-gorithm for transitions detection between shots (cut and gradual transitions). This algorithm is based on Discreet Cosine Transform (DCT) with multi-resolution notion (DCT-MR) to detect different types of shot transitions.

Shot classification

Shot classification is an important step in the way of searching and finding the se-mantic of video. After shot detection, the next step is to classify those detected shots’ into defined classes (the definition of classes is depend on the domain of the video treated). In our shot classification bloc, we present a new algorithm for the classifica-tion of shots, this algorithm is based on golden secclassifica-tion (spatial pixels representaclassifica-tion) and domain knowledge of soccer matches.

Text detection

Text in video is normally generated in order to supplement or summaries the visual content and thus is important in carrying information that is highly relevant to the content of the video. As such, it is a potential ready-to-use source of semantic

(41)

3.3. Proposed Framework for Sports Video Processing 25

information. In our work, we propose an algorithm for score-box barre detection superimposed on soccer video and the recognition of text in this score box using Optical Character Recognition (OCR). This algorithm is based on block matching and vector motion computation and 2D variance.

Audio descriptor detection

The most unique characteristic of a video document is its ability to convey a rich semantic presentation through the synchronized audio, visual and text presentations over a period of time. The video content analysis should use the techniques from audio and image tracks analysis. In this work we use in addition to image analysis we propose a windowing algorithm to compute the audio energy as descriptor of audio stream.

Video XMLisation

In video processing and analysis, is highly recommended to present the video content in a simple way in order to facilitate the processing of video information and reuse it in another applications (web, video annotation. . .etc). To achieve this aim, we present the video content in XML file, this procedure we call it video XMLisation. The content of the XML file is the description of the video based on the developed algorithms in this thesis (shot boundaries detection, shot classification, text in score-box and audio descriptor).

Domain knowledge

Domain knowledge is defined as the content of a particular field of knowledge, we will also develop our domain knowledge of soccer video in order to generate rules to use it in processing and analyzing the content of soccer video and to raise the results of the developed algorithms.

Finite state machine

A finite state machine (FSM) consists of a set of states : a start state, an input alphabet, and a transition function which maps input symbols and current states to a next state. Computation begins in the start state with an input string. It changes to

(42)

26 Chapter 3. A framework for automatic sports video analysis

new states depending on the transition function. In this work we present a finite state machine for modeling soccer video. Using this finite state machine we will extract summaries and the important events in soccer video known as highlights.

Summary and highlight segment selection candidate

To generate summaries and the important moments (highlights) of soccer video, we make use all of developed algorithms in this work. The algorithms are : shot boundaries detection and classification, score-box detection, the text recognition and the domain knowledge.

3

.4

Norms MPEG

To test the algorithms implemented in the context of this thesis, we chose to use the compressed videos with the MPEG standard. In this section we are going to present MPEG standards existing in the field of the video compression and some notions used in video processing and analysis.

MPEG is one of the most popular audio/video compression techniques because it is not just a single standard. Instead, It is a range of standards suitable for different applications but based on similar principles. MPEG is an acronym for the Moving Picture Expert Group established by ISO (International Standards Organization) and IEC (International Electrotechnical Commission). A video is a sequence of pictures and each picture is an array of pixels. This video data is organized in a hierarchical fashion in an MPEG video stream. MPEG video sequence consists of different layers, GOP, Pictures, Slices, Macroblock, Block. A comprehensive picture is shown in figure 3.3.

Hereafter, we give definitions of all layers used both in MPEG standards and in video processing.

Video Sequence : Begins with a sequence header, includes one or more groups of

pictures, and ends with an end-of-sequence code. Group of Pictures (GOP) A hea-der and a series of one or more pictures intended to allow random access into the sequence.

(43)

3.4. Norms MPEG 27

Figure 3.3 – Video Sequence in MPEG stream.

Picture : This is a primary coding unit of a video sequence. A picture consists of

three rectangular matrices representing luminance (Y) and two chrominance (Cb and Cr) values. The Y matrix has an even number of rows and columns. The Cb and Cr matrices are one half the size of the Y matrix in horizontal and vertical directions.

Slice : It contains one or more contiguous Blocks. The order of the

macro-Blocks within a slice is from left to right and top to bottom. Slices are important in the handling of errors. If the bitstream contains an error, the decoder can skip to start of the next slice.

Macro-Block : This is basic coding unit in the MPEG algorithm. It is a 16x16 pixel

segment in a frame. If each chrominance component has one-half the vertical and horizontal resolution of the luminance component, a macro-Block consists of four Y, one Cr, and one Cb block.

Block : This is the smallest coding unit in the MPEG algorithm. It consists of 8x8

pixels and can be one of the three types : luminance(Y), red chrominance(Cr), or blue chrominance(Cb).

Picture Types The MPEG standard specifically defines three types of pictures :

– Intra Pictures (I-Pictures) – Predicted Pictures (P-Pictures) – Bidirectional Pictures (B-Pictures)

These three types of pictures are combined to form a group of picture (GOP). Typical GOP structures are as follows :

(44)

28 Chapter 3. A framework for automatic sports video analysis

IBBPBBPBBPBBPI . . .. . . IPPIPPIPPIPPIP . . .. . . I I I I I I I I I I I I I I . . .. . .

Intra Pictures : Intra pictures, or I-Pictures, are coded using only information

pre-sented in the picture itself, and provide potential random access points into the com-pressed video data. They use only transform coding and provide moderate compres-sion.

Predicted Pictures : Predicted pictures, or P-Pictures, are coded with respect to

the nearest previous I or Pictures. This technique is called forward prediction. P-Pictures use motion compensation to provide more compression than is possible with I-pictures.

Bidirectional Pictures : Bidirectional pictures, or B-pictures, are pictures that use

both a past and future picture as a reference. This technique is called bidirectional prediction. B-pictures provide the most compression since they use the past and fu-ture picfu-ture as a reference, however the computation time is larger.

MPEG 1

The video compression technique developed by MPEG-1 covers many applica-tions from interactive systems on CD-ROM to the delivery of video over telecom-munications networks. The MPEG-1 video coding standard is thought to be generic. To support the wide range of applications profiles a diversity of input parameters including flexible picture size and frame rate can be specified by the user. MPEG has recommended a constraint parameter set : every MPEG-1 compatible decoder must be able to support at least video source parameters up to TV size : including a mini-mum number of 720 pixels per line, a minimini-mum number of 576 lines per picture, a minimum frame rate of 30 frames per second and a minimum bit rate of 1.86 Mbits/s.

Important features provided by MPEG-1 include frame based random access of vi-deo, fast forward/fast reverse (FF/FR) searches through compressed bit streams, reverse playback of video and editability of the compressed bit stream.

(45)

3.4. Norms MPEG 29

MPEG 2

MPEG-2 was given the charter to provide video quality not lower than NTSC/PAL and up to CCIR 601 quality. Emerging applications, such as digital cable TV distri-bution, networked database services via ATM, digital VTR applications and satellite and terrestrial digital broadcasting distribution, were seen to benefit from the in-creased quality expected to result from the new MPEG-2 standardization phase. Work was carried out in collaboration with the ITU-T SG 15 Experts Group for ATM Video Coding and in 1994 the MPEG-2 Draft International Standard (which is identical to the ITU-T H.262 recommendation) was released. The specification of the standard is intended to be generic - hence the standard aims to facilitate the bit stream interchange among different applications, transmission and storage media.

It is expected that most MPEG-2 implementations will at least video source para-meters : maximum sample density of 720 samples per line and 576 lines per frame, a maximum frame rate of 30 frames per second and a maximum bit rate of 15 Mbit/s.

MPEG 4

MPEG-4 is the result of another international effort involving hundreds of resear-chers and engineers from all over the world. MPEG-4, with formal as its ISO/IEC designation ’ISO/IEC 14496’, was finalized in October 1998 and became an Interna-tional Standard in the first months of 1999. The fully backward compatible extensions under the title of MPEG-4 Version 2 were frozen at the end of 1999, to acquire the formal International Standard Status early in 2000.

MPEG-4 builds on the proven success of three fields :

– Digital television.

– Interactive graphics applications (synthetic content).

– Interactive multimedia (World Wide Web, distribution of and access to content).

MPEG 7

MPEG-7 is targeted to produce a standardized description of multimedia material including images, text, graphics, 3D models, audio, speech, analog/digital video, and composition information. The standardized description will enable fast and efficient

(46)

30 Chapter 3. A framework for automatic sports video analysis

search and retrieval of multimedia content and advance the search mechanism from a text-based approach to a content-based approach. Currently, feature extraction and the search engine design are considered to be outside of the standard. Neverthe-less, when MPEG-7 is finalized and widely adopted, efficient implementation for feature extraction and search mechanism will be very important. The applications of MPEG-7 can be categorized into pull and push scenarios. For the pull scenario, MPEG-7 technologies can be used for information retrieval from a database or from the Internet. For the push scenario, MPEG-7 can provide the filtering mechanism applied to multimedia content broadcast from an information provider.

Compared with earlier MPEG standards, MPEG-7 possesses some essential diffe-rences. For example, MPEG-1, 2, and 4 all focus on the representation of audiovisual data, but MPEG-7 will focus on representing the meta-data (information about data). MPEG-7, however, may utilize the results of previous MPEG standards (e.g., the shape information in MPEG-4 or the motion vector field in MPEG-1 and 2).

MPEG 21

MPEG-21 is the newest of a series of standards being produced by the Moving Picture Experts Group. The basic concepts in MPEG-21 relate to the what and who within the multimedia framework. What is MPEG-21 ? MPEG-21 aims at defining the technology needed to support Users to exchange, access, consume trade and otherwise manipulate digital contents in an efficient, transparent and inter-operable manner. Why MPEG-21 ? In today’s digital world, individuals are producing more and more digital media contents. These content providers need better management and protection of rights on their digital content. A new multimedia framework to facilitate trading and consumption of digital content is needed.

One of the key aspects of MPEG-21 is that it’s a standard framework and not a complete, implementable solution. Hence, the multimedia and signal processing communities have many opportunities to use new techniques and solutions within the framework.

(47)

3.5. Conclusion 31

3

.5

Conclusion

In this chapter, we proposed a framework for sports video processing and analy-sis. This framework use shot detection and classification, text detection and recogni-tion, and audio descriptor, all of those information are presented into XML file for high level semantics representation, and reusing this description in other applica-tions like description video on the web. In other hand, we give noapplica-tions about MPEG norms because all of our experiments are under MPEG morn to test the algorithms in compression domain. We also present the MPEG norms in different version and the newest one that is MPEG 21.

(48)
(49)

Chapter 4

Dominant color detection

T

he field region in many sports can be described by a single dominant color. This dominant color demonstrates variations from one sport to another, from one stadium to another, and even within one stadium during a sporting event.

In this chapter, we will propose an enhanced dominant color region detection algorithm that is robust to dominant color variations that may be caused by the differences in the official field color specifications, changes in environmental factors, such as sunset and shadows, spatio-temporal variations in illumination intensity due to irregularly positioned and/or imperfect stadium lights, and spatial variations in field properties due to deformations. We begin this chapter with presentation of different color spaces, to deduce which color spaces more suitable for dominant color extraction in sports video.

(50)
(51)

4.1. Introduction 35

4

.1

Introduction

Color is a phenomenon of perception not an objective component or characteristic of a substance. Color is an aspect of vision ; it is a psychophysical response consisting of the physical reaction of the eye and the automatic interpretive response of the brain to wavelength characteristics of light above a certain brightness level (at lower le-vels the eye senses brightness differences but is unable to make color discriminations).

That light is the source of color was first demonstrated in 1666 by Isaac Newton, who passed a beam of sunlight through a glass prism (Figure 4.1), producing the rainbow of hues of the visible spectrum. This phenomenon had often been observed before, but it had always been related to latent color that was said to exist in the glass of the prism. Newton, however, took this simple experiment a step further. He passed his miniature rainbow through a second prism that reconstituted the original white beam of light, his conclusion was revolutionary : color is in the light, not in the glass, and the light people see as white is a mixture of all the colors of the visible spectrum.

Figure 4.1 – Light dispersed by a prism to make up of all visible wavelengths .

The reason rainbows appear colored is because the light is broken down into its constituent parts by passing through the water droplets in the air (The perception of color in a rainbow is proportional to the viewer’s perspective, you move, it moves). The theory of color has gone through some changes over time, and it is now an accepted fact that color is truly in the eye of the beholder. This is due to the fact that, as sensed by man, color is a sensation and not a substance a sensation created in response to the excitation of human visual system by light that is in the visible of the electro-magnetic spectrum.

(52)

36 Chapter 4. Dominant color detection

4

.2

The color spaces

In the algorithms of color segmentation, the choice of color space present a crucial thing to the success of region extraction. In this section, we will present the proprie-ties of various color spaces and discusses the suitable color spaces to our case that is the soccer video.

A color space is a model for representing color in terms of intensity values ; a color space specifies how color information is represented. It defines a one-, two-, three-, or four-dimensional space whose dimensions, or components, represent in-tensity values. A color component is also referred to as a color channel. For example, RGB space is a three dimensional color space whose components are the red, green, and blue intensities that make up a given color. Visually, these spaces are often represented by various solid shapes, such as cube and cylinder.

The representation of the colors in the RGB and CMY color spaces are designed for specific devices. But for human observer they still have no accurate definitions. For user interfaces a more intuitive color space is preferred. Such color spaces can be : HSL or HSV. Whereas all color spaces within a base family differ only in details of storage format or else are related to each other by very simple mathematical formulas.

The next step in this section is to introduce the most known color spaces in image/video analysis and processing area. The Color spaces presented are : RGB,

CMY, HSV, HSL, YUV, XYZ and L*a*b*. Next, we shall discuss the adequate

co-lor spaces in our case, and how to combine between them to improve the results of dominant color extraction.

RGB color model : The most famous color encoding is an additive system with three

components, Red, Green and Blue. Those components are added together in various ways to reproduce a broad array of colors (Figure 4.2).

CMY color model : The complement of RGB, it’s a subtractive systems with three

components : Cyan, Magenta and Yellow. The conversion from RGB is a simple subtraction, since they complement each other (Figure 4.3).

(53)

4.2. The color spaces 37

Figure 4.2 – RGB color space.

Figure 4.3 – CMY color space. RGB to CMY formula :        C=1−R M =1−G Y=1−B (4.1)

HSV color model : A system which represents color as Hue, Saturation and Value.

Hue is what is seen as the color. Saturation is how "pure" the color is. 0% is grey and 100% is a pure color. Value represents the brightness (Figure 4.4).

Figure 4.4 – HSV color space.

The algorithm of conversion form RGB color space to HSV color space is presen-ted in Annexe A 2.

HSL color model : The HSL color space has three coordinates : Hue, Saturation,

and Lightness (sometimes luminance) respectively, it is sometimes referred to as HLS. The hue is an angle from 0 to 360 degrees, typically 0 is red, 60 degrees yellow, 120 degrees green, 180 degrees cyan, 240 degrees blue, and 300 degrees magenta.

(54)

38 Chapter 4. Dominant color detection

Saturation typically ranges from 0 to 1 (sometimes 0 to 100%) and defines how grey the color is, 0 indicates grey and 1 is the pure primary color. Lightness is intuitively what it’s name indicates, varying the lightness reduces the values of the primary colors while keeping them in the same ratio (Figure 4.5).

Figure 4.5 – HSL color space.

Both HSV and HSL are the best color systems for color adjustment and balance since the parameters are more clearly separated (the algorithm of RGB color space conversion to HSL color space is presented in Annexe A 3).

YUV color model : The YUV color space is a bit unusual. The Y component

de-termines the brightness of the color (referred to as luminance or luma), while the U and V components determine the color itself (the chroma). Y ranges from 0 to 1 (or 0 to 255 in digital formats), while U and V range from−0.5 to 0.5 (or−128 to 127 in signed digital form, or 0 to 255 in unsigned form) (Figure 4.6).

Figure 4.6 – YUV color space (with Y=0).

Figure

Figure 2.1 – Video content
Figure 2.2 – (a) Video structure, (b) video sub-element.
Figure 2.4 – Soccer video structure
Figure 4.1 – Light dispersed by a prism to make up of all visible wavelengths .
+7

Références

Documents relatifs

The most interesting results are yet obtained at genre level. Due to the high semantic contents, not all genres are to be accurately classified with audio-visual information. We

We propose an audio-visual approach to video genre classification using content descriptors that exploit audio, color, temporal, and contour information.. Audio information is

AUTOMATIC WEB VIDEO CATEGORIZATION USING AUDIO-VISUAL INFORMATION AND HIERARCHICAL CLUSTERING RELEVANCE FEEDBACK..

A fixation delay of -300 ms refers to a fixation that occurred exactly when the word started to be highlighted (remind that a 300-ms audiovisual delay was set). A fixation delay

The audio modality is therefore as precise as the visual modality for the perception of distances in virtual environments when rendered distances are between 1.5 m and 5

The 1024-dimensional features extracted for the two modalities can be combined in different ways.In our experiment, multiplying textual and visual feature vectors performed the best

The Task provides a great amount of movies video, their visual and audio features and also their annotations[1].Both subtasks ask from the participants to leverage any

In this section, we show the results of some example queries on the platform, and how the music Linked Data can be used for MIR.In the SPARQL Query, we specified the audio