TV Stream Structuring

(1)

HAL Id: inria-00601845

https://hal.inria.fr/inria-00601845

Submitted on 20 Jun 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Z.A.A. Ibrahim, Patrick Gros

To cite this version:

Z.A.A. Ibrahim, Patrick Gros. TV Stream Structuring. International Scholarly Research Notices,

Hindawi, 2011, 2011, �10.5402/2011/975145�. �inria-00601845�

(2)

Volume 2011, Article ID 975145,17pages doi:10.5402/2011/975145

Research Article

TV Stream Structuring

Zein Al Abidin Ibrahim

¹

and Patrick Gros

²

1

LERIA Laboratory, Angers University, 49045 Angers, France

2

INRIA, Centre Rennes-Bretagne Atlantique, 35042 Rennes, France

Correspondence should be addressed to Patrick Gros, patrick.gros@inria.fr Received 3 March 2011; Accepted 19 April 2011

Academic Editors: H. Araujo and W.-L. Hwang

Copyright © 2011 Z. A. A. Ibrahim and P. Gros. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

TV stream structuring consists in detecting precisely the first and the last frames of all the programs and the breaks (commercials, trailers, station identification, bumpers) of a given stream and then in annotating all these segments with metadata. Usually, breaks are broadcasted several times during a stream. Thus, the detection of these repetitions can be considered as a key tool for stream structuring. After the detection stage, a classification method is applied to separate the repetitions in programs and breaks. In their turn, breaks repetitions are then used to classify the segments which appear only once in the stream. Finally, the stream is aligned with an electronic program guide (EPG), in order to annotate the programs. Our experiments have been applied on a 22-day long TV stream, and results show the eﬃciency of the proposed method in TV stream structuring.

1. Introduction

With the rapid development of digital technologies in the last decades, capturing and storing digital video content have become very common and easy. Television is probably the main source of such videos. However, navigating within TV streams is still a very complicated task to achieve. Users conceive a TV stream as a sequence of programs (P) and breaks (B), while from a signal point of view, this stream is seen as a continuous flow of frames and sounds, with no external markers of the start and end points of the included programs and no apparent structure. Most of TV streams have no associated metadata to describe their structure, except the program guide produced by TV channels. This program guide lacks precision, since the TV channels cannot predict the exact duration of live programs for example. In addition, it does not provide any information about breaks like commercials or trailers. The first step in the stream structuring chain is to segment the stream in TV programs by providing theirs precise boundaries. The second step consists in recognizing each program and in describing its actual content if needed.

In addition to program retrieval, TV stream structuring is very valuable for several applications. One example is archive management as it is done at the French National Audiovisual Institute (INA). This institute is responsible for

archiving all programs for which some French money was invested, and, in order to achieve this goal, it records all French TV channels 24/7. Since September 2006, INA has archived 540,000 hours of TV per year [1]. This amount of data is increasing since the number of channels is in progress.

Accessing the data is thus a crucial problem. To facilitate this task, INA manually describes the structure and the content of each stream.

Another application of TV stream structuring is the anal- ysis of commercial breaks, either to enforce legal regulations, to get some statistics, or to skip these commercials.

Moreover, nowadays video analysis techniques may face a major problem. Mostly of these techniques are designed to analyze a video that contains homogeneous content (content of same type, i.e., soccer game), while TV streams may contain several heterogeneous programs. The TV stream structuring can be used by video analysis techniques in this case to segment the stream in programs and then applying the adequate technique for each program basing on its type.

In addition to that, Internet Protocol Television (IPTV) services may use the TV stream structuring techniques in the catch-up TV, start-over TV, TVoD, or nPVR services.

Several methods have already been proposed to delimit

or detect some specific content items in TV streams. Some

detect bumpers to find breaks or programs. Others are

(3)

dedicated to commercials. Each of the existing methods solves only a part of the structuring problem. An example of the first complete solutions is Naturel’s [1]. Nevertheless, this approach requires a lot of manual annotation and cannot scale up. On the other hand, Manson and Berrani [2]

proposed a new technique based on a supervised learning algorithm, but this technique also requires manual annota- tion in order to train the system. Moreover, Manson and et al. classify each segment independently from its repetitions even if they probably have the same content type (P or B). Furthermore, Poli [3] proposed a top-down approach, which learns to predict a more accurate program guide.

This approach in its turn requires an enormous learning set (several years of exact program guides in his case).

The complete and pioneer method proposed by Naturel et al. [1] is based on a manually annotated reference video database which is considered as its main limitation. Our approach is an extension of Naturel’s method which tries to suppress the manual annotation stage of his method, and to reduce the manual annotation work of supervised methods.

It starts by detecting the segments that appear several times in the stream. Then, a classification method separates the program segments (P) from break segments (B), which are used later to segment the complete stream into program and break segments. Finally, the EPG is used to label the program segments.

The rest of this paper is organized as follows. In Section 2, we present an overview of the existing TV structuring methods. To facilitate the comprehension of our method, a vocabulary is defined in Section 3. Section 4 details our pro- posed method. Experimentations and results are provided in Section 5. Finally, we conclude in Section 6.

2. State of the Art

The TV stream structuring problem has not been extensively addressed in the literature. Most of the previous works focus on structuring a single program or a collection of programs without dealing with streams containing several heterogeneous programs. However, the literature is rich with systems that are dedicated to detect commercials which could be considered as the basis of any TV stream structuring system (i.e., [4–9]). However, these techniques are not su ﬃ cient to structure the streams because commercials are not the only kind of breaks.

Returning to the TV stream structuring process, it can be divided into two complementary tasks.

(i) The first task consists in segmenting the stream in Program/Break sequences where the precise start and end of programs and breaks are provided.

(ii) In the second task, each segmented program is labeled with metadata in order to identify it and to facilitate the retrieval of information from the stream.

The first task of the process can be done based on diﬀerent approaches.

(i) Segmenting the stream into logical units and then classifying each segment as a program or a break

segment [2]. These segments may be of diﬀerent granularities (Key-frame, Shot, Scene, . . . ). Then, consecutive segments of the same content (same commercial, e.g.,) are combined.

(ii) Searching the start and the end of program segments basing on the detection of discontinuities in the homogeneities of some features [10], modeling the boundary between programs and breaks [11], or detecting the repetition of opening and closing credits [12].

(iii) Searching the start and the end of break segments by recognizing them in a reference database [1] or basing on their repeated behavior [2, 13]. The latter should be followed by a classification step in order to separate repeated program segments from break ones.

The existing approaches of the literature may be of two types.

(1) Metadata-based: Methods that are based on the almost exclusive use of the metadata available with the stream in order to structure the TV stream [3].

(2) Content-based: Methods that use the audiovisual stream to structure TV streams. In their turn, these methods can be classified into two subclasses.

(i) Methods that search the boundaries of the programs themselves. This type of methods is noted as program-based methods [10–12, 14]

(ii) Methods that search to detect breaks, which may separate consecutive programs. These methods are called break-based methods [1, 2, 13].

We classify the methods of the literature in four categories based on the techniques they rely on.

Category 1. A prototype of the first category is the metadata- based method developed by Poli [3] in order to structure TV streams. His main idea is to use a large set of already annotated data to learn a model of the program guide and thus of the stream structure. A hidden Markov model and a decision tree are used to learn this model that predicts the start time and the genre of all programs and breaks appearing during a week. This is the only method that is totally based on television schedules, and it requires a huge amount of annotated data for the learning stage (up to one year for each channel). An additional step may be required afterward to analyze the stream since the prediction is not perfect. This work proved that, on the channels used, the stream structure is very stable over the years.

Category 2. The second category contains the program-

based methods that recover the structure of the TV stream

by detecting the programs boundaries [10–12, 14]. In [12],

the authors start from the assumption that, when considering

two consecutive days, a given program starts approximately

at the same time with the same opening and closing credits.

(4)

As a consequence, their method relies on the repetitive behavior of the open and closing credits of programs in order to detect their start and end time. The assumption used by the authors is not always true. Some programs do not have opening and closing credits. In addition, the TV channels broadcasts change completely in weekends. Likewise Liang et al. and Wang et al. propose in [11] a method basing on the opening and closing credits of programs. The idea is to detect special images called Program-Oriented Informative iMages (POIMs). These POIMs are frames containing logos with monochrome backgrounds and big text characters.

From the authors’ point of view, these POIMs appear in opening and closing credits and at the end of commercial segments. Unfortunately, opening and closing credits are not always present as mentioned before. Moreover, these POIM frames are not always present at the end of commercials and are variable from channel to channel. Contrarily to the methods proposed in [11, 12], El-Khoury et al. propose an original unsupervised method based on the fact that each program has homogeneous properties [10]. Consequently, the programs are extracted by detecting the discontinuities of some audiovisual features. The authors start from the idea that during a program, a selected set of features behaves in a homogeneous manner. In this method, short programs may not be detected and consecutive segments that belong to the same program are not merged. Moreover, detecting the boundaries of the breaks is easier and more precise than the program ones.

Category 3. In the third category, we find the recognition- based techniques that detect break segments. The work of Naturel et al. presented in [1] is the only complete structuring solution that is based on a reference database containing manually annotated breaks. This database is used to detect the breaks of the database rebroadcasted in the following part of the stream. The authors use hashing tables with video signatures in order to detect such repetitions which are used later to get the structure of the stream. The manual annotation of the database is the main constraint and drawback of the method. On the other hand, an automatic technique is proposed to update the database and thus to face the continuous change of the breaks. Unfortunately, the experimental data set used in this paper is not long enough to validate this updating approach.

Category 4. The techniques of the last category are the break- based methods that base on the detection of repeated audio- visual sequences in the TV stream. The underlying idea is that breaks and especially commercials have a repetitive behavior. Our technique falls into this category. However, several methods that use this principle have been already proposed. For example, Zeng et al. [13] use hashing tables with audio signatures in order to detect such repetitions.

On the other hand, Berrani and Manson in [2, 16] use a clustering-based approach, which groups similar key- frames and visual features and then use inductive logic programming (ILP) to classify them into P and B segments.

This last method, based on a supervised symbolic machine learning technique (ILP), shows that it is possible to learn the

structure of the stream from the raw data. Thus, in somehow it makes a link between the Naturel’s technique and the Poli’s one. The drawback of this method is that it needs 7 days of manually annotated data to train the system. Moreover, ILP restricts the usable information to the local context of each segment. In addition to that, authors have chosen to classify each segment independently from its repetitions.

From our point of view, most of the times, a segment and its repetitions are of the same type except in the trailers case. The trailers segments can be filtered using predefined rules. In our method, we take into account the contextual information of all occurrences in the stream of a given content. This last step adds a considerable improvement to the results as shown in the experimentations. Contrary to the methods in this category, the method proposed in [13] relays on audio signatures. Authors of [13] justify this choice by the fact that audio can overcome the limitation of the time-consuming video decoding. Using audio is a good idea, but it should be noticed that video decoding is not so time-consuming nowadays. Moreover, detecting audio segment boundaries is not quite easy. As a consequence, video signatures are used to overcome the latter problem. In addition, video signatures may be more robust than audio ones since the audio signal is very sensitive to noise and this may aﬀect the repetition detection. On the other hand, the used video stream to evaluate the method is not long-enough, and the number of repetitions it contains is not provided. The rules used for the segmentation are very simple ones and their e ﬀ ectiveness is not clearly evaluated, for example, by a comparison with the ground truth. Finally, the programs segmented by this method have to be annotated manually.

The metadata provided with the stream (EPG), which is an interesting source of information, is not used.

As a conclusion of this state-of-the-art survey, it should be noticed that the techniques based on the repetition detection are the more suitable ones as we search to segment the stream into programs and breaks. Of course, the breaks are not the only repetitive segments. Some program segments can appear several times in the stream, like opening and closing credits, flashbacks, news reports, and even a whole program can be repeated. Thus, a classification step is required to diﬀerentiate the P repetitions from the B ones.

3. Vocabulary Definition

Before presenting our method, in this section we define some vocabulary and some basic concepts that will help to better understand the proposed approach. First, a clear distinction should be made between the content and its representation.

The “content” is somehow an abstract concept, since this content cannot exist without any representation. Each item of content as a movie or a commercial has a duration, but no start or end time, and can appear several times in the stream.

Whereas the representation of this item has a start and end time, and thus it is unique in the stream. Let us define the following.

Content: what the stream represents.

Content element: a piece of contiguous content.

(5)

Content item: a content element, which corresponds to a basic broadcasting and semantic unit (e.g., a movie, a commercial, a news report, a weather forecast . . . ). When a movie is split in two by a break, both parts are considered as content items.

Video stream: succession of frames presented at a fixed rate.

Shot: a contiguous series of frames taken by single source (camera).

Segment: any contiguous series of shots that may be semantically unrelated. A segment represents a content element.

Sequence: a contiguous series of shots that are seman- tically related (e.g., all the shots of a given commer- cial). A sequence represents a content item.

Break (B): every sequence with a commercial aim such as commercials, interludes, trailers, jingles, bumpers, and self-promotions.

Program (P): every sequence that is not a break (e.g., movies, weather forecasts, news, etc.)

A content element can appear only once in the stream or be represented several times, because it is broadcasted several times (e.g., reruns) or because it appears several times in the same program (e.g., some scenes in a cartoon). To distinguish such situations at the stream level, we will use the words: UniqSegment and RepSegment, UniqSequence and RepSequence. For example, a content item appearing several times in the stream is represented by several RepSequences in the stream.

We will call RepSet a set of RepSegments or RepSequences corresponding to the same content element or content item.

A RepSet is a set of stream segments, which are almost identical from a content point of view. The set of all the RepSets of a stream will be called a RepStreamSet.

4. A Repetition-Based TV Stream Structuring Method

As we have mentioned in Section 2, the method proposed by Naturel et al. in [1] is costly and time-consuming as it uses the manual annotation to bootstrap the structuring process.

Our method outperforms this approach by using a machine learning technique that limits the manual annotation to the data necessary to train the system. Such data do not need to be contiguous to the testing data. In Figure 1, we give an overview of the proposed method.

The first step consists in detecting the repetitions (i.e., the RepStreamSet). Then, a postprocessing step is applied in order to fuse the consecutive RepSets in the RepStreamSet that belong to the same sequence (e.g., same commercial) in order to get a set of RepSequences. After that, a classification method separates the program RepSequences from the break ones. Then, the P/B segmentation can be extended to the whole stream. Finally, the segmented stream is aligned with the electronic program guide (EPG) to label the various

(1)

(2)

(3)

(4)

(5) Repetition detection

Post-processing

Classification

P/B segmentation TV stream

EPG (s)

Labeled stream EPG alignment

Figure 1: An overview about the proposed method (inputs: TV- stream, EPG(s). Output: labeled stream).

segments. Next in this section, we present each step in more detail.

4.1. RepSegment Detection. The purpose of this first step is to provide a fast method for detecting the common content elements in two videos. In our context, two cases can occur (1) the element can be an item broadcasted several times or (2) it can be a part of an item that appears several times within this item. In all cases, all the representations of this element should be very similar, and the set of all possible transformations between two representations of a same con- tent element should be rather small. These transformations come basically from noise broadcasting (additive Gaussian noise, color shift, digitization . . . ) and from editing effects at the postproduction stage (small temporal variations, text, banners . . . ). To consider two segments as RepSegments, these editing effects should be limited. Otherwise, the corresponding frames will present very large differences and it will not be possible to consider them as similar anymore.

The literature is very rich in methods that detect repeated sequences in the audiovisual streams. Some of these methods are limited to the repetition detection task (i.e., [17–23]) while others are designed to detect breaks and commercials (i.e., [4, 6, 13, 24, 25]). To detect repeated sequences, a simple method consists in searching in the document the frames or segments that are near-duplicate [4, 25]. Another idea is to compare the document with itself in order to provide a similarity matrix that can be processed to retrieve the repeated sequences [19]. The presented methods face problems when dealing with large audiovisual documents.

For thus, Herley proposes in [20] a method that can process a continuous infinite stream. In this method, the research of repetitions is limited to a finite historic sliding window.

The technique of perceptual hashing is also used to detect repetitions by considering a stream as a database containing signatures each representing a frame, a shot or, a segment [6, 13, 18, 21, 24].

As mentioned before, Naturel et al. have proposed the

first complete TV stream structuring solution [1]. To our

(6)

knowledge, the results obtained by his method are the most interesting in the domain. The only claim is the use of the video reference database that needs to be annotated manually.

In our work, we do not aim to propose a method to detect repeated sequences since the literature is very rich in such methods. In addition, we try to propose a TV stream structuring solution that overcomes the limitations of the one proposed by Naturel et al. For thus, we have chosen to use the method proposed by Naturel et al. to detect repetitions. Adopting the same algorithm will also facilitate the comparison of our TV stream structuring method with the Naturel’s one.

In this paper, we depend on the method presented in Naturel’s work [1] to detect the RepSegment. This approach proposes a perceptual hashing technique using a reference video database (RVD) composed of manually labeled video segments. Each label contains the type of the segment (P or B) and its title. Each shot of the stream is then compared to this database. In order to avoid a time consuming exhaustive search, each frame is described by a visual signature; each shot is thus characterized by a set of signatures; the search is limited to the database shots, which have at least one common signature.

From a shot segmentation point of view, we depend on the method presented in [26]. This method uses an adaptive threshold of luminance histogram, with improvements to detect dissolves and fades. The visual signature is based on the 64 lower frequency coe ffi cients (except the DC coe ffi cient itself) extracted from the discrete cosine transform (DCT) of the image luminance channel. The median value of the coe ffi cients is used to binarize these coe ffi cients, and thus providing a 64-bit signature.

In this step, we contribute in the adaptation of the repeti- tion detection method proposed in [1] without basing on the RVD. For thus, we propose an algorithm that compares the stream with itself in real time. First, the stream is segmented into shots, and a 64-bit visual signature is extracted from each frame. Then, these signatures are inserted in a hash table with a reference to the shot the corresponding frame belongs to. Using this representation, we can eﬀectively compare the stream with itself to retrieve the RepSegments. Additionally, and like Naturel did, we assume that RepSegments have at least a common visual signature. Algorithm 1 shows the di ﬀ erent steps of the detection process.

The Hamming distance is used to compare two shots and it is computed as follows.

Each shot contains several frames where each frame is assigned to a signature. Two shots p and q are compared, in case of having two frames one from p and the other from q with equal signatures. We first measure the Hamming distance between each two signatures u and v in a shot. This measurement is done by computing the number of diﬀerent bits between u and v as in

Hamm Sig

_i

, Sig

_j

= Σ

k

Sig

_i

[ k ] ⊕ Sig

_j

[ k ], k = 1, . . . 64 . (1)

Then, in order to measure the distance D between two shots Sh

_i

and Sh

_j

with the same duration, we take the average of the Hamming distances between the signatures of Sh

_i

and Sh

_j

as it is defined by

D Sh

_i

, Sh

_j

= 1 N

Σ

k

Hamm Sig

_ik

, Sig

_jk

,

for k = 1 . . . N ,

(2) where Sig

_ik

(Sig

_jk

, resp.) is the signature of the frame number k of the shot Sh

_i

(Sh

_j

, resp.) and N is the number of frames in the two shots.

In the case where the two shots have diﬀerent durations, the middle frame of the first shot is aligned with the middle frame of the second one. The frames at the boundaries that do not have associated frames in the second shot are discarded when computing the distance.

Two shots are considered as having the same content if their distance D is less than a fixed threshold.

At the end, the algorithm provides sets of similar shots (or part of shots) called RepSets. Each shot is represented by its start and end times in the stream and is measured in image numbers. Most content items are composed of several shots.

The detection process described above should thus be com- pleted by a postprocessing step that combines RepSegments in order to get RepSequences. These RepSequences should correspond, if the postprocessing step was not error prone, to content items, for example, to programs (or program parts) or commercials.

4.2. Repsegments Postprocessing. The detection process explained in the previous section supplies sets of analogous stream segments, that is, RepSets of RepSegments. The goal of the next step is to glue these RepSegments together in order to get less RepSegments of longer duration. Ideally, we should get RepSequences corresponding to semantically coherent content items (programs, trailers, commercials . . . ).

From a theoretical point of view, we should fuse consecutive RepSets that have the same number. Thus, the first proposed rule deals with the simple case where two RepSets have the same number of elements, which may belong to the same repeated content (content item: same commercial, e.g.,). From a practical point of view, this rule is not suﬃcient. Other rules should be proposed to take into account scenarios implied by the use of some production rules by TV channels and some problems propagated by the previous steps of the TV stream structuring process (shot segmentation and repetition detection). For these reasons, we have also proposed two additional rules dealing with the fact that some segments could have been missed (not broadcasted by TV channels or missed by the shot segmentation or repetition detection steps), leading to RepSets of di ﬀ erent cardinalities.

Let us denote R

i

the various RepSets and S

i^j

the segments of R

i

. We assume that the S

i^j

are sorted by increasing starting time, and the R

i

are sorted by increasing starting time of their first element S

¹_i

.

The first postprocessing rule deals with the simple

following case: two consecutive RepSets R

i

and R

i+1

have

(7)

(1) Compute the signatures of all frames and insert them in a hash table HT.

(2) For each signature S

i

of HT,

(a) Find in HT the set Set(S

i

) of all shots where this signature appears (b) For all pairs (p, q) of shots in Set(S

i

)

(i) Check if the pair is already in the list of similar shots.

(ii) If the Hamming distance between p and q is smaller than a threshold α, insert (p, q) in the list of similar shots.

(3) Compute and return the equivalence classes from the list of similar shots.

Algorithm 1: RepSegments detection process.

the same number of elements, and these elements are contiguous, that is, for all j , S

i^j

and S

i+1^j

are contiguous.

In order to deal with some possible noise, we introduce thresholds and distances, but the idea remains the same. The fusion is straightforward in this case.

First let us define a distance between two segments. This distance is equal to the duration between the end of the earlier segment and the start of the later one if both segments were broadcasted the same day, and to + ∞ otherwise.

Second, we extend this distance to get a comparison function

∂ between two RepSets R

i

and R

i

. This function is simply the average μ and the standard deviation σ of the distances between the corresponding segments S

i^j

and S

i^j

of the two RepSets. Two RepSets are fused if μ is smaller than a first threshold and σ is smaller than a second one. In this case, a new RepSet is created whose segments are the concatenation of the S

_i^j

and S

_i^j

and R

i

and R

i

are removed.

We should mention here that we may fuse two con- secutive RepSets even if they are of di ﬀ erent content. This scenario appears when all the broadcasts of a commercials C

i

is followed or preceded by a same commercial C

j

and vice versa. In this case, the two commercials are fused in the same RepSet.

The first postprocessing rule is a simple rule that is based on the fact that two consecutive RepSets are of the same content if they have the same number of elements. This rule performs well when all the shots of a commercial are rebroadcasted and are detected as repeated. Unfortunately, this is not always the case. For thus, this rule may be su ffi cient from a theoretical point of view but not in practice. Some problems have appeared due to several reasons. A first reason is that a commercial may be shortened after one or several di ff usions by TV channels. Another reason is due to errors in the shot segmentation and the repetition detection steps. In this case, we may have two consecutive RepSets that belong to the same content but have different number of elements. In another word, some shots in the broadcast of a commercial (e.g.) are not rebroadcasted or have not been detected as repeated. Thus we have proposed two other rules that deal with these scenarios.

The second proposed postprocessing rule corresponds to the case where two consecutive RepSets have diﬀerent numbers of segment and where | R

i

| > | R

i+1

| . In this rule, we search if there exists a RepSet R

k

with k > i + 1 and ∂ ( R

i

, R

k

) < ( α

, β

)( ∂ ( R

i

, R

k

) returns the mean and the

standard deviation of dist. See Postprocessing-Type 1). And in this case, a new R

i

is created as the fusion of R

i

and R

k

and the R

j

( j = i . . . k ) are removed from the RepStreamSet.

The third postprocessing rule is applied to take into account the cases when | R

i

| < | R

i+1

| . It consists in applying the second processing rule if the R

j

( j = i . . . k ) contain segments smaller than one second. In this case, some small segments are dropped, but longer segments can be obtained.

At the end of this postprocessing step, we get a set of RepSets (RepStreamSet). Each segment of these sets is composed of one or several contiguous shots. In the next section, we present a procedure to separate the segments corresponding to programs from those corresponding to breaks.

4.3. P/B Classification of the Segments. The previous stages, repetition detection and postprocessing, provide sets of similar segments that we call RepSets. Each segment is represented by its start and end times in the stream. We assume that the similar segments of a given RepSet represent the same content.

The main idea of our TV stream structuring solution is to detect the boundaries of B (Breaks) segments. The latter are detected basing on their repetitive behaviors in order to obtain the RepSets. These B segments are used later to segment the stream in P/B sequences. The segmentation is achieved by extracting, in a first step, all B segments from the stream, and then considering the remaining segments longer than one minute as P ones. Unfortunately, the B segments are not the only segments that appear more than once in the stream. The content present in the RepSets can be of very diverse natures: breaks, program segments like opening and closing credits, programs broadcasted twice, small programs like weather forecast, so forth. As a consequence, the RepSets do not provide enough information to get a clear structure of the stream. Therefore, the B RepSets should be separated from P ones before segmenting the stream. To do so, a classification step should be applied to diﬀerentiate P segments from B ones.

Theoretically, we may think that the duration of the B

segments is less than the P ones in the RepSets. In another

word, a P RepSet may contain segments that are longer

than segments in a B RepSet. This assumption is not always

true in reality and duration could be insuﬃcient for the

classification. For example, the News generic in France are

shorter than a lot of commercial segments. Furthermore, a

(8)

(1) For all pairs (R

i

, R

i+1

) of successive RepSets

(a) If R

i

and R

i+1

have the same number of elements n (i) For j ranging from 1 to n

dist[j] = S

_i+1^j

· start time − S

_i^j

· end time (ii) ∂(R

i

, R

j

) = (mean(dist), standard deviation(dist))

If mean(dist) < α and standard deviation(dist) < β Insert (i, i + 1) in the result

(2) Return the result

Algorithm 2: Post-processing—Type 1.

repeated P segment may not be entirely detected or may be oversegmented. This can be due to a shot segmentation problem or a repetition detection problem which lead to obtain several shorter P segments instead of the entire P one.

Moreover, the detection of false-positive repeated segments in TV programs such as talk-shows may oversegment the TV stream when considering them as B segments. In such programs, several static long-views segments with no- motion could have approximately the same duration, the same content, and could be filmed by the same camera which may mislead the repetition detection process.

To separate the P RepSets from B ones, two diﬀerent methods are proposed according to the data to be classified.

(i) In the first method, each segment of each RepSet in the RepStreamSet is described by a set of local and global features and then classified. In this case, the data to be classified are the segments. This approach is noted as segment-based. The segments may be of P or B type. This method is important to classify the RepSets that contain segments of diﬀerent types (P and B segments in the same RepSet) because each segment is classified independently.

(ii) The second method classifies directly each RepSet in the RepStreamSet rather than the individual segments. Each RepSet is described by a list of global features. This approach is noted as RepSet-based.

Most RepSets contains only P segments or only B segments. Trailers present a third quite annoying case. A RepSet corresponding to a trailer contains several B segments representing the diﬀusion of trailer itself and one or several P segments corre- sponding to the announced program. Labeling the RepSet as B will over-segment the corresponding program. Consequently, the classification step should separate the data in three rather than two classes: (1) P RepSets, (2) B RepSets, (3) Trailer (T) RepSets.

4.3.1. Segment and Repset Features. To classify the segments and the RepSets, a set of features are proposed to describe them. Some features are chosen based on observations that we have acquired in our real life when watching television (i.e., most of breaks repeats more than once). Others are derived from rules used by TV channels (i.e., presence of silence segments before commercials). Other features may be useful in this classification (features derived from audio

repetitions). In this section, we present the features used to represent the segments and the RepSets.

Segment Features. To describe a segment S

_i^j

, a set of global and local features is used. The local features are issued from the neighboring segments. The global ones are issued from the RepSet R

i

, which contains the segment. The global features used to represent a segment are the following.

(i) ( | R

i

| ) : Number of occurrences of S

_i^j

. It represents how many times the corresponding content item was broadcasted.

(ii) Number of days: this feature counts the number of diﬀerent calendar days where the corresponding content item appears.

(iii) Number of days of the week: it measures the number of diﬀerent days of the week (Monday to Sunday) where the content appears. For example, for a content item broadcasted 10 consecutive Tuesdays, the number of days will be 10 while the number of days of the week will be 1.

(iv) Duration of the segment S

_i^j

.

The local features used to describe a segment are issued from two sources. The first source is the presence of a separation before and/or after a segment. We call separation the simultaneous occurrence of monochrome frames and silence that happens between commercials in France due to legal regulations. To detect monochrome frames, we use the method proposed by [1]. In this method, a 48-bin histogram on the luminance channel is computed and its entropy is thresholded. The entropy of a histogram h quantized into n bins is given by

H = −

p

i

log p

i

, p

i

= h ( i )

Σ

k

h ( k ) . (3) On the other hand, the silence segments are detected using a simple method where the log-energy of overlapping audio frames of 10 ms is computed using the standard formula

E

db

( i ) = 10log

₁₀

Σ

n

x

n2

( i ) . (4)

A successive analysis is used to merge silent segments

and monochrome frames. The audio feature being more

discriminative than the visual one, silent audio segments

(9)

Table 1: Separation detection results.

Modality Precision Recall

Audio only 0.82 0.9

Image only 0.41 0.89

Successive analysis 1 0.9

are detected first, and their correspondence to monochrome frames is then checked. Table 1 presents the results obtained by Naturel et al. in [1].

The separations are used as a binary feature. For each segment, it tells whether a separation was observed in a temporal window before and/or after it.

This type of features is helpful to di ﬀ erentiate between P and B segments. A separation may appear before and after Breaks. Such separations do not appear within programs, but can appear at their borders, before the opening credits or after the closing credits, and thus separate them with breaks.

These separations do not follow a strict and systematic rule, and systems relying only on such information can produce poor results according to the production rules of the considered channel.

The second source of information to describe locally a segment is its neighboring segments. Let S

_i^j

be a RepSegment.

First, we consider a temporal window W

^b

before S

i^j

, and the RepSegments S

^b_k

( S

i^j

) contained in W

^b

. We count their number N

^b

( S

i^j

). Each of these RepSegment s

^b_k

( S

i^j

) belongs to a RepSet whose cardinality is denoted C

_k^b

( S

i^j

). This is the number of times the content of s

^b_k

( S

_i^j

) was broadcasted.

Second, we consider the same quantities relative to a temporal window W

^a

after S

i^j

: N

^a

( S

i^j

) and C

^a_k

( S

i^j

). From the above information, we derive the following local features:

(i) three binary features are issued from the separations:

(1) presence of a separation before S

_i^j

; (2) presence of a separation after S

i^j

; and (3) presence of separations before and after S

i^j

;

(ii) two other features are N

^b

( S

i^j

) and N

^a

( S

i^j

);

(iii) five features are issued from the C

_k^b

(S

_i^j

) and C

_k^a

(S

_i^j

):

(1) Sum

^b

( S

i^j

) = Σ

k

C

_k^b

( S

i^j

), (2) Sum

^a

( S

ⁱ_j

) = Σ

k

C

^a_k

( S

_i^j

),

(3) Min( S

i^j

) = min(Sum

^b

( S

i^j

), Sum

^a

( S

i^j

)), (4) Max( S

i^j

) = max(Sum

^b

( S

i^j

), Sum

^a

( S

i^j

)), (5) Avg( S

_i^j

) = average(Sum

^b

( S

_i^j

), Sum

^a

( S

_i^j

)).

Repset Features. Two kinds of global features are used to describe a set of repeated segments corresponding to a same content, that is, a RepSet R

i

. The first ones are analogous to those used to describe segments:

(i) | R

i

| ,

(ii) number of days,

(iii) number of days of the week,

(iv) mean duration of the segments in R

i

.

The second ones come from the local features defined for segments.

(i) Percentage of segments of R

i

that have (1) a separa- tion before, (2) a separation after, (3) a separation before or after, and (4) a separation before and after them.

(ii)

_j

N

^b

( S

_i^j

) et

_j

N

^a

( S

_i^j

).

(iii) For all the segments S

i^j

of R

i

, we compute Sum

^b

( S

i^j

), Sum

^a

( S

_i^j

), Min( S

_i^j

), Max( S

_i^j

), and Avg( S

_i^j

). We then associate 13 features to R

i

:

(a) min

_j

(Sum

^b

( S

_i^j

)), max

_j

(Sum

^b

( S

_i^j

)), average

_j

(Sum

^b

( S

i^j

)),

(b) min

_j

(Sum

^a

( S

i^j

)), max

_j

(Sum

^a

( S

i^j

)), average

_j

(Sum

^a

( S

_i^j

)),

(c) min

_j

(Min( S

_i^j

)), max

_j

(Min( S

_i^j

)), (d) min

_j

(Max(S

_i^j

)), max

_j

(Max(S

_i^j

)),

(e) average(Avg( S

i^j

)).

As a result, we can say that fourteen features, four global and ten local, describe a segment, while twenty three features describe a RepSet.

4.4. Stream Segmentation. The previous step was devoted to the P/B classification of the RepSets. The goal of the present step is to segment the stream in P/B sequence. It is composed of several substeps. At the first substep, all the segments that are classified as breaks are retrieved from the stream. At this moment, the stream is segmented in presegments where each has a start time and an end time.

Let Stm = { Seg

_i

/ Seg

_i

= (Seg

_i

· start time, Seg

_i

· end time ) } represents the segmented stream.

The aim of the second substep is to classify the remaining segments of the stream (each segment of Stm) as being a P or a B segment. The classification is based on the length of segments. A fixed threshold d

min

is used. Every segment longer than d

min

is labeled as P and all the other ones as B.

Experiments were done by Naturel et al. [1]] and led them to fix the threshold to one minute.

In the third substep, we fuse the consecutive B segments into one segment. At the end of this step, we obtain a new segmentation of the stream as P/B sequence.

4.5. Segment Labeling. Once the stream is segmented, the next step is to add a label to each program segment. Two types of analysis methods may be used here to label the seg- mented stream: content-based analysis or metadata analysis.

In our work, we have used the metadata broadcasted with the stream, namely, the EPG (Electronic Program Guide).

The EPG contains useful information about the programs

(10)

Table 2: Segment classification using cross-validation sampling method.

CV 10-folds Programs Breaks

Precision Recall Precision Recall

Random forest 96.38% 98.01% 97.66% 95.76%

Classification Tree 97.29% 97.48% 97.09% 96.87%

C4.5 97.12% 97.29% 96.88% 96.68%

KNN (10NN) 95.98% 96.66% 96.12% 95.33%

Na¨ıve Bayes 92.36% 94.93% 93.97% 90.95%

SVM (RBF) 92.31% 95.54% 94.65% 90.83%

CN2 95.77% 98.34% 98.03% 95.00%

Table 3: Segment classification using random sampling method (30% to train and 70% to test).

RS (30 : 70) iterated 5 times Programs Breaks

Precision Recall Precision Recall

Random forest 96.17% 97.08% 96.59% 95.55%

Classification Tree 96.29% 96.76% 96.25% 95.71%

C4.5 96.46% 97.03% 96.56% 95.89%

KNN (10NN) 95.16% 96.19% 95.56% 94.36%

Na¨ıve Bayes 92.41% 94.88% 93.92% 91.02%

SVM (RBF) 94.37% 96.44% 95.79% 93.38%

CN2 95.31% 97.71% 97.28% 94.46%

broadcasted, for example, titles, genres, and sometimes other information such as a short description and the list of actors.

To label the segmented stream, we propose to align it with the EPG using the Dynamic Time Warping (DTW) algorithm. DTW is a well-known method that computes a path and a distance between two sequences X and Y . The distance may be interpreted as the cost to be paid to transform X into Y by a set of weighted edition operations. These operations are the insertion, deletion, and substitution. The path with minimal cost provides the best alignment. In our system, a distance is computed between segments. It measures the similarity of the segments in terms of durations, start, and end times.

dist seg

_i

, seg

_j

= duration

_i

− duration

_j

+ s

i

− s

j

+ e

i

− e

j

,

(5)

where duration

_i

(duration

_j

, resp.) is the duration of seg

_i

(seg

_j

, resp.), s

i

( s

j

, resp.) is the start time of seg

_i

(seg

_j

, resp.) and e

i

( e

j

, resp.) is the end time of seg

_i

(seg

_j

, resp.).

The cost of the insertion deletion and substitution operations are defined as

C

del

= dist seg

_i

, seg

_j

, C

ins

= dist seg

_i

, seg

_j

,

C

sub

= α ∗ dist seg

_i

, seg

_j

, where 1 < α < 2 . (6)

The α parameter is used to favor a substitution operation over a deletion followed by an insertion one. Reader can refer to Naturel’s Ph.D. [15] for more information.

5. Experiments

This section presents a series of experiments to illustrate the results of applying our method. In these experiments, we used a corpus of 22 consecutive days of television recorded from a French channel (France2), and this is for the duration from 9/5/2005 to 30/5/2005. The evaluation of the proposed method concerns the classification, the segmentation, and the alignment steps. The repetition detection and postpro- cessing steps are not considered for two reasons. The first is the practical impossibility to manually annotate a database in terms of repeated content. Some methods in the literature provide an estimation of the precision of the repetition detection process. They choose randomly a sample set of repetitions and verify them manually. This type of evaluation remains imprecise. The second reason is that we would like to compare our results to the one proposed by Naturel et al. and, as a consequence, we adopt approximately the same evaluation process.

5.1. Classification Evaluation. The first experiments evaluate the RepStreamSet classification since it is the core of our con- tribution. The first set of these experiments are at segment level (segment-based) where each segment is described by its 14 features and classified as a P or B segment. On the other hand, the second part of experiments is at the RepSet level (RepSet-based), where each RepSet is described by its 23 features and classified as P RepSet, B RepSet, or T(= Trailer) RepSet. In this section, we present the experiments at the two levels, and then we compare the results issued from both levels.

Several classification methods were compared (Random

Forest, Classification Tree, C4.5, KNN, Na¨ıve Bayes, SVM,

(11)

Table 4: RepSet classification using cross-validation sampling method.

CV 10-folds Trailers Programs Breaks

Precision Precision Precision Recall Precision Recall

Random forest 62.16% 31.08% 97.75% 98.19% 93.24% 93.49%

Classification Tree 43.14% 29.73% 98.13% 97.45% 91.31% 94.20%

C4.5 37.20% 23.46% 97.65% 97.52% 91.35% 92.83%

KNN (10NN) 42.11% 21.62% 97.621% 96.39% 88.24% 93.04%

Na¨ıve Bayes 0.89% 78.38% 99.89% 13.08% 95.29% 73.10%

SVM (RBF) 72.73% 10.81% 97.88% 97.12% 90.00% 94.56%

CN2 64.86% 32.43% 97.02% 97.94% 92.82% 91.66%

Table 5: RepSet classification using random sampling method.

RS (30 : 70) iterated 5 times Trailers Programs Breaks

Precision Recall Precision Recall Precision Recall

Random forest 62.50% 13.46% 97.85% 97.82% 91.87% 94.34%

Classification Tree 29.91% 13.46% 98.00% 97.23% 90.49% 94.28%

C4.5 36.54% 25.68% 97.72% 97.77% 92.17% 92.95%

KNN (10NN) 28.57% 9.23% 97.41% 96.05% 87.15% 92.83%

Na¨ıve Bayes 0.92% 76.54% 99.94% 35.23% 93.96% 77.69%

SVM (RBF) 72.73% 10.81% 97.88% 97.12% 90.00% 94.56%

CN2 27.16% 8.46% 97.89% 97.07% 90.22% 94.59%

CN2), and several sampling methods were used to validate our results using the orange data mining software [27]

(cross validation noted CV, random sampling noted RS). The corpus was manually annotated and then used to label the RepSegments and RepSets in the RepStreamSet. The labeled RepStreamSet is used to train all the tested classification methods on part of the data from the one hand, and to check the results on the remaining part from the other hand.

5.1.1. Segment Classification. The set of labeled RepSegments in the RepStreamSet is used to evaluate the segment- based classification approach. As we have already seen, each segment can take one of two labels: P segment or B segment.

It should be noticed that, due to postprocessing errors, a segment could aggregate both P and B shots. In such a case, the segment is considered P or B in the ground truth according to the most overlapping class. This case rarely occurred in our results. Tables 2 and 3 show the precision and recall using several classification and sampling methods.

5.1.2. RepSet Classification. The set of labeled RepSets in the RepStreamSet is used to evaluate the RepSet-based classification approach. Contrarily to segment classification, RepSets can take one of the three labels: purely P RepSet, purely B RepSet, or T(= trailer) RepSet. Tables 4 and 5 show the precision and recall obtained with the same classification and sampling methods as above.

We can observe the poor detection results of T RepSets.

This may be due to several problems. First, the small number of trailers in the corpus is not suﬃcient to train correctly the corresponding model (178 trailers in a set of 14280 RepSets). Second, the pattern of these RepSets is much more

complex mixing P and B segments. Third, we faced some incoherencies in the shot detection process. Some content items that are broadcasted several times were not segmented always in the same way. This may be due to capturing problem, which has aﬀected the quality of the analogical streams that have been captured.

In order to correct the missclassification of the T RepSets, a priori filtering rule is applied in order to separate most of the T RepSets before the classification step. This rule is based on the local features of the segments contained in the RepSet.

For example, let R

i

be a T RepSet. The P segments of R

i

will not have neighboring RepSegments (before or after). In other words, a program may have isolated RepSegments when this does not happen with B sequences.

We have also used two other rules to filter some P RepSets. The first rule detects the RepSets with which the mean duration is greater than three minutes. These RepSets take a P label since B sequences are overwhelmingly shorter.

We have also observed that in programs such as political debates, games, and talk shows, some shots are detected as repeated within a given program. Such detections are due to the fact that in these programs some shots are taken with a fixed camera, have the same duration, and the scene is very static with no motion. As a consequence, these shots are almost identical from a visual point of view and give raise to RepSets. Other shots may also be detected like opening and closing credits when the program is split by a break, or shots to show the presents to win in game programs. The second rule uses the proximity of such segments in the same RepSet and the ratio of separations before and after these segments.

Applying the three previous rules on the RepSets

provides two RepStreamSets as result. The first called,

for simplicity, RepStreamSet filtered which represents the

(12)

Table 6: Classification of the RepSets after filtering using cross-validation sampling method.

CV 10-folds Trailers Programs Breaks

Precision Recall Precision Recall Precision Recall

Random forest 31.51% 32.86% 95.85% 95.08% 91.32% 92.45%

Classification Tree 34 % 24.29% 94.82% 94.88% 90.49% 91.06%

C4.5 36.51% 32.86% 95.55% 95.68% 91.96% 92%

KNN (10NN) 58.54% 34.29% 96.38% 94.35% 90.14% 94.32%

Na¨ıve bayes 2.32% 70% 99.60% 66.03% 94.11% 75.92%

SVM (RBF) 90% 12.86% 96.07% 94.77% 90.05% 94.13%

CN2 20.78% 22.85% 94.82% 94.5% 90.42% 90.69%

Table 7: Classification of the RepSets after filtering using random sampling method.

RS (30 : 70) iterated 5 times Trailers Programs Breaks

Precision Recall Precision Recall Precision Recall

Random forest 20.49% 20.58% 95.31% 94.80% 91 % 91.84%

Classification Tree 19.15 % 22.04% 94.57% 94.70% 90.57% 90 %

C4.5 26 % 24.08% 94.92% 94.87% 90.51% 90.77%

KNN (10NN) 29.33% 26.94% 94.35% 93.50% 88.55% 90.09%

Na¨ıve bayes 2.37% 68.98% 99.55% 67.88% 93.91% 75.29%

SVM (RBF) 20.49% 20.58% 95.31% 94.80% 91% 91.84%

CN2 0% 0% 97.20% 94.24% 89.21% 96.11%

5 10 15 20

(days) Program-based evaluation

Naturel et al. results

Proposed method with RepSet classification Proposed method with segment classification 0.86

0.88 0.9 0.92 0.94 0.96 0.98 1

F -measur e

0 25

Figure 2: Results in terms of P F-measure.

RepSets that obey the filters. The second called RepStream- Set remained that represents the remaining RepSets after the filtering step. Returning to the experiments, 5114 RepSets are filtered (i.e., the size of RepStreamSet filtered), while 9166 are the remaining ones (i.e., size of the RepStream- Set remained). In RepStreamSet filtered, 5025 P RepSets and 88 Trailer RepSets are correctly filtered.

After the application of the three filters mentioned above, a new set of repeated content is used (i.e., Rep- StreamSet remained). Tables 6 and 7 show the precision and the recall results using the new set of repeated content (RepStreamSet remained).

Most of the RepSets in the RepStreamSet filtered are filtered correctly. They were not considered when computing the precision and the recall of the classification. By adding

Naturel et al. results

Proposed method with RepSet classification Proposed method with segment classification

5 10 15 20

(days)

0 25

0.4 0.5 0.6 0.7 0.8 0.9 1

F -measur e

Break-based evaluation

Figure 3: Results in terms of B F-measure.

the correctly filtered RepSets to the previous tables, we obtain new results (Tables 8 and 9). These results give better performances, especially for the T RepSets.

As complementary results, we make an additional exper- iment to check whether the performance of the classification at the segment level remains the same when using the two filters or not. Tables 10 and 11 show the precision and the recall of the previously used methods using the cross- validation sampling method, but at the segment level.

5.1.3. Comparing Segment-Based and Repset-Based Classifica-

tion. In this section, we have chosen randomly 30% of the

RepSets to train the models and then all the RepSets are

classified in order to associate all the data to a class even

if it belongs to the training set. Tables (12(a), 12(b), 12(c),

(13)

Table 8: Table 6 + counting the filtered RepSets.

CV 10-folds + filtering Trailers Programs Breaks

Precision Recall Precision Recall Precision Recall

Random forest 71.69% 77% 98% 98.24% 91.32% 92.14%

Classification Tree 77% 74% 97.48% 98.14% 90.49% 90.76%

C4.5 75.12% 77% 97.83% 98.51% 91.96% 91.69%

KNN (10NN) 84.5% 77.45% 98.23% 97.89% 90.14% 94%

Na¨ıve bayes 8,12% 89.7% 99.75% 84.53% 94.11% 75.66%

SVM (RBF) 91.66% 70.1% 98% 97.43% 90.05% 93.81%

CN2 67,26% 73,5% 97,5% 97,4% 90,42% 90,38%

Table 9: Table 7 + counting the filtered RepSets.

RS (30 : 70) iterated 5 times + filtering Trailers Programs Inter-program

Precision Recall Precision Recall Precision Recall

Random forest 78.62% 78.46% 98.12% 97.97% 91 % 91.39%

Classification Tree 76 % 78.87% 97.83% 97.93% 90.57% 89.57 %

C4.5 80.88 % 79.38% 97.96% 98% 90.50% 90.33%

KNN (10NN) 82.55% 80.10% 97.75% 97.48% 88.55% 89.65%

Na¨ıve bayes 11.25% 90.67% 99.75% 87.69% 93.9% 74.93%

SVM (RBF) 96.61% 76.20% 98.17% 98% 90.22% 92.5%

CN2 97.41% 73.33% 98.85% 97.76% 89.22% 95.65%

Naturel et al. results

Proposed method with RepSet classification Proposed method with segment classification

Program-based evaluation of the P/B segmentation

0.8 0.85 0.9 0.95 1

F -measur e

5 10 15 20

(days)

0 25

Figure 4: Results in terms of P F-measure, without separations.

and 12(d)) show the distribution of RepSets between classes using several classification methods. The lines correspond to the true class of each RepSet, and the columns to the class in which these RepSets were classified.

In an analogous way, we have chosen randomly 30% of the RepSegments and classified the whole set. Tables (13(a), 13(b), 13(c), and 13(d)) show the confusion matrices.

In order to compare the results obtained with RepSets to those obtained with the segments, the former should be translated in terms of segments. To this aim, we compute the percentage of well-classified segments that are produced by the RepSet classification. Table 14 gathers the results in terms of number of RepSets (column 1) or segments (columns 2 and 3) correctly classified. Columns 1 and 3 correspond to the classification of the RepSets and the

Naturel et al. results

Proposed method with RepSet classification Proposed method with segment classification 0

0.2 0.4 0.6 0.8

F -measur e

Frame-based evaluation of the P/B segmentation

5 10 15 20

(days)

0 25

1 Figure 5: Results in terms of B F-measure, without separations.

segments, respectively. Column 2 is the translation of column 1 in terms of correctly classified segments (only columns 2 and 3 can be directly compared).

It should be noticed that the random forest algorithm

provides the best performances and appears as the most

adapted algorithm to our problem. Furthermore, the com-

parison of the two numbers in boldface shows that RepSet-

based classification provides a better performance than

segment-based classification. On the other hand, the rate of

misclassification falls from 5.27% to 3.28%. These results

mean that using information about all the repetitions of a

content item rather than just considering the RepSegments

in their local context only is useful and increases the global

performance of the system.

(14)

Table 10: Classification of the segments after filtering using cross-validation sampling method.

Cross-validation10-folds Programs Breaks

Precision Recall Precision Recall

Random forest 96.56% 97.14% 93.91% 92.72%

Classification Tree 97.44% 97.23% 94.20% 94.63%

C4.5 97.26% 97.45% 94.63% 94.23%

KNN (10NN) 96.05% 96.81% 93.18% 91.63%

Na¨ıve bayes 92.93% 92.48% 84.35% 85.23%

SVM (RBF) 94.17% 96.94% 93.15% 87.38%

CN2 96.59% 98.03% 95.72% 92.73%

Table 11: Table 10 + counting the filtered segments.

Cross-validation10-folds + filtering Programs Breaks

Precision Recall Precision Recall

Random forest 97.79% 98.18% 94.58% 93.43%

Classification Tree 98.35% 98.23% 94.82% 95.15%

C4.5 98.17% 98.25% 94.84% 94.62%

KNN (10NN) 97.47% 97.97% 93.40% 92.52%

Na¨ıve bayes 95.53% 95.26% 86.17% 86.90%

SVM (RBF) 96.26% 98.05% 94% 88.79%

CN2 97.8% 98.73% 96.17% 93.48%

Table 12

(a) Classification Tree.

B 19 122 2101

P 5 6733 113

Trailers 28 20 26

Trailers P B

(b) Random Forest.

B 10 98 2134

P 1 6760 90

Trailers 34 17 23

Trailers P B

(c) SVM.

B 2 109 2131

P 0 6601 250

Trailers 8 25 41

Trailers P B

(d) KNN (10NN).

B 8 230 2107

P 8 6613 230

Trailers 14 27 33

Trailers P B

5.2. Stream Segmentation and Characterization. The goal of the present section is to enlarge the classification step to the whole stream, that is, to UniqSegments also, and to label the segments whenever this is possible. As it is mentioned before, this labeling stage is done by aligning the stream

Table 13

(a) Classification Tree.

B 2228 20329

P 25048 939

P B

(b) Random Forest.

B 1829 20728

P 25259 728

P B

(c) SVM.

B 3291 19266

P 25066 921

P B

(d) KNN (10-NN).