• Aucun résultat trouvé

Implementation and results

MOVIE-IN-A-MINUTE: AUTOMATICALLY GENERATED VIDEO PREVIEWS

6.5 Implementation and results

The generation of aMiamivideo can be divided into four main steps (see Figure 6.1):

1. audio and video feature extraction,

2. audio and video segmentation and classification, 3. segment selection, and

4. preview composition.

Figure 6.1. Steps for the generation of aMiamivideo

6.5.1 Audio and video feature extraction

Various algorithms are applied to the audio and video signals to extract the features required by the next steps and for computing thepriority score accord-ing to equation (6.3) after normalization over the entire video. Video features Movie-in-a-Minute: Automatically Generated Video Previews

96

include low-level attributes such as contrast, color distribution, motion activity and mid-level attributes such as face location and size, and camera motion. Au-dio features include RMS value, spectral centroid, bandwidth, zero-crossing rate, and MFCC (see [McKinney & Breebaart, 2003] for a detailed descrip-tion).

6.5.2 Audio and video segmentation and classification

This step can be divided into five sub-steps:

a. Shot segmentation and clustering: a standard shot cut detection algo-rithm [Lienhart, 1999] is applied to divide the video stream into contin-uous shots. Time-constrained clustering [Boreczky et al., 2000] is then applied to group together visually consistent shots that are not far apart in time.

b. Audio classification: the synchronized audio stream is classified into co-herent audio classes such as silence, speech, music, noise, etc. Changes in the audio class indicate the audio segment boundaries used to verify constraints (6.6).

c. Micro-segmentation: segments exceeding the maximum duration after

min

based clues such as: a change in the audio class, appearance or disap-pearance of a detected face, a change in camera motion or object motion.

The micro-segmentation step can be easily formalized as an integer lin-ear programming problem and solved with standard methods (e.g. sim-plex method).

d. Segment compensation: successive segments violating constraints (6.6) are merged until the continuity requirement is fulfilled without violating the maximum segment duration dmax. When this is not possible, the segmentation induced by the audio classifier is used as primary instead of the shot-based one.

e. Pre-filtering: commercial detection [Schaffer et al., 2002] is performed over the entire video and the detected commercials are discarded from the set of segments available for the generation of theMiamivideo. In order not to disclose the end of the program, an extra 10% of segments is removed from the end.

6.5.3 Segment selection

The segment selection step consists of searching the best set of segments that maximize the objective function (6.1) in the space of all possible previews.

the shot segmentation are further divided into sub-segments with durations bigger than d and with boundaries possibly aligned with content-Mauro Barbieri, Nevenka Dimitrova and Lalitha Agnihotri

97 The space of all possible previews can be explored using a local search method such as simulated annealing or a genetic algorithm [Aarts & Lenstra, 1997]

because at this point each requirement or constraint has either been solved in

Miamivideo is to provide a rough overview of the content of a program, the goal of finding the absolute maximum of (6.1) can be relaxed to finding a good approximation, a preview with a reasonably high value ofeval(S).

We have implemented a heuristic search strategy that iteratively improves an initial set of selected segments. The starting set is constructed by selecting for each scene the segment with the highest priority scoreπ(sj)that generates the minimum redundancyρ(S). At every iteration, the first segments of each scene that improve the objective function are added to the set. The algorithm stops after a certain fixed number of iterations or ifeval(S)cannot be significantly improved. The solution is not optimal but usually good enough for the typical Miamivideo usage.

6.5.4 Preview composition

The last step consists of the actual composition of the preview by fusing the selected segments into one continuous audiovisual stream. Abrupt audio and video transitions between segments are smoothed using fading and dissolve effects.

6.5.5 Prototype implementation

A prototype of Miamivideo has been implemented in C++ (MPEG-2 de-coding and content analysis algorithms) and JavaTM(local search, segment se-lection and preview composition) for the generation of previews of recorded broadcast programs in MPEG-2 video format. The generation of aMiamivideo on a state-of-the-art personal computer requires no longer than the actual pro-gram duration. Most of the CPU time is used for video decoding and content analysis algorithms; the segment selection step requires only a fraction of the total running time.

In preliminary tests the system has been manually tuned and tested with a large set of narrative programs such as feature films and documentaries. The typical duration of aMiamivideo for a two-hours-long feature film is usually set to 60 or 90 seconds.

6.5.6 Results

The first reaction of most of the users to the seeing the generated previews was always very positive. However evaluation of the results has always been a difficult task for video summarization. Just as there are many ways to describe Movie-in-a-Minute: Automatically Generated Video Previews

the objective function (6.1). However, considering that the actual usage of the the previous steps or it is mapped to a priority or penalty score term in

98

an event or a scene, users can produce many video previews that they consider acceptable. Objective evaluation and benchmarking of different algorithms are still open challenges.

To judge whether the Miamivideo algorithm fulfils actual users’ require-ments, whether we should consider other requirements and, ultimately, if a Miami video provides a good overview of a program, we performed a user study involving ten subjects, male and female in various age categories. None of the participants were in any way involved in the development of theMiami video.

We conducted guided interviews organized in three parts. The first part was aimed at getting an impression of how much of the story line is comprehensi-ble, the second part contained questions related to requirements and the third part consisted of a benchmark of against a preview generated by uniform sub-sampling.

In the first part participants had to write down a description of the story line of four movies after seeing only the corresponding Miamivideos 60 or 120 seconds long. Some of the users had seen some of the four movies at least once in the near past. However only half of the participants who did see the movie and one third of the participants who did not see the movie could give a correct description of the story. Overall, 23% of all participants gave a wrong description. These results indicate that it is difficult to grasp the story line of a movie from a 60 or 120 seconds longMiamivideo. However presenting the ambiance of a movie is just as important. To this respect, most of the users indicated thatMiamivideo is a useful tool.

In the second part, participants were shown examples ofMiamivideos and were asked various questions related to each of the seven categories of require-ments presented in Section 6.3. The results indicate that the set of requirerequire-ments considered by theMiamialgorithm is relevant and complete. Generally speak-ing, participants were moderately positive about the degree of fulfillment of the requirements. In particular, segment duration and speech continuity were not perceived as satisfactory in many cases. Fulfillment of these requirements can be improved by using a more accurate and robust audio classifier and video segmentation algorithm.

In the third part of the interview subjects were shown two versions of a video preview (for five movies of various genres) and were asked to choose which one they preferred and why. The two versions were aMiamivideo and a preview generated by uniformly sub-sampling the program while preserving shot boundaries. The tests indicate that Miami video is only slightly more appreciated than uniform sub-sampling. Moreover users found it very difficult to choose between the two previews.

This could be related to the fact that some requirements (e.g. continuity) were not fully met. Users might have perceived theMiamivideos as randomly Mauro Barbieri, Nevenka Dimitrova and Lalitha Agnihotri

99 composed as the sub-sampled versions (although this type of randomness is different from the randomness introduced by uniform sub-sampling). To ver-ify this hypothesis, a new user test should be performed using Miamivideos

‘manually repaired’ to fully meet the users’ requirements.