• Aucun résultat trouvé

User evaluation of multi-episode video summaries

N/A
N/A
Protected

Academic year: 2022

Partager "User evaluation of multi-episode video summaries"

Copied!
6
0
0

Texte intégral

(1)

USER EVALUATION OF MULTI-EPISODES

VIDEO SUMMARIES

Itheri Yahiaoui, Bernard Merialdo, Benoit Huet

Multimedia Communication Department Institut Eurécom

2229 Route des Crêtes, 06904 Sophia-Antipolis, France

{itheri.yahiaoui, bernard.merialdo, benoit.huet}@eurecom.fr

Content

Video summaries

Multi-Episode Video Summaries Optimal summaries

Maximal Recall Automatic Summarization Experiments

User Evaluation

Conclusion and Future work.

Video Summaries

A summary is a subset of the video

Identify important information Constrained duration

A summary can be good or bad

Depends on task

Movie Trailer, Informative or Descriptive, etc…

Quality is generally difficult to evaluate

Video Summary format

keyframe set video skim

(2)

Multi-Episode Summaries

Independently created summaries may contain redundant information

Specific requirements to construct multi- episode summaries

Identification of :

What is common to several episodes What is specific (unique) to each episode

Typical applications

TV series, Set-Top-Box, etc…

Optimal Summaries

What is the best summary for a video?

Many proposals, two basic approaches:

User-based evaluation (qualitative) Smith and Kanade [CBAIVL 1998]

Informedia Project: video skims.

Mathematical criterion (quantitative) Gong and Liu [ICME 2000]

Use of SVD over a feature frame matrix.

Uchihashi and Foote [ICASSP 1999]

Definition of a shot importance measure.

Ideal summary evaluation

User u without summary performs task T:

performance pT(u)

User u with summary S performs task T:

performance pT(u | S)

Ideal summary efficiency:

average( pT(u | S) - pT(u) )

But:

users are different (many users required) users learn (cannot compute pT(u | S) after pT(u)) evaluation is very expensive (often not feasible)

Maximal Recall Task

Idea: Identify a movie from a picture from a magazine

Formalization:

User u knows summaries Siof video Vi User u is shown an excerpt E (from video Vj) User u is asked to guess j

Optimal summaries:

Should maximize the performance over all E Evaluation can be automated if the behavior of u can be reasonably simulated

(3)

Maximal Visual Recall

User chooses video jif (s)he recognizes similar images in both excerpt E and summary Sj

In case of ambiguity:

no decision This process can be automated based on similarity measure Similarity based on color histograms

Intuitive Idea

Consider videos:

red frame is good for V1, but will generate ambiguities with V2and V3

Summary should contain frames:

frequent in one video unfrequent in others

V1 V2 V3

User Performance

Number of excerpts with correct unambiguous answers

Computed using all excerpts of fixed duration dfrom all the videos

Note: performance vary with d.





v' m j m

v i j

v m j m

v i j

S f f similar to f

E f v v'

, S f and f similar to f

E f j : v) Card (i,

Evaluation Criterion

V1 V2 V3 V4 V5 V6

Summary construction

Iterative process Greedy algorithm Selection based on frame coverage

In-place refinement Try to replace each frame individually to improve quality Repeat until no change

(4)

Experiments

Six episodes from the TV serie «Friends»

Total videos duration 83150 frames

(99 min)

Summary of six key-frames per video Key-frames are selected according to method described earlier

Video processing

Elimination of jingle and credits Feature Vectors construction

Video Summary Evaluation

Issue: Evaluation

Two opposite approaches

User based evaluation: difficult to set-up, possible bias, ...) Mathematical criterion: (easy to set-up, difficult to interpret)

Simulation of user behavior based on Maximal Recall

Real experimentation

User simulated performance measure Limitation of image similarity measure Single and Multi-episode videos

Video Summary Evaluation

Experiment based on visual recall capabilities

Show summaries S1...Skof videos V1…..Vkto the user

For example a grid of images, where each line represent a Video Show an excerpt E of a video Vi to the user, then ask the user to guess i

Video Summary Evaluation

User answers:

Don’t Know

Unknown case when no similar image between E and any summary Si

Confused

Ambiguous case when similar images between E and summaries Siand Sj Sure

Unambiguous case when similar images between E and a single summary Si

Not sure

??!!??

I am Sure Don’t know

??!!??

(5)

Excerpt

duration %

correct %

ambiguous %

incorrect

4 sec 25.25 1.27 3.53

6 sec 29.87 1.61 4.79

8 sec 33.36 2.51 6.38

10 sec 36.82 2.86 6.54

20 sec 46.70 7.02 9.54

40 sec 54.06 15.47 13.14

Coverage over the original videos Evaluation of summaries

Experimental results Evaluation Results Analysis

Idea: Look precisely at the difference between the system’s evaluation method and the user’s answers.

Count the number of correct and wrong answers

Discuss the reason of the choice made by the users

Results based on 100 excerpts for 10 users

Evaluation Results

36%

84%

81%

89%

80% 83%

89%

79%

85%

79% 80%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

C o v e r a g e

Simulated User

User1 User2 User3 User4 User5 User6 User7 User8 User9 User10

Performance

Average Real User Performance = 82.9%

People with high score (89%) are fan of the serie

What makes our Simulated User perform so poorly (36%) ?

Evaluation Results

10 9 8 7 6

5 4 3 2 1

0

Nb of excerpts for which the system is correct Nb of excerpts

0 10 20 30 40 50 60 70 N b o f e x c e r p t s

Nb of users

Nb of excerpts for which the system is correct 19 7 4 2 1 1 0 0 0 1 0

Nb of excerpts 61 12 6 4 3 1 2 0 0 4 6

10 9 8 7 6 5 4 3 2 1 0

42 excerpts have been correctly identified by allreal users but incorrectly by the system

5 excerpts have been correctly identified by 9real users but incorrectly by the system

(6)

Results Analysis

Objective

:

Improve the performance of our automatic summarization scheme

Major factors:

Person, Object, Action, Location, Time

Actors + Clothes, 27.45 Actor + Clothes, 25.96 1/n Frames, 7.55

Decor, 4.47

Action, 10.32

Actor Occas., 3.4 Unusual Object, 2.34

details Face, 3.4

Clothing Specific, 16.59

Face detection and recognition, 23

Object detection and recognition, 18

Person and its Clothing Identification, 47 Identification of a group of

Persons and their clothing, 32

Recognize Objects and Persons in various

environments, 34

Improvements

Nb of excerpts for which the system could be correct depending on methodology employed.

(out of the 47 incorrectly identified excerpts)

Conclusion

Novel approach to automated video summary creation (inc. Multi-Video case) New method for evaluation

Use of Maximal Recall

Performance levels are easy to understand

New method for summary creation

Suboptimal automatic construction Summary duration is user definable

Work on Region Matching/Recognition

Références

Documents relatifs

Finally, the global summary can be constructed and presented to the user, as an hypermedia document composed of representative images of videos content selected in the previous step

[r]

In this section we present the evaluation results using the simulated user principal on multi-episodes video summaries created with six different algorithms.. As test data, we

Having defined Simulated User Principle for single video summaries, a process to automatically construct a summary with optimal (or near optimal) performance for this experiment can

Finally, the global summary can be constructed and presented to the user, as an hypermedia document composed of representative images of videos content selected in the previous step

Despite the fact that our performance criterion is based on the rate of correct simulated user responses (ex- cerpts for which the simulated user has correctly guessed the

While, Balanced AV-MMR further considers some important semantic features in visual information and audio information, including the audio genre, the face, and the temporal

One uses a circle space to demonstrate the relations between multi-video frames, and the other one shows the dynamic construction of a multi-video summary based on hierarchical