• Aucun résultat trouvé

Multimedia indexing

N/A
N/A
Protected

Academic year: 2022

Partager "Multimedia indexing"

Copied!
35
0
0

Texte intégral

(1)

Multimedia Indexing

Prof. Bernard Merialdo

Institut Eurecom

[email protected]

ERMITES 2007

Contents

‹Shot Segmentation

‹Video Analysis

‹TrecVid:

z Semantic Classification

z Video Search

z Summarization

‹MPEG-7

(2)

ERMITES 2007 3

Video Indexing

Keyframes

Camera movements Objects / events Text / captions

Scenes Shots

ERMITES 2007 4

Shot Segmentation

‹A shot is a continuous take from one camera

‹The transition from one shot to the next can be a hard cut or a gradual transition

‹Hard cuts can generally be easily detected:

‹Gradual transitions span over several frames

‹There are many types of gradual transitions based on different visual effects

(3)

ERMITES 2007 5

Shot Segmentation

‹Dissolve

z Special case: fade-in, fade-out

‹Wipe

Color Histogram: per keyframe

(4)

ERMITES 2007 7

Color Histogram: region-based

‹Split the image into regions, concatenate the region histograms

1 2

3 4

1 2 3 4

ERMITES 2007 8

Cut Detection: hard cuts

‹ Basic idea:

z Measure distance d(It,It+1) between consecutive frames

z Detect cut if distance is greater than threshold:

d(It,It+1) ≥θ

‹ Common distance: color histogram

z Depends on color space

z Robust to object movements

z Efficient for hard cuts

z Poor for gradual transitions

+

+ =

Colors

c t t 1

1 t

t,I ) h(c) h (c)

d(I

(5)

ERMITES 2007 9

Cut Detection: Gradual Transitions

‹Sliding window:

z Compare pre- and post- frames with current frame fc

z Compute PrePostRatio:

z Peak of PrePostRatio = end of gradual transition

=

PostFrames f

c PreFrames f

c

) f d(f,

) f d(f, io

PrePostRat

Cut Detection: Gradual Transitions

‹Dissolve between shot A and shot B:

‹PrePostRatio is usually minimal at the

beginning of a gradual transition and rises up to a maximum at the end of the transition

Pre-frames Current frame Post-frames A

A A A A A

A A A

A A A A A A

B B B B B B A A A A A A A

B B B

B B B

A A A A A A A B B B B B B

A A A B B B B B B

A B B B B B B

A

B

PrePostRatio

minimal slowly rising steeply rising

maximum falling

(6)

ERMITES 2007 11

Cut Detection: Gradual Transitions

‹Example of PrePostRatio curve

z two short gradual transitions and two cuts

0 100 200 300 400 500 600 700 800 900 1000

Frames

PrePostRatio

0

ERMITES 2007 12

Cut detection: difficult cases

‹Similar environment

z Change in camera position

z Same color ambiance

z Cut is difficult to detect

(7)

ERMITES 2007 13

Cut detection: difficult cases

‹Fast movement of large object

z Can be confused with wipe

z Shot can be over-segmented

Cut detection: difficult cases

‹Sudden change in illumination

z Sudden modification of colors

z Also the case in explosions, etc…

z Shot can be over-segmented

(8)

ERMITES 2007 15

Cut detection: ambiguous cases

‹Inserts

‹Interview edit

ERMITES 2007 16

Camera Motion

z

y

x z

y

x

Panning Tilting Rotation

Translation Zooming

(9)

ERMITES 2007 17

Camera Motion

‹Pan

‹Rotation

‹Zoom

Camera Motion

‹ How to find the motion vectors ?

z For example, block matching (remember image compression)

z Motion vector:

„ u(t) = x(t+1) – x(t)

„ v(t) = y(t+1) – y(t) y(t)

x(t)

y(t+1)

x(t+1)

(10)

ERMITES 2007 19

Camera Motion : Pan

‹Ideal Real

ERMITES 2007 20

Camera Motion : Zoom In

‹Ideal Real

(11)

ERMITES 2007 21

Camera Motion

‹How to determine camera motion ?

‹Three steps:

z Find movement in the image (motion vector field)

z Model field with parametric model

z Decide motion from parameter values

Camera Motion : Model Estimation

‹Affine model for Motion Vector field:

‹Least Square Estimation from motion vectors

‹Interpretation of coefficients

z Horizontal Pan:

„ a2=a3=a4=a5=a6=0, a1≠0

z Zoom in:

„ a1=a3=a4=a5=0, a2=a6>0

z Zoom out:

„ a1=a3=a4=a5=0, a2=a6<0

⎟⎟

⎜⎜

+

⎟⎟

⎜⎜

⎟⎟

⎜⎜

=

⎟⎟

⎜⎜

4 1 0

0 6

5 3 2

a a y

y x x a a

a a v u

(12)

ERMITES 2007 23

Camera Motion : Other Models

ERMITES 2007 24

Shot Representation: Keyframes

‹It is interesting to represent a shot with a single image: keyframe

z Useful for content browsing

(13)

ERMITES 2007 25

Shot Representation: Keyframe Selection

‹Approaches:

z First, last, middle frame of shot

z Fixed spacing (e.g. every 5 sec)

z Cluster centroid:

„ Cluster frames of shot

„ Choose keyframe(s) closest to centroid

z Difference based:

„ Make new keyframe when difference with last keyframe is greater than threshold

z Take motion into account:

„ E.g: zoom finishes on interesting picture

„ First and last image of pan

Shot Representation: Features

‹Image and video features:

z Color histogram

z Texture

z Edges

z Regions (segmentation or grid)

z Movement

‹But also:

z Audio analysis (silence, speech, music, noise…)

z Speech recognition on audio track

z Captions from subtitling

z Text recognition

(14)

ERMITES 2007 27

Scene Segmentation

‹A scene is a sequence of shots with similar topic, action and/or location

z Alternance of similar shots

z Same location, similar content

ERMITES 2007 28

Scene Segmentation

‹Idea:

z Similar shots close in time belong to the same scene

‹Algorithm:

z Clustering with temporal distance:

d(i,j) = 100 – s(i,j) x W(i,j) if |i-j| < T infinity else

with: s(i,j) similarity of shots i and j (normalized to 100) W(i,j) temporal weight function

(vanishes for |i-j|=T)

(15)

ERMITES 2007 29

Scene Transition Graphs

Yeung & Yeo, Princeton

‹ Cluster shots and build graph of shot sequences

‹ Scenes are

islands connected by a single arc in graph

TV News parsing

‹Detect anchor

‹split into stories

(16)

ERMITES 2007 31

Anchor Detection (UCF)

‹Problem:

z Same anchor person on different backgrounds

‹Face Detection

z Based on color

‹Person Detection

z Extend face region and cluster

TrecVid

(17)

ERMITES 2007 33

TrecVid

‹Evaluation campaign organized by NIST (National Institute of Standards, USA)

‹Purpose: compare video retrieval algorithms on same data and tasks

‹Started in 2001 as a track of TREC

‹Independant campaign from 2003

‹Participants: 12 in 2001, 54 in 2006

TrecVid Data

Sound and Vision (dutch) 50/50

2007

BBC Rushes 50/50

BBC Rushes 50

TV News (+arabic, chinese) 0/158

2006

TV News (+arabic, chinese) 85/85

2005

TV News (ABC, CNN, CSPAN) 0/70

2004

TV News (ABC, CNN, CSPAN) 66/67

2003

Internet Open Archive 73

2002

NIST videos 11

2001

Type Hours of video

(training/test) Year

(18)

ERMITES 2007 35

TrecVid Tasks

‹Shot Boundary Determination 2001-2007

‹Search 2001-2007

‹High-Level Feature Extraction 2003-2007

‹Stories 2003-2004

‹BBC Rushes 2005-2007

‹Camera motion 2006

ERMITES 2007 36

TrecVid: Shot Boundary Determination

‹2006 Results:

z 13 news videos, 3785 transitions

Cuts Gradual transitions

(19)

ERMITES 2007 37

TrecVid: High Level Feature Extraction

‹Goal: decide if a shot contains a concept or not

‹39 concepts:

Charts Animal

Road

Maps Prisoner

Mountain

Natural-Disaster Military

Vegetation

Explosion_Fire Police_Security

Desert

People-Marching Corporate-Leader

Building

Walking_Running Government-Leader

Outdoor

Boat_Ship Person

Studio

Truck Face

Meeting

Bus Crowd

Office

Car Waterscape_Waterfront

Court

Airplane Urban

Weather

Flag-US Snow

Entertainment

Computer_TV-screen Sky

Sports

TrecVid: High Level Feature Extraction

‹2006: 20 concepts evaluated

‹Overall results:

(20)

ERMITES 2007 39

TrecVid: High Level Feature Extraction

‹Example: CMU system (2005):

z Low level features extraction

„ Image

„ Audio

„ Motion

„ Detectors (face & text)

z Elementary Concept classifiers (168 concepts)

„ SVM (Support Vector Machines)

z Multi-Classifier fusion

ERMITES 2007 40

TrecVid : CMU Architecture

Uni-Modal Features

Multi-modal

Features Feature Tasks

Timing

Kinetic Motion Video OCR Face Feature

SFFT MFCC

Optical Motion

Canny Edge HSV/HVC/GRB

Color Gabor Texture

Transcript

Concept 1

O O O

1. Boat / Ship

O O O

Concept 2 Concept 3 Concept 4

Concept 168 SVM-based

Combination Multi-concepts

Combination

2. Madeleine Albright

3. Bill Clinton

10. Road Visual Info.

Audio Info.

Textual Info.

Structural Info.

Common

(21)

ERMITES 2007 41

TrecVid : CMU Low level features

‹ Image features

z Color histogram

z Texture

z Edge

‹ Audio features

z FFT

z MFCC

‹ Motion features

z Kinetic energy

z Optical flow

‹ Detector features

z Face detection

z Video-OCR detection

TrecVid : CMU Image features

‹5 by 5 grids for key-frame per shot

‹Color histogram

z 5 by 5, 125 bins color histogram

z HSV, HVC, and RGB color space

z 3125 dimensions (5*5*125)

z row-wise grids

‹Texture

z Six orientated Gabor filters

‹Edge

z Canny edge detector, 8 orientations

(22)

ERMITES 2007 43

TrecVid : CMU Audio & Motion

‹Every 20 msecs (512 windows at 44100 HZ sampling rate)

z FFT – Short Time Fourier Transform

z MFCC – Mel-Frequency cepstral coefficients

z SFFT – simplified FFT

‹Kinetic energy

z Capture the pixel difference between frames

‹Mpeg motion

z Mpeg motion vector extracted from p-frame

‹Optical flow

z Capture optical flow in each grid

ERMITES 2007 44

TrecVid : CMU Detector features

‹Face detector

z Detecting faces in the images

‹VOCR detector

z Detecting and recognizing VOCR

(23)

ERMITES 2007 45

TrecVid : CMU SVM Classifier

‹ Binary classifier

‹ Constructed from training data

‹ Linear separator with highest margin in space with Kernel distance

hyperplane

margin

TrecVid : CMU Multi-concepts Combination

‹ Bayesian Networks from 168 common annotation concepts

‹ Combine 4 most related concepts with target concept

Car, Road_Traffic, Truck, Vehicle_Noise Road

Gun_Shot, Building, Gun, Explosion Physical violence

Walking, Running, People, Person People Walking/running

Airplane, Sky, Smoke, Space_Vehicle_Launch Airplane Takeoff

Crowd, People, Running, Non-Studio_Setting Basket Scored

Sky, Water_Body, Nature_Non-Vegetation, Cloud Beach

Car_Crash, Man_Made_scene, Smoke, Road Train

Boat, Water_Body, Sky, Cloud Boat/Ship

(24)

ERMITES 2007 47

TrecVid Video Annotation

‹Training classifiers require a lot of training data

‹Data should be annotated by concepts

‹TrecVid annotation effort:

z Collaborative annotation in 2005

z Annotating with 39 concepts

‹LSCOM extension to 449 concepts

z Large Scale Concept Ontology for Multimedia

z Annotated on TRECVID 2005 data

‹2007: Collaborative Annotation using active learning strategy (organized by Grenoble)

ERMITES 2007 48

TrecVid: Search Task

‹Goal: find the shots satisfying a query

‹Queries are defined by text + sample keyframes + sample shots

‹2006 Topic examples:

z Topic 173: Find shots with one or more emergency vehicles in motion (e.g., ambulance, police car, fire truck, etc.)

z Topic 174: Find shots with a view of one or more tall buildings (more than 4 stories) and the top story visible

z Topic 175: Find shots with one or more people leaving or entering a vehicle

z

z Topic 195: Find shots of one or more soccer goalposts

z Topic 196: Find shots of scenes with snow

(25)

ERMITES 2007 49

TrecVid: Search Task

‹3 types of experiments:

z Automatic:

z Human-assisted:

z Interactive:

Topic Human query System Result

Topic Human query System Result

Topic System Result

TrecVid: Search Task Example

‹ IBM 2005 Automatic Search

‹ 1.Visual-based:

z light-weight learning (discriminative and nearest neighbor modeling)

‹ 2.Text-based:

z automatic query expansion

‹ 3.Model-based:

z automatic query-to-model mapping & weighting

‹ 4.Fusion:

z Query-independent

z Statistical normalization (visual)

z Rank normalization (text)

z Model-based re-ranking (text

& visual)

(26)

ERMITES 2007 51

TrecVid: Search Task Example

‹CMU 2004 Automatic Search

Multi-modal Query

Video Library

Weighted Linear Combination of Similarity Rankings

Final Ranked List of Video Shots Pope

John Paul II

Speech Trans. Video

OCR Color

Feature Semantic

Feature Detector Audio

Feature Texture

Feature

Multiple Modality Video Collection Analysis

ERMITES 2007 52

TrecVid: Search Task Example

‹Query-type dependent weights

Video Shots Query

λ

Σ

λ

Query Type

= n k

k

kP R q v

f

1

) ,

| λ (

λk

Text Ranking

Color

Ranking Pk(R|q,v)

Speech Trans. Video

OCR Color

Feature Face

Detector Audio

Feature Texture

Feature

Face Ranking

(27)

ERMITES 2007 53

TrecVid: Search Task Example

‹ Possible query types PEOS:

z Named person queries (P-queries)

„ “Find shots of Yasser Arafat“

“Find shots of Ronald Reagan speaking"

z Named entity queries (E-queries)

„ “Find shots of the Statue of Liberty“

“Find shots of the Mercedes logo"

z General object queries (O-queries)

„ “Find shots of snow-covered mountains“

“Find shots of one or more cats"

z Scene queries (S-queries)

„ “Find shots of roads with lots of vehicles“

“Find shots of people spending leisure time on the beach"

TrecVid: Search Task Example

‹Learning Query-type dependent weights

Training Search Training

Queries

Similarity rankings from Query

Labels

Video Library

Finding Cats Finding

Cats

Finding Clinton Finding Clinton

w1: (0.5, 0.3,....) w2: (0.4, 0.2,....)

Weights for Similarity Rankings

Object People Query Types

(28)

ERMITES 2007 55

TrecVid: Search Task Example

‹2005 MediaMill Interactive Search (Amsterdam)

‹Use 101 concept detectors

‹Cross Browser to explore multimedia space

time similarity

ERMITES 2007 56

TrecVid: BBC Rushes

‹2005: no task

‹2006: organize, no evaluation

‹2007: summarize, evaluation

z List of topics and events as ground truth

z 10 topics picked randomly

z Evaluator watches summary and counts topics present

(29)

ERMITES 2007 57

TrecVid: BBC Rushes

‹ Topics for video MRS044493:

z man from distance riding bicycle approach camera

z camera pans man with knapsack riding bicycle

z House

z zooming in

z bald man looking out of window

z bald man at window

z closeup at boy with fair hair looking out of window,only head visible

z zooming in boy with fair hair at window

z closeup at boy with fair hair at window looking at camera

z door

z bald man walking out of the house and a woman followed behind quarelling

z woman pull out bald's man shirt entering storeroom

z bald man and woman talking in store room

z bald man went out the storeroom

z bald man talking with man in dark overalls

z bald man enter store room

z bald man open cupboard and pull out things

z bald man shake woman's cheek and talking

z bald man went out store room and woman followed

z 3 people standing talking one facing camera

z

Eurecom Summarization

‹Hierarchical clustering of 1 sec video segments

z Level to be adjusted based on video content

(30)

ERMITES 2007 59

Eurecom Summarization

‹Detection of event redundancy:

z If two shots contain replays of the same event, they should contain (almost) the same sequences of clusters

Cluster 3 Cluster 3

Cluster 2 Cluster 1

Cluster 1

ERMITES 2007 60

Eurecom Summarization

‹Shot selection principle:

z Find a set of shots which best covers the set of clusters

‹Shot selection algorithm:

z Greedy selection:

„ Make all clusters initially active

„ Find shot which contains most active clusters

„ Select this shot and make its clusters inactive

„ Iterate

(31)

ERMITES 2007 61

Eurecom Summarization

‹Additional refinements:

z Cluster value:

„ Based on activity and presence of faces

z Dynamic acceleration:

„ Accelerate a shot to minimize static content

z Split-scren display

„ Maximize information displayed by time unit

„ Group shots by 4

Eurecom Summarization

‹Comparative results:

z Good on inclusions

z Low on usability

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

usheffumadriduglasgowucalthu-icrcntuniilip6kddietaljoanneumhuthkpufxpaleurecomdcucurtincost292colucmucityubrnoattlabsCMUBASE2CMUBASE1

Mean

Participants Fraction of inclusions found in the summary

Eurecom mean

1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

usheffumadriduglasgowucalthu-icrcntuniilip6kddietal

joanneumhuthkpufxpaleurecomdcucurtincost292colucmucityubrnoattlabsCMUBASE2CMUBASE1

Mean

Participants Was the summary east to understand (1-5)

Eurecom mean

(32)

MPEG-7

Multimedia Content Description Interface

ERMITES 2007 64

MPEG History

‹MPEG = Moving Picture Experts Group

z Started in 1988, Leonardo Chiariglione

‹MPEG-1: Interactive CD and MP3 1992

‹MPEG-2: DTV, STB, DVD 1994

‹MPEG-4: Web and Mobility 1998-1999

‹MPEG-7: Multimedia Content Description

Interface 2001

‹MPEG-21: Multimedia Framework ---

(33)

ERMITES 2007 65

MPEG-7 Objective

‹« Standardize content-based description for various types of audio-visual information, allowing quick and efficient content

identification, and addressing a large range of applications »

‹MPEG-7:

z Information about the content

z The bits about the bits

z Metadata

MPEG-7 Scope

Normalized content description

Search Filtering

Algorithms Applications

User interaction Media Combination

Media analysis

MPEG-7

(34)

ERMITES 2007 67

MPEG-7 Components

‹MPEG-7 Systems

‹MPEG-7 Description Definition Language

‹MPEG-7 Visual

‹MPEG-7 Audio

‹MPEG-7 Multimedia DSs

‹MPEG-7 Reference Software

‹MPEG-7 Conformance

ERMITES 2007 68

Low level Audio Visual descriptors

Color

Camera motion

Motion activity

Mosaic

Color

Motion trajectory

Parametric motion

Spatio-temporal shape

Color

Shape

Position

Texture

Video segments Still regions

Moving regions Audio segments

Spoken content

Spectral characterization

Music: timbre, melody

(35)

ERMITES 2007 69

Multimedia DS: Content Description

« DS » Still region

« DS » Multimedia segment

« DS » Audio visual region

« DS » Audio segment

« DS » Semantic

« DS » Event

« DS » Semantic place

« DS » Still region 3D

« DS » Video segment

« DS » Moving region

« DS » Ink segment

« DS » Segment

« DS » Semantic base Conceptual aspects Structural aspects

Content description

« DS » Semantic state

« DS » Semantic time

« DS » Object

« DS » Agent object

« DS » Audio visual segment

MPEG-7: Application Areas

‹ Storage and retrieval of audiovisual databases (image, film, radio archives)

‹ Broadcast media selection (radio, TV programs)

‹ Surveillance (traffic control, surface transportation, production chains)

‹ E-commerce and Tele-shopping (searching for clothes / patterns)

‹ Remote sensing (cartography, ecology, natural resources management)

‹ Entertainment (searching for a game, for a karaoke)

‹ Cultural services (museums, art galleries)

‹ Journalism (searching for events, persons)

‹ Personalized news service on Internet (push media filtering)

‹ Intelligent multimedia presentations

‹ Educational applications

‹ Bio-medical applications

Références

Documents relatifs

From this we conclude that the number of input words O = G 1 for which M will terminate successfully in k itérations, under the assumption that the test of Step 5 remains

Scene Segmentation and Image Feature Extraction for Video Indexing and Retrieval,. Patrick Bouthemy, Christophe Garcia, Rémi Ronfard, Georges Tziritas, Emmanuel Veneau,

Hierarchical fusion with multiple descriptor variants and multiple classifier variants was used and optimized for the semantic indexing task.. We made several ex- periment in order

Secondly, we present our retrieval system called Retimm, able to retrieve very quickly images from a large database from any request composed of images and/or text..

For example word spotting on the audio component can be used to detect if a given word has been pronounced, face recognition on the video component can be used to detect if a

\query image&#34;, also called the \example image&#34;, and the algorithm searches the database of \test images&#34; for images that are most similar to the query image..

Since no particular feature is relevant to all concepts and each concept is a priori best described by different features, we build concept detectors by combining different

Considering only the results in table 7 (2013, 2014 and 2015 test sets), with and without I-frames, all of en- gineered features, temporal re-scoring and conceptual feed back do