EURECOM and ECNU at TrecVid 2010 : The semantic indexing task

(1)

EURECOM and ECNU at TrecVid 2010:

The Semantic Indexing Task

Miriam Redi, Bernard Merialdo Feng Wang

Multimedia Department, EURECOM East China Normal University Sophia Antipolis, France Shanghai, China

{redi,merialdo}@eurecom.fr fwang@cs.ecnu.edu.cn October 22, 2010

1 Abstract

This year EURECOM and ECNU participated together at the TRECVID Semantic Indexing Task. We built four different systems for the light (10 concepts) submission. Three of our runs are functionally similar to the system used by EURECOM for last year’s High Level Feature Extraction task (see [6] for further details).

We keep as a basic run (Fusebase) the best-performing system from 2009, testing how such system performs on the new dataset; we then improve the EURECOM Fusebase by adding a global descriptor, originally built for scene recognition, and proved to be effective in the TRECVID context for spatially-independent concepts like “Nighttime”. We then experiment with a multi-modal analysis, combining the visual features with the textual metadata that have been provided with the 2010 video database. As last run, we try a new system based on Hamming Embedding and Weighted Visual words. The runs are composed as follows:

1. EURECOM FusebaseThis run fuses a pool of visual features, namely the Sift descriptor, the Color Moments global descriptor, the Wavelet Feature and the Edge Histogram.

On top of this, a face detector and a re-ranking method based on the video knowledge are applied, according to the 4^th run that EURECOM presented in the 2009 Edition.

2. EURECOM Gist This run increases the set of visual features with a global descriptor presented by Torralba et al. in [5] called the gist descriptor.

3. EURECOM MetadataThis run adds to the previous run the information mined from the textual metadata files related to each video.

4. EURECOM Weight HE this run combines Hamming Embedding and Weighted Bag- of-Visual-Words

(2)

Beside this participation, EURECOM took part in the joint IRIM submission; systems details are included in the IRIM notebook paper.

The remainder of this paper briefly describes the content of each run (Sec 2-5), including feature extraction, fusion and reranking methods. In Section 6 results are commented and discussed.

2 EURECOM Basic Run: EURECOM Fusebase

This run is composed by three main modules: first, a model is built by combining a pool of visual features; a face detector is then merged into the system; finally, a re-ranking method based on context knowledge is applied.

1. Visual Feature Extraction and Fusion: in this stage, 4 different features are computed. For each feature a Support Vector Machine is trained to predict the presence of a conceptc in a keyframe sof the test set. The choice of the descriptors is based on their effectiveness on the training set. For each keyframe the following descriptors are extracted:

• Sift with Bag of Words Two sets of interest points are identified using different salient points detectors:

– Difference of Gaussian – Hessian-Laplacian Detector

Each of these key points is then described with the SIFT [4] descriptor, using the VIREO system [1]. A K-means algorithm clusters a subset of the training set descriptors in a vocabulary ofnvisual words. Then, for each SIFT point in a keyframe, the nearest neighbor in the vocabulary is calculated; based on this statistics a n- dimension feature vector is built collecting the number of points in the image that can be approximated by the n^th visual word. Experiments on the training set sug- gested to choosen= 500.

• Color MomentsThis global descriptor computes, for each color channel in the LAB space, the first, second and third moment statistics.

• Wavelet FeatureThis texture-based descriptor calculates the variance in the Haar wavelet sub-bands for each window resulting from a 3×3 division of a given keyframe.

• Edge Histogram The MPEG-7 edge histogram describes the edges’ spatial distri- bution for 16 sub-regions in the image.

The features extracted are then used as input to feed a Support Vector Machine (SVM) (see implementation details in [2]) that creates, for each conceptc, a model based on each feature extracted from the training data. Such model is then used to detect the presence of c in a new sample s based on each feature. The classifiers parameters are selected via exhaustive grid search: their value is chosen based on the Mean Average Precision

(3)

maximization on the development set. We have now, for each concept c and keyframe s, 5 feature-specific outputs that we callpn(c, s), n = 1, . . . ,5. We fuse such scores with weighted linear fusion in order to obtain a single output, namelyp_v(c, s), that represents the probability of the concept cin the keyframes given the set of visual features.

2. Face Detector: p_v(c, s) (see Figure 1) is then interpolated with the output, defined as pf(c, s), of a face specific detector ran on thes. The interpolation is computed as follows:

p(s|c) =p_v(c, s)(1 +α·p_f(c, s)), where the parameter α is estimated by maximizing the Mean Average Precision of the development set.

3. Re-Ranking based on video and context knowledge: as we did last year, p(s|c) is updated based on a quantity Pv, namely the average concept score over all video shots, according to the equation p(c|s) = p(s|c) +P_v. Furthermore, the concept scores of the neighboring shots of sare updated based onp(c|s) (futher details can be found in [6]).

We performed step 2 and 3 only for the concepts for which adding such modules was intro- ducing a significant improvement in the final MAP (in the development stage).

) , ( 1cs

Sift (LOG detector) p Sift (Hessian detector)

Color Moments Wavelet Feature Edge Histogram

GIST Descriptor

Weighted Linear Fusion

) , ( 1 ) ,

( cs

n

i i i s

vc wp

p ∑

= ) =

, (cs

pn

…

Face Detector

Interpolation ) 1 ( )

|

(sc pv(c,s) pf(c,s)

p = ⋅ +α⋅

Re-Ranking based on video knowledge

v

Pc

c s P s c

p( | )= ( | )+ , )

, (cs

pv

Likelihood

∏=

= q

k

v k w k n v t

c p c

l

1

) , _ _ ( ) ,

( ()

Keyword Model )

(

1

_ c

pt ) , 1 _

(w v

n

) , _ (w qv n

) (

_ c

pt k

……

… … … METADATA

Term Frequency

Interpolation and final score ) 1 )(

| ( )

|

(c s pc s l(c,v)

P = +β⋅

) , (cv

l

Eurecom_Fusebase Eurecom_Gist

Eurecom_Metadata

) , (cs

pf

Eurecom_Weighted_HE Weighted Visual Words

Local Binary Pattern Hamming Embedding Kernel

Weighted Linear Fusion

) , ( 1 ) ,

( cs

n

i i i s

vc wp

p ∑

=

Figure 1: Framework of our system for the semantic indexing task

(4)

3 EURECOM Second Run: EURECOM Gist

In this run, the Visual Feature Extraction and Fusion module of the basic run is improved by adding a new global descriptor in the feature pool, namely the gist descriptor, proposed by Torralba et al. in [5]. Originally built as a feature for natural scene recognition, we chose gist for its good performances in global concept detection (e.g. Cityscape, Nighttime). Such feature exploits the properties of the Fourier Spectrum to represent the spatial envelope of the image/keyframe, i.e. the general structure of the scene. As for the other visual features, the extracted descriptor for each keyframe is used as input for an SVM, and the corresponding output is fused with the other visual feature-based concept scores via linear fusion.

4 EURECOM Third Run: EURECOM Metadata

Here a textual feature module is added to the previous visual-only run.

This year, a set of XML-formatted metadata is available with each video. We use the Term Frequency statistics to create a model for these textual descriptions: on the training set, for each concept c we compute the quantities p_t_k(c) , i.e. the probability for word t_k to appear in correspondence with concept c. We compute such statistics in a reduced set of fields of the XML metadata file, chosen based on their effectiveness in the global system performances, namely “title”, “description”, “subject” and “keywords’.

Given this model, on a new test video v we compute the cardinality n(w, v), where w is a term that appears in the metadata file corresponding to v. We then compute the likelihood l(c, v), between the test video textual feature and each concept-based text model. Such values are then used to update the output of the EURECOM Gist run, obtaining, for each shot s∈v,

P(c|s) =p(c|s)(1 +β·l(c, v)) The value β is estimated on the development data.

5 EURECOM Fourth Run: EURECOM Weight HE

This run fuses the Color Moments and Wavelet features computed for run 1-3 with Sift features.

The key modules here rely on the techniques used to process the raw SIFT features: weighted visual words and Hamming embedding.

5.1 Weighting Informativeness of BoW

In the current BoW based approaches, for different concepts, each visual word is treated equally.

The discriminative ability of those informative visual words could be seriously reduced. In this run, we employ an approach to measure the importance of each visual word for given concepts.

This is achieved by iteratively updating the weights to push the SVM kernel towards the optimal one. For this purpose, kernel alignment score (KAS) is used to measure the discriminative ability

(5)

of SVM kernel. The problem is then to maximize the KAS score by optimizing the weights of different visual words. Finally, the resulting weights are used in a modified kernel for concept detection. The details can be found in [7].

5.2 Hamming Embedding

During the construction of the visual vocabulary, SIFT descriptor space is quantized into Voronoi cells corresponding to different visual words [3]. Such kind of quantization simply treats all points falling in one cell as identical which results in an inaccurate distance measure between image samples.

In this run, we propose Hamming Embedding kernel to alleviate information loss problem of BoW features in video concept detection. Hamming Embedding (HE) was originally employed in [3] for similar image search and copy detection. For each keypoint quantized by visual vocabulary, a binary signature is further generated encoding its location information inside the cell. The distance between two keypoints in the same cell can be roughly estimated by the Hamming distance between their binary signatures. We revise HE as a kernel for discriminative classification.

6 Results Analysis

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

11%

_REGIM_5_1_REGIM_6_3_REGIM_4_2aptainSlow_4A_MM.Stig_1D_OXIIIT-1_1Weight_HE_3CU.Athena_3M.Hamster_3EC-UIUC-1_1EC-UIUC-4_4C_OXIIIT-2_2T+GT_run1_1DBJ-HLF-4_4DBJ-HLF-2_2A_CU.Ares_1A_CU.Hera_2T+GT_run3_3EC-UIUC-2_2DBJ-HLF-3_3DBJ-HLF-1_1OM_wgeom_1SOM_geom_4_MM.Jezza_2geom-max_2_Marburg4_1_Marburg3_2CU.Hermes_4D_OXIIIT-3_3D_OXIIIT-4_4_Marburg2_32geom-mkf_3ero_RUN01_1_Marburg1_4ero_RUN02_2A_UC3M_4_4A_UC3M_3_3_UEC_MKL_2RIM_RUN1_1ero_RUN03_3ndomwalk_1ero_RUN04_4T+GT_run2_2RIM_RUN2_2aseline_vk_3agreement_2RIM_RUN3_3FKI-MADM_3A_UC3M_2_2RIM_RUN4_4ITI-CERTH_1ine_vk_cm_4A_UC3M_1_1ITI-CERTH_2nria.willow_2nria.willow_1_Metadata_1CPRBUPT4_4ITI-CERTH_3ITI-CERTH_4MUG-AUTH_2CPRBUPT1_1FKI-MADM_4_Fusebase_4ecom_Gist_2T+GT_run4_4MMM-TJU4_4MUG-AUTH_1brno.color_2MMM-TJU3_3FKI-MADM_1dan.TV10.1_1dan.TV10.3_3NHKSTRL1_1NHKSTRL2_2NHKSTRL3_3spacetime_1dan.TV10.2_2brno.basic_3CPRBUPT3_3_JRS-VUT2_1MMM-TJU1_1hou_Run4_4MUG-AUTH_4L_A_CMU3_3hou_Run1_1_LIF_RUN4_4hou_Run2_2FKI-MADM_2_A_XJTU_1_1A_NTURFB_4TU-RF-C.H_1F_D_KBVR_3_LIF_RUN1_1CPRBUPT2_2F_D_KBVR_4MUG-AUTH_3_JRS-VUT1_2_NTU-RF-L_3NTU-r2-C.H_2_LIF_RUN3_3A_SJTU-IS_1_JRS-VUT4_3nria.willow_3_A_XJTU_4_4A_SJTU-IS_3F_D_KBVR_2_JRS-VUT3_4sc.run1005_1_A_XJTU_2_2_A_XJTU_3_3sc.run1001_3F_D_KBVR_1C_Average_1o.resource_4A_SJTU-IS_2A_SJTU-IS_4_LIF_RUN2_2_uzay.sys1_1sc.run1002_2MMM-TJU2_2_uzay.sys2_2A_uzay.sys_4_uzay.sys3_3mlabGITS3_3mlabGITS4_4L_A_IRIT_2_2Run3_130c_3F_A_CMU4_4F_A_CMU1_1_FIU-UM-1_1F_A_CMU2_2nii.PyCVF1_4_FIU-UM-3_3L_A_IRIT_3_3t-ut-s1m80_3ut-s40m40_1_FIU-UM-4_4L_A_IRIT_1_1mlabGITS1_1SIS_DYNI1_1SIS_DYNI2_2SIS_DYNI3_3SIS_DYNI4_4_FIU-UM-2_2L_A_IRIT_4_4t-ut-s5m50_2tsu_CBVR_1mlabGITS2_2EC-UIUC-3_3

Eurecom_Fusebase

Eurecom_Gist Eurecom_Weight_HE

Eurecom_Metadata

Figure 2: Evaluation results for our submitted runs

In Figure 2, the performances (MAP) of the various systems submitted for the light SIN task are presented.

The first three runs were based on classical visual features and textual information, as men- tioned in Sec. 2-4. Shown in Fig 3 are the concept-specific final performances; for run 1,2,3 the expected MAP (i.e. the value expected from the training phase) is as well presented. Compared to the estimated Mean Average Precision , the performances on the test set decrease of about

(6)

60% for the basic run and 70% for the Eurecom Metadata. Such systems have indeed been tuned on a specific subset of data, causing overtraining and performance decrease. n-fold cross validation and negative examples separation could have avoided this failure.

a

0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5

Base_Training Gist_Training Metadata_Training WeightedHE_Test Base_Test Gist_Test Metadata_Test Weight Median Maximum

Figure 3: Framework of our system for the semantic indexing task, run 1-3

We can see in Fig 3 that, by adding the Gist descriptor in the pool of visual features, the system performs slightly worse; this is mainly due to the fact that, in test, the average precision for the concept Nighttime decreases of about 50%, while the predicted difference between Eu- recom Fusebase and Eurecom Gist on the development (overtrained) data was around +600%.

However, for other concepts, e.g. Demonstration or Protest (+ 15%) or Airplane Flying (+

10%), the gist descriptor contributes positively to the final MAP.

The use of textual data combined on top of the Eurecom Gist run improves the MAP by 11,7%. Only 3 out of 10 concept scores (i.e. Airplane Flying(+ 17%),Demonstration Or Protest(+

62%),Singing(+ 16%), exactly the concepts for which the textual features have been extracted) contribute to the increase of the global performances.

The Eurecom Weighted HE is shown to be quite effective for local concepts (e.g. Boat Ship and Hand). The reason for that is that such run relies on a different set of global features and, moreover, that it uses improved local features, increasing their discriminative power using the described techniques. Our development estimations showed that Weighting and HE brought about 11% and 15% improvement respectively.

7 Conclusions

This year EURECOM and ECNU presented together a set of systems for the Semantic Indexing Task. Three out of four runs aimed at improving a system similar to the Eurecom 2009 sub-

(7)

mission. We showed that by adding a global descriptor, namely the GIST descriptor, the final MAP increases for some concepts. We also found that adding textual features extracted from the metadata improves the visual-only based system. Hamming embedding and weighted visual words added to a set of global features are also proved to be very effective for local concepts detection.

References

[1] Vireo group in http://vireo.cs.cityu.edu.hk/links.html.

[2] C. Chang and C. Lin. LIBSVM: a library for support vector machines. 2001.

[3] M. D. H. Jegou and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. InEuropean Conf. on Computer Vision, 2008.

[4] D. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.

[5] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.

[6] F. Wang and B. Merialdo. Eurecom at TRECVID 2009 High-Level Feature Extraction.

[7] F. Wang and B. Merialdo. ‘weighting informativeness of bag-of-visual-words by kernel opti- mization for video concept detection’. In ACM Workshop on Very-Large-Scale Multimedia Corpus, Mining and Retrieval, 2010.