• Aucun résultat trouvé

Indexing Models

Dans le document The DART-Europe E-theses Portal (Page 33-37)

For the past two decades, several indexing models have been proposed in the literature. The objective of image indexing is to store images effectively in the database and to retrieve similar images from a database for a given query image.

Image can be indexed using directly the extracted visual features (such as, color, texture and shape) with the vector representation. Recently, the bag-of-visual-features (or bag-of-words) inspired from textual indexing draw more attention for its simplicity and effectiveness on storing visual content. This section is dedicated to the presentation some of these indexing methods.

2.4.1 Vector space model

This is a the simplest model in CBIR system. Images are represented by their feature vectors. These vectors have the same dimension and normalized with the same scale (usually between 0 and 1). The tf.idf4 normalization is often used in information retrieval and text mining. This technique has also adopted widely in CBIR systems. This weighting scheme comes from a statistical measure to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Given 2 feature vectorsVq andVdextracted from image queryq and image documentd, the visual similarity is computed using two different measurement functions: Euclidian distance or cosines similarity.

Euclidean distance

The Euclidean distance is probably the most common approach to compare directly two images. GivenVqandVdare two vectors in Euclideann-space, then the metric distance of two imagespandqis given by:

d(Vq, Vd) =||Vq−Vd||=p

||Vq||2+||Vd||2−2Vq•Vd

The smaller distance indicates the closer of two images are. This value reflects the visual similarity of the two images.

Cosine similarity

In contrast to the distance measure, two vectorsVqandVdcan be considered to be similar if the angle between their vectors is small. To compute the cosine similarity, the normalized scalar product is used to measure the angle between two vectors :

4term frequency, inverse document frequency

2.4. Indexing Models 23

cos(θ) = Vq•Vd

||Vq||||Vd||

In information retrieval, the cosine similarity of two documents will range from0to1. A similarity of0implies that documents are identical, and a similarity of1implies they are unrelated.

2.4.2 Bag-of-words model

A simple approach to indexing images is to treat them as a collection of regions, describing only their statistical distribution of typical regions and ignoring their spatial structure. Similar models have been successfully used in the text community for analyzing documents and are known as “bag-of-words”

(BoW) models, since each document is represented by a distribution over fixed vocabulary.

Figure 2.8: Image is represented by a collection of visual words [Fei-Fei & Perona 2005].

The construction of this model is based on four main steps:

1. Image segmentation consists of dividing image into smaller parts. As introduced in previous section2.2, we can consider different types of image segmentation such as pixels, regions or interested points.

2. Feature extraction step consists of representing each image region by a set of visual features as detailed in section 2.3. Each feature is quantized and normalized by a vector with fixed size.

3. Visual vocabulary construction step converts feature vector represented image regions to “visual words” or “visual concepts” (analogy to words in text documents), which also produces a “visual dictionary” (analogy to a word dictionary). A visual word can be considered as a representative of several similar image regions. One simple method is performingk-means clustering over all the vectors. Visual words are then defined as the centers of the clusters. The number of the clusterskis the vocabulary size.

4. Each image region is mapped to a certain visual word through a clustering process and the image can be represented by the quantized vector of the visual vocabulary.

In step 3, k-means clustering is performed on a set of visual features to construct the visual words. We present in the following a brief description of this algorithm.

K-means clustering is a popular technique for automatic data partitioning in machine learning. The goal is to find k centroid vectorsµ1, ..., µkfor representing each cluster. The basic idea of this interactive algorithm is to assign each feature vectorxto the cluster such that the sum of squared errorErris minimum

Err = Nj is the number of pattern in thejth cluster. In general, the k-means clustering algorithm works as follows:

1. Select an initial mean vector for each of k clusters.

2. Partition data into k clusters by assigning each pattern xn to its closest cluster centroidµi.

3. Compute new mean clustersµ1, ..., µkas the centroids of k clusters.

4. Repeat step 2 and 3 until the cluster criterion is reached.

The initial mean vectors can be chosen randomly from k seed points in the data in the first step. The partitioning is then performed from these initial points. In the second step, to measure the distance between two patterns, different metric distances (e.g., Hamming distance, Euclidean distance, etc.) can be applied.

Usually, the Euclidean distance is good enough to measure the distance between two vectors in the same feature space. In step 3, the centroidµifor each cluster is re-estimated by computing the mean of cluster members. The number of iterations

2.4. Indexing Models 25

can be used in the last step as a convergence criterion. The k-means algorithm has a time complexity ofO(nk)for each iteration. Only one parameter which needs to be fixed is the number of clustersk.

As demonstrated in [Fei-Fei & Perona 2005], this model is simple but yet effective for image indexing. However, the lack of spatial relation and location information of visual words are the mains drawbacks of this model. Using this representation, methods based on latent semantics extraction, such as latent semantic analysis [Monay & Gatica-Perez 2003, Pham et al. 2007] and proba-bilistic latent semantic analysis [Monay & Gatica-Perez 2004] and latent Dirichlet allocation [Blei et al. 2003], are able to extract coherent topics within document collections in an unsupervised manner. Other approaches are based on discrim-inative methods with annotated or slightly annotated examples, such as support vector machine [Vapnik 1995] and nearest neighbors [Shakhnarovich et al. 2005].

In the next chapter, we will review of some of these learning methods.

2.4.3 Latent Semantic Indexing

Latent Semantic Analysis (LSA) was first introduced as a text retrieval technique [Deerwester et al. 1990] and motivated by problems in textual domain.

A fundamental problem was that users wanted to retrieve documents on the basis of their conceptual meanings, and individual terms provide little reliability about the conceptual meanings of a document. This issue has two aspects: synonymy and polysemy. Synonymy describes the fact that different terms can be used to refer to the same concept. Polysemy describes the fact that the same term can refer to different concepts depending on the context of appearance of the term. LSA is said to overcome these deficiencies because of the way it associates meaning to words and groups of words according to the mutual constraints embedded in the context which they appear. In addition, this technique is similar with the popular technique for dimension reduction, i.e., principal component analysis [Gorban et al. 2007], in data mining. It helps to analyze the document-by-term matrix by mapping the original matrix into lower dimensional space. Hence, the computational cost is also contracted.

Considering each image as a document, a coocurrence matrix of document-by-termM, a concatenation of vectors extracted from all document with model BoW is built. Following the analogy between textual document and image document, given a coocurrencce document-by-term matrixM rankr,M is decomposed into 3 matrices using Singular Value Decomposition (SVD) as follows:

M =UΣVt

where

U :is the matrix of eigenvectors derived fromM Mt Vt:is the matrix of eigenvectors derived fromMtM Σ :is anr×rdiagonal matrix of singular valuesσ.

σ :are the positive square roots of the eigen-values ofM MtorMtM This transformation divides matrix M into two parts. One is related to the documents and the second related to the terms. By selecting onlyklargest values from matrixΣand keep the corresponding column inUandV, the reduced matrix Mkis given by:

Mk =UkΣkVkt

where k < r is the dimensionality of the concept space. Indeed, the choice of parameterkis not obvious and depends on each data collection. It should be large enough to allow fitting the characteristics of the data. On the other hand, it must be small enough to filter out the non-relevant representation details. To rank a given document, the query vector q is then projected into the latent space to obtain a pseudo-vector,qk =q∗Uk, with dimension reduced.

Recently, LSA has been applied for scene modeling [Quelhas et al. 2007], image annotation [Monay & Gatica-Perez 2003], improving multimedia docu-ments retrieval [Pham et al. 2007, Monay & Gatica-Perez 2007] and indexing of video shots [Souvannavong et al. 2004]. In [Monay & Gatica-Perez 2003], Monay and Gatica-Perez have demonstrated that the LSA outperformed the pLSA of more than 10% on annotation and retrieval task based on COREL collection.

Unfortunately, LSA lacks a clear probabilistic interpretation comparing to other generative models such as probabilistic latent semantic analysis.

Dans le document The DART-Europe E-theses Portal (Page 33-37)