Appearance-Based “Bag-of-Features” Approach

7.6 Image Categorization

7.6.1 Appearance-Based “Bag-of-Features” Approach

The task of image categorization is to label a query image to a certain scene type, e.g., “building,” “street,” “mountains,” or “forest.” The main difference compared to recognition tasks for distinct objects is a much wider range of intra-class variation.

Two instances of type “building,” for example, can look very different in spite of having certain common features. Therefore a more or less rigid model of the object geometry is not applicable any longer.

7.6.1.1 Main Idea

A similar problem is faced in document analysis when attempting to automatically assign a piece of text to a certain topic, e.g., “mathematics,” “news,” or “sports”.

This problem is solved by the definition of a so-called codebook there (cf. [11] for example). A codebook consists of lists of words or phrases which are typical for a

Scene type 1

Scene type 2

Fig. 7.13 Exemplifying the process of scene categorization

certain topic. It is built in a training phase. As a result, each topic is characterized by a “bag of words” (set of codebook entries), regardless of the position at which they actually appear in the text. During classification of an unknown text the code-book entries can be used for gathering evidence that the text belongs to a specific topic.

This solution can be applied to the image categorization task as well: here, the

“visual codebook” consists of characteristic region descriptors (which correspond to the “words”) and the “bag of words” is often described as a “bag of features” in literature. In principle, each of the previously described region descriptors can be used for this task, e.g., the SIFT descriptor.

The visual codebook is built in a training phase where descriptors are extracted from sample images of different scene types and clustered in feature space. The cluster centers can be interpreted as the visual words. Each scene type can then be characterized by a characteristic, orderless feature distribution, e.g., by assigning each descriptor to its nearest cluster center and building a histogram based on the counts for each center (cf. Fig.7.13). The geometric relations between the features are not evaluated any longer.

In the recognition phase, the feature distribution of a query image based on the codebook data is derived (e.g., through assignment of each descriptor to the most similar codebook entry) and classification is done by comparing it to the distribu-tions of the scene types learnt in the training phase, e.g., by calculating some kind of similarity between the histograms of the query image and known scene types in the model database.

7.6.1.2 Example

Figure7.13 shows a schematic toy example for scene categorization. On the left side, an image with 100 randomly sampled patches (blue circles) is shown. A schematic distribution of the descriptors in feature space (only two dimensions are shown for illustrative purposes, e.g., for SIFT we would need 128 dimensions) is depicted in the middle. The descriptors are divided into five clusters. Hence, for each scene type a 5-bin histogram specifying the occurrence of each descriptor class can be calculated. Two histogram examples for two scene types are shown on the right.

7.6.1.3 Modifications

Many proposed algorithms follow this outline. Basically, there remain four degrees of freedom in algorithm design. A choice has to be made for each of the following points:

• Identification method of the image patches: the “sampling strategy”

• Method for descriptor calculation

• Characterization of the resulting distribution of the descriptors in feature space

• Classification method of a query image in the recognition phase

The identification of image patches can be achieved by one of the keypoint detec-tion methods already described. An alternative strategy is to sample the image by random. Empirical studies conducted by Nowak et al. [25] give evidence that such a simple random sampling strategy yields equal or even better recognition results, because it is possible to sample image patches densely, whereas the number of patches is limited for keypoint detectors as they focus on characteristic points. Dense sampling has the advantage of containing more information.

As far as the descriptor choice is concerned, one of the descriptor methods described above is often chosen for image categorization tasks, too (e.g., the SIFT descriptor).

A simple clustering scheme is to perform vector quantization, i.e., partition the feature space (e.g., for SIFT descriptors a 128D space) into equally sized cells.

Hence, each descriptor is located in a specific cell. The codebook is built by taking all cells which contain at least one descriptor into account (all training images of all scene types are considered); the center position of such a cell can be referred to as a codeword. Each scene type can be characterized by a histogram counting the number of occurrences of visual code words (identified in all training images belonging to that type) for each cell. Please note that such a partitioning leads to high memory demand for high-dimensional feature spaces.

An alternative clustering scheme is the k-means algorithm (cf. [19]), which intends to identify densely populated regions in feature space (i.e., where many descriptors are located close to each other). The distribution of the descriptors is then characterized by a so-called signature, which consists of the set of cluster cen-ters and, if indicated, the cluster sizes (i.e., the number of descriptors belonging to a cluster). The advantage of k-means clustering is that the codebook fits better to the actual distribution of the data, but on the other hand – at least in its original form – k-means only performs local optimization and the number of clusters k has to be known in advance. Therefore there exist many modifications of the scheme intending to overcome these limitations.

If the descriptor distribution of a specific scene type is characterized by a histogram, the classification of a query image can be performed by calculating sim-ilarity measures between the query image histogram HQ and the histograms of the scene types HS,ldetermined in the training phase. A popular similarity metric is the χ²test . It defines a distance measure d_χ2,lfor each scene type l:

d_χ2,l

An alternative method, the Earth Mover’s Distance (EMD, cf. the article of Rubner et al. [29]), can be applied if the distribution is characterized by a signature, i.e., a collection of cluster centers and the sizes of each cluster. For example, the signature of the distribution of the descriptors of a query image consists of m cluster centers cQ,iand a weighting factor wi; 1≤ i≤ m as a measure of the cluster size.

A scene type l is characterized by signature cS,l,jand wl,j; 1 ≤j ≤n, respectively.

The EMD defines a measure of the “work” which is necessary for transforming one signature into another. It can be calculated by

dEMD,l=

where dijdenotes a distance measure between the cluster centers cQ,iand cS,l,j(e.g., Euclidean distance). fij is a regularization term influenced by wi and wl,j, see [29]

for details.

The Earth Mover’s Distance has the advantage that it can also be calculated if the numbers of cluster centers of the two distributions differ from each other, i.e., m=n. Additionally, it avoids any quantization effects resulting from bin borders.

Results of comparative studies for a number of degrees of freedom in algorithm design like sampling strategies, codebook size, descriptor choice, or classification scheme are reported in [25] or [34].

7.6.1.4 Spatial Pyramid Matching

A modification of the orderless bag-of-features approach described by Lazebnik et al. [15], which in fact considers geometric relations up to some degree, is called spatial pyramid matching. Here, the descriptor distribution in feature space is characterized by histograms based on a codebook built with a k-means algorithm.

Compared to the bag-of-features approach, additional actions are performed: spa-tially, the image region is divided into four sub-regions. Consequently, an additional distribution histogram can be calculated for each sub-region again. This process can be repeated several times. Overall, when concatenating the histograms of all pyramid levels into one vector, the resulting description is in part identical to the histogram computed with a bag-of-features approach (for level 0), but has additional entries characterizing the distribution in the sub-images

Level 0 Level 1 Level 2

x ¼ x ¼ x ½

Fig. 7.14 Giving an example of the method with three descriptor types (indicated by red circles, green squares, and blue triangles) and three pyramid levels

The similarity metric for histogram comparison applied by Lazebnik et al. [15]

differs from theχ²test. Without going into details, let’s just mention that a simi-larity value of a sub-histogram at a high level (“high” means division of the image into many subimages) is weighted stronger than a similarity value at a lower level, because at higher levels matching descriptors are similar not only in appearance but also in location. Lazebnik et al. report improved performance compared to a totally unordered bag-of-features approach.

The proceeding is illustrated in Fig.7.14, where a schematic example is given for three descriptor types (indicated by red circles, green squares, and blue trian-gles) and three pyramid levels. In the top part, the spatial descriptor distribution as well as the spatial partitioning is shown. At each pyramid level, the number of descriptors of each type is determined for each spatial bin and summarized in sep-arate histograms (depicted below the spatial descriptor distribution). Each level is weighted by a weighting factor given in the last row when concatenating the bins of all histograms into the final description of the scene.

Dans le document Advances in Pattern Recognition (Page 185-189)