Variants of Region Descriptors - Advances in Pattern Recognition

A simple approach for characterizing a region is to describe it by its raw intensity values. Matching amounts to the calculation of the cross-correlation between two descriptors. However, this proceeding suffers from its computational complexity (as the descriptor size is equal to the number of pixels of the region) as well as the fact that it doesn’t provide much invariance. Descriptor design aims at finding a balance between dimensionality reduction and maintaining discriminative power.

Additionally, it should focus on converting the information of the image region such that it becomes invariant or at least robust to typical variations, e.g., non-linear illumination changes or affine transformations due to viewpoint change.

Basically, many of the descriptors found in literature belong to one of the following two categories:

• Distribution-based descriptors derived from the distribution of some kind of information available in the region, e.g., gradient orientation in SIFT descrip-tors. Commonly, the distribution is described by a histogram of some kind of

“typical” information.

• Filter-based descriptors calculated with the help of some kind of filtering. More precisely, a bank of filters is applied to the region content. The descriptor con-sists of the responses of all filters. Each filter is designed to be sensitive to a specific kind of information. Commonly used filter types separate properties in the frequency domain (e.g., Gabor filters or wavelet filters) or are based on derivatives.

Some descriptors for each of the two categories are presented in the following.

7.4.1 Variants of the SIFT Descriptor

Due to its good performance, the descriptor used in the SIFT algorithm, which is based on the distribution of gradient orientations, has become very popular. During the last decade several proposals have been made trying to increase its performance even further, as far as computation speed as well as recognition rate is concerned.

Ke and Sukthankar [12] proposed a modification they called PCA-SIFT. In prin-ciple they follow the outline of the SIFT method, but instead of calculating gradient

orientation histograms as descriptors they resample the interest region (its detection is identical to SIFT) into 41×41 pixels and calculate the x- and y-gradients within the resampled region yielding a descriptor consisting of 2×39×39=3,042 ele-ments. In the next step, they apply a PCA to the normalized gradient descriptor (cf.

Chapter 2), where the eigenspace has been calculated in advance. The eigenspace is derived from normalized gradient descriptors extracted from the salient regions of a large image dataset (about 21,000 images). Usually the descriptors contain highly structured gradient information, as they are calculated around well-chosen charac-teristic points. Therefore, the eigenvalues decay much faster compared to randomly chosen image patches. Experiments of Ke and Sukthankar [12] showed that about 20 eigenvectors are sufficient for a proper descriptor representation.

Hence descriptor dimensionality is reduced by a factor of about 6 (it consists of 20 elements compared to 128 of standard SIFT method) resulting in a much faster matching. Additionally, Ke and Sukthankar also reported that PCA-SIFT descrip-tors lead to a more accurate descriptor matching compared to standard SIFT. A more extensive study by Mikolajczyk and Schmid [23], where other descriptors are compared as well, showed that accuracy of matching performance of PCA-SIFT (compared to standard SIFT) depends on the scene type; there are also scenes for which PCA-SIFT performs slightly worse than standard SIFT.

Another modification of the SIFT descriptor called gradient location orientation histogram (GLOH) is reported by Mikolajczyk and Schmid [23] and is based on an idea very similar to the log-polar choice of the histogram bins for shape contexts, which are presented in a subsequent section. Contrary to the SIFT descriptor, where the region is separated into a 4×4 rectangular grid, the descriptor is calculated for a log-polar location grid consisting of three bins in radial direction, the outer two radial bins are further separated into eight sectors (see Fig.7.7).

Hence the region is separated into 17 location bins altogether. For each spatial bin a 16-bin histogram of gradient orientation is calculated, yielding a 272 bin his-togram altogether. The log-polar choice enhances the descriptors robustness as the relatively large outer location bins are more insensitive to deformations. Descriptor dimensionality is reduced to 128 again by applying a PCA based on the eigenvectors of a covariance matrix which is estimated from 47,000 descriptors collected from various images. With the help of the PCA, distinctiveness is increased compared to standard SIFT although descriptor dimensionality remains the same. Extensive stud-ies reported in [23] show that GLOH slightly outperforms SIFT for most image data.

Fig. 7.7 Depicts the spatial partitioning of the descriptor region in the GLOH method

Fig. 7.8 Showing an example of a CCH descriptor.

For illustration purposes, the region is just separated into two radial and four angular sub-regions (four radial and eight angular bins are used in the original descriptor)

Contrast context histograms (CCH), which were proposed recently by Huang et al. [10] aim at a speedup during descriptor calculation. Instead of calculating the local gradient at all pixels of the region, the intensity difference of each region pixel to the center pixel of the region is calculated, which yields a contrast value for each region pixel. Similar to the GLOH method, the region is separated into spatial bins with the help of a log-polar grid. For each spatial bin, all positive contrast values as well as all negative contrast values are added separately, resulting in a two-bin contrast histogram for each spatial bin (see Fig.7.8for an example: at each sub-region, the separate addition of all positive and negative contrast values yields the blue and green histogram bins, respectively.).

In empirical studies performed by the authors, a 64-bin descriptor (32 spa-tial bins, two contrast value for each spaspa-tial bin) achieved comparable accuracy compared to the standard SIFT descriptor, whereas the descriptor calculation was accelerated by a factor of 5 (approximately) and the matching stage by a factor of 2.

7.4.2 Differential-Based Filters

Biological visual systems give motivation for another approach to model the con-tent of a region around an interest point. Koenderink and Van Doorn [13] reported the idea to model the response of specific neurons of a visual system with blurred partial derivatives of local image intensities. This amounts to the calculation of a convolution of the partial derivative of a Gaussian kernel with a local image patch.

For example the first partial derivative with respect to the x-direction yields:

dR,x=∂G(x, y,σ ) /∂x∗IR(x, y) (7.8) where ∗ denotes the convolution operator. As the size of the Gaussian deriva-tive^∂^G/_∂x equals the size of the region R, the convolution result is a scalar dR,x. Multiple convolution results with derivatives in different directions and of different order can be combined to a vector dR, the so-called local jet, giving a distinctive, low-dimensional representation of the content of an image region. Compared to distribution-based descriptors the dimensionality is usually considerably lower here.

For example, partial derivatives in x- and y-direction (Table7.1) up to fourth order yield a 14D descriptor.

Table 7.1 Showing the 14 derivatives of a Gaussian kernel G (Size 41×41 pixel,σ =6.7) in x- and y-direction up to fourth order (top row from left to right: Gx, Gy, Gxx, Gxy, Gyy; middle row from left to right: Gxxx, Gxxy, Gxyy, Gyyy, Gxxxx; bottom row from left to right: Gxxxy, Gxxyy, G_xyyy, Gyyyy)

In order to achieve rotational invariance, the directions of the partial derivatives of the Gaussian kernels have to be adjusted such that they are in accord with the dominant gradient orientation of the region. This can be either done by rotating the image region content itself or with the usage of so-called steerable filters developed by Freeman and Adelson [7]: they developed a theory enabling to steer the deriva-tives, which are already calculated in x- and y-direction, to a particular direction and hence making the local jet invariant to rotation.

Florack et al. [5] developed so-called “differential invariants”: they consist of specific combinations of the components of the local jet. These combinations make the descriptor invariant to rotation, too.

7.4.3 Moment Invariants

Moment invariants are low-dimensional region descriptors proposed by Van Gool et al. [32]. Each element of the descriptor represents a combination of moments M_pq^a of order p+q and degree a. The moments are calculated for the derivatives of the image intensities Id(x, y)with respect to direction d. All pixels located within an image region of size s are considered. The M^a_pqcan be defined as

M_pq^a =1/s

x y

x^py^q·Id(x, y)^a (7.9)

Flusser and Suk [6] have shown for binary images that specific polynomial combina-tions of moments are invariant to affine transformacombina-tions. In other words, the value of a moment invariant should remain constant if the region from which it was derived has undergone an affine transformation. Hence, the usage of these invariants is a way of achieving descriptor invariance with respect to viewpoint change. Several invariants can be concatenated in a vector yielding a low-dimensional descriptor with the desired invariance properties.

Note that in [32] moment invariants based on color images are reported, but the approach can be easily adjusted to the usage of derivatives of gray value intensities, which leads to the above definition.

7.4.4 Rating of the Descriptors

Mikolajczyk and Schmid [23] investigated moment invariants with derivatives in x- and y-directions up to second order and second degree (yielding a 20D descriptor without usage of M^a₀₀) as well as other types of descriptors in an extensive compar-ative study with empirical image data of different scene types undergoing different kinds of modifications. They showed that GLOH performed best in most cases, closely followed by SIFT. The shape context method presented in the next section also performs well, but is less reliable in scenes that lack of clear gradient infor-mation. Gradient moments and steerable filters (of Gaussian derivatives) perform worse, but consist of only very few elements. Hence, their dimensionality is con-siderably lower compared to distribution-based descriptors. These two methods are found to be the best-performing low-dimensional descriptors.

7.5 Descriptors Based on Local Shape Information

Dans le document Advances in Pattern Recognition (Page 175-179)