Detection as Classification - Further Analysis of a Detection

8 Detection of Sparse Models: Counting

Algorithm 8.1: Sparse Model—Counting Detector, Step I

8.4 Further Analysis of a Detection

8.4.1 Detection as Classification

The detection scheme provides a classification at each possible instantiationθ be-tween the two classes,objectandnonobject. Note that in the first two steps of the algorithm there was hardly any use of nonobject or background images, in terms of training. The reason is that the background image population is so large and diverse that a very large sample would be needed to yield reliable estimates for a global classifier between object and nonobject images. Large data sets of many thousands of images have been used for this purpose in Rowley, Baluja, and Kanade (1998) and Sung and Poggio (1998). The sparse-object models described here can be trained successfully with several tens of examples. The only information regarding back-ground in the counting algorithm involves the densities of the local features, studied in chapter 6. Only a crude estimate of these is needed simply to ensure the feasibility of the first step of the algorithm. However, once a detector is produced, the population offalse positivesis much more homogeneous and constrained.

Using the original training sample of the object, together with a sample of false positives, a classifier can be trained. Run the detector on all the images of the object in the training set and register the pixel intensities, or the edge data, or the edge arrangement data, from the ROI defined by the detection, to the reference grid, as specified in equation 8.2. The same is done on images containing no object, each of which may yield a number of registered detections, to produce a population of false positives. This data is then used to train classifiers of the type described in chapter 9.

In figure 8.4, we show a sample of ROIs of false positives for the face detector. The reference grid has been reduced to a 16×16 lattice immediately surrounding the area around the eyes and mouth. Below, in the middle panel, we show the histogram maps of four edge types (two vertical and two horizontal) on this population, compared to the maps for the face population shown in the bottom panel. The similarity is quite striking, indicating that this false-positive population is quite restricted. The

Figure 8.4 (Top) Registered ROI’s of false positives from the face detector. (Middle) Edge maps for this population. (Bottom) Edge maps for face population. Reference grid is reduced to a 16×16 lattice covering the immediate region around the eyes and mouth.

163 8.5 Examples

edge histograms for randomly selected 16×16 images in the background population areuniform. And still for the human eye there is no question that most of the false positives arenotfaces.

8.5 Examples

In this section we illustrate some aspects of the counting detector for a variety of objects: faces, randomly perturbed L^ATEX symbols in cluttered scenes, the axial MRI scans, and an example of a 3D object at a limited range of viewpoints. In addition to detections of the sparse model, we show examples of how these detections can provide automatic initialization for the deformable models described in chapters 3, 4, and 5. Specifically, we will use the mapsa, estimated in step II, to initialize the templates. Recall that the sparse model represents the smallest range of scales at which the object is detected, and the algorithm is applied at six different resolutions to cover a range of scales of 4 : 1.

The edge arrangements in the examples are defined with the 16 smaller wedges shown in the top row of figure 6.7. The complexity of the arrangements is 3 (i.e.

nr =3), and apart from the MRI example, we always pickn =20 arrangements for the model with lower boundρ=.5 (see algorithm 6.1).

8.5.1 Faces

The face model shown in figure 6.9 was trained from 300 face images of the Olivetti data set. Much smaller data sets yield very similar results. The original images were downscaled to 44×36. The centers of the two eyes and the mouth were manually marked on each image and used as the three anchor points. The mean locations of each of these three landmarks are used for the three reference points p₁,p₂,p₃on the reference grid, as defined in section 6.4. At this scale, the mean distance between the center of the eyes is 14 pixels and the distance between the center of the mouth and the middle point between the two eyes is also 14 pixels. Edges are extracted on each downscaled image, after which their locations are registered to the reference grid using the affine map taking the anchor points to the reference points. The edge statistics are graphically represented in figure 6.8, and in terms of these, an edge model for step II was derived with 110 edge-type/location pairs. The 20 edge arrangements of complexitynr =3 obtained from training are shown in figure 6.9. A model using nr =1 is shown in chapter 11. The rangeAof linear maps at which we expect to

Figure 8.5 (Top) Detections from step I. (Middle) Detections from step II. (Bottom) Detec-tions remaining after final classifier. Detection triangles—mapping of the reference points into the image bya.

detect a face at a given resolution covers±25% scaling and±15^◦ of rotation. The smallest scale at which a face is detected is approximately 10 pixels between the two eyes. Faces at much larger scales are detected by subsampling the image and rerunning the detection algorithm.

The results of step I of the detection algorithm using this model, and processing six resolutions covering a range of scales of 4 : 1, are shown in the top row of figures 8.5 and 8.6. In the middle row are shown all detected affine maps obtained from step II.

165 8.5 Examples

Figure 8.6 (Top) Detections from step I. (Middle) Detections from step II. (Bottom) Detec-tions remaining after final classifier. Detection triangles—mapping of the reference points into the image bya.

The affine maps are represented using the detection triangle showing the locations of the two eyes and mouth.

A final classifier was produced for face versus nonface using randomized classifi-cation trees (see chapter 9). These trees were trained using the registered edge data of the face training set against registered edge data from false detections obtained on a collection of random images, as shown in figure 8.4. The results are shown on the bottom row of figures 8.5 and 8.6. Some additional detections are shown in figure 8.7.

Figure 8.7 Some additional examples of face detections: steps I and II.

The training set for this classifier is quite small—300 face images of 30 different people (10 per person) and a similar number of false detections. This is therefore the least stable component of the algorithm and increases the number of false negatives.

Deformable Models Initialized at Detections of Sparse Model

In figure 8.8, we show a close-up on the four faces of figures 8.5 and 8.6 and the locations of the points in the instantiation obtained from step II. The white dots show edge features detected at the estimated pose and the black dots show the extrapolated features that were not detected. If the features in the model are labeled according to the components of the face to which they belong, we obtain an estimate not only of where the centers of the two eyes and mouth are but other parts of the face as well.

For example, the location of other parts of the eyes, or part of the nose, the hairline, and so on.

167 8.5 Examples

Figure 8.8 Estimated instantiationθ on the four faces detected above. White dots show detected edge features, black dots show extrapolated features.

Figure 8.9 Pose detected on right hand image in figure 8.5 initializes Bernoulli model for deformable images.

One can also initialize a deformable image algorithm at the detected pose. Due to the high variability in lighting and illumination of faces, we implement the Bernoulli data model from section 5.4. This data model, being based on probabilities of edges that are very robust to illumination changes, inherits some of these properties. On the left in figure 8.9, we show the location of a set of points chosen on the reference grid mapped into the image by the detected pose. In the middle, we show the outcome of a global search on a range of scale and location parameters for updating the pose. On the right is the outcome of the deformable-image algorithm using a small number of basis coefficients. Note how the outlines of the eyes have been adjusted as well as the hairline, which in the initial instantiation was outside the face. The first example of this procedure was shown in figure 5.4, together with the edge maps and the edge data used to drive the algorithm.

Dans le document 2D Object Detection and Recognition (Page 180-187)