Other Systems - A Piecewise Smooth Gaussian Mixture Representation

A Piecewise Smooth Gaussian Mixture Representation

2.8 Other Systems

We conclude this chapter by comparing our work with three other face detection systems.

Like ours, all three systems pass a test window over the input image to determine whether or not a face exists at each window location. The key dierence is in the way each system tests its window patterns for faces.

2.8.1 Sinha

In an early window-based approach for nding faces, Sinha [84] uses a small set of spatial image invariants to describe the space of face patterns. He notes that even though windows

containing face patterns may vary considerably in appearance for a variety of reasons, there are some spatial image relationships common and possibly unique to all face window pat-terns. To detect faces, the idea is to compile an appropriate set of these spatial relationships as invariants, and test for window patterns that satisfy these relationships.

Sinha's image invariance scheme is based on a set of observed brightness invariants between dierent parts of a human face. The underlying observation is that that while illumination and other changes can signicantly alter brightness levels at dierent parts of a face, the local ordinal structure of brightness distribution remains largely unchanged.

For example, the eye regions of a face are almost always darker than the cheeks and the forehead, except possibly under some very unlikely lighting conditions. Similarly, the bridge of the nose is always brighter than the two anking eye regions. The scheme encodes these observed brightness regularities as a ratio template, which it uses to pattern match for faces.

A ratio template is a coarse spatial template of a face with a few appropriately chosen sub-regions, roughly corresponding to key facial features. The brightness constraints between facial parts are captured by an appropriate set of pairwise brighter-darker relationships between corresponding sub-regions. An image pattern matches the template if it satises all the pairwise brighter-darker constraints.

So far, Sinha's invariance-based approach has only been demonstrated on images with little background clutter. Even though it is unclear how well the approach actually avoids false positive errors in cluttered scenes, we still nd the spatial face representation scheme promising, and were encouraged to pursue a window-based face detection approach.

2.8.2 Moghaddam and Pentland

In a later system after ours, Moghaddam and Pentland [61] present a learning-based object nding technique that also models pattern classes by performing density estimation in a high dimensional space using eigenvector decomposition. These probability densities are then used to formulate a maximum-likelihood (ML) estimation framework for nding objects in images.

Their approach shares some common features with ours. Unlike an earlier scheme [69]

that uses a unimodal Gaussian distribution to model data, their approach, like ours, uses a multimodel Gaussian mixture to approximate the target data distribution more tightly.

To estimate high dimensional Gaussian densities with limited data, they also use a

regu-larization approach that decomposes displacement vectors from a Gaussian centroid into two complementary components, identical to ^D¹ and ^D² of our 2-Value distance metric.

Their approach, however, combines the two distances into a single value robust estimate for high dimensional Gaussian densities, whose form is the product of two lower-dimensional and independent Gaussian densities. This robust estimate is exactly a re-expression of Equation 2.1 as a probabilistic likelihood measure.

So far, Moghaddam and Pentland's approach has only been demonstrated on localiza-tion problems for human faces and hands. Recall that a localization problem diers from a detection problem in that the input image contains exactly one target object. Because a localization problem assumes exactly one target object, many localization approaches sim-ply treat the localization task as searching an image for the location and scale that best matches a target model. The approaches assume that in any image containing a target object, the region containing the target object should match the model best. Moghaddam and Pentland's formulation for their object nding problem is exactly an object localization formulation. They cast the object nding problem as a visual search that returns only the best matching location and scale between a target model and an input image in a proba-bilistic maximum likelihood sense. We believe that a detection problem can be much harder than a localization problem, because in the former, one requires more than just a reasonable model to describe target patterns. One also needs reliable thresholds to determine whether or not a match is close enough to be accepted as a target object.

2.8.3 Rowley, Baluja and Kanade

In another recent piece of work after ours, Rowley et. al. [76] report a window-based face detecion approach that uses a highly constrained retinally connected multi-layer perceptron net to classify local image patches. Networks with similar architectures and connection schemes have been previously used in character recognition tasks [28]. To improve perfor-mance over a single network, the system arbitrates between the output of multiple networks to obtain a more reliable class prediction for each window pattern. The authors report that even simple arbitration schemes, like ANDing the output of two networks, helps greatly in reducing the number of false positives, with perhaps only a small penalty in the number of missed faces. They argue that this is because individual networks tend to detect roughly the same set of faces, but generally make dierent false positive errors. So the number of

additional missed faces due to an ANDing operation should be small.

Like our technique, their network training procedure also requires a comprehensive but tractable set of prototypical \non-face" images. They use a \bootstrapping" method like ours to selectively add non-face patterns to the training set as training progresses. The

\bootstrapping" method helps them reduce the non-face data set to only 9000 examples, which is a very small and tractable number.

In preparing their database of face patterns, Rowley et. al. also introduces additional normalization procedures which we neglected when preparing our own face database. They normalize each original face pattern so that both eyes appear on a horizontal line, and scale the image precisely so that the distance from a point between the eyes to the upper lip is xed for all images. By normalizing the original face patterns, they were able to generate vitrual face patterns in a more principled fashion.

The individual networks they trained missed only one third the number of face patterns our system missed on a common test database. However, they each had about an order of magnitude more false positives than our system. After arbitration with several schemes, their resulting systems had slightly better detection and false alarm rates than ours.

Chapter 3

Generalizing the Object and

Dans le document MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I.T.R. No. January, (Page 82-86)