Thesis Outline and Contributions - MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENC

The rest of this thesis describes and analyzes our learning-based technique for detecting spatially well-dened objects and pattern classes. We begin in Chapter 2 by presenting a human face detection system we have developed using this approach. Human faces make up a natural and challenging class of spatially well-dened 3D objects, and the problem of nding them is further complicated if one has to deal with varied lighting conditions.

Here, our goal is twofold. We introduce the key elements of our object detection technique using a concrete example from a specic pattern detection application. We also quickly demonstrate the power of our approach by showing how well it performs on a reasonably complex real world problem.

Chapter 3 generalizes from our human face detection system and presents the underlying framework as a scheme for detecting spatially well-dened objects and pattern classes. We attempt to understand the overall approach in terms of its individual components, their functionality, limitations and underlying assumptions. Much of our analysis here will be based an eort to identify the critical components of our face detection system, which includes an empirical study on how the system's performance varies as one changes the architecture of the various components.

We also show three new applications of our pattern detection approach to demonstrate its generality and estensibility. The rst is a direct extension of our human face detection system to handle a wider range of poses. We present a new but identically structured system, trained with additional examples covering a wider range of views. The second application detects a dierent class of spatially well-dened objects | human eyes, to show that the approach works well for more than just human faces. Although there is less pattern variability between human eyes than between human faces, the problem is still challenging because there can be a much wider range of isolated background patterns that resemble human eyes than human faces. The third application is somewhat dierent in spirit from the detection problems considered so far. We look at the problem of recognizing isolated hand-printed digits using our underlying object and pattern class identication approach.

There has been a lot of work in hand-printed digit recognition over the past twenty to thirty years, with current state-of-art systems achieving recognition rates comparable to humans.

Our goal in this third application is not an attempt to better the state-of-art performance

in isolated hand-printed digit recognition systems. Rather, we wish to demonstrate that our underlying approach is truly general enough to model and capture localized pattern variations, even in a totally dierent problem domain, and on a task that is essentially pattern recognition and not detection in nature.

In Chapter 4, we take a formal look at one very critical aspect of our learning-based object and pattern class detection approach | the problem of selecting high utility exam-ples for training a system. We argue that the example selection task is essentially an active learningproblem, and we propose a function approximation based active learning formula-tion to show that one can indeed select useful training data in a principled and \optimal"

fashion. While the formulation we propose is computationally intractable in its original form for a wide range of approximation function classes, we see it as a possible benchmark for evaluating other active example selection schemes. We then consider a reduced version of the original active learning formulation that essentially hunts for new data where ap-proximation \error bars" are high. Furthermore, we show how such a scheme, with minor modications, can lead to a practical \boot-strap" example selection strategy that we have adopted in our object detection training methodology. Although the \boot-strap" strategy loses some of the original active learning avor, and may thus be \sub-optimal" in its choice of new examples, it is nevertheless a very eective means of sieving through unmanageably large sets of potential training data to make learning problems tractable.

Finally, in Chapter 5, we discuss two extensions to our object and pattern detection technique. The rst looks at how one can combine the output results of several pattern detectors to achieve better detection rates with fewer false alarms. Recently, Rowley et.

al. [76] have applied some simple arbitration techniques to a few face detection networks trained with our example selection methods, and have reported very impressive face de-tection results. We shall discuss about a more powerful arbitration scheme, called network boosting[30], that can potentially lead to systems with arbitrarily high correct classication rates. The second extension is about building hierarchical architectures for dealing with occlusion, and for detecting pattern classes with less well-dened boundaries. Recall from an earlier discussion that one can detect pattern classes with moderately variable bound-aries, by dening simpler sub-pattern classes that can be easily isolated and identied in an image. One main diculty in this approach is to reliably identify and locate full target patterns from the spatial distribution and composition of these sub-patterns in an image.

We shall look at some possible techniques for performing such a task.

The contributions of this thesis are as follows:

1. A new framework for detecting spatially well-dened objects and pattern classes with image variations that are dicult to parameterize. While most of the individual components within the proposed system architecture are not new, our work is the rst attempt at integrating and understanding these separate components as parts of an overall framework for detecting spatially well-dened objects and image patterns.

2. An implementation of a very robust human face detection system, based on our pro-posed pattern detection scheme. At its time of conception, our system was probably the state of art implementation for correctly nding human faces with extremely few false alarm errors, even in highly cluttered images. We currently know of two later systems [61] [76] that are based on ideas developed in our face detection approach.

Both systems have have also reported very impressive classication results.

3. A \boot-strap" paradigm for selecting useful training examples and for incrementally training object and pattern detection systems to arbitrary levels of robustness. Our object detection approach uses the \boot-strap" paradigm as part of its recommended example selection and training procedure. We have successfully demonstrated the paradigm in building our human face detection system, and have shown that the paradigm is vital for making an otherwise unmanageably complex learning problem tractable. The \boot-strap" idea is very general and is suitable for training most highly complex learning architectures and approximation function classes.

4. A highly robust 2-Value distance metric for measuring directionally dependent dis-tances to a Gaussian mixture sample distribution model. The individual components of our 2-Value metric correspond to dierent classication measures that have been used recently. As far as we know, our work is the rst attempt to combine the two measures for pattern classication. We also show how the 2-Value distance metric relates to classical measures like the Mahalanobis distance and probabilistic models in a Bayesean framework.

5. The idea of explicitly modeling the distribution of highly informative negative ex-amples to create additional features for classication. We show empirically that a

well chosen negative example distribution of a learning problem gives rise to a very discriminative set of additional classication features for pattern detection. We also propose two possible interpretations of a negative example distribution in the context of building distribution-based models for representing large pattern classes.

6. An active learning formulation for example-based function approximation learning.

We propose an optimality criterion for measuring the marginal utility of new data samples in a function approximation learning problem. We also derive a principled strategy for sampling new data in an \optimal" fashion based on our proposed utility measure. Finally, we show how simplifying the original formulation leads to practical example selection strategies like the \boot-strap" paradigm used by our object and pattern detection training approach.

Chapter 2

Learning an Object Detection

Dans le document MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I.T.R. No. January, (Page 34-38)