• Aucun résultat trouvé

Distribution of Data Points

5.1 Improving Detection Rates by Combining Classiers

One way of building more accurate and reliable decision procedures is to combine multiple classiers for the given task. Individual classiers tend to have their own advantages and deciencies. A collective decision made by a classier ensemble should therefore be more reliable than one made by any individual classier alone.

Network boosting [79] [30] is a classier combination technique, based on the probably approximately correct(PAC) learning framework [98], that converts a learning architecture with nite error rate into an ensemble of learning machines with an arbitrarily low error rate. The idea works as follows:

Suppose we have three classiers with almost identical error rates that produce inde-pendent classication results on a given task. One can obtain a lower error rate decision procedure by passing each new pattern through all three classiers and using the follow-ing output votfollow-ing scheme: If the rst two classiers agree, then return the common label;

otherwise, return the label as given by the third classier. Notice that what we have just described is simply a majority voting scheme on the three classier output labels. In the-ory however, one can repeatedly embed this voting process to obtain ensembles of 32 = 9 classiers, 33 = 27 classiers, 34 = 81 classiers and so on, with decreasingly small error rates.

Recently, Drucker et. al. [30] have applied this classier combination idea to optical character recognition (OCR) problems. They constructed an ensemble of three OCR neural networks and reported a signicant overall improvement in character recognition results. We believe one can similarly combine multiple pattern recognizers like ours to build arbitrarily robust object detection systems.

5.1.1 Network Boosting

We now derive the boosting procedure which arises from a variant of the PAC learning framework, known as the weak learning model [50]. Let the three classiers in the above

Figure 5-1:

Top:

A 3-classier ensemble has a lower error rate3 than a single classier

over a wide range of individual classier error rates 0 < < 0:5. Note: a perfect classier has = 0 while random guessing has = 0:5. Boosting does not improve performance in these two cases.

Bottom:

The actual dierence in error rates between a single classier and a 3-classier ensemble over the same range of individual classier error rates.

example have independent error rates <0:5. Using the output voting scheme described earlier, the resulting classier ensemble makes a mistake only when:

1. Both the rst and second classiers mis-label the input pattern. This occurs with probability 2.

2. The rst and second classiers disagree (with probability 2(1,)), and the third classier wrongly labels the input pattern (with probability). This occurs with joint probability 22(1,).

Adding probabilities for the two cases above, we arrive at the following error rate 3 = 32,23 for the classier ensemble, which can be signicantly smaller than the individual classier error rate for 0< <0:5.

Figure 5-1 plots the dierence in performance between a 3-classier ensemble and a single classier over a range of individual classier error rates 0 < < 0:5. The graphs conrm that in order to construct an arbitrarily robust system with boosting, the learner only has to produce a suciently large number of independent classiers whose individual error rates are slightly better than random guessing (= 0:5).

5.1.2 Maintaining Independence between Classiers

One major assumption in weak learning and boosting, is that all classiers being combined make independent errors on new input patterns. When two or more classiers in an ensemble make correlated predictions, one can show that boosting reduces the combined error rate less signicantly. Furthermore, if the rst two classiers produce fully correlated results, both classiers will always assign identical labels to new patterns, and boosting does not improve the ensemble error rate at all. Hence, when applying boosting to any non-trivial learning problem, one should always try to maintain a reasonable degree of independence between individual classiers.

Drucker et. al. [30] have proposed a network ensemble training procedure that helps maintain output independence between individual classiers. Their idea is to train each network classier with a dierent example distribution.

1. Assume that the learner has access to an \innite" stream of training patterns.

2. Draw from the stream a requisite number of data samples and train the rst network classier.

3. Use the rst network classier together with the \innite" example stream to generate a second training set as follows: Flip a fair coin. If the coin is heads, draw examples from the data stream until the rst network mis-classies a pattern and add this pattern to the second training set. If the coin is tails, draw examples until the rst network correctly classies a pattern and add the pattern to the second training set.

When the second data set is suciently large, use it to train the second network classier.

4. Generate a third training set using the rst two classiers as follows: Propagate examples from the \innite" data stream through the rst two classiers. Add each new pattern to the third training set only if the two classiers disagree on the output label. When the third data set is suciently large, use it to train the third network classier.

Drucker et. al. have successfully applied the above procedure to train independent OCR networks for their boosting experiment. Clearly, the procedure makes a very strong assumption that the learner has access to an \innite" data stream. For an OCR scenario, this assumption might still be reasonable because there are very large digit and charac-ter databases available that can serve as an almost \innite" example source. Recently, Simard et. al. [83] have also derived some useful transformations for articially generating additional digit and character patterns from existing database examples.

In other object and pattern detection problems, the learner may not have access to much data, so the training paradigm outlined above may not be feasible. This is because after training the rst classier, there may not be enough remaining examples to generate independent training sets for the second and third classiers, unless the individual classier error rates are all very high. We suggest three possible ways of training several independent classiers with limited data:

1. Choose a classier architecture with multiple plausible solutions for a given data set.

A highly connected multi-layer perceptron net can be one such architecture. When trained with a backpropagation algorithm, a multi-layer perceptron net initialized with

a dierent set of parameters can converge onto a dierent set of weights (i.e. a dierent solution). Assuming that nets with dierent weights make relatively independent classication errors, one can still perform boosting using several multi-layer perceptron nets trained on the same data set.

2. Build each classier using a dierent machine architecture. Because each machine architecture implements a dierent approximation function class, one can expect each classier to generalize dierently between existing training examples, and hence make dierent classication mistakes.

3. Train each classier on a dierent set of input features. Feature measurements help enhance certain image properties that can act as useful cues for distinguishing between pattern classes. Classiers that use dierent feature measurements are thus relying on dierent image attributes to identify patterns, and should therefore make dierent classication mistakes. In Chapters 2 and 3, our face and eye detection systems perform classication in a view-based feature space. To improve detection results with boosting, one can train and combine additional face and eye pattern recognizers that operate on dierent input image representations, such as a gradient based map or other ltered versions of the input pattern.

5.2 Building Hierarchical Architectures from Simple