Learning by classifier combination: boosting

(1)

Learning by classifier combination: boosting

Frédéric Precioso

Binary classification

n 

The objective is to detect regions which correspond to a certain category among data.

n 

This defines a binary classification : corresponding / not corresponding to the category.

2

Context: Example on Corel 1000

3

Rose / Other

4

(2)

Classification

How to find the boundary between classes (or between categories) :

n 

Rose (flower)

n 

Not rose (other)

Observations:

n 

Easy to produce decision rules often right

q  If the histogram contains blue, then predict other n 

Hard to find ones right all the time

n 

We focus here on a solution among many others:

boosting

?

5

Ideas of boosting:

horseracing

6

How to win?

n 

Ask to professional gamblers

n 

Lets assume:

q 

That professional gamblers can provide one single decision rule simple and relevant

q 

But that face to several races, they can always provide decision rules a little bit better than random

n 

Can we become rich?

7

Idea

n 

Ask heuristics to the expert

n 

Gather a set of cases for which these heuristics fail (difficult cases)

n 

Ask again the expert to provide heuristics for the difficult cases

n 

And so one…

n 

Combine these heuristics

n 

expert stands for weak learner

8

(3)

Questions

n 

How to choose races (i.e. learning examples) at each step?

q Focus on races (examples) the most “difficult” (the ones on which previous heuristics are the less relevant)

n 

How to merge heuristics (decision rules) into one single decision rule?

q Take a weighted vote of all decision rules

9

Boosting

n 

boosting = general method to convert several poor decision rules into one very powerful decision rule

n 

More precisely:

q 

Let have a weak learner which can always provide a decision rule with an error ≤1/2-γ

q 

A boosting algorithm can build (theoretically) a global decision rule with an error rate ≤ ε

10

Boosting: the little History

Mathematician Kearns asks: « Is it possible to make as good as possible a weak learner » (that is to say better than random)?

Schapire, answered in 90: « Yes ! » , and exhibited the first elementary boosting algorithm which shows that a weak binary classifier can always improve by being trained on 3 subsamples well chosen.

The choice of the weak learner does not have any importance (a decision tree, a bayesian classifier, a SVM, etc.), but one has to choose the 3 training subsamples with respect to its performance.

11

Boosting: elementary boosting

1.  We train a first decision rule

h₁

over a subsampling

S₁

of size m

₁ < m (m being the size of S the full training set).

Then we classified with h

₁

all S – S

₁

.

2.  We then train a second decision rule h

₂

over a subsampling

S₂, of size m₂

, extracted from

S − S₁

whom half of the examples were misclassified by h

₁

.

3.  We finally train a third decision rule

h₃

over

m₃

examples picked up in S − S

₁

− S

₂

for which h

₁

and h

₂

disagree on.

4.  The global decision rule is then a majority vote on the

3

decision rules: H = majority_vote(h

₁

,h

₂

,h

₃

)

12

(4)

Boosting: Efficiency

A theorem of Schapire on « weak learning power » proves that H gets a higher relevance than a global decision rule which would have been learnt directly on S.

Geometric example of boosting following this elementary algorithm in the next slides.

The weak learner is a hyperplan.

13

Boosting: Start

Left: training set S and subset S1 (red or blue circled).

Right: subset S₁ and the line C₁ learnt on this set.

14

Boosting: to be continued

Left: training set S – S₁ and line C₁ learnt on S₁.

Right: subset S2 included in S – S1 among the most informative for C1 (red or blue circled).

15

Boosting: …the end

Left: the line C₂ learnt on S₂.

Center: S3 = S – S1– S2 and the line C3 learnt on S3. Right: S and the combination of the 3 lines.

16

(5)

Probabilistic boosting

3 main ideas to generalize towards probabilistic boosting:

1.  A set of specialized experts and ask them to vote to take a decision.

2.  Adaptive weighting of votes by multiplicative update.

3.  Modifying example distribution to train each expert, increasing the weights iteratively of examples misclassified at previous iteration.

17

Boosting: Basics

Parameter: weak learners h Input: A set S

Initialization: All examples have same weights For t = 1,T do

use h with S and current distribution

increase weight of misclassified examples at current iteration

done

Output H : a majority weighted vote of all the h

18

Boosting: Basics

X h

₀

D

0

X h

₁

D

1

X h

₂

D

2

X h

_T

D

T

⎥⎦

⎢ ⎤

⎣

⎡ α

=

∑

= T 0 t

t t

final( ) sgn .h( )

H x x

n  How to update Dt to Dt+1 ?

n  How to compute the weight αt ? 19

Probabilistic boosting: AdaBoost

The standard algorithm is AdaBoost (Adaptive Boosting).

The main idea is to define at each iteration 1< t < T, a new distribution of a priori probabilities on training samples with respect to classification result at previous iteration.

The weight at iteration t of an example (x

_i

,y

_i

) of index i is denoted D

_t

(i).

Initially, all the examples get a similar weight, at each iteration, weights of misclassified examples are increased Thus the learner focuses on difficult examples of training set

20

(6)

n 

A training set: S = {(x

1,y1),…,(xm,ym)}

n  y_i∈{-1,+1} label (annotation) of example x_i∈ S

n 

For t = 0,…,T:

•  Build the distribution D_t on {1,…,m}

•  Find the weak decision (“heuristic”)

•  h_t : S → {-1,+1}

•  with a small error εt on D_t:

n 

Find the final decision h

final

Boosting: the algorithm

∑

( )

≠

=

≠

=

i i t t

y x h

i t

i i t D

t

h x y D i

:

) ( ]

) ( [ ε Pr

21

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

h ε ε D i y h x

=

= =

∑

≠

The AdaBoost Algorithm

Given: ( , ), ,( ,x y₁ ₁ K x ym m) where xi∈X y, i∈ −{ 1, 1}+ Initialization: D i₁( )=_m¹,i=1, ,K m

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh X_t: → −{ 1, 1}+ t ,i.e.,

•  Weight classifier: 1ln1 2

t t

t

α ε ε

= −

•  Update distribution: ₁ ( ) exp[ ( )]

, is for normalizati

( ) ^t ^{t i t} ⁱ on

t t

t

D i y h x

D i Z

Z α

+

= −

Output final classifier:

1

( ) ^T _{t t}( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

∑

⎠

What goal the AdaBoost wants to reach?

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

=

= =

∑

≠

The AdaBoost Algorithm

Given: ( , ), ,( ,x y₁ ₁ K x y_m _m) where x_i∈X y, _i∈ −{ 1, 1}+ Initialization: D i₁( )=_m¹,i=1, ,K m

For : t=1, ,K T

•  Weight classifier: 1 1 2ln

t t t

α ε ε

= −

•  Update distribution: _t₁( ) ^t( ) exp[ ^{t i t}( )]ⁱ , is for normalizati_t on

t

D i y h x

D i Z

Z α

+

= −

1

( ) ^T _{t t}( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

∑

⎠

What goal the AdaBoost wants to reach?

They are goal dependent.

Boosting: global classifier

The boosting algorithm builds the finale decision as a summation on a fonction basis {h

_t

}.

This is a classic process of machine learning algorithms (as SVM, bayesian classifier, neural networks, etc.).

⎟ ⎠

⎜ ⎞

⎝

= ⎛ ∑

t t

t

h x

x

H

_final

( ) sgn α ( )

24

(7)

Example “ Toy ”

Very popular «Toy» example, 2D points in domain D₁, to illustrate boosting schema.

(these points can be seen as vectors ∈ R²).

D1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1

25

Step 1

D₂=D’₁/Z₁ 0,17 0,17 0,17 0,07 0,07 0,07 0,07 0,07 0,07 0,07

⎟⎟

⎠

⎞

⎜⎜

⎝

= ⎛

⎟⎠

⎜ ⎞

⎝

= ⎛

⎟⎠

⎜ ⎞

⎝

= ⎛ −

⎟⎟⎠

⎞

⎜⎜⎝

⎛

ε ε

= − α

3 ln 7 3

. 0 7 . ln0 2 1

3 . 0 3 . 0 ln1 2 1 ln1

2 1

1 1 1

⎪⎪

⎩

⎪⎪⎨

⎧

=

α α

−

0,152 3 1 7 . 0 e 1 . 0

0,065 7 1 3 . 0 e 1 . 0 ' D

1 1

1

911 . 0 065 . 0 7 152 . 0 3

Z

₁

= × + × =

26

Step 2

D₃=D’₂/Z₂ 0,11 0,11 0,11 0,045 0,045 0,045 0,045 0,17 0,17 0,17

814 . 0 1357 , 0 3 036 . 0 4 0876 . 0 3

Z

₂

= × + × + × =

⎪ ⎩

⎪ ⎨

⎧

=

α α

− α

−

0,1357 e

07 . 0

0,036 e

07 . 0

0,0876 e

17 . 0 ' D

2 2 2

2

27

Step 3

28

(8)

Final decision

29

AdaBoost Hinge Loss

Lets write the error ε

_t

of

h_t

as: 1/2

−

γ

t

, where γ

t

measure the improvement brought by h

_t

regarding the basic error of 1/2.

Freund and Schapire, have shown that the hinge loss (the amount of misclassified data over training set

S) of finale decision H is

bounded by:

Hence, if each weak learner is slightly better than random ( γ

t > 0)

then the hinge loss decrease exponentially with t.

30

Proof

) x ( ) x (

f ⁼

∑ α

t t

h

^t

∏

∑

= −

⎟⎠

⎜ ⎞

⎝

⎛− ⋅

=

⋅

−

⋅

=

t t

i i

t t

t t t i

i i

t i t t i t Final i

Z x f y m

Z x h y x m

h Z y

y D

D

x

)) ( exp(

1

) ( 1exp

)) ( exp(

) , (

α α

Let

We have:

The hinge loss is less than ∏

t Zt

∏

∑ ∑ ∏

∑ ∑

=

−

≤

=

= _≠ _<

t t

i t

t

i final

i i

i Hx y i yfx

Z Z i x

f m y

m m

D

i i i i

) ( ))

( 1 exp(

1 Loss 1

Hinge

1

⁽ ⁾

1

⁽ ⁾⁰

31

And

Then the hinge loss is bounded by:

The error of learning decreases exponentially with t

( ) ( )

( )

( _ε ) _ε ( _ε )

ε

^α ^α

α α α

t t t

t

x h y

i t

x h y

i t

i i t

i t

t

e e

e D e

D D

Z

t t

t i t i t i t i

i i

x f y i

−

− =

− +

=

+ −

=

−

=

∑ ∑ ∑

=

≠

1 2 1

) ( exp ) (

: :

Proof

32

(9)

Error of generalizaton

Error of generalization of H can be bounded by:

where T is the number of boosting iterations, m the number of examples, and d a dimension (VC-dimension) of H

_T

space (« weaks learner complexity »)

When T becomes large, the 2

^nd

term becomes large too, then one can think that the learner would overfit

the solution (i.e. over-weighting of ambiguous examples)

( ) ( ) _⎟ ^⎟

⎠

⎞

⎜ ⎜

⎝

Ο ⎛

+

= m

d

H T

E H

E

al T HingeLoss T

.

Re

33

However observations show that this occurs very rarely On the opposite, real loss tends to decrease even after that the hinge loss has vanished

Classic behavior:

-overfitting

-Occam razor principle

Boosting real behavior:

- Error of generalization does not increase…

Error of generalizaton

34

Error of generalization does not increase…Why ? Relation with large

margin methods

n 

The hinge loss only evaluates if classification labels are correct

n 

The confidence in the result must also be taken into account

Margin Theory

35

Margin Theory

n 

Margin: confidence into the prediction

With high probability,

-  m is the number of training samples -  d is the « complexity » of weaks learners

> 0

∀ θ

µ

( )

dm generalisation error P margin

θ

⎛ ⎞

⎜ ⎟

≤ ≤ +Ο⎜ ⎟

⎜ ⎟

⎝ ⎠

( ) ( )

1

,

T t t t

T t t

y h x

x y α

α

=

∑

margin

36

(10)

n 

The bound is independant from T: the number of boosting iterations

n 

Boosting tends to maximise margins of training set since it focuses on samples difficult to classify (samples whom margin is the smallest)

Margin Theory

37

Advantages of AdaBoost

n 

(very) fast

n 

simple + easy to code

n 

One unique parameter to set: the number of boosting iterations (T)

n 

Can be use with any machine learning algorithm

n 

No-overfitting

n 

Is adapted to multi-class problems where y

_i∈

{1,..,c} and to multi-label problems

n 

Allow to find out outliers

38

Drawbacks of AdaBoost

n 

Performances depend on training data & weak learners

n 

AdaBoost can fail if

q 

The weak learner is to complex

§  Low margins → overfitting

q 

The weak learner is too weak

n 

γ

t→

0 too fast

→

underfitting

n 

AdaBoost seems specially sensitive to noise

39

Applications

n 

Detection, tracking and face recognition

n 

Spam, Zip Code OCR

n 

Text classification: Schapire and Singer - Used stumps with normalized term frequency and multi-class encoding

n 

OCR: Schwenk and Bengio (neural networks)

n 

Natural language Processing: Collins; Haruno, Shirai and Ooyama

n 

Image retrieval: Thieu and Viola

n 

Medical diagnosis: Merle et al.

n 

Fraud Detection: Rätsch & Müller 2001

n 

Drug Discovery: Rätsch, Demiriz, Bennett 2002

n 

Elect. Power Monitoring: Onoda, Rätsch & Müller 2000

40

(11)

Boosting: Applet & demo

http://cse.ucsd.edu/~yfreund/adaboost/index.html

http://vision.ucla.edu/~vedaldi/code/snippets/assets/adademo/adademo.mov http://morph.cs.st-andrews.ac.uk/fof/haarDemo/index.html

41

Background

n 

[Valiant ʼ 84]

introduced theoretical PAC model for studying machine learning

n 

[Kearns&Valiantʼ88]

open problem of finding a boosting algorithm

n 

[Schapireʼ89], [Freundʼ90]

first polynomial-time boosting algorithms

n 

[Drucker, Schapire&Simard ʼ92]

first experiments using boosting

42

Backgroung (cont.)

n 

[Freund&Schapire ʼ95]

q 

Provided AdaBoost algorithm

q 

Bigger advantage in pratice than previous boosting algorithms

n 

experiences using AdaBoost:

[Drucker&Cortes ʼ95] [Schapire&Singer ʼ98]

[Jackson&Cravon ʼ96] [Maclin&Opitz ʼ97]

[Freund&Schapire ʼ96] [Bauer&Kohavi ʼ97]

[Quinlan ʼ96] [Schwenk&Bengio ʼ98]

[Breiman ʼ96] [Dietterichʼ98]

n 

Further development on theory & algorithms:

[Schapire,Freund,Bartlett&Lee ʼ97] [Schapire&Singer ʼ98]

[Breiman ʼ97] [Mason, Bartlett&Baxter ʼ98]

[Grive and Schuurmansʼ98] [Friedman, Hastie&Tibshirani ʼ98]

43

Variations

n 

Adaboost Multi-class

q  AdaBoost.M1: same as standard but y ∈ {1,…,k}

q  AdaBoost.M2: introduces a confidence degree n 

Solve overfitting when noisy data:

q  Weight Decay (1998): introduces slack variables regularizing the error

q  GentleBoost (1998)

q  LogitBoost (2000): additive logistic regression model (statistical point of view)

q  Regularized AdaBoost (2000): « soft » margins

q  BrownBoost (2001): removes examples many times misclassified, considered as « noisy »

q  WeightBoost (2003): weights in final decision dependent on input

n 

Reduce the number of features (or Weak Learner) :

q  FloatBoost (2003): removes the worst WL after each iteration

q  JointBoost (2004): multi-class, jointly train N binary classifier sharing same features

44

(12)

1 ^st example: Face detection

Démo

45

The Task of Face Detection

Many slides adapted from P. Viola

n  Slide a window across image and evaluate a face model at every location.

Basic Idea Challenges

n  Slide a window across image and evaluate a face model at every location.

n  Sliding window detector must evaluate tens of thousands of location/scale combinations.

n  Faces are rare: 0–10 per image

q  For computational efficiency, we should try to spend as little time as possible on the non-face windows

q  A megapixel image has ~10⁶ pixels and a comparable number of candidate face locations

q  To avoid having a false positive in every image image, our false positive rate has to be less than 10^-6

(13)

The Viola/Jones Face Detector

n 

A seminal approach to real-time object detection

n 

Training is slow, but detection is very fast

n 

Key ideas

q  Integral images for fast feature evaluation

q  Boosting for feature selection

q  Attentional cascade for fast rejection of non-face windows

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.

CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

Image Features

Rectangle filters

Feature Value (Pixel in white area) (Pixel in black area)

=

∑

−

∑

Image Features

Rectangle filters

=

∑

−

∑

Size of Feature Space

=∑ −

∑

•  How many number of possible rectangle features for a 24x24 detection region?

Rectangle filters

Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

12 24

1 1 8 24

1 1 12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

= =

− + − + +

− + − +

∑∑

A+B C D

160,000

:

(14)

Feature Selection

A+B C D

160,000 :

What features are good for face detection?

12 24

1 1 8 24

1 1 12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

= =

− + − + +

− + − +

∑∑

Feature Selection

A+B C D

160,000 :

l  Can we create a good classifier using just a small subset of all possible features?

l  How to select such a subset?

12 24

1 1 8 24

1 1 12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

= =

− + − + +

− + − +

∑∑

Integral images

n  The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

(x, y)

,

( , ) ( , )

x x y y

ii x y i x y

ʹ′≤ ʹ′≤

ʹ′ ʹ′

= ∑

( , 1) ii x y − ( 1, ) s x − y

Computing the Integral Image

n  The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

n  This can quickly be computed in one pass through the image.

Impossible d'afficher l'image.

Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

(x, y)

,

( , ) ( , )

x x y y

ii x y i x y

ʹ′≤ ʹ′≤

ʹ′ ʹ′

= ∑

( , ) i x y

( , ) ( 1, ) ( , ) s x y = s x − y + i x y

( , ) ( , 1) ( , )

ii x y = ii x y − + s x y

(15)

Computing Sum within a Rectangle

Impossible d'afficher l'image.

Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

A B C D

sum ii = − ii − ii + ii ^D ^B

C A

Only 3 additions are required for any size of rectangle!

Scaling

n 

Integral image enables us to evaluate all rectangle sizes in constant time.

n 

Therefore, no image scaling is necessary.

n 

Scale the rectangular features instead!

1 2

3 4

5 6

Boosting

n 

Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier

q  A weak learner need only do better than chance n 

Training consists of multiple boosting rounds

q  During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners

q  “Hardness” is captured by weights attached to training examples

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

The AdaBoost Algorithm

For : t=1, ,K T

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

=

= =

∑

≠

:probability distribution of 's at time

t( ) i

D i x t

•  Weight classifier: 1 1 2ln

t t t

α ε ε

= −

•  Update distribution: _t₁( ) ^t( ) exp[ ^{t i t}( )]ⁱ , is for normalizati_t on

t

D i y h x

D i Z

Z α

+

= −

minimize weighted error for minimize exponential loss

Give error classified patterns more chance for learning.

(16)

The AdaBoost Algorithm

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

=

= =

∑

≠

t t

t

α ε ε

= −

( ) ^t ^{t i t} ⁱ on

t t

t

D i y h x

D i Z

Z α

+

= −

1

( ) ^T _{t t}( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

∑

⎠

Weak Learners for Face Detection

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ _t,i.e.,

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

=

= =

∑

≠

t t

t

α ε ε

= −

( ) ^t ^{t i t} ⁱ on

t t

t

D i y h x

D i Z

Z α

+

= −

1

( ) ^T _{t t}( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

∑

⎠

What base learner is proper for face detection?

Weak Learners for Face Detection

1 if ( ) ( ) 0 otherwise

t t t t

t

p f x p

h x ⎧ > θ

= ⎨

⎩

window

value of rectangle feature

parity threshold

Boosting

n 

Training set contains face and nonface examples

q  Initially, with equal weight n 

For each round of boosting:

q  Evaluate each rectangle filter on each example

q  Select best threshold for each filter

q  Select best filter/threshold combination

q  Reweight examples

n 

Computational complexity of learning: O(MNK)

q  M rounds, N examples, K features

(17)

AdaBoost

•  1990 Schapire, Boosting

•  1996 Freund & Schapire, AdaBoost -  Do not require any a priori knwoledge

•  2001 Viola & Jones -  Face detection

-  Haar descriptors, integral image, cascades -  Robust, fast

•  2002 Lienhart et al.

-  Extended descriptors, 45 degres rotation

Extended descriptors Haar descriptors

65

AdaBoost

Choisir un classifieur faible

Choose the classifier h_t which gets minimum error Initialize weights of examples

P: w_1,i = 1/2m ; N: w1,i = 1/2l

Strong Classifier

where

Evaluate the error of each classifier Normalize weights wi Provide training examples

(x1, y1), . . . , (xn , yn ) P: yi = 1; N: yi = 0

For t =1,2,…,T start

Next T

end

Update weights correct:

incorrect:

Obtenir un classifieur fort

66

Features Selected by Boosting

First two features selected by boosting:

This feature combination can yield 100% detection rate and 50% false positive rate

ROC Curve for 200-Feature Classifier

A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084.

Not good enough!

To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.

(18)

Attentional Cascade

Classifier 3 Classifier 2

Classifier 1 IMAGE

SUB-WINDOW

T T ^T FACE

NON-FACE

F F

NON-FACE

F NON-FACE n  We start with simple classifiers which reject many of the

negative sub-windows while detecting almost all positive sub- windows

n  Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on

n  A negative outcome at any point leads to the immediate rejection of the sub-window

Attentional Cascade

SUB-WINDOW

T T ^T FACE

NON-FACE

F F

NON-FACE

F NON-FACE n 

Chain classifiers that are

progressively more complex and have lower false positive rates

0

100 0 % False Pos 50

% Detection

ROC Curve

Detection Rate and False Positive Rate for Chained Classifiers

SUB-WINDOW

T T ^T FACE

NON-FACE

F F

NON-FACE

F NON-FACE n  The detection rate and the false positive rate of the cascade

are found by multiplying the respective rates of the individual stages

n  A detection rate of 0.9 and a false positive rate on the order of 10^-6 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.99¹⁰ ≈ 0.9) and a false positive rate of about 0.30 (0.3¹⁰ ≈ 6×10^-6 )

1 1

,

f d f

₂

, d

₂

f

₃

, d

₃

1 1

,

^K _i

K i

i i

F f D d

=

= ∏ ∏

, F D

Training the Cascade

n 

Set target detection and false positive rates for each stage

n 

Keep adding features to the current stage until its target rates have been met

q  Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error)

q  Test on a validation set

n 

If the overall false positive rate is not low enough, then add another stage

n 

Use false positives from current stage as the

negative training examples for the next stage

(19)

AdaBoost (6) – Cascade Principle

AdaBoost

1 N

Face x 99%

Non Face x 30%

1 1 0

0 0 0

Non Face x 70%

Face x 90%

Non Face x 0.00006%

1 1 1

0 1 0

AdaBoost 2

0 1

Non Face x 21%

Face x 98%

Non Face x 9%

73

Training the Cascade

Attentional Cascade

Decrease the set of negative examples

AdaBoost to choose the classifier

Update the threshold for D_i > d * D_i-1

F_i > f * F_i-1 ?

Initialize global rate F₀ = 1 ; D₀ = 1 Provide training exemples (x1, y1), . . . , (xn , yn ) P: yi = 1; N: yi = 0

begin

end Define max f /iteration, min d /iteration, global F_target

Fi > Ftarget ?

Next iteration

next classifier

Obtain a cascade of strong classifiers

Obtain a strong classifier

Weak classifier

75

ROC Curves Cascaded Classifier to Monlithic

Classifier

(20)

ROC Curves Cascaded Classifier to Monlithic Classifier

n  There is little difference between the two in terms of accuracy.

n  There is a big difference in terms of speed.

q  The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they are never evaluated by subsequent stages.

The Implemented System

n 

Training Data

q  5000 faces

n  All frontal, rescaled to 24x24 pixels

q  300 million non-faces

n  9500 non-face images

q  Faces are normalized

n  Scale, translation n 

Many variations

q  Across individuals

q  Illumination

q  Pose

Structure of the Detector Cascade

n  Combining successively more complex classifiers in cascade

q  38 stages

q  included a total of 6060 features

Reject Sub-Window

1 2 3 4 5 6 7 8 38 Face

F F F F F F F F F

T T T T T T T T T

All Sub-Windows

Structure of the Detector Cascade

Reject Sub-Window

1 2 3 4 5 6 7 8 38 Face

F F F F F F F F F

T T T T T T T T T

All Sub-Windows

2 features, reject 50% non-faces, detect 100% faces 10 features, reject 80% non-faces, detect 100% faces

25 features 50 features

by algorithm

(21)

Speed of the Final Detector

n 

On a 700 Mhz Pentium III processor, the face detector can process a 384 ×288 pixel image in about .067 seconds”

q  15 Hz

q  15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)

n 

Average of 8 features evaluated per window on test set

Image Processing

n 

Training ⎯ all example sub-windows were variance normalized to minimize the effect of different lighting conditions

n 

Detection ⎯ variance normalized as well

2 2 1 2

m

N

x σ = − ∑

Can be computed using integral image

Can be computed using integral image of squared image

Scanning the Detector

n 

Scaling is achieved by scaling the detector itself, rather than scaling the image

n 

Good detection results for scaling factor of 1.25

n 

The detector is scanned across location

q  Subsequent locations are obtained by shifting the window [sΔ]

pixels, where s is the current scale

q  The result for Δ = 1.0 and Δ = 1.5 were reported

Merging Multiple Detections

(22)

ROC Curves for Face Detection _Results

•  Fixed images

•  Video sequence

Frontal face Left profile face Right profile face

86

•  Fast and robust

•  Other descriptors instead of Haar

•  Other cascades (rotation…)

•  Eye detection

Extension

87

Conclusions

n 

How adaboost works?

n 

Why adaboost works?

n 

Adaboost for face detection

q 

Rectangle features

q 

Integral images for fast computation

q 

Boosting for feature selection

q 

Learning by classifier combination: boosting