• Aucun résultat trouvé

Learning by classifier combination: boosting

N/A
N/A
Protected

Academic year: 2022

Partager "Learning by classifier combination: boosting"

Copied!
22
0
0

Texte intégral

(1)

Learning by classifier combination: boosting

Frédéric Precioso

Binary classification

n 

The objective is to detect regions which correspond to a certain category among data.

n 

This defines a binary classification : corresponding / not corresponding to the category.

2

Context: Example on Corel 1000

3

Rose / Other

4

(2)

Classification

How to find the boundary between classes (or between categories) :

n 

Rose (flower)

n 

Not rose (other)

Observations:

n 

Easy to produce decision rules often right

q  If the histogram contains blue, then predict other n 

Hard to find ones right all the time

n 

We focus here on a solution among many others:

boosting

?

5

Ideas of boosting:

horseracing

6

How to win?

n 

Ask to professional gamblers

n 

Lets assume:

q 

That professional gamblers can provide one single decision rule simple and relevant

q 

But that face to several races, they can always provide decision rules a little bit better than random

n 

Can we become rich?

7

Idea

n 

Ask heuristics to the expert

n 

Gather a set of cases for which these heuristics fail (difficult cases)

n 

Ask again the expert to provide heuristics for the difficult cases

n 

And so one…

n 

Combine these heuristics

n 

expert stands for weak learner

8

(3)

Questions

n 

How to choose races (i.e. learning examples) at each step?

q Focus on races (examples) the most “difficult” (the ones on which previous heuristics are the less relevant)

n 

How to merge heuristics (decision rules) into one single decision rule?

q Take a weighted vote of all decision rules

9

Boosting

n 

boosting = general method to convert several poor decision rules into one very powerful decision rule

n 

More precisely:

q 

Let have a weak learner which can always provide a decision rule with an error ≤1/2-γ

q 

A boosting algorithm can build (theoretically) a global decision rule with an error rate ≤ ε

10

Boosting: the little History

Mathematician Kearns asks: « Is it possible to make as good as possible a weak learner » (that is to say better than random)?

Schapire, answered in 90: « Yes ! » , and exhibited the first elementary boosting algorithm which shows that a weak binary classifier can always improve by being trained on 3 subsamples well chosen.

The choice of the weak learner does not have any importance (a decision tree, a bayesian classifier, a SVM, etc.), but one has to choose the 3 training subsamples with respect to its performance.

11

Boosting: elementary boosting

1.  We train a first decision rule

h1

over a subsampling

S1

of size m

1 < m (m being the size of S the full training set).

Then we classified with h

1

all S – S

1

.

2.  We then train a second decision rule h

2

over a subsampling

S2, of size m2

, extracted from

S S1

whom half of the examples were misclassified by h

1

.

3.  We finally train a third decision rule

h3

over

m3

examples picked up in S − S

1

− S

2

for which h

1

and h

2

disagree on.

4.  The global decision rule is then a majority vote on the

3

decision rules: H = majority_vote(h

1

,h

2

,h

3

)

12

(4)

Boosting: Efficiency

A theorem of Schapire on « weak learning power » proves that H gets a higher relevance than a global decision rule which would have been learnt directly on S.

Geometric example of boosting following this elementary algorithm in the next slides.

The weak learner is a hyperplan.

13

Boosting: Start

Left: training set S and subset S1 (red or blue circled).

Right: subset S1 and the line C1 learnt on this set.

14

Boosting: to be continued

Left: training set S – S1 and line C1 learnt on S1.

Right: subset S2 included in S – S1 among the most informative for C1 (red or blue circled).

15

Boosting: …the end

Left: the line C2 learnt on S2.

Center: S3 = S – S1– S2 and the line C3 learnt on S3. Right: S and the combination of the 3 lines.

16

(5)

Probabilistic boosting

3 main ideas to generalize towards probabilistic boosting:

1.  A set of specialized experts and ask them to vote to take a decision.

2.  Adaptive weighting of votes by multiplicative update.

3.  Modifying example distribution to train each expert, increasing the weights iteratively of examples misclassified at previous iteration.

17

Boosting: Basics

Parameter: weak learners h Input: A set S

Initialization: All examples have same weights For t = 1,T do

use h with S and current distribution

increase weight of misclassified examples at current iteration

done

Output H : a majority weighted vote of all the h

18

Boosting: Basics

X h

0

D

0

X h

1

D

1

X h

2

D

2

X h

T

D

T

⎥⎦

⎢ ⎤

⎣

⎡ α

=

= T 0 t

t t

final( ) sgn .h( )

H x x

n  How to update Dt to Dt+1 ?

n  How to compute the weight αt ? 19

Probabilistic boosting: AdaBoost

The standard algorithm is AdaBoost (Adaptive Boosting).

The main idea is to define at each iteration 1< t < T, a new distribution of a priori probabilities on training samples with respect to classification result at previous iteration.

The weight at iteration t of an example (x

i

,y

i

) of index i is denoted D

t

(i).

Initially, all the examples get a similar weight, at each iteration, weights of misclassified examples are increased Thus the learner focuses on difficult examples of training set

20

(6)

n 

A training set: S = {(x

1,y1),…,(xm,ym)}

n  yi∈{-1,+1} label (annotation) of example xi∈ S

n 

For t = 0,…,T:

•  Build the distribution Dt on {1,…,m}

•  Find the weak decision (“heuristic”)

•  ht : S → {-1,+1}

•  with a small error εt on Dt:

n 

Find the final decision h

final

Boosting: the algorithm

( )

=

=

i i t t

y x h

i t

i i t D

t

h x y D i

:

) ( ]

) ( [ ε Pr

21

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

h ε ε D i y h x

=

= =

The AdaBoost Algorithm

Given: ( , ), ,( ,x y1 1 K x ym m) where xiX y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,

•  Weight classifier: 1ln1 2

t t

t

α ε ε

= −

•  Update distribution: 1 ( ) exp[ ( )]

, is for normalizati

( ) t t i t i on

t t

t

D i y h x

D i Z

Z α

+

= −

Output final classifier:

1

( ) T t t( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

⎠

What goal the AdaBoost wants to reach?

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

h ε ε D i y h x

=

= =

The AdaBoost Algorithm

Given: ( , ), ,( ,x y1 1 K x ym m) where xiX y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,

•  Weight classifier: 1 1 2ln

t t t

α ε ε

= −

•  Update distribution: t1( ) t( ) exp[ t i t( )]i , is for normalizatit on

t

D i y h x

D i Z

Z α

+

= −

Output final classifier:

1

( ) T t t( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

⎠

What goal the AdaBoost wants to reach?

They are goal dependent.

Boosting: global classifier

The boosting algorithm builds the finale decision as a summation on a fonction basis {h

t

}.

This is a classic process of machine learning algorithms (as SVM, bayesian classifier, neural networks, etc.).

⎟ ⎠

⎜ ⎞

⎝

= ⎛ ∑

t t

t

h x

x

H

final

( ) sgn α ( )

24

(7)

Example Toy

Very popular «Toy» example, 2D points in domain D1, to illustrate boosting schema.

(these points can be seen as vectors ∈ R2).

D1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1

25

Step 1

D2=D’1/Z1 0,17 0,17 0,17 0,07 0,07 0,07 0,07 0,07 0,07 0,07

⎟⎟

⎠

⎞

⎜⎜

⎝

= ⎛

⎟⎠

⎜ ⎞

⎝

= ⎛

⎟⎠

⎜ ⎞

⎝

= ⎛ −

⎟⎟⎠

⎞

⎜⎜⎝

⎛

ε ε

= − α

3 ln 7 3

. 0 7 . ln0 2 1

3 . 0 3 . 0 ln1 2 1 ln1

2 1

1 1 1

⎪⎪

⎩

⎪⎪⎨

⎧

=

=

=

=

=

α α

0,152 3 1 7 . 0 e 1 . 0

0,065 7 1 3 . 0 e 1 . 0 ' D

1 1

1

911 . 0 065 . 0 7 152 . 0 3

Z

1

= × + × =

26

Step 2

D3=D’2/Z2 0,11 0,11 0,11 0,045 0,045 0,045 0,045 0,17 0,17 0,17

814 . 0 1357 , 0 3 036 . 0 4 0876 . 0 3

Z

2

= × + × + × =

⎪ ⎩

⎪ ⎨

⎧

=

=

=

=

α α

α

0,1357 e

07 . 0

0,036 e

07 . 0

0,0876 e

17 . 0 ' D

2 2 2

2

27

Step 3

28

(8)

Final decision

29

AdaBoost Hinge Loss

Lets write the error ε

t

of

ht

as: 1/2

γ

t

, where γ

t

measure the improvement brought by h

t

regarding the basic error of 1/2.

Freund and Schapire, have shown that the hinge loss (the amount of misclassified data over training set

S) of finale decision H is

bounded by:

Hence, if each weak learner is slightly better than random ( γ

t > 0)

then the hinge loss decrease exponentially with t.

30

Proof

) x ( ) x (

f =

∑ α

t t

h

t

= −

⎟⎠

⎜ ⎞

⎝

⎛− ⋅

=

=

t t

i i

t t

t t t i

i i

t i t t i t Final i

Z x f y m

Z x h y x m

h Z y

y D

D

x

)) ( exp(

1

) ( 1exp

)) ( exp(

) , (

α α

Let

We have:

The hinge loss is less than ∏

t Zt

∑ ∑ ∏

∑ ∑

=

=

=

= <

t t

i t

t

i final

i i

i Hx y i yfx

Z Z i x

f m y

m m

D

i i i i

) ( ))

( 1 exp(

1 Loss 1

Hinge

1

( )

1

( )0

31

And

Then the hinge loss is bounded by:

The error of learning decreases exponentially with t

( ) ( )

( )

( )

( )

( ε ) ε ( ε )

ε

α α

α α α

t t t

t

x h y

i t

x h y

i t

i i t

i t

t

e e

e D e

D D

Z

t t

t i t i t i t i

i i

x f y i

− =

− +

=

+ −

=

=

∑ ∑ ∑

=

1 2 1

) ( exp ) (

: :

Proof

32

(9)

Error of generalizaton

Error of generalization of H can be bounded by:

where T is the number of boosting iterations, m the number of examples, and d a dimension (VC-dimension) of H

T

space (« weaks learner complexity »)

When T becomes large, the 2

nd

term becomes large too, then one can think that the learner would overfit

the solution (i.e. over-weighting of ambiguous examples)

( ) ( ) ⎟ ⎟

⎠

⎞

⎜ ⎜

⎝

Ο ⎛

+

= m

d

H T

E H

E

al T HingeLoss T

.

Re

33

However observations show that this occurs very rarely On the opposite, real loss tends to decrease even after that the hinge loss has vanished

Classic behavior:

-overfitting

-Occam razor principle

Boosting real behavior:

- Error of generalization does not increase…

Error of generalizaton

34

Error of generalization does not increase…Why ? Relation with large

margin methods

n 

The hinge loss only evaluates if classification labels are correct

n 

The confidence in the result must also be taken into account

Margin Theory

35

Margin Theory

n 

Margin: confidence into the prediction

With high probability,

-  m is the number of training samples -  d is the « complexity » of weaks learners

> 0

∀ θ

µ

( )

dm generalisation error P margin

θ

θ

⎛ ⎞

⎜ ⎟

≤ ≤ +Ο⎜ ⎟

⎜ ⎟

⎝ ⎠

( ) ( )

1

1

,

T t t t

T t t

y h x

x y α

α

=

=

=

margin

36

(10)

n 

The bound is independant from T: the number of boosting iterations

n 

Boosting tends to maximise margins of training set since it focuses on samples difficult to classify (samples whom margin is the smallest)

Margin Theory

37

Advantages of AdaBoost

n 

(very) fast

n 

simple + easy to code

n 

One unique parameter to set: the number of boosting iterations (T)

n 

Can be use with any machine learning algorithm

n 

No-overfitting

n 

Is adapted to multi-class problems where y

i

{1,..,c} and to multi-label problems

n 

Allow to find out outliers

38

Drawbacks of AdaBoost

n 

Performances depend on training data & weak learners

n 

AdaBoost can fail if

q 

The weak learner is to complex

§  Low margins → overfitting

q 

The weak learner is too weak

n 

γ

t

0 too fast

underfitting

n 

AdaBoost seems specially sensitive to noise

39

Applications

n 

Detection, tracking and face recognition

n 

Spam, Zip Code OCR

n 

Text classification: Schapire and Singer - Used stumps with normalized term frequency and multi-class encoding

n 

OCR: Schwenk and Bengio (neural networks)

n 

Natural language Processing: Collins; Haruno, Shirai and Ooyama

n 

Image retrieval: Thieu and Viola

n 

Medical diagnosis: Merle et al.

n 

Fraud Detection: Rätsch & Müller 2001

n 

Drug Discovery: Rätsch, Demiriz, Bennett 2002

n 

Elect. Power Monitoring: Onoda, Rätsch & Müller 2000

40

(11)

Boosting: Applet & demo

http://cse.ucsd.edu/~yfreund/adaboost/index.html

http://vision.ucla.edu/~vedaldi/code/snippets/assets/adademo/adademo.mov http://morph.cs.st-andrews.ac.uk/fof/haarDemo/index.html

41

Background

n 

[Valiant ʼ 84]

introduced theoretical PAC model for studying machine learning

n 

[Kearns&Valiantʼ88]

open problem of finding a boosting algorithm

n 

[Schapireʼ89], [Freundʼ90]

first polynomial-time boosting algorithms

n 

[Drucker, Schapire&Simard ʼ92]

first experiments using boosting

42

Backgroung (cont.)

n 

[Freund&Schapire ʼ95]

q 

Provided AdaBoost algorithm

q 

Bigger advantage in pratice than previous boosting algorithms

n 

experiences using AdaBoost:

[Drucker&Cortes ʼ95] [Schapire&Singer ʼ98]

[Jackson&Cravon ʼ96] [Maclin&Opitz ʼ97]

[Freund&Schapire ʼ96] [Bauer&Kohavi ʼ97]

[Quinlan ʼ96] [Schwenk&Bengio ʼ98]

[Breiman ʼ96] [Dietterichʼ98]

n 

Further development on theory & algorithms:

[Schapire,Freund,Bartlett&Lee ʼ97] [Schapire&Singer ʼ98]

[Breiman ʼ97] [Mason, Bartlett&Baxter ʼ98]

[Grive and Schuurmansʼ98] [Friedman, Hastie&Tibshirani ʼ98]

43

Variations

n 

Adaboost Multi-class

q  AdaBoost.M1: same as standard but y ∈ {1,…,k}

q  AdaBoost.M2: introduces a confidence degree n 

Solve overfitting when noisy data:

q  Weight Decay (1998): introduces slack variables regularizing the error

q  GentleBoost (1998)

q  LogitBoost (2000): additive logistic regression model (statistical point of view)

q  Regularized AdaBoost (2000): « soft » margins

q  BrownBoost (2001): removes examples many times misclassified, considered as « noisy »

q  WeightBoost (2003): weights in final decision dependent on input

n 

Reduce the number of features (or Weak Learner) :

q  FloatBoost (2003): removes the worst WL after each iteration

q  JointBoost (2004): multi-class, jointly train N binary classifier sharing same features

44

(12)

1 st example: Face detection

Démo

45

The Task of Face Detection

Many slides adapted from P. Viola

n  Slide a window across image and evaluate a face model at every location.

Basic Idea Challenges

n  Slide a window across image and evaluate a face model at every location.

n  Sliding window detector must evaluate tens of thousands of location/scale combinations.

n  Faces are rare: 0–10 per image

q  For computational efficiency, we should try to spend as little time as possible on the non-face windows

q  A megapixel image has ~106 pixels and a comparable number of candidate face locations

q  To avoid having a false positive in every image image, our false positive rate has to be less than 10-6

(13)

The Viola/Jones Face Detector

n 

A seminal approach to real-time object detection

n 

Training is slow, but detection is very fast

n 

Key ideas

q  Integral images for fast feature evaluation

q  Boosting for feature selection

q  Attentional cascade for fast rejection of non-face windows

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.

CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

Image Features

Rectangle filters

Feature Value (Pixel in white area) (Pixel in black area)

=

Image Features

Rectangle filters

Feature Value (Pixel in white area) (Pixel in black area)

=

Size of Feature Space

Feature Value (Pixel in white area) (Pixel in black area)

=

•  How many number of possible rectangle features for a 24x24 detection region?

Rectangle filters

Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

12 24

1 1 8 24

1 1 12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

w h

w h

w h

w h

w h

= =

= =

= =

− + − + +

− + − + +

− + − +

∑∑

∑∑

∑∑

A+B C D

160,000

:

(14)

Feature Selection

•  How many number of possible rectangle features for a 24x24 detection region?

Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

A+B C D

160,000 :

What features are good for face detection?

12 24

1 1 8 24

1 1 12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

w h

w h

w h

w h

w h

= =

= =

= =

− + − + +

− + − + +

− + − +

∑∑

∑∑

∑∑

Feature Selection

•  How many number of possible rectangle features for a 24x24 detection region?

Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

A+B C D

160,000 :

l  Can we create a good classifier using just a small subset of all possible features?

l  How to select such a subset?

12 24

1 1 8 24

1 1 12 12

1 1

2 (24 2 1)(24 1)

2 (24 3 1)(24 1)

(24 2 1)(24 2 1)

w h

w h

w h

w h

w h

w h

= =

= =

= =

− + − + +

− + − + +

− + − +

∑∑

∑∑

∑∑

Integral images

n  The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

(x, y)

,

( , ) ( , )

x x y y

ii x y i x y

ʹ′ ʹ′

ʹ′ ʹ′

= ∑

( , 1) ii x y − ( 1, ) s xy

Computing the Integral Image

n  The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.

n  This can quickly be computed in one pass through the image.

Impossible d'afficher l'image.

Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

(x, y)

,

( , ) ( , )

x x y y

ii x y i x y

ʹ′ ʹ′

ʹ′ ʹ′

= ∑

( , ) i x y

( , ) ( 1, ) ( , ) s x y = s xy + i x y

( , ) ( , 1) ( , )

ii x y = ii x y − + s x y

(15)

Computing Sum within a Rectangle

Impossible d'afficher l'image.

Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.

A B C D

sum ii = − iiii + ii D B

C A

Only 3 additions are required for any size of rectangle!

Scaling

n 

Integral image enables us to evaluate all rectangle sizes in constant time.

n 

Therefore, no image scaling is necessary.

n 

Scale the rectangular features instead!

1 2

3 4

5 6

Boosting

n 

Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier

q  A weak learner need only do better than chance n 

Training consists of multiple boosting rounds

q  During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners

q  “Hardness” is captured by weights attached to training examples

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

The AdaBoost Algorithm

Given: ( , ), ,( ,x y1 1 K x ym m) where xiX y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

h ε ε D i y h x

=

= =

:probability distribution of 's at time

t( ) i

D i x t

•  Weight classifier: 1 1 2ln

t t t

α ε ε

= −

•  Update distribution: t1( ) t( ) exp[ t i t( )]i , is for normalizatit on

t

D i y h x

D i Z

Z α

+

= −

minimize weighted error for minimize exponential loss

Give error classified patterns more chance for learning.

(16)

The AdaBoost Algorithm

Given: ( , ), ,( ,x y1 1 K x ym m) where xiX y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

h ε ε D i y h x

=

= =

•  Weight classifier: 1ln1 2

t t

t

α ε ε

= −

•  Update distribution: 1 ( ) exp[ ( )]

, is for normalizati

( ) t t i t i on

t t

t

D i y h x

D i Z

Z α

+

= −

Output final classifier:

1

( ) T t t( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

⎠

Weak Learners for Face Detection

Given: ( , ), ,( ,x y1 1 K x ym m) where xiX y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m

For : t=1, ,K T

•  Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,

1

where

arg min ( )[ ( )]

j

m

t j j t i j i

i h

h ε ε D i y h x

=

= =

•  Weight classifier: 1ln1 2

t t

t

α ε ε

= −

•  Update distribution: 1 ( ) exp[ ( )]

, is for normalizati

( ) t t i t i on

t t

t

D i y h x

D i Z

Z α

+

= −

Output final classifier:

1

( ) T t t( )

t

sign H x αh x

=

⎛ ⎞

⎜ = ⎟

⎝

⎠

What base learner is proper for face detection?

Weak Learners for Face Detection

1 if ( ) ( ) 0 otherwise

t t t t

t

p f x p

h x ⎧ > θ

= ⎨

⎩

window

value of rectangle feature

parity threshold

Boosting

n 

Training set contains face and nonface examples

q  Initially, with equal weight n 

For each round of boosting:

q  Evaluate each rectangle filter on each example

q  Select best threshold for each filter

q  Select best filter/threshold combination

q  Reweight examples

n 

Computational complexity of learning: O(MNK)

q  M rounds, N examples, K features

(17)

AdaBoost

•  1990 Schapire, Boosting

•  1996 Freund & Schapire, AdaBoost -  Do not require any a priori knwoledge

•  2001 Viola & Jones -  Face detection

-  Haar descriptors, integral image, cascades -  Robust, fast

•  2002 Lienhart et al.

-  Extended descriptors, 45 degres rotation

Extended descriptors Haar descriptors

65

AdaBoost

Choisir un classifieur faible

Choose the classifier ht which gets minimum error Initialize weights of examples

P: w1,i = 1/2m ; N: w1,i = 1/2l

Strong Classifier

where

Evaluate the error of each classifier Normalize weights wi Provide training examples

(x1, y1), . . . , (xn , yn ) P: yi = 1; N: yi = 0

For t =1,2,…,T start

Next T

end

Update weights correct:

incorrect:

Obtenir un classifieur fort

66

Features Selected by Boosting

First two features selected by boosting:

This feature combination can yield 100% detection rate and 50% false positive rate

ROC Curve for 200-Feature Classifier

A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084.

Not good enough!

To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.

(18)

Attentional Cascade

Classifier 3 Classifier 2

Classifier 1 IMAGE

SUB-WINDOW

T T T FACE

NON-FACE

F F

NON-FACE

F NON-FACE n  We start with simple classifiers which reject many of the

negative sub-windows while detecting almost all positive sub- windows

n  Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on

n  A negative outcome at any point leads to the immediate rejection of the sub-window

Attentional Cascade

Classifier 3 Classifier 2

Classifier 1 IMAGE

SUB-WINDOW

T T T FACE

NON-FACE

F F

NON-FACE

F NON-FACE n 

Chain classifiers that are

progressively more complex and have lower false positive rates

0

100 0 % False Pos 50

% Detection

ROC Curve

Detection Rate and False Positive Rate for Chained Classifiers

Classifier 3 Classifier 2

Classifier 1 IMAGE

SUB-WINDOW

T T T FACE

NON-FACE

F F

NON-FACE

F NON-FACE n  The detection rate and the false positive rate of the cascade

are found by multiplying the respective rates of the individual stages

n  A detection rate of 0.9 and a false positive rate on the order of 10-6 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6×10-6 )

1 1

,

f d f

2

, d

2

f

3

, d

3

1 1

,

K i

K i

i i

F f D d

=

=

=

= ∏ ∏

, F D

Training the Cascade

n 

Set target detection and false positive rates for each stage

n 

Keep adding features to the current stage until its target rates have been met

q  Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error)

q  Test on a validation set

n 

If the overall false positive rate is not low enough, then add another stage

n 

Use false positives from current stage as the

negative training examples for the next stage

(19)

AdaBoost (6) – Cascade Principle

AdaBoost

1 N

Face x 99%

Non Face x 30%

1 1 0

0 0 0

Non Face x 70%

Face x 90%

Non Face x 0.00006%

1 1 1

0 1 0

AdaBoost 2

0 1

Non Face x 21%

Face x 98%

Non Face x 9%

73

Training the Cascade

Attentional Cascade

Decrease the set of negative examples

AdaBoost to choose the classifier

Update the threshold for Di > d * Di-1

Fi > f * Fi-1 ?

Initialize global rate F0 = 1 ; D0 = 1 Provide training exemples (x1, y1), . . . , (xn , yn ) P: yi = 1; N: yi = 0

begin

end Define max f /iteration, min d /iteration, global Ftarget

Fi > Ftarget ?

Next iteration

next classifier

Obtain a cascade of strong classifiers

Obtain a strong classifier

Weak classifier

75

ROC Curves Cascaded Classifier to Monlithic

Classifier

(20)

ROC Curves Cascaded Classifier to Monlithic Classifier

n  There is little difference between the two in terms of accuracy.

n  There is a big difference in terms of speed.

q  The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they are never evaluated by subsequent stages.

The Implemented System

n 

Training Data

q  5000 faces

n  All frontal, rescaled to 24x24 pixels

q  300 million non-faces

n  9500 non-face images

q  Faces are normalized

n  Scale, translation n 

Many variations

q  Across individuals

q  Illumination

q  Pose

Structure of the Detector Cascade

n  Combining successively more complex classifiers in cascade

q  38 stages

q  included a total of 6060 features

Reject Sub-Window

1 2 3 4 5 6 7 8 38 Face

F F F F F F F F F

T T T T T T T T T

All Sub-Windows

Structure of the Detector Cascade

Reject Sub-Window

1 2 3 4 5 6 7 8 38 Face

F F F F F F F F F

T T T T T T T T T

All Sub-Windows

2 features, reject 50% non-faces, detect 100% faces 10 features, reject 80% non-faces, detect 100% faces

25 features 50 features

by algorithm

(21)

Speed of the Final Detector

n 

On a 700 Mhz Pentium III processor, the face detector can process a 384 ×288 pixel image in about .067 seconds”

q  15 Hz

q  15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)

n 

Average of 8 features evaluated per window on test set

Image Processing

n 

Training ⎯ all example sub-windows were variance normalized to minimize the effect of different lighting conditions

n 

Detection ⎯ variance normalized as well

2 2 1 2

m

N

x σ = − ∑

Can be computed using integral image

Can be computed using integral image of squared image

Scanning the Detector

n 

Scaling is achieved by scaling the detector itself, rather than scaling the image

n 

Good detection results for scaling factor of 1.25

n 

The detector is scanned across location

q  Subsequent locations are obtained by shifting the window [sΔ]

pixels, where s is the current scale

q  The result for Δ = 1.0 and Δ = 1.5 were reported

Merging Multiple Detections

(22)

ROC Curves for Face Detection Results

•  Fixed images

•  Video sequence

Frontal face Left profile face Right profile face

86

•  Fast and robust

•  Other descriptors instead of Haar

•  Other cascades (rotation…)

•  Eye detection

Extension

87

Conclusions

n 

How adaboost works?

n 

Why adaboost works?

n 

Adaboost for face detection

q 

Rectangle features

q 

Integral images for fast computation

q 

Boosting for feature selection

q 

Attentional cascade for fast rejection of

negative windows

Références

Documents relatifs

In this paper we propose a novel multi-task learning algorithm called MT-Adaboost: it extends Ada boost algorithm [1] to the multi-task setting; it uses as multi-task weak classifier

Motivated by applications in mathematical imaging, asymptotic expansions of the heat generated by single, or well separated, small particles are derived in [5, 7] based on

The frequency of multiple sclerosis in Mediterranean Europe: an incidence and prevalence study in Barbagia, Sardinia, insular Italy.. The frequency of multiple

We introduce in Section 2 a general framework for studying the algorithms from the point of view of functional optimization in an L 2 space, and prove in Section 3 their convergence

• appliquer de façon successive le même algorithme à des versions de l’échantillon initial d’apprentissage qui sont modifiées à chaque étape pour tenir compte des erreurs

In Section 3, we introduce the concept of Fano–Mori triples and their basic properties, in particular, we show that any canonical weak Q -Fano 3-fold is birational to a Fano–Mori

The system developed by the LAST for the PAN@FIRE 2020 Task on Authorship Identification of SOurce COde (AI-SOCO) was aimed at obtaining the best possible performance by combining a

The significance of the Ambiguity decomposition it that the error of the ensemble will be less than or equal to the average error of the individuals, and then the ensemble has