Learning by classifier combination: boosting
Frédéric Precioso
Binary classification
n
The objective is to detect regions which correspond to a certain category among data.
n
This defines a binary classification : corresponding / not corresponding to the category.
2
Context: Example on Corel 1000
3
Rose / Other
4
Classification
How to find the boundary between classes (or between categories) :
n
Rose (flower)
n
Not rose (other)
Observations:
n
Easy to produce decision rules often right
q If the histogram contains blue, then predict other n
Hard to find ones right all the time
n
We focus here on a solution among many others:
boosting
?
5
Ideas of boosting:
horseracing
6
How to win?
n
Ask to professional gamblers
n
Lets assume:
q
That professional gamblers can provide one single decision rule simple and relevant
q
But that face to several races, they can always provide decision rules a little bit better than random
n
Can we become rich?
7
Idea
n
Ask heuristics to the expert
n
Gather a set of cases for which these heuristics fail (difficult cases)
n
Ask again the expert to provide heuristics for the difficult cases
n
And so one…
n
Combine these heuristics
n
expert stands for weak learner
8
Questions
n
How to choose races (i.e. learning examples) at each step?
q Focus on races (examples) the most “difficult” (the ones on which previous heuristics are the less relevant)
n
How to merge heuristics (decision rules) into one single decision rule?
q Take a weighted vote of all decision rules
9
Boosting
n
boosting = general method to convert several poor decision rules into one very powerful decision rule
n
More precisely:
q
Let have a weak learner which can always provide a decision rule with an error ≤1/2-γ
q
A boosting algorithm can build (theoretically) a global decision rule with an error rate ≤ ε
10
Boosting: the little History
Mathematician Kearns asks: « Is it possible to make as good as possible a weak learner » (that is to say better than random)?
Schapire, answered in 90: « Yes ! » , and exhibited the first elementary boosting algorithm which shows that a weak binary classifier can always improve by being trained on 3 subsamples well chosen.
The choice of the weak learner does not have any importance (a decision tree, a bayesian classifier, a SVM, etc.), but one has to choose the 3 training subsamples with respect to its performance.
11
Boosting: elementary boosting
1. We train a first decision rule
h1over a subsampling
S1of size m
1 < m (m being the size of S the full training set).Then we classified with h
1all S – S
1.
2. We then train a second decision rule h
2over a subsampling
S2, of size m2, extracted from
S − S1whom half of the examples were misclassified by h
1.
3. We finally train a third decision rule
h3over
m3examples picked up in S − S
1− S
2for which h
1and h
2disagree on.
4. The global decision rule is then a majority vote on the
3decision rules: H = majority_vote(h
1,h
2,h
3)
12
Boosting: Efficiency
A theorem of Schapire on « weak learning power » proves that H gets a higher relevance than a global decision rule which would have been learnt directly on S.
Geometric example of boosting following this elementary algorithm in the next slides.
The weak learner is a hyperplan.
13
Boosting: Start
Left: training set S and subset S1 (red or blue circled).
Right: subset S1 and the line C1 learnt on this set.
14
Boosting: to be continued
Left: training set S – S1 and line C1 learnt on S1.
Right: subset S2 included in S – S1 among the most informative for C1 (red or blue circled).
15
Boosting: …the end
Left: the line C2 learnt on S2.
Center: S3 = S – S1– S2 and the line C3 learnt on S3. Right: S and the combination of the 3 lines.
16
Probabilistic boosting
3 main ideas to generalize towards probabilistic boosting:
1. A set of specialized experts and ask them to vote to take a decision.
2. Adaptive weighting of votes by multiplicative update.
3. Modifying example distribution to train each expert, increasing the weights iteratively of examples misclassified at previous iteration.
17
Boosting: Basics
Parameter: weak learners h Input: A set S
Initialization: All examples have same weights For t = 1,T do
use h with S and current distribution
increase weight of misclassified examples at current iteration
done
Output H : a majority weighted vote of all the h
18
Boosting: Basics
X h
0D
0X h
1D
1X h
2D
2X h
TD
T⎥⎦
⎢ ⎤
⎣
⎡ α
=
∑
= T 0 t
t t
final( ) sgn .h( )
H x x
n How to update Dt to Dt+1 ?
n How to compute the weight αt ? 19
Probabilistic boosting: AdaBoost
The standard algorithm is AdaBoost (Adaptive Boosting).
The main idea is to define at each iteration 1< t < T, a new distribution of a priori probabilities on training samples with respect to classification result at previous iteration.
The weight at iteration t of an example (x
i,y
i) of index i is denoted D
t(i).
Initially, all the examples get a similar weight, at each iteration, weights of misclassified examples are increased Thus the learner focuses on difficult examples of training set
20
n
A training set: S = {(x
1,y1),…,(xm,ym)}n yi∈{-1,+1} label (annotation) of example xi∈ S
n
For t = 0,…,T:
• Build the distribution Dt on {1,…,m}
• Find the weak decision (“heuristic”)
• ht : S → {-1,+1}
• with a small error εt on Dt:
n
Find the final decision h
finalBoosting: the algorithm
∑
( )≠
=
≠
=
i i t t
y x h
i t
i i t D
t
h x y D i
:
) ( ]
) ( [ ε Pr
21
1
where
arg min ( )[ ( )]
j
m
t j j t i j i
i h
h ε ε D i y h x
=
= =
∑
≠The AdaBoost Algorithm
Given: ( , ), ,( ,x y1 1 K x ym m) where xi∈X y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m
For : t=1, ,K T
• Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,
• Weight classifier: 1ln1 2
t t
t
α ε ε
= −
• Update distribution: 1 ( ) exp[ ( )]
, is for normalizati
( ) t t i t i on
t t
t
D i y h x
D i Z
Z α
+
= −
Output final classifier:
1
( ) T t t( )
t
sign H x αh x
=
⎛ ⎞
⎜ = ⎟
⎝
∑
⎠What goal the AdaBoost wants to reach?
1
where
arg min ( )[ ( )]
j
m
t j j t i j i
i h
h ε ε D i y h x
=
= =
∑
≠The AdaBoost Algorithm
Given: ( , ), ,( ,x y1 1 K x ym m) where xi∈X y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m
For : t=1, ,K T
• Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,
• Weight classifier: 1 1 2ln
t t t
α ε ε
= −
• Update distribution: t1( ) t( ) exp[ t i t( )]i , is for normalizatit on
t
D i y h x
D i Z
Z α
+
= −
Output final classifier:
1
( ) T t t( )
t
sign H x αh x
=
⎛ ⎞
⎜ = ⎟
⎝
∑
⎠What goal the AdaBoost wants to reach?
They are goal dependent.
Boosting: global classifier
The boosting algorithm builds the finale decision as a summation on a fonction basis {h
t}.
This is a classic process of machine learning algorithms (as SVM, bayesian classifier, neural networks, etc.).
⎟ ⎠
⎜ ⎞
⎝
= ⎛ ∑
t t
t
h x
x
H
final( ) sgn α ( )
24
Example “ Toy ”
Very popular «Toy» example, 2D points in domain D1, to illustrate boosting schema.
(these points can be seen as vectors ∈ R2).
D1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1
25
Step 1
D2=D’1/Z1 0,17 0,17 0,17 0,07 0,07 0,07 0,07 0,07 0,07 0,07
⎟⎟
⎠
⎞
⎜⎜
⎝
= ⎛
⎟⎠
⎜ ⎞
⎝
= ⎛
⎟⎠
⎜ ⎞
⎝
= ⎛ −
⎟⎟⎠
⎞
⎜⎜⎝
⎛
ε ε
= − α
3 ln 7 3
. 0 7 . ln0 2 1
3 . 0 3 . 0 ln1 2 1 ln1
2 1
1 1 1
⎪⎪
⎩
⎪⎪⎨
⎧
=
=
=
=
=
α α
−
0,152 3 1 7 . 0 e 1 . 0
0,065 7 1 3 . 0 e 1 . 0 ' D
1 1
1
911 . 0 065 . 0 7 152 . 0 3
Z
1= × + × =
26
Step 2
D3=D’2/Z2 0,11 0,11 0,11 0,045 0,045 0,045 0,045 0,17 0,17 0,17
814 . 0 1357 , 0 3 036 . 0 4 0876 . 0 3
Z
2= × + × + × =
⎪ ⎩
⎪ ⎨
⎧
=
=
=
=
α α
− α
−
0,1357 e
07 . 0
0,036 e
07 . 0
0,0876 e
17 . 0 ' D
2 2 2
2
27
Step 3
28
Final decision
29
AdaBoost Hinge Loss
Lets write the error ε
tof
htas: 1/2
−γ
t, where γ
tmeasure the improvement brought by h
tregarding the basic error of 1/2.
Freund and Schapire, have shown that the hinge loss (the amount of misclassified data over training set
S) of finale decision H isbounded by:
Hence, if each weak learner is slightly better than random ( γ
t > 0)then the hinge loss decrease exponentially with t.
30
Proof
) x ( ) x (
f =
∑ α
t th
t∏
∏
∑
= −
⎟⎠
⎜ ⎞
⎝
⎛− ⋅
=
⋅
⋅
−
⋅
=
t t
i i
t t
t t t i
i i
t i t t i t Final i
Z x f y m
Z x h y x m
h Z y
y D
D
x)) ( exp(
1
) ( 1exp
)) ( exp(
) , (
α α
Let
We have:
The hinge loss is less than ∏
t Zt
∏
∑ ∑ ∏
∑ ∑
=
=
−
≤
=
= ≠ <
t t
i t
t
i final
i i
i Hx y i yfx
Z Z i x
f m y
m m
D
i i i i
) ( ))
( 1 exp(
1 Loss 1
Hinge
1
( )1
( )031
And
Then the hinge loss is bounded by:
The error of learning decreases exponentially with t
( ) ( )
( )
( )
( )
( ε ) ε ( ε )
ε
α αα α α
t t t
t
x h y
i t
x h y
i t
i i t
i t
t
e e
e D e
D D
Z
t t
t i t i t i t i
i i
x f y i
−
− =
− +
=
+ −
=
−
=
∑ ∑ ∑
=
≠
1 2 1
) ( exp ) (
: :
Proof
32
Error of generalizaton
Error of generalization of H can be bounded by:
where T is the number of boosting iterations, m the number of examples, and d a dimension (VC-dimension) of H
Tspace (« weaks learner complexity »)
When T becomes large, the 2
ndterm becomes large too, then one can think that the learner would overfit
the solution (i.e. over-weighting of ambiguous examples)
( ) ( ) ⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
Ο ⎛
+
= m
d
H T
E H
E
al T HingeLoss T.
Re
33
However observations show that this occurs very rarely On the opposite, real loss tends to decrease even after that the hinge loss has vanished
Classic behavior:
-overfitting
-Occam razor principle
Boosting real behavior:
- Error of generalization does not increase…
Error of generalizaton
34
Error of generalization does not increase…Why ? Relation with large
margin methods
n
The hinge loss only evaluates if classification labels are correct
n
The confidence in the result must also be taken into account
Margin Theory
35
Margin Theory
n
Margin: confidence into the prediction
With high probability,
- m is the number of training samples - d is the « complexity » of weaks learners
> 0
∀ θ
µ
( )
dm generalisation error P margin
θ
θ
⎛ ⎞
⎜ ⎟
≤ ≤ +Ο⎜ ⎟
⎜ ⎟
⎝ ⎠
( ) ( )
1
1
,
T t t t
T t t
y h x
x y α
α
=
=
=
∑
∑
margin
36
n
The bound is independant from T: the number of boosting iterations
n
Boosting tends to maximise margins of training set since it focuses on samples difficult to classify (samples whom margin is the smallest)
Margin Theory
37
Advantages of AdaBoost
n
(very) fast
n
simple + easy to code
n
One unique parameter to set: the number of boosting iterations (T)
n
Can be use with any machine learning algorithm
n
No-overfitting
n
Is adapted to multi-class problems where y
i ∈{1,..,c} and to multi-label problems
n
Allow to find out outliers
38
Drawbacks of AdaBoost
n
Performances depend on training data & weak learners
n
AdaBoost can fail if
q
The weak learner is to complex
§ Low margins → overfitting
q
The weak learner is too weak
n
γ
t→0 too fast
→underfitting
n
AdaBoost seems specially sensitive to noise
39
Applications
n
Detection, tracking and face recognition
n
Spam, Zip Code OCR
n
Text classification: Schapire and Singer - Used stumps with normalized term frequency and multi-class encoding
n
OCR: Schwenk and Bengio (neural networks)
n
Natural language Processing: Collins; Haruno, Shirai and Ooyama
n
Image retrieval: Thieu and Viola
n
Medical diagnosis: Merle et al.
n
Fraud Detection: Rätsch & Müller 2001
n
Drug Discovery: Rätsch, Demiriz, Bennett 2002
n
Elect. Power Monitoring: Onoda, Rätsch & Müller 2000
40
Boosting: Applet & demo
http://cse.ucsd.edu/~yfreund/adaboost/index.html
http://vision.ucla.edu/~vedaldi/code/snippets/assets/adademo/adademo.mov http://morph.cs.st-andrews.ac.uk/fof/haarDemo/index.html
41
Background
n
[Valiant ʼ 84]
introduced theoretical PAC model for studying machine learning
n[Kearns&Valiantʼ88]
open problem of finding a boosting algorithm
n[Schapireʼ89], [Freundʼ90]
first polynomial-time boosting algorithms
n[Drucker, Schapire&Simard ʼ92]
first experiments using boosting
42
Backgroung (cont.)
n
[Freund&Schapire ʼ95]
q
Provided AdaBoost algorithm
q
Bigger advantage in pratice than previous boosting algorithms
n
experiences using AdaBoost:
[Drucker&Cortes ʼ95] [Schapire&Singer ʼ98]
[Jackson&Cravon ʼ96] [Maclin&Opitz ʼ97]
[Freund&Schapire ʼ96] [Bauer&Kohavi ʼ97]
[Quinlan ʼ96] [Schwenk&Bengio ʼ98]
[Breiman ʼ96] [Dietterichʼ98]
n
Further development on theory & algorithms:
[Schapire,Freund,Bartlett&Lee ʼ97] [Schapire&Singer ʼ98]
[Breiman ʼ97] [Mason, Bartlett&Baxter ʼ98]
[Grive and Schuurmansʼ98] [Friedman, Hastie&Tibshirani ʼ98]
43
Variations
n
Adaboost Multi-class
q AdaBoost.M1: same as standard but y ∈ {1,…,k}
q AdaBoost.M2: introduces a confidence degree n
Solve overfitting when noisy data:
q Weight Decay (1998): introduces slack variables regularizing the error
q GentleBoost (1998)
q LogitBoost (2000): additive logistic regression model (statistical point of view)
q Regularized AdaBoost (2000): « soft » margins
q BrownBoost (2001): removes examples many times misclassified, considered as « noisy »
q WeightBoost (2003): weights in final decision dependent on input
n
Reduce the number of features (or Weak Learner) :
q FloatBoost (2003): removes the worst WL after each iteration
q JointBoost (2004): multi-class, jointly train N binary classifier sharing same features
44
1 st example: Face detection
Démo
45
The Task of Face Detection
Many slides adapted from P. Viola
n Slide a window across image and evaluate a face model at every location.
Basic Idea Challenges
n Slide a window across image and evaluate a face model at every location.
n Sliding window detector must evaluate tens of thousands of location/scale combinations.
n Faces are rare: 0–10 per image
q For computational efficiency, we should try to spend as little time as possible on the non-face windows
q A megapixel image has ~106 pixels and a comparable number of candidate face locations
q To avoid having a false positive in every image image, our false positive rate has to be less than 10-6
The Viola/Jones Face Detector
n
A seminal approach to real-time object detection
n
Training is slow, but detection is very fast
n
Key ideas
q Integral images for fast feature evaluation
q Boosting for feature selection
q Attentional cascade for fast rejection of non-face windows
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.
CVPR 2001.
P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.
Image Features
Rectangle filters
Feature Value (Pixel in white area) (Pixel in black area)=
∑
−∑
Image Features
Rectangle filters
Feature Value (Pixel in white area) (Pixel in black area)
=
∑
−∑
Size of Feature Space
Feature Value (Pixel in white area) (Pixel in black area)
=∑ −
∑
• How many number of possible rectangle features for a 24x24 detection region?
Rectangle filters
Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.
12 24
1 1 8 24
1 1 12 12
1 1
2 (24 2 1)(24 1)
2 (24 3 1)(24 1)
(24 2 1)(24 2 1)
w h
w h
w h
w h
w h
w h
= =
= =
= =
− + − + +
− + − + +
− + − +
∑∑
∑∑
∑∑
A+B C D
160,000
:
Feature Selection
• How many number of possible rectangle features for a 24x24 detection region?
Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.
A+B C D
160,000 :
What features are good for face detection?
12 24
1 1 8 24
1 1 12 12
1 1
2 (24 2 1)(24 1)
2 (24 3 1)(24 1)
(24 2 1)(24 2 1)
w h
w h
w h
w h
w h
w h
= =
= =
= =
− + − + +
− + − + +
− + − +
∑∑
∑∑
∑∑
Feature Selection
• How many number of possible rectangle features for a 24x24 detection region?
Impossible d'afficher l'image. Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.
A+B C D
160,000 :
l Can we create a good classifier using just a small subset of all possible features?
l How to select such a subset?
12 24
1 1 8 24
1 1 12 12
1 1
2 (24 2 1)(24 1)
2 (24 3 1)(24 1)
(24 2 1)(24 2 1)
w h
w h
w h
w h
w h
w h
= =
= =
= =
− + − + +
− + − + +
− + − +
∑∑
∑∑
∑∑
Integral images
n The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.
(x, y)
,
( , ) ( , )
x x y y
ii x y i x y
ʹ′≤ ʹ′≤
ʹ′ ʹ′
= ∑
( , 1) ii x y − ( 1, ) s x − y
Computing the Integral Image
n The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive.
n This can quickly be computed in one pass through the image.
Impossible d'afficher l'image.
Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.
(x, y)
,
( , ) ( , )
x x y y
ii x y i x y
ʹ′≤ ʹ′≤
ʹ′ ʹ′
= ∑
( , ) i x y
( , ) ( 1, ) ( , ) s x y = s x − y + i x y
( , ) ( , 1) ( , )
ii x y = ii x y − + s x y
Computing Sum within a Rectangle
Impossible d'afficher l'image.
Votre ordinateur manque peut-être de mémoire pour ouvrir l'image ou l'image est endommagée. Redémarrez l'ordinateur, puis ouvrez à nouveau le fichier. Si le x rouge est toujours affiché, vous devrez peut-être supprimer l'image avant de la réinsérer.
A B C D
sum ii = − ii − ii + ii D B
C A
Only 3 additions are required for any size of rectangle!
Scaling
n
Integral image enables us to evaluate all rectangle sizes in constant time.
n
Therefore, no image scaling is necessary.
n
Scale the rectangular features instead!
1 2
3 4
5 6
Boosting
n
Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier
q A weak learner need only do better than chance n
Training consists of multiple boosting rounds
q During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners
q “Hardness” is captured by weights attached to training examples
Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.
The AdaBoost Algorithm
Given: ( , ), ,( ,x y1 1 K x ym m) where xi∈X y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m
For : t=1, ,K T
• Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,
1
where
arg min ( )[ ( )]
j
m
t j j t i j i
i h
h ε ε D i y h x
=
= =
∑
≠:probability distribution of 's at time
t( ) i
D i x t
• Weight classifier: 1 1 2ln
t t t
α ε ε
= −
• Update distribution: t1( ) t( ) exp[ t i t( )]i , is for normalizatit on
t
D i y h x
D i Z
Z α
+
= −
minimize weighted error for minimize exponential loss
Give error classified patterns more chance for learning.
The AdaBoost Algorithm
Given: ( , ), ,( ,x y1 1 K x ym m) where xi∈X y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m
For : t=1, ,K T
• Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,
1
where
arg min ( )[ ( )]
j
m
t j j t i j i
i h
h ε ε D i y h x
=
= =
∑
≠• Weight classifier: 1ln1 2
t t
t
α ε ε
= −
• Update distribution: 1 ( ) exp[ ( )]
, is for normalizati
( ) t t i t i on
t t
t
D i y h x
D i Z
Z α
+
= −
Output final classifier:
1
( ) T t t( )
t
sign H x αh x
=
⎛ ⎞
⎜ = ⎟
⎝
∑
⎠Weak Learners for Face Detection
Given: ( , ), ,( ,x y1 1 K x ym m) where xi∈X y, i∈ −{ 1, 1}+ Initialization: D i1( )=m1,i=1, ,K m
For : t=1, ,K T
• Find classifier which minimizes error wrt Dh Xt: → −{ 1, 1}+ t ,i.e.,
1
where
arg min ( )[ ( )]
j
m
t j j t i j i
i h
h ε ε D i y h x
=
= =
∑
≠• Weight classifier: 1ln1 2
t t
t
α ε ε
= −
• Update distribution: 1 ( ) exp[ ( )]
, is for normalizati
( ) t t i t i on
t t
t
D i y h x
D i Z
Z α
+
= −
Output final classifier:
1
( ) T t t( )
t
sign H x αh x
=
⎛ ⎞
⎜ = ⎟
⎝
∑
⎠What base learner is proper for face detection?
Weak Learners for Face Detection
1 if ( ) ( ) 0 otherwise
t t t t
t
p f x p
h x ⎧ > θ
= ⎨
⎩
window
value of rectangle feature
parity threshold
Boosting
n
Training set contains face and nonface examples
q Initially, with equal weight n
For each round of boosting:
q Evaluate each rectangle filter on each example
q Select best threshold for each filter
q Select best filter/threshold combination
q Reweight examples
n
Computational complexity of learning: O(MNK)
q M rounds, N examples, K features
AdaBoost
• 1990 Schapire, Boosting
• 1996 Freund & Schapire, AdaBoost - Do not require any a priori knwoledge
• 2001 Viola & Jones - Face detection
- Haar descriptors, integral image, cascades - Robust, fast
• 2002 Lienhart et al.
- Extended descriptors, 45 degres rotation
Extended descriptors Haar descriptors
65
AdaBoost
Choisir un classifieur faible
Choose the classifier ht which gets minimum error Initialize weights of examples
P: w1,i = 1/2m ; N: w1,i = 1/2l
Strong Classifier
where
Evaluate the error of each classifier Normalize weights wi Provide training examples
(x1, y1), . . . , (xn , yn ) P: yi = 1; N: yi = 0
For t =1,2,…,T start
Next T
end
Update weights correct:
incorrect:
Obtenir un classifieur fort
66
Features Selected by Boosting
First two features selected by boosting:
This feature combination can yield 100% detection rate and 50% false positive rate
ROC Curve for 200-Feature Classifier
A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084.
Not good enough!
To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.
Attentional Cascade
Classifier 3 Classifier 2
Classifier 1 IMAGE
SUB-WINDOW
T T T FACE
NON-FACE
F F
NON-FACE
F NON-FACE n We start with simple classifiers which reject many of the
negative sub-windows while detecting almost all positive sub- windows
n Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on
n A negative outcome at any point leads to the immediate rejection of the sub-window
Attentional Cascade
Classifier 3 Classifier 2
Classifier 1 IMAGE
SUB-WINDOW
T T T FACE
NON-FACE
F F
NON-FACE
F NON-FACE n
Chain classifiers that are
progressively more complex and have lower false positive rates
0
100 0 % False Pos 50
% Detection
ROC Curve
Detection Rate and False Positive Rate for Chained Classifiers
Classifier 3 Classifier 2
Classifier 1 IMAGE
SUB-WINDOW
T T T FACE
NON-FACE
F F
NON-FACE
F NON-FACE n The detection rate and the false positive rate of the cascade
are found by multiplying the respective rates of the individual stages
n A detection rate of 0.9 and a false positive rate on the order of 10-6 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6×10-6 )
1 1
,
f d f
2, d
2f
3, d
31 1
,
K iK i
i i
F f D d
=
=
=
= ∏ ∏
, F D
Training the Cascade
n
Set target detection and false positive rates for each stage
n
Keep adding features to the current stage until its target rates have been met
q Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error)
q Test on a validation set
n
If the overall false positive rate is not low enough, then add another stage
n
Use false positives from current stage as the
negative training examples for the next stage
AdaBoost (6) – Cascade Principle
AdaBoost
1 N
Face x 99%
Non Face x 30%
1 1 0
0 0 0
Non Face x 70%
Face x 90%
Non Face x 0.00006%
1 1 1
0 1 0
AdaBoost 2
0 1
Non Face x 21%
Face x 98%
Non Face x 9%
73
Training the Cascade
Attentional Cascade
Decrease the set of negative examples
AdaBoost to choose the classifier
Update the threshold for Di > d * Di-1
Fi > f * Fi-1 ?
Initialize global rate F0 = 1 ; D0 = 1 Provide training exemples (x1, y1), . . . , (xn , yn ) P: yi = 1; N: yi = 0
begin
end Define max f /iteration, min d /iteration, global Ftarget
Fi > Ftarget ?
Next iteration
next classifier
Obtain a cascade of strong classifiers
Obtain a strong classifier
Weak classifier
75
ROC Curves Cascaded Classifier to Monlithic
Classifier
ROC Curves Cascaded Classifier to Monlithic Classifier
n There is little difference between the two in terms of accuracy.
n There is a big difference in terms of speed.
q The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they are never evaluated by subsequent stages.
The Implemented System
n
Training Data
q 5000 faces
n All frontal, rescaled to 24x24 pixels
q 300 million non-faces
n 9500 non-face images
q Faces are normalized
n Scale, translation n
Many variations
q Across individuals
q Illumination
q Pose
Structure of the Detector Cascade
n Combining successively more complex classifiers in cascade
q 38 stages
q included a total of 6060 features
Reject Sub-Window
1 2 3 4 5 6 7 8 38 Face
F F F F F F F F F
T T T T T T T T T
All Sub-Windows
Structure of the Detector Cascade
Reject Sub-Window
1 2 3 4 5 6 7 8 38 Face
F F F F F F F F F
T T T T T T T T T
All Sub-Windows
2 features, reject 50% non-faces, detect 100% faces 10 features, reject 80% non-faces, detect 100% faces
25 features 50 features
by algorithm
Speed of the Final Detector
n
On a 700 Mhz Pentium III processor, the face detector can process a 384 ×288 pixel image in about .067 seconds”
q 15 Hz
q 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)
n
Average of 8 features evaluated per window on test set
Image Processing
n
Training ⎯ all example sub-windows were variance normalized to minimize the effect of different lighting conditions
n
Detection ⎯ variance normalized as well
2 2 1 2
m
Nx σ = − ∑
Can be computed using integral image
Can be computed using integral image of squared image
Scanning the Detector
n
Scaling is achieved by scaling the detector itself, rather than scaling the image
n
Good detection results for scaling factor of 1.25
n
The detector is scanned across location
q Subsequent locations are obtained by shifting the window [sΔ]
pixels, where s is the current scale
q The result for Δ = 1.0 and Δ = 1.5 were reported
Merging Multiple Detections
ROC Curves for Face Detection Results
• Fixed images
• Video sequence
Frontal face Left profile face Right profile face
86
• Fast and robust
• Other descriptors instead of Haar
• Other cascades (rotation…)
• Eye detection
Extension
87
Conclusions
n
How adaboost works?
n
Why adaboost works?
n
Adaboost for face detection
q
Rectangle features
q
Integral images for fast computation
q
Boosting for feature selection
q