Part I (Theory)

(1)

Supervised learning for computer vision:

Theory and algorithms - Part I

Francis Bach

¹

& Jean-Yves Audibert

^2,1

1.

INRIA - Ecole Normale Sup´erieure

2.

ENPC

ECCV Tutorial - Marseille, 2008

(2)

Outline

• Probabilistic model

• Local averaging algorithms

– Link between binary classification and regression – k-Nearest Neighbors

– Kernel estimate

– Partitioning estimate

• Empirical risk minimization and variants – Neural networks

– Convexification in binary classification – Support Vector Machines

(3)

Probabilistic model

• Training data = n input-output pairs :

(X₁, Y₁), . . . , (X_n, Y_n) i.i.d.

from some unknown distribution P

• A new input X comes.

• Goal: predict the corresponding output Y .

• probabilistic assumption:

(X, Y ) = another independent realization of P.

(4)

Some typical examples

• Computer Vision

– object recognition X = an image

Y = +1 if the image contains the object, Y = 0 otherwise

• Textual document

– X = a mail Y = spam vs non spam

• Insurance

– X = data of a future policy holder Y = premium

• Finance

– X = data of a loanee Y = loan rate

(5)

Measuring the quality of prediction (1/2)

• `(y, yˆ) = measure the loss incurred by predicting yˆ while the true output is y

• Typical losses are:

– the p-power loss for real outputs

`(y, yˆ) = |y − yˆ|^p

– the classification loss for discrete outputs (e.g. in {0, 1})

`(y, yˆ) = 1_y6=ˆ_y

(6)

Measuring the quality of prediction (2/2)

• A prediction function = mapping from input space to output space f : X 7→ f(X)

• Quality of a prediction function

Risk of f = R(f) = E `[Y, f(X)]

• The best prediction function (=Bayes predictor):

f^∗ = argmin

f

R(f)

(7)

Noise in classification:

R(f ) =

E

1

_Y _6=f_(X₎

=

P

(Y 6= f (X ))

Input space Density

Y=1

Y=0

⇒

^R(f^∗^{) = 0}

Input space Density

Y=1

Y=0

⇒

^R(f^∗⁾ ^> ⁰

0 or 6

(8)

Bayes predictors for typical losses

• In classification (when `(y, y) =ˆ 1_y6=ˆ_y):

f_cla^∗ (x) = argmax

y P(Y = y|X = x)

• In least square regression (when `(y, yˆ) = (y − yˆ)²) f_reg^∗ (x) = E(Y |X = x)

(9)

What is formally a supervised learning algorithm?

• An estimator of the unobservable f^∗

• An algorithm = a training sample is mapped to a prediction function fˆ : T = {(X₁, Y₁), . . . , (X_n, Y_n)} 7→ fˆ_T

• Quality of an algorithm for training samples of size n E_T R( ˆf_T )

Here the expectation is wrt the training sample distribution.

(10)

Uniformly universal consistency

• An algorithm is uniformly universally consistent if we have limn sup

P

©E_T R( ˆf_T ) − R(f^∗)ª

= 0

• Bad news: uniformly universally consistent algorithms do not exist [Devroye (1982); Audibert (2008)]

• Practical meaning: you will never know beforehand how much data is required to reach a predefined accuracy

(11)

Universal consistency [Stone (1977)]

• An algorithm is universally consistent if for any P generating the data, we have

E_T R( ˆf_T ) −→

n→+∞ R(f^∗), in other words:

sup

P

limn

©E_T R( ˆf_T ) − R(f^∗)ª

= 0

¡ 6= unif. univ. consistency: lim_n sup_P ©

E_T R( ˆf_T ) − R(f^∗)ª

= 0¢

• Good news: universally consistent algorithms do exist

• Practical meaning: for a sufficiently large amount of data, you will reach any desired accuracy

(12)

What should we expect from a good supervised learning algorithm?

• its universal consistency sup

P

limn

©E_T R( ˆf_T ) − R(f^∗)ª

= 0

• a locally uniform universal consistency sup

P∈P

©E_T R( ˆf_T ) − R(f^∗)ª

goes to 0 fast (typically in 1/n^γ),

for P a known class of distributions in which (we hope/know that) the unknown distribution P is.

(13)

Outline

– Kernel estimate

– Boosting

(14)

Link between binary classification and regression

Y ∈ {0, 1}

f_reg^∗ (x) = P(Y = 1|X = x)

⇒ 1_f_reg^∗ _(x)≥1/2 = argmax_y P(Y = y|X = x)= f_cla^∗ (x)

Theorem:

f_reg: real-valued function defined on the input space f_cla = 1_f_reg_≥1/2

R_cla(f_cla) − R_cla(f_cla^∗ ) ≤ 2 q

R_reg(f_reg) − R_reg(f_reg^∗ )

Corollary:

fˆ_reg universally consistent ⇒ fˆ_cla = 1_f_ˆ

reg≥1/2 universally consistent

(15)

Local averaging methods [Gy¨ orfi et al. (2004)]

Context: Y ∈ R `(y, y⁰) = (y − y⁰)²

• Recall: f^∗(x) = E(Y |X = x) unknown but (X₁, Y₁), . . . , (X_n, Y_n) observed

• Implementation

For an input x, predict the average of the Y_i of the X_i’s close to x fˆ : x 7→

Xn i=1

W_i(x)Y_i,

with W_i(x) appropriate functions of x, n, X₁, . . . , X_n.

(16)

Stone’s Theorem [Stone (1977)]:

sufficient conditions for universal consistency

Assume that the weights W_i satisfies for any distribution P 1. ∀ε > 0 P©¯

¯ Pⁿ

i=1 W_i(X) − 1¯

¯ > εª

n→+∞−→ 0

2. ∀a > 0 E© P_n

i=1 |W_i(X)|1_kX_i_−X_k>aª

n→+∞−→ 0 3. E P_n

i=1[W_i(X)]² −→

n→+∞ 0

4. + two technical assumptions Then fˆ : x 7→ P_n

i=1 W_i(x)Y_i is universally consistent

(17)

First example: the k-Nearest Neighbors

W_i(x) =

(1

k if X_i belongs to the k-n.n. of x among X₁, . . . , X_n 0 otherwise

e.g. for k = 1: if X_i N.N. of x, then fˆ₁(x) = Y_i. More generally:

fˆ_k(x) = 1 k

Xk

j=1

Y_i_j

Universal consistency [Stone (1977)] :

The k_n-N.N. is univ. consistent iff k_n → +∞ and k_n/n → 0

• The nearest neighbor (k = 1) algorithm is not universally consistent [Cover and Hart (1967)].

(18)

Using k-N.N.

• Requires full storage of training points

• Naive implementation: O(n)

• Refined implementation using trees: O(log n) at test time (but O(n log n) for building the tree) ⁽ http://www.cs.umd.edu/~mount/ANN/ )

• How to choose k? answer: by cross-validation i.e. take the k by minimizing the risk estimate of fˆ_k by

1 n

Xp

j=1

X

(x,y)∈B_j

£y − fˆ_k(∪_l6=jB_l)(x)¤₂ ,

where B₁, ..., B_p is a partition of the training sample:

(19)

Second example: kernel estimate [Nadaraya (1964);

Watson (1964)]

X ∈ R^d h > 0 K : R^d → R fˆ(x) =

Xn

i=1

µ K(^x−X_h ⁱ) P_n

l=1 K(^x−X_h ^l)

¶ Y_i

Universal consistency [Devroye and Wagner (1980); Spiegelman and Sacks (1980)]: Let B(0, u) be the Euclidean ball in R^d of radius u > 0. If there are 0 < a ≤ A et b > 0 s.t.

∀u ∈ R^d b1_B(0,a) ≤ K(u) ≤ 1_B(0,A)

and if h_n −→

n→+∞ 0 and nh^d_n −→

n→+∞ +∞, then fˆ is universally

K

A

−a a

−A

b 1

(20)

Partitioning estimate [Tukey (1947)]

X

X ∈ [0, 1]^d = X₁ t · · · t X_p fˆ(x) =

Xn

i=1

µ 1_X_i_∈X_j_(x) P_n

l=1 1_X_l_∈X_j_(x)

¶ Y_i,

with j(x) such that x ∈ X_j_(x) Universal consistency [Gy¨orfi (1991)]:

Let Diam(X_j) = sup_x₁_,x₂_∈X_j kx₁ − x₂k. If p/n −→

n→+∞ 0 and maxj Diam(X_j) −→

n→+∞ 0 then fˆ is universally consistent

• Meaning for a regular grid of width h_n: nh^d_n → +∞ and h_n → 0

(21)

Using the partitioning estimate

• Fast and simple but ...

• border effects: “Mind the gap!”

• nonobvious choice of the partition

• Variants of the partitioning estimate: decision trees with partitions built from the training data ...

(22)

Outline

– Kernel estimate

(23)

Empirical risk minimization

R(f) = E `(Y, f(X)) unobservable

• Empirical risk: r(f) = _n¹ P_n

i=1 `(Y_i, f(X_i)) observable

• Law of large numbers and central limit theorem:

r(f) −→^p.s.

n→+∞ R(f)

√n£

r(f) − R(f)¤ _L

n→+∞−→ N¡

0, Var `[Y, f(X)]¢ .

• Goal of learning: predict as well as f^∗ = argmin

f

R(f)

• a “natural” algorithm is therefore: fˆ_ERM ∈ argmin

f

r(f)

(24)

Natural choice does not work

fˆ_ERM ∈ argmin

f

r(f)

• There is an infinity of minimizers. Most of them will not perform well on test data. −→ Overfitting

Y_i

(25)

Why is it not working?

Prediction function space Risk

fˆERM

R

r

f^∗

R( ˆf_ERM) − R(f^∗) = R( ˆf_ERM) − r( ˆf_ERM) + r( ˆf_ERM) − r(f^∗) + r(f^∗) − R(f^∗)

≤ sup

f

©R(f) − r(f)ª

+ 0 + O(1/√ n)

∀f, R(f) − r(f) = O(1/√

n)

6=

^sup

f

©R(f) − r(f)ª

n→+∞−→ 0

(26)

f ˆ

_ERM

∈ argmin

f∈F

r(f ) with F appropriately chosen

• Choice of F ? Introduce f˜ ∈ argmin

f∈F

R(f),

R( ˆf_ERM) − R(f^∗) = R( ˆf_ERM) − R( ˜f)

| {z }

Estimation error

+ R( ˜f) − R(f^∗)

| {z } Approximation error R( ˆf_ERM) − R( ˜f) = R( ˆf_ERM) − r( ˆf_ERM) + r( ˆf_ERM) − r( ˜f)

+ r( ˜f) − R( ˜f)

≤ sup

f∈F

©R(f) − r(f)ª

+ 0 + O(1/√ n)

• F should be small enough to ensure sup

f∈F

©R(f) − r(f)ª

n→+∞−→ 0

• F should be large enough to ensure R( ˜f) − R(f^∗) −→ 0

(27)

First example : “neural networks” [Rosenblatt (1958, 1962)]

0 1

• Squashing function σ: a nondecreasing function with σ(x) −→

x→−∞ 0 σ(x) −→

x→+∞ 1

e.g. σ(x) = 1_x≥0 or σ(x) = 1/(1 + e^−x)

• (Artificial) neuron: function defined on R^d by

g(x) = σ

µ X_d

j=1

a_jx^(j) + a₀

¶

= σ(a · x)˜

(28)

• Neural network with one hidden layer: function defined on R^d by f(x) =

Xk

j=1

c_jσ(a_j · x) +˜ c₀

where x˜ =

µ 1 x

¶

= (1, x⁽¹⁾, . . . , x^(d))^T.

Σ 1

a⁽⁰⁾

k

(d)

a⁽

d) 1

a⁽¹⁾

1

a⁽⁰⁾

1

x⁽¹⁾ 1

c_k c₁

c₀

c₀ + Σ^k_j=1c_jσ(a_j · x˜) x

˜ x

Σ σ

a⁽¹⁾

k

(29)

Σ 1

a⁽⁰⁾ k

a⁽ d) k a⁽

d) 1

a⁽¹⁾ 1 a⁽⁰⁾

1

x⁽^d⁾ x⁽¹⁾

1

c_k c₁

c₀

c₀ + Σ^k_j=1c_jσ(a_j ·x˜) x

˜ x

Σ σ

a⁽¹⁾ k

Universal consistency in least square setting [Lugosi and Zeger (1995); Farag´o and Lugosi (1993)]:

(k_n) integer sequence (β_n) real sequence

F_n = { n. n. with one hidden layer, k ≤ k_n and P_k

j=0 |c_j| ≤ β_n } ERM on F_n is universally consistent if k_n → +∞, β_n → +∞ and

k_nβ_n⁴ log(k_nβ_n²)

−→ 0.

(30)

Using neural networks

• In practice: use of multilayer neural nets

• Squashing function ⇒ ERM = nonconvex optimization pb

⇒ any algorithm will end in a local minimum

• With good intuitions on how to build the neural nets and good heuristics to perform the minimization [LeCun et al. (1998); LeCun (2005); Simard et al. (2003)], neural nets are great...

(31)

Convexification of empirical risk minimization in binary classification

Y ∈ {−1; +1} R(g) = P[Y 6= g(X)]

• ERM: gˆ ∈ argmin

g∈G

P_n

i=1 1_Y_i_6=g(X_i₎ −→ highly nonconvex

• f real-valued function and g : x 7→ sign[f(x)]

−→ fˆ∈ argmin

f∈F

Xn i=1

1_Y_i_f_(X_i_)≤0 F convex

−→ fˆ∈ argmin

f∈F

Xn i=1

φ[Y_if(X_i)] φ convex

(32)

Criterion to choose the convex function φ

• φ-risk of f: A(f) = Eφ[Y f(X)].

• φ should satisfy:

fˆ univ. consistent for the φ-risk

⇒ sign( ˆf) univ. consistent for the classification risk

• Necessary and sufficient cond. [Bartlett et al. (2006)]:

φ is differentiable at 0 and φ⁰(0) < 0

(33)

Some convex functions useful for classification and their remarkable property

f^∗ best function for the φ-risk f a real-valued function

• φ(u) = (1 − u)₊ = max(1 − u, 0): S.V.M. loss

R[sign(f)] − R(g^∗) ≤ A(f) − A(f^∗)

• φ(u) = e^−u: AdaBoost loss

R[sign(f)] − R(g^∗) ≤ √

2p

A(f) − A(f^∗)

• φ(u) = log(1 + e^−u): Logistic regression loss R[sign(f)] − R(g^∗) ≤ √

2p

A(f) − A(f^∗)

• φ(u) = (1 − u)²: Least square regression loss R[sign(f)] − R(g^∗) ≤ p

A(f) − A(f^∗)

(34)

Support Vector Machines [Boser et al. (1992);

Vapnik (1995)]

C > 0 φ(u) = (1 − u)₊ H a Reproducing Kernel Hilbert Space

b∈R,h∈Hinf C

Xn i=1

φ¡

Y_i[h(X_i) + b]¢

+ 1

2khk²_H (P_C)

• Empirical φ-risk minim. on F = ©

x 7→ h(x) + b; khk_H ≤ λ, b ∈ Rª

finf∈F

1 n

Xn i=1

φ¡

Y_if(X_i)¢

(Q_λ)

• (ˆh_C,ˆb_C) solution of (P_C) ⇒ hˆ_C + ˆb_C sol. of (Q_λ) for λ = khˆ_Ck_H

(35)

Reproducing Kernel pre-Hilbert Space

• Let K : X × X → R symmetric (i.e. K(u, v) = K(v, u)) and positive semi-definite (i.e. ∀J ∈ N, ∀α ∈ R^J and ∀x₁, . . . , x_J P

1≤j,k≤J α_jα_kK(x_j, x_k) ≥ 0) – K is called a (Mercer) kernel – Examples: X = R^d

∗ linear kernel K(x, x⁰) = hx, x⁰i_Rd

∗ polynomial kernel K(x, x⁰) = ¡

1 + hx, x⁰i_Rd

¢_p

for p ∈ N^∗

∗ gaussian kernel K(x, x⁰) = e^−kx−x⁰^k²^/(2σ²⁾ for σ > 0.

• Let H⁰ be the linear span of K(x, ·) : x⁰ 7→ K(x, x⁰), equipped with

* X

1≤i≤I

α_iK(x_i, ·), X

1≤j≤J

α⁰_jK(x⁰_j, ·) +

H⁰

= X

i,j

α_iα⁰_jK(x_i, x⁰_j)

(36)

Reproducing Kernel Hilbert Space

• the closure H of H⁰ is an Hilbert space

• H (as H⁰) has the reproducing property:

i=1 α_iK(X_i, ·); ∀i, α_i ∈ Rª

( H S.V.M. pb = min

b∈R,h∈H_n C

Xn i=1

φ¡

Y_i[h(X_i) + b]¢

+ 1

2khk²_H

⇒ tractable (n + 1)-dimensional optimization task

(37)

Universal consistency and using S.V.M.

• Universal consistency [Steinwart (2002)]:

X ∈ [0; 1]^d σ > 0

The S.V.M. with gaussian kernel K : (x, x⁰) 7→ e^−kx−x⁰^k²^/(2σ²⁾ and parameter C = n^β−1 with 0 < β < 1/d is universally consistent.

• Practical choices:

– kernel: linear, polynomial, gaussian, ...

– C (and parameters of the kernel) cross-validated

• Choice of the kernel ←→ functions approximated by linear combinations of the functions K(x, ·) : x⁰ 7→ K(x, x⁰)

• gaussian kernel with σ → +∞ = linear kernel !

(38)

Boosting methods

• Let G be a set of functions from X to {−1, +1}

1. G = ©

x 7→ sign(x^(j⁾ − τ); j ∈ {1, . . . , d}, τ ∈ Rª

∪©

x 7→ 1_x∈A − 1_x∈A^c; A hyper-rectangle of R^dª

∪©

x 7→ 1_x∈A^c − 1_x∈A; A hyper-rectangle of R^dª

−1 +1 −1 +1

(39)

3. G = ©

univariate decision trees with number of leaves = d + 1ª

−1 +1 +1

+1

−1

• Boosting looks for classification function of the form x 7→ sign

µ X_m

j=1

λ_jg_j(x)

¶

• Question: choice of λ_j and g_j ?

(40)

Boosting by L

₁

-regularization

• Let F_λ = n P_m

j=1 λ_jg_j ; m ∈ N, λ_j ≥ 0, g_j ∈ G, P_m

j=1 λ_j = λ o

• φ(u) = e^u A_n(f) = _n¹ P_n

i=1 φ[Y_if(X_i)]

• Boosting by L₁-regularization:

fˆ_λ = argmin

f∈F_λ

A_n(f)

• Universal consistency [Lugosi and Vayatis (2004)]:

If λ = (log n)/4 and G is one of the previous choice (except choice 1), then sign( ˆf_λ) is universally consistent

(41)

Usual description of AdaBoost

• Initialisation: w_i = 1/n for i = 1, . . . , n

• Iterate: For j = 1 to J: – Take

g_j ∈ argmin

g∈G

1 n

Xn i=1

w_i1_g_(X_i_)6=Y_i and e_j the minimum value

– λ_j = ¹₂ log

³1−e_j e_j

´ .

– Update weights: for all i s.t. g_j(X_i) 6= Y_i, w_i ← w_i^1−e_e ^j

j . – Normalize the weights: w_i ← w_i/ P_n

i⁰=1 w_i⁰ for i = 1, . . . , n

• Output:

x 7→ sign

µ X_J

λ_jg_j(x)

¶

(42)

AdaBoost = greedy empirical φ-risk minimization

• φ(u) = e^u A_n(f) = _n¹ P_n

i=1 φ[Y_if(X_i)]

• f₀ = 0

• For j = 1 to J

• (λ_j, g_j) ∈ argmin

λ∈R,g∈G

A_n(f_j₋₁ + λg)

• f_j = f_j₋₁ + λ_jg_j

• Universal consistency [Bartlett and Traskin (2007)]:

If J = n^ν with 0 < ν < 1 and G is one of the previous choice (except choice 1), then AdaBoost is universally consistent

(43)

Link between boosting methods and S.V.M.

• AdaBoost output: x 7→ sign

µ P_J

j=1 λ_jg_j(x)

¶

• S.V.M. output: x 7→ sign

µ P_n

i=1 α_iK(X_i, x)+b

¶

• Consider K(x, x⁰) = P_J

j=1 g_j(x)g_j(x⁰). Then Xn

i=1

α_iK(X_i, x) =

Xn i=1

α_i

XJ j=1

g_j(X_i)g_j(x) =

XJ j=1

λ_jg_j(x)

with

λ_j =

Xn i=1

α_ig_j(X_i)

(44)

Boosting vs S.V.M. vs Neural networks

• Boosting advantages:

– Variable selection

– Ability to handle very large amount of features – Simple tricks to reduce computational complexity

– S.V.M. can be run at the end on the selected features

• S.V.M. advantages:

– Easy to use off-the-shelf – Consistently good results

• Neural networks advantages:

– Works well in practice

(45)

J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. 2008. To be published in Annals of Statistics, http://www.e-publications.org/ims/submission/index.php/AOS/

user/submissionFile/1175?confirm=51fc3552.

P.L. Bartlett and M. Traskin. Adaboost is consistent. J. Mach. Learn. Res., 8:2347–2368, 2007.

P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006.

Boser, Guyon, and Vapnik. A training algorithm for optimal margin classifiers. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1992.

T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT-13, 1967.

L. Devroye. Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans. Pattern Analysis and Machine Intelligence, 4:154–157, 1982.

L. Devroye and T. Wagner. Distribution-free consistency results in nonparametric discrimination and regression function estimation. Annals of Statistics, 8:231–239, 1980.

András Faragó and Gábor Lugosi. Strong universal consistency of neural network classifiers. IEEE Transactions on Information Theory, 39(4):1146–1151, 1993.

L. Gy¨orfi. Nonparametric estimation II. statistically equivalent blocks and tolerance regions. In Nonparametric Functional Estimation and Related Topics, pages 329–338, 1991.

L. Gy¨orfi, M. Kohler, A. Krzy˙zak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2004.

(46)

lecture09-optim.djvu, requiress the djvu reader http://djvu.org/download/.

Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors, http: // yann. lecun. com/ exdb/ publis/ pdf/ lecun-98b. pdf, Neural Networks: Tricks of the trade. Springer, 1998.

G. Lugosi and N. Vayatis. On the bayes-risk consistency of regularized boosting methods. Ann. Stat., 32(1):30–55, 2004.

G´abor Lugosi and Kenneth Zeger. Nonparametric estimation via empirical risk minimization. IEEE Transactions on Information Theory, 41(3):677–687, 1995.

E. A. Nadaraya. On estimating regression. Theor. Probability Appl., 9:141–142, 1964.

F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408, 1958.

F. Rosenblatt. Principles of Neurodynamics. Spartan Books, Washington, 1962.

P.Y. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural networks applied to visual document analysis. http: // research. microsoft. com/ ~patrice/PDF/fugu9.pdf, International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, pages 958–962, 2003.

C. Spiegelman and J. Sacks. Consistent window estimation in nonparametric regression. Annals of Statistics, 8:240–246, 1980.

I. Steinwart. Support vector machines are universally consistent. Journal of Complexity, 18, 2002.

C. J. Stone. Consistent nonparametric regression (with discussion). Annals of Statistics, 5:595–645,

(47)

J. W. Tukey. Nonparametric estimation ii. statistically equivalent blocks and tolerance regions. Annals of Mathematical Statistics, 18:529–539, 1947.

V. Vapnik. The nature of statistical learning theory. Springer-Verlag, second edition, 1995.

G. S. Watson. Smooth regression analysis. Sankhya Series A, 26:359–372, 1964.