• Aucun résultat trouvé

Part I (Theory)

N/A
N/A
Protected

Academic year: 2022

Partager "Part I (Theory)"

Copied!
47
0
0

Texte intégral

(1)

Supervised learning for computer vision:

Theory and algorithms - Part I

Francis Bach

1

& Jean-Yves Audibert

2,1

1.

INRIA - Ecole Normale Sup´erieure

2.

ENPC

ECCV Tutorial - Marseille, 2008

(2)

Outline

Probabilistic model

Local averaging algorithms

– Link between binary classification and regression – k-Nearest Neighbors

– Kernel estimate

– Partitioning estimate

Empirical risk minimization and variants – Neural networks

– Convexification in binary classification – Support Vector Machines

(3)

Probabilistic model

Training data = n input-output pairs :

(X1, Y1), . . . , (Xn, Yn) i.i.d.

from some unknown distribution P

A new input X comes.

Goal: predict the corresponding output Y .

probabilistic assumption:

(X, Y ) = another independent realization of P.

(4)

Some typical examples

Computer Vision

– object recognition X = an image

Y = +1 if the image contains the object, Y = 0 otherwise

Textual document

X = a mail Y = spam vs non spam

Insurance

X = data of a future policy holder Y = premium

Finance

X = data of a loanee Y = loan rate

(5)

Measuring the quality of prediction (1/2)

`(y, yˆ) = measure the loss incurred by predicting yˆ while the true output is y

Typical losses are:

– the p-power loss for real outputs

`(y, yˆ) = |y yˆ|p

– the classification loss for discrete outputs (e.g. in {0, 1})

`(y, yˆ) = 1y6=ˆy

(6)

Measuring the quality of prediction (2/2)

A prediction function = mapping from input space to output space f : X 7→ f(X)

Quality of a prediction function

Risk of f = R(f) = E `[Y, f(X)]

The best prediction function (=Bayes predictor):

f = argmin

f

R(f)

(7)

Noise in classification:

R(f ) =

E

1

Y 6=f(X)

=

P

(Y 6= f (X ))

Input space Density

Y=1

Y=0

R(f) = 0

Input space Density

Y=1

Y=0

R(f) > 0

0 or 6

(8)

Bayes predictors for typical losses

In classification (when `(y, y) =ˆ 1y6=ˆy):

fcla (x) = argmax

y P(Y = y|X = x)

In least square regression (when `(y, yˆ) = (y yˆ)2) freg (x) = E(Y |X = x)

(9)

What is formally a supervised learning algorithm?

An estimator of the unobservable f

An algorithm = a training sample is mapped to a prediction function fˆ : T = {(X1, Y1), . . . , (Xn, Yn)} 7→ fˆT

Quality of an algorithm for training samples of size n ET R( ˆfT )

Here the expectation is wrt the training sample distribution.

(10)

Uniformly universal consistency

An algorithm is uniformly universally consistent if we have limn sup

P

©ET R( ˆfT ) R(f

= 0

Bad news: uniformly universally consistent algorithms do not exist [Devroye (1982); Audibert (2008)]

Practical meaning: you will never know beforehand how much data is required to reach a predefined accuracy

(11)

Universal consistency [Stone (1977)]

An algorithm is universally consistent if for any P generating the data, we have

ET R( ˆfT ) −→

n→+∞ R(f), in other words:

sup

P

limn

©ET R( ˆfT ) R(f

= 0

¡ 6= unif. univ. consistency: limn supP ©

ET R( ˆfT ) R(f

= 0¢

Good news: universally consistent algorithms do exist

Practical meaning: for a sufficiently large amount of data, you will reach any desired accuracy

(12)

What should we expect from a good supervised learning algorithm?

its universal consistency sup

P

limn

©ET R( ˆfT ) R(f

= 0

a locally uniform universal consistency sup

P∈P

©ET R( ˆfT ) R(f

goes to 0 fast (typically in 1/nγ),

for P a known class of distributions in which (we hope/know that) the unknown distribution P is.

(13)

Outline

Probabilistic model

Local averaging algorithms

– Link between binary classification and regression – k-Nearest Neighbors

– Kernel estimate

– Partitioning estimate

Empirical risk minimization and variants – Neural networks

– Convexification in binary classification – Support Vector Machines

– Boosting

(14)

Link between binary classification and regression

Y ∈ {0, 1}

freg (x) = P(Y = 1|X = x)

1freg (x)≥1/2 = argmaxy P(Y = y|X = x)= fcla (x)

Theorem:

freg: real-valued function defined on the input space fcla = 1freg≥1/2

Rcla(fcla) Rcla(fcla ) 2 q

Rreg(freg) Rreg(freg )

Corollary:

fˆreg universally consistent fˆcla = 1fˆ

reg≥1/2 universally consistent

(15)

Local averaging methods [Gy¨ orfi et al. (2004)]

Context: Y R `(y, y0) = (y y0)2

Recall: f(x) = E(Y |X = x) unknown but (X1, Y1), . . . , (Xn, Yn) observed

Implementation

For an input x, predict the average of the Yi of the Xi’s close to x fˆ : x 7→

Xn i=1

Wi(x)Yi,

with Wi(x) appropriate functions of x, n, X1, . . . , Xn.

(16)

Stone’s Theorem [Stone (1977)]:

sufficient conditions for universal consistency

Assume that the weights Wi satisfies for any distribution P 1. ∀ε > 0 P©¯

¯ Pn

i=1 Wi(X)

¯ > εª

n→+∞−→ 0

2. ∀a > 0 E© Pn

i=1 |Wi(X)|1kXi−Xk>aª

n→+∞−→ 0 3. E Pn

i=1[Wi(X)]2 −→

n→+∞ 0

4. + two technical assumptions Then fˆ : x 7→ Pn

i=1 Wi(x)Yi is universally consistent

(17)

First example: the k-Nearest Neighbors

Wi(x) =

(1

k if Xi belongs to the k-n.n. of x among X1, . . . , Xn 0 otherwise

e.g. for k = 1: if Xi N.N. of x, then fˆ1(x) = Yi. More generally:

fˆk(x) = 1 k

Xk

j=1

Yij

Universal consistency [Stone (1977)] :

The kn-N.N. is univ. consistent iff kn +∞ and kn/n 0

The nearest neighbor (k = 1) algorithm is not universally consistent [Cover and Hart (1967)].

(18)

Using k-N.N.

Requires full storage of training points

Naive implementation: O(n)

Refined implementation using trees: O(log n) at test time (but O(n log n) for building the tree) ( http://www.cs.umd.edu/~mount/ANN/ )

How to choose k? answer: by cross-validation i.e. take the k by minimizing the risk estimate of fˆk by

1 n

Xp

j=1

X

(x,y)∈Bj

£y fˆk(∪l6=jBl)(x)¤2 ,

where B1, ..., Bp is a partition of the training sample:

(19)

Second example: kernel estimate [Nadaraya (1964);

Watson (1964)]

X Rd h > 0 K : Rd R fˆ(x) =

Xn

i=1

µ K(x−Xh i) Pn

l=1 K(x−Xh l)

Yi

Universal consistency [Devroye and Wagner (1980); Spiegelman and Sacks (1980)]: Let B(0, u) be the Euclidean ball in Rd of radius u > 0. If there are 0 < a A et b > 0 s.t.

∀u Rd b1B(0,a) K(u) 1B(0,A)

and if hn −→

n→+∞ 0 and nhdn −→

n→+∞ +∞, then fˆ is universally

K

A

−a a

−A

b 1

(20)

Partitioning estimate [Tukey (1947)]

X

X [0, 1]d = X1 t · · · t Xp fˆ(x) =

Xn

i=1

µ 1Xi∈Xj(x) Pn

l=1 1Xl∈Xj(x)

Yi,

with j(x) such that x ∈ Xj(x) Universal consistency [Gy¨orfi (1991)]:

Let Diam(Xj) = supx1,x2∈Xj kx1 x2k. If p/n −→

n→+∞ 0 and maxj Diam(Xj) −→

n→+∞ 0 then fˆ is universally consistent

Meaning for a regular grid of width hn: nhdn +∞ and hn 0

(21)

Using the partitioning estimate

Fast and simple but ...

border effects: “Mind the gap!”

nonobvious choice of the partition

Variants of the partitioning estimate: decision trees with partitions built from the training data ...

(22)

Outline

Probabilistic model

Local averaging algorithms

– Link between binary classification and regression – k-Nearest Neighbors

– Kernel estimate

– Partitioning estimate

Empirical risk minimization and variants – Neural networks

– Convexification in binary classification – Support Vector Machines

(23)

Empirical risk minimization

R(f) = E `(Y, f(X)) unobservable

Empirical risk: r(f) = n1 Pn

i=1 `(Yi, f(Xi)) observable

Law of large numbers and central limit theorem:

r(f) −→p.s.

n→+∞ R(f)

√n£

r(f) R(fL

n→+∞−→ N¡

0, Var `[Y, f(X)]¢ .

Goal of learning: predict as well as f = argmin

f

R(f)

a “natural” algorithm is therefore: fˆERM argmin

f

r(f)

(24)

Natural choice does not work

fˆERM argmin

f

r(f)

There is an infinity of minimizers. Most of them will not perform well on test data. −→ Overfitting

Yi

(25)

Why is it not working?

Prediction function space Risk

fˆERM

R

r

f

R( ˆfERM) R(f) = R( ˆfERM) r( ˆfERM) + r( ˆfERM) r(f) + r(f) R(f)

sup

f

©R(f) r(f

+ 0 + O(1/√ n)

∀f, R(f) r(f) = O(1/√

n)

6=

sup

f

©R(f) r(f

n→+∞−→ 0

(26)

f ˆ

ERM

argmin

f∈F

r(f ) with F appropriately chosen

Choice of F ? Introduce f˜ argmin

f∈F

R(f),

R( ˆfERM) R(f) = R( ˆfERM) R( ˜f)

| {z }

Estimation error

+ R( ˜f) R(f)

| {z } Approximation error R( ˆfERM) R( ˜f) = R( ˆfERM) r( ˆfERM) + r( ˆfERM) r( ˜f)

+ r( ˜f) R( ˜f)

sup

f∈F

©R(f) r(f

+ 0 + O(1/ n)

F should be small enough to ensure sup

f∈F

©R(f) r(f

n→+∞−→ 0

F should be large enough to ensure R( ˜f) R(f) −→ 0

(27)

First example : “neural networks” [Rosenblatt (1958, 1962)]

0 1

Squashing function σ: a nondecreasing function with σ(x) −→

x→−∞ 0 σ(x) −→

x→+∞ 1

e.g. σ(x) = 1x≥0 or σ(x) = 1/(1 + e−x)

(Artificial) neuron: function defined on Rd by

g(x) = σ

µ Xd

j=1

ajx(j) + a0

= σ(a · x)˜

(28)

Neural network with one hidden layer: function defined on Rd by f(x) =

Xk

j=1

cjσ(aj · x) +˜ c0

where x˜ =

µ 1 x

= (1, x(1), . . . , x(d))T.

Σ 1

a(0)

k

(d)

a(

d) 1

a(1)

1

a(0)

1

x(1) 1

ck c1

c0

c0 + Σkj=1cjσ(aj · x˜) x

˜ x

Σ σ

Σ σ

a(1)

k

(29)

Σ 1

a(0) k

a( d) k a(

d) 1

a(1) 1 a(0)

1

x(d) x(1)

1

ck c1

c0

c0 + Σkj=1cjσ(aj ·x˜) x

˜ x

Σ σ

Σ σ

a(1) k

Universal consistency in least square setting [Lugosi and Zeger (1995); Farag´o and Lugosi (1993)]:

(kn) integer sequence (βn) real sequence

Fn = { n. n. with one hidden layer, k kn and Pk

j=0 |cj| ≤ βn } ERM on Fn is universally consistent if kn +∞, βn +∞ and

knβn4 log(knβn2)

−→ 0.

(30)

Using neural networks

In practice: use of multilayer neural nets

Squashing function ERM = nonconvex optimization pb

any algorithm will end in a local minimum

With good intuitions on how to build the neural nets and good heuristics to perform the minimization [LeCun et al. (1998); LeCun (2005); Simard et al. (2003)], neural nets are great...

(31)

Convexification of empirical risk minimization in binary classification

Y ∈ {−1; +1} R(g) = P[Y 6= g(X)]

ERM: gˆ argmin

g∈G

Pn

i=1 1Yi6=g(Xi) −→ highly nonconvex

f real-valued function and g : x 7→ sign[f(x)]

−→ fˆ argmin

f∈F

Xn i=1

1Yif(Xi)≤0 F convex

−→ fˆ argmin

f∈F

Xn i=1

φ[Yif(Xi)] φ convex

(32)

Criterion to choose the convex function φ

φ-risk of f: A(f) = Eφ[Y f(X)].

φ should satisfy:

fˆ univ. consistent for the φ-risk

sign( ˆf) univ. consistent for the classification risk

Necessary and sufficient cond. [Bartlett et al. (2006)]:

φ is differentiable at 0 and φ0(0) < 0

(33)

Some convex functions useful for classification and their remarkable property

f best function for the φ-risk f a real-valued function

φ(u) = (1 u)+ = max(1 u, 0): S.V.M. loss

R[sign(f)] R(g) A(f) A(f)

φ(u) = e−u: AdaBoost loss

R[sign(f)] R(g)

2p

A(f) A(f)

φ(u) = log(1 + e−u): Logistic regression loss R[sign(f)] R(g)

2p

A(f) A(f)

φ(u) = (1 u)2: Least square regression loss R[sign(f)] R(g) p

A(f) A(f)

(34)

Support Vector Machines [Boser et al. (1992);

Vapnik (1995)]

C > 0 φ(u) = (1 u)+ H a Reproducing Kernel Hilbert Space

b∈R,h∈Hinf C

Xn i=1

φ¡

Yi[h(Xi) + b]¢

+ 1

2khk2H (PC)

Empirical φ-risk minim. on F = ©

x 7→ h(x) + b; khkH λ, b

finf∈F

1 n

Xn i=1

φ¡

Yif(Xi

(Qλ)

hC,ˆbC) solution of (PC) hˆC + ˆbC sol. of (Qλ) for λ = khˆCkH

(35)

Reproducing Kernel pre-Hilbert Space

Let K : X × X → R symmetric (i.e. K(u, v) = K(v, u)) and positive semi-definite (i.e. ∀J N, ∀α RJ and ∀x1, . . . , xJ P

1≤j,k≤J αjαkK(xj, xk) 0) – K is called a (Mercer) kernel – Examples: X = Rd

linear kernel K(x, x0) = hx, x0iRd

polynomial kernel K(x, x0) = ¡

1 + hx, x0iRd

¢p

for p N

gaussian kernel K(x, x0) = e−kx−x0k2/(2σ2) for σ > 0.

Let H0 be the linear span of K(x, ·) : x0 7→ K(x, x0), equipped with

* X

1≤i≤I

αiK(xi, ·), X

1≤j≤J

α0jK(x0j, ·) +

H0

= X

i,j

αiα0jK(xi, x0j)

(36)

Reproducing Kernel Hilbert Space

the closure H of H0 is an Hilbert space

H (as H0) has the reproducing property:

hf, k(x, ·)iH = f(x) for any f ∈ H Back to S.V.M.: Training set: (X1, Y1), . . . , (Xn, Yn) Let Hn = © Pn

i=1 αiK(Xi, ·); ∀i, αi

( H S.V.M. pb = min

b∈R,h∈Hn C

Xn i=1

φ¡

Yi[h(Xi) + b]¢

+ 1

2khk2H

tractable (n + 1)-dimensional optimization task

(37)

Universal consistency and using S.V.M.

Universal consistency [Steinwart (2002)]:

X [0; 1]d σ > 0

The S.V.M. with gaussian kernel K : (x, x0) 7→ e−kx−x0k2/(2σ2) and parameter C = nβ−1 with 0 < β < 1/d is universally consistent.

Practical choices:

– kernel: linear, polynomial, gaussian, ...

C (and parameters of the kernel) cross-validated

Choice of the kernel ←→ functions approximated by linear combinations of the functions K(x, ·) : x0 7→ K(x, x0)

gaussian kernel with σ +∞ = linear kernel !

(38)

Boosting methods

Let G be a set of functions from X to {−1, +1}

1. G = ©

x 7→ sign(x(j) τ); j ∈ {1, . . . , d}, τ

©

x 7→ sign(−x(j) + τ); j ∈ {1, . . . , d}, τ Rª 2. G = ©

x 7→ 1x∈A 1x∈Ac; A hyper-rectangle of Rdª

©

x 7→ 1x∈Ac 1x∈A; A hyper-rectangle of Rdª

−1 +1 −1 +1

(39)

3. G = ©

x 7→ 1x∈H 1x∈Hc; H halfspace of Rdª 4. G = ©

univariate decision trees with number of leaves = d + 1ª

−1 +1 +1

+1

−1

Boosting looks for classification function of the form x 7→ sign

µ Xm

j=1

λjgj(x)

Question: choice of λj and gj ?

(40)

Boosting by L

1

-regularization

Let Fλ = n Pm

j=1 λjgj ; m N, λj 0, gj ∈ G, Pm

j=1 λj = λ o

φ(u) = eu An(f) = n1 Pn

i=1 φ[Yif(Xi)]

Boosting by L1-regularization:

fˆλ = argmin

f∈Fλ

An(f)

Universal consistency [Lugosi and Vayatis (2004)]:

If λ = (log n)/4 and G is one of the previous choice (except choice 1), then sign( ˆfλ) is universally consistent

(41)

Usual description of AdaBoost

Initialisation: wi = 1/n for i = 1, . . . , n

Iterate: For j = 1 to J: – Take

gj argmin

g∈G

1 n

Xn i=1

wi1g(Xi)6=Yi and ej the minimum value

λj = 12 log

³1−ej ej

´ .

– Update weights: for all i s.t. gj(Xi) 6= Yi, wi wi1−ee j

j . – Normalize the weights: wi wi/ Pn

i0=1 wi0 for i = 1, . . . , n

Output:

x 7→ sign

µ XJ

λjgj(x)

(42)

AdaBoost = greedy empirical φ-risk minimization

φ(u) = eu An(f) = n1 Pn

i=1 φ[Yif(Xi)]

f0 = 0

For j = 1 to J

j, gj) argmin

λ∈R,g∈G

An(fj−1 + λg)

fj = fj−1 + λjgj

Universal consistency [Bartlett and Traskin (2007)]:

If J = nν with 0 < ν < 1 and G is one of the previous choice (except choice 1), then AdaBoost is universally consistent

(43)

Link between boosting methods and S.V.M.

AdaBoost output: x 7→ sign

µ PJ

j=1 λjgj(x)

S.V.M. output: x 7→ sign

µ Pn

i=1 αiK(Xi, x)+b

Consider K(x, x0) = PJ

j=1 gj(x)gj(x0). Then Xn

i=1

αiK(Xi, x) =

Xn i=1

αi

XJ j=1

gj(Xi)gj(x) =

XJ j=1

λjgj(x)

with

λj =

Xn i=1

αigj(Xi)

(44)

Boosting vs S.V.M. vs Neural networks

Boosting advantages:

– Variable selection

– Ability to handle very large amount of features – Simple tricks to reduce computational complexity

– S.V.M. can be run at the end on the selected features

S.V.M. advantages:

– Easy to use off-the-shelf – Consistently good results

Neural networks advantages:

– Works well in practice

(45)

J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. 2008. To be published in Annals of Statistics, http://www.e-publications.org/ims/submission/index.php/AOS/

user/submissionFile/1175?confirm=51fc3552.

P.L. Bartlett and M. Traskin. Adaboost is consistent. J. Mach. Learn. Res., 8:2347–2368, 2007.

P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006.

Boser, Guyon, and Vapnik. A training algorithm for optimal margin classifiers. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1992.

T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT-13, 1967.

L. Devroye. Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans. Pattern Analysis and Machine Intelligence, 4:154–157, 1982.

L. Devroye and T. Wagner. Distribution-free consistency results in nonparametric discrimination and regression function estimation. Annals of Statistics, 8:231–239, 1980.

Andr´as Farag´o and G´abor Lugosi. Strong universal consistency of neural network classifiers. IEEE Transactions on Information Theory, 39(4):1146–1151, 1993.

L. Gy¨orfi. Nonparametric estimation II. statistically equivalent blocks and tolerance regions. In Nonparametric Functional Estimation and Related Topics, pages 329–338, 1991.

L. Gy¨orfi, M. Kohler, A. Krzy˙zak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2004.

(46)

lecture09-optim.djvu, requiress the djvu reader http://djvu.org/download/.

Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors, http: // yann. lecun. com/ exdb/ publis/ pdf/ lecun-98b. pdf, Neural Networks: Tricks of the trade. Springer, 1998.

G. Lugosi and N. Vayatis. On the bayes-risk consistency of regularized boosting methods. Ann. Stat., 32(1):30–55, 2004.

G´abor Lugosi and Kenneth Zeger. Nonparametric estimation via empirical risk minimization. IEEE Transactions on Information Theory, 41(3):677–687, 1995.

E. A. Nadaraya. On estimating regression. Theor. Probability Appl., 9:141–142, 1964.

F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408, 1958.

F. Rosenblatt. Principles of Neurodynamics. Spartan Books, Washington, 1962.

P.Y. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural networks applied to visual document analysis. http: // research. microsoft. com/ ~patrice/PDF/fugu9.pdf, International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, pages 958–962, 2003.

C. Spiegelman and J. Sacks. Consistent window estimation in nonparametric regression. Annals of Statistics, 8:240–246, 1980.

I. Steinwart. Support vector machines are universally consistent. Journal of Complexity, 18, 2002.

C. J. Stone. Consistent nonparametric regression (with discussion). Annals of Statistics, 5:595–645,

(47)

J. W. Tukey. Nonparametric estimation ii. statistically equivalent blocks and tolerance regions. Annals of Mathematical Statistics, 18:529–539, 1947.

V. Vapnik. The nature of statistical learning theory. Springer-Verlag, second edition, 1995.

G. S. Watson. Smooth regression analysis. Sankhya Series A, 26:359–372, 1964.

Références

Documents relatifs

Based on a conceptual representa- tion of the Lez hydrosystem behaviour derived from a neural network model, and on the estimation of the distribution of karst and surface components

Merging satellite ocean color and Argo data using a neural network- based method has already shown strong potential to infer the vertical distribution of bio-optical properties