• Aucun résultat trouvé

General Curse of Dimensionality Argument

N/A
N/A
Protected

Academic year: 2022

Partager "General Curse of Dimensionality Argument"

Copied!
1
0
0

Texte intégral

(1)

'

&

$

% Intuitions and Classical Results

Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Complexity along a straight line Complexity of parity

Simple Functions can be Difficult to Learn

decision surface

Class −1

Class 1

This “complex” sinusoidal decision surface can- not be learned with less than 10 Gaussians.

Theorem

If ∃ a line in Rd that intersects m times with the decision surface S (and is not included in S), then one needs at least dm2 e Gaussians (of same width) to learn S with a Gaussian kernel classifier.

(Am I repeating myself ?) SVMs are doomed ! Snowbird Skiing Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Complexity along a straight line Complexity of parity

The Parity Problem

parity :

(b1, . . . ,bd) ∈ {0,1}d 7→

1 if Pd

i=1bi is even

−1 otherwise

Theorem

A Gaussian kernel classifier needs at least 2d−1 Gaussians (i.e.

support vectors) to learn the parity function (when Gaussians have fixed width and are centered on training points).

(Just in case you missed it) SVMs are doomed ! Snowbird Skiing Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Complexity along a straight line Complexity of parity

Then What ?

• Kernel machines with a local or local- derivative kernel appear not to scale to learn complex functions in high dimension.

•The no-free-lunch theorem suggests that there is no universal recipe for learning such functions without using a priori knowledge.

• Is there hope ?

• Humans seem to do learn such functions !

• There might be loose enough priors on general classes of functions that allow non-local learning algorithms to learn them.

• Examples : poster on non-local manifold Par- zen windows.

Maybe we are NOT DOOMED after all ! Snowbird Skiing Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Complexity along a straight line Complexity of parity

Bibliography

Bengio, Y., Delalleau, O., and Le Roux, N. (2005).

The curse of dimensionality for local kernel machines.

Technical Report 1258, D´epartement d’informatique et recherche op´erationnelle, Universit´e de Montr´eal.

ardle, W., M¨uller, M., Sperlich, S., and Werwatz, A. (2004).

Nonparametric and Semiparametric Models.

Springer, http ://www.xplore-stat.de/ebooks/ebooks.html.

Keerthi, S. S. and Lin, C.-J. (2003).

Asymptotic behaviors of support vector machines with Gaussian kernel.

Neural Computation, 15(7) :1667–1689.

Snapp, R. R. and Venkatesh, S. S. (1998).

Asymptotic derivation of the finite-sample risk of the k nearest neighbor classifier.

Technical Report UVM-CS-1998-0101, Department of Computer Science, University of Vermont.

Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux Snowbird Skiing Workshop '

&

$

% Intuitions and Classical Results

Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

General argument

Curse on spectral manifold learning Curse on SVMs

General Curse of Dimensionality Argument

Locality. Show that crucial properties of f (x) (e.g. tangent plane, decision surface normal vector) depend mostly on examples in ball N(x).

Smooothness. Show that within N(x), crucial property of f (x) must vary slowly ( = smoothness within N(x) ).

Complexity. Consider targets that vary sufficiently so that one needs to consider O(constd) different neighborhoods, with significantly different properties in each neighborhood.

Curse. Combine the above : an exponential number (in d) of examples are needed to achieve a given performance level.

... in the meantime, please enjoy the colors Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

General argument

Curse on spectral manifold learning Curse on SVMs

Spectral Manifold Learning Algorithms

Many manifold learning algorithms can be seen as kernel machines with data-dependent kernel (LLE, Isomap, kernel PCA, Laplacian Eigenmaps, charting, etc...).

Locality

Shown for the estimated tangent plane.

Smoothness

The tangent plane varies slowly within N(x), since it is in the span of vectors x −xi.

Complexity

If the underlying manifold has high curvature in many places, we are doomed...

Our sincere apologies to color-blind people Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

General argument

Curse on spectral manifold learning Curse on SVMs

Ex : Translation of a High Contrast Image

tangent directions tangent image

tangent directions tangent image

shifted image high−contrast image

Spectral manifolds are doomed ! Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

General argument

Curse on spectral manifold learning Curse on SVMs

The 1-Norm Soft Margin SVM with Gaussian Kernel

Locality

As shown in (Keerthi and Lin, 2003), the SVM becomes constant when σ → 0 or σ → ∞ ⇒ notion of locality w.r.t σ.

Smoothness

When there are training examples at a distance of the order of σ, the normal vector is almost constant in a ball whose radius is small with respect to σ.

Complexity

The number of examples needed may grow exponentially with the dimensionality of the decision surface.

SVMs are doomed ! Snowbird Learning Workshop '

&

$

% Intuitions and Classical Results

Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Non-locality due to theα’s

Test examples far from training examples Local-derivative kernels

Kernel Methods

f (x) = b +

n

X

i=1

αiK(x,xi)

Used in classification (KNN, SVM, ...), dimen- sionality reduction (kernel PCA, LLE, Isomap, Laplacian eigenmaps, ...)

This talk = independent of the way the αi

are learned

Is the rest of the kernel world doomed as well ? Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Non-locality due to theα’s

Test examples far from training examples Local-derivative kernels

When a Test Example is Far from Training Examples

If the kernel is local, i.e.

||x−xlimi||→∞K(x,xi) → ci

then when x gets farther from the training set f (x) → b + X

i

αici

The predictor is either constant or (approximately) the nearest neighbor predictor (e.g. with the Gaussian kernel)

In high dimensions, a random test point tends to be equally far from most training examples.

Aren’t we all so far from each other ? Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Non-locality due to theα’s

Test examples far from training examples Local-derivative kernels

Local-Derivative Kernels

The derivative of f is

∂f(x)

∂x =

n

X

i=1

αi

∂K(x,xi)

∂x

Local-derivative kernel

When ∂f /∂x is (approximately) contained in the span of the vectors (x − xj) with xj a neighbor of x

∂f(x)

∂x ' X

xj∈N(x)

γj(x − xj)

Especially lonely are those who stay local Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Non-locality due to theα’s

Test examples far from training examples Local-derivative kernels

Tangent Planes and Decision Surfaces

Manifold learning : ∂f /∂x span manifold’s tangent plane Classification : ∂f/∂x is the decision surface normal vector Constraining ∂f/∂x in the span of the neighbors is a very strong constraint, possibly leading to high-variance estimators.

SVMs with Gaussian kernel are local-derivative LLE is local-derivative

Isomap is local-derivative

Kernel PCA with Gaussian kernel is local-derivative

Spectral clustering with Gaussian kernel is local-derivative

Don’t worry, there are more images coming... Snowbird Learning Workshop '

&

$

% Intuitions and Classical Results

Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Summary

Geometric Intuition

Classical Curse of Dimensionality

Summary

Curse of dimensionality for classical non-parametric models Bias-Variance Dilemma in High Dimension

Asymptotic results : Exponential growth of nb examples w.r.t. dim.

Local kernels : when a test example is far from training examples Convergence of the predictor f(x) to either a constant or the nearest neighbor predictor.

Local-derivative kernels : strong subspace constraint

f(x)

∂x is a linear combination of x −xi with xi a near neighbor of x. Number of support vectors needed

Consider SVM with Gaussian kernel. If the target function varies a lot (e.g. parity function), a large (possibly exponential) number of support vectors is required.

If you can read this, be ready for a lot of distractions here... Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Summary

Geometric Intuition

Classical Curse of Dimensionality

Geometric Intuition

If we have to tile the space or the mani- fold where the bulk of the distribution is concentrated, then we will need an ex- ponential number of “patches” :

For classification problems no need to co- ver the whole space/manifold, only deci- sion surface, but still has dim. d − 1.

Number of required examples ∝ constd

.. and if you can’t read this, you should go to the optician Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Summary

Geometric Intuition

Classical Curse of Dimensionality

Kernel Density Estimation

For a wide class of kernel density estima- tors (H¨ardle et al., 2004), the generaliza- tion error converges in n−4/(4+d), i.e. :

The required number of examples to reach a given error level is exponential in d

Kernel density estimation seems doomed ! Snowbird Learning Workshop

Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning Complex Functions

Summary

Geometric Intuition

Classical Curse of Dimensionality

K nearest neighbors

In the context of K nearest neighbors with weighted Lp metrics of the form dist(x,y) = kA(x − y)kp, (Snapp and Venkatesh, 1998) show the generalization error can be written as a series expansion of the form

En = E +

X

j=2

cjn−j/d

under smoothness constraints on the class distributions, i.e. again The required number of examples to reach a given error level is exponential in d

Nearest neighbor predictor seems doomed ! Snowbird Learning Workshop

Yoshua Bengio

Olivier Delalleau Nicolas Le Roux

The Curse of Dimensionality for Local Kernel

Machines

Références

Documents relatifs

Abstract—We show an effective cut-free variant of Glivenko’s theorem extended to formulas with weak quantifiers (those without eigenvariable conditions): “There is an

Percentage relative errors in the approximate solution for the mean number of requests in a Ph/Ph/c queue with c=64 servers as a function of the offered load per

Abstract—In the context of data tokenization, we model a token as a vector of a finite dimensional metric space E and given a finite subset of E, called the token set, we address

The recent article [8] gives related deviation inequalities. Compared to Corollary 1.5, on the one hand they restrict to the i.i.d. case and Wasserstein distances, but on the other

This is achieved through a parameterization of similarity functions by convex combinations of sparse rank-one matrices, together with the use of a greedy approximate

– In Section 4, we provide a detailed theoretical analysis of the approximation properties of (sin- gle hidden layer) convex neural networks with monotonic homogeneous

Brunnschweiler and Bulte (2008) argue that natural resource exports over GDP cap- ture resource dependence more than resource abundance, and their use as a proxy for..

On the contrary, convergence in W 1 (or in duality with some other class F ) ensures good estimates simultaneously for