• Aucun résultat trouvé

The Curse of Highly Variable Functions for Local Kernel Machines

N/A
N/A
Protected

Academic year: 2022

Partager "The Curse of Highly Variable Functions for Local Kernel Machines"

Copied!
1
0
0

Texte intégral

(1)

'

&

$

%

Partial References

Bengio, Y., Delalleau, O., and Le Roux, N. (2005). The curse of dimensionality for local kernel machines. Technical Report 1258, D´epartement d’informatique et recherche op´erationnelle, Universit´e de Montr´eal.

Bengio, Y. and Larochelle, H. (2006). Non-local manifold parzen windows. In Advances in Neural Information Processing Systems 18. MIT Press.

H¨ardle, W., M¨uller, M., Sperlich, S., and Werwatz, A. (2004). Nonparametric and Semi- parametric Models. Springer, http://www.xplore-stat.de/ebooks/ebooks.html.

Schmitt, M. (2002). Descartes’ rule of signs for radial basis function neural networks.

Neural Computation, 14(12):2997–3011.

Snapp, R. R. and Venkatesh, S. S. (1998). Asymptotic derivation of the finite-sample risk of the k nearest neighbor classifier. Technical Report UVM-CS-1998-0101, Department of Computer Science, University of Vermont.

'

&

$

%

Non- Local Learning: We are not doomed!

• complex 6= non- smooth (e.g. Kolmogorov complexity weak prior that can buy a lot of power)

• similar 6= close in vector space ⇒ task- specific similarity

• nonlocal learning algorithms ⇒ can generalize even with few training samples (Bengio and Larochelle, 2006), can guess density shape near never seen examples, even though no explicit and specific prior knowledge is used.

'

&

$

%

Graph- Based Semi- Supervised Learning

A number of semi- supervised learning algorithms are based on the idea of propagating labels on the nodes of a neighbor- hood graph, starting from known labels, until some conver- gence is obtained. Typical cost function optimized this way:

C( ˆY ) = kYˆl − Ylk2 + µYˆ >LYˆ + µkYˆ k2

Proposition: the number of regions with constant estimated label is less than (or equal to) the number of labeled examples.

“Double curse”: with a Gaussian ker- nel, we would need many labeled ex- amples (one in each colored region), and many unlabeled ones (along the sinusoidal line).

⇒ a neighborhood graph may not be appropriate for the task (e.g. parity problem).

'

&

$

%

Manifold Learning

Many manifold learning algorithms can be written as in eq. 1 with KD having some locality property such that

∂f (x)

∂x ' X

xi∈N(x)

D(x, xi)(xi − x)

i.e. the tangent plane of the manifold is approximately in the span of the vectors (xi − x) with xi a near neighbor of x

⇒ high-variance estimators if not enough training points in neighborhood of x (curse of the manifold dimensionality).

x

xi

Ex: kernel PCA with a Gaussian kernel, Locally Linear Em- bedding, ISOMAP, ...

'

&

$

%

Minimum Number of Bases

Corollary of (Schmitt, 2002): if KD is the Gaussian kernel and f(·) changes sign at least 2k times along some straight line (i.e. that line crosses the decision surface at least 2k times), then there must be at least k bases (non- zeroαi’s).

decision surface

Class −1

Class 1

Ex: This “complex” sinusoidal decision surface requires a minimum of 10 Gaussians to be learned with a Gaussian ker- nel classifier.

Parity problem: learning the d- bits parity function parity : (b1, . . . , bd) ∈ {0, 1}d 7→

(

1 if Pd

i=1 bi is even

−1 otherwise

with a Gaussian kernel classifier requires at least 2d−1 bases (when centered on training points).

Bottom-line: with a purely local kernel, it can be difficult to learn a simple function that varies a lot.

'

&

$

%

General Curse Idea

Setting: learned function expressed as a linear function of (possibly data- dependent) kernel functions:

f (x) = b +

n

X

i=1

αiKD(x, xi) (1) 1. Locality: if the kernel is local in some sense, important prop- erties of f(·) at point x will depend mostly on training ex- amples xi in neighborhood N (x) of x.

Ex: derivative of the Gaussian kernel w.r.t. distance kx−xik2. The derivative of f(·) at x de- pends on those xi in N (x) ⇒ e.g. shape of the decision sur- face in a kernel classifier.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

2. Smoothness: important properties of f (·) at x vary slowly in N (x) (i.e. f (·) is smooth, in some sense, within N (x)).

3. Complexity: if the target function for f(·) is “complex”, i.e.

it varies a lot, smoothness ⇒ need to consider many neigh- borhoods (“tiling” the space with local patches).

CURSE! Locality ⇒ need training examples in each patch (whose number may grow in O(constd)).

'

&

$

%

Outlook

• Smoothness prior is the ba- sis for many algorithms, that learn only from local neigh- borhoods of training data

• Makes them unfit to learn functions that vary a lot, e.g.

needed for complex AI tasks

• Ex: SVMs with local ker- nel, manifold learning al- gorithms (LLE, ISOMAP, kernel PCA), graph- based semi-supervised learning, ...

• Need for non- locallearning!

Yoshua Bengio

Olivier Delalleau Nicolas Le Roux

The Curse of Highly Variable Functions for Local

Kernel Machines

Références

Documents relatifs

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

One could clearly hope for stronger results, but the above are already strong indications that when the data lie on a high-dimensional and curved manifold, local learning methods

Don’t worry, there are more images coming... Snowbird Skiing Workshop.. Intuitions and Classical Results Locality of the Kernel Curse of Dimensionality Arguments Learning

A rough analysis sug- gests first to estimate the level sets of the probability density f (Polonik [16], Tsybakov [18], Cadre [3]), and then to evaluate the number of

Chapter 4 investigated fast and e ffi cient implementations that avoid the costly matrix inversion demanded by G-SSL methods. We contributed with an extension of the success-

In practice, popular SSL methods as Standard Laplacian (SL), Normalized Laplacian (NL) or Page Rank (PR), exploit those operators defined on graphs to spread the labels and, from

contour functions and the ground truths over elements of a learning set. In this paper, it is shown that this same measure can be used to learn from soft labels, and tests on

In this article, a new approach is proposed to study the per- formance of graph-based semi-supervised learning methods, under the assumptions that the dimension of data p and