• Aucun résultat trouvé

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

N/A
N/A
Protected

Academic year: 2022

Partager "Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA"

Copied!
31
0
0

Texte intégral

(1)

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Yoshua Bengio Pascal Vincent

Jean-Franc¸ois Paiement

University of Montreal

April 2, Snowbird Learning’2003

(2)

Learning Modal Structures of the Distribution

Manifold learning and clustering

= learning where are the main high-density zones

Learning a tranformation that reveals “clusters” and manifolds:

Cluster = zone of high density separated from other clusters by regions of

low density

(3)

Spectral Embedding Algorithms

Many learning algorithms, e.g.

spectral clustering, kernel PCA,

Local Linear Embedding (LLE), Isomap,

Multi-Dimensional Scaling (MDS), Laplacian eigenmaps

have at their core the following (or its equivalent):

1. Start from

data points

2. Construct a

“neighborhood” or similarity matrix

(with corresponding [possibly data-dependent] kernel

) 3. Normalize it (and make it symmetric), yielding

(with corresponding kernel

)

4. Compute

largest (equivalently, smallest) e-values/e-vectors

5. Embedding of

=

-th elements of each of the

e-vectors (possibly

scaled using e-values)

(4)

Kernel PCA

Data

is implicitly mapped to “feature space”

of kernel

s.t.

PCA is performed in feature space:

Projecting points in high-dim might allow to find straight line along which they

are almost aligned (if basis, i.e. kernel, is “right”).

(5)

Kernel PCA

Eigenvectors

of (generally infinite) matrix

are

where

is an eigenvector of Gram matrix

.

Projection on

-th p.c. =

N.B. need centered:

,

subtractive normalization

(Scholkopf 96)

(6)

Laplacian Eigenmaps

Gram matrix from Laplace-Beltrami operator (

), which on finite data (neighborhood graph) gives graph Laplacian.

Gaussian kernel. Approximated by k-NN adjacency matrix Normalization: row average - Gram matrix.

Laplace-Beltrami operator

: justified as a smoothness regularizer on the manifold

:

, which equals eigenvalue of

for eigenfunctions .

Successfully used for semi-supervised learning.

(Belkin & Niyogi, 2002)

(7)

Spectral Clustering

Normalize kernel or Gram matrix divisively:

Embedding of

=

where

is

-th eigenvector of Gram matrix.

Perform clustering on the embedded points (e.g. after normalizing them by their norm).

Weiss, Ng, Jordan, ...

(8)

Spectral Clustering

unit sphere

principal eigenfns approx. kernel (= dot product) in MSE sense

and

almost colinear

and

almost orthogonal

points in same cluster mapped to points with near angle, even if

non-blob cluster (global constraint = transitivity of “nearness”)

(9)

Density-Dependent Hilbert Space

Define a Hilbert space with density-dependent inner product

with density

. A kernel function

defines a linear operator in that space:

(10)

Eigenfunctions of a Kernel

Infinite-dimensional version of eigenvectors of Gram matrix:

(some conditions to obtain a discrete spectrum)

Convergence of

e-vec/e-values of Gram matrix from

data sampled from

to

e-functions/e-values of linear operator with underlying

,

proven as

(Williams+Seeger 2000) .

(11)

Link between Spectral Clustering and Eigenfunctions

Equivalence between eigenvectors and eigenfunctions (and corr.

eigenvalues) when

is the empirical distribution:

Proposition 1: If we choose for

the empirical distribution of the data, then the spectral embedding from

is equivalent to values of the eigenfunctions of the normalized kernel

:

.

Proof: come and see our poster!

(12)

Link between Kernel PCA and Eigenfunctions

Proposition 2: If we choose for

the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of

:

.

Proof: come and see our poster!

Consequence: up to the choice of kernel, kernel

normalization, and up to scaling by

, spectral

clustering, Laplacian eigenmaps and kernel PCA

give the same embedding . Isomap, MDS and LLE also

give eigenfunctions but from a different type of

kernel.

(13)

From Embedding to General Mapping

Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only provided an embedding for the given data points.

Natural generalization to new points: consider these algorithms as learning eigenfunctions of

.

eigenfunctions

provide a mapping for new points. e.g. for empirical

:

Data-dependent “kernels” (Isomap, LLE): need to compute

without changing

. Reasonable for Isomap, less clear it makes

sense for LLE.

(14)

Criterion to Learn Eigenfunctions

Proposition 3: Given the first

eigenfunctions

of a symmetric function

, the

-th one can be obtained by minimizing w.r.t. the expected value of

over

. Then we get

and

.

This helps understand what the eigenfunctions are doing (approximating the “dot product”

) and provides a possible criterion for estimating the eigenfunctions when

is not an empirical distribution.

Kernels such as the Gaussian kernel and nearest-neighbor related kernels

force the eigenfunctions to reconstruct correctly only

for nearby

objects: in high-dim, don’t trust Euclidean distance between far objects.

(15)

Using a Smooth Density to Define Eigenfunctions?

Use your best estimator

of the density of the data, instead of the data, for defining the eigenfunctions.

Constrained class of e-fns, e.g. neural networks, can force e-fns to be smooth and not necessarily local.

Advantage? better generalization away from training points?

Advantage? better scaling with

? (no Gram matrix, no e-vectors)

Disadvantage? optimization of e-fns may be more difficult?

(16)

Recovering the Density from the Eigenfunctions?

Visually the eigenfunctions appear to capture the main characteristics of the density.

Can we obtain a better estimate of the density using the principal eigenfunctions?

(Girolami 2001): truncating the expansion

.

Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor analyzers and project back in input space, convoluting with a

model of reconstruction error as noise.

(17)

Role of Kernel Normalization?

Subtractive normalization yields to kernel PCA:

Thus the corresponding kernel

is expanded:

the constant function is an eigenfunction

eigenfunctions have zero mean and unit variance double-centering normalization (MDS, Isomap):

above (based on relation between dot product and distance)

What can be said about the divisive normalization? Seems better at clustering.

(18)

Multi-layer Learning of Similarity and Density?

The learned eigenfunctions capture salient features of the distribution:

abstractions such as clusters and manifolds.

Old AI (and connectionist) idea: build high-level abstractions on top of lower-level abstractions.

local Euclidean similarity

farther reaching notion of similarity

density +

improved

density +

empirical

model

(19)

Density-Adjusted Similarity and Kernel

B C A

Want and

“closer” than

and

.

Define a density adjusted distance as a geodesic wrt a Riemannian metric, with metric tensor that penalizes low density.

SEE OTHER POSTER (Vincent & Bengio)

(20)

Density-Adjusted Similarity and Kernel

original spirals

Gaussian kernel spectral embedding

-0.4 -0.2 0 0.2 0.4 0.6 0.8

-0.2 0 0.2 0.4 0.6 0.8 1

-6 -5 -4 -3 -2 -1 0 1 2 3

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085

-6 -5 -4 -3 -2 -1 0 1 2 3

Density-adjusted embedding Density-adjusted embedding

-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

-0.1 -0.05 0 0.05 0.1 0.15

-6 -5 -4 -3 -2 -1 0 1 2 3

-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

-6

-5

-4

-3

-2

-1

0

1

2

3

(21)

Conclusions

Many unsupervised learning algorithms (kernel PCA, spectral

clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked:

compute eigenfunctions of a normalized kernel.

Embedding can be generalized to mapping applicable to new points.

Eigenfunctions seem to capture salient features of the distribution by minimizing kernel reconstruction error.

Many questions open:

eigenfunctions

recover explicit density function?

finding e-fns with smooth

?

meaning of various kernel normalization?

multi-layer learning?

density-adjusted similarity (see Vincent & Bengio poster).

(22)

Proposition 3

The principal eigenfunction of the linear operator

corresponding to kernel

is the (or a, if repeated e-values) norm-1 function that minimizes the reconstruction error

(23)

Proof of Proposition 1

Proposition 1: If we choose for

the empirical distribution of the data, then the spectral embedding from

is equivalent to values of the eigenfunctions of the normalized kernel

:

.

(Simplified) proof:

As shown in Proposition 3, finding function and scalar minimizing

s.t.

yields a solution that satisfies

with the (possibly repeated) maximum norm eigenvalue.

(24)

Proof of Proposition 1

With empirical

, the above becomes (

):

Write

and

, then

and we obtain for the principal eigenvector:

For the other eigenvalues, consider the “residual kernel”

and recursively apply the same reasoning to obtain

,

, etc...

Q.E.D.

(25)

Proof of Proposition 2

Proposition 2: If we choose for

the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of

:

.

(Simplified) proof:

Apply the linear operator

on both sides of

:

or changing the order of integrals on the left-hand side:

Plug-in

:

(26)

Proof of Proposition 2

which contains elements of covariance matrix :

thus yielding

or

where

has elements

. So, where

takes its values,

where

is also the

-th e-vector of .

(27)

Proof of Proposition 2

PCA projection on

is

Q.E.D.

(28)

Proof of Proposition 3

Proposition 3: Given the first

eigenfunctions

of a symmetric function

, the

-th one can be obtained by minimizing w.r.t. the expected value of

over

. Then we get

and

. Proof:

Reconstruction error using approximation

:

where

with

, and

are the first

(eigenfunction,eigenvalue) pairs in order of decreasing absolute value of

.

(29)

Proof of Proposition 3

Minimization of wrt

gives

(1)

using eq. 1.

should be maximized.

(30)

Proof of Proposition 3

and set it equal to zero:

Using

:

(2)

Using recursive assumption that

are orthogonal for

:

Write the application of

in terms of the eigenfunctions:

(31)

Proof of Proposition 3

we obtain

Applying Perceval’s thm to obtain the norm on both sides:

If distinct ’s,

and

max. when

= 1 and

for

,

and

.

Since

and obtained

and

, get

and

.

Q.E.D.

Références

Documents relatifs

This is also consistent with predictions from fluid-dynamic models for time required to generate a zoned phonolite magma chamber with the dimensions of the Laacher See system (Tait

If we sort the eigenvalues of the probability matrix P in descending order, the eigenvector associated with the second largest eigenvalue (should be positive)— which is shifted to

The different curves represent the NRI evolution for the three methods (Deep Learning and Spectral Clustering 1 or 2) throughout the iterations of the active semi-supervised

This clustering is produced by a methodology combining Lapla- cian Eigenmap with Gaussian Mixture models, while the number of clusters is automatically determined by using the

This clustering is produced by a methodology combining Laplacian Eigenmaps with Gaussian Mixture models, while the number of clusters is set according to the Bayesian

Vielmehr fiel es dem Orden, der seinen Sitz in Trier hatte, immer schwerer, die für das Kollegium notwendigen Professoren zu stellen, unter anderem auch weil schon seit Jahren

In our study, the renal outcome was poorly predictable, whereas some patients presented with only mild proteinuria and stable eGFR during an overall median follow-up of 3 years,

Antibiotics delivery by oral gavage twice a day or in drinking water induces in contrast a robust and consistent depletion of mouse fecal bacteria, as soon as 4 days of treatment,