Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

(1)

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Yoshua Bengio Pascal Vincent

Jean-Franc¸ois Paiement

University of Montreal

April 2, Snowbird Learning’2003

(2)

Learning Modal Structures of the Distribution

Manifold learning and clustering

= learning where are the main high-density zones

Learning a tranformation that reveals “clusters” and manifolds:

Cluster = zone of high density separated from other clusters by regions of

low density

(3)

Spectral Embedding Algorithms

Many learning algorithms, e.g.

spectral clustering, kernel PCA,

Local Linear Embedding (LLE), Isomap,

Multi-Dimensional Scaling (MDS), Laplacian eigenmaps

have at their core the following (or its equivalent):

1. Start from

data points

2. Construct a

“neighborhood” or similarity matrix

(with corresponding [possibly data-dependent] kernel

) 3. Normalize it (and make it symmetric), yielding

(with corresponding kernel

)

4. Compute

largest (equivalently, smallest) e-values/e-vectors

5. Embedding of

=

-th elements of each of the

e-vectors (possibly

scaled using e-values)

(4)

Kernel PCA

Data

is implicitly mapped to “feature space”

of kernel

s.t.

PCA is performed in feature space:

Projecting points in high-dim might allow to find straight line along which they

are almost aligned (if basis, i.e. kernel, is “right”).

(5)

Kernel PCA

Eigenvectors

of (generally infinite) matrix

are

where

is an eigenvector of Gram matrix

.

Projection on

-th p.c. =

N.B. need centered:

,

subtractive normalization

(Scholkopf 96)

(6)

Laplacian Eigenmaps

Gram matrix from Laplace-Beltrami operator (

), which on finite data (neighborhood graph) gives graph Laplacian.

Gaussian kernel. Approximated by k-NN adjacency matrix Normalization: row average - Gram matrix.

Laplace-Beltrami operator

: justified as a smoothness regularizer on the manifold

:

, which equals eigenvalue of

for eigenfunctions .

Successfully used for semi-supervised learning.

(Belkin & Niyogi, 2002)

(7)

Spectral Clustering

Normalize kernel or Gram matrix divisively:

Embedding of

=

where

is

-th eigenvector of Gram matrix.

Perform clustering on the embedded points (e.g. after normalizing them by their norm).

Weiss, Ng, Jordan, ...

(8)

Spectral Clustering

unit sphere

principal eigenfns approx. kernel (= dot product) in MSE sense

and

almost colinear

and

almost orthogonal

points in same cluster mapped to points with near angle, even if

non-blob cluster (global constraint = transitivity of “nearness”)

(9)

Density-Dependent Hilbert Space

Define a Hilbert space with density-dependent inner product

with density

. A kernel function

defines a linear operator in that space:

(10)

Eigenfunctions of a Kernel

Infinite-dimensional version of eigenvectors of Gram matrix:

(some conditions to obtain a discrete spectrum)

Convergence of

e-vec/e-values of Gram matrix from

data sampled from

to

e-functions/e-values of linear operator with underlying

,

proven as

(Williams+Seeger 2000) .

(11)

Link between Spectral Clustering and Eigenfunctions

Equivalence between eigenvectors and eigenfunctions (and corr.

eigenvalues) when

is the empirical distribution:

Proposition 1: If we choose for

the empirical distribution of the data, then the spectral embedding from

is equivalent to values of the eigenfunctions of the normalized kernel

:

.

Proof: come and see our poster!

(12)

Link between Kernel PCA and Eigenfunctions

Proposition 2: If we choose for

the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of

:

.

Proof: come and see our poster!

Consequence: up to the choice of kernel, kernel

normalization, and up to scaling by

, spectral

clustering, Laplacian eigenmaps and kernel PCA

give the same embedding . Isomap, MDS and LLE also

give eigenfunctions but from a different type of

kernel.

(13)

From Embedding to General Mapping

Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only provided an embedding for the given data points.

Natural generalization to new points: consider these algorithms as learning eigenfunctions of

.

eigenfunctions

provide a mapping for new points. e.g. for empirical

:

Data-dependent “kernels” (Isomap, LLE): need to compute

without changing

. Reasonable for Isomap, less clear it makes

sense for LLE.

(14)

Criterion to Learn Eigenfunctions

Proposition 3: Given the first

eigenfunctions

of a symmetric function

, the

-th one can be obtained by minimizing w.r.t. the expected value of

over

. Then we get

and

.

This helps understand what the eigenfunctions are doing (approximating the “dot product”

) and provides a possible criterion for estimating the eigenfunctions when

is not an empirical distribution.

Kernels such as the Gaussian kernel and nearest-neighbor related kernels

force the eigenfunctions to reconstruct correctly only

for nearby

objects: in high-dim, don’t trust Euclidean distance between far objects.

(15)

Using a Smooth Density to Define Eigenfunctions?

Use your best estimator

of the density of the data, instead of the data, for defining the eigenfunctions.

Constrained class of e-fns, e.g. neural networks, can force e-fns to be smooth and not necessarily local.

Advantage? better generalization away from training points?

Advantage? better scaling with

? (no Gram matrix, no e-vectors)

Disadvantage? optimization of e-fns may be more difficult?

(16)

Recovering the Density from the Eigenfunctions?

Visually the eigenfunctions appear to capture the main characteristics of the density.

Can we obtain a better estimate of the density using the principal eigenfunctions?

(Girolami 2001): truncating the expansion

.

Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor analyzers and project back in input space, convoluting with a

model of reconstruction error as noise.

(17)

Role of Kernel Normalization?

Subtractive normalization yields to kernel PCA:

Thus the corresponding kernel

is expanded:

the constant function is an eigenfunction

eigenfunctions have zero mean and unit variance double-centering normalization (MDS, Isomap):

above (based on relation between dot product and distance)

What can be said about the divisive normalization? Seems better at clustering.

(18)

Multi-layer Learning of Similarity and Density?

The learned eigenfunctions capture salient features of the distribution:

abstractions such as clusters and manifolds.

Old AI (and connectionist) idea: build high-level abstractions on top of lower-level abstractions.

local Euclidean similarity

farther reaching notion of similarity

density +

improved

density +

empirical

model

(19)

Density-Adjusted Similarity and Kernel

B C A

Want and

“closer” than

and

.

Define a density adjusted distance as a geodesic wrt a Riemannian metric, with metric tensor that penalizes low density.

SEE OTHER POSTER (Vincent & Bengio)

(20)

Density-Adjusted Similarity and Kernel

original spirals

Gaussian kernel spectral embedding

-0.4 -0.2 0 0.2 0.4 0.6 0.8

-0.2 0 0.2 0.4 0.6 0.8 1

-6 -5 -4 -3 -2 -1 0 1 2 3

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085

-6 -5 -4 -3 -2 -1 0 1 2 3

Density-adjusted embedding Density-adjusted embedding

-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

-0.1 -0.05 0 0.05 0.1 0.15

-6 -5 -4 -3 -2 -1 0 1 2 3

-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

-6

-5

-4

-3

-2

-1

0

1

2

3

(21)

Conclusions

Many unsupervised learning algorithms (kernel PCA, spectral

clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked:

compute eigenfunctions of a normalized kernel.

Embedding can be generalized to mapping applicable to new points.

Eigenfunctions seem to capture salient features of the distribution by minimizing kernel reconstruction error.

Many questions open:

eigenfunctions

recover explicit density function?

finding e-fns with smooth

?

meaning of various kernel normalization?

multi-layer learning?

density-adjusted similarity (see Vincent & Bengio poster).

(22)

Proposition 3

The principal eigenfunction of the linear operator

corresponding to kernel

is the (or a, if repeated e-values) norm-1 function that minimizes the reconstruction error

(23)

Proof of Proposition 1

Proposition 1: If we choose for

the empirical distribution of the data, then the spectral embedding from

is equivalent to values of the eigenfunctions of the normalized kernel

:

.

(Simplified) proof:

As shown in Proposition 3, finding function and scalar minimizing

s.t.

yields a solution that satisfies

with the (possibly repeated) maximum norm eigenvalue.

(24)

Proof of Proposition 1

With empirical

, the above becomes (

):

Write

and

, then

and we obtain for the principal eigenvector:

For the other eigenvalues, consider the “residual kernel”

and recursively apply the same reasoning to obtain

,

, etc...

Q.E.D.

(25)

Proof of Proposition 2

Proposition 2: If we choose for

the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of