Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA
Yoshua Bengio Pascal Vincent
Jean-Franc¸ois Paiement
University of Montreal
April 2, Snowbird Learning’2003
Learning Modal Structures of the Distribution
Manifold learning and clustering
= learning where are the main high-density zones
Learning a tranformation that reveals “clusters” and manifolds:
Cluster = zone of high density separated from other clusters by regions of
low density
Spectral Embedding Algorithms
Many learning algorithms, e.g.
spectral clustering, kernel PCA,
Local Linear Embedding (LLE), Isomap,
Multi-Dimensional Scaling (MDS), Laplacian eigenmaps
have at their core the following (or its equivalent):
1. Start from
data points
2. Construct a
“neighborhood” or similarity matrix
(with corresponding [possibly data-dependent] kernel
) 3. Normalize it (and make it symmetric), yielding
(with corresponding kernel
)
4. Compute
largest (equivalently, smallest) e-values/e-vectors
5. Embedding of
=
-th elements of each of the
e-vectors (possibly
scaled using e-values)
Kernel PCA
Data
is implicitly mapped to “feature space”
of kernel
s.t.
PCA is performed in feature space:
Projecting points in high-dim might allow to find straight line along which they
are almost aligned (if basis, i.e. kernel, is “right”).
Kernel PCA
Eigenvectors
of (generally infinite) matrix
are
where
is an eigenvector of Gram matrix
.
Projection on
-th p.c. =
N.B. need centered:
,
subtractive normalization
(Scholkopf 96)
Laplacian Eigenmaps
Gram matrix from Laplace-Beltrami operator (
), which on finite data (neighborhood graph) gives graph Laplacian.
Gaussian kernel. Approximated by k-NN adjacency matrix Normalization: row average - Gram matrix.
Laplace-Beltrami operator
: justified as a smoothness regularizer on the manifold
:
, which equals eigenvalue of
for eigenfunctions .
Successfully used for semi-supervised learning.
(Belkin & Niyogi, 2002)
Spectral Clustering
Normalize kernel or Gram matrix divisively:
Embedding of
=
where
is
-th eigenvector of Gram matrix.
Perform clustering on the embedded points (e.g. after normalizing them by their norm).
Weiss, Ng, Jordan, ...
Spectral Clustering
unit sphere
principal eigenfns approx. kernel (= dot product) in MSE sense
and
almost colinear
and
almost orthogonal
points in same cluster mapped to points with near angle, even if
non-blob cluster (global constraint = transitivity of “nearness”)
Density-Dependent Hilbert Space
Define a Hilbert space with density-dependent inner product
with density
. A kernel function
defines a linear operator in that space:
Eigenfunctions of a Kernel
Infinite-dimensional version of eigenvectors of Gram matrix:
(some conditions to obtain a discrete spectrum)
Convergence of
e-vec/e-values of Gram matrix from
data sampled from
to
e-functions/e-values of linear operator with underlying
,
proven as
(Williams+Seeger 2000) .
Link between Spectral Clustering and Eigenfunctions
Equivalence between eigenvectors and eigenfunctions (and corr.
eigenvalues) when
is the empirical distribution:
Proposition 1: If we choose for
the empirical distribution of the data, then the spectral embedding from
is equivalent to values of the eigenfunctions of the normalized kernel
:
.
Proof: come and see our poster!
Link between Kernel PCA and Eigenfunctions
Proposition 2: If we choose for
the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of
:
.
Proof: come and see our poster!
Consequence: up to the choice of kernel, kernel
normalization, and up to scaling by
, spectral
clustering, Laplacian eigenmaps and kernel PCA
give the same embedding . Isomap, MDS and LLE also
give eigenfunctions but from a different type of
kernel.
From Embedding to General Mapping
Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only provided an embedding for the given data points.
Natural generalization to new points: consider these algorithms as learning eigenfunctions of
.
eigenfunctions
provide a mapping for new points. e.g. for empirical
:
Data-dependent “kernels” (Isomap, LLE): need to compute
without changing
. Reasonable for Isomap, less clear it makes
sense for LLE.
Criterion to Learn Eigenfunctions
Proposition 3: Given the first
eigenfunctions
of a symmetric function
, the
-th one can be obtained by minimizing w.r.t. the expected value of
over
. Then we get
and
.
This helps understand what the eigenfunctions are doing (approximating the “dot product”
) and provides a possible criterion for estimating the eigenfunctions when
is not an empirical distribution.
Kernels such as the Gaussian kernel and nearest-neighbor related kernels
force the eigenfunctions to reconstruct correctly only
for nearby
objects: in high-dim, don’t trust Euclidean distance between far objects.
Using a Smooth Density to Define Eigenfunctions?
Use your best estimator
of the density of the data, instead of the data, for defining the eigenfunctions.
Constrained class of e-fns, e.g. neural networks, can force e-fns to be smooth and not necessarily local.
Advantage? better generalization away from training points?
Advantage? better scaling with
? (no Gram matrix, no e-vectors)
Disadvantage? optimization of e-fns may be more difficult?
Recovering the Density from the Eigenfunctions?
Visually the eigenfunctions appear to capture the main characteristics of the density.
Can we obtain a better estimate of the density using the principal eigenfunctions?
(Girolami 2001): truncating the expansion
.
Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor analyzers and project back in input space, convoluting with a
model of reconstruction error as noise.
Role of Kernel Normalization?
Subtractive normalization yields to kernel PCA:
Thus the corresponding kernel
is expanded:
the constant function is an eigenfunction
eigenfunctions have zero mean and unit variance double-centering normalization (MDS, Isomap):
above (based on relation between dot product and distance)
What can be said about the divisive normalization? Seems better at clustering.
Multi-layer Learning of Similarity and Density?
The learned eigenfunctions capture salient features of the distribution:
abstractions such as clusters and manifolds.
Old AI (and connectionist) idea: build high-level abstractions on top of lower-level abstractions.
local Euclidean similarity
farther reaching notion of similarity
density +
improved
density +
empirical
model
Density-Adjusted Similarity and Kernel
B C A
Want and
“closer” than
and
.
Define a density adjusted distance as a geodesic wrt a Riemannian metric, with metric tensor that penalizes low density.
SEE OTHER POSTER (Vincent & Bengio)
Density-Adjusted Similarity and Kernel
original spirals
Gaussian kernel spectral embedding
-0.4 -0.2 0 0.2 0.4 0.6 0.8
-0.2 0 0.2 0.4 0.6 0.8 1
-6 -5 -4 -3 -2 -1 0 1 2 3
-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1
0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085
-6 -5 -4 -3 -2 -1 0 1 2 3
Density-adjusted embedding Density-adjusted embedding
-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1
-0.1 -0.05 0 0.05 0.1 0.15
-6 -5 -4 -3 -2 -1 0 1 2 3
-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1
-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
-6
-5
-4
-3
-2
-1
0
1
2
3
Conclusions
Many unsupervised learning algorithms (kernel PCA, spectral
clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked:
compute eigenfunctions of a normalized kernel.
Embedding can be generalized to mapping applicable to new points.
Eigenfunctions seem to capture salient features of the distribution by minimizing kernel reconstruction error.
Many questions open:
eigenfunctions
recover explicit density function?
finding e-fns with smooth
?
meaning of various kernel normalization?
multi-layer learning?
density-adjusted similarity (see Vincent & Bengio poster).
Proposition 3
The principal eigenfunction of the linear operator
corresponding to kernel
is the (or a, if repeated e-values) norm-1 function that minimizes the reconstruction error
Proof of Proposition 1
Proposition 1: If we choose for
the empirical distribution of the data, then the spectral embedding from
is equivalent to values of the eigenfunctions of the normalized kernel
:
.
(Simplified) proof:
As shown in Proposition 3, finding function and scalar minimizing
s.t.
yields a solution that satisfies
with the (possibly repeated) maximum norm eigenvalue.
Proof of Proposition 1
With empirical
, the above becomes (
):
Write
and
, then
and we obtain for the principal eigenvector:
For the other eigenvalues, consider the “residual kernel”
and recursively apply the same reasoning to obtain
,
, etc...
Q.E.D.
Proof of Proposition 2
Proposition 2: If we choose for
the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of
:
.
(Simplified) proof:
Apply the linear operator
on both sides of
:
or changing the order of integrals on the left-hand side:
Plug-in
:
Proof of Proposition 2
which contains elements of covariance matrix :
thus yielding
or
where
has elements
. So, where
takes its values,
where
is also the
-th e-vector of .
Proof of Proposition 2
PCA projection on
is
Q.E.D.
Proof of Proposition 3
Proposition 3: Given the first
eigenfunctions
of a symmetric function
, the
-th one can be obtained by minimizing w.r.t. the expected value of
over
. Then we get
and
. Proof:
Reconstruction error using approximation
:
where
with
, and
are the first
(eigenfunction,eigenvalue) pairs in order of decreasing absolute value of
.
Proof of Proposition 3
Minimization of wrt
gives
(1)
using eq. 1.
should be maximized.
Proof of Proposition 3
and set it equal to zero:
Using
:
(2)
Using recursive assumption that
are orthogonal for
:
Write the application of
in terms of the eigenfunctions:
Proof of Proposition 3
we obtain
Applying Perceval’s thm to obtain the norm on both sides:
If distinct ’s,
and
max. when
= 1 and
for
,
and
.
Since
and obtained
and
, get
and