TP7: KERNEL PRINCIPAL COMPONENT ANALYSIS
COURS D’APPRENTISSAGE, ECOLE NORMALE SUP ´ERIEURE
Rapha¨el Berthier raphael.berthier@inria.fr This exercise sheet was composed using [1, Exercises 1.6.4, 9.5.6].
1. Plain Principal Component Analysis (PCA)
Let X1, . . . , Xd ∈ Rp be a sequence of samples. The goal of the PCA is to compute a low-dimensional representation of the data. For a given dimensiond≤p, the PCA computes the subspaceVd such that
(1) Vd ∈ argmin
V⊂Rp: dim(V)≤d n
X
i=1
kXi−ProjV Xik22 ,
where ProjV denotes the orthogonal projection ontoV. Let us denote X the data matrix, whose lines are X1, . . . , Xn, andX =Pr
k=1σkukvkT its singular values decomposition (SVD), withσ1≥ · · · ≥σr>0.
We have the following theorem Theorem 1. Let d≤r. We have
min
Y:rank(Y)≤dkY −Xk2F =
r
X
k=d+1
σk2, wherek.kF denotes the Frobenius norm of a matrix,kZk2F = tr(ZZT) =P
i,jZi,j2 . Moreover, the minimum is reached if
Y =
d
X
k=1
σkukvkT.
1) Check that
n
X
i=1
kXi−ProjV Xik22=kX−XProjV k2F. Conclude thatVd={v1, . . . , vd} minimizes (1).
2) Prove that the coordinates of ProjV Xi in the orthonormal basis (v1, . . . , vd) of Vd are given by (σ1hei, u1i, . . . , σdhei, udi).
The right-singular vectors v1, . . . , vd ∈ Rp are called the principal axes. There represent the directions where the data varies the most. The vectors ck = Xvk = σkuk ∈ Rn, k = 1, . . . , d are called the principal components. The principal componentck gathers the coordinates ofX1, . . . , Xn onvk.
Note that in practice, since Vd is a linear span and not an affine span, it is highly recommended to first center the data point ˜Xi=Xi−n1PXi before applying the PCA.
2. Kernel principal component analysis
The Reproducing Kernel Hilbert Space (RKHS) framework allows us to “delinearize” some linear algorithm.
We show here how it can be applied to PCA. Let us consider that we have n pointsX1, . . . , Xn ∈ X. Note that in the kernel setting,X does not need to be a vector space anymore. Letφ:X → F be a mapping from X to some RKHSF associated to some positive definite kernel k on X. The principal of kernel PCA is to perform a PCA on the pointsφ(X1), . . . , φ(Xn) without computing the pointsφ(X1), . . . , φ(Xn) explicitly.
Letdbe a fixed positive integer. We seek for the spaceV ⊂ F such that Vd∈ argmin
V⊂F: dim(V)≤d n
X
i=1
kφ(Xi)−ProjVφ(Xi)k2F .
2 COURS D’APPRENTISSAGE, ECOLE NORMALE SUP ´ERIEURE
where ProjV denotes the orthogonal projection onV in the Hilbert spaceF. In the following, we denote byL the linear map
L :Rn→ F α7→ Lα=
n
X
i=1
αiφ(Xi). 3) Prove thatVd=LVd, withVd fulfilling
(2) Vd∈ argmin
V⊂Rn: dim(V)≤d n
X
i=1
kφ(Xi)−ProjLVφ(Xi)k2F .
4) We denote by K the n×n matrix with entries Ki,j = k(Xi, Xj) for i, j = 1, . . . , n. We assume in the following that Kis non-singular. Prove that for anyα∈Rn,kLK−1/2αk2F =kαk22.
5) Let V be a subspace of Rn of dimension dand denote by (b1, . . . , bd) an orthonormal basis of the linear spanK1/2V. Prove that (LK−1/2b1, . . . ,LK−1/2bd) is an orthonormal basis ofLV.
6) Prove the identities
ProjLV Lα=
d
X
k=1
hLK−1/2bk,LαiFLK−1/2bk
=LK−1/2ProjK1/2VK1/2α .
7) Let us denote (e1, . . . , en) the canonical basis ofRn. Check that
n
X
i=1
kφ(Xi)−ProjLV φ(Xi)k2F =
n
X
i=1
kLei− LK−1/2ProjK1/2VK1/2eik2F
=
n
X
i=1
kK1/2ei−ProjK1/2V K1/2eik22
=kK1/2−ProjK1/2V K1/2k2F.
8) Using Theorem 1, show that Vd = Span{v1, . . . , vd}, where v1, . . . , vd are eigenvectors ofK associated to thedlargest eigenvalues ofK, is a minimizer of (2).
9) We setfk=LK−1/2vk fork= 1, . . . , d. Check that (f1, . . . , fd) is an orthonormal basis ofVd and ProjVφ(Xi) =
n
X
k=1
hvk, K1/2eiifk.
Thus in the basis (f1, . . . , fk), the coordinates of the orthogonal projection of the point ϕ(Xi) onto V are (hv1, K1/2eii, . . . ,hvd, K1/2eii).
The take-home message is that the kernel PCA can be computed in the original space X as a eigenvalue decomposition of the Gram matrixK.
References
[1] Christophe Giraud.Introduction to high-dimensional statistics. Chapman and Hall/CRC, 2014.