Principal Component Analysis (PCA)
Master MAS, Université de Bordeaux 18 septembre 2019
Introduction
The aim is to explorenumerical data.
Example : 8 mineral waters described on 13 sensory descriptors.
## bitter sweet acid salted alcaline
## St Yorre 3.4 3.1 2.9 6.4 4.8
## Badoit 3.8 2.6 2.7 4.7 4.5
## Vichy 2.9 2.9 2.1 6.0 5.0
## Quézac 3.9 2.6 3.8 4.7 4.3
## Arvie 3.1 3.2 3.0 5.2 5.0
## Chateauneuf 3.7 2.8 3.0 5.2 4.6
## Salvetat 4.0 2.8 3.0 4.1 4.5
## Perrier 4.4 2.2 4.0 4.9 3.9
The rows describeobservations or individuals(the 8 mineral waters) and columns describevariables(the sensory descriptors).
The aim is to know :
I whichobservations are similar,
I quellesvariables are linked.
bitter
2.22.63.04.55.5
3.0 3.5 4.0
2.2 2.6 3.0
sweet
acid
2.53.03.54.0 4.5 5.5
salted
3.03.54.02.53.03.54.0
4.0 4.4 4.8
4.04.44.8
alcaline
One can look at :
I thedistance matrixbetween observations :
## St Yorre Badoit Vichy Quézac Arvie Chateauneuf Salvetat
## Badoit 4.1
## Vichy 7.9 4.8
## Quézac 2.9 5.3 9.7
## Arvie 3.0 1.8 5.5 4.7
## Chateauneuf 2.9 1.8 5.7 4.3 1.3
## Salvetat 4.0 1.2 5.4 4.9 1.8 1.6
## Perrier 8.2 10.6 14.7 6.2 10.1 9.9 10.3
I thecorrelation matrixbetween variables :
## bitter sweet acid salted alcaline
## bitter 1.00 -0.83 0.78 -0.67 -0.96
## sweet -0.83 1.00 -0.61 0.49 0.93
## acid 0.78 -0.61 1.00 -0.44 -0.82
## salted -0.67 0.49 -0.44 1.00 0.56
## alcaline -0.96 0.93 -0.82 0.56 1.00
It is also possible to usemultivariate descriptive statisticslike PCA in order to :
I visualize on graphicsdistances between observations or correlations between variables.
−4 −2 0 2
−3−2−10123
Distances between observations
Dim 1 (77.61%)
Dim 2 (12.48%)
St Yorre
Badoit
Vichy Quézac
Arvie Chateauneuf
Salvetat Perrier
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Correlations between variables
Dim 1 (77.61%)
Dim 2 (12.48%)
bitter
sweet acid
salted
alcaline
I Built new numerical variables "summarizing" as well as possible the original variables in order toreduce dimension.
Table:Original data
bitter sweet acid salted alcaline
St Yorre 3.4 3.1 2.9 6.4 4.8
Badoit 3.8 2.6 2.7 4.7 4.5
Vichy 2.9 2.9 2.1 6.0 5.0
Quézac 3.9 2.6 3.8 4.7 4.3
Arvie 3.1 3.2 3.0 5.2 5.0
Chateauneuf 3.7 2.8 3.0 5.2 4.6
Salvetat 4.0 2.8 3.0 4.1 4.5
Perrier 4.4 2.2 4.0 4.9 3.9
Table:Two new synthetic variables
PC1 PC2
St Yorre 1.85 1.19
Badoit -0.49 -0.64
Vichy 2.77 0.24
Quézac -1.72 0.11
Arvie 1.93 -0.48
Chateauneuf 0.09 0.00 Salvetat -0.93 -1.39
Perrier -3.49 0.97
Outline
Basic concepts
Analysis of the set of observations
Analysis of the set of variables
Interpretation of PCA results
PCA with metrics and GSVD
Basic concepts
We consider anumericaldatatable wherenobservations are described onpvariables.
1. . . j . . .p
1 .. .
.. .
i . . . xij . . .
.. .
.. . n
Some notations :
X= (xij)n×pis thenumerical data matrixwherexij∈Ris the value of theith observation on thejthvariable.
xi=
xi1
.. . xip
∈Rp the description of theith observation (rowofX)
xj=
x1j
.. . xnj
∈Rn the description of thejth variable (columunofX).
Example : 6 patients described on 3 variables (diastolic pressure, systolic pressure and cholesterol).
load("../data/chol.rda") print(X)
## diast syst chol
## Brigitte 90 140 6.0
## Marie 60 85 5.9
## Vincent 75 135 6.1
## Alex 70 145 5.8
## Manue 85 130 5.4
## Fred 70 145 5.0
n= p= X= x3= x2=
⇒Two sets of points.
The first set is theset of observations.
Example : the 6 patients define a set ofn= 6 points inR3.
Nuage des individus
60 65 70 75 80 85 90
5.05.25.45.65.86.06.2
80 90
100 110
120 130
140 150
diast
systchol
Brigitte
Marie
Vincent Alex
Manue
Fred
:
I Each observationiis apointxi inRp(a row ofX),
I Aweightwi is associated to each observationi. Usually : - wi=1n for randomly drawn observations.
- wi6=1n for ajusted samples, aggregated data...
A step ofpreproccessingis often applied to the data that might be :
I centeredto have columns (variables) with mean zero,
I scaledto have columns (variables) of variance 1.
Originaldata matrixX
1. . . j . . .p
1 .. .
.. .
i . . . xij . . .
.. .
.. . n
¯
x . . . ¯xj . . .
Centereddata matrixY
1. . . j . . .p
1 .. .
.. .
i . . . yij . . .
.. .
.. . n
¯y . . . 0 . . .
Here :
I x¯j=1nPn
i=1xijest is the mean of thejth variable (columnjofX),
I yij=xij−¯xjis the general term of the centered data matrixY.
The columns of thecentered data matrixYhave zero mean :
¯ yj= 1
n
n
X
i=1
yij= 0.
Example : the set of 6 patients.
Originaldata matrixX
## diast syst chol
## Brigitte 90 140 6.0
## Marie 60 85 5.9
## Vincent 75 135 6.1
## Alex 70 145 5.8
## Manue 85 130 5.4
## Fred 70 145 5.0
Means of the columns ofX
## diast syst chol
## 75.0 130.0 5.7
Centereddata matrixY
## diast syst chol
## Brigitte 15 10 0.3
## Marie -15 -45 0.2
## Vincent 0 5 0.4
## Alex -5 15 0.1
## Manue 10 0 -0.3
## Fred -5 15 -0.7
Means of the columns ofY
## diast syst chol
## 0 0 0
I Centering the data interprets as atranslationof the set of observations inRp.
Centered set of 6 patients
−15 −10 −5 0 5 10 15
−0.8−0.6−0.4−0.2 0.0 0.2 0.4
−50
−40
−30
−20
−10 0
10 20
diast
systchol
Brigitte
Marie
Vincent
Alex
Manue
Fred
Originaldata matrixX
1. . . j . . .p
1 .. .
.. .
i . . . xij . . .
.. .
.. . n
¯
x . . . x¯j . . . s . . . sj . . .
Standardizeddata matrixZ
1. . . j . . .p
1 .. .
.. .
i . . . zij . . .
.. .
.. . n
¯
z . . . 0 . . .
s . . . 1 . . .
Here :
I sj2=1nPn
i=1(xij−¯xj)2is the variance of thejth variable (columnj ofX),
I zij=xij−¯x
j
sj is the general term of the standardized data matrixZ.
The columns of thestandardized data matrixZhave a mean equal to 0 and a variance equal to 1 :
¯ zj= 1
n
n
X
i=1
zij= 0,var(zj) = 1 n
n
X
i=1
(zij−¯zj)2= 1.
Example : the set of 6 patients.
Originaldata matrixX
## diast syst chol
## Brigitte 90 140 6.0
## Marie 60 85 5.9
## Vincent 75 135 6.1
## Alex 70 145 5.8
## Manue 85 130 5.4
## Fred 70 145 5.0
Means and sd of the columns ofX
## diast syst chol
## mean 75 130.0 5.700
## sd 10 20.8 0.383
Standardizeddata matrixZ
## diast syst chol
## Brigitte 1.5 0.48 0.78
## Marie -1.5 -2.16 0.52
## Vincent 0.0 0.24 1.04
## Alex -0.5 0.72 0.26
## Manue 1.0 0.00 -0.78
## Fred -0.5 0.72 -1.83
Means and sd of the columns ofZ
## diast syst chol
## 0 0 0
## diast syst chol
## 1 1 1
I Standardization (centering and scaling) interprets as a translation and a normalisationof the set of observations inRp.
Nuage centré−réduit des 6 individus
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−2.0−1.5−1.0−0.5 0.0 0.5 1.0 1.5
−2.5
−2.0
−1.5
−1.0
−0.5 0.0
0.5 1.0
diast
syst
chol
Brigitte
Marie
Vincent Alex
Manue
Fred
In summary, three datasets of the same observations.
original data
−20 0 20 40 60 80 100−2 0 2 4 6 8
−50 0
50 100
150
diast
syst
chol
centered data
−20 0 20 40 60 80 100−2 0 2 4 6 8
−50 0
50 100
150
diast
syst
chol
standardized data
−20 0 20 40 60 80 100−2 0 2 4 6 8
−50 0 50
100 150
diast
syst
chol
I Centeringdo not change the distancesbetween the observations : d2(xi,xi0) =d2(yi,yi0).
I Standardizationchanges the distancesbetween the observations : d2(xi,xi0)6=d2(zi,zi0).
Proximity between two observationscan be measured with theEuclidean distance.
I The Euclidean distance between two observationsi andi0(two rows ofX) is :
d2(xi,xi0) =
p
X
j=1
(xij−xi0j)2.
I When data are standardized, the Euclidean distance between two observationsi andi0(two rows ofZ) is :
d2(zi,zi0) =
p
X
j=1
1
sj2(xij−xi0j)2.
It means :
I If variables (columns ofX) are measured ondifferent scales, variables with larger variance are more important than variables with smaller variance when
performing the Euclidean distance.
I Standardizing the data gives thesame importanceto all the variables when performing the Euclidean distance.
Example :distance between Brigitte and Marie Original data (X) :
## diast syst chol
## Brigitte 90 140 6.0
## Marie 60 85 5.9
## Vincent 75 135 6.1
## Alex 70 145 5.8
## Manue 85 130 5.4
## Fred 70 145 5.0
Mean and sd of the columns :
## diast syst chol
## mean 75 130.0 5.700
## sd 10 20.8 0.383
Standardized data (Z)
## diast syst chol
## Brigitte 1.5 0.48 0.78
## Marie -1.5 -2.16 0.52
## Vincent 0.0 0.24 1.04
## Alex -0.5 0.72 0.26
## Manue 1.0 0.00 -0.78
## Fred -0.5 0.72 -1.83
Euclidean distance between thetwo first rows ofX: d(x1,x2) =p
(90−60)2+ (140−85)2+ (6−5.9)2
=p
302+ 552+ 0.12 Euclidean distance between thetwo first rows ofZ:
d(z1,z2) =
r
1
102(90−60)2+ 1
20.82(140−85)2+ 1
0.3832(6−5.9)2
=p
(1.5 + 1.5)2+ (0.48 + 2.16)2+ (0.78−0.52)2
=p
32+ 2.72+ 0.262
The dispersionof the set of observations inRpis measured by theinertia.
I The inertia of thenobservations (thenrows ofX) is defined by :
I(X) =1 n
n
X
i=1
d2(xi,¯x).
I Inertia is a generalization ofthe varianceto the case of multivariate data (p variables).
I One can show that :
I(X) =
p
X
j=1
var(xj).
This means that :
I when the variables are centered,I(Y) =Pp i=1sj2,
I when the variabes are standardized,I(Z) =p.
Example :Inertia of the set of 6 patients Centered data (Y) :
## diast syst chol
## Brigitte 15 10 0.3
## Marie -15 -45 0.2
## Vincent 0 5 0.4
## Alex -5 15 0.1
## Manue 10 0 -0.3
## Fred -5 15 -0.7
Variance of the columns :
## diast syst chol
## 100.00 433.33 0.15
Standardized data (Z)
## diast syst chol
## Brigitte 1.5 0.48 0.78
## Marie -1.5 -2.16 0.52
## Vincent 0.0 0.24 1.04
## Alex -0.5 0.72 0.26
## Manue 1.0 0.00 -0.78
## Fred -0.5 0.72 -1.83
Variance of the columns :
## diast syst chol
## 1 1 1
I Inertia of the centered dataset :
I(Y) = 100 + 433.33 + 0.15
I Inertia if the standardized dataset :
I(Z) = 1 + 1 + 1 = 3
The second set of points associated with a numerical data matrix is theset of variables.
Example : the variables diastolic pressure, systolic pressure and cholesterol define a set ofp= 3 points inR6.
## Brigitte Marie Vincent Alex Manue Fred
## diast 90 60.0 75.0 70.0 85.0 70
## syst 140 85.0 135.0 145.0 130.0 145
## chol 6 5.9 6.1 5.8 5.4 5
Can’t be visualized !
I Each variablej is apointxjinRn(a columnX),
I Aweightmjis associated with each variabej. Usually :
I mj= 1in PCA,
I mj6= 1in MCA (Multiple Correspondance Analysis).
When data are centered :
I each variablej is a point denotedyjinRn(a columnY),
I we talk about theset of centered variables.
When data are standardized :
I each variablej is a point denotedzjinRn(column ofZ),
I we talk about theset of standardized variables.
Thelink between two variablesis measured by thecovarianceor thecorrelation.
To define covariance and correlation, ametricis associated withRn: N=diag(1
n, . . . ,1 n).
I The scalar product betweenxandyinRnis defined by :
<x,y>N=xTNy=1 nxTy=1
n
n
X
i=1
xiyi.
I The norm ofxinRnis then :
kxkN=√
<x,x>N=
v u u t
1 n
n
X
i=1
xi2.
With this metric,the variance writes as a squared norm:
I var(xj) = 1nPn
i=1(xij−x¯j)2=kyjk2
N,
I var(zj) = 1nPn
i=1(zij−¯zj)2=kzjk2
N.
The set of thepstandardized variables is then on theunit ballofRnwithkzjkN= 1.
Moreoverthe covariance and the correlation write as scalar product:
I cjj0=1nPn
i=1(xij−¯xj)(xij0−x¯j0) =<yj,yj0>N,
I rjj0 =1nPn i=1(xij−¯x
j sj )(xij0−¯x
j0
sj0 ) =<zj,zj0>N
This leads to a simple expression of thecovariance matrix denotedC and of the correlation matrix denotedR:
I C=YTNY,
I R=ZTNZ.
Example :
Covariance matrix :
## diast syst chol
## diast 100.00 112.5 0.25
## syst 112.50 433.3 -2.17
## chol 0.25 -2.2 0.15
Correlation matrix
## diast syst chol
## diast 1.000 0.54 0.065
## syst 0.540 1.00 -0.272
## chol 0.065 -0.27 1.000
With this metric,the correlation writes as a cosine:
I rjj0 = <yj,yj
0>N
kyjkNkyj0kN =cosθN(yj,yj0),
I rjj0 =<zj,zj0 >N=cosθN(zj,zj0).
This lead to ageometrical interpretationof the correlation between variables :
I an angle of 90 degrees between two standardized variables corresponds to a null correlation (cosine equals to 0) and then to the absence of linear link,
I an angle of 0 degrees corresponds to a correlation of 1 (cosine equals to 1) and then to a positive linear link,
I an angle of 180 degrees corresponds to a correlation of -1 (cosinus equals to -1 ) and then to a negative linear link.
PCA analyses :
I either thecentered data matrixY,
I or thestandardized data matrixZ.
This lead to two different methods of PCA :
I non normalized PCA(or PCA on covariance matrix) which analysesY,
I normalized PCA(or PCA or correlation matrix) which analysesZ.
From now on,normalized PCAis considered.
Outline
Basic concepts
Analysis of the set of observations
Analysis of the set of variables
Interpretation of PCA results
PCA with metrics and GSVD
Analysis of the set of observations
Find thesubspacewhich gives thebest representationof the observations.
I Best approximation of the databy projection.
I Best representation of thevariabilityof the observations.
Example : the set of the 6 patients descibed on the 3 standardized variables.
Nuage centré−réduit
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−2.0−1.5−1.0−0.5 0.0 0.5 1.0 1.5
−2.5
−2.0
−1.5
−1.0
−0.5 0.0
0.5 1.0
diast
syst
chol
Brigitte
Marie
Vincent Alex
Manue
Fred
−3 −2 −1 0 1
−2−101
Projection of the 6 patients
Dim 1 (52.69%)
Dim 2 (35.07%)
Brigitte
Marie
Vincent
Alex Manue
Fred
The aim is to findthe projection planewhich keeps as good as possible the distances between the patients i.e. their variability and then their inertia.
Projection of an observation (a point inRp) on an axis.
The coordinate of the orhogonal projection of a pointzi∈Rpon an axis ∆αwith orientation vectorvα(vTαvα= 1) is :
fiα=<zi,vα>=zTi vα,
Thevector of coordinatesof the projections of thenobservations is :
fα=
f1α
.. . fnα
=Zvα=
p
X
j=1
vjαzj.
I fαis alinear combinationof the columns ofZ.
I fαiscenteredif the columns ofZare centered.
Example : the 6 patients are the rows of the following standardized data matrix
Z=
1.50 0.48 0.78
−1.50 −2.16 0.52 0.00 0.24 1.04
−0.50 0.72 0.26 1.00 0.00 −0.78
−0.50 0.72 −1.83
Let us project the 6 "standardized" patients ontwo orthogonal axes∆1and ∆2with orientation vectors :
v1= 0.641
0.72
−0.265
!
, v2=
0.4433
−0.0652 0.894
!
.
The vectorsf1andf2of the coordinates of the projection of the 6 patients on ∆1and
∆2are :
f1=Zv1=0.641
1.5 .. .
−0.5
+0.72
0.48 .. . 0.72
−0.265
0.78 .. .
−1.82
=
1.09 .. . 0.683
f2=Zv2=0.4433
1.5 .. .
−0.5
−0.0652
0.48 .. . 0.72
−0.894
0.78 .. .
−1.82
=
1.333 .. .
−1.9
f1andf2aretwo new synthetic and centered variables.
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−2.0−1.5−1.0−0.5 0.0 0.5 1.0 1.5
−2.5
−2.0
−1.5
−1.0
−0.5 0.0
0.5 1.0
diast
syst
chol
Brigitte
Marie
Vincent Alex
Manue
Fred
−2 −1 0 1
−2.0−1.5−1.0−0.50.00.51.0
f1
f2
Brigitte
Marie
Vincent
Alex Manue
Fred
In PCA the orientation vectorsv1andv2are defined tomaximize the inertia of the set of projections of the observationsand then keep as goog as possible the distances between the observations.
Axes of projection of the observations in PCA.
∆1is the axis with orientation vectorv1∈Rpwhichmaximises the varianceof then projected observations :
v1= arg max
kvk=1var(Zv)
= arg max
kvk=1vTRv where
R=1 nZTZ is thep×pcorrelation matrix.
One can show that :
I v1is theeigenvectorassociated the largest eigenvalueλ1ofR,
I The first principal component (PC)f1=Zv1iscentered: f¯1= 0,
I λ1is thevarianceof the first PC :
var(f1) =λ .
∆2is the axis of orientation vectorv2⊥v1which maximised the variance of then projected observations :
v2= arg max
kvk=1,v⊥v1var(Zv).
One can show that :
I v2is theeigenvectorassociated with the second largest eigenvalueλ2ofR,
I The second principal component (PC)f2=Zv2iscentered: f¯2= 0,
I λ2is thevarianceof the second PC :
var(f2) =λ2,
I The principal componentsf1andf2are not correlated.
In the same way, we can getq≤r (r is the rank ofZ) orthogonal axes∆1, . . . , ∆q
on which observations are projected.
In summary :
1. Theeigen decompositionof the correlation matrixRis performed andq≤r is chosen.
2. Then×qmatrixF=ZVof theqprincipal componentsis obtained with the matrixVof theqfirst eigenvectors ofR.
I The principal componentsfα=Zvα(column ofF) are centered and of variance λα.
I The elementsfiαare called thefactor coordinatesor the observations or also the scoresof the observations on the principal components.
F=
1. . . α . . .q
1 .. .
.. .
i . . . fiα . . .
.. .
.. . n
mean . . . 0 . . .
var . . . λα . . .
Example of the 6 patients : matrixFof theq= 2 first PC
## f1 f2
## Brigitte 1.10 1.334
## Marie -2.66 -0.057
## Vincent -0.10 0.918
## Alex 0.13 -0.035
## Manue 0.85 -0.257
## Fred 0.68 -1.903
−3 −2 −1 0 1
−2−101
Projection of the 6 patients by PCA
Dim 1 (52.69%)
Dim 2 (35.07%)
Brigitte
Marie
Vincent
Alex Manue
Fred
Outline
Basic concepts
Analysis of the set of observations
Analysis of the set of variables
Interpretation of PCA results
PCA with metrics and GSVD
Analysis of the set of variables
Find thesubspacewhich gives thebest representationof the variables.
Example : the set of 3standardized variables.
3 variables on theunit ballofR6.
## Brigitte Marie Vincent Alex Manue Fred
## diast 1.5 -1.5 0.0 -0.5 1.0 -0.5
## syst 0.5 -2.2 0.2 0.7 0.0 0.7
## chol 0.8 0.5 1.0 0.3 -0.8 -1.8
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Projection of the 3 standardized variables
Dim 1 (52.69%)
Dim 2 (35.07%)
diast
syst chol
The aim is to findthe projection planewhich represents best the variables and then keeps as good as possible the angles between the variables i.e. their correlation.
Projection of a variable (a point inRn) on an axis.
The coordinate of theN-orthogonal projection of a pointzj∈Rnon an axisGα with orientation vectoruα(uTαNuα= 1) is :
ajα=<zj,uα>N= (zj)TNuα,
and thevector of coordinatesof the projections of thepvariables is :
aα=
a1α
.. . apα
=Z
TNuα
Warning :a metricNinRnis used.
I A metric inRnis an×npositive semidefinite matrix.
I Here in PCA,Nis the diagonal matrix of the weight of the observations : N=diag(w1, . . . ,wn).
I When all observations are weighted by 1n (usually by default) : N= 1
nIn.
Example : the three variables (diast, syst, chol) are columns of the following standardized data matrix
Z=
1.50 0.48 0.78
−1.50 −2.16 0.52 0.00 0.24 1.04
−0.50 0.72 0.26 1.00 0.00 −0.78
−0.50 0.72 −1.83
Let us project the 3 standardized variables ontwoN-orthogonal axesG1andG2with orientation vectors (hereN=16I6) :
u1=
0.87
−2.11
−0.08 0.10 0.67 0.54
, u2=
1.30
−0.06 0.90
−0.03
−0.25
−1.8
.
The vectorsa1anda2of coordinates of the projection of the 3 variables onG1andG2
are :
a1=ZTNu1=0.87 6
1.5 0.48 0.78
!
−2.11 6
−1.5
−2.16 0.52
!
+. . .+0.54 6
−0.5 0.72
−1.83
!
= 0.81 0.91
−0.33
!
a2=ZTNu2=1.30 6
1.5 0.48 0.78
!
−0.06 6
−1.5
−2.16 0.52
!
+. . .−1.80 6
−0.5 0.72
−1.83
!
= 0.45
−0.07 0.92
!