Principal Component Analysis (PCA)

(1)

Principal Component Analysis (PCA)

Master MAS, Université de Bordeaux 18 septembre 2019

(2)

Introduction

The aim is to explorenumerical data.

Example : 8 mineral waters described on 13 sensory descriptors.

## bitter sweet acid salted alcaline

## St Yorre 3.4 3.1 2.9 6.4 4.8

## Badoit 3.8 2.6 2.7 4.7 4.5

## Vichy 2.9 2.9 2.1 6.0 5.0

## Quézac 3.9 2.6 3.8 4.7 4.3

## Arvie 3.1 3.2 3.0 5.2 5.0

## Chateauneuf 3.7 2.8 3.0 5.2 4.6

## Salvetat 4.0 2.8 3.0 4.1 4.5

## Perrier 4.4 2.2 4.0 4.9 3.9

The rows describeobservations or individuals(the 8 mineral waters) and columns describevariables(the sensory descriptors).

The aim is to know :

I whichobservations are similar,

I quellesvariables are linked.

(3)

bitter

2.22.63.04.55.5

3.0 3.5 4.0

2.2 2.6 3.0

sweet

acid

2.53.03.54.0 4.5 5.5

salted

3.03.54.02.53.03.54.0

4.0 4.4 4.8

4.04.44.8

alcaline

(4)

One can look at :

I thedistance matrixbetween observations :

## St Yorre Badoit Vichy Quézac Arvie Chateauneuf Salvetat

## Badoit 4.1

## Vichy 7.9 4.8

## Quézac 2.9 5.3 9.7

## Arvie 3.0 1.8 5.5 4.7

## Chateauneuf 2.9 1.8 5.7 4.3 1.3

## Salvetat 4.0 1.2 5.4 4.9 1.8 1.6

## Perrier 8.2 10.6 14.7 6.2 10.1 9.9 10.3

I thecorrelation matrixbetween variables :

## bitter sweet acid salted alcaline

## bitter 1.00 -0.83 0.78 -0.67 -0.96

## sweet -0.83 1.00 -0.61 0.49 0.93

## acid 0.78 -0.61 1.00 -0.44 -0.82

## salted -0.67 0.49 -0.44 1.00 0.56

## alcaline -0.96 0.93 -0.82 0.56 1.00

(5)

It is also possible to usemultivariate descriptive statisticslike PCA in order to :

I visualize on graphicsdistances between observations or correlations between variables.

−4 −2 0 2

−3−2−10123

Distances between observations

Dim 1 (77.61%)

Dim 2 (12.48%)

St Yorre

Badoit

Vichy Quézac

Arvie Chateauneuf

Salvetat Perrier

−1.0 −0.5 0.0 0.5 1.0

−1.0−0.50.00.51.0

Correlations between variables

Dim 1 (77.61%)

Dim 2 (12.48%)

bitter

sweet acid

salted

alcaline

(6)

I Built new numerical variables "summarizing" as well as possible the original variables in order toreduce dimension.

Table:Original data

bitter sweet acid salted alcaline

St Yorre 3.4 3.1 2.9 6.4 4.8

Badoit 3.8 2.6 2.7 4.7 4.5

Vichy 2.9 2.9 2.1 6.0 5.0

Quézac 3.9 2.6 3.8 4.7 4.3

Arvie 3.1 3.2 3.0 5.2 5.0

Chateauneuf 3.7 2.8 3.0 5.2 4.6

Salvetat 4.0 2.8 3.0 4.1 4.5

Perrier 4.4 2.2 4.0 4.9 3.9

Table:Two new synthetic variables

PC1 PC2

St Yorre 1.85 1.19

Badoit -0.49 -0.64

Vichy 2.77 0.24

Quézac -1.72 0.11

Arvie 1.93 -0.48

Chateauneuf 0.09 0.00 Salvetat -0.93 -1.39

Perrier -3.49 0.97

(7)

Outline

Basic concepts

Analysis of the set of observations

Analysis of the set of variables

Interpretation of PCA results

PCA with metrics and GSVD

(8)

Basic concepts

We consider anumericaldatatable wherenobservations are described onpvariables.

1. . . j . . .p

1 .. .

.. .

i . . . xij . . .

.. .

.. . n

Some notations :

X= (xij)n×pis thenumerical data matrixwherexij∈Ris the value of thei^th observation on thej^thvariable.

xi=





xi1

.. . xip



^∈R^p the description of thei^th observation (rowofX)

x^j=





x1j

.. . xnj



^∈Rⁿ the description of thej^th variable (columunofX).

(9)

Example : 6 patients described on 3 variables (diastolic pressure, systolic pressure and cholesterol).

load("../data/chol.rda") print(X)

## diast syst chol

## Brigitte 90 140 6.0

## Marie 60 85 5.9

## Vincent 75 135 6.1

## Alex 70 145 5.8

## Manue 85 130 5.4

## Fred 70 145 5.0

n= p= X= x3= x²=

⇒Two sets of points.

(10)

The first set is theset of observations.

Example : the 6 patients define a set ofn= 6 points inR³.

Nuage des individus

60 65 70 75 80 85 90

5.05.25.45.65.86.06.2

80 90

100 110

120 130

140 150

diast

systchol

Brigitte

Marie

Vincent Alex

Manue

Fred

(11)

:

I Each observationiis apointxi inR^p(a row ofX),

I Aweightwi is associated to each observationi. Usually : - wi=¹_n for randomly drawn observations.

- wi6=¹_n for ajusted samples, aggregated data...

A step ofpreproccessingis often applied to the data that might be :

I centeredto have columns (variables) with mean zero,

I scaledto have columns (variables) of variance 1.

(12)

Originaldata matrixX

1. . . j . . .p

1 .. .

.. .

i . . . xij . . .

.. .

.. . n

¯

x . . . ¯x^j . . .

Centereddata matrixY

1. . . j . . .p

1 .. .

.. .

i . . . yij . . .

.. .

.. . n

¯y . . . 0 . . .

Here :

I x¯^j=¹_nPn

i=1xijest is the mean of thejth variable (columnjofX),

I yij=xij−¯x^jis the general term of the centered data matrixY.

The columns of thecentered data matrixYhave zero mean :

¯ y^j= 1

n

X

i=1

yij= 0.

(13)

Example : the set of 6 patients.

## diast syst chol

## Brigitte 90 140 6.0

## Marie 60 85 5.9

## Vincent 75 135 6.1

## Alex 70 145 5.8

## Manue 85 130 5.4

## Fred 70 145 5.0

Means of the columns ofX

## diast syst chol

## 75.0 130.0 5.7

Centereddata matrixY

## diast syst chol

## Brigitte 15 10 0.3

## Marie -15 -45 0.2

## Vincent 0 5 0.4

## Alex -5 15 0.1

## Manue 10 0 -0.3

## Fred -5 15 -0.7

Means of the columns ofY

## diast syst chol

## 0 0 0

(14)

I Centering the data interprets as atranslationof the set of observations inR^p.

Centered set of 6 patients

−15 −10 −5 0 5 10 15

−0.8−0.6−0.4−0.2 0.0 0.2 0.4

−50

−40

−30

−20

−10 0

10 20

diast

systchol

Brigitte

Marie

Vincent

Alex

Manue

Fred

(15)

1. . . j . . .p

1 .. .

.. .

i . . . xij . . .

.. .

.. . n

¯

x . . . x¯^j . . . s . . . sj . . .

Standardizeddata matrixZ

1. . . j . . .p

1 .. .

.. .

i . . . zij . . .

.. .

.. . n

¯

z . . . 0 . . .

s . . . 1 . . .

Here :

I s_j²=¹_nPn

i=1(xij−¯x^j)²is the variance of thejth variable (columnj ofX),

I zij=^x^ij^−¯^x

j

s_j is the general term of the standardized data matrixZ.

The columns of thestandardized data matrixZhave a mean equal to 0 and a variance equal to 1 :

¯ z^j= 1

n

X

i=1

zij= 0,var(z^j) = 1 n

n

X

i=1

(zij−¯z^j)²= 1.

(16)

Example : the set of 6 patients.

## diast syst chol

## Brigitte 90 140 6.0

## Marie 60 85 5.9

## Vincent 75 135 6.1

## Alex 70 145 5.8

## Manue 85 130 5.4

## Fred 70 145 5.0

Means and sd of the columns ofX

## diast syst chol

## mean 75 130.0 5.700

## sd 10 20.8 0.383

Standardizeddata matrixZ

## diast syst chol

## Brigitte 1.5 0.48 0.78

## Marie -1.5 -2.16 0.52

## Vincent 0.0 0.24 1.04

## Alex -0.5 0.72 0.26

## Manue 1.0 0.00 -0.78

## Fred -0.5 0.72 -1.83

Means and sd of the columns ofZ

## diast syst chol

## 0 0 0

## diast syst chol

## 1 1 1

(17)

I Standardization (centering and scaling) interprets as a translation and a normalisationof the set of observations inR^p.

Nuage centré−réduit des 6 individus

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−2.0−1.5−1.0−0.5 0.0 0.5 1.0 1.5

−2.5

−2.0

−1.5

−1.0

−0.5 0.0

0.5 1.0

diast

syst

chol

Brigitte

Marie

Vincent Alex

Manue

Fred

(18)

In summary, three datasets of the same observations.

original data

−20 0 20 40 60 80 100−2 0 2 4 6 8

−50 0

50 100

150

diast

syst

chol

centered data

−20 0 20 40 60 80 100−2 0 2 4 6 8

−50 0

50 100

150

diast

syst

chol

standardized data

−20 0 20 40 60 80 100−2 0 2 4 6 8

−50 0 50

100 150

diast

syst

chol

I Centeringdo not change the distancesbetween the observations : d²(xi,x_i0) =d²(yi,y_i0).

I Standardizationchanges the distancesbetween the observations : d²(xi,x_i0)6=d²(zi,z_i0).

(19)

Proximity between two observationscan be measured with theEuclidean distance.

I The Euclidean distance between two observationsi andi⁰(two rows ofX) is :

d²(xi,x_i0) =

p

X

j=1

(xij−x_i0j)².

I When data are standardized, the Euclidean distance between two observationsi andi⁰(two rows ofZ) is :

d²(zi,z_i0) =

p

X

j=1

1

s_j²(xij−x_i0j)².

It means :

I If variables (columns ofX) are measured ondifferent scales, variables with larger variance are more important than variables with smaller variance when

performing the Euclidean distance.

I Standardizing the data gives thesame importanceto all the variables when performing the Euclidean distance.

(20)

Example :distance between Brigitte and Marie Original data (X) :

## diast syst chol

## Brigitte 90 140 6.0

## Marie 60 85 5.9

## Vincent 75 135 6.1

## Alex 70 145 5.8

## Manue 85 130 5.4

## Fred 70 145 5.0

Mean and sd of the columns :

## diast syst chol

## mean 75 130.0 5.700

## sd 10 20.8 0.383

Standardized data (Z)

## diast syst chol

## Brigitte 1.5 0.48 0.78

## Marie -1.5 -2.16 0.52

## Vincent 0.0 0.24 1.04

## Alex -0.5 0.72 0.26

## Manue 1.0 0.00 -0.78

## Fred -0.5 0.72 -1.83

Euclidean distance between thetwo first rows ofX: d(x1,x2) =p

(90−60)²+ (140−85)²+ (6−5.9)²

=p

30²+ 55²+ 0.1² Euclidean distance between thetwo first rows ofZ:

d(z1,z2) =

r

1

10²(90−60)²+ 1

20.8²(140−85)²+ 1

0.383²(6−5.9)²

=p

(1.5 + 1.5)²+ (0.48 + 2.16)²+ (0.78−0.52)²

=p

3²+ 2.7²+ 0.26²

(21)

The dispersionof the set of observations inR^pis measured by theinertia.

I The inertia of thenobservations (thenrows ofX) is defined by :

I(X) =1 n

n

X

i=1

d²(xi,¯x).

I Inertia is a generalization ofthe varianceto the case of multivariate data (p variables).

I One can show that :

I(X) =

p

X

j=1

var(x^j).

This means that :

I when the variables are centered,I(Y) =Pp i=1s_j²,

I when the variabes are standardized,I(Z) =p.

(22)

Example :Inertia of the set of 6 patients Centered data (Y) :

## diast syst chol

## Brigitte 15 10 0.3

## Marie -15 -45 0.2

## Vincent 0 5 0.4

## Alex -5 15 0.1

## Manue 10 0 -0.3

## Fred -5 15 -0.7

Variance of the columns :

## diast syst chol

## 100.00 433.33 0.15

Standardized data (Z)

## diast syst chol

## Brigitte 1.5 0.48 0.78

## Marie -1.5 -2.16 0.52

## Vincent 0.0 0.24 1.04

## Alex -0.5 0.72 0.26

## Manue 1.0 0.00 -0.78

## Fred -0.5 0.72 -1.83

Variance of the columns :

## diast syst chol

## 1 1 1

I Inertia of the centered dataset :

I(Y) = 100 + 433.33 + 0.15

I Inertia if the standardized dataset :

I(Z) = 1 + 1 + 1 = 3

(23)

The second set of points associated with a numerical data matrix is theset of variables.

Example : the variables diastolic pressure, systolic pressure and cholesterol define a set ofp= 3 points inR⁶.

## Brigitte Marie Vincent Alex Manue Fred

## diast 90 60.0 75.0 70.0 85.0 70

## syst 140 85.0 135.0 145.0 130.0 145

## chol 6 5.9 6.1 5.8 5.4 5

Can’t be visualized !

(24)

I Each variablej is apointx^jinRⁿ(a columnX),

I Aweightmjis associated with each variabej. Usually :

I mj= 1in PCA,

I mj6= 1in MCA (Multiple Correspondance Analysis).

When data are centered :

I each variablej is a point denotedy^jinRⁿ(a columnY),

I we talk about theset of centered variables.

When data are standardized :

I each variablej is a point denotedz^jinRⁿ(column ofZ),

I we talk about theset of standardized variables.

(25)

Thelink between two variablesis measured by thecovarianceor thecorrelation.

To define covariance and correlation, ametricis associated withRⁿ: N=diag(1

n, . . . ,1 n).

I The scalar product betweenxandyinRⁿis defined by :

<x,y>N=x^TNy=1 nx^Ty=1

n

X

i=1

xiyi.

I The norm ofxinRⁿis then :

kxk_N=√

<x,x>N=

v u u t

1 n

n

X

i=1

x_i².

(26)

With this metric,the variance writes as a squared norm:

I var(x^j) = ¹_nPn

i=1(xij−x¯^j)²=ky^jk²

N,

I var(z^j) = ¹_nPn

i=1(zij−¯z^j)²=kz^jk²

N.

The set of thepstandardized variables is then on theunit ballofRⁿwithkz^jk_N= 1.

Moreoverthe covariance and the correlation write as scalar product:

I c_jj0=¹_nPn

i=1(xij−¯x^j)(x_ij0−x¯^j⁰) =<y^j,y^j⁰>N,

I r_jj0 =¹_nPn i=1(^x^ij^−¯^x

j s_j )(^x^ij⁰^−¯^x

j0

s_j0 ) =<z^j,z^j⁰>N

(27)

This leads to a simple expression of thecovariance matrix denotedC and of the correlation matrix denotedR:

I C=Y^TNY,

I R=Z^TNZ.

Example :

Covariance matrix :

## diast syst chol

## diast 100.00 112.5 0.25

## syst 112.50 433.3 -2.17

## chol 0.25 -2.2 0.15

Correlation matrix

## diast syst chol

## diast 1.000 0.54 0.065

## syst 0.540 1.00 -0.272

## chol 0.065 -0.27 1.000

(28)

With this metric,the correlation writes as a cosine:

I rjj⁰ = ^<y^j^,y^j

0>_N

ky^jk_Nky^j⁰k_N =cosθN(y^j,y^j⁰),

I r_jj0 =<z^j,z^j⁰ >N=cosθN(z^j,z^j⁰).

This lead to ageometrical interpretationof the correlation between variables :

I an angle of 90 degrees between two standardized variables corresponds to a null correlation (cosine equals to 0) and then to the absence of linear link,

I an angle of 0 degrees corresponds to a correlation of 1 (cosine equals to 1) and then to a positive linear link,

I an angle of 180 degrees corresponds to a correlation of -1 (cosinus equals to -1 ) and then to a negative linear link.

(29)

PCA analyses :

I either thecentered data matrixY,

I or thestandardized data matrixZ.

This lead to two different methods of PCA :

I non normalized PCA(or PCA on covariance matrix) which analysesY,

I normalized PCA(or PCA or correlation matrix) which analysesZ.

From now on,normalized PCAis considered.

(30)

Outline

Basic concepts

(31)

Analysis of the set of observations

Find thesubspacewhich gives thebest representationof the observations.

I Best approximation of the databy projection.

I Best representation of thevariabilityof the observations.

(32)

Example : the set of the 6 patients descibed on the 3 standardized variables.

Nuage centré−réduit

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−2.0−1.5−1.0−0.5 0.0 0.5 1.0 1.5

−2.5

−2.0

−1.5

−1.0

−0.5 0.0

0.5 1.0

diast

syst

chol

Brigitte

Marie

Vincent Alex

Manue

Fred

−3 −2 −1 0 1

−2−101

Projection of the 6 patients

Dim 1 (52.69%)

Dim 2 (35.07%)

Brigitte

Marie

Vincent

Alex Manue

Fred

The aim is to findthe projection planewhich keeps as good as possible the distances between the patients i.e. their variability and then their inertia.

(33)

Projection of an observation (a point inR^p) on an axis.

The coordinate of the orhogonal projection of a pointzi∈R^pon an axis ∆αwith orientation vectorvα(v^T_αvα= 1) is :

fiα=<zi,vα>=z^T_i vα,

Thevector of coordinatesof the projections of thenobservations is :

f^α=





f1α

.. . fnα



⁼^Zvα=

p

X

j=1

vjαz^j.

I f^αis alinear combinationof the columns ofZ.

I f^αiscenteredif the columns ofZare centered.

(34)

Example : the 6 patients are the rows of the following standardized data matrix

Z=







1.50 0.48 0.78

−1.50 −2.16 0.52 0.00 0.24 1.04

−0.50 0.72 0.26 1.00 0.00 −0.78

−0.50 0.72 −1.83







Let us project the 6 "standardized" patients ontwo orthogonal axes∆1and ∆2with orientation vectors :

v1= 0.641

0.72

−0.265

!

, v2=

0.4433

−0.0652 0.894

!

.

(35)

The vectorsf¹andf²of the coordinates of the projection of the 6 patients on ∆1and

∆2are :

f¹=Zv1=0.641





1.5 .. .

−0.5



+0.72





0.48 .. . 0.72



^−0.265





0.78 .. .

−1.82



=





1.09 .. . 0.683





f²=Zv2=0.4433





1.5 .. .

−0.5



^−0.0652





0.48 .. . 0.72



^−0.894





0.78 .. .

−1.82



⁼





1.333 .. .

−1.9





f¹andf²aretwo new synthetic and centered variables.

(36)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−2.0−1.5−1.0−0.5 0.0 0.5 1.0 1.5

−2.5

−2.0

−1.5

−1.0

−0.5 0.0

0.5 1.0

diast

syst

chol

Brigitte

Marie

Vincent Alex

Manue

Fred

−2 −1 0 1

−2.0−1.5−1.0−0.50.00.51.0

f1

f2

Brigitte

Marie

Vincent

Alex Manue

Fred

In PCA the orientation vectorsv1andv2are defined tomaximize the inertia of the set of projections of the observationsand then keep as goog as possible the distances between the observations.

(37)

Axes of projection of the observations in PCA.

∆1is the axis with orientation vectorv1∈R^pwhichmaximises the varianceof then projected observations :

v1= arg max

kvk=1var(Zv)

= arg max

kvk=1v^TRv where

R=1 nZ^TZ is thep×pcorrelation matrix.

One can show that :

I v1is theeigenvectorassociated the largest eigenvalueλ1ofR,

I The first principal component (PC)f¹=Zv1iscentered: f¯¹= 0,

I λ1is thevarianceof the first PC :

var(f¹) =λ .

(38)

∆2is the axis of orientation vectorv2⊥v1which maximised the variance of then projected observations :

v2= arg max

kvk=1,v⊥v₁var(Zv).

One can show that :

I v2is theeigenvectorassociated with the second largest eigenvalueλ2ofR,

I The second principal component (PC)f²=Zv2iscentered: f¯²= 0,

I λ2is thevarianceof the second PC :

var(f²) =λ2,

I The principal componentsf¹andf²are not correlated.

In the same way, we can getq≤r (r is the rank ofZ) orthogonal axes∆1, . . . , ∆q

on which observations are projected.

(39)

In summary :

1. Theeigen decompositionof the correlation matrixRis performed andq≤r is chosen.

2. Then×qmatrixF=ZVof theqprincipal componentsis obtained with the matrixVof theqfirst eigenvectors ofR.

I The principal componentsf^α=Zvα(column ofF) are centered and of variance λα.

I The elementsfiαare called thefactor coordinatesor the observations or also the scoresof the observations on the principal components.

F=

1. . . α . . .q

1 .. .

.. .

i . . . fiα . . .

.. .

.. . n

mean . . . 0 . . .

var . . . λα . . .

(40)

Example of the 6 patients : matrixFof theq= 2 first PC

## f1 f2

## Brigitte 1.10 1.334

## Marie -2.66 -0.057

## Vincent -0.10 0.918

## Alex 0.13 -0.035

## Manue 0.85 -0.257

## Fred 0.68 -1.903

−3 −2 −1 0 1

−2−101

Projection of the 6 patients by PCA

Dim 1 (52.69%)

Dim 2 (35.07%)

Brigitte

Marie

Vincent

Alex Manue

Fred

(41)

Outline

Basic concepts

(42)

Analysis of the set of variables

Find thesubspacewhich gives thebest representationof the variables.

(43)

Example : the set of 3standardized variables.

3 variables on theunit ballofR⁶.

## Brigitte Marie Vincent Alex Manue Fred

## diast 1.5 -1.5 0.0 -0.5 1.0 -0.5

## syst 0.5 -2.2 0.2 0.7 0.0 0.7

## chol 0.8 0.5 1.0 0.3 -0.8 -1.8

−1.0 −0.5 0.0 0.5 1.0

−1.0−0.50.00.51.0

Projection of the 3 standardized variables

Dim 1 (52.69%)

Dim 2 (35.07%)

diast

syst chol

The aim is to findthe projection planewhich represents best the variables and then keeps as good as possible the angles between the variables i.e. their correlation.

(44)

Projection of a variable (a point inRⁿ) on an axis.

The coordinate of theN-orthogonal projection of a pointz^j∈Rⁿon an axisGα with orientation vectoruα(u^T_αNuα= 1) is :

ajα=<z^j,uα>N= (z^j)^TNuα,

and thevector of coordinatesof the projections of thepvariables is :

a^α=





a1α

.. . apα



⁼^Z

TNuα

Warning :a metricNinRⁿis used.

I A metric inRⁿis an×npositive semidefinite matrix.

I Here in PCA,Nis the diagonal matrix of the weight of the observations : N=diag(w1, . . . ,wn).

I When all observations are weighted by ¹_n (usually by default) : N= 1

nIⁿ.

(45)

Example : the three variables (diast, syst, chol) are columns of the following standardized data matrix

Z=







1.50 0.48 0.78

−1.50 −2.16 0.52 0.00 0.24 1.04

−0.50 0.72 0.26 1.00 0.00 −0.78

−0.50 0.72 −1.83







Let us project the 3 standardized variables ontwoN-orthogonal axesG1andG2with orientation vectors (hereN=¹₆I6) :

u1=







0.87

−2.11

−0.08 0.10 0.67 0.54







, u2=







1.30

−0.06 0.90

−0.03

−0.25

−1.8







.

(46)

The vectorsa¹anda²of coordinates of the projection of the 3 variables onG1andG2

are :

a¹=Z^TNu1=0.87 6

1.5 0.48 0.78

!

−2.11 6

−1.5

−2.16 0.52

!

+. . .+0.54 6

−0.5 0.72

−1.83

!

= 0.81 0.91

−0.33

!

a²=Z^TNu2=1.30 6

1.5 0.48 0.78

!

−0.06 6

−1.5

−2.16 0.52

!

+. . .−1.80 6

−0.5 0.72

−1.83

!

= 0.45

−0.07 0.92

!