• Aucun résultat trouvé

TP1: PCA with R

N/A
N/A
Protected

Academic year: 2022

Partager "TP1: PCA with R"

Copied!
5
0
0

Texte intégral

(1)

TP1: PCA with R

1 An example.

Example ofn= 8 mineral waters described onp= 13 variables.

load("../data/eaux.rda") print(data[,1:3])

X <- data # original quantitative data matrix dim(X)

1.1 Bivariate description, distances and correlations

Dispersion plot of all the pairs of variables.

pairs(data[,1:5]) library(GGally) ggpairs(data[,1:5])

Matrices of the distances between the observations and of the correlations between the variables.

dist(data) cor(data[,1:5])

1.2 Centered and standardized data.

Mean and standard deviation of the columns.

m <- apply(X,2,mean) #mean print(m,digits=3)

s <- apply(X,2,sd) #standard deviation print(s,digits=3)

Centered data.

Y <- sweep(X,2,m,"-") apply(Y,2,mean) apply(Y,2,sd)

The covariance matrix is n1YTY. n <- nrow(X)

(t(as.matrix(Y))%*%as.matrix(Y)/n)[1,2]

cov(X)[1,2]

all.equal(cov(X),t(as.matrix(Y))%*%as.matrix(Y)/n)

#modify the covariance matrix formula to find the same results.

Standardized data.

Z <- ?

apply(Z,2,mean) apply(Z,2,sd)

The correlation matrix is n1ZTZ.

(2)

(t(as.matrix(Z))%*%as.matrix(Z)/n)[1,2]

cor(X)[1,2]

all.equal(cor(X),t(as.matrix(Z))%*%as.matrix(Z)/n)

Standardize now the data with thenon correctedstandard deviation.

s <- apply(X,2,sd)*sqrt((n-1)/n) #non corrected standard deviation print(s,digits=3)

Z <- ?

apply(Z,2,mean)

apply(Z,2,sd)*sqrt((n-1)/(n))

(t(as.matrix(Z))%*%as.matrix(Z)/n)[1,2]

cor(X)[1,2]

all.equal(cor(X),t(as.matrix(Z))%*%as.matrix(Z)/n)

1.3 PCA of the covariance matrix or of the correlation matrix ?

We use the functionPCAof the R packageFactoMineR

library(FactoMineR) # R package dedicated to multivariate data analysis

?PCA

1.3.1 PCA of the covariance matrix

Analysis of the rows and columns of the centered data matrixY i.e. eigen decomposition of the covariance matrixC=n1YtY.

The argumentscale.unit=FALSEis used.

res <- PCA(X,graph=FALSE,scale.unit=FALSE) # original data in input plot(res,choix="ind",cex=0.8,title="")

plot(res,choix="var",cex=0.8,title="")

#Give an interpretation of these two plots.

1.3.2 PCA of the correlation matrix

Analysis of the rows and columns of the standardized data matrixZi.e. eigen decomposition of the correlation matrixR= n1ZtZ.

res <- PCA(X,graph=FALSE) # original data in input plot(res,choix="ind",cex=0.8,title="")

plot(res,choix="var",cex=0.8,title="")

#plot the observations and the variables on the dimensions 3-4.

The plot of the observations is build from the matrix F of the principal coordinates of the individus (the principal components).

F <- res$ind$coord print(F[1:4,],digits=2)

#plot the observations on the first principal component plan using F axes <- c(1,2)

plot(F[,axes],pch=19,col=4,cex=1) abline(h=0,lty=2)

(3)

abline(v=0,lty=2)

text(F[,axes],labels=rownames(Z),pos=3,col=4,cex=1)

The variance of the principal components (the columns of F) are the eigenvalues of the correlation matrix.

res$eig[,1]

#calculate the variance of the columns of F n <- nrow(X)

apply(F,2,var)*(n-1)/n

The plot of the variables is build from the matrixAof the principal coordinates of the variables (the loadings).

A <- res$var$coord print(A[1:4,],digits=2)

#plot the variables on the first principal component map using A axes <- c(1,2)

plot(A[,axes],pch=19,col=4,cex=1) abline(h=0,lty=2)

abline(v=0,lty=2)

text(A[,axes],labels=colnames(Z),pos=3,col=4,cex=1)

The loadings are the correlations between the variables and the principal components.

A[,1]

# calculate the correlations between the first principal component and the 13 variables.

cor(X[,1],F[,1]) cor(X,F[,1])

1.3.3 PCA with other R functions.

Two functions perform PCA in the basicstatspackage: princompandprcomp. Do they perform PCA on the covariance or on the correlation matrix by default ?

?princomp

#pr <- princomp(X,cor=T) # on covariance matrix

?prcomp

pr <- prcomp(X,scale=TRUE) # on covariance matrix pr$sdev^2 # eigenvalues

1.4 PCA “by hand”

1.4.1 Via an eigen decomposition (ofC orR)

The eigenvectorsv1, ..., vq (q≤rwithrthe rank ofZ) of the correlation matrixR= n1ZtZ give the principal components inF with:

F=ZV where V is the matrix of theqeigenvectors in column.

R <- t(as.matrix(Z))%*%as.matrix(Z)/7 e <- eigen(R)

names(e) dim(e$vectors)

#matrix with the q=3 first eigenvectors q <- 3

(4)

V <- e$vectors[,1:q]

Fbis <- as.matrix(Z)%*%V head(Fbis)

head(F[,1:q]) #with the PCA function

The values λ1, ..., λq are the variance of the principal components (columns of $F).

e$values[1:q]

res$eig[1:q,1] #output of the PCA function

1.4.2 Via an SVD decomposition with metric

In PCA the metric inRn is the diagonal matrix of the weights of the rows (here 1/n) and the metric is inRp is the identity matrix.

#first compute Z with the non corrected standard deviation s <- apply(Y,2,sd)*sqrt((n-1)/n)

Z <- sweep(Y,2,s,"/") e_svd <- svd(Z) names(e_svd)

(e_svd$d^2)[1:q]/n #division by n because of the metric res$eig[1:q,1] #output of the PCA function

U <- e_svd$u[,1:q] #matrix of the q first left singular vectors Fbisbis <- U%*%diag(e_svd$d[1:q]) #principal components

head(Fbisbis)

head(F[,1:q]) #output of the PCA function

1.5 Interpretation of the results

1.5.1 Inertia (variance) of the principal components.

res$eig

barplot(res$eig[,1])

1.5.2 Projection plan of the observations

The 3 mineral waters best projected.

res$ind$cos2[,1:3]

plot(res,choix="ind",cex=1.5,title="",select="cos2 3")

The 3 mineral waters with highest contributions.

res$ind$contrib[,1:3]

plot(res,choix="ind",cex=1.5,title="",select="contrib 3")

1.5.3 Projection plan of of the variables

Correlation circle res$var$coord[,1:2]

plot(res,choix="var",title="",cex=1)

(5)

Interpretation of the projection plan of the observations according to the correlation circle.

plot(res,choix="ind",title="",cex=1)

2 Size effect

Example of fish contaminated with mercury load("../data/poissons.rda")

datalog <- data

datalog[,5:10] <- log(data[,5:10])

res <- PCA(datalog[,-1],quanti.sup=2:3,quali.sup=1,graph=FALSE) plot(res,choix="var",title="")

plot(res,choix="ind",invisible = "quali",label = "none",habillage=1,title ="")

Références

Documents relatifs

El Soufi, Sublaplacian eigenvalue functionals and contact structure deformations on compact CR manifolds, in preparation..

The problems as well as the objective of this realized work are divided in three parts: Extractions of the parameters of wavelets from the ultrasonic echo of the detected defect -

The problems as well as the objective of this realized work are divided in three parts: Extractions of the parameters of wavelets from the ultrasonic echo of the detected defect -

Using only the low-redshift background probes, and not imposing the SNIa intrinsic luminosity to be redshift independent (accounting for several luminosity evolution models), we

In this paper we focus on the problem of high-dimensional matrix estimation from noisy observations with unknown variance of the noise.. Our main interest is the high

The correlation matrix after the relative normalization when Exp.1 is the reference, on the left, and when Exp.2 is the reference, on the right..

The correlation matrix after the relative normalization when Exp.1 is the reference, on the left, and when Exp.2 is the reference, on the right.. Julien-Laferrière et al.: EPJ

After classifying the effective temperature into three parts including the daily variation, the season variation, and the high-frequency noise, the correlation analysis is