• Aucun résultat trouvé

Machine Learning - TP

N/A
N/A
Protected

Academic year: 2022

Partager "Machine Learning - TP"

Copied!
43
0
0

Texte intégral

(1)

Machine Learning - TP

Nathalie Villa-Vialaneix -nathalie.villa@univ-paris1.fr http://www.nathalievilla.org

IUT STID (Carcassonne) & SAMM (Université Paris 1)

Formation INRA , Niveau 3

(2)

Packages in R

is provided with basic functions but more than 3,000 packages are available on theCRAN (Comprehensive R Archive Network) for additionnal functions (see also the projectBioconductor).

Installing new packages(has to be done only once)

with the command line:

install.packages(c(" nnet "," e1071 "," rpart "," car ",

" randomForest "))

with the menu (Windows or Mac OS X)

Loading a package(has to be done each time R is re-started) library( nnet )

(3)

Working directory

All les in proper directories/subdirectories can be downloaded at:

http://www.nathalievilla.org/docs/formation_inra, either as individual les or as a full zip le inra-package.zip.

The companion script ML-scriptR.R is made to be run from the subdirectory ML/TP of the provided material. For the computers in the classroom, this can be set by the following command line:

setwd("/home/fp/Bureau/inra - package/ML/TP")

If you are using Windows or Mac OS X, you can also choose to do it from the menu (in Fichier / Dénir le répertoire de travail).

In any case, you must adapt the working directory if you want to run the script on another computer!

(4)

Outline

1 Introduction: Data importation and exploration 2 Neural networks

3 CART

4 Random forest

(5)

Introduction: Data importation and exploration

Outline

1 Introduction: Data importation and exploration 2 Neural networks

3 CART

4 Random forest

(6)

Use case description

Data kindly provided by Laurence Liaubet described in [Liaubet et al., 2011]:

microarray data: expression of 272 selected genes over 57 individuals (pigs);

a phenotype of interest (muscle pH) measured over the 57 invididuals (numerical variable).

le 1: genes expressions le 2: muscle pH

(7)

Introduction: Data importation and exploration

Load data

# L o a d i n g genes e x p r e s s i o n s

d <- read.table("../../Data/sel_data .csv",sep=";", header =T,row.names=1, dec=",")

dim(d) names(d) summary(d)

# L o a d i n g pH

pH <- scan("../../Data/sel_ph.csv",dec=",") length(pH)

summary(pH)

(8)

Basic analysis

# Data d i s t r i b u t i o n

layout(matrix(c(1 ,1 ,2) ,ncol=3)) boxplot(d, main =" genes expressions ") boxplot(pH , main ="pH")

(9)

Introduction: Data importation and exploration

Correlation analysis

# C o r r e l a t i o n a n a l y s i s heatmap (as.matrix(d))

(10)

Correlation with pH

# C o r r e l a t i o n with pH

pH.cor <- function(x) {cor(x,pH )}

totalcor <- apply(d ,1, pH.cor)

hist( totalcor , main =" Correlations between genes expression \n and pH",xlab =" Correlations ")

(11)

Introduction: Data importation and exploration

Classication and regression tasks

1 Regression: predict pH (numerical variable) from genes expressions.

Useful to: help verify the strength of the relation between genes expressions and pH; help understand the nature of the relation.

2 Classication: predict whether the pH is smaller or greater than 5.7 from genes expressions.

(toy example)

# C l a s s e s d e f i n i t i o n

pH. classes <- rep(0,length(pH )) pH. classes [pH >5.7] <- 1

table(pH. classes )

(12)

Train/Test split: regression framework

# I n i t i a l i z a t i o n and t r a i n i n g s a m p l e s e l e c t i o n set. seed (16011357)

training <- sample(1:ncol(d),round(0.8* ncol(d)), replace=F)

# M a t r i c e s d e f i n i t i o n

d. train <- t(d[, training ]) pH. train <- pH[ training ] d. test <- t(d[,- training ]) pH. test <- pH[- training ]

# Data f r a m e s d e f i n i t i o n

r. train <- cbind(d.train ,pH. train ) r. test <- cbind(d.test ,pH. test )

colnames(r. train ) <- c(colnames(d. train ),"pH") colnames(r. test ) <- c(colnames(d. train ),"pH") r. train <- data.frame(r. train )

r. test <- data.frame(r. test )

(13)

Introduction: Data importation and exploration

Train/Test split: classication framework

# V e c t o r s d e f i n i t i o n

pHc. train <- pH. classes [ training ] pHc. test <- pH. classes [- training ]

# Data f r a m e s d e f i n i t i o n

c. train <- cbind(d.train ,pHc. train ) c. test <- cbind(d.test ,pHc. test )

colnames(c. train ) <- c(colnames(d. train ),"pHc") colnames(c. test ) <- c(colnames(d. train ),"pHc") c. train <- data.frame(c. train )

c. test <- data.frame(c. test )

# T r a n s f o r m i n g pHc into a f a c t o r c. train$pHc <- factor(c. train$pHc) c. test$pHc <- factor(c. test$pHc)

(14)

Training / Test sets (short) analysis

layout(matrix(c(1 ,1 ,2 ,3 ,3 ,4) ,ncol=3,nrow=2, byrow =T))

boxplot(d.train , main =" genes expressions ( train )") boxplot(pH.train , main ="pH ( train )")

boxplot(d.test , main =" genes expressions ( test )") boxplot(pH.test , main ="pH ( test )")

table(pHc. train ) table(pHc. test )

(15)

Neural networks

Outline

1 Introduction: Data importation and exploration 2 Neural networks

3 CART

4 Random forest

(16)

Loading the data (to be consistent) and nnet library

# L o a d i n g the data

load("../../Data/train - test . Rdata ")

MLP are unusable with a large number of predictors(here: 272 predictors for 45 observations only in the training set) ⇒ selection of a relevant subset of variables (with LASSO):

# L o a d i n g the s u b s e t of p r e d i c t o r s load("../../Data/selected . Rdata ") MLPs are provided in the nnet package:

# L o a d i n g nnet l i b r a r y library( nnet )

(17)

Neural networks

Simple use: MLP in the regression framework

Training

# S i m p l e use : MLP with p =3 and no decay set. seed (17011644)

nn1 <- nnet (d. train [, selected ],pH.train , size =3, decay =0, maxit =500 , linout =T)

Analysis

print(nn1) summary(nn1)

# T r a i n i n g error and pseudo - R2 mean(( pH.train - nn1$ fitted)^2)

1-mean(( pH.train -nn1$ fitted)^2)/var(pH. train )

# P r e d i c t i o n s ( test set )

pred . test <- predict(nn1 ,d. test [, selected ])

# Test error and pseudo - R2

(18)

Predictions vs observations

plot(pH.train ,nn1$fitted, xlab =" Observations ", ylab =" Fitted values ",main ="",pch =3, col=" blue ")

points(pH.test , pred .test ,pch =19 ,col=" darkred ") legend(" bottomright ",pch=c(3 ,19) ,col=c(" blue ",

" darkred "),legend=c(" Train "," Test ")) abline(0,1,col=" darkgreen ")

(19)

Neural networks

Check results variability

set. seed (17012120)

nn2 <- nnet (d. train [, selected ],pH.train , size =3, decay =0, maxit =500 , linout =T)

...

Summary

Train Test nn1 30.8% 44.9%

(20)

Simple use: MLP in the classication framework

Training

# S i m p l e use : MLP with p =3 and no decay set. seed (17011716)

nnc1 <- nnet (pHc~.,data=c.train , size =3, decay =0, maxit =500 , linout =F)

Analysis

print( nnc1 ) summary( nnc1 )

# P r e d i c t i o n s nnc1$ fitted

# R e c o d i n g

pred . train <- rep(0,length( nnc1$ fitted)) pred . train [ nnc1$fitted>0.5] <- 1

# T r a i n i n g error

table( pred .train ,c. train$pHc)

sum( pred . train!=c. train$pHc)/ length( pred . train )

(21)

Neural networks

Test error

# P r e d i c t i o n s and r e c o d i n g

raw. pred . test <- predict(nnc1 ,c. test ) pred . test <- rep(0,length(raw. pred . test )) pred . test [raw. pred .test >0.5] <- 1

# Test error

table( pred .test ,c. test$pHc)

sum( pred . test!=c. test$pHc)/ length( pred . test ) Summary

Train Test

Observations 0 1 Predictions

0 22 1

1 0 22

Observations 0 1 Predictions

0 4 3

1 2 2

Overall misclassication rate:

(22)

Tuning MLP with the e1071 package

Search for the best parameters with 10-fold CV library( e1071 )

set. seed (1643)

t. nnc2 <- tune . nnet (pHc~.,data=c. train [,

c( selected ,ncol(c. train ))] , size =c(2 ,5 ,10) , decay =10^( -c(10 ,8 ,6 ,4 ,2)) , maxit =500 ,

linout =F, tunecontrol =

tune .control( nrepeat =5, sampling =" cross ", cross =10))

Basic analysis of the output

# L o o k i n g for the best p a r a m e t e r s plot(t. nnc2 )

(23)

Neural networks

Best parameters?

summary(t. nnc2 )

t. nnc2$best . parameters

# S e l e c t i n g the best MLP

(24)

Results analysis

# T r a i n i n g error

pred . train <- rep(0,length( nnc2$ fitted)) pred . train [ nnc2$fitted>0.5] <- 1

table( pred .train ,c. train$pHc)

sum( pred . train!=c. train$pHc)/ length( pred . train )

# P r e d i c t i o n s and test error

raw. pred . test <- predict(nnc2 ,c. test ) pred . test <- rep(0,length(raw. pred . test )) pred . test [raw. pred .test >0.5] <- 1

table( pred .test ,c. test$pHc)

sum( pred . test!=c. test$pHc)/ length( pred . test ) Summary

Train Test nnc1 2.22% 45.45%

nnc2 0% 22.27%

(25)

CART

Outline

1 Introduction: Data importation and exploration 2 Neural networks

3 CART

4 Random forest

(26)

Regression tree training

Training with the rpart package library( rpart )

# R e g r e s s i o n tree

tree1 <- rpart (pH~.,data=r.train ,control= rpart .control( minsplit =2))

Basic analysis of the output print( tree1 )

summary( tree1 ) plot( tree1 ) text( tree1 )

(27)

CART

Resulting tree

tree1$where # leaf n u m b e r

(28)

Performance analysis

# T r a i n i n g p r e d i c t i o n s and error

pred . train <- tree1$ frame $yval [ tree1$where ] mean(( pred .train -r. train$pH )^2)

1-mean(( pred .train -r. train$pH )^2)/var(pH. train )

# Test p r e d i c t i o n s and error

pred . test <- predict(tree1 ,r. test ) mean(( pred .test -r. test$pH )^2)

1-mean(( pred .test -r. test$pH )^2)/var(pH. train )

# F i t t e d v a l u e s vs True v a l u e s

plot(pH.train , pred .train , xlab =" Observations ", ylab =" Fitted values ",main ="",pch =3, col=" blue ")

points(pH.test , pred .test ,pch =19 ,col=" darkred ") legend(" bottomright ",pch=c(3 ,19) ,col=c(" blue ",

" darkred "),legend=c(" Train "," Test ")) abline(0,1,col=" darkgreen ")

(29)

CART

Summary

Numerical performance Train Test tree1 96.6% 11.6%

(30)

Classication tree training and tuning

Training with the rpart package and tuning with the e1071 package

# R a n d o m seed i n i t i a l i z a t i o n set. seed (20011108)

# T u n i n g the m i n i m a l n u m b e r of o b s e r v a t i o n s in a

# node ( m i n s p l i t ; d e f a u l t value is 20) t. treec1 <- tune . rpart (pHc~.,data=c.train ,

minsplit =c(4 ,6 ,8 ,10) , tunecontrol = tune .control( sampling =" bootstrap ", nboot =20))

Basic analysis of the output plot(t. treec1 )

(31)

CART

Tuning results

t. treec1$best . parameters

(32)

Basic analysis of the best tree

summary( treec1 ) plot( treec1 ) text( treec1 )

treec1$where # leaf n u m b e r treec1$ frame # nodes f e a t u r e s

(33)

CART

Predictions and errors

Training set

# Make the p r e d i c t i o n

pred . train <- predict(treec1 ,c. train )

# Find out which class is p r e d i c t e d

pred . train <- apply( pred .train ,1,which.max) library(car) # to have the " r e c o d e " f u n c t i o n pred . train <- recode ( pred .train ," 2=1;1=0 ")

# C a l c u l a t e m i s c l a s s i f i c a t i o n error table( pred .train ,pHc. train )

sum( pred . train!=pHc. train )/ length( pred . train ) Test set

pred . test <- predict(treec1 ,c. test )

pred . test <- apply( pred .test ,1,which.max) pred . test <- recode ( pred .test ," 2=1;1=0 ")

(34)

Classication performance summary

Overall misclassication rates

Train Test nnc1 2.22% 45.45%

nnc2 0 22.27%

treec1 0 45.45%

(35)

Random forest

Outline

1 Introduction: Data importation and exploration 2 Neural networks

3 CART

4 Random forest

(36)

Random forest training (regression framework)

Training with the randomForest package library( randomForest )

set. seed (21011144)

rf1 <- randomForest (d.train ,pH.train , importance =T, keep . forest =T)

Basic analysis of the output plot(rf1)

rf1$ntree rf1$mtry

rf1$importance

(37)

Random forest

Predictions and errors

# T r a i n i n g set ( oob error and t r a i n i n g error ) mean(( pH.train - rf1$predicted )^2)

rf1$mse

1-mean(( pH.train -rf1$predicted )^2)/var(pH. train ) 1-mean(( pH.train -predict(rf1 ,d. train ))^2)

/var(pH. train )

# Test set

pred . test <- predict(rf1 ,d. test ) mean(( pH.test - pred . test )^2)

1-mean(( pH.test - pred . test )^2)/var(pH. train ) MSE

Train Test nn1 30.8% 44.9%

nn2 99.8% -34

(38)

Predictions vs observations

plot(pH.train ,rf1$predicted , xlab =" Observations ", ylab =" Fitted values ",main ="",pch =3, col=" blue ")

points(pH.test , pred .test ,pch =19 ,col=" darkred ") legend(" bottomright ",pch=c(3 ,19) ,col=c(" blue ",

" darkred "),legend=c(" Train "," Test ")) abline(0,1,col=" darkgreen ")

(39)

Random forest

Importance analysis

layout(matrix(c(1 ,2) ,ncol=2))

barplot(t(rf1$importance [ ,1]) , xlab =" variables ", ylab ="% Inc. MSE",col=" darkred ",las =2, names=rep(NA ,nrow(rf1$importance ))) barplot(t(rf1$importance [ ,2]) , xlab =" variables ",

ylab ="Inc. Node Purity ",col=" darkred ", las =2,names=rep(NA ,nrow(rf1$importance ))) which(rf1$importance [ ,1] >0.0005)

(40)

Random forest training (classication framework)

Training with the randomForest package (advanced features) set. seed (20011203)

rfc1 <- randomForest (d.train ,factor(pHc. train ), ntree =5000 , mtry =150 , sampsize =40 , nodesize = 2, xtest =d.test , ytest =factor(pHc. test ), importance =T, keep . forest =F)

Basic analysis of the output plot( rfc1 )

rfc1$importance

(41)

Random forest

Predictions and errors

# T r a i n i n g set ( oob error and t r a i n i n g error ) table( rfc1$predicted ,pHc. train )

sum( rfc1$predicted!=pHc. train )/ length(pHc. train )

# Test set

table( rfc1$test$predicted ,pHc. test )

sum( rfc1$test$predicted!=pHc. test )/ length(pHc . test ) Overall misclassication rates

Train Test nnc1 2.22% 45.45%

nnc2 0 22.27%

treec1 0 45.45%

rfc1 33.33% 36.36%

(42)

Importance analysis

# I m p o r t a n c e a n a l y s i s

layout(matrix(c(1 ,2) ,ncol=2))

barplot(t( rfc1$importance [ ,3]) , xlab =" variables ", ylab =" Mean Dec. Accu .",col=" darkred ", las =2,names=rep(NA ,nrow( rfc1$importance ))) barplot(t( rfc1$importance [ ,4]) , xlab =" variables ",

ylab =" Mean Dec. Gini ",col=" darkred ",

las =2,names=rep(NA ,nrow( rfc1$importance ))) which( rfc1$importance [ ,3] >0.002)

(43)

Random forest

Liaubet, L., Lobjois, V., Faraut, T., Tircazes, A., Benne, F., Iannuccelli, N., Pires, J., Glénisson, J., Robic, A., Le Roy, P., SanCristobal, M., and Cherel, P. (2011).

Genetic variability or transcript abundance in pig peri-mortem skeletal muscle: eQTL localized genes involved in stress response, cell death, muscle disorders and metabolism.

BMC Genomics, 12(548).

Références

Documents relatifs

We compared intersections be- tween differentially expressed and differentially spliced genes in Additional file 1: Figure S10c,d and observed a higher overlap between lists

(A) Principal component analysis (PCA) on expression of 622 genes from 24 individuals with latent M.tb infection (LTBI) and 24 with active TB disease (TB) after whole

Les méthodes mises en place sont efficaces dans le cas virtuel « parfait » aussi bien pour l’estimation de la matrice de projection ou du recalage, mais les performances de ces

Intracellular ceramide synthesis and protein kinase Czeta activation play an essential role in palmitate-induced insulin resistance in rat L6 skeletal muscle cells.. Summers SA,

Invited Letter to Editor in response to: Constitutional thinness: body fat metabolism and skeletal muscle are important factors... 1 Invited Letter to Editor in response

Despite the difficulty to apply strictly selective modulations of energy production and consumption in in vivo skeletal muscle, the values of the elasticity

In the present study, we characterized eQTL in porcine skeletal muscle by combining genome-wide eQTL and ASE analyses based on RNAseq data from 189 animals, in order to

The above definitions allows us to state the following result, which turns the 2D-Bin Packing instance related to (V, L, H) into the search for a double-flow vector (2