• Aucun résultat trouvé

Some Sparse Methods for High Dimensional Data

N/A
N/A
Protected

Academic year: 2022

Partager "Some Sparse Methods for High Dimensional Data"

Copied!
45
0
0

Texte intégral

(1)

Some Sparse Methods for High Dimensional Data

Gilbert Saporta CEDRIC,

Conservatoire National des Arts et Métiers, Paris gilbert.saporta@cnam.fr

(2)

• A joint work with

– Anne Bernard (QFAB Bioinformatics) – Stéphanie Bougeard (ANSES)

– Ndeye Niang (CNAM) – Cristian Preda (Lille)

(3)

Outline

1. Introduction

2. Sparse regression 3. Sparse PCA

4. Group sparse PCA and sparse MCA

5. Clusterwise sparse PLS regression

6. Conclusion and perspectives

(4)

1.Introduction

• High dimensional data: p>>n

– Gene expression data – Chemometrics

– etc.

• Several solutions for regression problems with all variables (PLS, ridge, PCR) ; but

interpretation is difficult

• Sparse methods: provide combinations of few

variables

(5)

• This talk:

– a survey of sparse methods for supervised

(regression) and unsupervised (PCA) problems – New propositions for large n and large p when

unobserved heterogeneity among observations occurs

Clusterwise sparse PLS regression

(6)

2. Sparse regression

• Keeping all predictors is a drawback for high dimensional data: combinations of too many variables cannot be interpreted

• Sparse methods simultaneously shrink

coefficients and select variables, hence better

predictions

(7)

2.1 Lasso and elastic-net

The Lasso (Tibshirani, 1996) is a shrinkage and

selection method. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients (L1 penalty).

2

1

ˆ arg min

p

lasso j

j

 

β

β y Xβ

2

1

min with

p

j j

c

y - Xβ

(8)

• Looks similar to ridge regression (L

2

penalty)

• Lasso continuously shrinks the coefficients towards zero when c decreases

• Convex optimisation; no explicit solution

• If Lasso identical to OLS

2 2

2 2

min with

ˆridge arg min -

c

y - Xβ β

β y - Xβ β

1 p

jols j

c b

(9)

• Constraints and log-priors

– Like ridge regression, the Lasso is a bayesian regression but with an exponential prior

–  j is proportional to the log-prior

 

( ) exp

j 2 j

f  

(10)

• Finding the optimal parameter

– Cross validation if optimal prediction is needed – BIC when the sparsity is the main concern

a good unbiased estimate of df is the number of nonzero coefficients . (Zou et al., 2007)

2

2

ˆ( ) log( ) ˆ

arg min ( )

opt

y X n

n n df

   

  

 

   

(11)

• A few drawbacks

– Lasso produces a sparse model but the number of selected variables cannot exceed the number of units

Unfitted to DNA Arrays : n(arrays)<<p(genes)

– Select only one variable among a set of correlated predictors

(12)

Elastic net

• Combines ridge and lasso penalties to select more predictors than the number of

observations (Zou & Hastie, 2005)

– L1 part leads to a sparse model

– L2 part removes the limitation on the number of retained predictors and favourises group choice

2 2 2 1 1

ˆ

en

 arg min     

β

β y Xβ β β

(13)

2.2 Group-lasso

• X matrix divided into J sub-matrices Xj of pj

variables

Group Lasso: extension of Lasso for selecting groups of variables (Yuan & Lin, 2007):

2

1 1

ˆ arg min

J J

GL j j j j

j j

p

β y X β β

If pj=1 for all j, group Lasso = Lasso

(14)

• Drawback: no sparsity within groups

A solution: sparse group lasso

(Simon et al. , 2012)

– Two tuning parameters: grid search

2

1 2

1 1 1 1

min

pj

J J J

j j j ij

j j j i

  

 

    

 

  

β y X β β

(15)

2.3. Sparse PLS

Several solutions:

Le Cao et al. (2008)

Chun & Keles (2010)

(16)

3.Sparse PCA

In PCA, each PC is a linear combination of all the original variables : difficult to interpret the results

Challenge of SPCA: obtain components easily interpretable (many zero loadings in principal factors)

Principle of SPCA: modify PCA imposing lasso/elastic-net constraints to construct modified PCs with sparse loadings

Warning: Sparse PCA does not provide a global selection of variables but a selection dimension by dimension : different from the regression

context (Lasso, Elastic Net, …)

(17)

3.1 First attempts:

Simple PCA

– by Vines (2000) : integer loadings

– Rousson, V. and Gasser, T. (2004) : loadings (+ , 0, -)

SCoTLASS (Simplified Component Technique – Lasso) by Jolliffe & al. (2003) : extra L

1

constraints

2

1

max with 1 and

p

j j

u t

   

u'Vu u u'u

(18)

SCotLass properties:

• Non convex problem

usual PCA t 1 no solution

1 only one nonzero coefficient 1

t p

t

t p

 

(19)

Let the SVD of X be with the principal components Ridge regression:

Loadings can be recovered by regressing (ridge regression) PCs on the p variables

PCA can be written as a regression-type optimization problem

2 2

β

ˆridge arg min

β Z - Xβ β

 

' '

2 2

ˆ ii

ii

with

d

d

2 '

' -1 '

i,ridge i i

X X = VD V V V = I

β = X X + λ I X Xv v

+ v vi

X = UDV' Z = UD

3.2 S-PCA R-SVD by Zou et al (2006)

(20)

is an approximation to , and the ith approximated component

Produces sparse loadings with zero coefficients to facilitate interpretation

Alternated algorithm between elastic net and SVD

Sparse PCA add a new penalty to produce sparse loadings:

ˆ arg min

2 2

β Z - Xβ + β

ˆ ˆ

ˆ

i

v = β

β Xvˆ i

Lasso penalty

vi

1 1

+ β

(21)

• Loss of orthogonality

– SCotLass: orthogonal loadings but correlated components

– S-PCA: neither loadings, nor components are orthogonal

– Necessity of adjusting the % of explained variance

(22)

4. Group sparse PCA and sparse MCA

(Bernard, Guinot, S., 2012)

• Industrial context and motivation:

Funded by R&D department of Chanel cosmetic company

Relate gene expression data to skin aging measures

n=500, p= 800 000 SNP’s, 15 000 genes

(23)

Data matrix X divided into J groups Xj of pj variables

Group Sparse PCA: compromise between SPCA and group Lasso

Goal: select groups of continuous variables (zero coefficients to entire blocks of variables)

Principle: replace the penalty function in the SPCA algorithm

by that defined in the group Lasso

2 2

1 1

β

ˆ arg min

β Z - Xβ β β

2

ˆ arg min

J J

GL j j pj j

β Z X β β

4.1 Group Sparse PCA

(24)

In MCA:

Selection of 1 column in the original table (categorical variable Xj )

=

Selection of a block of pj indicator variables in the complete disjunctive table

Challenge of Sparse MCA : select categorical variables, not categories

Principle: a straightforward extension of Group Sparse PCA for groups of indicator variables, with the chi-square metric . Uses s-PCA r-SVD algorithm.

XJ

1 pJ

. . . 3

XJ1 … XJpj

1 0 . . . 0

0 1 . . . 0

Original table Complete disjunctive

table

4.2 Sparse MCA

(25)

Single Nucleotide Polymorphisms

Data:

n=502 individuals

p=537 SNPs (among more than

800 000 of the original data base, 15000 genes)

q=1554 (total number of columns) X : 502 x 537matrix of qualitative variables

K : 502 x 1554 complete disjunctive table K=(K1, …, K1554)

1 block

=

1 SNP = 1 Kj matrix

Application on genetic data

(26)

Single Nucleotide Polymorphisms

Application on genetic data

(27)

Single Nucleotide Polymorphisms

Application on genetic data

λ= 0:005: CPEV= 0:32% and 174 columns selected on Comp 1

(28)

Comparison of the loadings

Application on genetic data

..

(29)

Single Nucleotide Polymorphisms

Application on genetic data

Comparison between MCA and Sparse MCA on the first plan

(30)

Properties MCA Sparse MCA

Uncorrelated Components TRUE FALSE

Orthogonal loadings TRUE FALSE

Barycentric property TRUE Partly true

% of inertia

Total inertia

j 100 tot

2 j.1,..., j-1

Z

2 1

k

j

Zj.1,..., j-1

1

1 1

p

j j

p

p

are the residuals after adjusting for (regression projection)

Z Z Z

(31)

5. Clusterwise sparse PLS regression

Context

High dimensional explanatory data X (with p>>n), One or several variables Y to explain,

Unknown clusters of the n observations.

Twofold aim

Clustering: get the optimal clustering of observations, Sparse regression: compute the cluster sparse

regression coefficients within each cluster to improve the Y prediction.

Improve high-dimensional data interpretation.

(32)

• Clusterwise (typological) regression

– Diday (1974), Charles (1977), Späth (1979)

– Optimizes simultaneously the partition and the local models

– p >nk our proposal : local sparse PLS regressions

 

2

1 1

min ( ) ( ' )

n K

k i k k i

i k

i y

 

 1 β x

(33)

Criterion (for a single dependent variable and G clusters)

• w loading weight of PLS component t=Xw

• v loading weight u=Yv

• c regression coefficients of Y with t: c=(T’T)-1T’y

(cluster) sparse components (cluster) residuals

' ' 2 ' 2

1

arg min with 1

G

g g g g g g g g g g g

g

X y w v w y X w c w

(34)

A simulation study

• n=500, p= 100

• X vector normally distributed with mean 0 and

covariance matrix S with elements sij = cov(Xi, Xj) = ρ|i-j| , ρ ∈[0,1]

• The response variable (Y) is defined as follows:

there exists a latent categorical variable G, G∈{1,…, K}, such that, conditionally to G=i and X=x, we have :

2 /G i, ( 0i 1 1i ... pi p, i ) Y X=x N

x  

x

(35)

• Signal/noise

• the n subjects are equiprobably distributed into the K clusters

• For K=2 we have chosen

2

( / ) 0.1

i

Var Y G i

 

(36)
(37)

• A wrong number of groups

(38)

Application to real data

Data features (liver toxicity from mixOmics R package) X: 3116 genes (mRNA extracted from liver)

y: clinical chemistry marker for liver injury (serum enzymes level)

Observations: 64 rats

Each rat is subject to different more or less toxic acetaminophen doses (50, 150, 1500, 2000)

Necropsy is performed at 6, 18, 24 and 48 hours after exposure (i.e., time when the mRNA is extracted from liver)

Data pre-processing

y and X are centered and scaled

Twofold aim

Improve the sparse PLS link between genes (X) and liver injury marker (y)…

… while seeking G unknown clusters of the 64 rats.

(39)

Batch clusterwise algorithm

– For several random allocations of the n observations into G clusters

Compute G sparse PLS within each cluster (select and dimension through 10-fold CV)

Assign the observation to the cluster with the minimum criterion value.

– Iterate until convergence

– Select the best initialization (i.e., which minimizes the criterion) and get:

The allocation of the n observations into G clusters,

The sparse regression coefficients within each cluster.

(40)

Clusterwise sparse PLS: performance vs standard spls

No cluster

N=64

R2=0.968

RMSE=0.313

=0.9 (sparse parameter)

H= 3 dimensions

G=3 clusters of rats

N=(12; 42; 10)

R2=(0.99; 0.928; 0.995)

RMSE=(0.0272; 0.0111; 0.0186)

=( 0.9; 0.8; 0.9)

H=(3; 3; 3)

(41)

Clusterwise sparse PLS: regression coefficients

Interpretation

Links between genes (X) and marker for liver injury (y) are specific to each cluster

Cluster 1: 12 selected genes Cluster 2: 86 selected genes Cluster 3: 15 selected genes (No cluster: 31 selected genes)

The cluster structure is

significantly associated with the acetaminophen dose (pvalue- Chi2Fisher= 0.0057)

Cluster 1: all the doses

Cluster 2: mainly doses 50, 150 Cluster 3: mainly doses 1500, 2000

(42)

6.Conclusions and perspectives

• Sparse techniques provide elegant and efficient solutions to problems posed by high-dimensional

data. A new generation of data analysis methods with few restrictive hypothesis

• Sparse PCA and group-lasso extended towards

sparse MCA. Clusterwise extensions for large p and large n

(43)

Research in progress:

– Criteria for choosing the tuning parameter λ

– Extension of Sparse MCA : compromise between the Sparse MCA and the sparse group lasso

developed by Simon et al. (2002)

select groups and predictors within a group, in order to produce sparsity at both levels

– Extension to the case where variables have a block structure

– Writing a R package

(44)

References

Bernard, A. , Guinot, C., Saporta, G. . (2012) Sparse principal

component analysis for multiblock data and its extension to sparse multiple correspondence analysis , Compstat 2012, pp.99-106,

Limassol, Chypre

Charles, C. (1977): Régression Typologique et Reconnaissance des Formes. Thèse de doctorat, Université Paris IX.

Chun, H. and Keles, S. (2010), Sparse partial least squares for

simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3–25.

Diday, E. (1974): Introduction à l’analyse factorielle typologique, Revue de Statistique Appliquée, 22, 4, pp.29-38

Jolliffe, I.T. , Trendafilov, N.T. and Uddin, M. (2003) A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12, 531–547

Rousson, V. , Gasser, T. (2004), Simple component analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53,539- 555

Shen, H. and Huang, J. Z. (2008). Sparse principal component

analysis via regularized low rank matrix approximation. Journal of

(45)

Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2012) A Sparse-Group Lasso. Journal of Computational and Graphical Statistics,

Späth, H. (1979): Clusterwise linear regression, Computing, 22, pp.367-373

Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288

Vines, S.K., (2000) Simple principal components, Journal of the Royal Statistical Society: Series C (Applied Statistics), 49, 441-451

Yuan, M., Lin, Y. (2007) Model selection and estimation in

regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49-67

Zou, H., Hastie , T. (2005) Regularization and variable selection via the elastic net. Journal of Computational and Graphical Statistics, 67, 301-320,

Zou, H., Hastie, T. and Tibshirani, R. (2006) Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15, 265-286.

H. Zou, T. Hastie, R. Tibshirani, (2007), On the “degrees of freedom”

of the lasso, The Annals of Statistics, 35, 5, 2173–2192.

Références

Documents relatifs

In addition, the Valence Laterality interaction was qualified by a Valence LateralityTask interaction, which indicated that the Valence Laterality interaction was larger for

La salle de spectacle est rouge et or, éclairée par l'immense lustre de cristal, réchauffée par les teintes franches du plafond de Marc Chagall, la salle de spectacle, dessinée en fer

To estimate the support of the parameter vector β 0 in the sparse corruption problem, we study an extension of the Lasso-Zero methodology [Descloux and Sardy, 2018],

 En résumé pour notre série, nous avons relevé dans le groupe de patientes ayant présentées une cardiotoxicité persistante à un âge moyen et à un nombre moyen

J’´elabore une structure de contrˆole d’ordre sup´erieur, les intricateurs (“entanglers”), qui produisent ces intrications lorsque des conditions pr´ealables particuli`eres

If an artificial two step selection procedure is performed in PLS, first by ordering the absolute values of the loadings and then selecting a chosen number of variables, to

The site selected for the Arts Center by a committee representing the three departments lies between Washington Terrace and the soccer field, two hundred yards

In conclusion, one objective of this study was to assess the benefits of dimension reduction in penalized linear regression approaches for small sample size with