Variable selection for correlated data in high dimension using decorrelation methods

(1)

HAL Id: hal-01310571

https://hal.archives-ouvertes.fr/hal-01310571

Submitted on 29 May 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Variable selection for correlated data in high dimension using decorrelation methods

Emeline Perthame, David Causeur, Ching-Fan Sheu, Chloé Friguet

To cite this version:

Emeline Perthame, David Causeur, Ching-Fan Sheu, Chloé Friguet. Variable selection for correlated

data in high dimension using decorrelation methods. Statlearn: Challenging problems in statistical

learning, Apr 2016, Vannes, France. �hal-01310571�

(2)

Variable selection for correlated data in high dimension using decorrelation methods

Emeline Perthame INRIA, team MISTIS, Grenoble

Joint work with

David Causeur Ching-Fan Sheu Chlo´ e Friguet

Agrocampus, Rennes NCKU, Tainan, Taiwan UBS, Vannes

StatLearn, Vannes, April 2016

1 / 41

(3)

Outline

1. Introduction

2. Impact of dependence and dependence modeling

3. Disentangling signal from noise ...

• ... for a multiple testing issue

• ... for a supervised classification issue 4. Conclusion

2 / 41

(4)

The instrument: a 128-channel geodesic sensor net

• Electroencephalography (EEG) is the recording of electrical activity at scalp locations over time.

• The recorded EEG traces, which are time locked to external events, are averaged to form the event-related (brain) potentials (ERPs).

drawal) and left PFC may be associated with the processing of positive stimuli (typically associated with approach).

In addition, the ValenceLaterality interaction was qualified by a ValenceLateralityTask interaction, which indicated that the ValenceLaterality interaction was larger for the Good/Bad task (right: mean difference score for bad stimuli minus good stimuli = 2.84AV; left:M=2.38AV) than for the Abstract/

Concrete task (right: bad – goodM= 1.17AV; left: bad – goodM= 1.16AV),F(1, 16) = 4.84,P= 0.05, seeFig. 3. Thus, although the ValenceLaterality interaction was observed in both the Good/Bad and Abstract/Concrete tasks, which suggests that this interaction may reflect some degree of automatic processing, the 3- way interaction involving task indicates that the Valence Laterality interaction is not immune to reflective processing. For example, although it may be initiated automatically, it is possible

that an explicitly evaluative agenda can keep valence-specific information active in working memory (or conversely, that a nonevaluative agenda may suppress such information).

LPP latency

In addition to examining the extent of activation, an advantage of using EEG methods to study evaluative processes is the ability to examine the time course of evaluative processing. For each participant and for each condition, we computed the average onset of the frontal LPP for all electrodes in the right and left anterior scalp regions defined above. The onset was defined as the latency of the peak amplitude of the negative deflection immediately prior to the positive deflection identified as the frontal LPP. The calculated frontal LPP onset was then subjected to a 2 (laterality)2 (valence)2 (task) ANOVA. Cell means for significant effects are presented inTable 2. A main effect for laterality indicated that on average, the LPP for the right anterior region began earlier (M= 433 ms) than did the LPP on the left (M= 516 ms),F(1, 16) = 9.64,P< 0.01. Importantly, however, this main effect was qualified by a ValenceLaterality interaction. In contrast to the suggestion that negative stimuli are processed more quickly than positive stimuli for all processes, we found that, for the right frontal electrodes, the onset of the frontal LPP occurred more quickly for negative stimuli (M= 410 ms) than for positive stimuli (M= 455 ms), but that for the left frontal electrodes, the onset of the frontal LPP occurred earlier for Table 1

Mean amplitudes and standard deviations of the frontal LPP, according to task, valence, and laterality

Good/Bad task Abstract/Concrete task Left PFC Right PFC Left PFC Right PFC Bad

stimuli

0.55 (3.1)AV 3.73 (3.0)AV 0.91 (4.9)AV 2.94 (2.9)AV

Good stimuli

1.82 (2.3)AV 0.89 (3.1)AV 2.07 (3.7)AV 1.77 (3.5)AV

Fig. 1. Electrode locations comprising the right and left anterior scalp regions where the frontal late positive potential (LPP) was recorded.

W.A. Cunningham et al. / NeuroImage 28 (2005) 827 – 834 830

3 / 41

(5)

Auditory oddball experiment

A very commonly used experimental task

• Two auditory stimuli are presented to subjects

− A stimulus (500Hz) occurring frequently

− A stimulus (1000Hz) occurring infrequently

• ERPs are recorded on a 400 ms interval after the onset.

Motivations

• Auditory evoked potential (AEP): elicited by auditory stimulus

• Mismatch negativity (MMN): elicited by any change in the stimulus (odd/frequent)

• AEP and MMN are electrophysiological marker candidates for psychiatric disorders such as schizophrenia

4 / 41

(6)

ERP curves

0 100 200 300 400

−20 −10 0 10 20

Auditory ERP data − Kaohsiung Medical University

Time (ms)

ERP Amplitude ( µ V )

Hz500 Hz1000 Raw ERP curves for 13 subjects − Channel FZ

→ Signal detection: is there any difference between the two conditions ?

→ Signal identification: when does the difference occur ?

5 / 41

(7)

Linear model framework for ERP curves

At time t for subject i in condition j

• Multivariate analysis of variance model Y ijt = µ t + α it + γ jt + ε ijt

• Functional analysis of variance model Y ijt =

S

X

s=1

m s ϕ s (t ) +

S

X

s=1

a is ϕ s (t ) +

S

X

s=1

g js ϕ s (t ) + ε ijt

where ϕ _s (.), s = 1, . . . , S are B-splines.

6 / 41

(8)

Linear model framework for ERP curves

At time t for subject i in condition j

Y ijt = µ t + α it + γ jt + ε ijt

Signal detection

• Is there any difference between the two conditions ? H ₀ : for t = 1, . . . , T and j = 1, 2, γ _jt = 0

• Is it relevant to predict the label from ERP curves ?

→ High dimension: need for variable selection Signal identification

For t = 1, . . . , T , H 0t : for j = 1, 2, γ jt = 0

6 / 41

(9)

Some approaches

Detection

• F-test for multivariate (or functional) ANOVA ¹

• Optimal detection (Higher Criticism ² ) Supervised classification

• Ignoring correlations: Naive approaches ³

• Introducing sparsity: Lasso, Sparse LDA ⁴ Identification

• FDR controlling: Benjamini-Hochberg ...

→ Efficient under independence

1. Bugli and Lambert, 2006, Stat Med 2. Donoho and Jin, 2004, AOS

3. Bickel and Levina, 2004, Bernoulli ; Tibshirani et al., 2003, Stat Sc 4. Tibshirani, 1996, JRSS ; Clemmensen et al., 2011, Technometrics

7 / 41

(10)

Guthrie-Buchwald procedure ⁵

• Assumes an auto-regressive process with auto-correlation ρ

• Distribution of L _ρ under the null

L _ρ = # { t , p _t ≤ α }

where (p ₁ , . . . , p _T ) are p-values and α is a preset level

• A time interval is rejected if it is significant at the preset level and longer than usual time intervals

5. Guthrie and Buchwald, 1991, Psychophysiology

8 / 41

(11)

Strong and complex temporal dependence structure

→ Dependence affects the stability of selection procedures

9 / 41

(12)

Outline

1. Introduction

2. Impact of dependence and dependence modeling

3. Disentangling signal from noise ...

• ... for a multiple testing issue

• ... for a supervised classification issue 4. Conclusion

10 / 41

(13)

Rare and Weak paradigm ⁶

• Two components mixture for test statistics T = µ + ε, ε ∼ N (0, I T )

• Where signal is

− Rare

η = T ^−β , β ∈ ( 1 2 , 1)

− Weak

A = p

2r log(T ), r ∈ (0, 1)

6. Donoho and Jin, 2004, AOS ; 2008, PNAS

11 / 41

(14)

Phase diagram under independence ⁷

• Signal is detectable when r > ρ ^∗ (β) : ρ ^∗ _D (β) =

( β − ¹ ₂ if ¹ ₂ < β ≤ ³ ₄ (1 − √

1 − β) ² if ³ ₄ < β < 1.

0.5 0.6 0.7 0.8 0.9 1.0

0.0 0.2 0.4 0.6 0.8 1.0

β

r

Detectable

Undetectable Type I + Type II error rates of LRT −→

T→+∞

0 7. Ingster, 1999, Math Meth of Stat ; Donoho and Jin, 2004, AOS 12 / 41

(15)

Impact of dependence - Signal identification

0 200 400 600 800 1000

0 2 4 6 8 10 12

Time

T rue signal

r ≈ 0.4 , β =0.51

• Independence and ERP time dependence pattern

• 1000 datasets for each amplitude

• Benjamini Hochberg correction

13 / 41

(16)

Impact of dependence - Signal identification

0.5 0.6 0.7 0.8 0.9 1.0

0.00.20.40.60.81.0

β

r

Detectable

Undetectacle

• Independence and ERP time dependence pattern

• 1000 datasets for each amplitude

• Benjamini Hochberg correction

13 / 41

(17)

Impact of dependence - Signal identification

0 2 4 6 8 10 12

0.0 0.1 0.2 0.3 0.4 0.5

Peak amplitude

False Disco ver y Rate

BH−Independent case BH−Dependent case

0 2 4 6 8 10 12

0.0 0.2 0.4 0.6 0.8 1.0

Peak amplitude

Probability of no rejection

BH−Independent case BH−Dependent case

• Instability of multiple testing procedures FDR = pFDR(1-PNR)

14 / 41

(18)

Impact of dependence - Variable selection

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●

●●●

●

●●●

●●

●

●●

●

●●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●●

●

0 100 200 300 400 500

−5 0 5 10 15 20

variables' index

β

log P(Y = 2 | X )

P(Y = 1 | X ) = β ₀ + β ⁰ x

• Independence and ERP time dependence pattern

• 1000 datasets for each dependence structure

• Variable selection performed by Lasso ⁸

8. glmnet R package, Friedman et al., 2010, JSS

15 / 41

(19)

Impact of dependence - Variable selection

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●

●●

●

●●●

●

●●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●●

●

0 100 200 300 400 500

−5 0 5 10 15 20

variables' index

β

1 3 2 4

• Predictor X _t is assessed by its rank r _t deduced from its regression coefficient

• Relevance of a selected set S is given by the mean rank in S : r S = _#S ¹ X

t ∈S

r _t

16 / 41

(20)

Impact of dependence - Variable selection

Under independence

Mean ranks among selected features

Density

0 20 40 60 80 100

0.000.050.100.150.200.250.300.35

Under dependence

Mean ranks among selected features

Density

0 20 40 60 80 100

0.000.020.040.060.08

• Relevance: the most predictive variables are not selected under dependence

• Stability: selected subsets are not reproducible

17 / 41

(21)

Impact of dependence - Improving stability

• Bootstrap

− Bolasso ⁹

− Stability selection ¹⁰

• Dependence modeling

− Surrogate variable analysis ¹¹

− Latent effect adjustment after primary projection ¹²

− Factor analysis for multiple testing ¹³

9. Bach, 2008, Proceedings ICML

10. Meinshausen and B¨ uhlmann, 2010, JRSS 11. Leek and Storey, 2007, PLoS Genetics 12. Sun, Zhang and Owen, 2012, AOAS 13. Friguet, Kloareg and Causeur, 2009, JASA

18 / 41

(22)

Factor modeling of dependence

• Distribution of ERP curves

X = (X ₁ , . . . , X _T ) | Y = y ∼ N T (µ _y , Σ)

• Latent factor modeling

X = µ _y + BZ + e with e ∼ N T (0, Ψ) Ψ diagonal, rank(B ) = q,

Z ∼ N q (0, I _q ),

• Decomposition of covariance matrix Σ = Ψ + BB ⁰

19 / 41

(23)

Signal is hidden by noise

0 200 400 600 800

−2 −1 0 1 2

Time m(s)

T est statistics

20 / 41

(24)

Signal is hidden by noise

0 200 400 600 800

−2 −1 0 1 2

Time m(s)

Signal

0 200 400 600 800

−2 −1 0 1 2

Time m(s)

Noise

20 / 41

(25)

Outline

1. Introduction

2. Impact of dependence and dependence modeling

3. Disentangling signal from noise ...

• ... for a multiple testing issue

• ... for a supervised classification issue 4. Conclusion

21 / 41

(26)

Multiple testing issue

• ERP measure at time t , for subject i, Y ijt = µ t + α it + γ jt + ε ijt

• In matrix notations

Y _t = µ t + X ₀ α t + X γ t + ε t

with V(ε ₁ , . . . , ε _T ) = Σ

• Multiple testing for t = 1, . . . , T H _0,t : γ t = 0

• Dependence among tests

22 / 41

(27)

A prior knowledge of the signal

• OLS signal estimation of γ = (γ ₁ , . . . , γ _T ) ˆ

γ = γ + δ with δ ∼ N (0, Σ) and e Σ e ∝ Σ

0 200 400 600 800

−1.0 0.0 0.5 1.0 1.5 2.0 2.5

Time m(s)

Estimated signal

23 / 41

(28)

A prior knowledge of the signal

• OLS signal estimation of γ = (γ ₁ , . . . , γ _T ) ˆ

γ = γ + δ with δ ∼ N (0, Σ) and e Σ e ∝ Σ

0 200 400 600 800

−1.0 0.0 0.5 1.0 1.5 2.0 2.5

Time m(s)

Estimated signal

T

₀

T

₀

23 / 41

(29)

A prior knowledge of the signal

• OLS signal estimation of γ = (γ ₁ , . . . , γ _T ) ˆ

γ = γ + δ with δ ∼ N (0, Σ) and e Σ e ∝ Σ

• Noise is somewhere observed without signal

δ ₀ δ −0

∼ N

"

0 0

, Σ e 0,0 Σ e ⁰ _−0,0 Σ e −0,0 Σ e −0,−0

!#

• And can be estimated elsewhere ˆ δ −0 = ˆ Σ −0,0 Σ ˆ ⁻¹ _0,0 δ ˆ ₀

23 / 41

(30)

A prior knowledge of the signal

• And can be estimated elsewhere ˆ δ −0 = ˆ Σ −0,0 Σ ˆ ⁻¹ _0,0 δ ˆ ₀

0 200 400 600 800

−1.0 0.0 0.5 1.0 1.5 2.0 2.5

Time m(s)

Estimated signal

T

0

T

0

23 / 41

(31)

A prior knowledge of the signal

• New estimation of the signal ˆ

γ ^new = ˆ γ − ˆ δ

0 200 400 600 800

−1.0 0.0 0.5 1.0 1.5 2.0 2.5

Time m(s)

Estimated signal

23 / 41

(32)

Iterative algorithm

• New estimation of the signal ˆ

γ ^new = ˆ γ − ˆ δ

• Update of residual errors ˆ ε ^new = Y t − (ˆ µ t + ˆ α it + ˆ γ _t ^new )

• New estimation of covariance matrix

• Alternates estimation of signal and covariance structure

• Until convergence of test statistics

• Update of T 0

24 / 41

(33)

Choice of T ₀

Prior knowledge

• ERP: psychologists may know that signal does not occur before/after some time points

• Genomics: biologists may know that some genes are not involved in a biological process

No prior knowledge

• Conservative approach

T ₀ = { t , p _t ≥ t ₀ } where (p 1 , . . . , p T ) are p-values

25 / 41

(34)

Simulations - Adaptive factor analysis procedure

0 200 400 600 800 1000

0 1 2 3 4 5

Time

Signal amplitude

• Dependence structure of ERP experiment

• 1000 generated datasets

26 / 41

(35)

Simulations - Adaptive factor analysis procedure

Method FDR ¹⁴ TDR ¹⁵ PD ¹⁶

Benjamini-Hochberg 0.031 0.057 0.281 Benjamini-Yekutieli 0.009 0.011 0.101 Guthrie-Buchwald 0.086 0.233 0.538

SVA 0.088 0.151 0.599

LEAPP 0.151 0.304 0.847

AFA 0.034 0.498 1.000

14. False Discovery Rate 15. True Discovery Rate

16. Probability of Detecting the peak

27 / 41

(36)

Application to auditory data

0 100 200 300 400

−2 −1 0 1 2

Estimated condition effect along time

Time (ms)

Condition eff ect

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

LEAPP SVA AFA

BH

80 - 120 ms: Auditory evoked potential

100 - 200 ms: Mismatch negativity for the difference curve

28 / 41

(37)

Conclusion

• Adaptive estimation of signal and factor model parameters

• Designed for strong dependence

• Efficient multiple testing procedure

− FDR is controlled

− Good detection power

• ERP package available on CRAN ¹⁷

17. Causeur and Sheu, 2014, R package version 1.0.1

29 / 41

(38)

Outline

1. Introduction

2. Impact of dependence and dependence modeling

3. Disentangling signal from noise ...

• ... for a multiple testing issue

• ... for a supervised classification issue 4. Conclusion

30 / 41

(39)

Supervised classification issue

• Prediction of a label → Hz500 or Hz1000 frequency

• From ERP curves profiles X = (X ₁ , . . . , X _T ) (X | Y = y) ∼ N p (µ y , Σ)

• Among linear classification rule LR(x ) = log P(Y = 2 | X )

P(Y = 1 | X ) = β 0 + x ⁰ β

• The best one is Bayes’ rule β = Σ ⁻¹ (µ ₂ − µ ₁ ) β ₀ = log p ₂

p 1 − 0.5(µ ₂ + µ ₁ ) ⁰ Σ ⁻¹ (µ ₂ − µ ₁ )

• Theoretical misclassification rate π

31 / 41

(40)

Some estimation methods

Logistic regression

• Minimizing the deviance ( ˆ β 0 , β) = argmin ˆ _β ₀ _,β − 2

n

X

i=1

log[1 + exp( − V i (β 0 + x _i ⁰ β))]

where V _i = ± 1

• High dimension

− ` ₂ -penalization: Ridge ¹⁸

− ` 1 -penalization: Lasso ¹⁹

18. Hoerl and Kennard, 1970, Technometrics 19. Tibshirani, 1996, JRSS

32 / 41

(41)

Some estimation methods

Linear Discriminant Analysis

• OLS estimate → Method of moments ( ˆ β ₀ , β) = argmin ˆ _β ₀ _,β

n

X

i=1

[V _i − (β ₀ + x _i ⁰ β)] ² , where V _i = ± 1

• High dimension

− Ignoring correlations: Diagonal Discriminant Analysis (DDA) ¹⁸ , Nearest Shrunken Centroids ¹⁹

− Shrinkage Discriminant Analysis ²⁰ (SDA)

− Sparse linear discriminant analysis ²¹ (SLDA)

18. Bickel and Levina, 2004, Bernoulli 19. Tibshirani et al., 2003, Stat Sc

20. Ahdesm¨ aki and Strimmer, 2010, AOAS 21. Clemmensen et al., 2011, Technometrics

32 / 41

(42)

Conditional classification rule

• Under factor model assumption (Σ = Ψ + BB ⁰ ) X

Z

∼ N µ _y

0 ,

Σ B B ⁰ I q

• Among classification rules linear in (x, z )

• The best one is the conditional Bayes’ classifier LR(x, z) = log P(Y = 2 | X , Z )

P(Y = 1 | X , Z ) = β ₀ ^∗ + (x − Bz ) ⁰ β ^∗ with β ^∗ = Ψ ⁻¹ (µ 2 − µ 1 )

β ₀ ^∗ = log p 2

p 1 − 0.5(µ 2 + µ 1 ) ⁰ Ψ ⁻¹ (µ 2 − µ 1 )

• Analytical expression of misclassification rate π ^∗ _Z

33 / 41

(43)

Conditional classification rule

• Bayes rule error π

• Under factor model assumption X

Z

∼ N µ y

0 ,

Σ B B ⁰ I _q

• Conditional Bayes rule error π _Z ^∗

• One can show that π ≥ π ^∗ _Z

→ Theoretical superiority of conditional approach based on decorrelated data X e = X − BZ

33 / 41

(44)

Iterative decorrelation of data

• Estimation of µ 1 and µ 2

• Computation of centered profiles

• Estimation of factor model parameters ²² (Ψ, B)

• Decorrelation of data using generalized Thompson’s formula

˜

x = x − Bˆ ˆ z ⁰ Generalized Thompson’s formula

Z b = E X (Z ) = (I q +B ⁰ Ψ ⁻¹ B) ⁻¹ B ⁰ Ψ ⁻¹ x −

µ 1 P X (1) + µ 2 P X (2)

22. Friguet, Kloareg and Causeur, 2009, JASA

34 / 41

(45)

Simulations

●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●

●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●●●●

●

●●

●

●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●

●

●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

0 200 400 600 800 1000

0 10 20 30 40

variables' index

β

• n ₀ = n ₁ = 13

• Various dependence structures ²³

• 1000 learning datasets

• 1 testing dataset

23. Meinshausen and B¨ uhlmann, JRSS, 2010

35 / 41

(46)

Simulations - Prediction error rates

0.00.10.20.30.40.5

LASSO FA LASSO SLDA FA SLDA SDA FA SDA DDA FA DDA

●

→ Variable selection methods compared to their factor-adjusted version

36 / 41

(47)

Simulations - Selection accuracy

Method Nb of selected var. Accuracy

LASSO ²⁴ 13.10 62.36

Factor-adjusted LASSO 8.03 93.02

SLDA ²⁵ 10.00 62.50

FA SLDA 10.00 90.90

SDA ²⁶ 57.20 75.07

FA SDA 68.22 67.93

DDA ²⁷ 149.42 15.58

FA DDA 97.65 48.76

24. Tibshirani, 1996, JRSS ; Friedman et al., 2010, JSS 25. Clemmensen et al., 2011, Technometrics

26. Ahdesm¨ aki and Strimmer, 2010, AOAS 27. Bickel and Levina, 2004, Bernoulli

37 / 41

(48)

Conclusion

• Decorrelation method designed for prediction issues

• Preprocessing of the data which enables the use of usual selection methods

• FADA package available on CRAN ²⁸

• Application in genomics

• Adjustment for batch effect ²⁹

28. Perthame, Friguet and Causeur, 2014, R package version 1.2 29. Hornung, Boulesteix and Causeur, submitted

38 / 41

(49)

Outline

1. Introduction

2. Impact of dependence and dependence modeling

3. Disentangling signal from noise ...

• ... for a multiple testing issue

• ... for a supervised classification issue 4. Conclusion

39 / 41

(50)

Take home message

→ Whatever the statistical analysis, it would be efficient to account for dependence because it is a blessed situation ³⁰

→ Accounting for dependence introduces hyper-parameters

• Risk of overfitting

• Results depend on the estimation of the dependence model

− Need for robust models

− With few parameters

− To guarantee reproducible results

30. Hall and Jin, 2010, AOS

40 / 41

(51)

Variable selection for correlated data in high dimension using decorrelation methods

HAL Id: hal-01310571

https://hal.archives-ouvertes.fr/hal-01310571

Submitted on 29 May 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Variable selection for correlated data in high dimension using decorrelation methods

Emeline Perthame, David Causeur, Ching-Fan Sheu, Chloé Friguet

To cite this version:

Emeline Perthame, David Causeur, Ching-Fan Sheu, Chloé Friguet. Variable selection for correlated

data in high dimension using decorrelation methods. Statlearn: Challenging problems in statistical

learning, Apr 2016, Vannes, France. �hal-01310571�

Variable selection for correlated data in high dimension using decorrelation methods

Emeline Perthame INRIA, team MISTIS, Grenoble

Joint work with

David Causeur Ching-Fan Sheu Chlo´ e Friguet

Agrocampus, Rennes NCKU, Tainan, Taiwan UBS, Vannes

StatLearn, Vannes, April 2016

1 / 41

Outline

1. Introduction

2. Impact of dependence and dependence modeling

3. Disentangling signal from noise ...

• ... for a multiple testing issue

• ... for a supervised classification issue 4. Conclusion

2 / 41

The instrument: a 128-channel geodesic sensor net

• Electroencephalography (EEG) is the recording of electrical activity at scalp locations over time.

• The recorded EEG traces, which are time locked to external events, are averaged to form the event-related (brain) potentials (ERPs).

3 / 41

Auditory oddball experiment

A very commonly used experimental task

• Two auditory stimuli are presented to subjects

− A stimulus (500Hz) occurring frequently

− A stimulus (1000Hz) occurring infrequently

• ERPs are recorded on a 400 ms interval after the onset.

Motivations

• Auditory evoked potential (AEP): elicited by auditory stimulus

• Mismatch negativity (MMN): elicited by any change in the stimulus (odd/frequent)

• AEP and MMN are electrophysiological marker candidates for psychiatric disorders such as schizophrenia

4 / 41

ERP curves

0 100 200 300 400

−20 −10 0 10 20

Auditory ERP data − Kaohsiung Medical University

Time (ms)

ERP Amplitude ( µ V )

Hz500 Hz1000 Raw ERP curves for 13 subjects − Channel FZ

→ Signal detection: is there any difference between the two conditions ?

→ Signal identification: when does the difference occur ?

5 / 41

Linear model framework for ERP curves

At time t for subject i in condition j

• Multivariate analysis of variance model Y ijt = µ t + α it + γ jt + ε ijt

• Functional analysis of variance model Y ijt =

S

X

s=1

m s ϕ s (t ) +

S

X

s=1

a is ϕ s (t ) +

S

X

s=1

g js ϕ s (t ) + ε ijt

where ϕ s (.), s = 1, . . . , S are B-splines.

6 / 41

Linear model framework for ERP curves

At time t for subject i in condition j

Y ijt = µ t + α it + γ jt + ε ijt

Signal detection

• Is there any difference between the two conditions ? H 0 : for t = 1, . . . , T and j = 1, 2, γ jt = 0

• Is it relevant to predict the label from ERP curves ?

→ High dimension: need for variable selection Signal identification

For t = 1, . . . , T , H 0t : for j = 1, 2, γ jt = 0

6 / 41

Some approaches

Detection

where ϕ _s (.), s = 1, . . . , S are B-splines.

• Is there any difference between the two conditions ? H ₀ : for t = 1, . . . , T and j = 1, 2, γ _jt = 0

• F-test for multivariate (or functional) ANOVA ¹

• Optimal detection (Higher Criticism ² ) Supervised classification

• Ignoring correlations: Naive approaches ³

• Introducing sparsity: Lasso, Sparse LDA ⁴ Identification

Guthrie-Buchwald procedure ⁵

• Distribution of L _ρ under the null

L _ρ = # { t , p _t ≤ α }

where (p ₁ , . . . , p _T ) are p-values and α is a preset level

Rare and Weak paradigm ⁶

η = T ^−β , β ∈ ( 1 2 , 1)

Phase diagram under independence ⁷

• Signal is detectable when r > ρ ^∗ (β) : ρ ^∗ _D (β) =

( β − ¹ ₂ if ¹ ₂ < β ≤ ³ ₄ (1 − √

1 − β) ² if ³ ₄ < β < 1.