The degrees of freedom of the group Lasso for a general design

(1)

HAL Id: hal-00926929

https://hal.archives-ouvertes.fr/hal-00926929

Submitted on 10 Jan 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The degrees of freedom of the group Lasso for a general design

Samuel Vaiter, Gabriel Peyré, Jalal M. Fadili, Charles-Alban Deledalle, Charles Dossal

To cite this version:

Samuel Vaiter, Gabriel Peyré, Jalal M. Fadili, Charles-Alban Deledalle, Charles Dossal. The degrees

of freedom of the group Lasso for a general design. SPARS’13, Jul 2013, Lausanne, Switzerland. 1

page. �hal-00926929�

(2)

The degrees of freedom of the group Lasso for a general design

Samuel Vaiter and Gabriel Peyré Jalal M. Fadili Charles Deledalle and Charles Dossal CEREMADE, CNRS-U. Paris-Dauphine GREYC, CNRS-ENSICAEN-U. Caen IMB, CNRS-U. Bordeaux 1 Place du Maréchal De Lattre De Tassigny, 6, Bd du Maréchal Juin 351, cours de la Libération 75775 Paris Cedex 16, France. 14050 Caen Cedex, France. 33405 Talence Cedex, France.

Abstract —In this paper, we are concerned with regression problems where covariates can be grouped in nonoverlapping blocks, from which a few are active. In such a situation, the group Lasso is an attractive method for variable selection since it promotes sparsity of the groups. We study the sensitivity of any group Lasso solution to the observations and provide its precise local parameterization. When the noise is Gaussian, this allows us to derive an unbiased estimator of the degrees of freedom of the group Lasso. This result holds true for any fixed design, no matter whether it is under- or overdetermined. Our results specialize to those of [1], [2] for blocks of size one, i.e. ℓ

¹

norm. These results allow objective choice of the regularisation parameter through e.g. the SURE.

I. G

ROUP

L

ASSO AND

D

EGREES OF

F

REEDOM

Consider the linear regression problem Y = Xβ

0

+ ε, where Y is the real n-dimensional response vector, β

0

∈ R

^p

is the unknown vector of regression coefficients to be estimated, X ∈ R

^n×p

is the design matrix whose columns are the p covariate vectors, and ε is the error term. In this paper, we do not make any specific assumption on n with respect to p.

Let B be a disjoint union of the set of indices i.e. S

b∈B

= {1, . . . , p} such that b, b

^′

∈ B, b ∩ b

^′

= ∅. For β ∈ R

^p

, for each b ∈ B, β

b

= (β

i

)

i∈b

is a subvector of β whose entries are indexed by the block b, and |b| is the cardinality of b. The group Lasso amounts to solving

β(y) b ∈ Argmin

β∈R^p

1 2 ||y − Xβ||

²

+ λ X

b∈B

||β

b

||, (P

λ

(y)) from an observation y ∈ R

ⁿ

of the regression model, where λ > 0 is the regularization parameter and || · || is the ℓ

²

-norm.

Let y 7→ b µ(y) = X β(y) b be the response or the prediction associated to β(y), and let b µ

0

= Xβ

0

. We recall that µ(y) b is always uniquely defined (by strict convexity of the fidelity term), although β(y) b may not as is the case when X is a rank-deficient or underdetermined design matrix. Suppose that ε ∼ N (0, σ

²

Id

n

).

Following [3], the DOF is given by df = P

n i=1

cov(Yi,bµ_i(Y)) σ2

. The well-known Stein’s lemma asserts that, if y 7→ µ(y) b is a weakly differentiable function with an essentially bounded gradient, then an unbiased estimator of df is df(Y b ) = div µ(Y b ) = tr(∂ µ(Y b )) and E

^ε

( df(Y b )) = df , where ∂ µ(·) b is the Jacobian of µ(·). b

In the sequel, we define the B-support supp

_B

(β) of β ∈ R

^p

as supp

_B

(β) = {b ∈ B | ||β

b

|| 6= 0}. The size of supp

_B

(β) is

| supp

_B

(β)| = P

b∈B

|b|. The set of all B-supports is denoted I, and X

I

, where I is a B-support, is the matrix formed by the columns X

i

where i is an element of b ∈ I. We also introduce the following block-diagonal operators

δ

β

: v ∈ R

^|I|

7→ (v

b

/||β

b

||)

b∈I

∈ R

^|I|

and P

β

: v ∈ R

^|I|

7→ (Proj

_β⊥

b

(v

b

))

b∈I

∈ R

^|I|

, where Proj

_β⊥

b

= Id − β

b

β

b^T

is the orthogonal projector on β

b^⊥

. II. M

AIN

C

ONTRIBUTIONS

The first difficulty we need to overcome when X is not full column rank is that y 7→ β(y) b is set-valued. Toward this goal, we are led to impose the following assumption on X with respect to the block structure.

Assumption ( A (β)): Given a vector β ∈ R

^p

of B-support I, we assume that the finite subset of vectors {X

b

β

b

| b ∈ I} is linearly independent.

It is important to notice that (A(β)) is weaker than imposing that X

I

is full column rank, which is standard when analyzing the Lasso.

The two assumptions coincide for the Lasso, i.e. |b| = 1, ∀b ∈ I.

Definition 1: Let λ > 0. The transition space H is defined as H = [

I⊂I

[

b6∈I

H

I,b

, where H

I,b

= bd(π(A

I,b

)), where we have denoted

π : R

ⁿ

× R

^I,∗

× R

^I,∗

→ R

ⁿ

, R

^I,∗

= Y

b∈I

(R

^|b|

\ {0}) the canonical projection on R

ⁿ

(with respect to the first component), bd C is the boundary of the set C, and

A

I,b

= n

(y, β

I

, v

I

) ∈ R

ⁿ

× R

^I,∗

× R

^I,∗

| ||X

^Tb

(y − X

I

β

I

)|| = λ, X

^TI

(X

I

β

I

− y) + λv

I

= 0, ∀g ∈ I, v

g

= β

g

||β

g

||

o . We are now equipped to state our main sensitivity analysis result.

Theorem 1: Let λ > 0. Let y 6∈ H, and β(y) b a solution of (P

λ

(y)). Let I = supp

_B

( β(y)) b be the B-support of β(y) b such that (A( β(y))) b holds. Then, there exists an open neighborhood of y O ⊂ R

ⁿ

, and a mapping β e : O → R

^p

such that

1) For all y ¯ ∈ O, β(¯ e y) is a solution of (P

λ

(¯ y)), and β(y) = e β(y). b 2) the B-support of β(¯ e y) is constant on O.

3) the mapping β e is C

¹

(O) and its Jacobian is such that ∀¯ y ∈ O,

∂ β e

I^c

(¯ y) = 0 and ∂ β e

I

(¯ y) = d(y, λ) where d(y, λ) = X

^TI

X

I

+ λδ

_β(y)_b

◦ P

_β(y)_b

−1

X

^TI

and I

^c

= {b ∈ B | b / ∈ I} .

The next theorem provides a closed-form expression of the local variations of y 7→ µ(y). In turn, when b ε ∼ N (0, σ

²

Id), this will yield an unbiased estimator of the degrees of freedom and of the prediction risk of the group Lasso.

Theorem 2: Let λ > 0. For all y 6∈ H, there exists a solution β(y) b of (P

λ

(y)) with B-support I = supp

_B

( β(y)) b such that ( A ( β(y))) b is fulfilled. The mapping y 7→ µ(y) = X b β(y) b is C

¹

(R

ⁿ

\ H) and,

div( b µ(y)) = tr(X

I

d(y, λ))

where β(y) b is such that (A( β(y))) b holds. Moreover, The set H has Lebesgue measure zero. If Y = Xβ

0

+ ε where ε ∼ N(0, σ

²

Id

n

), then tr(X

I

d(Y, λ)) is an unbiased estimate of the DOF of the group Lasso.

R

EFERENCES

[1] H. Zou, T. Hastie, and R. Tibshirani, “On the “degrees of freedom” of the Lasso,” The Annals of Statistics, vol. 35, no. 5, pp. 2173–2192, 2007.

[2] C. Dossal, M. Kachour, J. Fadili, G. Peyr´e, and C. Chesneau, “The degrees of freedom of penalized ℓ

₁

minimization,” to appear in Statistica Sinica, 2012. [Online]. Available: http://hal.archives-ouvertes.fr/hal-00638417 [3] B. Efron, “How biased is the apparent error rate of a prediction rule?”