• Aucun résultat trouvé

Tight conditions for consistency of variable selection in the context of high dimensionality

N/A
N/A
Protected

Academic year: 2021

Partager "Tight conditions for consistency of variable selection in the context of high dimensionality"

Copied!
37
0
0

Texte intégral

(1)

HAL Id: hal-00602211

https://hal-enpc.archives-ouvertes.fr/hal-00602211v3

Submitted on 30 Mar 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

the context of high dimensionality

Laëtitia Comminges, Arnak S. Dalalyan

To cite this version:

Laëtitia Comminges, Arnak S. Dalalyan. Tight conditions for consistency of variable selection in the context of high dimensionality. Annals of Statistics, Institute of Mathematical Statistics, 2012, 40 (5), pp.2667-2696. �10.1214/12-AOS1046�. �hal-00602211v3�

(2)

arXiv:math.ST/1106.4293v2

TIGHT CONDITIONS FOR CONSISTENCY OF VARIABLE SELECTION IN THE CONTEXT OF HIGH DIMENSIONALITY

BYLAËTITIACOMMINGES ANDARNAKS. DALALYAN

Université Paris Est/ENPC and ENSAE-CREST

We address the issue of variable selection in the regression model with very high ambient dimension,i.e., when the number of variables is very large. The main focus is on the situation where the number of relevant vari- ables, called intrinsic dimension and denoted byd, is much smaller than the ambient dimensiond. Without assuming any parametric form of the underlying regression function, we get tight conditions making it possible to consistently estimate the set of relevant variables. These conditions re- late the intrinsic dimension to the ambient dimension and to the sample size. The procedure that is provably consistent under these tight condi- tions is based on comparing quadratic functionals of the empirical Fourier coefficients with appropriately chosen threshold values.

The asymptotic analysis reveals the presence of two quite different re- gimes. The first regime is whendis fixed. In this case the situation in non- parametric regression is the same as in linear regression,i.e., consistent variable selection is possible if and only if logd is small compared to the sample sizen. The picture is different in the second regime,d→ ∞as n→ ∞, where we prove that consistent variable selection in nonparamet- ric set-up is possible only ifd+loglogd is small compared to logn. We apply these results to derive minimax separation rates for the problem of variable selection.

1. Introduction. Real-world data such as those obtained from neuroscience, chemometrics, data mining, or sensor-rich environments are often extremely high-dimensional, severely underconstrained (few data samples compared to the dimensionality of the data), and interspersed with a large number of irrel- evant or redundant features. Furthermore, in most situations the data is con- taminated by noise making it even more difficult to retrieve useful information from the data. Relevant variable selection is a compelling approach for address- ing statistical issues in the scenario of high-dimensional and noisy data with small sample size. Starting from Mallows[29], Akaike[1], Schwarz[36]who in-

The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR) under the grant PARCIMONIE. The authors would like to thank the reviewers for very useful re- marks.

AMS 2000 subject classifications:Primary 62G08, secondary 62H12,62H15

Keywords and phrases:variable selection, nonparametric regression, set estimation, sparsity pattern

1

(3)

troduced respectively the famous criteriaCp, AIC and BIC, the problem of vari- able selection was extensively studied in the statistical and machine learning literature both from the theoretical and algorithmic viewpoints. It appears, how- ever, that the theoretical limits of performing variable selection in the context of nonparametric regression are still poorly understood, especially when the num- ber of variables, denoted byd and referred to as ambient dimension, is much larger than the sample sizen. The purpose of the present work is to explore this setting under the assumption that the number of relevant variables, hereafter called intrinsic dimension and denoted byd, may grow with the sample size but remains much smaller thand.

In the important particular case of linear regression, the latter scenario was the subject of a number of recent studies. Many of them rely on1-norm penal- ization[38,46,31]and constitute an attractive alternative to iterative variable selection procedures[2,45]and to marginal regression or correlation screening [42,18]. Promising results for feature selection are also obtained by conformal prediction[20], (minimax) concave penalties[16,17,44], Bayesian approach[37]

and higher criticism[15]. Extensions to other settings including logistic regres- sion, generalized linear model and Ising model were carried out in[8,34,18], respectively. Variable selection in the context of groups of variables with disjoint or overlapping groups was studied by[43,24,28,32,21]. Hierarchical procedures for selection of relevant variables were proposed by[3,5,47].

It is now well understood that in the Gaussian sequence model and in the high-dimensional linear regression with a Gram matrix satisfying some vari- ant of irrepresentable condition, consistent estimation of the pattern of rele- vant variables—also called the sparsity pattern—is possible under the condi- tiondlog(d/d) =o(n) asn → ∞[41]. Furthermore, it is well known that if (dlog(d/d))/n remains bounded from below by some positive constant when n → ∞, then it is impossible to consistently recover the sparsity pattern[40].

Thus, a tight condition exists that describes in an exhaustive manner the inter- play between the quantitiesd,d andnthat guarantees the existence of consis- tent estimators. The situation is very different in the case of non-linear regres- sion, since, to our knowledge, there is no result providing tight conditions for consistent estimation of the sparsity pattern.

The papers[26]and[4], closely related to the present work, considered the problem of variable selection in nonparametric Gaussian regression model. They proved the consistency of the proposed procedures under some assumptions that—in the light of the present work—turn out to be suboptimal. More pre- cisely, Lafferty and Wasserman[26]assumed the unknown regression function to be four times continuously differentiable with bounded derivatives. The algo-

(4)

rithm they proposed, termed Rodeo, is a greedy procedure performing simulta- neously local bandwidth choice and variable selection. Rodeo is shown to con- verge when the ambient dimensiond isO(logn/log logn)while the intrinsic di- mensionddoes not increase withn. On the other hand, Bertin and Lecué[4]

proposed a procedure based on the1-penalization of local polynomial estima- tors and proved its consistency whend=O(1)butdis allowed to be as large as logn, up to a constant. They also have a weaker assumption on the regression function merely assumed to belong to the Holder class with smoothnessβ >1.

To complete the picture, let us mention that estimation and hypotheses testing problems for high-dimensional nonparametric regression under sparse additive modeling were recently addressed in[25,33,19].

This brief review of the literature reveals that there is an important gap in consistency conditions for the linear regression and for the non-linear one. For instance, if the intrinsic dimensiondis fixed, then the condition guaranteeing consistent estimation of the sparsity pattern is(logd)/n→0 in linear regression whereas it isd=O(logn)in the nonparametric case. While it is undeniable that the nonparametric regression is much more complex than the linear one, it is however not easy to find a justification to such an important gap between two conditions. The situation is even worse in the case whered → ∞. In fact, for the linear model with at most polynomially increasing ambient dimensiond= O(nk), it is possible to estimate the sparsity pattern for intrinsic dimensionsd as large asn1ε, for someε >0. In other words, the sparsity index can be almost on the same order as the sample size. In contrast, in nonparametric regression, there is no procedure that is proved to converge to the true sparsity pattern when bothnanddtend to infinity, even ifdgrows extremely slowly.

In the present work, we fill this gap by introducing a simple variable selection procedure that selects the relevant variables by comparing some quadratic func- tionals of empirical Fourier coefficients to prescribed significance levels. Con- sistency of this procedure is established under some conditions on the triplet (d,d,n)and the tightness of these conditions is proved. The main take-away messages deduced from our results are the following:

• When the number of relevant variablesdis fixed and the sample sizen tends to infinity, there exist positive real numberscandcsuch that (a) if (logd)/n≤cthe estimator proposed in Section3is consistent and (b) no estimator of the sparsity pattern may be consistent if(logd)/n≥c.

• When the number of relevant variablesd tends to infinity withn → ∞, then there exist real numbersci and ¯ci,i=1,2 such thatc1>0, ¯c1>0 and (a) ifc1d+loglog(d/d)−logn<c2the estimator proposed in Section3is consistent and (b) no estimator of the sparsity pattern may be consistent

(5)

if ¯c1d+log log(d/d)−logn>c¯2.

• In particular, ifd grows not faster than a polynomial inn, then there exist positive real numbersc0andc0such that (a) ifdc0logn the estimator proposed in Section3is consistent and (b) no estimator of the sparsity pattern may be consistent ifdc0logn.

In the regime of a growing intrinsic dimensiond→ ∞and a moderately large ambient dimensiond =O(nC), for someC >0, we make a concentrated effort to get the constantc0as close as possible to the constantc0. This goal is reached for the model of Gaussian white noise and, very surprisingly, it required from us to apply some tools from complex analysis, such as the Jacobiθ-function and the saddle point method, in order to evaluate the number of lattice points lying in a ball of an Euclidean space with increasing dimension.

The rest of the paper is organized as follows. The notation and assumptions necessary for stating our main results are presented in Section2. In Section3, an estimator of the set of relevant variables is introduced and its consistency is es- tablished, in the case where the data come from the Gaussian white noise model.

The main condition required in the consistency result involves the number of lattice points in a ball of a high-dimensional Euclidean space. An asymptotic equivalent for this number is presented in Section4. Results on impossibility of consistent estimation of the sparsity pattern are derived in Section5. Sec- tion6is devoted to exploring adaptation to the unknown parameters (smooth- ness and degree of significance) and recovering minimax rates of separation.

Then, in Section7, we show that some of our results can be extended to the model of nonparametric regression. The relations between consistency and in- consistency results are discussed in Section8. The technical parts of the proofs are postponed to the Appendix.

2. The problem formulation and the assumptions. We are interested in the variable selection task (also known as model selection, feature selection, sparsity pattern estimation) in the context of high-dimensional non-linear re- gression. Letf:[0,1]d →Rdenote the unknown regression function. We assume that the number of variablesdis very large, possibly much larger than the sam- ple sizen, but only a small number of these variables contribute to the fluctua- tions of the regression functionf.

To be more precise, we assume that for some small subsetJ of the index set {1,... ,d}satisfying Card(J)≤d, there is a function ¯f:RCard(J)→Rsuch that

f(x) =¯f(xJ), ∀x∈Rd,

wherexJ stands for the subvector ofxobtained by removing fromx all the co- ordinates with indices lying outside J. In what follows, we allowd andd to

(6)

depend onnbut we will not always indicate this dependence in notation. Note also that the genuine intrinsic dimension is Card(J);dis merely a known upper bound on the intrinsic dimension. In what follows, we use the standard notation for the vector and sequence norms:

kxk0=X

j

1(xj 6=0), kxkpp=X

j

|xj|p,∀p∈[1,∞), kxk=sup

j |xj|, for everyx∈Rd orx∈RN.

Let us stress right away that the primary aim of this work is to understand when it is possible to estimate the sparsity patternJ (with theoretical guarantees on the convergence of the estimator) and when it is impossible. The estimator that we will define in next sections is intended to show the possibility of consis- tent estimation, rather than to provide a practical procedure for recovering the sparsity pattern. Therefore, the estimator will be allowed to depend on different constants appearing in conditions imposed on the regression functionfand on some characteristics of the noise.

To make the consistent estimation of the set J realizable, we impose some smoothness and identifiability assumptions onf. In order to describe the smooth- ness assumption imposed onf, let us introduce the trigonometric Fourier basis:

ϕ0≡1 and

ϕk(x) = (p

2 cos(2πk·x), k∈(Zd)+,

p2 sin(2πk·x), −k∈(Zd)+, (1) where(Zd)+denotes the set of allk∈Zd\{0}such that the first nonzero element ofkis positive andk·xstands for the usual inner product inRd. In what follows, we use the notation〈·,·〉for designing the scalar product inL2([0,1]d;R), that is

〈h, ˜h=R

[0,1]dh(x)h˜(x)dx for everyh, ˜hL2([0,1]d;R). Using this orthonormal Fourier basis, we define

ΣL=

f :X

kZdk2j〈f,ϕk2L;j∈ {1,... ,d}

.

To ease notation, we setθk[f] =〈f,ϕk〉for allk∈Zd. In addition to the smooth- ness, we need also to require that the relevant variables are sufficiently relevant for making their identification possible. This is done by means of the following condition.

[C1(κ,L)] The regression functionfbelongs toΣL. Furthermore, for some sub- setJ ⊂ {1,... ,d}of cardinality≤d, there exists a function ¯f:RCard(J)→R such thatf(x) =¯f(xJ),∀x∈Rd and it holds that

Qj[f] = X

k:kj6=0

θk[f]2κ,jJ. (2)

(7)

One easily checks thatQj[f] =0 for everyj that does not lie in the sparsity pat- tern. This provides a characterization of the sparsity pattern as the set of indices of nonzero coefficients of the vectorQ[f] = (Q1[f],... ,Qd[f]).

Prior to describing the procedures for estimating J, let us comment Con- dition[C1]. It is important to note that the identifiability assumption (2) can be rewritten asR

[0,1]d f(x)−R1

0f(x)d xj2

dxκ and, therefore, is not intrin- sically related to the basis we have chosen. In the case of continuously differ- entiable and 1-periodic functionf, the smoothness assumptionf ΣL as well can be rewritten without using the trigonometric basis, sinceP

kZdkj2θk[f]2= (2π)2R

[0,1]d[∂jf(x)]2dx. Thus, condition[C1]is essentially a constraint on the functionfitself and not on its representation in the specific basis of trigonomet- ric functions.

The results of this work can be extended with minor modifications to other types of smoothness conditions imposed onf, such as Hölder continuity or Besov- regularity. In these cases the trigonometric basis (1) should be replaced by a basis adapted to the smoothness condition (spline, wavelet, etc.). Furthermore, even in the case of Sobolev smoothness, one can replace the setΣLcorrespond- ing to smoothness order 1 by any Sobolev ellipsoid of smoothnessβ >0, see for instance[10]where the caseβ =2 is explored. Roughly speaking, the role of the smoothness assumption is to reduce the statistical model with infinite- dimensional parameterf to a finite-dimensional model having good approxi- mation properties. Any value of smoothness orderβ >0 leads to this reduction.

The valueβ=1 is chosen for simplicity of exposition only.

3. Idealized setup: Gaussian white noise model. To convey the main ideas without taking care of some technical details, we start by focusing our atten- tion on the Gaussian white noise model, that was proved to be asymptotically equivalent to the model of regression[6,35], as well as to other nonparametric models[7,13]. Thus, we assume that the available data consists of Gaussian pro- cess{Y(φ):φL2([0,1]d;R)}such that

Ef[Y(φ)] = Z

[0,1]d

f(x)φ(x)dx, Covf(Y(φ),Y(φ)) = 1 n

Z

[0,1]d

φ(x(x)dx. It is well-known that these two properties uniquely characterize the probability distribution of a Gaussian process. An alternative representation ofY is

d Y(x) =f(x)dx+n1/2d W(x), x∈[0,1]d,

whereW(x)is a d-parameter Brownian sheet. Note that minimax estimation and detection of the functionfin this set-up (but without sparsity assumption) was studied by[23].

(8)

3.1. Estimation of J by multiple hypotheses testing. We intend to tackle the variable selection problem by multiple hypotheses testing; each hypothesis con- cerns a group of the Fourier coefficients of the observed signal and suggests that all the elements within the group are zero. The rationale behind this approach is the following simple observation: since the trigonometric basis is orthonormal and contains the constant function,

j6∈J ⇐⇒ θk[f] =〈f,ϕk〉=0,∀k s.t. kj 6=0. (3) This observation entails that if the intrinsic dimension|J|is small as compared tod, then the sequence of Fourier coefficients is sparse. Furthermore, as ex- plained below, there is a sort of group sparsity with overlapping groups.

For every∈ {1,... ,d}, we denote byPd the set of all subsetsI of{1,... ,d} having exactlyelements:Pd =

I ⊂ {1,... ,d} : Card(I) = . For every multi- indexk∈Zd, we denote by supp(k)the set of indices corresponding to nonzero entries ofk. To define the blocks of coefficientsθk that will be tested for signifi- cance, we introduce the following notation: for everyI⊂ {1,... ,d}and for every jI, we set

VIj[f] =

θk[f]:j∈supp(k)⊂I . It follows from (3) that the characterization

j6∈J ⇐⇒ max

I

VIj[f]

p=0, (4) holds true for everyp∈[0,+∞]. Furthermore, again in view of (3), the maximum overIof the norms

VIj[f]

pis attained whenI =J and is equal to the maximum over all subsetsI such that Card(I)≤d. Summarizing these arguments, we can formulate the problem of variable selection as a problem of testingd null hy- potheses

H0j : VIj[f]

p=0 ∀I ⊂ {1,... ,d}such that Card(I)≤d. (5) If the hypothesisH0j is rejected, then thejth covariate is declared as relevant.

Note that by virtue of assumption[C1], the alternatives can be written as H1j:

VIj[f] 2

2κ for someI ⊂ {1,... ,d}such that Card(I)≤d. (6) Our estimator is based on this characterization of the sparsity pattern. If we de- note byyk the observable random variableYk), we have

yk =θk[f] +n1/2ξk, θk =〈fk〉, k∈Zd, (7) where{ξk;k ∈Zd}form a countable family of independent Gaussian random variables with zero mean and variance equal to one. According to this property,

(9)

yk is a good estimate ofθk[f]: it is unbiased and with a mean squared error equal to 1/n. Using the plug-in argument, this suggests to estimateVIj byVbIj = (yk : j ∈supp(k)⊂ I)and the norm ofVIj by the norm ofVbIj. However, since this amounts to estimating an infinite-dimensional vector, the error of estimation will be infinitely large. To cope with this issue, we restrict the set of indices for whichθk is estimated byyk to a finite set, outside of whichθk will be merely estimated by 0. Such a restriction is justified by the fact thatf is assumed to be smooth: Fourier coefficients corresponding to very high frequencies are very small.

Let us fix an integerm>0, the cut-off level, and denote, forjI⊂ {1,... ,d}, Smj ,I=

n

k∈Zd : kkk2m and {j} ⊂supp(k)⊂I o.

Since the alternatives H1j are concerned with the 2-norm, we build our test statistic on an estimate of the normkVIj[f]k2. To this end, we introduce

Qbm,Ij =X

kSjm,I

yk2− 1

n ,

which is an unbiased estimator ofQjm,I =P

k∈Sm,Ij θk2. Note that whenm → ∞, the quantityQm,Ij approacheskVIj[f]k22. It is clear that larger values ofmlead to a smaller bias while the variance get increased. Moreover, the variance ofQbjm,I is proportional to the cardinality of the setSjm,I. The latter is an increasing func- tion of Card(I). Therefore, if we aim at getting comparable estimation accuracies when estimating the functionalskVIj[f]k22byQbjm,I for variousI’s, it is reasonable to make the cut-off levelmvary with the cardinality ofI.

Thus, we consider a multivariate cut-offm= (m1,... ,md)∈Nd. For a subset I of cardinalityd, we test significance of the vectorVIj[f]by comparing its estimateQbjm,I with a prescribed thresholdλ. This leads us to define an estima- tor of the setJ by

Jbn(m,λ) =n

j∈ {1,... ,d} : max

d λ1max

IPd

Qbmj

,I≥1o .

wherem = (m1,... ,md)∈Nd and λ= (λ1,... ,λd)∈Rd+ are two vectors of tuning parameters. As already mentioned, the role ofm is to ensure that the truncated sumsQm,Jj do not deviate too much from the complete sumsQjJ. Quantitatively speaking, for a givenτ >0, we would like to choosem’s so that Qmj s,Jκτ/τ+1, wheres=Card(J). This guarantee can be achieved due to the smoothness assumption. Indeed, as proved in (26) (cf. AppendixB), it holds that

Qmj s,Jκms2Ls,jJ.

(10)

Therefore, choosing m = ℓL(1+τ)/κ1/2

, for every = 1,... ,d, entails the inequalityQmj

s,Jκτ/τ+1, which indicates that the relevance of variables is not affected too much by the truncation.

Pushing further the analogy with the hypotheses testing, we define Type I er- ror of an estimator Jbn of J as the one of having Jbn 6⊂ J, i.e., classifying some irrelevant variables as relevant. The Type II error is then that of having J 6⊂ Jb, which amounts to classifying some relevant variables as irrelevant. As in test- ing problem, handling the Type I error is easier since the distribution of the test statistic is independent off. In fact, this is the max of a finite family of random variables drawn from translated and scaledχ2-distributions. Using the Bonfer- roni adjustment, leads to the following control of the first kind error.

PROPOSITION 1. Let us denote by N(ℓ,γ) the cardinality of the set{k ∈Z : kkk22γℓ&k16=0}. If for some A>1and for everyℓ=1,... ,d,

λ≥2p

AN(ℓ,m2/ℓ)dlog(2e d/d) +2Adlog(2e d/d)

n , (8)

then the Type I errorP Jbn(m,λ)6⊂J

is upper-bounded by(2e d/d)d(A1), and therefore tends to 0 as d→+∞.

This proposition shows that the Type I error of a variable selection procedure may be made small by choosing a sufficiently high threshold. By doing this, we run the risk to rejectH0j very often and to drastically underestimate the set of relevant variables. The next result establishes a necessary condition, which will be shown to be tight, ensuring that such an underestimation does not occur.

THEOREM1. Let condition[C1(κ,L)]be satisfied with some known constants κ >0and L<and let s=Card(J). For some real numbersτ >0and A>1, set m= ℓL(1+τ)/κ1/2

,ℓ=1,... ,d, and defineλto be equal to the right-hand side of (8). If the condition

sκτ/(1+τ) (9)

is fulfilled, thenJbn(m,λ)is consistent and satisfies the inequalitiesP Jbn(m,λ)6⊃

J

≤2(2e d/d)d(A1)andP Jbn(m,λ)6=J

≤3(2e d/d)d(A1).

Condition (9) ensuring the consistency of the variable selection procedureJbn

admits a very natural interpretation: It is possible to detect relevant variables if the degree of relevanceκis larger than a multiple of the thresholdλs, the latter being chosen according to the noise level.

(11)

A first observation is that this theorem provides interesting insight to the pos- sibility of consistent recovery of the sparsity patternJ in the context of fixed in- trinsic dimension. In fact, whendremains bounded from above whenn → ∞ andd→ ∞, then we get thatP(Jb1(m,λ) =J)→n,d→∞1 provided that

logd≤Const·n. (10)

Although we did not find (exactly) this result in the statistical literature on vari- able selection, it can be checked that (10) is a necessary and sufficient condition for recovering the sparsity patternJ in linear regression with fixed sparsityd and growing dimensiond and sample size n. Thus, in the regime of fixed or boundedd, the sparsity pattern estimation in nonparametric regression is not more difficult than in the parametric linear regression, as far as only the consis- tency of estimation is considered and the precise value of the constant in (10) is neglected. Furthermore, there is a simple estimator Jbn(1)of J (cf. Eq. (3) in[10]), which is provably consistent under condition (10). This estimator can be seen as a procedure of testing hypothesesH0j of form (5) withp=∞and, therefore, it does not really exploit the structure of the Fourier coefficients of the regression function. To some extent, this is the reason why in the regime of growing intrin- sic dimensiond→ ∞, the estimatorJbn(1)proposed by[10]is no longer optimal.

In fact, whend→ ∞, the termN(s,ms2/s)present in (9) tends to infinity as well. Furthermore, as we show in Section4, this convergence takes place at an exponential rate ind. Therefore, in this asymptotic set-up it is crucial to have the right order ofN(s,ms2/s)in the condition that ensures the consistency. As shown in Section5, this is the case for condition (9).

REMARK1. An apparent drawback of the estimator Jbn is the large dimen- sionality of tuning parameters involved inJbn. However, Theorem1reveals that for achieving good selection power, it is sufficient to select the 2d-dimensional tuning parameter(m,λ)on a one-dimensional curve parameterized byϑ=L(1+

τ)/κ. Indeed, once the value ofϑis given, Thm.1advocates for choosing

m= (ℓϑ)1/2 and λ=2p

AN(ℓ,ϑ)dlog(2e d/d) +2Adlog(2e d/d)

n (11)

for every=1,... ,d. As discussed in Section6.1, this property allows to relax the requirement that the valuesLandκinvolved in[C1]are known in advance.

REMARK2. The result of the last theorem is in some sense adaptive w.r.t. the unknown sparsity. Indeed, while the estimatorJbn involvesd, which is merely a known upper bound on the true sparsitys =Card(J)and may be significantly

(12)

larger thans, it is the true sparsitys that appears in condition (9) as a first argu- ment of the quantityN(·,ϑ). This point is important given the exponential rate of divergence ofN(·,ϑ)when its first argument tends to infinity. On the other hand, if condition (9) is satisfied withN(d,ϑ)instead ofN(Card(J),ϑ), then the consistent estimation ofJ can be achieved by a slightly simpler procedure:

Jen(m,λ) =n

j∈ {1,... ,d} : max

IPd∗d

Qbjmd∗,Iλd

o.

The proof of this statement is similar to that of Thm.1and will be omitted.

4. Counting lattice points in a ball. The aim of the present section is to in- vestigate the properties of the quantityN(d,γ)that is involved in the conditions ensuring the consistency of the proposed procedures. Quite surprisingly, the asymptotic behavior ofN(d,γ)turns out to be related to the Jacobiθ-function.

To show this, let us introduce some notation. For a positive numberγ, we set C1(d,γ) =n

k∈Zd:k21+...+kd2γdo

, C2(d,γ) =n

k∈ C1(d,γ):k1=0o along withN1(d,γ) =CardC1(d,γ) andN2(d,γ) =CardC2(d,γ). In simple words,N1(d,γ)is the number of lattice points lying in thed-dimensional ball with radius(γd)1/2and centered at the origin, whileN2(d,γ)is the number of (integer) lattice points lying in the(d−1)-dimensional ball with radius(γd)1/2 and centered at the origin. With this notation, the quantityN(ℓ,·)of Theorem1 can be written asN1(ℓ,·)−N2(ℓ,·). By volumetric arguments, one can check that V(d)(pγ−1)d(d)d/2N1(d,γ)≤ V(d)(pγ+1)d(d)d/2, whereV(d) = πd/2/Γ(1+d/2) is the volume of the unit ball in Rd. Furthermore, similar bounds hold true forN2(d,γ)as well. Unfortunately, whend → ∞, these in- equalities are not accurate enough to yield non-trivial results in the problem of variable selection we are dealing with. This is especially true for the results on impossibility of consistent estimation stated in Section5.

In order to determine the asymptotic behavior ofN1(d,γ)andN2(d,γ)when dtends to infinity, we will rely on their integral representation through Jacobi’s θ-function. Recall that the latter is given byh(z) =P

rZzr2, which is well de- fined for any complex numberz belonging to the unit ball|z|<1. To briefly ex- plain where the relation betweenNi(d,γ)and theθ-function comes from, let us denote by{ar}the sequence of coefficients of the power series ofh(z)d, that ish(z)d=P

r0arzr. One easily checks that∀r∈N,ar =Card{k∈Zd:k12+...+

kd2=r}. Thus, for everyγsuch thatγdis integer, we haveN1(d,γ) =Pγd r=0ar. As a consequence of Cauchy’s theorem, we get :

N1(d,γ) = 1 2πi

I h(z)d zγd

d z z(1−z).

(13)

−4 −3 −2 −1 0 1 2 3 4

−5 0 5

−4

−3

−2

−1 0 1 2 3 4

FIG1. Lattice points in a ball of radius R=γd=3.2in the three dimensional space (d=3).

Red points are those ofC1(d,γ)\ C2(d,γ)while blue stars are those ofC2(d,γ). In this example, N(d,γ) =N(3,1.07) =110.

where the integral is taken over any circle|z|= w with 0<w <1. Exploiting this representation and applying the saddle-point method thoroughly described in[14], we get the following result.

PROPOSITION2. Letγ >0be an integer and letlγ(z) =logh(z)−γlogz . 1. There is a unique solution zγin(0,1)to the equationlγ(z) =0. Furthermore,

the functionγ7→zγis increasing andl′′γ(z)>0.

2. For i=1,2, the following equivalences hold true:

Ni(d,γ) = h(zγ)

zγγ

d 1+o(1)

h(zγ)i1zγ(1−zγ)(2l′′γ(zγ)πd)1/2, as dtends to infinity.

Hereafter, it will be useful to note that the second part of Prop.2yields log N1(d,γ)N2(d,γ)

=dlγ(zγ)−1

2logd+cγ+o(1), asd→ ∞, (12) withcγ=log h(z

γ)1 h(zγ)zγ(1zγ)pl′′

γ(zγ)

. Furthermore, while the asymptotic equiva- lences of Prop.2are established for integer values ofγ >0, relation log N1(d,γ)− N2(d,γ)

=dlγ(zγ)(1+o(1))holds true for any positive real numberγ[30]. In order to get an idea of how the termszγandlγ(zγ)depend onγ, we depicted in Fig.2the plots of these quantities as functions ofγ >0.

Combining relation (12) with Thm.1, we get the following result.

(14)

0 10 20 30 40 50

0.60.70.80.91.0

0 500 1000 1500 2000 2500

12345

FIG2. The plots of mappingsγ7→zγandγ7→lγ(zγ). One can observe that both functions are in- creasing, the first one converges to1very rapidly, while the second one seems to diverge very slowly.

COROLLARY3. Let condition[C1(κ,L)]be satisfied with some known constants κ >0 and L <. Consider the asymptotic set-up in which both d =dn and d=dn tend to infinity as n→ ∞. Assume that d grows at a sub-exponential rate in n , that islog logd =o(logn). If

limsup

n→∞

d

logn < 2 lγ(zγ)

withγ=L/κ, then consistent estimation of J is possible and can be achieved, for instance, by the estimatorJbn.

5. Tightness of the assumptions. In this section, we focus our attention on the functional classΣ(κ,L)of all functions satisfying assumption[C1(κ,L)]. For emphasizing thatJ is the sparsity pattern of the functionf, we write Jf instead of J. We assume thats =Card(J) =d. The goal is to provide conditions un- der which the consistent estimation of the sparsity support is impossible, that is there exists a constantc>0 and an integern0∈Nsuch that, ifnn0,

infJe

sup

fΣ(κ,L)

Pf(Je6=Jf)≥c,

where the inf is over all possible estimators ofJf. To this end, we introduce a set ofM+1 probability distributionsµ0,... ,µM onΣ(κ,L)and use the fact that

infJe

sup

fΣ(κ,L)e

Pf(Je6=Jf)≥inf

eJ

1 M

XM ℓ=1

Z

Σ(κ,L)

Pf(Je6=Jf)µ(df). (13) These measuresµwill be chosen in such a way that for each≥1 there is a set Jof cardinalitydsuch thatµ{Jf=J}=1 and all the setsJ1,... ,JMare distinct.

The measureµ0is the Dirac measure in 0. Considering theseµs as “priors” on Σ(κ,L)and defining the corresponding “posteriors”P0,P1,... ,PM by

P(A) = Z

Σ(κ,L)

Pf(A)µ(df), for every measurable setA⊂Rn,

(15)

we can write the inequality (13) as infJe

sup

fΣ(κ,L)

Pf(Je6=Jf)≥inf

ψ

1 M

XM ℓ=1

P(ψ6=ℓ), (14) where the inf is taken over all randomψtaking values in{0,... ,M}. The latter inf will be controlled using a suitable version of the Fano lemma. To state it, we denote byK(P,Q)the Kullback-Leibler divergence between two probability measuresPandQdefined on the same probability space.

LEMMA4 (Corollary 2.6 of[39]). Let M ≥3be an integer,(X,A)be a mea- surable space and let P0,... ,PM be probability measures on (X,A). Let us set

¯

pe,M =infψM1PM

ℓ=1P ψ6=

, where theinfis taken over all measurable func- tionsψ:X →

1,... ,M . If for some0< α <1, M1+1PM

ℓ=1K P,P0

αlogM , thenp¯e,M12α.

We apply this lemma withX being the set of all arraysy ={yk :k∈Zd}such that for someK>0 the entriesyk =0 for everyklarger thanKin2-norm. It fol- lows from Fano’s lemma that one can deduce a lower bound on ¯pe,M, the quan- tity we are interested in, from an upper bound on the average Kullback-Leibler divergence betweenPandP0. With these tools at hand, we are in a position to state the main result on the impossibility of consistent estimation of the sparsity pattern in the case when the conditions of Thm.1are violated.

THEOREM2. Assume thatϑ=L/κ >1and dd

≥3. Letγϑbe the largest integer satisfyingγ 1+ h(zγ)−11

ϑ, where the Jacobiθ-functionhand zγare those defined in Section4.

i) If for someα∈(0,1/2),

N(d,γϑ)dlog(d/d)

n2ϑ

αγϑκ2, (15)

then, for dlarge enough,infeJsupfΣPf Je6=Jf

12α.

ii) If for someα∈(0,1/2),

dlog(d/d)

nκ

α, (16)

theninfJesupfΣPf Je6=Jf

12α.

It is worth stressing here that condition (15) is the converse of condition (9) of Thm.1in the cased→ ∞, in the sense that condition (9) amounts to requiring that the left-hand side of (15) is smaller than some constant. There is however

(16)

one difference between the quantities involved in these conditions: the term N(d,ϑ(1+τ))of (9) is replaced byN(dϑ)in condition (15). One can won- der how closeγϑ is toϑ. To give a qualitative answer to this question, we plotted in Figure3the curve of the mappingϑ7→γϑ along with the bisectorϑ7→ϑ. We observe that the difference between two curves is small compared toϑ. As we discuss it later, this property shows that the constants involved in the necessary condition and in the sufficient condition for consistent estimation ofJ are very close, especially for large values ofϑ.

0 20 40 60 80 100

020406080100

FIG3. The curve of the function L7→γL(blue) and the bisector (red).

6. Adaptivity and minimax rates of separation.

6.1. Adaptation with respect to L andκ. The estimator Jb(m,λ)we have in- troduced in Section3is clearly nonadaptive: the tuning parameters(m,λ)rec- ommended by the developed theory involve the valuesLandκ, which are gen- erally unknown. Fortunately, we can take advantage of the fact that the choice of mandλis governed by the one-dimensional parameterϑ=L(1+τ)/κ. There- fore, it is realistic to assume that a finite grid of values 1< ϑ1≤...≤ϑK <∞is available containing a true value ofϑ. The following result provides an adaptive procedure of variable selection with guaranteed control of the error.

PROPOSITION5. Let1< ϑ1≤...≤ϑK <andτ >0be given values and set1 i=minn

i:(1+τ)maxj=1,...,dP

kZdk2jθk2 minjJ

P

k:kj6=0θk2ϑi

o

K.

For every i,∈N, let us denoteJbn(i) =Jbn(m(ϑi),λi))with m(ϑ) = (ϑℓ)1/2and λ(ϑ) =2p

2N(ℓ,ϑ)dlog(2e d/d) +4dlog(2e d/d)

n .

1We use the convention that the minimum over an empty set equals+∞.

(17)

If the conditionsi)< κτ/(1+τ)is fulfilled, then the estimatorJbnad=SK

i=1Jbn(i) satisfiesP(Jbnad6=J)≤(K+2)(d/2e d)d.

In simple words, if the grid of possible values{ϑi}has a cardinalityK which is not too large (that isK(d/d)d →0), then declaring a variable relevant if at least one of the procedures Jbn(i)suggests its relevance provides a consis- tent and adaptive variable selection strategy. The proof of this statement fol- lows immediately from Prop.1and Thm.1. Indeed, applying Prop.1withA=2 yieldsP(Jbnad6⊂J)≤PK

i=1P(Jbn(i)6⊂J)≤K(d/2e d)d, while Thm.1ensures that P(Jbnad6⊃J)≤P(Jbn(i)6⊃J)≤2(d/2e d)d.

6.2. Minimax rates of separation. Since the methodology of Section3takes its roots in the theory of hypotheses testing, one naturally wonders what are the minimax rates of separation in the problem of variable selection. The re- sults stated in foregoing sections allow us to answer this question in the case of Sobolev smoothness 1 and alternatives separated inL2-norm. The following re- sult, the proof of which is postponed to the AppendixEprovides minimax rates.

We assume herein that the true sparsitys =Card(J)and its known upper esti- matedare such thatd/sis bounded from above by some constant.

PROPOSITION6. There is a constant Ddepending only on L such that if κD

log(d/s) n2

2/(4+s)_slog(d/s) n

,

then there exists a consistent estimator of J . Furthermore, the consistency is uni- form infΣ(κ,L). On the other hand, there is a constant Ddepending only on L such that if

κD

log(d/s) n2

2/(4+s)_slog(d/s) n

, then uniformly consistent estimation of J is impossible.

Borrowing the terminology of the theory of hypotheses testing, we say that

log(d/s) n2

2/(4+s)

slog(dn /s) is the minimax rate of separation in the problem of variable selection for Sobolev smoothness one. These results readily extend to Sobolev smoothness of any orderβ ≥ 1, in which case the rate of separation takes the form log(dn2/s)

2β /(4β+s)

slog(dn /s). The first term in this maximum co- incides, up to the logarithmic term, with the minimax rate of separation in the problem of detection of ans-dimensional signal[22]. Note, however, that in our case this logarithmic inflation is unavoidable. It is the price to pay for not know- ing in advance whichsvariables are relevant.

(18)

7. Nonparametric regression with random design. So far, we have ana- lyzed the situation in which noisy observations of the regression functionf(·) are available at all pointsx∈[0,1]d. Let us turn now to the more realistic model of nonparametric regression, when the observed noisy values off are sampled at random in the unit hypercube[0,1]d. More precisely, we assume thatn in- dependent and identically distributed pairs of input-output variables(Xi,Yi), i=1,... ,nare observed that obey the regression model

Yi =f(Xi) +σǫi, i=1,... ,n.

The input variablesX1,... ,Xnare assumed to take values inRd while the output variablesY1,... ,Yn are scalar. As usual,ǫ1,... ,ǫn are such thatE[ǫi|Xi] =0,i = 1,... ,n; additional conditions will be imposed later. Without requiring fromfto be of a special parametric form, we aim at recovering the setJ ⊂ {1,... ,d}of its relevant variables. The noise magnitudeσis assumed to be known.

It is clear that the estimation of J cannot be accomplished without impos- ing some further assumptions onfand on the distributionPXof the input vari- ables. Roughly speaking, we will assume thatfis differentiable with a squared integrable gradient and thatPXadmits a density which is bounded from below.

More precisely, letgdenote the density ofPXw.r.t. the Lebesgue measure.

[C2] g(x) =0 for anyx6∈[0,1]d and thatg(x)≥gminfor anyx∈[0,1]d.

The next assumptions imposed to the regression function and to the noise re- quire their boundedness in an appropriate sense. These assumptions are needed in order to prove, by means of a concentration inequality, the closeness of the empirical coefficients to the true ones.

[C3(L,L2)] TheL([0,1]d,R,PX)andL2([0,1]d,R,PX)norms of the functionf are bounded from above respectively byLandL2,i.e.,P |f(X)| ≤L

=1 andE[f(X)2]≤L22.

[C4] The noise variables satisfy a.e.E[etǫi|Xi]≤et2/2for allt >0.

We stress once again that the primary aim of this work is merely to under- stand when it is possible to consistently estimate the sparsity pattern. The esti- mator that we will define is intended to show the possibility of consistent esti- mation, rather than being a practical procedure for recovering the sparsity pat- tern. Therefore, the estimator will be allowed to depend on the parametersgmin, L,κandL2appearing in conditions[C1-C3].

7.1. An estimator of J and its consistency. The estimator of the sparsity pat- ternJ that we are going to introduce now is based on the following simple ob-

Références

Documents relatifs

than values often reported in the literature for various coatings materials (typically between 0.1 and 10%, see 

Apart from this use of a semantic change for an extension enforcement pur- pose, a change of the semantics may be necessary for other reasons, for instance, for computational

Considering active transport instead of passive diffusion, and assuming a single nutrient species for simplicity, a third species participating in the CRN should be assigned a

The paper is organized as follows: the first Section is devoted to the basic description of implicit control systems on manifolds of jets of infinite order. In Section 2, we extend

A recent such example can be found in [8] where the Lipscomb’s space – which was an important example in dimension theory – can be obtain as an attractor of an IIFS defined in a

In the post task questionnaire, participants using the baseline interface were asked to indicate the usefulness of the book-bag in general while participants

To study the well definiteness of the Fr´echet mean on a manifold, two facts must be taken into account: non uniqueness of geodesics from one point to another (existence of a cut

b) zombie vortex: Critical layers increase in strength and depending on the anti-cyclonic (cyclonic) nature of the flow, the anti-cyclonic (cyclonic) sheets become unstable and roll