• Aucun résultat trouvé

Sparse classification boundaries

N/A
N/A
Protected

Academic year: 2021

Partager "Sparse classification boundaries"

Copied!
35
0
0

Texte intégral

(1)

HAL Id: hal-00371237

https://hal.archives-ouvertes.fr/hal-00371237

Preprint submitted on 27 Mar 2009

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

Sparse classification boundaries

Yuri Ingster, Christophe Pouet, Alexandre Tsybakov

To cite this version:

Yuri Ingster, Christophe Pouet, Alexandre Tsybakov. Sparse classification boundaries. 2009. �hal- 00371237�

(2)

Sparse classification boundaries

Yuri I. Ingster1, Christophe Pouet2 and Alexandre B. Tsybakov 3 March 27, 2009

Abstract

Given a training sample of size m from a d-dimensional population, we wish to allocate a new observationZ IRdto this population or to the noise.

We suppose that the difference between the distribution of the population and that of the noise is only in a shift, which is a sparse vector. For the Gaus- sian noise, fixed sample sizem, and the dimensiondthat tends to infinity, we obtain the sharp classification boundary and we propose classifiers attaining this boundary. We also give extensions of this result to the case where the sample size m depends on dand satisfies the condition (logm)/logd γ, 0 γ < 1, and to the case of non-Gaussian noise satisfying the Cram´er condition.

Keywords: Bayes risk, classification boundary, high-dimensional data, optimal classifier, sparse vectors

1Research partially supported by the RFBI Grant 08-01-00692-a and by Grant NSh–638.2008.1.

2Research partially supported by the grant ANR-07-BLAN-0234 and by PICS–

2715.

3Research partially supported by the grant ANR-06-BLAN-0194, by the PAS- CAL Network of Excellence and Isaac Newton Institute for Mathematical Sciences in Cambridge (Statistical Theory and Methods for Complex, High-Dimensional Data Programme, 2008).

1 Introduction

1.1 Model and problem

Let X = (X1, . . . , Xn) and Y = (Y1, . . . , Ym) be two i.i.d. samples from two different populations with probability distributions PX andPY on IRdrespectively.

Here

Xi = (Xi1, . . . , Xid), Yj = (Yj1, . . . , Yjd)

where Xik and Yjk are the components of Xi and Yj. We consider the problem of discriminant analysis when the dimension of the observations dis very large (tends to +). Assume that we observe a random vector Z = (Z1, . . . , Zd) independent

(3)

of (X,Y) and we know that the distribution ofZ is either PX orPY. Our aim is to classify Z, i.e., to decide whether Z comes from the population with distribution PX or from that with distributionPY.

In this paper we assume that

Xik =vk+ξik, Yjk =uk+ηkj, (1.1) where v = (v1, . . . , vd), u = (u1, . . . , ud) are deterministic mean vectors and the errors ξi1, . . . , ξid, η1j, . . . , ηjd are (unless other conditions are explicitly mentioned) jointly i.i.d. zero mean random variables with probability density f on IR.

Distinguishing between PX and PY presents a difficulty only when the vectors v and u are close to each other. A particular type of closeness for large d can be characterized by the sparsity assumption [9, 1] that we shall adopt in this paper.

As in [9, 1], we introduce the following set of sparse vectors in IRd characterized by a positive number ad and a sparsity index β (0,1]:

Uβ,ad = (

u= (u1, . . . , ud) : uk=adεk, εk ∈ {0,1}, cd1β

d

X

k=1

εkCd1β )

. Here 0 < c < C <+are two constants that are supposed to be fixed throughout the paper. The valuep=dβ can be interpreted as the “probability” of occurrence of non-zero components in vector u.

In what follows we shall deal only with a special case of model (1.1) that was also considered recently by [4]. Namely, we assume:

v = 0, uUβ,ad.

In this paper we establish the classification boundary, i.e., we specify the necessary and sufficient conditions on β and ad such that successful classification is possi- ble. Let us first define the notion of successful classification. We shall need some notation. Let ψ be a decision rule, i.e., a measurable function of X,Y, Z with values in [0,1]. Ifψ = 0 we allocateZ to the PX-population, whereas forψ = 1 we allocate Z to the PY-population. The rules ψ taking intermediate values in (0,1) can be interpreted as randomized decision rules. LetPH(u)0 andPH(u)1 denote the joint probability distributions of X,Y, Z when Z PX and Z PY respectively, and let EH(u)0, Var(u)H0 andEH(u)1, Var(u)H1 denote the corresponding expectation and variance operators. We shall also denote by P(u) the distribution of Y and by E(u), Var(u) the corresponding expectation and variance operators. Consider the Bayes risk

RB(ψ) =πEH(u)0(ψ) + (1π)EH(u)1(1ψ),

where 0 < π < 1 is a prior probability of the PX-population, and the maximum risk

RM(ψ) = max

EH(u)0(ψ), EH(u)1(1ψ) .

Let R(ψ) be either the Bayes risk RB(ψ) or the maximum risk RM(ψ).

(4)

We shall say thatsuccessful classification is possible if β and ad are such that

dlim+inf

ψ sup

uUβ,adR(ψ) = 0 (1.2)

for R = RM and R = RB with any fixed 0 < π < 1. Conversely, we say that successful classification is impossible if β and ad are such that

lim inf

d+ inf

ψ sup

uUβ,adR) =Rmax, (1.3) where Rmax = 1/2 for R = RM and Rmax = min(π,1π) for R = RB with 0< π <1.

We call (1.2) the upper bound of classification and (1.3) the lower bound of classification. The lower bound (1.3) for the maximum riskR=RM is interpreted as the fact that no decision rule is better (in a minimax sense) than the simple random guess. For the Bayes risk RB, the lower bound (1.3) is attained at the degenerate decision rule that does not depend on the observations: ψ 0 ifπ >1/2 or ψ 1 if π1/2.

The condition on (β, ad) corresponding to the passage from (1.2) to (1.3) is called the classification boundary. We shall say that a classifier ψ =ψd is asymp- totically optimal (or that ψ attains the classification boundary) if, for all β and ad such that successful classification is possible, we have

dlim+ sup

uUβ,adR(ψ) = 0 (1.4)

where R=RM orR =RB with any fixed 0 < π <1.

1.2 Main results

According to the value of β, we shall distinguish between moderately sparse vec- tors and highly sparse vectors. This division depends on the relation between m and d. For m not too large, i.e., when logm = o(logd), moderately sparse vec- tors correspond to β (0,1/2] and highly sparse vectors to β (1/2,1). For large m, i.e., when logm γlogd, γ (0,1), moderate sparsity corresponds to β (0,(1γ)/2] and high sparsity toβ ((1γ)/2,1γ).

The classification boundary for moderately sparse vectors is obtained in a rel- atively simple way (cf. Section 2). It is of the form

Rd

=d1/2βad1. (1.5)

This means that successful classification is possible if Rd +, and it is impos- sible if Rd 0 as d +. The result is valid both for β (0,1/2] and m 1 fixed or form depending ondsuch that logm γlogd, γ (0,1) asd+and β (0,(1γ)/2]. Moreover, (1.5) holds under weak assumptions on the noise. In particular, for the upper bound of classification we only need to assume that the

(5)

noise has mean zero and finite second moment (cf. Section 2). The lower bound is proved under a mild regularity condition on the density f of the noise.

The case of highly sparse vectors is more involved. We establish the classifica- tion boundary for the following scenarios:

(A) m 1 is a fixed integer, and the noise density f is Gaussian N(0, σ2) with known or unknown σ > 0;

(B) m + as d +, logm = o(logd), and f is Gaussian N(0, σ2) with known or unknown σ > 0.

(C) logm γlogd, γ (0,1), and f is Gaussian N(0, σ2) with known or un- known σ >0.

The upper bounds are extended to the following additional scenario:

(D) m + asd+, logmγlogd, 0γ <1, m/logd+, and the noise satisfies the Cram´er condition.

The conditions on the noise in (A)–(D) are crucial and, as we shall see later, they suggest that a special dependence of ad ond and m of the form ad p

(logd)/m is meaningful in the highly sparse case. More specifically, we take

ad =p

logd, x1 =s

m+ 1, (1.6)

where x1 > 0 is fixed. The classification boundary in (A, B, D) is then expressed by the following condition on β,s and m:

x1 =φ(β) (1.7)

where

φ(β) =

(φ1(β) if 1/2< β 3/4,

φ2(β) if 3/4< β < 1, (1.8) with

φ1(β) =p

1, φ2(β) = 2

1p

1β

. (1.9)

In other words, successful classification is possible if x1 φ(β) + δ, and it is impossible if x1 φ(β)δ, for any δ >0 and d large enough. This classification boundary is also extended to the case where x1 depends on d but stays bounded.

For Scenario (C) let ad = σxp

(logd)/m with fixed x > 0. We show that in this framework successful classification is impossible ifβ >1γ (cf. 1 in Section 2), and therefore we are interested in β ((1γ)/2,1γ). Set β =β/(1γ) (1/2,1) andx =x/

1γ. Then the classification boundary is of the form x =φ(β),

for the function φ(β) defined above.

(6)

Note that if f is known, the distribution PX is also known. This means that we do not need the sample X to construct decision rules. Thus, in Scenarios (A), (B) and (C) when σ is known we can suppose w.l.o.g. that only the sample Y is available; this remark remains valid in the case of unknown σ, as we shall see it later. As to Scenario (D), we shall also treat it under the assumption that only the sample Y is available (w.l.o.g. if f is known), to be consistent with other results.

However, if f is not known, the sample X contains additional information which can be used. The results for this case under Scenario (D) are similar to those that we obtain below but they are left beyond the scope of the paper.

Form= 0 (i.e., when there is no sampleY) the problem that we consider here reduces to the problem of signal detection in growing dimension d, cf. [6, 7, 8, 9, 10, 1, 12], and our classification boundary coincides with the detection boundary established in [6]. Sharp asymptotics in the detection problem was studied in [6]

(see also [9], Chapter 8) for known ad or β. Adaptive problem (this corresponds to unknown ad and β) was studied in [7, 8]. Various procedures attaining the detection boundary were proposed in [10, 1, 12]. Ingster and Suslina [10] introduced a method attaining the detection boundary based on the combination of three different procedures for the zones β (0,1/2], β (1/2,3/4] and β (3/4,1).

Later Donoho and Jin [1] showed that a test based on the higher criticism statistic attains the detection boundary simultaneously for these zones. More recently Jager and Wellner [12] proved that the same is true for a large class of statistics including the higher criticism statistic.

The paper of Hallet al. [4] deals with the same classification model as the one we consider here but study a problem which is different from ours. They analyse the conditions under which some simple (for example, minimum distance) classifiers ψ satisfy

dlim+EH(u)0(ψ) = 0. (1.10) Hall et al. [4] conclude that for minimum distance classifiers (1.10) holds if and only if 0 < β < 1/2. This implies that such classifiers cannot be optimal for 1/2β <1. They also derive (1.10) for some other classifiers in the case m = 1.

The results of this paper and their extensions to the multi-class setting were summarized in [14] and presented at the Meeting “Rencontres de Statistique Math´ematique” (Luminy, December 16-21, 2008) and at the Oberwolfach meeting

“Sparse Recovery Problems in High Dimensions: Statistical Inference and Learning Theory” (March 15-21, 2009). In a work parallel to ours, Donoho and Jin [2, 3]

and Jin [11] independently and contemporaneously have analysed a setting less general than the present one. They did not consider a minimax framework, but rather demonstrated that the higher criticism (HC) methodology can be success- fully extended to the classification problem. Donoho and Jin [3] showed that, for a special case of Scenario (B), the “ideal” HC statistic attains the same upper bound of classification that we prove below. Together with our lower bound, this implies that the “ideal” HC statistic is asymptotically optimal, in the sense defined above, for the Scenario (B). Donoho and Jin announce that similar results for the HC statistic in Scenarios (A) and (C) will appear in their work in preparation.

(7)

This paper is organized as follows. Section 2 contains some preliminary re- marks. In Section 3 we present the classification boundary and asymptotically optimal classifier for moderately sparse vectors under rather general conditions on the noise. In Section 4 we give the classification boundary and asymptotically op- timal classifiers for highly sparse vectors under Scenarios (A), (B) and (C). Section 5 provides an extension to Scenario (D). Proofs of the lower and upper bounds of classification are given in Sections 6 and 7, respectively.

2 Preliminary remarks

In this section we collect some basic remarks on the problem assuming thatf is the standard Gaussian density. As a starting point, we discuss some natural limitations for ad.

1. Remark thatadcannot be too small. Indeed, assume that instead of the set Uβ,ad we have only one vector u = (adε1, . . . , adεd) with known εk ∈ {0,1}. Then we get a familiar problem of classification with two given Gaussian populations.

The notion of classification boundary can be defined here in the same terms as above, and the explicit form of the boundary can be derived from the standard textbook results. It is expressed through the behavior of Q2d=a2dPd

k=1εk:

if Qd0, then successful classification is impossible:

lim inf

d+ inf

ψ R(ψ) =Rmax,

if Qd +, then successful classification is realized by the maximum like- lihood classifier ψ = 1I{T>0} where

T =

d

X

k=1:εk=1

(Zkad/2).

Here and below 1I{·} denotes the indicator function.

If we assume that Pd

k=1εk d1β, we immediately obtain some consequences for our model defined in Section 1. We see that successful classification in that model is impossible if ad is so small that d1βa2d = o(1), and it makes sense to consider only such ad that

d1βa2d+. (2.1)

In particular, for γ > 1β successful classification is impossible under Scenario (C) with adp

(logd)/m.

We shall see later that (2.1) is a rough condition, which is necessary but not sufficient for successful classification in the model of Section 1. For example, in that model with β (0,1/2] and fixed m, the value ad should be substantially larger than given by the condition d1βa2d1, cf. (1.5).

(8)

2. Our second remark is that, on the other extreme, for sufficiently large ad

the problem is trivial. Specifically, non-trivial results can be expected only under the condition

x=ad

pm/logd2

2, (2.2)

Indeed, assume that

x >2

2. (2.3)

Then the problem becomes simple in the sense that successful classification is easily realisable under (2.3) and the classical condition (2.1). Indeed, take an analog of the statistic T wheread and εk are replaced by their natural estimators:

T =

d

X

k=1

Zk SYk 2

m

ˆ

εk, with SYk = 1

m

m

X

i=1

Yik, εˆk= 1I{SYk>2 logd}, (2.4) and consider the classifier ψ = 1I{T >0}. We can write SYk = εkλ+ ζk where ζk are independent standard normal random variables. It is well known that maxk=1,...,d|ζk| ≤

2 logd with probability tending to 1 as d +. This and (2.3) imply that, with probability tending to 1, the vector (ˆε1, . . . ,εˆd) recovers exactly (ε1, . . . , εd) and the statisticT coincides with

Tˆ=

d

X

k=1:εk=1

Zk SYk 2

m

.

Since E(u)(SYk/

m) =εkad and Var(u)(SYk/

m) = 1/m, we find:

EH(u)0( ˆT) =ad

2

d

X

k=1

εk, EH(u)1( ˆT) = ad

2

d

X

k=1

εk,

Var(u)H0( ˆT) = Var(u)H1( ˆT) =

1 + 1 4m

d X

k=1

εk. It follows from Chebyshev’s inequality that under (2.1) we have

EH(u)0(ψ) =PH(u)0(T > 0)0, EH(u)1(1ψ) =PH(u)1(T 0)0 (2.5) asd+. Note that this argument is applicable in the general model of Section 1 (since the convergence in (2.5) is uniform in u Uβ,ad), implying successful classification by ψ under conditions (2.1) and (2.3).

3. Let us now discuss a connection between conditions (2.1) and (2.3). First, (2.3) implies (2.1) if m is not too large:

m=o d1βlogd

. (2.6)

On the other hand, if m is very large:

b >0 : m bd1βlogd, (2.7)

(9)

then we have x2 ba2dd1β, and condition (2.1) implies (2.3). Thus, the relation d1βa2d 1

determines the classification boundary in the general model of Section 1 if m is very large (satisfies (2.7)).

4. Finally, note that we can control conditions (2.3) and (2.2) by their data- driven counterparts. In fact, maxk=1,...,d|ζk| ≤

2 logd with probability tending to 1 as d +. Hence, if (2.2) holds, then MY

= max 1kdSYk 3 2 logd with the same probability. It is therefore convenient to consider the following pre- classifier taking values in {0,1, ND} (ND means “No Decision”, i.e., we need to switch to some other classifier):

ψpre =

0 if T 0, MY >3

2 logd, 1 if T > 0, MY >3

2 logd, ND if MY 3

2 logd,

where T is given by (2.4). The argument in 2 implies that ψpre classifies success- fully ifND is not chosen. Under condition (2.2) the pre-classifier chooses NDwith probability tending to 1 and then we apply one of the classifiers suggested below in this paper. We prove their optimality under assumption (2.2).

The above remarks can be easily extended to the case of Gaussian errors with known variance σ2 >0 by using the normalization Zk/σ, SYk/σ. Moreover, they extend to the case of non-Gaussian errors under the Cram´er condition and the additional assumption m/logd+ (cf. Section 5).

3 Classification boundary for moderately sparse vectors

In this section we consider the case of moderately sparse vectors. To simplify the notation, we set without loss of generality σ = 1. Assume that Rd = d1/2βad

satisfies:

dlim+Rd= + (3.1)

and consider the classifier based on a linear statistic:

ψlin = 1I{T>0}, T =

d

X

k=1

Zk 1 2m

m

X

i=1

Yik

! .

Note that T is similar to the statistic T defined in (2.4) with the difference that in T we do not threshold to estimate the positions of non-zero εk. Indeed, here we do not necessarily assume (2.3), and thus there is no guarantee that εk can be correctly recovered.

Assume that ηjk and ξik for all k, j, i are random variables with zero mean and variance 1 (we do not suppose here that ηjk have the same distribution as ξik).

(10)

Then the means ofYikandZkare E(u)(Yik) =EH(u)1(Zk) =εkad,EH(u)0(Zk) = 0, their variances are equal to 1, and we have:

EH(u)0(T) =ad

2

d

X

k=1

εk, EH(u)1(T) = ad

2

d

X

k=1

εk, Var(u)H0(T) = Var(u)H1(T) =d

1 + 1

4m

. We consider now a vector uIRd of the form

u= (u1, . . . , ud) : uk=adεk, εk∈ {0,1},

d

X

k=1

εk cd1β. (3.2) By (3.2), Chebyshev’s inequality and (3.1), we obtain

EH(u)0(ψ) =PH(u)0(T >0) PH(u)0

TEH(u)0(T)> cd1βad/2

4d (cd1βad)2

1 + 1

4m

0

asd+. An analogous argument yields thatEH(u)1(1ψ)0. The convergence here is uniform in usatisfying (3.2), and thus uniform in uUβ,ad. Therefore, we have the following result.

Theorem 3.1 Let ηkj andξik for all k, j, i be random variables with zero mean and variance 1. If (3.1) holds, then successful classification is possible and it is realized by the classifier ψlin.

Remark 3.1 We have proved theorem 3.1 with the set of vectors u defined by (3.2), which is larger than Uβ,ad. The upper bound on P

kεk in the definition of Uβ,ad is not needed. Also theηjk need not have the same distribution as the ξik and their variances need not be equal to 1. It is easy to see that the result of theorem 3.1 remains valid if these random variables have unknown variances uniformly bounded by an (unknown) constant.

The corresponding lower bound is given in the next theorem. Fora >0,tIR, set

a(t) =f(ta)/f(t), Da= Z

2a(t)f(t)dt, and

Dd(m, a, β) = d1Dma(Da1).

Theorem 3.2 Let either m1 be fixed or m =md+. If

dlim+Dd(m, ad, β) = 0, (3.3) then successful classification is impossible.

(11)

Proof of theorem 3.2 is given in Section 6.

Corollary 3.1 Let f be the density of standard normal distribution. If

dlim+Rd= 0, (3.4)

then successful classification is impossible for β (0,1/2] and m fixed or for β (0,1/2)and m =md+ such thatm =O(d1).

Proof. For the standard normal errors we have Da = ea2. Therefore, condition (3.3) can be satisfied only if ma2d=o(1) as d+. Moreover, in this case

Dd(m, ad, β)d1a2d(1 +ma2d)R2d. (3.5) Thus, if ma2d =o(1), conditions (3.3) and (3.4) are equivalent. Now, (3.4) and the assumption β (0,1/2] imply ad = o(1). This proves the corollary for fixed m.

Also, if β (0,1/2) and m = md + such that m = O(d1), then ma2d = O(R2d) =o(1). 2

Remark 3.2 Relation (3.5) is valid for a larger class of noise distributions, e.g., for non-Gaussian noise with finite Fisher information. Indeed, assume that a(t) is L2(f)-differentiable at point a= 0, i.e., there exists a function (·) such that

ka(·)1aℓ(·)kf =o(a), 0<k(·)kf <+, (3.6) where kg(·)k2f =R

IRg2(x)f(x)dx. Observe that k(·)k2f =

Z

IR

(f(x))2

f(x) dx=I(f)

is the Fisher information of f (with f defined in a somewhat stronger sense than, for instance, in [5]. Under assumption (3.6) we have

Da= 1 +ka(·)1k2f, ka(·)1k2f =a2(I(f) +o(1)) as a0.

Combining remarks 3.1 and 3.2 with theorems 3.1 and 3.2 we see that relation (1.5) determines the classification boundary for β (0,1/2] and fixed m or for β (0,1/2) and m +, m = O(d1), if the errors have zero mean, finite variance and finite Fisher information.

As corollaries of theorems 3.1 and 3.2 we can establish classification boundaries for particular choices of ad. Recall that non-trivial results can be expected only if ad satisfies (2.2). For instance, consider ad =ds with some s >0. Then for fixed mthe classification boundary in the regionβ (0,1/2] is given bys=β1/2, i.e., successful classification is possible if s <1/2β, and is impossible if s >1/2β.

Other choices of ad appear to be less interesting when β (0,1/2]. For example,

(12)

in the next section we consider the sequence ad =p

(logd)/mwith some s >0.

If ad is chosen in this way, successful classification is possible for all β (0,1/2]

with no exception, so that there is no classification boundary in this range of β.

Finally, note that theorem 3.1 is valid for all β (0,1). However, for β >1/2 its assumption limd+Rd = + guaranteeing successful classification is much too restrictive as compared to the correct classification boundary that we shall derive in the next section. The lower bound of theorem 3.2 is also valid for all β (0,1). However, we shall see in the next section that it is not tight for highly sparse vectors when β >3/4 (cf. proof of theorem 4.1).

4 Classification boundary for highly sparse vec- tors

We now analyse the case of highly sparse vectors, i.e.,we suppose that β (1/2,1) if logm = o(logd), and β = β/(1γ) (1/2,1) if logm γlogd, γ (0,1).

We shall show that the classification boundary for this case is expressed in terms of the function

φ(β) =

(φ1(β) if 1/2< β 3/4, φ2(β) if 3/4< β < 1,

where the functions φ1 and φ2 are defined in (1.9). Note that φ1 and φ2 are monotone increasing on (1/2,1), satisfyφ1(β)φ2(β) for all β (1/2,1),and the equality φ1(β) = φ2(β)(= 1/

2) holds if and only if β = 3/4.

The following notation will be useful in the sequel:

Td=p

logd, s=sd=ad/σTd, (4.1) and

x=s

m, x0 =sm/

m+ 1, x1 =s

m+ 1, x = x

1γ. (4.2) Clearly, x0 < x < x1. We allow s, x, x0, x1 to depend on d but do not indicate this dependence in the notation for the sake of brevity. We shall also suppose throughout that (2.2) holds, so that x1 =O(1) as d+.

4.1 Lower bound

The next theorem gives a lower bound of classification for highly sparse vectors.

Theorem 4.1 Let the noise density f be Gaussian N(0, σ2), σ2 > 0. Assume that β (1/2,1) and lim supd+x1 < φ(β). Then successful classification is impossible for fixed m and for m=md+.

Proof of theorem 4.1 is given in Section 6.

Though theorem 4.1 is valid with no restriction on m, it does not provide a correct classification boundary if m is large, i.e., logm γlogd, γ (0,1), as in Scenarios (C) and (D). The correct lower bound for large m is given in the next theorem.

(13)

Theorem 4.2 Consider Scenario (C) with β =β/(1γ)(1/2,1) and ad=σxp

(logd)/m.

Assume that lim supd+x < φ(β). Then successful classification is impossible.

Proof of theorem 4.2 is given in Section 6.

Recall that, by an elementary argument, under Scenario (C) and for ad as in theorem 4.2, successful classification is impossible if β > 1γ (cf. remark after (2.1)). This is the reason why in theorem 4.2 we consider only β <1γ.

4.2 Upper bounds for fixed m

We now propose optimal classifiers attaining the lower bound of theorem 4.1 under Scenario (A). First, we consider a procedure that attains the classification boundary only for β [3/4,1) but has a simple structure. Introduce the statistics

M0 = max

1kdSYk, M = max

1kdSZk where

SYk = 1

m

m

X

i=1

Yik, SZk= 1

m+ 1 Zk+

m

X

i=1

Yik

!

. (4.3)

Define

ΛM = M

max(

2σTd, M0). Taking a small c0 >0, consider the classifier of the form:

ψmax = 1I{ΛM>1+c0}.

Theorem 4.3 Consider Scenario (A). Let β (0,1) and (2.2) hold. Then, for any c0 >0,

dlim+ sup

uUβ,ad

EH(u)0max) = 0. (4.4) If lim supd+x1 < φ2(β), then, for any c0 >0,

dlim+ sup

uUβ,ad

EH(u)1max) = 0. (4.5) If lim infd+x1 > φ2(β) , then there exists c0 >0 such that

dlim+ sup

uUβ,ad

EH(u)1(1ψmax) = 0. (4.6)

Références

Documents relatifs

Distribution of SINR, as computed in (7), for Gaussian samples (in blue), real noise (in red) and real calls from the test dataset (in magenta), all identified as Z-calls according

Due to the discrete nature of the charge carriers, the current presents some fluctuations (noise) known as shot noise , which we aim to characterize here (not to be confused with

In the following, we introduce paired covariate settings and the paired lasso (Sect. 2), classify cancer types based on two transformations of the same molecular data (Sect. 3),

We study the problem of detection of a p-dimensional sparse vector of parameters in the linear regression model with Gaussian noise.. We establish the detection boundary, i.e.,

Using an aggregate with exponential weights, we obtain an optimal rate of convex aggregation for the hinge risk under the margin assumption.. Moreover, we obtain an optimal rate

On the other hand, the application of penalized optimal scoring to block clustering looks more complex, but as block clustering is typically defined as a mixture model whose

The second method originally proposes to convert successful high dimensional regression methods into the conditional density estimation, interpreting the coefficients of the

To this end, we exploit the repre- sentational power of sparse dictionaries learned from the data, where atoms of the dictionary selectively play the role of humanly