• Aucun résultat trouvé

On the optimality of the empirical risk minimization procedure for the convex aggregation problem

N/A
N/A
Protected

Academic year: 2022

Partager "On the optimality of the empirical risk minimization procedure for the convex aggregation problem"

Copied!
19
0
0

Texte intégral

(1)

www.imstat.org/aihp 2013, Vol. 49, No. 1, 288–306

DOI:10.1214/11-AIHP458

© Association des Publications de l’Institut Henri Poincaré, 2013

On the optimality of the empirical risk minimization procedure for the Convex aggregation problem 1

Guillaume Lecué

a

and Shahar Mendelson

b

aCNRS, LAMA, Université Paris-Est Marne-la-Vallée, 77454 Marne-la-Vallée, France. E-mail:guillaume.lecue@univ-mlv.fr bDepartment of Mathematics, Technion, I.I.T, Haifa 32000, Israel. E-mail:shahar@tx.technion.ac.il

Received 24 September 2010; revised 12 October 2011; accepted 14 October 2011

Abstract. We study the performance ofempirical risk minimization(ERM), with respect to the quadratic risk, in the context of convex aggregation, in which one wants to construct a procedure whose risk is as close as possible to the best function in the convex hull of an arbitrary finite classF. We show that ERM performed in the convex hull ofFis an optimal aggregation procedure for the convex aggregation problem. We also show that if this procedure is used for the problem of model selection aggregation, in which one wants to mimic the performance of the best function inF itself, then its rate is the same as the one achieved for the convex aggregation problem, and thus is far from optimal. These results are obtained in deviation and are sharp up to logarithmic factors.

Résumé. Nous étudions les performances de la procédure de minimisation du risque empirique, par rapport au risque quadratique, pour le problème d’agrégation convexe. Dans ce problème, on souhaite construire des procédures dont le risque est aussi proche que possible du risque du meilleur élément dans l’enveloppe convexe d’une classe finieF de fonctions. Nous prouvons que la procédure obtenue par minimisation du risque empirique sur la coque convexe deFest une procédure optimale pour le problème d’aggrégation convexe. Nous prouvons aussi que si cette procédure est utilisée pour le problème d’agrégation en sélection de modèle, pour lequel on souhaite imiter le meilleur dansF, alors le résidu d’agrégation est le même que celui obtenue pour le problème d’agrégation convexe. Cette procédure est donc loin d’être optimale pour le problème d’agrégation en sélection de modèle. Ces résultats sont obtenus en déviation et sont optimaux à des facteurs logarithmiques prés.

MSC:62G08

Keywords:Learning theory; Aggregation theory; Empirical process theory

1. Introduction and main results

In this note, we study the optimality of the empirical risk minimization procedure in the aggregation framework.

LetX be a probability space and let(X, Y )and(X1, Y1), . . . , (Xn, Yn)ben+1 i.i.d. random variables with values inX×R. From the statistical point of view,D=((X1, Y1), . . . , (Xn, Yn))is the family of given data.

Thequadratic riskof a real-valued functionf defined onXis given by R(f )=E

Yf (X)2

.

1Part of this research was supported by the Centre for Mathematics and its Applications, The Australian National University, Canberra, ACT 0200, Australia, by an Australian Research Council Discovery Grant DP0559465 and by the European Community’s Seventh Framework Programme (FP7/2007-2013), ERC Grant Agreement 203134.

(2)

Iffis a function constructed using the dataD, the quadratic risk offis the random variable R(f )=E

Yf (X) 2

|D .

For the sake of simplicity, throughout this article we will restrict ourselves to functionsf and random variables(X, Y ) for which|Y|,|f (X)| ≤balmost surely, for some fixedb≥1. One should note, though, that it is possible to extend the results beyond this case, to functions with well behaved tail – though at a high technical price (cf. the chaining arguments in [20] and [21]).

In the aggregation framework, one is given a finite setF of real-valued functions defined onX (usually called a dictionary) of cardinalityM. There are three main types of aggregation problems:

1. In theModel Selection (MS) aggregationproblem, one has to construct a procedure that produces a function whose risk is as close as possible to the risk of the best element in the given classF (cf. [2,3,9–12,16,24,25,27]).

2. In theConvex (C) aggregationproblem (cf. [1,7–9,12,24,28]) one wants to construct a procedure whose risk is as close as possible to the risk of the best function in the convex hull ofF (later denoted by conv(F )).

3. In theLinear (L) aggregationproblem (cf. [9,11,15,24]), one wants to construct a procedure whose risk is as close as possible to the risk of the best function in the linear span ofF (later denoted by span(F )).

The aim in the aggregation framework is to construct a proceduref˜for which, with high probability R(f )˜ ≤C min

fΔ(F )R(f )+ψnΔ(F )(M) (1.1)

withC=1 andΔ(F )is eitherF, or conv(F )or span(F ). It is worth mentioning that it is desirable for the constant C in (1.1) to be one in the aggregation setup for at least two reasons. First, there are some obvious mathematical differences in the analysis leading to exact oracle inequalities (C=1) and non-exact oracle inequalities (C >1). In particular, the geometry of the set Δ(F ) has a key role in an attempt to obtain exact oracle inequalities, whereas non-exact oracle inequalities are mainly based on complexity and concentration argument (cf. [17]). Second, an exact oracle inequality for the prediction riskR(·)leads to an exact oracle inequality for the estimation risk; namely, with high probability

Ef (X)˜ −f(X)2

|D

≤ min

fΔ(F )E

f (X)f(X)2

+ψnΔ(F )(M),

wheref denotes the regression function ofY givenX. Such an estimate on the regression function cannot follow from a non-exact oracle inequality, and thus, exact oracle inequalities can provide prediction and estimation results whereas non-exact oracle inequalities only lead to prediction results.

One can define theoptimal rates of the (MS), (C) and (L) aggregationproblems, respectively denoted byψn(MS)(M), ψn(C)(M)andψn(L)(M)(see, for example, [24]). The optimal rates are the smallest prices in the minimax sense that one has to pay to solve the (MS), (C) or (L) aggregation problems in expectation, as a function of the cardinalityM of the dictionary and of the sample sizen. It has been proved in [24] (see also [12] and [28] for the (C) aggregation problem) that

ψn(MS)(M)∼logM

n , ψn(C)(M)M

n ifM≤√

n,

1 nlogeM

n

ifM >

n and ψn(L)(M)M n,

where we denoteabif there are absolute positive constantscandC such thatcbaCb. Note that the rates obtained in [24] hold in expectation and in particular, the rateψn(C)(M)was achieved in the Gaussian regression model with a known variance and a known marginal distribution of the design. In [8], the authors were able to remove these assumptions at a price of an extra logn factor for 1≤M≤√

n(results are still in expectation). We also refer the reader to [6,28] for non-exact oracle inequalities in the (C) aggregation context.

Lower bounds in deviation follow from the arguments of [24] for the three aggregation problems with the same ratesψn(MS)(M),ψn(C)(M)andψn(L)(M). In other words, there exist two absolute constantsc0, c1>0 such that for

(3)

any sample cardinalityn≥1, any cardinality of a dictionaryM≥1 and any aggregation proceduref¯n, there exists a dictionaryF of sizeMsuch that with probability larger thanc0,

R(f¯n)≥ min

fΔ(F )R(f )+c1ψnΔ(F )(M), (1.2)

where the residual term ψnΔ(F )(M) is ψn(MS)(M) (resp. ψn(C)(M) or ψn(L)(M)) when Δ(F )=F (resp. Δ(F )= conv(F ) or Δ(F )=span(F )). Procedures achieving these rates in deviation have been constructed for the (MS) aggregation problem ([2] and [16]) and the (L) aggregation problem ([15]). So far, there was no example of a pro- cedure that achieves the rate of aggregationψn(C)(M)with high probability for the (C) aggregation problem and the aim of this note is to prove that the most natural procedure, empirical risk minimization over the convex hull ofF, achieves the rate ofψn(C)(M)in deviation (up to a lognfactor for values ofMclose to√

n).

Indeed, we will show that the proceduref˜ERM-Cminimizing the empirical risk functional f−→Rn(f )=1

n

n

i=1

Yif (Xi)2

, (1.3)

in conv(F )achieves, with high probability, the rate min(Mn, logM

n )for the (C) aggregation problem (see the exact formulation in TheoremA.1in theAppendix). Moreover, we will show that the rateψn(C)(M)can be achieved by f˜ERM-C for any orthogonal dictionary (formulated in Theorem B). On the other hand, it turns out that the same algorithm is far from the conjectured optimal rateψn(MS)(M)for the (MS) aggregation problem (see TheoremAand [16] for the conjecture).

Our first main result is to prove a lower bound on the performance off˜ERM-C (ERM in the convex hull) in the context of the (MS) aggregation problem. In [16], it was proved that this procedure is suboptimal for the problem of (MS) aggregation when the size of the dictionary is of the order of√

n. Here we complement the result by providing a lower bound for almost all values ofMandn.

Theorem A. There exist two absolute positive constantsc0andc1for which the following holds.For any integern andMsuch thatlogMc0n1/3,there exists a dictionaryF of cardinalityMsuch that,with probability greater than 9/12

Rf˜ERM-C

≥min

fFR(f )+c2ψn(M), whereψn(M)=M/nwhenM≤√

nand(nlog(eM/√

n))1/2whenM >

n.Moreover,for the same classF,if M≥√

n,then with probability larger than7/12, Rf˜ERM-C

≤min

fFR(f )+c3ψn(M).

Note that the residual termψn(M)of TheoremAis much larger than the optimal rateψn(MS)(M)=(logM)/n for the (MS) aggregation problem. It shows that ERM in the convex hull satisfies a much stronger lower bound than the one mentioned in (1.2) that holds for any algorithm. This result is of particular importance since optimal aggregation procedures for the (MS) aggregation problem take their values in conv(F ), and it was thus conjectured thatf˜ERM-Ccould be an optimal aggregation procedure for the (MS) aggregation problem (cf. [16] for more details on this problem). In [16] it was proved that this not the case forM=√

n; TheoremAshows that this is not the case for all the values ofMandnin the significant range (whenMis sub-exponential inn).

The proof of TheoremArequires two separate arguments (as in the proofs of the lower bounds in [24] and [28]).

The caseM≤√

nis easier, and follows an identical path to the one used in [16] forM=√

n. Its proof is presented for the sake of completeness, and to allow the reader a comparison with the situation in the other case, whenM >

n.

In the “largeM” range things are very different and we present a more intuitive description of the idea behind the construction in Section2.

(4)

The performance of ERM in the convex hull has been studied for an infinite dictionary in [7], in which estimates on its performance have been obtained in terms of the metric entropy ofF. The resulting upper bounds were conjectured to be suboptimal in the case of a finite dictionary, since they provide an upper bound of M/nfor every nandM whereas it is possible to achieve the rate

(logM)/n when M≥√

n. Although this result is probably known to experts and relies on standard machinery (see for instance [15,14]), we present its proof in theAppendix.

The residual term min(Mn, logM

n ) of Theorem A.1 behaves like ψn(C)(M) except for values of M for which n1/2< Mc(ε)n1/2+εforε >0. And, although there is a gap in this range in the general case, under the additional assumption that the dictionary is orthogonal, this gap can be removed.

Theorem B. For everyb >0there is a constantc1(b)and an absolute constantc2for which the following holds.Let nandMbe integers which satisfy thatlogMc1(b)

n.LetF be a finite dictionaryF of cardinalityMand(X, Y ) such that|Y|,supfF|f (X)| ≤b.IfF = {f1, . . . , fM}satisfies thatEfi(X)fj(X)=0 for anyi=j ∈ {1, . . . , M}, thenf˜ERM-Cachieves the rateψn(C)(M):for anyu >0,with probability greater than1−exp(−u)

Rf˜ERM-C

≤ min

fconv(F )R(f )+c2b2max

ψn(C)(M),u n

.

Removing the gap in the general case is likely to be a much harder problem, although we believe that the orthogonal case should be the “worst” one.

Finally, a word about notation. Throughout, we denote absolute constants or constants that depend on other param- eters byc,C,c1,c2, etc. (and, of course, we will specify when a constant is absolute and when it depends on other parameters). The values of constants may change from line to line. The notationxy (resp.xy) means that there exist absolute constants 0< c < C such thatcyxCy(resp.xCy). Ifb >0 is a parameter thenxby means thatxC(b)yfor some constantC(b)depending only onb. We denote byMp the spaceRM endowed with thep norm. The unit ball there is denoted byBpM. We also denote the unit Euclidean sphere inRMbySM1.

If F is a class of functions, letf be a minimizer in F of the true risk; in our case,f is the minimizer of E(f (X)Y )2. For everyfF setLf =(Yf (X))2(Yf(X))2, and letLF = {Lf: fF}be the excess loss class associated withF, the targetY and the quadratic risk.

2. On the complexity ofB1M with respect toM2

The aim of this section is to give some of the ideas needed in the proof of TheoremAin the caseM≥√

n. It is also presented to explain why the seemingly unlikely fact that the rate

1

nlog(eM/√

n) (2.1)

actually improves as the size of the dictionaryMincreases in our construction is true.

The example used for this result is a class FM = {0,±φ1, . . . ,±φM}wherei)Mi=1 is a bounded orthonormal family ofL2(PX)andY=φM+1(X)is orthogonal to this family. We also assume thatΦ(X)=1(X), . . . , φM(X)) is isotropic, that is, for everyλ∈Rn,EΦ(X), λ2= λ22.

An element in conv(FM) is of the form fλ = Φ, λ for some λB1M, its excess loss is Lfλ = Φ, λ2− 2Φ, λφM+1and the process one has to minimize is indexed byB1Mand given by

PnLfλ=1 n

n

i=1

Φ(Xi), λ2

−2 n

n

i=1

Φ(Xi), λ

φM+1(Xi). (2.2)

It follows from [21] that the oscillations of the quadratic term λB1M → |(PnP )(Φ, λ2)| are of lower order, and that the empirical process (2.2) behaves like λB1Mλ22−2n1/2V , λ where V =

(5)

n1/2n

i=1φM+1Φ(Xi), while a Gaussian approximation shows thatV essentially behaves like a standard Gaus- sian vectorGinRM. Hence, the excess riskPLf= λ22of the empirical risk minimization proceduref=fλwill be located around

arg min

0r1 min

λBM1 rSM−1

r−2G, λn

=arg min

0r1

r−2n1/2 sup

λB1M

rSM1G, λ .

Observe that for every radius 0< r≤1, supλBM 1

rSM1·, λ, is an interpolation norm, which will be denoted by · Ar. The problem arises because in the range 1/M≤r≤1 (which is the range we are interested in), a proportional change in the radiusronly results in a logarithmic change in the value ofEGAr, which is why one has to obtain a sharp estimate onEGAr for everyr.

It turns out that a rather accurate estimate on the complexity ofB1M ∩√

rSM1 comes from vectors of “short”

support. Namely, for everyI⊂ {1, . . . , M}, letSI be the set of vectors inSM1supported inI. Set Ck=

|I|=k

√1

kSIB1M∩ 1

kSM1. If one replacesB1M1

kSM1byCk, it is much easier to analyze ERM over that set. Indeed, it is straightforward to verify that ERM is likely to choose a vector inCk, wherekminimizes the functional

k→ 1

k − 2

n·Esup

vCk

G, v = 1

k− 2

nE k

i=1

gi21/2

, (2.3)

where(xi)is a non-increasing ordering of the vector(|xi|).

A sharp estimate on the Gaussian quantity reveals that the gap between the “level”kand the “level”decrease with the dimensionM. Thus, the minimum of (2.3) – which is proportional to (2.1) – decreases asMincreases.

The proof of Theorem A will be a combination of two approximation arguments – first, of the measure n1/2n

i=1Xi by a Gaussian, and second, an approximation of B1M by the setsCk, reducing the problem to the one described above.

One should comment that it is possible to approximate B1M1

kSM1 using a completely combinatorial set

|I|=kk1{−1,1}I, and the way the complexities change between the levelskandasM increases gives a more geometric explanation to why the minimizer moves closer to 0.

3. Proof of the lower bound for the (MS) aggregation problem (TheoremA) The proof of TheoremAconsists of two parts. The first, simpler part, is whenM≤√

n. This is due to the fact that if 0< θ <1 andρ=θ rM/n, the setB1M∩√

rSM1 is much “larger” than the setB1M ∩ √ρB2M. This results in much larger “oscillations” of the appropriate empirical process on the former set than on the latter one, leading to very negative values of the empirical excess risk functional for functions whose excess risk larger thanρ. The case M≥√

nis much harder because when considering the required values ofrandρ, the complexity of the two sets is very close, and comparing the two oscillations accurately involves a far more delicate analysis.

3.1. The caseM≤√ n

We will follow the method used in [16]. Leti)i∈N be a sequence of functions defined on [0,1] and set μ to be a probability measure on [0,1] such that i: i∈N) is a sequence of independent Rademacher variables in L2([0,1], μ).

LetM≤√

nbe fixed and put (X, Y )to be a couple of random variables;X is distributed according toμ and Y =φM+1(X). LetF = {0,±φ1, . . . ,±φM}be the dictionary, and note that any function in the convex hull ofF

(6)

can be written asfλ=M

j=1λjφj forλB1M. Since relative to conv(F ),f=0, the excess quadratic loss function is

Lλ(X, Y )= −2φM+1(X)

λ, Φ(X) +

λ, Φ(X)2

, where we setΦ(·)=1(·), . . . , φM(·)).

The following is a reformulation of Lemma 5.4 in [16].

Lemma 3.1. There exist absolute constantsc0, c1 and c2 for which the following holds.Let (Xi, Yi)i=1,...,n be n independent copies of(X, Y ).Then,for everyr >0,with probability greater than1−8 exp(−c0M),for anyλ∈RM,

λ22−1 n

n

i=1

λ, Φ(Xi)2 ≤1

2λ22 (3.1)

and c1

rM

n ≤ sup

λ rBM2

1 n

n

i=1

λ, Φ(Xi)

φM+1(Xi)c2 rM

n . (3.2)

Set r =βM/n for some 0< β ≤1 to be named later, and observe that B1M ∩√

rSM1=√

rSM1 because r≤1/M. For anyλ∈√

rSM1, PLλ= λ22=r, and thus applying (3.1) and (3.2), it is evident that with probability greater than 1−8 exp(−c0M),

inf

λB1M rSM1

PnLλ=r− sup

λ rSM−1

(PPn)Lλ

r+ sup

λ rSM1

λ22−1 n

n

i=1

λ, Φ(Xi)2

− sup

λ rSM1

2 n

n

i=1

λ, Φ(Xi)

φM+1(Xi)

≤3r 2 −2c1

rM n =

3β 2 −2c1

β M

n ≤ −c1

βM n, provided thatβ(2c1/3)2.

On the other hand, letρ=αM/nfor someαto be chosen later. Using (3.1) and (3.2) again, it follows that with probability at least 1−8 exp(−c0M), for anyλB1M∩ √ρB2M

|PnLλ| ≤PLλ+

λ22−1 n

n

i=1

λ, Φ(Xi)2 +

2 n

n

i=1

λ, Φ(Xi)

φM+1(Xi)

≤ 3ρ 2 +2c2

ρM n =

3α 2 +2c2

α M

n. Therefore, if 0< α < βsatisfies that 3α/2+2c2

α < c1

βfor some 0< β(2c1/3)2then with probability greater than 1−16 exp(−c0M), the empirical risk functionλ−→Rn(fλ)achieves smaller values onB1M∩√

rSM1than on B1M∩ √ρB2M. Hence, with the same probability,R(f˜ERM-C)ρ=αM/n.

3.2. The caseM≥√ n

Let us reformulate the second part of TheoremA.

Theorem 3.2. There exist absolute constantsc0, c1, c2andn0for which the following holds.For every integersnn0

and M,if

nM≤exp(c0n1/3), there is a function classFM of cardinalityM consisting of functions that are

(7)

bounded by1,and a couple(X, Y )distributed according to a probability measureμ,such that withμn-probability at least9/12,

R(f ) ≥ min

fFMR(f )+ c1 nlog(eM/√

n),

wherefis the empirical minimizer inconv(FM).Moreover,withμn-probability greater than7/12, R(f ) ≤ min

fFMR(f )+ c2 nlog(eM/√

n) .

The proof will require accurate information on a monotone rearrangement of almost Gaussian random variables.

Lemma 3.3. There exists an absolute constantCfor which the following holds.Letgbe a standard Gaussian random variable,setH (x)=P(|g|> x)and putW (p)=H1(p)(the inverse function ofH).Then for every0< p <1,

W2(p)−log 2/

πp2

+log log 2/

πp2Clog log(2/(πp2)) log(2/(πp2)) . Moreover,for every0< ε <1/2and0< p <1/(1+ε),

W2(p)W2

(1+ε)pCε, W2(p)W2

(1ε)pCε.

Proof. The proof of the first part follows from the observation that for everyx >0,

√2 x

πexp

x2/2 1− 1

x2

≤P

|g|> x

√2 x

πexp

x2/2

, (3.3)

wherecis a suitable absolute constant (see, e.g. [22]), combined with a straightforward (yet tedious) computation.

The second part of the claim follows from the first one, and is omitted.

The next step is a Gaussian approximation of a variableY =n1/2n

i=1Xi, whereX1, . . . , Xn are i.i.d. random variables, with mean zero, variance 1, under the additional assumption thatXhas well behaved tails.

Definition 3.4 ([18,26]). Let1≤α≤2.We say that a random variableXbelongs toLψα if there exists a constantC such that

Eexp

|X|α/Cα

≤2. (3.4)

The infimum over all constantsCfor which(3.4)holds defines a norm called theψα norm ofX,and we denote it by Xψα.

Proposition 3.5 ([22], p. 183). For everyLthere exist constantsc1andc2that depend only onLand for which the following holds.Let(Xn)n∈Nbe a sequence of i.i.d.,mean zero random variables with variance1,andXψ1L.If Y=1nn

i=1Xi then for any0< xc1n1/6, P[Yx] =P[gx]exp

EX31x3 6√

n

1+c2x+1

n

and

P[Y≤ −x] =P[g≤ −x]exp

−EX31x3 6√

n

1+c2

x+1

n

.

(8)

In particular,if0< xc1n1/6andEX31=0then P

|Y| ≥x

−P

|g| ≥x=c2P

|g| ≥xx+1

n .

Since Proposition3.5implies a better Gaussian approximation than the standard Berry–Esséen bounds, one may consider the following family of random variables that will be used in the construction.

Definition 3.6. We say that a random variable Y is (L, n)-almost Gaussian for L > 0 and n ∈ N, if Y = n1/2n

i=1Xi,whereX1, . . . , Xnare independent copies ofX,which is a non-atomic random variable with mean0, variance1,and satisfies thatEX3=0andXψ1L.

LetX1, . . . , XnandY be such thatY=n1/2n

i=1Xi is(L, n)-almost Gaussian. For 0< p <1 set U (p)=

x >0: P

|Y|> x

=p .

SinceXis non-atomic thenU (p)is non-empty and let u+(p)=supU (p) and u(p)=infU (p).

We shall apply Lemma3.3and Proposition3.5in the following case to boundu+(i/M)andu(i/M)for everyi, as long asMis not too large (i.e. logMc1n1/3). To that end, setεM,n= [(logM)/n]1/2, and for fixed values ofM andn, and 1iMlet

u+i =u+(i/M) and ui =u(i/M).

Corollary 3.7. For everyL >0there exist a constantC0that depends onLand an absolute constantC1for which the following holds.Assume thatY is(L, n)-almost Gaussian and thatlogMC0n1/3.Then,for every1≤iM/2,

(u+i )2≤log 2M2

πi2

−log

log 2M2

πi2

+C1max

log(log(2M2/(πi2))) log(2M2/(πi2)) , εM,n

and

(ui )2≥log 2M2

πi2

−log

log 2M2

πi2

C1max

log(log(2M2/(πi2))) log(2M2/(πi2)) , εM,n

.

Proof. Since√

logMC0n1/6, one may use the Gaussian approximation from Proposition3.5to obtain P

|Y| ≥

4 logM

≤P

|g| ≥

4 logM 1+c1

4 logM+1

n

2

4πlogMexp(−2 logM)

1+c1

4 logM+1

n

≤ 1 M2. Thus, for every 1≤iM, ifxU (i/M)thenx≤√

4 logM.

Let 1≤iM/2 andxU (i/M). Sincex≤2C0n1/6(becausex≤√

4 logM≤2C0n1/6), it follows from Propo- sition3.5that

i/M−H (x)c3H (x)x+1

nc4H (x)εM,n, (3.5)

whereH (x)=P[|g| ≥x]. Observe that ifW (p)=H1(p), then W2(i/M)x2c5εM,n.

(9)

Indeed, sinceH (x)(1c4εM,n)i/MH (x)(1+c4εM,n), then by the monotonicity ofW and the second part of Lemma3.3, settingp=H (x),

W2(i/M)W2

(1+c4εM,n)H (x)

W2 H (x)

+c6εM,n=x2+c6εM,n.

One obtains the lower bound in a similar way. The claim follows by using the approximate value ofW2(i/M)provided

in the first part of Lemma3.3.

The parametersu+i andui can be used to estimate the distribution of a non-increasing rearrangement(Yi)Mi=1of the absolute values ofMindependent copies ofY.

Lemma 3.8. There exists constantsc >0 andj0∈Nfor which the following holds.LetY1, . . . , YM be i.i.d.non- atomic random variables.For every1≤sM,with probability at least1−2 exp(−cs),

i: |Yi| ≥uss/2 and i: |Yi| ≥u+s ≤3s/2.

In particular,with probability at least11/12,for everyj0jM/2, u2jYju+2(j1)/3,

wherex =min{n∈N:xn}.

Proof. Fix 0< p <1 to be named later and leti)Mi=1be independent{0,1}-valued random variables withEδi=p.

A straightforward application of Bernstein’s inequality [26] shows that P

1 M

M

i=1

δipt

≤2 exp

cMmin

t2/p, t .

In particular, with probability at least 1−2 exp(−c1Mp), (1/2)Mp

M

i=1

δi(3/2)Mp.

We will apply this observation to the independent random variables δi =1{|Yi|>a},1≤iM, for an appropriate choice ofa. Indeed, if we takea for whichP(|Y1|> a)=s/M (such ana exists becauseY1 is non-atomic), then with probability at least 1−2 exp(−c1s), at leasts/2 of the|Yi|will be larger thana, and at most 3s/2 will be larger thana. Since this result holds for anyaU (s/M)the first part of the claim follows.

Now takes0 to be the smallest integer such that 1−2M

s=s0exp(−cs)≥11/12 (in particularc1log 24≤s0c1(log 48+1)). Applying the union bound and a change of variables, it is evident that with probability at least 5/6, for every(3s0)/2 +1≤jM/2,

i: |Yi| ≥u2jj and i: |Yi| ≥u+(2(j1))/3j−1,

and thusu2jYju+(2(j1))/3.

With Lemma3.8 and Corollary 3.7in hand, one can bound the following functional of the random variables (Yi)Mi=1.

Lemma 3.9. For everyL >0there exist constantsc1, . . . , c4,j0andα <1that depend only onLfor which the fol- lowing holds.LetY be(L, n)-almost Gaussian and letY1, . . . , YM be independent copies ofY.Then,with probability at least11/12,for everyj0kαM,

c1

log(ek/)−εM,n log(eM/)

YYkc2

log(ek/)+εM,n log(eM/)

.

(10)

Moreover,with probability at least10/12,for everyj0kαM YYk

1 k

k

i=1

YiYk2

1/2

c3 log(ek/)

log(eM/)− c4

log(eM/k),

and ifj0kαM,thenu2kYku+2(k1)/3and

1 k

k

i=1

YiYk2

1/2

c4 log(eM/k),

provided thatlog2MLkand thatεM,n=

(logM)/n≤1.

Proof. The first part of the claim follows from Lemma3.8and Corollary3.7, combined with a straightforward com- putation. For the second part, observe that, for some well chosen constantc1(L)depending only onL, with probability at least 11/12,Y1c1(L)

logM. Hence, applying the first part of the claim, with probability at least 10/12, 1

k

k

i=1

YiYk2

c1(L)j0logM

k +1

k

k

i=j0

YiYk2

c1(L)j0logM k +c2

k

k

i=j0

log2(ek/ i)

log(eM/ i)+ ε2M,n log(eM/ i)

c1(L)j0logM k +c3

1+ε2M,n

log(eM/k)c4 log(eM/k),

provided that log2MLkand thatεM,n≤1. Note that to estimate the sum we have used that 1

k

k

i=j0

log2(ek/ i)

log(eM/ i)≤ 1 log(eM/k)

1 k

k

i=j0

log2(ek/ i)c3 log(eM/k).

Now the second and the third parts follow from the first one.

The next preliminary step we need is a simple bound on the dual norm to the one whose unit ball isAr=B1M

rB2M. Recall that for a convex bodyC⊂RM, the polar body ofCisC= {x∈RM: supyCx, y ≤1}, and in our case,Ar=conv(BMr1/2B2M)(see, for example, [23]). From here on, givenv∈RM, set

vAr = sup

wAr

v, w,

and, as always,(vi)Mi=1is the monotone rearrangement of(|vi|)Mi=1.

Lemma 3.10. For everyv∈RM and any0< ρ < r≤1such that1/rand1/ρare integers, vArvAρv1/rv1/ρ

ρ

1/ρ

i=1

viv1/ρ 2

1/2

and in general for any0< r≤1, v1/rvArv1/r+

1/r 1/r

i=1

viv1/r2

1/2

.

(11)

Proof. First, observe that for everyv∈RM, vAr = min

1jM

r

j

i=1

vivj2

1/2

+vj

. (3.6)

Indeed, sinceAr =conv(BM(1/

r)B2M), it is evident thatvAr =inf{u+√

rw2, v=u+w}. One may verify that ifv=u+wis an optimal decomposition then supp(w)⊂ {i: |ui| = u}. Hence, ifu=Kthen for every 1≤iM,ui=Ksgn(vi)1{|vi|≥K}+vi1{|vi|<K}, and thus,wi=1{|vi|≥K}(vi−sgn(vi)K). Therefore,

vAr = inf

K>0

K+√

r

{i:|vi|≥K}

|vi| −K21/2 .

Moreover, since it is enough to consider only values ofKin{vj: 1≤jM}, (3.6) is verified. In particular, if 1/ris an integer then

vAr ≤√ r

1/r

i=1

viv1/r2

1/2

+v1/r . On the other hand, ifTr= {u∈RM: u2≤√

r,|supp(u)| ≤1/r}thenTrB1M∩√

rB2M. Hence, vAr ≥ sup

wTr

v, w =√ r

1/r

i=1

vi2

1/2

.

Therefore, if 1/rand 1/ρare integers, it follows that vArvAρ ≥√

r 1/r

i=1

vi2

1/2

− √

ρ 1/ρ

i=1

viv1/ρ2

1/2

+v1/ρ

v1/rv1/ρ −√ ρ

1/ρ

i=1

viv1/ρ 2

1/2

,

because(r1/r

i=1(vi)2)1/2v1/r.

The second part follows in a similar fashion and it omitted.

Proof of the lower bound of Theorem3.2. Let φ1, . . . , φM, X anda >0 be such that φ1(X), . . . , φM(X) are uniformly distributed on [−a, a] and have variance 1 (in particulara =√

3). Set T (X)=φM+1(X)=Y to be a Rademacher variable. Assume further thati)Mi=+11 are independent inL2(PX)and letFM = {0,±φ1, . . . ,±φM}.

Note that the functions in conv(FM)are given byfλ= Φ, λwhereΦ=1, . . . , φM)andλB1M. It is straightforward to verify that the excess loss function offλrelative to conv(FM)is

Lfλ=(fλφM+1)2(0φM+1)2= Φ, λ2−2Φ, λφM+1 (sincef=0), implying thatELfλ= λ22.

Let us consider the problem of empirical minimization in conv(FM)= {λ, Φ: λB1M}. Recall thatAr=B1M

rB2Mand, for an independent sample(Φ(Xi), φM+1(Xi))ni=1, define the functional

ψ (r, ρ)=n

λinfAr

Rn(fλ)− inf

μAρ

Rn(fμ)

.

If we show that for somerρ,ψ (r, ρ) <0, then for that sample,ELfρ.

Références

Documents relatifs

We study the performance of empirical risk minimization (ERM), with re- spect to the quadratic risk, in the context of convex aggregation, in which one wants to construct a

The problem of convex aggregation is to construct a procedure having a risk as close as possible to the minimal risk over conv(F).. Consider the bounded regression model with respect

We have shown in Section 2 that the small-ball condition is satis…ed for linear models such as histograms, piecewise polynomials or compactly supported wavelets, but with constants

We prove that, if all the M classifiers are binary, the (penalized) Empirical Risk Minimization procedures are suboptimal (even under the margin/low noise condition) when the

In particular, we shall show that the CUSUM is optimum in detecting changes in the statistics of Itô processes, in a Lorden-like sense, when the expected delay is replaced in

We extend the general local risk minimisation approach introduced in [1] to account for liquidity costs, and derive the corresponding optimal strategies in both the discrete-

First introduced in the seminal work of [4], the quadratic local risk minimization approach to derivative hedging is well understood, and its connection to pricing under the

The hypothesis to rely on self-financing strategies for qualifying as a redundant claim is essential in the pricing methodology which follows: self-financing means that there are