• Aucun résultat trouvé

4 The geometry of the fixed point

N/A
N/A
Protected

Academic year: 2022

Partager "4 The geometry of the fixed point"

Copied!
44
0
0

Texte intégral

(1)

Oracle inequalities and the isomorphic method

Shahar Mendelson 1,2,3 January 12, 2012

Abstract

We use the isomorphic method to study exact and non-exact oracle inequalities for empirical risk minimization in classes of functions that satisfy a subgaussian condition. We show that the “isomorphic profile”

of the corresponding squared-loss classes may be characterized using properties of the gaussian process indexed by the underlying base class.

We use this information to obtain the oracle inequalities and to show that these are the sharpest that one may obtain using the isomorphic method.

We also show that in very general situations, there is a true gap between the resulting exact and non-exact inequalities, and that the reason the oracle inequalities can not be improved further using the isomorphic method is structural: typical coordinate projections of the given base class contain extremal coordinate cubes.

1 Introduction

Oracle inequalities play an important part in modern learning theory and nonparametric statistics. Roughly put, given a class of functions F on a probability space (Ω, µ), an unknown targetY and a loss function`:R2 R, one tries to find a function inFthat approximatesY almost as well as the best possible function in the class, relative to the loss; that is, almost as well as the minimizer inF ofE`(f(X), Y), whereXis distributed according toµ.

The given data at one’s disposal is an independent sampleD= (Xi, Yi)Ni=1,

1Department of Mathematics, Technion, I.I.T, Haifa 32000, Israel.

2This research was supported by the Centre for Mathematics and its Applications, The Australian National University, Canberra, ACT 0200, Australia. Additional support was provided by an Australian Research Council Discovery grant DP0559465, the Euro- pean Community’s Seventh Framework Programme (FP7/2007-2013) under ERC grant agreement 203134, and the Israel Science Foundation grant 900/10.

3Email: shahar@tx.technion.ac.il

(2)

selected according to the joint distribution of X and Y. Setting ˆf to be the (random) function selected using the data, an oracle inequality with parameterA≥1 ensures that with probability at least 1−δ,

E(`( ˆf , Y)|D)≤A inf

f∈FE`(f, Y) +φ(F, N, δ),

whereφis the “price” one has to pay for selecting an almost optimal function inF. This price should depend on the “richness” of F, the sample size N and the desired high probability 1−δ. It should also be noted that since the measureµis unknown, as is the targetY, one would likeφto hold uniformly for every reasonable joint distribution (X, Y).

One of the main methods used to prove oracle inequalities is based on the comparison of the two structures involved: the random, visible structure, en- dowed on the class by the sample (Xi, Yi)Ni=1, and the actual structure, given by the unknown, underlying joint distribution (X, Y). Such an “isomorphic”

argument, introduced in this context in [3], is based on the following simple observation.

Let H be a class of functions on a probability space (Ω, ν) and let Z be distributed according to ν. If there is some 0 < ε < 1 for which, with probability at least 1−δ,

(1−ε)Eh≤ 1 N

XN

i=1

h(Zi)(1 +ε)Eh

on the set {h : Eh λN}, then functions in H can either have a small expectation (smaller thanλN), or alternatively, satisfy that their empirical means are equivalent to the actual means.

It should be noted that one way of obtaining an upper bound on the “iso- morphic profile” of the class (i.e., the functionλN(H, ε, δ)) is usingpeeling, which is a standard method in statistical literature. Therefore, even if it is not stated in this way, the isomorphic method is at the heart of many fun- damental results in modern learning theory/nonparametric statistics (even though, at times, peeling need not be the optimal way of controlling the isomorphic profile of the class).

As will be explained in Section 5, the isomorphic method allows one to obtain oracle inequalities for the empirical risk minimizer - (ERM), which is the function that best approximatesY on the data: i.e., the function ˆf is a minimizer inF of the functional PN

i=1`(f(Xi), Yi).

There is very extensive literature on ERM-based oracle inequalities (for example, see the books [4, 33, 1, 32, 11] and [10, 16]). In general, one would

(3)

like an oracle inequality to satisfy three main properties. First, that the wayφdepends on the complexity of the given classF is reasonable (and, of course, one would like to understand the correct notion of complexity and when a class is not too “rich” from that point of view). Second, that φ decreases quickly as a function of the sample size N, where the term “fast rates” is often used to describe a rate of decay faster than 1/

N - and hopefully, of the order of 1/N, perhaps up to logarithmic factors. Finally, that the constant A equals 1; such oracle inequalities are called exact, and whenA >1 they are callednon-exact.

The motivation for this article was the observation that in essentially all known examples, when using the isomorphic method there is a real gap between the complexity functional φ one can have in exact and non-exact oracle inequalities.

It is well known [1, 20, 14] that unless F has a reasonable structure (for example, whenF is a convex class), then even in the simplest of cases, when |F| = 2 and `(x, y) = (x−y)2, it is impossible to obtain an exact oracle inequality with a rate better than 1/

N, while the rate in a non- exact inequality is much faster. Therefore, it seems reasonable to compare the behavior of exact and non-exact inequalities only in cases in which the geometry ofF is not an obvious obstacle for obtaining a fast rate in an exact inequality. But even for classes with a “nice geometry”, the situation, as reflected in known examples, is unchanged: while one can obtain non-exact inequalities with fast rates, the exact ones have considerably slower ones.

Having this in mind, our aim here is to study the following three ques- tions.

1. What sort of oracle inequalities (exact and non-exact) can one obtain using the isomorphic method, and what aspects in the structure of a given class govern the resulting “price” functionalsφ?

2. Is there a real gap between the exact and non-exact inequalities that are derived using the isomorphic method?

3. If the estimates from (1) are indeed optimal, is there an underlying structural phenomenon that exhibits this optimality?

We will consider the case when ` is the squared loss, F is a convex, symmetric (i.e., if f F then −f F) and L-subgaussian class of func- tions, where byL-subgaussian we mean that theL2(µ) andψ2(µ) norms are equivalent onF with a constantL– see Definition 2.1. We will also assume

(4)

thatY is well bounded inLψ2.1

It turns out that for both oracle inequalities, the key parameters that appear when applying the isomorphic method are the levelsρkat which the canonical gaussian processGf indexed by{f ∈F :kfkL2(µ)≤ρ} are of the order ofρ√

k;k(F) is the first integer for which such a value exists, i.e.,k is of the order of µ

EsupfGf supf∈F kfkL2

2 .

Note that by the results of [19], ifF is a subgaussian class then k is equiv- alent to the smallest cardinality of a sampleN for which

sup

f∈F

|1 N

XN

i=1

f2(Xi)Ef2|

exhibits a non-trivial concentration – that is, when this supremum is smaller than supf∈F Ef2; and clearly, below the “level” at which concentration begins, there is no hope of obtaining any useful bound in our context.

As we will explain, fork≥k, eachρk measures the “maximal complex- ity” of a typical k-dimensional coordinate projections of F, {(f(Xi))ki=1 : f ∈F} (see Definition 4.1).

Although the same parameters play a central role in both types of oracle inequalities, there are clear differences in the resulting complexities.

Theorem A. For every L > 1, there exist constants c0, c1, c2, c3 and c4 that depend only on L for which the following holds. Let F be a convex, symmetric, L-subgaussian class. For every integer N k(F) set φ1 = c02c1N + 1/N),k1 = max{k ≤k≤c2N :ρk ≥c3p

k/N}, and φ2 =c3ρ2k1. IfkYkψ2 1, then with probability at least 1exp(−c4N), ERM satisfies a non-exact oracle inequality with constantA= 2 and a complexity term ofφ1. It also satisfies an exact inequality with probability at least 1exp(−c4k1) with (the much larger) complexity termφ2.

Note that the constant A = 2 can be replaced by any A > 1 and Y Bψ2 by Y M Bψ2, both at a price of modifying the constants c1, ..., c4. Also, observe that if the sequence (ρk) is even mildly regular, then φ2 mink≤k≤cN2k+k/N).

1Although many of the results may be extended beyond this case, the technical cost is rather high and we decided not to pursue this direction here.

(5)

Theorem A answers Question (1) in the subgaussian case, and the or- acle inequalities are given using the “localized” structure of the canonical gaussian process indexed by F.

What is more surprising is that under additional mild structural assump- tions on F, the estimate on the rate of the exact inequality that one may obtain using the isomorphic method is optimal. Therefore, the sequence (ρk) determines the limitation of the isomorphic method, and there is a real gap between the exact and non-exact oracle inequalities that one may obtain using the isomorphic method – resolving Question (2).

To formulate this result, recall that ` is the squared loss, let f = argminf∈FE`(f(X), Y) and setLf =`(f(X), Y)−`(f(X), Y).

Theorem B.Under mild assumptions onF,X andY (see Assumption 6.1 for an exact formulation), ifN ≥k(F) and λ.max{ρ2k :ρk≤c1p

k/N}, then

E sup

{Lf:ELf≥λ}

¯¯

¯¯

¯ 1 N

XN

i=1

ELf(Xi, Yi) ELf 1

¯¯

¯¯

¯1, and a similar estimate holds with probability at leastc3.

In particular, under the mildest regularity assumptions on (ρk) and as will be explained in Section 5 and Section 6, Theorem B implies that the isomorphic method can not yield an exact oracle inequality with a better complexity term than c1mink≤k≤c2N2k +k/N), showing that the upper bound from Theorem A is sharp.

Finally, turning to Question (3), we will focus on the caseY = 0. It was conjectured that the limitation of the isomorphic method in this case should be attributed to some natural geometric phenomenon inF, and, in view of classical results in empirical processes theory (see, for example, the books [33, 5]), it should not be surprising that this phenomenon is connected with the combinatorial dimension.

Definition 1.1 For every ε > 0, a set I ⊂ {1, ..., m} is ε-shattered by V Rm if there is some function s:I R, satisfying that for every J ⊂I there is somevJ ∈V for whichvJ(i)≥s(i) +εifi∈J, and vJ(i)≤s(i)−ε if i∈I\J. The combinatorial dimension ofV is the function

vc(V, ε) = sup{|I|:I is ε−shattered by V}.

With obvious modifications, one may define the combinatorial dimension for a class of functions F.

(6)

Note that the combinatorial dimension at scale ε measures the largest dimension of a “coordinate cube” of sizeεthat can be found in a coordinate projection of V (or of the classF). Moreover, it is standard to verify that ifV is convex and symmetric and σ isε-shattered by V, then theε-cube on the coordinatesσ,εBσ, satisfies that

εBσ⊂PσV ={(vi)i∈σ :v ∈V}.

It is well known [2] that a class of uniformly bounded functions satisfies the uniform law of large numbers if and only if vc(F, ε) is finite for every ε > 0. Moreover, by the Gin´e-Zinn symmetrization method [8], if F is a class of mean-zero functions and (εi)Ni=1are independent, symmetric{−1,1}- valued random variables, then

Esup

f∈F

|1 N

XN

i=1

f(Xi)Ef| ∼Esup

f∈F

|1 N

XN

i=1

εif(Xi)|.

Therefore, if the law of large numbers “breaks down” at some dimension/sample size N, then with high probability, a typical coordinate projection V = PσF = {(f(Xi))Ni=1 : f F} satisfies that Eεsupv∈PσF|PN

i=1εivi| ≥ δN. By the main result of [17], ifF consists of uniformly bounded functions, this

“break-down” is exhibited by the fact thatvc(PσF, c1δ) ≥c2δ2N, i.e., that PσF contains a large (∼δ), high dimensional (∼δ2N) coordinate cube.

We will show that under reasonable assumptions on F, if N k k then with µN-high probability, a typical further k-dimensional projection of F ∩ρkB(L2) is an extremal set in some sense: although it is contained inc3ρk

kB2k, it has a coordinate projection of dimension proportional tok that contains a coordinate cube of size∼ρk (which is the “largest” coordi- nate cube a proportional coordinate projection of a subset ofρk

kBk2 can contain). We will show that the appearance of this structure exhibits the break-down in the isomorphic condition and thus it resolves Question (3).

Below we have formulated a slightly weaker version of this result (see Theorem 4.5 for the full statement).

Theorem C. Under mild assumptions on F (see Definition 4.4), if F is convex, symmetric and L-subgaussian, then with probability at least 1− exp(−c1N),

vc(Pσ(F∩ρNB(L2(µ))), c2ρN)≥c3N.

Besides resolving Question (3), Theorem C has interesting implications on the geometry of random polytopes. When applied to F = {­

t,·® : t

(7)

B1n}, then for |σ| = k < n, PσF is the symmetric convex hull of the n vertices {(­

Xi, ej®

)ki=1 : 1 j n}, which we denote by Kn,k. In [15], the authors show that if ξ is a mean-zero, variance 1, subgaussian random variable, X = (ξi)ni=1 is a vector with independent coordinates distributed according to ξ, and (Xi)ki=1 are independent copies of X, then with high probability,

c1(Bkp

log(en/k)Bk2)⊂Kn,k, and in particular,

c2

rlog(en/k)

k Bk ⊂Kn,k. (1.1)

Here, Theorem C implies a subgaussian version of (1.1). We will show that F ={­

t,·®

:t∈Bn1}satisfies the assumptions of Theorem C, and therefore, if X is an isotropic, (i.e., E|­

X, t®

|2 = ktk22), L-subgaussian vector (rather than a vector with i.i.d, mean-zero, subgaussian coordinates), there is a subset I of {1, ..., k} of proportional cardinality for which the coordinate projectionPIKn,k satisfies that

c3

rlog(en/k)

k BI ⊂PIKn,k.

The proofs of Theorems A, B and C are based on the following steps.

First, one has to extend the results from [19] and obtain high probability estimates on the empirical process

sup

f∈F,h∈H

|1 N

XN

i=1

f(Xi)h(Xi)Ef h|

for reasonable classesF and H.

The required extension follows the path of [23] and [25], by obtaining accurate information on the structure of the random sets PσF and PσH, and is presented in Section 3.

In Section 4 we will study properties of the fixed pointsρk, and focus on situations in which the gaussian complexities Esupf∈F∩rB(L2)Gf originate from “finite dimensional” skeletons of F. In particular, we will prove the extended version of Theorem C and its corollary on the geometry of random polytopes. We will also show that in such cases, if (gi)Ni=1 are independent, standard gaussian random variables, then

E sup

f∈F∩ρkB(L2)

Gf Eg sup

v∈Pσ(F∩ρkB(L2))

1 N

XN

i=1

givi

(8)

with high probability. And, what is more important for the proof of the lower bound, we will show is that the Bernoulli process indexed byPσ(F∩ρkB(L2)) is also equivalent to the corresponding gaussian process (whereas for arbi- trary subsets of RN there could be a

logN gap between the expected supremum of the Bernoulli process and the gaussian one). The proof will be based on the “good geometry” ofPσ(F∩ρkB(L2)), which allows one to construct an optimal admissible sequence for that set.

Finally, using the results of Section 3 and Section 4 we will present the sharp upper and lower bounds resulting from the isomorphic method (Theorem A and Theorem B) in Section 5 and Section 6 respectively.

2 Notation

This section is devoted to various definitions and notation that will be used throughout the article. Many results and methods used here are standard in the areas of probability in Banach spaces and empirical processes. We refer the reader to [13] and to [33, 5] as general references on these topics.

Absolute constants are denoted byc1, c2, ...and their value may change from line to line. We write A . B if there is an absolute constant c1 for whichA≤c1B. A∼Bmeans thatc1A≤B ≤c2Afor absolute constantsc1 and c2. If the constants depend on some parameterr we will write A.rB orA∼r B.

We will consider various normed spaces. Given a probability measureµ, denote by B(L2(µ)) or by D the unit ball of the space L2(µ). For α 1, Lψα is the Orlicz space of all measurable functions for which the ψα norm, defined by

kfkψα = inf{c >0 :Eexp(|f /c|α)2},

is finite. For basic facts on Orlicz spaces we refer the reader to [28, 33].

Here, we will only require the observation that kfkψα is equivalent to the smallest constantκfor which, for everyt≥1,P r(|f| ≥t)≤2 exp(−(t/κ)α).

Let`np beRnendowed with the`pnorm and setLnp to beRnwhen viewed as theLp space of functions on{1, ..., n}, relative to the uniform probability measure on that set. `np,∞ is Rn endowed with the weak `p (quasi)-norm:

kxkp,∞= supi≥1i1/pxi, where (xi) is a monotone non-increasing rearrange- ment of (|xi|). IfI ⊂ {1, ..., n},`Ip,∞ isRI endowed with the weak`p norm.

Put k kψαn to be the ψα Orlicz norm on Rn, which is viewed as a space of functions on{1, ..., n}endowed with the uniform probability measure. In particular, using the characterization of the ψα norm, it follows that for x∈Rn,xi .kxkψnαlog1/α(en/i).

(9)

We will denote byBnp,Bp,∞n and Bψnα the unit balls of Rn endowed with the corresponding`p, weak-`pandψαnorms, respectively. For other normed spacesX, setBX to be the unit ball and let SX be the unit sphere.

Given an independent sampleσ = (X1, ..., XN) selected according toµ, setkfkLσ2 = (N1 PN

i=1f2(Xi))1/2, and given a classF let PσF ={(f(Xi))Ni=1 :f ∈F} ⊂RN

be the coordinate projection ofF generated by the sample (coordinates)σ.

We will also be interested in further coordinate projections of PσF that are given by selectors. Let 1≤k≤N, setδ =k/N and putδ1, ..., δN to be independent,{0,1}-valued random variables with meanδ. LetJ ={i:δi = 1} and set

PJPσF ={(f(Xi))i∈J :f ∈F}.

A standard concentration argument shows that with probability at least 12 exp(−c0k),|J| ∼k.

Our main interest will be subgaussian classes:

Definition 2.1 Let F be a class of functions on a probability space (Ω, µ).

We say thatF isL-subgaussian if for everyf, h∈F,kf−hkψ2 ≤Lkf−hkL2 andkfkψ2 ≤LkfkL2, where both norms are relative to the measureµ.

Theorem 2.2 There exists an absolute constant c1 for which the follow- ing holds. If f Lψ1 and X1, ..., XN are independent random variables distributed according to µ, then for every u >0,

P r Ã

|1 N

XN

i=1

f(Xi)Ef| ≥ukfkψ1

!

2 exp(−c1Nmin{u2, u}).

In particular, if f ∈Lψ2 then

P r Ã

|1 N

XN

i=1

f2(Xi)Ef2| ≥ukfk2ψ2

!

2 exp(−c1Nmin{u2, u}).

The first part of Theorem 2.2 is aψ1 version of Bernstein’s inequality (see, for example, [33]). The second one is an immediate outcome of the first claim, becausekf2kψ1 =kfk2ψ2. Theorem 2.2 will be used extensively in the chaining processes that is at the heart of our proofs.

(10)

2.0.1 Chaining

Chaining based methods play a central role in the study of certain stochastic processes, and in particular, in the study of gaussian and empirical processes.

Chaining allows one to translate the probabilistic structure of the given pro- cess to a metric invariant of the indexing set, using Talagrand’sγ functionals [31].

Definition 2.3 [31] For a metric space (F, d), an admissible sequence of F is a collection of subsets of F, {Fs :s 0}, such that for every s 1,

|Fs| ≤22s and |F0|= 1. For β 1 and j≥0, define the γβ,j functional by γβ,j(F, d) = inf sup

f∈F

X

s=j

2s/βd(f, Fs),

where the infimum is taken with respect to all admissible sequences ofF. If j= 0 we will writeγβ instead of γβ,j. Given an admissible sequence(Fs)s≥0 we denote byπsf a nearest point tof in Fs relative to the metricd, and for s >0, let ∆sf =πsf−πs−1f.

When considered for a setF ⊂L2(or`N2 ),γ2is determined by properties of the canonical gaussian process indexed by F (see [5, 31] for detailed expositions on these connections). Indeed, under certain mild measurability assumptions, if {Gf : f F} is a centered gaussian process indexed by a setF then

c1γ2(F, d)Esup

f∈F

Gf ≤c2γ2(F, d),

where c1 and c2 are absolute constants, and for every f, h∈ F,d2(f, h) = E|Gf −Gh|2. The upper bound is due to Fernique [6] and the lower bound is Talagrand’s Majorizing Measures Theorem [30].

Note that if T `N2 , (gi)Ni=1 are independent, standard gaussians and Gt=Pn

i=1giti, thend(s, t) =ks−tk2, and therefore c1γ2(T, `n2)Esup

t∈T

Xn

i=1

giti ≤c2γ2(T, `n2). (2.1) The understanding of empirical processes is far less complete, because there is no single metric that captures the tail behavior of a sum of iid random variables. A very simple case is the subgaussian one, in which a

(11)

standard chaining argument may be used to show that ifF ⊂Lψ2 is a class of mean-zero functions, then

Esup

f∈F

| XN

i=1

f(Xi)|.γ2(F, ψ2) N , and similar bounds hold with high probability.

Here, to study the squared loss, we will need a stronger result – estimates on the quadratic process

sup

f∈F

| XN

i=1

f2(Xi)Ef2|

using a metric structure on F. This problem has been studied in the sub- gaussian context in [19] – which is the case that will interest us here, and in far more general situations in [22, 23, 25].

Theorem 2.4 [19] There exist absolute constants c1 and c2 for which the following holds. Let F SL2 be a symmetric class and set dF2) = supf∈Fkfkψ2. Then, with probability at least1−2 exp¡

−c1min{N, γ22(F, ψ2)/d2F2)ª ), sup

f∈F

|1 N

XN

i=1

f2(Xi)1| ≤c2 µ

dF2)γ2(F, ψ2)

√N +γ22(F, ψ2) N

,

and

Esup

f∈F

|1 N

XN

i=1

f2(Xi)1| ≤c2 µ

dF2)γ2(F, ψ2)

√N +γ22(F, ψ2) N

. The main result of Section 3 below is a stronger version of Theorem 2.4, still in the subgaussian context.

3 Bounds on empirical processes

This section is devoted to the proof of the following extension of Theorem 2.4.

Theorem 3.1 There exist absolute constants c1, c2 and c3 for which the following holds. LetF and H be classes of functions of cardinality at least2

(12)

on(Ω, µ) and assume thatγ2(F, ψ2)/dF2)≥γ2(H, ψ2)/dH2). For every integer N 2(F, ψ2)/dF2))2 and u≥c1, with probability at least

12 exp Ã

−c2u2

µγ2(F, ψ2) dF2)

2! ,

sup

f∈F, h∈H

¯¯

¯¯

¯ 1 N

XN

i=1

f(Xi)h(Xi)Ef h

¯¯

¯¯

¯≤c3u3 µ

dH2)γ2(F, ψ2)

√N +γ22(F, ψ2) N

. (3.1) In particular, if H=F then with the same probability,

sup

f∈F

¯¯

¯¯

¯ 1 N

XN

i=1

f2(Xi)Ef2

¯¯

¯¯

¯≤c3u3 µ

dF2)γ2(F, ψ2)

√N +γ22(F, ψ2) N

. (3.2) Although Theorem 3.1 seems similar to Theorem 2.4, with the major im- provement (which is essential to our analysis) is in the probability estimate for “large values” ofu, the proofs of Theorem 2.4 and Theorem 3.1 are very different, and the latter is based on a more subtle decomposition of F – as will be explained below.

When studying empirical processes, a very significant observation is due to Gin´e and Zinn [8] – that instead of the supremum of the empirical pro- cess, it suffices to bound the supremum of the Bernoulli process indexed by a typical coordinate projection of the given class. Therefore, and follow- ing ideas from [23], we will study the structure of such random coordinate projections and the way their geometry is dictated by the structure of the class.

A very general notion of a “well behaved structure” of a coordinate projection has been introduced in [25], but we will need it in a much simpler (subgaussian) situation, leading to the following definition. To formulate it, let σ = (X1, ..., XN) and consider V =PσF. If v =Pσf = (f(Xi))Ni=1, we will abuse notation by writingπsvboth forπsf and forPσπsf, and ∆sv for

sf and Pσsf. Also, set dV = supf∈Fkfk ≡dF. This wayV is endowed with the metric structure fromF, as well as having a natural `p structure through the coordinate system onRN.

Definition 3.2 Let V RN be a coordinate projection of a class of func- tionsF, and assume that the metric onF is endowed by a norm k k. Given an integer 1 j log2N and α, γj > 0, we say that V = PσF admits a (j, α, γj) decomposition with respect to the norm if it satisfies the following properties:

(13)

1. F has an admissible sequence (Fs)s≥0 for which sup

f∈F

X

s>j

2s/2k∆s(f)k ≤γj.

2. For every v = Pσf, v V, every j < s log2N and every subset I ⊂ {1, ..., N}

ÃX

i∈I

(∆sv)2i

!1/2

≤αk∆svk

³

2s/2+p

|I|log1/2(eN/|I|)

´ ,

and for every j≤s≤log2N and every I ⊂ {1, ..., N} ÃX

i∈I

sv)2i

!1/2

≤αkπsvk

³

2s/2+p

|I|log1/2(eN/|I|)

´ .

3. For every v∈V and every s≥log2N, Ã N

X

i=1

(∆sv)2i

!1/2

≤α2s/2k∆svk.

The reason we chose to emphasize the setV rather thanF in Definition 3.2, is that the structural results of this section remain true (with the obvious adjustments) for an arbitrary V RN that admits such a decomposition – which is very useful when studying various processes indexed by V. In our case, the fact that V is a coordinate projection of F is only used to decomposeV using the structure of F.

The “local” properties (1), (2) and (3) imply “global” information on the structure of V , which will be referred to in what follows as property (4).

Lemma 3.3 If V =PσF admits a (j, α, γj) decomposition with respect to the norm k k, then for everyv ∈V and every I ⊂ {1, ..., N},

ÃX

i∈I

vi2

!1/2

.α

³

γj+dFmax n

2j/2,p

|I|log1/2(eN/|I|)

, where dF = supf∈F kfk.

(14)

Proof. We will separate the proof to two cases. First, if exp(|I|log(eN/|I|))>

22j (i.e., if |I| & 2j/log(eN/2j)), then there is some j < s0 log2N for which 2s0 ∼ |I|log(eN/|I|). Since f = P

s>s0sf +πs0f, then if v = Pσf, v = P

s>s0sv+πs0v. Therefore, by (2) and (3), and recall- ing thatk∆svk=k∆sfk, and svk=sfk, then

kvk`I

2 .α X

s>s0

k∆svk(2s/2+p

|I|log(eN/|I|)) +kπs0vk(2s0/2+p

|I|log(eN/|I|)) .α γj+ sup

f∈F

kfkp

|I|log(eN/|I|).

If, on the other hand, exp(|I|log(eN/|I|))22j then we repeat the chaining argument withs0 =j. The rest of the proof is identical.

Although this notion of decomposition appears strange, we will show in the next section that for every j 1, a typical coordinate projection PσF of a subgaussian classF admits such a decomposition, with k · k=k · kψ2, γj =γ2,j(F, ψ2), andα that depends on the desired probability.

Property (2) implies that for “small” values ofs(2s ≤N), eachw= ∆sv can be written as a sum of two vectors: a “peaky” part and a “regular” part – PIw +PIcw. The first, “peaky” vector is supported on relatively few coordinates and

kPIwk`N

2 .α2s/2kwk, while the “regular” part satisfies thatkPIcwkψN

2 .αkwk, since a monotone rearrangement of the coordinates of (|PIcw|)i satisfies that

(PIcw)i .αkwklog1/2(eN/i).

Indeed, this decomposition is evident from (2), by considering theilargest coordinates of w, where iis the largest integer in {1, ..., N} for which 2s ilog(eN/i).

Next, by (3), one can control the `N2 norm of all the increments ∆sv for large values ofs, (i.e., 2s≥N), uniformly.

Finally, the local estimates imply a “global” property – that V is con- tained in a Minkowski sum of `N2 and ψ2N, using the global parameters γj and dV instead of the local parameters 2s/2k k or k kp

|I|log1/2(eN/|I|) used in properties (2) or (3):

Lemma 3.4 There exist absolute constantsc1, c2 and c3 for which the fol- lowing holds. IfV satisfies property (4) andγ¯j =γj+ 2j/2dV, then for every

(15)

v∈V there is some I ⊂ {1, ..., N}, of cardinality

|I| ≤c1γj2/d2V)log

³ c2

N dV/¯γj

´ , for which

kPIvk`N

2 ≤c3α¯γj and kPIcvkψN

2 ≤c3αdV. Proof. Let k be the largest integer satisfying that ¯γj ≥dVp

klog(eN/k), and if k≥N setk=N. Hence,

k∼min n

N,γj2/d2V)log

³

eN dV/¯γj

´o .

For everyvletI be the set of the largestkcoordinates of (|vi|)Ni=1. Ifk < N then by property (4),kPIvk`N

2 .α¯γj, while fori≥k,vi .αdVp

log(eN/i), and thuskPIcvkψN

2 .αdV. Ifk=N thenIc=and the claim follows from property (4).

It should be noted that more complicated decompositions ofFare needed when more than a single norm on F has to be used. This is the case, for example, when elements ofF do not exhibit a purely subgaussian behavior, which is the situation in [23, 25].

This notion of decomposition allows one to study the Bernoulli process indexed byV2 ={(v2i)Ni=1:v∈V}. The following theorem follows an almost identical path to the proofs of Theorem 3.2 and Corollary 3.3 in [25]. Since the modifications needed to adjust these proofs to our setup are minor and a complete proof is rather lengthy, we will not present it here.

Theorem 3.5 There exist absolute constants c1, c2 and c3 for which the following holds. Assume thatV =PσF admits a(j, α, γj(V))decomposition and thatW =PσH admits a (j, β, γj(W)) decomposition, both with respect to the norm k k on the classes F and H. Then for every r c1, with probability at least12 exp(−c2r22j),

sup

v∈V,w∈W

¯¯

¯¯

¯ XN

i=1

εiviwi

¯¯

¯¯

¯

≤c3αβr

³ N

³

dWγj(V) +dVγj(V) + 2j/2dVdW

´

+γj(V)γj(W)

´ .

(16)

3.1 Decomposition of coordinate projections

The main results of this article depend on the ability to decompose coor- dinate projections in the sense of Definition 3.2. Here, we will show that a typical coordinate projection does admit a decomposition with respect to the normk · kψ2, and for 1≤j≤log2N,

γj = inf sup

f∈F

X

s>j

2s/2k∆sfkψ2 =γ2,j(F, ψ2).

Theorem 3.6 There exist absolute constantsc1andc2 for which the follow- ing holds. If F ⊂Lψ2 then for every u≥c1 and every 1≤j≤log2N, with probability at least1−2 exp(−c22ju2),PσF admits a(j, u, γj) decomposition with respect to theψ2 norm.

Proof. Let (Fs)s≥0 be an almost optimal admissible sequence of F with respect to theψ2 norm and put Vs=PσFs.

Recall that for everyf ∈Lψ2,kf2kψ1 =kfk2ψ2 andkfkL2 ≤c0kfkψ2 for an absolute constantc0. Hence, by Theorem 2.2, for everyI ⊂ {1, ..., N}, P r

à 1

|I| X

i∈I

f2(Xi)(t2+c20)kfk2ψ2

!

≤P r ï¯

¯¯

¯ 1

|I|

X

i∈I

f2(Xi)Ef2

¯¯

¯¯

¯≥t2kfk2ψ2

!

≤2 exp(−c1|I|min{t4, t2}). (3.3) If u 1 and t = up

log(eN/|I|) 1, then with probability at least 1 2 exp(−c1u2|I|log(eN/|I|)),

ÃX

i∈I

f2(Xi)

!1/2

≤c2up

|I|log(eN/|I|)kfkψ2.

For every 1≤k≤N, let Ek be the collection of all subsets of{1, ..., N}of cardinality kand let sk be the first integer satisfying that

22s exp(klog(eN/k))≥ |Ek|.

Assume that k = 2n, let s [s2n, s2n+1] and observe that this interval contains at most two integers. Since|∆sF|,| ∪`≤kEk| ≤exp(c3klog(eN/k)), then for everyu≥c4, with probability at least 1−2 exp(−c5u2klog(eN/k)), for everyI ∈ ∪`≤kE` and everyf ∈F,

ÃX

i∈I

(∆sf)2(Xi)

!1/2

.u2s/2k∆sfkψ2. (3.4)

(17)

Using the same argument, ifm= 2ifori≥n, then with probability at least 12 exp(−c6u2mlog(eN/m)), for everyI ∈Em and everyf ∈F,

ÃX

i∈I

(∆sf)2(Xi)

!1/2

.uk∆sfkψ2p

mlog(eN/m). (3.5) Combining (3.4) and (3.5), it follows that with probability at least 1 2 exp(−c7u2klog(eN/k)), for every f ∈F, every s [s2n, s2n+1] and every I ⊂ {1, ..., N},

ÃX

i∈I

(∆sf)2(Xi)

!1/2 .u

³

2s/2+p

|I|log(eN/|I|)

´

k∆sfkψ2. Letk0= 2n,k1 = 2n+1,...,k`= 2n+`, where N/2< k` ≤N. Applying (3.4) and (3.5) to k0, ..., k`, and since

X`

i=0

exp(−c7u2kilog(eN/ki))exp(−c8u22nlog(eN/2n)),

then with probability at least 12 exp(−c8u22nlog(eN/2n)), for everyf F, every I ⊂ {1, ..., N} and every 2s 2nlog(eN/2n),

ÃX

i∈I

(∆sf)2(Xi)

!1/2

.uk∆sfkψ2

³

2s/2+p

|I|log(eN/|I|)

´ . Letting 2j 2nlog(eN/2n) proves the “small s” component in Definition 3.2 for ∆sv=Pσsf and α∼u. The same argument can be used to prove the “smalls” component forPσπsf as well.

To prove the “larges” part, observe that ifu≥c9 and 2s≥N, then by (3.3) for t =up

2s/N and since |Fs| ≤ 22s, then with probability at least 12 exp(−c10u22s), for everyf ∈F,

à N X

i=1

(∆sf)2(Xi)

!1/2

≤c11u2s/2k∆sfkψ2. (3.6) Therefore, summing over 2s≥N, with probability at least 1−2 exp(−c12N u2), (3.6) holds for everyf ∈F and every suchs, which concludes the proof.

Références

Documents relatifs

But for finite depth foliations (which have two sided branching), there are many examples where it is impossible to make the pseudo-Anosov flow transverse to the foliation [Mo5] and

Delano¨e [4] proved that the second boundary value problem for the Monge-Amp`ere equation has a unique smooth solution, provided that both domains are uniformly convex.. This result

We introduce a family of DG methods for (1.1)–(1.3) based on the coupling of a DG approximation to the Vlasov equation (transport equation) with several mixed finite element methods

First introduced by Faddeev and Kashaev [7, 9], the quantum dilogarithm G b (x) and its variants S b (x) and g b (x) play a crucial role in the study of positive representations

In Section 7, by using H¨ ormander’s L 2 estimates, we establish the local L 2 closed range property for (0) b with respect to Q (0) for some weakly pseudoconvex tube domains in C n

The second mentioned author would like to thank James Cogdell for helpful conversation on their previous results when he was visiting Ohio State University.. The authors also would

A second scheme is associated with a decentered shock-capturing–type space discretization: the II scheme for the viscous linearized Euler–Poisson (LVEP) system (see section 3.3)..

If the wind direction is fixed, our analy- sis indicates a 2k + 1-th order convergence rate of the numerical flux at mesh nodes and for the cell average, and a k + 2-th order of