• Aucun résultat trouvé

5 Upper bounds on ERM

Dans le document 4 The geometry of the fixed point (Page 32-39)

X

i>p

((PJ(v−w))i)2

1/2

& p

kPJ(v−w)k`J 2

à p X

i=1

((v−w)i)2

!1/2

&(1/61/12) N√

pkf −hkL2 &kX

i∈J

gi(vi−wi)kLp, proving the first part by applying Lemma 4.16.

For the second part, let A be the (r, k, c)-skeleton of F ∩ρkD. Since N &kthen Pσ : (A, L2)→LN2 is a 3-isomorphism with probability greater than 12 exp(−c0N). Hence, for 0< θ <1,

γ2(Pσ(A ∩θD), LN2 )∼γ2(A ∩θD, L2), and since |A| ≤exp(k/c), then

γ2(A ∩θρkD, L2).p

log|A|θρk.cθρk k .L,cθγ2(F ∩ρkD, L2).

Therefore, by the first part, ifθis small enough andA0 ={f ∈ A:kfkL2 θρk} there isc7 =c7(L, c) for which, if|I| ≥(1−β)N, then

Eε sup

v∈PσA0

|X

I

εivi| ≥Eε sup

v∈PσA

|X

I

εivi| −c7·θγ2(F ∩ρkD, L2) N

&r,c,Lγ2(F ∩ρkD, L2) N .

Remark 4.19 The proof actually shows that one can have a better proba-bility estimate of 12 exp(−c0N) for c0 =c0(L, c, β), a fact we shall not use here.

5 Upper bounds on ERM

In this section we will prove the first of our two main results (Theorem A), and establish general exact and non-exact oracle inequalities in the subgaus-sian case.

LetF be a class of functions and denote by`the squared loss function.

Set`f =`(f(X), Y) = (f(X)−Y)2, put f = argminf∈FE`f and let `F = {`f :f ∈F} be the loss class associated withF (or alternatively, if Lf =

`f −`f, then LF = {Lf : f F} is the excess loss class associated with F). Given the data D= (Xi, Yi)Ni=1, the empirical minimization algorithm produces

fˆ= argminf∈FPN`f = argminf∈F 1 N

XN

i=1

`(f(Xi), Yi),

and the hope is that the error of the empirical minimizer E(`fˆ|D) will be close to minf∈FE`f (assuming, as we do, that the minimum exists).

Recall that using the isomorphic method for H =`F, one ensures that for some 0< ε <1, with high probability, (1−ε)Eh≤PNh≤(1 +ε)Ehfor everyh in the set {h∈`F :Eh≥λN}. On that event, for every h∈`F,

PNh

1 +ε−λN Eh PNh 1−ε+λN,

and therefore, since`F consists of nonnegative functions, then E(`fˆ|D)≤ 1

1−ε·PN`fˆ+λN.

Since ˆf is an empirical minimizer, thenPN`fˆ≤PN`f (1 +ε)(E`f+λN) and thus

E(`fˆ|D)≤ 1 + 2ε 1−ε E`f+

µ

1 +1 + 2ε 1−ε

λN, (5.1)

which is a non-exact oracle inequality with a constant that tends to 1 asε tends to zero (i.e., as the “isomorphic” condition approaches an “isometry”).

A similar argument applied to the excess loss class shows that since PNLfˆ0, then E(Lfˆ|D) ¯λN (where ¯λN is defined relative to the excess loss class). Hence,

E(`fˆ|D)≤E`f+ ¯λN, (5.2) which is an exact oracle inequality.

Having this in mind, it is clear that to obtain exact and non-exact oracle inequalities it is enough to identify the “isomorphic profiles” ofF – the levels λN for `F and LF, at which an isomorphic estimate on the corresponding loss or excess loss classes is possible.

We will show that in a subgaussian situation, this parameter is deter-mined by the fixed pointsρk of the corresponding gaussian process indexed byF.

Definition 5.1 Let H be a class of functions. For every integer N and 0< η, δ <1, let

λN(H, η, δ) = inf (

λ >0 :P r Ã

sup

Eh≥λ

|PN(h/Eh)1| ≤η

!

1−δ )

. (5.3) Clearly, ifEh≥λN then with probability 1−δ

1

1 +ηPNh≤Eh 1

1−ηPNh,

which is the desired isomorphic condition. Therefore, high probability bounds on the ratios

sup

Eh≥λ

|PN(h/Eh)1|

forH=`F orH =LF lead to the isomorphic profile of F.

Here, we will assume that the L2(µ) andψ2(µ) norms are L-equivalent on the closed span ofF and that the unknown target is bounded inLψ2 (e.g.

Y ∈BLψ

2). Moreover, since our goal is to study ERM when the structure of the base class F is not an obstacle for obtaining an exact oracle inequality with fast rates, we will assume further that F is a compact, convex set.

Thus,f = argminf∈FE(f(X)−Y)2 exists and is unique.

Let ψf be either `f orLf and set ψF =f :f ∈F}. For everyλ >0, put

Fψ,λ+ ={f ∈F :Eψf ≥λ} and Fλ=

³

2F∩c0 λD

´ , wherec0 is a constant that will be named later.

Our first observation is that the both`f andLf can be decomposed into two parts: a quadratic term, that has a very weak dependence on the target Y, and an affine or linear one, which will turn out to be the dominating term in the ratio estimate.

Setξ =f(X)−Y and we will assume thatξ 6= 0. Ifξ= 0 the treatment is much simpler and will not be presented here.

For everyf ∈F, putQf =f2(X) andUf =ξf(X). Note that for every f ∈F,Lf =Qf−f+ 2Uf−f and `f =Qf−f+ 2Uf−f+ξ2. Indeed,

Lf(X, Y) = (f−f)(X) ((f+f)(X)2Y)

= (f−f)(X) ((f−f)(X) + 2(f(X)−Y))

= (f−f)(X)2+ 2(f−f)(X) (f(X)−Y) =Qf−f+ 2Uf−f, (5.4)

and thus To obtain the ratio estimate for the excess loss class LF, one has to establish high probability bounds on

sup In a similar fashion, for the loss `one has to bound

sup Lemma 5.2 There exists an absolute constant c1 for which the following holds. Letψ be either`ofLand assume that there is a constantB >0such

Moreover, WL,λ c1

λ

³

2F∩√ λBD

´

and W`,λ c1 kξkL2

λ

³

2F ∩√ λBD

´ . Remark 5.3 The assumption that for everyf ∈F,kf−fk2L2 ≤BEψf, is the “significant” part of the so-called Bernstein condition [3, 21, 20]. This condition is satisfied in a wide variety of cases and for many loss functionals.

For example, when F is a compact, convex set, then the squared loss and excess squared loss classes satisfy it withB= 1 because of the characteriza-tion of the nearest point map onto such sets in a Hilbert space. We refer the reader to [21] for more details on the Bernstein condition in more general situations.

Proof of Lemma 5.2. Clearly, k(f −f)/(Eψf)1/2kL2

B, and thus (f−f)/(Eψf)1/2 ∈√

BD. Since F is convex and symmetric and Eψf ≥λ then

f−f

(Eψf)1/2 = f −f

√λ ·

√λ

(Eψf)1/2 2F/ λ, and the first part of the claim follows.

Ifψ=Lthen 1/(ELf)1/21/

λand thus WL,λ c1

λ

³

2F∩√ λBD

´ .

Finally, sinceE`f ≥ kf−fk2L2+kξk2L2, then 1/(E`f)1/2.1/(kξkL2+kf− fkL2) and thus

W`,λ c1 kξkL2

λ

³

2F ∩√ λBD

´ .

Let us reformulate Theorem A:

Theorem 5.4 For every 0 < η <1 and L > 1, there are constants c0, c1, c2, c3 and c4 that depend on L, η,kξkψ2 and kξkL2 for which the following holds. For every L-subgaussian class F, Y BLψ2 and N k(F), let k1 = max{k k cN : ρk c1p

k/N}, put δ1 = 2 exp(−c2N), and set δ2 = 2 exp(−c2k1). Then,

λN(`F, η, δ1)≤c3 µ

ρ2c4N+ 1 N

and

λN(LF, η, δ2)≤c3ρ2k1 Therefore, with probability at least12 exp(−c2N),

E`fˆ(1 + 2η)E`f+c3

Observe that if the sequence is even remotely regular, than ρ2k1 min

k≤k≤c0N2k+k/N).

Also, note that the sequence ρk/√

k decreases with k, and thus k1 grows withN – as should be expected.

The ratio estimate leading to the oracle inequality follows from Theorem 5.5 below. To formulate it, setA= (1/

Proof. In light of Lemma 5.2, the three processes one has to study are

λ)Fλ, all of the above may be bounded using Theorem 3.1 and Corollary 3.9. Also, |N−1PN

i=1ξ2(Xi)2| can be estimated by Theorem 2.2, proving the desired result.

Let us estimate the parameters appearing in Theorem 5.5. Since µ is L-subgaussian, then dA2) LdA(L2), and since λ−1/2Fλ c1D, then

λ)/λ≤N then with probability at least 12 exp(−c5(L)u2(H2(

and to ensure that supELf≥λ

ifγ ≥c2(L,kξkψ2, η). Finally, the probability estimate clearly holds because H(γρk1/(γρk1)&γ

Hence, it suffices to ensure thatH(√ λ)/√

N√

λ.L0 η, which concludes the proof by takingλ∼ρ2c4N + 1/N.

Dans le document 4 The geometry of the fixed point (Page 32-39)

Documents relatifs