5 Upper bounds on ERM

X

i>p

((P_J(v−w))^∗_i)²





1/2

&√ p



kP_J(v−w)k_`J 2 −

Ã _p X

i=1

((v−w)^∗_i)²

!_1/2



&(1/6−1/12)√ N√

pkf −hk_L₂ &kX

i∈J

g_i(v_i−w_i)k_L_p, proving the first part by applying Lemma 4.16.

For the second part, let A be the (r, k, c)-skeleton of F ∩ρ_kD. Since N &kthen P_σ : (A, L₂)→L^N₂ is a 3-isomorphism with probability greater than 1−2 exp(−c₀N). Hence, for 0< θ <1,

γ₂(P_σ(A ∩θD), L^N₂ )∼γ₂(A ∩θD, L₂), and since |A| ≤exp(k/c), then

γ₂(A ∩θρ_kD, L₂).p

log|A|θρ_k._cθρ_k√ k ._L,cθγ₂(F ∩ρ_kD, L₂).

Therefore, by the first part, ifθis small enough andA⁰ ={f ∈ A:kfk_L₂ ≥ θρ_k} there isc₇ =c₇(L, c) for which, if|I| ≥(1−β)N, then

E_ε sup

v∈PσA⁰

ε_iv_i| ≥E_ε sup

v∈PσA

ε_iv_i| −c₇·θγ₂(F ∩ρ_kD, L₂)√ N

&_r,c,Lγ₂(F ∩ρ_kD, L₂)√ N .

Remark 4.19 The proof actually shows that one can have a better proba-bility estimate of 1−2 exp(−c₀N) for c₀ =c₀(L, c, β), a fact we shall not use here.

In this section we will prove the first of our two main results (Theorem A), and establish general exact and non-exact oracle inequalities in the subgaus-sian case.

LetF be a class of functions and denote by`the squared loss function.

Set`_f =`(f(X), Y) = (f(X)−Y)², put f^∗ = argmin_f_∈FE`_f and let `_F = {`_f :f ∈F} be the loss class associated withF (or alternatively, if L_f =

`_f −`_f^∗, then L_F = {L_f : f ∈ F} is the excess loss class associated with F). Given the data D= (X_i, Y_i)^N_i=1, the empirical minimization algorithm produces

fˆ= argmin_f∈FP_N`_f = argmin_f∈F 1 N

i=1

`(f(X_i), Y_i),

and the hope is that the error of the empirical minimizer E(`_f_ˆ|D) will be close to min_f_∈FE`_f (assuming, as we do, that the minimum exists).

Recall that using the isomorphic method for H =`_F, one ensures that for some 0< ε <1, with high probability, (1−ε)Eh≤P_Nh≤(1 +ε)Ehfor everyh in the set {h∈`_F :Eh≥λ_N}. On that event, for every h∈`_F,

P_Nh

1 +ε−λ_N ≤Eh≤ P_Nh 1−ε+λ_N,

and therefore, since`_F consists of nonnegative functions, then E(`_f_ˆ|D)≤ 1

1−ε·P_N`_f_ˆ+λ_N.

Since ˆf is an empirical minimizer, thenP_N`_f_ˆ≤P_N`_f^∗ ≤(1 +ε)(E`_f^∗+λ_N) and thus

E(`_f_ˆ|D)≤ 1 + 2ε 1−ε E`_f^∗+

1 +1 + 2ε 1−ε

λ_N, (5.1)

which is a non-exact oracle inequality with a constant that tends to 1 asε tends to zero (i.e., as the “isomorphic” condition approaches an “isometry”).

A similar argument applied to the excess loss class shows that since P_NL_f_ˆ≤0, then E(L_f_ˆ|D) ≤¯λ_N (where ¯λ_N is defined relative to the excess loss class). Hence,

E(`_f_ˆ|D)≤E`_f^∗+ ¯λ_N, (5.2) which is an exact oracle inequality.

Having this in mind, it is clear that to obtain exact and non-exact oracle inequalities it is enough to identify the “isomorphic profiles” ofF – the levels λ_N for `_F and L_F, at which an isomorphic estimate on the corresponding loss or excess loss classes is possible.

We will show that in a subgaussian situation, this parameter is deter-mined by the fixed pointsρ_k of the corresponding gaussian process indexed byF.

Definition 5.1 Let H be a class of functions. For every integer N and 0< η, δ <1, let

λ_N(H, η, δ) = inf (

λ >0 :P r Ã

sup

Eh≥λ

|P_N(h/Eh)−1| ≤η

≥1−δ )

. (5.3) Clearly, ifEh≥λ_N then with probability 1−δ

1 +ηP_Nh≤Eh≤ 1

1−ηP_Nh,

which is the desired isomorphic condition. Therefore, high probability bounds on the ratios

sup

Eh≥λ

|P_N(h/Eh)−1|

forH=`_F orH =L_F lead to the isomorphic profile of F.

Here, we will assume that the L₂(µ) andψ₂(µ) norms are L-equivalent on the closed span ofF and that the unknown target is bounded inL_ψ₂ (e.g.

Y ∈B_L_ψ

2). Moreover, since our goal is to study ERM when the structure of the base class F is not an obstacle for obtaining an exact oracle inequality with fast rates, we will assume further that F is a compact, convex set.

Thus,f^∗ = argmin_f∈FE(f(X)−Y)² exists and is unique.

Let ψ_f be either `_f orL_f and set ψ_F ={ψ_f :f ∈F}. For everyλ >0, put

F_ψ,λ⁺ ={f ∈F :Eψ_f ≥λ} and F_λ=

2F∩c₀√ λD

´ , wherec₀ is a constant that will be named later.

Our first observation is that the both`_f andL_f can be decomposed into two parts: a quadratic term, that has a very weak dependence on the target Y, and an affine or linear one, which will turn out to be the dominating term in the ratio estimate.

Setξ =f^∗(X)−Y and we will assume thatξ 6= 0. Ifξ= 0 the treatment is much simpler and will not be presented here.

For everyf ∈F, putQ_f =f²(X) andU_f =ξf(X). Note that for every f ∈F,L_f =Q_f_−f^∗+ 2U_f−f^∗ and `_f =Q_f−f^∗+ 2U_f−f^∗+ξ². Indeed,

L_f(X, Y) = (f−f^∗)(X) ((f+f^∗)(X)−2Y)

= (f−f^∗)(X) ((f−f^∗)(X) + 2(f^∗(X)−Y))

= (f−f^∗)(X)²+ 2(f−f^∗)(X) (f^∗(X)−Y) =Q_f_−f^∗+ 2U_f−f^∗, (5.4)

and thus To obtain the ratio estimate for the excess loss class L_F, one has to establish high probability bounds on

sup In a similar fashion, for the loss `one has to bound

sup Lemma 5.2 There exists an absolute constant c₁ for which the following holds. Letψ be either`ofLand assume that there is a constantB >0such

Moreover, W_L,λ⊂ c₁

2F∩√ λBD

and W_`,λ ⊂ c₁ kξk_L₂√

2F ∩√ λBD

´ . Remark 5.3 The assumption that for everyf ∈F,kf−f^∗k²_L₂ ≤BEψ_f, is the “significant” part of the so-called Bernstein condition [3, 21, 20]. This condition is satisfied in a wide variety of cases and for many loss functionals.

For example, when F is a compact, convex set, then the squared loss and excess squared loss classes satisfy it withB= 1 because of the characteriza-tion of the nearest point map onto such sets in a Hilbert space. We refer the reader to [21] for more details on the Bernstein condition in more general situations.

Proof of Lemma 5.2. Clearly, k(f −f^∗)/(Eψ_f)^1/2k_L₂ ≤ √

B, and thus (f−f^∗)/(Eψ_f)^1/2 ∈√

BD. Since F is convex and symmetric and Eψ_f ≥λ then

f−f^∗

(Eψ_f)^1/2 = f −f^∗

√λ ·

√λ

(Eψ_f)^1/2 ∈2F/√ λ, and the first part of the claim follows.

Ifψ=Lthen 1/(EL_f)^1/2≤1/√

λand thus W_L,λ⊂ c₁

2F∩√ λBD

´ .

Finally, sinceE`_f ≥ kf−f^∗k²_L₂+kξk²_L₂, then 1/(E`_f)^1/2.1/(kξk_L₂+kf− f^∗k_L₂) and thus

W_`,λ ⊂ c₁ kξk_L₂√

2F ∩√ λBD

´ .

Let us reformulate Theorem A:

Theorem 5.4 For every 0 < η <1 and L > 1, there are constants c₀, c₁, c₂, c₃ and c₄ that depend on L, η,kξk_ψ₂ and kξk_L₂ for which the following holds. For every L-subgaussian class F, Y ∈ B_L_ψ₂ and N ≥ k^∗(F), let k₁ = max{k^∗ ≤ k ≤ c_N : ρ_k ≥ c₁p

k/N}, put δ₁ = 2 exp(−c₂N), and set δ₂ = 2 exp(−c₂k₁). Then,

λ_N(`_F, η, δ₁)≤c₃ µ

ρ²_c₄_N+ 1 N

and

λ_N(L_F, η, δ₂)≤c₃ρ²_k₁ Therefore, with probability at least1−2 exp(−c₂N),

E`_f_ˆ≤(1 + 2η)E`_f^∗+c₃

Observe that if the sequence is even remotely regular, than ρ²_k₁ ∼ min

k^∗≤k≤c0N(ρ²_k+k/N).

Also, note that the sequence ρ_k/√

k decreases with k, and thus k₁ grows withN – as should be expected.

The ratio estimate leading to the oracle inequality follows from Theorem 5.5 below. To formulate it, setA= (1/√

Proof. In light of Lemma 5.2, the three processes one has to study are

λ)F_λ, all of the above may be bounded using Theorem 3.1 and Corollary 3.9. Also, |N⁻¹P_N

i=1ξ²(X_i)−Eξ²| can be estimated by Theorem 2.2, proving the desired result.

Let us estimate the parameters appearing in Theorem 5.5. Since µ is L-subgaussian, then d_A(ψ₂) ≤ Ld_A(L₂), and since λ^−1/2F_λ ⊂ c₁D, then

λ)/λ≤N then with probability at least 1−2 exp(−c₅(L)u²(H²(√

and to ensure that sup_EL_f_≥λ

ifγ ≥c₂(L,kξk_ψ₂, η). Finally, the probability estimate clearly holds because H(γρ_k₁/(γρ_k₁)&_γ √

Hence, it suffices to ensure thatH(√ λ)/√

N√

λ._L⁰ η, which concludes the proof by takingλ∼ρ²_c₄_N + 1/N.

Dans le document 4 The geometry of the fixed point (Page 32-39)