X
i>p
((PJ(v−w))∗i)2
1/2
&√ p
kPJ(v−w)k`J 2 −
à p X
i=1
((v−w)∗i)2
!1/2
&(1/6−1/12)√ N√
pkf −hkL2 &kX
i∈J
gi(vi−wi)kLp, proving the first part by applying Lemma 4.16.
For the second part, let A be the (r, k, c)-skeleton of F ∩ρkD. Since N &kthen Pσ : (A, L2)→LN2 is a 3-isomorphism with probability greater than 1−2 exp(−c0N). Hence, for 0< θ <1,
γ2(Pσ(A ∩θD), LN2 )∼γ2(A ∩θD, L2), and since |A| ≤exp(k/c), then
γ2(A ∩θρkD, L2).p
log|A|θρk.cθρk√ k .L,cθγ2(F ∩ρkD, L2).
Therefore, by the first part, ifθis small enough andA0 ={f ∈ A:kfkL2 ≥ θρk} there isc7 =c7(L, c) for which, if|I| ≥(1−β)N, then
Eε sup
v∈PσA0
|X
I
εivi| ≥Eε sup
v∈PσA
|X
I
εivi| −c7·θγ2(F ∩ρkD, L2)√ N
&r,c,Lγ2(F ∩ρkD, L2)√ N .
Remark 4.19 The proof actually shows that one can have a better proba-bility estimate of 1−2 exp(−c0N) for c0 =c0(L, c, β), a fact we shall not use here.
5 Upper bounds on ERM
In this section we will prove the first of our two main results (Theorem A), and establish general exact and non-exact oracle inequalities in the subgaus-sian case.
LetF be a class of functions and denote by`the squared loss function.
Set`f =`(f(X), Y) = (f(X)−Y)2, put f∗ = argminf∈FE`f and let `F = {`f :f ∈F} be the loss class associated withF (or alternatively, if Lf =
`f −`f∗, then LF = {Lf : f ∈ F} is the excess loss class associated with F). Given the data D= (Xi, Yi)Ni=1, the empirical minimization algorithm produces
fˆ= argminf∈FPN`f = argminf∈F 1 N
XN
i=1
`(f(Xi), Yi),
and the hope is that the error of the empirical minimizer E(`fˆ|D) will be close to minf∈FE`f (assuming, as we do, that the minimum exists).
Recall that using the isomorphic method for H =`F, one ensures that for some 0< ε <1, with high probability, (1−ε)Eh≤PNh≤(1 +ε)Ehfor everyh in the set {h∈`F :Eh≥λN}. On that event, for every h∈`F,
PNh
1 +ε−λN ≤Eh≤ PNh 1−ε+λN,
and therefore, since`F consists of nonnegative functions, then E(`fˆ|D)≤ 1
1−ε·PN`fˆ+λN.
Since ˆf is an empirical minimizer, thenPN`fˆ≤PN`f∗ ≤(1 +ε)(E`f∗+λN) and thus
E(`fˆ|D)≤ 1 + 2ε 1−ε E`f∗+
µ
1 +1 + 2ε 1−ε
¶
λN, (5.1)
which is a non-exact oracle inequality with a constant that tends to 1 asε tends to zero (i.e., as the “isomorphic” condition approaches an “isometry”).
A similar argument applied to the excess loss class shows that since PNLfˆ≤0, then E(Lfˆ|D) ≤¯λN (where ¯λN is defined relative to the excess loss class). Hence,
E(`fˆ|D)≤E`f∗+ ¯λN, (5.2) which is an exact oracle inequality.
Having this in mind, it is clear that to obtain exact and non-exact oracle inequalities it is enough to identify the “isomorphic profiles” ofF – the levels λN for `F and LF, at which an isomorphic estimate on the corresponding loss or excess loss classes is possible.
We will show that in a subgaussian situation, this parameter is deter-mined by the fixed pointsρk of the corresponding gaussian process indexed byF.
Definition 5.1 Let H be a class of functions. For every integer N and 0< η, δ <1, let
λN(H, η, δ) = inf (
λ >0 :P r Ã
sup
Eh≥λ
|PN(h/Eh)−1| ≤η
!
≥1−δ )
. (5.3) Clearly, ifEh≥λN then with probability 1−δ
1
1 +ηPNh≤Eh≤ 1
1−ηPNh,
which is the desired isomorphic condition. Therefore, high probability bounds on the ratios
sup
Eh≥λ
|PN(h/Eh)−1|
forH=`F orH =LF lead to the isomorphic profile of F.
Here, we will assume that the L2(µ) andψ2(µ) norms are L-equivalent on the closed span ofF and that the unknown target is bounded inLψ2 (e.g.
Y ∈BLψ
2). Moreover, since our goal is to study ERM when the structure of the base class F is not an obstacle for obtaining an exact oracle inequality with fast rates, we will assume further that F is a compact, convex set.
Thus,f∗ = argminf∈FE(f(X)−Y)2 exists and is unique.
Let ψf be either `f orLf and set ψF ={ψf :f ∈F}. For everyλ >0, put
Fψ,λ+ ={f ∈F :Eψf ≥λ} and Fλ=
³
2F∩c0√ λD
´ , wherec0 is a constant that will be named later.
Our first observation is that the both`f andLf can be decomposed into two parts: a quadratic term, that has a very weak dependence on the target Y, and an affine or linear one, which will turn out to be the dominating term in the ratio estimate.
Setξ =f∗(X)−Y and we will assume thatξ 6= 0. Ifξ= 0 the treatment is much simpler and will not be presented here.
For everyf ∈F, putQf =f2(X) andUf =ξf(X). Note that for every f ∈F,Lf =Qf−f∗+ 2Uf−f∗ and `f =Qf−f∗+ 2Uf−f∗+ξ2. Indeed,
Lf(X, Y) = (f−f∗)(X) ((f+f∗)(X)−2Y)
= (f−f∗)(X) ((f−f∗)(X) + 2(f∗(X)−Y))
= (f−f∗)(X)2+ 2(f−f∗)(X) (f∗(X)−Y) =Qf−f∗+ 2Uf−f∗, (5.4)
and thus To obtain the ratio estimate for the excess loss class LF, one has to establish high probability bounds on
sup In a similar fashion, for the loss `one has to bound
sup Lemma 5.2 There exists an absolute constant c1 for which the following holds. Letψ be either`ofLand assume that there is a constantB >0such
Moreover, WL,λ⊂ c1
λ
³
2F∩√ λBD
´
and W`,λ ⊂ c1 kξkL2√
λ
³
2F ∩√ λBD
´ . Remark 5.3 The assumption that for everyf ∈F,kf−f∗k2L2 ≤BEψf, is the “significant” part of the so-called Bernstein condition [3, 21, 20]. This condition is satisfied in a wide variety of cases and for many loss functionals.
For example, when F is a compact, convex set, then the squared loss and excess squared loss classes satisfy it withB= 1 because of the characteriza-tion of the nearest point map onto such sets in a Hilbert space. We refer the reader to [21] for more details on the Bernstein condition in more general situations.
Proof of Lemma 5.2. Clearly, k(f −f∗)/(Eψf)1/2kL2 ≤ √
B, and thus (f−f∗)/(Eψf)1/2 ∈√
BD. Since F is convex and symmetric and Eψf ≥λ then
f−f∗
(Eψf)1/2 = f −f∗
√λ ·
√λ
(Eψf)1/2 ∈2F/√ λ, and the first part of the claim follows.
Ifψ=Lthen 1/(ELf)1/2≤1/√
λand thus WL,λ⊂ c1
λ
³
2F∩√ λBD
´ .
Finally, sinceE`f ≥ kf−f∗k2L2+kξk2L2, then 1/(E`f)1/2.1/(kξkL2+kf− f∗kL2) and thus
W`,λ ⊂ c1 kξkL2√
λ
³
2F ∩√ λBD
´ .
Let us reformulate Theorem A:
Theorem 5.4 For every 0 < η <1 and L > 1, there are constants c0, c1, c2, c3 and c4 that depend on L, η,kξkψ2 and kξkL2 for which the following holds. For every L-subgaussian class F, Y ∈ BLψ2 and N ≥ k∗(F), let k1 = max{k∗ ≤ k ≤ cN : ρk ≥ c1p
k/N}, put δ1 = 2 exp(−c2N), and set δ2 = 2 exp(−c2k1). Then,
λN(`F, η, δ1)≤c3 µ
ρ2c4N+ 1 N
¶
and
λN(LF, η, δ2)≤c3ρ2k1 Therefore, with probability at least1−2 exp(−c2N),
E`fˆ≤(1 + 2η)E`f∗+c3
Observe that if the sequence is even remotely regular, than ρ2k1 ∼ min
k∗≤k≤c0N(ρ2k+k/N).
Also, note that the sequence ρk/√
k decreases with k, and thus k1 grows withN – as should be expected.
The ratio estimate leading to the oracle inequality follows from Theorem 5.5 below. To formulate it, setA= (1/√
Proof. In light of Lemma 5.2, the three processes one has to study are
λ)Fλ, all of the above may be bounded using Theorem 3.1 and Corollary 3.9. Also, |N−1PN
i=1ξ2(Xi)−Eξ2| can be estimated by Theorem 2.2, proving the desired result.
Let us estimate the parameters appearing in Theorem 5.5. Since µ is L-subgaussian, then dA(ψ2) ≤ LdA(L2), and since λ−1/2Fλ ⊂ c1D, then
λ)/λ≤N then with probability at least 1−2 exp(−c5(L)u2(H2(√
and to ensure that supELf≥λ
ifγ ≥c2(L,kξkψ2, η). Finally, the probability estimate clearly holds because H(γρk1/(γρk1)&γ √
Hence, it suffices to ensure thatH(√ λ)/√
N√
λ.L0 η, which concludes the proof by takingλ∼ρ2c4N + 1/N.