Suppose that Θ is convex, and that the loss function`: Θ→RO is a1-Lipschitz,
|`(θ1)−`(θ2)| ≤a1|θ1−θ2| (20)
and a2-strongly convex: for alls∈[0,1] andθ1, θ2∈Θ,
`(sθ1+ (1−s)θ2)−a22(sθ1+ (1−s)θ2)2≤s
`(θ1)− a22θ1
2
+ (1−s)
`(θ2)− a22θ2
2
(both inequalities above are understood pointwise). Then the modulus of continuity of ` is lower-bounded by ρ7→ a82ρ2 in the sense that, for allθ1, θ2∈Θ,
1
2(`(θ1) +`(θ2))−` 12(θ1+θ2)
≥ a82(θ1−θ2)2 (21)
(pointwise). Let P be a law on O such that P `(θ) is well defined for all θ ∈ Θ, where we note P f :=R
f dP. Chooseθ◦ ∈Θ such thatP `(θ◦)≤P `(θ) for allθ∈Θ. Then, for all θ∈Θ,
1
2P(`(θ) +`(θ◦))≥P `(12(θ+θ◦)) + a82P(θ−θ◦)2
≥P `(θ◦) +a82P(θ−θ◦)2
≥P `(θ◦) +8aa22 1
P(`(θ)−`(θ◦))2,
where the first inequality follows from (21), the second holds by convexity of Θ and choice of θ◦, and the third one follows from (20). Therefore,
P(`(θ◦)−`(θ))2≤ 4aa21
2 P(`(θ)−`(θ◦)), which concludes the argument.
B Proofs
B.1 Proof of Theorem 1
The proof unfolds in three steps.
Step 1: an algebraic decomposition. For allj ∈JJK,t≥1 andθ∈Θ, let us define
Hej,t:=Rej,t−Ret(θ◦), Hbj,t :=Rbj,t−Rbt(θ◦), and
∆◦`(θ)( ¯¯ Ot,Z¯t) := ¯`(θ)( ¯Ot,Z¯t)−`(θ¯ ◦)( ¯Ot,Z¯t)
(¯`(θ) is defined in (2)). Fix arbitrarilya >0. An algebraic decomposition at the heart of all studies of the Super Learner [see,e.g,Dudoit and van der Laan,2005,van der Laan et al.,2007,Benkeser et al., 2018]) states that the excess risk of the Super Learner (that is, He
bjt,t) can be bounded by (1 + 2a) times the excess risk of the oracle (that is,He
ejt,t), plus some remainder terms:
He
bjt,t≤(1 + 2a)He
ejt,t+A
bjt,t(a) +B
ejt,t(a)
≤(1 + 2a)Heej
t,t+ max
j∈JJK
{Aj,t(a)}+ max
j∈JJK
{Bj,t(a)} (22)
where
Aj,t(a) := (1 +a)
Hej,t−Hbj,t
−aHej,t and Bj,t(a) := (1 +a)
Hbj,t−Hej,t
−aHej,t.
The first terms in the definitions ofAj,t(a) andBj,t equal±(1 +a) times 1
t
t
X
τ=1
∆◦`(θ¯ j,τ−1)( ¯Oτ,Z¯τ)−E
∆◦`(θ¯ j,τ−1)( ¯Oτ,Z¯τ)
Z¯τ, Fτ−1 ,
that is as the average of the t first terms of a martingale difference sequence. As for the shared second term in the definitions of Aj,t(a) and Bj,t, it satisfies −aHej,t ≤ 0. The second step of the proof consists in exploiting two so-called Bernstein’s inequalities to control the probabilities P[Aj,t(a)≥x] and P[Bj,t(a)≥x] forx≥0.
Step 2: Bounding positive deviations of Aj,t(a) and Bj,t(a). Set arbitrarily two integers N, N0 ≥2 and a real number x≥0. The analysis of P[Bj,t(a) ≥x] is exactly the same as that of P[Aj,t(a) ≥ x], so we present only the latter. The key to the analysis is a so-called stratification argument inspired by Cesa-Bianchi and Gentile[2008].
For every j ∈ JJK and t ≥ 1, recall the definitions (15) and (16) of varj,t and varfj,t. On the one hand, byA4 and because the functions of a real variableu7→u2 and u7→uβ are respectively
convex and concave, it holds that
almost surely. Moreover, it also holds that varfj,t ≤v2 almost surely by Theorem 3. The previous upper bound and (23) play a key role in the first version of Step 2 (Step 2 (v1)) presented below.
On the other hand, byA2,A4, and because the functionu7→uβ is concave it holds almost surely that, for all τ ∈JtK,
Consequently if Hej,t ≤ B (an inequality that holds almost surely when B =b1, by A3), then it also holds that second version of Step 2 (Step 2 (v2)) presented below.
Step 2 (v1). Setv(−1)2 := 0 and, for alli∈JN−1K,v(i)2 := 2i+1−N×v2. In view of (23) and since
≤ A3 and the Fan-Grama-Liu concentration inequality for martingales [Theorem 3.10 inBercu et al.,2015], (25) implies
P[Aj,t(a)≥x]≤
• Moreover if x≥xi, then v(i)2 ≤(x+a(v(i−1)2 /γ)1/β)×b2/(3(1 +a)) hence
Therefore, in light of (26), (28), (29) and the definitions ofC1(a), C2(a) given in the theorem, for all x≥x(a, N), it holds that
By A3 and A4, Theorem 3applies and (31) yields
P[Aj,t(a)≥x]≤
Set arbitrarily i∈JN0−1K∪ {0}and definex0i:= 32e(1 +a)γ(tb(i)1 )β/(15b2)−ab(i−1)1 .
Step 3: end of the proof. In view of (22), a union bound implies that from (36) and the above inequality. This completes the proof of Theorem 1.
B.2 Proof of Corollary 2
Corollary2follows from the straightforward application, twice, of the next technical lemma, based on (7) on the one hand and on (8) on the other hand.
Lemma 4. Let a, b, c > 0, β ∈]0,1] be some constants and (x(N))N≥2 be a sequence of positive numbers that decreases to 0. Let U be a real valued random variable such thatE[|U|]<∞ and, for all integer N ≥2 and all x≥x(N)>0, Proof of Lemma 4. It is well known that
E[U]≤E[U1{U ≥0}] =
Let us denote by L and R the above left-hand side and right-hand side integrals. Choose N ≥ min{n≥2 :x(n)≤b1/(2−β),log(an)≥1}.
Bounding L. LetxLbe chosen so thataNexp(−x2−βL /b) = 1,i.e.,xL:= (blog(aN))1/(2−β). Now, thanks to the change of variable u=x2−β/band because u7→u1/(2−β)−1 is nonincreasing,
L=xL+aN Z ∞
xL
exp(−x2−β/b)dx
=xL+aN b1/(2−β) Z ∞
log(aN)
exp(−u)u1/(2−β)−1du
≤xL+aN(blog(aN))1/(2−β) log(aN)
Z ∞
0
exp(−u)du
=xL(1 + 1/log(aN))≤2(blog(aN))1/(2−β). (40)
Bounding R. Let xR be chosen so that aNexp(−xR/c) = 1, i.e., xR := clog(aN). It is readily seen that
R=xR+aN Z ∞
xR
exp(−x/c)dx=xR+acNexp(−xR/c) =c(1 + log(aN)). (41)
In view of (39), (40), (41), and by choice ofN, we obtain
E[U]≤b1/(2−β)+ 2(blog(aN))1/(2−β)+c(1 + log(aN))
≤3(blog(aN))1/(2−β)+ 2clog(aN).
This completes the proof.
Set t≥1 and a∈]0,1]. In view of (7), Lemma4 yields (9) under the sufficient condition that N ≥ 2 also satisfy (10). Moreover, in view of (8), Lemma 4 also yields (11) under the sufficient condition thatN ≥2 also satisfy (12). This completes the proof of the corollary.
B.3 Proof of Theorem 3
The proof of Theorem 3 hinges on a Bernstein-like concentration inequality for sums of partly dependent random variables shown by Janson [2004, Theorem 2.3]. Janson emphasizes that his
theorem uses the independence of suitable (large) subsets of (ζα)α∈A, not any other information on the dependencies, so that the result must be suboptimal when the dependencies that exist are weak. We recall the theorem for completeness.
Theorem 5 (Janson [2004]). Let (ζα)α∈A be a collection of random variables with dependency graph G such that ζα−E[ζα]≤B for some B >0 and all α∈ A. DefineZ:=|A|−1P
Note that (18) from Theorem 3also writes as
P
Following the line of proof of the Rosenthal inequality byPetrov[1995, page 59] (see also the proof of Theorem 5.2 in [Baraud,2000]), we use (42) to controlE[|Z−E[Z]|p] henceE[|Hbj,t−Hej,t|p] (by convexity) for all p≥2. The bound (18) follows as in [Dedecker, 2001, proof of Corollary 3(b)], a method inspired by the proof of Theorem 6 in [Doukhan et al.,1984].
We first prove this corollary of Theorem 5. The constants are in no way optimal.
Corollary 6. In the context of Theorem 5, for all p≥2,
E[|Z−E[Z]|p]≤ 3π any nonnegative random variable U. Let r > 0 be a constant that we will carefully choose later on. Set arbitrarilys≥0, definem:=s/r, and introduce
Z˜m:=|A|−1X
α∈A
(ζα−E[ζα])1{|ζα−E[ζα]|< m}.
It holds that and (44), the rightmost term in (44), say Ip, satisfies
Ip≤
Because h(u)≥ u2 log(1 +u) for allu≥0, (45) yields bound ˜V ≤V thus imply that the rightmost term in (46) is smaller than
2p
We now chooser:= 5(p+ 1) deg(G)/|A|to guarantee the convergence of the above integral, toπ/2, and conclude that (44) and (46) imply
E[|Z−E[Z]|p]≤rpE[max
α∈A|ζα−E[ζα]|p] +π(rV)p/2(2p/2+p(5/4)p/2)
≤(rB)p+π(p+ 1)(2rV)p/2. (47)
Finally, since (p+ 1)/p≤3/2 andp2/p≤e2/e ≈2.61, we can simplify (47) to (43), thus completing the proof of Corollary6.
Fix arbitrarilyj ∈JJK,t≥1,V >0, andp≥2. To save space introduce, for eachτ ∈JtK, Zj,τ := ∆◦`(θ¯ j,τ−1)( ¯Oτ,Z¯τ)−E
∆◦`(θ¯ j,τ−1)( ¯Oτ,Z¯τ)
Z¯τ, Fτ−1
.
In view of A1, A2, A3 and A5, Corollary 6 applies and guarantees that almost surely, for all τ ∈JtK,
It is now easy to show that varft,j ≤v2 almost surely (see (6) for the definition of v2). By A5,
Therefore Markov’s inequality implies that, for allx >0,
P
By the technical Lemma 7, there existspx >0 such that
x|A|= 15eb2deg(G)px/2 +p
Otherwise, P[|Hbj,t −Hej,t| ≥ x,FeV] ≤ exp(2−qx) holds trivially. This completes the proof of
Theorem3.
Lemma 7. For anya, b, c >0, there existsp >0such thatc=b√
p+ap. Moreover,c2 ≤(b2+2ac)p.
Proof of Lemma 7. The quadratic equation c = bX+aX2 has a positive solution, so there does exist p > 0 such that c = b√
p+ap. Moreover, c2/p = b2+ 2ab√
p+a2p on the one hand and 2ac = 2ab√
p+ 2a2p ≥ 2ab√
p+a2p on the other hand, implying that c2/p ≤ b2 + 2ac. This completes the proof.