A A classical strong convexity argument - One-step ahead sequential Super Learning from short t

Suppose that Θ is convex, and that the loss function`: Θ→R^O is a₁-Lipschitz,

|`(θ₁)−`(θ2)| ≤a1|θ₁−θ2| (20)

and a₂-strongly convex: for alls∈[0,1] andθ₁, θ₂∈Θ,

`(sθ1+ (1−s)θ2)−^a₂²(sθ1+ (1−s)θ2)²≤s

`(θ1)− ^a₂²θ1

+ (1−s)

`(θ2)− ^a₂²θ2

(both inequalities above are understood pointwise). Then the modulus of continuity of ` is lower-bounded by ρ7→ ^a₈²ρ² in the sense that, for allθ1, θ2∈Θ,

2(`(θ₁) +`(θ₂))−` ¹₂(θ₁+θ₂)

≥ ^a₈²(θ₁−θ₂)² (21)

(pointwise). Let P be a law on O such that P `(θ) is well defined for all θ ∈ Θ, where we note P f :=R

f dP. Chooseθ^◦ ∈Θ such thatP `(θ^◦)≤P `(θ) for allθ∈Θ. Then, for all θ∈Θ,

2P(`(θ) +`(θ^◦))≥P `(¹₂(θ+θ^◦)) + ^a₈²P(θ−θ^◦)²

≥P `(θ^◦) +^a₈²P(θ−θ^◦)²

≥P `(θ^◦) +_8a^a²2 1

P(`(θ)−`(θ^◦))²,

where the first inequality follows from (21), the second holds by convexity of Θ and choice of θ^◦, and the third one follows from (20). Therefore,

P(`(θ^◦)−`(θ))²≤ ^4a_a²¹

2 P(`(θ)−`(θ^◦)), which concludes the argument.

B Proofs

B.1 Proof of Theorem 1

The proof unfolds in three steps.

Step 1: an algebraic decomposition. For allj ∈JJK,t≥1 andθ∈Θ, let us define

He_j,t:=Re_j,t−Re_t(θ^◦), Hb_j,t :=Rb_j,t−Rb_t(θ^◦), and

∆^◦`(θ)( ¯¯ Ot,Z¯t) := ¯`(θ)( ¯Ot,Z¯t)−`(θ¯ ^◦)( ¯Ot,Z¯t)

(¯`(θ) is defined in (2)). Fix arbitrarilya >0. An algebraic decomposition at the heart of all studies of the Super Learner [see,e.g,Dudoit and van der Laan,2005,van der Laan et al.,2007,Benkeser et al., 2018]) states that the excess risk of the Super Learner (that is, He

bjt,t) can be bounded by (1 + 2a) times the excess risk of the oracle (that is,He

ejt,t), plus some remainder terms:

bjt,t≤(1 + 2a)He

ejt,t+A

bjt,t(a) +B

ejt,t(a)

≤(1 + 2a)He_e_j

t,t+ max

j∈JJK

{A_j,t(a)}+ max

j∈JJK

{B_j,t(a)} (22)

where

Aj,t(a) := (1 +a)

Hej,t−Hbj,t

−aHej,t and Bj,t(a) := (1 +a)

Hbj,t−Hej,t

−aHej,t.

The first terms in the definitions ofA_j,t(a) andB_j,t equal±(1 +a) times 1

τ=1

∆^◦`(θ¯ _j,τ−1)( ¯O_τ,Z¯_τ)−E

∆^◦`(θ¯ _j,τ−1)( ¯O_τ,Z¯_τ)

Z¯_τ, F_τ−1 ,

that is as the average of the t first terms of a martingale difference sequence. As for the shared second term in the definitions of Aj,t(a) and Bj,t, it satisfies −aHej,t ≤ 0. The second step of the proof consists in exploiting two so-called Bernstein’s inequalities to control the probabilities P[Aj,t(a)≥x] and P[Bj,t(a)≥x] forx≥0.

Step 2: Bounding positive deviations of A_j,t(a) and B_j,t(a). Set arbitrarily two integers N, N⁰ ≥2 and a real number x≥0. The analysis of P[B_j,t(a) ≥x] is exactly the same as that of P[A_j,t(a) ≥ x], so we present only the latter. The key to the analysis is a so-called stratification argument inspired by Cesa-Bianchi and Gentile[2008].

For every j ∈ JJK and t ≥ 1, recall the definitions (15) and (16) of varj,t and varfj,t. On the one hand, byA4 and because the functions of a real variableu7→u² and u7→u^β are respectively

convex and concave, it holds that

almost surely. Moreover, it also holds that varf_j,t ≤v₂ almost surely by Theorem 3. The previous upper bound and (23) play a key role in the first version of Step 2 (Step 2 (v1)) presented below.

On the other hand, byA2,A4, and because the functionu7→u^β is concave it holds almost surely that, for all τ ∈JtK,

Consequently if Hej,t ≤ B (an inequality that holds almost surely when B =b1, by A3), then it also holds that second version of Step 2 (Step 2 (v2)) presented below.

Step 2 (v1). Setv⁽⁻¹⁾₂ := 0 and, for alli∈JN−1K,v⁽ⁱ⁾₂ := 2^i+1−N×v₂. In view of (23) and since

≤ A3 and the Fan-Grama-Liu concentration inequality for martingales [Theorem 3.10 inBercu et al.,2015], (25) implies

P[Aj,t(a)≥x]≤

• Moreover if x≥x_i, then v⁽ⁱ⁾₂ ≤(x+a(v⁽ⁱ⁻¹⁾₂ /γ)^1/β)×b₂/(3(1 +a)) hence

Therefore, in light of (26), (28), (29) and the definitions ofC1(a), C2(a) given in the theorem, for all x≥x(a, N), it holds that

By A3 and A4, Theorem 3applies and (31) yields

P[Aj,t(a)≥x]≤

Set arbitrarily i∈JN⁰−1K∪ {0}and definex⁰_i:= 32e(1 +a)γ(tb⁽ⁱ⁾₁ )^β/(15b₂)−ab⁽ⁱ⁻¹⁾₁ .

Step 3: end of the proof. In view of (22), a union bound implies that from (36) and the above inequality. This completes the proof of Theorem 1.

B.2 Proof of Corollary 2

Corollary2follows from the straightforward application, twice, of the next technical lemma, based on (7) on the one hand and on (8) on the other hand.

Lemma 4. Let a, b, c > 0, β ∈]0,1] be some constants and (x(N))N≥2 be a sequence of positive numbers that decreases to 0. Let U be a real valued random variable such thatE[|U|]<∞ and, for all integer N ≥2 and all x≥x(N)>0, Proof of Lemma 4. It is well known that

E[U]≤E[U1{U ≥0}] =

Let us denote by L and R the above left-hand side and right-hand side integrals. Choose N ≥ min{n≥2 :x(n)≤b^1/(2−β),log(an)≥1}.

Bounding L. Letx_Lbe chosen so thataNexp(−x^2−β_L /b) = 1,i.e.,x_L:= (blog(aN))^1/(2−β). Now, thanks to the change of variable u=x^2−β/band because u7→u^{1/(2−β)−1} is nonincreasing,

L=x_L+aN Z ∞

x_L

exp(−x^2−β/b)dx

=x_L+aN b^1/(2−β) Z ∞

log(aN)

exp(−u)u^{1/(2−β)−1}du

≤x_L+aN(blog(aN))^1/(2−β) log(aN)

Z ∞

exp(−u)du

=x_L(1 + 1/log(aN))≤2(blog(aN))^1/(2−β). (40)

Bounding R. Let x_R be chosen so that aNexp(−x_R/c) = 1, i.e., x_R := clog(aN). It is readily seen that

R=x_R+aN Z ∞

x_R

exp(−x/c)dx=x_R+acNexp(−x_R/c) =c(1 + log(aN)). (41)

In view of (39), (40), (41), and by choice ofN, we obtain

E[U]≤b^1/(2−β)+ 2(blog(aN))^1/(2−β)+c(1 + log(aN))

≤3(blog(aN))^1/(2−β)+ 2clog(aN).

This completes the proof.

Set t≥1 and a∈]0,1]. In view of (7), Lemma4 yields (9) under the sufficient condition that N ≥ 2 also satisfy (10). Moreover, in view of (8), Lemma 4 also yields (11) under the sufficient condition thatN ≥2 also satisfy (12). This completes the proof of the corollary.

B.3 Proof of Theorem 3

The proof of Theorem 3 hinges on a Bernstein-like concentration inequality for sums of partly dependent random variables shown by Janson [2004, Theorem 2.3]. Janson emphasizes that his

theorem uses the independence of suitable (large) subsets of (ζ_α)α∈A, not any other information on the dependencies, so that the result must be suboptimal when the dependencies that exist are weak. We recall the theorem for completeness.

Theorem 5 (Janson [2004]). Let (ζ_α)α∈A be a collection of random variables with dependency graph G such that ζ_α−E[ζ_α]≤B for some B >0 and all α∈ A. DefineZ:=|A|⁻¹P

Note that (18) from Theorem 3also writes as

Following the line of proof of the Rosenthal inequality byPetrov[1995, page 59] (see also the proof of Theorem 5.2 in [Baraud,2000]), we use (42) to controlE[|Z−E[Z]|^p] henceE[|Hb_j,t−He_j,t|^p] (by convexity) for all p≥2. The bound (18) follows as in [Dedecker, 2001, proof of Corollary 3(b)], a method inspired by the proof of Theorem 6 in [Doukhan et al.,1984].

We first prove this corollary of Theorem 5. The constants are in no way optimal.

Corollary 6. In the context of Theorem 5, for all p≥2,

E[|Z−E[Z]|^p]≤ 3π any nonnegative random variable U. Let r > 0 be a constant that we will carefully choose later on. Set arbitrarilys≥0, definem:=s/r, and introduce

Z˜m:=|A|⁻¹X

α∈A

(ζ_α−E[ζ_α])1{|ζ_α−E[ζ_α]|< m}.

It holds that and (44), the rightmost term in (44), say Ip, satisfies

I_p≤

Because h(u)≥ ^u₂ log(1 +u) for allu≥0, (45) yields bound ˜V ≤V thus imply that the rightmost term in (46) is smaller than

We now chooser:= 5(p+ 1) deg(G)/|A|to guarantee the convergence of the above integral, toπ/2, and conclude that (44) and (46) imply

E[|Z−E[Z]|^p]≤r^pE[max

α∈A|ζ_α−E[ζ_α]|^p] +π(rV)^p/2(2^p/2+p(5/4)^p/2)

≤(rB)^p+π(p+ 1)(2rV)^p/2. (47)

Finally, since (p+ 1)/p≤3/2 andp^2/p≤e^2/e ≈2.61, we can simplify (47) to (43), thus completing the proof of Corollary6.

Fix arbitrarilyj ∈JJK,t≥1,V >0, andp≥2. To save space introduce, for eachτ ∈JtK, Zj,τ := ∆^◦`(θ¯ j,τ−1)( ¯Oτ,Z¯τ)−E

∆^◦`(θ¯ j,τ−1)( ¯Oτ,Z¯τ)

Z¯τ, Fτ−1

In view of A1, A2, A3 and A5, Corollary 6 applies and guarantees that almost surely, for all τ ∈JtK,

It is now easy to show that varf_t,j ≤v₂ almost surely (see (6) for the definition of v₂). By A5,

Therefore Markov’s inequality implies that, for allx >0,

By the technical Lemma 7, there existsp_x >0 such that

x|A|= 15eb₂deg(G)p_x/2 +p

Otherwise, P[|Hb_j,t −He_j,t| ≥ x,Fe_V] ≤ exp(2−q_x) holds trivially. This completes the proof of

Theorem3.

Lemma 7. For anya, b, c >0, there existsp >0such thatc=b√

p+ap. Moreover,c² ≤(b²+2ac)p.

Proof of Lemma 7. The quadratic equation c = bX+aX² has a positive solution, so there does exist p > 0 such that c = b√

p+ap. Moreover, c²/p = b²+ 2ab√

p+a²p on the one hand and 2ac = 2ab√

p+ 2a²p ≥ 2ab√

p+a²p on the other hand, implying that c²/p ≤ b² + 2ac. This completes the proof.

Dans le document One-step ahead sequential Super Learning from short times series of many slightly dependent data, and anticipating the cost of natural disasters (Page 24-37)