Orlicz norms and concentration inequalities for β-heavy tailed random variables

(1)

HAL Id: hal-03175697

https://hal.archives-ouvertes.fr/hal-03175697v3

Preprint submitted on 30 Jun 2021

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Orlicz norms and concentration inequalities for β -heavy tailed random variables

Linda Chamakh, Emmanuel Gobet, Wenjun Liu

To cite this version:

Linda Chamakh, Emmanuel Gobet, Wenjun Liu. Orlicz norms and concentration inequalities for

β-heavy tailed random variables. 2021. �hal-03175697v3�

(2)

Orlicz norms and concentration inequalities for β -heavy tailed random variables

LINDA CHAMAKH^1,2 and EMMANUEL GOBET¹ and WENJUN LIU¹

1CMAP, CNRS, École Polytechnique, Institut Polytechnique de Paris, 91128 Palaiseau, France E-mail:[email protected];[email protected] 2Global Markets Quantitative Research – BNP Paribas, France

E-mail:[email protected]

We establish a new concentration-of-measure inequality for the sum of independent random variables withβ- heavy tail. This includes exponential of Gaussian distributions (a.k.a. log-normal distributions), or exponential of Weibull distributions, among others. These distributions have finite polynomial moments at any order but may not have finiteα-exponential moments. We exhibit a Orlicz norm adapted to this setting ofβ-heavy tails, we prove a new Talagrand inequality for the sum and a new maximal inequality. As consequence, a bound on the deviation probability of the sum from its mean is obtained, as well as a bound on uniform deviation probability.

MSC2020 subject classifications:Primary 60E15; secondary 60F10

Keywords:heavy tails; deviation inequality; Orlicz norm; Talagrand inequality; maximal inequality; empirical process

1. Introduction

1.1. Concentration inequalities

Understanding how sample statistical fluctuations impact prediction errors is crucial in learning al- gorithms. Typically, we are interested in bounding the probability that a sum of random variables exceeds a certain threshold, essentially in quantifying the deviation of the sum from its expectation. In other words, we aim at analyzing how fast the sum concentrates around its expectation.

Take the notation[M]for all integers from 1 toM included. For independent and centered random variables(Y_m)_m∈[M] taking value in a Banach space (B,k.k_B), the quantity of interest takes the formP

P

m∈[M]Y_m _B> ε

≤f(ε, M)for the most explicit and tightest possible functionf. The bounded, sub-Gaussian or the sub-exponential random variables have been largely covered by the literature (for example, via Bennett and Bernstein inequalities - see [BLM13] for an extensive review of main concentration inequalities techniques), as well as the case of alpha-exponential tails [CGS20]

(random variablesY s.t. there existsα >0, c >0, such thatE h

exp(^kY^k

α B

c )i

<∞). The fat-tailed case, for which the moment generating function does not exist but some polynomial moments exist, can be tackled for example via Burkholder or Fuk-Nagaev type of inequalities [Rio17,Mar17]. These inequalities are based on the existence and on the bounding of polynomial moments of the random variables. In this work, we focus on the heavy-tailed random variables case, in the limit case when noα-exponential moment is finite but every polynomial moment exist.

1

(3)

1.2. Orlicz norm

Orlicz norm [KR61] provides a nice tool to study the statistical fluctuations of an estimator for a given family of distributions. Consider an Orlicz functionΨ :R⁺→R⁺, that is a continuous non-decreasing function, vanishing in zero and withlim_x→+∞Ψ(x) = +∞, and define theΨ-Orlicz norm of theB- valued random variableY by

kYk_Ψ:= inf

c >0 : E

Ψ

kYk_B c

≤1

. (1.1)

With the additional property thatΨis convex, Orlicz functions are commonly referred to as "Young functions" (or "N-functions" as in [KR61]). Van de Geer and Lederer [vdGL13] exhibit in their work a "Bernstein-Orlicz" norm (the "(L)-Bernstein-Orlicz" norm) adapted to sub-Gaussian and sub- exponential tails and provide deviation inequalities for suprema of functions of random variables [vdGL13, Theorem 8]. The (L)-Bernstein-Orlicz norm is theΨ_L-Orlicz norm with

Ψ_L(z) = exp √

1 + 2Lz−1 L

2

−1.

ClearlykYk_Ψ_L<∞implies the existence of exponential moment. As shown in Wellner [Wel17], it is possible to generalize these results to any Orlicz functionΨ(x) =e^h(x)−1withhconvex. It requires again the existence of exponential moment which is not our framework. We would like to go beyond and do not assume anyα-exponential moment.

As a new Orlicz function able to handle heavy-tail situations, we will consider:

Ψ^HT_β (x) := exp((ln (x+ 1))^β)−1, x≥0, (1.2) for a parameterβ >1. We say thatY isβ-heavy tailedif there exists ac >0s.t.

E

Ψ^HT_β

kYk_B c

<∞.

Typically, we aim at encompassing situations likeY = exp(|G|^β²)whereG is a Gaussian random variable; the caseβ= 2corresponds to log-normal tails. See Section2.2for various examples.

Observe that when (1.1) is finite withΨ = Ψ^HT_β ,Y has finite polynomial moment of orderpfor any p >0, but may not haveα-exponential moments. Besides, ourβ-heavy tailed setting is closely related tolong-tailmodelling¹, which is used for instance in queuing applications [Asm03, Chapter 10].

1.3. Deviation inequalities for sum via Talagrand and Markov inequalities

What we call Talagrand inequality is an inequality of type:

X

m∈[M]

Y_m _Ψ≤C_Ψ

X

m∈[M]

Y_m _L

1(B)+ max

m∈[M]kY_mk_B _Ψ

!

. (1.3)

1typicallyS(x) :=P kYk_B> x

= exp(−(ln(1 +x))^β)for whichlim_x→+∞S(x+t)/S(x) = 1for anyt >0.

(4)

Talagrand [Tal89, Theorem 3] showed that this inequality is satisfied with Ψ_α(x) :=e^x^α −1 (α∈(0,1]). For the sake of presentation, let us consider i.i.d. (Y_m)_m∈[M]. The first term is then

P

m∈[M]Y_m _L

1(B)≤ O(√

M)by Bukholder inequality whenB⊆Ror the more general inequality of [Pis16, Proposition 4.35] whenBis a Hilbert space or a Banach space of type2.

When the maximal inequality is satisfied, that is under the form of [vdVW96, Lemma 2.2.2.], the second term is bounded by

max

m∈[M]kY_mk_B

_Ψ≤K_ΨΨ⁻¹(M) max

m∈[M]

kY_mk_Ψ.

Hence, for anyε >0, denotingX:=_M¹

P

m∈[M]Y_m

_B, thanks to the Markov inequality, the Tala- grand inequality and the two previous norm controls, we get

P(X≥ε) ≤

Sect.2.1−(iii)

2

1 + Ψ (ε/kXk_Ψ)= 2



1 + Ψ





εM

P

m∈[M]Y_m _Ψ









−1

(1.4)

≤2 1 + Ψ εM

C_Ψ⁰ (Ψ⁻¹(M) +√ M)

!!−1

. (1.5)

In particular, forΨ = Ψ_α, the above inequality simplifies to:

P(X≥ε)≤2 exp

−C_α⁰ ε√

Mα .

It is then possible to extend this type of inequality to suprema of functions as done in [CGS20], in the spirit of [Ada08]. In any case, a key element to derive these concentration inequalities is the Talagrand inequality (1.3).

1.4. Our contribution

The purpose of this work is mainly to establish the Talagrand inequality forΨ = Ψ^HT_β , to tackleβ- heavy tailed random variables as a difference with previous contributions available in the literature, and to derive some ready-to-use consequences. Note that this particular Orlicz function (1.2) is not at all part of the general result established by Talagrand [Tal89, Proposition 12], which states that the inequality (1.3) holds for Orlicz function of the formΨ(x) :=e^xζ(x) withζ non-decreasing for x large enough and satisfyinglim sup_u→+∞^ζ(e_ζ(u)^u⁾<+∞; indeed, in our setting, one easily checks that ζ(x) =x⁻¹ln(Ψ^HT_β (x)) =x⁻¹(ln(exp((ln(x+ 1))^β)−1))isdecreasingforxlarge.

1.5. Outline

In Section2, we recall the motivating example and define the adapted Orlicz function. Then we state our main results: Talagrand inequality (Theorem2.1), maximal inequality (Theorem2.2), pointwise and uniform deviation estimates (Corollary2.3and Theorem2.4). Derivations of finite-sample confidence intervals and of excess risk bounds [Kol11] in regression are given, as direct applications of our results.

Section3 is devoted to the proofs. In all these results, some universal constants appear: we do not investigate the question of having the best possible constants.

(5)

2. Motivating examples and main results

2.1. Orlicz norm properties

Althoughk.k_Ψ defined in (1.1) may not satisfy in general the triangle inequality, we keep calling it Orlicz norm for the sake of simplicity. For a given Banach space(B,k.k_B)over the fieldR, we denote L_Ψ(B) :={Y : Ω→Bs.t. kYk_Ψ<+∞}the set ofB-valued random variables with finiteΨ-Orlicz norm. For self-containedness we summarize a few well-known properties of the k·k_Ψ norm, for a given Orlicz functionΨ, which hold independently of the convexity ofΨ(unless explicitly required).

See [KR61], or more recently [CGS20, Section 4].

(i) Normalization: IfY ∈L_Ψ(B)thenE h

Ψ_kY_k

B

kYk_Ψ

i≤1.

(ii) Homogeneity: IfY ∈L_Ψ(B)andc∈RthencY ∈L_Ψ(B)andkcYk_Ψ=|c| kYk_Ψ. (iii) Deviation inequality: IfY ∈L_Ψ(B)thenP(kYk_B≥c)≤_Ψ(c/kY²_k

Ψ)+1 for anyc≥0.

(iv) IfΨis convex,k.k_Ψsatisfies to the triangle inequality.

2.2. Motivating examples of heavy-tailed distributions and adapted Orlicz norm

2.2.1. Log-normal distribution

LetY be a scalar random variable with log-normal distribution, i.e.

ln(Y)=^d N µ, σ²

, withσ >0. The distribution ofY admits the density

f_Y(y;µ, σ) := 1 σ√

2πye⁻

(lny−µ)2 2σ2 1_y>0.

Let us investigate what kind of Orlicz functionΨcan be used to havekYk_Ψ<∞. In particular, we search forΨ(x) = exp (ξ(x))−1such thatξis non-decreasing,ξ(0) = 0andlimx→+∞ξ(x) = +∞

in order to ensure thatΨ(0) = 0and limx→+∞Ψ(x) = +∞. Letc >0, observe that E

exp

ξ

|Y| c

<∞ =⇒ lim inf

x→∞ ξx c

−(lnx)²

2σ² =−∞. (2.1)

Consider the following functions forβ >0:

1. ξ_β(x) = (ln (x+ 1))^β,x≥0. Note that the caseβ≤1 is not much interesting in our setting since it quantifies tails with finite expectation at most (fat tail cases).

2. ξ_β(x) = (ln(x+ 1))^β(ln(ln(x+ 1) + 1))^α,x≥0,α∈R. This second case is a scale refinement of the first case. It is not studied here.

These functions satisfy the necessary condition (2.1) ifβ <2and for a largec². Furthermore, since for anyc >0,ξ_β ^x_c

< ε(lnx−µ)²for anyε >0forxlarge enough,E h

exp ξ_β_|Y_|

c

i

<+∞.

2β= 2is possible under restriction onσ: ifσ <^√¹

2, thenlim infx→∞ξ₂(x)−^(ln^x)² 2σ² =−∞.

(6)

2.2.2. Other distributions satisfyingE h

exp ξ_β_kY_k

B

c

i

<+∞

The associated Orlicz functionΨ^HT_β (x) := exp(ξ_β(x))−1is adapted to other distributions than just the log-normal distribution. For any random variableX admitting finiteα-exponential moment with α >1, thenY defined byln(Y) =Xwill admitβ-heavy tailed for any1< β < α. We refer the reader to [CGS20, Table 2] for an exhaustive list of distributions admittingα-exponential moments. Here are a few examples:

• The Generalized normal distribution with parameters c ∈R, b > 0, α >0 has a density f(x) =c_fe⁻

1 2

_|x−c|

b

α

up to a positive normalization constant c_f: it clearly admits a finite α-exponential moment. HenceY = exp(X)whereXhas densityf hence admitsβ-heavy tails forβ < α.

• The Skew normal distribution with parameters b∈R, c∈R, v >0 has a density f(x) = c_fe⁻

(x−c)2

2v Φ_b(x−c)

√v

, whereΦdenotes the standard Gaussian cumulative distribution function andc_f is a positive normalization constant: it admits2-exponential moment. IfX has this density, thenY = exp(X)hasβ-heavy tails forβ <2.

• The Weibull distribution with parametersλ >0, k >0has a densityf(x) =c_fx^k−1e⁻(_λ^x)^k1_x≥0 up to a positive normalization constant c_f: it has finite k-exponential moment. Consequently, Y = exp(X)whereXhas the density as above, admitsβ-heavy tails forβ < k. Such distributions are used, for instance, for earthquake magnitude modelling [HR99] .

2.3. Ψ

^HT_β

-Orlicz norm: properties and inequalities

We state different properties of the Orlicz function to be used forβ-heavy tailed distribution. The proof is postponed to Section3.4.

Proposition 2.1. Forβ >0defineΨ^HT_β :R⁺→R⁺by

Ψ^HT_β (x) := exp(ξ_β(x))−1 with ξ_β(x) := (ln (1 +x))^β, x≥0. (2.2) The following properties hold:

1. The applicationβ7→Ψ^HT_β defines a group isomorphism between((0,+∞),×)and((Ψ^HT_β :β >

0),◦), and in particular,(Ψ^HT_β )⁻¹= Ψ^HT_1/β. 2. Forβ >0,Ψ^HT_β is an Orlicz function.

3. Forβ >1,Ψ^HT_β is convex.

4. Forβ >1(resp.β <1), the limit asx→+∞ofΨ^HT_β (x)/x^kequals to+∞(resp. 0), for any k >0.

As a consequence, the associatedΨ^HT_β -Orlicz norm satisfies to the triangle inequality forβ >1.

Hereafter, we mostly restrict the results to the more interesting caseβ >1. Let us start with the Talagrand inequality (1.3) for theΨ^HT_β -Orlicz norm.

Theorem 2.1 (Talagrand type inequality). Let β ∈(1,+∞). Then there is a universal constant K_β,(2.3) s.t. for all independent, mean zero, random variables sequence (Y_m)_m∈[M_] with Y_m ∈

(7)

L_ΨHT

β (B)for allm∈[M], we have

X

m∈[M]

Y_m Ψ^HT_β

≤ K_β,(2.3)







X

m∈[M]

Y_m L1(B)

+

max

m∈[M]

kY_mk_B Ψ^HT_β





. (2.3)

We also establish that the general maximal inequality [vdVW96, Lemma 2.2.2.] (recalled in Lemma 3.6) holds for theΨ^HT_β function:

Theorem 2.2(AΨ^HT_β maximal inequality). Letβ∈(1,+∞). Then there exists a universal constant C_β,(2.4)s.t. for any random variablesY₁, . . . , Y_M inL_ΨHT

β(B),

max

m∈[M]

kY_mk_B Ψ^HT_β

≤ C_β,(2.4)(Ψ^HT_β )⁻¹(M) max

m∈[M]

kY_mk_ΨHT

β . (2.4)

Recall that(Ψ^HT_β )⁻¹(M) = Ψ^HT_1/β(M).As a consequence of the Talagrand inequality (2.3) and the maximal inequality (2.4), by following the same steps as described in (1.5), we can derive the following concentration inequality:

Corollary 2.3(A concentration inequality for sum of independentβ-heavy tailed random variables).

Let β∈(1,+∞). Assume thatB is an Hilbert space or a Banach space of type 2. Then for any Y₁, . . . , Y_M independent and centered random variables inL_ΨHT

β(B), for anyε >0,

P



 1 M

X

m∈[M]

Y_m B

≥ε





≤2 exp





−



ln



1 + εM K_β,(2.3)

C(2)^1/2µ₂√

M+C_β,(2.4)µ_ΨHT

βΨ^HT_1/β(M)









β



, (2.5) whereµ_ΨHT

β := max_m∈[M]kY_mk_ΨHT

β and µ₂:= max_m∈[M]kY_mk_L₂_(B),C(2) denotes the universal constant in the Pisier inequality [Pis16, Proposition 4.35],K_β,(2.3)the Talagrand constant in(2.3)and C_β,(2.4)the maximal inequality constant in(2.4).

Recall thatΨ^HT_1/β(M)goes to infinity slowlier thanM^k(for β >1,k >0) (Proposition 2.1-(4)).

Thus, whenY₁, . . . , Y_M are i.i.d. – implying thatµ_ΨHT

β andµ₂do not depend onM – the above upper bound takes the simple form

2 exp

−

ln(1 +Kε√

M)β , for some universal constantK >0(depending onµ_ΨHT

β andµ₂).

(8)

One may wonder about the sharpness of the bound (2.5) with respect to M andε; there is no reason for which constants in (2.5) are optimal. For M = 1, one retrieves a bound of the form 2 exp

−(ln(1 +Kε))^β

which optimality (w.r.t.ε) is somehow equivalent (up to constant) to that of the Markov inequality combined with Orlicz norm in the left-hand side inequality (1.4): hence, pos- sibilities of improvement are quite limited in general. ForM >1, investigating non-sharpness (w.r.t.

ε) would require, for instance, to identify the distribution ofP

m∈[M]Y_min some cases and doing so, showing a large gap between the left and right-hand sides of (2.5); unfortunately, to the best of our knowledge, we do not know such a situation where the distribution of the sum (under β-heavy tail conditions) has a tractable expression.

Besides, Corollary 2.3 can potentially be used to construct nonasymptotic confidence intervals for the mean off(X)using i.i.d. observations, under the assumption of β heavy-tails. For this, set Y :=f(X)−E[f(X)]. In that case, renormalizing the deviation by√

M as for asymptotic confidence intervals based on the central limit theorem (CLT), Corollary2.3writes

P





√ M

1 M

X

m∈[M]

f(X_m)−E[f(X)]

B

≥ε



≤2 exp

−(ln(1 +Kε))^β ,

whereKdepends explicitly on constants of Corollary2.3. Had these constants been known, we would obtain a tractable nonasymptotic confidence intervals. The rate is√

M as in usual CLT, but the depen- dence inεin the tails is different because of theβ-heavy tail setting. Alternatively, using CLT-based confidence intervals would reduce to asymptotic Gaussian bounds, which lighter tails would result in narrower confidence intervals: the latter might be quite incorrect for finite samples and it highlights the interest of Corollary2.3for deriving nonasymptotic confidence intervals.

In addition, the pointwise estimate from Corollary2.3can be turned into a uniform deviation estimate. On the technical side, the strategy consists in splitting the deviation between truncated functions and their residuals. The residuals are handled using Hoffman-Jorgensen inequality [LT13, Proposi- tion 6.8], following an initial idea from [Ada08] and the recent analysis of [CGS20]. The "truncated part" can be handled using Klein-Rio concentration bounds, together with the Dudley entropy integral bounds. For the latter which is related to the complexity of the space of functions and their related covering numbers, we choose to describe it using its Vapnik-Chervonenkis (VC) dimension (see [GKKW02, Theorem 9.4]). For alternative descriptions, see [vdG00, Sections 2.3 and 2.4] and [NP07];

adaptation of the following result to these other complexity descriptions is somehow direct and left to the reader.

Theorem 2.4 (A uniform concentration inequality for β-heavy tailed random variables). Let β ∈ (1,+∞). Let(X₁, . . . , X_M) be independent random variables taking values inR^d and letF be a countably-generated class of functionsf :R^d7→Rwith envelopeF(x) := sup_f∈F|f(x)|, such that F(X_m)∈L_ΨHT

β(R)for anym∈[M]. Set µ_ΨHT

β := max

m∈[M],f∈Fkf(X_m)k_ΨHT β ,

¯ µ_ΨHT

β := max

m∈[M]kF(X_m)k_ΨHT β , µ₂:= max

m∈[M],f∈Fkf(X_m)k_L₂.

(2.6)

(9)

Assume that the Vapnik-Chervonenkis dimensionV_F+ofF⁺:={{(x, t)∈R^d×R, t≤f(x)};f∈ F } is finite. Then, there exist two universal constantsK₁, K₂(depending only onβ) such that for anyε >0 satisfying the constraint

ε≥K₁c rV_F+

M (2.7)

with c:=

K₁Ψ^HT_1/β(M)¯µ_ΨHT β

∨

µ_ΨHT β

exp

2 ln₊

K₁µ_ΨHT

β/ε1/β

−1

, (2.8)

ln₊(x) := max(ln(x),0), we have

P



sup

f∈F

1 M

X

m∈[M]

(f(X_m)−E[f(X_m)])≥ε





≤2 exp





−



ln



1 + M ε K₂µ¯_ΨHT

βΨ^HT_1/β(M)









β



+ exp

− M ε² K₂(µ²₂+cε)

. (2.9)

A similar bound holds for lower deviations, i.e. replacing thesup and≥εbyinf and≤ −ε: it is obtained by changingFinto−F in the bounds.

IfFis a finite-dimensional vector space,V_F+≤dim(F) + 1[GKKW02, Theorem 9.5].

For i.i.d.(X_m)_m, i.e. theµ-parameters (2.6) do not depend onM, both the condition (2.8) and the bound (2.9) take simple forms in terms ofM (without focusing much on the best constants), which makes Theorem2.4even more easily applicable.

• The bound (2.9) becomes2 exp

−

ln(1 + ^{M ε}

KΨ^HT_1/β(M))β + exp

−_K(1+cε)^{M ε}²

for a positive constantKdepending onβand theµ-parameters.

• The equation (2.8) becomes simply

c:=K₁Ψ^HT_1/β(M), (2.10)

with a new constant K₁, depending on β and the µ-parameters. Indeed, from the first term in the definition (2.8) ofc, one gets thatc≥inf_M_≥1

K₁Ψ^HT_1/β(M)¯µ_ΨHT β

=:c₀>0, which, from (2.7), yields the rough lower bound ε≥K₁c₀/√

M. This implies in turn (after tedious computations) that the second term in the definition (2.8) ofccannot be (up to constant) larger than the first term, hence the equality (2.10).

• Furthermore, the probability upper bound (2.9) can be re-parametrized under the formP(· · · ≥ε(t))≤ e^−tfor any givent≥0with some appropriateε(t). Consider again the i.i.d. case to simplify the exposure, leveraging the above simplifications.

– The inequality 2 exp −

ln

1 + ^{M ε}

KΨ^HT_1/β(M)

β!

≤e^−t/2 holds whenε≥ε₁(t) :=

K^Ψ

HT 1/β(M)

M Ψ^HT_1/β(4e^t−1).

(10)

– Becauseexp

−_K(1+cε)^{M ε}²

≤exp

−^{M ε}_2K² +exp

−_2Kc^{M ε}

, the inequalityexp

−_K(1+cε)^{M ε}²

≤ e^−t/2holds as soon asε≥ε₂(t) :=

q2K(t+ln(4))

M andε≥ε₃(t) :=^2KK¹^Ψ

HT

1/β(M)(t+ln(4))

M .

– Last, (2.7) remains, i.e.ε≥ε₀:=K₁²Ψ^HT_1/β(M) q_V

F+

M . All in all, (2.9) can be rewritten as

P



sup

f∈F

1 M

X

m∈[M]

(f(X_m)−E[f(X_m)])≥max(ε₀, ε₁(t), ε₂(t), ε₃(t))



≤e^−t, t≥0.

Observe that ast→+∞(all other parameters being fixed), the termsε₂(t), ε₃(t)grows liket and√

t, like in the usual cases of sub-exponential and sub-gaussian tails. The effect ofβ-heavy tails appears inε₁(t)which grows likee^t^1/β.

Theorem2.4is an instrumental result in statistical learning theory, in particular for Empirical Risk Minimization. We refer to the lectures [Kol11] for a broad exposition. Let us exemplify in a regression problem, by deriving data-dependent bounds on the excess risk in this setting ofβ-heavy tailed random variables. Results for bounded functions can be found in [Kol11, Chapter 5]. Under the assumptions of Theorem2.4, set

f^?(x) :=E[Y |X=x] and f_M(x) :=arg inf

f∈F

1 M

M

X

m=1

(Y_m−f(X_m))²,

corresponding to the empirical regression function with quadratic loss. For a given measurable function g:R^d×R7→R, define its squaredL₂norm under the true and the empirical measures by

kg(X, Y)k²₂:=

Z

R^d

Z

R

g²(x, y)P(dx,dy), kg(X, Y)k²_2,M:= 1 M

M

X

m=1

g²(X_m, Y_m).

We claim that the excess risk is bounded as follows kf_M(X)−f^?(X)k²₂− inf

f∈Fkf(X)−f^?(X)k²₂

≤sup

f∈F

kY −f(X)k²₂− kY −f(X)k²_2,M + sup

f∈F

kY −f(X)k²_2,M− kY −f(X)k²₂

, (2.11) which implies a bound on the probability of excess risk deviating more thanε:

P

kf_M(X)−f^?(X)k²₂− inf

f∈Fkf(X)−f^?(X)k²₂≥ε

≤P sup

f∈F

kY −f(X)k²₂− kY −f(X)k²_2,M

≥ε/2

!

+P sup

f∈F

kY −f(X)k²_2,M− kY −f(X)k²₂

≥ε/2

! .

The above probabilities are estimated by two applications of Theorem2.4withX˜ = (X, Y)andF˜±= {(x, y)7→ ±(y−f(x))², f∈ F }, leading to an explicit bound for the probability of large excess risk.

To prove (2.11), observe that for anyf ∈ F,kY −f(X)k²₂=kY −f^?(X)k²₂+kf(X)−f^?(X)k²₂

(11)

and

kY −f_M(X)k²_2,M− inf

f∈FkY−f(X)k²_2,M= 0. (2.12)

This readily gives

kf_M(X)−f^?(X)k²₂− inf

f∈Fkf(X)−f^?(X)k²₂

=kY−f_M(X)k²₂− inf

f∈FkY −f(X)k²₂− kY −f_M(X)k²_2,M+ inf

f∈FkY −f(X)k²_2,M

≤sup

f∈F

kY −f(X)k²₂− kY −f(X)k²_2,M + sup

f∈F

kY −f(X)k²_2,M− kY −f(X)k²₂ , as claimed.

3. Proofs

3.1. Proof of Theorem 2.1

3.1.1. Preliminary results

Here, we recall Lemmas 8 and 9 of [Tal89], as well as the "Basic Estimate", which will enable us to prove Theorem2.1. In addition to the independentB-valued random variables(Y_m)_m∈[M], we will need to consider extra independent Rademacher random variables. Everything is defined as follows.

Let

Ω^M×Ω⁰,PM⊗P0,P

the basic probability space, whereP= P⊗P⁰such that the variables Y_mare defined onΩ^Mand forω= (ω_m)_m∈[M],Y_m(ω)depends only onω_m. Let(ε_m)_m∈[M]be a set of random variables defined onΩ⁰ with a Rademacher distribution independent of(Y_m)_m∈[M_]. The following inequalities can be proven independently apart from the context of Orlicz norms.

Lemma 3.1([Tal89, Lemma 8]). IfP

max_m∈[M]kY_mk_B≥t

≤¹₂, then

X

m∈[M]

P(kY_mk_B≥t)≤2P

max

m∈[M]kY_mk_B≥t

. (3.1)

Lemma 3.2([Tal89, Lemma 9]). SetX^(r)ther-th largest term of(kY_mk_B)_m∈[M_]. Then

P

X^(r)≥t

≤ 1 r!



 X

m∈[M]

P(kY_mk_B≥t)





r

. (3.2)

Set

µ:=E





X

m∈[M]

ε_mY_m B



, (3.3)

(12)

µ >0because theY_m’s are not all zero random variables (to avoid trivial situations). We now recall a key inequality which, combined with the previous lemmas, will enable us to prove the announced theorem.

Theorem 3.1([Tal89, Equation (2.5)]). Fork, qpositive integers s.t. k≥q,u >0and u⁰>0, we have

P





X

i∈[M]

ε_iY_i B

≥4qµ+u+u⁰



≤4 exp

− u² 64qµ²

+

K₀ q

k

+P



 X

r≤k

X^(r)> u⁰





where the constantK₀is a universal constant.

3.1.2. Symmetrisation argument forΨconvex

In the next Subsection, because we rely on Theorem3.1, we are going to prove the inequality (2.3) on symmetric random variables first (e.g. variablesY_ms.t.ε_mY_m=^d Y_m). The extension to non-symmetric variables will be direct thanks to Lemma3.4which establishes an "equivalence in norms" relationship between the Orlicz norm of the sum of random variables and its associated Rademacher average.

Lemma 3.3. LetΨbe convex Orlicz function andk·k_Ψthe associate Orlicz norm. For any mean zero random variableZ∈L_Ψ(B), we havekZk_Ψ≤

Z−Z⁰

_Ψ, withZ⁰anyB-valued random variable such thatE

Z⁰|Z

= 0.

Proof. Letc >0,

E[Ψ(kZk_B/c)]^(a)=E

"

Ψ

E

Z−Z⁰|Z _B c

!#(b)

≤E

"

Ψ E

Z−Z⁰ _B|Z c

!#

(c)

≤E

"

E

"

Ψ

Z−Z⁰ _B c

!

|Z

##

=E

"

Ψ

Z−Z⁰ _B c

!#

where in (a) we useZ⁰ has a zero conditional mean, in (b) we use thatΨ is non decreasing and the triangular inequality holds for thek.k_B, in (c) we apply the Jensen inequality. Hence by taking c=

Z−Z⁰

_Ψ>0, the right hand side is smaller than 1 (using Property (i) in Section 2.1), and thereforekZk_Ψ≤c=

Z−Z⁰ _Ψ.

Lemma 3.4. LetΨbe as in Lemma3.3. Let(Y_m)_m∈[M]be a sequence of independent mean-zero random variables inL_Ψ(B). Let(ε_m)_m∈[M_]be independent Rademacher random variables, and let (Y_m⁰)_m∈[M_]be an independent copy of the sequence(Y_m)_m∈[M_]. Then

X

m∈[M]

Y_m Ψ

≤

X

m∈[M]

Y_m− X

m∈[M]

Y_m⁰ Ψ

=

X

m∈[M]

ε_m(Y_m−Y_m⁰ ) Ψ

(13)

≤2

X

m∈[M]

ε_mY_m Ψ

≤4

X

m∈[M]

Y_m Ψ

.

Later on, we will apply these inequalities withΨ = Ψ^HT_β andΨ(x) =x(the associated Orlicz norm corresponds then to theL₁norm).

Proof. The first inequality comes from the application of Lemma3.3 withZ =P

m∈[M]Y_m and Z⁰=P

m∈[M]Y_m⁰ . Sinceε_mtakes values±1independently ofZ, Z⁰, we haveY_m−Y_m⁰ =^dY_m⁰ −Y_m=^d ε_m(Y_m−Y_m⁰ ). Since the sequences are independent inm, we obtain the equality of Lemma3.4. The second inequality is a consequence of the triangular inequality (iv) and the previous identities in distribution. The last inequality is a consequence of the application of Lemma3.3withZ=P

m∈[M]ε_mY_m andZ⁰=P

m∈[M]ε_mY_m⁰ satisfying

E Z⁰|Z

=E



E



 X

m∈[M]

ε_mY_m⁰ |ε_m, Y_m, m∈[M]



|Z



= 0

and of the triangular inequality:kZk_Ψ≤

Z−Z⁰ _Ψ=

P

m∈[M](Y_m−Y_m⁰ ) _Ψ≤2

P

m∈[M]Y_m _Ψ. 3.1.3. Completion of the proof of Theorem2.1

We will denoteKa positive constant depending only onβ, that may vary from line to line. We assume that at least one of theY_m’s is not zeroa.s., otherwise the announced inequality (2.3) is obvious.

In view of the inequalities of Lemma3.4, it is enough to do the reasoning and show the inequality (2.3) with the variables(ε_mY_m, m∈[M])instead of(Y_m, m∈[M]).

Rescaling. Note that (2.3) is invariant by homogeneous rescaling (see Property (ii) of Section 2.1), i.e. the inequality remains the same for the random variablesY˜_m:=^ε^m_C^Y^m for anyC >0. For the choice

C:=

X

m∈[M]

ε_mY_m L1(B)

+

m∈[M]max kY_mk_B Ψ^HT_β

>0,

observe that

X

m∈[M]

Y˜_m L1(B)

≤1 and

max

m∈[M]

Y˜_m _B

Ψ^HT_β

≤1, (3.4)

therefore the inequality (2.3) writes

X

m∈[M]

Y˜_m Ψ^HT_β

≤2K.

(14)

Conversely, if the above holds for someK (independent from theY˜_m’s verifying (3.4)), then (2.3) holds for theY_m’s. All in all, it means that without loss of generality, we can assume

X

m∈[M]

ε_mY_m L1(B)

≤1 and

max

m∈[M]kY_mk_B Ψ^HT_β

≤1,

and then show, under these assumptions, the existence ofK∈R(independent onY_m’s) such that

E



exp



ξ_β





P

m∈[M]ε_mY_m _B K











≤2.

Deviation bounds. By Property (iii) of Section2.1and since we assumed

max_m∈[M]kY_mk_B _ΨHT

β

≤ 1,

P

max

m∈[M]kY_mk_B≥t

≤2exp −ξ_β(t)

, t≥0.

The functionξ_β(·) = (ln (1 +·))^β being continuously increasing from 0 to+∞, there existst₀ s.t.

ξ_β(t₀) = 2 ln 2and∀t≥t₀, 2 exp −ξ_β(t)

≤1/2; for further use, notice the valuet₀=e^{(2 ln 2)}

1 β −1.

Then applying Lemma3.1, fort≥t₀, we have X

m∈[M]

P(kY_mk_B≥t)≤2P

max

m∈[M]kY_mk_B≥t

≤4 exp −ξ_β(t) .

Hence Lemma3.2yields forr∈N^∗,t≥t₀ P

X^(r)≥t

≤4^rexp −rξ_β(t)

r! . (3.5)

Denoteβ˜=bβc+ 1≥2. Equation (3.5) yields fort≥(e^β^˜−1)r

β˜

β (notice thatt≥e²−1≥t₀ as requested)

P

X^(r)≥tr⁻

β˜ β

≤ 4^rexp

−r[ln(1 +tr⁻

β˜ β)]^β

r! =:f(r, t).

Since β/β >˜ 1, the sequence (r⁻

β˜

β)_r≥1 is summable. Set S_β :=P

r≥1r⁻

β˜

β <+∞ and g(t) :=

(t/(e^β^˜−1))^β/^β^˜. From the inclusion{P

r≤g(t)X^(r)≥tS_β} ⊂S

r≤g(t){X^(r)≥tr⁻

β˜

β}and writing a union bound, we get

P



 X

r≤g(t)

X^(r)≥tS_β



≤ X

r≤g(t)

f(r, t). (3.6)

(15)

We claim that for all1≤r≤g(t)

r^1/βln(1 +tr⁻

β˜

β)≥ln(1 +t). (3.7)

This is a consequence of the above lemma applied withρ=r

1

β ≥1andτ=tr⁻

β˜

β ≥e^β^˜−1.

Lemma 3.5. For allρ≥1andτ≥e^β^˜−1, we haveρln(1 +τ)≥ln(1 +τ ρ^β^˜).

Proof. The functionf(ρ) :=ρln(1 +τ)−ln(1 +τ ρ^β^˜)vanishes atρ= 1, let us prove that it is non- decreasing inρprovided thatτ≥e^β^˜−1. Indeed,

f⁰(ρ) = ln(1 +τ)−βρ˜ ^β−1^˜ τ 1 +ρ^β^˜τ .

Since ρ^β^˜τ≥0 andρ≥1, we have ^βρ^˜

β−1˜ τ

1+ρ^β^˜τ ≤ ^β_ρ^˜ ≤β.˜ Hence, f⁰(ρ)≥ln(1 +τ)−β˜≥0. We are done.

Plugging (3.7) into (3.6) yields

P



 X

r≤g(t)

X^(r)≥tS_β



≤ X

r≤g(t)

4^r r! exp

−[ln(1 +t)]^β

≤exp(4) exp(−ξ_β(t)).

Let us recall thatµ=

P

m∈[M]ε_mY_m _L

1(B)≤1. We are now at the point to apply Theorem3.1 withq=deK₀e,u=t,u⁰=tS_β,2qµ≤tandk=bg(t)c:

P





X

m∈[M]

ε_mY_m B

≥t S_β+ 3





≤P





X

m∈[M]

ε_mY_m B

≥4qµ+u+u⁰





≤4 exp

− t² 64qµ²

+ exp (−bg(t)c) +P



 X

r≤g(t)

X^(r)≥tS_β





≤4 exp

− t² 64qµ²

+ exp (−bg(t)c) + exp 4−ξ_β(t) .

The above inequality is valid for anyt≥t₀∨(2deK₀eµ). Besides, in the above upper bound, the last third is asymptotically the largest one, therefore there existsK >0such that

P





X

m∈[M]

ε_mY_m B

≥Kt



≤Kexp −ξ_β(t)

, t≥0.

(16)

Orlicz norm bounds. The estimate implies for allc >0:

E



exp



ξ_β





X

m∈[M]

ε_mY_m B

/(cK)











−1

= Z ∞

0

exp ξ_β(t) ξ⁰_β(t)P





P

m∈[M]ε_mY_m _B

cK ≥t



dt

≤K Z ∞

0

ξ⁰_β(t) exp ξ_β(t)−ξ_β(ct)

dt. (3.8)

Let us check that the above integral is finite forc >1. Only the integrability att→+∞is questionable.

Write

ξ_β(t)−ξ_β(ct) = (ln(1 +t))^β





1−



1 +

ln(^(1+ct)_(1+t)) ln(1 +t)





β





≈t→+∞−β(ln(1 +t))^β−1ln(c).

Therefore, the function to integrate is bounded fortlarge by (up to constant) g(t) :=(ln(1 +t))^β−1

(1 +t) e⁻¹²^β(ln(1+t))^β−1^ln(c). We easily check thatR+∞

0 g(t)dt=R+∞

0 y^β−1e⁻¹²^βy^β−1^ln(c)dy <+∞sinceβ >1andc >1.

Furthermore, by monotone convergence theorem, the bound (3.8) converges to 0 asc→+∞, consequently

E



exp



ξ_β





P

m∈[M]ε_mY_m _B cK











≤2

for ac=c_βlarge enough. We have proved that

P

m∈[M]ε_mY_m _ΨHT

β

≤c_βK.

3.2. Proof of Theorem 2.2

We start by recalling the general maximal inequality on which our proof is based.

Lemma 3.6([vdVW96, Lemma 2.2.2]). LetΨbe a convex Orlicz function satisfying lim sup

x,y→+∞

Ψ(x)Ψ(y)/Ψ(c_Ψxy)<+∞ (3.9)

for some constantc_Ψ>0. Then, there is a constantK >0such that for anyB-valued random variables Y₁, . . . , Y_M,

max

m∈[M]kY_mk_B Ψ

≤KΨ⁻¹(M) max

m∈[M]kY_mk_Ψ.

(17)

Forβ >1,Ψ^HT_β is a convex Orlicz function, thus it remains to establish (3.9) to get Theorem2.2. We prove that one can takec_Ψ= 1. Letc≥1³s.t.Ψ^HT_β (c²)≥1. Letx, yst.x≥candy≥c: then

Ψ^HT_β (x)Ψ^HT_β (y)≤e^(ln(1+x))^βe^(ln(1+y))^β

≤e^(ln(x))^β^+(ln(y))^β^−(ln(xy))^βe^(ln(1+xy))^βe^{2 sup}^z≥c^(ln(1+z))^β^−(ln^z)^β

| {z }

:=C(c)

.

• C(c)is finite: indeed, by standard equivalents, we have that (ln(1 +z))^β−(lnz)^β= (lnz)^β

1 +ln(1 +z⁻¹) ln(z)

β

−1

!

∼z→∞β(lnz)^β−1 z which converges to 0 at infinity.

• Notice that (ln(xy))^β−(ln(x))^β ≥(ln(y))^β for any x, y≥1. Indeed, setting u= lnx≥0, v= lny≥0,

(u+v)^β−u^β= Z u+v

u

βz^β−1dz≥ Z v

0

βz^β−1dz=v^β (becausez7→βz^β−1is increasing sinceβ >1).

• Last, sincee^(ln(1+xy))^β = Ψ^HT_β (xy) + 1≥Ψ^HT_β (c²) + 1≥2, one has

e^(ln(1+xy))^β= e^(ln(1+xy))^β

e^(ln(1+xy))^β −1Ψ^HT_β (xy)≤2Ψ^HT_β (xy).

All in all, we conclude thatΨ^HT_β (x)Ψ^HT_β (y)≤2C(c)Ψ^HT_β (xy), for anyx, y≥c. We are done.

3.3. Proof of Theorem 2.4

We follow the strategy of [CGS20] by truncating the unbounded functionsf by a thresholdc, whose impact is analyzed using the Hoffman-Jorgensen inequality [LT13, Proposition 6.8] and the Talagrand inequality of Theorem2.1. The deviation probability related to the newly bounded random variables is quantified thanks to the Klein-Rio inequalities [KR05] together with the Dudley entropy integral bound.

Here are the notations used along this proof. We denote byKa positive constant that may change from line to line in the computations: this generic constantKmay depend on universal constants and β, but it does not depend on the sampleX₁, . . . , X_M, its sizeM, nor the class of functionsF, neither ε. For ease of notations, we writea≤_Kbwhena≤Kb.

For a givenc >0, set

R_cf:=f− T_cf where T_cf:=−c∨f∧c, T_cF:={T_cf:f∈ F },

3one can takec= q

e^{(ln 2)}^1/β−1≥1for whichΨ^HT_β(c²) = 1.

(18)

T_c^mf(·) :=T_cf(·)−E[T_cf(X_m)], Z_c:= sup

f∈F

1 M

X

m∈[M]

T_c^mf(X_m).

Note that the functionT_c^mf is centered w.r.t. the distribution ofX_m, and bounded by2c.

Assume thatc >0andε >0are such that sup

f∈F

1 M

X

m∈[M]

E[R_cf(X_m)]

≤ε/4, (3.10)

E[Z_c]≤ε/4. (3.11)

By writingf=R_cf+T_cf and using the sub-additivity of the supremum, we easily get sup

f∈F

1 M

X

m∈[M]

(f(X_m)−E[f(X_m)])

≤Z_c−E[Z_c] + sup

f∈F

1 M

X

m∈[M]

R_cf(X_m)

+ε/2.

Hence, the probability of deviation in Theorem2.4is bounded by

P



sup

f∈F

1 M

X

m∈[M]

R_cf(X_m)

≥ε/4



+P(Z_c−E[Z_c]≥ε/4) =: (?) + (??).

Term(?). Owing to the deviation inequality (iii) from Section2.1, it is bounded by

P



 X

m∈[M]

sup

f∈F

|R_cf(X_m)| ≥M ε/4





≤2 exp







−





 ln







M ε/4

P

m∈[M]sup_f∈F|R_cf(X_m)|

_ΨHT

β

+ 1













β



 .

Using the Talagrand inequality of Theorem2.1and the Hoffman-Jorgensen inequality [LT13, Proposi- tion 6.8], and following line by line the arguments of [CGS20, Section 5.5, Inequalities (38) and (39)], we can show that the abovek.k_ΨHT

β norm is bounded by K

max

m∈[M]F(X_m) Ψ^HT_β

,

provided thatc≥8E h

max_m∈[M_]F(X_m)i

.The above arguments are crucial to deal both with the truncation incand the sup inf. Furthermore, the maximal inequality (2.4) withY_m:=F(X_m)gives

(19)

that

max

m∈[M]

F(X_m) Ψ^HT_β

≤ C_β,(2.4)Ψ^HT_1/β(M)¯µ_ΨHT

β. (3.12)

All in all, we have

c≥8E

max

m∈[M]F(X_m)

=⇒(?)≤2 exp





−



ln





M ε Kµ¯_ΨHT

βΨ^HT_1/β(M)+ 1









β



. The above condition on the left hand side is met as soon as

c≥_KΨ^HT_1/β(M)¯µ_ΨHT β, where we have usedE[·]≤_Kk·k_ΨHT

β and (3.12).

Term(??). Apply the Klein-Rio inequality [KR05, Theorem 1.1] (we shall use the form presented in [CGS20, Theorem 8] which directly fits our setting), it shows that(??)is bounded by

exp

− M(ε/4)²

2(σ²+ 4cE(Z_c)) + 6c(ε/4)

whereσ²:= sup_f∈F _M¹ max_m∈[M]E

(T_c^mf)²(X_m)

. Observe that σ²≤sup

f∈F

max

m∈[M]Var [T_c^mf(X_m)]≤sup

f∈F

max

m∈[M]E

hT_cf²(X_m)i

≤µ²₂.

Using in addition the bound (3.11) onE(Z_c), we get (??)≤exp

− M ε² K(µ²₂+cε)

whereKis a universal constant.

Condition(3.10). FromR_cf(x) = (f(x)−c)₊−(f(x) +c)₋, we easily get

|E[R_cf(X_m)]| ≤ Z +∞

c P(|f(X_m)| ≥z) dz

≤2 Z +∞

c

exp(−(ln(z/λ+ 1))^β)dz

= 2λ Z +∞

c/λ

exp(−(ln(z+ 1))^β)dz=: 2λI(c/λ)

whereλ:=µ_ΨHT

β. A standard calculus shows that I(y)∼_y→+∞ y

β(ln(y+ 1))^β−1exp(−(ln(y+ 1))^β),