• Aucun résultat trouvé

Robust Estimation of a Regression Function in Exponential Families

N/A
N/A
Protected

Academic year: 2021

Partager "Robust Estimation of a Regression Function in Exponential Families"

Copied!
37
0
0

Texte intégral

(1)

IN EXPONENTIAL FAMILIES YANNICK BARAUD AND JUNTONG CHEN

Abstract. We observe n pairs X1 = (W1, Y1), . . . , Xn = (Wn, Yn) of

independent random variables and assume, although this might not be true, that for each i ∈ {1, . . . , n}, the conditional distribution of Yi given Wi belongs to a given exponential family with real parameter

θ?

i = θ?(Wi) the value of which is a function θ? of the covariate Wi. Given a model Θ for θ?, we propose an estimator bθ with values in Θ the construction of which is independent of the distribution of the Wi and that possesses the properties of being robust to contamination, outliers and model misspecification. We establish non-asymptotic exponential inequalities for the upper deviations of a Hellinger-type distance between the true distribution of the data and the estimated one based on bθ. Under a suitable parametrization of the exponential family, we deduce a uniform risk bound for bθ over the class of Hölderian functions and we prove the optimality of this bound up to a logarithmic factor. Finally, we provide an algorithm for calculating bθ when θ?is assumed to belong to functional classes of low or medium dimensions (in a suitable sense) and, on a simulation study, we compare the performance of bθ to that of the MLE and median-based estimators. The proof of our main result relies on an upper bound, with explicit numerical constants, on the expectation of the supremum of an empirical process over a VC-subgraph class. This bound can be of independent interest.

1. Introduction

In order to motivate the statistical problem we wish to solve here, let us start with a preliminary example.

Example 1. We study a cohort of n patients with respective clinical

char-acteristics W1, . . . , Wn with values in Rd. For the sake of simplicity we shall

assume that d is small compared to n even though this situation might not be the practical one. We associate the label Yi= 1 to the patient i if he/she

Date: October 28th, 2020.

2010 Mathematics Subject Classification. Primary 62J12, 62F35, 62G35, 62G05; Sec-ondary 60G99.

Key words and phrases. Generalized linear model, logit (logistic) regression, Poisson

regression, robust estimation, supremum of an empirical process.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 811017.

(2)

develops the disease D and Yi= −1 otherwise. A classical model for

study-ing the effect of the clinical characteristic W on the probability of developstudy-ing the disease D is the logit one

(1) P [Y = y|W ] = 1

1 + exp [−y hw?, W i] ∈ (0, 1) for y ∈ {−1, +1}

where w? is an unknown vector and h·, ·i the inner product of Rd. The problem is to estimate w? on the basis of the observations (Wi, Yi) for i ∈

{1, . . . , n}.

A common way of solving this problem is to use the Maximum Likelihood Estimator (MLE for short). In exponential families, the MLE is known to enjoy many nice properties but it also suffers from several defects. First of all, it is not difficult to see that it might not exist. This is in particular the case when a hyperplane separates the two subsets of Rdgiven by W+= {Wi, Yi = +1} and W− = {Wi, Yi = −1}, i.e. when there exists a unit

vector w0 ∈ Rd such that hw, w0i > 0 for all w ∈ W+ and hw, w0i < 0 for w ∈ W. In this case, the conditional likelihood function at λw0 with λ > 0

writes as n Y i=1 1 1 + exp [−λYihw0, Wii] = n Y i=1 1 1 + exp [−λ |hw0, Wii|] −→ λ→+∞1

hence the maximal value 1 is not reached. For a thorough study of the existence of the MLE in the logit model we refer to Candès and Sur (2020) as well as the references therein.

Another issue with the use of the MLE lies in the fact that it is not robust and we shall illustrate its instability in our simulation study. Robustness is nevertheless an important property in practice since, getting back to our Example 1, it may happen that our database contains a few corrupted data that correspond to mislabelled patients (some patients might have developed a disease which is not D but has similar symptoms). A natural question arises: how can we provide a suitable estimation of w? despite the presence of such possible corrupted data?

This is the kind of issues we want to solve here. Our approach is not, however, restricted to the logit model but applies more generally whenever the conditional distribution of Y given W belongs to a one-parameter ex-ponential family. More precisely, we assume that we observe n pairs of independent random variables (W1, Y1), . . . , (Wn, Yn) with values in W × Y

and assume that the conditional distribution of Yi given Wi belongs to an one-parameter exponential family with parameter θi = θ?(Wi) ∈ R which is

(3)

To our knowledge, there exist only few papers in the literature that tackle this estimation problem and establish risk bounds for the proposed estima-tors of θ?. When W = [0, 1], Kolaczyk and Nowak (2005) proposed an estimation of θ? by piecewise polynomials. When the exponential family is given in its canonical form and the natural parameter is a smooth function of the mean, they propose estimators that achieve, up to extra logarithmic factors, the classical rate n−α/(2α+1) over Besov balls with regularity α > 0 for an Hellinger-type loss. Brown et al (2010) considered one-parameter ex-ponential families which possess the property that the variances of the dis-tributions are quadratic functions of their means. These families include as special cases the binomial, gamma and Poisson distributions, among others, and have been studied earlier by Antoniadis and Sapatinas (2001). When the exponential family is parametrized by its mean, Brown et al (2010) used a variance stabilizing transformation in order to turn the original problem of estimating the function θ? into that of estimating a regression function in the homoscedastic Gaussian regression framework. They established uni-form rates of convergence with respect to some L2-loss over classes of

func-tions θ? that belong to Besov balls and are bounded from above and below by positive numbers. Finally, in the case of the Poisson family parametrized by its mean, Kroll (2019) proposed an estimator of θ? which is based on a model selection procedure. He proved that, under suitable but rather re-strictive conditions, his estimator achieved the minimax rate of convergence over Sobolev-type ellipsoids.

(4)

easy to derive upper bounds for the risk using a method called ρ-estimation which has been introduced in the papers Baraud et al (2017) and Baraud and Birgé (2018) and which solves the problem of estimating θ? under very mild assumptions.

Our main result takes the form of a non-asymptotic exponential deviation inequality for an Hellinger-type loss between the true conditional distribu-tion of the data and the estimated one based on our model. The values of the numerical constants that are involved in this inequality are given ex-plicitly and they do not depend on the exponential family. We shall also present an algorithm as well as a simulation study for calculating our esti-mator and evaluating its practical performance. Finally, let us mention that the proof of this main result relies on an upper bound for the expectation of the supremum of an empirical process over a VC-subgraph class. This bound provides explicit numerical constants, which is to our knowledge new in the literature and can be of independent interest.

The paper is organized as follows. We describe our statistical framework in Section 2 and present there several examples to which our approach ap-plies. The construction of the estimator and our main result about its risk are presented in Section 3. We shall also explain why the deviation inequal-ity we derive guarantees the desired robustness property of the estimator. Uniform risk bounds over Hölderian classes for a suitably chosen parameter are established in Section 4. As already mentioned, we shall see that, even when the exponential family is parametrized by the mean of the distribu-tion, the rates we get may differ from the (classical) ones established in the Gaussian case. Section 5 is devoted to the description of our algorithm and the simulation study. Our bound on the expectation of the supremum of an empirical process over a VC-subgraph class can be found in Section 6 as well as its proof. Section 7 is devoted to the other proofs.

2. The statistical setting

We observe n pairs of independent, but not necessarily i.i.d., random variables X1 = (W1, Y1), . . . , Xn = (Wn, Yn) with values in a measurable

product space (X , X ) = (W × Y , W ⊗ Y). We assume that for each i ∈ {1, . . . , n}, the conditional probability of Yi given Wi exists and is given

by the value at Wi of a measurable function Q?i from (W , W) into the set

all probabilities on (Y , Y) equipped with the Borel σ-algebra T associated to the total variation distance (which induces the same topology as the Hellinger one). In particular the mapping w 7→ h2(Q?i(w), Q) is measurable whatever the probability Q in (Y , Y) and i ∈ {1, . . . , n}.

With a slight abuse of language, we shall also refer to Q?i as the conditional distribution of Yigiven Wialthough this distribution is actually the value of Q?

(5)

nothing about their respective distributions PWi which can therefore be

arbitrary.

Let Q be an exponential family on the measured space (Y , Y, ν). We assume that Q = {Qθ, θ ∈ I} is indexed by a natural parameter θ that belongs to some non-trivial interval I ⊂ R (i.e. ˚I 6= ∅). This means that for all θ ∈ I, the distribution Qθ admits a density (with respect to ν) of the

form

(2) : y 7→ eS(y)θ−A(θ) with A(θ) = log

ïZ

Y e

θS(y)dν(y)

ò ,

where S is real-valued measurable functions on (Y , Y) which does not co-incide with a constant ν-a.e. We also recall that the function A is infinitely differentiable on the interior ˚I of I and strictly convex on I. It is of course possible to parametrize Q in a different way (i.e. with a non-natural pa-rameter) by performing a variable change γ = u(θ) where u is a continuous and strictly monotone function on I. We shall see in Section 3.4 that our main result remains unchanged under such a transformation and we there-fore choose, for the sake of simplicity, to present our statistical setting under a natural parametrization.

Given a class of functions Θ fromW into I, we presume all the conditional distributions Q?i(Wi) are of the form Qθ?(W

i) with θ

? in Θ. We shall refer

to θ? as the regression function.

Throughout this paper, we shall keep in mind that all these assumptions about the Q?i might not be true: the conditional distributions Q?i(Wi) might

not be exactly of the form Qθ?(W

i) but only close to distributions of these

forms and, even if they were, the set Θ might not contain θ?but only provide a suitable approximation of it. Nevertheless, we base our construction on the assumption that the conditional probabilities Q?i are all equal to some element the set {Qθ : w 7→ Qθ(w), θ ∈ Θ} which is associated to the exponential family Q and the function space Θ.

For i ∈ {1, . . . , n}, let QW be the set of all measurable mappings from (W , W) into to the space of probabilities on (Y , Y) equipped with the topol-ogy T , and QW =QnW. Hence, the vector Q? = (Q?1, . . . , Q?n) of the true conditional distributions belongs toQW. We endow the spaceQW with the Hellinger-type (pseudo) distance h defined as follows. For Q = (Q1, . . . , Qn)

(6)

where h denotes the Hellinger distance. In particular, h(Q, Q0) = 0 implies that for all i ∈ {1, . . . , n}, Qi = Q0i PWi-a.s. We recall that the Hellinger

distance between two probabilities P = p · µ and R = r · µ dominated by a measure µ on a measurable space (E, E ) is given by

h(P, R) =ï 1 2 Z Ep −r2 ò1/2 ,

the result being independent of the choice of the dominating measure µ. On the basis of the observations X1, . . . , Xn, we build an estimator bθ of θ?

with values in Θ and evaluate its performance by the quantity h2(Q?, Q b

θ) with the notations

Q?= (Q?1, . . . , Q?n) and Qθ= (Qθ, . . . , Qθ) where θ denotes a function fromW into I.

Our aim is to design bθ not only to guarantee that h2(Q?, Q

b

θ) is small but also that this quantity remains stable to a possible departure from the as-sumptions we started from i.e. when Q? 6∈ {Qθ, θ ∈ Θ} but infθ∈Θh2(Q?, Qθ) is small.

If Pi and Pi0 denote two distributions for a random variable (Wi, Yi) the

conditional distributions of which given Wi are Qi and Q0i respectively, then

h2(Pi, Pi0) =

Z

W h 2 Q

i(w), Q0i(w) dPWi(w)

and we shall write Pi = Qi· PWi and P 0

i = Q0i· PWi. In particular, for all

functions θ :W → I h2(Q?, Qθ) = n X i=1 h2(Pi?, Pi,θ)

where Pi? = Q?i · PWi and Pi,θ = Qθ· PWi for all i ∈ {1, . . . , n}. The prob-ability P? = ⊗ni=1Pi? corresponds to the true distribution of the observed data X = (X1, . . . , Xn) while Pθ = ⊗ni=1Pi,θ denotes the distribution of

independent random variables (W1, Y1), . . . , (Wn, Yn) for which the

condi-tional distribution of Yi given Wi is given by Qθ(Wi) ∈ Q for all i. Unlike Q

b

θ, Pbθ is not an estimator (of P

?) since the distributions of the W i are

unknown. Nevertheless it will sometimes be convenient to interpret our re-sults in terms of an Hellinger-type distance between P? and P

b

θ. For these reasons we shall sometimes write h(P?, Pbθ) for h(Q?, Qbθ) and h(Pθ, Pθ0) for h(Qθ, Qθ0) when θ and θ0 denote mappings fromW into I.

2.1. Examples. Let us present here some typical statistical settings to which our approach applies.

Example 2 (Gaussian regression with known variance). Given n

indepen-dent random variables W1, . . . , Wn with values inW , let

(7)

where the εi are i.i.d. standard real-valued Gaussian random variables, σ

is a known positive number and θ? an unknown regression function with values in I = R. In this case, Q is the set of all Gaussian distributions with variance σ2 and for all θ ∈ I = R, Qθ = N (θ, σ2) has a density with respect to ν = N (0, σ2) on (Y , Y) = (R, B(R)) which is of the form (2) with A(θ) = θ2/(2σ2) and S(y) = y/σ2 for all y ∈ R.

Example 3 (Binary regression). The pairs of random variables (Wi, Yi)

with i ∈ {1, . . . , n} are independent with values inW × {0, 1} and (4) P [Yi = y|Wi] = exp [yθ

?(W i)]

1 + exp [θ?(Wi)] for all y ∈ {0, 1}.

This means that the conditional distribution of Yi given Wiis Bernoulli with mean (1 + exp [−θ?(Wi)])−1 for some regression function θ? with values in I = R. This model is equivalent to the logit one presented in Example 1 by changing Yi ∈ {0, 1} into Yi0 = 2Yi− 1 ∈ {−1, 1} for all i. The exponential

familyQ consists of the Bernoulli distribution Qθ with mean 1/[1 + e−θ] ∈ (0, 1) and θ ∈ I = R. For all θ ∈ R, Qθ admits thus a density with respect to

the counting measure ν onY = {0, 1} of the form (2) with A(θ) = log(1+eθ) and s(y) = y for all y ∈Y .

Example 4 (Poisson regression). The exponential family Q is the set of

all Poisson distributions Qθ with mean eθ, θ ∈ I = R. Taking for ν the

Poisson distribution with mean 1, the density of Qθ with respect to ν takes the form (2) with S(y) = y for all y ∈ N and A(θ) = eθ− 1 for all θ ∈ R. The conditional distribution of Yi given Wi is presumed to be Poisson with

mean exp [θ?(Wi)] for some regression function θ? with values in I = R.

Example 5 (Exponential multiplicative regression). The random variables

W1, . . . , Wn are independent and

(5) Yi =

Zi

θ?(Wi)

for all i ∈ {1, . . . , n}

where the Zi are i.i.d. with exponential distribution of parameter 1 and

independent of the Wi. The conditional distribution of Yi given Wi is then exponential with mean 1/θ?(Wi) ∈ I = (0, +∞). Exponential distributions

parametrized by θ ∈ I admit densities with respect to the Lebesgue measure on R+ of the form (2) with S(y) = −y for all y ∈ Y = R+ and A(θ) =

− log θ.

3. The main results

(8)

reader to Baraud and Birgé (2018). Let ψ be the function defined on [0, +∞] by

(6) ψ(x) = x − 1

x + 1 for x ∈ [0, +∞) and ψ(+∞) = 1.

In order to avoid measurability issues, we restrict ourselves to a finite or countable subset Θ of Θ and define

(7) T(X, θ, θ0) = n X i=1 ψ Ç  qθ0(Xi) qθ(Xi) å for θ, θ0 ∈ Θ,

with the conventions 0/0 = 1 and a/0 = +∞ for all a > 0. Then, we set

(8) υ(X, θ) = sup

θ0∈Θ

T(X, θ, θ0) for all θ ∈ Θ

and choose bθ = bθ(X) as any (measurable) element of the random (and non-void) set

(9) E (X) = ß

θ ∈ Θ such that υ(X, θ) 6 inf

θ0∈Θυ(X, θ 0

) + 1 ™

.

The random variable bθ(X) is our estimator of the regression function θ? and we recall that Q

b

θ = (Qbθ, . . . , Qθb).

The construction of the estimator is only based on the choices of the exponential family given by (2) and the subset Θ of Θ. In particular, the estimator does not depend on the distributions PWi of the Wi which may therefore be unknown.

In the right-hand side of (9), the additive constant 1 plays no magic role and can be replaced by any smaller positive number. Whenever possible, the choice of an estimator bθ satisfying υ(X, bθ) = infθ∈Θυ(X, θ) should be preferred.

The fact that we build our estimator on a finite or countable subset Θ of Θ will not be restrictive as we shall see. Besides, this assumption is consistent with the practice of calculating an estimator on a computer that can handle a finite number of values only.

3.2. The main assumption. Let us make the following assumption:

Assumption 1. The class Θ is VC-subgraph on W with dimension not

larger than V > 1.

We recall that Θ is VC-subgraph on W with dimension not larger than V > 1 if, whatever the finite subset S of V + 1 points in W × R, there exists at least one subset S of S that is not the intersection of S with an (open) subgraph of a function in Θ, i.e.

(9)

Assumption 1 is satisfied when Θ is a linear space V with finite dimension d > 1, in which case V = d + 1. It also holds for any set Θ of the form {F (β), β ∈ V} where F is a monotone function on the real-line.

Another situation, although elementary, is that of a finite set Θ. As soon as the cardinality k of a set S satisfies 2k > Card Θ, there exists S ⊂ S that fulfils (10). This means that Θ is VC-subgraph and its dimension is not larger than V = (log(Card Θ)/ log 2) ∨ 1.

For more properties of the classes of functions which are VC-subgraph, we refer the reader to van der Vaart and Wellner (1996)[Section 2.6.2]. 3.3. The performance of the estimator bθ. Let us set

(11) c1= 150, c2= 1.1 × 106, c3= 5014

and, for Q ∈QW and A ⊂QW,

h (Q, A) = inf

Q0∈Ah Q, Q 0 .

The risk of our estimator satisfies the following properties.

Theorem 1. Let ξ > 0. Under Assumption 1, whatever the conditional

probabilities Q? = (Q?1, . . . , Q?n) of the Yi given Wi and the distributions of

the Wi, the estimator bθ defined in Section 3.1 satisfies, with a probability at

least 1 − e−ξ, h2 Q?, Q b θ 6 c1h 2(Q?,Q) + c 2V h 9.11 + log+ n V i + c3(1.5 + ξ) (12)

where Q = {Qθ = (Qθ, . . . , Qθ), θ ∈ Θ} and log+= max(0, log).

If, in particular, Q is dense in Q = {Qθ, θ ∈ Θ} with respect to the Hellinger-type distance h, there exists a numerical constant C > 0 such that the estimator bθ satisfies

Ch2 Q?, Q b θ 6 h 2(Q?,Q) + V h1 + log + n V i + ξ (13)

with a probability at least 1 − e−ξ.

The difference between (12) and (13) depends on the approximation prop-erties of Q = {Qθ, θ ∈ Θ} with respect to Q = {Qθ, θ ∈ Θ} (for the Hellinger-type distance h). Actually, whatever Q?,

(14) h(Q?,Q) − h(Q?,Q) 6 sup

θ∈Θ

h Qθ,Q

and, if the right-hand side is not larger than some η > 0, i.e. ifQ is an η-net forQ, (12) leads to Ch2 Q?, Q b θ 6 h 2(Q?, Q) + η2+ V 1 + log +(n/V ) + ξ

(10)

When Θ is dense in Θ for the topology of pointwise convergence, it can be shown that one can always take η = 0. We shall not comment on our result any further in this direction and rather refer to Baraud and Birgé (2018) Section 4.2. From now on, we shall assume for the sake of simplicity that η = 0, as if Θ = Θ. In the remaining part of this section, C will denote a positive numerical constant that may vary from line to line.

In order to comment further on Theorem 1, we shall present (13) in a slightly different form. We have seen in Section 2 that the quantity

h (Q?, Qθ) with θ ∈ Θ, which involves the conditional probabilities of P? and Pθ with respect to the Wi, can also be interpreted in terms of the Hellinger(-type) distance between these two product probabilities. Inequal-ity (13) is therefore equivalent to

Ch2 P?, P b θ 6 h 2(P?, P ) + V h1 + log+ n V i + ξ (15)

whereP = {Pθ, θ ∈ Θ}. Integrating this inequality with respect to ξ > 0 leads to the following risk bound for our estimator bθ

(16) CEh2 P?, P b θ 6 h 2(P?,P ) + V h1 + log + n V i .

In order to comment upon (16), let us start with the ideal situation where

P? belongs to P , i.e. P? = Pθ? with θ? ∈ Θ, in which case (16) leads to

(17) CEh2 P θ?, P b θ 6 V h 1 + log+ n V i .

Up to the logarithmic factor, the right-hand side of this inequality is of the expected order of magnitude V for the quantity h2(Pθ?, P

b

θ).

When the true distribution P? writes as Pθ? but the regression function

θ? does not belong to Θ, or if the conditional distributions of the Yi given Wi do not belong to our exponential family, inequality (16) shows that,

as compared to (17), the bound we get involves the approximation term

h2(P?,P ) that accounts for the fact that our statistical model is misspec-ified. However, as long as this quantity remains small enough as compared to V 1 + log+(n/V ), our risk bound will be of the same order as that given by (17) when the model is exact. This property accounts for the stability of our estimation procedure under misspecification.

In order to be more specific, let us assume that our data set has been con-taminated in such a way that the true distribution P? of X = (X1, . . . , Xn)

writes as (1 − α)Pθ+ αR⊗n with α ∈ (0, 1/2), θ ∈ Θ and an arbitrary distribution R 6= Pθ or, alternatively, that the data set contains a propor-tion k/n6 α of outliers a1, . . . , akX , in which case we may write P? as Nn−k

i=1 Qθ

Nk

i=1δai. Using the classical inequality h

(11)

the total variation distance between probabilities, we get (18) h2(P?,P ) 6 h2(P?, P θ)6 n X i=1 D(Pi?, Pθ)6 nα

which means that whenever nα remains small as compared to V (1+log+(n/V )), the performance of the estimator remains almost the same as if P?were equal to Pθ. The estimator bθ therefore possesses some stability properties with respect to contamination and the presence of outliers.

3.4. From a natural to a general exponential family. In Section 2, we focused on an exponential familyQ parametrized by its natural parameter. However statisticians often write exponential families Q under the general form Q = {Rγ = rγ· ν, γ ∈ J } with

(19) : y 7→ eu(γ)S(y)−B(γ) for γ ∈ J .

In (19), J denotes a (non-trivial) interval of R and u a continuous and strictly monotone function from J onto I so that B = A ◦ u. In the exponential familyQ = {Rγ, γ ∈ J } = {Qθ, θ ∈ I}, the probabilities Rγ are associated

to the probabilities Qθ by the formula Rγ= Qu(γ).

With this new parametrization, we could alternatively write our statistical model Q as

(20) Q =Rγ = (Rγ, . . . , Rγ) γ ∈ Γ

where Γ is a class of functions γ from W into J. Starting from such a statistical model and presuming that Q? = Rγ?for some function γ? ∈ Γ, we

could build an estimatorbγ of γ? as follows: given a finite or countable subset

Γ of Γ we set bγ = u−1(bθ) where bθ is any estimator obtained by applying the procedure described in Section 3.1 under the natural parametrization of the exponential family Q and the finite or countable model Θ = {θ = u ◦ γ, γ ∈ Γ}.

Since our modelQ for the conditional probabilities Q?is unchanged (only its parametrization changes), it would be interesting to establish a result on the performance of the estimator R

b

γ = Qbθ which is independent of the

parametrization. A nice feature of the VC-subgraph property lies in the fact that it is preserved by composition with a monotone function: since u is monotone, if Γ is VC-subgraph with dimension not larger than V , so is Θ and our Theorem 1 applies. The following corollary is therefore straightforward.

Corollary 1. Let ξ > 0. If the statistical model Q is under the general

(12)

at least 1 − e−ξ, h2 Q?, R b γ 6 c1h2(Q?,Q) + c2V h 9.11 + log+ n V i + c3(1.5 + ξ) (21) where Q = {Rγ, γ ∈ Γ}. In particular, (22) Eh2 Q?, R b γ 6 C0 h h2(Q?,Q) + V h1 + log+n V ii , for some numerical constant C0 > 0.

A nice feature of our approach lies in the fact that (21) holds for all exponential families simultaneously and all ways of parametrizing them.

4. Uniform risk bounds

Throughout this section, we shall assume that the Wi are i.i.d. with

com-mon distribution PW and that Q? = Rγ? belongs to a statistical model of

the (general) form given by (20) where Γ is a class H of smooth functions. Our aim is to estimate the regression function γ? under the assumption that it belongs to H. More precisely, we want to evaluate the minimax risk over H, i.e. the quantity

Rn(H) = inf e γ sup γ?∈HE h2 R γ?, R e γ 

where the infimum runs among all estimatorsγ of γe ? based on the n-sample X1, . . . , Xnand, for all functions γ, γ0 from W into J,

h2 Rγ, Rγ0 = 1 nh 2 R γ, Rγ0 = Z W h 2 R γ(w), Rγ0(w) dPW(w).

We have seen that Corollary 1 shows that our risk bound is invariant under a parameter change γ = v(θ), with v = u−1. This property is due to the fact that the composition by v preserves the VC-subgraph property and the VC-dimension. The situation changes dramatically when θ is assumed to belong to a smoothness class since a parameter change does not preserve smoothness in general. This means that, under smoothness assumptions, the order of magnitude of Rn(H) with respect to the sample size n depends on our way of parametrizing the exponential family Q or equivalently of choosing v.

(13)

4.1. Parametrizing by the mean. A common parametrization of an ex-ponential family Q = {Rγ, γ ∈ J } is by the means of the distributions, i.e. γ =R

Y ydRγ(y). This is typically the case for the Bernoulli, Gaussian and

Poisson families for example. Our observations (W1, Y1), . . . , (Wn, Yn) then

satisfy the equality

(23) Yi = γ?(Wi) + εi with E [εi|Wi] = 0 for all i ∈ {1, . . . , n}.

With the relationship between Yi and Wi written in this form, the problem becomes equivalent to that of estimating a regression function in a regres-sion model and one might expect that the rates for estimating γ? under smoothness assumptions be similar to those established when the errors are assumed to be i.i.d. Gaussian random variables. Unfortunately, this is not true in general since the regression model given by (23) is actually het-eroscedastic and the variances of the errors depend on the values of the regression function. We shall now show that, in this case, the rates can be much slower than those we would get in the Gaussian case.

For α ∈ (0, 1] and M > 0, let H = Hα(M ) be the set of functions γ on [0, 1] with values in J that satisfy the Hölder condition

(24) |γ(x) − γ(y)| 6 M|x − y|α for all x, y ∈ [0, 1]. We prove

Proposition 1. Let α ∈ (0, 1], M > 0, PW be the uniform distribution on

[0, 1] andQ the set of Poisson distributions Rγwith means γ ∈ J = (0, +∞).

For all n> 1, Rn(Hα(M ))> (1 − e−1) 144 "Ç 3M1/α 24+α+3/αn å1+αα ^M 8 ^ Ç 1 + √ 3 2 å# .

In the Poisson case, the rate for Rn(Hα(M )) is therefore at least of order n−α/(1+α), hence much slower than the one we would get in the Gaussian case, namely n−2α/(2α+1). We conclude that the parametrization by the mean leads to minimax rates that do depend on the exponential family. In order to obtain a rate that is independent of the exponential family, we must turn to another way of parametrizing them.

4.2. Connecting the Hellinger distance to the L2-norm. The unusual

(14)

and therefore behaves like 1 2 √ γ −pγ0 2 2 = 1 2 Z W Ä√ γ −pγ0ä2dPW

whenever γ and γ0 are close for the supremum norm.

In contrast, if Q is the family of Gaussian distributions with variance 1, we obtain that (26) h2(Rγ, Rγ0) = Z W h 1 − e−(γ(w)−γ0(w))2/8 i dPW(w)

and this quantity behaves like kγ − γ0k22/8 whenever γ and γ0 are close for the supremum norm.

Unless one assumes that the functions γ and γ0 are bounded away from 0 and infinity, the losses h2(Rγ, Rγ0) are not equivalent in the Poisson and Gaussian cases. In fact, the risk for estimating a regression function γ? ∈ H = Hα(M ) in the Poisson case is expected to be of the same order as

that for estimating √γ? in the Gaussian case. If γ? is assumed to be of

regularity α, that ofγ? is in general not better than α/2. The lower

bound we established in Proposition 1 actually corresponds to the usual (Gaussian) rate for estimating functions with regularity α/2.

In order to avoid this phenomenon, we need to choose a parametrization ofQ = {Rγ, γ ∈ J } that makes the quantities h2(Rγ, Rγ0) equivalent in all exponential families, at least when γ and γ0 are close in supremum norm. To achieve this goal it is actually enough to make all these quantities locally equivalent to a fixed one and we shall choose the L2(PW)-norm between the

functions γ and γ0. For example, in the Poisson case, a look at (25) shows that taking for Rγ the Poisson distribution with parameter γ2 instead of γ

would make this connection possible.

For a general exponential family Q = {Qθ, θ ∈ I} given by (2) under its natural form, we shall look for a parametrization γ = v(θ) for which h2(Qθ, Qθ0) = h2(Rγ, Rγ0) is of order |γ − γ0|2 whenever γ and γ0 are close enough. When I is an open interval, it is well known from Ibragimov and Has’minski˘ı (1981) that the Hellinger distance h2(Rγ, Rγ0) behaves locally like I(γ)|γ − γ0|2/8 where I(γ) denotes the Fisher information at γ = v(θ),

i.e. the quantity given by the formula A00(θ)[v0(θ)]−2. As a consequence, if we want that h(Rγ, Rγ0) be locally equivalent to |γ − γ0| up to a multiplicative constant that do not depend on γ, it suffices to choose the parametrization v in such a way that this Fisher information is constant, say equal to 8. This means that v must satisfy

(27) v0(θ) =

  A00(θ)

8 > 0 for all θ ∈ I.

(15)

open interval J = v(I), which means that v and u = v−1 are both con-tinuously differentiable on I and J respectively. The exponential family Q remains thus regular under the parametrization γ = v(θ) and we can derive the following results.

Proposition 2. Let v = u−1 be a function that satisfies (27) on the open interval I. If the exponential family Q is parametrized by γ = v(θ), i.e. Q = {Rγ = Qu(γ), γ ∈ J } then for all functions γ, γ0 on W with values

in J , (28) h2 Rγ, Rγ0 6 γ − γ0 2 2= Z W γ − γ 02 dPW.

Besides, for all compact subset K of J , there exists a constant cK> 0 such that for all functions γ, γ0 on W with values in K,

h2 Rγ, Rγ0 > cK γ − γ0

2 2.

This result implies that, under this new parametrization, the Hellinger-type distance h between Rγ and Rγ0 is equivalent to the L2(PW)-distance between the functions γ and γ0, at least when γ and γ0 take their values in a compact subset of J .

It is well-known that the functions v1 : θ 7→ (1/

2) arcsin(1/1 + e−θ), v2 : θ 7→ (1/

2)eθ/2 for θ ∈ R and v3: θ 7→ (√8)−1log θ for θ > 0 all satisfy Condition (27) in the cases of Examples 3, 4 and 5 respectively.

4.3. Uniform risk bounds over Hölder classes. Let us now assume that the exponential family Q = {Rγ, γ ∈ J } has been parametrized in such a way that there exists a constant κ > 0 such that

(29) h(Rγ, Rγ0)6 κ γ − γ0

for all γ, γ0 ∈ J

and that, for some (non trivial) compact interval K ⊂ J , there exists a constant cK > 0 such that

(30) h(Rγ, Rγ0)> cK γ − γ0

for all γ, γ0∈ K.

We have seen in Proposition 2 that (29) and (30) are both satisfied (with κ = 1) as soon as γ is a function v of the natural parameter θ that fulfils (27). For α ∈ (0, 1] and M > 0, we consider the class Hα(M ) of functions γ defined onW = [0, 1] with values in J that satisfy (24). We can then derive the following upper bound for the risk. It holds whatever the distributions PWi.

Proposition 3. Let α ∈ (0, 1] and M > 0. If (29) is satisfied, then

Rn(Hα(M ))6 2C0   Ç (κM )1/αlog(en) n å1+2α +3 log(en) 2n   (31)

(16)

The next result shows that, up to a logarithmic factor, the rate n−2α/(1+2α) is optimal, at least when the Wi are uniformly distributed on [0, 1].

Proposition 4. Let α ∈ (0, 1], M > 0 and n > 1. If PW is the uniform

distribution on [0, 1] and the inequalities (29) and (30) are satisfied for an interval K of length 2L > 0 Rn(Hα(M ))> c 2 K 48   Ç 3M1/α 22α+4+1/ακ2n å1+2αÅ M 2 4 ã ∧ L2  .

It follows from Proposition 2 that the risk bound (31) holds with κ = 1 when the exponential family is parametrized by γ = v(θ) and v satisfies (27). Proposition 3 shows that it is possible to establish a risk bound of optimal order on Rn(Hα(M )) in great generality i.e. independently of the exponen-tial family, provided that it is suitably parametrized.

The result established in Proposition 3 relies on the crucial property that ρ-estimators still perform correctly on a parameter space that does not nec-essarily contain the true parameter but only provides a suitable approxi-mation of it. We may therefore use (21) and the well-known fact that the Hölder class Hα(M ) can be well approximated in supremum norm, hence in L2(PW)-norm, by linear spaces which are VC-classes and (28) enables

us to control the Hellinger-type approximation of these VC-classes by the L2(PW)-one. Our approach does not restricted to Hölder classes and can

easily be extended to any functional spaces H the elements of which are well approximated by linear spaces.

5. Calculation of ρ-estimators and simulation study Let us now go back to Examples 3, 4, 5 which correspond respectively to the Bernoulli, Poisson and exponential distributions parametrized by their natural parameters. For this three cases, we shall illustrate by a simulation study the respective performance of the ρ-estimator bθ and the MLE in two different situations: when the statistical model is exact and when it is not. For the Poisson and exponential distributions, we shall in addition study a median-based estimator bθ0 which is defined as any minimizer over Θ of the

criterion θ 7→ n X i=1 |Yi− m(θ(Wi))|

where m(θ) is the median of the distribution Qθ, θ ∈ I, or an approximation of it. We set m(θ) = eθ+ 1/3 − 0.02e−θ for the Poisson distribution and m(θ) = (log 2)/θ for the exponential one. This estimator is chosen for its robustness properties with respect to contamination and outliers.

(17)

in the logit and Poisson models and I = (0, +∞) in the exponential one, that take the following forms for each of these exponential families: for all w = (w1, . . . , w5) ∈W (32) θ(w) =      η0+P5j=1ηjwj (logit)

log log1 + exp η0+P5j=1ηjwj (Poisson)

log1 + exp η0+P5j=1ηjwj



(exponential) where η = (η0, . . . , η5) belongs to R6. In particular, when P? = Pθ with θ is given by (32), the conditional expectation of Y1 given W1= w satisfies

E [Y1|W1 = w] = ®log 1 + exp η0+P5j=1ηjwj  (Poisson) log 1 + exp η0+P5j=1ηjwj −1 (exponential). The set Θ is VC-subgraph with dimension not larger than 7.

5.1. Calculation of the ρ-estimator. As mentioned in Section 3, we call ρ-estimator bθ = bθ(X) any element of the random set

E (X) = ß

θ ∈ Θ such that υ(X, θ) 6 inf

θ0∈Θυ(X, θ 0 ) + 1 ™ , where υ(X, θ) = sup θ0∈Θ T(X, θ, θ0) = n X i=1 ψ Ç  qθ0(Xi) qθ(Xi) å for all θ ∈ Θ.

An interesting feature of this definition lies in the fact that if bθ satisfies υ(X, bθ) 6 ε 6 1 it is necessarily a ρ-estimator. Indeed, since υ(X, θ0) >

T(X, θ0, θ0) = 0 for all θ0 ∈ Θ, if υ(X, bθ) 6 ε then υ(X, bθ) 6 ε 6 inf θ0∈Θυ(X, θ 0) + ε 6 inf θ0∈Θυ(X, θ 0) + 1,

and bθ therefore belongs to E (X). Consequently, in order to calculate our ρ-estimator, we may look for an element bθ that satisfies υ(X, bθ) 6 ε and, to find such a bθ, we use the following algorithm. We first initialise the process at some value θ0 ∈ Θ. We know from Baraud and Birgé (2018) that, for all θ ∈ Θ, the quantity T(X, θ0, θ) is an estimator of the difference h2(Q?, Qθ0) − h

2(Q?, Q

θ). More precisely, the following inequalities hold: E [T(X, θ0, θ)] ß 6 a0h2(Q?, Qθ0) − a1h 2(Q?, Q θ), > a1h2(Q?, Qθ0) − a0h 2(Q?, Q θ),

for some positive numerical constants a0 and a1. Note that the mappings θ 7→ a0h2(Q?, Qθ0) − a1h

2(Q?, Q

θ) and θ 7→ a1h2(Q?, Qθ0) − a0h

2(Q?, Q

(18)

empirical counterpart T(X, θ0, θ) and define θ1 as the maximizer of θ 7→ T(X, θ0, θ) over Θ. If the quantity

T(X, θ0, θ1) = sup

θ∈Θ

T(X, θ0, θ) = υ(X, θ0)6 ε

θ0 is necessarily a ρ-estimator. Otherwise, we iterate the process, taking

for θ0 the value θ1, and stop as soon as the condition υ(X, θ0) 6 ε is

satisfied or the number of loops exceeds some given number L > 0. In our simulations, we chose ε = 1 and L = 100. To find the maximizer of the mapping θ 7→ T(X, θ0, θ) we used the CMA (Covariance Matrix

Adaptation) method which turned out to be more stable than the gradient descent method. For more details about the CMA method, we refer the reader to Hansen (2016).

Algorithm 1 Searching for the ρ-estimator Input:

X = (X1, · · · , Xn): the data

θ0: the starting point Output: bθ 1: Initialize l = 0, bθ = θ0; 2: while υ(X, bθ) > ε and l ≤ L do 3: l ← l + 1 4: θ1= argmax θ∈Θ T(X, bθ, θ) 5: bθ ← θ1 6: end while 7: Return bθ.

To intialize the process we choose the value of θ0 as follows. In the case of logit regression, we take for θ0 the function on Rd that minimizes on Θ

the penalized criterion (that can be found in the e1071 R-package) θ 7→ 10 n X i=1 (1 − Yiθ(Wi))++ 1 2 d X i=1 |θ(ei) − θ(0)|2,

where e1, . . . , ed denotes the canonical basis of Rd. The e1071 R-package

is used for the purpose of classifying the Yi from the Wi. For the other exponential families we choose θ0 = bθ0.

5.2. Comparisons of the estimators when the model is exact. Through-out this section, we assume that the data X1, . . . , Xn are i.i.d. with

distri-bution Pθ? = Qθ?· PW, θ? ∈ Θ, and we evaluate the risk

Eh2 Pθ?, Peθ = E ïZ W hQθ?(w), Qeθ(X)(w) ä dPW(w) ò

(19)

in order to build 100 i.i.d. copies of the estimator eθ(X). Then we generate independently a sample W10, . . . , W10000 of size 1000 of the distribution PW and compute (33) Rbn(eθ) = 1 100 100 X i=1 " 1 1000 1000 X j=1 hQθ?(Wj0), Q e θ(Xi)(W 0 j) ä # .

In (33) the Hellinger distances are calculated for all θ, θ0 ∈ I according to the formula (34) h2(Qθ, Qθ0) = 1 − exp ï AÅ θ + θ 0 2 ã −A(θ) + A(θ 0) 2 ò

where A is given in (2). For this simulation study n = 500.

Logit model. We consider the function θ? = θ given by (32) with η = (1, . . . , 1) ∈ R6 and the distribution PW = (PW(1) + PW(2) + PW(3))/3 where PW(1), PW(2) and PW(3) are respectively the uniform distributions on the cubes

[−a, a]5, [b − 0.25, b + 0.25]5 and [−b − 0.25, −b + 0.25]5 with a = 0.25 and b = 2.

Poisson model. In this case θ?= θ given by (32) with η = (0.7, 3, 4, 10, 2, 5) and PW = PW,1⊗2 ⊗ PW,2⊗ PW,3⊗2 where PW,1, PW,2 and PW,3 are the uniform

distributions on [0.2, 0.25], [0.2, 0.3] and [0.1, 0.2] respectively.

Exponential model. We set θ? = θ given by (32) with η = (0.07, 3, 4, 6, 2, 1) and PW = PW,1⊗3 ⊗ PW,2⊗2 where PW,1 and PW,2 are the uniform distributions

on [0, 0.01] and [0, 0.1] respectively.

In order to compare the performance of the ρ-estimator to the two other estimators of interest, we shall first compute the estimated risk bRn(bθ) of

our estimator and then, given another estimator eθ (either the MLE or the median-based estimator bθ0), the quantity

(35) E(eθ) = Rbn(eθ) − bRn(bθ) b

Rn(bθ)

.

Note that large positive values of E (eθ) indicate a significant superiority of our estimator as compared to eθ and small negative values a slight inferiority. The respective values of bRn(bθ) and E (eθ) are displayed in Table 1.

We observe that the quantities bRn(bθ) are of order 1.4 × 10−3 in all three

(20)

Table 1. Values of bRn(bθ) and E (eθ) given by (33) and (35)

respectively when the model is well-specified

b

Rn(bθ) MLE bθ0 Logit 0.0013 +0.45% − Poisson 0.0016 -0.34% +440% Exponential 0.0014 -0.58% +120%

When the model is correct, the risks of the MLE and bθ are quite similar. In fact, a look at the simulations show that the ρ-estimator coincides most of the time with the MLE, a fact which is consistent with the result proved in Baraud et al (2017) (Section 5) that states the following: under (strong enough) assumptions, the MLE is a ρ-estimator when the statistical model is regular, exact and n is large enough. Our simulations indicate that the result actually holds under weaker assumptions. In this case, both the MLE and the ρ-estimator outperform the estimator bθ0. Moreover, we observe that

in all cases the algorithm returns the ρ-estimator after at most two steps. 5.3. Comparisons of the estimators in presence of outliers. We now work with n = 501 independent random variables X1, . . . , Xn. The 500 first

variables X1, . . . , Xn−1 are i.i.d. and simply follow the framework of the

previous section with the same distributions Pθ? = Qθ?.PW and parameter

values θ? ∈ Θ we previously used for each of our three exponential families. But we now add to the sample an extra variable Xn = (Wn, Yn) which is

an outlier. We still evaluate the performance of an estimator eθn(X) by its

empirical risk (33). For the logit regression we take Wn = 1000(1, 1, 1, 1, 1)

and Yn = 0, for the Poisson family Wn= 0.1(1, 1, 1, 1, 1) and Yn= 200 and

for the exponential distribution Wn= 5×10−3(1, 1, 1, 10, 10) and Yn= 1000. The results are displayed in Table 2. We note that the risks of the

ρ-Table 2. Values of bRn(bθ) and E (eθ) given by (33) and (35)

respectively in presence of an outlier

b

Rn(bθ) MLE θb0

Logit 0.0014 +15000% −

Poisson 0.0019 +1900% +330% Exponential 0.0015 +7400% +99%

(21)

Table 3. Quartiles for the number of loops in presence of outliers 1st Quartile Median 3rd Quartile Maximum

Logit 3 3 3 6

Poisson 2 2 2 3

Exponential 2 2 2 3

Table 3 displays the quartiles of the empirical distributions, based on our 100 simulations, of the number of loops l that have been necessary for our algorithm to compute the ρ-estimator. It shows that the presence of the outlier has slightly increased the number of loops that have been necessary to compute the ρ-estimator. We recall that when the model was exact, at most two loops were necessary.

5.4. Comparisons of the estimators when the data are

contami-nated. We now set n = 500 and keep the respective values of the

distribu-tions Pθ? to those of Section 5.2, but we now assume that the true

distribu-tion of the i.i.d. random variables X1, . . . , Xn is P? = 0.95Pθ?+ 0.05R for some distribution R on W × Y , thus allowing an amount of contamination of 5%. In such a situation, the squared Hellinger distance between the true distribution P? and the model is bounded by h2(P?, Pθ?) 6 5% according to (18). We assume that R is the distribution of a pair of random variables (W0, Y0), W0 with distribution PW and the conditional distribution of Y0

given W0 = w admitting a density rw with respect to ν. As before, we

evaluate the risk

Rn(eθ) = Eh2 P?, Peθ = E ïZ W hQ?(w), Q e θ(X)(w) ä dPW(w) ò

of an estimator eθ(X) by the Monte Carlo method. To compute the integral involved in the definition of the Hellinger distance

hQ?(w), Q e θ(X)(w) ä = 1 2 Z Y Ä» 0.95qθ?(w)(y) + 0.05rw(y) − » q e θ(w)(y) ä2 dν(y) we used a numerical approximation. For each estimator eθ of θ?, we obtain an estimation eRn(eθ) of Rn(eθ). As before, in order to compare the performance

of the other estimators to bθ, we evaluate

(36) E0(eθ) = Ren(eθ) − eRn(bθ) e

Rn(bθ)

.

In the Poisson case, the random variable Y0 writes as 80 + B where the conditional distribution of B given W = (w1, . . . , w5) is Bernoulli with mean

(22)

In the case of the exponential distribution, Y0 is independent of W and uniformly distributed on [50, 60].

With these values of the distribution R, the approximation error of the model, which is bounded by h2(P?, Pθ?) is not larger than 0.025 in the

Poisson case and 0.029 for the exponential distribution, hence smaller than 5% as expected. The corresponding results are presented in Table 4.

Table 4. Values of eRn(bθ) and E0(eθ) given by (36) when (in

average) 5% of the data are contaminated

e

Rn(bθ) MLE θb0 Poisson 0.027 +760% +11% Exponential 0.039 +340% -15%

For both models we see that the the risk of the ρ-estimator is of the same order of magnitude as that of the approximation term h2(P?,P) 6 h2(P?, Pθ?). As before, the MLE behaves poorly. Its risk explodes with the contamination while the performance of the median based estimator bθ0 is

comparable to that of the ρ-estimator.

In Table 5, we observe that the number of loops for calculating the ρ-estimator increases substantially as compared to the two previous situations. Our algorithm seems to have more difficulties to reach the ρ-estimator. Nev-ertheless, the fact that it sometimes has to stop when l = 100, i.e. before reaching its goal, does not alter its final performance as described by Table 4.

Table 5. Quartiles for the number of loops when the data are contaminated

1st Quartile Median 3rd Quartile Maximum

Poisson 5 5 5 100

Exponential 5 11 40 100

In conclusion, in our examples, we have seen that if the model is exact the MLE performs well but cannot deal with a slight misspecification of it. The presence of a single outlier explodes its risk. The median-based estimator b

θ0 performs poorly as compared to the ρ-estimator when the model is exact

and when the data set contains an outlier. Its performance becomes compa-rable to that of the ρ-estimator only when the data are contaminated. As compared to the MLE and bθ0, only the ρ-estimator shows some good and

(23)

6. An upper bound on the expectation of the supremum of an empirical process over a VC-subgraph class

The aim of this section is to prove the following result.

Theorem 2. Let X1, ..., Xn be n independent random variables with values

in (X , X ) and F an at most countable VC-subgraph class of functions with values in [−1, 1] and VC-dimension not larger than V > 1. If

Z(F ) = sup f ∈F n X i=1 (f (Xi) − E [f (Xi)]) and sup f ∈F 1 n n X i=1 Ef2(Xi) 6 σ26 1, then (37) E [Z(F )] 6 4.74 » nV σ2L (σ) + 90V L (σ), with L (σ) = 9.11 + log(1/σ2).

Let us now turn to the proof. It follows from classical symmetrisation arguments that E [Z(F )] 6 2EZ(F ), where Z(F ) = sup

f ∈F n P i=1 εif (Xi) and ε1, . . . , εnare i.i.d. Rademacher random variables. It is therefore enough

to prove that

(38) EZ(F ) 6 2.37»nV σ2L (σ) + 45V L (σ).

Given a probability P and a class of functions G on a (E, E) we denote by Nr(,G , P ) the smallest cardinality of an -net for the Lr(E, E , P )-norm

k·kr,P, i.e. the minimal cardinality of a subset G [] of G that satisfies for all g ∈G inf g∈G []kg − gkr,P = infg∈G [] ÅZ E |g − g|rdP ã1/r 6 . We start with the following lemma.

Lemma 1. Whatever the probability P on (X , X ),  ∈ (0, 2) and r > 1

Nr(,F , P ) 6 e(V + 1)(2e)V

Å 2 

ãrV .

Proof of Lemma 1. Let λ be the Lebesgue measure on ([−1, 1],B([−1, 1])) and Q the product probability P ⊗ (λ/2) on (E, E ) = (X × [−1, 1], X × B([−1, 1])). Given two elements f, g ∈ F and x ∈ X

(24)

and, setting Cf = {(x, t) ∈ X × [−1, 1], f(x) > t} the subgraph of f and

similarly Cg that of g, we deduce from Fubini’s theorem that kf − gk1,P = Z X |f − g| dP = 2 Z X ×[−1,1] 1lCf(x, t) − 1lCg(x, t) dQ = 2 1lCf − 1lCg 1,Q.

Since the functions f, g ∈F take their values in [−1, 1], kf − gkrr,P = Z X |f − g| r dP 6 2r−1 Z X |f − g| dP 6 2 r 1lCf − 1lCg 1,Q

and consequently, for all ε > 0

Nr(,F , P ) 6 N1((ε/2)r,G , Q) with G = {1lCf, f ∈F }.

SinceF is VC-subgraph with VC-dimension not larger than V , the class G is by definition VC with dimension not larger than V and the result follows

from Corollary 1 in Haussler (1995). 

The proof of Theorem 2 is based on a chaining argument. It follows from the monotone convergence theorem that it is actually enough to prove (38) withFJ, J > 1, in place of F where (FJ)J >1is a sequence of finite subsets

of F which is increasing for the inclusion and satisfies S

J >1FJ =F . We

may therefore assume with no loss of generality thatF is finite.

Let q be some positive number in (0, 1) to be chosen later on and PX the empirical distribution n−1Pni=1δXi. We shall denote by Eεthe expectation

with respect to the Rademacher random variables εi, hence conditionally on X = (X1, . . . , Xn). We set b σ2 =bσ2(X) = sup f ∈F kf k22,X = sup f ∈F ñ 1 n n X i=1 f2(Xi) ô ∈ [0, 1].

For each positive integer k, let Fk = Fk(X) be a minimal (qkbσ)-net for F with respect to the L2(X , X , PX)-norm denoted k·k2,X. In particular,

we can associate to a function f ∈ F a sequence (fk)k>1 with fk ∈ Fk satisfying kf − fkk2,X 6 qk

b

σ for all k > 1. Actually, since F is finite fk= f

for all k large enough. Besides, it follows from Lemma 1 with the choices r = 2 and P = PX that for all k> 1 we can choose Fk in such a way that

log[CardFk] is not larger than h(qkbσ) where

(39) h() = logîe(V + 1)(2e)Vó+ 2V logÅ 2  ã

(25)

For f ∈F , the following (finite) decomposition holds n X i=1 εif (Xi) = n X i=1 εif1(Xi) + n X i=1 εi +∞ X k=1 [fk+1(Xi) − fk(Xi)] = n X i=1 εif1(Xi) + +∞ X k=1 ñ n X i=1 εi(fk+1(Xi) − fk(Xi)) ô . SettingFk2 = {(fk, fk+1), f ∈F } for all k > 1, we deduce that

Z(F ) 6 sup f ∈F1 n X i=1 εif (Xi) + +∞ X k=1 sup (fk,fk+1)∈Fk2 n X i=1 εi[fk+1(Xi) − fk(Xi)] and consequently, EεZ(F ) 6 Eε ñ sup f ∈F1 n X i=1 εif (Xi) ô + +∞ X k=1 Eε " sup (fk,fk+1)∈Fk2 n X i=1 εi[fk(Xi) − fk+1(Xi)] # .

Given a finite set G of functions on X and setting −G = {−g, g ∈ G } and v2 = maxg∈G kgk22,X, we shall repeatedly use the inequality

E ñ sup g∈G n X i=1 εig(Xi) ô = E ñ sup g∈G ∪(−G ) n X i=1 εig(Xi) ô 6»2n log(2 CardG )v2

that can be found in Massart (2007)[inequality (6.3)]. Since maxf ∈F1kf k22,X 6 b

σ2, log(CardF1)6 h(qσ), log(Cardb Fk2)6 h(qkbσ) + h(qk+1bσ) and sup (fk,fk+1)∈Fk2 kfk− fk+1k22,X 6 sup f ∈F Ä kf − fkk2,X+ kf − fk+1k2,Xä2 6 (1 + q)2q2kbσ2 we deduce that Eε ñ sup f ∈F1 n X i=1 εif (Xi) ô 6σb » 2n (log 2 + h(qσ)),b and for all k> 1

(26)

Setting g : u 7→plog 2 + h(u) + h(qu) on (0, 1] and using the fact that g is decreasing (since h is) we deduce that

EεZ(F ) 6bσ2n " » log 2 + h(qbσ) + (1 + q)X k>1 qk»log 2 + h(qk b σ) + h(qk+1 b σ) # 6bσ2n " g(σ) + (1 + q)b X k>1 qkg(qkσ)b # 6 √ 2n " 1 1 − q Z σb qbσ g(u)du +1 + q 1 − q X k>1 Z qkσb qk+1 b σ g(u)du # 6√2n1 + q 1 − q Z bσ 0 g(u)du.

The mapping g being positive and decreasing, the function G : y 7→Ry

0 g(u)du

is increasing and concave. Taking the expectation with respect to X on both sides of the previous inequality and using Jensen’s inequality we get

EZ(F ) 6 √ 2n1 + q 1 − qE [G(bσ)] 62n1 + q 1 − qG (E [bσ]) 6√2n1 + q 1 − qG » E [bσ2]  . (40)

By symmetrization and contraction arguments (see Theorem 4.12 in Ledoux and Talagrand (1991)), Enbσ 2 6 E ñ sup f ∈F n X i=1 f2(Xi) − Ef2(X i)  ô + sup f ∈F n X i=1 Ef2(Xi)  6 2E ñ sup f ∈F n X i=1 εif2(Xi) ô + nσ2 6 8E ñ sup f ∈F n X i=1 εif (Xi) ô + nσ2 = 8EZ(F ) + nσ2 (41)

and we infer from (40) that

(42) EZ(F ) 6 √ 2n1 + q 1 − qG (B) with B =   σ2+8EZ(F ) n ∧ 1.

The following lemma provides an evaluation of G.

Lemma 2. Let a, b, y0 be positive numbers and y ∈ [y0, 1],

(27)

Proof. Using an integration by parts and the fact that d du » a + b log(1/u) = − b 2upa + b log(1/u) we get Z y 0 » a + b log(1/u)du =hu»a + b log(1/u)iy 0 +1 2 Z y 0 b pa + b log(1/u)du 6 y»a + b log(1/y) + by 2pa + b log(1/y) = y»a + b log(1/y) ï 1 + b 2 (a + b log(1/y)) ò

and the conclusion follows from the fact that y06 y 6 1. 

Since for all y ∈ (0, 1], g(y) =pa + b log(1/y) with

a = log[2e2(V + 1)2] + 2V log(8e/q) and b = 4V

we may apply Lemma 2 with y0 = σ and y = B and deduce from (42) that EZ(F ) 6 √ 2n1 + q 1 − q Å 1 + b 2a ã B»a + b log(1/σ) 6√2n1 + q 1 − q Å 1 + b 2a ã   σ2+8EZ(F ) n » a + b log(1/σ). Solving the inequality EZ(F ) 6 A»

2nσ2+ 16EZ(F ) with A = 1 + q 1 − q Å 1 + b 2a ã» a + b log(1/σ) we get that (43) EZ(F ) 6 8A2+p 64A4+ 2A22 6 16A2+ A2nσ2.

Références

Documents relatifs

Their linearity properties were recently studied by Drakakis, Requena, and McGuire, who conjectured that the absolute values of the Fourier coefficients of an Exponential Welch

that “the reachability problem requires exponential space” for safe, elementary Hornets – similarily to well known result of for bounded p/t nets [11].. In [12] we have shown that

Spectral analysis of a multistratified acoustic strip, part II: Asymptotic behaviour of acoustic waves in a strati- fied strip. Resonances for multistratified

Representation of the obtained failure mechanism u for the square plate problem with simple supports (left) and clamped supports (right) using the P 2 Lagrange element.

We derive a lower bound for the measure of performance of an empirical Bayes estimator for the scale parameter θ of a Pareto distribution by using a weighted squared-error loss

Because the box dimension of a set is always greater or equal to its Hausdorff dimension, the inequality of our theorem, also gives an upper bound for the Hausdorff dimension.. If the

An asymptotic bound for the tail of the distribution of the maximum of a gaussian process.. Annales

On the spectral radius of a random matrix: an upper bound without fourth moment.. Charles Bordenave, Pietro Caputo, Djalil Chafaï,