Aggregated estimators and empirical complexity for least square regression

(1)

Aggregated estimators and empirical complexity for least square regression

Jean-Yves Audibert

^a,b

aUniversité Paris VI Pierre et Marie Curie, laboratoire de probabilités et modèles aléatoires, 175, rue du Chevaleret, 75013 Paris, France bCREST, laboratoire de finance et assurance, 15, bd Gabriel Péri, 92245 Malakoff Cedex, France

Received 10 March 2003; accepted 10 November 2003 Available online 8 June 2004

Abstract

Numerous empirical results have shown that combining regression procedures can be a very efficient method. This work provides PAC bounds for theL²generalization error of such methods. The interest of these bounds are twofold.

First, it gives for any aggregating procedure a bound for the expected risk depending on the empirical risk and the empirical complexity measured by the Kullback–Leibler divergence between the aggregating distributionρˆand a prior distributionπand by the empirical mean of the variance of the regression functions under the probabilityρ.ˆ

Secondly, by structural risk minimization, we derive an aggregating procedure which takes advantage of the unknown properties of the best mixturef˜: when the best convex combination f˜of d regression functions belongs to the d initial functions (i.e. when combining does not make the bias decrease), the convergence rate is of order(logd)/N. In the worst case, our combining procedure achieves a convergence rate of order

(logd)/N which is known to be optimal in a uniform sense whend >√

N(see [A. Nemirovski, in: Probability Summer School, Saint Flour, 1998; Y. Yang, Aggregating regression procedures for a better performance, 2001]).

As in AdaBoost, our aggregating distribution tends to favor functions which disagree with the mixture on mispredicted points.

Our algorithm is tested on artificial classification data (which have been also used for testing other boosting methods, such as AdaBoost).

Résumé

De nombreuses études empiriques ont montré l’efficacité des méthodes consistant à combiner différentes procédures de régression. Ce travail fournit de nouvelles bornes “PAC” (probablement approximativement correct) pour l’erreur de généralisationL₂de ces méthodes. Ces bornes présentent un double intérêt.

Tout d’abord, elles donnent une borne sur le risque de n’importe quelle procédure d’agrégation en termes du risque empirique et d’une mesure empirique de la complexité basée sur la divergence de Kullback–Leibler entre la probabilité d’agrégationρˆet la probabilité a priori et sur la moyenne empirique de la variance des fonctions de régression sous la probabilitéρˆ.

Deuxièmement, par minimisation du “risque structurel”, nous dérivons une procédure d’agrégation qui s’adapte aux propriétés pourtant inconnues du meilleur mélangef˜ : quand la meilleure combinaison convexef˜ded fonctions est une de cesdfonctions (c’est-à-dire quand combiner les fonctions ne permet pas de réduire le biais), le taux de convergence est

E-mail address: [email protected] (J.-Y. Audibert).

doi:10.1016/j.anihpb.2003.11.006

(2)

d’ordre(logd)/N. Dans le pire des cas, notre procédure d’agrégation a un taux de convergence d’ordre

(logd)/N qui est le taux minimax optimal quandd >√

N (cf. [A. Nemirovski, in : Summer School, Saint Flour, 1998 ; Y. Yang, Aggregating regression procedures for a better performance, 2001]).

Comme la méthode AdaBoost, le mélange obtenu par notre algorithme pondère les fonctions qui ne sont pas en accord avec le mélange sur les points mal classés de l’ensemble d’apprentissage. L’algorithme est testé sur des problèmes de classification artificiels (déjà utilisés pour tester des procédures de boosting telles qu’AdaBoost).

MSC: primary 62G08; secondary 62J02, 94A17, 62H30

Keywords: Nonparametric regression; Deviation inequalities; Adaptive estimator; Oracle inequalities; Boosting

1. Introduction

Boosting algorithms (AdaBoost introduced by Freund and Schapire in [4], Bagging and Arcing introduced by Breiman in [1,2]) have been successful in practical classification applications. With support vector machines, boosting is known to be one of the best off-the-shelf classification procedure. As a consequence, numerous researchers have studied the reasons of their efficiency and have looked for means to extend their application domain to regression problems.

Friedman, Hastie and Tibshirani have proved [5] that AdaBoost is a stage-wise estimation procedure for fitting an additive logistic regression model. From this idea, Friedman derive a “gradient boosting machine” to estimate a function for some specified loss criteria.

Rätsch et al. [10] have shown that boosting is similar to an iterative strategy which maximizes the minimum margin of the aggregated classifier using an exponential barrier. They also use their view to obtain a boosting technique for regression.

In [13], Yang has studied minimax properties of aggregating regression procedures. In particular, he has proved that when the numberd of aggregated procedures is less than√

N (whereN is the size of the training set), the order of the convergence rate of the best mixture (in the minimax sense) is the same as the one of the best linear combination (i.e.d/N). Whend is greater than√

N, the convergence rate of the best convex combination attains (logd)/N (see also [9]).

In this paper, we will obtain new bounds for any aggregating procedure (Section 4) and derive from these bounds a procedure which achieves the optimal minimax convergence rate. Before proving these bounds, we will review Catoni results [3] on randomization procedures (Section 3). The estimators obtained by minimization of the bound are tested on classification using common artificial data: Twonorm, Threenorm and Ringnorm (Section 5).

2. Framework

We assume that we observe an i.i.d. sampleZ^N₁ (X_i, Y_i)^N_i₌₁ of random variables distributed according to a product probability measure P^⊗^N, where P is a probability distribution on (Z,B_Z)(X ⊗Y,B_X ⊗B_Y), (X,B_X)is a measurable space,Y=RandB_Y is the Borel sigma algebra. LetP(dY|X)denote a regular version of the conditional probabilities (which we will use in the following without further mention).

We assume that we have no prior information about the distributionPof(X, Y ), and that we have to guess it entirely from the training sample. We have to work with a prescribed set of regression functions since it is well known that there is generally no estimatorfˆ:Z^N→F(X,Y)such that

lim

N→+∞ sup

P∈M¹₊(Z)

E_P⊗(N+1)L

Y_N₊₁,f (Zˆ ₁^N)(X_N₊₁)

− inf

f∈F(X,Y)E_PL

Y, f (X)

=0,

(3)

whereF(X,Y)denotes the set of all the measurable functions fromX toY andLis a loss function. However, replacingF(X,Y)by the set of mixturesR˜ of a set of functionsRin the previous equality makes the problem feasible (provided the model R is not too big) with a speed of convergence depending on the capacity (or complexity) ofR. So we are interested in a particular non-parametric regression problem. For convenience of notation, we will index the functions in the model by the parameterθ:

R

f_θ∈F(X,Y);θ∈Θ .

Note that the set R(or equivalently the parameter set Θ) is not necessarily finite. Let π(dθ ) denote a prior distribution on the measurable space (Θ,T), where T is aσ-field on the parameter spaceΘ. In practice, the probability distributionπwill be chosen according to our preferences (and to our prior knowledge had we some).

For instance, if the modelRis the set of decision trees of depth lower than a certain limit and if we do not have any prior knowledge, we would like to favour small trees with respect to big ones since they are simpler and therefore more easily interpretable. To favour these trees, it suffices to give them a biggerπ-probability. On the contrary, if a subsetS ofRhas aπ-probability equal to one, then the functions in theπ-negligible setR\S are eliminated from the model.

We assume that the map(θ , x)→f_θ(x)is(B_X⊗T)-measurable. The set of mixtures of the setRis written as R˜

Eρ(dθ )f_θ;ρ∈M¹₊(Θ) .

The best possible guess is defined as the one minimizing the expected risk R(f )ˆ E_PL

Y,f (X)ˆ ,

whereLis the square loss:L(Y, Y)=(Y−Y)². The mean square loss has the distinguished property of being minimized by the conditional expectation ofY givenX. More precisely, it decomposes into

R(f )ˆ =E_P

Y−E_P(Y /X)2 +E_P

E_P(Y /X)− ˆf (X)2 .

Therefore, minimizing the mean square loss is equivalent to minimizing the quadratic distance to the conditional expectation.

Since the expected risk is not observable, we will have to use the empirical risk r(f )ˆ 1

N N

i=1

L

Y_i,f (Xˆ _i)

=E_P_¯L

Y,f (X)ˆ , whereP¯ denotes the empirical distribution

P¯ 1 N

N i=1

δ_(X_i_,Y_i₎.

LetΘ₁, . . . , Θ_Mbe subsets ofΘsuch that their union isΘ. Consider a regression procedure which estimate the bestθamong a subset ofΘ. Using this procedure, we getθˆ₁∈Θ₁, . . . ,θˆ_M∈Θ_M.

• Deterministic model selection consists in choosing one of theθˆ_ito estimateE_P(Y /X).

• In stochastic model selection (or randomized estimation), the choice of θˆ_i is randomized. This two-steps procedure (estimating the bestθ in each sub-modelΘ_i and choosing randomly the sub-model) can be seen as a one-step procedure if we allowfˆto be drawn fromRaccording to some posterior distributionρ(dθ )on the parameter set(Θ,T)(see [3,8]).

• In model averaging (or aggregated estimation), the idea is to use a weighting average of thef_θ_ˆ

i, in other words to combine the different estimators. This could also be done in a one-step procedure searching for the posterior distributionρon(Θ,T)such thatfˆ=Eρ(dθ )fθ is close toE_P(Y /X).

(4)

In this paper, we give results concerning these last two estimation problems. Our assumptions are the two following ones. First the conditional expectationE_P(Y /X)and the regression function in the models are relatively bounded inL^∞-norm, i.e. for anyf,ginR∪ {E(Y /X= ·)}, for anyx∈X,

f (x)−g(x) B. (2.1) Secondly, we assume that the noise has a uniform exponential moment conditionally to the explanatory variable, i.e. there existsα >0, M >0 such that for anyx∈X,

E_P(dY )exp

α Y−f^∗(X) /X=x

M, (2.2)

wheref^∗E_P(Y /X= ·) is the regression function associated with the distributionP. Note that this second assumption is sufficiently weak to deal with the case in which the output is equal to a function of the input plus a gaussian noise.

Letf˜denote the best mixture (for the square loss) of the regression functions in the modelR: f˜argmin

f∈ ˜R

R(f ). (2.3)

Finally, introduce a mixture distributionρ˜∈M¹₊(Θ)defined asEρ(dθ )_˜ f_θ= ˜f (the probability distributionρ˜is not necessarily unique).

3. Randomization

3.1. PAC-Bayesian expected risk bound

The following theorems bound the expected risk of a randomized procedure in terms of the empirical risk and a term of empirical complexity relying on the Kullback–Leibler divergence between the randomizing distributionρ and the prior distributionπ. Introduce the functions

G(λ) 8M

(αB−2λ)²e²+e^2λ−1−2λ

λ² and H (λ) 1

1−λG(λ).

Theorem 3.1. For anyε >0 and 0< λ < ^αB₂ such thatλG(λ) <1, withP^⊗^N-probability at least 1−ε, for any randomizing procedureρˆ:Z^N→M¹₊(Θ), we have

Eρ(dθ )_ˆ R(f_θ)−R(f )˜ H (λ)

Eρ(dθ )_ˆ r(f_θ)−r(f )˜ + B² λN

K(ρ, π )ˆ +log(ε⁻¹)

. (3.1)

Proof. See Section 7.1. 2

To use this bound, one has to choose arbitrarily the parameterλ. To avoid this choice, one can use an union bound.

Theorem 3.2. Introduce countable families(λ_i)_i_∈_I and(η_i)_i_∈_I such that 0< λ_i < αB/2, λ_iG(λ_i) <1, η_i >0 and

i∈Iη_i =1. For anyε >0, withP^⊗^N-probability at least 1−ε, for any randomizing procedureρˆ:Z^N→ M¹₊(Θ), for anyi∈I, we have

Eρ(dθ )_ˆ R(fθ)−R(f )˜ H (λi)

Eρ(dθ )_ˆ r(fθ)−r(f )˜ + B² N λ_i

K(ρ, π )ˆ +log

(ηiε)⁻¹

. (3.2)

(5)

Proof. Introduce the event A_i

Eρ(dθ )_ˆ R(fθ)−R(f )˜

H (λi) >Eρ(dθ )_ˆ r(f_θ)−r(f )˜ + B² N λi

K(ρ, π )ˆ +log

(η_iε)⁻¹ . From Theorem 3.1, for anyi∈I, we haveP^⊗^N(A_i) < η_iε. Hence we have

P^⊗^N

i∈I

Ai

i∈I

P^⊗^N(Ai) <

i∈I

ηiε=ε. 2

The problem is then to choose appropriately the parameter families(λi)i∈I and(ηi)i∈I. 3.2. Optimal randomizing procedure

In this section we use Theorem 3.2 to define a randomizing procedure. The bounds in the previous theorems cannot be computed from the data only. However they can be upper bounded by replacing the empirical risk of the unknown best mixturer(f )˜ by the infimum over the setR˜ of the empirical risk inf_R_˜r.

Introduce











Q(ρ, λ, η)Eρ(dθ )_ˆ r(f_θ)−inf_R_˜r 1−λG(λ) +B²

N

K(ρ, π )ˆ +log[(ηε)⁻¹] λ[1−λG(λ)] , Q

ρ, (λ_i)_i_∈_I, (η_i)_i_∈_I inf

i∈IQ(ρ, λ_i, η_i), Q(ρ) inf

(λi)i∈I∈Pλ

(η_i)_i∈I∈P^η

Q

ρ, (λi)i∈I, (ηi)i∈I

,

where Pλ and Pη are respectively the set of parameter families (λ_i)_i_∈_I and (η_i)_i_∈_I such that 0< λ_i < ^αB₂ , λ_iG(λ_i) <1, η_i >0 and

i∈Iη_i =1. Then the quantitiesQ(ρ, λ,1)andQ(ρ, λ_i, η_i)are respectively slightly weakened version of the RHS of inequalities (3.1) and (3.2).

The quantityQ(ρ)can also be written as Q(ρ)= inf

0<λ<αB/2 such thatλG(λ)<1Q(ρ, λ,1).

Let us define the optimal posterior distributionρˆoptas ˆ

ρ_opt= argmin

ρ∈M¹₊(Θ)

Q(ρ).

For any 0< ε <1, one may prove the existence of the “argmin” and thatρˆoptis a Gibbs distribution which can be written as

ˆ

ρopt= e⁻

N λopt B2 r(f )

Eπ(dθ )e⁻

N λopt

B2 r(f_θ) ·π,

for an appropriate parameter 0< λ_opt< αB/2 satisfyingλ_optG(λ_opt) <1. Then the inverse temperature parameter of the Gibbs distribution isβN λ_opt/B².

We would like to choose the parameter families such that the infimum inf_ρQ

ρ, (λ_i)_i_∈_I, (η_i)_i_∈_I

is not “too far”

from the optimal quantityQ(ρˆopt). The bound in Theorem 3.2 is appropriate when its order is 1/√

N. Therefore relevant values ofλ are greater than 1/√

N. Let us define 0< Λ < αB/2 such thatΛG(Λ)=1. Consider the

(6)

family(λi)i=1,...,p, where λi Λ/2ⁱ and p is such that Λ/2^p⁺¹<1/√

N Λ/2^p. When the parameter λ_opt belongs to[1/√

N;Λ[(which is the case we are interested in), for anyρ∈M¹₊(Θ), we have inf

i=1,...,pQ(ρ, λ_i,1)2Q(ρ, λ_opt,1).

So we just lose in the worst case a factor 2. It remains to choose the parametersη_i such that for anyρ∈M¹₊(Θ), the quantityQ(ρ, λ_i, η_i)is not “too far” from the quantityQ(ρ, λ_i,1). By takingη_i =1/p,i=1, . . . , p, we lose an additive log logN factor in front of the Kullback–Leibler divergenceK(ρ, π )which, in general, would be for the optimal distribution at least of the same order as the Kullback–Leibler divergence (in practice, log logN never exceeds 3).

Since the minimum over M¹₊(Θ) of the quantity Q(ρ, λ,1)(achieved for the probability distributionρ ∝ e⁻^{N λ}^B²^{r(f )}·π) is

B²

N λ[1−λG(λ)]log

εEπ(dθ )e⁻

N λ

B2[r(f_θ)−inf_R_˜r]₋1 , let us introduce for anyi=1, . . . , p,

Qi 1

λ_i[1−λ_iG(λ_i)]log

p

εEπ(dθ )e⁻^{N λi}^B² ^[^r(f^θ⁾⁻^inf^R^˜^r^]

,

whereλ_i=Λ/2ⁱ. Finally, we obtain the following randomizing procedure 1. Compute

i_optargmin

i=1,...,p

Q_i.

2. Randomize using the probability distribution e⁻

N Λ B2 2ioptr(f )

Eπ(dθ )e⁻

N Λ

B22ioptr(f_θ)·π.

Remark 3.1. Note that since our optimal randomizing procedure comes from a deviation inequality, the inverse temperature parameterβdepends on the probabilityε. Indeed, to get a higher confidence level, we need to have a biggerλand therefore to take a biggerβ (i.e. to be more selective). However in practiceεhas little influence on the temperature.

Remark 3.2. Our optimal randomizing distribution is a Gibbs distribution. We find it in a minimax context. One may notice that the randomizing distribution minimizing the Bayesian risk in a gaussian noise context is also a Gibbs distribution. More precisely, consider that the output is given by

Y=f_θ(X)+η,

where the random variableηis a centered gaussian with varianceσ² independent of the inputX. The Bayesian risk is

R_Bay(f )ˆ E_π(dθ/Z^N

1)EPθ(dZ_N₊₁)

Y_N₊₁− ˆf (X_N₊₁)₂

=σ²+E_π(dθ/Z^N

1)E_P(dX_N+1)

f_θ(X_N₊₁)− ˆf (X_N₊₁)2

=σ²+E_P(dXN+1)E_π(dθ/Z^N

1)

fθ(XN+1)− ˆf (XN+1)2 .

(7)

Hence the optimal estimator isfˆ=E_π(dθ/Z^N

1)fθ. It is associated with the posterior distribution ˆ

ρ(dθ )=π(dθ /Z₁^N)= e⁻

N 2σ2r(f_θ)

Eπe⁻

N

2σ2r(f )·π(dθ ),

which is a Gibbs distribution with inverse temperature parameterN/(2σ²).

4. Aggregated estimators

4.1. PAC-Bayesian expected risk bound

In the least square regression framework, there exists a simple relation between the risk of an aggregated estimator and the one of the associated randomized estimator which is

R(Eρ(dθ )f_θ)=Eρ(dθ )R(f_θ)−E_PVar_{ρ(dθ )}f_θ(X). (4.1)

This equality shows that aggregated regression procedures are more efficient than randomized ones and that the difference is measured by E_PVar_{ρ(dθ )}fθ(X). The first term of the RHS has already been bounded (see Theorem 3.1). So, to bound the expected risk of the aggregated estimator, it remains to bound the deviations of the variance term and this is done with similar techniques to those used for randomized estimators.

We obtain the following theorems which bound the expected risk of any aggregated estimator in terms of

• the empirical risk,

• the empirical complexity measured by the Kullback–Leibler divergence between the aggregating distribution ˆ

ρand the prior distributionπand by the empirical mean of the variance of the regression functions under the posterior distribution.

We still denote

G(λ) 8M

(αB−2λ)²e²+e^2λ−1−2λ

λ² and H (λ) 1

1−λG(λ), and we define

g(β)e^β−1−β

β² and h(β) 1

1+βg(β).

Theorem 4.1. For anyε >0,β >0 and 0< λ < αB/2 such thatλG(λ) <1, withP^⊗^N-probability at least 1−2ε, for any aggregating procedureρˆ:Z^N→M¹₊(Θ),

R(Eρ(dθ )_ˆ f_θ)−R(f )˜ H (λ)

Eρ(dθ )_ˆ r(f_θ)−r(f )˜ + B² N λ

K(ρ, π )ˆ +log(ε⁻¹)

+h(β)

− ¯V+ B² 2Nβ

2K(ρ, π )ˆ +log(ε⁻¹)

=H (λ)

r(Eρ(dθ )_ˆ f_θ)−r(f )˜ +

H (λ)−h(β)V¯ +B²H (λ)

N λ

K(ρ, π )ˆ +log(ε⁻¹)

+B²h(β) 2Nβ

2K(ρ, π )ˆ +log(ε⁻¹)

, (4.2)

whereV¯ EP¯Var_{ρ(dθ )}_ˆ fθ.

(8)

Using an union bound, we get

Theorem 4.2. Introduce countable families (λi)i∈I, (ηi)i∈I, (βj)j∈J and (ζj)j∈J such that 0< λi < αB/2, λ_iG(λ_i) <1, η_i >0,

i∈Iη_i =1, β_j >0, ζ_j >0 and

j∈Jζ_j=1. For anyε >0, with P^⊗^N-probability at least 1−2ε, for any aggregating procedureρˆ:Z^N→M¹₊(Θ), for anyi∈Iand for anyj∈J, we have

R(Eρ(dθ )_ˆ fθ)−R(f )˜ H (λi)

r(Eρ(dθ )_ˆ fθ)−r(f )˜ +

H (λi)−h(βj)V¯ +B²H (λ_i)

N λ_i

K(ρ, π )ˆ +log

(η_iε)⁻¹ +B²h(βj)

2Nβ_j

2K(ρ, π )ˆ +log

(ζ_jε)⁻¹

. (4.3)

Proof. In the proof of Theorem 4.1 (see Section 7.2), we have obtained that withP^⊗^N-probability at least 1−ε, for anyρ∈M¹₊(Θ),

−E_PVar_{ρ(dθ )}f_θh(β)

−E_P_¯Var_{ρ(dθ )}f_θ+ B² 2Nβ

2K(ρ, π )+log(ε⁻¹) .

Instead of using an union bound directly on inequality (4.2), we use it on this inequation. We get that withP^⊗^N- probability at least 1−ε, for anyρ∈M¹₊(Θ)and for anyj∈J,

−E_PVar_{ρ(dθ )}f_θh(β_j)

−E_P_¯Var_{ρ(dθ )}f_θ+ B² 2Nβ_j

2K(ρ, π )+log

(ζ_jε)⁻¹ , where(β_j)_j_∈_J and(ζ_j)_j_∈_J are parameter families such thatβ_j>0,ζ_j >0 and

j∈Jζ_j =1. It remains to add this inequation to inequality (3.2) to get the result. 2

Now let us introduce











B(ρ, λ, η, β, ζ )H (λ)

Eρ(dθ )r(f_θ)−r(f )˜ + B² N λ

K(ρ, π )+log

(ηε)⁻¹ +h(β)

− ¯V + B² 2Nβ

2K(ρ, π )+log

(ζ ε)⁻¹ B

ρ, (λi)i∈I, (ηi)i∈I, (βj)j∈J, (ζj)j∈J

κB²∧ inf

i∈I j∈J

B(ρ, λi, ηi, βj, ζj),

(4.4)

whereκ1+_e2(αB)^4M ².

By bounding the expected risk using assumptions (2.1) and (2.2), and from the previous theorem, we obtain Corollary 4.3. For anyε >0, with P^⊗^N-probability at least 1−2ε, for any aggregating procedureρˆ:Z^N→ M¹₊(Θ), we have

R(Eρ(dθ )_ˆ f_θ)−R(f )˜ B

ρ, (λ_i)_i_∈_I, (η_i)_i_∈_I, (β_j)_j_∈_J, (ζ_j)_j_∈_J .

Proof. From Theorem 4.1, withP^⊗^N-probability at least 1−2ε, for any aggregating procedureρˆ:Z^N→M¹₊(Θ), we have

R(Eρ(dθ )_ˆ fθ)−R(f )˜ inf

i∈I j∈J

B(ρ, λi, ηi, βj, ζj). (4.5)

(9)

Since the noise has a conditional uniform exponential moment (assumption (2.2)), the expected risk is bounded.

Specifically, we can write R(Eρf )=EP

Y −E(Y /X)2

+EP

E(Y /X)−Eρf2

E_P

e^α^|^Y⁻^{E(Y /X)}^| sup

u∈R+

{u²e⁻^αu} +B²

2 αe

2

M+B²

κB², (4.6)

whereκ_e2(αB)^4M ² +1. Since the quadratic riskR(f )˜ is positive, for any probability distributionρ, we have

Eρ(dθ )R(θ )−R(f )˜ κB². (4.7)

The result follows from equalities (4.5) and (4.7). 2 This corollary is the keystone of this work since

• by appropriately choosing the parameter families, one can deduce a parameter-free theorem which has the optimal minimax convergence rate except for a logarithmic factor (see Section 4.2.1),

• there exists an efficient procedure calculating one of the probability distributions minimizing the bound B(ρ, (λi)i∈I, (ηi)i∈I, (βj)j∈J, (ζj)j∈J), when the setsI andJare finite (see Section 4.2.2).

4.2. Optimal aggregating procedure 4.2.1. Comparison with minimax bounds

In this section, we derive from Corollary 4.3 an aggregating procedure which is optimal in a minimax sense according to lower bounds established by Juditsky and Nemirovski [6] and by Yang [13]. We still denoteρ˜ a posterior distribution such thatR(Eρ(dθ )_˜ fθ)=min_R_˜R.

Lemma 4.4. For a well chosen finite parameter families, we have B

˜

ρ, (λ_i)_i_∈_I, (η_i)_i_∈_I, (β_j)_j_∈_J, (ζ_j)_j_∈_J

γ (ε), where











γ (ε)2

C1V (¯ ρ)˜ +6

C2V (¯ ρ)˜ +2C1+2C2, V (¯ ρ)˜ E_P_¯Var_{ρ(dθ )}_˜ f_θ,

C1C1(ε)B² N

K(ρ, π )˜ +log(L₁ε⁻¹) κ1

, C2C2(ε) B²

8N

2K(ρ, π )˜ +log(L₂ε⁻¹)

κ₂ ,

andκ₁andκ₂, by definition, respectively satisfy 2κ₁G(κ₁)=1 andκ₂g(κ₂)=1 and finally











L₁log( ^4κ¹^N

log(ε⁻¹)) 2 log 2 ∨2, L₂log( ^8κ²^N

log(ε⁻¹)) 2 log 2 ∨2.

(10)

The proof and the parameter families are given in Section 7.3. From this lemma and from Corollary 4.3, by using the same parameter families, we get

Theorem 4.5. For anyε >0, withP^⊗^N-probability at least 1−2ε, any aggregating procedureρˆminimizing B

ρ, (λ_i)_i₌_0,...,p, (η_i)_i₌_0,...,p, (β_j)_j₌_0,...,q, (ζ_j)_j₌_0,...,q satisfies

R(Eρ(dθ )_ˆ fθ)−R(f )˜ γ (ε).

In the worst case, the bound has the same order (whenεis fixed) as C˜V˜∨ ˜C,

whereC˜^K(_N^{ρ,π )}^˜ B²andV˜ sup_x_∈XVar_{ρ(dθ )}_˜ f_θ(x)(we neglect the log logN term).

When the best mixturef˜belongs to the initial modelR, the variance term vanishes and the order of the bounds is given byC˜. A particular case of interest is when the parameter setΘis finite:Θ= {1, . . . , d}. Taking arbitrarily π=¹_d_d

i=1δi (uniform measure onΘ), one can check easily that for anyρ∈M¹₊(Θ), we have K(ρ, π )=logd−Hs(ρ)logd,

whereH_s(ρ)denotes the Shannon entropy ofρ (H_s(ρ)−_d

i=1ρ_ilogρ_i). In this case, when the best convex combinationf˜belongs to the modelR(V˜ =0), the convergence rate of our estimator will be logd/N, whereas whenf˜is not too close to the regression functions in the modelR(i.e. whenV˜ K(ρ, π )/N˜ ), the convergence rate will be

logd

N V˜. In the worst case, the quantityV˜ has the same order asB², and we find a convergence rate logd

N known to be optimal in the uniform sense as soon asd >√

Naccording to the following theorem Theorem 4.6 (Yang, 2001). Letd=N^τfor someτ >0. There exists a model

R=

f_i∈F(X,Y): i=1, . . . , d

such that for any aggregating procedureρ, one can find a functionˆ f˜∈ ˜R= {_d

i=1ρ˜_if_i: ρ˜∈M¹₊{1, . . . , d}}

satisfying

E_P⊗NR(Eρ(dθ )_ˆ f_θ)−R(f )˜ C







 d

N whenτ1

2, logd

N whenτ >1 2, where the constantCdoes not depend onN.

Remark 4.1. In [13], Yang also proposed an adaptive estimator. The advantage of the procedure designed here is to be feasible, to avoid splitting the data in many parts and to hold when the regression function wrt the unknown probability distribution is not in the modelR˜. Besides, our results also hold when the set of aggregated functions is infinite and under weaker assumptions (particularly on the noise).

(11)

Remark 4.2. Note that the unobservable termr(f )˜ in the boundBdoes not modify the probability distribution ˆ

ρλ,η,β,ξminimizingB(ρ, λ, η, β, ζ ). However the choice ofλamong(λi)i=0,...,pdepends onr(f ). To circumvent˜ this difficulty, one can, for instance, weaken the boundBby replacing^r(^E^{ρ(dθ )}^ˆ₁₋_λG(λ)^f^θ⁾⁻^r(^{f )}^˜ with

r(Eρ(dθ )_ˆ f_θ)−r(f )˜ + λG(λ) 1−λG(λ)

r(Eρ(dθ )_ˆ f_θ)−r(fˆ_ERM) ,

where the functionfˆ_ERM minimizes the empirical risk among the functions in R˜. For this algorithm, the first assertion of Theorem 4.5 becomes: for any 1/2ε >0,

P^⊗^N

R(Eρ(dθ )_ˆ fθ)−R(f )˜ γ (ε)+r(f )˜ −r(fˆ_ERM)

1−2ε, (4.8)

since sup

λ∈(λ_i)_i₌_0,...,p

λG(λ) 1−λG(λ)

=1.

By using Theorem 4.1 (for a posterior distributionρˆ_ERMsatisfyingEρ_ˆ_ERM(dθ )f_θ = ˆf_ERMand forλandβof order logd

N ), we get that the added termr(f )˜ −r(fˆ_ERM)is at most of order logd

N when the parameter setΘis finite:

|Θ| =d.

Another solution to determine the right parameters is to cut the training sample into two parts, use the first part of the training sample to compute the distributionsρˆ_λ,η,β,ξand use the second part of the training sample to select the best distribution among the O[(logN )²]distributions (each distribution corresponds to a point in the(λ, β)-grid).

This last step is almost free (since we neglect log logN terms), so the convergence rate of the resulting procedure is effectively of order

C˜V˜ ∨ ˜C.

Remark 4.3. Had we not been interested in having tight explicit constants, we could have written Theorem 4.1 in the following way (taking arbitrarilyβ=λ): there existsC₁, C₂>0 depending only on the constantsB,αand Msuch that for anyε >0 and 0< λ< C₁, withP^⊗^N-probability at least 1−2ε, for any aggregating procedure

ˆ

ρ:Z^N→M¹₊(Θ),

R(Eρ(dθ )_ˆ f_θ)−R(f )˜ (1+λ)

r(Eρ(dθ )_ˆ f_θ)−r(f )˜

+2λV¯ +C2

N

K(ρ, π )ˆ +log(ε⁻¹)

λ ,

where we still haveV¯ =EP¯Var_{ρ(dθ )}_ˆ f_θ. This inequation would have also led to the optimal convergence rate after optimization of the parameterλ.

Theorem 4.6 also shows that a direct application of our aggregating procedure is not optimal whend is lower than√

N, since then the convergence rate towards functions for whichV˜ =sup_x_∈XVar_{ρ(dθ )}_˜ fθ(x)has the same order asB²is

logd

N d

N.

However, in this case (d√

N), one can consider a gridRon the simplexR˜: R

_d

i=1

a_i √

dNfi: ai∈Nsuch that d i=1

ai= √ dN

,

wherex denotes the integer satisfying x−1<xx. We haveR˜= ˜R. Then applying our aggregating procedure to the new initial modelRfor a uniform prior distributionπonR, we obtain the desired convergence rate except for the logarithmic factor.

(12)

Proof. The best convex combinationf˜=_d

i=1ρ˜ifi belongs to S∩

_d

i=1

√ dN ˜ρi

√

dN f_i+ 1 √

dNC_d

, whereS is the simplex {d

i=1ρifi: ρi 0,d

i=1ρi =1}andCd is thed-dimensional cube{d

i=1aifi: 0 ai1}. This set is the convex combination of its vertices, so the functionf˜can be written as a convex combination of the functions in

R _d

i=1

√ dN ˜ρ_i

+ε_i √

dN f_i: ε_i∈ {0,1}

∩R. For anyf, g∈R, we have

f −g_∞d 2

B √

dN, hence¹

V˜ d² 16√

dN²B².

The number of functions in R is upper bounded by (√

dN +1)^d. Since we have K(ρ, π˜ )log CardR (because the distribution π is uniform over the set R), we get C˜ ^d^log(N_N^3/4⁺¹⁾B². As a result, we have C˜V˜ ∨ ˜C=O(_N^d logN ), which is the desired convergence rate up to the logarithmic factor. 2

In fact, whend√

N, the optimal convergence rate can also be obtained by randomizing functions from the gridR⊂ ˜R. To combined regression functions is then equivalent (in terms of convergence rate) to randomizing with an appropriate Gibbs distribution on the gridR.

Remark 4.4. The previous idea of discretizing the model can be also used to find the best linear combination with bounded coefficients. Indeed, letAbe the bound on the coefficients. Introduce the model

R_lin _d

i=1

qi

√

dNAf_i;q_i∈Z∩

−√

dN; √ dN

.

Then the best convex combinationf˜of functions fromR_linis also the best linear combination of functions from Rwith coefficients bounded by A. Using our aggregating (or an appropriate randomizing) procedure on R_lin, we obtain thatE_P⊗NR(Eρ(dθ )_ˆ f_θ)−R(f )˜ is of order _N^d logN, which cannot be improved uniformly beyond the logarithmic factor.

Remark 4.5. Note that to obtain an algorithm with optimal convergence rate in the uniform sense, we need not have used sophisticated tools. We just need deviation inequalities, a simple union bound and to discretize the simplexR˜. Indeed, any functionf ofR˜ satisfies a deviation inequality similar to the one of Lemma 7.3: for any 0λαB/2 satisfying 8Mλ(αB−2λ)²e², the deviations of

Z= −

Y−f (X)2

+

Y− ˜f (X)2

1 We use that for any random variableXsuch thataXba.s., the variance ofXis bounded by(b−a)²/4.

(13)

are given by logE_Pe^λ

Z−EPZ

B2 λ²R(f )¯

B² G(λ), (4.9)

whereG(λ)_(αB₋^8M_2λ)2e² +^e^2λ⁻_λ¹2⁻^2λ. The quantitiesR(f )¯ andr(f )¯ are still defined as





R(f )¯ =R(f )−R(f )˜ =EP

Y−f (X)2

−EP

Y − ˜f (X)2 ,

¯

r(f )=r(f )−r(f )˜ =E_P_¯

Y −f (X)2

−E_P_¯

Y − ˜f (X)2 . Hence, for any 0λαB/2 satisfyingλG(λ)1, we have successively

E_P⊗Ne

λN

B2{EP¯Z−EPZ[1−λG(λ)]}1.

For anyε >0, P^⊗^N

λN B²

E_P_¯Z−E_PZ

1−λG(λ)

−log(ε⁻¹)0

ε.

WithP^⊗^N-probability at least 1−ε, R(f )¯ r(f )¯

1−λG(λ)+B² N

log(ε⁻¹) λ[1−λG(λ)].

By using an union bound, for any discretized simplexRdiscwithP^⊗^N-probability at least 1−ε, for anyf ∈Rdisc, we get

R(f )¯ r(f )¯

1−λG(λ)+B² N

log(ε⁻¹CardRdisc) λ[1−λG(λ)] . For somem∈Nwhich will be chosen later, let us take

Rdisc= _d

i=1

a_i

mfi: ai∈Nsuch that d i=1

ai=m

. Then we have

CardRdisc= m+d

d

2×m^d whendm, 2×d^m whendm,

and for anyg∈ ˜Rthere existsf ∈Rdiscsuch thatf −g∞^B_m. This last inequality implies that there exists f∈Rdiscsuch that

¯

r(f )= 1 N

N i=1

2Y_i−f (X_i)− ˜f (X_i)

f (X_i)− ˜f (X_i)

ΣB

m, where

Σ

_N

i=1|2Y_i−f (X_i)− ˜f (X_i)|

N 2

_N

i=1|Y_i−f^∗(X_i)|

N +2B.

The algorithm which minimizes the empirical risk on the netRdiscsatisfies withP^⊗^N-probability at least 1−ε, for anyf ∈Rdisc,

R(¯ f )ˆ r(¯ f˜_disc) 1−λG(λ)+B²

N

log(ε⁻¹CardRdisc) λ[1−λG(λ)] ,