On optimality of empirical risk minimization in linear aggregation

(1)

HAL Id: hal-01314571

https://hal.archives-ouvertes.fr/hal-01314571v2

Preprint submitted on 25 May 2016

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

On optimality of empirical risk minimization in linear aggregation

Adrien Saumard

To cite this version:

Adrien Saumard. On optimality of empirical risk minimization in linear aggregation. 2016. �hal-

01314571v2�

(2)

On optimality of empirical risk minimization in linear aggregation

Adrien Saumard May 25, 2016

Abstract

In the …rst part of this paper, we show that the small-ball condition, recently introduced by [Men15], may behave poorly for important classes of localized functions such as wavelets, leading to suboptimal estimates of the rate of convergence of ERM for the linear aggregation problem.

In a second part, we derive optimal upper and lower bounds for the ex- cess risk of ERM when the dictionary is made of trigonometric functions.

While the validity of the small-ball condition remains essentially open in the Fourier case, we show strong connection between our results and con- centration inequalities recently obtained for the excess risk in [Cha14] and [vdGW16].

Keywords: empirical risk minimization, linear aggregation, small-ball property, concentration inequality, empirical process theory.

1 Introduction

Consider the following general regression framework: ( X ; T

X

) is a measurable space, (X; Y ) 2X R is a pair of random variables of joint distribution P - the marginal of X being denoted P

^X

- and it holds

Y = s (X) + (X) " , (1)

where s is the regression function of the response variable Y with respect to the random design X, (X) 0 is the heteroscedastic noise level and " is the con- ditionally standardized noise, satisfying E [" j X ] = 0 and E ["

²

j X ] = 1. Relation (1) is very general and is indeed satis…ed as soon as E [Y

²

] < + 1 . In this case s 2 L

₂

P

^X

is the orthogonal projection of Y onto the space of X-measurable functions. In particular, no restriction is made on the structure of dependence between Y and X.

Research partly supported by the french Agence Nationale de la Recherche (ANR 2011 BS01 010 01 projet Calibration).

(3)

We thus face a typical learning problem, where the statistical modelling is mini- mal, and the goal will be, given a sample (X

_i

; Y

_i

)

ⁿ_i=1

of law P

ⁿ

and a new covariate X

_n+1

, to predict the value of the associated response variable Y

_n+1

. More precisely, we want to construct a function b s, depending on the data (X

_i

; Y

_i

)

ⁿ_i=1

, such that the least-squares risk R ( b s) = E (Y

_n+1

b s (X

_n+1

))

²

is as small as possible, the pair (X

_n+1

; Y

_n+1

) being independent of the sample (X

_i

; Y

_i

)

ⁿ_i=1

.

In this paper, we focus on the technique of linear aggregation via Empirical Risk Minimization (ERM). This means that we are given a dictionary S = f s

₁

; :::; s

_D

g and that we produce the least-squares estimator b s on its linear span m = Span (S),

b

s 2 arg min

s2m

R

_n

(s) , where R

_n

(s) = 1 n

X

n i=1

(Y

_i

s (X

_i

))

²

. (2) The quantity R

_n

(s) is called the empirical risk of the function s. The accuracy of the method is tackled through an oracle inequality, where the risk of the estimator R ( b s) is compared - on an event of probability close to one - to the risk of the best possible function within the linear model m. The latter function is denoted s

_m

and is called the oracle, or the (orthogonal) projection of the regression function s onto m,

s

_m

2 arg min

s2m

R (s) .

An oracle inequality then writes, on an event

0

of probability close to one, R ( b s) R (s

_m

) + r

_n

(D) , (3) for a positive residual term r

_n

(D). An easy and classical computation gives that the excess risk satis…es R ( s) b R (s

m

) = kb s s

m

k

²2

, where k k

2

is the natural quadratic norm in L

₂

P

^X

: Hence, inequality (3) can be rewritten as kb s s

_m

k

²2

r

_n

(D) and the quantity r

_n

(D) thus corresponds to the rate of estimation of the projection s

_m

by the least-squares estimator b s in terms of excess risk, corresponding here to the squared quadratic norm.

The linear aggregation problem has been well studied in various settings linked to nonparametric regression ([Nem00], [Tsy03], [BTW07], [AC11]) and density esti- mation ([RT07]). It has been consequently understood that the optimal rate r

_n

(D) of linear aggregation is of the order of D=n, where D is the size of the dictionary.

Recently, [LM15] have shown that ERM is suboptimal for the linear aggregation problem in general, in the sense that there exist a dictionary S and a pair (X; Y ) of random variables for which the rate of ERM (drastically) deteriorates, even in the case where the response variable Y and the dictionary are uniformly bounded.

On the positive side, [LM15] also made a breakthrough by showing that if a so-called small-ball condition is achieved with absolute constants, uniformly over the functions in the linear model m, then the optimal rate is recovered by ERM.

We recall and discuss in details the small-ball condition in Section 2, but it is

worth mentioning here that one of the main advantages of the small-ball method

developed in a series of papers, [Men14b], [Men15], [Men14a], [KM15], [LM14],

[LM15] is that it enables to prove sharp bounds under very weak moment conditions

(4)

and thus to derive results that were unachievable with more standard concentration arguments.

In Section 2, we contribute to the growing understanding of this very recent approach by looking at the behavior of the small-ball condition when the dictionary is made of localized functions such as compactly supported wavelets, histograms or piecewise polynomials. It appears that with such functions, the small-ball condition can’t be satis…ed with absolute constants and the resulting bounds obtained in [LM15] are far from optimal in this case since they are of the order of D

³

=n.

The question of the validity of the small-ball approach in the Fourier case, which is a case where the functions are typically unlocalized, remains an open issue, of potentially great consequences in compressed sensing, [LM15]. Despite this lack of understanding, we prove by other means optimal upper and lower bounds for ERM when the dictionary is made of trigonometric functions. Our result, stated in Section 3, also outperforms previously obtained bounds [AC11].

An outline of the proofs related to Section 3 is given in Section 3.2. While de- tailing our arguments, we show the strong connection of our approach to optimal bounds with recent works of [Cha14] concerning least-squares under convex con- straint, extended to the setting of regularized ERM by [MvdG15] and [vdGW16].

Finally, complete proofs are dispatched in the Appendix.

2 The small-ball method for classical functional bases

We recall in Section 2.1 one of the main results of [LM15], linking the small-ball condition to the rate of convergence of ERM in linear aggregation. Then, we show in Section 2.2 that the constants involved in the small-ball condition behave poorly for dictionaries made of localized bases.

2.1 The small-ball condition and the rate of ERM in linear aggregation

Let us …rst recall the de…nition of the small-ball condition for a linear span, as exposed in [LM15].

De…nition 1 A linear span m L

₂

P

^X

is said to satisfy the small-ball con- dition for some positive constants

0

and

₀

if for every s 2 m,

P ( j s (X) j

⁰

k s k

2

)

₀

. (4)

The small-ball condition thus ensures that the functions of the model m do not

put too much weight around zero. From a statistical perspective, it is also explained

in [LM15] that the small-ball condition can be viewed as quanti…ed version of

identi…ability of the model m. A more general small-ball condition - that reduces

(5)

to the previous de…nition for linear models - is also available when the model isn’t necessary linear, [Men15].

Under the small-ball condition, [LM15] derive the following result, describing the rate of convergence of ERM in linear aggregation.

Theorem 2 ([LM15]) Let S = f s

₁

; :::; s

_D

g L

₂

P

^X

be a dictionary and as- sume that m = Span (S) satis…es the small-ball condition with constants

0

and

₀

(see De…nition 1 above). Let n (400)

²

D=

²₀

and set = Y s

_m

(X), where s

_m

is the projection of the regression function s onto m. Assume further that one of the following two conditions holds:

1. is independent of X and E

² ²

, or 2. j j almost surely.

Then the least-squares estimator b s on m, de…ned in (2), satis…es for every x > 0, with probability at least 1 exp

²₀

n=4 (1=x),

kb s s

_m

k

²2

16

0 2 0

2 2

Dx

n . (5)

Notice that Alternative 1 in Theorem 2 is equivalent to assuming that the regression function belongs to m - that is s = s

_m

- and that the noise is independent from the design - that is (X) is homoscedastic and " is independent of X in relation (1).

The main feature of Theorem 2 is that if the small-ball condition is achieved with absolute constants

0

and

₀

not depending on the dimension D nor the sample size n, then optimal linear aggregation rates of order D=n are recovered by ERM.

If moreover the regression function belongs to m (Alternative 1), then the only moment assumption required is that the noise is in L

₂

. Otherwise, Alternative 2 asks for a uniformly bounded noise. Some variants of Theorem 2 are also presented in [LM15], showing for instance that optimal rates can be also derived for ERM when the noise as a fourth moment.

In the analysis of optimal rates in linear aggregation, it is thus worth under- standing when the small ball condition stated in De…nition 1 is achieved with absolute constants.

One typical such situation is for linear measurements, that is when the functions of the dictionary are of the form f

_i

(x) = h x; t

_i

i , t

_i

2 R

^d

. Indeed, very weak conditions are asked on the design X in this case to ensure the small-ball property:

for instance, it su¢ ces to assume that X has independent coordinates that are

absolutely continuous with respect to the Lebesgue measure, with a density almost

surely bounded (see [LM14] and [Men15], Section 6, for more details). As shown

in [LM14] and [LM16], this implies that the small-ball property has important

consequences in sparse recovery and analysis of regularized linear regression.

(6)

2.2 The constants in the small-ball condition for general linear bases

Besides linear measurements discussed in Section 2.1 above, an important class of dictionaries for the linear aggregation problem consists in expansions along ortho- normal bases of L

₂

P

^X

. Our goal in this section is thus to investigate the behavior of the small-ball condition for some classical orthonormal bases such as piecewise polynomial functions, including histograms, wavelets or the Fourier basis.

The following assumption, that states the equivalence between the L

₁

and L

₂

norms for functions in the linear model m, is satis…ed by many classical functional bases:

(A1) Take S = f s

₁

; :::; s

_D

g L

₂

P

^X

a dictionary and consider its linear span m = Span (S). Assume that there exists a positive constant L

₀

such that, for every s 2 m,

k s k

₁

L

₀

p

D k s k

2

.

Examples of linear models m satisfying Assumption (A1) with an absolute constant L

₀

are given for instance in [BBM99].

More precisely, when X = [0; 2 ], X is uniform on X and S consists of the D

…rst elements of the Fourier basis, then (A1) is veri…ed.

Furthermore, if X = [0; 1]

^d

for some d 1, X is uniform on X , is a regular par- tition on X made of J hyper-rectangles and m is made of the piecewise polynomial functions de…ned on , of maximal degrees on each element of not larger than r 2 N , then Assumption (A1) is also satis…ed with a dimension D = (r + 1) J . This example includes in particular for r = 0 the case of histograms on .

Some wavelet expansions also satisfy Assumption (A1). As it will be useful in the following, let us more precisely state some notations (for more details about wavelets, see for instance [HKPT98]). We consider in this case that X = [0; 1] and X is uniformly distributed on X . Set

₀

the father wavelet and

₀

the mother wavelet, two functions de…ned on R . Thus, the support of the wavelets may not be contained in [0; 1], but for the estimation, only the wavelets whose support intersects [0; 1] will count. For every integers j 0, 1 k 2

^j

, de…ne

j;k

: x 7! 2

^j=2 ₀

2

^j

x k + 1 . We set for every integer j 0,

(j) = (j; k) ; 1 k 2

^j

& Support

_j;k

\ [0; 1] 6 = ; . Moreover, we set

_1;k

(x) =

₀

(x k + 1) and for any integer l 0,

( 1) = ( 1; k) ; Support

_1;k

\ [0; 1] 6 = ; and

l

= [

l j= 1

(j) .

(7)

Then we consider the model

m = Span f ; 2

^l

g .

If

₀

and

₀

are compactly supported, then f ; 2

^l

g satis…es Assumption (A1) for an absolute constant L

₀

and dimension D =Card(

l

). It is worth noting that more general multidimensional wavelets could also be considered at the price of more technicalities.

When a model m satis…es Assumption (A1), the small-ball condition is also veri…ed, but with constants that may depend on the dimension of the model.

Proposition 3 If a linear model m satis…es Assumption (A1) then inequality (4) of the small-ball condition given in De…nition 1 is veri…ed for any

0

2 (0; 1) with

0

= (1

²₀

) L

₀²

D

¹

.

When applied to Theorem 2, a direct consequence of Proposition 3 is that ERM satis…es the following bound on a model m satisfying Assumption (A1): for every x > 0 and

0

2 (0; 1), with probability at least 1 exp ( (1

²₀

) n= (4L

₀

D

²

)) (1=x),

kb s s

_m

k

²2

16L

₀

(1

²₀

)

²₀

2 2

D

³

x n ,

Hence, in such case Theorem 2 only says something for models of dimension D . p n (using the condition exp ( (1

²₀

) n= (4L

₀

D

²

)) < 1) and when the latter restriction is achieved, it provides a rate of convergence of the order D

³

=n. This is essentially a weakness of the small-ball approach in this case, since considering histograms and piecewise polynomial functions, [Sau12] proved in a bounded setting that the rate of convergence of ERM is actually D=n (whenever D . n= (ln n)

²

), which is the optimal rate of linear aggregation. Furthermore, for some more general linear models with localized bases such as Haar expansions, [Sau15] also proved that the rate of convergence of ERM is still D=n:

The proof of Proposition 3, detailed in the Appendix, is a direct application of Paley-Zygmund’s inequality (see [dlPG99]). [LM15] also noticed that more gen- erally, Paley-Zygmund’s inequality could be used to prove the small-ball property when for some p > 2, the L

_p

and L

₂

norms are equivalent, or also for subgaussian classes, where the Orlicz

₂

norm is controlled by the L

₂

norm, see [LM13].

These conditions are weaker than the control of the L

₁

norm by the L

₂

norm,

however we will show that the dependence in D for

₀

given in Proposition 3

above is sharp for localized bases such as histograms, piecewise polynomials and

wavelets. Hence, the control of the L

₁

norm by the L

₂

norm is in some way optimal

in these cases, and weaker assumptions could not imply some improvements on the

behavior of the small ball property for these models. In conclusion, when applied

to histograms and piecewise polynomials on regular partitions, or to compactly

supported wavelets, the small-ball method developed in [LM15] enables only to

prove suboptimal rates of the order D

³

=n.

(8)

Consider …rst a model of histograms on a regular partition of X = [0; 1]

^d

made of D pieces, X being uniformly distributed on X . More precisely, for any I 2 , set

s

_I

= 1

_I

p P

^X

(I) = p D1

_I

and take a dictionary S = f s

_I

; I 2 g , associated to the model m = Span (S). It holds, for every I 2 ,

k s

_I

k

₁

= p

D k s

_I

k

2

and so, as the s

_I

’s have disjoint supports, it is easy to see that m satis…es (A1) with L

₀

= 1. Furthermore, for any

0

2 (0; 1),

P ( j s

_I

(X) j

⁰

k s

_I

k

2

) = P

^X

(I) = 1

D . (6)

This shows that necessarily

₀

D

¹

and so, up to absolute constants, the value of

0

given in Proposition 3 is optimal in the case of histograms on a regular partition.

In particular, Assumption (A1) can’t be satis…ed with absolute constants in this case.

When considering the case of piecewise polynomial functions on a regular parti- tion, identity (6) above still holds for polynomial functions of degree zero supported by one element of the partition. Thus when the degrees of the polynomial functions in the model m are bounded by a constant r, we easily deduce that

₀

rD

¹

for any

0

2 (0; 1) and the value of

₀

given in Proposition 3 is again optimal in this case.

Finally, when the model m corresponds to a …nite expansion in some compactly supported wavelet basis, we have the following property, that again proves that the value of

₀

given in Proposition 3 is optimal. Examples of compactly supported wavelets include Daubechies wavelets and coi‡ets, see [HKPT98].

Proposition 4 Assume that X = [0; 1] and that the design X is uniformly dis- tributed on X . Take m = Span f ; 2

^l

g (using notations above de…ning the wavelets and the index set

m

) a linear model corresponding to some compactly supported wavelet expansion. More precisely, assume that Supp (

₀

) [ Supp (

₀

) [0; R] for some R 1, where Supp (

₀

) and Supp (

₀

) are the supports of the father wavelet

₀

and the mother wavelet

₀

respectively. Assume that the linear dimension D =Card(

l

) of the model m is greater than [R], the integer part of R: D > [R].

Then there exists an absolute constant

0

> 0 such that, if m achieves the small-ball condition given in De…nition 1 with constants (

₀

;

₀

) then

₀

(D [R])

¹

.

The proof of Proposition 4 consists in basic calculations and can be found in

the Appendix.

(9)

3 Optimal excess risks bounds for Fourier expan- sions

3.1 Main theorem

We have shown in Section 2 that the small-ball condition is satis…ed for linear models such as histograms, piecewise polynomials or compactly supported wavelets, but with constants that depend on the dimension of the model in such a way that using this condition to analyze the rate of convergence of ERM on these models may lead to suboptimal bounds.

The behavior of the small-ball condition when the model m is spanned by the D

…rst elements of the Fourier basis - considering that X = [0; 2 ] - remains however essentially unknown. Indeed, in the Fourier case, the bound obtained in Proposition 3 for the constants involved in the small-ball condition might be suboptimal. In particular, lower bounds such as the one established in Proposition 4 in the case of compactly supported wavelets remains inaccessible for us in the Fourier case.

It is worth noting that in the context of sparse recovery, [LM14] already noticed that the small-ball condition for Fourier measurements is hard to check and the authors mention a possible ‘non-uniform’small-ball property satis…ed in this case, but leave this direction open (see Remark 1.5 of [LM14] for more details).

Our aim in this section is to show that optimal rates of linear aggregation are attained by ERM in the Fourier case, that is when the model m is spanned by the D …rst elements of the Fourier basis. As optimal rates would also be achieved by Theorem 2 if the model m would satisfy the small-ball condition with absolute constants, this supports the conjecture made by [LM14] that the Fourier basis achieves a condition which is close to the small-ball condition with absolute constants.

We only tackle the bounded setting. One of the main reasons for this restric- tion is that we make a recurrent use along our proofs of classical Talagrand’s type concentration inequalities for suprema of the empirical process with bounded ar- guments. Indeed, our approach, which is based on [Sau12] and is detailed at a heuristic level in Section 3.2 below, is very di¤erent from the small-ball approach.

In fact, as we will explain in Section 3.2, it is closely related to recent advances linked to excess risk concentration due to [Cha14] in the context of least-squares under convex constraint, extended to regularized ERM by [vdGW16]. Even if con- centration inequalities exist for suprema of the empirical process with unbounded arguments, the unbounded case would involve much more technicalities and this would go beyond the scope of this paper.

Let us now precisely detail our assumptions. Assume that the design X is uniformly distributed on X = [0; 2 ] and that the regression function s satis…es s (0) = s (2 ). Then the Fourier basis is orthonormal in L

2

(P

^X

) and we consider a model m of dimension D (assumed to be odd) corresponding to the linear vector space spanned by the …rst D elements of the Fourier basis. More precisely, if we set '

₀

1, '

_2k

(x) = p

2 cos (kx) and '

_2k+1

(x) = p

2 sin (kx) for k 1, then '

_j ^D_j=0¹

(10)

is an orthonormal basis of (m; k k

2

), for an integer l satisfying 2l + 1 = D. Assume also:

(H1) The data and the linear projection of the target onto m are bounded by a positive …nite constant A:

j Y j A a:s: (7)

and

k s

_m

k

₁

A : (8)

Hence, from (H1) we deduce that

k s k

₁

= kE [Y j X = ] k

₁

A (9) and that there exists a constant

max

> 0 such that

2

(X

_i

)

²_max

A

²

a:s: (10)

(H2) The heteroscedastic noise level is not reduced to zero:

k k

2

= p

E [

²

(X)] > 0 . We are now in position to state our main result.

Theorem 5 Let A

₊

; A ; > 0 and let m be a linear vector space spanned by a dictionary made of the …rst D elements of the Fourier basis. Assume (H1) and take ' = ('

_k

)

^D_k=0¹

the Fourier basis of m. If it holds

A (ln n)

²

D A

+

n

¹⁼²

(ln n)

²

n ; (11)

then there exists a constant A

₀

> 0, only depending on ; A and on the constants A; k k

2

de…ned in assumptions (H1), (H2) respectively, such that by setting

"

n

= A

0

max

( ln n D

1=4

; D

²

ln n n

1=4

)

; (12)

we have for all n n

0

(A ; A

+

; A; k k

2

; ), P kb s s

_m

k

²2

(1 "

_n

) D

n C

m²

1 5n ; (13)

P kb s s

_m

k

²2

(1 + "

_n

) D

n C

m²

1 5n ; (14)

where b s is the least-squares estimator on m, de…ned in (2), and

C

m²

= E

²

(X) + k s s

_m

k

²2

.

(11)

The rate of convergence of ERM for linear aggregation with a Fourier dictionary exhibited by Theorem 5 is thus of the order D=n, which is the optimal rate of linear aggregation. In particular, this outperforms the bounds obtained in Theorem 2.2 of [AC11] under same assumption as Assumption (A1), that is satis…ed in the Fourier case, but also under more general moment assumptions on the noise. Indeed, as noticed in [LM15], the bounds obtained by [AC11] are in this case of the order D

³

=n, for models of dimension lower than n

¹⁼⁴

. In Theorem 5, our condition on the permitted dimension which is less restrictive, since models with dimension close to n

¹⁼²

are allowed.

Concerning the assumptions, uniform boundedness of the projection of the tar- get onto the model, as described in (8), is guaranteed as soon as the regression belongs to a broad class of functions named the Wiener algebra, that is whenever the Fourier coe¢ cients of the regression function are summable (in other words when the Fourier series of the regression function is absolutely convergent). For instance, functions that are Hölder continuous with index greater than 1/2 belong to the Wiener algebra, [Kat04].

Furthermore, Theorem 5 gives an information that is far more precise than sim- ply the rate of convergence of the least-squares estimator. Indeed, the conjunction of inequalities (13) and (14) of Theorem 5 actually proves the concentration of the excess risk of the least-squares estimator around one precise value, which is D C

m²

=n.

There are only very few and recent such concentration results for the excess risk of a M-estimator in the literature and this question constitutes an exiting new line of research in learning theory. Considering the same regression frame- work as ours, [Sau12] has shown concentration bounds for the excess risk of the least-squares estimator on models of piecewise polynomial functions. In a slightly di¤erent context of least-squares estimation under convex constraint, [Cha14] also proved the concentration in L

₂

norm, with …xed design and Gaussian noise. Un- der the latter assumptions, [MvdG15] have shown the excess risk’s concentration for the penalized least-squares estimator. Finally, [vdGW16] recently proved some concentration results for some regularized M-estimators. They also give an appli- cation of their results to a linearized regression context with random design and independent Gaussian noise.

3.2 Outline of the approach

The aim of this section is to explain the main ideas leading to the proof of Theorem 5 and to highlight some connections with other works in the literature.

The proof of Theorem 5 is technically very involved and is an adaptation to the Fourier case of the approach developed in [Sau12] concerning the performance of the least-squares estimator on models of piecewise polynomial functions and more general models endowed with a localized orthonormal basis.

An orthonormal basis (

_k

)

^D_k=1

of a linear model m L

₂

P

^X

is said to be lo- calized if there exists a constant L > 0 such that P

D

k=1 k k 1

L p

D sup

_k

j

k

j

(12)

for any (

_k

)

^D_k=1

2 R

^D

. This condition, taken into advantage in [Sau12], is typ- ically valid for models of piecewise polynomial functions and wavelets, but it is false in the Fourier case. Indeed, if ('

_k

)

^D_k=0¹

is the collection of D = 2l + 1

…rst elements of the Fourier basis de…ned in Section 3.1 above, then by taking (

_k

)

^D_k=0¹

= (1; 1; 0; 1; 0; 1; 0::::; 1; 0), it holds

D

X

1 k=0

k

'

_k

1

D

X

1 k=0

k

'

_k

(0) = 1 + p 2

l 1

X

k=0

cos (k 0) l + 1 D 2 sup

k

j

k

j . Hence, the Fourier basis has a behavior with respect to the sup-norm which is harder to control from this point of view than localized bases. It is however possible, for models of dimension D . p

n, to prove the consistency in sup-norm of the least- squares estimator towards the projection of the regression function s onto a model corresponding to …nite Fourier expansions. More precisely, we prove the following theorem, which is an essential piece in the proof of Theorem 5.

Theorem 6 Let > 0: Assume that m is a linear vector space spanned by the

…rst D elements of the Fourier basis. Assume that (H1) holds and that there exists A

₊

> 0 such that

D A

₊

n

¹⁼²

(ln n)

²

n . Then we have, for all n n

₀

(A

₊

; ),

P kb s s

_m

k

₁

L

⁽¹⁾_A;

D r ln n

n

!

n .

The proof of Theorem 6, which uses concentration tools such as Bernstein’s inequality (see Theorem 19) can be found in the Appendix, Section 5.1. Using Theorem 6, we can localize with probability close to one (equal to 1 n ) our analysis in a ball in sup-norm B

_L₁

(s

_m

; R

_n;D;

), centered on the projection of the target and of radius R

_n;D;

= L

⁽¹⁾_A;

D p

ln n=n.

As explained in [Sau12], empirical process theory can be used to derive optimal bounds such as Theorem 5, through the use of a representation formula for the excess risk, in terms of local suprema of the underlying empirical process. Indeed, set the least-squares contrast : L

₂

P

^X

! L

₁

(P ) de…ned by,

(s) : (x; y) 2 X R 7! (y s (x))

²

.

Then, we can write b s 2 arg min

_s₂_m

f P

_n

( (s)) g , where P

_n

is the empirical measure associated to the sample and we also have kb s s

_m

k

²2

= P ( ( b s) (s

_m

)). Further- more, the following representation formula holds for the excess risk (see identity (3.9) of [Sau12]) with probability close to one,

kb s s

_m

k

²2

2 arg max

C 0

sup

s2DC

(P P

_n

) ( (s) (s

_m

)) C , (15)

(13)

where

D

^C

:= s 2 m ; k s s

_m

k

²2

= C \

B

_L₁

(s

_m

; R

_n;d;

) .

A similar representation formula is also at the core of the approach developed by [Cha14] for least-squares estimation under convex constraints, extended to regular- ized ERM by [MvdG15] and [vdGW16]. In [Cha14] and [MvdG15], the framework allows to replace the empirical process appearing in (15) by a Gaussian process, while in [vdGW16] the more general framework of M-estimation forces the authors to work with an empirical process, exactly as in (15).

Using concentration inequalities for the supremum of the empirical process, on may show from (15) that with probability close to one,

kb s s

_m

k

²2

arg max

C 0

E sup

s2DC

(P P

_n

) ( (s) (s

_m

)) C . (16) The quantity of interest is thus the expectation of supremum of the empirical process over a slice D

^C

of the model m. Moreover, as we want to derive optimal bounds, we are looking for a control to the right constant of the …rst order of this quantity. To this end, we introduce an argument of contrast expansion into a linear and quadratic part, originally developed in [Sau12] and which is also one of the main features of the small-ball approach …rst built in [Men15]. More precisely, for every s 2 m and z = (x; y) 2 X R , it holds

(s) (z) (s

_m

) (z) =

_1;m

(z) (s s

_m

) (x) + (s s

_m

)

²

(x) (17) where

_1;m

(z) = 2 (y s

_m

(x)) : Using (17), we then split the quantity of interest in (16) into two parts,

E sup

s2DC

(P P

_n

) ( (s) (s

_m

)) E sup

s2DC

(P P

_n

)

_1;m

(s s

_m

)

| {z }

main part

+ E sup

s2DC

(P P

_n

) (s s

_m

)

²

| {z }

remainder term

.

We …nally show that the empirical process corresponding to the linear part of the contrast expansion gives the exact …rst order of rate of linear aggregation of the least-squares estimator, while the empirical process corresponding to the quadratic part of the contrast expansion contributes only through remainder terms. Some technical lemmas roughly corresponding to the previous observations can be found in the Appendix, Section 5.2.1.

4 Proofs related to Section 2

Proof of Proposition 3. Take s 2 m and

0

2 (0; 1). Set

₀

= fj s (X) j

⁰

k s k

2

g . By Paley-Zygmund’s inequality (Corollary 3.3.2 in [dlPG99]), it holds

P (

₀

) 1

²₀

k s k

²2

k s k

²₁

1

²₀

L

²₀

1 D ,

(14)

which readily proves Proposition 3.

Proof of Proposition 4. Take (j; k) 2

^m

, j 0. It holds

j;k 2

2

=

Z

1 0

2

j;k

(x) dx

= 2

^j

Z

1

0

2

^j

x k + 1

²

dx

=

Z

2^j k+1

k+1

j

0

(y) j

²

dy .

Thus, whenever Supp (

₀

) [ k + 1; 2

^j

k + 1], one has

_j;k ₂

= 1. Take j

₀

0 such that 2

^j⁰

R. Then

_j₀_;_{1 2}

= 1 and it is easy to see that for any j j

₀

there exists at least 2

^{j j}⁰

values of k such that Supp (

₀

) [ k + 1; 2

^j

k + 1]

and

_j;k ₂

= 1. Now, take

0

> 0 such that Z

R

0

1

_fj

0(y)j 0g

dy 1 2 .

For any j j

₀

and k such that Supp (

₀

) [ k + 1; 2

^j

k + 1], we get P

j;k

(X)

₀ _j;k

2

= P 2

^j=2 ₀

2

^j

X k + 1

₀

= 2

^j

Z

2^j k+1

k+1

j

0

(y) j dy = 2

^j

Z

R

0

j

0

(y) j dy 2

^j ¹

. (18) Furthermore, notice that Card ( ( 1)) [R] + 1 and D = Card (

_l

) [R] + 2

^l+1

. Hence, taking j = l in (18), we deduce that,

P

l;k

(X)

₀ _l;k ₂

1 D [R] , which gives the result.

5 Proofs related to Section 3

5.1 Proof of Theorem 6

Proof of Theorem 6. Let ; C > 0. Set

F

C¹

:= f s 2 m ; k s s

_m

k

₁

C g and

F

>C¹

:= f s 2 m ; k s s

_m

k

₁

> C g = m nF

C¹

.

(15)

Take the Fourier basis ('

_k

)

^D_k=0¹

of (m; k k

2

). By assumption on D, it holds D n.

Hence, by Lemma 8 below, we get that there exists L

⁽¹⁾_A;

> 0 such that, by setting

1

= (

max

k2f0;:::;D 1g

(P

_n

P )

_1;m

'

_k

L

⁽¹⁾_A;

r ln n n

) , we have for all n n

₀

(A

₊

), P (

₁

) 1 n . Moreover, we set

2

= (

max

k2f0;:::;D 1g²

j (P

_n

P ) ('

_k

'

_l

) j L

⁽²⁾

r ln n

n )

,

where L

⁽²⁾

is de…ned in Lemma 7 below. By Lemma 7, we have P (

₂

) 1 n and so, for all n n

₀

(A

₊

),

P

¹

\

2

1 2n : (19)

We thus have for all n n

₀

(A

₊

), P ( kb s s

_m

k

₁

> C)

P inf

s2F>C¹

P

n

( (s) (s

m

)) inf

s2FC¹

P

n

( (s) (s

m

))

= P sup

s2F>C¹

P

_n

( (s

_m

) (s)) sup

s2FC¹

P

_n

( (s

_m

) (s))

!

P (

sup

s2F_>C¹

P

_n

( (s

_m

) (s)) sup

s2F_C=2¹

P

_n

( (s

_m

) (s)) ) \

1

\

2

!

+ 2n (20) . Now, for any s 2 m such that

s s

_m

=

D 1

X

k=0

k

'

_k

, = (

_k

)

^D_k=0¹

2 R

^D

, we have

P

_n

( (s

_m

) (s))

= (P

_n

P )

_1;m

(s

_m

s) (P

_n

P ) (s s

_m

)

²

k s s

_m

k

²2

=

D 1

X

k=0

k

(P

n

P )

_1;m

'

_k

D 1

X

k;l=0

k l

(P

n

P ) ('

_k

'

_l

)

D 1

X

k=0 2 k

. We set for any (k; l) 2 f 0; :::; D 1 g

²

,

R

⁽¹⁾_k

= (P

_n

P )

_1;m

'

_k

and R

⁽²⁾_k;l

= (P

_n

P ) ('

_k

'

_l

) .

(16)

Moreover, we set a function h

_n

, de…ned as follows, h

_n

: = (

_k

)

^D_k=0¹

7 !

D

X

1 k=0

k

R

⁽¹⁾_k

D 1

X

k;l=0

k l

R

⁽²⁾_k;l

D

X

1 k=0

2 k

. We thus have for any s 2 m such that s s

_m

= P

D 1

k=0 k

'

_k

, = (

_k

)

^D_k=0¹

2 R

^D

, P

_n

( (s

_m

) (s)) = h

_n

( ) . (21) In addition we set for any = (

_k

)

^D_k=0¹

2 R

^D

,

j j

m;1

= p

2D j j

₁

. (22)

It is straightforward to see that j j

m;1

is a norm on R

^D

, proportional to the sup- norm. We also set for a real D D matrix B, its operator norm k A k

m

associated to the norm j j

m;1

on the D-dimensional vectors. More explicitly, we set for any B 2 R

^{D D}

,

k B k

m

:= sup

2R^D; 6=0

j B j

m;1

j j

m;1

= sup

2R^D; 6=0

j B j

₁

j j

₁

.

Note that k k

m

is an operator norm and so B

^k _m

k B k

^km

for any k 2 N . We also have, for any B = (B

_k;l

)

_k;l=0;::;D ₁

2 R

^{D D}

, the following classical formula

k B k

m

= max

k2f0;:::;D 1g

8 <

: 8 <

:

X

l2f0;:::;D 1g

j B

_k;l

j 9 =

; 9 =

; . (23)

Notice that for any = (

_k

)

^D_k=0¹

2 R

^D

,

D

X

1 k=0

k

'

_k

1

D j j

₁

sup

k

k '

_k

k

₁

j j

m;1

. Hence, it holds

F

>C¹

(

s 2 m ; s s

_m

=

D

X

1 k=0

k

'

_k

& j j

m;1

C )

(24) and

F

C=2¹

(

s 2 m ; s s

_m

=

D

X

1 k=0

k

'

_k

& j j

m;1

C=2 )

. (25)

Hence, from (20), (21) (25) and (24) we deduce that if we …nd on

1

T

2

a value of C such that

sup

2R^D;j jm;1 C

h

_n

( ) < sup

2R^D;j jm;1 C=2

h

_n

( ) ,

(17)

then we will get

P ( kb s s

_m

k

₁

> C) 2n .

Taking the partial derivatives of h

_n

with respect to the coordinates of its arguments, it then holds for any (k; l) 2 f 0; :::; D 1 g

²

and = (

_i

)

^D_i=0¹

2 R

^D

,

@h

_n

@

_k

( ) = R

⁽¹⁾_k

2

D

X

1 i=0

i

R

⁽²⁾_k;i

2

_k

(26)

We look now at the set of solutions of the following system,

@h

_n

@

_k

( ) = 0 , 8 k 2 f 0; :::; D 1 g . (27) We de…ne the D D matrix R

⁽²⁾n

to be

R

⁽²⁾_n

:= R

⁽²⁾_k;l

k;l=0::D 1

and by (26), the system given in (27) can be written

2 I

_D

+ R

⁽²⁾_n

= R

⁽¹⁾_n

, (S)

where R

⁽¹⁾n

is a D-dimensional vector de…ned by R

⁽¹⁾_n

= R

_n;k⁽¹⁾

k=0::D 1

. Let us give an upper bound of the norm R

⁽²⁾n

m

, in order to show that the matrix I

_D

+ R

⁽²⁾n

is nonsingular. On

2

we have

R

⁽²⁾_n _m

= max

k2f0;:::;D 1g

8 <

: 8 <

:

X

l2f0;:::;D 1g

j (P

n

P ) ('

_k

'

_l

) j 9 =

; 9 =

; L

⁽²⁾

max

k2f0;:::;D 1g

8 <

: 8 <

:

X

l2f0;:::;D 1g

r ln n n

9 =

; 9 =

; L

⁽²⁾

D

r ln n

n (28)

Hence, from (28) and the fact that D A

₊ ⁿ¹⁼²

(lnn)²

, we get that for all n n

₀

(A

₊

; ), it holds on

2

,

R

⁽²⁾_n

m

1 2

and the matrix I

_d

+ R

⁽²⁾n

is nonsingular, of inverse I

_d

+ R

⁽²⁾n 1

= P

+1

u=0

R

⁽²⁾n u

. Hence, the system (S) admits a unique solution

⁽ⁿ⁾

, given by

(n)

= 1

2 I

_d

+ R

⁽²⁾_n ¹

R

⁽¹⁾_n

.

(18)

Now, on

1

we have, R

⁽¹⁾_n

m;1

p 2D max

k2f0;:::;D 1g

(P

_n

P )

_1;m

'

_k

L

⁽¹⁾_A;

D

r 2 ln n n

and we deduce that for all n

₀

(A

₊

; ), it holds on

2

T

1

,

(n) m;1

1 2 I

_d

+ R

⁽²⁾_n ¹

m

R

⁽¹⁾_n

m;1

L

⁽¹⁾_A;

D

r 2 ln n

n . (29)

Moreover, by the formula (21) we have h

_n

( ) = P

_n

( (s

_m

)) 1

n X

n

i=1

Y

_i

s

_m

(X

_i

)

D 1

X

k=0

k

'

_k

(X

_i

)

!

²

and we thus see that h

_n

is concave. Hence, for all n

₀

(A

₊

; ), we get that on

2

,

(n)

is the unique maximum of h

_n

and on

2

T

1

, by (29), concavity of h

_n

and uniqueness of

⁽ⁿ⁾

, we get

h

_n ⁽ⁿ⁾

= sup

2R^D;j jm;1 C=2

h

_n

( ) > sup

2R^D;j jm;1 C

h

_n

( ) ,

with C = 2L

⁽¹⁾_A;

D q

2 lnn

n

, which concludes the proof.

Lemma 7 Let > 0. Assume that m is a linear vector space spanned by the …rst D elements of the Fourier basis, where D n. Then there exists L

⁽²⁾

> 0 such that

P max

k2f0;:::;D 1g²

j (P

_n

P ) ('

_k

'

_l

) j L

⁽²⁾

r ln n

n

!

n . (30)

Proof. For any (k; l) 2 f 0; :::; D 1 g

²

, we have

E ('

_k

'

_l

)

²

k '

_k

'

_l

k

²₁

4 .

Hence, we apply Bernstein’s inequality (see Proposition 2.9 in [Mas07]) and we get, for all > 0,

P j (P

_n

P ) ('

_k

'

_l

) j 2

r 2 ln n

n + 2 ln n 3n

!

2n . (31)

We get from (31) that for all > 0,

P max

(k;l)2f0;:::;D 1g²

j (P

_n

P ) ('

_k

'

_l

) j 2 p

2 + 2 3

r ln n n

!

2D

²

n n

⁺²

. (32)

To conclude, take = + 2.

(19)

Lemma 8 Let > 0. Assume that m is a linear vector space spanned by the …rst D elements of the Fourier basis, where D n.If (H1) holds and

_1;m

(X; Y ) :=

2 (Y s

_m

(X)) then P max

k2f1;:::;Dg

(P

_n

P )

_1;m

'

_k

L

⁽¹⁾_A;

r ln n n

!

n . (33)

Proof. Let > 0. By Bernstein’s inequality, we get by straightforward computa- tions (of the spirit of the proof of Lemma 7) that there exists L

_A;

> 0 such that, for all k 2 f 0; :::; D 1 g ,

P (P

_n

P )

_1;m

'

_k

L

⁽¹⁾_A;

r ln n n

!

n .

Now the result follows from a simple union bound with = + 1.

5.2 Proof of Theorem 5

Aiming at clarifying the proofs, the arguments involved and the connection with the proofs exposed in [Sau12], we generalize a little bit the Fourier framework by invoking along the proofs the three following assumptions, that are satis…ed for Fourier expansions. From now on, m L

₂

P

^X

is considered to be a linear model of dimension D, not necessarily built from the Fourier basis.

Let us de…ne a function

m

on X , that we call the unit envelope of m, such that

m

(x) = 1

p D sup

s2m;ksk2 1

j s (x) j : (34) As m is a …nite dimensional real vector space, the supremum in (34) can also be taken over a countable subset of m, so

m

is a measurable function.

(H3) The unit envelope of m is uniformly bounded on X : a positive constant A

_3;m

exists such that

k

^m

k

₁

A

_3;m

< 1 : In the Fourier case, (H3) is valid by taking A

_3;m

p

2. In fact, it is easy to see that assumption (H3) is equivalent to assumption (A1). Moreover, several technical lemmas derived in [Sau12] only assume the validity of (H3) and will thus be used without repeating their proofs.

(H4) Uniformly bounded basis : there exists an orthonormal basis ' = ('

_k

)

^D_k=1

in (m; k k

2

) that satis…es, for a positive constant u

_m

(')

k '

_k

k

₁

u

_m

(') .

(20)

Again, in the Fourier case, (H4) is valid by taking u

_m

(') p 2.

Remark 3 (H4) implies (H3) and in that case A

_3;m

= u

_m

(').

The assumption of consistency in sup-norm:

We assume that the least squares estimator is consistent for the sup-norm on the space X . More precisely, this requirement can be stated as follows.

(H5) Assumption of consistency in sup-norm: for any A

₊

> 0, if m is a model of dimension D satisfying

D A

₊

n

¹⁼²

(ln n)

²

;

then for every > 0, we can …nd a positive integer n

1

and a positive constant A

_cons

satisfying the following property: there exists R

_n;D;

> 0 depending on D; n and , such that

R

_n;D;

A

_cons

p ln n (35)

and by setting

1;

= fkb s s

_m

k

₁

R

_n;D;

g ; (36) it holds for all n n

₁

,

P [

₁_;

] 1 n : (37)

By Theorem 6, (H5) is veri…ed with R

_n;D;

D p

ln n=n.

In order to express the quantities of interest in the proof of Theorem 5, we need preliminary de…nitions. Let > 0 be …xed and for R

n;D;

de…ned in (H5), we set

R ~

n;D;

= max (

R

n;D;

; A

₁

r D ln n n

)

(38) where A

₁

is a positive constant to be chosen later. Moreover, we set

n

= max

(r ln n D ;

r D ln n

n ; R

_n;D;

)

: (39)

Thanks to the assumption of consistency in sup-norm (H5), our analysis will be localized in the subset

B

_(m;L₁₎

s

_m

; R ~

_n;D;

= n

s 2 m; k s s

_m

k

₁

R ~

_n;D;

o of m.

Let us de…ne several slices of excess risk on the model m : for any C 0, F

^C

= s 2 m; k s s

_m

k

²2

C \

B

_(m;L₁₎

s

_m

; R ~

_n;D;

F

^>C

= s 2 m; k s s

_m

k

²2

> C \

B

_(m;L₁₎

s

_m

; R ~

_n;D;