Model selection for (auto-)regression with dependent data

(1)

URL:http://www.emath.fr/ps/

MODEL SELECTION FOR (AUTO-)REGRESSION WITH DEPENDENT DATA

Yannick Baraud

¹

, F. Comte

²

and G. Viennet

³

Abstract. In this paper, we study the problem of non parametric estimation of an unknown regression function from dependent data with sub-Gaussian errors. As a particular case, we handle the autoregressive framework. For this purpose, we consider a collection of finite dimensional linear spaces (e.g. linear spaces spanned by wavelets or piecewise polynomials on a possibly irregular grid) and we estimate the regression function by a least-squares estimator built on a data driven selected linear space among the collection. This data driven choice is performedviathe minimization of a penalized criterion akin to the Mallows’ Cp. We state non asymptotic risk bounds for our estimator in some

L2-norm and we show that it is adaptive in the minimax sense over a large class of Besov balls of the formBα,p,∞(R) withp≥1.

Mathematics Subject Classification. 62G08, 62J02.

Received April 15, 1999. Revised July 20, 1999 and May 14, 2001.

1. Introduction

We consider here the problem of estimating the unknown functionf fromnobservations (Y_i, ~X_i), 1≤i≤n drawn from the regression model

Y_i =f(X~_i) +ε_i (1.1)

where (X~_i)_1≤i≤n is a sequence of possibly dependent random vectors inR^k and theε_i’s are i.i.d. unobservable real valued centered errors with varianceσ². In particular, ifY_i =X_i andX~_i = (X_i−1, . . . , X_i−k)⁰ we recover the classical autoregressive framework of orderk. In this paper, we measure the risk of an estimatorvia the expectation of some randomL₂-norm based on theX~_i’s. More precisely, if ˆf denotes some estimator of f, we define the risk of ˆf by

E[d²_n(f,fˆ)] =E

"

1 n

Xn i=1

f(X~_i)−fˆ(X~_i) ₂#

where for any functionss, t, d²_n(s, t) denotes the squared random distancen⁻¹P_n

i=1(s(X~_i)−t(X~_i))². We have in mind to estimatef thanks to some suitable least-squares estimator. For this purpose we introduce some finite

Keywords and phrases:Nonparametric regression, least-squares estimator, adaptive estimation, autoregression, mixing processes.

1Ecole Normale Sup´´ erieure, DMA, 45 rue d’Ulm, 75230 Paris Cedex 05, France; e-mail:[email protected]

2Laboratoire de Probabilités et Modèles Aléatoires, Boˆıte 188, Université Paris 6, 4 place Jussieu, 75252 Paris Cedex 05, France.

3 Laboratoire de Probabilités et Modèles Aléatoires, Boˆıte 7012, Université Paris 7, 2 place Jussieu, 75251 Paris Cedex 05, France.

c EDP Sciences, SMAI 2001

(2)

collection of finite dimensional linear spaces{S_m, m∈ Mn}(in the sequel, theS_m’s are called models) and we associate to each S_m, the least-squares estimator ˆf_m of f on it. Under suitable assumptions (in particular if theX~_i’s and theε_i’s are independent sequences) the risk of ˆf_m is equal to

E

d²_n(f, S_m)

+dim(S_m) n σ².

The aim of this paper is to propose some suitable data driven selection procedure to select some ˆmamongMn

in such a way that the least-squares estimator ˆf_m_ˆ performs almost as well as the best ˆf_m over the collection (i.e. the one which has the smallest risk). The selection procedure that is considered is a penalized criterion of the following form:

ˆ

m= arg min

m∈Mn

"

1 n

Xn i=1

Y_i−fˆ_m(X~_i) ₂

+ pen(m)

#

where pen is a penalty function mapping Mn into R₊. Of course the major problem is to determine such a penalty function in order to obtain a resulting estimator ˜f = ˆf_m_ˆ that performs almost as well as the best ˆf_m i.e. such that the risk of ˜f achieves, up to a constant, the minimum of the risks over the collectionMn. More precisely we show that one can find a penalty function such that

Eh

d²_n(f,f˜)

i≤C inf

m∈Mn

E

d²_n(f, S_m)

+dim(S_m)L_m

n σ²

(1.2) where theL_m’s are related to the collection of models. If the collection of models is not too “rich” then theL_m’s can be chosen to be constants independent ofnand the right-hand side of (1.2) turns out to be the minimum of the risks (up to a multiplicative constant) among the collection of least-squares estimators that are considered.

In most cases theL_m’s are either constants or of order ln(n).

There have been many studies concerning model selection based on Mallows’ [22]C_p or related penalization criteria like Akaike’s or the BIC criterion for regressive models (see Akaike [1,2], Shibata [28,29], Li [20], Polyak and Tsybakov [27], among many others ...). A common characteristic of all their results is their asymptotic feature. More recently, a general approach to model selection for various statistical frameworks including density estimation and regression has been developed in Barronet al.[7] with many applications to adaptive estimation.

An original feature of their viewpoint is its non asymptotic character. Unfortunately, their general approach imposes such restrictions to the regression Model (1.1) that it is hardly usable in practice. Following their ideas, Baraud [4, 5] has extended their results to more attractive situations involving realistic assumptions.

Baraud [4] is devoted to the study of fixed design regression while Baraud [5] considers Model (1.1) when all random variables X~_i’s and ε_i’s are independent, the ε_i’s being i.i.d. with a moment of order p > 2. Then Baraudet al.[6] relaxed the assumption of independence on the (X~_i)’s and theε_i’s as well. Our approach here as well as in the previous papers remains non asymptotic. Although there have been many results concerning adaptation for the classical regression model with independent variables, to our knowledge, not much is known concerning general adaptation methods for non parametric regression involving dependent variables. It is not within the scope of this paper to make an historical review for the case of independent variables.

Concerning dependent variables, Modha and Masry [24] deal with the model given by (1.1) when the process (X~_i, Y_i)_i∈Zis strongly mixing. Their approach leads to sub-optimal rates of convergence. It is worth mentioning, for a one dimensional first order autoregressive model, the works of Neumann and Kreiss [26] and Hoffmann [16]

which rely on the approximation of an AR(1) autoregression experiment by a regression experiment with independent variables. They study here various non parametric adaptive estimators such as local polynomials and wavelet thresholding estimators. Modha and Masry [25] consider the problem of one step ahead prediction of real valued stationary exponentially strongly mixing processes. Minimum complexity regression estimators based on Legendre polynomials are used to estimate both the model memory and the predictor function. Again

(3)

their approach does not lead to optimal rates of convergence, at least in the particular case of an autoregressive model.

Of course, this paper must be compared with our previous work (Baraud et al. [6]), where we had milder moment conditions on the errors (theε_i’s must admit moments of orderp >2) but stronger condition on the collection of models. Now we require theε_i’s to be sub-Gaussian (typically, theε_i’s are Gaussian or bounded) but we do not impose any assumption on our family of models (except for finiteness); it can be in particular as large as desired. Moreover, we no longer allow any dependency between theε_i’s, but we can provide results for more general types of dependency for theX~_i’s, typically when some norm connections are fulfilled (i.e. on the set Ω_n defined by (3.6)). Any kind of dependency is permitted on theX~_i’s as soon as theX~_i’s and theε_i’s are independent sequences of random variables. In the autoregressive framework, they are possibly arithmetically or geometrically β-mixing (the definitions are recalled below). Note that Baraud [5] gave the same kind of results in the independent framework under even milder conditions but assuming that the errors are Gaussian.

The techniques involved are appreciably different. We can also refer to Birg´e and Massart [8] for a general study of the fixed design regression with Gaussian errors.

Let us now present our results briefly. One can find collections of models such that the estimator ˆf_m_ˆ is adaptive in the minimax sense over some Besov ballsB_α,p,∞(R) withp≥1. Furthermore, in various statistical contexts, we also show that the estimator achieves the minimax rate of convergence although the underlying distribution of the X~_i’s is not assumed to be absolutely continuous with respect to the Lebesgue measure. For other estimators and in the case of independent data, such a result has been established by Kohler [18].

The paper is organized as follows: the general statistical framework is described in Section 2, and the main results are given under an Assumption (H_µ)in Section 3. Section 4 gives applications to minimax adaptive estimation in the case of wavelets basis. Section 5 is devoted to the study of condition (H_µ) in the case of independent sequencesX~_i’s andε_i’s or in the case of dependent sequences and (β-mixing) variablesX~_i’s. Most proofs are gathered in Sections 6 to 9.

2. The estimation procedure

Let us recall that we observe pairs (Y_i, ~X_i), i= 1, . . . , narising from (1.1) Y_i =f(X~_i) +ε_i.

The X~_i⁰ = (X_i,1, . . . , X_i,k)’s are random variables with law µ_i and we set µ = n⁻¹P_n

i=1µ_i. The ε_i’s are independent centered random variables. Theε_i’s may be independent of theX~_i’s or not. In particular, we have in mind to handle the autoregressive case for whichY_i =X_iandX~_i= (X_i−1, . . . , X_i−k)⁰. Then the model can be written:

X_i =f(X_i−1, . . . , X_i−k) +ε_i, i= 1, . . . , n. (2.1) Since we do not assume the ε_i’s to be bounded random variables, the law of the X~_i’s is supported by R^k. Nevertheless we aim at providing a “good” estimator of the unknown function f :R^k →Ronly on some given compact setA⊂R^k.

Let us now describe our estimation procedure. We consider a finite collection of finite dimensional linear spaces {Sm}m∈Mn consisting ofA-supported functions belonging to L2(A, µ). In the sequel the linear spaces S_m’s are called models. For each m ∈ Mn, we associate to each model of the collection the least-squares estimator off, denoted by ˆf_m, which minimizes overt∈S_m the least-squares contrast functionγ_n defined by

γ_n(t) = 1 n

Xn i=1

h

Y_i−t(X~_i) i₂

. (2.2)

(4)

Then, given a suitable penalty function pen(·), that is a nonnegative function onMn depending only on the data and known parameters, we define ˆmas the minimizer overMn ofγ_n( ˆf_m) + pen(m). This implies that the resulting Penalized Least Square Estimator (PLSE for short) ˜f = ˆf_m_ˆ satisfies for allm∈ Mn andt∈S_m

γ_n( ˜f) + pen( ˆm)≤γ_n(t) + pen(m). (2.3) The choice of a proper penalty function is the main concern of this paper since it determines the properties of the PLSE.

Throughout this paper, we denote byk kthe Hilbert norm associated to the Hilbert space L2(A, µ) and for each t ∈ L₂(A, µ), ktk²_n denotes the random variable n⁻¹P_n

i=1t²(X~_i). For each m ∈ Mn, D_m denotes the dimension ofS_mandf_mtheL₂(A, µ)-orthogonal projection off ontoS_m. Moreover, we denote byR^∗₊ the set of positive real numbers and byν the Lebesgue measure.

3. Main theorem

Our main result relies on the following assumption on the joint law of theX~_i’s and theε_i’s:

(H_X,ε)

(i) Theε_i’s are i.i.d. centered random variables that satisfy for allu∈R E[exp(uε₁)]≤exp

u²s² 2

, (3.1)

for some positives.

(ii) For eachk∈ {1, . . . , n},ε_k is independent of theσ-fieldFk=σ(X~_j,1≤j≤k).

Inequality (3.1) is fulfilled as soon asε₁is a centered random variable either Gaussian with variances²=σ² or a.s. bounded bys. In the autoregressive model given by (2.1), Condition (ii) is satisfied.

Theorem 3.1. Let us consider Model (1.1) where f is an unknown function belonging to L₂(A, µ) and the random variables ε_i’s andX~_i’s satisfy (H_X,ε). Set f_A=f1I_A, let(L_m)_m∈M_n be nonnegative numbers and set

Σ_n= X

m∈Mn

exp (−LmD_m). (3.2)

There exists some universal constantϑsuch that if the penalty function is chosen to satisfy pen(m)≥ϑs²D_m

n (1 +L_m) for all m∈ Mn, then the PLSE f˜defined by

f˜= ˆf_m_ˆ (3.3)

with

ˆ

m= arg min

m∈Mn

( 1 n

Xn i=1

h

Y_i−fˆ_m(X~_i) i₂

+ pen(m) )

(3.4) satisfies

Eh

kfA−f˜k²_n1I_Ω_n

i≤C inf

m∈Mn

kfA−f_mk²+ pen(m)

+C⁰s²Σ_n

n (3.5)

(5)

whereC and C⁰ are universal constants and

Ω_n=



ω/

ktk²_n ktk² −1

≤ 1

2, ∀t∈ [

m,m⁰∈Mn

(S_m+S_m0)\ {0}



· (3.6)

Comments

• For the proof of this result we use an exponential martingale inequality given by Meyer [23] and chaining arguments that can also be found in Barronet al.[7] to state exponential bounds on supremum of empirical processes.

• One can also define Ω_n by



ω/

ktk²_n ktk² −1

≤ρ, , ∀t∈ [

m,m⁰∈Mn

(S_m+S_m⁰)\ {0}





for some ρchosen to be less than one, then (3.5) holds for some constantC that now depends onρ.

• A precise calibration of the penalty term (best choices ofϑandL_m’s ) can be determined by carrying out simulation experiments (see the related work for density estimation by Birg´e and Rozenholc [9]).

• When theX~_i’s are random variables independent of theε_i’s the indicator set 1I_Ω_n can be removed in (3.5) (see Sect. 5). We emphasize that in this case no assumption on the type of dependency between theX~_i’s is required.

Below, we present a useful corollary which makes the performance of ˜f more precise when Ω_n (as defined by (3.6)) is known to occur with high probability. Indeed, assume that:

(H_µ) There exists ` >1 such that P(Ω^c_n)≤C_` n^`, then the following result holds:

Corollary 3.1. Let us consider Model (1.1) wheref is an unknown function belonging toL2(A, µ)∩L∞(A, µ).

Under the Assumptions of Theorem 3.1 and (H_µ), the PLSEf˜defined by (3.3) satisfies Eh

kfA−f˜k²_ni

≤C inf

m∈Mn

kfA−f_mk²+ pen(m)

+C⁰s²Σ_n

n + C⁰⁰kfAk²_∞+s²

n (3.7)

whereC and C⁰ are universal constants, andC⁰⁰ depends onC_` and` only.

The constantsCandC⁰in Corollary 3.1 are the same as those in Theorem 3.1. The proof of Corollary 3.1 is deferred to Section 6. We shall then see that ifS_mcontains the constant functions thenkfAk²_∞can be replaced bykfA−R

f_Adµk²_∞. Comments on Condition(H_µ)are to be found in Section 5.

4. Adaptation in the minimax sense

Throughout this section we take k = 1 for sake of simplicity and since we aim at estimating f on some compact set, with no loss of generality we can assume thatA= [0,1].

4.1. Two examples of collection of models

This section presents two collections of models which are frequently used for estimation: piecewise polynomials and compactly supported wavelets. In the sequel,J_n denotes some positive integer.

(6)

(P) Let Mn be the set of pairs (d,{b₀ = 0 < b₁ <· · · < b_d−1< b_d = 1}) when dvaries among {1, . . . , J_n} and {b₀ = 0 < b₁ < · · · < b_d−1 < b_d = 1} among the dyadic knots N_j/2^Jⁿ with N_j ∈ N. For each m= (m₁, m₂) ∈ Mn we define S_m as the linear span generated by the piecewise polynomials of degree less than r based on the dyadic knots given by m₂. More precisely, if m₁ =d and m₂ ={b₀ = 0< b₁

<· · ·< b_d−1< b_d= 1}thenS_m consists of all the functions of the form

t= Xd j=1

P_j1I_[b_j−1_,b_j_[,

where the P_j’s are polynomials of degree less than r. Note that dim(S_m) = rm₁. We denote by Sn

the linear space S_m corresponding to the choice m₁ = 2^Jⁿ and m₂ = {j/2^Jⁿ, j = 0, . . . ,2^Jⁿ}. Since dim(Sn) =r2^Jⁿ, we impose the natural constraintr2^Jⁿ≤n.

By choosing for allm∈ Mn L_m= ln(n/r)/r, Σ_n defined by (3.2) remains bounded by a constant that is free fromn. Indeed for eachd∈ {1, . . . , Jn},

|{m∈ Mn/ m₁=d}|=C₂^d−1_Jn₋₁≤C₂^dJn,

whereC_k^d denotes the binomial coefficient k

d

. Thus,

X

m∈Mn

e^−L^m^D^m ≤² XJn

d=1

C₂^d_Jne⁻^ln(n/r)d≤(1 + exp(−ln(n/r)))²^Jn

≤exp(n/rexp(−ln(n/r))) =e using that 2^Jⁿ≤n/r.

(W) For all integerj let Λ(j) be the set {(j, k), k= 1, . . . ,2^j}. Let us consider theL2-orthonormal system of compactly supported wavelets of regularityr,

{φ_J₀_,k, (J₀, k)∈Λ(J₀)} ∪ {ϕ_j,k, (j, k)∈ ∪^+∞_J=J₀Λ(J)},

built by Cohenet al. [10]; for a precise description and use, see Donoho and Johnstone [13]. These new functions derive from Daubechies’ [11] wavelets at the interior of [0,1] and are boundary corrected at the

“edges”. For some positiveJ_n, letSn be the linear span of theφ_J₀_,k’s for (J₀, k)∈Λ(J₀) together with the ϕ_j,k’s for (j, k)∈Λ¯_n=∪^J_J=Jⁿ⁻¹₀Λ(J). We have that dim(Sn) = 2^J⁰+P_J_n₋₁

j=J0 |Λ(j)|= 2^Jⁿ≤nifJ_n≤ln₂(n).

We takeMn =P( ¯Λ_n), (P(A) denotes the power of the set A) and for eachm ∈ Mn, defineS_mas the linear space generated by theφ_J₀_,k’s for (J₀, k)∈Λ(J₀) and theϕ_j,k’s for (j, k)∈m.

We chooseL_m= ln(n) in order to bound Σ_n by a constant that does not depend onn:

X

m∈Mn

e^−L^m^D^m ≤

2^Jn

X

D=1

C₂^D_Jne⁻^ln(n)D≤(1 + exp(−ln(n)))²^Jn ≤exp(nexp(−ln(n))) =e

using that 2^Jⁿ≤n.

(7)

4.2. Two results about adaptation in the minimax sense

Forp≥1 andα >0, we set

|t|α,p= sup

y>0y^−αw_d(t, y)_p, d= [α] + 1

|t|_∞= sup

x,y∈[0,1]|t(x)−t(y)|

where w_d(t, .)_p denotes the modulus of smoothness of t. For a precise definition of those notions, we refer to DeVore and Lorentz [12], Chapter 2, Section 7. We recall that a function t belongs to the Besov space B_α,p,∞([0,1]) if|t|α,p<∞.

In this section we show how an adequate choice of the collection of models leads to an estimator ˜f that is adaptive in the minimax sense (up to a constant) over Besov bodies of the form

Bα,p,∞(R₁, R₂) ={t∈ Bα,p,∞(A)/ |t|α,p≤R₁, |t|∞≤R₂}

withp≥1. In a related regression framework, the casep≥2 was considered in Baraudet al.[6] and it is shown there that weak moment conditions on theε_i’s are sufficient to obtain such estimators. We shall take advantage here of the strong integrability assumption on the ε_i’s to extend the result to the case where p∈ [1,2[. The PLSE defined by (3.3) with the collections(W)or(P)described in Section 4.1 (and the corresponding L_m’s) achieves the minimax rates up to a ln(n) factor. The extra ln(n) factor is due to the fact that those collections are “too big” for the problem at hand. In the sequel, we exhibit a subcollection of models (W’)out of (W) which has the property to be both “small” enough to avoid the ln(n) factor in the convergence rate and “big”

enough to allow the PLSE to be rate optimal. The choice of this subcollection comes from the compression algorithm field and we refer to Birg´e and Massart [8] for more details. It is also proved there how to obtain a suitable collection from piecewise polynomials instead of wavelets.

Fora >2 andx∈(0,1), let us set

K_j = [L(2^J−j)2^J] and L(x) =

1−lnx ln 2

_−a

, (4.1)

where [x] denotes the integer part ofx, and

L(a) = 1 +

+∞X

j=0

1 + (a+ ln(2))j

(1 +j)^a · (4.2)

Then we define the new collection of models (we take the notations used in the description of collection(W)) by:

(W’) ForJ ∈ {J0, . . . , J_n−1}, let

M^J_n =





J−1[

j=J0

(Λ(j))

J[n−1 j=J

m_j, m_j⊂Λ(j), (|m_j|) =K_j



 and setMn =S_J_n₋₁

J=J0M^J_n. Form∈ Mn, we defineS_mas the linear span of theφ_J₀_,k’s for (J₀, k)∈Λ(J₀) together with theϕ_j,k’s for (j, k)∈m.

(8)

For eachJ ∈ {J₀, . . . , J_n−1} andm∈ M^J_n,

2^J≤D_m= 2^J+

JXn−1 j=J

K_j ≤2^J



1 +^+∞X

j=1

j^−a



. (4.3)

Hence, for eachJ, the linear spaces belonging to the collection{Sm, m∈ M^J_n}have their dimension of order 2^J. Besides, it will be shown in Section 8 that the space∪_m∈M^J_nS_mhas good (nonlinear) approximation properties with respect to functions belonging to inhomogeneous Besov spaces.

We give a first result under the assumption that µ is absolutely continuous with respect to the Lebesgue measure on [0,1].

Proposition 4.1. Assume that(H_µ)and(H_X,ε)hold and thatµadmits a density with respect to the Lebesgue measure on[0,1]that is bounded from above by some constanth₁. Consider the collection of models(W’)with J_n such that2^Jⁿ≥Γn/ln^b(n)for someb >0andΓ>0. Let p∈[1,+∞] and set

1 p−1

2

+≤α_p=



 1 2

1 p−1

2 1 +

r2 + 3p 2−p

if p <2

0 else.

If α_p < α≤rthen∀(R₁, R₂)∈R^∗₊×R^∗₊, the PLSE defined by (3.3) withL_m=L(a)for allm∈ Mn satisfies sup

f∈Bα,p,∞(R1,R2)Eh

kf−f˜k²_ni

≤C₁n⁻^2α+1^2α (4.4)

whereC₁ depends onα,a,s,h₁,R₁,R₂,band Γ.

We now relax the assumption thatµis absolutely continuous with respect to the Lebesgue measure.

Proposition 4.2. Assume that (H_µ) and(H_X,ε)hold. Consider the collection of models (W’) withJ_n such that 2^Jⁿ≥Γn/ln^b(n)for some b >0 andΓ>0. Let p∈[1,+∞] and set

α⁰_p= 1 +√ 2p+ 1

2p ·

If α⁰_p < α≤rthen∀(R₁, R₂)∈R^∗₊×R^∗₊, the PLSE defined by (3.3) withL_m=L(a)for allm∈ Mn satisfies sup

f∈Bα,p,∞(R1,R2)Eh

kf−f˜k²_ni

≤C₂n⁻^2α+1^2α (4.5)

whereC₂ depends onα,a,s,R₁,R₂,bandΓ.

Equations (4.4) and (4.5) hold forR₂= +∞if the left-hand-side term is replaced by sup

f∈Bα,p,∞(R1,+∞)Eh

kf−f˜k²_n1I_Ω_n i

i.e. no assumption onkfk∞ is required provided that the indicator function 1I_Ω_n is added.

We shall see in Section 5 that Condition(H_µ)need not be assumed to hold when the sequences (X~_i)_i=1,...,n and (ε_i)_i=1,...,n are independent. Moreover in this case one can assume R₂ to be infinite. The proofs of Propositions 4.1 and 4.2 are deferred to Section 8.

(9)

5. Study of Ω

n

and condition (H

µ

)

In this section, we study Ω_nand we give sufficient conditions for(H_µ)to hold. For this purpose, we examine various dependency structures for the joint law of theX~_i’s and theε_i’s.

5.1. Case of independent sequences ( X ~

_i

)

_i=1,...,n

and (ε

_i

)

_i=1,...,n

We start with the case of deterministicX~_i’s. In this context it is clear from the definition of Ω_nthatP(Ω_n) = 1.

Thus the indicator 1I_Ω_ncan be removed in (3.5). More precisely under the assumptions of Theorem 3.1 we have that for some universal constantsC andC⁰

Eh

kf_A−f˜k²_ni

≤C inf

m∈Mn

kf_A−f_mk²_n+ pen(m)

+C⁰s²Σ_n

n · (5.1)

If the sequences (X~_i)_i=1,...,n and (ε_i)_i=1,...,n are independent then by conditionning over the X~_i’s (5.1) holds and it is enough to average over theX~_i’s to recover (3.5) where the indicator of Ω_n is removed. In conclusion in this context, Inequality (3.7) holds for any functionf ∈L₂(A, µ) withC” = 0. Let us emphasize again that in this case no assumption on the type of dependency of theX~_i’s is required.

5.2. Case of β-mixing X ~

_i

’s

The next proposition presents some dependency situations where Assumption (H_µ) is fulfilled: more precisely, we can check this assumption when the variables are geometrically or arithmeticallyβ-mixing. We refer to Kolmogorov and Rozanov [19] for a precise definition of β-mixing and to Ibragimov [17], Volonskii and Rozanov [31] or Doukhan [14] for examples. A sequence of random vectors is said to be geometricallyβ-mixing if the decay of theirβ-mixing coefficients, (β_k)_k≥0, is exponential, that is if there exists two positive numbers M and θsuch that β_k ≤Me^−θk for allk≥0. The sequence is said to be arithmeticallyβ-mixing if the decay is hyperbolic, that is if there exists two positive numbersM andθsuch that β_k≤M k^−θ for allk >0.

Since our results are expressed in terms of µ-norm, we introduce a condition ensuring that there exists a connection between this and the ν-norm. We recall thatν denotes the Lebesgue measure.

(C1): The restriction of µ to the set A admits a density h_X w.r.t. the Lebesgue measure such that:

0< h₀≤h_X≤h₁ whereh₀ andh₁ are some fixed constants chosen such thath₀≤1≤h₁.

A typical situation where(C1)is satisfied is once again the autoregressive model (2.1): in the particular case wherek= 1 and where the stationary distributionµ_εof theε_i’s is equivalent to the Lebesgue measure, it follows from Duflo [15] that the variables X_i’s admit a densityh_X w.r.t. the Lebesgue measure onR which satisfies:

h_X(y) =R

h_ε[y−f(x)]h_X(x)dx. Then h_X is a continuous function and sinceAis a compact, there exist two constantsh₀>0 andh₁≥1 such thath₀≤h_X(x)≤h₁,∀x∈A.

Proposition 5.1. Assume that (C1)holds.

(i) If the process (X~_i) is geometrically β-mixing with constants M and θ and if dim(Sn) ≤ n/ln³(n) then (H_µ)is satisfied for the collections (P) and(W)with `= 2 andC_`=C(M, θ, h₀, h₁).

(ii) If the process(X~_i)is arithmeticallyβ-mixing with constantsM andθ >12and ifdim(Sn)≤n^1−3/θ/ln(n) then(H_µ)is satisfied for the collections(P) and(W)with `= 2 andC_`=C(M, θ, h₀, h₁).

Proof. The result derives from Claim 5 in Baraudet al. [6] withρ= 1/2: (4.23) is fulfilled with Ψ(n) = ln²(n) in case (i) and Ψ(n) =n^3/θ in case (ii).

Comments

• Under suitable conditions on the function f the process (X_i)_i≥1−k generated by the autoregressive model (2.1) is stationary and geometrically (M, θ)-mixing. More precisely, the classical condition is (see Doukhan [14], Th. 7, p. 102):

(10)

(H^?)(i) Theε_i’s are independent and independent of the initial variablesX₀, . . . , X_−k+1.

(ii) There exists non negative constantsa₁, . . . , a_kand positive constantsc₀andc₁such that|f(x)| ≤ P_k

i=1a_i|x_i| −c₁if max_i=1,...,k|x_i|> c₀ and the unique nonnegative real zerox₀of the polynomialP(z) = z^k−P_k

i=1a_iz^k−i satisfies x₀ < 1. Moreover, the Markov chain (X~_i) is irreducible with respect to the Lebesgue measure onR^k.

In particular, the irreducibility condition for the Markov chain (X~_i) is satisfied as soon asµ_εis equivalent to the Lebesgue measure.

• Examples of arithmetically mixing processes corresponding to the autoregressive model (2.1) can be found in Ango Nze [3].

6. Proof of Theorem 3.1 and Corollary 3.1

In order to detail the steps of the proofs, we demonstrate consecutive claims. From now on we fix some m∈ Mn to be chosen at the end of the proof.

Claim 1: We have

kfA−f˜k²_n≤ kfA−f_mk²_n+2 n

Xn i=1

ε_i( ˜f−f_m)(X~_i) + pen(m)−pen( ˆm). (6.1)

Proof. Starting from (2.3) we know that γ_n( ˜f)−γ_n(f_m) ≤ pen(m)−pen( ˆm) and since γ_n( ˜f)−γ_n(f_m) = kf−f˜k²_n− kf−f_mk²_n−2n⁻¹P_n

i=1ε_i( ˜f−f_m)(X~_i), the claim is proved forf_A replaced byf namely kf−f˜k²_n≤ kf−f_mk²_n+2

n Xn i=1

ε_i( ˜f−f_m)(X~_i) + pen(m)−pen( ˆm). (6.2) Noticing that iftis aA-supported function thenkf−tk²_n=kf1I_A^ck²_n+kfA−tk²_n and applying this identity to t= ˜f andt=f_m, we obtain the claim from (6.2) after simplification bykf1I_Ack²_n.

Recall that Ω_n is defined by equation (3.6), and for eachm⁰ ∈ Mn, let G₁(m⁰) = sup

t∈B_m0

1 n

Xn i=1

ε_it(X~_i), whereB_m0 ={t∈S_m+S_m0/ktk ≤1}.

The key of Theorem 3.1 relies on the following proposition which is proved in Section 7.

Proposition 6.1. Under (H_(X,ε))for all m⁰∈ Mn

Eh

G²₁(m⁰)−(p₁(m⁰) +p₂(m))

+1I_Ω_n

i≤1.6κs²e^−L^m⁰^D^m⁰ n

wherep₁(m⁰) =κs²D_m⁰(1 +L_m⁰)/n,p₂(m) =κs²D_m/nandκis a universal constant (that can be taken to be 38).

Next, we show

Claim 2: There exists a universal constantC such that C⁻¹Eh

kfA−f˜k²_n1I_Ω_n

i≤ kfA−f_mk²+ pen(m) +s²Σ_n n ·

(11)

Proof. From Claim 1 we deduce

kfA−f˜k²_n≤ kfA−f_mk²_n+ 2kf˜−f_mkG1( ˆm) + pen(m)−pen( ˆm). (6.3) On Ω_n, we can ensure thatkf˜−f_mk ≤√

2kf˜−f_mkn, therefore the following inequalities hold 2kf˜−f_mkG1( ˆm)≤2kf˜−f_mkn

√2G₁( ˆm)≤1

4kf˜−f_mk²_n+ 8G²₁( ˆm)

≤ 1 4

kf˜−f_Akn+kfA−f_mkn

₂

+ 8G²₁( ˆm)

≤ 1 2

kf˜−f_Ak²_n+kfA−f_mk²_n

+ 8G²₁( ˆm). (6.4)

Combining (6.3) and (6.4) leads on Ω_n to, kfA−f˜k²_n≤ kfA−f_mk²_n+1

2kfA−f˜k²_n+1

2kfA−f_mk²_n+ pen(m) + 8G²₁( ˆm)−pen( ˆm)

≤ kfA−f_mk²_n+1

2kfA−f˜k²_n+1

2kfA−f_mk²_n+ pen(m) + 8p₂(m) + 8(G²₁( ˆm)−(p₁( ˆm) +p₂(m))₊

+ 8p₁( ˆm)−pen( ˆm). (6.5)

By takingϑ≥8κ, we have

pen(m⁰)≥8p₁(m⁰), for allm⁰∈ Mn and 8p₂(m)≤pen(m). Thus we derive from (6.5)

1

2kfA−f˜k²_n1I_Ω_n≤ 3

2kfA−f_mk²_n+ 2pen(m) + 8(G²₁( ˆm)−(p₁( ˆm) +p₂(m))₊1I_Ω_n, and by taking the expectation on both sides of this inequality we get

1 2Eh

kf_A−f˜k²_n1I_Ω_n i≤ 3

2kf_A−f_mk²+ 2pen(m)

+ 8 X

m⁰∈Mn

Eh

G²₁(m⁰)−(p₁(m⁰) +p₂(m))

+1I_Ω_n i

.

We conclude by using Proposition 6.1 and (3.2), and by choosingmamongMnto minimizem⁰ 7→ kf_A−f_m0k²+ pen(m⁰). This ends the proof of Theorem 3.1 withC= 4 andC⁰= 16×1.6κ.

For the proof of Corollary 3.1, we introduce the notation Π_m_ˆ for the orthogonal projector (with respect to the usual inner product ofRⁿ) onto theRⁿ-subspace{(t(X~₁), . . . , t(X~_n))⁰/t∈S_m_ˆ}. It follows from the definition of the least-squares estimator that ( ˜f(X~₁), . . . ,f˜(X~_n))⁰= Π_m_ˆY. Denoting in the same way the functiont and the vector (t(X~₁), . . . , t(X~_n))⁰, we see thatkf_A−f˜k²_n=kf_A−Π_m_ˆf_Ak²_n+kΠ_m_ˆεk²_n ≤ kf_Ak²_n+n⁻¹P_n

i=1ε²_i. Thus, Eh

kfA−f˜k²_n1I_Ωc n

i≤ kfAk²_∞P(Ω^c_n) +1 n

Xn i=1

E ε²_i1I_Ωc

n

.

(12)

Let nowxandy be positive constants to be chosen later, by a truncation argument we have E

ε²_i1I_Ω^c_n

≤x²P(Ω^c_n) +E

ε²_i1I_|ε_i_|>x1I_Ω^c_n

≤x²P(Ω^c_n) +Eh

ε²_ie^y|εⁱ^|−yx1I_|ε_i_|>x1I_Ω^c_n i

≤x²P(Ω^c_n) + 2y⁻²e^−yxEh

e^2y|εⁱ^|1I_Ω^c_n i

by using in the last inequality that for allu >0,u²e^u/2≤e^2u. Now by(H_X,ε)together with H¨older’s inequality (we set ¯`⁻¹= 1−`⁻¹) we have

Eh

e^2y|εⁱ^|1I_Ωc n

i≤E^1/^`^¯h e^2y^`|ε^¯ ⁱ^|

iP^1/`(Ω^c_n)≤2^1/^`^¯e^2y²^¯^`s²P^1/`(Ω^c_n).

Thus we deduce that Eh

kfA−fk˜ ²_n1I_Ω^c_n

i≤(kfAk²_∞+x²)P(Ω^c_n) + 2^1+1/^`^¯y⁻²e^2y²^`s^¯²^−yxP^1/`(Ω^c_n).

We now choosex= 2√

`s¯ andy= 1/xand under(H_µ)we get Eh

kfA−f˜k²_n1I_Ωc n

i≤h

(kfAk²_∞+ 4¯`s²)C_`+ 2^3+1/^`^¯e^−1/2C_`^1/``s¯² i 1

n·

The proof of Corollary 3.1 is completed by combining this inequality with the result of Claim 2.

Moreover, if for allm∈ Mn, 1I∈S_mthen we notice that all along the proof,f can be replaced byf+c=g where c is a given constant. Indeed, in this case,g_m = f_m+c, ˆg_m= ˆf_m+c, so that f −f_m =g−g_m and f−fˆ_m=g−gˆ_m. If we choosec=−R

f_Adµ, we find the same result withkfAk∞ replaced bykfA−R

f_Adµk∞

in the last inequality.

7. Proof of Proposition 6.1 7.1. A key lemma

To prove the proposition we use the following lemma which is inspired by a work on exponential inequalities for martingales due to Meyer [23] (Prop. 4, p. 168).

Lemma 7.1. Assume that Condition (H_X,ε)holds, then for any positive numbers,v we have:

P

" _n X

i=1

ε_it(X~_i)≥n, ktk²_n≤v²

#

≤exp

− n² 2s²v²

· (7.1)

Proof. Let M_n =P_n

i=1ε_it(X~_i), M₀ = 1 andGn the σ-field generated by theε_i’s, for i < n and the X~_i’s for i≤n. Note that E(Mn) = 0. For eachλ >0 we have

P

M_n≥n, ktk²_n≤v²

≤exp −λn+nv²s²λ²/2 E

exp(λM_n−λ²nktk²_ns²/2) . Let

Q_n= exp

λM_n−1

2λ²s²nktk²_n

= exp λM_n−1 2λ²s²

Xn i=1

t²(X~_i)

!