DATA DRIVEN MODEL SELECTION FOR SAME-REALIZATION PREDICTIONS IN AUTOREGRESSIVE PROCESSES

(1)

HAL Id: hal-03169343

https://hal.archives-ouvertes.fr/hal-03169343

Preprint submitted on 15 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

DATA DRIVEN MODEL SELECTION FOR

SAME-REALIZATION PREDICTIONS IN

AUTOREGRESSIVE PROCESSES

Kare Kamila

To cite this version:

Kare Kamila. DATA DRIVEN MODEL SELECTION FOR SAME-REALIZATION PREDICTIONS IN AUTOREGRESSIVE PROCESSES. 2021. �hal-03169343�

(2)

DATA DRIVEN MODEL SELECTION FOR

SAME-REALIZATION PREDICTIONS IN

AUTOREGRESSIVE PROCESSES

BY Kare KAMILA∗

March 15, 2021

SAMM, Université Paris 1 Panthéon-Sorbonne, FRANCE

Abstract

This paper is about the one-step ahead prediction of the future of observations drawn from an infinite-order autoregressive AR(∞) process. The aim of this paper is to design penalties (complete data driven) ensuring that the selected model verifies the efficiency property but in the non asymptotic framework. We present an oracle inequality with a leading constant equal to one. Moreover, we also show that the excess risk of the selected estimator enjoys the best bias-variance trade-off over the considered collection. To achieve these results, we needed to overcome the dependence difficulties by following a classical approach which consists in restricting to a set where the empirical covariance matrix is equivalent to the theoretical one. We show that this event happens with probability larger than 1 − c0/n3 with c0> 0. The proposed

data driven criteria are based on the minimization of the penalized criterion akin to the Mallows’s Cp. Monte Carlo experiments are performed to highlight the obtained

results.

Key words: Model selection, oracle inequality, efficiency, autoregressive process, data driven.

1 _Introduction

Consider observations (X1, X2, . . . , Xn)arising from a trajectory of the process

Xt= f∗ (Xt−i)i∈N∗ + σ ξ_t for any t ∈ Z. (1.1)

where (ξt)t∈Z is a sequence of zero-mean independent identically distributed random vari-ables (i.i.d.r.v) satisfying E(|ξ0|4) < ∞ and f∗ : RN → R is a measurable function and σ > 0an unknown constant.

The problem is to estimate the function f∗ _{using these observations. The process (}_1.1_{) is} a particular case of the general class of affine causal process studied in [10] and [4]. The study of this type of process more often requires the classical regularity condition on the function f∗_{, which are not restrictive at all and remain valid in various time series models.} This condition can be stated as follows:

∞ X k=1 sup x∈R∞ ∂ ∂xk f∗(x) ! < 1, (1.2)

∗ _{This author has received funding from the European Union’s Horizon 2020 research and innovation}

(3)

provided that that f∗ _{admits partial derivatives on R}N. Under (1.2) and if the noise ξ 0 admits r-order moments, [10] showed that there exists a stationary, mixing and ergodic solution to (1.1) admitting r-order moments.

Moreover, [4] studied the consistency and the asymptotic normality of the QMLE of θ∗= (θ∗_i)_i∈N in the case f∗ = fθ∗.

In this paper, we will focus only on processes with a linear regression function (fθ∗) with

respect to the past and depending on some parameter θ∗_{∈ R}N; that is f∗(Xt−1, Xt−2, . . .) = fθ∗(X_t−1, X_t−2, . . .) =

∞ X

i=1

θ_i∗Xt−i. (1.3) For such processes, condition (1.2) becomes

A1 :

∞ X i=1

|θ_i∗| < 1.

Even if this condition reduces the set of parameters a bit, the class of AR(∞) processes checking the condition A1 is rich and of practical importance because it contains almost all invertible causal ARMA(p, q) processes and it is very useful for prediction given the past. Moreover, contrary to the autocariance of ARMA(p, q) processes which decays ex-ponentially fast, AR(∞) are able to model more complex behaviour such as slower decay of the covariance structure.

Henceforth, let observations (X1, X2, . . . , Xn)be a trajectory of the solution X := (Xt)t∈Z of (1.1) verifying A1. The goal of this paper is to predict the next value Xn+1. In fact, if θ∗ were known, a simple prediction of Xn+1could be fθ∗(X_n, X_n−1, . . .)setting X_t= 0for all

t < 0. However, θ∗ is generally unknown and it is impossible to provide a direct estimator since its coordinate are infinite. It is classical to identify a ’good’ finite-dimensional model based on the data which can be done by sieve estimation where only a finite number of {θ∗

i}Ki=1 is estimated and letting K grows as the sample size increases. A usual approach to this is model selection and the goal is to provide a model with the prediction error as small as the oracle’s one.

This question has already been addressed in the literature. [17] was the first to tackle this issue. He proved that Akaike criterion is asymptotically efficient in the sense that the selected model achieves a smaller one-step mean squared error of prediction when it is fitted to predict an independent realization of the same process. Following Shibata’s asymptotically setting, [13] and [15] extended this result for same realization predictions. Indeed, they argued that the Shibata’s idea to fit the model to another independent real-ization is unrealistic since in practice we only have one data at hand. The common feature of these works is their asymptotic framework.

Meanwhile, there were several authors which study this question in non asymptotic regime. [11] in the non parametric framework, studied how well a Gaussian process admitting an AR(∞) representation can be approximated by a finite-order AR model.

In [2] and [3], they analyzed similar question, but a little bit different as observations arise from an auto-regressive model of order k. They proved an oracle inequality under several conditions, for instance the compactly supported base of the regression function. Moreover, they assume that the process is β-mixing which is usually admitted, but quite hard to verify in practice. For linear processes, the τ-mixing is more suitable since its coefficients can be easily computed (see [7]) and be bounded by a function of the model parameter θ∗_{(see [}₁₀_{]). In this work, we do not assume any mixing property of the process} since the condition A1 implies the τ-mixing property (see [10]) and we will see that the decreasing rate of τ-mixing coefficients is bounded by the decreasing rate of the coefficients

(4)

θ∗= (θ∗_i)_i∈N.

Based on the above and following a model selection approach, our purpose in this work is to design adaptive penalties in such a way that the selected model mimic the oracle when observations arise form AR(∞) under mild conditions, including the existence of the all order moment of the noise, the decreasing rate of the coefficients of (θ∗

i)i∈N so that thanks to a result by [10], the generating process has nice properties such as stationarity, τ-mixing. The main contributions of this paper include:

(i) Using least squares contrast, we have shown an oracle inequality with a leading constant equal to one.

(ii) We have also proved that the excess risk of the selected estimator enjoys the best bias-variance trade-off over the considered collection.

The paper is organized as follows. The model selection approach along with preliminary results are described in Section 2. The main results are presented Section 3. Finally, numerical results are presented in Section4and Section 5contains the proofs.

2 _{Model Selection Approach and Preliminary}

Re-sults

2.1 Model Selection Approach

Let Sm (shortly m) a model for f∗ to be the set of linear function f from RDm to R such that f (x1, x2, . . . , xDm) = Dm X i=1 θixi, (2.1)

with θ = (θ1, . . . , θDm) ∈ Θm and Θm a compact set of R

Dm. S

m can be viewed as an AR(Dm) model.

Given a predictor fθ∈ Sm, its quality is measured by the quadratic loss R(θ) = E(Xn+1− f_θn+1)2

where fn

θ = fθ(Xn−1, . . . , Xn−Dm). The Bayes predictor which minimizes R(θ) over the

set of all predictors is clearly the inaccessible function fθ∗. Let then introduce the excess

loss of the predictor fθ (with respect to fθ∗)

`(θ, θ∗) := R(θ) − R(θ∗) = E(fθn+1∗ − f_θn+1)2 ≥ 0.

Given a model m, we define its best predictor fθ∗ m by

θ_m∗ = argmin θ∈Θm

R(θ).

Its empirical version minimizing the least-squares contrast is b θm = argmin θ∈Θm γn(θ) where γn(θ) = 1 n n X t=1 (Xt− fθt)2. (2.2) Note that (thanks to stationarity of the process), for an estimate bθ = bθ(X1, . . . , Xn), the excess loss can be rewritten as

`(bθ, θ∗) = Eh F b θ− Fθ∗ 2 n i (2.3) where Fθ := (fθ1, . . . , fθn)> and kxk2n= n1 Pn t=1x2t.

(5)

Given that all the models which can be considered must have finite dimensions for fixed n, making all Sm wrong models, it is classical to let the dimension of competitive models grow with the number of observations. This will help reduce the excess loss and provide a better approximation of fθ∗.

Let Mn a countable collection of hierarchical model Sm and Kn is the dimension of the largest model in Mnsatisfying |Mn| ≤ Kn< n. We follow the classical approach of model selection which consists in minimizing the penalized LSE. Let pen: Mn→ R+be a penalty function, possibly data-dependent, and define

b

m = argmin m∈Mn

{C(m)} with C(m) := γn θb_m +pen(S_m). (2.4) Thus, the best possible choice over Mn is m∗ the so-called oracle defined as

m∗∈arg inf m∈Mn

`(bθm, θ∗). (2.5)

The oracle m∗ _{is unachievable since it depends on θ}∗ _{and the distribution P}

(X1,...,Xn) that

are unknowns. However, we hope to select a model mb so that `(bθmb, θ

∗₎ _{is closest to} `(bθm∗, θ∗).

The goal of this paper is twofold. First, we want to propose a data driven penalty in order to obtain an oracle inequality

`(bθ b m, θ∗) ≤ C1 inf m∈Mn `(bθm, θ∗) + C2 n (2.6)

with the leading constant C1 close to one and C2> 0.

Since for every m ∈ Mn, we have the following decomposition holds `(bθm, θ∗) = `(θm∗ , θ∗) + `(bθm, θm∗)

=: Biais(m) + Variance(m), (2.7)

the inequality (2.6) implies that the excess risk of the selected estimator bθmb realizes the best bias-variance trade-off, which would make our penalty an ideal choice in terms of excess risk. That is to say that the selected model mb will be large enough to reduce its bias, but not too large to avoid high variance.

Moreover, in decomposition (2.7), the bias term is generally not irreducible but the variance can be approximately proportional to the size of the model. Our second goal is to simplify the oracle inequality (2.6) in order to obtain the following common inequality

`(bθ b m, θ∗) ≤ C10 inf m∈Mn ( `(θ_m∗, θ∗) +pen(Sm) ) +C 0 2 n (2.8)

with the leading constant C0

1 = 1 + δwith δ > 0 (and close to 0) and C20 > 0.

2.2 Notations

We will use the following norms:

• k.k denotes the usual Euclidean norm on Rν_{, with ν ≥ 1;}

• kAkop is the operator norm of A as the square root of the largest eigenvalue of A>_A_. If A is symmetric, then kAkop is the largest (in absolute value) eigenvalue of A. • if X is a Rν_{-random variable and r ≥ 1, we set kXk}

r = EkXkr 1/r

(6)

2.3 Preliminary Results

As we are in dependence setting, we are going to leverage the τ-mixing property of (Xt)t∈Z in order to obtain some exponential Inequalities. The τ-mixing coefficients are a mea-sure of the dependence of the process and has been introduced by [9]. This will help us build ’independents’ random vectors and apply classical exponential Inequalities. Let then introduce some notations.

Let (Ω, C, P) be a probability space, M a σ-subalgebra of C and Z a random variable with values in a Banach space E, k.kE

. Assume that E|Z| < ∞ and define τ(p)(M, Z) = sup f ∈Λ(E) n Z f (x)PZ|M(dx) − Z f (x)PZ(dx) o p where Λ(E) is the set of 1-Lipschitz function, i.e. the functions f from E, k.kE

to R such that |f(x) − f(y)| ≤ kx − ykE.

Using the definition of τ, we will measure the dependence of the strictly stationary sequence (Zt)t∈Z thanks to the coefficients defined as follows. For any s ≥ 0, let introduce the norm kx − yk_Rk = (|x₁− y₁| + · · · + |x_k− y_k|)and setting M_i = σ(Z_t, t ≤ i)and if E(|Z₁|) < ∞,

let τ_Z,∞(p) (s) = sup l>0 n max 1≤k≤l 1 k sup n τ(p) M_i, (Zi1, . . . , Zik) , i + s ≤ i1< · · · < ik oo .

Finally, the time series (Zt)t∈Zis τZ,∞(p) -weakly dependentwhen its coefficients τ (p)

Z,∞ tend to 0as s tends to infinity.

The next Proposition that is a consequence of Theorem 3.1 in [10] gives a link between the τ-mixing coefficients of the process (Xt)t∈Z and the coefficients θ∗i of the model (1.3). Proposition 1. AssumeA1 holds and if |θ_t∗| = O(t−γ) with γ > 1, there exists a τ -weakly dependent stationary solution of (1.1) and a constant Cτ > 0 such that for r > 0

τ_X,∞(2) (r) ≤ Cτ log r

r γ−1

(2.9) Proof. With G(x, ξ0) = σ ξ0+ fθ∗(x) for any x ∈ R∞, it holds

G(x, ξ0) − G(y, ξ0) 2 = fθ∗(x) − f_θ∗(y)≤ ∞ X i=1 |θ∗_i| |x_i− y_i|.

Therefore (2.9) is a straightforward application of Theorem 3.1 in [10]. As we are going to need independence for block of random variables, let denote for t = 1, . . . , nthe random vector ~Xt:= (Xt−1, . . . , Xt−Kn)

>_{One can see that the process ( ~}_X t)t∈Z is also mixing with τ(1)

~

X,∞upper bounded by Knτ (1)

X,∞ (see Lemma 1).

Now, we construct random variables approximating ~Xt’s enjoying the independence by block property. Let sn, qn two integers such that n = 2 snqn. We are going to build 2 sn blocks of length qn so that the even index blocks are independent and so the odd index blocks.

For k = 0, . . . , sn− 1let denote by Ak= X~2kqn+1, . . . , ~X(2k+1)qn

and Bk= X~(2k+1)qn+1, . . . , ~X(2k+2)qn.

(7)

Proposition 2. Let (X_t)_t∈Z be the stationary mixing process process obtained in Proposi-tion1. Let also sn, qn, Ak, Bk defined as above for k = 0, . . . , sn− 1. There exist random vectors A∗_k = X~_2kq∗ _n₊₁, . . . , ~X_(2k+1)q∗ n, B ∗ k= X~ ∗ (2k+1)qn+1, . . . , ~X ∗ (2k+2)qn such that: 1. For k = 0, . . . , s_n− 1, A∗

k has the same law as Ak, also B∗k and Bk. 2. The random vectors (A∗_k)0≤k≤sn−1are independent and so are the vectors (B

∗

k)0≤k≤sn−1.

To prove the oracle inequality, we will assume some constraints on the observations. A2 Xt is sub-Gaussian with variance proxy σ02> 0 i.e.

E[eλ Xt] ≤ eλ

2_σ2

0/2 for any λ > 0.

Condition A2 implies that the vector Zm

t = (Xt−1, . . . , Xt−Dm)

> _{which will be prominent} in the proofs, is sub-Gaussian with variance proxy Dmσ20. Indeed for any v ∈ RDm such that kvk = 1 , E h exp λ v>Z_tmi = E "_D_m Y i=1 expλ vi(Xt−i # ≤ Dm Y i=1 exp λ vi(Xt−i Dm = Dm Y i=1 exp λ2Dmσ02v2i/2 = eλ22 Dmσ 2 0_,

where the Inequality follows from Hölder’s Inequality.

The following assumption provides a sufficient condition to ensure the invertibility of bothΣb_m and Σ_m.

A3: For any f_θ∈ Sm, < α, ∂θfθ >= 0 a.s. =⇒ α = 0

This condition means that the columns of the matrix Mm are linearly independents. We will also need to bound eigenvalues of the matrices Σm for any m ∈ Mn. To do that, we will leverage the relation between the spectral density of the process and these eigenvalues. Let us denote by r, the covariance function r(h) := E[XtXt+h]for any integer h. Let also introduce the function g : [−π, π[−→ C such that for any λ,

g(λ) = 1 2 π

X h∈Z

r(h) e−ihλ,

which exists under A1 with |θ∗

t| = O(t−γ)where γ ≥ 1 . Therefore, r is the inverse trans-form of g and r(h) = Rπ

−πeihλg(λ)dλfor any h ∈ Z. We will assume that A4: There exists a constant a > 0 such that inf

−π≤λ<πg(λ) ≥ a.

This is a very weak assumption, and we are going to give the value of a for AR(p) process with p ∈ N∗_{. Let denote θ}∗_{(z) = 1 −}Pp

j=1θ∗jzj, it is well known for such process that g(λ) = σ 2 2 πθ∗_(e−iλ₎ 2.

(8)

For instance for p equal to one, and Xt= θ∗1Xt−1+ σ ξt with |θ1∗| < 1, it follows g(λ) = σ 2 2π1 − θ∗₁e−iλ 2 = σ 2 2π 1 − 2 θ₁∗ cos(λ) + (θ₁∗)2 , and then it is simple to see that

a := σ

2

2π (1 + |θ∗₁|)2 ≤ g(λ) ≤

σ2 2π (1 − |θ∗₁|)2.

For p ≥ 1 and Xt=Pp_j=1θj∗Xt−j+ σ ξt satisfying Pp_j=1θ∗j < 1 and θj∗≥ 0, we have

g(λ) = σ 2 2π1 − Pp j=1θ ∗ je−ijλ 2 = σ2(2π)−1 1 + p X j=1 (θ∗_j)2− 2 p X j=1 θ∗_j cos(jλ) + 2 p−1 X k=1 θ_k∗n p X j=k+1 θ_j∗ cos (j − k)λo !−1 .

Thus, using −1 ≤ cos(x) ≤ 1 for any real x, it follows for every λ σ2(2π)−1 1 + p X j=1 (θ∗_j)2+ 2 p X j=1 θ_j∗+ 2 p−1 X k=1 θ∗_k n _Xp j=k+1 θ_j∗ o !−1 ≤ g(λ) ≤ σ2(2π)−1 1 + p X j=1 (θ∗_j)2− 2 p X j=1 θ_j∗− 2 p−1 X k=1 θ∗_kn p X j=k+1 θ_j∗o !−1 .

For such AR(p) process, one can take the constant a in A4 to be equal to a = σ2(2π)−1 1 + p X j=1 (θ∗_j)2+ 2 p X j=1 θ∗_j + 2 p−1 X k=1 θ_k∗ n _Xp j=k+1 θ∗_j o !−1 .

We can now state an important intermediate result which provides uniform lower and upper bound on the spectral norm of the matrices Σm.

Proposition 3. UnderA1 with |θ_t∗| = O(t−γ) where γ ≥ 2 , we have for any m ∈ Mn

Σm op ≤ π −1 ∞ X i=0 E[X₀X_i] < ∞. (2.10)

Moreover and under A3-A4, it holds Σ−1_m

op ≤ 1/a. (2.11)

Now for technical convenience, we choose the integer sn and the cardinal of Mnsuch that A5 : a sn 2 min ( (r ∧ 1) 26_σ2 0Kn !2 , (r ∧ 1) 27_σ2 0Kn ) ≥ 3 log n, (2.12) where r := a/E[X2

0]and b ∧ c is the minimum of b and c. This means that sn is of the form sn = C log nwhere C ≥ 6 a−1max

( 26_σ2 0Kn (r∧1) !2 ,27σ20Kn (r∧1) )

(9)

and a ≈ E[X2

0] ≈ σ20, we can choose sn= 6 (2)12(log)3n. Henceforth, in all the sequel Kn will satisfy

A6 : Kn= CKlog n for some constant CK > 0. (2.13) Let us introduce extra important notations. From the definition of the LSE (2.2), it follows that

b

θm= bΣ−1m M>mX (2.14)

where the matrix Mm =Xi−1, . . . , Xi−Dm

n

i=1, Σbm = M >

mMm and X = (X1, . . . , Xn)>, provided thatΣb_mis invertible almost everywhere (see Lemma6). Let denote the expected value of the random matrixΣb_mby Σ_m = E

b Σm

. Rewriting (1.1) in a vectorial form (with ξ = (ξ1, . . . , ξn)>), i.e. X = Fθ∗+ σ ξ = (F_θ∗− F_θ∗ m) + Fθm∗ + σ ξ, it follows that M>_mX = M>_m(Fθ∗− F_θ∗ m) + M > mMmθm∗ + σ M>mξ. Thus, using (2.14) it holds

b θm− θ∗m= bΣ−1m M>m(Fθ∗− F_θ∗ m) + σ bΣ −1 m M>mξ, (2.15) which implies Mm(bθm− θ∗m) = MmΣb−1_m M>_m(F_θ∗− F_θ∗ m) + σ MmΣb −1 m M > mξ = PMm Fθ∗− Fθ∗m + σ PMm(ξ) where PMm = Mm M > mMm −1

M>_m is the projection matrix onto the sub-space spanned by the columns of Mm.

The main difficulty in this work lies in the handling of the matrix Σb−1_m . We are going to use a classical approach to overcome this issue, it consists in defining a set on which the b

Σ−1_m can be approximated by Σ−1_m ([3],[12], [19],[8] among other) which is invertible (see Lemma6). For m ∈ Mn, let define Γm,r the set

Γm,r = n bΣ−1_m − Σ−1_m op ≤ r Σ−1_m op o .

We will see that in our framework Γm,r holds with high probability. Before proving that, let us notice thatΣbm can be rewritten as

b Σm = 1 n n X t=1 b Σm,t with Σbm,t = Ztm(Ztm)> where Zm t = (Xt−1, . . . , Xt−Dm) >_. _(2.16)

Since we have fixed r equal to a/E[X2

0], the set Γm,r will be denoted by Γm in the all the sequel.

The following result shows that the event Γm holds with high probability.

Proposition 4. Under assumptionsA1 − A6 and if |θ∗_t| = O(t−γ_{) with γ ≥ 8 , it holds}

P(Γcm) ≤ c0 n3, (2.17) with c₀ = 1 + 8 A CτCK X0 2(a (r ∧ 1)) −1 _{where A satisfies} ₍_5.1_).

(10)

3 Oracle Inequality for Same Realization Prediction

We are now able to state the main result of the paper.

Theorem 3.1. Let consider observations (X₁, . . . , Xn) arising from a solution of the pro-cess (1.1) satisfying A1 with |θ∗_t| = O(t−γ) where γ ≥ 8 and also verifying A2 and A4. Let M_n be some countable family of AR models satisfyingA3 and A5-A6. For x ≥ 4, let a penalty function pen: M_n_{→ R}+ _{such that}

pen(Sm) = x σ2 Dm

n . (3.1)

Then with probability at least 1 − c₀n−2, the LSE bθ b

m with m given inb (2.4), satisfies `(bθ b m, θ∗) ≤ inf m∈Mn n `(bθm, θ∗) o +2 x σ 2 n . (3.2)

Let us give some remarks about this result:

• The oracle inequality (3.2) is optimal in the sense that the leading constant is exactly one

• This result is new in non asymptotic framework for AR(∞) under mild conditions. Indeed, [14] obtained a counterpart of our result when n → ∞ under several assump-tions;

• The designed penalty (3.1) generalizes the Mallows Cp.

We can obtain for free as a consequence of Theorem3.1, the asymptotic efficiency obtained by [18] and [15].

Corollary 1. Under the assumptions of Theorem3.1, it holds `(bθ b m, θ∗) inf m∈Mn `(bθm, θ∗) P −→ n→∞ 1.

At present, we state the second main result describe in (2.8).

Theorem 3.2. Under the assumptions of the Theorem 3.1, with the same penalty (3.1), then with probability at least 1 − c0n−2, the LSE bθmb with m given inb (2.4), satisfies

`(bθ b m, θ∗) ≤ 2 inf m∈Mn n `(θ_m∗, θ∗) +pen(Sm) o + 2 x σ 2 n . (3.3)

Let us comment the optimality of the leading constant 2 relatively to the ones obtained in similar framework :

1. for regression in fixed design, Birgé and Massart ([5], [6]) obtained a similar inequality with the leading constant C = 1+δ with δ > 0. Also in [1], C = 1+δ with δ ∈ (0, 2). 2. in Theorem 3.1 in [2], the leading constant is equal to 4 and worth 2.(x+ρ)2

(x−ρ)2 > 2 in

[3]. These results have been obtained in a more general framework but with strong conditions.

(11)

Figure 1: Dimension Jump

4 _{Numerical Experiments}

This section aims at investigating how well the found penalties are in accordance with the results in Section 3. To do that, we generated observations from a causal invertible ARMA(1, 1)

Xt= φ0Xt−1+ ξt+ θ0ξt−1 where the ξ0

t s are independent and identically N (0, 1) distributed and (φ0, θ0) ∈

n

(a, b) : a ∈0.9, 0.7, 0.5, −0.9, −0.7, −0.5

and b ∈ 0.8, 0.6, −0.8, −0.6 o as in [15].

Since, all of these models are invertibles, then they admit an AR(∞) representation. In order to attest our theoretical results, we consider as candidate set of models, the family Mn of increasing AR(p) defined in (2.1) with 1 ≤ p ≤ Knwhere Kn= b2 log ncaccording to condition A6.

For each pair (φ0, θ0), we compute an empirical version of M E := `(bθmb, θ ∗₎ inf m∈Mn `(bθm, θ∗)

withmb selected as in (2.4) where pen(Sm) =x σb 2 Dm

n and the optimal constant bx has been calibrated using the dimension jump algorithm implemented in R capushe package and illustrated in Figure1. In order to do so and produce Figure1, we used an ARMA(1,1) model with φ0 = 0.9 and θ0 = 0.7 and a sample size of 500. Then, bx = 2 x where x is the value which gives the highest jump. In Figure1,x =b xopt = 2 ∗ 0.5975 = 1.195. This optimal value was set throughout the simulation study.

In the penalty pen(Sm), the variance σ2 is estimated by considering the largest model, i.e. of size Kn as traditionally done with Mallows’ Cp. The Table 1 summarizes the obtained results over 500 replications.

(12)

Table 1: Empirical estimates of ME θ0 φ0 n/Kn 0.8 0.6 -0.8 -0.6 0.9 60/8 1.17 1.24 1.21 1.11 120/9 1.10 1.19 1.12 1.06 200/10 1.05 1.16 1.11 1.12 500/12 1.01 1.04 1.13 1.05 1000/13 1.01 1.01 1.04 1.06 2000/15 1.00 1.01 1.03 1.03 0.7 60/8 1.20 1.23 1.15 1.22 120/9 1.15 1.21 1.09 1.16 200/10 1.14 1.18 1.18 1.24 500/12 1.03 1.15 1.09 1.16 1000/13 1.01 1.07 1.11 1.12 2000/15 1.01 1.03 1.03 1.12 0.5 60/8 1.25 1.13 1.12 1.14 120/9 1.13 1.11 1.06 1.14 200/10 1.15 1.14 1.07 1.16 500/12 1.08 1.14 1.05 1.10 1000/13 1.03 1.08 1.03 1.10 2000/15 1.01 1.07 1.04 1.11 -0.9 60/8 1.14 1.09 1.17 1.21 120/9 1.10 1.10 1.07 1.14 200/10 1.10 1.11 1.04 1.14 500/12 1.10 1.12 1.02 1.04 1000/13 1.04 1.08 1.01 1.02 2000/15 1.07 1.04 1.00 1.01 -0.7 60/7 1.17 1.21 1.20 1.17 120/8 1.14 1.21 1.11 1.11 200/9 1.18 1.12 1.12 1.21 500/12 1.11 1.12 1.03 1.13 1000/13 1.06 1.05 1.01 1.07 2000/15 1.05 1.11 1.00 1.03 -0.5 60/8 1.10 1.22 1.13 1.12 120/9 1.12 1.13 1.14 1.12 200/10 1.09 1.08 1.14 1.08 500/12 1.06 1.17 1.11 1.08 1000/13 1.05 1.12 1.04 1.06 2000/15 1.04 1.07 1.01 1.09

As can be seen, these results confirm our theory because we can observe that M Ed is not far from 1 and is decreasing with n. Moreover, we note the convergence towards 1 as announced in our theoretical results. These results are better than those obtained in [15]. The fastest decreasing rate occurs when φ0 ∈ {0.9, −0.9} and θ0 ∈ {0.8, −0.8} with sgn(φ0) = sgn(θ0) (where sgn(a) = 1 if a > 0 and sgn(a) = −1 otherwise). Let us also note that even if there is a global decrease between n = 60 and n = 2000, there are cases whereM Ed increases slightly from n = 1000. This happens very often when φ₀ is very close to −θ0and that makes the process (Xt)tvery close to a white noise which is unpredictable. We think that this is what justifies the global non-decay in these cases.

(13)

5 _Proofs

5.1 Proof of Theorem 3.1 Proof. γn θb_m = 1 n n X t=1 (Xt− ft b θm) 2 ₌ 1 n n X t=1 (f_θt∗− ft b θm) 2₊ σ2 n n X t=1 ξ_t2−2 σ n n X t=1 ξt(ft b θm− f t θ∗), Since `(bθm, θ∗) = E h 1 n Pn t=1(fθt∗− ft b θm)

2i by stationarity, the difficult part of the proof is to optain the penalty term from the expectation of the scalar product < ξ, (fbθm−fθ

∗) >_n.

Let m ∈ Mn. By definition C(m) ≤ C(m)b . Therefore F b θ b m− Fθ ∗ 2 n− 2 σ n n X t=1 ξt(ft b θ b m − f_θt∗) +pen(S b m) ≤ F b θm− Fθ ∗ 2 n− 2 σ n n X t=1 ξt(f_θ_bt m− f t θ∗) +pen(S_m).

Moreover, the term of interest can be decomposed into < ξ, F b θm− Fθ ∗ > = < ξ, F b θm− Fθ ∗ m > + < ξ, Fθ∗m− Fθ∗> = σ < ξ, PMm(ξ) > + < ξ, PMm Fθ∗− Fθm∗ > + < ξ, Fθm∗ − Fθ∗ > = σ < ξ, PMm(ξ) > + < ξ, (In− PMm) Fθ∗m− Fθ∗ > . and then F b θ b m− Fθ ∗ 2 n− 2 σ 2_{< ξ, P} M b m(ξ) >n−2 σ < ξ, (In− PMmb ) Fθ∗ b m− Fθ∗ >n+pen(Smb) ≤ F b θm−Fθ ∗ 2 n−2 σ 2 _{< ξ, P} Mm(ξ) >n−2 σ < ξ, (In−PMm) Fθ∗m−Fθ∗ >n+pen(Sm)

Therefore, it is quite easy to obtain the desired expectation. But since the matrix PMm

is random and not independent of ξ, the task is difficult. Taking expectation and applying Lemma5, it yields `(bθ b m, θ∗)+ pen(Smb)−2 σ 2 E < ξ, PM b m(ξ) >n ≤ `(bθm, θ ∗₎₊ _pen(S m)−2 σ2E < ξ, PMm(ξ) >n.

In view of Lemma4, and the choice of the penalty according to (3.1), it holds on Γm 0 ≤pen(Sm) − 2 σ2E < ξ, PMm(ξ) >n ≤ 2pen(Sm).

As a result on Γm,

`(bθm_b, θ∗) ≤ `(bθm, θ∗) + 2pen(Sm).

Let set Γ = Tm∈MnΓm. Γ holds with probability larger than 1 − c0n

−2_{. Indeed,} P(Γ) = 1 − P [ m∈Mn Γc_m ≥ 1 − X m∈Mn P(Γcm) ≥ 1 − c0Kn n3 ≥ 1 − c0 n2, using Proposition4and the fact that Kn≤ n. Hence, it holds on Γ

`(bθm_b, θ∗) ≤ inf m∈Mn `(bθm, θ∗) + 2pen(Sm) ≤ inf m∈Mn `(bθm, θ∗) + 2 x σ2 n

(14)

5.2 Proof of Theorem 3.2 Proof. We have `(bθm, θm∗) = 1 nE h_Xn t=1 (ft b θm− f t θ∗ m) 2i = 1 nE(bθm− θ ∗ m)>M>mMm(bθm− θ∗m). Using (2.15), it follows (bθm− θm∗) > M>_mMm(bθm− θ∗m) =< Fθ∗− F_θ∗ m, PMm Fθ∗− Fθ∗m > + σ2 < ξ, PMmξ > +2σ < ξ, PMm Fθm∗ − Fθ∗ > .

Moreover since PMm is a projection matrix PMm

op = 1 and, E| < Fθ∗− F_θ∗ m, PMm Fθ∗− Fθm∗ > | ≤ E h Fθ∗− F_θ∗ m PMm Fθ∗− Fθm∗ i ≤ Eh Fθ∗− F_θ∗ m PMm Fθ∗− Fθm∗ i ≤ Eh Fθ∗− F_θ∗ m 2i

Hence, we deduce from Lemma4 `(bθm, θ∗m) = 1 nE < Fθ∗− Fθm∗, PMm Fθ∗− Fθm∗ > + σ2 n E < ξ, PMmξ > ≤ `(θ_m∗, θ∗) + 2 nσ 2_D m ≤ `(θ_m∗, θ∗) + 0.5pen(Sm). So that `(bθm, θ∗) = `(θm∗, θ ∗ ) + `(bθm, θ∗m) ≤ 2`(θ∗_m, θ∗) +pen(Sm)

This fact along with Theorem3.1establishes (3.3).

5.3 Proof of Corollary 1

Proof. First, let remark that P(Γ) −→

n→∞ 1 (where Γ is the set defined in the proof of Theorem3.1). Also, from (3.2) and for any n

inf m∈Mn `(bθm, θ∗) ≤ `(bθmb, θ ∗_{) ≤} _inf m∈Mn `(bθm, θ∗) + 2 xσ2 n .

The proof is done after considering n → ∞ in the previous double-inequality.

5.4 Proof of Proposition 4

Let recall the definition ofΣb_m as in (2.16), b Σm = 1 n n X t=1 b Σm,t with Σb_m,t = Z_tm(Z_tm)>where Z_tm = (X_t−1, . . . , X_t−D_m)>. Following idea proof of Proposition 4 in [8], we claim that

Γc_r=n bΣ−1_m − Σ−1_m op > r Σ−1_m op o ⊂ n Σ−1/2_m Σb_mΣ−1/2_m − I_m op > r ∧ 1 2 o ,

(15)

so that Γc_r⊂n bΣ_m− Σ_m op Σ−1_m op > r ∧ 1 2 o . Therefore, with r = a

E[X02] and using Proposition 3

P(Γcr) ≤ P bΣ_m− Σ_m op > r ∧ 1 2 Σ−1m op ! ≤ P bΣ_m− Σ_m op > a (r ∧ 1) 2 ! ≤ P bΣ∗_m− Σ_m op > a (r ∧ 1) 4 ! + P bΣ_m− bΣ∗_m op > a (r ∧ 1) 4 ! =: P1+ P2.

First using Lemma3with u = a (r∧1)

4 and by virtue of A5, it follows P1 ≤ 2 exp

− 3 log n

≤ 2

n3.

Now let bound P2. We know that for a Dm× Dm matrix A A op ≤ A _∞:= max 1≤i≤Dm Dm X j=1 |Aij|

Thus, from Markov’s Inequality, P2 ≤ 4 a (r ∧ 1)E h bΣm− bΣ∗m op i ≤ 4 a (r ∧ 1)E h max 1≤i≤Dm Dm X j=1 Σb_m− bΣ∗_m i,j i ≤ 4 a (r ∧ 1) Dm X j=1 E Σb_m− bΣ∗_m i0,j ≤ 4 a (r ∧ 1) Dm X j=1 EXt−i0Xt−j− X ∗ t−i0X ∗ t−j .

Moreover, X_t−iX_t−j− X_t−i∗ X_t−j∗ ≤ X_t−i X_t−j − X_t−j∗ + X_t−j∗ X_t−i− X_t−i∗ so that with Cauchy-Schwartz’s Inequality,

EX_t−iX_t−j− X_t−i∗ X_t−j∗ ≤ 2 X₀ 2 X_t−1− X_t−1∗ 2 ≤ 2 X0 2τ (2)_(q n). Hence, P2 ≤ 8 X₀ 2 a (r ∧ 1)Dmτ (2)_(q n) ≤ 8 X₀ 2 a (r ∧ 1)Dm Cτ log q_n qn γ−1 ,

(16)

where the last inequality follows from Proposition3and Proposition1. As a result, choosing sn= O √ n log n

(ensuring A5), one can find a constant A such that log q_n qn γ−1 ≤ A√1 n γ−1 . (5.1)

As a result, with γ ≥ 8 and C = 8 X₀ 2/(a (r ∧ 1)) P2 ≤ A C CτDm 1 n7/2 ≤ A C CτCK 1 n3

by virtue of A6. The result is proved with c0= A C CτCK+ 1.

5.5 Proof of Proposition 3

Proof. The proof of the will be based on the relation between the spectral density function and the maximum eigenvalues of the variance covariance matrix.

Denote by u ∈ RDm the normalized eigenvector associated to the largest eigenvalue

λmax(Σm). Hence, λmax(Σm) = u>Σmu = Dm X j,k=1 ujr(j − k) uk= Z π −π g(λ) Dm X j,k=1 ujei(j−k)λukdλ = Z π −π g(λ) Dm X j=1 ujei jλ 2 dλ ≤ sup −π≤λ<π g(λ) Z π −π Dm X j=1 ujei jλ 2 dλ ≤ sup −π≤λ<π g(λ),

since, using Parseval identity, Rπ −π PDm j=1ujei jλ 2 dλ =PDm j=1u2j = 1. But, from Lemma2 and since γ ≥ 2, it follows

sup −π≤λ<π g(λ) ≤ 1 2 π X h∈Z |r(h)| ≤ C π +∞ X h=0 1 (h + 1)γ < ∞. Given that Σm is symmetric, it follows

Σm op = λmax(Σm) ≤ C π +∞ X h=0 1 (h + 1)γ, which concludes the proof of (2.10).

Now we end by the proof of (2.11). Reasoning as above, and by virtue of A4, one can show that λmin(Σm) ≥ inf −π≤λ<πg(λ) ≥ a which yields to Σ−1_m op= 1 λmin(Σm) ≤ 1 a, so that (2.11) is established.

(17)

5.6 Technical Lemmas

Lemma 1. AssumeA1 holds and (X_t) the mixing stationary solution of (1.1). Then, the process ( ~Xt) is mixing and

τ(1)_~

X,∞(r) ≤ Knτ (1)

X,∞(r − 1). (5.2)

Proof. Let set by Mi ~

X = σ( ~Xt, t ≤ i)and M i

X = σ(Xt, t ≤ i)for an integer i. One would like to bound τ Mi

~

X, ~Xj1, . . . , ~Xjk

for jk> . . . > j1 ≥ i + r. Let assume that the universe Ω is rich enough so that, one can find ~X_j∗

l = X ∗ jl−1, . . . , X ∗ jl−Kn > with l = 1, . . . , k verifying 1. ~X_j∗₁, . . . , ~X_j∗_k is distributed as ~Xj1, . . . , ~Xjk and independent of Mi ~ X; 2. X∗ j1−1, . . . , X ∗ jk−1 > is distributed as Xj1−1, . . . , Xjk−1 > and independent of Mi X. As a result, τ Mi_X_~, ~Xj1, . . . , ~Xjk ≤ k X l=1 k ~Xjl− ~X ∗ jlk1= k X l=1 Kn X t=1 E|Xjl−t− X ∗ jl−t| ≤ Kn k X l=1 E|Xjl−1− X ∗ jl−1| = Kn Xj1−1, . . . , Xjk−1 > − X_j∗ 1−1, . . . , X ∗ jk−1 > 1 = Knτ MiX, Xj1−1, . . . , Xjk−1.

This fact along with the definition of τ(1) ~

X,∞(r)leads to (5.2).

Lemma 2. UnderA1 with |θ_t∗| = O(t−γ_{) where γ > 1, we have}

r(h) = E[X0Xh] = O (h + 1)−γ

Proof. By virtue of A1, the process (Xt)t is causal; that is there exists (φi)i∈N such that Xt = P+∞_i=0φiξt−i with P+∞_i=0|φi| < ∞. The sequence (φi)i∈N is given by the relation φ(z) = P+∞

i=0φizi = _θ(z)1 with θ(z) = 1 − P+∞i=0θ∗i zi. Equating coefficients of zj, j = 0, 1, . . ., we find that φ0 = 1 and for i ≥ 1

φi = i X j=1

θ_j∗φi−j.

This fact allows us to deduce that the sequences (φi)i∈N and (θ∗i)i∈N decay at the same rate. Therefore, since |θ∗

t| = O (t + 1)−γ

, there exists h0 ∈ Z such that for any h ≥ h0, it holds |φt| ≤ C (t + 1)−γ for some constant C > 0. Thus,

r(h) = ∞ X j=0 φjφj+h ≤ C2 ∞ X j=0 1 (j + 1)γ 1 (j + h + 1)γ ≤ C2(h + 1)−γ ∞ X j=0 1 (j + 1)γ ≤ C 2π2 6 (h + 1) −γ_,

where the last inequality follows from the fact that γ ≥ 2 and that established the Lemma.

(18)

Lemma 3. Under assumptions A2, it holds for any model m ∈ M_n, and for all u > 0 P bΣ∗_m− Σ_m _{op ≥ u} ≤ 2 exp ( −sn 2 min ( u 16 Dmσ20 !2 , u 32 Dmσ02 ))

Proof. One can write for a matrix A A op = max v: kvk=1 v>A v= v>₀ A v0 .

Therefore one can find a vector v0∈ RDm with kv0k = 1 such that P bΣ∗_m− Σ_m op ≥ u = P v₀> Σb∗_m− Σ_m v₀ ≥ u . But, v₀> Σb∗_m− Σ_mv₀ = 1 n n X t=1 v₀>Σb∗_m,tv₀− v₀>Σ_mv₀ = 1 n n X t=1 v₀>(Z_t∗m) (Z_t∗m)>v0− v0>Σmv0 = 1 n n X t=1 Y_t2− E[Yt2] with Yt= v0>Ztm= PDm

i=1v0iXt−i∗ . From A2, Ytis SG(Dmσ20). Therefore, Yt2is SE(256 Dm2 σ04, 16 Dmσ02) (where SE stands for Sub-Gaussian and SE for Sub-Exponential).

Moreover, we can write v₀> Σb∗_m− Σ_mv₀ = 1 n n X t=1 Y_t2− E[Y2 t ] = 1 sn sn−1 X k=0 1 2qn qn X i=1 Y_2kq2 _n_+i− E[Y12] ! + 1 sn sn−1 X k=0 1 2qn qn X i=1 Y_(2k+1)q2 n+i− E[Y 2 1] ! = Y₁+ Y2. Therefore, Y₁ = 1 sn sn−1 X k=0 Y_1,k and Y₂ = 1 sn sn−1 X k=0 Y_2,k with Y_1,k = 1 2qn qn X i=1 Y_2kq2 _n_+i− E[Y2 1] and Y2,k = 1 2qn qn X i=1 Y_(2k+1)q2 n+i− E[Y 2 1]. {Y1,k} and {Y2,k} are independent random vectors by virtue of Proposition 2. Now, let us show that Yi,k are sub-exponentials. For λ such that |λ| < _{16 D}1_m_σ2

0, and denoting wi = Y_2kq2 _n_+i− E[Y12], we have Eeλ Y1,k = E h exp 1 2qn qn X i=1 λ wi i = E " Πqn i=1exp λ w_i 2qn # = E " Πqn i=1 exp λ w_i 2 1/qn # ≤ Πqn i=1 E h expλ wi 2 i1/qn ≤ eλ22 64 D 2 mσ04,

(19)

where we have used Hölder’s Inequality. Hence Y1,k is SE(64 Dm2 σ04,16 Dmσ02). As a result, using exponential inequalities for SE random variables, it follows

P Y₁ ≥ u/2≤ exp ( −sn 2 min ( u 16 Dmσ₀2 !2 , u 32 Dmσ2₀ )) so that P v>₀ Σb∗_m− Σ_m v₀ ≥ u/2 ≤ 2 exp ( −sn 2 min ( u 16 Dmσ20 !2 , u 32 Dmσ02 )) . Lemma 4. For every m ∈ Mn, it holds

E h < ξ, P_M_m(ξ) > 1I_Γ_m ≤ 2 D_m. (5.3) Proof. We have < ξ, PMm(ξ) > = < ξ, Mm M > mMm −1 M>_mξ > = < M>_mξ, bΣ−1_m M>_mξ > = < M>_mξ, (bΣ−1_m − Σ−1_m )M>_mξ > + < M>_mξ, Σ−1_m M>_mξ > . On one hand, E h (M>_mξ)>Σ−1_m (M>_mξ) i = Dm (5.4)

Indeed, let set by ˜ξ = M>

mξand for all k = 1, . . . , Dm, ˜ξk=Pnt=1Xt−kξt. Using conditional expectation, it can be showed that ˜ξ each component of ˜ξ is a sum of martingale difference sequence. Let compute the covariance matrix of ˜ξ. The k, l element of this matrix is

Σ_ξ˜ k,l = E n X i=1 Xi−kξi n X j=1 Xj−lξj = E n X i=1 Xi−kXi−l = E[M>mMm] k,l = Σm k,l Therefore, Σ_ξ˜= Σm and E h (M>_mξ)>Σ−1_m (M>_mξ)i=Trace Σ−1_m Σm = Dm. Moreover, it holds on Γm E| < M>mξ, (bΣ−1m − Σ−1m )M>mξ > | ≤ Dm. (5.5) Indeed, E| < M>mξ, (bΣ−1m − Σ−1m )M>mξ > | ≤ Eh ˜ξ (bΣ−1_m − Σ−1_m ) ˜ξ i ≤ Eh ˜ξ bΣ−1_m − Σ−1_m op ˜ξ i ≤ r Σ−1_m op E h ˜ξ 2i ≤ r D_mE[X 2 0] a , where the last inequality holds since Ek˜ξk2₌_Trace(Σ

m) = DmE[X02]. This implies (5.5) as r = a

(20)

Lemma 5. For every m ∈ M_n, it holds

E < ξ, In− PMm

Fθ∗

m− Fθ∗ > = 0 (5.6)

Proof. First, we have

E < ξ, In Fθ∗ m− Fθ∗ > = 0. (5.7) Indeed, E < ξ, In Fθ∗ m− Fθ∗ > = n X t=1 Eξt(fθt∗ m− f t θ∗) = n X t=1 E(fθt∗ m− f t θ∗) E[ξ_t|F_t] = 0. Secondly, PMm Fθ∗m − Fθ∗

is an element of Sm. Therefore, there exists θ0 ∈ Θm possibly dependent on (X1, X2, . . . , Xn)such that

PMm Fθm∗ − Fθ∗ = Mmθ0. As a result, < ξ, P_M_m F_θ∗ m− Fθ∗ > = < ξ, M_mθ₀ > ≤ sup θ∈Θm < ξ, M_mθ > . Since, θ 7→ < ξ, M_mθ >

is a continuous function and Θ_mcompact, one can find θ₁∈ Θ_m such that sup θ∈Θm < ξ, M_mθ > = < ξ, M_mθ₁ > .

But, for any θ ∈ Θm, E| < M>mξ, θ > | ≤ P Dm

k=1E|θkξe_k| = 0. It then follows E| < ξ, PMm Fθ∗m− Fθ∗ > | ≤ E| < M

>

mξ, θ1 > | = 0,

which along with (5.7) implies (5.6).

Lemma 6. AssumeA3 holds, then bΣm is a.e. invertible. Also, Σm is invertible. Proof. We can write Σb_m = M>_mM_m with M_m =X_i−1, . . . , X_i−D_m

n

i=1. By virtue of A3, Mm is of full rank which implies the a.e. invertibility ofΣb_m.

Moreover, Σm = E b Σm = EZ0m(Z0m)> with Zm 0 = (X−1, . . . , X−Dm) >_{. Let u ∈ R}Dm, it follows u>_Σ mu = E((Z0m)>u)2

≥ 0. Let show that whenever the equality holds (u>_Σ

m= 0), u = 0. Since ((Zm

0 )>u)2 ≥ 0, its expectation vanishes if and only if (Z0m)>u = 0 a.e. which yields to u = 0 by A3. Hence, Σm is positive definite and then invertible.

6 Acknowledgements

The author thanks William KENGNE Jean-Marc BARDET for proofreads and helpful discussions.

(21)

References

[1] S. Arlot and F. Bach. Data-driven calibration of linear estimators with minimal penalties. Advances in Neural Information Processing Systems, 22:46–54, 2009.

[2] Y. Baraud, F. Comte, and G. Viennet. Model selection for (auto-) regression with dependent data. ESAIM: Probability and Statistics, 5:33–49, 2001.

[3] Y. Baraud, F. Comte, G. Viennet, et al. Adaptive estimation in autoregression or-mixing regression via model selection. The Annals of Statistics, 29(3):839–875, 2001.

[4] J.-M. Bardet and O. Wintenberger. Asymptotic normality of the quasi-maximum likelihood estimator for multidimensional causal processes. The Annals of Statistics, 37(5B):2730–2759, 2009.

[5] L. Birgé and P. Massart. A generalized cp criterion for gaussian model selection. technical report, universités de paris 6 et paris 7, 2010. prépublication 647,39 pages. 2001.

[6] L. Birgé and P. Massart. Minimal penalties for gaussian model selection. Probability theory and related fields, 138(1-2):33–73, 2007.

[7] F. Comte, J. Dedecker, and M.-L. Taupin. Adaptive density deconvolution with dependent inputs. Mathematical methods of Statistics, 17(2):87, 2008.

[8] F. Comte and V. Genon-Catalot. Regression function estimation as a partly inverse problem. Annals of the Institute of Statistical Mathematics, 72(4):1023–1054, 2020.

[9] J. Dedecker and C. Prieur. New dependence coefficients. examples and applications to statis-tics. Probability Theory and Related Fields, 132(2):203–236, 2005.

[10] P. Doukhan and O. Wintenberger. Weakly dependent chains with infinite memory. Stochastic Processes and their Applications, 118(11):1997–2013, 2008.

[11] A. Goldenshluger and A. Zeevi. Nonasymptotic bounds for autoregressive time series model-ing. Annals of statistics, pages 417–444, 2001.

[12] D. Hsu, S. M. Kakade, and T. Zhang. An analysis of random design linear regression. arXiv preprint arXiv:1106.2363, 2011.

[13] C.-K. Ing and C.-Z. Wei. On same-realization prediction in an infinite-order autoregressive process. Journal of Multivariate Analysis, 85(1):130–155, 2003.

[14] C.-K. Ing and C.-Z. Wei. Order selection for same-realization predictions in autoregressive processes. The Annals of Statistics, 33(5):2423–2474, 2005.

[15] C.-K. Ing, C.-Z. Wei, et al. Order selection for same-realization predictions in autoregressive processes. The Annals of Statistics, 33(5):2423–2474, 2005.

[16] M. Lerasle et al. Optimal model selection for density estimation of stationary data under various mixing conditions. The Annals of Statistics, 39(4):1852–1877, 2011.

[17] R. Shibata. Asymptotically efficient selection of the order of the model for estimating param-eters of a linear process. The Annals of Statistics, pages 147–164, 1980.

[18] R. Shibata. Consistency of model selection and parameter estimation. Journal of Applied Probability, pages 127–141, 1986.

[19] S. A. van de Geer. On hoeffding’s inequality for dependent random variables. In Empirical process techniques for dependent data, pages 161–169. Springer, 2002.