• Aucun résultat trouvé

DATA DRIVEN MODEL SELECTION FOR SAME-REALIZATION PREDICTIONS IN AUTOREGRESSIVE PROCESSES

N/A
N/A
Protected

Academic year: 2021

Partager "DATA DRIVEN MODEL SELECTION FOR SAME-REALIZATION PREDICTIONS IN AUTOREGRESSIVE PROCESSES"

Copied!
21
0
0

Texte intégral

(1)

HAL Id: hal-03169343

https://hal.archives-ouvertes.fr/hal-03169343

Preprint submitted on 15 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

DATA DRIVEN MODEL SELECTION FOR

SAME-REALIZATION PREDICTIONS IN

AUTOREGRESSIVE PROCESSES

Kare Kamila

To cite this version:

Kare Kamila. DATA DRIVEN MODEL SELECTION FOR SAME-REALIZATION PREDICTIONS IN AUTOREGRESSIVE PROCESSES. 2021. �hal-03169343�

(2)

DATA DRIVEN MODEL SELECTION FOR

SAME-REALIZATION PREDICTIONS IN

AUTOREGRESSIVE PROCESSES

BY Kare KAMILA∗

March 15, 2021

SAMM, Université Paris 1 Panthéon-Sorbonne, FRANCE

Abstract

This paper is about the one-step ahead prediction of the future of observations drawn from an infinite-order autoregressive AR(∞) process. The aim of this paper is to design penalties (complete data driven) ensuring that the selected model verifies the efficiency property but in the non asymptotic framework. We present an oracle inequality with a leading constant equal to one. Moreover, we also show that the excess risk of the selected estimator enjoys the best bias-variance trade-off over the considered collection. To achieve these results, we needed to overcome the dependence difficulties by following a classical approach which consists in restricting to a set where the empirical covariance matrix is equivalent to the theoretical one. We show that this event happens with probability larger than 1 − c0/n3 with c0> 0. The proposed

data driven criteria are based on the minimization of the penalized criterion akin to the Mallows’s Cp. Monte Carlo experiments are performed to highlight the obtained

results.

Key words: Model selection, oracle inequality, efficiency, autoregressive process, data driven.

1

Introduction

Consider observations (X1, X2, . . . , Xn)arising from a trajectory of the process

Xt= f∗ (Xt−i)i∈N∗ + σ ξt for any t ∈ Z. (1.1)

where (ξt)t∈Z is a sequence of zero-mean independent identically distributed random vari-ables (i.i.d.r.v) satisfying E(|ξ0|4) < ∞ and f∗ : RN → R is a measurable function and σ > 0an unknown constant.

The problem is to estimate the function f∗ using these observations. The process (1.1) is a particular case of the general class of affine causal process studied in [10] and [4]. The study of this type of process more often requires the classical regularity condition on the function f∗, which are not restrictive at all and remain valid in various time series models. This condition can be stated as follows:

∞ X k=1 sup x∈R∞ ∂ ∂xk f∗(x) ! < 1, (1.2)

This author has received funding from the European Union’s Horizon 2020 research and innovation

(3)

provided that that f∗ admits partial derivatives on RN. Under (1.2) and if the noise ξ 0 admits r-order moments, [10] showed that there exists a stationary, mixing and ergodic solution to (1.1) admitting r-order moments.

Moreover, [4] studied the consistency and the asymptotic normality of the QMLE of θ∗= (θ∗i)i∈N in the case f∗ = fθ∗.

In this paper, we will focus only on processes with a linear regression function (fθ∗) with

respect to the past and depending on some parameter θ∗∈ RN; that is f∗(Xt−1, Xt−2, . . .) = fθ∗(Xt−1, Xt−2, . . .) =

∞ X

i=1

θi∗Xt−i. (1.3) For such processes, condition (1.2) becomes

A1 :

∞ X i=1

i∗| < 1.

Even if this condition reduces the set of parameters a bit, the class of AR(∞) processes checking the condition A1 is rich and of practical importance because it contains almost all invertible causal ARMA(p, q) processes and it is very useful for prediction given the past. Moreover, contrary to the autocariance of ARMA(p, q) processes which decays ex-ponentially fast, AR(∞) are able to model more complex behaviour such as slower decay of the covariance structure.

Henceforth, let observations (X1, X2, . . . , Xn)be a trajectory of the solution X := (Xt)t∈Z of (1.1) verifying A1. The goal of this paper is to predict the next value Xn+1. In fact, if θ∗ were known, a simple prediction of Xn+1could be fθ∗(Xn, Xn−1, . . .)setting Xt= 0for all

t < 0. However, θ∗ is generally unknown and it is impossible to provide a direct estimator since its coordinate are infinite. It is classical to identify a ’good’ finite-dimensional model based on the data which can be done by sieve estimation where only a finite number of {θ∗

i}Ki=1 is estimated and letting K grows as the sample size increases. A usual approach to this is model selection and the goal is to provide a model with the prediction error as small as the oracle’s one.

This question has already been addressed in the literature. [17] was the first to tackle this issue. He proved that Akaike criterion is asymptotically efficient in the sense that the selected model achieves a smaller one-step mean squared error of prediction when it is fitted to predict an independent realization of the same process. Following Shibata’s asymptotically setting, [13] and [15] extended this result for same realization predictions. Indeed, they argued that the Shibata’s idea to fit the model to another independent real-ization is unrealistic since in practice we only have one data at hand. The common feature of these works is their asymptotic framework.

Meanwhile, there were several authors which study this question in non asymptotic regime. [11] in the non parametric framework, studied how well a Gaussian process admitting an AR(∞) representation can be approximated by a finite-order AR model.

In [2] and [3], they analyzed similar question, but a little bit different as observations arise from an auto-regressive model of order k. They proved an oracle inequality under several conditions, for instance the compactly supported base of the regression function. Moreover, they assume that the process is β-mixing which is usually admitted, but quite hard to verify in practice. For linear processes, the τ-mixing is more suitable since its coefficients can be easily computed (see [7]) and be bounded by a function of the model parameter θ∗(see [10]). In this work, we do not assume any mixing property of the process since the condition A1 implies the τ-mixing property (see [10]) and we will see that the decreasing rate of τ-mixing coefficients is bounded by the decreasing rate of the coefficients

(4)

θ∗= (θ∗i)i∈N.

Based on the above and following a model selection approach, our purpose in this work is to design adaptive penalties in such a way that the selected model mimic the oracle when observations arise form AR(∞) under mild conditions, including the existence of the all order moment of the noise, the decreasing rate of the coefficients of (θ∗

i)i∈N so that thanks to a result by [10], the generating process has nice properties such as stationarity, τ-mixing. The main contributions of this paper include:

(i) Using least squares contrast, we have shown an oracle inequality with a leading constant equal to one.

(ii) We have also proved that the excess risk of the selected estimator enjoys the best bias-variance trade-off over the considered collection.

The paper is organized as follows. The model selection approach along with preliminary results are described in Section 2. The main results are presented Section 3. Finally, numerical results are presented in Section4and Section 5contains the proofs.

2

Model Selection Approach and Preliminary

Re-sults

2.1 Model Selection Approach

Let Sm (shortly m) a model for f∗ to be the set of linear function f from RDm to R such that f (x1, x2, . . . , xDm) = Dm X i=1 θixi, (2.1)

with θ = (θ1, . . . , θDm) ∈ Θm and Θm a compact set of R

Dm. S

m can be viewed as an AR(Dm) model.

Given a predictor fθ∈ Sm, its quality is measured by the quadratic loss R(θ) = E(Xn+1− fθn+1)2

 where fn

θ = fθ(Xn−1, . . . , Xn−Dm). The Bayes predictor which minimizes R(θ) over the

set of all predictors is clearly the inaccessible function fθ∗. Let then introduce the excess

loss of the predictor fθ (with respect to fθ∗)

`(θ, θ∗) := R(θ) − R(θ∗) = E(fθn+1∗ − fθn+1)2 ≥ 0.

Given a model m, we define its best predictor fθ∗ m by

θm∗ = argmin θ∈Θm

R(θ).

Its empirical version minimizing the least-squares contrast is b θm = argmin θ∈Θm γn(θ) where γn(θ) = 1 n n X t=1 (Xt− fθt)2. (2.2) Note that (thanks to stationarity of the process), for an estimate bθ = bθ(X1, . . . , Xn), the excess loss can be rewritten as

`(bθ, θ∗) = Eh F b θ− Fθ∗ 2 n i (2.3) where Fθ := (fθ1, . . . , fθn)> and kxk2n= n1 Pn t=1x2t.

(5)

Given that all the models which can be considered must have finite dimensions for fixed n, making all Sm wrong models, it is classical to let the dimension of competitive models grow with the number of observations. This will help reduce the excess loss and provide a better approximation of fθ∗.

Let Mn a countable collection of hierarchical model Sm and Kn is the dimension of the largest model in Mnsatisfying |Mn| ≤ Kn< n. We follow the classical approach of model selection which consists in minimizing the penalized LSE. Let pen: Mn→ R+be a penalty function, possibly data-dependent, and define

b

m = argmin m∈Mn

{C(m)} with C(m) := γn θbm +pen(Sm). (2.4) Thus, the best possible choice over Mn is m∗ the so-called oracle defined as

m∗∈arg inf m∈Mn

`(bθm, θ∗). (2.5)

The oracle m∗ is unachievable since it depends on θand the distribution P

(X1,...,Xn) that

are unknowns. However, we hope to select a model mb so that `(bθmb, θ

) is closest to `(bθm∗, θ∗).

The goal of this paper is twofold. First, we want to propose a data driven penalty in order to obtain an oracle inequality

`(bθ b m, θ∗) ≤ C1 inf m∈Mn `(bθm, θ∗) + C2 n (2.6)

with the leading constant C1 close to one and C2> 0.

Since for every m ∈ Mn, we have the following decomposition holds `(bθm, θ∗) = `(θm∗ , θ∗) + `(bθm, θm∗)

=: Biais(m) + Variance(m), (2.7)

the inequality (2.6) implies that the excess risk of the selected estimator bθmb realizes the best bias-variance trade-off, which would make our penalty an ideal choice in terms of excess risk. That is to say that the selected model mb will be large enough to reduce its bias, but not too large to avoid high variance.

Moreover, in decomposition (2.7), the bias term is generally not irreducible but the variance can be approximately proportional to the size of the model. Our second goal is to simplify the oracle inequality (2.6) in order to obtain the following common inequality

`(bθ b m, θ∗) ≤ C10 inf m∈Mn ( `(θm∗, θ∗) +pen(Sm) ) +C 0 2 n (2.8)

with the leading constant C0

1 = 1 + δwith δ > 0 (and close to 0) and C20 > 0.

2.2 Notations

We will use the following norms:

• k.k denotes the usual Euclidean norm on Rν, with ν ≥ 1;

• kAkop is the operator norm of A as the square root of the largest eigenvalue of A>A. If A is symmetric, then kAkop is the largest (in absolute value) eigenvalue of A. • if X is a Rν-random variable and r ≥ 1, we set kXk

r = EkXkr 1/r

(6)

2.3 Preliminary Results

As we are in dependence setting, we are going to leverage the τ-mixing property of (Xt)t∈Z in order to obtain some exponential Inequalities. The τ-mixing coefficients are a mea-sure of the dependence of the process and has been introduced by [9]. This will help us build ’independents’ random vectors and apply classical exponential Inequalities. Let then introduce some notations.

Let (Ω, C, P) be a probability space, M a σ-subalgebra of C and Z a random variable with values in a Banach space E, k.kE



. Assume that E|Z| < ∞ and define τ(p)(M, Z) = sup f ∈Λ(E) n Z f (x)PZ|M(dx) − Z f (x)PZ(dx) o p where Λ(E) is the set of 1-Lipschitz function, i.e. the functions f from E, k.kE

 to R such that |f(x) − f(y)| ≤ kx − ykE.

Using the definition of τ, we will measure the dependence of the strictly stationary sequence (Zt)t∈Z thanks to the coefficients defined as follows. For any s ≥ 0, let introduce the norm kx − ykRk = (|x1− y1| + · · · + |xk− yk|)and setting Mi = σ(Zt, t ≤ i)and if E(|Z1|) < ∞,

let τZ,∞(p) (s) = sup l>0 n max 1≤k≤l 1 k sup n τ(p) Mi, (Zi1, . . . , Zik)  , i + s ≤ i1< · · · < ik oo .

Finally, the time series (Zt)t∈Zis τZ,∞(p) -weakly dependentwhen its coefficients τ (p)

Z,∞ tend to 0as s tends to infinity.

The next Proposition that is a consequence of Theorem 3.1 in [10] gives a link between the τ-mixing coefficients of the process (Xt)t∈Z and the coefficients θ∗i of the model (1.3). Proposition 1. AssumeA1 holds and if |θt∗| = O(t−γ) with γ > 1, there exists a τ -weakly dependent stationary solution of (1.1) and a constant Cτ > 0 such that for r > 0

τX,∞(2) (r) ≤ Cτ log r

r γ−1

(2.9) Proof. With G(x, ξ0) = σ ξ0+ fθ∗(x) for any x ∈ R∞, it holds

G(x, ξ0) − G(y, ξ0) 2 = fθ∗(x) − fθ∗(y) ≤ ∞ X i=1 |θ∗i| |xi− yi|.

Therefore (2.9) is a straightforward application of Theorem 3.1 in [10].  As we are going to need independence for block of random variables, let denote for t = 1, . . . , nthe random vector ~Xt:= (Xt−1, . . . , Xt−Kn)

>One can see that the process ( ~X t)t∈Z is also mixing with τ(1)

~

X,∞upper bounded by Knτ (1)

X,∞ (see Lemma 1).

Now, we construct random variables approximating ~Xt’s enjoying the independence by block property. Let sn, qn two integers such that n = 2 snqn. We are going to build 2 sn blocks of length qn so that the even index blocks are independent and so the odd index blocks.

For k = 0, . . . , sn− 1let denote by Ak= X~2kqn+1, . . . , ~X(2k+1)qn



and Bk= X~(2k+1)qn+1, . . . , ~X(2k+2)qn.

(7)

Proposition 2. Let (Xt)t∈Z be the stationary mixing process process obtained in Proposi-tion1. Let also sn, qn, Ak, Bk defined as above for k = 0, . . . , sn− 1. There exist random vectors A∗k = X~2kqn+1, . . . , ~X(2k+1)q∗ n, B ∗ k= X~ ∗ (2k+1)qn+1, . . . , ~X ∗ (2k+2)qn such that: 1. For k = 0, . . . , sn− 1, A∗

k has the same law as Ak, also B∗k and Bk. 2. The random vectors (A∗k)0≤k≤sn−1are independent and so are the vectors (B

k)0≤k≤sn−1.

To prove the oracle inequality, we will assume some constraints on the observations. A2 Xt is sub-Gaussian with variance proxy σ02> 0 i.e.

E[eλ Xt] ≤ eλ

2σ2

0/2 for any λ > 0.

Condition A2 implies that the vector Zm

t = (Xt−1, . . . , Xt−Dm)

> which will be prominent in the proofs, is sub-Gaussian with variance proxy Dmσ20. Indeed for any v ∈ RDm such that kvk = 1 , E h exp λ v>Ztmi = E "Dm Y i=1 expλ vi(Xt−i  # ≤ Dm Y i=1 exp  λ vi(Xt−i  Dm = Dm Y i=1 exp λ2Dmσ02v2i/2  = eλ22 Dmσ 2 0,

where the Inequality follows from Hölder’s Inequality.

The following assumption provides a sufficient condition to ensure the invertibility of bothΣbm and Σm.

A3: For any fθ∈ Sm, < α, ∂θfθ >= 0 a.s. =⇒ α = 0

This condition means that the columns of the matrix Mm are linearly independents. We will also need to bound eigenvalues of the matrices Σm for any m ∈ Mn. To do that, we will leverage the relation between the spectral density of the process and these eigenvalues. Let us denote by r, the covariance function r(h) := E[XtXt+h]for any integer h. Let also introduce the function g : [−π, π[−→ C such that for any λ,

g(λ) = 1 2 π

X h∈Z

r(h) e−ihλ,

which exists under A1 with |θ∗

t| = O(t−γ)where γ ≥ 1 . Therefore, r is the inverse trans-form of g and r(h) = Rπ

−πeihλg(λ)dλfor any h ∈ Z. We will assume that A4: There exists a constant a > 0 such that inf

−π≤λ<πg(λ) ≥ a.

This is a very weak assumption, and we are going to give the value of a for AR(p) process with p ∈ N∗. Let denote θ(z) = 1 −Pp

j=1θ∗jzj, it is well known for such process that g(λ) = σ 2 2 π θ∗(e−iλ) 2.

(8)

For instance for p equal to one, and Xt= θ∗1Xt−1+ σ ξt with |θ1∗| < 1, it follows g(λ) = σ 2 2π 1 − θ∗1e−iλ 2 = σ 2 2π  1 − 2 θ1∗ cos(λ) + (θ1∗)2 , and then it is simple to see that

a := σ

2

2π (1 + |θ∗1|)2 ≤ g(λ) ≤

σ2 2π (1 − |θ∗1|)2.

For p ≥ 1 and Xt=Ppj=1θj∗Xt−j+ σ ξt satisfying Ppj=1θ∗j < 1 and θj∗≥ 0, we have

g(λ) = σ 2 2π 1 − Pp j=1θ ∗ je−ijλ 2 = σ2(2π)−1 1 + p X j=1 (θ∗j)2− 2 p X j=1 θ∗j cos(jλ) + 2 p−1 X k=1 θk∗n p X j=k+1 θj∗ cos (j − k)λo !−1 .

Thus, using −1 ≤ cos(x) ≤ 1 for any real x, it follows for every λ σ2(2π)−1 1 + p X j=1 (θ∗j)2+ 2 p X j=1 θj∗+ 2 p−1 X k=1 θ∗k n Xp j=k+1 θj∗ o !−1 ≤ g(λ) ≤ σ2(2π)−1 1 + p X j=1 (θ∗j)2− 2 p X j=1 θj∗− 2 p−1 X k=1 θ∗kn p X j=k+1 θj∗o !−1 .

For such AR(p) process, one can take the constant a in A4 to be equal to a = σ2(2π)−1 1 + p X j=1 (θ∗j)2+ 2 p X j=1 θ∗j + 2 p−1 X k=1 θk∗ n Xp j=k+1 θ∗j o !−1 .

We can now state an important intermediate result which provides uniform lower and upper bound on the spectral norm of the matrices Σm.

Proposition 3. UnderA1 with |θt∗| = O(t−γ) where γ ≥ 2 , we have for any m ∈ Mn

Σm op ≤ π −1 ∞ X i=0 E[X0Xi] < ∞. (2.10)

Moreover and under A3-A4, it holds Σ−1m

op ≤ 1/a. (2.11)

Now for technical convenience, we choose the integer sn and the cardinal of Mnsuch that A5 : a sn 2 min ( (r ∧ 1) 26σ2 0Kn !2 , (r ∧ 1) 27σ2 0Kn ) ≥ 3 log n, (2.12) where r := a/E[X2

0]and b ∧ c is the minimum of b and c. This means that sn is of the form sn = C log nwhere C ≥ 6 a−1max

( 26σ2 0Kn (r∧1) !2 ,27σ20Kn (r∧1) )

(9)

and a ≈ E[X2

0] ≈ σ20, we can choose sn= 6 (2)12(log)3n. Henceforth, in all the sequel Kn will satisfy

A6 : Kn= CKlog n for some constant CK > 0. (2.13) Let us introduce extra important notations. From the definition of the LSE (2.2), it follows that

b

θm= bΣ−1m M>mX (2.14)

where the matrix Mm =Xi−1, . . . , Xi−Dm

n

i=1, Σbm = M >

mMm and X = (X1, . . . , Xn)>, provided thatΣbmis invertible almost everywhere (see Lemma6). Let denote the expected value of the random matrixΣbmby Σm = E

 b Σm



. Rewriting (1.1) in a vectorial form (with ξ = (ξ1, . . . , ξn)>), i.e. X = Fθ∗+ σ ξ = (Fθ∗− Fθ∗ m) + Fθm∗ + σ ξ, it follows that M>mX = M>m(Fθ∗− Fθ∗ m) + M > mMmθm∗ + σ M>mξ. Thus, using (2.14) it holds

b θm− θ∗m= bΣ−1m M>m(Fθ∗− Fθ∗ m) + σ bΣ −1 m M>mξ, (2.15) which implies Mm(bθm− θ∗m) = MmΣb−1m M>m(Fθ∗− Fθ∗ m) + σ MmΣb −1 m M > mξ = PMm Fθ∗− Fθ∗m + σ PMm(ξ) where PMm = Mm M > mMm −1

M>m is the projection matrix onto the sub-space spanned by the columns of Mm.

The main difficulty in this work lies in the handling of the matrix Σb−1m . We are going to use a classical approach to overcome this issue, it consists in defining a set on which the b

Σ−1m can be approximated by Σ−1m ([3],[12], [19],[8] among other) which is invertible (see Lemma6). For m ∈ Mn, let define Γm,r the set

Γm,r = n bΣ−1m − Σ−1m op ≤ r Σ−1m op o .

We will see that in our framework Γm,r holds with high probability. Before proving that, let us notice thatΣbm can be rewritten as

b Σm = 1 n n X t=1 b Σm,t with Σbm,t = Ztm(Ztm)> where Zm t = (Xt−1, . . . , Xt−Dm) >. (2.16)

Since we have fixed r equal to a/E[X2

0], the set Γm,r will be denoted by Γm in the all the sequel.

The following result shows that the event Γm holds with high probability.

Proposition 4. Under assumptionsA1 − A6 and if |θ∗t| = O(t−γ) with γ ≥ 8 , it holds

P(Γcm) ≤ c0 n3, (2.17) with c0 = 1 + 8 A CτCK X0 2(a (r ∧ 1)) −1 where A satisfies (5.1).

(10)

3

Oracle Inequality for Same Realization Prediction

We are now able to state the main result of the paper.

Theorem 3.1. Let consider observations (X1, . . . , Xn) arising from a solution of the pro-cess (1.1) satisfying A1 with |θ∗t| = O(t−γ) where γ ≥ 8 and also verifying A2 and A4. Let Mn be some countable family of AR models satisfyingA3 and A5-A6. For x ≥ 4, let a penalty function pen: Mn→ R+ such that

pen(Sm) = x σ2 Dm

n . (3.1)

Then with probability at least 1 − c0n−2, the LSE bθ b

m with m given inb (2.4), satisfies `(bθ b m, θ∗) ≤ inf m∈Mn n `(bθm, θ∗) o +2 x σ 2 n . (3.2)

Let us give some remarks about this result:

• The oracle inequality (3.2) is optimal in the sense that the leading constant is exactly one

• This result is new in non asymptotic framework for AR(∞) under mild conditions. Indeed, [14] obtained a counterpart of our result when n → ∞ under several assump-tions;

• The designed penalty (3.1) generalizes the Mallows Cp.

We can obtain for free as a consequence of Theorem3.1, the asymptotic efficiency obtained by [18] and [15].

Corollary 1. Under the assumptions of Theorem3.1, it holds `(bθ b m, θ∗) inf m∈Mn `(bθm, θ∗) P −→ n→∞ 1.

At present, we state the second main result describe in (2.8).

Theorem 3.2. Under the assumptions of the Theorem 3.1, with the same penalty (3.1), then with probability at least 1 − c0n−2, the LSE bθmb with m given inb (2.4), satisfies

`(bθ b m, θ∗) ≤ 2 inf m∈Mn n `(θm∗, θ∗) +pen(Sm) o + 2 x σ 2 n . (3.3)

Let us comment the optimality of the leading constant 2 relatively to the ones obtained in similar framework :

1. for regression in fixed design, Birgé and Massart ([5], [6]) obtained a similar inequality with the leading constant C = 1+δ with δ > 0. Also in [1], C = 1+δ with δ ∈ (0, 2). 2. in Theorem 3.1 in [2], the leading constant is equal to 4 and worth 2.(x+ρ)2

(x−ρ)2 > 2 in

[3]. These results have been obtained in a more general framework but with strong conditions.

(11)

Figure 1: Dimension Jump

4

Numerical Experiments

This section aims at investigating how well the found penalties are in accordance with the results in Section 3. To do that, we generated observations from a causal invertible ARMA(1, 1)

Xt= φ0Xt−1+ ξt+ θ0ξt−1 where the ξ0

t s are independent and identically N (0, 1) distributed and (φ0, θ0) ∈

n

(a, b) : a ∈0.9, 0.7, 0.5, −0.9, −0.7, −0.5

and b ∈ 0.8, 0.6, −0.8, −0.6 o as in [15].

Since, all of these models are invertibles, then they admit an AR(∞) representation. In order to attest our theoretical results, we consider as candidate set of models, the family Mn of increasing AR(p) defined in (2.1) with 1 ≤ p ≤ Knwhere Kn= b2 log ncaccording to condition A6.

For each pair (φ0, θ0), we compute an empirical version of M E := `(bθmb, θ ∗) inf m∈Mn `(bθm, θ∗)

withmb selected as in (2.4) where pen(Sm) =x σb 2 Dm

n and the optimal constant bx has been calibrated using the dimension jump algorithm implemented in R capushe package and illustrated in Figure1. In order to do so and produce Figure1, we used an ARMA(1,1) model with φ0 = 0.9 and θ0 = 0.7 and a sample size of 500. Then, bx = 2 x where x is the value which gives the highest jump. In Figure1,x =b xopt = 2 ∗ 0.5975 = 1.195. This optimal value was set throughout the simulation study.

In the penalty pen(Sm), the variance σ2 is estimated by considering the largest model, i.e. of size Kn as traditionally done with Mallows’ Cp. The Table 1 summarizes the obtained results over 500 replications.

(12)

Table 1: Empirical estimates of ME θ0 φ0 n/Kn 0.8 0.6 -0.8 -0.6 0.9 60/8 1.17 1.24 1.21 1.11 120/9 1.10 1.19 1.12 1.06 200/10 1.05 1.16 1.11 1.12 500/12 1.01 1.04 1.13 1.05 1000/13 1.01 1.01 1.04 1.06 2000/15 1.00 1.01 1.03 1.03 0.7 60/8 1.20 1.23 1.15 1.22 120/9 1.15 1.21 1.09 1.16 200/10 1.14 1.18 1.18 1.24 500/12 1.03 1.15 1.09 1.16 1000/13 1.01 1.07 1.11 1.12 2000/15 1.01 1.03 1.03 1.12 0.5 60/8 1.25 1.13 1.12 1.14 120/9 1.13 1.11 1.06 1.14 200/10 1.15 1.14 1.07 1.16 500/12 1.08 1.14 1.05 1.10 1000/13 1.03 1.08 1.03 1.10 2000/15 1.01 1.07 1.04 1.11 -0.9 60/8 1.14 1.09 1.17 1.21 120/9 1.10 1.10 1.07 1.14 200/10 1.10 1.11 1.04 1.14 500/12 1.10 1.12 1.02 1.04 1000/13 1.04 1.08 1.01 1.02 2000/15 1.07 1.04 1.00 1.01 -0.7 60/7 1.17 1.21 1.20 1.17 120/8 1.14 1.21 1.11 1.11 200/9 1.18 1.12 1.12 1.21 500/12 1.11 1.12 1.03 1.13 1000/13 1.06 1.05 1.01 1.07 2000/15 1.05 1.11 1.00 1.03 -0.5 60/8 1.10 1.22 1.13 1.12 120/9 1.12 1.13 1.14 1.12 200/10 1.09 1.08 1.14 1.08 500/12 1.06 1.17 1.11 1.08 1000/13 1.05 1.12 1.04 1.06 2000/15 1.04 1.07 1.01 1.09

As can be seen, these results confirm our theory because we can observe that M Ed is not far from 1 and is decreasing with n. Moreover, we note the convergence towards 1 as announced in our theoretical results. These results are better than those obtained in [15]. The fastest decreasing rate occurs when φ0 ∈ {0.9, −0.9} and θ0 ∈ {0.8, −0.8} with sgn(φ0) = sgn(θ0) (where sgn(a) = 1 if a > 0 and sgn(a) = −1 otherwise). Let us also note that even if there is a global decrease between n = 60 and n = 2000, there are cases whereM Ed increases slightly from n = 1000. This happens very often when φ0 is very close to −θ0and that makes the process (Xt)tvery close to a white noise which is unpredictable. We think that this is what justifies the global non-decay in these cases.

(13)

5

Proofs

5.1 Proof of Theorem 3.1 Proof. γn θbm = 1 n n X t=1 (Xt− ft b θm) 2 = 1 n n X t=1 (fθt∗− ft b θm) 2+ σ2 n n X t=1 ξt2−2 σ n n X t=1 ξt(ft b θm− f t θ∗), Since `(bθm, θ∗) = E h 1 n Pn t=1(fθt∗− ft b θm)

2i by stationarity, the difficult part of the proof is to optain the penalty term from the expectation of the scalar product < ξ, (fbθm−fθ

∗) >n.

Let m ∈ Mn. By definition C(m) ≤ C(m)b . Therefore F b θ b m− Fθ ∗ 2 n− 2 σ n n X t=1 ξt(ft b θ b m − fθt∗) +pen(S b m) ≤ F b θm− Fθ ∗ 2 n− 2 σ n n X t=1 ξt(fθbt m− f t θ∗) +pen(Sm).

Moreover, the term of interest can be decomposed into < ξ, F b θm− Fθ ∗ > = < ξ, F b θm− Fθ ∗ m > + < ξ, Fθ∗m− Fθ∗> = σ < ξ, PMm(ξ) > + < ξ, PMm Fθ∗− Fθm∗ > + < ξ, Fθm∗ − Fθ∗ > = σ < ξ, PMm(ξ) > + < ξ, (In− PMm) Fθ∗m− Fθ∗ > . and then F b θ b m− Fθ ∗ 2 n− 2 σ 2< ξ, P M b m(ξ) >n−2 σ < ξ, (In− PMmb ) Fθ∗ b m− Fθ∗ >n+pen(Smb) ≤ F b θm−Fθ ∗ 2 n−2 σ 2 < ξ, P Mm(ξ) >n−2 σ < ξ, (In−PMm) Fθ∗m−Fθ∗ >n+pen(Sm)

Therefore, it is quite easy to obtain the desired expectation. But since the matrix PMm

is random and not independent of ξ, the task is difficult. Taking expectation and applying Lemma5, it yields `(bθ b m, θ∗)+ pen(Smb)−2 σ 2 E < ξ, PM b m(ξ) >n ≤ `(bθm, θ ∗)+ pen(S m)−2 σ2E < ξ, PMm(ξ) >n.

In view of Lemma4, and the choice of the penalty according to (3.1), it holds on Γm 0 ≤pen(Sm) − 2 σ2E < ξ, PMm(ξ) >n ≤ 2pen(Sm).

As a result on Γm,

`(bθmb, θ∗) ≤ `(bθm, θ∗) + 2pen(Sm).

Let set Γ = Tm∈MnΓm. Γ holds with probability larger than 1 − c0n

−2. Indeed, P(Γ) = 1 − P [ m∈Mn Γcm ≥ 1 − X m∈Mn P(Γcm) ≥ 1 − c0Kn n3 ≥ 1 − c0 n2, using Proposition4and the fact that Kn≤ n. Hence, it holds on Γ

`(bθmb, θ∗) ≤ inf m∈Mn `(bθm, θ∗) + 2pen(Sm) ≤ inf m∈Mn `(bθm, θ∗) + 2 x σ2 n

(14)

5.2 Proof of Theorem 3.2 Proof. We have `(bθm, θm∗) = 1 nE hXn t=1 (ft b θm− f t θ∗ m) 2i = 1 nE(bθm− θ ∗ m)>M>mMm(bθm− θ∗m). Using (2.15), it follows (bθm− θm∗) > M>mMm(bθm− θ∗m) =< Fθ∗− Fθ∗ m, PMm Fθ∗− Fθ∗m > + σ2 < ξ, PMmξ > +2σ < ξ, PMm Fθm∗ − Fθ∗ > .

Moreover since PMm is a projection matrix PMm

op = 1 and, E| < Fθ∗− Fθ∗ m, PMm Fθ∗− Fθm∗ > | ≤ E h Fθ∗− Fθ∗ m PMm Fθ∗− Fθm∗  i ≤ Eh Fθ∗− Fθ∗ m PMm Fθ∗− Fθm∗  i ≤ Eh Fθ∗− Fθ∗ m 2i

Hence, we deduce from Lemma4 `(bθm, θ∗m) = 1 nE < Fθ∗− Fθm∗, PMm Fθ∗− Fθm∗ >  + σ2 n E < ξ, PMmξ >  ≤ `(θm∗, θ∗) + 2 nσ 2D m ≤ `(θm∗, θ∗) + 0.5pen(Sm). So that `(bθm, θ∗) = `(θm∗, θ ∗ ) + `(bθm, θ∗m) ≤ 2`(θ∗m, θ∗) +pen(Sm) 

This fact along with Theorem3.1establishes (3.3). 

5.3 Proof of Corollary 1

Proof. First, let remark that P(Γ) −→

n→∞ 1 (where Γ is the set defined in the proof of Theorem3.1). Also, from (3.2) and for any n

inf m∈Mn `(bθm, θ∗) ≤ `(bθmb, θ ∗) ≤ inf m∈Mn `(bθm, θ∗) + 2 xσ2 n .

The proof is done after considering n → ∞ in the previous double-inequality. 

5.4 Proof of Proposition 4

Let recall the definition ofΣbm as in (2.16), b Σm = 1 n n X t=1 b Σm,t with Σbm,t = Ztm(Ztm)>where Ztm = (Xt−1, . . . , Xt−Dm)>. Following idea proof of Proposition 4 in [8], we claim that

Γcr=n bΣ−1m − Σ−1m op > r Σ−1m op o ⊂ n Σ−1/2m ΣbmΣ−1/2m − Im op > r ∧ 1 2 o ,

(15)

so that Γcr⊂n bΣm− Σm op Σ−1m op > r ∧ 1 2 o . Therefore, with r = a

E[X02] and using Proposition 3

P(Γcr) ≤ P bΣm− Σm op > r ∧ 1 2 Σ−1m op ! ≤ P bΣm− Σm op > a (r ∧ 1) 2 ! ≤ P bΣ∗m− Σm op > a (r ∧ 1) 4 ! + P bΣm− bΣ∗m op > a (r ∧ 1) 4 ! =: P1+ P2.

First using Lemma3with u = a (r∧1)

4 and by virtue of A5, it follows P1 ≤ 2 exp



− 3 log n

≤ 2

n3.

Now let bound P2. We know that for a Dm× Dm matrix A A op ≤ A := max 1≤i≤Dm Dm X j=1 |Aij|

Thus, from Markov’s Inequality, P2 ≤ 4 a (r ∧ 1)E h bΣm− bΣ∗m op i ≤ 4 a (r ∧ 1)E h max 1≤i≤Dm Dm X j=1 Σbm− bΣ∗m  i,j i ≤ 4 a (r ∧ 1) Dm X j=1 E Σbm− bΣ∗m  i0,j  ≤ 4 a (r ∧ 1) Dm X j=1 E Xt−i0Xt−j− X ∗ t−i0X ∗ t−j .

Moreover, Xt−iXt−j− Xt−i∗ Xt−j∗ ≤ Xt−i Xt−j − Xt−j∗ + Xt−j∗ Xt−i− Xt−i∗ so that with Cauchy-Schwartz’s Inequality,

E Xt−iXt−j− Xt−i∗ Xt−j∗  ≤ 2 X0 2 Xt−1− Xt−1∗ 2 ≤ 2 X0 2τ (2)(q n). Hence, P2 ≤ 8 X0 2 a (r ∧ 1)Dmτ (2)(q n) ≤ 8 X0 2 a (r ∧ 1)Dm Cτ log qn qn γ−1 ,

(16)

where the last inequality follows from Proposition3and Proposition1. As a result, choosing sn= O √ n log n 

(ensuring A5), one can find a constant A such that log qn qn γ−1 ≤ A√1 n γ−1 . (5.1)

As a result, with γ ≥ 8 and C = 8 X0 2/(a (r ∧ 1)) P2 ≤ A C CτDm 1 n7/2 ≤ A C CτCK 1 n3

by virtue of A6. The result is proved with c0= A C CτCK+ 1. 

5.5 Proof of Proposition 3

Proof. The proof of the will be based on the relation between the spectral density function and the maximum eigenvalues of the variance covariance matrix.

Denote by u ∈ RDm the normalized eigenvector associated to the largest eigenvalue

λmax(Σm). Hence, λmax(Σm) = u>Σmu = Dm X j,k=1 ujr(j − k) uk= Z π −π g(λ) Dm X j,k=1 ujei(j−k)λukdλ = Z π −π g(λ) Dm X j=1 ujei jλ 2 dλ ≤ sup −π≤λ<π g(λ) Z π −π Dm X j=1 ujei jλ 2 dλ ≤ sup −π≤λ<π g(λ),

since, using Parseval identity, Rπ −π PDm j=1ujei jλ 2 dλ =PDm j=1u2j = 1. But, from Lemma2 and since γ ≥ 2, it follows

sup −π≤λ<π g(λ) ≤ 1 2 π X h∈Z |r(h)| ≤ C π +∞ X h=0 1 (h + 1)γ < ∞. Given that Σm is symmetric, it follows

Σm op = λmax(Σm) ≤ C π +∞ X h=0 1 (h + 1)γ, which concludes the proof of (2.10).

Now we end by the proof of (2.11). Reasoning as above, and by virtue of A4, one can show that λmin(Σm) ≥ inf −π≤λ<πg(λ) ≥ a which yields to Σ−1m op= 1 λmin(Σm) ≤ 1 a, so that (2.11) is established. 

(17)

5.6 Technical Lemmas

Lemma 1. AssumeA1 holds and (Xt) the mixing stationary solution of (1.1). Then, the process ( ~Xt) is mixing and

τ(1)~

X,∞(r) ≤ Knτ (1)

X,∞(r − 1). (5.2)

Proof. Let set by Mi ~

X = σ( ~Xt, t ≤ i)and M i

X = σ(Xt, t ≤ i)for an integer i. One would like to bound τ Mi

~

X, ~Xj1, . . . , ~Xjk



for jk> . . . > j1 ≥ i + r. Let assume that the universe Ω is rich enough so that, one can find ~Xj

l = X ∗ jl−1, . . . , X ∗ jl−Kn > with l = 1, . . . , k verifying 1. ~Xj1, . . . , ~Xjk is distributed as ~Xj1, . . . , ~Xjk  and independent of Mi ~ X; 2. X∗ j1−1, . . . , X ∗ jk−1 > is distributed as Xj1−1, . . . , Xjk−1 > and independent of Mi X. As a result, τ MiX~, ~Xj1, . . . , ~Xjk  ≤ k X l=1 k ~Xjl− ~X ∗ jlk1= k X l=1 Kn X t=1 E|Xjl−t− X ∗ jl−t|  ≤ Kn k X l=1 E|Xjl−1− X ∗ jl−1|  = Kn Xj1−1, . . . , Xjk−1 > − Xj∗ 1−1, . . . , X ∗ jk−1 > 1 = Knτ MiX, Xj1−1, . . . , Xjk−1.

This fact along with the definition of τ(1) ~

X,∞(r)leads to (5.2).

 Lemma 2. UnderA1 with |θt∗| = O(t−γ) where γ > 1, we have

r(h) = E[X0Xh] = O (h + 1)−γ 

Proof. By virtue of A1, the process (Xt)t is causal; that is there exists (φi)i∈N such that Xt = P+∞i=0φiξt−i with P+∞i=0|φi| < ∞. The sequence (φi)i∈N is given by the relation φ(z) = P+∞

i=0φizi = θ(z)1 with θ(z) = 1 − P+∞i=0θ∗i zi. Equating coefficients of zj, j = 0, 1, . . ., we find that φ0 = 1 and for i ≥ 1

φi = i X j=1

θj∗φi−j.

This fact allows us to deduce that the sequences (φi)i∈N and (θ∗i)i∈N decay at the same rate. Therefore, since |θ∗

t| = O (t + 1)−γ 

, there exists h0 ∈ Z such that for any h ≥ h0, it holds |φt| ≤ C (t + 1)−γ for some constant C > 0. Thus,

r(h) = ∞ X j=0 φjφj+h ≤ C2 ∞ X j=0 1 (j + 1)γ 1 (j + h + 1)γ ≤ C2(h + 1)−γ ∞ X j=0 1 (j + 1)γ ≤ C 2π2 6 (h + 1) −γ,

where the last inequality follows from the fact that γ ≥ 2 and that established the Lemma. 

(18)

Lemma 3. Under assumptions A2, it holds for any model m ∈ Mn, and for all u > 0 P  bΣ∗m− Σm op ≥ u  ≤ 2 exp ( −sn 2 min ( u 16 Dmσ20 !2 , u 32 Dmσ02 ))

Proof. One can write for a matrix A A op = max v: kvk=1 v>A v = v>0 A v0 .

Therefore one can find a vector v0∈ RDm with kv0k = 1 such that P  bΣ∗m− Σm op ≥ u  = P  v0> Σb∗m− Σm v0 ≥ u  . But, v0> Σb∗m− Σmv0 = 1 n n X t=1 v0>Σb∗m,tv0− v0mv0  = 1 n n X t=1 v0>(Zt∗m) (Zt∗m)>v0− v0>Σmv0  = 1 n n X t=1 Yt2− E[Yt2]  with Yt= v0>Ztm= PDm

i=1v0iXt−i∗ . From A2, Ytis SG(Dmσ20). Therefore, Yt2is SE(256 Dm2 σ04, 16 Dmσ02) (where SE stands for Sub-Gaussian and SE for Sub-Exponential).

Moreover, we can write v0> Σb∗m− Σmv0 = 1 n n X t=1 Yt2− E[Y2 t ]  = 1 sn sn−1 X k=0 1 2qn qn X i=1 Y2kq2 n+i− E[Y12]  ! + 1 sn sn−1 X k=0 1 2qn qn X i=1 Y(2k+1)q2 n+i− E[Y 2 1]  ! = Y1+ Y2. Therefore, Y1 = 1 sn sn−1 X k=0 Y1,k and Y2 = 1 sn sn−1 X k=0 Y2,k with Y1,k = 1 2qn qn X i=1 Y2kq2 n+i− E[Y2 1]  and Y2,k = 1 2qn qn X i=1 Y(2k+1)q2 n+i− E[Y 2 1]. {Y1,k} and {Y2,k} are independent random vectors by virtue of Proposition 2. Now, let us show that Yi,k are sub-exponentials. For λ such that |λ| < 16 D1mσ2

0, and denoting wi = Y2kq2 n+i− E[Y12], we have Eeλ Y1,k = E h exp  1 2qn qn X i=1 λ wi i = E " Πqn i=1exp λ wi 2qn  # = E " Πqn i=1  exp λ wi 2 1/qn # ≤ Πqn i=1  E h expλ wi 2 i1/qn ≤ eλ22 64 D 2 mσ04,

(19)

where we have used Hölder’s Inequality. Hence Y1,k is SE(64 Dm2 σ04,16 Dmσ02). As a result, using exponential inequalities for SE random variables, it follows

P  Y1 ≥ u/2≤ exp ( −sn 2 min ( u 16 Dmσ02 !2 , u 32 Dmσ20 )) so that P  v>0 Σb∗m− Σm v0 ≥ u/2  ≤ 2 exp ( −sn 2 min ( u 16 Dmσ20 !2 , u 32 Dmσ02 )) .  Lemma 4. For every m ∈ Mn, it holds

E h < ξ, PMm(ξ) > 1IΓm ≤ 2 Dm. (5.3) Proof. We have < ξ, PMm(ξ) > = < ξ, Mm M > mMm −1 M>mξ > = < M>mξ, bΣ−1m M>mξ > = < M>mξ, (bΣ−1m − Σ−1m )M>mξ > + < M>mξ, Σ−1m M>mξ > . On one hand, E h (M>mξ)>Σ−1m (M>mξ) i = Dm (5.4)

Indeed, let set by ˜ξ = M>

mξand for all k = 1, . . . , Dm, ˜ξk=Pnt=1Xt−kξt. Using conditional expectation, it can be showed that ˜ξ each component of ˜ξ is a sum of martingale difference sequence. Let compute the covariance matrix of ˜ξ. The k, l element of this matrix is

Σξ˜  k,l = E  n X i=1 Xi−kξi n X j=1 Xj−lξj  = E n X i=1 Xi−kXi−l  = E[M>mMm]  k,l = Σm  k,l Therefore, Σξ˜= Σm and E h (M>mξ)>Σ−1m (M>mξ)i=Trace Σ−1m Σm = Dm. Moreover, it holds on Γm E| < M>mξ, (bΣ−1m − Σ−1m )M>mξ > | ≤ Dm. (5.5) Indeed, E| < M>mξ, (bΣ−1m − Σ−1m )M>mξ > |  ≤ Eh ˜ξ (bΣ−1m − Σ−1m ) ˜ξ i ≤ Eh ˜ξ bΣ−1m − Σ−1m op ˜ξ i ≤ r Σ−1m op E h ˜ξ 2i ≤ r DmE[X 2 0] a , where the last inequality holds since Ek˜ξk2 =Trace(Σ

m) = DmE[X02]. This implies (5.5) as r = a

(20)

Lemma 5. For every m ∈ Mn, it holds

E < ξ, In− PMm

 Fθ∗

m− Fθ∗ >  = 0 (5.6)

Proof. First, we have

E < ξ, In Fθ∗ m− Fθ∗ >  = 0. (5.7) Indeed, E < ξ, In Fθ∗ m− Fθ∗ >  = n X t=1 Eξt(fθt∗ m− f t θ∗)  = n X t=1 E(fθt∗ m− f t θ∗) E[ξt|Ft] = 0. Secondly, PMm Fθ∗m − Fθ∗ 

is an element of Sm. Therefore, there exists θ0 ∈ Θm possibly dependent on (X1, X2, . . . , Xn)such that

PMm Fθm∗ − Fθ∗ = Mmθ0. As a result, < ξ, PMm Fθ∗ m− Fθ∗ > = < ξ, Mmθ0 > ≤ sup θ∈Θm < ξ, Mmθ > . Since, θ 7→ < ξ, Mmθ >

is a continuous function and Θmcompact, one can find θ1∈ Θm such that sup θ∈Θm < ξ, Mmθ > = < ξ, Mmθ1 > .

But, for any θ ∈ Θm, E| < M>mξ, θ > | ≤ P Dm

k=1E|θkξek| = 0. It then follows E| < ξ, PMm Fθ∗m− Fθ∗ > | ≤ E| < M

>

mξ, θ1 > | = 0,

which along with (5.7) implies (5.6). 

Lemma 6. AssumeA3 holds, then bΣm is a.e. invertible. Also, Σm is invertible. Proof. We can write Σbm = M>mMm with Mm =Xi−1, . . . , Xi−Dm

n

i=1. By virtue of A3, Mm is of full rank which implies the a.e. invertibility ofΣbm.

Moreover, Σm = E  b Σm = EZ0m(Z0m)>  with Zm 0 = (X−1, . . . , X−Dm) >. Let u ∈ RDm, it follows u>Σ mu = E((Z0m)>u)2 

≥ 0. Let show that whenever the equality holds (u>Σ

m= 0), u = 0. Since ((Zm

0 )>u)2 ≥ 0, its expectation vanishes if and only if (Z0m)>u = 0 a.e. which yields to u = 0 by A3. Hence, Σm is positive definite and then invertible. 

6

Acknowledgements

The author thanks William KENGNE Jean-Marc BARDET for proofreads and helpful discussions.

(21)

References

[1] S. Arlot and F. Bach. Data-driven calibration of linear estimators with minimal penalties. Advances in Neural Information Processing Systems, 22:46–54, 2009.

[2] Y. Baraud, F. Comte, and G. Viennet. Model selection for (auto-) regression with dependent data. ESAIM: Probability and Statistics, 5:33–49, 2001.

[3] Y. Baraud, F. Comte, G. Viennet, et al. Adaptive estimation in autoregression or-mixing regression via model selection. The Annals of Statistics, 29(3):839–875, 2001.

[4] J.-M. Bardet and O. Wintenberger. Asymptotic normality of the quasi-maximum likelihood estimator for multidimensional causal processes. The Annals of Statistics, 37(5B):2730–2759, 2009.

[5] L. Birgé and P. Massart. A generalized cp criterion for gaussian model selection. technical report, universités de paris 6 et paris 7, 2010. prépublication 647,39 pages. 2001.

[6] L. Birgé and P. Massart. Minimal penalties for gaussian model selection. Probability theory and related fields, 138(1-2):33–73, 2007.

[7] F. Comte, J. Dedecker, and M.-L. Taupin. Adaptive density deconvolution with dependent inputs. Mathematical methods of Statistics, 17(2):87, 2008.

[8] F. Comte and V. Genon-Catalot. Regression function estimation as a partly inverse problem. Annals of the Institute of Statistical Mathematics, 72(4):1023–1054, 2020.

[9] J. Dedecker and C. Prieur. New dependence coefficients. examples and applications to statis-tics. Probability Theory and Related Fields, 132(2):203–236, 2005.

[10] P. Doukhan and O. Wintenberger. Weakly dependent chains with infinite memory. Stochastic Processes and their Applications, 118(11):1997–2013, 2008.

[11] A. Goldenshluger and A. Zeevi. Nonasymptotic bounds for autoregressive time series model-ing. Annals of statistics, pages 417–444, 2001.

[12] D. Hsu, S. M. Kakade, and T. Zhang. An analysis of random design linear regression. arXiv preprint arXiv:1106.2363, 2011.

[13] C.-K. Ing and C.-Z. Wei. On same-realization prediction in an infinite-order autoregressive process. Journal of Multivariate Analysis, 85(1):130–155, 2003.

[14] C.-K. Ing and C.-Z. Wei. Order selection for same-realization predictions in autoregressive processes. The Annals of Statistics, 33(5):2423–2474, 2005.

[15] C.-K. Ing, C.-Z. Wei, et al. Order selection for same-realization predictions in autoregressive processes. The Annals of Statistics, 33(5):2423–2474, 2005.

[16] M. Lerasle et al. Optimal model selection for density estimation of stationary data under various mixing conditions. The Annals of Statistics, 39(4):1852–1877, 2011.

[17] R. Shibata. Asymptotically efficient selection of the order of the model for estimating param-eters of a linear process. The Annals of Statistics, pages 147–164, 1980.

[18] R. Shibata. Consistency of model selection and parameter estimation. Journal of Applied Probability, pages 127–141, 1986.

[19] S. A. van de Geer. On hoeffding’s inequality for dependent random variables. In Empirical process techniques for dependent data, pages 161–169. Springer, 2002.

Figure

Figure 1: Dimension Jump
Table 1: Empirical estimates of M E θ 0 φ 0 n/K n 0.8 0.6 -0.8 -0.6 0.9 60/8 1.17 1.24 1.21 1.11 120/9 1.10 1.19 1.12 1.06 200/10 1.05 1.16 1.11 1.12 500/12 1.01 1.04 1.13 1.05 1000/13 1.01 1.01 1.04 1.06 2000/15 1.00 1.01 1.03 1.03 0.7 60/8 1.20 1.23 1.15

Références

Documents relatifs

Keywords and phrases: adaptivity, classification, empirical minimiza- tion, empirical risk minimization, local Rademacher complexity, margin condition, model selection,

We considered several estimators of the variance, namely the unbiased estimator based on the full least-squares solution, the unbiased estimators based on the least-squares

Globally the data clustering slightly deteriorates as p grows. The biggest deterioration is for PS-Lasso procedure. The ARI for each model in PS-Lasso and Lasso-MLE procedures

Key words: penalization, variable selection, complexity tuning, penalized partial likelihood, proportional hazards model, survival data, smoothing splines,

We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Moreover, the lower bound given in the second assertion of Theorem 3 shows that in the case of pointwise estima- tion over the range of classes of d -variate functions having