Random thresholds for linear model selection

(1)

January 2008, Vol. 12, p. 173–195 www.esaim-ps.org

DOI: 10.1051/ps:2007047

RANDOM THRESHOLDS FOR LINEAR MODEL SELECTION

Marc Lavielle

¹

and Carenne Lude˜ na

²

Abstract. A method is introduced to select the significant or non null mean terms among a collection of independent random variables. As an application we consider the problem of recovering the significant coefficients in non ordered model selection. The method is based on a convenient random centering of the partial sums of the ordered observations. Based onL-statistics methods we show consistency of the proposed estimator. An extension to unknown parametric distributions is considered. Simulated examples are included to show the accuracy of the estimator. An example of signal denoising with wavelet thresholding is also discussed.

Mathematics Subject Classiﬁcation. 62F12, 62G05, 62P99.

Received October 9, 2006.

1. Introduction

Consider the following model

y_i=µ_i+ε_i, i= 1, . . . , n

where (µ_i) is a sequence of unknown constants some of which are zero and (ε_i) are centered, independent random variables with common cumulative distribution F_ε. The problem we study in this article is choosing the signiﬁcant, non zero mean, coeﬃcients based on the observations (y_i). Based on the data, it only seems reasonable to choose indexiif

|yi|> τ, (1)

for some appropriate threshold. Thus the problem is the choice ofτin (1), which will depend on the distribution of the sequence (ε_i).

In practiceτmust be calibrated in terms of the data. A usual technique is to consider a sequence of thresholds (τ_j) for values ranging from very small (many significant coefficients) to very big (few significant coefficients) and study the point where a substantial decrease in the number of significant coefficients occurs. Of course choosing the “right”τ is equivalent to choosing the “right” numberkof significant coefficients, with the advantage that this can be done independently of the choice of (τ_j) by looking at the relative size of the observations. Indeed, a jump in the relative size of the observations should indicate the existence of significant (not noise) coefficients.

This has been considered by a number of authors (see, for example, [12,13,17]) and many adaptive procedures aimed at studying the correct “jump point” have been developed.

Keywords and phrases. Adaptive estimation, linear model selection, hard thresholding, random thresholding,L-statistics.

1 University Paris-Sud and University Ren´e Descartes, France;[email protected]

2 IVIC, Venezuela;[email protected]

c EDP Sciences, SMAI 2008

Article published by EDP Sciences and available at http://www.esaim-ps.org or http://dx.doi.org/10.1051/ps:2007047

(2)

One method that has proved to work remarkably well in practice in many settings including non ordered model selection for regression problems is to consider not only the individual observations, but instead the partial sums of the absolute (or squared) observations ordered decreasingly, and study the ﬂuctuations of these partial sums around their expectation conditional to the total sum. Even when this conditional expectation cannot be computed in a closed form, an exponential change of variable makes this centering possible. The estimated “jump point” is obtained by minimization of a certain functional of the conditionally centered partial sums. Intuitively, each ordered observation from the null mean subset should account for a certain expected proportion of the observable total sum. If non null mean terms exist, this proportion changes and can thus be detected. Centering the partial sums by their non conditional expectations introduces an additional variance term.

Our main result, Theorem 2.2 proves the method will consistently select this “jump point” under mild assumptions over the gap between significant and non significant coefficients.

Our method requires previous knowledge of F_ε. In a parametric setting,F_ε =F_ε(·;θ), we show that the case θ unknown can also be successfully considered introducing a consistent estimator of θ. When θ is a scale parameter, an appropriate modiﬁcation of the estimating procedure yields a scale free statistic, which is also shown to be consistent.

We then apply our method to the problem of estimating the signiﬁcant coeﬃcients for the regression problem

y_i=f(x_i) +η_i, i= 1, . . . , n

where f is an unknown function in some function space S and η_i are independent random variables with variance σ². A usual estimation procedure is to consider f ∈L²(µ) and a ﬁnite orthonormal system {φλ}Λ, with |Λ| =M_n. Denoting by y, φj_n = 1/n_n

i=1y_iφ_λ(x_i) the empirical coeﬃcients, Donoho and Johnstone [10] in their seminal article proposed choosing only those coeﬃcients whose absolute value exceeded a certain thresholdu=

τ σ²log(n)

n .This procedure has since been refereed to as hard thresholding.

In a very interesting reinterpretation, Barronet al. [4] study the problem of hard thresholding in the context of non ordered model selection based on the addition of a penalization term. Their arguments are combinatorial based on the complexity of the underlying linear spaces: the size of the set of all possible models of sizekout of K is bounded by (eK/k)^k and a logarithmic factor depending onKmust be introduced in order to bound the probabilities. In terms of (1) our observations would now be the empirical coeﬃcientsy, φj_n. Of course, except for the caseη_i∼N(0, σ²), the empirical coeﬃcients will not be necessarily independent, although uncorrelated, so that the problem does not comply to our assumptions. However, in practice the method works well. As discussed in Section 6.1, our method can be interpreted as a random threshold procedure.

Although closely related to the problem of estimating the proportion of false null hypothesis [9, 14], our goals are different. We are more interested in finding a consistent estimator for the jump point as well as convergence rates than in establishing an overall test for the proportion of false null hypothesis. In the latter case the main goal is establishing lower bounds that assure a specified confidence level. In terms of the proposed methodology, in our procedure we look at the fluctuations of the conditionally centered ordered data and not at the number of individual rejected tests. However, the connection between both approaches remains an interesting research topic.

The article is organized as follows: in Section 2 we introduce the problem and basic notation as well as the proposed test procedure. In Section 3 we state and prove theoretical results that justify our procedure, namely consistency of the selected subset of signiﬁcant coeﬃcients. In Section 4 we consider certain extensions which include the parametric distribution case, an application to the problem of non ordered linear model selection for the regression setting and interpret out testing scheme in terms of a random penalization procedure. In Section 5 we present simulated examples. An application to signal denoising with wavelet thresholding is proposed Section 6.

(3)

2. Describing the procedure 2.1. A ﬁrst hypothesis testing procedure

Assume we observe y_i = µ_i+ε_i. Variables ε_i are assumed to be centered, independent and identically distributed with common cumulative distribution F_ε. We begin by assuming that the cumulative distribution functionF_|ε| of the|εi|’s is known. In Section 4 we will deal with the unknownF_|ε|case.

Given the collection (y_i; 1≤i≤n), we are interested in this section in testing if all theµ_i’s are null or not.

Thus, we introduce the following hypothesis:

Null hypothesis:

H₀: µ_i≡0 fori= 1, . . . , n.

Alternative hypothesis:

H₁: there exists a non empty subsetI of{1,2, . . . , n} such thatµ_i= 0 fori∈I.

Then, the test procedure is deﬁned as follows:

i) Order the observations|y(1)| ≥ |y(2)| ≥. . .≥ |y(n)|.

ii) Fori= 1, . . . , n, letX(i)=−log

1−F_|ε|(|y(i)|) . iii) LetT_j=_j

i=1X(i)andQ_j=EH0(T_j|Tn).

iv) Deﬁne the test statisticD_n= max_j|Tj−Q_j|/√

n. We will reject the null hypothesis ifD_n> d_α, where d_αis deﬁned in Section 3.

Remark 1. Under the null hypothesis, the sequence (X(i)) is a decreasing sequence of exponential random variables with parameter 1. Then, the conditional expectation EH0(T_j|Tn) can easily be computed using the following proposition (the proof is given in the Appendix):

Proposition 2.1. Assume X(1), X(2), . . . , X(n) is an ordered sequence of Exp(1) random variables, with X(1)≥X(2)≥. . . X(n). For any1≤j≤n, let T_j=_j

i=1X(i). Then, for anyj ≤K≤n, E

X(i)

= n =1

1

(2)

E(T_j) = j+j n i=j+1

1

i (3)

E(T_j|T_K) = E(T_j)

E(T_K)T_K. (4)

Remark 2. The distribution of the test statistic D_n cannot be computed in a closed form. Nevertheless, the following standard result will allow us to construct probability tables (the proof is given in the Appendix):

Theorem 2.1. AssumeX₍₁₎, X₍₂₎, . . . , X_(n)is an ordered sequence of Exp(1) random variables, with X₍₁₎≥ X(2) ≥. . . X(n). For any 1 ≤j ≤n, let T_j =_j

i=1X(i). Introduce for t ∈[0,1]the random process d_n(t) = T[nt]−E

T[nt]|Tn

. Then, √¹

nd_n(t), as a stochastic process indexed ont∈[0,1], converges in distribution to a zero mean Gaussian process∆ with covariance function defined by

E(∆(t)∆(s)) = 1

0

1

0 [(1−u)∧(1−v)−(1−u)(1−v)][1I[0,t](u)−t+tlog(t)]

×[1I[0,s](v)−s+slog(s)]dG⁻¹(u)dG⁻¹(v), whereG(x) is the distribution function of an exponential r.v.

Using Theorem 2.1, we can conclude that statistic D_n deﬁned in the test procedure converges weakly to

∆_∞= sup_t∆(t). Then,d_αis deﬁned as the αquantile of ∆_∞.

(4)

Remark 3. Instead of assuming that the distribution of the |ε_i| is known, we can assume that there exists an increasing continuous function h:R⁺ →R⁺ such that the cumulative distribution function F_h of h(|ε|) is known. Then, X(i) is deﬁned as−log

1−F_h(h(|y(i)|))

. Without any loss of generality, we will consider the caseh=idin the following.

Remark 4. A uniform change of variable can also be used, by settingX(i)=F_|ε|(|y(i)|). Indeed, the conditional expectation ofT_j can also be computed here:

E(T_j|TK) =j(K−j)

K+ 1 + j(j+ 1) K(K+ 1)T_K.

2.2. Choosing the right coeﬃcients

If we reject the null hypothesis, the next step is to select the significant coefficients. For this we define the following procedure:

i) Fori= 1, . . . , n, letX(i)=−log

1−F_|ε|(|y(i)|) .

ii) LetK_n be some positive integer. For 1≤k≤n−K_n and 1≤j≤K_n, set

B_k,j,n= j

1 +_n−k

i=j+11/i K_n

1 +_n−k

i=K_n+11/i

(5)

and compute

T_k,j =

k+j

i=k+1

X(i), (6)

Q_k,j = B_k,j,nT_k,K_n, (7)

η_k = max

1≤j≤Kn

|Tk,j√−Q_k,j|

n . (8)

iii) Let

ˆk= Arg min

1≤k≤n−Kn

η_k.

Remark 2.1. The1or the2norms can be used instead of the_∞norm to deﬁneη by setting η_k =n⁻³²

Kn

j=1

|Tk,j−Q_k,j|

or

η_k =n⁻²

Kn

j=1

(T_k,j−Q_k,j)².

In the spirit of Section 2.1, it is possible to reinterpret the above procedure in a data-dependent “Hypothesis testing” setting. Lets be the (data-dependent) one to one mapping from{1,2, . . . , n} to{1,2, . . . , n}deﬁned byYs(i)=Y(i)(recall that (Y(i)) is a decreasing sequence).

Then, deﬁne the set of alternative hypotheses:

Alternative hypothesis:

H1(k): there exists a subsetI_k ⊂ {1,2, . . . , n}, with|I_k|=k, such that, – for anyi∈I_k,s(i)≤k andEH1(k)(Y_i)= 0,

– for anyi∈I_k,s(i)≥k+ 1 andEH1(k)(Y_i) = 0.

(5)

Under H1(k), there arek signiﬁcant coeﬃcients and|y_(k+1)|, . . . ,|y_(n)| have distributionF_|ε|. We will denote Ek() the expectation underH1(k)(instead ofEH1(k)()). That is,

Ek(T_k,j) :=j

⎛

⎝1 +

n−k

i=j+1

1/i

⎞

⎠. (9) So that,

B_k,j,n= Ek(T_k,j) Ek(T_k,K_n)

andQ_k,jcan be thought of asQ_k,j=Ek(T_k,j|Tk,Kn) using the results of the previous section and Proposition 2.1.

Our main consistency result requires bounding deviations of thek−th order statistic, amongnobservations, for k = n and k = n^d, for some d < 1/2. For this, two general results are required. The first concerning the asymptotic behavior of the maximum of n independent variables, and the second related to the limiting Gaussian behavior of intermediate order statistics. In order to unify notation and simplify the presentation both results will be given under the assumption that there existw =w(F_ε) and x0 < w such that the distribution of the errors,F_ε has a differentiable densityf_ε over (x0, w). We also assumef_ε satisfies one of the Von-Misses type conditions we give below.

ConditionVM.

1. We assumew=∞thatf_εis positive near inﬁnity and that there exists α >0 such that

x→∞lim

xf_ε(x) [1−F_ε(x)] =α.

2. We assumew <∞,f_εis positive near inﬁnity and that there existsα >0 such that

x↑wlim

(w−x)f_ε(x) 1−F_ε(x) =α.

3. We assume thatf_ε(x) is positive and that

x↑wlim

f(x)_w

x(1−F_ε(t))dt [1−F_ε(x)]² = 1, forx∈(x0, w).

We then have the following result Proposition 2.2.

1. (Convergence of extremes, de Haan [8], Ths. 2.7.1, 2.7.2, 2.7.3.) Assume one of the VM conditions hold. Then, F_ε(a_n+b_nx)ⁿ → G, for G = G(i, θ) the generalized extreme value function, and some choice of constants a_n, b_n. The parameters of the GEV function Gdepend on the VM condition that holds.

2. (Convergence of intermediate extreme values, Falk [11], Th.2.1.) Assume one of the VM conditions hold. Assumek→ ∞,k/n→0 and set

a_n,k=F_ε⁻¹(1−k/n); b_n,k=k¹^/²/(nf_ε(a_n,k)).

Then, if|ε|(k,n)is the n−korder statistic

P(|ε|(k,n)≤d_n+c_nx)→Φ(x),

(6)

whereΦ stands for the standard normal distribution function andc_n, d_n are any sequences that satisfy limn c_n/b_n,k= 1 and lim

n (d_n−a_n,k)/b_n,k = 0.

We now state our asymptotic framework:

AF1. There exists t ∈ (0,1) and a subset I_k

n of {1,2, . . . , n}, with k_n = [tn] and |Ik_n| = k_n, such that µ_i= 0 ifi∈I_k

n. For all other index,µ_i = 0.

AF2. For anyi∈I_k_n,|µi| ≥α_n, whereα_n→ ∞is such that

α_n= 2a_n+r_nb_n (10)

forr_n→ ∞.

AF3. K_n/n→csuch that 0< c <1−t. We have the following result

Theorem 2.2. Let (u_n) be any positive and decreasing sequence such that √

n u_n → ∞. Then, under the asymptotic framework defined byVM, AF1, AF2, AF3,

P

ˆk n−t

> u_n

→0. (11)

Moreover, fora >0 there exist constants c1, c2 which depend onasuch that if u_n=c1α_n√

logn 2√

n +c2α_nlog(n)

2n ,

then

P_H₁(k_n)

kˆ n−t

> u_n

≤2e^−a^log(ⁿ⁾+ 2P

1max≤i≤n|εi|> α_n

. (12)

The proof of Theorem 2.2 is given in Section 3.

Remark 2.2. No overlapping is ensured with high probability when n → ∞ under Condition AF2. This condition can be weakened assuming that the minimum gap between both subsets is such that the overlap between both ordered sequences is not any larger thann^d, for some d <1/2. In this case we would require the alternative condition

AF2’. For anyi∈I_k

n,|µi| ≥α_n, whereα_n→ ∞is such that

α_n=a_n+r1,nb_n+a_nd,n−k_n +r2,nb_nd,n−k_n, forr_i,n→ ∞fori= 1,2 and somed <1/2.

Thus the condition over α_n in [AF2’] yields a slight improvement with respect to [AF2] in terms of the constant since a_nd,n−kn +r2,nb_nd,n−kn is smaller than 2a_n+r1,nb_n. However it is not possible to have α_n≤a_n+r1,nb_n+a_n1/2,n−n^1/2 if rates as in Theorem 2.2 are sought. This fact yields an important insight as to the minimal signal to noise ratio that should be required.

3. Proof of Theorem 2.2

Our procedure is based on two facts: a) under our assumptions over the error distribution, if the null hypothesis is rejected, that is, if there is a group of significant coefficients and one of non significant coefficients,

(7)

both groups of observations will not mix with high probability under assumptionAF2) and b) if both groups are separated at indexk_n, thenT_k_n_,j−Q_k_n_,jwill only converge at rate√

nfor indexk_nsuch that|k_n−k_n|=o(√ n).

Setu_i=y_i fori∈I_k

n andv_i=y_i fori∈I_k

n. Thus (v_i) is an i.i.d. sequence with distributionF_ε.

We have the following lemma that assures that both collections are stochastically in order with high probability:

Lemma 3.1. Let (u(i))and(v(i))be the sequences(|ui|)and(|vi|)in a decreasing order. Then P

v(1)> u(k_n)

→0 and

P

v(1)> α_n 2

→0.

Proof. Let (˜v_i) be a sequence of i.i.d. r.v. with distributionF_ε. Then, P

v₍₁₎> u_(k n)

≤P

v₍₁₎+ ˜v₍₁₎> α_n

≤P

v₍₁₎> α_n/2 +P

˜

v₍₁₎> α_n/2

≤2P

v(1)> α_n/2

→2W

α_n/2−a_n b_n

→0.

Lemma 3.1 yields the (u(i)) and (v(i)) are stochastically in order with high probability. Let Ω_n be the subset of Ω wherev(1) < α_n/2 andu(k_n)> α_n/2. ClearlyP(Ω_n)→1. In what follows we will restrict our proof to Ω_n.

LetEk(T_k,j) be as deﬁned in equation (9). Also leta_i=E0 X(i)

=_n

=11/.

1) Consider ﬁrst the casek > k_n. On Ω_n,

T_k,j−Q_k,j = T_k,j−B_k,j,nT_k,K_n

=

T_k,j−Ek_n(T_k,j)

−B_k,j,n

T_k,K_n−Ek_n(T_k,K_n) +E_k_n(T_k,j)−B_k,jE_k_n(T_k,K_n)

= R_k,j+S_k,j.

We have decomposed the statisticsT_k,j−Q_k,j into a random partR_k,j and a deterministic partS_k,j. Remark over Ω_n, R_k,j is a function of v(k), . . . , v(K_n). First, let k = [tn] and j = [sn] for t ≤s. As in Theorem 2.1, R_k,j1IΩ_n (normalized by√

n) as a process indexed by (t, s)∈(0,1)² converges in distribution to a zero-mean Gaussian process Γ_t,s= (1−t)[∆(s−t)−∆(t−t)].

On the other hand,

S_k,j = Ek_n(T_k,j)− Ek(T_k,j)

Ek(T_k,K_n)Ek_n(T_k,K_n)

= Ek_n(T_k,j)−Ek(T_k,j)− Ek(T_k,j) Ek(T_k,K_n)

Ek_n(T_k,K_n)−Ek(T_k,K_n)

=

j+k−k_n i=j+1

a_i−

k−k_n

i=1

a_i+B_k,j,n

⎛

⎝^Kⁿ

+k−k_n

i=K_n+1

a_i−

k−k_n

i=1

a_i

⎞

⎠

=

k−k_n

i=1

a_i+j−a_i+B_k,j,n(a_i+Kn−a_i) .

(8)

Thus, there exists a constant, γ >0, which depends oncin [AF3], such that sup_j|S_k,j| ≥γ(k−k_n) and Pk_n

k_n−k > n u_n

≤ P

η_k_n>supη_k , (k−k_n)> n u_n

(13)

≤ P

2 sup

k

sup

j

R_k,j >inf

k sup

j |Sk,j|, (k−k_n)> n u_n

+P(Ω^c_n)

≤ P

2 sup

k

sup

j

R_k,j > γ n u_n

+P(Ω^c_n).

Because of the weak convergence ofR_k,j1IΩ_n the above probability tends to zero whenngoes to inﬁnity.

2) Consider now the casek < k_n. On Ω_n, T_k,j−Q_k,j = T_k,j−B_k,j,nT_k,K_n

= (1−B_k,j,n)T_k,k_n_−k+

T_k_n_,j−E T_k_n_,j

−B_k,j,n

T_k_n_,K_n−E

T_k_n_,K_n +E

T_k_n_,j

−B_k,j,nE

T_k_n_,K_n

= A_k,j+R_k_n_,j+U_k,j

whereA_k,j= (1−B_k,j,n)T_k,k_n_−k. Observe that over Ω_n,|y(i)|> α_n/2. Then, there existsc(α_n)>0 such that T_k,k_n_−k > c(α_n)(k_n−k), thus A_k,j =O(k_n−k). On the other hand,R_k_n_,j1IΩ_n converges in distribution to a zero-mean Gaussian process Γ_t_,s. Consider now the bias termU_k,j:

U_k,j = Ek_n

T_k_n_,j

− Ek(T_k,j) Ek(T_k,K_n)Ek_n

T_k_n_,K_n

= Ek_n

T_k_n_,j

−Ek

T_k_n_,j

− Ek

T_k_n_,j Ek(T_k,K_n)

Ek_n

T_k_n_,K_n

−Ek(T_k,K_n) +Ek_n

T_k n,Kn

Ek(T_k,K_n) E_k

T_k,k n

=

j−k

i=j+k−k_n+1

a_i−

k_n−k i=1

a_i+ Ek

T_k n,j

Ek(T_k,K_n)

Kn

i=Kn+k−k_n+1

a_i − Ek_n

T_k n,Kn

Ek(T_k,K_n)

k_n−k i=1

a_i.

Thus, there exists a constant δ > 0, which depends on c in [AF3], such that sup_j|Uk,j| ≥ δ(k−k_n) and Pk_n

k−k_n> n u_n

→0 when n→ ∞. The latter together with (13) shows (11).

In order to show (12), sharper bounds onPk_n

sup_ksup_jR_k,j > C n u_n

are required for any given constant C. As above, we will restrict our attention to the set Ω_n∩Ω_n and drop this fact from the notation. Consider ﬁrst as above the casek > k_n. Write,

R_k,j =

T_k,j−Ek_n(T_k,j)

−B_k,j,n

T_k,K_n−Ek_n(T_k,K_n)

= R⁽¹⁾_k,j+R_k,j⁽²⁾.

Since sup_jB_k,j,n= 1, we have that sup_j|R⁽²⁾_k,j|=|Tk,Kn−Ek_n(T_k,K_n)|.

LetGdenote the common distribution function of the collection (X_i). We can rewrite T_k,K_n−E_k_n(T_k,K_n) =

i

[X_i−E_k_n(X_i)]1I_{G−1(1−Kn/n)<Xi}.

(9)

Thus, recallingw_n=α_n−(a_n−r1,nb_n),

T_k,K_n−Ek_n(T_k,K_n) w_n

is the sum of independent bounded r.v. with variance bounded by 1, so that by Bennet’s inequality

P

sup_j|R⁽²⁾_k,j|

w_n >γ/2nu_n w_n

≤ P

|Tk,Kn−Ek_n(T_k,K_n)| w_n > c1γ

4√ 2

2nlogn+3c2γ 4

logn 3

≤ e⁻⁽â^{+1) log(}ⁿ⁾, choosingc2≥ ⁴⁽â_3γ⁺¹⁾ andc1≥4^√²^√_γâ⁺¹.

Hence, adding ink

P

sup

k

sup

j<k+Kn

R⁽²⁾_k,j > γ 2nu_n

≤e^−a^logⁿ. ForR_k,j⁽¹⁾ we have

T_k,j−Ek_n(T_k,j)

w_n =

i

[X_i−Ek_n(X_i)]1I_{G−1(1−j/n)<Xi}

w_n .

Hence in this case we must use a functional version of Bennet’s inequality (Th. 7.3 in [6]) which yields P

sup_j|R_k,j⁽²⁾| w_n >E

sup_j|R_k,j⁽²⁾| w_n

+√

2xv+x 3

≤e^−x,

forv ≥n+ 2E

sup_j|R⁽²⁾_k,j| wn

. Thus it remains to boundA=E

sup_j|R⁽²⁾_k,j| wn

. This can be done using standard symmetrization and entropy techniques to obtain, A ≤ 4√

nlogn, as the random entropy of the class A = {1I_G⁻¹(1−t), t∈[0,1]}(as it is a collection of increasing functions) is bounded by 2 logn.

As above, P

sup_j|R⁽¹⁾_k,j|

w_n > γ/2nu_n w_n

≤ P

|Tk,j−Ek_n(T_k,j)| w_n >4

nlogn +

2(a+ 1) logn(n+ 4

nlogn) + (a+ 1)logn 3

≤ e⁻(a+1) log(n), choosingc_i, i= 1,2 appropriately.

The casek < k_n follows analogously.

4. Some extensions 4.1. Unknown distribution

Assume now that the distribution F_ε of the ε_i’s is a parametric distribution F_ε(· ; θ), but where the parameterθ is unknown. For any 0≤k≤n−1, let θ_k =θ(y _(k+1), y_(k+2), . . . , y_(n)) be an estimator ofθ. Let F_|ε|(·; θ) be the distribution of the|ε_i|’s. We will consider the following assumptions:

(10)

F1. The cumulative distribution functionF_|ε|is two times diﬀerentiable as a function ofθwitha.e. strictly positive derivative atθ=θ.

F2. θbelongs to some compact set Θ and there exists, underH_k

n, a consistent estimatorθ_k

n =θ(y _(k n+1), y(k_n+2), . . . , y(n)) ofθ.

F3. There exists (a, b) such that 0< a < t< b <1 and a Lipschitz continuous function ˜θdeﬁned on [a, b]

such that, underH_k_n, (θ[tn]) converges uniformly on [a, b] in probability to (˜θ(t)).

Remark 1. Under hypothesisF2 andF3,θ_k

n is a consistent estimator of ˜θ(t) =θ.

Remark 2. Whent < t, convergence ofθ[tn] can be diﬃcult to check with any estimator, sinceθ[tn] depends on some y_i’s that are not distributed under distribution F_ε. Nevertheless, it is possible to use an estimator based on some empirical quantiles and that only depends on the smallest observations, that is, that depends only on the observations distributed under F_ε.

For anyθ∈Θ, letX_i(θ) =−log

1−F_|ε|(|yi|, θ)

, and T_k,j(θ) =_k+j

i=k+1X(i)(θ).

Then, we deﬁne the following procedure:

i) LetK_n≤[(1−b)n] be some positive integer. For [a n]≤k≤n−K_n; 1. letθ_k =θ(y (k+1), y(k+2), . . . , y(n));

2. fori= 1, . . . , n, letX(i)(θ_k) =−log

1−F_|ε|(|y(i)|;θ_k)

; 3. for 1≤j≤K_n, compute

T_k,j(θ_k) =

k+j

i=k+1

X(i)(θ_k), Q_k,j(θ_k) = B_k,j,nT_k,K_n(θ_k),

η_k(θ_k) = max

k+1≤j≤n

|Tk,j(θ_k)√−Q_k,j(θ_k)|

n ;

ii) let

kˆ= Arg min

an≤k≤bnη_k(θ_k).

Remark. Here, Q_k,j(θ_k) = B_k,j,nT_k,K_n(θ_k) is the conditional expectation of T_k,j, conditionally to T_k,K_n, assuming that k_n =k and thatθ=θ_k.

Then, we have the following result, Theorem 4.1. AssumeF1, F2, F3.

i) Introduce fort∈[0,1]ands∈[a, b]the random process dˆ_n(t, s) =T_k

n,[K_nt](θ_[ns])−EHkn

T_k

n,[K_nt](θ_[ns])|T_k

n,Kn(θ_[ns])

.

Then, dˆ_n(t, s)/√

n, as a stochastic process indexed on [0,1]×[a, b], converges in distribution, underH_k_n, to a zero mean Gaussian process(Λ(t, s)).

ii) Let (u_n) be any positive and decreasing sequence such that √

n u_n → ∞. Then, under the asymptotic framework defined by AF1, AF2, AF3,

P_H₁(k_n)

ˆk n −t

> u_n

→0. (14)

Proof. We ﬁrst showi).

(11)

With the above notation, for anyθ∈Θ, let

Ψ_j(θ) = T_k_n_,j(θ)−Q_k_n_,j(θ) Ψ_j(θ) = ∂Ψ_j

∂θ (θ).

Fort∈[0,1] ands∈[a, b], let

dˆ_n(t, s) = Ψ[nt](θ[ns])

= Ψ[nt](˜θ(s)) + (θ[ns]−θ(s))Ψ˜ _[_nt_](˜θ(s)) +O((˜θ(s)−θ[ns])²).

Using the same proof used for the convergence of (Ψ_[nt](θ))/√

n (see the Appendix), we show that, for s ∈ [a, b], (Ψ[nt](˜θ(s)))/√

n and (Ψ_[_nt_](˜θ(s)))/√

nalso converge to two zero-mean Gaussian processes. Then, using hypothesisF3,θ[ns]→θ(s) uniformly over [a, b], and then, ˆ˜ d_n(t, s) to a zero mean Gaussian process Λ(t, s).

We show nowii). Fors∈[a, b], leta_i(˜θ(s)) =EH0

X(i)(˜θ(s))

. Following the proof of Theorem 2.2, consider ﬁrst the casek > k_n. On Ω_n, fors∈[a, b],

T_k,j(˜θ(s))−Q_k,j(˜θ(s)) =

T_k,j(˜θ(s))−Ek_n

T_k,j(˜θ(s))

−B_k,j,n

T_k,K_n(˜θ(s))−Ek_n

T_k,K_n(˜θ(s)) +Ek_n

T_k,j(˜θ(s))

−B_k,jEk_n

T_k,K_n(˜θ(s))

= R_k,j(˜θ(s)) +S_k,j(˜θ(s)).

As in Theorem 2.2,R_k,j(˜θ(s))1IΩ_n(normalized by√

n) as a process indexed by (t, w, s)∈(0,1)²×(a, b) converges in distribution to a zero-mean Gaussian process. On the other hand,

S_k,j(˜θ(s)) =

k−kn

i=1

a_i+j(˜θ(s))−a_i(˜θ(s)) +B_k,j,n(a_i+Kn(˜θ(s))−a_i(˜θ(s)) .

Thus, there exists a constant, γ > 0, which depends on a, b in [F3], such that sup_j|Sk,j(˜θ(s))| ≥ (k−k_n)γ.

We conclude that Pk_n

k_n−k > n u_n

→ 0 using the arguments used for Theorem 2.2. The case k < k_n is

identical.

4.2. The unknown variance case

When θ is a scale parameter, i.e. F_ε(y;θ) = F_ε(y/θ; 1), we introduce the following procedure which is scale invariant:

i) fori= 1, . . . , n, letX_(i)=|y_(i)|;

ii) letK_n be some positive integer. For 1≤k≤n−K_n and 1≤j≤K_n, compute T_k,j =

k+j

i=1

X(i), (15)

Q_k,j = EH1(k)_k+j i=k X_(i)

EH1(k)_k+Kn

i=k X_(i)

T_k,K_n, (16)

η_k = max

1≤j≤Kn

|T_k,j√−Q_k,j|

n ; (17)