• Aucun résultat trouvé

Polygonal smoothing of the empirical distribution function

N/A
N/A
Protected

Academic year: 2021

Partager "Polygonal smoothing of the empirical distribution function"

Copied!
23
0
0

Texte intégral

(1)

HAL Id: hal-02062903

https://hal-univ-avignon.archives-ouvertes.fr/hal-02062903

Submitted on 10 Mar 2019

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Polygonal smoothing of the empirical distribution function

Delphine Blanke, Denis Bosq

To cite this version:

Delphine Blanke, Denis Bosq. Polygonal smoothing of the empirical distribution function. Statistical

Inference for Stochastic Processes, Springer Verlag, 2018, 21 (2), pp.263-287. �10.1007/s11203-018-

9183-y�. �hal-02062903�

(2)

DISTRIBUTION FUNCTION

DELPHINE BLANKE AND DENIS BOSQ

Abstract. We present two families of polygonal estimators of the distribution function: the first family is based on the knowledge of the support while the second addresses the case of an unknown support. Polygonal smoothing is a simple and natural method for regularizing the empirical distribution function Fn but its properties have not been studied deeply. First, consistency and exponential type inequalities are derived from well-known convergence prop- erties ofFn. Then, we study their mean integrated squared error (MISE) and we establish that polygonal estimators may improve the MISE ofFn. We con- clude by some numerical results to compare these estimators globally, and also together with the integrated kernel distribution estimator. Distribution func- tion estimation, Polygonal estimator, Cumulative frequency polygon, Order statistics, Mean integrated squared error, Exponential inequality, Smoothed processes 62G05, 62G30, 62G20

1. Introduction

Distribution function estimation finds many applications especially in survival analysis or for quantile estimation of a population. This explain why a large num- ber of articles are devoted to its estimation, see e.g. the nice review written by Servien (2009). The most classical and simple estimate of the cumulative distribu- tion function (cdf) F is the empirical distribution function (edf) Fn. Properties of Fn are well-known, it is an unbiased and strongly uniformly consistent estimator of F. But ifF is absolutely continuous with densityf, the edfFn does not take into account this smoothness property. Several smoothing techniques have been consid- ered in the literature to correct this drawback. The kernel estimator is one of the most commonly used. Theoretical properties of this estimator are now well estab- lished, we may refer to Swanepoel and Van Graan (2005); Servien (2009) and more recently to Quintela-del R´ıo and Est´evez-P´erez (2012) for a literature review about it. Other techniques are possible such as moving polynomial regression (Lejeune and Sarda, 1992), polynomial spline regression (Cheng and Peng, 2002), level cross- ings (Huang and Brill, 2004), Bernstein polynomials (with degree depending on the sample size n, Babu et al., 2002; Leblanc, 2012) among others. All these methods depend on some smoothing parameter to calibrate to avoid classical phenomena of over or under-smoothing.

For density estimation, polygonal frequency estimators, connecting mid-bin val- ues of an histogram by straight lines, are classical and have been widely studied, see eg Scott (1985), Simonoff (1996, p. 20-39) and references therein. For estimating the distribution function, practitioners are also accustomed to use the cumulative frequency polygons (hand drawing graphs, sometimes called ogives), especially for grouped data. In this paper, we considernindependent and identically distributed (iid) real-valued X1, . . . , Xn with an absolutely continuous cumulative distribu- tion function (cdf) F having the density f. For estimating F, a simple polygonal smoothing of the empirical distribution function is considered. More precisely, we present and study two families of polygonal estimators ofF: the first one supposes

1

(3)

that the support [a, b] off is known while the second one addresses the case of un- known support. Until now, and as far as we can judge, comparison of these simple estimators with the classical edf has not been studied deeply. To our knowledge, the single reference is Read (1972) where a continuous estimator ofF, namely join- ing the points (Xi,n+1i ), with X1, . . . , Xn the ordered sample, is studied. For n sufficiently large, it is shown that his expected squared error is no larger than that of Fn and that it dominates Fn in terms of the integrated error (with respect to F). Our goal is to extend these first results to our two general families of smoothed estimators of Fn.

The paper is organized as follows. In Section 2, we introduce our two families of polygonal estimators and give their first properties that can be deduced from their proximity to Fn. In Section 3, the MISE of these estimators is established in Theorem 3.6, the main result of this paper. To examine the small sample behaviour of the polygonal estimators, we present in Section 4 a simulation study that includes also the kernel distribution estimator. To this aim, we consider the mixtures of normal distributions introduced by Marron and Wand (1992) and completed by Janssen et al. (1995). A conclusion and discussion about the possible extensions of our results appear in Section 5. Finally, the most technical auxiliary result is proved in the Appendix A and parameters of the normal mixtures involved in the simulations are recalled in the Appendix B.

2. Polygonal distribution estimators

2.1. Definition. For i.i.d random variablesX1, . . . , Xnwith absolutely continuous cdf F, we consider two families of polygonal estimators of F derived from the classical empirical distribution function Fn:

Fn(x) = 1 n

Xn

i=1

I(−∞,x](Xi), x∈R

where IA denotes the indicator function of the set A. The first family adresses the case of a known support [a, b] while the second is adapted to the case of un- known/infinite support. If X1 < · · · < Xn (almost surely) denotes the ordered sample, we define G(j,p)n , j = 1,2 as follows (see also Figure 1 for an illustration withp= 0, p= 12 andp= 1):

G(1,p)n (t) = (t−a)(1−p) n(X1−a) I[a,X

1)(t) + 1− p(b−t) n(b−Xn)

I[X n,b](t) +

nX1

k=1

t+ (k−p)Xk+1 −(k+ 1−p)Xk n(Xk+1 −Xk) I[X

k,Xk+1)(t) (1) and

G(2,p)n (t) =G(1,p)n (t)I[X 1,Xn)(t)

+ max 0,t−(2−p)X1+ (1−p)X2 n(X2−X1)

I(−∞,X 1)(t) + min(1,t+ (n−1−p)Xn−(n−p)Xn−1

n(Xn−Xn1)

I[X

n,+∞)(t). (2) By this way, their construction depend on a known real parameter p, chosen in [0,1], indicating that the sample points (Xk+p(Xk+1 −Xk),nk),k= 1, . . . , n−1, are connected. For example, the case p= 0 (respectivelyp = 1) joins the points (Xk,nk) (resp. (Xk+1 ,kn)) while the midpoints (Xk+X2k+1 ,nk) are connected when

(4)

a X1* X2* Xn−1* Xn* b 0

1 n 2 n n1 n

1 1

c c

c c

c Fn(t) c

Gn (1,0)(t) Gn

(2,0)(t) Gn

(1,12 )

(t) Gn

(2,12 )

(t) Gn

(1,1)(t) Gn

(2,1)(t)

Figure 1. Estimators Fn, G(j,0)n , G(j,n 12), G(j,1)n , j = 1,2, for a distribution with support [a, b]

p= 12. These two families of estimators constitute proper distribution functions as continuous and increasing functions.

Note that Read (1972) has studied an estimator very similar toG(1,0)n (namely, with a = 0, b = 1 and n+ 1 at the denominator rater than n). He established that its pointwise risk is no larger than that of Fn and, that it dominatesFn in terms of integrated risk (however, this latter result is stated without proof). In the following, we study the exact second order asymptotic behaviour of the MISE of our general estimators.

Remark 2.1. The families G(1,p)n and G(2,p)n differ only at the two ends of the interval [a, b]: the first uses the support in its construction while the second one is adapted to the case of unknown or unbounded support. We may remark as in Babu et al. (2002) that to obtain the bounded support [0,1], a monotone transformation like Y =X/(1 +X)may handle the case of random variables with support[0;∞), andY = (1/2) + (tan−1X/π)can be taken for real random variables. Then estima- tors of the cdf may be derived with ˜Fen(p)(x) =G(1,p)n (y)(and[a, b] = [0,1]). Also, in our numerical studies of Section 4 where random variables are real-valued, [a, b]

is taken as a prediction interval with quite good results.

2.2. Properties. First, remark thatG(1,p)n and G(2,p)n are equal to k−pn at sample points. So for k= 1, . . . , n−1,j = 1,2:

G(j,p)n (Xk+1 )−G(j,p)n (Xk) =Fn(Xk+1 )−Fn(Xk) = 1 n.

Moreover, it is easy to show that forpin [0,1) andj= 1,2,G(j,p)n (t) andFn(t) are equal for t=Xk+p(Xk+1 −Xk),k= 1, . . . , n−1. For example,G(j,n 12)intersects Fn in the middle of [Xk, Xk+1 ], k= 1, . . . , n−1. The casep= 1 is specific with G(j,1)n (Xk+1 ) =Fn(Xk),j= 1,2,k= 1, . . . , n−1.

(5)

For random variables with support [a, b], one has alsoG(1,p)n (a) =Fn(a) = 0 and G(1,p)n (b) =Fn(b) = 1. On the other hand, we may note that

G(2,p)n (t) =

(0 fort≤(2−p)X1−(1−p)X2≤X1

1 fort≥(1 +p)Xn−pXn−1 ≥Xn buth

(2−p)X1−(1−p)X2,(1 +p)Xn−pXn−1 i

is not necessarily included in [a, b].

With the help of known properties ofFn, it is easy to obtain first results of con- vergence. To simplify the presentation of our results and without loss of generality, we suppose from now on that [a, b] = [0,1] in the definition of the family G(1,p)n . We start with a lemma that gives the proximity of the estimators G(j,p)n with Fn. This lemma is simple but essential for all the results derived in this paper.

Lemma 2.2. We get (a)

G(1,p)n (t)−Fn(t) = (1−p)t nX1 I[0,X

1)(t)− (1−t)p

n(1−Xn)I[Xn,1](t) +

n−1X

k=1

t−pXk+1 −(1−p)Xk n(Xk+1 −Xk) I[X

k,Xk+1)(t); (3) (b)

G(2,p)n (t)−Fn(t) = t+ (1−p)X2−(2−p)X1

n(X2−X1) I[(2−p)X

1−(1−p)X2,X1)(t) +t−(1 +p)Xn+pXn−1

n(Xn−Xn1) I[X

n,(1+p)Xn−pXn1](t) +

n−1X

k=1

t−pXk+1 −(1−p)Xk n(Xk+1 −Xk) I[X

k,Xk+1)(t). (4) Proof. Straightforward from the definition ofFn and relations (1)-(2).

From Lemma 2.2, one may deduce that 1

2n ≤Fn−G(j,p)n = max p n,1−p

n

≤ 1

n (5)

so thatFn−G(j,p)n

is maximal and equals to 1n forp= 0 orp= 1. Its minimal value is reached forp=12 with 2n1. Next, we may derive an exponential inequality.

Proposition 2.3. We obtain (a) Forj= 1,2 andε >2n(1−a1

0) with0< a0<1, P

G(j,

1 2)

n −F≥ε

≤2 exp(−2a202), n≥1.

(b) More generally, P

G(j,p)n −F

≥ε

≤2 exp(−2a202), 0< a0<1, for ε >max(p,1−p)n(1−a

0) ,n≥1.

Proof. First, for all p∈[0,1], we have

G(j,p)n −F= G(j,p)n −Fn

+ Fn−F

(6)

so by (5)

G(j,p)n −F

≤max p n,1−p

n

+kFn−Fk.

Thus, sinceF is supposed to be absolutely continuous in our framework, Massart’s inequality (1990) gives

P

G(j,p)n −F

≥ε

≤P

kFn−Fk≥ε−max p n,1−p

n

≤2 exp

−2n ε−max p n,1−p

n 2

.

Next, (b) holds with ε−max np,1−pn

≥a0εforε >max(p,1−p)n(1−a

0) and one derives (a)

with the choice p= 12.

Another derivation is envisaged in the following proposition.

Proposition 2.4. We get (a) forp=12,j= 1,2:

P G(j,

1 2)

n −F≥ε

≤2 exp(−2nε2), 0< ε < 1

4n, n≥1;

(b) and more generally, P G(j,p)n −F≥ε

≤2 exp(−2nε2), 0< ε <max p 2n,1−p

2n , n≥1.

Proof. For allp∈[0,1], we have again:

P G(j,p)n −F≥ε

≤2 exp

−2n

ε−max(p n,1−p

n )2

= 2 exp(−2nε2) exp

− 2

n max(p,1−p)2

+ 4εmax(p,1−p) .

So, forn≥1, one gets the resultP

G(j,p)n −F≥ε

≤2 exp(−2nε2) as 2nε <max(p,1−p)⇐⇒4εmax(p,1−p)−2

n max(p,1−p)2

<0.

Finally, note that known asymptotic results on Fn allow to derive easily limits in distribution for the estimatorsG(j,p)n ,j= 1,2 and p∈[0,1]. From (5), we get

sup

t

n Gj,pn (t)−F(t)

−√

n Fn(t)−F(t)≤ 1

√n

and Theorem 3.1 in Billingsley (1999, p.27) implies that G(j,p)n and Fn, properly normalized, have the same limit in distribution.

3. Mean integrated error of polygonal estimators

To establish the mean integrated error of the polygonal estimators, we suppose from now on thatX1, . . . , Xn,nare i.i.d. random variables with absolutely contin- uous cdf F such thatf has compact support [0,1].

(7)

3.1. Asymptotic integrated bias. We begin with a technical lemma useful for further derivations of our results.

Lemma 3.1. For allm∈N andp∈[0,1], (a)

Z 1 0

G(1,p)n (t)−Fn(t)m

dt

= (1−p)m−(−1)mpm

pX1+ (1−p)Xn

+ (−1)mpm

(m+ 1)nm ;

(b) Z

−∞

G(2,p)n (t)−Fn(t)m

dt

= (1−p)m+1X2−X1 2(1−p)m+1+ (−1)mpm+1 (m+ 1)nm

+Xn 2(−1)mpm+1+ (1−p)m+1

+Xn−1 (−1)m+1pm+1

(m+ 1)nm .

Proof. Straightforward from Lemma 2.2 and formulas (3)-(4) raised to the power

ofm and integrated term by term.

Note that for p= 12 andm even, R1 0 G(1,

1 2)

n (t)−Fn(t)m

dt becomes constant and equal to (2n)m+1m.

In the sequel, we will use the following conditions on the density f. Assumption 3.1 (A3.1).

(i) f is continuous on [0,1] and inf

x[0,1]f(x)≥c0 for some positive constantc0; (ii) f is a Lipschitz function: there exists a positive constant c1 such that for all

(x, y)∈(0,1)2,|f(x)−f(y)| ≤c1|x−y|.

Note that A3.1-(ii) is less stringent than usual conditions for kernel distribu- tion estimators where, in general, f is supposed to be at least differentiable, see e.g. Azzalini (1981); Swanepoel (1988); Jones (1990). The condition of minora- tion in A3.1-(i) is of course more stringent but it is useful to derive the following lemma where equivalent expressions are obtained for expectations of functions of X1, . . . , Xn.

Lemma 3.2. If the condition A3.1-(i) holds then, for all integersr≥0andm≥1, not depending on n, we get

(a)

E

i=1,...,n+rinf Xi

m

= am

nm+O 1 nm+1

, am>0;

(b)

E

1− sup

i=1,...,n+rXi

m

= bm

nm +O 1 nm+1

, bm>0.

(c)

E X2−X1

=d1

n +O 1 n2

,d1>0, andE X2−X1m

=O 1 nm

,

(d)

E Xn−Xn−1

=e1

n +O 1 n2

,e1>0, andE Xn−Xn−1 m

=O 1 nm

.

(8)

Proof. (a) We may write E inf

i=1,...,n+rXim

= (n+r) Z 1

0

xmf(x) 1−F(x)n+r−1

dx

=m Z 1

0

xm1(1−F(x))n+rdx.

Form= 1, we get by condition A3.1-(i) that 1

(n+r+ 1)kfk

≤ Z 1

0

1−F(x)n+r

dx≤ 1

c0(n+r+ 1) which implies in turn that there existsa1>0 such that

Z 1 0

1−F(x)n+r

dx=a1

n +O 1 n2

. Form≥2, the result follows by induction.

(b) The proof is similar starting from E 1− sup

i=1,...,n+r

Xi

m

= (n+r) Z 1

0

(1−x)mf(x)Fn+r−1(x) dx=m Z 1

0

(1−x)m−1Fn+r(x) dx.

(c) From the joint density of (X1, X2) (see (A.16) or eg David and Nagaraja, 2003, p. 12), we have

E(X2−X1)m= Z 1

0

Z 1 x

(y−x)mn(n−1)f(x)f(y)(1−F(y))n−2dydx

=nm Z 1

0

Z y 0

(y−x)m−1f(x)(1−F(y))n−1dxdy.

Form >1, we may bound the last term by kfkn

Z 1 0

ym(1−F(y))n1dy=kfk

n m+ 1

Z 1 0

P( inf

i=1,...,n−1Xi> tm+11 ) dt

=kfk

n

m+ 1E ( inf

i=1,...,n−1Xi)m+1

=O 1 nm

by (a).

Also, form= 1, we have E(X2−X1) =n

Z 1 0

F(y)(1−F(y))n−1dy=n Z 1

0

t(1−t)n1 f(F−1(t)) dt and (c) may be deduced from

n kfk

Z 1 0

t(1−t)n−1dt≤n Z 1

0

t(1−t)n−1

f(F−1(t)) dt≤ n c0

Z 1 0

t(1−t)n−1dt where nR1

0 t(1−t)n−1dt= n+11 .

(d) The last assertions follow from the joint density of (Xn−1 , Xn), see (A.16), leading to

E(Xn−Xn−1 )m= Z 1

0

Z y 0

(y−x)mn(n−1)f(x)f(y)Fn2(x) dxdy

=nm Z 1

0

Z 1 x

(y−x)m−1Fn−1(x)f(y) dydx.

(9)

Now, we are in position to derive the following result with asymptotic equivalents for the integrated bias of the estimatorsG(j,p)n ,j= 1,2.

Proposition 3.3. Suppose that the condition A3.1-(i) holds, then for all p∈[0,1]

and constants a1,b1,d1 ande1 defined in Lemma 3.2, we have:

(a) E

Z 1 0

G(1,p)n (t)−F(t)

dt= pE(X1) + (1−p)E(Xn)−p 2n

= 1−2p

2n +pa1−(1−p)b1

2n2 +O 1

n3 ; (b)

E Z

−∞

G(2,p)n (t)−F(t) dt

=(1−p)2E(X2−X1) + (1−2p)E(Xn−X1)

2n −E(Xn−Xn1)p2 2n

=1−2p

2n −(a1+b1)(1−2p)

2n2 +d1(1−p)2−e1p2

2n2 +O 1

n3 .

Proof. SinceFn is unbiased for estimatingF we have, forj= 1,2, E

Z

−∞

Gj,pn (t)−F(t) dt=E

Z

−∞

Gj,pn (t)−Fn(t) dt

and results are straightforward from Lemma 3.1 (m= 1) and Lemma 3.2.

One may remark that these estimators are asymptotically unbiased. Forp= 12, the bias is minimal and of ordern−2while other values ofpgive only an integrated bias of ordern−1.

3.2. Mean integrated squared error of polygonal estimators. Concerning the mean integrated squared error (MISE), we consider its decomposition with respect to Fn:

E Z

−∞

G(j,p)n (t)−F(t)2

dt

=E Z

−∞

Fn(t)−F(t)2

dt+E Z

−∞

G(j,p)n (t)−Fn(t)2

dt + 2E

Z

−∞

G(j,p)n (t)−Fn(t)

Fn(t)−F(t)

dt, j= 1,2, p∈[0,1]. (6) Note that these integrals exist as X1, . . . , Xn are supposed to be compactly sup- ported on [0,1]. By this way, the range of integration can be taken as [0,1] for j = 1 andh

(2−p)X1−(1−p)X2,(1 +p)Xn−pXn−1 i

whenj= 2. The first term represents the error of reference:

E Z

−∞

Fn(t)−F(t)2

dt= R1

0 F(t) 1−F(t) dt

n . (7)

It remains to study the values of pfor which the sum of the two last terms in (6) are globally negative. For such values, our families of polygonal estimators should be more efficient than the empirical distribution estimator.

First, ER

−∞ G(j,p)n (t)−Fn(t)2

dt, j = 1,2 can be directly deduced from Lemma 3.1 with m= 2 and Lemma 3.2.

(10)

Proposition 3.4. Under the condition A3.1-(i), we have for allp∈[0,1]

(a) E

Z 1 0

G(1,p)n (t)−Fn(t)2

dt=(1−2p) pE(X1) + (1−p)E(Xn) +p2 3n2

=1−3(1−p) + 3(1−p)2

3n2 +O 1

n3 ; (b)

E Z

−∞

G(2,p)n (t)−Fn(t)2

dt=p3E(Xn−Xn−1) + (1−p)3+p3 3n2

+(1−p)3E(X2−X1) +E(Xn−X1) (1−p)3+p3 3n2

= 1−3(1−p) + 3(1−p)2

3n2 +O 1

n3 .

We may remark that, again, the casep= 12 appears as simpler since the calculus of ER1

0 G(1,n 12)(t)−Fn(t)2

dt reduces to 12n12. The most difficult task is the study of the double product, we obtain the following result which is proved in the Appendix A.

Proposition 3.5. Under Assumption 3.1, we get forj= 1,2 andp∈[0,1]:

2E Z 1

0

G(j,p)n (t)−Fn(t)

Fn(t)−F(t)

dt=− 1

3n2 +O 1 n3

.

Now, collecting results in (6)-(7) and Proposition 3.4 and 3.5, we are in position to state the main result of this paper.

Theorem 3.6. Under Assumption 3.1, we get forj= 1,2 and all p∈[0,1]:

E Z

−∞

G(j,p)n (t)−F(t)2

dt= 1 n

Z 1 0

F(t) 1−F(t)

dt−p(1−p)

n2 +O 1 n3

.

First, we may conclude that for all p ∈ [0,1], estimatorsG(1,p)n and G(2,p)n are asymptotically equivalent. Indeed, G(2,p)n is only a slight modification ofG(1,p)n at its extremities, so this result seems natural. Next, for all p ∈ (0,1), the families G(j,p)n , j = 1,2 appear as more efficient than the empirical distribution estimator Fn. The choices p= 0 or p= 1 turn out to be more problematic since the term

p(1−p)

n2 vanishes in these cases. If these estimators improveFn, the gain can only occur for the third order what seems to be of less interest. Finally, this latter term is maximal for p= 12 with the value 4n12. In conclusion, among all the family of polygonal estimators, one should prefer to work withG(1,n 12)orG(2,n 12)(in the case of unknown support [a, b]) because these estimators have both the smaller asymptotic bias by Proposition 3.3 and the better efficiency relative to the classical distribution estimatorFn.

4. Simulation

4.1. Framework. In this section, we look at the small sample behaviour of the polygonal estimators, G(j,p)n , j = 1,2 with a focus on the values p= 12 and p= 0 or 1 (represented on Figure 1). To this aim, we consider the set of 15 Gaussian mixtures defined in Marron and Wand (1992), denoted by MW1-MW15 in the sequel, and also, the 16th density introduced in Janssen et al. (1995), say MW16.

The parameters of these normal mixtures are recalled in the Appendix B. These

(11)

_

_ _

_

_ _

_

_ _

_ _ _

_ _

_ n = 20 _

n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20

1 2 3 4 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Distribution

MISEx1000

_

_ _ _

_ _

_

_ _

_ _

_ _

_ _

_

n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50

0.5 1.0 1.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Distribution

Estimator _ Fn

Gn1_05 Gn2_05 Gn1_0

Gn2_0 Gn1_1

Gn2_1

Figure 2. MISE forFn,G(j,0)n ,G(j,n 12),G(j,1)n ,j= 1,2, forn= 20 (left) andn= 50 (right)

_ _

_ _

_ _

_

_ _

_ _ _ _

_ _

_ n = 20

n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20 n = 20

2 4 6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Distribution

MISEx1000

_

_ _ _

_ _

_

_ _

_ _

_ _ _

_ _ n = 50

n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50 n = 50

0.5 1.0 1.5 2.0 2.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Distribution

Estimator _ Fn Kn_PB Kn_AL Kn_CV

Figure 3. MISE forFn and kernel distribution estimators with three bandwidth choices : AL, PB and CV for n= 20 (left) and n= 50 (right)

(12)

distributions are easy to implement and describe a broad class of potential problems that may occur in nonparametric estimation (skewness, multimodality, and heavy kurtosis). As their parameters were chosen such that min−3σ) = −3 and max+ 3σ) = 3, the estimatorsG(1,p)n are computed with the valuesa =−3 and b = 3. For each of these distributions, N = 500 samples of sizesn = 20, 50 and 100 are generated with the R software (R Core Team, 2017). Next, a Monte Carlo approximation, based on 14000 trials over [−7,7], is operated for each sample of size nto estimateR7

7(G(j,p)n −F(t))2dt. The final approximation of the MISE, MISE, is obtained by averaging the results over the\ N = 500 replicates.

4.2. Results for the polygonal estimators. Approximations of the MISE for Fn and estimatorsG(j,p)n , j ∈ {1,2},p∈ {0,12,1} are reported in Figure 2. Here, only the cases n = 20 and n = 50 are represented since for n = 100 results are similar but with almost indistinguishable differences between the estimators. First of all, we remark that the obtained results are in accordance with the theoretical results of Theorem 3.6: the estimators G(j,n 12) have a smaller MISE than Fn for all simulated distributions. In addition, the estimator G(1,

1 2)

n calculated with the choice [a, b] = [−3,3] gives slight better results thanG(2,

1 2)

n . So even if [−3,3] is not the true support of our simulated distributions, the knowledge of this prediction interval is sufficient to improve the estimation. ConcerningG(j,0)n andG(j,1)n , results are clearly inferior to those of G(j,n 12). In accordance with our theoretical results, the use of these estimators should not be recommended (even if G(j,0)n seems to achieve globally better results thanG(j,1)n ).

5 10 2 1 4 8 11 12 13 3 6 9 14 7 15 16

2 4 6

MISE x 1000, n = 20

Distribution

5 2 10 4 1 3 8 12 13 6 11 9 7 14 15 16

0.5 1.0 1.5 2.0 MISE x 1000, n = 50

Distribution

5 2 10 4 1 12 8 3 6 13 9 11 7 14 15 16

0.3 0.6 0.9

MISE x 1000, n = 100

Distribution

Estimator Gn1_05 Kn_PB

Figure 4. Compared MISE ofG(1,

1 2)

n and Kn,PB for n= 20, 50 and 100.

(13)

4.3. Comparison with kernel distribution estimators. To complete our sim- ulation study, we also compare our estimators with the nonparametric kernel dis- tribution estimator defined by

Kn(t) = 1 nhn

Xn

i=1

L t−Xi

hn

, t∈R where hn is the bandwidth andL(t) =Rt

−∞K(x) dx. HereK is the usual kernel used in density estimation, chosen as a known continuous density onR, symmetric about 0. Theoretical properties of this estimator are well known, one may refer e.g. to Swanepoel and Van Graan (2005) or Servien (2009) with a rich literature review. The weighted MISE of Kn, ER

−∞ Kn(t)−F(t)2

f(t) dt is established in Swanepoel (1988) where optimal choice of the kernel K is discussed. The un- weighted case is derived in Jones (1990) when F has two continuous derivativesf andf:

E Z

−∞

Kn(t)−F(t)2

dt= R

−∞F(t)(1−F(t)) dt

n −2hn

n Z

−∞

tK(t)L(t) dt

+h4n 4 (

Z

−∞

t2K(t) dt)2 Z

−∞

f(t)2

dt+o h4n +o hn

n . Actually, if one considers only the Lipschitz condition A3.1-(ii) given onf, straight- forward calculation with Taylor series yields that the latter result is weakened in

E Z 1

0

Kn(t)−F(t)2

dt= R1

0 F(t)(1−F(t)) dt

n −2hn

n Z

−∞

tK(t)L(t) dt

+O h4n +o hn

n . (8) Comparing (8) with results of Theorem 3.6, it appears that the expressions are similar with the presence of the MISE ofFn in both of ones. Also, forhn of order n13, the improvement towardFn stands only on the 2nd order effect: aboutn4/3 for Kn; but onlyn2 for the estimators G(j,p)n whenp∈(0,1). Indeed,Kn should be asymptotically better but, the following numerical results show that exceptions hold for some densities. As usual, choosing the bandwidth hn is a critical task to avoid over or under-smoothing of the data. Several procedures have been proposed in the literature for kernel distribution estimation, we may refer among others to Sarda (1993) for a leave-one-out cross-validation method, Altman and L´eger (1995) or Polansky and Baker (2000) for a plug-in bandwidth choice and Bowman et al.

(1998) for a modified cross-validation method. First, we compare Fn to these estimators respectively called Kn,AL, Kn,PB and Kn,CV in the following. To this end, we use in R the kerdiest package of Quintela-del R´ıo and Est´evez-P´erez (2012). The selection of K appears as less crucial and we take the normal kernel in our simulations.

It appears in Figure 3 that forn= 20 andn= 50, the plug-in bandwidth choices of Altman and L´eger (1995) and Polansky and Baker (2000) give very similar results except than for MW16 where Kn,AL performs quite badly (with an approximate MISE beyond the frame). Also, the cross-validation proposed by Bowman et al.

(1998) seems to be less recommandable for these sample sizes. We may observed that for most of the distributions,Kn,ALandKn,PB have a smaller MISE thanFn. Exceptions hold forn= 20 with the two distributions MW3 and MW16 and, MW15 in addition for n= 50. Again, the casen= 100 is not represented but results are similar for Kn,AL and Kn,PB (with the same exceptions). Also, we observe that Kn,CV gives much more satisfactory results.

(14)

−3 −2 −1 0 1 2 3

0.00.40.81.2

MW 3 (strongly skewed)

x

f(x)

−3 −2 −1 0 1 2 3

0.00.51.01.5

MW 4 (kurtotic unimodal)

x

f(x)

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

MW 15 (discrete comb)

x

f(x)

−3 −2 −1 0 1 2 3

0.00.40.81.2

MW 16 (distant bimodal)

x

f(x)

Figure 5. Densities of 4 selected MW distributions.

4.4. Comparison of G(1,

1 2)

n and Kn,PB. In this part, we focus on the Polansky and Baker (2000) bandwidth choice (with multistage plug-in) because it is fastest in terms of computation time and also, it gives the most homogeneous results for the tested distributions. Indeed, since polygonal estimators do not depend on any bandwidth, it seems fairer to select one particular bandwidth choice rather than to adapt the method according to the density or the sample size. In Figure 4, the MISE of this kernel estimator (ordered by increasing value) is compared to the best polygonal estimatorG(1,

1 2)

n . Forn= 20, 50 and 100, the obtained results are quite close andKn,PB performs better for almost all distributions. This is not surprising from the result derived in (8). The sensitivity toward a bad bandwidth choice appears for the distributions MW3, MW4, MW15 and MW16 where the polygon estimator achieves better results. The respective densities are drawn in Figure 5 and the obtained estimations forG(1,

1 2)

n andKn,PB are given in Figure 6 forn= 50.

It appears that the estimates are quite close but in these special cases, the kernel estimator misses the curvatures. Finally, recall thatG(1,n 12)is constructed with the help of the prediction interval [-3,3] and the full nonparametric estimator G(2,n 12)

differs from G(1,n 12) only at both ends. Lower tail of estimated distributions are zoomed in Figure 7 to compare these two estimators.

5. Discussion

We have studied two general families G(1,p)n and G(2,p)n of smoothed polygonal distribution estimators and have derived their properties as well as exact expansions for the MISE at the second order for compactly supported distributions. These esti- mators present several advantages: they can be derived directly from the empirical distribution functionFn, they do not depend on any smoothing parameter and may be hand drawn. Because of their proximity toFn, they also inherit its convergence properties. Their study fills a gap since these estimators are quite naturally used by practitioners but their theoretical properties had not yet been studied in depth.

Examination of the second-order effect in the MISE shows that G(1,p)n and G(2,p)n

are asymptotically equivalent. Also, the MISE ofFn is improved for allpchosen in (0,1), and its minimal value is reached for pequals to 12. On the other hand, our results, for p= 0 or 1, also show that joining the ends ofFn does not necessarily

(15)

−4 −2 0 2 4

0.00.20.40.60.81.0

MW 3

x

F(x)

−4 −2 0 2 4

0.00.20.40.60.81.0

MW 4

x

F(x)

−4 −2 0 2 4

0.00.20.40.60.81.0

MW 15

x

F(x)

−4 −2 0 2 4

0.00.20.40.60.81.0

MW 16

x

F(x)

Figure 6. Estimations of F (dashed) with G(1,n 12) (blue) and Kn,PB (red) for the four selected MW distributions andn= 50

−3.0 −2.9 −2.8 −2.7 −2.6 −2.5

0.00.10.20.30.4

MW 3

x

F(x)

−3.0 −2.8 −2.6 −2.4 −2.2 −2.0

0.0050.0100.015

MW 4

x

F(x)

−3.0 −2.8 −2.6 −2.4 −2.2 −2.0

0.000.050.100.150.20

MW 15

x

F(x)

−2.9 −2.8 −2.7 −2.6 −2.5 −2.4

0.000.100.200.30

MW 16

x

F(x)

Figure 7. Same estimations ofF(dashed) zoomed on small values ofxwithG(1,n 12) (blue),G(2,n 12)(green),Kn,PB (red) andn= 50

improve the estimation and indeed may be worsen it. Our simulations on Gaussian mixtures support these conclusions and also show that G(1,

1 2)

n may achieve better results than G(2,

1 2)

n even for distributions with infinite support.

Références

Documents relatifs

Therefore, the estimate (2.4) generalizes in two ways the inequalities given by Lezaud in [10], in the case of birth-death processes: on the one hand, the boundedness assumption on

Properties of such function are justified and using it we propose many distributions that exhibit various shapes for their probability density and hazard rate functions..

Furthermore, we have recently shown that martingale methods can be used to relax classical hypotheses (as the uniform boundedness conditions) in concentration inequalities

A new algorithm based on fast summation in lexicographical order has been developed to efficiently calculate multivariate empirical cumulative distribution functions (ECDFs)

It is established that estimators joining the midpoints (as their reciprocal forms studied in the present article for quantile estimation) reach a minimal MISE at the order n −2...

Keywords and phrases: Pointwise mean squared error, Wavelet methods, Density estimation, Nonparametric

Maxisets for mu-thresholding rules (2008). Cumulative distribution function estimation under interval censoring case 1. Warped wavelets and vertical thresholding. Mixing and

The resulting estimator is shown to be adaptive and minimax optimal: we establish nonasymptotic risk bounds and compute rates of convergence under various assumptions on the decay