• Aucun résultat trouvé

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

N/A
N/A
Protected

Academic year: 2022

Partager "TAIL INDEX ESTIMATION BASED ON SURVEY DATA"

Copied!
32
0
0

Texte intégral

(1)

DOI:10.1051/ps/2014011 www.esaim-ps.org

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

Patrice Bertail

1,2

, Emilie Chautru

3

and St´ ephan Cl´ emenc ¸on

4

Abstract.This paper is devoted to tail index estimation in the context of survey data. Assuming that the population of interest is described by a heavy-tailed statistical model, we prove that the survey scheme plays a crucial role in the design of consistent inference methods for extremes. As can be revealed by simulation experiments, ignoring the sampling plan generally induces a significant bias, jeopardizing the accuracy of the extreme value statistics thus computed. Focus is here on the celebrated Hill method for tail index estimation, it is shown how to modify it in order to take into account the survey design.

Precisely, under specific conditions on the inclusion probabilities of first and second orders, we establish the consistency of the variant of the Hill estimator we propose. Additionally, its asymptotic normality is proved in a specific situation. Application of this limit result for building Gaussian confidence intervals is thoroughly discussed and illustrated by numerical results.

Mathematics Subject Classification. 62D05, 62F12, 62G32.

Received August 12, 2013. Revised February 5, 2014.

1. Introduction

It is the main purpose of this paper to study the impact of a survey sampling scheme on tail index estimation.

Indeed, in many situations, statisticians have at their disposal not only data but also weights arising from some survey sampling stratification. These weights correspond either to true inclusion probabilities, as is often the case for institutional data, or to some calibrated or post-stratification weights (minimizing some discrepancy with the inclusion probabilities subject to some margin constraints). In most cases, the survey design is ig- nored; such an omission can induce a significant sampling bias on the computed statistics [5]. When considering statistics of extremes in particular, this may cause severe drawbacks and completely jeopardize the estimation, as can be revealed by simulation experiments. Whereas asymptotic analysis of the Horvitz−Thompson estima- tor [27], in the context of mean estimation and regression in particular, has been the subject of much attention (see [2,16,23,24,31,32] for instance) and the last few years have witnessed significant progress towards a com- prehensive functional limit theory for distribution function estimation (refer to [6–8,18,33]), no result on tail estimation has been documented in the survey sampling literature yet.

Keywords and phrases.Survey sampling, tail index estimation, Hill estimator, Poisson survey scheme, rejective sampling.

1 MODAL’X - Universit´e Paris Ouest, 92001 Nanterre, France

2 Laboratoire de Statistique, CREST, France

3 Laboratoire AGM - Universit´e de Cergy-Pontoise, 95000 Cergy-Pontoise, France

4 Institut Mines-T´el´ecom - LTCI UMR T´el´ecom ParisTech/CNRS No. 5141, 75634 Paris, France.

[email protected]

Article published by EDP Sciences c EDP Sciences, SMAI 2015

(2)

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

Our goal here is to show how to incorporate the survey scheme into extreme value statistical techniques, in order to guarantee their asymptotic validity. Our approach is illustrated through the tail index estimation problem in the context of heavy-tailed survey data. We propose a specific modification of the Hill estimator, accounting for the survey plan by means of which data have been collected, and establish its consistency under adequate assumptions on the probabilities of inclusion of first and second orders. Following in the footsteps of [24], its asymptotic normality is also investigated in the particular situations of Poisson and rejective schemes.

Based on this limit result, the issue of building Gaussian confidence intervals for the tail index is next considered.

The rest of the paper is structured as follows. Basics about survey sampling and tail index estimation in the standard i.i.d. setup are briefly recalled in Section2. Section3 describes at length the modification of the Hill estimator we propose in the context of a general sampling plan and proves its consistency. In Section4, its asymptotic normality is established and the main asymptotic results of the paper are stated. Possible heuristic methods for selecting the number of observations that should be ideally involved in the computation of the modified Hill estimator are discussed in Section 5, together with illustrative numerical experiments. Technical proofs are deferred to the Appendix section.

2. Background and preliminaries

We first recall some crucial notions in survey sampling, which shall be extensively used in the subsequent analysis. Basic concepts of heavy-tail modeling, including Pareto-type distributions and standard strategies for statistical estimation of the related parameters are next briefly described for the sake of clarity. Throughout the article, the Dirac mass at x R is denoted by δx, the indicator function of any event E by I{E}. The (left-continuous) inverse of any nondecreasing functionH : (a, b)R, where−∞ ≤ a < b≤+∞, is denoted byH(t) = inf{s∈(a, b) : H(s)≥t},t∈R, with the convention that the infimum over an empty set is−∞.

The minimum of two real numbersxandy is denoted byx∧y and the maximum byx∨y. The expectation of an integrable random variable Z is denoted by E(Z) and, when it is square integrable, its variance is denoted byV(Z).

2.1. Survey sampling

Here and throughout, we consider a finite population of size N 1 referred to as UN :={1, . . . , N} and denote by P(UN) its power set. We call a sample of (possibly random) size n N, any subset s:={i1, . . . , in(s)} ∈ P(UN) with cardinalityn=:n(s) less thanN. A sampling scheme (design/plan) without replacement is determined by a probability distribution RN on the set of all possible sampless∈ P(UN). For anyi∈ {1, . . . , N}, the following quantity, generally called (first order)inclusion probability,

πi(RN) :=PRN(i∈S),

is the probability that the unitibelongs to a random sampleS drawn from distributionRN. In vectorial form, we shall write π(RN) := (π1(RN), . . . , πN(RN)). First order inclusion probabilities are assumed to be strictly positive in the subsequent analysis: ∀i ∈ {1, . . . , n}, πi(RN) > 0. Additionally, the second order inclusion probabilities are denoted by

πi,j(RN) :=PRN

(i, j)∈S2 ,

for any i = j in {1, . . . , N}2. When no confusion is possible, we shall fail to mention the dependence in RN when writing the first/second order probabilities of inclusion. The information related to the observed sample S⊂ {1, . . . , N}is encapsulated by the random vector:= (1, . . . , N), where

i=

1 ifi∈S, 0 otherwise.

The distribution of the sampling scheme has 1-d marginals that correspond to the Bernoulli distributions B(πi), 1≤i≤N, and covariance matrix given byi,j−πiπj}1≤i,j≤N.Notice incidentally that, equipped with these notations, we haveN

i=1i=n(S).

(3)

P. BERTAILET AL.

Example 1 (Poisson survey sampling). Though of extreme simplicity, the Poisson scheme (without replace- ment) plays a crucial role in sampling theory, insofar as it can be used to approximate a wide range of survey plans. This is indeed a key observation to establish general asymptotic results in the survey context, see [24]

and Section 4 of the present paper. For such a plan TN, the i’s are independent Bernoulli random variables with parametersp1, . . . , pN in (0,1). The first order inclusion probabilities thus characterize fully such a plan.

Observe in addition that the sizen(S) of a sample generated this way is random and goes to infinity asN→+∞

with probability one, provided that min1≤i≤Npi remains bounded away from zero.

Thesuperpopulation modelwe consider here stipulates that a real-valued random variableX with distribution Pand cdfF is observable on the populationUN,i.e. X1, . . . , XN are i.i.d. realizations drawn fromP. In practice, it is customary to determine the first order inclusion probabilities as a function of anauxiliary variable, which is observed on the entire population. Here, it is denoted byWwith distributionPW. Hence, for alli∈ {1, . . . , N} we can writeπi=π(Wi) for some link functionπ(.). WhenWandX are strongly correlated, thus proceeding helps select more informative samples and subsequently reduce the variance of estimators (see Sect. 4 for a more detailed discussion on the use of auxiliary information in survey sampling). One may refer to [9,15,22] for accounts of survey sampling techniques.

We recall that the Horvitz−Thompson estimator of the empirical measurePN := N−1N

i=1δXi based on the survey data described above is defined as follows [27]:

Pπ(RRNN):= 1 N

N i=1

i

πiδXi = 1 N

i∈S

1 πiδXi,

where the subscriptRN stipulates that the vector:= (1, . . . , N) is in correspondence with a sampleS drawn at random from distributionRN, and the superscriptπ(RN) indicates that the inclusion probabilities used in the formula are those of the design RN. When there is no ambiguity, we shall simplify notations and write PπN instead of Pπ(RRNN). We highlight the fact that, conditionally upon all vectors {(Xi,Wi), 1≤i ≤N}, the latter is an unbiased estimator ofPN, although it is not a probability measure. Its (pointwise) consistency and asymptotic normality are established in [2,31]. Limit results of functional nature are established in [18] for specific biased sampling models (see also [3,7,8,33]). The weighted quantityFNπ(x) :=PπN(−∞, x] is naturally different from the empirical cumulative distribution function of the observations,Fn(x) :=n−1

i∈SI{Xi≤x}

namely, whose asymptotic behavior is investigated in [5]. It is then straightforward to deduce the following (unbiased) estimate of the probability of exceedanceF(x) :=P(X > x),x∈R, given by:

FπN(x) := 1 N

N i=1

i

πi I{Xi> x}= 1 N

i∈S

1

πiI{Xi> x}. (2.1)

2.2. Tail index inference – the Hill estimator

In a wide variety of situations, it is appropriate to assume that a statistical population is described by a heavy-tailed probability distribution (the field of heavy-tail analysis is well depicted in [30]). A distribution with Pareto-like right tail is any probability measurePonRwith cdfF such that for allx∈R,

1−F(x) =F(x) =x−1/γL(x), (2.2)

where γ > 0 is the extreme value index (EVI) of distribution P and L(x) is a slowly varying function, i.e. a function such thatL(t x)/L(x)→1 asx→+∞for allt >0. Notice that instead of the EVI, focus is often on α:= 1/γ, thetail index of the distributionP. Functions of the form introduced in equation (2.2) are said to be regularly varying with index1/γ; the set of such functions is denoted byR−1/γ. One may refer to [4] for an account of the theory of regularly varying functions. The Hill estimator [26] provides a popular way of estimating the EVIγ. Its asymptotic behavior and the practical issues related to its computation are well-documented in

(4)

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

the literature, see for instance ([30], Chap. 4) and the references therein. Given an i.i.d. populationX1, . . . , XN of sizeN 1 drawn fromPand a numberK∈ {1, . . . , N−1}of largest observations, it is written as

HK,N := 1 K

K i=1

log

XN−i+1,N XN−K,N

, (2.3)

where X1,N ≤ · · · ≤XN,N denote the order statistics related to the population. Whereas the theory has been extensively developed in the case where the observations are independent and identically distributed, including issues related to the choice ofK in equation (2.3), to the best of our knowledge the Hill procedure has received no attention when data arise from a general survey. We point out that there exist alternative methods for tail index or EVI estimation, refer for instance to Chapter 4 of [1], for further details. The argument of the subsequent analysis paves the way for studying extensions of such techniques in the context of survey sampling models.

3. The Hill estimator in survey sampling

Placing ourselves in the framework described in Section2.1, we shall denote byX1,n≤ · · · ≤Xn,n the order statistics related to the surveysample(Xi1, . . . , Xin), wheren=n(S) may be random. When unitjis such that Xj =Xi,N, thei-th largest observation in the population, 1≤i, j≤N, its inclusion indicator and probability are denoted byi,N =jandπi,N =πj respectively. Similarly, we writeπi,n:=πj whenXi,n=Xj, 1≤i, j≤n.

As a general rule, indexes in uppercase shall designate the full population, as opposed to those in lowercase, which shall refer to the sample. We assume in addition that the distribution functionF has the semi-parametric form set out in equation (2.2) with unknown parameter γ >0 and, for the sake of simplicity, that its support is included in (0,+]. Because it is destined to be extensively used in the sequel, we also introduce the tail quantile function, written for allx∈[1,+∞] as

U(x) :=F

11 x

·

Its empirical and Horvitz−Thompson equivalents are respectively denoted by UN(x) :=FN

11

x

and UNπ(x) := (FNπ)

11 x

·

Notice that whenF∈ R−1/γ, the corresponding tail quantile functionU is also regularly varying with index γ ([1], Sects. 2.3.2 and 2.9.3). The goal pursued here is to estimate the tail parameter γ based on the survey dataXi1, . . . , Xin and the sampling planRN.

3.1. The Horvitz−Thompson variant of the Hill estimator

Notice first that, under the heavy-tail assumption above, we have:

γ= lim

x→∞

+∞

x

F(u) F(x)

du

u , (3.1)

see Section 2.6 of [1], for instance. In the case of the iid populationX1, . . . , XN drawn fromP, one classically recovers the celebrated Hill estimator by substitutingF with the empirical cdfFN in equation (3.1) and taking x=UN(N/K) =XN−K,N for some number 1≤K ≤N−1 of largest observations, supposedly representative

(5)

P. BERTAILET AL.

of the tail of the distribution. Indeed, we have:

+∞

XN−K,N

FN(u) FN(XN−K,N)

du u =

K i=1

XN−i+1,N

XN−i,N

FN(u) FN(XN−K,N)

du u

= K i=1

i

K (logXN−i+1,N logXN−i,N)

= 1 K

K i=1

log

XN−i+1,N XN−K,N

=HK,N.

Consistency of this estimator is classically guaranteed as soon asK→+∞whenN +∞so thatK=o(N), see [28]. In this case, the empirical thresholdXN−K,N (equivalent in probability to U(N/K)) goes to infinity (again in probability) asN +∞.

Going back to the survey data situation, one may naturally replaceU(N/K) byUNπ(N/K) and build a plug-in estimate of the EVIγbased on the HorvitzThompson estimator given in equation (2.1) of the tail probability F(x). Observe that by definitionUNπ(N/K) corresponds to one of the observations in the sample, sayXi with rank∈ {1, . . . , n}. To this obviously corresponds an index k∈ {0, . . . , n−1}such that =n−k, implying Xn−k,n =UNπ(N/K), the Horvitz−Thompson estimator of the quantile of order 1−K/N. We denote by κπN the map linkingk toK inUN under the sampling schemeRN:

κπN :

{1, . . . , N−1} −→ {1, . . . , n1}

K −→ k:=κπN(K)

,

where

κπN(K) :=n−inf

⎧⎨

i∈ {1, . . . , n−1}: i j=1

1

πj,n ≥N −K

⎫⎬

. (3.2)

This leads to the quantity:

γ=

+∞

Xn−k,n

Fπ(RRN N)(u) Fπ(RR N)

N (Xn−k,n) du

u = k i=1

Xn−i+1,n

Xn−i,n

Fπ(RRNN)(u) Fπ(RR N)

N (Xn−k,n) du

u

= k i=1

i

j=1 1 πn−j+1,n

k

j=1 1 πn−j+1,n

×(log(Xn−i+1,n)log(Xn−i,n))

=

k

j=1

1 πn−j+1,n

−1k i=1

1

πn−i+1,n log

Xn−i+1,n Xn−k,n

.

=:Hk,nπ .

Hence,kis to the sample whatKis to the population: the number of upper values on which the estimation should rely. Notice that we may also write

Hk,nπ =

K

j=1

N−j+1,N πN−j+1,N

−1K i=1

N−i+1,N πN−i+1,N log

XN−i+1,N XN−K,N

=:HK,Nπ , (3.3)

(6)

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

whereKis the chosen number of largest observations in the population from whichkwas constructed. Observe thatκπN in equation (3.2) is a surjective, non-injective random map, which suggests that the subsequent asymp- totic analysis better rely on some appropriately chosen K and its random imagek:=κπN(K) rather than the contrary. Since in practice only k can be computed from the Xi’s andπi’s, i∈S, the total population being partly unobserved, considerations about the choice of an appropriatekare discussed in detail in Subsection5.1.

From this point forward, the Horvitz−Thompson Hill estimator shall be writtenHK,Nπ withK∈ {1, . . . , N−1}

held fixed.

3.2. Consistency result

Here we investigate the limit properties of the estimatorHK,Nπ asN andnsimultaneously go to infinity, with n≤N. The following assumptions, related to the sample design, shall be involved in the asymptotic analysis.

Assumption 1. There existπ>0 andN0N\{0}such that for allN ≥N0andi∈ UN, πi > π.

Assumption 2. There exists <+such that we have∀N 1,

1≤i, j≤Nmax i,j−πiπj| ≤

Hypothesis1guarantees that first order inclusion probabilities do not vanish asymptotically, while hypotesis2 corresponds to the situation where the second order inclusion probabilities are not too different from those in the case of independent sampling (it is thus fulfilled by the Poisson design, see Sect. 4.1).

Remark 3.1 (On Assumptions 1 and 2). The assumptions introduced herein-before are rather mild and are fulfilled in a wide variety of situations. Indeed, Assumption 2 is standard in asymptotic analysis of sampling techniques, see [24,25] for instance. As for Assumption1, recall thatπi=π(Wi) withWan auxiliary variable andπ(.) a link function. Then, it is fulfilled as soon asπ(.) is continuous and the support ofPWis a compact subset of (R+)d, whereddenotes the dimension of the random vectorWandR+the set of positive real numbers.

Given this framework, following in the footsteps of Section 4.4.1 from [30], the consistency ofHK,Nπ can be handled by exploiting the properties of regularly varying distributions. Indeed, under the heavy-tail assumption in equation (2.2), provided thatK→+∞andK/N→0 asN +∞, we have

N KP

X U(N/K)∈.

−→v

N→∞ν−1/γ(.)

in the space of Radon measures on (0,+∞]. There, “→” stands for the vague convergence of measuresv 5 and ν−1/γ(.) is such thatν−1/γ(x,∞] =x−1/γfor allx >0 (see [30], Thm. 3.6 for instance). Its empirical counterpart in the population, usually called thetail empirical measure, is defined as follows:

νN := 1 K

N i=1

δXi/U(N/K).

When replacingU(N/K) by its estimateUN(N/K) in the expression above and assuming thatK=K(N) +∞whereK/N→0 asN +∞, it can be shown that it converges to ν−1/γ in probability ([30], Eq. (4.21)).

Since we have

HK,N =

1 νN

XN−K,N

U(N/K)(x,+∞]

dx x

5Recall that, in the space of non-negative Radon measures on (0,+], a sequence (μm)m≥1 is said to converge vaguely toμiff for any compactly supported continuous functionh: (0,+]R, we have:+∞

0 h(x)μm(dx)+∞

0 h(x)μ(dx) asm+.

(7)

P. BERTAILET AL.

and

γ=

1

ν−1/γ(x,+]dx x,

the asymptotic properties ofνN naturally convey the consitency of the Hill estimator. Generalizing this result to theHorvitz−Thompson tail empirical measure

νNπ = 1 K

N i=1

i πiδXi/Uπ

N(N/K)= 1 K

i∈S

1

πiδXi/Uπ

N(N/K) (3.4)

would then yield the theorem below (see the proof in the Appendix section). It reveals that, in regard to the asymptotic statistical estimation of the EVIγ, the Horvitz−Thompson variant of the Hill estimator HK,Nπ is consistent.

Theorem 3.2(Consistency). Let K=K(N)be a sequence of integers such that K +∞and K/N→0 as N, n→+∞. Provided that Assumptions 1and2 are fulfilled, we then have, as N andntend to+∞:

HK,Nπ −→P γ. (3.5)

4. Asymptotic normality

Whereas the consistency of the standard Hill estimator in equation (2.3) can be proved for any sequenceK going to infinity at a reasonable rate, asymptotic normality cannot be guaranteed at such a level of generality.

Higher-order regular variation properties of the heavy-tail model in equation (2.2) are required [12,14]. More specifically, consider the hypothesis below, referred to as the Von Mises condition [20].

Assumption 3. The regularly varying tail quantile function U ∈ Rγ with γ > 0 is such that there is a real parameter ρ < 0, referred to as the second order parameter, and a positive or negative function A with

x→+∞lim A(x) = 0 such that for anyt >0, 1 A(x)

U(tx) U(x) −tγ

x→∞−→ tγtρ1 ρ , or equivalently

1 A

1 F(x)

F(t x)

F(x) −t−1/γ

x→∞−→ t−1/γ tρ/γ1 γ ρ ·

This condition simply establishes some constraints about the slowly varying function L(.) in equation (2.2) to ensure its influence vanishes quickly enough not to interfere with the Pareto formx−1/γ ofF.

The limit distribution of the standard Hill estimator in equation (2.3) has been investigated by means of R´enyi’s exponential representation of log-spacings under Assumption3. Of course, this condition can hardly be checked in practice and the choice of the number of extremal observations is generally selected so as to minimize an estimate of the asymptotic mean squared error (MSE), see Section5.1and the references therein.

Remark 4.1(On the Hill estimator and the tail empirical process). Other approaches than the R´enyi decom- position in log-spacings have been developed to prove the asymptotic normality of the Hill estimator HK,N (see [30], Chaps. 4 and 9 and the references therein). The version presented at length in [30] involves the preliminary study of thetail empirical process

TN := K

νN −ν−1/γ

(x−1/γ,∞], x≥0,

(8)

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

from which the asymptotic properties of the Hill estimator are later deduced. We refer to Theorem 9.1 and Section 9.1.2 in [30] for more details on this seemingly simple but actually quite intricate procedure.

Unfortunately, contrary to the classical empirical process, the asymptotic properties ofTN cannot be extended to its Horvitz−Thompson equivalent. This is essentially due to the fact that the populationUN contains afinite number of observations. Hence, there is always a maximum XN,N < bounding PN and sampling fails to distinguish between a distribution with finite support such as those in the Weibull domain of attraction and a heavy-tailed distribution with no endpoint. Actually, these arguments are exactly the same as those introduced when discussing the applicability of bootstrap in extreme value analysis. Indeed, survey sampling under a superpopulation model can be viewed as a generalization of weighted bootstrap. The interested reader may refer to Section 6.4 in [30], for a brief introduction to the difficulty of bootstrapping heavy-tailed phenomena.

In this section, we shall aim at proving first that under Poisson survey schemes, the HorvitzThompson version of the Hill estimator computed on the sample is asymptotically close to its standard version calculated over the entire population. This result is next extended to rejective sampling plans, which can be viewed as Poisson plans conditioned upon a fixed sample size.

4.1. The case of the Poisson survey scheme

In this subsection we assume that the vectorcorresponds to that of a Poisson survey scheme, such as depicted in Example1. The distribution of this design is denoted by TN and the first order inclusion probabilities by p1, . . . , pN. Under this setting, the HorvitzThompson variant of the Hill estimator is naturally denoted by HK,Np and the πi’s in equation (3.3) are to be replaced by the corresponding pi’s. In addition, just like we previously set πi =π(Wi), we write pi = p(Wi) for all i ∈ UN and p the Poisson link function. In order to prove the asymptotic normality ofHK,Np , we shall require that thepi’s fulfill Assumption 1 with lower bound pand the auxiliary variable from which they are built to satisfy the condition below.

Assumption 4. The random vectors W1, . . . ,WN are iid with continuous distribution PW on W ⊂Rd, d- variate cdfFWand marginalsFW1, . . . , FWd. The joint distribution of the entailed iid sequence{(Xi,Wi), 1 i≤N}is denoted by PX,W with corresponding cdfFX,W.

The following result reveals that under the Poisson survey scheme, when based on the K largest values among the whole populationX1, . . . , XN, HK,Np converges at the same rate (1/

K namely) to the same limit distribution asHK,N, up to a multiplicative term in the asymptotic variance induced by the sampling scheme.

Further details about the convergence of the classical Hill estimator can be found e.g. in ([12], Thm. 1) and ([30], Sect. 9).

Theorem 4.2 (Limit distribution in the Poisson survey case). Suppose that Assumption 3 is fulfilled by the underlying heavy-tailed model and that Assumption1is satisfied by the considered sequence of Poisson inclusion probabilitiesp1, . . . , pN,N 1, constructed from some set of auxiliary variables as in Assumption4. In addition, assume that asN, K→ ∞,K/N 0,

E 1

p(W)

X > U(N/K)

N→∞−→ σ2p<∞. Then, provided thatK→+∞as N→+∞so that

KA(N/K)→λfor some constantλ∈R, we have the convergence in distribution asN +∞:

√K

HK,Np −γ ⇒ N

λ

1−ρ, γ2σ2p

. (4.1)

(9)

P. BERTAILET AL.

As can be seen by examining the proof of this theorem in the Appendix section, the limit result in equa- tion (4.1) can be obtained using the following decomposition:

√K

HK,Np −γ

=

√K rK,N

rK,NHK,Np −HK,N

Q(1)N

+

√K

rK,N (HK,N−γ)

Q(2)N

+γ√ K

1 rK,N 1

Q(3)N

, (4.2)

where

rK,N :=rK,N(TN,p) := 1 K

K i=1

N−i+1,N

pN−i+1,N· (4.3)

These three quantities are studied independently under the hypotheses required in Theorem4.2. First, we show that rK,N converges to 1 in probability as N +∞. Combined with R´enyi’s decomposition in log- spacings of the Hill estimator (refer for instance to [1], Sect. 4.4), this establishes the asymptotic convergence, in probability, ofQ(1)N to 0. It also implies thatQ(2)N is equivalent to

K (HK,N−γ), a well-known quantity which tends to a Gaussian distribution with expectationλ/(1−ρ) and varianceγ2under the second order condition stipulated in Assumption3([11], Thm. 3.2.5). As forQ(3)N , we calculate its expectation and variance conditionally on the full vector of observations{(Xi,Wi), 1 ≤i≤N}, yielding expressions where the randomness induced by the survey scheme has been controlled. Then, following in the lines of a LindebergFeller theorem for independent and non-identically distributed variables ([17], Thm. 3, p. 262), we analyze the conditions under which the conditional variance has a finite limit in probability relative to {(Xi,Wi), 1 i ≤N}. Provided that they are fulfilled, Q(3)N converges weakly to a centered Gaussian distribution with variance γ2p21).

Thus proceeding enables to considerQ(2)N andQ(3)N as independent random variables (one depends on the data and the other on the survey scheme). Thus, the limit distribution of their sum is simply the sum of their limit distributions, thereby yielding equation (4.1).

In the remark below, we exhibit some sufficient conditions that guarantee the existence of σ2p, the limit of the conditional expectation appearing in Theorem4.2.

Remark 4.3(Sufficient conditions for Theorem4.2).

It is possible to give a more explicit expression ofσ2p under some additional conditions. LetCX,W: [0,1]d+1 [0,1] be the copula of cdfFX,W such that for allx∈R+,w= (w1, . . . , wd)∈ W,

FX,W(x,w) =CX,W(F(x), FW1(w1), . . . , FWd(wd)).

In addition to the set of assumptions required in Theorem4.2, assume that the conditions below hold.

(A1) The cdfF ofX is absolutely continuous with respect to the Lebesgue measure with density f.

(A2) The one-dimensional marginal cdfs FW1, . . . , FWd of W are absolutely continuous with respect to the Lebesgue measure with respective densitiesfW1, . . . , fWd.

(A3) The copula CX,W is absolutely continuous with Lebesgue-integrable density cX,W bounded in a neigh- borhood of{1} ×[0,1]d.

Then,

σp2=

W

1

p(w)cX,W(1, FW1(w1), . . . , FWd(wd)) d j=1

fWj(wj) dw1. . .dwd.

(10)

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

In order to understand the behavior ofσ2pas a function of the copula density and of the choice of the inclusion probabilities, we illustrate our result by means of a simple example.

Example 2 (Clayton copula). Consider the simple situation where W is a one-dimensional random variable with cdf FW and density fW, and where the copula CX,W belongs to the Clayton class of copulas, i.e. there exists someθ >0 such that for all (u, v)[0,1]2\{(0,0)},

CX,W(u, v) =Cθ(u, v) := (u−θ+v−θ1)−1/θ.

Written this way, the functionCX,W models the positive dependence of rare events. It is known that the related tail dependence coefficient is given by 2−1/θ([29], Example 5.22 p. 215, Family (4.2.1)) and the Kendall’s tau by θ/(θ+2) ([29], Example 5.4 p. 163), yielding a wide range of possible correlations. Straightforward computations permit to show that the density is given by: for all (u, v)[0,1]2\{(0,0)},

cθ(u, v) = (θ+ 1)u−1−θv−1−θ(u−θ+v−θ1)−2−1/θ·

Notice that the density function cθ is not bounded near (0,0). Nevertheless it has no singularity near (1,1) and is integrable at (1,0). Hence, conditionsA1− A3in Remark4.3hold true. Thus, for allv∈[0,1], we have the simpler expression

cθ(1, v) = (θ+ 1)vθ. Now take

p(w) =

FW(w) if FW(w)(p,1], p if FW(w)[0, p].

Wheni is not too small, these inclusion probabilities are simply proportional to the rank ofWi divided by N. Then we have

σp2= 1 p

p

0 (θ+ 1)vθdv+

1 p

(θ+ 1)vθ−1dv= 1 +1

θ(1−pθ).

It follows that, as the parameterθ grows to (i.e. the greater the correlation in the extremes),σ2p tends to 1. In other words, sampling in this case does not degrade at all the asymptotic variance of the Hill estimator.

Before turning to the extension of the results above to more general sampling schemes, a few remarks are in order.

Remark 4.4(A singular situation). A situation of particular interest is whenW is one-dimensional and there exists a continuous increasing functiong:W →R+such thatW =g(X). In that case, the copulaCX,W linking both cdfsF ofXandFW ofW is such that for all (u, v)[0,1]2,CX,W(u, v) =u∧v, which is not differentiable.

Nevertheless, assuming that F is absolutely continuous with respect to the Lebesgue measure with density f and thatpis increasing, the result still holds withσ2p= 1/p(w),wherew := inf{w:FW(w) = 1}is the right endpoint of the distribution ofW (potentially +). Indeed, we have

E 1

p◦g(X)

X > U(N/K)

=N K

U(N/K)

1

p◦g(x)f(x) dx.

Setting u= NK(1−F(x)) entails E

1 p◦g(X)

X > U(N/K)

=

1 0

1

p◦g(U(N/(K u)))du,

and since by virtue of Assumption1, the Poisson link functionpis bounded, asN→ ∞this last integral tends to 1/(p◦g(∞)) = 1/p(w) for g increasing. For such a σ2p to be close to 1, it suffices to give a probability of inclusion of 1 to the most extreme observations ofW (and a fortiori ofX).

(11)

P. BERTAILET AL.

Remark 4.5 (On the asymptotic variance). Looking at the variance term in equation (4.1) of Theorem 4.2, we see that the influence of the survey scheme is encapsulated by the multiplicative term σ2p, which is by definition greater than or equal to 1. Ideally, we would like to have at our disposal inclusion probabilities for whichσ2p is as close to 1 as possible. This aspect is discussed in Section5.2. In any case, it is possible to use the Horvitz−Thompson empirical version ofσ2p given by

(SNp)2:= 1 K

N i=1

i p(Wi)

1

p(Wi)I{Xi> U(N/K)}

to estimate this variance term. Indeed, conditionally on the full vector of observations (Wi, Xi)1≤i≤N, we have that

E (SNp)2

= 1 K

N i=1

1

p(Wi)I{Xi> U(N/K)} −→

N→∞σp2 a.s.

(see Lem. 19). It follows that (SNp)2 converges inL1and thus in probability toσ2p.

4.2. Extension Rejective sampling schemes

We now show how the result stated in Theorem 4.2can be extended to an important class of survey plans, namelyrejective sampling schemes. For the sake of clarity, we first provide a brief description of the latter, refer to [2,24] for further details.

Fix n N and consider a vector (π1, . . . , πN) of first order inclusion probability. The rejective sampling, sometimes referred to asconditional Poisson sampling (CPS in short), exponential design without replacement or maximum entropy design [34], is the sampling planRN which picks samples of fixed size n(S) :=nin order to maximize the entropy measure

H(RN) =

{s∈P(UN): #s=n}

RN(s) logRN(s)

subject to the constraint stipulating that its vector of first order inclusion probabilities coincides with (π1, . . . , πN). It can be implemented in two steps, as follows.

1. Draw a sample S with a Poisson sampling plan (without replacement), with properly chosen first order inclusion probabilities (p1, . . . , pN). The representation is called canonical ifN

i=1pi =n. In that case the relationships betweenpi andπi, 1≤i≤N, are established in [24].

2. Ifn(S)=n, then reject it and go back to step one, otherwise stop.

The vector (p1, . . . , pN) must be chosen in a way that the resulting first order inclusion probabilities coincide with π1, . . . , πN, by means of a dedicated optimization algorithm, see [34]. The corresponding probability distribution is given by:∀s∈ P(UN),

RN(s) = TNp(s)I{#s=n}

{s∈P(UN): #s=n}TNp(s)

i∈s

pi

i /∈s

(1−pi)×I{#s=n}.

Refer to ([24], p. 1496), for more details on thepi’s.

Turning now to the extension of the result stated in Theorem4.2for the Poisson survey scheme to the case of rejective sampling, we proceed in two steps. For clarity, we shall writeHK,Nπ (RN) when the Horvitz−Thompson version of the Hill estimator involves the inclusion variables1, . . . , N drawn under the sampling planRN and the probabilities of inclusionπ1, . . . , πN. First, we show that the asymptotic distribution of

K(HK,Np (RN)−γ) is the same as that of

K(HK,Np (TN)−γ), whereRN (resp.TN) denotes a rejective (resp. Poisson) sampling plan

(12)

TAIL INDEX ESTIMATION BASED ON SURVEY DATA

with inclusion probabilitiesπ1, . . . , πN (resp.p1, . . . , pN). Then, we control the difference between HK,Np (RN) andHK,Nπ (RN). For this, we introduce the following quantities:∀K≤N,

DK,N(RN, TN) :=rK,N(RN,π)HK,Nπ (RN)−rK,N(RN,p)HK,Np (RN) and

rK,N(RN, TN) :=rK,N(RN,π)−rK,N(RN,p).

The ensuing approach follows in the footsteps of [24] and relies more specifically on the results displayed in Theorem 5.1 (p. 508). Let us start by defining the quantities

dN = N i=1

pi(1−pi) and ¯pN = 1 dN

N i=1

p2i(1−pi).

We assume that both RN and TN fulfill Assumption 1 for minoring constantsπ and p respectively and that the Poisson inclusion probabilities further satisfy the following condition.

Assumption 5.

lim sup

N→+∞

1 N

N i=1

pi(TN)<1.

Notice that, in this situation,dN =o(1/K) and ¯pN is bounded. In addition, as shown in [24] (see p. 1510 therein), the decomposition below holds for alli∈ {1, . . . , N}:

pi−πi=

p¯N−pi

dN +o(1/dN)

pi(1−πi).

This roughly means that the inclusion probabilities of the rejective sampling scheme are very close to those of the underlying Poisson design from which it was built. So close in fact thatDK,N(RN, TN) andrK,N(RN, TN) asymptotically vanish. Therefore, as revealed by the following result, Theorem4.2 also holds when the sample is constructed with a rejective plan.

Theorem 4.6 (Limit distribution in the rejective survey case). Suppose that all the conditions required in Theorem 4.2 hold together with Assumption 5. Then, for σp2 as in Theorem 4.2 and provided that K +∞

as N +∞ so that

KA(N/K)→ λfor some constant λ R, we have the convergence in distribution as N +∞:

√K

HK,Nπ −γ

⇒ N λ

1−ρ, γ2σ2p

. (4.4)

The proof of this theorem is available in the Appendix section. Notice that the limit variance does not depend on the inclusion probabilitiesπ1, . . . , πN, but on those of the underlying Poisson design,p1, . . . , pN namely. The asymptotic properties of the rejective sampling scheme are intricately linked to the Poisson plan which it is associated with.

5. Practical issues and illustrative experiments

5.1. On the choice of an optimal k

All results presented in the previous section depend on some appropriately numberKof largest observations in the populationX1, . . . , XN. Unfortunately, the estimated tail quantileXN−K,Nfrom whichHK,Nπ is computed may not be included in the sample. Hence, we need to choose a numberkof largest values in the sample to which we may associate some K that respects the necessary conditions for consistency and asymptotic normality to

(13)

P. BERTAILET AL.

hold (K=K(N)+∞and

KA(N/K)→λ <∞asN +∞). Recall that we definedκπN in equation (3.2), a non-injective random map that assigns an indexk in the sample to any indexK in the population so that Xn−k,n =UNπ(N/K). Setting

K(k) := (κ πN)(k) := N−

n−k

i=1

1 πi,n

! ,

where.is the ceiling function, it is straightforward to show that the limit results stated in the sections above remains true for

"

K(k)

Hk,nπ −γ

asN, k→+∞so that

"

K(k) A(N/K(k)) →λalmost-surely. This result can be naturally used to ground the construction of asymptotic Gaussian confidence intervals. Indeed, by virtue of Slutsky’s Lemma combined with Theorem3.2, the quantity

"

K(k)

Hk,nπ −γ

/Hk,nπ is then asymptotically pivotal, distributed as a standard Gaussian random variable asN, k→+∞.

In practice, choosing an optimal thresholdXN−K,N is already complicated in the i.i.d. case. Many techniques have been proposed in the literature, often based on the minimization of the MSE (see [10,19,21] and the references therein). Since they involve in general the estimation of the second order parameter ρ, which goes beyond the scope of our analysis, we leave such considerations for future research. In the meantime, we propose to simply rely on heuristics such as the stability of the HorvitzThompson version of the Hill estimator around the appropriatek.

5.2. On the choice of the sampling weights

A question of interest is the design of a survey sampling scheme yielding a weighted Hill estimator of minimum asymptotic variance. In order to give some hints of the corresponding plan, assume for simplicity thatW is a one- dimensional random variable with distributionFW and densityfW from which we observeN i.i.d. realizations, denoted by Wi,N once ordered. Also consider the conditions of Theorem 4.2 to hold together with those of Remark4.3. Then, minimizing the variance of the Horvitz−Thompson variant of the Hill estimator boils down to minimizing

σp2:=

W

1

p(w)cX,W(1, FW(w))fW(w) dw, subject to the constraints ⎧

W p(w)fW(w) dw=:η= lim

N→∞

E(n) N <1,

∀w∈ W, p≤p(w)≤1.

This variational problem may be translated in terms of the simpler convex optimisation problem (for chosing the weights) given by

min 1 N

N i=1

cX,W 1,Ni

p(Wi,N) subject to N

i=1p(Wi,N) =E(n),

p< p(Wi,N)<1. (5.1) If for all i ∈ {1, . . . , N} we set pi := p(Wi,N) and denote by λ0, . . . , λ2N the KKT multipliers, then the Lagrangian of this problem is simply given by

L(p) := 1 N

N i=1

1 picX,W

1, i

N

+λ0

#N

i=1

piE(n)

$ +

N i=1

λi(pi1) + N i=1

λN+i(p−pi).

Références

Documents relatifs

We first investigate the order of magnitude of the worst-case risk of three types of estimators of a linear functional: the greedy subset selection (GSS), the group (hard and

Keywords and phrases: Nonparametric regression, Biased data, Deriva- tives function estimation, Wavelets, Besov balls..

In this section we construct an estimator that works well whenever the distribution of X is sufficiently spherical in the sense that a positive fraction of the eigenvalues of

From the Table 2, the moment and UH estimators of the conditional extreme quantile appear to be biased, even when the sample size is large, and much less robust to censoring than

Nous avons décidé d'étudier et d'optimiser le doublage intracavité dans un OPO pour produire le rayonnement visible [MDB08]. Nous avons construit un oscillateur qui génère

By using the functional delta method the estimators were shown to be asymptotically normal under a second order condition on the joint tail behavior, some conditions on the

The proposed estimator is compared to existing minimum density power divergence estimators of the tail index based on fitting an extended Pareto distribution and exponential

Abstract We collected norms on the gender stereotypicality of an extensive list of role nouns in Czech, English, French, German, Italian, Norwegian, and Slovak, to be used as a