HAL Id: hal-01418370
https://hal.archives-ouvertes.fr/hal-01418370v2
Submitted on 13 Jan 2017
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Extreme value statistics for censored data with heavy tails under competing risks
Julien Worms, Rym Worms
To cite this version:
Julien Worms, Rym Worms. Extreme value statistics for censored data with heavy tails under com-
peting risks. Metrika, Springer Verlag, 2018, 81 (7), pp.849-889. �10.1007/s00184-018-0662-3�. �hal-
01418370v2�
Extreme value statistics for censored data with heavy tails under competing risks
Julien Worms (1) & Rym Worms
1(2)
(1) Universit´ e Paris-Saclay / Universit´ e de Versailles-Saint-Quentin-En-Yvelines Laboratoire de Math´ ematiques de Versailles (CNRS UMR 8100),
F-78035 Versailles Cedex, France, e-mail : julien.worms@uvsq.fr
(2) Universit´ e Paris-Est
Laboratoire d’Analyse et de Math´ ematiques Appliqu´ ees (CNRS UMR 8050),
UPEMLV, UPEC, F-94010, Cr´ eteil, France, e-mail : rym.worms@u-pec.fr
1Corresponding author
Extreme value statistics for censored data with heavy tails under competing risks
Abstract
This paper addresses the problem of estimating, in the presence of random censoring as well as competing risks, the extreme value index of the (sub)-distribution function associated to one partic- ular cause, in the heavy-tail case. Asymptotic normality of the proposed estimator (which has the form of an Aalen-Johansen integral, and is the first estimator proposed in this context) is es- tablished. A small simulation study exhibits its performances for finite samples. Estimation of extreme quantiles of the cumulative incidence function is also addressed.
AMS Classification. Primary 62G32 ; Secondary 62N02
Keywords and phrases. Extreme value index. Tail inference. Random censoring. Competing Risks. Aalen- Johansen estimator.
2
1. Introduction
The study of duration data (lifetime, failure time, re-employment time...) subject to random censoring is a major topic of the domain of statistics, which finds applications in many areas (in the sequel we will, for convenience, talk about lifetimes to refer to these observed durations, but without restricting our scope to lifetime data analysis). In general, the interest lies in obtaining informations about the central characteristics of the underlying lifetime distribution (mean lifetime or survival probabilities for instance), often with the objective of comparing results between different conditions under which the lifetime data are acquired. In this work, we will address the problem of inferring about the (upper) tail of the lifetime distribution, for data subject both to random (right) censoring and competing risks.
Suppose indeed that we are interested in the lifetimes of n individuals or items, which are subject to K different causes of death or failure, and to random censorship (from the right) as well. We are particularly interested in one of these causes (this main cause will be considered as cause number k thereafter, where k P t1, . . . , Ku), and we suppose that all causes are exclusive and are likely to be dependent on the others.
The censoring time is assumed to be independent of the different causes of death or failure and of the observed lifetime itself. However, since the other causes (different from the k-th cause of interest) generally cannot be considered as independent of the main cause, in no way they can be included in the censoring mechanism.
This prevents us from relying on the basic independent censoring statistical framework, and we are thus in the presence of what is called a competing risks framework (see Moeschberger and Klein (1995)).
For instance, if a patient is suffering from a very serious disease and starts some treatment, then the final outcome of the treatment can be death due to the main disease, or death due to other causes (nosocomial infection for instance). And censoring can occur due to loss of follow up or end of the clinical study. Another example, in a reliability experiment, is that the failure of some mechanical system can be due to the failure of a particular subpart, or component, of the system : since separating the different components for studying the reliability of only one of them is generally not possible, accounting for these different competing causes of failure is necessary. Another field where competing risks often arise are labor economics, for instance in re-employment studies (see Fermanian (2003) for practical examples).
One way of formalising this is to say that we observe a sample of n independent couples pZ
i, ξ
iq
1ďiďnwhere
Z
i“ minpX
i, C
iq, δ
i“ I
XiďCi, ξ
i“
"
0 if δ
i“ 0, C
iif δ
i“ 1.
The i.i.d. samples pX
iq
iďnand pC
iq
iďn, of respective continuous distribution functions F and G, represent the lifetimes and censoring times of the individuals, and are supposed to be independent. For convenience, we will suppose in this work that they are non-negative. The variables pC
iq
iďnform a discrete sample with values in t1, . . . , Ku, and represent the causes of failure or death of the n individuals or items. It is important to note that these causes are observed only when the data is uncensored (i.e. when δ
i“ 1), therefore we only observe the ξ
i’s, not the complete C
i’s.
One way of considering the failure times X
iis to write
X
i“ minpX
i,1, . . . , X
i,Kq,
where the variable X
i,kis a (rather artificial) variable representing the imaginary latent lifetime of the i-th individual when the latter is only affected by the k-th cause (the other causes being absent). This viewpoint may be interesting in its own right, but we will not keep on considering it in the sequel, one reason being that such variables X
i,1, . . . , X
i,Kcannot be realistically considered as independent, and their respective distributions are of no practical use or interpretability (as explained and demonstrated in the competing risks literature, these distributions are in fact not statistically identifiable, see Tsiatis (1975) for example).
The object of interest is the probability that a subject dies or fails after some given time t, due to the k-th cause, for high values of t. This quantity, denoted by
F s
pkqptq “ P r X ą t , C “ k s, is related to the so-called cumulative incidence function F
pkqdefined by
F
pkqptq “ P r X ď t , C “ k s.
Note that F s
pkqptq is not equal to 1´F
pkqptq, but to P pC “ kq´F
pkqptq, because F
pkqis only a sub-distribution function. However we have F s
pkqptq “ ş
8dF
pkqpuq. In the sequel, the notation ¯ Sp.q “ Sp8q ´ Sp.q will be
used, for any non-decreasing function S.
In this paper, we are interested in investigating the behaviour of F s
pkqptq for large values of t. This amounts to statistically study extreme values in a context of censored data under competing risks, and will lead us to consider some extreme value index γ
krelated to F s
pkq, which will be defined in a few lines. Equivalently, the object of interest is the high quantile x
pkqp“ p F s
pkqq
´ppq “ inft x P R ; F s
pkqpxq ě p u when p is close to 0, which can be interpreted as follows (in the context of lifetimes of individuals or failure times of systems) : in the presence of the other competing causes, a given individual (or item) will die (or fail), due to cause k after such a time x
pkqp, only with small probability p. A nonparametric inference for quantiles of fixed (and therefore not extreme) order, in the competing risk setting, has been already proposed in Peng and Fine (2007).
One way of addressing this problem could be through a parametric point of view (see Crowder (2001) for further methods in the competing risk setting), however, the non-parametric approach is the most common choice of people faced with data presenting censorship or competing risks. Of course, the standard Kaplan- Meier method for survival analysis does not yield valid results for a particular risk if failures from other causes are treated as censoring times, because the other causes cannot always be considered independent of the particular cause of interest.
The commonly used nonparametric estimator of the cumulative incidence function F
pkqis the so-called Aalen-Johansen estimator (see Aalen and Johansen (1978), or Geffray (2009) equation p7q) defined by
F
npkqptq “ ÿ
Ziďt
δ
iI
Ci“kn G s
npZ
i´q ,
where G s
ndenotes the standard Kaplan-Meier estimator of G (and G s
npt
´q denotes lim
sÒtG s
npsq), so that we can introduce the following estimator for F s
pkq:
F s
npkqptq “ ÿ
Ziąt
δ
iI
Ci“kn G s
npZ
i´q .
But if the value t considered is so high that only very few (if any) observations Z
i(such that C
i“ k) exceed t, then this purely nonparametric approach will lead to very unstable estimations F s
npkqptq of F s
pkqptq. This is why a semiparametric approach is desirable, and the one we will consider here is the one inspired by classical extreme value theory.
First note that in this paper, we will only consider situations where the underlying distributions F and G of the variables X and C are supposed to present power-like tails (also commonly named heavy tails), and we will focus on the evaluation of the order of this tail. Our working hypothesis will be thus that the different functions F s
pkq(for k “ 1, . . . , K) as well as G s “ 1 ´ G belong to the Fr´ echet maximum domain of attraction. In other words, we assume that they are (see Definition 1 in the Appendix) regularly varying at infinity, with respective negative indices ´1{γ
1, . . . , ´1{γ
Kand ´1{γ
C@1 ď k ď K, @x ą 0, lim
tÑ`8
F s
pkqptxq{ F s
pkqptq “ x
´1{γkand lim
tÑ`8
Gptxq{ s Gptq “ s x
´1{γC. (1) Consequently, F s “ 1´F “ ř
Kk“1
F s
pkqand H s “ F s G s (the survival function of Z) are regularly varying (at `8) with respective indices ´1{γ
Fand ´1{γ, where γ
F“ maxpγ
1, . . . , γ
Kq and γ satisfies γ
´1“ γ
F´1` γ
C´1(these relations are constantly used in this paper).
The estimation of γ
Fhas been already studied in the literature, as it corresponds to the random (right) censoring framework, without competing risks. We can cite Beirlant et al. (2007) and Einmahl et al. (2008), where the authors propose to use consistent estimators of γ divided by the proportion of non-censored observations in the tail, or Worms and Worms (2014), where two Hill-type estimators are proposed for γ
F, based on survival analysis techniques. However, our target here is γ
k(for a fixed k “ 1, . . . , K) and the point is that there seems to be no way to deduce an estimator of γ
kfrom an estimator of γ
F. Note that the useful trick used in Beirlant et al. (2007) and Einmahl et al. (2008) to construct an estimator of γ
Fdoes not seem to be extendable to this competing risks setting. To the best of our knowledge, our present paper is the first one addressing the problem of estimating the cause-specific extreme value index γ
k.
4
Considering assumption (1), it is simple to check that, for a given k, we have
tÑ`8
lim 1 F s
pkqptq
ż
`8 tlogpu{tq dF
pkqptq “ γ
k.
It is therefore most natural to propose the following (Hill-type) estimator of γ
k, for some given threshold value t
n(assumptions on this threshold are detailed in the next section) :
γ p
n,k“ ż
φ p
npuqdF
npkqpuq where φ p
npuq “ 1 F s
npkqpt
nq
log ˆ u
t
n˙ I
uątn,
which can be also written as p γ
n,k“ 1
n F s
npkqpt
nq
n
ÿ
i“1
logpZ
i{t
nq
G s
npZ
i´q I
ξi“kI
Ziątn“ 1 n F s
npkqpt
nq
ÿ
Zpiqątn
logpZ
piq{t
nq
G s
npZ
pi´1qq δ
piqI
Cpiq“k,
where Z
p1qď . . . ď Z
pnqare the ordered random variables associated to Z
1, . . . , Z
n, and δ
piqand C
piqare the censoring indicator and cause number which correspond to the order statistic Z
piq. It is clear that this estimator is a generalisation of one of the estimators proposed in Worms and Worms (2014), in which the situation K “ 1 (with only one cause of failure/death) was considered. The asymptotic result we prove in the present work is then valid in the situation studied in the latter, where only consistency was proved and a random threshold was used.
Our paper is organized as follows: in Section 2, we state the asymptotic normality result of the proposed estimator, and of a corresponding estimator of an extreme quantile of the cumulative incidence function.
Section 5 is devoted to the proofs. In Section 3, we present some simulations in order to illustrate finite sample behaviour of our estimator. Some technical aspects of the proofs are postponed to the Appendix.
2. Assumptions and Statement of the results
The central limit theorem which is going to be proved has the rate ?
v
nwhere v
n“ n F s
npkqpt
nq Gpt s
nq and t
nis a threshold tending to 8 with the following constraint
v
nnÑ8
ÝÑ `8 such that n
´η0v
nnÑ8
ÝÑ `8 for some η
0ą 0. (2) If we note l
kthe slowly varying function associated to F s
pkq(i.e. such that F s
pkqpxq “ x
´1{γkl
kpxq in condition (1)), the second order condition we consider is the classical SR2 condition for l
k(see Bingham, Goldie and Teugels (1987)),
@x ą 0, l
kptxq
l
kptq ´ 1
tÑ8„ h
ρkpxq gptq p@x ą 1q, (3) where g is a positive measurable function, slowly varying with index ρ
kď 0, and h
ρkpxq “
xρkρ´1k
when
ρ
kă 0, or h
ρkpxq “ log x when ρ
k“ 0.
Theorem 1. Under assumptions p1q, p2q and p3q, if there exists λ ě 0 such that ?
v
ngpt
nq
nÑ8ÝÑ λ, and if γ
kă γ
Cthen we have
? v
np p γ
n,k´ γ
kq ÝÑ
dN pλm, σ
2q as n Ñ 8 where
m “
$
&
%
γk2
1´γkρk
if ρ
kă 0, γ
k2if ρ
k“ 0,
and σ
2“ γ
k2p1 ´ rq
3` p1 ` r
2q ´ 2cr ˘ ,
with c “ lim
xÑ8F s
pkqpxq{ Fpxq P r0, s 1s and r “ γ
k{γ
CPs0, 1r.
Remark 1. Note that when γ
kă γ
F, then c “ 0, and, when γ
k“ γ
Fand c “ 1 (for instance when there is only one cause of failure/death), then σ
2reduces to γ
F2{p1 ´ rq.
Proposition 1. Under assumptions p1q and p2q, we have p γ
n,k PÝÑ γ
kas n Ñ 8.
Remark 2. The condition γ
kă γ
C(weak censoring) is not necessary for the consistency of γ
n,k.
Now, concerning the estimation of an extreme quantile x
pkqpn(of order p
ntending to 0) associated to F s
pkq, we propose the usual Weissman-type estimator (in this heavy tailed context), associated to the threshold t
nused in the estimation of γ
k,
ˆ x
pkqpn,tn
“ t
n˜
F s
npkqpt
nq p
n¸
γpn,k,
where p
nis assumed to satisfy the constraint p
n“ o `
F s
pkqpt
nq ˘
. Remind that by definition F s
pkqpx
pkqpnq “ p
n, and thus the definition of this estimator is based on the fact that, by the assumed regular variation of F s
pkq, the ratio F s
pkqpx
pkqpnq{ F s
pkqpt
nq is close to px
pkqpn{t
nq
´1{γk.
Corollary 1. Under the assumptions of Theorem 1, if in addition ρ
kă 0 (in (3)) and d
n“ F s
pkqpt
nq{p
nÑ 8 satisfies the condition
? v
n{ logpd
nq
nÑ8ÝÑ 8, (4)
then (with λ, m and σ
2being defined in the statement of Theorem 1)
? v
nlogpd
nq
˜ x ˆ
pkqpn,tn
x
pkqpn´ 1
¸
ÝÑ
dN pλm, σ
2q as n Ñ 8.
3. Simulations
In this section, a small simulation study is conducted in order to illustrate the finite-sample behaviour of our new estimator in some simple cases, and discuss the main issues associated with the competing risks setting.
For simplicity, we focus on the situation with two competing risks (K “ 2), also called causes below, and our aim is the extreme value index γ
1associated to the first cause. Data are generated from one of the following two models : for c
1, c
2non-negative constants satisfying c
1` c
2“ 1, we consider the following (sub-)distribution for each cause-specific function F s
pkq(k P t1, 2u) :
´ Fr´ echet : F s
pkqptq “ c
kexpp´t
´1{γkq, for t ě 0 ;
´ Burr : F s
pkqptq “ c
kp1 ` t
τk{βq
´1{pγkτkq, for t ě 1, where τ
ką 0, β ą 0.
The lifetime X , of survival function ¯ F “ F ¯
p1q` F ¯
p2q, is generated by the inversion method (with numerical computation of F s
´1). Censoring times are then generated from a Fr´ echet or a Burr distribution :
Gptq “ s expp´t
´1{γCq pt ě 0q or Gptq “ p1 s ` t
τC{βq
´1{pγCτCqpt ě 1q.
In this section, we consider (as it is often done in simulation studies) that the threshold t
nused in the definition of our new estimator ˆ γ
n,1is taken equal to Z
pn´knq(i.e. we consider it as random). One aim of this section is to show how our estimator (with random threshold)
ˆ
γ
1“ 1
n F s
np1qpZ
pn´knqq
kn
ÿ
i“1
logpZ
pn´i`1q{Z
pn´knqq
G s
npZ
pn´i,nqq δ
pn´i`1qI
Cpn´i`1q“1of γ
1behaves when the proportion c
1of cause 1 events varies : we consider c
1P t1, 0.9, 0.7, 0.5u, the case c
1“ 1 corresponding to the simple censoring framework, without competing risk.
Another aim is to illustrate the impact of dependency between the causes, when estimating the tail. The starting point is that, if cause 2 could be considered independent of cause 1, then we could (and would) include it in the censoring mechanism and we would be in the simple random censoring setting, without competing risk. In this case, it would be possible to estimate γ
1by one of the following two estimators, the first one being proposed in Beirlant et al. (2007) (a Hill estimator weighted with a constant weight), and the second one in Worms and Worms (2014) (a Hill estimator weighted with varying Kaplan-Meier weights):
ˆ
γ
1pBDF Gq“ 1 k
nkn
ÿ
i“1
1 ˆ
p
1logpZ
pn´i`1q{Z
pn´knqq (5)
ˆ
γ
1pKMq“ 1 n F s
n,bpZ
pn´knqq
kn
ÿ
i“1
δ
pn´i`1qI
Cpn´i`1q“1G s
n,bpZ
pn´i,nqq logpZ
pn´i`1q{Z
pn´knqq, (6)
6
100 300
−0.010.03
kn
Bias c1=1
c1=0.9 c1=0.7 c1=0.5
100 300
0.0000.003
kn
MSE c1=1
c1=0.9 c1=0.7 c1=0.5
(a) Fr´echet case, γ1“0.1, γ2“0.25, γC“0.3
100 300
−0.010.03
kn
Bias c1=1
c1=0.9 c1=0.7 c1=0.5
100 300
0.0000.003
kn
MSE c1=1
c1=0.9 c1=0.7 c1=0.5
(b) Same case as (a) but for Burr distribution
100 300
−0.010.03
kn
Bias c1=1
c1=0.9 c1=0.7 c1=0.5
100 300
0.0000.003
kn
MSE c1=1
c1=0.9 c1=0.7 c1=0.5
(c) Fr´echet case, γ1“0.1, γ2“0.25, γC“0.2
100 300
−0.010.03
kn
Bias c1=1
c1=0.9 c1=0.7 c1=0.5
100 300
0.0000.003
kn
MSE c1=1
c1=0.9 c1=0.7 c1=0.5
(d) Same case as (c) but for Burr distribution
100 300
−0.010.03
kn
Bias c1=1
c1=0.9 c1=0.7 c1=0.5
100 300
0.0000.003
kn
MSE c1=1
c1=0.9 c1=0.7 c1=0.5
(e) Fr´echet case, γ1“0.25, γ2“0.1, γC“0.45
100 300
−0.010.03
kn
Bias c1=1
c1=0.9 c1=0.7 c1=0.5
100 300
0.0000.003
kn
MSE c1=1
c1=0.9 c1=0.7 c1=0.5
(f) Same case as (e) but for Burr distribution
Figure 1: Comparison of bias and MSE (respectively left and right in each subfigure) ofpγn,kfor different values ofc1; in figures paq,pcqandpeq,X andCare Fr´echet distributed, but in figurespbq,pdqandpfqthey are Burr distributed.
100 300
−0.010.03
kn
Bias
100 300
0.00000.0025
kn
MSE
(a) Fr´echet case, γ1“0.1, γ2“0.25, γC“0.3, andc1“1
100 300
−0.010.03
kn
Bias
100 300
0.00000.0025
kn
MSE
(b) Case (a) but withc1“0.9
100 300
−0.010.03
kn
Bias
100 300
0.00000.0025
kn
MSE
(c) Fr´echet case, γ1“0.1, γ2“0.25, γC“0.2, andc1“1
100 300
−0.010.03
kn
Bias
100 300
0.00000.0025
kn
MSE
(d) Case (c) but withc1“0.9
100 300
−0.010.03
kn
Bias
100 300
0.0000.003
kn
MSE
(e) Fr´echet case, γ1“0.1, γ2“0.25, γC“0.45, andc1“1
100 300
−0.010.03
kn
Bias
100 300
0.0000.003
kn
MSE
(f) Case (e) but withc1“0.9
Figure 2: Comparison of bias and MSE (respectively left and right in each subfigure) forγpn,k (plain thick), ˆγpBDF Gq1 (plain thin) and ˆγ1pKMq (dashed), for Fr´echet distributedXandC.
where, in Equation p5q, ˆ p
1“
k1n
ř
kni“1
δ
pn´i`1qI
Cpn´i`1q“1, and in Equation p6q, the Kaplan Meier estimators F s
n,band G s
n,bare based on the ˜ δ
i“ δ
iI
Ci“1. These two estimators consider the uncensored lifetimes associated to cause 2 as independent censoring times. Comparing our new estimator with these latter two estimators, when c
1ă 1, will empirically prove that considering cause 2 as a competing risk independent of cause 1 has a great (negative) impact on the estimation of γ
1. Note that when c
1“ 1, the new estimator ˆ
γ
1and ˆ γ
1pKMqare exactly the same (therefore the thick and dashed lines in sub-figures (a), (c) and (e) of Figures 2 and 3 are overlapping, identical).
We address these two aims for each set-up (Fr´ echet, or Burr), by generating 2000 datasets of size 500, with three configurations of the triplet pγ
1, γ
2, γ
Cq : p0.1, 0.25, 0.3q (γ
1ă γ
2, moderate censoring γ
Cą γ
F), p0.1, 0.25, 0.2q (γ
1ă γ
2, heavy censoring γ
Că γ
F), or p0.25, 0.1, 0.45q (γ
1ą γ
2, moderate censoring γ
Că γ
F). Median bias and mean squared error (MSE) of the different estimators are plotted against different values of k
n, the number of excesses used. When Burr distributions are simulated, the parameter β is taken equal to 1, and the parameters pτ
1, τ
2, τ
Cq are taken equal to p12, 6, 5q in configurations 1 and 2, and to p6, 12, 5q in configuration 3.
Figure 1 illustrates the behaviour of our estimator when c
1varies. In terms of bias and MSE, we can see that the first configuration is a little better than the second one, which is itself much better than the third one. We observed this phenomenon in many other cases, not reported here : our estimator behaves best when it is the smallest parameter γ
kwhich is estimated, and when the censoring is not too strong. Our simulations also show that the quality of our estimator (especially in terms of the MSE) diminishes with c
1. Figures 2 and 3 present the comparison between our new estimator and the ones described in (5) and (6). A general conclusion (confirmed by other simulations not reported here) is that ˆ γ
1pBDF Gqand ˆ γ
1pKMq8
100 300
0.000.06
kn
Bias
100 300
0.0000.005
kn
MSE
(a) Burr case, γ1“0.1, γ2“0.25, γC“0.3, andc1“1
100 300
0.000.06
kn
Bias
100 300
0.0000.005
kn
MSE
(b) Case (a) but withc1“0.9
100 300
0.000.06
kn
Bias
100 300
0.0000.005
kn
MSE
(c) Burr case, γ1“0.1, γ2“0.25, γC“0.2, andc1“1
100 300
0.000.06
kn
Bias
100 300
0.0000.005
kn
MSE
(d) Case (c) but withc1“0.9
100 300
−0.040.04
kn
Bias
100 300
0.0000.005
kn
MSE
(e) Burr case, γ1“0.1, γ2“0.25, γC“0.45, andc1“1
100 300
−0.040.04
kn
Bias
100 300
0.0000.005
kn
MSE
(f) Case (e) but withc1“0.9
Figure 3: Comparison of bias and MSE (respectively left and right in each subfigure) forγpn,k (plain thick), ˆγpBDF Gq1 (plain thin) and ˆγ1pKMq (dashed), for Burr distributedX andC.
behave worse in most cases, even for a value of c
1of 0.9, which is only a slight modification of the situation without competing risk (c
1“ 1). Therefore, a contamination of the cause 1 distribution by another cause rapidly yield inadequate estimations of γ
1if dependency between causes is ignored ; this conclusion is true for both ˆ γ
1pBDF Gqand ˆ γ
1pKMq, but to a greater extent for ˆ γ
1pBDF Gq. In the third configuration pγ
1, γ
2, γ
Cq “ p0.25, 0.1, 0.45q, the improvement provided by ˆ γ
1(with respect to ˆ γ
pKMq1) becomes notable when c
1drops below 0.7.
4. Conclusion
In this paper, we consider heavy tailed lifetime data subject to random censoring and competing risks, and use the Aalen-Johansen estimator of the cumulative incidence function to construct an estimator for the extreme value index associated to the main cause of interest. To the best of our knowledge, this is the first estimator proposed in this context. Its asymptotic normality is proved and a small simulation study exhibiting its finite-sample performance shows that accounting for the dependency of the different causes is important, but that the bias can be particularly high. Estimating second order tail parameters would then be interesting in order to reduce this bias. A first step towards this aim could be to study the following moments
M
npαq“ 1 n F s
npkqpt
nq
n
ÿ
i“1
log
αpZ
i{t
nq
G s
npZ
i´q I
ξi“kI
Ziątn,
which asymptotic behaviour can be derived following the same lines as in the proof of Theorem 1.
5. Proofs
This section is essentially devoted to the proof of the main Theorem 1. Some hints about the proof of the consistency result contained in Proposition 1 are given in Subsection 5.3, and Corollary 1 is proved in Subsection 5.4.
We adopt a strategy developed by Stute in Stute (1995) in order to prove his Theorem 1.1, a well- known result which states that a Kaplan-Meier integral of the form ş
φ dF
ncan be approximated by a sum of independent terms. This idea is used in Suzukawa (2002) in the context of competing risks. We thus intend to approximate p γ
n,kby the integral r γ
n,k“ ş
φ
ndF
npkqof some deterministic function φ
n, with respect to the Aalen-Johansen estimator, and approximate this integral by the mean q γ
n,kof independent variables U
i,n(defined a few lines below). The passage from p γ
n,kto r γ
n,k(which amounts to replacing F s
npkqpt
nq by F s
pkqpt
nq in the denominator of p γ
n,k) will imply an additional sum of independent variables V
i,n, which will participate to the asymptotic variance of our estimator.
However, a major difference with Stute (1995) or Suzukawa (2002) is that the function we integrate here, φ
npuq “
1Fspkqptnq
logpu{t
nq I
uątn, is not only an unbounded function, depending on n, but it also has a ”sliding” support rt
n, `8r, which is therefore always close to the endpoint `8 of the distribution H . In Stute (1995), a crucial point of the proof consists in temporarily considering that the integrated function φ has a support which is bounded away from the endpoint of H (condition (2.3) there). Considering the kind of function φ
nwe have to deal with here, we cannot follow the same strategy : dealing with the remainder terms will thus be a particularly challenging part of our work. Finally note that, in order to deal with the ratio F s
npkqpt
nq{ F s
pkqpt
nq (and somehow approximate γ p
n,kby r γ
n,k) we will have to consider simultaneously integrals (with respect to F
npkq) of φ
nand of another function g
n, defined below, which basically shares the same flaws as φ
n.
10
Let us first recall or define the following objects : φ p
npuq “ 1
F s
npkqpt
nq log
ˆ u t
n˙ I
uątnφ
npuq “ 1 F s
pkqpt
nq log
ˆ u t
n˙ I
uątnγ
n,k“ ż
φ
npuqdF
pkqpuq
nÑ8ÝÑ γ
kr γ
n,k“
ż
φ
npuqdF
npkqpuq p γ
n,k“
ż
φ p
npuqdF
npkqpuq.
We thus have p γ
n,k“ ∆
´1nr γ
n,k, where
∆
n“ F s
npkqpt
nq{ F s
pkqpt
nq “ ż
g
npuqdF
npkqpuq and g
npuq “ 1
F s
pkqpt
nq I
uątn,
and we now introduce the following new quantities, related to the Stute-like decomposition of r γ
n,kand ∆
n: U
i,np1q“ φ
npZ
iq
GpZ s
iq δ
iI
Ci“kand V
i,np1q“ g
npZ
iq GpZ s
iq δ
iI
Ci“kU
i,np2q“ 1 ´ δ
iHpZ s
iq ψpφ
n, Z
iq and V
i,np2q“ 1 ´ δ
iH s pZ
iq ψpg
n, Z
iq U
i,np3q“
ż
Zi 0ψpφ
n, uq dC puq and V
i,np3q“ ż
Zi0
ψpg
n, uq dCpuq
U
i,n“ U
i,np1q` U
i,np2q´ U
i,np3qand V
i,n“ V
i,np1q` V
i,np2q´ V
i,np3qwhere, for any function f : R
`Ñ R , we note (for any given z ě 0)
ψpf, zq “ ż
`8z
f ptqdF
pkqptq and Cpzq “ ż
z0
dGptq Hptq s Gptq s . This enables us to finally define the important objects
q γ
n,k“ 1 n
n
ÿ
i“1
U
i,nand ∆ p
n“ 1 n
n
ÿ
i“1
V
i,n(7)
which are the triangular sums of independent terms which will respectively approximate r γ
n,kand ∆
n. At the beginning of section 5.1, it will be proved that EpU
i,np1qq “ γ
n,kand EpV
i,np1qq “ 1, while EpU
i,np2qq “ EpU
i,np3qq and E pV
i,np2qq “ E pV
i,np3qq, yielding E p q γ
n,kq “ γ
n,kand E p ∆ p
nq “ 1 ; the terms U
i,np2q, U
i,np3q, V
i,np2qand V
i,np3qonly participate to the variance component of the estimator. The relation between all these quantities is made clearer in the following Lemma :
Lemma 1. We have
? v
np p γ
n,k´ γ
kq “ ∆
´1np Z
n` ?
v
nR
n` ?
v
npγ
n,k´ γ
kq q (8) where
Z
n“ ? v
n´
p γ q
n,k´ γ
n,kq ´ γ
kp ∆ p
n´ 1q ¯
and R
n“ p γ r
n,k´ q γ
n,kq ´ γ
kp∆
n´ ∆ p
nq The proof of Lemma 1 is simple :
? v
np p γ
n,k´ γ
kq “ ∆
´1n?
v
np r γ
n,k´ ∆
nγ
kq
“ ∆
´1n?
v
ntp q γ
n,k´ γ
n,kq ` γ
kp1 ´ ∆
nq ` p γ r
n,k´ q γ
n,kq ` pγ
n,k´ γ
kqu which leads to the desired relation (8).
The main theorem thus becomes an immediate consequence of the following four results, the second one
being the most difficult to establish.
Proposition 2. Under condition p1q and assuming that v
nnÑ8
ÝÑ `8, (9)
if γ
kă γ
C, then
Z
nÝÑ
dN p0, σ
2q as n Ñ 8 where σ
2is defined in the statement of Theorem 1.
Proposition 3. Under conditions p1q and p2q , if γ
kă γ
C, then
r γ
n,k“ q γ
n,k` R
n,φand ∆
n“ ∆ p
n` R
n,g(10) where R
n,φ, R
n,g(and consequently R
ntoo) are o
Ppv
´1{2nq.
Corollary 2. Under the conditions of Proposition 3, ?
v
np∆
n´1q converges in distribution to N p0, 1{p1´rqq where r “ γ
k{γ
CPs0, 1r.
Lemma 2. Under conditions p1q, p3q and ?
v
ngpt
nq Ñ λ ě 0, the bias term ?
v
npγ
n,k´ γ
kq in (8) converges to λm as n Ñ 8, where m is defined in Theorem 1.
Propositions 2 and 3 will be proved in Sections 5.1 and 5.2 respectively, sometimes with the help of other results stated and established in the Appendix. The proofs of Corollary 2 and Lemma 2 are short, we state them below.
Concerning Corollary 2, once the proof of Proposition 2 has been gone through, it will become clear to the reader that ?
v
np ∆ ˆ
n´ 1q converges in distribution to the centred gaussian distribution of variance 1{p1 ´ rq, because ?
v
np ∆ ˆ
n´ 1q “ ř
ni“1
W ˜
i,nwhere ˜ W
i,n“
?vn n
V ˜
i,n“
?vn
n
pV
i,n´ 1q are centred, and V arp W ˜
i,nq “
n11´r1` op1{nq (this is proved similarly as (11) and (15)). Since Proposition 3 states that
∆
n“ ∆ ˆ
n` o
Ppv
1{2nq, the same central limit theorem holds for ∆
nand the corollary is proved.
Concerning now Lemma 2, remind that γ
n,k“ ş
φ
npuqdF
npkqpuq. An integration by parts and the fact that F s
pkqpxq “ x
´1{γkl
kpxq yield
? v
npγ
n,k´ γ
kq “ ? v
nż
`8 1y
´1{γk´1ˆ l
kpyt
nq l
kpt
nq ´ 1
˙ dy,
and, using assumption p3q and Proposition 3.1 in de Haan and Ferreira (2006), we can write ż
`81
y
´1{γk´1ˆ l
kpyt
nq l
kpt
nq ´ 1
˙
dy “ gpt
nq ż
`81
y
´1{γk´1h
ρkpyqdy ` opgpt
nqq.
The result then follows from assumption ?
v
ngpt
nq Ñ λ ě 0 and the fact that ş
`81
y
´1{γk´1h
ρkpyqdy “ m.
In the rest of the paper, we will very often handle the well-known sub-distributions functions H
p0qand H
p1,kqdefined, for all t ě 0, by
H
p0qptq “ P pZ ď t, δ “ 0q and H
p1,kqptq “ P pZ ď t, ξ “ kq.
Note that we have
dH
p0q“ F dG s and dH
p1,kq“ GdF s
pkq. 5.1. Proof of Proposition 2
We first write Z
n“
n
ÿ
i“1
W
i,nwhere W
i,n“
? v
nn
´ U ˜
i,n´ γ
kV ˜
i,n¯ and
" U ˜
i,n“ U
i,n´ γ
n,kV ˜
i,n“ V
i,n´ 1
where W
i,n, ˜ U
i,nand ˜ V
i,nare centred, because the random variables U
i,nand V
i,nhave expectations respec- tively equal to γ
n,kand 1. Indeed, we have
E
´ U
i,np1q¯
“ E
´
φnpZiq GpZs iq
δ
iI
Ci“k¯
“ ş
φnpuqGpuqs
dH
p1,kqpuq “ ş
φ
npuqdF
pkqpuq “ γ
n,kand E
´ U
i,np2q¯
“ E
´
1´δiHpZĎ iq
ψpφ
n, Z
iq
¯
“ ş ş
1ĎHpuq
I
tąuφ
nptq dF
pkqptq dH
p0qpuq “ ş ş
1Gpuqs
I
tąuφ
nptq dF
pkqptq dGpuq
12
as well as E
´ U
i,np3q¯
“ E
´ ş
Zi0
ψpφ
n, uq dC puq
¯
“ ş ş ş
I
ząuI
tąuφ
nptq dH pzq dF
pkqptq
dGpuqGpuqs HpuqĎ
“ E
´ U
i,np2q¯ . The proof for E pV
i,nq “ 1 is similar.
We will now prove the asymptotic normality of Z
nby using the Lyapunov criteria.
Lemma 3. Under the conditions p1q and p9q , if γ
kă γ
C: piq we have
V arp U ˜
1,n´ γ
kV ˜
1,nq “ E pU
1,n2q ` γ
k2E pV
1,n2q ´ 2γ
kE pU
1,nV
1,nq ` op1q (11) piiq we have
E pU
1,n2q “
ż φ
2npuq
Gpuq s dF
pkqpuq ´ ż
pψpφ
n, uqq
2dC puq (12) E pV
1,n2q “
ż g
n2puq
Gpuq s dF
pkqpuq ´ ż
pψpg
n, uqq
2dCpuq (13) E pU
1,nV
1,nq “
ż φ
npuqg
npuq
Gpuq s dF
pkqpuq ´ ż
ψpφ
n, uqψpg
n, uq dC puq (14) piiiq we have, noting r “ γ
k{γ
C(which belongs to s0, 1r under our conditions) as well as p “ γ{γ
F“
γ
C{pγ
F` γ
Cq Ps0, 1r,
V arp U ˜
1,n´ γ
kV ˜
1,nq “ 1 F s
npkqpt
nq Gpt s
nq
ˆ γ
2kp1 ` r
2q p1 ´ rq
3` op1q
˙
´ 1 ´ p H s pt
nq
ˆ 2γ
k3γp2 ´ γ
k{γq
3` op1q
˙
` op1q (15) Lemma 4. Under the conditions p1q and p9q , if γ
kă γ
C, then
ř
ni“1
E |W
i,n|
2`δÑ 0, as n tends to infinity, for some δ ą 0.
We can then immediately prove Proposition 2. Indeed, since Z
n“ ř
ni“1
W
i,n, Lemma 3 yields V arpZ
nq “ n V arpW
1,nq “ v
nn V arp U ˜
1,n´ γ
kV ˜
1,nq.
which, since v
n“ n F s
npkqpt
nq Gpt s
nq, becomes V arpZ
nq “ γ
2kp1 ` r
2qp1 ´ rq
´3´ `
2p1 ´ pqγ
k3γ
´1p2 ´ γ
k{γq
´3˘ ´
F s
pkqpt
nq{ F s pt
nq
¯
` op1q.
Therefore, depending on the limit c of the ratio F s
pkqpt
nq{ F s pt
nq when n Ñ 8 (for instance, it converges to 0 when γ
kă γ
F), it is simple to check that the variance of Z
nconverges to the value σ
2described in the statement of Theorem 1. Thanks to Lemma 4, the Lyapunov CLT applies and Proposition 2 is proved.
The two subsections 5.1.1 and 5.1.2 are now respectively devoted to the proofs of Lemmas 3 and 4.
5.1.1. Proof of Lemma 3
Part piq of the lemma is straightforward : since ˜ U
1,nand ˜ V
1,nare centred, we have indeed V arp U ˜
1,n´ γ
kV ˜
1,nq “ Epp U ˜
1,n´ γ
kV ˜
1,nq
2q “ Ep U ˜
1,n2q ` γ
k2Ep V ˜
1,n2q ´ 2γ
kEp U ˜
1,nV ˜
1,nq
“ p E pU
1,n2q ´ γ
n,k2q ` γ
k2p E pV
1,n2q ´ 1q ´ 2γ
kp E pU
1,nV
1,nq ´ γ
n,kq and the result comes by using the fact that γ
n,kconverges to γ
kas n Ñ 8.
Now we proceed to the proof of part piiq, and will only prove (12) because, by definition of φ
nand g
n, the proofs for (13) and (14) will be completely similar. First of all, we obviously have
pU
1,nq
2“ pU
1,np1qq
2` pU
1,np2qq
2` pU
1,np3qq
2` 2U
1,np1qU
1,np2q´ 2U
1,np1qU
1,np3q´ 2U
1,np2qU
1,np3q(16) The first term in the right-hand side of (12) is equal to E ppU
1,np1qq
2q, and the second one (without the minus sign) is equal to E ppU
1,np2qq
2q and to E pU
1,np1qU
1,np3qq because
E ppU
1,np2qq
2q “ E
ˆ 1 ´ δ
1H
2pZ pψpφ
n, Z
1˙
“ ż 1
H
2pzq pψpφ
n, zqq
2dH
p0qpzq “ ż
pψpφ
n, zqq
2dCpzq
and
EpU
1,np1qU
1,np3qq “
ż φ
npzq Gpzq s
ˆż
z 0ψpφ
n, uqdC puq
˙
dH
p1,kqpzq
“ ż
ψpφ
n, uq ˆż
8u
φ
npzqdF
pkqpzq
˙
dC puq “ ż
pψpφ
n, zqq
2dCpzq
The expectation E pU
1,np1qU
1,np2qq equals 0 because δ
1p1 ´ δ
1q is constantly 0, and we are now going to prove that E ppU
1,np3qq
2q “ 2 E pU
1,np2qU
1,np3qq, which ends the proof of (12) in view of (16). Indeed, noting hpzq “ ş
z0
ψpφ
n, uqdCpuq and using the simple fact that hpzq “ hpyq ` ş
zy
ψpφ
n, uqdCpuq for every y ă z, we have E ppU
1,np3qq
2q “
ż
8 0ˆż
z 0ψpφ
n, yqhpzqdCpyq
˙ dH pzq
“ ż
80
ˆż
z 0ψpφ
n, yqhpyqdC pyq
˙
dHpzq ` ż
80
"ż
z 0ˆż
z yψpφ
n, uqdC puq
˙
ψpφ
n, yqdCpyq
* dHpzq
“ ż
80
ˆż
z 0ψpφ
n, yqhpyqdC pyq
˙
dHpzq ` ż
80
"ż
z 0ˆż
u 0ψpφ
n, yqdCpyq
˙
ψpφ
n, uqdC puq
* dHpzq
“ 2 ż
80
ˆż
z 0ψpφ
n, yqhpyqdC pyq
˙ dHpzq
and
E pU
1,np2qU
1,np3qq “ E
ˆ 1 ´ δ
1H s pZ
1q ψpφ
n, Z
1q hpZ
1q
˙
“ ż
Hpyqhpyqψpφ s
n, yq dGpyq H s pyq Gpyq s
“ ż
80
ˆż
8 ydHpzq
˙
hpyqψpφ
n, yq dC pyq
“ ż
80
ˆż
z 0ψpφ
n, yqhpzqdC pyq
˙
dHpzq “ 1
2 E ppU
1,np3qq
2q as announced.
We can now start proving part piiiq of the lemma, in which the exact nature of the function φ
nmatters.
First remind that functions F s
pkq, G s and C are regularly varying of respective orders ´1{γ
k, ´1{γ
Cand 1{γ (for C, this is proved in Lemma 8 with δ “ 1). Let us define the constants c
jand d
j(j “ 0, 1, 2) by
c
j“ j!γ
kjp1 ´ qq
j`1and d
j“ j!γ
kj`1γp2 ´ γ
k{γq
j`1.
Since γ
kă γ
Cwas assumed, then according to Lemma 7 part piiq (applied first with a ` b “ 1{γ
C´ 1{γ
kă 0 for c
j, and then with a ` b “ ´2{γ
k` 1{γ “ p1{γ
C´ 1{γ
kq ` p1{γ
F´ 1{γ
kq ă 0 for d
j) , we have
ż
8 tnlog
jˆ u
t
n˙ dF
pkqpuq Gpuq s „ c
jF s
pkqpt
nq Gpt s
nq and
ż
8 tnlog
jˆ u
t
n˙ ˆ
F s
pkqpuq F s
pkqpt
nq
˙
2dC puq
Cpt
nq
nÑ8
ÝÑ d
j. (17) Hence, by definition of φ
n, g
n, the first terms of EpU
1,n2q, EpV
1,n2q and EpU
1,nV
1,nq in relations (12), (13) and (14) are respectively equivalent (as n Ñ 8) to c
2Dpt
nq, c
0Dpt
nq and c
1Dpt
nq where Dpt
nq denotes
Dpt
nq “ 1 F s
pkqpt
nq Gpt s
nq .
Since c
2` γ
2kc
0´ 2γ
kc
1is found to be equal to γ
k2p1 ` r
2q{p1 ´ rq
3, then in view of (11) this proves the first term in relation (15). We now need to obtain equivalent expressions for the quantities ş
pψpφ
n, uqq
2dC puq, ş pψpg
n, uqq
2dC puq and ş
ψpφ
n, uqψpg
n, uq dC puq in order to prove the second part of relation (15) and there- fore finish the proof of Lemma 3.
For saving space, we will use temporarily the following notations :
l
npuq “ log pu{t
nq , R
npuq “ F s
pkqpuq{ F s
pkqpt
nq
According to the technical Lemma 9 of the Appendix and, after splitting the integral into ş
`80
and ş
`8 tn, we
14
can write ż
pψpφ
n, uqq
2dCpuq “ γ
n,k2ż
tn0
dC puq
` ż
`8tn
´
l
2npuqR
2npuq `γ
k2pu{t
nq
´2{γk`2γ
kl
npuqR
npuq pu{t
nq
´1{γk¯
dC puq ` opCpt
nqq, (18) where opCpt
nqq in p18q is due to part piiq of Lemma 7 and to the fact that
npuq in Lemma 9 converges to 0 uniformly in u. According to the second part of relation p17q, we thus have
ż
pψpφ
n, uqq
2dCpuq “ pγ
k2` d
2` γ
2kd
0` 2γ
kd
1qCpt
nq ` opCpt
nqq (19) The other terms are treated similarly (using the fact that ψpg
n, uq “ 1 when u ď t
n, and “ F s
pkqpuq{ F s
pkqpt
nq when u ą t
n) and we obtain
ż
pψpg
n, uqq
2dC puq “ p1 ` d
0q Cpt
nq ` opCpt
nqq, (20) ż
ψpφ
n, uqψpg
n, uq dC puq “ pγ
k` d
1` γ
kd
0q Cpt
nq ` opCpt
nqq. (21) In view of (11), combining p19q, p20q and p21q and using Remark 3 (following Lemma 8) to write that Cpt
nq „ p1 ´ pq{ H s pt
nq (as n Ñ 8), this proves the second term in relation (15).
5.1.2. Proof of Lemma 4
We have to prove that, for some δ ą 0 small enough, n E |W
1,n|
2`δtends to 0, as n Ñ 8. In the sequel, cst denotes an unspecified absolute positive constant. According to the definition of W
1,n, it is clear that
n |W
1,n|
2`δď cst v
n1`δ{2n
1`δ˜
3ÿ
j“1
ˇ ˇ ˇ U
1,npjqˇ ˇ ˇ
2`δ
` γ
k2`δ3
ÿ
j“1
ˇ ˇ ˇ V
1,npjqˇ ˇ ˇ
2`δ
` |γ
n,k´ γ
k|
2`δ¸
First, we clearly have n
´1´δv
n1`δ{2|γ
n,k´ γ
k|
2`δÝÑ 0 as n Ñ 8. Secondly, since V
1,npjqhas the same form as U
1,npjq, with g
ninstead of φ
n(i.e. without the log factor), we will only prove that there exists some δ ą 0 such that, as n Ñ 8,
n
´1´δv
n1`δ{2E ˇ ˇ ˇ U
1,npjqˇ ˇ ˇ
2`δ
“ n
´δ{2´
F s
pkqpt
nq Gpt ¯
nq
¯
1`δ{2E ˇ ˇ ˇ U
1,npjqˇ ˇ ˇ
2`δ nÑ8
ÝÑ 0 for j P t1, 2, 3u (22) For j “ 1, we have
E ˇ ˇ ˇ U
1,np1qˇ ˇ ˇ
2`δ
“ ż
`80
ˇ ˇ ˇ ˇ
φ
npzq Gpzq ¯ ˇ ˇ ˇ ˇ
2`δ
dH
p1,kqpzq “ p F s
pkqpt
nż
`8tn
p Gpzqq s
´2´δplogpz{t
nGpzqdF ¯
pkqpzq
“
´
F s
pkqpt
nq Gpt ¯
nq
¯
´1´δż
`8 tnplogpz{t
nˆ Gpt ¯
nq Gpzq ¯
˙
1`δdF
pkqpzq F s
pkqpt
nq .
Applying part piiq of Lemma 7 for α “ 2 ` δ, a “ p1 ` δq{γ
Cand b “ ´1{γ
k(with δ sufficiently small so that a ` b “ 1{γ
C´ 1{γ
k` δ{γ
Cis kept ă 0), and using the fact that v
n“ n F s
pkqpt
nq Gpt s
nq Ñ 8, this ends the proof of (22) for j “ 1.
For j “ 2, we have
E ˇ ˇ ˇ U
1,np2qˇ ˇ ˇ
2`δ
“ ż
`80
ˇ ˇ ˇ ˇ
ψpφ
n, zq Hpzq ¯
ˇ ˇ ˇ ˇ
2`δ
FpzqdGpzq. s
By definition of ψ, φ
n, and γ
n,k, we have ψpφ
n, zq “ γ
n,kwhen z ď t
n. Therefore, splitting the integral above into two integrals ş
tn0
and ş
`8tn
we obtain E
ˇ ˇ ˇ U
1,np2qˇ ˇ ˇ
2`δ
“ I
1pt
nq ` I
2pt
nq, where, on one hand,
I
1pt
nq “ pγ
n,kq
2`δż
tn0
F ¯ pzqdGpyq
p Hpzqq ¯
2`δ“ pγ
n,kq
2`δż
tn0