Nonparametric methods for the estimation of the conditional distribution of an interval-censored lifetime given continuous covariates

(1)

MOHAMMAD HOSSEIN DEHGHAN

NONPARAMETRIC METHODS FOR THE

ESTIMATION OF THE CONDITIONAL

DISTRIBUTION OF AN INTERVAL-CENSORED

LIFETIME GIVEN CONTINUOUS COVARIATES

Thèse présentée

à la Faculté des études supérieures et postdoctorales de l'Université Laval dans le cadre du programme de doctorat en mathématiques

pour l'obtention du grade de Philosophiae Doctor (Ph.D.)

DEPARTEMENT DE MATHEMATIQUES ET DE STATISTIQUE FACULTÉ DES SCIENCES ET DE GÉNIE

UNIVERSITÉ LAVAL QUÉBEC

2012

(2)

Résumé

Cette thèse contribue au développement de l'estimation non paramétrique de la fonction de survie conditionnelle étant donné une covariable continue avec données censurées. Elle est basée sur trois articles écrits avec mon directeur de thèse, le professeur Thierry Duchesne.

Le premier article, intitulé "Une généralisation de l'estimateur de Turnbull pour l'estimation non paramétrique de la fonction de survie conditionnelle avec données censurées par intervalle," a été publié en 2011 dans Lifetime Data Analysis, vol. 17, pp. 234 - 255.

Le deuxième article, intitulé "Sur la performance de certains estimateurs nonpara-métriques de la fonction de survie conditionnelle avec données censurées par intervalle," est parru en 2011 dans la revue Computational Statistics & Data Analysis, vol. 55, pp. 3355-3364.

Le troisième article, intitulé "Estimation de la fonction de survie conditionnelle d'un temps de défaillance étant donné une covariable variant dans le temps avec observations censurées par intervalles", sera bientôt soumis à la revue Statistica Sinica.

(3)

Abstract

This thesis contributes to the development of conditional nonparametric estimation of the survival function given a continuous time-fixed and time-varying covariate with censored data. It is based on three papers written with my Ph.D. supervisor, Professor Thierry Duchesne.

The first paper, titled "A generalization of Turnbull's estimator for nonparamet-ric estimation of the conditional survival function with interval-censored data," was published in 2011 in Lifetime Data Analysis , vol. 17, pp. 234-255.

The second paper, titled "On the performance of some non-parametric estimators of the conditional survival function with interval-censored data", was published in 2011 in Computational Statistics & Data Analysis, vol. 55, pp. 3355-3364.

The third paper, titled "Estimation of the conditional survival function of a fail-ure time given a time-varying covariate with interval-censored observations ", will be submitted for publication in Statistica Sinica.

(4)

Avant-propos

Avant tout, je remercie Dieu le tout-puissant de m'avoir accordé son infinie bonté, ainsi que le courage, la force et la patience pour réaliser cet humble travail.

Au cours de mon doctorat (Ph.D.) à l'Université Laval, des gens sympathiques m'ont aidé de nombreuses façons, et il serait impossible pour moi de remercier chacun d'eux. Cependant, je dois prendre le temps d'exprimer ma gratitude envers les princi-paux contributeurs à ces travaux. Tout d'abord je tiens à remercier sincèrement mon grand directeur de thèse, le professeur Thierry Duchesne, d'avoir accepté de diriger mes travaux. Je lui suis reconnaissant pour la confiance qu'il m'a accordée et le soutien dont il m'a gratifié. Ses encouragements, ses précieux conseils et sa disponibilité m'ont permis d'élargir mes connaissances et d'avancer considérablement mes recherches.

Je suis également reconnaissant aux autres membres du comité de ma soutenance de thèse: Le président du groupe de statistique M. Louis-Paul Rivest, M. Lajmi Lahkal Chaieb et l'examinateur externe M. John Braun, pour avoir accepté d'évaluer ma thèse. Je les remercie également pour leurs importants commentaires qui ont aidé à améliorer ma thèse. Je me considère chanceux d'avoir pu bénéficier de leur expertise dans le domaine.

Ma gratitude va aussi à tous les directeurs du département, M. Frédéric Gourdeau, Mme. Line Baribeau, M. Roger Pierre et les professeurs du département, qui m'ont procuré l'occasion d'aiguiser mes compétences en mathématiques et en statistique. Je remercie tout spécialement M. Thierry Duchesne pour son soutien et pour m'avoir aidé à intégrer le milieu étudiant.

Je tiens à exprimer la profonde gratitude que j'ai envers ma famille bien aimée, ma chère épouse Azam et ma chère fille Mahsa, mes chers membres de nos grandes familles pour le support moral, la compréhension et l'encouragement qu'ils m'ont offerts. Je remercie enfin tous mes amis intimes et tous mes collègues, grâce auxquels ces quatre années de travail ont été agréables.

(5)

Merci enfin aux organismes suivants pour leur soutien financier: Le Conseil de recherche en sciences naturelles et en génie du Canada et le Ministère de la Science et de la Technologie de la R.I. d'Iran.

(6)

Acknowledgement

Before all, I would like to thank our Creator for endowing me the ability to study the nice wondrous world in which we all live. I hope that I will continue to get the opportunity to use my given abilities for the betterment of others.

During my doctoral (PhD.) studies at Laval University, nice people helped me in numerous ways, and it would be impossible for me to thank all of them. However, I must take the time to express my best gratitude to the main contributors to this work. First and foremost I wish to thank my great supervisor, Professor Thierry Duchesne, for his guidance during these years. Not only has he carefully advised and thought me within the courses and research, but also he supported me financially, in addition he has thought me by his useful and interesting moral behaviour that will be the best wealth for my future academic life that is deeply appreciated.

I am also grateful to the other committee members of my thesis defence: Chair of the statistics group Dr. Louis-Paul Rivest, Dr. Lajmi Lahkal Chaieb and external examiner Dr. John Braun, for accepting to evaluate my thesis. I also thank them for their important comments that helped to improve my thesis. I consider myself lucky to have been able to benefit from their expertise.

I would like to express my thanks and acknowledge the support of the following organizations:

1. Department of Mathematics and Statistics for providing me with the opportunity to attend graduate school at Laval University, Frédéric Gourdeau, Line Baribeau, Roger Pierre and all the professors of the department.

2. Natural Sciences and Engineering Research Council of Canada and to the Ministry of Science and Technology of LR. Iran for their financial support.

3. Computing services administrator Michel Lapointe, who really helped me to adapt and learn the programming environment.

4. Employees of the Mathematics and Statistics Department, Sylvie Drolet, Suzanne Talbot, Sylvie Lambert and Caroline Garneau.

(7)

vu

Lastly but most importantly, I would like to express my gratitude and respect to my family, my dear wife Azam and my daughter Mahsa. dear members of our large family for moral support, understanding and encouragement. I also thank all my nice friends and partners.

(8)

In the name of Allah, the Beneficent, the Merc iful. Nun. I swear b y the pen and what the angels write.

By the grace of your Lord you not mad . And most surely you shall have a reward never ta be eut off. And most surely you conform (your self} ta sublime morality. Sa you shall see, and they (tao) shall see which of y ou is affiicted with madness.

(9)

List of Tables

3.1 Integrated variance (IV), integrated absolute bias (IAB) and integrated mean squared error (IMSE) of the estimator for various choices of the kernel function when the intervals are small or big. The bandwidth h is

the optimal bandwidth for each kernel function 30 3.2 Monte Carlo ("True") variances vs average bootstrap variance estimates

obtained with versions A and B when n = 50, p = 4 and ZQ = 15. Values of t are approximately the 10th, 25th, 50th, 75th and 90th per-centiles of the distribution of T\Z = 15. Bandwidth obtained by Z)-fold cross-validation. MSEopt is the MSE of the estimator obtained with the

bandwidth that minimizes IMSE 31 3.3 Monte Carlo ("True") variances vs average bootstrap variance estimates

obtained with versions A and B when n = 50, p = 0.25 and ZQ = 15. Values of t are approximately the 10th, 25th, 50th, 75th and 90th per-centiles of the distribution of T\Z = 15. Bandwidth obtained with D-îo\d cross-validation. MSEopt is the MSE of the estimator obtained with the bandwidth that minimizes IMSE

3.4 Monte Carlo ("True") variances vs average bootstrap variance estimates obtained with versions A and B when n = 50, p = 4 and z0 = 15. Values

of t are approximately the 10th, 25th, 50th, 75th and 90th percentiles of

the distribution of T\Z = 15. Bandwidth obtained with the rule-of-thumb. 33 3.5 Monte Carlo ("True") variances vs average bootstrap variance estimates

obtained with versions A and B when n = 50, p = 0.25 and z0 = 15.

Values of t are approximately the 10th, 25th, 50th, 75th and 90th per-centiles of the distribution of T\Z = 15. Bandwidth obtained with the

rule-of-thumb 34 3.6 Optimal (/lopt). average cross-validation (hcv) and average rule-of-thumb

(hfior) bandwidths, with T ~ Weibull (shape = 3, scale = 1.5Z0) and

Z ~ [7(5, 25) for various values of n, p and z0 under the normal kernel. -id

4.1 Time required to run 6,000 simulations (n = 50) or 5,000 simulations (n = 100) for the four estimators considered with purely interval-censored

(13)

Xlll

4.2 Time required to run 6,000 simulations (n = 50) or 5,000 simulations (n = 100) for the four estimators considered under mixed case censoring.

The times are in hours and averaged over all 22 values of p 52 4.3 The average of IMSE over p, three values of the z0 = 8,15,22 and 6,000

simulations of estimators by various of the conditional failure time and

the covariate distributions under the interval-censoring and n = 50. . . 53 4.4 The average of IMSE over p, three values of the z0 = 8,15, 22 and 6,000

simulations of estimators by various of the conditional failure time and

the covariate distributions under the mixed-censoring and n = 50. . . . 53 4.5 Decrease in IMSE when going from n = 50 to n = 100 averaged over p, zG

and the distribution of Z when T \ Z follows the Weibull distribution and

the samples are purely interval-censored and mixed censoring-interval. 54 5.1 IMSE of the nonparametric estimator of S(t\Po) for various values of

the point of estimation /?0, of the variance cd of the error terms in the covariate model, of the rate p of the homogeneous Poisson process that generates the interval-censoring times, and of the sample size n under the pure interval-censoring scheme. Each IMSE value reported is calculated from 5,000 replications. GT refers to estimation with S QT L while NW refers to estimation with S%w and midpoint imputation of the

interval-censored times 72 5.2 IMSE of the nonparametric estimator of S(t\/30) for various values of

the point of estimation /3o, of the variance o"\ of the error terms in the covariate model, of the rate p of the homogeneous Poisson process that generates the interval-censoring times, and of the sample size n under the hybrid censoring scheme. Each IMSE value reported is calculated from 5,000 replications. G T refers to estimation with S QT L while GKM refers to estimation with S QK M and midpoint imputation of the

interval-censored times 73 5.3 IMSE of the nonparametric estimator of S ( t \ 0o) for various values of A

in the Z(tif) = PiUj + A l o g ^ ) + ey, three values of p, two values of n, o~\ = 1 under pure interval-censoring. Each IMSE value is calculated from 5,000 replications. GT refers to estimation with S QT L while NW refers to estimation with S ^w and midpoint imputation of the

interval-censored times 75 5.4 IMSE of the nonparametric estimator of S(t\f3o) for various values of

A in the Z(Uj) — fatij + A l o g ( ^ ) + ey, three values of p, two values of n, od\ = 1 under hybrid censoring. Each IMSE value is calculated from 5,000 replications. GT refers to estimation with S QT L while GKM refers to estimation with S Q K M a nd midpoint imputation of the

(14)

xiv

5.5 Monte Carlo variances of S(t\(30 = 1.5) at five values of t corresponding to the 10th, 25th, 50th, 75th and 90th percentiles of the Weibull distribution together with bootstrap estimates of this variance. Results are given for o£ = 1, Z(Uj) = Pitij + Etj, four values of p. n = 50 and under pure interval-censoring. All values are based on 5,000 replications. Srrue denotes the true value of S(t\B0 = 1.5), SGTL and SQTR respectively denote the average values of the SG T L and SG T R estimators and VE GT and VBG T respectively denote the empirical (Monte Carlo) variance and

average bootstrap variance of the SG T L estimator 77

5.6 Monte Carlo variances of S(t\B0 = 1.5) at five values of t corresponding to the 10th, 25th, 50th, 75th and 90th percentiles of the Weibull distribution together with bootstrap estimates of this variance. Results are given for o£ = 1, Z(tij) = Bdij + Eij, four values of p, n = 100 and under pure interval-censoring. All values are based on 5,000 replications. STrUe denotes the true value of S(t\B0 = 1.5), SGTL and SQTR respectively

denote the average values of the SG T L and SG T R estimators and VEGT and VBGT respectively denote the empirical (Monte Carlo) variance and

average bootstrap variance of the SG T L estimator 78

C.l IMSE of the nonparametric estimator of S(t\Poj for various values of A in the Z(tif) — /%£y + Aty + ey, three values of p, two values of n, a* — 1 under pure interval-censoring. Each IMSE value is calculated from 5,000 replications. G T refers to estimation with SG T L while NW refers to estimation with S ^w and midpoint imputation of the interval-censored

times 121 C.2 IMSE of the nonparametric estimator of S(t\/3oj for various values of

A in the Z(ttj) = Bttij + Aiy + ey, three values of p, two values of n, o\ = 1 under hybrid censoring. Each IMSE value is calculated from 5,000 replications. G T refers to estimation with SG T L while GKM refers to estimation with SG K M and midpoint imputation of the interval-censored

(15)

List of Figures

2.1 Both panels: Each line segment represents an observed interval, with the x-axis representing time and y-axis represents the item numbers. Left panel: intervals obtained when the censoring intervals are of wide length. Right panel: intervals obtained when the censoring intervals are

of medium length 12 2.2 Both panels: Each line segment represents an observed interval, with the

ar-axis representing time and y-axis represents the item numbers. Left panel: intervals obtained when the censoring intervals are of medium length. Right panel: intervals obtained when the censoring intervals are

of narrow length 12 3.1 Case n = 100, p = 4 h = 4, normal kernel. Black solid lines: True

values of the conditional survival function. Dashed lines: Estimates of the conditional survival function given Z = 8 (red), given Z = 15 (blue), given Z = 22 (green). Top panel: A typical realization of the estimator.

Bottom panel: Average of the estimator over 100 simulations 29 3.2 Estimates of S(t\z<j) obtained with the proposed method and the Cox

model for two different values of z0. The solid lines represent the true

values of S(t\zo) used to simulate the dataset 38 3.3 Left panel: Conditional survival function estimate for time to HIV

infec-tion (in days) given various values of the age (in years) at entry in the study obtained with different values of the bandwidth parameter. Right

panel: Check of the proportional hazards assumption 10 4.1 Integrated mean squared error, IMSE, of S(t\z0) by GT, Cox model,

GKM in the pure interval-censored by midpoint and multiple-imputation. The Zi's follow a uniform distribution on the interval (5, 25), the condi-tional distribution of T\Z is the log-normal with location=1.65 + 0.081z

(16)

XVI

4.2 Integrated mean squared error, IMSE, of S(t\z0) by GT, Cox model.

GKM in the pure interval-censored by midpoint and multiple imputation. The Zj's follow a uniform distribution on the interval (5, 25), the condi-tional distribution of T\Z is the Weibull with shape=3 and scale=1.5z

and zQ is 8. In panel 1, n = 50 and in panel 2, n = 100 59

4.3 Integrated mean squared error, IMSE, of S(t\z0) by GT, GKM by

mid-point and NW by midmid-point-multiple imputation. The Zj's follow a log-normal with location=2.69 and scale=0.2, z0 is 15 and n = 50. The

conditional distributions of T\Z are the normal with mean=1.3z and vari-ance=49 under pure interval-censored data in the panel 1, the Weibull

with shape=3 and scale=1.52 under mixed cases in the panel 2 60 5.1 Plot of SGjjL2(t\0) as a function of time (in years) for the pine weevil

data for the quartile values of 8: the solid line corresponds to 8 = 1.56,

the dashed line to 8 = 1.81 and the dotted line to 8 = 2.11 79 5.2 Plot of log[— log{—SG^li(t\B)}] as a function of log(time) (in years) for

the pine weevil data for the quartile values of 8: the solid line corresponds

to 8 = 1.56, the dashed line to 8 = 1.81 and the dotted line to 8 = 2.11. 80 5.3 Plot of JSGXL2(^ > 7, B) as a function of time (in years) for the pine

weevil data for the quartile values of 8: the solid line corresponds to

8 = 1.56, the dashed line to 8 = 1.81 and the dotted line to 8 = 2.11. . 81 5.4 Plot of SG^if'(t\B) as a function of time for the electrical equipment data

for the quartile values of 8: the solid line corresponds to 8 = —1.3, the dashed line to 8 = -0.3 and the dotted line to 8 = 0.05

5.5 Plot of log[— \og{SGjjL6(t\B)}] as a function of log(time) for the electrical

equipment data for the quartile values of 8: the solid line corresponds to 8 = - 1 . 3 , the dashed line to 8 = - 0 . 3 and the dotted line to 8 = 0.05. 5.6 Plot of SG^°£6(t\t > 76, B) as a function of time for the electrical

equip-ment data for the quartile values of 8: the solid line corresponds to

8 = - 1 . 3 , the dashed line to 8 = -0.3 and the dotted line to 8 = 0.05. 83 B.l Integrated mean squared error of S(t\zQ) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a normal with mean=15 and variance=25, the conditional distribution of T\Z is the Weibull with shape=3 and scale=T.5z, ZQ is 8.

(17)

XVII

B.2 Integrated mean squared error of S(t\z0) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT. Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a normal with mean=15 and variance=25, the conditional distribution of T\Z is the Weibull with shape=3 and scale=1.5z, z0 is

15. In panel 1 n = 50 and in panel 2 n = 100 97 B.3 Integrated mean squared error of S(t\z0) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a normal with mean=15 and variance=25, the conditional distribution of T\Z is the Weibull with shape=3 and scale=1.5z, z0 is

22. In panel 1 n = 50 and in panel 2 n = 100 98 B.4 Integrated mean squared error of S(t\zo) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a uniform distribution on the interval (5, 25), the conditional distribution of T\Z is the normal with mean=1.3z and variance=49, z0

is 15. In panel 1 n = 50 and in panel 2 n = 100 99 B.5 Integrated mean squared error of S(t\z0) as a function of p, the rate of the

Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a uniform distribution on the interval (5, 25), the conditional distribution of T\Z is the log-normal with location=1.65 + O.O8I2 and

scale=0.3, z0 is 15. In panel 1 n = 50 and in panel 2 n = 100 100

B.6 Integrated mean squared error of S(t\z0) as a function of p, the rate of the

Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a uniform distribution on the interval (5, 25), the conditional distribution of T\Z is the Weibull with shape=3 and scale=1.5z, ZQ is

15. In panel 1 n = 50 and in panel 2 n = 100 101 B.7 Integrated mean squared error of S(t\z0) as a function of p, the rate of the

Poisson process that generates the censoring intervals, for the GT and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a normal distribution with mean=15 and varriance=25, the conditional distribution of T\Z is the normal with mean=1.3z and variance=49, ZQ

(18)

XVlll

Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a normal distribution with mean=15 and varriance=25, the conditional distribution of T\Z is the log-normal with location=1.65 + 0.081z and scale=0.3, z0 is 15. In panel 1 n = 50 and in panel 2 n = 100. 103

the Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a log-normal with location=2.69 and scale=0.2, the con-ditional distribution of T\Z is the Weibull with shape=3 and scale=1.5z,

z0 is 15. In panel 1 n = 50 and in panel 2 n = 100 104

the Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a log-normal with location=2.69 and scale=0.2, the con-ditional distribution of T\Z is the log-normal with location=l.65+0.0812

and scale=0.3, z0 is 15. In panel 1 n = 50 and in panel 2 n = 100. . . . 105

B.ll Integrated mean squared error of S(t\z0) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a log-normal with location=2.69 and scale=0.2, the conditional distribution of T\Z is the normal with mean=1.32 and variance=49, z0

is 15. In panel 1 n = 50 and in panel 2 n = 100 106 B.12 Integrated mean squared error of S(t\zo) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT, Cox model and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a log-normal with location=2.69 and scale=0.2, the con-ditional distribution of T\Z is the Weibull with shape=3 and scale=1.5z,

n = 50. In panel 1 z0 is 8 and in panel 2 z0 is 22 107

B.13 Integrated mean squared error of S(t\zo) as a function of p, the rate of the Poisson process that generates the censoring intervals, for the GT and GKM (by midpoint and multiple-imputation) estimators. The Zj's follow a normal with mean=15 and variance=25, the conditional distribution of T\Z is the normal with mean-=1.3z and variance=49, n = 50. In panel

(19)

XIX

B.14 Integrated mean squared error of S(t\z0) as a function of p, the rate of the Poisson process that generates the censoring intervals, for the GT, GKM (by midpoint imputation) and NW using midpoint (resp., multi-ple) imputation for interval-censored (resp., right-censored) observations. The Zj's follow a log-normal with location=2.69 and scale=0.2, the con-ditional distribution of T\Z is log-normal with location=1.65 + 0.081z

and scale=0.3, z0 is 8. In panel 1 n = 50 and in panel 2 n = 100 109 B.15 Integrated mean squared error of S(t\zd) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT, GKM (by midpoint imputation) and NW using midpoint (resp., multi-ple) imputation for interval-censored (resp., right-censored) observations. The Zj's follow a log-normal with location=2.69 and scale=0.2, the con-ditional distribution of T \ Z is log-normal with location=1.65 + 0.081z

and scale=0.3, z0 is 15. In panel 1 n = 50 and in panel 2 n = 100. . . . 110 B.16 Integrated mean squared error of S(t\zo) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT, GKM (by midpoint imputation) and NW using midpoint (resp., multi-ple) imputation for interval-censored (resp., right-censored) observations. The Zj's follow a log-normal with location=2.69 and scale=0.2, the con-ditional distribution of T \ Z is Weibull with shape=3 and scale=1.5z, z0

is 15. In panel 1 n = 50 and in panel 2 n — 100 I l l B.l7 Integrated mean squared error of S(t\zo) as a function of p, the rate of

the Poisson process that generates the censoring intervals, for the GT, GKM (by midpoint imputation) and NW using midpoint (resp., multi-ple) imputation for interval-censored (resp., right-censored) observations. The Zj's follow a normal distribution with mean=15 and variance=25, ZQ is 15 and n = 50. The conditional distribution of T \ Z is Weibull with shape=3 and scale=1.5z and normal with mean=1.3z and variance=49

in panel 1 and panel 2 respectively 112 B.18 Integrated mean squared error of S(t\zo) as a function of p, the rate of the

Poisson process that generates the censoring intervals, for the GT, GKM (by midpoint imputation) and NW using midpoint (resp., multiple) im-putation for interval-censored (resp., right-censored) observations. The Zj's follow a uniform distribution on the interval (5, 25), the conditional distribution of T\Z is Weibull with shape--=3 and scale=1.5z. In panel 1

(20)

X X

Poisson process that generates the censoring intervals, for the GT, GKM (by midpoint imputation) and NW using midpoint (resp., multiple) im-putation for interval-censored (resp., right-censored) observations. The Zj's follow a uniform distribution on the interval (5, 25), the conditional distribution of T\Z is Weibull with shape=3 and scale=1.5z. In panel 1

ZQ is 8 and in panel 2 z0 is 22 114

the Poisson process that generates the censoring intervals, for the GT, GKM (by midpoint imputation) and NW using midpoint (resp., multi-ple) imputation for interval-censored (resp., right-censored) observations. The Zj's follow a log-normal with location=2.69 and scale=0.2, the con-ditional distribution of T\Z is normal with mean=1.3z and variance=49.

(21)

Chapter 1 Introduction

Survival analysis is the name of a collection of statistical techniques that is concerned with the modeling of lifetime data. These methods are used to describe, quantify and understand the stochastic behavior of time-to-events. In survival analysis we use the term 'failure' for the occurrence of the event of interest (even though the event may actually be a 'success', such as recovery from therapy). On the other hand the term 'survival time' specifies the length of time taken for failure to occur, usually denoted T, that is assumed to be a positive random variable. Survival analysis methods have been used in a number of applied fields, such as medicine, public health, biology, epidemiol-ogy, engineering, economics, finance, social sciences, psychology and demography. The analysis of failure time data usually means addressing one of three problems: the esti-mation of survival functions, the comparison of treatments or survival functions, and the assessment of covariate effects or the dependence of failure time on explanatory variables.

The survival function at time t is defined as

/•oo

S(t) = Pr(T >t)= f(u)du = 1 - F(t),

Jt+

where / and F are the density and distribution function of T, respectively, and it can be interpreted as the proportion of the population that survives up to time t. The empirical survival function is a non-parametric estimator of the unconditional survival function for complete data and is given by

s(t) = lt

i

{u>t

}

= i-m

(22)

Introduction 2

at time t given a time-fixed covariate, z0:

S(t\z0) = Pr(77 > t\Z = z0),

where Z is the covariate and z0 is a fixed value. Not only are the lifetime and its

covariate random variables unknown, but usually the conditional survival function is also unknown and needs to be estimated.

There are many reasons that make it difficult to get complete data in studies in-volving survival times. A study is often finished before the death of all patients, and we may keep only the information that some patients are still alive at the end of the study, not observing when they really die. In the presence of censored data, the time to event is unknown, and all we know is that the survival time has occurred before, between or after certain time points. This obviates the need for inference methods for censored data.

When the failure time is observed completely, there are numerous methods to make non parametric inference on its conditional distribution. For instance Nadaraya (1964) and Watson (1964) proposed a nonparametric estimator (NW) to estimate the condi-tional expectation u(z0) = E(T\Z = z0) as a locally weighted average using a kernel

function. Beran (1981) extended the Kaplan-Meier estimator and proposed a method for non-parametric estimation (generalized Kaplan-Meier) of the conditional survival function for right-censored data. Turnbull (1976) proposed a nonparametric estimator of the unconditional survival function under interval-censoring.

Our objectives in this thesis are mainly to present simple non-parametric or semi-parametric approaches to estimate the conditional survival function given a continuous time-fixed or time-varying covariate under interval-censored data. In Chapter 3, we propose a general non-parametric estimator of the conditional survival function with a continuous covariate under interval-censoring . We show that it generalizes the esti-mators of Turnbull (1967) and Beran (1981). We show that our algorithm is an EM algorithm, we propose an approach to select the bandwidth and we estimate the point-wise variance of the estimator using the bootstrap method. In Chapter 4 we study the performance of the method proposed in Chapter 3 and compare it to alternatives based on missing data imputation.

In some cases the covariate may not be time-fixed, but time-varying. Examples include the lifetime of a person who has had some illness (disease) and who has peri-odically visited a clinic; lifetime of a tree (plant) before attack by an insect/bug given its diameter or height; lifetime of a vehicle given its accumulated mileage; the survival of marriages given number of children, drug usage, etc. One cannot directly use the

(23)

Introduction

method proposed in Chapter 3 to estimate the conditional survival function. Hence in Chapter 5 we propose a semi-parametric approach to estimate the survival function given a time-varying covariate with interval-censored data.

In Chapter 2, we review some of the basic tools from survival analysis and kernel smoothing that will serve as the building blocks of our methods.

(24)

Chapter 2 Preliminary

2.1 Censored data

Survival analysis methods have been developed to analyze survival time data. Indeed, in studies where the variable of interest is the time until a well defined event we are usually unable to observe when the event of interest exactly occurs. We therefore have to work with samples of what we call "censored data". Censored data refers to data for which we have partial information. Mathematically information about a survival time can always be written as an interval that includes the real survival time, say {(Li, Ri)\(Li, Ri) 9 Ti,i = l , . . . , n } . We can categorize the various censoring data schemes as follows: (i (ii (iiv (iv (v

: If 0 < Li < R% < +oo, the datum Ti is said to be interval-censored.

: If 0 = Li < Ri < +oo, the datum Tt is said to be left-censored.

: I f 0 < L j < - R j = +oo, the datum Tj is said to be right-censored.

: If Li = Ti = Ri, the datum Tj is said to be complete.

(25)

Chapter 2. Preliminary 5

In this thesis we mainly focus our attention on interval-censored data. Figures 2.1 and 2.2 depict typical samples of interval-censored data. In these figures, each horizontal

segment represents a censoring interval (Li, Ri): we do not known when exactly the ith

(26)

Chapter 2.

Preliminary-Let T represent the time to event of interest and Z represent the covariate. If 7 = 1{T>.} is an indicator function then the unconditional survival function is defined

as S(t) = Pr(T > t) = E[y] and the conditional survival function is defined as S(t\z0) =

P(T > t\zQ) = Efr\zo].

2.2 Kernel smoothing and Nadaraya-Watson

esti-mator

Because we will be interested in estimating conditional distributions, we will breifly review the method of kernel smoothing. Basically a kernel smoother is a statistical technique for estimating a real valued smooth function f(x),x £ Rd, when no

para-metric model for this function is assumed. The estimate of the function is smooth, and the level of smoothness is controlled by a single parameter in each dimension, the bandwidth, which we will denote by h. This technique is most appropriate for low di-mensional (d < 3) data visualization purposes. Actually the kernel smoother represents the set of irregular data points as a smooth line or surface. A kernel is a symmetric, non-negative real-valued integrable function K^(u) = h~lK(u/h), typically satisfying

the two following requirements:

/

oo

K(u)du = 1,

-oo

2.K(—u) = K(u) for all values of u.

Although several symmetric kernel functions work well, in this thesis we focus on the Gaussian kernel, K(u) = (2-7r)~1/2 exp(—u2/2), u e K , and on the case d = 1.

Since the nonparametric estimation of the conditional survival function given a covariate is our main goal, we briefly review a classical kernel estimator from this field. Nadaraya (1964) and Watson (1964) proposed a nonparametric estimator (NW) for complete data to estimate the conditional expectation p.(x) — E(Y\x) as a locally weighted average using a kernel smoother function. They defined p,(x) — Z."=i ^ ^ ( z ) . where w£(x) = K{(x)/2Zj=iK3h(x) and K\(x) = h ^ K ^ X , - x)/h). The asymptotic

bias and variance of p,(x) are obtained by Nadaraya-Watson, see for instance Fan (1996, pp. 17).

(27)

Chapter 2. Preliminary

where ujk(y\x) = dj+ku(y\x)/dxjyk, and u2 = J™ u2K(u)du.

The asymptotic variance is

o2(x)

nh

n

f(x)

/

oo

K2(u)du + n lo ( h 1) . (2.2)

-oo

It can be adapted to estimate the conditional survival function given a covariate since f.[l{Ti>t}|20] = S(t\z0). Indeed, under complete data the Tj are observed and

hence so are 7* = l{xi>t}. The estimator is given by

S

hNW

(t\z

Q

) = ±

l l W

? ( z

Q

) . (2.3)

i = l

2.3 Generalized KaplanMeier estimator

Beran (1981) combined the NadarayaWatson (1964) and KaplanMeier (1958) estima tors, for estimation of the conditional survival function with rightcensored data given a covariate. The estimator proposed is

nh / , , , TT (■, l{Rj < +00}^(ZQ)\ {i-.L^t} V Lj-.L&Li Uj (PO) j

where Lt = min(Tj,Cj) and if Tj ^ Cj, Lj = Ri < +00 and if Tj > Ci, Ri = +00 and

the survival time is right-censored. Here Ci is a random variable that is independent of

Tj. Dabrowska (1989) gives expressions for the asymptotic behavior of S^KM(t\z0).

In the presence of intervalcensored data, neither of the previously mentioned non parametric estimators can be used directly to estimate the conditional survival function. If one wishes to use these estimators with intervalcensored or hybrid censored (i.e., mix of interval and rightcensored) data, one needs to impute a value to the unobserved (intervalcensored) data.

2.4 Imputation methods

Imputation is the substitution of some plausible values for missing data; the imputed dataset can be analyzed as complete data. This method may be an attractive approach

(28)

to analyze incomplete data; for more information see Little and Rubin (1987) and Rubin (1987). In this thesis we will consider imputation methods such as midpoint or multiple imputation. Actually the midpoint imputation assigns the censoring interval's midpoint to the unobserved event time, whereas multiple imputation randomly assigns a given number of values to the unobserved item.

If our dataset is purely interval-censored or hybrid censored, we use the two following algorithms respectively. To impute failure times using midpoint imputation, we simply set Ti = (Li + Ri)/2, which will yield Tj = Lj = Ri under complete data. Under hybrid censored data, we consider Tj = (Lj + Rd)/2 for interval-censored observations and we set Li = Li and R\ = +oo, for right-censored data.

Algorithm 1 (Midpoint imputation).

Step 1 : For i = 1,..., n,

(In interval-censored observations) Set Li = Ri = T = (Li + Rd)j2; (In right-censored observations) Set Li = Li and Ri = +oo.

Step 2: Compute S(.\ZQ) from T\,...,Tn with equation (2.3) and the NW estimator

if there are no right-censored observations; or from (L\, R\),..., (Ln, Rn) with

equation (2.4) and the GKM estimator if there are right-censored observations.

To impute failure times using multiple imputation, we simply set Tj = (Li + Rdj/2, which will yield Tj = Lj = Ri under complete data. Under hybrid censored data we consider T = (Lt -\-Rdj/2 for interval-censored observations and Lj = Lj and Ri = +cc,

for right-censored data. An initial value of the estimator can be obtained by midpoint imputation (Algorithm 1). Then we follow the iterative algorithm bellow. Here Si represents the type of censoring, i.e., for interval-censored we set 5j = 1 and for right-censored we set 5i = 0. Let R* = maxImaXj^o-fLj}, max.j:$]=i{Rj}} be a value as an

initial value for Ri when the item is right-censored and one uses the NW estimator. Algorithm 2 (Multiple imputation).

Step 1 : Put I = 0, r = 0 and set initial values for the elements o/T^ = (T^ ,..., T^). For i = 1,... ,n,

(a) // Ri < +oo, set T[r) = (Li + Ri)/2.

(b) If Ri = +oc, set T^r) = (Lj + R*)/2 if using the NW estimator (2.3), or set

Lx = Li and Ri = +oo if using the GKM estimator (2.4). Then compute S^(-\-)

(29)

Chapter 2. Preliminary

Step 2: Putr=r+1 and set T^ = S™ l (Lj(r) Z{,T{ € (L,,/^)) where u\r) ~ U(0,l);i

l,...,n.

Step 3: If using the NW estimator (2.3), compute S^(\) from Ty , ...,T^. If using the GKM estimator (2.4), for right-censored observations use (Li,Rd) instead of

T(r)

1 i

S t e p 4: If r < M, return to Step 2, otherwise put ~S,(■[) = Er=i{S(r)(-|-)}/AL

Step 5: If I < N put r = 0, S^r)(-\-) = S,(■{■), I = I + 1 and go to Step 2, otherwise

stop.

The corresponding estimate of S(t\z0) is given by SN(t\zo).

2.5 Turnbull estimator and self-consistent algorithm

A self-consistent estimate usually refers to an estimate that can be characterized by a self-consistency equation and is the limit of iterates obtained from that equation (Efron, 1967). Turnbull (1976) developed a self-consistency algorithm for estimation of S(t) with censored data. To derive the self-consistency equation for interval-censored data, one may treat interval-interval-censored data as incomplete data and then apply the EM algorithm. Let 0 = r0 < T\ < T2 < ... < rg, g ^ 2n, be the ordered distinct

values of {Li, Ri, i = l , . . . , n } . Put Bj = (TJ__I,TJ), j = l , . . . g and let pj be the

probability of death in Bj. Define

CP = {Pe[0,lf ; Y,Pj = \,Pj>b\,

a subset of M9. One determines the non-parametric maximum likelihood estimator

(NPMLE) of P by maximizing the following likelihood function:

L P ( S ) = fllS(Li) - S(Rl)) = f[Y,<XijPj, (2-5)

2=1 1=1 j = \

where

.. = / 1 , B3e(Li >Ri ) ' i j = l,...,0,» = l....,n. ,2 g, 13 I 0, otherwise.

(30)

Algorithm 3 (Turnbull estimator).

Step 1. Choose initial values p ^ for p , for instance Pj(0) = l / g , j = 1, •••,</• Step 2. At the Ith iteration, define the updated estimate, denoted by P ^

(4l ) D«V of P as D(0 1 Tn a']p(>~l) {Pi , ~ i P g J ° J r a s P j n 2_.i=i ^ F ~ a.f c^ ' D •

5£ep 5. Repeat step 2 until convergence.

The corresponding estimate of S(t), does not specify how to distribute the proba bility weights pj over the intervals Bj. Arbitrarily , one may assign all the probability masses to the left hand points of the intervals, which yields

1, t<Ti.

*■<*) E Vp t > n , (27)

l j-Tj>t

where p = min{Lj, Ri, i = 1 , . . . , n}.

2.6 Measures of performance of the estimators

Throughout this thesis we will compare the performance of various estimators. To mea sure the performance of the estimators, we review these classical measures: variance, bias, mean squared error (MSE) and integrated mean squared error (IMSE).

First we consider the conditional survival function S(t\zo) and its non parametric estimator S(t\z0) at fixed time t. Then the pointwise variance,

V(S(t\z0)) = E\{S(t\z0) - E[S(t\zQ)\y (2.8)

But in practice we estimate the real variance by simulating B samples and computing.

V(S(t\z0)) = ^ Y . ( S i ( t \ z o ) ~ § ( t \ z0) )2, (2.9) £ . = i

where 5 ( i | z0) ^ E2 B= i ^ ( i | z0) .

The bias of the estimator is defined as

(31)

In practice the bias of the estimator is estimated by

Bic7s(S(t\z0)) = i f ] (S(t\zQ) - E(t\zo)). (2.11)

-° i=i

One may consider the variance or bias separately to compare two or more estimators. But it is easier to compare the performance of the estimators by combining variance and bias in a single value. So we define the mean squared error of S(t\zo),

MSE(S(t\zoj) = E\(S(t\zo) - 5(*|z0))2| = v(S(t\z0)) + [Bias(S(t\zo))]2. (2.12)

Then in practice it can be estimated by,

MSE(S(t\zQ)) = i £ (S(t\z0) - S.Wz0))2. (2.13) **-* i=l

To consider the performance of the estimators at several values of t on a grid instead of a single point t, the integrated mean squared error (IMSE) can be used. It is defined as

IMSE(S(.\z0)) = £ E\(S(t\z0) - S(t\z0

/o dt, (2.14)

where the r is a large value, for instance the 99 percentile of survival times generated. It can also be defined as

IMSE(S(.\z0)) = ET{Eâ(S(T\z0) - §(T\z0))2}, (2.15)

which in practice can be estimated by

IMSE(S(.\z0)) = - L £ £ (SCA*) - Si&lzo))2, (2.16)

o v v i=ij=i

(32)

Chapter 2. Preliminary 12

20 30

Figure 2.1: Both panels: Each line segment represents an observed interval, with the x-axis representing time and y-axis represents the item numbers. Left panel: inter-vals obtained when the censoring interinter-vals are of wide length. Right panel: interinter-vals obtained when the censoring intervals are of medium length.

s - s - T . . . o _ o C3 -- _ O tN - .' " <=> _ -O - ". 1 1 1 1 1 1 1 30 60

Figure 2.2: Both panels: Each line segment represents an observed interval, with the x-axis representing time and y-axis represents the item numbers. Left panel: intervals obtained when the censoring intervals are of medium length. Right panel: intervals obtained when the censoring intervals are of narrow length.

(33)

Transition

In the previous chapter we gave a summary of the nonparametric methods that are available to estimate a survival or conditional survival function under various censoring schemes. Unless one is willing to use imputation methods, none of these methods can be used directly to obtain an estimator of the conditional survival function when the time-to-event is subject to interval-censoring. We propose a solution to this problem in the next chapter. Its contents have been published in Lifetime Data Analysis , vol. 17, pp. 234 - 255.

(34)

Chapter 3 A generalization of Turnbull's

estimator for nonparametric

estimation of the conditional

survival function with

interval-censored data

Resume

L'estimation non-paramétrique de la distribution conditionnelle d'une variable réponse étant donnée une covariable est souvent utile à des fins d'exploration ou d'aide à la spécification ou validation d'un modèle de régression paramétrique ou semi-paramétri-que. Dans cet article nous proposons un tel estimateur, dans le cas où la variable réponse est censurée par intervalle et la covariable est continue. Notre approche consiste à ajouter des poids qui dépendent de la valeur de la covariable dans l'équation proposée par Turnbull (1976). Ceci résulte en un estimateur qui n'est pas plus difficile à mettre en oeuvre que celui le Turnbull. Nous montrons la convergence de notre algorithme et que notre estimateur se réduit à l'estimateur généralisé de Kaplan-Meier (Beran, 1981) lorsque les données sont complètes ou censurées à droite. Nous démontrons par simulation que l'estimateur, l'estimation de la variance par la méthode du bootstrap et la sélection de la fenêtre de lissage (par une règle du pouce ou la validation croisée) donnent de bons résultats dans les échantillons finis. Nous illustrons la méthode en l'appliquant à un ensemble de données issu d'une étude sur l'incidence du VIH dans un groupe de travailleuses du sexe de Kinshasa.

(35)

A generalization of Turnbull's estimator 15

Abstract

Simple nonparametric estimates of the conditional distribution of a response variable given a covariate are often useful for data exploration purposes or to help with the specification or validation of a parametric or semi-parametric regression model. In this paper we propose such an estimator in the case where the response variable is interval-censored and the covariate is continuous. Our approach consists of adding weights that depend on the covariate value in the self-consistency equation proposed by Turnbull (1976).This results in an estimator that is no more difficult to implement than Turnbull's estimator itself. We show the convergence of our algorithm and that our estimator reduces to the generalized Kaplan-Meier estimator (Beran, 1981) when the data are either complete or right-censored. We demonstrate by simulation that the estimator, bootstrap variance estimation and bandwidth selection (by rule of thumb or cross-validation) all perform well in finite samples. We illustrate the method by applying it to a dataset from a study on the incidence of HIV in a group of female sex workers from Kinshasa.

(36)

3.1 Introduction

Simple nonparametric estimators of the conditional survival function are often useful in the analysis of survival data. They can be used at a primary data exploration stage to summarize the data, or perhaps to help out with the task of selecting an appropriate parametric or semiparametric regression model. For instance, it is common practice to produce an overlay of the Kaplan-Meier estimators for all treatment groups in some studies for descriptive or model selection purposes. This task, however, becomes somewhat more complicated when the survival times are interval-censored. As explained by Sun (2006, pp. 234-235), in the case of discrete covariates, one can look at the plots described above, but with Turnbull's (1976) estimator replacing the Kaplan-Meier, while in the case of continuous covariates one can partition the dataset into k groups according to the covariate values and then look at the plot of the Turnbull estimators obtained in each group. Our primary objective in this paper is to formalize this latter approach by extending Turnbull's estimator so that it can directly estimate a conditional survival function when the survival times are interval censored and the covariate(s) is(are) continuous. We can then plot the nonparametric estimates of the conditional survival function at various covariate values to either summarize the data or to get a first idea of potential parametric or semiparametric models that could fit the data. Hence the method that we propose here is intended as a model selection or validation tool, and not as a device to perform formal inferences about covariate effects, for which (semi-)parametric models are better suited.

At such an initial stage in the analysis, we deem it important that the method proposed be no more computationally demanding than applying Turnbull's method to a few subsets of the data. Though some authors have proposed semi- and nonparametric means of estimating conditional hazard functions in the presence of interval-censored data and covariates (Rabinowitz et al. (1995), Huan and Rossini, (1997), Betensky et al. (2001), Betensky et al. (2002)), a direct extension of Turnbull's method does not seem to have been considered; we propose such an extension here. Our approach consists of adding kernel weights in the self-consistency equation proposed by Turnbull (1976). By doing so, the numerical work required to obtain our proposed estimator is the same as that required to solve Turnbull's self-consistency equation. Note that other authors have considered modification of self-consistency algorithms, for instance Braun et al. (2005) who add smoothing in the E-step of an EM-algorithm for density estimation with interval-censored data in the absence of covariates, Betensky (2000) who introduces dependence on disease markers in Turnbull's algorithm or Betensky et al. (2002) who use a local likelihood method to fit the proportional hazards model when the response is interval-censored, but these methods do not directly produce a nonparametric estimate

(37)

of the conditional survival function in the presence of continuous covariates.

The paper is organized as follows. In Section 3.2, we define the new estimator and show how it reduces to the generalized Kaplan-Meier (GKM) estimator proposed by Beran (1981) under right-censored data or to Turnbull's nonparametric maximum like-lihood estimator (NPMLE) when all observations have equal kernel weights. We also show how the modified self-consistency algorithm proposed can be viewed as an algo-rithm of EM-type that maximizes a weighted marginal log-likelihood function. We show that the proposed algorithm converges monotonically and, in doing so, we generalize the traditional proof of monotonicity of the EM-algorithm to a proof of monotonicity of what we term a weighted EM-algorithm. We consider extensions of the proposed method that are more computationally demanding in sections 3.3 and 3.4: In Section 3.3, we propose a bootstrap algorithm that can be used if pointwise standard errors or confidence limits are desired, while in Section 3.4 we investigate two methods to obtain a value for the smoothing parameter (a simple "rule-of-thumb" method and a method based on cross-validation). The finite sample properties of the estimator and its extensions are studied by simulation in Section 3.5. We illustrate how these methods can be used for data exploration and help with model selection by applying them to a dataset from a study on the incidence of HIV in a group of female sex workers from Kinshasa (Democratic Republic of the Congo) in Section 3.6. We conclude the paper with a discussion and ideas for further research in Section 3.7.

3.2 Generalization of Turnbull's estimator

We consider nonparametric estimation of the conditional survival function of a random variable T given a value zo of the covariate vector Z, i.e., we want to construct a

nonparametric estimator of S(t\zQ) = Pr[T > t\Z = ZQ]. For instance in the case

study on female sex workers of Section 3.6, T is the time to HIV infection and Z is the woman's age at enrolment in the study. In this paper, we consider the case where T is not observed exactly but is known to belong to an interval (L,R). Hence, we want

to estimate S(t\z0) on the basis of a sample of n independent observations (Lj, Ri, Zj)

with (Li,Ri) 3 Ti: i — l,...,n, with the slight abuse of notation (Tj,Tj,Zj) if Tj

is known exactly. We assume that conditionally on the covariates, interval-censoring

is uninformative. More specifically, we suppose that given Z = z0 the value of T is

independent of the process that generates the bracketing times L and R. For instance under mixed case interval-censored data (Sun, 2006, pp. 12), we observe (L, R) if for 0 = U0 < Ux < • • • < UK < UK+i = oo with T e (Up Uj+1), L = U3 and R = Uj+l.

(38)

A generalization of TurnbulVs estimator 18

Then we assume that given Z = z0, T and {K, U\, £/2,..., U K } are independent.

3.2.1 P r o p o s e d e s t i m a t o r

To construct the new estimator, we add weights in Turnbull's selfconsistency equation for the NPMLE of the survival function. Let L = (Li, Ri) denote the ith observed censoring interval and let 0 = r0 < TX < r2 < ... < rg, g ^ 2n, be the ordered

distinct values of {Lj, Ri, i = 1 , . . . , n } . Put 5; = ( T ^ ^ T J ) , j = l , . . . g and let {A(,

£ = 1 , . . . m} be the set of innermost intervals, i.e., {Ag, £ = 1,.. .rn} = {Bj : 3i,i' : Tji = Lt and Tj = Ri', j = 1,... ,g}. It is well known (e.g., Turnbull, 1976, Li et al,

1997, Sun, 2006) that the NPMLE of the survival function assigns all probability mass to the innermost intervals. Let us now define the dummy variable a j j as

fl, BjCZIi

c*i,j = < , i = l , . . . , n , j = l , . . . , g . (3.1) ^0, otherwise

Let pj be the mass assigned to Bj in Turnbull's NPMLE of S(t), i.e., Pj = P[T G Bj], j = 1 , . . . ,g. Then p3 is found by solving the selfconsistency equation

P ^ t s r ^ 3 » J = l . , f f , (32)

fri n ELi

a

i,kPk

which will yield pj = 0 if Bj is not an innermost interval and pj > 0 otherwise. We cannot directly apply the strategies proposed by Beran (1981) or Betensky (2000), who make the indicator c.jj depend on Zj and z0. Indeed, any such weighting would have

to vary with j , otherwise it will cancel out in (3.2). Instead, we weight the ith term in the sum (3.2) as follows. Let ^ ( Z Q ) be the (kernel) weight for observation i, defined as

,, Af

h

{(Z,z

0

)} .

" •

(

) =

SL_«_{(Z..,)>■ ' • ■ • • ■

where K() is a symmetric kernel function and h is a smoothing parameter (bandwidth). If dim(Z) = d, then K() is typically chosen as the product of d density functions symmetric around zero. Because the choice of the smoothing parameters h is difficult and because completely nonparametric estimation of S(t\z0) is likely to be noisy under

intervalcensored data if d > 1, for the remainder of this paper we limit our investigation to the case where the covariate Zi is onedimensional. As a matter of fact, our intent is to propose a method to be used in the initial univariate exploratory stages of analysis. Even though in theory our results for fixed h hold if Zj is ddimensional, numerical implementation and bandwidth selection, which we consider in Sections 3.3 and 3.4, are

(39)

A generalization of Turnbulls estimator 19

of course much more difficult when dim(Zj) > 1. In such multivariate settings, other smoothing techniques (e.g., penalized likelihood, splines) should be more appropriate.

Let us define pj(zo), the probability mass assigned to Bj by our estimator of S(t\zo), as the value obtained by solving the following self-consistency equation:

P ^ o ) = t ^ ( z 0) a if{ Z° ] 3 = L . . . ,9. (3.3) ~i ELi <*i,kPk(zo)

In our own practical implementation of (3.3), we have used the standard normal kernel and the Epanechnikov kernel, K(x) = 0.75(1 — x)2l{l6[_i,i]}, where we use IA as the

indicator of event A. Other density functions symmetric around zero, such as the tri-cube or uniform kernels, can also be used. Though the functional form of the kernel does not have a practically significant impact on the statistical properties of the estimator (see Section 3.5), the bandwidth selection methods presented below only work well under the normal kernel.

As is the case with Turnbull's algorithm, equation (3.3) suggests the following iter-ative solution scheme.

Algorithm 4.

Step 1: Put r = 0. Set s equal to a small positive real number, set initial values

j

Pj \zo), 3 = !.•••.#. for the probability masses, for example Pj (z0) = l / m if B

is an innermost interval, 0 otherwise; if the innermost intervals have yet to be identified, one can set Pj (z0) = \/g V7.

Step 2: Compute ( r + l ) / \ V**- hi \ ai j P j (2o ) . , s Pj (zo) = 2-^ui(zo)— ( r ), ., j = l , . . . g , (3.4) 1 = 1 T,Li<Xi,kPk r)(zo) and put r = r + 1. Step 3: //llpW^J-p^H^ll/llpW^)!! > e, wherep^(z0) = (p^(z0),...,Pp(z0))T,

return to Step 2, otherwise stop.

The corresponding estimate of S(t\zo), if one assigns all the probability masses to the left hand points of innermost intervals, is given by

fl, t<n

Sh(t\z0) = E ^ (2 o)) t ^ T v (3.5)

(40)

A generalization of TurnbulVs estimator 20

3.2.2 Special cases

Note that following the arguments in Betensky (2000, Section 3), for right-censored data this algorithm amounts to redistributing the weights u^(z0) (instead of l/n) of the

censored observations to the complete observations. Thus for the special case of right-censored data, this implies that our algorithm is equivalent to the algorithm proposed by Efron (1967, Section 7) with his weights l / m replaced by our weights OJ^ZO). In

Appendix A.l, we show that under such replacement, Efron's algorithm yields the GKM. Hence when applied to right-censored data, our method yields the GKM.

It is also worth mentioning that when all observations are complete, the innermost intervals are the observed "intervals" (Tj,Tj). It is then easy to see that (3.3) yields Pj(z0) = ^i(zo) and, thus, that our method applied to complete data gives the Nadaraya

(1964) and Watson (1964) estimator of the conditional distribution.

Finally, consider the case where the covariate Z is discrete rather than continuous. Then applying our algorithm with h small and a kernel function with finite support is equivalent to fitting Turnbull's estimator to the dataset {(Lj, Ri) : Zi — z0}. Indeed in

this case, with h small enough then all observations such that Zt = z0 will have equal

weight UJ > 0 and all observations such that Zi d zo will have weight equal to zero.

3.2.3 Behavior of the algorithm

Numerical convergence of the algorithm

Algorithm 4 proposed is closely tied to the EM-algorithm and we now show that for a fixed value of h, it possesses the same monotone convergence property. Consider the "weighted marginal log-likelihood"

C = f > f ( z0) l o g { S ( L 0 - S ( R i ) } (3.6)

i=i

and let (Tj, Lj, i?i), i = 1 , . . . , n, be the "complete data" and consider the corresponding complete data weighted log-likelihood,

9

(41)

where d* = E"=ia;il(2o)l{rieB:,}- The E-step at the rt h iteration consists in taking the

expectation of £comp given Tj G (Lj, Ri), i.e., we compute

Q(p|p( r )) = L>> with j = i {(Li,Ri), i = l , . . . , n }

^ ' ( p ^ l o g p , ,

**d*(v) =**

Z

2^(z

0

);

" i j i=l Efc=l ai,kPk

Note that this is the same expectation as in Sun (2006, pp. 53), but with d* and d*(p)

that now depend on the weights OJ^(ZQ). NOW again following Sun (2006, pp. 53-54),

we get that p^r+1) = argmaxp Q(p|p^r') is obtained by setting pj y) equal to Pjl)(zQ)

given by (3.4), j = l,...,g.

To show that Algorithm 4 possesses the monotone convergence property, we first show that the EM-algorithm also has the monotone convergence property when applied to a weighted log-likelihood under general conditions. We actually find this latter general result interesting in its own right and so treat it in detail here. Let (Xj, Yj), i = l , . . . , n denote n i.i.d. random vectors. Assume that the Xj are observed but that the Yj are missing. We have that / C ( X , , Y J ; 0 ) = / O ( X J ; Q)fM\oCYi\Xl;0) for

missing given observed data, respectively. Now let us consider the maximization of the following weighted likelihood or corresponding weighted log-likelihood for the observed data:

LObs(0;X) = f[{fo(Xl-,e)}^ and £Obs(0; X) = £>, log f0(Xt; 6),

i = l . = 1

where 0 is the parameter to be estimated and the Wj are weights such that 0 ^ cjj ^ 1 and E . w. = 1 a nd that might depend on 0 or Xj, but not on Yj. We have that

£ComP(6; X, Y) = £Obs(0; X)+£M l s(0; X, Y), where £ComP(e-, X, Y) = log n. fc1 (Xi, Yi ;0 )

a n d ^s( 0 ; X , Y ) = l o g nJ/ M,( Yi| X j ; 0 ) .

Lemma 1. Let Q(9\G) = EQ [£Cmnp(0]X, Y)|X] (E-step) and put

0(r+1) =argmaxQ(0|0(r)) (M-step).

0

Then £obs(0{r+1);X) > £Obs{0{r);X), r = 1,2,...

The proof of Lemma 1 is the usual proof of the monotone convergence of the EM-algorithm, but with one additional use of Jensen's equality. It is outlined in Appendix A.2.

(42)

As a corollary to Lemma I, since (3.6) is bounded from above, Algorithm 1 will con-verge monotonically. As is the case with Turnbull's self-consistency algorithm (Gen-tleman and Geyer, 1994), Lemma 1 does not imply that (3.6) has a unique global maximum and/or that the sequence { p ^ } will converge to a maximizer of (3.6) (see also Braun et al, 2005).

U n i q u e n e s s of t h e m a x i m i z e r of (3.6)

We can follow the arguments of Gentleman and Geyer (1994) and of Bôhning et al (1996) to derive conditions under which (3.6) has a unique maximum for a fixed value of h. Define the following quantities: n(h) = #{z : U'KZQ) > 0}, Ah is the n(h) x g

matrix of a^'s, i € {%' : <x$(z0) > 0}, j = 1,... ,g and TDh is the n(h) x n(h) diagonal

matrix defined by D'1 = diag(— U^(ZQ)/T]2, i € {i' : w,v(z0) > 0}), with r\i = Ej=i a

ijPj-Then the Hessian of (3.6) is given by Hh = A'lTD/lA'1. The weighted log-likelihood

(3.6) will be strictly concave and have a unique maximum when Hh is strictly positive

definite; a sufficient condition for this is that n(h) ^ g and rank(Ah) = g.

Note that p(zo) ma y be a global maximizer of (3.6) even when (3.6) is not strictly concave. Bôhning et al. (1996) show that p(zo) globally maximizes (3.6) if and only if d*(p(z0)) -^ n for all j and d*(p(z0)) = n if pj(z0) > 0.

R e l a t i o n s h i p t o local likelihood

A thorough investigation of the asymptotic properties of p(zo) under general mixed-case censoring and with a data-driven choice of the bandwidth h remains an open

problem. We will see in Section 3.5 that the method does lead to valid inferences in finite samples with data-driven bandwidth selection. Here we simply argue that maximizing the "weighted marginal log-likelihood" (3.6) is sensible. Indeed, consider the local approximation

Pr[T e B , \ Z = z ] = p f)( z0) + p f)( z o ) ( z - z0) + • • • + p f ( z o ) { z - z0)«, \z - z0| < 1. Then local likelihood estimation would maximize

Uoc « È<*o)log jèay Y.pf(zd)(Zx - z0y t = l 1 , 1 t,=0

(43)

A generalization of TurnbuWs estimator 23

which would give (3.3) with PJ(ZQ) replaced with pj (z0) in the score equation when

q = 0. Hence maximizing (3.6) amounts to maximizing a local likelihood with a locally constant parametric model.

3.3 Pointwise variance estimation

Though semi-parametric methods are better suited for testing hypotheses of the form Ho: S(t\z0) = S(t\zi), one may be interested in measuring the amount of uncertainty

in an estimate Sh(t\z0). We thus consider estimation of the variance of Sh{n)(t\zo) for

fixed t and ZQ, where now h(n) emphasizes the fact that the bandwidth value may be data-driven (see Section 3.4). Following Sun (2001), we propose bootstrap estimation of this variance. We consider two versions of the bootstrap. Version A of Algorithm 5 below is computationally much faster than Version B, but one might suspect that the former will slightly underestimate the variance of the estimator.

Algorithm 5. Bootstrap algorithm for pointwise variance estimation 1. Fix t and ZQ. Compute a bandwidth h(n).

2. For k in 1 to K :

Step 1: Sample (with equal weight and with replacement) n observations from (Li, Ri, Zi), i= l,...,n. Denote the sample obtained (L\ ,R\ ,Z\ ), i = \,...,n.

Step 2, Version A: Compute Sh,ndt]z0) using the bootstrap sample generated at

Step 1.

Step 2, Version B: Compute a bandwith h^(n) and then Shlk),n)(t\zo), both

us-ing the bootstrap sample generated at Step 1.

3. Use the sample variance of the S^ 's as the variance estimator.

The difference between Versions A and B is that the former uses the bandwidth obtained from the original sample in each bootstrap replicate, while the latter will recalculate a new bandwidth for each bootstrap sample. We compare and discuss the accuracy of the variance estimators given by Versions A and B of Algorithm 5 in Section 3.5 and see that Version A is preferable when the bandwidth is selected by cross-validation, while Version B is recommended when h is chosen with the rule-of-thumb.

Nonparametric methods for the estimation of the conditional distribution of an interval-censored lifetime given continuous covariates