with Time

(1)

(2)

(3)

(4)

(5)

Generalized Linear Longitudinal Semi-parametric Models with Time Dependent Covariates

St. J ohn's

by

© V ineetha W arriyar K . V

A th esis submitted to th e School of Graduate Studies

in par tial fulfi lment of the r equi rements for th e degree of

Doctor of P hilosophy

Department of Mathematics and Statistics Memorial University of Newfoundland

November 2012

Newfoun dland

(6)

Abstract

Longitudinal data analys is is challenging because of th e difficulties in modelling the

correlations among the repeated responses, especially when the associated covariates are time dependent. Recent studies have examined correlations for both linear and discrete unbalanced longitudinal data, which are modelled following a Gaussian-type auto-r egressive moving aver age (ARMA) class of auto- correlations. However, these studies were confined to a regression setup where the regression fun ction is completely specified . In t his thesis, we consider a semi-parametric regression setup in which the regression function involves a specified as well as an unspecified func t ion over time.

Under the ARMA type correlation structure, we provide a semi-parametric gener-

alized quasi-likelihood (SGQL) approach for the estimation of the main regression

parameters. The proposed inference approach is compared with some existing gener-

alized estimating equation (GEE) approaches mainly through simulation studies. The

linear longitudinal semi-parametric mod el, for its foundational n ature, is discussed in

detail. Theoretical details on semi-parametric est imation for longitudina l count and

binary data are also provided.

(7)

Acknowledgements

I would like to express sincere thanks to my supervisor, Dr. Brajendra Sut rad- har. His support, vast knowledge and logical way of thinking have been invaluable t hroughout my work.

Thanks to all my friends , colleagues a nd house mates for putting up with me. I owe a great deal for their encouragement and moral support.

A special t hanks to the Department of Math ematics a nd Statistics at Memorial University for providing financial and academic assistance for my research , esp ecially all administrative staff members for the help they offered me during my study.

I would like to thank all the examiners of my thesis Dr. Ma ry Thompson, Dr.

Gary Sneddon and Dr. Alwell Oyet for their invaluable comments and s uggestions . I am so indebted to my parents, my sister, my in-la ws, a nd my uncle Dr . Asokan.

M . Variyath , who motivated me to explore t he wonderful world of Statistics, for their consta nt s upport and encouragement. Indeed, I a m grateful to my h usband , Ranjith , for his en couragement, coufidence and forb earauce . Without my family's understanding and love, I would not have b een able to finish this thesis. Above all, I thank Almighty God , wit hout whose blessing I would neve r have been able to complete this work.

- Vineet ha Warriyar K. V

(8)

I lovingly dedicate this thesis

to my husband Ranjith, who supported me every step of the way.

(9)

Abstract

Acknowledgements

List of Tables

List of Figures

1 Background of the Problem

1.1 Generalized linear models ( G LMs) . 1.1.1 Quasi-likelihood es timat ion for (3 1.2 Semi- parametric GLMs .

1.2.1 Linear model . .

1.2.1.1 Estimation of non-parametric fun ction 1 (z

₀)

1.2.1.2 Estimation of r egr ession effects (3 1. 2.2 Count d ata model . . . . . . . . . . . . . .

1.2.2.1 Estimation of non- parametric function 1(z

₀)

1.2.2.2 Estimation of r egression effects (3 1. 2.3 Binary data model . . . . . . . . . . . . .

11

111

Vl

V11

1

2

3

4

6

7

8

9

10

11

12

(10)

1.2.3.1 Estimation of non-para metric function 'Y(z

_{0 )}

1.2.3.2 Estimation of regression effects /3

1.3 Generalized linear longitudinal models (GLLMs) 1. 4 Semi-parametric GLLMs

1.5 Objective of the thesis .

2 Semi-parametric Linear Longitudinal Models

2.1 Existing semi-parametric estimation methods 2.1.1 PSSGEE approach .. . . . . . . .. .

13 13 14 17 18

21 28 28 2.1.1.1 Estimation of non-parametric function 29 2.1.1.2 Estimation of regression effects . . . . 31 2.1.1.3 Estimation of the 'working' correlation pa rameter a . 32 2.1.2 Pa rtially standardized emi-pa ra metric heteroscedastic GEE (PSSHGEE)

approach. 33

2.2 Proposed FSSGQL approach . 35

2.2.1 Estimation of non-parametric function 35

2.2.2 Estimation of (3 36

2.2.2.1 Basic properties of

!3FsscQL

38 2.2.3 Estimation of

p

and a

²

43 2.3 A Simulation study 45

2.3.1 Simulation design 45

2.3 .2 Da ta generation and simulation results 47

3 Semi-parametric Longitudinal Models for Discrete Data with Non-

stationary Correlation Structures 71

(11)

3.1 Semi-pa rametric longitudinal models for count data with non-stationary correla tion structures . . . . . . . . . . . . . . . . . . 72 3.1.1 Stationa ry correla tion models for count data in emi-par ametric

etup . . . . . . . . . . . . . . . . . . . . . . . 73 3.1.2 on-st ationa ry correlation models for count data 74 3.1.2.1 Non-stationa ry AR(1) models in semi-par ametric setup 74 3.1.2.2 on-stationa ry MA(1) models in semi-par ame tric setup 76 3.1. 2. 3 Non-stationa ry EQ C models in semi-par ametri c setup 77 3.2 Estimation in semi-pa rametric mod els for longitudinal count data 7

3.2.1 E timation of non-pa rametric function 1(.) 78 3.2.2 Estimation of f3 . . . . . . . . . . . . . . . 79

3.2.2.1 Naive GQL estima tion a pproach 79

3.2.2.2 PSSGQL estimation under non-stationa ry (ns) corre- la tion struct ure . . . . . . . . . . . . . . . .

3.2.2.3 Estimation of cor relation index parameter p

3.2.2.4 FSSGQL estima tion under non-sta tionary correlation

3.2.2.5 3.2.2.6

structure

Existing PSSGEE approach

Estimation of ' wo rking ' correlation pa ra meter

a

3.3 Semi-pa ra metric longitudinal m od els for bina ry d ata wit h non-stationar y correlation structures . . . . . . . . . . . . . . . . . . . .

80 82

84 88 89

90 3.3.1 Non-st ationa ry correla tion models for bina ry da ta 91

3.3. 1.1 Non-stat ionary AR(1) models in semi- para met ric setup 91

3.3.2 on-stationa ry MA(1) models in semi- parametric setup . 92

(12)

3.3.3 Non-stationary EQC models in semi-param etric setup . . . 93 3.4 Estimation in semi-parametric models in longitudinal binary d ata 94

3.4.1 Estimation of non-parametric function 1 (.) 3.4. 1.1

3.4.1.2 3.4.1. 3

PSSGQL(ns) estimation of f3

Estimation of correlation index parameter

p .

FSSGQL(ns) estimation of f3 ... . .... .

94 96 98 98

4 Empirical Study for Semi-parametric Longitudina l Count Data M od- els

4.1 Simulation design 4.2 Data generation

4.3 NGQL estimation: A biased approach

4.4 A finite sample efficiency comparison bet ween PSSGQL(ns) and PSS- G EE estimations . . . . . . . . . . . . . . .

4.5 Performance of the FSSGQL(ns) estimation 5 Concluding R e marks

Bibliog raphy

100 101 102 103

106 118

122

124

(13)

List of Tables

2.1 Simulated means (SMs) a nd simulated standard errors (SSEs) of the estimates of regression par ameters (3

₁

= 1 and (3

2

= 0.5, under AR(1) correlation model for selected values of the model p arameters ¢ and

0'2

;

with ry(t) = 3 + 2(t- n;

¹⁾

+ (t - n ;

¹⁾²^;

K=100; n=4; and 1000 simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2 Simulated means (SMs) and simulated standar d errors (SSEs) of the

estimates of regression parameters (3

1

= 1 an d (3

₂

= 0. 5, under MA ( 1) correlation model for selected values of the model parameters (} and

0'2

;

with ry (t) = 3 + 2(t - n;

¹⁾

⁺ ^(t- n;

¹⁾²^;

^K=100; ^{n=4; and} ¹⁰⁰⁰

simulations. . . . . . . . . . . . . . . . . . . . . . . .

2.3 Simulated means (SMs) and simulated standard errors (SSEs) of the estimates of regression parameters (3

₁

= 1 and (3

₂

= 0.5, under equi correlation model for select ed va lues of the model parameters ( and

0'2

;

with ry (t) = 3 + 2(t- n ;

¹⁾

⁺ ^(t- n ;

¹⁾²^;

^K=100; n=4; and 1000 57

simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

(14)

2.4 Simulated means (SMs) a nd simulated standard errors (SSEs) of the estimates of regression pa rameters (3

1

= 1 and (3

₂

= 0.5, under AR (1) correlation model for selected values of the model par ameters ¢ and a

²;

with 1 (t) = sin2t; K=100 ; n=4 ; and 1000 simulations. . . . . . . . 61 2.5 Simulated means (SMs) a nd simulated st anda rd errors (SSEs) of the

estimates of regression p arameters (J

₁

= 1 and (J

₂

= 0.5, under MA(1) correlation model for selected values of the model parameters e ^and

a²;

with r (t ) = sin2t; K= 100; n=4; and 1000 simulations. . . . . . . . 63 2.6 Simulated means (SMs) and simulat ed standard errors (SSEs) of the

est imates of regression parameters (3

1

= 1 and (3

₂

= 0.5, under equi correlation model for selected values of the model parameters ( and

a²;

with r (t ) = sin2t; K=100 ; n=4; and 1000 simulations. . . . . . . . 65

4.1 Simulated means (SMs), simulated standard errors (SSEs) and mean squared error (MSEs) of the n aive estimates of regr ession pa rameters (3 under non-stationary AR (1) correlation model for selected values of correlation index parameter p wit h K=100; n=4; and 1000 simulations. 104 4.2 Simulated means (SMs), simulated standard errors (SSEs) and mean

squa red error (MSEs) of the PSSGQL an d PSSG EE estimates of regres- sion parameters {3

₁

= 0.0 a nd (3

₂

= 0.0 , under non-stationary AR(1 ) correlation model for selected values of correlation index parameter

p

with K= 100; n = 4; and 1000 simulations . . . . . . . . . . . . . . . . . 112

(15)

4.3 Simulated means (SMs) , simulated standard errors (SSEs) and mean squared error (MSEs) of the PSSGQL and PSSGEE estimates ofregres- sion parameters ⁽³

1

= 1.0 a nd ⁽³

2

= 1.0, under non-stationary AR(l) correlation model for select ed values of correlation index parameter

^p

with K=lOO ; n =4; and 1000 simulations. . . . . . . . . . . . . . . . . 114 4.4 Simulated m eans (SMs), simulated st andard errors (SSEs) and mean

squared error (MSEs) of the PSSGQL and PSSG EE estimates of regr es- sion pa rameters (3

₁

= 0.5 a nd (3

₂

= 0.5, under non-stationary AR(l)

correlation model for selected values of correlation index parameter ^p with K=lOO; n=4; a nd 1000 simulations. . . . . . . . . . . . . . . . . 116 4.5 Simulated m eans (SMs), simulated standa rd errors (SSEs) and mean

squared error (MSEs) of t he FSSGQL(ns) estimates of regression pa-

rameter (3 und er non-stationary AR( 1) correlation model for selected

values of correlation index parameter

^p

with K =lOO ; n=4; and 1000

simulations. . . . . . . . . . . . . . . . . . . . . . . . 119

(16)

List of Figures

2.1 Efficiency compa risons of various se mi par ametric methods for t he es- timates of (3

₁

with '"Y (t) = 3 + 2(t - n t l ) + (t - ntl )

²,

under selected correlation processes: AR(l) with ¢ = 0. 8, MA(l ) wit h e ⁼ ^0.4 ^a ^nd

EQC with ( = 0.8. . . . . . . . . . . . . . . . . .

⁰ ⁰ ⁰ ⁰ ⁰ ^• ^• ^• ^• ^• ^•

51 2.2 Efficiency compa risons of various semi p ar amet ric methods for t he es-

timates of (3

₂

with '"Y (t) = 3 + 2(t - n tl ) + (t - nt l )

²,

under selected correlation processes: AR(l) with ¢ = 0.8, MA(l) with fJ = 0.4 a nd EQC with ( = 0.8 . . . . . . . . . . . . . . . . . .

⁰ ⁰ ⁰ ⁰ ⁰ ^• ^• ^• ^• ^• ^•

52 2.3 Efficiency compa risons of va rious semi p arametric methods fo r t he es-

tima tes of (3

1

wit h '"Y (t) = sin2t, under selected correla tion processes:

AR(l) wit h ¢= 0.8, MA (l ) withe = 0.4 and E QC with ( = 0. 8. . . . 53 2.4 Efficiency compa risons of various semi par ametric met hods for t he es-

tima tes of (3

₂

wit h '"Y (t) = sin2t , under selected correla tion processes:

AR( l ) with ¢ = 0.8, MA(l ) withe = 0.4 and EQC with ( = 0. 8. . . . 54

(17)

2.5 Simulated means of estimates of the non-parametric function ('y(t) =

3 + 2(t-

⁴

!

¹⁾

+ (t-

⁴

!

¹⁾²⁾

under the true correlation mat rix (TCM) a nd other selected correlation based FSSGQL method with AR(1) cor- related errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.6 Simu lated means of estimates of the non-parametric function ('y(t) =

3 + 2(t -

⁴

!

¹⁾

+ (t -

⁴

!

¹⁾²⁾

^under the true correlation mat rix (TCM) and other selected correlation based FSSGQL method wit h MA(1) cor- related errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.7 Simulated m eans of estimates of the non-paramet ric function ('y(t) =

3+ 2(t -

⁴

!

¹⁾

⁺ ^(t ^-

⁴

!

¹⁾²⁾

under the true correlation matrix (TCM) and other selected correlation based FSSGQL method with Equi correlated er rors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.8 Simulated means of estimates of the non-parametric function ('y(t)

=

sin2t) under selected correlation based FSSGQL m ethod with Equi correlated errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.1 Simulated means of estimat es of 1 (t) for PSSGQL and PSSGEE meth- ods, and true values of 1 (t) under non-stationary AR(1) correlation models fo r count data with a correlation index parameter ^p = 0.8 and regression parameters ({3

1 ,

{3

2 )'

= (0 , 0)'. . . . . . . . . . . . . . . . . 109 4.2 Simulated means of estimates of 1 (t) for PSSGQL and PSSGEE meth-

ods, and true values of 1 (t) under non-stationary AR(1) correlation

models fo r count data with a correlation index parameter p = 0.8 and

regression parameters ({3

1 ,

{3

_{2 )'}=

(1 , 1)' . 110

(18)

4.3 Simulated means of estimates of 1(t) for PSSGQL a nd PSSGEE meth- ods, and t rue values of 1 (t) under non-stationary AR (1) correlation models for count data with a correlation index parameter

p

= 0.8 and regression parameters ({3

₁,

{3

₂)'

= (0.5, 0.5)'. . . . . . . . . . . . . . . . 111 4.4 Simulated means of estimates of 1 (t) for FSSGQL(ns) method and true

values of '"'! ( t ) under non-stationary AR( 1) correlation models for count data with regression parameters ({3

_{1 ,}

{3

_{2 )}'

= (0, 0)' . . . . . . . . . . . 120 4.5 Simulated means of estimates of 1 (t) for FSSGQL(ns) method and true

values of '"'!( t ) under non-stationary AR( 1) correlation models for count

data with regression p arameters ({3

_{1 ,}

{3

_{2 )}'

= (0.5, 0.5 )'. 121

(19)

Chapter 1 Background of the Problem

Longitud inal studies ar e common in many scientific research areas such as clinical tri- a ls, economics, p ub lic health , agriculture , and so on. In th ese studies, the r esponses along wit h t he covariates are collected from individuals over a p eriod of time. In many cases, the time points are equally spaced . For example, (1) the Ohio ast hma data [Zeger , Liang and Albert (1988) ] collected from 537 children every year over a p eriod of four years ; (2) the health care ut ilization d ata [ Sutra dha r (2003 , page 391)]

collected by the General Hospital of the city of St. J oh n 's , Newfoundland , Canada, which contains t he number of yearly visits to a physician by individuals over four consecutive years; and (3) the survey of labour and income dyna mics (SLID) d ata on unemployment status among others collected by Statistics Ca nada [Sutr adhar (2011 )]

every year over a per iod of six or more years. T her e are other situations where a re-

s pondent reports a response whenever an event occurs, where time points may no t

b e equi-s paced. Because the re peat ed data a re likely to b e correlated , it is imp or-

tant to t ake such correlations into acco unt for efficient inferen ces of the regression

(20)

effects involved in the model. However , t he modelling of the correlations especially when the responses are discrete is d ifficult even if the responses are collected over equi-spaced time points . In a fixed regression set up, Sutr adhar (2010) suggested a Gaussian-type ARMA class of auto-correlation models appropriate for both linear a nd discrete longitudinal d ata . These regression models however , may be inadequate in situations where a specified (or fixed ) regression functi on may not be sufficient to interpret the responses completely. In such cases, one may extend these mod els by adding an unsp ecified non-pa rametr ic fu nction in time wit h the fix ed regr ession function . This leads to a semi-paramet ric regression model setup wher e longit u- dinal responses still follow a suitable correlation structure. There exis ts gener alized estimating equation (GEE) based approaches to d eal with the inferences for the afore- ment ioned semi- pa ra metric models in the longitudina l setup , where the modelling of longitudinal correla tions are not done. In this th esis, however , we concentrate on the semi-pa rametric inferences for r epeated dat a which follow a ARMA-type class of a uto-correla tions. In order to give a backgr ound for this semi-par ametric mod elling and infer ence problem in the longitudinal setup , we first provide the notations a nd an overview for the semi-parametric prob lem in indep endence setup in Sections 1.1 and 1.2. A brief over view of the same semi-parametric problem in longitud in al setup is provided in Sections 1.3 and 1.4 .

1.1 Generalized linear models ( G LMs)

Consider a GLM regression set up [ Ielder and Wedder burn (1972)] in wh ich a n

exponent ial fa mily based independ ent resp onses { y-i} , i = 1, . . . , K are observed .

(21)

Let Xi = (xi

1 , ... ,

Xip) ' b e a multidimensiona l covar iate vector correspon ding to Yi for the

ith

in dividua l. Supp ose t hat the mean response f.-Li(/3) = E(Y;) is infl uenced by a specified fixed regr ession function (linear predictor )

x~/3

with (3 = (/3

1, ... ,

/3p )' · The density of t he exponential family based res ponse Yi can be written as

(1. 1)

where a( .) a nd b(.) are known fun ctional forms such t hat b( .) depends only on Y i, and t he canonical para meter ei is d efined with a suitable lin k fu nction h(.) as

(1. 2)

The parameter ei is related to the mean response thr ough

(1.3)

where a'(.) is th e first der ivative of a(. ) with respect to e i · Also, it follows that the variance of Yi is

(1.4) where a " (.) is the second derivative of a(.) with respect to B i ·

1.1.1 Quasi-likelihood est imatio n for {3

In the above exponential setup, the regression p aramet er (3 is involved in f.-Li ((3 ) = a' ( Bi)

as well as in aii ((3) = a" ( e i) . Since aii ((3) is a fu nction of the mean r esponse, it is

sufficient to estimate (3 involved in f.-Li(/3) . W hen t he dens ity function is not known, and

the mean and variance are given, Wedderburn (1974) proposed the qu asi-likelihood

(22)

(QL ) estimation approach to estimate the regression parameter. In this approach , one solves t he QL estimating equation

t âa~~i) ^[ â"(B î)r

¹

(Yi- a'(Bi))

=

t ⁸¹ ^~~) [crii(,8)]-

¹

(Yi - f-li(,8))

=

0 (1. 5)

i=l i=l

[see also McCullagh (1983), McCullagh and Neider (1 989)]. The estimate ~QL ob- tained by solving (1.5) is consistent and highly efficient. This is because under the exponential family setup, the QL estimate turns out to be the likelihood estimate, which is known to be optima l (highly efficient) . [ Sutradhar (2010a)].

1.2 Semi-parametric GLMs

In semi-parametric models , th e mean response f-Li(,8) depends not on ly on a fixed r egression function , but also on an unspecified (non-parametric) smooth func tion , n amely r(zi) , where

^Zi

is an a uxiliary covariate which influences the response Y i·

T hen

^f-Li

(,8) becomes a function of an unknown parameter vector ,8 and an u nknown

smooth function r(z i), which we abbreviate as

(1.6)

In this set up , t he canonical parameter e i d efined in (1.2 ) has the for m

(1.7)

It is clear that the main regressio n parameter ,8 can no longer be estimated unbiasedly

by ign oring the estimation of r( zi )· The semi-par ametric GLMs are more flexible than

t he parametric GLMs especially when the regression function in fixed covariates is

insufficient to understand the mea n response .

(23)

Even though the estimation of both fixed regression parameter vector (3 and the non-pa rametric fun ction 1(. ) are of inter est, many early works [ Staniswalis (1989), and Muller (1988)] concentr ated on the estimation of the non-parametric mean func- tion, which is the same as substituting (3 = 0 in (1. 7). To deal with this type of non- par ametric regression estimation there exists many kernel methods and its variants, such as t he N adar aya-Watson kernel regression estimation [N adaraya ( 1964), Wat- son (1964), Bierens (1987) , Andrews (1995) ], local linear and polynomial regr ession [Clevela nd (1979) , Fan (1992 , 1993) , Stone (1980, 1982)], r ecursive kernel estimation [see e.g., Ahmad and Lin (1976) , Greblicki and K rzyzak (1980)], spline smoothin g [Whittaker (1923), E ubank (1988), Wahba (1990)], and nearest neighbour estimation [Royall ( 1966), Stone(1977)] . Among these techniques, the simpler Nadaraya-W atson kernel estimator or the local con stant estimator for ^{1 (} z) at a given covariate level z

=

z

0

involved in th e linear model,

has the form

"\"'K , .

K* (zo-z;)

'(z )

= D i=lYt b

I

0 "\"'K

K(* z o - z;)

D t=l b

where K(.)* is a suita ble kernel d ensity fun ction and b is known as the bandwidth.

T he selection of an appropria te b andwid th parameter b is always a problem in non-

parametric regression [ Silverma n (1986)]. In pract ice, we try to use a possible value

of b for which the bias and variance of the estimator will be minimum . Many data-

based m ethods s uch as cross valid ation [see Stone (1974), P icard and Cook (1984) ,

Ansley, Kohn , and Tharm (199 1)], generalized cross validation [Craven and Wahba

(24)

(1979)] wer e discussed in t he lit erature fo r choosing an a ppropriate b. Al tman (1990) suggested that t hese commonly used ba ndwidth selection techniques do not perform well when the errors are correlated. Hence we excluded these techniques and followed P agan and Ullah (1999) who proposed an optimum value for bandwidth, which m in- imizes the approximate mean integrated squa red error. T he a uthors recommended b ex n -

¹

1

⁵^,

a nd s uggested that this value of bandwidt h is the only value of b for which the bias a nd variance ar e of the same order of magnitude. T hus, as a practical choice, we will consider b

=

K -

¹

1

⁵.

In the independen ce set up, the estima tion of both (3 and 1'(.) a re also ex tensively studied in t he liter ature [e.g., Severini a nd Staniswalis (1994), Carota and P armigiani (2002)] . Under the exponent ial family, for example , Severini and Staniswalis ( 1994) suggested a semi-para metric QL (SQL) approach for the estimation of (3 and f'( .) . T he authors illustrated t heir estimation methodology using examples with linear , gamma and binar y dat a. Note that we do not deal with (continuous) gamma data in the thesis, instead, we concentrate on mo delling a nd infer ences for linear and discrete data such as count and binary data in semi-parametric set up for independent a nd longit udinal resp onses. For convenience, we now provide semi-parametric QL estimation in details for linear , count and binary data in t he indep endence set up.

1.2.1 Linear mode l

Consider t he model

(1.8)

(25)

where Ei's are independent a nd identically distributed wit h mean 0 a nd varian ce O"; .

identity func t ion. Also, var(Yi ) = O"ii

⁼

O"; , i

=

1, . . . , K.

1.2.1.1 Estimation of non-parametric function 1(z

0 )

For model (1.8) , the quasi-likelihood fun ction Q(f..li, Y i ) can b e written as

Then, the semi-par ametric QL estimating equation for 1(z

_{0 )}

is

(1.9)

( )

Pi(ZQ~Zj)

where wi zo

=

L~1Pi(¥l'

P i( .) being a kernel d ensity fun ction . For exa mple, one may choose P i(

^zo^-;;z;⁾= vkb

exp ( -;

¹^{( zo-;;}^{z; )}²⁾

wit h a suitable bandwidth b . Note tha t when wi(z

0 ) =

1, this SQL equa tion fur ther reduces to the well-known quasi- likelihood estima ting equa tion [ Wedderburn (1974)].

Since

⁸¹¹;(/3,zo)

=

⁸[x;!3+-r(zo)]

= 1 the SQL estimating equation (1. 9) has t he formula

8-y(zo) 8-y(zo) '

(1. 10)

K K

=}

L ^wi(zo) ^(Y ⁱ ^- ^xJ3 ⁾ ^- L ^wi(zo)r(zo ⁾

⁼

⁰

i=l i=l

yielding an estimate for th e non-par ametric function 1 ( z) e valuated at z = z

0

as

(1. 11 )

(26)

where L~ 1 ^wi(zo) ⁼ ^1. ^Now ^replacing ^zo ⁱⁿ ^(1.11 ^{) with} ^Zi, ^{we write}

where

K

i'(zi) = L 'Wj(zi)(yj - xj (J) = Yi- x~(J

j=l

K K

Y i = L ^wj(zi)Yj ^and ^xi ⁼ L ^wj ^(zi ^)xj

j=l j=l

(1. 12 )

(1.13)

Note that t he estimat or i' ( zi) in ( 1.12) is constructed for a given value of t he regr ession parameter vector (3. But , because in practice (3 is unknown an d in fact it is the main parameter of interest, we provide the estimating equation for (J in the following section. However, these formulas for i' (zi ) and ~ are already discussed in literature and for example , we refer to Severini and Staniswalis (1994), Speckman (1988) and Hastie and Tibshirani (1990) .

1.2.1.2 Estimation of regression effects (J

For linear mod els the QL estimator of (3 has a closed form expression. To der ive the estima tor, we fir st write f-.li ((J,)'(zi ))

= x~(J

+ i'(zi) a nd compute

Of-.li ((3, i'( zi )) 8(3

(x _' - x _' )'

)

(1. 14)

where xi is given in ( 1.13) . Similar to (1.5) we now write th e QL estimating equation

for (3 as

(27)

and by substituting i(zi) = Yi - xj3 w e obtain

K K

L (Xi- Xi) ' [ yi - x~f3- Y i + x~ f3] = L (xi - Xi)' [(Y i - Yi)- (xi - xi)'f3] = 0,

i=l i=l

yielding

K K

L (xi- xi)' (yi - Y i ) = L (xi - xi)'(xi - xi)f3 .

i=l i=l

It then follows that {3 h as th e closed form expression given by

K K

(3 = [L (xi - Xi) '(xi- Xi)t

¹

L (Xi - xi)'(yi - Yi ), ( 1. 15)

i=l i=l

where Y i and Xi are given in equation ( 1.13). The above equation (1.15 ) is the same as in Severini and Staniswalis (1994) [ eqn.(10 ), page . 503] with D = I , the identity matrix.

1.2.2 Count data model

There a re many situations in pract ice where one becomes interested in an alyzing count a nd binary data to und erstand the effect of covariates on th e res ponses. Similar to normally distributed r esponses considered in the previous section , these r esponses also follow the exponential family. However , in t he present semi-p ar ametric setup we are inter ested in exa mining t he regression effect when the mean response is assumed to consist of the fixed regression function as well as a non-par ametric smooth function.

For count responses, t he Poisson density function f (Yi) can b e expressed as a specia l form of exponential family density (1. 1) given by

(1. 16)

where ei = log 1-li and a(ei) = 1-li·

(28)

Thus w e write the Poisson mean and variance as

where

l-li ((3 , 'Y(zi) )

= exp(x~(3

+ 'Y (zi)) which is differ ent than (1.8) under the linear case.

1.2.2.1 E stimation of non-parametric function 'Y(z

0)

The SQL estimating equation for 'Y (z

0)

in the count d ata has t he form

L

K

^W ⁱ ^(zo) ^O ^{J-li ((3,} ^'Y( ^{zo)) [Yi} ^- ^{1-li ((3,} ^'Y( ^zo ⁾⁾ ^] ⁼ ₀

i=l O "f(zo) P,i((3, 'Y(z

0))

(1.17)

where IJ-i ((3, "f (z

0 ))

=

exp(x~(3

+ l'(zo)) .

Becau se

BJ.L;({3,,(zo))

=

Bexp(X:fJ+r(zo))

= exp(x' (3 + "f(Z )) (1. 17) reduces to

Br(zo) Br(zo) t 0 '

L

K

^wi( ^zo ^)[Y ^{i -} exp(x~(3 + 'Y(zo ))] = 0 ( 1. 18) i=l

and hence

The estimator for 'Y (z) computed at z = z

₀

under t he Poisson model is t hen given by

A ( )

l ( 2::::

¹

^wi(zo)Yi ⁾

'Y zo

=

og

K .

L i=l wi(z

0)

exp (x~(3) T hus for z = zi t he estimator of and 'Y(z) has the for m

( L

^K

^w ^·

⁽^z^{· ·}⁾ ⁾

A ( ·) -

l j=l

¹ ^'

Y J

'Y z, - og

K .

Lj=l wj ( zi) exp(xjf:l) (1.19)

(29)

1.2.2.2 Estimation of regression effects {3

Unlike the linear models, the estimator of {3 has no explicit form under the P ois- son count d ata model, a nd one h as to estimate {3 by solving a non-linear equa tion iteratively. For t his purpose, similar to (1.5), the QL estimating equation for {3 is

where

8f.Li ({3, i( zi) ) 8{3

with i (zi) as in (1.19 ). The derivative a~~i) is computed as

(1.20)

(1. 21)

2:~~ 1 ^wj(zi ^)Yj 2:f = 1 ^wj(zi ⁾ ^e ^xp ^(xj{J)xj

[2:~ ~ 1 ^{wj(zi) e} ^xp(xj ^{J ^)F

2:;:

1

Wj(zi ) exp(xj{J)xj

2:;:

1

Wj(zi ) e xp(x jf3)

Now by using (1.22) in (1. 21 ) we write 8f.Li({3, i(zi))

8{3

"'K ^w ·(z·)exp(x' {J) x'

·((.{ ' ( ·))[ J -

0 j=1

^J ^t ^j

jl

f.Lt

fJ'

ry z, x,

K .

2: j= 1 ^{Wj ( z} ⁱ ⁾ ^exp ⁽ ^xj(J)

Consequently, the estimat ing equation (1. 20 ) leads to

( 1.22)

(30)

wher e P,i =

exp(x~J)

+ i (zi)). Now by d efining

(1.23)

we r ewrite t he estimating equation as

J(

L (xi - xi)' (Y i - p,i) = o. ^(1.24)

i=l

The estimating equation (1.24) can b e solved iter atively using the w 11-known Newton- Raphson method. The iter ative equation has the form

J( J(

~(r+ l ) ⁼ ~(r)

^-

[ 8~' ^L ^(Xi- ^{xi)' (Yi-} ^P,i)t

¹

^[ ^L ^(Xi- ^xi) ^(Yi ^- ^P,i)]

i=l i=l

J( J(

~(r) + [ L (Xi - xi)' p,i (xi- Xi )t

¹

[L (Xi - X i) (Y i - P,i)] (1.25)

i= l i=l

and is used to compute the fin al estimate ,6 until convergen ce.

Severini and Staniswalis (1994, Example 2, page . 503) provided a n estimate for

1 (zi ) under gamma distribution , which is simila r , but different t ha n (1.19) . Hence for t he estima tion of ,6 , we have provided the exact iter ative equation in (1.25) under the Poisson case.

1.2.3 Binary data model

In the semi-pa ra metric GLM set up for bina ry r esponses , the binar y dis tribution is

which is a specia l case of the exponential family density (1.1) wit h

ei = log - - ₍ ^/Li ⁾

1 - f.Li ^a ^nd a(ei) = - log(1 - f.Li)·

(31)

In the par tially specified regression case we consider ei = xJ3 + !'(zi) and it then follows that

yielding

and

1.2.3.1 Estimation of non-parametric function l'(z

0 )

In t he binary case, the SQL estim ating equa tion for l'(z) at z = z

0

is given by

~ , ⁽ ^)a!Li(/3 ^,/'(zo)) ^[ ^Y ^{i -} ^M ^i(/3,/'(zo ⁾⁾ J _ ₀ ( 126 )

'8

^Wi

^zo ^8')'(zo) ^/L ^i(/3,')' ^(zo)) ⁽¹ ^- ^!Li(/3, ^')'(zo))) ^- ^' ^·

h (/3 ( ))

exp(x;/3+-y(zo))

B

w ere 1-li , ')' zo =

l+exp(x;l3+r(zo)).

ecause,

8/Li(/3, !' (zo) ) 8')' (zo)

exp(x; f3 + !'(zo)) 1

1 + exp(x;/3 + !'(z

0 ))

1 + exp(x;/3 + f' (zo))

= !-li(/3 , l' (zo))(1 -Mi(/3, !' (zo) )),

the estimating equation (1.26) r educes to

L

K

wi(zo) [Yi- !-li(/3, l'(zo)) ] = 0,

i=l

which is similar to (1.18). The difference lies in the formula for fLi(/3, l'(z

0)).

1.2.3.2 Estimation of regression effects f3

For the estimation of /3, the QL estimating equation has the formula

(1.27)

(1.28)

(32)

where

Of-li({J, 1 (zi))

8{3

a [ ^exp ^(x~{J + 1(zi)) ]

8{3 1 +

^exp(x~{J

+ 1(zi) )

[ exp (x~{J + 1 (zi)) ] [ x' + 81(zi)]

[1 +exp(x~f3+1(zi))]2 t

8{3

!-li(f3,1(zi)) (1- !-li(f3,1(zi) )) [x: + a~~i)l

The estimating equation in ( 1. 28) then reduces to

(1. 29)

Note t hat the estimating equation for ')'(.) in (1.27), and the estimating equat ion for {3 in (1.29) are the same as those in equations (6) and (8) respectively in Severini and Stanis walis (1994) , and that these equations must be solved iter atively. However , t her e is a closed form expression for ')'( .) (1. 19) in t he Poisson case, whereas the estimating equation (1.27) fo r the binary case has to b e solved iteratively. One needs to solve the estimating equation for ^{3 iteratively both in binar y and in Poisson cases.

1.3 Generalized linear longitudinal models ( G LLMs)

We have discussed the GLMs in independent set up in section 1.1 and its generali za- tion to the independent semi-parametric set up in details in section 1.2. The purpose of this research is to study the model a nd inferences in the semi-par ametric longitu- dinal data. For convenien ce, in this section , we now review the existing models a nd associated inferences in longitudinal set up.

In n otation , let Y i

=

(Y il, . . . , Y it, . .. , Y iT )' r epr esent the response vector , where

Y it is the response recorded at time t for the i th individual. Suppose that Xit =

(33)

(xitl, . .. , Xitv, . .. , Xitp)' be the ^p- dimensional covariate vector corresponding to the scalar Y it, and (3 be the p- dimensional r egressio n effects of Xit on Yit for all i

=

1, .. . , K , and a ll t = 1, ... , T. Since t he same outcome is measured consecutively over time for each individual, the repeated responses of an individual are likely to be correlated. In this set up we ass ume t hat the response Y i marginally follows ( 1.1) but their joint distribution is difficult to write, especia lly for discrete responses. The mean and variance of the response are denoted by /-lit (f3) = a'(Bit ) = E [ Yit] and var- [ Yit] = a" (B it) = CJiu(f3) . Similar to (1.5), the QL estimating equation for the unknown r egression parameter (3 can be written as

f t ^fJa~~it) [a"(eit)t

¹

(Yi t- a'(Bit))

i=l t=l K T

= L L Of-l~tJf3) [C5itt(f3)t

¹

(Yit - flit(f3) ) = 0

i=l t=l

(1. 30) The QL estimating equation (1.30) is the same as the independence assu mption based QL estimat ing equation and the solution of this estimating equation provides a con- sistent, but inefficient , estimate for {3 . This is because the observations from the same individual are correlated and (1.30) is written ignoring such correlations. As a rem- edy, one must take the correlations of longitudinal responses into account to achieve the desired efficiency of the r egr ession estimates .

The r elevant works in the field of longitudinal data a nalysis origin ated from Liang and Zeger (1986) . The authors introduce an extension of GLM for independent data to the longitudinal setup and propose the generalized estimating equations (GEEs) to acquire consistent a nd efficient regression estima tes involved in the GLLM model.

The backbone of their methodology is based on a 'working' correla tion matrix . Liang

(34)

and Zeger defin ed the GEE estimating equation as

~ aM~([3)~( ^)- ¹ ⁽ ( ! ~) ) ⁰

6 B{3

ⁱ

a Yi - J .Li

^tJ

= ,

i=l

(1.31)

where M i ([3) = (M il ([3 ), .. . , M it ([3), . . . , M ir(f3) )' is the mean vector of Y i and Vi (a) = A~

¹²

~(a)Ai

¹²

is the covariance ma trix with Ai = diag[o-in( f3 ), .. . . O"ijj ({3) , . . . o-irr(f3)], Ri(a) is a 'working ' correlation matrix, and a is the 'working' correlation parame- ter. Subsequent r esear ch in the longit udinal data an alysis literatur e shows th at, in several situations, these 'working' correlation based regression parameter estimates ar e inconsistent [ Crowder (1995)]. Crowder showed th at this consistency breakdown occurs due to the problem in estimating t he so-called 'working' correlation parameter a. In cases where 'working' correlat ions are estimable , Sutradhar and Das (1 999) showed that even if the estimator of a converges t o a value, the GEE approach gives consistent estimators of the regression parameters , but these estimators may be less efficient than the r egression estimators obtained based on the independence es- timating equa tions approach. Sutradhar (2003) proposed a generalization of the QL estimation approach, where {3 is obtained by solving the generalized quasi-likelihood (GQL ) estima ting equation given by

~ 8M~ ({3)L: ^- ¹⁽ ⁾⁽ ^([3)) ⁰

6 8 fJ i P Y i - Mi

= ,

i=l

(1.32)

where M i ([3) = (Mil ([3), ... , M it ([3) , . .. , MiT ([3) )' is the mean vector of Yi and L:i (p) =

Ai ¹² Ci(P )A~

¹²

i s the covariance matrix with Ai = diag[o-ill ([3) , . . . , O"ijj([3), . . . o-irr(f3 )], C; (p) is a general class of auto-correlations, a nd p is a correlation ind ex parameter.

The estimator /JcQL obtained by solving (1. 32) is consistent and very efficien t for {3.

(35)

1.4 Semi-parametric GLLMs

In the above mentioned longitudinal st udies, regression functions involved in th e lon- git udinal model are fully specified . For example, in linear longitudinal set up Jl· it (f3) is expressed as f..Lit ({3) = xit{J . This leads to par ametric modelling of marginal lon- gitudinal models [ Gilmour, Anderson , and Rae (1985 ), Liang and Zegger (1986) , Zeger and Liang (1986), F itzmaurice, Laird and Rotnitzky (1993 )] . However, th ere are situat ions wher e the regr ession functions involved in the model are part ially sp ec- ified , which leads to semi-para metric models in the longit udinal set up. I n the linear longitudinal setup, the semi-para metric models have b een studied by Severini and Wong (1992), Zeger and Diggle (1994), Moyeed and Diggle (1994), You and Chen (2007), Fan, Haung and Li (2007), Fan and Wu (2008 ), and Li (2011 ) . Some of these studies used t he ' working' correlations based GEE approach for th e estimation of regression parameters, and the non-par ametric function was estimated separately by using independence assumption [ see Zeger a nd Diggle (1994 )] . Other works s uch as Fan , Ha ung a nd Li (2007) ass umed normality for the responses and used likelihood a pproach for the estimation. But the covariance matrix for t he multiva riate dis tri- bution was constructed based on t he 'working' correlation m atrix . T here also exist some gener alizations where heteroscedasticity is assu med among the responses at a given time.

The semi- para metric ana lysis has also been studied for (marginal) exp on ential

family data by using the 'working' correlations based GEE approach. To be sp ecific,

we refer to Severini and Staniswalis (1994) , Lin and Carroll (2001, 2001a) for this GEE

based an alysis. These studies est imate regression par ameters and non-pa rametric

(36)

functions separately a nd GEE approaches has been used in b oth cases.

1.5 Obj e ctive of the thes is

The main objective of this thesis is to study t he semi-par ametric regr ession models when t he repeated responses follow a non-stationary correlation model t hat belongs to a class of Gaussian-type ARMA correlation structures. T he plan of the thesis is as follows.

In Chapter 2 , we focus on the semi-parametric linear longitudinal model where

a stationary correlation structure is used for inference. In the linear model setup,

t his type of stationary correlation structure is quite appropriate because the corre-

lations under linear models do not depend on any covar iates irrespective of whet her

the covariates are t ime dependent. Even though the semi-parametric analysis in the

linear model setup for longitudinal data is a direct extension of the indep endence

based semi-parametric an alysis discussed in Section 1.2, a close look at the estima-

tion problem (to be discussed in Chapter 2) r eveals that the existing studies in the

semi-par ametric longitudinal setup d id not incorporate the est imation effects of non-

parametric function 1{) while estimating the ma in regression parameter /3. Also,

the existing studies have extended the 'working' correlations based GEE approach

explained in (1.31 ) to t he semi-parametric setu p, which may not provide efficient

r egression estimates. To overcome these two problems, we revisit the inferences for

the semi-parametric linear longitudinal models and provide appropriate estimating

equations for efficient inferences by using (1) ARMA type class of auto-correlation

structures, and (2) taking the the estimation effect of non-parametric fun ction in

(37)

estima ting /3. We carry out a simulation study to examine the fini te sample based efficiencies of th e prop osed semi-parametric GQL (SGQ L) as well as various semi- parametric GEE (SGEE) approaches. The asymptotic distribution of the proposed estima tor is a lso discussed.

In Chapter 3 , we extend the semi-parametric linear longit udinal model discussed

in cha pter 2, to the d i screte data setup. In particular, we consider semi-parametric

models for longitudinal count a nd binary data. Note that some of the existing studies

such as Lin and Carroll (2001) a nd Severini and Staniswalis (1994) deal with such

models, but they mainly use the ' working' correlations based GEE approach . T hese

studies do not appear to accommodate the estimation effect of the non- parametric

fun ction 1{) while estimating /3. As far as the correlation structure is concerned, in

our approach , we use the non-stationary correlation str uctures suggested by Sutradhar

(2010) for both count and binary d ata. However, we d o not discuss any diagnostic

procedure for the identification of t he non- stationary correlation structure but this

can be don e following the technique given in Sutradhar (2010, Section 4). Rather ,

we assume that t he correlation structure involving the time d ependent covar iates are

known and develop a semi-parametric GQL (SGQL) approach for the main regression

parameters by taking the estimation effect of the non-parametric function as well as

the longitudinal correlations into account . Analy tical details for the SGQL ap proach

for both count and binary data are also provided. For the comparison with the existing

studies, t he proposed SGQL estimating equation is written in two ways . First, a

partially standardized SGQL (PSSGQL) approach is described wh ere the covariance

matrix involved in the estimating equation for /3 is free from the estimation effect of

1 {) . Second, a fully standardized SGQL (FSSGQL) a pproach is discussed in which

(38)

the estimation effect of r(.) is accommodated in the covarian ce matrix .

To examine t he finit e sample performan ce of th e proposed SGQL appr oaches, we carry out several simula tion studies in Chapter 4 for the longit udinal count data.

First we study the effect of ignoring the non-par amet ric function in estima ting /3 using a naive GQL (N GQL) a pproach. Because the performance of t he leading GEE based approaches did not a dequately study the count data in the semi-parametric setup , we have made a deta iled comparison of th e proposed PSSGQL ap proach with the existing pa rtia lly standardized semi- parametric GEE (PSSGEE) approaches in order to achieve effiec ient infer ence methods. We also provide the simulation results for the proposed FSSGQL approach.

The t hesis concludes in Chapter 5.

(39)

Chapter 2 Semi-parametric Linear Longitudinal Models

In this chapt er , we r evisit the semi-para metric analysis for linear longit udinal data collected over equi-spaced and unbala nced time points. However, we use general notations such that t he regression function can be written for th e responses collected over unequi-spaced time points, which accommodate the equi-spaced tirne data as an important special case. As fa r as t he correlation structure for th e r ep eated r esponses is concerned , we concent rate on equi-spaced time data only. Thus , as opposed to t he notation

Yit

used in Section 1.3 to r epresent t he r esponse at timet (t = 1, . . . , T) from the i

¹h

(i = 1, . .. , K ) individua l, we now use a genera l notation , namely,

Yij(tij)

to d enote the lh (j

=

1, . .. , ni) response of the ith individual at time

^t^ij^.

Here

ⁿⁱ

denotes t he total number of responses for the ith individual collected over

ni

t ime p oints . Further , for equi-spaced time data, t he time points would satisfy the r elationship

tij - ti,j-1

= ^ti,J+ ^l ^-

^ti,j,

^for ^example.

(40)

Suppose that Yi = (yil( t il) , . .. , Yii ( tij), . .. , Yin; ( tinJ )' denotes t he ni x 1 vector of r epeated responses for the

ith

(i = 1, . .. , K) individual. Also suppose th at these repeated responses ar e influenced by a smooth non- par am etr ic fun ction 1 (tij ), and a fixed and known p x ni covariate m atrix x; = ( Xil (til ), .. . , Xij ( tij), . . . , X in; (tin; .) ), Xij( tij ) b eing the p - dimensional covariate vector at tim e point tij · This type of re- peated continuous da ta m easured at time point tij is usually modelled as

or equivalently

x :j(tij)f3 + 1(tij ) + Eij( tij )

J Lij ( tij ) + Eij ( tij ), (2.1 )

(2.2)

where l (ti ) =(! (til),·· · ,/(t inJ)' and Ei = (Eil (t il) , · · · ,Eij(tij), ·· · ,Ein;(tinJ)' . We assume, E( c.i ) = 0 and var( c.i ) = var(Y. ) = L:i .

Note that in (2.2) , 1(t i) is not a subj ect specific non-parametric fun ction as its con-

struction requires only knowing 1(t ) at any t im et [Zeger and Diggle (1994); Sneddon and Sutradha r (2004)] . To be specific, 1(ti) is used here to represent ni component s , each with the sam e non- pa ram etric function but evaluated at n i diffe rent ti me points for t he

ith

individu al.

To develop a n efficient estimation procedure it is important t o consider the corr ela-

tion structure of the r ep eated responses. Let

Pit;i-tikl

denot e the pairwise correlations

b etween the t wo responses Yij(tij, tik) for all j of= k ;j,k = 1, . .. ,ni · The n i x ni

correlation matrix for Y i = (Yil (t il) , . . . , Yij ( tij), . . . , Yin; (tin;) )' is denoted by

(41)

For the purpose of constructing a suitable estimating equ ation for /3, it is necessary to

• • l. • l.

obtain an estimate Ci(P) to compute l:i(P ) = AlCi(p) Al . However , in an experiment where an individual can r eport a response at any time that is, when

tij

-::1

thj,

i -::1 h , i , h

=

1, . . . , K , it is possible that in some situations the Ci(P) matrices may h ave unbala nced dimensions. In other situations , it may happen that any two matr ix Ci(P) and Ch(P ) with ni = n h may not be th e same. In such cases, it is impossible to estimate Ci(P) for ith individual borrowing information from other (remaining) individuals . For this reason, ma ny authors have writt en the estima ting equations for f3 and 1{) for general case, that is, for unequi-spaced and unequal time for ind ividuals, but the estim ation for the correlation matrices was given for (1) ni = n for i =

1, ... , K , and (2) under the assumption that Ci(P) = C(p), a constant and common matrix. For exa mple, we refer to Lin and Carroll (2001 , p. 1048) where Ci(P) was est ima ted by

(2.3)

Note th at there ar e few difficulties wit h t his correlat ion matrix (2.3) constructi on .

This is because: (1) as t he unbalanced ni x ni matrices (r i'<) cannot be added from

a ll individuals, C(p) computation is meaningful only when ni = n, say. However ,

it is not understood how one may compute Ci(P ) needed for the con struction of ti,

when dimensions are not same (2) when a situa tion is considered wher e ti/s may be

unequi-spaced , there is no reason to justify the use of ni = n for all i.

(42)

In the thesis, we concentrat e on equi-spaced d at a and study th e inferences for the regression effects in the semi- parametric setup by properly accommodating the longitudinal correlations for both continuous a nd discrete data. This type of d ata were used in Sutradhar (2010), but the author dealt only with a fixed (specified ) regression function as opposed to a semi- par a metric rergession function. As far as the correla tion structure is concerned , following Sutradhar (2011 ), we assume that the repeated data follow a class of a uto-cor relation structures th at accommodates Gaussian ty pe all possible auto-regressive moving average of order r, s (ARMA(r, s)) correla tion models with AR(1) , MA (1), AR (2), MA(2), EQC (equi-correlations) , as some special cases. Note th at the AR( 1), MA(1) , and EQC structures for repeated data were a lso discussed in Liang a nd Zeger (1986), and subsequently these structures were used by Severini and Staniswallis (1994) in the semi-parametric longitudinal setup. Further note tha t in t his a pproach it is not necessary t h at

ni

=

n

(balanced da ta) for a ll i

=

1, ... K.

Specifically, we consider the correlation matrix C(p) for the error vector

^Ei

in (2.2) as

1

P1

P 2

Pni-1

P1

1

P1 Pni-2

Ci(P) for all i

=

1, 2, . .. , K ;

Pn;-l Pni-2

1

1 1

L:i (p) var(Yi)

=

A{Ci(p)AJ, (2.4)

where for R = 1, ... ,

ni -

1, P c denotes the lag R correlation between

^{Eij (}tij)

and

Ei,j+C(ti,j+C)-

We assume, however , that the variances are stationary and hen ce write

(43)

Ai =

^CJ²

In; where

^CJ²

is an unknown scalar constant , and In; is the ni x ni id entity matrix . The fo llowing examples demonst rate the correlation models that p roduce Ci(P) in (2.4) in the linear model setup :

(i) AR(l) mode l:

(2.5 )

(ii) MA(l ) m o de l:

(2.6) aiJ(tiJ) i!::f N(O , CJ~) ^{\I i} = 1, 2, ... , K ; j = ^{1, .} ^. ^{. ,} ⁿ ^i,

and

(iii) EQC mod el :

(2.7)

The lag f correlations (pe) b etween Eij(tij ) and E i,j+ e(ti,j+ e) for (2.4) , (2.5 ) and (2.6) a re

P e = ql, f = 1, . .. , ni - 1;

{ e

I+e²^'

P e

=

O ,

for f = 1

and for f

=

2, 3, . .. , ni - 1,

Pe

= ( = _a

-/' +

²_aa2 ,

f

=

1, ... , n i - 1 respectively, and th ey s atisfy the auto-correlation structure ci (p) in (2.4).

Note th at even though the Ci(P) matrix in (2 .4) is writt en corresponding t o n i

time points of the

^'ith

with Time

Generalized Linear Longitudinal Semi-parametric Models with Time Dependent Covariates

St. J ohn's

by

© V ineetha W arriyar K . V

A th esis submitted to th e School of Graduate Studies

in par tial fulfi lment of the r equi rements for th e degree of

Doctor of P hilosophy

Department of Mathematics and Statistics Memorial University of Newfoundland

November 2012

Newfoun dland

Abstract

Longitudinal data analys is is challenging because of th e difficulties in modelling the

Under the ARMA type correlation structure, we provide a semi-parametric gener-

alized quasi-likelihood (SGQL) approach for the estimation of the main regression

parameters. The proposed inference approach is compared with some existing gener-

alized estimating equation (GEE) approaches mainly through simulation studies. The

linear longitudinal semi-parametric mod el, for its foundational n ature, is discussed in

detail. Theoretical details on semi-parametric est imation for longitudina l count and

binary data are also provided.

Acknowledgements

I would like to express sincere thanks to my supervisor, Dr. Brajendra Sut rad- har. His support, vast knowledge and logical way of thinking have been invaluable t hroughout my work.

Thanks to all my friends , colleagues a nd house mates for putting up with me. I owe a great deal for their encouragement and moral support.

A special t hanks to the Department of Math ematics a nd Statistics at Memorial University for providing financial and academic assistance for my research , esp ecially all administrative staff members for the help they offered me during my study.

I would like to thank all the examiners of my thesis Dr. Ma ry Thompson, Dr.

Gary Sneddon and Dr. Alwell Oyet for their invaluable comments and s uggestions . I am so indebted to my parents, my sister, my in-la ws, a nd my uncle Dr . Asokan.

- Vineet ha Warriyar K. V

I lovingly dedicate this thesis

to my husband Ranjith, who supported me every step of the way.

Contents

Abstract

Acknowledgements

List of Tables

List of Figures

1 Background of the Problem

1.1 Generalized linear models ( G LMs) . 1.1.1 Quasi-likelihood es timat ion for (3 1.2 Semi- parametric GLMs .

1.2.1 Linear model . .

1.2.1.1 Estimation of non-parametric fun ction 1 (z

1.2.1.2 Estimation of r egr ession effects (3 1. 2.2 Count d ata model . . . . . . . . . . . . . .

1.2.2.1 Estimation of non- parametric function 1(z

1.2.2.2 Estimation of r egression effects (3 1. 2.3 Binary data model . . . . . . . . . . . . .

1

2

3

4

6

7

8

9

10

11

12

1.2.3.1 Estimation of non-para metric function 'Y(z

1.2.3.2 Estimation of regression effects /3

1.3 Generalized linear longitudinal models (GLLMs) 1. 4 Semi-parametric GLLMs

1.5 Objective of the thesis .

2 Semi-parametric Linear Longitudinal Models

2.1 Existing semi-parametric estimation methods 2.1.1 PSSGEE approach .. . . . . . . .. .

13

13 14 17 18

21

28 28 2.1.1.1 Estimation of non-parametric function 29 2.1.1.2 Estimation of regression effects . . . . 31 2.1.1.3 Estimation of the 'working' correlation pa rameter a . 32 2.1.2 Pa rtially standardized emi-pa ra metric heteroscedastic GEE (PSSHGEE)

approach. 33

2.2 Proposed FSSGQL approach . 35

2.2.1 Estimation of non-parametric function 35

2.2.2 Estimation of (3 36

2.2.2.1 Basic properties of

38

2.2.3 Estimation of

and a

43

2.3 A Simulation study 45

2.3.1 Simulation design 45

2.3 .2 Da ta generation and simulation results 47

3 Semi-parametric Longitudinal Models for Discrete Data with Non-

stationary Correlation Structures 71

3.1 Semi-pa rametric longitudinal models for count data with non-stationary correla tion structures . . . . . . . . . . . . . . . . . . 72 3.1.1 Stationa ry correla tion models for count data in emi-par ametric

3.2.1 E timation of non-pa rametric function 1(.) 78 3.2.2 Estimation of f3 . . . . . . . . . . . . . . . 79

3.2.2.1 Naive GQL estima tion a pproach 79

3.2.2.2 PSSGQL estimation under non-stationa ry (ns) corre- la tion struct ure . . . . . . . . . . . . . . . .

⁺ ^(t- n;

^K=100; ^{n=4; and} ¹⁰⁰⁰

⁺ ^(t- n ;

^K=100; n=4; and 1000 57

= 0.5, under MA(1) correlation model for selected values of the model parameters e ^and