• Aucun résultat trouvé

Measures of model adequacy and model selection in mixed-effects models

N/A
N/A
Protected

Academic year: 2022

Partager "Measures of model adequacy and model selection in mixed-effects models"

Copied!
165
0
0

Texte intégral

(1)

Thesis

Reference

Measures of model adequacy and model selection in mixed-effects models

JACOT, Nadège

Abstract

This thesis contributes to the development of measures of model selection and model adequacy for mixed-effects models. In the context of linear mixed-effects models, we review and compare in a simulation study a large set of measures proposed to evaluate model adequacy or/and to perform model selection. In the more general context of generalized linear mixed-effects models, we develop a measure of both model adequacy and model selection, that we name PRDpen. As a measure of model adequacy, our proposition gives information about the model at hand, as it measures the proportional reduction in deviance due to the model of interest in comparison with a prespecified null model. Furthermore, as a measure of model selection, PRDpen is able to choose the model that best fits the data among a set of alternatives, similarly to the information criteria.

JACOT, Nadège. Measures of model adequacy and model selection in mixed-effects models. Thèse de doctorat : Univ. Genève, 2016, no. GSEM 33

URN : urn:nbn:ch:unige-905276

DOI : 10.13097/archive-ouverte/unige:90527

Available at:

http://archive-ouverte.unige.ch/unige:90527

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

model selection in mixed-effects models

by

Nad` ege Jacot

A thesis submitted to the

Geneva School of Economics and Management, University of Geneva, Switzerland,

in fulfillment of the requirements for the degree of PhD in Statistics

Members of the thesis committee:

Prof. Eva Cantoni, Co-Adviser, University of Geneva Prof. Paolo Ghisletta, Co-Adviser, University of Geneva Prof. Benjamin Scheibehenne, Chair, University of Geneva

Prof. Thomas Kneib, Georg August-Universit¨at G¨ottingen

Thesis No. 33 September 2016

(3)
(4)

First of all, I would like to thank my two co-advisers Prof. Eva Cantoni and Prof. Paolo Ghisletta. Their support, availability and enthusiasm allowed me for accomplishing this present work. Their encouragements and precious advice were invaluable help.

I would also like to thank the other members of my committee, Prof. Benjamin Scheibehenne and Prof. Thomas Kneib, for their careful reading and for their relevant questions and comments.

I would then like to thank all my colleagues at the group of methodology and data analysis (MAD group) at the Faculty of Psychology and Educational sciences of the University of Geneva. In particular, I would like to thank Dr. Catherine Audrin for her good mood and support in all circumstances, and also for all the good moments enjoyed together; Sandrine Amstutz for being an inexhaustible source of information, but especially for her caring and generosity; Emmanuelle Grob for her positive energy, that was spread in our office; Victorin Luisier for his help when I arrived at the MAD group and for sharing many train journeys; and Dr. Guillaume Fuerst for his incredible potential to make me laugh and for our discussions about photography. I am also grateful to my colleagues at the Geneva School of Economics and Management (GSEM) of the University of Geneva.

I would also like to thank my colleagues at the team of the structural business statistics at the Federal Statistical Office. I would especially like to thank Pierre Maftei for his moral support, and Mathieu Gunzinger for his encouragements.

My parents and my sister gave me an unconditional support throughout my studies and always encouraged me to give the best of myself, and I really would like to thank them for this. I am also grateful to my parents-in-law, my sister-in-law, my brother-in-law, as well as my close friends, that never really understood the work I did, but that always stood by my side. Finally, I would like to sincerely thank my husband for his patience and for giving me comfort and strength throughout these five years.

(5)
(6)

This thesis contributes to the development of measures of model selection and model adequacy for mixed-effects models. First, we select and briefly describe, in the context of linear mixed-effects models, a large set of measures proposed to evaluate model adequacy or/and to perform model selection. We evaluate their sensitivity in a simulation study aimed at selecting the correct model from a series of nested alternative models, and we illustrate their use on the home radon levels data of Gelman and Pardoe (2006). Then, we give recommendations on the use of these different indices.

Second, we develop, in the more general context of generalized linear mixed-effects models, a measure of both model adequacy and model selection, that we name PRDpen. As a measure of model adequacy, our proposition gives information about the model at hand, as it measures the proportional reduction in deviance due to the model of inter- est in comparison with a prespecified null model. Furthermore, as a measure of model selection, PRDpen is able to choose the model that best fits the data among a set of al- ternatives, similarly to the information criteria. Indeed, our proposition is closely related to the information criteria, as it is composed of a similar penalty function. However, in comparison with the existing measures of model adequacy that are not able to perform model selection, and with the information criteria that give no indication about the model adequacy, PRDpen is innovative, as it combines the advantages of both the measures of model adequacy and the information criteria.

Third, we demonstrate that PRDpen is able to perform generalized linear mixed-effects model selection. Indeed, we prove the asymptotic validity of PRDpen when selecting fixed effects of a linear mixed-effects model. Then, we conducted a simulation study to show that our proposition is competitive with the existing information criteria in terms of model selection, and that the existing measures of model adequacy perform very poorly for comparing models. To illustrate that PRDpen can be used similarly to the information criteria for model selection, and similarly to the existing measures of model adequacy for evaluating the fit of the selected model, we analyze a sub-sample of the data from the 1988-89 Bangladesh fertility survey (Huq and Cleland, 1990).

We conclude that this thesis sheds light on the diversity of measures proposed to assess model adequacy or/and to select the most adequate model in the context of mixed- effects models. We further identify the most promising propositions, among the considered measures. In comparison with the existing measures of model adequacy, which can only be used for evaluating model adequacy, and with the information criteria, which can only be used for model selection, we recommend the use of the measure we propose, PRDpen. Indeed, the dual use of PRDpen (model adequacy and model selection) makes it very appealing.

(7)
(8)

Cette th`ese contribue au d´eveloppement de mesures de s´election de mod`ele et d’ad´equation du mod`ele pour les mod`eles `a effets mixtes. Premi`erement, nous s´electionnons et d´ecrivons bri`evement, dans le contexte des mod`eles lin´eaires `a effets mixtes, un grand nombre de mesures propos´ees pour ´evaluer l’ad´equation du mod`ele ou/et pour effectuer de la s´election de mod`ele. Nous ´evaluons leur sensibilit´e dans une ´etude de simulation dont le but est de s´electionner le mod`ele correct parmi une s´erie de mod`eles alternatifs emboˆıt´es, et nous illustrons leur utilisation sur les donn´ees de Gelman and Pardoe (2006) qui s’int´eressent aux niveaux de radon dans des maisons. Ensuite, nous donnons des recommendations `a propos de l’utilisation de ces diff´erents indices.

Deuxi`emement, nous d´eveloppons, dans le contexte plus g´en´eral des mod`eles lin´eaires g´en´eralis´es `a effets mixtes, une mesure d’ad´equation du mod`ele ainsi que de s´election de mod`ele, que nous nommons PRDpen. En tant que mesure d’ad´equation du mod`ele, notre proposition donne de l’information `a propos du mod`ele `a disposition, ´etant donn´e qu’elle mesure la r´eduction proportionnelle dans la d´eviance due au mod`ele d’int´erˆet en comparaison d’un mod`ele nul pr´esp´ecifi´e. De plus, en tant que mesure de s´election de mod`ele, PRDpen est capable de choisir le mod`ele qui s’ajuste le mieux aux donn´ees parmi un ensemble de mod`eles alternatifs, de mani`ere similaire aux crit`eres d’information. En effet, notre proposition est ´etroitement li´ee aux crit`eres d’information, ´etant donn´e qu’elle est compos´ee d’une fonction de p´enalit´e similaire. Cependant, en comparaison des mesures d’ad´equation du mod`ele existantes qui ne sont pas capables d’effectuer de la s´election de mod`ele, et des crit`eres d’information qui ne donnent aucune indication sur l’ad´equation du mod`ele, PRDpen est novateur, ´etant donn´e qu’il combine les avantages des mesures d’ad´equation du mod`ele ainsi que ceux des crit`eres d’information.

Troisi`emement, nous d´emontrons que PRDpen est capable d’effectuer de la s´election de mod`ele lin´eaire g´en´eralis´e `a effets mixtes. En effet, nous prouvons la validit´e asympto- tique de PRDpen pour la s´election des effets fixes d’un mod`ele lin´eaire `a effets mixtes. En- suite, nous avons conduit une ´etude de simulation pour montrer que notre proposition est concurrentielle, en terme de s´election de mod`ele, par rapport aux crit`eres d’information existants, et que les mesures existantes d’ad´equation du mod`ele ont des performances m´ediocres pour comparer des mod`eles. Pour illustrer que PRDpen peut ˆetre utilis´e de mani`ere similaire aux crit`eres d’information pour la s´election de mod`ele, et de mani`ere similaire aux mesures existantes d’ad´equation du mod`ele pour ´evaluer l’ajustement du mod`ele s´electionn´e, nous analysons un sous-´echantillon des donn´ees de l’enquˆete sur la fertilit´e au Bangladesh men´ee en 1988-89 (Huq and Cleland, 1990).

(9)

´

evaluer l’ad´equation du mod`ele ou/et pour s´electionner le mod`ele le plus ad´equat dans le contexte des mod`eles `a effets mixtes. Nous identifions ´egalement les propositions les plus promettantes, parmi les mesures consid´er´ees. En comparaison des mesures existantes d’ad´equation du mod`ele, qui ne peuvent ˆetre utilis´ees que pour ´evaluer l’ad´equation du mod`ele, et des crit`eres d’information, qui ne peuvent ˆetre utilis´es que pour la s´election de mod`ele, nous recommandons l’utilisation de la mesure que nous proposons, PRDpen. En effet, son utilisation double (ad´equation du mod`ele et s´election de mod`ele) la rend tr`es attrayante.

(10)

Acknowledgements i

Abstract iii

R´esum´e v

Introduction 1

1 Measures of explained variation and model selection in linear mixed- effects models: A review and a comparison based on a simulation study 7

1.1 Introduction . . . 7

1.2 Linear mixed-effects model (LMM) . . . 10

1.3 Measures . . . 11

1.3.1 Measures of model adequacy only and of model adequacy and model selection . . . 11

1.3.1.1 Gelman and Pardoe (2006) . . . 11

1.3.1.2 Snijders and Bosker (1994) . . . 12

1.3.1.3 Vonesh et al. (1996) . . . 13

1.3.1.4 Vonesh and Chinchilli (1996) . . . 13

1.3.1.5 Zheng (2000) . . . 14

1.3.1.6 Xu (2003) . . . 15

1.3.1.7 H. Liu et al. (2008) . . . 15

1.3.1.8 Measures comparison . . . 16

1.3.2 Measures of model selection . . . 16

1.3.2.1 Wilks (1938) . . . 16

1.3.2.2 Akaike (1974) and Schwarz (1978) information criteria . . 17

1.3.2.3 Spiegelhalter et al. (2002) . . . 18

1.4 Home radon levels . . . 18

1.4.1 Description of data and fitted models . . . 18

1.4.2 Results . . . 20

1.5 Simulation study . . . 22

1.5.1 Design . . . 23

1.5.2 Results . . . 23

1.5.2.1 Correlations . . . 25

1.5.2.2 Comparison of measures of overall model adequacy . . . . 25

1.5.2.3 Comparison of measures of model adequacy due to fixed effects . . . 31

1.5.2.4 Comparison of measures of model selection . . . 31

1.6 Discussion . . . 32

(11)

mixed-effects models 35

2.1 Introduction . . . 35

2.2 Generalized linear mixed-effects model (GLMM) . . . 37

2.3 Proportional reduction in deviance measure . . . 38

2.3.1 Definition of PRDpen . . . 38

2.3.1.1 Deviance and saturated mixed-effects model . . . 39

2.3.1.2 PRD and null model . . . 41

2.3.1.3 PRDpen . . . 42

2.3.2 Computations . . . 45

2.3.3 Asymptotic properties of PRDpen . . . 46

2.3.3.1 Known variance components . . . 48

2.3.3.2 Unknown variance components . . . 49

2.4 Simulation study . . . 51

2.4.1 Design . . . 52

2.4.2 Alternative measures . . . 56

2.4.2.1 Marginal and conditional IC . . . 56

2.4.2.2 Drand,Prand and marginal Drand . . . 57

2.4.2.3 R2m and R2c . . . 58

2.4.2.4 Marginal and conditional concordance correlation coeffi- cients . . . 58

2.4.3 Results . . . 59

2.4.3.1 Frequencies, sensitivities and specificities . . . 59

2.4.3.2 Presented measures . . . 61

2.4.3.3 Analysis of the results . . . 62

2.4.3.4 Summary and conclusion . . . 73

2.5 Illustration . . . 73

2.6 Discussion . . . 80

Conclusion 85 A Supplementary material for Chapter 1 91 A.1 Illustrations . . . 91

A.1.1 Orthodontic data analysis . . . 91

A.1.2 Students’ performance on a math test . . . 95

A.2 ANOVAs and measures of Gelman and Pardoe (2006) . . . 100

A.3 Performance in model selection . . . 111

B Supplementary material for Chapter 2 113 B.1 Marginal log-likelihood functions . . . 113

B.1.1 Normal distribution . . . 113

B.1.2 Poisson distribution . . . 113

B.1.3 Bernoulli distribution . . . 114

B.1.4 Binomial distribution . . . 114

B.2 Explicit formula for PRDN,pen(α) . . . 115

B.3 Simulation cases . . . 115

B.4 Drand formulas . . . 120

B.4.1 Normal distribution . . . 120

B.4.2 Poisson distribution . . . 120

(12)

B.4.3 Bernoulli distribution . . . 121 B.4.4 Binomial distribution . . . 122 B.5 Simulation study results . . . 122

C List of acronyms 135

(13)
(14)
(15)
(16)

This thesis is composed of two chapters along with this general introduction and an overall conclusion. Each chapter is written in the form of an article with its own introduction and discussion. Supplementary material of these chapters is placed after the conclusion in the Appendices, except the codes used for some analyses and the codes that allow for computing the measure we propose in Chapter 2, that are provided online. Furthermore, the list of acronyms is provided in Appendix C. Both Chapters have the common aim of studying the measures of model adequacy and model selection in mixed-effects models and this thesis bring them together, even if they can be read independently. In this Introduction, we expose what motivates our work and we describe the content of both Chapters.

In many disciplines such as psychology, forestry, pharmacokinetics or medicine, the use of mixed-effects models (e.g., C. E. McCulloch et al., 2008; Vonesh and Chinchilli, 1996) is very popular. These models are indeed very attractive because they contain fixed and random effects, and allow thus for modeling the mean structure as well as the covariance structure. An example of their usefulness is in longitudinal study where some subjects are repeatedly measured over a period of time. In that context, mixed-effects models allow for taking into account both the within- and the between-subjects variability.

Furthermore, they have the advantage, with respect to the repeated measures analysis of variance (ANOVA), of being able to deal with unbalanced data, in which the number of times each subject is measured is different.

As especially discussed in Gurka and Edwards (2008), the mixed-effects model is also called by some authors subject-specific model to differentiate it from the population- averaged model. A population-averaged model is considered when the interest is in esti- mation and inference about the fixed effects parameters. This model could also contain random effects, but they are considered as nuisance and allow for modeling variation of data. The distributional assumptions are thus only made on the responses. A subject- specific model rather is considered when the interest is also on the random effects them- selves. The distributional assumptions of such a model are made on the random effects as well as on the responses conditionally on the random effects.

Many software such as R (R Core Team, 2013) or SAS (SAS Institute Inc., 2011) enable the estimation of mixed-effects models. However, two questions arise for an applied researcher: how to evaluate the adequacy of the model at hand and how to identify the most relevant model among a set of alternatives?

To evaluate the adequacy of the model at hand, many researchers proposed to extend the coefficient of determination, namely R2, from regression models (Draper and Smith, 1998). For linear mixed-effects models (LMM), such extensions were proposed by Ed- wards et al. (2008), Gelman and Pardoe (2006), H. Liu et al. (2008), Snijders and Bosker (1994), and Xu (2003). Johnson (2014), Nakagawa and Schielzeth (2013), and Zheng

(17)

And for GLMM as well as nonlinear mixed-effects models (NLMM), Vonesh and Chin- chilli (1996) also proposed two versions of R2 to evaluate the proportional decrease in residual variability due to the fixed effects and that due to both fixed and random effects, respectively.

R2 is attractive, as it has for instance the advantage of being intuitively interpreted as a measure of explained variation, of being unit free and of ranging between 0 and 1, where 1 represents perfect fit. However for mixed-effects models, the definition of a R2 is difficult, as these models contain more than one source of variation due to the presence of random effects. This explains why there exists many different definitions of a measure of explained variation. Given the existence of these various extensions of R2, there is no consensus in the statistical literature about a preferred proposition.

Vonesh et al. (1996) proposed an alternative strategy to assess model adequacy. They defined two indices that are interpreted as concordance correlation coefficients between observed values and values predicted by fixed effects only, and by both fixed and random effects, respectively.

To identify the most relevant model among a set of alternatives, many methods exist such as the likelihood ratio test (LRT, Greven et al., 2008; Stram and Lee, 1994; Wilks, 1938), that allows for comparing two nested models (i.e., the parameters of the more parsimonious model constitute a subset of the parameters of the larger model), informa- tion criteria (IC), shrinkage based on penalized loss functions, fence methods, initially proposed by Jiang et al. (2008) for GLMM, or Bayesian techniques. However, rarely are these implemented in statistical software.

Among the IC, the most commonly used are the Akaike IC (AIC, Akaike,1974) and the Bayesian IC (BIC, Schwarz, 1978). Variations of the AIC include those of Hurvich and Tsai (1989), that proposed a correction for small samples or highly parameterized models, and of Bozdogan (1987), that defined a consistent AIC. For LMM, Vaida and Blanchard (2005) defined a conditional AIC (cAIC) by focusing on clusters, which implies that they focused on random effects selection. The cAIC aroused the interest of many authors. Indeed, it was generalized to unknown variance-covariance matrix of random effects by H. Liang et al. (2008) and its analytic representation was proposed by Greven and Kneib (2010). The cAIC was extended to generalized linear and proportional hazards mixed models by Donohue et al. (2011), to Poisson regression with random effects by Lian (2012), to GLMM by Yu and Yau (2012), and a unified framework was finally given by Saefken et al. (2014a) to estimate the cAIC in GLMM. Two modifications of the BIC were proposed by Pauler (1998) in order to identify the relevant fixed effects of a LMM and more recently, Delattre et al. (2014) defined another modified BIC for selecting 2- level fixed effects in GLMM and NLMM. In the framework of LMM, Pu and Niu (2006) extended the generalized IC (GIC) of R. Rao and Wu (1989) and a similar criterion was further given by Jiang and J. S. Rao (2003).

Existing shrinkage methods are the least absolute shrinkage and selection operator (LASSO, Tibshirani,1996), the adaptive LASSO (ALASSO, Zou,2006) and the smoothly clipped absolute deviation (SCAD, J. Fan and Li,2001). The adaptation of such methods for joint selection of both fixed and random effects in LMM was done by Bondell et al.

(2010), Ibrahim et al. (2011), and Peng and Lu (2012), and by Schelldorfer et al. (2014) in GLMM.

In the Bayesian framework, Spiegelhalter et al. (2002) proposed the deviance IC (DIC) and a partitioned version was proposed by Wheeler et al. (2010) to assess local model

(18)

fits. The stochastic search variable selection (SSVS, George and R. E. McCulloch, 1993, 1997; Geweke, 1996) approach is used to select the random effects of a reparameterized LMM by Chen and Dunson (2003). This method was extended to simultaneously select fixed and random effects by Cai and Dunson (2006) for GLMM by approximating in- tractable integrals, and exact alternatives were proposed by Kinney and Dunson (2007) for logistic mixed-effects models. To the same end of jointly selecting fixed and random effects, similar methods were proposed by Fr¨uhwirth-Schnatter and T¨uchler (2008) and Fr¨uhwirth-Schnatter and Wagner (2010) for LMM and by T¨uchler (2008) and Wagner and Duller (2012) for logistic mixed-effects models. Furthermore, T.-H. Fan et al. (2014) introduced another SSVS approach for proportional hazard mixed-effects models. Finally, a Bayesian nonparametric centered random effects model was proposed by Yang (2013) in the context of LMM, which was then extended by Yang (2012) in the context of GLMM, again with the aim of identifying the correct set of fixed as well as random effects.

As for evaluating model adequacy, there is no consensus in the statistical literature on how to perform model selection for mixed-effects models. In the context of LMM, a review and a description of model selection methods, focusing on frequentist methods, is provided by M¨uller et al. (2013). However, it does not exist to our knowledge a complete review of model selection procedures in the more general framework of GLMM or NLMM.

This thesis aims at reviewing and comparing the existing measures of model adequacy and model selection. Furthermore, this thesis aims at proposing a single measure that is able to answer both the questions of how to evaluate the adequacy of the model at hand and how to identify the most relevant model among a set of alternatives. In order to achieve this later aim, we make the choice of developing a measure that evaluates the model adequacy and that subsequently is adapted for model selection by incorporating a penalty function similar to that of the IC. For this reason, we focus in this thesis on the measures of model adequacy and on the IC. The other methods of model selection, such as fence methods or Bayesian techniques, are thus not studied here.

In Chapter 1, we focus on LMM. We review and compare the measures of model adequacy and the methods of model selection, with a focus on the existing measures of model adequacy, most of them being extensions of R2, and on the LRT, the AIC, the cAIC, the BIC and the DIC, due to their popularity. We thus intend to identify the most promising propositions that exist in the literature. To do so, we used a comparable notation across all propositions.

In this Chapter, we classify the measures of model adequacy, the LRT and the IC in categories determined according to their use. Among the measures of model adequacy, some are measures of the adequacy due to the fixed effects, referred to as marginal, and some are measures of the adequacy due to the fixed and random effects, referred to as conditional. Thereby, the first category includes the conditional measures of model adequacy and is denoted category A (A for adequacy). The second one highlights that some of the conditional indices of model adequacy can further be used for model selection, similarly to the adjusted R2 for linear regression models, due to a penalty function for large models. This second category is denoted category A&S (S for selection). The third category, denoted category F&S (F for adequacy due to fixed effects), contains the marginal measures of model adequacy. All of them can further be used for fixed effects selection, as advised by Orelien and Edwards (2008). Finally, the fourth category includes the measures that only allow for model selection and is denoted category S.

To illustrate the use of the indices of interest, we analyze the home radon levels data

(19)

houses clustered within counties in Minnesota, USA. We aim in particular at highlighting the differences in terms of interpretation of the measures in categories A, A&S and F&S.

And we also aim at comparing the results obtained with indices within a same category (A, A&S, F&S or S), which we expected to be similar. Among the alternative models that we consider, the same model is indicated to be the most appropriate by all the measures allowing for model selection (categories A&S, F&S and S). This selected model indicates that the levels of radon gas are larger for houses with basement and for counties with higher soil uranium content. Furthermore, the random part of the model indicates that the levels of radon gas are different among counties and that the effect of having a house with a basement on the level of radon gas varies among counties.

Two additional illustrations are given in Appendix A.1 to further highlight that the choice of the most appropriate model can be different according to the considered measure of model selection (categories A&S, F&S, S). In particular, we analyze the orthodontic data of Potthoff and Roy (1964), in which the aim is in finding what influences a dental measurement in children between the ages of 8 and 14 years. The other data we consider were previously analyzed by Kreft and Leeuw (1998) and the interest is in studying what are the reasons behind the student’s performance on a math test.

The comparison of the measures of interest is performed in a simulation study. We first evaluate the sensitivity of the considered indices to modifications of five parameters of a two-level LMM that contains a random intercept and a random slope, which are correlated. Second, we evaluate the ability of the measures that can be used for model selection (categories A&S, F&S and S) to identify the correct model among a set of alter- natives constituted of the population model (model from which the data are generated) and 5 simpler models (the ability to discriminate larger models is not evaluated). The comparison of the measures in categories A and A&S shows that their sensitivity to mod- ifications of some model parameters is similar and within the category A&S, few of the indices are competitive for model selection. When comparing the measures in category F&S, we observe that they have similar sensitivity to modifications of some model param- eters and that only few of them are not satisfying for fixed effects selection. And finally, the measures in the fourth category (S) are mainly IC and we observe that the cAIC is the most appropriate for model selection in our setup.

We finally discuss the results obtained in our simulation study by contrasting them with the advantages and the disadvantages of the different considered indices. For in- stance, some of the measures of interest have an intuitive interpretation, but are defined only for a LMM assuming normal errors and random effects. We conclude Chapter 1 by giving recommendations on the use of these different indices.

In Chapter 2, we consider the more general framework of GLMM that allows, for instance, for modeling non continuous responses, such as binary or count data. We thus introduce the GLMM and in that context, based on the results of our extensive comparison conducted in Chapter 1, we develop the penalized proportional reduction in deviance measure, named PRDpen. To this end, the deviance is defined using the marginal log- likelihood, as in the context of generalized linear models (GLM, McCullagh and Nelder, 1989). PRDpen is both a measure of model adequacy and model selection, thanks to a penalty function. We thus aim at defining an index which is able to perform model selection, which has an intuitive interpretation, and which is further competitive compared to the existing measures of model adequacy.

(20)

The penalty function of PRDpen can take different forms and is similar to that of most of the IC. For instance, the penalty function can be equal to the logarithm of the total number of observations multiplied by the number of parameters in the model, which is the same penalty term as that of the BIC. In this way, PRDpen inherits the ability of the IC to choose the most promising model among a set of alternatives and thus, the procedure for identifying the best model using our index is the following. The models having larger values are considered as being the best compromises between fit and complexity. Remark that PRDpen generally ranges between 0 and 1 but, as for the adjusted R2, PRDpen can be negative, which indicates that the model of interest provides no improvement over the considered null model, when taking into account the complexity of the model.

Despite the similarity with the IC that can only be used for comparing models, PRDpen further gives information about the model at hand. Indeed, it is intuitively interpreted as the proportional reduction in deviance due to the model of interest compared with a prespecified null model. Different null models can be considered but, in the literature, the most popular ones are those that contain only a fixed intercept, or a fixed and a random intercept.

The ability of PRDpen to perform model selection is evaluated in two ways. First, we study the asymptotic behavior of PRDpen for fixed effects selection in the particular context of the LMM with normal random effects and errors. The asymptotic properties that we prove mean that under some conditions, PRDpen asymptotically selects the best set of fixed effects. Second, in the context of GLMM, we conduct a simulation study to compare in finite sample size the performance in terms of model selection of several versions of PRDpen characterized by their penalty functions, several marginal and condi- tional measures of model adequacy and the most commonly used IC. We consider in this simulation study the four more common distributions from the exponential family for the responses in GLMM: Normal, Poisson, Bernoulli and Binomial, and three different cases in terms of balance of importance between the fixed and random parts of the model ob- tained by modifying four parameters of a two-level GLMM with a random intercept and a random slope that are correlated. We thus have 4×3 different setups under which we evaluate the ability of the considered indices to identify the correct model among a set of alternatives constituted of the population model and 15 simpler or more complex models.

The results of this simulation study indicate that, in terms of model selection, PRDpen performs similarly to the considered IC and outperforms the marginal and conditional measures of model adequacy. Thereby, the advantage of PRDpen over the IC is its ability of measuring the adequacy of the model at hand (cf. next paragraph). And the advantage of PRDpen over the existing measures of model adequacy is its ability to perform model selection.

To further illustrate that PRDpen can be used similarly to the IC for model selection and similarly to the existing measures of model adequacy, we analyze a sub-sample of the data from the 1988-89 Bangladesh fertility survey (Huq and Cleland, 1990). We identify the best model among a set of 16 alternatives and we aim at highlighting that the selected model is different according to the considered index. Using the versions of PRDpen providing the best results in the simulation study, the selected model indicates that the probability of using a contraceptive is higher for women living in an urban area (in comparison with women living in a rural area) and for women having children (in comparison with women having no children). This probability seems to decline with age for women having no children, but is rather close to be constant for those having children.

The random part of the model further highlights that there exists a high variability

(21)

the use of contraceptive varies greatly by district. The adequacy of this model is then evaluated by considering the measures of model adequacy (including PRDpen). While all agree that the selected model poorly fits the data, we aim at highlighting that their interpretation is different.

Finally, we especially discuss the similarities and differences of PRDpen to existing measures, as well as the choice of the penalty function of PRDpen. Furthermore, we expose the limitations of our simulation study and we give some lines of future research.

We finally conclude this thesis with an overall picture of our work. We further compare the results obtained in Chapter 1 with those obtained in Chapter 2 and we also give some directions for future research.

(22)

Measures of explained variation and model selection in linear

mixed-effects models: A review and a comparison based on a simulation study

1.1 Introduction

The linear mixed-effects model (LMM; a.k.a. linear multilevel model, hierarchical lin- ear model, or random-effects model) is widely used, especially to analyze clustered data.

Whether in educational sciences, psychology, medicine, biology or other domains, re- searchers need tools to compare alternative models and to evaluate their adequacy with respect to the selected data. Several measures, both in the frequentist and in the Bayesian framework, are used for this purpose and have different characteristics. Some of them al- low for performing selection of alternative models that differ in their fixed and/or random effects, some others allow for assessing the adequacy of a given model, and some allow for both selecting among alternatives and evaluating their adequacy.

For model selection, the most frequently used indices are the Akaike Information Criterion (AIC; Akaike, 1974), the Bayesian Information Criterion (BIC; Schwarz, 1978) and the Deviance Information Criterion (DIC; Spiegelhalter et al.,2002). The Likelihood Ratio Test (LRT; Wilks, 1938) can also be used, but only to compare two nested models (i.e., the parameters of the more parsimonious model constitute a subset of the parameters of the larger model). Vaida and Blanchard (2005) proposed a conditional AIC for LMMs, which was then generalized by H. Liang et al. (2008) and Greven and Kneib (2010) (cf.

Section 1.3.2.2). Pu and Niu (2006) extended the Generalized Information Criterion (GIC) from linear regression to LMMs and a similar criterion is further given by Jiang and J. S.

Rao (2003). In the Bayesian framework, Wheeler et al. (2010) proposed a partitioned DIC to assess local model fits instead of a single DIC value for the entire model.

In the same perspective of model selection, Jiang et al. (2008) introduced a class of strategies called fence methods for LMMs and generalized LMMs. Other authors, such as Orelien and Edwards (2008) and Edwards et al. (2008), interested inR2 indices (estimates of variance of the dependent variable explained by the independent variables, Draper and Smith,1998), focused on the selection of the fixed effects. Moreover, Orelien and Edwards

(23)

(called marginal measures) are appropriate for that purpose. Chen and Dunson (2003) rather developed in the Bayesian framework a general stochastic search variable selection (SSVS, George and R. E. McCulloch, 1993, 1997; Geweke, 1996) approach to select only the random effects of a LMM. And methods for simultaneously select fixed and random effects of a LMM were introduced by Bondell et al. (2010), Ibrahim et al. (2011), and Peng and Lu (2012) in the frequentist framework. These three latter propositions are extensions of shrinkage methods based on penalized loss functions, such as the least ab- solute shrinkage and selection operator (LASSO, Tibshirani, 1996), the adaptive LASSO (ALASSO, Zou, 2006) and the smoothly clipped absolute deviation (SCAD, J. Fan and Li, 2001).

For evaluating model adequacy, several papers considered extensions of the classical R2 measure, due to the simplicity of interpretation. Unfortunately, few of the available measures are absolute in the sense that they can be interpreted without referencing to a comparison (often called null) model. At the opposite, a relative measure can only be interpreted in comparison with another model. This is the case of the measures of model selection defined above and also of most of the definitions of R2, that require the specification of a null model. Among the indices that assess model adequacy, Snijders and Bosker (1994), Xu (2003), H. Liu et al. (2008) and Gelman and Pardoe (2006) presented extensions of the R2 measure (cf. Section 1.3). For generalized LMMs, Zheng (2000) extended some goodness-of-fit (GOF) indices of a generalized linear model and Pan and D. Y. Lin (2005) developed graphical and numerical methods. Finally, for generalized nonlinear mixed-effects models, Vonesh et al. (1996) and Vonesh and Chinchilli (1996) proposed a marginal and a conditional version of a concordance correlation coefficient and of a measure of explained residual variation, respectively. We note that in the literature, extensions of R2 are sometimes presented as GOF measures, as in Vonesh and Chinchilli (1996) or H. Liu et al. (2008) but, as Korn and Simon (1991) highlighted, it is important to distinguish measures of explained variation from GOF. In particular, although some authors use R2 for model selection (e.g., Xu, 2003), this measure cannot decrease, and usually increases with the addition of predictors to the model. Thus, to use R2 for model selection, a penalty function (cf. Section 1.3) is necessary to account for the increase in model complexity (i.e., loss in model parsimony).

In this Chapter, we are particularly interested in extensions of R2 and in information criteria. Indeed, they can be used together, with the latter used to compare models and the former used to evaluate the overall quality of the selected model. The considered measures, and their characteristics, are listed in Table 1.1. The third column of Table 1.1 specifies whether a measure can be used to evaluate model adequacy due to both fixed and random effects (A), or to evaluate model adequacy due to fixed effects (F), or to perform model selection (S). Some of the considered measures have dual use, as they can be used to both evaluate model adequacy and perform model selection (categories A&S and F&S).

In the following, when speaking about overall model adequacy, we imply model adequacy due to both fixed and random effects. Column 4 shows if the measure is absolute or relative and column 5 indicates which relative measure requires the specification of a null model. Finally, the last column concerns the interpretation, as a measure of explained variation, or as a concordance coefficient between observed and predicted values, or as an information criterion, or differently. For instance, Drand is a measure of the proportional reduction in deviance.

The aim of this Chapter is to compare the measures considered in Table 1.1. In

(24)

Table 1.1: Characteristics of the considered measures. A=overall model adequacy;

F=model adequacy due to fixed effects; S=model selection; Abs=absolute; Rel=relative;

EV=explained variation; C=concordance coefficient between observed and predicted val- ues; IC=information criterion. m.=marginal; c.=conditional.

Category Type Null model Measure of

Gelman and Pardoe (2006) R2“level” A Abs EV

λ“level” A Abs other

Zheng (2000) Drand A Rel required other

c A Abs C

Xu (2003) R2 A Rel required EV

ρ2 A Rel required EV

H. Liu et al. (2008) R2T A Rel required EV

Vonesh et al. (1996) c. rc,a A&S Abs C

Vonesh and Chinchilli (1996) c. R2V C,a A&S Rel required EV

Zheng (2000) Prand A&S Rel required other

Xu (2003) r2 A&S Rel required EV

H. Liu et al. (2008) R2TF,a A&S Rel required EV

Snijders and Bosker (1994) R21 F&S Rel required EV

R22 F&S Rel required EV

Vonesh et al. (1996) m. rc,a F&S Abs C

Vonesh and Chinchilli (1996) m. R2V C,a F&S Rel required EV

Zheng (2000)

m. Drand F&S Rel required other

m. Prand F&S Rel required other

m. c F&S Abs C

Xu (2003) m. R2 F&S Rel required EV

H. Liu et al. (2008) R2F,a F&S Rel required EV

Wilks (1938) LRT S Rel other

Akaike (1974) mAIC S Rel IC

Schwarz (1978) BIC S Rel IC

Vaida and Blanchard (2005) cAIC S Rel IC

Spiegelhalter et al. (2002) DIC S Rel IC

(25)

particular, we compare measures belonging to (a) categories A and A&S; (b) category F&S; and (c) category S. In order to do so, we conduct a simulation study using the home radon levels data of Gelman and Pardoe (2006), which initially motivated our work. In our simulation study, we manipulate five parameters of a varying-intercept and varying- slope model. We thus identify which of the considered measures are the most sensitive to these modifications and, among those allowing for model selection, which ones identify the correct model among a series of seven nested alternatives.

In Section 1.2, we define the LMM and explicit notation. In Section 1.3, we present and discuss the considered indices. We present and analyze the home radon levels data used for the simulation study in Section 1.4. We further analyze the data of Potthoff and Roy (1964) and of Kreft and Leeuw (1998) in Appendix A.1. Then, we describe the simulation study, and present and comment the results in Section 1.5. Finally, we discuss the results, where it appears thatPrand and the conditionalrc,a are useful to check the overall adequacy of the model at hand; the marginal versions of Prand and rc,a are the most promising measures, among those investigated here, to identify the best set of fixed effects; and the conditional AIC fares best with the restricted maximum likelihood (REML) estimation to compare LMMs.

1.2 Linear mixed-effects model (LMM)

Assume the following LMM (e.g., Bryk and Raudenbush, 1992; Goldstein, 2011; Laird and Ware, 1982; Skrondal and Sophia Rabe-Hesketh, 2004):

yi =Xiβ+Zibi+i, i= 1, . . . , m, (1.1) where yi = [yi1· · ·yini]0 is the ni ×1 vector of responses for group i, Xi is the ni ×p design matrix for fixed effects for group i, β is the p×1 vector of unknown fixed effects parameters, Zi is the ni×q design matrix for random effects for group i, bi is the q×1 vector of unobservable random effects for group i and i is the ni ×1 vector of errors.

We assume bi ∼ D(0,D), where D is a distribution with mean 0 and q ×q covariance matrix D, andi ∼ D(0,Ri), with Ri the ni×ni covariance matrix. The variance of the responses yi is thus Σi =ZiDZ0i+Ri. The total number of observations is N =Pmi=1ni. The ni×1 vector of predicted values for group i is ˆyi =Xiβˆ+Zibˆi, where ˆβ is the vector of estimated fixed effects and ˆbi are the predicted random effects. In the frequentist framework, the parameters of Σi can be estimated with maximum likelihood (ML), or with REML and plugged into ˆβ, and the predicted random effects are the conditional modes (best linear unbiased predictors, or BLUPs). For the Bayesian estimation, it is standard to consider noninformative normal priors for the coefficients associated with the fixed effects. The appropriate choice of priors for the parameters of the covariance matrix D is less obvious and there is still a debate among specialists. In this Chapter, we follow Gelman and Pardoe (2006) and thus use noninformative uniform priors for the parameters of the covariance matrix D.

Given that the most frequently used LMM is applied to data organized on two levels (e.g., children within classrooms), we present the LMM with two levels, which is consistent with the reviewed literature. In most cases, the distribution of the random effects and of the errors is assumed Gaussian. For this reason, model (1.1) is a generalization of the LMM, as described in Laird and Ware (1982). We observe in Section 1.3 that some of the considered indices can be applied to two-level LMMs (Snijders and Bosker, 1994),

(26)

while others can be applied to nonlinear mixed-effects models (e.g., Vonesh et al., 1996).

Several measures, as those of Xu (2003) and H. Liu et al. (2008), have the disadvantage to depend on the normality assumption of the errors and/or of the random effects.

1.3 Measures

In this Section, we present the considered measures, with a particular effort to make the various notations comparable. Some measures require the specification of a null model, which is either one containing only a fixed intercept, or one with a fixed intercept and a random intercept. For subsequent use in this Section, we define ¯y, the grand mean of the observed values yij,1ni, the ni×1 unit vector,Ini, theni×ni identity matrix, ˆy, the grand mean of the predicted values ˆyij, ˆyi0, the ni ×1 vector of fitted values for group i obtained with the null model, and Ri0, the covariance matrix of the errors of the null model.

1.3.1 Measures of model adequacy only and of model adequacy and model selection

We introduce and compare in this Section the measures belonging to categories A, A&S and F&S (cf. Table 1.1).

1.3.1.1 Gelman and Pardoe (2006)

For a LMM with L variance components, Gelman and Pardoe (2006) presented two Bayesian measures that summarize information in the data at each “level” of the model.

We write level in quotation marks because it corresponds to the separate variance com- ponents, rather than to the more usual definition based on the hierarchy of the data. For instance, consider this varying-intercept model:

yij =β0+bi0+β1xij+ij, i= 1, . . . , m, j = 1, . . . , ni

bi0 ∼ N(0, τ2), ij ∼ N(0, σ2), (1.2)

with a predictor xij. (1.2) can be written hierarchically withβ0i =β0+bi0, as yij ∼ N(β0i+β1xij, σ2), β0i ∼ N(β0, τ2).

This model has two “levels” (data yij, interceptsβ0i) with a different variance component at each “level” (σ2, τ2).

The model is defined at each “level” l= 1, . . . , L as ζk(l) =νk(l)+e(l)k

for k = 1, . . . , K(l). ζk(1) corresponds to yij in (1.1) and for l > 1, ζk(l) are the random coefficients. νk(l) are the linear predictors and e(l)k are the errors that follow a distribution with mean 0 and standard deviation σ(l). For instance, for model (1.2) and for l = 1, we have ζk(1) = yij, νk(1) = β0 +bi0 +β1xij, e(1)k = ij and σ(1) = σ. For l = 2, we have ζk(2) =β0i, νk(2) =β0, e(2)k =bi0 and σ(2) =τ.

(27)

Subsequently, we suppress the superscripts (l), as in Gelman and Pardoe (2006), be- cause we work with each “level” separately. The variation explained by the linear pre- dictors νk for each “level” is defined in the population by 1−[E (Var(ek))][E (Var(ζk))]−1 and is computed as

R2“level” = 1− EVKk=1ek) EVKk=1ζk),

where “E” is the posterior mean, “Var” is the posterior variance, “V” is the finite-sample variance operator (Vmi=1(xi) = (m−1)−1Pmi=1(xix)¯ 2), and ˆek and ˆζk are the estimates of ek and ζk, respectively. The expectations are estimated by averaging over posterior simulation draws, which gives rise to a “Bayesian adjusted R2”, that is a generalization of the classical adjusted R2 in regression. R2“level” usually takes values between 0 and 1.

For each “level,” the values of 0 and 1 indicate, respectively, a poor and a perfect fit with respect to the error variance explained at each “level.” IfR2“level” is negative, the prediction is so poor that the estimated error variance is larger than the variance of the data.

The measure that summarizes the average amount of pooling at each “level” is the pooling factorλ“level”, which is defined in the population by 1−[Var (E (ek))][E (Var(ek))]−1 and is computed as

λ“level”= 1− VKk=1(E (ˆek)) EVKk=1ek).

This measure ranges from 0 to 1, where 0 and 1 correspond to no and, respectively, complete pooling. For model (1.2), the complete-pooling model is yij = β0+β1xij +ij

with common estimates ∀iand the no-pooling model is yij =β0i+β1xij+ij withm β0i’s estimated by least squares. A low pooling factor λ“level” < 0.5 indicates a higher degree of within-group information than population-“level” information. A high pooling factor λ“level”>0.5 indicates a higher degree of population-“level” information than within-group information.

1.3.1.2 Snijders and Bosker (1994)

Two measures of modeled variation, one at each level of a two-level LMM, are defined.

This model is equivalent to model (1.1), in which levels 1 and 2 correspond to the subject level j and group level i, respectively, Ri = σ2Ini and D = τ2Iq, where Iq is the q×q identity matrix. These measures are defined for two-level models only and they require a null model, which contains a fixed intercept and a random intercept, with variance τ02Iq, and for which Ri0 =σ20Ini.

The level-1 modeled proportion of variation is defined in the population as the propor- tional reduction in mean squared prediction error foryij, 1−(var(yijXijβ)) (var(yij))−1. The corresponding criterion is

R12 = 1− var(yd ijXijβ) var(yd ij) ,

where var is obtained by plugging-in the parameter estimates in the population formulad

and Xij is the 1×p vector of fixed effects for subjectj in group i.

The level-2 modeled proportion of variation is defined in the population as the propor- tional reduction in mean squared prediction error for ¯yi., 1−var(¯yi.Xi.β)(var(¯yi.))−1.

(28)

The corresponding criterion is

R22 = 1− var(¯d yi.Xi.β) var(¯d yi.) ,

where ¯yi. and Xi. are the group means of yij and Xij, respectively.

To estimate population parameters, the random effects and the errors are assumed normally distributed. For R21 and for R22 for balanced data (ni =n ∀i), the numerator is the sample variance of the model of interest and the denominator is the sample variance of the null model. For example, for model (1.2), the criteria that estimate the population parameters are R21 = 1−(ˆσ2+ ˆτ2)/(ˆσ02+ ˆτ02) andR22 = 1−(ˆσ2/n+ ˆτ2)/(ˆσ02/n+ ˆτ02), with ˆ

σ2, ˆτ2, ˆσ02 and ˆτ02, that are the estimates ofσ2, τ2, σ02 and τ02 respectively.

In the case of unbalanced data, the authors advise to use a representative value ofni, such as the harmonic mean m−1Pin−1i −1. The interpretation of the R21 and R22 is the same as the traditional coefficient of determination (Draper and Smith, 1998). R12 andR22 identify which predictors are useful to predict yij and ¯yi., respectively. Population values lie between 0 and 1. Negative values of the estimates are possible when the fixed part of the model is misspecified.

1.3.1.3 Vonesh et al. (1996)

The model concordance correlation coefficient for generalized nonlinear mixed-effects mod- els is defined as

rc= 1−

Pm

i=1(yiyˆi)0(yiyˆi)

Pm

i=1(yiy1¯ ni)0(yiy1¯ ni) +Pmi=1yiy1ˆ ni)0yiy1ˆ ni) +Nyy)ˆ 2. Initially introduced by L. I. Lin (1989) to measure the degree of agreement between pairs of observations, rcis interpretable as a concordance correlation coefficient between observed and predicted values.

To assess the GOF associated with the fixed effects and to select the best set of fixed effects, a marginal model concordance correlation is obtained by setting ˆbi = 0 in ˆyi. If bˆi are not set to zero, as in the original definition, rc is referred to as the conditional model concordance correlation and it assesses the GOF associated with fixed and random effects. The range of values of rc is between −1 and 1, as the usual Pearson correlation, but with a slightly different interpretation. Indeed, rc measures the level of agreement, or concordance, between yi and ˆyi, and a value of 1 indicates perfect fit, while a value smaller than, or equal to, zero indicates lack of fit. Adjusted values for the number of parameters in β are defined as rc,a= 1−N(N −p)−1(1−rc), which allows for using the conditional rc,a for model selection.

1.3.1.4 Vonesh and Chinchilli (1996)

Another measure of explained residual variation for generalized nonlinear mixed-effects models, that requires the specification of a null model is introduced as follows:

R2V C = 1−

Pm

i=1(yiyˆi)0V−1i (yi−ˆyi)

Pm

i=1(yiyˆi0)0V−1i (yi−ˆyi0),

for any positive definite matrix Vi. Sensible choices for Vi are either ˆRi or ˆRi0, the covariance estimates of Ri or Ri0, respectively obtained by plugging-in the parameter

Références

Documents relatifs

judgments and consumers’ behavior knowledge have to be incorporated into the estimation of the operation parameters of the solar generation model: this information is

When adding an environmental maternal effect (analysis model AC), results were closer to true values: estimated direct genetic and residual variances and the estimated

First, the noise density was systematically assumed known and different estimators have been proposed: kernel estimators (e.g. Ste- fanski and Carroll [1990], Fan [1991]),

Using the seven proposed criteria for inherent prior model interpretability (section 4) to define 6 Dirac (binary) measures for SVM (Table 3) meeting each criterion without

Now that we are able to compute T V aR α values, we can use this to evaluate the model uncertainty indexes when the reference risk measure is the Tail VaR, and then construct

In the ORX-BTX-RISE group, risedronate appeared effective in preserving bone mass since BV/TV, Tb.SP were not modified when compared to SHAM; Tb.N was over-compensated on the

prove a large deviation principle for the discretised Young measures and the associated concentration prop- erty, which yields the maximum of entropy states presented in section 4

Using AIC as the information criterion, predictive performances of the modeling approaches were evaluated using three criteria based on the clinically relevant drug