Discussion - Measures of model adequacy and model selection in mixed-effects models

In this Chapter, we discuss measures allowing for evaluating model adequacy and perform-ing model selection in the linear mixed-effects model. In particular, we focus on extensions of R² and on information criteria. Thus, we deliberately do not consider shrinkage meth-ods based on penalized loss functions (Bondell et al., 2010; Ibrahim et al.,2011; Peng and Lu, 2012) in the frequentist framework, or the Bayesian selection methods, such as those of Chen and Dunson (2003) or Fr¨uhwirth-Schnatter and T¨uchler (2008) (see Section 3 of Fahrmeir et al., 2010, for an overview). The comparison of such methods is left for future research. Note that, because we were initially motivated by the example of Gelman and Pardoe (2006), in our simulation study design the population model 6, defined by equation (1.4), was the most complex. Hence, we examined how the various measures fare in rejecting a simpler model in light of the correct and most complex model. We did not address the question of how well these measures reject more complex models (i.e., containing unneeded parameters) when compared to the correct and simpler model.

Based on the results of the simulation study in Section 1.5.2 and the characteristics of the measures partly listed in Table 1.1, we now discuss some guidelines for what appear

to be the most promising indices. All the measures evaluating overall model adequacy (categories A and A&S), except the measures of Gelman and Pardoe (2006) and those based on a null model with a fixed intercept and a random intercept, give similar simula-tion results in terms of sensitivity to the modificasimula-tions of some parameters. Some of them have the advantage to be further used for model selection (category A&S), as they include some sort of a penalty function (e.g., adjustment for the number of parameters). Among them, Prand was the only measure able to identify in all simulation cases the population model, certainly due to its penalty term for a large number of random effects. As the generalizability of our results is limited by the implemented simulation study design, we can also favor the use of the conditional rc,a in addition toPrand. Indeed, the conditional r_c,a does not require the specification of a null model, does neither assume normality nor constant variances for the random effects and the errors, and is defined for generalized nonlinear mixed-effects models. Future research should also consider the need of normal-ity for the random effects for P_rand. Also, extensions to nonlinear mixed-effects models (Davidian and Giltinan, 1995; Vonesh and Chinchilli,1996) ought to be investigated.

Concerning the measures that allow for assessing model adequacy due to fixed effects and identifying the best fixed part of the model (category F&S), all of them yield similar results in the simulation study, except the marginal R²_{V C,a}, the marginal c and R_F,a² , that fail in identifying the best set of fixed effects. Among the remaining measures that correctly identify the best fixed part of the model, we favor the use of the marginalP_rand and of the marginal r_c,a, as their conditional counterparts (P_rand and the conditionalr_c,a) perform well for testing overall model adequacy and selecting model.

In our simulation study, all the information criteria but one wrongly selected a model different from the population model. Only the cAIC REML did not fail in this regard, most probably due to its computation based on the conditional likelihood and to the use of a penalty function adapted for LMMs, that is different depending on whether the LMM is estimated with the ML or REML estimation. As the cAIC is obtained considering the conditional likelihood, both the cAIC ML and REML can be used to compare models differing in their fixed or/and random parts (Greven and Kneib, 2010).

And based on the results of our simulation study, we recommend to use the REML estimation for model selection, when using the cAIC. However, the same is not true for the measures based on the marginal likelihood (e.g., -2LL, the mAIC or the BIC). Indeed, as noted in Sections 1.3.2.1 and 1.3.2.2, when the ML estimation is used, these measures can be used to compare models that have different fixed parts, as well as models that have different random parts. On the other hand, when the REML estimation is used, these measures can only be used to compare models that have different random parts. In our simulation study, we consider -2LL ML, -2LL REML, the mAIC and the BIC. The results presented in Section 1.5.2.4 indicate that the sensitivity to modifications of model parameters of -2LL ML is similar to that of -2LL REML. Furthermore, in terms of model selection, the smallest values of -2LL ML are obtained for the most complex model, as expected. And the smallest values of -2LL REML are obtained for the models having the most complex random part (models 4 and 6). This latter result confirms that it is inappropriate to use the REML estimation to compare models differing in their fixed part.

We also compute the mAIC and the BIC, but only for the ML estimation, as explained in Section 1.3.2.2. Thus, no comparison between the ML and REML estimations is possible for these measures. However, based on the results obtained for -2LL, we recommend to use the ML estimation for model selection in general, and to use the REML estimation only to compare models with different sets of random effects, when using the measures

obtained considering the marginal likelihood.

For model specification, researchers should ideally consider all plausible models and apply some of the measures described in this Chapter. However, this approach can result in a very large number of alternative models and thus becomes unfeasible. Guidelines on how to proceed in practice for model specification can be found in Chapter 6 of Snijders and Bosker (1999). Another interesting topic discussed in Chapter 9 of Snijders and Bosker (1999) is the possibility to include contextual effects to model a difference between the within- and between-group regression coefficients of a variable. Treating both model specification and contextual effects was however beyond the scopes of this Chapter.

To conclude, for researchers wanting to compare LMMs and to evaluate their adequacy in order to capture as much information as possible in the data, we recommend to consider jointly an index allowing for (a) model selection (cAIC REML), (b) fixed effects selection (the marginal rc,a and/or the marginalPrand), and (c) testing overall model quality (the conditional r_c,a and/or P_rand). The choice between (marginal) P_rand or the (marginal) conditionalr_c,acomes down to the interpretation. The former can be interpreted similarly to the classical R², as it measures the proportional reduction in PQL, and the latter is less commonly interpreted as a concordance correlation coefficient between observed and predicted values.

A measure of proportional reduction in deviance in generalized linear

mixed-effects models

2.1 Introduction

Generalized linear mixed-effects models (GLMM, Breslow and Clayton, 1993) are an ex-tension of linear mixed-effects models (LMM, Laird and Ware, 1982). LMM already provide a flexible framework in order to deal with longitudinal (repeated measures) data or clustered data. In comparison with linear regression models that already allow for modeling the mean structure, the inclusion of random effects in LMM further allows for modeling the covariance structure, also in the presence of correlated data. However, LMM can only be used for symmetric continuous responses. GLMM further allow for model-ing non continuous responses, such as binary, count or skewed (e.g., Gamma distributed) data.

In practical applications, two aspects are especially important. The first aspect is model selection (e.g., Burnham and D. R. Anderson, 2002; Claeskens and Hjort, 2008).

In the GLMM, model selection involves identifying the relevant fixed part of the model, random part of the model and link function. In this Chapter, we focus on the selection of the fixed and random parts of the model. Thereby, when we speak about model selection, it implies performing both fixed and random effects selection. The second aspect is the evaluation of the adequacy of a model at hand with respect to the data. To this end, a very intuitive and popular index is the coefficient of determination, namely R², from ordinary least squares (OLS) regression models (Draper and Smith, 1998). The R² measures the proportion of the variation of the responses explained by the predictors. In the context of mixed-effects models, a lot of tools emerged in the last few decades for these two aspects.

For model selection, the four main methods used are (1) information criteria (IC); (2) shrinkage methods based on penalized loss functions, such as the least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996), the adaptive LASSO (ALASSO, Zou, 2006) or the smoothly clipped absolute deviation (SCAD, J. Fan and Li,2001), that were adapted for LMM as well as for GLMM by Bondell et al. (2010), Ibrahim et al. (2011), Peng and Lu (2012), and Schelldorfer et al. (2014); (3) fence methods initially proposed by Jiang et al. (2008) for GLMM; and (4) Bayesian techniques, such as the deviance IC (DIC) of Spiegelhalter et al. (2002) or the adaptation of the stochastic search variable selection (SSVS, George and R. E. McCulloch, 1993, 1997; Geweke, 1996) approach

ini-tially proposed by Chen and Dunson (2003) for LMM. The idea of Chen and Dunson (2003) was taken up and extended by many authors. For instance, extensions to logistic mixed-effects models were proposed by Kinney and Dunson (2007) and T¨uchler (2008).

Fr¨uhwirth-Schnatter and Wagner (2010) considered the case of a non-Gaussian LMM and more recently, Yang (2012) presented another method for GLMM with random effects in nonparametric settings. A review and a description of the four above mentioned main selection methods in LMM can be found in M¨uller et al. (2013).

In this Chapter, we focus on the IC because we propose a measure inspired by them, as explained in the subsequent paragraphs. We especially focus on several versions of the Akaike IC (AIC, Akaike, 1974), including the conditional AIC (cAIC Vaida and Blan-chard, 2005) whose initial goal is to perform random effects selection (cf. Sections 2.4.2), and on several versions of the Bayesian IC (BIC Schwarz, 1978). The asymptotic behav-ior of many model selection procedures, such as the AIC, the BIC or the generalized IC (GIC) of R. Rao and Wu (1989) was studied by Shao (1997) for linear regression models.

However, the presence of random effects in mixed-effects models makes it difficult to es-tablish asymptotic results and thus, very few such results for the IC can be found in the literature on mixed-effects models. To our knowledge, only Jiang and J. S. Rao (2003) and Pu and Niu (2006), that extended the GIC of R. Rao and Wu (1989) to LMM in two ways, studied the asymptotic behavior of their selection criteria.

For evaluating model adequacy, several extensions of the R² were proposed. Such definitions were especially given by Gelman and Pardoe (2006), H. Liu et al. (2008), Snijders and Bosker (1994), and Xu (2003) for LMM and by Zheng (2000) for GLMM.

Vonesh et al. (1996) proposed to consider a marginal version and a conditional version of a measure of model adequacy. The marginal index allows for evaluating the adequacy due to fixed effects and the conditional index that due to both fixed and random effects.

For GLMM as well as for nonlinear mixed-effects models (NLMM), Vonesh et al. (1996) thus defined a marginal and a conditional concordance correlation coefficient and further proposed an adjustment for the number of fixed effects. More recently, Nakagawa and Schielzeth (2013) and Johnson (2014) followed the idea of Vonesh et al. (1996) and defined a marginal R² and a conditionalR² for GLMM.

In the literature, it is common to see that some measures of model adequacy are used for model selection. For instance, Zheng (2000) presented two examples to illustrate the use of their extensions of R² and of the conditional concordance correlation coefficient of Vonesh et al. (1996) for model selection. Orelien and Edwards (2008) advised to use marginal measures of model adequacy for fixed effects selection. Similarly, Vonesh et al.

(1996) used their adjusted marginal concordance correlation coefficient to select the best set of fixed effects in an example. However, this kind of measures is not initially designed for model selection. The consequence is that the existing measures of model adequacy perform really poorly in terms of model selection, as we will demonstrate in a simulation study.

With this Chapter, we aim at proposing a measure that combines the strengths (1) of the IC, allowing for performing model selection, and (2) of the measures of model adequacy, allowing for evaluating the quality of fit of the model at hand. Indeed, the IC can only be used to compare models. Thus, they provide no information about the model at hand, as noted by Vonesh and Chinchilli (1996), and their values cannot be intuitively interpreted. Rather, the existing measures of model adequacy applied directly to the model at hand. Thereby, they provide immediate feedback during the model building stage. However, we will see in this Chapter that the existing measures of model adequacy

are not able to identify the correct model among a set of alternatives. Thus, we propose in the context of the GLMM a penalized proportional reduction in deviance measure, named PRD_pen, which is demonstrated to both (1) perform model selection, and (2) assess the adequacy of the model at hand.

PRD_penis obtained first by defining a non-penalized proportional reduction in deviance measure, named PRD, and second by adding a penalty function on large models in the same way as the IC. PRD is an index of model adequacy that measures the proportional reduction in deviance due to the model of interest compared with a prespecified null model.

Thus, it is intuitively interpreted, similarly to theR². The inclusion of the penalization for large models makes PRD_pen able to further identify the most appropriate model among a set of alternatives. Thereby, PRD_pen is a hybrid measure and combines some of the characteristics of the IC, as well as some of the characteristics of the existing measures of model adequacy. PRD_pen thus has the advantage over the IC of giving information about the model at hand and of being intuitively interpreted. Furthermore, PRD_pen has the advantage over the existing measures of model adequacy of being able to perform model selection.

This Chapter is organized as follows. Section 2.2 defines the GLMM. In Section 2.3, we first define PRD_pen, which involves defining the deviance in GLMM and a saturated GLMM. Second, we explain how we compute PRD_pen using the statistical software R (R Core Team, 2013). Third, we study its asymptotic loss efficiency and consistency (defined in Section 2.3.3) for fixed effects selection in the context of the LMM following Pu and Niu (2006). In Section 2.4, we conduct a simulation study in order to evaluate the ability of PRD_pen to perform model selection in finite sample size. We further compare its performance with that of the most commonly used IC and of several measures of model adequacy. To illustrate the use of PRD_pen as a tool to perform model selection as well as a tool to evaluate model adequacy, we analyze a sub-sample of the data from the 1988-89 Bangladesh fertility survey (Huq and Cleland, 1990) in Section 2.5. Finally, we conclude with a discussion.

Dans le document Measures of model adequacy and model selection in mixed-effects models (Page 47-52)