Thesis
Reference
Two essays in statistics: a prediction divergence criterion for model selection & wavelet variance based estimation of latent time series
models
GUERRIER, Stéphane
Abstract
This thesis is divided in two parts. First, it presents a new criterion for model selection which is shown to be particularly well suited in "sparse" settings which we believe to be common in many research fields. Our selection procedure is developed for linear regression models, smoothing splines, autoregressive and mixed linear models. These developments are then applied in Biostatistics. The second part presents a new estimation method for the parameters of a time series model. The proposed estimation method offers an alternative to maximum likelihood estimation, that is straightforward to implement and often the only feasible estimation method with complex models. We derive the asymptotic properties of the proposed estimator for inference and perform an extensive simulation study to compare our estimator to existing methods. Finally, we apply our method in engineering to calibrate inertial sensors and demonstrate that it represents a considerable improvement compared to benchmark methods.
GUERRIER, Stéphane. Two essays in statistics: a prediction divergence criterion for model selection & wavelet variance based estimation of latent time series models . Thèse de doctorat : Univ. Genève, 2013, no. SES 814
URN : urn:nbn:ch:unige-296284
DOI : 10.13097/archive-ouverte/unige:29628
Available at:
http://archive-ouverte.unige.ch/unige:29628
Disclaimer: layout of this document may differ from the published version.
1 / 1
A Prediction Divergence Criterion for Model Selection
&
Wavelet Variance based Estimation of Latent Time Series Models
Th` ese
pr´esent´ee `a la Facult´e des sciences ´economiques et sociales de l’Universit´e de Gen`eve
par
St´ ephane Guerrier
sous la direction de
Prof. Maria-Pia Victoria-Feser
pour l’obtention du grade de
Docteur ` es sciences ´ economiques et sociales mention statistique
Membres du jury de th`ese :
Prof. Trevor Hastie, Stanford University Prof. Olivier Renaud, Universit´e de Gen`eve Prof. Elvezio Ronchetti, Universit´e de Gen`eve Prof. Maria-Pia Victoria-Feser, Universit´e de Gen`eve
Th`ese N◦ 814 Gen`eve, le 6 septembre 2013
trouvent ´enonc´ees et qui n’engagent que la responsabilit´e de leur auteur.
Gen`eve, le 6 septembre 2013
Le doyen
Bernard MORARD
Impression d’apr`es le manuscrit de l’auteur
L’objectif de cette th`ese est double, elle est donc divis´ee en deux parties. Premi`erement, nous proposons un nouveau crit`ere de s´election de mod`ele et, deuxi`emement, nous introduisons une nouvelle m´ethode d’estimation pour les param`etres de mod`eles de s´eries temporelles latentes.
Le probl`eme de la s´election de mod`ele est une partie cruciale de toute analyse statistique.
En effet, les m´ethodes utilis´ees dans ce contexte deviennent incontournables dans la majorit´e des probl´ematiques impliquant une connaissance th´eorique partielle et une tr`es grande quantit´e d’informations. C’est par exemple le cas dans la plus part des recherches conduites de nos jours en m´edecine, en biologie ou en ´economie. Ces m´ethodes ont pour but de d´eterminer quelles variables sont “importantes” pour “expliquer” un ph´enom`ene ´etudi´e. Toutefois, les termes “important” et
“expliquer” peuvent avoir un sens tr`es diff´erent en fonction du contexte et, en fait, la s´election de mod`ele peut ˆetre appliqu´ee `a toute situation o`u un compromis entre variabilit´e et complexit´e est
`
a d´eterminer (McQuarrie and Tsai, 1998). Par exemple, ces techniques peuvent ˆetre appliqu´ees pour s´electionner les variables “significatives” dans un probl`eme de r´egression, pour d´eterminer le nombre de dimensions dans une analyse en composantes principales ou tout simplement pour construire un histogramme. A cet ´egard, cette th`ese pr´esente un nouveau crit`ere de s´election de mod`ele, appel´e en anglais lePrediction Divergence Criterion et d´emontre que celui-ci est, sous certaines conditions de r´egularit´e, asymptotiquement efficace et possiblement convergent. Cette approche est particuli`erement adapt´ee aux probl`emes o`u le nombre de variables consid´er´ees est important mais o`u seul un petit sous-ensemble de celles-ci sont “significatives”. Ce type de situation est commun dans de nombreux domaines de recherche tels que ceux de la G´enomique et de la Prot´eomique. Notre proc´edure de s´election est d´evelopp´ee pour la r´egression lin´eaire ainsi que pour le lissage par spline. Le probl`eme de l’identification de l’ordre d’un mod`ele autor´egressif et la d´etermination de la structure al´eatoire d’un mod`ele mixte sont ´egalement investigu´es. Ces d´eveloppements th´eoriques sont ensuite appliqu´es `a la mod´elisation de la malnutrition enfantine en Zambie ainsi qu’`a la classification de deux types de leuc´emie en utilisant des donn´ees d’expression de g`enes par puces `a ADN.
La deuxi`eme partie de cette th`ese traite d’une probl´ematique totalement diff´erente en pr´esentant une nouvelle m´ethode d’estimation pour les param`etres de mod`eles de s´eries tem- porelles. Nous consid´erons ici en particulier des mod`eles construits comme une somme de processus latents. Ces types de mod`eles sont, par exemple, utilis´es dans de nombreuses ap- plications en ing´enierie et en sciences naturelles. La m´ethode d’estimation propos´ee, appel´ee en anglais laGeneralised Method of Wavelet Moments, est bien souvent la seule `a permettre l’estimation de mod`eles complexes offrant ainsi une alternative au maximum de vraisemblance.
Cet estimateur est le r´esultat de l’optimisation d’un crit`ere bas´e sur une distance standardis´ee entre la variance d’ondelette estim´ee et celle impliqu´ee par le mod`ele consid´er´e. Ce dernier est convergent et asymptotiquement distribu´e sous des conditions pouvant ˆetre facilement v´erifi´ees.
Cette th`ese pr´esente ´egalement une ´etude de simulation permettant de d´emontrer que cette approche se compare avec avantage aux m´ethodes alternatives et permet d’estimer des mod`eles pour lesquels il n’existe aucune autre alternative. Finalement, cette m´ethodologie est appliqu´ee dans le domaine de l’ing´enierie au probl`eme de la calibration de capteurs inertiels et d´emontre une am´elioration consid´erable compar´ee aux m´ethodes existantes.
vii
This thesis is divided in two parts. First, we propose a new criterion for model selection and, in the second part, we introduce a novel estimation technique for the parameters of latent time series models.
The problem of model selection is a crucial part of any statistical analysis and will be at the centre of the first part of this thesis. In fact, model selection methods become inevitable in an increasingly large number of applications involving partial theoretical knowledge and vast amounts of information, like in medicine, biology or economics. These techniques are intended to determine which variables are “important” to “explain” a phenomenon under investigation. The terms “important” and “explain” can have very different meanings according to the context and, in fact, model selection can be applied to any situation where one tries to balance variability with complexity (McQuarrie and Tsai, 1998). For example, these techniques can be applied to select “significant” variables in regression problems, to determine the number of dimensions in principal component analyses or simply to construct histograms. In this respect, we introduce a novel model selection criterion called the Prediction Divergence Criterion Estimator and we demonstrate that, under some regularity conditions, it is asymptotically loss efficient and can also be consistent. This new approach is shown to be particularly well suited in “sparse” settings which we believe to be common in many research fields such as Genomics and Proteomics.
Our selection procedure is developed for linear regression models and smoothing splines. The problem of identifying the order of an autoregressive model and the determination of the random structure of mixed linear models are also investigated. These developments are then applied in an analysis of childhood malnutrition in Zambia and for the distinction of two leukaemia classes using microarray gene expression data.
The second part of this thesis presents a new estimation method for the parameters of a time series model. We consider here composite Gaussian processes that are the sum of independent Gaussian processes which in turn explain an important aspect of the time series, as is the case in engineering and natural sciences. The proposed estimation method offers an alternative to classical estimation based on the likelihood, that is straightforward to implement and often the only feasible estimation method with complex models. The estimator results as the optimisation of a criterion based on a standardised distance between the sample Wavelet Variances (WV) estimates and the model based WV. Indeed, the WV provides a decomposition of the variance process through different scales, so that they contain the information about different features of the stochastic model. We derive the asymptotic properties of the proposed estimator for inference and perform an extensive simulation study to compare our estimator to existing methods. We also set sufficient conditions on latent time series models for our estimator to be consistent, that are easy to verify. Finally, we apply our method in engineering to calibrate inertial sensors and demonstrate that it represents a considerable improvement compared to benchmark methods.
ix
Il m’est un agr´eable devoir de remercier toutes les personnes qui m’ont aid´e et soutenu tout au long de mon parcours jusqu’`a la r´ealisation de ce travail.
Tout d’abord, je tiens `a remercier le Prof. Maria-Pia Victoria-Feser, ma directrice de th`ese qui m’a fait confiance et m’a laiss´e beaucoup de libert´e dans la r´ealisation de ce travail tout en ayant un oeil critique, exigeant mais surtout encourageant et bienveillant. Au-del`a de ses remarquables comp´etences acad´emiques, Maria-Piam’a marqu´e par ses grandes qualit´es humaines. J’ai eu un immense plaisir `a travailler avec elle et j’esp`ere pouvoir le faire encore longtemps.
Je suis particuli`erement redevable envers mon ami leDr. Yannick Stebler qui a contribu´e `a de nombreux d´eveloppements pr´esent´es ici. Travailler avec Yannick a ´et´e pour moi un grand privil`ege car, en plus d’ˆetre une personne exceptionnelle, il est un brillant ing´enieur et chercheur.
Sans sa contribution, cette th`ese ne serait pas la mˆeme. Je tiens ´egalement `a remercier le Dr. Adrian Waegli et leDr. Jan Skaloud avec lesquels j’ai r´ealis´e mon travail de master et
´enorm´ement appris. J’ai eu le plaisir de continuer `a poursuivre mes recherches avec eux durant ma th`ese, cette collaboration fut tr`es enrichissante.
Durant ces ann´ees `a l’Universit´e de Gen`eve, j’ai eu la chance de cˆotoyer de nombreuses per- sonnes qui toutes, `a leur mani`ere, m’ont apport´e leur soutien et leur concours. LeDr. Dominique- Laurent Couturier etElise Dupuis-Lozeron avec lesquels j’ai partag´e un bureau et surtout de bons moments. Roberto Molinariavec qui j’ai eu grand plaisir `a travailler et dont la contribution m’a beaucoup aid´e. Mes remerciements vont ´egalement `aCharlotte Beauchamp, auProf. Eva Cantoni, auDr. Nabil Mili ainsi qu’auDr. St´ephane Rothen.
Je suis tr`es honor´e que leProf. Trevor Hastie de l’Universit´e de Stanford ait accept´e de faire partie de mon jury. C’est ´egalement un grand honneur d’avoir eu comme membres du jury leProf. Olivier Renaud et leProf. Elvezio Ronchetti, tous deux de l’Universit´e de Gen`eve.
J’aimerais leur adresser mes remerciements pour le temps qu’ils ont consacr´e `a cette th`ese ainsi que pour leurs pertinents commentaires qui ont notablement am´elior´e sa qualit´e.
Cette th`ese est l’aboutissement d’un long parcours qui a ´et´e accompagn´e par mes proches que je tiens `a remercier `a leur tour.
LeProf. Hannelore Lee-Jahnke qui, depuis mon enfance, a tent´e de m’int´eresser aux langues et qui m’a toujours prodigu´e ses amicaux et judicieux conseils. J’exprime ´egalement mon amiti´e et ma gratitude envers mes amis et en particulier `a Gaetan Bakalli,Messaoud Benabdelouahad, St´ephane Bilen, C´edric de L’Epine, Julien Forbat, Christian Imperiale, Mucyo Karemera et Fay¸cal Meflah qui tous, chacun `a leur fa¸con, m’ont accompagn´e, soutenu ainsi qu’apport´e leur bonne humeur et leur humour. Une pens´ee particuli`ere pourC´edric, ami fid`ele, envers lequel je suis grandement redevable pour avoir toujours ´et´e l`a pour moi. Ma gratitude va aussi `a Steve R´e pour le soutien qu’il a su m’apporter.
Je dois beaucoup `a mon Amour,Lorena Garzoni qui m’a accompagn´e, ´ecout´e, encourag´e et surtout support´e en fin de th`ese. Son amour, sa pr´esence r´econfortante, sa compr´ehension et son humour sont pour moi un grand bonheur.
Enfin, mes parents qui, par leur amour et leur ind´efectible confiance en moi, m’ont soutenu, aid´e et sans lesquels je n’aurai pu en arriver l`a.
xi
R´esum´e vii
Abstract ix
Remerciements xi
List of Abbreviations xvii
1 Introduction 1
2 A Prediction Divergence Criterion for Model Selection 5
2.1 Introduction . . . 5
2.2 Thed-Class Error Measures . . . 8
2.3 The Prediction Divergence Criterion . . . 13
2.4 Linear Regression Models . . . 15
2.4.1 Asymptotic Properties: The Ordered Case . . . 20
2.4.2 Asymptotic Properties: The Unordered Case . . . 39
2.5 Extensions of the PDC Approach . . . 42
2.5.1 Smoothing Splines . . . 42
2.5.2 Order Selection in Autoregressive Models . . . 43
2.5.3 Random Effects Selection in Mixed Linear Models . . . 44
2.5.4 Application to Large and High-Dimensional Problems . . . 46
2.5.5 Alternative Sorting Algorithm . . . 48
2.6 Simulation Study . . . 48
2.6.1 Linear Regression Models . . . 48
2.6.2 Smoothing Splines . . . 55 xiii
2.6.3 Autoregressive Models . . . 59
2.6.4 Mixed Linear Models . . . 64
2.7 Case Study: Applications in Biostatistics . . . 64
2.7.1 Childhood Malnutrition in Zambia . . . 64
2.7.2 Application to Acute Leukemias Classification . . . 71
2.7.3 Application to Pharmacokinetic Data . . . 71
3 Estimation of Latent Time Series Models 75 3.1 Introduction . . . 75
3.2 Important Conventions, Notations & Definitions . . . 77
3.2.1 Conventions & Notations . . . 77
3.2.2 Stochastic Processes Definitions . . . 78
3.3 The Allan Variance Methodology . . . 79
3.3.1 The Allan Variance . . . 79
3.3.2 Allan Variance based Estimation . . . 80
3.4 The Generalised Method of Wavelet Moments . . . 86
3.4.1 The Wavelet Variance . . . 86
3.4.2 The Generalised Method of Wavelet Moments Estimator . . . 90
3.4.3 Connection with the Allan Variance based Estimation . . . 95
3.4.4 Modified Forms of the Generalised Method of Wavelet Moments . . . 96
3.5 Model Selection Strategies for the GMWM . . . 101
3.5.1 Goodness-of-Fit Test . . . 101
3.5.2 Model Selection Criterion . . . 102
3.6 Simulation Study . . . 103
3.7 Case Study: Inertial Sensors . . . 115
3.8 Possible Extensions of the GMWM Methodology . . . 122
3.9 Concluding remarks . . . 124
4 Conclusions 127
Appendix A Additional Results of Chapter 2 129
A.1 Results on Quadratic Forms . . . 129
A.2 Additional Simulation Results . . . 129
A.2.1 Simulation 2.2 . . . 129
A.2.2 Simulation 2.4 . . . 131
A.2.3 Simulation 2.5 . . . 131
A.2.4 Simulation 2.6 . . . 131
A.2.5 Simulation 2.7 . . . 131
Appendix B Additional Results of Chapter 3 139 B.1 Parametric Wavelet Variances for some specific processes . . . 139
B.2 Covariance Estimation of WV Estimators . . . 139
B.3 Graphical use of Wavelet Variances for model building purposes . . . 141
B.4 Additional Simulation Study . . . 142
ACF auto-covariance function
AD Allan Deviation
AIC Akaike information criterion
AV Allan variance
AR autoregressive
ARMA autoregressive moving-average BIC Bayesian information criterion DWT discrete wavelet transform EKF extended Kalman filter EM expectation-maximisation GLS generalised least squares
GM Gauss-Markov
GMM generalised method of moments
GMWM generalised method of wavelet moments GPS global positioning system
HRV heart rate variability
IEEE institute of electrical and electronics engineers IMU inertial measurement unit
INS inertial navigation system LS(E) least squares (estimator)
MA moving-average
MEMS micro-electro-mechanical system MGF moment generating function ML(E) maximum likelihood (estimator) MLM mixed linear models
MODWT maximal overlap discrete wavelet transform PDC(E) prediction divergence criterion (estimator) PSD power spectral density
QN quantisation noise
RMSE root mean squared error
R-RMSE relative root mean squared error
RW random walk
SRN signal-to-noise ratio UAV unmanned air vehicle
WVIC wavelet variance information criterion
WN white noise
WV wavelet Variance
xvii
Chapter 1
Introduction
This thesis is divided in two parts. First, we propose a new criterion for model selection and, second, we introduce a novel estimation technique for the parameters of latent time series models.
The problem of model selection is a crucial part of any statistical analysis and will be at the centre of the first part of this thesis. In fact, model selection methods become inevitable in an increasingly large number of applications involving partial theoretical knowledge and vast amounts of information, like in medicine, biology or economics. These techniques are intended to determine which variables are “important” to “explain” a phenomenon under investigation. The terms “important” and “explain” can have very different meanings according to the context and, in fact, model selection can be applied to any situation where one tries to balance variability with complexity (McQuarrie and Tsai, 1998). For example, these techniques can be applied to select “significant” variables in regression problems, to determine the number of dimensions in principal component analyses or simply to construct histograms. In this respect, we introduce a new class of error measures and a new class of model selection criteria. Moreover, a novel criterion, called the Prediction Divergence Criterion Estimator, is derived from these two classes and we demonstrate that, under some regularity conditions, it is asymptotically loss efficient and can also be consistent1. This new criterion is shown to be particularly well suited in
“sparse” settings which we believe to be common in many research fields such as Genomics and Proteomics. Our selection procedure is developed for linear regression models and smoothing splines. The problem of identifying the order of an autoregressive model and the determination of the random structure of mixed linear models are also investigated. These developments are then used in an analysis of childhood malnutrition in Zambia and for the distinction of two leukaemia classes using microarray gene expression data.
The second part of this thesis is dedicated to the estimation of latent time series models.
These latent models naturally describe the evolution of time processes that cannot be measured directly or perfectly. Indeed, imperfectly measured processes can be thought of being composed of the sum of, at least, two latent processes describing the measurement error and the “true”
process of interest. In many research problems, one either tries to recover the latter or to estimate the parameters characterising it. Additionally, the measurement error, as well as the process of interest, can also be itself composed of several latent processes. This is, for example, the case with the measurement errors of inertial sensors which are believed to be composite
1The definitions of efficiency and consistency for model selection will be given later in Section 2.4.1 (see (2.40) and (2.39), respectively). The reader can also refer e.g. to Shao (1997).
1
stochastic processes (Titterton and Weston, 2004). These sensors are commonly used in many engineering applications such as robotics or virtual reality. In this context, we propose a new estimation method for the parameter’s vector of latent time series models. This estimator, called Generalised Method of Wavelet Moments, exploits the mapping that exists between the model and the vector of wavelet variances. The idea behind our estimators is, in some sense, to inverse this mapping and find the model that is implied by the observed wavelet variance. We derive sufficient conditions on latent time series models to ensure consistency and asymptotic normality of the proposed estimator. Moreover, we show that this approach has many advantages over existing alternative methods for engineering and natural science applications.
This thesis is organised in two chapters. The problem of model selection is treated in Chapter 2 and is organised as follows.
• Section 2.1 introduces the general problem of model selection, and presents a motivating example.
• A new class of error measures, which generalises Efron’sq-class, is introduced in Section 2.2 together with the associated optimism theorem.
• Section 2.3 presents a new class of model selection criteria which is based on the previously mentioned class of error measures.
• In Section 2.4 our methodology is applied to linear regression models and its asymptotic properties (e.g. efficiency or consistency) are derived.
• Section 2.5 discusses some possible extensions as well as future research plans. The application of our methodology to smoothing splines as well as to the problem of selecting the correct order of an autoregressive model or the choice of the random structure of mixed linear models are briefly analysed. A strategy for dealing with large datasets, where we may possibly have thatnp2, is also presented.
• A detailed simulation study is presented in Section 2.6. These simulations reveal that our criterion performs particularly well in “sparse” settings and that the latter often behaves as an improved version of the Bayesian information criterion.
• Section 2.7 is dedicated to three case studies. The first one presents an analysis of childhood malnutrition in Zambia, while the second is an application to classification and gene selection in a leukaemia microarray problem. Finally, the third is a simple example of application in pharmacokinetics.
Chapter 3 is dedicated to the estimation of latent time series models and presents a new estimation method in this context. More precisely,
• Section 3.1 introduces the framework of latent time series models and presents existing estimation methods. Furthermore, Section 3.2 summarises the most important notations and conventions used throughout this chapter.
2Wherenandpdenote, respectively, the sample size and the number of parameters.
• Section 3.3 presents an estimation method commonly used in the engineering community for the estimation of latent time series models. We derive the conditions for the consistency of this estimator.
• The Generalised Method of Wavelet Moments is introduced in Section 3.4. We derive the sufficient conditions on latent time series models to ensure consistency and asymptotic normality of the proposed estimator. The connection between this method and the one presented in Section 3.3 is also investigated. Finally, we present several extensions that are aimed to improve the performance of our estimator. One of these extension is a general procedure which corrects the estimator’s bias.
• Model selection strategies for the Generalised Method of Wavelet Moments estimators are briefly presented in Section 3.5.
• An extensive simulation study is presented in Section 3.6. These simulations reveal that our method is able to estimate complex models for which alternative estimation techniques fail.
• Section 3.7 presents a case study where our methodology is applied to the stochastic modelling of inertial sensor errors. This approach is shown to clearly outperform classical methods.
• Section 3.8 describes some possible extensions and presents future research plans.
Finally, Chapter 4 concludes.
Chapter 2
A Prediction Divergence Criterion for Model Selection
2.1 Introduction
Model selection is an important and challenging problem in statistics. Indeed, it becomes unavoidable in more and more applications involving incomplete theoretical knowledge about the phenomenon under investigation and important amounts of available information, like in medicine, biology, economics, etc. Very often model selection is about choosing among a set of predictors, the subset that best predicts or explains a response variable.
A common model selection procedure consists in computing a criterion associated to either each potential model, or to a suitable sequence of potential models, and in choosing the one(s) that optimise(s) this criterion. Many criteria have been proposed and the most popular ones include Mallow’sCp (Mallows, 1973) based on prediction error, Akaike’s Information Criterion (AIC) (Akaike, 1974), based on the Kullback-Leibler divergence between the candidate model and the true one, and the Bayesian Information Criterion (BIC) (Schwarz, 1978). Comparisons of the relative (asymptotic) properties of the AIC and the BIC have flourished in the literature (see e.g. Zhang, 1993, Yang, 2005 and the references therein). When the true model is finite, the BIC is consistent1 (see e.g. Haughton, 1988), while the AIC (and theCp which are asymptotically equivalent as shown e.g. in Nishii, 1984) have a non-nil (asymptotic) probability of overfitting.
In finite samples, consistency is not necessarily a good property since consistent selection criteria may tend to underfit, so that the chosen models could have larger prediction errors. A related measure is the Signal-to-Noise Ratio (SNR) (see e.g. McQuarrie and Tsai, 1998); a criterion with a weak SNR will tend to choose models that overfit while a criterion with a strong SNR will tend to choose models that underfit. For example, the SNR of the BIC is larger than the ones of most criteria for a reasonable sample size and this criterion is known for choosing models that are often too simple. Beside the Cp, the AIC and the BIC a great number of criteria have been proposed in the literature (see e.g. Hannan and Quinn, 1979, Foster and George, 1994, Zheng and Loh, 1997, Tibshirani and Knight, 1999, George, 2000 and the references therein).
An alternative to explicit model selection criteria is based on, for example, prediction error estimates obtained by simulation methods such as bootstrap (see e.g. Efron, 1983) or
1The definition of consistency for model selection will be defined later in Section 2.4.1. The reader can also refer e.g. to Shao (1997).
5
cross-validation (see e.g. Shao, 1993). Nevertheless, these methods are strongly linked to explicit model selection criteria. For example, Shao (1993) showed that the leave-one-out cross-validation is asymptotically equivalent to the AIC. Moreover, Efron (2004) demonstrated that model-based penalty methods such as the AIC have better model selection performance compared to nonparametric methods like cross-validation, assuming that the model is believable.
The task of model selection can also be performed by adding some suitable penalty to the estimating function. The nonnegative garrote (Breiman, 1995), the lasso (Tibshirani, 1996), the elastic-net (Zou and Hastie, 2005) or the Dantzig Selector (Cand`es and Tao, 2007) are all examples of such methods. Nowadays, these methods are increasingly being used and often provide better results than selection approaches based on model selection criteria.
In this thesis we propose a new class of error measures, called thed-class, which takes from Efron’sq-class (Efron, 1986) and derive the optimism2theorem (using Efron (2004) terminology) associated to this class. This enables one to easily construct model selection criteria which are consistent estimators of the error measure of interest. Additionally, we propose a new class of selection criteria in which one can choose between a consistent criterion or one with a small but non nil (asymptotic) probability of overfitting. The latter, called the Prediction Divergence Criterion Estimator (PDCE), is derived from the optimism theorem we propose for thed-class of error measures. In finite samples, the PDCE often behaves as an improved version of the BIC and is particularly well suited in “sparse” settings. The derivation as well as the properties of this new class will be explained and derived later in the text and we will start with a motivating example.
Suppose that we wish to explain the behaviour of the response variable, sayy, using a set of pregressors and a linear regression model. Assume further that among thesepregressors, only p?phave an influence ony. This type of sparse situations is certainly common in practice, specially in the rapidly growing fields of Genomics and Proteomics. As an illustration of such a setting, consider the following linear model
y=Xβ+ε, ε∼ N 0, σε2I with
β= (0,1,0,1,0,1,0,1,0,1,0, ...,0
| {z }
50
) (2.1)
andσε2= 1. Suppose also that the pairwise correlations between xj andxk (i.e. jth andkth columns ofX) are arbitrarily chosen to be corr(xj,xk) = 0.5|j−k|. This situation corresponds to a theoreticalR2of approximately 88.2% and to a SNR for the slope coefficients of about 7.4. In Table 2.2 we present the results of a simulation study comparing the performance of our PDCE3 together with the lasso4, the elastic-net5and the stepwise forward approach using the AIC, AICc, AICu, FPE, FPEu, BIC, HQ and HQc criteria (see Table 2.3). We also computed the Least Squares Estimator (LSE) on the complete model as a benchmark. The performance criteria used for comparing the methods are presented in Table 2.1 and were obtained using 500 simulated
2The notion of optimism will be defined later in Section 2.2. The reader can also refer e.g. to Efron (2004) or Hastie et al. (2009), Chapter 7.
3Using the stepwise selection algorithm presented later in Section 2.4.2
4For the lasso we used the R functionlarsof thelarspackage. The shrinkage coefficientλwas chosen by minimising theCpstatistic.
5For the elastic-net we used the R functionenetof theelasticnetpackage. The shrinkage coefficientλ2(L2
penalty) was chosen by tenfold cross-validation andλ1 (L1 penalty) by minimising theCpstatistic.
Table 2.1: Model selection evaluation criteria Criteria Description
Cor. [%] Proportion of times the correct model is selected.
Inc. [%] Proportion of times the correct model is nested within the
selected model.
true+ Average number of selected significant variables (true positives).
false+ Average number of selected non-significant variables (false positives).
NbReg Average number of regressors in the selected model.
Med (PEy) Median of PEy(see (2.70)) computed on test samples.
Med (MSEβ) Median of MSEβ (see (2.71)) computed on test samples.
samples under the correct model. For the training and test samples, we chose respectively n = 70 and n? = 700. While a more extensive simulation study is provided in Section 2.6, Table 2.2 clearly reveals the advantage of our PDCE, not only in the probability of selecting the correct model, but also in prediction and estimation error. Figure 2.1 presents the MSEβ (see (2.70)) and PEy (see (2.71)) distributions and reveals even more clearly the advantage of the
PDCE over the other methods in the sparse settings considered in this simulation.
The rest of the chapter is organised as follows. In Section 2.2 we introduce thed-class of error measures which generalises Efron’sq-class and we derive the associated optimism theorem.
Section 2.3 presents a new class of model selection criterion, called the Prediction Divergence Criterion (PDC), which is based on thed-class of error measures. In Section 2.4 we apply the PDC to linear regression models and derive, for these models, the asymptotic properties of this selection approach. Section 2.5 presents some possible extensions of the PDC methodology as well as future research plans. The application of this approach to smoothing splines, the selection of the order of an autoregressive model as well as the choice of random effects in Mixed
Table 2.2: Evaluation criteria as explained in Table 2.1 for the full model with the LS estimator (LS), stepwise forward FPE (FPE), FPEu (FPEu), AIC (AIC), AICc (AICc), AICu (AICu), BIC (BIC), HQ (HQ), HQc (HQc), lasso (lasso), elastic-net (enet) and stepwise forward PDCE (PDCE) based on 500simulated samples under the correct model. The definition of the model selection criteria can be found in Table 2.3. The numbers in parentheses for the columns Med(PEy)and Med(MSEβ)are the corresponding standard errors estimated by using the bootstrap withB= 500resamplings. The numbers in superscript indicate the ranked performance for each evaluation criterion (before rounding).
Med (PEy) Med MSEβ
Cor. [%] Inc. [%] true+ false+ NbReg
LS 6.81 (1.7·10−1)12 9.73·100 (3.2·10−1)12 0.09.5 100.05.5 5.05.5 55.012 60.012 FPE 2.88 (6.5·10−2)10 2.89·100 (1.0·10−1)10 0.09.5 100.05.5 5.05.5 25.410 30.410 FPEu 1.75 (2.3·10−2)8 1.02·100 (4.3·10−2)8 0.09.5 100.05.5 5.05.5 10.27 15.27 AIC 4.04 (2.0·10−1)11 4.82·100 (2.8·10−1)11 0.09.5 100.05.5 5.05.5 33.711 38.711 AICc 1.72 (2.0·10−2)7 9.87·10−1(3.6·10−2)7 0.09.5 100.05.5 5.05.5 9.36 14.36 AICu 1.46 (1.8·10−2)6 5.61·10−1(2.2·10−2)6 1.44.5 100.05.5 5.05.5 5.24 10.24 BIC 1.44 (1.6·10−2)4 5.40·10−1(2.2·10−2)4 4.63 100.05.5 5.05.5 5.75 10.75 HQ 2.25 (6.9·10−2)9 1.72·100 (1.0·10−1)9 0.09.5 100.05.5 5.05.5 19.59 24.59 HQc 1.45 (1.6·10−2)5 5.49·10−1(1.9·10−2)5 1.44.5 100.05.5 5.05.5 5.03 10.03 lasso 1.33 (1.4·10−2)3 3.76·10−1(1.4·10−2)3 0.26 100.05.5 5.05.5 14.78 19.78 enet 1.27 (8.4·10−3)2 3.00·10−1(1.1·10−2)2 14.22 98.812 5.012 4.92 9.92 PDCE 1.09 (6.7·10−3)1 8.71·10−2(4.9·10−3)1 74.61 99.611 5.011 0.31 5.31
PEy
LSFPEFPEu AIC AICc AICu BIC HQ HQc lasso enet PDCE
12345
MSEβ
LSFPEFPEu AIC AICc AICu BIC HQ HQc lasso enetPDCE
0.51.01.52.02.5
Figure 2.1: Empirical distributions of PEyand MSEβ as defined in (2.70) and (2.71) (see also Table 2.1) for the full model with the LS estimator (LS), stepwise forward FPE (FPE), FPEu (FPEu), AIC (AIC), AICc (AICc), AICu (AICu), BIC (BIC), HQ (HQ), HQc (HQc), lasso (lasso), elastic-net (enet) and stepwise forward PDCE (PDCE) based on 500simulated samples under the correct model.
Linear Models (MLM) are briefly investigated. Section 2.6 presents a simulation study that compares the finite sample performance of different PDC criteria with other model selection techniques such as the lasso or the stepwise AIC. In Section 2.7 we present three case studies.
In the first one, the PDC selection procedure and other selection approaches are employed in an analysis of childhood malnutrition in Zambia. We then present an application to classification and gene selection in a leukaemia microarray problem. Finally, the PDC approach for the selection of random effects in MLM is illustrated with pharmacokinetics data.
2.2 The d-Class Error Measures
Consider a random variable Y distributed according to modelFθ, possibly conditionally on a set of fixed covariatesx= [x1. . . xp]. We observe a random sampleY = (Yi)i=1,...,nsupposedly generated fromFθ, possibly together with a non-randomn×pfull rank matrix of inputsX.
Given a prediction function ˆY that depends on the chosen model, Efron (1986) uses a function Q(u, v) based onq-class error measure to define a prediction error measure between the in-sample prediction and an out-of-sample predicted response. Theq-class of error measures based on the concave functionq(·) is given by
Q(u, v) =q(v) + ˙q(v)(u−v)−q(u)
where ˙q(v) is the derivative ofq(·) evaluated atv. The particular choice ofq(u) =u(1−u) gives the squared loss function Q(u, v) = (u−v)2. The prediction error measure is quantified by the (out-of-sample) expected prediction error
EPErr = 1 n
Xn i=1
EPErri where EPErri =E h
E0
hQ(Yi0,Yˆi)|Yii
(2.2) withY0 = (Yi0)i=1,...,n a random variable distributed asY, and where, as throughout this thesis,E[·] andE0[·], denote expectations under the distribution ofYi|xi, respectively Yi0|xi, the correct model. The expectations are, depending on the context, simple or multiple.
In the special case where the distribution ofY is replaced by the empirical distributiony, one gets thein-sampleerror,
ISErr = 1 n
Xn i=1
E0
Q(Yi0,yˆi)|y .
A training or apparent error can simply be computed as the average loss over the training sampley, i.e.
AErr = 1 n
Xn i=1
Q(yi,yˆi).
Actually AErr is an optimistic estimate of EPErr because the same data is used to fit the prediction rule and assess its error. Let the optimism Ψ and the expected optimism Ω be defined as
Ψ = Xn i=1
Ψi where Ψi=E0
Q(Yi0,yˆi)|y
−Q(yi,yˆi),
Ω = Xn i=1
Ωi where Ωi=E[Ψi],
(2.3)
respectively. Then, Efron’s optimism theorem (see Efron, 2004) demonstrates that EPErri=E
h E0
hQ(Yi0,Yˆi)|yii
=E
hQ(Yi,Yˆi) + Ωi
i where
Ωi= cov
˙ q( ˆYi), Yi
.
Hence, an estimator of EPErr is obtained as EPErr =\ 1
n Xn i=1
EPErr\i where EPErr\i=Q(yi,yˆi) +dcov
˙ q( ˆYi), Yi
(2.4) where, depending on the distribution ofYi|xi,dcov(·) is obtained analytically up to a value ofθ, the model’s parameters, which is then replaced by ˆθ, or by resampling methods (see e.g. Efron, 2004).
We may extend this methodology and theq-class of error measures as follows. Suppose that we wish to construct a model selection criterion, say C, in order to assess the discrepancyD(·,·) between two equidimensional vector valued functionsf1( ˆθ1,Y) andf2( ˆθ2,Y) where ˆθ1 and
θˆ2denote the estimated parameter vectors associated, respectively, to the modelsFθ1 andFθ2. Such a criterion can be defined without loss of generality as
C =E h
E0
hD f1
Y0,θˆ1
,f2
Y,θˆ2
ii (2.5)
where the expectation is multidimensional. The discrepancyD(f1(Y0,θˆ1),f2(Y,θˆ2)) is said to belong to thed-class of error measures if the functionD(·,·) is a valid Bregman divergence and if the functions fi( ˆθi,Y) :n →n (for θi fixed) are equidimensional and associated, respectively, to the modelsFθi, i= 1,2. In addition, we assume that the estimators ˆθ1 and ˆθ2
are based, respectively, onY0andY. In some sense, Bregman divergences are the multivariate equivalent of Efron’s q-class (see Bregman, 1967 for more details). The Bregman divergence encompasses squared error, relative entropy, logistic loss, Mahalanobis distance and other error measures. The Bregman divergence between two equidimensional vectorsuandvis defined as
D(u,v) =ψ(u)−ψ(v)−(u−v)T∇ψ(v) (2.6) whereψ(·) is a scalar and ∇ψ(v) represents the gradient vector ofψ(·) evaluated atv. The functionψ(·) is strictly convex and differentiable. For example, a squared loss functionD(u,v) =
||u−v||22 is obtained whenψ(v) =vTv.
Similarly to (2.3) we define the optimism ∆ for ad-class error measure and for a criterion defined in (2.5) as
∆ =E h
E0
hD f1
Y0,θˆ1
,f2
Y,θˆ2
ii−E hD
f1
Y,θˆ1
,f2
Y,θˆ2
i (2.7)
and from this definition we have the following “optimism” theorem.
Theorem2.1: Let the discrepancyD(·,·)be a valid d-class error measure based onψ(·)and assume that
E f1
Y,θˆ1
−E0
hf1
Y0,θˆ1
iT f1
Y,θˆ1
−E0
hf1
Y0,θˆ1
i<∞
E
∇ψ f2
Y,θˆ2
T
∇ψ f2
Y,θˆ2
<∞.
Then
∆ = trn covh
f1
Y,θˆ1
,∇ψ f2
Y,θˆ2
io
Proof: By definition, we have that
∆ =E h
E0
hD f1
Y0,θˆ1
,f2
Y,θˆ2
ii−E hD
f1
Y,θˆ1
,f2
Y,θˆ2
i and since the discrepancyD(·,·) belongs to thed-class we have thatD(·,·) is a valid Bregman divergence. Therefore, we may express the above terms as:
E h
E0
hD f1
Y0,θˆ1
,f2
Y,θˆ2
ii=E0
hψ f1
Y0,θˆ1
i−E hψ
f2
Y,θˆ2
i
−E
E0
hf1
Y0,θˆ1
i−f2
Y,θˆ2
T
∇ψ f2
Y,θˆ2
and E
hD f1
Y,θˆ1
,f2
Y,θˆ2
i=E hψ
f1
Y,θˆ1
i−E hψ
f2
Y,θˆ2
i
−E f1
Y,θˆ1
−f2
Y,θˆ2
T
∇ψ f2
Y,θˆ2
.
So by subtracting the two terms we obtain
∆ =E0
hψ f1
Y0,θˆ1
i−E hψ
f1
Y,θˆ1
i +E
f1
Y,θˆ1
−E0
hf1
Y0,θˆ1
iT
∇ψ f2
Y,θˆ2
.
SinceE0
hψ f1
Y0,θˆ1
i=E hψ
f1
Y,θˆ1
iit follows that
∆ =E f1
Y,θˆ1
−E0
hf1
Y0,θˆ1
iT
∇ψ f2
Y,θˆ2
Let x and zbe two real-valued vectors of the same dimension such that E
xTx
<∞ and E
zTz
<∞. Then we can always write E
xTz
=E
tr xTz
= tr E zxT
= tr (cov (x,z)) + tr E[x]ET[z]
. (2.8)
Using (2.8) we can express ∆ as
∆ = trn covh
f1
Y,θˆ1
−E0
hf1
Y0,θˆ1
i,∇ψ f2
Y,θˆ2
io + trn
E hf1
Y,θˆ1
−E0
hf1
Y0,θˆ1
ii ET
h∇ψ f2
Y,θˆ2
io.
SinceE0
hf1
Y0,θˆ1
iis a non stochastic quantity, it follows that
∆ = trn covh
f1
Y,θˆ1
,∇ψ f2
Y,θˆ2
io
which verifies the result of Theorem 2.1.
The direct consequence of Theorem 2.1 is that for any criterion C as defined in (2.5) one can construct an estimator, sayC similarly to (2.4) for theb q-class of error measures. Indeed, using (2.7) and applying Theorem 2.1 we have that
E h
E0
hD f1
Y0,θˆ1
,f2
Y,θˆ2
ii=E hD
f1
Y,θˆ1
,f2
Y,θˆ2
i + trn
covh f1
Y,θˆ1
,∇ψ f2
Y,θˆ2
io
and therefore a “natural” (and consistent) estimator of C is C =b D
f1
y,θˆ1
,f2
y,θˆ2
+ trn d covh
f1
y,θˆ1
,∇ψ f2
y,θˆ2
io (2.9)
where as in (2.4), depending on the distribution ofY|X,dcov(·) is obtained analytically up to a value of θ1 andθ2, the models’ parameters, which are then replaced by ˆθ1 and ˆθ2, or by resampling methods (see e.g. Efron, 2004).
Although we defined thed-class in a general manner for two modelsFθ1 andFθ2, nearly all model selection criteria are based on a chosen discrepancy between a given candidate model Fθand the true model F0. This is, for example, the case for the AIC which aims to compute the Kullback-Leibler divergence between Fθ andF0. In such a setting, we may simplify our definition given in (2.5) and write instead
C? =E h
E0
hD
f1 Y0 ,f2
Y,θˆii
(2.10) which leads to the following (consistent) estimator
Cb?=D
f1(y),f2
y,θˆ + trn
d covh
f1(y),∇ψ f2
y,θˆio
. (2.11)
We shall refer to the first and second terms of (2.9) (or of (2.11)) as theapparent divergence and the divergence optimism, respectively. At this point it must be pointed out that many criteria can be defined using (2.5) (or (2.10)). However, only a few of them are meaningful for the task of model selection. Indeed, an estimator of C?as defined in (2.10) should ideally satisfy (at least) two properties. Indeed, consider two candidate models, say Fθ1 andFθ2 such thatFθ1
is nested withinFθ2. Then,Cb? should satisfy the following two properties:
(Pr.1) For nested models, the apparent divergence is strictly non-increasing with respect to model complexity. More formally, we have that for any nested modelsFθ1 inFθ2 the apparent divergence is such that:
D
f1(y),f2
y,θˆ1
≥D
f1(y),f2
y,θˆ2
.
(Pr.2) For nested models, the divergence optimism is strictly non-decreasing with respect to model complexity. More formally, we have that for any nested modelsFθ1 inFθ2 the divergence optimism is such that:
trn d covh
f1(y),∇ψ f2
y,θˆ1
io≤trn d covh
f1(y),∇ψ f2
y,θˆ2
io.
If Property (Pr.1) is not satisfied, it would mean that a larger model (i.e. Fθ2) can provide a poorer (apparent) fit thanFθ1. This could, for example, correspond to a situation where the residual sum of squares increases with the complexity of the model. Although such situations may appear extremely unlikely, they could for instance occur when a robust estimator ofθis used together with a non-robust apparent divergence such as residual sum of squares. In Corollary 2.1 (below) we relate Property (Pr.1) to the estimator ˆθ. This enables to construct easily a criterion which verifies (Pr.1) given an estimator ˆθ. The proof of this result is straightforward and therefore shall be omitted.
Corollary 2.1: The estimatorCb? as defined in (2.11) satisfies Property (Pr.1) if the estimator θˆis the result of the following minimisation problem
θˆ= argmin
θ∈Θ
D(f1(y),f2(y,θ)).
The rational behind Property (Pr.2) is the following. Suppose thatFθ1 is the true model (or is nested within the true model) and assume that modelsFθj withj= 1, ..., Kare models with increasing complexity such that modelFθj is nested within modelFθk if 1≤j < k≤K. Then, if a criterion satisfies Property (Pr.1), the apparent divergence will be smaller as the model complexity increases. Therefore, the optimism should have exactly the opposite property in order to allow a model “close” toFθ1 to be selected by the criterion at hand.
In the following example we will use the above theory to derive a criterion equivalent to Mallow’s Cp.
Example2.1: Consider the linear modely=Xβ+εwhereX∈n×p is a full-ranked constant matrix, ε∼ N 0, σε2In
andβ∈ B ⊆p. Let βˆdenote the LSE of β, i.e.
βˆ= argmin
β∈B ||y−Xβ||22.
Suppose that we wish to find an estimator for the following criterion:
C?=E h
E0
h||Y0−Yˆ||22
ii.
Clearly, this model selection criterion belongs to the d-class error (and to theq-class as well) with f1(Y0) =Y0,f2(Y,β) =ˆ Xβˆ andψ(z) =zTz. By applying Theorem 2.1 we obtain
∆ = trh cov
Y,2Xβˆi
= 2σε2tr (S) = 2σε2p whereSdenotes the “hat” matrix ofX, i.e. S=X XTX−1
XT. Using (2.11) we get Cb?=D(f1(y,θ1),f2(y,θ2)) + ∆ =||y−Xβˆ||22+ 2σε2p
which is unsurprisingly equivalent to Mallow’sCp. Sinceβˆis obtained by minimising||y−Xβ||22
which is equivalent toD(f1(y),f2(y,β)), Property (Pr.1) is satisfied by Corollary 2.1. Note that this is also the case for the MLE since this estimator is equivalent to the LSE in this context. Moreover, for any fixed (across models)σ2ε Property (Pr.2) is also satisfied. In practice, σ2ε is replaced by an estimate, say σˆε2, of the noise variance obtained from a “low-bias” model, generally the largest model.
2.3 The Prediction Divergence Criterion
When a model selection criterion, say C, is used in practice to choose between two models, say Mj nested in Mk, an estimate of C (or of EPErr as defined in (2.2)) is computed for both models and the difference is used for selection. We propose instead another class of criteria that aims at directly measuring a prediction divergence between the two models. More formally, any criterion that compares the out-of-sample prediction computed in the smaller model ˆYj0with the in-sample prediction in the larger model ˆYk, quantified by the Bregman divergence D(·,·) (based on ψ(·)) belongs to this class, i.e.
PDCj,k=E h
E0
hD Yˆj0,Yˆk
|Yii
(2.12)
where the expectation is multidimensional. If the smaller model is not correct, the additional elements in the larger model create differences in the predictions and therefore should be accounted for. Intuitively, in a “suitable” sequence of nested models with increasing complexity and such that modelMj is nested in modelMj+1, supposing we have a consistent estimator of PDCj,j+1, say PDCEj,j+1 (see (2.13)), then this estimator is expected to be minimal when j = 0 < K, K is the number of potential sequentially nested models, and M0 denotes the correct (or closest to the correct) underlying model. Indeed, while j < 0, we expect PDCEj,j+1to be relatively large since modelMj is missing some elements of the correct model M0 which are included in model Mj+1. This is also true with PDCEj,j+m, j < j+m ≤ K or PDCEj−m,j,1 < j −m < j. On the other hand, if j ≥ 0, PDCEj,j+1 (or indeed PDCEj,j+m, m >0) is relatively small compared to whenj < 0 because both models include the correct one. Among all modelsj ≥0, PDCEj,j+mshould be minimised atj=0andm= 1 since PDCE0,0+1 compares the prediction of the correct model with the least overfitted one.
In the case of the linear regression model, we derived in Section 2.4 the (asymptotic) properties of the PDCE based on the squared loss function. In particular, we showed in Theorem 2.2 that under the setting previously defined and for sufficiently large sample sizenwe have that E[PDCE0,0+1]≤E[PDCEj,j+m] forj andmsuch that 0< j < K,m >0 andj+m≤K+ 1.
We also have that E[PDCE0,0+1] = E[PDCEj,j+m] if and only if j = 0 andm = 1. This confirms the intuitive explanation given above.
Clearly, the discrepancyD( ˆYj0,Yˆk) belongs tod-class of error measures but not to theq-class.
Therefore, by using Theorem 2.1 as in (2.9) we obtain the following consistent estimator of the PDCj,k
PDC[j,k= PDCEj,k=D( ˆyj,yˆk) + tr{dcov [ ˆyj,∇ψ( ˆyk)]}. (2.13) Although the above estimator can be computed analytically or using resampling methods for any valid Bregman divergence, we shall only consider here the squared loss function. The properties and performance of other divergences are left for further research. Therefore, (2.13) can be further simplified as
PDCEj,k=||yˆj−yˆk||22+ 2 tr [dcov ( ˆyj,yˆk)]. (2.14) For notational simplicity, we will not make a distinction between the PDCE defined in (2.13) and in (2.14) but in the rest of the text the term PDCE refers to (2.14) since the squared loss function is the only Bregman divergence considered in this chapter. In some sense, the PDCE defined in (2.14) is the equivalent of Mallow’sCp in the PDC class since both criteria are based on the same loss function. It will thus be of particular interest to compare (2.14) with the Cp to understand the differences between the PDC approach and the classical model selection approach (see (2.10)).
Many authors (see e.g. Bhansali and Downham, 1977) have examined the penalty function of the AIC (and of other criteria) and defined, for example, the AICα in which the term 2 (see Table 2.3) of the conventional AIC is replaced byα. We follow this strategy and define the modified PDCE as
PDCEλj,kn =||yˆj−yˆk||22+λntr [dcov ( ˆyj,yˆk)] (2.15) whereλn is a constant depending possibly on the sample sizen.
Hence, assuming that there existK competing nested models (and that the largest model is not the correct one) for describing the behaviour of Y, we propose to choose the modelMˆλn
satisfying
ˆ
λn = argmin
j=1,...,K−1
PDCEλj,j+1n . (2.16)
If a clear sequence of competing nested models does not exist, one can build one prior to applying the selection rule (2.16). This will be explained when treating the linear regression model in Section 2.4.2 (in particular see iterative rule (2.62)).
2.4 Linear Regression Models
We consider in this section the usual linear regression model with Gaussian errors Y =Xβ+ε, ε∼ N 0, σ2εI
whereXis a knownn×pdesign matrix of rankpandβis ap×1 vector of unknown parameters.
We assume that the method of LS is used to fit a model to the data and thus we consider the usual ordinary LS parameter estimates ofβ, i.e.
βˆ= XTX−1
XTy (2.17)
leading to the (linear) prediction
ˆ
y=Xβˆ=Sy whereS=X XTX−1
XT denotes the “hat” matrix. Since the errors are Gaussian, ˆβ is also the MLE ofβ. The unbiased and maximum likelihood estimates ofσε2are, respectively, given by
˜
σ2ε= ||y−yˆ||22
n−p and ˆσε2= ||y−yˆ||22
n . (2.18)
Throughout this chapter we assume that 0< σε2<∞. In Table 2.3 we provide some of the most commonly used criteria for model selection in linear regression models.
The PDCEλj,kn defined in (2.15) can be simplified for two linear nested candidate models.
Indeed, let model Mj be nested within model Mk and so that dim(βj) =j <dim(βk) =k whereβj andβk denote, respectively, the vector of unknown parameters associated to models Mj andMk. Then, the PDCj,k based squared loss function defined byψ(z) =zTzis equal to
PDCj,k=||Yˆj−Yˆk||22+ 2σ2εtr (SjSk) =||Yˆj−Yˆk||22+ 2σ2εj
whereSj andSk denote, respectively, the hat matrices of modelsMj andMk. Therefore, we obtain for PDCEλj,kn :
PDCEλj,kn =||yˆj−yˆk||22+λnσ2εj. (2.19) When the value of σε2is unknown, one may replaceσ2ε by a consistent estimator, say ˆσε2, like the LS estimator at the full model. As already mentioned, we show in Theorem 2.2 (below, and based on Lemma 2.1) that the PDCEλj,j+mn is expected, for sufficiently large sample size, to reach its smallest values forj=0andm= 1. This motivates the PDCE selection rule defined in (2.16).