Two essays in statistics: a prediction divergence criterion for model selection & wavelet variance based estimation of latent time series models

(1)

Thesis

Reference

Two essays in statistics: a prediction divergence criterion for model selection & wavelet variance based estimation of latent time series

models

GUERRIER, Stéphane

Abstract

This thesis is divided in two parts. First, it presents a new criterion for model selection which is shown to be particularly well suited in "sparse" settings which we believe to be common in many research fields. Our selection procedure is developed for linear regression models, smoothing splines, autoregressive and mixed linear models. These developments are then applied in Biostatistics. The second part presents a new estimation method for the parameters of a time series model. The proposed estimation method offers an alternative to maximum likelihood estimation, that is straightforward to implement and often the only feasible estimation method with complex models. We derive the asymptotic properties of the proposed estimator for inference and perform an extensive simulation study to compare our estimator to existing methods. Finally, we apply our method in engineering to calibrate inertial sensors and demonstrate that it represents a considerable improvement compared to benchmark methods.

GUERRIER, Stéphane. Two essays in statistics: a prediction divergence criterion for model selection & wavelet variance based estimation of latent time series models . Thèse de doctorat : Univ. Genève, 2013, no. SES 814

URN : urn:nbn:ch:unige-296284

DOI : 10.13097/archive-ouverte/unige:29628

Available at:

http://archive-ouverte.unige.ch/unige:29628

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

A Prediction Divergence Criterion for Model Selection

&

Wavelet Variance based Estimation of Latent Time Series Models

Th` ese

présentée à la Faculté des sciences économiques et sociales de l’Université de Genève

par

St´ ephane Guerrier

sous la direction de

Prof. Maria-Pia Victoria-Feser

pour l’obtention du grade de

Docteur ` es sciences ´ economiques et sociales mention statistique

Membres du jury de th`ese :

Prof. Trevor Hastie, Stanford University Prof. Olivier Renaud, Université de Genève Prof. Elvezio Ronchetti, Université de Genève Prof. Maria-Pia Victoria-Feser, Université de Genève

Th`ese N^◦ 814 Gen`eve, le 6 septembre 2013

(3)

(4)

trouvent énoncées et qui n’engagent que la responsabilité de leur auteur.

Gen`eve, le 6 septembre 2013

Le doyen

Bernard MORARD

Impression d’apr`es le manuscrit de l’auteur

(5)

(6)

(7)

(8)

L’objectif de cette thèse est double, elle est donc divisée en deux parties. Premièrement, nous proposons un nouveau critère de sélection de modèle et, deuxièmement, nous introduisons une nouvelle méthode d’estimation pour les paramètres de modèles de séries temporelles latentes.

Le problème de la sélection de modèle est une partie cruciale de toute analyse statistique.

En effet, les méthodes utilisées dans ce contexte deviennent incontournables dans la majorité des problématiques impliquant une connaissance théorique partielle et une très grande quantité d’informations. C’est par exemple le cas dans la plus part des recherches conduites de nos jours en médecine, en biologie ou en économie. Ces méthodes ont pour but de déterminer quelles variables sont “importantes” pour “expliquer” un phénomène étudié. Toutefois, les termes “important” et

“expliquer” peuvent avoir un sens très différent en fonction du contexte et, en fait, la sélection de modèle peut être appliquée à toute situation où un compromis entre variabilité et complexité est

`

a déterminer (McQuarrie and Tsai, 1998). Par exemple, ces techniques peuvent être appliquées pour sélectionner les variables “significatives” dans un problème de régression, pour déterminer le nombre de dimensions dans une analyse en composantes principales ou tout simplement pour construire un histogramme. A cet égard, cette thèse présente un nouveau critère de sélection de modèle, appelé en anglais lePrediction Divergence Criterion et démontre que celui-ci est, sous certaines conditions de régularité, asymptotiquement efficace et possiblement convergent. Cette approche est particulièrement adaptée aux problèmes où le nombre de variables considérées est important mais où seul un petit sous-ensemble de celles-ci sont “significatives”. Ce type de situation est commun dans de nombreux domaines de recherche tels que ceux de la Génomique et de la Protéomique. Notre procédure de sélection est développée pour la régression linéaire ainsi que pour le lissage par spline. Le problème de l’identification de l’ordre d’un modèle autorégressif et la détermination de la structure aléatoire d’un modèle mixte sont également investigués. Ces développements théoriques sont ensuite appliqués à la modélisation de la malnutrition enfantine en Zambie ainsi qu’à la classification de deux types de leucémie en utilisant des données d’expression de gènes par puces à ADN.

La deuxième partie de cette thèse traite d’une problématique totalement différente en présentant une nouvelle méthode d’estimation pour les paramètres de modèles de séries temporelles. Nous considérons ici en particulier des modèles construits comme une somme de processus latents. Ces types de modèles sont, par exemple, utilisés dans de nombreuses applications en ingénierie et en sciences naturelles. La méthode d’estimation proposée, appelée en anglais laGeneralised Method of Wavelet Moments, est bien souvent la seule à permettre l’estimation de modèles complexes offrant ainsi une alternative au maximum de vraisemblance.

Cet estimateur est le résultat de l’optimisation d’un critère basé sur une distance standardisée entre la variance d’ondelette estimée et celle impliquée par le modèle considéré. Ce dernier est convergent et asymptotiquement distribué sous des conditions pouvant être facilement vérifiées.

Cette thèse présente également une étude de simulation permettant de démontrer que cette approche se compare avec avantage aux méthodes alternatives et permet d’estimer des modèles pour lesquels il n’existe aucune autre alternative. Finalement, cette méthodologie est appliquée dans le domaine de l’ingénierie au problème de la calibration de capteurs inertiels et démontre une amélioration considérable comparée aux méthodes existantes.

vii

(9)

(10)

This thesis is divided in two parts. First, we propose a new criterion for model selection and, in the second part, we introduce a novel estimation technique for the parameters of latent time series models.

The problem of model selection is a crucial part of any statistical analysis and will be at the centre of the first part of this thesis. In fact, model selection methods become inevitable in an increasingly large number of applications involving partial theoretical knowledge and vast amounts of information, like in medicine, biology or economics. These techniques are intended to determine which variables are “important” to “explain” a phenomenon under investigation. The terms “important” and “explain” can have very different meanings according to the context and, in fact, model selection can be applied to any situation where one tries to balance variability with complexity (McQuarrie and Tsai, 1998). For example, these techniques can be applied to select “significant” variables in regression problems, to determine the number of dimensions in principal component analyses or simply to construct histograms. In this respect, we introduce a novel model selection criterion called the Prediction Divergence Criterion Estimator and we demonstrate that, under some regularity conditions, it is asymptotically loss efficient and can also be consistent. This new approach is shown to be particularly well suited in “sparse” settings which we believe to be common in many research fields such as Genomics and Proteomics.

Our selection procedure is developed for linear regression models and smoothing splines. The problem of identifying the order of an autoregressive model and the determination of the random structure of mixed linear models are also investigated. These developments are then applied in an analysis of childhood malnutrition in Zambia and for the distinction of two leukaemia classes using microarray gene expression data.

The second part of this thesis presents a new estimation method for the parameters of a time series model. We consider here composite Gaussian processes that are the sum of independent Gaussian processes which in turn explain an important aspect of the time series, as is the case in engineering and natural sciences. The proposed estimation method offers an alternative to classical estimation based on the likelihood, that is straightforward to implement and often the only feasible estimation method with complex models. The estimator results as the optimisation of a criterion based on a standardised distance between the sample Wavelet Variances (WV) estimates and the model based WV. Indeed, the WV provides a decomposition of the variance process through different scales, so that they contain the information about different features of the stochastic model. We derive the asymptotic properties of the proposed estimator for inference and perform an extensive simulation study to compare our estimator to existing methods. We also set sufficient conditions on latent time series models for our estimator to be consistent, that are easy to verify. Finally, we apply our method in engineering to calibrate inertial sensors and demonstrate that it represents a considerable improvement compared to benchmark methods.

ix

(11)

(12)

Il m’est un agréable devoir de remercier toutes les personnes qui m’ont aidé et soutenu tout au long de mon parcours jusqu’à la réalisation de ce travail.

Tout d’abord, je tiens à remercier le Prof. Maria-Pia Victoria-Feser, ma directrice de thèse qui m’a fait confiance et m’a laissé beaucoup de liberté dans la réalisation de ce travail tout en ayant un oeil critique, exigeant mais surtout encourageant et bienveillant. Au-delà de ses remarquables compétences académiques, Maria-Piam’a marqué par ses grandes qualités humaines. J’ai eu un immense plaisir à travailler avec elle et j’espère pouvoir le faire encore longtemps.

Je suis particulièrement redevable envers mon ami leDr. Yannick Stebler qui a contribué à de nombreux développements présentés ici. Travailler avec Yannick a été pour moi un grand privilège car, en plus d’être une personne exceptionnelle, il est un brillant ingénieur et chercheur.

Sans sa contribution, cette thèse ne serait pas la même. Je tiens également à remercier le Dr. Adrian Waegli et leDr. Jan Skaloud avec lesquels j’ai réalisé mon travail de master et

énormément appris. J’ai eu le plaisir de continuer à poursuivre mes recherches avec eux durant ma thèse, cette collaboration fut très enrichissante.

Durant ces années à l’Université de Genève, j’ai eu la chance de côtoyer de nombreuses personnes qui toutes, à leur manière, m’ont apporté leur soutien et leur concours. LeDr. Dominique- Laurent Couturier etElise Dupuis-Lozeron avec lesquels j’ai partagé un bureau et surtout de bons moments. Roberto Molinariavec qui j’ai eu grand plaisir à travailler et dont la contribution m’a beaucoup aidé. Mes remerciements vont également àCharlotte Beauchamp, auProf. Eva Cantoni, auDr. Nabil Mili ainsi qu’auDr. Stéphane Rothen.

Je suis très honoré que leProf. Trevor Hastie de l’Université de Stanford ait accepté de faire partie de mon jury. C’est également un grand honneur d’avoir eu comme membres du jury leProf. Olivier Renaud et leProf. Elvezio Ronchetti, tous deux de l’Université de Genève.

J’aimerais leur adresser mes remerciements pour le temps qu’ils ont consacré à cette thèse ainsi que pour leurs pertinents commentaires qui ont notablement amélioré sa qualité.

Cette thèse est l’aboutissement d’un long parcours qui a été accompagné par mes proches que je tiens à remercier à leur tour.

LeProf. Hannelore Lee-Jahnke qui, depuis mon enfance, a tenté de m’intéresser aux langues et qui m’a toujours prodigué ses amicaux et judicieux conseils. J’exprime également mon amitié et ma gratitude envers mes amis et en particulier à Gaetan Bakalli,Messaoud Benabdelouahad, Stéphane Bilen, Cédric de L’Epine, Julien Forbat, Christian Imperiale, Mucyo Karemera et Fay¸cal Meflah qui tous, chacun à leur fa¸con, m’ont accompagné, soutenu ainsi qu’apporté leur bonne humeur et leur humour. Une pensée particulière pourCédric, ami fidèle, envers lequel je suis grandement redevable pour avoir toujours été là pour moi. Ma gratitude va aussi à Steve Ré pour le soutien qu’il a su m’apporter.

Je dois beaucoup à mon Amour,Lorena Garzoni qui m’a accompagné, écouté, encouragé et surtout supporté en fin de thèse. Son amour, sa présence réconfortante, sa compréhension et son humour sont pour moi un grand bonheur.

Enfin, mes parents qui, par leur amour et leur indéfectible confiance en moi, m’ont soutenu, aidé et sans lesquels je n’aurai pu en arriver là.

xi

(13)

(14)

R´esum´e vii

Abstract ix

Remerciements xi

List of Abbreviations xvii

1 Introduction 1

2 A Prediction Divergence Criterion for Model Selection 5

2.1 Introduction . . . 5

2.2 Thed-Class Error Measures . . . 8

2.3 The Prediction Divergence Criterion . . . 13

2.4 Linear Regression Models . . . 15

2.4.1 Asymptotic Properties: The Ordered Case . . . 20

2.4.2 Asymptotic Properties: The Unordered Case . . . 39

2.5 Extensions of the PDC Approach . . . 42

2.5.1 Smoothing Splines . . . 42

2.5.2 Order Selection in Autoregressive Models . . . 43

2.5.3 Random Effects Selection in Mixed Linear Models . . . 44

2.5.4 Application to Large and High-Dimensional Problems . . . 46

2.5.5 Alternative Sorting Algorithm . . . 48

2.6 Simulation Study . . . 48

2.6.1 Linear Regression Models . . . 48

2.6.2 Smoothing Splines . . . 55 xiii

(15)

2.6.3 Autoregressive Models . . . 59

2.6.4 Mixed Linear Models . . . 64

2.7 Case Study: Applications in Biostatistics . . . 64

2.7.1 Childhood Malnutrition in Zambia . . . 64

2.7.2 Application to Acute Leukemias Classification . . . 71

2.7.3 Application to Pharmacokinetic Data . . . 71

3 Estimation of Latent Time Series Models 75 3.1 Introduction . . . 75

3.2 Important Conventions, Notations & Definitions . . . 77

3.2.1 Conventions & Notations . . . 77

3.2.2 Stochastic Processes Definitions . . . 78

3.3 The Allan Variance Methodology . . . 79

3.3.1 The Allan Variance . . . 79

3.3.2 Allan Variance based Estimation . . . 80

3.4 The Generalised Method of Wavelet Moments . . . 86

3.4.1 The Wavelet Variance . . . 86

3.4.2 The Generalised Method of Wavelet Moments Estimator . . . 90

3.4.3 Connection with the Allan Variance based Estimation . . . 95

3.4.4 Modified Forms of the Generalised Method of Wavelet Moments . . . 96

3.5 Model Selection Strategies for the GMWM . . . 101

3.5.1 Goodness-of-Fit Test . . . 101

3.5.2 Model Selection Criterion . . . 102

3.6 Simulation Study . . . 103

3.7 Case Study: Inertial Sensors . . . 115

3.8 Possible Extensions of the GMWM Methodology . . . 122

3.9 Concluding remarks . . . 124

4 Conclusions 127

(16)

Appendix A Additional Results of Chapter 2 129

A.1 Results on Quadratic Forms . . . 129

A.2 Additional Simulation Results . . . 129

A.2.1 Simulation 2.2 . . . 129

A.2.2 Simulation 2.4 . . . 131

A.2.3 Simulation 2.5 . . . 131

A.2.4 Simulation 2.6 . . . 131

A.2.5 Simulation 2.7 . . . 131

Appendix B Additional Results of Chapter 3 139 B.1 Parametric Wavelet Variances for some specific processes . . . 139

B.2 Covariance Estimation of WV Estimators . . . 139

B.3 Graphical use of Wavelet Variances for model building purposes . . . 141

B.4 Additional Simulation Study . . . 142

(17)

(18)

ACF auto-covariance function

AD Allan Deviation

AIC Akaike information criterion

AV Allan variance

AR autoregressive

ARMA autoregressive moving-average BIC Bayesian information criterion DWT discrete wavelet transform EKF extended Kalman filter EM expectation-maximisation GLS generalised least squares

GM Gauss-Markov

GMM generalised method of moments

GMWM generalised method of wavelet moments GPS global positioning system

HRV heart rate variability

IEEE institute of electrical and electronics engineers IMU inertial measurement unit

INS inertial navigation system LS(E) least squares (estimator)

MA moving-average

MEMS micro-electro-mechanical system MGF moment generating function ML(E) maximum likelihood (estimator) MLM mixed linear models

MODWT maximal overlap discrete wavelet transform PDC(E) prediction divergence criterion (estimator) PSD power spectral density

QN quantisation noise

RMSE root mean squared error

R-RMSE relative root mean squared error

RW random walk

SRN signal-to-noise ratio UAV unmanned air vehicle

WVIC wavelet variance information criterion

WN white noise

WV wavelet Variance

xvii

(19)

(20)

Chapter 1 Introduction

This thesis is divided in two parts. First, we propose a new criterion for model selection and, second, we introduce a novel estimation technique for the parameters of latent time series models.

The problem of model selection is a crucial part of any statistical analysis and will be at the centre of the first part of this thesis. In fact, model selection methods become inevitable in an increasingly large number of applications involving partial theoretical knowledge and vast amounts of information, like in medicine, biology or economics. These techniques are intended to determine which variables are “important” to “explain” a phenomenon under investigation. The terms “important” and “explain” can have very different meanings according to the context and, in fact, model selection can be applied to any situation where one tries to balance variability with complexity (McQuarrie and Tsai, 1998). For example, these techniques can be applied to select “significant” variables in regression problems, to determine the number of dimensions in principal component analyses or simply to construct histograms. In this respect, we introduce a new class of error measures and a new class of model selection criteria. Moreover, a novel criterion, called the Prediction Divergence Criterion Estimator, is derived from these two classes and we demonstrate that, under some regularity conditions, it is asymptotically loss efficient and can also be consistent¹. This new criterion is shown to be particularly well suited in

“sparse” settings which we believe to be common in many research fields such as Genomics and Proteomics. Our selection procedure is developed for linear regression models and smoothing splines. The problem of identifying the order of an autoregressive model and the determination of the random structure of mixed linear models are also investigated. These developments are then used in an analysis of childhood malnutrition in Zambia and for the distinction of two leukaemia classes using microarray gene expression data.

The second part of this thesis is dedicated to the estimation of latent time series models.

These latent models naturally describe the evolution of time processes that cannot be measured directly or perfectly. Indeed, imperfectly measured processes can be thought of being composed of the sum of, at least, two latent processes describing the measurement error and the “true”

process of interest. In many research problems, one either tries to recover the latter or to estimate the parameters characterising it. Additionally, the measurement error, as well as the process of interest, can also be itself composed of several latent processes. This is, for example, the case with the measurement errors of inertial sensors which are believed to be composite

1The definitions of efficiency and consistency for model selection will be given later in Section 2.4.1 (see (2.40) and (2.39), respectively). The reader can also refer e.g. to Shao (1997).

1

(21)

stochastic processes (Titterton and Weston, 2004). These sensors are commonly used in many engineering applications such as robotics or virtual reality. In this context, we propose a new estimation method for the parameter’s vector of latent time series models. This estimator, called Generalised Method of Wavelet Moments, exploits the mapping that exists between the model and the vector of wavelet variances. The idea behind our estimators is, in some sense, to inverse this mapping and find the model that is implied by the observed wavelet variance. We derive sufficient conditions on latent time series models to ensure consistency and asymptotic normality of the proposed estimator. Moreover, we show that this approach has many advantages over existing alternative methods for engineering and natural science applications.

This thesis is organised in two chapters. The problem of model selection is treated in Chapter 2 and is organised as follows.

• Section 2.1 introduces the general problem of model selection, and presents a motivating example.

• A new class of error measures, which generalises Efron’sq-class, is introduced in Section 2.2 together with the associated optimism theorem.

• Section 2.3 presents a new class of model selection criteria which is based on the previously mentioned class of error measures.

• In Section 2.4 our methodology is applied to linear regression models and its asymptotic properties (e.g. efficiency or consistency) are derived.

• Section 2.5 discusses some possible extensions as well as future research plans. The application of our methodology to smoothing splines as well as to the problem of selecting the correct order of an autoregressive model or the choice of the random structure of mixed linear models are briefly analysed. A strategy for dealing with large datasets, where we may possibly have thatnp², is also presented.

• A detailed simulation study is presented in Section 2.6. These simulations reveal that our criterion performs particularly well in “sparse” settings and that the latter often behaves as an improved version of the Bayesian information criterion.

• Section 2.7 is dedicated to three case studies. The first one presents an analysis of childhood malnutrition in Zambia, while the second is an application to classification and gene selection in a leukaemia microarray problem. Finally, the third is a simple example of application in pharmacokinetics.

Chapter 3 is dedicated to the estimation of latent time series models and presents a new estimation method in this context. More precisely,

• Section 3.1 introduces the framework of latent time series models and presents existing estimation methods. Furthermore, Section 3.2 summarises the most important notations and conventions used throughout this chapter.

2Wherenandpdenote, respectively, the sample size and the number of parameters.

(22)

• Section 3.3 presents an estimation method commonly used in the engineering community for the estimation of latent time series models. We derive the conditions for the consistency of this estimator.

• The Generalised Method of Wavelet Moments is introduced in Section 3.4. We derive the sufficient conditions on latent time series models to ensure consistency and asymptotic normality of the proposed estimator. The connection between this method and the one presented in Section 3.3 is also investigated. Finally, we present several extensions that are aimed to improve the performance of our estimator. One of these extension is a general procedure which corrects the estimator’s bias.

• Model selection strategies for the Generalised Method of Wavelet Moments estimators are briefly presented in Section 3.5.

• An extensive simulation study is presented in Section 3.6. These simulations reveal that our method is able to estimate complex models for which alternative estimation techniques fail.

• Section 3.7 presents a case study where our methodology is applied to the stochastic modelling of inertial sensor errors. This approach is shown to clearly outperform classical methods.

• Section 3.8 describes some possible extensions and presents future research plans.

Finally, Chapter 4 concludes.

(23)

(24)

Chapter 2 A Prediction Divergence Criterion for Model Selection

2.1 Introduction

Model selection is an important and challenging problem in statistics. Indeed, it becomes unavoidable in more and more applications involving incomplete theoretical knowledge about the phenomenon under investigation and important amounts of available information, like in medicine, biology, economics, etc. Very often model selection is about choosing among a set of predictors, the subset that best predicts or explains a response variable.

A common model selection procedure consists in computing a criterion associated to either each potential model, or to a suitable sequence of potential models, and in choosing the one(s) that optimise(s) this criterion. Many criteria have been proposed and the most popular ones include Mallow’sCp (Mallows, 1973) based on prediction error, Akaike’s Information Criterion (AIC) (Akaike, 1974), based on the Kullback-Leibler divergence between the candidate model and the true one, and the Bayesian Information Criterion (BIC) (Schwarz, 1978). Comparisons of the relative (asymptotic) properties of the AIC and the BIC have flourished in the literature (see e.g. Zhang, 1993, Yang, 2005 and the references therein). When the true model is finite, the BIC is consistent¹ (see e.g. Haughton, 1988), while the AIC (and theCp which are asymptotically equivalent as shown e.g. in Nishii, 1984) have a non-nil (asymptotic) probability of overfitting.

In finite samples, consistency is not necessarily a good property since consistent selection criteria may tend to underfit, so that the chosen models could have larger prediction errors. A related measure is the Signal-to-Noise Ratio (SNR) (see e.g. McQuarrie and Tsai, 1998); a criterion with a weak SNR will tend to choose models that overfit while a criterion with a strong SNR will tend to choose models that underfit. For example, the SNR of the BIC is larger than the ones of most criteria for a reasonable sample size and this criterion is known for choosing models that are often too simple. Beside the Cp, the AIC and the BIC a great number of criteria have been proposed in the literature (see e.g. Hannan and Quinn, 1979, Foster and George, 1994, Zheng and Loh, 1997, Tibshirani and Knight, 1999, George, 2000 and the references therein).

An alternative to explicit model selection criteria is based on, for example, prediction error estimates obtained by simulation methods such as bootstrap (see e.g. Efron, 1983) or

1The definition of consistency for model selection will be defined later in Section 2.4.1. The reader can also refer e.g. to Shao (1997).

5

(25)

cross-validation (see e.g. Shao, 1993). Nevertheless, these methods are strongly linked to explicit model selection criteria. For example, Shao (1993) showed that the leave-one-out cross-validation is asymptotically equivalent to the AIC. Moreover, Efron (2004) demonstrated that model-based penalty methods such as the AIC have better model selection performance compared to nonparametric methods like cross-validation, assuming that the model is believable.

The task of model selection can also be performed by adding some suitable penalty to the estimating function. The nonnegative garrote (Breiman, 1995), the lasso (Tibshirani, 1996), the elastic-net (Zou and Hastie, 2005) or the Dantzig Selector (Cand`es and Tao, 2007) are all examples of such methods. Nowadays, these methods are increasingly being used and often provide better results than selection approaches based on model selection criteria.

In this thesis we propose a new class of error measures, called thed-class, which takes from Efron’sq-class (Efron, 1986) and derive the optimism²theorem (using Efron (2004) terminology) associated to this class. This enables one to easily construct model selection criteria which are consistent estimators of the error measure of interest. Additionally, we propose a new class of selection criteria in which one can choose between a consistent criterion or one with a small but non nil (asymptotic) probability of overfitting. The latter, called the Prediction Divergence Criterion Estimator (PDCE), is derived from the optimism theorem we propose for thed-class of error measures. In finite samples, the PDCE often behaves as an improved version of the BIC and is particularly well suited in “sparse” settings. The derivation as well as the properties of this new class will be explained and derived later in the text and we will start with a motivating example.

Suppose that we wish to explain the behaviour of the response variable, sayy, using a set of pregressors and a linear regression model. Assume further that among thesepregressors, only p^?phave an influence ony. This type of sparse situations is certainly common in practice, specially in the rapidly growing fields of Genomics and Proteomics. As an illustration of such a setting, consider the following linear model

y=Xβ+ε, ε∼ N 0, σ_ε²I with

β= (0,1,0,1,0,1,0,1,0,1,0, ...,0

| {z }

50

) (2.1)

andσ_ε²= 1. Suppose also that the pairwise correlations between xj andxk (i.e. j^th andk^th columns ofX) are arbitrarily chosen to be corr(xj,xk) = 0.5^|^j⁻^k^|. This situation corresponds to a theoreticalR²of approximately 88.2% and to a SNR for the slope coefficients of about 7.4. In Table 2.2 we present the results of a simulation study comparing the performance of our PDCE³ together with the lasso⁴, the elastic-net⁵and the stepwise forward approach using the AIC, AICc, AICu, FPE, FPEu, BIC, HQ and HQc criteria (see Table 2.3). We also computed the Least Squares Estimator (LSE) on the complete model as a benchmark. The performance criteria used for comparing the methods are presented in Table 2.1 and were obtained using 500 simulated

2The notion of optimism will be defined later in Section 2.2. The reader can also refer e.g. to Efron (2004) or Hastie et al. (2009), Chapter 7.

3Using the stepwise selection algorithm presented later in Section 2.4.2

4For the lasso we used the R functionlarsof thelarspackage. The shrinkage coefficientλwas chosen by minimising theCpstatistic.

5For the elastic-net we used the R functionenetof theelasticnetpackage. The shrinkage coefficientλ2(L2

penalty) was chosen by tenfold cross-validation andλ1 (L1 penalty) by minimising theCpstatistic.

(26)

Table 2.1: Model selection evaluation criteria Criteria Description

Cor. [%] Proportion of times the correct model is selected.

Inc. [%] Proportion of times the correct model is nested within the

selected model.

true+ Average number of selected significant variables (true positives).

false+ Average number of selected non-significant variables (false positives).

NbReg Average number of regressors in the selected model.

Med (PEy) Median of PEy(see (2.70)) computed on test samples.

Med (MSEβ) Median of MSEβ (see (2.71)) computed on test samples.

samples under the correct model. For the training and test samples, we chose respectively n = 70 and n^? = 700. While a more extensive simulation study is provided in Section 2.6, Table 2.2 clearly reveals the advantage of our PDCE, not only in the probability of selecting the correct model, but also in prediction and estimation error. Figure 2.1 presents the MSEβ (see (2.70)) and PEy (see (2.71)) distributions and reveals even more clearly the advantage of the

PDCE over the other methods in the sparse settings considered in this simulation.

The rest of the chapter is organised as follows. In Section 2.2 we introduce thed-class of error measures which generalises Efron’sq-class and we derive the associated optimism theorem.

Section 2.3 presents a new class of model selection criterion, called the Prediction Divergence Criterion (PDC), which is based on thed-class of error measures. In Section 2.4 we apply the PDC to linear regression models and derive, for these models, the asymptotic properties of this selection approach. Section 2.5 presents some possible extensions of the PDC methodology as well as future research plans. The application of this approach to smoothing splines, the selection of the order of an autoregressive model as well as the choice of random effects in Mixed

Table 2.2: Evaluation criteria as explained in Table 2.1 for the full model with the LS estimator (LS), stepwise forward FPE (FPE), FPEu (FPEu), AIC (AIC), AICc (AICc), AICu (AICu), BIC (BIC), HQ (HQ), HQc (HQc), lasso (lasso), elastic-net (enet) and stepwise forward PDCE (PDCE) based on 500simulated samples under the correct model. The definition of the model selection criteria can be found in Table 2.3. The numbers in parentheses for the columns Med(PEy)and Med(MSEβ)are the corresponding standard errors estimated by using the bootstrap withB= 500resamplings. The numbers in superscript indicate the ranked performance for each evaluation criterion (before rounding).

Med (PEy) Med MSEβ

Cor. [%] Inc. [%] true+ false+ NbReg

LS 6.81 (1.7·10⁻¹)¹² 9.73·10⁰ (3.2·10⁻¹)¹² 0.0^9.5 100.0^5.5 5.0^5.5 55.0¹² 60.0¹² FPE 2.88 (6.5·10⁻²)¹⁰ 2.89·10⁰ (1.0·10⁻¹)¹⁰ 0.0^9.5 100.0^5.5 5.0^5.5 25.4¹⁰ 30.4¹⁰ FPEu 1.75 (2.3·10⁻²)⁸ 1.02·10⁰ (4.3·10⁻²)⁸ 0.0^9.5 100.0^5.5 5.0^5.5 10.2⁷ 15.2⁷ AIC 4.04 (2.0·10⁻¹)¹¹ 4.82·10⁰ (2.8·10⁻¹)¹¹ 0.0^9.5 100.0^5.5 5.0^5.5 33.7¹¹ 38.7¹¹ AICc 1.72 (2.0·10⁻²)⁷ 9.87·10⁻¹(3.6·10⁻²)⁷ 0.0^9.5 100.0^5.5 5.0^5.5 9.3⁶ 14.3⁶ AICu 1.46 (1.8·10⁻²)⁶ 5.61·10⁻¹(2.2·10⁻²)⁶ 1.4^4.5 100.0^5.5 5.0^5.5 5.2⁴ 10.2⁴ BIC 1.44 (1.6·10⁻²)⁴ 5.40·10⁻¹(2.2·10⁻²)⁴ 4.6³ 100.0^5.5 5.0^5.5 5.7⁵ 10.7⁵ HQ 2.25 (6.9·10⁻²)⁹ 1.72·10⁰ (1.0·10⁻¹)⁹ 0.0^9.5 100.0^5.5 5.0^5.5 19.5⁹ 24.5⁹ HQc 1.45 (1.6·10⁻²)⁵ 5.49·10⁻¹(1.9·10⁻²)⁵ 1.4^4.5 100.0^5.5 5.0^5.5 5.0³ 10.0³ lasso 1.33 (1.4·10⁻²)³ 3.76·10⁻¹(1.4·10⁻²)³ 0.2⁶ 100.0^5.5 5.0^5.5 14.7⁸ 19.7⁸ enet 1.27 (8.4·10⁻³)² 3.00·10⁻¹(1.1·10⁻²)² 14.2² 98.8¹² 5.0¹² 4.9² 9.9² PDCE 1.09 (6.7·10⁻³)¹ 8.71·10⁻²(4.9·10⁻³)¹ 74.6¹ 99.6¹¹ 5.0¹¹ 0.3¹ 5.3¹

(27)

PEy

LSFPEFPEu AIC AICc AICu BIC HQ HQc lasso enet PDCE

12345

MSEβ

LSFPEFPEu AIC AICc AICu BIC HQ HQc lasso enetPDCE

0.51.01.52.02.5

Figure 2.1: Empirical distributions of PEyand MSEβ as defined in (2.70) and (2.71) (see also Table 2.1) for the full model with the LS estimator (LS), stepwise forward FPE (FPE), FPEu (FPEu), AIC (AIC), AICc (AICc), AICu (AICu), BIC (BIC), HQ (HQ), HQc (HQc), lasso (lasso), elastic-net (enet) and stepwise forward PDCE (PDCE) based on 500simulated samples under the correct model.

Linear Models (MLM) are briefly investigated. Section 2.6 presents a simulation study that compares the finite sample performance of different PDC criteria with other model selection techniques such as the lasso or the stepwise AIC. In Section 2.7 we present three case studies.

In the first one, the PDC selection procedure and other selection approaches are employed in an analysis of childhood malnutrition in Zambia. We then present an application to classification and gene selection in a leukaemia microarray problem. Finally, the PDC approach for the selection of random effects in MLM is illustrated with pharmacokinetics data.

2.2 The d-Class Error Measures

Consider a random variable Y distributed according to modelFθ, possibly conditionally on a set of fixed covariatesx= [x1. . . xp]. We observe a random sampleY = (Yi)i=1,...,nsupposedly generated fromFθ, possibly together with a non-randomn×pfull rank matrix of inputsX.

Given a prediction function ˆY that depends on the chosen model, Efron (1986) uses a function Q(u, v) based onq-class error measure to define a prediction error measure between the in-sample prediction and an out-of-sample predicted response. Theq-class of error measures based on the concave functionq(·) is given by

Q(u, v) =q(v) + ˙q(v)(u−v)−q(u)

(28)

where ˙q(v) is the derivative ofq(·) evaluated atv. The particular choice ofq(u) =u(1−u) gives the squared loss function Q(u, v) = (u−v)². The prediction error measure is quantified by the (out-of-sample) expected prediction error

EPErr = 1 n

Xn i=1

EPErri where EPErri =E h

E0

hQ(Y_i⁰,Yˆi)|Yii

(2.2) withY⁰ = (Y_i⁰)i=1,...,n a random variable distributed asY, and where, as throughout this thesis,E[·] andE0[·], denote expectations under the distribution ofYi|xi, respectively Y_i⁰|xi, the correct model. The expectations are, depending on the context, simple or multiple.

In the special case where the distribution ofY is replaced by the empirical distributiony, one gets thein-sampleerror,

ISErr = 1 n

Xn i=1

E0

Q(Y_i⁰,yˆi)|y .

A training or apparent error can simply be computed as the average loss over the training sampley, i.e.

AErr = 1 n

Xn i=1

Q(yi,yˆi).

Actually AErr is an optimistic estimate of EPErr because the same data is used to fit the prediction rule and assess its error. Let the optimism Ψ and the expected optimism Ω be defined as

Ψ = Xn i=1

Ψi where Ψi=E0

Q(Y_i⁰,yˆi)|y

−Q(yi,yˆi),

Ω = Xn i=1

Ωi where Ωi=E[Ψi],

(2.3)

respectively. Then, Efron’s optimism theorem (see Efron, 2004) demonstrates that EPErri=E

h E0

hQ(Y_i⁰,Yˆi)|yii

=E

hQ(Yi,Yˆi) + Ωi

i where

Ωi= cov

˙ q( ˆYi), Yi

.

Hence, an estimator of EPErr is obtained as EPErr =\ 1

n Xn i=1

EPErr\i where EPErr\i=Q(yi,yˆi) +dcov

˙ q( ˆYi), Yi

(2.4) where, depending on the distribution ofYi|xi,dcov(·) is obtained analytically up to a value ofθ, the model’s parameters, which is then replaced by ˆθ, or by resampling methods (see e.g. Efron, 2004).

We may extend this methodology and theq-class of error measures as follows. Suppose that we wish to construct a model selection criterion, say C, in order to assess the discrepancyD(·,·) between two equidimensional vector valued functionsf1( ˆθ1,Y) andf2( ˆθ2,Y) where ˆθ1 and

(29)

θˆ2denote the estimated parameter vectors associated, respectively, to the modelsFθ1 andFθ2. Such a criterion can be defined without loss of generality as

C =E h

E0

hD f1

Y⁰,θˆ1

,f2

Y,θˆ2

ii (2.5)

where the expectation is multidimensional. The discrepancyD(f1(Y⁰,θˆ1),f2(Y,θˆ2)) is said to belong to thed-class of error measures if the functionD(·,·) is a valid Bregman divergence and if the functions fi( ˆθi,Y) :ⁿ →ⁿ (for θi fixed) are equidimensional and associated, respectively, to the modelsFθi, i= 1,2. In addition, we assume that the estimators ˆθ1 and ˆθ2

are based, respectively, onY⁰andY. In some sense, Bregman divergences are the multivariate equivalent of Efron’s q-class (see Bregman, 1967 for more details). The Bregman divergence encompasses squared error, relative entropy, logistic loss, Mahalanobis distance and other error measures. The Bregman divergence between two equidimensional vectorsuandvis defined as

D(u,v) =ψ(u)−ψ(v)−(u−v)^T∇ψ(v) (2.6) whereψ(·) is a scalar and ∇ψ(v) represents the gradient vector ofψ(·) evaluated atv. The functionψ(·) is strictly convex and differentiable. For example, a squared loss functionD(u,v) =

||u−v||²2 is obtained whenψ(v) =v^Tv.

Similarly to (2.3) we define the optimism ∆ for ad-class error measure and for a criterion defined in (2.5) as

∆ =E h

E0

hD f1

Y⁰,θˆ1

,f2

Y,θˆ2

ii−E hD

f1

Y,θˆ1

,f2

Y,θˆ2

i (2.7)

and from this definition we have the following “optimism” theorem.

Theorem2.1: Let the discrepancyD(·,·)be a valid d-class error measure based onψ(·)and assume that

E f1

Y,θˆ1

−E0

hf1

Y⁰,θˆ1

iT f1

Y,θˆ1

−E0

hf1

Y⁰,θˆ1

i<∞

E

∇ψ f2

Y,θˆ2

T

∇ψ f2

Y,θˆ2

<∞.

Then

∆ = trn covh

f1

Y,θˆ1

,∇ψ f2

Y,θˆ2

io

Proof: By definition, we have that

∆ =E h

E0

hD f1

Y⁰,θˆ1

,f2

Y,θˆ2

ii−E hD

f1

Y,θˆ1

,f2

Y,θˆ2

i and since the discrepancyD(·,·) belongs to thed-class we have thatD(·,·) is a valid Bregman divergence. Therefore, we may express the above terms as:

E h

E0

hD f1

Y⁰,θˆ1

,f2

Y,θˆ2

ii=E0

hψ f1

Y⁰,θˆ1

i−E hψ

f2

Y,θˆ2

i

−E

E0

hf1

Y⁰,θˆ1

i−f2

Y,θˆ2

T

∇ψ f2

Y,θˆ2

(30)

and E

hD f1

Y,θˆ1

,f2

Y,θˆ2

i=E hψ

f1

Y,θˆ1

i−E hψ

f2

Y,θˆ2

i

−E f1

Y,θˆ1

−f2

Y,θˆ2

T

∇ψ f2

Y,θˆ2

.

So by subtracting the two terms we obtain

∆ =E0

hψ f1

Y⁰,θˆ1

i−E hψ

f1

Y,θˆ1

i +E

f1

Y,θˆ1

−E0

hf1

Y⁰,θˆ1

iT

∇ψ f2

Y,θˆ2

.

SinceE0

hψ f1

Y⁰,θˆ1

i=E hψ

f1

Y,θˆ1

iit follows that

∆ =E f1

Y,θˆ1

−E0

hf1

Y⁰,θˆ1

iT

∇ψ f2

Y,θˆ2

Let x and zbe two real-valued vectors of the same dimension such that E

x^Tx

<∞ and E

z^Tz

<∞. Then we can always write E

x^Tz

=E

tr x^Tz

= tr E zx^T

= tr (cov (x,z)) + tr E[x]E^T[z]

. (2.8)

Using (2.8) we can express ∆ as

∆ = trn covh

f1

Y,θˆ1

−E0

hf1

Y⁰,θˆ1

i,∇ψ f2

Y,θˆ2

io + trn

E hf1

Y,θˆ1

−E0

hf1

Y⁰,θˆ1

ii E^T

h∇ψ f2

Y,θˆ2

io.

SinceE0

hf1

Y⁰,θˆ1

iis a non stochastic quantity, it follows that

∆ = trn covh

f1

Y,θˆ1

,∇ψ f2

Y,θˆ2

io

which verifies the result of Theorem 2.1.

The direct consequence of Theorem 2.1 is that for any criterion C as defined in (2.5) one can construct an estimator, sayC similarly to (2.4) for theb q-class of error measures. Indeed, using (2.7) and applying Theorem 2.1 we have that

E h

E0

hD f1

Y⁰,θˆ1

,f2

Y,θˆ2

ii=E hD

f1

Y,θˆ1

,f2

Y,θˆ2

i + trn

covh f1

Y,θˆ1

,∇ψ f2

Y,θˆ2

io

and therefore a “natural” (and consistent) estimator of C is C =b D

f1

y,θˆ1

,f2

y,θˆ2

+ trn d covh

f1

y,θˆ1

,∇ψ f2

y,θˆ2

io (2.9)

where as in (2.4), depending on the distribution ofY|X,dcov(·) is obtained analytically up to a value of θ1 andθ2, the models’ parameters, which are then replaced by ˆθ1 and ˆθ2, or by resampling methods (see e.g. Efron, 2004).

(31)

Although we defined thed-class in a general manner for two modelsFθ1 andFθ2, nearly all model selection criteria are based on a chosen discrepancy between a given candidate model Fθand the true model F0. This is, for example, the case for the AIC which aims to compute the Kullback-Leibler divergence between Fθ andF0. In such a setting, we may simplify our definition given in (2.5) and write instead

C^? =E h

E0

hD

f1 Y⁰ ,f2

Y,θˆii

(2.10) which leads to the following (consistent) estimator

Cb^?=D

f1(y),f2

y,θˆ + trn

d covh

f1(y),∇ψ f2

y,θˆio

. (2.11)

We shall refer to the first and second terms of (2.9) (or of (2.11)) as theapparent divergence and the divergence optimism, respectively. At this point it must be pointed out that many criteria can be defined using (2.5) (or (2.10)). However, only a few of them are meaningful for the task of model selection. Indeed, an estimator of C^?as defined in (2.10) should ideally satisfy (at least) two properties. Indeed, consider two candidate models, say Fθ1 andFθ2 such thatFθ1

is nested withinFθ2. Then,Cb^? should satisfy the following two properties:

(Pr.1) For nested models, the apparent divergence is strictly non-increasing with respect to model complexity. More formally, we have that for any nested modelsFθ1 inFθ2 the apparent divergence is such that:

D

f1(y),f2

y,θˆ1

≥D

f1(y),f2

y,θˆ2

.

(Pr.2) For nested models, the divergence optimism is strictly non-decreasing with respect to model complexity. More formally, we have that for any nested modelsFθ1 inFθ2 the divergence optimism is such that:

trn d covh

f1(y),∇ψ f2

y,θˆ1

io≤trn d covh

f1(y),∇ψ f2

y,θˆ2

io.

If Property (Pr.1) is not satisfied, it would mean that a larger model (i.e. Fθ2) can provide a poorer (apparent) fit thanFθ1. This could, for example, correspond to a situation where the residual sum of squares increases with the complexity of the model. Although such situations may appear extremely unlikely, they could for instance occur when a robust estimator ofθis used together with a non-robust apparent divergence such as residual sum of squares. In Corollary 2.1 (below) we relate Property (Pr.1) to the estimator ˆθ. This enables to construct easily a criterion which verifies (Pr.1) given an estimator ˆθ. The proof of this result is straightforward and therefore shall be omitted.

Corollary 2.1: The estimatorCb^? as defined in (2.11) satisfies Property (Pr.1) if the estimator θˆis the result of the following minimisation problem

θˆ= argmin

θ∈Θ

D(f1(y),f2(y,θ)).

(32)

The rational behind Property (Pr.2) is the following. Suppose thatFθ1 is the true model (or is nested within the true model) and assume that modelsFθj withj= 1, ..., Kare models with increasing complexity such that modelFθj is nested within modelFθk if 1≤j < k≤K. Then, if a criterion satisfies Property (Pr.1), the apparent divergence will be smaller as the model complexity increases. Therefore, the optimism should have exactly the opposite property in order to allow a model “close” toFθ1 to be selected by the criterion at hand.

In the following example we will use the above theory to derive a criterion equivalent to Mallow’s Cp.

Example2.1: Consider the linear modely=Xβ+εwhereX∈^n×p is a full-ranked constant matrix, ε∼ N 0, σ_ε²In

andβ∈ B ⊆^p. Let βˆdenote the LSE of β, i.e.

βˆ= argmin

β∈B ||y−Xβ||²2.

Suppose that we wish to find an estimator for the following criterion:

C^?=E h

E0

h||Y⁰−Yˆ||²2

ii.

Clearly, this model selection criterion belongs to the d-class error (and to theq-class as well) with f1(Y⁰) =Y⁰,f2(Y,β) =ˆ Xβˆ andψ(z) =z^Tz. By applying Theorem 2.1 we obtain

∆ = trh cov

Y,2Xβˆi

= 2σ_ε²tr (S) = 2σ_ε²p whereSdenotes the “hat” matrix ofX, i.e. S=X X^TX−1

X^T. Using (2.11) we get Cb^?=D(f1(y,θ1),f2(y,θ2)) + ∆ =||y−Xβˆ||²2+ 2σ_ε²p

which is unsurprisingly equivalent to Mallow’sCp. Sinceβˆis obtained by minimising||y−Xβ||²2

which is equivalent toD(f1(y),f2(y,β)), Property (Pr.1) is satisfied by Corollary 2.1. Note that this is also the case for the MLE since this estimator is equivalent to the LSE in this context. Moreover, for any fixed (across models)σ²_ε Property (Pr.2) is also satisfied. In practice, σ²_ε is replaced by an estimate, say σˆ_ε², of the noise variance obtained from a “low-bias” model, generally the largest model.

2.3 The Prediction Divergence Criterion

When a model selection criterion, say C, is used in practice to choose between two models, say M^j nested in M^k, an estimate of C (or of EPErr as defined in (2.2)) is computed for both models and the difference is used for selection. We propose instead another class of criteria that aims at directly measuring a prediction divergence between the two models. More formally, any criterion that compares the out-of-sample prediction computed in the smaller model ˆY_j⁰with the in-sample prediction in the larger model ˆYk, quantified by the Bregman divergence D(·,·) (based on ψ(·)) belongs to this class, i.e.

PDCj,k=E h

E0

hD Yˆ_j⁰,Yˆk

|Yii

(2.12)

(33)

where the expectation is multidimensional. If the smaller model is not correct, the additional elements in the larger model create differences in the predictions and therefore should be accounted for. Intuitively, in a “suitable” sequence of nested models with increasing complexity and such that modelM^j is nested in modelM^j+1, supposing we have a consistent estimator of PDCj,j+1, say PDCEj,j+1 (see (2.13)), then this estimator is expected to be minimal when j = 0 < K, K is the number of potential sequentially nested models, and M^0 denotes the correct (or closest to the correct) underlying model. Indeed, while j < 0, we expect PDCEj,j+1to be relatively large since modelM^j is missing some elements of the correct model M^0 which are included in model M^j+1. This is also true with PDCEj,j+m, j < j+m ≤ K or PDCE_j−m,j,1 < j −m < j. On the other hand, if j ≥ 0, PDCEj,j+1 (or indeed PDCEj,j+m, m >0) is relatively small compared to whenj < 0 because both models include the correct one. Among all modelsj ≥0, PDCEj,j+mshould be minimised atj=0andm= 1 since PDCE0,0+1 compares the prediction of the correct model with the least overfitted one.

In the case of the linear regression model, we derived in Section 2.4 the (asymptotic) properties of the PDCE based on the squared loss function. In particular, we showed in Theorem 2.2 that under the setting previously defined and for sufficiently large sample sizenwe have that E[PDCE0,0+1]≤E[PDCEj,j+m] forj andmsuch that 0< j < K,m >0 andj+m≤K+ 1.

We also have that E[PDCE0,0+1] = E[PDCEj,j+m] if and only if j = 0 andm = 1. This confirms the intuitive explanation given above.

Clearly, the discrepancyD( ˆY_j⁰,Yˆk) belongs tod-class of error measures but not to theq-class.

Therefore, by using Theorem 2.1 as in (2.9) we obtain the following consistent estimator of the PDCj,k

PDC[j,k= PDCEj,k=D( ˆyj,yˆk) + tr{dcov [ ˆyj,∇ψ( ˆyk)]}. (2.13) Although the above estimator can be computed analytically or using resampling methods for any valid Bregman divergence, we shall only consider here the squared loss function. The properties and performance of other divergences are left for further research. Therefore, (2.13) can be further simplified as

PDCEj,k=||yˆj−yˆk||²2+ 2 tr [dcov ( ˆyj,yˆk)]. (2.14) For notational simplicity, we will not make a distinction between the PDCE defined in (2.13) and in (2.14) but in the rest of the text the term PDCE refers to (2.14) since the squared loss function is the only Bregman divergence considered in this chapter. In some sense, the PDCE defined in (2.14) is the equivalent of Mallow’sCp in the PDC class since both criteria are based on the same loss function. It will thus be of particular interest to compare (2.14) with the Cp to understand the differences between the PDC approach and the classical model selection approach (see (2.10)).

Many authors (see e.g. Bhansali and Downham, 1977) have examined the penalty function of the AIC (and of other criteria) and defined, for example, the AICα in which the term 2 (see Table 2.3) of the conventional AIC is replaced byα. We follow this strategy and define the modified PDCE as

PDCE^λ_j,kⁿ =||yˆj−yˆk||²2+λntr [dcov ( ˆyj,yˆk)] (2.15) whereλn is a constant depending possibly on the sample sizen.

Hence, assuming that there existK competing nested models (and that the largest model is not the correct one) for describing the behaviour of Y, we propose to choose the modelM^ˆ^λn

(34)

satisfying

ˆ

λn = argmin

j=1,...,K−1

PDCE^λ_j,j+1ⁿ . (2.16)

If a clear sequence of competing nested models does not exist, one can build one prior to applying the selection rule (2.16). This will be explained when treating the linear regression model in Section 2.4.2 (in particular see iterative rule (2.62)).

2.4 Linear Regression Models

We consider in this section the usual linear regression model with Gaussian errors Y =Xβ+ε, ε∼ N 0, σ²_εI

whereXis a knownn×pdesign matrix of rankpandβis ap×1 vector of unknown parameters.

We assume that the method of LS is used to fit a model to the data and thus we consider the usual ordinary LS parameter estimates ofβ, i.e.

βˆ= X^TX−1

X^Ty (2.17)

leading to the (linear) prediction

ˆ

y=Xβˆ=Sy whereS=X X^TX−1

X^T denotes the “hat” matrix. Since the errors are Gaussian, ˆβ is also the MLE ofβ. The unbiased and maximum likelihood estimates ofσ_ε²are, respectively, given by

˜

σ²_ε= ||y−yˆ||²2

n−p and ˆσ_ε²= ||y−yˆ||²2

n . (2.18)

Throughout this chapter we assume that 0< σ_ε²<∞. In Table 2.3 we provide some of the most commonly used criteria for model selection in linear regression models.

The PDCE^λ_j,kⁿ defined in (2.15) can be simplified for two linear nested candidate models.

Indeed, let model M^j be nested within model M^k and so that dim(βj) =j <dim(βk) =k whereβj andβk denote, respectively, the vector of unknown parameters associated to models M^j andM^k. Then, the PDCj,k based squared loss function defined byψ(z) =z^Tzis equal to

PDCj,k=||Yˆj−Yˆk||²2+ 2σ²_εtr (SjSk) =||Yˆj−Yˆk||²2+ 2σ²_εj

whereSj andSk denote, respectively, the hat matrices of modelsM^j andM^k. Therefore, we obtain for PDCE^λ_j,kⁿ :

PDCE^λ_j,kⁿ =||yˆj−yˆk||²2+λnσ²_εj. (2.19) When the value of σ_ε²is unknown, one may replaceσ²_ε by a consistent estimator, say ˆσ_ε², like the LS estimator at the full model. As already mentioned, we show in Theorem 2.2 (below, and based on Lemma 2.1) that the PDCE^λ_j,j+mⁿ is expected, for sufficiently large sample size, to reach its smallest values forj=0andm= 1. This motivates the PDCE selection rule defined in (2.16).

Two essays in statistics: a prediction divergence criterion for model selection &amp; wavelet variance based estimation of latent time series models

Thesis

Reference