Confidence sets for model selection

(1)

Thesis

Reference

Confidence sets for model selection

HANNAY, Mark

Abstract

At first glance the goals of model selection might seem clear. Out of a set of possible models, we want to select the ”best” or a subset of ”best” models. This notion of ”best” however is not well defined, since it obviously depends on the initial goals of the selection. In order to study the uncertainty in model selection, we introduce a new definition of a model, where the models are no longer defined through zero and non-zero components but through irrelevant and relevant component. Then inspired by confidence intervals for estimated parameters, we propose a method to build confidence sets for model selection in a parametric setting, i.e.

create sets of models within which the true model is included with a certain confidence. This allows us to perform inference on model selection. We discuss the computational challenges with such a method, how to find p-values (for the model), consistency in model selection and through a data set and a simulation study show the implications of this new method.

HANNAY, Mark. Confidence sets for model selection. Thèse de doctorat : Univ. Genève, 2016, no. GSEM 31

URN : urn:nbn:ch:unige-864673

DOI : 10.13097/archive-ouverte/unige:86467

Available at:

http://archive-ouverte.unige.ch/unige:86467

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Confidence sets for model selection

by

Mark Hannay

A thesis submitted to the

Geneva School of Economics and Management, University of Geneva, Switzerland,

in fulfillment of the requirements for the degree of PhD in Statistics

Members of the thesis committee:

Prof. Eva Cantoni, Chair, University of Geneva Prof. ElvezioRonchetti, Adviser, University of Geneva Prof. Werner Stahel, Eidgen¨ossische Technische Hochschule Z¨urich

Thesis No. 31 August 2016

(3)

(4)

Acknowledgements

First of all, I would like to express my gratitude to my adviser Prof. Elvezio Ronchetti.

Over these last 3 years, he has always had time for me and has consistently supported me in my endeavours, for which I am immensely thankful.

Secondly, I would like to thank the chair of my thesis committee Prof. Eva Cantoni, for her scrupulous study of the manuscript and her insightful comments on the dissertation.

During my time at the University of Geneva, I have had the good fortune of having my office next to hers and have therefore also benefited from her encouraging and positive attitude.

My sincere thanks also goes to Prof. Werner Stahel, who was not only another valuable member to the thesis committee but is someone I consider a dear friend. His constant questioning, not only improved the text but also forced me to rethink many passages and ultimately allowed me to make surprising links between different fields.

Last but not least, I would like to thank my wife Digisha Hannay for putting up with all those long nights of me typing away on the laptop and for all those moments of me with my head in the clouds. You have been incredibly supportive and caring, which are only 2 of your many beautiful attributes.

(5)

(6)

Abstract

At first glance the goals of model selection might seem clear. Out of a set of possible models, we want to select the ”best” or a subset of ”best” models. This notion of ”best”

however is not well defined, since it obviously depends on the initial goals of the selection.

In order to study the uncertainty in model selection, we introduce a new definition of a model, where the models are no longer defined through zero and non-zero components but through irrelevant and relevant component. Then inspired by confidence intervals for estimated parameters, we propose a method to build confidence sets for model selection in a parametric setting, i.e. create sets of models within which the true model is included with a certain confidence. This allows us to perform inference on model selection.

We discuss the computational challenges with such a method, how to find p-values (for the model), consistency in model selection and through a data set and a simulation study show the implications of this new method.

(7)

(8)

R´ esum´ e

A première vue, le but de la sélection de modèles est clair. Parmi un ensemble de modèles possibles, nous voulons choisir le ”meilleur” ou un sous-ensemble de ”meilleures” modèles.

Cette notion de ”meilleur” modèle n’est pas précise, comme elle dépend des objectifs initiaux de la sélection.

Pour étudier la variabilité de la sélection de modèles, nous introduisons une nouvelle définition de modèle. Avec l’aide de cette nouvelle définition, il est possible de créer un ensemble de confiance pour la sélection de modèles dans le cas de la statistique paramétrique, c.à.d. que nous pouvons créer un ensemble de modèles qui inclut le vrai modèle avec une certaine confiance. Ceci peut être interprété comme une extension de la notion d’intervalle de confiance. Avec cet ensemble il est alors possible de faire de l’inférence statistique sur la sélection de modèles.

De plus nous proposons un algorithme pour cette nouvelle méthode, qui nous permet de trouver des p-valeurs (pour un mod`ele) et nous montrons la convergence de la sélection de modèles. Pour finir, à partir d’une base de données et une étude de simulation nous montrons les implications de cette nouvelle méthode.

(9)

(10)

Introduction

Model selection is the procedure of selecting the ”best” model amongst a set of possible models based on a given data set. There are many reasons for wishing to do this. For instance, one may want to reduce the complexity of the model, or one may want to select a less complex model, which can lead to better predictions of future observations. Our main interest will be to select the true model, for purposes of inference.

We restrict our work to model selection in a parametric setting, which gains from the added structure with respect to the more general framework, while still encompassing many important cases. More specifically, in a parametric setting, selecting a model usually entails finding the zero components of the parameter of interest, which we denote as θ, based on a data set with n observations, i.e. {X_i}ⁿ_i=1. We will now first introduce the notion of a model in this particular case, then briefly discuss about some of the most common methods of model selection before briefly mentioning some of the existing literature on uncertainty in model selection. After this short introduction to the subject, we provide an outline.

Parametric models

Throughout, we assume the parameter of interest θ to have p components. A model can then be represented by a set A ⊂ {1, ..., p}, which represents the zero components of the parameter of interest. Thus, for A ⊂ {1, ..., p}, we first define a naive type of model:

MA :θ[A]=0[A],

where0 ∈R^p is the zero vector. Consequently, because of the hierarchy in this particular structure, if MA is a correct model then so is MA˜ for any ˜A ⊂ A. Thus, in this context, selecting the true models would require first finding the biggest A, such thatM_A is true, and then providing a set with all the models MA˜ for any ˜A ⊂ A. This selection is redundant, since it only relies on a single model, which is why we propose to work with the following definition of a model for a fixed A instead:

M_A :θ_[A]=0_[A] and θ_[i]6= 0 for alli∈ A^c. (1) We have that this definition implies that only a single model, from all the possible models, can be correct and so selecting the true model is now well-defined.

Common methods for model selection

Most of the common methods for model selection are based on the log-likelihood function. Therefore, we begin by introducing this notion. Let f_θ be a density function for

(15)

2

observations Xi, which we assume to be independent and identically distributed (i.i.d.).

This lead to the log-likelihood function, defined as:

l(θ) = log

n

Y

i=1

f_θ(X_i)

!

=

n

X

i=1

log (f_θ(X_i))

This can be used to define estimators for each model, by maximizing the likelihood constrained to feasible points within a given model. More specifically, for a model MA

we define:

θˆMA = arg max

θ[A]=0[A]

l(θ).

One common method for model selection is based on testing between nested models.

More specifically, for A₁ ⊂ A₂, M_A₂ is said to be nested in M_A₁. Moreover, for k =

|A₂| − |A₁| and under mild conditions, we asymptotically have:

2l(ˆθMA2)−2l(ˆθMA1)∼χ²_k.

Consequently, one can test between nested models. This allows for stepwise procedures, where one starts either at the full, respectively the null, model and tests backwards, respectively forwards, until a stopping rule is satisfied. See Bendel and Afifi, (1977) for details on the stopping rule.

The stepwise procedure has a major drawback, namely that we can only compare nested models along the testing path. Information theory offers an elegant response to this problem. To each model MA we can associate a value:

−2l(ˆθMA) +k(n,|A|).

The idea is that k(n,|A|) should penalize bigger models and so k(n,|A|) is monotone decreasing in |A|. One can now even compare non-nested models. The two main information criteria are the AIC criterion (Akaike, 1974) and the BIC criterion (Schwarz, 1978).

More recently, regularization methods have been proposed for model selection. These methods benefit from convexity and do not require computing values for each model. For instance, Tibshirani, (1996) defined the lasso estimator as:

θˆ = arg min−2l(θ) +λ

p

X

i=1

θ_[i], where λ≥0.

It has been well established by B¨uhlmann and Van De Geer, (2011) that, under mild conditions on the sparsity of θ and on the covariance matrix, that selecting the non zero components of the lasso estimator for specific λ performs well in variable selection.

However, it is generally not consistent in model selection, prompting Zou, (2006) to penalize the components differently, which leads to the adaptive lasso. This new type of penalization method is then consistent in model selection.

Further common methods are model selection through cross validation (Shao, (1993) and Zhang, (1993)) and Bayesian model selection (Mitchell and Beauchamp, (1988) and Chipman et al., (2001)). The cross validation methods have asymptotic properties similar to those of information criteria. While the Bayesian method is based on model priors and parameter priors, with properties depending on the chosen priors. We will introduce Bayesian model selection more thoroughly in Chapter 1.

(16)

Introduction 3

Uncertainty in model selection

The previous methods are commonly used as ”point estimates” of models, which on their own do not account for the uncertainty in model selection.

The cross validation methods can produce some notion of uncertainty in model selection, since for each model one has a standard error of the estimate error as is done by Cantoni et al., (2007). However, the theory of uncertainty in model selection, using these techniques hasn’t be fully studied yet.

As previously mentioned, under some mild conditions, the Lasso can be used to build a set of variables, within which the true non zero components are included with a certain probability. Unfortunately, generally this set will also include zero components, even asymptotically. The adaptive lasso, which is consistent in model selection, does not solve this problem, since no notion of uncertainty for the adaptive step has been derived.

Bayesian methods offer the most striking response to the quantification of uncertainty in model selection by providing credibility sets. These sets though rely on priors, which can put many practitioners at a certain unease. Once again, we shall investigate these properties in further detail in Chapter 1.

Fortunately if we impose a threshold for the true non-zero components, we can recover the exact set of non zero variables asymptotically by many methods including using the Lasso or even the Zcut method described by Ishwaran and Rao, (2005). Both methods provide list of variables, with statistical guarantees. Unfortunately they do not provide list of models, which is what would characterize uncertainty in model selection. To that end Hansen et al., (2011) introduced model confidence sets, which include the ”best” model with a given confidence. More recently Ferrari and Yang, (2015) introduce a method of selecting a set of models in the linear model based on F-tests. Yet, mostly because of the definition of models used in these papers, their sets of models do not collapse to the set only including the true model asymptotically.

Outline

Although widely used, we show in Chapter 1 that the definition of a model used in (1.4) poses a dilemma. This dilemma is caused by the focus on finding the exact zero components. We argue that we should instead focus on relevance, which leads to a new definition of a model. Furthermore, following the ideas of Hansen et al., (2011), we introduce confidence sets, which allow to quantify the uncertainty in the selection, similarly to confidence intervals for estimation.

The new method we propose for model selection is based on testing theory. Therefore, in Chapter 2 we focus on testing theory in a general framework. We provide a method, based on the squared distance to the model of interest, to decide whether to reject or accept a model. Before discussing model selection, we then describe some interesting applications of this new testing method built on squared distances between models and apply it to real data example.

In Chapter 3, we extend the confidence sets in the univariate case to the general case.

It is based on aggregating all the individual decisions on accepting or rejecting models in Chapter 2. Furthermore, we supply multiple examples of possible models, which are simple extensions of existing notions of models and can be dealt with efficiently as long as p, the dimension of the parameter of interest, isn’t too big. After first providing a simulation study to better understand the small sample properties of the new method,

(17)

4

we use our newly developed method on 2 already widely studied data sets, yielding new interpretations.

Finally in Chapter 4, we focus on the computational methods used for testing and recovering the critical values. This particular chapter also adds some insight into how our proposed method reduces to well known methods in the standard cases. Most of the proposed computational methods are motivated by methods used in the growing field of regularization and rely heavily on convexity.

(18)

Chapter 1 Univariate model selection

In order to motivate our work, we start with the simplest model selection problem, which consists in choosing between two models defined by a simple parameter. Let M₀ :θ = 0 and M₁ : θ 6= 0, thus M₀ and M₁ are different models, conform to the definition of a model in Equation (1) and X_i ∼ N(θ, σ) i.i.d.. The goal is to determine which model is correct based on the data {Xi}ⁿ_i=1.

We first present a frequentist approach and then a Bayesian approach to this problem.

We show there are conceptual problems in both approaches leading to new definitions of models, based on relevance, irrelevance and signed relevance. By using extensions of the existing methods, we can study the decision problem related to these new models. In the last part, we propose a method to quantify the uncertainty in model selection for this particular framework.

Throughout this chapter, the statistical decisions are made with respect to the maximum likelihood estimator, ˆθ_n. By using the distributional assumptions we have that θˆ_n = _n¹ ^Pⁿ_i=1X_i ∼ N(θ, σ²/n).

1.1 Likelihood approach

1.1.1 Log-likelihood

For given data {X_i}ⁿ_i=1, under the assumption of normality and independence, we have that the log-likelihood function for known σ is equal to:

ln(θ) = log

n

Y

i=1

fθ(Xi)

!

=−nlog(√

2πσ)−

n

X

i=1

1

2σ²(Xi−θ)², where f_θ(x) = ^√¹

2πσexp−^(x−θ)_2σ₂²is the density function.

By definition ˆθ_n maximizes l_n(·) over R and we have that ˆθ_n 6= 0 with probability 1.

This implies that ˆθ_n is the maximum likelihood estimator over all possible values in M₁ with probability 1. On the other hand, since 0 is the only value in M₀, 0 is the maximum likelihood estimator over all possible values in M₀.

Furthermore, we have l_n(0) < l_n(ˆθ_n) with probability 1. This means that if we were to pick the model with the highest likelihood, we would always pick M1, even in the case where M₀ is the true model.

Information theory provides an elegant solution to this problem by focusing on the study of penalized likelihoods. More specifically, there exists penalties k_0;n and k_1;n, such

(19)

6 Chapter 1. Univariate model selection that we associate the values −2ln(0) +k0;n and −2ln(ˆθn) +k1;n to the modelsM0 and M1

respectively. We then select M₀ if −2l_n(0) +k_0;n ≤ −2l_n(ˆθ_n) +k_1;n, otherwise we select M₁. In fact if k_0;n < k_1;n, there is a strictly positive probability of selecting M₀. The most common choices for the penalties are ki;n = 2i, the AIC criterion (Akaike, 1974), and k_i;n = log(n)i the BIC criterion (Schwarz, 1978).

1.1.2 The dilemma of the zero component

It is well known that the AIC criterion isn’t consistent in model selection, while the BIC criterion is. We revisit these known facts in our example and present a dilemma one is faced with when using these methods or indeed any other method.

If one uses theAICcriterion to select the model, it can easily be shown that one selects M₀ if and only if ˆθ_n² ≤ _n²σ². If one uses the BICcriterion to select the model, one selects M₀ if and only if ˆθ²_n ≤ ^log(n)_n σ². This means that for both criteria, the decision depends solely on ˆθn. Therefore we define the decision functions MAIC, MBIC:R→ {M0, M1} as:

M_AIC(x) =







M0 if x² ≤ ²_nσ² M1 otherwise , MBIC(x) =







M₀ if x² ≤ ^log(n)_n σ² M₁ otherwise .

LetPθ˜(A) be the probability of A under the assumption that θ= ˜θ. We now use this notation to study the probabilities of selecting M₀ in different cases. In fact it follows from the previous observations that for any θ ∈R:

P_θM_AIC(ˆθ_n) = M₀ = P



 Z+

√n σ θ

!2

≤2



, P_θM_BIC(ˆθ_n) = M₀ = P



 Z+

√n σ θ

!2

≤log(n)



, where Z ∼ N(0,1).

First we assume M₀ to be true, i.e. θ = 0. The probability of selecting the wrong model with the AIC criterion, i.e. P_θM_AIC(ˆθ_n) = M₁, is equal to P (Z² >2) ≈ 0.157, which is strictly positive. Therefore, in this case, the AIC criterion is inconsistent, since even asymptotically we have a non negligible probability of selecting the wrong model. On the other hand, the probability of selecting the wrong model with the BIC criterion , i.e.

P_θM_BIC(ˆθ_n) =M₁, is equal to P (Z² >log(n)). This probability tends to 0 asymptotically, which implies M₀ is selected with probability 1 asymptotically. One can therefore argue that, in the caseM₀ is true, theBICcriterion may be preferable to theAICcriterion, since it is asymptotically consistent while the other one isn’t.

We now assumeM₁to be true. The maximum probability of selecting the wrong model, while still assumingM₁ to be true with theAICcriterion, i.e. sup_θ6=0P_θM_AIC(ˆθ_n) = M₀, is equal to P (Z² ≤2) = 0.843, which is strictly positive. The worst case probability of selecting the wrong model is thus large. TheBICcriterion performs even worse in this case, since the worst case probability of selecting the wrong model, sup_θ6=0P_θM_BIC(ˆθ_n) = M₀, is equal to P (Z² ≤log(n)). Thus, the probability of making the wrong decision tends to 1 for n→ ∞.

(20)

1.1. Likelihood approach 7 This leads to a dilemma, since even though the BIC criterion is consistent for a fixed θ, the probability of making the wrong decision can be arbitrarily close to 1. Whereas the AIC criterion is not consistent in model selection, but at least the probability of selecting the wrong model is bounded away from 1. We conclude, that it is impossible based on θˆ_n to be uniformly consistent in model selection in the case M₁ is true, while also being consistent in the case M₀ is true. This is because the distributional differences between θˆn under the assumption M0 is true or M1 is true can be arbitrarily small.

1.1.3 Relevance

The problem in the previous subsection comes from the fact if |θ| is very small, one would need huge amounts of observations to detect a signal. This remains true for other notable model selections techniques, such as the Lasso (Tibshirani, (1996), Zou, (2006) and B¨uhlmann and Van De Geer, (2011)), cross validation methods (Shao, (1993) and Zhang, (1993)) or even testing theory (Lehmann and Romano,2006).

All is not doom and despair though, since in many application one might not be interested in showing that θ = 0. Before even testing this hypothesis, one could imagine that the practitioner knows this hypothesis to be false or at least highly unlikely as was noted by Jones and Tukey, (2000). Furthermore, many practitioners are not strictly interested in significance testing but in discovering practical significance, which we denote as relevance.

For instance in bioequivalence testing (Schuirmann,1987), one is interested in testing the hypothesis |θ| ≤ δ. This leads to the following new different definitions of models, where for some fixed δ >0 we define:

M₀^(δ) : θ ∈[−δ, δ], (1.1)

M₁^(δ) : θ ∈[−δ, δ]^c.

The modelM₀^(δ)represents the case whereθ may not be zero, but is irrelevant, whileM₁^(δ) represents the case where θ is relevant. We believe this definition not only helps develop the theory, as we will see below, but makes practical sense, too.

1.1.4 Maximum likelihood for relevance

Due to the definitions of M₀^(δ) and M₁^(δ) in Equations (1.1), the likelihood is no longer necessarily maximized for values within the domain of M₁^(δ). This means that a simple maximum likelihood method can be used to select the model. In fact, clearly

sup

|θ|≤δ

l_n(θ)≥ sup

|θ|≥δ

l_n(θ) ⇐⇒ θˆ_n∈[−δ, δ].

Therefore, we can define the decision function based on the maximum likelihood, which once again depends entirely on ˆθ_n. We denote this function as M_{M l} :R→ⁿM₀^(δ), M₁^(δ)^o:

M_{M l}(x) =







M₀^(δ) if |x| ≤δ M₁^(δ) otherwise .

(21)

8 Chapter 1. Univariate model selection We now study the probabilities of selectingM₀^(δ) andM₁^(δ) as functions of θ. In fact it follows directly from the definition of M_{M l} that for any θ∈R:

Pθ

MM l(ˆθn) =M₀^(δ) = Pθ

θˆn∈[−δ, δ]=P −δ≤θ+ σ

√nZ ≤δ

!

, (1.2) P_θM_{M l}(ˆθ_n) =M₁^(δ) = P_θθˆ_n6∈[−δ, δ]= 1−P −δ≤θ+ σ

√nZ ≤δ

!

,

where Z ∼ N (0,1).

Thus,P_θM_{M l}(ˆθ_n) =M₀^(δ)is a strictly monotone decreasing function in |θ|, whereas P_θM_{M l}(ˆθ_n) = M₁^(δ) is a strictly increasing in |θ|. This implies that we recover the following expressions for the worst covering probability:

θ∈[−δ,δ]inf P_θM_{M l}(ˆθ_n) = M₀^(δ) = P −δ ≤δ+ σ

√nZ ≤δ

!

, inf

θ∈[−δ,δ]^cPθ

MM l(ˆθn) = M₁^(δ) = 1−P −δ≤δ+ σ

√nZ ≤δ

!

.

By using the fact that P−δ ≤δ+ ^√^σ_nZ ≤δ =P −2δ

√n

σ ≤Z ≤0 and δ > 0, we recover that limn→∞P −δ≤δ+^√^σ_nZ ≤δ=P (Z ≤0) = ¹₂. Thus, we find the following asymptotical results:

n→∞lim inf

θ∈[−δ,δ]P_θM_{M l}(ˆθ_n) = M₀^(δ) = 1 2,

n→∞lim inf

θ∈[−δ,δ]^cP_θM_{M l}(ˆθ_n) = M₁^(δ) = 1−1 2 = 1

2.

Similarly, we recover the following expressions for the worst case probability of selecting the wrong model:

sup

θ∈[−δ,δ]

P_θM_{M l}(ˆθ_n) =M₁^(δ) = 1−P −δ≤δ+ σ

√nZ ≤δ

!

, sup

θ∈[−δ,δ]^c

P_θM_{M l}(ˆθ_n) =M₀^(δ) = P −δ≤δ+ σ

√nZ ≤δ

!

,

which also both tend to ¹₂ as n→ ∞.

Therefore, this method doesn’t suffer from the same problem as the BIC criterion, where it was possible to select the the wrong model with a probability arbitrarily close to 1. This solves one part of the dilemma.

Furthermore, by using the expression for P_θM_{M l}(ˆθ_n) =M₀^(δ) in Equation (1.2), we can show that the method achieves near uniform consistency. More specifically, for any

(22)

1.1. Likelihood approach 9 0< < δ, we have:

n→∞lim inf

|θ|≤δ−P_θM_{M l}(ˆθ_n) = M₀^(δ) = lim

n→∞P −δ≤δ−+ σ

√nZ ≤δ

!

,

= lim

n→∞P −(2δ−)

√n

σ ≤Z ≤

√n σ

!

= 1,

n→∞lim inf

|θ|≥δ+P_θM_{M l}(ˆθ_n) = M₁^(δ) = 1− lim

n→∞P −δ ≤δ++ σ

√nZ ≤δ

!

,

= 1− lim

n→∞P −(2δ+)

√n

σ ≤Z ≤ −

√n σ

!

,

= 1−0 = 1.

Therefore if the true θ is not too close to ±δ we have uniform consistency in model selection. This implies that setting a threshold for relevance, namely δ >0, also offers a solution to the other part of the dilemma previously discussed.

Consequently, within the relevance framework, the maximum likelihood estimator is uniformly consistent in model selection for θ not too close to ±δ, while at the same time avoiding the case of selecting the wrong model with an arbitrarily high probability.

1.1.5 Signed relevance

In many situations, one may not only be interested in the relevance of the parameter of interest but also in its sign or direction. In fact, Kaiser, (1960) criticized the traditional two-sided test, i.e. the case of zero competent testing, by stating : ”to find a ‘significant’

effect and not be able to decide in which direction this difference or effect lies, seems a sterile way to do business”. This prompted him to suggest using two one-sided tests instead of one two-sided test.

Translating this to our case, it entails separating M₁^(δ) into a model for negatively relevant θ and another for positively relevant θ. More specially, M₁^(δ) can be separated into M₋^(δ) and M₊^(δ), in which case for a fixed δ >0 there are 3 models:

M₋^(δ) : θ <−δ, (1.3)

M₀^(δ) : −δ ≤θ≤δ, M₊^(δ) : θ > δ, where M₀^(δ) is still the model of irrelevant θ.

1.1.6 Maximum likelihood for signed relevance

Once again, we can use the maximum likelihood estimator, ˆθ_n, to select between the mod- elsM₋^(δ),M₀^(δ)andM₊^(δ), defined in Equations (1.3), where we fixδ >0. The corresponding decision function M_{M l}_± :R→ⁿM₋^(δ), M₀^(δ), M₊^(δ)^o is defined as:

M_{M l}_±(x) =











M₋^(δ) if x <−δ M₀^(δ) if −δ≤x≤δ M₊^(δ) if x > δ

.

(23)

10 Chapter 1. Univariate model selection We now study the probabilities of selectingM₋^(δ), M₀^(δ) and M₊^(δ) based on θ. In fact, just as in Subsection 1.1.4, for any θ∈R we recover:

P_θM_{M l}_±(ˆθ_n) =M₋^(δ) = P_θθˆ_n<−δ=P θ+ σ

√nZ <−δ

!

, (1.4)

P_θM_{M l}_±(ˆθ_n) =M₀^(δ) = P_θ−δ≤θˆ_n ≤δ=P −δ≤θ+ σ

√nZ ≤δ

!

, P_θM_{M l}_±(ˆθ_n) =M₊^(δ) = P_θθˆ_n> δ=P θ+ σ

√nZ > δ

!

, where Z ∼ N (0,1).

Thus, P_θM_{M l}_±(ˆθ_n) = M₀^(δ) is a strictly monotone decreasing function in |θ|, while P_θM_{M l}_±(ˆθ_n) = M₋^(δ)is a strictly decreasing inθ andP_θM_{M l}_±(ˆθ_n) = M₊^(δ)is a strictly increasing in θ. This implies that we recover the following analytical expressions for the worst covering probabilities:

θ<−δinf P_θM_{M l}_±(ˆθ_n) = M₋^(δ) = P −δ+ σ

√nZ <−δ

!

= 1 2,

−δ≤θ≤δinf P_θM_{M l}_±(ˆθ_n) = M₀^(δ) = P −δ ≤δ+ σ

√nZ ≤δ

!

,

θ>δinfP_θM_{M l}_±(ˆθ_n) = M₊^(δ) = P δ+ σ

√nZ > δ

!

= 1 2. By using the fact that P−δ ≤δ+^√^σ_nZ ≤δ = P −2δ

√n

σ ≤Z ≤0 and δ > 0, we recover that limn→∞P−δ ≤δ+ ^√^σ_nZ ≤δ = P(Z ≤0) = ¹₂. Similarly, we recover the following expressions for the worst case probability of selecting the wrong model:

sup

θ≥−δ

P_θM_{M l}_±(ˆθ_n) =M₋^(δ) = P −δ+ σ

√nZ ≤ −δ

!

= 1 2, sup

θ∈[−δ,δ]^c

P_θM_{M l}_±(ˆθ_n) =M₀^(δ) = P −δ≤δ+ σ

√nZ ≤δ

!

, sup

θ≤δ

P_θM_{M l}_±(ˆθ_n) =M₊^(δ) = P δ≤δ+ σ

√nZ

!

= 1 2, where sup_{θ∈[−δ,δ]}cP_θM_{M l}_±(ˆθ_n) =M₀^(δ) tends to ¹₂ asn → ∞.

Therefore, just as the method described in Subsection 1.1.4, this method doesn’t suffer from the same problem as the BIC criterion, where it was possible to select the wrong model with a probability arbitrarily close to 1.

Furthermore, by using the expressions for the probabilities in Equations (1.4), we can show that the method achieves near uniform consistency, in the same way as in the relevance framework.

(24)

1.2. Bayesian approach 11

1.2 Bayesian approach

Ultimately we are interested in quantifying the uncertainty in model selection. Much work has already been done on this topic within the Bayesian framework. Interestingly Bayesian model selection, as is described in Mitchell and Beauchamp, (1988) and in Chipman et al., (2001), can both provide a method to select a model and provide a credibility set of models. We apply this methodology to our particular case and discuss some of the implications.

1.2.1 Posterior model probability

In a general Bayesian framework, to each modelM₀andM₁ we associate the distributions π₀ and π₁ to the parameter space. Additionally, let f_n(· | θ) denote the density of ˆθ_n for a given θ. The marginal densities for ˆθ_n are:

p_n(· |M_k) =

Z

f_n(· |θ)π_k(θ)dθ, (1.5)

where k ∈ {0,1}.

Furthermore, letP (M₀),P (M₁) = 1−P (M₀) be the probabilities of the observations being generated by the modelsM₀,M₁ respectively. As a result, by a standard application of Bayes Theorem, we recover the posterior model probabilities for k ∈ {0,1}:

P M_k|θˆ_n= p_nθˆ_n |M_kP (M_k)

p_nθˆ_n|M₀P (M₀) +p_nθˆ_n |M₁P (M₁). (1.6) One can then select the model with the higher posterior model probability. Conse- quently, we can define the Bayesian decision function MBayes :R→ {M0, M1} by:

M_Bayes(x) = arg max

M∈{M˜ ₀,M1}

P M˜ |x. (1.7)

1.2.2 Spike and slab priors

Concerning the choice of the priors, a common approach, as suggested by Mitchell and Beauchamp, (1988), is to use the point and uniform priors, otherwise known as the spike and slab priors. More specifically we set π₀(t) = δ₀(t), a degenerate distribution with all its mass at 0 and π₁(t) = _2t¹

01|t|≤t₀, a uniform distribution with t₀ >0.

By the choices of the priors, we have P_π₀(θ= 0) = 1 and P_π₁(θ 6= 0) = 1. This means that under the assumption θ follows the distribution π0, the model M0 is true with probability 1. Likewise, if θ follows the distribution π₁, the model M₁ is true with probability 1.

Furthermore, by plugging in the above priors into Equation (1.5), it can be shown (details in Appendix A) that in this case the marginal likelihoods have the following simple expressions:

p_nθˆ_n |M₀ = 1

√2π^qσ²/n exp



− θˆ_n² 2σ²/n



, p_nθˆ_n |M₁ =

Z t0

−t0

√ 1

2π^qσ²/n exp



−(ˆθn−t)² 2σ²/n





1 2t₀dt.

(25)

12 Chapter 1. Univariate model selection Now, by plugging in these expressions into Equation (1.6), we can now give a rather simple expression of the posterior model probabilities. More specifically, (details in Ap- pendix A) the posterior model probabilities are determined by:

PM₀ |θˆ_n =



1 +

Z t0

−t₀exp



− t² 2σ²/n +

θˆ_nt σ²/n





1

2t₀dtP(M₀)⁻¹P (M₁)





−1

= 1−P M₁ |θˆ_n.

These probabilities are exact under the prior assumptions. Due to the definition of M₀, the distribution π₀ needs no justification since it is the unique distribution within the domain of M₀. The choice of π₁ is more arbitrary.

1.2.3 Point-normal priors

Another possible choice of priors is found in Chipman et al., (2001), where they prefer to use the point-normal formulation. More specifically we set π₀(t) = δ₀(t), a degenerate distribution with all its mass at 0, and π₁(t) = ^√¹

2πτ exp−_2τ^t²2

, a normal distribution with mean 0 and a scale parameter τ > 0.

Just as for the spike and slab priors, we have P_π₀(θ = 0) = 1 and P_π₁(θ 6= 0) = 1.

Now, by plugging in the above priors into Equation (1.5), it can be shown (details in Appendix A) that the marginal likelihoods are equal to:

p_nθˆ_n|M₀ = 1

√2π^qσ²/n exp



− θˆ_n² 2σ²/n



, pn

θˆn|M₁ = 1

√2π^qτ²+σ²/nexp



− θˆ²_n 2(τ²+σ²/n)



.

Once again, by plugging in these expressions into Equation (1.6), we can now give an analytical expression of the posterior model probabilities. More specifically, (details in Appendix A) the posterior model probabilities are determined by:

P M₀ |θˆ_n =



1 +

v u u t

σ²/n

τ²+σ²/nexp





θˆ_n²

2σ²/n · τ² τ² +σ²/n



P (M₀)⁻¹P (M₁)





−1

= 1−P M₁ |θˆ_n.

These probabilities are exact under the prior assumptions, requiring the knowledge of τ.

1.2.4 Choice of nuisance parameter

For both spike and slab and point-normal priors, there remain nuisance parameters to specify. Namely, we need to specify t₀ for spike and slab prior and τ for the point- normal prior. Here we focus on the point-normal prior, to show how important the choice of the nuisance parameter is. Not wanting to favour a model over another, we set P (M₀) =P (M₁) = ¹₂ in the rest of the subsection.

(26)

1.2. Bayesian approach 13 Under the assumption that M0 is true, we have ˆθn = ^√^σ_nZ where Z ∼ N (0,1). It therefore follows that under this assumption:

P M₀ |θˆ_n=



1 +

v u u t

σ²/n

τ²+σ²/nexp Z²

2 · τ² τ²+σ²/n

!



−1

.

Consequently, allowing τ to depend on n, if limn→∞ σ²/n

τ²+σ²/n = 0 then P M₀ |θˆ_n converges in probability to 1 under the assumption M₀ is true. On the other hand, if we allow the scale parameter τ to decay at the rate of 1/√

n by setting τ = a·σ/√ n, we have:

P M₀ |θˆ_n=



1 +

s 1

a²+ 1exp Z² 2 · a²

a²+ 1

!



−1

.

In such a case, the probability of choosing M₁ converges to P PM₀ |θˆ_n< ¹₂ > 0 as n → ∞, which is same problem we encountered using the AICcriterion.

In the case we assumeM₁ to be true, the choice of τ can be just as crucial. Assume the true density of θ is equal toπ∗(x) = ^√_2πτ¹

∗exp−_2τ^x²2

∗

. We have ˆθn=^qτ_∗²+σ²/n·Z, where Z ∼ N(0,1). By using the previous prior we recover:

P M₀ |θˆ_n=



1 +

v u u t

σ²/n

τ²+σ²/nexp Z²

2 · τ_∗² +σ²/n

σ²/n · τ² τ²+σ²/n

!



−1

. Let a∗ >0 such that τ∗ =a∗·σ/√

n, we then have:

P M₀ |θˆ_n =



1 +

v u u t

σ²/n

τ²+σ²/nexp Z²

2 ·(a²_∗+ 1)· τ² τ²+σ²/n

!



−1

≥



1 +

s σ²

nτ²+σ² exp Z²

2 ·(a²_∗+ 1)

!



−1

Therefore in the case τ∗ is badly specified, it it possible for P M₀ |θˆ_n to be arbitrarily big in probability. For instance if a∗, σ and τ are kept constant, while n → ∞ we then have that PM₀ |θˆ_nconverges in probability to 1. This points to the same problem the BIC criterion had in Subsection 1.1.2. Unfortunately, the exact same problem is true for the t₀ parameter in the spike and slab prior.

1.2.5 Relevance

The problem described in the previous subsection can be avoided by focusing on the notion of practical significance. Chipman et al., (2001) introduced the idea of working with models M₀^(δ) and M₁^(δ) as in Subsection 1.1.3 by defining δ as the ”threshold of practical significance”, i.e. ”threshold of relevance”. In this subsection we therefore focus on the models M₀^(δ) and M₁^(δ) for a given δ > 0. The definitions of marginal densities, posterior model probabilities and the Bayesian decision function from Equations (1.5), (1.6) and (1.7) translate directly to this setting, by replacing M0 and M1 by M₀^(δ) and M₁^(δ). Within this setting, we now propose 2 different types of priors and discuss some of their properties and implications.

Confidence sets for model selection

Thesis

Reference

Confidence sets for model selection

Confidence sets for model selection

Mark Hannay

Acknowledgements

Abstract

R´ esum´ e

Contents

Introduction

Parametric models

Common methods for model selection

Uncertainty in model selection

Outline

Chapter 1

Univariate model selection

1.1 Likelihood approach

1.1.1 Log-likelihood

1.1.2 The dilemma of the zero component

1.1.3 Relevance

1.1.4 Maximum likelihood for relevance

1.1.5 Signed relevance

1.1.6 Maximum likelihood for signed relevance

1.2 Bayesian approach

1.2.1 Posterior model probability

1.2.2 Spike and slab priors

1.2.3 Point-normal priors

1.2.4 Choice of nuisance parameter

1.2.5 Relevance